Quantization and Training of NN for efficient IntegerArithmeticOnly Inference
Modern CNN architectures have very high model complexity and demard high computational efficiency. Mobile devices, however, present the challenge to accomodate within limited memory and meeting low latency to maintina user engagement. The cost of performing inference on mobile devices like, smartphones, AR/VR devices, drones etc. bears large computation and memory overhead. In order to have more efficient inferece hardware on mobile devices with integeronly arithmetic hardware, a quantization scheme helps in developing integeronly arithmetic to substitute the more sophesticated floatingpoint arithmetic hardware. Quantization, therefore, is the tradeoff between accuracy and ondevice latency. The task at hand is to reduce model sizes and inference times, with minimal loss to accuracy. This also significantly affects the training procedure and requires the computer architect to maintain endtoend model accuracy.
This paper presents such a quantization scheme along with a hardware/software codesign training procedure. The improvements are demonstracted on MobileNet (a model family for runtime efficiency) running ImageNet classification and COCO object detection on modern CPUs.
Contributions of the paper:
 A quantization scheme to quantize weights and activations as 8bit integers, and other few parameters (such as, bias vectors) as 32bit integers. This quantization scheme is derived from Ref.^{1} suggesting fixedpoint arithemetic to accelerate training speed and Ref.^{2} suggesting 8bit fixedpoint arithemetic to accelerate inference on x86 architecture.
 A quantization inference framework for integerarithmetic only hardware.
 A codesigned quantization training framework to maintain accuracy of inferece.
 Presents the implementation of the frameworks onMobileNetrunning on ARM CPUs, to perform classification (ImageNet^{3}) and object detection (COCO ^{4}).
Details
The quantization scheme employs an Integerarithmetic only for Inference; and a Floatingpoint arithmetic for training. Both representation have a high degree of correlation with each other by separate adoption of the quantization scheme for each.
quantized value, denotes bitrepresentation of values.
real value, denotes the actual numerical value.
The integer value to quantized value mapping is given as follows:
where, and are some constants called quantization parameters. q can be quantized as Bbit integer for Bbit quantization. Here, B is 8bits. Bias vectors are quantized as 32bit integers.
Here, a single set of quantization parameters is used for both weights array and activations array. Separate arrays can use separate quantization parameters. The mapping can be implemented as SIMD as opposed to the alternate method of using lookup table, to have better performance.
8bit Quantization
“Scale” is arbitrary positive number. In software, it is a floatingpoint number just like the real value , of type float
. Note that for inference, the floatingpoint quantities need to be eliminated (discussed ahead).
“Zeropoint” is the quantized value corresponding to , and is of the same type as i.e. uint8
.
Using the above mapping, 0 is exactly representable in real value as well. NN implementation often has 0padding of arrays around boundaries. This property of the mapping facilitates in that.
How to do IntegerArithmeticOnly Matrix Multiplication?
Currently, and are floating point, and we need to have an integerarithmeticonly inference scheme using the mapping .
Given, matrix composed of real values and . Their product matrix has values . We have such that, , with quantization parameters .
By performing matrix multiplication, we have:
rewriting,
where remains the only noninteger constant which can be calculated offline using quantization scales and ,
Empirically, it is determined that 0 \lt M \lt 1, and therefore can be expressed in the normalized form as follows:
where, and
Strengths of paper and mechanisms
Weaknesses of paper and mechanism
Ideas for improvement
Lessons learned
Some other interesting approaches
Reference: Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H. and Kalenichenko, D., 2018. “Quantization and training of neural networks for efficient integerarithmeticonly inference”. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 27042713).

Gupta, Suyog, et al. “Deep learning with limited numerical precision.” International Conference on Machine Learning. 2015. ↩

Vanhoucke, Vincent, Andrew Senior, and Mark Z. Mao. “Improving the speed of neural networks on CPUs.” (2011). ↩

Deng, Jia, et al. “Imagenet: A largescale hierarchical image database.” 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009. ↩

Lin, TsungYi, et al. “Microsoft coco: Common objects in context.” European conference on computer vision. Springer, Cham, 2014. ↩