Amitabh's Weblog
MACHINE LEARNING HARDWARE

# Quantization and Training of NN for efficient Integer-Arithmetic-Only Inference

Modern CNN architectures have very high model complexity and demard high computational efficiency. Mobile devices, however, present the challenge to accomodate within limited memory and meeting low latency to maintina user engagement. The cost of performing inference on mobile devices like, smartphones, AR/VR devices, drones etc. bears large computation and memory overhead. In order to have more efficient inferece hardware on mobile devices with integer-only arithmetic hardware, a quantization scheme helps in developing integer-only arithmetic to substitute the more sophesticated floating-point arithmetic hardware. Quantization, therefore, is the tradeoff between accuracy and on-device latency. The task at hand is to reduce model sizes and inference times, with minimal loss to accuracy. This also significantly affects the training procedure and requires the computer architect to maintain end-to-end model accuracy.

This paper presents such a quantization scheme along with a hardware/software co-design training procedure. The improvements are demonstracted on MobileNet (a model family for run-time efficiency) running ImageNet classification and COCO object detection on modern CPUs.

#### Contributions of the paper:

1. A quantization scheme to quantize weights and activations as 8-bit integers, and other few parameters (such as, bias vectors) as 32-bit integers. This quantization scheme is derived from Ref.1 suggesting fixed-point arithemetic to accelerate training speed and Ref.2 suggesting 8-bit fixed-point arithemetic to accelerate inference on x86 architecture.
2. A quantization inference framework for integer-arithmetic only hardware.
3. A co-designed quantization training framework to maintain accuracy of inferece.
4. Presents the implementation of the frameworks onMobileNetrunning on ARM CPUs, to perform classification (ImageNet3) and object detection (COCO 4).

## Details

The quantization scheme employs an Integer-arithmetic only for Inference; and a Floating-point arithmetic for training. Both representation have a high degree of correlation with each other by separate adoption of the quantization scheme for each.

$q \rightarrow$ quantized value, denotes bit-representation of values.

$r \rightarrow$ real value, denotes the actual numerical value.

The integer value to quantized value mapping is given as follows:

where, $S$ and $Z$ are some constants called quantization parameters. q can be quantized as B-bit integer for B-bit quantization. Here, B is 8-bits. Bias vectors are quantized as 32-bit integers.

Here, a single set of quantization parameters is used for both weights array and activations array. Separate arrays can use separate quantization parameters. The mapping can be implemented as SIMD as opposed to the alternate method of using look-up table, to have better performance.

#### 8-bit Quantization

$r=S(q-Z)$

$S \rightarrow$ “Scale” is arbitrary positive number. In software, it is a floating-point number just like the real value $r$, of type float. Note that for inference, the floating-point quantities need to be eliminated (discussed ahead).

$Z \rightarrow$ “Zero-point” is the quantized value corresponding to $0$, and is of the same type as $q$ i.e. uint8.

Using the above mapping, 0 is exactly representable in real value as well. NN implementation often has 0-padding of arrays around boundaries. This property of the mapping facilitates in that.

#### How to do Integer-Arithmetic-Only Matrix Multiplication?

Currently, $r$ and $S$ are floating point, and we need to have an integer-arithmetic-only inference scheme using the mapping $r=S(q-Z)$.

Given, $N \times N$ matrix composed of real values $r_1$ and $r_2$. Their product matrix has values $r_3 = r_1r_2$. We have $r_{\alpha}^{i,j}$ such that, $1 \leq i,j \leq N$, with quantization parameters $(S_\alpha, Z_\alpha)$.

By performing matrix multiplication, we have:

rewriting,

where $M$ remains the only non-integer constant which can be calculated offline using quantization scales $S_1, S_2$ and $S_3$,

Empirically, it is determined that 0 \lt M \lt 1, and therefore can be expressed in the normalized form as follows:

where, $M_0 \in (0.5,1]$ and $n \in \mathbb I^{+}$

## Some other interesting approaches

Reference: Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H. and Kalenichenko, D., 2018. “Quantization and training of neural networks for efficient integer-arithmetic-only inference”. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2704-2713).

1. Gupta, Suyog, et al. “Deep learning with limited numerical precision.” International Conference on Machine Learning. 2015.

2. Vanhoucke, Vincent, Andrew Senior, and Mark Z. Mao. “Improving the speed of neural networks on CPUs.” (2011).

3. Deng, Jia, et al. “Imagenet: A large-scale hierarchical image database.” 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009.

4. Lin, Tsung-Yi, et al. “Microsoft coco: Common objects in context.” European conference on computer vision. Springer, Cham, 2014.