SNPE Quantization Algorithm - Tech It Yourself

## Wednesday, 23 June 2021

Source: developer.qualcomm

# Overview

• Non-quantized DLC files use 32 bit floating point representations of network parameters.
• Quantized DLC files use fixed point representations of network parameters, generally 8 bit weights and 8 or 32bit biases. The fixed point representation is the same used in Tensorflow quantized models.

# Quantization Algorithm

• Quantization converts floating point data to Tensorflow-style 8-bit fixed point format
• The following requirements are satisfied:
• Full range of input values is covered.
• Minimum range of 0.01 is enforced.
• Floating point zero is exactly representable.
• Quantization algorithm inputs:
• Set of floating point values to be quantized.
• Quantization algorithm outputs:
• Set of 8-bit fixed point values.
• Encoding parameters:
• encoding-min - minimum floating point value representable (by fixed point value 0)
• encoding-max - maximum floating point value representable (by fixed point value 255)
• Algorithm
1. Compute the true range (min, max) of input data.
2. Compute the encoding-min and encoding-max.
3. Quantize the input floating point values.
4. Output:
• fixed point values
• encoding-min and encoding-max parameters

## Details

1. Compute the true range of the input floating point data.
• finds the smallest and largest values in the input data
• represents the true range of the input data
2. Compute the encoding-min and encoding-max.
• These parameters are used in the quantization step.
• These parameters define the range and floating point values that will be representable by the fixed point format.
• encoding-min: specifies the smallest floating point value that will be represented by the fixed point value of 0
• encoding-max: specifies the largest floating point value that will be represented by the fixed point value of 255
• floating point values at every step size, where step size = (encoding-max - encoding-min) / 255, will be representable
1. encoding-min and encoding-max are first set to the true min and true max computed in the previous step
2. First requirement: encoding range must be at least a minimum of 0.01
• encoding-max is adjusted to max(true max, true min + 0.01)
3. Second requirement: floating point value of 0 must be exactly representable
• encoding-min or encoding-max may be further adjusted
3. Handling 0.
1. Case 1: Inputs are strictly positive
• the encoding-min is set to 0.0
• zero floating point value is exactly representable by smallest fixed point value 0
• e.g. input range = [5.0, 10.0]
• encoding-min = 0.0, encoding-max = 10.0
2. Case 2: Inputs are strictly negative
• encoding-max is set to 0.0
• zero floating point value is exactly representable by the largest fixed point value 255
• e.g. input range = [-20.0, -6.0]
• encoding-min = -20.0, encoding-max = 0.0
3. Case 3: Inputs are both negative and positive
• encoding-min and encoding-max are slightly shifted to make the floating point zero exactly representable
• e.g. input range = [-5.1, 5.1]
• encoding-min and encoding-max are first set to -5.1 and 5.1, respectively
• encoding range is 10.2 and the step size is 10.2/255 = 0.04
• zero value is currently not representable. The closest values representable are -0.02 and +0.02 by fixed point values 127 and 128, respectively
• encoding-min and encoding-max are shifted by -0.02. The new encoding-min is -5.12 and the new encoding-max is 5.08
• floating point zero is now exactly representable by the fixed point value of 128
4. Quantize the input floating point values.
• encoding-min and encoding-max parameter determined in the previous step are used to quantize all the input floating values to their fixed point representation
• Quantization formula is:
• quantized value = round(255 * (floating point value - encoding.min) / (encoding.max - encoding.min))
• quantized value is also clamped to be within 0 and 255
5. Outputs
• the fixed point values
• encoding-min and encoding-max parameters

## Quantization Example

• Inputs:
• input values = [-1.8, -1.0, 0, 0.5]
• encoding-min is set to -1.8 and encoding-max to 0.5
• encoding range is 2.3, which is larger than the required 0.01
• encoding-min is adjusted to −1.803922 and encoding-max to 0.496078 to make zero exactly representable
• step size (delta or scale) is 0.009020
• Outputs:
• quantized values are [0, 89, 200, 255]

## Dequantization Example

• Inputs:
• quantized values = [0, 89, 200, 255]
• encoding-min = −1.803922, encoding-max = 0.496078
• step size is 0.009020
• Outputs:
• dequantized values = [−1.8039, −1.0011, 0.0000, 0.4961]