Int8#

To push higher performance during inference computations, recent work has focused on computations that use activations and weights stored at lower precision to achieve higher throughput. Int8 computations offer improved performance over higher-precision types because they enable packing more computations into a single instruction, at the cost of reduced (but acceptable) accuracy.

Workflow#

oneDNN support symmetric and assymetric quantization models.

Quantization Model#

For each int8 tensor, the oneDNN library allows to specify scaling factors and zero-points (also refered to as quantization parameters), and assumes the following mathematical relationship:

$x_{f32}[:] = scale_{x} \cdot (x_{int8}[:] - zp_{x})$

where $$scale_{x}$$ is a scaling factor in float format, $$zp_{x}$$ is the zero point in int32 format, and $$[:]$$ is used to denote elementwise application of the formula to the arrays. In order to provide best performance, oneDNN does not compute those scaling factors and zero-points as part of primitive computation. Those should be provided by the user through the attribute mecanism.

These quantization parameters can either be computed ahead of time using calibration tools (static quantization) or at runtime based on the actual minimum and maximum values of a tensor (dynamic quantization). Either method can be used in conjuction with oneDNN, as the quantization parameters are passed to the oneDNN primitives at execution time.

To support int8 quantization, primitives should be created and executed as follow:

• during primitive creation, if one or multiple inputs are int8 (signed or not), then the primitive will behave as a quantized integer operation.

• still during primitive creation, the dimensionality of the scaling factors and zero-point should be provided using masks (e.g. one scale per tensor, one scale per channel, …).

• finally, during primitive execution, the user must provide the actual quantization parameters as arguments to the execute function. Scales shall be f32 values, and zero-points shall be int32 values.

Note

For performance reasons, each primitive implementation can support only a subset of quantization parameter masks. For example, convolution typically supports per-channel scales (no zero-point) for weights, and per-tensor scaling factor and zero-point for activation.

Note

Some primitives might use quantization parameters in order to dequantize/quantize intermediate values. This is for example the case for the RNN primitive, which will dequantize before applying non linear functions, and will requantize before executing matrix multiplication operations.

Numerical behavior#

Primitive implementations are allowed to convert int8 inputs to wider datatypes (e.g. int16 or int32), as those conversions do not impact accuracy.

During execution, primitives should avoid integer overflows and maintain integer accuracy by using wider datatypes (e.g. int32) for intermediate values and accumulators. Those are then converted as necessary before the result is written to the output memory objects. During that conversion, the behavior in case of underflow/overflow is undefined (e.g. when converting s32 to int8). However, it is highly encouraged for implementations to saturate values.

When multiple operations are fused in a single primitive using the post-op mecanism, those are assumed to be computed in f32 precision. As a result the destination quantization parameters are applied after the post-ops as follow:

$\dst[:] = post\_ops(OP(src[:], weights[:], ...)) / scale_{\dst} + zp_{\dst}$

Quantizing/dequantizing values between post-operations can still be achieved using one of eltwise post-ops, binary post-ops, or the scale parameter of the appropriate post-operation.

Example: Convolution Quantization Workflow#

Consider a convolution without bias. The tensors are represented as:

• $$\src_{f32}[:] = scale_{\src} \cdot (\src_{int8}[:] - zp_{\src})$$

• $$\weights_{f32}[:] = scale_{\weights} \cdot \weights_{int8}[:]$$

• $$\dst_{f32}[:] = scale_{\dst} \cdot (\dst_{int8}[:] - zp_{\dst})$$

Here the $$\src_{f32}, \weights_{f32}, \dst_{f32}$$ are not computed at all, the whole work happens with int8 tensors.So the task is to compute the $$\dst_{int8}$$ tensor, using the src_{int8}, weights_{int8} tensors passed at execution time, as well as the corresponding quantization parameters scale_{src}, scale_{weights}, scale_{dst} and zero_point{src}, zero_point_{dst}. Mathematically, the computations are:

$\dst_{int8}[:] = \operatorname{f32\_to\_int8}( scale_{\src} \cdot scale_{\weights} \cdot \operatorname{s32\_to\_f32}(conv_{s32}(\src_{int8}, \weights_{int8}) - zp_{\src} \cdot comp_{s32}) / scale_{\dst} + zp_{\dst} )$

where

• $$conv_{s32}$$ is just a regular convolution which takes source and weights with int8 data type and compute the result in int32 data type (int32 is chosen to avoid overflows during the computations);

• $$comp_{s32}$$ is a compensation term to account for src non-zero zero point. This term is computed by the oneDNN library and can typically be pre-computed ahead of time, for example during weights reorder.

• $$\operatorname{f32\_to\_s8}()$$ converts an f32 value to s8 with potential saturation if the values are out of the range of the int8 data type.

• $$\operatorname{s32\_to\_f32}()$$ converts an int8 value to f32 with potential rounding. This conversion is typically necessary to apply f32 scaling factors.

Per-Channel Scaling#

Primitives may have limited support of multiple scales for a quantized tensor. The most popular use case is the Convolution and Deconvolution primitives that support per-output-channel scaling factors for the weights, meaning that the actual convolution computations would need to scale different output channels differently.

• $$\src_{f32}(n, ic, ih, iw) = scale_{\src} \cdot \src_{int8}(n, ic, ih, iw)$$

• $$\weights_{f32}(oc, ic, kh, kw) = scale_{\weights}(oc) \cdot \weights_{int8}(oc, ic, kh, kw)$$

• $$\dst_{f32}(n, oc, oh, ow) = scale_{\dst} \cdot \dst_{int8}(n, oc, oh, ow)$$

Note that now the weights’ scaling factor depends on $$oc$$.

To compute the $$\dst_{int8}$$ we need to perform the following:

$\dst_{int8}(n, oc, oh, ow) = \operatorname{f32\_to\_int8}( \frac{scale_{\src} \cdot scale_{\weights}(oc)}{scale_{\dst}} \cdot conv_{s32}(\src_{int8}, \weights_{int8})|_{(n, oc, oh, ow)} ).$

The user is responsible for preparing quantized weights accordingly. To do that, oneDNN provides reorders that can perform per-channel scaling:

$\weights_{int8}(oc, ic, kh, kw) = \operatorname{f32\_to\_int8}( \weights_{f32}(oc, ic, kh, kw) / scale_{weights}(oc) ).$

The Quantization describes what kind of quantization model oneDNN supports.

Support#

oneDNN supports int8 computations for inference by allowing to specify that primitive input and output memory objects use int8 data types.