HPTQ Changes, Unload Support for Multi Output Layers

analogdevicesinc · Aug 29, 2024 · d46edef · d46edef
1 parent 14790f8
commit d46edef
Show file tree

Hide file tree

Showing 14 changed files with 232 additions and 73 deletions.
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # ADI MAX78000/MAX78002 Model Training and Synthesis
 
-July 22, 2024
+August 27, 2024
 
 **Note: This branch requires PyTorch 2. Please see the archive-1.8 branch for PyTorch 1.8 support. [KNOWN_ISSUES](KNOWN_ISSUES.txt) contains a list of known issues.**
 
@@ -1620,13 +1620,15 @@ When using the `-8` command line switch, all module outputs are quantized to 8-b
 
 The last layer can optionally use 32-bit output for increased precision. This is simulated by adding the parameter `wide=True` to the module function call.
 
-##### Weights: Quantization-Aware Training (QAT)
+##### Weights and Activations: Quantization-Aware Training (QAT)
 
 Quantization-aware training (QAT) is enabled by default. QAT is controlled by a policy file, specified by `--qat-policy`.
 
-* After `start_epoch` epochs, training will learn an additional parameter that corresponds to a shift of the final sum of products.
+* After `start_epoch` epochs, an intermediate epoch with no backpropagation will be realized to collect activation statistics. Each layer's activation ranges will be determined based on the range & resolution trade-off from the collected activations. Then, QAT will start and an additional parameter (`output_shift`) will be learned to shift activations for compensating weights  & biases scaling down.
 * `weight_bits` describes the number of bits available for weights.
 * `overrides` allows specifying the `weight_bits` on a per-layer basis.
+* `outlier_removal_z_score` defines the z-score threshold for outlier removal during activation range calculation. (default: 8.0)
+* `shift_quantile` defines the quantile of the parameters distribution to be used for the `output_shift` parameter. (default: 1.0)
 
 By default, weights are quantized to 8-bits after 30 epochs as specified in `policies/qat_policy.yaml`. A more refined example that specifies weight sizes for individual layers can be seen in `policies/qat_policy_cifar100.yaml`.
 
@@ -1745,7 +1747,7 @@ For both approaches, the `quantize.py` software quantizes an existing PyTorch ch
 
 #### Quantization-Aware Training (QAT)
 
-Quantization-aware training is the better performing approach. It is enabled by default. QAT learns additional parameters during training that help with quantization (see [Weights: Quantization-Aware Training (QAT)](#weights-quantization-aware-training-qat). No additional arguments (other than input, output, and device) are needed for `quantize.py`.
+Quantization-aware training is the better performing approach. It is enabled by default. QAT learns additional parameters during training that help with quantization (see [Weights and Activations: Quantization-Aware Training (QAT)](#weights-and-activations-quantization-aware-training-qat). No additional arguments (other than input, output, and device) are needed for `quantize.py`.
 
 The input checkpoint to `quantize.py` is either `qat_best.pth.tar`, the best QAT epoch’s checkpoint, or `qat_checkpoint.pth.tar`, the final QAT epoch’s checkpoint.
 
@@ -2004,7 +2006,7 @@ The behavior of a training session might change when Quantization Aware Training
 While there can be multiple reasons for this, check two important settings that can influence the training behavior:
 
 * The initial learning rate may be set too high. Reduce LR by a factor of 10 or 100 by specifying a smaller initial `--lr` on the command line, and possibly by reducing the epoch `milestones` for further reduction of the learning rate in the scheduler file specified by `--compress`. Note that the the selected optimizer and the batch size both affect the learning rate.
-* The epoch when QAT is engaged may be set too low. Increase `start_epoch` in the QAT scheduler file specified by `--qat-policy`, and increase the total number of training epochs by increasing the value specified by the `--epochs` command line argument and by editing the `ending_epoch` in the scheduler file specified by `--compress`. *See also the rule of thumb discussed in the section [Weights: Quantization-Aware Training (QAT)](#weights:-auantization-aware-training \(qat\)).*
+* The epoch when QAT is engaged may be set too low. Increase `start_epoch` in the QAT scheduler file specified by `--qat-policy`, and increase the total number of training epochs by increasing the value specified by the `--epochs` command line argument and by editing the `ending_epoch` in the scheduler file specified by `--compress`. *See also the rule of thumb discussed in the section [Weights and Activations: Quantization-Aware Training (QAT)](#weights-and-activations-quantization-aware-training-qat).*
 
 
 
@@ -2209,6 +2211,7 @@ The following table describes the most important command line arguments for `ai8
 | `--no-unload`            | Do not create the `cnn_unload()` function                    |                                 |
 | `--no-kat`               | Do not generate the `check_output()` function (disable known-answer test)  |                   |
 | `--no-deduplicate-weights` | Do not deduplicate weights and and bias values             |                                 |
+| `--scale-output` | Use scales from the checkpoint to recover output range while generating `cnn_unload()` function | |
 
 ### YAML Network Description
 
@@ -2330,6 +2333,12 @@ The following keywords are required for each `unload` list item:
 `width`: Data width (optional, defaults to 8) — either 8 or 32
 `write_gap`: Gap between data words (optional, defaults to 0)
 
+When `--scale-output` is specified, scales from the checkpoint file are used to recover the output range. If there is a non-zero scale for the 8 bits output, the output will be scaled and kept in 16 bits. If the scale is zero, the output will be 8 bits. For 32 bits output, the output will be kept in 32 bits always.
+
+Example:
+
+![Unload Array](docs/unload_example.png)
+
 ##### `layers` (Mandatory)
 
 `layers` is a list that defines the per-layer description, as shown below:
@@ -2654,7 +2663,7 @@ Example:
 By default, the final layer is used as the output layer. Output layers are checked using the known-answer test, and they are copied from hardware memory when `cnn_unload()` is called. The tool also checks that output layer data isn’t overwritten by any later layers.
 
 When specifying `output: true`, any layer (or a combination of layers) can be used as an output layer.
-*Note:* When `unload:` is used, output layers are not used for generating `cnn_unload()`.
+*Note:* When `--no-unload` is used, output layers are not used for generating `cnn_unload()`.
 
 Example:
         `output: true`

diff --git a/docs/unload_example.png b/docs/unload_example.png
diff --git a/gen-demos-max78000.sh b/gen-demos-max78000.sh
@@ -12,7 +12,8 @@ python ai8xize.py --test-dir $TARGET --prefix cifar-100-simplewide2x-mixed --che
 python ai8xize.py --test-dir $TARGET --prefix cifar-100-residual --checkpoint-file trained/ai85-cifar100-residual-qat8-q.pth.tar --config-file networks/cifar100-ressimplenet.yaml --softmax $COMMON_ARGS --boost 2.5 "$@"
 python ai8xize.py --test-dir $TARGET --prefix kws20_v3 --checkpoint-file trained/ai85-kws20_v3-qat8-q.pth.tar --config-file networks/kws20-v3-hwc.yaml --softmax $COMMON_ARGS "$@"
 python ai8xize.py --test-dir $TARGET --prefix kws20_nas --checkpoint-file trained/ai85-kws20_nas-qat8-q.pth.tar --config-file networks/kws20-nas-hwc.yaml --softmax $COMMON_ARGS "$@"
-python ai8xize.py --test-dir $TARGET --prefix faceid --checkpoint-file trained/ai85-faceid-qat8-q.pth.tar --config-file networks/faceid.yaml --fifo $COMMON_ARGS "$@"
+python izer/add_fake_passthrough.py --input-checkpoint-path trained/ai85-faceid_112-qat-q.pth.tar --output-checkpoint-path trained/ai85-fakepass-faceid_112-qat-q.pth.tar --layer-name fakepass --layer-depth 128 --layer-name-after-pt linear --low-memory-footprint "$@"
+python ai8xize.py --test-dir $TARGET --prefix faceid_112 --checkpoint-file trained/ai85-fakepass-faceid_112-qat-q.pth.tar --config-file networks/ai85-faceid_112.yaml --fifo $COMMON_ARGS "$@"
 python ai8xize.py --test-dir $TARGET --prefix cats-dogs --checkpoint-file trained/ai85-catsdogs-qat8-q.pth.tar --config-file networks/cats-dogs-hwc.yaml --fifo --softmax $COMMON_ARGS "$@"
 python ai8xize.py --test-dir $TARGET --prefix camvid_unet --checkpoint-file trained/ai85-camvid-unet-large-fakept-q.pth.tar --config-file networks/camvid-unet-large-fakept.yaml $COMMON_ARGS --overlap-data --mlator --no-unload --max-checklines 8192 --new-kernel-loader "$@"
 python ai8xize.py --test-dir $TARGET --prefix aisegment_unet --checkpoint-file trained/ai85-aisegment-unet-large-fakept-q.pth.tar --config-file networks/aisegment-unet-large-fakept.yaml $COMMON_ARGS --overlap-data --mlator --no-unload --max-checklines 8192 --new-kernel-loader "$@"

diff --git a/gen-demos-max78002.sh b/gen-demos-max78002.sh
@@ -12,7 +12,7 @@ python ai8xize.py --test-dir $TARGET --prefix cifar-100-simplewide2x-mixed --che
 python ai8xize.py --test-dir $TARGET --prefix cifar-100-residual --checkpoint-file trained/ai85-cifar100-residual-qat8-q.pth.tar --config-file networks/cifar100-ressimplenet.yaml --softmax $COMMON_ARGS "$@"
 python ai8xize.py --test-dir $TARGET --prefix kws20_v3_1 --checkpoint-file trained/ai87-kws20_v3-qat8-q.pth.tar --config-file networks/ai87-kws20-v3-hwc.yaml --softmax $COMMON_ARGS "$@"
 python ai8xize.py --test-dir $TARGET --prefix kws20_v2_1 --checkpoint-file trained/ai87-kws20_v2-qat8-q.pth.tar --config-file networks/ai87-kws20-v2-hwc.yaml --softmax $COMMON_ARGS "$@"
-python ai8xize.py --test-dir $TARGET --prefix faceid --checkpoint-file trained/ai85-faceid-qat8-q.pth.tar --config-file networks/faceid.yaml --fifo $COMMON_ARGS "$@"
+python ai8xize.py --test-dir $TARGET --prefix mobilefacenet-112 --checkpoint-file trained/ai87-mobilefacenet-112-qat-q.pth.tar --config-file networks/ai87-mobilefacenet-112.yaml --fifo $COMMON_ARGS "$@"
 python ai8xize.py --test-dir $TARGET --prefix cats-dogs --checkpoint-file trained/ai85-catsdogs-qat8-q.pth.tar --config-file networks/cats-dogs-hwc-no-fifo.yaml --softmax $COMMON_ARGS "$@"
 python ai8xize.py --test-dir $TARGET --prefix camvid_unet --checkpoint-file trained/ai85-camvid-unet-large-fakept-q.pth.tar --config-file networks/camvid-unet-large-fakept.yaml $COMMON_ARGS --overlap-data --mlator --no-unload --max-checklines 8192 "$@"
 python ai8xize.py --test-dir $TARGET --prefix aisegment_unet --checkpoint-file trained/ai85-aisegment-unet-large-fakept-q.pth.tar --config-file networks/aisegment-unet-large-fakept.yaml $COMMON_ARGS --overlap-data --mlator --no-unload --max-checklines 8192 "$@"
@@ -21,5 +21,5 @@ python ai8xize.py --test-dir $TARGET --prefix cifar-100-effnet2 --checkpoint-fil
 python ai8xize.py --test-dir $TARGET --prefix cifar-100-mobilenet-v2-0.75 --checkpoint-file trained/ai87-cifar100-mobilenet-v2-0.75-qat8-q.pth.tar --config-file networks/ai87-cifar100-mobilenet-v2-0.75.yaml --softmax $COMMON_ARGS "$@"
 python ai8xize.py --test-dir $TARGET --prefix imagenet --checkpoint-file trained/ai87-imagenet-effnet2-q.pth.tar --config-file networks/ai87-imagenet-effnet2.yaml $COMMON_ARGS "$@"
 python ai8xize.py --test-dir $TARGET --prefix facedet_tinierssd --checkpoint-file trained/ai87-facedet-tinierssd-qat8-q.pth.tar --config-file networks/ai87-facedet-tinierssd.yaml --sample-input tests/sample_vggface2_facedetection.npy $COMMON_ARGS "$@"
-python ai8xize.py --test-dir $TARGET --prefix pascalvoc_fpndetector --checkpoint-file trained/ai87-pascalvoc-fpndetector-qat8-q.pth.tar --config-file networks/ai87-pascalvoc-fpndetector.yaml --fifo --sample-input tests/sample_pascalvoc_256_320.npy --overwrite --no-unload $COMMON_ARGS "$@"
+python ai8xize.py --test-dir $TARGET --prefix pascalvoc_fpndetector --checkpoint-file trained/ai87-pascalvoc-fpndetector-qat8-q.pth.tar --config-file networks/ai87-pascalvoc-fpndetector.yaml --fifo --sample-input tests/sample_pascalvoc_256_320.npy --no-unload $COMMON_ARGS "$@"
 python ai8xize.py --test-dir $TARGET --prefix kinetics --checkpoint-file trained/ai85-kinetics-qat8-q.pth.tar --config-file networks/ai85-kinetics-actiontcn.yaml --overlap-data --softmax --zero-sram $COMMON_ARGS "$@"
diff --git a/izer/backend/max7800x.py b/izer/backend/max7800x.py
@@ -1,5 +1,5 @@
 ###################################################################################################
-# Copyright (C) 2019-2023 Maxim Integrated Products, Inc. All Rights Reserved.
+# Copyright (C) 2019-2024 Maxim Integrated Products, Inc. All Rights Reserved.
 #
 # Maxim Integrated Products, Inc. Default Copyright Notice:
 # https://www.maximintegrated.com/en/aboutus/legal/copyrights.html
@@ -69,6 +69,7 @@ def create_net(self) -> str:  # pylint: disable=too-many-locals,too-many-branche
         fast_fifo_quad = state.fast_fifo_quad
         fifo = state.fifo
         final_layer = state.final_layer
+        final_scale = state.final_scale
         first_layer_used = state.first_layer_used
         flatten = state.flatten
         forever = state.forever
@@ -136,6 +137,7 @@ def create_net(self) -> str:  # pylint: disable=too-many-locals,too-many-branche
         riscv = state.riscv
         riscv_cache = state.riscv_cache
         riscv_flash = state.riscv_flash
+        scale_output = state.scale_output
         simple1b = state.simple1b
         simulated_sequence = state.simulated_sequence
         snoop = state.snoop
@@ -1152,7 +1154,8 @@ def create_net(self) -> str:  # pylint: disable=too-many-locals,too-many-branche
                         conv_str = ', no convolution, '
                     apb.output(conv_str +
                                f'{output_chan[ll]}x{output_dim_str[ll]} output\n', embedded_code)
-
+            apb.output('\n', embedded_code)
+            apb.output(f'// Final Scales: {final_scale}\n', embedded_code)
             apb.output('\n', embedded_code)
 
             apb.header()
@@ -3553,8 +3556,20 @@ def run_eltwise(
         elif block_mode:
             assets.copy('assets', 'blocklevel-ai' + str(device), base_directory, test_name)
         elif embedded_code:
-            output_count = output_chan[terminating_layer] \
-                * output_dim[terminating_layer][0] * output_dim[terminating_layer][1]
+            output_count = 0
+            for i in range(terminating_layer + 1):
+                if output_layer[i]:
+                    if output_width[i] != 32:
+                        if scale_output:
+                            output_count += (output_chan[i] * output_dim[i][0] * output_dim[i][1]
+                                             + (32 // (2 * output_width[i]) - 1)) \
+                                             // (32 // (2 * output_width[i]))
+                        else:
+                            output_count += (output_chan[i] * output_dim[i][0] * output_dim[i][1]
+                                             + (32 // output_width[i] - 1)) \
+                                             // (32 // output_width[i])
+                    else:
+                        output_count += output_chan[i] * output_dim[i][0] * output_dim[i][1]
             insert = summary_stats + \
                 '\n/* Number of outputs for this network */\n' \
                 f'#define CNN_NUM_OUTPUTS {output_count}'

diff --git a/izer/checkpoint.py b/izer/checkpoint.py
@@ -1,5 +1,5 @@
 ###################################################################################################
-# Copyright (C) 2019-2023 Maxim Integrated Products, Inc. All Rights Reserved.
+# Copyright (C) 2019-2024 Maxim Integrated Products, Inc. All Rights Reserved.
 #
 # Maxim Integrated Products, Inc. Default Copyright Notice:
 # https://www.maximintegrated.com/en/aboutus/legal/copyrights.html
@@ -56,6 +56,7 @@ def load(
     bias_min = []
     bias_max = []
     bias_size = []
+    final_scale = {}
 
     checkpoint = torch.load(checkpoint_file, map_location='cpu')
     print(f'Reading {checkpoint_file} to configure network weights...')
@@ -251,6 +252,12 @@ def load(
             # Add implicit shift based on quantization
             output_shift[seq] += 8 - abs(quantization[seq])
 
+            final_scale_name = '.'.join([layer, 'final_scale'])
+            if final_scale_name in checkpoint_state:
+                w = checkpoint_state[final_scale_name].numpy().astype(np.int64)
+                final_scale[seq] = w.item()
+            else:
+                final_scale[seq] = 0
             layers += 1
             seq += 1
 
@@ -286,4 +293,4 @@ def load(
         sys.exit(1)
 
     return layers, weights, bias, output_shift, \
-        input_channels, output_channels
+        input_channels, output_channels, final_scale
diff --git a/izer/commandline.py b/izer/commandline.py
@@ -464,6 +464,8 @@ def get_parser() -> argparse.Namespace:
                        help='GitHub repository name for update checking')
     group.add_argument('--yamllint', metavar='S', default='yamllint',
                        help='name of linter for YAML files (default: yamllint)')
+    group.add_argument('--scale-output', action='store_true', default=False,
+                       help="scale output with final layer scale factor (default: false)")
 
     args = parser.parse_args()
 
@@ -691,6 +693,7 @@ def set_state(args: argparse.Namespace) -> None:
     state.rtl_preload_weights = args.rtl_preload_weights
     state.runtest_filename = args.runtest_filename
     state.sample_filename = args.sample_filename
+    state.scale_output = args.scale_output
     state.simple1b = args.simple1b
     state.sleep = args.deepsleep
     state.slow_load = args.slow_load

diff --git a/izer/izer.py b/izer/izer.py
@@ -1,5 +1,5 @@
 ###################################################################################################
-# Copyright (C) 2019-2023 Maxim Integrated Products, Inc. All Rights Reserved.
+# Copyright (C) 2019-2024 Maxim Integrated Products, Inc. All Rights Reserved.
 #
 # Maxim Integrated Products, Inc. Default Copyright Notice:
 # https://www.maximintegrated.com/en/aboutus/legal/copyrights.html
@@ -74,6 +74,7 @@ def main():
 
     # If not using test data, load weights and biases
     # This also configures the network's output channels
+    final_scale = None
     if cfg['arch'] != 'test':
         if not args.checkpoint_file:
             eprint('--checkpoint-file is a required argument.')
@@ -96,7 +97,7 @@ def main():
         else:
             # PyTorch checkpoint file selected
             layers, weights, bias, output_shift, \
-                input_channels, output_channels = \
+                input_channels, output_channels, final_scale = \
                 checkpoint.load(
                     args.checkpoint_file,
                     cfg['arch'],
@@ -134,6 +135,8 @@ def main():
             params['bypass'],
             filename=args.bias_input,
         )
+    if final_scale is None:
+        final_scale = {ll: 0 for ll in range(cfg_layers)}
     if cfg_layers > layers:
         # Add empty weights/biases and channel counts for layers not in checkpoint file.
         # The checkpoint file does not contain weights for non-convolution operations.
@@ -630,6 +633,7 @@ def main():
     state.eltwise = eltwise
     state.final_layer = final_layer
     state.first_layer_used = min_layer
+    state.final_scale = final_scale
     state.flatten = flatten
     state.in_offset = input_offset
     state.in_sequences = in_sequences

diff --git a/izer/quantize.py b/izer/quantize.py
@@ -1,5 +1,5 @@
 ###################################################################################################
-# Copyright (C) 2019-2023 Maxim Integrated Products, Inc. All Rights Reserved.
+# Copyright (C) 2019-2024 Maxim Integrated Products, Inc. All Rights Reserved.
 #
 # Maxim Integrated Products, Inc. Default Copyright Notice:
 # https://www.maximintegrated.com/en/aboutus/legal/copyrights.html
@@ -241,6 +241,11 @@ def get_max_bit_shift(t, clamp_bits, shift_quantile, return_bit_shift=False):
                 out_shift_name = '.'.join([layer, 'output_shift'])
                 out_shift = torch.Tensor([-1 * get_max_bit_shift(params_r, clamp_bits,
                                                                  shift_quantile, True)])
+                threshold_name = '.'.join([layer, 'threshold'])
+                if threshold_name in checkpoint_state:
+                    threshold = checkpoint_state[threshold_name]
+                    out_shift = (out_shift - threshold).clamp(min=-7.-clamp_bits,
+                                                              max=23.-clamp_bits)
                 new_checkpoint_state[out_shift_name] = out_shift
                 if new_masks_dict is not None:
                     new_masks_dict[out_shift_name] = out_shift