analogdevicesinc / ai8x-synthesis Goto Github PK
View Code? Open in Web Editor NEWQuantization and Synthesis (Device Specific Code Generation) for ADI's MAX78000 and MAX78002 Edge AI Devices
License: Apache License 2.0
Quantization and Synthesis (Device Specific Code Generation) for ADI's MAX78000 and MAX78002 Edge AI Devices
License: Apache License 2.0
https://github.com/MaximIntegratedAI/ai8x-synthesis/blob/develop/izer/backend/max7800x.py#L1673
instance = ffs(output_processor_map[ll] >> group * tc.dev.P_SHARED) \
& ~(tc.dev.P_SHARED-1)
val |= (instance + group * tc.dev.P_SHARED) * tc.dev.INSTANCE_SIZE
According to the code, the actual value written to the register is 1/4 * (out_offset (from YAML) + 0x8000 * smallest_out_processor_group_index).
(Below is a copy of documentation from the same file.)
Configure SRAM write pointer -- write ptr is global
(unless depth-wise w/o broadcast is used).
Get offset to first available instance of the first used
processor of the next layer.
As I get it CNN.c is basically a driver for for the accelerator and it has necessary functions to load the data etc..
What I missed is why those functions generated after the synthesis ? are they network/data independent and why is that ?
I trained the newly added tinier-ssd model with the included training script(scripts/train_svhn_tinierssd.sh)
During synthesis, I'm able to quantize the weights by this command:
python quantize.py ../ai8x-training-latest/logs/2022.07.11-214500_SVHN_tinierssd/qat_best.pth.tar ../ai8x-training-latest/logs/2022.07.11-214500_SVHN_tinierssd/qat_best_q.pth.tar --device MAX78000 -v -c networks/svhn-tinierssd.yaml
But ai8xize script is failing due to below error:
aix8ize command:
python ai8xize.py --test-dir generated_svhn_tinierssd --prefix tinierssd_svhn --checkpoint-file ../ai8x-training-latest/logs/2022.07.11-214500_SVHN_tinierssd/qat_best_q.pth.tar --config-file networks/svhn-tinierssd.yaml --device MAX78000 --compact-data --timer 0 --display-checkpoint --verbose --overwrite --mexpress
log:
Reading ../ai8x-training-latest/logs/2022.07.11-214500_SVHN_tinierssd/qat_best_q.pth.tar to configure network weights...
e2.op.bias
32 64 (2048, 3, 3) 8 -2 -128 127 18432 base.fire3.op.weight (64,) 8 -20 127 64 base.fir
e3.op.bias
64 64 (4096, 3, 3) 8 -2 -128 127 36864 base.fire4.op.weight (64,) 8 -81 104 64 base.fir
e4.op.bias
64 64 (4096, 3, 3) 8 -2 -128 127 36864 base.fire5.op.weight (64,) 8 -116 115 64 base.fir
e5.op.bias
64 64 (4096, 3, 3) 8 -2 -128 127 36864 base.fire6.op.weight (64,) 8 -128 108 64 base.fire6.op.bias
64 128 (8192, 3, 3) 8 -2 -128 127 73728 base.fire7.op.weight (128,) 8 -128 127 128 base.fire7.op.bias
128 32 (4096, 3, 3) 8 -2 -128 127 36864 base.fire8.op.weight (32,) 8 -110 127 32 base.fire8.op.bias
32 32 (1024, 3, 3) 8 -1 -128 127 9216 base.fire9.op.weight (32,) 8 -38 113 32 base.fire9.op.bias
32 32 (1024, 3, 3) 8 -1 -109 127 9216 base.fire10.op.weight (32,) 8 -2 61 32 base.fire10.op.bias
32 16 (512, 3, 3) 8 -2 -128 127 4608 aux_convs.conv12_1.op.weight (16,) 8 -41 40 16 aux_convs.conv12_1.op.bias
16 16 (256, 3, 3) 8 -1 -123 127 2304 aux_convs.conv12_2.op.weight (16,) 8 -3 88 16 aux_convs.conv12_2.op.bias
32 16 (512, 3, 3) 8 -4 -64 127 4608 pred_convs.loc_fire8.op.weight (16,) 8 -44 127 16 pred_convs.loc_fire8.op.bias
32 16 (512, 3, 3) 8 -1 -110 88 4608 pred_convs.loc_fire9.op.weight (16,) 8 -38 54 16 pred_convs.loc_fire9.op.bias
32 16 (512, 3, 3) 8 -1 -92 107 4608 pred_convs.loc_fire10.op.weight (16,) 8 -55 52 16 pred_convs.loc_fire10.op.bias
16 16 (256, 3, 3) 8 -1 -108 103 2304 pred_convs.loc_conv12_2.op.weight (16,) 8 -35 34 16 pred_convs.loc_conv12_2.op.bias
32 44 (1408, 3, 3) 8 -2 -87 97 12672 pred_convs.cl_fire8.op.weight (44,) 8 -124 127 44 pred_convs.cl_fire8.op.bias
32 44 (1408, 3, 3) 8 -1 -126 127 12672 pred_convs.cl_fire9.op.weight (44,) 8 -93 125 44 pred_convs.cl_fire9.op.bias
32 44 (1408, 3, 3) 8 -1 -124 127 12672 pred_convs.cl_fire10.op.weight (44,) 8 -75 88 44 pred_convs.cl_fire10.op.bias
16 44 (704, 3, 3) 8 -1 -128 127 6336 pred_convs.cl_conv12_2.op.weight (44,) 8 -59 50 44 pred_convs.cl_conv12_2.op.bias
TOTAL: 20 parameter layers, 336,336 parameters, 336,336 bytes
TOTAL: 20 parameter layers, 336,336 parameters, 336,336 bytes
Configuring data set: svhn_74.
tinierssd_svhn...
NOTICE: --overwrite specified, writing to generated_svhn_tinierssd/tinierssd_svhn even though it exists.
Arranging weights... ________________________________________ 100%
Storing weights... ________________________________________ 100%
Creating network... ________________________________________ 10%**ERROR: Processor 0: Layer 2 output for CHW=0,27,25 is overwriting input at offset 0x00402000 that was created by layer 1, CHW=0,0,0.**
I see the weights.h file is mostly empty.
Is there something I'm missing or the config yaml for tinierssd need to be corrected?
Creating network... ________________________________________ 10%
In synthesies i get this error "Layer 8: 3 input channels (before flattening) using 1 pass, and 1 operand (3 processors per pass), but the enabled processor map 0x00000000ffffffff has 32 bits instead of the expected number of 3." I could not solved it. Can you share a yaml file for ai85kws20netv2batchnorm model?
Hi
I trained your ai85unetlarge
on MS COCO with only 2 classes (Background, Person).
When trying to synthesize the trained model i ran into some problems:
First i had to modify aisegment-unet-large-fakept.yaml
because the last layer ended up smaller.
64 64 (4096, 1, 1) 8 -1 -120 125 4096 conv.op.weight (64,) 8 -31 0 64 conv.op.bias
TOTAL: 19 parameter layers, 282,220 parameters, 282,220 bytes
Configuring data set: CamVid_s352_c3_reduced.
vs.
64 32 (2048, 1, 1) 8 2 -13 15 2048 conv.op.weight (32,) 8 -81 79 32 conv.op.bias
TOTAL: 19 parameter layers, 280,140 parameters, 280,140 bytes
Configuring data set: coco_s352.
that's why i changed the last layer to output_processors: 0x00000000ffffffff
. Is that correct?
I'd like to use your UNet-Demo with my trained network. After some investigation i found out that you used some (i assume older version) unet_v5
and a AISegment_352_reduced
dataset in order to fit the sample data into SRAM.
Is there a possibility to synthesize without all the KAT and sample data?
I just wanna use UNet-demo with the camera and don't care about KAT for now. I tried --no-kat
, --synthesize-input
, --synthesize-words
and --max-verify-length
but when inspecting izer/sampledata.py
i saw that i cannot enter these steps after if shape[0] < 1 or shape[0] > 4:
anyways as my input shape is (48,88,88)
.
the build error is the following:
c:/maximsdk/tools/gnutools/10.3/bin/../lib/gcc/arm-none-eabi/10.3.1/../../../../arm-none-eabi/bin/ld.exe: C:/MaximSDK/Examples/MAX78000/CNN/UNet-demo/build/UNet-demo.elf section `.bss' will not fit in region `SRAM'
c:/maximsdk/tools/gnutools/10.3/bin/../lib/gcc/arm-none-eabi/10.3.1/../../../../arm-none-eabi/bin/ld.exe: region RAM overflowed with stack
c:/maximsdk/tools/gnutools/10.3/bin/../lib/gcc/arm-none-eabi/10.3.1/../../../../arm-none-eabi/bin/ld.exe: region `SRAM' overflowed by 148408 bytes
collect2.exe: error: ld returned 1 exit status
make: *** [/C/MaximSDK/Libraries/CMSIS/Device/Maxim/MAX78000/Source/GCC/gcc.mk:299: /c/MaximSDK/Examples/MAX78000/CNN/UNet-demo/build/UNet-demo.elf] Error 1
"make all -r -j 8" terminated with exit code 2. Build might be incomplete.
but it's pretty clear the sample data won't fit, the numpy pickle i generated with train.py
is 2.9MB which makes total sense given the (48x88x88x8bit) dimensions.
edit: nvm, i was slightly confused because the filesize is 2.9MB but when I load the pickle the array has the size 371712 which does indeed fit into data memory.
I commented out //#define USE_SAMPLEDATA // shows the sample data
.
Hi,
I'm trying to the card make a some sound. Would you give me the way or help?
Sorry, I'm the first time to use the MAX78000 EVKit, I can't totally understand how to set the value of the out_offset in yaml file, did I need to calculate it depending on something need to be noticed? Thank you so much!
Best regards,
Jason
Hi, I am modifying ai87_fpndetector for my use case by changing the input shape, removing the first residual in the backbone, and other smaller mods in the FPN accordingly (see .py file attached). To create the yaml for my own model, I'm modifying the original ai87_fpndetector_pascalvoc for my own use case. But I am getting this error for my yaml:
ERROR: Layer 73 (loc_60_80_res0_preprocess): HWC (4 channels/word) 8-bit 60x80 output (size 19200)
with output offset 0x10ae0 and expansion 1x exceeds data memory instance size of 81920.
What does this mean in general and how should I approach cases where memory is exceeded? how is this excess even calculated?
Here is my complete yaml and py file
modFPN.zip
The command I use to generate the project is:
./ai8xize.py --verbose --log --test-dir demos --prefix ai85-faceid-qat8 --checkpoint-file trained/ai85-faceid-qat8.pth.tar --config-file networks/faceid.yaml --device MAX78000 --compact-data --mexpress --softmax
Hi, running the train-quantize-eval-synthesize pipeline for the Camvid example, I encountered an error on the synthesis stage:
Arranging weights... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100%
Storing weights... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100%
Creating network... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0%ERROR: Processor 0: Layer 0 output for CHW=0,0,64 is overwriting input at offset 0x00400700 that was created by the input loader.
Creating network... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0%
Network-related files (ai85net-unet.py, camvid-unet-large.yaml, camvid-unet-large-fakept.yaml) and the Camvid dataset are unchanged. These are the commands I use:
# train
python train.py --lr 0.001 --optimizer adam --epochs 5 --batch-size 4 --gpus 0 \
--deterministic --compress policies/schedule.yaml --qat-policy \
policies/qat_policy_camvid.yaml --model ai85unetlarge --dataset CamVid_s352_c3 \
--use-bias --wd 0 --out-fold-ratio 4 --truncate-test \
--device MAX78000 \
--out-dir ai85unetlarge_artifacts/train
# quantize
python quantize.py ai85unetlarge_artifacts/train/last/best.pth.tar ai85unetlarge_artifacts/train/last/best-q.pth.tar \
--device MAX78000 -v
(In the *-fakept.yaml case, I add:
python izer/add_fake_passthrough.py --input-checkpoint-path ai85unetlarge_artifacts/train/last/best-q.pth.tar --output-checkpoint-path ai85unetlarge_artifacts/train/last/best-q-pt.pth.tar --layer-name pt --layer-depth 56 --layer-name-after-pt upconv3
)
# eval
python train.py --model ai85unetlarge --dataset CamVid_s352_c3 --truncate-test --out-fold-ratio 4 --evaluate \
--save-sample 1 \
--exp-load-weights-from ai85unetlarge_artifacts/train/last/best-q-pt.pth.tar -8 \
--device MAX78000 \
--use-bias \
--batch-size 2 \
--out-dir ai85unetlarge_artifacts/eval
# synthesize
python ai8xize.py --test-dir synthed_net --prefix ai85unetlarge --checkpoint-file \
ai85unetlarge_artifacts/train/last/best-q.pth.tar --config-file networks/camvid-unet-large.yaml \
--sample-input ai85unetlarge_artifacts/eval/sample_CamVid_s352_c3.npy \
--device MAX78000 \
--compact-data --mexpress --timer 0 --display-checkpoint --verbose --overwrite --board-name FTHR_RevA
quantize.py contains code to quantize PyTorch model but couldn't find equivalent script to quantize Tensorflow model as per model training in develop-tf branch of ai8x-training repository. What is the script to quantize Tensorflow model for MAX78000 device?
Hi,
I have been using the chip MAX78000 for quite some time now, and I am really impressed by the product.
However, I cannot explain why I have an inference time of less than 1ms for a 5 million MACs network. I got the number of MACs from the file cnn.h created with the synthesis tool, and the 1ms has been measured with an oscilloscope on the LED2 (red led on the ev board), that should light up when the network is running.
The accelerator is running at 50MHz, and thus it takes 50.000 cycles to complete the inference, and the result is correct and passes the check. But that would mean that the accelerator would do 5e6/5e4 = 100 MAC/Cycle, and according to the datasheet "Nominal 1 output channel per clock", I think the maximum theoretical value of MAC/Cycle should be 64, equal to the number of processors.
I am trying to understand this behaviour, do you think this is normal? Am I right taking into account a max of 64MAC/Cycle? My take is that the accelerator is not really running my network...
neural_net.zip
Thank you for your time!
Section "Generating a Random Sample Input" provides the following line as an example:
np.save(os.path.join('tests', 'sample_mnist'), a, allow_pickle=False, fix_imports=False)
where a
is a random tensor initialized explicitly as dtype=np.int64
in the previous line. The input being of an integer type seems to be important for the synthesis tool because it fails at the "bitwise and" operator inside the parameter load function in izer/load.py at this line when the type is e.g., float64. The example already uses an input cast to int64, which gives a hint, but the error message when you don't use an int type is pretty cryptic:
ufunc 'bitwise_and' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
and it was a bit cumbersome to trace the izer/... files to understand that my input wasn't saved with the correct type (I just copied the line I pasted above to save my sample data for a new dataset, which therefore skipped the typecasting, and numpy defaulted to float64 even though I had 256 unique values in the -128...+127 range). So my suggestion:
maybe a very small note under Section "Generating a Random Sample Input" in the readme.md, reiterating that the data should be dtype=np.int64
or, even though it's redundant, adding an explicit typecast to the "saving line" in the example (which everybody will probably copy without reading the randgen line, like me):
np.save(os.path.join('tests', 'sample_mnist'), a.astype('int64'), allow_pickle=False, fix_imports=False)
the second suggestion might induce even worse "hidden" errors like casting float input data in small ranges to int64 without notice, flooring them all to 0, -1 etc. though (depending on rounding settings), so I personally vote for the first one
Hi, I am curious if this can support models like resnet, where a layer adds an output from the last layer and one from some layers before. I am not sure if the synthesis could cover this. Or models like FCN-8 and this (https://github.com/dvu4/CarND-Semantic-Segmentation).
I know the synthesis in the current form does not support this, however, is there a way to run two pass for less processors by setting the registers? (I am comfortable with the registers on the accelerator, so you can explain using them)
I noticed that it's possible to train some simple early exit models via ai8x-training, thanks to codes such as the following:
https://github.com/analogdevicesinc/ai8x-training/blob/develop/train.py#L1534
However, I'm unsure how to synthesize and run early exit on the device. Could you please share an example? I'd like to see which layer of the model a test inference exits from on-device.
I want to know what is the purpose of update_old_model_params in train.py?
elif args.load_model_path: # print('2222') update_old_model_params(args.load_model_path, model) if qat_policy is not None: checkpoint = torch.load(args.load_model_path, map_location=lambda storage, loc: storage) if checkpoint.get('epoch', None) >= qat_policy['start_epoch']: ai8x.fuse_bn_layers(model) model = apputils.load_lean_checkpoint(model, args.load_model_path, model_device=args.device) ai8x.update_model(model)
This can lead to incorrect parameter loading when using multi-GPU training. This may require optimization.
There is a dark blue aggregator and three light blue aggregators.
Many thanks!
Hi,
I am getting this error when I try to implement AvgPool1d in synthesis.
"Pooling or zero-padding results in a zero data dimension (input [1800, 1], result [112, 0])."
My network configurable file:
# input: 64x1800x1,
- out_offset: 0x0000
in_dim: 1800
processors: 0xFFFF.FFFF.FFFF.FFFF
avg_pool: 16
pool_stride: 16
operation: none
name: l16_gap1
How do I specify it to do a 1d average pool?
hi I have trouble with the logic of processors in yaml file for example
HWC (little data) configuration for CIFAR-100
Simple Model
arch: ai85ressimplenet
dataset: CIFAR100
layers:
Layer 0
out_offset: 0x2000
processors: 0x7000000000000000
operation: conv2d
kernel_size: 3x3
pad: 1
activate: ReLU
data_format: HWC
Layer 1
out_offset: 0x0000
processors: 0x0ffff00000000000
operation: conv2d
kernel_size: 3x3
pad: 1
activate: ReLU
Layer 2 - re-form data with gap
out_offset: 0x2000
processors: 0x00000000000fffff
output_processors: 0x00000000000fffff
operation: passthrough
write_gap: 1
Layer 3
in_offset: 0x0000
in_sequences: 1
out_offset: 0x2004
processors: 0x00000000000fffff
operation: conv2d
kernel_size: 3x3
pad: 1
activate: ReLU
write_gap: 1
Layer 4 - Residual-1
in_sequences: [2, 3]
in_offset: 0x2000
out_offset: 0x0000
processors: 0x00000000000fffff
eltwise: add
operation: conv2d
kernel_size: 3x3
pad: 1
activate: ReLU
Layer 5
out_offset: 0x2000
processors: 0xfffff00000000000
output_processors: 0x000000fffff00000
max_pool: 2
pool_stride: 2
pad: 1
operation: conv2d
kernel_size: 3x3
activate: ReLU
" why the sample doesn't start to turn on the processors from the first(right)??what is the logic behind this??
Hi, there. I got trapped again in the network part. The last layer of this simple model seems not to work, Layer 3: "flatten" exceeds supported input dimensions (512 * 1 > 256)).
I supposed it might need an 8-bit data until I found an example in this maxim guide link. In example 4, layer 6, the input data is further larger than 8-bit after flattened. So I am wondering how to solve this.
Another question is over this. I tried setting the flatten of Layer 3 as False, but it gives the error about Layer2-output and Layer3-input needing different number of processors. So I d like to know about this after the first problem is solved. Thanks a lot.
layers:
\# ai8x.FusedConv1dBNReLU(1, 8, kernel_size=3, stride=1, padding=1, bias=True, batchnorm='Affine')
\- data_format: CHW
op: Conv1d
pad: 1
activate: ReLU
kernel_size: 3
stride: 1
processors: 0x0000000000000001
out_offset: 0x0000
\# ai8x.FusedConv1dBNReLU(8, 16, kernel_size=3, stride=1, padding=1, bias=True, batchnorm='Affine')
\- op: Conv1d
pad: 1
activate: ReLU
kernel_size: 3
stride: 1
processors: 0x0000000000000ff0
out_offset: 0x2000
\# ai8x.FusedConv1dBNReLU(16, 2, kernel_size=3, stride=1, padding=1, bias=True, batchnorm='Affine')
\- op: Conv1d
pad: 1
activate: ReLU
kernel_size: 3
stride: 1
processors: 0x000000000ffff000
out_offset: 0x0000
\# Flatten & ai8x.Linear(512 * 2, 512)
\- op: Linear
flatten: True
activate: None
processors: 0xffffffffffffffff
out_offset: 0x2000```
I got a few questions while going through the user guide of MAX78000.
How is the master quadrant selected, which collects the partial sum-of-products, calculates the sum of all products, and writes to the next layer?
(I have a feeling that it is the first quadrant (or the first processor) by looking at mnist-chw-ai85.yaml file where it avoids to use the first quadrant throughtout several layers.)
Also, is there any difference in performance if we choose to/not to run convolution on the master quadrants?
Can we parallelize computation for more processors when we have less channels? If I only have 32 channels to compute and have 60 output channels, can I parallelize the computation to the entire 64 processors?
Why doesn't the cifar-100 example not use parallel mode for the first layer? For example, by using processors 0x111 instead of 0x007 and input the data to three memories. If it is because of the channel format (HWC), is it more costly to change the format to (CHW) than the merit of running this in three parallel groups?
Many thanks!
Hello
I try to play with a "cat-dogs" example. I managed to train, quantize, evaluate and generate C code. The code works perfectly with sample data (test is passed on board).
On the other hand, I compile the cat dogs demo that uses the same network but also uses the onboard camera, which makes it more fun. I had tested the demo with images of cats and dogs on my computer screen and it works just amazing.
The problem arrives when I try to combine the two projects. It's to say I copy cnn.h, cnn.c, weights.h, and logs to the cats-dogs_demo folder from the generated previously cats-vs-dogs folder. The board reacts to button press and probably takes a picture, but CNN always gives the same output.
My question is: what is the difference between cats-vs-dogs and cats-dogs_demo examples on the CNN level (is it used differently)?
Thank you in advance for your answer!
at izer/izer.py, line 188
# Work with 1D input data
if data.ndim < 3:
data = np.expand_dims(data, axis=2)
It seems that it would not work with 1D data but at least 2D. Some situations might be added to solve this.
Hi, I am working on the he fpn detector example, and would like to know if I modify my FPN so that I use the 64x80 output, and drop the 4x5 output, what is the best way to setup the classification and regression memory location for this higher filter set.
For example, the comments show a mapping of the memory:
# Class predictions : (32x40 + 16x20 + 8x10 + 4x5) * 6 * 21 = 10200 * 21
# 0x0000 - 0xD480 (1700 x 32: wide & multi-pass)
# 0x0000 - 0xA000: 32x40x121 (&wide)
# 0xA000 - 0xC800: 16x20x121 (&wide)
# 0xC800 - 0xD200: 8x10x121 (&wide)
# 0xD200 - 0xD480: 4x5x121 (&wide)
#
# Location predictions: (32x40 + 16x20 + 8x10 + 4x5) * 6 * 4 = 10200 * 4
# 0xD500 - 0xEF90 (1700 x 4)
#
# 0xD500 - 0xE900: 32x40x24
# 0xE900 - 0xEE00: 16x20x24
# 0xEE00 - 0xEF40: 8x10x24
# 0xEF40 - 0xEF90: 4x5x24
I am reworking this for my own scenario of dropping the 4x5 but using the 64x80 and where I have only 2 classes and same filter shapes like this:
# Class predictions : (64x80 + 32x40 + 16x20 + 8x10) * 6 * 2 =
# 0x0000 - 0xD480 (6820 x 16: wide & NOT multi-pass)
# 0x0000 - 0x2800: 64x80x12 (&wide)
# 0x2800 - 0x4800: 32x40x12 (&wide)
# 0x4800 - 0x6800: 16x20x12 (&wide)
# 0x6800 - 0xD200: 8x10x12 (&wide)
I set the out_offset of largest classification output to 0x0000. Then out_offset of 32x40 to 2800, and so on.
While this setup synthesizes, would it cause memory overwrites/corruption of the 4 outputs since the memory locations overlap?
Hi, there.
I have finished the training part and doing quantizing. But it gives the ERROR TypeError: sequence item 1: expected str instance, NoneType found
.
Tracing back here.
Model keys (state_dict): conv1.weight, bn1.weight, bn1.bias, bn1.running_mean, bn1.running_var, bn1.num_batches_tracked, conv2.weight, bn2.weight, bn2.bias, bn2.running_mean, bn2.running_var, bn2.num_batches_tracked, conv3.weight, fc.weight, fc.bias Traceback (most recent call last): File "quantize.py", line 29, in <module> main() File "/home/vapor/code/AIoT/ai8x-synthesis/izer/quantize.py", line 297, in main convert_checkpoint(args.input, args.output, args) File "/home/vapor/code/AIoT/ai8x-synthesis/izer/quantize.py", line 158, in convert_checkpoint bias_name = '.'.join([layer, operation, 'bias']) TypeError: sequence item 1: expected str instance, NoneType found
Bias exists but cannot be joined. Thanks a lot if anyone could help with this!
Hello! This is my first time using MAX78000. I saw the sample code for TinySSD, but I don't know how to calculate the memory allocation for each tier. I think it should be CNN.C that displays the data configuration of each layer.
This should correspond to the layer offset for each layer of the YAML file.
https://github.com/MaximIntegratedAI/ai8x-synthesis/blob/develop/networks/svhn-tinierssd.yaml
Below is my question
Thank you very much for your help.
Hi,
My ultimate goal is to deploy a custom CNN model, which is available as Pytorch file. YAML file asks for the network structure used in the training and the dataset. Currently it appears to me that I need to redefine my network using ai8x libraries and retrain my model using the tools of maxim so that I can have the required files. Is it the way or am I misreading the documentation? If I (hopefully) misread it, do you have another source/documentation that describes how to use a pre-trainmodel?
Thank you for the support in advance.
Using the gen-demos-max78000.sh
script to synthesize a Camvid model on MAX78000:
python ai8xize.py --test-dir $TARGET --prefix camvid_unet --checkpoint-file trained/ai85-camvid-unet-large-fakept-q.pth.tar --config-file networks/camvid-unet-large-fakept.yaml $COMMON_ARGS --overlap-data --mlator --no-unload --max-checklines 8192 --new-kernel-loader --overwrite "$@"
and SerialLoader.py from aisegment_unet-demo to test the inference pipeline, I arrive at these predictions:
displayed by:
ax[0].imshow(img_resize1)
ax[1].imshow(colors, cmap="Greys")
ax[2].imshow(img_resize1)
ax[2].imshow(colors, cmap="Greys", alpha=0.2)
While some masks look correct (the top part), I want to ask if stripped patterns and the quality of prediction, in general, are expected from a trained model trained/ai85-camvid-unet-large-fakept-q.pth.tar
.
I am running evaluation on the tinierssd weights saved in trained from the repo. However, I am getting the error, and everything pauses after:
Can someone direct me as to what I can do to fix it?
My full printout is this:
I am able to evaluate other models: cifar10, mnist. But this keeps happening for tinierssd with SVHN
Hello,
I would like to deploy a model on the MAX78000FTHR Board.
While the documentation is really complete (thank you!), it is a bit dense and I have trouble finding relevant information.
Specifically, once the project is built, I don't know how to test it on the device.
For the purpose of benchmarking, I would like to deploy a set of models on the board and get a few information (speed, accuracy, latency, power consumption) for each model. How can I achieve this?
Would it be possible to write a special section, or maybe a wiki, about how to deploy a model from PyTorch to the device, and get the output of the model?
Thanks for your help!
When generating "deployment code" for networks that have layer weights with bitwidths different than 8, the corresponding bias values for those layers seem to get constrained to the same number of steps as the weights. For instance, a layer with 8-bit weights gets "8-bit" bias values (i.e., there are 256 different possible choices for each bias element from the dictionary {0x00, 0x01, ... 0xfe, 0xff}), but a layer with 2-bit weights gets only 4 different bias value possibilities, i.e., can only use the dictionary {0x00, 0x01, 0xfe, 0xff}, which effectively makes it 2-bit.
I've noticed this while generating code for the mixed-precision CIFAR-100 simplenet checkpoint (i.e., "ai85-cifar100-qat-mixed-q.pth.tar"). Steps to reproduce:
./ai8xize.py --verbose --log --test-dir sdk/Examples/MAX78000/CNN --prefix cifar-100-mixed --checkpoint-file trained/ai85-cifar100-qat-mixed-q.pth.tar --config-file networks/cifar100-simple.yaml --softmax --device MAX78000 --compact-data --mexpress --timer 0 --display-checkpoint --boost 2.5
I think this behavior is caused by the following: In quantize.py, bias values are left shifted by 7 so that "PyTorch can still use them to run a model.". The comment also states that "This needs to be reversed before loading the weights into the hardware."; and this is done -> When generating code, izer/izer.py calls the function load from izer/checkpoint.py, which right shifts the bias values by 7 according to the BIAS_DIV constant defined in izer/tornadocnn.py. However, this procedure causes the biases to be effectively rounded to the weight bit precision, and these values (which are, e.g., {-2,-1,0,+1} for 2-bit) are transferred to the hardware as bias. Then, as far as I understand, these bias values are left shifted by 7 before being applied to (summed with) the convolution output, and then the sum is scaled down to get it back to the original order of magnitude (as per the leftmost part of the block diagram in the User Guide, Figure 26-2. I think the right shift by 7 mentioned in the "multiplication" section of the readme.md also corresponds to this by the way. If not, I'm very confused, because then the bias values and the weight-activation multiplication result would be off by a factor of 128)
My question is: I couldn't find anything on the User Guide or on any other piece of the documentation that would suggest that this is intended behavior, but is it intended behavior? By reading the User Guide, I can't see any reason for disallowing e.g., the use of 0x25, 0xd3 etc. (8-bit) bias values for a layer with 2-bit weights. Am I missing something? Because if this is intended, the following statement on the readme.md is not clear to me:
On MAX78000/MAX78002, weights can be 1, 2, 4, or 8 bits wide... Bias values are always 8 bits wide. Data is 8 bits wide...
18.04.2021 edit: grammar
Hi, there. I have a Mac78002 EVKit. I want to use the model with the camera as in the MAX78000. When I change to HWC i get this error on the debug "the Data mismatch (338/623) at address 0x51800544: Expected 0xbb80f37f, read 0xbc80f37f."
I modified the YAML as follows:
processors: from 0x0000000100010001 to 0x0000000000000007
and set streaming to True.
Is it possible to use the camera in CHW format?
TIA
Hello,
i wanted to compare results of KWS with and without MFCC Calculation.
So i changed the KWS20_v3 model so that it directly supports MFCC calculations.
What i did was to change the __gen_datasets in KWS20 so that it generates (16000x1) instead of (128x128). And during training the
getitem function converts inp into a numpy, calculates MFCCs and transforms back into torch.
For the new model, i wrote the corresponding yaml and generated an example .npy-file (for ai8xize.py).
I used all the standard scripts of kws20_v3 for trianing evaluation and quantization and just fitted the path to the new model file and the model name.
During training i get around 60% - 80% from evaluating the training results (depending on choosen layers).
But after using quantize.py the evaluation will drop below 10%.
Usually i don't expect a model to drop that much after quantization. Might there be any other files i have to change?
And a second question:
It's said that the linear layer cannot be larger than 1023. A model with an layer of a size around 800 was still causing troubles. The ai8xize.py throw an error, because the layersize is -3. After changing it to a size below 255 (more CNNs before) it was working fine.
So is that true, that i can't have larger layersizes if i use quantisation?
In the .yaml description of the ai87_fpndetector, the intro comments describe the planned memory setup. I am confused as to some of the numbers there.
First, why use 121, and not 126 in the Class predictions? In the regression, you only use 24 = 6*4.
Second, I'm a bit confused on the address mapping in classification and regression. Could you please explain how the memory is mapped here. For example, for the classification, for the address range 0x0000-0xA000, is it correct to say it is calculated as 4 bytes per 0 hence 0xA000 == 40960 * 4 = 163840 which is the rounded up version of 3240121=154880?
Also, in the regression, the numbers seem to be too little for the data. For eg: for regression layer 4x5x64, the memory is planned for 0xF000-0xF0A0 = 640 which is much less that 4564 = 1280,. Similarly for others like the 32x40x64 layer 0xF6E0:0x10AE0 = 20480 << 324064 = 81920.
What is the recommended way to define these memory allocations in a systematic fashion especially if I modify the filter shapes eg increase filter sizes like changing 32x40 to 40x80 as when I setup the model with a different image input size? The memory locations seem to be referenced in the nms.c that is called in the sdk examples here L222. So any changes would then need to be propagated to the example to make it work for a new but similar model.
# Model Outputs:
#
# Class predictions : (32x40 + 16x20 + 8x10 + 4x5) * 6 * 21 = 10200 * 21
# 0x0000 - 0xD480 (1700 x 32: wide & multi-pass)
#
# 0x0000 - 0xA000: 32x40x121 (&wide)
# 0xA000 - 0xC800: 16x20x121 (&wide)
# 0xC800 - 0xD200: 8x10x121 (&wide)
# 0xD200 - 0xD480: 4x5x121 (&wide)
#
# Location predictions: (32x40 + 16x20 + 8x10 + 4x5) * 6 * 4 = 10200 * 4
# 0xD500 - 0xEF90 (1700 x 4)
#
# 0xD500 - 0xE900: 32x40x24
# 0xE900 - 0xEE00: 16x20x24
# 0xEE00 - 0xEF40: 8x10x24
# 0xEF40 - 0xEF90: 4x5x24
#
#
# FPN_out_4_5 : 0xF000-0xF0A0 (4x5x64, gap:1, protect after Layer 34 until x)
# FPN_out_8_10 : 0xF0A0-0xF1E0 (8x10x64, protect after Layer 37 until x)
# FPN_out_16_20: 0xF1E0-0xF6E0 (16x20x64, protect after Layer 40 until x)
# FPN_out_32_40: 0xF6E0-0x10AE0 (32x40x64, protect after Layer 43 until x)
#
Hi there!
I am currently working on a project with the goal to port an existing model onto the MAX78000 platform due to its advanced convolution abilities. Everything works just fine except the last two layers of the net which consist of two linear layers back to back.
According to the readme.md the chaining of linear layers is possible when omitting the 'flatten' step. I am assuming that the first Linear Layer requires a 'flatten' operation but the following linear layers do not, is that correct? Sadly I did not find any examples with more than one linear layer in series.
Anyway I tried a lot of different setups using different values for flatten, in_dim, in_channel etc. but it does not seem to work. It seems like the second linear layer has trouble with the output of the first linear layer which comes as 1x1 features - m channels. The output of the ai8xize.py script is as follows:
Configuring device: MAX78000
Reading ../../network_path.yaml to configure network...
Reading ../../pth_path.tar to configure network weights...
Checkpoint for epoch 1, model pth - weight and bias data:
InCh OutCh Weights Quant Shift Min Max Size Key Bias Quant Min Max Size Key
16 20 (320, 1) 8 -2 -126 127 320 Convs.0.weight (20,) 8 -86 115 20 Convs.0.bias
20 20 (400, 2) 8 N/A -81 81 800 Convs.1.weight (20,) 8 -77 77 20 Convs.1.bias
20 20 (400, 2) 8 -2 -81 81 800 Convs.2.weight (20,) 8 -68 71 20 Convs.2.bias
20 20 (400, 2) 8 N/A -81 81 800 Convs.3.weight (20,) 8 -68 77 20 Convs.3.bias
100 32 (1, 32, 100) 8 N/A -102 102 3200 Lins.0.weight (32,) 8 -95 102 32 Lins.0.bias
32 10 (1, 10, 32) 8 -2 -89 90 320 Lins.1.weight (10,) 8 -78 82 10 Lins.1.bias
TOTAL: 6 parameter layers, 6,362 parameters, 6,362 bytes
Configuring data set: Dataset.
prefix...
WARNING: --overwrite specified, writing to sdk/Examples/MAX78000/CNN/prefix even though it exists.
ERROR: The configured kernel dimensions (1x1) for layer 11 do not match the weights file (32x10)!
with the last three layers in the yaml file looking as follows:
# Layer 9: add
- operation: Add
in_offset: 0x0000
out_offset: 0x2000
in_sequences: [7, 8]
processors: 0xfffff00000000
output_processors: 0xfffff00000000
# Layer 10: Flatten + Linear
- operation: MLP
flatten: true
activation: ReLU
in_offset: 0x2000
out_offset: 0x0000
processors: 0x000fffff
# Layer 11: Linear
- operation: MLP
activation: None
out_offset: 0x2000
output_width: 32
processors: 0xffffffff00000000
It is very strange since the 'flatten' before the first layer also rearranges the features (features arranged in HxWxC and not only in 1x1xm where m =CWH) and it matches the dimesnions of the wieghts without a problem and in the second layer it does not.
Any advice is highly appreciated.
Issue: for the same input in Q8 mode [-127;128]: ai8x.set_device(87, True, False) (Pytorch prediction) =/= Synthesis (KAT)
model = ai87unet(...)
class Args:
def __init__(self, act_mode_8bit):
self.act_mode_8bit = act_mode_8bit
self.truncate_testset = False
args = Args(act_mode_8bit=True)
ai8x.set_device(87, True, False) # True to simulate device
checkpoint = torch.load('qat_best_q8.pth.tar')
state_dict = checkpoint['state_dict']
ai8x.fuse_bn_layers(model)
model.load_state_dict(state_dict, strict=True)
model = model.to(device)
I observed that the synthesis step does not generate the KAT by using Pytorch, but it runs the input through a custom code. This code generates the final output, which is then computed on the device (KAT) and this test is passed successfully. Therefore the device computes exactly what is expected.
The main problem arises when analyzing the KAT output. The synthesis is producing an output that is by factors worse than the output (prediction) of PyTorch in Q8 mode (for the same input).
Was something similar ever observed?
I can provide more information if needed. Thank you for your help.
Based on the Camvid example and the corresponding camvid-unet-large.yaml, having a Unet with 3 concatenation operations your implementation suggests using a single passthrough layer at the bottleneck:
# Layer 7: pt
- in_offset: 0x5000
out_offset: 0x4004
processors: 0x00ffffffffffffff
output_processors: 0x00ffffffffffffff
operation: None
write_gap: 1
in_sequences: [5]
# Layer 8: upconv3
- in_offset: 0x6000
out_offset: 0x4000
processors: 0x00ffffffffffffff
output_processors: 0x00ffffffffffffff
operation: convtranspose2d
kernel_size: 3x3
pad: 1
activate: None
write_gap: 1
in_sequences: [6]
# Layer 9: dec3
- out_offset: 0x2000
in_offset: 0x4000
processors: 0x00ffffffffffffff
output_processors: 0x00ffffffffffffff
operation: conv2d
kernel_size: 3x3
pad: 1
activate: ReLU
in_sequences: [8, 7]
Could you explain the reasoning behind not adding passthrough layers to do every concatenation in this network? In which cases do I need to use camvid-unet-large-fakept.yaml & izer/add_fake_passthrough.py instead?
Having the following Unet definition for a regression task, could you help to understand why multiple passthrough layers won't allow it to properly run inference:
---
arch: unetmedium
dataset: customdataset
layers:
# Layer 0: enc1
- out_offset: 0x4000
processors: 0x0000.0000.0000.0007
data_format: HWC
output_processors: 0x0f00.0000.0000.0000
operation: conv2d
kernel_size: 3x3
pad: 1
activate: ReLU
# Layer 1: enc2
- out_offset: 0x4000
processors: 0x0f00.0000.0000.0000
output_processors: 0x0000.0000.0000.ff00
operation: conv2d
kernel_size: 3x3
pad: 1
max_pool: 2
pool_stride: 2
activate: ReLU
# Layer 2: enc3
- out_offset: 0x0000
processors: 0x0000.0000.0000.00ff
output_processors: 0xffff.ffff.0000.0000
operation: conv2d
kernel_size: 3x3
pad: 1
max_pool: 2
pool_stride: 2
activate: ReLU
# Layer 3: bneck
- out_offset: 0x6000
processors: 0xffff.ffff.0000.0000
output_processors: 0xffff.ffff.ffff.ffff
operation: conv2d
kernel_size: 3x3
pad: 1
max_pool: 2
pool_stride: 2
activate: ReLU
# Layer 4: pt
- in_offset: 0x0000
out_offset: 0x4000
processors: 0xffff.ffff.0000.0000
output_processors: 0xffff.ffff.0000.0000
operation: None
write_gap: 1
in_sequences: [2]
# Layer 5: upconv3
- in_offset: 0x6000
out_offset: 0x4004
processors: 0xffff.ffff.ffff.ffff
output_processors: 0x0000.0000.ffff.ffff
operation: convtranspose2d
kernel_size: 3x3
pad: 1
activate: None
write_gap: 1
in_sequences: [3]
# Layer 6: dec3
- in_offset: 0x4000
out_offset: 0x2000
processors: 0xffff.ffff.ffff.ffff
output_processors: 0x0fff.ffff.ffff.ffff
operation: conv2d
kernel_size: 3x3
pad: 1
activate: ReLU
in_sequences: [5, 4]
# Layer 7: pt
- in_offset: 0x4000
out_offset: 0x4000
processors: 0x000.0000.0000.0ff00
output_processors: 0x000.0000.0000.0ff00
operation: None
write_gap: 1
in_sequences: [1]
# Layer 8: upconv2
- in_offset: 0x2000
out_offset: 0x4004
processors: 0x0fff.ffff.ffff.ffff
output_processors: 0x0000.0000.0000.00ff
operation: convtranspose2d
kernel_size: 3x3
pad: 1
write_gap: 1
in_sequences: [6]
activate: None
# Layer 9: dec2
- out_offset: 0x2000
in_offset: 0x4000
processors: 0x0000.0000.0000.ffff
output_processors: 0x0000.ffff.ffff.ffff
operation: conv2d
kernel_size: 3x3
pad: 1
activate: ReLU
in_sequences: [8, 7]
# Layer 10: pt
- in_offset: 0x4000
out_offset: 0x0000
processors: 0x0f00.0000.0000.0000
output_processors: 0x0f00.0000.0000.0000
operation: None
write_gap: 1
in_sequences: [0]
name: pt3
# Layer 11: upconv1
- in_offset: 0x2000
out_offset: 0x0004
processors: 0x0000.ffff.ffff.ffff
output_processors: 0x00f0.0000.0000.0000
operation: convtranspose2d
kernel_size: 3x3
pad: 1
write_gap: 1
activate: None
in_sequences: [9]
# Layer 12: dec1
- in_offset: 0x0000
out_offset: 0x4000
processors: 0x0ff0.0000.0000.0000
output_processors: 0x0000.ffff.ffff.ffff
operation: conv2d
kernel_size: 3x3
pad: 1
activate: ReLU
in_sequences: [11, 10]
# Layer 13: dec0
- out_offset: 0x0000
processors: 0x0000.ffff.ffff.ffff
output_processors: 0x0000.0000.ffff.ffff
operation: conv2d
kernel_size: 3x3
pad: 1
activate: ReLU
# Layer 14: conv
- out_offset: 0x4000
processors: 0x0000.0000.ffff.ffff
output_processors: 0x0000000000000001
operation: conv2d
kernel_size: 1x1
pad: 0
activate: None
some things stick in my mind about synthesis
First of all, why we are doing this;
(ai8x-synthesis) $ python quantize.py proj/qat_best.pth.tar proj/proj_q8.pth.tar --device MAX78000
In the second place;
python ai8xize.py --verbose --test-dir demos --prefix ai85-kws20 --checkpoint-file proj/proj_q8.pth.tar --config-file networks/kws20-hwc.yaml --device MAX78000 --compact-data --mexpress --softmax
then cnn.c cnn.h weight.h softmax.c file is created. But how are these files created? I want to understand. And what do these files mean?
Currently I am trying to deploy a CNN model onto MAX78000 that has been trained using Keras (*.h5). I converted the model to ONNX and extracted the architectural description to YAML.
I then tried generating the C files using ai8xize.py, however I am getting errors that my YAML file contains unknown keys. Comparing the YAML file of my model (generated from original Keras model) and some of the YAML sample files in the folder ai8x-synthesis/networks, the architecture description looks very different.
How can I obtain the architecture description as expected by ai8xize from my Keras model?
It would be great if I do not have to retrain the CNN model using PyTorch to ensure comparability towards other platforms, where the toolchain is Keras-based.
Another question: As TensorFlow >=2.6.0 is not supporting YAML anymor, but only JSON: Will there be an adaptation in the toolchain?
Kind regards,
asti205
What would be the best way to synthesize models obtained with quantization aware training via pytorch quantization api?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.