analogdevicesinc / ai8x-synthesis Goto Github PK

View Code? Open in Web Editor NEW

55.0 16.0 47.0 592.33 MB

Quantization and Synthesis (Device Specific Code Generation) for ADI's MAX78000 and MAX78002 Edge AI Devices

License: Apache License 2.0

Python 79.31% SystemVerilog 0.06% Makefile 2.16% C 11.31% Shell 2.33% Tcl 0.19% Jupyter Notebook 4.63%

max78000 maxim machine-learning maxim-integrated ai artificial-intelligence deep-learning analog-devices max78002

ai8x-synthesis's People

Contributors

Stargazers

Watchers

Forkers

jdk-maxim rotx-maxim ermanok tchenot m7analog aniktash maximgorkem asimov-aiz jessexm shaunclack ntanganiya vicloginov denizkilinc yeqiao jwwarren1 maximreza seldauyanik-maxim dplozza jake-carter bobbycounts khpeterson swap2ag azra26 catapangan lucacaronti gio200023 reidfathom5 reidbo arg-nctu odie8683 alicangok marieltinaco iggmercano oguzhanbsolak ivangilmercano rajclemente peterhamfelt eyuboglumerve jss-on asyatrhl taesikgong kent0311 danpfister rakechen-0307 cencarna lochanamendis

ai8x-synthesis's Issues

Out_offset (YAML) and SRAM write pointer

https://github.com/MaximIntegratedAI/ai8x-synthesis/blob/develop/izer/backend/max7800x.py#L1673

instance = ffs(output_processor_map[ll] >> group * tc.dev.P_SHARED) \
  & ~(tc.dev.P_SHARED-1)
val |= (instance + group * tc.dev.P_SHARED) * tc.dev.INSTANCE_SIZE

According to the code, the actual value written to the register is 1/4 * (out_offset (from YAML) + 0x8000 * smallest_out_processor_group_index).

Is the out_offset/in_offset in bytes and the write ptr in address, where each address points at 32-bit word? I think this is why I am getting confused.
When writing data, where and how does the address translation work? Because this address scheme does not match with that in the user guide where the SRAM from different quadrants are separate.

(Below is a copy of documentation from the same file.)

Configure SRAM write pointer -- write ptr is global
(unless depth-wise w/o broadcast is used).
Get offset to first available instance of the first used
processor of the next layer.

CNN.c

As I get it CNN.c is basically a driver for for the accelerator and it has necessary functions to load the data etc..
What I missed is why those functions generated after the synthesis ? are they network/data independent and why is that ?

Problem synthesizing the tinierssd svhn model

I trained the newly added tinier-ssd model with the included training script(scripts/train_svhn_tinierssd.sh)
During synthesis, I'm able to quantize the weights by this command:
python quantize.py ../ai8x-training-latest/logs/2022.07.11-214500_SVHN_tinierssd/qat_best.pth.tar ../ai8x-training-latest/logs/2022.07.11-214500_SVHN_tinierssd/qat_best_q.pth.tar --device MAX78000 -v -c networks/svhn-tinierssd.yaml

But ai8xize script is failing due to below error:
aix8ize command:

python ai8xize.py --test-dir generated_svhn_tinierssd --prefix tinierssd_svhn --checkpoint-file ../ai8x-training-latest/logs/2022.07.11-214500_SVHN_tinierssd/qat_best_q.pth.tar --config-file networks/svhn-tinierssd.yaml --device MAX78000 --compact-data --timer 0 --display-checkpoint --verbose --overwrite --mexpress

log:

Reading ../ai8x-training-latest/logs/2022.07.11-214500_SVHN_tinierssd/qat_best_q.pth.tar to configure network weights...

e2.op.bias
   32    64  (2048, 3, 3)        8    -2 -128  127   18432 base.fire3.op.weight                      (64,)          8  -20  127   64 base.fir
e3.op.bias
   64    64  (4096, 3, 3)        8    -2 -128  127   36864 base.fire4.op.weight                      (64,)          8  -81  104   64 base.fir
e4.op.bias
   64    64  (4096, 3, 3)        8    -2 -128  127   36864 base.fire5.op.weight                      (64,)          8 -116  115   64 base.fir
e5.op.bias
   64    64  (4096, 3, 3)        8    -2 -128  127   36864 base.fire6.op.weight                      (64,)          8 -128  108   64 base.fire6.op.bias
   64   128  (8192, 3, 3)        8    -2 -128  127   73728 base.fire7.op.weight                      (128,)         8 -128  127  128 base.fire7.op.bias
  128    32  (4096, 3, 3)        8    -2 -128  127   36864 base.fire8.op.weight                      (32,)          8 -110  127   32 base.fire8.op.bias
   32    32  (1024, 3, 3)        8    -1 -128  127    9216 base.fire9.op.weight                      (32,)          8  -38  113   32 base.fire9.op.bias
   32    32  (1024, 3, 3)        8    -1 -109  127    9216 base.fire10.op.weight                     (32,)          8   -2   61   32 base.fire10.op.bias
   32    16  (512, 3, 3)         8    -2 -128  127    4608 aux_convs.conv12_1.op.weight              (16,)          8  -41   40   16 aux_convs.conv12_1.op.bias
   16    16  (256, 3, 3)         8    -1 -123  127    2304 aux_convs.conv12_2.op.weight              (16,)          8   -3   88   16 aux_convs.conv12_2.op.bias
   32    16  (512, 3, 3)         8    -4  -64  127    4608 pred_convs.loc_fire8.op.weight            (16,)          8  -44  127   16 pred_convs.loc_fire8.op.bias
   32    16  (512, 3, 3)         8    -1 -110   88    4608 pred_convs.loc_fire9.op.weight            (16,)          8  -38   54   16 pred_convs.loc_fire9.op.bias
   32    16  (512, 3, 3)         8    -1  -92  107    4608 pred_convs.loc_fire10.op.weight           (16,)          8  -55   52   16 pred_convs.loc_fire10.op.bias
   16    16  (256, 3, 3)         8    -1 -108  103    2304 pred_convs.loc_conv12_2.op.weight         (16,)          8  -35   34   16 pred_convs.loc_conv12_2.op.bias
   32    44  (1408, 3, 3)        8    -2  -87   97   12672 pred_convs.cl_fire8.op.weight             (44,)          8 -124  127   44 pred_convs.cl_fire8.op.bias
   32    44  (1408, 3, 3)        8    -1 -126  127   12672 pred_convs.cl_fire9.op.weight             (44,)          8  -93  125   44 pred_convs.cl_fire9.op.bias
   32    44  (1408, 3, 3)        8    -1 -124  127   12672 pred_convs.cl_fire10.op.weight            (44,)          8  -75   88   44 pred_convs.cl_fire10.op.bias
   16    44  (704, 3, 3)         8    -1 -128  127    6336 pred_convs.cl_conv12_2.op.weight          (44,)          8  -59   50   44 pred_convs.cl_conv12_2.op.bias
TOTAL: 20 parameter layers, 336,336 parameters, 336,336 bytes
TOTAL: 20 parameter layers, 336,336 parameters, 336,336 bytes
Configuring data set: svhn_74.
tinierssd_svhn...
NOTICE: --overwrite specified, writing to generated_svhn_tinierssd/tinierssd_svhn even though it exists.
Arranging weights... ________________________________________ 100%
Storing weights...   ________________________________________ 100%
Creating network...  ________________________________________  10%**ERROR: Processor 0: Layer 2 output for CHW=0,27,25 is overwriting input at offset 0x00402000 that was created by layer 1, CHW=0,0,0.**

I see the weights.h file is mostly empty.
Is there something I'm missing or the config yaml for tinierssd need to be corrected?
Creating network... ________________________________________ 10%

YAML file

In synthesies i get this error "Layer 8: 3 input channels (before flattening) using 1 pass, and 1 operand (3 processors per pass), but the enabled processor map 0x00000000ffffffff has 32 bits instead of the expected number of 3." I could not solved it. Can you share a yaml file for ai85kws20netv2batchnorm model?

synthesize without KAT/sample data

Hi
I trained your ai85unetlarge on MS COCO with only 2 classes (Background, Person).
When trying to synthesize the trained model i ran into some problems:

First i had to modify aisegment-unet-large-fakept.yaml because the last layer ended up smaller.

   64    64  (4096, 1, 1)        8    -1 -120  125    4096 conv.op.weight                            (64,)          8  -31    0   64 conv.op.bias             
TOTAL: 19 parameter layers, 282,220 parameters, 282,220 bytes
Configuring data set: CamVid_s352_c3_reduced.

vs.

   64    32  (2048, 1, 1)        8     2  -13   15    2048 conv.op.weight                            (32,)          8  -81   79   32 conv.op.bias             
TOTAL: 19 parameter layers, 280,140 parameters, 280,140 bytes
Configuring data set: coco_s352.

that's why i changed the last layer to output_processors: 0x00000000ffffffff. Is that correct?

I'd like to use your UNet-Demo with my trained network. After some investigation i found out that you used some (i assume older version) unet_v5 and a AISegment_352_reduced dataset in order to fit the sample data into SRAM.
Is there a possibility to synthesize without all the KAT and sample data?

I just wanna use UNet-demo with the camera and don't care about KAT for now. I tried --no-kat, --synthesize-input, --synthesize-words and --max-verify-length but when inspecting izer/sampledata.py i saw that i cannot enter these steps after if shape[0] < 1 or shape[0] > 4: anyways as my input shape is (48,88,88).

the build error is the following:

c:/maximsdk/tools/gnutools/10.3/bin/../lib/gcc/arm-none-eabi/10.3.1/../../../../arm-none-eabi/bin/ld.exe: C:/MaximSDK/Examples/MAX78000/CNN/UNet-demo/build/UNet-demo.elf section `.bss' will not fit in region `SRAM'
c:/maximsdk/tools/gnutools/10.3/bin/../lib/gcc/arm-none-eabi/10.3.1/../../../../arm-none-eabi/bin/ld.exe: region RAM overflowed with stack
c:/maximsdk/tools/gnutools/10.3/bin/../lib/gcc/arm-none-eabi/10.3.1/../../../../arm-none-eabi/bin/ld.exe: region `SRAM' overflowed by 148408 bytes
collect2.exe: error: ld returned 1 exit status
make: *** [/C/MaximSDK/Libraries/CMSIS/Device/Maxim/MAX78000/Source/GCC/gcc.mk:299: /c/MaximSDK/Examples/MAX78000/CNN/UNet-demo/build/UNet-demo.elf] Error 1
"make all -r -j 8" terminated with exit code 2. Build might be incomplete.

but it's pretty clear the sample data won't fit, the numpy pickle i generated with train.py is 2.9MB which makes total sense given the (48x88x88x8bit) dimensions.
edit: nvm, i was slightly confused because the filesize is 2.9MB but when I load the pickle the array has the size 371712 which does indeed fit into data memory.

I commented out //#define USE_SAMPLEDATA // shows the sample data.

Sound

Hi,

I'm trying to the card make a some sound. Would you give me the way or help?

out_offset in yaml file

Sorry, I'm the first time to use the MAX78000 EVKit, I can't totally understand how to set the value of the out_offset in yaml file, did I need to calculate it depending on something need to be noticed? Thank you so much!

Best regards,
Jason

understanding error in synthesis yaml

Hi, I am modifying ai87_fpndetector for my use case by changing the input shape, removing the first residual in the backbone, and other smaller mods in the FPN accordingly (see .py file attached). To create the yaml for my own model, I'm modifying the original ai87_fpndetector_pascalvoc for my own use case. But I am getting this error for my yaml:

ERROR: Layer 73 (loc_60_80_res0_preprocess): HWC (4 channels/word) 8-bit 60x80 output (size 19200) 
with output offset 0x10ae0 and expansion 1x exceeds data memory instance size of 81920.

What does this mean in general and how should I approach cases where memory is exceeded? how is this excess even calculated?

Here is my complete yaml and py file
modFPN.zip

ERROR: Streaming in the first layer requires use of a FIFO.

The command I use to generate the project is:

./ai8xize.py --verbose --log --test-dir demos --prefix ai85-faceid-qat8 --checkpoint-file trained/ai85-faceid-qat8.pth.tar --config-file networks/faceid.yaml --device MAX78000 --compact-data --mexpress --softmax

Synthesis error in the Camvid example

Hi, running the train-quantize-eval-synthesize pipeline for the Camvid example, I encountered an error on the synthesis stage:

Arranging weights... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100%
Storing weights...   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100%
Creating network...  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0%ERROR: Processor 0: Layer 0 output for CHW=0,0,64 is overwriting input at offset 0x00400700 that was created by the input loader.
Creating network...  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0%

Network-related files (ai85net-unet.py, camvid-unet-large.yaml, camvid-unet-large-fakept.yaml) and the Camvid dataset are unchanged. These are the commands I use:

# train
python train.py --lr 0.001 --optimizer adam --epochs 5 --batch-size 4 --gpus 0 \
--deterministic --compress policies/schedule.yaml --qat-policy \
policies/qat_policy_camvid.yaml --model ai85unetlarge --dataset CamVid_s352_c3 \
--use-bias --wd 0 --out-fold-ratio 4 --truncate-test  \
--device MAX78000 \
--out-dir ai85unetlarge_artifacts/train

# quantize
python quantize.py ai85unetlarge_artifacts/train/last/best.pth.tar ai85unetlarge_artifacts/train/last/best-q.pth.tar \
 --device MAX78000 -v
(In the *-fakept.yaml case, I add:
python izer/add_fake_passthrough.py --input-checkpoint-path ai85unetlarge_artifacts/train/last/best-q.pth.tar --output-checkpoint-path ai85unetlarge_artifacts/train/last/best-q-pt.pth.tar --layer-name pt --layer-depth 56 --layer-name-after-pt upconv3
)
 
# eval
python train.py --model ai85unetlarge --dataset CamVid_s352_c3 --truncate-test --out-fold-ratio 4 --evaluate \
--save-sample 1 \
--exp-load-weights-from ai85unetlarge_artifacts/train/last/best-q-pt.pth.tar -8 \
--device MAX78000 \
--use-bias \
--batch-size 2 \
--out-dir ai85unetlarge_artifacts/eval

# synthesize
python ai8xize.py --test-dir synthed_net --prefix ai85unetlarge --checkpoint-file \
 ai85unetlarge_artifacts/train/last/best-q.pth.tar --config-file networks/camvid-unet-large.yaml \
--sample-input ai85unetlarge_artifacts/eval/sample_CamVid_s352_c3.npy \
--device MAX78000  \
--compact-data --mexpress --timer 0 --display-checkpoint --verbose --overwrite --board-name FTHR_RevA

Can't find quantize.py for Tensorflow model quantization in develop-tf branch

quantize.py contains code to quantize PyTorch model but couldn't find equivalent script to quantize Tensorflow model as per model training in develop-tf branch of ai8x-training repository. What is the script to quantize Tensorflow model for MAX78000 device?

MAC/Cycle mismatch

Hi,
I have been using the chip MAX78000 for quite some time now, and I am really impressed by the product.

However, I cannot explain why I have an inference time of less than 1ms for a 5 million MACs network. I got the number of MACs from the file cnn.h created with the synthesis tool, and the 1ms has been measured with an oscilloscope on the LED2 (red led on the ev board), that should light up when the network is running.

The accelerator is running at 50MHz, and thus it takes 50.000 cycles to complete the inference, and the result is correct and passes the check. But that would mean that the accelerator would do 5e6/5e4 = 100 MAC/Cycle, and according to the datasheet "Nominal 1 output channel per clock", I think the maximum theoretical value of MAC/Cycle should be 64, equal to the number of processors.

I am trying to understand this behaviour, do you think this is normal? Am I right taking into account a max of 64MAC/Cycle? My take is that the accelerator is not really running my network...
neural_net.zip

Thank you for your time!

possible small improvement to the readme.md about sample inputs for new datasets

Section "Generating a Random Sample Input" provides the following line as an example:

np.save(os.path.join('tests', 'sample_mnist'), a, allow_pickle=False, fix_imports=False)

where a is a random tensor initialized explicitly as dtype=np.int64 in the previous line. The input being of an integer type seems to be important for the synthesis tool because it fails at the "bitwise and" operator inside the parameter load function in izer/load.py at this line when the type is e.g., float64. The example already uses an input cast to int64, which gives a hint, but the error message when you don't use an int type is pretty cryptic:

ufunc 'bitwise_and' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

and it was a bit cumbersome to trace the izer/... files to understand that my input wasn't saved with the correct type (I just copied the line I pasted above to save my sample data for a new dataset, which therefore skipped the typecasting, and numpy defaulted to float64 even though I had 256 unique values in the -128...+127 range). So my suggestion:

maybe a very small note under Section "Generating a Random Sample Input" in the readme.md, reiterating that the data should be dtype=np.int64
or, even though it's redundant, adding an explicit typecast to the "saving line" in the example (which everybody will probably copy without reading the randgen line, like me):

np.save(os.path.join('tests', 'sample_mnist'), a.astype('int64'), allow_pickle=False, fix_imports=False)

the second suggestion might induce even worse "hidden" errors like casting float input data in small ranges to int64 without notice, flooring them all to 0, -1 etc. though (depending on rounding settings), so I personally vote for the first one

Resnet support?

Hi, I am curious if this can support models like resnet, where a layer adds an output from the last layer and one from some layers before. I am not sure if the synthesis could cover this. Or models like FCN-8 and this (https://github.com/dvu4/CarND-Semantic-Segmentation).

Can I set 4 processors for 8 channels?

I know the synthesis in the current form does not support this, however, is there a way to run two pass for less processors by setting the registers? (I am comfortable with the registers on the accelerator, so you can explain using them)

Early exit inference on MAX78000FTHR

I noticed that it's possible to train some simple early exit models via ai8x-training, thanks to codes such as the following:
https://github.com/analogdevicesinc/ai8x-training/blob/develop/train.py#L1534

However, I'm unsure how to synthesize and run early exit on the device. Could you please share an example? I'd like to see which layer of the model a test inference exits from on-device.

BUG:Using multiple GPUs to train a model will cause model evaluation errors!!!

I want to know what is the purpose of update_old_model_params in train.py？
elif args.load_model_path: # print('2222') update_old_model_params(args.load_model_path, model) if qat_policy is not None: checkpoint = torch.load(args.load_model_path, map_location=lambda storage, loc: storage) if checkpoint.get('epoch', None) >= qat_policy['start_epoch']: ai8x.fuse_bn_layers(model) model = apputils.load_lean_checkpoint(model, args.load_model_path, model_device=args.device) ai8x.update_model(model)
This can lead to incorrect parameter loading when using multi-GPU training. This may require optimization.

Block diagram in userguide page 381

There is a dark blue aggregator and three light blue aggregators.

Is the dark blue aggregator used when the quadrant is the master quadrant?
If a quadrant is not the master quadrant would the dark blue aggregator work as if it were a light blue aggregator?
I want to know more on how the output of each processor is passed on to other parts. Does the outputs first go to the light blue aggregator that makes a partial sum of products? Are the results all concatenated and sent to the master quadrant? Why is there an input from a light blue aggregator to the group's shared memory?
(For a single pass, where input channels less than 64) Does the data write work in parallel with the processors, or does it wait for partial/entire results (and processors also wait for data writes to finish).
For multi passes, how is it different? Does the multipass accumulator use data SRAM, or is it a cache within the dark blue aggregator?

Many thanks!

AvgPool1d reduces the feature dimension in synthesis

Hi,

I am getting this error when I try to implement AvgPool1d in synthesis.

"Pooling or zero-padding results in a zero data dimension (input [1800, 1], result [112, 0])."

My network configurable file:

  # input: 64x1800x1, 
  - out_offset: 0x0000
    in_dim: 1800
    processors: 0xFFFF.FFFF.FFFF.FFFF
    avg_pool: 16
    pool_stride: 16
    operation: none
    name: l16_gap1

How do I specify it to do a 1d average pool?

processors

hi I have trouble with the logic of processors in yaml file for example
HWC (little data) configuration for CIFAR-100
Simple Model

arch: ai85ressimplenet
dataset: CIFAR100

layers:
Layer 0

out_offset: 0x2000
processors: 0x7000000000000000
operation: conv2d
kernel_size: 3x3
pad: 1
activate: ReLU
data_format: HWC

Layer 1

out_offset: 0x0000
processors: 0x0ffff00000000000
operation: conv2d
kernel_size: 3x3
pad: 1
activate: ReLU

Layer 2 - re-form data with gap

out_offset: 0x2000
processors: 0x00000000000fffff
output_processors: 0x00000000000fffff
operation: passthrough
write_gap: 1

Layer 3

in_offset: 0x0000
in_sequences: 1
out_offset: 0x2004
processors: 0x00000000000fffff
operation: conv2d
kernel_size: 3x3
pad: 1
activate: ReLU
write_gap: 1

Layer 4 - Residual-1

in_sequences: [2, 3]
in_offset: 0x2000
out_offset: 0x0000
processors: 0x00000000000fffff
eltwise: add
operation: conv2d
kernel_size: 3x3
pad: 1
activate: ReLU

Layer 5

out_offset: 0x2000
processors: 0xfffff00000000000
output_processors: 0x000000fffff00000
max_pool: 2
pool_stride: 2
pad: 1
operation: conv2d
kernel_size: 3x3
activate: ReLU
" why the sample doesn't start to turn on the processors from the first(right)??what is the logic behind this??

Flatten and Linear of generating C model

Hi, there. I got trapped again in the network part. The last layer of this simple model seems not to work, Layer 3: "flatten" exceeds supported input dimensions (512 * 1 > 256)).
I supposed it might need an 8-bit data until I found an example in this maxim guide link. In example 4, layer 6, the input data is further larger than 8-bit after flattened. So I am wondering how to solve this.

Another question is over this. I tried setting the flatten of Layer 3 as False, but it gives the error about Layer2-output and Layer3-input needing different number of processors. So I d like to know about this after the first problem is solved. Thanks a lot.

layers:
  \# ai8x.FusedConv1dBNReLU(1, 8, kernel_size=3, stride=1, padding=1, bias=True, batchnorm='Affine')
  \- data_format: CHW
    op: Conv1d
    pad: 1
    activate: ReLU
    kernel_size: 3
    stride: 1
    processors: 0x0000000000000001
    out_offset: 0x0000

 \# ai8x.FusedConv1dBNReLU(8, 16, kernel_size=3, stride=1, padding=1, bias=True, batchnorm='Affine')
  \- op: Conv1d
    pad: 1
    activate: ReLU
    kernel_size: 3
    stride: 1
    processors: 0x0000000000000ff0
    out_offset: 0x2000

  \# ai8x.FusedConv1dBNReLU(16, 2, kernel_size=3, stride=1, padding=1, bias=True, batchnorm='Affine')
 \- op: Conv1d
    pad: 1
    activate: ReLU
    kernel_size: 3
    stride: 1
    processors: 0x000000000ffff000
    out_offset: 0x0000

  \# Flatten & ai8x.Linear(512 * 2, 512)
  \- op: Linear
    flatten: True
    activate: None
    processors: 0xffffffffffffffff
    out_offset: 0x2000```

Which register specifies a master quadrant for calculating sum-of-products?

I got a few questions while going through the user guide of MAX78000.

How is the master quadrant selected, which collects the partial sum-of-products, calculates the sum of all products, and writes to the next layer?
(I have a feeling that it is the first quadrant (or the first processor) by looking at mnist-chw-ai85.yaml file where it avoids to use the first quadrant throughtout several layers.)
Also, is there any difference in performance if we choose to/not to run convolution on the master quadrants?
Can we parallelize computation for more processors when we have less channels? If I only have 32 channels to compute and have 60 output channels, can I parallelize the computation to the entire 64 processors?
Why doesn't the cifar-100 example not use parallel mode for the first layer? For example, by using processors 0x111 instead of 0x007 and input the data to three memories. If it is because of the channel format (HWC), is it more costly to change the format to (CHW) than the merit of running this in three parallel groups?

Many thanks!

CNN files migration from cats-vs-dogs to cats-dogs_demo issue

Hello

I try to play with a "cat-dogs" example. I managed to train, quantize, evaluate and generate C code. The code works perfectly with sample data (test is passed on board).
On the other hand, I compile the cat dogs demo that uses the same network but also uses the onboard camera, which makes it more fun. I had tested the demo with images of cats and dogs on my computer screen and it works just amazing.
The problem arrives when I try to combine the two projects. It's to say I copy cnn.h, cnn.c, weights.h, and logs to the cats-dogs_demo folder from the generated previously cats-vs-dogs folder. The board reacts to button press and probably takes a picture, but CNN always gives the same output.
My question is: what is the difference between cats-vs-dogs and cats-dogs_demo examples on the CNN level (is it used differently)?

Thank you in advance for your answer!

A problem in izer.py

at izer/izer.py, line 188

    # Work with 1D input data
    if data.ndim < 3:
        data = np.expand_dims(data, axis=2)

It seems that it would not work with 1D data but at least 2D. Some situations might be added to solve this.

confusion for memory overwrite

Hi, I am working on the he fpn detector example, and would like to know if I modify my FPN so that I use the 64x80 output, and drop the 4x5 output, what is the best way to setup the classification and regression memory location for this higher filter set.

For example, the comments show a mapping of the memory:

# Class predictions   : (32x40 + 16x20 + 8x10 + 4x5) * 6 * 21 = 10200 * 21
#                       0x0000 - 0xD480 (1700 x 32: wide & multi-pass)
# 0x0000 - 0xA000: 32x40x121 (&wide)
# 0xA000 - 0xC800: 16x20x121 (&wide)
# 0xC800 - 0xD200: 8x10x121 (&wide)
# 0xD200 - 0xD480: 4x5x121 (&wide)
#
# Location predictions: (32x40 + 16x20 + 8x10 + 4x5) * 6 * 4 = 10200 * 4
#                       0xD500 - 0xEF90 (1700 x 4)
#
# 0xD500 - 0xE900: 32x40x24
# 0xE900 - 0xEE00: 16x20x24
# 0xEE00 - 0xEF40: 8x10x24
# 0xEF40 - 0xEF90: 4x5x24

I am reworking this for my own scenario of dropping the 4x5 but using the 64x80 and where I have only 2 classes and same filter shapes like this:

# Class predictions   : (64x80 + 32x40 + 16x20 + 8x10) * 6 * 2 = 
#                       0x0000 - 0xD480 (6820 x 16: wide & NOT multi-pass)
# 0x0000 - 0x2800: 64x80x12 (&wide)
# 0x2800 - 0x4800: 32x40x12 (&wide)
# 0x4800 - 0x6800: 16x20x12 (&wide)
# 0x6800 - 0xD200: 8x10x12 (&wide)

I set the out_offset of largest classification output to 0x0000. Then out_offset of 32x40 to 2800, and so on.

While this setup synthesizes, would it cause memory overwrites/corruption of the 4 outputs since the memory locations overlap?

Problems at quantizing

Hi, there.
I have finished the training part and doing quantizing. But it gives the ERROR TypeError: sequence item 1: expected str instance, NoneType found.
Tracing back here.
Model keys (state_dict): conv1.weight, bn1.weight, bn1.bias, bn1.running_mean, bn1.running_var, bn1.num_batches_tracked, conv2.weight, bn2.weight, bn2.bias, bn2.running_mean, bn2.running_var, bn2.num_batches_tracked, conv3.weight, fc.weight, fc.bias Traceback (most recent call last): File "quantize.py", line 29, in <module> main() File "/home/vapor/code/AIoT/ai8x-synthesis/izer/quantize.py", line 297, in main convert_checkpoint(args.input, args.output, args) File "/home/vapor/code/AIoT/ai8x-synthesis/izer/quantize.py", line 158, in convert_checkpoint bias_name = '.'.join([layer, operation, 'bias']) TypeError: sequence item 1: expected str instance, NoneType found
Bias exists but cannot be joined. Thanks a lot if anyone could help with this!

MAX78000 tinySSD yaml file memory question

Hello! This is my first time using MAX78000. I saw the sample code for TinySSD, but I don't know how to calculate the memory allocation for each tier. I think it should be CNN.C that displays the data configuration of each layer.

This should correspond to the layer offset for each layer of the YAML file.
https://github.com/MaximIntegratedAI/ai8x-synthesis/blob/develop/networks/svhn-tinierssd.yaml

Below is my question

Could you please tell me how to give the address of offset for each layer? How to plan it?
Why does layer0 need to set out_offset and in_offset at the beginning?
The data entered into Layer0 at the beginning is 3x74x74 = 16428 = 0x402C. So the memory occupied is 0x50402000+0x402C = 0x5040 602C?
Why is layer2 out_offset changed to 0x1000?

Thank you very much for your help.

Application for Custom Models

Hi,

My ultimate goal is to deploy a custom CNN model, which is available as Pytorch file. YAML file asks for the network structure used in the training and the dataset. Currently it appears to me that I need to redefine my network using ai8x libraries and retrain my model using the tools of maxim so that I can have the required files. Is it the way or am I misreading the documentation? If I (hopefully) misread it, do you have another source/documentation that describes how to use a pre-trainmodel?

Thank you for the support in advance.

Test results with a pretrained Camvid

Using the gen-demos-max78000.sh script to synthesize a Camvid model on MAX78000:

python ai8xize.py --test-dir $TARGET --prefix camvid_unet --checkpoint-file trained/ai85-camvid-unet-large-fakept-q.pth.tar --config-file networks/camvid-unet-large-fakept.yaml $COMMON_ARGS --overlap-data --mlator --no-unload --max-checklines 8192 --new-kernel-loader --overwrite "$@"

and SerialLoader.py from aisegment_unet-demo to test the inference pipeline, I arrive at these predictions:

displayed by:

ax[0].imshow(img_resize1)
ax[1].imshow(colors, cmap="Greys")
ax[2].imshow(img_resize1)
ax[2].imshow(colors, cmap="Greys", alpha=0.2)

While some masks look correct (the top part), I want to ask if stripped patterns and the quality of prediction, in general, are expected from a trained model trained/ai85-camvid-unet-large-fakept-q.pth.tar.

--sample-input doesn't work for me

Hi,

I am trying to synthesis part with --sample-input. I created sample_kws_20.npy , using --save-sample 10 argument. However, I take this returned.

sample tinierssd evaluation after synthesis throwing errors

I am running evaluation on the tinierssd weights saved in trained from the repo. However, I am getting the error, and everything pauses after:

Can someone direct me as to what I can do to fix it?

My full printout is this:

I am able to evaluate other models: cifar10, mnist. But this keeps happening for tinierssd with SVHN

Documentation improvement : end-to-end deployment

Hello,

I would like to deploy a model on the MAX78000FTHR Board.
While the documentation is really complete (thank you!), it is a bit dense and I have trouble finding relevant information.
Specifically, once the project is built, I don't know how to test it on the device.

For the purpose of benchmarking, I would like to deploy a set of models on the board and get a few information (speed, accuracy, latency, power consumption) for each model. How can I achieve this?

Would it be possible to write a special section, or maybe a wiki, about how to deploy a model from PyTorch to the device, and get the output of the model?

Thanks for your help!

about bias quantization

When generating "deployment code" for networks that have layer weights with bitwidths different than 8, the corresponding bias values for those layers seem to get constrained to the same number of steps as the weights. For instance, a layer with 8-bit weights gets "8-bit" bias values (i.e., there are 256 different possible choices for each bias element from the dictionary {0x00, 0x01, ... 0xfe, 0xff}), but a layer with 2-bit weights gets only 4 different bias value possibilities, i.e., can only use the dictionary {0x00, 0x01, 0xfe, 0xff}, which effectively makes it 2-bit.

I've noticed this while generating code for the mixed-precision CIFAR-100 simplenet checkpoint (i.e., "ai85-cifar100-qat-mixed-q.pth.tar"). Steps to reproduce:

clone+setup ai8x-synthesis, checkout bb712b9 (latest development commit AFAIK, Apr 16 2:02 AM GMT+3)
generate code for "ai85-cifar100-qat-mixed-q.pth.tar":
./ai8xize.py --verbose --log --test-dir sdk/Examples/MAX78000/CNN --prefix cifar-100-mixed --checkpoint-file trained/ai85-cifar100-qat-mixed-q.pth.tar --config-file networks/cifar100-simple.yaml --softmax --device MAX78000 --compact-data --mexpress --timer 0 --display-checkpoint --boost 2.5
check file sdk/Examples/MAX78000/CNN/cifar-100-mixed/weights.h for the arrays "BIAS_0", 1, 2 and 3. Bias values corresponding to the 2-bit layers only take the values {0xff, 0xfe, 0x00, 0x01}, which correspond respectively to {-2,-1,0,+1}, and the case is similar for the bias values corresponding to 4-bit-weight layers (16 values there). I identified the indices of the values corresponding to 2-bit and 4-bit layers by explicitly setting various elements to 0 in the state_dict and regenerating the code, but even just looking at the bare array with the real values actually hints at what I'm trying to say.

I think this behavior is caused by the following: In quantize.py, bias values are left shifted by 7 so that "PyTorch can still use them to run a model.". The comment also states that "This needs to be reversed before loading the weights into the hardware."; and this is done -> When generating code, izer/izer.py calls the function load from izer/checkpoint.py, which right shifts the bias values by 7 according to the BIAS_DIV constant defined in izer/tornadocnn.py. However, this procedure causes the biases to be effectively rounded to the weight bit precision, and these values (which are, e.g., {-2,-1,0,+1} for 2-bit) are transferred to the hardware as bias. Then, as far as I understand, these bias values are left shifted by 7 before being applied to (summed with) the convolution output, and then the sum is scaled down to get it back to the original order of magnitude (as per the leftmost part of the block diagram in the User Guide, Figure 26-2. I think the right shift by 7 mentioned in the "multiplication" section of the readme.md also corresponds to this by the way. If not, I'm very confused, because then the bias values and the weight-activation multiplication result would be off by a factor of 128)

My question is: I couldn't find anything on the User Guide or on any other piece of the documentation that would suggest that this is intended behavior, but is it intended behavior? By reading the User Guide, I can't see any reason for disallowing e.g., the use of 0x25, 0xd3 etc. (8-bit) bias values for a layer with 2-bit weights. Am I missing something? Because if this is intended, the following statement on the readme.md is not clear to me:

On MAX78000/MAX78002, weights can be 1, 2, 4, or 8 bits wide... Bias values are always 8 bits wide. Data is 8 bits wide...

18.04.2021 edit: grammar

Change face-tinierSSD from CHW to HWC

Hi, there. I have a Mac78002 EVKit. I want to use the model with the camera as in the MAX78000. When I change to HWC i get this error on the debug "the Data mismatch (338/623) at address 0x51800544: Expected 0xbb80f37f, read 0xbc80f37f."

I modified the YAML as follows:
processors: from 0x0000000100010001 to 0x0000000000000007
and set streaming to True.

Is it possible to use the camera in CHW format?

TIA

Bad Evaluation after training new KWS-Model

Hello,
i wanted to compare results of KWS with and without MFCC Calculation.
So i changed the KWS20_v3 model so that it directly supports MFCC calculations.
What i did was to change the __gen_datasets in KWS20 so that it generates (16000x1) instead of (128x128). And during training the
getitem function converts inp into a numpy, calculates MFCCs and transforms back into torch.

For the new model, i wrote the corresponding yaml and generated an example .npy-file (for ai8xize.py).
I used all the standard scripts of kws20_v3 for trianing evaluation and quantization and just fitted the path to the new model file and the model name.

During training i get around 60% - 80% from evaluating the training results (depending on choosen layers).
But after using quantize.py the evaluation will drop below 10%.

Usually i don't expect a model to drop that much after quantization. Might there be any other files i have to change?

And a second question:
It's said that the linear layer cannot be larger than 1023. A model with an layer of a size around 800 was still causing troubles. The ai8xize.py throw an error, because the layersize is -3. After changing it to a size below 255 (more CNNs before) it was working fine.
So is that true, that i can't have larger layersizes if i use quantisation?

clarification on the ai87_fpndetector yaml for memory setup

In the .yaml description of the ai87_fpndetector, the intro comments describe the planned memory setup. I am confused as to some of the numbers there.

First, why use 121, and not 126 in the Class predictions? In the regression, you only use 24 = 6*4.

Second, I'm a bit confused on the address mapping in classification and regression. Could you please explain how the memory is mapped here. For example, for the classification, for the address range 0x0000-0xA000, is it correct to say it is calculated as 4 bytes per 0 hence 0xA000 == 40960 * 4 = 163840 which is the rounded up version of 3240121=154880?
Also, in the regression, the numbers seem to be too little for the data. For eg: for regression layer 4x5x64, the memory is planned for 0xF000-0xF0A0 = 640 which is much less that 4564 = 1280,. Similarly for others like the 32x40x64 layer 0xF6E0:0x10AE0 = 20480 << 324064 = 81920.

What is the recommended way to define these memory allocations in a systematic fashion especially if I modify the filter shapes eg increase filter sizes like changing 32x40 to 40x80 as when I setup the model with a different image input size? The memory locations seem to be referenced in the nms.c that is called in the sdk examples here L222. So any changes would then need to be propagated to the example to make it work for a new but similar model.

# Model Outputs:
#
# Class predictions   : (32x40 + 16x20 + 8x10 + 4x5) * 6 * 21 = 10200 * 21
#                       0x0000 - 0xD480 (1700 x 32: wide & multi-pass)
#
# 0x0000 - 0xA000: 32x40x121 (&wide) 
# 0xA000 - 0xC800: 16x20x121 (&wide)
# 0xC800 - 0xD200: 8x10x121 (&wide)
# 0xD200 - 0xD480: 4x5x121 (&wide)
#
# Location predictions: (32x40 + 16x20 + 8x10 + 4x5) * 6 * 4 = 10200 * 4
#                       0xD500 - 0xEF90 (1700 x 4)
#
# 0xD500 - 0xE900: 32x40x24
# 0xE900 - 0xEE00: 16x20x24
# 0xEE00 - 0xEF40: 8x10x24
# 0xEF40 - 0xEF90: 4x5x24
#
#
# FPN_out_4_5  : 0xF000-0xF0A0 (4x5x64, gap:1, protect after Layer 34 until x)
# FPN_out_8_10 : 0xF0A0-0xF1E0 (8x10x64, protect after Layer 37 until x)
# FPN_out_16_20: 0xF1E0-0xF6E0 (16x20x64, protect after Layer 40 until x)
# FPN_out_32_40: 0xF6E0-0x10AE0 (32x40x64, protect after Layer 43 until x)
#

Chaining of Linear Layers

Hi there!

I am currently working on a project with the goal to port an existing model onto the MAX78000 platform due to its advanced convolution abilities. Everything works just fine except the last two layers of the net which consist of two linear layers back to back.

According to the readme.md the chaining of linear layers is possible when omitting the 'flatten' step. I am assuming that the first Linear Layer requires a 'flatten' operation but the following linear layers do not, is that correct? Sadly I did not find any examples with more than one linear layer in series.
Anyway I tried a lot of different setups using different values for flatten, in_dim, in_channel etc. but it does not seem to work. It seems like the second linear layer has trouble with the output of the first linear layer which comes as 1x1 features - m channels. The output of the ai8xize.py script is as follows:

Configuring device: MAX78000
Reading ../../network_path.yaml to configure network...
Reading ../../pth_path.tar to configure network weights...
Checkpoint for epoch 1, model pth - weight and bias data:
 InCh OutCh  Weights         Quant Shift  Min  Max   Size Key                                       Bias       Quant  Min  Max Size Key
   16    20  (320, 1)            8    -2 -126  127    320 Convs.0.weight                            (20,)          8  -86  115   20 Convs.0.bias
   20    20  (400, 2)            8 N/A    -81   81    800 Convs.1.weight                            (20,)          8  -77   77   20 Convs.1.bias
   20    20  (400, 2)            8    -2  -81   81    800 Convs.2.weight                            (20,)          8  -68   71   20 Convs.2.bias
   20    20  (400, 2)            8 N/A    -81   81    800 Convs.3.weight                            (20,)          8  -68   77   20 Convs.3.bias
  100    32  (1, 32, 100)        8 N/A   -102  102   3200 Lins.0.weight                             (32,)          8  -95  102   32 Lins.0.bias
   32    10  (1, 10, 32)         8    -2  -89   90    320 Lins.1.weight                             (10,)          8  -78   82   10 Lins.1.bias
TOTAL: 6 parameter layers, 6,362 parameters, 6,362 bytes
Configuring data set: Dataset.
prefix...
WARNING: --overwrite specified, writing to sdk/Examples/MAX78000/CNN/prefix even though it exists.
ERROR: The configured kernel dimensions (1x1) for layer 11 do not match the weights file (32x10)!

with the last three layers in the yaml file looking as follows:

# Layer 9: add
- operation: Add
  in_offset: 0x0000
  out_offset: 0x2000
  in_sequences: [7, 8]
  processors: 0xfffff00000000
  output_processors: 0xfffff00000000

# Layer 10: Flatten + Linear
- operation: MLP
  flatten: true
  activation: ReLU
  in_offset: 0x2000
  out_offset: 0x0000
  processors: 0x000fffff

# Layer 11: Linear
- operation: MLP
  activation: None
  out_offset: 0x2000
  output_width: 32
  processors: 0xffffffff00000000

It is very strange since the 'flatten' before the first layer also rearranges the features (features arranged in HxWxC and not only in 1x1xm where m =CWH) and it matches the dimesnions of the wieghts without a problem and in the second layer it does not.

Any advice is highly appreciated.

Synthesis accuracy issues

Issue: for the same input in Q8 mode [-127;128]: ai8x.set_device(87, True, False) (Pytorch prediction) =/= Synthesis (KAT)

Pytorch step

The provided u-net was trained with quant-aware training to perform segmentation.
The QAT weights were converted to the format q8 with the provided tool.
the Q8 weights were evaluated by using code similar to:

model = ai87unet(...)
class Args:
    def __init__(self, act_mode_8bit):
        self.act_mode_8bit = act_mode_8bit
        self.truncate_testset = False

args = Args(act_mode_8bit=True)
ai8x.set_device(87, True, False) # True to simulate device
checkpoint = torch.load('qat_best_q8.pth.tar')
state_dict = checkpoint['state_dict']
ai8x.fuse_bn_layers(model)
model.load_state_dict(state_dict, strict=True)
model = model.to(device)

This code succesfully simulates the device with full 8bit integers and the output is in range [-128;127] and the accuracy is maintained.

Synthesis step

I observed that the synthesis step does not generate the KAT by using Pytorch, but it runs the input through a custom code. This code generates the final output, which is then computed on the device (KAT) and this test is passed successfully. Therefore the device computes exactly what is expected.

Issue encountered

The main problem arises when analyzing the KAT output. The synthesis is producing an output that is by factors worse than the output (prediction) of PyTorch in Q8 mode (for the same input).

Was something similar ever observed?
I can provide more information if needed. Thank you for your help.

Passthrough connections in Unet

Based on the Camvid example and the corresponding camvid-unet-large.yaml, having a Unet with 3 concatenation operations your implementation suggests using a single passthrough layer at the bottleneck:

  # Layer 7: pt
  - in_offset: 0x5000
    out_offset: 0x4004
    processors: 0x00ffffffffffffff
    output_processors: 0x00ffffffffffffff
    operation: None
    write_gap: 1
    in_sequences: [5]
  # Layer 8: upconv3
  - in_offset: 0x6000
    out_offset: 0x4000
    processors: 0x00ffffffffffffff
    output_processors: 0x00ffffffffffffff
    operation: convtranspose2d
    kernel_size: 3x3
    pad: 1
    activate: None
    write_gap: 1
    in_sequences: [6]
  # Layer 9: dec3
  - out_offset: 0x2000
    in_offset: 0x4000
    processors: 0x00ffffffffffffff
    output_processors: 0x00ffffffffffffff
    operation: conv2d
    kernel_size: 3x3
    pad: 1
    activate: ReLU
    in_sequences: [8, 7]

Could you explain the reasoning behind not adding passthrough layers to do every concatenation in this network? In which cases do I need to use camvid-unet-large-fakept.yaml & izer/add_fake_passthrough.py instead?

Having the following Unet definition for a regression task, could you help to understand why multiple passthrough layers won't allow it to properly run inference:

---
arch: unetmedium
dataset: customdataset

layers:
  # Layer 0: enc1
  - out_offset: 0x4000
    processors: 0x0000.0000.0000.0007
    data_format: HWC
    output_processors: 0x0f00.0000.0000.0000
    operation: conv2d
    kernel_size: 3x3
    pad: 1
    activate: ReLU
  # Layer 1: enc2
  - out_offset: 0x4000
    processors: 0x0f00.0000.0000.0000
    output_processors: 0x0000.0000.0000.ff00
    operation: conv2d
    kernel_size: 3x3
    pad: 1
    max_pool: 2
    pool_stride: 2
    activate: ReLU
  # Layer 2: enc3
  - out_offset: 0x0000
    processors: 0x0000.0000.0000.00ff
    output_processors: 0xffff.ffff.0000.0000
    operation: conv2d
    kernel_size: 3x3
    pad: 1
    max_pool: 2
    pool_stride: 2
    activate: ReLU
  # Layer 3: bneck
  - out_offset: 0x6000
    processors: 0xffff.ffff.0000.0000
    output_processors: 0xffff.ffff.ffff.ffff
    operation: conv2d
    kernel_size: 3x3
    pad: 1
    max_pool: 2
    pool_stride: 2
    activate: ReLU
  # Layer 4: pt
  - in_offset: 0x0000
    out_offset: 0x4000
    processors: 0xffff.ffff.0000.0000
    output_processors: 0xffff.ffff.0000.0000
    operation: None
    write_gap: 1
    in_sequences: [2]
  # Layer 5: upconv3
  - in_offset: 0x6000
    out_offset: 0x4004
    processors: 0xffff.ffff.ffff.ffff
    output_processors: 0x0000.0000.ffff.ffff
    operation: convtranspose2d
    kernel_size: 3x3
    pad: 1
    activate: None
    write_gap: 1
    in_sequences: [3]
  # Layer 6: dec3
  - in_offset: 0x4000
    out_offset: 0x2000
    processors: 0xffff.ffff.ffff.ffff
    output_processors: 0x0fff.ffff.ffff.ffff
    operation: conv2d
    kernel_size: 3x3
    pad: 1
    activate: ReLU
    in_sequences: [5, 4]
  # Layer 7: pt
  - in_offset: 0x4000
    out_offset: 0x4000
    processors: 0x000.0000.0000.0ff00
    output_processors: 0x000.0000.0000.0ff00
    operation: None
    write_gap: 1
    in_sequences: [1]
  # Layer 8: upconv2
  - in_offset: 0x2000
    out_offset: 0x4004
    processors: 0x0fff.ffff.ffff.ffff
    output_processors: 0x0000.0000.0000.00ff
    operation: convtranspose2d
    kernel_size: 3x3
    pad: 1
    write_gap: 1
    in_sequences: [6]
    activate: None
  # Layer 9: dec2
  - out_offset: 0x2000
    in_offset: 0x4000
    processors: 0x0000.0000.0000.ffff
    output_processors: 0x0000.ffff.ffff.ffff
    operation: conv2d
    kernel_size: 3x3
    pad: 1
    activate: ReLU
    in_sequences: [8, 7]
  # Layer 10: pt
  - in_offset: 0x4000
    out_offset: 0x0000
    processors: 0x0f00.0000.0000.0000
    output_processors: 0x0f00.0000.0000.0000
    operation: None
    write_gap: 1
    in_sequences: [0]
    name: pt3
  # Layer 11: upconv1
  - in_offset: 0x2000
    out_offset: 0x0004
    processors: 0x0000.ffff.ffff.ffff
    output_processors: 0x00f0.0000.0000.0000
    operation: convtranspose2d
    kernel_size: 3x3
    pad: 1
    write_gap: 1
    activate: None
    in_sequences: [9]
  # Layer 12: dec1
  - in_offset: 0x0000
    out_offset: 0x4000
    processors: 0x0ff0.0000.0000.0000
    output_processors: 0x0000.ffff.ffff.ffff
    operation: conv2d
    kernel_size: 3x3
    pad: 1
    activate: ReLU
    in_sequences: [11, 10]
  # Layer 13: dec0
  - out_offset: 0x0000
    processors: 0x0000.ffff.ffff.ffff
    output_processors: 0x0000.0000.ffff.ffff
    operation: conv2d
    kernel_size: 3x3
    pad: 1
    activate: ReLU
  # Layer 14: conv
  - out_offset: 0x4000
    processors: 0x0000.0000.ffff.ffff
    output_processors: 0x0000000000000001
    operation: conv2d
    kernel_size: 1x1
    pad: 0
    activate: None

Synthesis Questions

some things stick in my mind about synthesis
First of all, why we are doing this;
(ai8x-synthesis) $ python quantize.py proj/qat_best.pth.tar proj/proj_q8.pth.tar --device MAX78000

In the second place;
python ai8xize.py --verbose --test-dir demos --prefix ai85-kws20 --checkpoint-file proj/proj_q8.pth.tar --config-file networks/kws20-hwc.yaml --device MAX78000 --compact-data --mexpress --softmax
then cnn.c cnn.h weight.h softmax.c file is created. But how are these files created? I want to understand. And what do these files mean?

Preparing a pre-trained Keras model for deployment

Currently I am trying to deploy a CNN model onto MAX78000 that has been trained using Keras (*.h5). I converted the model to ONNX and extracted the architectural description to YAML.
I then tried generating the C files using ai8xize.py, however I am getting errors that my YAML file contains unknown keys. Comparing the YAML file of my model (generated from original Keras model) and some of the YAML sample files in the folder ai8x-synthesis/networks, the architecture description looks very different.
How can I obtain the architecture description as expected by ai8xize from my Keras model?

It would be great if I do not have to retrain the CNN model using PyTorch to ensure comparability towards other platforms, where the toolchain is Keras-based.

Another question: As TensorFlow >=2.6.0 is not supporting YAML anymor, but only JSON: Will there be an adaptation in the toolchain?

Kind regards,
asti205

Synthesis of models trained with Pytorch Quantization API

What would be the best way to synthesize models obtained with quantization aware training via pytorch quantization api?