The snitch_cluster from pulp-platform

Add +define+COMMON_CELLS_ASSERTS_OFF

Please add +define+COMMON_CELLS_ASSERTS_OFF in target/common/common.mk

Env. docker (main)
Run.
cd /repo/target/snitch_cluster
make bin/snitch_cluster.vlt
Will report error ifndef COMMON_CELLS_ASSERTS_OFF for common cells.

Incorrect description of `ICACHE_ENABLE_REGISTER` in Snitch cluster peripherals

https://github.com/pulp-platform/snitch/blob/670517084e234cfa1cd941aaf2710e1ee5b34adc/hw/ip/snitch_cluster/src/snitch_cluster_peripheral/snitch_cluster_peripheral_reg.hjson#L394-L405

trace: `fseq_fpu_yield` identical to `fseq_yield`

Bug description

fseq_fpu_yield is currently calculated as (fpss_fpu_issues / snitch_fseq_offloads) / fpss_fpu_rel_issues:

snitch_cluster/util/trace/gen_trace.py

Lines 750 to 753 in 699d404

    
           'fseq_fpu_yield': 
        
           safe_div( 
        
               safe_div(seg['fpss_fpu_issues'], seg['snitch_fseq_offloads']), 
        
               fpss_fpu_rel_issues),

fpss_fpu_rel_issues is in turn calculated as fpss_fpu_issues / fpss_issues:

snitch_cluster/util/trace/gen_trace.py

Lines 736 to 737 in 699d404

    
           fpss_fpu_rel_issues = safe_div(seg['fpss_fpu_issues'], 
        
                                          seg['fpss_issues'])

It follows that fseq_fpu_yield = fpss_issues / snitch_fseq_offloads, which equals fseq_yield by definition:

snitch_cluster/util/trace/gen_trace.py

Lines 748 to 749 in 699d404

    
           'fseq_yield': 
        
           safe_div(seg['fpss_issues'], seg['snitch_fseq_offloads']),

Solution

I believe the original intent was for fseq_fpu_yield to represent the FREP yield related to FPU proper instructions only. Then we could simply calculate fseq_fpu_yield = fpss_fpu_issues / snitch_fseq_offloads. Perhaps even better, we would rectify snitch_fseq_offloads to count only instructions destined to the FPU proper.

I'm not sure of the usefulness of this metric altogether, so perhaps we could also just remove the duplicate fseq_fpu_yield.

make bin/snitch_cluster.vlt error

I have built the docker and cloned the snitch_cluster repo in /repo.
And then in the target/snitch_cluster, I use the command make bin/snitch_cluster.vlt

For which I get:

work-vlt/Vtestharness.h:11:10: fatal error: verilated_heavy.h: No such file or directory

What am I doing wrong here?

Commands I have used:

In the snitch_cluster root, docker run -it -v $REPO_TOP:/repo -w /repo ghcr.io/pulp-platform/snitch_cluster:main
In /repo of docker, git clone https://github.com/pulp-platform/snitch_cluster.git --recurse-submodules
In /repo/snitch_cluster/target/snitch_cluster, make bin/snitch_cluster.vlt

Thanks in advance.

Check parameters in HW config file against schema

Parameter ssr_nr_credits is defined in the default HW configuration:

snitch_cluster/target/snitch_cluster/cfg/default.hjson

Line 115 in e90dceb

ssr_nr_credits: 4,

However, this parameter is actually not used anywhere. The correct name for this parameter would be data_credits:

snitch_cluster/hw/snitch_cluster/src/snitch_cluster_wrapper.sv.tpl

Line 205 in e90dceb

    
             "{shift_width}, {rpt_width}, {index_credits}, {isect_slave_credits}, {data_credits}, "\

Regardless what value is set to ssr_nr_credits, the default for data_credits which is defined in the schema file is used:

snitch_cluster/docs/schema/snitch_cluster.schema.json

Lines 576 to 581 in e90dceb

    
           "data_credits": { 
        
               "type": "number", 
        
               "description": "Number of credits and buffer depth of the data word FIFO.", 
        
               "minimum": 1, 
        
               "default": 4 
        
           },

To avoid incurring in the same situation in the future, the HW configuration file should be validated against the schema upon hardware generation, producing an error if some parameter is not defined in the schema.

This is done to some extent in the Generator class:

snitch_cluster/util/clustergen/cluster.py

Lines 73 to 80 in e90dceb

    
           def validate(self, cfg): 
        
               # Validate the schema. This can fail. 
        
               try: 
        
                   DefaultValidatingDraft7Validator( 
        
                       self.root_schema, resolver=self.resolver).validate(cfg) 
        
               except ValidationError as e: 
        
                   print(e) 
        
                   exit(e)

But only against the root schema, which doesn't include all parameters. The ssr_nr_credits parameter is in the remote schema for the SnitchClusterTB class:

snitch_cluster/util/clustergen/cluster.py

Lines 357 to 369 in e90dceb

    
           class SnitchClusterTB(Generator): 
        
               """ 
        
               A very simplistic system, which instantiates a single cluster and 
        
               surrounding DRAM to test and simulate this system. This can also serve as a 
        
               starting point on how to use the `snitchgen` library to generate more 
        
               complex systems. 
        
               """ 
        
               def __init__(self, cfg): 
        
                   schema = Path(__file__).parent / "../../docs/schema/snitch_cluster_tb.schema.json" 
        
                   remote_schemas = [Path(__file__).parent / "../../docs/schema/snitch_cluster.schema.json"] 
        
                   super().__init__(schema, remote_schemas) 
        
                   # Validate the schema. 
        
                   self.validate(cfg)

Instruction Interface not stable assertion failed

After the change of @SamuelRiedel in pulp-platform/snitch#69, i get assertion failures:

# ** Error: [ASSERT FAILED] [tb_bin.i_dut.i_snitch_cluster.i_cluster.gen_core[8].i_snitch_cc.i_snitch.InstructionInterfaceStable] InstructionInterfaceStable (/home/noah/git/snitch-dace/snitch/hw/ip/snitch/src/snitch.sv:2599)
#    Time: 350 ns Started: 349 ns  Scope: tb_bin.i_dut.i_snitch_cluster.i_cluster.gen_core[8].i_snitch_cc.i_snitch.InstructionInterfaceStable File: /home/noah/git/snitch-dace/snitch/hw/ip/snitch/src/snitch.sv Line: 2599
# ** Error: [ASSERT FAILED] [tb_bin.i_dut.i_snitch_cluster.i_cluster.gen_core[0].i_snitch_cc.i_snitch.InstructionInterfaceStable] InstructionInterfaceStable (/home/noah/git/snitch-dace/snitch/hw/ip/snitch/src/snitch.sv:2599)
#    Time: 361 ns Started: 360 ns  Scope: tb_bin.i_dut.i_snitch_cluster.i_cluster.gen_core[0].i_snitch_cc.i_snitch.InstructionInterfaceStable File: /home/noah/git/snitch-dace/snitch/hw/ip/snitch/src/snitch.sv Line: 2599
# ** Error: [ASSERT FAILED] [tb_bin.i_dut.i_snitch_cluster.i_cluster.gen_core[4].i_snitch_cc.i_snitch.InstructionInterfaceStable] InstructionInterfaceStable (/home/noah/git/snitch-dace/snitch/hw/ip/snitch/src/snitch.sv:2599)
#    Time: 418 ns Started: 417 ns  Scope: tb_bin.i_dut.i_snitch_cluster.i_cluster.gen_core[4].i_snitch_cc.i_snitch.InstructionInterfaceStable File: /home/noah/git/snitch-dace/snitch/hw/ip/snitch/src/snitch.sv Line: 2599

Can you reproduce this using the attached binary? Is it a concern?

ssr_intrinsic.tar.gz

DMA network deadlock potential

Due to the handling of atomics, there is a potential of deadlocking the DMA network if the DMA can't issue any more writes because the reads are stalled on axi_demux.

https://github.com/pulp-platform/axi/blob/dda3876318dbf67305e2820ef47524bfa94b67bc/src/axi_demux.sv#L200

Since we have two dedicated networks, a narrow one (usually single core requests) and a wide one (burst-type transfers) we do not necessarily need atomic support on the wide network. With atomics disabled, the axi_demux can drop the dependency between read and write channels.

pulp-platform/snitch#116 as a reference where this happened + short-term mitigation by increasing the transaction buffers.

Support Verilator 5

Following from the discussion in the fork from KU Leuven. I would like to have support for Verilator 5. Verilator 5 has better performance, support for timing constructs and support for assertions. It would also allow projects using the snitch cluster to use verilator 5. I think in general it would be useful to support more recent versions of tools, as this makes the upgrading process later down the line a lot easier (make small steps instead of big ones).

Upgrading to verilator 5 would involve the following steps:

Add the -no-timing flag to the current tests, which disables the new timing features in verilator 5
Add verilated_threads.o to the verilator build targets OR compile both the systemverilog and cpp sources using verilator, so verilator can manage which verilator files should be included

If you have any questions or comments, please let me know!

hw: Prevent potential (incorrect) loop inference by simulator

See: pulp-platform/snitch#500

Extend `sim_utils`

1. Fix dry run bug

During a dry run, Simulation object will not create any process upon launch:

snitch_cluster/util/sim/Simulation.py

Lines 26 to 44 in 8cae8d2

    
           def launch(self, run_dir=None, dry_run=False): 
        
               # Default to current working directory as simulation directory 
        
               if not run_dir: 
        
                   run_dir = Path.cwd() 
        
               # Print launch message and simulation command 
        
               cprint(f'Run test {colored(self.elf, "cyan")}', attrs=["bold"]) 
        
               cmd_string = ' '.join(self.cmd) 
        
               print(f'$ {cmd_string}', flush=True) 
        
               # Launch simulation if not doing a dry run 
        
               if not dry_run: 
        
                   # Create run directory and log file 
        
                   os.makedirs(run_dir, exist_ok=True) 
        
                   self.log = run_dir / self.LOG_FILE 
        
                   # Launch simulation subprocess 
        
                   with open(self.log, 'w') as f: 
        
                       self.process = subprocess.Popen(self.cmd, stdout=f, stderr=subprocess.STDOUT, 
        
                                                       cwd=run_dir, universal_newlines=True)

Thus, when successful() is invoked, as on a CustomSimulation:

snitch_cluster/util/sim/Simulation.py

Lines 163 to 164 in 8cae8d2

    
           def successful(self): 
        
               return self.process.returncode == 0

It will raise an error:

    return self.process.returncode == 0
AttributeError: 'NoneType' object has no attribute 'returncode'

This can be solved in two ways:

don't invoke the method on a dry run, in sim_utils.py
handle dry runs in the Simulation class, assuming that they complete immediately and are always successful

Second option is preferable for reuse. Then there is no need to treat completion of dry runs separately in sim_utils.py:

snitch_cluster/util/sim/sim_utils.py

Line 144 in 8cae8d2

idcs = [i for i, sim in enumerate(running_sims) if dry_run or sim.completed()]

2. Override run directory per test

By default, run_simulations() runs all tests under the same run directory as specified by the run_dir argument, creating a unique subdirectory for each simulation based on the test name, if there is more than one test:

snitch_cluster/util/sim/sim_utils.py

Lines 134 to 142 in 8cae8d2

    
           # Launch simulation in current working directory, by default 
        
           if run_dir is None: 
        
               run_dir = Path.cwd() 
        
           # Create unique subdirectory for each test under run directory, if multiple tests 
        
           if uniquify_run_dir: 
        
               unique_run_dir = run_dir / running_sims[-1].testname 
        
           else: 
        
               unique_run_dir = run_dir 
        
           running_sims[-1].launch(run_dir=unique_run_dir, dry_run=dry_run)

We want to allow each simulation to be run in an arbitrary directory, not necessarily based on the test name, but based on some externally defined logic. For instance, suppose we want to run two simulations with the same binary, on two different hardware configurations. By default, both will run in the same directory, and thus we cannot preserve the logs of the two simulations.

docs: Add DMA, SSR and FREP usage in the tutorial

Encountered an 'Illegal Instruction' error during the simulation

I followed the tutorial and used Docker to run the Verilator simulation, with the software code /apps/dnn/flashattenion_2. However, I encountered the following error during the simulation:

VCD wave generation enabled
[fesvr] Wrote 36 bytes of bootrom to 0x1000
[fesvr] Wrote entry point 0x80000000 to bootloader slot 0x1020
[fesvr] Wrote 56 bytes of bootdata to 0x1024
[Tracer] Logging Hart 8 to logs/trace_hart_00000008.dasm
[Tracer] Logging Hart 0 to logs/trace_hart_00000000.dasm
[Tracer] Logging Hart 1 to logs/trace_hart_00000001.dasm
[Tracer] Logging Hart 2 to logs/trace_hart_00000002.dasm
[Tracer] Logging Hart 3 to logs/trace_hart_00000003.dasm
[Tracer] Logging Hart 4 to logs/trace_hart_00000004.dasm
[Tracer] Logging Hart 5 to logs/trace_hart_00000005.dasm
[Tracer] Logging Hart 6 to logs/trace_hart_00000006.dasm
[Tracer] Logging Hart 7 to logs/trace_hart_00000007.dasm
[Illegal Instruction Core 0] PC: 00008000b140 Data: 18317153
[Illegal Instruction Core 1] PC: 00008000b140 Data: 18317153
[Illegal Instruction Core 2] PC: 00008000b140 Data: 18317153
[Illegal Instruction Core 6] PC: 00008000b140 Data: 18317153
[Illegal Instruction Core 4] PC: 00008000b140 Data: 18317153
[Illegal Instruction Core 0] PC: 000000000000 Data: 00000000
[Illegal Instruction Core 1] PC: 000000000000 Data: 00000000
[Illegal Instruction Core 2] PC: 000000000000 Data: 00000000
[Illegal Instruction Core 6] PC: 000000000000 Data: 00000000
[Illegal Instruction Core 4] PC: 000000000000 Data: 00000000
[Illegal Instruction Core 0] PC: 000000000000 Data: 00000000
[Illegal Instruction Core 1] PC: 000000000000 Data: 00000000
[Illegal Instruction Core 2] PC: 000000000000 Data: 00000000
[Illegal Instruction Core 6] PC: 000000000000 Data: 00000000
[Illegal Instruction Core 4] PC: 000000000000 Data: 00000000
......

I checked the instruction at address 8000b140, and the content is as follows:
8000b140: 53 71 31 18 fdiv.s ft2, ft2, ft3
How can I solve this error and run the simulation correctly? I encountered similar errors in other test cases (such as gelu, softmax).
Thank you very much

Different Verible version in Docker container and Github CI linting stage

Standalone TB for `hw/ip/reqrsp_interface` is broken

The TB for hw/ip/reqrsp_interface was never adjusted to the AMO fixes merged with pulp-platform/snitch#149.

Synchronization between FPU and INT pipelines for arbitrary C code

The output of cycle-accurate simulation for this code is not correct:

#include <snrt.h>
#include <printf.h>
#define f64 double
#define i32 int

#define B 2
#define N 32

void __attribute__((noinline)) my_func(double* x, double* y) {
    for (int n = 0; n < N; n++) {
        for (int b = 0; b < B; b++) {
            x[b * N + n] *= y[b];
        }
    }
    for (int b = 0; b < B; b++) {
        y[b] = 0;
    }
}

int main() {
    unsigned tid = snrt_cluster_core_idx();
    if (tid != 0) {
        return 0;
    }
    double* y = (f64*) snrt_l1alloc(B * sizeof(f64));
    double* x = (f64*) snrt_l1alloc(B * N * sizeof(f64));
    double* z = (f64*) snrt_l1alloc(B * N * sizeof(f64));
    y[0] = 3.0;
    y[1] = 2.0;
    for (int n = 0; n < N; n++) {
        for (int b = 0; b < B; b++) {
            x[b * N + n] = n + 1;
            z[b * N + n] = (n + 1) * y[b];
        }
    }
    my_func(x, y);
    i32 ok = 1;
    for (int i = 0; i < B * N; i++) {
        if ((x[i] - z[i]) * (x[i] - z[i]) > 1e-3) {
            printf("Error: mismatch at dst, %d, %f (computed) != %f (expected) \n", (int)i, (double)x[i], (double)z[i]);
            ok = 0;
            break;
        }
    }
    if (ok) {
        printf("success, exitting...\n");
        return 0;
    } else {
        printf("FAILURE, exitting...\n");
        return 1;
    }
}

Observed output:

Error: mismatch at dst, 31, 0.000000 (computed) != 96.000000 (expected)

The issue is suspected to come from the lack of synchronization between INT and FPU units. It can be seen from the assembly https://godbolt.org/z/z3oEz4aen that no synchronization is even supposed to happen.

my_func:                                # @my_func
        fld     ft0, 0(a1)  # everything below goes to FPU 
        fld     ft1, 0(a0)
        fmul.d  ft0, ft0, ft1
        fsd     ft0, 0(a0)
        fld     ft0, 8(a1)
        fld     ft1, 32(a0)
        fmul.d  ft0, ft0, ft1
        fsd     ft0, 32(a0)
        fld     ft0, 0(a1)
        fld     ft1, 8(a0)
        fmul.d  ft0, ft0, ft1
        fsd     ft0, 8(a0)
        fld     ft0, 8(a1)
        fld     ft1, 40(a0)
        fmul.d  ft0, ft0, ft1
        fsd     ft0, 40(a0)
        fld     ft0, 0(a1)
        fld     ft1, 16(a0)
        fmul.d  ft0, ft0, ft1
        fsd     ft0, 16(a0)
        fld     ft0, 8(a1)
        fld     ft1, 48(a0)
        fmul.d  ft0, ft0, ft1
        fsd     ft0, 48(a0)
        fld     ft0, 0(a1)
        fld     ft1, 24(a0)
        fmul.d  ft0, ft0, ft1
        fsd     ft0, 24(a0)
        fld     ft0, 8(a1)
        fld     ft1, 56(a0)
        fmul.d  ft0, ft0, ft1
        fsd     ft0, 56(a0)
        sw      zero, 12(a1)  # everything below goes to INT
        sw      zero, 8(a1)
        sw      zero, 4(a1)
        sw      zero, 0(a1)
        ret

Remove artificial bottleneck in load/store queue

Currently, the load-store queue is quite limited in Snitch and since the addition of store response handling becomes easily full. The idea would be to compress back-to-back stores as we are not interested in the actual value but just that we have an outstanding store.

I think something like a compressable_fifo would be a good start where element of the same time pushed back-to-back could increment a counter instead of occupying an actual queue item. Critical paths need to be checked as well.

fcvt.s.d instruction not working with SSR

Conversion from double to single precision (fcvt.s.d) isn't working when used in combination with ssr. It hangs indefinitely on verilator but works on banshee.
This small test demonstrates the issue.
test_fcvt.zip

Update trace visualization utilities to emit TrackEvent data

The new Perfetto UI does not fully support traces in the legacy TraceViewer JSON format.
https://perfetto.dev/docs/faq#why-does-perfetto-not-support-lt-some-obscure-json-format-feature-gt-

Users are recommended to emit TrackEvent instead, Perfetto's native trace format.

This guide explains how to represent common JSON events using TrackEvent.

Fix generated documentation

The generated documentation lacks some proper titles. That should be fixed.

snRuntime: Mutexes might be uninitialized

crt0 currently doesn't init the .bss section. A mutex placed in .bss (e.g. clint_mutex in interrupt.c) is therefore in an uninitialized state leading to deadlocks. A fix to this should also conform to multi-cluster systems which are not participating in the cluster-wide barrier at the end of crt0.

Missing read write support for debug CSRs

Opening this issue as a reference to the previous PR #7

Include in repository, possibly in the source code, or in an alternative easily maintainable form.

docs don't mention compiling and installing musl

Hi, I really like the documentation for this project.

However, the documentation does not seem to mention the "Build MUSL dependency" step which is executed in the CI:

      - name: Build MUSL dependency
        run: |
          cd sw/deps
          mkdir install
          cd musl
          CC=$LLVM_BINROOT/clang ./configure --disable-shared \
          --prefix=../install/ --enable-wrapper=all \
          CFLAGS="-mcpu=snitch -menable-experimental-extensions"
          make -j4
          make install
          cd ../../../
      - name: Build Software
        run: |
          make -C target/snitch_cluster sw

Batchnorm and Maxpool layers unverified

The Batchnorm and Maxpool layers are included in the CI, testing that no error occurs during simulation, but the results are not verified.

Verilator testbench: time is always 0 in DASM and perf results

I am currently following the tutorial for snitch cluster at https://pulp-platform.github.io/snitch_cluster/ug/tutorial.html and reached the debugging/benchmarking step. However, when trying to analyze the performance, it seems that the tstart and tend metrics are always 0. This value persists from the DASM file to the text, csv, and json results from the other steps for benchmarking.

The command I ran: bin/snitch_cluster.vlt sw/apps/blas/axpy/build/axpy.elf
Resulting files is attached.
trace_hart_00000000.dasm.txt
hart_00000000_perf.json
perf.csv
event.csv

Running on docker on Linux, amd64:

$ uname -a
Linux b0f761a4bb94 6.5.0-1004-oem #4-Ubuntu SMP PREEMPT_DYNAMIC Fri Sep 15 19:52:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Tested running on

axpy in tutorial
blas/axpy from repo

Things I tried changing:

Adding back the patch from https://github.com/pulp-platform/snitch/blob/d026f47843f0ea6c269244c4e6851e0e09141ec3/util/patches/riscv-isa-sim/fesrv.patch

Add datagen and IPC-based verification tutorials to the documentation

Hardcoded barrier CSR address

Do not hardcode barrier CSR address, but get it from riscv-opcodes encoding.h.

snitch_cluster/sw/snRuntime/src/sync.h

Lines 57 to 60 in 8a020de

    
           /// Synchronize cores in a cluster with a hardware barrier 
        
           inline void snrt_cluster_hw_barrier() { 
        
               asm volatile("csrr x0, 0x7C2" ::: "memory"); 
        
           }

Need login for nonfree part

Hi,
I cloned this repo. and try : make
This will checkout a nofree submodule first, but need username and password.
Any help?

Update AXI dependency to new release when available

Update both Bender.lock and Bender.yml files.

Cluster wrapper generation fails for cluster with `xssr` fully disabled

Due to a bug in util/clustergen/cluster.py it is currently impossible to instantiate a cluster with no xssr cores. This is because num_ssrs_max must be equal to at least 1 for the following template to produce valid system verilog:

localparam snitch_ssr_pkg::ssr_cfg_t [${cfg['num_ssrs_max']}-1:0] SsrCfgs [${cfg['nr_cores']}] = '{
...
}

A simple one line fix to cluster.py appears to solve the issue:

self.cfg['num_ssrs_max'] = max(len(core['ssrs']) for core in cores)
# to
self.cfg['num_ssrs_max'] = max(max(len(core['ssrs']) for core in cores), 1)

I don't know enough about how the xssr configuration works to determine if my fix is sufficient -- I haven't managed to get the CI test suite to fully pass. At least the following fails.

/workspaces/stitch_cluster/target/snitch_cluster/sw/tests/build/alias.elf test failed

Given that xssr is disabled it seems reasonable that some might fail. Unfortunately, my laptop doesn't have enough memory to run the full suite in a reasonable amount of time.

Docker container image failing after make command

Following the tutorial, after building the docker container on the repo folder, the make command crashes.

`iis-setup.sh` fails on Pisoc machines

The iis-setup.sh script currently downloads the spike-dasm binary built for Almalinux.

snitch_cluster/iis-setup.sh

Lines 31 to 32 in d6b7f25

    
           wget https://github.com/pulp-platform/riscv-isa-sim/releases/download/snitch-v0.1.0/snitch-spike-dasm-0.1.0-x86_64-linux-gnu-almalinux8.7.tar.gz 
        
           tar xzf snitch-spike-dasm-0.1.0-x86_64-linux-gnu-almalinux8.7.tar.gz

Pisoc machines use CentOS 7.8, and the Almalinux spike-dasm build is thus incompatible with them.

Clean warnings in `make sw` target

make sw target raises several warnings which should be cleaned.

Memory map documentation out of date

The memory map documentation is no longer up to date after introducing the zero memory. The whole Snitch cluster hardware documentation could be replaced with the relevant section from our internal Occamy documentation.

Fix legacy warnings in `make docs` target

Garbled performance metrics at end of annotated traces

The trace output by gen_trace.py contains performance metrics at the end of the trace, e.g.:

## Performance metrics

Performance metrics for section 0 @ (517, 8635):
tstart                                   3760.0000
snitch_loads                                     7
snitch_stores                                   25
tend                                    11879.0000
fpss_loads                                       0
snitch_avg_load_latency                    79.4286
snitch_occupancy                            0.0299
snitch_fseq_rel_offloads                    0.1164
fseq_yield                                     1.0
fseq_fpu_yield                                 1.0
fpss_section_latency                             0
fpss_avg_fpu_latency                           2.0
fpss_avg_load_latency                            0
fpss_occupancy                              0.0039
fpss_fpu_occupancy                          0.0039
fpss_fpu_rel_occupancy                         1.0
cycles                                        8119
total_ipc                                   0.0339

When passing the trace through the annotate.py script, the performance metrics section is garbled:

                                      ##                                 Performance    tion 0 @ (517, 8635):
            tstart                  3760.0000
      snitch_loads                          7
      snitch_stores                         25
              tend                 11879.0000
        fpss_loads                          0
      snitch_avg_load_latency                    79.4286
      snitch_occupancy                     0.0299
      snitch_fseq_rel_offloads                     0.1164
        fseq_yield                        1.0
      fseq_fpu_yield                        1.0
      fpss_section_latency                          0
      fpss_avg_fpu_latency                        2.0
      fpss_avg_load_latency                          0
      fpss_occupancy                     0.0039
      fpss_fpu_occupancy                     0.0039
      fpss_fpu_rel_occupancy                        1.0
            cycles                       8119
         total_ipc                     0.0339

The annotate.py script should be extended to ignore the final part of the trace, preserving the performance metrics.

Possibly outdated comment

Is this comment still relevant? It seems the source line does contain a unit

https://github.com/pulp-platform/snitch/blob/e29f0d50d591b8bdc638203ff2fcfa1a9c7c26da/hw/ip/test/src/tb_bin.sv#L9-L12

Banshee simulator error: Unsupported instruction

I want to simulate sw/apps/blas/gemm/build/gemm.elf with src/banshee.yaml by banshee, but encountered the following errors:

ERROR banshee::tran > Unsupported instruction 0x800027b8: <illegal 0xf60b80d3>
ERROR banshee::tran > Unsupported instruction 0x800027c0: <illegal 0xe6010bd3>
ERROR banshee::tran > Unsupported instruction 0x800027e0: <illegal 0xf60c01d3>
ERROR banshee::tran > Unsupported instruction 0x800027e4: <illegal 0xf6078253>
ERROR banshee::tran > Unsupported instruction 0x800027e8: <illegal 0xf60b82d3>
ERROR banshee::tran > Unsupported instruction 0x800027f4: <illegal 0xe6028bd3>
......

All these unsupported instructions are fmv.b.x and fmv.x.b. How to resolve these errors?
By the way, banshee's README suggests using .bin files as input, but the snitch_cluster tutorial uses .elf files as input for banshee. What's the difference? And how to run banshee simulation for sw testcases in snitch_cluster correctly? I tried using llvm_objcopy to convert the .elf file into a .bin file, but the generated bin file could not be properly received by banshee.

Bug in GEMM FP32 baseline kernel

When used in the FlashAttention-2 layer the GEMM FP32 kernel yields wrong results by orders of magnitude

snitch_cluster: Findings regarding tool compatibility during synthesis trials

I have been using the default snitch_cluster as part of a synthesis evaluation.
Here are some findings of my synthesis runs of the cluster source code which could be interesting for increasing general tool compatibility. Description contains the fixes I did to my local repo to circumvent the errors during elaboration/synthesis.

snitch_cc: Tool inferred a latch for the signal addr.

Fix: Add an initial value for addr: addr = '0;
https://github.com/pulp-platform/snitch/blob/260039017e5e61f883ae5fd139ba9904351de635/hw/ip/snitch_cluster/src/snitch_cc.sv#L633

axi_dma_tc_snitch_fe: Tool inferred a latch for the signal status.

Fix: Add an initialization for status: status = '0;
https://github.com/pulp-platform/snitch/blob/260039017e5e61f883ae5fd139ba9904351de635/hw/ip/snitch_dma/src/axi_dma_tc_snitch_fe.sv#L303

reqrsp_to_tcdm: Tool complained about unpacked to packed assignment.

Not so sure about this one, think is an issue with the hierarchical type. Got it to work by using the locally defined rr_req_chan_t as cast target, which technically is not the same and only by chance has the same bitfields which are of interest for this assignment.
https://github.com/pulp-platform/snitch/blob/260039017e5e61f883ae5fd139ba9904351de635/hw/ip/tcdm_interface/src/reqrsp_to_tcdm.sv#L55-L62

snitch_ssr: Tool did not recognize derive_isect_cfg() as a constant function.

Fix: Removed (stubbed) the intersector for the synthesis runs, hope this does not break the functionality too much.
I suspect however that this is an issue of the SV support of the tool.
https://github.com/pulp-platform/snitch/blob/260039017e5e61f883ae5fd139ba9904351de635/hw/ip/snitch_ssr/src/snitch_ssr_streamer.sv#L56-L85

Otherwise the cluster synthesized out of the box 🚀

Building the hardware use command "make bin/snitch_cluster.vlt" has error

To compile the hardware for simulation,run one of the following commands:
make bin/snitch_cluster.vlt

but has below error：

    import hjson
ModuleNotFoundError: No module named 'hjson'

GEMM kernel stalls due to dimension mismatches

For some dimensions of the optimized kernels the SSR region will stall, since the buffer will not be consumed entirely due to SIMD/loop unrolling. Need to add constraints on dimension in datagen.py

Where did spike-dasm go?

Hi,

In the previous repository there was a spike-dasm tool.
Is it possible that this was removed in the current one?

Thanks!

snRuntime: snrt_putchar is not thread safe

Communication with tohost/fromhost is not thread safe and should be guarded with a mutex

	'fseq_fpu_yield':
	safe_div(
	safe_div(seg['fpss_fpu_issues'], seg['snitch_fseq_offloads']),
	fpss_fpu_rel_issues),

	fpss_fpu_rel_issues = safe_div(seg['fpss_fpu_issues'],
	seg['fpss_issues'])

	'fseq_yield':
	safe_div(seg['fpss_issues'], seg['snitch_fseq_offloads']),

	"data_credits": {
	"type": "number",
	"description": "Number of credits and buffer depth of the data word FIFO.",
	"minimum": 1,
	"default": 4
	},

	def validate(self, cfg):
	# Validate the schema. This can fail.
	try:
	DefaultValidatingDraft7Validator(
	self.root_schema, resolver=self.resolver).validate(cfg)
	except ValidationError as e:
	print(e)
	exit(e)

	class SnitchClusterTB(Generator):
	"""
	A very simplistic system, which instantiates a single cluster and
	surrounding DRAM to test and simulate this system. This can also serve as a
	starting point on how to use the `snitchgen` library to generate more
	complex systems.
	"""
	def __init__(self, cfg):
	schema = Path(__file__).parent / "../../docs/schema/snitch_cluster_tb.schema.json"
	remote_schemas = [Path(__file__).parent / "../../docs/schema/snitch_cluster.schema.json"]
	super().__init__(schema, remote_schemas)
	# Validate the schema.
	self.validate(cfg)

	def launch(self, run_dir=None, dry_run=False):
	# Default to current working directory as simulation directory
	if not run_dir:
	run_dir = Path.cwd()

	# Print launch message and simulation command
	cprint(f'Run test {colored(self.elf, "cyan")}', attrs=["bold"])
	cmd_string = ' '.join(self.cmd)
	print(f'$ {cmd_string}', flush=True)

	# Launch simulation if not doing a dry run
	if not dry_run:
	# Create run directory and log file
	os.makedirs(run_dir, exist_ok=True)
	self.log = run_dir / self.LOG_FILE
	# Launch simulation subprocess
	with open(self.log, 'w') as f:
	self.process = subprocess.Popen(self.cmd, stdout=f, stderr=subprocess.STDOUT,
	cwd=run_dir, universal_newlines=True)

	# Launch simulation in current working directory, by default
	if run_dir is None:
	run_dir = Path.cwd()
	# Create unique subdirectory for each test under run directory, if multiple tests
	if uniquify_run_dir:
	unique_run_dir = run_dir / running_sims[-1].testname
	else:
	unique_run_dir = run_dir
	running_sims[-1].launch(run_dir=unique_run_dir, dry_run=dry_run)

	/// Synchronize cores in a cluster with a hardware barrier
	inline void snrt_cluster_hw_barrier() {
	asm volatile("csrr x0, 0x7C2" ::: "memory");
	}

	wget https://github.com/pulp-platform/riscv-isa-sim/releases/download/snitch-v0.1.0/snitch-spike-dasm-0.1.0-x86_64-linux-gnu-almalinux8.7.tar.gz
	tar xzf snitch-spike-dasm-0.1.0-x86_64-linux-gnu-almalinux8.7.tar.gz

pulp-platform / snitch_cluster Goto Github PK

snitch_cluster's People

Contributors

Stargazers

Watchers

Forkers

snitch_cluster's Issues

Bug description

Solution

1. Fix dry run bug

2. Override run directory per test

Recommend Projects

Recommend Topics

Recommend Org