pulp-platform / snitch_cluster Goto Github PK
View Code? Open in Web Editor NEWAn energy-efficient RISC-V floating-point compute cluster.
Home Page: https://pulp-platform.github.io/snitch_cluster/
License: Apache License 2.0
An energy-efficient RISC-V floating-point compute cluster.
Home Page: https://pulp-platform.github.io/snitch_cluster/
License: Apache License 2.0
Please add +define+COMMON_CELLS_ASSERTS_OFF in target/common/common.mk
Env. docker (main)
Run.
cd /repo/target/snitch_cluster
make bin/snitch_cluster.vlt
Will report error ifndef COMMON_CELLS_ASSERTS_OFF for common cells.
fseq_fpu_yield
is currently calculated as (fpss_fpu_issues / snitch_fseq_offloads) / fpss_fpu_rel_issues
:
snitch_cluster/util/trace/gen_trace.py
Lines 750 to 753 in 699d404
fpss_fpu_rel_issues
is in turn calculated as fpss_fpu_issues / fpss_issues
:
snitch_cluster/util/trace/gen_trace.py
Lines 736 to 737 in 699d404
It follows that fseq_fpu_yield = fpss_issues / snitch_fseq_offloads
, which equals fseq_yield
by definition:
snitch_cluster/util/trace/gen_trace.py
Lines 748 to 749 in 699d404
I believe the original intent was for fseq_fpu_yield
to represent the FREP yield related to FPU proper instructions only. Then we could simply calculate fseq_fpu_yield = fpss_fpu_issues / snitch_fseq_offloads
. Perhaps even better, we would rectify snitch_fseq_offloads
to count only instructions destined to the FPU proper.
I'm not sure of the usefulness of this metric altogether, so perhaps we could also just remove the duplicate fseq_fpu_yield
.
I have built the docker and cloned the snitch_cluster repo in /repo.
And then in the target/snitch_cluster, I use the command make bin/snitch_cluster.vlt
For which I get:
work-vlt/Vtestharness.h:11:10: fatal error: verilated_heavy.h: No such file or directory
What am I doing wrong here?
Commands I have used:
Thanks in advance.
Parameter ssr_nr_credits
is defined in the default HW configuration:
However, this parameter is actually not used anywhere. The correct name for this parameter would be data_credits
:
Regardless what value is set to ssr_nr_credits
, the default for data_credits
which is defined in the schema file is used:
snitch_cluster/docs/schema/snitch_cluster.schema.json
Lines 576 to 581 in e90dceb
To avoid incurring in the same situation in the future, the HW configuration file should be validated against the schema upon hardware generation, producing an error if some parameter is not defined in the schema.
This is done to some extent in the Generator
class:
snitch_cluster/util/clustergen/cluster.py
Lines 73 to 80 in e90dceb
But only against the root schema, which doesn't include all parameters. The ssr_nr_credits
parameter is in the remote schema for the SnitchClusterTB
class:
snitch_cluster/util/clustergen/cluster.py
Lines 357 to 369 in e90dceb
After the change of @SamuelRiedel in pulp-platform/snitch#69, i get assertion failures:
# ** Error: [ASSERT FAILED] [tb_bin.i_dut.i_snitch_cluster.i_cluster.gen_core[8].i_snitch_cc.i_snitch.InstructionInterfaceStable] InstructionInterfaceStable (/home/noah/git/snitch-dace/snitch/hw/ip/snitch/src/snitch.sv:2599)
# Time: 350 ns Started: 349 ns Scope: tb_bin.i_dut.i_snitch_cluster.i_cluster.gen_core[8].i_snitch_cc.i_snitch.InstructionInterfaceStable File: /home/noah/git/snitch-dace/snitch/hw/ip/snitch/src/snitch.sv Line: 2599
# ** Error: [ASSERT FAILED] [tb_bin.i_dut.i_snitch_cluster.i_cluster.gen_core[0].i_snitch_cc.i_snitch.InstructionInterfaceStable] InstructionInterfaceStable (/home/noah/git/snitch-dace/snitch/hw/ip/snitch/src/snitch.sv:2599)
# Time: 361 ns Started: 360 ns Scope: tb_bin.i_dut.i_snitch_cluster.i_cluster.gen_core[0].i_snitch_cc.i_snitch.InstructionInterfaceStable File: /home/noah/git/snitch-dace/snitch/hw/ip/snitch/src/snitch.sv Line: 2599
# ** Error: [ASSERT FAILED] [tb_bin.i_dut.i_snitch_cluster.i_cluster.gen_core[4].i_snitch_cc.i_snitch.InstructionInterfaceStable] InstructionInterfaceStable (/home/noah/git/snitch-dace/snitch/hw/ip/snitch/src/snitch.sv:2599)
# Time: 418 ns Started: 417 ns Scope: tb_bin.i_dut.i_snitch_cluster.i_cluster.gen_core[4].i_snitch_cc.i_snitch.InstructionInterfaceStable File: /home/noah/git/snitch-dace/snitch/hw/ip/snitch/src/snitch.sv Line: 2599
Can you reproduce this using the attached binary? Is it a concern?
Due to the handling of atomics, there is a potential of deadlocking the DMA network if the DMA can't issue any more writes because the reads are stalled on axi_demux
.
Since we have two dedicated networks, a narrow one (usually single core requests) and a wide one (burst-type transfers) we do not necessarily need atomic support on the wide network. With atomics disabled, the axi_demux
can drop the dependency between read and write channels.
pulp-platform/snitch#116 as a reference where this happened + short-term mitigation by increasing the transaction buffers.
Following from the discussion in the fork from KU Leuven. I would like to have support for Verilator 5. Verilator 5 has better performance, support for timing constructs and support for assertions. It would also allow projects using the snitch cluster to use verilator 5. I think in general it would be useful to support more recent versions of tools, as this makes the upgrading process later down the line a lot easier (make small steps instead of big ones).
Upgrading to verilator 5 would involve the following steps:
-no-timing
flag to the current tests, which disables the new timing features in verilator 5verilated_threads.o
to the verilator build targets OR compile both the systemverilog and cpp sources using verilator, so verilator can manage which verilator files should be includedIf you have any questions or comments, please let me know!
During a dry run, Simulation
object will not create any process upon launch
:
snitch_cluster/util/sim/Simulation.py
Lines 26 to 44 in 8cae8d2
successful()
is invoked, as on a CustomSimulation
:snitch_cluster/util/sim/Simulation.py
Lines 163 to 164 in 8cae8d2
return self.process.returncode == 0
AttributeError: 'NoneType' object has no attribute 'returncode'
This can be solved in two ways:
sim_utils.py
Simulation
class, assuming that they complete immediately and are always successfulSecond option is preferable for reuse. Then there is no need to treat completion of dry runs separately in sim_utils.py
:
snitch_cluster/util/sim/sim_utils.py
Line 144 in 8cae8d2
By default, run_simulations()
runs all tests under the same run directory as specified by the run_dir
argument, creating a unique subdirectory for each simulation based on the test name, if there is more than one test:
snitch_cluster/util/sim/sim_utils.py
Lines 134 to 142 in 8cae8d2
I followed the tutorial and used Docker to run the Verilator simulation, with the software code /apps/dnn/flashattenion_2. However, I encountered the following error during the simulation:
VCD wave generation enabled
[fesvr] Wrote 36 bytes of bootrom to 0x1000
[fesvr] Wrote entry point 0x80000000 to bootloader slot 0x1020
[fesvr] Wrote 56 bytes of bootdata to 0x1024
[Tracer] Logging Hart 8 to logs/trace_hart_00000008.dasm
[Tracer] Logging Hart 0 to logs/trace_hart_00000000.dasm
[Tracer] Logging Hart 1 to logs/trace_hart_00000001.dasm
[Tracer] Logging Hart 2 to logs/trace_hart_00000002.dasm
[Tracer] Logging Hart 3 to logs/trace_hart_00000003.dasm
[Tracer] Logging Hart 4 to logs/trace_hart_00000004.dasm
[Tracer] Logging Hart 5 to logs/trace_hart_00000005.dasm
[Tracer] Logging Hart 6 to logs/trace_hart_00000006.dasm
[Tracer] Logging Hart 7 to logs/trace_hart_00000007.dasm
[Illegal Instruction Core 0] PC: 00008000b140 Data: 18317153
[Illegal Instruction Core 1] PC: 00008000b140 Data: 18317153
[Illegal Instruction Core 2] PC: 00008000b140 Data: 18317153
[Illegal Instruction Core 6] PC: 00008000b140 Data: 18317153
[Illegal Instruction Core 4] PC: 00008000b140 Data: 18317153
[Illegal Instruction Core 0] PC: 000000000000 Data: 00000000
[Illegal Instruction Core 1] PC: 000000000000 Data: 00000000
[Illegal Instruction Core 2] PC: 000000000000 Data: 00000000
[Illegal Instruction Core 6] PC: 000000000000 Data: 00000000
[Illegal Instruction Core 4] PC: 000000000000 Data: 00000000
[Illegal Instruction Core 0] PC: 000000000000 Data: 00000000
[Illegal Instruction Core 1] PC: 000000000000 Data: 00000000
[Illegal Instruction Core 2] PC: 000000000000 Data: 00000000
[Illegal Instruction Core 6] PC: 000000000000 Data: 00000000
[Illegal Instruction Core 4] PC: 000000000000 Data: 00000000
......
I checked the instruction at address 8000b140, and the content is as follows:
8000b140: 53 71 31 18 fdiv.s ft2, ft2, ft3
How can I solve this error and run the simulation correctly? I encountered similar errors in other test cases (such as gelu, softmax).
Thank you very much
The TB for hw/ip/reqrsp_interface
was never adjusted to the AMO fixes merged with pulp-platform/snitch#149.
The output of cycle-accurate simulation for this code is not correct:
#include <snrt.h>
#include <printf.h>
#define f64 double
#define i32 int
#define B 2
#define N 32
void __attribute__((noinline)) my_func(double* x, double* y) {
for (int n = 0; n < N; n++) {
for (int b = 0; b < B; b++) {
x[b * N + n] *= y[b];
}
}
for (int b = 0; b < B; b++) {
y[b] = 0;
}
}
int main() {
unsigned tid = snrt_cluster_core_idx();
if (tid != 0) {
return 0;
}
double* y = (f64*) snrt_l1alloc(B * sizeof(f64));
double* x = (f64*) snrt_l1alloc(B * N * sizeof(f64));
double* z = (f64*) snrt_l1alloc(B * N * sizeof(f64));
y[0] = 3.0;
y[1] = 2.0;
for (int n = 0; n < N; n++) {
for (int b = 0; b < B; b++) {
x[b * N + n] = n + 1;
z[b * N + n] = (n + 1) * y[b];
}
}
my_func(x, y);
i32 ok = 1;
for (int i = 0; i < B * N; i++) {
if ((x[i] - z[i]) * (x[i] - z[i]) > 1e-3) {
printf("Error: mismatch at dst, %d, %f (computed) != %f (expected) \n", (int)i, (double)x[i], (double)z[i]);
ok = 0;
break;
}
}
if (ok) {
printf("success, exitting...\n");
return 0;
} else {
printf("FAILURE, exitting...\n");
return 1;
}
}
Observed output:
Error: mismatch at dst, 31, 0.000000 (computed) != 96.000000 (expected)
The issue is suspected to come from the lack of synchronization between INT and FPU units. It can be seen from the assembly https://godbolt.org/z/z3oEz4aen that no synchronization is even supposed to happen.
my_func: # @my_func
fld ft0, 0(a1) # everything below goes to FPU
fld ft1, 0(a0)
fmul.d ft0, ft0, ft1
fsd ft0, 0(a0)
fld ft0, 8(a1)
fld ft1, 32(a0)
fmul.d ft0, ft0, ft1
fsd ft0, 32(a0)
fld ft0, 0(a1)
fld ft1, 8(a0)
fmul.d ft0, ft0, ft1
fsd ft0, 8(a0)
fld ft0, 8(a1)
fld ft1, 40(a0)
fmul.d ft0, ft0, ft1
fsd ft0, 40(a0)
fld ft0, 0(a1)
fld ft1, 16(a0)
fmul.d ft0, ft0, ft1
fsd ft0, 16(a0)
fld ft0, 8(a1)
fld ft1, 48(a0)
fmul.d ft0, ft0, ft1
fsd ft0, 48(a0)
fld ft0, 0(a1)
fld ft1, 24(a0)
fmul.d ft0, ft0, ft1
fsd ft0, 24(a0)
fld ft0, 8(a1)
fld ft1, 56(a0)
fmul.d ft0, ft0, ft1
fsd ft0, 56(a0)
sw zero, 12(a1) # everything below goes to INT
sw zero, 8(a1)
sw zero, 4(a1)
sw zero, 0(a1)
ret
Currently, the load-store queue is quite limited in Snitch and since the addition of store response handling becomes easily full. The idea would be to compress back-to-back stores as we are not interested in the actual value but just that we have an outstanding store.
I think something like a compressable_fifo
would be a good start where element of the same time pushed back-to-back could increment a counter instead of occupying an actual queue item. Critical paths need to be checked as well.
Conversion from double to single precision (fcvt.s.d) isn't working when used in combination with ssr. It hangs indefinitely on verilator but works on banshee.
This small test demonstrates the issue.
test_fcvt.zip
The new Perfetto UI does not fully support traces in the legacy TraceViewer JSON format.
https://perfetto.dev/docs/faq#why-does-perfetto-not-support-lt-some-obscure-json-format-feature-gt-
Users are recommended to emit TrackEvent instead, Perfetto's native trace format.
This guide explains how to represent common JSON events using TrackEvent.
The generated documentation lacks some proper titles. That should be fixed.
crt0
currently doesn't init the .bss
section. A mutex placed in .bss
(e.g. clint_mutex
in interrupt.c) is therefore in an uninitialized state leading to deadlocks. A fix to this should also conform to multi-cluster systems which are not participating in the cluster-wide barrier at the end of crt0
.
Opening this issue as a reference to the previous PR #7
Link to Snitch paper in this section https://pulp-platform.github.io/snitch_cluster/rm/custom_instructions.html#xfrep-extension-for-floating-point-repetition of the documentation is broken.
Link to the smallFloat ISA extension specification in Sntich's custom instructions documentation.
A draft of documentation can be found at this link https://gist.github.com/nazavode/5d804bfc2f7cb7d6a5da99ce48381593.
Include in repository, possibly in the source code, or in an alternative easily maintainable form.
Hi, I really like the documentation for this project.
However, the documentation does not seem to mention the "Build MUSL dependency" step which is executed in the CI:
- name: Build MUSL dependency
run: |
cd sw/deps
mkdir install
cd musl
CC=$LLVM_BINROOT/clang ./configure --disable-shared \
--prefix=../install/ --enable-wrapper=all \
CFLAGS="-mcpu=snitch -menable-experimental-extensions"
make -j4
make install
cd ../../../
- name: Build Software
run: |
make -C target/snitch_cluster sw
The Batchnorm and Maxpool layers are included in the CI, testing that no error occurs during simulation, but the results are not verified.
I am currently following the tutorial for snitch cluster at https://pulp-platform.github.io/snitch_cluster/ug/tutorial.html and reached the debugging/benchmarking step. However, when trying to analyze the performance, it seems that the tstart
and tend
metrics are always 0. This value persists from the DASM file to the text, csv, and json results from the other steps for benchmarking.
The command I ran: bin/snitch_cluster.vlt sw/apps/blas/axpy/build/axpy.elf
Resulting files is attached.
trace_hart_00000000.dasm.txt
hart_00000000_perf.json
perf.csv
event.csv
Running on docker on Linux, amd64:
$ uname -a
Linux b0f761a4bb94 6.5.0-1004-oem #4-Ubuntu SMP PREEMPT_DYNAMIC Fri Sep 15 19:52:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Tested running on
Things I tried changing:
Do not hardcode barrier CSR address, but get it from riscv-opcodes encoding.h
.
snitch_cluster/sw/snRuntime/src/sync.h
Lines 57 to 60 in 8a020de
Hi,
I cloned this repo. and try : make
This will checkout a nofree submodule first, but need username and password.
Any help?
Update both Bender.lock
and Bender.yml
files.
Due to a bug in util/clustergen/cluster.py
it is currently impossible to instantiate a cluster with no xssr
cores. This is because num_ssrs_max
must be equal to at least 1 for the following template to produce valid system verilog:
localparam snitch_ssr_pkg::ssr_cfg_t [${cfg['num_ssrs_max']}-1:0] SsrCfgs [${cfg['nr_cores']}] = '{
...
}
A simple one line fix to cluster.py
appears to solve the issue:
self.cfg['num_ssrs_max'] = max(len(core['ssrs']) for core in cores)
# to
self.cfg['num_ssrs_max'] = max(max(len(core['ssrs']) for core in cores), 1)
I don't know enough about how the xssr
configuration works to determine if my fix is sufficient -- I haven't managed to get the CI test suite to fully pass. At least the following fails.
/workspaces/stitch_cluster/target/snitch_cluster/sw/tests/build/alias.elf test failed
Given that xssr
is disabled it seems reasonable that some might fail. Unfortunately, my laptop doesn't have enough memory to run the full suite in a reasonable amount of time.
The iis-setup.sh
script currently downloads the spike-dasm binary built for Almalinux.
Lines 31 to 32 in d6b7f25
Pisoc machines use CentOS 7.8, and the Almalinux spike-dasm build is thus incompatible with them.
make sw
target raises several warnings which should be cleaned.
The memory map documentation is no longer up to date after introducing the zero memory. The whole Snitch cluster hardware documentation could be replaced with the relevant section from our internal Occamy documentation.
The trace output by gen_trace.py
contains performance metrics at the end of the trace, e.g.:
## Performance metrics
Performance metrics for section 0 @ (517, 8635):
tstart 3760.0000
snitch_loads 7
snitch_stores 25
tend 11879.0000
fpss_loads 0
snitch_avg_load_latency 79.4286
snitch_occupancy 0.0299
snitch_fseq_rel_offloads 0.1164
fseq_yield 1.0
fseq_fpu_yield 1.0
fpss_section_latency 0
fpss_avg_fpu_latency 2.0
fpss_avg_load_latency 0
fpss_occupancy 0.0039
fpss_fpu_occupancy 0.0039
fpss_fpu_rel_occupancy 1.0
cycles 8119
total_ipc 0.0339
When passing the trace through the annotate.py
script, the performance metrics section is garbled:
## Performance tion 0 @ (517, 8635):
tstart 3760.0000
snitch_loads 7
snitch_stores 25
tend 11879.0000
fpss_loads 0
snitch_avg_load_latency 79.4286
snitch_occupancy 0.0299
snitch_fseq_rel_offloads 0.1164
fseq_yield 1.0
fseq_fpu_yield 1.0
fpss_section_latency 0
fpss_avg_fpu_latency 2.0
fpss_avg_load_latency 0
fpss_occupancy 0.0039
fpss_fpu_occupancy 0.0039
fpss_fpu_rel_occupancy 1.0
cycles 8119
total_ipc 0.0339
The annotate.py
script should be extended to ignore the final part of the trace, preserving the performance metrics.
Is this comment still relevant? It seems the source line does contain a unit
I want to simulate sw/apps/blas/gemm/build/gemm.elf with src/banshee.yaml by banshee, but encountered the following errors:
ERROR banshee::tran > Unsupported instruction 0x800027b8: <illegal 0xf60b80d3>
ERROR banshee::tran > Unsupported instruction 0x800027c0: <illegal 0xe6010bd3>
ERROR banshee::tran > Unsupported instruction 0x800027e0: <illegal 0xf60c01d3>
ERROR banshee::tran > Unsupported instruction 0x800027e4: <illegal 0xf6078253>
ERROR banshee::tran > Unsupported instruction 0x800027e8: <illegal 0xf60b82d3>
ERROR banshee::tran > Unsupported instruction 0x800027f4: <illegal 0xe6028bd3>
......
All these unsupported instructions are fmv.b.x and fmv.x.b. How to resolve these errors?
By the way, banshee's README suggests using .bin files as input, but the snitch_cluster tutorial uses .elf files as input for banshee. What's the difference? And how to run banshee simulation for sw testcases in snitch_cluster correctly? I tried using llvm_objcopy to convert the .elf file into a .bin file, but the generated bin file could not be properly received by banshee.
When used in the FlashAttention-2 layer the GEMM FP32 kernel yields wrong results by orders of magnitude
I have been using the default snitch_cluster
as part of a synthesis evaluation.
Here are some findings of my synthesis runs of the cluster source code which could be interesting for increasing general tool compatibility. Description contains the fixes I did to my local repo to circumvent the errors during elaboration/synthesis.
snitch_cc
: Tool inferred a latch
for the signal addr
.
addr = '0;
axi_dma_tc_snitch_fe
: Tool inferred a latch
for the signal status
.
status
: status = '0;
reqrsp_to_tcdm
: Tool complained about unpacked to packed assignment.
rr_req_chan_t
as cast target, which technically is not the same and only by chance has the same bitfields which are of interest for this assignment.snitch_ssr
: Tool did not recognize derive_isect_cfg()
as a constant function.
Otherwise the cluster synthesized out of the box 🚀
For some dimensions of the optimized kernels the SSR region will stall, since the buffer will not be consumed entirely due to SIMD/loop unrolling. Need to add constraints on dimension in datagen.py
Hi,
In the previous repository there was a spike-dasm tool.
Is it possible that this was removed in the current one?
Thanks!
Communication with tohost
/fromhost
is not thread safe and should be guarded with a mutex
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.