vproc / vicuna Goto Github PK
View Code? Open in Web Editor NEWRISC-V Zve32x Vector Coprocessor
License: Other
RISC-V Zve32x Vector Coprocessor
License: Other
I$ configured with size of 4KB, parameter WAY_LEN is 128 bits and if the MEM_BYTE_W is being configured with 128 bit. Then there is a compilation error
parameter int unsigned MEM_BYTE_W = 4, // Memory data width (bytes)
parameter int unsigned WAY_LEN = 256 // Cache way length (lines)
Compilation Error/data/shared/mulberry/users/kwali/nna_icache_exp/nna/external/vicuna/rtl/vproc_cache.sv, 45
"val"
Packed union members must have same size.
Member "val" has different size (1 bits) from next member (3 bits).
I am mostly done with setting up the simulation environment for Questasim. I can run most test cases successfully except the vloxei8.v instructions.
The issue arises when the vproc_tb
is in the following configuration:
MEM_W=32
MEM_SZ=262144
MEM_LATENCY=5
VREG_W=512
VMEM_W=256
VMUL_W=128
VALU_W=128
VSLD_W=128
ICACHE_SZ=8192
ICACHE_LINE_W=128
DCACHE_SZ=65536
DCACHE_LINE_W=512
*this is the second configuration defined in the .config files in the
test directories (and the default values set in the sim Makefile)
while simulating the execution of the vloxei8.S code (Main core is ibex). The issue is that the coprocessor (and ibex) gets stuck.
vloxei8.v v16, (a1), v8, v0.t
mem_req.addr
signal. Afterwards, the mem_req.addr
signal goes to an undefined state (e.g. 32'hXXXXXXX0
)cpu_gnt_o
signal goes to 1'bX
)I have not further investigated this issue, but the problem is either:
Before I continue investigating, I would like to ask: Is this a known issue? If not, then I might have introduced it with my ASIC vector register or it is caused by using Questasim.
My compiler produced the following instruction, which Vicuna reports as illegal. It simulates correctly in spike, though:
42802557 vmv.x.s a0,v8
Line 203 in 691857c
Is this correct? mode_i.alu.op_mask
is 2 bits while the other masked
signals are only 1 bit.
Hello,
I have found out that there is no test data in conv_3x3.S file. Does it mean that it is still under development or it can work properly now? I have run the program with my own data but didn't receive expected result. If it can work properly now, could you please offer your test data?
Thank you
Hi @michael-platzer,
I have a few questions regarding an observation I made:
While simulating vicuna I found that some instruction sequences (for example vle8_8.S) have out-of-order (OoO) result transactions. I know that this is allowed by the x-interface, but I am not sure whether this is intended on vicuna.
The figure below shows the waveforms of the x-interface while executing vle8_8.S. It can be seen that the third offloaded vector instruction (vmv.v.v2, v0
) with id
= 2 has its result transaction before the second offloaded vector instruction (vle8.v v0, (a0)
) with id
= 1. The same is true for the fourth and fifth instructions.
The following command and configuration was used (and the simulation completed successfully):
make lsu/vle8_8 SIMULATOR=questa COMPILER=llvm
[CONFIG ] lsu/vle8_8 VREG_W=128 VMEM_W=32 VMUL_W=32
[SUCCESS] lsu/vle8_8/vle8_8 293 cycles ( 12 - 305)
Additionally the ram_type was set to RAM_ASIC.
vle8.v v0, (a0)
followed by vmv.v.v v2, v0
would cause a data hazard (i.e. hindering the move instruction from executing before the load instruction has completed), why does this not happen here?Fractional LMUL is not working, and indeed:
Line 388 in 6816693
I believe the minimum SEW is 8 bits and the maximum SEW is 32 bits, so Vicuna is supposed to support fractional LMUL down to 1/4? Please correct me if I am wrong.
Line 798 in 224f33a
I see an X prop issue that starts from here. The deq_state is initialized to X because the queue data are not reset. This eventually drives the Ibex CPI response to X, and Ibex flags this as an invalid instruction. This occurs on the first instruction issued by Ibex to Vicuna. When I initialize the queue data to 0 on reset, my test passes. Thoughts?
In numerous places you assign a default of X, for example, in vproc_lsu.sv line 134:
state_init_q <= '{busy: 1'b0, default: 'x};
From my understanding, this assigns the busy
member of state_init_q
to 0 on reset, and it defaults all the other members to X for simulation. The issue is that some members of state_init_q
are of type enum
(if you dive down into all the types). The enums have type logic
, which should be able to accept X, but for some reason I'm getting an error here trying to assign an enum to X. I even added X as one of the enum entries, but it didn't help.
Do you need to default all these signals to X?
Hi @michael-platzer,
In order to be compliant to the x-interface, the coprocessor must set the associated id for each memory request transaction. This is not the case at the moment as you can see from the screenshot below (id
stays "don't care", even during memory transactions).
Hi @michael-platzer and @stevobailey I hope you are doing well... I have been working on RiscV V ISA and its intrinsic and I want to start coding in intrinsic level on Vicuna ..
I have written this simple vector add code:
#include <uart.h>
#include <riscv_vector.h>
#include <stddef.h>
void* memcpy(void* dest, const void* src, size_t n)
{
for (size_t i = 0; i < n; i++)
{
((char*)dest)[i] = ((char*)src)[i];
}
}
int compare(int* ref, int* actual, int n) {
int r;
for (int i = 0; i < n; ++i) {
if (ref[i] - actual[i] == 0) {
r = 1;
}
else {
r = 0;
break;
}
}
return r;
}
// index arithmetic
void add(int* a, int* b, int* c, int n) {
for (int i = 0; i < n; ++i) {
a[i] = b[i] + c[i];
}
}
void vec_add(int* a, int* b, int* c, int n) {
while (n > 0) {
size_t vl = vsetvl_e32m1(n);
vint32m1_t vb = vle32_v_i32m1(b, vl);
vint32m1_t vc = vle32_v_i32m1(c, vl);
vint32m1_t va = vadd_vv_i32m1(vb, vc, vl);
vse32_v_i32m1(a, va, vl);
a += vl;
b += vl;
c += vl;
n -= vl;
}
}
int main(void) {
// data
int a[31] = { 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31 };
int b[31] = { 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31 };
// compute
int result[31];
int vecresult[31];
add(result, a, b, 31);
vec_add(vecresult, a, b, 31);
// compare
uart_puts(compare(vecresult, result, 31) ? "pass" : "fail");
return 0;
}
All I want to do is check whether or not normal vector add and vector add with V extension give same answer or not...
I compile the code it gives warnings about pointers in vector add function but it doesn't throw any error... but when I implement it on the board the uart doesn't give any output... not even "fail"...
Do you know the problem?
Not sure if this is a Verilog requirement or just good practice, but I'm having issues because not all module parameters have default values. Can you add them? The default can be 0, which may fail gracefully if not overwritten.
vproc_alu.sv
vproc_elem.sv
vproc_lsu.sv
vproc_mul_block.sv
vproc_mul.sv
vproc_sld.sv
vproc_vregfile.sv
vproc_vregpack.sv
vproc_vregunpack.sv
Hi @michael-platzer , @stevobailey and @moimfeld ....
In demo project you have added this test code:
#include <uart.h>
int main(void) {
uart_puts("Hello world!\n");
return 0;
}
I ran this code on demo project way back as you remember, but I forgot to ask that the tera term
displays a continuous prints of Hello world!
as if you have written the code in a while(1)
statement (like below picture)... why is it like this? I actually expected to see one single Hello world.
Even now when I run my vector codes I see continuous prints of "pass" (since I check failure or success of my written codes) as if all the codes are all in implicit while(1)
loop and for this matter the timer that I have added never stopped so I put a empty while(1)
at the end of my code to force the program to stop there so that I can measure the run time of the code... Can you clarify?
This issue might not be reproducible without the UVM environment. It is planned to open-source the environment in the next week.
As discussed yesterday, I am in the progress of setting up a UVM environment to verify Vicuna. The environment drives the core-side signals of the x-interface channels and therefore "emulates" a core. Any handshake signal controlled by the environment can be configured to have a random delay. Here are a few examples of what this means:
commit_valid
can be configured to have a random delay (in clock cycles) w.r.t. the issue handshakemem_ready
can be configured to have a random delay (in clock cycles) w.r.t. the assertion of the mem_valid
signalmem_result_valid
can be configured to have a random delay (in clock cycles) w.r.t. the memory request/response handshake(This is not a complete list of all signals with random delay)
Note: Even though there is random delay on certain transactions, the core-side is strictly in-order. So no transaction initiated by the environment is OoO.
When turning on random delay for the result_ready
signal of the result interface the coprocessor stalls indefinitely when result transactions is not immediately accepted. Below you can find a picture of the x-interface signals. After 590 ns the coprocessor stalls indefinitely. I have not further investigated this observation.
# ------------------------------------------------------------
# |
# |
# | Next Instruction Sequence Info
# |
# | Number of Instructions: 7
# |
# | 1. Instruction: 0002f2d7
# | 2. Instruction: 02050007
# | 3. Instruction: 5e000157
# | 4. Instruction: 0002f2d7
# | 5. Instruction: 00058107
# | 6. Instruction: 0002f2d7
# | 7. Instruction: 02050127
# |
# |
# ------------------------------------------------------------
This sequence corresponds to the following assembly (vle8_8.S), where only the vector instructions are offloaded:
la a0, vdata_start
li t0, 16
vsetvli t0, t0, e8,m1,tu,mu
vle8.v v0, (a0)
vmv.v.v v2, v0
vsetvli t0, t0, e8,m1,tu,mu
vle8.v v2, (a1), v0.t
li t0, 16
vsetvli t0, t0, e8,m1,tu,mu
vse8.v v2, (a0)
la a0, vdata_start
la a1, vdata_end
j spill_cache
result_valid
delay is disabled)Can Vicuna handle misaligned data? When simulating an 8-bit vector add using Spike and unaligned data, it works correctly. But when the same program is run on Vicuna RTL, it fails. It think it should work even on misaligned data.
I can recreate this by running your vector add test (alu/vadd_8.S), but adding an offset to the data:
...
.data
.align 10
.global vdata_start
.global vdata_end
.byte 0
vdata_start:
.word 0x323b3f47
.word 0x47434b3a
.word 0x302f2e32
...
[CONFIG ] alu/vadd_8 VREG_W=128 VMEM_W=32 VMUL_W=32
[ ERROR ] alu/vadd_8/vadd_8 607 cycles ( 12 - 619)
incorrect memory content; diff:
--- vadd_8.ref.vmem
+++ vadd_8.dump.vmem
@@ -1,7 +1,7 @@
333c4048
48444c3b
31302f33
-e9414b52
+e8414b52
3f44383b
37424d54
5e4b5049
make: *** [Makefile:64: alu/vadd_8] Error 1
Hi @michael-platzer I thought since the title is different I better open new issue and sorry if I am asking a lot but I think probably my low level questions can help you to have great and complete repo as well :)), anyway.... in test directory we can test vicuna and see number of clock cycles in verilator with 2 different configs... I actually want to write the same set of programs (for example vadd_8.s) in C without the help of vector extension just for execution on Ibex alone and see how many clock cycles Ibex will need for them, so that I can compare Ibex with Ibex+Vicuna for a sample vector code. Any suggestions?
Hi @michael-platzer , @stevobailey and @moimfeld ... I have a question about reduction sum intrinsic function:
I have written following code in normal C:
signed short int dense(signed char* inDense, signed char* wf, signed short int inDenseSize, signed short int biaseDense) {
signed short int i;
signed short int outDense = 0;
for (i = 0; i < inDenseSize; i++) {
outDense += inDense[i] * wf[i];
}
outDense += biaseDense;
return outDense;
}
I want to convert it into vector mode... I have written following for this purpose but it seems to have a problem since I cannot execute it on vicuna.
signed short int vec_dense(signed char* in1, signed char* in2, signed short int n, signed short int bias) {
signed short int out;
size_t vlmax = vsetvlmax_e16m1();
vint16m1_t vec_zero = vmv_v_x_i16m1(0, vlmax);
vint16m2_t vout = vmv_v_x_i16m2(0, vlmax);
while (n > 0) {
size_t vl = vsetvl_e8m1(n);
vint8m1_t vin1 = vle8_v_i8m1(in1, vl);
vint8m1_t vin2 = vle8_v_i8m1(in2, vl);
vout = vwmul_vv_i16m2(vin1, vin2, vl);
in1 += vl;
in2 += vl;
n -= vl;
}
vint16m1_t vec_sum = vredsum_vs_i16m2_i16m1(vec_zero, vout, vec_zero, vlmax);
out = vmv_x_s_i16m1_i16(vec_sum);
return out + bias;
}
Can you suggest?
Hi,
as you have mentioned in #16 attaching DDR is really board specific and I see in that issue other people like me are struggling to attach it to the core... I have nexys 4 DDR as well and I think it is really good to have at least one example of how to attach DDR to one specific board (any board even nexys video of your demo project).
The following CSR registers in the CPU probably need to be changed to properly flag having a vector extension. See the privileged spec:
There may be others. Note that setting the VS bits in the mstatus register were required to get simulation working with spike. I'm not sure what hardware is supposed to do with the VS bits.
Hi @moimfeld I have noticed that you added llvm installation in sw
directory... So, I tried to make llvm with make llvm LLVM_DIR=/opt/llvm
but first I got cmake version error (I had old version of cmake but llvm installation required newer version)... After I updated my cmake version I tried again and this time I get this error:
HEAD is now at 5177676... Updated MLIR type stubs to work with pytype
/bin/sh: 5: cmake: not found
/home/kiian/Desktop/VicunaRepos/vicuna/sw//toolchain.mk:22: recipe for target 'llvm' failed
make: *** [llvm] Error 127
It's wired because I have cmake and when I run cmake --version
terminal responds with this:
cmake version 3.22.3
CMake suite maintained and supported by Kitware (kitware.com/cmake).
I have ubuntu 16.04 on vmware. Can you help me solve this problem? I tried more than couple of times and I can't solve it
This issue is related to #59, but for delay on a different signal. The problematic delay here is mem_result_valid
. When a memory request is served in the same clock cycle as it was issued (i.e. no delay), the coprocessor will stall forever. Below you can find a picture of the x-interface signals. After 490 ns the coprocessor stalls indefinitely. I have not further investigated this observation.
This is the corresponding snippet from the x-interface documentation which states that memory result transactions can happen in the same cycle as the memory request transaction.
Memory result interface transactions cannot be initiated before the corresponding memory request interface handshake is completed. They are allowed to be initiated at the same time as or after completion of the memory request interface handshake.
The problematic instruction sequence is the same as in #59.
Line 69 in 66ed264
Unfortunately, some linters and synthesis tools don't accept this syntax. It seems silly, but I usually have to pass in the parameter at the module level (in addition to it being in the interface).
I need Vicuna + Ibex to support unaligned memory requests, but I do not need the predictable timing that Vicuna guarantees. Can you let me know if the following proposal will work? Assume a memory width of 32 bits.
If I pass 0 to the ADDR_ALIGNED parameter, then Vicuna memory requests will output the full address. I will write an adapter that aligns the memory data to the address. So unaligned reads will read two words and shift them down to become aligned. Unaligned writes will write two words by shifting up the write data and byte enable. I can pipeline this to reduce the latency overhead. Do you see an issue with this?
Hi @michael-platzer,
The x-interface specification state:
Note that a coprocessor shall be able to tolerate memory result transactions for which it did not perform the corresponding memory request handshake itself.
It might be beneficial to understand/ask what the reason behind this rule is before addressing this issue. If you don't know the reason then I can open an issue on the x-interface repository.
My UVM environment can be configured to generate random memory result transaction (i.e. asserting mem_result_valid
with unrelated id
and random data
/ exc
/ dbg
signals). At the moment there is a problem with generating "unrelated" id
because the memory request transaction does not have a defined id
(see #61).
Still, what I find is that the coprocessor stalls indefinitely when it is "bombarded" with random unrelated memory transactions. You can see the coprocessor stall in the screenshot below.
The problematic Instruction Sequence is the same as in #59
Can you help me understand what the operand widths mean? Let's take the default parameters as a starting point:
parameter int unsigned VREG_W = 128, // vector register width in bits
parameter int unsigned VMUL_W = 64, // MUL unit operand width in bits
parameter int unsigned VALU_W = 64, // ALU unit operand width in bits
parameter int unsigned VSLD_W = 64, // SLD unit operand width in bits
From a RISC-V standards perspective, VREG_W
equals VLEN
, so VLEN
is 128. ELEN
is 32, since this is a 32-bit vector unit. If VALU_W
is 64, the operands are 64 bits. If I want to do vector add on 32-bit vectors, can it only add two elements at a time (because 64-bit operands / 32-bit elements = 2 elements per operand)? So doubling VALU_W
to 128 would require twice the hardware but take half as many cycles to perform the addition?
In your paper, this statement
The ability to individually configure the throughput for each unit improves the performance of heavily used operations by increasing the respective unit’s data-path width (e.g., widening the data-path of the multiplier unit).
means, if I use lots of multiplies in my application, I should increase VMUL_W
to improve performance at the cost of HW, right? But if I use lots of adds, I should increase VALU_W
to improve performance at the cost of HW.
Put another way, VMUL_W
, VALU_W
, and VSLD_W
don't affect functionality (i.e. software), just performance and overhead. Yes?
Line 180 in 66ed264
A commercial linting tool flags this as an error. Can you make X_ID_WIDTH
a localparam, then pass it to vcore_xif and use it here and on line 196?
Thanks!
This is Moritz, I will be doing the verification of Vicuna. I am currently setting up a simulation environment for Questasim to replicate your tests, and now I have a question about the register file.
Since I am targeting ASIC I also want to add an option for an ASIC register file. I have a working version of the ASIC registerfile here: https://github.com/moimfeld/vicuna/blob/asic_dev/rtl/vproc_vregfile.sv .
But my question is if the XORs in the following lines are only there for the correct functionality of the "XOR" RAM, or if they have some other purpose (like for example masking)?
Line 92 in 94a7f47
Line 164 in 94a7f47
I ask because in my ASIC version I don't use this XORs at the moment and I want to make sure that I don't lose some functions.
Line 524 in 66ed264
Here, state_init_masked
is used before its declaration on line 562. Can you move the declaration above line 524?
Thanks!
Hi @michael-platzer and @stevobailey sorry I was unfortunately dealing with covid-19 and couldn't bother you more with my basic questions :p... but here is one:
In demo_top.sv you attached UART, ram32 and hwreg_iface... I want to have Ibex as main core, vicuna as vector co-processor, VGA, camera module and LED's attached as well... I have done this before using only ibex and wishbone b4 interface for mentioned peripherals with the help of https://github.com/pbing/ibex_wb... I wonder if I can do it here as well? Even without wishbone interface is it possible to treat those peripherals like URAT and ram32 of your top module code and attach them to ibex memory bus? Any suggestions? (Suppose I will finally attach ddr2 memory of my FPGA board as well)
Line 790 in 66ed264
Sorry to bombard you with issues, but I'm finally getting around to running some lint/simulations on recent updates you made.
This connects a 32-bit signal to a 38-bit input port. Shouldn't this be queue_data_q.rs1
instead of queue_data_q.rs1.r.xval
?
There are numerous unique case
statements that have incomplete entries. I see that verilator simulation ignores them, but it would be better to fix them. Possible solutions:
unique
from the case statmentThoughts?
I compiled some code and ended up with the following instruction. It simulates in Spike as expected, but Vicuna returns an illegal instruction for it:
c6a40457 vwredsum.vs v8,v10,v8
When manually decoded, I think it looks like:
[6:0] OP-V = 1010111 (0x57 = vector arithmetic instruction)
[11:7] vd = 01000 (0x08 = v8)
[14:12] funct3 = 000 (0x00 = OPIVV)
[19:15] vs1 = 01000 (0x08 = v8)
[24:20] vs2 = 01010 (0x0A = v10)
[25] vm = 1 (0x01 = unmasked)
[31:26] funct6 = 110001 (0x31 = vwredsum)
This looks good to me, though I'm no expert on the RISC-V vector extension. I decoded this using the 1.0 spec, but I assume this is the same in version 0.10.
At a high level, this is just part of an add reduction function. I'm using Vicuna to sum all the elements of a vector.
Hi,
Where can I find which instructions are not implemented?
Please use non-blocking assignments in all always_ff blocks. I see blocking assignments in these locations
vproc_alu.sv lines 370-380
vproc_elem.sv lines 380-394
vproc_lsu.sv lines 502-510
vproc_mul.sv lines 359-367
vproc_sld.sv lines 330-344
Which version of verilator are you using? With 4.210, I need to make a couple of changes to verilator_main.cpp for it to work. First, I include the following:
#include "Vvproc_top___024root.h"
I also have to change the following line:
Line 239 in beb86f9
to become (adds rootp in the hierarchy):
top->rst_ni, top->mem_req_o, top->mem_addr_o, top->rootp->vproc_top__DOT__v_core__DOT__vreg_rd_hazard_map_q, top->rootp->vproc_top__DOT__v_core__DOT__vreg_wr_hazard_map_q, 0);
I'm unsure if the following code is legal or not.
I am attempting to add reduce a vector group down to a single element. I get something like below when compiling with GCC:
# initialize v1 to 32-bit element, LMUL=1 value 0
0107f7d7 vsetvli a5,a5,e32,m1,tu,mu
5e0030d7 vmv.v.i v1,0
# now add reduce v4, which is also 32-bit elements but has LMUL=4, into v1
0127f757 vsetvli a4,a5,e32,m4,tu,mu
0240a0d7 vredsum.vs v1,v4,v1
According to the 0.10 spec:
When LMUL=4, the vector register group contains four vector registers, and instructions specifying an LMUL=4 vector register
group using vector register numbers that are not multiples of four are reserved.
So, since LMUL=4, the source v1 and destination v1 are both flagged by Vicuna as invalid registers, since v1 is not a multiple of 4. But this is a strange case. While LMUL=4 for v4, LMUL=1 for v1. This assumes we can associate different LMUL values for different registers/register groups. So there's no fundamental reason why this instruction is invalid.
I do not know if this is valid or not. Spike correctly simulates this. But LLVM compiles using v8 and v16 instead of v1 and v4, so that avoids this issue. However, LLVM produces the following code elsewhere, which is the same problem. Here, I load an 8-bit vector into v9 using LMUL=1, then load a 16-bit vector into v10 using LMUL=2. Then I sign extend the 8-bit register into 16 bits, from v9 to v12. But LMUL=2 for this instruction, so Vicuna flags v9 as an invalid register.
040576d7 vsetvli a3,a0,e8,m1,ta,mu
02058487 vle8.v v9,(a1)
049576d7 vsetvli a3,a0,e16,m2,ta,mu
02065507 vle16.v v10,(a2)
4a93a657 vsext.vf2 v12,v9
eec52857 vwmul.vv v16,v12,v10
Can you help me debug a dot product issue? I am taking two vectors, 9 16-bit elements each, and performing a dot product on them. I am using a 16-bit to 32-bit multiply, followed by a regular reduction sum (not width widening). Below is the relevant assembly.
800129cc <vect_dotProduct>:
800129cc: c500f757 vsetivli a4,1,e32,m1,ta,mu
800129d0: 5e003457 vmv.v.i v8,0
800129d4: c50d beqz a0,800129fe <vect_dotProduct+0x32>
800129d6: 04057757 vsetvli a4,a0,e8,m1,ta,mu
800129da: 04877057 vsetvli zero,a4,e16,m1,ta,mu
800129de: 0205d487 vle16.v v9,(a1)
800129e2: 02065507 vle16.v v10,(a2)
800129e6: ee952657 vwmul.vv v12,v9,v10
800129ea: 01107057 vsetvli zero,zero,e32,m2,tu,mu
800129ee: 02c42457 vredsum.vs v8,v12,v8
800129f2: 00171793 slli a5,a4,0x1
800129f6: 95be add a1,a1,a5
800129f8: 8d19 sub a0,a0,a4
800129fa: 963e add a2,a2,a5
800129fc: fd69 bnez a0,800129d6 <vect_dotProduct+0xa>
800129fe: c500f057 vsetivli zero,1,e32,m1,ta,mu
80012a02: 0206e427 vse32.v v8,(a3)
80012a06: 8082 ret
I see it loading the vectors correctly, then multiplying them correctly. However, when it gets to the reduction sum, the pipe_in_ctrl_i.vl_part_0
signal is high. This means the following line of RTL never actually adds the elements into the result. I see the elements appearing one-by-one in the elem_q
signal, but the result stays 0.
Line 229 in 06f1d2c
What is the next step in debugging this? What sets the vl_part_0
signal high? Thanks!
In several places, the sensitivity list for reset fails linting. For example, in proc_lsu.sv
, line 413:
always_ff @(posedge clk_i or negedge async_rst_n) begin : vproc_lsu_stage_vreg
if (~async_rst_n | (~ASYNC_RESET & ~rst_ni)) begin
state_vreg_q <= '{busy: 1'b0, default: 'x};
end
Having async_rst_n with other signals in the sensitivity list is the problem. I suggest you find a linting tool. Verilator has one, though I'm not sure if it catches this.
This page shows you how to fix the issue:
https://www.intel.com/content/www/us/en/programmable/quartushelp/13.0/mergedProjects/msgs/msgs/evrfx_veri_if_condition_does_not_match_sensitivity_list_edge.htm
Lines 27 to 39 in 4947abf
There's an asynchronous reset issue since data
is assigned here but not reset in either of the blocks above. This puts the asynchronous reset into combinational logic for the input of data.
The assignment to data
should either
clk_i
in the sensitivity list.There's a combinational loop when using the Ibex core:
vproc_top.vect_instr_gnt
vproc_top.vect_instr_commit
vproc_top.v_core.queue_push
vproc_top.v_core.dec_ready
Line numbers:
https://github.com/vproc/vicuna/blob/main/rtl/vproc_top.sv#L172
https://github.com/vproc/vicuna/blob/main/rtl/vproc_top.sv#L335
https://github.com/vproc/vicuna/blob/main/rtl/vproc_core.sv#L263
https://github.com/vproc/vicuna/blob/main/rtl/vproc_core.sv#L269
https://github.com/vproc/vicuna/blob/main/rtl/vproc_core.sv#L279
https://github.com/vproc/vicuna/blob/main/rtl/vproc_core.sv#L227
Hi @michael-platzer again I have a question, I have read your paper multiple times https://publik.tuwien.ac.at/files/publik_296583.pdf and in there (figure 2) you demonstrated a picture of Ibex and Vicuna with both data and instruction caches and an external memory... Since now in the demo project there is a RAM which utilize bram of FPGA and by default no caches are enabled, I really want to replace bram with ddr2 memory of my nexyxs 4 ddr board so that I can implement much larger applications like CNN networks... I am kind of familiar with the MIG IP of Vivado and I have created it but attaching it as an external memory to the core and most importantly filling it with .vmem file is a problem (since like microblaze there is no bootloader for Ibex and all that stuff) since you have done it can you help? Is it possible to have an example for it in this repo as well? I think no one have mentioned it and I have not found anything useful I really tried to do it but that filling part with .vmem (initializing ddr with .vmem file) as I mentioned is really a big problem for me to understand and solve (as you of course know we cannot use $readmemh for ddr to fill it with data)... Here some of my discussions with Ibex develpoers: (I used my other github account :)
https://github.com/lowRISC/ibex/issues/1466
Thank you...
Another instruction that needs to be implemented:
62850427 vs4r.v v8,(a0)
Hi @michael-platzer, about your demo project I actually couldn't generate it via .tcl files because probably my vivado has problems so I manually added the rtl files needed... my board is nexys 4 ddr and it doesn't have differential clock as stated in top module of demo... what can I do? besides I make all of the tests in sim directory by runung make
and I have all of the .vmem files, I want to actually test one in this demo project by adding .mem file and get a simple output for start, I am kind of familiar with Ibex so I know how to do it but after generating .bit file what should I expect as my output? I installed tera term https://osdn.net/projects/ttssh2/releases/
because I saw uart support but I still don't know what will happen after running lets say alu/vaad_8.vmem on the core. Can you clarify? sorry if it is too basic I just started the project...
Line 92 in dd20efa
I'm seeing an error that one of these accesses is out-of-bounds. I'm guessing it's not smart enough to ignore one of these outputs based off the ternary operator condition.
I changed the alu/test_configs.conf
file to
VREG_W=128 VMEM_W=64 VMUL_W=64
and now simulation fails. I believe the data cache is performing a width conversion to reduce a wide Vicuna data port into a 32-bit port for arbitration with the Ibex instruction port. So when you remove the data cache, you cannot have a wide Vicuna data port.
Can you add support for a wide Vicuna data port without a data cache? Or, at the very least, check the input parameters and error when using a wide Vicuna data port without a data cache?
[CONFIG ] alu/vxor_8 VREG_W=128 VMEM_W=64 VMUL_W=64
[ ERROR ] alu/vxor_8/vxor_8 527 cycles ( 12 - 539)
incorrect memory content; diff:
...
@@ -1,7 +1,7 @@
6861651d
-1d191160
+47434b3a
6a757468
-b21a100b
+e8404a51
3f44383b
37424d54
5e4b5049
make: *** [Makefile:35: alu/vxor_8] Error 1
Hi @michael-platzer I have tried to compile 2 simple CNN codes with your Makefile and link.ld file and boot them to Bram of my board using bootserdow
. After running make -f /path/to/vicuna/sw/Makefile PROG=test OBJ=test.o
for the first program I got the following error:
I searched about the error one solution was adding this code:
void *memcpy(void *dest, const void *src, size_t n)
{
for (size_t i = 0; i < n; i++)
{
((char*)dest)[i] = ((char*)src)[i];
}
}
I did it and the error was gone and the output is right... Is it the only way? meaning I should always add this code at the beginning of all of my codes or is there a better way? and what is this error most importantly?
Weirdly I find this code as well but the output differs from each other... This ones output is wrong!!!
void * memcpy ( void * destination, const void * source, int num ){
int i=0;
*((int*)destination) = *((int*)source);
}
test.md
I had to change the format from .c to .md in order to send it...
From my understanding, the memory arbiter stalls requests from the scalar core if the vector core has outstanding loads or stores. But I do not think this stalling is correct. Consider the following assembly code:
.text
.global main
main:
li a0, 64
la a1, vdata_start
la a2, vdata_mid
la a3, vref_end
vsetvli a4,a0,e32,m8,tu,mu
vle32.v v8,(a1)
vle32.v v16,(a2)
vadd.vv v8,v8,v16
vse32.v v8,(a3)
lw a4, 0(a3)
la a0, vdata_start
la a1, vdata_end
j spill_cache
.data
.align 10
.global vdata_start
.global vdata_mid
.global vdata_end
vdata_start:
.rept 64
.word 0x323b3f47
.endr
vdata_mid:
.rept 64
.word 0x47434b3a
.endr
vdata_end:
.align 10
.global vref_start
.global vref_end
vref_start:
// not correct, but we don't care for this test
.rept 64
.word 0xe2fa599a
.endr
vref_end:
It sets the vector length to 64. It loads 2 vectors, each with 64 32b entries. Then it adds them and stores the result. Then the scalar core immediately tries to read the first word in the result. This read should be stalled until the vector core completes its store instruction. But the waveform shows otherwise.
Arrow points to Ibex memory read, which occurs too early:
The top group is the data memory interface coming from the scalar core (Ibex). The middle group is the data memory interface coming from the vector core (Vicuna). You can see the memory read from Ibex is granted before Vicuna starts loading its vectors.
I believe this is because pending_store_o
is not registered, so it is low between the time Vicuna receives the store instruction and the time it actually starts executing it. You can see pending_store_o
in the bottom group of the above image.
The assembly code above should work in your test framework, though it will fail. But you can run it and generate a waveform to reproduce this issue.
Not sure what the exact error is, but see #47 for a new test case that fails. The provided assembly passes when VREG_W is 128, but not when it is larger. Can you first confirm that it is a bug?
Line 765 in 66ed264
I think this signal should be XIF_ID_W
bits wide, not 1 bit wide.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.