riscv / riscv-bfloat16 Goto Github PK

View Code? Open in Web Editor NEW

26.0 26.0 14.0 126 KB

Home Page: https://jira.riscv.org/browse/RVG-122

License: Creative Commons Attribution 4.0 International

Makefile 58.31% TeX 41.69%

riscv-bfloat16's People

Contributors

Stargazers

Watchers

Forkers

mfkiwl isabella232 krovers riscv-spec a4lg nibrunieatsi5 kevin-andes asb hdelassus liweiwei90 wajahatriaz sunshaoce clownsw

riscv-bfloat16's Issues

typo on page 6

Page six talks about "encopding"
Probably a typo of "encoding"

Why does Zvfbfmin depend on Zfbfmin

There are no scalar operands on the instructions in Zvfbfmin. Could we shift the Zfbfmin dependency to Zvfbfwma which does have scalar operands?

In line with the other issue (#51), there is a question mark about rounding mode support. This is less clear, as Google has not defined the used rounding mode, Intel is using RNE, ARM is using RTO or makes it selectable, and NVIDIA is using RTZ.

My argument is that with the original intent of trading precision for hardware efficiency the choice would be to do RTZ rounding (as that's free). The current draft specifies a selectable rounding mode which is consistent with other floating point extensions but would be quite costly in comparison to just RTZ.

Is there a way to enable a choice to just do RTZ (or even make it the default)? I guess further sub-extensions would work for options but it becomes a bit unwieldy.

Do vfwmaccbf16 and vfwmulbf16 require SEW=16?

vfwmaccbf16 and vfwmulbf16 descriptions do not specify that SEW=16 is required, unlike the vector conversion instructions. Is this correct?

NaN box handling in Bfloat16

The v0.0.1 document does not mention NaN boxing. I can see cases where it should be done, and others where the upper bits should simply be ignored. It could be instruction-specific. Please clarify the behavior.

Confirm concurrent use of BF16 and FP16 operations

Please can you confirm that BF16 operations are intended to be potentially mixed with FP16 operations without any CSR modifications?

Dependencies of vector bfloat16 extensions could be clearer

I previously proposed #34 which addressed this. 94fcf6b added extra text on vector extension dependencies, however:

It only added it to riscv-bfloat16-extensions.adoc, not to the individual extension descriptions (doc/riscv-bfloat16-{zvfbfmin,zvfbfwma.adoc}. Including the dependency information in the extension description means the information is more easily discoverable, and also matches the presentation used in the existing V spec.
Zve64f depends on Zve32f and Zve64d depends on Zve64f. Therefore, it's sufficient to say these extensions depend on Zve32f (which is the approach I took in #34). This matches what is done in the current V spec (this was adopted in riscv/riscv-v-spec#845).

Is dynamic rounding mode supported for FCVT.BF16.S and FCVT.S.BF16?

In specification 2.2.5, the supported rounding modes are listing except "DYN".
Is dynamic rounding mode reserved for FCVT.BF16.S and FCVT.S.BF16?

Are full instruction encodings available somewhere?

The PDF specification does not seem to specify full bit patterns to decode the described instructions - many fields just have names. Is this defined somewhere else, or are precise decodes still undefined?

Thanks.

S.B16 field

https://github.com/riscv/riscv-bfloat16/blob/main/doc/insns/fcvt_BF16_S.adoc mentions a S.B16 field but it seems the other mention has been commented out. Should this mention be changed to BF16.S ?

Reliance on zfh and zfhmin

The following says that a BF16 implementation must implement FP16:

The BFloat16 extensions depend on the half-precision floating-point extensions (Zfh and Zfhmin), which in turn rely on the single-precision floating-point extension (F).

Can you please clarify this requirement? Is the motivation to get load and stores? This appears to be overkill.

Should conversions to BFLOAT16 signal Underflow?

The instructions that convert to BFLOAT16 (FCVT.BF16.S, vfncvtbf16.f.f.w) do not say that Underflow can be signalled. Is this correct?

In IEEE-754, Underflow should be signalled if, for the result of a floating point operation:

The result is smaller than the smallest normal value representable in the type; and:
The operation results in a loss of precision.

Should conversion of a subnormal FP32 argument which rounds fraction bits and produces a subnormal or zero result therefore signal Underflow?

Vector instructions have no RM field

The descriptions of the vector instructions here refer to the RM field, but vector instructions have no such field. The descriptions should say "current rounding mode in fscr", or something similar.

use of FLH and FSH for BF 16 memory load / store

The current version of the spec draft states:

The BF16 extensions do not add any new load or store instructions, as the FLH and FSH 16-bit load and store instructions introduced by the half-precision extensions work just fine for BF16 values.

This is only true for implementations which use the standard IEEE encoding to store floating-point number in the RISC-V F registers. Any specific encoding would certainly not encode BF16 and half precision values in the same way. As this issue already been raised in the group discussion ?

https://github.com/riscv/riscv-bfloat16/blob/29ffa22aa440e549c4ded7abad6328d10f182f85/doc/riscv-bfloat16-extensions.adoc

Subnormal flushing

I'll start with apologising for letting this linger for so long, making it painful so late in the process.

I've recently finished looking into subnormal support by other ISAs (for a RISC-V summit europe abstract).
The summary is that other ISAs mostly flush subnormals, namely Google's TPU¹, Intel's AVX-512_BF16², and ARM's v8.02A³. ARM does have an optional extended BF16 support, where subnormal support becomes selectable, and NVIDIA also supports subnormals⁴. As far as I'm aware subnormal flushing applies to conversions as well as arithmetic.

I'd argue Google's TPU is the closest thing to a standard for BF16. Also the motivation for the BF16 format is to trade precision for hardware efficiency as ML often does not seem to need that precision. Both would argue for flushing subnormals (hardware support for this extension would be very cheap).

Thoughts? I can highlight this issue on the FP SIG list for further input?

¹ S.Wang, P.Kanwar. “BFloat16: The secret to high performance on Cloud TPUs”.
² Intel. “BFLOAT16 Hardware Numerics Definitions”.
³ Arm® Architecture Reference Manual, Armv8-A.
⁴ Fasi et al. “Numerical behavior of NVIDIA tensor Cores.

fcvt.bf16.s encoding collides with fround.h from zfa

The Zfa extension describes fround.h as "encoded like FCVT.H.S, but with rs2=4". This collides with the proposed fcvt.bf16.s encoding which also uses 4 in the rs2 position:

field bits<32> Inst = { 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, rs1{4}, rs1{3}, rs1{2}, rs1{1}, rs1{0}, frm{2}, frm{1}, frm{0}, rd{4}, rd{3}, rd{2}, rd{1}, rd{0}, 1, 0, 1, 0, 0, 1, 1 };

vfwmaccbf16 argument order inconsistent with similar base vector instructions

The argument order for vfwmaccbf16 is not consistent with similar instructions in the base vector extension. The order is specified here as:

vfwmaccbf16.vv vd, vs2, vs1, vm
vfwmaccbf16.vf vd, vs2, rs1, vm

Whereas similar instructions in the base vector extension are like this:

vfwmacc.vv vd, vs1, vs2, vm    # vd[i] = +(vs1[i] * vs2[i]) + vd[i]
vfwmacc.vf vd, rs1, vs2, vm    # vd[i] = +(f[rs1] * vs2[i]) + vd[i]

Should these instructions be redefined to match the similar base instructions? Note that only the FMA type instructions in the base use this order - most other binary operations use the order as defined here. I'm not sure why this is.

Thanks.

vfwmaccbf16 encoding between pdf and source code is different

Hi, I have downloaded pdf from https://github.com/riscv/riscv-bfloat16/releases/tag/20230322 and see the vfwmaccbf16 encoding is 100011

but vfwmaccbf16 in

https://github.com/riscv/riscv-bfloat16/blob/main/doc/insns/vfwmaccbf16.adoc

is written as 111011

which one should be followed.

Adding BFloat16 to the psABI doc

I've started this PR on the psABI doc to look ahead to the (minor) modifications needed to account for BFloat16. Feel free to close this if there's an objection to having a tracking issue for something at another repo, but I thought it was worth advertising to those working on this spec, who may have additional insight.

Zfbfinxmin extension?

All of the existing floating point extensions have a *inx variant (zfinx, zdinx, zhinx, and zhinxmin). Do you plan to define one for for zfbfinxmin?

Plan for Zfbf extension to include generic arithmetic operations

Is there a plan for a full set of operations, e.g. like FP32 in the F standard, for bf16? Are we driving towards a BF16 FADD etc?

Question about status of riscv bf16

Can anyone tell me the status of RISC-V bf16. So far BF16 is very important. Even more important than Fp16. I am wondering when we will RISC-V extension for BF16? Will scalar BF16 instrucions be similiar to ZFH? Thanks~

Question on "Obligatory" table naming

The table naming the FP formats in https://github.com/riscv/riscv-bfloat16/blob/main/doc/riscv-bfloat16-format.adoc is named "Obligatory Floating Point Format Table" and list some formats specified in IEEE-754 but also some formats which are not (e.g. BF16 and TF32). Is the term "obligatory" appropriate in that context ?

Encoding conflict with Zfa extension

The encoding for fcvt.bf16.s is conflict with fround.h in Zfa extension:
" If the Zfh extension is implemented, FROUND.H and FROUNDNX.H instructions are analogously
defined to operate on half-precision numbers. They are encoded like FCVT.H.S, but with rs2=4
and 5, respectively,"

BF16 vector conversion and round-to-odd

Follow-up to @krovers 's #52

Could it make sense to add right away (in Zvbfbmin) a BF16 version of RVV 1.0 vfncvt.rod.f.f.w vd, vs2, vm ?

As round-to-odd might be quite useful for use cases of this type of conversions and I do not think it is available by default in https://github.com/riscv/riscv-bfloat16/blob/main/doc/insns/vfncvtbf16_f_f_w.adoc (since frm does offer round-to-odd).

Support for FNORM FPR encoding

Does the BF16 extension support the FNORM FPR encoding?

FP64 and FP32 have corresponding unique load and store operations to match the operand data size. Overloading FP16 to support BF16 while matching operand size does not specify the end format.