skeeto / branchless-utf8 Goto Github PK

View Code? Open in Web Editor NEW

589.0 27.0 33.0 15 KB

Branchless UTF-8 decoder

License: The Unlicense

Makefile 2.70% C 97.30%

branchless-utf8's Introduction

Branchless UTF-8 Decoder

Full article: A Branchless UTF-8 Decoder

Example usage

#define N (1 << 20)  // 1 MiB

// input buffer with 3 bytes of zero padding
char buf[N+3];
char *end = buf + fread(buf, 1, N, stdin);
end[0] = end[1] = end[2] = 0;

// output buffer: parsed code points
int len = 0;
uint32_t cp[N];

int errors = 0;
for (char *p = buf; p < end;) {
    int e;
    p = utf8_decode(p, cp+len++, &e);
    errors |= e;
}
if (errors) {
    // decode failure
}

branchless-utf8's People

Contributors

Stargazers

Watchers

branchless-utf8's Issues

Faster version of reference decoder

Hi, I read your http://nullprogram.com/blog/2017/10/06/ - interesting work. For what it's worth, note that at the end of http://bjoern.hoehrmann.de/utf-8/decoder/dfa/#variations there is an improved version of the decoder that saves a shift for every byte, compared to the original version at the beginning of the article. Might be interesting to use the improved version as reference instead.

overreads

it's documented behavior so not strictly a bug, however you could avoid overreads by by computing the offsets from a table, there's only a few.

doesn't reject codepoints over U+10FFFF

It does appear to handle overlong encodings through the use of the minimum table but sequences starting from 0x110000 and up are incorrectly handled instead of rejected. This is easier understood since it must accept a lead byte of the values in that range (since byte 0xF4 is valid), however there is no logic for rejecting the invalid encodings starting with that byte, for which there are many.

lengths table?

Hi,
Thanks for writing and creating this repository. I am looking to understand the algorithm used but am bit perplexed by the usage of the lengths table. I understand that after shifting right 3 bits the first byte can only get max int value of 31. However, I am not sure how you derived the table with utf8 lengths from those numbers. Would it be possible to provide some explanation on how you derived such a table? or perhaps a resource somewhere in the internet?

Much Thanks.
B.

Invalid first byte bug

If first byte of the sequence is invalid then function can return zero as error code (valid utf-8).

Look fmt issue fmtlib/fmt#3038 , PR fmtlib/fmt#3044 and https://godbolt.org/z/EbeEex4Gf for details.

Example Usage

In case you're interested in a demonstration, I've added one to my fork which I used to see how to use UTF-8 in C. It does add a zlib dependency and a klib header file (which I've included), but it shows how one would use the decoder in practice.

If not, no worries. I just found myself looking at your code and not knowing at all how to use it. I imagine there might be others who are interested in branchless coding who might be in a similar position.

Smaller lengths array?

Found this code referenced inside imgui, so far as I can tell I'm not sure why the lengths array needs to contain 32 results.

The reason being is that the bits that control the length of a utf8 sequence are the leading 1s up at the front of a byte with a terminating 0 (assuming there was software out there dealing with utf8 in a bitstream).

0 -> ascii -> 1 byte
10 -> continuation -> 1 byte
110 -> 2 byte sequence
1110 -> 3 byte sequence
11110 -> 4 byte sequence

111110 -> presumably a 5 byte sequence
1111110 -> presumably a 6 byte sequence
11111110 -> presumably a 7 byte sequence
11111111 -> presumably an 8 byte sequence

However, currently utf8 only deals with at worst 4 byte code points so while the pattern could continue at the moment there aren't 5 byte sequences. Which means you could just drop the terminating 0 for the 4 byte sequence and work from there.

Now I'm not sure about the rest of the code dealing with errors and masks and shifting around...but presumably...an equivalent function exists, but with a smaller table.

    static const char lengths[] = {
1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 2, 2, 3, 4
    };
    
    static const int masks[]  = {0x00, 0x7f, 0x1f, 0x0f, 0x07};
    static const uint32_t mins[] = {4194304, 0, 128, 2048, 65536};
    static const int shiftc[] = {0, 18, 12, 6, 0};
    static const int shifte[] = {0, 6, 4, 2, 0};

    unsigned char *s = buf;
    int len = lengths[s[0] >> 4]; // here we just grab the upper nibble (the 4 bits which determine the length)
    // I kept the 0s which appear to contribute to determining the error.
    // in theory the code past this point works in a similar fashion except for the error handling of the erroneous sequence of 11111

For the remaining decoding section here:

    *c  = (uint32_t)(s[0] & masks[len]) << 18;
    *c |= (uint32_t)(s[1] & 0x3f) << 12;
    *c |= (uint32_t)(s[2] & 0x3f) <<  6;
    *c |= (uint32_t)(s[3] & 0x3f) <<  0;
    *c >>= shiftc[len];

I haven't exactly tested this...but it strikes me that the masking operation doesn't need a table.

    *c  = (uint32_t)((s[0] << len) >> len) << 18;
    // alternatively
    *c  = (uint32_t)(s[0] & (0xff >> len)) << 18;

Similar concepts could apply for the shiftc and shifte tables as these are multiples of 6 and 2.

   //*c >>= shiftc[len];
   *c >>= 24 - 6*len;

   //*e >>= shifte[len];
   *e >>= 8 - 2*len;

I'm guessing you've probably written code like this already, but I was curious.

Setting the error argument value

What is the deal with the value returned in the error argument? Are the values in the bits 6, 7, and 8 supposed to shift with the rest? What is the intended use of the value in argument e? Is it supposed to be used as a simple boolean or a structured bit field? Thanks.

skeeto / branchless-utf8 Goto Github PK

branchless-utf8's Introduction

Branchless UTF-8 Decoder

Example usage

branchless-utf8's People

Contributors

Stargazers

Watchers

Forkers

branchless-utf8's Issues

Faster version of reference decoder

overreads

doesn't reject codepoints over U+10FFFF

lengths table?

Invalid first byte bug

Example Usage

Smaller lengths array?

Setting the error argument value

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent