Code Monkey home page Code Monkey logo

branchless-utf8's Introduction

Branchless UTF-8 Decoder

Full article: A Branchless UTF-8 Decoder

Example usage

#define N (1 << 20)  // 1 MiB

// input buffer with 3 bytes of zero padding
char buf[N+3];
char *end = buf + fread(buf, 1, N, stdin);
end[0] = end[1] = end[2] = 0;

// output buffer: parsed code points
int len = 0;
uint32_t cp[N];

int errors = 0;
for (char *p = buf; p < end;) {
    int e;
    p = utf8_decode(p, cp+len++, &e);
    errors |= e;
}
if (errors) {
    // decode failure
}

branchless-utf8's People

Contributors

skeeto avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

branchless-utf8's Issues

overreads

it's documented behavior so not strictly a bug, however you could avoid overreads by by computing the offsets from a table, there's only a few.

doesn't reject codepoints over U+10FFFF

It does appear to handle overlong encodings through the use of the minimum table but sequences starting from 0x110000 and up are incorrectly handled instead of rejected. This is easier understood since it must accept a lead byte of the values in that range (since byte 0xF4 is valid), however there is no logic for rejecting the invalid encodings starting with that byte, for which there are many.

lengths table?

Hi,
Thanks for writing and creating this repository. I am looking to understand the algorithm used but am bit perplexed by the usage of the lengths table. I understand that after shifting right 3 bits the first byte can only get max int value of 31. However, I am not sure how you derived the table with utf8 lengths from those numbers. Would it be possible to provide some explanation on how you derived such a table? or perhaps a resource somewhere in the internet?

Much Thanks.
B.

Example Usage

In case you're interested in a demonstration, I've added one to my fork which I used to see how to use UTF-8 in C. It does add a zlib dependency and a klib header file (which I've included), but it shows how one would use the decoder in practice.

If not, no worries. I just found myself looking at your code and not knowing at all how to use it. I imagine there might be others who are interested in branchless coding who might be in a similar position.

Smaller lengths array?

Found this code referenced inside imgui, so far as I can tell I'm not sure why the lengths array needs to contain 32 results.

The reason being is that the bits that control the length of a utf8 sequence are the leading 1s up at the front of a byte with a terminating 0 (assuming there was software out there dealing with utf8 in a bitstream).

0 -> ascii -> 1 byte
10 -> continuation -> 1 byte
110 -> 2 byte sequence
1110 -> 3 byte sequence
11110 -> 4 byte sequence

111110 -> presumably a 5 byte sequence
1111110 -> presumably a 6 byte sequence
11111110 -> presumably a 7 byte sequence
11111111 -> presumably an 8 byte sequence

However, currently utf8 only deals with at worst 4 byte code points so while the pattern could continue at the moment there aren't 5 byte sequences. Which means you could just drop the terminating 0 for the 4 byte sequence and work from there.

Now I'm not sure about the rest of the code dealing with errors and masks and shifting around...but presumably...an equivalent function exists, but with a smaller table.

    static const char lengths[] = {
1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 2, 2, 3, 4
    };
    
    static const int masks[]  = {0x00, 0x7f, 0x1f, 0x0f, 0x07};
    static const uint32_t mins[] = {4194304, 0, 128, 2048, 65536};
    static const int shiftc[] = {0, 18, 12, 6, 0};
    static const int shifte[] = {0, 6, 4, 2, 0};

    unsigned char *s = buf;
    int len = lengths[s[0] >> 4]; // here we just grab the upper nibble (the 4 bits which determine the length)
    // I kept the 0s which appear to contribute to determining the error.
    // in theory the code past this point works in a similar fashion except for the error handling of the erroneous sequence of 11111

For the remaining decoding section here:

    *c  = (uint32_t)(s[0] & masks[len]) << 18;
    *c |= (uint32_t)(s[1] & 0x3f) << 12;
    *c |= (uint32_t)(s[2] & 0x3f) <<  6;
    *c |= (uint32_t)(s[3] & 0x3f) <<  0;
    *c >>= shiftc[len];

I haven't exactly tested this...but it strikes me that the masking operation doesn't need a table.

    *c  = (uint32_t)((s[0] << len) >> len) << 18;
    // alternatively
    *c  = (uint32_t)(s[0] & (0xff >> len)) << 18;

Similar concepts could apply for the shiftc and shifte tables as these are multiples of 6 and 2.

   //*c >>= shiftc[len];
   *c >>= 24 - 6*len;
   //*e >>= shifte[len];
   *e >>= 8 - 2*len;

I'm guessing you've probably written code like this already, but I was curious.

Setting the error argument value

What is the deal with the value returned in the error argument? Are the values in the bits 6, 7, and 8 supposed to shift with the rest? What is the intended use of the value in argument e? Is it supposed to be used as a simple boolean or a structured bit field? Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.