Found this code referenced inside imgui, so far as I can tell I'm not sure why the lengths
array needs to contain 32 results.
The reason being is that the bits that control the length of a utf8 sequence are the leading 1s up at the front of a byte with a terminating 0 (assuming there was software out there dealing with utf8 in a bitstream).
0 -> ascii -> 1 byte
10 -> continuation -> 1 byte
110 -> 2 byte sequence
1110 -> 3 byte sequence
11110 -> 4 byte sequence
111110 -> presumably a 5 byte sequence
1111110 -> presumably a 6 byte sequence
11111110 -> presumably a 7 byte sequence
11111111 -> presumably an 8 byte sequence
However, currently utf8 only deals with at worst 4 byte code points so while the pattern could continue at the moment there aren't 5 byte sequences. Which means you could just drop the terminating 0 for the 4 byte sequence and work from there.
Now I'm not sure about the rest of the code dealing with errors and masks and shifting around...but presumably...an equivalent function exists, but with a smaller table.
static const char lengths[] = {
1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 2, 2, 3, 4
};
static const int masks[] = {0x00, 0x7f, 0x1f, 0x0f, 0x07};
static const uint32_t mins[] = {4194304, 0, 128, 2048, 65536};
static const int shiftc[] = {0, 18, 12, 6, 0};
static const int shifte[] = {0, 6, 4, 2, 0};
unsigned char *s = buf;
int len = lengths[s[0] >> 4]; // here we just grab the upper nibble (the 4 bits which determine the length)
// I kept the 0s which appear to contribute to determining the error.
// in theory the code past this point works in a similar fashion except for the error handling of the erroneous sequence of 11111
For the remaining decoding section here:
*c = (uint32_t)(s[0] & masks[len]) << 18;
*c |= (uint32_t)(s[1] & 0x3f) << 12;
*c |= (uint32_t)(s[2] & 0x3f) << 6;
*c |= (uint32_t)(s[3] & 0x3f) << 0;
*c >>= shiftc[len];
I haven't exactly tested this...but it strikes me that the masking operation doesn't need a table.
*c = (uint32_t)((s[0] << len) >> len) << 18;
// alternatively
*c = (uint32_t)(s[0] & (0xff >> len)) << 18;
Similar concepts could apply for the shiftc and shifte tables as these are multiples of 6 and 2.
//*c >>= shiftc[len];
*c >>= 24 - 6*len;
//*e >>= shifte[len];
*e >>= 8 - 2*len;
I'm guessing you've probably written code like this already, but I was curious.