In this section: https://github.com/privatezero/flac_markdown/blob/master/flac.md#coded-number
It's talking about UTF-8, and UCS-2 aka UTF-16, so which encoding format does it use?
(btw sorry about the formatting, idk why the UTF encoding process is bolded)
UTF-8 encoding: The highest (left most) X bits are set to 1 to indicate the number of bytes in this code point, a 0 means it's ASCII compatible, and therefore has 7 remaining bits, otherwise 1, 2, 3, and 4 1's can be followed by a zero bit to designate the size of this codepoint in bits.
Example: 🦄 is U+1F984, or 0xF09FA684, the leading byte, 0xF0 says that there are 4 bytes in this code point, then all subsequent bytes are prefixed by 0b10 in the top 2 bits, that means it's a continuation byte, and you skip those bits when decoding it.
0xF0 & 0x07 << 18 = 0x000000
+
0x9F & 0x3F << 12 = 0x01F000
+
0xA6 & 0x3F << 6 = 0x000980
+
0x84 & 0x3F << 0 = 0x000004
=
0x1F984
UCS-2 aka UTF-16 before Surrogate Pairs is just straight up a 16 bit value with no special encoding.
UCS-2 aka UTF-16 misidentified as UCS-2 is if the codepoint is less than 0xD7FF or it's greater than 0xE000 AND less than 0xFFFF, the value has it's same value, otherwise it's split like this:
Since we're encoding the same Unicorn from above, we need 2 codepoints because 0x1F984 is above 0xFFFF.
So, we take the codepoint, subtract 0x10000, for the Low Surrogate we mod 0xF984 with 0x400, to get 0x184, then add 0xDC00 to get 0xDD84
for the High Surrogate, we take 0xF984 and divide it by 0x400 to get 0x003E, then we add 0xD800 to get 0xD83E.
So which encoding are we actually using? because that wording is very confusing.