Text - What is a Unicode code unit?

A code unit is a sequence of bits, of a specified minimum size, that is output by a character encoding. It can be thought of as a character encoding word.
Each code point, via a character encoding, is encoded to 1 or more code units.

An example:

Encoding the musical symbol G clef 𝄞 (code point U+1D11E) using the 3 UTF character encodings, we have:

Character Encoding
Code Unit size
Encoded value
Description
UTF-8
8-bit
0xF0 0x9D 0x84 0x9E
4 bytes. A sequence of 4 code units each 8-bits in length
UTF-16
16-bit
0xD834 0xDD1E
4 bytes. A sequence of 2 code units each 16-bits in length
UTF-32
32-bit
0x0001D11E
4 bytes. A sequence of 1 code units each 32-bits in length


Ads by Google


Ask a question, send a comment, or report a problem - click here to contact me.

© Richard McGrath