Text - Code point vs. code unit

A code point (sometimes called code value) is a numerical value that represents a Unicode character.
Think of this number as an index into the Unicode character set. The code point isn't yet in a form that can be included in a document. It isn't yet encoded.

An example

The musical symbol G clef
𝄞
is represented by the code point U+1D11E (Decimal 119,070).

A code point is encoded as a sequence of integers (called code units), whose bit size depends upon the selected character encoding.


An example.
The three most popular character encodings for Unicode are UTF-8, UTF-16, and UTF-32.

Character Encoding
Code Unit size
Encoded value
Description
UTF-8
8-bit
0xF0 0x9D 0x84 0x9E
4 bytes. A sequence of 4 code units each 8-bits in length
UTF-16
16-bit
0xD834 0xDD1E
4 bytes. A sequence of 2 code units each 16-bits in length
UTF-32
32-bit
0x0001D11E
4 bytes. A sequence of 1 code units each 32-bits in length

Note that it is only in UTF-32, that the encoded value is equal to the code point number. This is by design.


Ads by Google


Ask a question, send a comment, or report a problem - click here to contact me.

© Richard McGrath