Text - Code point vs. code unit

A code point (sometimes called code value) is a numerical value that represents a Unicode character.
Think of this number as an index into the Unicode character set. The code point isn't yet in a form that can be included in a document. It isn't yet encoded.

An example

The musical symbol G clef

𝄞

is represented by the code point U+1D11E (Decimal 119,070).

A code point is encoded as a sequence of integers (called code units), whose bit size depends upon the selected character encoding.

An example.
The three most popular character encodings for Unicode are UTF-8, UTF-16, and UTF-32.

Character Encoding

Code Unit size

Encoded value

Description

UTF-8

8-bit

0xF0 0x9D 0x84 0x9E

4 bytes. A sequence of 4 code units each 8-bits in length

UTF-16

16-bit

0xD834 0xDD1E

4 bytes. A sequence of 2 code units each 16-bits in length

UTF-32

32-bit

0x0001D11E

4 bytes. A sequence of 1 code units each 32-bits in length

Note that it is only in UTF-32, that the encoded value is equal to the code point number. This is by design.

Ads by Google

Ask a question, send a comment, or report a problem - click here to contact me.