Text - What is Unicode? A summary

Unicode is the standard for encoding all characters, for all of the written languages of the world.

A Unicode character is called a code point. A code point is a number that represents a specific character.
Think of this number as an index into a book of characters (one per page). The code point isn't yet in a form that can be included in a document. It isn't encoded.

For example, the musical symbol G clef 𝄞 is represented by the code point U+1D11E (Decimal 119,070).

A character encoding is an algorithm that maps a code point to a sequence of code units (bytes, words, or double-words) that can be included in a document.

The three most popular character encodings for Unicode are UTF-8, UTF-16, and UTF-32.

UTF-8 is a variable-width: 1,2,3 or 4-byte character encoding. The 1-byte code points are backwards compatible with ASCII.
UTF-16 is a variable-width: 2-byte or 4-byte character encoding.
UTF-32 is a fixed-width: 4-byte character encoding.


A code unit is a sequence of bits, of a specified minimum size, that is output by a character encoding. It can be thought of as a character encoding word.
Each code point, via a character encoding, is encoded to 1 or more code units.

An example:

Encoding the musical symbol G clef 𝄞 (code point U+1D11E) using the 3 UTF character encodings, we have:

Character Encoding
Code Unit size
Encoded value
Description
UTF-8
8-bit
0xF0 0x9D 0x84 0x9E
4 bytes. A sequence of 4 code units each 8-bits in length
UTF-16
16-bit
0xD834 0xDD1E
4 bytes. A sequence of 2 code units each 16-bits in length
UTF-32
32-bit
0x0001D11E
4 bytes. A sequence of 1 code units each 32-bits in length


Ads by Google


Ask a question, send a comment, or report a problem - click here to contact me.

© Richard McGrath