Text - What is a Unicode surrogate pair?

Surrogate pairs are used in UTF-16 only. The term refers to a sequence of 2 code units that together form a single code point.

In detail

UTF-16 is a variable-width 2-byte or 4-byte character encoding. Code points (characters) are encoded in either 2-bytes or 4-bytes depending upon the code point number.

For code points between 0x0 and 0xFFFF (i.e. 0 to 65,536) the code point can be encoded in a single code unit (16-bits).
For code points between 0x10000 and 0x10FFFF the code point requires 2 code units (a 16-bit high word, and a 16-bit low word).

In UTF-16, two code units together, form what is called a surrogate pair.

The first value of the surrogate pair (2 bytes) is called the high surrogate code unit (or leading surrogate).
It's value will be in the range: 0xD800 - 0xDBFF.
The second value of the surrogate pair (2 bytes) is called the low surrogate code unit (or trailing surrogate).
It's value will be in the range: 0xDC00 - 0xDFFF.

An example

The highest code point of Plane 0, (0xFFFF) can be encoded with a single code point.
The lowest code point of Plane 1, (0x10000), requires 2 code units. A high surrogate and an low surrogate.


Code Point
UTF-16 code units
UTF-16 bytes
UTF-16 data
U+FFFF
1
2
0xFFFF
U+10000
2
4
0xD800 0xDC00


Ads by Google

Ask a question, send a comment, or report a problem - click here to contact me.

© Richard McGrath