Character Encoding
Also known as: Unicode, UTF-8, ASCII, text encoding
The standard that maps characters to numbers so text can be stored and transmitted as bits — from ASCII's original 128 code points to Unicode's 150,000 and UTF-8's dominant variable-length encoding.
- Primary domain
- Algorithms & Mathematics
- Sub-category
- Information Theory, Mathematical & Numerical Analysis
In simple terms
Computers store text as numbers. A character encoding is the table that says which number represents which character: 65 = ‘A’, 66 = ‘B’, and so on. The challenge is agreeing on one table that covers every writing system on Earth — Chinese, Arabic, emoji, and all. Unicode is that agreement; UTF-8 is the way to turn Unicode’s code points into actual bytes.
The Visual Map
Character → code point → bytes, for three very different characters:
flowchart LR
A["'A'"] --> CPA["U+0041"] -->|"UTF-8: 1 byte"| BA["41"]
Z["'中'"] --> CPZ["U+4E2D"] -->|"UTF-8: 3 bytes"| BZ["e4 b8 ad"]
E["'😀'"] --> CPE["U+1F600"] -->|"UTF-8: 4 bytes"| BE["f0 9f 98 80"]
More detail
ASCII (1963) mapped 128 characters (the Latin alphabet, digits, punctuation, and control codes) to 7 bits. Enough for English; nowhere near enough for the world. The next decades saw hundreds of incompatible extensions — Latin-1, Windows-1252, Shift-JIS, Big5, ISO-8859 — each covering a region but producing mojibake (garbled text) when files moved between systems.
Unicode solved the coordination problem by defining one universal code point for every character: U+0041 for ‘A’, U+4E2D for ‘中’, U+1F600 for ’😀’. Unicode now covers over 150,000 characters across 168 scripts. A code point is an integer, not bytes — the encoding determines how that integer is stored:
- UTF-32 — every code point as a fixed 4-byte integer. Simple but wastes space for ASCII-heavy text.
- UTF-16 — 2 bytes for most characters, 4 bytes (a surrogate pair) for rarer ones. Used by Java, JavaScript internals, and Windows APIs.
- UTF-8 — variable width: 1 byte for ASCII (U+0000–U+007F), 2–4 bytes for everything else. The leading bits of the first byte signal the length. UTF-8 is backward-compatible with ASCII and is the dominant encoding on the web (~98% of web pages).
Important properties of UTF-8:
- Self-synchronising — you can find the start of any character by scanning for a byte whose high bits are
0xxxxxxxor11xxxxxx; continuation bytes always start10xxxxxx. - ASCII-safe — any byte with value < 128 is a single-byte ASCII character; no two-byte or four-byte sequence ever produces a byte in that range.
- Not fixed-width —
string.lengthin bytes ≠ number of characters. Pythonlen("中")is 1 (code points), butlen("中".encode("utf-8"))is 3 (bytes).
Every program that handles text — every web server, database, editor, terminal, and network protocol — must know what encoding text is in. Encoding bugs break everything: incorrect byte-lengths crash string operations, misidentified encodings corrupt data, and injected malformed bytes can bypass security checks. “Always use UTF-8” is the modern consensus because it is universal, compact, and ASCII-compatible — but you still have to know it’s variable-width, what a BOM is (a UTF-8 BOM is three bytes the Windows ecosystem adds and everything else ignores), and why \n vs \r\n is also an encoding concern.
Under the Hood
UTF-8’s bit-level scheme — the first byte’s leading bits announce the sequence length:
for ch in "A中😀":
cp = ord(ch)
encoded = ch.encode("utf-8")
pattern = " ".join(f"{b:08b}" for b in encoded)
print(f"{ch!r} U+{cp:04X} {len(encoded)} byte(s): {pattern}")
# 'A' U+0041 1 byte(s): 01000001
# '中' U+4E2D 3 byte(s): 11100100 10111000 10101101
# '😀' U+1F600 4 byte(s): 11110000 10011111 10011000 10000000
Read the first byte: 0xxxxxxx = standalone ASCII, 1110xxxx = “I start a 3-byte character”, and every continuation byte begins 10. The code point’s bits are simply distributed across the x positions — that’s the whole encoding.
Engineering Trade-offs
- UTF-8 vs UTF-16 vs UTF-32. UTF-8 is compact for ASCII-heavy text and ASCII-compatible, but East Asian text costs 3 bytes per character (vs 2 in UTF-16). UTF-32 buys O(1) code-point indexing nobody much needs at double-to-quadruple the memory. The web picked UTF-8; Java, C#, and Windows are wedded to UTF-16 by history — and pay for it in surrogate-pair bugs.
- Which “length” do you mean? Bytes (storage,
Content-Length), code points (Unicode logic), UTF-16 code units (JavaScript’s.length), or grapheme clusters (what users see — ’👩👩👧’ is one grapheme, many code points). Picking the wrong unit truncates emoji mid-character or rejects valid names. Databases and UIs routinely disagree here. - Validate or trust. Treating bytes as UTF-8 without validation is fast; malformed sequences (overlong encodings, lone surrogates) have been used to smuggle
../past security filters. Strict validation at the boundary costs a scan and saves a CVE. - Normalisation. ‘é’ can be one code point (U+00E9) or two (e + combining accent). Comparing or deduplicating user text without NFC/NFD normalisation silently treats equal-looking strings as different — but normalising everywhere costs CPU and can alter byte-faithful data.
Real-world examples
- A web server that doesn’t declare
Content-Type: text/html; charset=utf-8risks browsers guessing Latin-1 or Shift-JIS and rendering garbled text. strlenin C counts bytes, not characters — a UTF-8 string of 10 emoji can be 40 bytes.- MySQL’s historical
utf8charset is actually a broken 3-byte-max encoding that cannot store emoji;utf8mb4is the correct UTF-8. - Emoji rendering requires combining a code point with a skin-tone modifier and a zero-width joiner — a single rendered “character” can be 10+ code points.
Common misconceptions
- “Unicode and UTF-8 are the same thing.” Unicode is the code-point standard; UTF-8 is one encoding of it. There are others (UTF-16, UTF-32).
- “String length = character count.” Only in fixed-width encodings. In UTF-8, length in bytes, length in code points, and length in grapheme clusters (user-visible characters) can all differ.
Try it yourself
Watch one character become three bytes on disk:
printf '中' | od -A x -t x1
# 000000 e4 b8 ad
python3 -c "
s = 'héllo 中 😀'
print('code points:', len(s))
print('utf-8 bytes:', len(s.encode('utf-8')))
print('utf-16 bytes:', len(s.encode('utf-16-le')))
"
Same string, three different “lengths” — every encoding bug you’ll ever debug starts with confusing two of them.
Learn next
- Bits — the raw material every encoding compiles down to.
- Hexadecimal — the notation you’ll read encoded bytes in.
- HTTP — where
charset=utf-8headers decide how the web reads your bytes.
Relationships
- Requires
- Related
Neighborhood
A visual companion to the relationships above. Click any node to visit that topic.