What Every Content Creator Should Know About Emoji Encoding

Emoji are text, not images. They have codepoints, surrogates, ZWJ sequences, and variation selectors. Most content creators treat them as opaque glyphs. The encoding matters, and the bugs are real.

Forgemoji Editorial·Emoji culture researchers + platform-specific guides writers

Published June 20, 2026·Reviewed by The Forgemoji editorial team·8 min read

About the Forgemoji team

Most content creators treat emoji as opaque images. You click the picker, you find the right face, you send it, and the recipient device renders it however it renders it. This works in 95% of cases. The other 5% — the mojibake, the broken clusters, the question marks in place of emoji, the platform-specific renderings — happen because emoji are text, and the encoding matters.

Emoji are text, not images

The most important fact about emoji is that they are text, not images. Each emoji has a Unicode codepoint (a number assigned by the Unicode Consortium), and the codepoint is stored as part of the text stream. When you send a message with 😂, the message contains the codepoint U+1F602, which the recipient device then renders using its own glyph.

The same codepoint renders differently on different platforms because each platform draws its own glyph. 😂 is a flat yellow face on iOS, a blob on Google, a 3D rendering on Samsung, and a high-contrast monochrome on Twitter image proxy. The codepoint is the same; the drawing is platform-specific. This is the same model as the Latin alphabet: the codepoint for "A" is U+0041, and the way the "A" looks depends on the font.

The basics: codepoints and UTF-8

Every Unicode character has a codepoint, written in hex with a U+ prefix. Emoji codepoints are in the Supplementary Multilingual Plane (SMP), which means they are above U+FFFF. This is where the encoding gets interesting.

In UTF-8 (the dominant encoding on the web), each codepoint is encoded as 1-4 bytes. Emoji are always 4 bytes. In UTF-16 (the encoding used by JavaScript strings and Windows), codepoints above U+FFFF are encoded as surrogate pairs — two 16-bit code units that together represent a single codepoint. This is why emoji are 2 JavaScript characters long, not 1. "😂".length === 2 in JavaScript. This trips up a lot of front-end code that does character counting.

Common emoji codepoints

Emoji	Codepoint	UTF-8 bytes	UTF-16 code units
😂	U+1F602	4	2
❤️	U+2764 U+FE0F	6	4
👨‍👩‍👧	U+1F468 U+200D U+1F469 U+200D U+1F467	17	10
🇺🇸	U+1F1FA U+1F1F8	8	4

The last row is interesting. 🇺🇸 (Flag of United States) is two regional indicator codepoints (U+1F1FA for "US" and U+1F1F8 for "A") that combine to form the flag glyph. There are no codepoints for the flags themselves — they are computed from the pair. This is why you can type "US" and get 🇺🇸 on a platform that supports regional indicators, but the underlying data is two codepoints, not one.

Why some emoji are multiple codepoints

Some emoji are single codepoints (😂 is just U+1F602). Some are sequences. 👨‍👩‍👧 (Family: Man, Woman, Girl) is five codepoints joined by zero-width joiners (U+200D): U+1F468 (man) + U+200D (ZWJ) + U+1F469 (woman) + U+200D (ZWJ) + U+1F467 (girl). The ZWJ is an instruction to the renderer: bind the codepoints on either side of me into a single glyph.

This is a powerful mechanism. A user can construct a family of any composition — 👨‍👩‍👧‍👦 (two parents, two kids) or 👨‍👨‍👧 (two dads, one daughter) — by typing the codepoints in sequence, even if no platform has ever rendered that exact combination before. The result depends on whether the rendering platform knows the ZWJ sequence. Most modern platforms do, but the rendering can vary, and older platforms may show the emoji as separate glyphs.

Surrogate pairs and JavaScript

The surrogate pair issue is the source of a lot of emoji bugs in JavaScript. JavaScript strings are UTF-16, and codepoints above U+FFFF are stored as two 16-bit code units (a high surrogate and a low surrogate). This means "😂".length === 2, not 1, because the string contains two UTF-16 code units.

The most common bug is character counting. A naive message.length returns the count of UTF-16 code units, not the count of user-perceived characters. A message with 5 emoji and 10 ASCII characters will report a length of 20, not 15. The fix is to use the spread operator ([...message].length) or the Intl.Segmenter API, both of which count by codepoint or by grapheme cluster.

Variation selectors

Variation selectors are codepoints that change the rendering of the preceding codepoint. The most common is U+FE0F (VARIATION SELECTOR-16), which forces the emoji style of a character that also has a text style.

The clearest example is the heart. ❤ (U+2764) renders as a text-style heart glyph by default — a heavy, dark red heart that looks like a printer dingbat. ❤️ (U+2764 U+FE0F) is the same heart codepoint followed by VS-16, which forces the emoji style — a bright red, glossy heart glyph. The two render very differently, and the difference is one codepoint.

This is why typing "heart" in your emoji picker produces a different result than typing the Unicode name and hoping for the best. The picker always adds the variation selector automatically. The bare codepoint, typed by hand, does not.

Skin tone modifiers

The Fitzpatrick scale emoji modifiers (U+1F3FB to U+1F3FF) change the skin tone of an emoji that supports it. 🏋️ (Weight Lifter) is the combination U+1F3CB (Weight Lifter) + U+FE0F (VS-16) + U+1F3FB (light skin tone). Without the modifier, the default yellow emoji is used. With the modifier, a specific skin tone is rendered.

The important thing to know is that skin tone modifiers only work on emoji that are explicitly designed to accept them. Adding a skin tone modifier to a non-supporting emoji does nothing (the modifier is ignored), and some platforms handle unsupported combinations inconsistently. The Forgemoji generator follows the Unicode spec — supported combinations get a skin tone, unsupported combinations get the default yellow.

Common bugs and how to avoid them

•Mojibake in databases. Storing emoji in a database that expects ISO-8859-1 or other pre-Unicode encodings will corrupt the data. Use UTF-8 throughout the stack.
•Length calculations. A naive character count underestimates user-perceived length for emoji-heavy content. Use grapheme clusters.
•Search indexing. If your search engine tokenizes on whitespace, the emoji will be indexed as part of the surrounding text. Most modern search engines (Elasticsearch, OpenSearch, Algolia) handle emoji correctly with the right tokenizers.
•Accessibility. Screen readers announce the CLDR short name. For custom emoji or AI-generated emoji, the short name may be wrong or missing. Test with screen readers.
•DB storage limits. VARCHAR(255) counts code units, not characters. A VARCHAR(255) column in MySQL with utf8mb4 encoding can hold 255 characters, but the byte limit is 4 bytes per character, so the actual storage budget is 1020 bytes for emoji-heavy content. Plan accordingly.

Practical takeaways

•Always use UTF-8 throughout your stack. There is no good reason to use a pre-Unicode encoding in 2026.
•For length-sensitive code (database columns, character counters), use grapheme clusters or codepoints, not UTF-16 code units.
•Test emoji rendering on the platforms that matter to your users. iOS, Android, and Windows render differently.
•For ZWJ sequences, test on older devices. The rendering was inconsistent in 2018 and is not yet consistent in 2026.
•Variation selectors matter. If you want the emoji style, add VS-16. If you do not, leave it off.

Forgemoji outputs a clean transparent PNG. The encoding is your problem, not ours — but we have a deep-dive on ZWJ sequences in the engineering notes.

See How It Works →