How Forgemoji Works
Every emoji generated by Forgemoji goes through the same five-stage pipeline, regardless of whether you use the simple "pick two emojis" mode or the photo-to-emoji mode. The whole thing usually takes 10 to 20 seconds, with the majority of that spent waiting on a single image generation call.
We are deliberately transparent about how it works, because the model is not magic. Most failures are predictable once you know what the pipeline is doing, and once you know that, the right prompt choice is obvious.
The pipeline, step by step
- Prompt assembly
Two emojis are passed to a prompt builder. The builder looks up the official Unicode short name for each emoji, then composes a single English prompt that combines both. If you switched the input language in the picker, the prompt is translated first using a small in-house lookup table (no LLM call for translation).
- Image generation
The prompt goes to a text-to-image model. The output is 1024Γ1024. We run the primary and a secondary model with prompt-driven routing; we do not expose which model served a given request.
- Background removal
A separate rembg model strips the background. The rembg step is a separate GPU worker so it overlaps with the next request.
- Resize and format
The clean transparent 1024Γ1024 is resized to 256Γ256 using Lanczos resampling. The 256Γ256 PNG is what most chat platforms expect for custom emoji.
- Optional animation
If you toggled on the animated export, the 256Γ256 PNG is run through a small motion-synthesis model that produces a 24-frame loop at 12 fps. We support 6 animation styles: bob, sway, sparkle, glow, wiggle, and pulse. Animation adds 8-12 seconds on top of the base generation time.
Under the hood
A few more details that are easy to skip over but matter for understanding what the pipeline is and isn't doing. If you only care about the high-level flow, the previous section is enough. If you want to understand the failure modes (which is the only way to pick good prompts), read this section. See the about page for team context.
The text-to-image model is not a search engine. It does not "know" what π± + π is by retrieving an image β it samples pixels conditioned on the prompt. Two identical requests produce different outputs because the latent noise seed is different.
The rembg model is trained on natural images. It works on kawaii-style emoji because they have well-defined silhouettes, but it is not perfect: very thin outlines sometimes get partially eaten. If you see a strange semi-transparent halo, regenerate.
The pipeline runs as a sequence of independent GPU workers. There is no batch queue in the user-facing path β every request gets a fresh allocation. The rembg worker is the bottleneck during peak hours.
We do not store the generated emoji. The image is returned to your browser as a base64-encoded PNG inside the JSON response. The original prompt and the model output are logged for 30 days for abuse detection, then deleted.
Free tier traffic and Pro tier traffic share the same GPU pool. Pro is faster in practice during peak hours only because we shed free-tier load first when the queue backs up.
The picker UI is not part of the generation pipeline. It runs entirely in your browser, persists the last 50 generations in localStorage, and posts the chosen pair to the API.
Tips for better results
- Pick the second emoji carefully. The first emoji is the "subject" β it sets the body shape, the dominant color, and the face. The second emoji is the "modifier" β it contributes small features (a hat, a prop, a pattern). Pairs where both are equally important tend to come out muddy.
- Avoid abstract or contradictory pairs. π + π or π + π β the model will not know which feature to emphasize, so the result will look like a blurry average. If the pair feels like a bad joke, the model will turn it into a bad visual.
- Regenerate aggressively. The 4th or 5th attempt on the same pair is usually the best. The model's randomness is a feature, not a bug. Treat the first 2-3 generations as warm-up.
- Use the photo mode for face-emoji. The π§ + πΈ style is what the picker is best at. Things like π¦ + π are also strong because both have well-defined kawaii references in the training data.
Known limitations
The biggest limitation is text rendering. The model cannot spell words, so an emoji pair that includes π€ or π °οΈ will produce shapes that look vaguely like letters but are not legible.
Skin tone modifiers (the Fitzpatrick scale variants) are not handled. The model picks whatever skin tone its latent prior likes, and you cannot influence it from the picker.
Country flags (π³οΈβπ etc.) are also not handled. We tried prompt-side workarounds and they all produced worse results than just not generating.
A note for engineers
engineerNoteP1
We do not currently publish an SDK. The endpoint is small enough to call directly with fetch. We will publish one when the API stabilizes.
Privacy and data
The pipeline does not write your emoji to disk. The prompt you sent and the response are logged separately for 30 days, then deleted. The full data handling rules are in the privacy policy.
Lois Chen, Engineering & Pipeline
Reviewed May 2, 2026
How we wrote this: This page documents the v2 pipeline as of May 2026. Earlier pipeline versions (Qwen-Image-2510 + rembg-1.4) behaved differently on hairline edges and are no longer in production. We update this page when the pipeline changes, not on a fixed schedule.
Sources: Internal pipeline documentation and changelog.