How Tokenization Works and Why It Matters for AI Product Decisions

How Tokenizers Turn Text Into Numbers

Before an LLM can process any text, it must convert that text into a sequence of integers. Each integer maps to a "token" — a chunk of text that the model treats as a single unit. The tokenizer defines the vocabulary of possible chunks and the rules for splitting any input into those chunks. Different tokenization algorithms produce different splits, different vocabulary sizes, and different trade-offs.

Byte Pair Encoding (BPE)

BPE is the most widely used tokenization algorithm in modern LLMs (GPT-4, Claude, Llama). It starts with individual characters, then iteratively merges the most frequent adjacent pairs into new tokens. After training on a large corpus, common words like 'the' become single tokens, while rare words like 'defenestration' get split into subwords like ['def', 'en', 'est', 'ration']. OpenAI's cl100k tokenizer (used in GPT-4) has roughly 100,000 tokens in its vocabulary.

WordPiece

Used by Google's BERT and some encoder models. Similar to BPE but uses a likelihood-based criterion to choose which pairs to merge rather than raw frequency. In practice, the resulting tokenizations are similar to BPE. The key difference for PMs: WordPiece tokenizers tend to produce slightly different token counts than BPE for the same text, which means cost estimates from one model family don't transfer directly to another.

SentencePiece

A language-agnostic tokenizer that operates directly on raw text bytes rather than assuming pre-tokenized words. Used by Llama, Mistral, and many multilingual models. Its advantage: it treats spaces, punctuation, and non-Latin scripts uniformly, which makes it better for multilingual applications. It can implement either BPE or a unigram language model internally.

How tokenization actually works step by step

The tokenizer receives raw text, normalizes it (lowercasing, unicode normalization depending on config), then applies its trained merge rules greedily from left to right. 'I'm unhappy' might become ['I', "'m", ' un', 'happy'] in one tokenizer or ['I', "'", 'm', ' unhappy'] in another. Each resulting token is mapped to an integer ID, and that sequence of IDs is what the model actually processes. The model never sees your text — it sees numbers.

Vocabulary size trade-offs

Larger vocabularies (100K+ tokens) mean common words and phrases are single tokens, reducing sequence length and improving efficiency. But larger vocabularies require larger embedding matrices in the model, increasing memory usage. Smaller vocabularies (32K tokens) produce longer sequences for the same text, consuming more context window and increasing cost. Most frontier models have settled on 100K-150K token vocabularies as the practical optimum.

Why Token Count Matters More Than Word Count

Every AI API prices by tokens, not words. Every context window is measured in tokens, not characters. Every latency metric scales with token count. If you're thinking in words, you're estimating wrong — often by 30-40%.

The word-to-token ratio is not constant

In English, 1 word averages roughly 1.3 tokens with modern BPE tokenizers. But this ratio varies dramatically by content type. Technical documentation with code snippets can run 1.8 tokens per word. JSON payloads can exceed 2 tokens per word because punctuation characters each consume a token. Conversational English is closer to 1.1 tokens per word. Using a flat '1 word = 1 token' estimate will consistently underestimate your costs.

Example: A 500-word customer support response costs roughly 650 tokens. A 500-word JSON API response containing structured data might cost 900-1,000 tokens. This difference directly impacts your per-request cost by 35-50%.

System prompts are tokenized on every request

Your system prompt is tokenized and billed on every single API call. A 1,000-token system prompt at $3/million input tokens costs $0.003 per request just for the system prompt. At 1 million requests per day, that is $3,000 per day — $90,000 per month — just for the system prompt. Every word in your system prompt has a running cost. This is why prompt optimization is a cost lever, not just an engineering exercise.

Example: Reducing a 1,200-token system prompt to 800 tokens saves 400 tokens per request. At 5M daily requests with GPT-4o ($2.50/1M input tokens), that saves $1,000 per day — $30,000 per month.

Context window usage determines what your product can do

A 128K token context window sounds enormous, but tokens fill up fast. A 10-page PDF might consume 8,000-12,000 tokens. Your system prompt takes 500-2,000 tokens. RAG context takes 1,000-5,000 tokens. Conversation history accumulates linearly. The effective context window available for new generation is always smaller than the advertised maximum — and running near the limit degrades quality and increases latency.

Example: A customer support agent with a 2,000-token system prompt, 3,000 tokens of retrieved knowledge base articles, and 4,000 tokens of conversation history has already consumed 9,000 tokens before the user's latest message. In a 16K context window, only 7,000 tokens remain for the response and any additional context.

Tokenization's Impact on Cost and Latency

Tokenization isn't just a preprocessing step — it is the primary determinant of both your API cost and your inference latency. Every optimization strategy starts with understanding how token count drives both.

Input tokens drive cost on every request

Most providers charge 2-4x less for input tokens than output tokens, but input volume is typically 3-10x higher due to system prompts, context, and conversation history. A 10% reduction in average input token count can reduce your total API cost by 5-8% at scale. Audit your system prompts, trim retrieved context, and truncate conversation history strategically.

Output tokens drive latency

LLMs generate tokens sequentially — each output token requires a forward pass through the model. Time-to-last-token scales linearly with output length. A 500-token response takes roughly 5x longer than a 100-token response. If your feature doesn't need long responses, constrain max_tokens to reduce latency and cost simultaneously.

Token-inefficient formats multiply costs

JSON, XML, and markdown with heavy formatting produce significantly more tokens than plain text for the same information content. A structured JSON response might use 3x the tokens of the same data as plain text. If your downstream system can parse unstructured or minimally structured output, you can save substantially on output token costs.

Tokenization overhead in streaming

When streaming responses, each token is delivered as it's generated. The tokenizer's granularity determines the minimum unit of streaming output. Very short tokens (single characters, punctuation) can create overhead in streaming infrastructure without meaningful user-visible progress. This is why streaming sometimes appears to 'stutter' on certain types of content.

Master the Technical Foundations of AI Products

Tokenization, cost optimization, and technical architecture decisions are core modules in the AI PM Masterclass. Taught by a Salesforce Sr. Director PM.

Multilingual Tokenization Challenges

Tokenizers are trained predominantly on English text. This creates a systemic bias: non-English text is tokenized less efficiently, consuming more tokens for the same semantic content. For AI products serving global users, this has direct cost and quality implications.

Token inflation in non-Latin scripts

Chinese, Japanese, Korean, Arabic, and Hindi text can require 2-4x more tokens per semantic unit compared to English. A 100-word English sentence might use 130 tokens, while the equivalent Chinese sentence uses 250-350 tokens. This means your non-English users are effectively getting a smaller context window and paying more per interaction — even though they're sending the same amount of information.

Quality degradation at the tokenization level

When a tokenizer splits a word into many small subword tokens, the model has to reconstruct meaning across those fragments. This additional reconstruction step can reduce output quality for rare or morphologically complex languages. Turkish, Finnish, and Hungarian — languages with extensive agglutination — are particularly affected. The model isn't 'worse' at these languages; it's working with a less efficient representation.

Inconsistent pricing across languages

Because API pricing is per-token, the same task costs different amounts in different languages. A customer support interaction in Japanese might cost 2.5x what the identical interaction costs in English — not because the model does more work, but because the tokenizer produces more tokens. If your product serves multiple language markets, your cost model must account for per-language token inflation.

Testing tokenization before launching in new markets

Before expanding your AI product to a new language market, run representative text samples through the tokenizer and compare token counts against your English baseline. If the ratio exceeds 2x, you may need to adjust pricing, reduce system prompt length, or increase context window allocation for that language. Use the model provider's tokenizer tool (like OpenAI's tiktoken) to get exact counts.

Token-Aware Product Design Patterns

Understanding tokenization unlocks specific design patterns that reduce cost, improve latency, and create better user experiences. These are the practical patterns every AI PM should know.

Implement token budgets per feature

Assign explicit token budgets to each component of your AI pipeline: system prompt (max 800 tokens), retrieved context (max 2,000 tokens), conversation history (max 3,000 tokens), output (max 1,000 tokens). Monitor actual usage against these budgets. When a component consistently exceeds its budget, optimize it — don't just increase the context window. Token budgets create accountability and prevent cost creep.

Use conversation summarization for long sessions

Instead of sending full conversation history, periodically summarize older messages into a compressed representation. A 20-turn conversation might accumulate 8,000 tokens of history. Summarizing turns 1-15 into a 500-token summary reduces input tokens by 75% while preserving essential context. Implement this as an automatic background process triggered when conversation history exceeds a threshold.

Design output format for token efficiency

If your UI displays structured information, decide whether the model should output JSON (token-expensive but easy to parse) or a minimal structured format (token-efficient but requires custom parsing). For internal AI features where the output feeds another system, consider using delimited plain text instead of JSON — it can reduce output tokens by 40-60% with no loss of information.

Build tokenization into your analytics pipeline

Track token counts per request alongside your standard product metrics. Correlate token usage with user satisfaction, task completion rate, and revenue. You may discover that your highest-cost requests (long context, long output) don't correspond to your highest-value user interactions — creating an opportunity to optimize without affecting the user experience. Add token-per-session and cost-per-session to your product dashboard.

Pre-compute and cache token counts for static content

System prompts, few-shot examples, and knowledge base documents don't change between requests. Pre-compute their token counts so you can accurately predict per-request cost without calling the tokenizer at runtime. This also lets you optimize: if a knowledge base article tokenizes to 3,000 tokens but you only have a 2,000-token retrieval budget, you know in advance to summarize it or split it.