Extended Thinking in Production: Engineering Reasoning Models at Scale
TL;DR
Extended thinking gives Claude and o3 a dedicated reasoning pass before producing a final answer. For Claude, you control it via budget_tokens (10K–100K) and can read the thinking chain. For OpenAI o3, you set reasoning_effort (low/medium/high). The mechanics matter for product decisions: thinking tokens are billed, latency grows with budget, and the quality gain is task-specific. This guide covers how to configure both APIs, the cost and latency math, routing logic for mixed-complexity workloads, and async production patterns that make extended thinking viable without wrecking your UX.
The AI PM Minute
One tactic to make you a sharper AI PM, twice a week. 60 seconds to read. Free.
No fluff. Unsubscribe anytime.
What Extended Thinking Actually Is
Standard language model inference is a single forward pass: input goes in, output comes out. Extended thinking inserts a dedicated reasoning step before the final output. The model works through the problem privately, checks its own logic, considers alternatives, and only commits to an answer once it has reasoned through the solution. This is test-time compute in practice: spend more GPU cycles at inference time to get a better answer, rather than relying purely on what was baked in during training.
For Claude (Sonnet 4.6, Opus 4.8, and newer models), the thinking chain is visible in the API response as a thinking content block. You can log it, display it in debugging interfaces, or use it as an audit trail. For OpenAI o3 and o4, the reasoning is hidden: you see the final answer but not the intermediate reasoning tokens. The approach differs, but the underlying mechanism is the same: the model allocates additional compute to reason before committing.
Thinking tokens
Tokens generated during the reasoning pass that are not shown in the final output (or are shown separately). For Claude, these are billed at the same rate as input tokens. For o3, they are billed separately as reasoning tokens.
Output tokens
The final visible response after reasoning is complete. This is what the user or downstream system actually receives. Output token count is typically much smaller than thinking token count on hard tasks.
Budget vs. actual
Budget tokens set a ceiling on how much the model can think, not a floor. On simple subtasks the model may use far fewer than the budget. On genuinely hard problems it will approach the ceiling.
Streaming thinking
Claude supports streaming thinking tokens as they are generated. This lets you show a live 'thinking...' indicator or progressively reveal the reasoning chain in debugging UIs, reducing the perceived latency of long thinking passes.
The Configuration Levers: Budget Tokens and Effort Levels
The two providers expose thinking depth differently. Claude gives you a numeric budget; OpenAI gives you a categorical effort level. Both control the same underlying tradeoff: more compute at inference time in exchange for better reasoning quality.
Claude: budget_tokens
Set budget_tokens inside the thinking parameter block. Minimum 1,024. Practical range: 10K for routine analysis, 32K for complex multi-step reasoning, 100K for the hardest research and planning tasks. Claude will use as many tokens as needed up to this ceiling. You can also set an effort level (low/medium/high/max) which the API translates into a suggested budget.
OpenAI o3/o4: reasoning_effort
Pass reasoning_effort: 'low', 'medium', or 'high' in the API call. Low maps to roughly 1K reasoning tokens; high maps to roughly 20K+. You cannot inspect the reasoning chain. If you need auditability or want to tune by task, Claude is the better fit since the thinking is visible and the budget is numeric.
When to use a small budget
10K-16K tokens covers most analytical tasks: contract clause extraction, code review, multi-step data analysis. Start here for new use cases. Measure quality improvement vs. a standard model. Scale up only if accuracy gaps remain on hard examples.
When to use a large budget
32K-100K tokens for genuinely hard planning tasks: complex agentic workflows, long document synthesis with cross-referencing, research tasks requiring multi-hypothesis evaluation. Beyond 32K the returns diminish quickly on most commercial tasks. Benchmark before assuming more is better.
Pro tip: start small, calibrate up
Run your evaluation set at 10K, 32K, and 64K budget tokens. Plot accuracy vs. budget. Most tasks show a steep quality curve from 0 to 16K then flatten. The 2-4x budget increase to go from 16K to 64K rarely yields 2-4x quality improvement. Find your curve before committing to a larger budget in production.
Cost and Latency: The Real Numbers
Extended thinking is not free. Every thinking token costs money and adds latency. The product decision is whether the accuracy improvement on your specific task justifies the premium. Generic benchmarks do not answer this question for you. Only your evaluation set on your task does.
Claude Sonnet 4.6 with 10K thinking budget
Cost: 10K thinking tokens at ~$3/1M input tokens = roughly $0.03 per call in thinking cost alone. Add output tokens at $15/1M. On a 500-output-token response: $0.03 thinking + $0.0075 output = $0.0375 total. About 3x a standard no-thinking call.
Latency: 8-15 seconds additional latency on top of standard generation. Plan for 12-20 seconds total response time.
Claude Opus 4.8 with 32K thinking budget
Cost: 32K thinking tokens at ~$15/1M input tokens = $0.48 per call in thinking cost. A serious budget for a serious task. Justify with high-value workflows where a wrong answer has real business cost.
Latency: 20-60 seconds total. Not suitable for any real-time UX. Design for async delivery.
OpenAI o3 with high reasoning effort
Cost: Reasoning tokens billed separately at o3 rates. A high-effort call can add $0.25-$2.00 per call in reasoning cost depending on task complexity. OpenAI does not expose the token count, so budget by estimating average call cost from your logs.
Latency: 15-45 seconds for high-effort. Same async design requirement as Claude large-budget calls.
Build AI Products That Use Models Intelligently
The AI PM Masterclass teaches how to make model selection and configuration decisions that actually ship. Taught live by a Salesforce Sr. Director PM who has made these calls in production.
Routing: When to Use Extended Thinking
Not every request needs extended thinking. A routing layer that sends only complex requests to a reasoning model is the core architectural decision. The routing criteria depend on your task distribution. Here are the most reliable signals.
Multi-step reasoning required
Use extended thinking: Legal clause cross-referencing, financial model validation, complex code debugging with multiple dependencies, strategic planning across competing constraints.
Skip extended thinking: Single-fact retrieval, summarization of a single document, classification into predefined categories, creative generation tasks.
Verifiable correctness matters
Use extended thinking: Mathematical calculations, code that will run in production, medical information where wrong answers are dangerous, compliance determinations.
Skip extended thinking: Marketing copy, brainstorming, exploratory analysis where approximate answers are acceptable.
Cost of being wrong is high
Use extended thinking: Contract review, underwriting decisions, engineering specifications, anything where an error triggers rework or liability.
Skip extended thinking: Drafting internal memos, generating example responses for review, low-stakes Q&A.
Task complexity is input-dependent
Use extended thinking: Route dynamically. Use a lightweight classifier to estimate complexity from input features (question length, entity count, presence of 'if/then' conditions, numerical content). Route above a threshold to extended thinking.
Skip extended thinking: If your task set is homogeneous (all simple or all complex), static routing is simpler and almost as good.
Production Architecture Patterns
The latency of extended thinking makes synchronous request-response architectures painful for users. The three patterns below make extended thinking viable in production without sacrificing UX.
Async job processing
User submits a task, gets a job ID immediately. A background worker calls the extended thinking API. The result is stored and the user is notified when ready. Works well for document review, report generation, and any task the user does not need to watch in real time. Design the submission and retrieval flow as separate API calls.
Streaming with progress indicators
Stream the thinking tokens to show a live 'reasoning...' state. Claude supports streaming thinking blocks via the Anthropic streaming API. This does not reduce total latency but reduces perceived latency dramatically. Users who can see the model thinking tolerate 30-second waits much better than users staring at a spinner.
Tiered SLA by task type
Commit to different response times by task class. Simple Q&A: under 3 seconds (no extended thinking). Standard analysis: under 15 seconds (10K budget). Complex reasoning: under 60 seconds, delivered async (32K+ budget). Set user expectations per task type and instrument each tier separately so latency regressions are caught per tier, not averaged across all requests.
Thinking result caching
If your workflow re-processes the same documents or runs the same reasoning task repeatedly, cache the thinking result, not just the output. Claude prefix caching applies to the prompt but not the thinking block. Cache at the application layer: store the (input_hash, budget) -> (thinking, output) mapping with appropriate TTLs. This is especially valuable for batch processing pipelines where the same analytical framework is applied to many similar documents.
Fallback to standard on timeout
Set a hard timeout on extended thinking calls. If the thinking budget is not exhausted within your SLA, fall back to a standard-model response. For user-facing features, a good-enough fast answer beats a perfect slow answer that errors out. Instrument fallback rate: if it exceeds 5%, your budget or SLA is misconfigured.
Measuring Whether It Is Actually Helping
Extended thinking costs 3-10x more per call than standard inference. The only way to know if it is worth it is to measure accuracy improvement on your actual task, not on synthetic benchmarks. Here is the evaluation framework.
1. Build a hard eval set
Assemble 50-100 test cases where the correct answer is verifiable and where standard models fail at a meaningful rate. Easy cases do not differentiate thinking from no-thinking. Use real examples from your error logs, customer edge cases, or domain experts who can identify where the model currently falls short.
2. Run side-by-side at multiple budgets
Test: (a) standard model, no thinking. (b) same model, 10K budget. (c) same model, 32K budget. Measure accuracy on each condition. You want to see the accuracy curve: how much does each additional thinking token contribute? Most tasks show most of the gain in the first 10K tokens.
3. Calculate cost-adjusted ROI
ROI = (accuracy improvement * task value per call) / (cost premium per call). If extended thinking improves accuracy by 15 percentage points and each wrong answer costs the business $10 in rework, the break-even cost premium is $1.50 per call. At $0.40 per call premium, you have a 3.75x return on the extra inference spend.
4. Monitor reasoning quality in production
Log and sample thinking blocks. Review them weekly. Look for: circular reasoning (model repeating the same point without progress), unnecessary uncertainty (model thinks too long about obvious subtasks), and plan abandonment (model starts a reasoning approach then switches mid-thinking). These are signals your prompt or task framing is fighting the thinking process.
Ship AI Systems That Make Better Decisions
The AI PM Masterclass teaches how to architect AI systems, select and configure models, and measure what matters in production. Join the next cohort.
Related Articles
Before you go: get the AI PM Minute
One tactic to make you a sharper AI PM, twice a week. 60 seconds to read. Free.
No fluff. Unsubscribe anytime.