AI Reasoning Models Explained: o3, Chain-of-Thought, and When to Use Them

What Reasoning Models Actually Do

Standard LLMs generate responses token by token, left to right, in a single forward pass. Reasoning models use a different inference strategy: they generate an internal chain-of-thought — exploring the problem, checking their work, backtracking when they hit dead ends — before producing a final answer.

The thinking token budget

Reasoning models allocate a budget of tokens for internal thinking. Higher thinking budgets produce better results on hard problems but cost more and take longer. Most providers expose this as a controllable parameter. For simple tasks, a small thinking budget (or none) is optimal. For complex reasoning, more thinking tokens improve accuracy.

Chain-of-thought vs extended thinking

Chain-of-thought (CoT) prompting asks a standard model to show its reasoning in the response. Extended thinking (built into reasoning models) performs that reasoning internally, without polluting the response with intermediate steps. CoT adds output tokens and exposes reasoning; extended thinking adds compute and hides it. Both improve accuracy on hard problems.

How reasoning improves accuracy

On standard LLMs, early token choices commit the model to a reasoning path. If that path is wrong, the model cannot backtrack — it can only continue in the wrong direction. Reasoning models explore multiple paths, evaluate them, and choose the best — producing answers that are more correct, more calibrated, and more likely to catch their own errors.

What doesn't improve with reasoning

Reasoning helps with problems that have verifiable correct answers: math, logic, code, analysis. It doesn't help with knowledge retrieval (the model either knows a fact or doesn't), stylistic generation (writing a product description doesn't benefit from chain-of-thought), or tasks requiring very fast responses. Don't use reasoning models for everything.

When to Use Reasoning Models

Use reasoning models for

Multi-step math or quantitative analysis. Code generation and debugging where correctness is verifiable. Legal, medical, or financial document analysis requiring careful interpretation. Complex planning or scheduling problems. Agentic tasks where the model must reason about tool use and action consequences. Anything where a wrong answer is costly.

Don't use reasoning models for

Creative writing, marketing copy, or summarization — standard models are equally good and much faster. Simple factual retrieval or question-answering on well-known topics. High-volume, low-stakes tasks where latency and cost dominate. Real-time interactions where users expect immediate responses. Classification, extraction, or structured formatting tasks.

Hybrid routing strategies

Most production systems should route by task complexity: fast standard models for simple tasks, reasoning models for complex ones. Complexity classifiers can route automatically. Or use a standard model to attempt the task first, and escalate to a reasoning model only when confidence is low or the task fails a validation check.

Evaluating reasoning model ROI

Compare accuracy improvement (for your specific task, on your specific test set) against cost and latency increase. A reasoning model that improves accuracy by 15% while costing 5x more may be worth it for high-stakes tasks and not worth it for bulk processing. Always measure on your task, not benchmarks.

Make Smarter AI Model Decisions in the Masterclass

Model selection, reasoning systems, and technical PM decision-making are core curriculum in the AI PM Masterclass. Taught by a Salesforce Sr. Director PM.

Cost and Latency Reality

Reasoning tokens cost money

Thinking tokens are typically billed at the same rate as output tokens. A task that generates 5,000 thinking tokens costs significantly more than the same task without thinking. At scale, this matters enormously. Benchmark thinking token counts on representative inputs before committing to reasoning models in production.

Latency is substantially higher

A standard model response for a complex task might take 3–5 seconds. The same task with extended thinking might take 15–30 seconds. For conversational products, this latency is user-hostile. Reasoning models work better in asynchronous workflows (batch processing, background analysis) than in real-time user-facing interactions.

Thinking budget controls the cost-quality curve

Most reasoning model APIs let you set a maximum thinking token budget. Lower budget = faster, cheaper, potentially less accurate. Higher budget = slower, more expensive, potentially more accurate. Find the budget that achieves your required accuracy at acceptable cost by testing across a range of budget values on your specific task.

Cost per decision vs cost per request

For high-stakes decisions — medical diagnosis support, legal risk assessment, financial analysis — the cost of a reasoning model is evaluated against the cost of getting the decision wrong. In these contexts, 10x more expensive often becomes trivially cheap relative to the stakes. Frame cost decisions in terms of the decision being made, not the compute being used.

Reasoning Models in Product Architecture

Offline planning, online execution

Use a reasoning model for the planning or decision-making step (determine the strategy, build the plan, assess the situation) and a standard model for execution (write the content, format the output, generate the response). This concentrates reasoning cost where it has the most impact and keeps execution fast and cheap.

Reasoning as a validation layer

Generate output with a standard model, then use a reasoning model to verify it — check the logic, catch errors, validate against requirements. This critic pattern uses reasoning selectively (only on outputs that pass a complexity threshold) and keeps overall cost manageable while improving accuracy on the outputs that matter most.

Async reasoning workflows

For tasks where latency is acceptable (document analysis, report generation, research synthesis), reasoning models work well in asynchronous workflows: the user submits a task, the system processes it with extended thinking time, and delivers the result. Design the UX around expected wait times rather than forcing real-time constraints on tasks that don't require them.

Agentic reasoning for complex tasks

In multi-step agentic systems, reasoning models are most valuable at decision nodes — points where the agent must evaluate a situation, plan next steps, or choose between options. Standard models handle tool execution, data retrieval, and output formatting. Reasoning models handle judgment.