Unit Economics of AI Products: The Margin Math Every PM Needs

Why Unit Economics Beats Pricing in AI

In traditional SaaS, a sold seat costs you nothing additional to serve. Storage and compute are rounding errors. Gross margins of 75–85% are the floor. AI products break this model. Every inference burns GPU time, every token costs money, and the cost is roughly proportional to how much your customer uses the product. Suddenly a heavy user is a loss center, not a profit center.

This is why "just raise the price" is rarely the right answer. If your COGS scales linearly with usage and your pricing doesn't, raising the price slows the bleeding but doesn't change the shape of the curve. Unit economics is the discipline of understanding that shape — and bending it.

The metric that matters

Gross margin per active user, broken out by usage decile. Top-decile users (the "power users") often have negative margins. If they do, you have a structural problem, not a pricing one.

The COGS Stack: What an AI Inference Actually Costs

When a customer makes one request to your AI product, you incur cost across five layers. Most PMs only track the first one.

Model inference (tokens in + tokens out)

The headline cost. Frontier models like GPT-4 class or Claude Opus class run $5-15 per million input tokens and $15-75 per million output tokens. Output tokens are the expensive part. A 2,000-token response from a frontier model costs roughly $0.03-0.15 per call.

Retrieval (embeddings + vector search)

If you use RAG, every query embeds the user's input and runs a vector search. Embedding APIs cost $0.02-0.13 per million tokens. Vector DB queries cost less per call but add up — Pinecone or Weaviate hosted instances start at $70-300/month plus per-query fees.

Orchestration overhead

Agent loops, tool calls, retries, and reasoning chains. A multi-step agent task can make 5-20 model calls per user request. Coding agents and research agents are the worst offenders — a single Devin-style task can burn $2-10 in inference.

Infrastructure (servers, bandwidth, observability)

Your API servers, queues, logging, eval pipelines, and feature flags. Negligible per call but real at scale. Budget 5-15% of inference cost.

Human-in-the-loop (when applicable)

For products with review steps — content moderation, legal review, data labeling — human time is often the biggest line item and dwarfs inference cost. $1-5 per reviewed item is common.

Add these together and you get your fully-loaded cost per inference. For a typical chat product running on a frontier model, expect $0.05-0.30 per active conversation. For an agentic product running multi-step tasks, expect $0.50-5.00 per task. Industry-published benchmarks (OpenAI, Anthropic, AWS Bedrock pricing pages) are the source of truth for the model component.

Gross Margin: The Number That Decides Your Runway

Gross margin is revenue minus COGS, divided by revenue. SaaS investors expect 70%+ at scale because the comparable companies they've seen all hit it. AI products that pretend to be SaaS but ship with sub-50% margins look healthy on top-line metrics and rot from the inside.

Negative gross margin

You lose money on every customer. Common in the first 6-12 months of a new AI feature. Survivable only if you have a clear path to positive margin within 12 months — otherwise it's a slow death.

0-40% gross margin

You're treading water. Investors will valuation-discount you against SaaS comps. You probably can't fund growth from gross profit. Acceptable only as a transition state, not a destination.

40-70% gross margin

You're a real business but you'll trade at lower multiples than software comps. This is where most well-run AI products live in 2026. Defensible if your retention and net dollar retention are strong.

70%+ gross margin

You've earned the SaaS multiple. Usually means heavy use of cheaper models, aggressive caching, or pricing that captures value disproportionate to compute. Rare for products that use frontier models as their main service.

The traps PMs fall into: averaging margins across all users hides the negative-margin tail, ignoring trial users distorts new-cohort economics, and using list price instead of effective price (net of discounts) inflates the picture. Track gross margin at the cohort level, weighted by realized revenue.

The 4 Cost Drivers You Can Actually Control

Model selection (the 10x lever)

Frontier-class vs mid-tier vs small fine-tuned model is often a 5-50x cost difference. Most product surfaces don't need frontier intelligence — they need consistency. Route only the hard 10-20% of queries to the expensive model and the rest to a smaller one.

Token efficiency (system prompts, context, output length)

Bloated system prompts get re-sent on every call. Stuffed context windows scale quadratically with attention cost. Verbose outputs blow up the expensive token bucket. Trimming a 4,000-token system prompt down to 800 cuts input costs by 80% on every single inference.

Caching (prompt caching, response caching, semantic caching)

Prompt caching (offered by OpenAI, Anthropic, Google) discounts cached input tokens by 50-90%. For repetitive system prompts and few-shot examples, this is free margin. Semantic caching of common questions can eliminate 20-40% of inference entirely for FAQ-style products.

Routing and batching

Route by intent: simple queries to small models, hard ones to frontier. Batch non-realtime workloads (overnight summaries, reports) to batch APIs at 50% discount. The infrastructure to do this well typically pays for itself in under a quarter.

Notice what is not on this list: raising prices, cutting features, or removing the AI. Those are levers too, but they affect demand. The four above affect cost without affecting what the customer experiences. Pull those first.

Make AI Margin Decisions With Confidence

The AI PM Masterclass walks through real unit economics models — including the spreadsheets, the dashboards, and the conversations with finance — taught live by a Salesforce Sr. Director PM.

Build a Unit Economics Model in 30 Minutes

You don't need a CFO or a Looker dashboard for the first pass. A spreadsheet and an afternoon will tell you more than most AI startups know about themselves.

Step 1: Define the unit

What you do: Pick the smallest billable event: a query, a conversation, a generated document, a completed task. This is what you'll cost and revenue per.

PM Implication: If you can't define the unit cleanly, your pricing probably isn't aligned with value delivered. Fix that first.

Step 2: Pull last 30 days of usage data

What you do: Number of units delivered, total input tokens, total output tokens, average task length, error/retry rate, and which model served each call. Most stacks log this in OpenTelemetry or Helicone or LangSmith.

PM Implication: If your logs don't capture model and tokens per call, your unit economics work is blocked. Instrument first, model second.

Step 3: Compute fully-loaded cost per unit

What you do: Sum inference + retrieval + orchestration + infra + (human-in-the-loop if any). Divide by units. This is your COGS per unit — the C in unit economics.

PM Implication: If this number surprises you by more than 30%, your team's intuition is wrong about where the money goes. That's normal and that's why you do the exercise.

Step 4: Map revenue per unit

What you do: Effective revenue (net of discounts, refunds, trial dilution) divided by units delivered. For seat-based pricing, divide MRR by units delivered by the seat.

PM Implication: Heavy users almost always have lower revenue-per-unit than light users. This is where the margin curve bends negative.

Step 5: Plot gross margin by usage decile

What you do: Sort users by usage, group into 10 buckets, compute gross margin per bucket. The shape of this curve is your business.

PM Implication: If the top decile is negative and the bottom decile is 80%, your pricing isn't capturing value from power users. Move to usage-based or tiered pricing before chasing growth.

When the Numbers Don't Work: 3 Patterns to Fix Negative Margins

If you ran the spreadsheet and the top decile is bleeding, you have three structural moves before you touch the price tag.

Pattern 1: Cascade the model stack

Default to a small fast model. Escalate to a larger one only when the small one's confidence is low or the task requires it. Done right, this cuts inference cost 60-80% with no measurable quality loss on most product surfaces. Cursor and GitHub Copilot both use cascading internally.

Pattern 2: Convert flat pricing to usage tiers

If you have a single price covering all usage, top users subsidize themselves into your loss column. Adding usage caps or overage charges on the top tier is rarely a churn event for the right tier — the users who hit caps are usually willing to pay more because they're getting more value.

Pattern 3: Move expensive workloads to batch

Anything that doesn't need to be real-time (overnight reports, periodic syncs, bulk analyses) belongs on a batch API. OpenAI and Anthropic both offer ~50% discounts for batch processing. For products with substantial async workloads, this single change has shifted gross margin by 10-20 points.

Pattern 4 (bonus): Pre-compute and cache

If 30% of your queries are minor variations on a few common patterns, cache the answers semantically. The hard part is the cache invalidation strategy, but the cost reduction is dramatic. RAG products benefit the most.

If none of these moves close the gap, you have a deeper problem: the product's value isn't priceable above its cost. That's a strategy question, not a margin question, and it's the moment to stop optimizing and start rethinking what you're selling.