AI Infrastructure for Product Managers: GPUs, Inference, and the Latency-Cost Trade-off

Why PMs Need to Understand AI Infrastructure

Two scenarios where infrastructure knowledge makes you a better PM:

Scenario A: Your AI feature works great in development but times out for 5% of users in production. Your engineer says "we're hitting GPU memory limits on long prompts." Do you know what trade-offs you're choosing between?

Scenario B: Finance asks you to justify the $80K/month AI cost. You need to explain the cost model, defend the current architecture, and identify where costs can be reduced without hurting quality.

In both scenarios, the PM who understands inference infrastructure makes better decisions faster. The PM who doesn't is blocked waiting for engineering to translate everything.

The Hardware Layer: GPUs and Why They Matter

LLMs run on GPUs — Graphics Processing Units — because their matrix multiplication operations are massively parallel and GPUs are optimized for exactly that. The dominant GPU for AI inference in 2026 is the NVIDIA H100 and H200, with AMD MI300X emerging as a competitor.

GPU Memory (VRAM)

Determines what model sizes can be served. 80GB H100 can serve 70B parameters in 4-bit quantization.

GPU Availability

Directly affects latency and pricing. During peak demand, providers may queue requests or increase prices.

Cloud Providers

AWS, GCP, Azure and inference providers all compete on GPU availability and pricing.

You don't choose the GPU. But you choose the provider and model, which determines which GPU your workload runs on.

Inference Providers: The Decision That Shapes Everything

When you call an AI API, you're paying for someone else's GPU time. The inference provider landscape breaks into three categories:

Frontier model providers (hosted only)

OpenAI, Anthropic, Google. You call their API, they run their proprietary models on their infrastructure. No deployment options.

Best for: GPT-4o, Claude, Gemini, o-series reasoning models

Open model hosters

Together AI, Fireworks AI, Replicate, Modal. They run open-source models (Llama, Mistral, Qwen) on their infrastructure.

Best for: cost-sensitive use cases, model fine-tuning, specific open-source models

Self-hosted inference

You run models on your own cloud instances (AWS EC2 GPU, GCP A100 nodes). Maximum control, potential cost savings at scale.

Best for: regulated industries, very high volume, proprietary fine-tuned models

The PM Decision Matrix

Speed to ship: Frontier API (OpenAI/Anthropic)

Cost at scale: Open model hoster → evaluate self-hosted

Data privacy: Self-hosted or providers with DPA + VPC options

Ultra-low latency: Groq (LPU architecture), Cerebras

Latency: Where the Time Goes

AI latency has two components that behave differently:

Time to First Token (TTFT)

How long from sending the request until the first token streams back. For streaming UIs, this is the perceived latency.

Tokens Per Second (TPS)

How fast tokens stream after the first one arrives. Larger models are slower. Output length directly affects total time.

Total time = TTFT + (output tokens / TPS)

Techniques to reduce latency:

Speculative decoding: A small draft model generates candidate tokens; the large model verifies them in parallel. 2–3x speedup.
Prompt caching: Cache the processed representation of repeated system prompts. 50–80% discount on cached tokens.
Streaming: Always stream for user-facing features. Users tolerate 3–4 seconds to first token if text is visibly flowing.
Model routing: Use a fast small model for simple queries, a slower large model for complex ones.

Apply These Concepts in the AI PM Masterclass

You'll build products using real inference providers and make the model selection, infrastructure, and cost decisions described in this guide — live, with a Salesforce Sr. Director PM.

Cost Structure: The Complete Picture

AI infrastructure cost has three components:

1. Input Tokens

Every token in your prompt costs money. Context window management matters economically.

2. Output Tokens

Typically 3–5x more expensive per token than input. Always specify max_tokens.

3. Operational Overhead

For self-hosted — GPU instance costs, MLOps engineering time, monitoring infrastructure.

Sample cost calculation for a product feature

• Feature: AI-generated email draft

• Average prompt: 500 tokens (system + user instructions + user context)

• Average output: 200 tokens

• Model: Claude Sonnet ($3/M input, $15/M output)

• Cost per generation: (500 × $3/M) + (200 × $15/M) = $0.0045 per draft

• At 10,000 drafts/day: $45/day, ~$1,350/month

The Questions PMs Should Ask Engineering

What's our P50/P99 latency for this feature? What's the user-facing impact?

What model are we using and what's the cost per request at our projected volume?

Are we using prompt caching? If not, what's the blocking issue?

What's our fallback if the primary inference provider has an outage?

If we 10x our user volume, what breaks first — and what's the cost?