TECHNICAL DEEP DIVE

AI Model Deployment: What Product Managers Need to Know About Getting Models to Production

By Institute of AI PM·14 min read·Apr 9, 2026

TL;DR

Shipping an AI model is not like shipping a traditional feature. Models have unique deployment challenges: latency non-determinism, version drift, cost scaling, cold start issues, and behavioral changes across model updates. This guide covers the full deployment stack — from API vs. self-hosting trade-offs to versioning strategy to production rollout patterns — so you can make informed decisions and ask the right questions of your engineering team.

The AI Deployment Stack: From Model to User

AI deployment involves more layers than traditional software. Understanding each layer helps you diagnose latency issues, cost overruns, and unexpected behavior.

1

Model weights

The trained neural network parameters. For API-based products, you never touch these. For self-hosted deployments, these live on your GPU infrastructure. Size ranges from 1GB (small models) to 700GB+ (large frontier models).

2

Inference server

Software that loads model weights onto hardware and serves prediction requests. Options: vLLM, TGI (Text Generation Inference), TensorRT-LLM, Ollama (local dev). Inference servers handle batching, KV caching, and quantization — all of which affect performance.

3

API gateway / proxy layer

Handles authentication, rate limiting, routing between models, request logging, and retries. Tools like LiteLLM or custom middleware sit here. This is where you implement cost controls and model fallbacks.

4

Application layer

Your product code: prompt construction, context management, response parsing, and business logic. This is what you own regardless of whether you use APIs or self-host.

5

Observability stack

Logging, tracing, metrics, and alerting specifically for AI: token counts, latency distributions, error rates, and model output quality. Standard APM tools don't capture AI-specific signals without custom instrumentation.

API vs. Self-Hosted: The Core Trade-off

Managed API (OpenAI, Anthropic, Google)

ADVANTAGES

  • +Zero infrastructure overhead — no GPUs to manage
  • +Access to frontier models you couldn't self-host
  • +Automatic scaling — no capacity planning required
  • +Provider invests in model improvements over time

TRADE-OFFS

  • Data leaves your infrastructure (compliance risk)
  • Per-token pricing becomes expensive at very high volume
  • You can't control model version updates
  • Rate limits can throttle high-traffic features

Best for: Most products at most stages. The infrastructure savings outweigh the costs until you're processing 100M+ tokens/day.

Self-Hosted (vLLM, TGI on your own GPUs)

ADVANTAGES

  • +Data stays in your infrastructure — critical for regulated industries
  • +Fixed infrastructure cost regardless of volume
  • +Full control over model version and update timing
  • +Customize: quantization, fine-tuned models, custom sampling

TRADE-OFFS

  • Significant DevOps complexity and ongoing maintenance
  • GPU costs are high and require capacity planning
  • You maintain availability, scaling, and failover
  • Limited to open-weight models (Llama, Mistral, Qwen)

Best for: High-volume products (>100M tokens/day), regulated industries (healthcare, finance, legal), or products requiring custom fine-tuned models.

Latency, Throughput & the Performance Triangle

AI inference latency has two distinct components that require different optimization strategies. Understanding the difference matters for product decisions about streaming, UX, and capacity.

Time to First Token (TTFT)

How long before the user sees the first word. Dominated by input processing cost — longer contexts mean higher TTFT. Streaming addresses UX but doesn't change TTFT. Target: under 500ms for interactive features.

Time to Last Token (TTLT) / Generation speed

Total time to complete the response. Dominated by output length and model size. Optimize via smaller models for high-output tasks, or accept the trade-off and stream to improve perceived performance.

Throughput vs. latency trade-off

Inference servers can batch multiple requests together to improve GPU utilization. More batching = lower cost per request but higher latency for individual requests. Find the right batch size for your latency SLA.

Streaming as a UX strategy

Streaming tokens as they're generated dramatically improves perceived performance. A response that takes 8s to complete feels much faster when the first tokens appear in 400ms. Implement streaming for any user-facing generation.

Ship Real AI Products in the AI PM Masterclass

Deployment architecture, cost optimization, and production AI systems are covered in the masterclass — taught live by a Salesforce Sr. Director PM.

Model Versioning: The Problem Nobody Talks About

Model updates silently break AI products. Unlike a traditional API where v1.0 is stable forever, LLMs change behavior across versions — sometimes significantly. Model versioning is one of the most underestimated operational challenges in AI product management.

Silent behavior changes on auto-updated models

Providers often auto-update model versions behind the same API endpoint (e.g., 'gpt-4-turbo' may have changed 3 times). Your prompts may degrade without any version notification. Always pin to specific model version IDs in production.

Prompt brittle across model versions

A prompt tuned for gpt-4-0125-preview may produce worse results on gpt-4-turbo-2024-04-09. When upgrading model versions, always re-evaluate your full prompt library. Never assume prompt portability.

Deprecation timelines are shorter than expected

Model providers deprecate versions faster than traditional software. OpenAI's policy is typically 6 months notice. Build model version upgrade paths into your roadmap — they're not optional maintenance.

Testing before version migration

Before migrating to a new model version in production: run your entire evaluation set, compare output quality, check for formatting regressions, test edge cases. Treat model upgrades like database migrations.

Rollout Patterns for AI Features

Traditional feature rollout patterns apply to AI, but with additional considerations unique to non-deterministic systems. These patterns reduce risk when deploying new models or significantly changed prompts.

1

Shadow mode

Run the new model in parallel with the current model on all production traffic. Log both outputs but serve only the current model's response to users. Compare quality offline before switching. Zero user risk, full production traffic coverage.

2

Canary deployment (traffic percentage)

Route 1–5% of traffic to the new model. Monitor quality metrics, latency, and error rates in real-time. Ramp up gradually if metrics hold. Gate on: error rate, output quality score, and user feedback signals.

3

A/B testing by user cohort

Split users into control (current model) and treatment (new model) groups. Measure downstream business metrics: task completion, user satisfaction, retention. Required for any model change that affects core user workflows.

4

Feature flag with instant rollback

Implement a hard kill switch that reverts 100% of traffic to the previous model within 60 seconds. AI systems can fail in ways that are hard to detect immediately — fast rollback is non-negotiable for production AI features.

Ship AI Features Confidently After the Masterclass

Deployment architecture, rollout strategy, and production AI systems are core curriculum. Stop guessing — learn the patterns that senior AI PMs use in production.