Model Versioning: Managing AI Models in Production

Why Model Versioning Is Harder Than Code Versioning

Software engineering has decades of tooling for code versioning. AI model management borrows some of these concepts but introduces unique challenges that make it a fundamentally different discipline. Understanding these differences is the starting point for building a reliable model management practice.

Models can't be diffed

A code diff tells you exactly what changed: line 47 was modified, a new function was added, a dependency was updated. Model weights are billions of floating-point numbers. There's no meaningful way to 'diff' two model versions and understand what changed behaviorally. Two models with very similar weights can produce dramatically different outputs on edge cases. The only way to understand behavioral changes is through evaluation — running both versions against a test set and comparing outputs.

Trade-off: This means model versioning requires a robust evaluation pipeline, not just artifact storage. Without automated evaluation on every version change, you're deploying blind.

Models degrade silently

Code either works or throws an error. Models degrade gradually — accuracy drops by 2%, latency increases by 100ms, hallucination rate creeps up. These regressions don't trigger alerts unless you've explicitly built monitoring for them. A model that was excellent at launch can become mediocre over months as the data distribution shifts, and you won't know unless you're measuring continuously.

Trade-off: Model versioning must be coupled with continuous evaluation and monitoring. A version number without associated evaluation metrics is meaningless — you need to know not just which version is deployed, but how it's performing right now.

Non-deterministic behavior

The same model with the same weights can produce different outputs for the same input due to temperature settings, sampling strategies, and floating-point arithmetic variations across hardware. This means you can't rely on snapshot testing (run input X, expect output Y) the way you would with deterministic code. Your evaluation must be statistical: does the distribution of outputs remain acceptable, not does every individual output match.

Trade-off: Evaluation sets need to be large enough for statistical significance. A 20-example test set isn't sufficient — you need hundreds of examples with multiple runs to detect regression with confidence.

The blast radius is different

A code bug typically affects a specific feature or endpoint. A model regression affects every interaction that uses that model. If your LLM powers search, summarization, and chat simultaneously, a single model update can degrade all three features at once. The blast radius of a bad model version is often much larger than a bad code deployment.

Trade-off: This is why gradual rollout is even more important for models than for code. Deploying a new model version to 100% of traffic immediately is the AI equivalent of pushing to production without testing.

The 4 Things You Must Version

Most teams version only the model weights. This is insufficient. A model's behavior is determined by the weights, the data it was trained on, how it was evaluated, and the inference configuration that controls its runtime behavior. All four must be versioned together as an atomic unit.

Model weights and architecture

The model itself — weights, architecture definition, and any adapter layers (LoRA, QLoRA). For self-hosted models, store the full checkpoint in a model registry (MLflow, Weights & Biases, Amazon SageMaker Model Registry). For API-based models (OpenAI, Anthropic), version the specific model identifier (e.g., 'gpt-4-0613' vs 'gpt-4-turbo-2024-04-09'). Pin to specific model versions in production — never use 'latest' or auto-updating model aliases.

Trade-off: Model checkpoints are large (gigabytes to terabytes). Storage costs add up when keeping many versions. Implement a retention policy: keep the last N versions readily available, archive older versions to cold storage, and delete versions older than your retention window.

Training data snapshot

The exact dataset used to train or fine-tune each model version. This includes training data, validation data, and any data preprocessing or filtering applied. Store a hash or snapshot ID, not a copy of the full dataset (which may be terabytes). Use data versioning tools like DVC, LakeFS, or Delta Lake to track dataset versions. Without data versioning, you can't reproduce a model, diagnose regressions, or prove compliance.

Trade-off: Data versioning adds pipeline complexity and storage overhead. For teams using third-party model APIs (not fine-tuning), version the prompt template and few-shot examples instead — these are your equivalent of 'training data' and have the same impact on output quality.

Evaluation results

The performance metrics from running each model version against your evaluation set. Store accuracy, latency percentiles (p50, p95, p99), cost per request, and any domain-specific metrics (hallucination rate, format compliance, safety scores). Every model version should have an evaluation report that was generated before deployment, not after. These results are your decision record for why a model version was deployed or rejected.

Trade-off: Evaluation sets themselves evolve — new edge cases are discovered, requirements change. Version your evaluation sets alongside your models. A model that 'passed evaluation' on an outdated test set provides false confidence.

Inference configuration

Temperature, top-p, max tokens, stop sequences, system prompt, tool definitions, and any post-processing logic. Two identical model weights with different temperature settings produce meaningfully different products. Version the full inference config as a JSON or YAML file alongside the model version. In many production systems, prompt changes are more frequent than model changes — and they have just as much impact on output quality.

Trade-off: Config changes are often treated casually ('just a prompt tweak') but can cause significant behavior shifts. Apply the same versioning and evaluation rigor to config changes as to model changes. A prompt change without re-evaluation is an untested deployment.

Model Rollback Strategies and When to Use Them

Instant rollback (hot standby)

Keep the previous model version loaded and ready to serve traffic. Rollback is a traffic switch — redirect requests from the new version to the old one. This is the fastest rollback (seconds) but doubles your infrastructure cost since you're running two model instances simultaneously. Use this for critical, high-traffic features where downtime is unacceptable. Maintain hot standby for at least 48-72 hours after any model deployment.

Warm rollback (cached artifacts)

Store previous model versions in a registry with pre-built deployment artifacts (container images, model files in fast storage). Rollback requires redeploying the previous version, which takes minutes to tens of minutes depending on model size and infrastructure. This is the standard approach for most production systems — reasonable rollback time without the cost of running duplicate instances indefinitely.

Config-only rollback

When the regression is caused by a prompt or inference config change rather than a model weight change, you can rollback just the config without redeploying the model. This is the fastest and cheapest rollback path. It requires that config and model versions are independently deployable — a design decision you should make early. Many production regressions are config issues, not model issues.

Partial rollback (feature-level)

If a model serves multiple features, you may only need to rollback for the affected feature. Route feature A traffic to the new model version and feature B traffic to the old version. This requires feature-level routing logic but preserves improvements in unaffected features. Most valuable in multi-use model deployments where a regression only impacts specific use cases.

Deploy AI Models With Confidence

Model lifecycle management, deployment strategies, and production AI operations are covered in the AI PM Masterclass. Taught by a Salesforce Sr. Director PM.

A/B Testing and Gradual Rollout for Model Updates

Shadow mode deployment

Run the new model version in parallel with the production version, but don't serve its outputs to users. Log both versions' outputs for the same inputs and compare them offline. This is the safest first step — you get real-world performance data without any user risk. Run shadow mode for at least 1-2 weeks to capture enough query diversity, including edge cases that don't appear in evaluation sets. Measure latency, cost, output quality, and failure rates side-by-side.

Percentage-based traffic splitting

Route a small percentage of traffic (1-5%) to the new model version and monitor key metrics. If metrics hold or improve after 24-48 hours, increase to 10%, then 25%, then 50%, then 100%. At each stage, compare quality metrics, latency, error rates, and user satisfaction signals between the two cohorts. This is the standard gradual rollout pattern and should be the default for any model update in a production system serving real users.

Segment-based rollout

Instead of random traffic splitting, roll out the new version to specific user segments first — internal users, beta users, lower-risk use cases, or specific geographies. This is particularly valuable when the new model version changes behavior in ways that might require user communication or documentation updates. It also lets you gather qualitative feedback from a known group before broad deployment.

Automatic rollback triggers

Define quantitative thresholds that trigger automatic rollback without human intervention. If error rate exceeds 2x baseline, if p95 latency exceeds 5 seconds, if hallucination rate exceeds the threshold — automatically shift traffic back to the previous version and alert the team. These guardrails are essential for overnight deployments and for teams operating across time zones. Set them conservatively — it's better to rollback unnecessarily than to let a bad model serve users for hours.

Model Lifecycle Management for AI PMs

Experimentation phase

The team is evaluating multiple model candidates — different architectures, fine-tuning approaches, or third-party APIs. The PM's role is to define the evaluation criteria before experimentation begins, not after. What metrics must the model pass? What latency is acceptable? What cost ceiling exists? Without these criteria defined upfront, teams optimize for the wrong metric or endlessly iterate without a clear 'good enough' bar. Document these criteria as a model acceptance contract.

Staging and validation phase

A model candidate has passed offline evaluation and needs to be validated in a production-like environment. Deploy to a staging environment with production-like traffic (replayed or mirrored). Run the full evaluation suite. Test edge cases, adversarial inputs, and failure modes. Validate that the model integrates correctly with guardrails, post-processing, and downstream systems. This phase catches integration issues that lab evaluation misses — format mismatches, timeout handling, error propagation.

Production monitoring phase

The model is live and serving users. Continuous monitoring tracks quality metrics, latency, cost, and user feedback signals. Define a monitoring dashboard that the PM reviews weekly. Establish alert thresholds for automatic detection of degradation. Schedule periodic re-evaluation against the original benchmark to catch gradual drift. Most importantly, maintain a clear owner for the model in production — ambiguous ownership is the top cause of undetected model degradation.

Retirement and migration phase

Every model eventually needs to be replaced — the provider deprecates it, a better option becomes available, or business requirements change. Plan model retirement as a product migration: define the timeline, communication plan, and rollback path. Don't wait until the deprecation deadline. Start evaluating replacement models at least 3 months before the current model's end-of-life. Rushed model migrations are the single most common cause of AI product quality regressions.