AI Model Migration Template: How to Plan and Execute a Safe Model Upgrade

The Model Migration Risk Profile

Model migrations are non-deterministic changes — you can't fully predict how the new model will behave on your production traffic until it's actually serving your production traffic. The goal of the migration process is not to eliminate this uncertainty but to bound it: test enough to know where the risks are, and build the infrastructure to catch problems quickly and roll back if needed.

Quality regression on specific use cases

A model that is better on average can be worse for specific prompts or domains. Your evaluation suite must cover all major use cases, not just the most common ones. A model that is 5% better on average but 30% worse on your power user use case is not an improvement.

Output format and structure changes

New models often produce differently formatted outputs — different punctuation patterns, different use of headers, different response length distributions. If downstream code parses model output, format changes can break integrations in ways that aren't caught by quality evaluation.

Behavior changes on edge cases

Safety behavior, refusal patterns, and response to adversarial prompts can change significantly between model versions. Your red team prompt library should be run against any new model before migration.

Latency and cost changes

New models often have different latency and cost profiles than the models they replace. Confirm that the new model meets your latency SLAs and that the cost impact is within budget. A model that is better but 2x more expensive requires a pricing and margin decision, not just a deployment decision.

The Model Migration Process Template

Phase 1: Evaluation (1–2 weeks)

Run your full evaluation suite (all 100–200 representative prompts) on the new model. Run your red team prompt library. Measure latency and cost. Document all quality differences — both improvements and regressions. Identify any output format changes that could break downstream integrations. Decision gate: is the new model a clear improvement on our primary dimensions? Is the cost/latency acceptable?

Phase 2: Staging environment testing (1 week)

Deploy the new model to staging. Replay a week of production traffic through the staging environment. Compare outputs side-by-side with production model outputs. Run integration tests to verify output format compatibility. Identify any unexpected behaviors not caught in Phase 1. Decision gate: are there any integration-breaking format changes? Any unexpected behaviors that require mitigation?

Phase 3: Staged rollout (1–2 weeks)

Deploy to 5% of traffic. Monitor quality metrics, user feedback rates, and error rates for 48 hours. If metrics are stable or improving, expand to 20%, then 50%, then 100%. At each stage, compare metrics to baseline. Define rollback trigger criteria before rollout starts: 'if quality score drops >5% or negative feedback rate increases >20%, we roll back.'

Phase 4: Full deployment and monitoring (ongoing)

Deploy to 100% of traffic. Maintain old model infrastructure for minimum 2 weeks post-migration (enables fast rollback if delayed issues emerge). Monitor quality metrics daily for first 2 weeks. Update prompt library and evaluation suite with any learnings from the migration. Publish changelog entry communicating the model update to users.

Rollback Planning Template

Rollback trigger criteria (define before deployment)

Quality score drops more than X% vs. baseline. Negative feedback rate increases more than Y% vs. baseline. Error rate increases more than Z% vs. baseline. Any critical safety incident with confirmed root cause in new model. Specific high-priority use case quality drops more than X%.

Rollback execution procedure

Who is authorized to initiate rollback (typically: on-call engineer, PM, or either). How rollback is executed (feature flag flip, config change, or deployment rollback — document exact steps). How long rollback takes (target: < 15 minutes). How users are notified of rollback if needed.

Post-rollback process

Incident report filed documenting: rollback trigger, timeline, metrics at time of rollback decision. Root cause investigation for why the pre-rollout evaluation didn't catch the issue. Decision on next steps: fix the issue and re-attempt migration, or stay on current model. No re-attempt without addressing the root cause of the rollback.

Get All AI PM Templates in the Masterclass

Model migration, evaluation frameworks, and the full AI PM toolkit are part of the AI PM Masterclass. Taught by a Salesforce Sr. Director PM.

Model Migration Mistakes

Evaluating only on representative cases, missing edge cases

Evaluation suites naturally overrepresent common cases. A new model that performs excellently on common queries but fails badly on a rare but high-stakes use case (legal document analysis, medical dosage questions) will pass your evaluation with high scores. Build evaluation coverage for high-stakes edge cases even if they're rare in terms of volume.

No rollback plan before deployment

Teams that don't define rollback trigger criteria before deployment end up in arguments during incidents: 'is this bad enough to roll back?' Define the criteria before you're in the middle of an incident. Objective trigger criteria remove the need for real-time judgment calls under pressure.

Communicating the migration as purely positive

Model migrations almost always involve tradeoffs — some things improve, some things change in ways users may not prefer. User communication that only describes improvements without acknowledging changes sets users up for 'it felt different' confusion. Acknowledge the changes alongside the improvements.

Decommissioning old model infrastructure immediately

Immediately decommissioning the old model infrastructure after migration removes your ability to do a fast rollback if delayed issues emerge. Keep the old model infrastructure live for minimum 2 weeks post-full-deployment before decommissioning.

Model Migration Checklist

Pre-migration

Full evaluation suite run on new model. Red team prompt library run on new model. Latency and cost comparison completed. Output format compatibility verified. Rollback trigger criteria documented. Old model infrastructure confirmed available for rollback. User communication drafted.

During staged rollout

Quality metrics monitored at each traffic percentage stage. Rollback criteria reviewed at each stage before expanding. Rollback execution verified (can we actually roll back quickly?). PM and on-call engineer in sync during initial rollout hours.

Post-migration

Changelog entry published to users. Old model infrastructure maintained for 2 weeks. Quality monitoring elevated for first 2 weeks. Evaluation suite updated with any learnings. Migration retrospective completed.