TECHNICAL DEEP DIVE

Model Merging for AI Product Managers: SLERP, TIES, and Model Soups Explained

By Institute of AI PM·14 min read·Jun 10, 2026

TL;DR

Model merging combines the weights of two or more separately trained models into a single model — no additional training required, near-zero compute cost. The main techniques (SLERP, TIES, DARE, model soups) let open-weights product teams stack domain expertise from different fine-tunes, recover capability lost during fine-tuning, and experiment with capability combinations in hours instead of weeks. Model merging only works with open-weights models you host yourself. If you're exclusively on API providers like OpenAI or Anthropic, this isn't applicable yet — but as open-source models close the capability gap, it's worth understanding.

What Model Merging Actually Does

When you fine-tune a base model — say, Llama 3.1 70B — on a medical dataset, you're shifting thousands of weight matrices away from their original values. The resulting model knows more about medicine but may have lost some general capability in the process. Fine-tune a different copy of the same base on legal documents and you get a model that's good at legal reasoning but has similar tradeoffs.

Model merging asks: what if you could combine the medical fine-tune and the legal fine-tune into a single model that's good at both? The answer, experimentally, is often yes — and the mechanism is surprisingly simple: arithmetic on weight matrices.

Because both fine-tuned models started from the same base, their weights live in the same high-dimensional space. The delta from base to medical fine-tune represents "medical knowledge added." The delta from base to legal fine-tune represents "legal knowledge added." Adding those deltas together on top of the base model can — under the right conditions — produce a model that carries both skill sets.

Two hard constraints to check before going further:

Same base model required: You can only merge fine-tunes that share an identical base checkpoint. Merging a Llama 3.1 fine-tune with a Mistral fine-tune doesn't work — the weight spaces aren't aligned.
Weights must be accessible: Model merging requires access to actual model weights — the numerical parameters stored in checkpoint files. API-based models (GPT, Claude, Gemini) don't expose weights. Merging is exclusively a play for teams using open-weights models like Llama, Mistral, Qwen, or Gemma.

With those constraints understood, here's the landscape of techniques currently in use.

The Main Techniques: SLERP, TIES, DARE, and Model Soups

Each technique solves a different problem with weight combination. They're not mutually exclusive — some are combined in practice.

Linear Interpolation (LERP)

θ_merged = (1−α)×θ_A + α×θ_B

How it works: The simplest approach: take a weighted average of two models' weights. A single hyperparameter α controls the blend (0.5 = equal mix). Fast and easy to implement.

Limitation: Doesn't preserve the geometry of the weight space well. Models at the extremes are fine; the midpoint can be a worse model than either parent. Works best when the models are already very similar (e.g., checkpoints from the same training run).

SLERP (Spherical Linear Interpolation)

Interpolation along the great-circle arc between two unit vectors in weight space

How it works: Instead of interpolating weights linearly (which can shrink the magnitude of the resulting vectors), SLERP interpolates along the surface of a unit hypersphere. This preserves the 'direction' of each model's weights — a property that correlates better with capability retention.

Limitation: SLERP only works cleanly when merging exactly two models at a time. Multi-model merges require chaining pairwise SLERPs or switching to a different technique.

TIES Merging (Trim, Elect Sign & Merge)

Three-step process: trim small deltas, elect majority sign, merge matching-sign weights

How it works: Addresses a real problem: most fine-tuned weights differ only slightly from the base model, but those tiny differences create noisy interference when averaging. TIES first trims weight updates below a threshold (treating them as noise), then resolves sign conflicts by majority vote across source models, then averages only the remaining weights. Results are meaningfully better than naive averaging when merging 3+ models.

Limitation: Requires tuning the trim threshold. The right value depends on the models and the task — there's no universal default.

DARE (Drop And REscale)

Randomly zero out delta weights, then rescale the survivors to compensate

How it works: Similar to TIES but uses random dropout of delta weights rather than magnitude-based trimming. The surviving deltas are rescaled upward to compensate for the dropped ones. Often combined with TIES in practice (DARE-TIES). From the paper 'Language Models are Super Mario,' which showed that fine-tunes can often be applied as sparse delta additions to the base.

Limitation: Performance varies by model pair. Requires eval to know whether DARE is giving you a better merge than TIES or SLERP for your specific use case.

Model Soups

Average multiple fine-tuned checkpoints from the same base, trained with different hyperparameters

How it works: Instead of merging different fine-tunes on different tasks, model soups average checkpoints from training runs that used different random seeds, learning rates, or data orders. Each 'ingredient' is a fine-tune of the same base on the same task. The averaged model is consistently better than any individual ingredient on held-out data. Introduced in the 2022 paper 'Model Soups' by Wortsman et al.

Limitation: Only applies to multiple runs on the same task/dataset. You're not stacking different capabilities — you're getting more robust, better-calibrated performance on the same task.

Why This Matters for AI Product Teams

The product implication isn't "you should merge models." It's "you have a new option on the model customization menu — one that's dramatically cheaper and faster than the alternatives."

Stack domain expertise without extra training

Your customer success team fine-tuned a model on your support ticket history. Your engineering team fine-tuned one on your internal API docs. Merging them produces a model that handles both support and technical questions — without paying for a third fine-tuning run.

Recover from catastrophic forgetting

Fine-tuning on a narrow domain often degrades general capability. A model becomes great at contract review but worse at following open-ended instructions. Merging the domain fine-tune with the base model (at a controlled ratio) restores some general capability while preserving most domain gain.

Rapidly prototype capability combinations

Instead of commissioning a new fine-tuning run (days of GPU time, thousands of dollars), an ML engineer can try a merge in under an hour on a single GPU. Iteration speed on capability exploration goes from days to hours.

Get ensemble benefits at single-model cost

Model soups consistently outperform any single fine-tuned checkpoint on evals. If you're already doing fine-tuning with multiple hyperparameter configurations, souping the best checkpoints costs almost nothing and meaningfully improves production quality.

The practical upshot: if your ML team runs fine-tuning experiments anyway, model merging should be part of their toolkit by default. The tooling is mature (mergekit on GitHub has 10K+ stars and supports all the techniques above), and the worst outcome is a merge that doesn't improve on either parent — which costs you an hour of engineering time, not a week of GPU budget.

Go Deeper in the AI PM Masterclass

Model customization strategy — fine-tuning, merging, RAG, or prompt engineering — is covered live in the masterclass, taught by a Salesforce Sr. Director PM.

When Model Merging Beats Fine-Tuning (and When It Doesn't)

Model merging is not a replacement for fine-tuning. It's a complement. The decision tree is relatively clear once you know what each approach can and can't do.

Scenario: You have two existing fine-tunes and want to combine their capabilities

Try merging firstStart with SLERP or TIES. Run your eval suite. If the merged model meets quality bar, you saved a fine-tuning run.

Scenario: You need to adapt a base model to a new domain with no existing fine-tune

Fine-tune (or LoRA/QLoRA) firstMerging can't inject knowledge that isn't already in one of the source models. You need at least one domain fine-tune before merging adds value.

Scenario: Your fine-tuned model is accurate on-domain but has degraded instruction following

Merge the fine-tune with a strong instruction-following modelThis is the most common productive use of merging in production — recovering general capability that fine-tuning eroded.

Scenario: You've run 5 fine-tuning experiments with different LR schedules

Model soup themAveraging those checkpoints consistently outperforms the best single checkpoint. Low cost, meaningful quality gain.

Scenario: You want very precise behavior on a specific task

Fine-tune, don't mergeMerging dilutes specialization. If you need tight control over outputs for a narrow task, a dedicated fine-tune will outperform a merged model.

Practical Use Cases and How Teams Implement This

The community around model merging has produced a rich collection of merged models on Hugging Face, many of which are released as finished products. Here's how production teams approach it:

Multilingual + domain specialist

Take a model strong at multilingual tasks (e.g., Qwen 2.5) and merge with a domain fine-tune trained on English-only technical content. The merged model handles both languages and the specialized domain — often better than either model alone on multilingual domain queries.

Code + instruction following

Fine-tune on code; merge with a strong instruction-following model. Result: a model that follows complex, multi-part prompts AND writes high-quality code. Much harder to get both from a single fine-tuning run without massive dataset curation.

Safety + capability

Capability-focused fine-tunes often degrade safety alignment. Merging with a safety-tuned model (or the RLHF checkpoint) at a modest weight recovers refusal behavior on clearly harmful prompts without killing capability. A blunt but often effective fix.

Customer persona adaptation

One enterprise customer needs formal tone; another needs casual. Instead of two deployment endpoints running separate models, merge each customer's tone fine-tune with the base and serve from a single infrastructure layer with per-customer checkpoints.

The main tooling ecosystem: mergekit (open source, supports SLERP, TIES, DARE, linear interpolation, and more; YAML config-driven) is the de facto standard. LM Studio supports some merging workflows via GUI for smaller models. Hugging Face's merge scheduler and the transformers library handle the underlying checkpoint math.

A typical merge workflow: define the source models and technique in a YAML config file, run mergekit-merge, load the output into vllm or ollama, run your eval suite. If quality bar is met, ship. If not, adjust the merge weights or try a different technique. Total time: 30 minutes to 2 hours depending on model size.

The Limits of Model Merging: What It Can't Do

Model merging generates real value, but its failure modes are just as worth understanding as its capabilities.

Unpredictable quality on downstream tasks

Merging two capable models doesn't guarantee a capable merged model. The weight combination may create interference that degrades both capabilities. You won't know until you eval. Treat merging as hypothesis generation, not guaranteed improvement.

Can't add knowledge that isn't already there

Merging redistributes what already exists in the source models' weights. It cannot inject new factual knowledge, new reasoning patterns, or new skills that neither source model has. For new capability acquisition, you still need training data and compute.

Same-architecture requirement is a real constraint

You cannot merge a Llama 3.1 checkpoint with a Mistral Nemo checkpoint — different architectures, different weight layouts. The community is working on cross-architecture merging but it's experimental. In practice, you need to commit to a base model family for your fine-tune ecosystem before merging becomes practical.

No version semantics

A merged model is a new artifact with no automatic provenance tracking. If you're shipping to production, your model governance pipeline needs to capture: which source models were merged, which technique, which ratios, and the eval results. Without this, debugging regressions becomes very hard.

Not applicable to commercial API models

If your product runs on GPT, Claude, or Gemini, model merging is irrelevant until those providers offer weight access. Follow the open-weights model capability gap — as Llama, Qwen, and Gemma close in on closed-source quality, the merging option becomes more viable for more teams.

The product manager's job here is setting realistic expectations with the ML team. Model merging is a fast, cheap experiment — not a production shortcut that bypasses evals. Merge, measure, decide. Any merge going into production should pass the same regression eval suite as any other model change.

Turn Model Knowledge Into Product Decisions

The AI PM Masterclass covers the full model customization stack — from prompt engineering to fine-tuning, merging, and distillation — taught live by a Salesforce Sr. Director PM.