LoRA and PEFT Explained for AI Product Managers
TL;DR
Full fine-tuning a 70B model requires 8+ A100s and weeks of compute. LoRA (Low-Rank Adaptation) cuts that to a single GPU and hours by updating less than 1% of the model's weights — and typically gets you 90–95% of the quality. As a product manager, understanding LoRA and the broader PEFT family helps you make the build vs. prompt vs. fine-tune decision with real numbers, not intuition. This guide covers how LoRA works, when it's worth the investment, and how it compares to the alternatives.
Why Fine-Tuning Exists and When PMs Should Care
A base language model knows a lot about the world but nothing about your product, your users' vocabulary, or your quality bar. Prompting is the first tool for customization — you describe the task in the system prompt and give examples. But prompting has limits: you're borrowing the model's general behavior, not shaping it.
Fine-tuning updates the model's weights directly on examples of correct behavior for your specific task. The result is a model that has internalized your domain — it follows your format reliably, uses your terminology, and doesn't require lengthy few-shot examples in every prompt. The catch has always been cost: full fine-tuning requires gradient updates across all parameters, which is prohibitively expensive for large models.
When prompting is enough
The task is general-purpose and the model already understands it. Style adjustments, summarization, Q&A over documents. A well-crafted system prompt with 2-3 examples handles 80% of use cases. Don't fine-tune what prompting can solve.
When fine-tuning adds value
You need consistent format, specialized vocabulary, or behaviors the model doesn't perform reliably with prompting alone. Legal clause classification, medical coding, customer support in your brand voice. You have 500+ labeled examples to work with.
When fine-tuning is essential
Latency constraints prevent large prompts. Cost pressure makes long system prompts unsustainable at scale. You need to distill a larger model's behavior into a smaller, faster, cheaper one for production deployment.
How LoRA Works: The Core Idea Without the Math
When a model learns a new task, it doesn't need to change all of its billions of weights. The research finding behind LoRA is that the weight updates needed are low-rank — they can be expressed as the product of two much smaller matrices. LoRA exploits this by freezing the original model weights and injecting pairs of small trainable matrices at each attention layer. Only these small matrices are updated during training.
A weight matrix in a transformer might be 4096 × 4096 (approximately 16M parameters). Instead of updating it directly, LoRA trains a matrix A of shape 4096 × r and a matrix B of shape r × 4096, where r (the "rank") is typically 4–64. With r=16, that's 131K parameters instead of 16M — a 120× reduction. At inference, the LoRA update is simply added to the frozen base weights with zero architectural change.
Rank (r) is the key knob
Lower rank means fewer parameters, faster training, and smaller adapter files — but less expressive power. Higher rank captures more nuance but costs more. Most tasks work well at r=8 to r=16. Start low and only increase if quality is insufficient.
Alpha scales the contribution
The LoRA alpha hyperparameter scales how much the adapter update influences the output relative to the frozen weights. Common practice: set alpha = 2r. This keeps behavior predictable as you adjust rank during experimentation.
Which layers need LoRA
Applying LoRA to the query and value projections in attention layers captures most of the benefit. Adding LoRA to all projection matrices and feedforward layers helps on complex tasks but increases adapter size significantly.
Adapters merge at inference
LoRA weights can be merged into the base model before deployment, adding zero latency overhead. Or you can serve multiple LoRA adapters on a single base model and swap them per request — one shared base, many specialized behaviors.
The PEFT Family: LoRA, QLoRA, and What Else Exists
LoRA is the most widely used PEFT (Parameter-Efficient Fine-Tuning) method, but the family is broader. Understanding the landscape helps you have informed conversations with your ML team about what's being used and why — and helps you spot when someone's recommending a heavier approach than the task requires.
LoRA — Low-Rank Adaptation
How it works: Injects trainable low-rank matrices at attention layers. Original weights are frozen. Only adapter weights are updated during training.
When to use: The go-to for most use cases. 70-80% quality recovery vs. full fine-tuning at 1-5% of the compute cost. Works on all major open-weight models (Llama 3, Mistral, Qwen, Gemma).
QLoRA — Quantized LoRA
How it works: The base model is quantized to 4-bit precision before LoRA adapters are added. Dramatically reduces GPU memory — a 70B model fits on a single 48GB A6000.
When to use: When you need to fine-tune a large model on limited hardware. Slight quality trade-off vs. full-precision LoRA. The standard method for fine-tuning 30B+ models on commodity cloud GPUs.
Adapter Layers
How it works: Small neural network modules inserted between transformer layers. Only adapter modules are trained; the rest of the model is frozen.
When to use: Older technique, largely superseded by LoRA. You may encounter it in legacy codebases or papers from 2021-2023. LoRA is preferred for new work.
Prefix Tuning / Prompt Tuning
How it works: Trainable virtual tokens prepended to the input at each layer (prefix) or just the first layer (prompt tuning). The model learns to attend to these tokens to shift behavior.
When to use: Very small parameter counts. Useful when you need the base model entirely frozen. Doesn't give the same depth of behavioral control as LoRA for complex tasks.
Apply This in the AI PM Masterclass
The masterclass covers fine-tuning strategy as a product investment decision — when it pays off, how to scope the data work, and how to evaluate ROI. Taught live by a Salesforce Sr. Director PM.
Real Costs and Timelines: What Fine-Tuning Actually Takes
The most common mistake is treating fine-tuning as a purely technical decision. It's a product investment with real cost, timeline, and maintenance implications. Here's what you need to plan for in 2026.
Data preparation
500-5,000 labeled examples for most tasks. Budget 2-6 weeks for collection, annotation, and quality review. This is consistently the longest and most expensive part — not the GPU time itself. Bad data produces fine-tuned models that confidently do the wrong thing.
Compute cost
A Llama 3.1 70B QLoRA run on 1,000 examples costs roughly $10-50 on cloud GPU providers in 2026. Full-precision LoRA on an 8B model is under $5. Budget $100-500 for experimentation across different rank settings and learning rates.
Evaluation infrastructure
You need evals before you start — without a quality baseline on the base model, you can't tell if fine-tuning improved things. Budget 1-2 weeks to build an eval set. This work pays off repeatedly beyond the first fine-tuning run.
Ongoing maintenance
Models need re-fine-tuning when the base model updates or when your data distribution shifts. Plan for quarterly re-runs at minimum. Unplanned re-runs after quality regressions are the most expensive form of fine-tuning cost.
Serving cost delta
LoRA adapters merged into the base model add zero latency overhead. The ROI calculation often comes from prompt cost savings — removing a 2,000-token system prompt at 10M daily calls saves meaningful inference dollars. Run the math.
The Decision Framework: LoRA vs. Full Fine-Tuning vs. Prompting
These three approaches are not mutually exclusive — most mature AI products use all of them, applied to different tasks or product stages. The question is which to reach for first and when to escalate.
Start with prompting when...
You're still validating the use case. Data volume is under 100 examples. Your quality bar can be met by the base model with guidance. Iteration speed matters more than inference cost. Never fine-tune before you've optimized your prompt.
Move to LoRA when...
Prompting is inconsistent and you have 500+ high-quality examples. Inference cost is a growing constraint from long prompts at scale. You need reliable format adherence or consistent domain vocabulary across all outputs.
Use full fine-tuning when...
You need maximum quality and have the compute budget. You're distilling a larger model's behavior into a smaller, faster one for production. You need to modify the model's base knowledge, not just steer its task behavior.
Use QLoRA when...
You want LoRA quality but need to run on a single consumer or cloud GPU. You're working with 30B+ models. The 5-10% quality gap vs. full-precision LoRA is acceptable — for most tasks it's imperceptible.
The PM heuristic
Before investing in fine-tuning, run a structured prompt optimization experiment first. Spend two days getting your best possible result with few-shot examples and chain-of-thought prompting. If that doesn't meet the quality bar, you now have a baseline, a labeled dataset, and a quality definition — and fine-tuning becomes a lower-risk investment because you know exactly what you're trying to beat.