AI Make-or-Buy: Foundation Models, APIs, or Custom Models?
TL;DR
"Make or buy" in AI is no longer binary. You have four real options: hosted frontier APIs, hosted smaller models, self-hosted open models, and fine-tuned or custom-trained models. Each has a distinct profile across cost, latency, control, and switching cost. This guide gives you a decision framework so you stop choosing based on hype and start choosing based on the actual product requirement.
The Four Options Most Teams Don't Distinguish Clearly
Option A: Frontier hosted API
GPT-5, Claude Opus, Gemini Ultra via API. Best quality, highest per-token cost, vendor lock-in risk, fastest path to market.
Option B: Smaller hosted models
GPT-4o-mini, Claude Haiku, Gemini Flash. 10-30x cheaper than frontier. Often good enough for narrow tasks. Same vendor risk.
Option C: Self-hosted open models
Llama 4, Mistral, Qwen on your own GPUs or via Together/Replicate. Full control, lower per-token cost at scale, real ops complexity.
Option D: Fine-tuned or custom-trained
Fine-tune an open base or train from scratch. Highest specialization, highest investment, only justifiable when prior options demonstrably fail.
When Frontier APIs Are the Right Call
Frontier APIs are the right starting point for almost everyone. They give you the highest quality and the fastest experimentation cycle, and they let you focus engineering on the layer above the model where most differentiation actually lives.
Pre-product-market fit
You don't know what you're building yet. Pay the premium, learn fast. Optimize later.
Low to moderate volume
If you're burning under $20K/month on inference, the optimization work isn't worth the engineering time.
Quality is the product
When you're selling intelligence (legal research, complex code generation), quality compounds. The cheapest model that can't do the job is the most expensive one.
Your team is small
Self-hosting requires real ops investment. If you have 3 engineers, don't spend 1 of them on inference infra.
When Smaller Hosted Models Win
The smaller-model tier is the most underused option. Most production AI workloads — classification, extraction, routing, summarization — don't need frontier intelligence. They need consistent, fast, cheap, accurate enough.
High-volume narrow tasks
Tag classification, intent detection, entity extraction. Smaller models hit 95%+ of frontier quality at 1/20th the cost.
Latency-sensitive UX
Sub-second response targets. Smaller models can be 3-5x faster end-to-end. Streaming UX feels instant.
Multi-step pipelines
Most agent steps don't need frontier quality. Use a small model for routine steps and a frontier model only when you actually need the IQ.
Cost-pressured products
Free tiers, freemium, B2C scale. Smaller models make unit economics work where frontier APIs don't.
Make Make-or-Buy Calls With Confidence
The AI PM Masterclass walks through real make-or-buy decisions with cost models, vendor matrices, and architecture diagrams — taught by a working Sr. Director of PM.
When Self-Hosting Open Models Pays Off
Self-hosting is rarely the right first move and frequently the right move at scale. The breakeven generally lives somewhere between $50K and $200K of monthly inference spend — below that, the engineering cost of self-hosting eats the savings.
Regulatory/data residency
When data cannot leave your VPC. Self-hosting becomes the only legal option for many regulated workloads.
Predictable, high volume
When you know your QPS, dedicated GPU economics beat per-token pricing by 3-10x.
Fine-tuning + custom behavior
Open models give you weights. Closed APIs give you knobs. If your differentiation requires weight-level control, self-host.
Vendor lock-in mitigation
Multi-vendor architecture is its own form of insurance. Self-hosted backups protect against API price hikes and outages.
When Custom-Trained or Fine-Tuned Is the Answer
Custom training is the option most teams over-explore and most need least. The right reason to fine-tune is documented failure of cheaper options, not novelty. Fine-tuning is the answer when prompt engineering, RAG, and bigger models have all genuinely been tried and don't close the gap.
Repeated, narrow output format
If you need outputs in a specific structure that prompting can't consistently produce, fine-tuning earns its cost.
Domain-specific reasoning
Legal, medical, scientific reasoning where frontier models still misfire on subtle distinctions. The fine-tune is the moat.
Latency or cost wall
When you need a model 10-100x cheaper than frontier and quality drop is unacceptable. Distill to a smaller model.
Brand voice and tone
When tone matters more than capability. A fine-tune locks the voice across thousands of generations consistently.