AI Make-or-Buy: Foundation Models, APIs, or Custom Models?

The Four Options Most Teams Don't Distinguish Clearly

Option A: Frontier hosted API

GPT-5, Claude Opus, Gemini Ultra via API. Best quality, highest per-token cost, vendor lock-in risk, fastest path to market.

Option B: Smaller hosted models

GPT-4o-mini, Claude Haiku, Gemini Flash. 10-30x cheaper than frontier. Often good enough for narrow tasks. Same vendor risk.

Option C: Self-hosted open models

Llama 4, Mistral, Qwen on your own GPUs or via Together/Replicate. Full control, lower per-token cost at scale, real ops complexity.

Option D: Fine-tuned or custom-trained

Fine-tune an open base or train from scratch. Highest specialization, highest investment, only justifiable when prior options demonstrably fail.

When Frontier APIs Are the Right Call

Frontier APIs are the right starting point for almost everyone. They give you the highest quality and the fastest experimentation cycle, and they let you focus engineering on the layer above the model where most differentiation actually lives.

Pre-product-market fit

You don't know what you're building yet. Pay the premium, learn fast. Optimize later.

Low to moderate volume

If you're burning under $20K/month on inference, the optimization work isn't worth the engineering time.

Quality is the product

When you're selling intelligence (legal research, complex code generation), quality compounds. The cheapest model that can't do the job is the most expensive one.

Your team is small

Self-hosting requires real ops investment. If you have 3 engineers, don't spend 1 of them on inference infra.

When Smaller Hosted Models Win

The smaller-model tier is the most underused option. Most production AI workloads — classification, extraction, routing, summarization — don't need frontier intelligence. They need consistent, fast, cheap, accurate enough.

High-volume narrow tasks

Tag classification, intent detection, entity extraction. Smaller models hit 95%+ of frontier quality at 1/20th the cost.

Latency-sensitive UX

Sub-second response targets. Smaller models can be 3-5x faster end-to-end. Streaming UX feels instant.

Multi-step pipelines

Most agent steps don't need frontier quality. Use a small model for routine steps and a frontier model only when you actually need the IQ.

Cost-pressured products

Free tiers, freemium, B2C scale. Smaller models make unit economics work where frontier APIs don't.

Make Make-or-Buy Calls With Confidence

The AI PM Masterclass walks through real make-or-buy decisions with cost models, vendor matrices, and architecture diagrams — taught by a working Sr. Director of PM.

When Self-Hosting Open Models Pays Off

Self-hosting is rarely the right first move and frequently the right move at scale. The breakeven generally lives somewhere between $50K and $200K of monthly inference spend — below that, the engineering cost of self-hosting eats the savings.

Regulatory/data residency

When data cannot leave your VPC. Self-hosting becomes the only legal option for many regulated workloads.

Predictable, high volume

When you know your QPS, dedicated GPU economics beat per-token pricing by 3-10x.

Fine-tuning + custom behavior

Open models give you weights. Closed APIs give you knobs. If your differentiation requires weight-level control, self-host.

Vendor lock-in mitigation

Multi-vendor architecture is its own form of insurance. Self-hosted backups protect against API price hikes and outages.

When Custom-Trained or Fine-Tuned Is the Answer

Custom training is the option most teams over-explore and most need least. The right reason to fine-tune is documented failure of cheaper options, not novelty. Fine-tuning is the answer when prompt engineering, RAG, and bigger models have all genuinely been tried and don't close the gap.

Repeated, narrow output format

If you need outputs in a specific structure that prompting can't consistently produce, fine-tuning earns its cost.

Domain-specific reasoning

Legal, medical, scientific reasoning where frontier models still misfire on subtle distinctions. The fine-tune is the moat.

Latency or cost wall

When you need a model 10-100x cheaper than frontier and quality drop is unacceptable. Distill to a smaller model.

Brand voice and tone

When tone matters more than capability. A fine-tune locks the voice across thousands of generations consistently.