Classification Systems: When to Use Rules, ML, or LLMs

The Classification Spectrum: Rules, ML, and LLMs

Classification — assigning an input to one or more predefined categories — is the backbone of most production AI systems. Spam filtering, content moderation, ticket routing, intent detection, document categorization, lead scoring — all classification problems. The question isn't whether you need classification; it's which approach gives you the best accuracy-to-cost ratio for your specific problem.

Think of classification approaches as a spectrum. On one end, you have hand-coded rules that are free to run but expensive to maintain. In the middle, you have trained ML models that learn patterns from labeled data. On the far end, you have LLMs that can classify with zero labeled data but at dramatically higher cost per prediction.

Rule-based systems

Cost: $0 per predictionLatency: <1msHigh on known patterns, zero on unknown

Trade-off: Perfect explainability and zero cost, but every new edge case requires manual engineering. Maintenance cost grows linearly with complexity.

Traditional ML classifiers

Cost: $0.0001 per predictionLatency: 1-10msHigh with sufficient labeled data

Trade-off: Strong accuracy at near-zero cost, but requires labeled training data (hundreds to thousands of examples per class) and retraining when patterns shift.

Fine-tuned small models

Cost: $0.001 per predictionLatency: 10-50msVery high for targeted domains

Trade-off: Near-LLM quality at a fraction of the cost, but requires fine-tuning infrastructure, evaluation pipelines, and ongoing data collection.

LLM-based classification

Cost: $0.01-0.10 per predictionLatency: 200-2000msGood zero-shot, excellent with examples

Trade-off: Immediate flexibility with no training data needed, but 100-1000x cost, high latency, and non-deterministic outputs make it impractical at high volume.

When Each Approach Is the Right Choice

The decision framework isn't about which approach is “best” — it's about which approach fits your constraints. Volume, latency requirements, how fast your categories change, how much labeled data you have, and your error tolerance all factor in. Here's when each approach wins.

Use rules when the logic is explicit and stable

If you can write the classification logic as if-then statements and it won't change often, rules are the right answer. Examples: routing emails by domain (@company.com goes to enterprise), classifying transactions by amount thresholds, filtering content with known banned keywords. Rules are also the right first step when you're exploring a new classification problem — start with rules, observe where they fail, and use those failures as training data for ML.

PM signal: If your team can enumerate all the classification logic in a meeting, start with rules.

Use traditional ML when you have labeled data and need scale

Logistic regression, random forests, gradient boosted trees (XGBoost, LightGBM), and SVMs are the workhorses of production classification. They handle millions of predictions per second at near-zero cost. If you have 500+ labeled examples per category, these models will outperform LLMs on accuracy while costing 1000x less. For text classification specifically, fine-tuned BERT-class models (DistilBERT, DeBERTa) hit 95%+ accuracy on most benchmarks.

PM signal: If you have historical labeled data or can label 1000 examples, ML classifiers are almost always the right choice.

Use fine-tuned small models when you need LLM-like understanding at ML-like cost

Fine-tuning a small language model (GPT-4o-mini, Gemini Flash, Llama 8B) on your specific classification task gives you the reasoning capability of an LLM at 10-50x lower cost. You need 200-2000 labeled examples for effective fine-tuning. The resulting model runs fast enough for real-time use and produces deterministic outputs with temperature=0.

PM signal: If your classification requires nuanced language understanding but LLM costs are prohibitive, fine-tuning is the sweet spot.

Use LLMs when you have no training data or categories change constantly

LLMs shine when you can't build a training dataset — either because you're launching a new product and have no historical data, or because your categories change so frequently that retraining a model isn't practical. Zero-shot and few-shot classification with LLMs lets you ship immediately and iterate on categories without any retraining. Use LLMs as the starting approach, then graduate to ML classifiers once you've accumulated enough labeled data from LLM predictions.

PM signal: If you need to classify something new tomorrow with no existing data, LLMs are the only option that works on day one.

Hybrid Classification Architectures

The most effective production classification systems don't pick one approach — they layer multiple approaches in a cascade. The principle is simple: handle easy cases cheaply and only escalate to expensive approaches when necessary. This is the same logic behind tiered customer support: don't send every question to a senior engineer when a FAQ page handles 60% of them.

The cascade architecture typically reduces total classification cost by 70–90% compared to sending everything to an LLM, while maintaining equivalent or better accuracy.

Rules-first cascade

Route inputs through rules first. High-confidence matches get classified immediately (cost: $0). Everything else falls through to an ML classifier. Only ambiguous cases that the ML model flags as low-confidence go to the LLM. In practice, rules handle 40-60% of volume, ML handles 30-40%, and only 10-20% hits the LLM.

Confidence-gated escalation

The ML classifier runs on every input but only returns a prediction when confidence exceeds a threshold (e.g., 0.85). Below-threshold inputs escalate to a more expensive model. This gives you a single knob to tune the cost-accuracy trade-off: lower the threshold for cheaper operation, raise it for higher accuracy.

LLM-as-labeler pipeline

Use LLMs to classify the first batch of inputs, then use those LLM-generated labels as training data for a traditional ML classifier. Once the ML model reaches target accuracy, switch production traffic to it. The LLM becomes your labeling engine, not your inference engine. This pattern lets you launch with LLM quality and migrate to ML cost.

Ensemble with voting

Run multiple classifiers in parallel and use majority voting or weighted consensus. A rules engine, an ML model, and a small LLM each produce a prediction. Agreement means high confidence; disagreement triggers human review or escalation. More expensive than cascading but produces higher accuracy on critical classification tasks where errors are costly.

Key design decision: where to set confidence thresholds

Every cascade architecture requires a confidence threshold that determines when to escalate. Set it too high and everything escalates (expensive). Set it too low and you get cheap but inaccurate classifications. Start with a threshold that routes ~20% of traffic to the expensive path, then adjust based on accuracy metrics. Track accuracy separately for each tier so you know where errors originate.

Design Production AI Architectures in the Masterclass

Classification cascades, model selection frameworks, and cost optimization patterns are covered hands-on in the AI PM Masterclass — taught by a Salesforce Sr. Director PM.

Evaluation and Monitoring for Classification Systems

Classification evaluation is deceptively simple — accuracy alone tells you almost nothing useful. A spam filter that classifies everything as “not spam” has 95% accuracy if only 5% of emails are spam. AI PMs need to understand the metrics that actually matter and build monitoring that catches degradation before users do.

Precision vs. recall: the fundamental trade-off

Precision measures 'of everything we classified as X, how many were actually X.' Recall measures 'of everything that was actually X, how many did we catch.' You can't maximize both simultaneously. Content moderation systems typically optimize for high recall (catch everything bad, accept some false positives). Fraud detection optimizes for high precision (only flag real fraud, accept missing some). The PM decision is which type of error is more costly for your users.

Trade-off: High precision = fewer false alarms but more missed catches. High recall = catches more but with more false positives.

F1 score and when it misleads

F1 is the harmonic mean of precision and recall — a single number that balances both. It's useful for comparing models during development but dangerous as a production metric because it obscures the precision-recall trade-off. A model with 90% precision and 70% recall has the same F1 as one with 70% precision and 90% recall, but those models behave very differently in production. Always report precision and recall separately.

Trade-off: F1 simplifies comparison but hides which error type dominates. Always decompose.

Per-class performance: the metric most teams skip

Aggregate metrics hide class-level failures. Your model might have 95% overall accuracy but 40% accuracy on the category that matters most. Report precision, recall, and volume for each class individually. Small classes are especially prone to poor performance because the model has fewer training examples. If one class represents 2% of volume but 50% of business value, its metrics should be your primary KPI.

Trade-off: Per-class metrics create more complexity but prevent the most common classification failure: invisible underperformance on minority classes.

Monitoring for distribution shift

Classification models degrade when the distribution of incoming data shifts away from the training distribution. Monitor the distribution of predicted classes over time — if a category that historically received 15% of predictions suddenly drops to 5%, something has changed. Track confidence score distributions too: a model that becomes less confident over time is seeing inputs it wasn't trained for. Set up alerts for both predicted-class distribution changes and confidence score degradation.

Trade-off: Drift monitoring requires infrastructure investment but catches model degradation weeks before accuracy metrics show it.

Classification System Design Patterns for Production

Moving classification from a notebook to production introduces challenges around taxonomy management, multi-label classification, handling ambiguous inputs, and maintaining classification consistency over time. Here are the patterns that experienced AI PMs use.

Taxonomy versioning

Categories change over time — new ones get added, old ones merge, definitions evolve. Version your taxonomy like you version code. Every prediction should be tagged with the taxonomy version it was classified against. This lets you re-evaluate historical predictions against updated taxonomies and makes auditing possible.

Hierarchical classification

Instead of one flat list of 200 categories, structure classification as a tree: first classify into 10 top-level categories, then sub-classify within each. This reduces the per-level classification complexity, improves accuracy at each level, and lets you use different models for different levels (rules for top-level, ML for sub-categories).

Rejection class (none of the above)

Always include an explicit 'unknown' or 'other' category. Forcing the model to pick from predefined categories when the input doesn't fit any of them produces confidently wrong predictions. An 'other' class acts as a safety valve — inputs classified as 'other' get routed to human review, and patterns in the 'other' bucket tell you when to add new categories.

Multi-label vs. multi-class design

Multi-class classification assigns exactly one label. Multi-label allows multiple labels per input. The choice affects your model architecture, training data format, and evaluation metrics. Use multi-class when categories are mutually exclusive (an email is either spam or not). Use multi-label when inputs can belong to multiple categories (a support ticket can be about billing AND technical issues).

Human-in-the-loop for low confidence

Route low-confidence predictions to human reviewers rather than making a potentially wrong automated decision. The human reviews serve two purposes: they correct the immediate prediction, and the corrected labels become training data for model improvement. Design the review interface to show the model's top predictions and confidence scores so reviewers can confirm or override quickly.

Shadow mode deployment

Before switching from one classification system to another, run the new system in shadow mode: it classifies every input but its predictions aren't used for decisions. Compare shadow predictions against the existing system's predictions and against ground truth labels. This reveals accuracy differences, latency impacts, and edge cases before any user is affected.