AI Prompt Management Template: Version, Test, and Govern Your Prompt Library

Prompt Library Structure

Organize your prompts so they can be found, understood, and maintained by anyone on the team — not just the person who wrote them.

Naming convention

Use the format: [product-area]/[feature]/[task]. Example: support/ticket-routing/classify or content/email-drafting/generate-subject. Avoid names like 'new_prompt_v2_final' — these are unsearchable and undeletable.

Required metadata per prompt

Author, creation date, last modified date, associated feature, model it was designed for, average token count, and cost per call estimate. This metadata is what allows you to do cost audits and track which prompts need updating after a model upgrade.

Prompt description field

One paragraph describing: what this prompt does, what inputs it receives, what output format it returns, and what use cases it covers. Future team members (and future you) will thank you.

Test case references

Link to the evaluation dataset and test cases for this prompt. Without this link, prompt changes are made without running tests — a common source of silent regressions.

Deprecation status

Mark prompts as: Active, Deprecated (still in use but scheduled for replacement), or Archived (no longer in production). Never delete — archive. Deleted prompts become mystery outages when someone finds a code reference to them.

Version Control for Prompts

Semantic versioning

Use major.minor.patch for prompts. Major: structural change that breaks downstream parsing. Minor: behavior change that improves but doesn't break. Patch: wording tweak, typo fix. This helps engineers know what to test when you ship a new version.

Change log format

For each version: what changed (the diff), why it changed (the reason), what test results showed (before/after metrics), and who approved the change. A prompt change without a change log is a black box you can't audit.

Rollback protocol

Every prompt in production should have a documented rollback path: what the previous version was, where to find it, and who can authorize a rollback. Practice rollbacks before you need them in an incident.

Branch strategy

Maintain dev, staging, and production versions of critical prompts just as you would for code. Test in staging with production-representative data before promoting. Never edit production prompts directly.

Prompt Testing Protocol

Build a representative test set (minimum 50 cases per prompt)

Test cases should cover: happy path inputs, edge cases, adversarial inputs, and known historical failure modes. If you don't have a test set, creating one is the first task before editing the prompt.

Regression testing: new version must match or beat old version

Run both versions against your test set. The new version must achieve ≥ the old version's score on all primary metrics. A prompt that improves average performance but introduces regressions in specific categories is not ready.

Human review for qualitative changes

For prompts that generate free-form text (emails, summaries, recommendations), automated metrics aren't sufficient. Have 2–3 team members blind-evaluate a sample from the new version versus the old. Use a rubric, not vibes.

Canary deployment for high-stakes prompts

Route 5% of traffic to the new prompt version before full rollout. Monitor error rate, parse failure rate, and user feedback signals for 24–48 hours before proceeding. Gate: no metric degradation before expanding.

Build Production-Grade AI Systems in the Masterclass

Prompt management, AI operations, and production AI systems are core curriculum — taught live by a Salesforce Sr. Director PM.

Deployment and Access Control

Tiered edit permissions

Not everyone should be able to edit production prompts. Define tiers: Read-only (all team members can read prompts), Edit (PM and ML engineer can edit in dev/staging), Promote (only PM and eng lead can promote to production). Document who approves each tier change.

Approval workflow for production changes

Any change to a production prompt requires: test results (automated), a reviewer who is not the author, and a documented reason. This is not bureaucracy — it's the minimum viable process to prevent a well-intentioned edit from causing an incident.

Environment promotion checklist

Before promoting from staging to production: test suite passing, human review complete, monitoring configured, rollback plan documented, stakeholders notified of behavior change. Make this a literal checkbox list.

Monitoring and Incident Response

Parse failure rate

Track the % of responses that fail to match your expected schema. A sudden increase in parse failures indicates a prompt regression or model behavior change. Alert threshold: >2% over a 1-hour rolling window.

Output quality score

For each prompt, define an automated quality proxy: LLM-as-judge, keyword presence, output length distribution. Track this over time. A gradual drift in quality scores is a model drift signal.

Prompt injection attempts

Log and alert on inputs that contain patterns consistent with prompt injection: instruction-like text in user inputs, attempts to override the system prompt, or unusual formatting. Don't wait for a successful injection to start monitoring.

Incident classification

When a prompt causes a production incident, classify it: regression (change caused it), model drift (model behavior changed under a fixed prompt), adversarial input, or data issue. Classification drives the right remediation — not all prompt incidents are prompt problems.