Prompt Engineering Practice for Aspiring AI Product Managers

Why Most Aspiring AI PMs Practice Prompt Engineering Wrong

Prompt engineering looks easy. Open ChatGPT, type something, get a response. The accessibility hides the depth. PMs who practice the surface form develop bad habits that hiring managers spot in the first 10 minutes of a technical conversation. Here are the four most common failures.

Practicing on a single input at a time

The most common mistake is testing prompts one input at a time. Type a query, read the response, decide the prompt is good, move on. This trains the wrong intuition because LLMs are highly variable: a prompt that produces a good output on one input may fail on the next 20. PMs who practice this way ship features that work in demo and fail in production. Hiring managers detect this immediately by asking how do you know that prompt is good.

Tradeoff: Testing on multiple inputs takes 10x more time per iteration but builds the right reflexes. The unlock is to keep a small evaluation set (20 to 50 inputs) for every prompt task you work on. The set takes 30 minutes to build and pays back across every prompt iteration that follows.

Optimizing for impressive outputs rather than reliable ones

Aspiring PMs gravitate toward prompts that produce flashy, creative, or emotionally resonant outputs. These are demo friendly but production hostile. A prompt that works for a demo may produce inconsistent length, off topic content, or tone drift across 100 production inputs. Hiring managers care about reliability because that is what ships. Demos are not products.

Tradeoff: Reliable prompts often feel boring (constrained format, terse instructions, explicit examples) but they ship. The skill is being willing to choose the boring prompt over the flashy one when the use case demands consistency. Practice both kinds so you can recognize the difference.

Memorizing prompt techniques without understanding why they work

Chain of thought, few shot, role prompting, ReAct: these techniques get repeated like incantations without the underlying intuition. A PM who knows that few shot examples typically improve format compliance by 30 to 60 percent on classification tasks but rarely improve open ended generation can apply the technique with judgment. A PM who knows only the names will use techniques where they do not help and add latency and cost for no benefit.

Tradeoff: Building the intuition requires reading the original papers (Chain of Thought, ReAct, Self Consistency) plus running your own ablations. This takes 15 to 25 hours of focused work spread over several weeks. The payoff is technical conversations where you can explain when to use each technique rather than reciting names.

Treating prompts as static rather than evolving

Hiring managers ask candidates how often they update prompts in production. The expected answer for a serious AI product is monthly to quarterly, with version control and evaluation gates. Aspiring PMs who treat prompts as one shot artifacts that ship and never change reveal that they have not worked in or studied production AI systems.

Tradeoff: Prompt versioning, evaluation gates, and rollout discipline are operational overhead that does not feel like product work. PMs who skip them ship faster initially and slower later when prompt changes break production. Build the habit of treating prompts like code from day one of practice.

A 30 Day Practice Plan With 12 Exercises

The plan below assumes 30 to 45 minutes per day. Each exercise is designed to build a specific skill and to compound on the previous ones. By day 30 you will have built an evaluation set, run dozens of comparisons, and developed real intuition about when techniques help and when they do not.

Week 1: Build your evaluation set and baseline

Day 1 to 2: Pick a single task you care about (summarizing meeting notes, classifying customer support tickets, generating PRD outlines, drafting Slack responses). Day 3 to 5: Build an evaluation set of 30 to 50 representative inputs. Spend the time to make these realistic, not synthetic. Day 6 to 7: Write a baseline prompt (your first attempt, no optimization) and run it across the full set. Score each output on a 1 to 5 scale. This is your starting point.

Tradeoff: Building the evaluation set is the slowest, least exciting part. Most learners skip it and regret it within two weeks because they have no way to tell if changes are improvements. Push through; this is the single highest leverage 5 hours of the 30 day plan.

Week 2: Iterate on prompt structure

Day 8 to 10: Try four variants of the prompt that change only structure (instruction first vs context first, with vs without explicit format, with vs without role). Run each across the full evaluation set and score. Day 11 to 12: Add a few shot variant with 3 examples. Day 13 to 14: Compare all variants in a single table. Identify which structural change moved the score the most. Hypothesize why.

Tradeoff: Running a prompt across 50 inputs takes 5 to 15 minutes per variant depending on the model. Plan around the wait. Many learners use this time to read about a technique they will try next week, which compounds the learning.

Week 3: Test advanced techniques

Day 15 to 17: Add chain of thought (asking the model to reason step by step before answering). Score and compare. Day 18 to 20: Add self consistency (run the same prompt 3 times with temperature above 0 and take the majority answer). Score and compare. Day 21: Try a smaller and a larger model on your best prompt and compare cost, latency, and quality. By the end of week 3 you should have data on which techniques actually move the metric for your task.

Tradeoff: Advanced techniques cost more and slow down responses. The exercise teaches you when the cost is worth it (often: yes for analytical tasks, no for simple classification). This intuition is the most valuable thing you build in the 30 days.

Week 4: Stress test and ship

Day 22 to 24: Add 10 adversarial inputs to your evaluation set (jailbreak attempts, edge cases, malformed inputs, off topic queries). Test how your best prompt handles them. Day 25 to 27: Add a guardrails layer (a second prompt that checks the output of the first) and re evaluate. Day 28 to 30: Write up your work as a case study with the evaluation set, the variants you tested, the scores, and what you learned. This case study becomes a portfolio piece.

Tradeoff: Adversarial testing is uncomfortable because your prompt will fail in interesting ways. That is the point. Hiring managers want PMs who have stared at failure modes and thought about mitigations rather than PMs who have only seen the happy path.

The Evaluation Habits That Separate Serious AI PMs From Tinkerers

Hiring managers can tell within 10 minutes whether a candidate has internalized evaluation discipline or just memorized the vocabulary. The four habits below are the markers they look for in a technical screen.

Always run prompts across multiple inputs before judging

When asked is this a good prompt the right answer is never on this one example, yes. The right answer is on my evaluation set of 30 inputs it scored 4.1 out of 5 versus 3.4 for the previous version. PMs who reflexively reach for evaluation sets demonstrate the operating mindset of someone who has shipped AI features rather than just played with them.

Distinguish between average quality and tail quality

A prompt that scores 4.2 on average but produces a 1 out of 5 output on 8 percent of inputs is not safe to ship. Tail risk matters more than mean quality for most production AI features. PMs who track both signals (average and worst case) make better launch decisions than PMs who track only the headline number.

Score blind whenever possible

When comparing prompt variants, do not look at which variant produced which output until after you have scored. Rater bias is real even with simple rubrics. Mixing outputs from different variants and scoring blind is a 30 second discipline that prevents you from fooling yourself. Hiring managers who hear about this practice immediately update their assessment of the candidate upward.

Rebaseline when models change

When the underlying model updates (a new GPT version, a new Claude release), re run your evaluation set on the new model with your existing prompt. Sometimes performance improves automatically. Sometimes it regresses in ways that need new prompts. PMs who do not rebaseline get surprised in production. Build this into your monthly routine as soon as you have a real evaluation set.

Use a spreadsheet before you use a tool

There are excellent prompt evaluation platforms (Promptfoo, Langfuse, Braintrust, Humanloop) but starting with a tool is a trap. PMs who jump to tools before they have run evaluations manually never develop the underlying intuition. Spend your first 30 days in a Google Sheet with columns for input, prompt variant A output, prompt variant B output, A score, B score, and notes. Once the discipline is automatic, then adopt a tool. The tool is the cherry on top, not the foundation.

Develop Real Prompt Engineering Skill

Prompt design, evaluation harnesses, and the technical practice habits AI PMs need are core curriculum in the AI PM Masterclass. Taught by a Salesforce Sr. Director PM.

Resources to Practice Against and Read Alongside

The 30 day plan above is the practice. The list below is the supporting material. Read selectively rather than exhaustively; reading without practicing is how learners fool themselves into thinking they have skill.

The Anthropic Prompt Engineering Guide

Anthropic publishes the most practical, opinionated prompt engineering guide of any frontier lab. Read it once at week one for orientation, then again at week four with your evaluation experience to recognize what they mean by each technique. The second read produces 10x the insight of the first.

The OpenAI Cookbook

A repository of working examples for common AI tasks. Read these as case studies, not as templates to copy. Understanding why each example was built the way it was matters more than reusing the prompt. Pay particular attention to the structured output and function calling sections, which are foundational for production work.

The Chain of Thought, ReAct, and Self Consistency papers

The original papers behind the techniques you will use most often. They are short (8 to 12 pages each) and accessible. Reading them gives you the historical context and the experimental data that justifies why these techniques work. PMs who have read the original papers stand out in technical conversations.

Hamel Husain's applied LLM blog and talks

Hamel writes the most pragmatic content on evaluation, error analysis, and shipping LLM features. His talks at AI Engineer World's Fair are required watching. Look for his posts on looking at your data and on writing your own LLM as judge prompts. These two ideas alone are worth more than most full courses.

The Latent Space podcast

Weekly conversations with practitioners building AI products in production. Listen during commutes or workouts. After 8 to 12 episodes you will have absorbed the working vocabulary and the current debates of the field, which gives you something concrete to talk about in interviews and on social platforms where the AI PM community gathers.