How to Run an AI Product Standup That Actually Surfaces Risk

Why the Classic Standup Misses AI Risk

The standard agile standup was designed for software teams shipping deterministic code. It asks each team member what they did yesterday, what they will do today, and what is blocking them. On an AI team, this format reliably misses the risks that bite hardest, because those risks do not show up as blockers.

Quality drift does not feel like a blocker

When the model has degraded by 2 percentage points on the evaluation set, the engineer running the eval will not raise it as a blocker because no one is blocked. They will note it in a Slack thread or wait for the weekly review. By the time the PM hears about it, the team is two sprints into work built on top of a model that regressed. Standups must explicitly ask about quality movement, otherwise it stays invisible until it is large.

Tradeoff: Asking about quality every day adds about 2 minutes to standup. Skipping the question saves the time but loses the early warning. The teams that run weekly quality reviews instead of daily checks routinely catch drift only after it has compounded.

On call burden is hidden until someone burns out

AI products generate more incidents per shipped feature than traditional software because model behavior changes under load and over time. The engineer who was on call last week may have been paged 8 times overnight and is exhausted today, but they will not say so unprompted. Standups that do not have a dedicated on call check in produce surprise resignations and silent quality decay because the on call is too tired to investigate root causes.

Tradeoff: A short on call check in feels redundant on quiet weeks. It is not. The 1 minute spent every Monday surfacing on call load is the cheapest health signal the team has.

Evaluation set gaps surface as failed releases, not as blockers

A team building an AI feature without strong evaluation coverage on a relevant input slice will not call this out in standup, because there is no daily work item that says, fix the evaluation set. The gap shows up at acceptance review when the team realizes they cannot defend the quality of the release. Standups need a regular prompt about evaluation coverage so the gap is exposed weeks earlier.

Tradeoff: Asking about evaluation coverage is annoying for the engineering lead because the answer is usually, we need to do more work. That is the point. Surfacing the gap creates the work item that prevents the late stage scramble.

Cross team dependencies degrade quietly

AI features almost always depend on other teams (platform, data, legal, applied science). When a dependency starts slipping, the dependent team often does not flag it because they hope to recover. The PM hears about the slip 2 sprints later when the date moves. A standup that explicitly asks about cross team dependencies surfaces these slips while they are small and recoverable.

Tradeoff: Listing dependencies every day duplicates information visible in roadmap tools. Yes, but the redundancy creates a forcing function for the PM to follow up before the slip becomes a date change.

A 15 Minute Standup Format That Surfaces Risk

The format below replaces the classic three question standup with a four part structure that fits in 15 minutes for a team of 6 to 9 people. Each part has a hard timebox. Anything that needs more discussion gets parked and handled in a followup with a named owner.

Part 1, on call and incident pulse (2 minutes)

The on call engineer reports overnight pages, current incidents, and any unresolved customer reports. Format, number of pages, top theme, hours of sleep impacted. If on call has been paged more than 5 times in the last week, the PM owns finding 1 hour of recovery time today (no meetings, no new work). This is non negotiable. Also surface any incidents in flight that the wider team should know about.

Tradeoff: Adding a recovery rule may slow short term progress because an engineer is offline. It prevents the larger cost, an exhausted engineer making a quality mistake or quitting.

Part 2, quality and evaluation movement (3 minutes)

The applied scientist or ML engineer reports any movement in the eval scores from the last 24 hours and flags any inputs that have started failing. Format, eval set name, score change, hypothesis for cause, planned investigation. If a metric moved more than 1 percentage point and no one knows why, that becomes the top investigation of the day. Skip this section only on days with no eval runs and announce that explicitly so the silence is not interpreted as good news.

Tradeoff: Daily eval reporting requires daily eval runs, which costs compute. The cost is small relative to the cost of catching a regression late. Set the eval to run automatically overnight so the morning report is ready before standup.

Part 3, work in flight by stream (7 minutes)

Each engineer reports against the three backlog streams (model work, bugs, features) with a focus on what is at risk, not what is on track. Use the prompt, what changed since yesterday and what is more or less likely to land. Avoid status theater. The PM listens for soft slip signals (the work is slightly bigger than I thought, the dependency has not responded) and parks them for postmeeting follow up.

Tradeoff: A risk first standup is harder for engineers because they have to surface what they are unsure about. Coach the team to do this for two weeks until it becomes habit. The early discomfort is worth the prevention of late surprises.

Part 4, cross team dependency and decision check (3 minutes)

The PM names every cross team dependency the team is waiting on and the date they expect movement. If a date has slipped or the contact has gone quiet, the PM owns following up that day. Also surface decisions the team needs from outside (legal review, design approval, exec sign off). Decisions that have been waiting more than 5 business days get escalated by name in standup, which creates accountability for the PM to push.

Tradeoff: Naming slow decisions can feel confrontational. It is also the only reliable way to keep AI projects moving, because cross functional decisions are the most common source of multi week delays.

Role Specific Prompts That Surface Hidden Signal

Generic standup questions produce generic answers. The PM should rotate through role specific prompts that pull out the signal each role uniquely sees. Use one of the prompts below per role each week, alternating across the month.

For the applied scientist or ML engineer

What input slice are you most uncertain about right now? Where is the eval set probably wrong? What model behavior have you noticed that you cannot yet explain? These prompts surface the soft tail risks that quantitative metrics do not capture. Document the answers in a running list and re visit weekly. The recurring themes are usually the next major investigation.

For the software engineer building product surfaces

Which AI behavior is hardest to design around in your code right now? Are there model outputs that break your UI assumptions? What error states are you guessing at because product has not specified? These prompts expose the gaps between model behavior and product intent that are usually only discovered at acceptance review.

For the on call engineer

What pattern are you seeing in pages that we are not yet tracking as a backlog item? What manual fix did you have to repeat? Which alerts are noisy and which are actually useful? These prompts convert tribal on call knowledge into backlog items and reduce repeat firefighting. Most teams under invest in this conversion and pay for it later in burnout.

For the design or research partner

What user behavior have you observed that surprised you? Where are users misinterpreting AI output? What disclosure or explanation is missing? Design and research are often closest to the user reality of an AI product. The PM who does not pull this signal into standup is flying half blind.

Escalation rule, when to leave standup

If any item raised in standup looks like it could become a launch blocker, a customer incident, or a dependency that slips by more than a week, the PM calls it out and schedules a 30 minute followup the same day with the right 3 to 5 people. Do not solve it in standup. Standups that turn into discussions punish the rest of the team and train people to stay quiet next time. The escalation rule turns standup into a routing layer, which is its highest value role.

Run Standups That Catch Risk Early

Standup design, sprint operations, and risk surfacing for AI teams are taught live in the AI PM Masterclass by a Salesforce Sr. Director PM.

A Weekly Risk Roll Up That Closes the Loop

Standup surfaces risk every day. A weekly risk roll up converts that signal into a written record and a list of decisions. Without the roll up, the same risk gets raised in standup three weeks in a row with no resolution, and the team learns that surfacing risk is pointless. Run the roll up every Friday for 30 minutes with the PM, tech lead, and on call engineer.

Section 1, top three risks of the week with state changes

List the top three risks raised in standup this week. For each, note whether the risk got better, worse, or held steady. State changes matter more than the absolute risk level because they show whether actions are working. If a risk has held steady or worsened for three weeks running, escalate it out of the team to leadership with a written ask. Leadership cannot help with risks they have not heard about.

Section 2, on call load and recovery actions

Total pages this week, distribution across the team, and whether any engineer is at risk of burnout. Plan one concrete action to reduce on call load next week (a noisy alert to fix, a doc to write, a runbook to update). The plan must be assigned by name and dated, not left as a general intent. Track resolved on call load reduction items in a running log so the team sees the trend.

Section 3, evaluation set health and coverage gaps

Note any evaluation set updates this week (new examples added, examples removed, scoring rubrics changed) and any coverage gaps that became visible. Decide one coverage investment for next week (a new slice to score, a new safety category, a new size cohort). Eval coverage that is not invested in degrades, because the input space drifts faster than the test set.

Section 4, dependencies that need PM action next week

List every cross team dependency that is at risk and the action the PM will take to unblock it. Common actions, send a written summary to the dependency owner, schedule a meeting with the right decision maker, escalate to a director with a single page brief. The PM is responsible for owning these, not the engineering team. Closing the loop on dependencies is the highest leverage activity an AI PM does in any given week.