Physical AI for Product Managers: Embodied AI, Robotics, and VLA Models
TL;DR
Physical AI — robots and autonomous machines powered by foundation models — raised $55.8 billion in 2026 alone. Vision-Language-Action (VLA) models are the architectural breakthrough making general-purpose robots possible. Unlike software AI, physical AI failures have real-world consequences: your product can drop parts, collide with people, or cause fires. This guide covers the VLA architecture, the sim-to-real gap, real-time inference constraints, and what these mean for PMs deciding whether to build for the physical world.
The AI PM Minute
One tactic to make you a sharper AI PM, twice a week. 60 seconds to read. Free.
No fluff. Unsubscribe anytime.
What Physical AI Actually Is
Physical AI refers to AI systems that perceive, reason about, and act in the physical world — not just in text or pixels. Before the foundation model era, robots ran on hand-coded behavior trees and narrow computer vision models. They were brittle: a warehouse robot trained to pick blue bins would fail on red ones. Physical AI replaces those brittle stacks with a model brain that can follow natural language instructions, reason about novel situations, and generalize across objects it has never seen.
The physical AI market is projected to grow from $3.8 billion in 2026 to over $7.24 billion by 2030 — and that estimate was made before robotics companies raised $55.8 billion in a single year. The categories that are furthest along:
Warehouse and logistics robots
Amazon's Sequoia and Sparrow systems now pick over 75% of items autonomously. Autonomous mobile robots (AMRs) route themselves dynamically based on live floor state, not pre-programmed paths.
Manufacturing cobots
Collaborative robots like Figure's humanoid, Apptronik's Apollo, and Boston Dynamics' Atlas are being piloted at BMW and GE. They assemble parts, hand tools to human workers, and inspect quality.
Autonomous vehicles
Waymo operates robotaxi fleets in four US cities. The AI stack — perception, prediction, planning — is a physical AI system that must reason about other vehicles, pedestrians, and edge cases in real time.
Agricultural and inspection robots
Drones and ground robots that spray fields, inspect pipelines, and survey construction sites. These run on GPS, computer vision, and increasingly LLM-based instruction following.
As an AI PM, you do not need to build a robot from scratch to care about physical AI. Products built on top of physical AI platforms — fleet management software, robot task orchestration, human-robot collaboration tools, digital twins — all require PMs who understand how the underlying AI stack works.
Vision-Language-Action (VLA) Models: The Architecture Behind General Robots
The breakthrough enabling general-purpose physical AI is the Vision-Language-Action (VLA) model. A VLA takes visual input from the robot's cameras plus a natural language instruction ("pick up the red cup and place it in the bin on the left") and outputs low-level motor commands — joint angles, end-effector positions, gripper force — at 30 to 100 Hz. It replaces separate perception, planning, and control modules with a single end-to-end model.
Vision Encoder
Processes one or more camera streams (RGB, depth, wrist-mounted) into visual tokens. Usually a pretrained image model — ViT, DINOv2, or a CLIP-style encoder — that maps pixels to the same embedding space as language tokens.
Language Input
The natural language task instruction is tokenized and embedded exactly as in an LLM. This is what allows VLAs to follow free-form commands without re-training: 'sort the packages by size' is as valid as 'pick up the cup.'
Transformer Backbone
A large transformer (often a fine-tuned LLM backbone) fuses the visual tokens and language tokens into a joint representation. It reasons about what the robot sees relative to what it is being asked to do.
Action Head
Instead of outputting text tokens, the final layer outputs action tokens — typically encoded as continuous values representing joint positions or end-effector deltas. A diffusion head or flow-matching decoder translates these into smooth motor trajectories.
Inference Loop
The full stack runs at 30-100 Hz, meaning the model re-evaluates the scene and outputs new motor commands many times per second. This is orders of magnitude faster than LLM inference for text, requiring aggressive optimization: quantization, hardware accelerators, and on-device deployment.
Leading open-weight VLA models in 2026 include pi0 (Physical Intelligence), OpenVLA, Octo, and InternVLA-M1. Google DeepMind's RT-2 was an early influential VLA; its successor models are deployed in real warehouse settings. The competitive landscape looks like the early LLM market: several open-weight contenders, proprietary frontier models from well-funded labs, and a rapidly closing quality gap between them.
The Three Technical Challenges That Determine Whether Physical AI Ships
Software AI products fail gracefully: a bad output produces a wrong answer, which a user ignores or corrects. Physical AI products fail consequentially: a bad output moves a 50 kg arm into a human. Understanding these three challenges tells you whether a physical AI product is ready to ship.
The Sim-to-Real Gap
What it is: Most physical AI models are trained in simulation — faster, cheaper, and safer than physical rollouts. But simulation physics is imperfect. Friction, deformable objects, lighting variation, and sensor noise all differ between simulation and the physical world. A model that achieves 95% success in simulation may achieve 60% in deployment.
PM implication: Sim-to-real gap is your hidden deployment risk. Demand real-world pilot data before any launch commitment. Treat simulation success rates as a ceiling, not a floor. Build in a 3-6 month real-environment validation phase that does not appear on marketing timelines.
Real-Time Inference Constraints
What it is: A language model can take 2-5 seconds to respond; users tolerate this. A robot picking an object from a moving conveyor belt must decide and act in under 50ms. Running a large VLA model at 30 Hz on edge hardware requires quantization, model distillation, and often custom silicon (NVIDIA Jetson, Google Edge TPU, or vendor-specific accelerators).
PM implication: Latency is a hard product constraint in physical AI, not a performance metric to optimize post-launch. Build latency budgets into your technical spec before committing to product capabilities. 'This feature requires 200ms of compute' determines feasibility the same way it determines whether a mobile app launch is possible.
Safety Certification and Liability
What it is: Autonomous systems in physical environments face regulatory and liability requirements that software AI does not. Medical robots may require FDA clearance. Industrial cobots must meet ISO 10218 and ISO/TS 15066 safety standards. Autonomous vehicles navigate state and federal regulation. A product launch decision is not just an engineering decision — it is a legal and regulatory one.
PM implication: Map the regulatory path on day one, not at launch. Identify whether your physical AI product is a medical device, an industrial machine, or a consumer product — each has a different compliance track with a different timeline. Compliance delays are measured in years, not sprints.
Build AI Products That Actually Ship
The AI PM Masterclass covers the technical foundations behind AI products — including how architectural decisions translate into product constraints. Taught live by a Salesforce Sr. Director PM.
Physical AI Product Architecture: How It Differs from Software AI
A software AI product stack (API call → LLM → response → UI) is relatively simple. A physical AI product stack has additional layers at every level. Understanding the stack tells you where the product decisions live.
Hardware layer
The physical robot body: motors, sensors (cameras, LiDAR, force/torque sensors, IMU), compute hardware (edge GPU, TPU, custom ASIC). Hardware choices constrain the software stack: an underpowered edge chip cannot run a 7B parameter VLA at 30 Hz.
Perception layer
Real-time processing of sensor data into a representation the planning model can use. Camera streams, depth maps, point clouds. A separate perception stack often runs alongside the VLA for tasks like object detection and pose estimation where lower latency specialized models outperform the VLA.
Planning layer (the VLA)
The core intelligence: given visual input and language instruction, produce an action. This is where foundation model capability lives — and where most research investment is going in 2026.
Control layer
Converts high-level action plans into low-level motor commands: joint angles, PWM signals, torque limits. Often runs on a separate real-time operating system (ROS 2, custom RTOS) with deterministic timing guarantees the AI model layer cannot provide.
Telemetry and remote monitoring
Unlike software products, physical AI products cannot be patched instantly if something goes wrong in the field. Remote monitoring, over-the-air update infrastructure, and telemetry logging are not optional — they are how you iterate without shipping engineering teams to every deployment site.
Fleet management
Most physical AI deployments involve fleets of identical or similar robots. Fleet management software — deployment, monitoring, task assignment, maintenance scheduling — is often where the software PM contribution is highest and most independent of robotics hardware expertise.
What Physical AI Means for Product Decisions
Several product management principles that work well for software AI need to be relearned for physical AI. The differences are not cosmetic.
Failure modes are physical
A hallucinating LLM outputs a wrong answer. A hallucinating physical AI model moves a robot arm into a human or drops a fragile object. Your risk framework must include physical harm scenarios, not just output quality scenarios. Red-teaming in physical AI includes literal adversarial physical environments.
Iteration cycles are slower
Software AI: deploy a new prompt or model to production in minutes. Physical AI: test on physical hardware, which requires scheduling robot time, resetting environments after each test, and often shipping test units to partner sites. Plan for 10x longer iteration cycles than your software PM experience suggests.
Users are often not operators
The person who buys a warehouse robotics system (operations VP) is not the same person who interacts with it daily (floor worker). And the robot's 'user' is technically the task it is performing. User research requires observational fieldwork on physical sites, not just interviews and session recordings.
Hardware-software co-design is the roadmap
Shipping a new model capability may require new sensors or upgraded edge compute. Roadmap dependencies between software (model quality), hardware (sensor resolution, compute power), and compliance (safety certification) mean roadmap planning for physical AI is inherently cross-disciplinary.
Data collection is a physical logistics problem
Training data for VLAs comes from robot demonstrations — either human teleoperation or prior robot runs. Collecting 10,000 demonstration trajectories requires running robots in the real world for weeks, which has a physical cost measured in engineering hours, facility time, and hardware wear.
Uptime SLAs have physical consequences
A software product that goes down for 30 minutes costs user trust. A robot picking line that goes down for 30 minutes stops production at a cost of thousands of dollars per minute. Reliability engineering and on-call expectations are different in kind, not just degree.
How to Break Into Physical AI as an AI PM
Most physical AI companies are not looking for PMs who can code robots. They need PMs who understand AI, think rigorously about safety, and can work across hardware, software, and regulatory domains. The fastest path in depends on your background.
From software AI PM
The strongest transfer path. Companies like Figure, Apptronik, and Boston Dynamics actively recruit software AI PMs who have shipped AI products to production. Your evaluation and reliability experience transfers directly. The gap to close: robotics hardware vocabulary (kinematics, actuators, SLAM) and safety certification basics.
From hardware PM or embedded systems
Excellent foundation for the hardware-software co-design requirements. Gap to close: AI/ML product experience — evals, model selection, prompt management, inference cost. A side project deploying an open-weight VLA (even in simulation using IsaacSim or MuJoCo) builds credibility fast.
From operations or industrial backgrounds
Domain expertise in manufacturing, logistics, or healthcare is underrated in physical AI. Companies struggle to find PMs who both understand AI and can walk a factory floor. Gap to close: the technical fundamentals of VLAs and the data flywheel for physical AI training.
Companies hiring physical AI PMs in 2026
On the hardware-software stack: Figure, Apptronik, Boston Dynamics, 1X, Agility Robotics, AGIBOT. On fleet management and enterprise deployment software built on top of these platforms: Machina Labs, Robust AI, Dexterity AI. Traditional robotics companies adding foundation model layers: FANUC, ABB Robotics, Yaskawa, Universal Robots. Autonomous vehicle companies: Waymo, Zoox, Nuro, Aurora. Large tech building internal physical AI teams: Amazon Robotics, Tesla Optimus, Apple.
Learn to Build AI Products That Go Beyond Software
The AI PM Masterclass covers the architectural decisions behind AI products — from LLMs to physical AI. Taught live by a Salesforce Sr. Director PM who has built AI products at scale.
Related Articles
Before you go: get the AI PM Minute
One tactic to make you a sharper AI PM, twice a week. 60 seconds to read. Free.
No fluff. Unsubscribe anytime.