Edge AI: Running Models on Device for Speed and Privacy

What Edge AI Means and Why It Matters for Products

Edge AI means running machine learning inference directly on the user's device rather than sending data to a cloud server. The "edge" is the endpoint where data is generated — a smartphone, a laptop, a camera, an IoT sensor, or an embedded processor in a car.

Zero-latency inference

Cloud AI features have a latency floor: network round-trip time plus server processing time. Even with the fastest APIs, this means 200–800ms minimum for a single inference call. On-device inference eliminates network latency entirely. A model running on an iPhone's Neural Engine can complete inference in 10–50ms. For real-time features — live camera filters, voice transcription, predictive keyboard, gesture recognition — this latency difference is the difference between a feature that feels instant and one that feels sluggish.

Privacy by architecture

When data never leaves the device, privacy is guaranteed by design — not by policy. Apple's on-device processing for photos, Siri requests, and keyboard predictions means user data cannot be intercepted in transit, stored on Apple's servers, or accessed by Apple employees. This is a fundamentally stronger privacy guarantee than any cloud-based system can offer, regardless of encryption or data handling policies. For products in healthcare, finance, or markets with strict data sovereignty laws (EU, China), edge AI may be the only viable architecture.

Offline functionality

Cloud AI features fail when connectivity fails. Edge AI features work everywhere: airplanes, subways, rural areas, developing markets with unreliable connectivity. Google Translate's on-device translation, Apple's offline dictation, and Samsung's on-device photo editing all work without any network connection. If your users need AI features in environments with poor or no connectivity, edge deployment is not optional.

Cost structure inversion

Cloud AI costs scale linearly with usage: every inference call has a cost. Edge AI has a high fixed cost (model development, optimization, device testing) but zero marginal cost per inference. Once the model is on the device, the user can run it a million times and your server bill does not change. For high-frequency AI features — autocomplete, spam filtering, photo enhancement — edge deployment can reduce inference costs by 95%+ at scale.

When to Deploy on Edge vs. Cloud vs. Hybrid

The edge-vs-cloud decision is not binary. Most production AI products use a hybrid architecture where some inference runs on-device and some runs in the cloud. The decision depends on model complexity, latency requirements, privacy constraints, and connectivity assumptions.

Deploy on edge when latency is critical

Real-time features that users interact with continuously — keyboard predictions, voice activity detection, face detection, on-device search, gesture recognition — must run on-device. Any perceptible delay makes these features feel broken. Edge deployment is also essential for features that need to respond to sensor data in real-time: fall detection on Apple Watch, driver attention monitoring in vehicles, and real-time translation overlays in AR glasses.

Decision criteria: If your feature requires sub-100ms response time and users interact with it more than once per session, strongly prefer edge deployment.

Deploy in cloud when model capability matters most

Complex reasoning, long-form generation, multi-modal understanding, and tasks that require access to large knowledge bases cannot run on current edge hardware. GPT-4-class models require hundreds of gigabytes of memory and massive compute. Even aggressively quantized, they do not fit on mobile devices. If your feature needs frontier model capabilities — detailed analysis, creative writing, complex code generation — cloud deployment is currently the only option.

Decision criteria: If the task requires a model larger than 3B parameters and quality cannot be sacrificed, deploy in the cloud and optimize latency through streaming and caching.

Use hybrid when you need both

The most sophisticated products use a tiered architecture: a small, fast model on-device handles initial processing, filtering, and simple tasks, while complex tasks are routed to a larger cloud model. Apple Intelligence uses this pattern: on-device models handle text rewriting and summarization for short content, while longer or more complex tasks are routed to Apple's Private Cloud Compute. This hybrid approach optimizes for latency on common cases while preserving capability for complex cases.

Decision criteria: If 60–80% of your use cases can be handled by a small model, deploy that model on-device and route the remaining 20–40% to the cloud. This gives you the best UX for most users while maintaining quality for edge cases.

Technical Constraints of Edge Deployment

Edge deployment is not just "take a model and put it on a phone." Device hardware imposes hard constraints that fundamentally limit what models can do. Understanding these constraints helps PMs set realistic expectations and make informed trade-offs.

Model size and memory limits

A flagship iPhone has 6–8GB of RAM shared between the OS, apps, and the model. A mid-range Android has 4–6GB. This means on-device models must typically be under 1–2GB. A 7B parameter model at 4-bit quantization requires ~3.5GB — feasible on flagship devices but too large for older or budget hardware. Most production on-device models are 0.5B–3B parameters, heavily quantized. This constrains what tasks the model can handle: classification, short-form generation, and entity extraction work well; long-form reasoning does not.

Compute and neural accelerator access

Modern phones have dedicated neural processing units (NPUs): Apple's Neural Engine (16-core, 35 TOPS), Qualcomm's Hexagon NPU, Google's Tensor TPU. These chips are optimized for specific operations (matrix multiplication, convolution) and specific precisions (INT8, FP16). Your model must be compatible with the target NPU's supported operations. Custom layers or unusual architectures may fall back to CPU execution, which can be 10–50x slower.

Battery and thermal throttling

Sustained AI inference drains battery and generates heat. Running a vision model at 30fps for camera features can consume 15–25% of battery per hour. When the device overheats, the OS throttles CPU and NPU performance, causing your model to slow down or drop frames. Products that use continuous on-device inference must budget for power consumption and design graceful degradation when thermal throttling kicks in.

Device fragmentation and testing

Unlike cloud deployment where you control the hardware, edge deployment means your model runs on thousands of different device configurations. An iPhone 16 Pro has different NPU capabilities than an iPhone 12. A Samsung Galaxy S24 has different memory and processor constraints than a Pixel 8. You must test on representative devices across your user base and handle devices that cannot run the model at all — either by falling back to cloud or disabling the feature gracefully.

The update problem

Cloud models can be updated instantly. Edge models require an app update or a background model download. If you discover a quality issue or safety vulnerability in your on-device model, patching it means pushing an update to millions of devices and waiting for users to install it. Some users will be running old model versions for weeks or months. Your product design must account for model version heterogeneity \u2014 and your incident response plan must include a cloud fallback for critical issues.

Learn AI Architecture Decisions in the Masterclass

Edge vs. cloud deployment, model optimization, and infrastructure trade-offs are core to the AI PM Masterclass curriculum. Taught by a Salesforce Sr. Director PM with real production experience.

Edge AI Frameworks and Tools PMs Should Know

You do not need to use these tools yourself, but you need to know what they do and what trade-offs they involve. When your engineering team proposes an edge deployment approach, you should be able to evaluate whether it fits your product constraints.

Apple Core ML and MLX

Apple's on-device ML frameworks. Core ML is the deployment format for iOS, iPadOS, macOS, watchOS, and visionOS. It supports neural networks, decision trees, and classical ML models. MLX is Apple's newer framework optimized for Apple Silicon (M-series chips). Core ML models run on the Neural Engine with hardware acceleration. If your product targets Apple devices, Core ML is the standard deployment path. Conversion from PyTorch/TensorFlow is well-supported but not always lossless — some operations are not supported and require workarounds.

Google TensorFlow Lite and MediaPipe

TensorFlow Lite (TFLite) is Google's on-device inference framework, supporting Android, iOS, and embedded Linux. MediaPipe builds on TFLite with pre-built, optimized pipelines for common tasks: face detection, hand tracking, object detection, text classification. If your edge AI feature maps to a MediaPipe solution, you can ship in days instead of months. For custom models, TFLite provides quantization tools and delegate support for hardware acceleration on Qualcomm, Samsung, and MediaTek NPUs.

ONNX Runtime and cross-platform deployment

ONNX (Open Neural Network Exchange) is a vendor-neutral model format that runs on multiple hardware backends. ONNX Runtime supports CPU, GPU, NPU, and specialized accelerators across Windows, macOS, Linux, Android, and iOS. If your product needs to deploy the same model across multiple platforms — a desktop app and a mobile app, for example — ONNX provides a single model format with platform-specific optimization. The trade-off: ONNX may not squeeze out the last 10–20% of performance that platform-native formats (Core ML, TFLite) can achieve.

Qualcomm AI Engine and Samsung NPU SDK

For Android devices, chipset-specific SDKs provide the deepest hardware optimization. Qualcomm's AI Engine Direct and Samsung's ONE (On-device Neural Engine) let you target specific NPU features for maximum throughput and efficiency. The trade-off: chipset-specific optimization fragments your deployment. Code optimized for Qualcomm Snapdragon may not work on Samsung Exynos or MediaTek Dimensity. Most teams start with TFLite or ONNX for broad compatibility and selectively optimize for dominant chipsets in their user base.

Product Design Patterns for Edge AI

Edge AI constraints require different product design thinking than cloud AI. The best edge AI products embrace the constraints rather than fighting them. These design patterns have emerged from teams that have shipped successful on-device AI features.

Graceful degradation across device tiers

Design three tiers of your AI feature: full capability for flagship devices with NPU access, reduced capability for mid-range devices, and cloud fallback for older devices. Apple Intelligence requires an A17 Pro or M-series chip. Devices below that threshold get a cloud-based alternative or the feature is unavailable. Your product specification should define minimum device requirements for each tier and the user experience for devices that do not meet them. Never ship an edge AI feature that crashes or freezes on unsupported devices.

Progressive model loading

Do not force users to download a 500MB model before they can use a feature. Load the model on first use, download in the background, and provide a cloud fallback during the download. Google Translate downloads language packs on-demand (30–50MB per language pair) and falls back to cloud translation until the download completes. This pattern respects users' storage and bandwidth while ensuring the feature is always available.

On-device preprocessing with cloud reasoning

Use edge AI for the compute-intensive preprocessing step and send only the processed result to the cloud for final reasoning. For example: run speech recognition on-device to convert audio to text (keeping the audio private), then send only the transcript to a cloud LLM for understanding and response. This hybrid pattern gives you the privacy benefits of edge processing with the reasoning capabilities of cloud models.

Federated learning for continuous improvement

Traditional model updates require centralizing user data for retraining. Federated learning trains model updates on-device using local data, then sends only the model updates (not the data) to a central server for aggregation. Google uses federated learning to improve Gboard predictions without ever seeing what users type. This pattern lets you improve edge models continuously while maintaining the privacy guarantee. The trade-off: federated learning is significantly more complex to implement than centralized training and requires careful privacy accounting.