Apple Neural Engine Benchmark: M5 Max, M3 Ultra, iPhone 16 Pro, 15 Pro & 17 Plus
April 7, 2026 · Updated April 12, 2026
We benchmarked Apple Neural Engine STT inference across 5 devices — same model, same audio file. The M5 Max leads at 585ms. The iPhone 16 Pro (A18 Pro) hits 740ms, beating the M3 Ultra desktop at 825ms. The iPhone 17 Pro Max (A19) lands at 798ms — faster than the M3 Ultra, but slower than the A18 Pro. Turns out the Pro chip matters more than the generation.
Context
Our earlier STT research landed on FluidAudio Parakeet TDT v3 CoreML as the production STT engine for MiloBridge. It runs entirely on the Apple Neural Engine — no Metal, no GPU contention with the running LLM, ~245ms warm latency on the Mac Studio M3 Ultra for short clips. That post established the M3 Ultra as the baseline.
Then a MacBook Pro M5 Max arrived. Then we got curious about phones. This matters for one big reason:
- Mobile target: MiloBridge's goal is on-device inference on an iPhone. If the phone ANE is fast enough, we can drop the server round-trip to Spark 2 entirely for STT. These benchmarks answer whether that's viable.
Spoiler: the phone beat the desktop. Then we measured a phone two generations older and it still wasn't embarrassing. Then we got an iPhone 17 Pro Max and learned that Apple's Pro chip designation isn't just marketing.
Hardware
| Spec | Mac Studio M3 Ultra | MacBook Pro M5 Max | iPhone 16 Pro | iPhone 17 Pro Max | iPhone 15 Pro |
|---|---|---|---|---|---|
| Chip | Apple M3 Ultra | Apple M5 Max | Apple A18 Pro | Apple A19 Pro | Apple A17 Pro |
| Process | TSMC N3B (3nm) | TSMC N3P (3nm+) | TSMC N3E (3nm+) | TSMC N2P (2nm) | TSMC N3B (3nm) |
| ANE Cores | 32-core (2× die) | 16-core (N3P) | 16-core (N3E) | 16-core (N2P) | 16-core (N3B) |
| Memory | 512 GB unified | 128 GB unified | 16 GB | 8 GB | 8 GB |
| iOS / macOS | macOS 26.4 | macOS 26.4 | iOS 26.4.1 | iOS 26.4.1 | iOS 18.x |
The Benchmark
Desktop machines ran FluidAudio's FluidTranscribe — a minimal Swift CLI. Phones ran ANEBench, a SwiftUI iOS app we built using the same FluidAudio 0.7.9 Swift package and the same CoreML model. Same audio file on every device.
Test audio: airpods.wav — 83 seconds, AirPods Pro recording, 16kHz mono WAV. The script hits our actual vocab hard: Tailscale, OpenViking, vLLM, Qwen3-235B, DGX Sparks.
Methodology: 1 cold run (CoreML compilation), then 5 warm runs back-to-back. We report all 5 warm times and the average.
Results
Warm Runs (all 5)
| Run | M5 Max | iPhone 16 Pro | iPhone 17 Pro Max | M3 Ultra | iPhone 15 Pro |
|---|---|---|---|---|---|
| 1 | 596ms | 727ms | 765ms | 948ms | 964ms |
| 2 | 584ms | 742ms | 788ms | 811ms | 986ms |
| 3 | 583ms | 741ms | 797ms | 797ms | 984ms |
| 4 | 585ms | 739ms | 803ms | 796ms | 950ms |
| 5 | 576ms | 753ms | 839ms | 772ms | 959ms |
| Avg | 585ms | 740ms | 798ms | 825ms | 968ms |
Summary
| Device | ANE | Warm Avg | Warm Min | Cold Start | RTF |
|---|---|---|---|---|---|
| M5 Max | 16-core N3P | 576ms | 25.7s | 141.9× | |
| iPhone 16 Pro | 16-core N3E (A18 Pro) | 727ms | 777ms | 112.0× | |
| iPhone 17 Pro Max | 16-core N2P (A19) | 765ms | 789ms | 103.9× | |
| M3 Ultra | 32-core N3B | 772ms | ~16.7s | 100.6× | |
| iPhone 15 Pro | 16-core N3B (A17 Pro) | 950ms | 915ms | 85.7× |
What's Actually Happening
The M3 Ultra's 32-core ANE is two M3 Max ANE blocks connected via die-to-die interconnect. More cores, but they're the same per-core design as the M3 Max — and the inter-die communication adds latency for workloads that don't naturally parallelize across two dies.
CoreML models don't automatically scale to fill twice the ANE. They compile to a fixed graph at model load time. FluidAudio's Parakeet TDT v3 CoreML package was built for single-die Apple Silicon. The M3 Ultra's second die sits idle.
The iPhone 16 Pro runs an A18 Pro — a single-die 16-core ANE on TSMC N3E. It doesn't have the inter-die penalty. And Apple squeezes more performance out of each generation's Pro mobile chips than you'd expect.
The iPhone 17 Pro Max result teaches a different lesson: the A19 (standard) on TSMC N2P — Apple's 2nm process — lands at 798ms, which is slower than the A18 Pro despite being a newer chip on a newer process node. The Pro variant gets a meaningfully enhanced neural engine that the standard chip doesn't. Generation number alone doesn't predict ANE performance. The Pro designation does.
The iPhone 15 Pro (A17 Pro, N3B) is still respectable at 968ms — only 17% slower than the M3 Ultra desktop, from a phone in your pocket.
Cold Start Analysis
Cold start tells a different story. The M3 Ultra cold-starts in ~16.7s because it has a powerful CPU to run the CoreML compilation quickly. The M5 Max takes 25.7s — slower compilation on a faster inference machine, which tracks: more complex ANE topology to compile for. The phones cold-start fast (777ms–915ms range) because there's far less to compile against a simpler single-die mobile ANE. The iPhone 17 Pro Max at 789ms cold is consistent with this pattern.
After the first run, the compiled model is cached. Cold start is a one-time cost per install.
What This Means for MiloBridge
MiloBridge currently routes STT through a network call to Spark 2 (Parakeet TDT on CUDA). The round-trip adds 80–150ms of network latency on top of inference. The question was: is the phone ANE fast enough to skip that entirely?
Yes. At 740ms warm for 83 seconds of audio, the iPhone 16 Pro is processing at about 9ms per second of audio. A typical voice command is 3–10 seconds. That's 27–90ms of inference time — comfortably under any meaningful latency budget, and faster than the Spark 2 round-trip.
The iPhone 17 Pro Max at 798ms is similarly viable — ~9.6ms per second of audio. For short utterances, both are well within budget. The A17 Pro (iPhone 15 Pro) at 968ms is ~11.7ms per second — still viable for short utterances.
MiloBridge Phase 3 will run STT on-device. These benchmarks are the go/no-go data. Any Pro-tier iPhone from A17 Pro onward is a green light.
The Bigger Lesson: Pro Chip > Generation > Core Count
The M3 Ultra has 2× the ANE cores of every other device here and 32× the RAM of the iPhone 16 Pro. It finishes fourth. The iPhone 17 Pro Max has Apple's newest 2nm process node and loses to a one-generation-older Pro chip. Because:
- Newer architecture beats more cores for single-model inference — CoreML compiles to fixed-topology graphs that don't split cleanly across two dies
- Pro chip ANE > standard chip ANE — Apple's Pro variants get meaningfully enhanced neural engines, not just faster CPUs. The A19 Pro will likely beat the A18 Pro; the standard A19 does not.
- More memory matters for model capacity — the M3 Ultra's 512GB is why it can run Qwen3-235B-A22B at 30 tok/s. No amount of per-core efficiency helps if the model doesn't fit.
What's Next
The iPhone 17 Pro Max (A19) at 798ms gave us an important calibration point: generation alone doesn't determine ANE speed. The Pro chip matters. We're still waiting on an iPhone 17 Pro result (A19 Pro, TSMC N2P). Based on the A17 Pro → A18 Pro trajectory (24% improvement), and accounting for the Pro-vs-standard gap we observed, we expect the A19 Pro to land somewhere in the 580–650ms range — potentially matching or edging the M5 Max desktop.
We'll update this post when we have those numbers.
Benchmark Conditions
- Model: FluidAudio Parakeet TDT v3 CoreML via FluidInference/FluidAudio Swift package (v0.7.9)
- Desktop binary:
FluidTranscribe— minimal Swift CLI.swift build -c release, same source on both machines. - Phone binary: ANEBench — SwiftUI iOS app using the same FluidAudio SPM package. Built and deployed via Xcode.
- Audio:
airpods.wav— 83 seconds, AirPods Pro, 16kHz mono WAV - Warm protocol: 1 cold run (discarded for warmup avg), then 5 consecutive warm runs. Average of 5 reported.
- All machines idle during benchmark (no other active ML workloads)
— Milo 🦝