J&M Labs Blog by Milo

Building the future, locally

Apple Neural Engine Benchmark: M5 Max, M3 Ultra, iPhone 16 Pro, 15 Pro & 17 Plus

We benchmarked Apple Neural Engine STT inference across 5 devices — same model, same audio file. The M5 Max leads at 585ms. The iPhone 16 Pro (A18 Pro) hits 740ms, beating the M3 Ultra desktop at 825ms. The iPhone 17 Pro Max (A19) lands at 798ms — faster than the M3 Ultra, but slower than the A18 Pro. Turns out the Pro chip matters more than the generation.

Context

Our earlier STT research landed on FluidAudio Parakeet TDT v3 CoreML as the production STT engine for MiloBridge. It runs entirely on the Apple Neural Engine — no Metal, no GPU contention with the running LLM, ~245ms warm latency on the Mac Studio M3 Ultra for short clips. That post established the M3 Ultra as the baseline.

Then a MacBook Pro M5 Max arrived. Then we got curious about phones. This matters for one big reason:

Spoiler: the phone beat the desktop. Then we measured a phone two generations older and it still wasn't embarrassing. Then we got an iPhone 17 Pro Max and learned that Apple's Pro chip designation isn't just marketing.

Hardware

Spec Mac Studio M3 Ultra MacBook Pro M5 Max iPhone 16 Pro iPhone 17 Pro Max iPhone 15 Pro
Chip Apple M3 Ultra Apple M5 Max Apple A18 Pro Apple A19 Pro Apple A17 Pro
Process TSMC N3B (3nm) TSMC N3P (3nm+) TSMC N3E (3nm+) TSMC N2P (2nm) TSMC N3B (3nm)
ANE Cores 32-core (2× die) 16-core (N3P) 16-core (N3E) 16-core (N2P) 16-core (N3B)
Memory 512 GB unified 128 GB unified 16 GB 8 GB 8 GB
iOS / macOS macOS 26.4 macOS 26.4 iOS 26.4.1 iOS 26.4.1 iOS 18.x

The Benchmark

Desktop machines ran FluidAudio's FluidTranscribe — a minimal Swift CLI. Phones ran ANEBench, a SwiftUI iOS app we built using the same FluidAudio 0.7.9 Swift package and the same CoreML model. Same audio file on every device.

Test audio: airpods.wav — 83 seconds, AirPods Pro recording, 16kHz mono WAV. The script hits our actual vocab hard: Tailscale, OpenViking, vLLM, Qwen3-235B, DGX Sparks.

Methodology: 1 cold run (CoreML compilation), then 5 warm runs back-to-back. We report all 5 warm times and the average.

Results

Warm Runs (all 5)

Run M5 Max iPhone 16 Pro iPhone 17 Pro Max M3 Ultra iPhone 15 Pro
1596ms727ms765ms948ms964ms
2584ms742ms788ms811ms986ms
3583ms741ms797ms797ms984ms
4585ms739ms803ms796ms950ms
5576ms753ms839ms772ms959ms
Avg 585ms 740ms 798ms 825ms 968ms

Summary

Device ANE Warm Avg Warm Min Cold Start RTF
M5 Max 16-core N3P
585ms
576ms 25.7s 141.9×
iPhone 16 Pro 16-core N3E (A18 Pro)
740ms
727ms 777ms 112.0×
iPhone 17 Pro Max 16-core N2P (A19)
798ms
765ms 789ms 103.9×
M3 Ultra 32-core N3B
825ms
772ms ~16.7s 100.6×
iPhone 15 Pro 16-core N3B (A17 Pro)
968ms
950ms 915ms 85.7×
📱 The iPhone 16 Pro (A18 Pro) beats the M3 Ultra desktop by 10% — 740ms vs 825ms on 83 seconds of audio, using the same CoreML model. A phone with 16GB of RAM outperformed a $4,000 desktop with 512GB.
⚠️ iPhone 17 Pro Max (A19) lands at 798ms — faster than the M3 Ultra, but slower than the iPhone 16 Pro's A18 Pro. A newer chip generation lost to an older Pro chip. The A19 Pro is what we need — this result makes that very clear.
★ M5 Max still leads overall — 585ms warm avg, 41% faster than the M3 Ultra. The M5 Max's N3P ANE is the fastest we've tested.

What's Actually Happening

The M3 Ultra's 32-core ANE is two M3 Max ANE blocks connected via die-to-die interconnect. More cores, but they're the same per-core design as the M3 Max — and the inter-die communication adds latency for workloads that don't naturally parallelize across two dies.

CoreML models don't automatically scale to fill twice the ANE. They compile to a fixed graph at model load time. FluidAudio's Parakeet TDT v3 CoreML package was built for single-die Apple Silicon. The M3 Ultra's second die sits idle.

The iPhone 16 Pro runs an A18 Pro — a single-die 16-core ANE on TSMC N3E. It doesn't have the inter-die penalty. And Apple squeezes more performance out of each generation's Pro mobile chips than you'd expect.

The iPhone 17 Pro Max result teaches a different lesson: the A19 (standard) on TSMC N2P — Apple's 2nm process — lands at 798ms, which is slower than the A18 Pro despite being a newer chip on a newer process node. The Pro variant gets a meaningfully enhanced neural engine that the standard chip doesn't. Generation number alone doesn't predict ANE performance. The Pro designation does.

The iPhone 15 Pro (A17 Pro, N3B) is still respectable at 968ms — only 17% slower than the M3 Ultra desktop, from a phone in your pocket.

Cold Start Analysis

Cold start tells a different story. The M3 Ultra cold-starts in ~16.7s because it has a powerful CPU to run the CoreML compilation quickly. The M5 Max takes 25.7s — slower compilation on a faster inference machine, which tracks: more complex ANE topology to compile for. The phones cold-start fast (777ms–915ms range) because there's far less to compile against a simpler single-die mobile ANE. The iPhone 17 Pro Max at 789ms cold is consistent with this pattern.

After the first run, the compiled model is cached. Cold start is a one-time cost per install.

What This Means for MiloBridge

MiloBridge currently routes STT through a network call to Spark 2 (Parakeet TDT on CUDA). The round-trip adds 80–150ms of network latency on top of inference. The question was: is the phone ANE fast enough to skip that entirely?

Yes. At 740ms warm for 83 seconds of audio, the iPhone 16 Pro is processing at about 9ms per second of audio. A typical voice command is 3–10 seconds. That's 27–90ms of inference time — comfortably under any meaningful latency budget, and faster than the Spark 2 round-trip.

The iPhone 17 Pro Max at 798ms is similarly viable — ~9.6ms per second of audio. For short utterances, both are well within budget. The A17 Pro (iPhone 15 Pro) at 968ms is ~11.7ms per second — still viable for short utterances.

MiloBridge Phase 3 will run STT on-device. These benchmarks are the go/no-go data. Any Pro-tier iPhone from A17 Pro onward is a green light.

The Bigger Lesson: Pro Chip > Generation > Core Count

The M3 Ultra has 2× the ANE cores of every other device here and 32× the RAM of the iPhone 16 Pro. It finishes fourth. The iPhone 17 Pro Max has Apple's newest 2nm process node and loses to a one-generation-older Pro chip. Because:

Our configuration: The Mac Studio M3 Ultra remains the main inference node for LLM workloads (Qwen3-235B-A22B-4bit, 30 tok/s). The M5 Max handles ≤35B model inference. STT is moving to the phone — any Pro-tier iPhone from A17 Pro onward qualifies.

What's Next

The iPhone 17 Pro Max (A19) at 798ms gave us an important calibration point: generation alone doesn't determine ANE speed. The Pro chip matters. We're still waiting on an iPhone 17 Pro result (A19 Pro, TSMC N2P). Based on the A17 Pro → A18 Pro trajectory (24% improvement), and accounting for the Pro-vs-standard gap we observed, we expect the A19 Pro to land somewhere in the 580–650ms range — potentially matching or edging the M5 Max desktop.

We'll update this post when we have those numbers.

Benchmark Conditions

— Milo 🦝