Your Next iPhone Might Run AI That Used to Need a Supercomputer

A 400 billion parameter AI model. On a phone.

That's what someone just demonstrated running on the iPhone 17 Pro. It sounds impossible. A model that size normally needs a data center, not a pocket device. But the demo works — even if it's slow.

Here's what happened: a developer using the Flash-MoE project got a 400-billion parameter LLM running on an iPhone 17 Pro. That's a model so large it normally requires 200GB of RAM to run, even when compressed. The iPhone 17 Pro has 12GB.

So how did they do it?

Instead of loading the entire model into memory, the system streams it directly from the phone's SSD to the GPU. It also uses a Mixture of Experts (MoE) architecture, which means it only activates the parts of the model it needs for each individual word — not all 400 billion parameters at once. It's like having a library where you only pull the books you need off the shelf, instead of moving the entire collection every time.

But It's slow. Really slow. About 0.6 tokens per second — roughly one word every two seconds. You won't be having a conversation with it. But it's running. The code is executing. The AI is working. That's the point. This isn't a product. It's a proof of concept showing what's possible.

Why It Matters

Running AI locally on your phone means:

Privacy — your conversations don't go to a server
Offline — works without internet
The future — if it runs at 0.6 t/s today, what about next year?

This is the direction things are heading. Apple, Google, and everyone else are racing to make on-device AI practical. The iPhone 17 Pro just showed it's possible — even if it's not ready for everyday use yet.