AI Strategy

Two roads to the same chip: a decade of AI silicon, and what it means for your business

Ant·
Two roads to the same chip: a decade of AI silicon, and what it means for your business

The chart that started this

I built this chart to map a decade of AI silicon, and it tells a story most people outside the industry have missed.

Two companies. Two opposite bets in 2017. By 2025, the same chip.

A decade of AI silicon: NVIDIA and Apple both went neural in 2017, took opposite roads (tensor cores in the GPU vs a separate Neural Engine plus unified memory), and converged by 2025 on matrix units plus large unified memory on one chip

Two bets in 2017

NVIDIA and Apple both went "neural" in 2017, then bet opposite ways.

NVIDIA put the matrix engines inside the GPU. Volta shipped the first Tensor Cores, dedicated matrix-multiply units sitting right next to the graphics cores, and NVIDIA wrapped its CUDA software moat around them. If you wanted to train a model, you wrote CUDA, and CUDA ran on tensor cores in the GPU.

Apple went the other direction. It put neural hardware in a separate Neural Engine, the A11 Bionic's NPU on the iPhone, kept apart from the GPU, and bet everything on unified memory: one pool of RAM that the CPU, GPU, and Neural Engine all share, with no copying between them.

Same starting word, "neural." Completely different architectures.

NVIDIA's road: tensor cores, then lower precision every generation

NVIDIA's decade reads like a precision ladder, and each rung buys more AI throughput per watt:

  • Volta (2017): first Tensor Cores, FP16
  • Turing (2018): INT8 and INT4, tensor cores reach consumer RTX cards
  • Ampere (2020): TF32 and structured sparsity
  • Hopper (2022): FP8 and the Transformer Engine, built for the LLM era
  • Blackwell (2024): native FP4

The pattern: keep the matrix units in the GPU, keep dropping the precision, keep CUDA in front of all of it. Lower precision means more math per watt, which is why an FP4 chip can do far more AI work than an FP16 chip of the same size.

Apple's road: unified memory, neural engine kept apart

Apple's road was about memory, not matrix units:

  • A11 (2017): first Neural Engine, a dedicated NPU, separate from the GPU
  • M1 (2020): 16-core Neural Engine and one unified memory pool for CPU, GPU, and NPU. But the GPU still had no tensor cores.
  • M2 through M4 (2022 to 2024): more NPU cores and more memory, but GPU matrix math still ran on generic ALUs through the shared FP32 pipeline.

For years Apple had the memory story nobody else had, and NVIDIA had the matrix-units-in-the-GPU story nobody else had. Each was strong in exactly the place the other was weak.

The convergence: 2025 to 2026

Then they swapped signatures.

NVIDIA borrowed Apple's move. With GB10 Grace Blackwell, the chip in DGX Spark, NVIDIA finally shipped unified memory: 128GB coherent across CPU and GPU. The DGX Station goes further with the GB300 Grace Blackwell Ultra: 748GB of coherent memory and roughly 20 PFLOPS of FP4, enough to run trillion-parameter-class models from a machine that sits next to your desk.

Apple borrowed NVIDIA's move. The M5, in October 2025, is the first Apple chip to light the grid: dedicated matrix units, tensor cores, in every GPU core, instead of routing matmul through generic ALUs. The rumored M5 Ultra, expected late 2026, would fuse two M5 Max dies for around 256GB of unified memory with GPU tensor cores on one chip.

Two roads, one destination: matrix units in the GPU, plus a large unified memory pool, on a single chip you can put on a desk.

Why this matters if you run a business

Here is the part that actually changes decisions.

This convergence matters because the hardware is now strong enough to run frontier-scale models, in the hundreds of billions to a trillion parameters, locally. Not in a hyperscaler's data center. On a machine in your office.

That opens two doors that were closed two years ago.

Control. When you send prompts to a frontier model over an API, your data leaves your building, and what happens to it depends on the provider and the plan you are on. Some tiers retain inputs, some can use them to improve future models, and the terms change over time. Running a capable model locally keeps sensitive material, client records, contracts, internal documents, inside your own walls.

Economics. Cloud AI is priced per token. That is fine when usage is light. When a workflow runs thousands of times a day, the per-token meter adds up fast, and you are renting forever. Local hardware is a pay-once cost. Past a certain volume, owning the compute is cheaper than renting it by the token.

The honest version

I am not going to tell you everything in that chart is shipping today. It is not.

The M5 Ultra is still a rumor. As of now, no shipping Apple chip pairs more than 128GB with GPU tensor cores, and even that is well below the DGX Station's 748GB. The Station is real, but it is priced like the frontier machine it is. Local AI is not automatically the right answer, and for plenty of teams the cloud is still the correct call.

What changed is that "run frontier-scale AI in your own building" went from impossible to a real option you can put on the table. Two years ago that sentence did not make sense. Now it is a line in a budget.

Where we come in

Deciding where your AI should run, in the cloud, on-premise, or a hybrid, is exactly the kind of question we help operators work through at Grid Theory: what the control and cost math looks like for your specific workloads, what is worth running locally, and what is better left in the cloud.

If you are weighing it, book a discovery call and we will map it to your business.

A

Ant

Building AI-powered infrastructure for businesses ready to scale.

We use cookies to analyze site traffic and improve your experience. By accepting, you consent to the use of cookies for analytics and advertising purposes.