Reiner Pope – The math behind how LLMs are trained and served

4/29/20262 hr 14 min

Did a very different format with Reiner Pope - a blackboard lecture where he walks through how frontier LLMs are trained and served.

It’s shocking how much you can deduce about what the labs are doing from a handful of equations, public API prices, and some chalk.

It’s a bit technical, but I encourage you to hang in there – it’s really worth it.

There are less than a handful of people who understand the full stack of AI, from chip design to model architecture, as well as Reiner. It was a real delight to learn from him.

Recommend watching this one on YouTube so you can see the chalkboard.

Reiner is CEO of MatX, a new chip startup (full disclosure - I’m an angel investor). He was previously at Google, where he worked on software efficiency, compilers, and TPU architecture.

Download markdown of transcript here to chat with an LLM.

Wrote up some flashcards and practice problems to help myself retain what Reiner taught. Hope it's helpful to you too!

Sponsors

* Jane Street needs constant access to incredibly low-latency compute. I recently asked one of their engineers, Clark, to talk me through how they meet these demands. Our conversation—which touched on everything from FPGAs to liquid cooling—was extremely helpful as I prepped to interview Reiner. You can watch the full discussion and explore Jane Street’s open roles at janestreet.com/dwarkesh

* Google’s Gemma 4 is the first open model that’s let me shut off the internet and create a fully disconnected “focus machine”. This is because Gemma is small enough to run on my laptop, but powerful enough to actually be useful. So, to prep for this interview, I downloaded Reiner’s scaling book, disconnected from wifi, and used Gemma to help me break down the material. Check it out at goo.gle/Gemma4

* Cursor helped me turn some notes I took on how gradients flow during large-scale pretraining into a great animation. At first, I wasn’t sure the best way to visualize the concept, but Cursor’s Composer 2 Fast model let me iterate on different ideas almost instantaneously. You can check out the animation in my recent blog post. And if you have something to visualize yourself, go to cursor.com/dwarkesh

Timestamps

(00:00:00) – How batch size affects token cost and speed

(00:32:09) – How MoE models are laid out across GPU racks

(00:47:12) – How pipeline parallelism spreads model layers across racks

(01:03:37) – Why Ilya said, “As we now know, pipelining is not wise.”

(01:18:59) – Because of RL, models may be 100x over-trained beyond Chinchilla-optimal

(01:33:02) – Deducing long context memory costs from API pricing

(02:04:02) – Convergent evolution between neural nets and cryptography

Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe

Clips

Showing 10 of 12

How big is “big” traffic?

Rule of thumb: batch ≈ 300×sparsity

Half of CapEx on memory

Why cache hits are cheap

Decoding Gemini’s 200K surcharge

The memory wall on context

From 64 to ~500 GPUs per rack

Equalizing training and inference cost

Hundred-fold over Chinchilla?

The 20ms train departs

Transcript preview

First 90 seconds

Dwarkesh Patel· Host0:00
Today I'm interviewing Rainer Pope, who is CEO of Madex, which is a new chip startup. Previously, he was doing TPU architecture and many other things at Google. This is a very different format from my usual interviews. This is gonna be a blackboard lecture. We're gonna get up in a second. We, in fact, built this whole new studio with specifically this format in mind. Um, and so it's a pleasure to get to inaugurate it with you. We're gonna be talking about model architecture, ML infra, and many other things. And, um, the reason I think it's an important topic is because once you actually understand how training and inference actually work in a cluster, as we'll see a lot of things about why AI is the way it is, why AI architectures are the way they are, why, um, API prices are the way they are. Fundamentally also h-how-- why AI progress is the way it is, start making sense, and you need to understand the details to get there, and you need a blackboard to understand the details. So Rainer, thank you so much for doing this.
Reiner Pope· Guest0:50
Yeah, very happy to be here.
Dwarkesh Patel· Host0:51
Just a heads up, this is a lecture with graphs and equations and all that stuff. So if you can, I would really recommend watching it on a video platform like YouTube. Okay. Uh, full disclosure, I am a angel investor in Madex, but that's unrelated to this podcast. Um, Rainer, maybe to kick us off, I'll ask this question. So we have a couple of companies like Claude and Codex and Cursor are offering something like, uh, fast mode, where for six x the price, they'll give-- stream you tokens at two point five x the speed. Mechanically, I'm curious what's going on here. Why-- Like, why is it the case that you can pay more to get faster latency? And two, could you keep going? Could you pay a hundred

Clips

Transcript preview

We value your privacy