Ilya Sutskever – We're moving from the age of scaling to the age of research

Dwarkesh Patel (Host) 00:00.460

It still seems better than models. I mean, obviously models are better than the average human at language and math encoding, but are they better at the average human at learning?

Ilya Sutskever (Co-founder and Chief Scientist) 00:09.580

Oh yeah. Oh yeah, absolutely. What I meant to say is that language math and coding, and especially math and coding, suggests that whatever it is that makes people good at learning is probably not so much a complicated prior, but something more some fundamental thing.

Dwarkesh Patel (Host) 00:29.140

Wait, I'm not sure I understand. What What Why should that be the case?

Ilya Sutskever (Co-founder and Chief Scientist) 00:32.340

So consider a skill. That people exhibit some kind of great reliability or, you know, um Yeah. If the skill is one that was very useful to our ancestors for many millions of years, hundreds of millions of years, you could say, you could argue that maybe humans are good at it

Ilya Sutskever (Co-founder and Chief Scientist) 00:54.420

because of evolution. Because we have a prior. An evolutionary prior that's encoded in some very non obvious way. That somehow makes us so good at it. But if people exhibit great ability, reliability, robustness, ability to learn in a domain that really did not exist until

Ilya Sutskever (Co-founder and Chief Scientist) 01:17.900

recently, then this is more an indication that people might have just better machine learning period.

Dwarkesh Patel (Host) 01:29.180

But then how should we think about what that is? Is it a matter of of Yeah, what is the ML analogy for what There's a couple of interesting things about it. It takes fewer samples. It's more unsupervised. You don't have to set a like a child learning to drive a car a child no no

Dwarkesh Patel (Host) 01:46.020

no no learning to drive a car a teenager learning how to drive a car is like not exactly getting some pre-built verifiable reward. There it comes from their interaction with the machine and the with the environment. Um and Yeah, it takes much of your samples, it seems more

Dwarkesh Patel (Host) 02:04.340

unsupervised. It seems more robust.

Ilya Sutskever (Co-founder and Chief Scientist) 02:07.260

Much more robust. The robustness of people is really staggering.

Dwarkesh Patel (Host) 02:12.380

Yeah, so is it like Okay, and do you have a unified way of thinking about why are all these things happening at once? What is the ML analogy that would that could be it could realize something like this?

Ilya Sutskever (Co-founder and Chief Scientist) 02:24.020

So so so um this is where, you know, one of the things that you've been asking about is how can you know the teenage driver kind of self correct and learn from their experience without an external teacher. And the answer is well, they have their value function. Right? They have

Ilya Sutskever (Co-founder and Chief Scientist) 02:41.860

a general sense, which is also by the way extremely robust in people, like whatever it is, the human value function, whatever the human value function is with a few exceptions around addiction. It's actually very, very robust. And so for something like a teenager that's learning

Ilya Sutskever (Co-founder and Chief Scientist) 03:01.220

to drive They start to drive and they already have a sense of how they're driving immediately. How badly they're unconfident and then they see okay and they and then of course the the learning speed of any teenager so fast after 10 hours you're good to go. Yeah. It seems like

Dwarkesh Patel (Host) 03:18.120

humans have some solution but I'm curious about like well how are they doing it and like why is it so hard to like how do we need to reconceptualize the way we're training models to make something like this possible?

Ilya Sutskever (Co-founder and Chief Scientist) 03:27.360

You know that is a great question to ask. And it's a question I have a lot of opinions about. But unfortunately we live in a world where not not all machine learning ideas are discussed freely and this is this is one of them. So There's probably a way to do it. I think it can be

Ilya Sutskever (Co-founder and Chief Scientist) 03:50.000

done. The fact that people are like that, I think it's a proof that it can be done. There may be another blocker though, which is there is a possibility that the human neurons actually do more compute than we think. And if that is true, and if that plays an important role, then

Ilya Sutskever (Co-founder and Chief Scientist) 04:11.040

things might be more difficult. But regardless, I I do think it points to the existence of some machine learning principle that I have an opinions on but unfortunately circumstances make it hard to to discuss in detail

Dwarkesh Patel (Host) 04:28.040

even though Nobody listens to this podcast Ilya.

Ilya Sutskever (Co-founder and Chief Scientist) 04:31.640

Yeah.

Dwarkesh Patel (Host) 04:32.200

So I have to say that prepping for Ilya was pretty tough because neither I nor anybody else had any idea what he's working on and what SSI is trying to do. I had no basis to come up with my questions and the only thing I could go off, honestly, was trying to think from first

Dwarkesh Patel (Host) 04:48.160

principles about what are the bottle next to HEI? Cuz clearly Ilya is working on them in some way. Part of this question involved thinking about RL scaling because everybody's asking how well RL will generalize and how we can make it generalize better. As part of this, I was

Dwarkesh Patel (Host) 05:02.320

reading this paper that came out recently on RL scaling and it showed that actually the learning cover on RL looks like a sigmoid. I found this very curious. Why should it be a sigmoid? Where it learns very little for a long time and then And it quickly learns a lot and then it

Dwarkesh Patel (Host) 05:16.400

asymptotes. This is very different from the power law you see in pre-training where the model learns a bunch at the very beginning and then less and less over a time. And it actually reminded me of a note that I had written down after I had a conversation with a researcher

Dwarkesh Patel (Host) 05:29.000

friend where he pointed out that the number of samples that you need to take in order to find a correct answer scales exponentially with how different your current probability distribution is from the target probability distribution. And I was thinking about how these two ideas

Dwarkesh Patel (Host) 05:43.040

are related. I had this vague idea that they should should be connected, but I really didn't know how. I don't have a math background, so I couldn't really formalize it. But I wondered if Gemini III could help me out here. And so I took a picture of my notebook and I took the

Dwarkesh Patel (Host) 05:55.360

paper and I put them both in the context of Gemini III and I asked it to find the connection. And I thought a bunch and then I realized that the correct way to model the information you gain from a single yes or no outcome in RL is as the entropy of a random binary variable. It

Dwarkesh Patel (Host) 06:12.440

made a graph which showed how the bit to gain first sample in RL versus supervised learning scale as a pass rate increases. And as soon as I saw the graph that Gemini 3 made, immediately a ton of things started making sense to me. Then I wanted to see if there was any empirical

Dwarkesh Patel (Host) 06:27.760

basis to this theory. So, I asked Gemini to code an experiment to show whether the improvement and loss scales in this way with pass rate. I just took the code that Gemini outputted. I copy-pasted it into a Google Colab notebook. And I was able to run this toy ML experiment and

Dwarkesh Patel (Host) 06:44.120

visualize its results without a single bug. It's interesting because the results look similar but not identical to what we should have expected. And so I downloaded this chart and I put it into Gemini and I asked it, what is going on here? And it came up with a hypothesis that I

Dwarkesh Patel (Host) 06:56.880

think is actually correct, which is that we're capping how much supervised learning can improve in the beginning by having a fixed learning rate. And in fact, we should decrease the learning rate over time. It actually gives us an intuitive understanding for why in practice we

Dwarkesh Patel (Host) 07:10.280

have learning rate schedulers that decrease the learning rate over time. I did this entire flow from coming up with this vague initial question to building a theoretical understanding, to running some toy ML experiments, all with Gemini 3. This feels like the first model where

Dwarkesh Patel (Host) 07:26.160

it can actually come up with new connections that I wouldn't have anticipated. It's actually now become the default place I go to when I want to brainstorm new ways to think about a problem. If you want to read more about RL scaling, you can check out the blog post that I wrote

Dwarkesh Patel (Host) 07:38.800

with a little help from Gemini 3. And if you want to check out Gemini 3 yourself, go to gemini.google. I'm