Dwarkesh Patel (Host) 00:00.460
It still seems better than models. I mean, obviously models are better than the average human at language and math encoding, but are they better at the average human at learning?
Ilya Sutskever (Co-founder and Chief Scientist) 00:09.580
Oh yeah. Oh yeah, absolutely. What I meant to say is that language math and coding, and especially math and coding, suggests that whatever it is that makes people good at learning is probably not so much a complicated prior, but something more some fundamental thing.
Dwarkesh Patel (Host) 00:29.140
Wait, I'm not sure I understand. What What Why should that be the case?
Ilya Sutskever (Co-founder and Chief Scientist) 00:32.340
So consider a skill. That people exhibit some kind of great reliability or, you know, um Yeah. If the skill is one that was very useful to our ancestors for many millions of years, hundreds of millions of years, you could say, you could argue that maybe humans are good at it
Ilya Sutskever (Co-founder and Chief Scientist) 00:54.420
because of evolution. Because we have a prior. An evolutionary prior that's encoded in some very non obvious way. That somehow makes us so good at it. But if people exhibit great ability, reliability, robustness, ability to learn in a domain that really did not exist until
Ilya Sutskever (Co-founder and Chief Scientist) 01:17.900
recently, then this is more an indication that people might have just better machine learning period.
Dwarkesh Patel (Host) 01:29.180
But then how should we think about what that is? Is it a matter of of Yeah, what is the ML analogy for what There's a couple of interesting things about it. It takes fewer samples. It's more unsupervised. You don't have to set a like a child learning to drive a car a child no no
Dwarkesh Patel (Host) 01:46.020
no no learning to drive a car a teenager learning how to drive a car is like not exactly getting some pre-built verifiable reward. There it comes from their interaction with the machine and the with the environment. Um and Yeah, it takes much of your samples, it seems more
Dwarkesh Patel (Host) 02:04.340
unsupervised. It seems more robust.
Ilya Sutskever (Co-founder and Chief Scientist) 02:07.260
Much more robust. The robustness of people is really staggering.
Dwarkesh Patel (Host) 02:12.380
Yeah, so is it like Okay, and do you have a unified way of thinking about why are all these things happening at once? What is the ML analogy that would that could be it could realize something like this?
Ilya Sutskever (Co-founder and Chief Scientist) 02:24.020
So so so um this is where, you know, one of the things that you've been asking about is how can you know the teenage driver kind of self correct and learn from their experience without an external teacher. And the answer is well, they have their value function. Right? They have
Ilya Sutskever (Co-founder and Chief Scientist) 02:41.860
a general sense, which is also by the way extremely robust in people, like whatever it is, the human value function, whatever the human value function is with a few exceptions around addiction. It's actually very, very robust. And so for something like a teenager that's learning
Ilya Sutskever (Co-founder and Chief Scientist) 03:01.220
to drive They start to drive and they already have a sense of how they're driving immediately. How badly they're unconfident and then they see okay and they and then of course the the learning speed of any teenager so fast after 10 hours you're good to go. Yeah. It seems like
Dwarkesh Patel (Host) 03:18.120
humans have some solution but I'm curious about like well how are they doing it and like why is it so hard to like how do we need to reconceptualize the way we're training models to make something like this possible?
Ilya Sutskever (Co-founder and Chief Scientist) 03:27.360
You know that is a great question to ask. And it's a question I have a lot of opinions about. But unfortunately we live in a world where not not all machine learning ideas are discussed freely and this is this is one of them. So There's probably a way to do it. I think it can be
Ilya Sutskever (Co-founder and Chief Scientist) 03:50.000
done. The fact that people are like that, I think it's a proof that it can be done. There may be another blocker though, which is there is a possibility that the human neurons actually do more compute than we think. And if that is true, and if that plays an important role, then
Ilya Sutskever (Co-founder and Chief Scientist) 04:11.040
things might be more difficult. But regardless, I I do think it points to the existence of some machine learning principle that I have an opinions on but unfortunately circumstances make it hard to to discuss in detail
Dwarkesh Patel (Host) 04:28.040
even though Nobody listens to this podcast Ilya.
Ilya Sutskever (Co-founder and Chief Scientist) 04:31.640
Yeah.
Dwarkesh Patel (Host) 04:32.200
So I have to say that prepping for Ilya was pretty tough because neither I nor anybody else had any idea what he's working on and what SSI is trying to do. I had no basis to come up with my questions and the only thing I could go off, honestly, was trying to think from first
Dwarkesh Patel (Host) 04:48.160
principles about what are the bottle next to HEI? Cuz clearly Ilya is working on them in some way. Part of this question involved thinking about RL scaling because everybody's asking how well RL will generalize and how we can make it generalize better. As part of this, I was
Dwarkesh Patel (Host) 05:02.320
reading this paper that came out recently on RL scaling and it showed that actually the learning cover on RL looks like a sigmoid. I found this very curious. Why should it be a sigmoid? Where it learns very little for a long time and then And it quickly learns a lot and then it
Dwarkesh Patel (Host) 05:16.400
asymptotes. This is very different from the power law you see in pre-training where the model learns a bunch at the very beginning and then less and less over a time. And it actually reminded me of a note that I had written down after I had a conversation with a researcher
Dwarkesh Patel (Host) 05:29.000
friend where he pointed out that the number of samples that you need to take in order to find a correct answer scales exponentially with how different your current probability distribution is from the target probability distribution. And I was thinking about how these two ideas
Dwarkesh Patel (Host) 05:43.040
are related. I had this vague idea that they should should be connected, but I really didn't know how. I don't have a math background, so I couldn't really formalize it. But I wondered if Gemini III could help me out here. And so I took a picture of my notebook and I took the
Dwarkesh Patel (Host) 05:55.360
paper and I put them both in the context of Gemini III and I asked it to find the connection. And I thought a bunch and then I realized that the correct way to model the information you gain from a single yes or no outcome in RL is as the entropy of a random binary variable. It
Dwarkesh Patel (Host) 06:12.440
made a graph which showed how the bit to gain first sample in RL versus supervised learning scale as a pass rate increases. And as soon as I saw the graph that Gemini 3 made, immediately a ton of things started making sense to me. Then I wanted to see if there was any empirical
Dwarkesh Patel (Host) 06:27.760
basis to this theory. So, I asked Gemini to code an experiment to show whether the improvement and loss scales in this way with pass rate. I just took the code that Gemini outputted. I copy-pasted it into a Google Colab notebook. And I was able to run this toy ML experiment and
Dwarkesh Patel (Host) 06:44.120
visualize its results without a single bug. It's interesting because the results look similar but not identical to what we should have expected. And so I downloaded this chart and I put it into Gemini and I asked it, what is going on here? And it came up with a hypothesis that I
Dwarkesh Patel (Host) 06:56.880
think is actually correct, which is that we're capping how much supervised learning can improve in the beginning by having a fixed learning rate. And in fact, we should decrease the learning rate over time. It actually gives us an intuitive understanding for why in practice we
Dwarkesh Patel (Host) 07:10.280
have learning rate schedulers that decrease the learning rate over time. I did this entire flow from coming up with this vague initial question to building a theoretical understanding, to running some toy ML experiments, all with Gemini 3. This feels like the first model where
Dwarkesh Patel (Host) 07:26.160
it can actually come up with new connections that I wouldn't have anticipated. It's actually now become the default place I go to when I want to brainstorm new ways to think about a problem. If you want to read more about RL scaling, you can check out the blog post that I wrote
Dwarkesh Patel (Host) 07:38.800
with a little help from Gemini 3. And if you want to check out Gemini 3 yourself, go to gemini.google. I'm