Dwarkesh Patel (Host) 00:00.720
That's a very interesting way to put it. But let me ask you the question you just posed then. What are we scaling and what what is what would it mean to have a recipe? Because I guess I'm not aware of a very very clean relationship that almost looks like a law of physics which
Dwarkesh Patel (Host) 00:18.160
existed in pre-training. There was a power law between data or computer parameters and loss. What is the kind of relationship we should be seeking and how how should we think about what this new recipe might look like?
Ilya Sutskever (Co-founder and Chief Scientist) 00:32.800
So, we've we've already witnessed a transition from one type of scaling to a different type of scaling from pre-training to RL Now people are scaling your RL. Now based on what people say on Twitter, they spend more computer on RL than on pre-training at this point because RL
Ilya Sutskever (Co-founder and Chief Scientist) 00:55.120
can actually consume quite a bit of compute. You know, you do very very long roll outs. Yes. So it takes a lot of compute to produce those roll outs. And then you get relatively small amount of learning per roll out so you really can't spend you really can't spend a lot of
Ilya Sutskever (Co-founder and Chief Scientist) 01:08.440
compute. And I could imagine like I wouldn't at at this at this It's It's more like I wouldn't even call it a scale scaling. I would say, "Hey, like what are you doing?" And is the thing you are doing the the the the most productive thing you could be doing. Yeah. Can you find a
Ilya Sutskever (Co-founder and Chief Scientist) 01:26.240
most more productive way of using your compute? We've discussed the value function business earlier. And maybe once people get good at value functions, they'll be using their their um resources more productively. And if you find a whole other way of training models. You could
Ilya Sutskever (Co-founder and Chief Scientist) 01:46.280
say, is this scaling or is it just using your resources? I think it becomes a little bit ambiguous. In a sense that when people were in the age of research, back then it was like people say, "Hey, let's try this and this and this. Let's try that and that and that. Oh, look,
Ilya Sutskever (Co-founder and Chief Scientist) 01:59.320
something interesting is happening." And I think there will be a return to that.
Dwarkesh Patel (Host) 02:04.280
So if we're back in the era of research, stepping back, what is the part of the recipe that we need to think most about? When you say value function, people are already trying the current recipe, but then having a LLM as a judge and so forth. You can say that's the value
Dwarkesh Patel (Host) 02:18.400
function, but it sounds like you have something much more fundamental in mind. Do we need Do we need to go back to should we even rethink pre-training at all and not just add more steps to the end of that process? Yeah.
Ilya Sutskever (Co-founder and Chief Scientist) 02:30.400
So, the the the discussion about value function, I think it was interesting. I want to like emphasize that I think the value function is something like it's going to make our RL more efficient. And I think that makes a difference. But I think that anything you can do with a
Ilya Sutskever (Co-founder and Chief Scientist) 02:49.080
value function, you can do without just more slowly. The thing which I think is the most fundamental is that these models somehow just generalize dramatically worse than people. Yes. And it's super obvious. That's that seems like a very fundamental thing. Okay.
Dwarkesh Patel (Host) 03:07.520
So this is the crux to generalization and there's two sub questions. There's one which is about sample efficiency, which is why should it take so much more data for these models to learn than humans. There's a second about even separate from the amount of data it takes, there's
Dwarkesh Patel (Host) 03:24.320
a question of why is it so hard to teach the thing we want to a model than to a human, which is to say for to a human that we don't necessarily need a verifiable world war to be able to you're probably mentoring a bunch of researchers right now and you're you know talking with
Dwarkesh Patel (Host) 03:41.120
them, you're showing them your code and you're showing them how you think. And from that, they're picking up your way of thinking and how you they should do research. You don't have to set like a verifiable reward for them. That's like, "Okay, this is the next part of their
Dwarkesh Patel (Host) 03:52.400
curriculum." And now this is the next part of their curriculum and oh it was this training was unstable. and we got a there's not this schleppy bespoke process. So, perhaps these two issues are actually related in some way. But I'd be curious to explore this this second thing
Dwarkesh Patel (Host) 04:07.220
which was more like continual learning and this first thing which feels just like um sample efficiency.
Ilya Sutskever (Co-founder and Chief Scientist) 04:13.380
Yeah, so you know you could actually wonder one one possible explanation for the human sample efficiency that needs to be considered is evolution. And evolution has given us a small amount of the the most useful information possible. And for things like vision, hearing, and
Ilya Sutskever (Co-founder and Chief Scientist) 04:36.300
locomotion, I think there's a pretty strong case that evolution actually has given us a lot. Mhm. So, for example, human dexterity far exceeds I mean robots can become dexterous too if you subject them to like a huge amount of training and simulation. But to train a robot in the
Ilya Sutskever (Co-founder and Chief Scientist) 04:55.780
real world to quickly like pick up a new skill like a person does. Seems very out of reach. And here you could say, "Oh yeah, like locomotion." All our ancestors needed great locomotion, squirrels like So, locomotion maybe like we've got like some unbelievable prior. You could
Ilya Sutskever (Co-founder and Chief Scientist) 05:13.940
make the same case for vision, you know. I I believe Jan LeCan made the point, "Oh, like um children learn to drive after 16 hour after like 10 hours of practice." Which is true. But our vision is so so good. At least for me, when I remember myself being 5 year old, my I was I
Ilya Sutskever (Co-founder and Chief Scientist) 05:31.820
was very excited about cars back then. And I'm pretty sure my car recognition was more than adequate for self driving already as a 5 year old. You don't get to see that much data as a 5 year old. You spend most of your time in your parents' house. So you have very low data
Ilya Sutskever (Co-founder and Chief Scientist) 05:46.300
diversity. But you could say maybe that's evolution too. But then language and math and coding, probably not.