Dwarkesh Patel (Host) 00:00.520
I I like this idea that the real reward hacking is the human researchers who are too focused on the evals. I think there's two ways to understand or to try to think about what what you just pointed out. One is, look, if it's the case that simply by becoming superhuman at a
Dwarkesh Patel (Host) 00:20.260
coding competition, a model will not automatically become more tasteful and exercise a better judgment about how to improve your code base. Well, then you should expand the suite of environments such that you're not just testing it on having the best performance in coding
Dwarkesh Patel (Host) 00:36.020
competition, it should also be able to make the best kind of application for X thing or Y thing or Z thing. In another Maybe this is what you're hinting at. Is to say why should it be the case in the first place that becoming superhuman at coding competitions doesn't make you a
Dwarkesh Patel (Host) 00:52.380
more tasteful programmer more generally. Maybe the thing to do is not to keep stacking up the amount of environments and the diversity of environments to figure out approach which let you learn from one environment and improve your performance on something else.
Ilya Sutskever (Co-founder and Chief Scientist) 01:07.820
So, I have I have an analogy a human analogy which might be helpful. So, even the case, let's take the case of competitive programming since you mentioned that. And suppose you have two students. One of them work decided they want to be the best competitive programmer, so they
Ilya Sutskever (Co-founder and Chief Scientist) 01:24.260
will practice 10,000 hours for that domain. They will solve all the problems, memorize all the proof techniques, and be very very, you know, be very skilled at quickly and correctly implementing all the algorithms, and by doing by doing so they became the best or one of the
Ilya Sutskever (Co-founder and Chief Scientist) 01:43.140
best. Student number two thought, "Oh, competitive program in your school, maybe they practiced for 100 hours. Much much less and they also did really well. Which one do you think is going to do better in their career later on?"
Dwarkesh Patel (Host) 01:56.100
The second. Right.
Ilya Sutskever (Co-founder and Chief Scientist) 01:57.500
And I think that's basically what's going on. The models are much more like the first student but even more because then we say, "Okay, so the model should be good at competitive programming. So let's get every single competitive programming problem ever. And then let's do some
Ilya Sutskever (Co-founder and Chief Scientist) 02:10.820
data augmentation so we have even more competitive programming problems.
Dwarkesh Patel (Host) 02:14.100
Yes.
Ilya Sutskever (Co-founder and Chief Scientist) 02:14.700
And we train on that. And so now you got this great competitive programmer. And with this analogy, I think it's more intuitive. I think it's more intuitive with this analogy that yeah, okay, so if it's so well trained, okay, it's like all the different algorithms and all the
Ilya Sutskever (Co-founder and Chief Scientist) 02:28.340
different proof techniques are like right it's at its fingertips. And it's more intuitive that with this level of preparation it not would not necessarily generalize to other things.
Dwarkesh Patel (Host) 02:39.780
But then what is the analogy for what the second student is doing before they do the 100 hours of fine tuning.
Ilya Sutskever (Co-founder and Chief Scientist) 02:48.020
I think it's like they have it. I think it's the it factor.
Dwarkesh Patel (Host) 02:53.620
Yeah.
Ilya Sutskever (Co-founder and Chief Scientist) 02:54.140
Right? And like I know like when I was an undergrad I remember there was there was a student like this that studied with me. So I know I know it exists.
Dwarkesh Patel (Host) 03:01.100
Yeah. I think it's interesting to distinguish it from whatever pre-training does. So when we to understand what you just said about we don't have to choose the data in pre-training is to say actually It's not dissimilar to the 10,000 hours of practice. It's just that you get
Dwarkesh Patel (Host) 03:15.660
that 10,000 hours of practice for free because it's already somewhere in the pre-training distribution. But it's like maybe you're suggesting actually there's actually not that much generalization of pre-training. There's just so much data in pre-training. And but it's like it's
Dwarkesh Patel (Host) 03:29.300
not necessarily generalizing better than RL.
Ilya Sutskever (Co-founder and Chief Scientist) 03:31.220
Like the main the main strength of pre-training is that there is A so much of it. Yeah.
Ilya Sutskever (Co-founder and Chief Scientist) 03:36.340
And B you don't have to think hard about what data to put into pre-training. And it's a very kind of natural data and it does include in it a lot of what people do. Yeah.
Ilya Sutskever (Co-founder and Chief Scientist) 03:50.340
People's thoughts and a lot of the features of, you know, it's like the whole world as projected by people onto text. Yeah.
Ilya Sutskever (Co-founder and Chief Scientist) 04:00.140
And pre-training tries to capture that using a huge amount of data. It's it's very the pre-training is very difficult to reason about because it's so hard to understand the manner in which the model relies on pre-training data. And whenever the model makes a mistake, could it be
Ilya Sutskever (Co-founder and Chief Scientist) 04:20.420
because something by chance is not as supported by the pre-training data? You know, and pre support by pre-training is maybe a loose term. I I don't know if I can add anything more useful on this, but I don't think there is a human analog to pre-training.
Dwarkesh Patel (Host) 04:39.380
Here's analogies that people have proposed for what the human analogy due to pre-training is and I'm curious to get your thoughts on why they're potentially wrong. One is to think about the first 18 or 15 or 13 years of a person's life when they aren't necessarily economically
Dwarkesh Patel (Host) 04:55.540
productive, but they are doing something that is making them understand the world better and so forth. And the other is to think about evolution as doing some kind of search for 3 billion years which then results in a human lifetime instance. And then I'm I'm I'm curious if you
Dwarkesh Patel (Host) 05:15.460
think either of these are actually analogous to pre-training or how would how would you think about at least what lifetime human learning is like if not pre-training?