Dwarkesh Patel (Host) 00:00.520
I I like this idea that the real reward hacking is the human researchers who are too focused on the eval's. I think there's two ways to understand or to try to think about what what you just pointed out. One is, look, if it's the case that simply by becoming superhuman at a
Dwarkesh Patel (Host) 00:20.260
coding competition, a model will not automatically become more tasteful and exercise a better judgment about how to improve your code base. Well, then you should expand the suite of environments such that you're not just testing it on having the best performance in coding
Dwarkesh Patel (Host) 00:36.020
competition, it should also be able to make the best kind of application for X thing or Y thing or Z thing. In another Maybe this is what you're hinting at. Is to say why should it be the case in the first place that becoming superhuman at coding competitions doesn't make you a
Dwarkesh Patel (Host) 00:52.380
more tasteful programmer more generally. Maybe the thing to do is not to keep stacking up the amount of environments and the diversity of environments to figure out approach with let you learn from one environment and improve your performance on something else.
Ilya Sutskever (Co-founder and Chief Scientist) 01:07.820
So, I have I have an analogy a human analogy which might be helpful. So So, even the case, let's take the case of competitive programming since you mentioned that. And suppose you have two students. One of them work decided they want to be the best competitive programmer, so
Ilya Sutskever (Co-founder and Chief Scientist) 01:24.180
they will practice 10,000 hours for that domain. They will solve all the problems, memorize all the proof techniques, and be very very, you know, be very skilled at quickly and correctly implementing all the algorithms, and by doing by doing so they became the best. to one of
Ilya Sutskever (Co-founder and Chief Scientist) 01:43.020
the best. Student number two thought, "Oh, competitive program in your school, maybe they practiced for 100 hours. Much much less and they also did really well. Which one do you think is going to do better in their career
Dwarkesh Patel (Host) 01:55.260
later on?"
Ilya Sutskever (Co-founder and Chief Scientist) 01:56.100
The second. Right? And I think that's basically what's going on. The models are much more like the first student but even more because then we say, "Okay, so the model should be good at competitive programming. So let's get every single competitive programming problem ever. And
Ilya Sutskever (Co-founder and Chief Scientist) 02:10.300
then let's do some data augmentation so we have even more competitive programming problems. Yes. And we train on that. And so now you got this great competitive programmer. And with this analogy, I think it's more intuitive. I think it's more intuitive with this analogy that
Ilya Sutskever (Co-founder and Chief Scientist) 02:23.060
yeah, okay, so if it's so well trained, okay, it's like all the different algorithms and all the different proof techniques are like right it's at its fingertips. And it's more intuitive that with this level of preparation it not would not necessarily generalize to other things.
Dwarkesh Patel (Host) 02:39.780
But then what is the analogy for what the second consultant is doing before they do the 100 hours of fine tuning.
Ilya Sutskever (Co-founder and Chief Scientist) 02:48.020
I think it's like they have it. I think it's the it factor. Yeah. Right? And like I know like when I was an undergrad I remember there was there was a student like this that studied with me. So I know I know it exists.
Dwarkesh Patel (Host) 03:01.100
Yeah. I think it's interesting to distinguish it from whatever pre-training does. So when we to understand what you just said about we don't have to choose the data in pre-training is to say actually It's not dissimilar to the 10,000 hours of practice. It's just that you get
Dwarkesh Patel (Host) 03:15.660
that 10,000 hours of practice for free because it's already somewhere in the pre-training distribution. But it's like maybe you're suggesting actually there's actually not that much generalization of pre-training. There's just so much data in pre-training. And but it's like it's
Dwarkesh Patel (Host) 03:29.300
not necessarily generalizing better than RL.
Ilya Sutskever (Co-founder and Chief Scientist) 03:31.220
Like the main the main strength of pre-training is that there is A so much of it. Yeah. And B you don't have to think hard about what data to put into pre-training. Yeah. and it's a very kind of natural data and it does include in it a lot of what people do. Yeah. People's
Ilya Sutskever (Co-founder and Chief Scientist) 03:50.780
thoughts and a lot of the features of, you know, it's like the whole world as projected by people onto text. Yeah. And pre-training tries to capture that using a huge amount of data. It's it's very the pre-training is very difficult to reason about because it's so hard to
Ilya Sutskever (Co-founder and Chief Scientist) 04:11.060
understand the manner in which the model relies on pre-training data. And whenever the model makes a mistake, could it be because something by chance is not as supported by the pre-training data? You know, and pre support by pre-training is maybe a loose term. I I don't know if
Ilya Sutskever (Co-founder and Chief Scientist) 04:31.340
I can add anything more useful on this, but I don't think there is a human analog to pre-training.
Dwarkesh Patel (Host) 04:39.380
Here's analogies that people have proposed for what the human analogy due to pre-training is and I'm curious to get your thoughts on why they're potentially wrong. One is to think about the first 18 or 15 or 13 years of a person's life when they aren't necessarily economically
Dwarkesh Patel (Host) 04:55.540
productive, but they are doing something that is making them understand the world better and so forth. And the other is to think about evolution as doing some kind of search for 3 billion years which then results in a human lifetime instance. And then I'm I'm I'm curious if you
Dwarkesh Patel (Host) 05:15.460
think either of these are actually analogous to pre-training or how would how would you think about at least what lifetime human learning is like if not pre-training?