Ilya Sutskever – We're moving from the age of scaling to the age of research - part 6/17
2025-11-25_17-29 • 1h 36m 3s
Dwarkesh Patel (Host)
00:00.460
It
still
seems
better
than
models.
I
mean,
obviously
models
are
better
than
the
average
human
at
language
and
math
encoding,
but
are
they
better
at
the
average
human
at
learning?
Ilya Sutskever (Co-founder and Chief Scientist)
00:09.580
Oh
yeah.
Oh
yeah,
absolutely.
What
I
meant
to
say
is
that
language
math
and
coding,
and
especially
math
and
coding,
suggests
that
whatever
it
is
that
makes
people
good
at
learning
is
probably
not
so
much
a
complicated
prior,
but
something
more
some
fundamental
thing.
Dwarkesh Patel (Host)
00:29.140
Wait,
I'm
not
sure
I
understand.
What
What
Why
should
that
be
the
case?
Ilya Sutskever (Co-founder and Chief Scientist)
00:32.340
So
consider
a
skill.
That
people
exhibit
some
kind
of
great
reliability
or,
you
know,
um
Yeah.
If
the
skill
is
one
that
was
very
useful
to
our
ancestors
for
many
millions
of
years,
hundreds
of
millions
of
years,
you
could
say,
you
could
argue
that
maybe
humans
are
good
at
it
Ilya Sutskever (Co-founder and Chief Scientist)
00:54.420
because
of
evolution.
Because
we
have
a
prior.
An
evolutionary
prior
that's
encoded
in
some
very
non
obvious
way.
That
somehow
makes
us
so
good
at
it.
But
if
people
exhibit
great
ability,
reliability,
robustness,
ability
to
learn
in
a
domain
that
really
did
not
exist
until
Ilya Sutskever (Co-founder and Chief Scientist)
01:17.900
recently,
then
this
is
more
an
indication
that
people
might
have
just
better
machine
learning
period.
Dwarkesh Patel (Host)
01:29.180
But
then
how
should
we
think
about
what
that
is?
Is
it
a
matter
of
of
Yeah,
what
is
the
ML
analogy
for
what
There's
a
couple
of
interesting
things
about
it.
It
takes
fewer
samples.
It's
more
unsupervised.
You
don't
have
to
set
a
like
a
child
learning
to
drive
a
car
a
child
no
no
Dwarkesh Patel (Host)
01:46.020
no
no
learning
to
drive
a
car
a
teenager
learning
how
to
drive
a
car
is
like
not
exactly
getting
some
pre-built
verifiable
reward.
There
it
comes
from
their
interaction
with
the
machine
and
the
with
the
environment.
Um
and
Yeah,
it
takes
much
of
your
samples,
it
seems
more
Dwarkesh Patel (Host)
02:04.340
unsupervised.
It
seems
more
robust.
Ilya Sutskever (Co-founder and Chief Scientist)
02:07.260
Much
more
robust.
The
robustness
of
people
is
really
staggering.
Dwarkesh Patel (Host)
02:12.380
Yeah,
so
is
it
like
Okay,
and
do
you
have
a
unified
way
of
thinking
about
why
are
all
these
things
happening
at
once?
What
is
the
ML
analogy
that
would
that
could
be
it
could
realize
something
like
this?
Ilya Sutskever (Co-founder and Chief Scientist)
02:24.020
So
so
so
um
this
is
where,
you
know,
one
of
the
things
that
you've
been
asking
about
is
how
can
you
know
the
teenage
driver
kind
of
self
correct
and
learn
from
their
experience
without
an
external
teacher.
And
the
answer
is
well,
they
have
their
value
function.
Right?
They
have
Ilya Sutskever (Co-founder and Chief Scientist)
02:41.860
a
general
sense,
which
is
also
by
the
way
extremely
robust
in
people,
like
whatever
it
is,
the
human
value
function,
whatever
the
human
value
function
is
with
a
few
exceptions
around
addiction.
It's
actually
very,
very
robust.
And
so
for
something
like
a
teenager
that's
learning
Ilya Sutskever (Co-founder and Chief Scientist)
03:01.220
to
drive
They
start
to
drive
and
they
already
have
a
sense
of
how
they're
driving
immediately.
How
badly
they're
unconfident
and
then
they
see
okay
and
they
and
then
of
course
the
the
learning
speed
of
any
teenager
so
fast
after
10
hours
you're
good
to
go.
Yeah.
It
seems
like
Dwarkesh Patel (Host)
03:18.120
humans
have
some
solution
but
I'm
curious
about
like
well
how
are
they
doing
it
and
like
why
is
it
so
hard
to
like
how
do
we
need
to
reconceptualize
the
way
we're
training
models
to
make
something
like
this
possible?
Ilya Sutskever (Co-founder and Chief Scientist)
03:27.360
You
know
that
is
a
great
question
to
ask.
And
it's
a
question
I
have
a
lot
of
opinions
about.
But
unfortunately
we
live
in
a
world
where
not
not
all
machine
learning
ideas
are
discussed
freely
and
this
is
this
is
one
of
them.
So
There's
probably
a
way
to
do
it.
I
think
it
can
be
Ilya Sutskever (Co-founder and Chief Scientist)
03:50.000
done.
The
fact
that
people
are
like
that,
I
think
it's
a
proof
that
it
can
be
done.
There
may
be
another
blocker
though,
which
is
there
is
a
possibility
that
the
human
neurons
actually
do
more
compute
than
we
think.
And
if
that
is
true,
and
if
that
plays
an
important
role,
then
Ilya Sutskever (Co-founder and Chief Scientist)
04:11.040
things
might
be
more
difficult.
But
regardless,
I
I
do
think
it
points
to
the
existence
of
some
machine
learning
principle
that
I
have
an
opinions
on
but
unfortunately
circumstances
make
it
hard
to
to
discuss
in
detail
Dwarkesh Patel (Host)
04:28.040
even
though
Nobody
listens
to
this
podcast
Ilya.
Ilya Sutskever (Co-founder and Chief Scientist)
04:31.640
Yeah.
Dwarkesh Patel (Host)
04:32.200
So
I
have
to
say
that
prepping
for
Ilya
was
pretty
tough
because
neither
I
nor
anybody
else
had
any
idea
what
he's
working
on
and
what
SSI
is
trying
to
do.
I
had
no
basis
to
come
up
with
my
questions
and
the
only
thing
I
could
go
off,
honestly,
was
trying
to
think
from
first
Dwarkesh Patel (Host)
04:48.160
principles
about
what
are
the
bottle
next
to
HEI?
Cuz
clearly
Ilya
is
working
on
them
in
some
way.
Part
of
this
question
involved
thinking
about
RL
scaling
because
everybody's
asking
how
well
RL
will
generalize
and
how
we
can
make
it
generalize
better.
As
part
of
this,
I
was
Dwarkesh Patel (Host)
05:02.320
reading
this
paper
that
came
out
recently
on
RL
scaling
and
it
showed
that
actually
the
learning
cover
on
RL
looks
like
a
sigmoid.
I
found
this
very
curious.
Why
should
it
be
a
sigmoid?
Where
it
learns
very
little
for
a
long
time
and
then
And
it
quickly
learns
a
lot
and
then
it
Dwarkesh Patel (Host)
05:16.400
asymptotes.
This
is
very
different
from
the
power
law
you
see
in
pre-training
where
the
model
learns
a
bunch
at
the
very
beginning
and
then
less
and
less
over
a
time.
And
it
actually
reminded
me
of
a
note
that
I
had
written
down
after
I
had
a
conversation
with
a
researcher
Dwarkesh Patel (Host)
05:29.000
friend
where
he
pointed
out
that
the
number
of
samples
that
you
need
to
take
in
order
to
find
a
correct
answer
scales
exponentially
with
how
different
your
current
probability
distribution
is
from
the
target
probability
distribution.
And
I
was
thinking
about
how
these
two
ideas
Dwarkesh Patel (Host)
05:43.040
are
related.
I
had
this
vague
idea
that
they
should
should
be
connected,
but
I
really
didn't
know
how.
I
don't
have
a
math
background,
so
I
couldn't
really
formalize
it.
But
I
wondered
if
Gemini
III
could
help
me
out
here.
And
so
I
took
a
picture
of
my
notebook
and
I
took
the
Dwarkesh Patel (Host)
05:55.360
paper
and
I
put
them
both
in
the
context
of
Gemini
III
and
I
asked
it
to
find
the
connection.
And
I
thought
a
bunch
and
then
I
realized
that
the
correct
way
to
model
the
information
you
gain
from
a
single
yes
or
no
outcome
in
RL
is
as
the
entropy
of
a
random
binary
variable.
It
Dwarkesh Patel (Host)
06:12.440
made
a
graph
which
showed
how
the
bit
to
gain
first
sample
in
RL
versus
supervised
learning
scale
as
a
pass
rate
increases.
And
as
soon
as
I
saw
the
graph
that
Gemini
3
made,
immediately
a
ton
of
things
started
making
sense
to
me.
Then
I
wanted
to
see
if
there
was
any
empirical
Dwarkesh Patel (Host)
06:27.760
basis
to
this
theory.
So,
I
asked
Gemini
to
code
an
experiment
to
show
whether
the
improvement
and
loss
scales
in
this
way
with
pass
rate.
I
just
took
the
code
that
Gemini
outputted.
I
copy-pasted
it
into
a
Google
Colab
notebook.
And
I
was
able
to
run
this
toy
ML
experiment
and
Dwarkesh Patel (Host)
06:44.120
visualize
its
results
without
a
single
bug.
It's
interesting
because
the
results
look
similar
but
not
identical
to
what
we
should
have
expected.
And
so
I
downloaded
this
chart
and
I
put
it
into
Gemini
and
I
asked
it,
what
is
going
on
here?
And
it
came
up
with
a
hypothesis
that
I
Dwarkesh Patel (Host)
06:56.880
think
is
actually
correct,
which
is
that
we're
capping
how
much
supervised
learning
can
improve
in
the
beginning
by
having
a
fixed
learning
rate.
And
in
fact,
we
should
decrease
the
learning
rate
over
time.
It
actually
gives
us
an
intuitive
understanding
for
why
in
practice
we
Dwarkesh Patel (Host)
07:10.280
have
learning
rate
schedulers
that
decrease
the
learning
rate
over
time.
I
did
this
entire
flow
from
coming
up
with
this
vague
initial
question
to
building
a
theoretical
understanding,
to
running
some
toy
ML
experiments,
all
with
Gemini
3.
This
feels
like
the
first
model
where
Dwarkesh Patel (Host)
07:26.160
it
can
actually
come
up
with
new
connections
that
I
wouldn't
have
anticipated.
It's
actually
now
become
the
default
place
I
go
to
when
I
want
to
brainstorm
new
ways
to
think
about
a
problem.
If
you
want
to
read
more
about
RL
scaling,
you
can
check
out
the
blog
post
that
I
wrote
Dwarkesh Patel (Host)
07:38.800
with
a
little
help
from
Gemini
3.
And
if
you
want
to
check
out
Gemini
3
yourself,
go
to
gemini.google.
I'm