WEBVTT

00:00.330 --> 00:02.970
-: Hello and welcome to this Bison tutorial.

00:02.970 --> 00:05.640
All right, so in this new code section

00:05.640 --> 00:07.920
we are going to implement experience replay.

00:07.920 --> 00:09.480
So we're gonna make a new class

00:09.480 --> 00:11.430
which we will call replay memory

00:11.430 --> 00:14.477
and that will implement experience replay exactly

00:14.477 --> 00:16.860
as you saw in the intuition lectures.

00:16.860 --> 00:19.050
But first, let's give a quick reminder

00:19.050 --> 00:21.510
about what is experience replay.

00:21.510 --> 00:24.309
So you know, all this artificial intelligence is based

00:24.309 --> 00:26.378
on mark of decision processes

00:26.378 --> 00:29.130
and mark of decision processes consist

00:29.130 --> 00:31.950
of looking at a series of events.

00:31.950 --> 00:34.470
So the events are, you know, for example

00:34.470 --> 00:39.030
going from one state ST to the next state, ST plus one.

00:39.030 --> 00:40.980
But if the events were like that, well

00:40.980 --> 00:44.520
since the next state is very correlated to the current state

00:44.520 --> 00:46.980
well, the network would not be learning very well.

00:46.980 --> 00:49.650
So for those coming from the Deep Learning course,

00:49.650 --> 00:51.030
that's exactly the same

00:51.030 --> 00:54.428
as where we learned our time series with only one time step.

00:54.428 --> 00:56.340
It was not learning anything

00:56.340 --> 00:58.860
because one time step was not sufficient enough

00:58.860 --> 01:03.060
for a model to learn to understand long-term correlations.

01:03.060 --> 01:04.350
So that's the same here

01:04.350 --> 01:07.200
and that's why we have to implement experience replay.

01:07.200 --> 01:08.310
So how does it work?

01:08.310 --> 01:09.240
Well, that's very simple.

01:09.240 --> 01:12.180
Instead of only considering the current states

01:12.180 --> 01:14.287
that is only one state at time T,

01:14.287 --> 01:16.830
we are going to consider more in the past.

01:16.830 --> 01:20.400
So exactly like for our list TMs, and therefore our series

01:20.400 --> 01:23.348
of events will not be ST and ST plus one.

01:23.348 --> 01:27.450
This will be, for example, the 100 states in the past.

01:27.450 --> 01:30.720
So ST minus 100, ST minus 99,

01:30.720 --> 01:33.990
up to ST minus one, and then ST.

01:33.990 --> 01:37.800
So in other words, we put the 100 last transitions

01:37.800 --> 01:39.810
into what we call the memory.

01:39.810 --> 01:42.150
And that way we have a long term memory

01:42.150 --> 01:43.860
as opposed to a short term memory

01:43.860 --> 01:46.350
or even should I say, an instant memory.

01:46.350 --> 01:48.600
And that makes the whole deep key learning process

01:48.600 --> 01:50.250
work much better.

01:50.250 --> 01:55.140
And then, once we create this memory of the last 100 events

01:55.140 --> 01:58.560
we will sample, that is we will take some random batches

01:58.560 --> 02:02.310
of these transitions to make our next subjects.

02:02.310 --> 02:06.150
That is our next move, by selecting the next action.

02:06.150 --> 02:09.150
And therefore, in this replay memory class that

02:09.150 --> 02:11.730
we're implementing for experience replay

02:11.730 --> 02:13.500
we will make three functions.

02:13.500 --> 02:15.690
First of all, the INIT function as usual.

02:15.690 --> 02:17.490
That's the case for any class.

02:17.490 --> 02:20.910
And so in this INIT function we will define the variables

02:20.910 --> 02:23.850
that will be attached to the future instances of the class.

02:23.850 --> 02:26.220
That is the future objects that will be created

02:26.220 --> 02:27.510
from this class.

02:27.510 --> 02:30.420
And so very simply, these variables will be the memory

02:30.420 --> 02:33.783
of the 100 transitions, the 100 events, and the capacity

02:33.783 --> 02:35.640
that is the 100 number.

02:35.640 --> 02:37.860
You will be welcome to try a longer memory

02:37.860 --> 02:39.750
by increasing the capacity.

02:39.750 --> 02:42.166
So that's the first function, INIT function.

02:42.166 --> 02:46.500
And then we'll make two other functions, one push function

02:46.500 --> 02:49.560
to make sure that the memory doesn't ever contain more

02:49.560 --> 02:51.180
than 100 transitions.

02:51.180 --> 02:53.430
And for this, we'll use the capacity by just

02:53.430 --> 02:55.470
doing one simple "if" condition

02:55.470 --> 02:58.740
and then eventually we will make the sample function.

02:58.740 --> 03:02.040
And that will be of course to sample some transitions

03:02.040 --> 03:05.490
in this memory of the last 100 transitions.

03:05.490 --> 03:08.520
All right, so let's start by introducing the class.

03:08.520 --> 03:11.160
So as usual, we start with class

03:11.160 --> 03:12.570
And then we give a name to the class.

03:12.570 --> 03:15.753
So we call it replay memory.

03:17.160 --> 03:22.160
And then in parentheses we input object, then column.

03:22.560 --> 03:23.520
And then here we go.

03:23.520 --> 03:27.150
We start with the first function, the INIT function.

03:27.150 --> 03:29.460
So that's exactly the same as before.

03:29.460 --> 03:33.390
We start with def, then two underscores, INIT,

03:33.390 --> 03:36.720
two underscores again, and then the variables.

03:36.720 --> 03:40.470
So there is of course self, which is the variable attached

03:40.470 --> 03:43.680
to the future instances of the class, the future objects.

03:43.680 --> 03:45.720
And then we're gonna have another variable

03:45.720 --> 03:49.200
for you to be able to try some other experience replay,

03:49.200 --> 03:50.640
some other memories.

03:50.640 --> 03:52.920
And that's gonna be the capacity.

03:52.920 --> 03:56.340
So this capacity will simply be the number 100

03:56.340 --> 03:57.990
because we're gonna make experience replay

03:57.990 --> 04:00.930
with the 100 last transitions.

04:00.930 --> 04:03.930
All right, and then, colon, and here we go.

04:03.930 --> 04:08.100
Let's go inside the function and let's define the variables

04:08.100 --> 04:10.410
of our replay memory objects.

04:10.410 --> 04:14.261
So, the first one will be self dot capacity.

04:14.261 --> 04:16.200
(typing)

04:16.200 --> 04:17.880
And as you probably understood

04:17.880 --> 04:21.150
this will be the capacity that is the maximum number

04:21.150 --> 04:24.627
of transitions we want to have in our memory of events.

04:24.627 --> 04:25.830
And this will be equal

04:25.830 --> 04:30.000
to the argument we will input when creating an object

04:30.000 --> 04:31.920
of the replay memory class.

04:31.920 --> 04:34.500
And therefore that is capacity.

04:34.500 --> 04:36.660
That's the argument of the INIT function.

04:36.660 --> 04:38.160
So capacity.

04:38.160 --> 04:43.050
So again, not to be confused, self dot capacity is the name

04:43.050 --> 04:45.450
of the variable that is attached to the object

04:45.450 --> 04:48.930
and capacity here is the argument we will input

04:48.930 --> 04:52.830
when creating an object of the replay memory class.

04:52.830 --> 04:56.130
All right, and then we have a second variable.

04:56.130 --> 04:57.960
That's of course the memory.

04:57.960 --> 05:00.303
So self dot memory.

05:01.740 --> 05:02.760
All right.

05:02.760 --> 05:05.940
And so what will this memory variable be equal to?

05:05.940 --> 05:10.830
Well, this memory is supposed to contain the last 100 events

05:10.830 --> 05:14.130
and therefore this should be a simple list.

05:14.130 --> 05:17.280
You know, a list which will contain the last 100 events,

05:17.280 --> 05:19.470
the last 100 transitions.

05:19.470 --> 05:22.230
And to initialize the list, there is nothing more simple.

05:22.230 --> 05:24.720
We just add some brackets like that.

05:24.720 --> 05:26.070
And here we go.

05:26.070 --> 05:27.780
Our memory is initialized.

05:27.780 --> 05:30.300
So of course, at the beginning of the experiments

05:30.300 --> 05:32.790
or more precisely the beginning of the exploration

05:32.790 --> 05:34.590
the memory will be an empty list.

05:34.590 --> 05:37.020
And then we will put the transitions each time

05:37.020 --> 05:38.580
we reach a future state.

05:38.580 --> 05:41.490
And speaking of that, that's exactly what we'll do

05:41.490 --> 05:42.570
with the next function.

05:42.570 --> 05:44.820
That's what we're gonna call the push function.

05:44.820 --> 05:48.300
We will make this push function to append the events

05:48.300 --> 05:49.830
in this memory list.

05:49.830 --> 05:52.650
And then we'll use the capacity to make sure

05:52.650 --> 05:56.190
that this memory list always contains 100 events

05:56.190 --> 05:57.480
and never more.

05:57.480 --> 05:59.670
All right, so let's do this in the next tutorial.

05:59.670 --> 06:01.563
And until then, enjoy AI.