WEBVTT

00:00.570 --> 00:02.490
Instructor: Hello and welcome to this tutorial.

00:02.490 --> 00:04.080
Now we're gonna make the full loop

00:04.080 --> 00:05.970
that will compute the policy loss

00:05.970 --> 00:07.410
and the value loss.

00:07.410 --> 00:09.090
And once we have these two losses,

00:09.090 --> 00:11.100
we will be able to use our optimizer

00:11.100 --> 00:13.110
to apply the (indistinct) in dissent

00:13.110 --> 00:14.790
to reduce the losses.

00:14.790 --> 00:17.670
All right, so there we go. We start here.

00:17.670 --> 00:19.830
By the way, in the previous tutorial,

00:19.830 --> 00:21.330
we implemented this section

00:21.330 --> 00:23.280
and I forgot to remove the indent.

00:23.280 --> 00:24.390
Sorry about that.

00:24.390 --> 00:26.670
So starting from R here,

00:26.670 --> 00:28.500
is not in the full loop

00:28.500 --> 00:30.480
and now we're starting a new full loop.

00:30.480 --> 00:32.850
So I'm starting here with four.

00:32.850 --> 00:34.050
And now what we're gonna do,

00:34.050 --> 00:35.370
is we're gonna start from

00:35.370 --> 00:38.640
the last step that was done during the exploration

00:38.640 --> 00:40.470
and we're gonna move backward in time.

00:40.470 --> 00:44.880
So that's why here, I'm doing for I in a reversed range

00:47.160 --> 00:51.210
lan rewards, because rewards is the list.

00:51.210 --> 00:53.340
And since each step of the exploration

00:53.340 --> 00:54.870
is associated to a reward,

00:54.870 --> 00:57.270
because at each step we get reward,

00:57.270 --> 01:00.000
when lan rewards is this number of steps.

01:00.000 --> 01:01.830
And this reverse here, is used

01:01.830 --> 01:04.260
so that we can move back in time.

01:04.260 --> 01:05.130
So there we go.

01:05.130 --> 01:06.990
And now, what we're gonna do,

01:06.990 --> 01:09.030
is update the cumulative reward.

01:09.030 --> 01:10.080
That is R.

01:10.080 --> 01:11.610
And we're gonna update it this way.

01:11.610 --> 01:12.630
That's actually the same

01:12.630 --> 01:14.190
as what we did for doom.

01:14.190 --> 01:16.500
It will be equal to Gamma,

01:16.500 --> 01:18.330
which we get from our parameters.

01:18.330 --> 01:22.797
So I'm taking params first, params dot Gamma times R

01:24.060 --> 01:26.760
plus the reward of the step,

01:26.760 --> 01:29.490
which we can get by taking the list reward,

01:29.490 --> 01:31.470
and taking the index back.

01:31.470 --> 01:34.050
So first this will be the reward of the last step.

01:34.050 --> 01:35.000
Then it'll be the reward

01:35.000 --> 01:37.020
of the previous step, and et cetera.

01:37.020 --> 01:41.070
And each time we update R by multiplying it by Gamma,

01:41.070 --> 01:43.590
and then adding this reward at this step.

01:43.590 --> 01:44.790
And so by doing this,

01:44.790 --> 01:47.190
remember, we will get in the end,

01:47.190 --> 01:49.140
so I'm gonna write it as a comment,

01:49.140 --> 01:51.750
we will get R, accumulative reward.

01:51.750 --> 01:52.830
That will be equal

01:52.830 --> 01:56.040
at the end of the loop to R zero,

01:56.040 --> 01:59.820
the reward of step zero, plus Gamma

01:59.820 --> 02:03.900
times R one, the reward of the first step,

02:03.900 --> 02:06.570
plus Gamma squared

02:06.570 --> 02:10.470
times R two, the reward of the second step,

02:10.470 --> 02:14.850
plus, dot, dot, dot, plus Gamma

02:14.850 --> 02:18.153
at the power of N minus one,

02:19.080 --> 02:22.800
times the reward obtained at step

02:22.800 --> 02:26.160
N minus one, where N is the number of steps.

02:26.160 --> 02:29.850
But then, be careful at the end we will have Gamma.

02:29.850 --> 02:32.523
At the power of number of steps,

02:33.510 --> 02:37.110
times the value, the value of the V function

02:37.110 --> 02:40.110
applied to the last state.

02:40.110 --> 02:42.537
This, what we should get at the end.

02:42.537 --> 02:45.900
And this, we will get that because remember here,

02:45.900 --> 02:48.480
we got this value of the last step,

02:48.480 --> 02:51.180
because this was done at the end of this fore loop here.

02:51.180 --> 02:53.340
And so we got the value,

02:53.340 --> 02:56.400
and we set R to be equal to that value.

02:56.400 --> 02:58.320
So right now, R,

02:58.320 --> 03:00.450
at the beginning of this second full loop here,

03:00.450 --> 03:03.570
will be equal to this value of the last date,

03:03.570 --> 03:05.550
but then by doing this,

03:05.550 --> 03:07.260
this is what we'll get in the end.

03:07.260 --> 03:09.030
R equal R zero, plus Gamma,

03:09.030 --> 03:11.790
R one, plus Gamma squared R two plus Gamma

03:11.790 --> 03:13.230
at the power of N minus one

03:13.230 --> 03:16.560
times the rewarded step N minus one plus Gamma

03:16.560 --> 03:17.393
the power of number

03:17.393 --> 03:20.970
of steps times this value of the last date.

03:20.970 --> 03:23.580
So that's the main thing to understand

03:23.580 --> 03:26.070
in this computation of the cumulative reward.

03:26.070 --> 03:28.680
And that's why it is important to start

03:28.680 --> 03:32.070
from it by initializing R with the value here and

03:32.070 --> 03:36.810
doing this reversed full loop to get this final equation.

03:36.810 --> 03:38.340
Perfect. And now, now

03:38.340 --> 03:41.250
that we have the right value for the cumulative reward

03:41.250 --> 03:43.830
well, we will compute the advantage.

03:43.830 --> 03:46.530
And the advantage here is just the advantage

03:46.530 --> 03:49.140
of getting this reward compared to the value.

03:49.140 --> 03:53.130
So I'm going to introduce a new variable, Advantage

03:53.130 --> 03:57.030
and therefore it will be equal to this cumulative reward

03:57.030 --> 04:01.590
minus the value of the V function obtained at the step I.

04:01.590 --> 04:06.590
So therefore that is error minus values I.

04:07.140 --> 04:09.660
Perfect. And now that we have the community reward

04:09.660 --> 04:13.170
and the advantage. then we can get the value loss.

04:13.170 --> 04:15.210
This is the first loss we can get now.

04:15.210 --> 04:18.510
So we are gonna get our value loss variable

04:18.510 --> 04:20.940
and this will be updated the following way.

04:20.940 --> 04:24.780
Remember so far that the value loss was initialized to zero.

04:24.780 --> 04:28.410
And so we're gonna take the value loss again

04:28.410 --> 04:32.160
and add oh 0.5 times

04:32.160 --> 04:35.370
the squared of the advantage so we can get it this way.

04:35.370 --> 04:39.450
Advantage dot power two.

04:39.450 --> 04:41.730
So that just mean the square of the Advantage

04:41.730 --> 04:43.500
Advantage at the power of two.

04:43.500 --> 04:45.840
And that is exactly the value loss

04:45.840 --> 04:49.170
the loss generated by the predictions

04:49.170 --> 04:53.070
of the value of the V function output by the critic.

04:53.070 --> 04:54.330
And so it makes sense

04:54.330 --> 04:57.150
that this is the value loss because remember

04:57.150 --> 04:59.700
the advantage A of the action A

04:59.700 --> 05:01.530
and the state S is the difference

05:01.530 --> 05:04.830
between the Q value and the value of the V function.

05:04.830 --> 05:08.580
And so when we play the optimal action, well

05:08.580 --> 05:13.410
we get the stationary state with Q optimal

05:13.410 --> 05:18.410
of the optimal action A star played in the state S

05:18.420 --> 05:22.170
equals the optimal value V star of the state S.

05:22.170 --> 05:24.690
So it's quite intuitive to understand that,

05:24.690 --> 05:27.573
when the advantage is not equal to zero,

05:27.573 --> 05:30.210
then there will be a difference between these two

05:30.210 --> 05:33.360
and therefore that's how the loss is measured.

05:33.360 --> 05:36.300
Okay, so value loss computed.

05:36.300 --> 05:39.000
One loss down. We have now one more to go.

05:39.000 --> 05:40.260
It has the policy loss

05:40.260 --> 05:42.960
and that's exactly what we're going to compute right now.

05:42.960 --> 05:46.080
And to compute it, we need to consider again

05:46.080 --> 05:48.810
the generalized advantage estimation.

05:48.810 --> 05:50.700
Because to compute the policy loss

05:50.700 --> 05:53.640
we need the generalized advantage estimation.

05:53.640 --> 05:56.460
And to get the generalized advantage estimation

05:56.460 --> 05:59.790
we need first the temporal difference of the state values.

05:59.790 --> 06:03.210
So we have multiple things to compute here

06:03.210 --> 06:05.880
and we're gonna start with this temporal difference.

06:05.880 --> 06:07.740
Once we get the temporal difference

06:07.740 --> 06:10.440
we will get the generalized advantage estimation.

06:10.440 --> 06:13.140
And once we get the generalized advantage estimation

06:13.140 --> 06:14.880
we will get the policy loss.

06:14.880 --> 06:18.943
All right, so let's start with the temporal difference TD.

06:19.790 --> 06:22.530
So TD is equal to

06:22.530 --> 06:26.580
the reward of the step I,

06:26.580 --> 06:31.580
plus gamma which we get things to our params list.

06:31.860 --> 06:36.860
So params dot gamma times the value of the step

06:37.500 --> 06:41.760
I plus one and we add that data to access it

06:41.760 --> 06:46.083
minus the value of the step I.

06:46.920 --> 06:50.250
And same we add that data.

06:50.250 --> 06:52.650
All right, so that's the formula of the temporal difference

06:52.650 --> 06:54.150
of the state values.

06:54.150 --> 06:58.500
And now we can update the generalized advantage estimation

06:58.500 --> 07:01.770
and how is it updated? while we take our GAE

07:01.770 --> 07:05.820
and we multiply it by Gamma,

07:05.820 --> 07:08.790
params dot Gamma times tell

07:08.790 --> 07:10.950
which we access with our parameters as well.

07:10.950 --> 07:14.190
So we take params tell

07:14.190 --> 07:18.480
and we add this temporal difference of the state values.

07:18.480 --> 07:19.860
So be careful.

07:19.860 --> 07:23.970
We are in a full loop and each time we multiply the GAE

07:23.970 --> 07:27.120
by Gamma and by tell, and we add the temporal difference.

07:27.120 --> 07:28.530
So it's important to understand

07:28.530 --> 07:29.940
that at the end of this loop

07:29.940 --> 07:34.940
well this generalized advantage estimation will be equal to

07:35.670 --> 07:39.240
the sum on all the steps

07:39.240 --> 07:44.160
of Gamma times tell at the power of I

07:44.160 --> 07:48.180
times the temporal difference at the step I.

07:48.180 --> 07:50.700
All right, so important to keep that in mind.

07:50.700 --> 07:54.300
And now that we have the generalized advantage estimation

07:54.300 --> 07:55.860
and the temporal difference

07:55.860 --> 07:59.040
we can finally compute the policy loss.

07:59.040 --> 08:00.180
So let's do this.

08:00.180 --> 08:05.180
We are going to update our policy loss the following way

08:05.520 --> 08:08.880
by taking the old policy loss.

08:08.880 --> 08:13.590
And we subtract the log probabilities obtained

08:13.590 --> 08:16.710
at the step I that we multiply

08:16.710 --> 08:21.090
by this generalized advantage estimation that we have to put

08:21.090 --> 08:24.120
in a variable because then we will compute the gradients.

08:24.120 --> 08:27.150
So it has to be attached to gradients in the dynamic graph.

08:27.150 --> 08:32.150
And then we add minus or 0.01 times the entropy

08:33.180 --> 08:37.230
the entropy obtained at the step I in the full loop.

08:37.230 --> 08:39.510
And again, now be careful.

08:39.510 --> 08:42.540
This is the computation inside the full loop

08:42.540 --> 08:44.820
which means that at the end of the full loop

08:44.820 --> 08:48.160
what we will get is policy loss

08:50.160 --> 08:54.250
equals minus sum over the steps

08:55.170 --> 08:58.830
of the product log of the policy

08:58.830 --> 09:03.830
at the step I times the generalized advantage estimation

09:04.170 --> 09:06.100
plus this 0.01

09:07.140 --> 09:08.970
times the entropy

09:08.970 --> 09:10.680
at the step I. So there we go.

09:10.680 --> 09:14.160
And now what is the policy at the step I well,

09:14.160 --> 09:15.900
that's the soft max probabilities

09:15.900 --> 09:18.510
of the actions and the entropy at the step I.

09:18.510 --> 09:21.420
Well, you know what it is, it's what we computed earlier

09:21.420 --> 09:22.920
and what we appended to the list.

09:22.920 --> 09:24.300
So we already have that.

09:24.300 --> 09:28.230
But this PII here is the soft max probability

09:28.230 --> 09:30.150
of the actions.

09:30.150 --> 09:32.130
And why do we put a minus here?

09:32.130 --> 09:33.600
That's because the log

09:33.600 --> 09:37.170
of the probability and the entropy are negated values.

09:37.170 --> 09:40.140
And since we want to minimize their absolute value

09:40.140 --> 09:42.840
we must see this loss as a [inaudible] likelihood

09:42.840 --> 09:44.340
as opposed to a distance.

09:44.340 --> 09:47.310
You know, we want to maximize the probability

09:47.310 --> 09:51.480
of playing the action that will maximize the advantage.

09:51.480 --> 09:53.160
That's the whole idea behind it.

09:53.160 --> 09:55.020
We want to maximize the probability

09:55.020 --> 09:58.230
of playing the action that will maximize the advantage.

09:58.230 --> 10:00.510
And for those of you who might be wondering

10:00.510 --> 10:02.160
what is the purpose of this

10:02.160 --> 10:05.880
entropy coefficient that is this factor 0.01 here.

10:05.880 --> 10:08.370
Well, the purpose of it is just to prevent

10:08.370 --> 10:10.260
from falling too quickly

10:10.260 --> 10:13.440
into a trap where we have a distribution

10:13.440 --> 10:16.740
of probabilities with zeros for all the actions

10:16.740 --> 10:19.530
except one which has a probability of one.

10:19.530 --> 10:22.500
And if that happens, that would minimize the entropy.

10:22.500 --> 10:25.320
So that's why we're adding this small coefficient

10:25.320 --> 10:28.530
0.01 here that will make the entropy increase

10:28.530 --> 10:30.063
in the gradient descent.

10:30.960 --> 10:33.270
Okay? So now the good news is

10:33.270 --> 10:35.460
that the most difficult part is done.

10:35.460 --> 10:36.750
We have the two losses

10:36.750 --> 10:39.000
and therefore what we only need to do now

10:39.000 --> 10:40.890
and we already know how to do it

10:40.890 --> 10:42.390
is to perform just a cast grid

10:42.390 --> 10:45.150
in descent to reduce these two losses.

10:45.150 --> 10:47.250
And so what we're gonna do now is

10:47.250 --> 10:51.360
get out of this loop and we're gonna take our optimizer,

10:51.360 --> 10:53.550
the one we made separately.

10:53.550 --> 10:55.890
Then remember, the first thing we have to do is to

10:55.890 --> 10:58.590
initialize all the gradient parameters to zero.

10:58.590 --> 11:00.660
And to do this, we add dot

11:00.660 --> 11:04.830
then the zero and the score grad method.

11:04.830 --> 11:06.960
All right? So that's done then.

11:06.960 --> 11:09.660
Now we're going to perform backward propagation

11:09.660 --> 11:10.743
but we're gonna give twice as much

11:10.743 --> 11:12.780
as I importance to the policy loss

11:12.780 --> 11:15.960
than the value loss because the policy loss is smaller.

11:15.960 --> 11:19.560
So to do this, we are gonna put in parenthesis

11:19.560 --> 11:22.267
policy underscore loss plus

11:23.323 --> 11:25.773
0.5 value loss.

11:26.820 --> 11:30.720
So 0.5 times the value loss.

11:30.720 --> 11:34.200
And we're gonna add here dot and we applies the

11:34.200 --> 11:38.490
backward method to perform backward propagation.

11:38.490 --> 11:40.950
And thanks to this trick here with the policy loss

11:40.950 --> 11:43.980
plus health of the value loss, we give twice as

11:43.980 --> 11:47.460
much importance to the policy loss than the value loss.

11:47.460 --> 11:50.160
Okay? Then we're gonna use another trick

11:50.160 --> 11:52.170
which is to prevent the gradient

11:52.170 --> 11:54.390
from taking extremely large values

11:54.390 --> 11:57.000
and therefore degenerate the algorithm.

11:57.000 --> 12:01.170
And the trick to do that is to get first our torch library

12:01.170 --> 12:02.490
then the NN module

12:02.490 --> 12:07.490
from the torch library, then the details sub module.

12:07.500 --> 12:10.110
And now we're gonna use the function clip

12:10.110 --> 12:13.530
underscore grad underscore norm.

12:13.530 --> 12:16.540
And we are going to input our model parameters

12:17.850 --> 12:21.180
with a second input, which will be 40.

12:21.180 --> 12:23.460
And that trick will basically make sure

12:23.460 --> 12:26.310
that the gradients won't take extremely large values

12:26.310 --> 12:28.200
and degenerate the algorithm.

12:28.200 --> 12:29.790
And for those of you who might be wondering

12:29.790 --> 12:31.800
what this 40 is exactly

12:31.800 --> 12:33.960
well that just means that we're using this value

12:33.960 --> 12:38.670
so that the norm of the gradient stays between zero and 40

12:38.670 --> 12:40.560
and therefore that's how we prevent the gradient

12:40.560 --> 12:42.930
from taking too large values.

12:42.930 --> 12:45.150
Okay, so now we're almost done.

12:45.150 --> 12:49.830
Remember we made this ensure shared grads function

12:49.830 --> 12:52.650
at the beginning of the file, which is to ensure

12:52.650 --> 12:56.880
that the agent and the shared model share the same gradient.

12:56.880 --> 12:59.130
And to do this, to make sure of it

12:59.130 --> 13:01.170
we can apply this function here.

13:01.170 --> 13:05.767
And so we are gonna add ensure shared grad

13:07.170 --> 13:10.410
to make sure that the model

13:10.410 --> 13:14.670
and the shared model share the same gradient.

13:14.670 --> 13:16.650
All right, so that's just a precaution.

13:16.650 --> 13:19.680
I'm not sure that's totally necessary, but you know

13:19.680 --> 13:22.020
at least we won't get any issue here.

13:22.020 --> 13:25.020
Okay? And finally, last line of code.

13:25.020 --> 13:29.310
We are of course going to perform the optimization step

13:29.310 --> 13:32.190
to reduce the losses and you know how to do it.

13:32.190 --> 13:35.070
Of course, we take our optimizer

13:35.070 --> 13:39.330
and we add dot step with parenthesis,

13:39.330 --> 13:43.530
and there we go. The training of our brains is over.

13:43.530 --> 13:44.760
So congratulations.

13:44.760 --> 13:47.040
I hope this wasn't too overwhelming.

13:47.040 --> 13:47.873
Don't worry.

13:47.873 --> 13:49.620
I will provide the code with all the comments.

13:49.620 --> 13:51.660
So if you missed any detail

13:51.660 --> 13:53.370
you can have a look at the comments.

13:53.370 --> 13:56.010
And don't worry if you haven't understood anything

13:56.010 --> 13:57.390
This is very advanced.

13:57.390 --> 14:00.780
But, rest assured, this is also the most powerful.

14:00.780 --> 14:04.110
Remember, who is it made from? The creator of PyTorch.

14:04.110 --> 14:06.810
So, we are really working with the best here

14:06.810 --> 14:07.980
the state of the art.

14:07.980 --> 14:08.970
So it's totally normal

14:08.970 --> 14:11.220
if you didn't get everything the first time.

14:11.220 --> 14:13.770
But, by working on it many times

14:13.770 --> 14:16.470
you will definitely get more and more comfortable.

14:16.470 --> 14:19.200
So now we're done with the training.

14:19.200 --> 14:22.890
So basically we made all the most important things.

14:22.890 --> 14:26.130
You know, we made the brains by building the architectures

14:26.130 --> 14:28.920
of the neural networks with the convolutions, the LSDM

14:28.920 --> 14:30.540
and the fully connected layers.

14:30.540 --> 14:34.500
We trained this brain by making this train code here.

14:34.500 --> 14:37.290
So basically the heart of the algorithm is done.

14:37.290 --> 14:39.750
You made the A three C, congratulations.

14:39.750 --> 14:41.880
Now we have a few more things to do

14:41.880 --> 14:44.520
but that is just to get the fun part, you know

14:44.520 --> 14:48.030
we need to make this test dot py fl

14:48.030 --> 14:50.790
which will test the agent

14:50.790 --> 14:54.120
and provide the videos of the AI playing breakout.

14:54.120 --> 14:56.220
So this will be very fun to watch.

14:56.220 --> 14:59.880
We will not code all the lines of this test dot py fl

14:59.880 --> 15:02.820
because as we said, we did the most important thing.

15:02.820 --> 15:04.560
all related to A three C

15:04.560 --> 15:07.140
but I will of course explain the code

15:07.140 --> 15:10.110
and eventually we have this main dot py fl

15:10.110 --> 15:11.880
which will execute the code.

15:11.880 --> 15:14.190
And from the moment we execute this code

15:14.190 --> 15:15.990
all the codes will be generated.

15:15.990 --> 15:19.530
So the brains will be made, the training will happen

15:19.530 --> 15:22.170
and the AI will play new games of breakout

15:22.170 --> 15:24.060
and we will get all the videos.

15:24.060 --> 15:26.580
So I can't wait to eventually watch them.

15:26.580 --> 15:27.540
We are gonna see

15:27.540 --> 15:29.970
if the AI is smart enough to catch the balls.

15:29.970 --> 15:33.810
So now I'm gonna see you in the next tutorial for this test

15:33.810 --> 15:37.290
dot py, so that we can test the AI on some new games.

15:37.290 --> 15:39.213
And until then, enjoy AI.