WEBVTT

00:01.080 --> 00:01.913
-: Hello and welcome back

00:01.913 --> 00:04.020
to the course on Artificial Intelligence.

00:04.020 --> 00:05.550
Today we're continuing our journey

00:05.550 --> 00:07.140
into the world of A3C

00:07.140 --> 00:11.190
and we're talking about the asynchronous side of A3C.

00:11.190 --> 00:12.480
So we have our abbreviation,

00:12.480 --> 00:14.400
Asynchronous Advantage Actor Critic,

00:14.400 --> 00:15.720
and today we are going to find out

00:15.720 --> 00:19.080
what asynchronous here stands for, what it means.

00:19.080 --> 00:21.000
And let's go back a step.

00:21.000 --> 00:23.820
Let's look at what we started this whole course with.

00:23.820 --> 00:25.350
We started with reinforcement learning

00:25.350 --> 00:26.850
and the, what it's all about,

00:26.850 --> 00:29.340
that the agent is in a certain state,

00:29.340 --> 00:30.720
they observe the state,

00:30.720 --> 00:32.610
they make certain decisions,

00:32.610 --> 00:34.440
they take actions in that state,

00:34.440 --> 00:36.750
and then the state is changed.

00:36.750 --> 00:39.150
So they get into a new state, plus they get a reward.

00:39.150 --> 00:41.040
So they get reward for taking that action,

00:41.040 --> 00:46.040
or some sort of reward, which could be a penalty as well.

00:46.050 --> 00:47.790
And they end up in a new state.

00:47.790 --> 00:49.950
And based on that, now they take another action.

00:49.950 --> 00:52.410
Again, they get reward and end up in a new state

00:52.410 --> 00:54.593
and they take another action and so on.

00:54.593 --> 00:57.177
And so that is the basis behind

00:57.177 --> 00:59.670
all of reinforcement learning.

00:59.670 --> 01:02.910
And that's what we've been using in Q-Learning,

01:02.910 --> 01:05.880
in Deep Q-Learning, in Deep Evolution Q-Learning,

01:05.880 --> 01:07.785
And that has allowed our agents to beat

01:07.785 --> 01:10.770
gradually more complex and more complex environment.

01:10.770 --> 01:15.360
But now we're going to introduce a even better concept

01:15.360 --> 01:18.480
and even further, take this even to a further level

01:18.480 --> 01:22.890
what A3C introduces through this asynchronous element is,

01:22.890 --> 01:26.380
instead of having one agent attack the environment,

01:26.380 --> 01:31.170
they have three agents or whatever number of agents,

01:31.170 --> 01:34.260
so several agents attacking the same environment.

01:34.260 --> 01:37.050
And the key here is that why it's called asynchronous,

01:37.050 --> 01:39.360
is because they're initialized differently.

01:39.360 --> 01:40.950
So their starting points are different.

01:40.950 --> 01:44.010
So for instance, as you'll see from practical tutorials

01:44.010 --> 01:46.560
you set a random seed and you set it differently

01:46.560 --> 01:48.000
for each of the agents.

01:48.000 --> 01:51.240
And that way, because their starting points are different,

01:51.240 --> 01:53.190
they're going to first go through environments

01:53.190 --> 01:54.023
in different ways

01:54.023 --> 01:55.800
and then they're gonna explore in different ways.

01:55.800 --> 01:57.270
And then in the next iterations,

01:57.270 --> 01:58.620
they're also gonna explore in different ways.

01:58.620 --> 02:01.200
And so for instance, we have three agents,

02:01.200 --> 02:03.510
you are, all of a sudden you're getting triple

02:03.510 --> 02:05.010
the amount of experience.

02:05.010 --> 02:06.810
Instead of just one agent going through

02:06.810 --> 02:08.430
and exploring an environment

02:08.430 --> 02:10.802
and trying to understand how to operate

02:10.802 --> 02:13.830
in that environment, you now have three,

02:13.830 --> 02:16.530
or however many of them, going through that

02:16.530 --> 02:17.790
and getting this experience.

02:17.790 --> 02:20.093
And so they're so that each one of them is learning

02:20.093 --> 02:21.900
through this bigger experience.

02:21.900 --> 02:23.361
And apart from being,

02:23.361 --> 02:25.830
just giving a broader range of experience,

02:25.830 --> 02:29.700
it also reduces the chances of one agent

02:29.700 --> 02:31.350
getting stuck in a local maximum.

02:31.350 --> 02:34.530
So for instance, if one agent finds like a way to

02:34.530 --> 02:37.650
beat the environment, which is not the most optimal,

02:37.650 --> 02:39.900
because if it deviates to the left or to the right

02:39.900 --> 02:42.260
from that solution that it found, it always gets like

02:42.260 --> 02:43.470
gets more penalized.

02:43.470 --> 02:45.630
It might get stuck in that local maximum.

02:45.630 --> 02:46.830
It might just keep doing that

02:46.830 --> 02:48.150
thinking that that's the optimal solution,

02:48.150 --> 02:49.680
where, where it's actually not.

02:49.680 --> 02:54.120
Well, the likelihood of several agents getting stuck

02:54.120 --> 02:57.997
in that same local maximum is, decreases over,

02:57.997 --> 02:59.760
decreases with the number of agents.

02:59.760 --> 03:02.790
So the probability of one agent getting stuck

03:02.790 --> 03:05.490
in a certain local maximum might be high,

03:05.490 --> 03:08.130
but, or might have, might be a certain value.

03:08.130 --> 03:09.630
But the probability when you have three of them

03:09.630 --> 03:10.950
of all three of them getting stuck

03:10.950 --> 03:12.960
in that local maximum is much lower.

03:12.960 --> 03:16.770
And as long as they share experience between each other

03:16.770 --> 03:17.970
they can help each other out.

03:17.970 --> 03:19.800
So if one of them gets stuck, for instance

03:19.800 --> 03:21.030
it's stuck in a local maximum,

03:21.030 --> 03:23.250
just simply think that, that's the best one,

03:23.250 --> 03:24.420
that's the best, that's the best solution

03:24.420 --> 03:25.830
all the time and keeps doing that.

03:25.830 --> 03:28.140
Well, as long as it interacts with the other agents,

03:28.140 --> 03:30.810
so let's say this guy gets stuck in a local maximum

03:30.810 --> 03:32.430
as long as it interacts with the other agents

03:32.430 --> 03:35.130
through the way we build our whole algorithm,

03:35.130 --> 03:37.620
A3C algorithm, they will help him out.

03:37.620 --> 03:39.240
They will give him knowledge that actually no,

03:39.240 --> 03:40.350
hey, you should explore this

03:40.350 --> 03:44.197
or he will be likely, more likely to get out of that.

03:44.197 --> 03:47.250
And also overall, the environment will know that,

03:47.250 --> 03:49.530
hey, even though this is a great maximum,

03:49.530 --> 03:51.660
these other agents have seen better options

03:51.660 --> 03:53.010
and we should keep exploring

03:53.010 --> 03:55.200
because there are, it looks like there are better options.

03:55.200 --> 03:56.850
So in a, in a very short

03:56.850 --> 03:58.470
kind of rough, intuitive understanding

03:58.470 --> 04:00.309
that's, that, those are some of the advantages

04:00.309 --> 04:02.550
of having, these asynchronous agents.

04:02.550 --> 04:04.200
First of all, you like have more experience

04:04.200 --> 04:06.180
to choose from and to learn from.

04:06.180 --> 04:08.460
You could get to the solution faster.

04:08.460 --> 04:11.760
And generally speaking, there's a lesser chance

04:11.760 --> 04:16.650
of getting stuck in a certain local maximum.

04:16.650 --> 04:18.307
So let's see how this all plays out

04:18.307 --> 04:20.692
in this model that we've built so far.

04:20.692 --> 04:23.700
As you remember, this is what we've gotten so far

04:23.700 --> 04:25.170
through the actor-critic.

04:25.170 --> 04:26.760
And this is, like, where it all ties in.

04:26.760 --> 04:29.490
This is so far, as you remember from the previous tutorial,

04:29.490 --> 04:30.960
we did introduce this, you know,

04:30.960 --> 04:33.720
we had this already even in Deep Evolution Q-Learning.

04:33.720 --> 04:36.768
So we just named actor now, but now we've introduced critic

04:36.768 --> 04:38.760
but so far it doesn't really make sense.

04:38.760 --> 04:40.170
What's the point of having this critic

04:40.170 --> 04:42.090
and measuring the value of the state

04:42.090 --> 04:44.280
or predicting the value of the state

04:44.280 --> 04:46.470
using the same neural networks,

04:46.470 --> 04:48.570
this same approach.

04:48.570 --> 04:50.340
But now it's, this is, this is the part

04:50.340 --> 04:52.410
where it's gonna start making more sense.

04:52.410 --> 04:54.870
What we're going to do is we're going to replicate this

04:54.870 --> 04:56.910
because now we have multiple agents.

04:56.910 --> 04:58.410
So with multiple agents, this is

04:58.410 --> 04:59.400
this is what it would look like.

04:59.400 --> 05:02.850
So the, the first way of imagining it

05:02.850 --> 05:05.190
is now we have these three things.

05:05.190 --> 05:06.930
Well remember what we said

05:06.930 --> 05:09.240
about them sharing the experience between each other.

05:09.240 --> 05:11.370
So this is actually, right, right now

05:11.370 --> 05:12.270
they're all independent.

05:12.270 --> 05:14.370
You have one playing the game, another one playing the game

05:14.370 --> 05:15.390
another one playing the game.

05:15.390 --> 05:18.090
It's like, it's like launching your agent

05:18.090 --> 05:19.500
on three different computers.

05:19.500 --> 05:21.000
You put three different computers next to each other

05:21.000 --> 05:23.040
and you launch them and you know, that's great.

05:23.040 --> 05:26.070
Like indeed, you, you, like, you'll get,

05:26.070 --> 05:28.590
you'll get more experience, you'll get like more variety,

05:28.590 --> 05:29.880
especially if they're initialized differently.

05:29.880 --> 05:31.950
So we're gonna assume from here that they're all initial

05:31.950 --> 05:33.030
always initialized differently.

05:33.030 --> 05:35.580
Even though we have the same picture here, we are going to

05:35.580 --> 05:37.920
know that they're actually initialized differently.

05:37.920 --> 05:40.440
So it's not gonna be like identical training,

05:40.440 --> 05:43.538
identical learning from this game.

05:43.538 --> 05:45.660
And so even if you, like, you put three computers

05:45.660 --> 05:47.400
side by side and you launch them, yes,

05:47.400 --> 05:49.740
you're gonna have more experience

05:49.740 --> 05:52.227
because you're gonna have three agents playing

05:52.227 --> 05:56.603
and also you're going to have a bigger variety

05:56.603 --> 06:00.120
of possible solutions. So that's true.

06:00.120 --> 06:01.650
But the problem is that they're not sharing

06:01.650 --> 06:02.670
that experience among each other.

06:02.670 --> 06:04.110
They're not learning from each other.

06:04.110 --> 06:06.810
So they, they don't have that synergy

06:06.810 --> 06:08.400
they don't have the advantage,

06:08.400 --> 06:10.680
or the the extra power, that they would get

06:10.680 --> 06:11.513
if they were cooperating.

06:11.513 --> 06:12.660
You know, like how if you have,

06:12.660 --> 06:14.730
if you have a team of people

06:14.730 --> 06:17.910
they work better together than each one of them separately.

06:17.910 --> 06:21.210
So like in a team here, you got 1+1+1, it, it's three.

06:21.210 --> 06:23.970
But in a team 1+1+1 is not three, it's like 33

06:23.970 --> 06:26.040
because they leverage each other's strengths

06:26.040 --> 06:28.380
and mitigate each other's weaknesses.

06:28.380 --> 06:29.250
And same thing here.

06:29.250 --> 06:31.260
So if you put these three computers side by side,

06:31.260 --> 06:32.850
yes, you'll, you'll have more experience,

06:32.850 --> 06:34.680
more variety, and possibly with someone will

06:34.680 --> 06:36.270
get to better solution than the other one.

06:36.270 --> 06:37.800
That's great, but it'll be even better

06:37.800 --> 06:39.750
if they start sharing that experience.

06:39.750 --> 06:41.130
And how do they do that?

06:41.130 --> 06:43.980
Well, it's through this V that we calculated.

06:43.980 --> 06:47.640
So this V value, that's the output of our network

06:47.640 --> 06:49.560
is actually like that.

06:49.560 --> 06:52.950
So they have this same V.

06:52.950 --> 06:55.920
So every time all these agents

06:55.920 --> 06:58.050
they're contributing to the same critic.

06:58.050 --> 07:01.260
They don't have separate critics, they have a common critic.

07:01.260 --> 07:03.330
And that's the key of how

07:03.330 --> 07:06.270
the actor-critic ties in with the synchronous.

07:06.270 --> 07:08.400
So there's one critic that's watching as,

07:08.400 --> 07:09.870
as they get experience.

07:09.870 --> 07:12.060
So how do we calculate the V?

07:12.060 --> 07:14.940
The we calculate the V through, as you remember,

07:14.940 --> 07:17.610
we calculate the V through the values that we get.

07:17.610 --> 07:20.760
So the rewards that we get through the environment.

07:20.760 --> 07:25.620
And so as the agents explore their environment,

07:25.620 --> 07:28.320
they are calculating, they're predicting the V

07:28.320 --> 07:30.720
plus they have the V that they can calculate.

07:30.720 --> 07:32.430
This is, this is all, all ties back

07:32.430 --> 07:33.720
into what we've already discussed

07:33.720 --> 07:35.209
in the previous sections of this course.

07:35.209 --> 07:40.140
So they already have a V that they, that they can predict,

07:40.140 --> 07:42.630
like expect, through the rewards

07:42.630 --> 07:44.591
that they know that exist in this maze

07:44.591 --> 07:47.190
and that they've already explored.

07:47.190 --> 07:48.023
And as they explore them,

07:48.023 --> 07:49.620
of course, that that value can change.

07:49.620 --> 07:51.342
But also they have the V that

07:51.342 --> 07:53.460
this V is the output of the neural network.

07:53.460 --> 07:55.650
So as they're going through this,

07:55.650 --> 07:58.201
they're going to be adjusting their neural networks

07:58.201 --> 08:01.530
in order to better match that expected V.

08:01.530 --> 08:03.660
So basically, this is shared.

08:03.660 --> 08:06.861
The critic part is shared between the agents

08:06.861 --> 08:10.590
and that is how they share the information

08:10.590 --> 08:11.423
between each other.

08:11.423 --> 08:14.220
That's how they are able to, kind of,

08:14.220 --> 08:15.510
see what's going on in the environment

08:15.510 --> 08:17.550
shared with each other and then use that,

08:17.550 --> 08:20.940
as we'll see further in the next part, in the advantage,

08:20.940 --> 08:23.461
see, use that in order to optimize

08:23.461 --> 08:25.710
how they're behaving in that environment.

08:25.710 --> 08:29.550
And the other thing to note here is, so this was A3C

08:29.550 --> 08:32.453
This is like the core of A3C up to here.

08:32.453 --> 08:36.420
This is a type version of A3C, but there's an actually

08:36.420 --> 08:39.672
an even better implementation of this A3C

08:39.672 --> 08:42.499
which you'll actually hear Atlan talk about

08:42.499 --> 08:45.330
in the, one of the first tutorials

08:45.330 --> 08:46.890
in the practical side of things.

08:46.890 --> 08:48.630
And what he'll be talking about

08:48.630 --> 08:51.212
is how the creator of PyTorch

08:51.212 --> 08:54.660
actually made an adjustment to one of the codes

08:54.660 --> 08:58.380
that was shared on GitHub where he took all of these,

08:58.380 --> 09:00.450
as you can see right now, they have separate neural networks

09:00.450 --> 09:02.280
and only, they share the V,

09:02.280 --> 09:03.990
that adjustment that was made was

09:03.990 --> 09:06.510
actually to take all of these neural networks

09:06.510 --> 09:09.330
and put them into one, take them and put them together.

09:09.330 --> 09:11.670
So ultimately, there's only one neural network

09:11.670 --> 09:15.060
here shared among the agents.

09:15.060 --> 09:18.998
So before they had, each one of them had one neural network

09:18.998 --> 09:21.720
which was shared for the actor and for the critic.

09:21.720 --> 09:23.760
One neural network shared for the actor, for the critic

09:23.760 --> 09:25.770
one neural network shared for the actor, for.

09:25.770 --> 09:27.840
Now they all have one neural network,

09:27.840 --> 09:29.760
which is shared for the actor-critic,

09:29.760 --> 09:31.980
actor-critic, actor-critic.

09:31.980 --> 09:34.788
And then the critic is here, in common.

09:34.788 --> 09:37.560
So let's see, let's let's move these pictures

09:37.560 --> 09:39.960
to the left or here, so make some space.

09:39.960 --> 09:43.200
And this, this is basically the architecture

09:43.200 --> 09:44.760
or the structure,

09:44.760 --> 09:47.839
that we are going to be using in the practical tutorials.

09:47.839 --> 09:51.810
I know that like this, this may sound a bit overwhelming

09:51.810 --> 09:53.370
at this stage, but we've got, we've got

09:53.370 --> 09:55.830
one more A to talk about which is the advantage.

09:55.830 --> 09:57.871
And there we, we'll see a bit better

09:57.871 --> 10:00.360
in action how this is going.

10:00.360 --> 10:02.910
So we'll, we'll talk about the intuition and action there.

10:02.910 --> 10:05.460
But generally speaking, this is, this is what it is.

10:05.460 --> 10:09.840
This is, there's one network which each of the agents use.

10:09.840 --> 10:11.370
So they share, basically what that means

10:11.370 --> 10:13.110
is that they share the weights.

10:13.110 --> 10:15.540
The weights of the network are shared

10:15.540 --> 10:16.920
between agents and when they update it,

10:16.920 --> 10:19.647
they update the whole network, not just their own network.

10:19.647 --> 10:22.314
And then they have outputs, they have, like,

10:22.314 --> 10:25.260
these actions for each agent

10:25.260 --> 10:26.910
and then they have the critic that is shared

10:26.910 --> 10:27.743
which is gonna be monitored.

10:27.743 --> 10:29.913
So I know all this is kind of like,

10:31.080 --> 10:32.850
like there's a lot of stuff right now

10:32.850 --> 10:35.550
but hopefully it's slowly coming together.

10:35.550 --> 10:37.170
At least the main takeaway from here

10:37.170 --> 10:39.690
is that the critic, because it's shared,

10:39.690 --> 10:44.280
that's how the agents are able to make sure

10:44.280 --> 10:46.530
that they're cooperating together

10:46.530 --> 10:48.840
in order to get to the result much faster.

10:48.840 --> 10:50.984
And then in the next tutorial we'll see even further

10:50.984 --> 10:53.534
how all of this adds up, how this comes together.

10:53.534 --> 10:57.960
And for now there's, we like, I would like to recommend

10:57.960 --> 11:01.200
or we would like to recommend you a, an additional reading.

11:01.200 --> 11:06.200
So this is a blog by Jaromir Janisch,

11:06.840 --> 11:08.640
it's called "Let's Make an A3C: Implementation."

11:08.640 --> 11:11.820
There's actually two parts, implementation and theory.

11:11.820 --> 11:14.190
There's the link and it's very similar

11:14.190 --> 11:16.302
to what Atlan will be implementing

11:16.302 --> 11:19.410
in the practical side of the tutorial.

11:19.410 --> 11:23.010
So it's not specifically for this tutorial,

11:23.010 --> 11:24.600
just not specifically just for this tutorial,

11:24.600 --> 11:28.230
but it's for this whole section, in encouragement there

11:28.230 --> 11:30.579
some additional information, some additional insights there.

11:30.579 --> 11:33.360
And so that's, that's why we're bringing it up here.

11:33.360 --> 11:35.220
But nevertheless, in the next tutorial

11:35.220 --> 11:38.040
we're going to start pulling all of this together,

11:38.040 --> 11:39.180
everything we've discussed so far.

11:39.180 --> 11:40.530
And I look forward to seeing you next time.

11:40.530 --> 11:42.363
And until then, enjoy AI.
