WEBVTT

00:00.600 --> 00:01.433
-: Hello, and welcome back

00:01.433 --> 00:04.050
to the course on artificial intelligence.

00:04.050 --> 00:05.430
I hope you're enjoying the course so far.

00:05.430 --> 00:09.030
And today we're talking about action selection policies.

00:09.030 --> 00:11.010
All right, let's dive straight into it.

00:11.010 --> 00:14.460
Previously, we talked about adding a neural network

00:14.460 --> 00:15.990
to our simple Q learning

00:15.990 --> 00:20.990
and so far we are getting quite into deep Q learning.

00:21.210 --> 00:24.390
We've talked about the learning part quite a bit

00:24.390 --> 00:26.640
including adding some elements to it.

00:26.640 --> 00:28.860
And today we're talking about this part,

00:28.860 --> 00:30.000
we're talking about the acting.

00:30.000 --> 00:31.290
So let's have a look.

00:31.290 --> 00:34.650
So here we've got what we discussed about the acting

00:34.650 --> 00:38.490
that once you input the values, the parameters

00:38.490 --> 00:40.080
or the vector describing the state

00:40.080 --> 00:42.480
the agent is currently in that environment,

00:42.480 --> 00:45.570
then that is after all the learning is done

00:45.570 --> 00:47.370
or even before the learning is done,

00:47.370 --> 00:49.530
basically we get all the Q values.

00:49.530 --> 00:51.120
So we're not interested in the learning right now.

00:51.120 --> 00:52.020
We're interested in acting.

00:52.020 --> 00:53.970
So once we have these Q values,

00:53.970 --> 00:57.330
how do we understand which one we need to use?

00:57.330 --> 01:00.690
Well, if you think about it, Q values are simply,

01:00.690 --> 01:01.920
these are the predictions for the Q values.

01:01.920 --> 01:05.490
So as we did in the simple Q learning algorithm,

01:05.490 --> 01:06.323
what did we do?

01:06.323 --> 01:09.180
We just selected the one with the best,

01:09.180 --> 01:10.410
with the highest Q value.

01:10.410 --> 01:12.540
Once we have the one with the highest Q value,

01:12.540 --> 01:13.680
we just take that action

01:13.680 --> 01:16.380
because it just brings us the highest Q value

01:16.380 --> 01:18.000
and that we know that Q values is calculated,

01:18.000 --> 01:20.370
is immediate reward that we expect to receive

01:20.370 --> 01:23.100
plus the DK factor terms the value of the next state.

01:23.100 --> 01:24.780
And it's a recursive calculation.

01:24.780 --> 01:25.613
So why not?

01:25.613 --> 01:28.410
Why wouldn't you take the best Q value?

01:28.410 --> 01:30.840
And that's kind of the end of it.

01:30.840 --> 01:32.970
But as you can see here, it's not as simple.

01:32.970 --> 01:34.590
Here, we're using a softmax function

01:34.590 --> 01:35.970
and this is where we're going to talk

01:35.970 --> 01:37.920
about action selection policies.

01:37.920 --> 01:39.060
So here, in reality,

01:39.060 --> 01:40.864
we don't have to have just a softmax function.

01:40.864 --> 01:44.850
We can have different action selection policies.

01:44.850 --> 01:49.470
For example, we've got epsilon greedy, epsilon soft

01:49.470 --> 01:50.850
and we've got the softmax.

01:50.850 --> 01:53.280
And those are kind of like the most commonly

01:53.280 --> 01:54.960
used action selection policies.

01:54.960 --> 01:56.310
Of course, they're others.

01:56.310 --> 01:58.560
For instance, the most basic one is,

01:58.560 --> 02:00.600
here's a very simple action selection policy.

02:00.600 --> 02:03.990
Just select the best that one with the highest Q value.

02:03.990 --> 02:06.330
But why doesn't that action policy fly

02:06.330 --> 02:09.210
and why do we have different types of action policy,

02:09.210 --> 02:10.530
action selection policies?

02:10.530 --> 02:15.530
Well, it all boils down to exploration versus exploitation.

02:15.540 --> 02:19.680
And that is the core of reinforcement learning

02:19.680 --> 02:21.898
because we've already talked about this a little bit

02:21.898 --> 02:25.002
that your agent, when it's operating in an environment,

02:25.002 --> 02:27.420
it might predict certain Q values,

02:27.420 --> 02:29.040
which might be good.

02:29.040 --> 02:31.860
And it might not turn out great,

02:31.860 --> 02:33.570
it might turn out that those Q values are bad

02:33.570 --> 02:34.980
and it'll be forced to explore.

02:34.980 --> 02:36.990
So if we, for instance in this case

02:36.990 --> 02:40.050
predict that Q2 is the best one, and then it takes Q2,

02:40.050 --> 02:41.193
takes action two,

02:42.510 --> 02:44.040
so from here it takes action two

02:44.040 --> 02:46.890
and then it gets a very negative reward.

02:46.890 --> 02:50.460
Then the environment is forcing the agent to go and explore

02:50.460 --> 02:51.630
because now it's gonna learn

02:51.630 --> 02:54.870
that, Oh, actually I thought Q2 is gonna be very good

02:54.870 --> 02:56.820
but it turned out very bad.

02:56.820 --> 02:58.380
So the result turned out very bad.

02:58.380 --> 02:59.910
So the network's gonna update itself.

02:59.910 --> 03:01.380
So next time he's in the state,

03:01.380 --> 03:02.310
he's gonna probably,

03:02.310 --> 03:04.088
he might still choose Q2 if it was,

03:04.088 --> 03:06.810
like if it was very, very favorable.

03:06.810 --> 03:09.180
So you might think that's like,

03:09.180 --> 03:12.480
he might need a couple of times, a couple of penalties,

03:12.480 --> 03:15.000
punishments in order to learn that Q2 is a bad action.

03:15.000 --> 03:17.340
But maybe he'll already soon learn

03:17.340 --> 03:18.540
that, Okay I'm gonna take a different action,

03:18.540 --> 03:20.040
I'm gonna take this action

03:20.040 --> 03:22.140
because now it has the best Q value.

03:22.140 --> 03:24.553
So sometimes the environment forces the agent

03:24.553 --> 03:27.600
to take different, to explore different actions

03:27.600 --> 03:30.318
but sometimes the agent might get,

03:30.318 --> 03:33.540
find itself stuck in a local maximum.

03:33.540 --> 03:36.060
It might find that it found,

03:36.060 --> 03:37.950
like through its initial exploration,

03:37.950 --> 03:39.930
it found that, oh this is a pretty cool action.

03:39.930 --> 03:42.150
Like, I'm going to go right here.

03:42.150 --> 03:43.584
And that's pretty cool action

03:43.584 --> 03:47.310
but the problem is that it thinks it's the best action

03:47.310 --> 03:49.080
simply because it hasn't explored,

03:49.080 --> 03:50.040
it's explored going up,

03:50.040 --> 03:51.810
it's gonna explored going left,

03:51.810 --> 03:52.890
it's explored going right

03:52.890 --> 03:54.900
but it hasn't explored going down

03:54.900 --> 03:57.570
from that specific state that it's in.

03:57.570 --> 04:00.960
And now that it's kind of like biased towards this action

04:00.960 --> 04:02.430
and thinks it's a good action, it's gonna keep taking,

04:02.430 --> 04:04.830
it's gonna keep getting, it's gonna keep taking this action,

04:04.830 --> 04:06.600
it's gonna keep getting a good reward

04:06.600 --> 04:10.380
but what if this action would've been even better

04:10.380 --> 04:13.314
if this action would've been so much better

04:13.314 --> 04:15.900
that if it knew about this action,

04:15.900 --> 04:17.340
it would actually switch to this action.

04:17.340 --> 04:19.830
But because it got stuck in a local maximum

04:19.830 --> 04:21.360
and it's getting these good rewards,

04:21.360 --> 04:23.610
it's just going to be reinforced.

04:23.610 --> 04:25.710
This is going to keep reinforcing itself that

04:25.710 --> 04:27.150
or the environment's going to reinforce it

04:27.150 --> 04:29.550
that this is a good action to take, keep doing that.

04:29.550 --> 04:32.460
But the reality is that there's this other action

04:32.460 --> 04:33.810
that it hasn't found yet

04:33.810 --> 04:37.050
or hasn't even explored that would've been much better.

04:37.050 --> 04:38.220
And so what we want to do,

04:38.220 --> 04:41.490
is we want to come up with an action selection policy

04:41.490 --> 04:45.810
that allows our agent not to get stuck in a local maximum.

04:45.810 --> 04:48.510
Yes, it's important to keep doing the good actions.

04:48.510 --> 04:50.190
That's the exploitation part.

04:50.190 --> 04:52.110
We wanna exploit what we've found

04:52.110 --> 04:53.940
but at the same time, we still want to explore.

04:53.940 --> 04:55.830
We never want to stop exploring.

04:55.830 --> 04:57.990
Like in life, you never wanna stop learning,

04:57.990 --> 04:59.263
you stop learning, you die.

04:59.263 --> 05:00.600
There's a saying like that.

05:00.600 --> 05:02.730
That when you're not growing, you're dying

05:02.730 --> 05:03.563
or something like that.

05:03.563 --> 05:05.700
So you want to keep learning

05:05.700 --> 05:07.740
and your agent wants to keep learning.

05:07.740 --> 05:10.410
And that's where these action selection policies come in.

05:10.410 --> 05:12.330
So we've got three listed here.

05:12.330 --> 05:14.160
So the first one is epsilon greedy.

05:14.160 --> 05:15.630
It's a very simple one.

05:15.630 --> 05:18.330
It sounds pretty complex in the sense

05:18.330 --> 05:20.160
that like it's got a cool name

05:20.160 --> 05:22.074
and usually things (indistinct) names are,

05:22.074 --> 05:23.190
It's actually not.

05:23.190 --> 05:26.460
So basically what it does, is it'll select the one

05:26.460 --> 05:31.440
with the best Q value and like epsilon greedy,

05:31.440 --> 05:32.940
you might hear it in other places.

05:32.940 --> 05:35.220
It's just like a selection policy.

05:35.220 --> 05:39.270
So in this case, we're using it to select our Q values,

05:39.270 --> 05:40.103
our action.

05:40.103 --> 05:42.960
So you'll select the one with the highest Q value,

05:42.960 --> 05:45.990
all the time, except for epsilon percent of the time.

05:45.990 --> 05:49.320
So for instance, if you set epsilon to 10%,

05:49.320 --> 05:52.320
then you're going to, or 0.1,

05:52.320 --> 05:53.970
then 10% of the time,

05:53.970 --> 05:56.700
the action is going to be selected at random.

05:56.700 --> 05:58.110
So 90% of the time,

05:58.110 --> 06:00.450
you're still going to be selecting the best action

06:00.450 --> 06:02.130
based on the highest Q value.

06:02.130 --> 06:05.580
But 10% of the time is gonna be selecting a random action.

06:05.580 --> 06:07.710
Uniform is is going to be absolutely

06:07.710 --> 06:09.510
randomly taking an action

06:09.510 --> 06:13.230
or if you said epsilon to 0.5 or 0.05,

06:13.230 --> 06:15.485
that means that 95% of the time

06:15.485 --> 06:18.240
the agent is gonna be taking the action

06:18.240 --> 06:19.200
with the highest Q value,

06:19.200 --> 06:21.273
but 5% of the time it's still going to be selecting

06:21.273 --> 06:22.470
in a random action.

06:22.470 --> 06:25.770
So it's going to be going out there and exploring.

06:25.770 --> 06:28.650
So epsilon soft is very similar.

06:28.650 --> 06:30.000
By the way, that's kind of like

06:30.000 --> 06:31.230
why it's called epsilon greedy

06:31.230 --> 06:36.230
because you're greedily selecting the action,

06:36.600 --> 06:38.250
the good action except

06:38.250 --> 06:40.290
for that little epsilon percent of the time.

06:40.290 --> 06:43.860
So the lower the epsilon,

06:43.860 --> 06:45.720
the more greedily you're selecting

06:45.720 --> 06:50.400
that kind of the action that is the optimal action.

06:50.400 --> 06:52.560
And the less you are leaving,

06:52.560 --> 06:54.720
less chances you're leaving for exploration.

06:54.720 --> 06:56.010
Epsilon soft is the opposite.

06:56.010 --> 06:58.062
So basically you're selecting,

06:58.062 --> 07:00.630
at random, you're selecting one

07:00.630 --> 07:02.010
minus epsilon percent of the time.

07:02.010 --> 07:04.620
So if your epsilon is like 0.1, so 10%,

07:04.620 --> 07:08.250
then only 10% of the time you're taking this action

07:08.250 --> 07:12.390
and 90% of the time you're selecting a random action.

07:12.390 --> 07:14.760
So very simple, just inverted algorithms.

07:14.760 --> 07:18.420
And softmax is kind of like the next step from,

07:18.420 --> 07:20.700
or it's a more advanced version,

07:20.700 --> 07:23.610
I would say, of the epsilon greedy algorithm.

07:23.610 --> 07:25.290
Although they both have merit

07:25.290 --> 07:26.850
and they both have place.

07:26.850 --> 07:29.100
We are going to be using soft max in our coding,

07:29.100 --> 07:30.900
in our practical set of things.

07:30.900 --> 07:32.250
So that's why we're going to talk

07:32.250 --> 07:34.383
in a bit more detail about softmax.

07:35.340 --> 07:36.360
So let's have a look.

07:36.360 --> 07:37.860
So let's move on to softmax.

07:37.860 --> 07:40.590
Hopefully it's pretty clear about epsilon greedy.

07:40.590 --> 07:43.356
So it's a pretty straightforward algorithm select.

07:43.356 --> 07:46.200
This one most of the time

07:46.200 --> 07:47.760
except for sometimes go and explore.

07:47.760 --> 07:49.860
And now we also see why it's important

07:49.860 --> 07:51.540
to do that exploration

07:51.540 --> 07:53.765
so that we don't end up in local maximums

07:53.765 --> 07:56.070
in our optimization process.

07:56.070 --> 07:58.890
So now we're gonna talk a bit more about softmax,

07:58.890 --> 08:02.506
there's a tutorial on softmax at the end of the course,

08:02.506 --> 08:05.520
I think it's in annex number two

08:05.520 --> 08:08.460
where we talk about the concept behind softmax.

08:08.460 --> 08:09.960
I'm just going to refresh a little bit here.

08:09.960 --> 08:12.870
So there we're talking about convolution neural networks.

08:12.870 --> 08:14.723
And by the way, we are going to be covering convolution.

08:14.723 --> 08:17.010
We're not covering convolution neural networks

08:17.010 --> 08:19.050
in this section of the course.

08:19.050 --> 08:21.514
In this section we're still using a vector

08:21.514 --> 08:23.430
but in the next section of the course,

08:23.430 --> 08:27.060
when we're creating an AI to play doom,

08:27.060 --> 08:29.400
we are going to be using convolution neural networks.

08:29.400 --> 08:31.620
So it could be beneficial for you

08:31.620 --> 08:33.313
to look at convolution neural networks

08:33.313 --> 08:36.330
and then take the softmax function

08:36.330 --> 08:38.310
or you can learn a bit more about softmax

08:38.310 --> 08:41.820
after you take the convolution neural networks annex

08:41.820 --> 08:43.260
of the course later on.

08:43.260 --> 08:45.150
But here's a quick refresher.

08:45.150 --> 08:47.430
So here we've got a convolution neural network

08:47.430 --> 08:48.960
which decides whether it's a dog or a cat.

08:48.960 --> 08:53.580
So here we've got the voting process between these neurons

08:53.580 --> 08:56.760
and this one says that it's got the features

08:56.760 --> 09:01.760
the fluffy ears, pointed face type of thing

09:02.220 --> 09:05.130
and the kind of like the features,

09:05.130 --> 09:08.520
the types of eyes, the way the eyes look

09:08.520 --> 09:09.960
all these features that belong to a dog.

09:09.960 --> 09:11.550
So it's a 95% chance

09:11.550 --> 09:13.920
that it's dog and the 5% chance that cat.

09:13.920 --> 09:16.080
But the question is, how did we get,

09:16.080 --> 09:17.850
and in that tutorial we're talking

09:17.850 --> 09:20.850
about how did we get these values to add up to one.

09:20.850 --> 09:22.851
Well, whatever the convolution

09:22.851 --> 09:25.950
or our whole neural network,

09:25.950 --> 09:27.330
so the convolution neural network

09:27.330 --> 09:30.420
plus the fully connected layers, whatever it's sped out,

09:30.420 --> 09:31.860
whatever the values it sped out,

09:31.860 --> 09:33.554
we applied a softmax function over here.

09:33.554 --> 09:36.390
This is where we're introduce the formula

09:36.390 --> 09:37.223
for the softmax function.

09:37.223 --> 09:38.760
This is what it looks like.

09:38.760 --> 09:40.560
And then we got these values.

09:40.560 --> 09:43.470
And so basically that's a quick refresher.

09:43.470 --> 09:46.233
This is the formula for the softmax.

09:46.233 --> 09:49.560
What it does, is it takes however many outputs you have,

09:49.560 --> 09:50.970
doesn't matter.

09:50.970 --> 09:55.970
It will take them and it will squash them all into values

09:56.040 --> 09:58.620
between zero and one, regardless of how big they are.

09:58.620 --> 10:00.000
Just by looking at this formula,

10:00.000 --> 10:02.580
you can see that there's a total sum at the bottom.

10:02.580 --> 10:04.170
So these values are gonna be zero,

10:04.170 --> 10:05.003
between zero and one.

10:05.003 --> 10:08.700
And also all these values are going to add up to one always.

10:08.700 --> 10:12.630
And so that's very beneficial for us

10:12.630 --> 10:15.243
because when we're using the softmax function,

10:16.110 --> 10:19.740
what happens is we get these Q values,

10:19.740 --> 10:21.420
we select this best Q value.

10:21.420 --> 10:25.140
But in reality, what happens is, these Q values that we get,

10:25.140 --> 10:26.760
they're actual numbers, right?

10:26.760 --> 10:28.950
So there's some kind of numbers,

10:28.950 --> 10:30.420
they don't have to add up to one,

10:30.420 --> 10:31.740
they don't have to be between zero and one,

10:31.740 --> 10:33.150
just some numbers.

10:33.150 --> 10:34.650
But when we apply softmax,

10:34.650 --> 10:36.120
we don't just select the best one,

10:36.120 --> 10:38.250
We actually get numbers like that.

10:38.250 --> 10:41.700
So we get numbers in the range between zero and one

10:41.700 --> 10:44.310
and that also add up to one.

10:44.310 --> 10:47.340
And so what other thing do we know that add up to one?

10:47.340 --> 10:48.270
Well, probabilities,

10:48.270 --> 10:50.160
we know that probabilities always have to add up to one.

10:50.160 --> 10:52.680
So that is why we can say

10:52.680 --> 10:53.910
here we've got Q values,

10:53.910 --> 10:57.990
but here all of a sudden we've got probability.

10:57.990 --> 11:00.120
So we can say that the likelihood

11:00.120 --> 11:02.820
of this being the best action is 90%,

11:02.820 --> 11:04.920
this best being best action is 5%, 2%, 3%.

11:05.880 --> 11:08.220
Because we know the higher your Q value,

11:08.220 --> 11:09.087
the better the action.

11:09.087 --> 11:11.850
And so if we squashed them to zero to one,

11:11.850 --> 11:13.140
then these become probabilities

11:13.140 --> 11:15.090
and we can deal with them as such.

11:15.090 --> 11:20.090
And therefore now is when the action is selected.

11:20.490 --> 11:22.920
And that's how we come up with Q2.

11:22.920 --> 11:24.540
But if you look at it closely,

11:24.540 --> 11:26.547
this isn't a strict 100%

11:26.547 --> 11:28.590
and these are not strict 0%.

11:28.590 --> 11:29.903
So this is a 5%, 2%, 3%.

11:30.810 --> 11:35.810
So the most natural way to apply the softmax

11:36.900 --> 11:41.400
in order to preserve exploration in the algorithm,

11:41.400 --> 11:44.640
is to use these exact probabilities

11:44.640 --> 11:48.600
as how often we are going to be taking that action.

11:48.600 --> 11:51.960
So these probabilities actually represent the distribution

11:51.960 --> 11:54.480
of these actions that we're taking.

11:54.480 --> 11:57.810
So basically softmax makes it very easy for us

11:57.810 --> 11:58.980
to come up with a way

11:58.980 --> 12:01.770
to combine exploitation and exploration.

12:01.770 --> 12:03.540
So the best action

12:03.540 --> 12:05.070
will always have the highest probability

12:05.070 --> 12:06.750
because it has the highest Q value.

12:06.750 --> 12:08.550
And therefore here we're going to be,

12:08.550 --> 12:10.530
just we're gonna use these as our distribution

12:10.530 --> 12:11.363
and we're gonna say,

12:11.363 --> 12:14.340
"Okay we're gonna be taking Q2 90% of the time

12:14.340 --> 12:16.620
but 5% of the time we're still gonna be taking Q1

12:16.620 --> 12:18.470
and 2% of the time we're gonna take Q3

12:18.470 --> 12:21.390
and 3% of the time we're gonna be taking Q4."

12:21.390 --> 12:24.561
And the beauty here is also that as these values update

12:24.561 --> 12:27.090
as the agent goes through the network

12:27.090 --> 12:28.170
more and more and more,

12:28.170 --> 12:33.170
it becomes more familiar with the environment

12:34.200 --> 12:35.250
and therefore these updates.

12:35.250 --> 12:37.520
So this value for instance might become,

12:37.520 --> 12:40.290
like it might ascertain

12:40.290 --> 12:42.690
that this value is actually less or this actually is higher.

12:42.690 --> 12:45.600
And so these probability will also change

12:45.600 --> 12:47.070
as an agent goes through.

12:47.070 --> 12:49.200
So even though here we've got Q2,

12:49.200 --> 12:50.370
nobody is to say

12:50.370 --> 12:53.700
that sometimes 5% of the time to be more precise,

12:53.700 --> 12:56.880
we'll be selecting Q1 as the action to take.

12:56.880 --> 13:00.150
And sometimes or action one will be taking action one.

13:00.150 --> 13:01.948
Sometimes we'll be taking action through two,

13:01.948 --> 13:04.200
action three 2% of the time

13:04.200 --> 13:06.480
and action four will be taking about 3% of the time.

13:06.480 --> 13:11.040
So every action has a chance to play in this process

13:11.040 --> 13:13.170
as long as we have enough iterations

13:13.170 --> 13:15.090
and agent goes through lots and lots of times

13:15.090 --> 13:17.940
through these states that they're in.

13:17.940 --> 13:22.940
And that's how any kind of deep learning algorithm works

13:22.950 --> 13:24.960
that you want to do this many, many, many times

13:24.960 --> 13:27.180
so that you learn from experience.

13:27.180 --> 13:29.580
And therefore, as you can see here,

13:29.580 --> 13:31.860
it's a very natural transition to,

13:31.860 --> 13:34.230
we're not just randomly like an epsilon greedy algorithm.

13:34.230 --> 13:37.440
We're not just randomly selecting the actions,

13:37.440 --> 13:40.280
we're selecting them based on their softmax values,

13:40.280 --> 13:44.190
which makes it, like has some logic behind it,

13:44.190 --> 13:46.620
not just at random 10% of the time,

13:46.620 --> 13:48.120
we're selecting a random action

13:48.120 --> 13:50.010
but there's some logic behind how we're doing it

13:50.010 --> 13:53.220
and based on the Q values that we've explored.

13:53.220 --> 13:56.970
And so, that's the action selection policy

13:56.970 --> 13:58.590
that we are going to be using in this course.

13:58.590 --> 14:01.830
You're welcome to definitely check out the epsilon greedy,

14:01.830 --> 14:04.050
action selection policy if you like,

14:04.050 --> 14:05.487
but we are going to be predominantly using

14:05.487 --> 14:08.760
the softmax action selection policy.

14:08.760 --> 14:11.460
And I've got an interesting reading for you.

14:11.460 --> 14:15.150
So this is called Adaptive Epsilon Greedy Exploration

14:15.150 --> 14:17.460
in Reinforcement Learning Based on Value Differences.

14:17.460 --> 14:18.900
It's such 1,010 article.

14:18.900 --> 14:22.620
And it's interesting because Mike Michelle,

14:22.620 --> 14:24.090
I'm not sure how to pronounce it.

14:24.090 --> 14:28.410
Michelle, Mickel topic introduces a different type

14:28.410 --> 14:29.243
of algorithm.

14:29.243 --> 14:32.520
So an adjusted epsilon greedy algorithm

14:32.520 --> 14:37.260
and called the VDBE algorithm

14:37.260 --> 14:39.060
or epsilon VDBE algorithm.

14:39.060 --> 14:40.380
You can see it over here.

14:40.380 --> 14:42.780
And he actually compares it

14:42.780 --> 14:44.220
to the epsilon greedy and softmax.

14:44.220 --> 14:46.650
And it's an epsilon greedy algorithm

14:46.650 --> 14:49.530
which basically the main idea behind it,

14:49.530 --> 14:53.820
is to adjust the value of epsilon

14:53.820 --> 14:56.580
depending on the state the agent is in.

14:56.580 --> 14:59.880
So if the agent is very certain about the state they're in,

14:59.880 --> 15:01.500
then epsilon should be smaller.

15:01.500 --> 15:02.700
So there should be less exploration.

15:02.700 --> 15:04.080
If the agent is uncertain,

15:04.080 --> 15:06.330
epsilon should be higher, should be more exploration.

15:06.330 --> 15:08.777
So it is a 2010 article.

15:08.777 --> 15:13.777
I'm not sure if this new proposed algorithm is widely used

15:14.790 --> 15:18.015
or has been accepted in the community

15:18.015 --> 15:21.160
or if artificial intelligence has moved

15:21.160 --> 15:23.100
kind of away from this suggestion.

15:23.100 --> 15:25.530
But nevertheless, it will definitely help you

15:25.530 --> 15:29.430
reinforce your knowledge about action selection policies

15:29.430 --> 15:31.500
which we discussed the epsilon greedy, the softmax,

15:31.500 --> 15:32.880
will help you, it'll give you an opportunity

15:32.880 --> 15:34.020
to compare the side by side,

15:34.020 --> 15:37.080
and also see in which direction people actually think

15:37.080 --> 15:39.330
when they want to improve artificial intelligence.

15:39.330 --> 15:40.680
So if you're ever planning

15:40.680 --> 15:44.520
on creating really interesting algorithms

15:44.520 --> 15:46.530
that are pushing the edge

15:46.530 --> 15:47.970
of ultra artificial intelligence

15:47.970 --> 15:50.640
and pushing the envelope in this space,

15:50.640 --> 15:52.860
then this could be a good way for you

15:52.860 --> 15:56.220
to see in which direction people think sometimes

15:56.220 --> 16:00.210
when they're trying to improve the norms

16:00.210 --> 16:01.350
of artificial intelligence

16:01.350 --> 16:04.050
or the norms that existed back then in 2010.

16:04.050 --> 16:04.883
So there we go.

16:04.883 --> 16:06.840
Hopefully you enjoyed today's tutorial

16:06.840 --> 16:10.020
about the action selection policies,

16:10.020 --> 16:12.278
and we learned about epsilon greedy, epsilon soft

16:12.278 --> 16:13.797
and the softmax.

16:13.797 --> 16:15.990
And now you are even more prepared

16:15.990 --> 16:18.270
for the practical side of things.

16:18.270 --> 16:20.850
And on that note, I look forward to seeing you next time.

16:20.850 --> 16:22.773
And until then, enjoy AI.
