1
00:00:11,520 --> 00:00:12,720
As a general pattern.

2
00:00:12,720 --> 00:00:18,210
It note that for the prediction problem, usually we are interested in finding VIVES, which is what

3
00:00:18,210 --> 00:00:18,900
we just did.

4
00:00:19,560 --> 00:00:24,450
For the control problem, we are usually interested in working with the action value queue of ESA.

5
00:00:25,290 --> 00:00:30,450
The reason for this is, if you recall, the queue function makes it easy to look up what action to

6
00:00:30,450 --> 00:00:32,340
take given a state s.

7
00:00:33,000 --> 00:00:35,890
We always take the arguments over all possible actions.

8
00:00:35,910 --> 00:00:37,470
We call this the greedy action.

9
00:00:38,220 --> 00:00:42,840
In this way, we maximise what we believe to be the sum of future rewards.

10
00:00:43,440 --> 00:00:47,160
So just keep this general pattern in mind for the prediction problem.

11
00:00:47,160 --> 00:00:51,240
We work with V of S for the control problem we work with Q of ESA.

12
00:00:56,320 --> 00:01:01,480
In order to understand how to apply the Monte Carlo approach to the control problem, we first have

13
00:01:01,480 --> 00:01:06,220
to understand the principle of policy, iteration and policy improvement.

14
00:01:07,240 --> 00:01:09,010
Let's consider two basic facts.

15
00:01:09,700 --> 00:01:16,300
Number one, given a policy, we can use Monte Carlo to evaluate the value function, whether that be

16
00:01:16,300 --> 00:01:18,130
the state value or the action value.

17
00:01:18,580 --> 00:01:20,350
We just saw that in the previous lecture.

18
00:01:21,400 --> 00:01:27,580
Number two, given an action value function, we can always choose based on this what we believe to

19
00:01:27,580 --> 00:01:29,650
be the best action given the current state.

20
00:01:30,220 --> 00:01:32,710
This is just the arg max over all possible actions.

21
00:01:32,760 --> 00:01:32,970
A.

22
00:01:38,050 --> 00:01:41,140
Well, it turns out that these two facts are interdependent.

23
00:01:41,950 --> 00:01:45,460
Given a policy, we can find its corresponding action value.

24
00:01:45,850 --> 00:01:49,990
And from that, we can take the arguments to find a possibly better policy.

25
00:01:50,930 --> 00:01:54,170
But what if this policy is different from the original policy?

26
00:01:54,980 --> 00:01:57,680
Then we can find the value function for this new policy.

27
00:01:58,400 --> 00:02:04,040
And from that, we can take the arguments again to find yet another new and possibly better, but at

28
00:02:04,040 --> 00:02:05,420
least as good policy.

29
00:02:06,650 --> 00:02:08,960
From there we can find the value function again.

30
00:02:09,710 --> 00:02:14,600
So you see, this is just a loop where we go back and forth, finding the value function, a given a

31
00:02:14,600 --> 00:02:16,700
policy and improving the policy.

32
00:02:16,880 --> 00:02:18,200
Given that value function.

33
00:02:20,220 --> 00:02:25,830
You can see that we have named these two steps back to finding the value function is called the evaluation

34
00:02:25,830 --> 00:02:31,470
step and the act of finding the best policy given that value function is called the improvement step.

35
00:02:33,290 --> 00:02:38,270
And just to be clear, the reason why this is a loop is because they both change each other.

36
00:02:38,600 --> 00:02:43,970
So by doing step one, you change the policy and by doing step two, you change the value.

37
00:02:49,010 --> 00:02:54,380
It's been proven, although we won't discuss it here, that performing these steps leads to an monotonic

38
00:02:54,380 --> 00:02:56,060
improvement in the policy.

39
00:02:56,810 --> 00:03:01,880
Therefore, if we just keep doing this process, eventually we will end up at the optimal policy.

40
00:03:06,920 --> 00:03:08,840
So how can we apply this to Monte Carlo?

41
00:03:09,590 --> 00:03:10,610
Here's a rough outline.

42
00:03:11,660 --> 00:03:16,340
First, we start by initializing Q of essay and a policy to both be random.

43
00:03:17,510 --> 00:03:23,240
Next, we enter a loop that goes for a predetermined at number of episodes inside the loop.

44
00:03:23,510 --> 00:03:27,710
We first evaluate the policy by finding Q of essay for the given policy.

45
00:03:28,250 --> 00:03:30,440
We call this the policy evaluation step.

46
00:03:32,120 --> 00:03:38,060
Once we've done that, we find a new policy where for each step, we take the action to be the arguments

47
00:03:38,060 --> 00:03:39,170
overall actions.

48
00:03:39,410 --> 00:03:41,600
For Q of essay with a given state.

49
00:03:42,410 --> 00:03:44,510
This is called the policy improvement step.

50
00:03:45,470 --> 00:03:50,000
This is pretty simple and not unlike the prediction problem, except for one small difference.

51
00:03:55,060 --> 00:04:01,300
Note that earlier when we discussed the evaluation problem, we discussed how to find VIVES, but now

52
00:04:01,300 --> 00:04:06,940
we have to find a cure essay in order to find you of essay where we evaluate our policy.

53
00:04:07,330 --> 00:04:11,440
We need to keep track of not only the states and rewards, but also the actions.

54
00:04:11,980 --> 00:04:15,850
So we're going to record a triples of states actions and rewards.

55
00:04:16,270 --> 00:04:19,330
So S1, A1, R1, S2, A2, R2, and so on.

56
00:04:20,290 --> 00:04:23,350
Here's what the pseudocode for the evaluation step would look like.

57
00:04:24,100 --> 00:04:29,890
As you can see, this is pretty similar to calculating V of S except when we saw the sample returns.

58
00:04:30,100 --> 00:04:33,220
We index the dictionary by both state and action.

59
00:04:38,300 --> 00:04:39,290
In the second part.

60
00:04:39,590 --> 00:04:43,550
Q is again a dictionary, but now the key is a state action tuple.

61
00:04:44,270 --> 00:04:46,940
Other than that, the process is exactly the same.

62
00:04:47,540 --> 00:04:51,650
We still calculate the sample mean of each list of returns for a given key.

63
00:04:56,720 --> 00:05:00,500
At this point, our solution works, but it's not an ideal solution.

64
00:05:01,130 --> 00:05:02,150
Let's think about why.

65
00:05:03,140 --> 00:05:11,390
First, if you recall via us only stores big -- values but q of essay stores, big -- times big values.

66
00:05:12,170 --> 00:05:14,780
As you know, with Monte Carlo sampling.

67
00:05:14,810 --> 00:05:18,530
The more samples you collect, the more accurate your answer becomes.

68
00:05:19,280 --> 00:05:25,010
When we use Q of essay, we have many more values to estimate and therefore we have to collect many

69
00:05:25,010 --> 00:05:27,740
more samples in order to get an accurate estimate.

70
00:05:32,730 --> 00:05:34,020
But there is a second problem.

71
00:05:34,950 --> 00:05:40,920
Each policy evaluation step is a monte Carlo estimation, which means that we must calculate a large

72
00:05:40,920 --> 00:05:41,940
number of samples.

73
00:05:42,900 --> 00:05:46,260
But now we've nested this loop inside another loop.

74
00:05:47,070 --> 00:05:48,240
What's the effect of this?

75
00:05:49,020 --> 00:05:53,850
Well, let's say our policy iteration loop needs to run 1000 times to find the optimal policy.

76
00:05:54,780 --> 00:05:59,790
Now, let's say our inner loop needs to run 1000 times in order to get an accurate estimate of queue

77
00:05:59,790 --> 00:06:00,440
of DNA.

78
00:06:01,440 --> 00:06:03,150
How many episodes do we have to play?

79
00:06:04,200 --> 00:06:07,590
The answer is 1000 times 1000, which is 1 million.

80
00:06:08,370 --> 00:06:13,950
As you can see, our data usage is not efficient and the number of times we need to play the game grows

81
00:06:13,950 --> 00:06:14,730
quite fast.

82
00:06:19,850 --> 00:06:23,510
A better solution is to use what's called generalized policy iteration.

83
00:06:24,410 --> 00:06:28,700
At first, this approach may seem kind of wacky, but in fact it's been shown to work.

84
00:06:29,420 --> 00:06:35,900
The idea is this for the policy evaluation step, instead of playing multiple episodes, in order to

85
00:06:35,900 --> 00:06:40,910
get a good estimate of the value of our policy, we are only going to play one episode.

86
00:06:41,660 --> 00:06:46,790
After playing this one episode, we'll get a series of states actions and rewards from which we can

87
00:06:46,790 --> 00:06:48,740
calculate the corresponding returns.

88
00:06:49,930 --> 00:06:55,690
From that, we can loop through each state action in return and use the latest sample of the return

89
00:06:55,750 --> 00:06:56,950
to update queue of S.

90
00:06:58,180 --> 00:07:00,430
Now you'll notice that I've just put some pseudocode here.

91
00:07:00,460 --> 00:07:01,780
Update queue of a sun.

92
00:07:01,810 --> 00:07:04,030
But I haven't told you exactly how we're going to do that.

93
00:07:04,360 --> 00:07:05,710
We'll be discussing this shortly.

94
00:07:06,700 --> 00:07:11,890
The important thing to note right now is that we'll only keep a single running copy of queue of S in

95
00:07:12,880 --> 00:07:15,700
the next and final step is the same as before.

96
00:07:16,180 --> 00:07:20,370
We update the policy by taking the arguments over queue for a given step.

97
00:07:20,710 --> 00:07:21,970
Overall actions.

98
00:07:27,100 --> 00:07:28,720
So what's going on with this line here?

99
00:07:29,260 --> 00:07:32,980
How do we update you of and a given the old estimate of queue of us?

100
00:07:32,980 --> 00:07:39,460
And for this we need to take a short digression back to sample means and expected values yet again.

101
00:07:40,180 --> 00:07:45,730
As I'm sure you're tired of me repeating by now, this is the expression for the sample mean we sum

102
00:07:45,730 --> 00:07:48,790
up all the samples and we divide by the total number of samples.

103
00:07:49,570 --> 00:07:53,020
The question we have to ask is, is this an efficient calculation?

104
00:07:58,340 --> 00:07:59,300
The answer is no.

105
00:07:59,990 --> 00:08:01,950
Summing up and values is no event.

106
00:08:02,630 --> 00:08:05,620
The more values we store, the more time it takes to calculate.

107
00:08:06,170 --> 00:08:07,010
This is not good.

108
00:08:07,850 --> 00:08:13,310
So the question becomes, is there a way to reduce the time it takes to calculate the sample mean?

109
00:08:13,670 --> 00:08:17,900
Every time I collect a new sample to see how this is done.

110
00:08:17,930 --> 00:08:22,850
Let's express the sample mean in terms of the previous sample mean and the latest sample.

111
00:08:24,880 --> 00:08:28,900
Before we move on, please make sure you can derive this yourself on paper.

112
00:08:29,800 --> 00:08:36,700
In the derivation you see here, x bar n means the sample mean it using the first n samples thus ex-parte

113
00:08:36,700 --> 00:08:40,930
and minus one means the sample mean it using the first and minus one samples.

114
00:08:46,140 --> 00:08:51,900
The trick is we can turn the above expression into something that looks like and in fact is gradient

115
00:08:51,900 --> 00:08:52,380
descent.

116
00:08:53,310 --> 00:08:55,530
We can write it so it looks a little more familiar.

117
00:08:55,950 --> 00:09:03,030
The new estimate is equal to the old estimate plus one over n times, the sample minus the old estimate.

118
00:09:04,110 --> 00:09:10,110
In this scenario, the target is the latest sample we've collected, while the old estimate is our prediction.

119
00:09:10,830 --> 00:09:12,240
One over n is the learning rate.

120
00:09:12,510 --> 00:09:17,310
So this learning rate is one that decays over time by using such a learning rate.

121
00:09:17,400 --> 00:09:19,710
We get back exactly the sample mean.

122
00:09:20,470 --> 00:09:22,830
Remember, all we've done so far is basic algebra.

123
00:09:27,860 --> 00:09:32,420
Here's what it would look like if we translated it into our Monte Carlo reinforcement learning code

124
00:09:32,510 --> 00:09:33,410
to update queue.

125
00:09:34,490 --> 00:09:41,210
The new estimate of queue of SNP is updated as the old estimate plus one over ten times the latest return

126
00:09:41,390 --> 00:09:42,560
minus the old estimate.

127
00:09:43,700 --> 00:09:46,640
Just note that the equal sign here means assignment.

128
00:09:46,910 --> 00:09:48,620
We're not saying that both sides are equal.

129
00:09:49,190 --> 00:09:52,670
The left side is the new estimate and the right side holds the old estimate.

130
00:09:52,910 --> 00:09:54,650
We just haven't bothered to subscript them.

131
00:09:55,850 --> 00:10:00,560
However, we are not done yet because we are going to make yet another modification to this.

132
00:10:05,720 --> 00:10:11,300
Remember that in our latest generalized policy iteration loop we are updating the same Q dictionary

133
00:10:11,330 --> 00:10:12,770
using different policies.

134
00:10:13,640 --> 00:10:19,850
This policy is being updated on each step and therefore the samples which we are using to calculate

135
00:10:19,850 --> 00:10:22,970
Q of SNP are not coming from the same distribution.

136
00:10:24,080 --> 00:10:29,440
In this case we don't want to use exactly the sample mean intuitively.

137
00:10:29,450 --> 00:10:32,180
The oldest samples come from the oldest policies.

138
00:10:32,240 --> 00:10:33,650
They don't really matter that much.

139
00:10:34,340 --> 00:10:37,700
The newest samples come from the newest policies and they matter more.

140
00:10:42,760 --> 00:10:47,170
The key is to use a constant learning rate instead of one that decays over time.

141
00:10:49,570 --> 00:10:53,380
Doing so gives us what is called the exponentially decaying average.

142
00:10:54,280 --> 00:10:59,230
The idea is when you take the conventional sample mean each sample is weighted equally.

143
00:11:00,220 --> 00:11:05,950
As we said, we don't want that because old values come from old policies and so they come from a different

144
00:11:05,950 --> 00:11:08,620
distribution than the one we are now interested in.

145
00:11:09,760 --> 00:11:15,370
On the other hand, the exponentially decaying average weights each sample in an exponentially decaying

146
00:11:15,370 --> 00:11:15,900
fashion.

147
00:11:16,690 --> 00:11:22,090
The most recent sample matters most, and the weights for each sample decay exponentially as you go

148
00:11:22,090 --> 00:11:24,040
backwards in the order they were collected.

149
00:11:29,340 --> 00:11:35,160
So now we can express our generalized policy iteration quote again, but this time we can fill in this

150
00:11:35,160 --> 00:11:38,100
little missing detail about how to update queue of essay.

151
00:11:39,660 --> 00:11:45,420
So this is the exact same pseudocode as before, except I've replaced the middle block of code with

152
00:11:45,420 --> 00:11:46,320
the actual update.

153
00:11:46,350 --> 00:11:47,760
We can use the update queue.

154
00:11:49,310 --> 00:11:54,470
Unfortunately, at this point there is a subtle but important detail we have yet to discuss.

155
00:11:55,010 --> 00:11:59,450
So we are not quite done specifying our algorithm, but we've laid out most of the groundwork.