1
00:00:00,690 --> 00:00:08,760
Up to this point, we've been used to train our models with one fixed learning rate throughout the process

2
00:00:08,760 --> 00:00:13,980
so we could fix our learning rate as we did to say 0.01.

3
00:00:13,980 --> 00:00:18,360
And we use this same learning rate throughout our whole training process.

4
00:00:18,870 --> 00:00:25,230
But it happens that if this learning rate is too large, then we risks diverging.

5
00:00:25,470 --> 00:00:32,830
And if the learning rate is too small, it would take too long for our model to converge.

6
00:00:32,850 --> 00:00:40,200
So let's consider this plot right here is actually a very simplified plot as actually what really happens

7
00:00:40,200 --> 00:00:41,900
is way more complex than this.

8
00:00:41,910 --> 00:00:46,470
So let's consider this plot and we have this.

9
00:00:47,560 --> 00:00:53,940
And yeah we have the loss record in and the weights.

10
00:00:53,950 --> 00:00:57,400
So yeah we have loss and then weights or parameters.

11
00:00:58,060 --> 00:01:03,520
And our aim is actually to modify these weights such that the loss is minimized.

12
00:01:03,530 --> 00:01:07,630
So our aim is to get all to get to this position right here.

13
00:01:08,800 --> 00:01:11,800
Now we start with the case where we have a high learning rate.

14
00:01:11,800 --> 00:01:17,950
So if you are dealing with a high learning rate, then it will be easier for our model to find its way

15
00:01:17,950 --> 00:01:24,820
to this minimum position right here or to some position close to this minimum position, let's say at

16
00:01:24,820 --> 00:01:25,900
this point here.

17
00:01:26,500 --> 00:01:33,340
But the problem here is, oh, once it gets to this position, it could also very easily diverge from

18
00:01:33,340 --> 00:01:33,490
it.

19
00:01:33,490 --> 00:01:39,940
So you could very easily get back to another points around here and then repeat this kind of process

20
00:01:39,940 --> 00:01:43,200
again where the model just kind of diverges.

21
00:01:43,210 --> 00:01:45,480
So you could have something like this.

22
00:01:45,490 --> 00:01:51,920
It could come down here and then get back to some point around this and so on and so forth.

23
00:01:51,940 --> 00:01:58,090
So we may have this case where the model doesn't really converge because the learning rate is too high

24
00:01:58,090 --> 00:02:01,240
and then for the small learning rates.

25
00:02:02,520 --> 00:02:04,110
We may start training.

26
00:02:04,110 --> 00:02:11,310
And then if, say, the model or if we find ourselves at this point here, that is with modify the weights

27
00:02:11,310 --> 00:02:19,650
so that the loss value happens to be at at this point here, it becomes difficult for us to get to this

28
00:02:20,190 --> 00:02:21,240
ultimate.

29
00:02:22,210 --> 00:02:31,030
Are global minima Since the learning rate is too small and it changes, we are making very small changes,

30
00:02:31,030 --> 00:02:39,250
so we may find ourselves just staying around this low call minima instead of going towards this global

31
00:02:39,250 --> 00:02:40,990
minima, which is this.

32
00:02:42,040 --> 00:02:50,290
And so to bring in a balance, what we could do is when we start the training process, we could use

33
00:02:50,290 --> 00:02:57,730
a relatively high learning rate so that the model kind of approaches this global minima faster.

34
00:02:57,730 --> 00:03:05,440
And then after a certain number of epochs, we start or we modify the learning rate so that it becomes

35
00:03:05,440 --> 00:03:08,110
or it takes in very small values.

36
00:03:08,110 --> 00:03:15,160
And since it now takes in small values, we now start taking up very small changes so that we can get

37
00:03:15,160 --> 00:03:20,200
towards this global minima now without risking divergence.

38
00:03:21,460 --> 00:03:28,510
Now, one way of doing this is by say you could fix let's say you can suppose that for the first ten

39
00:03:28,510 --> 00:03:36,400
epochs you train your model are learning rate of C 0.1 and then from 10 to 20 you're going to train

40
00:03:36,400 --> 00:03:39,520
a model, a learner of say, 0.01.

41
00:03:39,520 --> 00:03:41,380
Let's say we're divide it by a factor of ten.

42
00:03:41,380 --> 00:03:48,010
So we now move to 0.01 and the next ten, again, let's say we're training all this for 30 epochs and

43
00:03:48,010 --> 00:03:51,610
then the next we go to 0.001.

44
00:03:51,610 --> 00:03:57,040
So you could train your model and then after ten epochs, you restart the training by modifying the

45
00:03:57,040 --> 00:03:59,020
learning rate in your optimizer.

46
00:03:59,020 --> 00:04:01,530
And then again, you do this same year.

47
00:04:01,540 --> 00:04:08,310
But now the problem with this is you always have to be there to ensure that after the training, that's

48
00:04:08,350 --> 00:04:11,950
after the ten epochs, you modify this manually.

49
00:04:11,980 --> 00:04:14,440
Now what if we're able to do this automatically?

50
00:04:15,400 --> 00:04:20,350
As usual, this is made possible by TensorFlow callbacks with this callback.

51
00:04:20,350 --> 00:04:23,020
That is the learning rate scheduler callback.

52
00:04:23,020 --> 00:04:31,030
We could define a function which takes in the number of epochs and then modifies the learning rate based

53
00:04:31,030 --> 00:04:41,050
on the current epoch or based on a mixture of the current epoch and some predefined function.

54
00:04:42,080 --> 00:04:45,550
So as you could see here, we have the learning rate scheduler.

55
00:04:45,560 --> 00:04:50,450
It takes a master schedule and then we could specify the verbosity.

56
00:04:50,810 --> 00:04:56,610
Now, this is an example of this scheduler method, which has been defined here, here.

57
00:04:56,750 --> 00:05:02,120
What it do is if the number of epochs is less than ten, then you're going to use this predefined learning

58
00:05:02,120 --> 00:05:02,720
rate.

59
00:05:03,020 --> 00:05:03,860
And then.

60
00:05:04,630 --> 00:05:12,580
In the case where the number of epochs is greater than equal ten, then we start to reduce the learning

61
00:05:12,580 --> 00:05:14,360
rate in an exponential manner.

62
00:05:14,380 --> 00:05:15,990
So there we go.

63
00:05:16,000 --> 00:05:17,810
Let's take this off.

64
00:05:17,830 --> 00:05:24,850
Now, what we're saying here is we're modifying the learning rate such that after this we have in before

65
00:05:24,850 --> 00:05:28,390
ten epochs, we have in this fixed learning rate.

66
00:05:29,490 --> 00:05:30,230
That's it.

67
00:05:30,240 --> 00:05:33,750
And then after this, we start to reduce this.

68
00:05:33,750 --> 00:05:38,250
So the learning starts dropping as we continue with the training.

69
00:05:38,250 --> 00:05:46,350
So now we don't really need to be monitoring the training manually because this callback will automatically

70
00:05:46,350 --> 00:05:48,540
modify the learning rate for you.

71
00:05:48,930 --> 00:05:54,600
We could simply copy out this example which has been given to us right here and then make use of it

72
00:05:54,600 --> 00:05:56,730
in our training process.

73
00:05:56,730 --> 00:06:02,280
So yeah, we have to include this text and then our code.

74
00:06:02,280 --> 00:06:07,230
So yeah, we paste out this scheduler, let's do this.

75
00:06:07,230 --> 00:06:13,140
And then for this one we have our learning rate schedule in.

76
00:06:14,490 --> 00:06:15,390
Callback.

77
00:06:15,390 --> 00:06:19,050
So learning rate scheduler.

78
00:06:19,290 --> 00:06:21,540
And then we do this in part.

79
00:06:21,540 --> 00:06:29,220
So yeah again we just have learning rate scheduler, we run that and then get back to this position

80
00:06:30,150 --> 00:06:30,990
right here.

81
00:06:31,650 --> 00:06:35,040
Let's do this one, two, three.

82
00:06:35,040 --> 00:06:35,850
So that's fine.

83
00:06:35,850 --> 00:06:43,410
So notice that here we have this other, so we have the callbacks learning rate scheduler and we have

84
00:06:43,410 --> 00:06:47,280
since we logger is stopping learning the scheduler on other callbacks.

85
00:06:47,280 --> 00:06:48,330
So that's fine.

86
00:06:49,170 --> 00:06:52,710
Let's get back to that and then we are going to define.

87
00:06:52,710 --> 00:06:58,980
So yeah, we've had the learning rate scheduler or this scheduler method, but we're yet to define our

88
00:07:00,090 --> 00:07:01,320
scheduler callback.

89
00:07:01,320 --> 00:07:12,690
So we have our, let's say scheduler, scheduler callback equals learning rate scheduler, and then

90
00:07:12,690 --> 00:07:15,420
it takes in this schedule method right here.

91
00:07:15,420 --> 00:07:16,350
So that's fine.

92
00:07:16,350 --> 00:07:23,070
Notice how the scheduler method takes in the current epoch and the current epoch number and the learning

93
00:07:23,070 --> 00:07:23,640
rate.

94
00:07:23,640 --> 00:07:31,950
So here we have this given to us already, and so we'll modify this to take, say for example, three.

95
00:07:31,950 --> 00:07:38,250
So after three epochs we're going to modify the learning rate and then we could print out the learning

96
00:07:38,250 --> 00:07:44,940
rate so we could say, for example, the current learning all actually what we could do is let's take

97
00:07:44,940 --> 00:07:46,680
this off, let's not have that.

98
00:07:46,680 --> 00:07:50,760
Let's specify the verbosity to be equal one.

99
00:07:50,760 --> 00:07:58,650
So yeah, we have the verbosity specified as one and then we'll take off this.

100
00:07:59,250 --> 00:08:05,490
So you could also check out the CSV logger, the locked CSC file.

101
00:08:05,490 --> 00:08:11,610
So you'll notice how throughout we've been logging this values since we just doing, we just adding

102
00:08:11,610 --> 00:08:15,390
up or stacking up the values on the previous values we've had already.

103
00:08:15,390 --> 00:08:19,170
So for now what we could do is take this off.

104
00:08:19,170 --> 00:08:25,560
Now let's take this off the all this and then focus on just the scheduler.

105
00:08:25,560 --> 00:08:29,010
So we have scheduler callback and that's fine.

106
00:08:29,700 --> 00:08:34,890
Let's ensure that the cells have been run already, so that should be okay.

107
00:08:35,130 --> 00:08:40,650
Now we have this arrow verbosity, unexpected keyword arguments.

108
00:08:40,650 --> 00:08:48,510
Let's come back, get back to this here and then we have always verbose, so let's modify that as verbose

109
00:08:48,510 --> 00:08:49,230
equal one.

110
00:08:49,230 --> 00:08:50,400
We run that again.

111
00:08:50,940 --> 00:08:51,750
That should be fine.

112
00:08:51,750 --> 00:08:54,660
Now we could now get back to our training.

113
00:08:54,660 --> 00:09:01,050
So we're expecting that below three epochs, let's say equal three epochs, we have a given learning

114
00:09:01,050 --> 00:09:05,850
rate and then above that we have a learning rate which decreases exponentially.

115
00:09:05,970 --> 00:09:11,640
So that's it is here that the training process starts.

116
00:09:11,640 --> 00:09:18,540
We do not really need to print out any value because when we set our variables to one, it outputs this.

117
00:09:18,540 --> 00:09:23,700
So here we have learning ratio set and learning rate to 0.00999.

118
00:09:23,730 --> 00:09:30,750
That's practically 0.01, which is what we are given here in this learning resource is part of this

119
00:09:30,750 --> 00:09:31,800
learning rate setting.

120
00:09:31,800 --> 00:09:38,280
And then as time goes on, it's going to decrease this value and then always output the current learning

121
00:09:38,280 --> 00:09:38,910
rate.

122
00:09:39,780 --> 00:09:47,400
So as we carrying out this training process, we call that the aim of having to work with these kinds

123
00:09:47,400 --> 00:09:54,510
of learning rate schedulers is that we want to actually get the best of both worlds.

124
00:09:54,510 --> 00:10:02,070
So what we want is speed, because a very slow or very small learning rate doesn't assure speed.

125
00:10:02,070 --> 00:10:06,240
And so we want speed and then we also want.

126
00:10:07,330 --> 00:10:09,070
Stability when training.

127
00:10:09,070 --> 00:10:16,750
So because we want this tool, we are going to use modify our lending rates such that we always get

128
00:10:16,750 --> 00:10:19,360
this throughout our training process.

129
00:10:19,360 --> 00:10:25,390
And so that's why whenever we're starting with training, we have high learning rates which could ensure

130
00:10:25,390 --> 00:10:26,200
speed.

131
00:10:26,200 --> 00:10:33,130
And then as soon as we we have trained for a given period of time, for a given number of epochs and

132
00:10:33,130 --> 00:10:36,490
that we're trying to approach this global minimum right here.

133
00:10:36,490 --> 00:10:43,630
We now seek for stability by reducing the learning rate so that we don't get to this point and have

134
00:10:43,630 --> 00:10:44,430
to divert.

135
00:10:44,440 --> 00:10:52,150
So if we want a more stable kind of training process, what we do is reduce this learning rate.

136
00:10:53,280 --> 00:10:55,650
After the training is complete, here's what we get.

137
00:10:55,680 --> 00:11:02,910
You could see that we have this learning rate, which is now modified after a given number of epochs.

138
00:11:02,910 --> 00:11:10,830
So you see how the learning rates start decreasing as we go on with the training process.

139
00:11:11,800 --> 00:11:14,860
And so that's how we implement learning rate scheduling.

140
00:11:14,980 --> 00:11:24,550
But before moving on, let's check out this tutorial provided by the Amex Net Project Developers, which

141
00:11:24,550 --> 00:11:28,900
talks of other different learning rate scheduling techniques.

142
00:11:28,930 --> 00:11:32,770
So here we have this learning rate scheduling with warm up.

143
00:11:33,640 --> 00:11:41,470
Yes, it's slanted triangular and as you could see, the learning rate actually falls between these

144
00:11:41,470 --> 00:11:42,910
two values here.

145
00:11:42,910 --> 00:11:48,020
So we have our learning rate between one and two.

146
00:11:48,040 --> 00:11:53,650
So you could always define your max learning rate and then your mean learning rate.

147
00:11:53,650 --> 00:11:55,240
So you could always have this.

148
00:11:55,240 --> 00:12:06,130
And then with the warm up, what we do is we are going to increase this learning rate linearly up to

149
00:12:06,820 --> 00:12:14,680
the max for a given number of epochs, and then once we attend this, once we get to this number of

150
00:12:14,680 --> 00:12:19,100
epochs, we now start decreasing this learning rate.

151
00:12:19,120 --> 00:12:23,740
So yeah, the decrease is linear, so it's a linear function we're using.

152
00:12:23,740 --> 00:12:28,960
And then once it gets to the minimum learning rate, we just maintain that constant value right up to

153
00:12:28,960 --> 00:12:29,550
the end.

154
00:12:29,560 --> 00:12:31,660
So another Oh.

155
00:12:33,710 --> 00:12:39,470
Learning rate scheduling technique we could use here is we have this linear increase that's the warm

156
00:12:39,470 --> 00:12:46,700
up and then we decrease this exponentially right up to this minimum value.

157
00:12:47,730 --> 00:12:56,130
Now, this technique of warm up was developed in this paper by Priya goyal et al, and they found that

158
00:12:56,130 --> 00:13:01,920
having a smooth, linear warm up in the learning rate at the start of the training improved the stability

159
00:13:01,920 --> 00:13:05,310
of the optimizer and led to better solutions.

160
00:13:05,550 --> 00:13:13,680
Then from here we move on to the cosine learning scheduling method where there is this smooth decrease

161
00:13:13,680 --> 00:13:18,600
in the learning rate, which is kind of resembling the cosine function.

162
00:13:18,630 --> 00:13:22,140
Now this is what cosine function actually looks like.

163
00:13:22,140 --> 00:13:25,950
So here we have this and that.

164
00:13:25,950 --> 00:13:28,530
So here is our cosine function.

165
00:13:28,530 --> 00:13:36,360
And then if you notice, you'll find that this portion actually looks similar to what we have here.

166
00:13:36,720 --> 00:13:41,460
And now, after we get to a given number of epochs, we just maintain that.

167
00:13:41,460 --> 00:13:44,190
So we have defined the mean and then the max.

168
00:13:44,190 --> 00:13:47,550
And then once we get to the mean, we just maintain that mean throughout.

169
00:13:47,550 --> 00:13:51,390
Then we also have this stepwise decay scheduling.

170
00:13:51,690 --> 00:13:53,250
Let's take this off here.

171
00:13:53,250 --> 00:13:55,800
You see how we start with the warm up.

172
00:13:55,800 --> 00:14:02,010
So this warm up is kind of like a method used very much in practice.

173
00:14:02,010 --> 00:14:06,240
And then from here we have this step wise reductions.

174
00:14:06,240 --> 00:14:09,930
So we go in for this fixed learning rate.

175
00:14:09,930 --> 00:14:15,900
After given a number of epochs, we now drop the learning rate and then we drop it and so on and so

176
00:14:15,900 --> 00:14:16,170
forth.

177
00:14:16,170 --> 00:14:21,330
So here is our step wise scheduling with warm up.

178
00:14:21,600 --> 00:14:24,630
Then from here we have this cool down.

179
00:14:26,270 --> 00:14:32,990
Where we follow the stepwise method, and then after a given number of epochs, we now reduce the learning

180
00:14:32,990 --> 00:14:34,180
rate linearly.

181
00:14:34,220 --> 00:14:36,560
Hence the term cooled down.

182
00:14:36,890 --> 00:14:46,310
We also have this one cycle Shetland technique proposed by Leslie Smith and Nicole Toppin.

183
00:14:47,000 --> 00:14:53,660
And is why it looks like you see that we increased as we have this warm up and then this linear decrease.

184
00:14:53,660 --> 00:15:00,080
Once we get to this initial position, we also have again this linear decrease with a different slope

185
00:15:00,620 --> 00:15:01,340
with a.

186
00:15:02,350 --> 00:15:03,670
Smaller slope.

187
00:15:03,670 --> 00:15:12,160
And then once we get to this minimum or this final minimum, we now just actually just maintain the

188
00:15:12,160 --> 00:15:13,120
learning rate.

189
00:15:14,260 --> 00:15:21,250
Now, finally, we have the cyclical scheduling methods originally proposed by Leslie and Smith.

190
00:15:21,700 --> 00:15:27,640
The idea of cyclical increasing or cyclical increase in the degree and decrease in the learning rate

191
00:15:27,640 --> 00:15:32,020
has been shown to give faster convergence and more optimal solutions.

192
00:15:32,020 --> 00:15:38,230
So as you could see here, we are having the learning rates, which is going to be bouncing from the

193
00:15:38,230 --> 00:15:40,630
minimum to the maximum, as you can see here.

194
00:15:40,630 --> 00:15:47,080
So we have this linear increase to the max and then linear drop, linear increase in your drop and so

195
00:15:47,080 --> 00:15:48,040
on and so forth.

196
00:15:48,610 --> 00:15:55,090
And then finally we have this cyclical cosine online scheduling right here.

197
00:15:55,090 --> 00:16:02,320
We see how we have the cosine online and then we have this full cyclical process as we go from the highest

198
00:16:02,320 --> 00:16:03,010
to the lowest.

199
00:16:03,010 --> 00:16:11,680
And and then this next cycle from the meet between the mean and the max, that's this right to the mean

200
00:16:11,680 --> 00:16:12,970
learning rate.

201
00:16:12,970 --> 00:16:18,160
And then we take up this final cycle from this quadrant of the total.

202
00:16:18,160 --> 00:16:21,370
So we had 0.5, which is a quarter of two.

203
00:16:21,370 --> 00:16:25,440
So we start from this and then right to the mean.

204
00:16:25,450 --> 00:16:32,290
Now notice also that as we go into the cycle, the cycle lengths keep increasing.

205
00:16:32,290 --> 00:16:39,910
So we move from this to this length and finally this length, based on the schedule you want to implement,

206
00:16:39,910 --> 00:16:44,250
all you need to do is to modify this scheduler method right here.

207
00:16:44,260 --> 00:16:48,760
The next callback we'll be looking at is that of model check pointing.

208
00:16:48,760 --> 00:16:56,080
In this model checkpoint callback, we are able to save the model's weights at some frequency.

209
00:16:56,080 --> 00:17:05,500
So all I previously were after training like with this, after training our model, we get to low or

210
00:17:05,500 --> 00:17:08,260
rather we get to save the model like here.

211
00:17:08,260 --> 00:17:12,460
So you we save the model or we save our weights.

212
00:17:12,460 --> 00:17:21,280
After we're done with the training here, we could be doing this weight saving during the training process

213
00:17:21,280 --> 00:17:24,880
and thanks to this model checkpoint callback.