1
00:00:00,120 --> 00:00:05,940
Hello, everyone, and welcome to this new and exciting session in which we are going to discuss the

2
00:00:05,940 --> 00:00:07,260
Resnick model.

3
00:00:07,290 --> 00:00:13,830
This model was first introduced in this paper entitled Deep Residual Learning for Image Recognition

4
00:00:13,830 --> 00:00:16,950
by coming here and up till date.

5
00:00:16,950 --> 00:00:19,140
That's about seven years later.

6
00:00:19,140 --> 00:00:26,790
This model is still greatly used and the high performance is gotten when working with the rest net model

7
00:00:26,790 --> 00:00:33,420
come due to the fact that the rest of model relies on this residual block right here, which permits

8
00:00:33,420 --> 00:00:43,140
us get even better error rates as we could see here, as compared to the VG and Google Net models.

9
00:00:43,140 --> 00:00:51,030
In this section, we are going to focus on understanding how this residual block works and how the resonant

10
00:00:51,030 --> 00:00:54,450
model is constructed based off this model.

11
00:00:54,570 --> 00:01:02,910
This curves right here depict the limit of models like the VG, which are just based on stacking up

12
00:01:02,910 --> 00:01:06,550
conv layers to best understand this plot.

13
00:01:06,570 --> 00:01:10,740
Recall that with the Alex net we had fewer number of layers.

14
00:01:10,740 --> 00:01:18,000
So we we started out with Alex net fewer number of layers, and then we moved on to VG where we had

15
00:01:18,000 --> 00:01:23,940
the VG 16 version, then the VG 19 version.

16
00:01:23,940 --> 00:01:30,090
And then we expect that if we keep increasing this number of layers, then the error rates should be

17
00:01:30,090 --> 00:01:30,800
dropping.

18
00:01:30,810 --> 00:01:34,890
But what happens actually is the opposite.

19
00:01:34,890 --> 00:01:43,290
So what goes on here is you see for this 20 layer network, we have a lower error rate as compared to

20
00:01:43,290 --> 00:01:45,930
this 56 layer network.

21
00:01:45,930 --> 00:01:52,020
So we expect that this instead should be lower than this because this will stack and or stack more conv

22
00:01:52,050 --> 00:01:54,930
layers, but that's instead the opposite.

23
00:01:55,080 --> 00:01:59,340
Now this same phenomenon is witness with the test set.

24
00:01:59,340 --> 00:02:01,560
So here there's a test and there's a train.

25
00:02:01,560 --> 00:02:08,130
So for the test tool, we have the 20 layer performing better than the 56 layer.

26
00:02:08,700 --> 00:02:17,820
And so it's clear that just blindly stacking up conv layers wouldn't help in making this in dropping

27
00:02:17,820 --> 00:02:21,990
the training and test errors even though they are more expensive.

28
00:02:22,740 --> 00:02:30,510
And this is why the Resnick model introduces this residual learning which is based off the residual

29
00:02:30,510 --> 00:02:31,890
block which we've seen already.

30
00:02:31,890 --> 00:02:33,390
Here is the residual block.

31
00:02:33,690 --> 00:02:39,390
Now note that this wet layer your others with layers, you are simply convolutional layers.

32
00:02:39,840 --> 00:02:47,400
And so now unlike before where we'll just stack this year let's let's let's call this w LX for weight

33
00:02:47,400 --> 00:02:53,640
layer we stack this weight layer and then we'll stack this other one and then we'll keep stacking up

34
00:02:53,640 --> 00:02:54,540
that way.

35
00:02:55,420 --> 00:02:57,190
We'll just keep stacking up like this.

36
00:02:57,210 --> 00:03:03,900
Now what we do is we create a connection between the input and the output.

37
00:03:04,410 --> 00:03:08,730
So we create this connection and then you're, there's some addition.

38
00:03:08,730 --> 00:03:16,350
So we get this output and we add it with this other output right here to produce now this new output.

39
00:03:16,350 --> 00:03:24,390
So here, if we suppose that this input is X, and then what goes on in here is F of X, So let's change

40
00:03:24,390 --> 00:03:25,710
this to read.

41
00:03:25,710 --> 00:03:30,060
What goes on in here is f of x, let's this.

42
00:03:30,910 --> 00:03:40,330
F of X here we have this f of x, then our output can be given us h of x, which is simply equal f of

43
00:03:40,330 --> 00:03:42,700
x plus x.

44
00:03:42,700 --> 00:03:46,150
That's the input plus its output.

45
00:03:46,510 --> 00:03:50,050
Now to produce this new output h of x.

46
00:03:51,010 --> 00:04:00,160
But again, to better understand why we need this residual block right here, we need to first understand

47
00:04:00,160 --> 00:04:08,290
why models which are based on just stacking up the conveyors like the VG actually under feet, even

48
00:04:08,290 --> 00:04:11,680
when you add up or when you increase the number of layers.

49
00:04:11,800 --> 00:04:18,450
Now the reason for that is exploding and or vanishing gradients.

50
00:04:18,460 --> 00:04:24,910
So let's explain what it means for gradients to be vanishing.

51
00:04:25,120 --> 00:04:30,040
Now recall that in the gradient descent process we have a weight.

52
00:04:30,190 --> 00:04:35,680
And this weight, the weight is weight is updated is such that we take its previous value.

53
00:04:35,680 --> 00:04:43,840
So we have the previous value of the weight minus the learning rates here, let's call it our LR generally

54
00:04:43,840 --> 00:04:51,790
denoted alpha minus this learning rate times the partial derivative of the loss with respect to that

55
00:04:51,790 --> 00:04:52,810
given weight.

56
00:04:53,590 --> 00:05:00,430
Now, during the training process, in order to compute this partial derivative right here very efficiently,

57
00:05:00,520 --> 00:05:07,820
the method used is back propagation, or one of the most common methods used is that of back propagation.

58
00:05:07,840 --> 00:05:13,720
Now, the way back propagation works is that you have, say, this model, let's call this model M,

59
00:05:13,720 --> 00:05:17,160
and then you have some input right here with the output.

60
00:05:17,170 --> 00:05:25,870
Obviously you have what the model outputs and let's, let's call the model output Y cup and then Y.

61
00:05:25,930 --> 00:05:31,390
That's not the model we have Y right here, which is what the model is expected to output.

62
00:05:31,420 --> 00:05:38,590
Now it's this difference that produces the loss, and we're finding that partial derivative of this

63
00:05:38,710 --> 00:05:46,570
difference with respect to each and every weight which makes up this model.

64
00:05:47,290 --> 00:05:52,280
Then if we split this up into different layers, let's just say we have different layers like this.

65
00:05:52,300 --> 00:05:58,360
Note that each layer is composed of several weights, but let's, let's say we just put this up like

66
00:05:58,360 --> 00:05:58,790
this.

67
00:05:58,810 --> 00:06:03,010
Now, the layers, this layer has its own weights.

68
00:06:03,010 --> 00:06:10,300
But one point to note is that during the back propagation process, to obtain the partial derivative

69
00:06:10,300 --> 00:06:17,770
of the loss with respect to this weight, here we make use of the partial derivative of the loss with

70
00:06:17,770 --> 00:06:25,770
respect to weights, which come after the layer, which come after this layer right here.

71
00:06:25,780 --> 00:06:35,260
So we to get this year, we will need this different partial derivative to get let's, let's get back

72
00:06:36,670 --> 00:06:42,820
let's get back to get this for example, to get the partial derivative with respect to this weight here

73
00:06:42,820 --> 00:06:46,510
we will need this others right here.

74
00:06:46,810 --> 00:06:55,180
Now, the problem is if we have this year and and also before before going to explain the problem with

75
00:06:55,180 --> 00:07:02,980
this is that we need to understand that to get for example, let's take this one to get, for example,

76
00:07:02,980 --> 00:07:07,990
the partial derivative of the loss with respect to this different weights.

77
00:07:07,990 --> 00:07:09,450
There are many weights in this layer.

78
00:07:09,460 --> 00:07:13,150
Let's say weight for this layer is actually equal.

79
00:07:13,510 --> 00:07:22,630
Some values, let's say alpha one times whatever value times of this partial derivative of the loss

80
00:07:22,630 --> 00:07:25,630
with respect to this weights here.

81
00:07:25,630 --> 00:07:29,800
So let's consider that this is the seat layer.

82
00:07:29,800 --> 00:07:33,160
One, two, three, four, five, six, then L six.

83
00:07:33,160 --> 00:07:35,050
Here we have WL seven, layer seven.

84
00:07:35,050 --> 00:07:44,230
So because we're multiplying here, it means that if a while getting this partial derivative, we obtain

85
00:07:44,230 --> 00:07:46,780
a value very close to zero.

86
00:07:46,810 --> 00:07:55,930
If we get a value very close to zero say 0.000001, for example, it means that it's going to affect

87
00:07:55,930 --> 00:08:01,210
this other partial derivative in the sense that this tool will be a very small value.

88
00:08:01,210 --> 00:08:10,570
And if this partial derivatives are too small, then we will not get a change in this weight because

89
00:08:10,570 --> 00:08:18,070
you have the new way you're trying to get being equal to previous weight minus a very small value here.

90
00:08:18,070 --> 00:08:22,480
And so there will be no there will be little or no changes in the weights.

91
00:08:22,480 --> 00:08:30,220
And that is why even though you're you keep increasing this number of layers, let's take this off.

92
00:08:30,490 --> 00:08:36,940
Even though we keep increasing this number of layers, we cannot achieve better performances due to

93
00:08:36,940 --> 00:08:44,320
this vanishing ingredient problem, as the model is now finding it difficult to update its weights such

94
00:08:44,320 --> 00:08:50,980
that the training error can be decreased since the gradients right here are vanishing.

95
00:08:50,980 --> 00:08:53,580
That is getting towards zero.

96
00:08:53,590 --> 00:09:02,200
So now we've just seen that making our network deeper or increasing the number of layers makes it difficult

97
00:09:02,200 --> 00:09:07,480
to propagate information from one far end to the other end.

98
00:09:07,960 --> 00:09:18,100
And so what the authors suppose is that if the added layers can be constructed as identity mappings,

99
00:09:18,160 --> 00:09:24,980
a deeper model should have training error no greater than its shallower counterpart.

100
00:09:25,000 --> 00:09:31,930
So this means that you have this model, let's suppose this model and then this is the shallower model

101
00:09:31,930 --> 00:09:34,570
and then this is the deeper model right here.

102
00:09:35,920 --> 00:09:42,820
Are we saying that if we construct this deeper model such that it's identical to the shallower model?

103
00:09:42,820 --> 00:09:46,620
So basically it's the same as this here changes in blue.

104
00:09:46,630 --> 00:09:54,340
So we have this same shallower model and then the remaining layers here, this other layers here are

105
00:09:54,340 --> 00:10:03,160
constructed such that this is the identity function of a group of identity functions which are stacked

106
00:10:03,160 --> 00:10:04,050
together.

107
00:10:04,060 --> 00:10:13,420
Then the training error of this one, although deeper, shouldn't be greater than this training error

108
00:10:13,420 --> 00:10:17,470
right here or the training error of this shallower model.

109
00:10:18,310 --> 00:10:26,200
And so this means that if we want to pass information from this point to this point right here, and

110
00:10:26,200 --> 00:10:35,740
that those weights are the values in your damping, this information, then there is this path right

111
00:10:35,740 --> 00:10:37,960
here which permits us.

112
00:10:38,950 --> 00:10:43,960
Simply copy this input information to the output.

113
00:10:44,980 --> 00:10:49,120
And obviously this is simply the identity function.

114
00:10:49,120 --> 00:10:55,440
And so here, if after passing through, say, 20 layers and we get to this point, so let's consider

115
00:10:55,450 --> 00:10:57,640
each and every one of these is a single layer.

116
00:10:57,640 --> 00:11:03,430
So we've gone through 20 layers and then we've gotten to this point where the values we get here are

117
00:11:03,430 --> 00:11:11,320
almost zero, such that when this information passes is going to be also or practically zero.

118
00:11:11,320 --> 00:11:19,330
Then there is this path which at least restores this exact same input we have here.

119
00:11:19,480 --> 00:11:28,480
And so this means that just as the authors of the papers supposed in this example we took here, if

120
00:11:28,480 --> 00:11:38,680
we make our model or neural network deeper by adding these residual blocks, then there will be no increase

121
00:11:38,680 --> 00:11:40,000
in the error rate.

122
00:11:40,750 --> 00:11:47,410
And in practice, this instead leads to a decrease in the error rate, which is exactly what we want.

123
00:11:48,340 --> 00:11:56,320
And one other argument which accounts for the fact that this residual blocks help in improving the performance

124
00:11:56,320 --> 00:12:05,770
of the model is the fact that since we have several parts, this residual model now looks like a combination

125
00:12:05,770 --> 00:12:07,540
of several shallow models.

126
00:12:07,540 --> 00:12:09,790
So it looks like you're combining different.

127
00:12:09,790 --> 00:12:11,500
Let's let's draw it better.

128
00:12:11,500 --> 00:12:12,610
Let's get back.

129
00:12:12,640 --> 00:12:22,150
It looks like we're combining this shallow model here with this other shallow model with this other

130
00:12:22,150 --> 00:12:31,690
shallow model and then producing what we call an ensemble of models or an ensemble of shallow models,

131
00:12:31,690 --> 00:12:39,520
to be more precise, which help in making the overall model much more performance as compared to when

132
00:12:39,520 --> 00:12:41,560
we just have a single path.

133
00:12:42,340 --> 00:12:47,620
So another way you can look at this is that for this, let's take this first shallow model.

134
00:12:47,620 --> 00:12:54,260
You could have information which passes this way and then gets here and then goes this way.

135
00:12:54,280 --> 00:12:55,660
So that's the first model.

136
00:12:55,660 --> 00:13:03,910
And then another time you could have information which goes this way or goes this way, goes this way,

137
00:13:03,910 --> 00:13:05,980
and you have this other model.

138
00:13:05,980 --> 00:13:13,390
You could have another situation where the model goes this way, it goes this way, and then goes just

139
00:13:13,390 --> 00:13:22,510
straight and giving us this other model later in this paper entitled Visualizing the Lost Landscape

140
00:13:22,510 --> 00:13:30,220
of Neural Nets, Li Al produced this visualization here, which shows.

141
00:13:30,820 --> 00:13:36,970
Resonate without SKIP connections and originate with Skip connections.

142
00:13:36,970 --> 00:13:46,180
And it shows how easy it is or how easy it is for the waits to find their way or to get the optimal

143
00:13:46,180 --> 00:13:53,350
weights which minimize the loss as compared to when there are no skip connections here.

144
00:13:53,380 --> 00:13:59,860
It should also be noted that this addition we have here is an element wise addition, and so we have

145
00:13:59,860 --> 00:14:07,300
to ensure that the dimensions of this input should match the dimensions of the output we have here for

146
00:14:07,300 --> 00:14:09,820
this operation to be valid.

147
00:14:09,850 --> 00:14:16,720
Now, in the cases where on the case where these two aren't equal, then we need to do some modifications

148
00:14:16,720 --> 00:14:19,060
in this skip connection right here.

149
00:14:19,480 --> 00:14:27,970
Now, when we look at this three models compared, we have this VG 19, we have this 34 layer plane

150
00:14:28,180 --> 00:14:31,270
or convolution network.

151
00:14:31,270 --> 00:14:35,110
Then we have this 34 layer residual network.

152
00:14:35,110 --> 00:14:39,370
We find that we have this skip connections.

153
00:14:39,370 --> 00:14:41,430
That's our residual block.

154
00:14:41,440 --> 00:14:43,750
So this is our residual block right here.

155
00:14:43,750 --> 00:14:49,500
We stack this residual blocks Now instead of just stacking the conf layers as we did with the Vgs,

156
00:14:49,510 --> 00:14:55,570
now we stack the residual blocks and then sometimes you could see here, sometimes this line is solid,

157
00:14:55,570 --> 00:14:57,400
sometimes it is dotted.

158
00:14:57,430 --> 00:15:03,640
Now when it's data like this, it simply because there's going to be a change in the dimension.

159
00:15:03,640 --> 00:15:08,530
So if you notice here, you find that every time we have this dotted lines, you find that there is

160
00:15:08,530 --> 00:15:10,870
a change here, the number of channels.

161
00:15:10,870 --> 00:15:17,680
So here you have 64 channels and then here you have a 128.

162
00:15:18,310 --> 00:15:26,620
And so since we get in this 64 channel input and we want to match this with 128 output, we get in here,

163
00:15:26,620 --> 00:15:29,530
then we need to do some adjustments here.

164
00:15:29,560 --> 00:15:37,390
Now, as with senior or as we can see in the paper here, they actually two ways of making this adjustments.

165
00:15:37,390 --> 00:15:41,950
The first way is this a the next one, this is B four.

166
00:15:41,950 --> 00:15:49,330
The A the shortcut still perform identity mapping with extra zero entries padded for increasing the

167
00:15:49,330 --> 00:15:50,380
dimensions.

168
00:15:50,380 --> 00:15:55,330
So to get from 64 to 128, we add this extra zero entries.

169
00:15:55,840 --> 00:16:00,190
Then either we do that or we take the B.

170
00:16:00,190 --> 00:16:07,660
That's the projection shortcut which was presented in equation two is used to match the dimensions,

171
00:16:07,660 --> 00:16:11,560
but this is actually done using one by one convolutions.

172
00:16:12,490 --> 00:16:18,550
Now to better understand how the one by one convolutions work, let's take this example where we have

173
00:16:18,550 --> 00:16:24,370
this input size ten by ten cannot size now since it's one by one, then the kernel size is one.

174
00:16:24,370 --> 00:16:28,660
So this is we have just this one wait right here.

175
00:16:28,660 --> 00:16:34,260
And then you see you just go through each and every pixel value.

176
00:16:34,270 --> 00:16:43,030
So here we have this and then notice that the input is the same shape as the output.

177
00:16:43,660 --> 00:16:49,870
And so if, for example, you have this input made of two channels, let's add a second channel.

178
00:16:49,870 --> 00:16:51,820
Let's suppose this input has two channels.

179
00:16:51,820 --> 00:17:02,200
So we have this one channel and this other channel right here then to obtain an output of say four channels.

180
00:17:02,200 --> 00:17:04,750
So we have this input to channels.

181
00:17:04,750 --> 00:17:07,870
So it's ten by ten by two, two channels.

182
00:17:07,870 --> 00:17:16,120
And if we want to have this output to get to four channels, all we need to do now is just make use

183
00:17:16,120 --> 00:17:21,490
of the one by one convolution and then we add the all.

184
00:17:21,490 --> 00:17:27,460
We have four of these different weights or this four for all the different kernels.

185
00:17:27,460 --> 00:17:28,870
Since it's obviously one by one.

186
00:17:28,870 --> 00:17:36,190
If we three by three, then we would have something like this would have three of the four of this.

187
00:17:36,190 --> 00:17:37,450
Now it's one by one.

188
00:17:37,450 --> 00:17:39,370
We have this here for of this.

189
00:17:39,370 --> 00:17:48,010
And then in that case, this one will give this output, then this other one will give this other channel

190
00:17:48,040 --> 00:17:48,730
here.

191
00:17:50,240 --> 00:17:50,960
This order.

192
00:17:50,990 --> 00:17:52,450
One Let's change the color.

193
00:17:52,460 --> 00:17:57,680
This other one will give another channel which will just add your.

194
00:17:59,190 --> 00:18:08,040
Then this other one your stick that to be green will produce this other channel.

195
00:18:08,040 --> 00:18:12,150
So that's how we can move from these two channels to four channels.

196
00:18:12,150 --> 00:18:20,820
And so in the case of the rest nets where we want the inputs, again, some inputs, let's say 64 channel

197
00:18:20,820 --> 00:18:24,030
input, and we want that at the end.

198
00:18:25,130 --> 00:18:34,640
All we want, that this what we have here should match with this output, which is already 128 or 128

199
00:18:34,670 --> 00:18:35,210
year.

200
00:18:35,390 --> 00:18:44,180
Then we need we could add this one by one convolutions with a certain number of filters such that we

201
00:18:44,180 --> 00:18:47,380
could get this desired number of channels here.

202
00:18:47,390 --> 00:18:56,840
So here now we can have 128 one by one filters, and then this now will match up with this output right

203
00:18:56,840 --> 00:19:00,830
here such that we could carry out the element wise addition.

204
00:19:01,250 --> 00:19:08,480
And now the difference with the VG and other previous conf nets is that instead of making use of the

205
00:19:08,480 --> 00:19:14,060
max pool, as you will have here, you see this pooling layer with pool size two.

206
00:19:14,090 --> 00:19:21,680
What we do here is we use our three by three convolutional layer, but we use stripes.

207
00:19:21,680 --> 00:19:30,410
So we specify the striped number of two and this permits us to down sample this feature maps right here.

208
00:19:31,070 --> 00:19:32,760
Again, we could check this out here.

209
00:19:32,780 --> 00:19:34,610
Let's suppose we have this three.

210
00:19:34,610 --> 00:19:41,110
And if we were to do or if we want to downsample this, what we could do is increase number of stripes.

211
00:19:41,120 --> 00:19:42,620
Let's take that to two.

212
00:19:42,620 --> 00:19:44,990
And you see, this is going to be downsampled.

213
00:19:44,990 --> 00:19:51,550
And if we take the pattern to one, you find that the output is half of what we have here.

214
00:19:51,560 --> 00:19:59,120
Then the others also make use of augmentation and then batch normalization where they apply this batch

215
00:19:59,150 --> 00:20:06,320
normalization layer right after each convolution and before the radio activation.

216
00:20:06,890 --> 00:20:14,870
Now, batch normalization is this technique for accelerating deep neural network training by reducing

217
00:20:14,870 --> 00:20:17,060
internal covariate shift.

218
00:20:18,700 --> 00:20:21,070
To better understand marginalization.

219
00:20:21,070 --> 00:20:24,880
We'll start by explaining the notion of covariate shift.

220
00:20:25,660 --> 00:20:28,520
To better understand the notion of covariate shift.

221
00:20:28,540 --> 00:20:35,740
Let's suppose that we're trying to build a model which classifies or which says whether an input image

222
00:20:35,740 --> 00:20:42,050
like this one or say this one is that of a car.

223
00:20:42,070 --> 00:20:47,290
So I call or it is not a car.

224
00:20:48,140 --> 00:20:55,910
Now, if you build in this kind of system and then you start by or you create batches of these kinds

225
00:20:55,910 --> 00:21:01,670
of toy cars and you pass through the model and model learns how to see this and know that it's a car

226
00:21:01,670 --> 00:21:05,420
and see some other image and knows that that's not a car.

227
00:21:06,410 --> 00:21:16,790
Then later on, when you take a car from this order distribution and you pass into our system, it becomes

228
00:21:16,790 --> 00:21:22,820
difficult for the weights of this model to adapt to this change.

229
00:21:22,820 --> 00:21:27,680
In distribution, though, the inputs are our cars.

230
00:21:28,520 --> 00:21:32,930
And to visualize this, let's consider this plot right here.

231
00:21:33,440 --> 00:21:36,980
What we're going to have is something like this.

232
00:21:36,980 --> 00:21:47,000
So we'll have this year for car and then this for not car, and then we'll build this classifier or

233
00:21:47,000 --> 00:21:55,010
this model which distinguishes a car and an image, which is not a car by, say, this function, for

234
00:21:55,010 --> 00:21:55,850
example.

235
00:21:56,210 --> 00:22:03,290
Now, when you bring some other distribution, like say, this distribution, you would have something

236
00:22:03,290 --> 00:22:04,250
like this.

237
00:22:04,550 --> 00:22:14,660
You see you have something like this here, this other distribution and not call this a not car is about

238
00:22:14,660 --> 00:22:15,350
this.

239
00:22:15,860 --> 00:22:23,060
And then you would need to have something like this to separate the cars from images which are not cars.

240
00:22:23,570 --> 00:22:31,790
And this then makes it difficult for us to have a function which separates images, which are cars from

241
00:22:31,790 --> 00:22:36,830
those which are not cars when these images come from this to different distribution.

242
00:22:37,160 --> 00:22:42,020
Now, this shift here is known as the covariate shift.

243
00:22:42,620 --> 00:22:50,270
And so that's why most times before passing the image into the model, what we do is we normalize this

244
00:22:50,270 --> 00:22:50,840
input.

245
00:22:50,840 --> 00:22:57,530
So let's suppose we have an input X, We generally carry out some normalization in order to account

246
00:22:57,530 --> 00:22:59,080
for this covariate shift.

247
00:22:59,090 --> 00:23:06,860
So now after normalization, what we're going to have is that all those images be from this distribution

248
00:23:06,860 --> 00:23:16,580
or this order distribution right here will now have been normalized to reduce the effect of the shift.

249
00:23:16,580 --> 00:23:26,240
And so now we could have our single or could have this our function which separates the cars from the

250
00:23:26,240 --> 00:23:30,560
non cars and with much more ease.

251
00:23:31,460 --> 00:23:39,800
Now, that said, what if this kind of covariate shift instead happens in the hidden layers?

252
00:23:39,800 --> 00:23:43,340
That's these layers which make up the model right here.

253
00:23:43,340 --> 00:23:50,690
So let's suppose that we have some confidence like this stacked with the activation functions and then

254
00:23:51,680 --> 00:23:59,330
we have this weights, does this parameters which make up which are part of the layer now coming from

255
00:24:00,140 --> 00:24:02,120
different distributions.

256
00:24:02,780 --> 00:24:08,270
Then in this case we have an internal covariate shift.

257
00:24:08,510 --> 00:24:16,190
And to remedy the situation, we now make use of the batch normalization and the algorithm for the batch

258
00:24:16,190 --> 00:24:18,260
normalization is described in the paper.

259
00:24:18,260 --> 00:24:21,980
So here we have a mini batch and we obtain its mean.

260
00:24:21,980 --> 00:24:27,080
So here we try to obtain the average value of the different weights.

261
00:24:27,080 --> 00:24:35,480
Then we also obtain the standard deviation, which is sigma and the variance which is sigma squared.

262
00:24:35,510 --> 00:24:38,840
So basically we obtain the mean and we obtain the variance.

263
00:24:38,840 --> 00:24:42,650
And it's this that we make use now to normalize our data.

264
00:24:42,650 --> 00:24:49,730
So now you take every weight you subtract by the mean that's year, which is calculate a year, and

265
00:24:49,730 --> 00:24:54,410
then you divide by this standard deviation.

266
00:24:55,490 --> 00:25:02,390
And then we add the small epsilon to avoid having a very small number or zero at the denominator.

267
00:25:02,660 --> 00:25:03,710
So that's it.

268
00:25:04,070 --> 00:25:08,840
This is how the normalization or the batch normalization process goes on.

269
00:25:09,830 --> 00:25:15,170
And it should be noted that there are other normalization techniques like the layer and group normalization,

270
00:25:15,170 --> 00:25:22,640
which are kind of similar to this, but different in a sense that with a batch normalization, this

271
00:25:22,640 --> 00:25:26,690
mean is calculated over a given mini batch.

272
00:25:26,690 --> 00:25:33,050
So like, this is the mean of values calculated in the mean batch and the standard deviation from that

273
00:25:33,050 --> 00:25:33,940
mini batch.

274
00:25:33,950 --> 00:25:46,250
Now, after getting this new value of x x Chapo, what we now do is we multiply it by gamma and add

275
00:25:46,250 --> 00:25:46,880
better.

276
00:25:46,880 --> 00:25:47,660
Now this gamma.

277
00:25:47,900 --> 00:25:50,660
And better actually trainable parameters.

278
00:25:50,660 --> 00:25:58,250
So when working with batch normalization in, say, TensorFlow or PyTorch, you'll notice that the batch

279
00:25:58,850 --> 00:26:01,200
layer will also have its parameters.

280
00:26:01,220 --> 00:26:08,690
Now, you're the role of this gamma, and this better is to scale and shift.

281
00:26:08,690 --> 00:26:17,150
And these parameters are learned along with the original model parameters and restore the representation

282
00:26:17,150 --> 00:26:18,710
power of the network.

283
00:26:18,710 --> 00:26:25,340
And so when we set Gamma to be the square root of the variance and beta to be this, to be the expectancy

284
00:26:25,370 --> 00:26:33,650
or the mean of X, then we could recover the original activations if that were the optimal thing to

285
00:26:33,650 --> 00:26:34,220
do.

286
00:26:34,850 --> 00:26:43,070
So essentially what this saying here is if it's instead optimal for us not to use the batch normalization,

287
00:26:43,070 --> 00:26:51,410
then we could adapt the value of gamma and better such that we get this original value of extra year.

288
00:26:51,500 --> 00:26:55,010
And the way this can be done is quite simple.

289
00:26:55,250 --> 00:27:05,750
All we need to do here is multiply this X by, let's say, gamma squared plus epsilon.

290
00:27:05,750 --> 00:27:10,070
So we have gamma squared plus epsilon.

291
00:27:10,400 --> 00:27:19,670
And then once we've multiplied this, you see when you take this year and multiply by this, you set

292
00:27:19,670 --> 00:27:27,200
this to like this, do this, this cancels out with this and you're left with x I minus the mean.

293
00:27:27,290 --> 00:27:36,020
Now when you live with XY minus the mean, if beta is equal to mean, then you will have x i minus the

294
00:27:36,020 --> 00:27:36,650
mean.

295
00:27:37,720 --> 00:27:47,440
Plus bitter, which in this case is the mean and you see it comes out and gives you x I which is this

296
00:27:47,440 --> 00:27:49,180
original value of x.

297
00:27:49,180 --> 00:27:57,370
So if we set our beta to be this our rather our gamma two with this and our beta to be the mean, then

298
00:27:57,370 --> 00:28:00,310
in that case we drive our original value of x.

299
00:28:00,310 --> 00:28:07,480
And so that's why you see these two parameters are trainable, such that we get the best values or the

300
00:28:07,480 --> 00:28:11,800
most optimal values for gamma and beta.

301
00:28:12,400 --> 00:28:20,410
Then there's also this initialization that's the model or the the network is trained from scratch.

302
00:28:20,500 --> 00:28:27,190
Stochastic gradient descent is used with a mini batch size of 256 learning results from 0.1 and is divided

303
00:28:27,190 --> 00:28:29,190
by ten when the error plateaus.

304
00:28:29,200 --> 00:28:35,380
So basically when we get to the point where let's have this, when we get to a point where the arrow

305
00:28:35,380 --> 00:28:45,150
starts to arrow starts to plateau, then at that point we could update the learning rate from 0.1 to

306
00:28:45,150 --> 00:28:46,210
0.01.

307
00:28:46,210 --> 00:28:49,600
And then if it plateaus, if it drops, you see any plateaus.

308
00:28:49,810 --> 00:28:50,710
Let's go this way.

309
00:28:50,710 --> 00:28:52,540
It drops and plateaus again.

310
00:28:52,540 --> 00:28:56,980
Then we carry out this same computation.

311
00:28:57,250 --> 00:29:04,330
Now, that said, is four over 60 times ten to the four iterations with decay and momentum I used and

312
00:29:04,330 --> 00:29:06,670
there is no use of drop out.

313
00:29:06,760 --> 00:29:13,510
So again, in testing here we have different skills which are used and averaged.

314
00:29:13,510 --> 00:29:20,660
So we pass the image at these different scales and then the average value of the average scores recorded.

315
00:29:20,680 --> 00:29:27,520
Now, before we move forward to check out some results, it's important to note here that after this

316
00:29:27,520 --> 00:29:32,220
last Conv layer, we do not carry out flooding.

317
00:29:32,230 --> 00:29:40,240
Instead, what is done here is average pooling in order to better understand how the global average

318
00:29:40,240 --> 00:29:41,290
pooling works.

319
00:29:41,290 --> 00:29:45,250
Let's consider this example from politico.com.

320
00:29:45,430 --> 00:29:51,700
So right here, we're supposing that we have this as the output of the final layer.

321
00:29:51,700 --> 00:29:58,150
Then what we do is instead of just flooding and that's just picking all those values and flattening

322
00:29:58,150 --> 00:30:02,020
them out and then passing through a fully connected layer.

323
00:30:02,020 --> 00:30:08,950
What we are going to do is depending on the number of neurons we want, the next fully connected layer,

324
00:30:08,950 --> 00:30:12,850
we are going to create a certain number of channels.

325
00:30:12,850 --> 00:30:16,180
So here, for example, we have this depth here.

326
00:30:16,180 --> 00:30:17,650
The number of channels here is three.

327
00:30:17,650 --> 00:30:19,930
We have the height and we have the width.

328
00:30:19,930 --> 00:30:27,850
And then since we know what we're actually doing, global average pulling, then for each and every

329
00:30:27,850 --> 00:30:32,860
one of these channels, for this channel, this channel and this channel, we're going to get the average

330
00:30:32,860 --> 00:30:33,310
value.

331
00:30:33,310 --> 00:30:34,480
So you have the average value.

332
00:30:34,480 --> 00:30:35,470
Is this eight?

333
00:30:35,500 --> 00:30:37,720
Yeah, the average value is three.

334
00:30:37,750 --> 00:30:39,820
You're the average value is this five.

335
00:30:39,820 --> 00:30:47,890
And so if you want a 1000 of this, then you should have a depth of 1000 right here And now with this

336
00:30:47,890 --> 00:30:54,690
you see that it looks quite similar to the flood and layer as now you have all these different values.

337
00:30:55,120 --> 00:31:00,610
There's different single values which cannot be passed into a fully connected layer.

338
00:31:00,910 --> 00:31:06,670
Now, it should be noted that in certain tasks like in classification, which we are trying to do,

339
00:31:06,940 --> 00:31:15,460
this global average pooling will be great as the position of the pixels don't really matter.

340
00:31:15,700 --> 00:31:21,640
Now the simply means that you're since you get this average and you get this average, you get this

341
00:31:21,640 --> 00:31:31,060
average, it means that this position of this pixel wouldn't be close to this other pixel, as in the

342
00:31:31,060 --> 00:31:32,710
case of the flattening.

343
00:31:33,850 --> 00:31:42,190
But since like in our case, we're interested in saying whether a person is angry, happy or sad, the

344
00:31:42,190 --> 00:31:51,040
positions of this output values right here won't really matter as much as would would have mattered

345
00:31:51,040 --> 00:31:58,480
in the case where we're dealing with an object detection or say, object counting problem, where the

346
00:31:58,480 --> 00:32:05,590
particular position of the person or of whatever we are trying to detect actually counts.

347
00:32:05,590 --> 00:32:06,370
So the better.

348
00:32:06,370 --> 00:32:09,370
Explain this again, let's consider this two examples.

349
00:32:09,370 --> 00:32:14,080
Example one example two right here for classification problems.

350
00:32:14,080 --> 00:32:17,800
All we're interested in is in detecting that this person is happy.

351
00:32:17,800 --> 00:32:24,040
So whether we have a face this way or this way, the position doesn't really matter.

352
00:32:24,040 --> 00:32:29,260
All we're interested in is in knowing whether this person is happy, angry or sad.

353
00:32:29,560 --> 00:32:35,290
Now, for object detection, the exact position of the person matters.

354
00:32:35,290 --> 00:32:37,030
And so the exact.

355
00:32:37,350 --> 00:32:40,740
Position of this neurons year will matter.

356
00:32:40,740 --> 00:32:45,690
And so employing global average pulling for such tasks isn't a great idea.

357
00:32:45,780 --> 00:32:53,520
So in summary, if you have a task where the position doesn't matter that much, then you could use

358
00:32:53,520 --> 00:32:54,840
a global average pulling.

359
00:32:54,870 --> 00:33:00,300
If not, then your advice to use the flatten layer from here you could see the different variants of

360
00:33:00,300 --> 00:33:01,130
the rest nets.

361
00:33:01,140 --> 00:33:09,060
You see here we have the 18 layer rest net as a rest net 18, the rest net 34 rest net 50 rest net 101

362
00:33:09,060 --> 00:33:10,590
rest net 152.

363
00:33:10,590 --> 00:33:16,920
And here we could see that with the plane networks, that's without the rest residual block you find

364
00:33:16,920 --> 00:33:20,550
this 18 layer performing better than the 34 layer.

365
00:33:20,550 --> 00:33:27,720
But once we have the rest net block, you find the 34 layer performing are better than the 18 layer,

366
00:33:27,720 --> 00:33:29,760
meaning that we could now go deeper.

367
00:33:29,760 --> 00:33:39,210
We also have this table right here, which shows the VG model, the Google Net, the glue unit, and

368
00:33:39,210 --> 00:33:48,630
then the different plane networks and the residual networks shows clear that the rest net 150 performs

369
00:33:48,630 --> 00:33:55,970
best regardless of the fact that it is deeper than the 101 and 50 counterparts.

370
00:33:55,980 --> 00:34:02,400
And before we move on, also note that you're this rest net block, as you can see here, is composed

371
00:34:02,400 --> 00:34:11,640
of these two conv layers for the 34 layer, while from this 50 to 150 layers, the resonant blocks are

372
00:34:11,640 --> 00:34:16,980
composed of three layers or three conv layers, as you could see right here.