1
00:00:00,210 --> 00:00:06,360
Hello, everyone, and welcome to this new and exciting session in which we are going to look at other

2
00:00:06,360 --> 00:00:11,070
state of the art convolutional neural network based model.

3
00:00:11,310 --> 00:00:14,200
And talking about state of the art models.

4
00:00:14,220 --> 00:00:22,320
Ten years ago in the image Net, visual recognition challenged the Alex Net Convolutional neural network

5
00:00:22,320 --> 00:00:30,840
beat all state of the art solutions previously or before this Alex net solution state of the art methods

6
00:00:30,840 --> 00:00:37,280
achieved the top error rate of our top 5% error rate of 35 25.3%.

7
00:00:37,290 --> 00:00:45,130
But with Alex net we dropped this error rate to 15.3%.

8
00:00:45,150 --> 00:00:47,250
Now this is 25.2.

9
00:00:47,280 --> 00:00:55,500
This breakthrough has led to a widespread adoption of convolutional neural networks in solving recognition

10
00:00:55,500 --> 00:00:57,030
tasks like this one.

11
00:00:57,030 --> 00:01:03,510
And although today we would hardly use this kind of model, that's to say the Alex Net model, we are

12
00:01:03,510 --> 00:01:10,830
going to discuss this model because it was a precursor to most of the modern components we have today,

13
00:01:10,830 --> 00:01:14,190
like the mobile nets, dense nets and efficient nets.

14
00:01:14,550 --> 00:01:21,000
That said, we are going to see what makes our what made the Alex Net model so powerful.

15
00:01:21,000 --> 00:01:27,720
The Alex Net model was first published in this paper entitled Image Net Classification with deep convolutional

16
00:01:27,720 --> 00:01:28,870
neural networks.

17
00:01:28,890 --> 00:01:35,280
Just from the title, you could get some idea that this was one of the first times.

18
00:01:35,280 --> 00:01:39,120
Conf nets were used for this image net challenge.

19
00:01:39,270 --> 00:01:44,940
This paper was by Alex Krinsky, Elias Sekhar and Jeffrey E Hinton.

20
00:01:45,660 --> 00:01:48,510
In the abstract Do start by presenting the results.

21
00:01:48,510 --> 00:01:56,550
As you could see here on test data, they achieved top one and top five error rates of 37.5% and 17%,

22
00:01:56,550 --> 00:01:59,790
which is considerably better than the previous state of the art.

23
00:01:59,820 --> 00:02:09,630
Now this model was 60 million parameter model composed of comb layers and max pulling layers with some

24
00:02:09,630 --> 00:02:16,230
final three fully connected layers, as we shall see in the model under the model section shortly.

25
00:02:16,230 --> 00:02:21,600
Now, you're 1000 ways off, Max, because we have 1000 classes.

26
00:02:21,960 --> 00:02:28,560
They also try to reduce overfitting by implementing strategies like drop out and data augmentation.

27
00:02:28,560 --> 00:02:31,860
That said, you are to discuss the data set which was used.

28
00:02:32,340 --> 00:02:41,250
Obviously the image net, which is 15 million labeled high resolution image dataset, although the finally

29
00:02:41,250 --> 00:02:50,220
used roughly 1.2 million training images, 50,000 validation and 150,000 for testing, then it's also

30
00:02:50,220 --> 00:02:57,690
important to note that the images which we used for training were downsampled to 256 by 256 images.

31
00:02:58,080 --> 00:03:05,370
As for the overall architecture, as we could see here, we have the Conv layers followed by max pulling

32
00:03:05,370 --> 00:03:05,640
layers.

33
00:03:05,640 --> 00:03:10,080
Sometimes you see your comp layer max pulling conf layer max pooling.

34
00:03:10,080 --> 00:03:17,040
Then we have several conf layers, this max pooling layer and then we have three dense layers.

35
00:03:17,040 --> 00:03:23,370
You see here we have this dance layer, this dense layer and this final dense layer with our 2000 way

36
00:03:23,370 --> 00:03:24,270
output.

37
00:03:25,500 --> 00:03:32,880
Then another point to note here is given that at a time many times the non linearity use was the trench

38
00:03:32,880 --> 00:03:33,950
or the sigmoid.

39
00:03:33,960 --> 00:03:35,880
Let's get back to this top here.

40
00:03:36,090 --> 00:03:41,520
You see here to talk about the non linearity which was used just here.

41
00:03:41,520 --> 00:03:47,340
The previously were the mostly used Tange or the sigmoid as the seer.

42
00:03:47,760 --> 00:03:53,130
But it turns out that after working with the rail, Lou, the rail, if you can recall, we've seen

43
00:03:53,130 --> 00:03:58,500
this in the previous section, the yellow is simply this function which takes in a value x, and then

44
00:03:58,500 --> 00:04:00,630
if x is negative, the value is zero.

45
00:04:00,660 --> 00:04:03,240
If x is positive, it remains the same value.

46
00:04:03,240 --> 00:04:09,210
So basically we have x f of x f of x here.

47
00:04:10,010 --> 00:04:12,870
Which is our review function, which is zero.

48
00:04:12,890 --> 00:04:20,450
If X is less than zero and it is x if x is greater than or equals zero.

49
00:04:20,450 --> 00:04:23,030
So this are really function right here.

50
00:04:23,030 --> 00:04:32,090
And what the discovered was that the revenue permitted them to train their model much faster than this

51
00:04:32,090 --> 00:04:36,830
previously used nonlinearity is like the tank of x.

52
00:04:36,830 --> 00:04:44,120
So as you could see here, just after a few epochs, just after, let's say five epochs, the attain

53
00:04:44,120 --> 00:04:49,790
this training error rate as compared to this other non linearity here.

54
00:04:49,790 --> 00:04:54,560
So this is what we get when we use the rail, Lou.

55
00:04:54,560 --> 00:05:02,570
And this is what we get when we use some other nonlinearity like the X and it's important to note that

56
00:05:02,570 --> 00:05:10,910
till date most conf nets we build make use of this ratio nonlinearity.

57
00:05:10,910 --> 00:05:16,160
Another thing that did to speed up the training was to use multiple GPUs.

58
00:05:16,160 --> 00:05:25,400
Then the actual number of GPUs they use here is two and the device a method of communication between

59
00:05:25,400 --> 00:05:28,010
these two GPUs to speed up calculation.

60
00:05:28,010 --> 00:05:34,550
From here, the authors make use of this normalization strategy for regularization known as the local

61
00:05:34,550 --> 00:05:42,110
response normalization, and this normalization strategy was used alongside the low nonlinearity.

62
00:05:42,380 --> 00:05:46,670
So from here, we have some inputs.

63
00:05:46,670 --> 00:05:56,120
Let's take this schematic from this post by Akhil and Wah, where it shows this even more clearly.

64
00:05:56,300 --> 00:06:07,850
Here we have some input and then we normalize it based on its surroundings, hence the term local response

65
00:06:07,880 --> 00:06:15,320
normalization here he explains that there is inter channel, local response normalization, and there

66
00:06:15,320 --> 00:06:18,650
is intra channel local response normalization.

67
00:06:18,790 --> 00:06:26,090
As you can see, this is all between pixels of a given channel or neurons of a given channel.

68
00:06:26,090 --> 00:06:27,980
And here this is inter channel.

69
00:06:27,980 --> 00:06:33,410
So this is carry out between pixels of different channels.

70
00:06:34,010 --> 00:06:38,390
Now, that said, the exact mathematical formula used here is this one.

71
00:06:38,390 --> 00:06:48,860
So here we have a given neuron and then we divide as value by this summation right here, which is the

72
00:06:48,860 --> 00:06:52,520
square of some neighboring values.

73
00:06:52,520 --> 00:06:57,860
And according to the author's terms here, you do see this sort of response.

74
00:06:57,860 --> 00:07:02,180
Normalization implements a form of lateral inhibition.

75
00:07:02,180 --> 00:07:09,710
So take note of this and it's inspired by the type found in real neurons creating competition for big

76
00:07:09,710 --> 00:07:15,050
activities amongst neuron outputs computed using different channels.

77
00:07:15,350 --> 00:07:22,460
So this means that if we consider this three neighboring channels from or rather the three neighboring

78
00:07:22,460 --> 00:07:30,410
neurons from those three different channels, if we take this particular neuron right here and we try

79
00:07:30,410 --> 00:07:38,720
to normalize it or perceptual normalization layer, given that it is surrounded by this pixel was value

80
00:07:38,720 --> 00:07:46,250
is relatively high because of this squared term right here and this is the squared, right?

81
00:07:46,250 --> 00:07:55,670
You're meaning that you take this value C one divided sum by sum summation and then you have this alpha,

82
00:07:55,700 --> 00:07:57,290
obviously you have this K right here.

83
00:07:57,290 --> 00:07:58,220
Let's admit that.

84
00:07:58,220 --> 00:07:59,780
Let's just put it right here.

85
00:07:59,780 --> 00:08:04,340
We have this alpha and then we have this value squared.

86
00:08:04,340 --> 00:08:08,450
So obviously this would be a function of basically this will be this value here.

87
00:08:08,450 --> 00:08:15,380
So when you square this value, it means that this overall value here will become very small, hence

88
00:08:15,380 --> 00:08:17,630
the term lateral inhibition.

89
00:08:17,630 --> 00:08:26,180
And so for a neuron to maintain a relatively high value after going through this local response normalization

90
00:08:26,180 --> 00:08:35,150
layer right here, it has to ensure that it has one of the highest values among the surrounding neurons.

91
00:08:36,140 --> 00:08:42,740
Nonetheless, this local response normalization as compared to other normalization techniques like the

92
00:08:42,740 --> 00:08:50,060
batch normalization layer normalization and the group normalization hasn't proven to be very effective

93
00:08:50,060 --> 00:08:56,000
when it comes to regularizing a neural network, and it's not used by modern conflicts.

94
00:08:56,000 --> 00:09:02,900
We had seen in the previous sessions that the pulling layers permit us down sample information from

95
00:09:02,900 --> 00:09:09,230
the inputs such that as we go deeper in the neural network, we have a reduced.

96
00:09:09,320 --> 00:09:16,340
Number of features now in this paper that make use of this pulling layer, more specifically the max

97
00:09:16,340 --> 00:09:17,420
pull layer.

98
00:09:17,840 --> 00:09:20,410
And the way it works is quite straightforward.

99
00:09:20,420 --> 00:09:28,480
So we're supposing we have this kind of input and then we have a three by three max per layer with initially

100
00:09:28,490 --> 00:09:29,690
straight of one.

101
00:09:29,870 --> 00:09:35,730
What we're going to have here is we have this positions which we are going to fix here.

102
00:09:35,750 --> 00:09:39,320
Let's get back and then pick some values.

103
00:09:39,320 --> 00:09:45,890
So let's say we have a value of one, two and then all this other values.

104
00:09:45,920 --> 00:09:52,130
If we want to carry out the max pull operation with a straight of one, we'll start with this year.

105
00:09:53,180 --> 00:09:53,600
See?

106
00:09:53,600 --> 00:09:58,070
And because it is Max pull we have, we're going to pick the max of all this.

107
00:09:58,070 --> 00:10:00,860
So we have the highest values, 11.

108
00:10:00,980 --> 00:10:03,200
And so here we're going to have 11.

109
00:10:03,200 --> 00:10:07,460
And then the next thing we'll do is we are going to shift this.

110
00:10:07,460 --> 00:10:10,430
So we shift this here.

111
00:10:10,430 --> 00:10:14,210
Since the start of one, we're going to go one step to the right.

112
00:10:14,210 --> 00:10:17,240
And so we have this now here.

113
00:10:17,990 --> 00:10:21,840
Notice how we still pick out three by three pixels.

114
00:10:21,860 --> 00:10:23,630
Let's now take this one off.

115
00:10:23,630 --> 00:10:26,600
So we've done the shift and now we have this position.

116
00:10:26,600 --> 00:10:27,790
We take the max here.

117
00:10:27,800 --> 00:10:30,650
The max here again is going to give us 11.

118
00:10:30,650 --> 00:10:35,030
So we have 11 year and then we're going to do another shift.

119
00:10:35,030 --> 00:10:39,380
So from here we're going to take this year, let's take this off.

120
00:10:39,530 --> 00:10:42,470
We do this other shift and then we still have this.

121
00:10:42,470 --> 00:10:44,810
The max here again is going to be 11.

122
00:10:44,810 --> 00:10:48,290
So you see at the top we have 11, 11, 11.

123
00:10:48,290 --> 00:10:52,730
And then from here, we'll move on to this next one.

124
00:10:52,730 --> 00:10:57,920
So we will go downward, one step downward.

125
00:10:57,920 --> 00:11:00,590
We'll have this here.

126
00:11:01,250 --> 00:11:04,790
And the max here is going to be year 11 still.

127
00:11:04,790 --> 00:11:09,530
So we're going to have 11 here and there will move this way.

128
00:11:09,530 --> 00:11:10,250
This way.

129
00:11:10,250 --> 00:11:12,200
We'll go downward and all of that.

130
00:11:12,200 --> 00:11:17,330
So if we move this way, I think we should have 11 still the other way.

131
00:11:17,330 --> 00:11:19,490
11 we go downward.

132
00:11:19,530 --> 00:11:21,050
We still have 11 practically.

133
00:11:21,050 --> 00:11:23,030
We will have 11 everywhere.

134
00:11:23,030 --> 00:11:26,120
So we'll have 11 and your 11.

135
00:11:26,120 --> 00:11:31,940
So this is going to be our output from this input year after the max pull operation.

136
00:11:32,210 --> 00:11:41,390
Now, when we modify this strike number from 1 to 2, take this number from 1 to 2 as it was illustrated

137
00:11:41,390 --> 00:11:48,290
in the paper, instead of having or instead of moving through one step, we move to two steps.

138
00:11:48,290 --> 00:11:53,120
So if we take this off here, you'll see that we'll start with this.

139
00:11:53,120 --> 00:11:55,400
So from the first one, we're going to get 11.

140
00:11:55,400 --> 00:11:57,170
So this was try to equal two.

141
00:11:57,170 --> 00:11:59,390
So the first one we get 11.

142
00:11:59,540 --> 00:12:05,270
And then from here, instead of moving just one step, like previously, we had this, here we have

143
00:12:05,270 --> 00:12:07,220
this year and then we move one step.

144
00:12:07,220 --> 00:12:08,720
Now we're going to move two steps.

145
00:12:08,720 --> 00:12:10,550
So we'll move this way.

146
00:12:11,430 --> 00:12:13,380
And then again we have 11.

147
00:12:13,380 --> 00:12:18,170
And then instead of going one step downward, we're going to go two steps downward.

148
00:12:18,180 --> 00:12:20,760
And so we'll end up here.

149
00:12:22,210 --> 00:12:27,910
And then we have a maximum of 11, and then we'll go two steps again this way.

150
00:12:28,090 --> 00:12:29,780
So let's take this one off.

151
00:12:31,200 --> 00:12:32,490
Take this one off.

152
00:12:32,640 --> 00:12:38,070
We go two steps again, and then we get a maximum of 11 right here.

153
00:12:38,250 --> 00:12:46,590
So as you could see here, when it talk of overlapping pulling, they actually use a stride of two,

154
00:12:46,620 --> 00:12:48,090
just as we have described.

155
00:12:48,090 --> 00:12:54,570
And then the found this to give them or to give an improvement in the results, though these improvements

156
00:12:54,570 --> 00:13:05,040
aren't very much so in practice we generally use the classical max pulling with as equal one does stride

157
00:13:05,040 --> 00:13:11,490
number equal one, and we also use two by two can our size.

158
00:13:11,490 --> 00:13:18,030
So instead of using three by three kind of sizes, as you see here most times or in modern coordinates,

159
00:13:18,030 --> 00:13:21,240
generally use two by two pull in size.

160
00:13:21,270 --> 00:13:28,620
Now, getting back to the general architecture, we could see here that this very first coordinate has

161
00:13:28,620 --> 00:13:32,460
a cannot size of 11 by 11.

162
00:13:34,290 --> 00:13:44,970
And although these kinds of canals permit the network capture much larger spatial context, we'll see

163
00:13:44,970 --> 00:13:53,700
that they are computationally much more expensive compared to these canals with smaller filter size.

164
00:13:53,790 --> 00:14:00,750
And as we'll see in subsequent sections, the companies developed after this didn't use this kind of

165
00:14:00,750 --> 00:14:08,010
large kennel sizes as they were able to make use of these kinds of smaller filters to still capture

166
00:14:08,010 --> 00:14:10,920
this large spatial context.

167
00:14:10,920 --> 00:14:15,870
The 11 by 11 filters capture then to overcome overfitting.

168
00:14:15,900 --> 00:14:22,290
The others make use of data augmentation and the drop out technique.

169
00:14:22,290 --> 00:14:26,820
So you could check out on our previous sessions where we talk about this to different techniques.

170
00:14:27,000 --> 00:14:31,500
Now, that said, you would see your the training details.

171
00:14:31,500 --> 00:14:39,390
And then one very interesting advantage of working with the 11 by 11 kennel size filters is the fact

172
00:14:39,390 --> 00:14:42,480
that we could have visualizations like this.

173
00:14:42,480 --> 00:14:49,080
So because those kennel sizes are large enough, we could visualize them in this manner.

174
00:14:49,080 --> 00:14:55,890
And then clearly from here we see how our conv layer captures these kinds of low level features.

175
00:14:55,890 --> 00:14:58,230
Like here we have a slanted line.

176
00:14:58,230 --> 00:15:09,840
Here we have many slanted lines, We have this vertical line, we have this horizontal lines right here,

177
00:15:09,840 --> 00:15:12,780
and then we have this this checkerboard pattern.

178
00:15:12,780 --> 00:15:18,300
We have this colors, sometimes dual, sometimes single color.

179
00:15:18,300 --> 00:15:26,520
And so we see how this first conv layers parameters capture low level features in this section that

180
00:15:26,520 --> 00:15:29,520
discuss this record breaking results.

181
00:15:29,520 --> 00:15:39,300
As you could see, we have this top one error rate, which for now or at that time was about 45.7%.

182
00:15:39,300 --> 00:15:46,410
And then with this curve net model, this was dropped to 37.5%.

183
00:15:46,440 --> 00:15:52,630
Then for the top five, we have dropped from 25.7 to 17%.

184
00:15:52,650 --> 00:16:03,000
Now, they also developed this other variant, which comes with an even better top five error rate of

185
00:16:03,000 --> 00:16:04,650
15.3%.

186
00:16:04,650 --> 00:16:10,950
And so as you could see here, we're moving from this previous method that is a C plus F these, which

187
00:16:10,950 --> 00:16:18,990
had 26.2 top 5% error rate to the CNN, which has 15.3% error rate.

188
00:16:19,890 --> 00:16:27,330
Then in this section on qualitative results, we see the different inputs, the correct levels, and

189
00:16:27,330 --> 00:16:31,590
then what the model predicts are the top five best predictions.

190
00:16:31,590 --> 00:16:33,570
See, the model does well here.

191
00:16:33,570 --> 00:16:37,740
Those were all your correct correct prediction.

192
00:16:38,370 --> 00:16:39,530
Yours wrong.

193
00:16:39,630 --> 00:16:43,980
You see, it predicts convertible wins actually agree all year.

194
00:16:43,980 --> 00:16:48,360
It does this wrongly here it's also wrong.

195
00:16:48,360 --> 00:16:57,270
But unlike your this level doesn't even occur among the top five best predictions.

196
00:16:57,270 --> 00:16:59,430
And so that's it for this breakthrough model.

197
00:16:59,430 --> 00:17:03,840
We are going to look at other Corvette models in the next sections.