1
00:00:00,210 --> 00:00:05,670
Hello, everyone, and welcome to this new and exciting session in which we are going to discuss the

2
00:00:05,670 --> 00:00:07,020
VG model.

3
00:00:07,650 --> 00:00:14,490
VG actually stands for Visual Geometric Group and this was presented in the paper by current Simeon

4
00:00:14,490 --> 00:00:22,680
and Andrew Zimmerman, entitled Very Deep Convolutional Networks for Large Kill Image Recognition.

5
00:00:22,680 --> 00:00:30,600
In the session we are going to discuss different methods which the authors of the VG paper used to drop

6
00:00:30,600 --> 00:00:45,330
the top one validation error rate from 38.1 to 23.7, where this 38.1 was achieved by the breakthrough

7
00:00:45,360 --> 00:00:48,810
KOVNER model, which is the Alex Net model.

8
00:00:49,050 --> 00:00:56,280
Now, in the previous session in which we treated the Alex model and we saw the power of working with

9
00:00:56,280 --> 00:01:02,850
cough nets and solving recognition tasks, one thing we could notice already or very clearly from this

10
00:01:02,850 --> 00:01:05,400
model is that it's quite shallow.

11
00:01:05,400 --> 00:01:12,120
And so Siemionow and Andrews's men go even deeper with the VG model.

12
00:01:12,120 --> 00:01:19,710
In this paper, the authors investigated the effect of the conflict of the convolutional network depth

13
00:01:19,740 --> 00:01:24,180
on its accuracy, not just words, depth and accuracy.

14
00:01:24,180 --> 00:01:33,450
So unlike the Alex Net, where the depth is relatively small and is actually a shallow network, here

15
00:01:33,450 --> 00:01:45,020
the authors use a deeper comb convolutional neural network and make use of smaller convolution filters.

16
00:01:45,030 --> 00:01:53,850
Now recall with Alex net from the very first layer, we already had 11 by 11 filters and we argued that

17
00:01:53,850 --> 00:01:59,490
this helped in capturing large spatial dependencies.

18
00:01:59,520 --> 00:02:06,540
Now we'll explain how it's possible for us to make use of this smaller and more economical convolutional

19
00:02:06,540 --> 00:02:16,770
filters while still capturing large spatial dependencies like the bigger five by five and 11 by 11.

20
00:02:16,770 --> 00:02:24,210
Filters will do to better understand why it's better to work with two three by three conv layers as

21
00:02:24,210 --> 00:02:28,650
compared to working with a single five by five conv layer.

22
00:02:28,650 --> 00:02:30,930
Let's consider the following examples.

23
00:02:30,930 --> 00:02:34,380
So here we have these three examples.

24
00:02:34,380 --> 00:02:35,880
Let's start with this first one here.

25
00:02:35,880 --> 00:02:37,830
So let's have this.

26
00:02:37,830 --> 00:02:43,740
We start with this part here and this one, if you notice, it's five by five.

27
00:02:43,740 --> 00:02:53,070
So here you have kernel size of five, input size of ten and no pattern dilation of one and stride of

28
00:02:53,070 --> 00:02:53,670
one.

29
00:02:53,850 --> 00:02:59,880
So here we have this here and you can see we have this output, which is six by six.

30
00:02:59,880 --> 00:03:01,710
So you have the six by six output.

31
00:03:01,710 --> 00:03:06,360
And the way each and every pixel in the output is gotten is quite simple.

32
00:03:06,360 --> 00:03:08,430
You have the kernel right here.

33
00:03:08,430 --> 00:03:09,540
Let's have this kernel.

34
00:03:09,540 --> 00:03:18,480
We have this kernel which has been passed on a particular patch on the input and then this produces

35
00:03:18,480 --> 00:03:19,710
the outputs.

36
00:03:19,710 --> 00:03:27,270
So simply take this kernel values, multiply by these values, add them up and then get this output.

37
00:03:27,600 --> 00:03:31,260
Now this is the case of five by five.

38
00:03:31,260 --> 00:03:34,590
So the receptive field Spaniards quite great.

39
00:03:34,590 --> 00:03:42,570
See, we could get or we could capture this information in this patch in the input right here.

40
00:03:42,750 --> 00:03:46,680
Now for the number of parameters, you could simply count this.

41
00:03:46,680 --> 00:03:50,520
We have five times five, which is 25 parameters.

42
00:03:50,520 --> 00:03:52,920
Here we have 25 parameters.

43
00:03:52,920 --> 00:03:59,880
And then for the learning capabilities, this one is quite limited here.

44
00:04:00,150 --> 00:04:09,210
Here we have 25 and your will say this is quite limited because if we suppose that we have an input,

45
00:04:09,210 --> 00:04:15,600
let's say we have this input which is passed through a single conv layer and then obviously the conf

46
00:04:15,600 --> 00:04:20,130
layer ends with a nonlinearity, in our case the railroad nonlinearity.

47
00:04:20,130 --> 00:04:24,900
So let's say we have the nonlinearity right here and then we get the output.

48
00:04:25,020 --> 00:04:36,420
Now this doesn't capture as much complex information as we would be able to capture if we had to conv

49
00:04:36,420 --> 00:04:37,470
layer stacked.

50
00:04:37,500 --> 00:04:42,450
Now here, when you stack this first one, this is a case of three by three and this is the case of

51
00:04:42,450 --> 00:04:43,050
five by five.

52
00:04:43,050 --> 00:04:49,560
So year after this first three by three, we then have this other three by three with this other non

53
00:04:49,560 --> 00:04:50,310
linearity.

54
00:04:50,310 --> 00:04:57,810
So here we are able to capture much more complex information from our input information of my input

55
00:04:57,810 --> 00:04:59,370
data as compared to.

56
00:04:59,520 --> 00:05:02,270
And we just have one single cough layer.

57
00:05:02,280 --> 00:05:04,230
So that's why you're.

58
00:05:04,380 --> 00:05:06,450
Let's take this off here.

59
00:05:06,450 --> 00:05:09,960
We the learning capabilities isn't as much as this.

60
00:05:09,960 --> 00:05:11,310
Two, three by three.

61
00:05:11,520 --> 00:05:15,290
Now, let's get to the receptive field span for the two three by three.

62
00:05:15,300 --> 00:05:19,620
To understand this, we are going to take this example right here.

63
00:05:19,620 --> 00:05:22,440
So take note that here we have input size of ten.

64
00:05:22,440 --> 00:05:23,870
So we're going to maintain this.

65
00:05:23,880 --> 00:05:26,250
We have input size ten.

66
00:05:26,290 --> 00:05:26,990
Okay.

67
00:05:27,000 --> 00:05:32,880
Now, the kernel size is three, unlike here where we have five pattern dilation and strike number the

68
00:05:32,880 --> 00:05:33,520
same.

69
00:05:33,540 --> 00:05:37,380
Now notice that the output we have here is eight by eight.

70
00:05:37,380 --> 00:05:39,100
All like your world, six by six.

71
00:05:39,120 --> 00:05:45,540
Now since we are having two, three by three conv layers, we're going to get this output just like

72
00:05:45,540 --> 00:05:47,460
the output we get here is this.

73
00:05:47,460 --> 00:05:50,130
So let's draw, let's put this here.

74
00:05:50,130 --> 00:05:52,830
We have this input, the error is it.

75
00:05:52,830 --> 00:05:57,780
And then we get this output which is going to be input of another three by three.

76
00:05:57,780 --> 00:06:02,220
So this is the first three by three and then this is the next three by three.

77
00:06:02,220 --> 00:06:07,680
So this will be the input of another three by three conv layer and then we should produce the output.

78
00:06:07,680 --> 00:06:13,650
So what we're trying to show you is when we have this output year, which is the input of this next

79
00:06:13,650 --> 00:06:14,160
layer.

80
00:06:14,160 --> 00:06:19,740
So this is this two year combined as one, this forms one.

81
00:06:19,740 --> 00:06:24,330
So this year, this by it now is an input.

82
00:06:24,330 --> 00:06:26,760
So you're instead of having ten, we have eight.

83
00:06:26,760 --> 00:06:28,770
So we take a sample size of eight.

84
00:06:28,770 --> 00:06:29,550
There we go.

85
00:06:29,550 --> 00:06:35,550
Same kind of size patterns dilutions try the same and what we'll want you to notice.

86
00:06:35,550 --> 00:06:37,270
Let's take this off.

87
00:06:37,290 --> 00:06:42,060
What I want you to notice here is the fact that this output is six by six.

88
00:06:42,060 --> 00:06:47,640
And so this means that this year, if we follow this, let's take the mouse.

89
00:06:47,640 --> 00:06:55,440
If we follow you see, this year captures the same information as you would as this one year would capture.

90
00:06:55,530 --> 00:07:03,510
And so we could confidently see that the receptive fields paneer is quite great Now for a number of

91
00:07:03,510 --> 00:07:04,230
parameters.

92
00:07:04,230 --> 00:07:06,750
Here we have nine and here we have nine.

93
00:07:06,750 --> 00:07:09,540
Nine plus nine is 18.

94
00:07:09,540 --> 00:07:19,110
So 18 year 18, you see clearly that this model now is cheaper as compared to a model which uses a five

95
00:07:19,110 --> 00:07:21,990
by five or even 11 by 11.

96
00:07:22,200 --> 00:07:26,910
Now, for the learning capabilities, we've seen this already because we stacked two, we're able to

97
00:07:26,910 --> 00:07:29,880
capture much more complex information from the inputs.

98
00:07:29,880 --> 00:07:31,350
And so this is great.

99
00:07:31,350 --> 00:07:41,940
So in all we see that it's better to use this conv layers with smaller can all sizes.

100
00:07:42,120 --> 00:07:48,690
We want to thank Edward Young for providing this convolution visualizer which you can find on as young

101
00:07:48,690 --> 00:07:50,160
dot GitHub dot IO.

102
00:07:50,460 --> 00:07:58,530
So at this point we've understood why the authors of this paper prefer to work with this smaller kernel

103
00:07:58,530 --> 00:08:01,890
size convolutional layers that's three by three.

104
00:08:02,340 --> 00:08:08,700
And with the smaller conf layers they were able to push the depth to 16.

105
00:08:08,700 --> 00:08:20,820
That's between 16 and 19 layers where the 16 layer version is VG 16 and then the 19 layer version is

106
00:08:20,820 --> 00:08:22,800
VG 19.

107
00:08:24,150 --> 00:08:28,230
In this table we have the summary of those models.

108
00:08:28,230 --> 00:08:33,090
Here we'll focus on the 16 and 19 weight layer models.

109
00:08:33,090 --> 00:08:35,280
So here we have 16 weight layers.

110
00:08:35,280 --> 00:08:41,040
You see, we start with two conf layers and then max spool and then two layers and then max spool and

111
00:08:41,040 --> 00:08:46,080
then three layers, max pull, three layers max pull and then three conf layers.

112
00:08:46,080 --> 00:08:53,310
Then from here we have the max spool, we have a flattened layer and then we have this three fully connected

113
00:08:53,310 --> 00:08:56,820
layers which end up with a soft max.

114
00:08:56,820 --> 00:09:04,080
Since we are dealing with a multiclass classification problem, the authors also noted that the usage

115
00:09:04,080 --> 00:09:09,930
of the hour and normalization that is local response normalization which we saw in the Alex net, did

116
00:09:09,930 --> 00:09:16,590
not improve performance, but instead led to increased memory consumption and computation time in the

117
00:09:16,590 --> 00:09:19,020
section to describe the training process.

118
00:09:19,170 --> 00:09:24,960
Then from your testing and we could get some results.

119
00:09:26,160 --> 00:09:34,050
And here we see we have this top one validation error and we also notice that we have this component

120
00:09:34,050 --> 00:09:35,670
layer or this confident model.

121
00:09:35,670 --> 00:09:37,850
Your A now A BCD.

122
00:09:37,920 --> 00:09:47,310
You could get them by checking on this table here you have a L with the R and normalization, B, C,

123
00:09:47,310 --> 00:09:48,930
D, and E right here.

124
00:09:48,930 --> 00:09:54,420
So basically these are the different models and your other results.

125
00:09:54,420 --> 00:09:59,160
So we'll working with the local response normalization we know.

126
00:09:59,210 --> 00:10:03,340
Notice that this seems to have even the highest error.

127
00:10:03,350 --> 00:10:09,130
So that's why the authors did not make use of this normalization technique.

128
00:10:09,140 --> 00:10:12,290
Now we get the best results with a VG 19.

129
00:10:12,290 --> 00:10:20,750
So this year 19 we get 25.5 for top one and then eight for top five.

130
00:10:20,750 --> 00:10:26,030
Now, if you're new to this notion of top one vowel error and top five while error, you could check

131
00:10:26,030 --> 00:10:31,550
out the previous section where we discussed the top one accuracy and top five accuracy.

132
00:10:31,550 --> 00:10:32,390
You're again.

133
00:10:32,390 --> 00:10:40,490
Now we have this comparison with the state of the art solutions at the time you here Alex net over feet

134
00:10:41,420 --> 00:10:51,650
inception RMS are a clarify this this model by clarify I and Ziegler and Ferguson's model.

135
00:10:51,650 --> 00:10:57,260
So here we see that the VG has at the time had the best results.

136
00:10:57,260 --> 00:11:05,240
And with this we can conclude that stacking up those conv layers with smaller kernel size actually helps

137
00:11:05,240 --> 00:11:07,250
in getting better results.

138
00:11:07,250 --> 00:11:12,950
And in subsequent sections we'll see the limit of just stacking up many conv layers, as with the VG.