1
00:00:00,180 --> 00:00:07,140
Hello, everyone, and welcome to this session in which we treat this modern convolutional neural network

2
00:00:07,140 --> 00:00:10,320
architecture known as the Efficient Nets.

3
00:00:10,800 --> 00:00:19,170
In this efficient paper, the authors proposed a more controlled manner of designing convolutional neural

4
00:00:19,170 --> 00:00:26,220
networks such that it suits our demands in accuracy and speed.

5
00:00:27,150 --> 00:00:34,020
And as you can see in those plots, you see that we could choose suitable parameters such that we could

6
00:00:34,020 --> 00:00:42,600
modify or increase our accuracy while taking note of how this affects the speed.

7
00:00:42,630 --> 00:00:51,000
That said, in this section, we'll see how Minxin Tang and Kwok Lei build the system for automatically

8
00:00:51,000 --> 00:00:55,670
scaling our convolutional neural networks much more efficiently.

9
00:00:55,680 --> 00:01:03,180
Components are commonly developed at a fixed resource budget and then scaled up for better accuracy

10
00:01:03,270 --> 00:01:05,430
if more resources are available.

11
00:01:05,430 --> 00:01:10,100
So with the case of the rest nets we had rest net 34.

12
00:01:10,110 --> 00:01:12,780
So your rest net 34.

13
00:01:12,810 --> 00:01:19,540
Then after we had resonant 50 ref net say 152.

14
00:01:19,740 --> 00:01:30,930
And depending on the kind of setting, we are going to pick this resonant model which will permit us

15
00:01:30,930 --> 00:01:38,410
to run without any problems of latency while maintaining a reasonable accuracy.

16
00:01:38,430 --> 00:01:45,090
So this means that if we are working in a high compute environment, then we could afford to work with

17
00:01:45,090 --> 00:01:45,600
this.

18
00:01:45,600 --> 00:01:53,100
Whereas if we are working a low compute environment, then we would have to work with this model with

19
00:01:53,100 --> 00:01:56,160
fewer conv layers.

20
00:01:56,850 --> 00:02:05,460
Now that said, in this paper, the authors propose a more systematic study of how this model's killing

21
00:02:05,460 --> 00:02:06,360
can be done.

22
00:02:07,260 --> 00:02:13,650
And unlike other methods where we just kill by increasing the depth here, the proposal scaling by increasing

23
00:02:13,650 --> 00:02:22,020
the depth, increasing the width, the number of channels and the resolution that is the size of the

24
00:02:22,020 --> 00:02:23,130
input image.

25
00:02:23,700 --> 00:02:30,930
And so here the proposed new scaling method that uniformly scales all dimensions of this depth with

26
00:02:30,930 --> 00:02:37,230
resolution using a simple yet highly effective compound coefficient.

27
00:02:38,420 --> 00:02:40,110
You could see the results right here.

28
00:02:40,130 --> 00:02:42,230
You see, for example, the rest at 50.

29
00:02:43,220 --> 00:02:44,300
Let's extrapolate.

30
00:02:44,300 --> 00:02:46,720
Let's let's let's let's pick this here.

31
00:02:46,730 --> 00:02:50,060
Although this has more parameters on the rest that 50.

32
00:02:51,080 --> 00:02:53,880
Let's take inside the B four because it has less parameters.

33
00:02:53,900 --> 00:03:00,830
So you see it has fewer parameters than the rest at 50 but is an accuracy that's top one.

34
00:03:00,830 --> 00:03:07,310
Accuracy on the image net is much greater than that of this resonant 50 year.

35
00:03:07,310 --> 00:03:14,510
So we have the efficient net B version, which is about 83% to 1% accuracy.

36
00:03:14,510 --> 00:03:19,760
While this is only at about 76% or one accuracy.

37
00:03:20,420 --> 00:03:26,270
Now in this figure, we see how we have this baseline, some sort of baseline, like in the case of

38
00:03:26,270 --> 00:03:33,440
the rest net, we could say this is rest net 18 and then we have this deeper model, the upscaling.

39
00:03:33,470 --> 00:03:35,540
This could be rest net, say 50.

40
00:03:35,600 --> 00:03:38,750
Now, in this case, they have this baseline.

41
00:03:38,750 --> 00:03:45,650
First of all, this baseline is gotten by carrying out an automatic network architecture search.

42
00:03:45,650 --> 00:03:53,170
So we we get this baseline and the different layers we have for this baseline.

43
00:03:53,180 --> 00:03:56,930
And then note that this baseline has a depth.

44
00:03:56,960 --> 00:03:59,000
As you could see, we ask, yes, depth.

45
00:03:59,000 --> 00:04:06,380
And when we scale deeper, when we do carry out depth scaling, you see we have much more layers added

46
00:04:06,380 --> 00:04:07,700
to this one.

47
00:04:07,700 --> 00:04:12,980
And then when we do with scaling, we increase the number of channels.

48
00:04:12,980 --> 00:04:19,730
So you see we have this smaller channels for the baseline and then the width scaling permits us to increase

49
00:04:19,730 --> 00:04:21,680
this number of channels.

50
00:04:21,680 --> 00:04:25,970
Then also we have the resolution scaling, which has to do with the inputs.

51
00:04:25,970 --> 00:04:28,880
So here we have this input high times width.

52
00:04:28,880 --> 00:04:33,740
And now after current resolution scaling, we see we increase this resolution.

53
00:04:33,740 --> 00:04:38,930
This means that we may work with a base of 224 by 224.

54
00:04:38,930 --> 00:04:46,640
And then after scaling, we may get to say 640 by 640.

55
00:04:47,510 --> 00:04:54,470
Then from here we also have the compound scaling, which is what is used in this paper where we don't

56
00:04:54,470 --> 00:05:02,540
only focus on the width or the depth or the resolution, but we scale all this systematically to achieve

57
00:05:02,540 --> 00:05:09,140
the best possible results while maintaining reasonable speeds.

58
00:05:09,710 --> 00:05:17,840
That said, we could see from this different plots that when you increase like here we have the width

59
00:05:17,840 --> 00:05:19,820
that is the number of channels which is increased.

60
00:05:19,820 --> 00:05:25,310
You notice that as we increase this number of channels, at some point it starts to plateau.

61
00:05:25,310 --> 00:05:31,520
And then when we increase the the depth at some point to start to plateau, then when we also increase

62
00:05:31,520 --> 00:05:33,890
this input size, that's a resolution.

63
00:05:33,890 --> 00:05:35,510
At some point it starts to plateau.

64
00:05:35,510 --> 00:05:42,470
And so this is why the authors propose a technique where we could combine all this such that we get

65
00:05:42,470 --> 00:05:44,420
even better results.

66
00:05:45,260 --> 00:05:46,340
And there we go.

67
00:05:46,340 --> 00:05:49,700
We see the effect of compound scaling.

68
00:05:49,700 --> 00:05:56,360
You see that we have this depth and then our resolution, you see when the depth is one and resolution

69
00:05:56,360 --> 00:06:01,460
is one, we add this blue here, you see we have worse results here.

70
00:06:01,460 --> 00:06:09,800
Whereas when we doubled this depth and then increase resolution by 1.3, you see we have this best results

71
00:06:09,800 --> 00:06:10,670
right here.

72
00:06:12,020 --> 00:06:17,300
That said, we will now dive a bit more deeper and look at this compound coefficient which the spoke

73
00:06:17,300 --> 00:06:19,010
of at the very beginning.

74
00:06:19,100 --> 00:06:24,800
So we go down here and we have this formula right here.

75
00:06:24,860 --> 00:06:27,020
See this formula right here?

76
00:06:28,340 --> 00:06:28,880
All right.

77
00:06:28,880 --> 00:06:34,110
At this equation, which is equation three, where we have this depth, we have this different formulas,

78
00:06:34,130 --> 00:06:41,600
the three formulas, the depth equal alpha times fee, and now these fees are user specified coefficient

79
00:06:41,600 --> 00:06:46,070
that controls how many more resources are available for model scaling.

80
00:06:46,070 --> 00:06:51,460
So this is some sort of scaling coefficient Right here is Phi Phi Phi.

81
00:06:51,470 --> 00:06:54,410
And then here we have Alpha, Beta and Gamma.

82
00:06:54,440 --> 00:07:01,660
Now this is designed such that alpha beta squared gamma squared is approximately equal to and alpha

83
00:07:01,730 --> 00:07:01,970
greater.

84
00:07:02,000 --> 00:07:06,320
On equal one beta is always greater than equal one and gamma always greater than equal one.

85
00:07:06,320 --> 00:07:09,410
So now we are going to carry out a grid search.

86
00:07:09,410 --> 00:07:17,330
So we're going to search for the best values for this alpha, beta and gamma and then fix them.

87
00:07:17,330 --> 00:07:18,770
Obviously, they're constant.

88
00:07:18,770 --> 00:07:24,770
So we're going to fix this and then now start varying for search that we carry out the scaling in a

89
00:07:24,770 --> 00:07:26,420
more systematic manner.

90
00:07:27,220 --> 00:07:33,220
And there we go to carry out the order to find the the values for alpha, beta and gamma, the fixed

91
00:07:33,220 --> 00:07:40,690
fee to be equal one and then the obtain alpha 1.2, beta 1.1 and gamma 1.15.

92
00:07:40,690 --> 00:07:47,350
All of these such that we have this constraint now they didn't fix alpha, beta and gamma as constants

93
00:07:47,350 --> 00:07:52,060
and scale up the baseline network with the different files, as we already explained.

94
00:07:53,350 --> 00:07:58,690
And it's based on this different values of PHI that we obtain the different versions of the efficient

95
00:07:58,690 --> 00:08:01,180
net going from B one to B seven.

96
00:08:02,300 --> 00:08:08,150
Now, before moving on, it's important to take note of this efficient net b zero, which is our baseline

97
00:08:08,150 --> 00:08:08,700
network.

98
00:08:08,720 --> 00:08:13,430
Remember, we have some baseline network which we have seen here.

99
00:08:13,460 --> 00:08:14,630
Let's go this way.

100
00:08:14,660 --> 00:08:17,990
We have this baseline network right here.

101
00:08:18,020 --> 00:08:26,720
This one, this baseline network, which we are going to scale such that we have better results while

102
00:08:27,560 --> 00:08:30,830
working with compute constraints.

103
00:08:31,100 --> 00:08:37,880
Now, that said, let's crow, let's take this off and then scroll down back to this baseline model

104
00:08:37,880 --> 00:08:40,760
which is given just here in this table.

105
00:08:40,760 --> 00:08:45,350
You see we have this baseline model, efficient net B zero.

106
00:08:45,350 --> 00:08:51,710
And then you'll notice first that the resolution is 224 by 224, meaning that we're going to start with

107
00:08:51,710 --> 00:08:59,690
image sizes of 224 by 224 and not that different image sizes could be used for the different models,

108
00:08:59,690 --> 00:09:07,280
although the best or the most adapted resolution for each model should be preferably used.

109
00:09:07,460 --> 00:09:13,580
Now that said, here you see we have a usual conv layer and then we have this MDB conf right here.

110
00:09:13,610 --> 00:09:20,000
Now before getting to the and b conf also note that after carrying out the neural architecture search,

111
00:09:20,690 --> 00:09:29,660
the authors noticed that we could also make use of this five by five kernel or five by five kernel size

112
00:09:29,660 --> 00:09:30,350
filters.

113
00:09:30,350 --> 00:09:38,540
So unlike what we had discussed in previous sessions, this five by five kernel size filters are still

114
00:09:38,540 --> 00:09:40,010
very useful.

115
00:09:40,010 --> 00:09:45,920
Then get into the MDB conf we find here here the say its main building block.

116
00:09:45,920 --> 00:09:49,550
It's is this mobile inverted bottleneck.

117
00:09:49,550 --> 00:09:53,000
So we call the mobile inverted bottleneck which we found in sunlight.

118
00:09:53,000 --> 00:09:59,780
All sunlight all is the mobile net version two paper we had seen already, to which they also add the

119
00:09:59,780 --> 00:10:02,540
squeeze and excitation optimization.

120
00:10:02,810 --> 00:10:08,090
Now in the mobile net version three, the Squeeze and excitation optimization was added.

121
00:10:08,090 --> 00:10:15,350
So here we have basically the mobile net embedded residual block, which we have seen already.

122
00:10:15,350 --> 00:10:22,740
And then if we check out this mobile net v three paper which you can feel free to look at, you would

123
00:10:22,760 --> 00:10:24,860
have this squeeze and excitation.

124
00:10:24,860 --> 00:10:26,660
Write your resume into this.

125
00:10:26,780 --> 00:10:32,810
You see here we have the mobile net version two with bottleneck, we residual, there's residual.

126
00:10:32,840 --> 00:10:34,940
Then our bottleneck as usual.

127
00:10:34,940 --> 00:10:44,570
Here we have this low dimension input getting in and then it gets expanded and then we have this low

128
00:10:44,570 --> 00:10:49,820
dimension output which is produced this final layer right here.

129
00:10:49,850 --> 00:10:59,210
Now with this squeeze and excitation to better understand this squeezing excitation layer, we should

130
00:10:59,210 --> 00:11:02,960
or we could get back to how the conv layers actually work.

131
00:11:02,960 --> 00:11:09,320
You see that to get this output, let's take this off to get this output.

132
00:11:09,320 --> 00:11:16,670
For example, we carry out multiplications and additions for each and every channel.

133
00:11:16,670 --> 00:11:23,440
Your that's for each and every channel, the input and the filters which correspond to this channel.

134
00:11:23,450 --> 00:11:29,270
And then to produce this negative one right here, all this are added up with equal weights.

135
00:11:29,270 --> 00:11:35,090
So the output from this computation, let's call it alpha, will be put here, plus the output from

136
00:11:35,090 --> 00:11:35,930
this computation.

137
00:11:35,930 --> 00:11:36,830
Let's call it beta.

138
00:11:36,830 --> 00:11:43,430
We will put your plus the output from this computation, let's call it gamma will be put here and we'll

139
00:11:43,430 --> 00:11:47,350
get this value or this output of negative one at this position.

140
00:11:47,360 --> 00:11:56,840
Now what the squeeze and excitation layer brings in is some weights on this addition operation right

141
00:11:56,840 --> 00:11:57,200
here.

142
00:11:57,200 --> 00:12:04,670
So instead of just having a weight of one, your one and your one, we're going to have some modified

143
00:12:04,670 --> 00:12:12,770
parameter or some parameters added here such that certain channels influence the output more than some

144
00:12:12,770 --> 00:12:13,490
others.

145
00:12:15,080 --> 00:12:23,900
And so here we could have instead of one we can have a weight E, your alpha with B and your A C getting

146
00:12:23,900 --> 00:12:31,790
back to this paper, the way this is done is as such we start by carrying out some pulling and the result

147
00:12:31,790 --> 00:12:37,840
of this pulling will be one by one by C output.

148
00:12:37,850 --> 00:12:39,140
Now C is the number of channels.

149
00:12:39,140 --> 00:12:46,850
So if you're we have C channels here, we have this output your C, so this output here will be one

150
00:12:46,850 --> 00:12:52,370
by one by C, you notice how this is small and then we have C the size.

151
00:12:52,370 --> 00:12:56,930
The size C is exactly the same size right here.

152
00:12:57,140 --> 00:13:01,820
So we have exactly this size is the same as this.

153
00:13:02,510 --> 00:13:05,480
And therefore the height and width is one by one.

154
00:13:05,750 --> 00:13:13,250
Now, once we get this, we pass this through two fully connected layers, you see with this railway

155
00:13:13,250 --> 00:13:14,180
activation.

156
00:13:14,180 --> 00:13:20,570
And then here we have this hard sigmoid activation after this fully connected layer.

157
00:13:20,570 --> 00:13:31,580
And then here we get this output of this same number of channels C which will match with this one.

158
00:13:33,140 --> 00:13:39,860
But now what we get here will be multiplied by each and every channel here.

159
00:13:39,860 --> 00:13:48,800
So this now will serve as the weights will serve us, because remember, we designed this as a a alpha

160
00:13:49,400 --> 00:14:03,350
plus beta plus gamma, and then we had a, B and then C, So this A, B, and C is actually this output

161
00:14:03,350 --> 00:14:04,040
right here.

162
00:14:04,040 --> 00:14:07,250
We're supposed to see that the channel size equal three.

163
00:14:07,250 --> 00:14:13,310
So we have this three here and these are the values which we get after going through this fully connected

164
00:14:13,310 --> 00:14:14,000
layer.

165
00:14:14,000 --> 00:14:18,290
And then we take this now and multiply by each and every channel.

166
00:14:18,290 --> 00:14:21,680
So if you break this up into three parts, let's let's remove this.

167
00:14:22,580 --> 00:14:27,950
There is this and we could cut this, let's cut this into three parts.

168
00:14:27,950 --> 00:14:30,200
So we have one, two, three.

169
00:14:30,200 --> 00:14:32,120
So we suppose there's one by one by three.

170
00:14:32,120 --> 00:14:37,910
So if we have this, this first part will multiply this chunk.

171
00:14:37,910 --> 00:14:44,360
So we'll multiply this chunk and then this other part will multiply this next chunk.

172
00:14:44,360 --> 00:14:48,350
And then this other part here we multiply this other chunk.

173
00:14:48,350 --> 00:14:57,530
And so now we have this channels whose contribution to this output is now weighted.

174
00:14:58,250 --> 00:15:01,160
That said, we also have the expansion factor.

175
00:15:01,160 --> 00:15:02,240
There is six.

176
00:15:03,320 --> 00:15:11,690
Then getting back to the result, we see how the efficient nets perform better than the corresponding

177
00:15:11,720 --> 00:15:19,520
or other conflicts with similar number of parameters or even more number of parameters.

178
00:15:19,520 --> 00:15:24,650
Like here we see how the efficient net B zero or outperforms the rest.

179
00:15:24,650 --> 00:15:32,390
Net 50 though you see this great difference number of parameters as the efficient net is more efficient

180
00:15:32,390 --> 00:15:36,140
as or has fewer parameters as compared to the rest.

181
00:15:36,140 --> 00:15:37,040
Net 50.

182
00:15:37,040 --> 00:15:43,580
We see rest net efficient at B one compared to rest net 152 C €60 million 7.8.

183
00:15:43,580 --> 00:15:50,900
But this one is more accurate than the rest and 152 you could check out from this right up to efficient

184
00:15:50,900 --> 00:15:51,140
net.

185
00:15:51,170 --> 00:15:54,230
B seven You see we have this G pipe.

186
00:15:54,920 --> 00:16:00,350
They are 9797, but this one has 557 million parameters.

187
00:16:00,350 --> 00:16:00,680
Why?

188
00:16:00,680 --> 00:16:04,340
This is only at 66 million parameters.

189
00:16:05,060 --> 00:16:14,300
And we could also look at the floating point operations You see here we have fewer floating point operations

190
00:16:15,380 --> 00:16:19,970
for the efficient net zero while still having higher accuracy.

191
00:16:20,930 --> 00:16:29,600
We also see that if we scale the mobile nets and the rest nets will still they wouldn't still get better

192
00:16:29,600 --> 00:16:31,850
results compared to the efficient net.

193
00:16:32,810 --> 00:16:38,630
And this shows the power of the network architecture search, which was used in getting our baseline.

194
00:16:38,900 --> 00:16:42,590
Now we'll go down and check out this year.

195
00:16:42,590 --> 00:16:47,210
We have this results right here.

196
00:16:47,210 --> 00:16:55,310
You see the class activation map, which is a visualization technique which permits practitioners understand

197
00:16:55,310 --> 00:17:07,190
how the model or rather what portions of the inputs helped in producing the outputs shows clearly here

198
00:17:07,190 --> 00:17:16,970
that when we use compound killing, we have the map which is more focused on relevant regions, as you

199
00:17:16,970 --> 00:17:25,100
could see right here, as compared to the baseline model and this other models with the deeper with

200
00:17:25,100 --> 00:17:28,850
depth scaling with scaling and resolution scaling.