1
00:00:00,120 --> 00:00:06,270
Hello, everyone, and welcome to this new and amazing section in which we are going to treat the mobile

2
00:00:06,270 --> 00:00:08,010
net architecture.

3
00:00:08,280 --> 00:00:15,960
This was first developed by Google researchers in 2017 and we have the mobile net version one and 2019.

4
00:00:15,960 --> 00:00:19,260
The again, develop the mobile net version two.

5
00:00:19,260 --> 00:00:22,950
And after this, there was a mobile version three.

6
00:00:23,100 --> 00:00:29,760
But yeah, we are just going to focus on this mobile net version two entitled Inverted Residuals and

7
00:00:29,760 --> 00:00:31,350
Linear Bottlenecks.

8
00:00:31,950 --> 00:00:40,260
By just looking at the title, we could guess the type of environments for which this model was built

9
00:00:40,260 --> 00:00:40,820
for.

10
00:00:40,830 --> 00:00:48,960
In fact, the mobile nets have been built for environments with low compute resources like the mobile

11
00:00:48,960 --> 00:00:50,790
and edge devices.

12
00:00:51,000 --> 00:00:59,640
So in this section we are going to focus on what permits this model that is the mobile net v two to

13
00:00:59,640 --> 00:01:05,430
perform quite well in terms of speed while producing high quality results.

14
00:01:05,430 --> 00:01:12,090
There are two major techniques which make the mobile net version two very powerful, all which permits

15
00:01:12,090 --> 00:01:21,360
them all which permits us work at higher speeds while still maintaining reasonable quality results.

16
00:01:21,720 --> 00:01:31,770
Now these two are the dep y separable convolutions and the inverted residual bottleneck which we have

17
00:01:31,770 --> 00:01:32,610
right here.

18
00:01:32,850 --> 00:01:38,940
Yours, those separable convolutions, this irregular convolution, your separable convolution.

19
00:01:39,660 --> 00:01:45,870
And we'll start by explaining what that separable convolution is.

20
00:01:46,230 --> 00:01:55,250
A deep, separable convolution is simply a combination of a depth wise convolution and a point y convolution.

21
00:01:55,260 --> 00:02:03,900
Now this point why convolution is not in different from a normal convolution layer, but with kernel

22
00:02:03,900 --> 00:02:07,800
size of one so one by one convolution.

23
00:02:07,800 --> 00:02:13,050
Here, that's the point where it's convolution and before this we have the depth wise convolution.

24
00:02:13,080 --> 00:02:22,620
Now to understand like this, now the deep separable convolution which is this to put together in sequence

25
00:02:22,980 --> 00:02:24,190
or sequentially.

26
00:02:24,210 --> 00:02:32,880
Now to understand what this depth wise convolution is, actually let's get back to this demo where we

27
00:02:32,880 --> 00:02:36,500
saw how usual convolution operation works.

28
00:02:36,510 --> 00:02:40,050
As you could see here, we have this inputs.

29
00:02:40,050 --> 00:02:44,700
Basically let's let's have this, let's toggle the movement.

30
00:02:46,140 --> 00:02:47,820
So yeah, we have some inputs.

31
00:02:47,850 --> 00:02:50,040
Now this input is three dimensional.

32
00:02:50,040 --> 00:02:56,700
As you see here we have one, two, three, So 012 we have this three dimensions.

33
00:02:56,700 --> 00:02:59,730
Let's change the color so it's clear for you to see.

34
00:02:59,730 --> 00:03:09,570
We have this first dimension, we have the second dimension, and then we have this other third dimension.

35
00:03:09,930 --> 00:03:18,090
Now, what goes on during the convolution operation is that we have this canal, so we have this year

36
00:03:18,090 --> 00:03:20,570
and then we have also some canal.

37
00:03:20,580 --> 00:03:23,340
So in this case, three by three canal.

38
00:03:24,510 --> 00:03:28,490
We have this on a canal here, three by three.

39
00:03:28,500 --> 00:03:34,560
And the reason why it is three by three is because we have three channels here and the input.

40
00:03:34,560 --> 00:03:42,120
So because the inputs are three channel or have three channels, we have a three channel canal.

41
00:03:42,270 --> 00:03:48,750
Now, here we could add this, we could add a foot channel right here at this four channel.

42
00:03:48,750 --> 00:03:53,040
And then we'll also have here four of this canals.

43
00:03:53,130 --> 00:04:01,590
And then what happens is going here during the convolution operation is exactly what we're seeing here.

44
00:04:01,590 --> 00:04:07,380
So we have this one which is placed at a given position.

45
00:04:07,380 --> 00:04:11,580
Let's pick its color is place at this position, for example.

46
00:04:11,580 --> 00:04:18,510
And then we multiply all the values, like let's say we add this point here.

47
00:04:18,510 --> 00:04:23,580
You see at this point we can see how this filter in this case, this filter.

48
00:04:23,580 --> 00:04:30,210
So this filter is actually we match this filter with this one, we match this one with this in red.

49
00:04:30,210 --> 00:04:33,960
We match this one with this in green.

50
00:04:34,410 --> 00:04:36,710
Now, since there are three, obviously we don't have four.

51
00:04:36,720 --> 00:04:43,290
Now let's get back to this operation where we take this one and multiply with all the values we have

52
00:04:43,290 --> 00:04:43,620
here.

53
00:04:43,620 --> 00:04:51,390
So when we multiply all the values we have here, we get, for example, in this case one by zero plus

54
00:04:51,390 --> 00:04:54,450
zero times zero plus one times zero.

55
00:04:54,450 --> 00:04:56,930
You see all this at the top canceled.

56
00:04:56,940 --> 00:04:59,940
And then we have zero times zero zero times one.

57
00:05:00,080 --> 00:05:00,800
On council.

58
00:05:00,800 --> 00:05:07,550
We have one times one year, so we have one and then we have 100 negative, one times one negative one.

59
00:05:07,550 --> 00:05:11,450
So we have negative one and then negative one times two.

60
00:05:11,480 --> 00:05:13,390
We have negative two.

61
00:05:13,400 --> 00:05:21,160
So this gives us a value of negative two and then we move to the next one.

62
00:05:21,170 --> 00:05:23,080
This one, let's change the color.

63
00:05:23,090 --> 00:05:24,890
You see, we have one times.

64
00:05:25,130 --> 00:05:26,420
All this at the top is zero.

65
00:05:26,420 --> 00:05:29,990
So when we multiply all this by because it's just simply matching.

66
00:05:29,990 --> 00:05:35,570
So when we have one times zero, we have zero zero times zero zero, one times zero zero, one times

67
00:05:35,570 --> 00:05:42,200
zero zero, negative one times two, negative two, then negative one times zero.

68
00:05:42,200 --> 00:05:46,760
We have zero negative one times zero zero, negative one times to negative two.

69
00:05:46,760 --> 00:05:49,730
So this gives us negative four.

70
00:05:50,030 --> 00:05:51,590
At this point we have negative four.

71
00:05:51,590 --> 00:05:52,970
We move to the next one.

72
00:05:53,420 --> 00:05:55,270
Here we have negative one.

73
00:05:55,280 --> 00:05:56,750
All this is zero, obviously.

74
00:05:56,750 --> 00:05:58,820
So we just get to this one one times one.

75
00:05:58,820 --> 00:06:04,190
we have one zero or negative one and then zero.

76
00:06:04,190 --> 00:06:05,450
So we have one.

77
00:06:05,450 --> 00:06:14,960
Let's write that we have one and minus one because this is negative one times one minus one, it gives

78
00:06:14,960 --> 00:06:15,830
us zero.

79
00:06:15,830 --> 00:06:17,240
So here we have negative six.

80
00:06:17,240 --> 00:06:23,270
When you add all this up, basically we taken this, we multiply and then we add, yeah, we take this

81
00:06:23,270 --> 00:06:27,320
multiply, we add, take this multiply, we add and we add all these values.

82
00:06:27,320 --> 00:06:31,010
And this gives us your negative six.

83
00:06:31,040 --> 00:06:34,820
Now, because we have this bias of one, we add plus one, it gives us negative five.

84
00:06:34,820 --> 00:06:36,500
That's how we obtain this value here.

85
00:06:36,710 --> 00:06:39,050
So we obtain this one right here.

86
00:06:39,200 --> 00:06:47,060
Now we'll repeat the same process for all these filters and all the different positions.

87
00:06:47,270 --> 00:06:51,590
So basically, let's toggle this so you see what happens.

88
00:06:51,830 --> 00:06:58,280
You see we repeat this process, we move, we move, we go down and that's it.

89
00:06:59,030 --> 00:07:00,140
There we go.

90
00:07:00,140 --> 00:07:04,380
So we repeat this to the end and we have this final value right here.

91
00:07:04,400 --> 00:07:07,970
Then if we want to have you see that we get this.

92
00:07:07,970 --> 00:07:12,460
So what we have is let's take this off, let's take this orange off.

93
00:07:12,470 --> 00:07:15,980
What we have is now we have this output.

94
00:07:15,980 --> 00:07:23,480
So we have the input, which is of the of channel, which has a number of channels to read here.

95
00:07:23,480 --> 00:07:28,310
We also have this number of channels three, and then we have an output which has a number of channels

96
00:07:28,310 --> 00:07:28,940
one.

97
00:07:28,940 --> 00:07:35,150
But if we want the number of channels to be equal, say to like in this case for the output, we have

98
00:07:35,150 --> 00:07:39,920
one channel and this other channel, then we need to increase the number of filters we have here.

99
00:07:39,920 --> 00:07:42,140
So basically here we have how many filters.

100
00:07:42,140 --> 00:07:45,170
We, we, we could create this again.

101
00:07:45,170 --> 00:07:46,610
So we will have now two.

102
00:07:46,610 --> 00:07:52,340
If we want to have output two, then we will need to have two of this of this three dimensional filter.

103
00:07:52,340 --> 00:08:00,980
So we will have this again, we would have this with its own weights, obviously, and we will have

104
00:08:00,980 --> 00:08:02,720
this and that's it.

105
00:08:02,720 --> 00:08:09,440
So because we have this two now, we will no longer have an output with one channel, but now an output

106
00:08:09,440 --> 00:08:11,310
with two channels, as you could see here.

107
00:08:11,330 --> 00:08:15,920
Now notice how as we click this, we move to this next.

108
00:08:15,920 --> 00:08:18,770
So this is the this year is this one year.

109
00:08:18,770 --> 00:08:19,910
Let's pause this.

110
00:08:20,000 --> 00:08:25,160
So this year is this one right here.

111
00:08:25,670 --> 00:08:30,400
And this one is this one right here.

112
00:08:30,440 --> 00:08:35,180
Now, with this, we are able to get this other channel for the output.

113
00:08:35,180 --> 00:08:40,710
And so that's how the convolution operation works for a normal convolution.

114
00:08:40,730 --> 00:08:48,470
Now, if we get to the depth wise convolution, we will find that this be different from this method

115
00:08:48,470 --> 00:08:50,390
for the depth wise convolution.

116
00:08:50,390 --> 00:08:55,910
As the name goes, this computations are done depth wise.

117
00:08:55,910 --> 00:09:09,680
So first of all, now here this output will only be gotten from interactions of this channel and this

118
00:09:09,680 --> 00:09:11,480
channel right here.

119
00:09:11,690 --> 00:09:19,400
So what goes on here is we take this and then we pass it around, as we usually do, and then we obtain

120
00:09:19,700 --> 00:09:22,520
this new output right here.

121
00:09:22,520 --> 00:09:31,580
So if this is three by three, then we obtain some values one, two, three, four, five, six, seven,

122
00:09:31,580 --> 00:09:32,180
eight, nine.

123
00:09:32,180 --> 00:09:38,210
So these values will be different from this one, because the way we compute these values is different

124
00:09:38,210 --> 00:09:40,340
from the way these values were computed.

125
00:09:40,340 --> 00:09:47,180
The way we computed this value was we take this year, pass it here, take this, pass this, take this,

126
00:09:47,510 --> 00:09:56,060
pass this here, and then add all our resulting sums to obtain this value of negative five as we saw.

127
00:09:56,060 --> 00:09:59,090
But in this case, what we'll obtain for this.

128
00:09:59,480 --> 00:10:01,200
Value will be this five.

129
00:10:01,220 --> 00:10:09,950
What we'll obtain here will be one times zero zero times zero plus one times 0001 times one.

130
00:10:09,950 --> 00:10:13,880
We will have one and then you will have zero year one times one.

131
00:10:13,880 --> 00:10:18,080
Negative one, negative two, negative one, negative two.

132
00:10:18,110 --> 00:10:19,100
It gives us negative two.

133
00:10:19,100 --> 00:10:25,640
So what we'll obtain here will be negative two, and then we'll move on to the next, we'll move on

134
00:10:25,640 --> 00:10:28,760
to the next, we'll go to this next position.

135
00:10:28,760 --> 00:10:34,510
We'll get some value, as we've seen here and up to this last value right here.

136
00:10:34,520 --> 00:10:41,150
So the way we got this was we took all this in different four different channels and then we added them

137
00:10:41,150 --> 00:10:42,770
up to get this bigger.

138
00:10:42,770 --> 00:10:49,610
We just get this directly by taking for each and every channel and just producing the output like this.

139
00:10:49,610 --> 00:10:56,480
So this means that with this we are going to have like here, let's take this off with this because

140
00:10:56,480 --> 00:11:02,330
we are having the each channel all for the filters producing its own output.

141
00:11:02,330 --> 00:11:06,680
We're going to have this three producing three different outputs.

142
00:11:06,680 --> 00:11:12,920
So you already we have three outputs, unlike here where we had to output and the two outputs here were

143
00:11:12,920 --> 00:11:19,550
controlled by the number of canals we used, because here we use two canals, we have two outputs.

144
00:11:19,550 --> 00:11:22,400
But here we this doesn't matter.

145
00:11:22,400 --> 00:11:28,550
You're the number of output or the number of input channels we have here will dictate the number of

146
00:11:28,550 --> 00:11:29,000
outputs.

147
00:11:29,000 --> 00:11:31,160
So here we have three channels.

148
00:11:31,160 --> 00:11:34,630
So we just obviously have these three different outputs.

149
00:11:34,640 --> 00:11:40,070
Now since we have the three output and we want to be able to control the number of output channels,

150
00:11:40,070 --> 00:11:45,500
what we'll do now after the depth wise convolution, which is in fact what we've just explained here

151
00:11:45,500 --> 00:11:49,490
is we're going to add now the one by one convolution.

152
00:11:49,490 --> 00:11:55,400
That's the point where it's convolution after adding the one by one convolution, we are going to specify

153
00:11:55,400 --> 00:12:01,880
the number of channels here and this number of channels of this one by one convolution that will permit

154
00:12:01,880 --> 00:12:08,960
us leave from a certain number of channels like in this case three to another number of channel or to

155
00:12:09,350 --> 00:12:12,140
give a number of channels, let's say two.

156
00:12:12,140 --> 00:12:17,450
So after getting this three channels or getting up with three channels, we now get, we pass to this

157
00:12:17,450 --> 00:12:22,490
point where it's convolution and now we're going to get just two channels.

158
00:12:24,300 --> 00:12:25,590
To better understand this.

159
00:12:25,590 --> 00:12:29,250
Let's take this dead wide convolution image from papers with code.

160
00:12:29,250 --> 00:12:30,510
So you're.

161
00:12:30,510 --> 00:12:32,340
Let's try to reduce this anyway.

162
00:12:32,340 --> 00:12:33,150
Let's have this.

163
00:12:33,150 --> 00:12:36,240
You see, we have one, two, three.

164
00:12:36,270 --> 00:12:37,020
You see this?

165
00:12:37,020 --> 00:12:37,680
One, two, three.

166
00:12:37,680 --> 00:12:39,330
This is a five by five canal.

167
00:12:39,330 --> 00:12:41,370
And then we pass this.

168
00:12:41,370 --> 00:12:46,980
Notice how each and every one is now responsible for its own output.

169
00:12:46,980 --> 00:12:47,910
See this?

170
00:12:49,140 --> 00:12:53,250
See, we have this orange with this foil for this particular channel.

171
00:12:53,250 --> 00:12:54,270
It gives us an output.

172
00:12:54,270 --> 00:12:56,010
The red gives us an output.

173
00:12:56,270 --> 00:13:03,420
The yellow gives us an output unlike previously, where we'll take this pass this year and then add

174
00:13:03,420 --> 00:13:04,950
all this together.

175
00:13:06,440 --> 00:13:12,620
But now instead, what we do is we just simply carry out that addition at the level of the channel,

176
00:13:12,620 --> 00:13:16,070
and then we get this output right here.

177
00:13:17,120 --> 00:13:23,350
Now, let's see why that wise or that separable convolution is more efficient.

178
00:13:23,360 --> 00:13:26,290
And we'll do this by calculating the number of filters.

179
00:13:26,300 --> 00:13:33,170
So here we find the number of filters we need to get from this input, which is one, two, three,

180
00:13:33,170 --> 00:13:34,520
four, five, six, seven.

181
00:13:34,520 --> 00:13:46,010
So we have this seven by seven by three inputs which we want to convert to this, three by three by

182
00:13:46,010 --> 00:13:47,420
two output.

183
00:13:47,420 --> 00:13:54,920
And here the number of filters and the more parameters we used can be calculated as here we have nine

184
00:13:54,920 --> 00:13:55,880
times three.

185
00:13:56,000 --> 00:14:01,100
Obviously in this, nine times three comes from the fact that we have for each of this, we have nine

186
00:14:01,100 --> 00:14:02,690
and times three.

187
00:14:02,690 --> 00:14:09,080
The three is from here actually, because we have three input channels, then we'll have three filter

188
00:14:09,080 --> 00:14:09,650
channels.

189
00:14:09,650 --> 00:14:15,440
And so we're going to have the filter size, which is three by three.

190
00:14:16,040 --> 00:14:19,100
I think we should change this color so it makes it clearer.

191
00:14:19,190 --> 00:14:20,390
Let's change this.

192
00:14:20,390 --> 00:14:23,300
So here we have seven by seven by three.

193
00:14:23,300 --> 00:14:24,950
The three is for this number of channels.

194
00:14:24,950 --> 00:14:32,420
Here we have three by three and then this is three by three, four single one, then times three, this

195
00:14:32,420 --> 00:14:33,470
number of weights.

196
00:14:33,470 --> 00:14:38,600
But now, because we want to have let's change this color again here, because we want to have an output

197
00:14:38,600 --> 00:14:46,600
with two channels, then we would multiply this again by the number of output channels two.

198
00:14:46,610 --> 00:14:50,150
So this is like some general formula to get a number of parameters.

199
00:14:50,480 --> 00:14:52,550
We're going to omit the biases.

200
00:14:52,550 --> 00:14:59,750
So here we have three by three by three, two and seven times 254 parameters.

201
00:14:59,960 --> 00:15:07,460
Now this means that if we have or we want to have a number of channels of, say, 16, then this will

202
00:15:07,730 --> 00:15:12,320
give us 432.

203
00:15:12,980 --> 00:15:19,820
Now, let's consider that we are dealing with depth wise convolution and the point where is convolution

204
00:15:19,820 --> 00:15:25,460
which both form the depth separable convolution with a depth wise we will have.

205
00:15:26,480 --> 00:15:29,750
First of all, we have to note that this is no more needed.

206
00:15:29,750 --> 00:15:30,740
We just need this.

207
00:15:30,740 --> 00:15:36,860
So what we have will be three by three by.

208
00:15:38,660 --> 00:15:39,350
Three.

209
00:15:39,830 --> 00:15:46,850
Now the story comes from the number of inputs by now to get the number of outputs.

210
00:15:47,420 --> 00:15:54,650
We wouldn't carry out any multiplication here because basically the output from the depth wise convolution

211
00:15:54,650 --> 00:16:01,800
is a three by three by 3/10 or in this case or more.

212
00:16:01,880 --> 00:16:04,940
Or we could just say it's a three channel output.

213
00:16:05,210 --> 00:16:08,960
Since we have three channel inputs, we'll have three channel output.

214
00:16:08,960 --> 00:16:15,170
So here to obtain the outputs or we just need this, we saw this already, we just take this multiplier,

215
00:16:15,200 --> 00:16:22,940
get the output, get this first one, we take this multiplier, get this next one, take this multiplier,

216
00:16:22,970 --> 00:16:25,690
get this next one and we're good to go.

217
00:16:25,700 --> 00:16:30,470
So once we have this, we now add the the weight for the point.

218
00:16:30,470 --> 00:16:38,090
Why is convolution now for the point twice it's one by one can also see it's quite cheap compared to

219
00:16:38,090 --> 00:16:39,650
the three by three channels.

220
00:16:39,650 --> 00:16:44,060
So one by one here we have one by one.

221
00:16:44,060 --> 00:16:45,830
So let's just put this here.

222
00:16:45,830 --> 00:16:50,450
We have one by one now times.

223
00:16:52,470 --> 00:16:55,260
The inputs, just like with the usual convolution.

224
00:16:55,260 --> 00:16:59,280
It's the same thing actually, because you're for the for the usual convolution.

225
00:16:59,280 --> 00:16:59,610
Yeah.

226
00:16:59,610 --> 00:17:05,970
We have to calculate number of weights, just get the kernel size then like one of the the kernel size

227
00:17:05,970 --> 00:17:10,590
times it cannot size times the number of input channels times number of output channels.

228
00:17:10,590 --> 00:17:15,570
So here we have the kernel size times kernel size times the number of input channels, which in this

229
00:17:15,570 --> 00:17:19,950
case is three and the number of output channels which is two.

230
00:17:20,490 --> 00:17:22,380
So that's what we have Now.

231
00:17:22,380 --> 00:17:26,220
If we multiply this, we have 27 plus six.

232
00:17:26,220 --> 00:17:30,840
That gives us something like 33.

233
00:17:30,870 --> 00:17:32,520
You see that this gives us territory.

234
00:17:32,550 --> 00:17:40,110
Now, one interesting point to note here is if we multiply, if we modify this and take and say we want

235
00:17:40,110 --> 00:17:45,240
to have 16 output channel, then we'll change this two and put 16.

236
00:17:45,240 --> 00:17:50,180
In that case, our answer will be 27.

237
00:17:50,190 --> 00:18:00,780
Here we have 27 plus 48 now 27 plus 48 or we get that quickly, five seven, that gives us 75.

238
00:18:00,780 --> 00:18:08,400
So you see that here we have increased this number of channels from 2 to 16, number of weights, 75.

239
00:18:08,400 --> 00:18:11,700
But when we did that year, number of weights went to 422.

240
00:18:11,700 --> 00:18:20,430
So clearly that that wise set or the depth separable convolution is one that is way cheaper than the

241
00:18:20,430 --> 00:18:22,020
normal convolution.

242
00:18:22,800 --> 00:18:30,600
And in the paper the others argue that this kind of convolution permits us to reduce the computational

243
00:18:30,600 --> 00:18:36,030
costs by 8 to 9 times than that of standard convolution.

244
00:18:36,030 --> 00:18:41,400
While we have only a small reduction in the accuracy.

245
00:18:42,060 --> 00:18:48,660
That said, from this diagram here, you should now understand why when we presenting a regular convolution,

246
00:18:48,660 --> 00:18:52,260
the others have this filter.

247
00:18:52,260 --> 00:18:59,460
You see the filter here which has this depth into the input.

248
00:18:59,460 --> 00:19:02,280
You see this depth right here, See this?

249
00:19:02,280 --> 00:19:09,180
Whereas when we present the separable convolution, we have this filter which doesn't have any depth.

250
00:19:09,660 --> 00:19:16,500
And that's because here we have no inter channel computation or calculations as compared to this one.

251
00:19:16,710 --> 00:19:22,440
And then after we have this point wise, now the point wise is regular convolution.

252
00:19:22,440 --> 00:19:28,590
So we see we have this depth again, but now it's smaller in size because it's just a one by one filter.

253
00:19:28,590 --> 00:19:34,170
The next improvement we should look at is this inverted residual block right here.

254
00:19:34,170 --> 00:19:40,190
So it's first of all called inverted in comparison to the residual block.

255
00:19:40,200 --> 00:19:47,850
Now, the residual block, as you may notice, you have this large, relatively large channel, and

256
00:19:47,850 --> 00:19:52,560
then it becomes small in the middle and then it becomes large in the outputs.

257
00:19:52,560 --> 00:19:59,010
So this means that we have some inputs, we have some inputs, and then we have our residual block and

258
00:19:59,010 --> 00:20:00,600
then we have some outputs.

259
00:20:00,600 --> 00:20:06,210
Then we have obviously this link right here from the input to the output.

260
00:20:06,240 --> 00:20:12,270
Now with this you see this becomes the data or the inputs big.

261
00:20:12,270 --> 00:20:15,330
He goes to small and then big.

262
00:20:15,330 --> 00:20:22,340
But here's what we have is we pass in a relatively small channel, small number of channels or an input

263
00:20:22,340 --> 00:20:24,300
with relatively smaller number of channels.

264
00:20:24,300 --> 00:20:32,580
So we are small and then in our block it becomes the number of channels increased and then as output

265
00:20:32,630 --> 00:20:37,750
number of channels reduced, hence the term inverted residual block.

266
00:20:37,770 --> 00:20:46,200
Now, in addition to the fact that we're using depth wise convolutions instead of the normal convolutions,

267
00:20:46,500 --> 00:20:55,380
the fact that we have relatively lower dimensional data getting into this block and lower dimensional

268
00:20:55,380 --> 00:21:04,890
data, getting out means that we could transport very low dimensional data throughout our mobile network.

269
00:21:04,890 --> 00:21:12,270
So here we have low dimensional data getting in, low dimensional data getting out, and then inside

270
00:21:12,270 --> 00:21:21,810
we have this expansion layer right here as this expansion layer permits us to capture as much information

271
00:21:21,810 --> 00:21:24,900
as possible from our input features.

272
00:21:25,290 --> 00:21:28,440
One thing to also note is the fact that we're using the rate of six.

273
00:21:28,440 --> 00:21:35,610
The rule of six is different from a usual value in the sense that we the value we have for all x less

274
00:21:35,610 --> 00:21:36,060
than zero.

275
00:21:36,060 --> 00:21:39,240
The value is zero for all x greater than zero, the value is x.

276
00:21:39,240 --> 00:21:42,030
So we have this Y equals x line right here.

277
00:21:42,030 --> 00:21:47,940
But with the rate of six as from the value six, we actually clipped this output.

278
00:21:47,940 --> 00:21:51,900
So what we have here is all values or let's get.

279
00:21:52,030 --> 00:21:52,620
Back.

280
00:21:52,630 --> 00:21:56,080
What we have here is we have the values.

281
00:21:56,230 --> 00:21:59,580
It remains X, but once we get to six, it gets clipped.

282
00:21:59,590 --> 00:22:04,450
So for all values of X greater than six, the value remains at six.

283
00:22:04,450 --> 00:22:05,950
So that's a value six.

284
00:22:05,950 --> 00:22:10,870
And then one other important point is the fact that because we are doing or we are carrying out this

285
00:22:10,870 --> 00:22:18,760
projection from high dimensional data to a low dimensional data, this réélu nonlinearity generally

286
00:22:19,270 --> 00:22:22,480
will cause us to lose too much information.

287
00:22:22,480 --> 00:22:28,990
And because of that, there is no real activation in the final layer here.

288
00:22:29,800 --> 00:22:32,050
You can see this in the summary right here.

289
00:22:32,050 --> 00:22:33,880
You see we have the input.

290
00:22:34,600 --> 00:22:39,130
Then we have the expansion factor, which is DT.

291
00:22:39,160 --> 00:22:42,760
This means that now we have this hyper parameter which we can turn.

292
00:22:42,760 --> 00:22:50,470
So if we want better results, we can turn this expansion factor so that it permits us to get this better

293
00:22:50,470 --> 00:22:51,130
results.

294
00:22:51,130 --> 00:22:56,500
So here we have this expansion factor DT here, which expands number of channels.

295
00:22:56,500 --> 00:23:06,110
So we get in with K and then we now move to TC and then we have this TCA which now takes us to K Prime.

296
00:23:06,190 --> 00:23:12,670
Now also note that here we have the realm of six, really six, but here we have no activation.

297
00:23:13,240 --> 00:23:17,230
Now that said, this is the summary of our mobile net version two.

298
00:23:17,230 --> 00:23:20,920
So here you have mobile version two, you have the different bottlenecks.

299
00:23:20,920 --> 00:23:21,880
There we go.

300
00:23:21,880 --> 00:23:31,240
And then we have this come to the average pull and then come to the this figure right here also shows

301
00:23:31,240 --> 00:23:38,740
us how the mobile net version two outperforms the mobile net v one shuffle net and the nice net.

302
00:23:38,740 --> 00:23:43,420
You see that with the mobile net version two right here.

303
00:23:43,990 --> 00:23:52,270
If you pick this, let's pick for example, this two, you see the number of operations or the computation

304
00:23:52,270 --> 00:24:00,550
cost here is almost similar, but we see this great difference in accuracy where the mobile net v two

305
00:24:00,580 --> 00:24:02,980
outperforms the mobile net V one.

306
00:24:04,150 --> 00:24:11,020
Then apart from classification, the mobile net v two has been used in other tasks like object detection,

307
00:24:11,740 --> 00:24:19,270
semantic segmentation and other computer vision tasks where we have low compute resource.