1
00:00:00,080 --> 00:00:04,970
After preparing the data, the next step will be to dive into modeling.

2
00:00:05,000 --> 00:00:14,120
First things first, we shall start by explaining the formal model which we are going to use, and then

3
00:00:14,120 --> 00:00:18,380
we will dive into model creation with TensorFlow.

4
00:00:18,410 --> 00:00:25,460
The formal model was first developed in this paper entitled Simple and Efficient Design for Semantic

5
00:00:25,460 --> 00:00:27,230
Segmentation with Transformers.

6
00:00:27,230 --> 00:00:28,940
That was in 2021.

7
00:00:28,940 --> 00:00:33,080
And this model, which you could see right here.

8
00:00:33,140 --> 00:00:34,190
Essentially.

9
00:00:34,800 --> 00:00:40,050
It's made of two parts, that is the encoder section and the decoder section.

10
00:00:40,050 --> 00:00:46,230
Now talking about the encoder and decoder sections, we could get back to 2015.

11
00:00:46,230 --> 00:00:53,460
That's six years before the former paper, uh, where the unit paper was published.

12
00:00:53,580 --> 00:00:58,950
The unit architecture itself is made of the encoder and the decoder block.

13
00:00:58,950 --> 00:01:04,350
So here on the left of the U, you have this encoder block.

14
00:01:04,350 --> 00:01:05,520
Let's make that transparent.

15
00:01:05,520 --> 00:01:08,100
So you can see that we have this encoder block.

16
00:01:08,490 --> 00:01:12,900
And then to the right we have the decoder block.

17
00:01:13,440 --> 00:01:14,730
Let's change the color.

18
00:01:15,550 --> 00:01:17,290
There we go okay.

19
00:01:17,290 --> 00:01:25,270
So we have this encoder and then the decoder blocks which make up our U-Net architecture.

20
00:01:25,540 --> 00:01:33,880
Now unlike with image classification where our output will be some vector and in which each position

21
00:01:33,880 --> 00:01:42,760
in that vector will be a specific class which will expect our model to pick from all the remaining classes.

22
00:01:42,760 --> 00:01:47,950
Here we have an output which is similar to the input.

23
00:01:48,220 --> 00:01:53,680
That is, we're going from an input image of shape height by width by three three.

24
00:01:53,680 --> 00:01:59,080
Here represents the number of channels to an output of shape height by width by one.

25
00:01:59,080 --> 00:02:03,010
Or sometimes we could just consider this to be height by width.

26
00:02:03,040 --> 00:02:11,590
So um, the important point to note here is the fact that the kind of output in the case of semantic

27
00:02:11,590 --> 00:02:16,240
segmentation is different from that of image classification.

28
00:02:16,240 --> 00:02:21,430
And hence a U-Net architecture makes a lot more sense here.

29
00:02:21,430 --> 00:02:27,880
It should also be noted that in practice this is generally H divided by four and weight divided by four,

30
00:02:27,880 --> 00:02:37,030
because it's much more computationally expensive to have to generate or reproduce this output with the

31
00:02:37,030 --> 00:02:39,070
exact same dimensions as the input.

32
00:02:39,070 --> 00:02:43,300
So here we have um h and four uh weight divided by four.

33
00:02:43,300 --> 00:02:47,440
So this uh figure will look somewhat like this instead.

34
00:02:47,440 --> 00:02:50,800
So we have this input and then this output.

35
00:02:50,800 --> 00:02:58,660
Getting back to our U-Net architecture in this blocks we have right here, we have the usual ConvNet

36
00:02:58,660 --> 00:02:59,260
layers.

37
00:02:59,260 --> 00:03:04,510
And then we have the downsampling by the maxpool layers we have here Maxpool.

38
00:03:04,510 --> 00:03:13,300
So essentially we are making use of convnets and maxpool to downsample the inputs.

39
00:03:13,300 --> 00:03:23,830
And then in the order turn as this downward, turn this upward turn, we have again the ConvNet layers

40
00:03:23,830 --> 00:03:27,700
and the upsampling layers.

41
00:03:28,150 --> 00:03:37,300
Then in between this downward and upward turns we crop out parts of our features in this blocks.

42
00:03:37,300 --> 00:03:38,260
You could notice that.

43
00:03:38,260 --> 00:03:41,140
Or you could take note of that with the dotted lines.

44
00:03:41,140 --> 00:03:47,260
And then we concatenate this to the blocks in the upward turn.

45
00:03:47,260 --> 00:03:54,700
We suppose that you already have an idea of how the conv layers work, but let's take this simple example

46
00:03:54,700 --> 00:04:02,740
which will show how, um, a simple conv layer works, and also how the upsample layer, which is the

47
00:04:02,740 --> 00:04:07,810
conv 2d transpose layer works, which can be used for the upsampling.

48
00:04:07,810 --> 00:04:10,540
So let's suppose we have this example.

49
00:04:10,870 --> 00:04:13,870
Then in this case, uh, if this is a kernel.

50
00:04:13,870 --> 00:04:15,580
So let's say this is a kernel.

51
00:04:15,580 --> 00:04:18,040
And then this is an input feature.

52
00:04:18,040 --> 00:04:20,260
In this case two we have an input feature.

53
00:04:20,260 --> 00:04:22,420
And then here we have the kernel.

54
00:04:22,420 --> 00:04:30,370
So we will take this input kernel and multiply by this part of our input.

55
00:04:30,370 --> 00:04:31,960
So this is two by two.

56
00:04:31,990 --> 00:04:33,010
Here's two by two.

57
00:04:33,040 --> 00:04:36,160
We take two times zero four times zero all of that zero.

58
00:04:36,190 --> 00:04:40,120
Then one times one plus one six times six.

59
00:04:40,120 --> 00:04:42,010
That's 36 plus one.

60
00:04:42,010 --> 00:04:43,930
So we have 37.

61
00:04:43,930 --> 00:04:46,630
So at this point we have 37.

62
00:04:46,780 --> 00:04:49,780
And then we are going to take this off.

63
00:04:49,780 --> 00:04:52,660
We're going to take this off or we're going to just shift this.

64
00:04:52,900 --> 00:04:54,190
We shift this.

65
00:04:55,440 --> 00:04:59,070
You see, we now multiply four times zero plus six times zero.

66
00:04:59,070 --> 00:04:59,790
That's zero.

67
00:04:59,790 --> 00:05:03,660
Six times one, that's six plus four times six.

68
00:05:03,660 --> 00:05:04,650
That's 24.

69
00:05:04,650 --> 00:05:07,110
So six plus, um, 2430.

70
00:05:07,440 --> 00:05:08,700
We have 30.

71
00:05:08,700 --> 00:05:11,340
We just have your 30.

72
00:05:11,340 --> 00:05:16,560
And then for the next we'll just move this again.

73
00:05:16,560 --> 00:05:22,470
So let's move that, move this and then we multiply again.

74
00:05:22,470 --> 00:05:25,980
We have zero times one plus zero times six.

75
00:05:25,980 --> 00:05:27,090
That's still zero.

76
00:05:27,090 --> 00:05:31,890
And then one times one that's one plus six times six that's 37.

77
00:05:31,890 --> 00:05:34,200
So again here we're going to have 37.

78
00:05:34,410 --> 00:05:36,930
So we see the output we are obtaining.

79
00:05:37,050 --> 00:05:40,380
And then we will shift this again.

80
00:05:42,010 --> 00:05:44,170
And multiply six.

81
00:05:44,350 --> 00:05:51,280
This is six times zero plus four times zero plus six times one plus four times six.

82
00:05:51,280 --> 00:05:52,480
So that's 30.

83
00:05:52,510 --> 00:05:54,970
So again here we have 30.

84
00:05:55,580 --> 00:05:56,690
So that's fine.

85
00:05:56,690 --> 00:05:59,060
So now we have this output.

86
00:05:59,880 --> 00:06:01,050
And there we go.

87
00:06:01,050 --> 00:06:08,730
So that's how the the the conv layer works for the conv transpose layer which could be used for upsampling.

88
00:06:08,730 --> 00:06:10,980
It works um slightly different way.

89
00:06:10,980 --> 00:06:13,080
So again we have this kernel.

90
00:06:13,080 --> 00:06:22,110
Then we have this patch or this part we take five times zero plus five times zero.

91
00:06:22,140 --> 00:06:28,350
We select this patch and multiply by all these different values here.

92
00:06:28,350 --> 00:06:30,660
That's for all the different positions.

93
00:06:30,660 --> 00:06:34,140
And now we have this zero.

94
00:06:34,880 --> 00:06:36,980
Here we have zero.

95
00:06:37,710 --> 00:06:39,540
Five times one is five.

96
00:06:41,030 --> 00:06:42,290
There we go.

97
00:06:42,320 --> 00:06:44,450
Um, zero times six.

98
00:06:44,480 --> 00:06:47,480
Or rather, five times six is 30.

99
00:06:47,840 --> 00:06:49,550
So that's what we obtain.

100
00:06:50,090 --> 00:06:52,460
And then we'll move this patch again.

101
00:06:52,460 --> 00:06:53,840
I will move this position again.

102
00:06:53,840 --> 00:06:55,190
So we go to this.

103
00:06:55,220 --> 00:06:56,630
Here we have two.

104
00:06:57,500 --> 00:06:58,340
There we go.

105
00:06:58,340 --> 00:07:01,010
We have now two times zero.

106
00:07:01,490 --> 00:07:06,380
But note that we are not going to do two times zero.

107
00:07:06,530 --> 00:07:07,610
Um two times zero.

108
00:07:07,610 --> 00:07:08,840
Then put it out this way.

109
00:07:08,840 --> 00:07:10,820
We're not going to have two times zero like this.

110
00:07:10,820 --> 00:07:13,610
Two times one and then two times six.

111
00:07:13,610 --> 00:07:20,600
This way what we're going to have instead is we are going to have this frontier values which are going

112
00:07:20,600 --> 00:07:23,210
to be added to this other frontier values.

113
00:07:23,210 --> 00:07:25,940
So it's going to be a combination of those values.

114
00:07:25,940 --> 00:07:27,710
So let's have this this way.

115
00:07:27,710 --> 00:07:29,810
You see it's going to be something like this.

116
00:07:29,810 --> 00:07:31,400
So this is super post.

117
00:07:31,400 --> 00:07:32,420
Let's just put it out.

118
00:07:32,450 --> 00:07:33,890
Um this way okay.

119
00:07:33,890 --> 00:07:36,260
So that's how that's how it's going to look like.

120
00:07:36,260 --> 00:07:44,720
And then for the for the next, let's shift this position to for this next one we have one times zero.

121
00:07:44,900 --> 00:07:46,220
Let's write that out.

122
00:07:46,220 --> 00:07:47,870
We have let's change the color.

123
00:07:47,870 --> 00:07:51,080
So you could see that clearly the green.

124
00:07:51,080 --> 00:07:52,010
And there we go.

125
00:07:52,010 --> 00:07:54,110
So we have one times zero zero.

126
00:07:54,110 --> 00:07:55,520
Again we have zero zero.

127
00:07:55,520 --> 00:08:00,200
So we'll have um zero zero.

128
00:08:00,350 --> 00:08:01,820
See we added that zero again.

129
00:08:01,820 --> 00:08:07,490
And then one times one is one and then one times six is or rather.

130
00:08:07,490 --> 00:08:09,290
Yeah one times six is six.

131
00:08:09,290 --> 00:08:10,790
So we have six.

132
00:08:11,180 --> 00:08:16,340
And then for this last position we will take this and move to zero.

133
00:08:16,340 --> 00:08:18,740
Well for zero obviously all values will be zero.

134
00:08:18,740 --> 00:08:21,320
So we'll just pick another color.

135
00:08:21,320 --> 00:08:24,290
Let's pick um say black.

136
00:08:24,290 --> 00:08:25,310
Let's pick black.

137
00:08:25,310 --> 00:08:26,600
And then there we go.

138
00:08:26,600 --> 00:08:30,200
We have we would have take that black.

139
00:08:30,950 --> 00:08:31,820
There we go.

140
00:08:31,820 --> 00:08:38,270
We have zero here we have zero here we have zero.

141
00:08:38,270 --> 00:08:40,640
And then here we have zero.

142
00:08:40,640 --> 00:08:41,300
Okay.

143
00:08:41,300 --> 00:08:44,840
So now this is this final result.

144
00:08:44,840 --> 00:08:48,890
Then from here we could let's drag this up so you could see it clearer.

145
00:08:49,220 --> 00:08:56,660
From here we just have this output matrix which is going to be um let's take this pan.

146
00:08:57,170 --> 00:08:59,840
Our output matrix now is going to be zero.

147
00:08:59,840 --> 00:09:01,670
So here we have zero.

148
00:09:01,670 --> 00:09:05,240
We have zero, we have zero.

149
00:09:05,240 --> 00:09:06,770
We have five plus zero.

150
00:09:06,770 --> 00:09:08,300
That's five.

151
00:09:09,090 --> 00:09:13,470
And then we have 30 plus one plus zero plus zero.

152
00:09:13,470 --> 00:09:14,820
That's 31.

153
00:09:15,860 --> 00:09:17,510
Then we have 12 plus zero.

154
00:09:17,510 --> 00:09:18,590
That is 12.

155
00:09:18,800 --> 00:09:24,050
Then we have one, then we have six, then we have um zero.

156
00:09:24,050 --> 00:09:24,980
So that's it.

157
00:09:24,980 --> 00:09:32,810
And so now we see clearly how we are able to leave from an input, which is um, two by two to an output,

158
00:09:32,810 --> 00:09:34,190
which is three by three.

159
00:09:34,190 --> 00:09:37,940
So that's a simple upsampler at this point.

160
00:09:37,940 --> 00:09:40,400
We're still in 2015 with a U-Net model.

161
00:09:40,400 --> 00:09:48,650
Now we move to, um, 2021 with our former model and with the former model, we'll still use the encoder

162
00:09:48,650 --> 00:09:50,540
and decoder as we've seen already.

163
00:09:50,540 --> 00:09:56,120
But now instead of our blocks made of conv layers, we have the transformers.

164
00:09:56,120 --> 00:10:03,020
The former architecture came with five major methods, which made it the state of the art in semantic

165
00:10:03,020 --> 00:10:05,750
segmentation back in 2021.

166
00:10:05,750 --> 00:10:10,100
The very first of these is the hierarchical feature representation.

167
00:10:10,520 --> 00:10:17,600
Now, as we had seen with the U-Net, we already had this hierarchical representation.

168
00:10:17,600 --> 00:10:22,220
But then with the Vit that's the vision transformer.

169
00:10:22,220 --> 00:10:26,630
This is the transformer which precedes other transformers like the former.

170
00:10:26,630 --> 00:10:29,810
We didn't have this hierarchical structure.

171
00:10:29,810 --> 00:10:38,570
And so what the format did was borrowed this hierarchical structure from other models like the unit.

172
00:10:38,780 --> 00:10:45,860
The reason why the authors decided to work with this hierarchical structure is because when we represent

173
00:10:45,860 --> 00:10:54,380
features hierarchically at higher resolutions, we obtain or we are able to extract more coarse features

174
00:10:54,380 --> 00:11:00,560
as compared to the lower resolutions, where we tend to extract more the finer features.

175
00:11:00,560 --> 00:11:09,230
And so with this, we mimic the kind of feature extraction mechanism we had already with the convolutional

176
00:11:09,230 --> 00:11:10,340
neural networks.

177
00:11:10,340 --> 00:11:18,350
Now that said, let's take this off and see how in practice, um, the patch merging is done.

178
00:11:18,350 --> 00:11:22,550
Now, it should be noted that with the vit shift this with the vit.

179
00:11:22,940 --> 00:11:29,750
Um, if you have an input like this one where you have eight pixels by eight, you could count them.

180
00:11:29,750 --> 00:11:35,210
We actually having 12345678 pixels by eight pixels.

181
00:11:35,210 --> 00:11:39,560
Now, um, for the input, let's get to the paper.

182
00:11:39,710 --> 00:11:40,700
Here's a white paper.

183
00:11:40,700 --> 00:11:45,470
As you can see, we have um, in this case it was actually a 16 by 16.

184
00:11:45,470 --> 00:11:53,360
So what we did here was we flattened these patches, uh, and here, um, if we have it, then we would

185
00:11:53,360 --> 00:12:00,350
simply take, um, this one and then this next one lesson said do it this way.

186
00:12:00,350 --> 00:12:01,310
So it's simpler.

187
00:12:01,310 --> 00:12:05,870
So we take like this and we flatten it out this way.

188
00:12:06,440 --> 00:12:07,970
Copy this other one.

189
00:12:08,510 --> 00:12:10,610
Um, let's cut that and paste.

190
00:12:11,240 --> 00:12:17,600
There we go right up to where we have, um, all this flattened out.

191
00:12:17,600 --> 00:12:25,940
And so at the end we are obtaining this output feature which is one by one, because obviously here

192
00:12:25,940 --> 00:12:29,960
you have one by one, but it has a certain number of channels.

193
00:12:29,960 --> 00:12:34,220
In this case, since we have 64 pixels then we will have 64 channels.

194
00:12:34,220 --> 00:12:39,110
But if this input was eight by eight by three, that is if we had three channels, then the output will

195
00:12:39,110 --> 00:12:42,500
be one by one by 64 times three.

196
00:12:43,450 --> 00:12:49,510
In a way, you understand that this is how we were able to flatten out the patches.

197
00:12:49,510 --> 00:12:57,190
Now we're going to do something similar with the SEQ former, where we are going to regroup this different

198
00:12:57,190 --> 00:12:58,210
pixels together.

199
00:12:58,210 --> 00:13:01,300
So let's um, take this off.

200
00:13:01,300 --> 00:13:02,410
Let's just put it up here.

201
00:13:02,410 --> 00:13:03,820
So you see that clearly.

202
00:13:03,970 --> 00:13:08,530
Um, eight by eight by 1 to 1 by one by 64 times three.

203
00:13:08,560 --> 00:13:15,640
Now, as we're saying, one to go from as you see in the paper we had, uh, we have the original height

204
00:13:15,640 --> 00:13:17,710
by width by, let's say three.

205
00:13:17,710 --> 00:13:25,060
And now we want to obtain height divided by four by width divided by four by a channel, let's say C1.

206
00:13:25,240 --> 00:13:31,000
So getting back here, given that we have eight by eight, it means that we are going to obtain two

207
00:13:31,000 --> 00:13:33,340
by two because eight divided by four is two.

208
00:13:33,340 --> 00:13:38,890
Now obviously in in practice it's not actually eight by eight like the inputs will generally be let's

209
00:13:38,890 --> 00:13:40,690
say 512 by 512.

210
00:13:40,690 --> 00:13:48,280
And so it makes sense that, um, you could go up to let's say height by 22 by 32 or width by 32.

211
00:13:48,310 --> 00:13:51,760
So getting back here our output is going to be two by two.

212
00:13:51,790 --> 00:13:57,010
So we'll make sure that this this part let's take off this eight from here.

213
00:13:57,130 --> 00:14:03,010
We'll make sure that this part this this quarter you see let's copy that quarter.

214
00:14:03,820 --> 00:14:05,950
Um we put it out here.

215
00:14:06,670 --> 00:14:08,650
Then we flatten this out.

216
00:14:08,650 --> 00:14:12,430
So again, just like with this we just simply flatten this out.

217
00:14:12,430 --> 00:14:15,280
So this first quarter we flatten that out.

218
00:14:15,280 --> 00:14:17,410
Let's let's break this up with a pen.

219
00:14:17,410 --> 00:14:18,820
So you see that clearly.

220
00:14:18,820 --> 00:14:21,340
So we have this like that.

221
00:14:21,610 --> 00:14:22,270
There we go.

222
00:14:22,270 --> 00:14:25,930
So this first part we take it here and then we flatten it out.

223
00:14:26,530 --> 00:14:31,000
Then once we flatten out like this we move to this next quarter.

224
00:14:31,150 --> 00:14:33,580
We take all this and then flatten out again.

225
00:14:33,580 --> 00:14:38,620
So what we just do is simply just copy this and paste okay.

226
00:14:38,620 --> 00:14:40,360
So let's copy that and paste.

227
00:14:40,360 --> 00:14:44,110
Maybe we should change the color so it looks different.

228
00:14:44,110 --> 00:14:45,160
There we go.

229
00:14:45,490 --> 00:14:47,320
Um let's copy this color code.

230
00:14:47,320 --> 00:14:51,460
So we just um recolor this part here okay.

231
00:14:51,460 --> 00:14:54,460
So paste it out and there we go okay.

232
00:14:54,460 --> 00:14:55,270
So that's fine.

233
00:14:55,270 --> 00:15:04,720
And then this other zone for this other zone we have speak that let's pick this red or let's pick yellow

234
00:15:04,840 --> 00:15:05,440
okay.

235
00:15:05,440 --> 00:15:08,470
So we pick that copy the color code.

236
00:15:08,470 --> 00:15:10,960
And then we're going to do the same with this.

237
00:15:10,960 --> 00:15:12,220
So paste out.

238
00:15:12,220 --> 00:15:14,770
You see this top um left quarter here.

239
00:15:14,770 --> 00:15:15,550
Is it in blue.

240
00:15:15,550 --> 00:15:18,670
Then this um top right quarter here.

241
00:15:18,670 --> 00:15:19,600
Is it in orange.

242
00:15:19,600 --> 00:15:22,990
We have this other quarter yellow and this other quarter green.

243
00:15:23,230 --> 00:15:25,990
Um, if we join this together, this should be below.

244
00:15:25,990 --> 00:15:30,400
So let's put this around about this, um, make it this way.

245
00:15:30,400 --> 00:15:30,910
Okay.

246
00:15:30,910 --> 00:15:31,420
That's fine.

247
00:15:31,420 --> 00:15:37,930
So, you see, if we bring all this together, um, now you have this output, which is two, you see,

248
00:15:37,930 --> 00:15:40,810
because here you have two pixels by two.

249
00:15:40,900 --> 00:15:42,130
So this is it.

250
00:15:42,130 --> 00:15:46,570
We have two by two, and then we have a certain number of channels.

251
00:15:46,570 --> 00:15:54,940
And that's exactly the way the authors did to, um, downsample the inputs or the input features.

252
00:15:54,940 --> 00:16:01,330
The next method, which was implemented in the former paper was the overlapped patch merging.

253
00:16:01,330 --> 00:16:05,980
Now with the overlap patch merging is simply the same as the patch merging, which we've just seen,

254
00:16:05,980 --> 00:16:09,310
but with the difference that this time around is overlapped.

255
00:16:09,310 --> 00:16:16,960
So unlike what we just seen, which is actually, um, non overlapping with the overlapped patch merging,

256
00:16:16,960 --> 00:16:18,700
we'll do something a little bit different.

257
00:16:18,700 --> 00:16:25,390
So before we took this quarter, we took this quarter this next quarter this quarter and this other

258
00:16:25,390 --> 00:16:25,930
quarter.

259
00:16:25,930 --> 00:16:27,820
And then we produce this output.

260
00:16:27,820 --> 00:16:30,880
Now instead what we'll do is we are going to overlap.

261
00:16:30,880 --> 00:16:36,880
That is get into, uh, the other quarters while trying to produce this output.

262
00:16:36,880 --> 00:16:38,350
So this overlap now.

263
00:16:38,350 --> 00:16:46,000
So we will make use of all this part to produce the first, um, this first part of our um, output.

264
00:16:46,000 --> 00:16:48,250
And then for the next we just shift this.

265
00:16:48,250 --> 00:16:50,170
So let's shift this tool.

266
00:16:50,320 --> 00:16:51,760
We'll move this this way.

267
00:16:51,760 --> 00:16:58,150
You see, it takes in part of, um, this quadrant, this quadrant, this quadrant and this quadrant,

268
00:16:58,150 --> 00:17:04,990
and this ensures that we, we have local continuity between the different patches, unlike with the

269
00:17:04,990 --> 00:17:07,180
non overlapping, um, patch merging.

270
00:17:07,180 --> 00:17:11,740
So for the next one we just simply drag we pull this.

271
00:17:11,740 --> 00:17:12,610
There we go.

272
00:17:12,610 --> 00:17:13,990
Let's pull it this way.

273
00:17:13,990 --> 00:17:16,150
So you see for this other part we just do this.

274
00:17:16,150 --> 00:17:19,330
And then for this other part that's the bottom left.

275
00:17:19,330 --> 00:17:22,960
For the bottom left we'll simply do it this way okay.

276
00:17:22,960 --> 00:17:23,620
So that's it.

277
00:17:23,620 --> 00:17:26,410
So that's uh the overlapping patch merging.

278
00:17:26,410 --> 00:17:30,280
And then we'll move to the next where we have the efficient self-attention.

279
00:17:30,880 --> 00:17:37,450
The efficient self-attention, as the name goes, provides a more efficient way of carrying out the

280
00:17:37,450 --> 00:17:38,650
self-attention.

281
00:17:38,650 --> 00:17:43,660
Now, the self-attention, as um, from the original attention is all you need.

282
00:17:43,660 --> 00:17:44,140
Paper.

283
00:17:44,140 --> 00:17:44,920
This is the transformer.

284
00:17:44,920 --> 00:17:49,060
Paper um, can be shown with this formula.

285
00:17:49,060 --> 00:17:59,080
So essentially the computational cost of this operation care by CT is big O of N square.

286
00:17:59,080 --> 00:18:05,740
And the reason why we have big O of n square is simply because if QR that's our query is given to be

287
00:18:05,860 --> 00:18:13,420
um, an n by e matrix, and then we have k which is also an n by E matrix.

288
00:18:13,420 --> 00:18:21,100
Then multiplying this by k would be multiplying this by an e by n matrix.

289
00:18:21,100 --> 00:18:23,950
And so you end up with an n by n matrix.

290
00:18:23,950 --> 00:18:32,200
This means that um, if you have a very large value for n, then you would end up with a very large

291
00:18:32,200 --> 00:18:36,940
matrix, which is also very expensive to compute.

292
00:18:36,940 --> 00:18:42,490
Now that said, in the domain of computer vision, values of n are pretty large.

293
00:18:42,490 --> 00:18:46,900
So um, n in our case will be height by width.

294
00:18:46,900 --> 00:18:50,170
So we'll be um h times w.

295
00:18:50,170 --> 00:18:59,680
So if you have a height of, let's say 512 and a width of 512, then you'll be having 512 times 512,

296
00:18:59,680 --> 00:19:01,750
which is um, pretty large.

297
00:19:01,750 --> 00:19:06,100
And so the the role of this efficient self-attention method is to.

298
00:19:06,310 --> 00:19:07,720
See how to compute this.

299
00:19:07,720 --> 00:19:08,590
Shocked.

300
00:19:09,040 --> 00:19:11,080
Um, much more efficiently.

301
00:19:11,080 --> 00:19:16,360
So let's take this off and show you how, um, this actually is done.

302
00:19:16,660 --> 00:19:24,280
So what we will have here is we're going to take this, um, h times w and divide by uh, a factor r,

303
00:19:24,280 --> 00:19:31,720
so if it's divided by R then let's, let's um just copy it and put, let's take this from here, copy

304
00:19:31,720 --> 00:19:31,810
it.

305
00:19:31,810 --> 00:19:37,420
And put this way we have h times w divided by r.

306
00:19:37,420 --> 00:19:40,300
And then this c will be multiplied by r.

307
00:19:40,300 --> 00:19:42,460
So we can we will be able to reshape it.

308
00:19:42,700 --> 00:19:43,180
Now.

309
00:19:43,180 --> 00:19:48,160
Now note that if you try to reshape if you just divide this by R and maintain C, you cannot reshape.

310
00:19:48,160 --> 00:19:51,010
So that's why you need to divide by r and then multiply by r.

311
00:19:51,010 --> 00:19:52,750
So you will be able to reshape.

312
00:19:52,750 --> 00:20:01,900
Now that said what this means is you're going to have um this this this side reduced because now this

313
00:20:01,900 --> 00:20:04,360
is the h times w divided by r.

314
00:20:04,390 --> 00:20:05,470
Let's divide it here.

315
00:20:05,470 --> 00:20:07,060
We have divided by r.

316
00:20:07,060 --> 00:20:10,330
So this side is going to be reduced.

317
00:20:10,360 --> 00:20:13,000
See it's going to be reduced divided by r.

318
00:20:13,000 --> 00:20:18,790
And then this other side is going to be increased because now you've multiplied by c.

319
00:20:18,790 --> 00:20:21,640
So let's multiply here by c by by r.

320
00:20:21,640 --> 00:20:22,180
Sorry.

321
00:20:22,180 --> 00:20:24,340
So we multiply by r and that's increased.

322
00:20:24,370 --> 00:20:32,680
Now from this point what next we're going to do is we are going to um simply project this to C.

323
00:20:32,680 --> 00:20:37,780
So with a linear layer we can obtain C from um c times r.

324
00:20:37,780 --> 00:20:43,630
So just like in the paper here you have on the paper um there we go.

325
00:20:43,660 --> 00:20:48,460
We have this linear projection which takes C times R and projects it to C.

326
00:20:48,460 --> 00:20:55,000
So just like as in the paper we're going to have, we're going to get back to our original shape, um

327
00:20:55,000 --> 00:20:57,130
in this other direction.

328
00:20:57,130 --> 00:21:01,000
So now we're going to have C instead of C by R.

329
00:21:01,030 --> 00:21:02,770
Let's actually come and measure it here.

330
00:21:02,770 --> 00:21:05,080
So you see it's actually exactly exactly C.

331
00:21:05,230 --> 00:21:06,790
So you see that's C.

332
00:21:06,790 --> 00:21:10,630
But h um times w divided by r still maintains.

333
00:21:10,630 --> 00:21:20,020
So we can just copy this from here and then paste out here we have h times w h times w um divided by

334
00:21:20,020 --> 00:21:20,410
r.

335
00:21:20,410 --> 00:21:22,990
And then we have now c here.

336
00:21:22,990 --> 00:21:29,080
So you see we have left from h times w by c to h times w divided by r by c.

337
00:21:29,470 --> 00:21:36,520
And now um instead of multiplying so this, this um here had um times up.

338
00:21:36,520 --> 00:21:39,370
Let's call this n this was an n by c.

339
00:21:39,370 --> 00:21:52,330
So instead of multiplying n um by c with um n by c transpose, now you're going to multiply n divided

340
00:21:52,330 --> 00:21:58,930
by r by c times n by c transpose.

341
00:21:58,930 --> 00:22:12,640
And so we go from big O of n square in complexity to big O of n square divided by r as we have in the

342
00:22:12,640 --> 00:22:13,180
paper.

343
00:22:13,180 --> 00:22:18,610
So here's in the paper we go from from o n square to o n square divided by r.

344
00:22:18,610 --> 00:22:24,280
And this is very important because um, as we've said already in computer vision, the value of n is

345
00:22:24,280 --> 00:22:29,260
generally very large because we're dealing with high or with high values for the height and the width.

346
00:22:29,260 --> 00:22:34,240
And then uh, the set R to 6464 one from 1 to 4.

347
00:22:34,240 --> 00:22:38,620
So depending on the stage um, let's, let's increase this.

348
00:22:38,620 --> 00:22:44,170
Depending on the stage, you, you will have different values for, for r copy and paste okay.

349
00:22:44,170 --> 00:22:46,420
So depending on the stage you have different values for r.

350
00:22:46,570 --> 00:22:51,130
So uh this makes the self-attention uh much more efficient.

351
00:22:51,430 --> 00:22:52,630
There we go.

352
00:22:52,630 --> 00:22:53,500
So that's it.

353
00:22:53,500 --> 00:23:01,720
So at this point now we move on to the next method which is the mix feed forward network mix FM in the

354
00:23:02,020 --> 00:23:02,770
paper.

355
00:23:02,770 --> 00:23:10,210
The positional information or information about the position is added on to the the linear projection

356
00:23:10,210 --> 00:23:11,560
of the flattened patches.

357
00:23:11,560 --> 00:23:14,230
So here we have patch plus positional embedding.

358
00:23:14,230 --> 00:23:22,300
You can see that uh you see here for example 012 up to nine to give us a position or to tell us or give

359
00:23:22,300 --> 00:23:25,780
us information about the position of each and every patch.

360
00:23:25,780 --> 00:23:33,670
It turns out that this way of getting information about position works fine when we have a fixed, um,

361
00:23:33,670 --> 00:23:34,540
inputs.

362
00:23:34,960 --> 00:23:41,650
But when we are in test mode and that the inputs we are getting are different from other, the shape

363
00:23:41,650 --> 00:23:45,880
of the input is different from the shape of that which we use when we were training.

364
00:23:45,910 --> 00:23:52,690
Then this way of storing or locating each and every patch doesn't work so well.

365
00:23:52,690 --> 00:23:59,740
And so what the authors of the former paper did was they simply, um, encode positional information

366
00:23:59,740 --> 00:24:03,160
using this three by three convolutional layer.

367
00:24:03,160 --> 00:24:05,890
So that said, we've treated all the different blocks.

368
00:24:06,280 --> 00:24:07,870
Which make up our encoder.

369
00:24:07,870 --> 00:24:10,570
We can now look at the decoder path.

370
00:24:10,570 --> 00:24:15,310
Now essentially we have a lightweight all MLP decoder.

371
00:24:15,310 --> 00:24:22,210
And the reason why we need just a lightweight all MLP decoder without any transformers is because of

372
00:24:22,210 --> 00:24:28,210
the hierarchical architecture we have at the level of the encoder, which permits us to have, uh,

373
00:24:28,210 --> 00:24:31,120
an effective receptive field below.

374
00:24:31,120 --> 00:24:38,050
In this figure, you could see the receptive fields of the deeplab version three plus and the farmers

375
00:24:38,050 --> 00:24:38,590
compared.

376
00:24:38,590 --> 00:24:44,320
We could see how even at stage four, the receptive field of this deeplab version three plus is still

377
00:24:44,320 --> 00:24:49,780
pretty small compared to those of the former at the different stages.

378
00:24:49,780 --> 00:24:58,000
And this is simply because of the hierarchical structure or architecture in at the level of the encoder.

379
00:24:58,000 --> 00:25:05,200
And because we have this features rich information passed um, into the decoder.

380
00:25:05,200 --> 00:25:12,850
We do not have, uh, we do not need to have a very complex decoder, um, architecture.

381
00:25:12,850 --> 00:25:19,840
So at the level of the decoder, we just have this lightweight, um, all MLP decoder because the encoder

382
00:25:19,840 --> 00:25:27,370
has already done a lot of work, uh, when it comes to extracting, um, the codes and the fine features

383
00:25:27,370 --> 00:25:28,900
from the inputs.

384
00:25:29,200 --> 00:25:36,460
And that said, you could see how with fewer parameters, the former models outperform the other models.

385
00:25:36,760 --> 00:25:38,440
Um, as of 2021.
