1
00:00:11,670 --> 00:00:17,880
In this lecture, we are going to discuss convolution in the preparation for building convolutional

2
00:00:17,910 --> 00:00:18,870
neural networks.

3
00:00:19,530 --> 00:00:25,050
Of course, given the name, it's pretty obvious that a convolutional neural network is a neural network

4
00:00:25,110 --> 00:00:26,370
with convolution.

5
00:00:26,940 --> 00:00:32,430
And so in order to understand CNN is we must first understand convolution.

6
00:00:37,580 --> 00:00:42,620
First, I want to mention that people sometimes treat convolution as this mysterious thing.

7
00:00:43,220 --> 00:00:48,740
But as with our discussion of machine learning, we want to take the dumb as possible approach.

8
00:00:49,250 --> 00:00:51,530
Sure, convolution can be complex.

9
00:00:51,890 --> 00:00:55,310
It's a central operation in the signal processing and computer vision.

10
00:00:55,850 --> 00:00:57,320
But in fact, it's quite simple.

11
00:00:57,860 --> 00:00:59,540
There are only two requirements.

12
00:00:59,840 --> 00:01:01,220
First, can you add.

13
00:01:01,850 --> 00:01:03,440
Second, can you multiply?

14
00:01:04,220 --> 00:01:08,840
If your answer to these two questions is yes, then you understand convolution.

15
00:01:09,800 --> 00:01:13,120
Believe it or not, convolution is just adding and multiplying.

16
00:01:18,280 --> 00:01:20,530
So let's start by not doing any math.

17
00:01:20,800 --> 00:01:25,120
Let's just look at convolution qualitatively with convolution.

18
00:01:25,180 --> 00:01:27,460
There are three objects you want to pay attention to.

19
00:01:28,060 --> 00:01:29,560
First is the input image.

20
00:01:30,130 --> 00:01:31,660
Second, there is the filter.

21
00:01:32,290 --> 00:01:35,170
Note that another name for the filter is the kernel.

22
00:01:35,200 --> 00:01:38,080
So we're going to use those two terms interchangeably.

23
00:01:39,340 --> 00:01:41,110
Finally, there's the output image.

24
00:01:41,770 --> 00:01:47,020
The output image is what you get when you convolve the input image with the filter.

25
00:01:47,770 --> 00:01:54,010
In other words, if I perform the convolution operation on the input, image and the filter, I get

26
00:01:54,010 --> 00:01:54,970
the output image.

27
00:01:55,840 --> 00:01:59,680
The Betha medical symbol for convolution is a star or asterisk.

28
00:02:00,040 --> 00:02:05,170
Not to be confused with the asterisk that we use in computer programming for multiplication.

29
00:02:05,890 --> 00:02:11,770
So if you're running convolution in code, it will not be an asterisk because in code we already use

30
00:02:11,770 --> 00:02:13,390
the asterisk for multiplication.

31
00:02:18,490 --> 00:02:23,050
It's helpful to think of some examples of convolution to understand what it can do.

32
00:02:23,920 --> 00:02:25,660
There are two examples I really like.

33
00:02:26,440 --> 00:02:27,940
The first example is blurring.

34
00:02:28,690 --> 00:02:34,630
So the input image is just the original image and the output image is a blurred version of that image.

35
00:02:35,290 --> 00:02:38,920
You might recognize this operation from applications such as Photoshop.

36
00:02:44,120 --> 00:02:48,200
The second example I really like is edge detection as input.

37
00:02:48,230 --> 00:02:53,810
We have the original image and as output, we get white lines where there are edges in the original

38
00:02:53,810 --> 00:02:56,690
image and we get black where there are no edges.

39
00:02:57,230 --> 00:03:01,430
So the output image highlights all the edges in the original input image.

40
00:03:06,640 --> 00:03:12,430
So that's your first lesson on how to view convolution, this perspective is that convolution is an

41
00:03:12,430 --> 00:03:13,770
image modifier.

42
00:03:14,410 --> 00:03:20,620
The input is the original image and the output is a modified or transformed version of that image.

43
00:03:21,310 --> 00:03:25,960
In other words, you might think of this as a feature transformation on the input image.

44
00:03:26,740 --> 00:03:29,860
You know, that sounds curiously like what a neural network does.

45
00:03:30,610 --> 00:03:35,770
In any case, what makes a blurring and edge detection actually work if they are both convolution?

46
00:03:36,490 --> 00:03:39,400
What makes one convolution different from another convolution?

47
00:03:40,240 --> 00:03:44,020
Well, the answer is the filter when you use a gussying filter.

48
00:03:44,170 --> 00:03:47,710
This blurs the image when you use an edge detection filter.

49
00:03:47,770 --> 00:03:49,000
You get edge detection.

50
00:03:49,060 --> 00:03:51,460
You can think of that as sharpening the image.

51
00:03:52,450 --> 00:03:56,980
Your next question might be how do we find or design these filters in the first place?

52
00:03:57,700 --> 00:03:59,770
That's something we'll discuss later in the section.

53
00:04:00,130 --> 00:04:05,540
But the answer, as you might have come to expect, is just another dumb as possible approach.

54
00:04:10,740 --> 00:04:13,980
OK, so let's get into the nitty gritty of how convolution works.

55
00:04:14,520 --> 00:04:18,240
I promise you that all you need to know is how to add and multiply.

56
00:04:18,840 --> 00:04:19,830
Let's see if that's true.

57
00:04:20,880 --> 00:04:21,620
We're going to use it.

58
00:04:21,630 --> 00:04:23,640
Very tiny images for this example.

59
00:04:24,000 --> 00:04:27,810
Although in reality, the actual images we work with will be much bigger.

60
00:04:28,530 --> 00:04:31,650
This is just so we can feasibly do these calculations by hand.

61
00:04:32,760 --> 00:04:39,730
So let's start with an image zero 10, 10, zero, 20, 30, 30, 20, 10, 20, 2010 and zero five

62
00:04:39,730 --> 00:04:40,530
five zero.

63
00:04:41,100 --> 00:04:43,770
And we also have the filter one zero zero two.

64
00:04:44,550 --> 00:04:47,180
So how do we convolve this image with this filter?

65
00:04:52,390 --> 00:04:57,040
Let's imagine overlaying the filter at the upper left corner of the image.

66
00:04:57,790 --> 00:05:03,550
Then all we need to do is Element Y's multiplication and add up all the results.

67
00:05:04,060 --> 00:05:11,530
So, yeah, one time zero plus zero times 10 plus zero times 20 plus 30 times two.

68
00:05:11,980 --> 00:05:13,270
That is equal to 60.

69
00:05:14,020 --> 00:05:17,260
This gives us our first output in the output image.

70
00:05:17,860 --> 00:05:21,700
And as promised, all you needed to do was multiply and add.

71
00:05:26,880 --> 00:05:33,270
Now, let's move our filter over one space to the right and then let's do our element Y's multiplication

72
00:05:33,270 --> 00:05:41,550
and add up the results again so we get one times ten plus zero times ten plus zero times thirty plus

73
00:05:41,550 --> 00:05:42,490
two times thirty.

74
00:05:42,720 --> 00:05:43,610
That's 70.

75
00:05:44,310 --> 00:05:48,450
This gives us our second output, which goes to the right of the first output.

76
00:05:53,610 --> 00:05:58,260
Now, let's move the filter over one more space to the right and do the same operation again.

77
00:05:58,920 --> 00:06:06,420
We get one times ten plus zero times zero plus zero times 30 plus two times 20, and that's 50.

78
00:06:07,140 --> 00:06:11,340
This gives us a third output, which goes to the right of the first two upwards.

79
00:06:16,460 --> 00:06:19,920
Now you realize that there's no more space to move our filter to the right anymore.

80
00:06:20,540 --> 00:06:27,500
So let's instead zigzag back to the left, but go down one row and then let's repeat our calculation.

81
00:06:28,190 --> 00:06:34,820
We get one times 20 plus zero times 30 plus zero times ten plus two times 20.

82
00:06:35,000 --> 00:06:36,050
That's equal to 60.

83
00:06:36,890 --> 00:06:39,800
This gives us our first output in the second row.

84
00:06:41,390 --> 00:06:46,460
So you can see that the outward location corresponds to where we've placed the filter along the original

85
00:06:46,460 --> 00:06:47,010
image.

86
00:06:52,170 --> 00:06:56,430
So we're not going to go through the rest of the calculations, since that would be a bit redundant.

87
00:06:56,880 --> 00:07:02,100
But what I want you to do is if you don't understand this, do the rest of the calculations by hand

88
00:07:02,490 --> 00:07:04,350
to make sure that these results are correct.

89
00:07:06,360 --> 00:07:11,220
So we have 60, 70, 50, 60, 70, 50, and then 20, 30, 20.

90
00:07:16,330 --> 00:07:22,810
One helpful exercise to get a better idea of how Convolution works is to write pseudo code and even

91
00:07:22,810 --> 00:07:25,960
real code to implement the algorithm we just described.

92
00:07:26,590 --> 00:07:29,170
This will help us uncover a few hidden details.

93
00:07:29,200 --> 00:07:35,260
You may have not considered just by starting the code, you realize one thing you need to take care

94
00:07:35,260 --> 00:07:37,240
of, which is what size.

95
00:07:37,510 --> 00:07:39,400
Should I initialize the output array?

96
00:07:39,400 --> 00:07:47,380
As if you recall in our example, the input image huddling for while the kernel had length to the output

97
00:07:47,420 --> 00:07:48,250
handling three.

98
00:07:49,090 --> 00:07:50,020
So what's the pattern?

99
00:07:50,950 --> 00:07:57,160
I claim that if you have an array of length N and a filter of length K, then there are an A minus K

100
00:07:57,160 --> 00:08:01,570
plus one distinct possible positions you can put the filter into.

101
00:08:02,620 --> 00:08:04,300
You might want to draw this on paper.

102
00:08:04,420 --> 00:08:06,540
If you don't see right away why this is true.

103
00:08:11,660 --> 00:08:14,900
So the first step in our pseudocode is to initialize the output array.

104
00:08:15,860 --> 00:08:16,940
But here's another detail.

105
00:08:16,970 --> 00:08:23,390
You may have not considered that what height will be the input height minus Col height plus one that

106
00:08:23,510 --> 00:08:26,900
put width will be the input width minus kind of width plus one.

107
00:08:27,740 --> 00:08:31,160
But in our example, both our image and kernel were square.

108
00:08:31,910 --> 00:08:34,340
So you might be wondering, is this always the case?

109
00:08:34,610 --> 00:08:35,510
What's the convention?

110
00:08:36,260 --> 00:08:39,620
And the answer is that four images, usually these are not square.

111
00:08:40,400 --> 00:08:42,580
This just has to do with cameras and screens.

112
00:08:43,130 --> 00:08:47,510
Most screens like your computer screen and your TV screen are not square.

113
00:08:48,050 --> 00:08:50,780
And therefore, cameras do not take square pictures.

114
00:08:51,650 --> 00:08:57,770
Therefore, images we find in the wild that we want to use as data are typically also not square.

115
00:08:58,730 --> 00:09:02,030
Some known networks do use square images for convenience.

116
00:09:02,210 --> 00:09:04,940
So when you build a data set, you make them square.

117
00:09:05,450 --> 00:09:08,210
And one example of this is amnesty, which we've already seen.

118
00:09:09,020 --> 00:09:12,970
On the other hand, colonels are almost always square by convention.

119
00:09:18,070 --> 00:09:24,120
Let's move on, the next step is just to fill in the output array by performing the convolution algorithm.

120
00:09:25,360 --> 00:09:29,440
So first we loop through zero up to up at height with the index.

121
00:09:29,580 --> 00:09:29,840
I.

122
00:09:31,960 --> 00:09:36,600
Inside that, we loop through zero up to output with with the index J.

123
00:09:37,630 --> 00:09:40,300
So the pair IJA will always index our output.

124
00:09:41,530 --> 00:09:43,800
Then we loop through each position of the colonel.

125
00:09:44,830 --> 00:09:47,900
So we have I going from zero up to Colonel Height.

126
00:09:48,040 --> 00:09:51,190
And we have JJ going from zero up to Colonel Width.

127
00:09:52,150 --> 00:09:58,540
Finally, inside all these loops, we have our main calculation, which, as promised, is just multiplication.

128
00:09:58,540 --> 00:10:02,560
And addition, we multiply the input image at position.

129
00:10:02,710 --> 00:10:06,200
I plus I j plus j k by the colonel.

130
00:10:06,220 --> 00:10:08,200
Add position i j j.

131
00:10:08,830 --> 00:10:12,670
And we add this result to the output image at IJA.

132
00:10:13,830 --> 00:10:19,510
Now it's OK to accumulate using plus equals because we initialize the output image to be all zeros.

133
00:10:20,890 --> 00:10:26,050
As an exercise, you might want to try and put this into code so that you can confirm that it works

134
00:10:26,080 --> 00:10:27,670
and returns the expected result.

135
00:10:32,760 --> 00:10:38,670
The inner part of the pseudocode is key because it helps us understand the equation that defines convolution.

136
00:10:39,660 --> 00:10:47,070
So here we can see that the IJA ith entry of the output a convolve with W is the sum over I prime.

137
00:10:47,220 --> 00:10:51,310
And the sum over gay pride of a at I plus I prime.

138
00:10:51,390 --> 00:10:53,460
J plus J prime times.

139
00:10:53,540 --> 00:10:54,970
W at AI Prime.

140
00:10:54,990 --> 00:10:55,560
J Prime.

141
00:10:56,880 --> 00:11:00,420
Now you might ask, why are we looking at these complicated equations.

142
00:11:00,660 --> 00:11:03,360
If Pi Torch already does all this work for us.

143
00:11:04,230 --> 00:11:10,080
And the answer is that this will help you immensely in understanding the different perspectives on convolution

144
00:11:10,410 --> 00:11:11,850
that we are going to discuss later.

145
00:11:16,950 --> 00:11:22,740
Moreover, if you're a curious person and you go on Wikipedia to read about convolution, you'll see

146
00:11:22,740 --> 00:11:24,330
something very similar to this.

147
00:11:25,020 --> 00:11:30,510
In this example, you can think of X as the filter, Y as the input image, and Z is the output image.

148
00:11:35,640 --> 00:11:40,230
Now, you might notice something weird about this equation from Wikipedia, which is that instead of

149
00:11:40,230 --> 00:11:42,090
plus signs, we have minus signs.

150
00:11:42,480 --> 00:11:43,020
Why is that?

151
00:11:43,500 --> 00:11:44,970
Is the lazy programmer wrong?

152
00:11:45,870 --> 00:11:48,840
And the answer is, in fact, all of deep learning is wrong.

153
00:11:49,560 --> 00:11:52,410
But for better or worse, that's just the way we do things.

154
00:11:53,040 --> 00:11:57,960
In the end, it doesn't make a difference because the filters we use will be learned that using gradient

155
00:11:57,960 --> 00:11:59,880
descent, in other words, automatically.

156
00:12:01,260 --> 00:12:06,420
So the process of finding the filter will be done automatically using gradient descent, which will

157
00:12:06,420 --> 00:12:08,700
find the best values that optimize our lost function.

158
00:12:09,000 --> 00:12:11,840
So it doesn't matter if the filter is reversed or not.

159
00:12:16,950 --> 00:12:22,320
In fact, if you use a library like Saipov, you'll notice that it already has a function called Convolve

160
00:12:22,350 --> 00:12:22,850
Tudi.

161
00:12:23,790 --> 00:12:28,470
The problem is if you use this function as is, you'll get a totally different answer than we did.

162
00:12:29,710 --> 00:12:33,840
And as a side note, you might want to try it yourself to confirm that what I'm saying is true.

163
00:12:35,250 --> 00:12:40,980
Now, this is because Convolve Tudi does a proper convolution and not the deep learning version of convolution

164
00:12:41,310 --> 00:12:42,930
with plus instead of minus.

165
00:12:43,740 --> 00:12:49,770
In order to make Sipos convolution work the same way, we have to flip the filter both horizontally

166
00:12:49,770 --> 00:12:53,130
and vertically and set the mold argument equal to valid.

167
00:12:54,650 --> 00:13:01,030
Now, as a side note, convolution is a commutative operation, a convolve with W is the same as W Convolve

168
00:13:01,070 --> 00:13:04,970
with a, therefore it doesn't matter which input we flip.

169
00:13:10,180 --> 00:13:14,440
In fact, what we are doing in deep learning is actually known as the cross correlation.

170
00:13:15,210 --> 00:13:21,070
I actually think this is a much more helpful and descriptive name compared to convolution convolution

171
00:13:21,070 --> 00:13:21,850
in and of itself.

172
00:13:21,880 --> 00:13:23,410
Probably doesn't mean much to you.

173
00:13:23,980 --> 00:13:25,690
But the word correlation does.

174
00:13:26,320 --> 00:13:29,530
You probably think of the word correlated as sameness.

175
00:13:30,100 --> 00:13:35,740
So if I say X and Y are correlated, that means to you that there is some degree of similarity between

176
00:13:35,740 --> 00:13:36,400
X and Y.

177
00:13:37,150 --> 00:13:42,040
Therefore, you might think of CNN as a correlation or on that work rather than a convolutional neural

178
00:13:42,040 --> 00:13:42,520
network.

179
00:13:43,150 --> 00:13:49,240
The only difference between correlation and convolution is that convolution reverses the orientation

180
00:13:49,240 --> 00:13:51,700
of the filter, whereas correlation does not.

181
00:13:56,830 --> 00:14:02,260
The final topic I want to talk about in this lecture is this Moad argument to understand this.

182
00:14:02,290 --> 00:14:06,730
It's helpful to look at these animations which kind of summarize how Convolution works.

183
00:14:07,480 --> 00:14:11,830
Basically, you're sliding the filter across every possible position in the input image.

184
00:14:12,640 --> 00:14:17,290
And this animation, the motion of the filter is bounded by the edges of the image.

185
00:14:17,860 --> 00:14:21,490
Because of this output, image is always smaller than the input image.

186
00:14:26,570 --> 00:14:32,300
But you might wonder, what if I want the output image to be the same size as the input image in this

187
00:14:32,300 --> 00:14:34,330
case, what you can do is add padding.

188
00:14:35,090 --> 00:14:40,310
This is equivalent to adding a virtual array of zeros around the input image so that the filter can

189
00:14:40,310 --> 00:14:41,930
extend out to those values.

190
00:14:43,480 --> 00:14:47,590
The reason I say it's virtual is because you wouldn't really want to allocate the space in code.

191
00:14:48,220 --> 00:14:52,870
There's no reason to, since you already know that anything multiplied by zero is still zero.

192
00:14:53,680 --> 00:14:58,750
So in other words, to quote unquote add padding, you can just pretend there are zeros surrounding

193
00:14:58,750 --> 00:15:05,320
the input image as many zeros as you need to ensure that the output size is equal to the input size.

194
00:15:10,520 --> 00:15:14,540
You may have noticed, however, that even with padding, we still lose some information.

195
00:15:15,230 --> 00:15:20,480
What I mean by that is there are outputs we could calculate if we extended the padding even further.

196
00:15:20,840 --> 00:15:22,700
That would be non-zero outputs.

197
00:15:23,300 --> 00:15:28,310
So one other thing you can do, which is not as common these days, is to extend the padding further

198
00:15:28,640 --> 00:15:31,250
so you can catch all these non-zero outputs.

199
00:15:32,090 --> 00:15:35,300
This results in an output size of N plus K minus one.

200
00:15:35,630 --> 00:15:39,020
If your input image has length N and your filter has length K.

201
00:15:39,740 --> 00:15:41,600
Again, you should draw this out on paper.

202
00:15:41,960 --> 00:15:43,550
If you're not convinced this is true.

203
00:15:48,650 --> 00:15:51,950
To summarize the three modes of convolution, let's look at this table.

204
00:15:52,730 --> 00:15:55,190
The first one we discussed was called valid convolution.

205
00:15:55,730 --> 00:15:59,210
This applies when the kernel can only touch the original input image.

206
00:15:59,750 --> 00:16:02,180
The output size is in the minus K plus one.

207
00:16:03,200 --> 00:16:05,810
The second one we discussed was called same convolution.

208
00:16:06,500 --> 00:16:12,050
In this scenario, we add some padding just enough so that the output size is the same as the input

209
00:16:12,050 --> 00:16:12,950
size in.

210
00:16:14,410 --> 00:16:18,520
The third mode we discussed is called full convolution in this scenario.

211
00:16:18,550 --> 00:16:24,330
We extend the filter out as far as possible so that at least one point on the filter overlaps with one

212
00:16:24,330 --> 00:16:25,690
point on the input image.

213
00:16:26,230 --> 00:16:28,930
The output size is N plus K minus one.

214
00:16:29,980 --> 00:16:33,790
Normally in deep learning, we use valid or same convolutions.
