1
00:00:00,510 --> 00:00:08,400
This yellow paper published several years ago was one of the first to come up with a single neural network

2
00:00:08,400 --> 00:00:16,230
which predicts bond and box and class probabilities directly from full images in one evaluation.

3
00:00:16,500 --> 00:00:21,870
Now, back then, models are, to be precise.

4
00:00:21,870 --> 00:00:28,400
Object detection models followed this kind of pipeline where we have a region proposal generator, a

5
00:00:28,410 --> 00:00:32,160
feature extractor, and then a classification unit.

6
00:00:32,160 --> 00:00:41,790
So you can look at this with a simple RCN model where we had the input image, the extract regions,

7
00:00:41,790 --> 00:00:50,580
that's regions where the model thinks we could have images, where the model proposes the locations

8
00:00:50,580 --> 00:00:51,840
of objects.

9
00:00:51,840 --> 00:01:01,230
And then from your each proposed location is passed into this feature extractor which obviously extracts

10
00:01:01,500 --> 00:01:04,920
features from those warped regions.

11
00:01:04,920 --> 00:01:07,530
As you could see here, for example, this region.

12
00:01:07,530 --> 00:01:16,170
And then we have the classifier which tells us whether this is a person, an airplane or say, a TV

13
00:01:16,170 --> 00:01:17,070
monitor.

14
00:01:17,370 --> 00:01:26,940
Now, with the yellow, as we were saying, we have a single network, so let's get your you see, we

15
00:01:26,940 --> 00:01:31,920
do not have all that different stages in our pipelines here.

16
00:01:31,920 --> 00:01:38,510
We just have our input image, a single neural network, and then we have the outputs.

17
00:01:38,520 --> 00:01:45,270
See that there's also this additional non max oppression here, but we'll look at this shortly.

18
00:01:45,570 --> 00:01:52,860
Now, the performance of the yellow is quite impressive or was quite impressive in terms of speed is

19
00:01:52,860 --> 00:02:00,030
especially as now we can obtain speeds of up to 45 frames per second.

20
00:02:00,030 --> 00:02:07,080
And with the yellow of the smaller version of the yellow, we obtain up to 155 frames per second while

21
00:02:07,080 --> 00:02:11,910
achieving double the mean average precision of other real time detectors.

22
00:02:11,910 --> 00:02:15,210
We'll look at the mean average precision subsequently.

23
00:02:15,210 --> 00:02:24,360
This is the metric used generally in object detection, and so a high mean average precision means the

24
00:02:24,360 --> 00:02:27,390
object detection model is performing better.

25
00:02:28,110 --> 00:02:34,440
Another advantage of the YOLO to take note of is the fact that it reasons globally about an image.

26
00:02:34,800 --> 00:02:41,070
So unlike the sliding windows and the region proposal based techniques, we looked at the original the

27
00:02:41,070 --> 00:02:47,190
region propose proposal based techniques like the RCN here for the sliding windows.

28
00:02:47,190 --> 00:02:59,070
The way it works is we have let's get this way, we have this image right here and then this window.

29
00:02:59,070 --> 00:03:01,380
You see, you can you can look at this as a window.

30
00:03:01,380 --> 00:03:07,230
So this window is slide that through the whole image, unlike here where we just simply pass in the

31
00:03:07,230 --> 00:03:14,100
image as an input with a sliding window, you would have to take each window and pass into our neural

32
00:03:14,100 --> 00:03:14,700
network.

33
00:03:14,700 --> 00:03:18,480
And you do that while you're sliding through the full image.

34
00:03:18,480 --> 00:03:26,580
So as we're saying, getting back to what we're saying here as compared to the sliding window and the

35
00:03:26,600 --> 00:03:35,430
RCN kind of models, the YOLO performs better in the sense that it sees the entire image during training

36
00:03:35,430 --> 00:03:36,300
and test time.

37
00:03:36,300 --> 00:03:42,690
So it implicitly encodes contextual information about classes as well as the parents.

38
00:03:42,690 --> 00:03:49,620
So you find that models like the RC and mistake background patches in an image for objects because it

39
00:03:49,620 --> 00:03:51,960
can't see the larger context.

40
00:03:52,380 --> 00:03:53,070
Okay.

41
00:03:53,070 --> 00:03:58,800
Another point is the YOLO learns generalizable representation of objects.

42
00:03:58,800 --> 00:04:06,330
So you'll see that here the YOLO, which was which wasn't trained on this kind of paintings, performs

43
00:04:06,330 --> 00:04:07,170
quite well.

44
00:04:07,170 --> 00:04:14,220
You see the YOLO was was trained on on this particular version of YOLO was trained on the Pascal view

45
00:04:14,220 --> 00:04:15,090
C dataset.

46
00:04:15,090 --> 00:04:20,940
But when we test this on this image paintings, you see that it does well.

47
00:04:20,940 --> 00:04:27,720
So the YOLO models compared to others, learns more generalizable representations of objects.

48
00:04:27,720 --> 00:04:34,170
Let's now go in depth and see how the YOLO algorithm works first and first.

49
00:04:34,410 --> 00:04:41,880
Remember that if you have this kind of model, see our YOLO model, which you could have your.

50
00:04:42,660 --> 00:04:43,560
There we go.

51
00:04:43,590 --> 00:04:50,220
You see, this YOLO model has several calf layers and then completes with this connector layer.

52
00:04:50,220 --> 00:04:54,630
So we have the feature extractor and the classifier unit right here.

53
00:04:54,630 --> 00:04:56,880
Anyways, we are not going to get into this.

54
00:04:56,880 --> 00:04:59,760
Now let's just consider that we have this model.

55
00:05:00,120 --> 00:05:04,140
And then we have inputs like this one.

56
00:05:04,170 --> 00:05:08,940
See, these are our input and then we have some output right here.

57
00:05:09,390 --> 00:05:14,040
Now, this output obviously is meant to be a bound and box.

58
00:05:14,040 --> 00:05:21,710
So here we will we could draw this bounding box for this woman here.

59
00:05:21,720 --> 00:05:27,660
So this shows that or this bounding box is for this this person.

60
00:05:28,170 --> 00:05:33,690
And then let's take this off at the stroke.

61
00:05:34,740 --> 00:05:35,960
Let's change that color.

62
00:05:35,970 --> 00:05:40,230
So here we have this your this, this person detected.

63
00:05:40,230 --> 00:05:45,180
And then we also have this other person right here.

64
00:05:45,390 --> 00:05:47,340
So we'll have someone like this.

65
00:05:47,340 --> 00:05:48,300
See that?

66
00:05:48,480 --> 00:05:57,750
So we have now this two detections, one for this woman and and this other one for this other person.

67
00:05:57,750 --> 00:06:05,580
And so we're going to build a model, which is this one which takes in these kinds of input output pairs

68
00:06:05,580 --> 00:06:15,840
and then learns to get those inputs and predict the the outputs correctly, such that when given a new

69
00:06:15,840 --> 00:06:25,560
input, it could tell us where every object is located and what type of object it is precisely.

70
00:06:25,920 --> 00:06:31,800
Now, that said, the actual outputs wouldn't be this image.

71
00:06:31,800 --> 00:06:38,490
With this bounding box, the actual output will be different from what we are seeing here.

72
00:06:38,640 --> 00:06:47,520
Now, the way this outputs are created is using some sort of encoding system where we have this input

73
00:06:47,520 --> 00:06:51,900
which is broken up into some grid cells.

74
00:06:51,900 --> 00:06:54,660
So right here, take the pen.

75
00:06:54,660 --> 00:07:04,800
So right here, you see we have let's suppose that we have this 224 by 224 input image, that's this

76
00:07:04,800 --> 00:07:07,830
image here, 224 by 224 this one year.

77
00:07:07,830 --> 00:07:12,510
And then we break this up into several grid cells.

78
00:07:12,510 --> 00:07:15,420
So this is a single grid cell, another one, and so on and so forth.

79
00:07:15,450 --> 00:07:26,640
Now each grid cellular, given that we have seven grid cells, so we have seven by seven output, see

80
00:07:26,640 --> 00:07:32,400
that we have one, two, three, four, five, six, seven, one, two, three, four, five, six,

81
00:07:32,400 --> 00:07:32,820
seven.

82
00:07:32,820 --> 00:07:33,750
So that's it.

83
00:07:33,750 --> 00:07:43,590
And now each of this is going to be 32 by 32, so it's like 32 by 32 patches.

84
00:07:43,590 --> 00:07:50,970
So we take each patch here, combine them to form 224 by 224 or better still we take the 224 image and

85
00:07:50,970 --> 00:07:55,110
break it up into 32 by 32 grid cells.

86
00:07:55,110 --> 00:08:02,610
See that now once we have this ready or once we want, once we have some somehow broken up our image

87
00:08:02,610 --> 00:08:13,740
into this grid cells, we are then going to encode the outputs based on the locations of the center

88
00:08:13,740 --> 00:08:16,410
of our bounding boxes right here.

89
00:08:16,410 --> 00:08:22,620
So you see we have we're going to redraw this bounding boxes so that you see that clearly.

90
00:08:22,860 --> 00:08:25,890
Let's increase this to say five.

91
00:08:25,920 --> 00:08:30,870
Okay, So here we are having this bounding box right here.

92
00:08:31,260 --> 00:08:32,670
See this bounding box?

93
00:08:33,150 --> 00:08:37,890
And then it has a center at about this position here.

94
00:08:37,920 --> 00:08:39,150
See that position?

95
00:08:39,630 --> 00:08:45,330
If we if we want to locate this here, it falls about this around this year.

96
00:08:45,330 --> 00:08:46,930
So you see, it's about this.

97
00:08:46,950 --> 00:08:52,350
Now, for this other person, we have another bonding box like this.

98
00:08:53,070 --> 00:08:55,260
See that sort of bounding box?

99
00:08:55,260 --> 00:08:59,760
And the center is about this, about around the child's nose.

100
00:08:59,760 --> 00:09:01,680
So it's around your see that?

101
00:09:01,770 --> 00:09:08,880
So what we're going to do now is we're going to have each and every one of these let's change this color.

102
00:09:08,880 --> 00:09:15,360
We're going to have each and every one of this sorry, We're going to have every one of this year.

103
00:09:16,510 --> 00:09:18,850
Having certain values.

104
00:09:19,090 --> 00:09:27,530
Now in the case where the so like this one let's let's make this bit more transparent the case where

105
00:09:27,530 --> 00:09:33,280
a cell like this one, let's do 20 so you could see that better.

106
00:09:33,280 --> 00:09:40,060
So are we going to have as we're saying, and the case where a cell like this one which doesn't contain

107
00:09:40,060 --> 00:09:45,610
an object, then we'll say, okay, this first value will be a zero.

108
00:09:45,640 --> 00:09:50,140
You see that this first value will be zero.

109
00:09:50,980 --> 00:09:52,120
Let's get back.

110
00:09:52,120 --> 00:09:55,200
And here our first value is zero.

111
00:09:55,210 --> 00:09:59,380
So for this cell, the cell, your first value is zero.

112
00:09:59,410 --> 00:10:07,600
Now, for this order cell, your where the there is an object there.

113
00:10:08,170 --> 00:10:12,250
See, this value will be a one.

114
00:10:13,130 --> 00:10:13,840
You see that?

115
00:10:14,170 --> 00:10:21,520
So each and every cellular, each and every cellular has or takes certain values based on whether there

116
00:10:21,520 --> 00:10:23,110
is an object or not.

117
00:10:24,550 --> 00:10:31,420
Now, as you can see, this cell, you have a zero, the zero, in fact, all the zero and except this

118
00:10:31,420 --> 00:10:35,350
tool which we put in red, which will take values of one.

119
00:10:35,350 --> 00:10:43,680
And this is simply because it happens that the centers of the bounding boxes fall in those cells.

120
00:10:43,690 --> 00:10:45,100
So that's the first step.

121
00:10:45,280 --> 00:10:53,890
Now, once we have this first step, the next thing we want to do is we want to locate the exact position

122
00:10:53,890 --> 00:10:56,160
of our images.

123
00:10:56,170 --> 00:11:01,840
So the first thing is, we want to know whether there's, first of all, an image that's by encoding

124
00:11:01,840 --> 00:11:02,800
it like this.

125
00:11:02,830 --> 00:11:06,310
The next thing we want to know is the exact position.

126
00:11:06,400 --> 00:11:15,160
Now, this exact position obviously depends on the kind of the way we want to represent our bounding

127
00:11:15,160 --> 00:11:15,880
boxes.

128
00:11:15,910 --> 00:11:26,980
Now, we could represent our bounding boxes by specifying X mean X mean y mean and x max y max.

129
00:11:27,760 --> 00:11:34,780
With this kind of representation, an object like this maybe right here, or let's say a person right

130
00:11:34,780 --> 00:11:42,460
here will have this of values, or we'll make use of these values to locate this person.

131
00:11:42,460 --> 00:11:46,810
So we'll make use of this point here, which is X mean, y mean.

132
00:11:46,810 --> 00:11:55,360
And this other point your which is x max y max with respect to the origin, which is the top left corner.

133
00:11:55,360 --> 00:11:57,630
So this is our origin right here.

134
00:11:58,000 --> 00:11:58,780
You see that?

135
00:11:58,780 --> 00:12:03,130
So we go X steps and Y steps downward.

136
00:12:03,130 --> 00:12:09,730
Then your X steps to the right, Y steps downward to locate this person.

137
00:12:09,760 --> 00:12:16,000
Now, once we once we've done this, we could we, we could just put this out here so we could say,

138
00:12:16,030 --> 00:12:18,000
okay, we're creating our output.

139
00:12:18,010 --> 00:12:20,470
Remember, our aim here is to create our outputs.

140
00:12:20,470 --> 00:12:22,180
So we are creating our outputs.

141
00:12:22,180 --> 00:12:27,790
We know for every cell or for every grid cell where objects are located.

142
00:12:27,790 --> 00:12:29,620
That's it then.

143
00:12:29,620 --> 00:12:32,620
Now, to get the bounding boxes, we could make use of this.

144
00:12:32,740 --> 00:12:40,540
But what's important to note here is the notation used by the authors of the YOLO v one paper was instead

145
00:12:40,540 --> 00:12:51,250
of X center y center, then the width and the height c that of the bounding box.

146
00:12:51,250 --> 00:12:52,660
Obviously none of the image.

147
00:12:52,660 --> 00:12:58,960
So here instead of having making use of x mean y mean x max y max.

148
00:12:58,960 --> 00:13:04,480
We make use of x center, which is the center y center.

149
00:13:04,480 --> 00:13:06,130
So we'll go x.

150
00:13:06,130 --> 00:13:07,150
Y.

151
00:13:08,200 --> 00:13:17,050
And then we look for the width of this box, the width and also the height of the box.

152
00:13:17,920 --> 00:13:19,600
So let's get back.

153
00:13:19,600 --> 00:13:22,110
So that that's basically how we do that.

154
00:13:22,120 --> 00:13:25,090
And then there's another spatial encoding which is done.

155
00:13:25,870 --> 00:13:32,860
And what they actually do is for the width and the height of the bounding box, they're going to divide

156
00:13:32,860 --> 00:13:39,770
this by big W, where big W is the of the width of the whole image.

157
00:13:39,770 --> 00:13:43,630
So if our image is 224 by 224 we'll take this width.

158
00:13:43,630 --> 00:13:51,190
Let's suppose that the suite is say 160, so we'll take 160 divided by 224 and get our width and then

159
00:13:51,190 --> 00:13:53,410
for the height we'll do the same thing.

160
00:13:53,410 --> 00:13:58,690
So if we have the height of say 200, we'll take 200 by 224 and we get that.

161
00:13:58,690 --> 00:14:03,560
So it's h divided by the height of the whole image.

162
00:14:03,580 --> 00:14:04,360
See that?

163
00:14:04,480 --> 00:14:12,160
Now once we have that, we, we here for this x y, see, we're going to do something similar, but

164
00:14:12,160 --> 00:14:18,250
not dividing by the whole width and also not dividing by the height of the image.

165
00:14:18,250 --> 00:14:27,010
What we are going to do here is we are going to have this x c with respect to its specific grid cell.

166
00:14:27,250 --> 00:14:28,450
Now, let's explain.

167
00:14:28,450 --> 00:14:31,430
Let's pull this way so you could see that clear.

168
00:14:31,450 --> 00:14:38,350
So we suppose in that we have this here and we've already seen how to get for the width and the height.

169
00:14:38,350 --> 00:14:43,540
We take that divide by the total width and for the height divided by the total height, total width

170
00:14:43,540 --> 00:14:45,760
and total height in our example is 224.

171
00:14:45,850 --> 00:14:48,700
Now for the x and Y.

172
00:14:48,730 --> 00:14:56,100
See here, for example, we have this, our x, y, c is for these other objects, our x, y, z.

173
00:14:56,110 --> 00:14:59,260
Let's consider the example of let's consider this example here.

174
00:14:59,260 --> 00:15:03,310
So what we're saying is we are not going to take this respect to the full image.

175
00:15:03,310 --> 00:15:10,330
Instead, we're going to say we're going to take this grid cell and suppose that this distance is one

176
00:15:10,330 --> 00:15:13,960
and we take this and suppose that this distance towards one.

177
00:15:13,960 --> 00:15:16,470
Now, our origin here is this.

178
00:15:16,480 --> 00:15:18,400
See, this points our origin.

179
00:15:19,000 --> 00:15:23,950
Now, if we're for this cell, this will be our origin and this will be our distance one.

180
00:15:23,950 --> 00:15:25,750
And here's our distance one.

181
00:15:26,110 --> 00:15:32,020
Now, if this is our distance one, this distance one, then this point here, let's say this point

182
00:15:32,020 --> 00:15:37,150
here will be a fraction of one basically.

183
00:15:37,150 --> 00:15:39,200
So it be a value between zero and one.

184
00:15:39,220 --> 00:15:45,130
Now, if we take this distance, we could approximate this to be about zero point, say, five six.

185
00:15:45,130 --> 00:15:48,400
And then this distance is about, say, 0.7.

186
00:15:48,400 --> 00:15:55,840
So in this case x c will be this distance and y c will be this distance.

187
00:15:55,880 --> 00:16:01,120
In that case, we'll have 0.56 and 0.7.

188
00:16:01,540 --> 00:16:02,380
So that's it.

189
00:16:02,620 --> 00:16:08,080
Now, once we have that, the next thing we'll do is we'll just simply put that out here.

190
00:16:08,080 --> 00:16:10,510
So we'll have that x, z.

191
00:16:11,230 --> 00:16:19,630
With respect to the grid cell, we write that G and then we'll have Y, C, G, That's it.

192
00:16:19,870 --> 00:16:23,850
So here we now know how to obtain these values.

193
00:16:23,860 --> 00:16:25,870
Let's change the color back.

194
00:16:25,870 --> 00:16:28,060
So here we would have say 0.7.

195
00:16:28,090 --> 00:16:32,380
0.0.56 is 0.7.

196
00:16:32,380 --> 00:16:33,770
So that's how we have this.

197
00:16:33,790 --> 00:16:40,090
Now, once we once we get this, remember we already have one year, so the next one we have is x,

198
00:16:40,090 --> 00:16:43,650
c, g for the next part.

199
00:16:43,660 --> 00:16:53,380
If we have a data set where we have say 20 classes like the Pascal velocity is set with a cocoa data

200
00:16:53,380 --> 00:17:00,040
set where we have 80 classes, what we would have from your C we've had whether the object is there

201
00:17:00,040 --> 00:17:05,320
or not, if the object is there, we want to get this location now from the location.

202
00:17:05,320 --> 00:17:09,070
We want to know what object exactly is found there.

203
00:17:09,070 --> 00:17:12,100
So yeah, we suppose that we have 20 classes.

204
00:17:12,130 --> 00:17:17,010
C Now, initially every one year is zero, so we have 20.

205
00:17:17,020 --> 00:17:18,940
We're just going to align 20 zeros.

206
00:17:18,940 --> 00:17:22,360
So basically we have this, this 20 zeros here.

207
00:17:22,390 --> 00:17:31,790
Now if the person even in our classes, we decide that the person occupies the of third position in

208
00:17:31,790 --> 00:17:32,620
a list of classes.

209
00:17:32,620 --> 00:17:38,710
So we could, we could have a class or let's just let's just check out the list of Pascal C classes.

210
00:17:38,830 --> 00:17:44,350
Whereas you can see here we have the 20 different classes and then if we count this, we have one,

211
00:17:44,350 --> 00:17:52,500
two, three, four, five, six, seven, eight, nine, ten, 11, 12, 13, 14, 15.

212
00:17:52,510 --> 00:17:55,720
Notice how this person is the 15th position.

213
00:17:55,720 --> 00:17:59,320
15, 16, 17, 18, 19, 20.

214
00:17:59,530 --> 00:18:07,480
Okay, so we have the person at that 15th position and so.

215
00:18:07,600 --> 00:18:15,010
This means that when encoding our image here or when encoding our output, we would have to take that

216
00:18:15,010 --> 00:18:16,280
into consideration.

217
00:18:16,510 --> 00:18:17,680
Will simply do.

218
00:18:19,330 --> 00:18:27,610
One, 2345, six, seven, eight, nine, ten, 11, 12, 13, 14, 15.

219
00:18:27,610 --> 00:18:35,560
So you're at this position will change this and put all one instead of the zero which all the other

220
00:18:35,560 --> 00:18:36,600
classes get.

221
00:18:36,610 --> 00:18:42,340
So we take this off and then we have a one right here.

222
00:18:42,580 --> 00:18:43,150
See that?

223
00:18:43,150 --> 00:18:46,490
So we have 15, 16, 17, 18, 19, 20.

224
00:18:46,510 --> 00:18:54,040
Now if you have another data set with fewer classes, say eight classes, this will be of length eight.

225
00:18:54,520 --> 00:19:01,810
Now we've seen how for each and every one of these grid cells, which actually sell 49 of them because

226
00:19:01,810 --> 00:19:10,360
we have seven by seven, we have this encoding that's so what our model sees or what our model gets

227
00:19:10,360 --> 00:19:17,350
as output isn't this, but instead this kind of output.

228
00:19:17,470 --> 00:19:24,190
So now we are going to prepare our data set so that we have an input image, which is this, and then

229
00:19:24,190 --> 00:19:28,180
an output label which is going to be this label right here.

230
00:19:28,180 --> 00:19:35,410
And then the model is going to struggle to be able to correctly produce these kinds of outputs.

231
00:19:35,650 --> 00:19:41,350
And so as we've just discussed, our output will be of shape.

232
00:19:41,350 --> 00:19:49,600
This output here will be our shape seven by seven by 20.

233
00:19:49,630 --> 00:19:53,080
See number of classes 20 plus five.

234
00:19:53,380 --> 00:19:56,530
Just say five plus 20.

235
00:19:58,950 --> 00:20:05,810
Which is in fact seven by seven by 25.

236
00:20:05,820 --> 00:20:09,600
So this is what our label will look like.

237
00:20:11,000 --> 00:20:17,480
So when we get our input and our outputs, I want to get the input and the bounding boxes with the specific

238
00:20:17,480 --> 00:20:18,080
classes.

239
00:20:18,080 --> 00:20:24,320
We're going to create this kind of outputs from the labels we get depending on our data set.

240
00:20:25,430 --> 00:20:32,720
Nonetheless, when we zoom into this model right here, we notice that we take in some input and then

241
00:20:32,720 --> 00:20:39,740
at the level of the output, we have seven by seven by 30 instead of seven by seven by 25.

242
00:20:39,770 --> 00:20:42,890
Now, the reason why we have this is simple.

243
00:20:43,640 --> 00:20:45,890
We have the first position.

244
00:20:45,890 --> 00:20:47,330
Let's reduce this.

245
00:20:47,330 --> 00:20:53,510
We have the first position which tells us whether there is an object or not.

246
00:20:53,510 --> 00:20:54,770
So that's one.

247
00:20:55,040 --> 00:21:02,770
And then the next position gives us the location with the x, c, y, c, w, and H.

248
00:21:02,780 --> 00:21:04,400
So here we have the position.

249
00:21:05,210 --> 00:21:10,730
We have the position, so we have x, y, w, h.

250
00:21:12,050 --> 00:21:20,340
Now the next ones give us the classes or tells us what class our object belongs to.

251
00:21:20,360 --> 00:21:25,040
So yeah, we would have a series of zeros.

252
00:21:25,170 --> 00:21:29,330
At some point we will have a one and then we have your zeros.

253
00:21:29,330 --> 00:21:31,450
But the length of this year is 20.

254
00:21:31,460 --> 00:21:35,840
Now 20 plus five is 25, obviously, which is less than 30.

255
00:21:35,870 --> 00:21:39,950
Now, to understand why this is actually 30.

256
00:21:40,520 --> 00:21:41,480
Oops.

257
00:21:41,480 --> 00:21:43,940
To understand why this is actually 30.

258
00:21:44,360 --> 00:21:47,330
What we'll do is let's take this off.

259
00:21:47,960 --> 00:22:00,890
What we'll do is we'll suppose that we have for every cell two boxes responsible for locating the object.

260
00:22:01,490 --> 00:22:07,580
So instead of having this year only this particular box.

261
00:22:07,580 --> 00:22:09,440
So we suppose that this is a box.

262
00:22:09,680 --> 00:22:10,910
Let's take the pen.

263
00:22:10,910 --> 00:22:12,710
We suppose that this year.

264
00:22:13,640 --> 00:22:16,400
This year is a box.

265
00:22:17,060 --> 00:22:18,970
This one is a box.

266
00:22:18,980 --> 00:22:26,360
Instead of having only this box, we are going to make two boxes responsible for locating the object.

267
00:22:26,360 --> 00:22:30,800
Remember, this is this premise us to locate the object because first of all, it tells us whether the

268
00:22:30,800 --> 00:22:32,510
object is found in the cell or not.

269
00:22:32,510 --> 00:22:36,140
And here it tells us it gives us the exact coordinates of the objects.

270
00:22:36,140 --> 00:22:38,300
This one is for the classes separate.

271
00:22:38,300 --> 00:22:43,690
So the next thing we'll do is we'll take this and multiply it by two.

272
00:22:43,700 --> 00:22:46,100
So let's shift this again.

273
00:22:46,550 --> 00:22:49,880
We have this and then we paste that out.

274
00:22:49,910 --> 00:22:54,590
See, now, if we have the number of boxes to be three, then we'll just multiply twice.

275
00:22:54,590 --> 00:22:59,660
Then the yellow view on paper they use, they use B to B they considered B to be equal to.

276
00:22:59,690 --> 00:23:02,270
You could check that out your B equal to.

277
00:23:02,480 --> 00:23:10,850
So that said, we see we now have two boxes responsible for locating the object in that image.

278
00:23:10,850 --> 00:23:18,320
So when designing the labels, we could design with just one because we know that this does a correct

279
00:23:18,320 --> 00:23:18,920
answer.

280
00:23:18,920 --> 00:23:24,380
But what the model will predict will be two values.

281
00:23:24,380 --> 00:23:25,910
You see, it will predict these two.

282
00:23:25,910 --> 00:23:27,650
We predict this and predict this.

283
00:23:27,830 --> 00:23:34,280
Now, the reason why we're doing this is because we want that one of these boxes or any one of these

284
00:23:34,280 --> 00:23:39,290
boxes should be more specialized, depending on the size of the object.

285
00:23:39,320 --> 00:23:45,800
Now, as you've seen here, we're dealing with relatively large objects with respect to the to the image,

286
00:23:45,800 --> 00:23:46,310
actually.

287
00:23:46,310 --> 00:23:53,960
But what if we have an image of a person where the person is say, just a very sad we have a very,

288
00:23:55,220 --> 00:24:03,740
very small person to image ratio like this where we will have a bound and box which will be small compared

289
00:24:03,740 --> 00:24:05,360
to the whole image.

290
00:24:05,360 --> 00:24:13,520
In that case, what we want is for the model of these boxes to be specialized such that maybe this first

291
00:24:13,520 --> 00:24:21,020
box will will learn to detect the smaller objects while this one learns to detect the larger objects.

292
00:24:21,020 --> 00:24:21,860
So that's it.

293
00:24:21,860 --> 00:24:24,710
That's how we construct this output.

294
00:24:24,710 --> 00:24:31,610
So we understand how to construct the labels and then how to construct the model outputs.

295
00:24:31,640 --> 00:24:41,840
Remember, we have to update this weight such that the difference between the labels and the model output,

296
00:24:42,110 --> 00:24:45,530
let's say oh, is minimized.

297
00:24:46,520 --> 00:24:52,670
Getting back to the paper, we see the exact structure of the YOLO model.

298
00:24:52,670 --> 00:24:59,960
So right here we start with some kind of layer conv layer max full layer conf layer max spool.

299
00:24:59,960 --> 00:25:07,790
We have several layers, then the max spool, then we have this order of layers this year, times for

300
00:25:07,820 --> 00:25:10,520
the max spool called layers then.

301
00:25:10,870 --> 00:25:11,630
Layers.

302
00:25:11,660 --> 00:25:15,530
Then finally, we have the fully connected layers for classification.

303
00:25:15,530 --> 00:25:20,240
So this is for feature extraction this year, feature extraction.

304
00:25:20,240 --> 00:25:23,420
And then this is for classification.

305
00:25:24,350 --> 00:25:32,480
Now, for the training, what they do is they pre train this model on image net with 1000 classes.

306
00:25:32,480 --> 00:25:35,720
But note that this pre training is for the problem of classification.

307
00:25:35,720 --> 00:25:46,340
So it's a visual classification problem and then the pretend this model for over a week here and achieve

308
00:25:46,790 --> 00:25:49,570
top 5% accuracy of 88%.

309
00:25:49,580 --> 00:25:56,870
And then from this model they add four convolutional layers and two fully connected layers.

310
00:25:56,870 --> 00:25:59,060
We randomly initialized widths.

311
00:25:59,180 --> 00:26:07,340
So while going going during the training for the object detection, we have weights from the previous

312
00:26:07,340 --> 00:26:16,220
training that's from the pre trained weights from the image net and then the add some the add for the

313
00:26:16,220 --> 00:26:17,510
actually spoke of four.

314
00:26:17,510 --> 00:26:24,260
So it should be this and this one this one this last two year.

315
00:26:24,920 --> 00:26:33,820
So they add up for conv layers and to connect fully connected layers which have been randomly initialized.

316
00:26:33,830 --> 00:26:37,580
As I said, you're now following the example.

317
00:26:37,580 --> 00:26:37,870
Okay.

318
00:26:37,940 --> 00:26:42,770
Does it detection often requires fine grained visual information, So we increase the input resolution

319
00:26:42,770 --> 00:26:47,750
of the network from two two to 20 to 24 by 224 to 4, 48 by 448.

320
00:26:47,750 --> 00:26:54,170
So what they did was they trained on 224 by 224 images and then add detection time.

321
00:26:54,170 --> 00:26:57,250
We used 448 by 448 images.

322
00:26:57,260 --> 00:27:03,730
Now they also use a linear activation function for the final layer and all the layers use the following

323
00:27:04,040 --> 00:27:14,390
key rail so the user value for the final and then the key value for the all other layers as the activation

324
00:27:14,390 --> 00:27:15,140
function.

325
00:27:15,140 --> 00:27:26,930
Now, as I recall for that we have our yellow, which is simply this What the rail does is all values.

326
00:27:28,000 --> 00:27:35,950
All values it takes, which are less, less than zero, are sent to zero, and all values get around

327
00:27:35,950 --> 00:27:38,020
zero, I maintained.

328
00:27:38,020 --> 00:27:45,250
So if you pass in a value like three value to the value, you get back three.

329
00:27:45,490 --> 00:27:55,930
But if you pass, say -0.5, -0.5, what you will get will be zero because all negative values are sent

330
00:27:55,930 --> 00:27:58,460
to zero and all positive values are maintained.

331
00:27:58,480 --> 00:28:11,620
Basically is you remain x if x is greater than or equal zero and you go to zero if x is less than zero.

332
00:28:11,770 --> 00:28:15,740
Now for the final layer, or rather for all the other.

333
00:28:15,760 --> 00:28:16,920
This is for the final layer.

334
00:28:16,930 --> 00:28:18,500
This is what I use for the final layer.

335
00:28:18,520 --> 00:28:23,950
Now, for all the other layers, they use the key value for the liquid value.

336
00:28:23,980 --> 00:28:30,680
What we have is not this year, not a straight horizontal line, but instead something like this.

337
00:28:30,700 --> 00:28:32,260
See something a bit slanted.

338
00:28:32,830 --> 00:28:35,560
And then here is still maintain the positive.

339
00:28:35,560 --> 00:28:36,460
Still maintain.

340
00:28:36,460 --> 00:28:39,370
So we still have still X.

341
00:28:40,490 --> 00:28:45,830
It still remains x4x greater than or equal zero.

342
00:28:46,250 --> 00:28:56,630
But for this one it goes to 0.1 x4x less than zero.

343
00:28:56,780 --> 00:29:04,610
So for all negative values, we will have 0.1 x, So it means our gradient here, this gradient is going

344
00:29:04,610 --> 00:29:05,900
to be 0.1.

345
00:29:05,900 --> 00:29:07,100
So the better border too much.

346
00:29:07,100 --> 00:29:12,950
If we don't understand the notion of gradient anyway, we have that we have this.

347
00:29:13,490 --> 00:29:22,970
It means that if we if we send a value like say -0.5 now what we get will be -0.5 times 0.1 and normal

348
00:29:22,970 --> 00:29:23,540
zero.

349
00:29:23,540 --> 00:29:24,740
So that is the difference.

350
00:29:24,740 --> 00:29:32,480
And this is the activation function used everywhere in the model except for the last layer.

351
00:29:32,750 --> 00:29:35,810
So that's what that's what's defined right here.

352
00:29:35,840 --> 00:29:41,840
Now, the last function that you're the users, the sum squared error.

353
00:29:41,840 --> 00:29:44,840
So that's the simple loss function they use.

354
00:29:44,840 --> 00:29:47,180
And that's why at the beginning the speak of

355
00:29:49,880 --> 00:29:54,260
we frame object detection as a regression problem.

356
00:29:54,380 --> 00:29:55,280
So that's it.

357
00:29:55,430 --> 00:29:58,130
So it's like a simple regression problem.

358
00:29:58,130 --> 00:30:05,840
Actually, that is the sum square of the difference between the model output and the expected output,

359
00:30:05,840 --> 00:30:07,040
which is a levels.

360
00:30:07,820 --> 00:30:16,160
Now that we understand globally how the models build and the training process, let's get to look in

361
00:30:16,160 --> 00:30:19,100
depth into this loss function.

362
00:30:19,580 --> 00:30:26,240
So right here we have this loss function and then we're supposing that we have this labels here.

363
00:30:26,240 --> 00:30:27,890
So you're all levels.

364
00:30:27,890 --> 00:30:30,280
And then here's what the model predicts.

365
00:30:30,290 --> 00:30:39,500
Remember, we have seven by seven, seven by seven cells, grid cells by 34.

366
00:30:39,500 --> 00:30:46,760
The models predictions, whereas the levels are seven by seven by 25.

367
00:30:46,790 --> 00:30:47,330
See that?

368
00:30:47,660 --> 00:30:53,210
So we have this first five for the location, this location, this location.

369
00:30:53,210 --> 00:30:56,360
We have to add this one year.

370
00:30:57,500 --> 00:30:58,220
Oops.

371
00:30:58,220 --> 00:31:00,440
Let's change the color back to red.

372
00:31:00,620 --> 00:31:02,090
So you see that clearer.

373
00:31:02,090 --> 00:31:04,520
So we have this location.

374
00:31:04,520 --> 00:31:08,660
Here we go.

375
00:31:09,380 --> 00:31:11,420
We have this location here.

376
00:31:14,390 --> 00:31:14,860
That's it.

377
00:31:14,870 --> 00:31:20,630
We have this five, this five, and we have the classes here.

378
00:31:20,630 --> 00:31:26,450
We also have the classes and then we have this five right here.

379
00:31:26,720 --> 00:31:31,760
Now, the way we obtain the loss is we break it up into several parts.

380
00:31:31,910 --> 00:31:33,750
We'll look at this part first.

381
00:31:33,770 --> 00:31:34,920
We'll start with this part.

382
00:31:34,940 --> 00:31:37,880
Now, this part is just basically adding the parts up.

383
00:31:37,880 --> 00:31:38,820
The first part.

384
00:31:38,840 --> 00:31:39,740
The next part.

385
00:31:39,740 --> 00:31:40,310
This part.

386
00:31:40,310 --> 00:31:40,940
This part.

387
00:31:40,940 --> 00:31:42,050
And this part.

388
00:31:42,050 --> 00:31:43,250
For this first part.

389
00:31:43,250 --> 00:31:43,940
Here's this.

390
00:31:44,390 --> 00:31:51,410
It punishes the model when it makes errors with respect to whether there is an object in a particular

391
00:31:51,410 --> 00:31:52,470
grid cell or not.

392
00:31:52,490 --> 00:32:01,220
So if we have for this grid cell one as a label, we expect the model to predict a one year or and a

393
00:32:01,220 --> 00:32:02,660
one right here.

394
00:32:02,900 --> 00:32:08,030
So this means that what will happen here is we are going to go through we see the sum from I equals

395
00:32:08,030 --> 00:32:09,950
zero to X squared square here.

396
00:32:09,950 --> 00:32:11,570
In our case, seven.

397
00:32:11,570 --> 00:32:14,790
So x squared is seven squared, which is 49.

398
00:32:14,810 --> 00:32:18,800
So we go through each and every grid cellular, which is logical.

399
00:32:18,800 --> 00:32:20,480
We go to each and every one of these.

400
00:32:20,480 --> 00:32:28,520
So we go to each and every one of those 49 different grid cells, and then we'll calculate the difference.

401
00:32:28,560 --> 00:32:29,180
See this?

402
00:32:29,660 --> 00:32:35,150
We have c i, c I, Chapo, that's C minus Chapo squared.

403
00:32:35,180 --> 00:32:42,470
So we have this year minus this square, and then we add all those up.

404
00:32:42,740 --> 00:32:48,050
Now, also notice that there's a double sum in your the reason why we have this double sum is because

405
00:32:48,050 --> 00:32:56,240
we are actually going to take this minus this, plus this, minus this C that, take this minus this,

406
00:32:56,240 --> 00:32:57,620
plus this, minus this.

407
00:32:57,620 --> 00:33:01,730
If we had, say, five or three boxes, then we'll have three of this.

408
00:33:01,730 --> 00:33:05,540
We have one, two and we add another one before the classes.

409
00:33:05,540 --> 00:33:07,990
So in that case we will go three times.

410
00:33:08,000 --> 00:33:10,010
So hopefully that's clear.

411
00:33:10,130 --> 00:33:22,970
But one thing you notice is this this notation here one one of obj I or obj is actually object one of

412
00:33:22,970 --> 00:33:23,900
object.

413
00:33:24,440 --> 00:33:26,330
So you notice this notation right here.

414
00:33:26,360 --> 00:33:32,500
Now this notation, let's get back and just circle that out here so it's clear.

415
00:33:32,510 --> 00:33:34,310
So this is, this is it right here.

416
00:33:34,310 --> 00:33:35,370
You notice this.

417
00:33:35,390 --> 00:33:46,070
And what is here in the paper is this one obj I notation denotes if an object appears in cell I and

418
00:33:46,070 --> 00:33:54,770
this one obj ij because this is actually ij not i ij denotes that the j bounding box does.

419
00:33:55,070 --> 00:34:01,070
Here we have two bounding boxes, so either this bounding box or this bounding box predicted predictor

420
00:34:01,070 --> 00:34:11,270
in cell I so I is the cell clearly see that I goes to 49 and then j goes to two.

421
00:34:12,830 --> 00:34:18,880
If we if we have this, should I equal 1 to 49 and then this should be equal 1 to 2.

422
00:34:19,520 --> 00:34:22,460
This means that there's a slight error notation right here.

423
00:34:22,760 --> 00:34:28,670
Anyways, we understand that we're going from 1 to 49 and then we're going from 1 to 2 because we basically

424
00:34:28,670 --> 00:34:33,410
go into each cell year and we also go into each and every one of these boxes.

425
00:34:33,560 --> 00:34:42,710
Now getting back to our one IJ or one obj IJ notation, we're saying that we've already seen that this

426
00:34:42,710 --> 00:34:45,920
denotes that the j bounding box predict all in cell.

427
00:34:45,920 --> 00:34:51,530
I given cell is responsible for that prediction.

428
00:34:51,800 --> 00:34:54,090
You see that Now, what does this mean?

429
00:34:54,110 --> 00:35:02,570
It means that if a particular like your if a particular box, if this box is not responsible for the

430
00:35:02,570 --> 00:35:09,290
prediction, then we are not going to include it when computing this error.

431
00:35:09,770 --> 00:35:16,520
Now, how do we know whether this box or this box is responsible for the prediction?

432
00:35:16,550 --> 00:35:18,800
The way we get this is simple.

433
00:35:19,430 --> 00:35:24,500
Let's pull this to the right or let's reduce that so we could get more space.

434
00:35:24,500 --> 00:35:35,540
So what happens is let's suppose we have this image and then we have one object here and we have another

435
00:35:35,540 --> 00:35:36,440
object here.

436
00:35:36,650 --> 00:35:39,140
Then we have some bounding boxes.

437
00:35:39,440 --> 00:35:44,840
So we have this bounding box and we have this other bounding box.

438
00:35:45,590 --> 00:35:51,110
Now, for this for this year we have a particular cell.

439
00:35:51,110 --> 00:35:57,560
Let's suppose our we've broken this up and then we have a given grid cell like this one which is responsible

440
00:35:57,560 --> 00:35:58,880
for predicting this object.

441
00:35:58,910 --> 00:36:06,170
Now, the first box you see, this first box will predict maybe this bounding box, and then the next

442
00:36:06,170 --> 00:36:09,680
box will predict maybe this bounding box.

443
00:36:09,950 --> 00:36:12,740
Now, when we say it will predict this, this will predict this.

444
00:36:12,960 --> 00:36:21,210
On in box and the other pretty This is actually because they have different values for x, y, c, w,

445
00:36:21,210 --> 00:36:22,200
h, c.

446
00:36:22,200 --> 00:36:25,950
This quadruple here is different from this other quadruplets.

447
00:36:25,950 --> 00:36:30,600
And because they are different, it means that obviously the bounding box that you will get will be

448
00:36:30,600 --> 00:36:31,230
different.

449
00:36:31,230 --> 00:36:35,580
And because those these two bounding boxes you get will be different.

450
00:36:35,610 --> 00:36:46,440
It means that you can now compare which of these two is closest to the actual bounding box, which is

451
00:36:46,440 --> 00:36:46,650
this.

452
00:36:46,650 --> 00:36:49,110
So this is the actual and this is what the model predicts.

453
00:36:49,110 --> 00:36:54,240
So we comparing this to they are competing for which of them is closest to the actual.

454
00:36:54,240 --> 00:37:00,870
So let's suppose that the actual is, is something like this suppose that actually something like this

455
00:37:02,370 --> 00:37:03,180
here.

456
00:37:03,180 --> 00:37:04,800
So we have something like this.

457
00:37:05,040 --> 00:37:05,610
Okay.

458
00:37:05,610 --> 00:37:10,020
So in that case it's clear that this one let's look for a neutral color.

459
00:37:10,020 --> 00:37:20,640
Now, it's clear that this because we remember we are having this one, this box competing with this

460
00:37:20,640 --> 00:37:24,360
black box, competing with this black box.

461
00:37:24,360 --> 00:37:27,760
So this is B one and B two to compare competing.

462
00:37:28,230 --> 00:37:31,770
But the blue box is the actual one.

463
00:37:31,800 --> 00:37:41,460
B, we just call that B, So we're going to compare the difference between B one and B and the difference

464
00:37:41,460 --> 00:37:48,390
between B two and B now the one which is which resembles B the most, that's the one which has the least.

465
00:37:48,390 --> 00:37:57,960
The, the smaller difference will be the one responsible, as I said here for that prediction.

466
00:37:57,960 --> 00:37:58,800
You see that?

467
00:37:58,800 --> 00:38:03,300
So in our case here, it's clear that B one is responsible.

468
00:38:03,300 --> 00:38:07,830
So for this particular case, because for a different grid cell, we may have B two responsible for

469
00:38:07,830 --> 00:38:10,620
whatever grid cell you may have again, B two or B one.

470
00:38:10,620 --> 00:38:12,690
It just depends on on what?

471
00:38:12,690 --> 00:38:23,850
On the difference between the the bound and box by that box specific box and the actual bounding box.

472
00:38:23,970 --> 00:38:29,480
Now, that said, another question you may ask yourself is how do we compare this bounding boxes?

473
00:38:29,490 --> 00:38:33,990
Now, the way we compare this bounding boxes is by using the IOU score.

474
00:38:33,990 --> 00:38:43,350
So if we have two bounding boxes like this, we have these two bounding boxes and then we have let's

475
00:38:43,350 --> 00:38:45,750
see this one.

476
00:38:46,860 --> 00:38:48,480
So we have this other box here.

477
00:38:49,650 --> 00:38:53,400
And then we also have this one, something like this.

478
00:38:56,160 --> 00:39:04,780
If we are to compare the how close this one, this pair, this pair of boxes is compared to how close

479
00:39:04,780 --> 00:39:05,730
this other Paris.

480
00:39:05,730 --> 00:39:14,160
You see clearly that this pair is closer or simply put, they're more closer to each other as compared

481
00:39:14,160 --> 00:39:16,860
to this other pair right here.

482
00:39:17,340 --> 00:39:23,730
Now, the way we look at this is we compute the area between the two boxes.

483
00:39:24,150 --> 00:39:24,920
So that's it.

484
00:39:24,930 --> 00:39:28,080
You look for this area of the intersection.

485
00:39:28,080 --> 00:39:29,940
So this is the intersection.

486
00:39:30,510 --> 00:39:35,250
And then so we have your we call this IOU IOU.

487
00:39:35,250 --> 00:39:37,130
Actually, let's just put it right here.

488
00:39:37,140 --> 00:39:44,400
IOU actually stands for intersection over union intersection divided, which is equal.

489
00:39:44,400 --> 00:39:53,280
The intersection, we'll call it intersection divided by the union.

490
00:39:54,810 --> 00:40:01,380
So if you take, for example, this intersection here and divide by the area, this area plus this area,

491
00:40:01,950 --> 00:40:05,460
then that's actually including the this intersection.

492
00:40:05,670 --> 00:40:08,130
Basically, this is this is the intersection.

493
00:40:08,130 --> 00:40:09,660
And then let's change this.

494
00:40:09,900 --> 00:40:14,100
So you could see Clara and this is the union.

495
00:40:14,580 --> 00:40:15,390
See this?

496
00:40:15,690 --> 00:40:17,040
This is our union.

497
00:40:17,790 --> 00:40:18,590
That's it.

498
00:40:18,600 --> 00:40:19,770
So that's our union.

499
00:40:19,770 --> 00:40:20,820
And this is our intersection.

500
00:40:20,850 --> 00:40:25,970
So we take this area divided by all this area, and we get the IOU score.

501
00:40:25,980 --> 00:40:30,660
We're going to repeat the same process for this one where this is our union.

502
00:40:32,280 --> 00:40:37,830
And you could see clearly that this one would have a higher IOU compared to this one.

503
00:40:37,830 --> 00:40:46,980
And so this is how we compute or we know which of the boxes is responsible for that prediction.

504
00:40:47,220 --> 00:40:51,540
Now, getting back here, let's get back to our last function.

505
00:40:51,550 --> 00:40:52,680
As we're saying.

506
00:40:52,680 --> 00:41:00,300
We're going to have that if this box, for example, if it happens that this box be one year, this

507
00:41:00,300 --> 00:41:08,580
box here is responsible for the prediction, then we would have this difference times one, C times

508
00:41:08,580 --> 00:41:09,190
one.

509
00:41:09,210 --> 00:41:13,590
Now, if this box is not responsible, then we have times zero.

510
00:41:13,590 --> 00:41:18,380
So this is not going to be considered when we compute in this last.

511
00:41:18,390 --> 00:41:19,080
See that?

512
00:41:20,130 --> 00:41:20,760
That's it.

513
00:41:21,000 --> 00:41:22,080
So it's true.

514
00:41:22,080 --> 00:41:29,160
We sum into the two boxes, but actually we're going to take only or consider only one when calculating

515
00:41:29,160 --> 00:41:29,910
this difference.

516
00:41:29,910 --> 00:41:32,220
For this one, we're not going we're going to submit it.

517
00:41:32,430 --> 00:41:34,560
Now we move on to the next.

518
00:41:34,560 --> 00:41:44,820
This other one year computes or permits us to have grid cells or permits us to correctly predict when

519
00:41:44,820 --> 00:41:45,830
there is no object.

520
00:41:45,840 --> 00:41:47,400
Notice here, this is no object.

521
00:41:47,400 --> 00:41:48,540
Here is object.

522
00:41:48,660 --> 00:41:56,880
So what we have in here is for the cells where we have an object like you see, where the label we have

523
00:41:56,880 --> 00:42:04,320
the cell and the cell we're going to use this year where there is no object, we're going to use instead

524
00:42:04,320 --> 00:42:05,430
this one year.

525
00:42:06,210 --> 00:42:14,550
And you're basically when there is no object, we're just going to take the the output or the value

526
00:42:14,550 --> 00:42:23,250
we have here, minus the value we have here then plus the value we have here, minus the value we have

527
00:42:23,280 --> 00:42:23,880
here.

528
00:42:24,090 --> 00:42:30,450
Now, the next I also know that we have this lambda no object Now in the paper, the talk a little bit

529
00:42:30,450 --> 00:42:32,460
more about this year.

530
00:42:32,460 --> 00:42:35,610
We have to remedy this.

531
00:42:35,790 --> 00:42:37,740
First of all, let's understand this.

532
00:42:38,370 --> 00:42:38,780
Yeah.

533
00:42:38,800 --> 00:42:42,780
To say they use the sum squared error because it is easy to optimize.

534
00:42:42,780 --> 00:42:47,850
However, it does not perfectly aligned with our goal of maximizing the average position.

535
00:42:48,720 --> 00:42:55,260
It weighs localization error equally with classification error, which may not be ideal.

536
00:42:55,650 --> 00:43:02,520
Also, in every image, many grid cells do not contain any object, so this pushes the confidence course

537
00:43:02,520 --> 00:43:10,560
of those cells towards zero, often overpowering the gradients from cells that do not contain objects.

538
00:43:10,920 --> 00:43:15,900
Now this can lead to model instability, causing training to diverge early on.

539
00:43:16,380 --> 00:43:22,860
So to remedy this, they increase the loss from the bounding box, coordinate predictions, and decrease

540
00:43:22,860 --> 00:43:24,600
the loss from the confidence predictions.

541
00:43:24,600 --> 00:43:29,820
For boxes that don't contain objects, we use a parameter.

542
00:43:29,880 --> 00:43:31,310
We use the two parameters.

543
00:43:31,320 --> 00:43:38,340
Lambda coordinates this for the the positioning and lambda no objects for when we have no objects to

544
00:43:38,340 --> 00:43:39,360
accomplish this.

545
00:43:39,360 --> 00:43:45,270
So we set lambda coordinates with five and lambda no objects to 0.5.

546
00:43:45,600 --> 00:43:54,060
So as we're saying, this lambda no object here, 0.5 and lambda coordinate is five as they have given

547
00:43:54,060 --> 00:43:54,420
us.

548
00:43:54,420 --> 00:43:54,660
Right.

549
00:43:54,820 --> 00:43:55,310
Year.

550
00:43:55,450 --> 00:44:03,100
Now, what we can deduce from this, from looking at these formulas here is that the model will be punished

551
00:44:03,100 --> 00:44:11,920
more severely if it has or if a particular grid cell was meant to predict an object.

552
00:44:11,920 --> 00:44:20,710
And it didn't predict that as compared to when it doesn't have an object and it didn't predict that

553
00:44:20,710 --> 00:44:21,500
correctly.

554
00:44:21,580 --> 00:44:28,630
So for the object, we have more punishment as compared to this one, because this lambda no object

555
00:44:28,630 --> 00:44:29,830
is 0.5.

556
00:44:30,010 --> 00:44:37,840
Now, for the coordinates, it receives highest punishment here because as we as we have lambda coordinate

557
00:44:37,840 --> 00:44:41,140
equal five for the classes, it's still equal one.

558
00:44:41,140 --> 00:44:45,670
So here we have 10.51 and five five.

559
00:44:46,090 --> 00:44:50,680
Now, getting back here, we have this one for the classes.

560
00:44:51,310 --> 00:44:54,040
Basically, what we have here is we have this condition.

561
00:44:54,190 --> 00:44:57,420
So this is one object or object of eye.

562
00:44:58,050 --> 00:45:00,810
This one was object or one object.

563
00:45:01,210 --> 00:45:03,010
So notice that this is now I.

564
00:45:03,050 --> 00:45:11,770
Now, basically what we have in is we calculate this difference only when that grid cell has an object.

565
00:45:11,890 --> 00:45:19,300
So if, like in the level year, if we have a one like for these two grid cells, then we'll go ahead

566
00:45:19,300 --> 00:45:20,680
and compute this difference.

567
00:45:21,130 --> 00:45:28,480
But in the case where we have no object, like in the cell and the cell and the cell and the cell or

568
00:45:28,480 --> 00:45:34,180
all of the cells except this two, then we wouldn't get into this.

569
00:45:34,180 --> 00:45:36,220
So we'll just keep that note.

570
00:45:36,220 --> 00:45:37,360
Also, we have this.

571
00:45:38,710 --> 00:45:41,590
Oh, sorry, we have we actually wouldn't get into this.

572
00:45:41,590 --> 00:45:43,630
So here we have just one.

573
00:45:44,110 --> 00:45:50,050
Now, this what we do here is similar to what we're going to do for the coordinates.

574
00:45:50,050 --> 00:45:59,500
With the coordinates, we have the same process where if there is no object, we wouldn't go ahead to

575
00:45:59,500 --> 00:46:00,930
compute this difference.

576
00:46:00,940 --> 00:46:03,980
So like here, if this is zero, then we'll skip this.

577
00:46:04,000 --> 00:46:08,110
Now, if it is one, then we'll go ahead and compute the difference between this and this.

578
00:46:08,110 --> 00:46:09,160
And that's what we have here.

579
00:46:09,190 --> 00:46:16,780
Now, this is the x minus x bar squared, plus the x, y, minus y bar squares, basically this x minus

580
00:46:16,780 --> 00:46:19,910
this square plus this minus this squared.

581
00:46:19,930 --> 00:46:26,320
And then here and also notice that this this two is squared and then it's added up.

582
00:46:27,370 --> 00:46:35,680
And then here we have the square root of the width minus the square root of the order width or the predicted

583
00:46:35,680 --> 00:46:36,190
width.

584
00:46:36,220 --> 00:46:41,070
All of that squared plus the square root of the height minus the square root of the predicted height,

585
00:46:41,080 --> 00:46:42,040
all of that squared.

586
00:46:42,190 --> 00:46:47,170
And then they add this up and multiply by lambda coordinates.

587
00:46:47,950 --> 00:46:55,180
But it should be noted that we are only going to compute this if we happen to fall in a box which is

588
00:46:55,180 --> 00:46:58,350
raised possible for that prediction.

589
00:46:58,360 --> 00:47:03,910
So if this box is responsible for the prediction, it means that we are not going to make we're not

590
00:47:03,910 --> 00:47:07,720
going to compute this for this box right here.

591
00:47:07,750 --> 00:47:09,490
We're not going to compare this and this.

592
00:47:09,490 --> 00:47:13,240
We're going to do this and this because this box is responsible.

593
00:47:13,240 --> 00:47:18,310
And we've seen already what it means by a box being responsible for the prediction.

594
00:47:19,690 --> 00:47:21,470
Getting back to the paper.

595
00:47:21,490 --> 00:47:23,140
Let's get back up here.

596
00:47:23,260 --> 00:47:30,610
Now, what we have also is that the sum square arrow are equally waste errors in large boxes and small

597
00:47:30,610 --> 00:47:31,530
boxes.

598
00:47:31,540 --> 00:47:40,570
So the error metric should reflect that small deviations and large boxes matter less than in small boxes.

599
00:47:40,600 --> 00:47:45,910
To partially address this, we predict the square root of the bounding box width and height instead

600
00:47:45,910 --> 00:47:47,880
of the width and height directly.

601
00:47:47,890 --> 00:47:58,240
So what you're saying is, if we have this couple here, there's two boxes and then we have this deviation

602
00:47:58,990 --> 00:48:03,100
or we have the difference, which we're trying to compute for the loss.

603
00:48:03,100 --> 00:48:07,990
And then we have also this smaller boxes here with this similar difference.

604
00:48:07,990 --> 00:48:11,050
So let's let's try to have something similar to that.

605
00:48:11,320 --> 00:48:19,840
We have something like this hope similar enough so we have something like this with the initial method

606
00:48:19,840 --> 00:48:34,420
we will have W minus W bar or sharper or tilde squared plus the height minus the height bar square where

607
00:48:34,420 --> 00:48:40,570
this is W is what the label expects and the W by is what the model predicts.

608
00:48:40,570 --> 00:48:43,360
So the same, the same for the H and the H bar.

609
00:48:43,390 --> 00:48:50,140
Now what they want is that if we have this difference here or better.

610
00:48:50,140 --> 00:48:54,580
So let's say that what is going on with this here is that.

611
00:48:56,590 --> 00:49:02,670
If we have this year these two boxes, this difference because is equal difference.

612
00:49:02,680 --> 00:49:04,560
You see, the difference is the same.

613
00:49:04,570 --> 00:49:09,490
If we have this year, then let's say the difference is five.

614
00:49:09,910 --> 00:49:15,580
Let's let's suppose that the difference is five year difference five that's the width and the height

615
00:49:15,580 --> 00:49:16,720
difference is five.

616
00:49:16,720 --> 00:49:20,530
Then we'll have five squared plus five squared as 50.

617
00:49:21,460 --> 00:49:26,950
Now for this small box here, this is five and then here is five.

618
00:49:27,430 --> 00:49:29,320
Then we also have 50.

619
00:49:30,250 --> 00:49:32,110
But this is not what we want.

620
00:49:32,350 --> 00:49:41,950
The reason why we don't want this is because this kind of difference for smaller boxes is more important

621
00:49:41,950 --> 00:49:45,520
than this difference for this bigger boxes.

622
00:49:45,520 --> 00:49:54,160
So it's just like you you supposing that you have say say you have a loaf of bread like this.

623
00:49:54,160 --> 00:50:02,620
Suppose we have a loaf of bread and then you have this part your which you can have control.

624
00:50:02,830 --> 00:50:10,600
Now compared to a case where you have the smaller loaf of bread, you will find that cutting off this

625
00:50:10,600 --> 00:50:15,090
part here is like current of practically a third of my loaf.

626
00:50:15,100 --> 00:50:21,910
Whereas for this case this was like cutting out, say, 1/10 of my loaf.

627
00:50:21,910 --> 00:50:27,430
So so clearly from here the loss is less felt as compared to this other one.

628
00:50:27,550 --> 00:50:32,800
And so to, as I say in the paper, to remedy the situation, to add the square root here.

629
00:50:32,950 --> 00:50:36,550
Now let's add the square root and see this difference.

630
00:50:36,550 --> 00:50:46,090
Now, the square root of let's say let's say this was 30 and this was 25, so we had five and then there

631
00:50:46,090 --> 00:50:47,340
was 30 and 25.

632
00:50:47,350 --> 00:50:56,770
The square root of 30 now is let's say this is 30 with 30 and 44 for this other one.

633
00:50:56,770 --> 00:51:02,230
So this one has with high 30 and so the one has with height 25.

634
00:51:02,800 --> 00:51:07,490
So this gave us the difference, which was five, and that's how we had this different here.

635
00:51:07,510 --> 00:51:16,420
Now when we when we take now the square root of 30, let's take the square root of 30, the square root

636
00:51:17,320 --> 00:51:21,760
of 30, that's 5.47.

637
00:51:22,390 --> 00:51:23,740
So 514, seven.

638
00:51:23,740 --> 00:51:35,740
Now minus the square root of 25 will be 0.4 7c0 .47, which when you square let's compute that directly,

639
00:51:35,740 --> 00:51:49,810
which when you are going to square it will give you approximately 0.22 C So you have 0.22 plus 0.22.

640
00:51:50,890 --> 00:51:51,820
Now that will give us

641
00:51:51,820 --> 00:51:59,020
0.440.44

642
00:51:59,200 --> 00:52:01,260
for this to be good boxes.

643
00:52:01,270 --> 00:52:07,630
Now for the smaller boxes, let's suppose that this one was say because we want to have a difference

644
00:52:07,630 --> 00:52:11,710
of five, we could say ten and five.

645
00:52:11,950 --> 00:52:15,100
So here we had ten by ten and here we have five by five.

646
00:52:16,330 --> 00:52:21,910
In that case we will have square root of ten minus square root of five.

647
00:52:23,380 --> 00:52:30,310
And now when you compute that, you will have 0.84.

648
00:52:31,120 --> 00:52:35,530
So what you mean here is when you take the square root of ten, minus the square root of five and you

649
00:52:35,530 --> 00:52:40,810
square it, it gives you 0.840.84 plus 0.84.

650
00:52:40,810 --> 00:52:43,320
We give you about 1.6.

651
00:52:43,330 --> 00:52:44,410
Let's see, 1.6.

652
00:52:44,410 --> 00:52:49,150
Anyway, that's already much greater than 0.44.

653
00:52:49,180 --> 00:52:57,040
It means that the model is penalised more now for making this error as compared to making this error

654
00:52:57,580 --> 00:53:05,590
so that now solve the problem of where the model would have penalised or the model would have been penalised

655
00:53:05,590 --> 00:53:14,560
in the same way for this same difference, even when the size of the size difference between these two

656
00:53:14,560 --> 00:53:17,170
boxes is quite considerable.

657
00:53:17,380 --> 00:53:24,310
Now that said, from here the train for over 135 epochs on training and validation data sets from Pascal

658
00:53:24,310 --> 00:53:28,570
Vos and two in 2007 and 2012.

659
00:53:29,260 --> 00:53:31,180
We test with testing on 2012.

660
00:53:31,180 --> 00:53:36,470
We also include the VOC 2007 testing for training, test data for training.

661
00:53:36,490 --> 00:53:41,240
Then through our training we use a set of 64 momentum of 0.9 and decay.

662
00:53:41,410 --> 00:53:44,460
Thus weight decay of 0.0005.

663
00:53:44,540 --> 00:53:51,820
The ratio was as follows For the first epochs, we slowly raise the learning rate from 10 to -3, so

664
00:53:51,820 --> 00:53:53,200
ten zero negative two.

665
00:53:53,200 --> 00:53:54,520
So started with a.

666
00:53:54,740 --> 00:53:57,860
Relatively lower lending rate, slowly increasing it.

667
00:53:58,340 --> 00:54:03,830
Because if we start at a high learning rate, the model often diverges due to unstable gradients.

668
00:54:03,830 --> 00:54:06,650
So we continue training with ten to the negative two.

669
00:54:06,680 --> 00:54:10,580
That's after going from this slowly increase to 10 to -2.

670
00:54:10,580 --> 00:54:16,640
And then you continue training with this tensor negative two for 75 epochs, then density negative three

671
00:54:16,640 --> 00:54:17,600
for 30 epochs.

672
00:54:17,600 --> 00:54:23,180
So after this 75 epochs they drop two tensor negative three and then finally drop against tensor negative

673
00:54:23,180 --> 00:54:25,130
four for 30 epochs.

674
00:54:25,130 --> 00:54:28,580
That makes now 75 plus 30.

675
00:54:28,580 --> 00:54:35,540
That's 105 plus 3135 epochs OC.

676
00:54:35,540 --> 00:54:42,410
So to avoid overfitting we use drop out and extensive data augmentation drop out layer with rate 0.5

677
00:54:42,410 --> 00:54:46,790
after the first connected layer prevents quad adaptation between the layers.

678
00:54:46,790 --> 00:54:52,850
For this augmentation, they introduce random scaling and translations of up to 20% of the original

679
00:54:52,850 --> 00:54:53,690
image size.

680
00:54:53,690 --> 00:55:01,340
We also randomly adjust the exposure and saturation of the image by up to a factor of 1.5 in the HSV

681
00:55:01,340 --> 00:55:02,360
color space.

682
00:55:03,140 --> 00:55:08,120
Now, given that after the detection has been made, let's get back to the top.

683
00:55:08,930 --> 00:55:17,120
That's after or let's say after the model has been trained, we will get detections like this.

684
00:55:17,480 --> 00:55:20,360
Let's increase this would get detections like this.

685
00:55:20,360 --> 00:55:23,330
So we might have many more detections than expected.

686
00:55:23,330 --> 00:55:31,700
So we're going to apply the non max suppression algorithm to remove those cells or rather to remove

687
00:55:31,700 --> 00:55:38,480
those bounding boxes which are repeated around a certain region and focus only on bounding boxes which

688
00:55:38,480 --> 00:55:41,240
have the highest probability scores.

689
00:55:41,240 --> 00:55:46,910
So like this one, you see, the thickness here signifies the probability score, the probability of

690
00:55:46,910 --> 00:55:49,110
an object being that location.

691
00:55:49,130 --> 00:55:49,700
See that?

692
00:55:49,880 --> 00:55:53,810
So that's why we left with this after the non max suppression.

693
00:55:54,440 --> 00:56:00,770
Now, the way this normal suppression algorithm works is as such, we have after the person, after

694
00:56:00,770 --> 00:56:05,190
the model has been trained, we pass this input image and we may get predictions like this.

695
00:56:05,210 --> 00:56:11,810
Now let's, let's suppose that the this one, this, these two whites have the highest probabilities.

696
00:56:11,810 --> 00:56:14,450
And then we also have some other predictions.

697
00:56:15,800 --> 00:56:18,380
We have maybe say this other prediction here.

698
00:56:19,610 --> 00:56:23,180
That's it We have see another prediction around your.

699
00:56:24,350 --> 00:56:25,430
Something like this.

700
00:56:26,600 --> 00:56:31,100
Now, what we're going to do is we are going to consider that for a particular object.

701
00:56:31,100 --> 00:56:37,490
Let's say this this bounding box right here for a particular bonding box, we look at this probability

702
00:56:37,490 --> 00:56:42,170
and compare with that of the bonding box around it.

703
00:56:42,170 --> 00:56:46,970
And obviously to know what a bonding box surrounds or is very close to this bonding box, we will look

704
00:56:46,970 --> 00:56:48,250
at an IOU.

705
00:56:48,260 --> 00:56:57,530
So if we fix the IOU to a threshold of 0.5, it means that if we're taking this box, for example,

706
00:56:57,530 --> 00:57:06,250
constructing this box, then any box with an IOU, that is any box is close enough to this here such

707
00:57:06,260 --> 00:57:14,450
that is IOU is greater than 0.5, meaning that they are very close, then we are going to remove that

708
00:57:14,450 --> 00:57:15,230
bounding box.

709
00:57:15,230 --> 00:57:19,730
So it means that we're going to take this off because they are already very close.

710
00:57:20,990 --> 00:57:25,510
So you see that now you could you could play around with this value, meaning that you could take,

711
00:57:25,520 --> 00:57:34,520
say, 0.2 or even 0.7, depending on your the data set you're working with now.

712
00:57:34,520 --> 00:57:40,120
So what we're saying is, because these two are very close to each other, we take that off.

713
00:57:40,130 --> 00:57:43,220
Now, obviously they must be placed in the same object.

714
00:57:43,400 --> 00:57:50,390
Now, if we have another box like this one, if we had another box like this one, see, suppose we

715
00:57:50,390 --> 00:57:58,340
had another box like this one, then this box would not be taken off because the IOU is less than 0.5.

716
00:57:58,340 --> 00:58:02,540
So when the oh is greater than 0.5, we know that they are very close to each other.

717
00:58:02,540 --> 00:58:03,980
We compared your probabilities.

718
00:58:03,980 --> 00:58:08,420
The one with the highest probability is going to win, the other one is going to be taken off, hence

719
00:58:08,420 --> 00:58:11,990
the term non max suppression.

720
00:58:11,990 --> 00:58:14,960
So if you are not, you're not a max, we're going to suppress you.

721
00:58:14,960 --> 00:58:16,580
So we suppress all of that.

722
00:58:16,580 --> 00:58:19,880
And you see this one is left now for this year.

723
00:58:19,880 --> 00:58:22,790
We're going to say, okay, this one has the highest probability.

724
00:58:22,850 --> 00:58:29,330
We're not going to compare this with this because obviously the IOU will be less than 0.5.

725
00:58:29,450 --> 00:58:35,060
Now, we're going to take this one and this one we're going to compare this to the IOU is going to be

726
00:58:35,060 --> 00:58:36,740
greater than 0.5.

727
00:58:36,740 --> 00:58:39,950
And so you have we're going to remove this other one.

728
00:58:39,950 --> 00:58:43,250
So this one here will be taken off.

729
00:58:43,740 --> 00:58:45,260
You see, this one is still left.

730
00:58:45,410 --> 00:58:53,630
So that will be it will be left now with these three predictions anyway, generally when training our

731
00:58:53,930 --> 00:59:04,820
model, we aim to even be able to avoid the non max oppression as as a whole and other variants have

732
00:59:04,820 --> 00:59:09,890
been developed to try to reduce that dependence on the Nomex oppression.

733
00:59:09,980 --> 00:59:17,060
Although you'll have variants like the yellow nine 9000 you love to you v three you love for you also

734
00:59:17,060 --> 00:59:23,390
have yellow v5uxur which performs even better than this yellow v one.

735
00:59:23,390 --> 00:59:31,940
We discuss and write your so you have some tables which compare to other methods.

736
00:59:31,970 --> 00:59:39,650
See, you could always look review this, see your fast r c and then you have the yellow.

737
00:59:39,770 --> 00:59:48,980
You see that it performs better than the than the yellow, but this one is faster than the fast rc.

738
00:59:49,160 --> 00:59:49,610
N.

739
00:59:50,480 --> 00:59:54,890
We also have this comparison table for different objects.

740
00:59:54,890 --> 00:59:56,420
See four different objects.

741
59:56.420 --> 1:00:03.020
We see the the position for this different objects and we compare it with this different methods.

742
1:00:03.020 --> 1:00:04.610
Here we have the yellow.

743
1:00:05.210 --> 1:00:15.410
Then you could get down here, see the recall, see that we've already had a tutorial on the precision

744
1:00:15.410 --> 1:00:17.570
and recall, so you should be able to understand this.

745
1:00:17.570 --> 1:00:22.370
Or if you're new to that, you could check out our previous videos here.

746
1:00:22.700 --> 1:00:29.480
This is some quantitative results in the VOC 2007 Picasso and people are data sets so that it performs

747
1:00:29.480 --> 1:00:33.380
best on the P on this view C 2007.

748
1:00:33.680 --> 1:00:38.720
It performs best on the Picasso performance based on the people art data set.

749
1:00:40.250 --> 1:00:41.450
Let's get back here.

750
1:00:41.870 --> 1:00:42.650
Oh, okay.

751
1:00:42.650 --> 1:00:43.280
This is fast.

752
1:00:43.280 --> 1:00:47.660
RC And while you're comparing with the with a, with a simple.

753
1:00:47.660 --> 1:00:50.360
Arsene Okay, so that's it.

754
1:00:51.080 --> 1:00:57.380
We see that with the Picasso and the people Art it performs outperforms other methods like Arsene,

755
1:00:57.380 --> 1:01:05.210
meaning that it's, he has marginalization, capacities as compared to other methods, techniques.

756
1:01:05.360 --> 1:01:09.860
Here we have some limitations of the yellow usually imposes strong spatial constraints and bounding

757
1:01:09.860 --> 1:01:10.640
boxes.

758
1:01:10.640 --> 1:01:18.140
So since each grid cell only predicts two boxes and can only have one class.

759
1:01:18.350 --> 1:01:21.980
So this means that let's get back here.

760
1:01:21.980 --> 1:01:23.510
This means that if we have.

761
1:01:24.070 --> 1:01:31.900
Uh uh, if we had a person who was, say, standing just behind, right behind the person here, it

762
1:01:31.900 --> 1:01:36.130
would have been difficult to predict this person and this other person.

763
1:01:36.130 --> 1:01:42.220
And also in the case where we have images, where the objects are quite small, so we have some very

764
1:01:42.220 --> 1:01:52.240
small objects and packed up like this, this Yo-Yo Ma algorithm or this model, we find it difficult

765
1:01:52.240 --> 1:01:55.870
in detecting each and every one of them.

766
1:01:57.280 --> 1:02:01.480
Getting back to the paper, our model struggles with small objects that appear in groups such as flocks

767
1:02:01.480 --> 1:02:02.940
of of birds.

768
1:02:02.950 --> 1:02:07.570
Since our model learns to predict bounding boxes from data, it struggles to generalize to objects in

769
1:02:07.570 --> 1:02:11.380
the new unusual aspect ratio or configuration.

770
1:02:11.380 --> 1:02:17.730
So this has been trained on the past couple of data set where the objects have certain aspect ratios.

771
1:02:17.740 --> 1:02:18.700
Now I may turn it out.

772
1:02:18.700 --> 1:02:24.370
You have a different data set where the aspect ratio is different by aspect ratio, we simply meaning

773
1:02:24.370 --> 1:02:26.920
the width to height ratio.

774
1:02:26.920 --> 1:02:35.250
So this ratio right here, this aspect ratio can be say uh two by five.

775
1:02:35.260 --> 1:02:39.820
If we take in height by with the yellow is going to be like say three by two.

776
1:02:39.970 --> 1:02:45.700
So what, what goes on here is you've trained this on past couple of years, he did a set where you

777
1:02:45.700 --> 1:02:55.810
have a specific ask or a general kind of aspect ratio or aspect ratios, but when this is taken to different

778
1:02:55.810 --> 1:03:02.110
images where the aspect ratios aren't similar to that of the past called VOC, then the past, the model

779
1:03:02.110 --> 1:03:05.950
finds it difficult or struggles to generalize in such situations.

780
1:03:07.510 --> 1:03:14.140
Finally, we went Wild Train while we train on a loss function that approximates detection, performance

781
1:03:14.140 --> 1:03:22.000
or loss function trace errors, the same small bounding boxes versus large bound bounding boxes and

782
1:03:22.060 --> 1:03:26.700
say that the main source of error is incorrect localization.

783
1:03:26.830 --> 1:03:31.060
That said, we're done with this review of the yellow paper.

784
1:03:31.090 --> 1:03:35.860
The next section, we are going to build this yellow from scratch.