1
00:00:00,450 --> 00:00:03,420
Welcome to the section and object detection.

2
00:00:03,810 --> 00:00:10,920
Object detection is one of the most popular computer vision tasks and also a very important one.

3
00:00:12,030 --> 00:00:21,810
Object detection entails correctly classifying objects which are in an image and also saying exactly

4
00:00:21,810 --> 00:00:25,020
where these objects are located in the image.

5
00:00:25,020 --> 00:00:33,150
So if we have this image, we see clearly that we have an airplane, an aeroplane, a person, a person,

6
00:00:33,150 --> 00:00:34,640
and then we could say a car.

7
00:00:34,650 --> 00:00:37,560
So this tool is some sort of car.

8
00:00:37,590 --> 00:00:47,100
Now, an object detector not only classifies these images, but also localizes exactly their positions

9
00:00:47,100 --> 00:00:48,240
in the image.

10
00:00:48,240 --> 00:00:51,960
So this airplane, for example, has this bounding box.

11
00:00:51,960 --> 00:00:58,440
Now, a bounding box basically surrounds us also square or rectangular box which surrounds the object.

12
00:00:58,440 --> 00:01:03,840
And then for this person, we have this bounding box, for this, this bounding box and this other aeroplane,

13
00:01:03,840 --> 00:01:08,730
we have this bounding box which is in this other bounding box.

14
00:01:09,030 --> 00:01:16,290
So unlike a classification problem, where when given this kind of input, we have as an output one

15
00:01:16,290 --> 00:01:20,440
hard vector, for example, which represents a number of classes we're dealing with.

16
00:01:20,460 --> 00:01:27,960
So supposing we have five classes, then when given an input, we have to correctly say whether that

17
00:01:27,960 --> 00:01:30,780
input belongs to one of the five classes.

18
00:01:31,020 --> 00:01:38,880
Now, with object detection, we not only have this, but we also have the positions in the image.

19
00:01:38,880 --> 00:01:45,210
So like this, we have this position like this and we have several conventions for this positions.

20
00:01:45,210 --> 00:01:52,320
One of the most popular conventions is the center convention, where we have the x center, y center

21
00:01:52,320 --> 00:01:53,760
width and then height.

22
00:01:53,790 --> 00:01:55,040
Now, what does this mean?

23
00:01:55,050 --> 00:02:01,470
This means that based on this referential right here, so we could define a referential right here where

24
00:02:01,470 --> 00:02:03,260
we have this origin.

25
00:02:03,270 --> 00:02:08,060
Now, recall we used to have y is to working with this kind of referential.

26
00:02:08,070 --> 00:02:12,090
Now the referential for image data is considered to be this.

27
00:02:12,090 --> 00:02:13,890
So our origin starts from here.

28
00:02:13,890 --> 00:02:17,700
We're moving the x direction and then in the Y direction.

29
00:02:17,700 --> 00:02:24,990
So this is our friends and then this is based from on this point, on this origin that we actually define

30
00:02:24,990 --> 00:02:25,830
positions.

31
00:02:25,830 --> 00:02:30,060
So if we have this airplane, let's consider this bigger airplane.

32
00:02:30,060 --> 00:02:37,230
So we could have this bigger airplane, we could define its bounding box by its center and then its

33
00:02:37,230 --> 00:02:37,740
width.

34
00:02:37,740 --> 00:02:45,510
So once we have a center, obviously, and if we're given this width, we can obviously see that it

35
00:02:45,510 --> 00:02:48,120
is in this bounding box.

36
00:02:48,120 --> 00:02:52,500
So this is the first convention the center does the X center and the Y center.

37
00:02:52,500 --> 00:02:56,670
The X center is basically the distance from this to the center.

38
00:02:56,670 --> 00:02:58,650
So suppose that the center is right here.

39
00:02:58,650 --> 00:03:06,210
So if our center is here, then the distance from here to this is X center and the distance from up

40
00:03:06,210 --> 00:03:08,130
to this is Y center.

41
00:03:08,130 --> 00:03:09,840
So basically that's what we have.

42
00:03:10,410 --> 00:03:17,310
So if we want to link this up like this, we could see clearly how we obtain X center the distance from

43
00:03:17,310 --> 00:03:24,780
here the year and then Y center distance from this origin to this, which now gives us the center.

44
00:03:24,780 --> 00:03:26,450
So that's how we get a center.

45
00:03:26,460 --> 00:03:32,850
Now we are given the width and then the height, if we give it the width and height, obviously to get

46
00:03:32,850 --> 00:03:39,120
this points or to get because we have actually four points right here to get all those four points which

47
00:03:39,120 --> 00:03:47,160
make up the bounding box, we could start from say, this point for this point to obtain the x and Y

48
00:03:47,160 --> 00:03:54,570
coordinates right here, we simply take this x coordinate and subtract from the width divided by two

49
00:03:54,570 --> 00:03:57,540
because from this to this is we divided by two on the diagram.

50
00:03:57,540 --> 00:04:01,020
It doesn't show clearly at this point is at the center.

51
00:04:01,020 --> 00:04:02,760
But normally this should be at the center.

52
00:04:02,760 --> 00:04:06,390
So we could rearrange this bounding box now.

53
00:04:06,390 --> 00:04:13,170
So as we said, to get this, we have this x center minus the width divided by two to obtain this and

54
00:04:13,170 --> 00:04:20,370
then to get the Y, we have the Y center plus because this is actually the positive direction and then

55
00:04:20,370 --> 00:04:22,650
this is the negative direction for the X.

56
00:04:22,650 --> 00:04:24,660
So for y, this is positive.

57
00:04:24,660 --> 00:04:28,770
This is negative for Y, this is positive for X and then this is negative for x.

58
00:04:28,950 --> 00:04:36,930
Now, to obtain the Y, as we said, we have this y center plus the width of this, or rather plus the

59
00:04:36,930 --> 00:04:39,200
height of this divided by two.

60
00:04:39,210 --> 00:04:40,680
That's how we obtain this.

61
00:04:40,950 --> 00:04:47,430
Now, to obtain this, since we're going from this, we're going the x center minus the width divided

62
00:04:47,430 --> 00:04:47,850
by two.

63
00:04:47,850 --> 00:04:49,890
We obtain the x coordinate here.

64
00:04:49,890 --> 00:04:56,040
To obtain the y coordinate we have the Y center minus because this is the negative direction.

65
00:04:56,040 --> 00:04:59,820
So we have the Y center minus the height, divided by two and.

66
00:05:00,350 --> 00:05:04,010
This is very similar to obtain the x here.

67
00:05:04,040 --> 00:05:05,840
There's an x coordinate right here.

68
00:05:05,840 --> 00:05:10,760
We have the x enter year plus the width divided by two to obtain the same x coordinate.

69
00:05:10,760 --> 00:05:13,490
Right here is the x center plus the width divided by two.

70
00:05:13,490 --> 00:05:18,350
Obviously these two have the same x coordinates, but different y coordinates To obtain the y coordinate.

71
00:05:18,350 --> 00:05:22,940
We have this y center minus the height divided by two.

72
00:05:22,970 --> 00:05:24,080
To obtain the Y coordinate.

73
00:05:24,080 --> 00:05:27,530
Here we have the Y center plus the height divided by two.

74
00:05:27,890 --> 00:05:33,440
Either ways, once we have this to coordinate this point here and then this, we could always obtain

75
00:05:33,440 --> 00:05:35,120
this and this automatically.

76
00:05:35,120 --> 00:05:41,720
Now another convention is the x mean y, mean x, max y, and max convention where we just given this

77
00:05:41,720 --> 00:05:42,320
coordinates.

78
00:05:42,320 --> 00:05:48,140
If we're given this coordinates and then this coordinates, we could obviously obtain all this because

79
00:05:48,140 --> 00:05:55,340
when one is given this and then given this, we could just get the whole box or the magically.

80
00:05:55,340 --> 00:06:02,450
So here are the two main conventions we use to actually locate an object in the image.
