1
00:00:00,060 --> 00:00:06,480
Hi and welcome back to the course in this section, we'll take a look at a very useful, deep learning

2
00:00:06,480 --> 00:00:08,430
based truck called Deep Thought.

3
00:00:08,580 --> 00:00:09,960
So let's get started.

4
00:00:10,500 --> 00:00:14,400
So before we begin, do you remember what a truck is?

5
00:00:15,000 --> 00:00:17,160
Well, don't worry, I'll refresh your memory shortly.

6
00:00:17,790 --> 00:00:23,180
Now, imagine you might think a truck is just something that trucks and moving object in a scene.

7
00:00:23,190 --> 00:00:24,630
And while that is correct.

8
00:00:24,930 --> 00:00:27,750
And that's exactly what means shift an optical fluid did.

9
00:00:28,530 --> 00:00:33,960
There's a lot more there's a lot more complicated cases where you'd want to use a much more advanced

10
00:00:33,960 --> 00:00:34,260
truck.

11
00:00:34,740 --> 00:00:36,690
Now let's take a look at one of those cases.

12
00:00:37,320 --> 00:00:39,390
So imagine this scene here.

13
00:00:40,680 --> 00:00:44,940
OK, forget about a text on top of explain it based on images right now.

14
00:00:45,660 --> 00:00:51,570
So we want to truck multiple vehicles in this frame so you can easily see.

15
00:00:51,570 --> 00:00:54,330
Firstly, we can have bounding boxes on these vehicles.

16
00:00:54,690 --> 00:00:56,520
And isn't that trucking them?

17
00:00:56,700 --> 00:01:04,530
Well, it's not technically, because what if we wanted to put a bounding box over each vehicle and

18
00:01:04,530 --> 00:01:11,070
have that linked to a certain I.D. number like, say, this is car number one second of the two and

19
00:01:11,070 --> 00:01:11,580
so on?

20
00:01:11,670 --> 00:01:13,500
And as you can see, this is car.

21
00:01:13,980 --> 00:01:16,770
The look of this car had the first big car on the left image.

22
00:01:17,250 --> 00:01:22,980
Let's say that's car number one and you can see four seconds after it has gone up here.

23
00:01:23,190 --> 00:01:23,550
All right.

24
00:01:24,240 --> 00:01:32,100
Now what's going to happen in your little detector or most of the other detectors is that the number

25
00:01:32,100 --> 00:01:36,000
of calls on the screen might change constantly from frame to frame.

26
00:01:36,510 --> 00:01:40,180
You can see sometimes you might have four or five cars detected here.

27
00:01:40,200 --> 00:01:42,360
Sometimes we have three cars detected.

28
00:01:42,900 --> 00:01:49,620
So what that means is that whatever number the you or if you even assign a number of commanding box

29
00:01:49,620 --> 00:01:57,390
one to a vehicle, there's no guarantee the bounding box one here is going to be the the box one here

30
00:01:57,840 --> 00:01:58,740
on the next frame.

31
00:01:59,700 --> 00:02:03,540
So essentially, I'm trying to pose the problem presented here.

32
00:02:03,990 --> 00:02:09,900
The problem is when we go from frame to frame because of bounding boxes tend to be unstable, and that's

33
00:02:10,140 --> 00:02:12,660
mentally unstable because of a moving scene.

34
00:02:12,840 --> 00:02:20,520
I should say then you're going to have bounding boxes not aligned to the number of the bounding box,

35
00:02:20,520 --> 00:02:24,630
not aligned to the vehicle or the object that it's being tracked.

36
00:02:24,780 --> 00:02:26,970
Because in this case, this one could be number one.

37
00:02:27,420 --> 00:02:29,180
Over here, it could be number three.

38
00:02:29,700 --> 00:02:32,550
And there's no way to actually track them individually.

39
00:02:32,760 --> 00:02:39,180
So if you wanted to assign, let's say, this vehicle being kind of one, you want to do all this because

40
00:02:39,180 --> 00:02:44,100
number one, while it's visible in the frame, so that's essentially what tracking is.

41
00:02:45,900 --> 00:02:51,960
So I'll do a quick recap on some of the previous trackers you would have seen in the open TV lesson.

42
00:02:52,560 --> 00:02:59,400
So the first one is mean shift and means is quite useful, although it's quite simple to do when you're

43
00:02:59,400 --> 00:03:02,580
tracking a single moving object in a scene like this.

44
00:03:02,640 --> 00:03:05,940
Imagine we're tracking this tank moving along this makeshift road.

45
00:03:06,930 --> 00:03:14,190
It uses a histograms, basically histograms of the color intensity or whatever metric you want to use

46
00:03:14,190 --> 00:03:15,150
from the color space.

47
00:03:15,720 --> 00:03:20,310
And it tries to find a likelihood of the object location over the histogram.

48
00:03:20,760 --> 00:03:27,680
So it's sort of like a mode seeking type algorithm that basically senses stumbling blocks upon the object

49
00:03:27,690 --> 00:03:28,860
of its being tracked.

50
00:03:28,920 --> 00:03:32,190
And it works well when it's like when it's with one object.

51
00:03:32,850 --> 00:03:38,250
However, it's a lot more complicated cases where you have two objects and means shifting is not going

52
00:03:38,250 --> 00:03:38,640
to work.

53
00:03:39,750 --> 00:03:41,730
So what about optical flow?

54
00:03:42,000 --> 00:03:46,800
Well, optical flow is probably much better than mine chips in some in some cases.

55
00:03:47,790 --> 00:03:55,050
However, again, it doesn't solve the problem of keeping a consistent I.D. on certain vehicle.

56
00:03:55,620 --> 00:04:00,660
It's really good for tracking movement and vehicles do it because what it does and you looks at the

57
00:04:01,320 --> 00:04:06,720
basically the relative motion or movement between the object and the viewer, that's you got a camera.

58
00:04:07,470 --> 00:04:15,090
So now we can move on to something that is actually quite was quite groundbreaking when it was, I guess,

59
00:04:15,090 --> 00:04:15,690
published.

60
00:04:16,560 --> 00:04:24,060
Basically, Kalman filtering is what engineers like myself in the previous day, 10 years ago, I was

61
00:04:24,060 --> 00:04:29,910
an engineer, electrical, electrical and computer engineer, and I did some robotics as well.

62
00:04:30,240 --> 00:04:35,400
And Kalman filters were amazing at and cleaning up.

63
00:04:35,400 --> 00:04:42,210
Since sensor data like you have a distance sensor or some sort of like light sensor, a common filter

64
00:04:42,210 --> 00:04:43,920
would be ideal just to clean it up.

65
00:04:44,040 --> 00:04:50,190
And how it works is that basically it takes prior knowledge of the state of the reading of that sensor.

66
00:04:50,670 --> 00:04:53,940
So you can see initially it's going to be a little bit off.

67
00:04:53,940 --> 00:04:59,070
The air is going to be quite big because there are no historical measurements yet, but as you can see

68
00:04:59,070 --> 00:04:59,750
as we get more.

69
00:04:59,800 --> 00:05:05,260
Historical measurements, the estimate using the common filter becomes narrow and narrow to the actual

70
00:05:05,260 --> 00:05:08,140
values these X's here are different.

71
00:05:09,160 --> 00:05:14,350
Measured temperatures in this case over time, and you can see pretty much all over the place, they're

72
00:05:14,350 --> 00:05:14,980
unstable.

73
00:05:15,370 --> 00:05:19,970
But using the common filtering algorithm, we get a much stable reading this way.

74
00:05:21,040 --> 00:05:28,570
So now that we've discussed some of the older tracking algorithms, let's take a look at some of the

75
00:05:28,570 --> 00:05:30,250
challenges faced with this.

76
00:05:31,030 --> 00:05:36,830
So a lot of these are quite computationally expensive now we can get real time performance from them.

77
00:05:37,720 --> 00:05:44,140
But again, they use a lot of resources, and it's probably not the most efficient way given their performance.

78
00:05:44,860 --> 00:05:50,620
So they are also very susceptible to noise and fast camera movements as well as conclusions.

79
00:05:52,000 --> 00:05:55,150
So generally, it's you can see here like these two people tracking here.

80
00:05:55,150 --> 00:06:02,290
This is one and this is two you can see we keep them in frame here and it maintains it's correct as

81
00:06:02,290 --> 00:06:06,520
the guy, as the guys walk past and believe it maintains the 80s correctly.

82
00:06:07,000 --> 00:06:13,660
But you can see in a tracking scenario, this can actually cause a number of problems to those previous

83
00:06:13,660 --> 00:06:14,200
trackers.

84
00:06:15,100 --> 00:06:17,080
So now this brings us to deep sort.

85
00:06:17,230 --> 00:06:25,030
So deep thought is a very widely used tracker right now, and it actually has been recently integrated

86
00:06:25,060 --> 00:06:27,160
into India's deep stream.

87
00:06:27,640 --> 00:06:34,330
Deep Stream is basically a production deployment tool for computer vision models and applications and

88
00:06:34,660 --> 00:06:37,140
deep, so it was recently brought in to it.

89
00:06:37,210 --> 00:06:42,070
So now we have a vision of better tracking on that in video deep stream platform.

90
00:06:43,090 --> 00:06:47,080
And basically, it performs multi object tracking, as you can see in this image here.

91
00:06:47,770 --> 00:06:53,320
You can see each person has a unique ID and as they move about to frame, even though this isn't a video,

92
00:06:53,800 --> 00:06:55,840
but you can imagine people moving here.

93
00:06:56,230 --> 00:07:01,810
They will keep the same ID consistently across the frame until the exit or leave the frame.

94
00:07:02,710 --> 00:07:10,570
So just to give a quick overview of how common all of this sort of works, it basically performs common

95
00:07:10,570 --> 00:07:18,100
filtering in the image space and frame by frame data association using the Hungarian method with an

96
00:07:18,100 --> 00:07:22,150
association metric that measures bounding box overlap.

97
00:07:22,300 --> 00:07:27,970
So what that scene, because I just read this how it would fit would here, but it's a pretty good summary

98
00:07:27,970 --> 00:07:29,500
of what it's doing.

99
00:07:29,620 --> 00:07:29,940
Yeah.

100
00:07:30,670 --> 00:07:34,810
So what's happening is that it's looking at each frame.

101
00:07:35,050 --> 00:07:36,940
So we have a frame by frame association.

102
00:07:36,940 --> 00:07:41,860
So imagine we have the bomb in boxes from one frame and then the bounding boxes for the next room.

103
00:07:42,340 --> 00:07:45,190
We have to find some sort of association between them.

104
00:07:45,640 --> 00:07:48,940
And that's what the common filtering into image space is doing.

105
00:07:49,390 --> 00:07:55,390
And next to Hungarian method is also doing a similar metric by comparing the bungling blocks overlap

106
00:07:55,390 --> 00:07:56,440
with the previous frame.

107
00:07:56,920 --> 00:07:59,670
So that's how the deep sort of algorithm works.

108
00:07:59,680 --> 00:08:03,970
And if you want to take a look at it in deeper, you can read the paper here.

109
00:08:04,600 --> 00:08:10,420
I've also compiled some notes on these slides here, so they're quite lengthy, so I'm not going to

110
00:08:10,420 --> 00:08:13,690
read them out there, but difficult to explain by just reading them.

111
00:08:14,320 --> 00:08:20,470
But if you want, this is basically a summary of the people above the different metrics about association

112
00:08:20,530 --> 00:08:28,510
assignment algorithm talks about a lot of these things here, and it basically tells you how deep.

113
00:08:28,510 --> 00:08:35,050
So it actually solves a lot of these problems that we've found, and we have a final summary of all

114
00:08:35,170 --> 00:08:40,450
the deep sort of architecture and some of the more intimate, informative things about that feature

115
00:08:40,450 --> 00:08:44,380
descriptor in how it uses that to match the objects.

116
00:08:44,980 --> 00:08:47,890
So that's it for this deep sort lesson.

117
00:08:47,890 --> 00:08:55,060
In the next section, we'll take a look at the Google Club notebook that when we actually use deep sort

118
00:08:55,060 --> 00:08:55,780
with yellow.

119
00:08:57,280 --> 00:08:59,260
So I'll see you in the next lesson.

120
00:08:59,440 --> 00:08:59,830
Thank you.
