1
00:00:02,010 --> 00:00:03,180
Hi and welcome back.

2
00:00:03,210 --> 00:00:09,240
So in this lecture, we'll take a look at the new architecture and evolution that yoga took all the

3
00:00:09,240 --> 00:00:10,320
way up to Vision five.

4
00:00:10,770 --> 00:00:11,940
So let's get started.

5
00:00:12,090 --> 00:00:17,970
So firstly, most conventional object detectors consist of two to three main parts.

6
00:00:18,200 --> 00:00:24,570
OK, we have the backbone, which is typically a classified network that has been trained on image that

7
00:00:25,000 --> 00:00:28,620
unpopular ones are resonant, big, dense net and darknet.

8
00:00:28,740 --> 00:00:30,840
Darknet was one that was used for.

9
00:00:31,110 --> 00:00:33,240
It's a CSP network that's been used for you.

10
00:00:34,140 --> 00:00:38,220
We then have to head for loss calculations and predictions and inference.

11
00:00:38,790 --> 00:00:41,790
And then the NEC, which was introduced in recent detectors.

12
00:00:42,270 --> 00:00:48,360
It is directly leveraged into the backbones for enhancing the richness and semantic representation of

13
00:00:48,360 --> 00:00:51,720
the extracted features for objects of different shapes and sizes.

14
00:00:52,710 --> 00:00:58,560
So let's take a look at the initial Eurovision one and use Eurovision to networks.

15
00:00:59,040 --> 00:01:05,700
So in Eurovision one, which came out in 2015, it was basically the first single stage detector, along

16
00:01:05,700 --> 00:01:09,150
with estimates around the time it was developed simultaneously.

17
00:01:09,150 --> 00:01:14,640
So the researchers probably didn't have much knowledge of what was going on in the SSD research world

18
00:01:15,300 --> 00:01:18,630
of it utilized by some musician and Leekie.

19
00:01:18,990 --> 00:01:20,550
We knew activations as well.

20
00:01:20,940 --> 00:01:23,670
And this is a diagram of the architecture here.

21
00:01:27,070 --> 00:01:32,590
Now you listen to what is implemented, several changes, such as removing the fully connected layer

22
00:01:33,010 --> 00:01:39,340
at the end, facilitating resolution of independence and a few new visions of the other two, such as

23
00:01:39,340 --> 00:01:43,150
Tiny Button two will also released and tiny.

24
00:01:43,240 --> 00:01:50,140
The reason they made a tiny its vision, too, was because it was easy to deploy and embedded systems,

25
00:01:50,140 --> 00:01:53,500
and that was a big need for optimal action on the edge.

26
00:01:53,980 --> 00:02:00,280
So you can have all of these Android cameras or little Raspberry Pi type cameras running on the edge

27
00:02:00,280 --> 00:02:07,210
with a low computational power, but were able to run an object the model object detector models like

28
00:02:07,510 --> 00:02:09,400
leggy tiny little yellow.

29
00:02:11,170 --> 00:02:13,240
So let's take a look at it a little bit and treat.

30
00:02:13,810 --> 00:02:16,180
So using Vision Tree was actually quite good.

31
00:02:16,720 --> 00:02:21,330
It actually had a number of features that actually improved the network they are.

32
00:02:21,910 --> 00:02:25,750
It was inspired by Raisinets, and they had feature pyramid networks inside of it.

33
00:02:26,350 --> 00:02:32,860
The researchers utilized a new feature extractive backend called Darknet 53, which had skipped connections

34
00:02:33,310 --> 00:02:40,600
similar to resonate and tree prediction heads like the FPN, which were available to use, and it actually

35
00:02:40,600 --> 00:02:45,640
performed very, very well for a few years, maybe from 2016 to 2019.

36
00:02:46,150 --> 00:02:53,290
It was the industry standard using your artificial tree for object detection, but in 2020 you're looking

37
00:02:53,290 --> 00:02:58,090
for came out and basically it's shortlisted three different backend.

38
00:02:58,090 --> 00:02:59,830
So you can see the back into the views here.

39
00:03:00,340 --> 00:03:06,010
And these basically were tested extensively in the research paper and provided different speeds and

40
00:03:06,010 --> 00:03:07,930
accuracy, according to what you want to use.

41
00:03:07,940 --> 00:03:14,830
However, Darknet 53, was generally the best choice for most datasets, and that was the one they continued

42
00:03:14,830 --> 00:03:20,260
with most of the experiments in the research paper, and one of the big features of you're looking for

43
00:03:20,270 --> 00:03:23,350
was a modified pat aggregation network called Pan.

44
00:03:23,770 --> 00:03:29,140
It uses spatial pyramid pooling tightly, coupled with the darknet 5G model.

45
00:03:29,590 --> 00:03:33,010
So this aided increasingly receptive field of the model.

46
00:03:33,220 --> 00:03:36,910
So you had a lot better bounding box predictions at that point.

47
00:03:37,370 --> 00:03:41,390
And this is an overview of the very complicated.

48
00:03:41,500 --> 00:03:47,350
Your model has a lot of good features that basically implemented something called a bag of tricks,

49
00:03:47,740 --> 00:03:55,030
which allowed us to get a lot of the performance out of it would basically minimal computational penalties.

50
00:03:56,140 --> 00:04:01,030
And you can see this is a summary of some optimizations that were made here in the model.

51
00:04:01,390 --> 00:04:03,820
The bug of special they call it bag of tricks.

52
00:04:03,820 --> 00:04:09,400
This bag of freebies and bug specializes in inference time, so a training time.

53
00:04:09,880 --> 00:04:15,130
They are things like class labeled smoothening, different data augmentation techniques such as mosaic

54
00:04:15,130 --> 00:04:16,120
and cut mics.

55
00:04:16,600 --> 00:04:19,960
We have dropped the block regulator regularisation.

56
00:04:20,440 --> 00:04:22,780
We had a self adversarial trading.

57
00:04:23,200 --> 00:04:29,530
We had something called Claw Hugh loss across many batch normalization and then doing testing and had

58
00:04:29,530 --> 00:04:32,830
a bunch of different things that optimized performance as well.

59
00:04:33,370 --> 00:04:39,610
So this bag of freebies and Baggott specials were introduced by the researchers, and it got a lot of

60
00:04:40,180 --> 00:04:41,410
benefits out of them.

61
00:04:42,310 --> 00:04:45,220
Now there's the other five by ultra-Orthodox.

62
00:04:45,610 --> 00:04:51,430
It's a heavily optimized PyTorch machine, a field of four that has been open sourced by this company

63
00:04:51,430 --> 00:04:52,660
called Control Ethics.

64
00:04:52,990 --> 00:04:59,740
And you definitely should check out the GitHub because it's a very, very good implementation of YOLO.

65
00:05:00,500 --> 00:05:05,260
Well, basically YOLO Vision for but they call it you, Eurovision five because it meets so many different

66
00:05:05,260 --> 00:05:06,040
optimizations.

67
00:05:06,040 --> 00:05:11,830
Did you get very, very good performance out of the box on your own data sets with you a little bit

68
00:05:11,830 --> 00:05:18,310
in faith, and it's quite easy to use, extensively developed and is a number of ways to configure it

69
00:05:18,320 --> 00:05:19,240
if you needed to.

70
00:05:19,720 --> 00:05:26,380
I've made a number of customizations to your version five of on that on their model, and it's quite

71
00:05:26,530 --> 00:05:31,600
fun to work with, very efficient and so easy to train and multiple GPUs as well.

72
00:05:32,200 --> 00:05:38,290
So we'll stop there for now on YOLO, and in the next section, we'll take a look at Eficiente Attempt,

73
00:05:38,740 --> 00:05:41,800
which is a different obliquity action model coming out from Google.

74
00:05:42,280 --> 00:05:44,590
So stay tuned for that lesson.

75
00:05:44,710 --> 00:05:45,130
Thank you.