1
00:00:11,650 --> 00:00:16,960
In this lecture we are going to introduce a new type of layer which has been shown to be very helpful

2
00:00:16,960 --> 00:00:22,630
in training convolution or known that works here's something to consider.

3
00:00:22,660 --> 00:00:28,030
Recall that early on we looked at why it's important to normalize your data before passing it into a

4
00:00:28,030 --> 00:00:31,780
model such as linear regression or logistic regression.

5
00:00:31,780 --> 00:00:38,190
We do this by subtracting the mean and dividing by the standard deviation but because this is an operation

6
00:00:38,190 --> 00:00:43,000
on the data we can only do it at the very first layer of the neural network.

7
00:00:43,080 --> 00:00:49,830
What happens if the data becomes not normalized at some later layer then that layer has to deal with

8
00:00:49,960 --> 00:00:53,950
UN normalized data which is not as good as normalized data.

9
00:00:53,970 --> 00:00:58,790
So this begs the question how can we make the data at every layer normalized.

10
00:01:03,670 --> 00:01:06,050
The answer is batch normalization.

11
00:01:06,400 --> 00:01:12,400
If you recall inside the PI towards training loop we're actually doing Batch gradient descent on the

12
00:01:12,400 --> 00:01:13,060
outer loop.

13
00:01:13,090 --> 00:01:18,910
We loop through each epoch and on the inner loop we look at a chunk of data on each step and do gradient

14
00:01:18,910 --> 00:01:22,300
descent with respect to just this chunk of data.

15
00:01:22,300 --> 00:01:27,820
This works because we know that a batch of data will have approximately the same statistics as the full

16
00:01:27,860 --> 00:01:32,870
dataset.

17
00:01:33,000 --> 00:01:35,020
So what if we did something like this.

18
00:01:35,190 --> 00:01:42,000
We inject a layer that does normalization for us for example in an n n instead of dense dense dense

19
00:01:42,180 --> 00:01:47,340
we have batch of dense batch form dense that you know I'm dense at each of these layers.

20
00:01:47,370 --> 00:01:53,730
All it does is take the batch of data subtract is mean and divide by its standard deviation.

21
00:01:53,730 --> 00:01:56,880
In fact this is pretty much what black normalization does

22
00:02:01,990 --> 00:02:05,440
but there's something extra that batch normalization does.

23
00:02:05,440 --> 00:02:07,340
Here's another question to consider.

24
00:02:07,690 --> 00:02:10,990
How do we know that normalization is actually good.

25
00:02:10,990 --> 00:02:12,730
In fact we don't.

26
00:02:12,760 --> 00:02:17,770
Perhaps there's a better amount of shifting and a better amount of scaling that we can use that leads

27
00:02:17,770 --> 00:02:19,790
to a more optimal model.

28
00:02:19,840 --> 00:02:26,680
And so this is the full description of the batch normalization layer it first does normalization using

29
00:02:26,680 --> 00:02:29,510
the batch mean and the batch standard deviation.

30
00:02:29,830 --> 00:02:36,220
Then it re scales the data and re shifts the data so that the final output has an optimal location and

31
00:02:36,220 --> 00:02:37,960
an optimal scale.

32
00:02:37,960 --> 00:02:43,620
Of course these new parameters gamma and beta are learned automatically using a gradient descent.

33
00:02:48,700 --> 00:02:53,610
One alternative perspective on bash normalization is that it acts as a regular riser.

34
00:02:54,460 --> 00:03:00,430
As you know we use regularization to prevent over fitting with methods such as drop out.

35
00:03:00,460 --> 00:03:04,090
So how does batch normalization do regularization.

36
00:03:04,090 --> 00:03:07,920
The idea is because each batch of data is slightly different.

37
00:03:08,080 --> 00:03:11,320
It's going to have a different mean and standard deviation.

38
00:03:11,320 --> 00:03:17,750
It's not the true mean and standard deviation of the entire dataset these differences are like noise

39
00:03:17,780 --> 00:03:22,030
which slightly change what the known that work sees on each iteration.

40
00:03:22,040 --> 00:03:27,380
In other words because they're known that work always has to learn noisy images it becomes impervious

41
00:03:27,380 --> 00:03:28,160
to the noise

42
00:03:33,300 --> 00:03:38,250
the final thing I want to mention in this lecture is that while we discussed the batch norm in the context

43
00:03:38,250 --> 00:03:45,400
of a field for neuron that work maximum is not commonly applied in between dense layers as you recall.

44
00:03:45,410 --> 00:03:48,000
This is the CNN section of the course.

45
00:03:48,170 --> 00:03:53,990
And so the reason we're talking about batch normalization now is because batch norm is mostly used with

46
00:03:53,990 --> 00:03:59,800
convolutions now adding national arm also adds some hyper parameter choices.

47
00:03:59,930 --> 00:04:05,570
In particular you have to decide whether or not you should use nationalism and if you decide you should

48
00:04:05,840 --> 00:04:07,640
then you have to decide where to put it.

49
00:04:08,600 --> 00:04:14,300
Luckily we can just follow the conventions in this regard factional players are usually applied after

50
00:04:14,300 --> 00:04:18,410
convolution layers well after or before depends on which way you look at it.

51
00:04:18,830 --> 00:04:23,750
So maybe a better way of looking at it is that they go in between convolution layers.

52
00:04:23,960 --> 00:04:30,830
One common pattern would be like this convolution to bash an arm to convolution to back storm back to

53
00:04:30,830 --> 00:04:35,440
convolution back to back storm and then flatten and then your dense layers.

54
00:04:35,480 --> 00:04:38,850
So that's just to give you an idea of what it might look like.

55
00:04:38,960 --> 00:04:44,060
You're always encouraged to try things on your own but also to read about the most popular CNN such

56
00:04:44,060 --> 00:04:49,160
as Fiji resident and inception to see how they're doing things.

57
00:04:49,160 --> 00:04:54,840
This is the approach we're going to take in the next lecture when we improve our CFR 10 results.

58
00:04:55,050 --> 00:05:00,270
By the way if you want to read more about batch form you can check out the paper included an extra reading

59
00:05:00,270 --> 00:05:01,830
that T T.

60
00:05:01,830 --> 00:05:08,430
The title is back to normalization accelerating deep network training by reducing internal covariance

61
00:05:08,430 --> 00:05:08,970
shift.
