1
00:00:11,680 --> 00:00:16,350
In this lecture we are going to discuss stochastic gradient descent.

2
00:00:16,360 --> 00:00:21,710
Previously we looked at regular gradient descent and how it can be used to optimize a function.

3
00:00:21,850 --> 00:00:28,750
But as you may know in tensor flow 2.0 the gradient descent option is called S G D which is short for

4
00:00:28,750 --> 00:00:31,060
us to castigate gradient descent.

5
00:00:31,090 --> 00:00:36,840
So what makes it stochastic and how is it different from non stochastic gradient descent.

6
00:00:41,990 --> 00:00:45,110
To understand this let's first think of a simpler problem.

7
00:00:45,260 --> 00:00:49,970
Let's suppose we want to measure the average height of all humans in the entire world.

8
00:00:49,970 --> 00:00:52,590
Now that would be quite a hefty task.

9
00:00:52,610 --> 00:00:58,910
There are 7 billion people in the world and it would take a very long time to ask each person individually.

10
00:00:59,300 --> 00:01:04,970
But what if instead you took just one thousand random people and took the average height of these 1000

11
00:01:05,060 --> 00:01:11,300
people would you expect the average height of these 1000 to be similar to the average height of the

12
00:01:11,300 --> 00:01:13,310
entire population.

13
00:01:13,310 --> 00:01:14,920
The answer is yes.

14
00:01:15,230 --> 00:01:21,710
It should be clear that the main advantage of asking only 1000 people instead of 7 billion people is

15
00:01:21,710 --> 00:01:28,550
that it would take much less time.

16
00:01:28,650 --> 00:01:31,410
OK so how is this relevant for deplaning.

17
00:01:31,430 --> 00:01:35,030
If you recall in deep learning we have to calculate the cost.

18
00:01:35,060 --> 00:01:41,090
And from this cost we have to calculate is gradient so that we can do a gradient descent.

19
00:01:41,090 --> 00:01:46,250
The problem is this cost depends on the number of training samples in our dataset.

20
00:01:46,280 --> 00:01:52,270
So imagine you are training on the famous image in that dataset which has about 1 million images.

21
00:01:52,280 --> 00:01:58,010
This means that for each iteration of your training loop you have to add up all the errors for all 1

22
00:01:58,010 --> 00:01:59,630
million samples.

23
00:01:59,630 --> 00:02:01,790
That's going to take an extremely long time

24
00:02:06,880 --> 00:02:12,120
but what if instead I took the average error over just 1000 images.

25
00:02:12,190 --> 00:02:14,400
Of course they must be chosen randomly.

26
00:02:14,800 --> 00:02:21,460
The same idea applies this average error is probably close to the average error over the entire dataset

27
00:02:22,210 --> 00:02:29,230
and so if we do gradient descent using only these 1000 images we will probably end up taking a very

28
00:02:29,230 --> 00:02:30,490
similar step.

29
00:02:30,490 --> 00:02:33,250
Just with 1000 times less computation

30
00:02:38,370 --> 00:02:44,910
in actuality we typically use smaller back sizes like 32 64 or 128.

31
00:02:44,970 --> 00:02:51,330
The basic algorithm looks like this instead of just one loop over the epochs we have two nested loops

32
00:02:51,990 --> 00:02:55,330
one over the epochs and one over the training set.

33
00:02:55,380 --> 00:03:01,140
The idea is we are going to split up our data into batches and for each batch we are going to take one

34
00:03:01,140 --> 00:03:03,230
step of gradient descent.

35
00:03:03,450 --> 00:03:09,780
As you can see in this pseudocode we are using a batch size of 32 so the number of iterations of the

36
00:03:09,780 --> 00:03:13,180
inner loop is n divided by 32.

37
00:03:13,230 --> 00:03:20,620
We can specify the start index as I times thirty two and the end index as I plus 1 times 32.

38
00:03:20,790 --> 00:03:26,670
So this will count from zero to thirty to thirty two to sixty four sixty four ninety six and so forth

39
00:03:28,130 --> 00:03:32,400
on each epoch we will make one pass through the entire dataset.

40
00:03:32,450 --> 00:03:37,040
It's important to randomize the data and each epoch because running through the same samples in the

41
00:03:37,040 --> 00:03:41,240
same order each time could lead to undesirable patterns being learned

42
00:03:46,430 --> 00:03:47,530
as an exercise.

43
00:03:47,540 --> 00:03:53,250
You may want to try this out for yourself to prove that it is faster to set up your experiment.

44
00:03:53,360 --> 00:03:57,320
Here's what you should do in the tensor flow 2.0 fit function.

45
00:03:57,320 --> 00:03:59,670
You have an argument called that size.

46
00:03:59,840 --> 00:04:05,060
You can take any of the examples from this course and explicitly set the back size to a small value

47
00:04:05,090 --> 00:04:10,100
such as 32 and explicitly set it to the size of the entire dataset.

48
00:04:10,100 --> 00:04:12,910
Whatever that may be from there.

49
00:04:12,980 --> 00:04:17,380
Compare the time it takes for each method to converge to a minimum.

50
00:04:17,600 --> 00:04:20,320
You should find that with batch gradient descent.

51
00:04:20,420 --> 00:04:28,580
The training process converges much faster and in fact sometimes your dataset may be so large that doing

52
00:04:28,580 --> 00:04:31,580
one step of gradient descent cannot even fit into memory.