1
00:00:00,050 --> 00:00:03,830
We've just looked at the section on corrective measures.

2
00:00:03,860 --> 00:00:07,100
Now we're going to dive into TensorFlow data sets.

3
00:00:07,100 --> 00:00:10,580
So right here in the documentation you click open on TensorFlow data.

4
00:00:10,610 --> 00:00:12,050
Click on the overview.

5
00:00:12,410 --> 00:00:19,760
Um to have an idea of what TensorFlow data sets are all about and also why they're important or why

6
00:00:19,760 --> 00:00:21,980
we should generally want to use them.

7
00:00:21,980 --> 00:00:31,220
The TensorFlow data API, as we shall see throughout this course, will permit us to build complex data

8
00:00:31,220 --> 00:00:35,720
processing pipelines in a very efficient manner.

9
00:00:35,720 --> 00:00:44,240
This API would permit us, um, work with large amounts of data, and also be able to perform complex

10
00:00:44,240 --> 00:00:47,510
computations on this data effortlessly.

11
00:00:47,690 --> 00:00:53,930
The first method, which we shall explore in our TensorFlow data API is from slices method.

12
00:00:53,930 --> 00:00:55,940
So let's just copy this out.

13
00:00:55,940 --> 00:00:57,830
We get back to our code.

14
00:00:58,130 --> 00:01:00,500
Then we change this to train.

15
00:01:00,500 --> 00:01:01,970
So this is our train data set.

16
00:01:01,970 --> 00:01:08,990
And now we have um our from Tensor slices method which is actually going to take our tuple which will

17
00:01:08,990 --> 00:01:13,340
contain the x train and the y train.

18
00:01:13,340 --> 00:01:15,500
So that's going to be our train data set.

19
00:01:15,500 --> 00:01:19,310
Now we could do for I in train data set.

20
00:01:19,730 --> 00:01:25,910
Um let's um print out I well let's, let's break once we're done with the first sample.

21
00:01:25,910 --> 00:01:27,590
So that's it.

22
00:01:27,590 --> 00:01:31,280
You see we have this input and then we have the output.

23
00:01:31,280 --> 00:01:36,920
Let's say I j I j and then print out I and then also print out j.

24
00:01:37,100 --> 00:01:40,490
So you see that we have the I, we have this I.

25
00:01:40,760 --> 00:01:43,160
And then we also have the j.

26
00:01:43,340 --> 00:01:45,170
That's the, the output.

27
00:01:45,170 --> 00:01:53,930
Now um getting back here you would find that it's very easy to um, do stuff like shuffling, um, batching

28
00:01:53,930 --> 00:01:57,290
and also prefetching, which we shall look at shortly.

29
00:01:57,290 --> 00:01:59,300
Now let's get back to documentation.

30
00:01:59,480 --> 00:02:02,840
We could scroll down and here we have the shuffle method.

31
00:02:02,840 --> 00:02:09,770
You see that this, um, shuffle method actually takes in a buffer size, um, a seed, because you

32
00:02:09,770 --> 00:02:14,930
want to be able to reproduce a particular setting of your data set.

33
00:02:14,930 --> 00:02:21,050
So if you want to shuffle your data in a way that, uh, when you shove reshuffle it again, you have

34
00:02:21,050 --> 00:02:23,030
exact same shuffling order.

35
00:02:23,030 --> 00:02:24,200
So that's it.

36
00:02:24,200 --> 00:02:27,080
And then you have the reshuffle after each iteration.

37
00:02:27,080 --> 00:02:33,200
So you want that each time you iterate on your data, you want to maybe reshuffle again.

38
00:02:33,200 --> 00:02:34,550
Anyways, that's it.

39
00:02:34,580 --> 00:02:36,950
We have this, uh, definitions here.

40
00:02:36,950 --> 00:02:40,550
It randomly reshuffle or randomly shuffle the elements of this data set.

41
00:02:40,550 --> 00:02:44,240
And the data set fills a buffer with buffer size elements.

42
00:02:44,240 --> 00:02:50,330
So here we told that for instance if your data set contains 10,000 elements but the buffer size is set

43
00:02:50,330 --> 00:02:57,350
to 1000, then the shuffle will initially select a random element from only the first 1000 elements

44
00:02:57,350 --> 00:02:58,100
in the buffer.

45
00:02:58,100 --> 00:03:03,500
So when you set the buffer size to 1000, what you're saying is, uh, want to shuffle from the first

46
00:03:03,500 --> 00:03:08,810
1000 elements, and once an element is selected, it's placed in the buffer is replaced with the next

47
00:03:08,810 --> 00:03:11,750
um or 1,001st element.

48
00:03:11,750 --> 00:03:14,390
So maintaining the 1000 element buffer.

49
00:03:14,390 --> 00:03:22,460
So here what this essentially is that if you have a data set with say ten elements and a buffer size

50
00:03:22,460 --> 00:03:23,780
of let's say three.

51
00:03:23,780 --> 00:03:33,260
So let's create this uh, ten element data set 12345678 um 910.

52
00:03:33,260 --> 00:03:36,140
So here we have eight 910.

53
00:03:36,140 --> 00:03:39,290
So we have our ten elements um data set.

54
00:03:39,290 --> 00:03:45,200
And if we have a buffer size of three then the shuffling will be done such that the first element will

55
00:03:45,200 --> 00:03:48,980
be picked from either one, 2 or 3.

56
00:03:48,980 --> 00:03:52,370
Once we pick an element, let's say we randomly pick out two.

57
00:03:52,400 --> 00:03:58,970
If we randomly pick out two, then what this means is we are going to take this off, and then we will

58
00:03:58,970 --> 00:04:01,790
now pick between one, three and four.

59
00:04:01,790 --> 00:04:06,470
So that our buffer size, um, remains three, because now we've picked out two.

60
00:04:06,470 --> 00:04:08,240
And so we're left with one, three and four.

61
00:04:08,240 --> 00:04:09,800
Now let's randomly pick an element.

62
00:04:09,800 --> 00:04:10,850
We pick four.

63
00:04:10,850 --> 00:04:12,920
So this is no more in our buffer.

64
00:04:12,920 --> 00:04:14,780
We now move on to five.

65
00:04:14,780 --> 00:04:19,940
And then we could randomly pick out whatever other element and so on and so forth right up to the end.

66
00:04:19,940 --> 00:04:24,980
So that's essentially what um, is being described here in the, in the documentation.

67
00:04:25,280 --> 00:04:30,830
Um, reshuffle each iteration controls whether the shuffle order should be different for each epoch.

68
00:04:30,830 --> 00:04:35,690
So, um, after every epoch you could decide whether you want to reshuffle or not.

69
00:04:35,690 --> 00:04:41,060
So getting back to the code, we have our train um, data set.

70
00:04:41,060 --> 00:04:42,800
Well we already had this.

71
00:04:42,800 --> 00:04:44,390
So we'll just say train data set.

72
00:04:44,810 --> 00:04:50,720
Um, equal train data set train data set shuffle.

73
00:04:50,720 --> 00:04:53,660
And then we provide a buffer size.

74
00:04:53,660 --> 00:04:59,660
We could take buffer size of for example um let's say 16 of a buffer size of 16.

75
00:05:00,160 --> 00:05:02,590
And then we have the batch.

76
00:05:02,590 --> 00:05:04,870
So let's specify let's take this off.

77
00:05:04,870 --> 00:05:09,730
We have batch and then we have batch batch um size.

78
00:05:09,730 --> 00:05:11,560
We could also have your buffer size.

79
00:05:11,560 --> 00:05:17,560
Anyways let's just put this let's just um have this buffer buffer size.

80
00:05:17,560 --> 00:05:19,420
And then we could define this above.

81
00:05:19,420 --> 00:05:22,150
So now we have the buffer size.

82
00:05:22,270 --> 00:05:26,500
We have the buffer size which is set to 16.

83
00:05:26,500 --> 00:05:31,360
And we have the batch size which we could set to 64.

84
00:05:31,360 --> 00:05:35,230
So now we have our buffer size set and the batch size set.

85
00:05:35,230 --> 00:05:37,900
Another thing we could do is um prefetching.

86
00:05:37,990 --> 00:05:43,630
Getting back to documentation we have here that uh, most um data set input pipeline should end with

87
00:05:43,630 --> 00:05:44,590
a call to prefetch.

88
00:05:44,590 --> 00:05:49,210
This allows later elements to be prepared while the current element is being processed.

89
00:05:49,210 --> 00:05:56,800
This um, often involves or rather, it often improves latency and throughput at the cost of using additional

90
00:05:56,800 --> 00:05:59,410
memory to store the prefetched elements.

91
00:05:59,410 --> 00:06:00,070
So.

92
00:06:00,070 --> 00:06:03,790
So here we told that, uh, this prefetching has a buffer size.

93
00:06:03,790 --> 00:06:08,320
So we have this buffer size argument and it represents the number, the maximum number of elements that

94
00:06:08,320 --> 00:06:09,820
will be buffered when prefetching.

95
00:06:09,820 --> 00:06:16,270
And if the value of um, TensorFlow data auto tune is used, then the buffer size is dynamically tuned.

96
00:06:16,270 --> 00:06:17,920
So we'd like to use this.

97
00:06:17,920 --> 00:06:23,980
So to understand the concept of prefetching, um, let's suppose that this red is for loading the data.

98
00:06:23,980 --> 00:06:26,650
And this blue is for um training.

99
00:06:26,650 --> 00:06:31,960
So let's say we load a batch and then we train and then we load again.

100
00:06:31,960 --> 00:06:33,010
Let's copy this.

101
00:06:33,010 --> 00:06:35,320
We load let's drag this.

102
00:06:35,320 --> 00:06:37,030
This way we load.

103
00:06:37,450 --> 00:06:43,870
And then after after loading you see after loading we go ahead with the training.

104
00:06:44,620 --> 00:06:46,000
There we go.

105
00:06:46,000 --> 00:06:49,210
And then, um, let's say we load again.

106
00:06:49,210 --> 00:06:50,800
So let's copy this.

107
00:06:51,580 --> 00:06:58,840
Let's say we load again and we go ahead again and, um, train on that loaded data.

108
00:06:58,870 --> 00:07:04,360
The whole point of prefetching is that while you're training the model, you could be loading the data

109
00:07:04,360 --> 00:07:04,660
set.

110
00:07:04,660 --> 00:07:10,630
So instead of having to always wait for the model training to complete before loading, you could instead,

111
00:07:10,630 --> 00:07:13,030
um, do this, um, simultaneously.

112
00:07:13,030 --> 00:07:17,230
So let's, um, take this off so you could have this here.

113
00:07:17,230 --> 00:07:21,400
So now you finish, you're done with, um, um, loading.

114
00:07:21,400 --> 00:07:23,350
You start training and you're also loading.

115
00:07:23,350 --> 00:07:28,690
So once you're done with this training, you see you could now you can now pick up from here or.

116
00:07:28,690 --> 00:07:32,290
Well, you could pick up from here because you must complete the loading so you could pick up from here.

117
00:07:32,290 --> 00:07:33,220
Let's take this off.

118
00:07:33,220 --> 00:07:40,030
And then once you obviously while you're training again with this, you could also, um, start loading.

119
00:07:40,030 --> 00:07:41,980
So we could start loading from here.

120
00:07:41,980 --> 00:07:46,690
And let's take this off and then we could pick up from here.

121
00:07:46,690 --> 00:07:47,740
Let's take this off.

122
00:07:47,740 --> 00:07:50,680
We could pick up from here and start training.

123
00:07:50,680 --> 00:07:53,200
So you see that at the end.

124
00:07:53,200 --> 00:07:56,710
We've taken up uh, for the first scenario.

125
00:07:56,740 --> 00:07:58,780
Let's let's reduce this.

126
00:07:59,650 --> 00:08:03,370
Um, for the first scenario, we we took a time.

127
00:08:03,370 --> 00:08:09,850
Let's say we took a time, t this time we took a time t to complete.

128
00:08:09,850 --> 00:08:18,130
But in the second scenario, we've taken a time of about, um, two divided by three t.

129
00:08:18,130 --> 00:08:23,380
And so that's why here we just do prefetch prefetch.

130
00:08:23,380 --> 00:08:28,270
And then we have TensorFlow data auto auto train.

131
00:08:28,270 --> 00:08:29,650
So that's it.

132
00:08:29,770 --> 00:08:31,570
We run that and there we go.

133
00:08:31,570 --> 00:08:34,180
We have our um train data set.

134
00:08:34,720 --> 00:08:35,920
Um that's fine.

135
00:08:35,920 --> 00:08:38,680
We we repeat the same for the validation and the testing.

136
00:08:38,680 --> 00:08:40,390
So take this off.

137
00:08:40,390 --> 00:08:50,710
We have um train off we have Val and then here we have Val, then here we also have Val.

138
00:08:50,710 --> 00:08:53,950
And instead of train we have Val.

139
00:08:53,950 --> 00:08:56,320
So train here we have Val.

140
00:08:56,320 --> 00:08:58,000
We do the same for the test.

141
00:08:58,000 --> 00:09:02,410
So let's run this and then do the same for the test data set.

142
00:09:02,500 --> 00:09:08,710
We have test test and take this off.

143
00:09:08,710 --> 00:09:12,700
We have test and yeah test.

144
00:09:12,700 --> 00:09:18,820
So that's it for the um train validation and test data set on this.

145
00:09:18,820 --> 00:09:21,280
You will notice that the shape is now different.

146
00:09:21,280 --> 00:09:23,110
Let's let's actually print out just the shapes.

147
00:09:23,110 --> 00:09:26,230
So let's say I shape and then J shape.

148
00:09:27,100 --> 00:09:29,710
Run that and see what we get.

149
00:09:29,710 --> 00:09:32,920
You see now we have 64 by eight and 64 by one.

150
00:09:32,920 --> 00:09:37,360
And that's simply because our batch size was set to 64.

151
00:09:37,420 --> 00:09:40,750
Now we're not going to change anything uh the level of the model.

152
00:09:40,750 --> 00:09:42,520
So we still have our model.

153
00:09:42,520 --> 00:09:43,630
Let's rerun that.

154
00:09:43,630 --> 00:09:48,460
So we initialize the or, we reinitialize the parameters, we recompile.

155
00:09:48,460 --> 00:09:53,170
And then here we're going to replace all this with train data set.

156
00:09:53,170 --> 00:09:54,880
So we'll just say train data set.

157
00:09:54,880 --> 00:09:55,870
And that's fine.

158
00:09:55,870 --> 00:09:59,920
And then for the validation now we have val data set.

159
00:10:00,280 --> 00:10:04,720
So take this off and rerun that again.

160
00:10:04,720 --> 00:10:07,600
And our training should all be fine.

161
00:10:07,600 --> 00:10:11,890
So you see we we go on with the training as usual.

162
00:10:11,890 --> 00:10:14,440
And we could check out our plots.

163
00:10:14,440 --> 00:10:19,480
And now the evaluation will be on the Val data set.

164
00:10:19,480 --> 00:10:21,520
So let's take this off.

165
00:10:21,520 --> 00:10:22,750
Run that again.

166
00:10:22,750 --> 00:10:26,080
Yeah it's going to be test um data set.

167
00:10:26,110 --> 00:10:27,310
Test data set.

168
00:10:27,340 --> 00:10:28,090
Take that off.

169
00:10:28,090 --> 00:10:29,140
Run that again.

170
00:10:29,140 --> 00:10:30,670
And that should be fine.

171
00:10:30,670 --> 00:10:39,220
So you see we've replaced um, our initial um data pipeline with now the TensorFlow data API.

172
00:10:39,370 --> 00:10:40,660
Um, pipeline.