1
00:00:11,710 --> 00:00:17,920
In this lecture, we are going to look at a CoLab notebook that does text classification using Arnon's,

2
00:00:18,550 --> 00:00:24,460
the specific text classification we're going to work on is spam detection, which, as you can probably

3
00:00:24,460 --> 00:00:27,820
surmise on your own, is a very practical application.

4
00:00:28,750 --> 00:00:34,270
Note that there are some other examples of a text classification, such as sentiment analysis, where

5
00:00:34,270 --> 00:00:40,250
you can try to guess whether a passage is positive or negative and also just document classification.

6
00:00:40,750 --> 00:00:45,820
For example, if you're a new site and you have thousands of articles, you may want to automatically

7
00:00:45,820 --> 00:00:51,550
classify them into different categories on your site, such as technology, local news, lifestyle,

8
00:00:51,550 --> 00:00:53,020
health, weather and so forth.

9
00:00:53,560 --> 00:00:56,980
So there are lots of basic applications of text classification.

10
00:00:58,850 --> 00:01:04,280
This lecture is going to walk you through a prepared CoLab notebook, although a very good exercise,

11
00:01:04,280 --> 00:01:09,650
which I always recommend, is once you know how this is done, to try and recreate it yourself with

12
00:01:09,650 --> 00:01:11,310
as few references as possible.

13
00:01:12,110 --> 00:01:16,940
As usual, you can look at the title of the notebook to determine what notebook we are currently looking

14
00:01:16,940 --> 00:01:17,240
at.

15
00:01:19,180 --> 00:01:24,620
To start, we're going to load in our data set, which is a spam detection data set stored as a CSV.

16
00:01:25,300 --> 00:01:28,360
We use the get method to download the data first.

17
00:01:37,650 --> 00:01:40,500
You'll notice that this is a very strangely formatted file.

18
00:01:43,880 --> 00:01:48,070
If we use the head commands, you'll see that there appear to be some invalid characters.

19
00:01:48,620 --> 00:01:53,300
In addition, each line ends with three commas, but none of those columns contain any data.

20
00:01:58,700 --> 00:02:05,750
The next step is to load in our CCV using pedigreed CSFI notice that I've used in encoding called ISO

21
00:02:05,750 --> 00:02:07,610
eight eight five nine, dash one.

22
00:02:08,330 --> 00:02:13,250
The default is UTSA, which does not work on this file due to invalid characters.

23
00:02:17,470 --> 00:02:20,940
The next step is to do a DFT head to see what our data looks like.

24
00:02:26,090 --> 00:02:31,760
It seems that the first two columns are simply called V1 and V2, and there are three completely empty

25
00:02:31,760 --> 00:02:32,720
columns after that.

26
00:02:33,380 --> 00:02:37,460
This makes sense since each line of our file ended with three commas.

27
00:02:40,480 --> 00:02:44,470
So in the next line, I'm going to drop these three unnecessary columns.

28
00:02:48,120 --> 00:02:50,310
The next step is to do a deal, how to get.

29
00:02:54,530 --> 00:02:57,560
And we see that the three columns are gone as expected.

30
00:03:00,310 --> 00:03:05,560
Next, because I don't like having my columns named, if you wanted me to, I'm going to name them something

31
00:03:05,560 --> 00:03:07,480
a little better labels and data.

32
00:03:11,670 --> 00:03:17,130
The next step is to do it the head again to check that the columns have been successfully renamed.

33
00:03:22,960 --> 00:03:28,420
So currently, the labels are called ham and spam, as you know, in machine learning, we like our

34
00:03:28,420 --> 00:03:31,690
columns to be integers from zero up to K minus one.

35
00:03:32,500 --> 00:03:36,340
In our case, since K is two, we want our labels to be zero one one.

36
00:03:37,000 --> 00:03:42,550
We'll map ham to zero and spam to one will assign this new column to B B labels.

37
00:03:47,300 --> 00:03:50,510
The next step is to split our data into train and test.

38
00:03:54,670 --> 00:03:58,630
The next step is to check the shape of our data sets just as a sanity check.

39
00:04:03,580 --> 00:04:08,560
So in the next few blocks of code, we're going to begin writing the code for text preprocessing.

40
00:04:09,670 --> 00:04:12,740
We'll start by initializing our word to index dictionary.

41
00:04:13,420 --> 00:04:18,330
Currently, this will only contain the padding token, which is assigned to have the index zero.

42
00:04:19,150 --> 00:04:21,040
So that's the format of this dictionary.

43
00:04:21,220 --> 00:04:24,970
The key is the word and the value is its corresponding index.

44
00:04:25,540 --> 00:04:29,610
We also have a variable called X, which I've initialized to one.

45
00:04:30,850 --> 00:04:35,660
Basically, this variable will store our current index as we loop through the train set.

46
00:04:36,370 --> 00:04:41,230
So notice it's initialized to one because zero is already taken by the padding token.

47
00:04:42,040 --> 00:04:46,410
But you'll see precisely how this word to index dictionary is used very shortly.

48
00:04:51,650 --> 00:04:57,350
The next step is to live through each document in the train set, our goal right now is to populate

49
00:04:57,350 --> 00:05:03,530
the word to index dictionary, that is, we want to store all the words in the train set and assign

50
00:05:03,530 --> 00:05:05,810
them to all unique integer indices.

51
00:05:06,440 --> 00:05:07,600
So inside the loop.

52
00:05:07,640 --> 00:05:13,040
We begin by grabbing the data column, lower casing the string and then calling the split method.

53
00:05:13,610 --> 00:05:16,010
This is the simplest way to do tokenization.

54
00:05:16,700 --> 00:05:22,370
Lots of people think tokenization is some magical operation, but in reality it's just a fancy string

55
00:05:22,370 --> 00:05:22,760
split.

56
00:05:23,450 --> 00:05:28,760
There are other things you can do, like slip punctuation or completely remove punctuation, but we

57
00:05:28,760 --> 00:05:30,500
won't consider those scenarios.

58
00:05:31,640 --> 00:05:35,050
Of course, it's easy to replace this line with anything you like.

59
00:05:35,390 --> 00:05:42,130
For example, the token icer from Analytica Spacy or Gentium, it's essentially no effort to do that.

60
00:05:42,320 --> 00:05:44,900
So if you want to try that on your own, please do.

61
00:05:46,220 --> 00:05:50,390
The next step is to loop through each of our tokens inside this loop.

62
00:05:50,420 --> 00:05:54,940
We're going to check whether or not the token is currently in our word to index dictionary.

63
00:05:55,490 --> 00:06:01,010
If this is not the case, then we need to create a new entry, otherwise we don't need to do anything.

64
00:06:02,270 --> 00:06:08,060
So inside the statement, we simply add this token as a key and set IDEX to be the value.

65
00:06:08,630 --> 00:06:12,620
We then increment IDEX to prepare it for the next token we find.

66
00:06:13,190 --> 00:06:19,760
OK, so hopefully you can see how this loop will populate the word to IDEX dictionary with unique indices

67
00:06:19,760 --> 00:06:22,430
assigned to each word in the training corpus.

68
00:06:28,010 --> 00:06:33,170
The next step is to simply print out the word to index dictionary to confirm that it contains what we

69
00:06:33,170 --> 00:06:33,800
expect.

70
00:06:36,850 --> 00:06:39,540
OK, so hopefully this is what you expected to see.

71
00:06:42,850 --> 00:06:47,860
The next step is to check the size of the word to a next dictionary to see how many unique tokens we

72
00:06:47,860 --> 00:06:48,460
have.

73
00:06:53,190 --> 00:06:56,790
So as you can see, we have about ten thousand unique tokens.

74
00:07:00,530 --> 00:07:06,170
The next step, as you recall, is that we need to convert our text data into data that can be accepted

75
00:07:06,170 --> 00:07:07,310
by a neural network.

76
00:07:07,790 --> 00:07:13,470
This means we have to convert each document into a list of word indices in order to do this.

77
00:07:13,490 --> 00:07:19,910
We'll start by creating an empty list called train sentences as int's, hopefully for obvious reasons.

78
00:07:20,720 --> 00:07:22,670
The next step is to loop through the train said.

79
00:07:22,700 --> 00:07:28,340
Once again, note that I've made a little comment here, which reminds you that we could have done this

80
00:07:28,340 --> 00:07:30,880
step and the previous step all at once.

81
00:07:31,460 --> 00:07:37,310
That is, we could have populated word to IDEX and converted the data into lists of integers at the

82
00:07:37,310 --> 00:07:38,010
same time.

83
00:07:38,510 --> 00:07:40,750
So you may want to try that as an exercise.

84
00:07:41,840 --> 00:07:44,450
I think doing it separately makes things a little more clear.

85
00:07:45,950 --> 00:07:51,530
OK, so inside the loop, we're going to read tokenized the current document so you can see how this

86
00:07:51,530 --> 00:07:55,370
is a bit inefficient since we've tokenized every document twice.

87
00:07:56,870 --> 00:08:03,350
Once we have the tokens, we then map them to integers using the word to IDEX dictionary after we get

88
00:08:03,350 --> 00:08:08,870
the result, which I've called Centonze as int, we then append that to our list of train inputs.

89
00:08:14,530 --> 00:08:19,660
The next step is to do the same kind of loop for the test set, but this time it's going to be a bit

90
00:08:19,660 --> 00:08:20,320
different.

91
00:08:20,950 --> 00:08:26,380
The main reason is because we can't be sure that every word that appears in the test set also appeared

92
00:08:26,380 --> 00:08:32,020
in the train said one common way of dealing with this is to simply not include those words.

93
00:08:32,500 --> 00:08:36,610
Another common method is to have a special token just for unknown words.

94
00:08:36,850 --> 00:08:38,020
But then you would have to train.

95
00:08:38,020 --> 00:08:40,720
Your neural network waits for unknown words as well.

96
00:08:41,530 --> 00:08:46,830
In this case, we're just going to ignore any words that do not appear, in our words, to IDEX dictionary.

97
00:08:47,320 --> 00:08:50,020
Other than that, this loop is the same as above.

98
00:08:54,910 --> 00:08:58,960
OK, so the next step is to check the size of our train and test input's.

99
00:09:02,070 --> 00:09:04,710
So as you can see, they are the same as before.

100
00:09:08,400 --> 00:09:10,490
The next step is to create a data generator.

101
00:09:11,160 --> 00:09:16,230
This is optional since theoretically you could just loop through the data structures we already have.

102
00:09:16,830 --> 00:09:21,460
However, this makes the subsequent pocketwatch code more similar to what you've seen before.

103
00:09:22,530 --> 00:09:29,130
So this function is going to take in three arguments, the inputs X, the targets Y and the batch size

104
00:09:29,910 --> 00:09:30,960
inside the function.

105
00:09:30,960 --> 00:09:36,450
We're going to start by shuffling the inputs and targets, as you recall, when we do batch gradient

106
00:09:36,450 --> 00:09:36,940
descent.

107
00:09:37,260 --> 00:09:39,510
We like to shuffle the data on kapok.

108
00:09:40,530 --> 00:09:45,840
We then compute the number of batches, which is the ceiling of the size of Y, divided by the batch

109
00:09:45,840 --> 00:09:46,500
size.

110
00:09:47,040 --> 00:09:52,170
The reason we want to use the ceiling is because if they don't divide evenly, we still want to include

111
00:09:52,170 --> 00:09:53,410
the leftover points.

112
00:09:54,780 --> 00:09:57,240
The next step is to live through each of our batches.

113
00:09:59,280 --> 00:10:05,640
Inside the loop, we're going to start by computing the end index of the batch, so normally this would

114
00:10:05,640 --> 00:10:08,120
just be a plus one times batch size.

115
00:10:08,430 --> 00:10:11,660
But remember, the final batch may not be a full batch.

116
00:10:12,120 --> 00:10:17,100
So we need to account for that by taking the men of the usual value and the length of y.

117
00:10:18,390 --> 00:10:25,290
The next step is to Index X and Y to obtain expatriate Y batch using the indices we just found.

118
00:10:26,490 --> 00:10:30,750
OK, so please double check these indices and make sure they make sense to you.

119
00:10:31,200 --> 00:10:35,570
If they do not make sense to you, then apply my rule when in doubt, print it out.

120
00:10:38,920 --> 00:10:44,890
OK, so currently X matches a list of documents, each represented as a list of integers.

121
00:10:45,550 --> 00:10:51,050
There's one problem with this, which is that each document in our dataset could have a different length.

122
00:10:51,070 --> 00:10:54,550
In fact, they most likely do by default.

123
00:10:54,550 --> 00:10:57,190
Arnon's don't work with sequences of variable length.

124
00:10:57,580 --> 00:11:02,380
So unless you want to write some additional code to handle that, then the typical way to deal with

125
00:11:02,380 --> 00:11:06,040
this is to make each sample in the batch have the same length.

126
00:11:06,790 --> 00:11:10,930
In order to do this, we first have to find the longest sequence in the batch.

127
00:11:11,500 --> 00:11:18,190
Of course, this is so that we can include all the data that appears in the batch, so all the shorter

128
00:11:18,190 --> 00:11:22,160
sequences will have padding and the longest sequence will not have any padding.

129
00:11:23,140 --> 00:11:28,130
OK, so Maslon is just the maximum length of the documents in the batch.

130
00:11:28,870 --> 00:11:31,900
The next step is to live through each sample in the batch.

131
00:11:32,590 --> 00:11:36,430
Inside this loop, we first grab the sample, which is called Little X.

132
00:11:36,910 --> 00:11:41,740
So little X at this point is a list of integers representing the document.

133
00:11:42,340 --> 00:11:44,540
As always, when in doubt, print it out.

134
00:11:45,340 --> 00:11:48,310
OK, so the next step is to determine the amount of padding.

135
00:11:49,000 --> 00:11:55,810
Of course, this is just Max Len minus the length of X. This is because after padding, we want all

136
00:11:55,810 --> 00:11:57,930
the documents to have the length mcauslan.

137
00:11:58,960 --> 00:12:05,770
So we multiply maxilla minus the length of X by a list containing zero, which, as you recall, doesn't

138
00:12:05,770 --> 00:12:07,180
actually do multiplication.

139
00:12:07,750 --> 00:12:12,860
When you use the asterisk operator with a list, it simply repeats the list that many times.

140
00:12:13,090 --> 00:12:15,480
So what we get back is a list of zeros.

141
00:12:16,240 --> 00:12:21,080
The final step is to join the padding and X together and store this in X batch.

142
00:12:21,550 --> 00:12:25,660
So this will overwrite what was in X backstage at the start of this loop.

143
00:12:28,570 --> 00:12:33,010
Now, once this inner loop is complete, X batch will be a non JAGAT array.

144
00:12:33,580 --> 00:12:38,440
So at this point we can convert both X and Y back into towards Tensas.

145
00:12:39,220 --> 00:12:42,970
The final step in this function is to yield the expansion Y batch.

146
00:12:43,400 --> 00:12:45,270
As you recall, this is a generator.

147
00:12:45,400 --> 00:12:47,500
So we want to yield and not return.

148
00:12:53,120 --> 00:12:55,250
The next step is to test their new function.

149
00:12:55,820 --> 00:12:59,030
This will also show you how our generator is to be used.

150
00:12:59,600 --> 00:13:06,210
As expected, we use it in a loop and it takes in our inputs and labels as arguments inside the loop.

151
00:13:06,230 --> 00:13:11,630
We're going to print out one input target pair along with their shapes, will then break out of the

152
00:13:11,630 --> 00:13:13,640
loop since we only need to see one.

153
00:13:17,730 --> 00:13:24,210
OK, so here you can see the input tensor, which is a 2D array of integers as expected, where most

154
00:13:24,210 --> 00:13:26,490
of the rows just start with a bunch of zeros.

155
00:13:27,000 --> 00:13:28,920
This makes sense since that's the padding.

156
00:13:29,970 --> 00:13:36,060
We can see that the target sensor is just an array of zeros and ones with a size of thirty two as expected.

157
00:13:39,800 --> 00:13:44,480
The next step is to do the same thing, but with our test data set, just so you can see how we can

158
00:13:44,480 --> 00:13:45,900
loop through that data as well.

159
00:13:50,790 --> 00:13:53,240
As you can see, it's pretty much the same thing.

160
00:13:58,270 --> 00:14:01,120
The next step is to set the device to be the CPU.

161
00:14:06,520 --> 00:14:07,970
Next, we define the model.

162
00:14:08,590 --> 00:14:14,800
This is the same as the Arnon's we looked at earlier, but within embedding layer up front, so as input

163
00:14:14,800 --> 00:14:20,320
we take in the vocab size, the embedding dimension, the number of hidden units, the number of hidden

164
00:14:20,320 --> 00:14:26,530
layers and the number of outputs, the two relevant things for us in this section are, of course,

165
00:14:26,530 --> 00:14:32,980
the vocab size V and the embedding dimensioned since that specifies the size of our embedding matrix.

166
00:14:36,820 --> 00:14:42,010
Next, we create our three layers, the embedding, the l'estang and the final dense layer.

167
00:14:45,600 --> 00:14:51,390
In the forward function, we pass our data through each of the layers, first we create our initial

168
00:14:51,390 --> 00:14:54,150
ASTM states, H0 and C0.

169
00:14:57,420 --> 00:15:00,100
Next, we pass our data through the embedding layer.

170
00:15:00,480 --> 00:15:02,470
This gives us an NBA team I.D..

171
00:15:04,140 --> 00:15:07,020
Next, we pass our data through the LSM layer.

172
00:15:07,440 --> 00:15:12,680
This gives us an NBA team by M next we do a global max pool.

173
00:15:13,080 --> 00:15:16,750
This collapses the T dimension and gives us an N by N.

174
00:15:18,150 --> 00:15:22,880
Finally, we pass our data through the last dense layer and this gives us an end by K.

175
00:15:28,980 --> 00:15:35,940
Next, we instantiate the answer and move the parameters to the GPU as before, I'm going to speed through

176
00:15:35,940 --> 00:15:38,550
the usual boilerplate code which we've seen.

177
00:15:44,200 --> 00:15:46,450
The next step is to create the laws and optimizer.

178
00:15:50,950 --> 00:15:55,840
The next step is to create simple lambda functions that can be used to call our data generators for

179
00:15:55,840 --> 00:16:01,470
training tests, you'll see that these shorter names are much easier to use throughout the notebook.

180
00:16:06,020 --> 00:16:08,190
The next step is to define the training function.

181
00:16:09,140 --> 00:16:15,650
The thing to pay attention to here is how we use the train Jan and testing lambda functions also note

182
00:16:15,650 --> 00:16:19,190
that we explicitly move the inputs and targets to the GPU.

183
00:16:20,030 --> 00:16:22,670
Other than that, you've essentially seen all this before.

184
00:16:29,700 --> 00:16:31,860
The next step is to call the training function.

185
00:16:42,050 --> 00:16:45,560
The next step is to plot the train and test Los per iteration.

186
00:16:49,740 --> 00:16:52,170
OK, so the last iteration looks good.

187
00:16:56,980 --> 00:17:03,490
The next step is to calculate the accuracy again, notice that we simply use our transgene and Tashjian

188
00:17:03,490 --> 00:17:04,630
Lambda functions.

189
00:17:12,780 --> 00:17:18,870
All right, so we end up doing pretty well in the high 90s, so conceivably, if you wanted to say create

190
00:17:18,870 --> 00:17:24,750
your own Web server that processes some type of email or digital message, you can now filter spam very

191
00:17:24,750 --> 00:17:25,440
accurately.

192
00:17:26,190 --> 00:17:31,320
As usual, please take this opportunity to try different type of parameters yourself and see if you

193
00:17:31,320 --> 00:17:33,050
can improve the results on your own.

194
00:17:33,720 --> 00:17:37,110
So please try that as an exercise and I'll see you in the next lecture.
