1
00:00:11,120 --> 00:00:17,990
So in this lecture, we'll be looking at the notebook to do tax classification using an art end or begin

2
00:00:17,990 --> 00:00:22,190
by downloading our data set, which will be the BBC News data once again.

3
00:00:22,850 --> 00:00:27,960
But note that at this point in the course, you've learned the skills to plug in any data set you like,

4
00:00:27,980 --> 00:00:31,190
so please feel free to do that as an exercise.

5
00:00:38,040 --> 00:00:41,160
The next step is to import everything we need for this notebook.

6
00:00:48,190 --> 00:00:52,390
The next step is to load in our CSB using PD that reads GSV.

7
00:00:57,470 --> 00:01:02,120
The next step is to call the effect head to remind ourselves what our data frame looks like.

8
00:01:06,190 --> 00:01:09,570
As you can see, we have two columns, text and labels.

9
00:01:13,400 --> 00:01:17,150
The next step is to assign a numerical targets for each of our labels.

10
00:01:21,540 --> 00:01:25,410
The next step is to determine the number of classes which we'll call K.

11
00:01:30,200 --> 00:01:33,200
The next step is to split our data into training test.

12
00:01:38,350 --> 00:01:42,220
The next step is to convert our text sentences into sequences.

13
00:01:43,030 --> 00:01:49,030
Note that we are now not using TFI Taf, but instead will be representing each document as a list of

14
00:01:49,030 --> 00:01:49,690
integers.

15
00:01:50,380 --> 00:01:53,260
Note that we've set the max vocab size to two thousand.

16
00:02:00,000 --> 00:02:05,460
The next step is to assign our word to index mapping to a variable and also to check the true vocab

17
00:02:05,460 --> 00:02:06,090
size.

18
00:02:10,580 --> 00:02:13,970
As you can see, there are over 27000 tokens.

19
00:02:18,620 --> 00:02:21,080
The next step is to pad our training sequences.

20
00:02:25,460 --> 00:02:31,070
As you can see, our documents are quite long with a maximum of about a few thousand tokens.

21
00:02:31,850 --> 00:02:34,100
This makes sense since they are news articles.

22
00:02:37,760 --> 00:02:40,430
The next step is to pad the test sequences as well.

23
00:02:40,790 --> 00:02:47,390
But this time, setting the max length to tee this will emulate how we use this model in the real world,

24
00:02:47,780 --> 00:02:52,130
since we wouldn't know the length of any future data that we want to use this model on.

25
00:02:58,740 --> 00:03:00,450
The next step is to create a model.

26
00:03:01,680 --> 00:03:04,170
You can see that I've chosen an embedding size of 20.

27
00:03:04,560 --> 00:03:07,470
But you should feel free to change this as an exercise.

28
00:03:12,790 --> 00:03:18,820
The next step is to create our aunt and layers, so we have the input followed by embedding followed

29
00:03:18,820 --> 00:03:22,750
by Elysium, followed by global max pooling, followed by dense.

30
00:03:29,620 --> 00:03:33,430
Now, it's important to remember that there is no formula for hyper parameters.

31
00:03:33,760 --> 00:03:36,220
You simply choose them based on experiments.

32
00:03:37,180 --> 00:03:41,110
As such, your exercise for this lecture is to do these experiments.

33
00:03:41,680 --> 00:03:48,010
I've listed some ideas, such as using a multiple Elysium layers instead of just one using the G R U

34
00:03:48,010 --> 00:03:52,690
instead of the LSD M and using the symbol or an N instead of the last year.

35
00:03:53,500 --> 00:03:58,150
In addition, a setting return sequences to false and not using global max pooling.

36
00:03:58,810 --> 00:04:01,840
As you recall, this will just keep the final hidden state.

37
00:04:02,800 --> 00:04:07,450
Note that for all the above options, you should also try different values for the number of hidden

38
00:04:07,450 --> 00:04:08,020
units.

39
00:04:09,190 --> 00:04:12,970
And finally, note that you can also do combinations of the above.

40
00:04:13,510 --> 00:04:19,510
So, for example, multiple TR U layers instead of just one or a simple or an N with return sequences

41
00:04:19,510 --> 00:04:20,500
set to false.

42
00:04:25,000 --> 00:04:28,840
The next step is to call, compile and fit all of which you've seen before.

43
00:04:43,110 --> 00:04:45,330
The next step is to plot the loss per epoch.

44
00:04:51,260 --> 00:04:52,970
So the last free pork looks good.

45
00:04:57,030 --> 00:04:59,610
The next step is to plot the accuracy per epoch.

46
00:05:05,150 --> 00:05:10,910
So the accuracy for E-block looks good, as expected, performance on the train set is better.

47
00:05:12,590 --> 00:05:16,850
Note that training in Elysium network appears to be less stable than a CNN.

48
00:05:17,390 --> 00:05:21,290
This is just par for the course as your models get more complex.

49
00:05:21,620 --> 00:05:27,050
Your loss per epoch starts to get more erratic and less robust to different hyper parameter values.

50
00:05:29,230 --> 00:05:34,860
So as a final exercise for this lecture, please continue computing other metrics like the F1 and the

51
00:05:34,860 --> 00:05:35,560
AUC.

52
00:05:37,060 --> 00:05:42,550
In addition, compare the performance of this model to previous models we used on this dataset.

53
00:05:43,150 --> 00:05:45,040
You may be surprised at the result.

54
00:05:45,730 --> 00:05:51,550
Consider why this is the case and think about how this might help guide your decisions about which models

55
00:05:51,550 --> 00:05:53,530
to use in the real world.