1
00:00:11,160 --> 00:00:15,810
So in this lecture, we'll be looking at the notebook for how to do topic modeling with LDA.

2
00:00:16,650 --> 00:00:21,240
We'll begin this notebook by downloading our data set, which is the BBC news data once again.

3
00:00:22,200 --> 00:00:27,210
Note that although this data set is reclassification and comes with labels, we won't be using those

4
00:00:27,210 --> 00:00:29,580
labels since LDA is unsupervised.

5
00:00:30,240 --> 00:00:36,000
In addition, a nice and simple exercise you can always try is to plug in a new data set into this notebook

6
00:00:36,000 --> 00:00:37,470
and repeat the analysis.

7
00:00:45,120 --> 00:00:46,920
The next step is to do our imports.

8
00:00:47,610 --> 00:00:53,330
Note that Llda appears in the decomposition module, which is also the home of other methods like Piqué,

9
00:00:53,430 --> 00:00:56,430
SVT and AMF and factor analysis.

10
00:00:56,970 --> 00:01:00,900
So that should give you some idea of what a family of algorithms LDA belongs to.

11
00:01:01,800 --> 00:01:05,700
Also, take notice that we are using the count of exerciser and not TFR yet.

12
00:01:06,420 --> 00:01:11,820
This is because the model itself is based on word counts and not distributions based on Faria.

13
00:01:18,500 --> 00:01:21,260
The next step is to download Stoppard's from Analytic.

14
00:01:26,420 --> 00:01:31,730
The next step is to convert or stoppers list into a set since will be adding more storage in the next

15
00:01:31,730 --> 00:01:32,180
step.

16
00:01:36,820 --> 00:01:39,880
The next step is to add more words to our set of stop words.

17
00:01:40,570 --> 00:01:45,550
Basically, I ran the script originally without these top words, but added them later when I saw them

18
00:01:45,550 --> 00:01:46,900
appear in the topics.

19
00:01:47,590 --> 00:01:53,680
So this is an example of how one would refine their code based on the results they see if you see generic

20
00:01:53,680 --> 00:01:54,680
words in your output.

21
00:01:54,700 --> 00:01:58,810
It's likely that they are not useful and so you can go back and simply remove them.

22
00:02:03,650 --> 00:02:07,070
The next step is to load in our data using PD that reads ISV.

23
00:02:11,800 --> 00:02:15,880
The next step is to call the after head to remind ourselves what this data looks like.

24
00:02:21,700 --> 00:02:25,990
As you can see, the data set is a two column data frame of text and labels.

25
00:02:30,690 --> 00:02:35,400
The next step is to create a town of riser passing in the stop words we previously defined.

26
00:02:40,510 --> 00:02:44,170
The next step is to transform our text into account vector format.

27
00:02:50,920 --> 00:02:56,140
The next block of code is a simple comment stating that you could potentially split the data into train

28
00:02:56,140 --> 00:02:59,830
and test at this point and use the test set to evaluate the model.

29
00:03:00,760 --> 00:03:06,460
Note that for LDA, common metrics are the log likelihood of the data or the perplexity, which is just

30
00:03:06,470 --> 00:03:08,830
exponential of the negative log likelihood.

31
00:03:13,040 --> 00:03:15,330
The next step is to create our Llda instance.

32
00:03:15,890 --> 00:03:18,440
Note that I've chosen 10 components arbitrarily.

33
00:03:19,190 --> 00:03:24,410
As usual, this can be optimized if you have some metric you care about, which is generally chosen

34
00:03:24,410 --> 00:03:25,790
based on out of sample data.

35
00:03:26,480 --> 00:03:31,580
For example, if you're doing classification, then you might want to use test accuracy to choose the

36
00:03:31,580 --> 00:03:32,960
best end components.

37
00:03:33,860 --> 00:03:39,080
Also note that I've set the random state since there's some randomness in the LDA algorithms.

38
00:03:39,380 --> 00:03:44,180
The results will be different each time you run this setting, this random state will ensure that we

39
00:03:44,180 --> 00:03:45,410
get the same results.

40
00:03:50,600 --> 00:03:54,290
The next step is to fit the older model passing in our count matrix.

41
00:03:58,400 --> 00:04:03,080
Note that the fitting process takes quite some time compared to other socket learned models.

42
00:04:03,710 --> 00:04:08,960
Generally speaking, Bayesian learning and variational inference have the potential to be slower than

43
00:04:08,960 --> 00:04:10,730
similar non Bayesian methods.

44
00:04:16,070 --> 00:04:18,350
The next step is to define a function called plot.

45
00:04:18,380 --> 00:04:22,220
Top words now the details of this function aren't that important.

46
00:04:22,490 --> 00:04:24,140
So we won't be discussing what's in it.

47
00:04:24,620 --> 00:04:26,900
This code is simply copied and pasted from the site.

48
00:04:26,900 --> 00:04:28,130
You'll learn documentation.

49
00:04:28,490 --> 00:04:31,730
And basically, it generates the plots I showed you in the previous lecture.

50
00:04:32,480 --> 00:04:35,570
What's important to know is what it does, not how it works.

51
00:04:36,440 --> 00:04:40,460
So basically, the job of this function is to make a bar plot for each topic.

52
00:04:41,060 --> 00:04:45,990
As you recall, each topic is expressed in terms of words for each topic.

53
00:04:46,010 --> 00:04:51,200
What we want is a bar plot showing the sorted word count for the top words pertaining to that topic.

54
00:04:51,950 --> 00:04:56,210
In any case, this is much easier to see in a picture, so it will all make sense shortly.

55
00:04:57,350 --> 00:05:02,270
Note that for simplicity, we'll plot just the top 10 words per topic, which, as you can see, we

56
00:05:02,270 --> 00:05:05,060
have set as the default argument for any top words.

57
00:05:10,840 --> 00:05:14,950
The next step is to call the plot top where it's function, which we just defined above.

58
00:05:16,040 --> 00:05:21,130
Recall that the second argument corresponds to the feature names which are just column indices to the

59
00:05:21,130 --> 00:05:21,970
LDA model.

60
00:05:23,110 --> 00:05:27,910
What we really need is an index toward word mapping, which is just the reverse of the word to index

61
00:05:27,910 --> 00:05:30,550
mapping, which you saw earlier in this course.

62
00:05:31,240 --> 00:05:35,380
Now, since we didn't create the count matrix ourselves, we don't actually have this mapping.

63
00:05:35,920 --> 00:05:38,650
However, it is stored in the count vector riser object.

64
00:05:39,590 --> 00:05:44,200
Specifically, we can obtain the word to index next mapping by calling this weirdly named function and

65
00:05:44,200 --> 00:05:45,400
get feature names out.

66
00:05:46,090 --> 00:05:50,260
This will give us a list of feature names which we can then pass in to plot top words.

67
00:05:58,960 --> 00:06:02,950
OK, so as you can see, we get a plot just like the one I showed you in the lecture.

68
00:06:03,760 --> 00:06:07,150
What is nice about all this is that the results make a lot of sense.

69
00:06:07,780 --> 00:06:13,750
For instance, the first topic consists of the words people, mobile technology, music and so forth.

70
00:06:14,290 --> 00:06:19,600
This is clearly related to consumer technologies such as phones and digital music, which of course

71
00:06:19,600 --> 00:06:21,310
are commonly played from phones.

72
00:06:22,300 --> 00:06:29,290
The second topic has the words US big company, market firm, sales, growth, oil and so forth.

73
00:06:29,860 --> 00:06:33,790
This is probably related to the economy, finance and the stock market.

74
00:06:34,720 --> 00:06:40,690
The third topic has the words government, U.S. economic, economy, world and so forth.

75
00:06:41,500 --> 00:06:43,690
Notice how the word at U.S. shows up again.

76
00:06:44,380 --> 00:06:49,930
So for topics, this is acceptable since a single word can be associated with multiple topics.

77
00:06:50,710 --> 00:06:55,210
This topic also seems to be related to the economy, but on a more global scale.

78
00:06:56,770 --> 00:07:02,500
The fourth topic is the words TV people, broadband, search, video, web and so forth.

79
00:07:03,100 --> 00:07:06,370
So this seems to be related to TV and internet services.

80
00:07:07,840 --> 00:07:12,220
The fifth topic is the words Law Lord, Government Bill and so on.

81
00:07:12,760 --> 00:07:16,060
So basically words related to the law in legal matters.

82
00:07:19,750 --> 00:07:23,500
As an exercise, please go through topics six up to 10 now yourself.

83
00:07:24,250 --> 00:07:29,830
You should find that these cover even more topics like film, music, software, politics and sports.

84
00:07:32,110 --> 00:07:37,330
Also, if you recall the labels in this data set, you can relate these topics to the ground truth labels

85
00:07:37,330 --> 00:07:37,870
as well.

86
00:07:38,560 --> 00:07:41,290
In fact, topics are more fine grained the labels.

87
00:07:41,770 --> 00:07:46,480
For example, both music and film would belong to entertainment, despite the fact that they should

88
00:07:46,480 --> 00:07:48,010
likely be separate topics.

89
00:07:52,890 --> 00:07:58,020
Now, we just got a chance to look at the topics, which, as you recall, is a property of the model.

90
00:07:58,830 --> 00:08:02,640
The next step is to look at our documents to see how they relate to our topics.

91
00:08:03,360 --> 00:08:06,720
As you recall, Llda essentially gives us two matrices.

92
00:08:07,140 --> 00:08:10,860
One with documents by topics and another with topics by words.

93
00:08:11,340 --> 00:08:13,140
We just saw topics by words.

94
00:08:13,500 --> 00:08:15,870
So now we want to see documents by topics.

95
00:08:16,920 --> 00:08:19,500
To get this matrix, we call transform onex.

96
00:08:19,800 --> 00:08:21,150
And this gives us back Z.

97
00:08:22,050 --> 00:08:28,300
As you recall, Z is the letter we commonly use in machine learning for hidden variables in clustering.

98
00:08:28,320 --> 00:08:31,720
This would represent cluster IDs for topic modeling.

99
00:08:31,740 --> 00:08:34,320
This represents a distribution over topics.

100
00:08:41,970 --> 00:08:47,720
So the next step is to pick a random document and plot its distribution will begin by setting numbers

101
00:08:47,730 --> 00:08:50,340
random see so that we get a consistent result.

102
00:08:51,870 --> 00:08:56,010
The next step is to select a random row from our data frame, which we'll call I.

103
00:08:56,970 --> 00:09:02,670
The next step is to grab the ith row of Z, which will give us a 1D array, which represents a distribution

104
00:09:02,670 --> 00:09:03,660
over topics.

105
00:09:05,070 --> 00:09:09,420
The next step is to define a list of topics which are just integers from one up to 10.

106
00:09:10,110 --> 00:09:15,090
As you recall, topics are latent variables, which means they don't have any inherent meaning other

107
00:09:15,090 --> 00:09:16,320
than what we assign to them.

108
00:09:18,210 --> 00:09:23,580
The next step is to draw a bar chart plotting the topic distribution along with the topics themselves,

109
00:09:23,580 --> 00:09:24,990
as well as the true label.

110
00:09:32,630 --> 00:09:39,080
OK, so as you can see, essentially all of the probability goes to topic 10, the true label of sport.

111
00:09:39,830 --> 00:09:43,580
So let's go back up to topic 10 and make sure that this makes sense.

112
00:09:46,950 --> 00:09:54,210
OK, so Topic 10 has the top words, game England games, players time a play Wales and so forth.

113
00:09:54,840 --> 00:09:58,020
So clearly this is related to some sport they play in England.

114
00:10:03,930 --> 00:10:08,700
The next step is to print the article to further check that this topic assignment makes sense.

115
00:10:15,040 --> 00:10:18,520
OK, so the article title is Chavez set to lose fitness bit?

116
00:10:19,330 --> 00:10:20,600
So this makes sense.

117
00:10:20,650 --> 00:10:23,380
Clearly, this article is about some famous athlete.

118
00:10:28,410 --> 00:10:31,380
The next step is to repeat the same code for a new article.

119
00:10:37,550 --> 00:10:42,620
OK, so as you can see, this article has a majority of the weight on topic seven in a small amount

120
00:10:42,620 --> 00:10:44,300
of weight on topic nine and three.

121
00:10:45,170 --> 00:10:47,840
Note that the label for this article is entertainment.

122
00:10:48,650 --> 00:10:50,480
So let's see what these topics mean.

123
00:10:57,110 --> 00:11:04,130
OK, so Topic seven is related to the words film, best world awards and so forth, so that makes sense.

124
00:11:04,220 --> 00:11:06,350
Clearly, this has to do with entertainment.

125
00:11:12,850 --> 00:11:17,590
So again, let's print the articles, you make sure that this really is related to entertainment and

126
00:11:17,590 --> 00:11:18,670
specifically film.

127
00:11:25,610 --> 00:11:30,350
OK, so the title is Oscar steer clear of controversy, which makes complete sense.

128
00:11:30,740 --> 00:11:33,080
The Oscars are awards for those in the film industry.