WEBVTT

0
00:00.420 --> 00:05.420
In the last lesson, we got started using Beautiful Soup and we saw how we could

1
00:05.430 --> 00:10.430
use it to parse through the HTML of a website and pull out the pieces that we're

2
00:10.560 --> 00:11.393
interested in.

3
00:12.210 --> 00:16.920
Now it's no fun scraping a website that you've already got access to locally.

4
00:17.490 --> 00:21.900
It's much better if we can get hold of something that's currently live on the

5
00:21.900 --> 00:26.520
internet. So I'm going to go ahead and comment out all of this code,

6
00:27.960 --> 00:32.960
and I'm going to be using Beautiful Soup to get hold of a live website and grab

7
00:33.900 --> 00:34.800
data from it.

8
00:35.580 --> 00:40.170
And the website that we're going to be using is the YCombinator's Hacker

9
00:40.170 --> 00:41.003
News website.

10
00:41.670 --> 00:46.650
This is where anybody can post a link to a news piece that they've discovered

11
00:46.650 --> 00:51.300
that's tech related, or you could show off things that you've built.

12
00:51.660 --> 00:52.200
For example,

13
00:52.200 --> 00:56.430
I've just been looking at this guy's website that he built called My Desk Tour

14
00:57.030 --> 00:59.760
where you can post a picture of your desk setup,

15
01:00.150 --> 01:04.110
and you can see all of the tools and gear that they've got.

16
01:04.860 --> 01:08.160
So we're going to be scraping the main Hacker News website

17
01:08.190 --> 01:10.980
which is under this particular URL.

18
01:11.520 --> 01:15.450
And this is usually where I go to find the latest tech news.

19
01:16.200 --> 01:20.760
We're going to copy this URL, and we're going to go back to our main.py.

20
01:21.270 --> 01:25.050
And in order to download the data from that website,

21
01:25.080 --> 01:28.860
we're going to be using our handy friend, which is requests.

22
01:29.520 --> 01:34.520
Now requests allows us to get hold off of the data from a particular URL,

23
01:36.060 --> 01:39.240
which in this case is news.ycombinator.com.

24
01:40.500 --> 01:43.080
And once we've made that request,

25
01:43.110 --> 01:48.110
then we can save the data that we get back in a response variable.

26
01:49.070 --> 01:50.840
And once we've got the response,

27
01:50.870 --> 01:54.350
we can actually print out the text of the response

28
01:54.890 --> 01:59.890
and this is basically equivalent to what we did when we opened up our HTML file

29
02:00.590 --> 02:04.040
and we read the file contents, the text of the file.

30
02:04.700 --> 02:06.950
So now if I go ahead and run this,

31
02:07.010 --> 02:10.850
then you can see that it's going to print out loads of stuff,

32
02:11.150 --> 02:15.860
but this is basically the code that represents this particular page.

33
02:16.310 --> 02:19.700
So in fact, if you right click and click view page source,

34
02:19.730 --> 02:24.470
you'll see that this is exactly the same HTML code that we're getting back over

35
02:24.470 --> 02:28.790
here. So we don't actually want all of this jumbled mess.

36
02:29.180 --> 02:34.180
What we're more interested in is the specific titles and the links

37
02:34.760 --> 02:37.100
for each of these pieces.

38
02:37.730 --> 02:41.480
It shows by default 30 of the top articles,

39
02:41.870 --> 02:43.940
and this is ranked by an algorithm.

40
02:43.970 --> 02:48.970
So it's most recent and also getting a lot of traction,

41
02:49.040 --> 02:50.420
so a lot of upvotes,

42
02:50.840 --> 02:55.730
but it doesn't represent the most upvoted items. So you can see here, in fact,

43
02:55.760 --> 02:58.340
the most upvoted at least today anyways,

44
02:58.430 --> 03:01.210
is Mozilla laying off of 250 employees.

45
03:01.960 --> 03:06.960
So what if I wanted to get hold of the article title and the link of the post

46
03:07.840 --> 03:10.300
from this page that has the highest point.

47
03:10.690 --> 03:14.590
I don't want to have to manually check all of this. I want to do it with code.

48
03:15.280 --> 03:18.040
So let's go ahead and scrape it. Now,

49
03:18.070 --> 03:21.970
if I right click on each of these titles and I click on inspect,

50
03:22.420 --> 03:26.530
then it takes me to the precise line of code in the HTML

51
03:26.560 --> 03:30.070
that's responsible for rendering this component.

52
03:30.550 --> 03:34.210
This is actually a anchor tag, so that's the

53
03:34.240 --> 03:39.240
a tag. And the text in the anchor tag is the title of the article,

54
03:41.470 --> 03:45.430
and then the href is the link that will take me to the actual story. So

55
03:45.460 --> 03:47.230
if I click on this, you can see

56
03:47.230 --> 03:51.820
it takes me to the actual news piece about Joan Feynman.

57
03:52.960 --> 03:54.880
So what about this point? Well,

58
03:54.880 --> 03:57.850
let's go ahead and right click on it and click inspect.

59
03:58.180 --> 04:03.180
You can see this is in a span and it has a class of score while this title is in

60
04:06.910 --> 04:10.570
a a ref and it has a class of story link.

61
04:11.170 --> 04:13.420
So with those two pieces of information,

62
04:13.570 --> 04:17.350
we can use Beautiful Soup to scrape all of the titles,

63
04:17.500 --> 04:20.020
all of the links and all of their points.

64
04:20.470 --> 04:24.550
And we can compare all those points and figure out which one has the highest

65
04:24.550 --> 04:28.030
point on this page. So let's go ahead and do that.

66
04:29.020 --> 04:33.310
So I'm gonna save the yc_webpage as the response.text,

67
04:34.300 --> 04:38.710
and then I'm going to use Beautiful Soup to parse that webpage.

68
04:39.220 --> 04:40.480
So BeautifulSoup

69
04:40.540 --> 04:45.540
and then I'm going to pass in the actual HTML document that we want to parse.

70
04:45.910 --> 04:48.220
So this is the YC webpage,

71
04:48.820 --> 04:52.750
and then we provide the method to which we're going to parse it.

72
04:52.750 --> 04:56.890
So html.parser, with an ER at the end.

73
04:57.580 --> 05:01.930
And this is our soup. Once we've created our soup,

74
05:02.620 --> 05:06.580
the next step is actually to dig in the soup and find the parts that we want.

75
05:06.730 --> 05:11.290
So if, for example, if I want to get hold of the title of all of that,

76
05:11.290 --> 05:16.060
then I can just say print soup.title. And now you'll see

77
05:16.060 --> 05:19.000
it gives me the title which is Hacker News.

78
05:19.780 --> 05:24.100
And that's the same as what you see here in the tab bar.

79
05:24.910 --> 05:26.710
Now, what if I didn't want the title?

80
05:26.710 --> 05:31.150
What if I actually wanted to get hold of this text here,

81
05:31.750 --> 05:34.120
the title of each of these articles?

82
05:34.810 --> 05:39.810
See if you can figure out how to get hold of this text and print it out in your

83
05:40.330 --> 05:44.590
code. Remember it has the class that's a story link,

84
05:44.950 --> 05:46.360
and it's an anchor tag.

85
05:46.960 --> 05:50.410
Pause the video and see if you can get this title,

86
05:50.590 --> 05:53.380
so yours might be different from what I've got on screen of course.

87
05:53.440 --> 05:57.590
It depends on what's showing up on Hacker News on the day you are doing this.

88
05:58.010 --> 06:02.810
But get the title of the first article printed out using BeautifulSoup.

89
06:03.490 --> 06:04.323
<v 1>Okay.</v>

90
06:05.800 --> 06:08.920
<v 0>All right. What we want to do is we want to use find.</v>

91
06:09.430 --> 06:14.430
So we're going to find the first instance from this webpage where the actual

92
06:15.850 --> 06:20.080
name of the tag is equal to a, so that's an anchor tag,

93
06:20.770 --> 06:25.770
and then the class is equal to the story link.

94
06:27.250 --> 06:29.530
So I'm just gonna copy that and paste it in.

95
06:30.220 --> 06:34.600
Remember that in order to not clash with the reserved class keyword,

96
06:34.630 --> 06:37.210
we have to add a underscore afterwards.

97
06:38.200 --> 06:41.350
Now this should be our article tag.

98
06:42.100 --> 06:44.230
And if we go ahead and print it,

99
06:44.470 --> 06:49.470
then you can see that we get this exact anchor tag.

100
06:50.350 --> 06:54.130
But if we want to get hold of the text that's actually in the anchor tag,

101
06:54.190 --> 06:58.450
then we have to go one step further and call the getText method

102
06:58.570 --> 07:03.190
that's also from Beautiful Soup. So now when I run that, you can see

103
07:03.310 --> 07:06.670
I only get the actual text of the article.

104
07:07.510 --> 07:12.310
Let's work on some of the other pieces. So this is the article text.

105
07:13.840 --> 07:18.840
And then if we want to get hold of the article_link and the article_upvotes.

106
07:22.300 --> 07:26.020
See if you can figure out how to complete these two parts as well.

107
07:26.200 --> 07:30.880
So we want the HTML link that is, of course, all of this HTTP,

108
07:30.880 --> 07:34.330
et cetera. And then we also want to get hold of the upvote

109
07:34.390 --> 07:37.150
which is this little number right here.

110
07:38.110 --> 07:41.080
It's inside a span with a class of score.

111
07:41.380 --> 07:42.213
<v 1>Right?</v>

112
07:44.500 --> 07:48.190
<v 0>All right. So let's do the first thing, which is article link. Well,</v>

113
07:48.190 --> 07:52.840
we can actually already tap into the same article tag we already got up here.

114
07:53.260 --> 07:55.060
And instead of saying getText,

115
07:55.090 --> 08:00.090
we can use the get method to get the specific value of a attribute.

116
08:02.080 --> 08:04.990
So what we want is of course, the href.

117
08:06.520 --> 08:09.190
And then the article_upvote,

118
08:09.250 --> 08:14.250
we'll have to tap into our soup and find the tag with a name that is span

119
08:17.140 --> 08:21.400
because this is what we're looking for, and has a class of score.

120
08:21.880 --> 08:22.713
<v 1>Right?</v>

121
08:24.400 --> 08:29.400
<v 0>Like this. Finding this particular tag is not enough.</v>

122
08:29.680 --> 08:34.570
This actually just gets us the tag. If we want to go further

123
08:34.570 --> 08:38.350
and we actually want to get the text that's inside that span

124
08:38.650 --> 08:41.230
which is of course the 19 points,

125
08:41.800 --> 08:45.280
then we have to dig one step deeper and call the

126
08:45.310 --> 08:49.450
getText method like this. Now,

127
08:49.480 --> 08:54.480
if I go ahead and print out the article_text and the article_link,

128
08:55.710 --> 08:59.550
and also finally the article_upvote,

129
09:00.090 --> 09:01.200
then you can see

130
09:01.200 --> 09:05.730
I get all three pieces of data that I'm interested in. Now,

131
09:05.790 --> 09:10.620
instead of getting the first occurrence, I want to get all of the ones that are

132
09:10.650 --> 09:15.570
on this page, so all 30 results. Now, in order to do that,

133
09:15.810 --> 09:20.810
I have to change the find to find_all both here and here.

134
09:23.280 --> 09:23.790
This way

135
09:23.790 --> 09:28.790
we get a list of all of the articles and I'll get a list of all of the article

136
09:31.860 --> 09:32.693
_upvotes.

137
09:33.990 --> 09:38.040
So now it's going to be a little bit different. In order to get all of the text

138
09:38.070 --> 09:41.730
and all of the link, then I have to use a for loop.

139
09:42.450 --> 09:46.410
So I'll say for article tag in articles,

140
09:46.740 --> 09:47.940
so articles is of course,

141
09:47.940 --> 09:52.410
this list where we find all of the anchor tags with a class of storylink,

142
09:53.130 --> 09:56.820
and then I'm going to loop through each one of those and for each of the tags,

143
09:56.850 --> 10:00.030
I'm going to get the text and also get the Href.

144
10:02.100 --> 10:06.150
I'm going to create two new lists, articles_text, and article_links.

145
10:08.550 --> 10:12.930
And then I'm going to save each of the new articles into those lists.

146
10:19.700 --> 10:20.533
<v 2>...</v>

147
10:20.690 --> 10:23.000
<v 0>Like this. And in fact,</v>

148
10:23.030 --> 10:28.030
we could probably simplify this a little bit by refactoring and renaming the

149
10:30.020 --> 10:31.040
article text,

150
10:31.070 --> 10:35.240
so the singular version into just text and the article_link

151
10:36.770 --> 10:38.420
to just the link.

152
10:39.740 --> 10:44.300
So now let's print out the lists. So the article_texts,

153
10:45.530 --> 10:49.010
the article_links and the article_upvotes.

154
10:49.760 --> 10:51.440
And this find_all

155
10:51.440 --> 10:55.760
gives me a list and I can't call getText on the list.

156
10:56.120 --> 10:59.720
So I'll also need to create a new list.

157
11:00.050 --> 11:02.360
So I'm going to choose to use list comprehension here.

158
11:03.260 --> 11:08.260
So I'm going to say for score in all of the scores,

159
11:10.370 --> 11:13.850
we're going to create a list using each of those scores

160
11:14.090 --> 11:17.450
and we're going to call getText in order to get each of them.

161
11:18.500 --> 11:21.560
This is the same as writing out a for loop like this

162
11:21.650 --> 11:26.180
but it's obviously much shorter. Now, when I hit run,

163
11:26.210 --> 11:28.880
you can see that each of my lists are ordered.

164
11:29.150 --> 11:33.950
So this is the first article's text, this is the first article's link,

165
11:34.220 --> 11:36.560
and this is the first article's points.

166
11:38.660 --> 11:43.660
What we want to do is we want to get the article_upvotes into a number format,

167
11:45.080 --> 11:47.750
so an integer. And to do that,

168
11:47.780 --> 11:51.380
we of course have to get rid of the points that comes afterwards.

169
11:51.830 --> 11:56.740
But notice how each of these items are strings. So that means we can split the string

170
11:56.890 --> 12:01.270
by the space and only get hold of the first item in that space.

171
12:02.110 --> 12:06.070
Let me show you what I mean. So we've got all of the article upvotes,

172
12:06.190 --> 12:09.670
let's go ahead and just print out the first item.

173
12:10.380 --> 12:11.213
<v 2>Right.</v>

174
12:13.770 --> 12:18.090
<v 0>So now we just get the first item, which is 40 points. Now,</v>

175
12:18.090 --> 12:21.840
if I take that item and I call the split method,

176
12:21.900 --> 12:26.400
then it's going to split every word in the sentence. By default,

177
12:26.400 --> 12:31.350
it splits by the space. Now, if I run this code,

178
12:32.220 --> 12:32.910
you can see

179
12:32.910 --> 12:37.910
I get a list where I've got the first item being 40 and the second being points.

180
12:39.390 --> 12:42.900
So it's basically split that string by the space.

181
12:43.740 --> 12:48.300
Now the next stage is I could get hold of just the first item that comes from

182
12:48.300 --> 12:50.130
that list, which is now 40.

183
12:50.820 --> 12:54.180
If I now finally wrap it around an int,

184
12:54.270 --> 12:59.130
then I can turn that into an actual number. Don't worry if your number changes

185
12:59.130 --> 13:03.600
because you're pulling data live from a website. That upvote number can change in

186
13:03.600 --> 13:06.540
any second. So this is the method

187
13:06.540 --> 13:10.260
how we can get hold of the actual number from the upvotes.

188
13:10.890 --> 13:14.790
Now we're going to apply all of this .split

189
13:14.850 --> 13:19.740
and also getting hold of the first item into our list comprehension.

190
13:20.100 --> 13:22.920
So for each of the scores that soup finds,

191
13:23.160 --> 13:27.420
we're going to get hold of the text and then split the text and then get the first

192
13:27.420 --> 13:30.150
item from the text. And then finally,

193
13:30.180 --> 13:35.180
we wrap all of this around an int and turn it into an integer. Then if I go ahead

194
13:36.960 --> 13:39.120
and uncomment all these lines of code,

195
13:39.540 --> 13:43.350
then you can see I've got all of these numbers being printed out,

196
13:43.980 --> 13:46.110
which means I can now sort them.

197
13:47.670 --> 13:51.930
I want to get the index of the list item that has the highest value.

198
13:52.140 --> 13:55.950
And then I want to use that index to pick out the title, text,

199
13:56.100 --> 13:58.920
and also the link from these two lists,

200
13:59.400 --> 14:02.610
because they're all ordered in exactly the same way.

201
14:02.610 --> 14:07.610
So this first item corresponds to this first link corresponds to this first 

202
14:07.950 --> 14:12.690
upvote. And I want to pose this to you as a challenge.

203
14:13.050 --> 14:16.080
Can you print out the title and link for the Hacker

204
14:16.080 --> 14:18.840
News story with the highest number of upvotes?

205
14:19.350 --> 14:22.080
Since we're working with three different lists at this point,

206
14:22.320 --> 14:27.320
you'll have to find the index of the largest number inside the article_upvotes

207
14:27.390 --> 14:28.830
list to accomplish this.

208
14:29.310 --> 14:32.910
I'll give you a few seconds to pause the video before I show you the solution.

209
14:36.330 --> 14:36.750
All right,

210
14:36.750 --> 14:41.750
here's the solution. We can use the max function that Python comes with to get

211
14:42.660 --> 14:46.440
the largest number from our article_upvotes.

212
14:48.450 --> 14:52.970
And then we can print this largest number and see if it works.

213
14:53.660 --> 14:58.040
So we've got 1,312. Now,

214
14:58.040 --> 15:00.650
once we've gotten hold of the largest number,

215
15:00.800 --> 15:04.160
then we can find its index from this list.

216
15:04.610 --> 15:07.640
So we can say article_upvotes.index

217
15:07.700 --> 15:11.300
and then we find the index of this largest number.

218
15:11.410 --> 15:12.243
<v 2>Right.</v>

219
15:16.270 --> 15:19.780
<v 0>Now, if we hit run, you can see that we're getting index number 27.</v>

220
15:20.470 --> 15:22.870
So instead of just printing out that index,

221
15:22.960 --> 15:27.960
we can print, instead, the article_texts with that index, so passing in the largest

222
15:30.010 --> 15:35.010
index, and also the article_links and passing in the same index.

223
15:37.690 --> 15:38.980
So now if I hit run,

224
15:39.070 --> 15:44.070
you can see that the most popular article at the moment on this page has this

225
15:45.100 --> 15:48.640
title text and this particular link. Of course for you

226
15:48.640 --> 15:52.870
it will be different because it depends on what's currently showing up on Hacker

227
15:52.870 --> 15:55.600
News. But if I refresh this page,

228
15:55.630 --> 16:00.630
you can see that this article with 1,313 points is of course the most popular

229
16:02.500 --> 16:05.890
article and it is the one that's about Mozilla.

230
16:07.030 --> 16:12.030
You can imagine a use case for this where every day we scrape all the data on Y

231
16:12.250 --> 16:16.840
Combinator and then we send ourselves through a text message through an email,

232
16:17.170 --> 16:20.860
the most upvoted title and article

233
16:21.100 --> 16:24.040
so that we can just look at that one thing.

234
16:25.150 --> 16:30.150
And you've seen now how we can use the requests module to get hold of the text

235
16:31.090 --> 16:34.750
the HTML code from a particular website,

236
16:35.020 --> 16:38.350
and then use Beautiful Soup to parse through that website

237
16:38.770 --> 16:43.770
and then to get hold of these specific parts that we want by using find_all or

238
16:44.470 --> 16:45.303
find,

239
16:45.460 --> 16:49.900
and then getting hold of the text or getting hold of the link or getting hold of

240
16:49.900 --> 16:51.640
any other thing that we want.

241
16:52.480 --> 16:56.920
So now that we've seen how we can do web scraping using Beautiful Soup, in the

242
16:56.920 --> 16:57.670
next lesson

243
16:57.670 --> 17:02.670
I want to talk a little bit about the ethics of scraping websites and when to do

244
17:03.370 --> 17:06.340
it and what you can use the data you get from this

245
17:06.340 --> 17:10.420
for. So for all of that and more, I'll see you on the next lesson.