WEBVTT

0
00:00.060 --> 00:00.750
Hey guys,

1
00:00.750 --> 00:05.330
welcome to Day 45 of 100 Days of Code. Now,

2
00:05.330 --> 00:08.450
today, we're going to be getting back to coding with Python,

3
00:08.900 --> 00:12.080
and we're going to be learning how to scrape the web for data

4
00:12.320 --> 00:14.810
using a module called BeautifulSoup.

5
00:15.770 --> 00:19.040
Now we've been working with APIs for quite a while now,

6
00:19.550 --> 00:20.600
and we know that 

7
00:20.630 --> 00:25.630
we can use a website's API to access their data or to interact with the

8
00:26.120 --> 00:27.860
website using code.

9
00:28.520 --> 00:32.810
But some websites don't have an API or their API

10
00:32.810 --> 00:35.630
doesn't let us do all the things that we want to do.

11
00:36.500 --> 00:40.460
So this is where we start thinking about using web scraping

12
00:41.090 --> 00:44.540
where we look through the underlying HTML code

13
00:44.570 --> 00:48.110
of a website to get hold of the information that we want.

14
00:49.130 --> 00:52.880
So the aim of today is to learn how to make soup,

15
00:53.390 --> 00:57.410
but not this kind of soup. We're going to be making BeautifulSoup.

16
00:57.980 --> 00:59.960
What exactly is BeautifulSoup? Well,

17
00:59.990 --> 01:04.990
it's a module that helps developers like us make sense of websites.

18
01:06.170 --> 01:09.770
We could think of a lot of websites as a bit of a spaghetti soup,

19
01:10.190 --> 01:14.000
even something seemingly as simple as the Google front page,

20
01:14.270 --> 01:16.850
when you right click on it and view page source,

21
01:17.120 --> 01:20.060
you can see that it's horrendously complicated.

22
01:20.570 --> 01:25.100
And if you wanted to make sense of this webpage and pull out the relevant parts

23
01:25.100 --> 01:25.933
of the data,

24
01:26.270 --> 01:31.130
then you'll need an HTML parser like BeautifulSoup so that you can

25
01:31.130 --> 01:36.130
find and pull out the HTML elements that you're interested in from this

26
01:36.680 --> 01:39.140
soup of jumbled HTML code.

27
01:39.770 --> 01:42.380
And once we've mastered this skill,

28
01:42.500 --> 01:46.040
then we'll be able to take any website, for example,

29
01:46.070 --> 01:48.980
Empire's 100 Greatest Movies Of All Time,

30
01:49.220 --> 01:53.000
this is a huge list of a hundred movies that apparently everyone should have

31
01:53.000 --> 01:54.950
watched at some point in their life,

32
01:55.400 --> 01:58.460
and we can pull out the relevant parts to us

33
01:58.700 --> 02:02.540
namely the title and the ranking of each movie

34
02:02.840 --> 02:07.840
and we're going to use it to compile a list of movies that we have to watch so

35
02:07.880 --> 02:11.390
that we can look at the list, cross out the ones that we've already seen,

36
02:11.720 --> 02:16.280
and then pick at random one from the list so that we can watch all of the

37
02:16.280 --> 02:19.550
hundred movies of all time. That's the goal.

38
02:19.760 --> 02:24.290
And once you're ready head over to the next lesson and we're going to get started

39
02:24.350 --> 02:25.850
using BeautifulSoup.