WEBVTT

0
00:00.300 --> 00:02.970
Hey guys, welcome to day 53

1
00:02.970 --> 00:05.430
of a hundred days of Code.

2
00:05.430 --> 00:09.060
Today it's time for your Capstone project,

3
00:09.060 --> 00:11.880
and it's time when you review everything that we've learned

4
00:11.880 --> 00:14.190
over the last 10 days or so.

5
00:14.190 --> 00:17.070
Everything to do with web scraping.

6
00:17.070 --> 00:18.870
The project that we're going to be tackling

7
00:18.870 --> 00:21.360
is a data entry job.

8
00:21.360 --> 00:23.970
Now, there's a lot of data entry jobs out there

9
00:23.970 --> 00:27.510
where you're kind of just meant to transfer data

10
00:27.510 --> 00:29.040
from one format to another.

11
00:29.040 --> 00:32.700
So maybe you have it in a physical print copy,

12
00:32.700 --> 00:35.370
or maybe it's on a website, maybe it's in a PDF,

13
00:35.370 --> 00:37.620
and you just have to transfer it somewhere else,

14
00:37.620 --> 00:40.800
usually typing it into a spreadsheet.

15
00:40.800 --> 00:43.890
The inspiration for this project came from

16
00:43.890 --> 00:45.810
when I was browsing Reddit actually

17
00:45.810 --> 00:48.090
on the r/Python subreddit,

18
00:48.090 --> 00:51.540
which is a really good community for you to actually look at

19
00:51.540 --> 00:54.000
and see what other people are doing with Python

20
00:54.000 --> 00:56.880
and seeing the latest and greatest things built

21
00:56.880 --> 00:59.100
or news about Python.

22
00:59.100 --> 01:01.650
Now, one of the posts I saw was asking

23
01:01.650 --> 01:04.890
whether if anyone has automated their job completely,

24
01:04.890 --> 01:06.810
basically using Python.

25
01:06.810 --> 01:09.570
Now we've seen how powerful Python can be,

26
01:09.570 --> 01:11.640
especially when we apply it to web scraping

27
01:11.640 --> 01:14.400
using Beautiful Soup and Selenium.

28
01:14.400 --> 01:17.040
And looking through all the comments,

29
01:17.040 --> 01:19.590
there's actually a lot of people who have done this,

30
01:19.590 --> 01:22.170
including this one guy

31
01:22.170 --> 01:27.170
who basically pretty much automated his entire job.

32
01:27.600 --> 01:30.420
And the jobs that tend to be easily automated

33
01:30.420 --> 01:33.300
using Python are data entry jobs,

34
01:33.300 --> 01:36.270
moving data from one format to another.

35
01:36.270 --> 01:39.690
And if you think about it, if that job is in fact remote,

36
01:39.690 --> 01:44.690
so if you search on Indeed.com for a remote data entry job

37
01:44.790 --> 01:47.700
and you get up and running with the company

38
01:47.700 --> 01:49.440
and you start doing it manually,

39
01:49.440 --> 01:52.800
and then once you've understood what it is you have to do,

40
01:52.800 --> 01:56.910
for example, gathering statistical data, preparing reports,

41
01:56.910 --> 01:58.830
and maintaining spreadsheets,

42
01:58.830 --> 02:02.190
if you realize that this is a large part of your job

43
02:02.190 --> 02:05.610
and you can automate it pretty much with Python,

44
02:05.610 --> 02:08.940
then you could probably get Python to do 70% of your job

45
02:08.940 --> 02:11.160
and you spend the rest 30% of the day

46
02:11.160 --> 02:13.950
doing the rest of the job, but still being paid

47
02:13.950 --> 02:16.950
as a full on worker with full benefits.

48
02:16.950 --> 02:18.450
So this is something that a lot of people

49
02:18.450 --> 02:22.050
in the Python community has talked about and explored.

50
02:22.050 --> 02:24.960
And this is something that we are going to be trying out

51
02:24.960 --> 02:29.430
using both Beautiful Soup and Selenium in this project.

52
02:29.430 --> 02:31.140
In our case, we're going to be tackling

53
02:31.140 --> 02:33.480
a research data entry job

54
02:33.480 --> 02:36.450
where we're researching house prices

55
02:36.450 --> 02:40.080
that fit a particular criteria for a client

56
02:40.080 --> 02:41.580
on the Zillow website

57
02:41.580 --> 02:45.300
and then we're going to be transferring that data into a form,

58
02:45.300 --> 02:48.420
which will create a spreadsheet in Google Sheets.

59
02:48.420 --> 02:52.380
And that is usually how as a data entry person,

60
02:52.380 --> 02:54.180
this is how we would make our money.

61
02:55.230 --> 02:58.110
Now, because this is a Capstone project,

62
02:58.110 --> 03:00.360
we're going to be using everything that we've learned

63
03:00.360 --> 03:02.100
in this section.

64
03:02.100 --> 03:05.430
That means Beautiful soup as well as Selenium.

65
03:05.430 --> 03:06.810
You might have to revise up

66
03:06.810 --> 03:08.220
on some of the things you learned,

67
03:08.220 --> 03:10.290
especially the stuff on Beautiful Soup,

68
03:10.290 --> 03:12.090
which we covered a few days ago

69
03:12.090 --> 03:14.010
and we're going to combine all the skills

70
03:14.010 --> 03:15.720
that you've done so far.

71
03:15.720 --> 03:17.400
And this project really is going to test

72
03:17.400 --> 03:20.640
all of your web scraping skills that you've acquired so far

73
03:20.640 --> 03:23.070
and see how far you can run with it.

74
03:23.070 --> 03:24.480
Because it's Capstone project,

75
03:24.480 --> 03:26.610
there's not going to be a lot of guidance,

76
03:26.610 --> 03:28.230
so you're going to have to persevere

77
03:28.230 --> 03:30.570
and try to see if you can solve your own problems

78
03:30.570 --> 03:33.900
and see if you can get to the end outcome.

79
03:33.900 --> 03:36.960
Let's say that you have a client who wants you

80
03:36.960 --> 03:41.100
to compile a list of all the places that they can rent

81
03:41.100 --> 03:45.570
in San Francisco up to $3,000 per month,

82
03:45.570 --> 03:48.870
and it has to have at least one bedroom.

83
03:48.870 --> 03:50.790
Now, San Francisco is notorious

84
03:50.790 --> 03:53.820
for really expensive housing,

85
03:53.820 --> 03:55.440
and it's also really difficult

86
03:55.440 --> 03:58.350
often to actually find somewhere that you want to live.

87
03:58.350 --> 04:01.320
On Zillow, you can already filter on these things.

88
04:01.320 --> 04:03.900
So for example, you could say, this is the area,

89
04:03.900 --> 04:06.810
San Francisco, California that I want to rent

90
04:06.810 --> 04:09.060
and then of course, changing it to for rent

91
04:09.060 --> 04:10.440
rather than for sale,

92
04:10.440 --> 04:15.390
switching the maximum price up to $3,000

93
04:15.390 --> 04:17.760
and then adding in the extra requirement

94
04:17.760 --> 04:20.910
that it must have at least one bedroom.

95
04:20.910 --> 04:24.270
Now, we could use the live version of Zillow

96
04:24.270 --> 04:25.860
to do this project,

97
04:25.860 --> 04:29.280
but the problem is that websites frequently get updated.

98
04:29.280 --> 04:32.790
Companies like Zillow continuously improve their site,

99
04:32.790 --> 04:36.060
so they might change the structure of their HTML,

100
04:36.060 --> 04:38.460
the names of their CSS classes

101
04:38.460 --> 04:42.660
and have popups to the website or introduce captures

102
04:42.660 --> 04:45.300
which causes issues for Selenium.

103
04:45.300 --> 04:46.890
For this course, I want to make sure

104
04:46.890 --> 04:49.410
that we can all practice writing our code

105
04:49.410 --> 04:52.410
in a stable environment that doesn't change

106
04:52.410 --> 04:55.710
and that my code solution continues to work.

107
04:55.710 --> 04:59.070
That's why I've created a clone of Zillow's website

108
04:59.070 --> 05:01.590
so that you can practice and test your knowledge

109
05:01.590 --> 05:03.693
of Beautiful Soup and Selenium.

110
05:04.590 --> 05:08.370
Open up the Zillow clone website inside your Chrome browser.

111
05:08.370 --> 05:10.590
And as you can see, I've created a snapshot

112
05:10.590 --> 05:12.060
of the Zillow site

113
05:12.060 --> 05:15.030
where I've already narrowed down the search criteria.

114
05:15.030 --> 05:17.850
I've picked San Francisco as the location,

115
05:17.850 --> 05:21.240
and I've picked for rent rather than for sale,

116
05:21.240 --> 05:24.810
and I specify the price as up to $3,000

117
05:24.810 --> 05:27.990
for a one bedroom apartment.

118
05:27.990 --> 05:31.170
So the URL you should use with Selenium for this project

119
05:31.170 --> 05:36.170
should read, https://appbrewery.github.io/Zillow-Clone.

120
05:42.630 --> 05:44.910
Now, in addition to using that URL,

121
05:44.910 --> 05:47.490
you're also going to be using Beautiful Soup

122
05:47.490 --> 05:50.820
to scrape through all of this data.

123
05:50.820 --> 05:54.930
And what we want is the price, the address,

124
05:54.930 --> 05:58.830
and also the URL that this will link to.

125
05:58.830 --> 06:01.080
So for example, when I click on this,

126
06:01.080 --> 06:04.680
it will link to the actual listing of the place.

127
06:04.680 --> 06:06.510
And then once you've scraped all of that data

128
06:06.510 --> 06:10.260
using Beautiful Soup, then you're going to be using Selenium

129
06:10.260 --> 06:13.740
to auto fill in a Google form.

130
06:13.740 --> 06:15.870
So we're going to be adding in the address of the property,

131
06:15.870 --> 06:18.210
the price per month and the linked property

132
06:18.210 --> 06:21.120
and of course, we're going to fill out one of these forms

133
06:21.120 --> 06:24.000
per listing that we have on Zillow.

134
06:24.000 --> 06:26.700
And once all of that form's been compiled,

135
06:26.700 --> 06:31.470
then you'll have the option to turn it into a spreadsheet.

136
06:31.470 --> 06:34.770
Whenever you create a form in Google Forms,

137
06:34.770 --> 06:37.290
you can see that when you go to the responses tab,

138
06:37.290 --> 06:39.420
you can click on this button

139
06:39.420 --> 06:42.690
in order to create a Google sheet from the responses

140
06:42.690 --> 06:44.100
that have been submitted

141
06:44.100 --> 06:45.570
and this is what you end up with,

142
06:45.570 --> 06:49.290
a spreadsheet with the address of the property,

143
06:49.290 --> 06:52.380
the price per month, and a link to the property.

144
06:52.380 --> 06:54.810
So this way, once you've done this research,

145
06:54.810 --> 06:56.760
then you can send it to your client

146
06:56.760 --> 06:59.100
so that they can filter down on each of the listings

147
06:59.100 --> 07:02.880
that match their criteria and decide which one they want to go

148
07:02.880 --> 07:04.740
and make a viewing.

149
07:04.740 --> 07:07.530
So this of course makes their job a little bit easier,

150
07:07.530 --> 07:10.230
and this is our research task

151
07:10.230 --> 07:12.360
that we're going to complete today.

152
07:12.360 --> 07:14.760
So the first part of scraping the data

153
07:14.760 --> 07:17.190
for the relevant listings is going to be done

154
07:17.190 --> 07:20.160
using Beautiful Soup and then the second part

155
07:20.160 --> 07:23.790
where we're going to be filling in this form is going to be done

156
07:23.790 --> 07:25.620
using Selenium.

157
07:25.620 --> 07:28.020
So that is the project.

158
07:28.020 --> 07:30.930
And once you're ready, head over to the next lesson

159
07:30.930 --> 07:33.990
and take a look at the requirements of the project

160
07:33.990 --> 07:37.203
and we can get started with the Capstone project.