WEBVTT

00:01.830 --> 00:07.200
So first thing we're going to do is we're going to import puppeteer.

00:09.150 --> 00:12.120
So puppeteer require puppeteer.

00:16.750 --> 00:19.270
Then we'll write our main function.

00:19.270 --> 00:22.030
So async function main.

00:22.690 --> 00:25.600
And we'll create the browser instance.

00:25.600 --> 00:32.170
So const browser await property and launch.

00:33.490 --> 00:38.620
With the headless set to true to false.

00:40.560 --> 00:41.340
Let me see.

00:42.210 --> 00:43.500
So you can see.

00:47.170 --> 00:48.550
All of it like that.

00:48.550 --> 00:54.400
So const browser await property launch and we set this to false so we can see the browser when it's

00:54.400 --> 00:58.720
launched and then we'll create a page inside of the browser.

00:58.720 --> 01:02.650
So await browser new page.

01:04.000 --> 01:10.480
And then inside of this page, we're going to set a specific viewport for it because, well, we're

01:10.480 --> 01:13.180
going to be using something with the width and height.

01:13.180 --> 01:17.920
So it's pretty useful to set an exact height and width.

01:18.670 --> 01:26.260
And with that, we're going to go to the site with the demo page.

01:26.260 --> 01:34.060
So the URL for this infinite scroll demo, I'm going to link the put the link in the resources as well

01:34.060 --> 01:37.630
if you don't want to type it out and.

01:39.510 --> 01:44.400
Then we're going to set a target amount of the items we want to get.

01:44.400 --> 01:47.870
So target item count, we set to 100.

01:47.880 --> 01:52.950
So this means we get 100 of these different boxes.

01:52.950 --> 01:55.680
So once we get 100, we're done.

01:56.280 --> 02:01.110
So it's going to scroll down and down until we get 100 items.

02:01.860 --> 02:04.710
Then we're going to say const items.

02:04.860 --> 02:07.110
Await scrape.

02:07.770 --> 02:09.330
Infinite.

02:10.140 --> 02:19.410
Infinite scroll items and pass in the page to another function we're going to write here, which is

02:19.410 --> 02:23.940
going to extract items and then the target item count.

02:25.530 --> 02:26.580
And

02:29.310 --> 02:31.920
then we can console log the items.

02:34.100 --> 02:42.350
So this function is what we're going to look at first, which is returning the items of these boxes

02:42.350 --> 02:43.580
on the page.

02:43.820 --> 02:50.180
So let's go ahead and write that function extract items.

02:52.100 --> 02:54.290
And we're going to just.

02:55.070 --> 02:59.510
Get a extracted extracted items array.

03:01.060 --> 03:08.500
So this is where we call array from document query selector all.

03:10.800 --> 03:14.210
Access to books.

03:15.000 --> 03:19.440
So let me just explain a bit about what is going on here.

03:19.740 --> 03:24.330
So this document, query selector all is a function.

03:24.330 --> 03:26.550
We are running inside of the browser.

03:26.550 --> 03:33.540
So if we paste that inside here, you can see a node list of 40 boxes.

03:33.540 --> 03:38.700
So there's 40 boxes in the Dom right now, these ones.

03:40.190 --> 03:41.900
And that's a node list.

03:41.900 --> 03:45.470
And unfortunately, that is not an array.

03:45.980 --> 03:49.910
We want to make it into an array so we can use dot map on it.

03:50.390 --> 03:52.220
And so let me show you.

03:52.220 --> 03:55.940
So we do a items and then we do extract it.

03:56.960 --> 03:58.670
Items dot map.

03:59.180 --> 04:05.000
And we say element return the elements Inner text.

04:07.850 --> 04:08.830
Like so.

04:08.840 --> 04:11.030
And then we return the items.

04:11.960 --> 04:13.010
So.

04:13.880 --> 04:21.470
The it's going to iterate over every of these node list items and get their inner text.

04:21.470 --> 04:25.250
So infinite scroll backs, one, infinite scroll box two and so on.

04:26.180 --> 04:36.170
And so the node list is transformed into an array, and then we map over the array and return each elements

04:36.170 --> 04:37.220
in our text.

04:37.340 --> 04:43.820
And then we finally return this array of all these text objects.

04:45.970 --> 04:46.990
Now.

04:47.290 --> 04:47.770
Um.

04:49.490 --> 04:52.310
So let's see if that works.

04:52.310 --> 04:53.510
Just for now.

04:53.540 --> 04:54.190
Let's see.

04:54.200 --> 04:55.940
So I comment out this.

04:57.200 --> 05:00.230
Then we can run extract items.

05:01.280 --> 05:07.700
Actually you can run, await, page, evaluate, and inside evaluate.

05:07.700 --> 05:12.920
You can either pass in a string, you can pass in a string like this.

05:16.990 --> 05:17.590
Oops.

05:22.270 --> 05:29.140
And then we put single quotes here so we don't, um, escape the double quotes.

05:29.140 --> 05:30.790
So like this.

05:32.710 --> 05:37.900
You can pass in a string, which is basically a console command for the browser.

05:38.020 --> 05:39.610
You can also pass in a function.

05:39.610 --> 05:41.590
So let's try and see.

05:46.210 --> 05:47.440
What we get here.

05:47.440 --> 05:52.540
So if I just do console.log result here for this one.

05:54.440 --> 05:55.480
And let's see.

05:58.360 --> 06:00.700
Let me open up the terminal.

06:03.660 --> 06:05.220
So I do note.

06:07.440 --> 06:08.850
Index.js.

06:10.900 --> 06:14.380
So I'm not getting anything because I'm not running the function.

06:14.410 --> 06:16.030
Of course the main.

06:16.990 --> 06:20.500
So down here below type main.

06:20.680 --> 06:22.540
So we run the main function.

06:29.500 --> 06:31.630
So now the browser is running.

06:31.840 --> 06:35.620
Let's see what it says in here.

06:35.890 --> 06:37.150
And here we get this list.

06:37.160 --> 06:40.780
There's no text in it, but that's okay because it's, um.

06:42.340 --> 06:46.450
It's a node list item you can say.

06:46.450 --> 06:47.020
Right.

06:47.440 --> 06:52.360
But you can also go and say extract items.

06:53.540 --> 06:57.500
So you pass in the function here, extract items.

06:57.740 --> 06:59.210
So now we're just passing.

06:59.450 --> 07:01.580
Passing in this function here.

07:03.690 --> 07:05.850
And let's see how that looks.

07:16.850 --> 07:18.410
Now it's running the site again.

07:20.130 --> 07:25.230
And now we get these, um, text boxes inside.

07:27.910 --> 07:29.980
So all the text in here.

07:30.880 --> 07:33.610
Um, so yeah, you can see it works fine.

07:33.960 --> 07:37.000
It says something about items is not defined.

07:40.890 --> 07:42.100
Yeah, it's this.

07:42.210 --> 07:45.850
It's this console log here because I haven't defined items.

07:45.870 --> 07:53.010
But anyway, you can see now that we are passing in a function to the page evaluate and it runs the

07:53.010 --> 07:54.390
function in here.

07:55.220 --> 08:01.460
And so I hope that it's clear for you now what exactly is going on in this section.

08:01.460 --> 08:08.600
But in the next lecture, we're going to flesh out this function here and we're going to see how we're

08:08.600 --> 08:10.070
scrolling the page.
