WEBVTT

00:00.630 --> 00:01.980
Eden: Hey there, Eden here,

00:01.980 --> 00:05.730
and in this part, we are going to implement LangChain code

00:05.730 --> 00:10.110
to ingest our medium blog into our vector store.

00:10.110 --> 00:12.330
It's going to include taking our data

00:12.330 --> 00:15.810
and loading it into a LangChain document object,

00:15.810 --> 00:17.160
split this object

00:17.160 --> 00:20.550
with a LangChain text splitter into smaller chunks,

00:20.550 --> 00:24.180
going to embed those chunks and turn them into vectors,

00:24.180 --> 00:26.670
and then we're going to store those vectors

00:26.670 --> 00:28.740
in the Pinecone vector store.

00:28.740 --> 00:30.450
And now this may sound a lot,

00:30.450 --> 00:32.550
but actually in LangChain it would only take us

00:32.550 --> 00:33.810
a couple of lines of code.

00:33.810 --> 00:35.700
And that's what's cool about LangChain.

00:35.700 --> 00:38.340
It saves us a lot of boilerplate code.

00:38.340 --> 00:40.770
This video will include the imports

00:40.770 --> 00:44.250
and the LangChain objects that we'll be working with.

00:44.250 --> 00:46.410
And the following video will include

00:46.410 --> 00:48.750
the ingestion implementation.

00:48.750 --> 00:52.140
And let's start by importing from LangChain,

00:52.140 --> 00:53.283
the text loader.

00:55.860 --> 00:58.860
So when we talk to LLMs, we send them data

00:58.860 --> 01:00.660
and that data is text.

01:00.660 --> 01:03.570
So the text can be something that we write to the prompt,

01:03.570 --> 01:06.960
but we can attach to it some other things like

01:06.960 --> 01:08.700
what if we want to process something

01:08.700 --> 01:10.380
on our WhatsApp messages

01:10.380 --> 01:13.140
or if we want to download some PDF files

01:13.140 --> 01:14.340
and process them,

01:14.340 --> 01:18.750
or if we want to do something with a Notion notebook.

01:18.750 --> 01:21.450
So all of those elements I described earlier

01:21.450 --> 01:24.570
are simply text, but they have different formats

01:24.570 --> 01:26.490
and different semantic meaning.

01:26.490 --> 01:30.570
So the document loaders are actually classes implementations

01:30.570 --> 01:33.000
of how to load and process data

01:33.000 --> 01:36.780
in order to make it digestible by the large language model.

01:36.780 --> 01:39.060
So to be honest, it's pretty easy to understand.

01:39.060 --> 01:41.430
So I don't mind getting my hands dirty

01:41.430 --> 01:45.600
and going inside this repo and reviewing the source code.

01:45.600 --> 01:48.750
So this is the LangChain Python implementation.

01:48.750 --> 01:51.090
It's a GitHub repository, which is public,

01:51.090 --> 01:52.890
so you can easily find it online.

01:52.890 --> 01:54.963
I also list it in the course's resource.

01:55.950 --> 01:59.880
So if I'll head up and I'll go to the LangChain directory.

01:59.880 --> 02:02.370
And right over there there is this directory

02:02.370 --> 02:05.280
which is called document_loaders.

02:05.280 --> 02:08.160
So inside it is the entire implementation

02:08.160 --> 02:10.380
of all the document loaders.

02:10.380 --> 02:15.150
So for example, here is the WhatsApp document loader

02:15.150 --> 02:18.780
and if we'll go up there is the text document loader

02:18.780 --> 02:20.610
under the text.py file.

02:20.610 --> 02:23.070
So you can see this is the entire implementation

02:23.070 --> 02:25.950
of the text document loader, which we're going to use.

02:25.950 --> 02:29.730
So all we're doing right now is simply taking a file path,

02:29.730 --> 02:34.170
opening it up like we would usually open a file in Python,

02:34.170 --> 02:37.620
attaching some metadata that is going to hold the source,

02:37.620 --> 02:41.700
which is going to be equal to the file path of the file.

02:41.700 --> 02:43.230
And that's pretty much it.

02:43.230 --> 02:46.530
We wrap it inside a list and return it.

02:46.530 --> 02:49.200
So nothing novel here, nothing special.

02:49.200 --> 02:52.620
Simply an easy wrapper that helps us abstract things

02:52.620 --> 02:55.170
and you'll see it later because we will have the same

02:55.170 --> 02:57.423
interface for all documents.

02:58.470 --> 03:00.690
So let's check out what's happening

03:00.690 --> 03:03.630
with the WhatsApp document loader.

03:03.630 --> 03:05.520
Let's see its implementation.

03:05.520 --> 03:07.050
So I'm going to go back

03:07.050 --> 03:11.970
and I'll go head up to the downside of this document.

03:11.970 --> 03:14.730
And yeah, this is the WhatsApp chat loader.

03:14.730 --> 03:18.090
Now here we're also loading a file, which is the text file

03:18.090 --> 03:20.133
that represent our WhatsApp chat.

03:21.150 --> 03:25.740
And all we are doing right now is simply opening up the file

03:25.740 --> 03:28.533
like we did with the regular text loader,

03:29.370 --> 03:33.210
then using some regular expressions to extract the names

03:33.210 --> 03:36.600
of the sender, of the receiver, the text,

03:36.600 --> 03:38.490
and the date that it was sent.

03:38.490 --> 03:40.740
And simply concatenate one row

03:40.740 --> 03:43.890
after another who sent which row.

03:43.890 --> 03:46.230
So it's simply taking the file, loading it,

03:46.230 --> 03:50.790
doing its formatting and returning it as is.

03:50.790 --> 03:54.330
So basically we're making the data

03:54.330 --> 03:57.450
more ingestible by the LLM.

03:57.450 --> 03:59.790
So it's super, super easy to use

03:59.790 --> 04:03.600
and there are tons of implementations of different file

04:03.600 --> 04:06.690
and document loaders that we can use to our disposal.

04:06.690 --> 04:08.820
So that's what makes LangChain so great

04:08.820 --> 04:12.870
that it has all this variety of options that we can choose

04:12.870 --> 04:14.040
to handle our documents.

04:14.040 --> 04:17.160
We can handle documents of any kind, of WhatsApp,

04:17.160 --> 04:21.360
of Google Drive, of Notion, of PDF, whatever we want,

04:21.360 --> 04:23.700
LangChain can help the LLM digest it

04:23.700 --> 04:25.113
in the best possible way.

04:26.730 --> 04:28.920
Cool, so let's go back to the code

04:28.920 --> 04:32.223
and now let's import the character text splitter.

04:34.560 --> 04:36.840
Now text splitters are here to help us

04:36.840 --> 04:39.120
with long pieces of text.

04:39.120 --> 04:42.030
Those texts have tons of tokens inside them

04:42.030 --> 04:45.510
and if we'll send them directly to the LLM,

04:45.510 --> 04:47.280
then our request will probably fail

04:47.280 --> 04:49.710
because it surpassed the token limitation

04:49.710 --> 04:50.970
the model enforces.

04:50.970 --> 04:52.520
So for example, in the GPT-3.5,

04:53.700 --> 04:56.520
we have 4K tokens limitation.

04:56.520 --> 04:58.350
So to conclude, the text splitter

04:58.350 --> 05:00.660
allows us to take text which is large

05:00.660 --> 05:02.490
and to split it into chunk.

05:02.490 --> 05:04.230
Now to be honest, text splitters

05:04.230 --> 05:06.330
have a lot of logic in there

05:06.330 --> 05:08.790
because there are a lot of splitting strategies

05:08.790 --> 05:11.190
and there are a lot of smart ways to do it

05:11.190 --> 05:14.280
with calculating the appropriate chunk size.

05:14.280 --> 05:16.530
Now the chunk size is not trivial

05:16.530 --> 05:18.240
because it may change according to

05:18.240 --> 05:19.980
what we want to accomplish.

05:19.980 --> 05:22.110
Now this is different to the LLM we're choosing

05:22.110 --> 05:23.670
and to the different embedding system,

05:23.670 --> 05:26.400
but we'll cover it later in this course.

05:26.400 --> 05:28.890
In the meantime, let's take a look at the documentation

05:28.890 --> 05:31.020
of the character text splitter.

05:31.020 --> 05:33.810
Now notice that we give it the separator

05:33.810 --> 05:36.480
that we want to separate and we supply the chunk size.

05:36.480 --> 05:39.390
Now here it's the size of 1000 tokens

05:39.390 --> 05:41.220
and the chunk overlap.

05:41.220 --> 05:45.060
The chunk overlap parameter specifies the amount of overlap

05:45.060 --> 05:47.190
between the chunks when we split the text

05:47.190 --> 05:48.570
into smaller parts.

05:48.570 --> 05:51.120
This overlap can be super helpful

05:51.120 --> 05:54.960
to ensure that the text isn't split up in a way

05:54.960 --> 05:58.500
that disturbs the context or the meaning.

05:58.500 --> 06:01.380
The length function is usually len

06:01.380 --> 06:05.190
and it helps LangChain to determine the chunk size.

06:05.190 --> 06:08.190
Now sometimes you want to do it with tokens

06:08.190 --> 06:10.200
and there are special functions we can write

06:10.200 --> 06:13.503
that help us find out how many tokens our chunk having.

06:16.290 --> 06:19.380
Alright, let's now import our embeddings object

06:19.380 --> 06:21.333
and we'll use OpenAI embeddings.

06:26.040 --> 06:29.790
Now I remind you that an embedding model or encoder

06:29.790 --> 06:33.150
or whatever you want to call it, is simply a black box,

06:33.150 --> 06:36.420
which takes in inputs as text

06:36.420 --> 06:40.863
and outputs vectors in an embedding's vector space.

06:41.790 --> 06:43.650
So I have a text that I want to embed.

06:43.650 --> 06:45.180
How do I actually do it?

06:45.180 --> 06:49.050
Well, a lot of embedding providers simply exposed to us

06:49.050 --> 06:51.540
an API slash embed that we can use

06:51.540 --> 06:54.960
where we send the text and get back the vector.

06:54.960 --> 06:57.420
There are a lot of models that aspire to create

06:57.420 --> 07:00.450
a good and smart embedding, the one of OpenAI,

07:00.450 --> 07:03.750
the last of them text-embedding-ada-002,

07:03.750 --> 07:05.100
is a very good one

07:05.100 --> 07:06.270
because in embedding

07:06.270 --> 07:08.640
there is a great significance for the price.

07:08.640 --> 07:12.060
Sometimes you embed an entire database

07:12.060 --> 07:14.550
and if a certain model is cheaper

07:14.550 --> 07:16.560
then it's much better to use.

07:16.560 --> 07:20.220
So de facto, the text-embedding-ada-002

07:20.220 --> 07:23.913
is cheaper by 98% than their last one.

07:25.080 --> 07:29.700
So if we return back to the import of OpenAI embedding,

07:29.700 --> 07:31.620
then the whole idea

07:31.620 --> 07:34.200
of text embedding models in LangChain

07:34.200 --> 07:38.040
is to create a uniform interface for us

07:38.040 --> 07:39.750
to access different embeddings

07:39.750 --> 07:42.630
from different embeddings providers.

07:42.630 --> 07:46.320
So it doesn't matter if we're using Cohere, Hugging Face,

07:46.320 --> 07:49.020
OpenAI, it doesn't really matter who are we using.

07:49.020 --> 07:50.640
We have the same interface.

07:50.640 --> 07:52.980
So switching between a one to another

07:52.980 --> 07:56.010
is pretty straightforward with just changing a parameter.

07:56.010 --> 07:59.130
So that's the entire abstraction of the embeddings.

07:59.130 --> 08:03.250
And here we are going to use the OpenAI embeddings.

08:03.250 --> 08:06.240
Okay, we're getting closer to the end of the video.

08:06.240 --> 08:09.030
That's good because it's a long video.

08:09.030 --> 08:11.853
So now we're importing Pinecone.

08:14.880 --> 08:18.330
Now we talked about embeddings that we create from the text,

08:18.330 --> 08:20.310
the vectors.

08:20.310 --> 08:23.430
We need to store those vectors somewhere.

08:23.430 --> 08:25.020
We need persistent storage.

08:25.020 --> 08:27.840
We want the ability to search in the vector space

08:27.840 --> 08:30.960
for closest vectors of the current one.

08:30.960 --> 08:33.990
We also want to be able to add new vectors

08:33.990 --> 08:35.340
to the vector space.

08:35.340 --> 08:39.000
So all of this is being handled by the vector databases

08:39.000 --> 08:41.070
that does all the work for us.

08:41.070 --> 08:43.380
Now, an excellent vector database

08:43.380 --> 08:46.980
that has recently gone viral is called Pinecone

08:46.980 --> 08:50.940
and we'll be exploring and using Pinecone in these videos.

08:50.940 --> 08:51.990
It has a free tier,

08:51.990 --> 08:54.660
so don't worry about spending a lot of money.

08:54.660 --> 08:56.400
And that's all for our imports.

08:56.400 --> 08:57.500
And in the next video,

08:57.500 --> 09:00.003
we will be implementing the ingestion.