WEBVTT

00:00.960 --> 00:02.160
Hello everyone!

00:02.560 --> 00:08.080
In today's video we will learn about Retrieval augmented generation, also known as Rag.

00:08.640 --> 00:15.480
Before we take a deep dive into what is Rag, let's understand what is LM and how it is used.

00:16.120 --> 00:23.440
LM is large language model that is trained on vast volumes of data, and use billions of parameters

00:23.440 --> 00:31.720
to generate original output for tasks like answering questions, translating languages, and completing

00:31.720 --> 00:38.840
sentences, the model bases its response on the textual content it has ingested during training.

00:39.280 --> 00:43.760
Now let's understand what are the challenges with LM?

00:44.480 --> 00:46.160
Lack of specific data.

00:47.120 --> 00:52.480
Language models are limited to providing generic answers based on their training data.

00:53.080 --> 01:01.550
If a user were to ask about specific domain questions, a traditional LM may not be able to provide

01:01.550 --> 01:02.990
accurate answers.

01:03.710 --> 01:06.190
Another challenge is hallucination.

01:06.510 --> 01:13.710
The model bases its response on the textual content it has ingested during training, and there is no

01:13.750 --> 01:21.790
telling exactly what went into that data or how the model recombines it to generate the novel text.

01:22.350 --> 01:25.190
Another challenge is generic responses.

01:25.630 --> 01:31.510
Language models often provide generic responses and aren't tailored to specific contexts.

01:31.870 --> 01:39.030
This can be a major drawback in a customer support scenario, since individual user preferences are

01:39.070 --> 01:44.190
usually required to facilitate a personalized customer experience.

01:44.790 --> 01:49.990
These are some of the challenges that I have listed here, but there are a lot more.

01:51.070 --> 01:56.510
Rag in general aims to solve some of the challenges that are listed here.

01:57.430 --> 02:01.380
Let's move on to understand what is Rag Retrieval.

02:01.380 --> 02:08.220
Augmented generation, also known as Rag, is the process of optimizing the output of language model

02:08.220 --> 02:15.100
so it references an authoritative knowledge base outside of its training data source before generating

02:15.100 --> 02:16.140
a response.

02:16.660 --> 02:20.740
Now let's understand how Rag works.

02:21.260 --> 02:23.700
Step one is data collection.

02:24.220 --> 02:29.060
You must first gather all the data that is needed for your application.

02:29.420 --> 02:33.740
This can be domain specific data that Lem is not trained on.

02:34.180 --> 02:36.700
Step two is data chunking.

02:37.100 --> 02:43.260
Data chunking is the process of breaking your data into smaller, more manageable pieces.

02:43.660 --> 02:45.180
Large language models.

02:45.180 --> 02:51.660
They have a context window or embedding models also have context window.

02:52.020 --> 02:56.340
Um, these models cannot process thousands of line items.

02:56.580 --> 02:59.260
They have a context that they can process.

02:59.740 --> 03:02.370
That is where data chunking is needed.

03:02.850 --> 03:09.370
You need to understand the model context limitation and chunk the data accordingly.

03:09.850 --> 03:12.090
Step three is document embedding.

03:12.530 --> 03:20.450
Now that the source data has been broken into smaller parts, it needs to be converted into vector representation.

03:20.850 --> 03:28.050
This involves transforming text data into embeddings, which are numeric representation that captures

03:28.050 --> 03:30.690
the semantic meaning behind the text.

03:31.250 --> 03:35.570
Embeddings is a deeper topic that needs more understanding.

03:36.050 --> 03:42.130
I have covered it as part of another video where you can understand how embedding works.

03:43.410 --> 03:46.490
Next topic here is handling user queries.

03:46.850 --> 03:52.210
So once we embed the smaller chunks, you have to handle the user queries.

03:52.730 --> 04:00.650
When a user query enters the system, it must also be converted into embeddings or vector representation.

04:01.130 --> 04:08.400
The same model must be used for both document and query embedding to ensure uniformity between the two

04:08.400 --> 04:09.560
processes.

04:10.520 --> 04:17.400
So if you use a vector embedding model for document embedding, it has to be the same model that embeds

04:17.400 --> 04:18.640
the user query.

04:19.240 --> 04:25.200
If the two models are different, then the vector embeddings of the two models are different and you

04:25.200 --> 04:27.800
won't get a uniform search experience.

04:28.160 --> 04:35.040
So very important that you use the same embedding model for document embedding and handling the user

04:35.040 --> 04:35.880
queries.

04:36.480 --> 04:42.560
And then the fifth and the final step is generating responses with an LM.

04:43.320 --> 04:49.320
The retrieval text chunks along with the initial user query are fed into a language model.

04:49.680 --> 04:53.160
This is how different steps in the Rag orchestration works.

04:53.520 --> 04:56.120
Let's understand this using a diagram.

04:56.640 --> 05:01.710
So here is the diagram where the entire Rag system is Was explained.

05:01.990 --> 05:08.870
The very first step here on the bottom left you see data collection and chunk.

05:09.190 --> 05:15.070
That is step one and two where we collect the data and chunk the data.

05:15.390 --> 05:22.470
Once you chunk the data, you send it to an embedding model and then do a vector embedding using the

05:22.470 --> 05:25.870
model and store it to the vector database.

05:26.230 --> 05:31.390
So step one and two is data collection and chunk.

05:31.910 --> 05:32.910
Then step three.

05:32.950 --> 05:35.230
Here is the chunks that were created.

05:35.550 --> 05:41.350
You embed them using an embedding model and then you store it in the vector database.

05:42.510 --> 05:47.270
This is usually a batch process that keeps running over and over again.

05:47.590 --> 05:55.670
Um, then once the data is stored as a in the vector database, the user passes the query and a prompt.

05:55.990 --> 06:03.180
So when they pass this to the Rag orchestrator, Rag orchestrator uses just the query from the from

06:03.180 --> 06:11.140
the prompt and the query, and it embeds it using the same embedding model and then searches for the

06:11.140 --> 06:14.060
relevant content in the vector database.

06:14.700 --> 06:21.740
Once it searches for the relevant content in the vector database, it then extracts the relevant content,

06:21.940 --> 06:24.420
gives it back to the Rag orchestrator.

06:24.780 --> 06:32.060
Rag orchestrator combines the prompt, the query, and the context info that it retrieved and sends

06:32.060 --> 06:32.300
it to.

06:32.340 --> 06:38.580
The LLM would process the data and generate the response.

06:39.580 --> 06:41.100
And step five.

06:41.540 --> 06:46.060
So these are the different steps that are involved in the Rag orchestrator.

06:46.220 --> 06:48.060
And this is how Rag works.

06:48.780 --> 06:53.580
Let's take a real world example and understand this in a little more detail.

06:53.980 --> 06:56.700
Thank you and see you in the next video.