WEBVTT

00:00.320 --> 00:05.840
And in this video, we're going to have a look at how you can build a self-correcting rag pipeline rag

00:05.840 --> 00:08.180
standing for retrieval augmented generation.

00:08.180 --> 00:12.710
And after that, we're going to look at how this will be integrated into the side of Landgraaf.

00:12.710 --> 00:19.010
So the first thing I want to bring your attention to is there was a paper done in 2023 which talked

00:19.010 --> 00:27.410
about how if you were able to grade documents and also test for hallucinations, then it was much better

00:27.410 --> 00:33.720
than simply selecting 3 or 5 different documents for the retrieval step to help in grounding the LLM

00:33.720 --> 00:37.080
in factual based answers or consistency of answers.

00:37.080 --> 00:43.050
And so you've got this kind of set up over here where you've got a question, you then do some retrieval.

00:43.050 --> 00:46.050
You grade how well that retrieval is done.

00:46.050 --> 00:49.140
You decide how relevant the docs are.

00:49.140 --> 00:56.230
And then you do a generation step and you check for any hallucinations, and then you check to see does

00:56.230 --> 00:57.130
it answer the question.

00:57.130 --> 01:01.900
And if it doesn't answer the question, then you rewrite the question and then you go back and do more

01:01.900 --> 01:03.370
retrieval and more work.

01:03.370 --> 01:06.640
But if it does answer the question, then you finish at the answer stage.

01:06.640 --> 01:08.290
So this is what we're going to implement.

01:08.290 --> 01:11.950
We'll just be doing a simple version of it and trying to keep it as simple as possible.

01:11.950 --> 01:17.990
And what we'll do to start with is you're going to build the specific functions and you'll see how these

01:17.990 --> 01:18.560
are built.

01:18.560 --> 01:20.360
And then we'll tie that into a graph.

01:20.360 --> 01:24.860
So rather than trying to build the graph we'll actually see how these individual pieces work.

01:24.860 --> 01:28.340
And then we'll we'll decide to package it up inside of land graph.

01:28.340 --> 01:31.070
So we're going to install a couple of packages first.

01:31.070 --> 01:36.830
And you'll need to also install your OpenAI API key and also optional as well.

01:36.830 --> 01:40.130
If you want to follow along with Lang Smith and see the tracing aspect.

01:40.130 --> 01:43.980
Now the first thing we're going to do is we're going to index some blog posts.

01:44.010 --> 01:50.190
Lillian Wang here has produced some really high quality blogs on what AI agents, what our prompt engineering.

01:50.190 --> 01:56.610
And so we can use Lang chains community of web based loaders to specifically load those documents.

01:56.610 --> 02:01.770
And then as well as that, split those into 250 chunks of characters.

02:01.770 --> 02:06.760
And then we can load that into an in-memory chroma vector store.

02:06.760 --> 02:13.450
And we can also get the retriever by basically doing the dot as retriever on that vector store as well.

02:13.600 --> 02:14.020
All right.

02:14.050 --> 02:19.930
Now that we've got the documents embedded inside of chroma, we can then create a retrieval grader.

02:19.930 --> 02:24.520
And the retrieval grader is only going to be responsible for figuring out whether the documents are

02:24.520 --> 02:27.910
relevant to that question with either a yes or a no.

02:28.690 --> 02:35.220
Notice here how we're setting up a Pydantic model, and we're also setting up a photo as the chat model.

02:35.220 --> 02:40.230
And we're using this dot with structured output, which will be producing an output of this Pydantic

02:40.230 --> 02:44.700
model, so that we have a binary score that will either represent yes or no.

02:44.700 --> 02:50.610
And the prompt basically is something along the lines of you're grading the documents based on a question.

02:50.610 --> 02:54.940
Does it have to be a stringent test, but it has to contain some sort of relevance to the question or

02:54.940 --> 02:59.290
the keywords or the semantic meaning and give it a binary, you know, yes or no.

02:59.290 --> 03:04.390
And then we set up our grade prompts, and then we are saying the grade prompt is then going into the

03:04.390 --> 03:07.210
structured LLM grader and we can come up with a question.

03:07.210 --> 03:09.640
So in this scenario we decided agent memory.

03:10.390 --> 03:16.240
And you can see when we do the dot invoke we're then going to be looking inside invoke the retriever

03:16.240 --> 03:17.260
to get the relevant documents.

03:17.260 --> 03:19.450
We'll be actually looking inside of that retriever.

03:19.450 --> 03:25.060
We can get the second pages content as well, and we'll just do one doc as well just to put into the

03:25.060 --> 03:26.170
page content.

03:26.170 --> 03:27.880
And so this is how that's going to work.

03:27.880 --> 03:30.100
So just for for an example here you go.

03:30.100 --> 03:34.780
So you can see that it does think that that this specifically is useful.

03:34.780 --> 03:37.720
This doc text the second article is useful.

03:37.720 --> 03:40.840
And what we can then do is do a generation step.

03:40.840 --> 03:45.050
So we've got our Lang long chain hub, which allows us to import prompts from long chain hub.

03:45.050 --> 03:50.390
And we've also got a string output parser, which when we combine that inline chain expression language,

03:50.390 --> 03:53.270
will also return a string output.

03:53.270 --> 03:59.660
We've then got we've then also got a format docs function, which allows us to easily format the documents.

03:59.660 --> 04:02.120
And you can see here this is how it's working.

04:02.120 --> 04:04.970
So we're passing in the context which is all the docs.

04:04.970 --> 04:06.390
And we're asking the question.

04:06.390 --> 04:09.540
And then we get a generation out that is trying to answer that question.

04:09.540 --> 04:09.720
Right.

04:09.720 --> 04:11.460
So this is the generation code.

04:11.460 --> 04:14.670
And we've also got something for grading the hallucinations.

04:14.670 --> 04:19.050
So this is a binary score for hallucinations present in the generation answer which is either going

04:19.050 --> 04:20.490
to be yes or no.

04:20.490 --> 04:21.990
And pretty much the same code.

04:21.990 --> 04:26.040
You've got some sort of grader that's being set up with the dot with structured output.

04:26.040 --> 04:31.580
And we also have a system message that talks about you're looking at a generated answer.

04:31.610 --> 04:35.210
Does this answer is it grounded in the documents.

04:35.300 --> 04:41.660
So you'll see the chat prompt template here actually contains both the system message and also the documents

04:41.660 --> 04:42.920
and the generation.

04:42.920 --> 04:47.420
And so we then pass the hallucination prompt into the structured grader.

04:47.420 --> 04:52.730
And then we can invoke that with the generation we got in the previous step along with all of the documents.

04:53.750 --> 04:56.150
And you'll see that it believes.

04:56.150 --> 05:02.330
Yes, being that the answer in this scenario is grounded in slash, supported by the sets of facts,

05:02.330 --> 05:05.090
we also can grade the answer as well.

05:05.090 --> 05:08.030
So making sure that the answer addresses the question.

05:08.030 --> 05:12.590
So not just looking for hallucinations but making sure the answer addresses directly the question.

05:12.590 --> 05:14.660
And it's very similar.

05:14.660 --> 05:19.680
The same sort of syntax that we're following along here where we're setting up a grader and we have

05:19.680 --> 05:21.150
that answer prompt as well.

05:21.210 --> 05:27.420
And so in this scenario it says no, which basically means that it doesn't think that it resolves the

05:27.420 --> 05:27.750
question.

05:27.750 --> 05:31.020
So it didn't hallucinate but it didn't resolve the answer.

05:31.020 --> 05:35.400
And we've also got a question Rewriter where we're setting up a chat model.

05:35.400 --> 05:40.590
We've got a system prompt that's telling us to rewrite the question, and then we're giving it the initial

05:40.590 --> 05:42.840
question and formulate an improve improved question.

05:42.840 --> 05:48.180
And so if you remember what our original question was, it was agent memory.

05:48.180 --> 05:52.680
And then this is what is the role and function of agent memory in artificial intelligence systems.

05:52.680 --> 05:57.570
Then we can start setting up now that we've got all the individual pieces our graph.

05:57.570 --> 06:00.090
So in our graph set we'll have the question.

06:00.090 --> 06:03.690
We'll have the generation and we'll have a list of documents.

06:04.650 --> 06:07.120
And we'll then start setting up the nodes.

06:07.120 --> 06:13.630
So we have a retrieve node, which soul's purpose is to get the question and invoke the retriever.

06:13.630 --> 06:19.030
And then just to add in the documents that came to the Landgraff state with that retriever.

06:19.030 --> 06:24.790
So then you've got the generate section where we will take both the question and the documents that

06:24.790 --> 06:30.160
we returned from the retriever, and we will invoke the rag chain with the context being the documents

06:30.160 --> 06:35.600
and the question being the question will then return the generation as an extra key.

06:35.630 --> 06:37.640
You can see here on the far right.

06:37.640 --> 06:39.470
We then grade the documents.

06:39.470 --> 06:42.200
And this grading happens by filtering the documents.

06:42.200 --> 06:44.330
So we get the question and the documents as well.

06:44.330 --> 06:47.570
And we can run that retrieval grader.

06:47.570 --> 06:48.710
We'll get a binary score.

06:48.710 --> 06:51.410
And if it's yes then we want to keep that document.

06:51.410 --> 06:53.600
And if not then it's not relevant.

06:53.600 --> 06:56.250
And we've got this documents being the filter docs.

06:56.250 --> 07:02.640
And we also have a transform query node, which allows us to take in a question, taking the documents,

07:02.640 --> 07:06.720
and then better rewrite the question from that as well.

07:06.720 --> 07:07.620
Then the edges.

07:07.620 --> 07:10.320
So we need to decide whether we're going to generate.

07:10.320 --> 07:15.090
So if there's the filter documents and you've got the question as well.

07:15.090 --> 07:21.280
So if there's no filter documents then all the documents are not relevant and we need to transform the

07:21.280 --> 07:21.670
query.

07:21.670 --> 07:23.230
So we go back into the query.

07:23.260 --> 07:26.020
Else we have all the relevant documents.

07:26.020 --> 07:28.420
So go and generate the answer.

07:29.410 --> 07:35.200
And we've got a final one here which is grade generation versus documents and questions.

07:35.200 --> 07:38.410
So it takes in the questions the documents the generation.

07:38.410 --> 07:43.490
It runs a hallucination grader on the documents and the generation.

07:43.490 --> 07:49.580
And if the grade is yes, then the decision generation is grounded in documents, and then we also grade

07:49.580 --> 07:50.690
it on its answer.

07:50.690 --> 07:58.040
And if the answer is graded successfully yes, then the decision the actual answer addresses both is

07:58.040 --> 08:00.680
inside of the documents and also addresses the question.

08:00.680 --> 08:06.380
And if it goes into these else or not useful or not supported, then it's basically failing at that

08:06.380 --> 08:06.980
point.

08:06.990 --> 08:09.840
And so then we can start setting up our graph with our nodes.

08:09.840 --> 08:13.980
So we add in the retrieve the grade documents, the generate the transform query.

08:13.980 --> 08:18.600
And we set the entry point to the retrieve node.

08:18.600 --> 08:22.320
And then we add an edge from retrieve to the grade documents.

08:22.320 --> 08:26.280
And we add a conditional edge from the grade documents.

08:26.280 --> 08:32.770
And remember this is either going to go and transform the query and regenerate more document retrieval,

08:32.770 --> 08:35.170
or it's going to actually start generating the answer.

08:36.130 --> 08:37.750
And you can see from Transform Query.

08:37.750 --> 08:40.330
This is where we would then go to retrieve retrieval.

08:40.330 --> 08:42.820
If we were to go back into transform Query.

08:42.820 --> 08:46.960
And then we've also got a conditional edge for when we generate.

08:47.050 --> 08:50.710
If we if it's not supported then we regenerate.

08:50.710 --> 08:52.300
If it's useful then we end.

08:52.300 --> 08:55.310
And if it's not useful then we transform the query.

08:55.310 --> 08:59.390
So we go back into the transform query, which would then take us back into retrieval, which would

08:59.390 --> 09:00.650
then start grading the documents.

09:00.650 --> 09:05.240
And then after that, you would then be able to see whether those documents are good enough.

09:05.240 --> 09:07.820
And so then we can run our app and have a look at this.

09:07.820 --> 09:12.530
So we've got question how the different types of agent memory work.

09:12.530 --> 09:13.790
And we can have a look and see.

09:13.790 --> 09:17.210
So it starts by retrieving the the document.

09:17.210 --> 09:21.990
And then we can see that we're checking the documents, and we've assessed it and realized that we decided

09:21.990 --> 09:23.220
to generate.

09:23.220 --> 09:27.870
And then after that we've we've then generated we've checked for hallucinations.

09:27.870 --> 09:30.780
We've made a decision that the generation is grounded documents.

09:30.780 --> 09:36.240
And also we've made a decision that the generation addresses the question.

09:36.240 --> 09:38.640
And so we've decided that this is the output.

09:38.640 --> 09:42.780
But the important thing is that maybe if the document wasn't relevant.

09:42.780 --> 09:45.220
So you can see here check document relevance question.

09:45.220 --> 09:47.050
And it decided the document was relevant.

09:47.050 --> 09:50.560
But there's some documents that aren't relevant which means those get removed.

09:50.560 --> 09:55.120
And again it's graded that the documents and it's made a decision to generate.

09:55.120 --> 09:56.950
It's checked for those nations.

09:56.950 --> 10:00.040
It's the answer is grounded in the documents.

10:00.040 --> 10:03.970
And the generation has is actually addressing the question.

10:03.970 --> 10:05.110
So it's designed to generate.

10:05.110 --> 10:11.060
But if we were to have documents that weren't addressing our different types of generations, and that's

10:11.060 --> 10:13.580
where we would start seeing this extra cycle kick in.

10:13.580 --> 10:17.480
So if I scroll back up to the top, we've encapsulated this as a graph.

10:17.480 --> 10:22.220
But the important point is, if we grade these documents and we don't find any that are relevant, then

10:22.220 --> 10:24.710
we're going to have to go and start rewriting the question.

10:24.710 --> 10:26.060
And the same is true here.

10:26.060 --> 10:31.800
If the answer doesn't specifically address the question, then we go back in here and you might set

10:31.800 --> 10:34.350
a maximum number of iterations to go round on.

10:34.350 --> 10:40.560
But hopefully this gives you a sort of rough understanding about how you could build an adaptive retrieval

10:40.560 --> 10:41.400
based system.

10:41.400 --> 10:44.460
So obviously there's easier ways to do retrieval.

10:44.460 --> 10:49.680
One would just be to simply just select some retrieval and and generate the answer.

10:50.130 --> 10:54.840
But this is trying to encapsulate where we're looking to see whether the documents that are returned

10:54.840 --> 10:55.930
are actually useful.

10:55.930 --> 10:58.660
And also, is the answer hallucinated?

10:58.690 --> 11:00.790
Does the answer refer to the documents?

11:00.790 --> 11:05.080
And if it does, and all of those things are true, then it goes through to the final output.

11:05.080 --> 11:08.200
So this is what you might end up working towards.

11:08.200 --> 11:13.660
But for more simple ones, I would definitely start with just having the question, having the retrieval

11:13.660 --> 11:16.060
node and then having the final answer.

11:16.060 --> 11:18.760
But this is something that you could definitely work up to with time.

11:18.760 --> 11:19.150
All right.

11:19.150 --> 11:19.510
Cool.

11:19.510 --> 11:20.560
I'll see you in the next one.
