WEBVTT

00:00.120 --> 00:06.200
Evaluators that we learned in our previous videos and use it for evaluating a Rag pipeline.

00:06.600 --> 00:13.640
The goal here is how to evaluate your Rag pipeline with model based metrics available in haystack offering.

00:14.160 --> 00:20.000
The basic understanding is that you have known how to execute pipelines, and you're pretty aware of

00:20.000 --> 00:21.880
the pipelines in haystack.

00:22.280 --> 00:25.560
So first things let's create a rag pipeline.

00:25.800 --> 00:33.440
In this case here I'll quickly run through the process here where I have to load the datasets, import

00:33.440 --> 00:40.560
those documents, use sentence transformers, and write that in an in-memory document store.

00:41.440 --> 00:44.160
So for that, let me first import the documents.

00:44.440 --> 00:52.400
In this case here I'm going to use a pre-existing document from the hugging face PubMed QA instructions.

00:52.680 --> 00:54.320
It will load datasets.

00:54.560 --> 00:56.200
It's a huge data set.

00:56.200 --> 00:59.360
So we will use only top 25 data.

00:59.680 --> 01:05.190
And then in document I will use the context of the document and get all the documents.

01:05.550 --> 01:11.910
So once we have all the documents, let's understand how the document looks like.

01:12.270 --> 01:19.150
So what I'm going to do is I'm going to just quickly print the document so you understand how the documents

01:19.150 --> 01:20.030
look like.

01:21.110 --> 01:28.990
So now I'll just quickly execute this so we understand how the documents look like.

01:29.430 --> 01:30.350
There you go.

01:31.110 --> 01:32.390
So there's a document here.

01:32.390 --> 01:33.670
And it has content.

01:33.670 --> 01:36.870
And each document has a content and document ID.

01:37.510 --> 01:41.710
Now let's move on to get two different things here.

01:42.190 --> 01:45.830
Also the instructions from the document that we have.

01:46.150 --> 01:50.030
So we wrote the documents here very similarly.

01:50.070 --> 01:51.750
We get the context.

01:51.910 --> 01:55.510
We also get the instructions and the response.

01:55.950 --> 02:03.300
So these are all the instructions of the questions and the all ground truth answers as a response.

02:04.340 --> 02:07.260
So we get three different values from the documents.

02:07.580 --> 02:12.500
So now we have documents questions and ground truth answers.

02:12.980 --> 02:19.660
Let's initialize in memory document store and document writer and sentence transformers.

02:20.060 --> 02:28.180
So in this case here I'm initializing the document store as in memory document store there is sentence

02:28.180 --> 02:30.660
transformers document embedder.

02:31.060 --> 02:33.540
We're using mini lm v6.

02:34.020 --> 02:37.900
And then we initialize the document writer with the document store.

02:38.260 --> 02:41.700
And if there is a duplicate then we skip.

02:42.100 --> 02:44.020
Now let's create a pipeline.

02:44.300 --> 02:53.060
So I created a pipeline here index indexing and the component as document Embedder which is a document

02:53.060 --> 02:55.460
Embedder is a document writer.

02:56.140 --> 03:01.540
And then you connect document Embedder documents to document writer documents.

03:01.980 --> 03:07.180
And then we run the indexing pipeline with all the documents that we had.

03:07.420 --> 03:11.700
So that's step one where we will do all the indexing pipeline would run.

03:11.980 --> 03:16.260
Now let's move on to the other phase where we want the Rag application.

03:17.140 --> 03:25.180
So in this case here we need the prompt builder, the answer builder sentence transformers text Embedder

03:25.500 --> 03:27.660
generator and the retriever.

03:28.140 --> 03:31.900
This will help us create the Dirac pipeline.

03:33.020 --> 03:35.820
And now what I'm going to do is provide a prompt.

03:35.980 --> 03:38.820
So this is the prompt from all the document.

03:38.980 --> 03:46.300
We say answer the following question based on the given context for documents in the document.

03:46.300 --> 03:48.900
This is the question and what's the answer.

03:49.300 --> 03:51.780
And then moving on we connect.

03:51.980 --> 03:54.260
We'll initialize the Rag pipeline.

03:54.500 --> 03:55.740
This is a pipeline.

03:55.740 --> 04:04.420
So in this case here in the Rag pipeline we are adding components like sentence transformer text Embedder

04:04.610 --> 04:08.290
Retriever, which is in memory document store.

04:09.170 --> 04:11.250
We add the document store here.

04:11.490 --> 04:15.530
Prompt builder a generator and answer builder.

04:15.890 --> 04:19.450
Now we added all the components to the req pipeline.

04:19.690 --> 04:23.210
And then we will connect different components to each other.

04:23.730 --> 04:29.490
So in this case I'll begin with query Embedder retriever or query embedding.

04:30.250 --> 04:31.930
Then the retriever is a sender.

04:31.930 --> 04:36.890
Here it will send the retrieved documents to the Prompt builder documents.

04:37.890 --> 04:45.130
Prompt builder sends it to the generator and generator replies has sent it to the answer builder to

04:45.170 --> 04:46.570
build the answers.

04:47.570 --> 04:54.650
There's one more Rag pipeline component that needs to be connected, which is the retriever that retrieves

04:54.650 --> 04:59.170
the document here, sends it to the answer builder documents.

04:59.530 --> 05:02.770
So now everything is connected to each other.

05:03.010 --> 05:11.000
And then we will ask the question, do high levels of procalcitonin in the early phase after pediatric

05:11.000 --> 05:15.880
liver transplant indicate poor postoperative outcome?

05:16.400 --> 05:17.960
That's the question here.

05:17.960 --> 05:20.840
And then I will go ahead and run the pipeline.

05:21.800 --> 05:27.080
So in this case here I have the code that we can use to run the pipeline.

05:27.640 --> 05:30.520
So here we have the question that we asked.

05:30.840 --> 05:33.360
Prompt builder is the same question.

05:33.360 --> 05:35.880
And the answer builder has the same question.

05:35.880 --> 05:36.920
Everything's great.

05:37.400 --> 05:38.960
And then let's run this.

05:39.320 --> 05:40.920
It's a big data set.

05:41.920 --> 05:49.320
So what I'm going to do is even though we are getting just top 25, I'll pause the video and we'll come

05:49.320 --> 05:49.920
back.

05:50.360 --> 05:52.040
It was quicker than I thought.

05:52.520 --> 05:59.200
So in this case, we got the generated response from the generator about the question that was asked

05:59.200 --> 05:59.680
here.

06:00.680 --> 06:05.360
So I hope you know about all of this and you're aware of how to execute the pipeline.

06:05.760 --> 06:07.280
This is where we start.

06:07.430 --> 06:12.150
the evaluators whether the results that we got like how good were they?

06:12.590 --> 06:19.990
So to evaluate that we'll need to extract this questions the ground truth answers and the ground truth

06:19.990 --> 06:20.830
document.

06:21.230 --> 06:25.230
So this is the original set of questions that we have stored.

06:26.270 --> 06:29.830
So we extract that top 25 and store it.

06:30.310 --> 06:35.670
And so these are the first set of questions that are stored as a zip file.

06:36.070 --> 06:41.110
And we will use it to evaluate the Rag pipeline going forward.

06:41.110 --> 06:45.710
Let's initialize the Rag answer and then retrieve docs.

06:46.070 --> 06:52.990
So now instead of executing this once what we want to do is we want to execute this and the list of

06:52.990 --> 06:53.990
25.

06:54.950 --> 07:00.550
So we get a list of 25 questions, ground truth answers and ground truth docs.

07:00.910 --> 07:08.270
And then we capture the responses from the generated LLM as rag answers and the retrieved documents.

07:08.620 --> 07:10.780
So we execute this in a loop.

07:10.980 --> 07:14.580
So if you see here we have the list of questions that we got here.

07:15.420 --> 07:18.900
So we execute the entire pipeline that we had here.

07:19.180 --> 07:20.780
We execute in a loop.

07:20.980 --> 07:25.420
And each response that we get we get we save the answer.

07:25.780 --> 07:31.740
And we also store the retrieved documents as the document part of the answer.

07:32.020 --> 07:36.260
So we'll have 25 sets of responses that we receive.

07:36.780 --> 07:39.900
Now we will go ahead and evaluate this.

07:40.260 --> 07:44.220
So what does this mean is create an evaluation pipeline.

07:44.700 --> 07:48.660
And then eval pipeline I'll add components.

07:48.860 --> 07:54.980
And this time the component that we want to add is faithfulness and evaluator.

07:55.380 --> 07:56.940
And then let's import that.

07:57.260 --> 08:00.060
We have seen how this works independently.

08:00.060 --> 08:04.580
But now it's time where we see how this works within a pipeline.

08:05.500 --> 08:07.020
There was a typo there.

08:07.020 --> 08:16.930
And then going to quickly use sass evaluator and then provide the model which is mini lm V to evaluate.

08:17.450 --> 08:19.250
So now we have things in place.

08:19.650 --> 08:23.490
We add the component and now let's run the pipeline.

08:23.890 --> 08:32.250
To run the pipeline we'll have to provide three different things for faithfulness evaluation run pipeline.

08:32.730 --> 08:36.010
Here we have to provide the questions that we had.

08:36.010 --> 08:37.690
List of questions.

08:37.890 --> 08:41.330
Then the ground truth documents and the rag answers.

08:41.690 --> 08:46.450
So this is what were the answers that we got from the generator or the LM.

08:46.930 --> 08:48.730
And this is the ground truth doc.

08:48.770 --> 08:50.410
That is the relevant answer.

08:50.410 --> 08:52.170
And these are the questions.

08:52.570 --> 09:00.850
So faithfulness would help us understand how close the rag answer was compared to the ground truth docs

09:01.850 --> 09:05.050
and similarly sass evaluator.

09:05.090 --> 09:12.730
We have got all the predicted answers and the ground truth answers right here from the list of the ground

09:12.730 --> 09:15.690
truth answers that we originally captured.

09:16.010 --> 09:19.090
So this is actually ground truth documents.

09:19.090 --> 09:21.530
And here is the ground truth answers.

09:22.010 --> 09:29.690
Sass evaluator does need the the ground truth answers to evaluate against the generated answers.

09:30.690 --> 09:33.010
So now we get everything that we want.

09:33.050 --> 09:36.970
And then let's go ahead and run this and print the results.

09:37.850 --> 09:38.930
Run this.

09:39.530 --> 09:45.490
I will pause this until the pipeline gets executed and see you back soon.

09:46.170 --> 09:50.370
So now the pipeline got executed along with the evaluators running on it.

09:50.730 --> 09:55.250
So now let's go ahead and copy that and paste it in a Visual Studio editor.

09:55.570 --> 09:57.650
And let's analyze this.

09:58.930 --> 10:05.570
So if you notice here we got the root of the JSON as one of the evaluator is faithfulness.

10:06.090 --> 10:07.810
And scroll down a little bit here.

10:08.010 --> 10:11.290
The other one here is sass evaluator.

10:11.490 --> 10:18.480
So we got two distinct evaluators and their evaluation performed in case of SaaS faithfulness.

10:18.640 --> 10:20.320
The result has statements.

10:20.640 --> 10:23.400
So in this case here there were statements.

10:23.560 --> 10:28.480
One statement is two, three, four and five.

10:28.960 --> 10:31.080
And then each statement had a score.

10:31.080 --> 10:34.240
And it was evaluated as 1111.

10:34.240 --> 10:36.640
And this is the overall score.

10:37.040 --> 10:39.760
Same goes for other 24 statements.

10:39.760 --> 10:46.360
And you can check it out how the responses were when you execute this at your own time.

10:47.240 --> 10:50.280
But in a nutshell, this is how it works.

10:50.480 --> 10:58.320
Whereas the SaaS evaluator had had a score and then which is the average of all scores, and the individual

10:58.320 --> 11:00.080
scores are here.

11:00.600 --> 11:08.920
These are 25 SaaS evaluator scores for the 25 items that we we looped against.

11:09.320 --> 11:12.840
So this is how the evaluators work in haystack.
