WEBVTT

00:00.280 --> 00:01.920
Dive into faithfulness.

00:01.920 --> 00:10.480
Evaluator faithfulness evaluator component evaluates document retrieved by haystack pipeline such as

00:10.480 --> 00:13.800
Rag pipeline without ground truth labels.

00:14.240 --> 00:20.280
The components place the generated answer into statements and checks each of them against the provided

00:20.280 --> 00:22.680
context within NLM.

00:23.160 --> 00:29.960
A higher faithfulness score is better, and it indicates that a large number of statements in the generated

00:29.960 --> 00:33.080
answer can be inferred from the context.

00:34.040 --> 00:41.840
A faithfulness score can be used to better understand how often and when the generator in a rag pipeline

00:41.880 --> 00:43.040
hallucinates.

00:43.600 --> 00:48.200
Let's work with a real world example with a faithfulness evaluator.

00:48.640 --> 00:50.800
So here's the psalm editor.

00:51.160 --> 00:58.640
Let me import a haystack components called evaluator and a specific evaluator called Faithfulness evaluator.

00:59.120 --> 01:00.760
I'll ask the question here.

01:01.780 --> 01:04.460
who created Python programming language.

01:04.780 --> 01:06.980
Let's provide the context here.

01:07.220 --> 01:09.620
Which is creator of the Python.

01:10.020 --> 01:15.180
Python was created in late 1998, in a programming language.

01:15.500 --> 01:22.940
And then it has two sentences here, the creator of it and what Python as a language entails.

01:23.380 --> 01:25.140
So these are the context.

01:25.180 --> 01:28.180
Now let's go to the predicted answer for this.

01:29.100 --> 01:32.340
So let's say the generator provided an answer like this.

01:32.980 --> 01:33.500
Here.

01:34.100 --> 01:41.380
This is the answer that came back from a generator and which is Python high level internal purpose programming

01:41.380 --> 01:44.260
language that was created by George Lucas.

01:44.620 --> 01:46.020
Now let's run.

01:46.060 --> 01:52.820
Let's initialize Sarri the evaluator as faithful evaluator and result evaluator.

01:53.100 --> 01:53.580
Run!

01:53.900 --> 01:59.100
Run the pipeline here with questions, contexts, and predicted answers.

02:00.060 --> 02:02.080
Now, once we're in the pipeline.

02:02.360 --> 02:05.400
Let's print result of individual scores.

02:05.680 --> 02:06.560
That's one.

02:07.040 --> 02:09.240
Now it says print the results.

02:09.400 --> 02:10.960
Print result scores.

02:11.400 --> 02:15.760
So in this case we have the entire program ready.

02:16.000 --> 02:17.640
Let's execute this.

02:18.760 --> 02:21.680
I'll go ahead and execute this I got an error.

02:21.960 --> 02:25.360
So I believe it's not scores it's results.

02:25.560 --> 02:27.840
I'll execute this one more time.

02:28.160 --> 02:29.520
So now perfect.

02:29.880 --> 02:34.080
So if you see here the score is 0.5.

02:34.480 --> 02:35.920
Individual score.

02:36.400 --> 02:43.480
And the reason we got the score as not one and 0.5 is because there are two statements here.

02:43.760 --> 02:47.320
One of them has an answer to the question and the other one does not.

02:47.760 --> 02:49.960
And that is why it's one and zero.

02:50.160 --> 02:51.400
And it's not one.

02:52.440 --> 02:55.320
It's a it's an average of the two statements.

02:55.560 --> 02:59.920
That's why we got 0.5 as the individual score.

03:00.400 --> 03:03.650
So now if you notice the creator is also different here.

03:04.650 --> 03:08.450
So here is Gurudom von Rosen and here is George Lucas.

03:08.890 --> 03:10.410
That's the difference here.

03:10.810 --> 03:16.090
So what I can do here is let's take the creator of the context here and put it here.

03:16.370 --> 03:21.450
And let's see how the question and context and the answers are evaluated.

03:21.450 --> 03:22.570
And we get the answer.

03:22.570 --> 03:29.010
Cut the relevancy here is the creator here was different than the predicted answer here.

03:29.290 --> 03:32.450
That's what's causing the accuracy to go down.

03:32.690 --> 03:33.970
And we receive.

03:34.810 --> 03:37.490
And let's understand how this works here.

03:38.010 --> 03:39.810
Faithfulness evaluator.

03:40.090 --> 03:47.010
Now if you scroll down a little bit on this, on this class, you would notice this is a prompt that

03:47.010 --> 03:52.450
has been used for faithfulness evaluator to evaluate the predicted answers.

03:52.850 --> 03:59.530
So you see here your task is to judge the faithfulness or groundedness of statement based on the context

03:59.530 --> 04:00.610
information.

04:01.010 --> 04:05.150
So context information here is the source of truth.

04:06.190 --> 04:11.030
Extract the statements from the provided predicted answers to questions.

04:11.310 --> 04:19.190
Calculate the faithfulness score for each statements made in the predicted answer, and we are determining

04:19.190 --> 04:21.470
that the score is one.

04:21.470 --> 04:28.910
If the statement can be inferred from the provided context, or zero if it cannot be inferred.

04:29.310 --> 04:36.510
So that is the prompt that is passed along with the context questions and predicted answer to the output

04:36.710 --> 04:39.630
to the OpenAI model for evaluation.

04:39.870 --> 04:44.030
That is about faithfulness evaluator and how it works.

04:44.990 --> 04:52.190
We will use faithfulness evaluator in an end to end pipeline in a more advanced topic later on, but

04:52.190 --> 04:57.590
this course and tutorial or video is more about how faithfulness works.

04:57.830 --> 05:02.950
Let's move on to another evaluator called SAS evaluator.
