WEBVTT

00:00.080 --> 00:02.160
So what is a SaaS evaluator?

00:02.400 --> 00:09.520
Also known as semantic answer similarity can use SaaS evaluator component to evaluate answers predicted

00:09.520 --> 00:15.000
by a haystack pipeline, such as a Rag pipeline against ground truth labels.

00:15.320 --> 00:21.120
You can provide a BI encoder or cross encoder model to initialize a SaaS evaluator.

00:21.520 --> 00:26.800
You can provide a BI encoder or cross encoder model to initialize a SaaS evaluator.

00:27.240 --> 00:32.640
By default, it uses sentence transformer paraphrase multilingual MP.

00:32.680 --> 00:35.040
Net base v2 model.

00:35.840 --> 00:41.280
Note that only one predicted answer is compared to the ground truth answer at a time.

00:41.680 --> 00:48.360
The component does not support multiple ground truth answers for the same question or multiple answer

00:48.360 --> 00:50.120
predicted for the same question.

00:50.680 --> 00:53.360
Let's take a example on SaaS evaluator.

00:54.720 --> 00:57.840
So here I have imported SaaS evaluator.

00:58.040 --> 01:02.620
We provide a variable and we'll do a warm up to import the model.

01:03.620 --> 01:08.020
Now let's run this sense evaluator and provide the ground truth answers.

01:08.420 --> 01:15.060
This case here, let's say we do Berlin and Paris, and let's do the predicted answers here that we

01:15.100 --> 01:18.460
got from the generator as Berlin and Lyon.

01:18.860 --> 01:24.020
Now, just when you're on this, spend the results as individual scores.

01:24.380 --> 01:26.660
Or we can just do score.

01:27.100 --> 01:30.900
Let's save this sass evaluator and then run this.

01:31.500 --> 01:34.820
It might take a minute or so to download the model.

01:35.100 --> 01:36.820
Sometimes it's quicker.

01:37.980 --> 01:38.860
There you go.

01:39.300 --> 01:41.700
We actually got the answer pretty quickly.

01:41.940 --> 01:43.260
It's 0.7.

01:43.620 --> 01:50.980
So what it did is it compared Berlin with Berlin and Paris with line and gave us the answer.

01:51.380 --> 01:54.060
Let's check this one more time as a result.

01:54.300 --> 01:55.900
Individual scores.

01:56.220 --> 01:57.780
Now let's run this again.

01:58.120 --> 02:05.160
we'll get individual answers, like how bonus similarity between Berlin and Berlin pairs with line.

02:06.120 --> 02:12.960
So if you notice, the similarity was one for Berlin because it matched, whereas with Paris it was

02:12.960 --> 02:13.440
low.

02:13.800 --> 02:18.440
So the average of these two is 0.7587.

02:18.880 --> 02:25.880
This is the similarity answer evaluator which evaluates against ground truth answers.

02:26.280 --> 02:31.000
Let's move on to the third evaluator, which is context relevance evaluator.

02:31.160 --> 02:38.560
You can use context relevance evaluator component to evaluate documents retrieved by his tech pipeline.

02:38.760 --> 02:42.600
Just rag pipeline without ground truth labels.

02:43.000 --> 02:49.800
Unlike the Sass evaluator, we do not need the ground truth labels here.

02:50.480 --> 02:57.060
The component breaks up the context into multiple statements and checks whether each statement is relevant

02:57.060 --> 02:58.620
for answering a question.

02:59.740 --> 03:07.140
The finance score for the context relevance is a number from 0 to 1, and represents the proportion

03:07.140 --> 03:10.660
of statements that are relevant to the provided question.

03:11.180 --> 03:17.180
Let's take a real world example and understand how context relevance evaluator works.

03:17.500 --> 03:19.140
So now let's understand this.

03:19.380 --> 03:24.580
Here I have imported the evaluator context relevance evaluator.

03:24.900 --> 03:33.260
In here I'll ask a very similar question that I asked before who created the Python programming language.

03:33.780 --> 03:42.220
And then we give the same context here that we provided in our previous example period of the Python

03:42.220 --> 03:46.380
when it was created, and what the Python language is all about.

03:46.940 --> 03:54.560
Now, instead of using faithfulness evaluator, in this case we'll use the evaluator as context Text

03:54.720 --> 03:59.360
relevance evaluator, which would evaluate the context.

04:00.320 --> 04:07.520
We don't need the predicted answer here, which is unlike the faithfulness evaluator where we have predicted

04:07.520 --> 04:08.760
answers as well.

04:09.240 --> 04:14.640
This is just to evaluate whether, given this question can weaken this context.

04:14.640 --> 04:16.280
Answer the question or not.

04:17.040 --> 04:17.600
Let's go to.

04:17.640 --> 04:19.840
The result is evaluator run.

04:20.240 --> 04:24.200
Provide a list of questions and contexts as context.

04:24.600 --> 04:26.360
Print the score as result.

04:26.680 --> 04:27.400
Score.

04:27.920 --> 04:32.880
Print individual scores and then also print the results.

04:33.280 --> 04:35.160
Let's go ahead and execute this.

04:35.160 --> 04:39.920
So if you notice here when this executes we get all 111.

04:39.920 --> 04:41.920
And the result is here as well.

04:42.320 --> 04:43.400
So we got one.

04:43.400 --> 04:48.000
Because clearly the question here is in context.

04:48.320 --> 04:52.720
Actually it's not the way the context here is relevant to the question.
