WEBVTT

00:00.760 --> 00:01.840
Hello everyone!

00:02.000 --> 00:05.600
In today's video we will learn about evaluators and haystack.

00:06.080 --> 00:07.480
What are evaluators?

00:07.680 --> 00:15.400
Haystack has tools needed to evaluate entire pipeline or individual components like retrievers, readers,

00:15.560 --> 00:16.800
and generators.

00:17.240 --> 00:24.040
We can use evaluation and its results to judge how well the system performs, compare the performance

00:24.040 --> 00:28.720
of different models, and identify underperforming components.

00:29.720 --> 00:37.600
End to end evaluation checks how the full pipeline is used and evaluates only the final outputs.

00:38.080 --> 00:40.960
The pipeline is approached as a black box.

00:41.440 --> 00:43.680
What are different types of evaluation?

00:43.840 --> 00:45.600
Model based evaluation.

00:45.920 --> 00:53.720
Model based evaluation uses less with prompt instructions or smaller fine tuned models to score aspects

00:53.720 --> 00:54.560
of pipelines.

00:54.560 --> 00:55.160
Output.

00:56.280 --> 00:58.400
Statistical evaluation.

00:58.880 --> 01:04.670
It requires no model and is thus a more lightweight way to score pipeline outputs.

01:05.030 --> 01:12.310
Most statistical evaluators require ground truth label, such as documents relevant to the query or

01:12.310 --> 01:13.710
the expected answer.

01:14.190 --> 01:18.710
We will not cover statistical evaluation as part of this video.

01:19.190 --> 01:23.990
We'll cover this in a more advanced tutorial in the Advanced Topics section.

01:24.950 --> 01:27.070
Model based evaluation.

01:27.310 --> 01:34.150
Model based evaluation and Hastac uses a language model to check the results of a pipeline.

01:34.710 --> 01:40.190
This method is easy to use because it usually doesn't need labels for the outputs.

01:40.550 --> 01:48.230
It's often used with retrieval augmented generative pipelines, but can work with any pipelines.

01:48.670 --> 01:50.510
Large language model.

01:50.670 --> 01:57.350
A common strategy for model based evaluation involves using a large language model, such as OpenAI's

01:57.390 --> 02:06.580
ChatGPT GPT models as an evaluator model, often referred to as golden model, small and cross encoder

02:06.580 --> 02:07.260
models.

02:07.500 --> 02:10.700
These models can calculate semantic similarity.

02:11.260 --> 02:20.420
This method of using small encoder models as evaluator is faster and cheaper to run, but is less flexible

02:20.420 --> 02:23.260
in terms of what aspect you can evaluate.

02:23.660 --> 02:26.780
What are different types of model based evaluation?

02:27.500 --> 02:32.180
Faithfulness evaluator also known as low as a judge.

02:32.780 --> 02:39.940
Faithfulness, also called groundedness, evaluates to what extent a generated answer is based on the

02:39.940 --> 02:41.620
retrieved documents.

02:42.660 --> 02:51.140
An L is used to extract statements from the answer and check the faithfulness of each statements separately.

02:51.780 --> 02:58.850
If the answer is not based on the documents, the answer or the least part of it is called a hallucination.

02:59.610 --> 03:06.930
Another type of model based evaluation is sass evaluator, also known as semantic answer similarity.

03:07.370 --> 03:13.770
Semantic answer similarity uses transformer based cross encoder architecture to evaluate the semantic

03:13.770 --> 03:17.850
similarity of two answers, rather than their lexical overlap.

03:17.850 --> 03:19.690
Context relevance.

03:19.690 --> 03:28.010
Evaluator context relevance refers to how relevant the retrieved documents are to the query, and is

03:28.010 --> 03:29.970
used to judge that aspect.

03:30.250 --> 03:36.530
It first extracts statement from documents and then checks how many of them are relevant for answering

03:36.530 --> 03:37.290
the query.

03:37.770 --> 03:41.410
There are other set of evaluators that haystack provides.

03:41.730 --> 03:48.770
I have listed just three of them and will go through each of them in detail in this video, but you

03:48.770 --> 03:54.370
can check out other evaluators that haystack has offered in their documentation.

03:54.690 --> 03:55.610
Let's take a.