WEBVTT

00:00.720 --> 00:02.600
Hello everyone and welcome.

00:02.880 --> 00:10.520
In today's session, we will cover evaluation capabilities in Amazon bedrock evaluations help you streamline

00:10.520 --> 00:14.280
testing and improve generative AI applications.

00:14.880 --> 00:19.440
As of now, bedrock offers a couple of ways to evaluate your results.

00:19.760 --> 00:23.360
One of them is models based evaluation.

00:24.440 --> 00:27.960
Here we have automatic and human evaluations.

00:28.320 --> 00:34.360
In automatic, you can evaluate performances using just the model and metrics that you select.

00:34.760 --> 00:37.080
Bedrock now offers model.

00:37.080 --> 00:38.040
As a judge.

00:38.400 --> 00:45.680
It's a pre-trained model that evaluates your model's performances using metrics that you selected under

00:45.680 --> 00:46.920
humans option.

00:47.200 --> 00:54.560
You can work with AWS Managed Teams, where you can evaluate responses from up to two models.

00:55.440 --> 01:00.230
You can define evaluation metrics specific to your use case.

01:00.870 --> 01:08.110
You can also do the same with your work team if you have one other way to do this is knowledge basis.

01:08.550 --> 01:16.590
You can now run an automatic knowledge based evaluations to assess and optimize Rag applications using

01:16.590 --> 01:18.950
Amazon Bedrock knowledge basis.

01:19.430 --> 01:22.390
So now let's go ahead and see how this works in practice.

01:23.230 --> 01:27.150
To evaluate knowledge bases we'll have to first create one.

01:27.550 --> 01:32.870
In this case here I created a knowledge base by the name evaluation KB.

01:33.230 --> 01:40.070
And the data source that I have here is from a PDF that is publicly available called Lambda PDF.

01:40.590 --> 01:44.630
So this is the bucket first link Udemy Bedrock evaluation.

01:44.630 --> 01:49.270
And in here I have lambda hyphen dg pdf.

01:50.150 --> 01:51.310
That is my source.

01:51.470 --> 01:55.390
So I'll go here select this option and add document from S3.

01:55.950 --> 01:59.820
Browse DG choose add.

02:03.340 --> 02:05.180
And then I'll add this as a source.

02:05.220 --> 02:06.460
It checks the permissions.

02:06.780 --> 02:10.060
So now the knowledge base is successfully synced with Lambda.

02:12.660 --> 02:18.660
So let me go over the source that I have right now, so that you will be able to understand and relate

02:18.660 --> 02:23.060
to some of the post-processing that I would do for evaluations.

02:24.140 --> 02:25.620
So here is my source.

02:25.980 --> 02:30.580
In this case here I have developer guide AWS Lambda.

02:30.740 --> 02:32.900
And you can find this on their website.

02:33.180 --> 02:38.940
I'll also attach it to the section where you can download it, but it basically covers different aspects

02:38.940 --> 02:47.540
of of Lambda serverless and how you can use the Lambda service on AWS console.

02:47.980 --> 02:53.980
So I'll go ahead and test whether my data is ingested or not on the serverless OpenSearch serverless.

02:54.700 --> 02:57.610
So in this case here I will open a new tab.

02:57.850 --> 02:59.330
Go to open search.

03:00.010 --> 03:03.410
In here I'll go to the serverless dashboards.

03:03.770 --> 03:11.810
And then it did create a new collection I'll go in here go to the dashboard URL and then developer tools

03:11.810 --> 03:13.090
run this query.

03:14.210 --> 03:20.850
So now I see that the source URL is first link Udemy bedrock evaluation Lambda.

03:23.690 --> 03:25.650
So it did sync up the source.

03:25.890 --> 03:30.930
And my vector database is ready with Lambda as a source.

03:31.170 --> 03:33.330
Now I'll go ahead and close all of this.

03:33.610 --> 03:37.650
And now I would go ahead and create a knowledge base evaluation.

03:37.650 --> 03:42.090
I'll give the name here first link Udemy Lambda Evaluation.

03:42.330 --> 03:47.010
I'll provide a description which is optional Lambda knowledgebase evaluation.

03:47.410 --> 03:52.970
Then you have to select a model that would do the rag on the existing data source.

03:54.000 --> 04:00.360
So in this case here I'll go ahead and select cloud 3.5 Sonet.

04:00.720 --> 04:03.320
I'm now going to add tags as of now.

04:03.560 --> 04:07.600
And then I'll select the knowledge base KB that we just created.

04:08.040 --> 04:13.200
In here there are two different options available for knowledge base evaluation.

04:13.520 --> 04:19.760
One is retrieve only and the other one is retrieve and response generation.

04:21.040 --> 04:26.840
I will select, retrieve and generate a response generation option available here.

04:27.320 --> 04:33.920
It not only retrieves the content, but also generates the content and then will evaluate the entire

04:33.920 --> 04:34.840
pipeline.

04:35.240 --> 04:37.680
I'll have to select the model for that.

04:37.680 --> 04:39.840
I'll go ahead and select a cloud model.

04:39.880 --> 04:41.480
Haiku apply.

04:42.600 --> 04:47.120
So this is a very important part here where you have to select the matrix.

04:47.320 --> 04:50.120
There are about five matrices that you can measure.

04:50.120 --> 04:58.180
Quality Helpfulness, correctness, logical coherence, faithfulness, and completeness.

04:58.940 --> 05:05.740
I'll go over each one of them in detail right now before we create the pipeline or evaluation.

05:06.700 --> 05:14.460
Let's quickly go over the matrices that are designed for evaluation retrieval with response generation.

05:15.060 --> 05:20.740
The score is an average score for responses across all prompts in your data set.

05:21.260 --> 05:25.980
The scores are normalized from 0 to 1 for ease of interpretability.

05:27.020 --> 05:29.740
The first one that we saw is correctness.

05:30.180 --> 05:33.900
Correctness means accurately answering the question.

05:34.300 --> 05:41.580
It checks how accurately the responses generated by the language model were in context to the data given

05:41.580 --> 05:43.380
to the large language model.

05:43.780 --> 05:45.780
The next one is completeness.

05:46.100 --> 05:51.210
Completeness means answering and resolving all aspects of the question.

05:52.210 --> 05:57.410
It checks how closely the answering went for the context given to the language model.

05:57.810 --> 05:58.850
Helpfulness.

05:59.370 --> 06:04.050
Helpfulness means holistically useful responses to the questions.

06:04.330 --> 06:13.130
This logical coherence, where the logical coherence means responses are free from logical gaps, inconsistencies,

06:13.130 --> 06:14.570
or contradictions.

06:14.970 --> 06:16.650
The last one is faithfulness.

06:17.570 --> 06:22.810
Faithfulness means avoiding hallucination with respect to the retrieved text chunks.

06:23.330 --> 06:27.370
The higher the score, the more faithful the generated responses are.

06:27.970 --> 06:30.050
Now let's go back to the console.

06:30.650 --> 06:33.810
So now we selected helpfulness as one matrix.

06:34.650 --> 06:38.490
Let's select one more matrix as correctness.

06:39.090 --> 06:44.850
Scroll down a little bit and then we'll see is responsible AI as an harmfulness.

06:45.290 --> 06:50.760
It measures how harmful the responses are and using hate in certain violence.

06:51.360 --> 06:58.120
I'll select that there are a couple of others responsible AI options available as refusal.

06:59.120 --> 07:04.160
It measures how evasive the responses are in refusing to answer questions.

07:04.560 --> 07:12.000
Stereotyping measures generalized statement about individuals, a group of people, and responses.

07:12.560 --> 07:16.280
So now I'll go ahead and select the data set for evaluation.

07:16.800 --> 07:21.520
Go ahead and select the same lambda source that we have as the evaluation.

07:22.480 --> 07:26.520
Where do I want the S3 location to store the results?

07:26.800 --> 07:33.640
For that I'm going to create one more directory here or a folder in S3 bucket.

07:34.040 --> 07:40.000
Amazon S3 buckets under first link Udemy bedrock evaluation.

07:40.360 --> 07:46.910
I'll go ahead and create results and create a folder that would store all of our results.

07:47.310 --> 07:48.150
Go back.

07:48.350 --> 07:49.990
Browse results.

07:50.430 --> 07:53.270
So this is the thing that is little confusing here.

07:54.030 --> 07:55.190
It doesn't choose.

07:55.190 --> 07:57.390
Give the option choose here.

07:57.870 --> 08:04.270
What I'll have to do is copy the S3 location evaluation and results.

08:04.670 --> 08:06.990
I'll go ahead and create a new service role.

08:07.310 --> 08:10.110
You can also use an existing service role.

08:10.390 --> 08:15.190
Let's go ahead and give Amazon Bedrock service role and hit the create button.

08:15.590 --> 08:17.590
So we landed up in a problem.

08:17.910 --> 08:20.150
So what happened here is data.

08:20.190 --> 08:21.750
I think I did a mistake.

08:22.070 --> 08:25.390
So I had to do data set for evaluation.

08:26.430 --> 08:29.710
This is not the same as data set as for source.

08:29.910 --> 08:31.710
I'll have to give a data source here.

08:32.030 --> 08:38.990
So for that for the data set options I'll have to provide a JSON that says for each conversation turns

08:38.990 --> 08:40.590
refer the responses.

08:40.870 --> 08:42.670
The content here is text.

08:43.070 --> 08:52.100
A trigger is a source or configuration that invokes Lambda functions such as AWS service, and the question

08:52.100 --> 08:56.340
or the prompt that goes in there is what is AWS Lambda trigger?

08:56.820 --> 08:59.140
Same goes with the other line item.

08:59.460 --> 09:02.300
What is AWS Lambda event?

09:02.740 --> 09:05.020
And the answer should be along these lines.

09:05.380 --> 09:10.260
So I'll go ahead and upload this file to the data to the S3 bucket.

09:11.340 --> 09:19.100
So now I uploaded the JSON file evaluation data set I'll go ahead and select that file.

09:19.140 --> 09:21.220
Browse evaluation data set.

09:21.620 --> 09:24.100
This will be the evaluation criteria.

09:24.580 --> 09:31.340
Now if you notice here you do not have the required course setting which is also one of the other problem

09:31.780 --> 09:32.700
in this bucket.

09:32.700 --> 09:39.700
Here I'll have to go to permissions and allow a course setting which is the Cross-region resource sharing

09:39.740 --> 09:40.500
option.

09:41.650 --> 09:43.890
Evaluations do need this.

09:43.890 --> 09:45.930
So I'll go ahead and paste the solution.

09:45.930 --> 09:54.850
So we say aloud method is get put post delete allow from all different origins and allow control allow

09:54.890 --> 09:55.730
origin.

09:56.090 --> 09:58.010
So I'll go ahead save the changes.

09:58.050 --> 10:01.490
Go back to my evaluation job here.

10:02.170 --> 10:03.490
Add a new one.

10:03.490 --> 10:07.170
I'll go ahead and actually create a new one for now I know what.

10:07.210 --> 10:08.210
Let me select.

10:11.130 --> 10:12.730
The S3 bucket.

10:12.730 --> 10:19.050
So I'll come down here and datasets and evaluation result S3 location.

10:20.050 --> 10:25.730
So if you notice here the very first thing that it needs is the data set for evaluation.

10:26.050 --> 10:27.050
What does this mean.

10:27.410 --> 10:33.690
This means that we'll have to provide prompts and the desired result, how it should look like to the

10:33.690 --> 10:41.600
large language model for it to assess the responses, let me share with you how this this JSON file

10:41.600 --> 10:42.440
looks like.

10:42.880 --> 10:45.680
So if you notice here there are two line items.

10:45.680 --> 10:49.040
And it's very important to understand how this works.

10:50.240 --> 10:55.320
This section would determine the matrix that you want to generate from the responses.

10:55.840 --> 11:00.360
So here we have conversation turns as a root element.

11:00.520 --> 11:02.080
A reference response.

11:02.680 --> 11:05.040
So how does a response look like to us.

11:05.360 --> 11:13.320
It's like a text that says a trigger is a source or configuration that invokes a lambda function, such

11:13.320 --> 11:15.160
as an AWS service.

11:15.760 --> 11:23.160
Now this is the context when the question asked is what is an AWS Lambda trigger?

11:23.680 --> 11:26.720
Then the response should look something like this.

11:27.200 --> 11:30.880
That is the expectation that we are setting up for the matrix.

11:30.920 --> 11:36.480
Similarly, for another question or prompt, what is an AWS Lambda event?

11:37.040 --> 11:43.190
The content should look like an event as a JSON document defined by the AWS service or.

11:43.710 --> 11:49.950
The application invoking a Lambda function that is provided as an input to lambda function.

11:50.470 --> 11:56.830
So now with this, let me upload this to the S3 bucket and give the location for the data set.

11:57.150 --> 11:59.390
So I uploaded the data set here.

11:59.790 --> 12:02.710
Now let me go back to the evaluation data set.

12:02.790 --> 12:11.390
Browse and provide the JSON file that would use to evaluate the Rag result from retrieve and generate

12:11.390 --> 12:12.430
responses.

12:12.870 --> 12:16.630
The second option I have here is results for evaluation.

12:17.350 --> 12:20.670
So where do we want to store the results for evaluation?

12:20.790 --> 12:24.470
For that I have created directory results.

12:24.670 --> 12:25.990
I'll select that.

12:26.350 --> 12:30.350
And then down here I'm going to use an existing role.

12:30.750 --> 12:33.350
Let's give the role name as provision role.

12:34.310 --> 12:37.410
And then I'll go ahead and click the create button.

12:37.730 --> 12:39.730
Looks like we have some errors.

12:40.090 --> 12:43.930
So it does not have the required course setting.

12:43.930 --> 12:46.210
So the bucket that we use for evaluation.

12:46.250 --> 12:52.010
Evaluation jobs they do need the permission for cross-origin resources.

12:52.930 --> 12:55.090
So I go here to edit.

12:55.330 --> 12:56.050
All right.

12:56.450 --> 13:02.330
So in here I already had the course setting available which has allowed headers.

13:03.290 --> 13:09.730
It allows all the four method for the the other jobs to access the bucket allow origins.

13:09.730 --> 13:17.450
It allows origins from different regions expose headers as access control allow origin.

13:17.770 --> 13:19.770
So now I'll go ahead and save the changes.

13:20.770 --> 13:22.610
Go back here and create.

13:23.010 --> 13:23.930
Let's go up.

13:24.490 --> 13:25.490
No errors.

13:26.210 --> 13:27.650
So there was an error.

13:27.850 --> 13:33.690
It does not have access to the cloud 3.5 provision role.

13:34.770 --> 13:37.720
So I'll have to go ahead and make changes and retry it.

13:37.960 --> 13:40.400
So I might go ahead and select a different model.

13:40.440 --> 13:44.160
Maybe that works I'll select Haiku Create.

13:44.760 --> 13:45.560
It worked.

13:45.880 --> 13:49.160
So I gave permission for haiku and not sonnet.

13:49.480 --> 13:56.040
So you can change the the IAM permissions for your account and make it work for now.

13:56.480 --> 14:00.360
The entire evaluation pipeline is up and running.

14:01.000 --> 14:05.880
This does take a while, so I'll pause the video and grab some coffee and come back.

14:06.440 --> 14:09.880
So after a while the job is now complete.

14:10.520 --> 14:16.960
Let me go to the evaluation job and let's understand some of the evaluations that it came up with.

14:18.000 --> 14:21.480
So now let's evaluate the results in the matrix summary.

14:21.520 --> 14:27.840
Here we give three different matrix correctness helpfulness and harmfulness.

14:28.200 --> 14:29.800
The correctness is one.

14:30.080 --> 14:34.310
That means it satisfied the Responses that we got.

14:34.670 --> 14:39.430
So let's scroll down a little bit and let's understand how the chart looks like.

14:40.430 --> 14:45.670
So in this case here helpfulness is 0.83.

14:46.150 --> 14:50.470
And let's say you want to understand why the helpfulness was not one.

14:50.950 --> 14:57.870
So in this case here the conversation that we had was what is an AWS Lambda trigger.

14:58.630 --> 15:01.030
This is the output we got from the law.

15:01.830 --> 15:06.990
There were five sources that were chunked and they were referenced for summary.

15:07.350 --> 15:09.550
And then this is the ground truth.

15:10.030 --> 15:13.350
This is what we gave in the JSON file for evaluation.

15:13.590 --> 15:18.470
And it compared this ground truth versus the generation output content.

15:19.110 --> 15:21.270
And the score is not one.

15:22.310 --> 15:24.070
It's close to 0.8.

15:24.310 --> 15:28.470
Sometimes llms are a little stricter in conversation.

15:28.470 --> 15:35.380
So you do get responses on the on little lower at times, but if you want to understand why you got

15:35.380 --> 15:42.740
the score is 0.83, you can go ahead and understand the explanation for that particular conversation

15:42.740 --> 15:43.860
that was used.

15:44.300 --> 15:50.500
So this is how you can take a deep dive into the evaluations of each matrix that you select.

15:50.540 --> 15:53.420
Then there is correctness correctness.

15:53.420 --> 15:57.260
Give us one out of one for the two conversations that we give.

15:57.620 --> 15:59.100
Harmfulness is zero.

15:59.420 --> 16:01.260
There was no harmful content.

16:01.500 --> 16:04.540
You can go here and check the same data set.

16:05.540 --> 16:07.940
What is an AWS Lambda trigger?

16:08.300 --> 16:16.060
There was the output retrieved count retrieved chunks ground truth that we provided and the score is

16:16.060 --> 16:16.740
zero.

16:16.940 --> 16:21.580
There is no insult, harm or violent content generated.

16:21.860 --> 16:26.260
So these are the evaluations that were created from the data that we had.

16:27.260 --> 16:27.980
Thank you.

16:28.020 --> 16:29.700
I'll see you in the next video.