WEBVTT

00:00.880 --> 00:02.760
Hello everyone and welcome.

00:03.080 --> 00:07.040
In today's session we will cover model as a judge for evaluations.

00:07.160 --> 00:10.520
For this I will go ahead and hit on create button.

00:10.960 --> 00:14.080
In there I'll select model as a judge option.

00:14.640 --> 00:19.640
So here I'll give the name as L.l.m. as a judge evaluation.

00:20.480 --> 00:22.000
The description is model.

00:22.000 --> 00:27.320
As a judge evaluation I'll go ahead and select the model anthropic and haiku.

00:27.720 --> 00:33.600
So this is what is needed for the automatic evaluation that generates evaluation matrix.

00:34.040 --> 00:38.000
Then here we have another model that we want to evaluate.

00:39.040 --> 00:47.320
So in this case here I'll go select the same model haiku from anthropic Apply just like our previous

00:47.320 --> 00:48.040
video.

00:48.280 --> 00:52.600
I'll select two different metrics here helpfulness and correctness.

00:53.040 --> 00:58.200
I'll go with responsible A metrics as Harmfulness.

00:58.600 --> 01:06.470
Here I'll have to provide the prompt data set and the evaluation results very similar to how we did

01:06.470 --> 01:08.070
it in our previous video.

01:08.550 --> 01:12.750
So I'll go ahead and share with you how the evaluation dataset looks like.

01:13.110 --> 01:16.390
So here there is a prompt right 15 words.

01:16.510 --> 01:17.950
Summary of text.

01:17.990 --> 01:19.830
Fargate is a technology.

01:19.830 --> 01:22.550
And all of this you can go through it.

01:22.990 --> 01:25.270
I'll upload it along the section.

01:26.070 --> 01:27.990
And then there's a response.

01:28.190 --> 01:36.870
Fargate allows AWS Fargate allows running containers without managing servers or clusters, simplifying

01:36.870 --> 01:39.150
container deployment and scaling.

01:39.590 --> 01:46.150
So there are two results that or two line items I am providing for the LM to evaluate.

01:46.630 --> 01:49.590
And this is how the the content or the prompt is.

01:50.030 --> 01:53.270
And then this is what the response is from the LM dot.

01:54.030 --> 01:54.830
All right.

01:54.830 --> 01:57.390
Let me upload this to the S3 bucket.

01:57.390 --> 01:57.990
Right.

01:58.310 --> 02:03.710
So the upload is complete in my bucket I uploaded the The file.

02:04.430 --> 02:08.230
Large language model as a judge evaluation data set.

02:08.590 --> 02:16.150
So now I'll go to the job that we are creating and then browse provide the evaluation dataset as large

02:16.150 --> 02:17.990
lambda model as a judge.

02:18.230 --> 02:20.230
And then I'll have the results here.

02:20.430 --> 02:21.750
You can see results.

02:21.950 --> 02:29.070
Or I can actually create another folder here by the name as judge evaluation results.

02:29.870 --> 02:31.030
Then create a folder.

02:31.350 --> 02:33.550
Go back up browse results.

02:33.710 --> 02:34.670
So this is the part.

02:34.670 --> 02:35.630
This is not clear.

02:35.630 --> 02:39.150
But what I'll do is go ahead and add this path here.

02:39.190 --> 02:42.070
Results as a judge.

02:42.070 --> 02:42.990
Results.

02:43.150 --> 02:47.190
I'll use an existing role bedrock provisioning role.

02:47.470 --> 02:50.150
And then I'm not giving a new KMS key.

02:50.630 --> 02:52.310
Go ahead and create.

02:52.350 --> 02:57.030
It looks like there's a problem here so I missed there's a space in here.

02:57.230 --> 02:58.910
Let's hit the create button.

02:58.910 --> 03:02.790
Use an existing role bedrock provisioning role.

03:03.310 --> 03:04.110
Create.

03:05.100 --> 03:05.980
There you go.

03:06.340 --> 03:08.060
The job is in progress.

03:08.220 --> 03:10.100
I will go ahead and pause.

03:10.380 --> 03:12.380
It takes a little while for this to run.

03:12.580 --> 03:18.580
So the evaluation job that we ran for as a judge is completed.

03:18.940 --> 03:22.580
So now I'll go ahead and click on this and we'll go check out the summary.

03:22.980 --> 03:24.940
This was the job that we ran.

03:24.980 --> 03:31.180
Description model and all the details that we we had provided.

03:31.700 --> 03:34.700
And then we had three different scores that we wanted.

03:34.900 --> 03:40.460
One of them was helpfulness correctness and harmfulness.

03:41.500 --> 03:43.580
So these three were the metrics.

03:43.780 --> 03:52.060
And if you notice here the breakdown is provided where we have the score as 0.83 for helpfulness.

03:52.580 --> 03:54.980
And then here are the prompt details.

03:55.140 --> 03:56.540
Expand this.

03:57.020 --> 04:00.260
So if you notice here we have two different line items.

04:00.620 --> 04:04.380
One said by 15 words for this text.

04:04.700 --> 04:09.930
And then the Fargate is the technology and all the details that we provided.

04:09.930 --> 04:11.290
This is the output.

04:11.770 --> 04:14.290
This is the ground truth that we wanted.

04:14.610 --> 04:22.730
So this is how it compare the input and the output with actual ground truth that is available or that

04:22.730 --> 04:23.570
we passed.

04:23.890 --> 04:25.850
Here's the score that it provided.

04:25.850 --> 04:30.050
And it gives you the details about why it scored it as 0.83.

04:31.090 --> 04:35.130
So if you go through it, you'll understand why this is 0.83.

04:35.610 --> 04:36.890
Say okay.

04:37.330 --> 04:42.370
Same goes for correctness score here is one out of one.

04:42.890 --> 04:47.130
And you go to the prompt details and then expand this.

04:47.290 --> 04:55.410
And you'll see the very similar way of evaluating prompt input generation output and ground truth values.

04:56.290 --> 05:03.530
So one thing that I want to note wanted to highlight is all scores have been normalized to value zero

05:03.530 --> 05:04.330
and one.

05:04.690 --> 05:07.520
So this is going to be always between 0 and one.

05:07.840 --> 05:10.120
There is no harmful content.

05:10.200 --> 05:11.280
The value is zero.

05:11.320 --> 05:14.560
Here, let's check this as a prompt.

05:14.720 --> 05:15.920
Details here.

05:16.400 --> 05:20.920
And then the score is zero because there are no harmful contents.

05:21.360 --> 05:21.880
Great.

05:22.280 --> 05:23.760
So we got all the details.

05:24.080 --> 05:30.320
Now if you want to check out the then the evaluation of the entire data set for Rag and large language

05:30.320 --> 05:30.880
model.

05:31.320 --> 05:35.320
Then you can go to S3 go to buckets.

05:35.840 --> 05:40.520
In here we have the bedrock evaluation as the bucket and results.

05:40.920 --> 05:42.840
So we have two results.

05:43.880 --> 05:48.200
This was for the rag and this is for large language model as a judge.

05:48.200 --> 05:50.680
And then you go here go dive deeper.

05:50.680 --> 05:52.440
And then general data sets.

05:52.720 --> 05:53.280
Yeah.

05:53.320 --> 05:54.080
There you go.

05:54.520 --> 05:57.600
So this is where you find all the relevant details.

05:58.560 --> 06:03.640
You can download this file and go through the entire data set instead of just two.

06:03.960 --> 06:09.760
Same goes for the results of Rag where you go in here Inference configs.

06:09.800 --> 06:11.240
Zero data sets.

06:11.480 --> 06:13.240
Reg data sets.

06:13.640 --> 06:15.720
And this is the JSON line item.

06:15.720 --> 06:18.120
This is where what you can download.

06:18.480 --> 06:20.880
I'll quickly show you how this looks like.

06:21.520 --> 06:23.320
So here's how it looks like.

06:23.600 --> 06:26.880
There is conversation turns input record.

06:26.880 --> 06:32.880
Was this the output details are here where this there's a model that we selected.

06:33.800 --> 06:41.760
Here is the text chunk and the retrieved passage content from the text and the details that was provided,

06:42.240 --> 06:45.960
the location of the file or the data source.

06:46.440 --> 06:50.880
All of this you can find here for the entire PDF that we selected.

06:51.120 --> 06:54.840
Well, actually this is not the entire PDF dot.

06:55.840 --> 06:59.240
This is like two line items with five chunks each.

06:59.560 --> 07:03.320
So that is what it is because the chunk size is five.

07:03.560 --> 07:06.600
It would go ahead and check it for five different chunk items.

07:07.040 --> 07:10.440
Thank you so much and I'll see you in another video.
