WEBVTT

00:01.320 --> 00:03.900
-: Hey, I'm gonna walk you through what evals.

00:03.900 --> 00:05.880
So this is maybe something you've heard of.

00:05.880 --> 00:08.700
Evals are short for evaluation metrics

00:08.700 --> 00:10.290
and it's how we measure alignment

00:10.290 --> 00:13.680
between what we get in terms of responses from the AI

00:13.680 --> 00:15.600
and the business goals that we've been set.

00:15.600 --> 00:18.840
So for example, the accuracy, the reliability,

00:18.840 --> 00:21.600
or the quality of the responses.

00:21.600 --> 00:23.490
This is an example of evals.

00:23.490 --> 00:24.510
You might have seen this

00:24.510 --> 00:26.640
whenever a new model is released.

00:26.640 --> 00:29.010
This is the example from Claude 3.

00:29.010 --> 00:31.140
They check how well it did

00:31.140 --> 00:34.320
against generally accepted eval benchmarks.

00:34.320 --> 00:38.820
MLU is one of them. GPQA, Diamond.

00:38.820 --> 00:42.930
There's human eval, which is a coding eval benchmark.

00:42.930 --> 00:44.520
And typically these are question

00:44.520 --> 00:47.910
and answer sets as you get a percentage correct.

00:47.910 --> 00:49.950
And if that percentage keeps going up,

00:49.950 --> 00:51.060
then you're doing a good job.

00:51.060 --> 00:54.990
And if that percentage is very high compared

00:54.990 --> 00:58.410
to GPT-4 or some of the other models out there,

00:58.410 --> 00:59.760
and then you've hit the holy grail

00:59.760 --> 01:02.310
and people will start training your model,

01:02.310 --> 01:03.390
trying to use your model.

01:03.390 --> 01:06.480
So it's something that they focus on quite a bit.

01:06.480 --> 01:10.500
And this is an example of what these evals look like.

01:10.500 --> 01:13.050
This is an abstract algebra one.

01:13.050 --> 01:16.350
You can see there's question, the statement there,

01:16.350 --> 01:19.500
and then there's answer A, B, C, D.

01:19.500 --> 01:23.220
And we just see how well the AI does

01:23.220 --> 01:25.950
in terms of that question and answer.

01:25.950 --> 01:27.840
So we know the answer to that question

01:27.840 --> 01:29.910
and that the answer is B.

01:29.910 --> 01:33.240
And if the AI predicts B, then gets it right.

01:33.240 --> 01:36.330
Same, this was one for professional accounting.

01:36.330 --> 01:40.140
It's asking a question and this answers A, B, C, D.

01:40.140 --> 01:42.510
And the correct answer is A,

01:42.510 --> 01:45.003
so we can tell whether the AI got it or not.

01:46.560 --> 01:48.360
And the reason why evals are important

01:48.360 --> 01:50.280
is they're surprisingly often all you need

01:50.280 --> 01:52.770
as Greg Brockman from OpenAI says.

01:52.770 --> 01:54.900
They're usually the most important part

01:54.900 --> 01:57.513
of any prompt engineering project that I work.

01:58.650 --> 02:00.030
Now, what makes them so hard

02:00.030 --> 02:03.660
is that these tools are not reliable.

02:03.660 --> 02:06.240
Like, they quite often fail

02:06.240 --> 02:08.160
and you don't know that they're failing

02:08.160 --> 02:11.880
unless you're running it at 10 or 100 or even 1,000 times.

02:11.880 --> 02:14.190
And seeing how often it fails

02:14.190 --> 02:17.400
and know what failure is really important

02:17.400 --> 02:19.977
and you need to be able to capture that in your evals.

02:19.977 --> 02:23.100
But the other big problem is cost and latency.

02:23.100 --> 02:25.110
So if your evals are too expensive

02:25.110 --> 02:26.850
or they're too slow to run

02:26.850 --> 02:28.680
and it limits how many tests you can run.

02:28.680 --> 02:31.650
If every time you test a new prompt, you have to send it

02:31.650 --> 02:33.510
to your boss and they'll wait for them to come back

02:33.510 --> 02:35.580
to you a week later, then you're not gonna be able

02:35.580 --> 02:37.170
to iterate very fast.

02:37.170 --> 02:38.670
The other big thing is hallucination.

02:38.670 --> 02:40.860
It's pretty hard to fact check models

02:40.860 --> 02:43.350
or fact check humans (indistinct).

02:43.350 --> 02:45.090
And these models tend to make things up

02:45.090 --> 02:46.380
or miss important things.

02:46.380 --> 02:47.850
It's really hard to measure

02:47.850 --> 02:50.730
because you very quickly run into the fact

02:50.730 --> 02:54.120
that there's almost no objective truth in the world

02:54.120 --> 02:56.790
unless the question is very simple.

02:56.790 --> 02:59.190
Most tasks are not simple, particularly the ones

02:59.190 --> 03:00.753
where evals matter the most.

03:01.950 --> 03:03.810
There are three types of eval:

03:03.810 --> 03:06.660
programmatic, synthetic and human eval.

03:06.660 --> 03:09.060
Programmatic is the holy grail. It's the ideal one.

03:09.060 --> 03:12.990
If you can quickly and cheaply calculate whether the answer

03:12.990 --> 03:16.260
is correct, then you can move a lot faster

03:16.260 --> 03:19.470
and you get answers within milliseconds.

03:19.470 --> 03:21.720
A good example of this is checking AI answers

03:21.720 --> 03:24.000
to multiple choice questions as we already saw.

03:24.000 --> 03:24.870
It's fast and cheap

03:24.870 --> 03:28.440
and the one weakness is that it's pretty hard to come up

03:28.440 --> 03:32.700
with good programmatic measures that capture complex tasks.

03:32.700 --> 03:35.310
Because we're limited here to multiple choice,

03:35.310 --> 03:38.790
if it's a freeform answer, that's a lot harder to calculate.

03:38.790 --> 03:42.870
In that manner, we tend to move towards synthetic evals.

03:42.870 --> 03:46.680
Synthetic evals is just asking typically GPT-4,

03:46.680 --> 03:48.300
or the best model is at the time,

03:48.300 --> 03:50.160
whether a response is concise

03:50.160 --> 03:52.560
or asking whether the response is correct

03:52.560 --> 03:55.080
or checking some other action in the response.

03:55.080 --> 03:57.810
That's something that GPT-4 is actually pretty good at,

03:57.810 --> 04:00.360
particularly when you're measuring the responses

04:00.360 --> 04:04.410
from lesser models or open source models.

04:04.410 --> 04:06.630
It's cheaper and faster than using a human,

04:06.630 --> 04:09.210
but the problem is that if AI is bad at the task,

04:09.210 --> 04:12.030
then it's potentially also bad at checking the task.

04:12.030 --> 04:15.660
It's also a lot costlier and slower than programmatic evals.

04:15.660 --> 04:18.480
I have some synthetic evals that cost me a couple

04:18.480 --> 04:20.190
of dollars every time I run them.

04:20.190 --> 04:23.043
And while that's cheaper than paying a human to evaluate,

04:24.450 --> 04:26.280
it takes a couple of minutes to run

04:26.280 --> 04:29.640
and I find I get distracted and start checking my emails

04:29.640 --> 04:31.710
and then I forget what I was doing.

04:31.710 --> 04:33.660
So it's a bit of a productivity killer

04:33.660 --> 04:35.580
if you have to wait a long time

04:35.580 --> 04:39.660
and you could easily spend like $100 just testing.

04:39.660 --> 04:41.400
Human evals is ideal.

04:41.400 --> 04:44.160
That's an example of getting your boss to check the work

04:44.160 --> 04:47.610
or getting an editor to add comments to a document

04:47.610 --> 04:50.880
or even just you looking at like eyeballing the responses

04:50.880 --> 04:52.227
like you would do in ChatGPT.

04:52.227 --> 04:54.660
But this is a high likelihood of accuracy

04:54.660 --> 04:56.610
or at least, you're likely to agree

04:56.610 --> 04:59.430
with the results if you like the responses.

04:59.430 --> 05:01.500
But the problem is it's weakest.

05:01.500 --> 05:04.080
It's the weakest way to evaluate

05:04.080 --> 05:08.130
because it's very expensive, it's slow, it takes a long time

05:08.130 --> 05:09.930
to get people to give feedback.

05:09.930 --> 05:12.900
They can also be pretty subjective as well.

05:12.900 --> 05:16.020
Different people will evaluate the same response

05:16.020 --> 05:17.280
in different ways.

05:17.280 --> 05:19.560
So let's dive into each one a little bit deeper.

05:19.560 --> 05:21.090
Programmatic evals.

05:21.090 --> 05:23.970
The simplest one I really like is just word length.

05:23.970 --> 05:26.040
Like when you're generating blog content,

05:26.040 --> 05:26.970
one of the things I find

05:26.970 --> 05:29.400
is that it doesn't generate long enough content.

05:29.400 --> 05:32.310
So one of the things I test fairly often is like,

05:32.310 --> 05:36.210
how do I get it to produce a longer piece of content?

05:36.210 --> 05:39.150
And this is an example here from a Google Sheets test I did

05:39.150 --> 05:42.330
where I just ran the prompt 10 times,

05:42.330 --> 05:44.760
ran another prompt 10 times where I had,

05:44.760 --> 05:46.530
I'll lose my job all in caps

05:46.530 --> 05:48.330
if it doesn't make the blog post longer.

05:48.330 --> 05:50.550
And that did actually improve the word length.

05:50.550 --> 05:52.110
It was 13% higher.

05:52.110 --> 05:53.340
The reason why this is good

05:53.340 --> 05:54.870
is because it's simple to calculate.

05:54.870 --> 05:58.167
You can do it, even in Excel

05:58.167 --> 05:59.550
Q&amp;A is the most obvious one.

05:59.550 --> 06:01.170
So one that pretty much everyone does.

06:01.170 --> 06:04.800
You generate some set of questions and answers

06:04.800 --> 06:07.680
and you know what the answers should be,

06:07.680 --> 06:11.040
therefore, you can check whether it actually follows suit.

06:11.040 --> 06:12.840
So this is an example from a bank.

06:12.840 --> 06:14.310
We have transaction descriptions

06:14.310 --> 06:16.920
and we know what the transaction type

06:16.920 --> 06:20.190
and transaction category should be for these descriptions

06:20.190 --> 06:22.530
and we can check whether it actually categorizes them

06:22.530 --> 06:23.520
into the right format.

06:23.520 --> 06:27.120
So you can see on the example labeled four,

06:27.120 --> 06:28.890
purchased books from the bookstore,

06:28.890 --> 06:31.390
it labeled that as entertainment instead of other.

06:33.300 --> 06:34.680
The other one is hallucinations.

06:34.680 --> 06:37.320
And we came up with a relatively elegant way to test this.

06:37.320 --> 06:42.320
One of my clients, we were building product descriptions

06:42.660 --> 06:45.900
or product ads based on the copy that was on the website.

06:45.900 --> 06:50.430
So what we did is we just built a function in Google Sheets

06:50.430 --> 06:52.800
that would check the copy on the website

06:52.800 --> 06:55.230
and divide up all the words.

06:55.230 --> 06:57.960
So we'd just see what words were available on the website

06:57.960 --> 07:00.780
and if any words were in the ad that weren't on the website,

07:00.780 --> 07:02.520
then that was counted as a hallucination

07:02.520 --> 07:04.140
so we could very quickly

07:04.140 --> 07:07.713
and easily A/B test even in Google Sheets.

07:08.550 --> 07:11.760
So the second one I wanna talk through a little bit

07:11.760 --> 07:12.930
of detail is synthetic.

07:12.930 --> 07:14.760
So I'm gonna show you some examples of this.

07:14.760 --> 07:16.770
One is using a vector database,

07:16.770 --> 07:20.190
and this is a lot faster actually than calling GpT-4,

07:20.190 --> 07:21.810
but it can still be quite effective.

07:21.810 --> 07:24.630
Embeddings that you get from a a vector search.

07:24.630 --> 07:28.560
Basically how close things are in terms of similarity.

07:28.560 --> 07:30.570
The lower the score, the better here.

07:30.570 --> 07:32.760
And we're looking in this case

07:32.760 --> 07:35.730
at how well it matches the way

07:35.730 --> 07:39.510
that someone writes right in the certain style of an author.

07:39.510 --> 07:42.630
And in this case, some of the techniques

07:42.630 --> 07:44.080
were much better than others.

07:46.170 --> 07:48.420
Another one that's really popular

07:48.420 --> 07:51.570
for synthetic evals is pairwise comparison.

07:51.570 --> 07:56.070
So you can pass it both of the responses.

07:56.070 --> 08:00.450
In this case, this is the bank transaction example

08:00.450 --> 08:01.770
that we gave before.

08:01.770 --> 08:04.380
We know what the reference answer is, withdrawal other.

08:04.380 --> 08:08.430
And we know that GPT-3.5 has come back

08:08.430 --> 08:09.540
with withdrawal entertainment,

08:09.540 --> 08:11.430
Mistral came back with withdrawal other.

08:11.430 --> 08:14.310
We can pass all that information to an LLM

08:14.310 --> 08:17.190
and ask it to give a verdict, as well as some reasoning.

08:17.190 --> 08:19.980
And this is pretty good for fuzzier tasks

08:19.980 --> 08:23.340
where there's not like a direct match necessarily,

08:23.340 --> 08:26.190
or there might be one version that's better than another.

08:28.200 --> 08:29.670
The other big one I use a lot

08:29.670 --> 08:31.950
is I call it like violation finding.

08:31.950 --> 08:33.240
You have a bunch of errors

08:33.240 --> 08:35.610
that typically show up in the responses.

08:35.610 --> 08:38.130
Like for example, if you're generating blog content

08:38.130 --> 08:39.660
or creating a report

08:39.660 --> 08:42.120
and if someone has to go edit that blog content,

08:42.120 --> 08:43.860
a human has to go edit that blog content.

08:43.860 --> 08:46.770
But you can take all of the edits that they typically find

08:46.770 --> 08:49.380
as some of the problems that they see over and over again

08:49.380 --> 08:51.900
and then turn those into rules that you check for.

08:51.900 --> 08:55.290
So you can have one prompt for each type of violation.

08:55.290 --> 08:57.060
Say, for example, it's using the wrong language

08:57.060 --> 08:59.670
or it's not concise enough, whatever it is.

08:59.670 --> 09:00.840
And then you can see

09:00.840 --> 09:03.513
how many violations one report has versus another.

09:05.550 --> 09:07.980
Human evals, most expensive one,

09:07.980 --> 09:11.070
but also most effective if you have the time.

09:11.070 --> 09:14.370
So one example from the image generation space,

09:14.370 --> 09:16.290
I do a lot of cherry-picking.

09:16.290 --> 09:19.230
Cherry-picking is basically just, yeah, running lots

09:19.230 --> 09:21.030
of different types of things

09:21.030 --> 09:23.730
and then choosing which ones turn out the best.

09:23.730 --> 09:26.503
here's an example where I was making a sci-fi,

09:26.503 --> 09:28.830
a story image catalog.

09:28.830 --> 09:32.970
So basically creating images to go with the sci-fi story.

09:32.970 --> 09:34.560
I was trying to figure out a good prompt

09:34.560 --> 09:36.600
and I wanted some dystopian-looking stuff.

09:36.600 --> 09:38.790
I went and downloaded some images

09:38.790 --> 09:41.310
of people being welded into their houses

09:41.310 --> 09:43.680
during the COVID-19 lockup

09:43.680 --> 09:45.930
and a few other dystopian images like that.

09:45.930 --> 09:48.840
And then I just used Midjourney Describe

09:48.840 --> 09:52.500
to reverse engineer the prompt and looking at these prompts.

09:52.500 --> 09:54.780
I could cherry pick my favorite examples

09:54.780 --> 09:56.490
and say, okay, this looks really interesting.

09:56.490 --> 09:59.013
I'm gonna use that scenario in this sci-fi story.

10:00.150 --> 10:04.020
The other one I use a lot is just, I call it thumb ratings.

10:04.020 --> 10:07.140
So thumbs up, thumbs down if the response is good.

10:07.140 --> 10:09.270
This is an example from a marketing team.

10:09.270 --> 10:11.610
If we're generating an email with an image,

10:11.610 --> 10:14.460
a customer this image and some copy

10:14.460 --> 10:15.650
and they can give it a quick thumbs up

10:15.650 --> 10:17.700
or thumbs down as to whether it's approved or not.

10:17.700 --> 10:19.920
We can use that approval rate across lots

10:19.920 --> 10:23.520
of these images in order to tell whether it's good or bad.

10:23.520 --> 10:25.410
And we could A/B test those prompts there.

10:25.410 --> 10:26.640
Makes it a little bit easier for people,

10:26.640 --> 10:28.413
like Tinder for ratings.

10:29.490 --> 10:30.990
The final one is just manual review.

10:30.990 --> 10:33.480
This is from a project I was working on

10:33.480 --> 10:38.130
where we were analyzing video and using the GPT Vision API.

10:38.130 --> 10:39.390
One of the things we found

10:39.390 --> 10:41.790
is that quite often, we'd refuse the request

10:41.790 --> 10:43.170
and we wouldn't have found this

10:43.170 --> 10:45.300
unless we actually looked for it,

10:45.300 --> 10:48.210
and dug into what was really coming back.

10:48.210 --> 10:49.980
So we dumped all that into Google Sheets

10:49.980 --> 10:52.530
and then we looked a little bit further into the detail

10:52.530 --> 10:55.050
and we'd notice that quite often it would say

10:55.050 --> 10:56.940
that it can't assist with these requests

10:56.940 --> 10:57.960
in certain scenarios.

10:57.960 --> 10:59.490
And we'd dig into what those scenarios were

10:59.490 --> 11:01.740
and then we'd correct for them in the prompt.

11:03.510 --> 11:06.210
Cool, that is how evals work

11:06.210 --> 11:09.180
and it's a really deep field,

11:09.180 --> 11:10.530
so it's not covering everything,

11:10.530 --> 11:12.930
but at least you have a good understanding now

11:12.930 --> 11:15.870
of what you need to be looking at,

11:15.870 --> 11:17.490
what you need to be thinking about

11:17.490 --> 11:20.100
and how to work through the problem

11:20.100 --> 11:22.113
of evaluating your model output.