WEBVTT

00:00.360 --> 00:04.830
-: Okay, let me show you how to use a tool called promptfoo.

00:04.830 --> 00:09.060
Okay, so promptfoo is an evaluation framework,

00:09.060 --> 00:11.430
which runs locally, it's quite useful,

00:11.430 --> 00:14.760
and you can add pretty simply a bunch of tests

00:14.760 --> 00:16.410
in order to test your prompts.

00:16.410 --> 00:19.977
So, this is the interface you get when you run it.

00:19.977 --> 00:21.750
It works in the command line as well,

00:21.750 --> 00:24.270
and it just gives you kind of like a pass/fail, pass/fail

00:24.270 --> 00:26.070
for various different metrics.

00:26.070 --> 00:27.090
It's really interesting.

00:27.090 --> 00:29.610
Now, the way that you can get it working

00:29.610 --> 00:33.240
is you just copy this command, npx promptfoolatest init,

00:33.240 --> 00:34.500
then you run that in the terminal,

00:34.500 --> 00:36.900
and this is going to give you a prompts.txt

00:36.900 --> 00:40.560
and a promptfooconfig.yaml file.

00:40.560 --> 00:43.205
You don't actually have to run this init necessarily,

00:43.205 --> 00:45.540
you can just set these up yourself.

00:45.540 --> 00:49.380
But essentially, the way it works is,

00:49.380 --> 00:52.170
if I just maybe open up my config,

00:52.170 --> 00:54.060
that you would set in here

00:54.060 --> 00:56.310
everything you need to run your evaluation.

00:56.310 --> 00:57.780
That you would put your prompts,

00:57.780 --> 00:59.850
and these can be in line like this

00:59.850 --> 01:02.130
or you could put something like,

01:02.130 --> 01:03.240
if I show you in the guide

01:03.240 --> 01:05.670
actually it's a little bit easier.

01:05.670 --> 01:08.940
You can put like a list of files in here as well,

01:08.940 --> 01:11.070
prompt1, prompt2, etcetera.

01:11.070 --> 01:13.380
And then, you have,

01:13.380 --> 01:14.410
you can just do it in line like this,

01:14.410 --> 01:16.680
which is nice if you're doing a simple test.

01:16.680 --> 01:17.513
Here I've got one prompt,

01:17.513 --> 01:19.830
which is just tell me a funny joke about a topic.

01:19.830 --> 01:20.850
And then another prompt,

01:20.850 --> 01:22.260
just tell me a funny joke about topic

01:22.260 --> 01:24.150
in the style of Dave Chappelle.

01:24.150 --> 01:26.820
Then we have the providers that we want to test against,

01:26.820 --> 01:29.520
so you can just keep adding as many as you want here.

01:29.520 --> 01:31.040
And then you have the tests.

01:31.040 --> 01:34.790
So you can have test cases with specific tests.

01:34.790 --> 01:36.330
So in this case, going to tell

01:36.330 --> 01:39.510
a joke about the topic of bananas

01:39.510 --> 01:42.750
and we want to make sure that the output contains

01:42.750 --> 01:43.583
the word bananas.

01:43.583 --> 01:45.540
And the way you can do that, which is very simple,

01:45.540 --> 01:48.978
is you just have an assert and then type icontains

01:48.978 --> 01:50.250
and then the value bananas.

01:50.250 --> 01:53.130
And that's just going to check that has that in there.

01:53.130 --> 01:55.350
So here we're going to have the word avocado,

01:55.350 --> 01:56.850
here we're going to have the word New York

01:56.850 --> 01:58.980
for these different tests that we're going to run.

01:58.980 --> 02:01.380
So what this is going to do,

02:01.380 --> 02:03.000
it's going to run every prompt,

02:03.000 --> 02:05.280
every combination of every prompt,

02:05.280 --> 02:10.170
every provider, and then every topic here

02:10.170 --> 02:12.300
because we have the different topics

02:12.300 --> 02:14.640
that are inserted into this prompt.

02:14.640 --> 02:15.960
The other cool thing you can do

02:15.960 --> 02:18.990
is you can set default tests for everything.

02:18.990 --> 02:21.780
So this runs across all of the different outputs.

02:21.780 --> 02:24.600
And you can do basic things like JavaScript functions,

02:24.600 --> 02:26.070
which is pretty cool.

02:26.070 --> 02:27.600
If you want to calculate shorter outputs,

02:27.600 --> 02:30.540
you can do output.length, which is really helpful.

02:30.540 --> 02:33.000
But then, what I mostly use this for

02:33.000 --> 02:37.230
is using an LLM to test whether the prompt was correct.

02:37.230 --> 02:38.880
It's called LLM rubric,

02:38.880 --> 02:41.467
and then you literally just write your prompt in here,

02:41.467 --> 02:42.870
"Ensure the joke will be funny enough

02:42.870 --> 02:44.670
for stand up comedians to tell on stage."

02:44.670 --> 02:47.850
And what it will do is, for every version of this,

02:47.850 --> 02:49.440
it's going to call the LLM again,

02:49.440 --> 02:52.140
it's going to check, you know, this is the question,

02:52.140 --> 02:54.480
and then it's going to give you a response and a reason.

02:54.480 --> 02:56.550
And you can just keep adding as many of these as you want.

02:56.550 --> 02:58.320
I want to make sure there's no disclaimer or text

02:58.320 --> 02:59.190
other than the joke,

02:59.190 --> 03:02.070
that it doesn't describe itself as an AI,

03:02.070 --> 03:04.170
and the joke should not be a dad joke

03:04.170 --> 03:05.970
or too politically correct

03:05.970 --> 03:08.610
because it's like a problem that I've had in general.

03:08.610 --> 03:10.170
All right. And once you have this,

03:10.170 --> 03:12.060
you can just keep adding prompts and stuff to it,

03:12.060 --> 03:13.620
which is really helpful.

03:13.620 --> 03:15.570
All right, so, we'll just save this.

03:15.570 --> 03:17.940
And I'll show you how it looks when you run it.

03:17.940 --> 03:19.990
So if we're going into the terminal here,

03:21.690 --> 03:25.143
so you can just run npx prompt through latest eval.

03:26.100 --> 03:28.000
And then it's going to run everything.

03:29.880 --> 03:30.840
Here we go.

03:30.840 --> 03:33.360
So I think it caches everything here.

03:33.360 --> 03:35.610
We don't have to run it again

03:35.610 --> 03:37.590
if it's already run, which is nice.

03:37.590 --> 03:39.540
But you can see that we have a lot of fails

03:39.540 --> 03:42.060
and we can actually see that in an interface here.

03:42.060 --> 03:45.660
So if we go use the view command,

03:45.660 --> 03:47.970
it's going to ask this if we want to open a URL.

03:47.970 --> 03:50.310
And here we go, we have our full test.

03:50.310 --> 03:51.690
Here are the different outputs,

03:51.690 --> 03:55.140
you can see that, you know, the topic bananas,

03:55.140 --> 03:56.910
and all of those failed.

03:56.910 --> 03:58.260
It didn't do a good job.

03:58.260 --> 04:01.800
Avocado toast, it actually got six passed here.

04:01.800 --> 04:05.250
So, this one had one fail, five pass, one fail, five pass.

04:05.250 --> 04:08.850
You can click into this and see what failed and passed.

04:08.850 --> 04:11.460
And this is the actual prompt compiled afterwards.

04:11.460 --> 04:12.900
And then this is the output,

04:12.900 --> 04:15.750
and you can see that it was successful and everything.

04:15.750 --> 04:17.700
It contained the word avocado.

04:17.700 --> 04:18.900
It wasn't a dad joke.

04:18.900 --> 04:20.880
I give the reason the joke is not a dad joke.

04:20.880 --> 04:23.580
It does not appear to be too politically correct,

04:23.580 --> 04:25.140
blah, blah, blah.

04:25.140 --> 04:26.160
So that's really cool.

04:26.160 --> 04:27.930
And then you can see where it fails.

04:27.930 --> 04:30.007
You can click on this and say,

04:30.007 --> 04:32.613
"Okay, it's not funny enough.

04:33.510 --> 04:35.160
It's not fresh enough."

04:35.160 --> 04:37.260
Okay, so that's pretty interesting.

04:37.260 --> 04:40.200
And if we look at what are the different things that failed,

04:40.200 --> 04:43.150
it's quite often this, "It's just not funny enough," right?

04:44.880 --> 04:46.410
What's this one again?

04:46.410 --> 04:48.390
Okay, no disclaimer or text other than the joke, yeah,

04:48.390 --> 04:49.223
so here we go.

04:49.223 --> 04:51.270
It's because we have this at the beginning.

04:51.270 --> 04:52.830
All right, here we go. (chuckles)

04:52.830 --> 04:53.940
And that's not what we want,

04:53.940 --> 04:55.773
we want the just the joke back.

04:57.000 --> 04:59.910
Cool. So, really quick way to do this.

04:59.910 --> 05:04.910
You can also, you know, add more test cases if you want.

05:04.980 --> 05:08.640
And then you can go back and see your different prompts,

05:08.640 --> 05:12.090
and click into it and see the score of the different things,

05:12.090 --> 05:15.030
like, well, it's a pass rate, success rate, etcetera.

05:15.030 --> 05:18.000
You can see the datasets that you've added,

05:18.000 --> 05:19.560
and then you can see the progress,

05:19.560 --> 05:22.230
so like how is it improved over time

05:22.230 --> 05:24.090
as you've tested more stuff.

05:24.090 --> 05:27.870
Now, you can also, in theory, add another evaluation

05:27.870 --> 05:32.070
or set up an evaluation here without using the YAML

05:32.070 --> 05:33.450
if that's easier for you as well.

05:33.450 --> 05:36.723
So you just add the providers, in this case,

05:38.850 --> 05:41.760
yeah, I want gpt-3.5-turbo.

05:41.760 --> 05:43.980
And then I want to add a prompt.

05:43.980 --> 05:45.153
Yeah, hilarious joke.

05:47.970 --> 05:51.960
And then, you can duplicate that if you want to edit it,

05:51.960 --> 05:53.013
which is nice.

05:57.660 --> 05:59.400
Kind of funny joke on topic.

05:59.400 --> 06:01.560
Okay, so now we've got our two prompts,

06:01.560 --> 06:03.060
and then we can add the test case.

06:03.060 --> 06:05.560
You can just say, "I want a joke about chocolate."

06:07.710 --> 06:09.240
All right. So, here we go.

06:09.240 --> 06:10.710
It gives us our YAML.

06:10.710 --> 06:12.780
And we can run that test.

06:12.780 --> 06:14.913
So if we run the evaluation.

06:16.740 --> 06:18.750
Here we go, 100% passing.

06:18.750 --> 06:21.150
There's no tests.

06:21.150 --> 06:23.013
I want to set up an evaluation here.

06:25.050 --> 06:25.883
Add an assert.

06:25.883 --> 06:27.363
We can just soon say,

06:29.250 --> 06:30.630
here we go.

06:30.630 --> 06:31.463
And just be,

06:33.510 --> 06:34.743
mind numbingly funny.

06:40.951 --> 06:42.534
Run the evaluation.

06:46.440 --> 06:48.570
Yeah. So, it's just a simple pun

06:48.570 --> 06:50.670
is not mind numbingly funny. (snickers)

06:50.670 --> 06:53.580
Cool. So you can see how simple this is to run.

06:53.580 --> 06:55.770
And yeah, I find it quite quick and easy

06:55.770 --> 06:56.603
to mess around with.

06:56.603 --> 07:00.930
So, feel free to use this for your prompt optimization,

07:00.930 --> 07:03.150
connect it to different things you're going to be doing.

07:03.150 --> 07:05.070
But, yeah, I could quite like this.

07:05.070 --> 07:07.350
And I think you can export it as well,

07:07.350 --> 07:10.827
I haven't done that, but I think you can.

07:10.827 --> 07:13.050
Yeah, you can download the table,

07:13.050 --> 07:15.420
JSON as well, which is quite useful.

07:15.420 --> 07:17.733
Cool. All right, enjoy. Happy prompting.