WEBVTT

00:00.200 --> 00:01.790
In-context learning.

00:01.970 --> 00:04.580
Don't worry, it's not that complicated as it sounds.

00:04.580 --> 00:07.340
It's just adding examples into the prompt.

00:07.460 --> 00:10.670
This is the killer application of Llms.

00:10.670 --> 00:15.470
If you get examples from the prompt, it is much better at following this example.

00:15.470 --> 00:20.600
They should do a good job of the task that you're asking it to do that this is one of the most famous

00:20.600 --> 00:27.530
studies that introduced the GPT three MLM, but it's still one of the best, I think, in terms of showing

00:27.530 --> 00:32.220
you the value of prompt engineering and adding examples as you add more examples.

00:32.250 --> 00:39.330
The prompt for bigger models, not only do they get better at performance, they get they actually keep

00:39.330 --> 00:41.370
getting better over time, which is really nice.

00:41.370 --> 00:45.150
And obviously just adding one example here.

00:45.150 --> 00:47.070
So zero shot is no examples.

00:47.070 --> 00:53.430
Going from 0 to 1 takes you from 10% to almost 50% accuracy on Twitter.

00:53.760 --> 00:56.990
And I'd say that really matches my experience as well.

00:57.170 --> 01:02.720
This is if you add ten examples and after ten examples, that's when it starts to level off a little

01:02.720 --> 01:03.140
bit.

01:03.140 --> 01:07.700
But yeah, anything if you can do anything just to get one example on the prompt, that would make a

01:07.700 --> 01:08.660
big difference.

01:09.230 --> 01:09.620
Great.

01:09.620 --> 01:12.320
So if you want to read more about that there's a link to the paper there.

01:12.740 --> 01:16.880
Let's give you an example of using examples on your front.

01:16.910 --> 01:19.060
Here we just have a zero shot prompt.

01:19.420 --> 01:21.250
There are no examples in this.

01:21.250 --> 01:24.550
We're just asking it to do something with style of Steve Jobs.

01:24.550 --> 01:26.050
And then we're telling you the format.

01:26.050 --> 01:31.810
And it's kind of an awkward pretend there's magic soulmate futzing.

01:32.260 --> 01:34.900
And then we're just giving it examples.

01:34.990 --> 01:36.490
So we have an array of examples.

01:36.490 --> 01:40.300
That is how I usually add things to my prompt because it's cleaner.

01:40.300 --> 01:44.550
And then you can swap out the examples much easier rather than writing them all out into print.

01:45.000 --> 01:49.860
So what I'm doing here is the shoe shop prompt is just a zero shot prompt.

01:49.860 --> 01:55.530
And then I've added an example section and four example examples.

01:55.530 --> 01:59.910
And then I'm just iterating through and adding these with the line break.

02:00.120 --> 02:04.020
So when you run this what you get is the full prompt with the examples.

02:04.020 --> 02:05.940
So we've added this section here.

02:06.030 --> 02:10.710
Now if you've made a change to these examples and we wanted to add something new in, maybe we wanted

02:10.710 --> 02:12.960
to add an additional one here.

02:12.960 --> 02:14.190
Then we could do that.

02:14.190 --> 02:17.130
And it's uh, just the case of changing the array.

02:17.130 --> 02:20.460
And then we have this extra example which is really helpful.

02:21.600 --> 02:23.430
Also I really like this tip before.

02:23.670 --> 02:25.800
And you can also then just make a change.

02:25.800 --> 02:31.050
So if you wanted this to be in terms of you want it to be labeled differently or whatever, you could

02:31.050 --> 02:35.320
do that here in the code rather than having to rewrite your whole prompt.

02:35.860 --> 02:36.070
All right.

02:36.160 --> 02:37.120
That's our performance.

02:37.180 --> 02:38.890
And let's see how well we do.

02:38.920 --> 02:42.640
Now you can notice something about these names is that they all start with I.

02:42.670 --> 02:46.630
And I didn't tell it to say make every name start with I.

02:46.630 --> 02:48.340
But it just learned that from the examples.

02:48.340 --> 02:54.970
So here I put in I bar fridge I fridge beer I drink beer fridge I bought I space, I time every single

02:54.970 --> 02:57.940
example product name I put in starts with an I.

02:57.970 --> 03:01.630
It's just like Steve Jobs iPod or iMac or iPhone.

03:01.930 --> 03:06.820
And I did that specifically, not something that I actually really want from a product name generator,

03:06.820 --> 03:11.890
but it did that specifically to show you just how much it followed the examples.

03:11.890 --> 03:17.860
With just three shots, we've managed to get it using the word letter in front of every single name.

03:18.700 --> 03:21.340
Now what do we want an evaluation metric for this.

03:21.340 --> 03:23.780
And maybe there are different things to evaluate.

03:23.780 --> 03:30.380
But this is just an example of a programmatic evaluation because we can just tell whether it starts

03:30.380 --> 03:31.730
with the word I or not.

03:31.730 --> 03:37.010
In encode we have something to run quite quickly, and then we just want to get the percentage there

03:37.010 --> 03:41.210
actually returned it with it, starting with an I.

03:41.540 --> 03:47.310
So split the product names into an array that's split by comma, and then we'll see if it starts with

03:47.310 --> 03:47.820
an I.

03:48.390 --> 03:49.290
We'll run that.

03:49.290 --> 03:53.430
We can see that 66% of the names started with an I.

03:53.520 --> 03:55.140
So this one didn't start with an I.

03:55.170 --> 03:58.560
If we did have I in front then it would say 100%.

03:58.560 --> 04:04.170
And then if we then we can see that the error metric is working.

04:04.170 --> 04:05.730
So now that's working.

04:05.730 --> 04:08.790
We could do some a B testing which is really fun.

04:09.150 --> 04:17.560
So I want to see how many examples are needed in order to get a result that it starts with I every time.

04:17.950 --> 04:24.040
We're including three examples traditionally here, and we want to know if including that third example,

04:24.040 --> 04:25.390
that second example is worth it.

04:25.390 --> 04:29.290
Maybe it just learns from one example and it would hit this.

04:29.620 --> 04:31.480
And just now we've got the different prompts.

04:31.480 --> 04:34.420
We've got the three shot prompts, two shot prompt, one shot prompt.

04:34.420 --> 04:39.200
And all I've done is kind of built them up by just adding the next prompt in here, say the next example.

04:39.200 --> 04:41.210
So we have this terror shot prompt already had.

04:41.210 --> 04:45.860
And then we just add the example, the first example to get the word shot prompt.

04:45.860 --> 04:48.050
And then this one has two examples.

04:48.050 --> 04:49.880
And then this one had three samples.

04:50.450 --> 04:50.780
All right.

04:50.780 --> 04:51.740
Those are the different prompts.

04:51.740 --> 04:57.650
And we're going to test this asynchronously if there's a common pattern because we just run it once.

04:57.650 --> 05:00.740
And we can see that they all begin to die or appear.

05:01.070 --> 05:05.360
But that might not be reliable enough for your for your application.

05:05.360 --> 05:10.610
Say, what is quite common is you'll create a function which would help you run and evaluate, and then

05:10.610 --> 05:12.770
you would then have a number of around.

05:12.920 --> 05:15.320
In this case, I'm testing it 30 times now.

05:15.320 --> 05:20.900
The reason why we're doing it asynchronous is if you think of a supermarket where you have multiple

05:20.900 --> 05:28.680
cashiers on shift, they can all handle a customer each at a time, and therefore the line can go down

05:28.770 --> 05:29.640
quite quickly.

05:29.640 --> 05:34.590
Whereas if you only have one cashier on, then everyone's going to have to wait their turn and she can

05:34.590 --> 05:36.090
only process one at a time.

05:36.120 --> 05:40.560
Asynchronous is just a way of getting more cashiers on the desk, right where where I'm going to call

05:40.560 --> 05:48.510
all 30 prompt at the same time, and we should get them back basically as fast as if we'd only called

05:48.510 --> 05:48.960
one.

05:49.450 --> 05:50.980
So that's the benefit of async.

05:51.010 --> 05:53.770
Otherwise it would just take 20 minutes or something.

05:53.770 --> 05:55.780
For this to one at a time.

05:55.780 --> 05:56.410
Call.

05:56.410 --> 06:01.750
Call the prompt 30 times and run this and add the the parts that make it async.

06:01.750 --> 06:08.350
By the way, we have an async API client, and then we have to await that function.

06:08.350 --> 06:13.630
And we have to create the function as async def into just def that async def.

06:13.640 --> 06:18.680
And then we're spending up the tasks and then we're gathering those tasks.

06:18.680 --> 06:23.210
And what that means is now it can take all 30 tasks at the same time and then just gather them.

06:23.210 --> 06:27.890
So when they all arrive back then we're gonna get the correct count.

06:27.890 --> 06:28.700
Like what?

06:29.120 --> 06:34.700
What percentage of these tasks are have the right result and we could see him.

06:35.150 --> 06:41.990
There are actually still had some about a third of the names, 27% still began with I, which is pretty

06:41.990 --> 06:45.200
good for Steve Jobs did begin a lot of product with I'd.

06:45.200 --> 06:50.660
We've found that even with just one example hit, 100% of the time, it starts to produce a thigh that's

06:50.660 --> 06:51.830
reliable enough for it.

06:51.830 --> 06:57.560
And this is a pretty made up evaluation metric of studying the name with I.

06:57.590 --> 07:03.240
But if you had any other evaluation metric, you could just drop that in and then you see the results

07:03.240 --> 07:03.480
here.

07:03.510 --> 07:09.510
Well, what do you take away is that it's really important to know how many examples you need to provide,

07:09.510 --> 07:14.250
because the more examples you provide, the more expensive the prompt is, the more longer it takes

07:14.250 --> 07:14.910
to run.

07:14.910 --> 07:21.060
And that's going to be something you have to trade off while you're building a real production AI system.
