WEBVTT

00:00.210 --> 00:03.750
-: Hey, let's start practicing what,

00:03.750 --> 00:06.420
let's start exploring how prompt caching works

00:06.420 --> 00:07.380
in a Jupyter Notebook.

00:07.380 --> 00:08.670
So the first thing we're gonna do

00:08.670 --> 00:11.730
is just run this pip install openai and anthropic command.

00:11.730 --> 00:12.563
That's gonna make sure

00:12.563 --> 00:14.550
that you've got both the OpenAI package

00:14.550 --> 00:17.700
and the Anthropic package installed inside of Python.

00:17.700 --> 00:18.533
After doing that,

00:18.533 --> 00:20.640
we've also got a bunch of different imports

00:20.640 --> 00:21.810
that we're gonna need.

00:21.810 --> 00:25.410
And then you need to set your OpenAI API key,

00:25.410 --> 00:29.160
and we're using the getpass module and the getpass function,

00:29.160 --> 00:32.490
which will basically allow you to do this in a hidden way.

00:32.490 --> 00:34.470
Now I've already set my API keys,

00:34.470 --> 00:37.260
and we're also loading the Anthropic client

00:37.260 --> 00:39.960
and also modifying the OpenAI client

00:39.960 --> 00:42.180
to have our general API key.

00:42.180 --> 00:43.680
Now, the first thing you're gonna need to do

00:43.680 --> 00:47.580
is create a new function called check_openai_caching,

00:47.580 --> 00:50.220
which will take both a system prompt and a user prompt,

00:50.220 --> 00:52.050
and it will have a doc string.

00:52.050 --> 00:55.200
And then what we're then gonna do is have a response,

00:55.200 --> 00:57.210
and that response is gonna basically do

00:57.210 --> 01:01.800
an openai.chatcompletions.create.

01:01.800 --> 01:03.630
We're gonna use the GPT-4o model,

01:03.630 --> 01:05.100
and we'll pass in both the messages

01:05.100 --> 01:07.830
for the system prompt and the user prompt.

01:07.830 --> 01:09.810
Now we're gonna get the usage out of this

01:09.810 --> 01:12.870
by doing usage = response.usage,

01:12.870 --> 01:14.730
and then we're gonna just print some metadata,

01:14.730 --> 01:16.950
so the total tokens, the prompt tokens,

01:16.950 --> 01:18.540
the completion tokens,

01:18.540 --> 01:20.160
and then we're gonna build some logic here

01:20.160 --> 01:20.993
that just allows us

01:20.993 --> 01:24.660
to easily extract the prompt tokens details

01:24.660 --> 01:26.370
and getting the cache tokens from those,

01:26.370 --> 01:28.590
whether they're a dictionary or if they're not a dictionary,

01:28.590 --> 01:30.930
and we have some fallbacks for those here.

01:30.930 --> 01:33.330
And now what we're gonna do is also return the response.

01:33.330 --> 01:35.280
What you should see is the first one

01:35.280 --> 01:37.350
doesn't have any cached tokens.

01:37.350 --> 01:39.720
However, the second time we make that call,

01:39.720 --> 01:41.640
you'll end up with cached tokens.

01:41.640 --> 01:44.340
And that basically means you're getting reduced costs

01:44.340 --> 01:45.480
on your input tokens.

01:45.480 --> 01:48.240
All we did was use that long system message prompt

01:48.240 --> 01:49.980
and then rerunning the response

01:49.980 --> 01:52.050
within that five minute time window

01:52.050 --> 01:54.450
is gonna automatically enable

01:54.450 --> 01:56.880
that bit of the system prompt to be cached.

01:56.880 --> 02:00.483
And you'll get a 50% cost savings using OpenAI with that.

02:01.350 --> 02:04.230
Now, it works slightly different when it comes to Anthropic,

02:04.230 --> 02:05.730
so we're gonna have a different function here

02:05.730 --> 02:10.500
called create_anthropic request.

02:10.500 --> 02:12.958
Now we'll call cached_message,

02:12.958 --> 02:14.550
and I'm gonna take in both a system prompt

02:14.550 --> 02:18.480
and a user prompt, and we're gonna have a doc string.

02:18.480 --> 02:21.690
And then what we need to do is we need to have a response,

02:21.690 --> 02:24.153
which is going to use the Anthropic client.

02:25.590 --> 02:26.490
And then we will have

02:26.490 --> 02:31.490
the .beta.prompt_caching.messages.create.

02:35.040 --> 02:37.920
And inside this you're gonna put a model,

02:37.920 --> 02:40.410
and I'm gonna change this to latest.

02:40.410 --> 02:44.550
Then we're gonna put the max tokens at 1,024.

02:44.550 --> 02:46.320
Then we're gonna put our system,

02:46.320 --> 02:49.560
which we will have a type of text,

02:49.560 --> 02:54.560
and the text will be here from the system prompt.

02:56.970 --> 02:58.320
Now this is the bit that's interesting

02:58.320 --> 03:01.410
is where you actually add a cache control key

03:01.410 --> 03:02.850
inside of here.

03:02.850 --> 03:04.470
And inside of that cache control key,

03:04.470 --> 03:07.620
you put a type of ephemeral.

03:07.620 --> 03:10.590
Now we're also gonna have the messages,

03:10.590 --> 03:13.350
which just have a type of text and a user prompt.

03:13.350 --> 03:15.780
Now, after that, we now need to get usage.

03:15.780 --> 03:19.080
So what we're then gonna do is we're gonna go below here

03:19.080 --> 03:22.740
and we're gonna do usage is equal to response.usage,

03:22.740 --> 03:27.740
and then we're gonna print the cache creation tokens,

03:34.380 --> 03:38.757
which is actually usage.cache_creation_input_tokens.

03:43.230 --> 03:46.323
We also have the cache read tokens,

03:48.570 --> 03:53.570
which is, that is correct, cache_read_input_tokens,

03:53.910 --> 03:56.913
and we also have the regular input tokens.

04:00.240 --> 04:02.463
And that will be the usage.input_tokens.

04:03.420 --> 04:05.400
We have the regular output tokens,

04:05.400 --> 04:06.870
which is just the .output_tokens,

04:06.870 --> 04:09.150
then we're gonna return the response.

04:09.150 --> 04:12.960
Cool, so let's have a look and we'll use the same context

04:12.960 --> 04:15.603
that we had earlier with the long system prompt.

04:17.250 --> 04:19.080
And we're gonna call our new function.

04:19.080 --> 04:22.950
So that will be create_anthropic_cached_message,

04:23.820 --> 04:25.710
and then we're gonna call this twice.

04:25.710 --> 04:26.730
And just one more thing

04:26.730 --> 04:28.920
that we need to do is rather than have a type of text here,

04:28.920 --> 04:32.160
we need a role of user and then the content.

04:32.160 --> 04:35.160
So this, we do actually need to change this to a role.

04:35.160 --> 04:36.900
So I've just updated that function

04:36.900 --> 04:38.460
and then we're gonna run these two again.

04:38.460 --> 04:41.640
And you can see we have created some input tokens.

04:41.640 --> 04:43.860
We don't have any cached read tokens.

04:43.860 --> 04:46.230
And then the second time that we actually do that,

04:46.230 --> 04:49.410
we end up with not creating any cache creation tokens.

04:49.410 --> 04:51.030
We do read from the cache.

04:51.030 --> 04:53.760
We have some regular input and regular output tokens.

04:53.760 --> 04:54.960
So again, you can see

04:54.960 --> 04:58.320
that you can get very granular with Anthropic

04:58.320 --> 05:01.800
by deciding what should be cached, okay?

05:01.800 --> 05:03.090
So you can specifically say

05:03.090 --> 05:05.040
in this specific type of message,

05:05.040 --> 05:07.170
I want to cache this system prompt

05:07.170 --> 05:11.310
and you also have the amount of cache creation tokens

05:11.310 --> 05:12.510
and the cache read.

05:12.510 --> 05:15.270
So you have to specify with Anthropic

05:15.270 --> 05:16.800
using their beta client at the moment

05:16.800 --> 05:18.360
with the prompt caching.

05:18.360 --> 05:21.870
And obviously this is a much different approach to OpenAI.

05:21.870 --> 05:24.930
They do offer additional savings when it comes to Anthropic.

05:24.930 --> 05:28.440
So if you're doing a large amount of work

05:28.440 --> 05:30.960
and a lot of LLM requests,

05:30.960 --> 05:33.210
then you might want to look at Anthropics caching

05:33.210 --> 05:36.030
because it provides 90% cost reductions

05:36.030 --> 05:37.560
after you've created the cache.

05:37.560 --> 05:38.970
However, if you are just,

05:38.970 --> 05:40.860
you know, going about your day-to-day things,

05:40.860 --> 05:42.270
and you're not particularly interested

05:42.270 --> 05:44.700
in latency and cost optimization,

05:44.700 --> 05:46.170
then I would recommend OpenAI.

05:46.170 --> 05:48.000
So it really does depend on the use case

05:48.000 --> 05:49.860
that you're currently working on.

05:49.860 --> 05:52.560
The final function that we're gonna work on is a function

05:52.560 --> 05:53.393
that we're gonna use

05:53.393 --> 05:55.530
and we're gonna calculate the cost savings.

05:55.530 --> 05:58.110
I'm gonna paste in the cached tokens,

05:58.110 --> 05:59.910
and then we're gonna get the model.

05:59.910 --> 06:02.820
And in this case, I'm gonna put claude, the latest.

06:02.820 --> 06:03.840
And then what we're gonna have here

06:03.840 --> 06:08.550
is a way of calculating the cost per million tokens.

06:08.550 --> 06:10.353
So I'm gonna call this prices,

06:11.670 --> 06:13.953
and we're gonna have two models.

06:15.660 --> 06:17.883
We'll have the claude-3.5-sonnet,

06:19.770 --> 06:22.773
and inside there, we will have the base input,

06:24.330 --> 06:28.290
which at this time of writing is 0.003.

06:28.290 --> 06:32.167
So that's essentially $3 million,

06:32.167 --> 06:34.353
$3 per million tokens.

06:37.050 --> 06:39.033
We also have the cache read,

06:41.550 --> 06:44.463
which is 0.0003,

06:45.780 --> 06:48.453
and we also have GPT-4o,

06:51.210 --> 06:54.267
which has a base input of 0.0025.

07:00.810 --> 07:05.690
And we also have a cache read of 0.00125,

07:09.600 --> 07:11.520
but basically half price.

07:11.520 --> 07:14.160
And then what we then need to do is we then need to go

07:14.160 --> 07:17.490
and get the model prices based on the model that we put in.

07:17.490 --> 07:19.803
And then we need to work out the base cost.

07:20.880 --> 07:24.090
And the base cost is basically the number of cached tokens

07:24.090 --> 07:26.010
times by the base input,

07:26.010 --> 07:28.620
the cached cost, which is the cached tokens

07:28.620 --> 07:30.330
times by the cache read.

07:30.330 --> 07:32.460
And then we return these figures

07:32.460 --> 07:33.990
and we'll also print these out as well.

07:33.990 --> 07:36.330
So let's go and print those out.

07:36.330 --> 07:41.330
And we'll also then return the savings, which is here.

07:42.810 --> 07:45.360
So let's also have another variable called savings.

07:48.540 --> 07:49.373
Okay, great, cool.

07:49.373 --> 07:50.430
So now we've got the ability

07:50.430 --> 07:54.360
to do the calculate the cost savings using this.

07:54.360 --> 07:56.640
And then what we could do is, for example,

07:56.640 --> 07:59.040
we could do a couple of different edge cases here.

07:59.040 --> 08:02.490
So we could say OpenAI and we could put in some tokens.

08:02.490 --> 08:06.363
So let's say we're putting in 10,000 tokens.

08:10.260 --> 08:12.450
And we will do the same here.

08:12.450 --> 08:16.410
And you can see here the base cost of that is $25.

08:16.410 --> 08:19.110
The cached cost is $12.50, and the savings $12.50.

08:19.110 --> 08:22.260
So you get 50% cost reduction with OpenAI.

08:22.260 --> 08:26.250
If you look at Anthropic, we have our base cost of $30,

08:26.250 --> 08:30.120
the cached cost, $3, so our savings are $27.

08:30.120 --> 08:33.030
So a saving of 90%.

08:33.030 --> 08:35.100
So hopefully this gives you a good indication

08:35.100 --> 08:37.620
as to how prompt caching works,

08:37.620 --> 08:40.410
how you can use this for both OpenAI and Anthropic,

08:40.410 --> 08:42.750
and also how you can calculate cost saving metrics

08:42.750 --> 08:44.253
based on any input tokens

08:44.253 --> 08:46.800
that prompt caching currently provides

08:46.800 --> 08:48.300
for different types of models.
