WEBVTT

1
00:00:00.000 --> 00:00:05.000
Hi, and welcome to this AI and C-Sharp video on the Microsoft Agent Framework.

2
00:00:05.119 --> 00:00:11.199
Today, we're going to look into OpenAI and its caching tokens and how they work.

3
00:00:11.319 --> 00:00:13.720
So, let's jump into it.

4
00:00:13.840 --> 00:00:18.719
So, here I have some sample code and a small prompt.

5
00:00:19.520 --> 00:00:22.520
In that, we just make an Azure OpenAI client here.

6
00:00:22.639 --> 00:00:26.319
We could use OpenAI client, they work exactly the same.

7
00:00:26.639 --> 00:00:33.040
We are going to give our model a little text with some aromatic equations,

8
00:00:33.840 --> 00:00:36.840
and we tell it this is the only knowledge it has.

9
00:00:38.240 --> 00:00:42.840
Then we are starting a session, and we are going in

10
00:00:42.959 --> 00:00:48.639
and writing out the answer, of course, but also the token count.

11
00:00:48.759 --> 00:00:55.159
And we're using GPT-5 Mini here, and the price is in million tokens,

12
00:00:55.400 --> 00:01:02.400
and it would be 0.25, and then we can see caching is one-tenth of that.

13
00:01:03.520 --> 00:01:06.919
Then we have output, and reasoning is part of output.

14
00:01:07.040 --> 00:01:09.440
We're not really going into reasoning today,

15
00:01:10.319 --> 00:01:13.319
but I've just put it out so we have all four of these.

16
00:01:14.760 --> 00:01:20.239
So, if we ask our AI here,

17
00:01:20.360 --> 00:01:23.959
what is this text about?

18
00:01:25.599 --> 00:01:30.599
What we will see is we get our answer back.

19
00:01:31.360 --> 00:01:35.959
And because this is our first question, we don't get any cache or anything.

20
00:01:36.080 --> 00:01:40.599
It's just the input and the output, and much of it was reasoning.

21
00:01:41.199 --> 00:01:44.120
Probably wouldn't need reasoning for this specific question,

22
00:01:44.239 --> 00:01:45.760
but just so we have all of it.

23
00:01:46.519 --> 00:01:48.040
And we can do a follow-up question.

24
00:01:49.160 --> 00:01:54.160
Are the answers correct?

25
00:01:55.480 --> 00:02:00.480
And we will get back that, yes, they are correct,

26
00:02:00.599 --> 00:02:03.599
but what you will notice is we didn't hit the cache.

27
00:02:04.440 --> 00:02:06.959
So, the cache didn't work.

28
00:02:08.119 --> 00:02:12.919
Does that mean we need to set some settings over here? Why is that?

29
00:02:13.880 --> 00:02:16.039
Well, we don't need to set any settings.

30
00:02:16.160 --> 00:02:22.160
Caching is on by default, but caching is only happening

31
00:02:22.759 --> 00:02:27.759
when we hit a thousand or more tokens in terms of OpenAI.

32
00:02:29.520 --> 00:02:31.320
So, because we were below that,

33
00:02:32.440 --> 00:02:36.440
they, being OpenAI or Azure in this case,

34
00:02:36.559 --> 00:02:42.559
didn't go in and use the cache because probably it's too much of a hassle

35
00:02:42.679 --> 00:02:47.679
for very, very low token counts and would hurt our performance.

36
00:02:48.119 --> 00:02:52.119
So, they have chosen 1,000 to be when you begin,

37
00:02:52.240 --> 00:02:54.839
and you can't really change that.

38
00:02:55.679 --> 00:02:58.080
So, in order for us to see caching,

39
00:02:58.199 --> 00:03:01.240
we need to go in and use a longer text.

40
00:03:02.080 --> 00:03:04.679
So, in my case, I have a book here.

41
00:03:04.800 --> 00:03:09.279
It's a prefix of Pride and Prejudice,

42
00:03:09.399 --> 00:03:10.880
and then the first chapter.

43
00:03:11.000 --> 00:03:13.399
It's roughly 500 lines of text.

44
00:03:13.399 --> 00:03:15.600
And the reason for a longer text is, of course,

45
00:03:15.720 --> 00:03:19.399
to hit over that 1,000 token limit.

46
00:03:19.520 --> 00:03:23.520
So, if we go in and ask again, what is this text about?

47
00:03:28.720 --> 00:03:31.000
We will now see that once it's come back,

48
00:03:31.119 --> 00:03:35.399
and it will take a little longer now because we have a longer text,

49
00:03:35.520 --> 00:03:39.399
it comes back and gives us a summary of what that text is about.

50
00:03:39.399 --> 00:03:43.800
And we can see our input here is almost 10,000 tokens.

51
00:03:43.919 --> 00:03:47.520
So, we are above when caching should go into effect.

52
00:03:47.639 --> 00:03:51.720
So, right now, it has actually cached some of our tokens

53
00:03:51.839 --> 00:03:55.800
and are on their servers.

54
00:03:56.919 --> 00:04:00.320
Their servers being Azure, in this case, but also OpenAI.

55
00:04:00.440 --> 00:04:03.440
So, if we do a follow-up question to this,

56
00:04:03.559 --> 00:04:06.559
we can see that we have 1,000 tokens.

57
00:04:06.559 --> 00:04:09.160
If we do a follow-up question to this,

58
00:04:09.279 --> 00:04:18.279
what characters are in Pride and Prejudice?

59
00:04:22.079 --> 00:04:24.079
We will get that back.

60
00:04:28.279 --> 00:04:31.079
And we don't really care about what it says here.

61
00:04:31.079 --> 00:04:37.079
We are interested in seeing that of the 9,800 tokens in total,

62
00:04:37.200 --> 00:04:41.200
we added a little more because we added more questions.

63
00:04:42.000 --> 00:04:45.600
We saw that 9,600 of them were cached,

64
00:04:46.200 --> 00:04:51.799
meaning we paid only one-tenth of the price for those.

65
00:04:52.600 --> 00:04:58.600
And that's because whenever our question and answer is exactly the same,

66
00:04:58.720 --> 00:05:04.720
and in this case, this text is exactly the same as we sent it the first time,

67
00:05:05.320 --> 00:05:11.320
it doesn't need to predict the next token for each one of those 9,600 tokens.

68
00:05:11.720 --> 00:05:13.720
But it needed to predict the rest.

69
00:05:14.320 --> 00:05:16.320
So, if we asked,

70
00:05:18.920 --> 00:05:22.920
who is Darcy?

71
00:05:29.519 --> 00:05:35.519
We will again see that our cache is quite high.

72
00:05:36.720 --> 00:05:42.720
So, again, we only really paid a 0.25 in the difference between these two.

73
00:05:44.119 --> 00:05:47.320
So, this is really good, we don't need to do anything,

74
00:05:47.440 --> 00:05:53.440
but of course, for the small cases, we don't get any caching value,

75
00:05:53.839 --> 00:05:56.440
but that doesn't matter too much.

76
00:05:56.559 --> 00:05:58.559
So, this is good.

77
00:05:59.359 --> 00:06:03.359
One interesting thing is that if I restart this,

78
00:06:04.359 --> 00:06:09.359
we are in a new session, and if I say, what is this text about?

79
00:06:16.359 --> 00:06:21.359
What we will see is that it's actually still caching.

80
00:06:22.079 --> 00:06:29.079
So, the cache is in per API key we use.

81
00:06:29.480 --> 00:06:33.480
So, even between sessions, if we ask exactly the same thing,

82
00:06:33.799 --> 00:06:36.799
we will get the same.

83
00:06:37.359 --> 00:06:43.359
But if we go into the book and put in something different here,

84
00:06:46.079 --> 00:06:48.079
we lose our cache.

85
00:06:49.079 --> 00:06:52.079
So, if we do this, what is this text about?

86
00:06:54.200 --> 00:06:58.200
And because I changed the first character, it's not the same text anymore,

87
00:06:58.600 --> 00:07:05.600
the cache is missed, and we see that we end up using full input.

88
00:07:06.200 --> 00:07:11.200
So, what if I go down here and in the middle, just type something in there?

89
00:07:11.480 --> 00:07:14.480
Let's see how that affects our cache.

90
00:07:19.079 --> 00:07:26.079
Here we can see we can cache up to this area, so this was the 1100.

91
00:07:27.079 --> 00:07:30.079
So, it's only when something changes,

92
00:07:30.399 --> 00:07:35.399
but if we have exactly the same text, then we get the cache.

93
00:07:36.079 --> 00:07:39.079
If you go see OpenAI's documentation,

94
00:07:39.399 --> 00:07:45.399
they talk about caching, caching, caching, caching, caching, caching, caching,

95
00:07:46.399 --> 00:07:52.399
They talk about caching being there for like 5-10 minutes until inactivity,

96
00:07:53.399 --> 00:07:56.399
and no matter what, the cache will go away after an hour.

97
00:07:57.399 --> 00:08:02.399
So, of course, they need to keep that up in memory, up in their servers,

98
00:08:03.399 --> 00:08:05.399
so cache is fleeting,

99
00:08:06.399 --> 00:08:12.399
but it will stay around long enough for a follow-up question like this.

100
00:08:13.399 --> 00:08:17.399
And everything is transparent, you don't need to write any specific code

101
00:08:17.920 --> 00:08:25.920
in order to hit the cache, so they're just giving you long text cheaper automatically.

102
00:08:26.920 --> 00:08:31.920
And it's roughly this tenth of a price every time we do caching,

103
00:08:31.920 --> 00:08:34.919
but, of course, we pay more because we spend more.

104
00:08:35.919 --> 00:08:38.919
So, that is actually everything how caching works.

105
00:08:38.919 --> 00:08:41.919
You don't need to be worried that you need to do anything,

106
00:08:42.440 --> 00:08:45.440
it just works out of the box, which is cool.

107
00:08:46.440 --> 00:08:49.440
So, that's everything, see you on the next one.

