WEBVTT

1
00:00:00.580 --> 00:00:03.480
Let's talk cost and how much it will

2
00:00:03.480 --> 00:00:05.920
cost you to follow along in this course,

3
00:00:06.120 --> 00:00:07.880
but also cost in general.

4
00:00:09.660 --> 00:00:13.140
The topic is really, really difficult to place

5
00:00:13.140 --> 00:00:15.460
in a course like this because it could

6
00:00:15.460 --> 00:00:19.320
either be at the front, but then we'll

7
00:00:19.320 --> 00:00:21.600
talk, in this case, I will talk about

8
00:00:21.600 --> 00:00:24.680
some concepts that haven't been explained yet, or

9
00:00:24.680 --> 00:00:27.160
it can be at the end, which I

10
00:00:27.160 --> 00:00:28.960
think is too late, so I'll put it

11
00:00:28.960 --> 00:00:29.480
up front.

12
00:00:30.580 --> 00:00:31.920
Bear with me that there is a few

13
00:00:31.920 --> 00:00:34.820
things we haven't talked about yet in this,

14
00:00:36.900 --> 00:00:40.060
and they will be, of course, explained along

15
00:00:40.060 --> 00:00:41.820
the way, and I will give some high

16
00:00:41.820 --> 00:00:43.320
levels in this video.

17
00:00:45.280 --> 00:00:47.380
The good thing about cost when it comes

18
00:00:47.380 --> 00:00:50.700
to AI in general is that it's a

19
00:00:50.700 --> 00:00:53.660
pay-for-what-you-use approach in all

20
00:00:53.660 --> 00:00:54.520
of the providers.

21
00:00:54.520 --> 00:00:57.440
So there's no idle fees, meaning that if

22
00:00:57.440 --> 00:01:00.360
I make an Azure account or an OpenAI

23
00:01:00.360 --> 00:01:03.080
account and don't use it for a month,

24
00:01:03.140 --> 00:01:06.220
it's not like it costs you anything compared

25
00:01:06.220 --> 00:01:08.100
to, for example, if you had an Azure

26
00:01:08.100 --> 00:01:08.880
app service.

27
00:01:09.160 --> 00:01:11.120
If you use it or not, you would

28
00:01:11.120 --> 00:01:13.520
pay because it's a spinning system.

29
00:01:14.720 --> 00:01:18.020
So it won't cost you anything to have

30
00:01:18.020 --> 00:01:22.700
multiple accounts to different providers.

31
00:01:22.700 --> 00:01:26.080
I have, like, 10 different, and it doesn't

32
00:01:26.080 --> 00:01:27.960
cost me anything when I don't use them.

33
00:01:29.620 --> 00:01:33.240
There's, however, that most providers need to charge

34
00:01:33.240 --> 00:01:37.160
up your account with, like, around $5.

35
00:01:39.400 --> 00:01:42.440
Azure and Google being the exception, where you

36
00:01:42.440 --> 00:01:44.600
can pay as you go, and then you

37
00:01:44.600 --> 00:01:46.000
pay after the fact.

38
00:01:47.360 --> 00:01:50.360
But if you want an OpenAI account, you

39
00:01:50.360 --> 00:01:53.940
need to spend up front $5, and then

40
00:01:53.940 --> 00:01:56.660
you can, of course, use those as you

41
00:01:56.660 --> 00:01:57.100
go along.

42
00:01:57.960 --> 00:02:00.460
And $5 would be plenty to follow this

43
00:02:00.460 --> 00:02:00.800
course.

44
00:02:03.120 --> 00:02:05.560
There are a few providers that offer free

45
00:02:05.560 --> 00:02:08.199
tiers with some heavy rate limits.

46
00:02:08.720 --> 00:02:10.880
Google have some free tier.

47
00:02:12.240 --> 00:02:14.320
GitHub models have a free tier.

48
00:02:14.320 --> 00:02:17.080
Hugging Face and Cohere have as well.

49
00:02:17.440 --> 00:02:19.600
But, for example, Cohere, you can make one

50
00:02:19.600 --> 00:02:20.500
call per minute.

51
00:02:21.480 --> 00:02:23.820
And Google, you can only use all the

52
00:02:23.820 --> 00:02:26.860
very, very cheap models, not some of the

53
00:02:26.860 --> 00:02:28.560
more advanced models and so on.

54
00:02:31.560 --> 00:02:36.160
What you pay is not in amount of

55
00:02:36.160 --> 00:02:38.800
time you spend on the LLM or anything.

56
00:02:38.880 --> 00:02:40.600
You pay in what is called tokens.

57
00:02:41.560 --> 00:02:43.300
Let's learn what those are.

58
00:02:44.720 --> 00:02:52.260
Tokens is a kind of language translation from

59
00:02:52.260 --> 00:02:57.280
English to a language of tokens, which is

60
00:02:57.280 --> 00:03:01.420
common words set together in order to not

61
00:03:01.420 --> 00:03:04.820
need to know every single word in the

62
00:03:04.820 --> 00:03:05.600
English language.

63
00:03:06.120 --> 00:03:08.060
So you can see here I have made

64
00:03:08.060 --> 00:03:08.540
a high.

65
00:03:08.640 --> 00:03:10.780
This is a sample on how many tokens

66
00:03:10.780 --> 00:03:11.940
there are in this sentence.

67
00:03:12.860 --> 00:03:20.400
And you can see "-this", "-is", "-a", is

68
00:03:20.400 --> 00:03:21.420
a token by themselves.

69
00:03:22.620 --> 00:03:25.760
So I use very, very common words here.

70
00:03:26.020 --> 00:03:29.500
But have I used an uncommon word like

71
00:03:29.500 --> 00:03:30.160
platypus?

72
00:03:30.640 --> 00:03:34.940
That would have been broken up into multiple

73
00:03:34.940 --> 00:03:37.600
tokens because it won't have a specific token

74
00:03:37.600 --> 00:03:38.160
by itself.

75
00:03:38.160 --> 00:03:41.840
But I have one token specifically set to

76
00:03:41.840 --> 00:03:42.060
it.

77
00:03:42.800 --> 00:03:47.000
So these 66 characters is 15 tokens in

78
00:03:47.000 --> 00:03:47.320
total.

79
00:03:48.480 --> 00:03:51.540
And you roughly learn that you shouldn't really

80
00:03:51.540 --> 00:03:52.600
care about this too much.

81
00:03:52.700 --> 00:03:55.740
You know many tokens, less tokens and so

82
00:03:55.740 --> 00:03:55.880
on.

83
00:03:55.940 --> 00:03:57.800
15 tokens is not a lot.

84
00:04:00.700 --> 00:04:04.200
You pay in three different or four different

85
00:04:04.200 --> 00:04:04.680
ways.

86
00:04:05.580 --> 00:04:07.700
You pay for input tokens.

87
00:04:07.900 --> 00:04:09.100
That is your prompt.

88
00:04:09.100 --> 00:04:11.620
So if we say, what is the capital

89
00:04:11.620 --> 00:04:12.200
of France?

90
00:04:12.260 --> 00:04:15.140
We pay for the tokens of that sentence.

91
00:04:16.300 --> 00:04:19.120
There will be additional input tokens that you

92
00:04:19.120 --> 00:04:20.240
also pay for.

93
00:04:20.500 --> 00:04:25.300
Like whenever we introduce tools and structured output

94
00:04:25.300 --> 00:04:27.920
and stuff, there will be additional metadata that

95
00:04:27.920 --> 00:04:29.660
need to be sent to the LLM.

96
00:04:30.300 --> 00:04:32.720
For example, when we say, what is the

97
00:04:32.720 --> 00:04:34.240
weather in Paris?

98
00:04:34.280 --> 00:04:35.420
And we have a weather tool.

99
00:04:36.280 --> 00:04:39.220
We will also pay for the schema that

100
00:04:39.220 --> 00:04:42.220
says, how is that tool defined?

101
00:04:44.260 --> 00:04:46.480
So even if you say, hi, and you

102
00:04:46.480 --> 00:04:48.000
have a lot of tools, it will cost

103
00:04:48.000 --> 00:04:48.920
you more tokens.

104
00:04:49.660 --> 00:04:52.880
We will get into that when we talk

105
00:04:52.880 --> 00:04:53.320
tools.

106
00:04:54.540 --> 00:04:57.420
Then there's something called cash tokens, which is

107
00:04:57.420 --> 00:04:58.940
reuse of tokens.

108
00:04:58.940 --> 00:05:02.580
Because whenever we begin to do conversations, we

109
00:05:02.580 --> 00:05:05.200
will ask a question and then we'll do

110
00:05:05.200 --> 00:05:05.940
a follow-up.

111
00:05:06.460 --> 00:05:09.300
And we don't need to spend tokens on

112
00:05:09.300 --> 00:05:11.720
the first part of the question again, because

113
00:05:11.720 --> 00:05:14.120
that would be the same answer.

114
00:05:15.640 --> 00:05:17.900
It only happens when you begin to use

115
00:05:17.900 --> 00:05:21.700
more tokens, like 1024.

116
00:05:22.080 --> 00:05:24.120
Before that, it will not try to cash

117
00:05:24.120 --> 00:05:24.520
tokens.

118
00:05:26.200 --> 00:05:28.220
But if you have some heavy use and

119
00:05:28.220 --> 00:05:32.020
with very much follow-up, all your things

120
00:05:32.020 --> 00:05:34.440
will probably be cash tokens, and cash tokens

121
00:05:34.440 --> 00:05:35.880
are cheaper in general.

122
00:05:37.960 --> 00:05:40.080
Finally, we have the output tokens, which can

123
00:05:40.080 --> 00:05:41.620
be put into two categories.

124
00:05:42.260 --> 00:05:44.800
The first being what we get back.

125
00:05:45.020 --> 00:05:46.620
So when we ask, what is the capital

126
00:05:46.620 --> 00:05:48.600
of France?

127
00:05:48.660 --> 00:05:50.340
We will get that it's Paris.

128
00:05:51.000 --> 00:05:53.760
And we pay for the word that the

129
00:05:53.760 --> 00:05:54.920
capital is Paris.

130
00:05:56.120 --> 00:06:02.980
If we use reasoning models, like Chat-GPT-5, we

131
00:06:02.980 --> 00:06:06.400
are also getting some tokens that are being

132
00:06:06.400 --> 00:06:08.900
used for the internal reasoning of the model.

133
00:06:09.080 --> 00:06:11.280
So it might think about what is the

134
00:06:11.280 --> 00:06:14.140
capital of France before it answers.

135
00:06:15.360 --> 00:06:18.260
Or if it's a very complex question, it

136
00:06:18.260 --> 00:06:21.820
will talk back and forward with itself and

137
00:06:21.820 --> 00:06:22.580
use those.

138
00:06:22.580 --> 00:06:25.460
You don't really get those tokens back, but

139
00:06:25.460 --> 00:06:27.260
you still pay for them, because they are

140
00:06:27.260 --> 00:06:30.660
still tokens being used by the LLM.

141
00:06:31.180 --> 00:06:33.600
These are called reasoning tokens, and they will

142
00:06:33.600 --> 00:06:35.200
be part of the output token.

143
00:06:35.300 --> 00:06:37.000
So you will see an output token, for

144
00:06:37.000 --> 00:06:40.360
example, of 1000 output tokens, and then you

145
00:06:40.360 --> 00:06:43.540
will know 300 of them were used for

146
00:06:43.540 --> 00:06:46.580
reasoning, while the last 700 was actually the

147
00:06:46.580 --> 00:06:47.700
text you received back.

148
00:06:51.640 --> 00:06:54.240
Let's see some examples, and also why this

149
00:06:54.240 --> 00:06:58.280
can be very difficult to understand, and you

150
00:06:58.280 --> 00:07:01.640
need to learn from experience how tokens work.

151
00:07:03.040 --> 00:07:05.080
So this is one of the examples in

152
00:07:05.080 --> 00:07:07.300
the sample repo, so you can go check

153
00:07:07.300 --> 00:07:08.760
out the code if you want to.

154
00:07:08.760 --> 00:07:17.140
But if we take a simple setup like

155
00:07:17.140 --> 00:07:19.380
our Hello World, and just ask a question,

156
00:07:19.440 --> 00:07:21.980
what is the capital of France, we get

157
00:07:21.980 --> 00:07:24.020
the output, of course, that the capital of

158
00:07:24.020 --> 00:07:25.940
France is Paris.

159
00:07:27.480 --> 00:07:29.740
And if we go to the sample repo,

160
00:07:32.390 --> 00:07:37.030
which we have here, we can go in

161
00:07:37.030 --> 00:07:40.530
and see that I'm just using the chat

162
00:07:40.530 --> 00:07:44.410
client, and then I'm taking out from the

163
00:07:44.410 --> 00:07:48.530
response, which is where we have our text

164
00:07:48.530 --> 00:07:50.970
back, the capital of France, we can also

165
00:07:50.970 --> 00:07:52.510
get a usage object.

166
00:07:53.290 --> 00:07:55.270
And if that usage object is different from

167
00:07:55.270 --> 00:07:58.450
zero, meaning the LLM we are working with

168
00:07:58.450 --> 00:08:02.410
support reporting usage back, and most of them

169
00:08:02.410 --> 00:08:06.830
do now, we get our input tokens, we

170
00:08:06.830 --> 00:08:09.410
get our cache tokens, we get our reasoning

171
00:08:09.410 --> 00:08:14.390
tokens, and how many of them were output

172
00:08:14.390 --> 00:08:15.910
tokens, and how many of them were reasoning.

173
00:08:18.370 --> 00:08:20.670
So that is what we are seeing here.

174
00:08:21.050 --> 00:08:23.290
We are not doing anything special yet.

175
00:08:23.550 --> 00:08:26.770
But some interesting thing happens, because we are

176
00:08:26.770 --> 00:08:29.790
using here a chat-gpt 4.1 nano, one

177
00:08:29.790 --> 00:08:31.650
of the cheapest models you can get.

178
00:08:32.850 --> 00:08:34.870
And it's used some input tokens.

179
00:08:35.190 --> 00:08:36.250
What is the capital of France?

180
00:08:37.730 --> 00:08:39.630
There's no extra tools or anything.

181
00:08:39.750 --> 00:08:40.730
So this is the raw.

182
00:08:40.990 --> 00:08:42.870
What is the capital of France is 14

183
00:08:42.870 --> 00:08:43.290
tokens.

184
00:08:44.270 --> 00:08:46.950
It used zero cache tokens, because it's a

185
00:08:46.950 --> 00:08:49.490
too little sentence to begin to cache, and

186
00:08:49.490 --> 00:08:50.330
there's no follow-up.

187
00:08:51.350 --> 00:08:55.170
And it used eight tokens for the output.

188
00:08:55.490 --> 00:08:57.130
The capital of France is Paris.

189
00:08:57.690 --> 00:08:59.990
And zero of them was reasoning, because it's

190
00:08:59.990 --> 00:09:01.250
not a reasoning model.

191
00:09:02.850 --> 00:09:06.030
It spent roughly one second doing this.

192
00:09:06.810 --> 00:09:09.030
In general, you will see that Azure is

193
00:09:09.030 --> 00:09:13.410
faster at answering back than OpenAI is.

194
00:09:14.050 --> 00:09:17.530
But it's all depending on what time of

195
00:09:17.530 --> 00:09:18.710
day you do it.

196
00:09:18.810 --> 00:09:20.510
If you do it in peak hours, it

197
00:09:20.510 --> 00:09:21.190
will be longer.

198
00:09:21.410 --> 00:09:23.310
If you do it in non-peak hours,

199
00:09:23.370 --> 00:09:24.270
it will be shorter.

200
00:09:25.710 --> 00:09:29.690
So you can't really look at the time

201
00:09:29.690 --> 00:09:32.030
and say, okay, every time I will ask,

202
00:09:32.130 --> 00:09:33.210
what is the capital of France?

203
00:09:33.370 --> 00:09:34.490
It will be like this.

204
00:09:34.810 --> 00:09:35.930
Sometimes it will be longer.

205
00:09:36.050 --> 00:09:37.110
Sometimes it will be shorter.

206
00:09:37.730 --> 00:09:40.530
But I did it with different models at

207
00:09:40.530 --> 00:09:42.550
the exact same time, so they can roughly

208
00:09:42.550 --> 00:09:43.090
be compared.

209
00:09:46.180 --> 00:09:51.060
We also have the same code, but now

210
00:09:51.060 --> 00:09:54.000
with another model, GPT 5 nano.

211
00:09:55.460 --> 00:09:58.500
So we get, in this case, it actually

212
00:09:58.500 --> 00:09:59.920
used one token less.

213
00:09:59.920 --> 00:10:02.260
It must have figured out that it wanted

214
00:10:02.260 --> 00:10:03.380
to do it in a different way.

215
00:10:03.940 --> 00:10:07.460
It shouldn't really happen, but that's the way

216
00:10:07.460 --> 00:10:07.800
it is.

217
00:10:10.000 --> 00:10:12.900
We see that it gives a shorter output,

218
00:10:13.360 --> 00:10:15.980
but it used more output tokens, and it's

219
00:10:15.980 --> 00:10:18.060
also using slightly more time.

220
00:10:19.640 --> 00:10:21.560
And the reason for that, we can see,

221
00:10:21.860 --> 00:10:24.700
is it's using more reasoning tokens.

222
00:10:25.800 --> 00:10:29.680
And this is because GPT 5 nano is

223
00:10:29.680 --> 00:10:34.180
a very, very, I won't call it unintelligent,

224
00:10:34.920 --> 00:10:38.060
but not the most intelligent of these models.

225
00:10:38.600 --> 00:10:40.600
So it needs to think, oh, this can

226
00:10:40.600 --> 00:10:41.940
be a serious question.

227
00:10:42.060 --> 00:10:44.300
I need to think how to give that

228
00:10:44.300 --> 00:10:44.780
answer.

229
00:10:46.220 --> 00:10:48.620
So it actually spent some tokens on this.

230
00:10:48.940 --> 00:10:51.400
So in this case, using this one over

231
00:10:51.400 --> 00:10:54.480
here, despite it actually having a higher per

232
00:10:54.480 --> 00:10:58.300
cost token count, will cost less.

233
00:10:59.040 --> 00:11:01.300
And we can also see we get slightly

234
00:11:01.300 --> 00:11:02.400
faster back.

235
00:11:04.160 --> 00:11:06.440
But then something interesting happened.

236
00:11:06.880 --> 00:11:09.760
If I go to the biggest and greatest

237
00:11:09.760 --> 00:11:15.060
model, GPT 5.2, as of this recording,

238
00:11:16.080 --> 00:11:19.560
we get exactly the same input, exactly the

239
00:11:19.560 --> 00:11:20.360
same output.

240
00:11:22.140 --> 00:11:24.720
And then we see that our output token

241
00:11:24.720 --> 00:11:27.320
is only six tokens here, despite it being

242
00:11:27.320 --> 00:11:29.280
the model that can think for a long

243
00:11:29.280 --> 00:11:29.640
time.

244
00:11:30.760 --> 00:11:34.380
This is because GPT 5.2, we could

245
00:11:34.380 --> 00:11:37.380
also have taken GPT 5, is a more

246
00:11:37.380 --> 00:11:40.240
intelligent model, so it knows that this question

247
00:11:40.240 --> 00:11:41.480
is not a hard question.

248
00:11:42.100 --> 00:11:44.020
So it doesn't need to think about it.

249
00:11:44.920 --> 00:11:47.740
This model has less world knowledge.

250
00:11:48.180 --> 00:11:49.860
This one has more world knowledge.

251
00:11:49.860 --> 00:11:54.560
So this was just ready to answer question

252
00:11:54.560 --> 00:11:57.360
back, while here it needed to think about

253
00:11:57.360 --> 00:11:58.980
where it had that in its storage.

254
00:12:00.680 --> 00:12:03.600
So for that reason, you cannot really just

255
00:12:03.600 --> 00:12:07.620
say, bigger model, more cost, in terms of

256
00:12:07.620 --> 00:12:10.500
output, because things like this.

257
00:12:11.220 --> 00:12:13.900
But have you asked a very scientific question

258
00:12:13.900 --> 00:12:18.080
about some universal models and so on?

259
00:12:18.080 --> 00:12:21.720
This model would have definitely began to use

260
00:12:21.720 --> 00:12:24.160
a lot of reasoning tokens and way more

261
00:12:24.160 --> 00:12:26.320
reasoning tokens than this one over here.

262
00:12:27.540 --> 00:12:29.920
We will talk about reasoning in general and

263
00:12:29.920 --> 00:12:31.560
how much reasoning and so on.

264
00:12:31.980 --> 00:12:35.040
But just to show you, it is very,

265
00:12:35.240 --> 00:12:37.800
very difficult to figure out.

266
00:12:38.300 --> 00:12:41.000
So here we actually use the biggest models

267
00:12:41.000 --> 00:12:43.260
and spend the least amount of tokens.

268
00:12:44.000 --> 00:12:47.480
Those tokens still cost more, which we will

269
00:12:47.480 --> 00:12:48.560
see in one second.

270
00:12:49.300 --> 00:12:54.080
But it is a balancing act on how

271
00:12:54.080 --> 00:12:59.080
cheap or how intelligent you need to have

272
00:12:59.080 --> 00:12:59.460
a model.

273
00:12:59.980 --> 00:13:01.980
This is a simple question.

274
00:13:02.740 --> 00:13:04.940
So we should definitely use a simple model.

275
00:13:05.440 --> 00:13:06.800
In this case, I would use this one

276
00:13:06.800 --> 00:13:09.920
over here, despite this being slightly less tokens.

277
00:13:12.260 --> 00:13:15.740
While here, it is kind of overkill because

278
00:13:15.740 --> 00:13:17.720
you think, oh, this is a serious question.

279
00:13:17.840 --> 00:13:20.040
We need to think a lot about it.

280
00:13:20.240 --> 00:13:22.220
So that wouldn't fit well.

281
00:13:24.280 --> 00:13:27.200
While in the bigger models, this is way

282
00:13:27.200 --> 00:13:29.960
overkill for a question like the capital of

283
00:13:29.960 --> 00:13:30.300
France.

284
00:13:30.940 --> 00:13:34.160
It shows that by being good at answering

285
00:13:34.160 --> 00:13:34.420
it.

286
00:13:34.600 --> 00:13:38.000
But still, why spend the tokens for a

287
00:13:38.000 --> 00:13:39.080
higher price?

288
00:13:39.120 --> 00:13:43.640
Because this is like 50 times more priced

289
00:13:43.640 --> 00:13:44.960
than this one is.

290
00:13:46.280 --> 00:13:47.680
But let's talk price.

291
00:13:50.880 --> 00:13:53.140
And when we talk about these, what is

292
00:13:53.140 --> 00:13:54.500
the capital of France and so on, we

293
00:13:54.500 --> 00:13:57.880
cannot really measure how much it costs because

294
00:13:57.880 --> 00:13:59.520
it's dirt cheap to do that.

295
00:14:00.100 --> 00:14:03.080
So what I've done is I've calculated, I've

296
00:14:03.080 --> 00:14:06.020
taken Hello World, as we saw in the

297
00:14:06.020 --> 00:14:06.860
previous section.

298
00:14:07.480 --> 00:14:09.660
And we asked, what is the capital of

299
00:14:09.660 --> 00:14:10.060
France?

300
00:14:10.640 --> 00:14:13.420
And beyond that, I've also asked how to

301
00:14:13.420 --> 00:14:14.060
make soup.

302
00:14:14.060 --> 00:14:17.380
How to make soup is a system or

303
00:14:17.380 --> 00:14:19.820
is a message that will give quite a

304
00:14:19.820 --> 00:14:21.460
long answer back.

305
00:14:23.060 --> 00:14:28.020
Because there's several steps involved in doing it.

306
00:14:28.800 --> 00:14:33.020
So if I took those questions, when I

307
00:14:33.020 --> 00:14:34.640
press F5, I run it once.

308
00:14:35.200 --> 00:14:38.580
If I press F5 and run it 1

309
00:14:38.580 --> 00:14:42.500
,000 times, I would end up roughly spending

310
00:14:42.500 --> 00:14:49.500
52,000 input tokens and 375,000 output

311
00:14:49.500 --> 00:14:49.960
tokens.

312
00:14:50.560 --> 00:14:53.400
This number being higher because how to make

313
00:14:53.400 --> 00:14:56.580
soup gives a higher number back.

314
00:14:56.680 --> 00:15:00.680
We can try Hello World from the last

315
00:15:00.680 --> 00:15:05.040
section here to say how to make soup

316
00:15:11.910 --> 00:15:17.130
and see that it's a much longer answer

317
00:15:17.130 --> 00:15:17.490
back.

318
00:15:20.890 --> 00:15:22.250
Put it over here.

319
00:15:25.090 --> 00:15:28.170
And right now I'm running as a JGV

320
00:15:28.170 --> 00:15:30.330
guide manual, so it really, really thinks about

321
00:15:30.330 --> 00:15:31.270
how to make soup.

322
00:15:32.350 --> 00:15:34.490
And you can see we got a longer

323
00:15:34.490 --> 00:15:36.790
answer back and we can see we spent

324
00:15:36.790 --> 00:15:40.850
a lot of extra output tokens even if

325
00:15:40.850 --> 00:15:41.910
we took away the reasoning.

326
00:15:42.290 --> 00:15:45.290
It's still 700 tokens compared to the 6

327
00:15:45.290 --> 00:15:47.570
and 8 we saw before and it used

328
00:15:47.570 --> 00:15:48.890
11 seconds to do it.

329
00:15:49.650 --> 00:15:52.450
So longer answers will also take longer to

330
00:15:52.450 --> 00:15:53.650
get back.

331
00:15:56.690 --> 00:16:03.650
So 1,000 times, 52 input tokens, 375

332
00:16:03.650 --> 00:16:04.570
output tokens.

333
00:16:07.070 --> 00:16:10.450
If we look at the different providers, there

334
00:16:10.450 --> 00:16:13.530
would be no upfront cost in doing this,

335
00:16:13.710 --> 00:16:16.670
these 1,000 times for Azure OpenAI, Microsoft

336
00:16:16.670 --> 00:16:17.230
Foundry.

337
00:16:18.210 --> 00:16:21.910
OpenAI would cost 5 upfront US dollars.

338
00:16:22.190 --> 00:16:24.110
Same with Anthropic.

339
00:16:24.790 --> 00:16:27.830
Google models, GitHub models would cost 0 upfront.

340
00:16:28.950 --> 00:16:31.830
Gemini would be no cost.

341
00:16:32.430 --> 00:16:33.950
XAI would cost 5 upfront.

342
00:16:34.250 --> 00:16:36.350
OpenRival would cost 5 upfront.

343
00:16:36.810 --> 00:16:37.890
And HuggingFish would be 9.

344
00:16:40.070 --> 00:16:43.170
And we look at these 50,000 input

345
00:16:43.170 --> 00:16:45.370
tokens and think, oh, that might be a

346
00:16:45.370 --> 00:16:46.870
lot or 375.

347
00:16:47.210 --> 00:16:52.330
No, it's not really, because these models are

348
00:16:52.330 --> 00:16:56.010
working in 1 million tokens per stuff.

349
00:16:57.650 --> 00:17:00.030
So Azure OpenAI, even when we went to

350
00:17:00.030 --> 00:17:02.230
one of the big models like ChatTV 5,

351
00:17:02.930 --> 00:17:05.069
we would end up using...

352
00:17:07.069 --> 00:17:09.910
For every input token, we would use 1

353
00:17:09.910 --> 00:17:11.990
.25 US dollars.

354
00:17:12.650 --> 00:17:15.310
And for every output token, we would spend

355
00:17:15.310 --> 00:17:16.369
10 dollars.

356
00:17:17.470 --> 00:17:20.270
So that would be that input-wise, it

357
00:17:20.270 --> 00:17:24.450
would cost us 0.065 dollars.

358
00:17:25.150 --> 00:17:28.430
And output-wise, it would cost 3.75.

359
00:17:28.870 --> 00:17:30.890
And that's for running it 1,000 times

360
00:17:30.890 --> 00:17:32.430
with the big model.

361
00:17:33.590 --> 00:17:38.630
We go to OpenAI's pricing.

362
00:17:42.450 --> 00:17:43.710
And let's have a look.

363
00:17:45.090 --> 00:17:47.390
We have 5.2, which costs a little

364
00:17:47.390 --> 00:17:52.770
more, 1.75. But if we look at,

365
00:17:52.870 --> 00:17:56.670
for example, ChatTV 5 Nano here, we are

366
00:17:56.670 --> 00:18:00.090
down to 0.2 instead of 1.25

367
00:18:00.090 --> 00:18:01.370
per million tokens.

368
00:18:01.370 --> 00:18:05.210
And output, not 10 dollars per million, but

369
00:18:05.210 --> 00:18:09.030
0.8. The rough estimate is when you

370
00:18:09.030 --> 00:18:13.010
go down a pitch, it's seven times cheaper,

371
00:18:13.250 --> 00:18:15.310
and then seven times cheaper again.

372
00:18:15.390 --> 00:18:17.510
So it would be like 50 times cheaper

373
00:18:17.510 --> 00:18:21.170
to run this 1,000 times with what

374
00:18:21.170 --> 00:18:22.370
we have set up right now.

375
00:18:23.170 --> 00:18:26.370
So take this number, divide it by 50,

376
00:18:27.470 --> 00:18:29.530
and you would roughly have how much time

377
00:18:29.530 --> 00:18:31.870
you would do this, even if you ran

378
00:18:31.870 --> 00:18:34.630
my samples 1,000 times.

379
00:18:36.430 --> 00:18:38.930
If we look at Foundry, it's the same

380
00:18:38.930 --> 00:18:39.430
pricing.

381
00:18:39.950 --> 00:18:45.170
There could be some measurable storage cost, OpenAI

382
00:18:45.170 --> 00:18:46.550
being exactly the same.

383
00:18:48.010 --> 00:18:50.470
So Azure OpenAI and OpenAI cost exactly the

384
00:18:50.470 --> 00:18:50.710
same.

385
00:18:50.930 --> 00:18:54.210
They have a price agreement that they need

386
00:18:54.210 --> 00:18:54.970
to cost the same.

387
00:18:54.970 --> 00:18:59.690
OpenAI have a premium version where you get

388
00:18:59.690 --> 00:19:04.950
faster response times, but that costs are doubled.

389
00:19:05.930 --> 00:19:08.250
If you went to Anthropic, it would cost

390
00:19:08.250 --> 00:19:08.750
you more.

391
00:19:09.770 --> 00:19:14.810
They are roughly two times more expensive at

392
00:19:14.810 --> 00:19:15.450
most things.

393
00:19:16.530 --> 00:19:19.350
GitHub models could be free, as mentioned.

394
00:19:20.370 --> 00:19:23.070
Gemini costs roughly the same as Azure and

395
00:19:23.070 --> 00:19:23.490
OpenAI.

396
00:19:26.290 --> 00:19:28.850
Grok costs a little more, so they are

397
00:19:28.850 --> 00:19:30.070
on the level of Anthropic.

398
00:19:31.950 --> 00:19:34.670
And things like OpenRouter, where you have multiple

399
00:19:34.670 --> 00:19:38.830
different models, is various, but roughly the same

400
00:19:38.830 --> 00:19:41.050
prices as the real providers.

401
00:19:42.690 --> 00:19:44.830
So as you can see, even if you

402
00:19:44.830 --> 00:19:47.190
ran it 1,000 times, you wouldn't even

403
00:19:47.190 --> 00:19:50.050
spend $5, and that would be on the

404
00:19:50.050 --> 00:19:50.590
big models.

405
00:19:50.590 --> 00:19:53.910
So all these numbers should be down to

406
00:19:53.910 --> 00:19:57.150
50 times less if you use the nano

407
00:19:57.150 --> 00:19:57.570
models.

408
00:19:59.390 --> 00:20:03.670
And that's actually everything, except we just have

409
00:20:03.670 --> 00:20:05.350
one extra thing.

410
00:20:05.450 --> 00:20:09.250
There is a few costs that don't go

411
00:20:09.250 --> 00:20:10.030
into tokens.

412
00:20:10.850 --> 00:20:13.510
So things like uses of hosted tools.

413
00:20:13.930 --> 00:20:15.950
We haven't talked about hosted tools yet, but

414
00:20:15.950 --> 00:20:18.570
you know them from ChatGPT that it can

415
00:20:18.570 --> 00:20:20.450
do a web search or it can do

416
00:20:20.450 --> 00:20:23.510
code interpreter, meaning it can generate graphs and

417
00:20:23.510 --> 00:20:24.350
stuff like that.

418
00:20:26.010 --> 00:20:29.610
They cost extra, so they are not built

419
00:20:29.610 --> 00:20:30.330
in tokens.

420
00:20:30.810 --> 00:20:33.390
They are built per time because, for example,

421
00:20:33.470 --> 00:20:36.250
a code interpreter will spin up a container

422
00:20:36.250 --> 00:20:39.310
in the cloud and execute your code.

423
00:20:41.170 --> 00:20:44.230
In reality, this is actually the highest cost

424
00:20:44.230 --> 00:20:46.930
I have when I use AI in general

425
00:20:46.930 --> 00:20:52.050
because the tokens are so cheap, at least

426
00:20:52.050 --> 00:20:54.570
if you stay with the small models.

427
00:20:54.890 --> 00:20:58.050
If you went to ChatGPT 5.2 Pro,

428
00:20:58.370 --> 00:21:01.710
then you would pay a lot and lot

429
00:21:01.710 --> 00:21:02.130
of money.

430
00:21:02.410 --> 00:21:03.610
We can see them up here.

431
00:21:04.650 --> 00:21:07.970
So ChatGPT 5.0 Pro here, it costs

432
00:21:07.970 --> 00:21:12.990
you 21 million tokens compared to 1.25

433
00:21:12.990 --> 00:21:18.430
and $168 per million tokens.

434
00:21:18.570 --> 00:21:20.810
So if you go to Pro, you definitely

435
00:21:20.810 --> 00:21:25.870
pay, and you are definitely probably over-engineering

436
00:21:25.870 --> 00:21:30.690
your system, but it can be done if

437
00:21:30.690 --> 00:21:34.630
your scenario provides it.

438
00:21:35.030 --> 00:21:37.330
In general, I would just say take the

439
00:21:37.330 --> 00:21:41.790
cheapest model as possible, and until it doesn't

440
00:21:41.790 --> 00:21:43.750
prove that it's a good model for you

441
00:21:43.750 --> 00:21:45.750
and you need to go up in price,

442
00:21:46.090 --> 00:21:46.830
do that.

443
00:21:47.490 --> 00:21:51.630
But beyond that, stay on the cheap models.

444
00:21:52.370 --> 00:21:55.930
But again, code interpreter, web search, cost you

445
00:21:55.930 --> 00:21:59.270
per thousand calls and so on.

446
00:21:59.750 --> 00:22:02.090
Images is the same and so on.

447
00:22:02.570 --> 00:22:05.730
If you use the Microsoft Foundry where you

448
00:22:05.730 --> 00:22:09.770
have your conversations and agents up in the

449
00:22:09.770 --> 00:22:11.470
cloud, there's a small storage feed.

450
00:22:11.630 --> 00:22:14.570
It's the normal Azure storage.

451
00:22:16.110 --> 00:22:19.190
If you use things like vector searches and

452
00:22:19.190 --> 00:22:21.410
so on, there will, of course, also be

453
00:22:21.410 --> 00:22:24.590
some cost involved in having those, but we

454
00:22:24.590 --> 00:22:26.670
will get into those as well.

455
00:22:28.690 --> 00:22:31.530
But the main thing in all this is

456
00:22:31.530 --> 00:22:35.690
just following this course will cost you pennies.

457
00:22:35.690 --> 00:22:38.290
It will probably be the $5 upfront if

458
00:22:38.290 --> 00:22:42.150
you don't want to use Azure OpenAI or

459
00:22:42.150 --> 00:22:45.730
roughly that amount as you go through it

460
00:22:45.730 --> 00:22:48.010
and run it a lot of times.

461
00:22:49.670 --> 00:22:51.550
So that is it for this one.
