WEBVTT

1
00:00:00.670 --> 00:00:04.190
Congratulations on getting through three of the four

2
00:00:04.190 --> 00:00:06.730
major concepts of LLM.

3
00:00:07.470 --> 00:00:11.530
So we have gone through chat, tools, and

4
00:00:11.530 --> 00:00:12.350
structured output.

5
00:00:13.250 --> 00:00:16.170
We're still missing RAG, but before we go

6
00:00:16.170 --> 00:00:18.610
there, I want to do a little intermission

7
00:00:19.270 --> 00:00:22.210
of the life of an LLM call, meaning

8
00:00:22.210 --> 00:00:25.570
what actually goes on when you call the

9
00:00:25.570 --> 00:00:28.510
LLM, what requests and what response are going

10
00:00:28.510 --> 00:00:28.850
on.

11
00:00:30.290 --> 00:00:33.810
The reason why I've waited for this moment

12
00:00:33.810 --> 00:00:37.270
is that we need to know about tools

13
00:00:37.270 --> 00:00:40.130
and structured output in order to understand this

14
00:00:40.130 --> 00:00:40.510
lecture.

15
00:00:41.610 --> 00:00:44.690
And we now have that, so now is

16
00:00:44.690 --> 00:00:45.830
the time to do this.

17
00:00:47.270 --> 00:00:50.950
I will do this through slides here, but

18
00:00:50.950 --> 00:00:53.050
the code is also in the sample repo.

19
00:00:55.070 --> 00:00:58.910
So what we see here is that we

20
00:00:58.910 --> 00:01:01.670
have a normal system and there should be

21
00:01:01.670 --> 00:01:05.410
nothing new except this little line that enables

22
00:01:05.410 --> 00:01:07.730
us to get the raw calls back and

23
00:01:07.730 --> 00:01:07.930
forth.

24
00:01:08.270 --> 00:01:10.210
There will be a bonus video just after

25
00:01:10.210 --> 00:01:13.130
this explaining a bit more what's going on,

26
00:01:13.150 --> 00:01:15.890
because there's a bit more before this and

27
00:01:15.890 --> 00:01:17.550
after this in order to do this.

28
00:01:18.530 --> 00:01:22.350
But for us, it's just a normal client

29
00:01:22.350 --> 00:01:23.510
and a normal agent.

30
00:01:23.910 --> 00:01:26.530
We are giving it one tool called getWeather,

31
00:01:26.990 --> 00:01:28.290
which is just down here.

32
00:01:29.670 --> 00:01:32.170
We tell it to speak like a pirate

33
00:01:32.170 --> 00:01:33.570
as an instruction.

34
00:01:34.590 --> 00:01:37.050
We make a message called, what is the

35
00:01:37.050 --> 00:01:38.110
weather like in Paris?

36
00:01:38.910 --> 00:01:41.110
And we want it back as a weather

37
00:01:41.110 --> 00:01:43.010
response, meaning structured output.

38
00:01:43.870 --> 00:01:46.330
So nothing new that we haven't learned over

39
00:01:46.330 --> 00:01:48.110
the last couple of sections.

40
00:01:51.090 --> 00:01:54.190
And this is what happens when we call

41
00:01:54.190 --> 00:01:56.810
this specific line of code.

42
00:01:58.410 --> 00:02:01.930
So we call this URL, and in my

43
00:02:01.930 --> 00:02:05.490
case, I had my resource called sensum365ai,

44
00:02:06.050 --> 00:02:07.990
but yours will be something else here.

45
00:02:08.449 --> 00:02:11.610
But beyond that, we're just calling OpenAI's deployments

46
00:02:11.610 --> 00:02:15.690
of chat-gpg5 in the chat system on

47
00:02:15.690 --> 00:02:17.650
completions with a version number.

48
00:02:19.610 --> 00:02:22.190
And then we sent the first thing we

49
00:02:22.190 --> 00:02:22.430
sent.

50
00:02:22.590 --> 00:02:24.750
In general, you should look like these three

51
00:02:24.750 --> 00:02:27.630
sections are one long JSON file.

52
00:02:28.330 --> 00:02:30.990
I just put it up in three columns

53
00:02:30.990 --> 00:02:33.430
here, so it's easier to look at.

54
00:02:34.770 --> 00:02:39.450
But the message, we sent two messages and

55
00:02:39.450 --> 00:02:40.290
not just one.

56
00:02:40.690 --> 00:02:42.390
And that's because, of course, we have our

57
00:02:42.390 --> 00:02:45.090
instructions, which is the system message.

58
00:02:45.590 --> 00:02:47.770
So this is turned into one message.

59
00:02:48.070 --> 00:02:49.670
This is turned into another message.

60
00:02:50.270 --> 00:02:52.190
And that's why we have two messages here,

61
00:02:52.390 --> 00:02:54.650
one in the role of system and one

62
00:02:54.650 --> 00:02:55.690
in the role of user.

63
00:02:57.290 --> 00:03:00.990
We then send the gpt5 as the model,

64
00:03:01.690 --> 00:03:04.890
and then we send this block of code.

65
00:03:05.350 --> 00:03:07.670
This block of code is the JSON schema

66
00:03:07.670 --> 00:03:12.290
for our weather response, meaning this object over

67
00:03:12.290 --> 00:03:12.650
here.

68
00:03:13.750 --> 00:03:16.130
So we know that it's a weather response.

69
00:03:16.290 --> 00:03:19.410
We know that we have a city of

70
00:03:19.410 --> 00:03:23.190
type string, a condition of type string, a

71
00:03:23.190 --> 00:03:26.650
degrees of type integer, and a degrees Celsius

72
00:03:26.650 --> 00:03:27.710
of type integer.

73
00:03:28.210 --> 00:03:30.710
We know that these four of them are

74
00:03:30.710 --> 00:03:33.650
required, which is what we see over here.

75
00:03:34.450 --> 00:03:37.370
And then a bit more like additional properties,

76
00:03:37.650 --> 00:03:39.350
which we have none and so on.

77
00:03:40.550 --> 00:03:43.210
So this is also being sent along with

78
00:03:43.210 --> 00:03:43.690
the messages.

79
00:03:44.570 --> 00:03:47.530
And then the final thing we send is

80
00:03:47.530 --> 00:03:51.430
our tool, which is down here, and turned

81
00:03:51.430 --> 00:03:52.330
into a schema.

82
00:03:53.450 --> 00:03:54.910
That is a function.

83
00:03:55.290 --> 00:03:57.890
The function has no description, because it didn't

84
00:03:57.890 --> 00:03:58.610
provide one.

85
00:03:59.090 --> 00:04:00.370
The name of getWeather.

86
00:04:00.630 --> 00:04:06.630
It had one parameter, which was called city,

87
00:04:07.090 --> 00:04:09.950
and that city was of type string.

88
00:04:13.690 --> 00:04:14.910
And that is what's being sent.

89
00:04:16.310 --> 00:04:19.089
And then you might think, oh, then it

90
00:04:19.089 --> 00:04:22.390
will tell us back it's sunny and 90

91
00:04:22.390 --> 00:04:22.770
degrees.

92
00:04:23.930 --> 00:04:25.530
But that's not what happens.

93
00:04:26.350 --> 00:04:28.890
What happens is it gives us back this

94
00:04:28.890 --> 00:04:29.470
response.

95
00:04:30.650 --> 00:04:33.710
And in this response, we are getting back

96
00:04:33.710 --> 00:04:36.470
a finishing reason of tool calls.

97
00:04:37.270 --> 00:04:40.250
And that means it says, hey, I want

98
00:04:40.250 --> 00:04:41.470
you to make a tool call.

99
00:04:42.330 --> 00:04:44.990
And in our case, it says, I want

100
00:04:44.990 --> 00:04:47.810
you to call getWeather, and I want you

101
00:04:47.810 --> 00:04:50.810
to give the parameter city the value of

102
00:04:50.810 --> 00:04:51.490
Paris.

103
00:04:53.730 --> 00:04:57.670
Beyond that, there's a few small things here

104
00:04:57.670 --> 00:04:59.810
that are mostly empty in our case.

105
00:04:59.990 --> 00:05:01.350
There's some creation date.

106
00:05:01.510 --> 00:05:02.550
There's some ID.

107
00:05:02.810 --> 00:05:07.890
There's what model and what its date is.

108
00:05:09.210 --> 00:05:13.830
It gives us back some information about hate

109
00:05:13.830 --> 00:05:16.730
speech, jailbreak, and so on, if any of

110
00:05:16.730 --> 00:05:18.930
this is violating that, which it's not.

111
00:05:19.710 --> 00:05:22.110
And then it's giving us back some information

112
00:05:22.110 --> 00:05:24.650
about how many tokens we used in order

113
00:05:24.650 --> 00:05:25.450
to do this.

114
00:05:28.350 --> 00:05:31.850
So we can see that it used 190

115
00:05:31.850 --> 00:05:36.390
tokens for input, meaning all we see here

116
00:05:36.390 --> 00:05:38.750
cost 190 tokens.

117
00:05:39.510 --> 00:05:42.530
We saw that it totally used an output

118
00:05:42.530 --> 00:05:50.930
of 536, and 212 of those were for

119
00:05:50.930 --> 00:05:53.570
reasoning, because it was a reasoning model.

120
00:05:54.850 --> 00:05:59.550
So the last token, the number between these

121
00:05:59.550 --> 00:06:03.290
two is this part, because this is the

122
00:06:03.290 --> 00:06:04.410
only thing it generated.

123
00:06:04.670 --> 00:06:08.010
The rest is just logging around our system.

124
00:06:10.230 --> 00:06:13.770
And this means that Agent Framework will receive

125
00:06:13.770 --> 00:06:17.330
this call and say, oh, that means the

126
00:06:17.330 --> 00:06:19.370
user will like to call this tool.

127
00:06:19.690 --> 00:06:22.790
So they invoke this tool on our behalf,

128
00:06:23.710 --> 00:06:26.730
send in Paris as the city, and you

129
00:06:26.730 --> 00:06:28.090
can see I set a breakpoint here.

130
00:06:28.650 --> 00:06:32.290
And now we could expect that the LM

131
00:06:32.290 --> 00:06:35.110
sits and waits for us, and there could

132
00:06:35.110 --> 00:06:37.610
be some time out or something, and that's

133
00:06:37.610 --> 00:06:38.730
absolutely not true.

134
00:06:39.670 --> 00:06:43.290
The LM doesn't care at all what we

135
00:06:43.290 --> 00:06:43.850
do now.

136
00:06:44.570 --> 00:06:46.350
It doesn't wait for us.

137
00:06:46.510 --> 00:06:50.930
It doesn't think it's missing to do something.

138
00:06:51.370 --> 00:06:54.690
It, in its mind, it's complete, because it

139
00:06:54.690 --> 00:06:57.230
told you, hey, I want to call this

140
00:06:57.230 --> 00:06:57.510
tool.

141
00:06:58.630 --> 00:07:00.850
And now it's off serving other people.

142
00:07:01.010 --> 00:07:02.190
There's no connections.

143
00:07:02.530 --> 00:07:03.890
There's no persistent connection.

144
00:07:04.170 --> 00:07:07.670
There's nothing between us and the system right

145
00:07:07.670 --> 00:07:13.590
now, because what happens is when we call

146
00:07:13.590 --> 00:07:16.430
again, we send the entire message once again.

147
00:07:18.550 --> 00:07:22.890
And technically, it's an entire new thing for

148
00:07:22.890 --> 00:07:25.210
the LM, and it doesn't know that we

149
00:07:25.210 --> 00:07:26.030
called it before.

150
00:07:27.570 --> 00:07:30.970
So we sent the same information, speak like

151
00:07:30.970 --> 00:07:34.110
a pirate, user, and stuff like that, but

152
00:07:34.110 --> 00:07:38.170
we're sending this extra part in that, hey,

153
00:07:38.290 --> 00:07:41.370
you told us to call this tool, and

154
00:07:41.370 --> 00:07:42.990
here's the result of that tool.

155
00:07:43.410 --> 00:07:44.690
It's sunny and 90 degrees.

156
00:07:46.370 --> 00:07:49.570
Beyond that, we sent the same two starting

157
00:07:49.570 --> 00:07:49.970
messages.

158
00:07:50.510 --> 00:07:51.950
We sent the model again.

159
00:07:52.270 --> 00:07:55.910
We sent the response format, which we not

160
00:07:55.910 --> 00:07:57.150
used in the first one.

161
00:07:57.290 --> 00:07:58.270
So why did we send it?

162
00:07:58.750 --> 00:08:01.370
Well, it could have been that the question

163
00:08:01.370 --> 00:08:03.070
would just be hello, and then we would

164
00:08:03.070 --> 00:08:05.430
need to use it, because there was no

165
00:08:05.430 --> 00:08:05.970
tool call.

166
00:08:08.270 --> 00:08:12.930
We get the same prompt thing about self

167
00:08:12.930 --> 00:08:18.250
-aid and self-harm and so on, and

168
00:08:21.050 --> 00:08:25.050
when we sent the second request, we sent

169
00:08:25.050 --> 00:08:28.530
the response format exactly in the same structure

170
00:08:28.530 --> 00:08:29.130
as before.

171
00:08:31.130 --> 00:08:33.490
We also sent a tool again, because it

172
00:08:33.490 --> 00:08:35.309
could be that it needed to call the

173
00:08:35.309 --> 00:08:36.590
tool again.

174
00:08:36.970 --> 00:08:39.750
Had the message been, what is the weather

175
00:08:39.750 --> 00:08:43.410
like in Paris and Berlin, it would technically

176
00:08:43.410 --> 00:08:47.690
send one request, call the tool, send one

177
00:08:47.690 --> 00:08:51.170
request, call the tool, and first there, it

178
00:08:51.170 --> 00:08:52.330
would have all this data.

179
00:08:53.250 --> 00:08:54.690
So there's a lot of back and forward

180
00:08:54.690 --> 00:08:58.250
that we really don't see when it's abstracted

181
00:08:58.250 --> 00:09:01.550
away, and when we send this again, we

182
00:09:01.550 --> 00:09:02.930
pay for all these tokens again.

183
00:09:04.250 --> 00:09:06.710
So that is also the reason why the

184
00:09:06.710 --> 00:09:09.330
second we put on a tool or structured

185
00:09:09.330 --> 00:09:12.270
output, it will cost us more.

186
00:09:13.050 --> 00:09:16.930
It's, of course, incredibly powerful, so we shouldn't

187
00:09:16.930 --> 00:09:21.770
think that it's wrong to cost extra, but

188
00:09:21.770 --> 00:09:23.590
that's just the way it works.

189
00:09:24.530 --> 00:09:27.610
So this would call, had we known this

190
00:09:27.610 --> 00:09:29.870
up front, we could have made this call

191
00:09:29.870 --> 00:09:34.050
instead, and it would know, because everything is

192
00:09:34.050 --> 00:09:35.890
here that it needs to know in order

193
00:09:35.890 --> 00:09:39.590
to answer the question now, and the second

194
00:09:39.590 --> 00:09:42.890
response comes back, and in this case, it's

195
00:09:42.890 --> 00:09:45.650
a finishing reason of stop, meaning it's done.

196
00:09:46.930 --> 00:09:49.710
It has sent us what it thinks is

197
00:09:49.710 --> 00:09:53.810
our final result, meaning the city equals Paris,

198
00:09:53.910 --> 00:09:57.150
the condition equals sunny, the degrees Fahrenheit equals

199
00:09:57.150 --> 00:10:00.370
66, and degrees Celsius equals 19.

200
00:10:02.330 --> 00:10:06.570
It gives again back the IDs and stuff,

201
00:10:07.470 --> 00:10:11.590
the self-harm part and stuff, and the

202
00:10:11.590 --> 00:10:13.610
final tokens being used.

203
00:10:14.230 --> 00:10:17.370
So the completion token is small, because it's

204
00:10:17.370 --> 00:10:21.150
only what we get back here, and the

205
00:10:21.150 --> 00:10:23.490
prompt token is bigger than the first time.

206
00:10:23.570 --> 00:10:27.690
It was 190, now it's 227, meaning this

207
00:10:27.690 --> 00:10:31.470
costs the difference between 227 and 190.

208
00:10:35.650 --> 00:10:37.730
And then we're done.

209
00:10:38.110 --> 00:10:41.470
We can take this JSON and turn it

210
00:10:41.470 --> 00:10:45.730
into our object, and that's everything that is

211
00:10:45.730 --> 00:10:46.210
to it.

212
00:10:46.810 --> 00:10:52.250
It is completely request-response, request-response with

213
00:10:52.250 --> 00:10:55.070
a magic happening up with the next token

214
00:10:55.070 --> 00:11:00.910
prediction, but beyond that, it is just a

215
00:11:00.910 --> 00:11:03.210
stateless machine of running.

216
00:11:03.530 --> 00:11:06.310
It's not like it's sitting there waiting or

217
00:11:06.310 --> 00:11:09.550
having a hook into your machine, being able

218
00:11:09.550 --> 00:11:13.330
to call your tools or magically know how

219
00:11:13.330 --> 00:11:14.770
to run C-sharp up there.

220
00:11:15.170 --> 00:11:17.310
All the tools are run on your machine

221
00:11:17.310 --> 00:11:19.990
with your access and so on, and it's

222
00:11:19.990 --> 00:11:23.250
quite a brilliant system in order, in the

223
00:11:23.250 --> 00:11:27.770
way it works, and this is really helping

224
00:11:27.770 --> 00:11:30.950
a lot of people understand what is actually

225
00:11:30.950 --> 00:11:36.010
going on, why it is so incredible, magical,

226
00:11:36.270 --> 00:11:38.570
and at the same time so damn simple.

227
00:11:40.870 --> 00:11:43.850
So now we know the life of an

228
00:11:43.850 --> 00:11:46.810
Olicon, and we are ready to go on

229
00:11:46.810 --> 00:11:47.770
and tackle RAC.
