WEBVTT

1
00:00:00.000 --> 00:00:07.400
Hi, and welcome to this AI and TCR video and the second part of the Reasoning Deep Dive mini-series.

2
00:00:07.400 --> 00:00:12.319
In this video, we're going to look at how OpenAI do its reasoning.

3
00:00:12.319 --> 00:00:20.760
If you haven't seen part one, I heartily encourage you to do so before seeing this video.

4
00:00:20.760 --> 00:00:31.420
So here we are inside the Visual Studio, and over in our sample repo, we have Thinking

5
00:00:31.420 --> 00:00:37.840
and Reasoning settings, and we are running the OpenAI settings here.

6
00:00:37.840 --> 00:00:46.560
So OpenAI, as mentioned in the first video, was the first one to actually have reasoning models.

7
00:00:46.560 --> 00:00:48.439
And the first one was the O1 model.

8
00:00:48.439 --> 00:00:54.639
You can still hit it in certain areas, but it's kind of a legacy model now.

9
00:00:54.639 --> 00:01:05.120
Then they made O3, there was never an O2 because of a legal trade name with the O2 company in the UK.

10
00:01:05.120 --> 00:01:10.239
So the next one was called O3, and the next one was called O4, there's various mini-models

11
00:01:10.239 --> 00:01:12.400
and pro-models of them.

12
00:01:12.400 --> 00:01:19.839
But what you really need in models right now is the ChatGPT5.

13
00:01:19.839 --> 00:01:30.120
So ChatGPT5 was the first model, so for example, the famous 4.0 is not a reasoning model.

14
00:01:30.120 --> 00:01:35.239
We should expect that all future models are reasoning models from OpenAI, because this

15
00:01:35.239 --> 00:01:42.319
is the future, but again, these videos are important because we need to control how much

16
00:01:42.319 --> 00:01:45.040
thinking is going on.

17
00:01:45.040 --> 00:01:50.440
So in all these samples, I'm going to ask the question, what is the capital of France

18
00:01:50.440 --> 00:01:59.639
and how many people live there, with an extra thing for the AI to understand is to answer

19
00:01:59.639 --> 00:02:02.160
back in max three words.

20
00:02:02.160 --> 00:02:07.639
I also do that so it's easier for you to see the output over here.

21
00:02:07.639 --> 00:02:16.039
But what we can do is, if we just take a baseline, and the baseline here, if we go

22
00:02:16.039 --> 00:02:24.440
down and have a look, we can see we're using a ChatGPT5 mini, and we're just asking the

23
00:02:24.440 --> 00:02:33.639
question, nothing changed other than just telling, hey, ChatGPT5 mini, answer this question.

24
00:02:33.639 --> 00:02:38.960
So this is what I call a baseline, and what many people don't know is that reasoning is

25
00:02:38.960 --> 00:02:44.520
on by default on these models, and it's by default medium.

26
00:02:44.520 --> 00:02:50.039
That might sound okay, but for a simple question like, what's the capital of France and how

27
00:02:50.039 --> 00:02:55.399
many people live there, it is probably a bit too much even for that.

28
00:02:55.399 --> 00:03:04.479
So if I just press F10 and let it run the code down here, which is just a normal run

29
00:03:04.479 --> 00:03:11.880
and then give some output, we will see that it takes quite a while.

30
00:03:16.679 --> 00:03:20.360
So it's still thinking, and it came back.

31
00:03:20.360 --> 00:03:27.759
And what we see is it wrote back that Paris is 2.1 million, not the important part in

32
00:03:27.759 --> 00:03:33.759
this video series, but more that we had an input token that's also not too much because

33
00:03:33.759 --> 00:03:44.559
it will always be the same input, but we used 528 output tokens, and of those, 512 of them

34
00:03:44.559 --> 00:03:48.080
were used for reasoning.

35
00:03:48.080 --> 00:03:55.119
So if we do the same, but now we're going to do reasoning equals minimal.

36
00:03:55.119 --> 00:04:02.720
And to do equal minimal, we need to do quite a lot of extra here in the raw format because

37
00:04:02.720 --> 00:04:08.320
of this not being a standard yet on how to do reasoning.

38
00:04:08.320 --> 00:04:17.399
So we need to, inside our agent, give the chat client options of going into the chat

39
00:04:17.399 --> 00:04:23.519
options and inside this doing something called raw representation factory.

40
00:04:23.519 --> 00:04:29.920
And this is an object directly from OpenAI that we need to set and set the reasoning

41
00:04:29.920 --> 00:04:34.279
effort to a number.

42
00:04:34.279 --> 00:04:41.079
And over time, they have made various different versions of this reasoning effort.

43
00:04:41.079 --> 00:04:48.679
There's actually two more now, one called none, and one called extra high.

44
00:04:48.679 --> 00:04:54.839
Some models support them, but for example, it's only the chativity 5.2 and higher that

45
00:04:54.839 --> 00:04:59.040
understand the very high, for example.

46
00:04:59.040 --> 00:05:05.760
But if we do all this extra step, and to me, this is very cumbersome, and I will also show

47
00:05:05.760 --> 00:05:14.000
how I, in real life, use something on top of agent framework in order to make this easier.

48
00:05:14.000 --> 00:05:23.640
But if we do this and let it run, so if I come up here again, write out the chat and

49
00:05:23.640 --> 00:05:30.880
let the code we just saw run, you'll see it comes back much, much faster because now

50
00:05:30.880 --> 00:05:40.079
we had minimal reasoning, gave back the same answer because it's more or less facts that it have.

51
00:05:40.079 --> 00:05:42.000
It doesn't need to reason on it.

52
00:05:42.000 --> 00:05:49.239
So imagine you're the user being up here versus the user being down here when it comes to

53
00:05:49.239 --> 00:05:54.000
such a simple question like this.

54
00:05:54.000 --> 00:06:00.079
You would much rather be the user down here because you got a fast answer, and at the

55
00:06:00.079 --> 00:06:08.359
same time, you much want to be a developer down here because it costs much less tokens.

56
00:06:08.359 --> 00:06:14.679
So controlling reasoning for minimal like this is a good thing, but if it had been a

57
00:06:14.679 --> 00:06:21.000
very, very advanced question, of course, we might end up having problems here because

58
00:06:21.000 --> 00:06:28.559
it couldn't follow the rules or didn't think enough about it.

59
00:06:28.559 --> 00:06:34.519
If we go on the other hand and go much higher because what we have also seen so far has

60
00:06:34.519 --> 00:06:42.959
only just been the chat client, and when we saw down here, we can only really set one

61
00:06:42.959 --> 00:06:47.000
thing about this when it comes to reasoning.

62
00:06:47.000 --> 00:06:54.239
There's nothing more that is about reasoning in the chat client.

63
00:06:54.239 --> 00:07:00.440
But if we go up here and begin to use the responses API, which does it in a slightly

64
00:07:00.440 --> 00:07:02.040
different manner as well.

65
00:07:02.040 --> 00:07:08.760
Here it's a create response, while here it's a chat completion options, making it difficult

66
00:07:08.760 --> 00:07:13.079
for you to understand and remember in day-to-day.

67
00:07:13.079 --> 00:07:18.040
Whenever I need to set these things, I always need to go back to a previous sample, hence

68
00:07:18.040 --> 00:07:23.119
the reason why I've made something better that I will show in one second.

69
00:07:23.119 --> 00:07:29.440
But this is the raw way of doing it, and we can now set something called a reasoning options.

70
00:07:29.440 --> 00:07:38.019
And in that, we can set again an effort level from high, extra high, and so on, but we can

71
00:07:38.019 --> 00:07:46.820
also set a reasoning summon verbosity, because whenever something reasons, we might want

72
00:07:46.820 --> 00:07:49.339
to see what did it reason about?

73
00:07:49.339 --> 00:07:52.500
Why did it come to that conclusion?

74
00:07:52.500 --> 00:08:00.899
And in the chat client part here, that is not an option, while if we use the responses

75
00:08:00.899 --> 00:08:03.779
API, it is an option.

76
00:08:03.779 --> 00:08:16.179
In general, OpenAI is a big mess of different settings with different options, and when you begin to...

77
00:08:16.179 --> 00:08:24.500
When we see in video two and three on Google and Entropic, we will see that this is kind

78
00:08:24.500 --> 00:08:29.299
of a big patchwork of different things as they came along.

79
00:08:29.299 --> 00:08:36.619
They will probably streamline down the road, but here we need to go in and actually go

80
00:08:36.619 --> 00:08:43.780
between auto, concise, and detailed in how much we want to see back.

81
00:08:43.780 --> 00:08:54.099
And we can see it's a summary, so it will actually think more inside, but we only get

82
00:08:54.099 --> 00:08:58.940
the summaries of what it thought about.

83
00:08:59.580 --> 00:09:06.859
If we go into this example, in this case, I've set the reasoning to high to see how

84
00:09:06.859 --> 00:09:11.260
much it thinks, because, again, if you set reasoning to minimal, there will not be a

85
00:09:11.260 --> 00:09:19.179
lot of reasoning summary back, while if we set it to high, there will be a lot of reasoning summary back.

86
00:09:19.179 --> 00:09:26.419
So if I let this run, again, it will think a lot, and I'm even using the ChatGP 5 Mini.

87
00:09:26.419 --> 00:09:33.979
If we use the ChatGP 5, the time we would now sit and wait would be even longer.

88
00:09:33.979 --> 00:09:39.500
And again, high is higher than the default, which was medium, so this will be the longest

89
00:09:39.500 --> 00:09:44.659
time we wait here.

90
00:09:44.659 --> 00:09:48.780
So behind the scenes right now, it's doing the internal monologue, and we can see now

91
00:09:48.780 --> 00:09:54.859
that it come back, that this is determining the response form.

92
00:09:54.859 --> 00:10:04.659
User wants to have a response of maximum three words, and a lot of things it now thinking about.

93
00:10:04.659 --> 00:10:06.859
So we are getting this data back.

94
00:10:06.859 --> 00:10:15.500
I didn't show it, but down here, when we ask the question, we can, inside the messages,

95
00:10:15.500 --> 00:10:21.580
find the contents that are subtype text reasoning content, and then we can write this out, and

96
00:10:21.580 --> 00:10:23.219
this is what we're seeing right now.

97
00:10:23.219 --> 00:10:30.419
So this is not what the user will see, but you could optionally show this or log it or

98
00:10:30.419 --> 00:10:42.940
whatever you wish to do with it in order to understand why did it end up saying Paris 2.1 million.

99
00:10:42.940 --> 00:10:52.340
And we can see, because of reasoning, we now ended up with 1,472 of the tokens were used for reasoning.

100
00:10:52.340 --> 00:11:01.940
So it took a long time to answer this, and that's because this is a long reasoning effort.

101
00:11:01.940 --> 00:11:08.859
You ask the question, and had it been a human, that human would have sat for perhaps 30 seconds,

102
00:11:08.859 --> 00:11:13.739
a minute, before they even answered your question.

103
00:11:13.739 --> 00:11:22.500
But again, if it's more complex, this is really the way for the LLM to actually get the answer

104
00:11:22.500 --> 00:11:29.419
right, at the cost of more tokens and more time.

105
00:11:29.419 --> 00:11:37.900
Let's move on, because I really, really honestly hate code like this.

106
00:11:37.940 --> 00:11:39.419
It's difficult to remember.

107
00:11:39.419 --> 00:11:42.140
It's annoying to look at.

108
00:11:42.140 --> 00:11:52.820
And in real life, when I sit and do these things, I think, yeah, I should use the ChattyPT5

109
00:11:52.820 --> 00:11:56.580
for its intelligence, but I don't need it to reason.

110
00:11:56.580 --> 00:12:02.900
So I want to make it quickly go down to minimal, but that's not easy, because I need to sit

111
00:12:02.900 --> 00:12:10.659
and copy-paste this along, and I make helper methods to do it, and so on.

112
00:12:10.659 --> 00:12:18.359
So I have created what is called the Agent Framework Toolkit.

113
00:12:18.359 --> 00:12:28.619
So what we're about to see here is exactly the same as we saw with the raw here, going minimal.

114
00:12:28.619 --> 00:12:33.900
And the Agent Framework Toolkit does it in a much more concise way, in that we just create

115
00:12:33.900 --> 00:12:37.739
an agent factory, and then we get an agent.

116
00:12:37.739 --> 00:12:46.419
And among the features, just like models and tools and such, is just the reasoning effort directly here.

117
00:12:46.419 --> 00:12:53.859
So this does exactly the same, just in a much more concise manner.

118
00:12:53.859 --> 00:12:59.859
So if we do this, we can see it's fast, not because it's Agent Framework, but because

119
00:12:59.859 --> 00:13:01.900
we chose minimal.

120
00:13:01.900 --> 00:13:08.820
And we get back that it's 1.2 million and a low token count.

121
00:13:08.820 --> 00:13:14.820
And in the same manner, if we want to use Agent Framework Toolkit for responses part,

122
00:13:14.820 --> 00:13:21.059
the only thing we need to tell is it's a client-type of responses and the two settings.

123
00:13:21.059 --> 00:13:29.020
So again, these few lines of code versus all these extra things you need to set in

124
00:13:29.020 --> 00:13:31.659
the raw framework.

125
00:13:31.659 --> 00:13:40.340
So if we do that, again, it will be slow to do it, because we are now doing the reasoning,

126
00:13:40.340 --> 00:13:48.700
and we should roughly end up with the same as we saw in the raw responses API.

127
00:13:48.700 --> 00:13:54.020
What you will also see down here is that when we have reasoning, I have a handy method

128
00:13:54.020 --> 00:13:58.659
just directly on the response to say, get reasoning content.

129
00:13:58.659 --> 00:14:04.140
So you don't need to write all this part as well.

130
00:14:04.140 --> 00:14:09.299
This is just tucked away in an extension method.

131
00:14:09.299 --> 00:14:17.539
So if we get back, it will just give us the reasoning, and we can print it out to the

132
00:14:17.539 --> 00:14:21.940
user, save it, and so on.

133
00:14:21.940 --> 00:14:28.859
So Agent Framework Toolkit does not make reasoning better or faster or anything, but it just

134
00:14:28.859 --> 00:14:32.739
makes your part of the code much more simple to look at.

135
00:14:32.739 --> 00:14:47.580
So if we look at this part of the code versus the raw version with responses API, we can

136
00:14:47.580 --> 00:14:55.599
see it can't even fit on half the screen here.

137
00:14:55.599 --> 00:15:00.099
So it's just a more convenient way to do it.

138
00:15:00.099 --> 00:15:02.940
But that's not the important part here.

139
00:15:02.940 --> 00:15:09.780
If you do it one way or the other, the most important thing is go in for the love of God

140
00:15:09.780 --> 00:15:17.859
and help your users not get slow answers just because you've gone to the new model.

141
00:15:17.859 --> 00:15:25.099
And if you know up front you don't need reasoning, you can go to a chat.gtkform1 model, which

142
00:15:25.099 --> 00:15:32.700
is not a reasoning model, but when it's easy, I tend to stay with five on everything and

143
00:15:32.700 --> 00:15:36.940
then controlling the reasoning on the fly.

144
00:15:36.940 --> 00:15:38.580
So that's everything for OpenAI.

145
00:15:38.580 --> 00:15:41.380
In the next one, we're going to look at Google.

