WEBVTT

1
00:00:00.000 --> 00:00:06.800
Hi and welcome to this AI and CCR video and a little mini-series about reasoning.

2
00:00:06.800 --> 00:00:14.920
So LLMs can do reasoning, thinking, they call it different things. But what is it

3
00:00:14.920 --> 00:00:20.319
and how can we control it? So let's do a small intro and then there will be a

4
00:00:20.319 --> 00:00:28.120
separate video on how OpenAI do it, Google do it and Antropic do it. But what

5
00:00:28.120 --> 00:00:34.880
is reasoning and why does it matter to us? Well reasoning was introduced into

6
00:00:34.880 --> 00:00:43.000
the O1 model back in September 2024. So OpenAI was the one who did it and it

7
00:00:43.000 --> 00:00:50.119
was sort of an involvement of at that moment in time people were getting a lot

8
00:00:50.119 --> 00:00:58.799
of hallucinations from AI and someone began to say if you tell it things step

9
00:00:58.799 --> 00:01:03.439
by step, you have probably heard that if you have been around for that, it thinks

10
00:01:03.439 --> 00:01:10.919
better. And this is kind of an involvement of that by moving such a

11
00:01:10.919 --> 00:01:18.120
thing step by step into the model itself. Because reasoning is a little like

12
00:01:18.120 --> 00:01:24.519
having a human having an inner dialogue. They call it chain of thought in the

13
00:01:24.519 --> 00:01:30.919
technical terms. But to have a question before answering instead of just saying

14
00:01:30.919 --> 00:01:37.400
what comes to mind. And we know it from ourselves. If we just ask a question and

15
00:01:37.400 --> 00:01:43.559
we immediately need to answer or by mistake answer, we might say the wrong

16
00:01:43.559 --> 00:01:50.720
thing. While if instead we go in, let's say, what's the capital of France, we might

17
00:01:50.720 --> 00:01:56.839
know that by heart. But if we want to think a little more about it, we might

18
00:01:56.839 --> 00:02:05.120
come up with a better answer in the end. And that is what reasoning is all about.

19
00:02:05.199 --> 00:02:11.960
How much of this inner dialogue is shown to the user when it comes to, for

20
00:02:11.960 --> 00:02:19.880
example, chat GPT and Google and so on, is very much from model to model. Some of

21
00:02:19.880 --> 00:02:26.559
them showing all of it. DeepSea was very famous for doing that. Some of them only

22
00:02:26.559 --> 00:02:33.320
show a summary and some of them it's configurable in their APIs what they do.

23
00:02:33.360 --> 00:02:42.360
So reasoning is a good thing. If we think a little about what we answer, we

24
00:02:42.360 --> 00:02:50.360
often get a better answer. But there's some drawbacks. Sometimes reasoning is

25
00:02:50.360 --> 00:02:59.240
simply not needed. In this case, we just ask hello and if the LLM needs to talk

26
00:02:59.240 --> 00:03:04.320
about, think a lot about what hello means, what it is in cultures and so on

27
00:03:04.320 --> 00:03:12.679
and so forth, just saying hello, that can lead to overthinking. And overthinking is

28
00:03:12.679 --> 00:03:19.320
wasting the user's time and the developers money. Because the reasoning

29
00:03:19.320 --> 00:03:26.880
an LLM can do counts toward the output tokens. It's still going back and

30
00:03:26.880 --> 00:03:32.839
forth like doing a chat loop where you have multiple conversations back and

31
00:03:32.839 --> 00:03:38.639
forward. Because this inner monologue happens, it's still predicting next

32
00:03:38.639 --> 00:03:43.479
tokens and someone needs to pay. So it's output tokens and those are the

33
00:03:43.479 --> 00:03:52.119
expensive ones. So we didn't really want an LLM to think so much about just

34
00:03:52.119 --> 00:03:59.600
the question or message hello. While if we asked it a deep question or deeply

35
00:03:59.600 --> 00:04:04.720
technical questions, yes we want to do it because else the answer that it would

36
00:04:04.720 --> 00:04:11.919
just give back quickly might be wrong. So it's a balancing act when to reason and

37
00:04:11.919 --> 00:04:17.239
when to not. Some models are slowly getting better and internally

38
00:04:17.239 --> 00:04:21.920
understanding when to think hard about something and when to not. That is called

39
00:04:21.920 --> 00:04:27.519
the auto reasoning. Some models don't do it, some do it better, some do it worse

40
00:04:27.519 --> 00:04:34.160
and so on. But if we know upfront that we don't want to think a lot about this

41
00:04:34.160 --> 00:04:41.440
because we know the kind of scenario we are in in our code, it's better to tell

42
00:04:41.440 --> 00:04:49.600
upfront how much it should think. And we can control that. So in this mini-series

43
00:04:49.600 --> 00:04:55.040
we're gonna first dive into OpenAI in part one to see how you control it

44
00:04:55.040 --> 00:05:00.760
there. Then Google in part two and Androvic in part three. And the reason

45
00:05:00.760 --> 00:05:06.440
why we need to check out and do it in different models is because this is not

46
00:05:06.440 --> 00:05:11.480
an industry standard yet and they do it in various different ways. Even Google

47
00:05:11.480 --> 00:05:18.000
does it in two different ways now. Slowly going to the way of how OpenAI

48
00:05:18.000 --> 00:05:25.320
do it. And for that reason I'm breaking it down into three parts because it's

49
00:05:25.320 --> 00:05:32.000
also what is called breaking glass because Microsoft Agent Framework

50
00:05:32.000 --> 00:05:37.880
cannot really take a decision on how reasoning is done when not even those

51
00:05:37.880 --> 00:05:44.000
three big ones cannot choose by themselves. So it's a more low-level

52
00:05:44.160 --> 00:05:49.880
setting and it can be difficult for someone new to Microsoft Agent Framework

53
00:05:49.880 --> 00:05:56.239
to actually figure out how to even set this reasoning. And for that reason we

54
00:05:56.239 --> 00:06:04.600
will go into three separate videos on it after this. So see you in those.