WEBVTT

1
00:00:01.800 --> 00:00:04.620
Hi, and welcome to another video in AI

2
00:00:04.620 --> 00:00:06.600
in C#, where we look into the

3
00:00:06.600 --> 00:00:07.860
Microsoft Agent Framework.

4
00:00:08.520 --> 00:00:10.480
In this video, we're going to look at

5
00:00:10.480 --> 00:00:13.420
how we can work with input data other

6
00:00:13.420 --> 00:00:14.060
than text.

7
00:00:16.660 --> 00:00:19.800
So here we are in Visual Studio, and

8
00:00:19.800 --> 00:00:22.400
we have seen a lot of examples so

9
00:00:22.400 --> 00:00:27.520
far where we have just given some chat

10
00:00:27.520 --> 00:00:30.840
message with some text, or just given the

11
00:00:30.840 --> 00:00:36.160
text directly here, same output.

12
00:00:37.080 --> 00:00:39.240
But what if we want to use other

13
00:00:39.240 --> 00:00:39.720
things?

14
00:00:40.480 --> 00:00:44.820
These LLMs, or most of them, are multimodal,

15
00:00:45.040 --> 00:00:47.560
where they talk about they are able to

16
00:00:47.560 --> 00:00:50.820
work with images and PDFs and stuff.

17
00:00:52.820 --> 00:00:56.180
In the example, I have put in some

18
00:00:56.180 --> 00:01:01.400
scenarios for text, image, and PDF, because that

19
00:01:01.400 --> 00:01:04.140
is more or less what they can work

20
00:01:04.140 --> 00:01:04.459
with.

21
00:01:07.130 --> 00:01:10.080
Because if you try to upload a .txt

22
00:01:10.080 --> 00:01:12.980
file or some kind of file that you

23
00:01:12.980 --> 00:01:17.620
have, they just come back and say they

24
00:01:17.620 --> 00:01:20.260
are not able to work with them, they

25
00:01:20.260 --> 00:01:22.240
are not able to process them.

26
00:01:23.320 --> 00:01:25.640
And of course, if it's a file or

27
00:01:25.640 --> 00:01:29.480
some kind of content that it doesn't understand,

28
00:01:29.920 --> 00:01:33.420
you can, of course, use normal RAG or

29
00:01:33.420 --> 00:01:39.640
Retrival Augmented Generation, and put it, extract data

30
00:01:39.640 --> 00:01:41.380
out of a file, let's say you had

31
00:01:41.380 --> 00:01:45.000
some kind of technical specification in a format

32
00:01:45.000 --> 00:01:47.660
that is not common.

33
00:01:48.480 --> 00:01:50.760
You could, of course, take that data out

34
00:01:50.760 --> 00:01:53.800
and then just give it to the LLM

35
00:01:53.800 --> 00:01:54.500
as text.

36
00:01:55.900 --> 00:01:59.100
But there's this few scenarios, images and PDFs,

37
00:01:59.120 --> 00:02:01.120
that are supported.

38
00:02:02.820 --> 00:02:05.220
In one of the cases, PDF is only

39
00:02:05.220 --> 00:02:09.340
partly supported, but let's see how that works.

40
00:02:10.260 --> 00:02:14.400
And in order to run this sample, you

41
00:02:14.400 --> 00:02:18.780
need both Azure AI and OpenAI, because OpenAI

42
00:02:18.780 --> 00:02:24.840
actually have some features that they support that

43
00:02:24.840 --> 00:02:27.460
Azure OpenAI, for some reason, don't support.

44
00:02:29.870 --> 00:02:32.170
But let's have some breakpoints and let's start

45
00:02:32.170 --> 00:02:35.350
with the normal text, which is nothing special

46
00:02:35.350 --> 00:02:37.890
in the scenario.

47
00:02:42.420 --> 00:02:45.220
And we just used Azure OpenAI here, but

48
00:02:45.220 --> 00:02:48.000
both Azure and OpenAI, of course, can understand

49
00:02:48.000 --> 00:02:48.540
text.

50
00:02:49.720 --> 00:02:52.200
So what is the capital of France and

51
00:02:52.200 --> 00:02:56.480
a response back, that just says the capital

52
00:02:56.480 --> 00:02:57.840
of France is Paris.

53
00:02:58.680 --> 00:02:59.540
So nothing special there.

54
00:03:01.420 --> 00:03:03.740
Images are also supported.

55
00:03:03.740 --> 00:03:11.120
So let's run and see that whenever we

56
00:03:11.120 --> 00:03:14.760
use an image, we can either use an

57
00:03:14.760 --> 00:03:17.920
image that are online, meaning a URL to

58
00:03:17.920 --> 00:03:18.400
an image.

59
00:03:19.000 --> 00:03:20.380
So you can simply go in and say

60
00:03:20.380 --> 00:03:23.260
the chat message, instead of just writing the

61
00:03:23.260 --> 00:03:27.040
text, you can make a list of different

62
00:03:27.040 --> 00:03:28.460
things you want to send in.

63
00:03:29.200 --> 00:03:31.420
In our case here, we want to send

64
00:03:31.420 --> 00:03:34.700
in a text content and a URI content,

65
00:03:35.080 --> 00:03:39.480
because if we just send the image, it

66
00:03:39.480 --> 00:03:41.080
might not know what to do about it.

67
00:03:41.160 --> 00:03:43.740
But in this case, we want to ask

68
00:03:43.740 --> 00:03:45.880
what is in the image and the image

69
00:03:45.880 --> 00:03:49.480
is a picture of the settlers of Catan.

70
00:03:50.620 --> 00:03:55.540
I can show it here, so we know

71
00:03:55.540 --> 00:03:56.680
what's going on.

72
00:03:58.900 --> 00:04:02.540
Just bring it up on another screen and

73
00:04:02.540 --> 00:04:04.360
show that it's this image.

74
00:04:04.980 --> 00:04:06.860
So we're asking, what is in this image?

75
00:04:09.100 --> 00:04:15.420
And if we do that, take a little

76
00:04:15.420 --> 00:04:19.899
longer, of course, and let me jump over

77
00:04:19.899 --> 00:04:20.240
here.

78
00:04:20.820 --> 00:04:24.240
So it can look at the image and

79
00:04:24.240 --> 00:04:25.540
tell what's in it.

80
00:04:25.540 --> 00:04:28.100
And we could have asked what colours are

81
00:04:28.100 --> 00:04:31.380
used in the image, what numbers are used,

82
00:04:31.440 --> 00:04:31.940
and so on.

83
00:04:32.220 --> 00:04:37.900
So that's just image recognition and working with

84
00:04:37.900 --> 00:04:38.120
them.

85
00:04:38.420 --> 00:04:40.640
And it doesn't require any special model other

86
00:04:40.640 --> 00:04:42.880
than a multi-model.

87
00:04:44.360 --> 00:04:47.040
And we can begin to see our input

88
00:04:47.040 --> 00:04:49.680
tokens are becoming bigger, because it needs to

89
00:04:49.680 --> 00:04:52.220
get the entire image and stuff.

90
00:04:53.780 --> 00:04:56.460
If we have local files, I have the

91
00:04:56.460 --> 00:04:59.140
same image, just as an image.jpg. And

92
00:04:59.140 --> 00:05:01.820
I put in that it's only called image

93
00:05:01.820 --> 00:05:05.500
.jpg. So it's not like it's looking at

94
00:05:05.500 --> 00:05:07.840
the URL and figuring out it's settlers of

95
00:05:07.840 --> 00:05:08.180
Catan.

96
00:05:08.300 --> 00:05:09.720
It's actually looking at the image.

97
00:05:11.760 --> 00:05:16.420
And images can be processed in two ways,

98
00:05:16.520 --> 00:05:19.780
either as Base64 with a data UI.

99
00:05:20.440 --> 00:05:23.040
And then we will just ask what is

100
00:05:23.040 --> 00:05:24.220
the content of the image.

101
00:05:24.600 --> 00:05:29.400
So we will see exactly the same response

102
00:05:29.400 --> 00:05:33.580
back, more or less, and taking the same

103
00:05:33.580 --> 00:05:36.300
amount of input tokens.

104
00:05:37.180 --> 00:05:39.520
And in the same manner, we can use

105
00:05:39.520 --> 00:05:43.760
the read-only memory, which is the three

106
00:05:43.760 --> 00:05:47.240
options we can send data to a system.

107
00:05:48.340 --> 00:05:51.400
So in these cases, it's data content instead

108
00:05:51.400 --> 00:05:53.020
of UI content.

109
00:05:55.700 --> 00:05:57.720
And we can see we get exactly the

110
00:05:57.720 --> 00:05:59.460
same message back.

111
00:06:00.360 --> 00:06:01.820
So nothing special there.

112
00:06:04.180 --> 00:06:07.360
And these work both for Azure AI and

113
00:06:07.360 --> 00:06:08.280
Azure OpenAI.

114
00:06:09.920 --> 00:06:13.000
But then we come to PDFs.

115
00:06:14.860 --> 00:06:17.540
And let's run again.

116
00:06:18.560 --> 00:06:20.980
And in this case, I have taken a

117
00:06:20.980 --> 00:06:26.540
PDF that is the rules of settlers of

118
00:06:26.540 --> 00:06:31.440
Catan, just to keep the same theme here.

119
00:06:32.760 --> 00:06:35.560
And what we will see is that this

120
00:06:35.560 --> 00:06:38.360
is using the OpenAI agent, because if I

121
00:06:38.360 --> 00:06:40.560
had used the Azure OpenAI agent, it would

122
00:06:40.560 --> 00:06:43.860
have come back and said, hey, I don't

123
00:06:43.860 --> 00:06:46.340
support files.

124
00:06:47.280 --> 00:06:50.220
So Azure OpenAI, for some reason, can only

125
00:06:50.220 --> 00:06:53.160
do this with images and not PDFs.

126
00:06:54.000 --> 00:06:57.980
And it's not something that is anything about

127
00:06:57.980 --> 00:06:59.220
the agent framework.

128
00:06:59.460 --> 00:07:02.060
It is the pure LLM behind the scenes

129
00:07:02.060 --> 00:07:05.660
that are the problem here, not the framework,

130
00:07:06.060 --> 00:07:07.280
because that doesn't care.

131
00:07:09.080 --> 00:07:12.900
But if we have the rules here, because

132
00:07:12.900 --> 00:07:15.760
we can only work with local data and

133
00:07:15.760 --> 00:07:18.400
local inquisition, because if you have some UI,

134
00:07:18.640 --> 00:07:20.380
you, of course, can download it, have it

135
00:07:20.380 --> 00:07:23.640
in memory and give it to the system.

136
00:07:24.480 --> 00:07:26.780
But we can't just point to a PDF

137
00:07:26.780 --> 00:07:29.380
out there in the world and say, what

138
00:07:29.380 --> 00:07:30.260
is in this PDF?

139
00:07:31.360 --> 00:07:33.160
But if we get it as a base

140
00:07:33.160 --> 00:07:39.280
64, just like before, we can ask, what

141
00:07:39.280 --> 00:07:41.500
is the winning conditions in Attach PDF?

142
00:07:43.700 --> 00:07:46.560
And this takes quite a while and takes

143
00:07:46.560 --> 00:07:50.560
like 15,000 tokens, because it's a long

144
00:07:50.560 --> 00:07:52.740
rule book with images and stuff.

145
00:07:52.740 --> 00:07:56.280
But had it been a simpler PDF with

146
00:07:56.280 --> 00:08:00.200
just text, it would probably go faster.

147
00:08:03.640 --> 00:08:07.480
So I get a response back, and it

148
00:08:07.480 --> 00:08:09.860
will tell that we need to reach 10

149
00:08:09.860 --> 00:08:12.040
victory points, and it will even point out

150
00:08:12.040 --> 00:08:16.580
that is specified on page 5 and page

151
00:08:16.580 --> 00:08:20.020
14 and tell a little about the rules.

152
00:08:20.020 --> 00:08:23.980
So that is fairly cool, but we can

153
00:08:23.980 --> 00:08:25.640
see other token costs, of course.

154
00:08:27.600 --> 00:08:32.260
And data in memory works exactly the same

155
00:08:32.260 --> 00:08:32.620
way.

156
00:08:32.820 --> 00:08:37.960
And again, it's only possible using OpenAI agent.

157
00:08:45.230 --> 00:08:47.710
So it is thinking for the next one,

158
00:08:47.870 --> 00:08:49.250
and then we are done.

159
00:08:49.750 --> 00:08:51.490
But we'll get the same answer back, of

160
00:08:51.490 --> 00:08:51.710
course.

161
00:08:55.900 --> 00:08:59.680
But actually making a small failure here that

162
00:08:59.680 --> 00:09:02.400
says it's on page 7, it's about it

163
00:09:02.400 --> 00:09:03.440
and not page 14.

164
00:09:03.860 --> 00:09:05.800
So that's AI for you.

165
00:09:08.340 --> 00:09:12.700
So again, other files, I have tried to

166
00:09:12.700 --> 00:09:17.700
put in .txt and so on, and it

167
00:09:17.700 --> 00:09:22.980
doesn't seem to really be able to understand

168
00:09:22.980 --> 00:09:23.300
that.

169
00:09:23.300 --> 00:09:27.860
It just says, I can't process external files

170
00:09:27.860 --> 00:09:28.340
like that.

171
00:09:28.800 --> 00:09:30.740
And that is the case where you need

172
00:09:30.740 --> 00:09:34.220
to use your C-sharp skills to extract

173
00:09:34.220 --> 00:09:36.480
data from the various files and give it

174
00:09:36.480 --> 00:09:41.060
as text for the system or convert something

175
00:09:41.060 --> 00:09:43.560
to PDFs and give it that way.

176
00:09:46.520 --> 00:09:48.920
So if we look at this, you can

177
00:09:48.920 --> 00:09:52.760
see we can give this list of contents.

178
00:09:52.760 --> 00:09:54.500
And you can see there's a bunch of

179
00:09:54.500 --> 00:09:59.720
other contents here, but these are both input

180
00:09:59.720 --> 00:10:00.660
and output content.

181
00:10:00.880 --> 00:10:03.020
And we are not really going to send

182
00:10:03.020 --> 00:10:06.040
at any point an error content.

183
00:10:06.200 --> 00:10:09.620
That is for the LLM to send back

184
00:10:09.620 --> 00:10:15.060
the MCP tool call results and things like

185
00:10:15.060 --> 00:10:15.260
that.

186
00:10:15.380 --> 00:10:18.080
Hosted files, if it's up in Azure AI

187
00:10:18.080 --> 00:10:20.260
Foundry, it will give it back as a

188
00:10:20.260 --> 00:10:21.060
hosted file.

189
00:10:21.060 --> 00:10:24.260
So don't give the file back, but it

190
00:10:24.260 --> 00:10:25.840
gives a reference to the file that you

191
00:10:25.840 --> 00:10:28.960
can then download by yourself to save tokens.

192
00:10:31.020 --> 00:10:35.340
So that is actually everything there is to

193
00:10:35.340 --> 00:10:36.340
this.

194
00:10:37.280 --> 00:10:39.200
Of course, if you work with images and

195
00:10:39.200 --> 00:10:41.920
PDFs, it's very simple, but the rest is

196
00:10:41.920 --> 00:10:45.980
extraction and giving as text instead.

197
00:10:48.440 --> 00:10:50.260
So see you in the next one.
