WEBVTT

1
00:00:00.000 --> 00:00:04.599
Hi, and welcome to this AI and C-Sharp video on the Microsoft 18 Framework.

2
00:00:05.199 --> 00:00:08.199
Today we're going to look into Foundry Local,

3
00:00:08.800 --> 00:00:14.399
which just came out as 1.0, meaning general availability.

4
00:00:15.199 --> 00:00:20.799
So before we jump into code, let's talk a little about what it is and its history.

5
00:00:21.600 --> 00:00:24.100
Because this is Microsoft's own definition,

6
00:00:24.100 --> 00:00:28.299
Foundry Local is an end-to-end local AI solution, blah, blah, blah.

7
00:00:28.299 --> 00:00:33.400
It's about being able to run LLMs locally.

8
00:00:35.500 --> 00:00:40.400
And using some SDKs, we are, of course, using the C-Sharp version.

9
00:00:41.900 --> 00:00:45.400
The alternative to Foundry Local is OLama,

10
00:00:45.900 --> 00:00:50.400
which can run the OLama models, but also a lot of other models.

11
00:00:51.700 --> 00:00:56.700
But I think, as far as this goes, Microsoft put out Foundry Local

12
00:00:56.799 --> 00:01:00.500
in order to satisfy enterprise businesses

13
00:01:00.500 --> 00:01:08.199
that can't really just be dependent on an open-source model like OLama,

14
00:01:08.199 --> 00:01:09.900
which is not owned by Meta.

15
00:01:12.300 --> 00:01:15.599
So let's talk a little history about this,

16
00:01:15.599 --> 00:01:19.000
because I have actually covered Foundry Local before,

17
00:01:19.699 --> 00:01:23.699
but it has been a very bumpy ride on what this is.

18
00:01:24.500 --> 00:01:29.800
It was introduced back in May 2025 at the Build Conference from Microsoft,

19
00:01:30.400 --> 00:01:34.900
and in June, the first version came out, and that was version 0.1,

20
00:01:34.900 --> 00:01:38.099
because we were in preview mode at that time.

21
00:01:38.599 --> 00:01:43.099
So quickly, we got over June, August, September,

22
00:01:43.099 --> 00:01:46.400
we got version 0.1, 0.2, and 0.3.

23
00:01:46.599 --> 00:01:51.599
It required to be downloaded using Winget before we could work,

24
00:01:52.199 --> 00:01:54.300
but that was pretty okay to work with,

25
00:01:54.300 --> 00:01:57.500
but it was a little odd that you actually needed to download some

26
00:01:57.500 --> 00:02:00.000
to your machine before it worked.

27
00:02:01.699 --> 00:02:03.500
That might be good for a local machine,

28
00:02:03.500 --> 00:02:07.000
but if you want to run it up in a cloud or something,

29
00:02:07.000 --> 00:02:10.500
that should be very annoying that you can't make the code do it all.

30
00:02:11.000 --> 00:02:14.500
So in my samples back then, I actually automated Winget

31
00:02:14.800 --> 00:02:16.600
in order to get it to work.

32
00:02:17.500 --> 00:02:20.100
So I was okay with it at that point.

33
00:02:20.100 --> 00:02:22.800
I'm not really a big fan of offline models,

34
00:02:22.800 --> 00:02:28.800
but it worked, and it could connect at that time to the semantic kernel,

35
00:02:28.800 --> 00:02:32.300
which was the predecessor to the ADIN framework.

36
00:02:34.500 --> 00:02:39.100
Then in November 2025, something very, very odd happened.

37
00:02:39.100 --> 00:02:41.199
Something very, very odd happened.

38
00:02:42.199 --> 00:02:51.899
They released, suddenly went from 0.3 to a version 0.8.0.1, and 0.8.0.2.

39
00:02:53.300 --> 00:02:57.800
This was a complete rewrite of what we have seen in the early days.

40
00:02:58.600 --> 00:03:03.899
Winget was gone, and it was made with some open-source NuGet package

41
00:03:03.899 --> 00:03:11.600
that was not the official OpenAI standard,

42
00:03:11.600 --> 00:03:17.100
but something it felt totally randomly picked off,

43
00:03:17.500 --> 00:03:22.699
and it made it completely incompatible with everything else Microsoft did.

44
00:03:24.100 --> 00:03:28.800
So when I saw this and tried it out and tried to make it work again,

45
00:03:28.800 --> 00:03:33.300
I simply gave up and went back to 0.3 because I knew that worked.

46
00:03:34.600 --> 00:03:39.500
And since then, they must have come to their senses

47
00:03:39.500 --> 00:03:43.500
because now in April 2026, we have version 1.0.

48
00:03:44.300 --> 00:03:48.500
And again, it's not dependent on Winget,

49
00:03:50.000 --> 00:03:52.800
and it has compatibility with ADIN framework,

50
00:03:52.800 --> 00:03:56.100
as we'll see in some code in one second.

51
00:03:56.899 --> 00:04:01.199
But let's also evaluate a bit if it's worth doing

52
00:04:01.199 --> 00:04:05.300
because there is some oddnesses in it still,

53
00:04:05.300 --> 00:04:10.500
but that's perhaps to be expected from local.

54
00:04:11.399 --> 00:04:12.899
But let's have a look at the code.

55
00:04:16.299 --> 00:04:18.600
So here I have the code.

56
00:04:18.600 --> 00:04:22.399
It's in 0 to 1st agent under OpenAI-based

57
00:04:22.399 --> 00:04:24.700
because it's still OpenAI-based,

58
00:04:24.700 --> 00:04:27.399
and 0 to 1st agent local foundry.

59
00:04:28.899 --> 00:04:30.600
So let me get rid of this.

60
00:04:31.899 --> 00:04:37.399
So the model I'm going to use is a very, very small model,

61
00:04:37.399 --> 00:04:41.100
GWEN3, 0.6 billion parameters.

62
00:04:41.100 --> 00:04:44.899
And that's because my graphics card is very, very small.

63
00:04:46.799 --> 00:04:52.100
And for that reason, I can also only use the CPU models,

64
00:04:52.100 --> 00:04:55.600
not the GPU or NPU models in here.

65
00:04:56.600 --> 00:04:59.100
So I'm just going to use Direct4CPU,

66
00:04:59.100 --> 00:05:02.200
but you can change here and also go up to higher models.

67
00:05:03.100 --> 00:05:05.899
What models are available, we will see in one second.

68
00:05:08.299 --> 00:05:12.200
One thing to note is that in order to use this,

69
00:05:12.200 --> 00:05:15.700
you either need to use a foundry-local WinNL

70
00:05:15.700 --> 00:05:19.000
or a foundry-local on other types of machines.

71
00:05:19.000 --> 00:05:23.500
So you need conditional NuGet package in here to do this,

72
00:05:23.500 --> 00:05:26.100
and also you need conditional target frameworking

73
00:05:26.100 --> 00:05:30.000
that you need to target the windows before they work,

74
00:05:31.399 --> 00:05:33.500
not just the normal .NET 10.

75
00:05:33.500 --> 00:05:37.299
So already here, it's a bit funky, but it is doable.

76
00:05:39.299 --> 00:05:43.000
So if we do this, we make a folder

77
00:05:44.100 --> 00:05:46.000
where we can store all these things.

78
00:05:46.000 --> 00:05:49.799
In my case, I would put it in my user folder here,

79
00:05:49.799 --> 00:05:51.799
and you can see it's empty at the moment.

80
00:05:52.799 --> 00:05:57.000
So we're going to make what is called a foundry-local manager,

81
00:05:57.000 --> 00:06:01.600
and we're going to make it with a name and a path

82
00:06:01.600 --> 00:06:06.600
and where we want to host the models in terms of a local web server.

83
00:06:08.200 --> 00:06:11.500
So we're going to do that, and once it's created,

84
00:06:11.500 --> 00:06:13.600
we can take its instance.

85
00:06:13.600 --> 00:06:17.399
Again, a little funky way of doing things, but it is possible.

86
00:06:18.399 --> 00:06:22.600
Then we're going to get a catalog of models,

87
00:06:23.600 --> 00:06:25.600
and then we can list these models.

88
00:06:27.000 --> 00:06:30.700
And let me run it down here so we can see.

89
00:06:30.700 --> 00:06:32.899
These are the models that are supported,

90
00:06:32.899 --> 00:06:36.899
and among them there is this Quintree 1.6 billion.

91
00:06:36.899 --> 00:06:40.500
So it's not everything in the world that can be used.

92
00:06:40.500 --> 00:06:46.500
The biggest and most interesting one is GPT-OSS 20 billion parameters,

93
00:06:46.600 --> 00:06:52.700
which is the OpenAI ChatGPT open source model that was released

94
00:06:52.700 --> 00:06:57.600
that are roughly equivalent to, I think,

95
00:06:57.600 --> 00:07:05.100
between ChatGPT 4.0 and ChatGPT 5.0.

96
00:07:05.100 --> 00:07:11.799
It's a little stronger than 4.0 and not as smart as 5.0.

97
00:07:11.799 --> 00:07:15.299
So it's a fairly okay model, but there's also all the FHIR models,

98
00:07:15.299 --> 00:07:18.100
the Quint models, and the Whisper models, if need be.

99
00:07:19.299 --> 00:07:22.399
So once we have these, we can get the specific model,

100
00:07:22.399 --> 00:07:24.399
and we give this alias.

101
00:07:26.000 --> 00:07:28.799
And in there we need to find among…

102
00:07:28.799 --> 00:07:32.799
In this case there's only one variant that is CPU-bound,

103
00:07:32.799 --> 00:07:38.899
but there could have been a GPU-bound version of some of them,

104
00:07:38.899 --> 00:07:42.700
and then you would be able to choose the model of…

105
00:07:42.700 --> 00:07:46.100
if it should use GPU, CPU, or NPU.

106
00:07:48.799 --> 00:07:54.299
In our case we do that, and then we check if it's already in the cache,

107
00:07:54.299 --> 00:07:56.899
meaning have it already been downloaded.

108
00:07:56.899 --> 00:08:00.100
And by now we don't have any models downloaded,

109
00:08:00.100 --> 00:08:04.299
so there won't be any, and we'll go in and download.

110
00:08:05.100 --> 00:08:09.299
This will take a little while, not too much, it's not a big model.

111
00:08:09.299 --> 00:08:14.700
This one has a progress download, which is broken at the moment,

112
00:08:14.700 --> 00:08:16.700
so don't use it.

113
00:08:17.500 --> 00:08:22.299
It will download, but you can't really show the process, it has a bug.

114
00:08:22.899 --> 00:08:27.899
So it's a rough 1.0 in my opinion.

115
00:08:28.899 --> 00:08:33.700
But we now have our model downloaded, so we can see in here

116
00:08:34.700 --> 00:08:40.700
that we got a 511 megabyte big model here.

117
00:08:40.700 --> 00:08:43.700
So behind the scenes it's our next models.

118
00:08:46.299 --> 00:08:49.700
So let's load the model into memory.

119
00:08:53.700 --> 00:08:56.500
And again, on the bigger models this will take longer,

120
00:08:56.500 --> 00:08:59.900
but here it was only 500 megabytes we needed to load in,

121
00:09:00.099 --> 00:09:05.099
and it's CPU bound, so it's taking up my memory and my CPU.

122
00:09:06.099 --> 00:09:10.299
And then we're going to start a small web server that we set up here,

123
00:09:10.299 --> 00:09:13.299
so we have a local URL to do this.

124
00:09:14.299 --> 00:09:16.900
So now we have the Outry part up and running,

125
00:09:16.900 --> 00:09:22.500
it now runs a mini web server that we can connect to these models,

126
00:09:23.299 --> 00:09:28.299
similar to when you do OLAMA and just start the OLAMA service.

127
00:09:29.299 --> 00:09:34.700
And you can, of course, wrap this into any kind of service you want

128
00:09:35.299 --> 00:09:37.099
in order to run these models.

129
00:09:37.099 --> 00:09:39.500
In this case we are just running it in a console app

130
00:09:39.500 --> 00:09:44.900
where we first both run the server and consumes it,

131
00:09:44.900 --> 00:09:47.900
but in real life you will, of course, have two different processes.

132
00:09:50.099 --> 00:09:57.299
Then we go in and make, based on our service URL, which we have here,

133
00:09:57.299 --> 00:10:02.500
so we can see it gave us on the fly a port.

134
00:10:04.299 --> 00:10:08.700
Up here we didn't specify it, but it will on the fly serve a port.

135
00:10:09.700 --> 00:10:13.099
And then we're going to make a normal OpenAI client,

136
00:10:13.700 --> 00:10:17.099
but instead of an API key, because it's local, we don't need one,

137
00:10:17.099 --> 00:10:20.900
so we can just write anything here. In my case I wrote no API key.

138
00:10:21.500 --> 00:10:23.500
And then we need to pass the endpoint in.

139
00:10:24.500 --> 00:10:28.900
And once we do this, we have everything like normal.

140
00:10:28.900 --> 00:10:32.500
We can do a chat client and we can get an AI agent like normal.

141
00:10:33.299 --> 00:10:35.099
And then we can begin to ask our questions.

142
00:10:35.099 --> 00:10:37.500
So first question is, what's the capital of Sweden?

143
00:10:38.700 --> 00:10:41.900
And it will come back using normal agent framework.

144
00:10:44.500 --> 00:10:48.500
And come back and hopefully say something here.

145
00:10:49.500 --> 00:10:51.500
Oh, so it's also showing it's thinking.

146
00:10:52.299 --> 00:10:55.299
Apparently that's the way that model works.

147
00:10:55.900 --> 00:10:59.700
Strange, but different models, different way of doing things.

148
00:11:00.700 --> 00:11:06.700
But it come back at least and say the capital of Sweden is Stockholm. That's okay.

149
00:11:07.299 --> 00:11:09.299
And then we can do the same in streaming.

150
00:11:09.299 --> 00:11:14.500
I will just set a breakpoint down here and let it stream on how to make soup.

151
00:11:14.900 --> 00:11:17.500
First it will give us back the thinking part.

152
00:11:19.299 --> 00:11:21.700
Again, it's not all models that just does this.

153
00:11:22.900 --> 00:11:26.700
So we don't need to take it away in real life.

154
00:11:27.700 --> 00:11:32.299
So in this case it was how to make tomato soup and it will show us that.

155
00:11:33.099 --> 00:11:35.299
And then we can stop the web service

156
00:11:35.900 --> 00:11:41.099
and we can unload the model from memory so we can free up our memory.

157
00:11:41.900 --> 00:11:42.900
And then we're done.

158
00:11:43.299 --> 00:11:49.099
So this is completely different from what you have seen previously from my videos

159
00:11:49.099 --> 00:11:51.299
because, again, it's complete rewrite.

160
00:11:53.700 --> 00:11:54.700
It works.

161
00:11:55.299 --> 00:11:59.900
So if you're into offline models, this is probably a good way to go

162
00:11:59.900 --> 00:12:01.099
or go with Olama.

163
00:12:01.099 --> 00:12:03.299
They work more or less the same way

164
00:12:04.299 --> 00:12:08.500
in that they make a OpenAI client with a special URL.

165
00:12:09.900 --> 00:12:11.299
But that's everything there is to it.

166
00:12:12.299 --> 00:12:17.900
It's up to you if you feel like this is a good way of doing things.

167
00:12:17.900 --> 00:12:24.500
If you want to go Olama or, like me, want to use the paid models up in the cloud.

168
00:12:25.099 --> 00:12:27.099
But see you in the next one.