WEBVTT

1
00:00:00.200 --> 00:00:03.059
Hi, in this video we are looking at

2
00:00:03.059 --> 00:00:05.960
how to explore and deploy the many models

3
00:00:05.960 --> 00:00:09.460
Foundry have to offer, and understand the different

4
00:00:09.460 --> 00:00:12.720
deployment types, as it can be quite confusing

5
00:00:12.720 --> 00:00:13.600
to be honest.

6
00:00:14.840 --> 00:00:17.660
I will show you how to deploy the

7
00:00:17.660 --> 00:00:20.240
models you have available, how to get access

8
00:00:20.240 --> 00:00:23.320
to restricted models and how to request more

9
00:00:23.320 --> 00:00:25.200
quota, so let's get started.

10
00:00:25.840 --> 00:00:28.120
If we look at the model catalogue, you

11
00:00:28.120 --> 00:00:31.220
will see there are over 10,000 LLMs

12
00:00:31.220 --> 00:00:34.380
models to choose from, but in real life,

13
00:00:34.500 --> 00:00:37.380
unless you have a very niche scenario, you

14
00:00:37.380 --> 00:00:39.840
will only be dealing with a handful of

15
00:00:39.840 --> 00:00:42.380
the model providers and the model types.

16
00:00:44.280 --> 00:00:49.740
Foundry's most known models come from OpenAI, Anthropic,

17
00:00:50.460 --> 00:00:53.500
XAI, Mistral and DeepSeek.

18
00:00:54.940 --> 00:00:58.500
The only big brand missing is Google Gemini,

19
00:00:58.900 --> 00:01:01.940
but that is to be expected, as Google

20
00:01:01.940 --> 00:01:04.540
and Microsoft are competitors in both the cloud

21
00:01:04.540 --> 00:01:06.300
hosting and the AI space.

22
00:01:07.300 --> 00:01:10.840
Microsoft themselves also have a few models on

23
00:01:10.840 --> 00:01:12.800
their own, but nothing revolutionary.

24
00:01:14.440 --> 00:01:18.000
There's one special model called the model router.

25
00:01:18.600 --> 00:01:21.320
It as such is not a model, but

26
00:01:21.320 --> 00:01:24.680
instead a collection of models that based on

27
00:01:24.680 --> 00:01:29.420
the question, can choose simpler or advanced models

28
00:01:29.420 --> 00:01:30.100
on the fly.

29
00:01:31.140 --> 00:01:34.880
I had a mixed result with this, so

30
00:01:34.880 --> 00:01:36.580
I tend to not use it personally, but

31
00:01:36.580 --> 00:01:39.620
it's an interesting idea and behind the scenes,

32
00:01:39.760 --> 00:01:43.540
you can go in and choose which models

33
00:01:43.540 --> 00:01:46.520
it have available to route to.

34
00:01:49.120 --> 00:01:50.820
Not all models are made for the same

35
00:01:50.820 --> 00:01:51.340
purpose.

36
00:01:51.820 --> 00:01:54.940
Some do text, some do images, some do

37
00:01:54.940 --> 00:01:58.500
embeddings and some do speech, so it's a

38
00:01:58.500 --> 00:02:00.800
good thing to go in and check the

39
00:02:00.800 --> 00:02:04.220
capabilities of each model as you go.

40
00:02:05.940 --> 00:02:09.139
It's also not all models that work with

41
00:02:09.139 --> 00:02:15.000
the agents and workflow services, and since those

42
00:02:15.000 --> 00:02:17.400
are the ones we are going to focus

43
00:02:17.400 --> 00:02:20.820
on in this series, we will tend to

44
00:02:20.820 --> 00:02:22.980
stick with these models as you see on

45
00:02:22.980 --> 00:02:23.500
the screen.

46
00:02:27.710 --> 00:02:32.920
Looking at a specific model, a description is

47
00:02:32.920 --> 00:02:35.720
there and a link to the pricing.

48
00:02:36.740 --> 00:02:41.220
While this screen is okay, I tend to

49
00:02:41.220 --> 00:02:44.500
instead go to the provider's own specific site

50
00:02:44.500 --> 00:02:46.820
as it gives more details and capabilities.

51
00:02:47.780 --> 00:02:51.780
Here is, for example, what GPT 5 have,

52
00:02:52.240 --> 00:02:55.980
and we can see features and pricing directly.

53
00:02:57.860 --> 00:03:00.820
I normally only use models that can do

54
00:03:00.820 --> 00:03:03.160
at least tool calling or function calling, as

55
00:03:03.160 --> 00:03:07.020
it's some kind called, and structured output, as

56
00:03:07.020 --> 00:03:08.860
I tend to use these features all the

57
00:03:08.860 --> 00:03:09.160
time.

58
00:03:10.020 --> 00:03:12.620
Beyond that, I look for at least an

59
00:03:12.620 --> 00:03:16.940
embedding model for VectorStore, and then I choose

60
00:03:16.940 --> 00:03:21.060
non-reasoning chat models and reasoning chat models

61
00:03:21.060 --> 00:03:22.660
for the more advanced workloads.

62
00:03:24.960 --> 00:03:27.600
These are my go-to models as of

63
00:03:27.600 --> 00:03:32.800
December 2025, where GPT 4.1 and 4.1-mini

64
00:03:32.800 --> 00:03:36.980
is my non-reasoning, 5.1

65
00:03:36.980 --> 00:03:40.080
and 5.1 Mini is for reasoning, and

66
00:03:40.080 --> 00:03:43.340
TextEmbedding 3.0 Small is for embeddings.

67
00:03:44.340 --> 00:03:48.560
I am slowly getting rid of GPT 4.1

68
00:03:48.560 --> 00:03:53.020
as it's quite pricey compared to how

69
00:03:53.020 --> 00:03:56.220
old it is, but it's really nice when

70
00:03:56.220 --> 00:03:58.620
you just need to set up a model

71
00:03:58.620 --> 00:04:02.900
without needing to configure all the reasoning settings

72
00:04:02.900 --> 00:04:03.660
and so on.

73
00:04:05.900 --> 00:04:08.560
Now let's dive into how to deploy a

74
00:04:08.560 --> 00:04:10.640
model, and you do that by clicking the

75
00:04:10.640 --> 00:04:14.100
model in the details page, and on the

76
00:04:14.100 --> 00:04:17.339
top, choosing the deploy button.

77
00:04:18.120 --> 00:04:20.860
When you press that, there's a default settings

78
00:04:20.860 --> 00:04:22.040
and a custom settings.

79
00:04:22.220 --> 00:04:24.840
I will very much recommend you choose custom

80
00:04:24.840 --> 00:04:29.120
settings, because if you just press default, it

81
00:04:29.120 --> 00:04:32.500
just creates the model with no questions asked,

82
00:04:33.040 --> 00:04:36.260
so it's better to go with the custom

83
00:04:36.260 --> 00:04:37.260
model, in my opinion.

84
00:04:39.780 --> 00:04:43.520
What you see is this sidebar that will

85
00:04:43.520 --> 00:04:47.100
pop up next to the deploy, where you

86
00:04:47.100 --> 00:04:50.060
give a deployment name, and this is special

87
00:04:50.060 --> 00:04:52.840
for Azure in that you can choose your

88
00:04:52.840 --> 00:04:55.940
own name for the deployment, while, for example,

89
00:04:56.060 --> 00:04:59.300
in OpenAI, if you need to reference GPT

90
00:04:59.300 --> 00:05:01.680
5.1, you need to give this name,

91
00:05:01.900 --> 00:05:05.120
but you could call this my model if

92
00:05:05.120 --> 00:05:05.720
you want to.

93
00:05:06.600 --> 00:05:10.160
But I recommend that you don't change these

94
00:05:10.160 --> 00:05:13.480
names, because these are the ones the different

95
00:05:13.480 --> 00:05:17.440
providers show in their spec sheets and so

96
00:05:17.440 --> 00:05:17.680
on.

97
00:05:18.880 --> 00:05:23.160
Next, you choose a deployment type, and there

98
00:05:23.160 --> 00:05:24.800
are four different types.

99
00:05:25.860 --> 00:05:30.460
There's something called global standards, meaning that you

100
00:05:30.460 --> 00:05:33.760
get the cheapest pricing, the pricing you normally

101
00:05:33.760 --> 00:05:37.740
see on the spec sheets, but if you

102
00:05:37.740 --> 00:05:41.520
can't rely on that, every call you do

103
00:05:41.520 --> 00:05:45.560
will stay within the Azure region that you

104
00:05:45.560 --> 00:05:46.380
have chosen.

105
00:05:46.380 --> 00:05:50.650
For a little more price, you can do

106
00:05:50.650 --> 00:05:54.890
data zone standard, which is within the Azure

107
00:05:54.890 --> 00:05:56.510
region's data zone.

108
00:05:57.310 --> 00:06:00.090
And then there is provisioned version of this,

109
00:06:00.210 --> 00:06:03.750
where you get more reliability on your models,

110
00:06:04.390 --> 00:06:07.270
but you pay by the hour for each

111
00:06:07.270 --> 00:06:10.930
of them, while global standard and data zone

112
00:06:10.930 --> 00:06:16.110
standard, you only pay when you use tokens.

113
00:06:18.000 --> 00:06:21.060
Beyond that, you can choose a model version.

114
00:06:21.240 --> 00:06:24.200
There's often only one, but some of the

115
00:06:24.200 --> 00:06:28.860
older ones have older versions, and then some

116
00:06:28.860 --> 00:06:31.620
upgrade policies that it will automatically go to

117
00:06:31.620 --> 00:06:33.620
the newest version when they become available.

118
00:06:34.880 --> 00:06:39.260
Then you choose your tokens per minute, and

119
00:06:39.260 --> 00:06:42.340
you might feel that you don't have enough

120
00:06:42.340 --> 00:06:46.220
tokens per minute, and if you don't, you

121
00:06:46.220 --> 00:06:49.120
need to go with quotas and raise them,

122
00:06:49.320 --> 00:06:51.460
as we'll talk about in a little bit.

123
00:06:52.880 --> 00:06:55.020
The final thing is guardrails.

124
00:06:55.480 --> 00:06:59.180
There's the default, the default v2, and you

125
00:06:59.180 --> 00:07:02.420
can make your own guardrails as well, which

126
00:07:02.980 --> 00:07:05.020
we'll cover in a later video.

127
00:07:06.020 --> 00:07:08.840
So once you do this, you deploy it,

128
00:07:09.560 --> 00:07:13.240
and you end up with a model.

129
00:07:15.000 --> 00:07:18.080
Certain models you can't press the deploy button

130
00:07:18.080 --> 00:07:22.740
on, and that is because some models are

131
00:07:22.740 --> 00:07:28.440
behind registration, meaning that you need to, it's

132
00:07:28.440 --> 00:07:31.420
often models that are under heavy use or

133
00:07:31.420 --> 00:07:37.340
very resource intensive, because Microsoft in general, like

134
00:07:37.340 --> 00:07:40.540
everyone else, have a hard time keeping up

135
00:07:40.540 --> 00:07:44.540
with providing servers for all this load.

136
00:07:45.720 --> 00:07:48.260
So some of them are behind a request

137
00:07:48.260 --> 00:07:51.480
access, and when you press request access, you

138
00:07:51.480 --> 00:07:53.340
are taken to a website where you fill

139
00:07:53.340 --> 00:07:54.080
out a form.

140
00:07:54.080 --> 00:07:58.960
You need to fill out your address details,

141
00:07:59.240 --> 00:08:02.380
your subscription ID, and why you need access

142
00:08:02.380 --> 00:08:04.140
to the model, and so on.

143
00:08:05.520 --> 00:08:07.920
And it's a bit of a lottery, to

144
00:08:07.920 --> 00:08:08.380
be honest.

145
00:08:09.000 --> 00:08:14.000
Sometimes I have received after requesting access, a

146
00:08:14.000 --> 00:08:17.580
day after I get access, some I never

147
00:08:17.580 --> 00:08:18.660
get an answer back.

148
00:08:19.580 --> 00:08:25.900
And even if I'm on my private system,

149
00:08:26.160 --> 00:08:31.580
or I work with a company that have

150
00:08:31.580 --> 00:08:35.000
dedicated Microsoft contacts and so on.

151
00:08:36.700 --> 00:08:39.580
The contacts I've spoken to is, say it's

152
00:08:39.580 --> 00:08:43.280
a come first, come serve model, which I

153
00:08:43.280 --> 00:08:46.620
don't believe when a Fortune 500 company comes

154
00:08:46.620 --> 00:08:49.380
along, they will probably push them up the

155
00:08:50.160 --> 00:08:50.720
length.

156
00:08:51.120 --> 00:08:53.020
I've also seen that sometimes I need to

157
00:08:53.020 --> 00:08:56.260
request, not here for a month, request again,

158
00:08:56.340 --> 00:08:57.740
and then I suddenly get it.

159
00:08:58.280 --> 00:09:02.920
So it's not like it always works.

160
00:09:03.320 --> 00:09:05.200
If you don't hear something in a couple

161
00:09:05.200 --> 00:09:06.880
of weeks, try and request again.

162
00:09:07.160 --> 00:09:09.300
That is my recommendation.

163
00:09:10.260 --> 00:09:13.280
Or at least speak to some Microsoft contacts

164
00:09:13.280 --> 00:09:13.900
that you have.

165
00:09:14.600 --> 00:09:19.160
But it is a real lottery when we

166
00:09:19.160 --> 00:09:21.900
have so few resources at the moment.

167
00:09:26.000 --> 00:09:28.220
Once you have your models, you can go

168
00:09:28.220 --> 00:09:31.180
to the build tab and then in the

169
00:09:31.180 --> 00:09:34.120
model section here, it will list your default

170
00:09:34.120 --> 00:09:35.620
different models.

171
00:09:36.440 --> 00:09:38.440
And if you press them, you go into

172
00:09:38.440 --> 00:09:42.200
the playground, but also go to the details

173
00:09:42.200 --> 00:09:43.180
where you can edit.

174
00:09:43.620 --> 00:09:48.080
So you can change the number of tokens

175
00:09:48.080 --> 00:09:48.860
and so on.

176
00:09:49.640 --> 00:09:53.260
You can delete your model, not that it

177
00:09:53.260 --> 00:09:55.700
costs you anything just by having it around.

178
00:09:57.060 --> 00:10:00.280
And again, if you find that you use

179
00:10:00.280 --> 00:10:02.320
a model and you constantly run out of

180
00:10:02.320 --> 00:10:05.700
tokens, you might want to request a quota.

181
00:10:05.700 --> 00:10:13.870
And a quota is, again, a website that

182
00:10:13.870 --> 00:10:16.690
you go to, fill out a form, tell

183
00:10:16.690 --> 00:10:19.990
why do you need it, how many thousands

184
00:10:19.990 --> 00:10:22.810
of tokens do you need instead.

185
00:10:23.770 --> 00:10:25.510
And again, this is also a lottery.

186
00:10:25.810 --> 00:10:30.270
I have tried the worst, trying to get

187
00:10:30.270 --> 00:10:34.550
more tokens for one of the models back

188
00:10:34.550 --> 00:10:37.630
in January and took three months before we

189
00:10:37.630 --> 00:10:38.050
got it.

190
00:10:39.070 --> 00:10:42.110
But I've also tried going here and the

191
00:10:42.110 --> 00:10:45.590
next day having many more tokens than I

192
00:10:45.590 --> 00:10:46.410
actually requested.

193
00:10:47.730 --> 00:10:51.450
So again, go out early getting the quota

194
00:10:51.450 --> 00:10:53.070
if you need to go live at some

195
00:10:53.070 --> 00:10:59.750
point and ask for a lot because it

196
00:10:59.750 --> 00:11:01.410
can be really, really difficult.

197
00:11:01.410 --> 00:11:07.210
And I have scenarios right now where we

198
00:11:07.210 --> 00:11:10.970
need to go to less capable models because

199
00:11:10.970 --> 00:11:12.930
we simply cannot get the quota for the

200
00:11:12.930 --> 00:11:13.890
high ones.

201
00:11:16.490 --> 00:11:20.410
But that is everything about models and model

202
00:11:20.410 --> 00:11:21.050
deployment.

203
00:11:22.830 --> 00:11:23.890
See you in the next one.