WEBVTT

1
00:00:00.270 --> 00:00:01.710
<v Maximilian>So here for this course,</v>

2
00:00:01.710 --> 00:00:04.980
just as in the previous section for LM Studio,

3
00:00:04.980 --> 00:00:06.810
I will use the Gemma 3 model,

4
00:00:06.810 --> 00:00:09.450
but you can of course play around with different models

5
00:00:09.450 --> 00:00:12.030
and use all kinds of models.

6
00:00:12.030 --> 00:00:16.320
Now, once you go to the model detail page of a given model,

7
00:00:16.320 --> 00:00:19.290
you have a dropdown here which allows you

8
00:00:19.290 --> 00:00:22.830
to choose from different flavors

9
00:00:22.830 --> 00:00:26.400
and quantizations of that model.

10
00:00:26.400 --> 00:00:30.600
All these identifiers here are called tags,

11
00:00:30.600 --> 00:00:34.950
so a given model, Gemma 3, has different tags

12
00:00:34.950 --> 00:00:39.510
that then point at different versions of that model.

13
00:00:39.510 --> 00:00:41.940
Now, the obvious tags are, of course, the tags

14
00:00:41.940 --> 00:00:45.510
that simply describe the different parameter sizes,

15
00:00:45.510 --> 00:00:47.490
so the different amounts of parameters,

16
00:00:47.490 --> 00:00:50.970
because the Gemma 3 model was published by Google

17
00:00:50.970 --> 00:00:53.263
with different amounts of parameters,

18
00:00:53.263 --> 00:00:56.430
but then we also got all these tags that refer

19
00:00:56.430 --> 00:01:00.840
to different quantized versions of these models.

20
00:01:00.840 --> 00:01:02.190
The confusing part,

21
00:01:02.190 --> 00:01:06.540
however, is that also these versions here are quantized,

22
00:01:06.540 --> 00:01:09.750
it's just not obvious from the tag names,

23
00:01:09.750 --> 00:01:14.100
but by default, Ollama only has quantized versions

24
00:01:14.100 --> 00:01:17.430
in their model catalog because you typically want

25
00:01:17.430 --> 00:01:20.130
to run quantized versions on your system

26
00:01:20.130 --> 00:01:21.870
for all the reasons you learned about

27
00:01:21.870 --> 00:01:23.370
earlier in this course.

28
00:01:23.370 --> 00:01:26.430
You essentially have pretty much no disadvantages

29
00:01:26.430 --> 00:01:28.290
from using quantized versions,

30
00:01:28.290 --> 00:01:29.700
but you get better performance

31
00:01:29.700 --> 00:01:32.190
and lower hardware requirements.

32
00:01:32.190 --> 00:01:35.220
So therefore, these versions of the model

33
00:01:35.220 --> 00:01:36.930
that you find at the top here,

34
00:01:36.930 --> 00:01:40.410
these tags also point at quantized versions.

35
00:01:40.410 --> 00:01:41.760
You can just think of them

36
00:01:41.760 --> 00:01:45.660
as the officially recommended versions,

37
00:01:45.660 --> 00:01:47.460
because if you select one of these models,

38
00:01:47.460 --> 00:01:50.130
let's say the 12 billion parameter model,

39
00:01:50.130 --> 00:01:53.580
you find more information about it here, and guess what?

40
00:01:53.580 --> 00:01:56.790
You also find quantization information here.

41
00:01:56.790 --> 00:01:59.910
So that proves that this version of the model

42
00:01:59.910 --> 00:02:03.300
with this tag also is quantized,

43
00:02:03.300 --> 00:02:05.820
which is a good thing, of course.

44
00:02:05.820 --> 00:02:08.010
You also find the size of the model here,

45
00:02:08.010 --> 00:02:11.190
which is the amount of disk space this model will occupy

46
00:02:11.190 --> 00:02:12.990
after being downloaded.

47
00:02:12.990 --> 00:02:15.990
And you find some other pieces of information here,

48
00:02:15.990 --> 00:02:18.537
which you can explore in greater detail,

49
00:02:18.537 --> 00:02:21.990
like for example, the default configuration that's applied

50
00:02:21.990 --> 00:02:24.450
to this model, for example here,

51
00:02:24.450 --> 00:02:27.840
a quite high-temperature value for high creativity,

52
00:02:27.840 --> 00:02:30.390
but we'll get back to configuring these models

53
00:02:30.390 --> 00:02:32.520
and creating your own customized versions

54
00:02:32.520 --> 00:02:35.103
of these models later in this section.

55
00:02:36.750 --> 00:02:38.370
You also, most importantly,

56
00:02:38.370 --> 00:02:41.550
find a command for running this version

57
00:02:41.550 --> 00:02:44.130
of the model locally on your system.

58
00:02:44.130 --> 00:02:45.960
So here, we find ollama run,

59
00:02:45.960 --> 00:02:47.700
we already talked about this,

60
00:02:47.700 --> 00:02:50.010
but then this here is the model identifier,

61
00:02:50.010 --> 00:02:52.890
and the structure of this identifier is always the same.

62
00:02:52.890 --> 00:02:57.063
It's essentially the model name, gemma3 in this case,

63
00:02:57.930 --> 00:03:00.330
then a colon, and then the tag

64
00:03:00.330 --> 00:03:03.030
of the specific model version you want to run,

65
00:03:03.030 --> 00:03:06.240
so 12 billion in this case.

66
00:03:06.240 --> 00:03:08.010
If you were to choose a different tag

67
00:03:08.010 --> 00:03:12.360
because you want to force some other quantization technique,

68
00:03:12.360 --> 00:03:15.510
for example, you could select that in the dropdown

69
00:03:15.510 --> 00:03:18.423
and then simply use this tag here.

70
00:03:19.290 --> 00:03:22.500
And here, indeed, I will use this version,

71
00:03:22.500 --> 00:03:25.530
because for the Gemma 3 model specifically,

72
00:03:25.530 --> 00:03:27.960
Google released some quantized versions

73
00:03:27.960 --> 00:03:32.940
where they trained that version with quantization in mind.

74
00:03:32.940 --> 00:03:36.870
So in theory, this version should perform a little better

75
00:03:36.870 --> 00:03:38.693
than the quantized version

76
00:03:38.693 --> 00:03:42.810
that was derived from the unquantized raw model,

77
00:03:42.810 --> 00:03:44.100
though in reality,

78
00:03:44.100 --> 00:03:47.280
it might not really make a big difference.

79
00:03:47.280 --> 00:03:50.100
But here, I'll go for this version,

80
00:03:50.100 --> 00:03:52.533
and I'll therefore copy this command

81
00:03:52.533 --> 00:03:57.240
and go back to my command line to run it there.

82
00:03:57.240 --> 00:03:59.970
Now, before I do that, though, I also wanna point

83
00:03:59.970 --> 00:04:03.570
at that Readme, which you'll find on that model detail page

84
00:04:03.570 --> 00:04:06.030
where you can in general learn more

85
00:04:06.030 --> 00:04:08.100
about the model you're about to use,

86
00:04:08.100 --> 00:04:10.410
for example, about the context window it has

87
00:04:10.410 --> 00:04:13.830
and many other things, where in case of the Gemma models,

88
00:04:13.830 --> 00:04:16.770
you learn about these quantization aware trained models

89
00:04:16.770 --> 00:04:18.180
I just explained.

90
00:04:18.180 --> 00:04:20.883
So that is indeed the model I am running here,

91
00:04:22.140 --> 00:04:24.390
and where you can also learn more about the performance

92
00:04:24.390 --> 00:04:27.750
of this model, how it did in various benchmarks, and so on.

93
00:04:27.750 --> 00:04:29.370
This information is very similar

94
00:04:29.370 --> 00:04:31.530
to what you would find on the detail page

95
00:04:31.530 --> 00:04:33.750
of that model on Hugging Face.

96
00:04:33.750 --> 00:04:36.960
So therefore, I will now run this command,

97
00:04:36.960 --> 00:04:39.900
and what happens now is that as a first step,

98
00:04:39.900 --> 00:04:41.340
the model is downloaded.

99
00:04:41.340 --> 00:04:42.630
Of course, this doesn't happen

100
00:04:42.630 --> 00:04:44.550
every time you run this command,

101
00:04:44.550 --> 00:04:46.170
it just happens the first time

102
00:04:46.170 --> 00:04:48.588
if you didn't download the model before.

103
00:04:48.588 --> 00:04:51.313
By the way, just to also show you this command,

104
00:04:51.313 --> 00:04:55.950
and for that, I stopped this command by pressing Ctrl+C,

105
00:04:55.950 --> 00:04:59.160
besides downloading the model on demand,

106
00:04:59.160 --> 00:05:01.230
when you run it the first time,

107
00:05:01.230 --> 00:05:04.140
you can also use ollama pull

108
00:05:04.140 --> 00:05:06.460
and then that model identifier

109
00:05:08.340 --> 00:05:11.160
to just download it without running it.

110
00:05:11.160 --> 00:05:14.100
Now, in many situations, you of course want to download

111
00:05:14.100 --> 00:05:15.660
and then immediately run it,

112
00:05:15.660 --> 00:05:18.270
but if you know that you will need that model

113
00:05:18.270 --> 00:05:20.700
later that day, but not immediately,

114
00:05:20.700 --> 00:05:23.940
or that you just wanna run it behind the scenes as a server,

115
00:05:23.940 --> 00:05:25.590
something we'll explore later,

116
00:05:25.590 --> 00:05:27.690
then you could also just download it

117
00:05:27.690 --> 00:05:30.273
without running it by using all ollama pull.

118
00:05:31.290 --> 00:05:33.240
Either way, once that command is done,

119
00:05:33.240 --> 00:05:34.590
it will have been downloaded,

120
00:05:34.590 --> 00:05:37.980
and then as a next step, you can run it with ollama run,

121
00:05:37.980 --> 00:05:41.370
or if you used ollama run, it will start automatically.

122
00:05:41.370 --> 00:05:44.013
And I'll be back once the download finished for me,

123
00:05:45.480 --> 00:05:49.050
and now for me, that download process finished.

124
00:05:49.050 --> 00:05:50.730
If you did use ollama run,

125
00:05:50.730 --> 00:05:52.560
it will now already have started.

126
00:05:52.560 --> 00:05:55.590
I, however, used ollama pull, so what I'll do here

127
00:05:55.590 --> 00:05:58.620
to run it is I'll simply run ollama run again,

128
00:05:58.620 --> 00:06:00.300
and it's that same command as before,

129
00:06:00.300 --> 00:06:02.700
but now of course since it was downloaded,

130
00:06:02.700 --> 00:06:06.750
it will not download it, but instead it will just run it.

131
00:06:06.750 --> 00:06:09.870
And once it runs, you'll see something like this,

132
00:06:09.870 --> 00:06:10.980
basically an arrow,

133
00:06:10.980 --> 00:06:14.100
and then it's waiting for your input here,

134
00:06:14.100 --> 00:06:15.780
because again, it doesn't come

135
00:06:15.780 --> 00:06:18.123
with a graphical user interface.

136
00:06:19.110 --> 00:06:23.400
Now, as you see, you can enter /? to get some help,

137
00:06:23.400 --> 00:06:24.840
and that's not a bad idea

138
00:06:24.840 --> 00:06:27.600
because that will tell you that there are a couple

139
00:06:27.600 --> 00:06:29.880
of slash commands available

140
00:06:29.880 --> 00:06:31.830
that you can run here in this mode.

141
00:06:31.830 --> 00:06:34.710
So when it's waiting for your input,

142
00:06:34.710 --> 00:06:36.630
you can set various parameters,

143
00:06:36.630 --> 00:06:40.080
you can view information about the model that's running.

144
00:06:40.080 --> 00:06:43.860
You could also save or load a session, a chat session,

145
00:06:43.860 --> 00:06:47.340
essentially if you wanna save your work, quit the process

146
00:06:47.340 --> 00:06:49.710
and come back to it later, for example.

147
00:06:49.710 --> 00:06:52.530
You can clear your session context,

148
00:06:52.530 --> 00:06:55.410
so clear the chat history if you wanna restart

149
00:06:55.410 --> 00:06:58.200
without closing and restarting the program.

150
00:06:58.200 --> 00:07:01.710
You can close the program by typing /bye,

151
00:07:01.710 --> 00:07:03.360
and you can type this command

152
00:07:03.360 --> 00:07:05.970
to view some helpful keyboard shortcuts

153
00:07:05.970 --> 00:07:07.560
you might wanna use.

154
00:07:07.560 --> 00:07:12.270
So what I'll do here is I'll just ask, hi, how are you?

155
00:07:12.270 --> 00:07:14.310
Which is always kind of a weird question.

156
00:07:14.310 --> 00:07:16.800
I mean, it's just a token generator, but still,

157
00:07:16.800 --> 00:07:19.740
this allows me to show you that this is working

158
00:07:19.740 --> 00:07:21.033
and up and running.

159
00:07:21.870 --> 00:07:24.420
And that is indeed how we can use

160
00:07:24.420 --> 00:07:27.930
an open Large Language Model with help of Ollama,

161
00:07:27.930 --> 00:07:29.880
and if you wanna chat with it,

162
00:07:29.880 --> 00:07:31.500
that is a great way of doing it.

163
00:07:31.500 --> 00:07:34.074
You might not need a graphical user interface

164
00:07:34.074 --> 00:07:36.600
like the one you get from LM Studio.

165
00:07:36.600 --> 00:07:38.730
For that, you might indeed prefer

166
00:07:38.730 --> 00:07:41.973
the more lightweight approach Ollama has.