WEBVTT

1
00:00:00.300 --> 00:00:03.450
<v Maximilian>The thing that should be loaded into memory</v>

2
00:00:03.450 --> 00:00:07.650
when performing inference are, in the end, all the weights,

3
00:00:07.650 --> 00:00:09.360
the so-called parameters

4
00:00:09.360 --> 00:00:12.300
that are associated to these connections

5
00:00:12.300 --> 00:00:16.020
between these different neurons, these different nodes,

6
00:00:16.020 --> 00:00:18.099
that make up the neural network

7
00:00:18.099 --> 00:00:20.700
that was used for training this model.

8
00:00:20.700 --> 00:00:23.430
So a large language model, in the end,

9
00:00:23.430 --> 00:00:25.440
is such a neural network.

10
00:00:25.440 --> 00:00:28.500
You could say it was trained as a neural network

11
00:00:28.500 --> 00:00:30.660
with a certain training algorithm.

12
00:00:30.660 --> 00:00:33.150
And as a result of this training process,

13
00:00:33.150 --> 00:00:36.750
we got billions of parameters

14
00:00:36.750 --> 00:00:38.640
because we got billions of connections.

15
00:00:38.640 --> 00:00:40.380
And every connection has a parameter,

16
00:00:40.380 --> 00:00:42.840
a weight associated to it,

17
00:00:42.840 --> 00:00:45.870
that simply describes how a value is transformed

18
00:00:45.870 --> 00:00:48.990
when traveling through these different nodes.

19
00:00:48.990 --> 00:00:51.270
Now, what is the value I'm talking about?

20
00:00:51.270 --> 00:00:54.420
It's, in the end, your prompt, which, to be precise,

21
00:00:54.420 --> 00:00:57.870
is actually deconstructed into so-called tokens,

22
00:00:57.870 --> 00:01:00.660
where every token has a token ID.

23
00:01:00.660 --> 00:01:02.520
And there are tools out there,

24
00:01:02.520 --> 00:01:05.370
you'll find a link to one attached to this lecture,

25
00:01:05.370 --> 00:01:09.630
that show you how a given prompt for a given AI model

26
00:01:09.630 --> 00:01:13.773
is translated to tokens and then to token IDs.

27
00:01:14.700 --> 00:01:18.450
And it's these IDs that are actually used by the model,

28
00:01:18.450 --> 00:01:19.293
you could say.

29
00:01:20.160 --> 00:01:21.450
Now, technically, it's, of course,

30
00:01:21.450 --> 00:01:22.950
a bit more complex than that,

31
00:01:22.950 --> 00:01:25.260
but you can think of these token IDs

32
00:01:25.260 --> 00:01:27.600
being fed into this neural network.

33
00:01:27.600 --> 00:01:29.790
There, every connection has a weight.

34
00:01:29.790 --> 00:01:33.840
And as these IDs are fed through that trained network

35
00:01:33.840 --> 00:01:36.150
where the weights, of course, have been derived

36
00:01:36.150 --> 00:01:37.890
based on the training data,

37
00:01:37.890 --> 00:01:41.520
in the end as output, you get new token IDs,

38
00:01:41.520 --> 00:01:44.160
which are then translated back to human-readable text.

39
00:01:44.160 --> 00:01:45.690
And which is then, in the end,

40
00:01:45.690 --> 00:01:48.990
the response that was generated by the model.

41
00:01:48.990 --> 00:01:51.900
So these parameters, these weights are important.

42
00:01:51.900 --> 00:01:54.330
And you might have heard the term parameter

43
00:01:54.330 --> 00:01:57.330
in conjunction with large language models before.

44
00:01:57.330 --> 00:01:58.530
You also might have heard

45
00:01:58.530 --> 00:02:01.770
that some of the most capable large language models,

46
00:02:01.770 --> 00:02:06.770
like DeepSeek R1, have hundreds of billions of parameters.

47
00:02:07.080 --> 00:02:09.720
Now, the open models I'll cover in this course

48
00:02:09.720 --> 00:02:11.400
are typically a bit smaller.

49
00:02:11.400 --> 00:02:13.260
For example, the Gemma 3 model,

50
00:02:13.260 --> 00:02:16.207
which is the one I will use primarily in this course,

51
00:02:16.207 --> 00:02:21.090
"only," in quotes, has around 27 billion parameters.

52
00:02:21.090 --> 00:02:22.500
But that still is quite a lot,

53
00:02:22.500 --> 00:02:24.864
even if it is far less than the 600,

54
00:02:24.864 --> 00:02:28.680
700 billion parameters DeepSeek R1 might have,

55
00:02:28.680 --> 00:02:30.990
which, by the way, also is an open model.

56
00:02:30.990 --> 00:02:33.390
You could run that locally as well too

57
00:02:33.390 --> 00:02:36.150
if you have the right hardware for that.

58
00:02:36.150 --> 00:02:39.810
Because, of course, this information about the parameters

59
00:02:39.810 --> 00:02:41.730
brings it back to these hardware requirements.

60
00:02:41.730 --> 00:02:43.470
Because I mentioned that the model

61
00:02:43.470 --> 00:02:46.650
should be loaded entirely into VRAM,

62
00:02:46.650 --> 00:02:49.890
or system memory if you don't have any VRAM.

63
00:02:49.890 --> 00:02:52.500
And what I mean with loading the model

64
00:02:52.500 --> 00:02:55.175
is loading all the model parameters

65
00:02:55.175 --> 00:02:59.400
and then all the context, as I explained before.

66
00:02:59.400 --> 00:03:01.890
So all these billions of parameters

67
00:03:01.890 --> 00:03:04.440
must be loaded into your computer memory,

68
00:03:04.440 --> 00:03:08.670
preferably into the VRAM, the memory of your GPU.

69
00:03:08.670 --> 00:03:11.640
So how much memory is that then?

70
00:03:11.640 --> 00:03:14.220
How much memory is required?

71
00:03:14.220 --> 00:03:17.640
Well, every parameter, typically,

72
00:03:17.640 --> 00:03:21.960
is a float32 or float16 value

73
00:03:21.960 --> 00:03:24.240
after that initial training process.

74
00:03:24.240 --> 00:03:25.260
Now, what does this mean?

75
00:03:25.260 --> 00:03:28.680
These are simply data types used in computer science,

76
00:03:28.680 --> 00:03:30.420
used in programming,

77
00:03:30.420 --> 00:03:34.950
and specifically, they're data types that describe fractions

78
00:03:34.950 --> 00:03:36.690
or fractional numbers,

79
00:03:36.690 --> 00:03:41.690
numbers that have digits after the decimal place in the end.

80
00:03:41.940 --> 00:03:45.150
Now, the difference between float32 and float16

81
00:03:45.150 --> 00:03:47.520
is the precision of that number.

82
00:03:47.520 --> 00:03:49.590
And you don't need to understand

83
00:03:49.590 --> 00:03:52.380
the internals for this course, or in general,

84
00:03:52.380 --> 00:03:54.960
for interacting with large language models,

85
00:03:54.960 --> 00:03:58.380
but you can simply think of a float32 number

86
00:03:58.380 --> 00:04:01.920
being capable of storing more information

87
00:04:01.920 --> 00:04:03.900
about such a fractional number.

88
00:04:03.900 --> 00:04:06.930
It's simply able to store longer fractions,

89
00:04:06.930 --> 00:04:09.000
longer decimal numbers.

90
00:04:09.000 --> 00:04:12.000
Float16 is less precise.

91
00:04:12.000 --> 00:04:14.640
Now, internally, on your computer,

92
00:04:14.640 --> 00:04:18.780
every such number must be stored in some space and memory

93
00:04:18.780 --> 00:04:20.850
because the entire model should be loaded.

94
00:04:20.850 --> 00:04:25.603
And here, a float32 value takes up four bytes,

95
00:04:25.603 --> 00:04:29.760
float16 value takes up two bytes in memory.

96
00:04:29.760 --> 00:04:34.230
So that's four or two bytes for every single parameter,

97
00:04:34.230 --> 00:04:36.000
every single weight.

98
00:04:36.000 --> 00:04:36.870
And therefore,

99
00:04:36.870 --> 00:04:40.050
even if you had just a 2 billion parameter model,

100
00:04:40.050 --> 00:04:41.370
which is really small.

101
00:04:41.370 --> 00:04:43.590
Keep in mind, we are talking about 600

102
00:04:43.590 --> 00:04:46.140
or 700 billion parameter models.

103
00:04:46.140 --> 00:04:47.550
And I mentioned that for this course,

104
00:04:47.550 --> 00:04:52.200
I would dive into using 27 billion parameter model.

105
00:04:52.200 --> 00:04:55.560
So a 2 billion parameter model is really small,

106
00:04:55.560 --> 00:04:57.900
but even that would require

107
00:04:57.900 --> 00:05:00.540
four to eight gigabytes of memory

108
00:05:00.540 --> 00:05:04.710
if you wanted to load it as is into memory,

109
00:05:04.710 --> 00:05:07.080
or into VRAM preferably.

110
00:05:07.080 --> 00:05:10.830
And that's, of course, absolutely doable for many computers.

111
00:05:10.830 --> 00:05:12.150
But if you, again,

112
00:05:12.150 --> 00:05:15.630
think of that 27 billion parameters model,

113
00:05:15.630 --> 00:05:19.260
well, that would require 54,

114
00:05:19.260 --> 00:05:24.260
or maybe even around 100 gigabytes of video RAM.

115
00:05:25.050 --> 00:05:29.880
And most computers and most GPUs don't have that.

116
00:05:29.880 --> 00:05:32.523
So that's a problem we need to solve.