WEBVTT

00:00.720 --> 00:02.680
Hello everyone and welcome.

00:03.160 --> 00:06.920
In today's session we will learn about inference parameters.

00:06.920 --> 00:10.080
We will be using inference parameters throughout the course.

00:10.480 --> 00:16.760
However, it's also important that you learn about inference parameters, since these parameters would

00:16.760 --> 00:21.920
help you regulate and manage the responses from the large language models.

00:23.080 --> 00:28.800
So the very first and obvious question is what are inference parameters?

00:29.200 --> 00:36.120
When running model inference, you can adjust the inference parameters to influence the model response.

00:36.560 --> 00:44.000
Inference parameters can change the pool of possible outputs that the model considers during generation,

00:44.240 --> 00:47.960
or they can also limit the final response.

00:48.800 --> 00:56.800
The following categories of parameters are commonly found across different models randomness and diversity

00:56.800 --> 00:57.680
and length.

00:58.200 --> 01:02.710
Let's take a deep dive in each one of them and understand what they mean.

01:03.710 --> 01:05.750
Randomness and diversity.

01:06.790 --> 01:11.110
Randomness and diversity refer to the amount of variation in models.

01:11.110 --> 01:19.350
Response for any given sequence model determines a probability distribution options for the next token

01:19.350 --> 01:20.510
in the sequence.

01:20.990 --> 01:27.150
You can control these factors by limiting or adjusting the distribution Foundation.

01:27.150 --> 01:34.430
Models typically support the following parameters to control the randomness and diversity in the response

01:35.310 --> 01:36.150
temperature.

01:36.590 --> 01:37.350
Top k.

01:37.950 --> 01:38.750
Top p.

01:39.430 --> 01:41.590
There are other parameters as well.

01:41.790 --> 01:48.950
However, these three parameters are more often used to control the variation in the model's response.

01:49.350 --> 01:55.630
Before we take a deep dive into these parameters, let's first understand certain concepts.

01:56.830 --> 01:57.670
Sampling.

01:58.750 --> 02:01.430
In sampling, we have greedy sampling.

02:01.790 --> 02:08.550
So let's say if you're ordering something from restaurant, greedy sampling would be equivalent to always

02:08.550 --> 02:13.590
ordering the single most common or most popular dish on the menu.

02:14.070 --> 02:20.790
If the most frequently ordered dish is the Caesar salad, greedy sampling would result in ordering.

02:21.110 --> 02:24.230
I will have the Caesar salad all the time.

02:24.790 --> 02:32.630
This relates to the language model using top k equals one and temperature equals one, where the model

02:32.630 --> 02:39.710
always chooses the single most likely next word according to its probability distribution.

02:40.470 --> 02:45.510
We'll come back to top k and temperature values later in this video.

02:46.830 --> 02:52.070
The second sampling that we have to understand is random sampling.

02:52.790 --> 02:58.700
If you're ordering in a restaurant, random sampling would be the equivalent of choosing your order

02:58.700 --> 03:06.300
by literally pulling a menu item at random, with no regards to what type of dish it is, or whether

03:06.300 --> 03:07.780
it even makes sense.

03:08.820 --> 03:15.700
So you might have ordering something like I'll have a chicken fried steak soup, or I'll have a cheeseburger.

03:16.220 --> 03:21.420
Completely random combinations that don't form a coherent dish or a meal.

03:21.860 --> 03:30.100
This relates to language models using moderate values of top k, such as top k equals to 50, top k

03:30.100 --> 03:36.380
equals to 100, along with a high temperature of 1.5 to 2.0.

03:36.900 --> 03:44.060
With such settings, the model can generate creative and surprising outputs by sampling from the broad

03:44.060 --> 03:45.980
set of potential networks.

03:47.260 --> 03:48.980
So let's do a quick summary.

03:49.460 --> 03:57.410
Greedy sampling leads to coherent but uncreative outputs like always ordering the single most popular

03:57.410 --> 03:57.970
item.

03:58.490 --> 04:05.650
Random sampling, on the other hand, enables maximum creativity, but outputs are incoherent, like

04:05.650 --> 04:06.370
ordering.

04:06.370 --> 04:10.210
By pulling any menu item out of the hat randomly.

04:10.490 --> 04:17.410
The goal is doing these parameters to achieve a desired balance between coherence that is, sticking

04:17.450 --> 04:25.650
to a conventional menu item, and creativity that is, ordering something new or unexpected, sometimes

04:25.650 --> 04:27.730
for the specific use case.

04:28.770 --> 04:32.050
So let's first understand the top k parameter.

04:32.330 --> 04:38.930
Top k limits the model's output to the top k most probable tokens at each step.

04:39.370 --> 04:45.090
This will help reduce incoherent or non cycle output by restricting the model's vocabulary.

04:45.450 --> 04:50.410
Let's say for the prompt, I have the words in the vocabulary and the probabilities are matched with

04:50.410 --> 04:52.450
probability of 0.6.

04:52.690 --> 04:53.330
Couch.

04:53.650 --> 04:56.810
The probability of 0.2 bed.

04:57.050 --> 04:59.250
0.1 chair.

04:59.610 --> 05:06.690
0.05 car 0.01 bikes 0.01.

05:07.210 --> 05:09.290
Bucket 0.3.

05:10.050 --> 05:13.530
With top k sampling, let's say k equals three.

05:13.770 --> 05:14.850
It does the following.

05:15.250 --> 05:20.330
It considers only top three highest probability words in the distribution after sorting them.

05:20.530 --> 05:29.410
So if k equals three it only considers the words mat, bucket and bed since they have the highest probability.

05:30.890 --> 05:32.650
Top p parameter.

05:33.090 --> 05:40.330
Top p filters out tokens whose cumulative probability is less than a specified threshold, that is,

05:40.370 --> 05:40.810
p.

05:41.250 --> 05:47.650
It allows for more diversity in the output, but still avoiding low probability tokens.

05:48.050 --> 05:55.680
Let's say for the prompt, I will have the words in the vocabulary and the probabilities are solid.

05:55.720 --> 06:00.640
0.4 Berger 0.3 pasta 0.1.

06:01.040 --> 06:03.320
Steak as 0.08.

06:03.680 --> 06:10.800
When you specify top p parameter value as 0.8, it will include salad with 0.4.

06:11.240 --> 06:13.080
Burger with 0.3.

06:13.440 --> 06:15.680
Pasta with 0.1.

06:16.280 --> 06:24.920
Since the cumulative probability of all three of them is 0.8, this covers 80% of the probability mass

06:24.920 --> 06:26.760
in just top three words.

06:26.760 --> 06:30.720
And it would drop steak, which is 0.08.

06:31.800 --> 06:35.040
Let's recap top k and top p parameters.

06:35.480 --> 06:44.160
Top p sampling with point p equals 0.8 will consider broader, more inclusive set of words compared

06:44.160 --> 06:48.920
to using top k equals three with top k equals three.

06:49.240 --> 06:54.990
The model only considers top three highest probability words after the context.

06:55.030 --> 07:03.350
No matter how low the probabilities are with top p at 0.8, it will include as many words as needed

07:03.350 --> 07:07.110
until the cumulative probability reaches 0.8.

07:08.150 --> 07:11.350
Now let's go and understand the third inference parameter.

07:11.510 --> 07:12.310
Temperature.

07:12.870 --> 07:19.630
Temperature adjusts the randomness or confidence level of the model's prediction by scaling the log

07:19.630 --> 07:20.910
probabilities.

07:21.390 --> 07:29.030
Higher temperature leads to more diverse but potential nonsensical outputs when lower temperature leads

07:29.030 --> 07:32.510
to more focused and predictable responses.

07:33.110 --> 07:43.270
Let's say we specify a low temperature value of 0.2 or 0.5, makes the model more confident and peaks

07:43.270 --> 07:49.790
the probability distribution, whereas a high temperature value of greater than one makes the model

07:49.950 --> 07:53.230
Predictions more spreadable and uncertain.

07:54.270 --> 08:00.070
Now that we learned about these parameters, let's apply the inference parameters.

08:00.510 --> 08:02.630
Consider the example prompt.

08:03.030 --> 08:05.030
I hear the hoof beats off.

08:05.550 --> 08:11.950
The model determines the following three words to be the candidates for the next token, and the model

08:11.950 --> 08:14.990
also assigns probability for each word.

08:15.470 --> 08:18.390
Horse has 0.7 probability.

08:18.750 --> 08:25.510
Zebras has 0.2, and unicorn has 0.1 probability.

08:26.110 --> 08:28.030
So how does this all come together?

08:28.390 --> 08:31.670
Let's say we set the temperature at 0.8.

08:32.110 --> 08:37.190
Top K as two and top P as 0.7.

08:37.830 --> 08:44.790
The very first step is if you set high temperature the probability distribution is flattened and the

08:44.790 --> 08:51.820
probabilities become less different which would increase the The probability of choosing unicorn and

08:51.820 --> 08:54.820
decrease the probability of choosing horse.

08:55.180 --> 08:57.700
So it depends on what you want to select.

08:57.700 --> 09:02.980
If you select 0.80.5, it would choose horse.

09:03.300 --> 09:11.180
But if you choose a 1.5 or 2, the model will choose unicorn as the next possible token.

09:11.660 --> 09:15.820
The next step it would consider is the top key equals two.

09:15.860 --> 09:16.660
Filtering.

09:17.060 --> 09:20.980
It selects the two candidates horses and zebras.

09:21.700 --> 09:24.140
Top P equals 0.7.

09:24.140 --> 09:25.980
Filtering is the third step.

09:25.980 --> 09:30.500
It applies from this the top key equal to two.

09:30.780 --> 09:38.740
It then applies top PSP0 point seven, filtering from the highest to lowest scale probability by keeping

09:38.740 --> 09:44.340
the order until the cumulative probability mass, which is 0.7.

09:45.500 --> 09:53.610
So in this case, the model only considers horse because it's the only candidate which has 0.7 probability.

09:54.090 --> 10:03.370
However, if top p value was 0.9, then the model considers horses and zebras as the next probable tokens.

10:03.810 --> 10:08.170
These were all the inference parameters that controls the model behavior.

10:08.610 --> 10:11.250
The other category we have is length.

10:11.770 --> 10:17.690
Foundation model typically supports parameters that limit the length of the responses.

10:18.290 --> 10:21.450
An example of these parameters are provided below.

10:21.930 --> 10:24.530
Response length and exact value.

10:24.530 --> 10:31.850
To specify the minimum or maximum number of tokens to return in the generated response.

10:32.410 --> 10:40.290
It really helps to keep check on the model responses from having excessive long responses, which reduces

10:40.290 --> 10:44.490
or increases actually the latency of the generated responses.

10:45.490 --> 10:48.690
The other parameter we have is top sequences.

10:49.130 --> 10:54.970
Specify sequence of characters that stop the model from generating further tokens.

10:55.450 --> 11:02.450
If the model generates a stop sequence that you specify, it will stop generating after that sequence,

11:03.370 --> 11:06.090
a third length token is the penalties.

11:06.650 --> 11:12.010
You can specify the degree to which you can penalize the outputs in a response.

11:12.410 --> 11:20.290
Examples include length of the response, repeated tokens in the response, frequency of tokens in a

11:20.290 --> 11:24.810
response, and the type of tokens in a response.

11:25.290 --> 11:29.730
I hope you understand the inference parameters very very deep.

11:29.930 --> 11:31.370
Take a deep dive here.

11:31.850 --> 11:34.090
It's very important to go through this.

11:34.410 --> 11:41.170
It will really help you understand how the model response generation can be governed with these parameters.