WEBVTT

00:00.360 --> 00:02.440
Hello everyone and welcome.

00:02.880 --> 00:09.520
In this session we will do a hands on activity on Llama guard 311 B vision model.

00:10.040 --> 00:16.240
For that, you'll have to go to the hugging face and request access to this model.

00:16.360 --> 00:18.960
I'm on this hugging face interface.

00:19.200 --> 00:21.960
And if you notice here this is the model card.

00:22.280 --> 00:25.960
You can read about all the details that they have provided.

00:26.400 --> 00:29.920
Very first thing that you'll have to do is request access.

00:29.920 --> 00:33.280
Now I already raised the request for this model.

00:33.600 --> 00:35.280
I'll show you how this looks like.

00:36.080 --> 00:40.520
So here is the request access that comes along on the model card.

00:40.680 --> 00:46.960
Before you raise a request, it says you have to agree to the terms and share your contact information

00:46.960 --> 00:48.600
to access this model.

00:48.760 --> 00:51.160
And then there is a license agreement.

00:51.160 --> 00:52.440
Expand this.

00:52.440 --> 00:55.240
And all the way at the bottom you send the request.

00:55.240 --> 00:58.640
It gets approved in few minutes to maybe couple hours.

00:59.040 --> 01:01.240
Usually it comes along very quickly.

01:01.360 --> 01:03.870
I'm here on the Google Colab interface.

01:03.870 --> 01:06.430
If you notice here I have the Pro account.

01:07.030 --> 01:12.190
It's very important that you understand that Llama Guard three vision is a very big model.

01:12.550 --> 01:14.750
It needs a lot of compute and memory.

01:15.350 --> 01:23.510
So in this case the resources that I'm using is system Ram as 83.5GB, GPU Ram is 40GB, and the disk

01:23.510 --> 01:25.750
memory is 235 gigabyte.

01:26.230 --> 01:26.510
Okay.

01:26.550 --> 01:28.710
So now I'll go ahead and start with the code.

01:28.710 --> 01:31.510
First check what are my compute allocations.

01:35.390 --> 01:37.190
So I'll run Nvidia SMI command.

01:38.750 --> 01:41.870
So here you get all the details about the Nvidia processes.

01:42.030 --> 01:45.190
How much is compute and memory allocation for the notebook.

01:45.550 --> 01:48.750
Right now no job is running so no processes are running.

01:49.270 --> 01:51.430
Let's go ahead and upgrade the transformers.

01:52.710 --> 01:56.430
So now I'll go ahead and execute the command upgrade transformer.

02:11.900 --> 02:13.980
So I upgraded the transformers.

02:16.540 --> 02:21.780
And then I'll import certain classes from Transformers that we need for running the model.

02:22.380 --> 02:30.580
In this case, I have imported Lama for condition generation auto processor, Lama processor for us

02:30.580 --> 02:32.700
to help download and run the model.

02:33.140 --> 02:36.780
Then from typing import list any.

02:41.100 --> 02:42.540
And also import torch.

02:45.660 --> 02:48.860
I'll go ahead and provide the model ID.

02:53.220 --> 02:58.020
So now with Lama guard multimodal model ID I'll go ahead and load the model.

02:58.300 --> 02:59.780
Looks like there is a typo.

03:08.370 --> 03:13.530
Now I have used the tokenizer from llama processor by passing model id to processor.

03:14.090 --> 03:19.930
Let's get the model from llama for condition generation from pre-trained method where pass model id

03:20.530 --> 03:23.170
torch and device type is auto.

03:23.530 --> 03:24.410
Let me run this.

03:32.330 --> 03:34.250
It takes a while for this to run.

03:35.410 --> 03:37.290
Now we have two problems here.

03:37.290 --> 03:39.850
We don't have a token in our colab secrets.

03:39.850 --> 03:45.250
So let me go ahead and add the code for for accessing the hugging face interface.

03:45.930 --> 03:47.250
I missed a step here.

03:47.250 --> 03:51.050
Before accessing the models from Hugging Face Hub, go to login.

03:51.610 --> 03:54.090
In this case I'll have to provide a token.

03:54.890 --> 03:58.410
I will go to the hugging face interface and get my token.

03:59.090 --> 04:00.410
Got token and login.

04:02.330 --> 04:03.720
Now run the interface.

04:06.840 --> 04:08.560
Now there's a different problem.

04:08.560 --> 04:10.760
It's a client error where you're forbidden.

04:10.760 --> 04:15.680
For this URL, there is a special permission for the model to be publicly accessible.

04:15.680 --> 04:17.200
I'll share that with you right now.

04:17.440 --> 04:20.000
So under your account, go to Access tokens.

04:20.360 --> 04:29.040
When you generate this access token, you have to go edit permission and then give read access to all

04:29.040 --> 04:31.680
public gated repository you can access.

04:35.800 --> 04:41.120
It is very important for you to get access to this model in the Colab account that you own.

04:41.800 --> 04:45.440
I'll come back here to the Google Colab account and run this.

04:48.520 --> 04:49.880
It works seamlessly.

04:51.440 --> 04:53.160
Also, this is a huge model.

04:53.160 --> 04:58.840
It takes time for the notebook to download the model, so I'll go ahead and pause the video until the

04:58.840 --> 05:00.160
model is downloaded.

05:02.040 --> 05:04.510
I will resume the video when this is done.

05:05.550 --> 05:10.230
If you notice the model is downloaded, it's a very big model for hosting.

05:13.670 --> 05:20.070
So here I have used two methods that would help us with the execution of the Llama guard three vision

05:20.070 --> 05:20.790
model.

05:20.790 --> 05:22.670
One of them is display image.

05:22.670 --> 05:29.110
Display image will be used to display the image that will be part of the conversation.

05:29.790 --> 05:32.670
And the other method is llama guard test.

05:33.310 --> 05:34.950
M stands for multimodal.

05:34.950 --> 05:37.550
This method has tokenizer as an input.

05:37.550 --> 05:38.790
Along with tokenizer.

05:38.790 --> 05:46.510
We also pass model the actual model, which will generate the output the conversation image category

05:47.070 --> 05:48.870
and exclude category keys.

05:52.070 --> 05:56.470
Categories is a very important upgrade to the Llama Guard three model.

05:56.990 --> 06:04.070
It would help users add their custom categories that they want to add on top of the 13 default categories

06:04.070 --> 06:05.100
that they offer.

06:05.460 --> 06:08.860
You can also exclude the categories that you don't want.

06:09.180 --> 06:13.620
So let's say you have a model and you want to use it for different use cases.

06:14.260 --> 06:18.620
Every use case has different categories that you want to guardrail against.

06:19.060 --> 06:23.420
So you can exclude the categories that you don't want for your use case.

06:23.820 --> 06:28.020
It's a very convenient feature where you can exclude these categories.

06:28.460 --> 06:31.100
So now let's go ahead and understand the code here.

06:33.100 --> 06:40.100
In this code block if category is not none which basically means that you're providing the categories,

06:40.540 --> 06:44.740
then tokenizer will apply the chat template with the conversation.

06:46.020 --> 06:48.220
You add generation prompt as true.

06:49.460 --> 06:50.900
Set Tokenizes false.

06:51.180 --> 06:52.820
So I set this up as false.

06:52.820 --> 06:56.100
There are no special tokens, so this is false as well.

06:56.100 --> 07:00.780
Then you provide the list of custom categories and exclude category keys.

07:02.980 --> 07:04.860
If you want to exclude anything.

07:05.020 --> 07:07.050
This can be an empty list as well.

07:07.970 --> 07:14.370
And in case you don't have custom categories, then everything else is very similar to what we did above,

07:14.690 --> 07:16.690
except that there are no categories.

07:19.050 --> 07:21.330
So that's what this else block is all about.

07:22.890 --> 07:29.490
Then we will use tokenizer to tokenize the input, and we pass the template along with the image, which

07:29.490 --> 07:31.650
will return the tensors as dt.

07:33.530 --> 07:38.810
Then we'll use the model instance to generate the output responses with inputs.

07:38.810 --> 07:40.250
Do sample as false.

07:40.290 --> 07:45.210
Top p and temperature values are none along with the max new token as 50.

07:45.730 --> 07:51.490
When the output is generated, then we use tokenizer to decode the output, get the response back,

07:51.490 --> 07:52.410
and return.

07:58.410 --> 08:03.090
Here is the code block that would help us invoke the model with necessary inputs.

08:03.250 --> 08:06.720
These inputs are an image Example tax.

08:08.160 --> 08:14.160
So in the content directory in the Colab, I have uploaded the example text jpeg image, and then I

08:14.200 --> 08:16.040
convert it to the RGB format.

08:16.080 --> 08:21.000
Here is the conversation where the role is user and the content is type text.

08:21.040 --> 08:22.240
How can I file tax.

08:22.240 --> 08:24.040
And then we have type as image.

08:24.880 --> 08:29.320
I have commented out this this code that we will go through it later on.

08:33.560 --> 08:38.040
However, with this input that we have I'll pass it to the llama guard.

08:38.040 --> 08:42.480
Multimodal test method with tokenizer model, conversation and image.

08:43.640 --> 08:47.960
Let's go ahead and execute this method or this block of code.

08:54.680 --> 08:56.160
So now this code ran.

08:56.160 --> 09:02.480
And if you notice here we have begin of text start header ID as user and header ID.

09:02.760 --> 09:08.670
And then there is an image because since it sends the image to the model and check if there is unsafe

09:08.670 --> 09:09.310
content.

09:09.870 --> 09:12.870
And here are the default categories that it provides.

09:12.870 --> 09:15.430
The user is asking how can I file taxes?

09:15.430 --> 09:18.350
And then this is the system prompt.

09:18.350 --> 09:20.510
And the response came back as safe.

09:20.630 --> 09:21.910
This is end of turn.

09:23.310 --> 09:29.710
So this is an example where the image was processed with the conversation and determined to be safe.

09:29.950 --> 09:32.750
Now let's understand and do a couple of other things.

09:33.750 --> 09:36.550
How can I provide my own custom categories?

09:36.550 --> 09:40.430
So for that I will have to add the custom category.

09:40.430 --> 09:42.630
I'll go ahead and add the categories here.

09:43.950 --> 09:51.350
So I provided a custom category dictionary with S1 as custom category one and S2 as another category

09:51.350 --> 09:53.550
that says this will be removed, right?

09:53.590 --> 09:55.670
So technically this won't be removed.

09:56.630 --> 09:58.350
This is another category two.

10:00.270 --> 10:01.830
And let's go ahead and run this.

10:05.190 --> 10:08.020
Now I have to pass this value to the method.

10:16.180 --> 10:17.700
Let's run this one more time.

10:22.100 --> 10:22.860
There you go.

10:22.860 --> 10:25.180
So now you have custom category one.

10:25.220 --> 10:27.500
AI models should not talk about this.

10:27.500 --> 10:31.460
This is the custom category one and this is another category.

10:31.940 --> 10:34.580
So now we provided two custom categories.

10:34.580 --> 10:37.500
That model will evaluate on only these categories.

10:37.500 --> 10:40.020
But what about excluding certain categories.

10:40.340 --> 10:46.300
So let's say out of the 13 categories that were provided you don't want to evaluate S2.

10:54.060 --> 10:58.780
So in this case what we can do is we can exclude category S2 as a list.

10:58.780 --> 11:02.340
And then I'll pass exclude category keys to the method.

11:12.690 --> 11:14.330
And then let's execute this.

11:19.650 --> 11:21.330
If you notice there is no S2.

11:21.370 --> 11:23.170
It says S1 and S3.

11:23.410 --> 11:25.250
You can also now do S7.

11:25.250 --> 11:28.010
Privacy is different for different companies.

11:28.010 --> 11:29.610
So let's exclude privacy.

11:38.530 --> 11:43.290
So if you notice it does unsafe content categories without S7.

11:44.170 --> 11:50.210
I hope you learned about Llama Guard three and how it works with default and custom categories.

11:51.890 --> 11:54.290
You can also exclude certain categories.

11:55.130 --> 12:00.850
There has been more configuration options available with this model compared to its previous version,

12:01.690 --> 12:05.370
and it also does the image processing along with the prompt.

12:06.250 --> 12:07.170
Thank you so much.

12:07.170 --> 12:08.810
I'll see you in the next video.