WEBVTT

00:00.560 --> 00:02.800
Hello everyone and welcome.

00:03.000 --> 00:08.240
In this video we will learn about another probe language model risk cards.

00:08.720 --> 00:14.640
This probe is all about a language model vulnerable to different probes that come up with the text generation.

00:15.120 --> 00:17.680
The white paper of this probe can be found here.

00:17.880 --> 00:24.360
And the gist of it is that, as with all languages, text generated by language model can be harmful

00:24.360 --> 00:26.920
or used to bring about harm.

00:27.800 --> 00:35.000
Automating language generation adds both an element of scale and also more subtle emergent undesirable

00:35.000 --> 00:36.920
tendency to generate text.

00:37.640 --> 00:43.480
So the different categories under Lmlk probes are bullying, demeaning, profanity, and so forth.

00:43.960 --> 00:51.760
Let's go ahead and execute one of the content here profanity and probe the and find the vulnerability,

00:51.760 --> 00:52.320
if any.

00:52.720 --> 00:57.640
So for that I have created a yml file with different configurations that we went through.

00:58.000 --> 01:05.720
The only difference is I'm specifying a different probe spec by the name of profanity.

01:06.040 --> 01:10.360
And I'm testing this on OpenAI model GPT 3.5 turbo.

01:10.520 --> 01:11.880
We'll go ahead and run the command.

01:12.000 --> 01:18.480
Garrett config probe l c profanity yml and run this all.

01:18.520 --> 01:18.920
There you go.

01:18.960 --> 01:20.020
That was pretty quick.

01:20.060 --> 01:26.940
And if you notice here, it ran about different profanity prompts and probes and there were some failures.

01:27.300 --> 01:31.860
So let's go ahead and go to the JSON line file which is nine C6.

01:32.180 --> 01:35.820
So I have opened the report for LMK profanity.

01:36.020 --> 01:37.780
And here is how it looks.

01:38.740 --> 01:41.740
So let me copy one of the lines here and paste it here.

01:42.100 --> 01:42.660
Okay.

01:43.020 --> 01:48.460
So from the response we got here profanity mental disability was tagged as failed.

01:48.860 --> 01:53.060
The profanity mental disorder or mental disability is marked as one.

01:53.500 --> 01:58.340
It was determined from one of the key words that is listed here in the prompt.

01:58.780 --> 02:03.540
And the other one is search profanity sexual search profanity.

02:03.580 --> 02:05.380
Sexual is marked as one.

02:05.820 --> 02:11.940
So that is how the detectors have determined that out of all the words here, these two were marked

02:11.940 --> 02:12.740
as one.

02:13.380 --> 02:17.540
That means that they were basically not okay to pass the test.

02:18.420 --> 02:19.620
Thank you so much.

02:20.020 --> 02:22.420
I hope you learned about one more probe.

02:22.820 --> 02:25.380
Please feel free to go over different probes.

02:25.540 --> 02:33.180
Whatever works best for you to determine whether an LLM is vulnerable to the different probes and make

02:33.180 --> 02:36.620
a better system and application for generative AI.

02:37.260 --> 02:38.020
Thank you.
