WEBVTT

00:02.360 --> 00:12.480
Procedure Analyzer is a Python based service for detecting PII entities in text during analysis.

00:12.520 --> 00:23.880
It runs a set of different PII recognizers, each one in charge of detecting one or more PII entities

00:24.480 --> 00:26.400
using different mechanisms.

00:27.160 --> 00:38.000
It comes with a set of predefined recognizers, but it can be easily extended with other custom recognizers

00:38.000 --> 00:38.800
as well.

00:40.400 --> 00:50.880
Predefined and custom recognizers can leverage regex named entity recognition, also known as NER,

00:51.000 --> 00:56.040
and other types of logic, to detect PII in unstructured data.

00:58.320 --> 01:03.120
So here is the pip install Presidio analyzer.

01:03.680 --> 01:16.880
And here is the spacey Encore Web LG as the English language processing model for detecting the APIs.

01:17.160 --> 01:17.600
Right?

01:17.640 --> 01:20.440
Let's go ahead and install and download them.

01:22.800 --> 01:31.800
The following code sample will be responsible for setting up the analyzer engine, which would load

01:31.800 --> 01:41.200
the NLP model by default, and other PII recognizers which goes in here.

01:41.840 --> 01:42.280
Right.

01:42.440 --> 01:49.040
And then we'll call the analyzer to analyze the results for the phone number.

01:49.920 --> 01:59.760
In this case we are setting up the text to anonymize as the text that we want to analyze.

02:00.160 --> 02:04.920
And the entity that we want to detect is the phone number.

02:05.240 --> 02:05.760
Right.

02:05.800 --> 02:07.880
So here is the code for it.

02:07.880 --> 02:09.640
Let's go ahead and run this.

02:11.280 --> 02:14.800
Um, I missed the part where I had to import the, uh.

02:17.200 --> 02:18.200
The class here.

02:18.320 --> 02:19.440
Analyzer engine.

02:19.720 --> 02:20.160
Right.

02:20.480 --> 02:29.480
Uh, which also will in future require the pattern recognizer, which is a custom pattern that we will

02:29.480 --> 02:32.720
create for detecting our custom entities.

02:32.880 --> 02:33.160
Right.

02:33.200 --> 02:35.520
And now I'll go ahead and run this.

02:41.400 --> 02:41.920
Right.

02:42.040 --> 02:52.640
So now if you notice here it did detect phone number which starts from the um, uh, 46 as the count

02:52.680 --> 02:54.600
here until 58.

02:55.040 --> 02:55.560
Right.

02:55.600 --> 03:05.430
And the score is 0.75, which means that the, uh, the confidence level for, Saying that this particular

03:05.710 --> 03:09.390
entity is a phone number is very high, right?

03:09.630 --> 03:21.670
So that is how we use the analyzer engine to to analyze the text and uh, find the entity phone number.

03:22.110 --> 03:22.510
Right.

03:22.830 --> 03:35.350
Now let's go ahead and understand how the, the entire uh, class diagram for Procedure Analyzer is,

03:35.790 --> 03:36.590
is set up.

03:38.030 --> 03:44.910
So here is the, um, the class diagram for Procedure Analyzer.

03:46.310 --> 03:55.550
Here is the analyzer engine that we just went through to understand the, uh, understand and detect

03:55.590 --> 03:57.310
the, uh, the phone number.

03:57.590 --> 04:01.830
Uh, but if you notice here it has other components as well.

04:01.830 --> 04:05.550
In the, in the, in the hierarchy.

04:05.950 --> 04:06.150
Right.

04:06.150 --> 04:10.870
So here on the left hand side we have entity recognizer.

04:11.910 --> 04:18.110
Entity recognizer is an object that is responsible for detecting entities in text.

04:18.870 --> 04:24.710
It can be rule based recognizer machine learning model or a combination of both.

04:25.390 --> 04:25.790
Right.

04:26.070 --> 04:32.270
Then we have here uh, the recognizer registry.

04:33.150 --> 04:40.310
Uh, it is a registry that contains all the entity recognizers that are available in the Presidio.

04:40.910 --> 04:48.190
The analyzer engine uses the registry to detect entities in the text.

04:49.390 --> 04:49.790
Right.

04:50.430 --> 05:00.670
Then here we have the NLP engine, which is an object that holds the NLP model that is used by the analyzer

05:00.990 --> 05:01.550
engine.

05:05.030 --> 05:16.190
You can use a variety of different NLP engines to, uh, to detect the, uh, the entities in the text.

05:16.230 --> 05:28.510
You can use Spacy, transformer or Stanza NLP engines to, uh, to run the detection uh, in the text.

05:29.270 --> 05:31.710
And then here we have context aware enhancer.

05:32.270 --> 05:37.790
It's a module that enhances the detection of entities by using context of the text.

05:39.830 --> 05:44.270
One particular class that is not covered here is the pattern recognizer.

05:44.830 --> 05:53.110
Uh, it is a type of entity recognizer that uses regular expressions to detect entities in the text.

05:53.270 --> 05:57.670
Um, we'll cover that as part of another video with hands on activity.

05:58.110 --> 06:04.790
But here is the, uh, high level understanding of how Presidio Analyzer works.

06:05.630 --> 06:06.230
Thank you.

06:06.310 --> 06:07.190
I'll see you in the next.
