WEBVTT

00:01.500 --> 00:09.120
This is in continuation with the exact transformation which we have known, so we will be implementing

00:09.120 --> 00:09.660
the same.

00:15.290 --> 00:19.070
So first of all, let us in both Fondas and No.

00:19.700 --> 00:25.140
And this is the fire which is containing all the estimates from the elections.

00:25.340 --> 00:27.050
So this is the fire which we have.

00:29.600 --> 00:40.450
It has a listing of if the e-mail address is a scam or spam and it has the message in front of it,

00:40.730 --> 00:43.560
and both of them are separated by the.

00:45.270 --> 00:48.600
So while reading this fine, I have.

00:49.800 --> 00:58.390
Ready, fiber, and then I provided the delimiter buildup and I was told that there is more ahead and

00:58.590 --> 01:03.150
the needs of the hair that I have provided, I started on message.

01:06.480 --> 01:07.780
So let us view the data.

01:08.040 --> 01:09.790
So this is the data which we have.

01:09.810 --> 01:11.140
This is the target value.

01:11.160 --> 01:17.670
And this is the message value target value is basically the value which we will be predicting when we

01:17.670 --> 01:20.110
will be working on machine learning algorithms.

01:20.310 --> 01:23.860
So this is what we are predicting based on the messages.

01:24.090 --> 01:26.970
So the main focus here is to.

01:28.470 --> 01:34.740
Analyze the message and find out if the message is actually a ham or a spam.

01:35.070 --> 01:40.050
So when we will be implementing some machine learning algorithm on top of this data.

01:40.950 --> 01:48.600
It will provide us a target value on a new treatment, first of all, learn from this data that these

01:48.600 --> 01:54.370
are a few messages and this is the level of time, this is the class which we have provided to it.

01:54.810 --> 01:59.930
So whenever three and three is written, it will try to classify that as a spam.

02:00.270 --> 02:03.440
So what will happen is it will learn from this particular data.

02:03.690 --> 02:11.580
And when a new data unseeingly that would be shown to the model, which we will be training, it will

02:12.300 --> 02:18.440
be able to classify if that this is actually a ham or a spam.

02:19.570 --> 02:27.640
So right now, we will not be going very much in depth in how we will treat the model, but we will

02:27.640 --> 02:30.970
only look at the data preparation part of it.

02:31.720 --> 02:38.110
So right now we are just working on data and how we will actually train the model is the second reading.

02:40.090 --> 02:41.800
So this is the of which we have.

02:42.940 --> 02:45.650
So just ignore this particular step.

02:45.660 --> 02:49.120
We will discuss it once, we'll start with the modern creation.

02:50.060 --> 02:50.780
And.

02:53.820 --> 03:01.950
These are different libraries, which are which we are including the first one is its feature extraction.

03:03.040 --> 03:04.610
Using a scanner.

03:05.230 --> 03:16.360
So it is this idea of victories in this idea of vectorized is what will convert our tax data into a

03:16.360 --> 03:20.020
vector form, into a tabular form, which we are expecting.

03:21.130 --> 03:28.870
Then from the end of the purpose, we are importing the Stopford, so the stop words are the words which

03:28.870 --> 03:30.820
we want to remove from the text.

03:31.250 --> 03:34.750
For example, we have OPIS.

03:35.890 --> 03:36.610
You.

03:37.920 --> 03:38.580
Then.

03:39.880 --> 03:40.510
He.

03:41.420 --> 03:49.160
I all of these are stoppered, so these are all stop words, these are the words which don't really

03:49.550 --> 03:51.810
give a context to the bill.

03:52.130 --> 03:59.060
They do not give a meaning to the doll, but they actually add the number of words.

03:59.210 --> 04:05.010
And because we are creating a vector, we are creating a complete vector out of this data.

04:05.150 --> 04:07.950
We don't want the extra words to be present here.

04:08.090 --> 04:14.330
So that is why we are getting a list of stoppered so that we could remove them from the text, which

04:14.330 --> 04:15.200
we will be having.

04:16.900 --> 04:24.460
Then we are importing the water organizer from and we get organize this vote organizer will actually

04:24.460 --> 04:29.470
convert the text into token forms in the world to conform.

04:29.710 --> 04:37.180
So this entire message are joking with will be converted into OK, as a different word, I would be

04:37.450 --> 04:40.360
a different word, would be a different book and so on.

04:41.080 --> 04:42.910
Then we have more limited.

04:43.630 --> 04:49.000
We can use Wordnik, memorize it also and ostomy method also.

04:49.330 --> 04:52.090
Any of those methods could be used.

04:52.270 --> 05:00.310
Limitation, as we have already discussed, is a more appropriate method because it keeps context in

05:00.310 --> 05:00.720
mind.

05:02.080 --> 05:05.670
So we will be using what mental amitiza here.

05:07.050 --> 05:11.220
So we are creating an object of ordinance and amitiza.

05:12.180 --> 05:13.530
Named as Lemmer.

05:14.540 --> 05:17.390
And we are getting the sack of stuff put.

05:18.420 --> 05:24.510
And we are getting these tough words from English language, you can get these words from any language.

05:25.570 --> 05:32.030
And Lee, let me stop for you, let us see what the words are covered in the books.

05:33.010 --> 05:41.660
So here we have a list of stop words that is E above after again, all because between below then four

05:41.720 --> 05:48.400
did few had happened in is in must myself.

05:48.410 --> 05:56.980
So all the words which don't really give a meaning to a sentence, but the connecting words are considered

05:56.980 --> 05:57.970
good as stop.

05:57.970 --> 06:04.930
But now there are several other things which cause issues like punctuations.

06:05.050 --> 06:12.100
So we can consider punctuations also and I those punctuation to our stoppered so that those are also

06:12.100 --> 06:15.310
removed from the sentences.

06:16.700 --> 06:20.370
So what we are doing here is we are defining a particular function.

06:20.720 --> 06:24.850
So what this function does is the functions name is split into us.

06:25.490 --> 06:29.220
So this will get different limits to us.

06:29.420 --> 06:34.040
So what we are doing is we are forcefully converting the entire message into Lorqess.

06:35.810 --> 06:43.310
After the the messaging to lowercase, we are dividing the message into different words, different

06:43.340 --> 06:44.330
word tokens.

06:46.000 --> 06:52.810
After these words would have been organized, we are thinking of all the different words.

06:52.840 --> 07:01.240
This is a blandest which we have created and we are taking in each word from the message one by one.

07:02.240 --> 07:09.020
So from each and every word in the message we are reading on top of each and every word, and then we

07:09.020 --> 07:14.630
are checking if the word is a part of the word, then we do not do anything.

07:14.630 --> 07:16.550
We just continue with our Falu.

07:17.520 --> 07:25.230
And in case the word is not found in the but then the IV that would pull this particular list, which

07:25.230 --> 07:26.220
we have created.

07:28.170 --> 07:39.840
And after we got on the list, all the words list from these words, the word on the memorized form

07:39.840 --> 07:40.840
of these words.

07:41.640 --> 07:48.420
So we basically keep a collection of all words except for the school books and then memorize them and

07:49.020 --> 07:49.890
send them back.

07:50.870 --> 07:55.970
So this is what we are doing and here what we have is liquid on this view.

07:57.720 --> 08:05.130
So here we have as deeply and as deep this this is just a split of data, which we have done, don't

08:05.700 --> 08:06.980
get into depth of it.

08:07.230 --> 08:09.780
Let's just see what is there in the brain data.

08:10.620 --> 08:16.220
So this is the data which we have the training data which has the target value and the message value.

08:16.650 --> 08:23.610
And after this target value, what happens is the same similar thing is what hasn't been the best data

08:23.610 --> 08:24.080
also.

08:25.960 --> 08:36.520
We've just seen no rain to fight the effect of it, so this is a vectorized which will initially loan

08:36.520 --> 08:37.950
from the data which we have.

08:38.320 --> 08:45.070
So it will to follow on from the data which we have and known what type of birds do I need to have?

08:46.020 --> 08:53.850
What type of awards should be considered so that we will be able to classify some as famous Victor?

08:55.040 --> 08:58.750
They the effect that it is just learning from it now.

08:58.900 --> 09:06.300
So what it does is we are just defining the object of this idea of victimizer and we are just telling

09:06.310 --> 09:09.670
it that you will see this particular functions in dilemma's.

09:09.970 --> 09:15.520
I'm considered the minimum document frequency to be doing the maximum government frequency to be three

09:15.520 --> 09:25.240
thousand, which means that consider only those words which have at least 20 occurrences and it marks

09:25.240 --> 09:26.740
3000 occurrences.

09:27.400 --> 09:33.520
So what this means is, let's say we have a word C or let us see.

09:33.520 --> 09:39.870
We have some word as you and that word is not very frequently occurring.

09:39.880 --> 09:41.670
There was once in a lifetime.

09:41.680 --> 09:44.230
Then I go to estimates of the child up Voynov.

09:45.730 --> 09:55.140
So considering that as a criteria for classifying something as a bomb or harm would not be a good measure

09:55.900 --> 10:02.710
somewhere, which is very used, so we will have to remove all the very rarely used words because it

10:02.710 --> 10:06.890
will basically add to the list of words which we are considering.

10:07.130 --> 10:10.030
It will add to the size of the features which we have.

10:10.860 --> 10:17.640
So we will do those ones now on the other end, when we see maximum value in frequency is equal to three

10:17.640 --> 10:20.550
thousand, which means is legacy.

10:20.580 --> 10:23.630
My name is Satana.

10:24.640 --> 10:29.680
And the name comes in each and every sentence, which I receive.

10:30.660 --> 10:38.130
So whatever estimates I'm receiving, all of them, I'm having the name Taniya, that means that it

10:38.130 --> 10:43.890
isn't even relevant, Wolf, to would you classify something to Hammerstein?

10:44.070 --> 10:50.340
Because everyone has the information that my name is gone now and each and every person would be able

10:50.340 --> 10:53.740
to find out that this person's name is Sonya.

10:53.760 --> 11:01.290
So even the ball dismisses or any kind of misses or the five dismisses, all will be sending out the

11:01.290 --> 11:03.660
message to me saying hi to Nafees by this.

11:03.900 --> 11:05.010
Danielle, please get this.

11:05.220 --> 11:08.100
So I don't want to consider these kind of words.

11:09.670 --> 11:16.080
So this is the reason why we do not consider the maximum data frequency here.

11:17.000 --> 11:22.970
So we will be considering only those words which have a limit to the maximum document frequency.

11:23.360 --> 11:31.190
So we are saying that those words need the judge agreeing on a maximum document of three thousand documents,

11:31.400 --> 11:37.430
whatever is occurring, more than three in more than three thousand dismissals don't consider that.

11:39.230 --> 11:46.970
And right now, we're just specifying the criteria on which our estimates would be loaning, the organizer

11:46.970 --> 11:52.600
would be on how this vectorized would be loaning, this DFI vectorized would be learning.

11:52.610 --> 11:54.620
That is what we are defining at this moment.

11:57.000 --> 11:58.350
So now what we do is.

12:02.810 --> 12:03.890
Now, we'll run this.

12:04.810 --> 12:10.270
And then we will slip this the idea of vectorized on the plane.

12:11.690 --> 12:20.240
So we will saw this on this particular training day that we didn't like the idea of vectorized, please

12:20.240 --> 12:29.060
learn from the messages which I already have and find out that vich message should be considered.

12:29.060 --> 12:35.270
Endou which words should be considered while you are creating this vectorized.

12:36.350 --> 12:38.720
So right now, I'm just straining my vectorized.

12:38.760 --> 12:45.260
I'm just telling the vectorized to loan from this data and find out what does what does it actually

12:45.260 --> 12:46.260
have to to the.

12:47.730 --> 12:51.600
And then after I have done that, I will be.

12:52.580 --> 12:53.750
Transforming.

12:54.960 --> 12:56.760
These deep train.

12:57.790 --> 13:05.450
Messages I will be transforming these messages from a continuous text form into the vectorized form.

13:05.620 --> 13:12.310
So once I read this and then I transform this, I get this training data.

13:13.380 --> 13:16.300
And similarly, I can do the same thing with the best thing they know.

13:16.490 --> 13:20.560
So you don't need to worry about what training data is, what testing these days.

13:20.790 --> 13:22.520
Don't worry about that as of now.

13:22.770 --> 13:24.090
Just focus on that.

13:24.600 --> 13:30.750
First of all, tell them what know learn how it needs to wake the guys from the training data, which

13:30.750 --> 13:33.460
we have and what I want you to consider.

13:33.900 --> 13:39.120
Let us see, for example, how you can understand this is legacy.

13:39.120 --> 13:42.900
I am working on a geological project.

13:44.570 --> 13:49.690
And the majority of Soumises, which I get, are related to geography.

13:50.840 --> 13:59.960
It will be talking about it and then it will be talking about some of some options in some cities and

13:59.960 --> 14:09.230
some regions, and most of them will do this while I have another friend who is working on a chemical

14:09.230 --> 14:16.310
project and they will be having words like nitrogen, hydrogen and phosphorus, those kind of words

14:16.310 --> 14:16.460
with.

14:18.190 --> 14:26.050
Now, what I'm trying to say here is that the world which I will be considering to create a vectorized

14:26.230 --> 14:31.540
would be different from the words my friend would be considering for creating his victories.

14:32.670 --> 14:39.840
That is the reason why we are forced to all training, we are, first of all, footing this idea of

14:40.020 --> 14:48.660
the visit on my brain in detail so that it can learn what what want more than what vocabulary does it

14:48.660 --> 14:52.940
need to consider for creating the vectorized.

14:53.100 --> 15:01.830
And then only it will transform this so that I have a proper vaporizer for my geography related project

15:02.190 --> 15:02.580
and.

15:03.540 --> 15:09.150
My friend has a different reason why his game is really good theology and so on.

15:11.140 --> 15:18.420
So we just heard this, and here is one example of how we can actually visualize this, so although

15:18.740 --> 15:23.220
the damage will be created, this idea of the victims, which will be.

15:24.830 --> 15:32.090
Within a spice matrix form, so it will be creating something like there will be a lot of columns.

15:33.090 --> 15:39.600
And all of those problems will be having some zeros, ones, dos, tres wars and the count of the words

15:39.600 --> 15:40.380
which will be there.

15:40.650 --> 15:48.870
And apart from that, there will be a blind space so that this matrix does not capture a lot of space.

15:50.130 --> 15:59.400
So this is what all victimiser is using and how we create vectors using different.

16:01.030 --> 16:08.620
Now, how we will use this vector, which we have generated into this classification, is what we will

16:08.620 --> 16:18.100
be launching later and we will learn how we will see a different classification models using such datasets.

16:22.230 --> 16:28.950
There to see how we can how this limited English is working, so I created one message that is my Red

16:29.160 --> 16:33.880
Sox on the day of the Sox in the world, and the nation needs more Sox like this.

16:34.530 --> 16:36.480
So we will just run.

16:38.450 --> 16:44.000
This particular port, so I am going message dot law and then printing the message.

16:46.160 --> 16:54.770
Then I'm converting these words into tokens, so each word has been completed and converted into tokens.

16:55.700 --> 17:04.670
Now here I am just vibrating on the boards and then I am simply printing the words.

17:06.950 --> 17:09.020
If they are skimped on appendix.

17:11.780 --> 17:13.070
So you can see in my.

17:13.950 --> 17:23.430
On the in the time, though, all of these words, which are part of the supports have been Skipp and

17:23.430 --> 17:29.610
other words have been upended in the list, and then they are just using the words.

17:31.270 --> 17:36.520
So these are the of the world, so you can see the socks have been converted into sock.

17:39.060 --> 17:40.420
At all the places.

17:40.770 --> 17:48.290
So what this will do is when we are done voting, I know when we are creating the the idea of victory,

17:48.870 --> 17:52.430
then these will be considered as one single word.