1
00:00:11,070 --> 00:00:16,440
So in this lecture, we will be discussing the concept of class imbalance in some typical methods for

2
00:00:16,440 --> 00:00:18,390
dealing with that in machine learning.

3
00:00:19,290 --> 00:00:22,560
So let's begin by discussing why class imbalance is a problem.

4
00:00:23,310 --> 00:00:26,700
Suppose that you were developing a blood test for some disease.

5
00:00:27,180 --> 00:00:33,060
Thus you were essentially building a binary classifier where the output is either a disease or no disease.

6
00:00:34,320 --> 00:00:38,880
It should be noted that diseases by nature are generally rare in the population.

7
00:00:39,390 --> 00:00:43,800
Otherwise, a large portion of people would be walking around in a disease state.

8
00:00:44,670 --> 00:00:51,780
Suppose that for the disease in question, it is only present in 0.1 percent of the population and thus

9
00:00:51,780 --> 00:00:54,870
ninety nine point nine percent do not have the disease.

10
00:00:55,800 --> 00:01:01,560
Now, suppose that the classifier we build is very bad, such that it doesn't really do any computation

11
00:01:01,890 --> 00:01:04,709
except for returning no disease for every input.

12
00:01:05,340 --> 00:01:08,790
This illustrates the problem with using accuracy as a metric.

13
00:01:09,510 --> 00:01:14,490
You see, even with this bad classifier, you would still end up getting ninety nine point nine percent

14
00:01:14,490 --> 00:01:19,860
accuracy simply because that is the percentage of people who do not have the disease.

15
00:01:20,340 --> 00:01:23,940
And this is assuming that the testing for this disease is indiscriminate.

16
00:01:24,390 --> 00:01:29,700
That is, everyone in the population has an equal chance of being tested regardless of their symptoms.

17
00:01:30,120 --> 00:01:33,540
This is probably not realistic, but let's just assume that is true.

18
00:01:34,380 --> 00:01:36,420
So what can be done about this issue?

19
00:01:41,120 --> 00:01:46,160
Well, we can start by looking at the results in a more detailed manner in order to understand where

20
00:01:46,160 --> 00:01:47,090
we are going wrong.

21
00:01:47,900 --> 00:01:52,430
In particular, there are four quantities we're interested in looking at instead of just one.

22
00:01:53,150 --> 00:01:59,360
These are the true positives, the true negatives, the false positives and the false negatives.

23
00:01:59,930 --> 00:02:01,670
So these should be pretty intuitive.

24
00:02:02,390 --> 00:02:05,960
We first have to decide which class is positive and which is negative.

25
00:02:06,530 --> 00:02:11,990
So for this example, let's assume that testing positive for the disease is equivalent to the positive

26
00:02:11,990 --> 00:02:12,620
class.

27
00:02:13,430 --> 00:02:19,310
In this case, the number of true positives is the number of people who have the disease that we correctly

28
00:02:19,310 --> 00:02:21,170
predicted have the disease.

29
00:02:21,830 --> 00:02:27,140
On the other hand, the number of false positives is the number of people we predicted to have the disease

30
00:02:27,410 --> 00:02:29,360
but do not really have the disease.

31
00:02:30,590 --> 00:02:36,050
Similarly, the number of true negatives is the number of people who do not have the disease, and we

32
00:02:36,050 --> 00:02:39,050
correctly predicted that they do not have the disease.

33
00:02:39,650 --> 00:02:45,170
And the number of false negatives is the number of people who we predicted to not have the disease but

34
00:02:45,170 --> 00:02:46,730
really do have the disease.

35
00:02:51,270 --> 00:02:56,940
Oh, and if you noticed we typically organized these numbers in a table, in particular, we call this

36
00:02:56,940 --> 00:02:58,440
a confusion matrix.

37
00:02:59,040 --> 00:03:04,050
Note that a confusion matrix is more general and can be extended to any number of classes.

38
00:03:04,590 --> 00:03:10,140
However, the following techniques in this lecture will be focused on the binary case, although they

39
00:03:10,140 --> 00:03:12,240
can be extended to the multiclass case.

40
00:03:12,540 --> 00:03:14,460
They don't naturally fit into that picture.

41
00:03:15,030 --> 00:03:20,220
And thus, for the rest of this lecture, let's assume that we're talking about binary classification.

42
00:03:25,000 --> 00:03:28,180
So how can we make use of the numbers we've collected so far?

43
00:03:28,930 --> 00:03:33,820
Well, historically there have been a few ways to do this, and whatever you use tends to depend on

44
00:03:34,120 --> 00:03:35,440
what field you come from.

45
00:03:36,250 --> 00:03:42,070
So if you're in the medical field or life sciences, you tend to report the sensitivity and specificity.

46
00:03:42,670 --> 00:03:45,220
The sensitivity is the true positive rate.

47
00:03:45,760 --> 00:03:50,710
It is equal to the number of true positives divided by the total number of positives.

48
00:03:50,950 --> 00:03:52,690
Hence, true positive rate.

49
00:03:53,230 --> 00:03:57,430
It is the rate at which we can detect those who are positive for the disease.

50
00:03:58,460 --> 00:04:02,750
Another way to think of this is out of, however, many actual positives.

51
00:04:03,020 --> 00:04:06,500
This is the percentage of those positives you'll be able to detect.

52
00:04:07,700 --> 00:04:13,520
Note that in terms of the four counts we collected, it is equal to the number of true positives divided

53
00:04:13,520 --> 00:04:17,089
by the number of true positives, plus the number of false negatives.

54
00:04:17,540 --> 00:04:20,930
As you recall, the false negatives also have the disease.

55
00:04:22,019 --> 00:04:24,240
The second item is the specificity.

56
00:04:24,720 --> 00:04:30,810
This is the true negative rate, as you might be able to guess, it is equal to the number of true negatives

57
00:04:31,110 --> 00:04:33,330
divided by the total number of negatives.

58
00:04:33,570 --> 00:04:39,690
Hence, true negative rate if it's the rate at which we can detect those who are negative for the disease.

59
00:04:40,260 --> 00:04:45,480
And again, you can derive this to be the number of true negatives divided by the number of true negatives,

60
00:04:45,690 --> 00:04:47,550
plus the number of false positives.

61
00:04:48,270 --> 00:04:52,200
Again, the false positives are among those who do not have the disease.

62
00:04:56,840 --> 00:05:03,050
So let's think about what would happen to the sensitivity and specificity if we use our very bad classifier,

63
00:05:03,080 --> 00:05:05,240
which always predicts no disease.

64
00:05:06,020 --> 00:05:08,030
Suppose there are people in our study.

65
00:05:08,780 --> 00:05:14,570
As mentioned, zero point zero zero one end of these people have the disease, while zero point nine

66
00:05:14,570 --> 00:05:16,130
nine nine n do not.

67
00:05:17,150 --> 00:05:19,640
Now, since we always predict no disease.

68
00:05:19,850 --> 00:05:23,660
This means that we have zero point nine nine nine n a true negatives.

69
00:05:24,110 --> 00:05:27,380
We also have zero point zero zero one n a false negatives.

70
00:05:28,070 --> 00:05:31,190
We have no true positives because we never predict positive.

71
00:05:31,790 --> 00:05:35,390
And we also have no false positives since we never predict positives.

72
00:05:36,080 --> 00:05:41,960
Because of this, the sensitivity is equal to zero divided by zero plus zero point zero zero one N,

73
00:05:42,260 --> 00:05:43,550
which is just zero.

74
00:05:43,910 --> 00:05:46,160
This is because we have no true positives.

75
00:05:46,880 --> 00:05:52,580
On the other hand, the specificity is equal to zero point ninety nine n, divided by zero point nine

76
00:05:52,650 --> 00:05:54,890
nine and a plus zero, which is one.

77
00:05:55,520 --> 00:06:01,390
This makes sense because we correctly predicted everyone who does not have the disease by always predicting

78
00:06:01,390 --> 00:06:02,480
and no disease.

79
00:06:03,950 --> 00:06:05,240
So how does this help?

80
00:06:05,900 --> 00:06:10,190
Well, note that we now have two numbers to measure the performance of our predictor.

81
00:06:10,880 --> 00:06:16,700
Previously, we only had accuracy, which was at ninety nine point nine percent, which seems good when

82
00:06:16,700 --> 00:06:17,760
it is really not.

83
00:06:18,260 --> 00:06:23,690
But by using two numbers, they both give us an indication of how the model is performing by looking

84
00:06:23,690 --> 00:06:25,190
at the two classes separately.

85
00:06:25,970 --> 00:06:31,640
In this case, the specificity is one which is a good sign, but the sensitivity is zero, which is

86
00:06:31,640 --> 00:06:37,280
a bad sign if you measured the performance of your predictor and it had poor sensitivity.

87
00:06:37,580 --> 00:06:44,510
You would immediately know that it is failing to detect the disease and thus, unlike accuracy, this

88
00:06:44,510 --> 00:06:50,570
could be used as a measure of how useful your model would be in the real world to understand this.

89
00:06:50,600 --> 00:06:55,760
Let's continue with our example of trying to predict whether or not a patient has some disease.

90
00:06:56,300 --> 00:07:02,540
The sensitivity tells us how good we are at detecting the disease in people who actually have the disease.

91
00:07:03,020 --> 00:07:07,910
Of course, we want this to be high so that we can provide treatment to these patients.

92
00:07:08,420 --> 00:07:13,340
We would not want to leave them untreated, which is what a bad predictor would do by predicting no

93
00:07:13,340 --> 00:07:14,840
disease for every patient.

94
00:07:16,010 --> 00:07:19,370
On the other hand, we want our specificity to also be high.

95
00:07:19,940 --> 00:07:24,650
As an extreme example, we don't want to tell everyone that they have the disease because they don't

96
00:07:24,650 --> 00:07:25,940
actually need treatment.

97
00:07:26,690 --> 00:07:31,310
Now, in practice, typically what happens is not that you treat the patient upon seeing a positive

98
00:07:31,310 --> 00:07:35,420
test, but rather you do further tests to confirm the diagnosis.

99
00:07:36,050 --> 00:07:40,610
At the same time, you don't want to scare patients by telling them that they have the disease when

100
00:07:40,610 --> 00:07:41,660
they really do not.

101
00:07:46,310 --> 00:07:52,040
Now it's worth thinking about what the sensitivity and specificity would be if our model made perfect

102
00:07:52,040 --> 00:07:52,940
predictions.

103
00:07:53,780 --> 00:07:59,270
In this case, the number of false positives is zero and the number of false negatives is also zero.

104
00:08:00,190 --> 00:08:07,300
So the sensitivity is tip over TP plus F.N, which is equal to TP over TP, which is one.

105
00:08:08,140 --> 00:08:15,680
Likewise, the specificity is 10 over 10 plus FP, which is equal to 10 over 10, which is also one.

106
00:08:16,930 --> 00:08:21,730
Therefore, the true positive rate is one in the true negative rate would also be one.

107
00:08:22,600 --> 00:08:28,990
Thus, what we are aiming for when we build a binary classifier is for both the sensitivity and specificity

108
00:08:28,990 --> 00:08:29,740
to be one.

109
00:08:30,490 --> 00:08:35,679
In the case of our very bad classifier, we got zero for the sensitivity, which is very bad.

110
00:08:40,299 --> 00:08:44,800
Now, as mentioned, the metrics you use tend to be influenced by your field of study.

111
00:08:45,370 --> 00:08:49,330
I personally find the sensitivity and specificity to make a lot of sense.

112
00:08:49,690 --> 00:08:54,850
But in NLP, we tend to use precision and recall intuitively.

113
00:08:54,850 --> 00:08:59,740
You can always remember what these mean because they are named for their use in document retrieval.

114
00:09:00,640 --> 00:09:02,140
So let's start with recall.

115
00:09:02,710 --> 00:09:09,790
The recall happens to be the same thing as the sensitivity that is the true positive rate in terms of

116
00:09:09,790 --> 00:09:11,680
document retrieval or search engines.

117
00:09:11,980 --> 00:09:16,870
It would be the number of relevant documents you found out of the total number of documents you should

118
00:09:16,870 --> 00:09:17,590
have found.

119
00:09:18,100 --> 00:09:22,000
In other words, it's the percentage of documents you were able to recall.

120
00:09:22,960 --> 00:09:25,150
Now the precision is a little bit different.

121
00:09:25,630 --> 00:09:28,690
This is also known as the positive predictive value.

122
00:09:29,350 --> 00:09:34,030
It's equal to the ratio of true positives to true positives, plus false positives.

123
00:09:34,870 --> 00:09:40,210
In other words, it's equal to the number of documents you correctly retrieved divided by the total

124
00:09:40,210 --> 00:09:42,160
number of documents you retrieved.

125
00:09:42,910 --> 00:09:44,560
So why does this make sense?

126
00:09:49,190 --> 00:09:54,350
Well, suppose you built a very bad document retrieval system that just returned all the documents every

127
00:09:54,350 --> 00:09:54,920
time.

128
00:09:55,670 --> 00:09:56,160
Sure.

129
00:09:56,180 --> 00:10:01,070
This would include the documents you wanted, but also tons of documents which you did not want.

130
00:10:01,520 --> 00:10:03,350
Hence, it would be imprecise.

131
00:10:03,830 --> 00:10:06,290
The number of false positives would be very high.

132
00:10:07,010 --> 00:10:12,890
On the other hand, if you can reduce the number of false positives to zero and stop returning irrelevant

133
00:10:12,890 --> 00:10:15,500
documents, then you would be very precise.

134
00:10:16,040 --> 00:10:21,020
If the number of false positives is zero, then the precision would be equal to true positives over

135
00:10:21,020 --> 00:10:23,000
true positives, which would be one.

136
00:10:23,630 --> 00:10:26,210
Hence, this ratio is known as the precision.

137
00:10:26,900 --> 00:10:29,840
How precise were you in returning relevant documents?

138
00:10:34,340 --> 00:10:40,310
So we just discussed two methods of measuring the performance of a binary classifier when we have imbalanced

139
00:10:40,310 --> 00:10:41,090
classes.

140
00:10:41,720 --> 00:10:47,000
One option is to use sensitivity and specificity, while the other is to use precision and recall.

141
00:10:48,170 --> 00:10:53,480
Note that there is one crucial way that these differ from accuracy, which is the typical performance

142
00:10:53,480 --> 00:10:58,190
metric, and that is these give us back to numbers instead of just one.

143
00:10:59,420 --> 00:11:04,850
The problem is this might make it too complicated when it comes to making decisions about which model

144
00:11:04,850 --> 00:11:05,570
is best.

145
00:11:06,830 --> 00:11:12,140
For example, suppose you're comparing two models, but one does better with precision, while the other

146
00:11:12,140 --> 00:11:13,340
does better with recall.

147
00:11:13,880 --> 00:11:15,890
How can you choose which model is best?

148
00:11:16,670 --> 00:11:21,650
Well, one way to answer this question is to distill these metrics back into a single number.

149
00:11:26,340 --> 00:11:31,890
So one method of getting a single number out of the precision and recall is to use the F1 score.

150
00:11:32,640 --> 00:11:37,230
Essentially, the F1 score is the harmonica meaning of the precision and recall.

151
00:11:38,100 --> 00:11:42,300
Now you don't have to worry about what that is, but I'll show you the formula in case you ever need

152
00:11:42,300 --> 00:11:43,800
to implement it yourself.

153
00:11:44,400 --> 00:11:49,860
Essentially, it's just two times the product of precision and recall, divided by the sum of precision

154
00:11:49,860 --> 00:11:50,580
and recall.

155
00:11:51,240 --> 00:11:55,740
So just think of the F1 score as a way to take the average of these two numbers.

156
00:11:57,750 --> 00:12:02,640
Now, if you're curious, recall that the regular meeting is where you add up all the numbers and divide

157
00:12:02,640 --> 00:12:09,060
by end, the harmonic mean is where you first invert all the numbers, then add together, then divide

158
00:12:09,060 --> 00:12:11,070
by end and then invert them back.

159
00:12:11,970 --> 00:12:16,050
So a simple way to think about it is that it's just another way of computing the mean.

160
00:12:16,830 --> 00:12:20,760
So the F1 score is some kind of mean of the precision and recall.