1
00:00:11,060 --> 00:00:14,990
So in this lecture, we will continue our discussion on class imbalance.

2
00:00:15,890 --> 00:00:21,560
Now there's one final metric I want to discuss, which comes from the era of radar in the field of estimation

3
00:00:21,560 --> 00:00:22,340
and detection.

4
00:00:23,150 --> 00:00:27,170
This is where many of the statistical methods we use today were invented.

5
00:00:28,870 --> 00:00:35,890
So one helpful plot that was used was the receiver operating characteristic or the rosy curve, this

6
00:00:35,890 --> 00:00:41,860
curve is a plot of the true positive rate on the y axis in the false positive rate on the x axis.

7
00:00:42,430 --> 00:00:46,300
Note that the false positive rate is just one minus the true negative rate.

8
00:00:47,230 --> 00:00:53,050
So a typical RC curve will go up into the right and it will be steep on the left side while being more

9
00:00:53,050 --> 00:00:54,370
flat on the right side.

10
00:00:55,150 --> 00:00:58,060
So let's think about how this type of curve arises.

11
00:01:02,600 --> 00:01:08,960
The main idea behind the RC is that your decision rule can be based on a threshold and where you define

12
00:01:08,960 --> 00:01:13,460
that threshold to be will affect these true positive and false positive rates.

13
00:01:14,480 --> 00:01:19,730
Now we haven't really discussed this yet, but there is typically a tradeoff between a specificity and

14
00:01:19,730 --> 00:01:22,190
sensitivity in particular.

15
00:01:22,190 --> 00:01:24,410
Suppose that our sensitivity is low.

16
00:01:25,070 --> 00:01:30,050
That is, the number of true positives is low because we're not catching enough of the positives.

17
00:01:30,740 --> 00:01:35,600
So perhaps your job is to detect enemy ships on your radar, then you're letting too many of them slip

18
00:01:35,600 --> 00:01:36,080
by.

19
00:01:36,650 --> 00:01:39,380
In other words, our detector is not sensitive enough.

20
00:01:40,220 --> 00:01:44,870
In that case, we can simply make it more sensitive by lowering the threshold of detection.

21
00:01:45,980 --> 00:01:52,100
Normally, we would choose the positive class if the probability of that class is bigger than 50 percent.

22
00:01:52,760 --> 00:01:55,460
However, we don't have to use this as a threshold.

23
00:01:56,000 --> 00:01:59,240
Instead, we could pick 40 percent or 30 percent.

24
00:01:59,840 --> 00:02:05,240
By doing this, our classifier will make more positive detections, hopefully increasing the number

25
00:02:05,240 --> 00:02:06,350
of true positives.

26
00:02:07,130 --> 00:02:09,410
The tradeoff is that this won't be perfect.

27
00:02:09,800 --> 00:02:12,530
We could end up having more false positives as well.

28
00:02:12,920 --> 00:02:17,600
This will increase the false positive rate or, in other words, decrease the specificity.

29
00:02:22,360 --> 00:02:24,460
So that brings us back to the Rosie.

30
00:02:25,240 --> 00:02:31,060
Basically, the Rosie is formed by letting this threshold of detection go all the way from zero percent

31
00:02:31,330 --> 00:02:34,540
up to 100 percent at zero percent.

32
00:02:34,570 --> 00:02:40,150
We will predict that everything is positive, so our true positive rate will be one and our false positive

33
00:02:40,150 --> 00:02:41,440
rate will also be one.

34
00:02:42,160 --> 00:02:48,400
This is because we will always predict one, since our threshold is zero percent and any number output

35
00:02:48,400 --> 00:02:52,870
by our model will be greater than zero percent, then our model will always predict one.

36
00:02:53,680 --> 00:02:59,200
So we catch all the true positives or that enemy ships, but all the negatives will be predicted as

37
00:02:59,200 --> 00:03:00,370
positive as well.

38
00:03:01,210 --> 00:03:06,970
At 100 percent, a true positive rate will be zero in our false positive rate will also be zero.

39
00:03:07,630 --> 00:03:12,070
In this case, we never predict positive, so we never detect any enemy ships.

40
00:03:12,520 --> 00:03:18,340
But on the other hand, we never falsely identify something which is not an enemy ship as an enemy ship.

41
00:03:23,010 --> 00:03:25,770
The interesting part of the Rowsey is its shape.

42
00:03:26,520 --> 00:03:30,690
Typically, we draw a line from the bottom left corner up to the top right corner.

43
00:03:31,200 --> 00:03:33,480
This line is the benchmark for random guessing.

44
00:03:34,200 --> 00:03:39,930
That is to say this model would be very dumb and just output a random probability for every input sample.

45
00:03:40,680 --> 00:03:45,900
As an exercise, I recommend thinking about why that would yield a straight line in the RC.

46
00:03:47,520 --> 00:03:52,770
Now, with this line as our benchmark, we can now start to think about what kind of ROIC we actually

47
00:03:52,770 --> 00:03:53,550
want to see.

48
00:03:54,180 --> 00:03:59,670
Well, what we want to see, as you can probably tell, is a curve that goes to the top left of our

49
00:03:59,670 --> 00:04:00,360
benchmark.

50
00:04:00,840 --> 00:04:03,720
The more top left two can go, the better it is.

51
00:04:04,290 --> 00:04:06,390
Let's think about why that is the case.

52
00:04:10,990 --> 00:04:16,209
To understand this, let's think about the extreme case where the rosy shoots up to one immediately

53
00:04:16,209 --> 00:04:17,200
and stays there.

54
00:04:17,800 --> 00:04:18,760
What does this mean?

55
00:04:19,630 --> 00:04:24,220
This means that no matter what my threshold is, my true positive rate is perfect.

56
00:04:24,820 --> 00:04:29,980
Therefore, I can drive my false positive rate down to zero while still keeping my true positive rate

57
00:04:29,980 --> 00:04:30,550
at one.

58
00:04:31,570 --> 00:04:37,420
Furthermore, as you recall, a false positive rate of zero corresponds to a specificity of one.

59
00:04:38,170 --> 00:04:44,170
Therefore, in this case, our classifier is perfect for both classes and does not make any mistakes.

60
00:04:44,560 --> 00:04:50,950
Both the sensitivity and specificity are one, of course, in reality for most practical data sets.

61
00:04:51,160 --> 00:04:53,530
The curve will be below this perfect case.

62
00:04:58,140 --> 00:04:59,730
Now, here's something to think about.

63
00:05:00,330 --> 00:05:02,070
We've been talking about metrics.

64
00:05:02,640 --> 00:05:06,510
Our goal was to quantify the performance of a model with a single number.

65
00:05:07,230 --> 00:05:10,380
But we seem to have made things worse by introducing a plot.

66
00:05:10,740 --> 00:05:11,810
You have to look at.

67
00:05:12,300 --> 00:05:15,720
We seem to have gone in the opposite direction as what we intended.

68
00:05:16,380 --> 00:05:17,970
Of course, we are still not done.

69
00:05:18,600 --> 00:05:21,600
In fact, we can get a single number from this plot.

70
00:05:22,200 --> 00:05:26,040
The single number is called the AUC or the area under the curve.

71
00:05:26,670 --> 00:05:30,960
As you might expect, it is simply the area underneath the rosy curves.

72
00:05:31,770 --> 00:05:34,680
Note that both the height and the width of this box is one.

73
00:05:34,890 --> 00:05:37,590
Therefore, the area of the box is also one.

74
00:05:38,400 --> 00:05:43,430
Now, because the total area of the box is one, that means a perfect AUC is one.

75
00:05:43,560 --> 00:05:45,720
It covers the whole area of the box.

76
00:05:46,860 --> 00:05:52,380
Note that the benchmark splits the box in two and thus the benchmark for random guessing is zero point

77
00:05:52,380 --> 00:05:52,920
five.

78
00:05:53,610 --> 00:05:59,250
Thus, your models AUC is hopefully larger than zero point five, and ideally you want it to be close

79
00:05:59,250 --> 00:05:59,790
to one.

80
00:06:00,420 --> 00:06:03,480
So this is another alternative to the F1 score.

81
00:06:08,240 --> 00:06:14,060
So here's one complication of the AUC method that makes it a bit more complex to use compared to the

82
00:06:14,060 --> 00:06:19,910
F1 score in particular, this is that it requires your model to output probabilities.

83
00:06:20,480 --> 00:06:24,650
Specifically, the model needs to output P of Y equals one given X.

84
00:06:25,190 --> 00:06:30,260
This is assuming that your classes are assigned the values to zero and one, and that one is the positive

85
00:06:30,260 --> 00:06:32,880
class, as you recall.

86
00:06:32,900 --> 00:06:38,300
This is needed since the RC is formed by applying different thresholds from zero up to one.

87
00:06:39,080 --> 00:06:44,480
So, for example, if the threshold is zero point three and the model outputs zero point four, then

88
00:06:44,480 --> 00:06:47,000
we would assign that sample to the positive class.

89
00:06:47,510 --> 00:06:52,940
But if the threshold were zero point five, then we would assign that same sample to the negative class.

90
00:06:53,660 --> 00:06:57,410
So the model must be able to output this posterior probability.

91
00:06:58,310 --> 00:07:04,190
Now it's worth discussing how one usually does this in the most popular Python library for machine learning,

92
00:07:04,460 --> 00:07:05,750
which is I learn.

93
00:07:07,230 --> 00:07:10,890
Note that if you're implementing models yourself, then this would be obvious.

94
00:07:11,160 --> 00:07:16,020
But if you're using you learn it may not be obvious if you're not familiar with the API.

95
00:07:16,500 --> 00:07:18,570
So it's worth reviewing how this works.

96
00:07:20,400 --> 00:07:26,040
So normally when you call Model Dog, predict inside, you learn you get back in and length one dimensional

97
00:07:26,040 --> 00:07:26,590
array.

98
00:07:26,910 --> 00:07:33,630
Assuming that you have any input samples if you have classes in this array will contain integers from

99
00:07:33,630 --> 00:07:35,130
zero up to K minus one.

100
00:07:35,940 --> 00:07:38,790
So this is the typical way to get a prediction inside.

101
00:07:38,790 --> 00:07:39,330
Get Learn.

102
00:07:40,110 --> 00:07:42,710
The question is how do we get probabilities?

103
00:07:47,190 --> 00:07:50,970
The answer is to call a different function, which is usually called predict problem.

104
00:07:51,990 --> 00:07:56,700
Now keep in mind, it's like you learn as another library, which is part of the Nampai stack, which

105
00:07:56,700 --> 00:08:00,600
means it is constantly changing from month to month and year to year.

106
00:08:01,170 --> 00:08:04,770
So while this is the case today, it may not be the case forever.

107
00:08:05,640 --> 00:08:08,310
In any case, what can we expect from this function?

108
00:08:09,540 --> 00:08:14,790
Well, supposing again, that we have no samples in key classes, we will get back a two dimensional

109
00:08:14,790 --> 00:08:21,480
array of size n by K. This time, the array values are not integers, but floating point numbers.

110
00:08:21,930 --> 00:08:25,350
And furthermore, these floating point numbers are probabilities.

111
00:08:25,950 --> 00:08:33,570
In particular, if we call the output p, then pink will be the posterior probability that the sample

112
00:08:33,570 --> 00:08:35,130
belongs to Class K.

113
00:08:35,909 --> 00:08:40,740
In other words, mathematically it is p of why seven equals K given exhibit.

114
00:08:42,270 --> 00:08:47,670
Note that because of this convention, each row of this array will sum to one, since each row of the

115
00:08:47,670 --> 00:08:50,550
matrix is a separate probability distribution.

116
00:08:51,960 --> 00:08:57,660
Now, back to the binary case specifically, as you recall, we want to know the probability that the

117
00:08:57,660 --> 00:09:00,670
class is equal to one in, since the output is binary.

118
00:09:00,690 --> 00:09:02,750
There are only two columns in the Matrix.

119
00:09:03,330 --> 00:09:08,820
In this case, we want the second column or the column at Index one, since the first column corresponds

120
00:09:08,820 --> 00:09:10,710
to the probability for class zero.

121
00:09:15,420 --> 00:09:21,270
So putting this all together here is how we would compute the AUC insight, could learn from a trained

122
00:09:21,270 --> 00:09:22,140
binary model.

123
00:09:23,520 --> 00:09:29,580
First, suppose we have our target array, which is called why this array contains only the integers

124
00:09:29,580 --> 00:09:30,570
zero and one.

125
00:09:31,590 --> 00:09:34,710
Suppose we have a corresponding input matrix called X.

126
00:09:35,430 --> 00:09:41,550
What we need to do is get the predictions for X by calling model that predicts Prabha passing in X.

127
00:09:42,180 --> 00:09:45,900
This will give us the posterior probabilities which we will call P.

128
00:09:46,740 --> 00:09:49,710
As you recall, P is an end by two matrix.

129
00:09:50,940 --> 00:09:56,070
Since we only want the probabilities for class one, we will grab the column at Index one.

130
00:09:57,030 --> 00:10:04,410
Finally, we will pass Y in the column int index one for P into the function called RC AUC score.

131
00:10:05,130 --> 00:10:08,100
This will then return a single number, which is the AUC.

132
00:10:12,860 --> 00:10:18,140
Now, as a final side note, it's worth mentioning that not all classifiers inside you learn have a

133
00:10:18,140 --> 00:10:19,760
function called predicts Prabha.

134
00:10:20,330 --> 00:10:23,240
This is simply due to how certain classifiers work.

135
00:10:23,810 --> 00:10:27,170
Some algorithms simply do not output probabilities.

136
00:10:28,310 --> 00:10:30,470
One example of this is the perceptron.

137
00:10:31,520 --> 00:10:37,610
Another example is the support vector machine, although there is an ad hoc method to obtain probabilities,

138
00:10:37,820 --> 00:10:39,110
which has some issues.

139
00:10:39,620 --> 00:10:43,250
In particular, the probabilities may not be consistent with the predictions.

140
00:10:43,550 --> 00:10:46,430
And this method does not scale well for large datasets.

141
00:10:47,510 --> 00:10:53,450
On the other hand, since neural networks naturally output probabilities, using the AUC for imbalanced

142
00:10:53,450 --> 00:10:56,660
classes in deep learning is typically a fine choice.