1
00:00:00,090 --> 00:00:06,030
One of the most common things we want to do with the data set is to get a measure of its center.

2
00:00:06,030 --> 00:00:12,090
And there are many ways to think about this idea of center, the first of which is the idea of mean.

3
00:00:12,090 --> 00:00:15,690
And this is the most common measure of center that we use.

4
00:00:15,690 --> 00:00:18,420
We usually think about mean as average.

5
00:00:18,420 --> 00:00:22,770
But when we're being a little more technical, we call it the mean or the arithmetic mean.

6
00:00:22,770 --> 00:00:27,630
And this here is the formula that we use to calculate mean.

7
00:00:27,630 --> 00:00:34,380
But really all this formula says is that we are to add up all of the data points in our data set, all

8
00:00:34,380 --> 00:00:40,260
the values in our data set, and then divide by the number of data points or the number of values that

9
00:00:40,260 --> 00:00:42,090
we have in the data set.

10
00:00:42,210 --> 00:00:49,470
So the way that we read this formula here is to say that this sigma notation here, this big E is called

11
00:00:49,470 --> 00:00:49,890
sigma.

12
00:00:49,890 --> 00:00:51,420
It means to sum up.

13
00:00:51,420 --> 00:00:59,100
And so we are adding up here all of the values of X and we use this little indicator I here to mean

14
00:00:59,100 --> 00:01:06,330
the first value of x, second value of x, third value of x, etc. So this I equals one notation below.

15
00:01:06,330 --> 00:01:13,020
The sigma notation means start with the first value and keep adding until you get to the nth value.

16
00:01:13,020 --> 00:01:19,350
So if there are ten values in the data set, this says start with the first value in the data set,

17
00:01:19,350 --> 00:01:24,780
the I equals one value and add that to the second value, the third value, the fourth value, etc.

18
00:01:24,780 --> 00:01:27,600
All the way up to the end equals 10th value.

19
00:01:27,600 --> 00:01:33,630
We're adding all the values together in the data set and then dividing by N equals ten the number of

20
00:01:33,630 --> 00:01:35,220
data points in the set.

21
00:01:35,220 --> 00:01:37,680
So this is just a fancy formula.

22
00:01:37,680 --> 00:01:42,930
To summarize what we're doing here, sum all the data points, divide by the number of data points in

23
00:01:42,930 --> 00:01:43,530
the set.

24
00:01:43,530 --> 00:01:49,710
So for instance, if we have a data set like this one, a very simple data set, we can see we have

25
00:01:49,710 --> 00:01:53,670
five points in the set one, two, four, six and seven.

26
00:01:53,670 --> 00:01:59,400
There are five values in this data set to use our formula here to find the mean.

27
00:01:59,400 --> 00:02:05,460
We would simply add all of these values together and then divide by five because there are five values

28
00:02:05,460 --> 00:02:06,480
in the data set.

29
00:02:06,570 --> 00:02:13,020
The result then, of course, the sum of all of the values in the numerator is 20, 20 divided by five

30
00:02:13,020 --> 00:02:13,710
is four.

31
00:02:13,830 --> 00:02:20,490
And so we can say here that the mean is 20 divided by five, which is equal to four.

32
00:02:20,520 --> 00:02:25,200
Now, the way that we indicate mean depends on what we're talking about.

33
00:02:25,230 --> 00:02:29,280
We mentioned earlier the idea of population versus sample.

34
00:02:29,280 --> 00:02:34,800
Well, if we're taking the mean of the population, let's say that this data set here accounts for the

35
00:02:34,800 --> 00:02:41,370
entire population, then we usually use the Greek letter MU, which looks like this U here.

36
00:02:41,370 --> 00:02:48,150
And we say that the mean is given by MU and so we would say mu is equal to for the mean of the population

37
00:02:48,150 --> 00:02:49,260
is equal to four.

38
00:02:49,290 --> 00:02:56,340
If on the other hand we're taking the mean of a sample, we usually indicate that with this notation,

39
00:02:56,550 --> 00:03:01,110
which we read as x bar, it's just x with a bar over the top of it.

40
00:03:01,110 --> 00:03:06,750
And so if you see x bar, it usually means the mean of a sample of the population.

41
00:03:06,750 --> 00:03:12,270
Whereas this notation here mu usually means the mean of the entire population.

42
00:03:12,270 --> 00:03:15,780
The formula is the same in both cases, right?

43
00:03:15,780 --> 00:03:18,990
This could be x bar or it could be mu.

44
00:03:19,020 --> 00:03:24,510
We would still use this same formula to sum all the data points and divide by the number of data points.

45
00:03:24,510 --> 00:03:27,870
But we would use x bar to indicate that value.

46
00:03:27,870 --> 00:03:30,180
If we are finding the mean of the sample we use.

47
00:03:30,180 --> 00:03:30,780
Mu.

48
00:03:30,810 --> 00:03:36,480
If we are finding the mean of the population, we really want to think about the mean either way, whether

49
00:03:36,480 --> 00:03:40,170
it's the mean of the sample or the mean of a population we really want to think about.

50
00:03:40,170 --> 00:03:43,020
The mean is the balancing point of the data.

51
00:03:43,020 --> 00:03:48,120
And we also want to realize that there are some different things we can do with mean that are interesting,

52
00:03:48,120 --> 00:03:51,420
including for instance, take a weighted mean.

53
00:03:51,420 --> 00:03:58,410
So let's say we work at a company and we've asked every department manager to survey all of the employees

54
00:03:58,410 --> 00:04:02,820
in their department and get from them an employee satisfaction score.

55
00:04:02,850 --> 00:04:09,120
On a scale of 1 to 10, ten being perfectly satisfied, one being extremely unsatisfied.

56
00:04:09,240 --> 00:04:16,709
So with this first data point here, we have a department that includes 20 employees and the department

57
00:04:16,709 --> 00:04:17,310
manager.

58
00:04:17,310 --> 00:04:23,550
The department head has reported to us that the mean employee satisfaction score in their department

59
00:04:23,730 --> 00:04:25,080
is 8.4.

60
00:04:25,470 --> 00:04:32,490
Another department manager has told us that of their seven employees, the mean employee satisfaction

61
00:04:32,490 --> 00:04:34,920
score is 6.1 in that department.

62
00:04:34,920 --> 00:04:41,610
So what you see here is that we have all different department sizes across our company and we're getting

63
00:04:41,610 --> 00:04:47,040
a different employee satisfaction score from each department head, each department head.

64
00:04:47,040 --> 00:04:52,080
Each department manager has already taken a mean for their own department.

65
00:04:52,080 --> 00:04:58,080
When we have a situation like this, we have to take a weighted mean if we want to get a mean employee

66
00:04:58,080 --> 00:04:59,940
satisfaction score for the entire company.

67
00:05:00,180 --> 00:05:08,280
Because obviously here we can get kind of an intuitive sense that this 8.4 value for satisfaction counts

68
00:05:08,280 --> 00:05:14,160
a little heavier or should count a little heavier than this employee satisfaction score here of 6.1,

69
00:05:14,160 --> 00:05:19,410
because there are 20 employees in this department, only seven employees in this department.

70
00:05:19,410 --> 00:05:25,670
So intuitively, we should wait this 8.4 a little heavier than we weight this 6.1.

71
00:05:25,680 --> 00:05:27,060
So how do we do that?

72
00:05:27,060 --> 00:05:32,760
Well, the formula is virtually the same as what we did up here, except that we're accounting for this

73
00:05:32,760 --> 00:05:34,140
different weighting.

74
00:05:34,140 --> 00:05:42,270
So what we do is we sum up all of the products of the actual score and the weighting, and then we divide

75
00:05:42,270 --> 00:05:44,820
that by the sum of all of the weights.

76
00:05:44,820 --> 00:05:48,600
So if we do that with this data set here, here's what that looks like.

77
00:05:48,630 --> 00:05:54,630
We take the product of 20 and 8.4, we add that to the product of seven and 6.1.

78
00:05:54,630 --> 00:05:59,790
Then we add to that 13 times 9.1 plus 25 times 7.8.

79
00:05:59,790 --> 00:06:04,890
So we're taking all these products in the numerator, adding them together, and then we're dividing

80
00:06:04,890 --> 00:06:06,510
by the total weight.

81
00:06:06,510 --> 00:06:11,220
This denominator should make intuitive sense to us because if we go back to our original formula for

82
00:06:11,220 --> 00:06:17,100
the mean, we're dividing by the number of data points down here, we know that if we add up all of

83
00:06:17,100 --> 00:06:23,490
these department sizes, we get 20 plus seven, plus 13 is 40, plus 25 is 65.

84
00:06:23,490 --> 00:06:27,570
We have 65 total employees across four departments.

85
00:06:27,570 --> 00:06:31,860
So this denominator here is really just like the denominator up here.

86
00:06:31,860 --> 00:06:38,250
We have 65 total employees, so we kind of have this same idea of total number of data points in the

87
00:06:38,250 --> 00:06:39,990
denominator of each formula.

88
00:06:40,020 --> 00:06:42,360
The only thing that's different is this numerator.

89
00:06:42,360 --> 00:06:47,460
And in the numerator here, that's where we account for the different weights based on these different

90
00:06:47,460 --> 00:06:48,660
department sizes.

91
00:06:48,660 --> 00:06:54,450
And if we were to actually do this math, what we'd see here is that we get an approximate employee

92
00:06:54,450 --> 00:07:01,770
satisfaction score of 8.1, and that is a weighted mean based on the data that was turned in by our

93
00:07:01,770 --> 00:07:03,360
department managers.

94
00:07:03,360 --> 00:07:06,390
So we can calculate a mean, we can calculate a weighted mean.

95
00:07:06,510 --> 00:07:11,160
It's also important to realize that we can calculate what's called a truncated mean.

96
00:07:11,400 --> 00:07:17,250
This comes into play when we have data where maybe we've introduced some outliers.

97
00:07:17,250 --> 00:07:24,900
For instance, if we have this data set here, notice that 16, 18, 21, etc. all the way to 33 this

98
00:07:24,900 --> 00:07:32,010
part of the data set seems to be fairly consistent and then all of a sudden we have this enormous 91

99
00:07:32,010 --> 00:07:35,430
value in the data set, which looks to be an outlier.

100
00:07:35,430 --> 00:07:40,320
It looks to be outside of the normal range of the rest of the data set.

101
00:07:40,320 --> 00:07:47,220
So it's possible, based on the context of how we collected this data, that this 91 value is an extreme

102
00:07:47,220 --> 00:07:48,870
value, it's an outlier.

103
00:07:48,870 --> 00:07:51,720
It doesn't really fit with the rest of the data.

104
00:07:51,720 --> 00:07:57,420
Depending on why we think this value occurred, we may choose to calculate a truncated mean.

105
00:07:57,420 --> 00:08:02,040
And what we would do in that case is say this 91 value looks like an outlier.

106
00:08:02,040 --> 00:08:07,230
So what we're going to do is we're going to ignore it, so we're going to ignore this 91.

107
00:08:07,230 --> 00:08:12,630
But because we're taking a value off of the high end of the data set, we're also going to take a value

108
00:08:12,630 --> 00:08:19,080
off of the low end of the data set to kind of balance it out, leaving us with just these six values

109
00:08:19,080 --> 00:08:22,230
in the data set between 18 and 33.

110
00:08:22,230 --> 00:08:27,720
From there, then we would just calculate the mean of this six point data set.

111
00:08:27,720 --> 00:08:31,860
We would add 18, 21, 27, 32, 32 and 33.

112
00:08:31,860 --> 00:08:39,270
We'd add those together and divide by six because we have six values here and we would calculate a mean,

113
00:08:39,510 --> 00:08:47,760
but it would be very important if we did that to indicate that we eliminated 25% of our data because

114
00:08:47,760 --> 00:08:52,170
there were eight values in the original dataset, we took away two of them.

115
00:08:52,380 --> 00:08:59,190
That means we took away 25% of our data set in order to calculate a truncated mean.

116
00:08:59,190 --> 00:09:05,010
If we're ever going to do something like this, we need to be really, really careful when we do it,

117
00:09:05,010 --> 00:09:11,400
because if we're taking out an outlier, doing so may give us a more accurate picture of what's going

118
00:09:11,400 --> 00:09:11,790
on.

119
00:09:11,790 --> 00:09:15,150
But it may also give us a less accurate picture of what's going on.

120
00:09:15,150 --> 00:09:22,050
Maybe this 91 value that we got in our survey or that's sitting out here in our sample or our population.

121
00:09:22,050 --> 00:09:26,970
Maybe this value is totally accurate and extremely important to include in the data set.

122
00:09:26,970 --> 00:09:33,060
Or maybe something just went wrong with this 91 value and it is more appropriate to eliminate it to

123
00:09:33,060 --> 00:09:34,950
get a more accurate mean.

124
00:09:34,950 --> 00:09:39,690
We may feel fairly confident about eliminating this 91 value or we may not.

125
00:09:39,720 --> 00:09:46,260
Either way, if we take away a value from our dataset, it's very important to indicate and to communicate

126
00:09:46,260 --> 00:09:52,260
to anybody looking at this data, looking at the mean that we calculated that we did remove some outlier

127
00:09:52,260 --> 00:09:58,170
data, that we removed 25% of the data set and that we're actually calculating here a truncated mean

128
00:09:58,170 --> 00:09:59,280
as opposed to a.

129
00:09:59,470 --> 00:10:02,150
FRUMIN With the original data set.

130
00:10:02,170 --> 00:10:07,960
We always want to be really, really, really careful whenever we are manipulating the original data

131
00:10:07,960 --> 00:10:09,460
set that we collected.

132
00:10:09,460 --> 00:10:15,010
But this is something that we can do if we have really good reason to believe that we'll get a more

133
00:10:15,010 --> 00:10:17,560
accurate picture of what's happening with the data.

134
00:10:17,560 --> 00:10:21,730
And then we just want to make sure to communicate that that's in fact what we have done.

135
00:10:21,730 --> 00:10:24,750
So this is kind of the general idea of mean.

136
00:10:24,760 --> 00:10:28,660
Let's look at a second measure of central tendency, which is the mode.

137
00:10:29,600 --> 00:10:35,240
Now, the mode of a data set is an even simpler concept than the idea of the mean of the set.

138
00:10:35,270 --> 00:10:39,230
It is simply the value that occurs most often in the set.

139
00:10:39,320 --> 00:10:45,230
So for instance, if we're looking at this data set, we can see that every value in the set occurs

140
00:10:45,230 --> 00:10:50,860
once except this value here of 32, which occurs twice.

141
00:10:50,870 --> 00:10:56,210
So because the value 32 occurs more than any other value in the set.

142
00:10:56,240 --> 00:10:59,840
We say that 32 is the mode of the data set.

143
00:10:59,990 --> 00:11:01,100
That would be the mode.

144
00:11:01,130 --> 00:11:02,600
We don't say that.

145
00:11:02,600 --> 00:11:04,740
Two is the mode of the data set.

146
00:11:04,760 --> 00:11:09,530
Sometimes people think that the count of that highest value will be the mode.

147
00:11:09,530 --> 00:11:10,460
The mode here is not.

148
00:11:10,460 --> 00:11:15,110
Two the number of times the 32 occurs, the mode is actually 32.

149
00:11:15,320 --> 00:11:18,410
So we could say that the mode of the data set is 32.

150
00:11:18,440 --> 00:11:24,320
Now, a couple of things to think about with mode, we can also have a bimodal data set.

151
00:11:24,320 --> 00:11:30,590
So this is the same set except that we have inserted an extra data point, an extra value of 18 here.

152
00:11:30,590 --> 00:11:35,630
So 18 occurs twice and 32 occurs twice.

153
00:11:35,930 --> 00:11:43,610
So we might say that this data set is bimodal, that it has two modes a mode of 18 and a mode of 32.

154
00:11:43,760 --> 00:11:49,130
So 18 and 32 and indicate that this is a bimodal data set.

155
00:11:49,250 --> 00:11:53,310
Or we can have the scenario where the data set has no mode.

156
00:11:53,330 --> 00:11:57,200
In this case, every data point occurs only one time.

157
00:11:57,230 --> 00:12:01,310
We have 16, 18, 21, 27, 32, 33, 91.

158
00:12:01,310 --> 00:12:06,790
There's no value in the set that occurs more than any other value in this set.

159
00:12:06,800 --> 00:12:08,570
There's no standout value.

160
00:12:08,570 --> 00:12:11,960
And so we would say that this data set has no mode.

161
00:12:11,990 --> 00:12:17,300
Now we'll talk a little bit more about mode when we talk about when to use different types of central

162
00:12:17,300 --> 00:12:17,990
tendency.

163
00:12:17,990 --> 00:12:22,250
But this is the idea for now and then the last measure of central tendency that we want to talk about

164
00:12:22,250 --> 00:12:24,110
is the idea of the median.

165
00:12:24,500 --> 00:12:32,330
Now, the median of a data set is the center of the data set that we find simply by ordering the entire

166
00:12:32,330 --> 00:12:35,390
data set and looking for the value in the center.

167
00:12:35,600 --> 00:12:41,750
So given this data set here, we've ordered the set from smallest to largest data point.

168
00:12:41,840 --> 00:12:48,920
And so all we want to do is work our way in toward the center, will eliminate here or ignore the smallest

169
00:12:48,920 --> 00:12:50,180
and the largest value.

170
00:12:50,210 --> 00:12:55,970
We keep working our way in until we get to the center and we see here then that the value left in the

171
00:12:55,970 --> 00:12:57,200
center is 27.

172
00:12:57,200 --> 00:13:00,750
So for this set, the median is 27.

173
00:13:00,770 --> 00:13:05,840
If we do the same thing for this set, working our way from the outside toward the inside, the data

174
00:13:05,840 --> 00:13:08,000
set is already ordered from smallest to largest.

175
00:13:08,010 --> 00:13:10,130
That's important when it comes to the median.

176
00:13:10,130 --> 00:13:12,050
We have to have the data in order.

177
00:13:12,230 --> 00:13:19,580
So if we start working our way in this way, we're left with now two values in the center, 27 and 32.

178
00:13:19,670 --> 00:13:26,750
Now, for the median in particular, we can't just pick the value of 27 or pick the value of 32 because

179
00:13:26,750 --> 00:13:29,170
neither one of them is exactly the center.

180
00:13:29,180 --> 00:13:35,870
So what we say is when we have an even number of data points in our set, we'll always be left with

181
00:13:35,870 --> 00:13:43,370
two data points that are at this center point position and then in order to calculate median will actually

182
00:13:43,370 --> 00:13:49,910
go ahead and take the arithmetic mean take the mean of these last two centralized data points.

183
00:13:49,910 --> 00:13:57,230
So with 27 and 32, we say 27 plus 32 and we have two data points here.

184
00:13:57,230 --> 00:13:58,940
So we divide by two.

185
00:13:58,970 --> 00:14:02,930
We end up with 59 over two.

186
00:14:03,050 --> 00:14:06,980
And the result then is 29.5.

187
00:14:06,980 --> 00:14:16,520
And so the median of this data set is 29.5, which is really just the mean of 27 and 32, the two most

188
00:14:16,520 --> 00:14:21,230
centralized data points when we order the data set from smallest to largest.

189
00:14:21,320 --> 00:14:26,240
So at the most basic level, that's how we calculate mean, median and mode.

190
00:14:26,240 --> 00:14:31,970
But what's even more interesting is thinking about when to use each of these measures of central tendency

191
00:14:31,970 --> 00:14:35,930
based on the kind of data we have and the context around our data.

192
00:14:36,410 --> 00:14:43,280
So earlier we talked about all different kinds of data, including continuous versus discrete data,

193
00:14:43,280 --> 00:14:47,630
nominal versus ordinal data, numeric versus non numeric data.

194
00:14:47,630 --> 00:14:54,080
And now we've just introduced these ideas of mean median and mode, these three different ways of measuring

195
00:14:54,080 --> 00:15:01,070
central tendency of measuring how the data collects itself around some center point, identifying the

196
00:15:01,070 --> 00:15:03,200
center point of a data set.

197
00:15:03,200 --> 00:15:08,750
What's far more interesting is thinking about when it's appropriate to use mean, when it's appropriate

198
00:15:08,750 --> 00:15:11,450
to use median, when it's appropriate to use mode.

199
00:15:11,450 --> 00:15:18,200
It might even be helpful to list more than one of these measures of center, depending on the data we're

200
00:15:18,200 --> 00:15:18,920
looking at.

201
00:15:18,920 --> 00:15:22,340
And not all of these measures are always going to make sense.

202
00:15:22,340 --> 00:15:28,970
For instance, remember we said earlier that nominal data was data that couldn't be ordered at.

203
00:15:28,990 --> 00:15:31,420
As an example, we said species of animals.

204
00:15:31,420 --> 00:15:37,750
So if we think about household pets like dogs, cats, birds, etc., that's nominal data.

205
00:15:37,750 --> 00:15:40,600
It's data that doesn't really have an order to it.

206
00:15:40,600 --> 00:15:47,320
If that's the case, if we can't put data in a specific logical order, then there's really no way to

207
00:15:47,320 --> 00:15:50,050
calculate a median for that data.

208
00:15:50,050 --> 00:15:51,370
It really just doesn't make sense.

209
00:15:51,370 --> 00:15:56,800
If you remember when we learned to calculate median, we said that we had to line up the data in order

210
00:15:56,800 --> 00:16:03,580
from some sort of smallest to largest or one end to the other spectrum and then find the center point.

211
00:16:03,580 --> 00:16:08,700
And that's just really not applicable at all if the data is not nominal.

212
00:16:08,710 --> 00:16:14,800
Similarly, when we calculated a mean, we add it up all the data points and then divide it by the number

213
00:16:14,800 --> 00:16:17,170
of data points, that's not going to make sense.

214
00:16:17,170 --> 00:16:23,560
If we have non numeric data, we obviously can't find a sum for a data set that is non numeric that

215
00:16:23,560 --> 00:16:24,850
isn't numbers.

216
00:16:24,850 --> 00:16:33,910
We can't add letters, we can't add animal species, we can't add colors in a mathematical numeric sense.

217
00:16:33,910 --> 00:16:39,580
So calculating a mean for non numeric data just doesn't really make sense.

218
00:16:39,580 --> 00:16:46,360
But in all these other cases we should be able to calculate some kind of mean median mode for the data.

219
00:16:46,360 --> 00:16:52,420
For instance, we could think about continuous data like a measurement of height for a group of people.

220
00:16:52,420 --> 00:16:57,370
And we could imagine, of course, that we could calculate a mean height for that group of people.

221
00:16:57,370 --> 00:17:03,790
We could also think about a set of discrete data, maybe the number of children per family in a data

222
00:17:03,790 --> 00:17:04,810
set of families.

223
00:17:04,810 --> 00:17:06,910
Of course, that's going to be a discrete number.

224
00:17:06,910 --> 00:17:13,359
Every family is going to have zero children, one child, two children, three children, etc. No single

225
00:17:13,359 --> 00:17:16,630
family is going to have 2.43 children.

226
00:17:16,630 --> 00:17:19,060
So that would be an example of discrete data.

227
00:17:19,060 --> 00:17:24,790
But of course, we could find a mean for that data simply by adding up the total number of children

228
00:17:24,790 --> 00:17:27,220
and then dividing by the total number of families.

229
00:17:27,220 --> 00:17:28,329
We would get a mean.

230
00:17:28,480 --> 00:17:36,460
But keep in mind that in that example we might come up with a mean of, let's say 3.17 children per

231
00:17:36,460 --> 00:17:41,260
one family, and we have to think about and interpret the context of that data.

232
00:17:41,260 --> 00:17:48,370
We understand that no individual family is going to have 3.17 children while at the same time understanding

233
00:17:48,370 --> 00:17:54,010
that the mean of the data set could be 3.17 children skipping around here.

234
00:17:54,010 --> 00:17:59,500
We can think about, for instance, the mode of numeric versus non numeric data.

235
00:17:59,500 --> 00:18:02,620
We've already looked at the mode of numeric data.

236
00:18:02,620 --> 00:18:08,020
We had that data set earlier where the data point 32 occurred twice in the data set, and we said that

237
00:18:08,020 --> 00:18:13,690
the mode of that data set was 32, but we could have non numeric data and indicate a mode.

238
00:18:13,690 --> 00:18:17,380
For instance, maybe our family has two dogs.

239
00:18:18,110 --> 00:18:19,790
And one cat.

240
00:18:20,500 --> 00:18:22,030
And one bird.

241
00:18:22,820 --> 00:18:24,200
As pets.

242
00:18:24,200 --> 00:18:27,770
We would say that the mode of this data set is dogs.

243
00:18:27,860 --> 00:18:34,520
This is non numeric data in terms of dogs, cat and bird, in the sense that our data set would be dog,

244
00:18:34,520 --> 00:18:36,090
dog, cat bird.

245
00:18:36,110 --> 00:18:41,630
That's non numeric data, but we could clearly see there that in the set dog, dog, cat bird, the

246
00:18:41,630 --> 00:18:43,130
mode is dog.

247
00:18:43,130 --> 00:18:45,140
And then maybe one last example.

248
00:18:45,140 --> 00:18:48,980
Let's look at the mean of nominal versus ordinal data here.

249
00:18:49,010 --> 00:18:54,260
Remember that nominal data is data that does not have some kind of specific order.

250
00:18:54,260 --> 00:18:58,100
It can't be ordered versus ordinal data can be ordered.

251
00:18:58,100 --> 00:19:02,390
We can put an ordinal data set in a logical specific order.

252
00:19:02,390 --> 00:19:08,360
And the question then of whether we can calculate a mean for a data set that can't be ordered or calculate

253
00:19:08,360 --> 00:19:11,570
the mean for a data set that can be ordered is dependent.

254
00:19:11,570 --> 00:19:16,310
It depends on whether or not the data is numeric or non numeric.

255
00:19:16,310 --> 00:19:21,860
For instance, you could imagine an ordinal data set of grades, letter grades in a class.

256
00:19:21,860 --> 00:19:30,320
So if a professor is handing out grades A through F, that's data that can clearly be ordered, but

257
00:19:30,320 --> 00:19:35,600
we wouldn't be able to calculate a mean of that data set because this is non numeric data.

258
00:19:35,600 --> 00:19:41,360
And so we can't add up these grades and then divide by the total number of grades if instead of letter

259
00:19:41,360 --> 00:19:48,920
grades were being given percentage grades, for instance, students have grades like 92% in the class

260
00:19:48,920 --> 00:19:51,350
or 73% in the class.

261
00:19:51,350 --> 00:19:57,140
Now we have numeric data and we could calculate a mean, but if all we have is letter grades A, through

262
00:19:57,140 --> 00:20:03,500
F, B plus and C minus, etc., we can order that, but it's not numeric, so we can't calculate a mean.

263
00:20:03,500 --> 00:20:08,660
So maybe we'll be able to calculate a mean for ordinal data if it's numeric data.

264
00:20:08,660 --> 00:20:10,370
Same thing with nominal data.

265
00:20:10,370 --> 00:20:16,520
Nominal data cannot be ordered, but we may still be able to calculate a mean if it's numeric data.

266
00:20:16,520 --> 00:20:20,480
Now, if it's numeric, presumably we should be able to order it.

267
00:20:20,480 --> 00:20:26,750
But the idea here is really just that we have to have numeric data if we're going to be able to calculate

268
00:20:26,750 --> 00:20:27,080
a mean.

269
00:20:27,080 --> 00:20:33,050
So this gives you a general idea of when we may be able to use mean median and mode depending on the

270
00:20:33,050 --> 00:20:34,160
kind of data we have.

271
00:20:34,160 --> 00:20:37,790
And of course these different types of data are not all exclusive.

272
00:20:37,790 --> 00:20:44,780
We can have continuous data that is ordinal and numeric, or we can have discrete data that is nominal

273
00:20:44,780 --> 00:20:45,950
and non numeric.

274
00:20:45,950 --> 00:20:48,560
So these are overlapping categories.

275
00:20:48,560 --> 00:20:53,450
We just have to think about which measures of central tendency are appropriate based on the kind of

276
00:20:53,450 --> 00:20:54,800
data we're looking at.

277
00:20:54,800 --> 00:21:00,020
And then the last thing is really where the art form of all of this gets introduced, which is this

278
00:21:00,020 --> 00:21:06,560
idea of when it's most appropriate to use each of these particular measures of central tendency.

279
00:21:06,560 --> 00:21:14,000
For example, in economics, in politics, we very often talk about annual household income for the

280
00:21:14,000 --> 00:21:15,440
population of a country.

281
00:21:15,440 --> 00:21:23,390
And when we talk about that statistic, we usually refer to median annual household income, not mean

282
00:21:23,390 --> 00:21:24,650
annual household income.

283
00:21:24,650 --> 00:21:32,150
And the reason is because there are very often a very small number of extremely, extremely, extremely

284
00:21:32,150 --> 00:21:33,410
high income earners.

285
00:21:33,410 --> 00:21:39,350
You think about the very wealthiest people in a country, the very wealthiest people in the world whose

286
00:21:39,350 --> 00:21:46,700
incomes are so, so, so large that including them in the dataset would drastically skew the mean of

287
00:21:46,700 --> 00:21:52,760
the dataset to the point we're talking about mean annual household income would really not give us a

288
00:21:52,760 --> 00:22:01,370
good representation of typical annual household income, whereas median is a much better representation

289
00:22:01,430 --> 00:22:07,610
of annual household income because when we look at the median, we eliminate all of those high income

290
00:22:07,610 --> 00:22:12,140
earners, we eliminate all of those extreme outlier high income earners.

291
00:22:12,140 --> 00:22:16,310
We just look at the center point, the median point for household income.

292
00:22:16,310 --> 00:22:22,970
That tends to give us a much, much more typical look at annual household income for the population

293
00:22:22,970 --> 00:22:23,750
of a country.

294
00:22:23,750 --> 00:22:29,900
So when we talk about this statistic, we almost always refer to the median instead of the mean.

295
00:22:29,900 --> 00:22:32,330
You can even think about a different example.

296
00:22:32,660 --> 00:22:38,930
Imagine we have a race, maybe it's an Ironman triathlon and the race has a time limit.

297
00:22:38,930 --> 00:22:44,390
Let's say there's a segment of the race that must be completed within 90 minutes.

298
00:22:44,390 --> 00:22:50,120
So we say that the maximum time allowed is 90 minutes for this segment of a race.

299
00:22:50,120 --> 00:22:56,240
And for any participants who don't complete that segment in 90 minutes, they're disqualified from the

300
00:22:56,240 --> 00:22:56,930
rest of the race.

301
00:22:56,930 --> 00:22:58,400
Their times don't count.

302
00:22:58,400 --> 00:23:05,330
If we have a scenario like that, we have a bunch of people who are going to finish under this 90 minute

303
00:23:05,330 --> 00:23:10,640
time cap, and then we're going to have a bunch of people who are disqualified who hit this 90 minute

304
00:23:10,640 --> 00:23:13,580
time cap and then they are disqualified from the race.

305
00:23:13,880 --> 00:23:22,220
If we want to take data of those participants and calculate some sort of a center for how long it.

306
00:23:22,270 --> 00:23:25,120
Took the participants to finish this particular leg.

307
00:23:25,120 --> 00:23:27,090
This particular segment of the race.

308
00:23:27,100 --> 00:23:34,240
In this scenario, median would be a much better tool to use to calculate that measure of center than

309
00:23:34,240 --> 00:23:40,960
mean, because, I mean, it doesn't really give us a way to handle all of the people who were disqualified

310
00:23:40,960 --> 00:23:43,150
when they hit this 90 minute time cap.

311
00:23:43,150 --> 00:23:49,480
But for median, you could imagine that we would line up all of the participants data points in order.

312
00:23:49,480 --> 00:23:52,990
So you can think about lining up our data set like this.

313
00:23:52,990 --> 00:23:59,140
And then we have a bunch of people at the top who were disqualified indicating those people here in

314
00:23:59,140 --> 00:23:59,800
red.

315
00:23:59,800 --> 00:24:05,590
Well, assuming we have enough participants, these people who were disqualified won't skew our measure

316
00:24:05,590 --> 00:24:13,180
of center if we use median because we simply cross them out as we move to the center here and we end

317
00:24:13,180 --> 00:24:19,210
up then with just these people in the center to calculate the median for the data set and get an actual

318
00:24:19,210 --> 00:24:20,680
idea of center.

319
00:24:20,680 --> 00:24:26,080
So in a scenario like that one, we may be forced to use median instead of mean.

320
00:24:26,110 --> 00:24:30,970
The same would be true here with our letter grades A, B, C, D, and F.

321
00:24:31,000 --> 00:24:35,860
We couldn't calculate a mean there, but we could find a median of letter grades.

322
00:24:35,860 --> 00:24:42,610
And the whole idea here, the whole takeaway is just that when we're thinking about how to look at the

323
00:24:42,610 --> 00:24:49,330
central tendency of the data, we need to first think about whether or not it's even possible to calculate

324
00:24:49,330 --> 00:24:52,750
a mean and or a median and or a mode.

325
00:24:52,780 --> 00:24:57,100
It may not even be possible to find all three of these measures for the data set.

326
00:24:57,100 --> 00:25:01,090
So our first step is to think about what we actually can calculate.

327
00:25:01,090 --> 00:25:06,850
And then once we know what we can calculate, to think about which measurement makes more sense.

328
00:25:06,850 --> 00:25:13,270
So going back to the annual household income example, in that case, it would be possible to calculate

329
00:25:13,270 --> 00:25:15,070
a mean, a median and a mode.

330
00:25:15,070 --> 00:25:20,080
We would be able to find all three measures of central tendency for that data.

331
00:25:20,080 --> 00:25:21,520
So that's our first step.

332
00:25:21,520 --> 00:25:24,850
We have all three of these possibilities available to us.

333
00:25:24,850 --> 00:25:30,250
But then our second step is to think about which of these three is most appropriate to represent the

334
00:25:30,250 --> 00:25:30,910
data set.

335
00:25:30,910 --> 00:25:36,400
And that's where we realize that median is going to give us a truer picture of what's going on with

336
00:25:36,400 --> 00:25:37,330
this data set.

337
00:25:37,330 --> 00:25:44,200
Then the mean will the mode may give us somewhat of an accurate picture, but the race here is probably

338
00:25:44,200 --> 00:25:49,840
going to be between median and mode in this example with the race with the time limit.

339
00:25:49,870 --> 00:25:53,200
Going back to our first step, which values can we calculate?

340
00:25:53,200 --> 00:25:58,540
We may not even be able to calculate a mean if some of our participants didn't finish in this 90 minute

341
00:25:58,540 --> 00:25:59,290
time cap.

342
00:25:59,290 --> 00:26:02,860
So then the only options available to us are median and mode.

343
00:26:02,860 --> 00:26:07,480
And based on the data set, we could decide which measure is more appropriate.

344
00:26:07,480 --> 00:26:13,540
Which measure gives a better picture of the typical finishing time of our participants.

345
00:26:13,540 --> 00:26:18,040
So as you're thinking through these measures of central tendency and getting comfortable working with

346
00:26:18,040 --> 00:26:22,060
them, always come back to your two step process of number one.

347
00:26:22,060 --> 00:26:24,100
Which of these measures can I calculate?

348
00:26:24,130 --> 00:26:31,180
Number two of the measures I can calculate which is the most appropriate measure to use to best represent

349
00:26:31,180 --> 00:26:33,310
the center of the data set.

