1
00:00:00,090 --> 00:00:05,220
We're devoting this entire section to talking about the normal distribution because it's such an important

2
00:00:05,220 --> 00:00:11,130
distribution for modeling real world phenomena and because it's so important because it comes up so

3
00:00:11,130 --> 00:00:14,730
often, we're going to break it down into several pieces here.

4
00:00:14,730 --> 00:00:19,650
We just want to talk about the mean variance and standard deviation that are associated with a normal

5
00:00:19,650 --> 00:00:20,400
distribution.

6
00:00:20,400 --> 00:00:26,580
So as a reminder, a normal distribution is a distribution whose probability density function looks

7
00:00:26,580 --> 00:00:27,270
like this.

8
00:00:27,450 --> 00:00:34,590
So if the probability density function is F of x, then this curve here is modeled by this function

9
00:00:34,590 --> 00:00:38,010
F of x because of the shape of the normal distribution.

10
00:00:38,010 --> 00:00:43,470
We often call it a bell curve as well, although there are other distributions that have bell shapes.

11
00:00:43,470 --> 00:00:50,160
So this particular kind of bell shape is not exclusive to a normal distribution, but all normal distributions

12
00:00:50,160 --> 00:00:56,820
do have this kind of symmetrical bell shaped curve or bell shaped probability density function, which

13
00:00:56,820 --> 00:01:03,960
is centered at the mean of the data, which if we're talking about a population, we indicate with MU,

14
00:01:03,960 --> 00:01:07,680
if we're talking about a sample, we indicate with x bar.

15
00:01:07,680 --> 00:01:12,450
And that's right at this point right here, the center of the normal distribution.

16
00:01:12,450 --> 00:01:17,520
Now when we're talking about a normal distribution, we have to keep in mind that we could still be

17
00:01:17,520 --> 00:01:21,210
referring to either a population or a sample.

18
00:01:21,300 --> 00:01:28,800
Remember that a population is the entire group or set that we're studying, whereas a sample is a small

19
00:01:28,800 --> 00:01:30,900
subsection of that population.

20
00:01:30,900 --> 00:01:36,690
And we distinguish between these two things because let's say, for instance, we're interested in surveying

21
00:01:36,690 --> 00:01:39,870
the entire population of the state of California.

22
00:01:39,870 --> 00:01:45,810
We know right away it's going to be very difficult, if not impossible, to collect survey responses

23
00:01:45,810 --> 00:01:49,860
from every single person, every single citizen of the state.

24
00:01:49,860 --> 00:01:56,370
And so instead of attempting to ask every person in the state the same survey question or to gather

25
00:01:56,370 --> 00:02:03,060
data from every person in the state, the entire population, instead we'll take a small sample.

26
00:02:03,090 --> 00:02:08,430
The goal, of course, is to make sure that our sample is representative of the population, but that's

27
00:02:08,430 --> 00:02:09,810
an entirely different topic.

28
00:02:09,810 --> 00:02:15,810
Figuring out how to take a sample that isn't biased but instead is representative of the population.

29
00:02:15,810 --> 00:02:21,630
Now, for a normal distribution, our formulas for mean variance and standard deviation will be different

30
00:02:21,630 --> 00:02:26,250
depending on whether we have data from the entire population or just from the sample.

31
00:02:26,280 --> 00:02:31,830
Of course, keep in mind that normally we'll have sample data because practically speaking for real

32
00:02:31,830 --> 00:02:38,040
world problems, it is usually difficult or impossible to collect data for the entire population.

33
00:02:38,040 --> 00:02:45,120
But to distinguish between these two groups, we want to indicate the size of the population with capital

34
00:02:45,120 --> 00:02:48,420
N and the size of the sample with lowercase n.

35
00:02:48,420 --> 00:02:54,330
So if you see capital N, that means we're talking about the number of subjects in the entire population.

36
00:02:54,330 --> 00:02:59,640
So if we're talking about the state of California, this capital N would represent the number of people

37
00:02:59,640 --> 00:03:01,140
in the entire state.

38
00:03:01,320 --> 00:03:07,950
If we see lowercase n, it means that we've taken a smaller sample of the population and the number

39
00:03:07,950 --> 00:03:11,400
of people in our sample is given by lowercase n.

40
00:03:11,400 --> 00:03:17,190
So let's say based on what we're studying, not related to the California example, our population is

41
00:03:17,190 --> 00:03:18,150
10,000.

42
00:03:18,150 --> 00:03:20,700
Well, capital N would equal 10,000.

43
00:03:20,700 --> 00:03:27,540
We might take a sample of, let's say 1200 people and then lowercase n would be equal to 1200.

44
00:03:27,750 --> 00:03:33,750
So these are the sizes of the population in the sample capital n and lowercase n, And then the mean

45
00:03:34,320 --> 00:03:40,500
of the population is given by this formula here, whereas the sample is given by this formula.

46
00:03:40,530 --> 00:03:44,490
Notice that they are identical, except for two things.

47
00:03:44,490 --> 00:03:51,420
First, in the population mean formula, we're dividing by capital N, whereas in the sample mean formula,

48
00:03:51,420 --> 00:03:56,910
we're dividing by lowercase n, And of course that makes sense based on the fact that we're using capital

49
00:03:56,910 --> 00:04:01,860
n and lowercase n to represent the size of our population and sample respectively.

50
00:04:02,040 --> 00:04:06,630
And then the only other difference is the notation that we use to indicate the mean, which we talked

51
00:04:06,630 --> 00:04:10,470
about earlier up here as the center of the normal distribution.

52
00:04:10,500 --> 00:04:17,160
The population mean, we indicate with this Greek letter mu and the sample mean we indicate with x bar

53
00:04:17,190 --> 00:04:18,930
this x with a bar over it.

54
00:04:18,930 --> 00:04:21,510
But other than that, the formulas are identical.

55
00:04:21,630 --> 00:04:27,360
So all these formulas are telling us is that to define the mean, we are summing up all of the possible

56
00:04:27,360 --> 00:04:33,810
x sub values, all of the values that are variable takes on and after we sum them all up, then we divide

57
00:04:33,810 --> 00:04:36,360
by either the number in our population.

58
00:04:36,360 --> 00:04:40,680
If we're finding population mean or the number in our sample, if we're finding sample mean.

59
00:04:40,680 --> 00:04:44,730
So add up all the data points, divide by the number of data points we have.

60
00:04:44,730 --> 00:04:48,450
That's all we're doing to calculate the mean of the normal distribution.

61
00:04:48,570 --> 00:04:53,940
Now for the variance, this should look familiar because we've already looked at mean variance and standard

62
00:04:53,940 --> 00:04:55,170
deviation formulas.

63
00:04:55,170 --> 00:04:59,970
But for the normal distribution, just to confirm here, all we're doing with the variance is.

64
00:05:00,000 --> 00:05:05,190
Taking all of the data points that we have, all of the sub values that we have in our data set.

65
00:05:05,370 --> 00:05:08,630
And for each one we subtract the mean here.

66
00:05:08,640 --> 00:05:13,260
So notice that the formulas for population in sample are again almost identical.

67
00:05:13,290 --> 00:05:18,870
All we're doing is taking the individual x abi values in our data set and that goes for population and

68
00:05:18,870 --> 00:05:21,330
sample and then we subtract the mean.

69
00:05:21,330 --> 00:05:26,070
And of course if that's the population we're subtracting MU if it's the sample, we're subtracting X

70
00:05:26,070 --> 00:05:26,520
bar.

71
00:05:26,520 --> 00:05:33,510
But in either case we're just subtracting the mean and that gives us the deviation for each data point.

72
00:05:33,510 --> 00:05:41,700
So this is the deviation of the individual data point x by when we square that value, we now have a

73
00:05:41,700 --> 00:05:44,310
squared deviation or a squared difference.

74
00:05:44,310 --> 00:05:49,380
And then this summation notation just tells us to sum up all the squared deviations.

75
00:05:49,380 --> 00:05:56,220
So once we expand out to this part of the formula here, this is just the sum of all the squared deviations.

76
00:05:56,220 --> 00:06:02,250
And again, notice that up to that point our formulas are identical, except for the mean being represented

77
00:06:02,250 --> 00:06:03,690
by MI or X bar.

78
00:06:03,720 --> 00:06:09,450
And then in the summation notation for population, we're summing up to capital N for the sample, we're

79
00:06:09,450 --> 00:06:15,120
summing up to lowercase N, So really no difference at all, practically speaking, between the two

80
00:06:15,120 --> 00:06:15,600
formulas.

81
00:06:15,600 --> 00:06:19,830
Up to this point, we're just taking the sum of all of the squared deviations.

82
00:06:19,830 --> 00:06:24,750
But then for population, we divide by the number in our population capital.

83
00:06:24,750 --> 00:06:31,710
N, But for the sample, you might think that we would divide by lowercase n the number of subjects

84
00:06:31,710 --> 00:06:33,030
in the sample.

85
00:06:33,030 --> 00:06:38,700
But in fact, and this is the most interesting part of all of our formulas for mean variance in standard

86
00:06:38,700 --> 00:06:40,890
deviation for a population in a sample.

87
00:06:40,920 --> 00:06:49,440
This particular piece right here is the most interesting part of our formulas, and the reason is because

88
00:06:49,440 --> 00:06:52,080
we would think it's n, but it's actually n minus one.

89
00:06:52,080 --> 00:06:59,580
And that's because when we take a sample from a population, we inherently introduce some bias.

90
00:06:59,580 --> 00:07:05,820
It's virtually impossible for our sample to perfectly represent our population for us to be able to

91
00:07:05,820 --> 00:07:10,260
scale up our sample data to perfectly match our population.

92
00:07:10,260 --> 00:07:15,810
For example, maybe we're working with a population of people and let's say the population is 1000 people,

93
00:07:15,810 --> 00:07:21,060
and we'll even say that exactly 500 of them are women and 500 of them are men.

94
00:07:21,060 --> 00:07:27,930
We might take a sample of 100 people, and even though we randomly sample from the population, meaning

95
00:07:27,930 --> 00:07:33,390
that we are randomly, blindly choosing one person at a time from the population, putting them back

96
00:07:33,390 --> 00:07:35,760
in the population, and then pulling out another person.

97
00:07:35,760 --> 00:07:41,730
And we're collecting a sample of 100 people from that population of 500 women and 500 men.

98
00:07:41,730 --> 00:07:48,660
When we sample that way and we collect 100 people, we might get a sample that's, let's say 52 women

99
00:07:48,660 --> 00:07:50,130
and 48 men.

100
00:07:50,250 --> 00:07:54,960
That's not going to scale up perfectly to 500 women and 500 men.

101
00:07:54,960 --> 00:07:59,760
Our sample is going to be a little bit different in terms of its representation of the population.

102
00:07:59,760 --> 00:08:05,700
We are just hoping and implementing some good sampling techniques to try to make sure that the sample

103
00:08:05,700 --> 00:08:09,960
is as representative of the population as we can possibly make it.

104
00:08:09,960 --> 00:08:13,920
But no matter what we do, we're likely going to introduce bias.

105
00:08:13,920 --> 00:08:20,190
And so because we introduce some bias when we sample from the population, it turns out that because

106
00:08:20,190 --> 00:08:27,690
of that bias, using this end minus one figure to calculate variance actually helps us undo a little

107
00:08:27,690 --> 00:08:34,110
bit of that bias as opposed to just using lowercase n the number of subjects in our sample.

108
00:08:34,110 --> 00:08:41,520
So we use this n minus one value to get a better approximation to get rid of some of the bias we introduced

109
00:08:41,520 --> 00:08:45,240
by sampling instead of using the full population.

110
00:08:45,240 --> 00:08:52,530
So when we calculate variance for a sample, we have to remember to divide by n minus one instead of

111
00:08:52,530 --> 00:08:55,830
just lowercase n the number of subjects in our sample.

112
00:08:55,830 --> 00:09:02,820
So if we take a sample of 100 people, we would be dividing here by 99 instead of dividing by 100.

113
00:09:02,820 --> 00:09:07,350
So those are our population and sample variance formulas.

114
00:09:07,350 --> 00:09:13,590
And then, as you already know, standard deviation will always just be the square root of variance.

115
00:09:13,620 --> 00:09:18,600
Now, if we want to see these formulas in action, of course the easiest way to do this is going to

116
00:09:18,600 --> 00:09:25,290
be to use a calculator or a computer to help us make these calculations, especially as our dataset

117
00:09:25,290 --> 00:09:26,280
gets larger.

118
00:09:26,280 --> 00:09:31,170
Because you can see here, let's work with a data set that I've already created.

119
00:09:31,200 --> 00:09:38,190
This data set only has ten data points in it and already look at all of the calculations we're doing,

120
00:09:38,190 --> 00:09:44,940
much less if we have a data set of 1000 subjects or even more, these calculations quickly become impossible

121
00:09:44,940 --> 00:09:46,470
to do by hand.

122
00:09:46,470 --> 00:09:52,350
But here's what we've done here to calculate mean variance and standard deviation for this particular

123
00:09:52,350 --> 00:09:52,800
data set.

124
00:09:52,800 --> 00:09:54,600
Let's walk through this step by step.

125
00:09:54,750 --> 00:09:58,980
So what we're going to say is that maybe we're a distribution company.

126
00:09:58,980 --> 00:09:59,670
We're running a.

127
00:09:59,730 --> 00:10:00,390
Warehouse.

128
00:10:00,390 --> 00:10:05,190
And we want to get information about the age of the employees that work in the warehouse.

129
00:10:05,190 --> 00:10:10,230
So what we're going to do is we're going to take a sample of ten employees.

130
00:10:10,230 --> 00:10:13,320
So we're going to say lowercase n is equal to ten.

131
00:10:13,320 --> 00:10:17,070
We're going to be using a sample instead of surveying the entire population.

132
00:10:17,070 --> 00:10:22,920
Maybe we actually have 200,000 warehouse workers and we're not going to be able to ask every one of

133
00:10:22,920 --> 00:10:24,090
them for their age.

134
00:10:24,090 --> 00:10:27,720
So instead, we're going to take a sample of ten employees.

135
00:10:27,720 --> 00:10:32,520
And all we're doing here, we're not talking about at this point whether this is a good sample.

136
00:10:32,520 --> 00:10:35,790
We'll talk about sampling techniques and representative samples later.

137
00:10:35,790 --> 00:10:42,060
All we're talking about now is how to calculate mean variance and standard deviation for this sample.

138
00:10:42,090 --> 00:10:49,680
We're using age in this example only because age is as long as our population is large enough, is going

139
00:10:49,680 --> 00:10:53,310
to be normally distributed around some mean.

140
00:10:53,310 --> 00:10:58,800
So we're going to say here that our sample size is lowercase n equal to ten.

141
00:10:58,800 --> 00:11:05,040
We surveyed ten employees, and this second column here of our table are all of the responses we got.

142
00:11:05,040 --> 00:11:10,170
So we asked ten warehouse employees their age and we got all ten ages here.

143
00:11:10,170 --> 00:11:13,080
And then to calculate sample mean.

144
00:11:13,080 --> 00:11:20,250
So we're calculating here X bar we found that x bar the sample mean was equal to 43.2.

145
00:11:20,250 --> 00:11:27,810
And we find that by taking the sum of all of these ages and then dividing by N equals ten.

146
00:11:27,810 --> 00:11:32,130
So we would take here to get x bar the sample mean.

147
00:11:32,130 --> 00:11:37,710
This is how we calculated 43.2 we would take 43 plus 35 plus 38.

148
00:11:37,710 --> 00:11:42,180
So 43 plus 35 plus 38.

149
00:11:42,180 --> 00:11:46,830
And we would continue adding until we got to the very last age here 40.

150
00:11:46,830 --> 00:11:53,190
And then we divide that entire sum by the sample size and equals ten.

151
00:11:53,190 --> 00:11:55,950
And we see that here in our sample mean formula.

152
00:11:55,950 --> 00:12:01,440
We sum up all of our data points, all of our x sub data points, and then we divide by the sample size

153
00:12:01,440 --> 00:12:02,610
and equals ten.

154
00:12:02,610 --> 00:12:08,640
And the result there is a mean of 43.2 for the age of those ten employees.

155
00:12:08,670 --> 00:12:16,350
Now, to find variance for our sample, we take each one of our x sub values and we subtract the mean

156
00:12:16,350 --> 00:12:17,220
x bar.

157
00:12:17,220 --> 00:12:19,500
So that's what we're doing here in this column.

158
00:12:19,500 --> 00:12:21,030
We're finding all of the deviations.

159
00:12:21,030 --> 00:12:26,160
So we're taking 43 and subtracting 43.2 to get negative point two.

160
00:12:26,190 --> 00:12:33,660
We're taking 35 and subtracting 43.2 to get -8.2 all the way down until the last data point.

161
00:12:33,660 --> 00:12:40,980
And so this second column here is giving us this x sub minus x bar value for each data point.

162
00:12:41,040 --> 00:12:45,930
Then in this last column here, we're squaring each of those deviations.

163
00:12:45,930 --> 00:12:54,960
So -0.2 squared is a positive 0.0 for -8.2 when we square it is a positive 67.24.

164
00:12:54,960 --> 00:12:57,180
So we square all those values.

165
00:12:57,180 --> 00:13:01,080
Now we have all of our individual squared deviations.

166
00:13:01,080 --> 00:13:07,710
This part of the formula, then our summation notation here tells us add up all of those squared deviations.

167
00:13:07,710 --> 00:13:13,500
So we have all of our squared deviations and then we add them all together to find this sum.

168
00:13:13,500 --> 00:13:21,540
Now we have the sum of all of our squared deviations, and then all we have to do to find variance is

169
00:13:21,540 --> 00:13:28,590
divide this sum of all the squared deviations by and remember and minus one because we have a sample,

170
00:13:28,590 --> 00:13:29,760
not a population.

171
00:13:29,760 --> 00:13:35,610
And when we find sample variance, we have to divide by n minus one to undo some of that bias.

172
00:13:35,610 --> 00:13:46,290
So here this value for variance is not 1,007.6 divided by N equals ten, it's 1,007.6 divided by ten

173
00:13:46,290 --> 00:13:48,810
minus one or divide it by nine.

174
00:13:48,810 --> 00:13:57,390
So when we divide the sum of our squared deviations by n minus one equals nine instead of n equals ten,

175
00:13:57,570 --> 00:14:00,810
the result we get is 111.96.

176
00:14:00,810 --> 00:14:09,270
And so this is our variance, 111.96 and that's an approximate value because we're rounding there to

177
00:14:09,270 --> 00:14:11,010
the nearest two decimal places.

178
00:14:11,010 --> 00:14:13,110
So that's our approximate variance.

179
00:14:13,110 --> 00:14:18,120
And then of course, to find standard deviation, we just take the square root of variance.

180
00:14:18,120 --> 00:14:24,570
So standard deviation will be equal to the square root of variance, which in our case is approximately

181
00:14:24,570 --> 00:14:30,090
the square root of 111.96, which is approximately.

182
00:14:31,720 --> 00:14:32,800
10.58.

183
00:14:32,800 --> 00:14:39,640
And so our standard deviation for the sample is approximately 10.58, just over ten and a half.

184
00:14:39,640 --> 00:14:46,150
Remember and we talked about this before, that the mean is this center of the data, The standard deviation

185
00:14:46,150 --> 00:14:51,280
is the figure that we'll use as a measure of spread mean is a measure of central tendency.

186
00:14:51,280 --> 00:14:53,620
Standard deviation is a measure of spread.

187
00:14:53,620 --> 00:14:59,170
So this figure tells us how much this data is spread out around the mean.

188
00:14:59,170 --> 00:15:05,470
Remember that if our standard deviation is smaller, it tells us that our data is more tightly clustered

189
00:15:05,470 --> 00:15:09,490
around the mean, which means all the data gets sucked in closer to the mean.

190
00:15:09,490 --> 00:15:16,000
And our normal distribution might look something more like this with a taller peak and smaller tails

191
00:15:16,000 --> 00:15:19,990
where all this data is pushed in closer to this center.

192
00:15:20,020 --> 00:15:20,890
The mean.

193
00:15:20,890 --> 00:15:27,040
Whereas if our standard deviation is larger, it tells us that our data is pushed further away from

194
00:15:27,040 --> 00:15:27,610
the mean.

195
00:15:27,610 --> 00:15:28,930
It's more spread out.

196
00:15:28,930 --> 00:15:34,030
This measure of spread is larger, and so our normal distribution might look something a little bit

197
00:15:34,030 --> 00:15:40,120
more like this, where we have fatter tails and the distribution is shorter, the data is pushed out

198
00:15:40,120 --> 00:15:44,380
further away from the mean because of that larger standard deviation.

199
00:15:44,380 --> 00:15:50,230
And we'll talk more later about how we use and interpret mean variance in standard deviation.

200
00:15:50,230 --> 00:15:56,980
The primary goal here is just to understand how to calculate these three values for a normal distribution

201
00:15:56,980 --> 00:16:00,070
for both a population and a sample.

