1
00:00:00,090 --> 00:00:05,760
We've said over and over throughout this section that our goal with any chart or graph is just to display

2
00:00:05,760 --> 00:00:10,080
the data or conclusions that we're trying to communicate as clearly as possible.

3
00:00:10,110 --> 00:00:16,620
Part of what that means is, of course, being aware of some common plot pitfalls, like, for example,

4
00:00:16,650 --> 00:00:21,970
a misleading vertical axis, which is something that unfortunately is too common.

5
00:00:21,990 --> 00:00:29,040
In this chart here, we clearly have a bar graph and at first glance it looks like there's a huge increase

6
00:00:29,040 --> 00:00:31,710
between 2022 and 2023.

7
00:00:31,710 --> 00:00:36,120
And so we might expect some dramatic change between these two years.

8
00:00:36,120 --> 00:00:44,430
But when we look at the Y axis, what we see is that the y axis not only starts at 34%, but tops out

9
00:00:44,430 --> 00:00:46,290
up here at 42%.

10
00:00:46,290 --> 00:00:52,020
So the difference between these two years is actually not that significant at all.

11
00:00:52,020 --> 00:00:56,340
In 2022, it looks like we're sitting at about 35% maybe.

12
00:00:56,460 --> 00:01:00,150
And in 2023, it looks like we're sitting at about maybe 40%.

13
00:01:00,150 --> 00:01:05,850
So an increase from 35% to 40% is really not that significant at all.

14
00:01:05,880 --> 00:01:14,010
Imagine instead, if we started this y axis at zero and it went all the way to 100, the bars representing

15
00:01:14,010 --> 00:01:20,130
these two years would look almost identical, and we certainly wouldn't interpret the bar chart as if

16
00:01:20,130 --> 00:01:27,390
there was a big change between 2022 and 2023, but sketched out this way with this particular y axis,

17
00:01:27,390 --> 00:01:29,730
it looks like the increase is dramatic.

18
00:01:29,730 --> 00:01:33,960
So we have to be really careful about the scale of this vertical axis.

19
00:01:33,960 --> 00:01:38,790
In this case, the scale is probably too small, but we can also have the opposite scenario where the

20
00:01:38,790 --> 00:01:40,320
scale is too big.

21
00:01:40,320 --> 00:01:45,660
And so a change between bars is actually dramatic, but it doesn't appear dramatic because the scale

22
00:01:45,660 --> 00:01:47,460
of the vertical axis is too big.

23
00:01:47,460 --> 00:01:52,080
So we have to be aware of the scale of any axis in any chart.

24
00:01:52,080 --> 00:01:57,060
We also need to think about where it makes sense to start and end the axis like we talked about here

25
00:01:57,060 --> 00:02:01,800
in this particular chart, when we're talking about percentages, it probably makes more sense to start

26
00:02:01,800 --> 00:02:07,440
the vertical axis at zero and ended at 100 instead of starting it at 34 and ending it at 42.

27
00:02:07,440 --> 00:02:13,140
In fact, it's usually a best practice to start the vertical axis at zero unless we have really good

28
00:02:13,140 --> 00:02:14,850
reason for not doing so.

29
00:02:14,850 --> 00:02:19,860
If we think the data is actually more clear, if we don't start at zero, maybe we have negative values

30
00:02:19,860 --> 00:02:23,010
in our data and we need to start at some negative value.

31
00:02:23,010 --> 00:02:26,760
So we just need to be really careful about the start and end points, the scale.

32
00:02:26,760 --> 00:02:30,420
We also need to think about a consistent scale across the axis.

33
00:02:30,420 --> 00:02:37,440
So for instance, with this chart here, if we look at the horizontal axis, we have years across the

34
00:02:37,440 --> 00:02:38,310
horizontal axis.

35
00:02:38,310 --> 00:02:44,850
So this is 1960, 1970, 1980, and then all of a sudden we jump to the year 2000 and then the year

36
00:02:44,850 --> 00:02:48,840
2020, which makes this horizontal axis really confusing.

37
00:02:48,840 --> 00:02:51,030
It feels compressed over here on the left side.

38
00:02:51,030 --> 00:02:53,850
And then like there's bigger jumps over here on the right side.

39
00:02:53,850 --> 00:02:59,430
It's hard for us to envision where we are in time as we move along this horizontal axis because we don't

40
00:02:59,430 --> 00:03:01,170
keep the scale consistent.

41
00:03:01,170 --> 00:03:05,880
We also have a problem here where the horizontal axis isn't labeled very well.

42
00:03:05,880 --> 00:03:10,950
Yes, we say here that the horizontal axis represents year, but because we're crossing over from the

43
00:03:10,950 --> 00:03:18,150
1900s to the two thousandths to start at just 60, 70, 80, and then suddenly show just zero zero and

44
00:03:18,150 --> 00:03:22,050
then 20, it just makes this axis more difficult to interpret.

45
00:03:22,050 --> 00:03:26,940
So whenever we're building a chart or a graph, not only do we want to be correct, we don't want to

46
00:03:26,940 --> 00:03:31,800
have data that's wrong in our graph or in our chart, or we don't want the chart to be misleading,

47
00:03:31,800 --> 00:03:34,920
but we also want to do everything we can to make it clear.

48
00:03:34,920 --> 00:03:39,960
So even if there was nothing wrong about this horizontal axis right, we could fill it in here with

49
00:03:39,960 --> 00:03:42,510
90 and then ten.

50
00:03:42,510 --> 00:03:45,780
Even doing that would help make things a little more clear.

51
00:03:45,780 --> 00:03:51,930
It also might make things more clear to say here 1960 and then show the year 2000.

52
00:03:51,930 --> 00:03:56,130
Just making those couple of changes makes it a lot easier to read the horizontal axis.

53
00:03:56,130 --> 00:04:02,220
So we want to do everything we can to first, of course, not be wrong, but then second, be as clear

54
00:04:02,220 --> 00:04:04,620
as possible with how we're displaying information.

55
00:04:04,620 --> 00:04:10,620
So this axis wasn't labeled really well in this chart over here, nothing's labeled well at all.

56
00:04:10,620 --> 00:04:12,450
We don't have a chart title.

57
00:04:12,450 --> 00:04:15,180
We don't even know what's being shown here in the chart.

58
00:04:15,180 --> 00:04:19,769
And we probably be helped if we maybe displayed the percentage at the top of each bar.

59
00:04:19,769 --> 00:04:24,480
Something like this, maybe 35.5%, let's say.

60
00:04:24,480 --> 00:04:28,260
And then up here, let's say 40.2%.

61
00:04:28,260 --> 00:04:33,600
If those are the real values, because of the scale of this vertical axis, it's maybe a little hard

62
00:04:33,600 --> 00:04:35,400
to tell the exact value of these bars.

63
00:04:35,400 --> 00:04:40,020
Just adding these values here brings extra clarity to this particular chart.

64
00:04:40,020 --> 00:04:45,510
We should probably also have here a title like we said, so we know what the chart is displaying.

65
00:04:45,510 --> 00:04:48,810
We just want to make sure everything's labeled as well as it can be.

66
00:04:48,810 --> 00:04:53,940
This line plot over here on the right is another example of a chart that's misleading.

67
00:04:53,940 --> 00:04:59,840
The title of the graph said that college is no longer worth it and tries to show that based on.

68
00:04:59,940 --> 00:05:06,240
On the discrepancy between the rising cost of tuition and the relatively stable earnings that can be

69
00:05:06,240 --> 00:05:09,740
expected after a person completes college and earns their degree.

70
00:05:09,750 --> 00:05:15,030
The problem is not only am I giving you extra information there because the legend here only says earnings,

71
00:05:15,030 --> 00:05:18,780
it doesn't say annual earnings or specify some other period.

72
00:05:18,780 --> 00:05:25,650
But in addition to that problem, what the chart is actually showing is total cost of tuition over the

73
00:05:25,650 --> 00:05:28,140
course of the entire college program.

74
00:05:28,140 --> 00:05:30,270
So let's say that's over four years.

75
00:05:30,270 --> 00:05:34,520
The mean total cost has risen to close to $95,000.

76
00:05:34,650 --> 00:05:42,390
The graph is comparing that value to mean earnings of that graduate, but on an annual basis.

77
00:05:42,390 --> 00:05:48,360
So it's showing that the graduate can expect to earn about $50,000 per year after they graduate.

78
00:05:48,360 --> 00:05:53,070
But this line plot makes it look like there's some huge discrepancy in the cost.

79
00:05:53,070 --> 00:05:58,800
But if we really think about the annual earnings of the Graduate, if they earn 50,000 a year, then

80
00:05:58,800 --> 00:06:07,170
in their first five years of employment, they can likely expect to earn at least 250,000, 50,000 a

81
00:06:07,170 --> 00:06:11,580
year times five years, maybe more, if they get a small raise each year.

82
00:06:11,580 --> 00:06:17,190
And that is compared to the total cost of the four year degree of close to.

83
00:06:18,010 --> 00:06:18,970
95,000.

84
00:06:19,630 --> 00:06:24,130
And when we compare those two figures, the cost of college looks a lot more reasonable.

85
00:06:24,130 --> 00:06:30,010
Whereas if we compare them this way, the cost of college looks extremely unreasonable because of the

86
00:06:30,010 --> 00:06:30,490
difference.

87
00:06:30,490 --> 00:06:37,060
The discrepancy between this high cost of tuition and the apparent low earning potential of the graduate.

88
00:06:37,060 --> 00:06:41,770
We'll also sometimes see in the real world charts like these ones.

89
00:06:41,770 --> 00:06:43,450
This is a pie chart.

90
00:06:43,450 --> 00:06:47,860
Let's say it's trying to represent support for different political candidates.

91
00:06:47,860 --> 00:06:52,300
First of all, just like this bar chart, we have no title for the chart.

92
00:06:52,300 --> 00:06:59,140
We have no legend, and including both of those would be really helpful if we had a legend here where

93
00:06:59,140 --> 00:07:06,130
we said that Green was supports candidate A and Blue was supports candidate B, etc. that would help

94
00:07:06,130 --> 00:07:08,250
us to interpret this pie chart.

95
00:07:08,260 --> 00:07:10,360
We have another problem here with the pie chart.

96
00:07:10,360 --> 00:07:12,940
The data is just flat out misleading.

97
00:07:12,970 --> 00:07:17,650
A pie chart should always sum to 100% each piece of the pie.

98
00:07:17,650 --> 00:07:20,560
When we add them all up, they should sum to 100%.

99
00:07:20,560 --> 00:07:28,180
And here these percentages are summing to 165%, which makes the chart completely invalid and unreadable

100
00:07:28,180 --> 00:07:28,930
entirely.

101
00:07:28,930 --> 00:07:33,970
So not only is it just flat out wrong, but we don't really understand what it's saying because we have

102
00:07:33,970 --> 00:07:35,680
no title, we have no legend.

103
00:07:35,680 --> 00:07:41,680
And on top of that, pie charts are often not a great way of displaying most data in general, because

104
00:07:41,680 --> 00:07:46,900
they can be hard to read in the sense that it's difficult to compare each slice against the other.

105
00:07:46,900 --> 00:07:53,050
We can tell that this purple section here appears to be the biggest section, but looking at green versus

106
00:07:53,050 --> 00:07:58,660
blue, if we don't have these labels at first glance, it's hard to interpret whether green is bigger

107
00:07:58,660 --> 00:08:01,150
than blue or vice versa, or they're the same size.

108
00:08:01,150 --> 00:08:07,210
So for something like this, a comparative bar chart might be just a better way to display the data

109
00:08:07,210 --> 00:08:07,990
in general.

110
00:08:08,200 --> 00:08:14,830
So the whole idea here is that whenever we're trying to display any kind of information, we've collected

111
00:08:14,830 --> 00:08:21,220
data and now we want to communicate that data in some summarized way to somebody else, we really need

112
00:08:21,220 --> 00:08:25,390
to think about choosing the best.

113
00:08:26,090 --> 00:08:28,040
Chart type, right?

114
00:08:28,040 --> 00:08:30,560
Is a bar chart the best option?

115
00:08:30,560 --> 00:08:32,510
Should we use a line plot instead?

116
00:08:32,510 --> 00:08:34,539
Should we use a box and whisker plot?

117
00:08:34,549 --> 00:08:35,900
What kind of graph?

118
00:08:35,900 --> 00:08:40,100
What kind of chart is the best way to display this information?

119
00:08:40,100 --> 00:08:45,770
Once we've chosen the chart that we think best communicates the data that we have to display, then

120
00:08:45,770 --> 00:08:53,900
we want to think about all of the aspects of the chart that affect the clarity of the information being

121
00:08:53,900 --> 00:08:54,560
displayed.

122
00:08:54,560 --> 00:08:56,270
So we should probably have a title.

123
00:08:56,270 --> 00:08:58,520
We should probably label both axes, right?

124
00:08:58,520 --> 00:09:03,380
We should probably have set up here that this horizontal axis represents a year.

125
00:09:03,380 --> 00:09:07,190
We should label this vertical axis based on whatever it represents.

126
00:09:07,190 --> 00:09:09,590
We should have a legend, if that's appropriate.

127
00:09:09,590 --> 00:09:12,320
Over here, we had a legend for the line plot.

128
00:09:12,320 --> 00:09:14,870
We should maybe give the units for each axis.

129
00:09:14,870 --> 00:09:21,440
So over here we said that this vertical axis is dollars in units of thousands, so we can interpret

130
00:09:21,440 --> 00:09:22,130
this 45.

131
00:09:22,130 --> 00:09:24,950
Here's 45,000 over here.

132
00:09:24,950 --> 00:09:29,450
This vertical axis, we know we have percentages because we see 34% here.

133
00:09:29,450 --> 00:09:33,860
So we know we have percentages, but we don't know what this vertical axis is indicating.

134
00:09:33,860 --> 00:09:37,070
So we want to think about labeling everything that we can.

135
00:09:37,070 --> 00:09:41,600
We want to think about the scale of each axis and the scale of the data that we're displaying.

136
00:09:41,600 --> 00:09:43,490
We don't want the scale to be too small.

137
00:09:43,490 --> 00:09:45,350
We don't want it to be too big.

138
00:09:45,350 --> 00:09:47,660
We want the scales to be consistent.

139
00:09:47,660 --> 00:09:53,240
We want to start our scales at zero, unless some other starting point definitely makes more sense.

140
00:09:53,240 --> 00:09:58,010
And we want to make sure that the point that we're actually communicating is a valid one.

141
00:09:58,010 --> 00:09:59,660
So let's maybe say here.

142
00:10:00,390 --> 00:10:06,900
Validity right in this chart here, we're communicating a significant increase across years when in

143
00:10:06,900 --> 00:10:09,570
fact the increase isn't that significant at all.

144
00:10:09,600 --> 00:10:15,360
Over here, we're communicating a significant difference between tuition across four years and then

145
00:10:15,360 --> 00:10:17,360
earnings for just one year.

146
00:10:17,370 --> 00:10:19,950
Maybe that's really not the most appropriate comparison.

147
00:10:19,950 --> 00:10:25,560
Maybe we should compare cost of tuition to expected lifetime earnings or earnings over the first five

148
00:10:25,560 --> 00:10:26,820
years or ten years.

149
00:10:26,820 --> 00:10:32,100
In other words, we really might not be communicating the best conclusion based on the way that we're

150
00:10:32,100 --> 00:10:35,370
displaying our data or what we're including in our data at all.

151
00:10:35,370 --> 00:10:40,980
And then the last thing probably is just pure accuracy or correctness.

152
00:10:40,980 --> 00:10:43,410
This pie chart here is just wrong.

153
00:10:43,440 --> 00:10:48,420
A pie chart always has to sum to 100% by definition, and this one doesn't.

154
00:10:48,420 --> 00:10:50,910
If the data were displaying is just incorrect.

155
00:10:50,910 --> 00:10:52,560
We obviously want to fix that.

156
00:10:52,560 --> 00:10:56,910
Maybe this 35.5% here was actually wrong in this bar.

157
00:10:56,910 --> 00:11:00,420
Should have been all the way up here at 45%.

158
00:11:00,420 --> 00:11:07,080
And after we build out our bar chart, we realize that and we can correct the bar and label it appropriately.

159
00:11:07,080 --> 00:11:13,110
Maybe we had a mistake here where in our line plot, instead of having 85 right here, maybe we actually

160
00:11:13,110 --> 00:11:13,950
had 90.

161
00:11:13,950 --> 00:11:17,130
And that's just a mistake and we need to change it to 85.

162
00:11:17,130 --> 00:11:21,930
So these are all things that we want to think about, starting primarily with just choosing the best

163
00:11:21,930 --> 00:11:22,380
chart.

164
00:11:22,380 --> 00:11:27,000
Working through each of these questions is my chart clear?

165
00:11:27,000 --> 00:11:30,090
Is the point I'm trying to make even valid?

166
00:11:30,120 --> 00:11:32,040
Is everything in the chart accurate?

167
00:11:32,040 --> 00:11:37,890
And if we've done all that and we've built a chart or a graph that we feel accurately communicates the

168
00:11:37,890 --> 00:11:43,500
data we've collected and shows the point that we're trying to make by summarizing the data in the chart

169
00:11:43,500 --> 00:11:49,380
or the graph, then we can feel pretty confident that we've created a good chart or a graph or a plot

170
00:11:49,380 --> 00:11:55,020
that's actually doing a good job displaying the conclusion that the data should reveal.

