1
00:00:00,507 --> 00:00:01,788
Hello everyone.

2
00:00:01,788 --> 00:00:08,011
In this video lecture we'll learn how
to obtain a histogram of a dataset.

3
00:00:08,011 --> 00:00:12,519
How to impose smooth density
function over a histogram and

4
00:00:12,519 --> 00:00:16,200
to change colors and
line width in a histogram.

5
00:00:19,609 --> 00:00:22,838
So I have entered
the following data set and

6
00:00:22,838 --> 00:00:27,872
called it small size data set
because it only has 30 data points.

7
00:00:27,872 --> 00:00:33,031
And we will try to find
the histogram of this data set.

8
00:00:33,031 --> 00:00:38,227
So histogram routine or
command in R is basically hist.

9
00:00:38,227 --> 00:00:42,768
And it's a function, and the argument
of the function will be our data set.

10
00:00:42,768 --> 00:00:48,036
If you basically write
small.size.dataset here, and

11
00:00:48,036 --> 00:00:54,737
without doing anything, by just entering,
we will get our histogram.

12
00:00:54,737 --> 00:00:57,511
But let me mention a few
things about this histogram.

13
00:00:57,511 --> 00:01:01,539
By default, on the y-axis,
there is a frequency.

14
00:01:01,539 --> 00:01:08,178
For example, there is only one number
in my dataset between 0 and 20.

15
00:01:08,178 --> 00:01:12,557
Usually, we'll like to have this
as a probability or density.

16
00:01:12,557 --> 00:01:14,798
So we'll change that.

17
00:01:14,798 --> 00:01:19,694
By default the name of this histogram is
histogram of the name of the variable name

18
00:01:19,694 --> 00:01:21,930
of the dataset, we can play with it.

19
00:01:21,930 --> 00:01:26,829
And by again, by default x label
is just small.size.dataset in

20
00:01:26,829 --> 00:01:31,920
other words it's just the variable name,
the name of the dataset.

21
00:01:31,920 --> 00:01:35,847
But we can change this, for example,

22
00:01:35,847 --> 00:01:40,283
we can go back and
we can change our x label.

23
00:01:43,360 --> 00:01:47,520
So x label will accept a string, so

24
00:01:47,520 --> 00:01:52,684
that's why we have these quotas, here and

25
00:01:52,684 --> 00:01:58,142
I will, for example, put, My data points.

26
00:01:58,142 --> 00:02:03,029
If I enter, everything stays same,
its label has changed.

27
00:02:03,029 --> 00:02:07,990
We can change the title to, so
if we go back, and we say main.

28
00:02:07,990 --> 00:02:12,056
Again, main accepts a string,
and whatever the string is,

29
00:02:12,056 --> 00:02:15,258
it will be it is title,
the title of histogram.

30
00:02:15,258 --> 00:02:20,303
Let me write 'Histogram of my data.'

31
00:02:22,442 --> 00:02:27,359
And now the histogram title has

32
00:02:27,359 --> 00:02:32,472
changed, and I have my labels.

33
00:02:32,472 --> 00:02:35,058
Correct my data points, my title.

34
00:02:35,058 --> 00:02:38,138
Now we can also turn off this frequency.

35
00:02:38,138 --> 00:02:38,951
How do we do that?

36
00:02:38,951 --> 00:02:41,833
We go back into the function, we say freq.

37
00:02:41,833 --> 00:02:46,730
freq here stands for frequency,
and it's a boolean variable.

38
00:02:46,730 --> 00:02:48,407
By default, it is true.

39
00:02:48,407 --> 00:02:53,737
That's why in the histogram
it comes with the frequency.

40
00:02:53,737 --> 00:02:59,535
But we can turn that off by saying it is
FALSE, or just by using capital letter F.

41
00:02:59,535 --> 00:03:03,392
And if I do that,
everything stays same, well almost.

42
00:03:03,392 --> 00:03:07,230
The density has changed, so
my histogram scaled down.

43
00:03:07,230 --> 00:03:10,220
Now I have probabilities here.

44
00:03:10,220 --> 00:03:14,141
Now, we can change the colors here,
we can, for

45
00:03:14,141 --> 00:03:18,077
example, change the colors
of this rectangle.

46
00:03:18,077 --> 00:03:22,616
For example, I can go and say, color=,

47
00:03:22,616 --> 00:03:27,173
again, this is going to
take it as a string.

48
00:03:27,173 --> 00:03:29,340
Let's say I'm going to
write this as green.

49
00:03:29,340 --> 00:03:35,756
And if I do that, I have my histogram,
which is green, all green.

50
00:03:35,756 --> 00:03:40,100
We can impose smooth density

51
00:03:40,100 --> 00:03:44,987
function over this histogram.

52
00:03:44,987 --> 00:03:50,035
So density function is found by
using this command, density.

53
00:03:50,035 --> 00:03:54,481
So I have to put my data set here.

54
00:03:54,481 --> 00:03:57,517
Now, density to find its density, but

55
00:03:57,517 --> 00:04:02,669
if I want to impose the graph on it,
I should use the lines command.

56
00:04:02,669 --> 00:04:05,988
And my argument for
the lines is the density command, and

57
00:04:05,988 --> 00:04:09,185
the argument for
the density is basically my data set.

58
00:04:09,185 --> 00:04:13,673
If I do that,
it basically imposes some smooth density.

59
00:04:13,673 --> 00:04:17,858
Probably distribution
function over my histogram.

60
00:04:17,858 --> 00:04:23,729
I can change the color here,
color of my density function.

61
00:04:23,729 --> 00:04:27,282
So let's say this time
I'm going to use red, and

62
00:04:27,282 --> 00:04:31,280
now it is red, but
I can also play with that line width.

63
00:04:31,280 --> 00:04:37,340
So width is lwd and if I increase it,
for example, if I make this five.

64
00:04:37,340 --> 00:04:44,004
Now my line width is a little,
the width is bigger than the previous one.

65
00:04:44,004 --> 00:04:47,518
So I have this histogram,
I have my distribution.

66
00:04:47,518 --> 00:04:53,208
Density function's imposed
on my histogram here.

67
00:04:53,208 --> 00:04:56,913
And one last thing I would like
to mention is this bin width.

68
00:04:56,913 --> 00:05:01,941
We can play with this bin width, in other
words, the width of this rectangles.

69
00:05:01,941 --> 00:05:04,541
By increasing these break points.

70
00:05:04,541 --> 00:05:11,000
For example, I can go back to my
histogram command and say breaks.

71
00:05:11,000 --> 00:05:13,519
So breaks takes a lot of values.

72
00:05:13,519 --> 00:05:19,063
It can take a sequence, array, or
it can take a number of break points.

73
00:05:19,063 --> 00:05:20,949
I'm going to use this as just a number.

74
00:05:20,949 --> 00:05:24,150
For example, if I put 10 here.

75
00:05:24,150 --> 00:05:26,249
Let’s see what happens.

76
00:05:26,249 --> 00:05:31,004
Well if I put 10 here then
I have a smaller bin width,

77
00:05:31,004 --> 00:05:35,667
more break points and
my histogram looks like this.

78
00:05:35,667 --> 00:05:40,338
Might or
might not be useful to use these breaks.

79
00:05:40,338 --> 00:05:44,627
And if I want to still
impose my density over it,

80
00:05:44,627 --> 00:05:48,394
I would just impose my density over this.

81
00:05:48,394 --> 00:05:52,276
Okay, so
what have you learned in this video?

82
00:05:52,276 --> 00:05:56,896
You have learned how to find histogram
of a dataset with frequencies or

83
00:05:56,896 --> 00:05:59,021
probabilities on the y-axis.

84
00:05:59,021 --> 00:06:03,617
You learned how to change bin
width in histogram using breaks.

85
00:06:03,617 --> 00:06:08,912
You have learned how to impose
a smooth density over a histogram.

86
00:06:08,912 --> 00:06:13,460
And you have learned how to change
colors and line width in a histogram.