1
00:00:00,000 --> 00:00:04,350
This next line of code will
then create a pooling layer.

2
00:00:04,350 --> 00:00:06,450
It's max-pooling because we're

3
00:00:06,450 --> 00:00:08,370
going to take the maximum value.

4
00:00:08,370 --> 00:00:10,365
We're saying it's
a two-by-two pool,

5
00:00:10,365 --> 00:00:11,850
so for every four pixels,

6
00:00:11,850 --> 00:00:14,760
the biggest one will
survive as shown earlier.

7
00:00:14,760 --> 00:00:17,805
We then add another
convolutional layer,

8
00:00:17,805 --> 00:00:20,970
and another max-pooling layer
so that the network can

9
00:00:20,970 --> 00:00:22,200
learn another set of

10
00:00:22,200 --> 00:00:24,480
convolutions on top
of the existing one,

11
00:00:24,480 --> 00:00:27,510
and then again, pool
to reduce the size.

12
00:00:27,510 --> 00:00:29,610
So, by the time the image gets to

13
00:00:29,610 --> 00:00:31,875
the flatten to go into
the dense layers,

14
00:00:31,875 --> 00:00:33,550
it's already much smaller.

15
00:00:33,550 --> 00:00:36,150
It's being quartered, and
then quartered again.

16
00:00:36,150 --> 00:00:38,955
So, its content has been
greatly simplified,

17
00:00:38,955 --> 00:00:41,230
the goal being that
the convolutions will

18
00:00:41,230 --> 00:00:44,525
filter it to the features
that determine the output.

19
00:00:44,525 --> 00:00:46,460
A really useful method on

20
00:00:46,460 --> 00:00:48,890
the model is the
model.summary method.

21
00:00:48,890 --> 00:00:51,460
This allows you to inspect
the layers of the model,

22
00:00:51,460 --> 00:00:52,670
and see the journey of

23
00:00:52,670 --> 00:00:54,605
the image through
the convolutions,

24
00:00:54,605 --> 00:00:55,985
and here is the output.

25
00:00:55,985 --> 00:00:58,190
It's a nice table
showing us the layers,

26
00:00:58,190 --> 00:01:01,940
and some details about them
including the output shape.

27
00:01:01,940 --> 00:01:05,630
It's important to keep an eye
on the output shape column.

28
00:01:05,630 --> 00:01:07,205
When you first look at this,

29
00:01:07,205 --> 00:01:09,905
it can be a little bit
confusing and feel like a bug.

30
00:01:09,905 --> 00:01:12,590
After all, isn't
the data 28 by 28,

31
00:01:12,590 --> 00:01:15,230
so y is the output, 26 by 26.

32
00:01:15,230 --> 00:01:17,360
The key to this is
remembering that

33
00:01:17,360 --> 00:01:19,925
the filter is a three by
three filter.

34
00:01:19,925 --> 00:01:22,430
Consider what happens
when you start scanning

35
00:01:22,430 --> 00:01:24,890
through an image starting
on the top left.

36
00:01:24,890 --> 00:01:28,265
So, for example with this
image of the dog on the right,

37
00:01:28,265 --> 00:01:29,690
you can see zoomed into

38
00:01:29,690 --> 00:01:31,835
the pixels at
its top left corner.

39
00:01:31,835 --> 00:01:33,770
You can't calculate
the filter for

40
00:01:33,770 --> 00:01:35,385
the pixel in the top left,

41
00:01:35,385 --> 00:01:36,980
because it doesn't
have any neighbors

42
00:01:36,980 --> 00:01:38,690
above it or to its left.

43
00:01:38,690 --> 00:01:40,560
In a similar fashion,

44
00:01:40,560 --> 00:01:42,390
the next pixel to
the right won't work

45
00:01:42,390 --> 00:01:45,455
either because it doesn't
have any neighbors above it.

46
00:01:45,455 --> 00:01:47,600
So, logically, the
first pixel that you

47
00:01:47,600 --> 00:01:50,015
can do calculations
on is this one,

48
00:01:50,015 --> 00:01:51,620
because this one of course has

49
00:01:51,620 --> 00:01:54,860
all eight neighbors that
a three by three filter needs.

50
00:01:54,860 --> 00:01:56,990
This when you think about it,

51
00:01:56,990 --> 00:01:58,250
means that you can't use

52
00:01:58,250 --> 00:02:00,860
a one pixel margin
all around the image,

53
00:02:00,860 --> 00:02:02,795
so the output of the convolution

54
00:02:02,795 --> 00:02:04,640
will be two pixels smaller on x,

55
00:02:04,640 --> 00:02:06,665
and two pixels smaller on y.

56
00:02:06,665 --> 00:02:09,815
If your filter is five-by-five
for similar reasons,

57
00:02:09,815 --> 00:02:11,975
your output will be
four smaller on x,

58
00:02:11,975 --> 00:02:13,670
and four smaller on y.

59
00:02:13,670 --> 00:02:16,505
So, that's y with
a three by three filter,

60
00:02:16,505 --> 00:02:18,905
our output from the
28 by 28 image,

61
00:02:18,905 --> 00:02:20,980
is now 26 by 26,

62
00:02:20,980 --> 00:02:23,990
we've removed
that one pixel on x and y,

63
00:02:23,990 --> 00:02:25,985
and each of the borders.

64
00:02:25,985 --> 00:02:30,050
So, next is the first of
the max-pooling layers.

65
00:02:30,050 --> 00:02:32,720
Now, remember we specified
it to be two-by-two,

66
00:02:32,720 --> 00:02:34,930
thus turning
four pixels into one,

67
00:02:34,930 --> 00:02:36,855
and having our x and y.

68
00:02:36,855 --> 00:02:41,195
So, now our output gets
reduced from 26 by 26,

69
00:02:41,195 --> 00:02:43,015
to 13 by 13.

70
00:02:43,015 --> 00:02:45,790
The convolutions will
then operate on that,

71
00:02:45,790 --> 00:02:49,219
and of course, we lose
the one pixel margin as before,

72
00:02:49,219 --> 00:02:51,200
so we're down to 11 by 11,

73
00:02:51,200 --> 00:02:53,285
add another two-by-two
max-pooling

74
00:02:53,285 --> 00:02:55,175
to have this rounding down,

75
00:02:55,175 --> 00:02:57,940
and went down, down to
five-by-five images.

76
00:02:57,940 --> 00:03:01,760
So, now our dense neural network
is the same as before,

77
00:03:01,760 --> 00:03:04,055
but it's being fed with
five-by-five images

78
00:03:04,055 --> 00:03:06,250
instead of 28 by 28 ones.

79
00:03:06,250 --> 00:03:08,630
But remember, it's
not just one compress

80
00:03:08,630 --> 00:03:12,110
five-by-five image instead
of the original 28 by 28,

81
00:03:12,110 --> 00:03:14,090
there are a number
of convolutions

82
00:03:14,090 --> 00:03:15,695
per image that we specified,

83
00:03:15,695 --> 00:03:17,665
in this case 64.

84
00:03:17,665 --> 00:03:20,210
So, there are 64 new images

85
00:03:20,210 --> 00:03:22,235
of five-by-five that
had been fed in.

86
00:03:22,235 --> 00:03:23,920
Flatten that out and you have

87
00:03:23,920 --> 00:03:27,710
25 pixels times
64, which is 1600.

88
00:03:27,710 --> 00:03:29,915
So, you can see that
the new flattened layer

89
00:03:29,915 --> 00:03:31,780
has 1,600 elements in it,

90
00:03:31,780 --> 00:03:35,210
as opposed to the 784
that you had previously.

91
00:03:35,210 --> 00:03:38,360
This number is impacted by
the parameters that you

92
00:03:38,360 --> 00:03:41,995
set when defining
the convolutional 2D layers.

93
00:03:41,995 --> 00:03:44,025
Later when you experiment,

94
00:03:44,025 --> 00:03:45,680
you'll see what
the impact of setting

95
00:03:45,680 --> 00:03:48,420
what other values for the number
of convolutions will be,

96
00:03:48,420 --> 00:03:50,900
and in particular, you
can see what happens when

97
00:03:50,900 --> 00:03:54,550
you're feeding less than 784
over all pixels in.

98
00:03:54,550 --> 00:03:56,505
Training should be faster,

99
00:03:56,505 --> 00:03:59,300
but is there a sweet spot
where it's more accurate?

100
00:03:59,300 --> 00:04:01,100
Well, let's switch
to the workbook,

101
00:04:01,100 --> 00:04:03,390
and we can try it
out for ourselves.