1
00:00:11,700 --> 00:00:16,690
In this lecture we are going to discuss a very popular topic in modern deep learning.

2
00:00:16,840 --> 00:00:21,810
Gans Gans stands for generative adversarial network.

3
00:00:22,110 --> 00:00:28,140
Yann Le -- who is one of the forefathers of Deep Learning described Ganz as the most interesting idea

4
00:00:28,170 --> 00:00:31,740
in machine learning in the past 10 years.

5
00:00:31,920 --> 00:00:37,440
By the way this is also who the famous Lean IT architecture of convolution on neural networks was named

6
00:00:37,440 --> 00:00:41,950
after the CNN from which all other CNN's are based.

7
00:00:42,060 --> 00:00:52,750
Ian Goodfellow the inventor of Ganz is now himself a prominent deep learning researcher.

8
00:00:52,810 --> 00:00:58,750
My goal in this lecture is to introduce you to the fundamental theory behind again as you'll see that

9
00:00:58,900 --> 00:01:04,120
although they sound complicated they are nothing but a new and interesting way of combining things we

10
00:01:04,120 --> 00:01:06,190
already know how to build.

11
00:01:06,250 --> 00:01:12,250
This is a common pattern in deep learning known networks themselves are nothing but a bunch of logistic

12
00:01:12,250 --> 00:01:14,220
regressions chain together.

13
00:01:14,230 --> 00:01:19,320
CNN and Oren Arnolds are nothing but neural networks with shared weights.

14
00:01:19,360 --> 00:01:24,700
I like this approach because deep learning allows us to build tons of cool things with only a few simple

15
00:01:24,700 --> 00:01:31,640
tools and some creativity.

16
00:01:31,650 --> 00:01:35,350
First let's discuss what gains are even used for in the first place.

17
00:01:35,430 --> 00:01:41,810
Since this may not be immediately obvious the main use case for Ganz is to generate data.

18
00:01:42,000 --> 00:01:43,490
Ninety nine percent of the time.

19
00:01:43,500 --> 00:01:45,870
This means generating images.

20
00:01:46,050 --> 00:01:52,140
The reason gains have become so popular is because unlike the models that came before it gains are exceptionally

21
00:01:52,170 --> 00:01:53,610
good at this.

22
00:01:53,610 --> 00:01:57,450
Here is some example images produced by state of the art Ganz.

23
00:01:57,630 --> 00:02:02,250
Believe it or not these are not real people but rather images generated by a computer

24
00:02:07,370 --> 00:02:12,680
one thing you can do if you want to see a few more examples and observe for yourself the power of Ganz

25
00:02:13,040 --> 00:02:17,010
is to check out a few Web sites that display such examples.

26
00:02:17,060 --> 00:02:17,990
One Web is.

27
00:02:18,020 --> 00:02:19,530
This person does not exist.

28
00:02:19,540 --> 00:02:20,970
Dot com.

29
00:02:21,050 --> 00:02:22,070
This one is simple.

30
00:02:22,100 --> 00:02:25,700
You go to the page and it shows you a randomly generated face.

31
00:02:25,700 --> 00:02:28,670
You can refresh the page to see more faces.

32
00:02:29,180 --> 00:02:32,740
Another one which I really like is which faces real dot com.

33
00:02:32,750 --> 00:02:34,460
This one is more like a game.

34
00:02:34,640 --> 00:02:37,100
So you're shown two images and one is real.

35
00:02:37,130 --> 00:02:39,410
One is fake and you have to pick the right one.

36
00:02:40,100 --> 00:02:41,650
So check these out for yourself.

37
00:02:41,810 --> 00:02:47,180
If you want to see more examples of how good gains are at generating realistic images

38
00:02:52,310 --> 00:02:52,680
OK.

39
00:02:52,680 --> 00:02:58,710
So what's the main idea behind gains gains or a system of two neural networks each of which has their

40
00:02:58,710 --> 00:03:04,260
own objective the objective of the first known network is to generate images.

41
00:03:04,260 --> 00:03:10,020
The objective of the second neural network is to discriminate between real and fake images.

42
00:03:10,020 --> 00:03:13,680
This is all you need to understand how gangs get their name.

43
00:03:13,770 --> 00:03:17,180
They are generative because you are generating images.

44
00:03:17,370 --> 00:03:22,380
They are adversarial because you have two neural networks which oppose each other in an adversarial

45
00:03:22,380 --> 00:03:23,340
manner.

46
00:03:23,340 --> 00:03:25,300
They have opposite goals.

47
00:03:25,390 --> 00:03:30,150
The goal of the first known network is to generate images that look real and the goal of the second

48
00:03:30,150 --> 00:03:33,150
neural network is to detect real and fake images

49
00:03:38,290 --> 00:03:39,060
intuitively.

50
00:03:39,070 --> 00:03:42,970
You can think of Ganz like a counterfeiter and a shop owner.

51
00:03:42,970 --> 00:03:48,180
The counterfeiter goes to the shop and tries to use fake cash to buy an item.

52
00:03:48,220 --> 00:03:53,770
The shop owner has to detect whether the cash being given by customers is real or fake.

53
00:03:54,190 --> 00:04:00,040
Perhaps at the beginning both the counterfeiter and the shop owner are not experts at what they do.

54
00:04:00,130 --> 00:04:05,410
The counterfeiter is not that good at making fake bills but the shop owner is also not that good at

55
00:04:05,410 --> 00:04:07,500
detecting fake bills.

56
00:04:07,510 --> 00:04:13,960
However once the shop owner learns to detect the fakes the counterfeiter may then improve on his counterfeit

57
00:04:13,960 --> 00:04:17,400
design and trick the shop owner once more.

58
00:04:17,740 --> 00:04:23,200
After a while the shop owner may again realize he is being duped and learn to inspect the bills more

59
00:04:23,200 --> 00:04:27,100
carefully to detect these improved fake bills.

60
00:04:27,100 --> 00:04:32,350
The cycle continues as the counterfeiter gets better at tricking the shop owner and the shop owner gets

61
00:04:32,350 --> 00:04:34,300
better at avoiding being tracked.

62
00:04:39,470 --> 00:04:42,420
Of course these analogies only take us so far.

63
00:04:42,500 --> 00:04:44,060
How do we actually build again.

64
00:04:45,230 --> 00:04:51,640
Let's recall all the ingredients we need for a basic neural network that we want to train first.

65
00:04:51,650 --> 00:04:54,180
We need a neuron that work that much is obvious.

66
00:04:54,290 --> 00:04:59,450
Well in our case we actually have to neural networks in order to train the neural network.

67
00:04:59,450 --> 00:05:02,330
We need an objective or a lost function.

68
00:05:02,330 --> 00:05:03,950
This is the big question with Ganz.

69
00:05:03,950 --> 00:05:06,050
What is the last function.

70
00:05:06,050 --> 00:05:09,910
But once we have these two things everything else is nearly trivial.

71
00:05:09,920 --> 00:05:13,310
We call a model that fit which does gradient descent and that's that

72
00:05:18,390 --> 00:05:18,730
OK.

73
00:05:18,750 --> 00:05:22,480
So let's dig a little deeper on what the lost function is.

74
00:05:22,480 --> 00:05:29,230
And in fact for again we're going to have two lost functions one for the generator and one for the discriminator.

75
00:05:29,230 --> 00:05:34,580
Let's first look at the discriminator since that's more in line with what we are used to already.

76
00:05:35,050 --> 00:05:39,790
Since the discriminator must be able to tell the difference between real and fake images.

77
00:05:39,790 --> 00:05:45,700
This is a binary classification problem for every input the discriminator can only predict one of two

78
00:05:45,700 --> 00:05:48,070
categories real or fake.

79
00:05:48,310 --> 00:05:59,040
As you know for binary classification the correct lost function is the binary cross entropy.

80
00:05:59,080 --> 00:06:00,520
What about for the generator.

81
00:06:00,520 --> 00:06:02,560
What is the last function.

82
00:06:02,560 --> 00:06:08,410
This is what I call applying creativity to the basic tools that we already know about intuitively.

83
00:06:09,100 --> 00:06:15,970
The basic idea is this we have our generator network in our generator network feeds into our discriminator

84
00:06:15,970 --> 00:06:17,190
network.

85
00:06:17,230 --> 00:06:21,350
From this perspective this is just one gigantic neural network.

86
00:06:21,490 --> 00:06:27,250
What we can do is just like with transfer learning we can freeze the layers in the discriminator so

87
00:06:27,250 --> 00:06:33,310
that when we run gradient descent only the generator network gets trained next.

88
00:06:33,320 --> 00:06:35,120
And here is the ingenious part.

89
00:06:35,630 --> 00:06:39,660
We want the generator to get better at producing real looking images.

90
00:06:39,830 --> 00:06:45,890
And so our last function will remain the binary cross entropy except that we are going to switch the

91
00:06:45,890 --> 00:06:47,270
labels.

92
00:06:47,270 --> 00:06:51,530
In other words let's say for the discriminator real is one and a fake a zero.

93
00:06:51,530 --> 00:06:57,500
We're going to pass in fake images but make the target one such that we're encouraging the discriminator

94
00:06:57,860 --> 00:07:04,810
to identify these fake images as real but the discriminator weights are frozen so they won't change

95
00:07:04,810 --> 00:07:07,890
when we do this only the generator weights are trained

96
00:07:13,090 --> 00:07:15,490
here's what the loss would look like mathematically.

97
00:07:15,550 --> 00:07:20,910
If you're interested here all the white hats are predictions on fake images.

98
00:07:21,070 --> 00:07:26,680
Since the generator is now the input into the discriminator because we always want the target to be

99
00:07:26,680 --> 00:07:31,210
one we only care about the first half of the binary cross entropy.

100
00:07:31,210 --> 00:07:34,770
So essentially it's just the sum over the log of all the white hats.

101
00:07:35,790 --> 00:07:40,110
Or as stated another way the negative log likelihood of the Y has

102
00:07:45,240 --> 00:07:48,440
now that you understand the basics of how Gans work.

103
00:07:48,480 --> 00:07:52,310
Let's look at some details that will help us in our implementation.

104
00:07:52,530 --> 00:07:57,390
We know that neural networks must have both inputs and outputs for the discriminator.

105
00:07:57,390 --> 00:08:03,240
This is easy to input as an image and the output is a prediction about whether the input image is real

106
00:08:03,240 --> 00:08:05,520
or fake for the generator.

107
00:08:05,520 --> 00:08:07,260
This will be a little strange.

108
00:08:07,320 --> 00:08:12,980
We know that the output should be an image because its job is to generate images but what is its input.

109
00:08:18,120 --> 00:08:22,440
In fact the input to the generator is nothing but noise.

110
00:08:22,500 --> 00:08:28,680
What we are going to do is generate noise from a multi variant standard normal and the generator will

111
00:08:28,680 --> 00:08:31,260
learn to map this noise to an image.

112
00:08:31,620 --> 00:08:37,500
Mathematically what we are doing is saying we have some random vector z which comes from a standard

113
00:08:37,500 --> 00:08:38,300
normal.

114
00:08:38,610 --> 00:08:44,980
Let's say it has size a d equals 100 although this is a hyper parameter so you can choose what you like.

115
00:08:45,210 --> 00:08:47,500
We call this one hundred dimensional space.

116
00:08:47,610 --> 00:08:48,810
The latent space

117
00:08:53,960 --> 00:08:59,960
in essence you can think of it as an imaginary space where the generator believes all the images of

118
00:08:59,960 --> 00:09:00,770
digits live.

119
00:09:01,430 --> 00:09:04,130
So maybe the ones are over here.

120
00:09:04,130 --> 00:09:10,070
All the sevens are right beside it because seven's kind of look like ones near the Sevens or the nines

121
00:09:10,100 --> 00:09:16,280
because nines look like sevens if you draw the loop very small so you can see how these images kind

122
00:09:16,280 --> 00:09:19,770
of morph into one another when they are drawn in different ways.

123
00:09:19,910 --> 00:09:24,080
So there's late space maps all the different kinds of images that you can have

124
00:09:27,410 --> 00:09:32,390
and of course this is a source of confusion for both computers and humans.

125
00:09:32,390 --> 00:09:37,940
When an image is right on the boundary between say 7 or 1 it's difficult for us to tell what digit it

126
00:09:37,940 --> 00:09:38,770
should be.

127
00:09:38,810 --> 00:09:39,870
Should it be a 7.

128
00:09:39,920 --> 00:09:41,740
Or should it be a one.

129
00:09:41,870 --> 00:09:47,390
In any case what the generator is doing is it's learning to associate each part of the latent space

130
00:09:47,600 --> 00:09:50,290
with different images in a continuous manner.

131
00:09:52,630 --> 00:09:58,660
Then later when we generate images we're just picking a random part of the space and generating an image

132
00:09:58,780 --> 00:10:06,350
from whatever that part of the space represents.

133
00:10:06,400 --> 00:10:12,260
You can think of the generator as kind of the reverse of a feature transformer or an embedding with

134
00:10:12,260 --> 00:10:18,160
a feature transformer or an embedding or taking an input image or an input text or some other kind of

135
00:10:18,160 --> 00:10:21,100
input and we're mapping it to a vector.

136
00:10:21,400 --> 00:10:24,550
But for a generator and again we're doing the opposite.

137
00:10:24,730 --> 00:10:27,900
We start with a vector and then we map that to an image

138
00:10:33,190 --> 00:10:39,090
as always it's going to be helpful to look at some pseudocode before we move on to the actual implementation.

139
00:10:39,130 --> 00:10:41,650
So here's how it's going to look.

140
00:10:41,680 --> 00:10:43,920
So the first step is the load in our data set.

141
00:10:43,990 --> 00:10:48,160
We're going to use amnesty for this example in a future courses.

142
00:10:48,160 --> 00:10:53,650
You'll learn about more advanced types of games which will work on more complex types of data.

143
00:10:53,650 --> 00:10:57,710
Next we're going to create our discriminator and generator networks.

144
00:10:57,730 --> 00:11:02,500
These are just standard neural networks some dense layers some real use and so forth.

145
00:11:02,500 --> 00:11:08,230
The trick is we're going to have to optimize is one that optimizes the parameters of the discriminator

146
00:11:08,560 --> 00:11:12,040
and one that optimizes the parameters of the generator.

147
00:11:12,070 --> 00:11:17,410
Conceptually you can imagine that the lost function is always applied to the output of the discriminator

148
00:11:17,410 --> 00:11:26,160
only in either case we will use the binary cross entropy but with flipped labels when we train the discriminator

149
00:11:26,280 --> 00:11:31,380
half of the data will be real images from the dataset and half of the images will be generated from

150
00:11:31,380 --> 00:11:32,550
the generator.

151
00:11:32,640 --> 00:11:39,030
These will use the true labels when we train the generator we will generate random noise pass that through

152
00:11:39,030 --> 00:11:44,580
the generator then pass that through the discriminator and use a flipped label meaning that we say it's

153
00:11:44,580 --> 00:11:51,200
a one even though it's really a zero we can conceptualize this chain of generator and discriminator

154
00:11:51,350 --> 00:11:56,990
as a kind of combined model where only the generator weights our parameters although that's not exactly

155
00:11:56,990 --> 00:11:58,340
how we'll look at code.

156
00:11:58,340 --> 00:11:59,690
It's a good conceptual tool.

157
00:12:04,900 --> 00:12:11,010
Once we've built our two networks we're going to run a gradient descent loop as usual inside the loop.

158
00:12:11,020 --> 00:12:17,470
We're going to train the discriminator and generator in alternating order in order to train the discriminator

159
00:12:17,600 --> 00:12:24,190
we'll grab a random sample of images from my dataset and a random sample of images from our generator.

160
00:12:24,190 --> 00:12:30,070
Then we'll do one iteration of gradient descent by calling discriminator dot train on batch.

161
00:12:30,400 --> 00:12:37,040
We do this for both the real and fake images then for generator training we'll do the following.

162
00:12:37,040 --> 00:12:43,610
First of all generate a batch of noise samples then we'll call combines model dot train on batch passing

163
00:12:43,610 --> 00:12:50,160
in the noise and ones for the target to do one iteration of gradient descent on the generator.

164
00:12:50,570 --> 00:12:56,300
After several thousand iterations our generator should be good enough to produce images that look like

165
00:12:56,300 --> 00:12:57,380
the M.A. data set.

166
00:13:02,570 --> 00:13:08,180
As a final note for this lecture I want to mention that in terms of model evaluation we're going to

167
00:13:08,180 --> 00:13:11,680
take a bit of a departure from what we normally do.

168
00:13:11,840 --> 00:13:17,210
You recall that for most models what we're looking for after training is that the loss goes down and

169
00:13:17,210 --> 00:13:19,510
the accuracy goes up.

170
00:13:19,520 --> 00:13:26,030
This makes sense if you are doing classification or regression but now we are no longer doing classification

171
00:13:26,030 --> 00:13:27,620
or regression.

172
00:13:27,620 --> 00:13:33,680
Remember that in our case the generator and discriminator are constantly trying to one up each other.

173
00:13:33,680 --> 00:13:39,050
And so what is interesting is that the loss functions for both the generator and discriminator should

174
00:13:39,050 --> 00:13:40,660
look just like noise.

175
00:13:40,880 --> 00:13:46,570
It should appear that if you only look at the loss per iteration that nothing is happening at all.

176
00:13:46,910 --> 00:13:53,650
But in fact something profound is happening both the generator and discriminator are improving.

177
00:13:53,710 --> 00:13:59,140
You don't see this in the Las per iteration because ideally both the generator and discriminator are

178
00:13:59,140 --> 00:14:00,400
improving at the same rate.

179
00:14:05,540 --> 00:14:09,080
Let's now summarize everything you learned in this lecture.

180
00:14:09,080 --> 00:14:14,420
First you learn that gains are not just a single neuron that work but a system of tuner on that works

181
00:14:14,540 --> 00:14:17,000
a generator and a discriminator.

182
00:14:17,080 --> 00:14:22,320
The discriminator is responsible for telling the difference between a real and fake images.

183
00:14:22,550 --> 00:14:29,210
The generator is responsible for creating realistic looking images both the generator and discriminator

184
00:14:29,310 --> 00:14:31,190
learned in tandem.

185
00:14:31,190 --> 00:14:36,350
The lost function for the discriminator is just the binary cross entropy because there are two classes

186
00:14:36,380 --> 00:14:42,860
real and fake the lost function for the generator is not directly applied to the generator but rather

187
00:14:42,920 --> 00:14:47,590
a combined model where the generator feeds into the discriminator.

188
00:14:47,600 --> 00:14:50,900
This allows us to again use the binary cross entropy.

189
00:14:51,050 --> 00:14:58,360
But there are two important details first we freeze the weights of the discriminator so that only the

190
00:14:58,360 --> 00:15:00,360
generator is trained.

191
00:15:00,400 --> 00:15:05,720
Second we flip the labels so that image is generated by the generator are labeled as real.

192
00:15:07,280 --> 00:15:13,640
Next we learn that the input to the generator is nothing but noise sampled from the latent space the

193
00:15:13,640 --> 00:15:17,600
latent space is where the in coatings for each digit lives.

194
00:15:17,600 --> 00:15:23,710
You can think of this as the opposite of a feature transformation or an embedding and embedding or feature

195
00:15:23,720 --> 00:15:30,080
transformation is like a mapping from an input image or an input word or any other kind of input into

196
00:15:30,080 --> 00:15:31,780
a single vector.

197
00:15:31,850 --> 00:15:33,650
The generator does the opposite.

198
00:15:33,650 --> 00:15:40,280
It takes a vector which lives in the latent space or the embedding space and produces an image.

199
00:15:40,280 --> 00:15:46,250
Lastly we learn that when training gets the last per iteration should not be informative in the sense

200
00:15:46,250 --> 00:15:51,680
that it should not decrease over time as it does when we are training other models.

201
00:15:51,680 --> 00:15:57,050
This is because both the discriminator and the generator should be learning at the same rate continually

202
00:15:57,110 --> 00:15:58,460
helping each other improve.