1
00:00:11,730 --> 00:00:16,910
OK so now that you understand how everything works conceptually let's talk about how it will work in

2
00:00:16,910 --> 00:00:18,250
code.

3
00:00:18,260 --> 00:00:23,790
First of all the data itself is going to be very different from what you're used to before.

4
00:00:23,830 --> 00:00:27,410
Whenever we wanted to use an ends we remembered the rule.

5
00:00:27,440 --> 00:00:29,390
All data is the same.

6
00:00:29,390 --> 00:00:32,670
But now it appears that my rule has been broken.

7
00:00:32,810 --> 00:00:36,110
All data is in fact not the same.

8
00:00:36,110 --> 00:00:41,570
Recall that with the recommended systems we want to look at all the data comes in the form of triples

9
00:00:42,020 --> 00:00:45,400
the inputs or the user and movie while the target is the rating.

10
00:00:46,280 --> 00:00:51,430
Clearly a sample of users and movies does not make up an end by D matrix of features

11
00:00:56,630 --> 00:01:01,970
so an end by the matrix of features is what we want but what do we actually have.

12
00:01:01,970 --> 00:01:04,550
Well we can still have n items that's clear.

13
00:01:04,580 --> 00:01:06,460
That's just the number of samples.

14
00:01:06,590 --> 00:01:13,220
But let's look at it this way a list of end users which by the way can contain duplicates will be an

15
00:01:13,220 --> 00:01:16,370
N length array of categorical objects.

16
00:01:16,700 --> 00:01:23,720
Similarly a list of n movies which also may contain duplicates will be another in length array of categorical

17
00:01:23,720 --> 00:01:25,270
objects.

18
00:01:25,280 --> 00:01:30,530
Now you might wonder how can the list of users and movies contain duplicates.

19
00:01:30,530 --> 00:01:36,500
Of course this must be the case the same user may have watched multiple different movies in which case

20
00:01:36,710 --> 00:01:38,970
that user will show up multiple times in the data.

21
00:01:39,620 --> 00:01:45,980
Conversely multiple users will of course rate the same movie in which case that movie will show up multiple

22
00:01:45,980 --> 00:01:46,820
times in the data

23
00:01:51,980 --> 00:01:52,920
as we discussed.

24
00:01:52,940 --> 00:02:00,710
The key operation inspired by NRP is the embedding in A.P. we have an end by Team matrix of word indexes

25
00:02:01,130 --> 00:02:06,230
which we use to index the embedding matrix mapping each word to a feature vector.

26
00:02:06,380 --> 00:02:08,600
We get back in end by t by D.

27
00:02:08,600 --> 00:02:10,820
Array of word vectors.

28
00:02:10,820 --> 00:02:17,360
Similarly both users and movies which since they are categorical will be represented by user indexes

29
00:02:17,360 --> 00:02:23,480
and movie indexes so the users will be an end length array of integers and the movies will also be an

30
00:02:23,480 --> 00:02:25,440
end length array of integers.

31
00:02:25,670 --> 00:02:31,670
After we index an embedding matrix we will get back and by these arrays of feature vectors for both

32
00:02:31,730 --> 00:02:34,280
users and movies.

33
00:02:34,680 --> 00:02:41,250
As a side note unlike LP where we have to do a lot of text processing this is not the case with recommenders

34
00:02:41,910 --> 00:02:42,830
as a quiz question.

35
00:02:42,840 --> 00:02:44,940
I would like you to think about why.

36
00:02:45,060 --> 00:02:49,580
Please take a moment to think about this or pause the video until you think you have the answer.

37
00:02:56,530 --> 00:02:56,920
OK.

38
00:02:56,950 --> 00:03:02,470
So why do we not need all the text pre processing that we do for A.P. with recommenders.

39
00:03:02,480 --> 00:03:08,270
Well we have to think of the natural state of the data data like this is usually stored in a database

40
00:03:08,900 --> 00:03:09,910
in databases.

41
00:03:09,920 --> 00:03:15,830
We don't pass around strings all the time because strings take up lots of space and they're inefficient.

42
00:03:15,830 --> 00:03:19,880
For example checking whether or not one string is the same as another string.

43
00:03:19,880 --> 00:03:21,350
It's not efficient.

44
00:03:21,350 --> 00:03:25,890
On the other hand comparing whether two numbers are equal is efficient.

45
00:03:25,910 --> 00:03:31,550
So in a database when you store a movie you you'll probably have a table just for movies where you give

46
00:03:31,550 --> 00:03:37,060
each movie an integer IDB and then in that table you'll have a column for the movie name.

47
00:03:37,130 --> 00:03:39,550
Its synopsis the director the producer.

48
00:03:39,560 --> 00:03:41,480
It's duration and so forth.

49
00:03:41,630 --> 00:03:46,740
By the way the director and producer will probably also just be integer ideas.

50
00:03:46,790 --> 00:03:54,040
Similarly users will be stored the same way so you'll give users a user I.D. and then along with that

51
00:03:54,130 --> 00:04:00,460
you'll store their name email address and so on therefore in the ratings table you would not say something

52
00:04:00,460 --> 00:04:04,820
like Bob rated star was a 5 although that would be its interpretation.

53
00:04:05,080 --> 00:04:09,550
Instead you'll say user 3 4 5 rates movie 6 7 8 5

54
00:04:14,480 --> 00:04:14,890
okay.

55
00:04:14,920 --> 00:04:19,940
So we've established that there won't be any need to do tax pre processing and we've established that

56
00:04:19,940 --> 00:04:25,690
the first thing we need to do is map each user and movie to a feature vector using in buildings.

57
00:04:25,730 --> 00:04:26,180
Now what.

58
00:04:27,170 --> 00:04:32,300
Well now we can count Nate the user vectors and the movie vectors that make up each batch.

59
00:04:32,300 --> 00:04:39,110
This will give us a matrix of size end by 2D and so now we come back to the situation where we can again

60
00:04:39,110 --> 00:04:41,810
say all data is the same.

61
00:04:41,810 --> 00:04:47,150
Since this is nothing but a two dimensional data matrix we can do anything with it anything that we

62
00:04:47,150 --> 00:04:52,610
would normally do in machine learning on data of the same kind i.e. we can pass this through a neural

63
00:04:52,610 --> 00:04:53,240
network.

64
00:04:58,150 --> 00:05:04,060
So here's what the code might look like for a recommender model in the constructor we take in some arguments

65
00:05:04,060 --> 00:05:09,730
for the total number of users the total number of movies the embedding dimension and the number of hidden

66
00:05:09,730 --> 00:05:10,890
units.

67
00:05:11,020 --> 00:05:15,130
Of course you could make this more fancy and let your model be a neuron that work with multiple hidden

68
00:05:15,130 --> 00:05:17,350
layers but we won't do that.

69
00:05:17,350 --> 00:05:23,440
You may want to try that as an exercise you can think of the total number of users and the total number

70
00:05:23,440 --> 00:05:27,590
of movies like vocabulary sizes for the users and movies.

71
00:05:27,640 --> 00:05:33,910
So if we call the number of users N and the number of movies M and the embedding the mentioned D then

72
00:05:33,910 --> 00:05:37,010
the user embedding will be a matrix of size and by D.

73
00:05:37,180 --> 00:05:45,100
And the movie embedding will be a matrix of size embody by the way don't confuse this n with the end

74
00:05:45,100 --> 00:05:51,460
that we used previously which meant the number of samples in a batch on this slide and within the code.

75
00:05:51,460 --> 00:05:53,120
That's not what we mean.

76
00:05:53,470 --> 00:05:58,240
And the reason for this sort of overloaded use of the letter N is because that's what people normally

77
00:05:58,240 --> 00:06:00,440
use in recommender systems.

78
00:06:00,460 --> 00:06:03,800
So just make sure you're paying attention to the context.

79
00:06:04,000 --> 00:06:09,220
Once we create art to embed things all we need to do is create our own CNN layers which for our simple

80
00:06:09,220 --> 00:06:11,540
model can just be too linear layers.

81
00:06:11,950 --> 00:06:16,030
And by the way since this is a scalar regression problem we'll have one output

82
00:06:20,990 --> 00:06:21,390
okay.

83
00:06:21,400 --> 00:06:23,770
So what does the forward function look like.

84
00:06:23,800 --> 00:06:25,750
Probably exactly what you expect.

85
00:06:26,320 --> 00:06:29,500
Well maybe not since there's one thing you haven't seen before.

86
00:06:29,890 --> 00:06:36,220
Earlier in the course we started by looking at very simple ways to build models modules themselves our

87
00:06:36,220 --> 00:06:37,240
models.

88
00:06:37,240 --> 00:06:43,090
So if I create a linear module I can call that a model or I can take a whole sequence of modules and

89
00:06:43,090 --> 00:06:50,180
wrap that in a sequential and I can call that a model later we talked about custom models.

90
00:06:50,180 --> 00:06:51,560
We saw that with our own ends.

91
00:06:51,570 --> 00:06:53,650
There's a little extra work that we need to do.

92
00:06:53,750 --> 00:06:56,450
That can't be done in a sequential model at least not yet.

93
00:06:57,140 --> 00:07:00,250
So custom models were necessary.

94
00:07:00,350 --> 00:07:04,850
There was another reason I mentioned for why custom models are useful and this is it.

95
00:07:05,090 --> 00:07:11,690
This model is unique in this course because it's the first one we've seen that takes in multiple inputs.

96
00:07:11,690 --> 00:07:17,720
As you can see the arguments into the Ford function are not just a single X like we had before but now

97
00:07:17,720 --> 00:07:24,800
we have to you and Em you and them in this case are both one dimensional arrays containing user I.D.

98
00:07:24,830 --> 00:07:32,320
and corresponding movie ideas a sequential model would not be able to handle an input like this.

99
00:07:32,410 --> 00:07:36,870
So the first thing we do is pass you and them into their respective embedding.

100
00:07:37,570 --> 00:07:44,130
After that both the users and movies will be represented by arrays of size num samples by the.

101
00:07:44,290 --> 00:07:50,860
The next step is to merge or concatenate these two matrices together along Axis 1 to get back in array

102
00:07:50,860 --> 00:08:02,710
of size num samples by 2D finally it's just business as usual to pass this to the end end layers.

103
00:08:02,870 --> 00:08:03,160
Okay.

104
00:08:03,200 --> 00:08:08,630
So the next thing to consider is how will we load in the data the data is going to come in the form

105
00:08:08,630 --> 00:08:11,310
of a C S V or a T S V or something like that.

106
00:08:11,810 --> 00:08:14,660
So we can use pandas to load in this dataset.

107
00:08:14,750 --> 00:08:20,990
We can even just turn it into a num pie array right away since we don't actually need any Panda's functionality.

108
00:08:20,990 --> 00:08:26,580
But here is one thing to consider ratings data sets are usually huge.

109
00:08:26,610 --> 00:08:31,050
You might have millions of data points if your business has millions of users.

110
00:08:31,050 --> 00:08:33,020
This is pretty much guaranteed.

111
00:08:33,120 --> 00:08:38,900
So just like with vision and text you probably want to use some kind of data generator or data iterator

112
00:08:39,630 --> 00:08:44,580
but unlike vision and text there aren't any specialized libraries that will do all these steps for you

113
00:08:45,870 --> 00:08:51,870
luckily there are generic data set objects that we can use that just loop over arrays without any regard

114
00:08:51,870 --> 00:08:53,310
to what's in them.

115
00:08:53,340 --> 00:08:59,100
So here's an example of how to use tensor data set which takes in an arbitrary length sequence of data

116
00:08:59,130 --> 00:08:59,960
tensor.

117
00:09:00,330 --> 00:09:04,440
For us we'll pass in three sensors the users the movies and the ratings

118
00:09:09,550 --> 00:09:12,680
later in order to loop over each data set for training.

119
00:09:12,850 --> 00:09:15,910
We use the data loader object which we've already seen.

120
00:09:16,330 --> 00:09:20,400
The first argument is the data set objects which we just looked at.

121
00:09:20,500 --> 00:09:26,050
The second argument is the backside and the third argument is whether or not to shuffle the data.

122
00:09:26,050 --> 00:09:31,340
And once you have this it's just business as usual defined your training function and inside that function

123
00:09:31,340 --> 00:09:32,920
to iterate over the data loader.