WEBVTT

1
00:00:00.260 --> 00:00:02.580
Let's learn how embeddings work.

2
00:00:03.400 --> 00:00:08.760
So embeddings are what is called semantic meaning

3
00:00:08.760 --> 00:00:10.560
and similarity searches.

4
00:00:11.080 --> 00:00:14.340
So it's not like it's doing the old

5
00:00:14.340 --> 00:00:18.780
style search of if the two words are

6
00:00:18.780 --> 00:00:21.460
the same or fuzzy search and so on.

7
00:00:21.560 --> 00:00:24.120
It's more a semantic search where we have

8
00:00:24.120 --> 00:00:25.120
AI involved.

9
00:00:26.420 --> 00:00:29.620
So in a normal search, a kitten would

10
00:00:29.620 --> 00:00:31.820
not be close to a cat because it

11
00:00:31.820 --> 00:00:36.300
has totally different length, totally different letters and

12
00:00:36.300 --> 00:00:36.840
so on.

13
00:00:37.440 --> 00:00:40.700
But in a vector store where you had

14
00:00:40.700 --> 00:00:46.260
words like wolf, dog, cat, banana, apple, the

15
00:00:46.260 --> 00:00:48.860
word kitten would go close to a cat.

16
00:00:50.780 --> 00:00:54.920
And behind the scenes, such vectors have 1536

17
00:00:54.920 --> 00:00:56.320
different dimensions.

18
00:00:57.580 --> 00:00:59.760
In our world, we of course have the

19
00:00:59.760 --> 00:01:01.720
three or the four if you take time.

20
00:01:02.360 --> 00:01:05.700
But in mathematics, you can have infinite amount

21
00:01:05.700 --> 00:01:08.220
of vectors and dimensions.

22
00:01:09.380 --> 00:01:12.160
So that is how this works.

23
00:01:12.720 --> 00:01:15.320
This is a bit abstract, so we are

24
00:01:15.320 --> 00:01:17.160
going to go into some code and actually

25
00:01:17.160 --> 00:01:19.660
show a little on how we make these

26
00:01:19.660 --> 00:01:22.160
vectors, how they look like, and also how

27
00:01:22.160 --> 00:01:25.740
we check that something fits together.

28
00:01:28.540 --> 00:01:31.780
So in here, we have an embedding data

29
00:01:31.780 --> 00:01:33.060
sample.

30
00:01:34.080 --> 00:01:37.240
And we're using the normal client here, but

31
00:01:37.240 --> 00:01:39.720
now we're not going to new up a

32
00:01:39.720 --> 00:01:40.940
chat client agent.

33
00:01:41.060 --> 00:01:43.160
We don't use it at all in this

34
00:01:43.160 --> 00:01:43.760
scenario.

35
00:01:43.760 --> 00:01:47.120
Instead, we're taking what is called an embedding

36
00:01:47.120 --> 00:01:47.520
client.

37
00:01:49.000 --> 00:01:51.780
And in OpenAI's terms, one of the best

38
00:01:51.780 --> 00:01:54.100
they have is text embedding tree small.

39
00:01:54.760 --> 00:01:57.220
You also have a larger one, but not

40
00:01:57.220 --> 00:02:02.500
all vector stores can support such big embeddings,

41
00:02:02.660 --> 00:02:04.200
so I tend to use this one.

42
00:02:07.699 --> 00:02:10.740
So if we make one of these embedding

43
00:02:10.740 --> 00:02:16.400
generators, we can actually embed this sentence.

44
00:02:16.600 --> 00:02:18.140
In this case, I've just made a little

45
00:02:18.140 --> 00:02:20.820
Q&A part of our Wi-Fi data.

46
00:02:20.940 --> 00:02:22.340
What is the Wi-Fi password at the

47
00:02:22.340 --> 00:02:22.700
office?

48
00:02:23.280 --> 00:02:25.720
And the office password, it's guest 42.

49
00:02:27.160 --> 00:02:29.840
And if we use this embedding generator, just

50
00:02:29.840 --> 00:02:34.340
say generate this text and turn it into

51
00:02:34.340 --> 00:02:34.820
a vector.

52
00:02:36.620 --> 00:02:39.920
So this actually sent up a signal to

53
00:02:39.920 --> 00:02:42.260
OpenAI and got back.

54
00:02:43.040 --> 00:02:47.020
So embeddings are rather fast and also rather

55
00:02:47.020 --> 00:02:51.120
cheap, way cheaper than normal LLM tokens in

56
00:02:51.120 --> 00:02:53.320
terms of cost and speed.

57
00:02:54.200 --> 00:02:57.720
So what we get back is a vector

58
00:02:57.720 --> 00:03:01.040
and we get how when it was created,

59
00:03:01.300 --> 00:03:03.320
what model did it, and then we get

60
00:03:03.320 --> 00:03:04.100
the vector back.

61
00:03:04.200 --> 00:03:07.000
And the vector is just a long series

62
00:03:07.000 --> 00:03:07.900
of numbers.

63
00:03:08.960 --> 00:03:11.320
And I will not claim to understand this

64
00:03:11.320 --> 00:03:15.580
fully because I'm not a mathematician, but we

65
00:03:15.580 --> 00:03:18.480
can technically write it out if we want

66
00:03:18.480 --> 00:03:20.480
to and see how it looks.

67
00:03:21.800 --> 00:03:24.780
And it doesn't give too much meaning for

68
00:03:24.780 --> 00:03:24.960
us.

69
00:03:25.060 --> 00:03:28.640
It's just 1536 of these numbers back and

70
00:03:28.640 --> 00:03:31.420
forward in various different dimensions.

71
00:03:31.740 --> 00:03:33.460
How will it fit that dimension?

72
00:03:36.940 --> 00:03:39.820
This is a rather small embedding.

73
00:03:40.460 --> 00:03:42.600
We can also do a larger embedding here.

74
00:03:42.700 --> 00:03:44.540
I have the first chapter of write and

75
00:03:44.540 --> 00:03:48.020
produce that I can turn into a vector.

76
00:03:49.140 --> 00:03:58.220
And all vectors become these 1536 dimensions.

77
00:03:58.980 --> 00:04:03.020
So it's not like this vector I create

78
00:04:03.020 --> 00:04:07.220
here is any bigger or any smaller than

79
00:04:07.220 --> 00:04:08.360
the one we just made.

80
00:04:08.440 --> 00:04:11.340
It has the same things here.

81
00:04:11.820 --> 00:04:15.180
But what happens is the smaller we make

82
00:04:15.180 --> 00:04:18.860
it, the less defined it will be to

83
00:04:18.860 --> 00:04:21.980
know what area this is in, what area

84
00:04:21.980 --> 00:04:22.680
this is in.

85
00:04:23.120 --> 00:04:26.040
And the bigger, the more it needs to

86
00:04:26.040 --> 00:04:29.340
compress the data and lose some of its

87
00:04:29.340 --> 00:04:29.700
meaning.

88
00:04:30.340 --> 00:04:34.520
So I would say a good vector is

89
00:04:34.520 --> 00:04:37.060
roughly in the middle of such things like

90
00:04:37.060 --> 00:04:37.320
this.

91
00:04:37.400 --> 00:04:40.980
This might be okay in terms of size.

92
00:04:41.780 --> 00:04:43.460
But the more you put into a vector,

93
00:04:43.640 --> 00:04:46.180
the more it will lose its value.

94
00:04:47.760 --> 00:04:49.720
But you can also, if you just put

95
00:04:49.720 --> 00:04:51.720
in every single word into a vector, it

96
00:04:51.720 --> 00:04:54.200
wouldn't know what to do with that because

97
00:04:54.200 --> 00:04:56.520
it will just be too generic for anything.

98
00:04:58.120 --> 00:05:00.560
Vectors can technically be very long.

99
00:05:00.940 --> 00:05:02.840
You can put in up to 6000 words

100
00:05:02.840 --> 00:05:05.220
roughly because they're counted in tokens.

101
00:05:05.340 --> 00:05:09.740
We cannot really know exactly precisely, but that's

102
00:05:09.740 --> 00:05:12.420
roughly 10 to 15 pages if you wish

103
00:05:12.420 --> 00:05:12.620
to.

104
00:05:13.640 --> 00:05:16.780
But figuring out how big a vector should

105
00:05:16.780 --> 00:05:19.060
be, should it be one sentence at a

106
00:05:19.060 --> 00:05:21.000
time, should it be one paragraph at a

107
00:05:21.000 --> 00:05:23.540
time, should it be one chapter at a

108
00:05:23.540 --> 00:05:25.980
time, a page, and so on, is a

109
00:05:25.980 --> 00:05:31.320
whole science that we can talk, we could

110
00:05:31.320 --> 00:05:34.800
make a course just about that by itself.

111
00:05:35.120 --> 00:05:38.160
And I'm no expert, but I'm slowly learning

112
00:05:38.160 --> 00:05:39.380
as we go along.

113
00:05:40.680 --> 00:05:43.160
What we can do is, of course, we

114
00:05:43.160 --> 00:05:45.280
can go in and show that same vector

115
00:05:45.280 --> 00:05:48.860
and we'll just see a whole lot of

116
00:05:48.860 --> 00:05:53.560
numbers again with different values in terms of

117
00:05:53.560 --> 00:05:55.040
the different dimensions.

118
00:05:55.620 --> 00:05:57.780
But again, it's not like just because it

119
00:05:57.780 --> 00:05:59.920
was a longer text that it's a longer

120
00:05:59.920 --> 00:06:00.380
vector.

121
00:06:00.920 --> 00:06:03.060
They will be exactly the same length.

122
00:06:06.680 --> 00:06:10.920
So now we have some vectors and I've

123
00:06:10.920 --> 00:06:13.980
just asked ChatGBT to make me a co

124
00:06:13.980 --> 00:06:19.900
-similarity search score that can compare two vectors

125
00:06:19.900 --> 00:06:21.600
to each other and see how close they

126
00:06:21.600 --> 00:06:22.140
are together.

127
00:06:22.360 --> 00:06:24.520
Because that's how things work.

128
00:06:25.780 --> 00:06:28.480
So I won't claim that I understand this

129
00:06:28.480 --> 00:06:31.820
code specifically, but it's going in and figuring

130
00:06:31.820 --> 00:06:34.740
out how close are two vectors to each

131
00:06:34.740 --> 00:06:38.900
other and giving a value between zero, meaning

132
00:06:38.900 --> 00:06:41.040
it had absolutely nothing to do with each

133
00:06:41.040 --> 00:06:43.700
other, and one being it was the exact

134
00:06:43.700 --> 00:06:47.580
same sentence that was vectorised.

135
00:06:48.620 --> 00:06:51.840
So if we go in and ask, how

136
00:06:51.840 --> 00:06:56.280
close are our Wi-Fi data with the

137
00:06:56.280 --> 00:06:56.860
book data?

138
00:06:57.400 --> 00:07:01.580
And hopefully you will intuitively know that in

139
00:07:01.580 --> 00:07:04.040
private purchase, they probably don't talk a lot

140
00:07:04.040 --> 00:07:06.840
about Wi-Fi and offices and so on.

141
00:07:07.020 --> 00:07:09.020
So they're probably not too similar.

142
00:07:10.340 --> 00:07:13.440
And what we see is that that's correct.

143
00:07:13.600 --> 00:07:16.120
It's like 0.10 out of 1.

144
00:07:16.860 --> 00:07:18.160
So a fairly low score.

145
00:07:20.600 --> 00:07:23.100
But if we want to check a bit

146
00:07:23.100 --> 00:07:27.000
more realistic, we can make a small vector

147
00:07:27.000 --> 00:07:30.600
here that says, how can my customer use

148
00:07:30.600 --> 00:07:31.220
the Wi-Fi?

149
00:07:32.180 --> 00:07:35.380
This is what a real person would perhaps

150
00:07:35.380 --> 00:07:36.180
ask for.

151
00:07:36.180 --> 00:07:38.300
They wouldn't know to ask, what is the

152
00:07:38.300 --> 00:07:39.840
Wi-Fi for the office and so on.

153
00:07:39.920 --> 00:07:42.840
They would just, I have a customer at

154
00:07:42.840 --> 00:07:44.640
the office at the moment.

155
00:07:44.760 --> 00:07:46.400
They need to get onto the Wi-Fi.

156
00:07:47.280 --> 00:07:49.020
They don't specify the word guest.

157
00:07:49.220 --> 00:07:52.360
They don't specify office and anything.

158
00:07:53.140 --> 00:07:55.600
So we can make a vector of that

159
00:07:55.600 --> 00:07:59.640
and check the similarity search between our Wi

160
00:07:59.640 --> 00:08:00.800
-Fi data and the question.

161
00:08:01.440 --> 00:08:04.380
And here we get a better score, 0

162
00:08:04.380 --> 00:08:08.840
.45, which is a rather okay score, but

163
00:08:08.840 --> 00:08:09.880
of course it's not a one.

164
00:08:10.480 --> 00:08:12.820
But again, a one would never really happen,

165
00:08:13.420 --> 00:08:17.720
because if that were to happen, the end

166
00:08:17.720 --> 00:08:20.600
user would have written exactly this sentence with

167
00:08:20.600 --> 00:08:23.240
colons in exactly the same places and so

168
00:08:23.240 --> 00:08:23.480
on.

169
00:08:24.380 --> 00:08:28.120
So a 45 is a pretty okay, and

170
00:08:28.120 --> 00:08:30.180
if we only had these two, this would

171
00:08:30.180 --> 00:08:31.600
of course be the best match.

172
00:08:32.919 --> 00:08:35.059
We can ask in a different way, what

173
00:08:35.059 --> 00:08:36.720
is the office Wi-Fi?

174
00:08:37.580 --> 00:08:41.120
And if we do that, we get an

175
00:08:41.120 --> 00:08:44.840
even better score, because there's more similarity of

176
00:08:44.840 --> 00:08:48.740
this sentence compared to our original Q&A

177
00:08:48.740 --> 00:08:51.520
in terms of this question.

178
00:08:53.380 --> 00:08:56.600
But again, if the book is totally different,

179
00:08:57.000 --> 00:09:01.960
so we had these three, then this one

180
00:09:01.960 --> 00:09:04.380
would be the best way of curing to

181
00:09:04.380 --> 00:09:05.800
get our right information.

182
00:09:07.660 --> 00:09:11.180
So it's all about these numbers, and in

183
00:09:11.180 --> 00:09:13.200
reality, we are not sitting and doing the

184
00:09:13.200 --> 00:09:14.480
match score ourselves.

185
00:09:14.800 --> 00:09:18.940
That's the job of a vector store, and

186
00:09:18.940 --> 00:09:21.240
we are just doing here a sample.

187
00:09:21.920 --> 00:09:25.140
I've never in real life written code like

188
00:09:25.140 --> 00:09:28.220
this, but in order for us to quickly

189
00:09:28.220 --> 00:09:30.720
see that we have different scores for different

190
00:09:30.720 --> 00:09:34.160
vectors when they're compared, it's nice to know.

191
00:09:34.820 --> 00:09:36.480
But again, when we go to the vector

192
00:09:36.480 --> 00:09:40.280
store in the next lecture, we will let

193
00:09:40.280 --> 00:09:42.520
that do this similarity score.

194
00:09:43.440 --> 00:09:46.160
In this case, we're using cosine similarity, where

195
00:09:46.160 --> 00:09:48.320
it's between 0 and 1.

196
00:09:49.280 --> 00:09:51.840
Others use from minus 1 to 1.

197
00:09:52.280 --> 00:09:54.160
Some use from 0 to 2.

198
00:09:54.580 --> 00:09:55.800
So there's different ways.

199
00:09:55.900 --> 00:09:57.760
So you need to check on your vector

200
00:09:57.760 --> 00:10:01.520
store what is a good score, because sometimes

201
00:10:01.520 --> 00:10:03.360
a negative score is a good score.

202
00:10:04.520 --> 00:10:08.020
That's all depending on the implementation of how

203
00:10:08.020 --> 00:10:10.820
the mathematician have said, this is the best

204
00:10:10.820 --> 00:10:13.360
way for me to check two vectors together.

205
00:10:15.180 --> 00:10:17.740
But now we should be ready to go

206
00:10:17.740 --> 00:10:18.940
to our vector store part.
