WEBVTT

00:00.160 --> 00:02.560
It is week two, day four.

00:02.600 --> 00:10.080
It's the start of a two day series in which we wrap up week two, and together, the two days will be

00:10.080 --> 00:13.080
that same combination of challenging and satisfying.

00:13.440 --> 00:15.840
As you will see, it is, of course, a yellow day.

00:15.880 --> 00:20.480
It's a yellow day because we're doing integrations, in particular Super Bass, which you had your first

00:20.480 --> 00:20.800
look at.

00:20.840 --> 00:26.200
Hopefully you logged in, you clicked around, you set up your organization, done some of the initial

00:26.280 --> 00:29.160
stuff, got some sense of what we're talking about.

00:29.480 --> 00:33.480
But first, as usual, we have some recapping.

00:33.760 --> 00:39.200
I want to just do a quick refresher on rag, because the next couple of days are all about building

00:39.200 --> 00:41.600
rag ourselves, for reals.

00:41.600 --> 00:44.600
And the first, a quick refresher on what rag is.

00:44.640 --> 00:48.880
It's, of course, about making your model appear to be more knowledgeable.

00:48.880 --> 00:52.720
It's retrieval augmented generation, a trick that works well.

00:52.720 --> 00:57.480
The simple version of it is they just take some information, shove it in the prompt.

00:57.480 --> 01:03.290
So you ask the user a question and you give the model some information that might pertain to the question.

01:03.290 --> 01:10.050
And the clever trick that's added to this is all about the technique for figuring out what useful data

01:10.130 --> 01:11.730
could we include in the prompt.

01:11.730 --> 01:17.010
And using this, this thing called vector embeddings, this idea that there are kinds of llms that are

01:17.010 --> 01:22.370
good at taking some text and coming up with a set of numbers that represents the meaning of the text.

01:22.370 --> 01:26.490
And then you can use these numbers as a way of finding similar information.

01:26.530 --> 01:29.730
And that gives you a way to look up stuff in your knowledge base.

01:29.730 --> 01:32.330
That's likely to be useful in answering a question.

01:32.330 --> 01:37.050
And at the heart of this new idea is a type of LLM that you may not have even known about, called an

01:37.050 --> 01:41.810
embedding model, also known as an encoder, an embedding LLM.

01:42.170 --> 01:45.970
And it's something which is able to take an input as some text.

01:45.970 --> 01:52.570
And the output is a bunch of numbers, a list of numbers which are called vectors, also called vector

01:52.570 --> 01:53.570
embeddings.

01:53.690 --> 01:58.130
And these numbers represent the meaning of this input text.

01:58.220 --> 02:00.500
and you can think of it like it's a point in space.

02:00.500 --> 02:02.980
If it were three numbers, it would be a point, like here.

02:02.980 --> 02:06.780
If it's a thousand numbers, then it's a point in multidimensional space.

02:06.940 --> 02:13.260
But basically it's a bunch of numbers and text with a similar set of numbers that are close to each

02:13.300 --> 02:21.620
other in this space, represent similar kinds of stuff, and that allows us to search for relevant contextual

02:21.620 --> 02:23.740
information in our knowledge base.

02:23.740 --> 02:29.620
Something that has similar meaning pertains to a similar topic as the question that the user is asking.

02:29.620 --> 02:31.940
And so, you know, I'd love to get this diagram.

02:31.940 --> 02:34.260
You've seen this diagram already a few times.

02:34.300 --> 02:37.500
And if you've done my other courses, you see there's like a million times you're like, oh, not this

02:37.500 --> 02:37.860
again.

02:37.900 --> 02:39.900
I'm sorry, I like this diagram.

02:40.180 --> 02:42.700
The user asks a question.

02:42.700 --> 02:47.140
That question comes to our code or to wherever we've got our workflow.

02:47.340 --> 02:54.700
And what we do is we say, I want to know the vector or the vector embeddings associated with this question.

02:54.700 --> 02:58.180
If the question is how much does it cost to fly to Heathrow?

02:58.380 --> 02:59.980
Turn that into a vector.

03:00.300 --> 03:03.620
And the next thing you do is you say, I've got a database of all my information.

03:03.620 --> 03:06.420
I've already turned it into lots of vectors.

03:06.420 --> 03:13.700
And what I can do is retrieve the closest pieces of information to the vector for that question, so

03:13.700 --> 03:19.100
that if the question is how much does it cost to go to Heathrow, then hopefully ticket prices to London

03:19.100 --> 03:21.540
is some of the data that gets retrieved.

03:21.740 --> 03:24.460
And when you retrieve it, you don't you don't retrieve the vectors.

03:24.460 --> 03:29.100
You retrieve the text, the original language about the ticket prices to London.

03:29.100 --> 03:32.300
And it's that language which you put in the prompt to the LLM.

03:32.340 --> 03:37.580
You say, hey, LLM, the user has asked this question, how much does it cost to fly to Heathrow?

03:37.820 --> 03:41.180
Here's some interesting relevant context.

03:41.180 --> 03:43.060
And it includes ticket prices in London.

03:43.060 --> 03:47.100
And the LLM is good enough to to connect the dots to get that input context.

03:47.100 --> 03:52.580
And it knows that the most likely next tokens will be the answer to the question.

03:52.620 --> 03:53.980
And that's what comes back.

03:53.980 --> 03:55.260
And that is Rag.

03:55.350 --> 04:00.190
And you know, when I give these recaps, I like to add some something extra, uh, to spice it up.

04:00.190 --> 04:05.590
And I will just mention that often when you're building these vector databases, you've got content

04:05.590 --> 04:08.790
that's something like lots of documents with lots of stuff on them.

04:08.950 --> 04:15.830
And one challenging question is how do you create like one vector for the whole document?

04:15.830 --> 04:21.270
Or if the document has, say, ticket prices to different places, maybe you should break it up.

04:21.270 --> 04:26.550
And for each of the paragraphs of text or for each of the sections, you should make that be a little

04:26.550 --> 04:31.430
miniature document with a vector associated with it that might get inserted into the prompt.

04:31.710 --> 04:38.670
And this process of potentially taking a document and breaking it up into smaller pieces is called chunking.

04:38.830 --> 04:40.670
You're turning it into chunks.

04:40.670 --> 04:47.430
And there's again, there's a whole cottage industry of different techniques and strategies for chunking.

04:47.430 --> 04:51.590
And people like to get very excited about it and come up with lots of sort of principles.

04:51.590 --> 04:52.430
You should try it like this.

04:52.430 --> 04:53.590
You shouldn't chunk like this.

04:53.750 --> 04:59.640
Uh, the at the end of the day, they're the only one true principle is that you need to test it.

04:59.640 --> 05:05.000
You need to try different strategies, different approaches to find out what works best for your data

05:05.000 --> 05:06.040
and your questions.

05:06.040 --> 05:12.040
And if you take my AI engineering courses, the core track, I go on and on about it and we build testing

05:12.040 --> 05:19.080
methodologies, and we use things like MRI and all sorts of stuff to get to a point where we can try

05:19.080 --> 05:23.840
out different chunking strategies and pick the one that works the best, but that's that's more of the

05:23.840 --> 05:24.440
pro stuff.

05:24.440 --> 05:29.640
What we're going to be doing today is just chunking using defaults, which often works very well indeed.

05:29.680 --> 05:29.960
Okay.

05:30.000 --> 05:36.880
And then the other thing that we covered yesterday is rag versus agentic rag, which is just the kind

05:36.880 --> 05:38.520
of next step to rag.

05:38.520 --> 05:42.680
So traditional rag you remember I said someone is chatting with us.

05:42.840 --> 05:44.840
We do vector based retrieval.

05:44.840 --> 05:49.160
We find vector and we we look it up in our knowledge base.

05:49.400 --> 05:54.890
And then that gets sent to the LLM to respond to the question in a genetic rag.

05:54.890 --> 05:56.290
It's just a bit more than that.

05:56.330 --> 06:00.090
The idea is that the user asks a question to our agent.

06:00.130 --> 06:06.130
Our agent is powered by an LLM that doesn't just respond to the question, but also it manages the whole

06:06.130 --> 06:08.610
workflow of how to respond to the question.

06:08.610 --> 06:14.130
And so it is equipped with a tool that allows for vector based retrieval, and it can choose to use

06:14.130 --> 06:16.410
that tool in order to find relevant contexts.

06:16.450 --> 06:20.770
And maybe there are some settings that can it can apply to decide how to use the tool.

06:20.890 --> 06:25.850
And maybe it's also got other kinds of retrieval tools as well that in its in its arsenal of tools that

06:25.850 --> 06:29.010
it can use to try to best answer the question.

06:29.010 --> 06:34.530
And it can figure out how to pick the right ones, and it can try all of them and use that to to assess

06:34.530 --> 06:34.850
context.

06:34.850 --> 06:36.970
And if the context is not good enough, it can do it again.

06:36.970 --> 06:44.690
And that kind of iterative approach is, is what makes it agentic where the LLM is deciding how to do

06:44.690 --> 06:45.090
that.

06:45.090 --> 06:49.050
And so that gives you a rough sense of rag versus agentic rag.

06:49.090 --> 06:52.220
And now you often hear people saying that rag is dead.

06:52.500 --> 06:55.260
Uh, and they usually say that for one of two reasons.

06:55.340 --> 07:01.580
One reason is because context windows, which is the total amount of information that you can pass in

07:01.620 --> 07:04.740
to an LLM for it to be able to respond to you loosely.

07:04.740 --> 07:09.860
Loosely speaking, uh, context windows are getting bigger and bigger and bigger, and so you can cram

07:09.860 --> 07:12.060
more and more into that context.

07:12.060 --> 07:15.260
And so they get to a point when you think, hey, why do I even need rag?

07:15.260 --> 07:18.620
I can just put my entire knowledge base in the context window.

07:18.620 --> 07:20.340
And so we don't need rag anymore.

07:20.380 --> 07:23.340
And now I think that is probably a bit of a red herring.

07:23.540 --> 07:28.260
Uh, ultimately, uh, to be scalable, you would need to have some sort of rag at some point, like

07:28.260 --> 07:33.300
we're about to build out a rag pipeline for a company, and they might have tons and tons of documents.

07:33.300 --> 07:35.140
They might have gigabytes of documents.

07:35.140 --> 07:40.380
You could easily get to a point where you would far surpass what could possibly go in a context window,

07:40.740 --> 07:47.140
but also, surely it's a waste of compute and resources and and just just a general waste to put in

07:47.180 --> 07:50.710
tons of irrelevant content into the context window.

07:50.750 --> 07:56.150
If if the LM is being asked for ticket prices to London, there's surely no benefits to giving ticket

07:56.150 --> 07:59.790
prices to every other city in the world in the context window as well.

07:59.790 --> 08:00.110
So.

08:00.110 --> 08:04.710
So it seems to me that the idea that the context window has got so big that we don't need rag anymore,

08:04.750 --> 08:05.590
that's a red herring.

08:05.590 --> 08:06.350
That's not right.

08:06.350 --> 08:07.470
But there's another reason.

08:07.470 --> 08:11.350
Some people say rag is dead and it's because of a gigantic rag.

08:11.710 --> 08:19.390
A gigantic rag has come along and shown us this more iterative, more autonomous way to be finding retrieving

08:19.390 --> 08:20.590
relevant content.

08:20.590 --> 08:26.070
And so it's possible that traditional Rag is going to be fully replaced by Agentic rag.

08:26.070 --> 08:29.470
And I think that's probably true, but I think that that's just a terminology thing.

08:29.470 --> 08:34.790
I think Agentic rag is the natural successor to Rag, and it's the way we do rag nowadays.

08:34.790 --> 08:36.790
I don't think that means rag is dead.

08:36.830 --> 08:41.950
I think Agentic rag is just the natural evolution of the same kind of trickery.

08:42.110 --> 08:45.430
And so I say, long live agentic rag.

08:45.630 --> 08:46.510
We're in good shape.

08:46.510 --> 08:47.470
It's here to stay.

08:47.510 --> 08:52.230
Okay, so there are two distinct phases to building out rag.

08:52.550 --> 08:58.670
The first of them is data ingest, pulling in the information we've got and putting it into our knowledge

08:58.670 --> 09:00.310
base into our vector store.

09:00.670 --> 09:06.190
So you typically have some, some data in some some source place that could be text files.

09:06.190 --> 09:08.910
It could be a sheet, it could be PDFs.

09:09.270 --> 09:11.670
And you need to to load in that data.

09:11.710 --> 09:15.790
You need to what's called extract it pull it from from the source.

09:15.990 --> 09:21.030
The next thing you need to do is what they call transform, which is saying it's in some format in the

09:21.030 --> 09:21.270
source.

09:21.270 --> 09:23.950
We might need it in a better format for us to work with.

09:24.150 --> 09:28.870
Often you use this to add in metadata, little extra bits of information for the data that might be

09:28.870 --> 09:31.070
used to filter it, or something like that.

09:31.070 --> 09:33.550
So transform is typically what you do next.

09:33.830 --> 09:38.070
And then uh, you would do a few things that are very specific to rag.

09:38.230 --> 09:42.710
You would typically chunk, which is what I mentioned a few minutes ago, about breaking up your document

09:42.710 --> 09:46.510
into pieces that feel like they're going to be about the right size.

09:46.550 --> 09:49.720
That's likely to match the questions that are asked.

09:49.720 --> 09:54.560
And there's so many different ways of doing this that the default way is just to split it into things

09:54.560 --> 09:55.440
that make sense.

09:55.440 --> 09:58.680
There's a more sophisticated way called semantic chunking.

09:58.680 --> 10:04.680
When you try and work to break it into fragments that are most likely to be about the same stuff, but

10:04.680 --> 10:08.480
often the simple approach works just fine, and that's what we will be doing.

10:08.520 --> 10:12.640
See you chunk it, and then you do what's called vectorizing, which is informal.

10:12.640 --> 10:19.320
The proper name for it is is encode or embed or calculate the vector embeddings, which is basically

10:19.320 --> 10:22.640
putting it through an embedding model and coming up with the vector.

10:22.640 --> 10:26.840
But the word people often use is vectorize to find that vector.

10:26.840 --> 10:30.960
And then you load that into your vector store.

10:31.240 --> 10:37.880
And the reason I use this terminology is that you may have heard of extract, transform and load ETL.

10:38.080 --> 10:43.280
That's that's really like a traditional data engineering expression from many years ago.

10:43.280 --> 10:47.330
And we still very much think in terms of extract, transform and load today.

10:47.330 --> 10:50.930
But there is also this chunking and vectorizing that happens along the way.

10:51.250 --> 10:57.890
So these are the data ingest pipes that you would typically build as part of your Rag solution.

10:57.890 --> 11:04.010
And then the second distinct phase, which is sometimes called the Rag pipeline, or the question answering

11:04.210 --> 11:06.210
is where the user asks a question.

11:06.250 --> 11:07.890
Obviously it goes to our agent.

11:07.890 --> 11:12.210
If this is Agent Rag, there's an LLM that manages the workflow.

11:12.210 --> 11:17.930
And one of the things it does is it calls a tool which is able to look up relevant information.

11:17.930 --> 11:19.970
And the way it does that is it.

11:19.970 --> 11:25.410
First vectorize the question and then it looks that up in the vector store.

11:25.450 --> 11:26.690
What information do I have?

11:26.690 --> 11:33.690
What chunks do I have in there where the vectors associated with the chunks are similar to this vector.

11:33.690 --> 11:35.130
And then I get back those chunks.

11:35.130 --> 11:41.130
I get that text and that is what I use to then call the LLM to get my final response.

11:41.610 --> 11:45.170
And that is the second distinct phase of building Rag.