WEBVTT

00:00.080 --> 00:01.280
Implement our agent.

00:01.480 --> 00:05.960
This is going to be a very long and intensive coding video.

00:06.400 --> 00:10.520
So I'm not going to write each line of code throughout this video.

00:10.560 --> 00:16.440
I'm going to use an existing piece of code that I have, but I will definitely walk you through the

00:16.440 --> 00:22.520
code line by line, and then we'll execute the code on the Google Colab notebook.

00:22.560 --> 00:32.120
Having said that, I have a 400 as the GPU compute with 83 gig memory Ram and 235 disk space.

00:32.280 --> 00:34.960
It supports Python three GPUs.

00:35.200 --> 00:38.720
And now I'll go ahead and first install the haystack AI.

00:39.000 --> 00:41.440
So pip install haystack AI.

00:41.440 --> 00:43.840
And I'll also do pip install fast drag.

00:44.880 --> 00:45.640
It's done.

00:46.160 --> 00:52.120
So now I will go ahead and start the first step which is loading the nutritional data dot JSON.

00:52.560 --> 00:55.320
So here is the nutritional data dot JSON.

00:55.560 --> 01:03.010
We first index our data and for that we need this JSON file I have uploaded the file here in the in

01:03.010 --> 01:03.770
the notebook.

01:04.130 --> 01:09.130
Let's quickly go through the structure of the nutrition data JSON file.

01:09.490 --> 01:11.290
So here is the JSON file.

01:11.530 --> 01:16.210
It has six entries here, with each entry having three fields to it.

01:16.530 --> 01:21.530
Image URL which is where the image with the necessary information is stored.

01:22.410 --> 01:24.810
There is a title and there's a content.

01:25.090 --> 01:31.410
Let me share with you one of the images here with the protein bar nutritional fact and see how this

01:31.410 --> 01:32.250
looks like.

01:32.690 --> 01:40.250
So here is the media Amazon file which has the nutritional facts about the protein bar calories, cholesterol,

01:40.530 --> 01:45.250
carbohydrates and all the details with protein and everything.

01:45.650 --> 01:47.810
So now let's go back to the notebook.

01:48.650 --> 01:52.650
So next we'll have to index the documents to in-memory document store.

01:53.010 --> 01:55.450
So here is our in-memory document store.

01:55.450 --> 02:03.940
And then here we'll use the sentence Transformers all mini six v2 model to create the embedding for

02:03.940 --> 02:05.660
each label description.

02:05.660 --> 02:11.060
And then we'll create a pipeline to index our data to in-memory document store.

02:11.460 --> 02:14.860
So here we have put a pipeline called index pipeline.

02:14.860 --> 02:20.820
And then we add those components the sentence transformer model call it document Embedder.

02:21.060 --> 02:24.900
And then here we have another component documents writer.

02:25.940 --> 02:27.780
We'll name it as Doc Writer.

02:28.180 --> 02:33.340
Then we'll connect the document Embedder documents to document writer documents.

02:33.660 --> 02:35.420
Let's execute the code block.

02:35.700 --> 02:42.940
So next we create document objects with the nutrition label content as the content and store the title

02:42.940 --> 02:49.700
and image URL as metadata before passing them to indexing pipeline for processing.

02:50.180 --> 02:55.740
So here for this indexing pipeline, we have documents with content as the root as a root element.

02:55.740 --> 02:57.060
And then metadata field.

02:57.060 --> 02:59.470
Here is title and image library.

03:00.270 --> 03:01.310
Let's run this.

03:01.750 --> 03:05.630
So we got six documents returned to the in-memory database.

03:05.990 --> 03:08.110
Next, we'll build a retrieval pipeline.

03:08.550 --> 03:12.790
We create a document retrieval pipeline for the documents above.

03:13.270 --> 03:15.910
We'll later use this pipeline in our tool.

03:16.310 --> 03:23.390
This pipeline will consist of sentence transformer text embedder in memory and better retrieval to fetch

03:23.390 --> 03:29.790
top one content, and multimodal prompt builder to construct the prompt that are able to use.

03:29.950 --> 03:33.830
So here I have imported the necessary packages.

03:34.150 --> 03:38.830
Here is the prompt template with image and the content that we get from the document.

03:39.310 --> 03:45.750
Then we have sentence transformer text Embedder component to embed our questions.

03:46.510 --> 03:48.710
Uses all the mini LLM model.

03:49.190 --> 03:54.470
Then we have a in-memory retriever component to fetch the top one document.

03:54.950 --> 04:02.320
And then we have multimodal prompt in prompt Builder to construct the prompt that our agent will use,

04:02.640 --> 04:05.360
and then we'll connect Embedder to retriever.

04:05.680 --> 04:12.640
And the retriever will, once it got all the retriever documents, it will embed it to the prompt builder

04:12.640 --> 04:13.920
with the document here.

04:14.680 --> 04:17.800
And then we'll run this pipeline and print it.

04:18.080 --> 04:19.800
Let's see how this works.

04:20.120 --> 04:21.320
So it ran.

04:21.600 --> 04:24.600
We have prompt builder with the prompt as image.

04:24.880 --> 04:31.760
So in this pipeline the multimodal prompt builder component receives one document object one document

04:31.760 --> 04:35.960
object here from the retriever and renders the prompt.

04:36.320 --> 04:43.200
Notice here we have image placeholder in the prompt template for our model in order to inject the images

04:43.200 --> 04:44.480
into this later on.

04:44.760 --> 04:52.720
Additionally, multimodal prompt builder converts the Keyman image into base 64 screen for image to

04:52.720 --> 04:55.320
be processed by multimodal agent.

04:56.240 --> 04:58.860
So now we have created the retrieval pipeline.

04:59.140 --> 05:02.220
The next phase is creating the multi-modal react agent.

05:02.540 --> 05:07.540
For that, we'll have to first define a tool with our retrieval pipeline ready.

05:07.580 --> 05:13.620
We can create our tool using Doc image haystack query tool component from fast drag.

05:14.340 --> 05:19.580
So here from the fast drag we are importing doc with image Haystack query tool.

05:19.900 --> 05:26.460
Very important to understand that this tool has three important attributes or inputs which is name that

05:26.460 --> 05:31.740
is nutrition tool description and the pipeline or YML file.

05:32.180 --> 05:37.340
So the name and a description of this tool is used by our agent to decide when to use it.

05:37.620 --> 05:42.020
We will provide the agent with our retrieval pipeline when we want to invoke it.

05:42.180 --> 05:44.380
Let's run this with the defining tool.

05:45.460 --> 05:52.740
And now with this tool ready, I'll run this tool with a protein bar as input and print the result.

05:53.060 --> 05:53.860
There you go.

05:54.340 --> 05:55.820
So here we got image tag.

05:55.820 --> 05:59.990
And then this image shows a protein bar with chocolate peanut butter.

06:00.630 --> 06:03.830
Nutrition facts per bag as 50g.

06:04.270 --> 06:08.990
This is what we got from the tool which matches the nutrition data content here.

06:09.230 --> 06:09.950
Protein bar.

06:09.990 --> 06:10.550
Chocolate.

06:10.550 --> 06:11.270
Peanut butter.

06:11.270 --> 06:12.110
Nutrition.

06:12.110 --> 06:14.550
Fats per bar 50g.

06:15.070 --> 06:18.470
So the next step is to initialize the generator.

06:19.430 --> 06:22.830
So now here as you notice we have our multimodal agent.

06:22.870 --> 06:27.630
We have to initialize 535 vision h f generator.

06:28.070 --> 06:33.590
This model processes both text and prompt and base64 encoded images.

06:34.030 --> 06:38.790
This makes it well suited for image text like visual question answering.

06:39.190 --> 06:45.950
The vision HRF generator uses hugging face Image Protect model, which will function as an LLM for our

06:45.950 --> 06:46.550
agent.

06:46.910 --> 06:52.470
So this is the model name model Phi 3.5 vision instruct.

06:52.470 --> 06:55.150
And here's the stop by word text criteria.

06:55.150 --> 07:00.800
And then we have initialized the generator by passing the model name or path along with it.

07:00.840 --> 07:07.000
We also say what task it is for image to text is maximum, token is 100.

07:07.320 --> 07:13.040
Stopping criteria is here the observations stopping word and end.

07:13.400 --> 07:16.280
It's very important to understand and learn this.

07:16.280 --> 07:21.560
And we have defined words as observation and end as stop words.

07:21.920 --> 07:25.200
These stop words are specific to model and react prompting.

07:26.120 --> 07:29.480
So now we pass other more specific parameters.

07:29.720 --> 07:32.280
Storage trust remote code is true.

07:32.480 --> 07:36.840
And then attention implementation is eager and device map as auto.

07:37.360 --> 07:43.200
And then we'll warm up this particular model which will help us download this from the hugging face.

07:43.320 --> 07:49.120
You can go look up this model on the hugging face, which would have all the model details on the model

07:49.120 --> 07:49.680
card.

07:49.880 --> 07:52.760
So it is now downloading this model.

07:52.760 --> 07:53.840
It takes a while.

07:54.000 --> 07:55.320
I'll pause the video.
