WEBVTT

00:00.090 --> 00:01.440
-: Hey, in this video, we're gonna have a look

00:01.440 --> 00:05.010
at a implementation of the Realtime API,

00:05.010 --> 00:06.840
which is made by OpenAI.

00:06.840 --> 00:09.480
We're gonna look at how this uses WebSockets

00:09.480 --> 00:12.360
to essentially stream messages from a server

00:12.360 --> 00:13.320
directly to a client,

00:13.320 --> 00:15.840
and a client will then send messages directly to the server.

00:15.840 --> 00:18.240
This was released on October 1st, 2024.

00:18.240 --> 00:19.410
If you actually go and have a look

00:19.410 --> 00:21.360
at the Realtime API documentation,

00:21.360 --> 00:23.700
you'll see this is in beta, so it's still quite early.

00:23.700 --> 00:26.130
It supports things like native speech-to-speech.

00:26.130 --> 00:28.830
It can also do simultaneous multimodal output

00:28.830 --> 00:30.780
and natural steerable voices.

00:30.780 --> 00:33.570
We're gonna have a look at the Realtime console

00:33.570 --> 00:34.680
in this lesson.

00:34.680 --> 00:37.020
And the reason why is there's a demo application

00:37.020 --> 00:38.460
that you can use to have a go

00:38.460 --> 00:41.760
and see how the event stream happens in real time.

00:41.760 --> 00:42.780
I want you to navigate

00:42.780 --> 00:47.070
to openai/openai-realtime-console.

00:47.070 --> 00:49.590
And then what you're going to need to do is click on Code,

00:49.590 --> 00:52.110
click Copy to Clipboard, load up a terminal,

00:52.110 --> 00:55.230
and then we're gonna git clone this repo.

00:55.230 --> 00:57.060
After you've clone that repository,

00:57.060 --> 00:59.430
now you're gonna need to create an environment variable file

00:59.430 --> 01:02.310
at the root level directory with a .env.

01:02.310 --> 01:05.370
And you're gonna need to put both the OPENAI_API_KEY

01:05.370 --> 01:07.350
and also we'll need to add in

01:07.350 --> 01:10.230
the REACT_APP_LOCAL_RELAY_SERVER_URL,

01:10.230 --> 01:12.240
so I'm just gonna copy that as well.

01:12.240 --> 01:15.603
And then make sure to put your API key for OpenAI here.

01:16.950 --> 01:18.210
Okay, so the next thing we wanna do

01:18.210 --> 01:21.180
is just install all of the packages with npm i

01:21.180 --> 01:23.590
and then we wanna run npm run start

01:24.540 --> 01:26.700
and we also want another server,

01:26.700 --> 01:29.763
so go and load up a new tab and do npm run relay.

01:31.860 --> 01:34.230
And then what you should see is a Realtime console.

01:34.230 --> 01:36.360
Go to the bottom and click on Connect.

01:36.360 --> 01:38.610
And there's two different ways that you can do this.

01:38.610 --> 01:41.340
You can see there's a manual way of doing this

01:41.340 --> 01:42.720
and a VAD way.

01:42.720 --> 01:44.580
So the manual way is push talk.

01:44.580 --> 01:47.877
So I can say, "Hey, how's it going? How's everything been?"

01:49.620 --> 01:52.320
And then you can see that the assistant is now replying

01:52.320 --> 01:54.840
like, you know, that everything's been going good.

01:54.840 --> 01:57.150
Or I can do VAD, which will listen in real time

01:57.150 --> 01:59.973
to the audio buffer and will then reply.

02:02.940 --> 02:04.770
So you can now see that just the buffer

02:04.770 --> 02:06.210
is constantly appending,

02:06.210 --> 02:08.880
but then when you hit certain drops in volume,

02:08.880 --> 02:10.653
then the assistant will reply.

02:16.080 --> 02:17.977
So the other things we can do is say,

02:17.977 --> 02:20.160
"I want to get the weather in Chicago."

02:20.160 --> 02:22.110
And if you look on the right-hand side,

02:23.580 --> 02:25.740
the map has now changed to Chicago

02:25.740 --> 02:27.930
and we've got the actual weather output there,

02:27.930 --> 02:29.490
which is great, so you can see that's working,

02:29.490 --> 02:30.540
which is really good.

02:30.540 --> 02:33.180
So you can use tool use within these model.

02:33.180 --> 02:34.950
So the way that this is actually working

02:34.950 --> 02:38.880
is basically we're using WebSockets' technology

02:38.880 --> 02:42.840
to stream events from the server that OpenAI is providing,

02:42.840 --> 02:44.460
which is a stateful backed service.

02:44.460 --> 02:46.830
So it stores all the information for us

02:46.830 --> 02:49.950
and it streams those as WebSocket events

02:49.950 --> 02:53.430
to the client side via a WebSocket connection

02:53.430 --> 02:54.263
inside of React.

02:54.263 --> 02:57.420
I think this is gonna open up a lot of individual use cases

02:57.420 --> 03:00.930
and we'll see the ability to even hook up Twilio

03:00.930 --> 03:04.170
and chat and phone numbers with the OpenAI Realtime.

03:04.170 --> 03:05.310
So we're gonna be able to get

03:05.310 --> 03:07.350
those voice-to-voice interactions.

03:07.350 --> 03:08.670
You're gonna be able to stream

03:08.670 --> 03:12.840
multiple types of modalities, including speech, text,

03:12.840 --> 03:15.030
and also being able to use tools,

03:15.030 --> 03:16.950
giving the agents specific types of tools

03:16.950 --> 03:19.650
that it can use in a real-time setting.

03:19.650 --> 03:23.130
The cost implications for this is slightly more expensive.

03:23.130 --> 03:26.040
As you can see, it's $5 for a million input tokens

03:26.040 --> 03:28.800
and $20 for a million output tokens.

03:28.800 --> 03:29.633
And then you can see,

03:29.633 --> 03:32.130
we've got a certain number of audio tokens.

03:32.130 --> 03:34.590
So these are the ones where it's quite expensive.

03:34.590 --> 03:38.670
So even though these token costs are much larger,

03:38.670 --> 03:40.110
it is actually quite cheap

03:40.110 --> 03:42.700
to generate responses using the realtime API.

03:42.700 --> 03:45.450
Okay, cool, hopefully you found this useful and exciting

03:45.450 --> 03:47.100
and I'll see you in the next one.