WEBVTT

1
00:00:00.000 --> 00:00:06.659
Hi, and welcome to this AI and C-Sharp video on the OpenAI Realtime API.

2
00:00:06.699 --> 00:00:11.159
So, the Realtime API is the way we can do speech-to-speech,

3
00:00:11.199 --> 00:00:14.260
meaning we can ask the question in audio

4
00:00:14.300 --> 00:00:17.659
and actually also get audio responses back.

5
00:00:17.700 --> 00:00:19.659
So, let's go into a demo.

6
00:00:22.159 --> 00:00:25.360
So, here I have the code running,

7
00:00:25.399 --> 00:00:28.860
and I have the output over here.

8
00:00:30.000 --> 00:00:32.959
So, let's first see a demo, and then we'll go to the code.

9
00:00:36.799 --> 00:00:38.299
Hi there.

10
00:00:40.799 --> 00:00:45.259
Hello. How's it going? What can I help you with today?

11
00:00:45.299 --> 00:00:49.259
Fine. Can you tell me what the capital of France is?

12
00:00:52.500 --> 00:00:55.459
Of course. The capital of France is Paris.

13
00:00:55.500 --> 00:00:58.959
It's known for its art, culture, and landmarks like the Eiffel Tower.

14
00:01:00.500 --> 00:01:03.459
Cool. Can you tell me how many people live in that city?

15
00:01:05.300 --> 00:01:09.559
Sure. Paris has a population of around 2 million people in the city itself,

16
00:01:09.599 --> 00:01:13.260
and if you include the greater metropolitan area, it's about 12 million.

17
00:01:14.199 --> 00:01:15.760
Okay. Thank you.

18
00:01:17.800 --> 00:01:21.260
You're welcome. If you have any more questions, feel free to ask.

19
00:01:22.800 --> 00:01:24.860
So, that's the demo.

20
00:01:24.900 --> 00:01:28.360
We talked about Paris, but we can talk about anything.

21
00:01:28.360 --> 00:01:31.019
So, let's have a look at what the code actually does

22
00:01:31.059 --> 00:01:33.220
and why it works the way it works.

23
00:01:34.660 --> 00:01:37.220
So, we first get our normal sequence,

24
00:01:37.260 --> 00:01:42.019
and then we make either our OpenAI client or our Azure OpenAI client,

25
00:01:42.059 --> 00:01:45.019
but we're not actually making an Azure OpenAI client,

26
00:01:45.059 --> 00:01:48.419
because if we do that, it will actually come back with an exception

27
00:01:48.459 --> 00:01:51.919
saying the Realtime API is not supported,

28
00:01:51.959 --> 00:01:56.319
and we should use the OpenAI client instead.

29
00:01:56.319 --> 00:02:01.080
I think this is a first step in switching to everything

30
00:02:01.120 --> 00:02:05.680
being in the OpenAI client and then for Azure setting an endpoint,

31
00:02:05.720 --> 00:02:08.179
because this is also the first time we really used

32
00:02:08.220 --> 00:02:13.380
the new slash OpenAI slash v1 endpoint with OpenAI.

33
00:02:15.520 --> 00:02:21.580
So, the way we do it here is then we say Azure OpenAI client

34
00:02:21.619 --> 00:02:23.979
and then set the endpoint,

35
00:02:23.979 --> 00:02:26.380
as we have seen a few times before as well.

36
00:02:27.880 --> 00:02:31.179
And once we have the client, we can get the Realtime client,

37
00:02:31.220 --> 00:02:35.779
because the API is not the chat client or the responses API,

38
00:02:35.820 --> 00:02:38.880
it's what is called Realtime client instead.

39
00:02:41.320 --> 00:02:44.880
We also need models, and this is a model we haven't used before,

40
00:02:44.919 --> 00:02:48.779
so if you're on Azure, you might need to deploy this model.

41
00:02:49.979 --> 00:02:52.380
There's either a GPT Realtime,

42
00:02:52.380 --> 00:02:56.279
there's a... for mini, a non-mini,

43
00:02:56.320 --> 00:02:59.479
and there's a 1.5 version as well.

44
00:03:02.279 --> 00:03:07.080
These are a bit expensive, like $10 per 1 million tokens

45
00:03:07.119 --> 00:03:11.080
and $20 million for output,

46
00:03:11.119 --> 00:03:14.779
and if you go to the non-mini, it's $32 and $64.

47
00:03:15.479 --> 00:03:19.679
In real life, testing it out, I've sat here for a couple of hours

48
00:03:19.779 --> 00:03:22.479
and preparing this demo,

49
00:03:22.520 --> 00:03:25.979
and that has cost me roughly half a US dollar.

50
00:03:26.020 --> 00:03:29.380
So, not too, too expensive, but still.

51
00:03:31.580 --> 00:03:36.080
You can also give one more model for transcriptions,

52
00:03:36.119 --> 00:03:38.580
meaning when we talk,

53
00:03:38.619 --> 00:03:42.080
that they can actually see what we said in text,

54
00:03:42.119 --> 00:03:44.679
as we saw in our output.

55
00:03:44.720 --> 00:03:48.580
This is optional, so if you want to save that,

56
00:03:48.580 --> 00:03:52.279
if you don't need it anyway, you can get rid of it.

57
00:03:52.320 --> 00:03:55.380
I will show a little later how we can do that.

58
00:03:57.179 --> 00:04:05.380
Then, when we ran it, it said we could press Ctrl-C to cancel.

59
00:04:05.419 --> 00:04:09.880
That is just a console.cancel key press here.

60
00:04:11.979 --> 00:04:14.979
And then we take our Realtime Client,

61
00:04:15.020 --> 00:04:17.980
and then we start a conversation,

62
00:04:17.980 --> 00:04:20.880
and that conversation happens with the model.

63
00:04:21.679 --> 00:04:23.679
And then we configure the session,

64
00:04:23.720 --> 00:04:26.679
and there's a bunch of settings we can set on this.

65
00:04:26.720 --> 00:04:32.720
I've taken the most common ones here that I thought we needed.

66
00:04:32.760 --> 00:04:36.079
So, we can give instructions, just like system message,

67
00:04:36.119 --> 00:04:40.679
it should talk in a special way, and so on.

68
00:04:40.720 --> 00:04:43.880
And I'm just giving it that it's a voice assistant,

69
00:04:43.920 --> 00:04:46.380
and it needs to be clear and brief.

70
00:04:48.980 --> 00:04:53.480
Then, we can set some audio options, meaning input audio.

71
00:04:53.519 --> 00:04:59.279
So, for the input, if we want a transcription, we set this.

72
00:04:59.320 --> 00:05:01.679
If we don't, we can just take it away.

73
00:05:02.679 --> 00:05:06.380
You can control some noise reduction, if need be.

74
00:05:06.420 --> 00:05:11.380
And then there's turn detection, which is the main part we do here.

75
00:05:12.279 --> 00:05:18.279
So, it goes in, and you can set various options here,

76
00:05:18.320 --> 00:05:21.279
like whenever I start speaking,

77
00:05:21.320 --> 00:05:26.279
how many milliseconds do it need to go back in the recording

78
00:05:26.320 --> 00:05:29.119
in order to pick up the first words I say,

79
00:05:29.160 --> 00:05:35.160
how long should it wait for a pause until it actually begins answering,

80
00:05:35.679 --> 00:05:38.079
and how long time should it wait

81
00:05:38.179 --> 00:05:41.079
before it tries to do a follow-up for us,

82
00:05:41.119 --> 00:05:43.380
meaning if we are just silent, it says,

83
00:05:43.420 --> 00:05:47.019
hey, I can't hear you, is there something we should talk about?

84
00:05:48.519 --> 00:05:52.579
And finally, we have if we want to be able to interrupt,

85
00:05:52.619 --> 00:05:56.220
meaning in the middle of the AI speaking,

86
00:05:56.260 --> 00:05:58.420
if we want to say something else,

87
00:05:58.459 --> 00:06:02.019
if that should happen or not, or it should finish its sentence.

88
00:06:02.059 --> 00:06:03.220
It doesn't happen instantly,

89
00:06:03.220 --> 00:06:11.119
but at a certain point in the longer conversation it gives back,

90
00:06:11.160 --> 00:06:18.160
it can stop itself and begin to ask in a follow-up question.

91
00:06:19.359 --> 00:06:23.359
Then we have output, where we can control how fast they speak

92
00:06:23.399 --> 00:06:29.399
and which of the different voices we want to have, male or female.

93
00:06:30.339 --> 00:06:34.339
I've used this Marin, it sounds nice.

94
00:06:35.339 --> 00:06:40.339
And for output, we just tell it everything is audio.

95
00:06:40.380 --> 00:06:43.540
Technically, the real-time can also work with images and text,

96
00:06:43.579 --> 00:06:45.579
but that's beyond this video.

97
00:06:47.839 --> 00:06:50.839
Then we do all the in-audio stuff.

98
00:06:50.880 --> 00:06:57.880
In-audio is a NuGet package to work with audio on Windows.

99
00:06:59.839 --> 00:07:05.839
And up here, we will put in an audio player.

100
00:07:05.880 --> 00:07:09.000
So essentially, I don't want to go into this,

101
00:07:09.040 --> 00:07:14.040
but it's essentially just that we can talk with the...

102
00:07:14.079 --> 00:07:18.940
or we can play back streaming audio feedback

103
00:07:18.980 --> 00:07:22.980
that we get from the API.

104
00:07:23.940 --> 00:07:26.799
And in the same manner, we have a microphone streamer

105
00:07:26.880 --> 00:07:30.380
that we can start a recording,

106
00:07:30.399 --> 00:07:34.899
and then we can restart that recording for every time we have something new,

107
00:07:34.940 --> 00:07:38.040
but else the microphone is open all the time,

108
00:07:38.079 --> 00:07:40.200
waiting for us to set.

109
00:07:40.239 --> 00:07:42.239
Again, this is...

110
00:07:43.600 --> 00:07:47.600
This is vibe-coded from using Codecs,

111
00:07:47.640 --> 00:07:50.640
and it's just C-sharp more than anything.

112
00:07:52.339 --> 00:07:54.940
And then we do something a bit special,

113
00:07:54.940 --> 00:07:58.339
because this is an open-ended system.

114
00:07:58.380 --> 00:08:04.079
So instead of doing a wait and async,

115
00:08:04.119 --> 00:08:06.540
we just have these things for tasks.

116
00:08:06.579 --> 00:08:13.579
So we are keeping a continuous task of uploading Microsoft Audio,

117
00:08:13.619 --> 00:08:17.619
and we have a continuous task of receiving updates back.

118
00:08:18.779 --> 00:08:21.079
Upload is fairly simple.

119
00:08:21.119 --> 00:08:23.179
It's just taking the microphone

120
00:08:23.179 --> 00:08:27.679
and read everything that has been said so far into chunks,

121
00:08:27.720 --> 00:08:29.920
and put it into the session.

122
00:08:29.959 --> 00:08:33.960
So session being from the real-time that we created.

123
00:08:35.320 --> 00:08:38.219
And then we receive the updates,

124
00:08:38.260 --> 00:08:42.460
but before we look at that, let's just see that we just start our microphone,

125
00:08:42.479 --> 00:08:46.479
and then we begin to await receiving updates.

126
00:08:47.380 --> 00:08:50.780
And at some point, we will press Ctrl-C or stop the program,

127
00:08:50.820 --> 00:08:54.520
and we will get a cancellation, where we just cancel everything.

128
00:08:54.559 --> 00:08:58.559
We stop the microphone, we clear the audio player, and then we're done.

129
00:09:00.219 --> 00:09:04.219
But the most important thing happens down in receive updates.

130
00:09:04.260 --> 00:09:07.260
So every time, we just have a receive update,

131
00:09:07.280 --> 00:09:11.280
and it can do an await for each update.

132
00:09:11.320 --> 00:09:15.320
And every one of these updates is something different.

133
00:09:15.359 --> 00:09:20.559
The first one we normally get is a session created,

134
00:09:20.599 --> 00:09:23.159
where we can just then write some text out.

135
00:09:23.200 --> 00:09:26.599
We say the session is created, speak naturally,

136
00:09:26.619 --> 00:09:30.619
and wait for the AI to answer back, and press Ctrl-C.

137
00:09:32.119 --> 00:09:34.859
Then there's a buffer speech started,

138
00:09:34.900 --> 00:09:38.700
meaning whenever we begin to speak,

139
00:09:38.719 --> 00:09:42.059
it tells, oh, I'm in listening mode right now.

140
00:09:42.099 --> 00:09:45.200
So we just say, listening back in our console here.

141
00:09:45.799 --> 00:09:50.400
And whenever we stop speaking, we get the speech stopped,

142
00:09:50.440 --> 00:09:53.940
and then it begins to think, because it might take a little while

143
00:09:53.979 --> 00:09:59.479
for our question to be answered based on how difficult the answer is.

144
00:10:01.940 --> 00:10:05.440
Then, if we have to transcribe in place,

145
00:10:05.479 --> 00:10:08.539
we get a transcribe completed object,

146
00:10:08.580 --> 00:10:11.580
where we can just say what we said.

147
00:10:12.359 --> 00:10:19.559
And we can also get the transcribe from the other side

148
00:10:19.580 --> 00:10:24.380
on what the AI said.

149
00:10:24.419 --> 00:10:27.320
We can get it back here in just a done,

150
00:10:27.359 --> 00:10:32.659
but there's also a delta version where we can get it back streaming if need be.

151
00:10:32.679 --> 00:10:35.679
I tend to not to do that.

152
00:10:36.679 --> 00:10:43.179
But what we do need is we need to get a delta of all the audio that's being said.

153
00:10:43.219 --> 00:10:49.580
So every time a small piece of what is being said is put in,

154
00:10:49.619 --> 00:10:52.080
we enqueue that into the audio player,

155
00:10:52.119 --> 00:10:58.119
so it will know when it can begin to give data back.

156
00:10:59.780 --> 00:11:05.280
If we ever clear the buffer, we can just clear the audio player.

157
00:11:05.979 --> 00:11:10.679
And when our response is done, we flush the pending audio,

158
00:11:10.719 --> 00:11:16.719
so we just have it tell out the rest of the data.

159
00:11:17.520 --> 00:11:21.020
And if there's any errors or anything,

160
00:11:21.059 --> 00:11:24.159
we can get, for example, error messages back

161
00:11:24.179 --> 00:11:27.179
if there's something wrong with the response.

162
00:11:27.820 --> 00:11:30.320
And finally, if there's any errors.

163
00:11:31.320 --> 00:11:35.320
So we can try to set some breakpoints here and see it in real life.

164
00:11:36.659 --> 00:11:40.659
This will feel a bit odd because it just begins to...

165
00:11:43.659 --> 00:11:47.659
to say words and pause it again and so on, but let's have a look.

166
00:11:47.659 --> 00:11:51.659
So whenever we, say, start session up here,

167
00:12:04.500 --> 00:12:10.500
start conversation, we get that the session created.

168
00:12:12.039 --> 00:12:15.039
There's also a session updated, which technically happens,

169
00:12:15.039 --> 00:12:17.780
but I haven't had any need for it,

170
00:12:17.820 --> 00:12:23.820
but there's a real-time server update session configuration updated.

171
00:12:26.940 --> 00:12:30.940
But we get this, and we are ready to speak.

172
00:12:32.340 --> 00:12:38.340
And now I spoke something, so it thought that we are listening to that.

173
00:12:39.340 --> 00:12:43.340
And now it hears that we don't speak anymore,

174
00:12:43.380 --> 00:12:45.380
so it's beginning to think about it.

175
00:12:45.419 --> 00:12:49.419
It will become a very odd conversation here.

176
00:12:49.440 --> 00:12:53.440
And now I'd have to transcribe back,

177
00:12:55.479 --> 00:12:57.479
which, yeah, okay.

178
00:12:57.520 --> 00:13:01.520
It couldn't understand what I was saying while we did this,

179
00:13:01.539 --> 00:13:05.539
but now we are getting the audio back in small chunks,

180
00:13:05.880 --> 00:13:07.880
so this one will be hit multiple times.

181
00:13:07.919 --> 00:13:11.919
I'm pressing F5 right now, multiple times.

182
00:13:11.940 --> 00:13:15.940
I'll just get rid of it here.

183
00:13:15.979 --> 00:13:19.979
And now the transcribe is done on what it is it wants to say back.

184
00:13:20.020 --> 00:13:22.020
So, sure, I'm here to help.

185
00:13:22.039 --> 00:13:26.039
That normally comes just before the audio comes back.

186
00:13:29.679 --> 00:13:31.679
So it writes that back, and it now says,

187
00:13:31.719 --> 00:13:35.719
So it writes that back, and it now says,

188
00:13:35.760 --> 00:13:39.260
we are done, so now it begins to speak.

189
00:13:39.280 --> 00:13:41.280
So if I press F5 here.

190
00:13:50.320 --> 00:13:52.820
Could you tell me a bit more about what you'd like to talk about

191
00:13:52.859 --> 00:13:54.859
or what kind of support you're looking for?

192
00:13:54.880 --> 00:13:57.380
Of course. I'm here to help.

193
00:13:57.419 --> 00:13:59.419
Take your time and share what's on your mind,

194
00:13:59.460 --> 00:14:03.719
and we can go from there. Thank you.

195
00:14:04.760 --> 00:14:07.260
You're very welcome. Just let me know whenever you're ready,

196
00:14:07.280 --> 00:14:09.280
and we can dive into it together.

197
00:14:10.619 --> 00:14:14.619
So as you can see, it's quite difficult to debug, of course,

198
00:14:15.760 --> 00:14:19.760
hence putting out various messages along the way.

199
00:14:20.960 --> 00:14:23.659
But that's just the nature of trying to speak

200
00:14:23.679 --> 00:14:26.679
while also trying to speak in this video, of course.

201
00:14:28.020 --> 00:14:31.020
So that is actually what's there.

202
00:14:31.679 --> 00:14:34.679
Again, the hard part, for me at least,

203
00:14:34.719 --> 00:14:39.419
was to get someone to make this start and stop recording

204
00:14:39.460 --> 00:14:43.460
of audio streams and playing back the audio streams.

205
00:14:44.280 --> 00:14:47.280
Beyond that, it is fairly simple.

206
00:14:47.320 --> 00:14:50.619
There's tons of settings, and this can do much, much more

207
00:14:50.659 --> 00:14:53.659
than what we are covering in this video.

208
00:14:53.679 --> 00:14:58.679
But beyond that, it's fairly simple to understand

209
00:14:58.719 --> 00:15:01.520
once you go through it line by line.

210
00:15:01.719 --> 00:15:06.219
This is a bit funky, of course, that we are not really making

211
00:15:06.260 --> 00:15:11.260
normal code with the same weight, but it makes sense

212
00:15:11.299 --> 00:15:15.299
when we are just having an open microphone waiting for data.

213
00:15:16.619 --> 00:15:20.619
So I think that's everything. See you in the next one.