WEBVTT

1
00:00:00.000 --> 00:00:02.520
Welcome back!

2
00:00:02.520 --> 00:00:09.840
We're now going to move away from OpenAI's text models to take a look at the functionality of their audio models.

3
00:00:09.840 --> 00:00:09.840


4
00:00:09.840 --> 00:00:09.840


5
00:00:09.840 --> 00:00:14.600
OpenAI's Whisper model has speech-to-text capabilities that can be used to

6
00:00:14.600 --> 00:00:21.040
create audio transcripts or translate audio from one language into an English transcript.

7
00:00:21.040 --> 00:00:27.640
The model supports many of the most common audio file formats, but does place a limit on the size of the audio file.

8
00:00:27.640 --> 00:00:28.680


9
00:00:28.680 --> 00:00:28.680


10
00:00:28.680 --> 00:00:31.920
Whisper has potential applications in automating business

11
00:00:31.920 --> 00:00:37.080
meeting transcripts and in accessibility features like caption generation.

12
00:00:37.080 --> 00:00:37.080


13
00:00:37.080 --> 00:00:37.120


14
00:00:37.120 --> 00:00:41.160
In this video, we'll discuss speech-to-text transcription.

15
00:00:41.160 --> 00:00:41.160


16
00:00:41.160 --> 00:00:41.200


17
00:00:41.200 --> 00:00:47.440
Let's use Whisper to transcribe a meeting recording stored in an MP3 audio file.

18
00:00:47.440 --> 00:00:47.440


19
00:00:47.440 --> 00:00:47.440


20
00:00:47.440 --> 00:00:51.320
The first thing we need to do is to load the file into our Python environment.

21
00:00:51.320 --> 00:00:51.320


22
00:00:51.320 --> 00:00:52.160


23
00:00:52.160 --> 00:00:54.440
There are lots of Python libraries out there for working

24
00:00:54.440 --> 00:00:58.520
with audio files, but we'll be using the Python open function here.

25
00:00:58.520 --> 00:00:58.520


26
00:00:58.520 --> 00:00:59.480


27
00:00:59.480 --> 00:01:03.760
The open function takes two arguments: the first is the file to be

28
00:01:03.760 --> 00:01:09.240
opened, and the second indicates the mode with which the file should be opened.

29
00:01:09.240 --> 00:01:13.600
Different modes support reading and writing to virtually any file type.

30
00:01:13.600 --> 00:01:13.600


31
00:01:13.600 --> 00:01:14.560


32
00:01:14.560 --> 00:01:20.000
The "rb" here stands for read binary - all this means is that we're opening a file that

33
00:01:20.000 --> 00:01:27.600
is stored in binary format, which is typical for non-text files like audio, video, and images.

34
00:01:27.600 --> 00:01:27.600


35
00:01:27.600 --> 00:01:27.600


36
00:01:27.600 --> 00:01:31.480
If the audio file is found in a different directory to the Python script or

37
00:01:31.480 --> 00:01:35.760
notebook we're working in, we also need to prepend the file name with its path.

38
00:01:35.760 --> 00:01:35.760


39
00:01:35.760 --> 00:01:35.760


40
00:01:35.760 --> 00:01:40.720
This audio file can now be used like any other Python variable.

41
00:01:40.720 --> 00:01:41.840


42
00:01:41.840 --> 00:01:41.840


43
00:01:41.840 --> 00:01:46.000
Requests to the Whisper model are sent to the Audio endpoint of the API.

44
00:01:46.000 --> 00:01:46.000


45
00:01:46.000 --> 00:01:46.000


46
00:01:46.000 --> 00:01:54.120
To create a transcribe request to this endpoint, we call the transcribe method on the Audio class.

47
00:01:54.120 --> 00:02:00.600
Inside, we specify the audio model to use and the audio file to transcribe.

48
00:02:00.600 --> 00:02:00.600


49
00:02:00.600 --> 00:02:00.600


50
00:02:00.600 --> 00:02:02.840
Let's print the response to see what's returned.

51
00:02:02.840 --> 00:02:02.840


52
00:02:02.840 --> 00:02:02.840


53
00:02:02.840 --> 00:02:07.840
Like the other endpoints, we receive a JSON response, but

54
00:02:07.840 --> 00:02:11.920
fortunately, this response only contains a single key and value.

55
00:02:11.920 --> 00:02:11.920


56
00:02:11.920 --> 00:02:12.960


57
00:02:12.960 --> 00:02:16.040
We can access the transcript text using the text key.

58
00:02:16.040 --> 00:02:16.040


59
00:02:16.040 --> 00:02:16.040


60
00:02:16.040 --> 00:02:18.400
There we have it!

61
00:02:18.400 --> 00:02:18.400


62
00:02:18.400 --> 00:02:18.400


63
00:02:18.400 --> 00:02:22.400
The model did a solid job of transcribing the audio, but note that its

64
00:02:22.400 --> 00:02:26.920
performance may fluctuate with changes in audio quality or different accents.

65
00:02:26.920 --> 00:02:26.920


66
00:02:26.920 --> 00:02:26.920


67
00:02:26.920 --> 00:02:31.640
Sensitive or confidential audio should also not be sent to the model.

68
00:02:31.640 --> 00:02:31.640


69
00:02:31.640 --> 00:02:31.640


70
00:02:31.640 --> 00:02:36.200
Due to the Whisper model being trained on audio from non-English

71
00:02:36.200 --> 00:02:41.680
languages, it can also transcribe audio from many other languages with good results.

72
00:02:41.680 --> 00:02:41.680


73
00:02:41.680 --> 00:02:41.680


74
00:02:41.680 --> 00:02:48.200
The process for transcribing non-English audio is exactly the same: we open the audio

75
00:02:48.200 --> 00:02:54.400
file, make a transcribe request to the Audio endpoint, and extract the text from the response.

76
00:02:54.400 --> 00:02:54.400


77
00:02:54.400 --> 00:02:54.400


78
00:02:54.400 --> 00:02:59.760
Time to create your own audio transcripts!