Welcome back! We're now going to move away from OpenAI's text models to take a look at the functionality of their audio models.

OpenAI's Whisper model has speech-to-text capabilities that can be used to create audio transcripts or translate audio from one language into an English transcript. The model supports many of the most common audio file formats, but does place a limit on the size of the audio file.

Whisper has potential applications in automating business meeting transcripts and in accessibility features like caption generation.

In this video, we'll discuss speech-to-text transcription.

Let's use Whisper to transcribe a meeting recording stored in an MP3 audio file.

The first thing we need to do is to load the file into our Python environment.

There are lots of Python libraries out there for working with audio files, but we'll be using the Python open function here.

The open function takes two arguments: the first is the file to be opened, and the second indicates the mode with which the file should be opened. Different modes support reading and writing to virtually any file type.

The "rb" here stands for read binary - all this means is that we're opening a file that is stored in binary format, which is typical for non-text files like audio, video, and images.

If the audio file is found in a different directory to the Python script or notebook we're working in, we also need to prepend the file name with its path.

This audio file can now be used like any other Python variable.

Requests to the Whisper model are sent to the Audio endpoint of the API.

To create a transcribe request to this endpoint, we call the transcribe method on the Audio class. Inside, we specify the audio model to use and the audio file to transcribe.

Let's print the response to see what's returned.

Like the other endpoints, we receive a JSON response, but fortunately, this response only contains a single key and value.

We can access the transcript text using the text key.

There we have it!

The model did a solid job of transcribing the audio, but note that its performance may fluctuate with changes in audio quality or different accents.

Sensitive or confidential audio should also not be sent to the model.

Due to the Whisper model being trained on audio from non-English languages, it can also transcribe audio from many other languages with good results.

The process for transcribing non-English audio is exactly the same: we open the audio file, make a transcribe request to the Audio endpoint, and extract the text from the response.

Time to create your own audio transcripts!