WEBVTT

1
00:00:00.000 --> 00:00:05.280
In this video, we'll discuss Whisper's translation capabilities.

2
00:00:05.280 --> 00:00:07.160
Let's dive right in!

3
00:00:07.160 --> 00:00:07.160


4
00:00:07.160 --> 00:00:07.160


5
00:00:07.160 --> 00:00:11.080
The Whisper model not only has the ability to transcribe audio into

6
00:00:11.080 --> 00:00:17.760
the language it's in, but also translate and transcribe audio in one go.

7
00:00:17.760 --> 00:00:17.760


8
00:00:17.760 --> 00:00:17.800


9
00:00:17.800 --> 00:00:21.360
This is currently limited to an English transcript, so we can

10
00:00:21.360 --> 00:00:26.360
translate and transcribe German into English, but not German into French.

11
00:00:26.360 --> 00:00:27.480


12
00:00:27.480 --> 00:00:27.480


13
00:00:27.480 --> 00:00:31.240
Like with its transcription functionality, Whisper can translate

14
00:00:31.240 --> 00:00:36.640
audio from most common audio file types up to a particular size limit.

15
00:00:36.640 --> 00:00:36.640


16
00:00:36.640 --> 00:00:36.640


17
00:00:36.640 --> 00:00:46.240
The process for translating and transcribing audio is almost identical to normal transcription, with just one change.

18
00:00:46.240 --> 00:00:46.240


19
00:00:46.240 --> 00:00:46.240


20
00:00:46.240 --> 00:00:53.880
We open the non-English audio to translate and transcribe, which in this case is an m4a file.

21
00:00:53.880 --> 00:00:53.880


22
00:00:53.880 --> 00:00:53.880


23
00:00:53.880 --> 00:00:57.720
This is where the change comes in - instead of using the transcribe

24
00:00:57.720 --> 00:01:05.080
method, we use the translate method; again, making the request to the Audio endpoint.

25
00:01:05.080 --> 00:01:05.080


26
00:01:05.080 --> 00:01:05.080


27
00:01:05.080 --> 00:01:09.880
The translated transcription can be extracted from the text key of the response.

28
00:01:09.880 --> 00:01:10.840


29
00:01:10.840 --> 00:01:10.840


30
00:01:10.840 --> 00:01:16.800
Looking at the transcript, we can see that it wasn't perfect - making two spelling errors.

31
00:01:16.800 --> 00:01:16.800


32
00:01:16.800 --> 00:01:16.800


33
00:01:16.800 --> 00:01:21.080
The performance of Whisper can vary wildly depending on audio quality, the

34
00:01:21.080 --> 00:01:26.840
language the audio is recorded in, and the model's knowledge of the subject matter.

35
00:01:26.840 --> 00:01:31.440
Before creating a full-fledged application on this model, we'll need to test

36
00:01:31.440 --> 00:01:35.320
that the model's performance is sufficiently good for the particular use case.

37
00:01:35.320 --> 00:01:35.320


38
00:01:35.320 --> 00:01:36.520


39
00:01:36.520 --> 00:01:39.280
Let's see how we can give Whisper a helping hand here.

40
00:01:39.280 --> 00:01:40.280


41
00:01:40.280 --> 00:01:40.280


42
00:01:40.280 --> 00:01:45.720
When we send our audio to the Whisper model, we can also provide an optional prompt.

43
00:01:45.720 --> 00:01:45.720


44
00:01:45.720 --> 00:01:45.720


45
00:01:45.720 --> 00:01:50.240
This prompt can be used to improve the quality of the response

46
00:01:50.240 --> 00:01:54.360
by providing an example of how we want the transcript to be styled.

47
00:01:54.360 --> 00:01:54.360


48
00:01:54.360 --> 00:01:54.360


49
00:01:54.360 --> 00:01:57.840
For example, if we want the transcript to retain

50
00:01:57.840 --> 00:02:05.680
filler words from the audio, like ummms and uhhhhs, we can provide the following prompt along with the audio.

51
00:02:05.680 --> 00:02:05.680


52
00:02:05.680 --> 00:02:05.680


53
00:02:05.680 --> 00:02:09.600
If we know broadly what the topic of the audio is about, we can also

54
00:02:09.600 --> 00:02:15.200
this context as a prompt to help the model narrow-down the correct words.

55
00:02:15.200 --> 00:02:18.720
Here's an example of a prompt to give the model more context.

56
00:02:18.720 --> 00:02:18.720


57
00:02:18.720 --> 00:02:19.720


58
00:02:19.720 --> 00:02:23.960
The transcribe method also supports prompts that can be used in very similar ways.

59
00:02:23.960 --> 00:02:23.960


60
00:02:23.960 --> 00:02:23.960


61
00:02:23.960 --> 00:02:30.600
Let's adapt our last request with a prompt to try and fix the spelling errors outputted by the model.

62
00:02:30.600 --> 00:02:30.600


63
00:02:30.600 --> 00:02:30.600


64
00:02:30.600 --> 00:02:36.800
Let's assume that we know the audio discusses AI trends and ChatGPT; we

65
00:02:36.800 --> 00:02:39.040
can define a prompt with this context,

66
00:02:39.040 --> 00:02:39.440


67
00:02:39.440 --> 00:02:41.200
and pass it to the prompt argument.

68
00:02:41.200 --> 00:02:41.200


69
00:02:41.200 --> 00:02:41.200


70
00:02:41.200 --> 00:02:46.840
Printing the response, shows, that with a little extra context, the model

71
00:02:46.840 --> 00:02:51.040
was able to accurately determine the correct spelling and style of the words.

72
00:02:51.040 --> 00:02:51.040


73
00:02:51.040 --> 00:02:51.040


74
00:02:51.040 --> 00:02:55.840
In this example, the original response was already pretty close to what we

75
00:02:55.840 --> 00:03:01.680
wanted, but for other cases, it's possible to see quite dramatic improvements using prompts.

76
00:03:01.680 --> 00:03:01.680


77
00:03:01.680 --> 00:03:01.680


78
00:03:01.680 --> 00:03:05.680
Time for you to try using AI to bridge language barriers.

79
00:03:05.680 --> 00:03:08.640
Good luck!