Whisper is all you need for your audio transcription!

Step into the future of transcription with OpenAI's Whisper technology and explore how this advanced tool is reshaping the way we document conversations and preserve the nuances of quiet discussions in the digital era.

Apoorva

July 22, 2024 |

3 mins

Chapters

Breaking down the technology: How OpenAI’s Whisper audio transcription work Whisper using Python Code Transcribing audio using Whisper

In today’s digital world, audio recordings have become an integral part of our lives. From business meetings to personal conversations, we often use audio recordings to keep a record of our discussions. However, in some situations, people need to speak quietly due to various reasons such as confidentiality, privacy, or ASMR recordings. This poses a challenge for transcription services as they are unable to accurately transcribe such recordings.

Recently, I faced a similar problem, where, one of datakulture’s clients wanted to explore audio transcription and analyze the transcripts later. Being one of the data analytics consulting companies, this is how we solved the issue with OpenAI’s Whisper audio transcription.

OpenAI’s Whisper audio transcription is an automatic speech recognition technology that is designed to convert low or whispered speech into text format. It uses state-of-the-art deep learning algorithms and neural networks to understand and transcribe speech that is spoken in a low volume. With this technology, individuals and businesses can accurately transcribe their confidential or quiet conversations, enabling them to keep a record of their discussions.

Breaking down the technology: How OpenAI’s Whisper audio transcription work

The architecture of OpenAI’s Whisper transcription involves a combination of advanced machine learning techniques and neural networks, specifically designed to transcribe speech that is spoken in a low or whispered voice.

The process begins by taking an audio recording and converting it into a digital signal. This digital signal is then processed by a deep neural network that has been trained on a large dataset of whispered speech. The neural network consists of multiple layers of interconnected nodes, and each layer performs a specific task in the analysis process, one big difference between machine learning and deep learning.

The first layer of the neural network is responsible for extracting low-level features from the audio signal, such as frequency, duration, and intensity. The subsequent layers of the network perform increasingly complex computations to understand the spoken words and accurately transcribe them.

One of the critical components of OpenAI’s Whisper audio analysis is the use of a language model. A language model is a machine learning algorithm that predicts the likelihood of a given sequence of words based on the probability of occurrence of the individual words. OpenAI’s language model is trained on a vast corpus of text data, allowing it to accurately predict the probability of the next word in a sentence.

The neural network and language model work together to accurately transcribe the whispered speech in real-time. The output of the system is a text transcript of the spoken words, which can be used for a variety of purposes such as archiving, analysis, or translation.

Whisper using Python Code

There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs.

To use Whisper transcription for an audio, you have to use a Python code given below.

!pip install ffmpeg

!pip3 install setuptools-rust

!pip3 install git+https://github.com/openai/whisper.git

import whisper

model = whisper.load_model("base")

result = model.transcribe("audio.mp3")

print(result["text"])

Below is an example usage of

whisper.detect_language()

and

whisper.decode()

which provide lower-level access to the model.

import whisper

model = whisper.load_model("base")

# load audio and pad/trim it to fit 30 seconds

audio = whisper.load_audio("audio.mp3")audio = whisper.pad_or_trim(audio)

# make log-Mel spectrogram and move to the same device as the model

mel = whisper.log_mel_spectrogram(audio).to(model.device)

# detect the spoken language

_, probs = model.detect_language(mel)

print(f"Detected language: {max(probs, key=probs.get)}")

# decode the audio

options = whisper.DecodingOptions()

result = whisper.decode(model, mel, options)

# print the recognized text

print(result.text)

Depending on the length of the audio, Whisper speech to text takes anywhere from 10 to 30 minutes to process an audio transcription.

Transcribing audio using Whisper

Here, I will show you a speech recognition example, how can you give an input as an audio file and you can get an output in the text format.

The audio that is used here is :

Apoorva Grover · Speech Recognition Marketplace

And the Whisper model transcribed it as:

So the funny thing about the big economic news of the day, the Fed raising interest rates half a percentage point, was that there was only really one tidbit of actual news in the news and the interest rate increase wasn’t it. You knew it was coming. I knew it was coming. Wall Street knew it was coming. Businesses knew it was coming. So on this Fed Day, on this program, something a little bit different. Jay Powell in his own words, five of them, his most used economic words from today’s press conference. Word number one, of course, it’s the biggie. 2% inflation. Inflation. Inflation. Inflation. Inflation. Inflation. Dealing with inflation. Powell’s big worry, the thing keeping him up at night, price stability is the Fed’s whole ballgame right now. Powell basically said as much today. Word number two.

It is pretty accurate in English language as you can see. If you want to know more about the Open AI Whisper model and its working , you can refer to its GitHub page: OpenAI Whisper. If you are looking for how to integrate AI into your business or need help building and implementing more artificial intelligence decision making examples, reach out to us.

From audio to insights—trust our data science consulting company to guide you.

by Apoorva

Apoorva, ex data scientist at datakulture, worked closely with the data science team—supporting research, data exploration, training, and model-building activities. With a strong blend of analytical, creative, and communication skills, she loved spreading knowledge through engaging audiences in events, writing blogs and technical papers, and participation in platforms like Medium.

Whisper is all you need for your audio transcription!

Breaking down the technology: How OpenAI’s Whisper audio transcription work

Whisper using Python Code

Transcribing audio using Whisper

You may also like