Whisper is all you need for your audio transcription!

Step into the future of transcription with OpenAI's whisper technology and explore how this advanced tool is reshaping the way we document conversations and preserve the nuances of quiet discussions in the digital era.



Dec 16, 2023


3 mins

In today’s digital world, audio recordings have become an integral part of our lives. From business meetings to personal conversations, we often use audio recordings to keep a record of our discussions. However, in some situations, people need to speak quietly due to various reasons such as confidentiality, privacy, or ASMR recordings. This poses a challenge for transcription services as they are unable to accurately transcribe such recordings.

Recently, I faced a similar problem, where, one of Datakulture’s clients wanted to explore audio transcription and analyze the transcripts later. This is where OpenAI’s whisper audio transcription came into play.

OpenAI’s whisper audio transcription is an advanced technology that is designed to convert low or whispered speech into text format. It uses state-of-the-art deep learning algorithms and neural networks to understand and transcribe speech that is spoken in a low volume. With this technology, individuals and businesses can accurately transcribe their confidential or quiet conversations, enabling them to keep a record of their discussions.

Breaking Down the Technology: How OpenAI’s Whisper Audio Transcription Work

The architecture of OpenAI’s whisper audio analysis involves a combination of advanced machine learning techniques and neural networks, specifically designed to transcribe speech that is spoken in a low or whispered voice.

The process begins by taking an audio recording and converting it into a digital signal. This digital signal is then processed by a deep neural network that has been trained on a large dataset of whispered speech. The neural network consists of multiple layers of interconnected nodes, and each layer performs a specific task in the analysis process.

The first layer of the neural network is responsible for extracting low-level features from the audio signal, such as frequency, duration, and intensity. The subsequent layers of the network perform increasingly complex computations to understand the spoken words and accurately transcribe them.

One of the critical components of OpenAI’s whisper audio analysis is the use of a language model. A language model is a machine learning algorithm that predicts the likelihood of a given sequence of words based on the probability of occurrence of the individual words. OpenAI’s language model is trained on a vast corpus of text data, allowing it to accurately predict the probability of the next word in a sentence.

The neural network and language model work together to accurately transcribe the whispered speech in real-time. The output of the system is a text transcript of the spoken words, which can be used for a variety of purposes such as archiving, analysis, or translation.

Whisper using Python Code

There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs.

The Audio transcription can be performed with the following python code :

!pip install ffmpeg !pip3 install setuptools-rust !pip3 install git+

import whisper model = whisper.load_model("base") result = model.transcribe("audio.mp3") print(result["text"])

Below is an example usage of whisper.detect_language() and whisper.decode() which provide lower-level access to the model.

import whisper model = whisper.load_model("base") # load audio and pad/trim it to fit 30 seconds audio = whisper.load_audio("audio.mp3")audio = whisper.pad_or_trim(audio) # make log-Mel spectrogram and move to the same device as the model mel = whisper.log_mel_spectrogram(audio).to(model.device) # detect the spoken language _, probs = model.detect_language(mel) print(f"Detected language: {max(probs, key=probs.get)}") # decode the audio options = whisper.DecodingOptions() result = whisper.decode(model, mel, options) # print the recognized text print(result.text)

Transcribing Audio using Whisper

Here, I will show you how can you give an input as an audio file and you can get an output in the text format.

The audio that is used here is :

Speech Recognition Marketplace

And the transcription that we got was :

So the funny thing about the big economic news of the day, the Fed raising interest rates half a percentage point, was that there was only really one tidbit of actual news in the news and the interest rate increase wasn’t it. You knew it was coming. I knew it was coming. Wall Street knew it was coming. Businesses knew it was coming. So on this Fed Day, on this program, something a little bit different. Jay Powell in his own words, five of them, his most used economic words from today’s press conference. Word number one, of course, it’s the biggie. 2% inflation. Inflation. Inflation. Inflation. Inflation. Inflation. Dealing with inflation. Powell’s big worry, the thing keeping him up at night, price stability is the Fed’s whole ballgame right now. Powell basically said as much today. Word number two.

It is pretty accurate in English language as you can see. If you want to know more about the Whisper model and its working , you can refer to its GitHub page :

Convert your audio into text accurately

Get transcript