Whisper is all you need for your audio transcription!
Step into the future of transcription with OpenAI's Whisper technology and explore how this advanced tool is reshaping the way we document conversations and preserve the nuances of quiet discussions in the digital era.
Apoorva
July 22, 2024 |
3 mins
In today’s digital world, audio recordings have become an integral part of our lives. From business meetings to personal conversations, we often use audio recordings to keep a record of our discussions. However, in some situations, people need to speak quietly due to various reasons such as confidentiality, privacy, or ASMR recordings. This poses a challenge for transcription services as they are unable to accurately transcribe such recordings.
Recently, I faced a similar problem, where, one of datakulture’s clients wanted to explore audio transcription and analyze the transcripts later. This is where OpenAI’s Whisper audio transcription came into play.
OpenAI’s Whisper audio transcription is an advanced technology that is designed to convert low or whispered speech into text format. It uses state-of-the-art deep learning algorithms and neural networks to understand and transcribe speech that is spoken in a low volume. With this technology, individuals and businesses can accurately transcribe their confidential or quiet conversations, enabling them to keep a record of their discussions.
Breaking down the technology: How OpenAI’s Whisper audio transcription work
The architecture of OpenAI’s Whisper transcription involves a combination of advanced machine learning techniques and neural networks, specifically designed to transcribe speech that is spoken in a low or whispered voice.
The process begins by taking an audio recording and converting it into a digital signal. This digital signal is then processed by a deep neural network that has been trained on a large dataset of whispered speech. The neural network consists of multiple layers of interconnected nodes, and each layer performs a specific task in the analysis process.
The first layer of the neural network is responsible for extracting low-level features from the audio signal, such as frequency, duration, and intensity. The subsequent layers of the network perform increasingly complex computations to understand the spoken words and accurately transcribe them.
One of the critical components of OpenAI’s Whisper audio analysis is the use of a language model. A language model is a machine learning algorithm that predicts the likelihood of a given sequence of words based on the probability of occurrence of the individual words. OpenAI’s language model is trained on a vast corpus of text data, allowing it to accurately predict the probability of the next word in a sentence.
The neural network and language model work together to accurately transcribe the whispered speech in real-time. The output of the system is a text transcript of the spoken words, which can be used for a variety of purposes such as archiving, analysis, or translation.
Whisper using Python Code
There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs.
To use Whisper transcription for an audio, you have to use a Python code given below.
!pip install ffmpeg
!pip3 install setuptools-rust
!pip3 install git+https://github.com/openai/whisper.git
import whisper
model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])
Below is an example usage of
whisper.detect_language()
and whisper.decode()
which provide lower-level access to the model.import whisper
model = whisper.load_model("base")
# load audio and pad/trim it to fit 30 seconds
audio = whisper.load_audio("audio.mp3")audio = whisper.pad_or_trim(audio)
# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio).to(model.device)
# detect the spoken language
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")
# decode the audio
options = whisper.DecodingOptions()
result = whisper.decode(model, mel, options)
# print the recognized text
print(result.text)
Depending on the length of the audio, Whisper speech to text takes anywhere from 10 to 30 minutes to process an audio transcription.
Transcribing audio using Whisper
Here, I will show you how can you give an input as an audio file and you can get an output in the text format.
The audio that is used here is :
And the Whisper model transcribed it as:
So the funny thing about the big economic news of the day, the Fed raising interest rates half a percentage point, was that there was only really one tidbit of actual news in the news and the interest rate increase wasn’t it. You knew it was coming. I knew it was coming. Wall Street knew it was coming. Businesses knew it was coming. So on this Fed Day, on this program, something a little bit different. Jay Powell in his own words, five of them, his most used economic words from today’s press conference. Word number one, of course, it’s the biggie. 2% inflation. Inflation. Inflation. Inflation. Inflation. Inflation. Dealing with inflation. Powell’s big worry, the thing keeping him up at night, price stability is the Fed’s whole ballgame right now. Powell basically said as much today. Word number two.
It is pretty accurate in English language as you can see. If you want to know more about the Open AI Whisper model and its working , you can refer to its GitHub page: OpenAI Whisper