VRG Illo STK427 K Radtke Getty Mics

I used OpenAI’s new technology to transcribe the audio on my laptop

OpenAI, the company behind the image-generation and meme-spanning program DALL-E and the powerful text autocomplete engine GPT-3, has launched a new, open-source neural network aimed at transcribing (through) audio into written text. Tech Crunch) is called Whisper, and it “reaches human-level robustness and accuracy over English speech recognition,” and it can also automatically recognize, transcribe, and translate other languages ​​such as Spanish, Italian, and Japanese.

As someone who constantly records and transcribes interviews, I was immediately hyped about the news — thinking I could write my own app to securely transcribe audio from my computer. While cloud-based services like Otter.ai and Trint work for most things and are relatively secure, there are some interviews where I or my sources feel more comfortable with the audio file off the Internet.

Using it turned out to be even easier than I expected; I already had Python and various developer tools set up on my computer, so installing Whisper was as simple as running a single Terminal command. Within 15 minutes, I was able to use Whisper to transcribe a recorded test audio clip. For a relatively tech-savvy person who doesn’t already have Python, FFmpeg, Xcode, and Homebrew set up, this should take about an hour or two. Someone is already working to make this process very simple and user-friendly, but we’ll talk about that in just a second.

Command-line apps aren't for everyone, but for those doing relatively complex work, Whisper is easy to use.

Command-line apps aren’t for everyone, but for those doing relatively complex work, Whisper is easy to use.

While OpenAI certainly sees this use case as an opportunity, it’s pretty clear that the company is primarily targeting researchers and developers with this release. In a blog post announcing Whisper, the team said its code “will serve as a foundation for building useful applications and further research on robust speech processing” and that “Whisper’s high accuracy and ease of use will allow developers to add voice interfaces. to a much broader range of applications.” The approach is still notable, however — the company has limited access to machine-learning projects like its most popular DALL-E or GPT-3, “to learn more about real-world usage and the desire to continue iterating on our security systems.” stated. .”

Image showing a text file with transcribed lyrics for Yung Gravy's song “Betty (Get Money)”.  There are many mistakes in the transcription.

The text files that Whisper produces aren’t very easy to read if you’re using them to write an article.

There’s also the fact that installing Whisper isn’t exactly a user-friendly process for most people. However, journalist Peter Stern teamed up with GitHub developer advocate Christina Warren To try and fix it, announced that they are creating a “free, secure and easy-to-use transcription app for journalists” based on Whisper’s machine learning model. I spoke with Stern, and he said he decided the program, called Stage Whisper, should exist after conducting a few interviews and concluding that it was “the best transcription I’ve ever used, other than human transcribers.”

I’ve compared the transcription produced by Whisper to what Otter.ai and Trint put out for the same file, and I’d say it’s relatively comparable. They all have enough flaws that I wouldn’t copy and paste quotes from them into an article without double-checking the audio (which is best practice no matter what service you use). But Whisper’s version definitely works for me; I can search through it to find the sections I need and double check them manually. In theory, Stage Whisper uses a single model with a GUI wrapped around it so it should work exactly the same.

Stern admits that tech from Apple and Google could make Stage Whisper obsolete in a few years — the Pixel’s Voice Recorder app has been able to do offline transcriptions for years, and a version of that feature is starting to roll out to some other Android devices, and Apple has offline dictation built into iOS (though it’s currently There is no correct way to transcribe audio files). “But we can’t wait that long,” Stern said. “Journalists like us need good auto-transcription apps today.” He expects to have a bare-bones version of the Whisper-based app ready in two weeks.

To be clear, Whisper probably won’t fully leverage cloud-based services like Otter.ai and Trint, no matter how easy it is to use. For one, OpenAI’s model lacks one of the biggest features of traditional transcription services: being able to label who said what. Stern says Stage Whisper probably won’t support this feature: “We’re not developing our own machine learning model.”

The cloud is someone else’s computer – maybe that means it’s a little faster

And while you’re getting the benefits of native processing, you’re also getting the drawbacks. The bottom line is that your laptop is much less powerful than the computers used by the professional transcription service. For example, I fed audio from a 24-minute long interview running on my M1 MacBook Pro into Whisper; It took about 52 minutes to transcribe the entire file. (Yes, I made sure it was using the Apple Silicon version of Python instead of the Intel one.) Otter spit out the transcript in less than eight minutes.

OpenAI’s technology has one big advantage, however – cost. Cloud-based subscription services almost always cost you money if you’re using them professionally (Otter has a free tier, but upcoming changes make them less useful for people who often transcribe content), and built-in transcription features into platforms like Microsoft Word or Pixel require you to pay for special software or hardware. Stage Whisper — And even Whisper is free and can run on a computer you already own.

Again, OpenAI has more hopes for Whisper than a basis for a secure transcription application — and I’m very excited about what researchers will do with it, or what they’ll learn by watching a trained machine learning model. “680,000 Hours of Multilingual and Multitasking Monitoring Data Collected from the Web.” But it’s even more exciting because it has a real, practical use today.

Leave a Comment

Your email address will not be published.