lp logo 3

OpenAI Open-Sources Whisper, a Multilingual Speech Recognition System • Technology Flow

lp logo 3

Speech recognition remains a challenging problem in AI and machine learning. To address this, OpenAI today will open-source Whisper, an automatic speech recognition system that enables “robust” transcription in multiple languages ​​as well as translation from those languages ​​into English, the company said.

Countless companies have developed highly capable speech recognition systems that lead to software and services from tech giants like Google, Amazon, and Meta. What makes Whisper different, according to OpenAI, is that it’s trained on 680,000 hours of multilingual and “multitask” data collected from the web, leading to better recognition of unique voices, background noise and technical jargon.

“Primary Intended Users [the Whisper] Modelers are AI researchers who study the robustness, generalizability, capabilities, biases, and limitations of existing models. However, Whisper is also very useful as an automatic speech recognition solution for developers, especially for English speech recognition,” OpenAI writes in the GitHub repo for Whisper, from where several versions of the system can be downloaded. “[The models] Strong ASR shows results in ~10 languages. If fine-tuned on certain tasks, such as voice activity detection, speaker classification or speaker diarization, they may demonstrate additional capabilities, but have not been rigorously evaluated in this area.

Whisper has its limitations, especially in the area of ​​text prediction. Because the system is trained on large amounts of “noise” data, OpenAI warns that Whisper may include words that aren’t actually spoken in its transcriptions — perhaps because it’s trying to predict the next word in the audio and try to transcribe the audio. . Furthermore, Whisper does not perform equally well across languages, suffering from high error rates when it comes to speakers of languages ​​that are not well represented in the training data.

Unfortunately that last bit is nothing new to the world of speech recognition. A 2020 Stanford study found that Amazon, Apple, Google, IBM and Microsoft’s systems made the fewest errors — about 35% — and biases with users who are white over black have long plagued even the best systems.

However, OpenAI sees Whisper’s transcription capabilities being used to enhance existing accessibility tools.

“Although Whisper models cannot be used for real-time transcription out of the box, their speed and size suggest that others can build applications on them that enable real-time speech recognition and translation,” the company said. Continued on GitHub. “The real value of utilitarian applications built on top of Whisper models suggests that the uneven performance of these models may have real economic implications… [W]While e-technology is expected to be used primarily for utilitarian purposes, the greater availability of automatic speech recognition technology will enable more actors to develop efficient surveillance technologies or augment existing surveillance efforts, as speed and accuracy allow for affordable automatic transcription and translation of large volumes. Audio Communication.”

Whisper’s release does not necessarily indicate OpenAI’s future plans. While largely focused on commercial efforts like DALL-E 2 and GPT-3, the company is pursuing several purely theoretical research threads, including AI systems that learn by observing videos.

Leave a Comment

Your email address will not be published.