by Holly Zheng, '22
When we are talking to a friend at a noisy cocktail party, we usually are able to identify the voice of the friend and keep track of what they are saying despite the chattering in the background. This task seems effortless for the human brain, but for computer-user interacting devices such as Alexa and Google Home, “cocktail party” scenes induce a complicated process through which the device has to filter the background voices and pinpoint the target one. Recently, researchers at Google and the Idiap Research Institute in Switzerland proposed a new approach to the training of similar devices on the voice filtering process.
Prior to this research, the classical deep learning methods on voice filtering involve several challenges. The model requires to know the total number of speakers ahead of time, which is unrealistic because the number of speakers constantly changes in a real time scenario. Another challenge lies in the fact that the order in which different speakers talk affects the training of the model -- the “permutation problem.”
Researchers at Google and Idiap proposed a new method to train input data into a device so that the voice filtering can more accurately decipher the content of a target speaker’s speech in the presence of other background conversations. The method involves two separate training networks, a speaker recognition network and a spectrogram masking network. The database primarily consisted of anonymized voice logs in English from mobile devices, which included 34 million utterances from more than 100 thousand speakers. For each dataset, a target spectrogram -- a spectrum of frequencies of sound -- was computed from a clean audio from the target speaker. A second noisy audio containing multiple speakers generated its magnitude spectrogram, and a third reference audio from the target speaker computed word embedding vectors “d-vectors.” The masking network took the d-vectors and the magnitude spectrogram as inputs and generated a “masked spectrogram.” The goal of the network was to reduce the difference between the target spectrogram and the computed masked spectrogram.
One of the criteria that the researchers examined to evaluate these models was the Word Error Rate (WER), the percentage of the words that the model did not recognize correctly. In addition to several noise-enhanced models, researches also trained the same dataset on a normal model that did not train with a background noise audio. The WER of the noise-enhanced model was 23.4%, compared to 55.9% of the old none noise-enhanced model. In addition to salient decrease in recognition error, another benefit of this newly proposed training method is that it does not require knowledge of the number of speakers ahead of time. This method also solves the permutation problem, although potential improvements do include training on a larger and more challenging database and adding more interfering speakers.
This new masking model presents a better training method for the voice filtering task. Such method is applicable for all types of AI-powered voice recognition devices including Google Home and Siri. If utilized, this training method enhances the filtering capability of such devices so that they can better respond to tasks or inquiries from a specific user. As the voice recognition task becomes more systematic and accurate, maybe one day your Alexa can tell you the weather outside even when you are in the middle of a noisy cocktail party.
Wang, Q., Muckenhirn, H., “Voice filter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking,” October 12, 2018 https://arxiv.org/pdf/1810.04826.pdf