Some Applications of Time-Frequency Representations in Speech Processing

From Computer Vision to Audio Processing through Mel Spectrogram

Published in

Towards Data Science

17 min readMay 20, 2021

Spectrograms certainly are the best-suited representation of audio signals for analysis. Not only does it match our understanding of sounds through frequency decomposition, but it also allows us to use 2-dimensional analysis architectures. Given the massive progress in computer vision during the last decade, the possibility of representing sounds as images opens many options. This article will analyze the similitude and limits of the application of computer vision techniques to audio processing. A heavy accent is put on speech recognition and speech synthesis.

“Are we headed towards a future where an AI will be able to out-think us in every way? Then the answer is unequivocally yes.” Elon Musk — Spectrogram generated by the author.

Computer Vision currently is one of the hottest research fields in machine learning. However, audio processing has been able to surf on the wave of ground-breaking deep-learning techniques in visual analysis. Indeed, audio signals can be easily represented as 2-dimensional data. This form has proven to be more suited to information extraction. As its name suggests, time-frequency representations show audio signals through discretization in time and frequency. Most neural network architecture manipulating images can thus be used for audio processing. Like how the human brain inspires neural networks, Mel-Spectrogram aims to show signals how we hear them. This specific time-frequency representation takes into account the non-linear scale of our hearing. This article focuses on a couple of applications of Mel-Spectrogram and deep learning techniques to improve speech manipulation. We will see how to generate a human-like speech of a specific text and how computers can understand what we are saying. Such tasks use architectures like sequence-to-sequence (Listen, Attend, and Spell), Connectionist Temporal Classification, and causal convolution. We will end with a discussion on the limits of the similitude between computer vision and machine hearing.

How to visualize sound

Sound signals, or waveforms, are produced by variations in air pressure. Microphones and speakers interface with sound waves through vibrating membranes. We quickly understand that sound is thus single dimension data. In its most natural form, an audio signal is represented as an intensity varying over time. However, this is a poor representation of the information embedded in a signal. Indeed, frequencies are the key characteristics that composed sounds. Pieces of information are encoded in the evolution of frequencies and their amplitude over time. Time-frequency (TF) representations are an adequate representation of sound to analyze its embedded data. The most common TF representations are spectrograms. A spectrogram is composed of pixels that describe the amplitude associated with a range of frequency at a specific time step. The temporal position is on the x-axis, whereas frequency bins are on the y-axis. The brighter the pixel, the higher the energy of the associated frequency. Not only are spectrograms a great way to visualize audio signals, but they are also easy to compute. Multiple techniques have been developed and optimized over the years, the Fourier Transform being the most common.

Computer vision techniques can be applied to audio processing due to the 2-dimension nature of spectrogram. There are many applications: audio classification, audio separation and segmentation, music genre classification and tagging, music generation, music transcription, voice recognition, speech synthesis, and speech recognition. The list goes on, but we will focus on the two latter applications in this paper. A specific type of spectrogram, or more specifically a scale, is especially suited to speech manipulation. The way humans hear frequencies is known as pitch. The human audible spectrum is not flat. We do not perceive frequencies linearly. Lower frequencies are more distinguishable than higher frequencies. The Mal scale takes into account the logarithmic response of our hearing. This scale is based on an empirical formula such that distances on the y-axis of a spectrogram match our perceived distance frequencies in different sounds. Besides, the human perception of sound amplitude is not linear either. Once again, we hear loudness logarithmically. The decibel scale takes this characteristic into account. The MEL-spectrogram implements both representations by morphing the y-axis to the Mel scale and the decibel scale as pixel intensity values.

Speech recognition

The classical approach

Speech recognition certainly is one of the machine learning techniques most known by the public. Vocal assistants are becoming mainstream on all platforms: speakers (Amazon Echo), game consoles (Kinect), phones and smartwatches (Siri, Google Assistant), etc. While speech recognition has been studied for several decades, it only became accurate enough to be useful in real-world applications only recently, thanks to the surge of deep learning.

First, let us have a look at the classical way to build a speech recognition system. This method lay on detecting phonemes in a spectrogram and their combination to form sentences based on models. As observed in Figure 1, three models compose the speech recognition pipeline to match speech features X to a sentence Y:

The language model defines the probability of a given sequence of words occurring in a sentence. Such a model is based on the study of the grammar and idioms of a given language. N-grams models were well suited for limited speech input data.
The pronunciation model generates a sequence of phonemes for a given sentence. This step is generally performed using look-up tables defined by linguistics experts.
The acoustic model finally defines how a given phoneme sounds. It produces data compared to speech features X extracted from an audio recording. Spectrograms are generally used as features set. Acoustic models were built as Gaussian Mixture Models.

Given such a model, recognition is performed by inferring on the features extracted from the analyzed recording. The system finds the sentence Y by maximizing its probability given the data X.

Fig. 1. Classical speech recognition pipeline [1]

Over the years, components of this model have been optimized using deep learning. The use of neural networks significantly improved the overall accuracy of the models. A typical fully optimized pipeline used a neural language model, a neural network pronunciation model, and a hybrid RNN-HMM (Recurrent Neural Network — Hidden Markov Model) acoustic model. Even the feature extraction has been augmented using CNNs (Convolutional Neural Networks).

However, each learning model is trained independently to perform the specific task that the component is supposed to do. The model error of members of the pipeline might not behave well together — the error compounding along the way limits performance. Also, the human-designed features pipeline is intuitive and easier to train. Still, it might not be optimized for the task, which leads to performance limitations and the need for higher computational resources. These reasons pushed the development of end-to-end models. The whole recognition process is trained as a single big component. We will now look at two of the most popular ones: Connectionist Temporal Classification and Sequence-To-Sequence.

Connectionist Temporal Classification

If you ever said ‘Hey Google’, this model certainly already analyzed your voice. The Connectionist Temporal Classification (CTC) model [2,3,4] is currently used in multiple products from Google and Baidu.

Again, the goal is to deduce the sentence Y from an input spectrogram X. But now, each output token {Y(t), t ∈[1, L]} represent a letter, and each input token {X(t), t t ∈[1, T]} is a subpart of the spectrogram. Note that input spectrograms token can overlap and have an arbitrary size. The only requirement is T>L; there must be more input tokens than sentence characters. However, one cannot directly match input to output token for two main reasons. First, as we saw earlier, sounds are not associated with letters but phonemes. And a given phoneme can be produced for different combinations of letters. Thus, context is critical to identify the letters producing a given phoneme. Secondly, our input token’s time resolution does not necessarily match the duration of pronunciation of a given letter (or phoneme) by the speaker. The same character would (and should — to avoid drop-out) be inferred in multiple consecutive output tokens. The CTC structure presented in Figure 2 solves both problems.

Fig. 2. Connectionist Temporal Classification structure [1]

Fig. 3. Results of a CTC model over the pronunciation of ‘His friends’ [1]

The spectrogram is fed into a bi-directional recurrent neural network. Each time step takes into account its temporal neighbors, which allows taking account of the pronunciation context. Each frame of the prediction is described by a distribution function of the most likely character, Y(t) = { s(k,t), k ∈ character map} where s(k,t) is the logarithmic probability of the presence of a character k at time step t. The character map can be composed of letters, numbers, punctuation, a blank character, and even complete words. We can now compute each character's probabilities over time to form the sentence by selecting picks characters — see Figure 3.

The CTC structure often struggles with grammar and misspell words. The integration of a simple language model in the CTC training drastically improves the model [2].

Sequence-To-Sequence

Sequence-To-Sequence (Seq2Seq) models are used in language processing to change the domain of representation of an object. Such representation could be a specific language for speech translation or image and text for image captioning. In our case, we want to express a speech recording under its textual form. Seq2Seq models are composed of two main components, an encoder and a decoder. Given our application field, let us respectively call these the listener and the speller.

The listen operation usually uses a Bidirectional Long Short Term Memory (BLSTM) RNN to condensate the information from a speech spectrogram. Convolutional layers also have proven to be very efficient in improving accuracy [7]. This step is critical to optimize the representation of the information and eliminate the noise in the audio. The feature vector h extracted by the listener usually is eight times smaller than the audio recording’s initial duration. This reduced resolution is critical to speed up learning and (most importantly) inference processes. Note that given the recurrent structure of the network, the chronological order remains untouched.

Still, the feature vector is too long for a speller to extract sentences. It does not know where to find relevant information to infer the character associated with a specific timestep. An attention mechanism is required to focus the analysis on the proper time window of the input. This operation is added to our decoder. The model is now described as a Listen, Attend, and Spell (LAS) neural network [5,6]. The decoder sequentially runs through the features representation h by focusing on the most likely area of interest, a character at a time. Note that this attention mechanism elegantly tackles the issue of repeating character detection encounters in the CTC.

The Attend and Spell operation is computed using an attention-based LSTM transducer. The probability distribution over a character from y(i) the output is a function of the decoder state s(i) and the attention context c(i). In addition, the current state s(i) also depends on the previous state s(i-1), the previously most likely character y(i-1), and the previous context c(i-1). This intricate relationship is clearly illustrated in Figure 4.

Speech recognition methods comparison

Fig. 5. WERs (%) on various test sets for the models compared in [12]

Figure 5 presents the Word Error Rates (WERs) of multiple speech recognition systems. In particular, the CTC-grapheme model is based on the first architecture shown in part III, and Att. [1,2]-layer dec. use the LAS technique. Implementation details and training process are available in the associated paper [12].

The WER is a common evaluation metric used in the Audio Speech Recognition field. It compares model outputs and ground truth using the Levenshtein distance. It can be simplified as WER = (S+I+D) / N where S is the number of substitutions, D is the number of deletions, I is the number of insertions, and N is the number of words in the ground truth. The best performing model is a LAS using two decoders (or speller) layers. However, the accuracy improvement of a model using a single speller is minor, considering the additional computational cost. At first sight, it seems like the CTC-based model is not even usable. However, the implementation presented here does not include a language model. Which is understandable as all model would benefit from such additional component. But as we saw earlier, the main issue of CTC models is their grammatical incorrectness. Outputs are still understandable but perform poorly from a raw comparison with a reference text. Graves et al. [2] reached a WER of 6.7% by adding a trigram language model to a CTC-based model. Note that this accuracy can hardly be compared to the results from Figure 5 as they do not share the same training process.

Speech synthesis

Now that we know how a virtual assistant understands what we are saying let’s look at how it replies. Speech synthesis or Text-To-Speech (TTS) were traditionally built using two primary strategies: concatenative and parametric. The concatenative approach stitch together sounds from an extensive database to generate new, audible speech. The parametric methods (HMM) use a model representing complex linguistic-auditory features defines by a linguistic expert. However, both approaches are non-flexible and sound robotic. Similar to the speech recognition field, TTS has been enhanced by neural networks.

State-of-the-art deep learning approaches have shown great results. Some models generate speeches that can be mistaken for real-person words. There are currently two main TTS frameworks: DeepVoice3 from Baidu Research [8] and Tacotron2 from Google [11]. Both models use a MEL-spectrogram as an intermediate representation in their pipeline. A spectrogram is generated from an input text, and then the audio wave is synthesized from the metarepresentation. The spectrogram generation processes globally use the same seq-2-seq techniques seen earlier in the speech recognition section. In addition, both systems use the audio synthesizer WaveNet [9,10]. We will now focus on WaveNet and its similitude with deep learning techniques applied to image processing.

WaveNet structure

WaveNet is a generative neural network of raw audio waveforms given a MEL-spectrogram. One could argue that a simple inverse Fourier Transform could perform a similar task way more efficiently. Unfortunately, spectrograms generated from texts are currently not accurate enough to produce realistic-sounding speech. Given the 2D nature of the input, the computer vision field’s deep learning technique is a big inspiration to perform the task at hand. Auto-Regressive models, Auto-Encoders, and Generative Adversarial Networks (GANs) are popular models performing image generation. WaveNet is actually based on the PixelCNN architecture, an Auto-Regressive model.

The key to WaveNet success is causal convolutions (CCs). This is a type of convolution suited to 2D temporal data like spectrograms. It ensures that the model keeps the chronological ordering of the input. By opposition to classical convolution, the emitted prediction probability p(x(t+1) | x(1), …, x(t)) at a given time-step t does not depend on the next input tokens, as shown in Figure 6. Besides, CCs do not present recurrent connections. They are thus generally faster to train than other RNN models.

Fig. 6. Visualization of a stack of causal convolutional layers [9]

However, CCs models need many layers to increase the perception field of the input sequence. For example, the model’s reception field shown in Figure 6is only 5. A single phoneme can spread across several hundreds or thousands of timesteps on a spectrogram with high temporal resolution. Network dilatation tackles this issue. This makes large skips in the input data, so we have a better global view of it. The concept is easily understandable through a visual analogy. It is similar to increasing the receptive field of view to see the entire landscape and not only a single tree on a photograph. As shown in Figure 6, each additional layers double the field of view. This method can lead to vast receptive fields without increasing data resolution in the network, and thus computational efficiency remains.

Fig. 7. Visualization of a stack of dilated causal convolutional layers [9]

WaveNet training and acceleration

Yet, training and inference processes are very computationally heavy. It typically requires 10 to 30 hours of a person’s voice training data. The model has to produce from 16 to 24 thousand samples for each second of continuous audio footage. Unfortunately, the CCs architecture imposes to generate samples one by one due to its sequential generation. This is a massive waste of computational power as today’s graphics card relies on heavy parallel computation. The waveform must be produced in one go leveraging parallel computation units to accelerate computation. The latest variant of WaveNet [10] starts from white noise and applies changes to it over time to morph it into the output speech waveform. The morphing process is parallel over the entirety of the input data. The training of such a network requires another pre-trained model. This second model, the teacher, is a classical sequential WaveNet. The parallel network tries to mimic what the teacher network does while being more efficient. This is possible by placing the teacher ahead of the student (see Figure 8) during training so that the backpropagation algorithm allows to tweak the student’s parameters. This training method is called Probability Density Distillation.

Fig. 8. Overview of Probability Density Distillation of WaveNet [10]

This method is relatively similar to the training of GANs: a student model tries to fool the teacher, while the teacher attempts to better distinguish fake inputs from real ones. However, here, the student attempts to mimic instead of fooling the teacher. Note that only the inference process is accelerated as the training method requires an initial fully trained sequential model. But the computational efficiency gain is massive for inference. Waveform generation is over a thousand times faster than its predecessor. Real-time generation is now absolutely feasible on most hardware platforms, even smartphones or home assistants like Google Nest.

WaveNet performance

Evaluating generative models is hard. Indeed, it is not possible to compare a model’s output to the expected results given inputs. By definition, generative models create something new, unseen. WaveNet is no exception. The Mean Opinion Score (MOS) is an effective validation metric for speech synthesis. The MOS is a number that describes how a sound sample would pass as a natural human speech. It is an empirical value obtain through human rating services. Tacotron 2 [11], which uses WaveNet to synthesized audio from a spectrogram generated by a seq2seq model, yields great results. This model has a MOS of 4.526 ± 0.066, where real professional voice recordings reach 4.582 ± 0.053. Also, Google carried a user study where people listened to Tacotron 2 samples and professional voice narrators and guessed which one is more natural. You can try this test yourself and hear synthesized samples on the following webpage linked in the paper [11]: https://google.github.io/tacotron/publications/tacotron2. Figure 9 presents the results of the study.

Fig. 9. Synthesized vs. ground truth: 800 ratings on 100 items [11]

Most of the time, people cannot tell apart synthesized and ground truth samples. A small but statistically significant tendency toward real footage was recorded. This is mainly due to the mispronunciation of proper nouns and uncommon vocabulary.

The limits of computer vision applied to audio processing

We have now covered several deep architectures that greatly improve audio processing. Most of these architectures were originally developed to solve computer vision problems. However, the time-frequency representation of sound allows treating sounds as 2D data, similarly to images. Most specifically, convolution layers are of great use. Yet, some spectrogram characteristics might generate substantial limitations to the advancement of such techniques.

First, sounds are transparent. Visual objects and sound events do not accumulate similarly. Dark areas in a spectrogram show an absence of sounds, whereas it simply corresponds to a dark object or a shadow in an image. Besides reflective or transparent surfaces, pixels in an image generally belong to a single entity. Instead, discrete sound events sum together into a distinct whole. A frequency bin at a given time step cannot be assumed to belong to a single sound with the associated pixel’s magnitude. It can be produced by any number of accumulated sounds or even complex interactions such as phase cancellation. It isn’t convenient to separate overlapping sounds where one can easily dissociate objects in a picture.

Secondly, axes of a spectrogram do not share the same meaning as the flat images’ axis. The axis of an image describes the projected position of an object. Translation or rotation does not alter the nature of the subject in the picture, only its position. This is why CNNs are so powerful. A convolution layer allows dissociating the nature of an object from its position in the image. A dog remains a dog if it is in the upper-left or lower-right corner of a picture. However, a translation of a spectrogram along the y-axis alters the frequency of a sound. The rotation of a spectrogram does not have any physical interpretation due to the heterogeneous meaning of both axes. The temporal aspect of a spectrogram imposes some modifications to the models’ architecture, as we saw in our study of causal convolutions.

Thirdly, the spectral properties of sounds are non-local. Neighboring pixels of images belong to the same objects or form a boundary. However, this is not the case for sounds. While the fundamental frequency of a sound is local, its timbre is not. The timbre of a sound is composed of all the harmonics which are spaced out on the y-axis. These harmonics frequencies are not locally close in the spectrogram but form the same sound and move together according to a shared relation.

Conclusion

You should now have a better understanding of the advantages of the time-frequency representation in the context of neural networks. The few architectures we described here only provide a glimpse of the possible application. However, CTC and seq2seq architecture encapsulate how classical computer vision models can be adapted to audio processing, the critical difference being the chronological nature of spectrograms. Besides, these methods are stepping stones to more advanced applications like voice cloning.

Undoubtedly, computer vision is a great inspiration for the analysis of most 2-dimensional data. It has had a significant impact on audio processing because it allowed end-to-end pipelines without creating brand new architectures. However, the chronological nature of spectrograms is a severe obstacle between images and audio. While convolution-based networks led to significant accuracy improvement, such layers are not suited to spectrograms. There are fundamental differences that impose new techniques, as shown in IV.C The limits of computer vision applied to audio processing. Speech-related 2D convolution-based techniques admittedly reach their peach performances. The remaining error percents are surely due to the unusual characteristics of spectrograms.

Thanks for reading. Connect with me on LinkedIn to continue the discussion!

Reference

[1] Chris Manning and Richard Socher, Lecture 12: End-to-End Models for Speech Processing (2017), Stanford University

[2] Graves, Alex, and Navdeep Jaitly. Towards End-To-End Speech Recognition with Recurrent Neural Networks (2014), ICML. Vol. 14.

[3] H. Sak, A. Senior, K. Rao, O. Irsoy, A. Graves, F. Beaufays, and J. Schalkwyk, Learning acoustic frame labeling for speech recognition with recurrent neural networks (2015), IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

[4] Amodei, Dario, et al., Deep speech 2: End-to-end speech recognition in English and Mandarin (2015), arXiv preprint arXiv:1512.02595

[5] William Chan and Navdeep Jaitly and Quoc V. Le and Oriol Vinyals, “Listen, Attend and Spell (2015), arXiv:1508.01211

[6] Chung-Cheng Chiu and Tara N. Sainath and Yonghui Wu and Rohit Prabhavalkar and Patrick Nguyen and Zhifeng Chen and Anjuli Kannan and Ron J. Weiss and Kanishka Rao and Ekaterina Gonina and Navdeep Jaitly and Bo Li and Jan Chorowski and Michiel Bacchiani, State-of-the-art Speech Recognition With Sequence-to-Sequence Models (2018), arXiv:1712.01769

[7] N. Jaitly, D. Sussillo, Q. Le, O. Vinyals, I. Sutskever, and S. Bengio, A Neural Transducer (2016), arXiv preprint arXiv:1511.04868

[8] Wei Ping and Kainan Peng and Andrew Gibiansky and Sercan O. Arik and Ajay Kannan and Sharan Narang and Jonathan Raiman and John Miller, Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning (2018), arXiv: 1710.07654

[9] Aaron van den Oord and Sander Dieleman and Heiga Zen and Karen Simonyan and Oriol Vinyals and Alex Graves and Nal Kalchbrenner and Andrew Senior and Koray Kavukcuoglu, WaveNet: A Generative Model for Raw Audio (2016), arXiv: 1609.03499

[10] Aaron van den Oord and Yazhe Li and Igor Babuschkin and Karen Simonyan and Oriol Vinyals and Koray Kavukcuoglu and George van den Driessche and Edward Lockhart and Luis C. Cobo and Florian Stimberg and Norman Casagrande and Dominik Grewe and Seb Noury and Sander Dieleman and Erich Elsen and Nal Kalchbrenner and Heiga Zen and Alex Graves and Helen King and Tom Walters and Dan Belov and Demis Hassabis, Parallel WaveNet: Fast High-Fidelity Speech Synthesis (2017), arXiv: 1711.10433

[11] Jonathan Shen and Ruoming Pang and Ron J. Weiss and Mike Schuster and Navdeep Jaitly and Zongheng Yang and Zhifeng Chen and Yu Zhang and Yuxuan Wang and RJ Skerry-Ryan and Rif A. Saurous and Yannis Agiomyrgiannakis and Yonghui Wu, Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions (2018), arXiv: 1712.05884

[12] Rohit Prabhavalkar and Kanishka Rao and Tara Sainath and Bo Li and Leif Johnson and Navdeep Jaitly, A Comparison of Sequence-to-Sequence Models for Speech Recognition (2017), Interspeech 2017, ISCA