Jesse Engel, Lamtharn Hantrakul, Chenjie Gu, Adam Roberts (Google Research, Brain Team) ICLR 2020
논문 리딩 계기: GCT731 페이퍼 리딩 과제
리딩 후 느낀 점: 3월 20일(월)에 리딩 종료 예정.
Abstract
Most generative models of audio directly generate samples in one of two domains: time or ****frequency. While sufficient to express any signal, these representations are inefficient, as they do not utilize existing knowledge of how sound is generated and perceived. A third approach (vocoders/synthesizers) successfully incorporates strong domain knowledge of signal processing and perception but has been less actively researched due to limited expressivity and difficulty integrating with modern auto-differentiation-based machine learning methods.
In this paper, we introduce the Differentiable Digital Signal Processing (DDSP) library, which enables direct integration of classic signal processing elements with deep learning methods. Focusing on audio synthesis, we achieve high-fidelity generation without the need for large autoregressive models or adversarial losses, demonstrating that DDSP enables utilizing strong inductive biases without losing the expressive power of neural networks. Further, we show that combining interpretable modules permits manipulation of each separate model component, with applications such as independent control of pitch and loudness(독립적인 음정 및 음량 제어), realistic extrapolation to pitches not seen during training (훈련과정 중에 보지 못한 음정에 대한 현실적인 외삽), blind dereverberation of room acoustics(실내 음향의 블라인드 리버브제거), transfer of extracted room acoustics to new environments, and transformation of timbre between disparate sources. In short, DDSP enables an interpretable and modular approach to generative modeling, without sacrificing the benefits of deep learning. The library is publicly available and we welcome further contributions from the community and domain experts.
Introduction
Neural networks are universal function approximators in the asymptotic limit (Hornik et al., 1989), but their practical success is largely due to the use of strong structural priors such as convolution (LeCun et al., 1989), recurrence (Sutskever et al., 2014; Williams & Zipser, 1990; Werbos, 1990), and self-attention (Vaswani et al., 2017).
These architectural constraints promote generalization and data efficiency to the extent that they align with the data domain. From this perspective, end-to-end learning relies on structural priors to scale, but the practitioner’s toolbox is limited to functions that can be expressed differentiably.
Here, we increase the size of that toolbox by introducing the Differentiable Digital Signal Processing (DDSP) library, which integrates interpretable signal processing elements into modern automatic differentiation software (TensorFlow). While this approach has broad applicability, we highlight its potential in this paper by exploring the example of audio synthesis.
Objects have a natural tendency to periodically vibrate. Small shape displacements are usually restored with elastic forces that conserve energy (similar to a canonical mass on a spring), leading to a harmonic oscillation between kinetic and potential energy (Smith, 2010). Accordingly, human hearing has evolved to be highly sensitive to phase-coherent oscillation, decomposing audio into spectrotemporal responses through the resonant properties of the basilar membrane and tonotopic mappings into the auditory cortex (Moerel et al., 2012; Chi et al., 2005; Theunissen & Elie, 2014). However, neural synthesis models often do not exploit this periodic structure for generation and perception.

As shown in Figure1, most neural synthesis models generate waveforms directly in the time domain, or from their corresponding Fourier coefficients in the frequency domain. While these representations are general and can represent any waveform, they are not free from bias. This is because they often apply a prior over generating audio with aligned wave packets rather than oscillations. For example, strided convolution models–such as SING (Defossez et al., 2018), MCNN (Arik et al., 2019), and WaveGAN(Donahue et al., 2019)–generate waveforms directly with overlapping frames. Since audio oscillates at many frequencies, all with different periods from the fixed frame hop size, the model must precisely align waveforms between different frames and learn filters to cover all possible phase variations. This challenge is visualized on the left of Figure 1.
Rather than predicting waveforms or Fourier coefficients, a third model class directly generates audio with oscillators. Known as vocoders or synthesizers, these models are physically and perceptually motivated and have a long history of research and applications (Beauchamp, 2007; Morise et al., 2016). These “analysis/synthesis” models use expert knowledge and hand-tuned heuristics to extract synthesis parameters (analysis) that are interpretable (loudness and frequencies) and can be used by the generative algorithm (synthesis).
Neural networks have been used previously to some success in modeling pre-extracted synthesis parameters (Blaauw & Bonada, 2017; Chandna et al., 2019), but these models fall short of end-to-end learning. The analysis parameters must still be tuned by hand and gradients cannot flow through the synthesis procedure. As a result, small errors in parameters can lead to large errors in the audio that cannot propagate back to the network. Crucially, the realism of vocoders is limited by the expressivity of a given analysis/synthesis pair.
1.3 Contributions
In this paper, we overcome the limitations outlined above by using the DDSP library to implement fully differentiable synthesizers and audio effects. DDSP models combine the strengths of the above approaches, benefiting from the inductive bias of using oscillators while retaining the expressive power of neural networks and end-to-end training.