MusicLM: Generating Music From Text

Andrea Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, Matt Sharifi, Neil Zeghidour, Christian Frank (Google Research) arXiv 2023

논문 리딩 계기: GCT731 수업에서 교수님께서 언급해주신 페이퍼!

리딩 후 느낀 점: 3월 3일(금)에 리딩 종료 예정.

Abstract

본 연구는 Google Research에서 개발한 Text-to-Music 모델인 MusicLM을 소개함. MusicLM은 계층적 seq2seq 모델링 태스크로써 조건부 음악 생성 과정을 던짐. 이는 몇분간의 음악 동안에 일관성이 유지된 24kHz sampling rate의 음악을 생성함. 실험을 통해 MusicLM이 타 모델들에 비해 오디오 품질과 텍스트 설명(캡션)과의 어울림에 있어 모두 능가함을 실험을 통해 보임. 또한, MusicLM은 텍스트 캡션에 설명된 스타일에 따라 휘파람과 흥얼거리는 멜로디를 변환할 수 있다는 점에서 이들(텍스트와 멜로디) 모두에 대해 조건화될 수 있음을 보임. 추후 연구를 뒷받침하기 위해, MusicCaps라는 5500여개의 music-text 쌍의 데이터셋을 제작함. 이는 사람 전문가에 의해 각 음악 정보에 어울리는 높은 질의 텍스트 캡션이 달려있음.

Introduction

**Conditional Neural Audio Generation(조건부 신경망 오디오 생성)**은 text-to-speech, lyrics-conditioned music generation, audio synthesis from MIDI sequences 등 많은 응용 범위를 포함하는 개념임. 이러한 태스크는 Conditioning signal(조건 신호)와 해당되는 출력 오디오 간의 시간적 정렬에 의해 촉진되어옴.

대조적으로, text-to-image 생성의 발전에 영감을 받아, 최근 연구에서는 *"Whistling with wind blowing"*과 같은 시퀀스 전반의 높은 수준의 캡션에서 오디오를 생성하는 것을 탐구함.

이처럼 거친 캡션에서 오디오를 생성하는 것은 획기적인 발전을 의미하지만, 이들은 아직 몇 초 동안만 발생하는 몇 가지의 acoustic events(음향 이벤트)로 구성된 단순한 acoustic scenecs(음향 장면)으로 제한됨. 따라서 하나의 텍스트 캡션을 long-term 구조와 music clips처럼 많은 stem을 가진 풍부한 오디오 시퀀스로 바꾸는 것은 여전히 해결되어야 함.

**AudioLM(Boros et al., 2022)**은 최근에 오디오 생성을 위한 프레임워크를 제시한 바 있음. Discrete representation space(이산 표현 공간)에서 언어 모델링 작업으로 오디오 합성을 캐스팅하고, corase-to-fine 오디오 이산 단위 (또는 토큰)의 계층 구조를 활용함. AudioLM은 수십 초 동안 높은 fidelity와 장기적인 음악적 일관성(long-term coherence)을 모두 달성함. 또한, 오디오 신호 컨텐츠에 있어 어떠한 가정을 두지 않음으로써, 현실적인 오디오를 audio-only corpora로부터 생성하도록 훈련됨. (이 말은 즉, 어떠한 annotation도 가하지 않고, speech나 piano music만을 활용하는 것임) 다양한 신호를 모델링할 수 있는 능력은 그러한 시스템이 적절한 데이터로부터 훈련된다면, 더 풍부한 출력물을 생성할 수 있음을 시사함.

Besides the inherent difficulty of synthesizing high-quality and coherent audio, another impeding factor is the scarcity of paired audio-text data. This is in stark contrast with the image domain, where the availability of massive datasets contributed significantly to the remarkable image generation quality that has recently been achieved (Ramesh et al., 2021; 2022; Saharia et al., 2022; Yu et al., 2022). Moreover, creating text descriptions of general audio is considerably harder than describing images. First, it is not straightforward to unambiguously capture with just a few words the salient characteristics of either acoustic scenes (e.g., the sounds heard in a train station or in a forest) or music (e.g., the melody, the rhythm, the timbre of vocals and the many instruments used in accompaniment). Second, audio is structured along a temporal dimension which makes sequence-wide captions a much weaker level of annotation than an image caption.