这篇文章介绍WaveNet,原始音频波形的深生成模型。我们证明,Wavenets能够生成模仿任何人类声音的语音,并且比现有的最佳文本语音系统听起来更自然,reducing the gap with human performance by over 50%.

We also demonstrate that the same network can be used to synthesize other audio signals such as music,and present some striking samples of automatically generated piano pieces.


Allowing people to converse with machines is a long-standing dream of human-computer interaction.The ability of computers to understand natural speech has been revolutionised in the last few years by the application of deep neural networks (e.g.,谷歌语音搜索).However,用计算机生成语音-通常被称为语音合成or text-to-speech (TTS) — is still largely based on so-calledconcatenative TTS,一个非常大的短语音片段数据库从一个说话人记录下来,然后重新组合形成完整的话语。这使得修改声音变得困难(例如切换到其他扬声器,or altering the emphasis or emotion of their speech) without recording a whole new database.

This has led to a great demand for参数化TTS,where all the information required to generate the data is stored in the parameters of the model,通过对模型的输入,可以控制语音的内容和特征。到目前为止,however,parametric TTS has tended to sound less natural than concatenative.现有的参数化模型通常通过信号处理算法(即语音编码器.

WaveNet通过直接模拟音频信号的原始波形来改变这种模式,一次一个样本。以及更自然的演讲,使用原始波形意味着WaveNet可以模拟任何类型的音频,including music.


Wave animation 全屏幕 fullscreen_mobile

yabo网球研究人员通常避免模拟原始音频,因为它滴答声很快:通常每秒16000个样本或更多,with important structure at many time-scales.建立一个完全自回归模型,in which the prediction for every one of those samples is influenced by all previous ones (in statistics-speak,每个预测分布都以所有以前的观察结果为条件)。is clearly a challenging task.

However,ourPixelRNNPixelCNNmodels,published earlier this year,showed that it was possible to generate complex natural images not only one pixel at a time,but one colour-channel at a time, requiring thousands of predictions per image.This inspired us to adapt our two-dimensional PixelNets to a one-dimensional WaveNet.

建筑动画 全屏幕 fullscreen_mobile

The above animation shows how a WaveNet is structured.It is a fully convolutional neural network,其中卷积层具有各种膨胀因子,使其接收场随深度呈指数增长并覆盖数千个时间步长。

培训时,the input sequences are real waveforms recorded from human speakers.After training,我们可以对网络进行采样,生成合成语音。在采样过程中的每一步,都从网络计算的概率分布中提取一个值。This value is then fed back into the input and a new prediction for the next step is made.像这样一步一步地建立样本的计算代价很高,但是我们发现它对于产生复杂的realistic-sounding audio.


我们使用谷歌的一些TTS数据集对Wavenet进行了培训,以便评估其性能。The following figure shows the quality of WaveNets on a scale from 1 to 5,与谷歌目前最好的TTS系统相比(参数化级联的),用人类语言平均意见得分(mos).MOS是主观声音质量测试的标准度量,and were obtained in blind tests with human subjects (from over 500 ratings on 100 test sentences).As we can see,Wavenets将美中两国的艺术水平与人类水平之间的差距缩小了50%以上。

无论是中文还是英文,Google's current TTS systems are considered among the best worldwide,因此,用一个模型来改进两者是一项重大的成就。

全屏幕 fullscreen_mobile

Here are some samples from all three systems so you can listen and compare yourself:




In order to use WaveNet to turn text into speech,我们必须告诉它文本是什么。我们通过将文本转换成一系列语言和语音特征(其中包含有关当前音素的信息,音节,单词等)并将其送入波网。这意味着网络的预测不仅取决于以前的音频样本,but also on the text we want it to say.

如果我们在没有文本序列的情况下训练网络,it still generates speech,但现在它必须弥补说什么。正如你从下面的样品中听到的,this results in a kind of babbling,真正的单词中穿插着拼成的类似单词的声音:

注意,非语音声音,such as breathing and mouth movements,有时也由波网产生;这反映了原始音频模型更大的灵活性。

从这些样品中你可以听到,一个单波网可以学习许多不同声音的特征,男性和女性。To make sure it knew which voice to use for any given utterance,we conditioned the network on the identity of the speaker.有趣的是,我们发现,在许多演讲者身上进行的培训比单独对演讲者进行的培训在模拟单个演讲者方面做得更好,suggesting a form of transfer learning.




因为WaveNets可以用来模拟任何音频信号,我们认为尝试创作音乐也会很有趣。Unlike the TTS experiments,我们没有在输入序列上调节网络,告诉它播放什么(如乐谱);instead,we simply let it generate whatever it wanted to.当我们在一组古典钢琴音乐上训练时,它产生了如下令人着迷的样本:

Wavenets为TTS提供了很多可能性,music generation and audio modelling in general.事实上,用深度神经网络直接生成每个时间步的时间步对16kHz音频完全有效,这真的令人惊讶,更不用说它优于最先进的TTS系统了。我们很高兴看到下一步我们能用它们做什么。

For more details,take a look at our.