Open Voice: Versatile Instant Voice Cloning. Zengyi Qin, MIT, and others.

9 months ago

142

Science Education Science Voice Cloning AI AudioPaper AudioBook PukeOnAPlate PukeOnABook Dalek

Index of Science videos:

https://rumble.com/v406mdz-index-of-robert-heinlein-audiobooks..html

Open Voice: Versatile Instant Voice Cloning.

Zengyi Qin, MIT, and others.

We introduce OpenVoice, a versatile instant voice cloning approach that requires only a short audio clip from the reference speaker to replicate their voice and generate speech in multiple languages. OpenVoice represents a significant advancement in addressing the following open challenges in the field:
1) Flexible Voice Style Control. OpenVoice enables granular control over voice styles, including emotion, accent, rhythm, pauses, and intonation, in addition to replicating the tone color of the reference speaker. The voice styles are not directly copied from and constrained by the style of the reference speaker. Previous approaches lacked the ability to flexibly manipulate voice styles after cloning.
2) Zero-Shot Cross-Lingual Voice Cloning. OpenVoice achieves zero-shot cross-lingual voice cloning for languages not included in the massive-speaker training set. Unlike previous approaches, which typically require an extensive massive-speaker multi-lingual (MSML) dataset for all languages, OpenVoice can clone voices into a new language without any massive-speaker training data for that language.
OpenVoice is also computationally efficient, costing tens of times less than commercially available API’s that offer even inferior performance. To foster further research in the field, we have made the source code and trained model publicly accessible. We also provide qualitative results in our demo website. Prior to its public release, our internal version of OpenVoice was used tens of millions of times by users worldwide between May and October 2023, serving as the backend of MyShell.ai.
One. Introduction.
Instant voice cloning (IVC) in text-to-speech (TTS) synthesis means the TTS model can clone the voice of any reference speaker given a short audio sample without additional training on the reference speaker. It is also referred to as Zero-shot TTS. IVC enables the users to flexibly customize the generated voice and exhibits tremendous value in a wide variety of real-world applications, such as media content creation, customized chatbots, and multi-modal interaction between humans and computers or large language models.
An abundant of previous work has been done in IVC. Examples of auto-regressive approaches include VALLE and XTTS, which extract the acoustic tokens or speaker embedding from the reference audio as a condition for the auto-regressive model. Then the auto-regressive model sequentially generate acoustic tokens, which are then decoded to raw audio waveform. While these methods can clone the tone color, they do not allow users to flexibly manipulate other important style parameters such as emotion, accent, rhythm, pauses and intonation. Also, auto-regressive models are relatively computationally expensive and has relatively slow inference speed. Examples of non-autoregressive approach include YourTTS and the recently developed Voicebox, which demonstrate significantly faster inference speed but are still unable to provide flexible control over style parameters besides tone color.

Another common disadvantage of the existing methods is that they typically require a huge MSML dataset in order to achieve cross-lingual voice clone. Such combinatorial data requirement can limit their flexibility to include new languages. In addition, since the voice cloning research by tech giants are mostly closed-source, there is not a convenient way for the research community to step on their shoulders and push the field forward.
We present OpenVoice, a flexible instant voice cloning approach targeted at the following key problems in the field:
In addition to cloning the tone color, we ask, how can we have flexible control of other important style parameters such as emotion, accent, rhythm, pauses and intonation? These features are crucial for generating in-context natural speech and conversations, rather than monotonously narrating the input text. Previous approaches can only clone the monotonous tone color and style from the reference speaker but do not allow flexible manipulation of styles.
How to enable zero-shot cross-lingual voice cloning in a simple way. We put forward two aspects of zero-shot capabilities that are important but not solved by previous studies:
One, if the language of the reference speaker is not presented in the MSML dataset, can the model clone their voice?
Two, if the language of the generated speech is not presented in the MSML dataset, can the model clone the reference voice and generate speech in that language?
In previous studies, the language of the reference speaker and the generated language by the model should both exist in great quantity in the MSML dataset. But what if neither of them exist?
The challenge is how to realize a super-fast speed real-time inference, without downgrading the quality, which is crucial in a massive commercial production environment.
To address the first two problems, OpenVoice is designed to decouple the components of a voice as much as possible. The generation of language, tone color, and other important voice features are made independent of each other, enabling flexible manipulation over individual voice styles and language types. This is achieved without labeling any voice style in the MSML training set. We would like to clarify that the zero-shot cross-lingual task in this study is different from that in VALLE-X. In VALLE-X, data for all languages need to be included in the MSML training set, and the model cannot generalize to an unseen language outside the MSML training set. By comparison, OpenVoice is designed to generalize to completely unseen languages outside the MSML training set. The third problem is addressed by default, since the decoupled structure reduces requirement on model size and computational complexity. We do not require a large model to learn everything. Also, we avoid the use of auto-regressive or diffusion components to speed up the inference.
Our internal version of OpenVoice before this public release has been used tens of millions of times by users worldwide between May and October 2023. It powers the instant voice cloning backend of MyShell.ai and has witnessed a user growth of several hundredfold on this platform. To facilitate the research progress in the field, we explain the technology in great details and make the source code with model weights publicly available.

Two. Approach.
The technical approach is simple to implement but surprisingly effective. We first present the intuition behind OpenVoice, then elaborate on the model structure and training.
Two point 1. Intuition.
The Hard Part. It is obvious that simultaneously cloning the tone color for any speaker, enabling flexible control of all other styles, and adding new language with little effort could be very challenging. It requires a huge amount of combinatorial datasets where the controlled parameters intersect, and pairs of data that only differ in one attribute, and are well-labeled, as well as a relatively large-capacity model to fit the dataset.
The Easy Part. We also notice that in regular single-speaker TTS, as long as voice cloning is not required, it is relatively easy to add control over other style parameters and add a new language. For example, recording a single-speaker dataset with 10K short audio samples with labeled emotions and intonation is sufficient to train a single-speaker TTS model that provides control over emotion and intonation.
Adding a new language or accent is also straightforward by including another speaker in the dataset. The intuition behind OpenVoice is to decouple the IVC task into separate subtasks where every subtask is much easier to achieve compared to the coupled task. The cloning of tone color is fully decoupled from the control over all remaining style parameters and languages. We propose to use a base speaker TTS model to control the style parameters and languages, and use a tone color converter to embody the reference tone color into the generated voice.
Two point 2. Model Structure.
We illustrate the model structure in Figure 1. The two main components of OpenVoice are the base speaker TTS model and the tone color converter. The base speaker TTS model is a single-speaker or multi-speaker model, which allows control over the style parameters, for example, emotion, accent, rhythm, pauses and intonation, accent and language. The voice generated by this model is then passed to the tone color converter, which changes the tone color of the base speaker into that of the reference speaker.
Base Speaker TTS Model.
The choice of the base speaker TTS model is very flexible. For example, the VITS model can be modified to accept style and language embedding in its text encoder and duration predictor. Other choices such as InstructTTS can also accept style prompts. It is also possible to use commercially available (and cheap) models such as Microsoft TTS, which accepts speech synthesis markup language (SSML) that specifies the emotion, pauses and articulation. One can even skip the base speaker TTS model, and read the text by themselves in whatever styles and languages they desire. In our OpenVoice implementation, we used the VITS model by default, but other choices are completely feasible. We denote the outputs of the base model as X (Of LI, SI, and CI), where the three parameters represent the language, styles and tone color respectively. Similarly, the speech audio from the reference speaker is denoted as X (Of LO, SO, and CO).
Tone Color Converter.
The tone color converter is an encoder-decoder structure with an invertible normalizing flow in the middle. The encoder is a 1D convolutional neural network that takes the short-time Fourier transformed spectrum of X (Of LI, SI, and CI) as input. All convolutions are single strided.
The feature maps outputted by the encoder are denoted as Y (Of LI, SI, and CI). The tone color extractor is a simple 2D convolutional neural network that operates on the mel-spectrogram of the input voice and outputs a single feature vector that encodes the tone color information. We apply it on X(Of LI, SI, and CI) to obtain vector v (Of CI), then apply it on X (Of LO, SO, and CO) to obtain vector v(CO).
The normalizing flow layers take Y (Of LI, SI, and CI) and v(CI ) as input and outputs a feature representation Z (Of LI, and SI ) that eliminates the tone color information but preserves all remaining style properties. The feature Z( Of LI, and SI) is aligned with International Phonetic Alphabet (IPA) along the time dimension. Details about how such feature representation is learned will be explained in the next section. Then we apply the normalizing flow layers in the inverse direction, which takes Z(Of LI, and SI ) and v(Of CO) as input and outputs Y (Of LI, SI, and CO). This is a critical step where the tone color CO from the reference speaker is embodied into the feature maps. Then the Y (Of LI, SI, and CO) is decoded into raw waveforms X (Of LI, SI, and CO) by HiFi-Gan that contains a stack of transposed 1D convolutions. The entire model in our OpenVoice implementation is feed-forward without any auto-regressive component.

The tone color converter is conceptually similar to voice conversion, but with different emphasis on its functionality, inductive bias on its model structure and training objectives. The flow layers in the tone color converter are structurally similar to the flow-based TTS methods but with different functionalities and training objectives.
Alternative Ways and Drawbacks.
Although there are alternative ways to extract Z (Of LI, and SI), we empirically found that the proposed approach achieves the best audio quality. One can use HuBERT to extract discrete or continuous acoustic units to eliminate tone color information, but we found that such method also eliminates emotion and accent from the input speech. When the input is an unseen language, this type of method also has issues preserving the natural pronunciation of the phonemes. We also studied another approach that carefully constructs information bottleneck to only preserve speech content, but we observed that this method is unable to completely eliminate the tone color.
A Remark on Novelty.
OpenVoice does not intend to invent the submodules in the model structure. Both the base speaker TTS model and the tone color converter borrow the model structure from existing work. The contribution of OpenVoice is the decoupled framework that separates the voice style and language control from the tone color cloning. This is very simple, but very effective, especially when one wants to control styles, accents or generalize to new languages. If one wanted to have the same control on a coupled framework such as XTTS, it could require a tremendous amount of data and computing, and it is relatively hard to fluently speak every language.
In OpenVoice, as long as the single-speaker TTS speaks fluently, the cloned voice will be fluent.
Decoupling the generation of voice styles and language from the generation of tone color is the core philosophy of OpenVoice. We also provided our insights of using flow layers in the tone color converter, and the importance of choosing a universal phoneme system in language generalization in our experiment section.
Two point 3. Training.
In order to train the base speaker TTS model, we collected audio samples from two English speakers, with American and British accents, one Chinese speaker and one Japanese speaker. There are 30K sentences in total, and the average sentence length is 7 seconds. The English and Chinese data has emotion classification labels. We modified the VITS model and input the emotion categorical embedding, language categorical embedding and speaker id into the text encoder, duration predictor and flow layers. The training follows the standard procedure provided by the authors of VITS. The trained model is able to change the accent and language by switching between different base speakers, and read the input text in different emotions. We also experimented with additional training data and confirmed that rhythm, pauses and intonation can be learned in exactly the same way as emotions.
In order to train the tone color converter, we collected 300K audio samples from 20K individuals.
Around 180K samples are English, 60K samples are Chinese and 60K samples are Japanese. This is what we called the MSML dataset. The training objectives of the tone color converter is two-fold.
First, we require the encoder-decoder to produce natural sound. During training, we feed the encoder output directly to the decoder, and supervised the generated waveform using the original waveform with mel-spectrogram loss and HiFi-GAN loss. We will not present details here, as it has been well explained by previous literature.
Second, we require flow layers to eliminate as much tone color information as possible from the audio features. During training, for each audio sample, its text is converted to a sequence of phonemes in IPA, and each phoneme is represented by a learnable vector embedding. The sequence of vector embedding is passed to a transformer encoder to produce the feature representation of the text content. Denote this feature as L is an element of Real c times l, where c is the number of feature channels and l is the number of phonemes in the input text. The audio waveform is processed by the encoder and flow layers to produce the feature representation Z is an element of Real c times t, where t is the length of the features along the time dimension. Then we align L with Z along the time dimension using dynamic time warping, an alternative is monotonic alignment, to produce L bar is an element of Real c times t, and minimize the KL-divergence between L bar and Z. Since L bar does not contain any tone color information, the minimization objective would encourage the flow layers to remove tone color information from their output Z. The flow layers are conditioned on the tone color information from the tone color encoder, which further helps the flow layers to identify what information needs to be eliminated. In addition, we do not provide any style or language information for the flow layers to be conditioned upon, which prevents the flow layers from eliminating information other than tone color.

Since the flow layers are invertible, conditioning them on a new piece of tone color information and running its inverse process can add the new tone color back to the feature representations, which are then decoded to the raw waveform with the new tone color embodied.
Three. Experiment.
It is hard to be objective in the evaluation of voice cloning for several reasons. First, different research studies usually have different training and test sets. The numerical comparison could be intrinsically unfair. Even though their metrics such as Mean Opinion Score can be evaluated by crowdsourcing, the diversity and difficulty of the test set would significantly influence the results. For example, if many samples in the test set are neural voices that concentrate on the mean of human voice distributions, then it is relatively easy for most methods to achieve good voice cloning results.
Second, different studies usually have different training sets, where the scale and diversity would have considerable influence of the results.
Third, different studies can have a different focus on their core functionalities. OpenVoice mainly aims at tone color cloning, flexible control over style parameters, and making cross-lingual voice clone easy even without massive-speaker data for a new language.
These are different from the objectives of previous work on voice cloning or zero-shot TTS. Therefore, instead of comparing numerical scores with existing methods, we mainly focus on analyzing the qualitative performance of OpenVoice itself, and make the audio samples publicly available for relevant researchers to freely evaluate.
Accurate Tone Color Cloning.
We built a test set of reference speakers selected from celebrities, game characters and anonymous individuals. The test set covers a wide voice distributions including both expressive unique voices and neutral samples in human voice distribution. With any of the 4 base speakers and any of the reference speakers, OpenVoice is able to accurately clone the reference tone color and generate speech in multiple languages and accents. We invite the readers to this website for qualitative results.
Flexible Control on Voice Styles.
A premise for the proposed framework to flexibly control the speech styles is that the tone color converter is able to only modify the tone color and preserves all other styles and voice properties. In order to confirm this, we used both our base speaker model and the Microsoft TTS with SSML to generate a speech corpus of 1K samples with diverse styles, emotion, accent, rhythm, pauses and intonation, as the base voices. After converting to the reference tone color, we observed that all styles are well-preserved. In rare cases, the emotion will be slightly neutralized, and one way that we found to solve this problem is to replace the tone color embedding vector of this particular sentence with the average vector of multiple sentences with different emotions from the same base speaker. This gives less emotion information to the flow layers so that they do not eliminate the emotion. Since the tone color converter is able to preserve all the styles from the base voice, controlling the voice styles becomes very straightforward by simply manipulating the base speaker TTS model. The qualitative results are publicly available on this website.
Cross-Lingual Voice Clone with Ease.
OpenVoice achieves near zero-shot cross-lingual voice cloning without using any massive-speaker data for an unseen language. It does require a base speaker of the language, which can be achieved with minimum difficulty with the off-the-shelf models and datasets. On our website, we provide an abundance of samples that demonstrate the cross-lingual voice clone capabilities of the proposed approach. The cross-lingual capabilities are two-fold:
When the language of the reference speaker is unseen in the MSML dataset, the model is able to accurately clone the tone color of the reference speaker.
When the language of the generated speech is unseen in the MSML dataset, the model is able to clone the reference voice and speak in that language, as long as the base speaker TTS supports that language.
Fast Inference with Low Cost.
Since OpenVoice is a feed-forward structure without any autoregressive component, it achieves very high inference speed. Our experiment shows that a slightly optimized version of OpenVoice, including the base speaker model and the tone converter, is able to achieve 12 times real-time performance on a single A10G GPU, which means it only takes 85 milliseconds to generate one second of speech.

Through detailed GPU usage analysis, we estimate that the upper bound is around 40 times real-time, but we will leave this improvement as future work. Importance of IPA. We found that using IPA as the phoneme dictionary is crucial for the tone color converter to perform cross-lingual voice cloning. As we detailed in Section 2 point 3, in training the tone color converter, the text is first converted into a sequence of phonemes in IPA, then each phoneme is represented by a learnable vector embedding. The sequence of embedding is encoded with transformer layers and compute loss against the output of the flow layers, aiming to eliminate the tone color information. IPA itself is a cross-lingual unified phoneme dictionary, which enables the flow layers to produce a language-neutral representation. Even if we input a speech audio with unseen language to the tone color converter, it is still able to smoothly process the audio. We also experimented with other types of phoneme dictionaries but the resulting tone color converter tended to mispronounce some phonemes in unseen languages. Although the input audio can be correct, there is a high likelihood that the output audio will be problematic and sounds non-native.
Four. Discussion.
OpenVoice demonstrates remarkable instance voice cloning capabilities and is more flexible than previous approaches in terms of voice styles and languages. The intuition behind the approach is that it is relatively easy to train a base speaker TTS model to control the voice styles and languages, as long as we do not require the model to have the ability to clone the tone color of the reference speaker. Therefore, we proposed to decouple the tone color cloning from the remaining voice styles and the language, which we believe is the foundational design principle of OpenVoice. In order to facilitate future research, we have made the source code and model weights publicly available.

References to eighteen other publications in text.

Loading comments...

Comments