Speech Disentanglement for Analysis and Modification of Acoustic and Perceptual Speaker Characteristics (en)
* Presenting author
Abstract:
Deep generative neural models have been developed to disentangle speaker from content related variations of a speech signal by computing separate embeddings for each of them. By modifying the speaker embeddings while leaving the content embeddings untouched, speech signals can be generated with certain desirable acoustic or perceptual properties. Here, we take a closer look at the properties of the disentangled speech representations, that are produced by our fully unsupervised factorized variational autoencoder (FVAE) model. We carry out a statistical analysis of the speaker embeddings to investigate whether they indeed encode acoustic signal properties that are known to be characteristic of a speaker. For example, do the embeddings of speakers with similar acoustic voice quality features indeed form well-defined clusters? We will analyze the speech that is synthesized when the speaker embedding is modified along a trajectory connecting the above identified cluster centers to learn more about the properties of the speaker embedding manifold. For example, will the produced speech continuously change from a male to a female voice when resynthesizing the speech from a speaker embedding that is taken along a trajectory connecting the cluster centers of male and female speakers?