FVTTS :

Face-based Voice Synthesis for Text-To-Speech

Sungkyunkwan University
Interspeech 2024

*Corresponding author

Abstract. A face is expressive of individual identity and used in various studies such as identification, authentication, and personalization. Similarly, a voice is a means of expressing individuals, and personalized voice synthesis based on voice reference is active. However, the voice-based method confronts voice sample dependency limitations. We propose Face-based Voice synthesis for Text-To-Speech (FVTTS) to synthesize voice from face images that are more expressive of personal identity than voice samples. A major challenge in face-based TTS methods is extracting distinct voice features highly related to voice from the face image. Our face encoder is designed to tackle this by integrating global facial attributes with voice-related features to represent personalized characteristics. FVTTS has shown superiority in various metrics and adaptability across different data domains. We establish a new standard in face-based TTS, leading the way in personalized voice synthesis.

1. Results on LRS3 dataset

FVTTS synthesizes diverse voices of multi-speaker Faces. With different face images of the same Speaker Face, FVTTS generate similar but not the same voices. The synthetic results are shown in following samples.

Synthetic voice is synthesized with unseen face images

Face Text Ground Truth yourTTS faceTTS FVTTS (ours)
So when I started my job at the Arnold Foundation I came back to looking at a lot of these questions and I came back.
So people hear about this study and they're like great if I want to get better at my job I just need to upgrade my browser.
There's going to be a new system based on donated package tracking rechnology from the logistics company that I work for.
About a year after 911 researchers examined a group of women who were pregnant when they were exposed to the world.
So people hear about this study and they're like great if I want to get better at my job I just need to upgrade my browser.
And he said well I just care so deeply about my customers that I would never well them one of our crappy products.

Synthetic voice is synthesized with seen face images

Face Text Ground Truth yourTTS faceTTS FVTTS (ours)
So people So people hear about this study and they're like great if I want to get better at my job I just need to upgrade my browser.
And he was talking about the importance of coaching boys into men and changing the culture of the locker room and giving.
We have ideas for how to make things better and I want to share three of them that we've picked up in our own work.

2. Results on Cross dataset

GRID dataset

With GRID dataset, the out of distribution of seen dataset, we generate new voices.

Face Text Ground Truth yourTTS faceTTS FVTTS (ours)
We have ideas for how to make things better and I want to share three of them that we've picked up in our own work.
We have ideas for how to make things better and I want to share three of them that we've picked up in our own work.
We have ideas for how to make things better and I want to share three of them that we've picked up in our own work.

Animation dataset

To apply FVTTS to animation images, we synthesize voice of animation characters. We select the human-like character images from open access video sharing platform YouTube. Note that the model is trained for generate human voice, so that the characters' voices do not similar to their original voices.

Face Text faceTTS FVTTS (ours)
You should see these places, I mean there are a whole world outside of books and maps.
You should see these places, I mean there are a whole world outside of books and maps.
You don't have to live in fear, cause for the first time in forever I will be right there.
Every day's a little harder as I feel my power grow. Don't you know there's part of me that longs to go into the unknown?
But is this what it feels like to be growing apart. When did I become the one who’s always chasing your heart
There is nowhere you could go that I won't be with you and you will do wondrous thing.

BibTeX

BibTex Code Here