FVTTS : Face-based Voice Synthesis for Text-To-Speech

1. Results on LRS3 dataset

FVTTS synthesizes diverse voices of multi-speaker Faces. With different face images of the same Speaker Face, FVTTS generate similar but not the same voices. The synthetic results are shown in following samples.

Synthetic voice is synthesized with unseen face images

Face	Text	Ground Truth	yourTTS	faceTTS	FVTTS (ours)
	So when I started my job at the Arnold Foundation I came back to looking at a lot of these questions and I came back.
	So people hear about this study and they're like great if I want to get better at my job I just need to upgrade my browser.
	There's going to be a new system based on donated package tracking rechnology from the logistics company that I work for.
	About a year after 911 researchers examined a group of women who were pregnant when they were exposed to the world.
	So people hear about this study and they're like great if I want to get better at my job I just need to upgrade my browser.
	And he said well I just care so deeply about my customers that I would never well them one of our crappy products.

Synthetic voice is synthesized with seen face images

Face	Text	Ground Truth	yourTTS	faceTTS	FVTTS (ours)
	So people So people hear about this study and they're like great if I want to get better at my job I just need to upgrade my browser.
	And he was talking about the importance of coaching boys into men and changing the culture of the locker room and giving.
	We have ideas for how to make things better and I want to share three of them that we've picked up in our own work.

2. Results on Cross dataset

GRID dataset

With GRID dataset, the out of distribution of seen dataset, we generate new voices.

Face	Text	Ground Truth	yourTTS	faceTTS	FVTTS (ours)
	We have ideas for how to make things better and I want to share three of them that we've picked up in our own work.
	We have ideas for how to make things better and I want to share three of them that we've picked up in our own work.
	We have ideas for how to make things better and I want to share three of them that we've picked up in our own work.

Animation dataset

To apply FVTTS to animation images, we synthesize voice of animation characters. We select the human-like character images from open access video sharing platform YouTube. Note that the model is trained for generate human voice, so that the characters' voices do not similar to their original voices.

Face	Text	faceTTS	FVTTS (ours)
	You should see these places, I mean there are a whole world outside of books and maps.
	You should see these places, I mean there are a whole world outside of books and maps.
	You don't have to live in fear, cause for the first time in forever I will be right there.
	Every day's a little harder as I feel my power grow. Don't you know there's part of me that longs to go into the unknown?
	But is this what it feels like to be growing apart. When did I become the one who’s always chasing your heart
	There is nowhere you could go that I won't be with you and you will do wondrous thing.