Setup for acoustic-visual speech synthesis by concatenating bimodal units

15 years 1 months ago

Download hal.archives-ouvertes.fr

This paper presents preliminary work on building a system able to synthesize concurrently the speech signal and a 3D animation of the speaker's face. This is done by concatenating bimodal diphone units, that is, units that comprise both acoustic and visual information. The latter is acquired using a stereovision technique. The proposed method addresses the problems of asynchrony and incoherence inherent in classic approaches to audiovisual synthesis. Unit selection is based on classic target and join costs from acoustic-only synthesis, which are augmented with a visual join cost. Preliminary results indicate the benefits of the approach, since both the synthesized speech signal and the face animation are of good quality. Planned improvements and enhancements to the system are outlined.

Asterios Toutios, Utpala Musti, Slim Ouni, Vincent

Real-time Traffic

Bimodal Diphone Units | INTERSPEECH 2010 | Join Cost | Signal Processing | Speech Signal |

claim paper

Post Info
More Details (n/a)

Added	18 May 2011
Updated	18 May 2011
Type	Journal
Year	2010
Where	INTERSPEECH
Authors	Asterios Toutios, Utpala Musti, Slim Ouni, Vincent Colotte, Brigitte Wrobel-Dautcourt, Marie-Odile Berger

Comments (0)

Sciweavers

Setup for acoustic-visual speech synthesis by concatenating bimodal units

Bimodal Diphone Units | INTERSPEECH 2010 | Join Cost | Signal Processing | Speech Signal |

Explore & Download

Productivity Tools

Sciweavers