In order to enable face animation on the Internet using high quality synthetic speech, the Text-to-Speech (TTS) servers need to be implemented on network-based servers and shared by many users. The output of a TTS server is used to animate talking heads as defined in MPEG-4. The TTS server creates two sets of data: audio data and Phonemes with optional Facial Animation Parameters (FAP) like smile. In order to animate talking heads on a client it is necessary to stream the output of the TTS server to the client. Real-time streaming protocols for audio data already exist. We developed a real-time transport protocol with error recovery capability to stream Phonemes and Facial animation Parameters (PFAP), which are used to animate the talking head. The stream was designed for interactive services and allows for low latency communications. The typical bit rate for enabling a talking face is less than 800 bit/s.
Jörn Ostermann, Jürgen Rurainsky, M. Reh