Lips don’t lie

To create a believable virtual agent or avatar, involves many dynamic tasks, one of which is to coordinate your character’s facial animation with a sound track. For a chat bot, a typical lip-sync process, may look something like this:

  1. Generate the voice track by synthesizing the text the bot needs to speak.
  2. Break down the voice track into phonemes, the smallest structural units of sound that distinguish meaning for a language.
  3. Animate the character’s face to synchronize with the phonemes in the dialogue.

Lip sync (short for lip synchronization), is a technical term for matching lip movements with sung or spoken vocals.

Voice speech is typically generated using a Text-to-Speech engine that resides on the device displaying the software agent or is part of web service. Either way, the input is always the (augmented) text and output a wave or mpeg3 sound file. Depending on the capabilities of the TTS-engine, the input text can be augmented with the intent to generate a more expressive voice sound. This can be achieved for instance, by simply inserting punctuation marks into the input text or by or using mark up, like defined in the Speech Synthesis Markup Language (SSML) or Emotional Markup Language (EmotionML).

Generally, pitch, volume, and speech (aka speech-rate) are the dynamic properties of the speech synthesis process. Examples for input text could look like this:

“Hello? Hello. Hello!”
.. or ..

SSML Markup

Phonemic Representation

A phoneme is a basic unit of a language’s phonology. The table below (Source: LumenVox) shows the full set of IPA and X-SAMPA phonemes used by the LumenVox Text-To-Speech engine for American English.


Where a phoneme is a basic acoustic unit of speech, a viseme is the basic visual unit of speech, representing a lip pose. Visemes and phonemes do not share a one-to-one correspondence, as several phonemes look the same on the face when produced.
The table below shows a standard 22-Viseme model with phoneme mapping.

Viseme Model with 22 mouth shapes

The next table shows a simpler, 12 mouth position, viseme model, mapping all phonemes into a set of only 12 lip poses.

Viseme Model with 12 mouth shapes

Phonemes and Visemes

TTS engines from different vendors provide different approaches to get to the voice sound file and the phoneme information. iSpeech for instance requires two web service calls with the same input text: one to request the voice sound file and another one to request the time-coded phoneme data.

A much more involved, but integrated solution can for instance be created with the Acapele TTS Engine, using its NSCAPI interface.


Acapela NSCAPI

When a text synthesis request is processed, this TTS Engine calls an event-handler (CallBackSpeechEvent), every time a new phoneme is generated.
Once the synthesis is complete (or in-between for streaming), another event-handler (CallBackSpeechData) is called, informing the client to pickup the voice sound file.

Here is data structure provided, with every phoneme change during synthesis:



Like the NSC_EVENT_DATA_PhoSynch datastructure makes obvious, the Acapela NSCAPI event does not only provide the phoneme but also the viseme (in this case, for the 22 viseme lip-pose model)

Mapping this down to the 12 lip pose model however, can be done using this simple Mouth array:

22 to 12 Lip-Poses Mapping

.. and using something like this in the event-handler function:

22 to 12 Lip Poses Mapping

The event handler’s NSC_EVID_TEXT_DONE case already hints at the fact that the voice sound data is not in a useable format, i.e. the TTS engine usually creates pcm or wave sound files, which need to be converted and compressed, using an MP3 encoder.
Not only is MP3 widely supported and the resulting file size is much smaller, compared to the originally provided, pcm wave file, it also supports the containment of metadata.
ID3 is a metadata container that can be used in conjunction with the MP3 audio file format. It allows information such as the title, artist, album, track number, and other information about the file to be stored in the file itself. Adding the time-coded viseme data (here in form of an XML document) right into the mp3 voice sound file, make it easily available in a single web service call response.

PCM to MP3 w/ ID3 comment injection

Lipsync Xml Document

The convert function above, requires the LAME opensource mp3 encoder to be installed. LAME is a high quality MPEG Audio Layer III (MP3) encoder licensed under the LGPL. Installing LAME on RedHat 6 for instance could be done like so:
Here is a sample mp3 file, containing the Lipsync Xml document in the MP3-ID3 comment tag:

Using an MP3 Tag-Editor like Tagger, makes the comment tag visible:



Leave a Reply