Alicia vs Alexa – Wolf Paulus

Democratizing the creation of delightful audible content

Have you ever heard an interview with Alicia Keys? She spent most of her childhood in Hell’s Kitchen, one of New York City’s toughest neighborhoods. Experiencing how much emotion she’s putting into each word and her very deliberate expressive pronunciation means you cannot just casually listen but have to stay attentively engaged.

Contrast this with listening to voice assistants. The defining moment for me was when I heard a cheery voice assistant casually report a fatal car accident. Such tone-deaf insensitivity triggered something deep within me. This prompted me to conduct research on a voice assistant’s inappropriate intonation — and ultimately land a coveted speaking slot at the 2014 Mobile Voice Conference in San Francisco and share my findings.

Speech Synthesis Markup Language

Today’s speech synthesis is much more natural sounding. Ironically, while progressing from extremely robotic to almost indiscernibly human, speech synthesis didn’t travel through the uncanny valley. Although, one research paper suggests the speculative hypothesis that users’ positive attitude towards new technologies supersede their primary affinity towards an artificial conversation partner.

Emphasizing in SSML

SSML, the speech synthesis markup language, makes synthesis highly customizable. Just like I would use a bold or strong tag to emphasize a word when writing an HTML document, I can use SSML’s emphasis tag when creating an SSML document.

Let me apply this concept, emphasizing the word ‘lazy’, in this well-known alphabetic pangram, “A quick brown fox jumps over the lazy dog.” If I had to speak this sentence, I would probably also say the word “quick”, a little faster, just to play up the fox’s agility. SSML supports that too, by allowing me to adjust the speech rate, which is an attribute of the prosody tag.

Here is how we would synthesize this sentence twice, first without, and then with the modifications:

<?xml version="1.0"?>
<speak version="1.1" xml:lang="en-US">
	<p>The quick brown fox jumps over the lazy dog.</p>
	<p>The <prosody rate="fast">quick</prosody> brown fox jumps over the <emphasis level="moderate">lazy</emphasis> dog.</p>
</speak>

👂 Listen to the synthesized speech

The next example uses a phonetic pangram and introduces another prosody attribute, to raise or lower the tone (pitch), allowing the very young “Justin” voice, subtly making fun of the French queen.

<?xml version="1.0"?>
<speak version="1.1" xml:lang="en-US">
	<p>The beige hue on the waters of the loch impressed all, including the <prosody pitch="high">French queen</prosody>, before she heard that symphony again, just as young Arthur wanted.</p>
</speak>

👂 Listen to the synthesized speech

Emotional Prosody

If I listen to someone saying something very sad, I notice how their voice expresses profound sorrow. The speech rate slows down, the volume softens, and the pitch deepens, particularly towards the end of the sentence.

SSML amplifies these expressions rather well.

<?xml version="1.0"?>
<speak version="1.1" xml:lang="en-US">
	<p><prosody rate="fast">Despite everyone's recent <emphasis level="moderate">efforts</emphasis>,</prosody> this is very sad news for the workforce and<prosody volume="soft" pitch="low" rate="slow"> their families.</prosody></p>
</speak>

This time I used the synthesizer’s “Joey” voice and utilize the prosodyattributes for rate, volume, and pitch:

👂 Listen to the synthesized speech

SSML offers much more. The say-as tag’s interpret-as attribute, for instance, can provide additional context. <say-as interpret-as=”address”>Acacia Dr.</say-as> would result in the synthesis of the expanded name: “Acacia Drive”. <say-as interpret-as=”characters”>fox</say-as>would spell out each character: “ F O X ”, while <say-as interpret-as=”ordinal”>23</say-as> would be synthesized as “twenty-third”.

SSML is an XML-based markup language and SSML documents have a hierarchical tree structure. All elements in an SSML document can contain sub-elements, text, and attributes. The boundaries of elements are delimited by start-tags and end-tags. However, not all elements are allowed to be nested. For instance, the emphasis element may contain a say-aselement, but not the way around. While there is a related standard, EmotionML, the SSML standard does not support the encoding of emotions directly. Instead, some speech synthesis service providers now extend SSML with their own proprietary tags, like IBM’s ‘expressive ssml’.

Experimenting with SSML and putting SSML tags into content can be a fun but also cumbersome and error-prone process.

Firing up the Wayback Machine

WYSIWYG — what you see is what you get — At the beginning of the 1990s, writers could see for the first time what their copy would look like while writing it. I was there, and saw what it meant to content creators — this was truly a BIG deal. The presentation of the word I wanted to emphasize looked bold on the screen, exactly like it would look later on paper when printed.

Moving on, 30 years later, I spend way less time reading, and instead scan Instagram and YouTube, and listen to Amazon Echo and Google Home devices.

WYSIWYG for Speech Synthesis

Writing copy for audible (rather than visual) consumption is likely to come with a new set of requirements. Hearing the content auralized or audiated (I think we still struggle to agree on an equivalent adjective for visualize) will certainly help content creators during the creation process and ultimately benefit listeners.

I believe that frequently listening to my content while writing it is crucial, but I also think additional visual cues could help other content creators and speed up the process. Here are some ideas for how speech synthesis attributes could be mapped to visual styles:

Emphasis → font-weight, the heavier the font, the more emphasis is placed on the word.
Volume → font-size, the larger the font size, the louder the word is spoken.
Speech Rate → letter-spacing, the more a word is spaced out, the slower the word is spoken.
Pitch → color, as suggested in Nicholas Melendez’s Color of Sound Chart

With emphasis and the major prosody attributes mapped to CSS types, it’s a straightforward task to remove the need for XML tags and instead use a visual editor. Adding an eraser and a playback button and we are getting close to an HWYS (Hear What You See) editor.

Code ≠ Content

Writing clean, error-free code and writing engaging and emphatic content are two important skills; not everyone masters such skills equally, if at all.

I think in some small way, WYSIWYG editors democratized content writing and readers benefited by reading nicely-formatted, laid out content from a broader group of authors.

If we want to broaden the group of authors capable of creating delightful audible content, this will require easy-to-use tools. I think removing the need to write XML tags and assisting the generation of synthetic speech visually, is a good first step.

Demo

📺 Watch a 90 second demo

Democratizing the creation of delightful audible content

Speech Synthesis Markup Language

Emphasizing in SSML

Emotional Prosody

Firing up the Wayback Machine

WYSIWYG for Speech Synthesis

Code ≠ Content

Demo

Leave a Reply Cancel reply