If you want to add none trivial speech output to your application, no matter if it’s a desktop, web, or mobile app, you need to find a way to convert text into speech (TTS) and eventually provide it in a sound format (like MP3) that can be played back on end-users’ devices. While some operating systems come with TTS capabilities built-in, the quality of the voice sound may vary more than you like, and an user experience spanning multiple OSes and platforms, almost always justifies or even requires the deployment of a TTS Web service.
All this is old news of course and companies like Nuance, iSpeech, or NeoSpeech provide Text-To-Speech services, varying greatly in price, quality, and performance. Other TTS-providers like Acapela or LumenVox lease their TTS-Server software, i.e. you get a performance-constrained binary that can be deployed on a RedHat Linux server in your own server room or for instance on Amazon Elastic Compute Cloud (Amazon EC2). The obvious advantage over the completely out-sourced approach is quality of service (response time) as well as security and privacy.
Every single fire started with a spark
[from Michelle Branch’s – Spark ..]
Getting started with something new, sometimes requires only little more than a spark, which I hope to provide by showing how to use your Mac as a Text-To-Speech server, converting text strings to MP3 voice sound files on the fly. When we are done, you can request an mp3 sound file by either sending an HTTP GET request like:
which would stream an MP3 back in return or send an HTTP POST request and receive a path to the mp3 file back, ready to be downloaded once or multiple times.
.. as easy as one, two, three
1. MP3 Encoding
Not only are mp3 files almost universally playable, they are also considerably smaller than wav or aiff files for instance and still provide a decent sound quality. While each and every Mac comes with Text-To-Speech capabilities built-in, the output comes aiff encoded and is only accessible as an command line tool.
Mac2Speech – a Speech Synthesis Server for OS X, on the other hand allows you to use your Mac as a Text-To-Speech server, converting text strings into MP3 voice sound files on the fly.
2. TTS Server
There are no pre-requisites. Mac2Speech is a HTTP server that by default is available on port 8080. HTTP server, MP3 encoder, and everything else is all encapsulated in a single compact binary. A web user-interface allows for rapid testing and experimentation with different languages and voices, accessible in your web browser via http://localhost:8080
Once launched, the server puts an icon into the dock, which when control-clicked, exposes a menu, to further configure the application. Here for instance the HTTP port can be changed:
3. Download and Go
Enough with theory, now it’s time to go and download the dmg file, put it into you Applications folder, and try it out.
Voices and Languages
The Web user-interface allows access to all installed languages and voices. Click the row, marked with a small arrow, to open a list of all currently installed Languages and Voices. Changing the language will also influence the available voices. More importantly, when changing the language, remember to enter the text to be synthesized in that language.
Unfortunately, neither of the pre-installed voices are great, but Apple provides free access to much higher quality voices, if you care enough to install them. ‘Allison’ for instance is great and less robotic, but there is still room for improvement.
To install some of the better voices from Apple, open ‘System Preferences’ then ‘Dictation & Speech’. Now click on the ‘Text to Speech’ and then on the selected ‘System Voice’ (e.g. Alex) and in the appearing drop-down, select Customize…
Here you can discover (play) and install some amazingly good voices. Please do yourself a favor and install Allison and Tom, two very good American-English voices.
But there is more, Acapela for instance offers natural-sounding text to speech that easily plug-in to the Mac2Speech TTSServer. The Infovox iVox product, developed by the Acapela Group, allows to download and install additional voices into the OS X Voices repository. You can install those HQ voices and try them for a few weeks and then buy them for $20 to $30 each.
TTS Server Usage
HTTP GET Request
Sending an HTTP GET with voice and text parameters, will result in a MediaType.APPLICATION_OCTET_STREAM. I.e. the audio content is streamed directly in the HTTP response.
If the optional save=true parameter-value pair is sent with the request, an additional HTTP header gets include, to encourage downloading instead of directly playing the MP3.
Here is an example:
HTTP POST Request
Sending an HTTP POST with voice and text parameters, will create and temporarily store an MP3 file on the server.
The URL to that MP3 file is returned and can be requested, until the server gets restarted, at which time all temporarily created files are deleted.