Add a Natural Voice to Your Application With AWS Polly

Voice output can take your project to a whole new level. Are you are a maker building a new home automation tool or a professional developer working on a commercial gadget? Follow this tutorial to learn how to add a natural voice to your project with very little effort!

Jeff Sheldon

TTS = Text To Speech

Text To Speech systems are nothing new. The first time I heard a computer speak was sometimes in the last century on my Commodore C64. It was a hardly intelligible, monotonic, robotic voice. Yet so exciting!!

Fast forward to 2019 and listen to the modern, state of the art TTS systems. Alexa, Siri, Cortana and others — they all sound so natural! And they can all be easily mistaken for real human speakers. Wouldn’t it be nice to have your Raspberry Pi project talk to you like that?

Meet AWS Polly

Amazon’s cloud platform AWS offers many easy to use cloud-based solutions for various tasks. From database and computing services, through IoT broker and various message queues, right up to a ready to use image recognition. And — the topic of the today’s article — state of the art Text To Speech service AWS Polly.

AWS Polly — the voice behind Amazon Alexa — at the moment supports 57 different voices across 19 languages. You can choose between males and females, children, adults, different accents — G’day Australia! — and for some extra experimentation, you can try having a Japanese voice say some English text, for example.

In this post we will create a simple Python 3 app with all the Text To Speech building blocks that you can then reuse in your project.

We are going to use Raspberry Pi running Raspbian. That’s just to have some baseline platform — there is nothing Raspberry specific in the code, and it will run just fine on any internet-connected device where you can get Python installed, be it Windows, Mac, or Linux.

Prerequisites

We can in general configure and consume AWS services in three different ways:

AWS Web console — the GUI
aws-cli — the command line client
AWS SDK — the Software Development Kit. For Python, it’s boto3 library.

For starters, let’s install the latter two using pip3.

pi@polly:~ $ sudo pip3 install awscli boto3
Collecting awscli
[...]

Installing collected packages: docutils, pyasn1, rsa, urllib3, six, python-dateutil, jmespath, botocore, s3transfer, colorama, PyYAML, awscli, boto3

Successfully installed PyYAML-3.13 awscli-1.16.25 boto3-1.9.15 botocore-1.12.15 colorama-0.3.9 docutils-0.14 jmespath-0.9.3 pyasn1-0.4.4 python-dateutil-2.7.3 rsa-3.4.2 s3transfer-0.1.13 six-1.11.0 urllib3-1.23

Now we can verify that both aws-cli and boto3 work.

pi@polly:~ $ aws --version
aws-cli/1.16.25 Python/3.5.3 Linux/4.14.70-v7+ botocore/1.12.15

pi@polly:~ $ python3 -c "import boto3; print(boto3.__version__)"
1.9.15

We will also need the pygame library. It comes pre-installed in Raspbian but if you’re following this tutorial on some other platform you may need to install it using pip3 install pygame.

Testing audio output

As this tutorial is all about audio output, let’s test that pygame can actually play sounds. We’ll make use of one of the audio files from Scratch that comes with Raspbian, or you can play your own mp3, ogg, or wav file.

In your favourite text editor or in a Python IDE like Thonny open a new file audio_test.py and insert this code:

Simple PyGame audio test

#!/usr/bin/env python3

# audio_test.py - simple pygame audio test
# Author Michael Ludvig

import pygame

# PyGame initialisation
pygame.init()
pygame.mixer.init()

# Audio file from Scratch
audio = '/usr/share/scratch/Media/Sounds/Human/Laugh-male1.wav'

# Play the audio and wait for completion
pygame.mixer.music.load(audio)
pygame.mixer.music.play()
while pygame.mixer.music.get_busy():
    pygame.time.Clock().tick(10)

Save it and run it by pressing F5 in Thonny or from the command line with python3 audio_test.py. You should hear a man laughing.

It’s critical to get this working before we move on. No audio working = no Polly talking!

AWS Credentials

Amazon offers a free tier for most services so we can test them without paying a cent. With the Polly free tier we can convert up to 5 Million characters per month in the first year of using the service — that should be plenty for most of us, as it’s roughly 5 days of continuous talking!

I want to keep this article focused on Polly, so please follow the steps in a side-post to create your AWS credentials. We will need them for the future demos.

Before continuing make sure that the credentials are correctly configured.

pi@polly:~ $ aws polly describe-voices
{
    "Voices": [
        {
            "Gender": "Male",
            "Id": "Russell",
            "LanguageCode": "en-AU",
            "LanguageName": "Australian English",
            "Name": "Russell"
        },
        ... many more voices listed ...
}

If instead you see an error like this, go back to the Credentials article and double-check all the steps.

pi@polly:~ $ aws polly describe-voices
An error occurred (AccessDeniedException) when calling the DescribeVoices operation: User: arn:aws:iam::123456789012:user/polly is not authorized to perform: polly:DescribeVoices

Hello Polly

With the access credentials in place and pygame audio working, we can finally get AWS Polly to say something.

The official AWS SDK (Software Development Kit) for Python is called boto3and supports almost all AWS services, including Polly. It automatically handles authentication, request signing, response decoding and so on.

To synthesise speech through AWS Polly, we essentially need only one line of Python code. We’ll be calling polly.synthesize_speech() from boto3.

boto3.client('polly').synthesize_speech(
    OutputFormat='ogg_vorbis',
    VoiceId='Brian', 
    Text='Hello, I am Polly! Even though I sound like Brian.')

Of course to actually play the synthesised speech we will need a few more lines to initialise the pygame audio output. Save this code as audio_helper.py, we will use it later.

#!/usr/bin/env python3

# audio_helper.py - simple PyGame audio player
# Author Michael Ludvig

import io
import pygame

# PyGame initialisation - upon module loading
pygame.init()
pygame.mixer.init()

# Convert boto3 audio stream to Bytes stream
# for compatibility with pygame
def play_audio_stream(audio_stream):
    audio = io.BytesIO(audio_stream.read())
    play_audio(audio)

# Here we play the audio stream
def play_audio(audio):
    pygame.mixer.music.load(audio)
    pygame.mixer.music.play()
    while pygame.mixer.music.get_busy():
        pygame.time.Clock().tick(10)

With the boring audio stuff out of the way, the actual Polly-related code is a neat, short program. Save it as hello_polly.py.

#!/usr/bin/env python3

# hello_polly.py - Simple AWS Polly demo
# Author Michael Ludvig

# Import play_audio_stream() from audio_helper.py
from audio_helper import play_audio_stream

# Boto3 is the AWS SDK for Python
import boto3

# Initialise AWS Polly client
# AWS Credentials will be read from ~/.aws/credentials
polly = boto3.client('polly')

# Synthesise Text to OGG Vorbis audio
response = polly.synthesize_speech(OutputFormat='ogg_vorbis', VoiceId='Brian',
             Text='Hello, I am Polly! Even though I sound like Brian.')

# Play the returned audio stream - call the function from audio_helper.py
play_audio_stream(response['AudioStream'])

That’s it in a nutshell! Run it with python3 hello_polly.py and if the stars are aligned, audio unmuted, speakers connected, and AWS credentials valid, you should hear it speak.

Alternatively, listen to the output here: hello_polly.ogg

Advanced talking

Just like HTML enriches plain text with bold and italics, paragraphs, and images, SSML — Speech Synthesis Markup Language — introduces similar tags to create more engaging voice output by using different voices, changing tempo, pitch, volume, and so on.

To use SSML, simply wrap the text in <speak>…</speak> marks and add TextType='ssml' parameter when calling synthesize_speech().

Let’s replace the plain text in hello_polly.py with a simple SSML text and save it as ssml_simple.py. Here are only the changed lines, the rest of the program remains the same.

# SSML-enhanced audio
ssml_text = '''

Let me tell you a secret.

Amazon Alexa is my sister!

'''

response = polly.synthesize_speech(OutputFormat='ogg_vorbis', VoiceId='Emma',
           TextType='ssml', Text=ssml_text)

The complete list of available SSML tags is documented on Amazon’s SSML Tags Supported by Amazon Polly page.

Listen to the output here: ssml_simple.ogg

I hear voices…

In the programs above, we only used the voices of Brian and Emma. Polly, however, knows many many more voices that speak different languages and different accents — from English, German or French, through to Japanese and Chinese to the somewhat unexpected Icelandic or Romanian. All of 19 languages, many with different accents, for example British, American, Australian and Indian English.

Listing the available languages is another simple call to Polly API: polly.describe_voices() .

Once we receive the list of voices, we can get each voice to introduce itself. With a little bit of SSML we’ll make sure that the name is said in its native language, but the rest of the sentence is in English. Sometimes with a different accent! With SSML and different voices, we can create a truly multi-cultural experience.

#!/usr/bin/env python3

# describe_voices.py - describe and play all AWS Polly voices
# Author Michael Ludvig

import boto3
from audio_helper import play_audio_stream

polly = boto3.client('polly')

response = polly.describe_voices()  # Optional param: LanguageCode='en-US'
voices = response['Voices']
print("AWS Polly currently supports {} voices".format(len(voices)))

for voice in voices:
    text = "Hi, my name is {Name} and I speak {LanguageName}".format(**voice)
    print(text)

    ssml_text = '''
        Hi, my name is
        {Name}
        and I speak {LanguageName}.
    '''.format(**voice)
    response = polly.synthesize_speech(OutputFormat='ogg_vorbis',
        LanguageCode='en-GB', VoiceId=voice['Id'],
        TextType='ssml', Text=ssml_text)
    play_audio_stream(response['AudioStream'])

Listen to the output here: describe_voices.ogg

Now what?

Now it’s the time to add voice to your projects. How about changing your Raspberry Pi based alarm clock from the boring beep beep beep to a personalised Wake up Michael! Wake up, it’s 10AM already!! Or how about upgrading your Twitter display to a Twitter reader? Or get your door camera tot welcome your visitors by their name? Of course that would also need some face recognition. But don’t worry, we will get to that in one of the future articles.

And by the way you can download all the code from this article from my AWS Polly GitHub repository and start playing now. Literally.

See you next time!

License and Attribution

This article is based on a work licensed under a Creative Commons Attribution 4.0 International License by the author, Michael Ludvig. The article as it originally appeared is: How to teach your projects to talk with AWS Polly

NetDip