Text to speech with Amazon Polly

Uthpala Pitawela
4 min readMar 8, 2020

Amazon Polly is one of the prominent features of AWS where it converts text into life-like speech. Currently, Polly supports a wide variety of languages including female and male voices. Polly can be integrated for applications such as newsreaders, games, eLearning platforms, accessibility applications for visually impaired people, etc. Some of the benefits can be listed as follows.

  • Natural sounding voices
  • Store and re-distribute speech
  • Real-time streaming
  • Customize and control speech output
  • Low cost

Amazon Polly components

  1. Input Text

The input text can be provided as a plain text or Speech Synthesis Markup Language (SSML) format. SSML speech can be controlled with respect to pronunciation, pitch, speech rate, etc.

2. Available voices

Amazon Polly variety of voices related to different languages which include female and male voices. Voice should be specified along with the input text in order to produce the audio stream.

3. Output format

Polly provides synthesized speech in multiple formats. For web and mobile applications can request the speech in mp3 or Ogg Vorbis format and for IoT devices and telephony solutions can request in PCM format.

Voices in Amazon Polly

Polly provides a variety of voices that belong to multiple languages. Polly also provides a special type of voice called a Bilingual voice. These Bilingual voices are comfortable with more than one language so that it has the ability to speak up words and phrases from both languages. There is only one Bilingual voice called Aditi which is compatible with both English and Hindi.

Different voices have different voice speeds. At the same time, we can deliberately change the voice speed using an option called SSML tags. For that we can use SSML <prosody> tag.

SSML

SSML provides additional control over the text that needs to be converted into speech. SSML provides the following options.

  • Long pause with text
  • Change the pitch and speech rate
  • Emphasizing specific words and phrases
  • Phonetic pronunciation
  • Breathing sounds
  • Whispering
  • Newscaster speaking style

All SSML- enhanced text should be enclosed with a <speak> tag.

I. Long pause with text

<break> tag can be used for the long pause. The pause can be applied with two attribute values namely strength and time. Strength attribute values include none, x-weak, weak, medium, strong, x-strong. Here default strength attribute is medium.

<break strength=”medium”>

Time attribute can be given in second and milliseconds.

<break time=”3s”/> or <break time=”100ms”/>

II. Emphasizing words.

Emphasizing change the rate and volume of the speech. Text is spoken in loud and slower when it comes to high emphasis. Emphasize is controlled by level attribute which has three levels namely strong, moderate and reduced.

<emphasis level=”strong”>

III. Breathing sounds

Breathing sounds make the speech more natural and life-like. <amazon:breath> and <amazon:auto-breaths> provides breathing effects for speech. There are three breathing options.

a. Manual mode: Manually set the location, volume and length of the breath sound within the text.

b. Automated mode: Amazon Polly adds the breath within the text.

c. Mixed mode: Mix of the above two modes

IV. Whispering

With the <amazon:effect name=”whispered”> tag text can be whispered rather than the normal speech. Whispering speed can be controlled using <prosody> tag in between whispering tag.

<speak>

Normal speech is like this <amazon:effect name=”whispered”>

<prosody rate=”-10%”>But the whispering sound is like this

</prosody></amazon:effect>

</speak>

V. Newscaster style

The newscaster style is available only for Mathew and Joanna voices, which are available only in American English (en-US) in Neural format. The newscaster tag can be added as follows.

<amazon:domain name=”news”>text</amazon:domain>

VI. Conversational speaking style

Conversational speaking style is also available only for Mathew and Joanna voices. This option gives the life-like conversational effects to the speech.

<amazon:domain name=”conversational”>text</amazon:domain>

Lexicons

Lexicons enable you to customize the pronunciation of words. Lexicons can be stored in a particular region and can be used within that region. For instance, when we give the text as W3C but want to read that as World Wide Web Consortium, we can apply lexicons.

<lexeme>

<grapheme>W3C</grapheme>

<alias>World Wide Web Consortium</alias>

</lexeme>

Here, alias is the name that we want to read instead of W3C. So whenever Polly comes across W3C, that will be read as World Wide Web Consortium. There is a limit of five lexicons per speech.

Let’s log into AWS Management console and experience life-like speech with Polly. Go to console and search for Amazon Polly service.

As in the above interface, the input can be given as plain text or SSML. You can try out different options explained in this article using SSML. When you click the button Listen to speech, you will hear the converted speech from the given text. Mp3 download option is also available for the converted speech.

Amazon Polly comes up with SDK and CLI options.

AWS SDKs — SDKs can be used when integrating Polly for existing applications.

AWS CLI — CLI can be used to access Polly without writing any code.

--

--