Conversational Actions were deprecated on June 13, 2023. For more information, see Conversational Actions sunset.

Speech Synthesis Markup Language (SSML) reference (Beta)

Page Summary

Actions on Google platform supports several SSML Beta features, including <phoneme>, <say-as interpret-as="duration">, <voice>, <lang>, and Timepoints.
The <phoneme> tag allows for custom pronunciation of words using IPA or X-SAMPA phonetic alphabets.
The <say-as interpret-as="duration"> tag enables specifying durations, which are then read out correctly.
The <voice> tag allows switching between different voices or specifying voice attributes like language and gender within a single request.
The <lang> tag can be used to include text in multiple languages, although the quality of the result may vary depending on the language combination.

The Actions on Google platform supports a number of SSML Beta features in addition to the Actions on Google standard SSML elements.

Summary of supported Beta SSML features:

<phoneme>: Customize the pronunciation of specific words.
<say-as interpret-as="duration">: Specify durations.
<voice>: Switch between voices in the same request.
<lang>: Use multiple languages in the same request.
Timepoints: Use the <mark> tag to return the timepoint of a specified point in your transcript.

`<phoneme>`

You can use the <phoneme> tag to produce custom pronunciations of words inline. Actions on Google accepts the IPA and X-SAMPA phonetic alphabets. See the phonemes page for a list of supported languages and phonemes.

Each application of the <phoneme> tag directs the pronunciation of a single word:

  <phoneme alphabet="ipa" ph="ˌmænɪˈtoʊbə">manitoba</phoneme>
  <phoneme alphabet="x-sampa" ph='m@"hA:g@%ni:'>mahogany</phoneme>

Stress markers

There are up to three levels of stress that can be placed in a transcription:

Primary stress: Denoted with ˈ in IPA and " in X-SAMPA.
Secondary stress: Denoted with ˌ in IPA and % in X-SAMPA.
Unstressed: Not denoted with a symbol (in either notation).

Some languages might have fewer than three levels or not denote stress placement at all. See the phonemes page to see the stress levels available for your language. Stress markers are placed at the start of each stressed syllable. For example, in US English:

Example word	IPA	X-SAMPA
water	`ˈwɑːtɚ`	"wA:t@`
underwater	`ˌʌndɚˈwɑːtɚ`	`%Vnd@"wA:t@`

Broad vs narrow transcriptions

As a general rule, keep your transcriptions more broad and phonemic in nature. For example, in US English, transcribe intervocalic t (instead of using a tap):

Example word	IPA	X-SAMPA
butter	`ˈbʌtɚ` instead of `ˈbʌɾɚ`	"bVt@` instead of "bV4@`

There are some instances where using the phonemic representation makes your TTS results sound unnatural (for example, if the sequence of phonemes is anatomically difficult to pronounce).

One example of this is voicing assimilation for s in English. In this case the assimilation should be reflected in the transcription:

Example word	IPA	X-SAMPA
cats	`ˈkæts`	`"k{ts`
dogs	`ˈdɑːgz` instead of `ˈdɑːgs`	`"dA:gz` instead of `"dA:gs`

Reduction

Every syllable must contain one (and only one) vowel. This means that you should avoid syllabic consonants and instead transcribe them with a reduced vowel. For example:

Example word	IPA	X-SAMPA
kitten	`ˈkɪtən` instead of `ˈkɪtn`	`"kIt@n` instead of `"kitn`
kettle	`ˈkɛtəl` instead of `ˈkɛtl`	`"kEt@l` instead of `"kEtl`

Syllabification

You can optionally specify syllable boundaries by using .. Each syllable must contain one (and only one) vowel. For example:

Example word	IPA	X-SAMPA
readability	`ˌɹiː.də.ˈbɪ.lə.tiː`	`%r\i:.d@."bI.l@.ti:`

Durations

The Actions on Google platform supports <say-as interpret-as="duration"> to correctly read durations. For example, the following example would be verbalized as "five hours and thirty minutes":

<say-as interpret-as="duration" format="h:m">5:30</say-as>

The format string supports the following values:

Abbreviation	Value
h	hours
m	minutes
s	seconds
ms	milliseconds

`<voice>`

The <voice> tag allows you to use more than one voice in a single SSML request. In the following example, the default voice is an English male voice. All words will be synthesized in this voice except for "qu'est-ce qui t'amène ici", which will be verbalized in French using a female voice instead of the default language (English) and gender (male).

<speak>And then she asked, <voice language="fr-FR" gender="female">qu'est-ce qui
t'amène ici</voice><break time="250ms"/> in her sweet and gentle voice.</speak>

Alternatively, you can use a <voice> tag to specify an individual voice (the voice name on the supported voices and languages page) rather than specifying a language and/or gender:

<speak>The dog is friendly<voice name="fr-CA-Wavenet-B">mais la chat est
mignon</voice><break time="250ms"/> said a pet shop
owner</speak>

When you use the <voice> tag, Actions on Google expects to receive either a name (the name of the voice you want to use) or a combination of the following attributes. All three attributes are optional but you must provide at least one if you don't provide a name.

gender: One of male, female or neutral.
variant: Used as a tiebreaker in cases where there are multiple possibilities of which voice to use based on your configuration.
language: Your desired language. Only one language can be specified in a given <voice> tag. Specify your language in BCP-47 format. You can find the BCP-47 code for your language in the language code column on the supported voices and languages page.

You can also control the relative priority of each of the gender, variant, and language attributes using two additional tags: required and ordering.

required: If an attribute is designated as required and not configured properly, the request fails.
ordering: Any attributes listed after an ordering tag are considered as preferred attributes rather than required. The SSML considers preferred attributes on a best effort basis in the order they are listed after the ordering tag. If any preferred attributes are configured incorrectly, Actions on Google might still return a valid voice but with the incorrect configuration dropped.

Examples of configurations using the required and ordering tags:

<speak>And there it was <voice language="en-GB" gender="male" required="gender"
ordering="gender language">a flying bird </voice>roaring in the skies for the
first time.</speak>

<speak>Today is supposed to be <voice language="en-GB" gender="female"
ordering="language gender">Sunday Funday.</voice></speak>

`<lang>`

You can use <lang> to include text in multiple languages within the same SSML request. All languages will be synthesized in the same voice unlesss you use the <voice> tag to explicitly change the voice. The xml:lang string must contain the target language in BCP-47 format (this value is listed as "language code" in the supported voices table). In the following example "chat" will be verbalized in French instead of the default language (English):

<speak>The french word for cat is <lang xml:lang="fr-FR">chat</lang></speak>

Actions on Google platform supports the <lang> tag on a best effort basis. Not all language combinations produce the same quality results if specified in the same SSML request. In some cases, a language combination might produce an effect that is detectible but subtle or perceived as negative. Known issues:

Japanese with Kanji characters is not supported by the <lang> tag. The input is transliterated and read as Chinese characters.
Semitic languages such as Arabic, Hebrew, and Persian are not supported by the <lang> tag and will result in silence. If you want to use any of these languages we recommend using the <voice> tag to switch to a voice that speaks your desired language (if available).