How to get the best Replica Voice performance

ℹī¸

Here are some tips to help you produce the best voice performance from Replica Voices. Please note that each technique is not guaranteed to work consistently on all voices and each time you generate speech. It may take a few tries to get the result you want. As this technology is still in its infancy, we kindly ask that you have patience and experiment with the text you input to generate the results you desire.

Choosing the right voice for the job

Each Replica Voice is intended for a specific kind of performance.

There are currently three categories:

  1. Voice-over: Advertising, video, podcasts
  2. Character: Gaming, animation, fiction podcasts
  3. Narration: Audiobooks, apps, video, podcasts

Therefore, it's advised to make sure the voice you choose is actually suited to the script for your project.

In the two examples below you'll notice that Ava doesn't do as good a job as Lily in performing a line that was written for a fictional character.

Oh, look at the time. I better go. I'd love to sit around all day and chat but I go things to do and places to be.

Ava: Narration voice

Lily: Character voice

Notice how Lily's delivery is more natural. Lily's style of speaking is more casual and even has a touch of shyness to it whereas Ava's speaking style is very direct and informative which is what she was designed to do.

On occasion, you will notice that some Replica Voices can perform two kinds of scripts.

Ava: Informative script

Ava: Character script (with reverb to enhance delivery)

While Ava's voice is human-like, her speaking style allows her to double as an AI character 😉

It's worth spending time experimenting with each voice as you might find a similar result.

Fixing pronunciation

Because the English language is not a phonetic language it can be hard for a Replica Voice to know the difference between two words that are spelled the same but pronounced differently. These types of words are called 'homographs'.

For example, consider the phrase: 'The dove dove down'.

The first dove represents the bird, and the second dove represents the past tense of the word 'dive'.

As humans, we know how to pronounce the first and second 'dove' whereas a Replica Voice cannot because they have not learned the intricacies of language, but rather, they mimic the sound.

The remedy, for this is to spell your words phonetically. For example: 'The dove dhove down'.

As with all other improvements, we are working to improve our pronunciation model, and are confident we'll be able to solve this in the near future.

Until then, we recommend spelling words phonetically until you get the desired pronunciation.

What does Punctuation do?

Using punctuation, you can change the way a line is delivered and potentially change the entire feeling of the delivery.

Here are some examples using the Replica Voice, Deckard.

'My name is Deckard.'

Deckard by default has a dark and mysterious voice.

'My name, is Deckard.'

Adding a comma in the middle of this statement changes the delivery only slightly but Deckard now sounds more serious and like someone you don't want to mess with.

'My name, is, Deckard.'

Too much punctuation will make it sound more robotic.

Adding an additional comma after the word 'is' makes Deckard sound slightly confused, like he's been hit over the head and is trying to recall who he is 😅

It may seem simple, but punctuation can make a significant difference in the way a Replica Voice delivers a line.

Single line vs Multiline

As the technology is improving, we have noticed that some lines of text for some Replica Voices sound better depending on the length of a line and whether it's a paragraph or not.

Here's an example of a single line vs a multiline using the Replica Voice, Stone.

Today, we stand for all human-kind. Today we march into the gates of hell, and destroy the Sin Heart for good.

Today, we stand for all human-kind. Today we march into the gates of hell and destroy the Sin Heart for good.

Notice how putting the second sentence onto a second line has a slight effect? Additionally, Stone pronounces the word 'human-kind' properly. We have notice that this technique can sometimes have a very obvious effect and sometimes barely at all.

It's always worth experimenting with multiple lines as it can fix some things like the flow of how a line is delivered and the pronunciation of words.

Experimenting with the script

Sometimes, a Replica Voice just won't deliver the script in a way that you like. We've noticed that by modifying the script slightly you can get a better sounding result.

For example, here's an example using the Replica Voice, Stone.

Today, we stand for all human-kind. Today we march into the gates of hell, and destroy the Sin Heart for good.

Today, we stand for all human-kind. Today we march into the gates of hell. Today we destroy the sin heart, for good.

Notice how breaking the last sentence into two sentences and slightly modifying last sentence we achieve a better result? While this is not consistent, it's worth experimenting with this technique if you're not able to generate speech to your liking.

Using styles

You can use styles to modify the way a Replica Voice sounds. However, as with all the techniques listed on this page, because the styles are experimental and the technology is still young, the effect of the styles are not consistent across each voice nor every sentence.

Some voices react differently to a style compared to others and we found that some of the styles were not usable on some voices at all. This is why you'll see that some voices do not currently have even one style and some have multiple.

For the following three examples, we'll use the Replica Voice 'Yasmina' and the following script:

I might be young Indigo, but I'm not dumb. If you gave me a chance I could surprise you.

No style applied.

Upset style applied.

This style has a decent effect on the voice and sentence. Yasmina does indeed sound upset.

Fearful style applied.

However, while this style is labelled Fearful, it actually enhanced the emotion of the content Yasmina is speaking and makes her sound even more upset than that the Upset style.

We've noticed that sometimes a style won't have the desired effect but instead will enhance the emotion of the content being spoken, even if the emotion is not the same as the style. Whereas other times it will unexpectedly change the delivery of the line altogether. This is something you simply have to experiment with.

The styles tend to (but not always) work best when they are in line with the emotion of the content as illustrated in this next example.

Quick, get to the escape pods. The ship is going down. We have to leave now!

No style applied.

Fearful style applied.

In this example, you'll notice that the Fearful style worked quite well in line with the emotion of the content being spoken.

We're working to add SSML support in upcoming releases. With this you'll be able to add pauses between words, and modify the pitch and speed of delivery with finer precision.

Try another 'take'

If you're not entirely happy with the way a Replica Voice delivers a line, you can try rendering it again by creating another 'take'.

Though we have not yet built this feature into the product, it is still possible to create another take.

Here are two takes of the same line:

Take one

Take two

Notice how in take two the Replica Voice places a little more emphasis on the word 'threaten'?

This is a byproduct of the technology and it can sometimes be achieved by generating the same line in two separate 'speech blocks' as illustrated below.

The same text in two different 'Speech blocks' can sometimes cause the AI to perform the same script differently.
The same text in two different 'Speech blocks' can sometimes cause the AI to perform the same script differently.

However, the results are not always consistent. Some Replica Voices may say the same line differently each time but the difference may be quite subtle. Other times, it may be obvious.

We are currently working on more advanced speech controls to allow you to create custom takes and not leave the delivery up to chance.

Until we build this functionality, you can use this technique to generate new takes of the same line.

What's coming next?

  • SSML support: the ability to control pitch, speed and add pauses in the middle of sentences.
  • AI Voice model improvements: this will improve the sound of the speech, bringing it one step closer to sounding uncannily human.
  • Takes: the ability to create and save multiple takes.