27 June 2023

AI and the voiceover revolution

AI and the voiceover revolution

We’ve now well and truly entered the era of artificial intelligence, with new tools popping up every day. It’s a divisive topic – while many software companies are rushing to include AI in their products, others are decrying the technology as a danger to human society. The truth? We’re still way too early into AI adoption to see where the dust will settle, but there have already been a number of fascinating utilizations that promise to revolutionize the content we consume and the ways we live our lives.

While many are focused on how AI will change the landscape for written and visual media, it’s also set to drastically alter audio content, as technology continues to improve and synthesized voices become more commonplace.

Many of us will balk at the idea of these artificial voices creeping into the media that we all know and love, but thanks to continued advancements, things are starting to change. Indeed, we are seeing AI voices rolled out in increasingly interesting ways, promising to make us all feel a little more comfortable with it.

Spotify, perhaps the world’s most iconic music streaming platform, introduced its new AI DJ back in February, made possible thanks to their acquisition of so-called ‘dynamic AI voice platform’ Sonantic back in 2022. The DJ is designed to curate playlists to the user’s particular tastes, providing background information on the tracks and artists as they listen.

The tool has had mixed responses so far, but that’s to be somewhat expected in consumer-facing AI integrations such as this. Indeed, while some label it as a ‘soulless’ approximation of a human DJ, others pile on the praise of delivering the ‘real radio experience’, without the annoying ads that drove people away from standard airwaves in the first place.


AI for when the time is right

Alpha Studios, the division of localization company Alpha CRC responsible for all audiovisual projects, has also begun delving into the power of AI, and how it can best benefit clients.

“AI presents a number of exciting possibilities in the world of audio production,” says Alpha Studios director Neale Laxton. “We typically work with clients requiring voices in a number of different languages. Traditionally, this would require us to find a different voice actor for each locale. With AI voiceover technology, that’s beginning to change. With fewer people needing to be involved, we’re seeing reduced costs to clients as well as faster turnaround times”.

Of course, AI isn’t a one-size-fits all solution to audio content. “There will be times when AI-produced voices are appropriate, and times when its better to invest in a human voice. We tend to advise clients that AI voiceover can be great if you’re looking for 99% accuracy. But for businesses in the medical sector, for example, where accuracy is of the utmost importance, it would be better to invest in human voices,” added Laxton.

So what kinds of content are AI voices most suitable for? Explainer videos are a prime candidate. These are short, informational videos used to introduce topics to students, employees, or customers. In the wake of the pandemic, a growing number of businesses and educational institutions have been adopting eLearning materials to disseminate information among their audiences. Explainer videos play a key role in this, as they are typically viewed as the most engaging way to introduce new content.

Of course, many would express concern about whether an AI-voice can be truly engaging. When we think about the classic synthesized voice, it’s hard to imagine anything more off-putting than those dull, robotic tones droning on at you. Luckily, advances in AI are changing that too. Leong Wai Kit, Chew Yuin-Y, Balqis Zulkifi and Kho Suet Nie conducted a study, published in the SEARCH Journal of Media and Communication Research, which looked at how university students perceived AI-generated voice in explainer videos.

The group highlighted that high quality audio is one of the most important aspects of an explainer video for promoting understanding and message retention. However, they acknowledged that many explainer videos are created on a tight budget that doesn’t always allow for professional human voice actors. Could AI voiceover then be a suitable substitute, with significant advances over standard text-to-speech software? Would it be understandable, and, crucially, would it be pleasant?

By asking a group of students to assess a series of explainer videos according to their comprehensiveness, pleasantness, naturalness and human likeness, they were able to establish that AI-generated voices showed significant improvements over traditional text-to-speech software, and didn’t fall far behind human voiceover in most key sectors.

Indeed, while the human voiceover recorded a ‘pleasantness’ rating of 4.45 out of 5, the AI-generated voice scored 3.97. The ‘comprehensiveness’ difference between the two was even smaller, with human comprehensiveness at 4.55, while the AI-generate voice scored 4.39. AI-generated voice was also considered significantly more humanlike than text-to-speech voices, scoring 4.03 and 2.77 respectively. The takeaway here? AI-generated voices are a much more suitable alternative to human voices than tradition text-to-speech ever was, and could help reduce costs and increase output for creators in a variety of fields.


Life on the cutting edge

Of course, synthesized voices aren’t only for explainer videos, with some taking on much more experimental roles such as that of conversation facilitator for groups of older adults in order to combat loneliness and enhance their social lives.  ‘Bono’ fuses AI with traditional text-to-speech software in order to control the amount each member of a group contributes to a given conversation.

In a study published in the International Journey of Social Robotics, Katie Seaborn, Takuya Sekiguchi, Seiki Tokunaga, Norihisa P. Miyake and Mihoko Otake-Matsuura reported that Bono was well received by groups of older adults that had been involved in their research. The study mainly focused on the importance of conversation facilitators such as Bono having a physical presence, as opposed to the reception of the voice itself, but there were some pertinent findings that suggest fully AI-generated voices could further enhance the experience and improve quality of life for otherwise isolated individuals.

During the research, groups of older adults were invited to share photos on a specific theme, which they were then asked to discuss together. After the conversation, the participants were invited to share their thoughts on how the digital conversation facilitator fared, with the voice itself being one of the key areas of discussion. In fact, the study found that the voice was ultimately more important than the physical presence of the robot.

Most (67%) somewhat liked Bono’s text-to-speech generated voice, although 22% commented that it was mechanical, and 19% felt that it was monotonous. It’s in the requested improvements that we can see signs of where a fully AI-generated voice could offer a better user experience, with participants requesting a greater variety of responses and phrases.

The participants in this study were Japanese, and felt that more ‘aizuchi’ (interjections common in Japanese conversation which signal that the listener is engaged in what the speaker is saying), laughter, and jokes, would help improve the experience. To put it simply, it appears that they wanted the robot’s speech to become more humanlike. As the earlier research from Kit et al. showed, this could potentially be achieved by switching the speech generation technology from text-to-speech software to a fully AI-powered platform.


Further challenges ahead

It’s clear that we’re on the precipice of an AI revolution, but it’s important that we acknowledge where there are still limits on what it can do. While there is plenty of English audio content available online for AI creators to train their engines, other languages may not fare so well.

This could mean that multilingual content that is produced solely with AI could have a clear quality gap between the English version and those produced in other, less widely spoken languages. In this case, it might be best to consider a mixed-approach, in which AI voices are used for locales where there is sufficient training material to allow the AI to perform well, while human voices are used for others.

The danger here is that, as English content becomes quicker and easier to produce, other languages might be increasingly regarded as an afterthought, furthering discrepancy that already exists within online and media communities. It’s an issue that doesn’t have a clear answer at the moment, but, as AI continues to prove its value and adoption becomes increasingly widespread, it’s one that will be well worth considering.