Voice user interfaces (VUIs) becoming ubiquitous and speech synthesis technology maturing, it is possible to synthesise AI voices to resemble our friends and relatives, which we will collectively call ‘kin’, and use them on VUIs. However, designing such interfaces and investigating how the familiarity of kin voices affect user perceptions remain under-explored.
Check out the video overview:
Role and Results
I led this research:
- Created the concept
- Interface and custom API development
- Conducted user studies
- Writing and publication to CSCW 2021 (journal article)
Assistance in development: Tamil Selvan Gunasekaran
KinVoice System Implementation
As an example application, we implemented a prototype, KinVoice, on Amazon Echo Dot devices that enables users to set reminders and receive. It issues the reminders in AI-generated voices based on the voices of family members and friends.
When the user set a reminder, KinVoice retrieves information on the reminder message, day, and time from the user. Then, it updates the Alexa Developer Console server which helps to keep track of and issue the reminder. The reminder data is also posted to a custom-made Django framework API that is hosted on a Google Cloud server. The API generates the reminder message as an audio file in a kin voice based on a speech recording sample of the user’s kin using the Real-Time Voice Cloning tool and stores the file to an Amazon S3 bucket database. When the reminder is issued, the Echo Dot announces there is a reminder for them and asks the user to play the reminder message. Finally, KinVoice plays the reminder audio file from the S3 bucket when the user asks and could be played at any time after the reminder is issued, until the next reminder is issued.
- The voices of friends and family promoted the feeling of connection (co-presence), social presence and telepresence.
- The voices were persuasive, credible, and charismatic.
- The voices were likeable, safe, and eerie (drew attention to the interface).
Voice Cloning Demo
Today, we can easily clone voices with just 5 seconds of audio. Try out voice cloning for yourself using this Google Colab project.