Red Dwarf-style talking head’s emotions ‘easier to read’ than our own (Wired UK)

Computer scientists have created a talking head with the emotional range to compete with Red Dwarf’s Holly, with volunteers able to identify its mood more accurately than a human counterpart.

“Some of our team are hardcore Red Dwarf fans, so having a talking head as funny as Holly would be a clear win,” Bjorn Stenger of Toshiba Research Europe told Wired.co.uk, claiming it’s the “most expressive visual text-to-speech system” created to date. However it needs to be provided with the witty repartee and is not an autonomous pal — users can type a text or email as normal, add in the emotion, and “Zoe” will relay it, appropriate facial expressions intact.

“Right now it could be used to read out messages in an expressive way,” says Stenger. “It might also be used as a reader for audio books for example. We’re currently talking to organisations for autistic and deaf children, where this kind of technology could potentially be used for teaching kids to ‘read emotions’ and to lip read.” Autistic children are often shown flashcards with “happy” or “sad” expressions, but to put a real, expressive face that morphs and changes as it talks could be of real benefit.

The University of Cambridge’s department of engineering has been working on the project with Toshiba’s Cambridge Research Lab for several months, using the Japanese tech firm’s Cluster Adaptive Training software to achieve expressive speech in the disembodied head and face tracking to achieve a realistic affect.

British actress Zoe Lister supplied the speech and visuals, spending a few weeks in the Cambridge studio recording 7,000 sentences read out from newspaper clippings and even the phone directory. Algorithms were then used to create data points round the vocal and visual data — by modelling the voice and face, text commands can use the data points to recreate an emotion and sentence.

During the recordings, Lister expressed six base emotions — happy, sad, tender, angry, afraid and neutral — that could later be combined to create nuanced speech patterns: “We obtained very expressive versions of emotions from the data, and can then combine expressions with varying degrees of strength to create novel expressions.” For instance, by speeding the language up and combining anger and fear the head sounds panicked.

“It still took us a few months to create the model as we had to process the speech data as well as refine the face model,” says Stenger. “The models we train from this data are general enough to generate new text with new expressions. But one of the difficulties we had was tracking the face accurately, as well as modelling the dynamic range of emotions.”

The system doesn’t yet follow voice commands like Siri, but Stenger believes it’s possible to eventually train the system to infer emotion from the words it’s instructed to say. It will also be possible to generate a system whereby smartphone users upload their facial and vocal data to personalise it — doing this with speech, says Stenger, is already within reach, but personalised faces will be more tricky.

“It took us days to create Zoe, because we had to start from scratch and teach the system to understand language and expression,” said Roberto Cipolla of Cambridge’s department of engineering. “Now that it already understands those things, it shouldn’t be too hard to transfer the same blueprint to a different voice and face.”

Right now, though, it’s only at the research stage and no heavyweight manufacturers have been involved in the collaboration. When they do get involved, though, they’ll only be dealing with a program that runs on tens of megabytes, making it compact enough for smartphones and tablets.

Of course, to get it into mass production the team would have to make sure the nuances appear genuine and not, as with most robots manufactured today, just horribly unnerving. It’s speculated that mirroring real human expressiveness is the way to breach this barrier, but as we have seen with the humanoid Face and its Hybrid Engine for Facial Expressions Synthesis, lifelike expressiveness can still look, well, creepy.

Interestingly, when a group of volunteers was asked to identify the emotion being spoken by Zoe online across ten sentences, 77 percent answered accurately, while just 73 percent identified Lister’s emotions when she spoke the same sentences on film.

Stenger hypothesises that this is because of the extreme “stylisation of the expression in the synthesis”. But it bodes well for further improved Zoe models.

“Obviously people are very tuned to facial expressions and it’s still tricky to cross the ‘uncanny valley’ — motion that looks mechanical or unreal is often perceived more negatively than a clearly artificial one (like a cute talking animal character).

“It might also be a nightmare to talk to a poorly implemented virtual dialogue system. As soon as people see a real face they probably tend to have higher expectations in terms of an intelligent response as well.”

Watching the above demo video, the speech is still jolted and not totally natural, but the facial expressions are smooth and synch well. The Cambridge lab claims it’s a norm we will come to expect, which makes sense given the ubiquity of things like voice recognition, facial recognition and Siri — we expect more from our devices, and this is a step in the right direction.

“Present day human-computer interaction still revolves around typing at a keyboard or moving and pointing with a mouse,” Cipolla said in a statement. “For a lot of people, that makes computers difficult and frustrating to use. In the future, we will be able to open up computing to far more people if they can speak and gesture to machines in a more natural way. That is why we created Zoe — a more expressive, emotionally responsive face that human beings can actually have a conversation with. This technology could be the start of a whole new generation of interfaces which make interacting with a computer much more like talking to another human being.”

It makes sense then, that Zoe lists one of her possible uses as “carer” — while we’re waiting for a Robot & Frank style companion to be built, a disembodied head providing comfort or simply someone to shoot the breeze with, would be interesting.

What Next?

Related Articles