Begonya Ferrer: “I’ve listened to myself and there are phrases that are very sneaky”. This is how artificial voices arrive in the world of locution

Begonya Ferrer: “I’ve listened to myself and there are phrases that are very sneaky”. This is how artificial voices arrive in the world of locution | Technology

Begonya Ferrer: “I’ve listened to myself and there are phrases that are very sneaky”. This is how artificial voices arrive in the world of locution | Technology
is the headline of the news that the author of WTM News has collected this article. Stay tuned to WTM News to stay up to date with the latest news on this topic. We ask you to follow us on social networks.

The actress and broadcaster Begonya Ferrer let some friends listen to some audio messages she had received from a voiceover company. It was her voice, but it wasn’t really her: “I listened to myself and there are a lot of phrases that make a lot of sense,” she says. “I showed it to some friends and they told me the same thing.” Ferrer had listened to her recorded voice many times. But it was the first time that she heard herself “synthesized”. In the following audios you can see the little difference for the untrained ear.

As in other fields of artificial intelligence, there are dozens of companies working to improve the artificial reproduction of the human voice. His advances are remarkable, although not perfect yet. The difference between English and other languages is also notable. It’s already used for voices that don’t require sophisticated tones, accents, or emotions, such as answering machines or internet videos and games, but it will come. “They have to finish outlining some sounds and intonations, or commas. But it gives you much to think about if it is bread for today and hunger for tomorrow”, says Ferrer.

The company that hired her to synthesize her voice is Voces en la Red. “Until 2018, the synthetic voice was bad. Now it has been evolving, especially with Amazon, Microsoft and Google and between 2020 and 202 it has taken a leap. However, there is still work to be done and we are now like when the first iPhone came out”, says Javier de Alfonso, founder of Voces en la Red. The advantage of having a synthesized voice is, obviously, that a human is not required to speak each new video or each change in the answering machine. Soon even the machine will be able to “read” what the text generator creates, with hardly any latency. That is to say, he will speak improvising on any subject, almost as in a natural telephone conversation.

Voces en la Red collaborates with a start-up Canadian, Resemble, to improve its Spanish catalog and be able to market it. All these advances are almost real in English, but in Spanish there is still a lot of editing and retouching. To questions from this newspaper, in Resemble they explain it this way: “Most of the work of modeling languages with artificial intelligence is specific to English. Our main focus now is to improve the naturalness and prosody of Spanish. Also because of the type of our clients, we often find that the spanglish it is common. Language switching is a key area of research for us,” they say.

For this improvement of the machine, the work of professionals such as Begonya Ferrer is essential. She explains that she receives more and more requests for this type of aseptic recording, without really knowing who they are for or what for. “I work for a lot of people,” says Ferrer. “I do more and more projects to train robots. They don’t give much information to the announcers. Worked on-line with people from all over the world. I have even been asked for quite a few projects from China. They make you read fragments of audiobooks and if phonemes are missing they send you more texts. The technical conditions are very specific, very different from advertising, a very dry, clean sound that does not rise above so many decibels”, he adds.

Sometimes they make him repeat entire sentences as many times as there are words in that sentence, emphasizing a different word in each reading. It’s easy to think that the access to recording hours that Amazon, Google and Microsoft have is an order of magnitude different. Microsoft, for example, already offers pre-selected customers services of this type: “Customers must upload the training data of their preferred speaker along with an audio file with the speaker expressing their verbal consent. Neural Voice personalized training starts with approximately 30 minutes of voice data (or 300 recorded sentences) and the data size we recommend is approximately 2 to 3 hours, or 2000 recorded expressions”, Microsoft sources explain to questions from this newspaper. .

From there, companies like WarnerBros, the BBC or Duolingo are using it for some of their services. These types of solutions are already available. De Alfonso, from Voces en la Red, believes that it will soon be used for more complex but equally simple phrases, such as a news item or a radio bulletin. But there is still room for greater feats such as “reading an erotic story.”

Years for professional dubbing

From the sector more focused on dubbing, they see this progress as a real threat, but still years away. “We have seen give spectacular, but real applications are still a long way off”, says Ángel Martín, director of the dubbing company Eva Localisation. “There are no tools yet to put your original series and have it returned to you adapted to another language.”

That said, however, there is room for other types of applications, according to Martín: “There are millions of hours on YouTube or TikTok networks with personal content where the rights are less significant or that do not require as much precision because what they want is to reach to as wide an audience as possible,” he says, referring to products that, due to their lower demands, can already use this type of voice.

Does any of this make voice actors nervous? Not for the moment. “The industry is not prepared even if these tools are available. That does not mean that we are not all interested in seeing how it evolves”, adds Martín. The future ability of artificial intelligence to fit the most suitable words in another language in the lipsticks of an actor seems inevitable. But for now it is not imminent.

The sector is also experiencing a moment of “fat cows”, says Alex Mohamed, technical and security director of Deluxe studios. “There is a huge amount of work and there is little time to see what happens. Also, no example has surfaced for anyone to worry about. It will happen over the years. It’s likely,” he adds.

There is also a more complex legal debate pending, probably than with other products from artificial intelligence: “Voices are subject to rights. What will happen when a person dies? What will happen if I take the voice of someone who has just passed away, change the pitch slightly and use it?” says Mohamed. The combination of several human timbres will give an original voice that will not have rights. It is one thing to clone a voice of a specific actress, who agrees and can claim her rights, and another to use her timbre in a cocktail that produces something new.

You can follow THE COUNTRY TECHNOLOGY in Facebook Y Twitter or sign up here to receive our weekly newsletter.