Thanks to a new text-to-speech system developed by Microsoft and MIT, a massive collection of free ebooks just added nearly 5,000 audiobooks to its digital shelves.
The first ever ebook: In 1971, Michael S. Hart was granted virtually unlimited access to the University of Illinois’s computer system, which was one of the first 15 nodes in ARPAnet, the network that would give birth to the internet.
Anticipating that it would one day be possible to widely disseminate information via computers, Hart decided to use some of his time typing out the Declaration of Independence so that anyone who wanted a copy in the future could have it.
That was the world’s first ebook, and it marked the beginning of Project Gutenberg, a nonprofit, volunteer-driven effort to digitize and distribute books for free. Today, more than 70,000 ebooks, mostly works in the public domain, are available on the project’s website.
The challenge: As part of its mission to make literature available to as many people as possible, Project Gutenberg eventually began adding audiobooks to its collection, but creating those files was more challenging than typing out the Declaration of Independence or scanning pages of physical books.
“There’s a lot of demand for audiobooks, but we discovered we weren’t that good at making them,” said Greg Newby, director and CEO of the Project Gutenberg Literary Archive Foundation. “Creating high-quality recordings was beyond the capacity of our volunteer-driven team.”
“It creates audiobooks en masse, reducing days of volunteer labor to just 30 seconds per book.”Greg Newby
“When we learned about this neural text-to-speech technology, the possibilities were obvious,” said Newby. “It creates audiobooks en masse, reducing days of volunteer labor to just 30 seconds per book.”
How it works: Text-to-speech isn’t exactly new, but Project Gutenberg’s ebooks aren’t written in a standard format, and existing systems would have trouble identifying the parts of them that didn’t need to be narrated, such as page numbers or tables of contents.
The typical text-to-speech narrations often sounded robotic, too, which isn’t the best listening experience.
The AI researchers started by focusing on just ebooks saved as HTML. They then built a tool that could group ebooks with similar structures together and developed a system that could then convert the books in each group into a standardized structure.
This upfront work made it easier to extract the text in an ebook that should be narrated, while ignoring the rest.
Say it with feeling! To ensure the audio wouldn’t sound robotic, the researchers applied an “automatic speaker and emotion inference system.” This software could look at the context of the text to predict how the narration should be delivered, adjusting elements like tone and pacing.
It could also automatically determine when different characters were speaking and use unique voices for their dialogue.
“The different voices combined with the emotive pacing made a much more compelling audiobook than you would’ve gotten from any previous solution,” said Newby.
Looking ahead: At the Interspeech 2023 conference in August, the researchers demonstrated how their text-to-speech tech could be used to create a new version of one of the Project Gutenberg’s audiobooks in a person’s own voice from just a 5-second audio sample.
According to an MIT news release, they now plan to “explore whether this technology can help create more inclusive audiobooks that foster a more personal connection between the listeners and their favorite works.”
In the meantime, Microsoft says it plans to keep working with Project Gutenberg to make sure it can apply the technology to other files in the future, helping the project’s audiobook collection catch up with its ebook library.
The big picture: Microsoft’s advanced text-to-speech technology may have been a blessing to Project Gutenberg and people searching for free audiobooks, but to the voice acting profession, which makes a living talking into microphones, it’s potentially an existential threat.
“If you’re going to replicate me or any other performer, we should consent to that.”Zeke Alton
This issue is a major reason for the ongoing strike of the American actors’ union, SAG-AFTRA, as actors look for ways to ensure they won’t be put out of work by AIs, like the one helping Project Gutenberg, especially ones using clones of their own voices.
“Let’s be clear — Pandora’s box is open,” actor Zeke Alton told ComicCon attendees in July. “If you’re going to replicate me or any other performer, we should consent to that, and then we should be compensated for the use of what makes us money.”
We’d love to hear from you! If you have a comment about this article or if you have a tip for a future Freethink story, please email us at [email protected].