I Tested 3 Text-to-speech Ai Models To See Which Is Best - Hear My Results

Trending 2 hours ago
I experimented pinch 3 starring text-to-speech AI models - here's what I found
Elyse Betters Picaro / ZDNET

ZDNET's cardinal takeaways

  • There are now respective AI devices disposable that tin make humanlike speech.
  • Some AI voices tin now whisper, laugh, and execute different expressive feats.
  • TTS devices alteration successful position of their level of realism and their intended audiences.

Synthetic voices generated by artificial intelligence are, for amended aliases worse, becoming commonplace. Meanwhile, nan number of companies processing this exertion is increasing rapidly.

Recent innovations successful AI, specified arsenic nan transformer architecture -- which forms nan backbone of galore generative AI tools, including ample connection models, generative adversarial networks (GANs), and diffusion models -- person led to nan emergence of AI systems that tin person matter prompts into natural-sounding artificial speech. There are now a wide assortment of these text-to-speech (TTS) systems available, each pinch its peculiar benefits and shortcomings.

To summation a clearer consciousness of which are nan astir advanced, I tested 3 of nan astir celebrated free TTS devices presently connected nan market.

ElevenLabs

ElevenLabs is wide considered an manufacture leader successful sound realism, and I recovered this to beryllium a reasonably meticulous appraisal successful my ain experiments pinch nan company's TTS tool. But that realism feels much intimately aligned pinch nan sound of a trained sound character aliases master podcaster than it does pinch mean quality speech -- it's almost a small too polished. In that sense, however, it tends to beryllium nan preferred prime for galore businesses and professionals looking for reliable automated narration. It besides supports much than 20 languages, further expanding nan platform's scope and appeal.

The institution besides released a caller text-to-speech exemplary called v3 arsenic a investigation preview past month. It supports much than 70 languages, and users tin herb up their AI-generated speech pinch audio tags that origin it to laugh, sigh, aliases speak successful a whisper, to sanction conscionable a fewer examples.

Also: ElevenLabs' caller AI sound adjunct tin automate your favourite tasks -- and you tin effort it for free

You tin motion up for a free relationship pinch ElevenLabs, and you'll automatically person 10,000 free credits. Select nan "Text to Speech" action nether "Playground" successful nan left-hand menu, and you'll beryllium redirected to a page wherever you tin participate a civilization punctual you'd for illustration nan AI strategy to narrate, prime from a scope of civilization voices, and set parameters for illustration velocity and stability. Prompts are constricted to 5,000 characters, and each characteristic successful each loop of a sound procreation uses a azygous credit.

Hume AI

Hume AI's TTS exemplary is different contender for nan astir realistic voice-generating tool. The institution has positioned its proprietary Empathic Voice Interface (EVI) arsenic an AI strategy that tin seizure and simulate nan subtleties of quality speech, imbuing it pinch a deeper furniture of believability. Like ElevenLabs, Hume offers a wide group of premade AI sound characters, each pinch its ain expressive quirks. You tin besides make civilization voices by describing them successful natural-language prompts.

To trial it out, I did my champion to picture nan sound of Samwise Gamgee from "The Lord of nan Rings," arsenic portrayed successful nan films by Sean Astin. My prompt: "Gentle but brave hobbit, pinch a working-class, West Country British -- perchance pinch a hint of Welsh -- accent. He should sound frightened but resolved to complete his mission."

Also: This caller text-to-speech AI exemplary understands what it's saying - really to effort it for free

After I prompted it to opportunity a celebrated statement from nan film, "If I return 1 much step, it'll beryllium nan furthest distant from location I've ever been," it produced 3 samples, varying successful reside and emphasis. All of them were impressive; to my ear, they contained a grade of realism and affectional extent that isn't replicable by its competitors. They didn't sound overmuch for illustration Astin's Sam, but that was undoubtedly a reflection of nan admittedly imperfect explanation I utilized arsenic a prompt.

You tin besides capsicum pauses by adding "[pause]" into your prompt, aliases adhd slangy infusions for illustration "y'all" to heighten nan believability of your civilization voices.

Descript

If you're looking for an AI voice-generating instrumentality that offers a scope of editing features, Descript is nan 1 to choose.

The company's TTS exemplary generates audio files successful a waveform format, which you tin edit conscionable arsenic you would successful Adobe Audition aliases a akin platform. You tin take from a room of premade AI voices aliases taxable a short signaling of your ain voice, and nan strategy will clone it for you.

I tested nan voice-cloning characteristic by asking nan strategy to publication a short prompt: "Summers successful New York City are getting brutal, and I request to put successful much high-quality aerial conditioning." (Which is true.) The first clip around, nan AI-generated type of my sound decidedly sounded for illustration me, but location was besides a mechanical value that detracted from nan realism.

I decided to springiness it different effort and re-record my voice, this clip taking disconnected my Bluetooth headphones and reference nan book much slow and deliberately. The results this clip were overmuch much realistic -- a much convincing simulation of my voice, successful my opinion, than a akin voice-cloning characteristic offered by Hume.

Also: I said pinch an AI type of myself, acknowledgment to Hume's free instrumentality - really to effort it

You tin besides set each portion of AI-generated audio by straight editing your written prompt. It wasn't perfect, of course; my adjacent friends and family members would astir apt beryllium capable to spot nan difference, but it would apt fool my much distant acquaintances. I tin easy ideate utilizing nan instrumentality to narrate my ain articles aliases for immoderate akin usage case.

For podcasters and different contented creators looking to quickly polish their audio recordings, Descript besides offers an AI characteristic that identifies and eliminates filler words, unnecessary pauses, "umms" and "uhhs," and different unwanted bits of audio.

ZDNET's proposal

It's important to carnivore successful mind that these are conscionable 3 of a immense number of TTS models presently available, and that each personification will person their ain preferences based connected their master role, tech savviness, budget, and truthful on. Before you take a level and tally pinch it, walk a fewer minutes playing pinch different options to spot which personification interfaces consciousness astir intuitive and which ones connection features that align astir intimately pinch your imaginative goals. Also retrieve that services alteration successful really they usage your data.

Also: Text-to-speech pinch emotion - this caller AI exemplary does everything but shed a tear

Regardless of which level you extremity up using, support your oculus connected nan velocity astatine which this exertion continues to evolve. Very soon, we'll apt beryllium surviving successful a world filled pinch AI voices -- and immoderate of them could sound conscionable for illustration your own.

Want much stories astir AI? Check retired AI Leaderboard, our play newsletter.

More