Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Affordable text-to-speech for long-form content (audiowaveai.com)
55 points by yagudaev on May 26, 2024 | hide | past | favorite | 50 comments
Hi HN, I’m Michael, creator of AudiowaveAI. I started this project out of frustration when I couldn't find an audiobook version of Make by Pieter Levels. The available text-to-speech options were either too robotic, overly complex, or simply too costly.

It works really well for non-fiction long-form content (i.e. hours of audio).

It’s early days for AudiowaveAI, and I’m looking for feedback to improve the product. Try it out and share your thoughts: [AudiowaveAI](https://audiowaveai.com). Thanks!




This is really cool! One quick note about your marketing copy, though: > Audio for humans, not robots

There are plenty of blind folks who use traditional text to speech for navigating our devices. We prefer the robot text at ridiculously high speeds. We're humans too.

I would love the option to switch to a more natural voice for more literary text (or even a fan fic) so I'll definitely be checking this out


> I would love the option to switch to a more natural voice for more literary text (or even a fan fic) so I'll definitely be checking this out

I'm curious if it would be possible to do some kind of analysis to determine the number of individual characters in the text who are speaking, and then assign an appropriate voice to each of the characters. So if you had something like descriptive language interspersed with a conversation between two characters, that you'd have three voices (a narrator, Character A, and Character B) that are consistent across the text.

For more complex writing with many characters, you'd probably need a wide library of possible voices, and the analysis piece would need to spot-on, since it would be very confusing to have one characters' lines spoken by the wrong voice.

Regarding fanfics, many authors give (or withhold) permissions around creating derivative versions of their work via avenues like ficbinding. Before using a tool like this to create an audio version of their writing, I'd suggest reaching out to a fic's author to see if they'd be okay with that. For personal-only use, though, and especially if it's in context of accessibility for visually-impaired folks, I imagine that many of them would probably be okay with it.


This is what the best narrators do. My favourite example is Andy Serkis narrating the Hobbit.


Thank you so much ; you've given me a lot to think about.

I'll admit I don't listen to a lot of fiction content, I prefer to watch it.

I do plan on adding additional voices and one brand of fiction I do like listening to is stories for adults. I used it on the Calm app to fall asleep before, and it is great for calming the mind.

For the time being the product is best for non-fiction, but I'll keep an eye on opportunities to make it work better for fiction.

Thank you also for opening my mind on how copy is perceived by blind users. It will be an honour to help them more. Empowering others through tech is why a lot of us work so hard. This is a great reminder.

Any golden classics I should try for fiction?


I've been using Piper for this. The quality is (in my subjective opinion) as good as the TTS built into MacOS is, it's open source, and it's so fast that you can run it in real time on a raspberry pi. On a real computer I can generate a whole audiobook in about 20 minutes.

What I do is I split the book up into sentences, generate speech for each sentence and at the same time turn that sentence into subtitles. Then I combine the two and stitch them all together into a mp4 container with audio and a subtitle track using ffmpeg. mpv (and think VLC) can display subtitles synced to audio playback even when there is no video track.


Thats genius! Was it a lot of work to set up?


Super cool! A lot of what you are describing I want to do in the future too.

The issue I personally found with traditional TTS is the lack of emotional range and lack of thoughtful pauses. ML models are better at this and picking up on small queues that are hard to program into a TTS otherwise.

I love the iPhone on Safari has a built-in TTS now and was excited to use it. It actually didn't work on Make by Pieter levels after I bought it. So I went to explorer other options. After I started listening to AI generated TTS, I just couldn't go back. It's like 270p vs 2160p (4K).


Is it possible to switch back and forth between the written text and the audio, like Amazon's Whispersync? I prefer reading with my eyes when I can (especially on my ereader, so with pagination instead of scrolling), but I would love to be able to flip narration on when I need to set the book down to do something like wash my dishes, then pick the book back up when I'm done.

I've been looking for something that would let me synchronize Librivox recordings with Project Gutenberg epub files, but as much as I love the Librivox volunteers for their contributions, a lot of the recordings are such low audio quality that they're not fun to listen to. This would be a big step up, and there's no copyright worries for this use case because the works are in the public domain!


I've used Storyteller to create an epub book with Media Overlay but not sure it works in all ebook readers. It worked in Calibre.

https://smoores.gitlab.io/storyteller/docs/what-is-this/


My ereader is an Android tablet under the hood. I don't know of any apps that can do this on Android, but I can go hunting!


Storyteller itself does have an app, just requires you to be hosting the service: https://smoores.gitlab.io/storyteller/docs/reading-your-book...

It also says that BookFusion can read the files it produces: https://smoores.gitlab.io/storyteller/docs/reading-your-book...


I use the app Moon+ Reader that can do this. It uses the built-in text-to-speech engine so if someone makes another engine with more natural speech, it can plug in seamlessly.


I’m making such tool, you can both read and listen, although reader functionality is very simple right now

App is Listenly.io


Good for you!

So similar to my app. But I'm not a real programmer, so of course your is more refined.

I almost launched the same exact online business.

Here's my version (my github version is a bit less refined than my local code):

https://github.com/sm18lr88/OpenAI_TTS_GUI


Super cool! Thanks for sharing that . Just clean up the UI a little to make it more attractive and hide certain options.

I also experimented with a desktop app and tried to run open-source models locally.

Being a "real programmer" actually hurts you, I had a lot of things to unlearn to just ship fast and keep iterating. I was too stuck looking for the "best practices" or for it to be "just right" (code words for perfectionism).

So keep iterating and writing many projects . This is project 12 in 16 weeks. I've been doing this challenge of 52 startups in 52 weeks. It's been tremendously helpful. (more about it: http://52shipped.com)


Thanks for sharing that site!

I just converted my app from monolithic to modular, and switched from PySimpleGUI to Qt6.

I think the second biggest factor that kept me from launching the online business was reading up on all the big online marketplaces banning AI audiobooks, at least for now. So I just use it for myself.

Some of my initial versions even had a button to make the output compliant with part of the ACX requirements using complex ffmpeg commands.

I've also found that using the iZotope or Adobe Audition (best) algorithm to stretch the audio by 8-15% makes the listening experience better for difficult material, since OpenAI's slower speed settings don't sound well. So I tend to do tts-1-hd and a 12% stretch with Adobe Audition.


Interesting, I read epubs on Android using aiTTS as TTS engine using Google cloud voices.

What I would really like is an option to download the whole book as mp3 for offline playback, and different voices for each character.


Moon+ Reader has offline playback although the voices aren't as good, but on the bright side, if someone makes a local AI text-to-speech engine, then that can plug into the app and it'll work fully offline.


Nice, well you can download each audio file you convert to your device at any time.

I found a single file didn't make sense as you lose your place in it easily.

Instead splitting each chapter into a file seemed to do well.

But, you really do want a listening app. Otherwise it get harder to share and listen to on your phone.

So far I created a listening app as a PWA for the time being.


I use an audiobook app (Voice audiobook player on Android) which holds my place in line for even large single-file books.


Curious what the technical implementation looks like. What kind of TTS are you using? How do you scale it? What are the costs involved?


Using OpenAI TTS, it costs ~$10 for ~10hrs. So at $15, margin is 50%.

The costs will go down in the future, I hope, and there are promising open-source projects coming up. Their quality is still pretty subpar and I have a comparison table I'll share soon on HN.

Otherwise it is Next.js on Vercel with Postgres DB.

Hope it helps


I have a use case for a niche audience:

The videogame Final Fantasy XIV has a lot of text. A LOT of text.

Someone has made a plugin to pipe text to external tts services, or a websocket. You talk to characters in game and hear the dialog read by the tts.

https://github.com/karashiiro/TextToTalk

For whatever reason, amazon poly only exposes middling quality voices to the plugin. And I'd rather not have an active AWS account for just this use case.

ElevenLabs is supported by the plugin, but their service isn't really about tts and I'd have to pay the $220/yr tier to unlock further "pay as you go (per character)" with a budget of 100,000 characters per month. A bit steep for using it only for in this one game.

If someone could help plumb AudiowaveAI to this plugin, I'd gladly turn off AWS for this!


It’s OpenAI TTS, replacing the API endpoint should be pretty easy

It’s similar quality to 11labs but 10x cheaper


Thanks so much yeah a few people asked for an API to make it easy too. Added it to my list of TODOs :)


I've been looking for something like this. Thank you.

A couple of questions:

How do I delete projects?

I must have tapped three times after submitting a Wikipedia article and it created three projects that apparently cannot be deleted.

How do I delete my account?

And for $15 I get credits. How many credits do I get foe $15? Is each credit a word translate? 1 credit == 1 word translated to audio?


Hi Andrew , these are fantastic questions. Let me answer them one at a time:

> How do I delete projects?

Three dots on the side of the project, you can delete it

> I must have tapped three times after submitting a Wikipedia article and it created three projects that apparently cannot be deleted.

> How do I delete my account?

Just email me support@audiowaveai.com with form that email and I'll delete it of you. Still MVP no functionality for that yet.

> And for $15 I get credits. How many credits do I get foe $15? Is each credit a word translate? 1 credit == 1 word translated to audio?

1 credit = 1 character. You are right I need to be more clear on it. $15 would give your about 10hrs of audio or 100 articles (~5-6mins). ElevenLabs will cost you $99 for the same audio.


I looked into using Google cloud TTS or Azure for this but it was too expensive.

How have you got the costs so low? Also the GCP voices don't have as natural intonation. How did you do that?

I really didn't think there would be a market for this either.


It’s OpenAI TTS


Correct, it's OpenAI TTS.

Costs still need to come down and they will in the future, especially with OSS models improving daily.

Will share a comparison table I made in the near future (just need to clean it up).


So I work on a similar project in my spare time, but have just settled for Azure's Text to Speech service.

What I'm actually interested in is your pricing model. Why do you have constraints on characters AND articles, versus just characters? Does doing the conversion cost a static amount that you don't want someone making 10000 requests a month? Or is the article count and hours of audio just an estimation of the 600,000 character limit?

If it's just an estimate of real usage of the actual 600,000 character limit, then I'd try and word it differently, otherwise I feel like I'm going to be heavily constrained by the platform.


Yeah great point, I will simplify the pricing and just say "Listen up to 10hrs of audio" instead.

It is a lot more clear and avoids any misunderstandings. Just need to make the changes to the app to do that.

Thanks for the great feedback


Hey. I have published a non-fiction book, and i would like to publish the audiobook on Amazon (Audible, etc). Do you know if the output is accepted by them? What format should I provide my book to AudiowaveAI to receive a good audio? Does it understand chapter titles, quotations, etc?


Hi, it does split chapters with AudiowaveAI using markdown. It converts other formats to markdown and uses that.

So you can either copy and paste a markdown file there or upload it. PDF and epub would work too, but it is a bit more finky.

I'm happy to help; there are a few authors who contacted me, and I'm releasing an HD quality for authors this week, too (less static noise).

Ping me at michael@audiowaveai.com


Hey, I can help with that I’m building Listenly.io, and one author already made an audiobook

Works best with .epub


I looked into this problem a while back and haven’t looked at since.

The base ai model sounded like whisper ai from meta. Did you train the voice yourself or is it one of defaults?

I am always curious as to what copyright issues products like this run into. Also whats the stack like to build something like this?


isn't whisper the speech to text model by openai? which model did you mean?


yeah thats correct. I meant this one https://voicebox.metademolab.com/


Yeah I tried a bunch of them and OpenAI's TTS was by far the best.

Outside of that standard tech stack Next.js, Postgres, TailwindCSS.

It is still early days for ML TTS, and it will be exciting to see the compute requirements drop and for it to run on the device. OSS models have some promise, but still not there from quality perspective.


I reach first for Mac's built-in "say" program. Its not perfect, but good enough for most use cases. Free, simple, CLI driven. Potentially more private than a cloud service, and works when have 0 connectivity.


Totally fair, and I was really excited about the iOS safari built-in version. It didn't work for the book I bought and then after using AI bases TSS, I just couldn't go back.

I would love to be able to run the models on the device, and it will come in the future. The OSS models are not quite there from a quality standpoint yet. But they will surly get there


Is this a custom trained TTS model or is it an implementation of something like StyleTTSv2?


Just OpenAI TTS for now, tried a bunch of others and they were no where as good yet sadly.

There is some promise from Myshell models on huggingface and I hope to see them keep evolving.


This is great. Simple and slick makes it easy to use. I'm impressed by the URL importing feature. Is that a GPT wrapper behind the scenes or do you use another library?


Love this tool; please include the pricing on the front page before one has to sign up. Thanks!


Thanks so much :). The pricing is towards the bottom, should I just add a link to the footer/header to make it more visible scrollable you think?


You put "medicore content" in place of (I assume) "mediocre content".


Thank you so much! Fixed it now :)


This looks nice a really nice product.


Thanks a lot Nayam :). Let me know how it goes when you get a chance to try it, added chatbox to make it easier to leave feedback now




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: