Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: I generated 70k audiobooks with OpenAI Text-to-Speech (listenly.io)
140 points by evan_ry 41 days ago | hide | past | favorite | 109 comments
Hey HN. I’m Ivan, hacker from Ukraine.

For about a year, I was working on Listenly — an app to listen to text content with OpenAI's natural-sounding text-to-speech model.

At some moment, I realized that it would be cool to take all the public domain e-books and create audio versions for them. So I did it... kind-of.

It would cost an immense amount of money to generate all the audio right away (OpenAI TTS costs approximately $0.84/hour of audio; 11labs, for comparison, is 10 times more expensive). So, I took a more gradual approach.

I took all the metadata from the Project Gutenberg catalog (it's about 70GB of dirty XML), cleaned it, put it into my database, and created a browsable catalog. When the first user visits a book page on Listenly, I download the full text of the book, save it in my cloud storage, and calculate the price for audio generation based on the book's length. Then, if the user decides to purchase it, we generate the audio.

I know it’s not perfect.

I've burned out a couple of times already while doing it.

But still, I need to show it to the world. And I’ll be glad to hear your feedback.

Peace.




If you're interested in further text to speech missions, I just got Piper (open-source text-to-speech engine) running happily in a Docker container on my Mac. Effectively "free", high quality, fast-generating text-to-speech.

Check out their voice samples: https://rhasspy.github.io/piper-samples/ (or make your own).

Happy to help you set it up locally...

https://github.com/rhasspy/piper


Piper is great, but nowhere near the quality of the OpenAI TTS voices.

I'd find Piper too jarring for audio books because of the big quality difference, but I actually prefer it for things like AI assistants as I don't necessarily "want" AI assistants to sound perfectly human, and prefer the stylistic choice of having them sound more computer-generated.


I don't think it's high quality, tbh

Much less enjoyable than with OpenAI TTS.


Ah, nice! I've been doing something similar to convert web novels --> epub --> mp3/m4b --> sorta a graphic novel --> sorta a video / slide show

Here is pride and prejudice and up the thread you can see another web novel example:

https://twitter.com/HarrisonJackson/status/18109373574214537...

ElevenLabs has so many great voice models but is super expensive. I want to experiment with some oss voice models and even train my own but not sure on a great starting point with that. Play.ht has some good voices, too.

Seeing some of the results here with the openai tts I will probably switch at least the narrator to use one of these to save some money.


This is very cool!

I think you should try OpenAI's voices for characters too. They're really good at catching the emotions. They even can scream! https://x.com/ivryb/status/1780210661189992877


Very cool, and nice work on this! I used to record wikipedia's articles in audio format to help those who had trouble reading, so I'm a huge fan of anything that makes public domain work more accessible.

As a rabid audiobook consumer, I do have a couple of suggestions.

An easy one - currently you only use the Onyx voice from OpenAI. I'd recommend that at the very least you match the gender of the voice to the gender of the author. I find this is pretty common with published audiobooks, and I find it helps bring out the tone of the author more.

A harder one - most great audiobook narrators change their voice depending on the character speaking. If you really wanted to go in depth here, parsing the text by character and matching them to a voice would go a long way in making these more listenable. It would be fairly straightforward (albeit more expensive) to parse these books with an LLM and ask it to add inline markdown for the right voice options for each speaking character.


I wonder if we are ripe for the following:

Given a great narration in one language, have a model annotate the tone and emotion of the narrator for each sentence, and re-apply these emotions to the voice synthesis for a target language, on the translated version.

Narration/recitation is such an orthogonal axis to the story and literary style, and an integral part of the experience.


> I'd recommend that at the very least you match the gender of the voice to the gender of the author.

I prefer the voice to match the protagonist. Or better yet an audio play with the narrator voice plus a voice matched to each speaker.

This is the kind of bikeshedding that AI text-to-voice can make moot. We can all have it our own way. That's an argument for generating the voice just in time rather than as a batch. But as long as such tools aren't ubiquitous this batch is a great public service.


> an audio play with the narrator voice plus a voice matched to each speaker

Oh, please don't! I find this extremely disorienting! When I am listening to an audiobook, I am not listening to the voice. I am transported, envisioning another world, and changing voices often breaks the immersion by forcing me to re-calibrate to their cadence, tenor, accent, etc.

> We can all have it our own way.

Ah, well, yes. Of course. Nevermind, then!


I'm postponing the development of the voice selector for like 3 months already. Something more important is always popping up xD


Totally fair. Solo dev is hard, and those priority choices are always a challenge. Remember that you have more context than anyone else suggesting things here - I'm sure that 3mo delay is for a reason. Great work so far!


Numerous issues, what voice for George Eliot?

J.KRowling's books, as mentioned, are famously well read by Stephen Fry and Jim Cook, both male voices for an initialed (but female) author.

But what then for J.KRowlings Cormoran Strike series under the pseudonym Robert Galbraith?

Female or male voice for modern crime fiction male detective novels?


One of the other comments mentioned matching voice to protagonist, which would fit your example for Harry Potter.

There are likely many factors that go into selecting the right voice here - my main point is the same voice shouldn't be used for all books. It's likely a simple heuristic is better than "male voice for all", though no approach will be perfect without the opinion of the author, which isn't available unfortunately.


Indeed, voice matching to book "by some magic metric" is the way to go.

My comment wasn't entirely clear, it was the "voice matched to author gender" part that prompted a response.

In the great scheme of thing any voice reading aloud is an advance for people that require or like to hear books read out, improvements can come on a per book basis.

The end goal is likely a mix of bespoke readings by gifted voive readers (Fry) and guided "selectable AI voice" readings that can do can do clear and correct pronunciation and pacing with the voice of Jamie Erl Jonas (totally not James Earl Jones), Skarlat Johnson, or that Chipmunk character.


Disagree, the narrator for Harry Potter has always been a male in my head. I don’t think people read the book in the tone of the author, more in the tone of the main protagonist.


The most common versions of the audiobooks are narrated by Jim Cook and Stephen Fry.

> I don’t think people read the book in the tone of the author

People certainly do in some instances, but an interesting thing about generated content going forward is folks will likely have the ability to choose on demand.


So the model is - “first person pays, rest of community gets that audio for free”, have I understood that right?

Cos if so - cool, that’s a lovely model. And you should make more of it. There’s a definite feel good factor associated with this. You could probably also charge a bit more - $5 for a thing I get alone vs $10 for a thing that I get but everyone else gets for free too seems a no brainer incentive to me.

FWIW I find Omnivore[0] to be really compellingly realistic TTS. I don’t know what they use but it’s pretty great imo.

[0] https://omnivore.app/


Or, the first person underwrites the initial generation, and then gets some credits as subsequent people pay a small amount.


Another way to do this would be to crowd fund. Instead of 1 person paying 50$ per book (just a random number, I dunno how much it costs) 10 people can pay 5$ each. 11th person onwards can get it for free.

You could also get some credits from these companies in return for advertising “this book is sponsored by blah company”


I answerd this here: https://news.ycombinator.com/item?id=40963194

I like the idea of letting people donate the audio they purchased to the community.

Although I'm scared that I'll have no money.


LibriVox does this with people who volunteer to read everything. It’s an extremely time consuming endeavor for the readers but they ultimately choose to do so unpaid. Why not create a system that covers the cost of the TTS model at cost as to avoid profiting off of what exists in the public domain? That’s the spirit of things being in the public domain.


Agreed, I should do this.

Although I still concerned about the costs of maintaining all these MP3s.


yeah, LibriVox is good and simple- no logins required: https://librivox.org/


I checked out the omnivore TTS.

They're using a previous generation of TTS models, which most of the reader apps are using. They're reasonable, cheap, but sound noticeably worse than OpenAI's or 11Labs. I don't like them.


Did you know that Microsoft did basically the same thing for free last year?

https://marhamilresearch4.blob.core.windows.net/gutenberg-pu...


Their TTS model is worse than OpenAI's though


Both will seem dull going forward. TTS will feel more natural every passing year so the current spending for any TTS model will seem kind of wasteful after 1-2 years.


Holy sh*t, nope, didn't know it


I don't think MS did all of the books. For the ones they did do, you could just link your visitors to the audio on archive.org.


How much listening have you done to the results? How do you feel about the results? Just interested because I've listened to quite a few AI readings (sometimes without knowing ahead of time) and I'm still sort of processing my reactions.


I personally actively listen to shorter-form content. I have already listened to around 30 of Paul Graham's essays and the "Shape Up" mini-book from Basecamp (both are free on listenly.io/public-library).

Jason Cohen's blog posts and TechCrunch "Startup Weekly" newsletter are also great to listen to.

In terms of books, I'm not as active. But actually, I like Churchill's books much better with this AI narration than any that I've found on Audible. It looks like they're trying to narrate Churchill's books as if Churchill would, and it's not a good thing.

I think it's already very good in terms of sound quality. If not for fiction, then for professional literature, it's just great.

Some people have already purchased some books, finding them through Google (it is indexing all the pages right now, but it is taking some time, as there are 100,000+ pages for all the books, authors, and subjects).


I'm curious about this as well, I listened to a few of the samples and they seem to be of ok quality. But having experimented with this myself a bit, things can be fine until it chokes and you get very weird emphasis, emotion etc.

Side note, this almost feels like something 2012 Google would have done, a la their scanning of the Library of Congress. Something to show off their text-to-speech.


I created a similar project for the book Madame Bovary, but in French using the ElevenLabs API.

A sample of the first chapter is available here:

https://fairpublishing.org/index.php/ebooks/sample-audiobook...

The voice quality and pronunciation are excellent. However, the system struggles with acting, so the tone and emotional expression are often wrong during dialogues. Additionally, I have to fragment the text into short paragraphs, making it challenging to set appropriate break durations, resulting in an unnatural rhythm.

Despite the technical quality and my appreciation for the reading voice, I won't continue in this direction.

ElevenLabs is quite expensive, but it would be worth it if the final result were good enough for listeners to purchase the audiobook.

I don't know if using OpenAI's API in English would yield better results. However, OpenAI's performance in non-English languages is not satisfactory.


Bark is better in expressing the right emotions, but the voice quality and hallucinations are bad.

Maybe generating a bunch of runs and then asking the users to vote could get us the best narrated book overall.


In general, it is not great for fiction right now, needs a lot of improvement But for history/philosophy/science books its great.

And yeah, OpenAI's model is bad for non-English languages. At least, for now...


I definitely support your goal: take all the public domain e-books and create audio versions for them. I think the "on-demand" approach is kinda brilliant. Once a book is requested, how long does it take to generate the audio file? Does it happen in one shot?

I sadly found an AI audio project I don't support: This person was instead summarizing popular books into 10 minutes of audio. Basically trying to SEO better than the author and I know the authors aren't compensated. That just left me feeling sad. (I know book summaries for busy people have been a thing for a while, but this just all felt so opportunistic.)

As I search podcasts these days, I'm finding more and more of these low-effort, "doesn't take more than a few minutes to set up, why not" type AI-generated spam cannons. Been hard for a while but it's about to get REALLY hard to separate the wheat from the chaff.


Right now, I'm splitting all the text into 4,000-character chunks (OpenAI TTS limitation), and converting them into audio "on-demand".

When it's like 1-2 minutes before the end of the current chunk — I'm starting to generate the next one, for a seamless transition.

One chunk is taking about 30-40 seconds to generate (OpenAI API is 20-30s, Azure OpenAI API is ~40s).

I was planning to convert the whole book (just by queuing and parallelizing the requests) and concatenate it into a single MP3 (or an MP3 for each chapter), but it's not ready yet.


I like to watch short movie recaps on youtube instead of the whole things.

I also read summaries of books for research purposes or for dull school homeworks.

They both have a place before or after ai.


Yep, I just listened to a 30min Dune 1 recap on Youtube because I read it years ago but wanted to finally start the second book.

What's the problem?


I think it would be fair IF writers also paid royalties to authors of books in the same genre/subjects they have read.


I guess you didn't hear about Librivox? Which allows anyone to provide voiceovers for Project Gutenberg books. Much better than AI generated voice in my experience.


I don't think that they contradict.

Maybe AI-generated books should also be a part of Librivox.

I tried to listen to some, but the quality of narration was bad.


This is one of the worst use cases for AI. You have no way to verify the quality of the output. Many of these texts are going to have pronunciations that will be difficult for today’s TTS systems. Plus, many of these are already available from good voice actors, many of them free, and they do the proper service to these texts.

It seems like you did a lot of good technical work, but I find this project entirely useless and a waste of resources.


For some of the books, it's true; for some, it's not.

I'm really enjoying listening to nonfiction – history, philosophy, biographies.


I think it's sad that lying in a title is now accepted marketing practice.


Why do you think it's a lie?

There are 70,000 audiobooks in the catalog, and people can listen to them. If audio is generated on-demand in the background, it does not make them "not-audiobooks", and it does not make my post a lie. "If it looks like a duck..."

It's just a technical implementation detail. And I'm not hiding it; I'm describing it in the post


It makes them "not generated" which means the "generated" part is a lie. Generally I've found Ukrainian software developers to be dishonest, but now I'm wondering if it's a cultural difference in understanding of what words mean.


The title says you generated 70k audiobooks.

You didn't.

That's a lie.

Most people don't like to be lied to.


Aside from the lie about generating 70k audiobooks, it also gets so much free advertising because people think this is a hobby "crowd sourced" project when the author is just selling tech services and subscriptions.


It is a hobby project, it is crowd-sourced, and I am selling tech services (not subscriptions).

And I don't see any lie there.

https://news.ycombinator.com/item?id=40964863

https://news.ycombinator.com/item?id=40963194


This is a silly thing to quibble over. The author made it plain what they were doing in the post. The title is a a succinct way to describe what they’re doing.


Nah the author lied in the title to grab attention. That's different than being "succinct".



Exactly how I perceive it


'Tis the way of the world, unfortunately.


We can choose to collectively change the way of the world. And in this case the title is lying, the website seems to be lying regarding cached generation still being the same price, and I would flagg it as false advertising.


Come on, what's your problem, man?

I explained what I did in detail. I'm open in the comment section and explained my reasoning regarding the pricing. I've made practically no money off of this project so far.

There is an option to cache, but there is also an option to crowd-source, which makes the price for the first person smaller.

Moreover, if you try to buy an 'hour plan' for $15 and listen to any PG book, you will not be billed for the converted chunk, so the caching works as you'd expect.

Flagging feels so exteremely unfair.


You seem to understand english well enough that you know that the title, as stated, is a lie. It's simply not true regardless of whatever rationalizations you can come up with. You barely making money off it doesn't make it true. Users being able to buy a $15/month subscription to listen to 10 hours of audio doesn't make it true.


1. There are 70,000 audiobooks in the catalog, and people can listen to them. If audio is generated on-demand in the background, it does not make them "not-audiobooks", and it does not make my post a lie.

It's just a technical implementation detail. And I'm not hiding it; I'm describing it in the post. I cannot describe the implementation detail in the short title.

It's just that you decided to believe that it's a lie, saying it very confidently, and taking down the post that was received generally very positively.

2. It's not a subscription, it's one-time purchase of hours.


It's not all on you. I can see that culturally in marketing this has become acceptable for many people. I just think it is unfair for people who want to be honest in their titles or thumbnails etc. But that's not really the standard. I was just complaining about something, not about you only.


I really didn't intend to be dishonest.

I just wrote a catchy title (which can be a bit misleading, but not dramatically, as all the audiobooks I'm mentioning are really accessible to people; I developed all the infrastructure needed for that), and tried to clarify everything in the post itself.



Oh come on be a little more generous. It’s hard getting buzz around an indie project.


How about "I imagined generating 70k audiobooks, and you can subscribe to my website for $50/month"?


You seem to be entirely focused on what the creator physically did and not what the user receives. They would not be available for users to listen to as audiobooks, if the creator only imagined generating the audiobooks

How about "just-in-time generation" of 70k audiobooks


It wouldn't fit into HN's characters limit


Thank you for support!


I love how much better listening to books with AI has become.

Have you done any attempts at multiple narrators telling a story?

Microsoft's Azure has a great tool for doing this but it's time consuming as you have to take all the text & match it to the narrator by hand. Open AI's last big demo kind of showed using voice chat to change narrator voices on the fly.

I think it would be awesome if you could submit a book, have a simple tool parse through & find all the speakers. Then let you sample how each one sounds with a brief description of what the person is like. Basically you get to have each voice do an audition & you pick your favorites. Then it goes through page by page generating audio based on the voices selected.

I'm not suggesting this feature for the app. I'm just throwing out this idea as one I've been thinking about. There have been a lot of books I've wanted to listen to but don't have time to sit down & read.


Yeah, I think it should be possible technically. Put all the chapters through LLM and ask it to add markup for different characters/voices.

Right now, my paid users are listening mostly to non-fiction, so it seems like they don't need it.

But this whole Project Gutenberg saga is kinda diluting everything, and I need to think which users/market to focus on.

Will see :)


Great project.

Pricing: maybe try a mobile app with monthly subscription? Something for recurring revenue.

Features: can you generate at 1.5x speed? Might be more natural than the playback speed up options and be a nice differentiator.


+1 on this. Even if the subscription isn't ideal for most users, you will get more active users and a better feedback loop with it.


Really interesting take!


I really thought that 1.5x playback speed would be the same as 1.5x generation speed. Wow. Looks like I was wrong.

Regarding the subscription — I thought that no subscription was actually a competitive advantage, but now so many people are telling me to do it, that I'm really not sure anymore.


Did you try 1.5x generation speed and hear an improvement? Mine was totally just a guess. If it is better and you can beat traditional audio books on quality that would be cool. A lot of people listen at higher speeds.

Re:subscription - maybe try both and see? Assuming most people listen to the same books, your generation costs should plummet pretty quickly.


Honestly? The quality of the output is as expected, I wondered how it would manage something like Shakespeare which depends so heavily on iambic pentameter, instead AI does what it usually does which is drone on at a slightly too fast speed, with no natural pauses and no delivery. Honestly as with most things you would be better paying for a human performance than relying on this.

I wish the OP well, and the project is nicely designed. But AI simply isn't there for this yet, not without a lot of individual hand holding and extra work.


You should try listening to some non-fiction, such as history, philosophy, biographies, etc.

It's already great for that purpose.


They are better, but they still sound slightly unnatural to me, the pauses are in the wrong places, or not long enough. It takes me out of focusing on the actual words


I think there may be issue with data collection. I tried listening to some of pg's articles but they were cut off right in the beginning, see e.g. 005 Lisp for Web Applications.


In this particular case, it's just that the blog post is basically just a link: https://paulgraham.com/lwba.html


I did some spot checks and the cadence and intonation of their speech feels so natural. The sentences flow. It's the best I've ever heard. Thanks for doing this.


Are there any books among Project Gutenberg books that haven't already been performed as an audiobook? Assuming that all of the popular books in Project Gutenberg have an audiobook available to purchase read by a human which is probably better quality or at least more likely to be better quality, why would I want to pay money for this instead? I don't see the value proposition here.


I see why you'd think a human-read one would be better, but in my experience that's not the case. It's not that easy to read out loud and actually sound good.

I've spent a fair amount of time listening to free audiobooks (https://archive.org/details/librivoxaudio) including many that are out of copyright like these, as opposed to modern but in the public domain.

After listening to a few minutes of "Frankenstein" on his site, I would say that these OpenAI generated voices sound better than almost all of the human-read ones on Librevox, both in audio and performance quality -- these are voices that are designed to sound good, and they succeed at that.


You're right about the popular books, but the long-tail of not-so-popular ones doesn't have a human audio version, and probably will never have.

Plus, sometimes available human narrations are so bad that you really would like to listen to an AI one (I've experienced it with Churchill's audiobooks on Audible).

I don't know if it will work. It felt like it should work, at least for pSEO.

I got my first two audiobook purchases two weeks after I submitted the sitemap to Google. It was some romantic novels. But now it's flatlined again.

Will see...



Seems like some kind of bug on the LemonSqueezy side. It is enabled in the store settings, but I also cannot see it. Will open a ticket.


Does anybody have a recommendation for the apps/scrips (mac/windows/ios/android) that will allow me to generate audiobook from ebup (or txt) using my own openai api key?


Once generated, (I.e. a user pays for the audio to be generated) does it become available to the public? If so, very cool!


If it works that way it's a rather nice setup. Would love to have an answer from the developer.


Right now, it's not working like that.

I was thinking about it.

On the one hand, I want to make money. On the other hand, I understand that making everything available for free would be much more aligned with the Project Gutenberg philosophy.

I left my job, living on the savings, and in the last year listenly made only $400 ~= $35 MRR. Although I was not doing much marketing.

I'm dreaming of it making $1k, $3k, $5k MRR.

Right now, I set the price to be 50% of the API cost, so I would make a profit starting from the 3rd same book purchase.

But maybe I should make it fully social project, get some donations, and treat it as "lead magnet" to monetize something else. I'm open to your suggestions!


I do AI consulting and I did some audio related projects where I basically resold ElevenLabs + quality control. EL is much better than OpenAI imho.

Monetizing is good but there is no value proposition in the product.

The chances I'll get something I'd like to listen are low because: - AI errors - AI lack of emotion - You picked a voice I've heard in thousands of automatically generated youtube videos and that I came to hate.

There is no chance I'd buy this, I'd rather buy an audiobook made by a human.

Now, people may not understand that - but then they'll be disappointed, bother you for a refund (chargebacks are 15$ a pop if you don't) or just speak badly about the project. Repeating sales potential is pretty bad imho.

I hope I don't come across as rude.

If you are really set on this idea I'd recommend to generate 1 book, make it perfect until it reads like it should and then sell it on as many platforms as you can (Amazon mainly I guess). Maybe use a custom cloned voice so it will sound unique and constistent across all books. You don't need a website but you have one so you might as well use it for marketing and maybe to gauge interest for the next book to process.

An audiobook is a good product in itself.


Sound advice.


Is there any open source text to speech library that's starting to be half close or decent for something like this?


xTTS is not open source but you can download it and use it for some things - and it's the nicest sounding one.

Bark has potential but the voice quality is pretty off.

The tortoise fork which improves the model and restores cloning (the author of tortoise decided it was to dangerous and crippled the project) is ok with some voices but it takes a lot of tries.

Voicebox from Meta is pretty good, comparable quality to ElevenLabs, but it's research-only for now.

Pretty sad overall.


Feature request: I'd love to be able to choose among different voices.


Why only Google account to login? also, why only dark theme when so many users have difficulty reading on dark backgrounds?


Sorry for the inconvenience :(

Additional auth providers and UI theming were not a priority, and frankly, this is the first time I have received such a request.

But you're right, I definitely will do it.


Cool but it seems you are not transparent about regenerating books?

The best books should already exist in audio and you can already show examples of the quality.

Has no one used this yet? Do you not store the generated result?

I mean it's fine to make money but you state it differently.

Nonetheless I like the project, I'm impressed with the examples and I also like the approach


is your code on github somewhere?


Just out of interest, what exactly do you want to see in the code? It's just a wrapper over OpenAI API, see https://platform.openai.com/docs/guides/text-to-speech + Google OAuth2 authorization + lemonsqueezy as payment platform.


No, it's closed-source for now. You think I should open-source it?


I would love to contribute and run a version on my server. Also you need a search engine. Your list is too long to click through categories. The cool thing is you have monetized it -- virtually no open source projects have monetization built it.

If you don't want to open source it, send me an email: anthony@chovy.com -- i'd like to collaborate with you privately if I can run my own instance.


Interesting. We can meet and talk!

I'm really not sure about fully open-sourcing it. It's generally good for developer-focused products, but for Listenly... I just can't see the benefits. But I might be very wrong.


There probably aren't a lot of benefits unless you want to showcase your skills to employers.

Anyway, hit me up on https://t.me/chovy2 or https://fightclub.profullstack.com -- I'd like to help you out.


Nice job Ivan.

I expect your costs to drive down over time, which is nice.


Also, I find it quite unethical that you're charging for public domain books. It's frankly gross, in my opinion.


Well, someone has to pay for API calls :D

I was thinking about launching a Kickstarter campaign and making the whole library free for everyone. But I need more feedback. I don't know if it's viable.


If the audio generation is paid for, by the first listener, it will be available to everybody for free? No?


You're more than welcome to pay the API cost for it yourself.


Wait until you find out who else charges for public domain books.




Consider applying for YC's first-ever Fall batch! Applications are open till Aug 27.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: