Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Podcast API (listennotes.com)
223 points by wenbin 13 days ago | hide | past | favorite | 69 comments

> API Terms of Use

> Applications using the Listen API must not pre-fetch, cache, index, or store any content on the server side. Note that the id and the pub_date (e.g., latest_pub_date_ms, pub_date_ms...) of a podcast or an episode are exempt from the caching restriction.

Is that.. common? I've never knowingly come across anything like that before, seems weird to me. Sort of makes sense, in a 'you must not try to avoid needing to pay us more because we want more money' sort of way, but.. really? Also, almost entirely (basically, except OSS) undetectable, surely.

[Edit: failure to read my own quote correctly, thanks xd1936] --- And if you really take it seriously - 'must not [...] store any content' - it really limits what you could even use it for, not being able to store the `id` even for a later reference. I don't think that's what's intended, but it seems to be what's written. ---

(Just so I don't sound like a grumpy old git (I'm not old, at least!) - I really really really like the docs page https://www.listennotes.com/api/docs/ only thing I'd suggest perhaps is embedding the OpenAPI 'HTML' contents below the other options, rather than it being a link to follow. Awesome though.)

Map tiling APIs do this, like Mapbox and Google. Else you could circumvent all but their lowest-tier subscription plans with a brain-dead caching proxy and a large disk which is what they want to avoid.

Amazon's API famously does this as well (or used to, it's been a while) by requiring any prices you show to be no more out of date than N minutes forcing you to basically request on demand every time you need to show it. They'd rather you just send the traffic their way for people to see the price.

Heh, yeah. I think my reaction's still similar though - why shouldn't I be allowed to do that?

The alternative of course is to charge more per tile, or have a base 'access fee' + small incremental charge. Pay per usage doesn't work best for everything, IMO.

(And I'd likely still want to come back occasionally to check it hasn't changed, even if I cached every tile forever. (Which I probably wouldn't, if the hit rate was really low, like it was a one-off, and I'm being cheap about my API usage why wouldn't I also be cheap about my disk usage.))

> why shouldn't I be allowed to do that?

Short answer: Because that's the contract.

Companies that provide data for offline use will have a separate licensing modeling, usually with subscriptions for updates or perhaps a finite license term. MaxMind's GeoIP database is a popular example.

That's not really an answer though, that was the starting point.

And this isn't a one-off dataset, we're discussing an API pricing model - there will be new podcasts, existing podcasts' metadata will change; people using this API will want to make repeated calls, they just might also reasonably want to cache results.

If this were my service, I just wouldn't do pay-per-API-call, or at least not only. Of course, the free tier presents more of a problem then, but I'd probably just restrict it more making it less attractive, and have a lower entry point than the $100pcm that's a flat-fee for some but not all extra features, showing images at all (and not in free), for example.

As it is, I reckon loads of users cache results - not maliciously, just because they haven't read that they're not supposed to - and that OP has no idea (because how would they).

Pay-per-use is just the simplest and most straightforward and possibly fairest way to couple the value your API gives someone with the amount they pay in return.

Or, from the eyes of the user, they get full access to the API yet don't have to pay much if their project gets no traction.

The downside is that users can lie, but it's mainly just low-end users who would lie. Pay-per-user licenses are similar: a startup or a hackathon is most likely to share the license between a few people while larger companies are going to be honest because (1) they can afford it and (2) they don't want trouble at scale.

So you can ignore most abuse.

The problem with other payment structures for ListenNotes is that it's a relatively small database. You can clone the whole thing trivially. It doesn't even mirror/host the audio feeds. Its only value is that it put in the work of structuring and normalizing the metadata.

If you built a business on top of ListenNotes, you'd save more and more money as you grow bigger and bigger if you were simply cloning the whole thing with your own crawler. So the more value you would get from ListenNotes, the less you're actually paying them. Or ListenNotes would have to price their per-call fee so high that they could somehow capture a fair price for that value yet shut out smaller users.

Turns out "courtesy agreements" generally do work at scale as larger companies become less and less likely to lie just like they become less and less likely to pirate Photoshop.

> have a lower entry point than the $100pcm that's a flat-fee for some but not all extra features, showing images at all (and not in free), for example.

The downside of this is that now you limit what people can build on cheaper tiers. In fact maybe they can't even build their compelling product without whatever content you're paywalling behind tiers they can't afford on day 1, while the goal is to let someone build anything they want on day 1 so that they are a large end-user on day 1000.

After all, the ideal isn't that you scale value with your customer's income but rather you scale in price as they convert value into income. It, of course, is all just trade-offs.

Depending on the use case, possibly a whole lot of disks.

Right, I would assume that even just the tiles for the biggest cities alone would still be way more than most would want to store. On the other hand, let's assume on the client-side, can you not even cache a tile a user just saw 10s ago but went off screen? Or is it assumed the browser will cache that tile?

> On the other hand, let's assume on the client-side, can you not even cache a tile a user just saw 10s ago but went off screen? Or is it assumed the browser will cache that tile?

I don't know the map tile terms, but the quoted limitation for this service specifies server-side caching.

I’ve noticed similar recently with many paid book search apis out there and was also grossed out.

You’re not paying for a data source at all, you’re paying for an expensive embedded application.

I don’t see how it’s remotely reasonable. The person managing this api has stricter protections on this data (though they’re not even his podcasts) than we have on our personal data.

You're not paying for the data, you're paying for the service.

This is common. Companies that provide the data for offline use tend to have a separate licensing and subscription fee structure. Companies that provide the API tend to forbid offline caching/storage of the data.

The service, though, is 'convenient access to the data [which is already out there]'. And once I've used it, I don't need it 100/sec just because that's how frequently people are using my downstream service to do something with some popular 'trending' podcast; I'm perfectly happy (and it would be a good practice to be!) caching it for some period, until I need the service again to conveniently see if the data that's already out there has changed.

> The service, though, is 'convenient access to the data [which is already out there]'.

The service is whatever is described in the contract you agree to when you purchase it.

If you don't like the terms of the contract, you can always try to negotiate an alternate agreement. Or you can choose not to purchase the service.

The seller isn't obligated to provide their services on your terms, just as you're not obligated to purchase the seller's services on their terms if you don't agree to them.

A single snapshot of an ever-changing database is the culmination of potentially years of research and payroll and system development that API consumers precisely didn't and don't want to do, that's what gives the dataset and thus API value.

The price that captures that value would have to be much higher in the model where you only need to access the database at some interval (let's say weekly), and that's not necessarily any more palatable.

I commend the service provider for aggregating the data and making a business - hope that person is able to make a living from it.

It’s an interesting service that I would be very interested in using in providing a service of my own. And I’d be more than happy to pay for it, but those terms are a non-starter, at least for me.

The year is 2040. There’s no running water. Grocery stores mandate that all purchased liquids must be consumed prior to leaving the premises.

The year is 2050. For some reason that nobody can remember, everybody lives in "stores."

The year is 2060. “Stores” begin synthetically seeding human life in closed environments in according with growth hacking best practices. Product market fit declared a solved problem.

Thing is, you prevent an API so that people don't use some kind of data harvester. If your API is terrible, people resort to harvesting.

> Note that the id and the pub_date of a podcast or an episode are exempt from the caching restriction

> it really limits what you could even use it for, not being able to store the `id` even for a later reference

:facepalm: - thanks, I'll (keep it but) edit my comment to reflect that correction.

At least for the actual audio, I understand that podcasters get grumpy when people cache that server-side, because they depend on server logs to get viewership numbers for advertisers, so if a popular client downloads the audio once and distributes it to all their customers, they can't make money off any of those customers.

Podcasts also often target advertisements geographically (based on IP address, I guess?). Being able to serve to each listener is part of their value proposition to advertisers.

I worked on a food tracking PWA, and getting it to be useful offline was horrible. We’d have to hit the API at least once a day to grab commonly used foods and refresh our temporary cache. The data did not change at all... eggs don’t suddenly have a different calorie count the next time you eat them lol

A database of all of the world's foods though could easily be larger than I'd like a calorie counting app on my phone to be though, for example. So it's not necessarily silly - network can be cheaper than disk.

IIRC, Mapbox has similar terms for both their map tiles and their geo lookup results.

I was tinkering a bit recently in an effort to build a simple system that finds 'related' podcasts and see if I can see the network effect play out over time. I did this by building a graph of people (hosts/guests) and episodes and started folding in tags/topics. None of this is in my wheelhouse, and I found:

- It takes a lot of work to curate a substantial collection of podcasts. There are lists all over the place but it's hard to know what's really in there.

- I attmpmted to use SpaCy and/or NLTK to do some 'Named Entity Recognition' in order to extract topics/people/orgnaziations from episode titles and descriptions. This was surprisingly brittle. The string 'Sean Carroll', for example, wasn't detected as a person by either framework (IIRC). It also seems quite brittle to punctuation and other context (e.g. beginning or end of a sentence). This was using the default models shipped with both. I started off with just the english models but expanded as there were lots of names being skipped silently. That helped less than I had hoped.

- I have yet to find a good UI for exploring a graph. I used Neo4j and the built in 'browser' is not intended for that purpose. Gephi has good capability for filtering and analytics, but it takes quite a bit of getting used to and the graph itself isn't amenable to dynamic navigation.

That's all. Bookmarking this as it would really help.

Many people use our Listen Later playlists to curate podcasts / episodes by topics.

Here are some examples: https://www.listennotes.com/podcast-playlists/

Each playlist has a rss feed. So you can subscribe to the playlist on any podcast app (except for spotify or the like)

Big fan of Listennotes in its entirety, but this feature is a real gem!

Yes this looks great, thanks!!!

I’ve been thinking of doing the same for my graph visualization newsletter source/target [0].

I’d love to connect if you’re interested in collaborating! sourcetarget@cjlm.ca

[0] https://sourcetarget.email

Love seeing development in the podcast space. One specific problem I've been wanting solved for a long while is difficulty with sharing podcasts with friends across podcast apps. If you're not using the same podcast app as your friend, it's always a pain to manually search and find the podcast in your own app. I'd love a universal podcast url, something like `podcast://<podcast_url>` that individual podcast apps can understand, which links you to the podcast within your desired app, similar to the "default browser" behaviour on mobile and desktop. Has anyone come across something like this?

Podcasts are just RSS feeds. Nothing stopping a app registering itself as a handler for the RSS mime type, at least on desktop/Android (I don't know how iOS works here). I doubt most users would have a RSS reader installed at this stage, so most users wouldn't even have a risk of getting it revealed as a list of links to audio files by using the wrong app.

I mean podcasts are an extension of RSS if I recall correctly... I don't see why this wouldn't be possible.

This is a big problem for iOS. My spouse uses the default Podcast app. I use Overcast. Anytime she sends me something to listen iOS tries to open it in Podcasts. When I send something from Overcast it gets sent as an Overcast URL.

In the distant past (2009 or so), podcast://… would open in iTunes or similar, being equivalent to http://xn--rvg. So you’d have something like podcast://example.com/podcast.xml. I haven’t the foggiest idea whether this still works, or how HTTPS might have been integrated or not.

Is this service cooperating with or in competition to the Podcast Index?


I was wondering the same thing. This one is a paid $$$ API.

It's in competition, it seems to me.

I'd like to see a crowdsourced "Genius for Podcasts".

Most podcast producers are terrible about correctly adding metadata: Chapters, images, episode notes, descriptions, etc.

Let the superfans upload custom metadata to be displayed alongside the episode as it's playing in your podcast player.

I'm building something very similar to this!

> Trusted by 2,007 companies and developers.

Haven't seen this before, an actual figure rather than 'these big names' (and you have no idea if it's just some small team somewhere for some toy test/demo, or a significant piece of the whole organisation's puzzle).

I'm (just idly) curious what number you waited for (assuming you did) before making that public. Because, and obviously it'll vary a bit for different people, there's going to be some number below which it has negative impact, not just (probably some other, with a 'meh' range between, number) above which it has the positive impact that is it's raison d'être.

When I see usage figures touted in the form "Trusted by X companies" I assume X is the total number of signups they've had.

Yes I assume that's the same. What I mean is '0' obviously looks bad, '2007' as it was when I commented sounds good (to me anyway).

If you knew you wanted to have that copy on day zero, you probably wouldn't launch with it, because it doesn't look good, so I just wonder at what point people think it starts to be positive, or at least not negative.

Although, they do actually list 51 customers with links to their websites. It's a pretty compelling "social proof" take, IMO.

Super happy to see a BoostVC "cockroach" still at it 2.5 years later! Keep up the good work. I use ListenNotes myself to discover and play podcasts.

If you ever get tired of managing an ElasticSearch cluster, have a look at Typesense: https://github.com/typesense/typesense

It's an open source alternative to Algolia to build instant search with typo tolerance.

I recently built this demo to show Typesense in action on a 32M songs dataset: https://songs-search.typesense.org/

Should return most results in less than 50ms.

This makes me wonder how Spotify continues to have dreadful search.

Suggestion for pivot: add a podcast playing web application (basically podcast subscriptions, you have already most other in place), and charge a more reasonable amount for that plus unlimited regular search. The pro subscription is way too expensive for me.

Edit: I didn't notice that this is about a new API service

Happy customer here. I’ve seen a few people suggest a free service but from our testing this is far more comprehensive and with better search quality.

We use the service in conjunction with iframely to load podcast episodes that can be listened to with ease.

Great product, customer service and documention.

Thanks from team Paiger <3

So like Podcastindex.org, but not free?

Cool project! I wonder if your price points are randomly picked or if 100 is the sweet spot, which would be odd.

Do you plan to add some text-to-speech magic, so one can search for the actual podcast content? That would be a killer feature for me :)

Yeah, modern open source speech recognition like Vosk can have the cost like $2c per hour (70 times less than Google STT cost $1.4/hour) and should be just enough for search.

@wenbin do you need any help with it?

Great work Wenbin! What's the hardest part about maintaining this API?

Thanks for asking!

The hardest part is to make small incremental improvements over a long period of time :)

Like most software projects, this API is never a finished product. It's always work-in-progress.

Small incremental improvements are not glamorous, typically not newsworthy to share to the public.

Some examples of small incremental improvements:

1. Improve API docs. I heard that many API-focused startups have a dedicated team to maintain their API doc page.

2. Dealing with edge cases. As more apps/websites use our API, we'll see some edge cases that we would never know, which could be as simple as adding a data field in the response with 2 lines code change, or changing search index that requires to re-index the whole thing for a few days. There could also be some strange edge cases with billing, e.g. what if a user subscribe to the paid plan, then unsubscribe, then subscribe again, then do something strange, then unsubscribe...

3. Customer support. This involves adding FAQ (tweaking the texts) and preparing email templates to answer frequently asked questions from users.

4. Doing things to keep the service robust & performant, e.g., adding new alerts via Datadog/Pagerduty so we can know what go wrong in time. We also need to have mechanism to be able to know if a particular app sends tons of requests (e.g., send request in an infinite loop) in a short amount of time and we should be able to do something about it (e.g., suspend the account).

Are the docs custom or are you using a third party product? Doesn’t look like Swagger UI or Slate.

It's built from scratch, which was easier than customizing from some open source projects back then (early 2019).

But the doc is codified in openapi format: https://www.listennotes.com/api/docs/#openapi

So you can feed the openapi spec into other doc viewers, e.g., Postman, or redoc https://listen-api.listennotes.com/api/v2/openapi.html

As an avid podcast creator and consumer, I’d love to take a full look at this, but I kept getting the full captcha experience. You know what I mean, “select the squares with sidewalks” :(

OP, are you aware of this?

@wenbin what do you think about adding the ability to comment at specific portions of at podcast (in the player obviously) similar to soundcloud (assuming there's no patent issues).

It's not a replacement for this by any means, but in case anyone would find a reasonably up to date list of around 600,000 podcasts useful, here you go: https://gofile.io/d/MjYVy7 - No episodes, just the name, creator, and feed URLs for further crawling on your own.

Great work.

I first heard about Listen Notes when you were interviewed on the Django chat podcast.

Here’s that episode if people want to learn more about the tech behind the site. https://lnns.co/Td9vzk47qQ3

I wish there was a question in the FAQ asking:

"Why would I use this over the iTunes API?"

Great question! Will add.

For itunes api: 1. You can't search episodes 2. You can't get a lot of search results of podcasts. 3. Their terms of use may not allow you to do what you want to do

The iTunes api gives you almost no useful data. And it's rate limited to 20req/min, with no alternative.

If that is what the official Podcasts app returns, it’s bad search results.

We used ListenNotes a while back in a web based podcast player and have only good things to say about the API. It's reasonably priced, much easier to deal with than Apple's API and email support is speedy!

This is great, but I wish it were clearer that we had to apply with a linkedin URL before actually getting to test the thing out.

When trying the Postman Web View I get:

Profile cannot be found This public profile may have been disabled or deleted

Already contacted Postman customer support :)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact