Hacker News new | past | comments | ask | show | jobs | submit login
RSS Box – RSS for websites that do not support RSS (rssbox.herokuapp.com)
353 points by mrzool 10 months ago | hide | past | favorite | 104 comments

> FYI: If you enable JavaScript then you be able to access additional options in the dropdown menus. The website should still be somewhat usable, but recent versions of Firefox will try to download the RSS feeds.

This is how you communicate with people with JavaScript disabled! Kudos. Most sites either present you with a blank page with no information, a blank page asking to enable JavaScript (even when the content is just text), or silently break some features.

That kind of respectful messaging alone is making me want to take a closer look. Though it should be “then you will be able”.

> Important: Please do not overload this service. Do not make more requests than you need.

What if we don’t have control over the frequency of requests (e.g. using a service like Feedly)? Do those happen often enough that we’d need to host the app ourselves?

> What if we don’t have control over the frequency of requests (e.g. using a service like Feedly)? Do those happen often enough that we’d need to host the app ourselves?

I know the wording is a bit vague, and I know that most services don't let you customize this. I added it after I suddenly started receiving tons of traffic caused by, I suspect, a single user. This person was purposefully fetching feeds multiple times a minute.

Anyhow, if you aren't actively trying to abuse the service, you should be good. Some RSS readers have "boost" features to fetch feeds more frequently (often a paid feature).

Once I am able to add some good caching, then I may be able to remove that notice. But right now, the service is kinda overloaded and that is why some of the services (Twitter and Instagram in particular) may give you errors at the moment.

Feed aggregation services tend to minimize the frequency of requests especially for unpopular feeds since more requests = higher load for them as well. The incentives on the publishing and consuming sides align. Many services only offer to increase crawling frequencies for premium users, and even then only for a limited number of feeds. Not to mention they only need to crawl once for however many subscribers.

It’s really people who don’t use aggregation services and set their clients to update very frequently (say every minute) that pose a problem.

An aggregator most likely would prioritize feeds by popularity and then run most of their crawlers 24/7. Telling some users that a popular feed has new entries and not telling the free tier sounds complicated. Unless the free tier is getting a read replica of the database, in which case it’s throttled only by how far behind the read replicas get.

Prioritizing lower subscribership feeds if a smaller number of premium customers add them makes some sense though.

Of course popular feeds are crawled constantly, and no, there’s no “telling some users that a popular feed has new entries and not telling the free tier”. But you bet services don’t want to crawl those one-subscriber feeds (e.g. some websites have personal feeds for paid subscribers) constantly for free. And RSS Box feeds apparently tend to fall in the latter category.

> What if we don’t have control over the frequency of requests (e.g. using a service like Feedly)?

I believe most RSS reader-clients and aggregator backends are programmed to respect HTTP cache-control headers; so as long as the developer of this service sets those headers appropriately for their endpoints, there shouldn't be a problem.

The warning is likely more for people's custom scripting using curl(1) et al, where there isn't an HTTP cache in the code path.

Yes, this is far better than most.

But as a matter of fact, almost all of that functionality could be done without JavaScript! You can make the dropdown work by changing it to use a <details> element (the best approach for both practical and accessibility reasons), or :focus-within, or any number of hacks (I’m partial to the checkbox hack which has served me well for many years, except for minor accessibility concerns), and then replace the <a> elements with appropriate submit buttons that will set the necessary query string parameters.

For example, the button for “exclude retweets” would be:

  <button type=submit name=include_rts value=0>Exclude retweets</button>
The only bit that I can’t think of a good way to handle is when you’re setting more than one query string parameter, like the Twitter “Exclude retweets and replies” button which wants to add `include_rts=0&exclude_replies=1` to the query string. I can think of ways to make it work with two clicks (harming accessibility in the process), but none with only one click. My best hope was putting formaction="?include_rts=0&exclude_replies=1" on the submit button, but a GET form will scrub any query string from its action, so that doesn’t work.

And download links like the Ustream one would use formaction="/ustream/download" on their submit button.

I will look into this. Thanks!

> What if we don’t have control over the frequency of requests (e.g. using a service like Feedly)? Do those happen often enough that we’d need to host the app ourselves?

I run a similar service where I rate limit. I've whitelisted the IPs of the centralised feed readers like Feedly (at least the ones I've been able to identify), but the reate limit for non-whitelisted IPs is generous enough that it's really there to only catch the really problematic scripts which crop up from time to time.

> Though it should be “then you will be able”.

Fixed. Thanks! :)

I tried to tackle this issue in a more general way, so i wrote rss proxy [0], that analyzes the dom structure and derives feed candidates from it. Feel free to try the demo [1]

[0] https://github.com/damoeb/rss-proxy/ [1] https://rssproxy.migor.org/

This is amazing. I've been looking for this exact solution. Thanks for your work!

I've been looking for something like this for ages.. Thank you!

You can actually get RSS feeds for YouTube: emacs users like myself who consume YouTube via elfeed have been doing it like so:




More info here: https://joshrollinswrites.com/help-desk-head-desk/20200611/

It really helps to break away from the addictive properties of YouTube's "Up Next" algorithm

Wow... I wrote my own WebSub receiver and put it on an always-on server to get around the fact that I assumed YouTube doesn't have RSS from the fact that I couldn't find this information anywhere.

It's crazy how rare it's become to see an RSS link on websites that actually have it if you add /feed or /feed.xml to the url.

I have to wonder how much more dead RSS would be if Wordpress (like, 90% of blogs/news sites) didn't create a /feed by default.

Check this out: Puts an RSS/Atom subscribe button back in URL bar. https://addons.mozilla.org/en-US/firefox/addon/awesome-rss/

Helps discover rss feeds for sites.

That is great, thanks. Got back into RSS lately and doing the ol "look for RSS link in footer and then guess at /feed" rigmarole every time gets old fast.

The screenshot of the RSS icon in the url bar brings back a memory I don't know if I had. Didn't Firefox used to do this by default or am I misremembering?

You can get an XML dump of all your subscribed channels which contains these links: https://www.youtube.com/subscription_manager (scroll to the bottom)

The playlist feed will only show the first 15 videos.

I haven't used the playlist one, I just knew it existed

Does anyone remember Yahoo Pipes? That was awesome, I mostly used it (around 2008-2010) to make RSS for sites where it wasn't available https://en.wikipedia.org/wiki/Yahoo!_Pipes

If you dig around, you can find some alternatives to YP


I run one of those, https://pipes.digital. It supports scraping of sites that don't have RSS, creating a feed via xpath or css selectors. From the sites this cool project supports it has integrated support for Youtube and Twitter. But I think it would be a great extension to embed rssbox and offer blocks for all those services.

Off topic but since we are talking about RSS feeds: Is there a web service to replay feeds?

Use case are old blog archive one wants to (re-) read sequentially from the start but not in binge mode. So maybe one post per day or week. I'm thinking about the old posts of Aaron Swartz or Steve Yegge.

Just adding the feed to a feed reader is often not sufficient because the feed only contains the last 20 entries or so.

I would pay for this! It's common for blogs to go through a life cycle, and usually when you discover them they're already relative inactive. You can go through archives but there isn't good software support for making sure you see everything.

This would also be useful for going through historical incidents - e.g. replaying the top 10 politics blogs, day by day, during momentous events. It's simple enough to just treat it as an offset; the display would clearly say the original date of publication. This is a lot better than simply adding old RSS feeds since it comes in at the same rate it would have happened in the moment.

Something similar could be set up for newspapers. Imagine receiving all space-related stories from Life, NYT, Guardian, Spiegel, for the time period from say 1965-1970.

The current best version of this requires a huge amount of research into old newspapers, and also reading books and then manually connecting each book's timeline. If instead you could in parallel consume multiple sources, the correlation would be natural.

Even better would be adding in later commentary about those events.

Example: take "History of the Decline and Fall of the Roman Empire", and annotate each page with contemporary thought from each era. So every page would a section on what the scholarly response was when it came out, then 50 years later, then 100, each adding in new methods of investigation and validation/testing of the claims as archaeology, linguistics, carbon-dating, anthropology, etc. developed in the background.

There are some services like Feedly that cache everything some users did use, so you can go backward even if you subscribed just now, but unless someone usees it, I doubt it will work as it implies service must crawl internet for RSS feeds and subscribe to them all the time in order to keep history (since RSS on its own can't do that).

Sounds like internet archive for RSS.

I once made a Python commandline utility for this, which does still work if you can install it with the right (now long-outdated) Python version: https://pypi.org/project/dripfeed/

I don't intend to maintain it further but all source code is available and it's not terribly complicated (all the hard stuff is done by other python libraries).

Google reader used store every entry in the feed ever, and it was my primary reason for using it. I was subscribed to tons of webcomics, and this allowed me to easily keep track of where I was. The only limitations to this was the rate they checked feeds at (I never read about any complaints about this, but also wasn't looking and wouldn't have been interested at the time), they only had history starting when the first person added a feed, and a large number of blogs and comics would only put notifications in the feed, without any actual content.

Well if the feed doesn’t contain older entries, it’s impossible without something like the Wayback machine (and I doubt they store rss feeds)

Some feeds are just paginated. Following "next" links would work. Feeds readers don't do this though as they assume people are usually interested in the new posts.

Perhaps the tool behind https://rewind.website ?

There's an open source service that does something similar, does anyone remember its name?

edit: Both rssbox in the OP, and RSS-bridge[1] are open source. I was thinking of the latter. There's also RSSHub[2].

[1] https://github.com/RSS-Bridge/rss-bridge

[2] https://github.com/DIYgod/RSSHub

There's https://easylist.to/easylist/easylist.txt for universal content-blocking rules, youtube-dl for universal video extraction methods.

While building a feed reader of my own, I had a recent idea for a project for universal content crawling rules: how is the content hierarchy organized on each site and how do you extract it from each content page. A single community project that any other project could use to crawl websites for their content.

Looks like rss-bridge comes close to that.

To help extract article content, you might be interested in this collection I help maintain: https://github.com/fivefilters/ftr-site-config/

It's used, in addition to an automatic article extractor, in Full-Text RSS: http://ftr.fivefilters.org

The service https://feed43.com will enable you you to build a RSS feed out of pretty much anything with a URL. I use it to build RSS out of sha256 release files, vendor client download release pages, changelogs, etc.

Similar service: https://politepol.com/en/

I would say this is a simpler service overall and not an even competitor, it appears to be using HTML elements as keys and creating entries based on that but stops there. I would not consider this on par with Feed43 based on the few samples I tried, it lacks in depth parsing required to handle the expected result formatting.

Out of curiosity, what is your software stack?

I am not affiliated so not really sure, just a satisfied end user for a number of years.

I tried a whole lot of solutions before discovering this service, its the only one I've found flexible enough to handle these random tech endpoint formats, it's basically like RSS-Bridge (mentioned in other comments) with a visual regex parser built in to avoid having to write actual code like RSS-Bridge requires. $0.02 YMMV :)

Anyone ever did something like this for Facebook?

I know it'd have to be subjective per user (security ACLs ⇒ different accounts seeing differing subsets of other accounts' posts); but I'd be fine with just getting my own account's subjective view, by logging into such a service using Facebook OAuth (or, if that isn't enough, then I'd be fine with handing over my Facebook creds themselves, ala XAuth, provided the service is a FOSS one I'm running a copy of myself in e.g. an ownCloud instance.)

I also know that it'd likely require heavyweight scraping using e.g. Puppeteer, to fool Facebook into thinking it's real traffic. But that's not really that much of an impediment, as long as you don't need to scale it to more than a dozen-or-so scrapes per second. (Which you'd automatically be safe from if it was a host-it-yourself solution, since there'd only be one concurrent user of your instance.)

Anyone done this?

RSSBox used to have Facebook support (but only for public pages, no personal content), but when Facebook started cordoning off their API two years ago, I had to turn it off since I was unable to get my application approved. The code is still there, but I am doubtful it would work even if you manage to get an API key that works. I think the best option may be to scrape the web content now, unfortunately.

I have assumed for a while the only way to convert FB -> RSS would be to scrape the home page, but from what I recall the HTML & DOM is all kinds of messed up - intentionally obfuscated to prevent adblocking. From a quick look just now it does seem like it would be a nightmare to try to parse it as-is - and I would guess FB changes a lot of the output regularly anyway to defeat adblockers, making efforts to keep up pretty challenging.

It almost sounds like a problem best solved with OCR, rather than scraping per se. Build a simple model to recognize “posts” from screenshots, and output the rectangular viewport regions of their inner content; then build some GIS-like layered 2D interval tree of all the DOM regions, such that you could ask Puppeteer et al to filter for every DOM node with visibility overlap with that viewport region; extract every single Unicode grapheme-cluster within those nodes separately, annotated with its viewport XY position; and finally, use the same kind of model that lets PDF readers you highlight “text” (i.e. arbitrary bags of absolute-positioned graphemes) in PDFs, to “un-render” the DOM nodes’ bag of positioned graphemes back into a stream of space/line/paragraph-segmented text.

I wrote about trying to do this for Facebook page events as an example. Code sample included:


A somewhat unconventional UI for following content, which incidentally works with RSS Box, is Fraidycat[0]. It groups recent posts under “individuals” with a visualization of how much recent activity there is in a given feed, and allows to choose “follow intensity” which works in a nice and transparent way.

[0] https://news.ycombinator.com/item?id=22545878

Hey thanks for this lovely pitch, goblin89. <3 all my goblin friends out there

I learned of Fraidycat in an RSS related HN comment yesterday and have been trying it out. Love some of the homemade quirkiness that I forgot software used to have -- the video is great too.

Only question I have: do you really have to assign every Github issue to yourself, the sole developer? Something about it cracks me up: https://github.com/kickscondor/fraidycat/issues

> "Thanks for the bug report. Fortunately for you, our best man is on the job!"

> kickscondor has assigned the issue to kickscondor

Anyways, just playing. Great product and great shepherding of the Github project.

Oh I feel such a sense of progress just assigning bugs to myself. When I get around to writing a blog post about it, I certainly hope you will be there to upvote it, hombre f.

Related (maybe) but tangential, does anyone know of a good web to text converter? Back in the day you used to just use Lynx, is that still the way or has it been surpassed?

I personally use postlight/mercury-parser[0] to convert articles to Markdown files, a small script to add extracted metadata (like author, featured image, original link, date it was scraped) to the top of a Markdown file, and put those Markdown files within Hugo for a DYI Pocket alternative.

You can use the --format flag to pick between Markdown/text/HTML output, so it should serve your purpose.

If anyone here has looked for a reader view on Chrome, odds are you've probably stumbled upon Mercury Reader[1]. This is what powers it.

[0] https://github.com/postlight/mercury-parser/

[1] https://chrome.google.com/webstore/detail/mercury-reader/okn...

I’ve been using this programmatically https://www.npmjs.com/package/html-to-text not sure if it can be run from command line but it’s be easy enough to write a wrapper.

Also I reckon pandoc is worth a try.

There is w3m, which also supports css and images, if your terminal is cooperative.

Any of the popular text-mode browser will do it, afaik (links, lynx, w3m) since it's already pretty much part of their functionality. Calibre might be able to do it, as well as exporting many other formats.

Pandoc can convert HTML to Markdown and other text formats:


Check out https://txtify.it

The are also RSSHub, which supports more site, but many of them are Chinese websites.


Soundcloud actually does have RSS feeds. I'm not sure how they're exposed to users (I forgot how I got this URL) but they exist:


I worry about this being swarmed by traffic and hugged to death. Since it's popular on HN, I imagine the particular Heroku instance is overwhelmed. I was surprised that it worked when I used it. I guess I'm gonna have to pony up and donate then...

You are correct in that it is somewhat starved of resources. The free Heroku instance that I host is running on the free Heroku dyno (512 MB RAM). I do not have a good caching solution currently, which is why Twitter and Instagram are almost always returning errors now. I suspect a single person is responsible for most of the issues (see GitHub issue #38). It's actually amazing how well it runs considering how much traffic is thrown at it.

At some point I hope to get enough time to implement a caching solution, which should hopefully resolve most of these issues.

Looks like you can self host there is a github repo

In the same vein : RSS-Bridge


(you can find multiple instances on the web)

Nice one. As much as I approve of services I'd rather self-host. Tired of people pulling the rug out under my feet

You can self-host this one as well. They mention it on the page and link to the GitHub repo: https://github.com/stefansundin/rssbox

I don't know if anyone will particularly care, but both Substack and MailChimp newsletters have RSS feeds, in case you prefer those over mail. With Substack, you merely append "feed/" to the end.

With Mailchimp, well, you look for a "view in browser" or "share this issue with friends" link in the newsletter. On the archive page it takes you to, an RSS link is on the righthand corner.

RSS Box is great - perfect for the kind of sites where scraping from the HTML is problematic because the HTML changes so much.

I work on a somewhat similar project called Feed Creator which can be used for less popular pages where you can select elements for the feed using CSS selectors: https://createfeed.fivefilters.org

I wrote a similar tool (but not as polished) that requires you to write custom plugins. This works well if you have websites that are hard to scrape in an automated way. Maybe it's useful to someone else: https://github.com/dewey/feedbridge

Is it necessary to write a plugin for each website?

You could in theory combine some, but it was just a very specific use case I built it for. Just a fun project, and on Github in case anyone else has a similar niche problem.

My own problem with services like this is that I don't want to tell their owners what blogs I read. It's a privacy concern.

I'd feel much more comfortable using a standalone tool that I could run on my own laptop (ideally one that didn't require running a web server or even a web browser).

To practice SwiftUI, I was building an rss/feed reader but immediately realized anything you're going to display needs to be web-rendered. Or rather, to avoid a web browser (like WKWebView) the product is so neutered that it's not all that compelling.

Even stripping everything out but plaintext with an HTML parser to put it in a text view, I realized I could wrap the links with native Cocoa labels that act as hyperlinks. And then do the same with images. Hmm, what about tables and stuff? Soon I realized, why would I even want this? It's annoying to visit the origin site when the RSS reader can just render it, and it kinda defeats the purpose.

For many years I've used and continue to use an RSS reader named newsboat[1][2] (which is an actively developed fork of the venerable newsbeuter, which I used for years before that).

It's a feature-rich RSS reader that runs completely in the terminal, presenting text-only views of each RSS feed.

The links open in the browser of your choice (which for me is a text-only version of emacs-w3m, which I also run exclusively in the terminal).

However, some RSS items can be read in their entirety within the RSS reader[3], and does not require the opening of any links. This is my preferred method of consuming RSS.

[1] - https://newsboat.org/

[2] - https://github.com/newsboat/newsboat

[3] - ie. those RSS items for which the author has chosen to make their entire article/post available over RSS instead of merely posting a teaser and requiring browsing to their website to read the rest

It's open source, just roll your own insurance. https://github.com/stefansundin/rssbox

I wanted to check out the service in the OP’s post, and realized I didn’t have an iOS based RSS reader app on my tablet. In line with your privacy concerns, I wanted an app that didn’t require creating an account. I couldn’t find one in the first several I loaded. Any tips?

None of the first few apps that show up for me in the iOS store need accounts. I've used Reeder in the past. But now I use NetNewsReader which is free and available on github.


Is it normal in iOS-land to need to register to use apps?

All I want is something that lets me get an RSS feed of Instagram accounts I follow, by giving nothing except the URL or username. I've tried 4 separate services that all work at first, then -- a week, a month, an indefinite time in the future -- stop working and never resume again.

https://github.com/RSS-Bridge/rss-bridge provides this and works for me. It sometimes throws an error but recovers after that.

Is there any way to turn a twitter feed (i.e., multiple users) into a single RSS feed?

If you use NewsBlur, it has an inbuilt twitter client and allows subscribing to twitter lists, which could accomplish this.

I've got something similar: it sends Twitter feed to e-mail: https://github.com/dottedmag/twema

Not at the moment, unfortunately.

How would you guys use it? What're the use cases for this website?

If you want to subscribe to a website / blog that doesn't offer an RSS feed.

https://feed43.com/ does it for almost any website provided you fiddle with a bit of "code"

I selfhost this: https://github.com/RSS-Bridge/rss-bridge

I highly recommend it.

If you want to process an RSS feed programmatically, you have to run code to poll the feed and keep track of items already processed. This isn't hard to write, but it's often not core to your app's logic.

You probably just want to run code on each new item in the feed.

Pipedream lets you treat an RSS feed as an event source. Pipedream runs the code to poll the feed, emitting new items as the feed produces them.

RSS for Hackers - https://rss.pipedream.com

"There was a problem talking to Instagram. Please try again in a moment." blocked by Instagram?

Instagram used to have an open API, but that is closed down now. The app is currently using some private-ish endpoints, but they are ratelimited. I need to add caching. More people have started using my app recently, and I have not had time to add caching yet.

I got the same error message using my own user for Twitter

this is really cool. Twitter didn't work, though.

For twitter I use the perl backend scripts (https://github.com/ciderpunx/twitrssme/tree/master/fcgi) from http://twitrss.me/ by itself. It's pretty easy to scrape twitter users/searches and generate RSS feeds on disk for my native reader.

# in a bash script called by cron every handful of hours there are many, many lines like this:

    perl twitter_user_to_rss.pl gnuradio > ~/limbo/www/rss/gnuradio.xml

    perl twitter_search_to_rss_wtf.pl "rtlsdr" > ~/limbo/www/rss/rtlsdr.xml

This is a good alternate front-end to Twitter that also provides RSS feeds: https://github.com/zedeus/nitter

Sorry, recently the ratelimit has been exhausted almost all of the time. I need to add caching to solve this problem but have not gotten to it yet.


serious question: does anybody use RSS nowadays?

Yes! Unfortunately it is no longer mainstream since the monetization opportunities are fewer. But a lot of tech-savvy people prefer it for many reasons:

1. You get a personalized view of what you have and have not read. 2. You can scan over a lot of posts very quickly, and pick out what you want to read. 3. You can aggregate a lot of different websites in a single place. No need to visit each website individually. 4. Increased privacy. 5. Less tracking. 6. Increased control. 7. Fewer ads.

Probably more reasons, but these are the primary reasons why I still prefer RSS.

Yes (though I personally have only ever used it as a notification mechanism: I just click through to the page, and don't read the content in the reader. In fact the RSS reader I built for myself doesn't support any other mode). I follow over 100 different feeds through this, mostly on different sites (it's mostly webcomics, some news, artists, and youtube channels). It would be basically impossible to do this with any other tech: at best my feeds would be fragmented across multiple services. Some would not be possible to integrate at all. Services like this allow some of the sites I want to follow to fit into the system.

Found this story via RSS, and once again I find myself surprised that there are nerds out there that read HN but do so through the Olde Timey expedient of going to the home page.

I'd say 95% of my content discovery comes from RSS; the few sites that don't tend to be high volume sites - like news - where I get value from visiting the home page to see how editors have prioritised stories.

Yep, heavily.

Nope! That's why someone made that service, because neither they nor anyone else needs it.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact