Hacker News new | past | comments | ask | show | jobs | submit login
The Perils of RSS (kn100.me)
56 points by kn100 on March 10, 2022 | hide | past | favorite | 37 comments



Try enabling Simple List Extensions[1] on your RSS feed and see if that deletes the old items. Basically this tells the feed reader that your feed isn't a typical "stream of posts", it's a finite list. So it should be treated differently.

Newsgator used to handle that by deleting any posts not in your feed at every retrieval. So maybe Feedly would too.

If you think that guessing at the internal implementation of Feedly's code based on what a company that's been dead for years used to do sounds like a long shot, then bravo. But hey, it's worth a try, right?

It's been a minute, but I think the gist of it is adding

  xmlns:cf="https://www.microsoft.com/schemas/rss/core/2005"
to your <rss> root element, and

  <cf:treatAs>list</cf:treatAs>
under the <channel>. But see the link for examples. Good luck! [1] - https://docs.microsoft.com/en-us/previous-versions/windows/d...


I'm going to try this, as well as other suggestions here across a range of cloud-ish RSS readers and have a shoot out! It's going to take me a little while since it seems Feedly at least does not crawl very often, but I think a little experimentation here would make for an interesting follow up.


The real lesson to me seems to be summarized by the last two lines in the post: “One mistake can permanently wreck your RSS feed for users of cloudy platforms you [or your readers] have no control over.”

The issue isn’t RSS, it’s using something like Feedly.


Well, you also wreck the RSS feed of all of your direct RSS consumers too. You would have to get them all to individually delete and resubscribe which, unless you have all of your content in your feed, also deletes historical content that they may want.

It's just that at least you can fix it for future direct consumers by fixing your feed.


You can't choose what software your subscribers use.


One mitigation is to start serving the feed at a new URL. This should fix things for new subscribers. The old subscribers can optionally be encouraged to change to the new URL manually via a specially crafted post in the old feed.


Unfortunately, some blogs I've seen only expose the last n posts through RSS. In that case, the retaining behavior is very useful. However, there could be some middle ground with a 'reset' hint.


IF, and that's a big if, all your feeds do inject full articles in the feed RSS2Email is awesome: as long as you have a good MUA retaining and eventually sharing, searching etc in your historically collected feeds is hyper effective and simple. Unfortunately the starting if is almost never met in the present world...


A feed is not a block chain; the feed publishers are free to do anything on their feeds. So the feed readers must be able to protect themselves from serving deleted items in perpetuity. I ran a server-side feed collector in roastidio.us for airss.roastidio.us users, my algorithm is very simple:

I keep only the top N items in the latest version of the feed; everything else are dropped. For a fast moving feed, if the users don't read soon enough some items will be missed. This is still better (in terms of usability) than the bombardment of items to a user just coming back from a long vacation.


Different conversion for a different topic.

> For a fast moving feed, if the users don't read soon enough some items will be missed.

This may work well for you but definitely isn't what I prefer. For example I often subscribe to recording feeds for conferences or other events. These feeds don't post for most of the year but them might have hundreds of entries. I definitely don't want my feed reader to drop all but the last 10 (YouTube for example keeps much less than the latest 100 videos in the feed). I want it to keep them all so that I can work through the list and watch the ones I am interested in.

So while your some solution might work in some cases this definitely isn't what I would want personally and in my opinion what Feedly did here was reasonable as a general behavior, even if it isn't optimal for this particular case. I'd rather have it handle the happy common case we'll then to hurt it for the rare case when someone makes a mistake, especially when the downsides for the mistake aren't that high.


> I want it to keep them all so that I can work through the list and watch the ones I am interested in.

Your usage case can be better served by visiting the website directly, after you learn that there are some updates from the feed. There is no guaranty that a feed aggregator would not miss anything anyway; this is the the nature of the poll based delivery.


Thanks for explaining my use case to me.

- How to I remember where I am when I have watched half of the videos?

- What if two events happen around the same time. Now I need to check both instead of them being in the same place.

It turns out that leaving items in my feed reader and dismissing them as I watch or decide not to watch works incredibly well.


You are using the feed to overcome the usability problem of a poorly designed website. I have seen my fair share of poorly designed websites so I can't argue with that :)

My feed reader's intended use case is to following a large number of slow moving feeds like personal blogs, which are usually not bursty and not super important if a few items fell in the crack.


You're just picking from different trade-offs, not doing something better nor worse (nor right nor wrong).

A major use-case of RSS readers is to have a local, eternal copy of a producer's content to read at your own leisure. I have content in my RSS reader that I cherish from blogs that aren't even online anymore. That I don't need to visit poorly designed websites is only a tertiary perk—I don't need to visit the origin, ever.

There's not much to argue here, though. It's just a different offering. For example, I use https://sumi.news/ (created by some HNer) and certainly don't use it read the entire history of every news website. Nothing wrong with that.


> You're just picking from different trade-offs

Of course. I am not feedly and I can't afford to keep everything for all the feeds ever subscribed by my users in my database. There are feeds that keep thousands of items and there are feeds that have broken date fields so I cannot prune by date reliably.


Maybe, but I don't expect most websites to track which videos I have watched and I don't want to need to create an account or somehow sync my watch list across devices. So I just use my feed reader for that and it works excellently.

> My feed reader's intended use case is to following a large number of slow moving feeds like personal blogs

This is also the majority of my feeds. While I agree that a few items falling through the cracks is a big deal I would still prefer if they don't.

But I think the morale of the story is that there is a huge set of ways that people use RSS Feeds. They are a generic system for distributing updates and it is hard to make quick fixes that work well for all of the wide variety of use cases that people have.


> I keep only the top N items in the latest version of the feed; everything else are dropped. For a fast moving feed, if the users don't read soon enough some items will be missed.

I completely dislike this behavior. I saw this on Ars Technica some years ago, where it was said that only the most recent 20 items (?) would be in the feed. Ars can be a fast changing site, and since full text feeds are a paid privilege, I thought it was a disservice to paying users not to have the ability not to miss out on something just because they were busy with something else for a day or two.


I don't have the time to read everything. I'd rather miss something from the busy news site than miss something from independent bloggers.


The problem is that RSS doesn't really have a concept of "deleted". In general feeds just contain the newest items. Items not on the feed are generally considered old, not deleted.

You can play with heuristics like "This feed is usually sorted by published date and these entries I saw previously should still be here but are missing, maybe they were deleted" but not even have published or updated dates and they aren't always accurate.

This gets even more complicated with WebSub which allows hubs go remove entries they have already seen. So now you need to not do deletion on those requests, and if you turn down your poll rate for feeds with WebSub (which Feedly does IIRC) you might not even fetch the full feed to notice that items are missing for weeks.

The one thing that would actually be reliable here is probably Atom's pagnation and archiving extensions which you can follow to get a view of all of the items in the feed and prune missing ones. However you probably don't want to do this every fetch (otherwise it kinda defeats the point of pagnation). Maybe you could combine this with other heuristics to come up with a decent system. But I'm not even sure these have ordering guarantees so I'm not sure if you are going to fetch potentially unlimited pages just to check if an item was deleted.

If we did want to solve this problem the best option would probably be to add a tombstone element to the feed to explicitly remove items.

TL;DR as far as I am aware neither RSS or Atom have the concept of deletion. Much like sending someone 100 email newsletters unrelated to their subscription will piss them off with no recourse try not to send them 100 feed updates that they don't want.

Edit: It looks like there are tombstone proposals for both RSS and Atom: https://oleb.net/blog/2015/11/rss-feed-item-deletions/. Maybe the author could have tried this, but IDK if anyone supports them.


Seems to me like this is squarely Feedly’s responsibility, and everybody should be impressing upon then the need for the RSS Reader to behave sensibly.

Btw, I wonder if there are any protocol level suggestions/guarantees about this kind of a thing (edits/deletes/revocations).


In contrast I see this squarely as the authors responsibility. In general with RSS if you publish and entry it is published. Feedly did the right thing and displayed the published entries.

The subsub+resub is a nice hack to clean old entries but I think in general it is nice in Feedly that you can subscribe to a feed and get history beyond what is currently on the feed.

To me the author is asking Feedly to make the happy, common case worse to make the rare case where the publisher makes a mistake slightly better. I think the only thing that Feedly could have done better is manually removed the items when the author reached out to customer support. But if I was in the position I don't know if I would have conplied either. If the don't have infra for this it is a risky manual database operation. Plus feed entries are argubly their subscribers data, I would think twice before deleting something from under my subscribers even if I thought that most of them would appreciate it.

For edits you just edit the entry. Most readers won't actually notify the user again but many will update the content.

For deletions there are some proposals for tombstoning. I went into detial here: https://news.ycombinator.com/item?id=30625712


I don’t see why Feedly has append-only semantics instead of simply mirroring the feed on the website. In general, as both a publisher and a reader, I would want Feedly’s behavior to be simply pass-through to the extent possible.

What common case does Feedly’s default behavior prioritize?


Feedly appears to cache one feed for all subscribers. I'm speculating, but would it prioritize consistency above accuracy? Fewer complaints from people who are confused because that blog post they read yesterday is gone today.


>The subsub+resub is a nice hack to clean old entries but I think in general it is nice in Feedly that you can subscribe to a feed and get history beyond what is currently on the feed

I get what you're saying, but for something like Feedly, they're an aggregator, not a publisher, and therefore they should, where at all possible, conform to the publisher's intentions.

If I have an RSS feed of my blog, and somehow my pet toucan manages to publish something extolling the virtues of Froot Loop National Socialism, I think deleting that from my feed should percolate out before I get pogromed by Tony the Tiger.


I don't think Feedly is necessarily doing anything wrong. Publisher beware kinda comes with the territory like making sure you don't accidentally send too many emails. It's kind of a quirk of centralization that you would have the recourse of "sorry, can I unsend that?" with the pubsub model to begin with.

But I agree that it would certainly be nice if they offered a way to delete content. For example, an interesting attack on a website would be to silently add spam to their RSS feed knowing it will be stored in RSS aggregators forever. That doesn't seem great.


I wonder how much data you could encode in an RSS feed and therefore retrieve from services like Feedly?

I remember reading a while ago about movie piracy sites that would store chunks of encrypted video inside images and host these images on services like Google Drive, and then client side download these 'images' in order to reassemble a film. I wonder if a similar thing would be possible with and RSS feed? Feedly seems more than happy to store the entire content of every post I've made!


Exactly this. The pros and cons of open protocols. Does RSS solve this in any way? If so, the blogger or feedly have not implemented or used that part correctly. If not, we identified a missing piece in the protocol and getting that added to every RSS feed/aggregator is pretty much impossible (though a popular one like feedly should be able to adapt)


There are at least a couple of ways of indicating that an entry should be deleted in RSS and Atom, but they don't seem to be well-supported.

This blog post mentions a couple: https://oleb.net/blog/2015/11/rss-feed-item-deletions/

I vaguely recall that Microsoft also had some proposed mechanism, but I can no longer find any trace of it. I doubt anything supports it.


My perils with rss:

1. Too many formats, it's been 20 years since something changed in RSS formats, we should know by now what we need. Also rdf is really ugly. We should make one final rss format and switch to it.

2. There are no rules for guids. No maximal length, I've seen guids several kb long in the wild. Integer guids would be perfect (but uuid or some hash is fine too, just agree on some maximal length ffs).

3. no history standard, if site owner decides to show you 20 items and you go on vacation, when you return there will be 20 new items and the rest will be lost. I'm thinking something like ?limit=200 or even better ?since=5257 (some last seen guid)

4. Sites not putting entire article in, only summary.

5. Sites putting page navigation ui in articles.

6. Pages lacking RSS feed entirely.


Oh man, the perils are endless. Back when I worked on an RSS aggregator I wrote a list. https://docs.google.com/document/d/1cvq67iQpk2C7ufOsefsfKnGC...

A few comments on your list:

2. People got suuuuuper confused about this. I agree several kB is too long, though I could see a case for allowing a (reasonable length) URL, as that's a natural ID in some environments.

3. Yeah, basically if there's a fast rate of publishing relative to feed length you have issues. Another way to handle this would be to say the feed must have the last X days of content, and interested readers have to check back at least that often. Not a problem for always-on aggregator sites, though it could be an issue for occasionally-connected clients (eg, smartphone standalone app)

4. To be fair that's a feature in a lot of cases. Some sites have to monetize content somehow or other.

5. Or share buttons...

6. IKR?! For all my complaints I still wish RSS was used more!


Personally:

- lack of the entire article in the feed, Miniflux and some others try to download it if needed but does not work flawlessly;

- lack of RSS/atom support at all for few but not so few websites;

- long push against RSS by Big&Powerful that makes too many simply not anymore aware of it's existence, to a point that for many site that still have RSS finding the feed URL from the page source is the sole way to discover it.

The rest is far less dangerous IMVHO :-)


> 4. Sites not putting entire article in, only summary.

I prefer the summary and a link.


> We should make one final rss format and switch to it.

This was the motivation for creating the last several versions of RSS and Atom, if I remember correctly! Sometimes I wonder if the RSS situation was the original inspiration for https://xkcd.com/927/.

Anyway, RSS parsing is a solved problem in most major languages. Just look for a prominent open source library that handles all the major format variations, and call it a day.

It's not like real-world RSS feeds actually comply with any of the standards, anyway. Most parsers need fallbacks and bits of autodetection. ("Is this escaped HTML or plain text or what?")

(Full disclosure: I worked on a number of RSS-related projects back in the day and worked with at least one person involved in one of the standardizion attempts.)


> however the cloudy nature of Feedly removes an important avenue of control from the user. With RSS reader software, it is possible to remove and re-add a feed, thereby grabbing a brand new copy of it. When a content creator like myself makes an error like this, users can fairly easily do this after the mistake had been corrected, mark everything below their most recently read post as read, and get on with their life.

I don’t agree with the approach suggested here with RSS reader software. Removing a feed and adding it again means losing out on older articles that are no longer present in the feed, losing “starred” status of articles (in my understanding this is also a way to save an article permanently even if the feed publisher doesn’t include it anymore in the feed), and losing the unread status of some old articles that one wanted to get to later (including as recent as a week or two in the definition of “old”).

I wouldn’t want Feedly or someone else to remove articles after those have been received by my reader software. In this case the reason may be good and acceptable, but it breaks a contract that users of Feedly (not the publishers of the feeds) expect from such a service. Soon it could devolve into censorship requests. Feedly or any such service is better off, and would spend lesser time and effort, by avoiding messing with the feeds. “What you publish is what we serve the users” should be their guideline.

The peril I see is publishers publishing feeds without adequate testing.


Or, don't fiddle with web code without testing it in staging as you'll annoy your users.


I suppose you're right, but as I explained in the post, I am not some big company, I'm just a person who occasionally likes to post stuff on the internet. I'm a huge fan of various blogs, and want to be a part of that space. It is so nice to browse a proper hobbyist blog that isn't loaded up with attempts to sell me things, and is just someone who is genuinely passionate about whatever it is they're blogging about.

I personally am not a user of RSS, however I received multiple requests to add a feed, so I did. I don't want to set up a staging environment for my blog, this is a hobbyist thing I do in my limited spare time.

The whole reason I wrote this was to hopefully prevent people like me from making this same mistake. I wasn't particularly familiar with RSS and its idiosyncrasies, and paid the price for it. By providing my content in a format I myself have very little interest in without fully understanding the ramifications, my content has become available on a third party service that I have no control over, and when readers discover that my feed doesn't look right, they come to me to complain about it. It might just be a fundamental disagreement, but I do wonder if tools like Feedly should provide a little more control either to publishers, or to users.

I'm just rambling at this point, but hopefully this illustrates my thinking behind the post better.


tl;dr RSS cannot delete entries. This is fixed in Atom. <https://datatracker.ietf.org/doc/html/rfc6721> If you care about Web feeds, you use Atom, not the laughably inferior RSS.

https://hn.algolia.com/?type=comment&query=atom+rss+chrismor...




Applications are open for YC Winter 2024

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: