Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Full Text, Full Archive RSS Feeds for Any Blog (dogesec.com)
152 points by panoramas4good 3 months ago | hide | past | favorite | 45 comments



As a publisher who publishes a full-text RSS feed at a time when not a lot of publishers do, I must say: The publisher should have a say in this.

This is not to say that this is a good idea or a bad one, but I think you will, long-term, have better luck if people don’t feel their content is being siphoned.

A great case-in-point is what my friends at 404 Media did: https://www.404media.co/why-404-media-needs-your-email-addre...

They saw that a lot of their content was just getting scraped by random AI sites, so they put up a regwall to try to limit that as much as possible. But readers wanted access to full-text RSS feeds, so they went out of their way to create a full-text RSS offering for subscribers with a degree of security so it couldn’t be siphoned.

I do not think this tool was created in bad faith, and I hope that my comment is not seen as being in bad faith, but: You will find better relationships with the writers you share if you ask rather than just take. They may have reasons for not having RSS feeds you may not be aware of. For example, I don’t want my content distributed in audio format, because I want to leave that option open for myself.

People should have a say in how their content is distributed. I worry what happens when you take those choices away from publishers.


This.

I love these projects but often they can have a negative side-effects.


ive never implemented it but it should be possible to check if content still lives behind the url where it was originally found before serving any kind of archived copy.(preferably with contact info for the unwilling author) Using it for a search index should be fine ofc


I disagree. If you put your content out in the open for everyone to read, it is totally valid to scrape that content. Otherwise put it behind a paywall. If i can access it for free with a browser then you should be fine with me consuming your content with the tool of my choice. So i can search or use it however i see fit. Why not?

Getting consumed by ai scrapers will be inevitable in the long run i think.


Just because I make the information available in a convenient way doesn't mean I expect it to be harvested. That you make that leap is 100% troubling and makes me not want to have you as a reader, because you don't respect my work.

You are describing the “give an inch, take a mile” concept neatly.

I think your mindset will just lead to a lot of people who otherwise would not want to regwall their content to do so. And if I ever do so, I will include a link to your post so they know who to blame.


I feel like the two massive unspoken caveats are:

1. Downloading and polling that doesn't resemble a cyberattack.

2. Not reproducing their content in a way that could compete with theirs or tarnishes their identity... and there's a lot of open ongoing debate about how that principle relates to different ways of using LLMs.


So I can take all the words written here by you and use them to pretend to be you elsewhere online, right?


As a one-off thing you personally do, yeah that’s probably okay. Turning that into a product that you then offer to others is where the line is drawn, in my opinion.


I think this is a fair line. I don't want to mess with the tinkerers of the world, and to be clear I'm not even entirely opposed to this. I just think we do not put enough stock into discussing potentially damaging actions with creators.

Which is why so many writers and artists are upset at OpenAI and Anthropic right now.


> generally the RSS and ATOM feeds for any blog, are limited in two ways;

> 1. [limited history of posts]

> 2. [partial content]

To fix the limitation N°1 on some cases, maybe the author can rely on sitemaps [1], is a feature present in many sites (as RSS feeds) and it shows all the pages published.

[1] https://www.sitemaps.org/


Similar goal, different approach. I wrote RSS reader, that captures link meta from various RSS sources. The meta data are exported every day. I have different repositories for bookmarks, different for daily links, different for 'known domains'.

Written in Django.

I can always go back, parse saved data. If web page is not available, I fall back to Internet Archive.

- https://github.com/rumca-js/Django-link-archive - RSS reader / web scraper

- https://github.com/rumca-js/RSS-Link-Database - bookmarks I found interesting

- https://github.com/rumca-js/RSS-Link-Database-2024 - every day storage

- https://github.com/rumca-js/Internet-Places-Database - internet domains found on the internet

After creating python package for web communication, that replaces requests for me, which uses sometimes selenium I wrote also CLI interface to read RSS sources from commandline: https://github.com/rumca-js/yafr


It's so clever to just pull from Wayback Machine rather than scrape the site itself. Never even thought of that


Before building an app that depends on the Wayback Machine (or other Archive infrastructure) it's good to keep in mind this post from their blog: <https://blog.archive.org/2023/05/29/let-us-serve-you-but-don...>

One of my favorite tricks when coming across a blog with a longtail of past posts is to verify that it's hosted on WordPress and then to ingest the archives into my feedreader.

Once you have the WordPress feed URL, you can slurp it all in by appending `?paged=n` (or `&paged=n`) for the nth page of the feed. (This is a little tedious in Thunderbird; up till now I've generated a list of URLs and dragged and dropped each one into the subscribe-to-feed dialog. The whole process is amenable to scripting by bookmarklet, though—gesture at a blog with the appropriate metadata, and then get a file that's one big RSS/Atom container with every blog post.)


wait, so if WordPress is migrating 500M blogs to Wordpress[1], does this mean essentially we'll have easy access to all tumblr blogs' history?

[1] https://arstechnica.com/gadgets/2024/08/tumblr-migrates-more...


I used it to recover some lost content from my blog a few years ago, it was fantastic: https://simonwillison.net/2017/Oct/8/missing-content/


This reminds me of something I wrote in early 2000. At that time RSS was less than a year old and if I'm honest I wasn't aware of it at all. I wrote a short PHP script to get the HTML of each site in a list, do a diff against the most recent snapshot, and generate a web page with a table containing all the changes. I could set per site thresholds for change value to cope with small dynamic content like dates and exclude certain latger sections of content via regexp. I probably still have the code in my backups from the dot com boom job I had at the time.


Looks like a nice tool for extending existing RSS sources. As for the sites that don't have RSS support in the first place, there is also RSSHub [1]. Sadly, you can't use both for the same source: history4feed's trick with the Wayback Machine wouldn't work with the RSSHub feed.

[1] https://rsshub.app/


Awesome, I once developed a project called https://rerss.xyz, aimed at creating an RSS feed that reorders historical blog posts, but it was hindered by the two issues mentioned in the article.


The mystical creature - the URL - is a link to a resource that doesn't have to be static, it's only the URL that is static. eg. the content might change. So you might want to have the program revisit the resource once in a while to see if there are updates.


A really original idea i see one time: someone was writing a technical book in the first post of their blog, new posts talked about the work they done and linked to that part of the book. At times the posts had almost nothing besides the link, sometimes they talked about the technicalities and considerations of the writing but at times it just talked about every day life, why it was a good or a bad day to write.

When the book was done the blog was replaced by a link where one could buy the printed version.


I wrote a similar tool [1], although it's designed to let you gradually catch up on a backlog rather than write a full feed all at once. Right now it only works on Blogger and WordPress blogs, so I'll need to learn from their trick of pulling from Internet Archive.

[1] https://github.com/steadmon/blog-replay


I had a similar idea to replay blogs. It'll pull from WordPress or Internet Archive and give you a replay link to add to your feed reader.

https://refeed.to


Someone somewhere is still running a gopher server.


Having a fixed path for a search api is a great idea.


> RSS and ATOM feeds are problematic for two reasons; 1) lack of history, 2) contain limited post content.

None of those are problems with RSS or Atom¹ feeds. There’s no technical limitation to having the full history and full post content in the feeds. Many feeds behave that way due to a choice by the author or as the default behaviour of the blogging platform. Both have reasons to be: saving bandwidth² and driving traffic to the site³.

Which is not to say what you just made doesn’t have value. It does, and kudos for making it. But twice at the top of your post you’re making it sound as if those are problems inherit with the format when they’re not. They’re not even problems for most people in most situations, you just bumped into a very specific use-case.

¹ It’s not an acronym, it shouldn’t be all uppercase.

² Many feed readers misbehave and download the whole thing instead of checking ETags.

³ To show ads or something else.


Also Atom feeds supports pagination https://www.rfc-editor.org/rfc/rfc5005#section-3


Also, there’s an existing, moderately well supported format for JSON feeds: https://www.jsonfeed.org


I have the full history in my blog feed.


Does no one find it ironic that one of the complaints about RSS feeds is they don't give you the full content, forcing you to visit the site, while trying to access the poster's web site through reader view gives you a warning that you have to visit the site directly to get the full content?


The future of RSS is "git clone".

RSS was invented in 1999, 6 years before git!

Now we have git and should just be "git cloning" blogs you like, rather than subscribing to RSS feeds.

I still have RSS feeds on all my blogs for back-compat, but git clone is way better.


What problems does that solve? Reading blogs over git clone sounds like re-inventing the wheel. Are there even any tools that do that?

If anything were to replace RSS (and Atom) I'd personally hope for h-feed [1] since it's DRYer. But realistically it's going to be hard to eclipse RSS, there's far too much adoption and it is mostly sufficient.

[1] https://indieweb.org/h-feed


> What problems does that solve?

A million?

Having your own local copy of your favorite authors' collections is the absolute way to go. So much faster, searchable, transformable, resistant to censorship, et cetera.


> What problems does that solve? Reading blogs over git clone sounds like re-inventing the wheel.

Can’t say anything about blogs, but the kernel folks actively use mailing list archives over Git[1,2] (also over NNTP and of course mail is also delivered as mail).

[1] https://public-inbox.org/README.html

[2] https://lore.kernel.org/


I'm not the GP commenter, but I'm supposing there would be some way of announcing the git repo where you can find the source -- similar to the `<link...>` tag used for RSS, you could have a

  <link rel="alternate" type="application/x-git" title="my blog as a git repo" href="..." />
..and tooling could take care of all the things you like in an RSS reader. I could see this working really well for static site generators like vitepress or Jekyll or what have you, but going beyond what's in the source is kind of project-specific, but maybe I'm interested in just a summary of commits/PRs

Anyway, there isn't an official IANA-defined type for a git repo (the application/x-git is my closest guess until one became official) but my point is it isn't too far beyond what auto-discovery of RSS is.

I think the GP's comment is from the point of view of making it easy to retrieve the contents of the blog archive, easier than the hoops mentioned (bulk archive retrieval and generating WordPress page sequences, etc.) as well as solving the problem in TFA (partial feeds, partial blog contents in the feed).


> <link rel="alternate" type="application/x-git" title="my blog as a git repo" href="..." />

This is a _great_ idea. Let's make this happen.

Edit: okay this is live now in Scroll and across PLDB, my blog, and other sites. Would love if someone could post this link to HackerNews: https://scroll.pub/blog/gitOverRss.html


I'd post the link, and I agree it's a cool idea, but the post looks like a pretty shallow rehashing of the thread.


I like it, I'm adding this <link> to my sites now, too


Awesome! Any chance you could add some info about who you are to your HN profile? Would love to read your stuff. Clearly a mind full of good ideas!


It's not what you're aiming for with this comment, but I bet git would actually make a pretty good storage tool/format for archival of mostly static sites.

horrible simple hack: use `wget` with `--mirror` option, and commit the result to a git repository. Repeat with a `cron` job to keep an archive with change history.


I assume this is what wayback machine uses?


Of course not. They have their own crawler (Heritrix, an open source Java crawler) and archive in WARC format. It‘s serious archiving, they want to preserve reply codes, HTTP headers etc.


You clone what? A WordPress database?


> You clone what? A WordPress database?

You clone static site generated websites.

Scroll is designed for this, but there's no reason other SSCs can't copy our patterns.

Here's a free command line working client you can try [beta]: https://wws.scroll.pub/readme.html

Instead of favoriting feeds, you favorite repos. Then you type "wws fetch" to update all your local repos.

It fetches the branch that contains the built artifacts along with the source, so you have ready to read HTML and clean source code for any transformations or analysis you want to do.

---

I love Wordpress, but the WordpressPHPMySQL stack is a drag. At some point I expect they will move the Wordpress brand, community, and frontend to be powered by a static site generator.

To be quite honest, I suspect they'll probably want to use Scroll as their new backend.


And if the blog's repo is private or, gasp, it's not versioned with git?


Then it's not worth reading.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: