Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Feed Creator 2.0 – Generate RSS feeds from web page elements (fivefilters.org)
24 points by k1m 28 days ago | hide | past | web | favorite | 9 comments

Cool project!

I've recently created something similar for personal use. I have many websites (mainly webshops) I want to be notified about changes on, but they don't have RSS feeds, subscriptions or APIs than you can use.

I set up a cron job that runs daily, scrapes websites according to some XPaths, and saves the results to a DB. If any new elements have appeared, an email will be sent out. The biggest challenge is handling false positives: being able to distinguish between a new element and e.g., a previously seen element with an updated title, description etc. For websites that directly expose what seems to be unique, server-side, identifiers in their HTML, using that as a primary key seem to work well. If that's not available, the href of the HTML element seem to be fairly static.

Do you have any thoughts on the issue of false positives and unique identifiers?

Thanks! I haven't given the issue of unique identifiers too much thought because in most cases I assume the item URL is less likely to change than the text and will serve as the unique identifier for the RSS reader. It's possible to create feeds without item URLs in Feed Creator, so in those cases maybe letting users select an identifier to be the guid element in the feed would be helpful.

Generally though, I'm hoping users understand that feeds produced in this way could be a little more brittle than if the site offered its own feed.

One difference with your approach is that you have the data from previous fetches in your database. With Feed Creator everything related to producing the feed (source URL, selectors, filters, etc.) is embedded in the feed URL to avoid having to record data on the server. So each request is treated as if it's the first one - the server doesn't know if an item in the feed is new or old. If we referred to feed data from previous fetches, maybe we could let users introduce a delay before having a new item added to the feed. This might help in cases where a typo is spotted and corrected by the publisher minutes after publication. Can't think of a much better way of avoiding false positives at the moment though.

Happy to get feedback and answer any questions about this here.

Here are two feeds we made earlier to give you an idea of what Feed Creator is supposed to do. The links below will pre-fill the form with the parameters you'd enter and produce a preview of the RSS:

* Chomsky.info articles: https://createfeed.fivefilters.org/index.php?url=https%3A%2F...

* Latest articles by John Pilger: https://createfeed.fivefilters.org/index.php?url=http%3A%2F%...

Is this service self-hostable?

Yes. If you have access to a server with PHP, you should be able to run it yourself. We have a simple PHP file you can download to test the compatibility of your server.[0]

We sell the self-hosted version[1] and have a blog post with some instructions if you want to run it on a VPS[2].

[0] Zip file with PHP file inside: https://createfeed.fivefilters.org/fc_compatibility_test.zip

[1] https://www.fivefilters.org/pricing/

[2] https://blog.fivefilters.org/2020/06/04/feed-creator-2.html

I’d be worried about copyright issue.

In many cases, the client doesn’t care/know about content copyright and just crawling.

(P/s: I used to develop a blogging platform and find RSS links to seed content)

This seems a halfway before create a webscraping solution. Adding support for integrate automation with zapier or ifttt could help to close the circle. Nice project

Thank you! :) I don't have much experience with Zapier, but I assume the RSS feed this produces can be plugged into both Zapier and IFTTT if they have RSS support. Or maybe you had something else in mind?

You're right. RSS could be enough to use for automation

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact