Hacker News new | past | comments | ask | show | jobs | submit login

Any developers who'd like to contribute to improving how article content is extracted from web pages should check out Mozilla's Readability repository: https://github.com/mozilla/readability

I'm currently trying to bring the PHP port up to speed here: https://github.com/fivefilters/readability.php

We use an older version as part of our article extraction for Push to Kindle: https://www.fivefilters.org/push-to-kindle/




I believe the Instant View[1] crowdsourcing model where people write templates for each website could boost a lot these parsers (hope they open source it soon). Its just impossible to make these extensions work for every single website with some simple heuristics.

Check the codebase of some popular parsers:

Firefox (already mentioned): https://github.com/mozilla/readability/blob/master/Readabili...

Google Chrome: https://github.com/chromium/dom-distiller

Mercury parser: https://github.com/postlight/mercury-parser

[1] https://instantview.telegram.org/


Thanks for mentioning Instant View, I hadn't come across that. We actually maintain something similar here: https://github.com/fivefilters/ftr-site-config

We use these in our own tools and also get contributions from others, including Wallabag users: https://github.com/wallabag/wallabag

Before it was sold, Instapaper used to have something similar. A public database of its site-specific extraction templates. We used that as the starting point for our repository.


FYI your API pricing page doesn't seem to load https://rapidapi.com/user/fivefilters


Thanks to both of you for expanding my 'readable web' toolbox.

What do you fallback to if the rule is not present or doesn't work?


In our case, we try to match using the XPath selectors that we have for the site. If we don't have any, or they fail to match anything for the title, author, or body, we then go to Readability and let it do its thing to try and extract whatever we're missing.


Makes sense. What does 'prune' and 'tidy' instruct parser to do?


Prune instructs the parser to remove any elements within the extracted article block that look superfluous. This can result in false positives, so we tend to disable it when we've gone to the trouble of creating site-specific extraction rules.

Tidy determines if the source HTML should be cleaned up first with HTML Tidy - https://github.com/htacg/tidy-html5. If you're parsing the source HTML with an HTML 5 parser, as we are now, it shouldn't be necessary any more (I think we actually ignore it now). We used it more before when we relied on libxml parsing, which often trips up on modern HTML.


Another cool crowdsourced thing I discovered recently is SponsorBlock [1] which is an extension to automatically skip sponsored content in Youtube videos. Users contribute timings to the database that everyone else uses. It works remarkably well, any recent video with more than about 50,000 views is pretty much guaranteed to have timings submitted.

[1] https://sponsor.ajay.app/


Is it possible to utilize the database of Instant View per-site parsers in a web browser or extension's reader mode?


ArchiveBox is a tool that downloads web pages and saves them in various different formats: warc, pdf, rendered png, plain text. I wonder what it uses for plain text extraction and if the readability repo would be useful for that purpose.

Edit: Oh neat it does actually. https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#...

> Archive method SAVE_READABILITY

> Extract article text, summary, and byline using Mozilla's Readability library. Unlike the other methods, this does not download any additional files, so it's practically free from a disk usage perspective. It works by using any existing downloaded HTML version (e.g. wget, DOM dump, singlefile) and piping it into readability.


That's pretty amazing it already does it.

ArchiveBox and the other stuff from the "DIY no-credentials don't-care-about-the-rules" web archiving community, like ArchiveTeam.... continues to astound me with it's quality and "professionalism" (as a credentialed professional in the field of digital library stuff... they are often outdoing the actual credentialed professional community).


Aww thank you!


My suspicion is that there are an increasing number of publishers who are intentionally severing compatibility with Readability.

Washington Post, I'm looking at you mofos. Chief reason I'll seek out any alternative news site for archival. It's been this way for about a year, if not more.


O/T, but thanks for Push To Kindle. I found the browser version so useful I bought the iPhone app - both to use and also to support you. Brought a whole new field of usefulness to my Kindle


Thank you! That's really nice to hear. (Appreciate the support too.)


I wrote a Swift port of it last year for my app but it deviates from readability a fair bit as I tailed it a bit to my needs, I've considered cleaning it up and open sourcing it regardless. I know there is an Objective-C port floating around.


Do not clean it up just put it up there and let others do it! I'd be very interested in it, please reach out when you do.


Just reading through that code it seems like readbility is only intended to work on English language websites? Like it checks for nodes with class names matching /and|article|body|column|content|main|shadow/ and uses a minimum length of 140 characters for matching nodes that are reader-able. Seems a bit lazy for a company whose stated mission is "to ensure the Internet is a global public resource, open and accessible to all".


For what it's worth, even when working on Dutch websites my class names will usually be in English, and I think that's common across the industry. After all, the programming languages we use to write them are already in English, so Dutch class names would stick out like a sore thumb.


Another robust solution is Tranquility reader which exists as an extension and has better accuracy than Readability at the expense of speed.

https://github.com/ushnisha/tranquility-reader-webextensions


yes! I use this addon for firefox all the time. Usually I click on my "Tranquility!" button the moment the cookie notification pops up, no cookies needed ;) (good for articles via HN, also possible to circumvent some paywalls)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: