Any developers who'd like to contribute to improving how article content is extr...

up6w6 · on Aug 25, 2021

I believe the Instant View[1] crowdsourcing model where people write templates for each website could boost a lot these parsers (hope they open source it soon). Its just impossible to make these extensions work for every single website with some simple heuristics.

Check the codebase of some popular parsers:

Firefox (already mentioned): https://github.com/mozilla/readability/blob/master/Readabili...

Google Chrome: https://github.com/chromium/dom-distiller

Mercury parser: https://github.com/postlight/mercury-parser

[1] https://instantview.telegram.org/

k1m · on Aug 25, 2021

Thanks for mentioning Instant View, I hadn't come across that. We actually maintain something similar here: https://github.com/fivefilters/ftr-site-config

We use these in our own tools and also get contributions from others, including Wallabag users: https://github.com/wallabag/wallabag

Before it was sold, Instapaper used to have something similar. A public database of its site-specific extraction templates. We used that as the starting point for our repository.

benzible · on Aug 25, 2021

FYI your API pricing page doesn't seem to load https://rapidapi.com/user/fivefilters

freediver · on Aug 25, 2021

Thanks to both of you for expanding my 'readable web' toolbox.

What do you fallback to if the rule is not present or doesn't work?

k1m · on Aug 25, 2021

In our case, we try to match using the XPath selectors that we have for the site. If we don't have any, or they fail to match anything for the title, author, or body, we then go to Readability and let it do its thing to try and extract whatever we're missing.

freediver · on Aug 25, 2021

Makes sense. What does 'prune' and 'tidy' instruct parser to do?

k1m · on Aug 25, 2021

Prune instructs the parser to remove any elements within the extracted article block that look superfluous. This can result in false positives, so we tend to disable it when we've gone to the trouble of creating site-specific extraction rules.

Tidy determines if the source HTML should be cleaned up first with HTML Tidy - https://github.com/htacg/tidy-html5. If you're parsing the source HTML with an HTML 5 parser, as we are now, it shouldn't be necessary any more (I think we actually ignore it now). We used it more before when we relied on libxml parsing, which often trips up on modern HTML.

bspammer · on Aug 25, 2021

Another cool crowdsourced thing I discovered recently is SponsorBlock [1] which is an extension to automatically skip sponsored content in Youtube videos. Users contribute timings to the database that everyone else uses. It works remarkably well, any recent video with more than about 50,000 views is pretty much guaranteed to have timings submitted.

[1] https://sponsor.ajay.app/

nyanpasu64 · on Aug 25, 2021

Is it possible to utilize the database of Instant View per-site parsers in a web browser or extension's reader mode?

infogulch · on Aug 25, 2021

ArchiveBox is a tool that downloads web pages and saves them in various different formats: warc, pdf, rendered png, plain text. I wonder what it uses for plain text extraction and if the readability repo would be useful for that purpose.

Edit: Oh neat it does actually. https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#...

> Archive method SAVE_READABILITY

> Extract article text, summary, and byline using Mozilla's Readability library. Unlike the other methods, this does not download any additional files, so it's practically free from a disk usage perspective. It works by using any existing downloaded HTML version (e.g. wget, DOM dump, singlefile) and piping it into readability.

jrochkind1 · on Aug 25, 2021

That's pretty amazing it already does it.

ArchiveBox and the other stuff from the "DIY no-credentials don't-care-about-the-rules" web archiving community, like ArchiveTeam.... continues to astound me with it's quality and "professionalism" (as a credentialed professional in the field of digital library stuff... they are often outdoing the actual credentialed professional community).

nikisweeting · on Aug 26, 2021

Aww thank you!

dredmorbius · on Aug 25, 2021

My suspicion is that there are an increasing number of publishers who are intentionally severing compatibility with Readability.

Washington Post, I'm looking at you mofos. Chief reason I'll seek out any alternative news site for archival. It's been this way for about a year, if not more.

mft_ · on Aug 25, 2021

O/T, but thanks for Push To Kindle. I found the browser version so useful I bought the iPhone app - both to use and also to support you. Brought a whole new field of usefulness to my Kindle

k1m · on Aug 25, 2021

Thank you! That's really nice to hear. (Appreciate the support too.)

jamil7 · on Aug 25, 2021

I wrote a Swift port of it last year for my app but it deviates from readability a fair bit as I tailed it a bit to my needs, I've considered cleaning it up and open sourcing it regardless. I know there is an Objective-C port floating around.

freediver · on Aug 25, 2021

Do not clean it up just put it up there and let others do it! I'd be very interested in it, please reach out when you do.

mdoms · on Aug 25, 2021

Just reading through that code it seems like readbility is only intended to work on English language websites? Like it checks for nodes with class names matching /and|article|body|column|content|main|shadow/ and uses a minimum length of 140 characters for matching nodes that are reader-able. Seems a bit lazy for a company whose stated mission is "to ensure the Internet is a global public resource, open and accessible to all".

Vinnl · on Aug 26, 2021

For what it's worth, even when working on Dutch websites my class names will usually be in English, and I think that's common across the industry. After all, the programming languages we use to write them are already in English, so Dutch class names would stick out like a sore thumb.

freediver · on Aug 25, 2021

Another robust solution is Tranquility reader which exists as an extension and has better accuracy than Readability at the expense of speed.

https://github.com/ushnisha/tranquility-reader-webextensions

Tarsul · on Aug 25, 2021

yes! I use this addon for firefox all the time. Usually I click on my "Tranquility!" button the moment the cookie notification pops up, no cookies needed ;) (good for articles via HN, also possible to circumvent some paywalls)