
A New Instapaper Parser - ingve
http://blog.instapaper.com/post/137288701461
======
dang
I wish someone would make a good open-source library for this complicated and
annoying but valuable task.

We did some work in 2014 on an archive of all stories posted to HN, with the
goals of (a) having a lightweight, readable version of everything quickly
available and (b) doing analytics on the content. But we got bogged down on
getting the actual content programmatically across the full spectrum of cases.
This is one of those problems where not merely one devil is in the details but
a whole legion of them, and not the glamorous kind. Getting it right would
have sucked up all our resources, and the APIs out there (e.g. Readability)
came with problems too, so we dropped the project.

But for a programmer who enjoys the snake-pit-of-corner-cases type of
challenge, this would make a fine project, one with real public-service
potential. We can't work on it ourselves, but we'd consider funding it.

~~~
fortes
I used to work at Flipboard, and we invested a lot in this issue. It is not
easy, and requires constant (constant!) maintenance.

Getting to 80% quality isn't hard. 90% is tricky. 95% incredibly costly.

~~~
rkho
Completely agree. A friend and I tried to do something like this as a fun
project at a hackathon, getting to 80% wasn't difficult, just a lot of parsing
the DOM for articles. Dealing with things like adverts, photo captions,
comments, and other text that shouldn't be in the actual article was the real
pain -- especially when we wanted to detect paragraph/subheader breaks since
we wanted to parse articles and text-to-speech.

------
dchuk
I realize there are business reasons for not sharing the full details of the
parser (or open sourcing it entirely), but it would still be interesting to
hear more details about the actual components/architecture/techniques being
used.

I would assume this is built on some sort of headless browser implementation
but who knows, maybe not. Hopefully Instapaper does a followup with more
technical details.

------
alpb
A very simple thing that I've been reporting to Instapaper for multiple times
over a year now is fixing the GitHub support. I often save a lot of pages from
GitHub to Instapaper (such as
[https://github.com/docker/docker/blob/master/docs/userguide/...](https://github.com/docker/docker/blob/master/docs/userguide/storagedriver/zfs-
driver.md)) and it only saves the repository root link, so I usually end up
losing that link entirely and it's pretty annoying.

I'm not sure how this article is supposed to make a paying user happy, it
doesn't show any measurable metrics (neither any user should actually care
about that, it should just work). I still wonder how hard adding a simple "if
github then" check is.

~~~
bambax
The GitHub page, when copy-and-pasted on
[http://markitdown.medusis.com](http://markitdown.medusis.com) renders okay.

MarkItDown is a "toy" implementation of a rich-text to markdown converter,
that I wrote 3 years ago and still enjoys significant usage. Maybe it could be
used as a start for a more full-featured parser.

~~~
alpb
I am not talking about a content parsing problem. Because of some stupid
canonical URL header GitHub serves, Instapaper is constantly saving redacting
the URL path to the repo level and they haven't been fixing it since forever.

------
rjknight
Obligatory link to the nearly-identically-named project:
[https://github.com/Engelberg/instaparse](https://github.com/Engelberg/instaparse)

