
Show HN: Get clean readable content from any webpage - bndr
https://github.com/bndr/node-read
======
Theodores
Thanks for posting that, it is exactly what I need.

I want to scrape a diverse selection of web pages for 'the gist' of the main
content purely to put into a solr search engine. At the moment I am doing
surprisingly well just going on the meta descriptions - in essence meta
descriptions should be what I am after - but meaningfully scraped content is
the ideal and what I need if no meta description available.

I am sure you have lots of grand plans for this, however, one thing that I
wish I could do is search through my web history, at a content level.

For instance, earlier today I found some code was no longer working properly.
I could remember bits of the tutorial I needed first time around but finding
it again was a chore (on a narrow subject then every search result from Google
is purple/visited, so you can't immediately identify the link).

If I could have done some 'search my previous browsing history' then I could
have got there a lot quicker. In this 'imaginary search' only the pages that
are kept open and scrolled through get 'internally archived'. There could also
be snapshots of the pages in question in the results as an aid to memory.

Another thing I would like to see done with this type of tool is a sensible
spell check. For decades we have had spell checks in text entry areas but
never on a finished web page. A little widget to show the words with problem
spellings with wavy underlining that would work on any web page would save no
end of woes for people that have to proof read things online. If it also
highlighted sentences too long for good copy that would be very useful too.

Can these sorts of applications be built on what you have here?

P.S. Funny that you wrote this 186 days ago:

[https://news.ycombinator.com/item?id=6581317](https://news.ycombinator.com/item?id=6581317)

Times have changed...

~~~
flippyhead
I'm curious, do you use Google History? If so, does it not meet your needs?

------
crashandburn4
Very simple request for the developer, you couldn't put up
screenshots/examples of text that has been extracted along with a (cached)
link so that people can see how it looks without downloading and testing it?

------
willlma
Good to see an OSS version. How does it handle pagination? From my experience
(Pocket, Readability, Readable (now Evernote Clearly), Boilerpipe) that's the
big differentiator. But frankly, with the increased use of media queries, I
find myself using this view less and less, and the only feature I miss is
pagination.

The optimal reading environment on a responsive website is usually the tablet
view. Narrow enough to make jumping lines easy, narrow enough to prevent big
ads on the side columns. I'd love to see a tool that could force media queries
on bigger displays and simply center the text. You'd still get a feel for what
the web designer was going for, but it's much cleaner. On none responsive
sites, it could use the same metrics as this tool to determine which elements
are article elements and simply center them and remove the rest, while
maintaining the fonts, colors... The medium is the message and all that jazz.

------
snipek
Send email message to read@snipek.com , with a URL in subject line. Then
you'll get a readable email of that web page. You can read it offline on
mobile mail client. That's how I read long web articles these days.

~~~
aleksi
There is also a similar service:
[http://www.ukeeper.com](http://www.ukeeper.com)

~~~
snipek
Good to know. Thanks!

------
kiliankoe
Thank you for this, I just started using node-readability for a project and
I'll gladly give this a spin. The second url [0] I tried passing it however
unfortunately returned an empty body. node-readability is able to parse the
contents of tagesschau.de seemingly without problems.

[0] [http://www.tagesschau.de/ausland/krise-in-der-
ostukraine112....](http://www.tagesschau.de/ausland/krise-in-der-
ostukraine112.html)

~~~
bndr
Thanks! It may be encoding problem, I'll take a look at it today.

------
rpedela
How does this compare to Apache Tika?

------
kephra
I did not read the source, but have a question:

How do you deal with web pages, where the content is dynamically created using
javascript/ajax? Do you execute this JS, or just ignore this growing
antipattern?

~~~
grimtrigger
Are dynamically created websites actually common with text-heavy web _sites_?
I've only seen them on interaction-heavy web _apps_.

~~~
dredmorbius
Blogger (dynamic templates) and Business Insider are two that come to mind as
suspects. Both render blank pages in Chrome w/ JS disabled.

------
TD-Linux
This reminds me of Firefox Mobile's reader mode - something I find invaluable.

Unfortunately they don't seem to have a desktop equivalent, yet.

~~~
atopal
Reader mode for desktop is coming (no specific version targeted though). Here
is the bug:
[https://bugzilla.mozilla.org/show_bug.cgi?id=558882](https://bugzilla.mozilla.org/show_bug.cgi?id=558882)

And here are the mock-ups:
[http://people.mozilla.org/~mmaslaney/readermode/index.html](http://people.mozilla.org/~mmaslaney/readermode/index.html)

------
bgnm2000
would love to see this as a bookmarklet (that hits the user's npm) for demo
purposes

~~~
bndr
I'm not sure I understand what you mean with "that hits the user's npm"

~~~
jzig
He means a button he can press in the browser that does the work for him.

------
earwolf
Pagination. Comments. Profit.

