
Overview of Text Extraction Algorithms - Anon84
http://tomazkovacic.com/blog/14/extracting-article-text-from-html-documents/
======
ajays
This page is just a thin wrapper, with a link to the _actual_ overview:
[http://tomazkovacic.com/blog/14/extracting-article-text-
from...](http://tomazkovacic.com/blog/14/extracting-article-text-from-html-
documents/)

And here's his list of resources: [http://tomazkovacic.com/blog/56/list-of-
resources-article-te...](http://tomazkovacic.com/blog/56/list-of-resources-
article-text-extraction-from-html-documents/)

~~~
bl4k
I wonder why anybody from HN here read that article and decided to submit it
and not the original source

~~~
itsnotvalid
The link is changed to pointing to the original article already.

For reference, this post is originally linked to
<http://www.readwriteweb.com/hack/2011/03/text-extraction.php>

------
grayrest
Readability has a LOT of hand tuned heuristics for figuring out the most
likely content of the page, but the primary indicator on whether a tag with
text in it is part of an article or not is the number of commas in the tag.
It's my favorite thing about the algorithm because it's a dumb idea that
works. The comma rule gets the extraction correct on about 70% of the web, the
rest of the heuristics are mostly there to cover screwy ways people structure
their articles.

~~~
tha-dude
I've been dabbling in content-scraping, what bugs me is that with all the AJAX
trickery that's going on, merely analyzing the XHTML source doesn't get you
very far in many cases. Executing the page (JS, DOM and all) via browser-
programming is an option but of course quite expensive. A headless browser is
what's needed!

~~~
pkandathil
Yeah. I think that is the challenge. A good way to get around the AJAX problem
is to see if a site has an RSS feed and use that to extract content. I wish
sites had a url for bots built in so you didnt have to do all this fancy stuff
to extract the content.

~~~
buss
Many of the big sites will feed you non-ajax content if you're the googlebot.

------
garply
I studied a fair amount of NLP (a true passion of mine) at school and after I
graduated I spent several months working on tech which did this (and other
things). That was intended to be a startup, but sadly, at the time, my
business sense sucked and I couldn't decide on a good product to fit the tech
to (the fact that I was developing tech before I had a strong sense of my
product is already telling).

I since have started a completely different (and profitable) company and the
code has just been bit-rotting. I'm not sure what I should do with it. Keep it
around in case I ever decide to do a business model like some of these
companies (I probably don't have time for that)? Open source it (time-
consuming to clean the code and what do I stand to gain from that)? I guess I
could use the open-sourced stuff to help me find contracts for freelancing,
but I just don't see a lot of NLP remote work being offered.

Still, I hate seeing the code rot...

~~~
bl4k
dump it on github - someone will come around and clean it up for you

what is it written in?

~~~
garply
Mostly Python. Some C. Some R to play around with datasets and prototype
algorithms.

------
hollerith
When I started using the internet in 1992 Usenet (which BTW was almost always
referred to as netnews or just plain "news" before Time magazine, etc, used
their influence as explainers of the internet to the general public to change
the name) was the social heart of the internet the way the web is now, and you
did not need _algorithms_ to _extract_ the text from Usenet because the text
was all dead-simple plain text files.

------
rb2k_
I always wanted to generate a simple service that classifies websites.
Something where you dump in the HTML/URL and it returns something like
"agriculture"/"government"/"retail"/"education".

I already have a set of a few thousand classifications at hand. What would
probably be a good algorithm to run it through? I assume I'd use something
like webstemmer/boilerpipe/... to extract just the main text first.

What I am a bit uncertain is what I should do after that. My guess would be
that I isolate the nouns/adjectives with the highest frequency and do a
clustering with my already categorized dataset as training data.

Does that somewhat makes sense? If yes: any recommendations or alternatives
for libraries (preferably ruby) or just algorithms themselves (k-means, svm,
neural network...)

~~~
flavy
Hi,we are launching a service exactly like the one you were referring to. You
can try the API already. Check out the docs
[https://sites.google.com/site/thinkersrus/products-1/science...](https://sites.google.com/site/thinkersrus/products-1/scienceapialpha)
and the demo <http://www.thinkersr.us/demo.html>

~~~
rb2k_
Oh nice, looks interesting and worked pretty well from my few tests. Sadly,
with the amount of requests I'd be doing (10000+ just for training), I guess I
should build myself something on my own

------
PaulHoule
Unless I'm missing something, all the methods he's talking about involve
looking at web pages in isolation, or, alternatively across the set of all web
pages.

To do "template drop out", it would seem productive to look longitudinally
across pages on a single site, or in a subdirectory. For instance, almost all
pages in Hacker News have the same chrome. Methods used for DNA clustering
(such as Hidden Markov Models) could quickly find 'conserved' and
'unconserved' areas of documents.

This touches semantic technology because it links the ability to find nameless
statistical patterns with meaningful semantic identifiers, such as domain
names.

~~~
andraz
Methods that do clustering on similar web pages are mostly too CPU intensive
for processing larger sets (we're talking millions of web pages). They are
also harder to scale from data-locality perspective, you need to figure out
which pages to put together and then get the data together.

Looking at pages in isolation is much more horizontally scalable. You can take
a look at Webstemmer
(<http://www.unixuser.org/~euske/python/webstemmer/index.html>) for a method
exploiting similarities.

~~~
PaulHoule
Great reply. However, I think something is only worth doing if it's
impossible.

Your argument is like Chomsky's argument about the poverty of the stimulus,
just in reverse. There are heuristics that let us radically prune the N^2
possible relationships between things into a much smaller set that will let us
do things that would be otherwise unscalable.

~~~
andraz
Let me know if you know of this approach being used somewhere in production
processing millions of web pages. I would be very interested to know how they
overcome the difficulties!

I can imagine the cost/benefit of the approach is favorable for largest search
engines like Google and Bing that are trying to squeeze last few percentage
points of precision out of results. For everybody else, the engineering and
scaling difficulties are probably too big. I'd love to be proven wrong.

~~~
PaulHoule
Google and Bing are doing billions of web pages, not millions. I process
millions of web pages myself with 3 computers -- millions aren't a lot these
days, although I'm not currently using clustering methods.

Rather I'm selective with my inputs so I start with unscrambled eggs so I can
improve precision not by "a few percentage points" but rather reduce the false
positive rate by an order of magnitude.

My use of ML so far has been modest, limited to solving a few straightforward
problems. Personally I think search is boring (on webscale, too big of a game
for small players plus search as we know it probably can't get much better
because the queries are not precise -- better performance will require
changing the game) but I've been forced to put effort into it because end
users expect it.

------
TorKlingberg
I have been thinking there should be a way in html to mark the main part of a
web page, as opposed to the header, navigation, footer, etc. It could be used
when printing, by screen readers, search engines, Readability. I don't know if
the W3C would approve such a tag, or if site owners would bother to use it
though.

~~~
semanticist
HTML 5 has the 'article' tag. There's also header, footer, and other semantic
mark-up tags.

I've spent the last month writing scrapers for newspaper sites. No one uses
any of these things.

~~~
buss
Are the scrapers hand-tuned to each website? If not how did you do it?

~~~
semanticist
We need to extract site-specific metadata, and I'm no expert at NLP, so it's
mostly one scraper per site.

