The Easy Way to Extract Useful Text from Arbitrary HTML

yago · on Aug 6, 2008

from TFA: The concept is rather simple: use information about the density of text vs. HTML code to work out if a line of text is worth outputting. (This isn’t a novel idea, but it works!) The basic process works as follows:

   1. Parse the HTML code and keep track of the number of bytes processed.
   2. Store the text output on a per-line, or per-paragraph basis.
   3. Associate with each text line the number of bytes of HTML required to describe it.
   4. Compute the text density of each line by calculating the ratio of text to bytes.
   5. Then decide if the line is part of the content by using a neural network.

You can get pretty good results just by checking if the line’s density is above a fixed threshold (or the average), but the system makes fewer mistakes if you use machine learning — not to mention that it’s easier to implement!

for whose who too scared to follow link...

twism · on Aug 6, 2008

Anyone else getting an error page about this site being reported to be an "attack website"?

ComputerGuru · on Aug 6, 2008

Only Google has it shit-listed, therefore Firefox does too by extension.

Opera 9.5 and IE 7 (both w/ phishing detection as well) don't flag it. Google's malware detection engine has been known to be off plenty of times before - lost a startup I know a month's income because some bad sites were linking to them, before they apologized and undid their blacklisting.

thomasmallen · on Aug 7, 2008

Of course it's an "attack website." It'll strip away your HTML, and then what will you have?

sysop073 · on Aug 6, 2008

Google SafeBrowsing has it flagged: http://safebrowsing.clients.google.com/safebrowsing/diagnost...

danw · on Aug 6, 2008

I think it's blacklisted by google but it appears to be perfectly safe

natch · on Aug 6, 2008

Most attack sites do. Appear to be perfectly safe, that is.

For anyone who wants to read the article, this works. (I put in URL_HERE as a placeholder because HN truncates the URL):

lynx --dump URL_HERE > output.txt

Ironic that to read the article, I used a technique that solves the problem the article is talking about solving. In a much easier way, I might add!

amackera · on Aug 6, 2008

Maybe that's the trick... It's a meta-article about how to _really_ get text from HTML. By figuring out you can just dump from lynx - you no longer have to read it!

danw · on Aug 14, 2008

That doesn't solve the actual problem of extracting the article content itself from a page.

jonmc12 · on Aug 6, 2008

The patterns could be formed in a more formalized way by applying Shannon's principles of information entropy across the document, lines of text, word patterns, N-Grams, etc. Then Bayesian inference can be applied for probabilistic pattern matching (vs the neural network).

Not that this approach is any easier, its just perhaps more robust for applying to other problem sets.

Nothing new of course, according to their website, Autonomy (Europe's 2nd largest software company) uses these techniques as the basis for their core technology to analyze text, audio, video, etc.

fauigerzigerk · on Aug 7, 2008

It's an interesting idea and probably useful in some situations. It would be much more useful, though, if the parser kept structural information about where in the html tree a particular text fragment was found. Lines could still serve as the unit to which statistical analysis is applied (although that seems error prone), but knowing more about the structure would enable further processing down the line.