
The Easy Way to Extract Useful Text from Arbitrary HTML - danw
http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/
======
yago
from TFA: The concept is rather simple: use information about the density of
text vs. HTML code to work out if a line of text is worth outputting. (This
isn’t a novel idea, but it works!) The basic process works as follows:

    
    
       1. Parse the HTML code and keep track of the number of bytes processed.
       2. Store the text output on a per-line, or per-paragraph basis.
       3. Associate with each text line the number of bytes of HTML required to describe it.
       4. Compute the text density of each line by calculating the ratio of text to bytes.
       5. Then decide if the line is part of the content by using a neural network.
    

You can get pretty good results just by checking if the line’s density is
above a fixed threshold (or the average), but the system makes fewer mistakes
if you use machine learning — not to mention that it’s easier to implement!

for whose who too scared to follow link...

------
twism
Anyone else getting an error page about this site being reported to be an
"attack website"?

~~~
danw
I think it's blacklisted by google but it appears to be perfectly safe

~~~
natch
Most attack sites do. Appear to be perfectly safe, that is.

For anyone who wants to read the article, this works. (I put in URL_HERE as a
placeholder because HN truncates the URL):

lynx --dump URL_HERE > output.txt

Ironic that to read the article, I used a technique that solves the problem
the article is talking about solving. In a much easier way, I might add!

~~~
amackera
Maybe that's the trick... It's a meta-article about how to _really_ get text
from HTML. By figuring out you can just dump from lynx - you no longer have to
read it!

------
jonmc12
The patterns could be formed in a more formalized way by applying Shannon's
principles of information entropy across the document, lines of text, word
patterns, N-Grams, etc. Then Bayesian inference can be applied for
probabilistic pattern matching (vs the neural network).

Not that this approach is any easier, its just perhaps more robust for
applying to other problem sets.

Nothing new of course, according to their website, Autonomy (Europe's 2nd
largest software company) uses these techniques as the basis for their core
technology to analyze text, audio, video, etc.

------
fauigerzigerk
It's an interesting idea and probably useful in some situations. It would be
much more useful, though, if the parser kept structural information about
where in the html tree a particular text fragment was found. Lines could still
serve as the unit to which statistical analysis is applied (although that
seems error prone), but knowing more about the structure would enable further
processing down the line.

