

Evaluating Text Extraction Algorithms - tomazk
http://tomazkovacic.com/blog/122/evaluating-text-extraction-algorithms/

======
sigil
This is excellent! I read your posts a while back that inventoried the various
libraries and algorithms for text extraction. Ever since, I've been wishing
someone would actually measure and compare performance. Thanks so much.

 _Readability’s poor performance came as a surprise, moreover, it’s varying
results of it’s two ports. Relatively low precision and high recall indicate
that readability tends to include large portions of useless text in its
output._

Subjectively, I've thought this for a while -- decruft (the python port of
readability) has not given us particularly good results. (And it's insanely
slow.) But I share the same misgivings as you, that the port may not be
faithful to the original used by readability the service.

Definitely looking forward to your evaluation code.

