Evaluating Text Extraction Algorithms

sigil · on June 10, 2011

This is excellent! I read your posts a while back that inventoried the various libraries and algorithms for text extraction. Ever since, I've been wishing someone would actually measure and compare performance. Thanks so much.

Readability’s poor performance came as a surprise, moreover, it’s varying results of it’s two ports. Relatively low precision and high recall indicate that readability tends to include large portions of useless text in its output.

Subjectively, I've thought this for a while -- decruft (the python port of readability) has not given us particularly good results. (And it's insanely slow.) But I share the same misgivings as you, that the port may not be faithful to the original used by readability the service.

Definitely looking forward to your evaluation code.