Hacker News new | past | comments | ask | show | jobs | submit login
Evaluating Text Extraction Algorithms (tomazkovacic.com)
23 points by tomazk on June 9, 2011 | hide | past | favorite | 1 comment



This is excellent! I read your posts a while back that inventoried the various libraries and algorithms for text extraction. Ever since, I've been wishing someone would actually measure and compare performance. Thanks so much.

Readability’s poor performance came as a surprise, moreover, it’s varying results of it’s two ports. Relatively low precision and high recall indicate that readability tends to include large portions of useless text in its output.

Subjectively, I've thought this for a while -- decruft (the python port of readability) has not given us particularly good results. (And it's insanely slow.) But I share the same misgivings as you, that the port may not be faithful to the original used by readability the service.

Definitely looking forward to your evaluation code.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: