Hacker News new | comments | show | ask | jobs | submit login

I build a summariser for classifier4j (http://classifier4j.cvs.sourceforge.net/viewvc/classifier4j/...).

It's 5 years old now, but the summaries it generates are competitive quality-wise with most things out there (eg, the MS Word summarizer). Unfortunately I don't have an online demo working atm (like I said - 5 years old)

My algorithm is something I made up, and from memory it works like this:

1) Remove HTML, stem, remove stopwords etc

2) Sort unique words by popularity in the text

3) Split the original text on sentence boundaries.

4) Include each sentence that first mentions the next most popular word, until the summary is the maximum length requested.

Like most things, it's surprising how well a simple algorithm like that works.

There are ports for C#, and Googling just then apparently someone has done a python port too.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact