Perhaps a minor issue, but a bit of a turn off for me: you have to include "http:// in your url. Otherwise you get some snarky comment like "You'll need to provide me with a valid url, matey" which is even more aggrevating given that the address was a correct one. I suggest you allow people to miss this - it's more to type and I'm guessing most people are used to just bunging an address into their browser without protocol qualification.
Interesting, I've been working on a similar idea but different implementation. Out of curiosity, how do you select the summary content for a URL? That can be a problem.
For example, I tested tldr.it on a recent HN submission, Programmers: How to Make the Systems Guys Love You (http://news.ycombinator.com/item?id=1800839), and it seems to select the first paragraph or so of the article (depending on whether you select the short, medium, or long tldr).
However, so many authors these days start out with a long, rambling preamble, instead of the venerable Who/What/When/Where/(Why|How), that the first few paragraphs don't necessarily provide a real tldr-type summary.
Which is the case with that particular article - the first few paragraphs don't help me decide whether to read it or not, and the entire article is quite long, with some good nuggets further down in it.
I extract it using a pretty clever algorithm then run it through a few things to summarize it. The extraction is nowhere near perfect; it performs best on the major news sites thus far. I didn't really have time to polish it as much as I'd like, but it seems to work well especially on FOXNews and NYTimes (and Blogspot articles).
It's 5 years old now, but the summaries it generates are competitive quality-wise with most things out there (eg, the MS Word summarizer). Unfortunately I don't have an online demo working atm (like I said - 5 years old)
My algorithm is something I made up, and from memory it works like this:
1) Remove HTML, stem, remove stopwords etc
2) Sort unique words by popularity in the text
3) Split the original text on sentence boundaries.
4) Include each sentence that first mentions the next most popular word, until the summary is the maximum length requested.
Like most things, it's surprising how well a simple algorithm like that works.
There are ports for C#, and Googling just then apparently someone has done a python port too.
Do you want to talk about the summarisation algorithm at all? I wrote a little blog post about a trivial extractive summarisation system a while ago ( http://honnibal.wordpress.com/ ), and there's a long literature on summarisation in NLP. A lot of the techniques are a bit computationally costly and complicated to be practical, though. Meanwhile abstractive summarisation still hasn't properly gotten off the ground.
In your version you said you weren't happy with the HTML extractor. It's pretty hard to generalize that part, but one technique I found useful was having a flag that told the program to ignore all text until it found the first <p> tag.
In my testing, that removed ~90% of navigation text (although I note you are only looking in <p> tags. I had a flag for that too, but found it was unnecessary most of the time).
Also, I found regular expressions weren't terrible for sentence boundary detection. OTOH, there was nothing like NLTK for Java when I wrote it anyway.
Well unfortunately I'm not doing much on my end at this point. I do a few small things and then let libots do most of the work. One of my last iteration items was to put some of my own summarization work into it and let libots be less of a player, but obviously with 48 hours, I had to prioritize.
I invested most of my cleverness in actually getting the content out of the page since that's really where the money is for an MVP for this; no content == no summary. :)
I'm using an algorithm very similar to what they do with a few clever additions of my own. I started out with something almost identical, but they had a few twists that made it even better, which I then in turn improved on (and HTML5-ified :)).
Why don't you open source your algorithm and more folks can work on it with you. I've been futzing with Readability JS converted to PHP (but could port to Ruby, Python) and it would be great to collab and share test files, etc.
I'd be interested in working on this project --- it's a problem I've come across quite a bit. There's even an academic contest for it, called CLEANEVAL, although the way they set up the problem was arguably not quite right.
I have a half-API in there, but I certainly want to build it out a little more. I planned on having it done by the time the competition was over but then I decided to do crap like "sleep" and "eat". :P
It looks like you might be using libots -> libots.sourceforge.net
I'm watching this thread with interest because I've been doing some work on building a good summarization engine for a few months now, and it is getting pretty close to ready for prime-time -> summarity.com
Are there any alternatives that actually let humans do this? Perhaps a site that employs some game mechanics? First to summarize, +10 points. Correction, +5 points. Request to summarize (bounties), +n points.
I'm all for adding game mechanics to enhance user participation or to build in a 'hook,' but it seems like making an RSS feed something interactive that beggars caretaking would be contrary to the whole point of an RSS feed in the first place.
This same idea is the last one i added a week ago in my project ideas document, time to delete it :). Jeremy, i guess you've already noticed this, tldr.com could expire in november if it's not renewed.
What about accepting complete urls as parameter? So that you just need to put tldr.it/ in front of an [URL] to obtain a summary?
Did you show this to reddit yet?
One thing i was thinking about was how to monetize something like this, sadly the only option seems to be ads or amazon/etc related to the content...
And yes monetization is definitely on my mind. I have a few options. The first is as you mentioned Amazon stuff for the arbitrary URL's. Then there's also injecting ads into the summarized RSS feeds. Then there's also adding users and letting them have enhanced functionality for $x a month (i.e., update their RSS feeds on a regular basis, cutting free users back to once a day, make the bookmarklet a premium feature, and so on).
I really think this could be a good product; just not sure if I'll have the time to really develop it.
>What about accepting complete urls as parameter? So that you just need to put tldr.it/ in front of an [URL] to obtain a summary?
I agree with drtse4 on that point. Not as many people know about bookmarklets as we let on to believe, at least thats what I found out among my friends. And well frankly this time last year I didn't know what they were either.
Secondly, the domain is quite short; telling people to add tldr.it in front of any url to get a summary is simple and could have better word of mouth advantages then say teaching/telling them to use a bookmarklet or coming back to the homepage and pasting it in the text box.
Excellent! I'm glad my best-effort extractor seems to work OK.
I have a special case extractor for things like the NYTimes (where their markup makes it really easy to pull things out) and 1-2 other sites. I want to add special cases for more of the popular sites on the 'net. But seems like not a ton of sites really need it unless their markup sucks (I'm looking at you BBC and CNN).
Cool project! I was working on something similar recently, and found http://greatsummary.com/ to be far superior to OTS. It won't scale very well, since you have to do an RPC every time, but maybe you can convince them to give you their code.
I have a lot of semantic analysis and NLP stuff I'd like to do with this project. OTS is largely a crutch to get the project out the door until I can get something more scalable in place (if I ever get the chance!). :)
Hi everybody, Topicmarks is indeed slower in processing but that's because it does far deeper semantic processing behind the screens than our peers tldr.it, goodsummary.com, summarity.com and others.
We think this results in a much more relevant summary - but we'd love to hear about how our peers compare to us for your particular summarization needs.
Also, we're starting to open up our API for select developers if you're interested in integrating summarization, fact extraction etc into your own application. Tweet @topicmarks if you want to be involved.
Sincere question from someone who is working on his first projects now: why did you publish without infrastructure enough to support even the relatively meager "HN effect" (see also: Digg effect, Reddit effect)? Not criticizing you for your decision whatsoever, just trying to see the thought process from someone who is further along than me.