

Tldr.it - Summarizer for RSS feeds and other web pages (built in 48 hours) - jeremymcanally
http://tldr.it

======
retube
Perhaps a minor issue, but a bit of a turn off for me: you have to include
"<http://> in your url. Otherwise you get some snarky comment like "You'll
need to provide me with a valid url, matey" which is even more aggrevating
given that the address was a correct one. I suggest you allow people to miss
this - it's more to type and I'm guessing most people are used to just bunging
an address into their browser without protocol qualification.

if (!url.startsWith("<http://>) { url = "<http://> \+ url; }

(Edit: for some reason the closing quotes are being stripped)

~~~
jeremymcanally
Ah yes. That joys of cargo culting a URL validation regex. I can't fix it now
(per contest rules), but I'll be sure to fix that ASAP.

------
SkyMarshal
Interesting, I've been working on a similar idea but different implementation.
Out of curiosity, how do you select the summary content for a URL? That can be
a problem.

For example, I tested tldr.it on a recent HN submission, _Programmers: How to
Make the Systems Guys Love You_
(<http://news.ycombinator.com/item?id=1800839>), and it seems to select the
first paragraph or so of the article (depending on whether you select the
short, medium, or long tldr).

However, so many authors these days start out with a long, rambling preamble,
instead of the venerable Who/What/When/Where/(Why|How), that the first few
paragraphs don't necessarily provide a real tldr-type summary.

Which is the case with that particular article - the first few paragraphs
don't help me decide whether to read it or not, and the entire article is
quite long, with some good nuggets further down in it.

~~~
jeremymcanally
I extract it using a pretty clever algorithm then run it through a few things
to summarize it. The extraction is nowhere near perfect; it performs best on
the major news sites thus far. I didn't really have time to polish it as much
as I'd like, but it seems to work well especially on FOXNews and NYTimes (and
Blogspot articles).

~~~
bravura
Do you mind summarizing a bit how you generate snippets?

We were discussing this on MetaOptimize recently
([http://metaoptimize.com/qa/questions/2815/how-are-search-
eng...](http://metaoptimize.com/qa/questions/2815/how-are-search-engine-
snippets-generated)), but I'm curious to hear about alternate approaches.

~~~
nl
I build a summariser for classifier4j
([http://classifier4j.cvs.sourceforge.net/viewvc/classifier4j/...](http://classifier4j.cvs.sourceforge.net/viewvc/classifier4j/newbuild/core/src/java/net/sf/classifier4J/summariser/SimpleSummariser.java?revision=1.6&view=markup)).

It's 5 years old now, but the summaries it generates are competitive quality-
wise with most things out there (eg, the MS Word summarizer). Unfortunately I
don't have an online demo working atm (like I said - 5 years old)

My algorithm is something I made up, and from memory it works like this:

1) Remove HTML, stem, remove stopwords etc

2) Sort unique words by popularity in the text

3) Split the original text on sentence boundaries.

4) Include each sentence that first mentions the next most popular word, until
the summary is the maximum length requested.

Like most things, it's surprising how well a simple algorithm like that works.

There are ports for C#, and Googling just then apparently someone has done a
python port too.

------
syllogism
Do you want to talk about the summarisation algorithm at all? I wrote a little
blog post about a trivial extractive summarisation system a while ago (
<http://honnibal.wordpress.com/> ), and there's a long literature on
summarisation in NLP. A lot of the techniques are a bit computationally costly
and complicated to be practical, though. Meanwhile abstractive summarisation
still hasn't properly gotten off the ground.

~~~
jeremymcanally
Well unfortunately I'm not doing much on my end at this point. I do a few
small things and then let libots do most of the work. One of my last iteration
items was to put some of my own summarization work into it and let libots be
less of a player, but obviously with 48 hours, I had to prioritize.

I invested most of my cleverness in actually getting the content out of the
page since that's really where the money is for an MVP for this; no content ==
no summary. :)

~~~
syllogism
Yeah, that problem is a real pain. As I mentioned in my post it's the bit I'm
not happy with. I wonder how the readability tool does it; that seems to do a
very good job.

It seems that OTS uses a word frequency strategy, so the algorithm is similar
or identical to the one I demoed. Interesting.

~~~
riffer
Their JS is out there if you grab it from the Bookmarklet. As in, it is not
minified.

I have gone through it carefully, and it is clever.

OTS is definitely word freq based.

~~~
jeremymcanally
I'm using an algorithm very similar to what they do with a few clever
additions of my own. I started out with something almost identical, but they
had a few twists that made it even better, which I then in turn improved on
(and HTML5-ified :)).

~~~
dotBen
Why don't you open source your algorithm and more folks can work on it with
you. I've been futzing with Readability JS converted to PHP (but could port to
Ruby, Python) and it would be great to collab and share test files, etc.

~~~
jeremymcanally
Sure I might consider that at some point! Yet another OSS project for me to
maintain though... :P

~~~
syllogism
I'd be interested in working on this project --- it's a problem I've come
across quite a bit. There's even an academic contest for it, called CLEANEVAL,
although the way they set up the problem was arguably not quite right.

------
philcrissman
Overwhelmed with traffic?

Would really like to see this. :-) If it has an API I'd love to tie something
like this to <http://tldrd.com> to auto-generate summaries of various texts.

~~~
jeremymcanally
I have a half-API in there, but I certainly want to build it out a little
more. I planned on having it done by the time the competition was over but
then I decided to do crap like "sleep" and "eat". :P

~~~
riffer
It looks like you might be using libots -> libots.sourceforge.net

I'm watching this thread with interest because I've been doing some work on
building a good summarization engine for a few months now, and it is getting
pretty close to ready for prime-time -> summarity.com

------
buddydvd
Are there any alternatives that actually let humans do this? Perhaps a site
that employs some game mechanics? First to summarize, +10 points. Correction,
+5 points. Request to summarize (bounties), +n points.

~~~
mattdeboard
I'm all for adding game mechanics to enhance user participation or to build in
a 'hook,' but it seems like making an RSS feed something interactive that
beggars caretaking would be contrary to the whole point of an RSS feed in the
first place.

~~~
buddydvd
Ah, I meant let humans summarize articles instead of relying on libots.

------
drtse4
This same idea is the last one i added a week ago in my project ideas
document, time to delete it :). Jeremy, i guess you've already noticed this,
tldr.com could expire in november if it's not renewed.

What about accepting complete urls as parameter? So that you just need to put
tldr.it/ in front of an [URL] to obtain a summary?

Did you show this to reddit yet?

One thing i was thinking about was how to monetize something like this, sadly
the only option seems to be ads or amazon/etc related to the content...

~~~
jeremymcanally
You can do <http://tldr.it/summarize/?summary[url]=(your> url). That's how the
bookmarklet works. :)

And yes monetization is definitely on my mind. I have a few options. The first
is as you mentioned Amazon stuff for the arbitrary URL's. Then there's also
injecting ads into the summarized RSS feeds. Then there's also adding users
and letting them have enhanced functionality for $x a month (i.e., update
their RSS feeds on a regular basis, cutting free users back to once a day,
make the bookmarklet a premium feature, and so on).

I really think this could be a good product; just not sure if I'll have the
time to really develop it.

~~~
hackerbob
>What about accepting complete urls as parameter? So that you just need to put
tldr.it/ in front of an [URL] to obtain a summary?

I agree with drtse4 on that point. Not as many people know about bookmarklets
as we let on to believe, at least thats what I found out among my friends. And
well frankly this time last year I didn't know what they were either.

Secondly, the domain is quite short; telling people to add tldr.it in front of
any url to get a summary is simple and could have better word of mouth
advantages then say teaching/telling them to use a bookmarklet or coming back
to the homepage and pasting it in the text box.

------
abraham
Nicely done. I tried it on one of my blog post and the results are spot on.
<http://tldr.it/summaries/247>

~~~
jeremymcanally
Excellent! I'm glad my best-effort extractor seems to work OK.

I have a special case extractor for things like the NYTimes (where their
markup makes it really easy to pull things out) and 1-2 other sites. I want to
add special cases for more of the popular sites on the 'net. But seems like
not a ton of sites really need it unless their markup sucks (I'm looking at
you BBC and CNN).

------
jerome666
Interesting. a while ago I was using similar tool: <http://topicmarks.com>

in short, you upload a text and it extracts: facts, summary, keywords, index
and stuff, with the most important info.

in most of the cases pretty cool, but sometimes was slow in processing.

~~~
RSieb
Hi everybody, Topicmarks is indeed slower in processing but that's because it
does far deeper semantic processing behind the screens than our peers tldr.it,
goodsummary.com, summarity.com and others.

We think this results in a much more relevant summary - but we'd love to hear
about how our peers compare to us for your particular summarization needs.

Also, we're starting to open up our API for select developers if you're
interested in integrating summarization, fact extraction etc into your own
application. Tweet @topicmarks if you want to be involved.

Roland (CEO Topicmarks)

------
abhijitr
Cool project! I was working on something similar recently, and found
<http://greatsummary.com/> to be far superior to OTS. It won't scale very
well, since you have to do an RPC every time, but maybe you can convince them
to give you their code.

~~~
jeremymcanally
I have a lot of semantic analysis and NLP stuff I'd like to do with this
project. OTS is largely a crutch to get the project out the door until I can
get something more scalable in place (if I ever get the chance!). :)

------
jeremymcanally
Sorry about it being slow. Working on getting more resources on the box.

~~~
mattdeboard
Sincere question from someone who is working on his first projects now: why
did you publish without infrastructure enough to support even the relatively
meager "HN effect" (see also: Digg effect, Reddit effect)? Not criticizing you
for your decision whatsoever, just trying to see the thought process from
someone who is further along than me.

~~~
jeremymcanally
It's one of the constraints of the competition unfortunately. They give us a
"stock" Linode box that we can't upgrade (didn't know this ahead of time or I
would've spent time on caching etc.).

------
cing
You're probably aware but, I'm getting "You'll need to be giving me a valid
URL, matey." using the URLs provided.

~~~
jeremymcanally
If you're on a non-Webkit browser, it'll "work." They're really just meant as
examples/placeholders that you can follow rather than things you can just
click through.

------
revorad
I tried to summarise 3-4 stories from the HN homepage, but none of them
worked.

------
jpwagner
change the url to see what others have summarized.

odd content...

------
andre
techcrunched to death

