Hacker News new | comments | show | ask | jobs | submit login
Tldr.it - Summarizer for RSS feeds and other web pages (built in 48 hours) (tldr.it)
85 points by jeremymcanally 2589 days ago | hide | past | web | 43 comments | favorite

Perhaps a minor issue, but a bit of a turn off for me: you have to include "http:// in your url. Otherwise you get some snarky comment like "You'll need to provide me with a valid url, matey" which is even more aggrevating given that the address was a correct one. I suggest you allow people to miss this - it's more to type and I'm guessing most people are used to just bunging an address into their browser without protocol qualification.

if (!url.startsWith("http://) { url = "http:// + url; }

(Edit: for some reason the closing quotes are being stripped)

Ah yes. That joys of cargo culting a URL validation regex. I can't fix it now (per contest rules), but I'll be sure to fix that ASAP.

Interesting, I've been working on a similar idea but different implementation. Out of curiosity, how do you select the summary content for a URL? That can be a problem.

For example, I tested tldr.it on a recent HN submission, Programmers: How to Make the Systems Guys Love You (http://news.ycombinator.com/item?id=1800839), and it seems to select the first paragraph or so of the article (depending on whether you select the short, medium, or long tldr).

However, so many authors these days start out with a long, rambling preamble, instead of the venerable Who/What/When/Where/(Why|How), that the first few paragraphs don't necessarily provide a real tldr-type summary.

Which is the case with that particular article - the first few paragraphs don't help me decide whether to read it or not, and the entire article is quite long, with some good nuggets further down in it.

I extract it using a pretty clever algorithm then run it through a few things to summarize it. The extraction is nowhere near perfect; it performs best on the major news sites thus far. I didn't really have time to polish it as much as I'd like, but it seems to work well especially on FOXNews and NYTimes (and Blogspot articles).

Do you mind summarizing a bit how you generate snippets?

We were discussing this on MetaOptimize recently (http://metaoptimize.com/qa/questions/2815/how-are-search-eng...), but I'm curious to hear about alternate approaches.

I build a summariser for classifier4j (http://classifier4j.cvs.sourceforge.net/viewvc/classifier4j/...).

It's 5 years old now, but the summaries it generates are competitive quality-wise with most things out there (eg, the MS Word summarizer). Unfortunately I don't have an online demo working atm (like I said - 5 years old)

My algorithm is something I made up, and from memory it works like this:

1) Remove HTML, stem, remove stopwords etc

2) Sort unique words by popularity in the text

3) Split the original text on sentence boundaries.

4) Include each sentence that first mentions the next most popular word, until the summary is the maximum length requested.

Like most things, it's surprising how well a simple algorithm like that works.

There are ports for C#, and Googling just then apparently someone has done a python port too.

The first couple of articles I tried:



suffered from complete content extraction failure.

The third one seemed to work pretty well, but there were lots of spurious newlines in the output which made it really hard to read.

Nice idea but needs another 48 hours of polish :-)

PS I tried both those sfgate and Bell Systems pages through viewtext.org, and that got the content for both just fine.

Maybe you could pipe requests through their API:


Cool, thanks for posting, will definitely keep tabs. Good luck with it. Will post mine when i've got a working site.

Do you want to talk about the summarisation algorithm at all? I wrote a little blog post about a trivial extractive summarisation system a while ago ( http://honnibal.wordpress.com/ ), and there's a long literature on summarisation in NLP. A lot of the techniques are a bit computationally costly and complicated to be practical, though. Meanwhile abstractive summarisation still hasn't properly gotten off the ground.

(I wrote the classifier4j summariser, as outlined here: http://news.ycombinator.com/item?id=1803020)

In your version you said you weren't happy with the HTML extractor. It's pretty hard to generalize that part, but one technique I found useful was having a flag that told the program to ignore all text until it found the first <p> tag.

In my testing, that removed ~90% of navigation text (although I note you are only looking in <p> tags. I had a flag for that too, but found it was unnecessary most of the time).

Also, I found regular expressions weren't terrible for sentence boundary detection. OTOH, there was nothing like NLTK for Java when I wrote it anyway.

Well unfortunately I'm not doing much on my end at this point. I do a few small things and then let libots do most of the work. One of my last iteration items was to put some of my own summarization work into it and let libots be less of a player, but obviously with 48 hours, I had to prioritize.

I invested most of my cleverness in actually getting the content out of the page since that's really where the money is for an MVP for this; no content == no summary. :)

Yeah, that problem is a real pain. As I mentioned in my post it's the bit I'm not happy with. I wonder how the readability tool does it; that seems to do a very good job.

It seems that OTS uses a word frequency strategy, so the algorithm is similar or identical to the one I demoed. Interesting.

Their JS is out there if you grab it from the Bookmarklet. As in, it is not minified.

I have gone through it carefully, and it is clever.

OTS is definitely word freq based.

I'm using an algorithm very similar to what they do with a few clever additions of my own. I started out with something almost identical, but they had a few twists that made it even better, which I then in turn improved on (and HTML5-ified :)).

Why don't you open source your algorithm and more folks can work on it with you. I've been futzing with Readability JS converted to PHP (but could port to Ruby, Python) and it would be great to collab and share test files, etc.

Sure I might consider that at some point! Yet another OSS project for me to maintain though... :P

I'd be interested in working on this project --- it's a problem I've come across quite a bit. There's even an academic contest for it, called CLEANEVAL, although the way they set up the problem was arguably not quite right.

Let me know if you want some help, this is an area I'm interested in.

Overwhelmed with traffic?

Would really like to see this. :-) If it has an API I'd love to tie something like this to http://tldrd.com to auto-generate summaries of various texts.

I have a half-API in there, but I certainly want to build it out a little more. I planned on having it done by the time the competition was over but then I decided to do crap like "sleep" and "eat". :P

It looks like you might be using libots -> libots.sourceforge.net

I'm watching this thread with interest because I've been doing some work on building a good summarization engine for a few months now, and it is getting pretty close to ready for prime-time -> summarity.com

Are there any alternatives that actually let humans do this? Perhaps a site that employs some game mechanics? First to summarize, +10 points. Correction, +5 points. Request to summarize (bounties), +n points.

I'm all for adding game mechanics to enhance user participation or to build in a 'hook,' but it seems like making an RSS feed something interactive that beggars caretaking would be contrary to the whole point of an RSS feed in the first place.

Ah, I meant let humans summarize articles instead of relying on libots.

This same idea is the last one i added a week ago in my project ideas document, time to delete it :). Jeremy, i guess you've already noticed this, tldr.com could expire in november if it's not renewed.

What about accepting complete urls as parameter? So that you just need to put tldr.it/ in front of an [URL] to obtain a summary?

Did you show this to reddit yet?

One thing i was thinking about was how to monetize something like this, sadly the only option seems to be ads or amazon/etc related to the content...

You can do http://tldr.it/summarize/?summary[url]=(your url). That's how the bookmarklet works. :)

And yes monetization is definitely on my mind. I have a few options. The first is as you mentioned Amazon stuff for the arbitrary URL's. Then there's also injecting ads into the summarized RSS feeds. Then there's also adding users and letting them have enhanced functionality for $x a month (i.e., update their RSS feeds on a regular basis, cutting free users back to once a day, make the bookmarklet a premium feature, and so on).

I really think this could be a good product; just not sure if I'll have the time to really develop it.

>What about accepting complete urls as parameter? So that you just need to put tldr.it/ in front of an [URL] to obtain a summary?

I agree with drtse4 on that point. Not as many people know about bookmarklets as we let on to believe, at least thats what I found out among my friends. And well frankly this time last year I didn't know what they were either.

Secondly, the domain is quite short; telling people to add tldr.it in front of any url to get a summary is simple and could have better word of mouth advantages then say teaching/telling them to use a bookmarklet or coming back to the homepage and pasting it in the text box.

Nicely done. I tried it on one of my blog post and the results are spot on. http://tldr.it/summaries/247

Excellent! I'm glad my best-effort extractor seems to work OK.

I have a special case extractor for things like the NYTimes (where their markup makes it really easy to pull things out) and 1-2 other sites. I want to add special cases for more of the popular sites on the 'net. But seems like not a ton of sites really need it unless their markup sucks (I'm looking at you BBC and CNN).

Cool project! I was working on something similar recently, and found http://greatsummary.com/ to be far superior to OTS. It won't scale very well, since you have to do an RPC every time, but maybe you can convince them to give you their code.

I have a lot of semantic analysis and NLP stuff I'd like to do with this project. OTS is largely a crutch to get the project out the door until I can get something more scalable in place (if I ever get the chance!). :)

Interesting. a while ago I was using similar tool: http://topicmarks.com

in short, you upload a text and it extracts: facts, summary, keywords, index and stuff, with the most important info.

in most of the cases pretty cool, but sometimes was slow in processing.

Hi everybody, Topicmarks is indeed slower in processing but that's because it does far deeper semantic processing behind the screens than our peers tldr.it, goodsummary.com, summarity.com and others.

We think this results in a much more relevant summary - but we'd love to hear about how our peers compare to us for your particular summarization needs.

Also, we're starting to open up our API for select developers if you're interested in integrating summarization, fact extraction etc into your own application. Tweet @topicmarks if you want to be involved.

Roland (CEO Topicmarks)

Sorry about it being slow. Working on getting more resources on the box.

Sincere question from someone who is working on his first projects now: why did you publish without infrastructure enough to support even the relatively meager "HN effect" (see also: Digg effect, Reddit effect)? Not criticizing you for your decision whatsoever, just trying to see the thought process from someone who is further along than me.

It's one of the constraints of the competition unfortunately. They give us a "stock" Linode box that we can't upgrade (didn't know this ahead of time or I would've spent time on caching etc.).

You're probably aware but, I'm getting "You'll need to be giving me a valid URL, matey." using the URLs provided.

If you're on a non-Webkit browser, it'll "work." They're really just meant as examples/placeholders that you can follow rather than things you can just click through.

Same here

I tried to summarise 3-4 stories from the HN homepage, but none of them worked.

change the url to see what others have summarized.

odd content...

techcrunched to death

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact