
Show HN: TL;DRizer - an algorithmic summarizer webapp/api in java (weekend hack) - mohaps
http://tldrzr.herokuapp.com
======
HunterV
Lorem Ipsum:

From:

Etiam tincidunt dolor at est sagittis a rhoncus turpis egestas. Integer
elementum erat nec nisi molestie eu tempus magna feugiat. Mauris eu ligula et
ligula vulputate tempor. Etiam vel lectus et mi vulputate rutrum. Cras libero
ipsum, rhoncus at accumsan id, adipiscing iaculis turpis. Cras vel metus nec
enim consectetur aliquet vel nec nunc. Proin at mauris purus. Nullam nulla
dui, interdum nec pharetra sit amet, vulputate a lectus. Nunc vulputate
pellentesque purus at euismod. Nam in justo quis ante porttitor pellentesque.
Quisque quis purus a magna scelerisque egestas quis id sapien. Ut non felis
sit amet ipsum sodales placerat. Proin nibh massa, sollicitudin et posuere a,
placerat convallis magna. Duis lacinia mauris sit amet ante pharetra sed
bibendum lorem euismod.

To:

Mauris eu ligula et ligula vulputate tempor. Etiam vel lectus et mi vulputate
rutrum. Cras libero ipsum, rhoncus at accumsan id, adipiscing iaculis turpis.
Nullam nulla dui, interdum nec pharetra sit amet, vulputate a lectus. Duis
lacinia mauris sit amet ante pharetra sed bibendum lorem euismod.

So much faster to read, I never had the time to read through all those design
mockups!

~~~
mohaps
well, it ain't called TL;DRizer for nothing! :P

~~~
HunterV
Seriously though, props!

~~~
mohaps
thanks

------
MojoJolo
Hello. I recently finished my thesis for my MS CS degree. My thesis is about
automatic summarization. It undergoes research, defense, and I think its
result is good enough for me. It uses statistical approach and machine
learning. My main issue about it is not the summarization part, but the text
extraction part. I can't seem to extract article in a web page well enough.
I'm using boilerpipe (<https://code.google.com/p/boilerpipe/>) for it. It can
do most tricks, but it's not that good for me. May I ask how you extract the
main article in the page?

Here's a preview of mine
([http://www.textteaser.com/ui/article?link=http%3A%2F%2Fwww.p...](http://www.textteaser.com/ui/article?link=http%3A%2F%2Fwww.philstar.com%2Fheadlines%2F2013%2F03%2F16%2F920206%2Funemployment-
rate-unchanged)). Go to its home page to read more news. It caters Philippine
news and will soon enters alpha stage. I'm planning to open up the API or open
source it. HN, which is better? The API is ready, registration is the only
thing that it lacks.

You can try the API here:
[http://api.textteaser.com/api/?url=http://www.theverge.com/2...](http://api.textteaser.com/api/?url=http://www.theverge.com/2013/4/9/4178156/european-
tech-startups-report)

Just replace the url parameter with the URL of what you want to summarize.
Some URLs are not tested yet, and may produce errors. :)

~~~
midko
Hi. I study CS with an inclination towards ML but I don't know anything about
the topic of automated summarizing. I'm curious, since you've taken a ML
approach, did you still need to rely on NLP and if so, was this very
problematic? Also, do you perhaps know an article or a paper that could serve
as a good starting point/overview of what approaches there are to summarizing
and what are the current difficulties. Thanks

~~~
MojoJolo
Hi, I recommend this following research:
<http://www.cs.cmu.edu/~nasmith/LS2/das-martins.07.pdf>
<http://www.aclweb.org/anthology-new/W/W03/W03-1204.pdf>

I still have other papers, but those can be a good starting point.

In my thesis, NLP is done via statistical approach. It learns from it's
previous summaries does have somewhat learning. I don't see any problems
combining NLP with Machine Learning. Can you elaborate on this?

~~~
midko
Thanks, these look like exactly what I was looking for.

Re NLP, I meant to say NLP from sentiment analysis perspective (not sure if
that's the right way to put it). So my question was whether you had to extract
meaning of one or few related sentences or only process the text in a
statistical manner (which you answered).

------
logn
Generated 5 Sentence Summary for [http://www.businessinsider.com/why-marissa-
mayer-bought-a-30...](http://www.businessinsider.com/why-marissa-mayer-
bought-a-30m-startup-2013-4)

Back in March, Yahoo bought a startup called Summly for $30 million. Before
Yahoo shut it down, Summly was a news aggregation app for smartphones.
According to Summly's own Web site, the technology behind the app was "built"
by an organization called "SRI International," not by the startup's employees.
And indeed, inside Yahoo, Summly is called "Yahoo's Siri." A source close to
Yahoo says that CEO Marissa Mayer believes summarization technology is "going
to be huge for Yahoo" as it builds "personalized news feeds" into mobile
versions of its "core experiences," including Yahoo Finance and Yahoo Sports.
The job of implementing this technology at Yahoo will not be given to anyone
from Summly, including its young CEO.

\--

Edit: Adding this...

Generated 3 Sentence Summary of Gettysburg Address

Four score and seven years ago our fathers brought forth on this continent, a
new nation, conceived in Liberty, and dedicated to the proposition that all
men are created equal. We have come to dedicate a portion of that field, as a
final resting place for those who here gave their lives that that nation might
live. The brave men, living and dead, who struggled here, have consecrated it,
far above our poor power to add or detract.

~~~
sherjilozair
This misses the most important line in the article: Acquiring Summly seems to
have been an almost incidental side effect of a deal Yahoo made with SRI for a
piece of "summarization technology".

~~~
mohaps
the algo is still kinda "dumb". It basically tries to get top N keywords -
most frequent non stopwords, stems them and goes through the sentences looking
for which of them (upto max summary length) contain the keywords. I'll keep
whittling at it over nights/weekends to see if I can make it more
"semantically aware"

edit: some work is needed on the tokenization also. currently, I don't
preserve non-period punctuation.

~~~
romain_g
You might want to look into topic modeling to extract the most meaningful
topics (set of co-occuring words). It could greatly improve your results
compared to the keyword approach. Interesting initiative, keep us posted !

------
pokoleo
I really like this.

Here's an example of one of PG's essays run through the algorithm:
[<http://paulgraham.com/startupideas.html>]

The most important thing to understand about paths out of the initial idea is
the meta-fact that these are hard to see. Empirically, the way to have good
startup ideas is to become the sort of person who has them. If you know a lot
about programming and you start learning about some other field, you'll
probably see problems that software could solve. Some of the most valuable new
ideas take root first among people in their teens and early twenties. So if
you're a young founder (under 23 say), are there things you and your friends
would like to do that current technology won't let you. But there may still be
money to be made from something like journalism. Similarly, since the most
successful startups generally ride some wave bigger than themselves, it could
be a good trick to look for waves and ask how one could benefit from them.

If you ran examples of PG's essays through this, people would see the
immediate benefit.

~~~
Charlesmigli
I think no algorithm can perform such a summarizing task. If you're looking
for summaries of PG's essays see here
<http://tldr.io/discover/paulgraham.com>.

------
mohaps
haha! :) Yeah, this was kinda fueled by the news of the summly acquisition and
too many red bulls drunk during the drive from LA to SFO after wondercon.

------
shakeel_mohamed
YES. As a college student, this is amazing for those long readings for classes
one isn't interested int. I wanted to build sort of the reverse of this at one
point (take a question/prompt as input, generate a response).

Are you planning on open sourcing this?

~~~
mohaps
yeah, I plan on open sourcing this. waiting on some technicalities.

~~~
shakeel_mohamed
Great, please post about it or contact me when you do

------
micheleg
Sorta cool. The Yahoo purchase of Summly is not much more than a PR play. The
technology wasn't/isn't there. And, while this "weekend hack" is neat, the
quality of summaries isn't close to that of the TLDR plug-in
(<http://www.tldrstuff.com/#desktop>) Not only does the Stremor plug-in get
"what is important with the article" the plug-in is simply on all of my
browsers and works FAST. Fun discussion though, and props to little Nick.

------
cpio
I've got an IOException while trying to summarize [http://matt-
welsh.blogspot.com/2013/04/running-software-team...](http://matt-
welsh.blogspot.com/2013/04/running-software-team-at-google.html) And a
different one for <http://googleblog.blogspot.com> I guess you should put more
effort in your html parser. Try Apache Tika, perhaps.

~~~
mohaps
try this url: <http://matt-welsh.blogspot.com/feeds/posts/default?alt=rss> it
works. same feed url pattern will work for google blog too
<http://googleblog.blogspot.com/feeds/posts/default?alt=rss>

------
mohaps
Now the url can be a page, I try to extract the article text using boilerpipe.
:) Also added a simple GET endpoint for linking. Try this summary of PG's
"Writing and Speaking" essay:
[http://tldrzr.herokuapp.com/tldr/?feed_url=http://www.paulgr...](http://tldrzr.herokuapp.com/tldr/?feed_url=http://www.paulgraham.com/speak.html)

------
rodrigoavie
So, when is Yahoo! buying it? How much the deal?

~~~
Trezoid
Yahoo will buy something that USES this tech, and outsources the actual
building as well...

------
dpcx
It seems to not properly handle embedded HTML; using my feed
(<http://www.dp.cx/blog/rss.xml>), look at the story titled "The Difficulty of
Parsing the Web" and notice the <select /> box that is rendered.

------
drakaal
Not bad for a weekend, but <http://www.tldrstuff.com> does a much better job.
Especially where the sentences don't break on . like where J. R. R. Tolkien is
concerned.

And The TLDR Plugin works with HTML and on all western languages.

------
Charlesmigli
As as cofounder of <http://tldr.io> this confirms our vision that for now (and
for many years) only people can perform such a hard task like summarizing.

~~~
MojoJolo
I agree with "for now" but not "for many years". Right now, most or all
automatic summarizers are doing extraction. Which is just lifting sentences
from the original article itself. It is different from the human perception of
summary, which is abstraction. That uses the most important parts of the
article and paraphrase it for easy reading.

Right now, abstraction or paraphrasing is hard to do by a computer. But I
think and hopefully it will be possible in few years time. There are various
open source and academic tools that can do some pretty good NLP. I'm looking
into Apache OpenNLP, and WordNet. I'm hoping for 2 or 3 years time.

BTW, I have an app similar to your tldr.io. Check my HN comment
(<https://news.ycombinator.com/item?id=5523770>) for more info about it. ;)

~~~
drakaal
Changing the sentences adds bias. Maintaining the author's intent is
important.

Generating news highlights from lots of sources might be cool as computer
generated content. But rewriting an author's story in new words is not adding
value it is just ripping them off.

~~~
MojoJolo
Thanks. I got pretty good insights. Bias doesn't come into my mind. So you're
saying multi-document "summarization" maybe the next step of consumer
automatic summarization? There are many research about multi-document
summarization, and will look into it.

------
peter_l_downs
Awesome! Looks super similar to an old sideproject of mine,
www.bookshrink.com. The algorithm's different -- yours is aimed more towards
summaries, while mine was aimed at sentence importance.

~~~
mohaps
yeah, I'm working on adding more summarizer algorithms. I've been thinking on
the lines of weighing up rhetorical questions, weighing down exclamation mark
(cheap sarcasm detection) etc.

~~~
peter_l_downs
I experimented with giving bonuses to proper nouns and verbs, as well as
giving a slight advantage to shorter sentences.

------
mohaps
Updated the app with some goodies like links to summaries, ability to
summarize all types of urls (not just feed urls) and a "spiffy" new logo :)
Also did some css fixes etc.

------
mohaps
TL;DRzr is now open source! <https://news.ycombinator.com/item?id=5535827>

------
bambax
Tried the rss feed from my blog and got an NPE:

<http://blog.medusis.com/rss>

Does it expect a specific format?

~~~
mohaps
try now. the blog.medusis.com/rss link works now. Thanks for the feedback.
since this grabs the page text (when no rss text is found) a lot of junk like
copyright notices etc. shows up in summary. Will have to add some logic to
scrub those. It also behaves horribly with code snippets.

~~~
bambax
Excellent, thanks, it does work now.

So what does it do exactly? It seems to extract some sentences more or less at
random from the text...?

~~~
mohaps
the algo is listed at the bottom of the home page. Will be opensourcing this
code soon.

------
devopstom
Now to sell it to Yahoo for $30 Million!

------
keeran
Also see <http://tldr.it> (a RailsRumble 2010 entry)

~~~
mohaps
nice :) much better UI. As you can tell, I really suck at HTML/JS coding.

~~~
caublestone
I like your UI. If you change anything, keep this page flow and don't put much
else on the page. Check out medium.com for readable font inspiration.

Awesome work!

------
jbrooksuk
Very cool! I can't wait for it to be open sourced.

~~~
mohaps
trying to figure out (short of creating a new repo from current code) how to
mirror the heroku git repo for this on github

~~~
jbrooksuk
Can't you add another remote for it? One for GitHub and one for Heroku, then
just push to both when you want to update.

~~~
mohaps
yeah, that's what I ended up doing. :) first time using git. These old bones
have been ground up bad by CVS/Subversion :P Here's the announcement:
<https://news.ycombinator.com/item?id=5535827>

------
cmccabe
tl;dr

