

Show HN: My news summarization side project - samsnelling
http://summary.io/technology

======
samsnelling
Site owner here. A few thoughts: \- HN has been pretty hostile towards Summly
and how it is "trivial" or "basic" to create an app that summarizes the news.
I've always wanted to try. All of the negativity actually motivated me to
think "Why can't I make something similar." \- This has been an incredible
learning experience for me. \- Feedback is more than welcome.

~~~
tensor
I think the arguments against summly have more to do with the fact that they
_didn't_ have any tech or in-house expertise. They licensed tech from SRI and
even had another group build the app itself.

That aside, summarization is a relatively old field of study (in CS) and there
is tons of good information to read and even many free libraries. My
suggestion would be to try out an unsupervised learning algorithm such as LDA.
You still need training data, but you don't need to label it with categories.
The downside is that you will have no control over what it learns. Still,
classic examples of LDA involve classifying news sources.

For standard linear classification, understanding Bayes is important, but for
actual implementation look at something like liblinear and use logistic
regression with regularization. The difference between Bayes and LR is that
Bayes optimizes learning the underlying probability distribution while LR
optimizes "getting the classification right", otherwise known as optimizing
the expectation. Regularization controls for overfitting and there can be a
_big_ difference between the type you use (L1 vs L2) and the settings. Dont'
make the mistake of treating it as a minor tweak.

~~~
jmduke
I've read a lot of literature on classification, but nothing about
summarization. Do you have any recommended literature/libraries?

~~~
samsnelling
jmduke, I started with
(<http://en.wikipedia.org/wiki/Automatic_summarization>) specifically the
unsupervised keyphrase extraction. In terms of libraries, there aren't a huge
amount of small packages out there outside of the monolithic Stanford NLP
package (<http://nlp.stanford.edu/software/lex-parser.shtml>) and such. When I
get back to the house I would be glad to share my bookmarks with you if you
are interested.

~~~
jmduke
That'd be great! My email is in my profile if you'd prefer that method.

------
adelevie
Isn't it always going to be more efficient for the producer of said text to
produce the summary him/herself? Couldn't resources be better spent trying to
influence the production process of various news outlets to provide summaries?

In the legal community (and I'm sure many others), there is tremendous benefit
to writing concise introductory and concluding paragraphs, as well as tables
of contents that act as excellent skeletons for much longer documents. In
policy-land, the one-pager is king...but I digress.

I guess I'm just kind of lost as to how or when the cost of _accurately_
summarizing text by computer is cheaper than basically "asking" the author to
provide one. Will the quality of a computer-generated summary ever be >= the
quality of an author-generated summary?

~~~
samsnelling
Well I guess it depends on how you look at it. My goal with this version of my
project is not to produce a summary that includes everything.

When I approached this project I looked at it with this problem: I currently
read 10-15 news sites. I spend too much time reading the news. How do I get to
the stories that really matter to me?

Producing text-extraction summaries solve this problem well in my opinion.

 _> > Isn't it always going to be more efficient for the producer of said text
to produce the summary him/herself? Couldn't resources be better spent trying
to influence the production process of various news outlets to provide
summaries?_

The quality of summaries would be SO much better if news outlets did this
themselves. Again, I think it would be really cool to be able to have the
influence to change the production process.

I really appreciate your insight here!

~~~
adelevie
Interesting. Of I course, I'm not trying to discourage this at all. I really
dig this stuff. Just writing my thoughts on another, complimentary approach. I
guess this makes sense from the perspective of a hacker who wants to build a
cool thing for his/her own use.

The questions of efficiency really come into play when you see Yahoo! spending
$30 million for Summly. For that, you could hire 60 people to work for
$50,000/yr for one year. I wonder how 60 happily-employed English majors might
stack up to something produced by Summly et al.

~~~
Samuel_Michon
_“For [$30 million], you could hire 60 people to work for $50,000/yr for one
year.”_

$30 million / $50,000 = 600

You wouldn’t be able to hire 600 people, but you’d definitely get more than 60
English majors. Even after taxes, insurance, benefits, HR, management,
accounting, rent, equipment, travel expenses, etc, you could probably afford
at least 200 English majors at $50,000 a year.

~~~
adelevie
Yea, I was off by a pesky decimal place.

------
bradknowles
Speaking only for myself, I want a few different things out of a service like
this.

For one, I want an article fingerprinting technology. One that can tell that
multiple different sites are talking about the same original post, and not
really saying anything that is materially different. Maybe they all just cut-
n-paste (which something like Churnalism would hopefully address), but I also
want to catch the sites that add a little unique content to an article, but
not enough to make a real difference. Link analysis would have to be factored
into this, based on the full expanded URLs -- Sometimes there are new articles
that come out with additional information on a topic that has previously been
discussed, and I wouldn't want to miss those.

Second, once you have the fingerprint for each article from each site, you
need a fuzzy way to compare them for uniqueness -- I want to do a "sort -u" on
all news articles, based on the fingerprint.

Third, I need a way to tweak the scores and settings, so that articles from a
high quality site like Ars Technica gets rated better than a lower-quality
site. Of course, a certain amount of automation can be used here to generate
default scores and settings, but I may have a different idea of exactly what
scores and settings I want to use as compared to someone else.

I do like the idea of taking input from sites like HN as an additional
variable for the positive or negative weighting of a news story (or a
particular news site), if the article in question is one that has recently
been discussed there.

Of course, you also need the concept of pluggable modules, so that when the
next new thing comes out (like Churnalism), it can be quickly and easily added
to the mix.

I don't suppose this sounds remotely familiar to anyone? I've got a bunch of
feeds that I watch, but there's a lot of duplication and I would dearly love
to be able to filter out that chaff while still allowing through the
occasional unique article from those sites that usually just jump on the
bandwagon long after the horses have escaped the barn.

Thanks!

------
DanBC
I'm interested in how you decide what sources to use, and then what subject to
summarise, and then how each story is summarised.

Please don't take this the wrong way but: it's a list of sites that I dislike
so I wouldn't use this service. I can, however, see the value, and I would use
it if it was sites that I was more interested in.

I guess for a tech crowd I'm a bit confused about this and RSS: Why don't
people just use a better RSS reader?

But for non-technical people who need to keep up with a few different websites
this could be great. Once you get v1 sorted you could think about adding some
kind of voting for v2. "Useful [y][n]" "important [y][n]" etc.

Good luck with it if you do decide to do any more with it.

~~~
samsnelling
I completely agree with you. Basically for v1 I just took my personal rss
feeds and used those. Technology has about 15 sources, Business has about 7,
Top has about 5.

Currently, the summaries are categorized on a feed by feed basis (eg
everything from the Verge is technology), but I've been messing around with a
Bayes classifier. I just need some training data.

Each story is summarized the following way: 1) Get link from feed. 2) If the
link doesn't exist, scrape the content and image. 3) Break content into
sentences (custom NLP based off of regex). 4) Tokenize sentences into words.
5) Porter-stem all words. 6) Run heavily customized LexRank. 7) Return best
sentences, use no more than 15% or 3 sentences, whichever is less.

Right now, it's just a firehose of data. There's a lot I can do with it, but
I'm exploring where to go: summarize the news? Product reviews? Public domain
books? Try and hook up with an rss reader company? Build a chrome extension?

I really appreciate the feedback!

~~~
PaulHoule
I love the user interface.

My feeling about the algorithm is that it works really well on some stories
and poorly on others. For instance, how can you extract 3 salient points from
"10 Tens To Spam The Web With A Top Ten List"?

Anyway, a key to advances in practical A.I. is being able to change the
problem definition to something that is doable AND serves a need. Competitions
like Kaggle and TREC attract smart and hard-working competitors but make a
real advance only once every couple of years.

You want to beat the odds, rather than summarizing anything that comes down
the pipe, you can throw out any articles that don't summarize well. If you
could get rid of 50% of the strikeouts it would look much better and if you
got rid of 80% it's going to be better than a committee of mechanical turks.

Shoot me a line if you want some help making this work.

~~~
samsnelling
Paul, I think we just emailed each other.

Again, you are exactly right. Some of my problems I know can be improved using
a better scraping algorithm. Another idea would be if the article is not at X
length, don't try and summarize it.

I look foward to talking to you more about it!

------
bayan09
<http://news.thetechblock.com> does an outstanding job with relevant tech news
in my opinion. It's hand curated, though. There's also no summary. UI is
great.

~~~
samsnelling
Bayan09, I will look into adding it tonight to what I scrape. I will comment
and let you know how it goes! Thanks!+

------
jjsz
Looks good but tldr.io exists. Feedly now has, what comes down to, a rating
based on how many times an article was saved. It just needs a tldr.io layer
over it.

~~~
samsnelling
I love tldr.io and use it frequently. Really amazes me. Of course I think this
approach might be more scalable. I agree with adding a rating layer would be a
great addition. Thanks!

------
nathanb
Needs an RSS feed

~~~
samsnelling
Thanks for the suggestion! That could actually be added really easily. :)

------
radiusq
Good job. Hopefully someone has at least $30mm for you :)

~~~
samsnelling
Thanks for the kind words! To be honest I am just looking to beef up my
portfolio before I go job hunting next year (yikes!).

:)

