Show HN: My news summarization side project

samsnelling · on April 24, 2013

Site owner here. A few thoughts: - HN has been pretty hostile towards Summly and how it is "trivial" or "basic" to create an app that summarizes the news. I've always wanted to try. All of the negativity actually motivated me to think "Why can't I make something similar." - This has been an incredible learning experience for me. - Feedback is more than welcome.

tensor · on April 24, 2013

I think the arguments against summly have more to do with the fact that they didn't have any tech or in-house expertise. They licensed tech from SRI and even had another group build the app itself.

That aside, summarization is a relatively old field of study (in CS) and there is tons of good information to read and even many free libraries. My suggestion would be to try out an unsupervised learning algorithm such as LDA. You still need training data, but you don't need to label it with categories. The downside is that you will have no control over what it learns. Still, classic examples of LDA involve classifying news sources.

For standard linear classification, understanding Bayes is important, but for actual implementation look at something like liblinear and use logistic regression with regularization. The difference between Bayes and LR is that Bayes optimizes learning the underlying probability distribution while LR optimizes "getting the classification right", otherwise known as optimizing the expectation. Regularization controls for overfitting and there can be a big difference between the type you use (L1 vs L2) and the settings. Dont' make the mistake of treating it as a minor tweak.

Homunculiheaded · on April 24, 2013

To make the choice of L1 vs L2 a little more flexible there's also Elastic net regularization [0] which lets you combine both L1 and L2

Also agree with LDA, I've recently been using it for some unsupervised text classification and the topics it generates are impressively sane.

0.http://en.wikipedia.org/wiki/Elastic_net_regularization

samsnelling · on April 24, 2013

>> Dont' make the mistake of treating it as a minor tweak. Yes. Exactly. With how rapidly I have been developing this (2 weeks in the making), I think I could pivot it relatively easily. Really however, classification is just an added layer on top of what I already have.

Thank you for your comment. I will look into the stuff you mentioned! This is all relatively new to me!

jmduke · on April 24, 2013

I've read a lot of literature on classification, but nothing about summarization. Do you have any recommended literature/libraries?

samsnelling · on April 24, 2013

jmduke, I started with (http://en.wikipedia.org/wiki/Automatic_summarization) specifically the unsupervised keyphrase extraction. In terms of libraries, there aren't a huge amount of small packages out there outside of the monolithic Stanford NLP package (http://nlp.stanford.edu/software/lex-parser.shtml) and such. When I get back to the house I would be glad to share my bookmarks with you if you are interested.

jmduke · on April 24, 2013

That'd be great! My email is in my profile if you'd prefer that method.

hyperberry · on April 24, 2013

Oh, come on now. You can be honest here... you want $30M like Summly & Wavii :)

And maybe you should get it. The app looks very impressive. How does your program decide which snippets are the most important? Did you build that technology or license/ use somebody else's code? B/c from what I understand, both Summly and Wavii were primarily acquired for their proprietary Natural Language Processing (NLP) tech. If your app is capable of NLP- and you built it from scratch- you likely have something of real value -- and you're probably in demand as an NLP engineer too.

I do wonder, however, if those recent acquisitions had anything to do with a recent court ruling concerning "Fair Use"/ summarization/ news aggregators. You might wanna take a look & make sure your aggregation of others' content is acceptable + truly protected under "Fair Use" principles. Please note, I am not an attorney.

See here for the original link: https://news.ycombinator.com/item?id=5519622

samsnelling · on April 24, 2013

Hyperberry,

You are on point with that comment! Actually all of the tech is my code (outside of scraping the image and a few packages via npm). To be honest, I worked pretty hard on some of the NLP for sentence parsing. With that said however, I don't think I am in that high of demand as a developer (yet).

In terms of fair use... I am scared shitless. Right now I am not overly worried because I am not a huge site. The way the algorithm is set up is to take 15% or 3 sentences at max, whichever is less. This is something I will have to approach extremely carefully as I move forward.

In terms of this project's future: I doubt it will get acquired. But I really really hope it will help me get a job once I add it to my portfolio. (I am about to graduate in 1 year!).

Thanks for your reply!

sharemywin · on April 24, 2013

do you plan on making this an app?

samsnelling · on April 24, 2013

Personally my experience is really on the web side of things. I am however looking at the option of doing an HTML5 wrapper around it so it can be published as an application in the stores. Should be an interesting path!

sharemywin · on April 27, 2013

I looked at corona and phonegap. both seem like quick ways to get to an app quickly.

danso · on April 24, 2013

I think this is a cool project to undertake but yes, I would group myself among those who think Summly was trivial. Not in a technical sense, but in a do-people-need-it-sense? That is, the best news sites are already going to have relevant headlines and good meta-descriptions. What else could even the best extraction algorithm pull out -- from the text -- that would aid in my decision to read the article? In this problem space, more is not better, because the point is to keep things as concise as possible. I argue that this is largely already done by the content editors. And for sites that do it poorly, I would suspect that their articles will have equally poor extractions.

This is not to say that there aren't valuable things that can be added to an article's overview...but I think direct quote/passages or summaries thereof are the least interesting things. I'd rather see something like, how many popular users on Twitter (i.e. not spambots) have tweeted this post, and what are the best comments that are not merely tweets of the title? Or, maybe just pull out the best comments from the stories if Disqus has been enabled. These are the kind of things that would be useful auxiliary information before actually reading the article.

samsnelling · on April 24, 2013

Danso, You just opened up an entire new wave of thinking in my head with this line I'd rather see something like, how many popular users on Twitter (i.e. not spambots) have tweeted this post, and what are the best comments that are not merely tweets of the title? Or, maybe just pull out the best comments from the stories if Disqus has been enabled. :)

Do-people-need-it? No probably not. But I think for me it helps get to the news articles I really want to read. If I read an interesting sentence or two that was extracted, I feel like I am more likely to go on and read the article.

>>And for sites that do it poorly, I would suspect that their articles will have equally poor extractions. You could not be more right. I have been experimenting with an "Entertainment" section. You will not believe how poor some of the extractions are purely based on the writing style.

I don't think that this solution is perfect, but the wheels are turning inside my head on where I should take this.

danso · on April 24, 2013

Well I'm glad you approve, because as a former content writer, I resent any one who thinks that their algorithm can provide more value than me ;)

No really though, I think that because human writing (and just as importantly, HTML layout) is so diverse, even a very good algorithm risks extracting something very banal or redundant. Cross-referencing an article with social data, however, is very low-hanging fruit and at the same time, pretty useful.

For example, take a post like the currently popular "Introduction to Position Based Fluids" (http://physxinfo.com/news/11109/introduction-to-position-bas...)

An extractor is immediately going to have a problem here because most of the meat is in the video...there is some descriptive text, but it doesn't add much more than the title and the meta description ("Position Based Fluids is a way of simulating liquids using Position Based Dynamics (PBD) approach")

However, the top HN comments are insightful...for example:

"The only uncanny valley effect here is that the water doesn't wet anything - everything is perfect Teflon. So it's hard to judge whether the movement of the water is perfectly realistic, because I keep looking at the transitions. If I don't look at surfaces, it seems very convincing."

And since the mechanism is the same, why not extract the top comment from Reddit?

"Wow. Not much else to say but wow."

OK, not that great in this case...but the retrieval work is trivial. More importantly, both of these sites have built in meta data telling you the value of each comment...not just upvotes, but how much discussion it generated...there are plenty of useful ways to slice this data numerically...And the same pattern would apply to Twitter/Facebook...pull in the comments/reactions from users with high social cred.

In other words, why resort to machine-summarization when you have perfectly fine humans to do so, and who are actually (in some cases) adding new insights?

And there's nothing stopping you from applying text-mining to the social data/commentary you've collected...for example, showing only comments that directly refer to the main keywords/proper nouns that are in the article (this will keep your algorithm from collecting comments like: WOW, AWESOME!! that get upvotes merely for being positive).

Anyway, this seems like an easier route with a lot of value...maybe it's so obvious and technically easy that it's not sexy...but I think it would be very useful.

petercooper · on April 24, 2013

I argue that this is largely already done by the content editors.

Certainly for professional media outlets and therefore true of the mass market most summarizers are trying to hit. But just take most of the best stuff that reaches HN.. Personal blogs, GitHub repos, documentation. Really hard but potentially rewarding stuff to summarize. I currently do this by hand in the email newsletter space but am certainly interested in any advancements that'll save me time one day :-)

samsnelling · on April 24, 2013

Peter,

So funny that you mentioned github docs, as I am working on something custom for that (my sentence parsing works really bad with code... who would've thought!). You are exactly right on the potential of things that can be summarized. Currently Summary.io only shows sites with images, but the summaries are still being generated! Appreciate the comment. If you have any other ideas on what I could summarize let me know!

adelevie · on April 24, 2013

Isn't it always going to be more efficient for the producer of said text to produce the summary him/herself? Couldn't resources be better spent trying to influence the production process of various news outlets to provide summaries?

In the legal community (and I'm sure many others), there is tremendous benefit to writing concise introductory and concluding paragraphs, as well as tables of contents that act as excellent skeletons for much longer documents. In policy-land, the one-pager is king...but I digress.

I guess I'm just kind of lost as to how or when the cost of accurately summarizing text by computer is cheaper than basically "asking" the author to provide one. Will the quality of a computer-generated summary ever be >= the quality of an author-generated summary?

samsnelling · on April 24, 2013

Well I guess it depends on how you look at it. My goal with this version of my project is not to produce a summary that includes everything.

When I approached this project I looked at it with this problem: I currently read 10-15 news sites. I spend too much time reading the news. How do I get to the stories that really matter to me?

Producing text-extraction summaries solve this problem well in my opinion.

>> Isn't it always going to be more efficient for the producer of said text to produce the summary him/herself? Couldn't resources be better spent trying to influence the production process of various news outlets to provide summaries?

The quality of summaries would be SO much better if news outlets did this themselves. Again, I think it would be really cool to be able to have the influence to change the production process.

I really appreciate your insight here!

Samuel_Michon · on April 24, 2013

“I currently read 10-15 news sites. I spend too much time reading the news. How do I get to the stories that really matter to me?”

I don’t know about you, but for me, this problem is solved by using RSS (Reeder) and receiving email digests from Percolate, The Brief, and Hackernewsletter.

http://www.percolate.com

http://www.thebrief.io

http://www.hackernewsletter.com

Bonus: http://www.thefeature.net/

samsnelling · on April 24, 2013

I tried to do email digests. I really did. I am signed up for a few of those. My problem is they get lost in my inbox. Personally I like to check into news sites several times a day. The feature is really interesting to me however. I personally don't use tumblr, but the inherent "repost" feature about it seems really attractive. Maybe I should look into automatic posting to tumblr. Thanks for the feedback! I've never heard of the brief!

Samuel_Michon · on April 24, 2013

“My problem is [email digests] get lost in my inbox”

I have a Smart Mailbox set up in Mail.app that all those digests land in, that way they’re not mixed in with all the email that I need to act on.

“The feature is really interesting to me however. I personally don't use tumblr, but the inherent "repost" feature about it seems really attractive. Maybe I should look into automatic posting to tumblr.”

The Feature is a selection of articles that people save to read later using Instapaper. The Tumblr connection is Marco Arment, who worked at Tumblr and created Instapaper and The Feature.

You can set up Instapaper in such a way that whenever you mark something ‘Read later’ (from the web or a RSS client like Reeder), it’s automatically added to your tumblelog. You can then peruse your tumblelog whenever you feel like it.

adelevie · on April 24, 2013

Interesting. Of I course, I'm not trying to discourage this at all. I really dig this stuff. Just writing my thoughts on another, complimentary approach. I guess this makes sense from the perspective of a hacker who wants to build a cool thing for his/her own use.

The questions of efficiency really come into play when you see Yahoo! spending $30 million for Summly. For that, you could hire 60 people to work for $50,000/yr for one year. I wonder how 60 happily-employed English majors might stack up to something produced by Summly et al.

Samuel_Michon · on April 24, 2013

“For [$30 million], you could hire 60 people to work for $50,000/yr for one year.”

$30 million / $50,000 = 600

You wouldn’t be able to hire 600 people, but you’d definitely get more than 60 English majors. Even after taxes, insurance, benefits, HR, management, accounting, rent, equipment, travel expenses, etc, you could probably afford at least 200 English majors at $50,000 a year.

adelevie · on April 25, 2013

Yea, I was off by a pesky decimal place.

samsnelling · on April 24, 2013

I know you aren't. In fact, you taking time out of your day to post here with your opinion about my site really does quite the opposite of discouraging me.

Summly getting bought for that much really confuses me as well.

monkeynotes · on April 24, 2013

I bet you could set up a mechanical turk job to summarise news for a lot less than 60 people @ $50k a year and get similar results.

DanBC · on April 24, 2013

I'm interested in how you decide what sources to use, and then what subject to summarise, and then how each story is summarised.

Please don't take this the wrong way but: it's a list of sites that I dislike so I wouldn't use this service. I can, however, see the value, and I would use it if it was sites that I was more interested in.

I guess for a tech crowd I'm a bit confused about this and RSS: Why don't people just use a better RSS reader?

But for non-technical people who need to keep up with a few different websites this could be great. Once you get v1 sorted you could think about adding some kind of voting for v2. "Useful [y][n]" "important [y][n]" etc.

Good luck with it if you do decide to do any more with it.

samsnelling · on April 24, 2013

I completely agree with you. Basically for v1 I just took my personal rss feeds and used those. Technology has about 15 sources, Business has about 7, Top has about 5.

Currently, the summaries are categorized on a feed by feed basis (eg everything from the Verge is technology), but I've been messing around with a Bayes classifier. I just need some training data.

Each story is summarized the following way: 1) Get link from feed. 2) If the link doesn't exist, scrape the content and image. 3) Break content into sentences (custom NLP based off of regex). 4) Tokenize sentences into words. 5) Porter-stem all words. 6) Run heavily customized LexRank. 7) Return best sentences, use no more than 15% or 3 sentences, whichever is less.

Right now, it's just a firehose of data. There's a lot I can do with it, but I'm exploring where to go: summarize the news? Product reviews? Public domain books? Try and hook up with an rss reader company? Build a chrome extension?

I really appreciate the feedback!

PaulHoule · on April 24, 2013

I love the user interface.

My feeling about the algorithm is that it works really well on some stories and poorly on others. For instance, how can you extract 3 salient points from "10 Tens To Spam The Web With A Top Ten List"?

Anyway, a key to advances in practical A.I. is being able to change the problem definition to something that is doable AND serves a need. Competitions like Kaggle and TREC attract smart and hard-working competitors but make a real advance only once every couple of years.

You want to beat the odds, rather than summarizing anything that comes down the pipe, you can throw out any articles that don't summarize well. If you could get rid of 50% of the strikeouts it would look much better and if you got rid of 80% it's going to be better than a committee of mechanical turks.

Shoot me a line if you want some help making this work.

samsnelling · on April 24, 2013

Paul, I think we just emailed each other.

Again, you are exactly right. Some of my problems I know can be improved using a better scraping algorithm. Another idea would be if the article is not at X length, don't try and summarize it.

I look foward to talking to you more about it!

bradknowles · on April 25, 2013

Speaking only for myself, I want a few different things out of a service like this.

For one, I want an article fingerprinting technology. One that can tell that multiple different sites are talking about the same original post, and not really saying anything that is materially different. Maybe they all just cut-n-paste (which something like Churnalism would hopefully address), but I also want to catch the sites that add a little unique content to an article, but not enough to make a real difference. Link analysis would have to be factored into this, based on the full expanded URLs -- Sometimes there are new articles that come out with additional information on a topic that has previously been discussed, and I wouldn't want to miss those.

Second, once you have the fingerprint for each article from each site, you need a fuzzy way to compare them for uniqueness -- I want to do a "sort -u" on all news articles, based on the fingerprint.

Third, I need a way to tweak the scores and settings, so that articles from a high quality site like Ars Technica gets rated better than a lower-quality site. Of course, a certain amount of automation can be used here to generate default scores and settings, but I may have a different idea of exactly what scores and settings I want to use as compared to someone else.

I do like the idea of taking input from sites like HN as an additional variable for the positive or negative weighting of a news story (or a particular news site), if the article in question is one that has recently been discussed there.

Of course, you also need the concept of pluggable modules, so that when the next new thing comes out (like Churnalism), it can be quickly and easily added to the mix.

I don't suppose this sounds remotely familiar to anyone? I've got a bunch of feeds that I watch, but there's a lot of duplication and I would dearly love to be able to filter out that chaff while still allowing through the occasional unique article from those sites that usually just jump on the bandwagon long after the horses have escaped the barn.

Thanks!

bayan09 · on April 24, 2013

http://news.thetechblock.com does an outstanding job with relevant tech news in my opinion. It's hand curated, though. There's also no summary. UI is great.

samsnelling · on April 24, 2013

Bayan09, I will look into adding it tonight to what I scrape. I will comment and let you know how it goes! Thanks!+

jjsz · on April 24, 2013

Looks good but tldr.io exists. Feedly now has, what comes down to, a rating based on how many times an article was saved. It just needs a tldr.io layer over it.

samsnelling · on April 25, 2013

I love tldr.io and use it frequently. Really amazes me. Of course I think this approach might be more scalable. I agree with adding a rating layer would be a great addition. Thanks!

nathanb · on April 24, 2013

Needs an RSS feed

samsnelling · on April 24, 2013

Thanks for the suggestion! That could actually be added really easily. :)

radiusq · on April 24, 2013

Good job. Hopefully someone has at least $30mm for you :)

samsnelling · on April 24, 2013

Thanks for the kind words! To be honest I am just looking to beef up my portfolio before I go job hunting next year (yikes!).

:)