
HNSummaries.com - algorithmically summarized HN articles to your inbox - dy
http://hnsummaries.com
======
dy
Would love to get people's feedback - built this for myself over the weekend
as a way to accelerate and limit my reading of HN (and being inspired by the
NLP course from Stanford).

The NLP is pretty basic and takes a ratio of the original article, so you do
get some longer listings.

Big thanks to Wayne Larsen of hckrnews.com for providing me with some insight
on tracking top stories and letting me use his ranking data. Also, I recommend
<http://www.hackernewsletter.com/> for a human-curated version.

~~~
peter_l_downs
Is the code online? If not, any chance you would consider it? I'm into NLP (I
wrote <http://bookshrink.com>) and would love to see how you did this!

~~~
dy
Hi peter_l_downs - the summarizer is based on Open Text Summarizer
(<http://libots.sourceforge.net/>) which works very similarly to your page at
bookshrink (TF-IDF sentence scoring). I made some minor edits that accommodate
article structure.

Bookshrink has some pretty amusing summaries... it reminded me of this meme a
while back where people would paste books into Microsoft Word and
AutoSummarize it down to 6 words :)

------
Dn_Ab
Here is a simple recipe to do similar that works decently well as a start off
point:

\--------------------------------------------------------

Count how many times each word appears in the document into a dictionary or
map structure

Also make sure you track the total words.

document |> splitBySpace |> if dictionary has word then +1 else 1;
totalwords++

Then split the document into sentences.

Okay, now for each sentence

==========================================

score = 0.

split sentence by space and

for each word score+= -(dictionary[word]/sum) * log(dictionary[word]/sum)

dictionaryScore.Add(sentence, score)

==========================================

So now each sentence has a score. You can sort by best and lose order. Or if
you want to limit (0 - 1) based on score:

findbestScore and filter each sentence by if limit < docscore / bestscore.

As I said this is only a start off point and is susceptible to list of random
words (guess why) there are many ways to make it better. Here is a portion of
code I dug up from a while ago:

    
    
      let inline sumMap m = m |> Map.fold (curryfst (+)) 0. 
    
      let inline internal countsAndSum n doc =
        let counts = splitstr [|" "|] doc |> filterStop n |> Array.fold mapAdd Map.empty
        counts, sumMap counts
    
      let ent m sum k = 
        let p = (mapGet m k 0.)/sum
        if p = 0. then 0. else -p * log2 (p)
    
      let eScore doc =      
        let counts , sum = countsAndSum 0 doc    
        splitSentenceRegEx doc |> Array.map (fun str -> str, splitstr [|" "|] str |> Array.fold (flip ((+) << (ent counts sum))) 0.)

~~~
nl
This is the approach used in the Python NLTK. That algorithm was adapted from
a Java library called Classifier4J that I wrote in the early 2000's[1].

I'd never seen that technique before, but (like a lot of algorithms) it is
quite obvious once you've seen it.

Edit: actually, it's slightly different to my technique because I just used
liner scoring (ie, counting popular words). I'm not sure which technique would
work best.

[1] [https://groups.google.com/d/topic/nltk-
dev/qV9e5TsCBHg/discu...](https://groups.google.com/d/topic/nltk-
dev/qV9e5TsCBHg/discussion)

~~~
Dn_Ab
Hey I hadn't seen the technique above either. But I've certainly heard of your
work. Unfortunately I am not able to share in the bounty that is NLTK.

Anyways, it is really hard to judge these things (statistical recommenders)
since the metric is inherently subjective and there really is no wrong or
right answer. But the way I like to defend it is: if you are going to just
skim you should at least use a statistical based approach. Better than just
jumping about randomly.

These days I'm more interested in abstract summarization without cheating (no
templates).

~~~
nl
_Unfortunately I am not able to share in the bounty that is NLTK_

Why is that?

~~~
Dn_Ab
I am much stronger in F# than Python and have already invested time in
building a decent codebase in it. Also I personally think better in statically
typed functional languages.

I did not write the SO post which also is based on just word frequencies. I
have found at least that in terms of picking the most relevant words with
respect to the topic, the method I wrote which was inspired by ideas of
entropy gives what I deem to be better results. Its robust against stop words
and commonly repeated words that are not part of the topic. The summaries
though, I cannot say are better or worse.

------
marknutter
I personally don't want algorithmically summarized content, I want manually
summarized content by knowledgeable HN users. It's half the reason I click
into the comments 99% of the time before clicking into the linked article. I
want interesting insight along with a good summary of what the main points
were being communicated. There's just no way automatically generated summaries
can compete with that.

~~~
dreeves
I'd love to see the best of both worlds. I too love it when someone takes the
time to summarize an article -- great community service. I'd love to establish
a convention for doing so (I vote for prepending "Summary:" -- I find "tl;dr"
irksome).

Then HNsummaries.com could fetch those when available instead of or in
addition to the auto-summaries.

~~~
marknutter
Funny, I created exactly that few years ago, got some traction and a story on
readwriteweb.com, and then let the app stagnate and die. Maybe I should pursue
it again :)

------
ankimal
Just got my first newsletter. Looking good for an initial release. Some
feedback:

Would love to get an index of headlines on top of the email with anchors to
actual stories below.

Would love to see shorter summaries and maybe some of the top comments for
each story (summarized, if possible).

~~~
dy
Thanks for the feedback. I'll add the list of headlines and I'm thinking about
the summaries of the comments... it gets harder to understand them because
they can be very context specific.

I'll take a crack and maybe add it as an option.

------
petercooper
Bear in mind that comments here are self selecting for people who like HN's
comments section ;-) But I know plenty of people and speak to people on
Twitter who _deliberately_ avoid these comments pages due to a perception
(fair or not) of "drama" and what not. For those folks, an email like this
could be just the ticket. For me though, I'm staying here ;-)

------
moconnor
Thanks for sharing this, I'm curious to see how well it works out over time.
It'd be nice to be able to choose the compression level.

Quality feels at least as good as an open source summarizer I played around
with a while back; good work!

~~~
dy
Thanks Mark for the comments and feedback! I appreciate it.

------
Timothee
One thing I commend you for is to ask when I want to receive the email. It's
surprising that barely any mailing-list or newsletter lets you pick that…

~~~
kiwidrew
This.

I spend most of my time in Asia-Pacific timezones, so most of my automated
emails arrive at awkward times. I'm glad that this one won't be staring at me
from my inbox first thing in the morning -- helping me to produce first,
consume second.

------
jilebedev
Great execution, but I'm uncertain of the idea. My personal perspective: I
read wikipedia for information -- I read HN for critical insight. Not always
present, but a higher signal/noise ratio than other websites. I don't want a
summary of information - I want critical thought.

------
eaurouge
Why only 20 stories? I usually scan the first three pages once a day, a
snapshot of the top 90 articles. Only about 10% are relevant so I'd rather
have more summaries to sift thru to find the ~10 relevant articles for the
day.

------
dreeves
I actually thought algorithmic summaries would be worse than useless but they
seem surprisingly good. Here's the one from Caine's Arcade:

"9 year old Caine sets up an arcade in his father’s used car parts store in
East L.A., using only cardboard boxes his dad had lying around and a ton of
ingenuity. Watch his dreams come true when this filmmaker sets up a flash mob
to come and play. Just watching this may make you a better person. $82,000 has
already been raised for Caine’s scholarship fund! little behind on the
bandwagon, but...film just had me in tears."

~~~
dy
This is reductive summarization so it's selecting key sentences and phrases
from the text (rather than generating any new phrasing) so occasionally it
will seem brilliant, and other times...

~~~
mjn
In particular, this approach works best with journalism-style writing.
Journalists typically write in a style with fairly short sentences that stand
alone, and paragraphs of only 1-3 sentences. They even pay deliberate
attention to quotability, for either pull quotes or chance of being quoted
elsewhere, so everything is well suited to pulling sentences out. Tends not to
work as well when applied to other styles of writing.

For more general text, the first problem that comes up is that out of context
sentences with pronouns that point nowhere end up being unintelligible. The
second sentence above only worked because the "he" was completely unambiguous
in this summary.

~~~
petercooper
I've done some work in this area (specifically in developer related news) and
you're right. The tricky ones are where you end up with links to GitHub repos
or project pages that assume visitors know what they're looking for. Automated
summaries then become less than useless :-( My dream solution? Developers
learn to write nice summaries on their pages ;-)

~~~
mgkimsal
might be worthwhile to automatically fetch and parse text from some well-known
urls (github, for example) to grab content from there to use as an adjunct.

------
DanielBMarkham
I plan on adding this to my <http://newspaper23.com> site. It's just way on
the back burner.

Ideally I think you would do it client-side, so readers could adjust the
shrinkage to the amount of time they have to peruse. I was also thinking about
a scenario where you could browse at say 100-words and then dive-deep if you
found anything that interests you. A more interactive approach. You might want
to consider this.

But I really like the idea. Would love to hear how the project goes!

------
sabalaba
I got my first email, here's some feedback.

You should make sure that the summaries don't scale linearly with the size of
the content--just because an article is 10x as long, doesn't mean I want a
summary to be 10x longer. Maybe scale logarithmically?

I didn't find any of the summaries to be high quality or any better than I
could get from briefly skimming HN myself.

I've unsubscribed.

------
chrishan
I am taking an alternative approach to make sense of HN stories for Chinese
readers. As a regular HN reader, I manually summarize the topic of top stories
and translate them into Chinese. The motivation is to lower the startup/tech
news sharing barriers. Link - <http://geektell.com/>

------
mistermann
Really like it!

One small suggestion...could you make the "76 comments" under the title
clickable through to the HN comments?

One other option (maybe a user preference), include some noteworthy excerpts
from the HN comments in the email as well?

------
sabalaba
Feature Request:

It would be great to get a weekly or monthly summary.

Nice work.

------
gootik
why email? I'd like to see the summaries in a web page too.

------
SeoxyS
How about giving writers the respect they deserve and not algorithmically
rewriting their work? Has our attention span really gotten so short that we
cannot read articles of substance any longer?

~~~
dreeves
Well, it's like having an abstract of a paper. Which is a good point --
ideally the authors themselves would provide the summary. Still, you certainly
need summaries!

I'd say the only time summaries could be a bad thing is for fiction, where you
don't want to give things away.

For non-fiction giving things away is whole point. :)

