
The Berkeley Document Summarizer: Learning-Based, Single-Document Summarization - fitzwatermellow
https://github.com/gregdurrett/berkeley-doc-summarizer
======
orthoganol
Anyone else work with NLP but unlikely to try this tool simply because it's in
Java?

Do any Python NLP engineers integrate Java tools in their setup, e.g. via
Jython? How does that work for you?

~~~
maga
I work with NLP and do not plan to try this tool, though, not because of Java,
but because I'm not interested in a summarizer right now.

Our current production back-end doing NLP stuff is written in Java and most
third party libraries it uses are also written in Java. It was initially
written in Python, but at one point we realized that most libraries we use are
in Java and Python is just moving data between them. The choice of these Java
libraries wasn't driven by any love towards Java either. By the time they were
just a fair bit more advanced both in feature set and performance than their
Python counterparts. One example is Stanford Parser and CoreNLP toolkit--up
until Parsey McParseface it was the most accurate parser, and CoreNLP toolkit
had more features (that interested us) than Python's NLTK.

~~~
thpalmear
In the end, language and libs don't matter as much as the actual relevancy of
the algorithmic methods. This is where the real innovation and invention
occurs, in the ability to mimic human cognition as closely as possible. This
is also why side-by-side comparisons are the ultimate litmus and why there is
now a separation happening between companies that regurgitate voyeuristic
ideas (geocities, friendster, myspace, facebook-aol, snapchat, twitter-
majordomo) verses companies that truly invent and innovate hard-to-duplicate
initial algorithmic solutions (Google, SENS, Buck Inst., Human Longevity,
SpaceX). It's like fake AI vs a path toward some semblance of real AI.

------
z3t4
would be cool if projects like these could show some examples of what they are
capable of ... What results i should expect.

~~~
fnl
From the linked page:

See
[http://www.eecs.berkeley.edu/~gdurrett/](http://www.eecs.berkeley.edu/~gdurrett/)
for papers and BibTeX.

------
tinco
I wonder how it compares to the reddit summarizer bot which is absolutely
amazing. It frequently is most upvoted comment on long news articles.

~~~
zuzun
Summarizing news articles the way the bot does isn't really that difficult.
You face the task of picking 5 sentences out of 20. Since journalists write in
a very compact style, the result will almost always look decent. You just
follow a few simple rules like:

    
    
      - Length of sentence without stopwords
      - Distribution of words across the article
      - Words shared with the title
      - Position in the article
      - Position in the paragraph
      - Does the sentence address the subject in third person? (avoid)
      - Does the sentence contain direct speech? (avoid)
    

Stuff like this. It's a bit of a Mechanical Turk, really.

------
Animats
How does it compare with the summarizer in Microsoft Word (removed around
2010)?

~~~
f_allwein
I sometimes tell people about this (e.g. students who need to shorten their
papers) and nobody believes this was a thing. Of course, the results were
pretty poor back then. I guess there should be better tools by now. Side-by-
side comparison anyone?

------
KasianFranks
I personally like context-controllable summarization
[http://www.lexcognition.com/summarai/](http://www.lexcognition.com/summarai/)

------
fowlerpower
So how does this compare to the stuff Google was doing with document
summarization? Is the content unique, meaning does it summarize using brand
new words?

It's unclear but still seems Really promising.

~~~
thpalmear
Google has been to busy trying to lift methods from Berkeley Lab and passing
it off as their own, in particular Tomas Mikolov
[https://www.kaggle.com/c/word2vec-nlp-
tutorial/forums/t/1234...](https://www.kaggle.com/c/word2vec-nlp-
tutorial/forums/t/12349/word2vec-is-based-on-an-approach-from-lawrence-
berkeley-national-lab)

~~~
wodenokoto
You need a better argument than "somebody else has been working on word
vectors before Google"

No shit.

~~~
thpalmear
That's not what was said. It's how the feature attributes in the vectors are
constructed, scored and ranked in addition to the calculations used to score
vectors for similarity.

------
technologia
I like this approach, its interesting to see that coherence factoring into the
summation amongst other things.

------
amelius
Could HN run this automatically on every post? That would be cool :)

------
faitswulff
Does anyone have an example of a summary done with this library?

~~~
technologia
If you look at the paper it has one annotated:
[http://www.cs.utexas.edu/~gdurrett/papers/durrett-berg-
klein...](http://www.cs.utexas.edu/~gdurrett/papers/durrett-berg-klein-
acl2016.pdf)

~~~
nl
I don't think it does? Figure 4 is an example of a manually written summary in
the dataset.

There are some examples of how the sentence compression works, but no complete
automatic summaries that I can see.

------
thpalmear
Here's some competition:
[http://sumve.com/slack/sumbot/sumbot11.html](http://sumve.com/slack/sumbot/sumbot11.html)

------
huula
Hey, this is the guy whose dissertation talk I went to several months ago,
which is on the same topic!

