

Automatic Summarization in Medium - MojoJolo
https://medium.com/medium-ideas/59b1747c1d36

======
jmduke
Not that I don't think the technology behind this is both awesome and
completely applicable -- because it is, and it is -- but I hope Medium doesn't
take this approach.

The biggest part of Medium's value proposition, to me, is manual curation.
While the included summarizations are technically accurate, they're stilted,
like they're being spat out from an algorithm (because they are!): the exact
opposite feeling that you want when you're being gently told from a site that
they're editing and proofing and including all of these human flourishes.

~~~
ozziegooen
Those summaries were quite fantastic. I agree that perhaps it is unnecessary
to use them for the selected entries, but feel like they must be useful
somewhere else.

Perhaps it would make the most sense to have an rss reader or similar with the
summarizer built-in. Or a page with links to medium articles (plus others)
with summaries.

~~~
MojoJolo
Hi, I built Readborg ([http://readborg.com/](http://readborg.com/)) to
showcase what the algorithm (TextTeaser) can do. It is a news reader for
Philippine news.

~~~
ics
I searched for TextTeaser and only found a boilerplate login site and some
unrelated results. Have you or do you plan on posting the code anywhere? The
results look very good, and though I'm sure you can think of some great uses
for it (incl. ways to monetize it) it would be _fantastic_ to have a good
summarizer for personal notes and the like. I suppose if Readborg was a full-
blown RSS reader (it doesn't appear to be) you could pump all your writing
through an RSS feed but that's a sticky way of doing things.

Edit: Saw the comment below asking if we could see your thesis, which I would
be very interested in especially if you don't plan on sharing all of your
code.

------
rglullis
Lazy Translation of the Portuguese one:

    
    
      - There is no list/bill of demands on the table that
      can be negotiated.
      - This is great for some of the effects the protests will 
      bring to government leaders. (this one really didn't make much sense)
      - In less than a week a movement was formed that led 
      hundreds of thousands to the streets, without any kind of 
      leadership, warning or prediction.
      - There is an explanation for all that.
      - Despite all the anarchy, there is logic behind all this 
      that we are experiencing.

------
guiambros
This is quite impressive. It also worked fairly well for that article in
Portuguese too (it's just missing the closing thought, but the original
article is vague anyway; there's 3-4 paragraphs trying to conclude, but not a
unique final thought).

I'd love to see this plugged into tldr.io, so articles on HN could be
automatically extracted + summarized -- and later improved by real humans, as
needed.

Like a Circa News app, but for the web at large.

------
NKCSS
I feel, just like movie trailers, that these 'previews' ruin the articles by
giving away too much.

Why would you read the article if you've just read the cliff notes? Previews
should entice, not give away too much.

Movie trailers look awesome because it's the best the film has to offer,
making the full length feature look pale in comparison, which is why I try to
avoid them at all cost.

------
whiddershins
I think that technology is amazing. It was a really cool way to demonstrate
it, because I had already read some of those posts, so I got a sense of how
accurate the summaries were. I would love to know more. Great job.

------
grad_ml
Interesting! Nice work. I have also worked on this problem (not exactly same
though!), so I want to ask few things. The only description you have given is
about the features you have used. These features are very well established,
and are being used almost since the inception of this field(like 60's and
all). I would suggest using advanced features like HITS score etc. What are
the base techniques you have compared against. Some recent work like shen et
al.(Automatic document summarization using CRF) has used CRF based methods. Is
your method based on bag of words or has markovian structure? Also how do you
decide how many sentences to select, explain little bit about sentence ranking
technique and also explain little about evaluation techniques. Without these
explanation it is very difficult to make any constructive comment. Also may
want to talk about training process (if supervised) and scaling issues! We can
also talk offline if you wish to :) . -Rahul

------
pseut
Well-written English is remarkably well structured[1]: the first or second
paragraph is usually it's main point and the rest fills in details; the first
paragraph or two in a section provides an overview and the rest of the section
provides support, etc. News articles are even more structured: the first two
paragraphs tell the whole story in summary[2].

For writing as short as articles on Medium, it might be useful to compare the
algorithm to a naive version that pulls the first two sentences from each
paragraph.

[1] See, e.g. [http://www.amazon.com/Style-The-Basics-Clarity-
Grace/dp/0321...](http://www.amazon.com/Style-The-Basics-Clarity-
Grace/dp/0321112520)

[2]:
[http://en.wikipedia.org/wiki/Inverted_pyramid](http://en.wikipedia.org/wiki/Inverted_pyramid)

------
nl
If anyone wants to play, I'm the author of Classifier4J which is a very old
Java rest classification tool, which also includes a text summarization
engine. I believe the algorithm in c4j has been ported to Python and is
available in NLTK ([https://groups.google.com/forum/m/#!topic/nltk-
dev/qV9e5TsCB...](https://groups.google.com/forum/m/#!topic/nltk-
dev/qV9e5TsCBHg)).

I did a bit of testing of the java version, and it was pretty competitive with
commercially available summarizers at the time.

------
louischatriot
Always interesting to see new people try to approach the summary problem but I
find these summaries have the defects common to automatic keyphrase extraction
summaries: they feel very artificial, and are usually not accurate. The
summary of "four steps to Google" is a good example.

I hope this kind of technology sees the day but I'm very skeptical about it
working on general-purpose content and not just structured content such as
news, as it does today.

~~~
ivan444
As the author of TextTeaser noted, there are two approaches to automatic
summarization: abstraction and extraction.

Abstraction combines huge portions of two young research fields -- NLP & NLG
(Natural Language Processing & Generation). NLG is even harder than NLP, and
less researched. Without good NLG algorithm for presenting summary, you can't
have more humane summaries.

Extraction simply takes sentences (or some portions of them), ranks them and
presents a few best results.

Two years ago, I was at presentation of PhD about text summarization. There
I've figured out that you can make fair summarization algorithm in a few
hours. Here is an prototype:
[https://bitbucket.org/ivan444/textsum/src/1d09b0f4f72a60903d...](https://bitbucket.org/ivan444/textsum/src/1d09b0f4f72a60903d91236ad77e8a5d5f5ce864/prototype.py?at=default)
Dirty prototype code, it took me just about 10h of work to prepare dataset,
think algorithm, write program and tune it (this works only for Croatian
language, if you want other language, you'll need to get list of function
words for that language --
[http://en.wikipedia.org/wiki/Function_word](http://en.wikipedia.org/wiki/Function_word)
). There is also java version of text summarizer (somewhere in repository) and
simple tool to get clean, article-only text from any page containing some
longer texts (it isn't tuned well, I didn't spent more than 1h of work in it,
so I don't expect it works well).

Algorithm is simple: (1) break text into sentences, (2) extract features, (3)
calc features score and sum them, (4) present ranked sentences (and, later,
choose a few best).

Used features: normalized number of words, type of sentence (declarative,
interrogative, exclamatory), order score (give first sentence a boost, as
usually first sentence is the most important one), ratio between number of
function words and all words (function words are words without semantic
content; there is a fwords.txt in a repository which contains ~700 Croatian
function words), normalized sum of three minimum TF-IDF scores (document =
sentence).

I don't know the state of the code (it is more than a year old code), but
anyone is free to use that code for anything they like.

~~~
Saad_M
As a PhD graduate in NLG I wouldn't say NLG is a "young" research field. For
example the oldest NLG book I have is Eduard Hovy's PhD work on the PAULINE
system ("Generating Natural Language Under Pragmatic Constraints"), which was
published back 1988. The seminal reference book for NLG ("Building Natural
Language Generation Systems") was published back in 2000. What's made NLG more
interesting recently is that the computing environment has changed
considerably. We have considerably more larger pools of time-series data than
was available in the past and that we now also have standardised data-to-text
pipeline architecture when creating NLG applications for such data.

Nevertheless, I do agree there's still considerable challenges when trying to
perform text-to-text generation which involves trying to combine NLP and NLG
together to abstract, interpret, and then summarise unstructured free text.

------
zerop
Is text Summarization really mature now?.. It can be very handy, I dont have
time to read entire news always.. Found one more text summarizer.. not sure
how good it is --
[http://pravin.paratey.com/nlp/summarization](http://pravin.paratey.com/nlp/summarization)

------
egonschiele
Is this algorithm a closed-source / patent pending type situation? If not, how
does it work?

~~~
MojoJolo
I'm using 4 features of the article: Title, sentence length, sentence
position, and modified keyword frequency. The first three features are just
normal that you can see in most automatic summarization research. Modified
keyword frequency considers not just the frequency, but also the distance of
each keyword. :) There. That's just a brief explanation of how TextTeaser
works.

~~~
mlla
You mention in the article that you developed the algorithm as a part of
writing your MSc thesis. Is your thesis available on the Internet (as in pdf
or in any other format)?

~~~
MojoJolo
Will ask my adviser about it and upload the pdf later. :)

~~~
timrogers
FWIW I'd love to see this too. We have a semi-regular paper-reading club at
GoCardless (YC S11) and this could be super interesting.

