
A Gentle Introduction to Text Summarization in Machine Learning - ReDeiPirati
https://blog.floydhub.com/gentle-introduction-to-text-summarization-in-machine-learning/
======
nathancahill
A large part of the effort in text summarization is in the quality of the
stemmer. If you're working with English, you're golden, there are several
high-quality stemmers available. However, if you're working in a language that
doesn't have a stemmer yet, it's a colossal task to write one.

~~~
sansnomme
What about pictorial languages like Mandarin and Japanese?

~~~
jjtheblunt
Japanese isn't pictorial ; it uses two alphabets, one more for native words,
one for foreign words, and then substitutes in Chinese symbols as shortcuts
when widely known.

~~~
rococode
I think Chinese is really not a pictorial language either; using your apt
description, it's like if every character was a "shortcut". Harder to learn,
but still fairly standard in conveying meaning as far as modern languages go.
Some characters are still visually similar to what they represent, but at this
point they're mostly a bit of a stretch.

A true pictorial language would convey most meaning through the symbols
themselves, and I don't think any modern languages fulfill that definition.
Maybe sign languages are the closest thing we have to pictorial language, in
terms of the way some things are expressed symbolically?

------
schemathings
Interesting topic and well written .. I kicked the tires a few times with
newspaper3k
[https://github.com/codelucas/newspaper/](https://github.com/codelucas/newspaper/)
it has nlp and summary methods that work pretty well, I think I'll peek under
the hood to see how it's being done there. Curious to see if your method is an
improvement, if so, hey they're both in python!

------
i_call_solo
Why waste time say lot word when few word do trick?

~~~
m-i-l
"Brevity takes effort": "If I had more time, I would have written a shorter
letter."

------
heyyyouu
This is great stuff. One question -- what about applying the role of the word
of the sentence into the weighting, along with the frequency. As in general
Subject, Verb and DOs are going to be MUCH more important than, say, adverbs
and articles. Is that not done because it's harder to automate vs. just
frequency?

------
ColanR
I wish there were also tools like this that did paraphrasing. The article
mentions the possibility, using deep learning, but doesn't go into any
details; and I haven't seen anything that can summarize below the sentence
level anywhere else either.

~~~
heyyyouu
Can I ask what you mean by summarize below the sentence level? Thanks!

~~~
ColanR
See the subsection titled "Abstraction-based summarization" in the article.
Basically, instead of copying important sentences verbatim (and spending all
your analysis on choosing those sentences), rewrite entire paragraphs into new
and unique sentences. No more plagiarism, because nothing is a direct copy.

~~~
heyyyouu
That's interesting. One thing about plagiarism is that derivative works are
considered plagiarism, even if the direct language is no longer used.
[https://www.copyright.gov/circs/circ14.pdf](https://www.copyright.gov/circs/circ14.pdf)
A court would have to rule, of course, but I wonder if this would actually get
around that. That said, as US caselaw currently stands summaries are
completely legal to a certain length -- you can't stop someone from
summarizing what exists, it's like sharing a fact -- but you can't create a
derivative work. So as long as the resulting output was short enough you
shouldn't have to worry anyhow about plagiarism, since summary is covered, but
if it's longer, derivative could be an issue. Anyway, thank you -- very
interesting!

~~~
ColanR
> So as long as the resulting output was short enough you shouldn't have to
> worry anyhow about plagiarism

Ya, that's what I was trying to get at when I mentioned summarizing paragraphs
into sentences. I think we're on the same page there. :)

~~~
heyyyouu
Ah, gotcha! :-)

------
nshm
Abstractive summarization not yet working (like text generation in chatbots),
extractive is ok and better done with RNN/BERT/something else neural network
based, that's all you need to know.

------
Koffiepoeder
This is very similar to how TLDR-bot on reddit summarizes reddit posts or
linked articles: [https://smmry.com/about](https://smmry.com/about)

The results can be surprisingly good, even for such a basic algorithm.

------
seeker_
This is a very basic introduction to the topic. I don't understand why it is
getting so much traction here. Am I missing something?

~~~
tracer4201
Might be basic for you, but I know very little about the field. I found it
very informative. If you don’t have something constructive, why post at all?

------
f055
This article is brilliant, thank you!

------
swiley
A gentle introduction to text summarization using a python NLP library.

------
amelius
Am I the only one who hates reading grammatically incorrect pieces of text?

For me this technology is still in the "unusable" phase, and urgently needs
more work.

