
TextTeaser – An automatic summarization algorithm - MojoJolo
https://github.com/MojoJolo/textteaser
======
nichodges
Really interesting. I tried it with an article I wrote for Wired - both to see
how it handled lengthy content with multiple points, and also how it handled
my 'loose' writing style.

Really surprised with both the quality and succinctness of the result:
[http://www.textteaser.com/s/t1bNud](http://www.textteaser.com/s/t1bNud)

Well done.

(Also to the project owner - copying the link is borked in Firefox, I had to
type it out manually)

~~~
keithpeter
Yes, tried it on a couple of my own pages. Seems to work ok.

Who is going to be first to couple this library with an RSS feed reader and
mailer so that I can get auto-generated summaries of recently written articles
sent to my blackberry?

~~~
btbuildem
bitofnews.com already did it

Personally, I find the "email me news" scheme obnoxious. I get enough emails
as is it. Would prefer to see a portal that shows summaries of all the news
and lets the user explore.

~~~
keithpeter
bitofnews.com is interesting but I want to be able to specify which sites to
summarise.

------
MojoJolo
If you guys want to try out TextTeaser, you can check out the website
([http://www.textteaser.com/](http://www.textteaser.com/)). Or try the API via
Mashape
([https://www.mashape.com/mojojolo/textteaser](https://www.mashape.com/mojojolo/textteaser)).

~~~
yelnatz
Are you using LSA?

~~~
sinzone
You mean do you support SLA?

~~~
amatsukawa
No, I believe he actually meant LSA = latent semantic analysis, which is an
algorithm used to extract topics.

I am also curious about how the NLP/ML parts are implemented, as it's claimed
by the README on github. Briefly scanned the code but didn't really spot it.

~~~
MojoJolo
It's more of statistical NLP and a bit of machine learning. The algorithm can
be found here:
[https://github.com/MojoJolo/textteaser/blob/master/src/main/...](https://github.com/MojoJolo/textteaser/blob/master/src/main/scala/com/textteaser/summarizer/Summarizer.scala)

------
cantrevealname
You can cut and paste some text here to try it out:

[http://www.textteaser.com](http://www.textteaser.com)

If you leave the "Title" field empty and click "Summarize", nothing happens --
which I thought was very confusing. You have to fill in something for the
Title.

~~~
MojoJolo
I require the title because I need it for the algorithm.

~~~
pests
That explains the many tests I just did with random copy and pasted articles.
I just typed gibberish into the title. I mean, not all texts need to have a
title.

~~~
MojoJolo
Maybe in the future I can improve it without requiring the title. It may
produce good results to other type of texts, but right now, TextTeaser is
meant to be used for news articles.

~~~
6ren
Since the title (headline) of a news article usually summarizes it, TextTeaser
arguably is less an article summarizer than a headline expander...

 _EDIT_ it would be nice, from a UX POV, to request the title if it's missing,
rather than silently deleting the story... also, you might emphasis the
importance of it (because it doesn't seem important at all). Perhaps just
labeling it as "headline" or "subject" instead of the generic "title" would
help.

------
BjoernKW
This sure looks interesting. What are the theoretical foundations of this? As
for SBS I found this paper:
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.222...](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.222.6530&rep=rep1&type=pdf)
. I couldn't find anything relevant on DBS, though.

At a cursory glance the algorithm seems like a variation Luhn's abstract
algorithm.

~~~
MojoJolo
You are spot on on the paper that I referenced. As you can see in 4.3 which is
in page 3, the paper mentioned two algorithms for sentence selection. These
are Summation-Based Selection and Density-Based Selection. Which is SBS and
DBS respectively.

~~~
agibsonccc
What made you pick this representation in particular? I'm kind of curious what
different kinds of algorithms you might have looked at.

Summarizing only blogs posts seems a bit limiting to me. (Btw, I'm not trying
to be negative, congrats on your success! texteaser looks great!)

I implemented a custom version (mainly changed the scoring scheme to include
TF/IDF of words for initialized scoring) of TextRank and loved it.

The main thing I liked about it was how general it was. Words are nodes and
sentences are vertices. Then you basically use pagerank to rank the sentences
according the graph representation.

[1]
[http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Mihalcea.pdf](http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Mihalcea.pdf)

~~~
MojoJolo
Hi, I focus on blog posts because I don't want it to be broad. This was
because TextTeaser is my research for my graduate studies. And having a
broader research means harder to accomplish. But it doesn't mean it can't be
used to other type of text. It can still be used. It's just optimized for
news.

I'm a little bit familiar with TextRank because I stumbled upon it when I'm
doing my research. I also read several algorithms but forgot what they are
called.

~~~
agibsonccc
Ahh very cool! Thank you for the insight. I could see where that would be
applicable then. Using comments as features is a very neat concept.

News is the most broadly applicable use for this so leveraging that isn't a
bad thing. There's always a trade off of broad applicability vs overfitting
for a particular case to get better results.

Thanks for the insight! Again great work.

------
nemo1618
Hey guys, I made a userscript at HackMIT last weekend that adds article
summaries to the HN front page. It doesn't use the TextTeaser API (for the
time being, at least) but the summaries seem to come out about the same
anyway.

Check it out here:
[https://github.com/lukechampine/ADHN](https://github.com/lukechampine/ADHN)

------
PLejeck
From a quick test, it seems to treat almost every bit of content on a page
equally, even elements which are clearly smaller and next to an image.

Might I recommend taking CSS styles into account? Large text is usually
headlines, <strong> text is usually important, and darker greys generally
suggest a side comment. Would be much easier if everybody used <aside> and
<h1> but even in 2013 that's too high an expectation.

~~~
MojoJolo
You are right, I'm not taking account of HTML tags. It is because I extract
the text beforehand using Pythoon Goose. In that sense, only the text will be
feed in the algorithm without any HTML tags.

~~~
nubela
Try
[https://github.com/visualrevenue/reporter](https://github.com/visualrevenue/reporter)
:) I'm looking at your service now and it is really massively awesome. Can I
ask, if you are considering monetizing it, or going the venture-path (boo)? I
ask this because I'm curious on the viability of using your service/library on
a long-term project.

~~~
ismaelc
He's monetizing it as an API here
[https://www.mashape.com/mojojolo/textteaser](https://www.mashape.com/mojojolo/textteaser)

------
hnriot
if you paste this thread into the demo you get not very encouraging results. I
haven't looked at the code but I suspect they find the sentences with most
(cosine) similarity to the title and bias towards early sentences.

results:

\- Hacker Newsnew | threads | comments | ask | jobs | submit hnriot (1618) |
logout upvote TextTeaser – An automatic - summarization algorithm (github.com)
\- If you leave the "Title" field empty and click "Summarize", nothing happens
-- which I thought was very confusing. \- reply upvote downvote MojoJolo 1
hour ago | link I require the title because I need it for the algorithm. \-
not a criticism of textteaser (which was behind this excellent project
[https://news.ycombinator.com/item?id=6498625](https://news.ycombinator.com/item?id=6498625)),
\- reply upvote downvote wikiburner 3 minutes ago | link Is this a well known
text summarization tool?

~~~
MojoJolo
Hahaha. It sucks. The algo is not meant for this kind of websites. Try out
news articles! :)

~~~
draugadrotten
[http://www.textteaser.com/s/T4PQ1s](http://www.textteaser.com/s/T4PQ1s)

Not sure it captures the essence of the source article's argument. The fourth
bullet makes no sense at all. I can't see it being useful at this stage.

Can you provide us with a list of articles that it manages to summarize
properly?

It's great to see the project on github though. I look forward to seeing it
improved over time. Thanks for sharing.

------
drakaal
But it doesn't do well with sentence disambiguation. And the summaries aren't
particularly good.

This isn't even on Par with Summly which was pretty hacked together.

[https://www.mashape.com/stremor](https://www.mashape.com/stremor)

Creates MUCH better summaries ans comes with all the stuff to separate Content
from the web template.

If you contact Stremor there is also a version that scores every sentence for
importance on a scale of 0-100 and maintains HTML so that you can return
summaries of any length and still have images and other styling maintained.

( [http://www.tldrstuff.com](http://www.tldrstuff.com) has several ways you
can play with the tech )

------
andrewcooke
[https://github.com/MojoJolo/textteaser/blob/master/src/main/...](https://github.com/MojoJolo/textteaser/blob/master/src/main/scala/com/textteaser/summarizer/Summarizer.scala)

you post non-idiomatic(?) scala in a comment to explain what you are doing, i
think? not a criticism of textteaser (which was behind this excellent project
[https://news.ycombinator.com/item?id=6498625](https://news.ycombinator.com/item?id=6498625)),
but seems to raise questions about the language...

~~~
MojoJolo
You are right. I think Scala is a good language and handle functional
programming well. But the code is too abstracted that even me might not get
what it is doing. I just placed as a reminder for me. And also for everyone
else to easily get what that piece of code is doing.

~~~
fedesilva
Curiously, I find easier to read the scala code than the commented pseudo
code. I've been experiencing this a lot lately. It' seems I am loosing my
ability to reason about code that loops explicitly.

One minor nitpick that can be of help when dealing with tuples: A partial
function ( {case xxx => yyy} ) is a Function1 so you can use it with map and
filter. This way you can deconstruct tuples into names and avoid using _1, _2,
etc. { case (name, value) => blah }

[https://github.com/MojoJolo/textteaser/blob/master/src/main/...](https://github.com/MojoJolo/textteaser/blob/master/src/main/scala/com/textteaser/summarizer/Summarizer.scala#L79)
could be made more readable by giving names to the tuple elements.

Thanks for publishing this code. It yields impressive results.

------
srin
I've been interested in how it works since I first saw it! Can't wait for the
documentation. Though I think I'm going to learn scala just to read through
this. Thanks for putting it up!

~~~
wikiburner
Is this a well known text summarization tool? I hadn't heard of it before this
post.

~~~
MojoJolo
Hi, I don't want to say it's well known. But it got in HN once in a while.

[https://news.ycombinator.com/item?id=6498625](https://news.ycombinator.com/item?id=6498625)

[https://news.ycombinator.com/item?id=6049873](https://news.ycombinator.com/item?id=6049873)

In TC:

[http://techcrunch.com/2013/10/06/textteaser-lets-
developers-...](http://techcrunch.com/2013/10/06/textteaser-lets-developers-
integrate-text-summarization-into-their-apps-and-sites/)

~~~
wikiburner
Yep, pretty well known!

Anyway, thanks for open sourcing - really cool.

------
ape4
In most news articles the first paragraph is already a summary.

------
ytadesse
Jolo, this is great! What is the implication for your API now? I notice that
it's still available on Mashape and you're still charging a fee for it.

~~~
MojoJolo
Hi! I will still retain the API in Mashape. That is for the developers that do
not want the hassle to deploy it in their own servers. On the other hand, the
open source code is for devs to check out the algo, hopefully improve and
contribute to TextTeaser. If they want to use it and deploy it on their own,
they are free to do so. :)

Think MongoHQ for MongoDB.

~~~
ytadesse
Great! You're a good man.

------
cheshire137
Really wish there was a way I could test the API without giving my CC info to
Mashape. Even for the Freemium plan, I can't do a single request without
giving payment info. Thus, I'm skipping this API, despite how cool it looks.

 _Edit:_ the main TextTeaser web site is down right now, which is why I went
straight to the API to test.

~~~
ismaelc
Hey, you can contact mojojolo in Mashape through the Contact Now button at the
bottom of this page
[https://www.mashape.com/mojojolo/textteaser#!pricing](https://www.mashape.com/mojojolo/textteaser#!pricing)

He can set up a limited free private API for you to test. Let me know if you
have questions about this process - chris@mashape.com

------
natch
Cool.

What is the structure of the sent.model file inside the corpusEN.bin zip
archive?

It's a strikingly small file for something called corpus. Say I have a larger
corpus, or a corpus in a different language, how would I go about building one
of these sent.model files with more data?

~~~
MojoJolo
The corpusEN.bin file is the training data provided by OpenNLP which I used to
split sentences
([http://opennlp.sourceforge.net/models-1.5/](http://opennlp.sourceforge.net/models-1.5/)).
It's not the training data used for summarization.

------
9diov
Can you provide a bit more details about your approach? Are you using machine
learning or just simple scoring based on some heuristics? From the look of the
source code it seems to be the latter to me.

~~~
MojoJolo
It's mostly statistical (simple scoring). But as you can see in this lines of
code:
[https://github.com/MojoJolo/textteaser/blob/master/src/main/...](https://github.com/MojoJolo/textteaser/blob/master/src/main/scala/com/textteaser/summarizer/Summarizer.scala#L24-L37)
I keep track of the keywords used by the blog and category before. Through it,
TextTeaser employs a little bit of machine learning to improve the quality of
the results.

------
milkmanjr
Very cool. What made you open source it?

Also I remember it being a bit pricier to use the API. What made you to go
down on price? I'm tempted to hook this up to my app right now.

------
unknownian
There's a certain euphoria I get when I see a different color on GitHub than
the normal ruby, python, shell, and JS.

~~~
MojoJolo
Just checked it in Github. It's 100% Scala. :)

------
iamtechaddict
Why you used scala to build when python could have been an good alternative ?

~~~
SkyMarshal
Why use Python when Scala is a good alternative?

~~~
iamtechaddict
because NTLK is a very strong toolkit for natural language processing and i
haven't found anything comparable in scala.

~~~
SkyMarshal
You can see from the source he's already using Apache OpenNLP. Scala is 100%
interoperable with Java libs, so you have the entire Java ecosystem available,
not just Scala code.

~~~
iamtechaddict
ya i saw that I'm not familiar with OpenNLP. lemme have a look it might solve
my problem, I'm also starting a nlp project using scala :)

------
nwq
How would I go about using this directly from Python, os.system calls?

~~~
level09
you can either rewrite it in python, or use unirest/requests to summarize text
using the API.

------
dangerlibrary
Why would "Philippine" be hard-coded in as a stop word?

~~~
MojoJolo
Didn't manage to remove it.

I created a news reader for Philippine news
([http://www.readborg.com/](http://www.readborg.com/)) using TextTeaser. The
word Philippine appears most of the time and I decided to make it as a stop
word. Forgot to remove it in the stop words.

------
jgalt212
works good enough for me. I'll give you $15MM for it.

-Marissa

~~~
MojoJolo
This made me laugh. And $15m is big enough for me.

