Hacker Newsnew | comments | ask | jobs | submitlogin
TextTeaser – An automatic summarization algorithm (github.com)
177 points by MojoJolo 186 days ago | comments


cantrevealname 186 days ago | link

You can cut and paste some text here to try it out:

http://www.textteaser.com

If you leave the "Title" field empty and click "Summarize", nothing happens -- which I thought was very confusing. You have to fill in something for the Title.

-----

MojoJolo 186 days ago | link

I require the title because I need it for the algorithm.

-----

keithpeter 185 days ago | link

Looks like the algorithm is giving weight to h1 and h2 tags in the page markup having just tried it on some of my pages. Is that true or am I imagining it?

If so, I'll have to provide more literal subheadings!

-----

MojoJolo 185 days ago | link

Nope. :) You are imagining it.

-----

keithpeter 184 days ago | link

Fair enough. I must have used more relevant subheadings than I thought!

-----

pests 186 days ago | link

That explains the many tests I just did with random copy and pasted articles. I just typed gibberish into the title. I mean, not all texts need to have a title.

-----

MojoJolo 186 days ago | link

Maybe in the future I can improve it without requiring the title. It may produce good results to other type of texts, but right now, TextTeaser is meant to be used for news articles.

-----

6ren 185 days ago | link

Since the title (headline) of a news article usually summarizes it, TextTeaser arguably is less an article summarizer than a headline expander...

EDIT it would be nice, from a UX POV, to request the title if it's missing, rather than silently deleting the story... also, you might emphasis the importance of it (because it doesn't seem important at all). Perhaps just labeling it as "headline" or "subject" instead of the generic "title" would help.

-----

diminish 185 days ago | link

Congratulations for open sourcing the library. Do you think, it could generate a title as a one sentence summarization?

-----

nichodges 186 days ago | link

Really interesting. I tried it with an article I wrote for Wired - both to see how it handled lengthy content with multiple points, and also how it handled my 'loose' writing style.

Really surprised with both the quality and succinctness of the result: http://www.textteaser.com/s/t1bNud

Well done.

(Also to the project owner - copying the link is borked in Firefox, I had to type it out manually)

-----

keithpeter 185 days ago | link

Yes, tried it on a couple of my own pages. Seems to work ok.

Who is going to be first to couple this library with an RSS feed reader and mailer so that I can get auto-generated summaries of recently written articles sent to my blackberry?

-----

btbuildem 185 days ago | link

bitofnews.com already did it

Personally, I find the "email me news" scheme obnoxious. I get enough emails as is it. Would prefer to see a portal that shows summaries of all the news and lets the user explore.

-----

keithpeter 185 days ago | link

bitofnews.com is interesting but I want to be able to specify which sites to summarise.

-----

hboon 185 days ago | link

I browsed the article and wanted to look at what else you have written but didn't find any references in the URL: http://www.wired.co.uk/news/archive/2013-08/22/filtering-the.... Not even your name. Is that normal for Wired?

-----

MojoJolo 186 days ago | link

Thanks!

But I did not get what you mean by "copying the link is borked in Firefox". What link are you talking about? :)

-----

nichodges 186 days ago | link

Once I was at the screen showing me the summary, in the right hand 'Share' column - I can't copy the text in the link, image, or embed fields. (FF4.0 on OSX 10.8.5)

-----

mcpherrinm 184 days ago | link

I really hope you mean Firefox 24, not 4.

-----

MojoJolo 186 days ago | link

Thanks. Will check it out. :) Never tried it on Firefox.

-----

drakaal 185 days ago | link

A comparison: of Summaries of http://www.wired.co.uk/news/archive/2013-08/22/filtering-the...

Stremor's TLDR: Pointless reeling off the numbers. The challenge each of us now faces is a brand new one. Our filters were once the media, our friends, and our families. We were aware of, and understood how our filters operated. EdgeRank isn't something that Facebook users understand. Just 20 tweets out of thousands. We need better filters.

Text Teaser: How do we create a balanced diet of content with so much junk being thrown at us? Now, large media organisations create mountains of content, then track our reading habits and online behaviour in order to build a profile of us. A set of favourite news groups; a list of RSS feeds; a well-curated bookmarks folder; these are all filters we once built ourselves. The less we understand our filters, the more we will come to accept that the world they present us with is true. The more control we have over our filters, the more we can understand what we're not seeing.

Text Teaser goes over 350 Characters which is the Established People can't sue you for stealing it limit... So also weigh that when deciding which you like better.

-----

nemo1618 185 days ago | link

Hey guys, I made a userscript at HackMIT last weekend that adds article summaries to the HN front page. It doesn't use the TextTeaser API (for the time being, at least) but the summaries seem to come out about the same anyway.

Check it out here: https://github.com/lukechampine/ADHN

-----

MojoJolo 186 days ago | link

If you guys want to try out TextTeaser, you can check out the website (http://www.textteaser.com/). Or try the API via Mashape (https://www.mashape.com/mojojolo/textteaser).

-----

yelnatz 185 days ago | link

Are you using LSA?

-----

MojoJolo 185 days ago | link

What do you mean by LSA? Are you referring to this: http://en.wikipedia.org/wiki/Latent_semantic_analysis

If it is, I'm not using it. :)

-----

sinzone 185 days ago | link

You mean do you support SLA?

-----

amatsukawa 185 days ago | link

No, I believe he actually meant LSA = latent semantic analysis, which is an algorithm used to extract topics.

I am also curious about how the NLP/ML parts are implemented, as it's claimed by the README on github. Briefly scanned the code but didn't really spot it.

-----

MojoJolo 185 days ago | link

It's more of statistical NLP and a bit of machine learning. The algorithm can be found here: https://github.com/MojoJolo/textteaser/blob/master/src/main/...

-----

BjoernKW 185 days ago | link

This sure looks interesting. What are the theoretical foundations of this? As for SBS I found this paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.222... . I couldn't find anything relevant on DBS, though.

At a cursory glance the algorithm seems like a variation Luhn's abstract algorithm.

-----

MojoJolo 185 days ago | link

You are spot on on the paper that I referenced. As you can see in 4.3 which is in page 3, the paper mentioned two algorithms for sentence selection. These are Summation-Based Selection and Density-Based Selection. Which is SBS and DBS respectively.

-----

agibsonccc 185 days ago | link

What made you pick this representation in particular? I'm kind of curious what different kinds of algorithms you might have looked at.

Summarizing only blogs posts seems a bit limiting to me. (Btw, I'm not trying to be negative, congrats on your success! texteaser looks great!)

I implemented a custom version (mainly changed the scoring scheme to include TF/IDF of words for initialized scoring) of TextRank and loved it.

The main thing I liked about it was how general it was. Words are nodes and sentences are vertices. Then you basically use pagerank to rank the sentences according the graph representation.

[1] http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Mihalcea.pdf

-----

MojoJolo 185 days ago | link

Hi, I focus on blog posts because I don't want it to be broad. This was because TextTeaser is my research for my graduate studies. And having a broader research means harder to accomplish. But it doesn't mean it can't be used to other type of text. It can still be used. It's just optimized for news.

I'm a little bit familiar with TextRank because I stumbled upon it when I'm doing my research. I also read several algorithms but forgot what they are called.

-----

agibsonccc 185 days ago | link

Ahh very cool! Thank you for the insight. I could see where that would be applicable then. Using comments as features is a very neat concept.

News is the most broadly applicable use for this so leveraging that isn't a bad thing. There's always a trade off of broad applicability vs overfitting for a particular case to get better results.

Thanks for the insight! Again great work.

-----

aswanson 185 days ago | link

Any suggestions on papers for Luhn's abstract algorithm? I hadn't heard of it before.

-----

BjoernKW 185 days ago | link

There you go:

https://text-analysis.googlecode.com/files/luhn58.pdf‎ http://dl.acm.org/citation.cfm?id=1662360

-----

drakaal 185 days ago | link

This was also the foundation for Summly. The problem is that it is flawed. It doesn't take in to account emotion, or emphasis.

What is the most important sentence in this:

Drakaal is a poopy head. He often posts to hackernews and calls people an idiot. When this happens I get mad.

The Premise is "Drakaal is a poopy head" so that is the most important, but it doesn't inform the user. What he does is the most telling of the sentences, but with the "He" as the first word you can't actually make sense of the content with out the prior sentence. The last sentence is the least important for the understanding of the content.

It is important to know that sentence number 2 is the most informative, but that to figure out what it means requires Sentence 1.

When computing the results of a summary you have to weigh sentence dependencies, density of information, amount of emotion expressed and number of characters available to you.

And Keywords aren't enough, you need noun entities and the ability to tell the relationships of words so that you know "Cars, Trucks, and Automobiles" are all the same concept in many contexts.

-----

hnriot 186 days ago | link

if you paste this thread into the demo you get not very encouraging results. I haven't looked at the code but I suspect they find the sentences with most (cosine) similarity to the title and bias towards early sentences.

results:

- Hacker Newsnew | threads | comments | ask | jobs | submit hnriot (1618) | logout upvote TextTeaser – An automatic - summarization algorithm (github.com) - If you leave the "Title" field empty and click "Summarize", nothing happens -- which I thought was very confusing. - reply upvote downvote MojoJolo 1 hour ago | link I require the title because I need it for the algorithm. - not a criticism of textteaser (which was behind this excellent project https://news.ycombinator.com/item?id=6498625), - reply upvote downvote wikiburner 3 minutes ago | link Is this a well known text summarization tool?

-----

MojoJolo 186 days ago | link

Hahaha. It sucks. The algo is not meant for this kind of websites. Try out news articles! :)

-----

draugadrotten 185 days ago | link

http://www.textteaser.com/s/T4PQ1s

Not sure it captures the essence of the source article's argument. The fourth bullet makes no sense at all. I can't see it being useful at this stage.

Can you provide us with a list of articles that it manages to summarize properly?

It's great to see the project on github though. I look forward to seeing it improved over time. Thanks for sharing.

-----

PLejeck 185 days ago | link

From a quick test, it seems to treat almost every bit of content on a page equally, even elements which are clearly smaller and next to an image.

Might I recommend taking CSS styles into account? Large text is usually headlines, <strong> text is usually important, and darker greys generally suggest a side comment. Would be much easier if everybody used <aside> and <h1> but even in 2013 that's too high an expectation.

-----

MojoJolo 185 days ago | link

You are right, I'm not taking account of HTML tags. It is because I extract the text beforehand using Pythoon Goose. In that sense, only the text will be feed in the algorithm without any HTML tags.

-----

nubela 185 days ago | link

Try https://github.com/visualrevenue/reporter :) I'm looking at your service now and it is really massively awesome. Can I ask, if you are considering monetizing it, or going the venture-path (boo)? I ask this because I'm curious on the viability of using your service/library on a long-term project.

-----

ismaelc 185 days ago | link

He's monetizing it as an API here https://www.mashape.com/mojojolo/textteaser

-----

andrewcooke 186 days ago | link

https://github.com/MojoJolo/textteaser/blob/master/src/main/...

you post non-idiomatic(?) scala in a comment to explain what you are doing, i think? not a criticism of textteaser (which was behind this excellent project https://news.ycombinator.com/item?id=6498625), but seems to raise questions about the language...

-----

shoo 185 days ago | link

that patch of code could be restructured a bit to make it more readable, here are a couple of small suggestions that jump out at me:

1. perhaps `.reduceLeft(_ + _)` could be replaced with the use of a `sum` function or method (assuming one exists in scala?)

2. if the `topKeywords` collection returned a default value with a `.score` of 0 when queried with a key it doesnt contain, the headOption getOrElse null match null would not be necessary.

e.g. in python it might look something like this:

    Keyword = namedtuple('Keyword ', ['score', ...etc...])

    top_keywords = defaultdict(lambda : Keyword(score=0, ...etc...))

    def sbs(words):
        if words:
            return (1.0/len(words)) * sum(top_keywords[w].score for w in words)
        else:
            return 0.0
(apologies for making superficial comments about the code. the algorithm itself certainly seems interesting)

-----

MojoJolo 186 days ago | link

You are right. I think Scala is a good language and handle functional programming well. But the code is too abstracted that even me might not get what it is doing. I just placed as a reminder for me. And also for everyone else to easily get what that piece of code is doing.

-----

fedesilva 185 days ago | link

Curiously, I find easier to read the scala code than the commented pseudo code. I've been experiencing this a lot lately. It' seems I am loosing my ability to reason about code that loops explicitly.

One minor nitpick that can be of help when dealing with tuples: A partial function ( {case xxx => yyy} ) is a Function1 so you can use it with map and filter. This way you can deconstruct tuples into names and avoid using _1, _2, etc. { case (name, value) => blah }

https://github.com/MojoJolo/textteaser/blob/master/src/main/... could be made more readable by giving names to the tuple elements.

Thanks for publishing this code. It yields impressive results.

-----

ape4 186 days ago | link

In most news articles the first paragraph is already a summary.

-----

cheshire137 185 days ago | link

Really wish there was a way I could test the API without giving my CC info to Mashape. Even for the Freemium plan, I can't do a single request without giving payment info. Thus, I'm skipping this API, despite how cool it looks.

Edit: the main TextTeaser web site is down right now, which is why I went straight to the API to test.

-----

ismaelc 185 days ago | link

Hey, you can contact mojojolo in Mashape through the Contact Now button at the bottom of this page https://www.mashape.com/mojojolo/textteaser#!pricing

He can set up a limited free private API for you to test. Let me know if you have questions about this process - chris@mashape.com

-----

ytadesse 186 days ago | link

Jolo, this is great! What is the implication for your API now? I notice that it's still available on Mashape and you're still charging a fee for it.

-----

MojoJolo 186 days ago | link

Hi! I will still retain the API in Mashape. That is for the developers that do not want the hassle to deploy it in their own servers. On the other hand, the open source code is for devs to check out the algo, hopefully improve and contribute to TextTeaser. If they want to use it and deploy it on their own, they are free to do so. :)

Think MongoHQ for MongoDB.

-----

ytadesse 186 days ago | link

Great! You're a good man.

-----

natch 185 days ago | link

Cool.

What is the structure of the sent.model file inside the corpusEN.bin zip archive?

It's a strikingly small file for something called corpus. Say I have a larger corpus, or a corpus in a different language, how would I go about building one of these sent.model files with more data?

-----

MojoJolo 185 days ago | link

The corpusEN.bin file is the training data provided by OpenNLP which I used to split sentences (http://opennlp.sourceforge.net/models-1.5/). It's not the training data used for summarization.

-----

srin 186 days ago | link

I've been interested in how it works since I first saw it! Can't wait for the documentation. Though I think I'm going to learn scala just to read through this. Thanks for putting it up!

-----

wikiburner 186 days ago | link

Is this a well known text summarization tool? I hadn't heard of it before this post.

-----

MojoJolo 186 days ago | link

Hi, I don't want to say it's well known. But it got in HN once in a while.

https://news.ycombinator.com/item?id=6498625

https://news.ycombinator.com/item?id=6049873

In TC:

http://techcrunch.com/2013/10/06/textteaser-lets-developers-...

-----

wikiburner 186 days ago | link

Yep, pretty well known!

Anyway, thanks for open sourcing - really cool.

-----

dangerlibrary 186 days ago | link

Why would "Philippine" be hard-coded in as a stop word?

-----

MojoJolo 186 days ago | link

Didn't manage to remove it.

I created a news reader for Philippine news (http://www.readborg.com/) using TextTeaser. The word Philippine appears most of the time and I decided to make it as a stop word. Forgot to remove it in the stop words.

-----

unknownian 186 days ago | link

There's a certain euphoria I get when I see a different color on GitHub than the normal ruby, python, shell, and JS.

-----

MojoJolo 186 days ago | link

Just checked it in Github. It's 100% Scala. :)

-----

9diov 186 days ago | link

Can you provide a bit more details about your approach? Are you using machine learning or just simple scoring based on some heuristics? From the look of the source code it seems to be the latter to me.

-----

MojoJolo 186 days ago | link

It's mostly statistical (simple scoring). But as you can see in this lines of code: https://github.com/MojoJolo/textteaser/blob/master/src/main/... I keep track of the keywords used by the blog and category before. Through it, TextTeaser employs a little bit of machine learning to improve the quality of the results.

-----

milkmanjr 183 days ago | link

Very cool. What made you open source it?

Also I remember it being a bit pricier to use the API. What made you to go down on price? I'm tempted to hook this up to my app right now.

-----

jgalt212 185 days ago | link

works good enough for me. I'll give you $15MM for it.

-Marissa

-----

MojoJolo 185 days ago | link

This made me laugh. And $15m is big enough for me.

-----

iamtechaddict 186 days ago | link

Why you used scala to build when python could have been an good alternative ?

-----

MojoJolo 186 days ago | link

I'm seeing good things about Python and NLTK. But back when I develop the core algo of TextTeaser (few months ago), I still don't know Python.

Right now, the TextTeaser website is coded using Python and Flask.

-----

SkyMarshal 186 days ago | link

Why use Python when Scala is a good alternative?

-----

iamtechaddict 185 days ago | link

because NTLK is a very strong toolkit for natural language processing and i haven't found anything comparable in scala.

-----

rspeer 185 days ago | link

NLTK's strength is the clarity and flexibility of its code, for when you're experimenting with various processes and representations to find out what works.

If you have a single NLP model that already works, you wouldn't gain anything from rewriting it using NLTK. It would probably just get slower, because you're adding abstractions that you've already shown you don't need.

I say this as a fan of and (once) contributor to NLTK.

-----

SkyMarshal 185 days ago | link

You can see from the source he's already using Apache OpenNLP. Scala is 100% interoperable with Java libs, so you have the entire Java ecosystem available, not just Scala code.

-----

iamtechaddict 185 days ago | link

ya i saw that I'm not familiar with OpenNLP. lemme have a look it might solve my problem, I'm also starting a nlp project using scala :)

-----

kkthnxbye 186 days ago | link

It shouldn't matter what language the OP decided to use, as long as it allows him/her to do whatever he/she set out to do in the first place.

-----

tel 186 days ago | link

Why use Python instead of Scala?

-----

drakaal 185 days ago | link

But it doesn't do well with sentence disambiguation. And the summaries aren't particularly good.

This isn't even on Par with Summly which was pretty hacked together.

https://www.mashape.com/stremor

Creates MUCH better summaries ans comes with all the stuff to separate Content from the web template.

If you contact Stremor there is also a version that scores every sentence for importance on a scale of 0-100 and maintains HTML so that you can return summaries of any length and still have images and other styling maintained.

( http://www.tldrstuff.com has several ways you can play with the tech )

-----

nwq 185 days ago | link

How would I go about using this directly from Python, os.system calls?

-----

level09 185 days ago | link

you can either rewrite it in python, or use unirest/requests to summarize text using the API.

-----




Lists | RSS | Bookmarklet | Guidelines | FAQ | DMCA | News News | Feature Requests | Bugs | Y Combinator | Apply | Library

Search: