Hacker News new | comments | show | ask | jobs | submit login
Show HN: An easy-to-use Text Analysis API – NLP and Machine Learning (aylien.com)
153 points by parsabg on Feb 18, 2014 | hide | past | web | favorite | 69 comments

Cool stuff! It's nice to see platforms like this which abstract out good algorithms, so that developers can worry about thinking of interesting applications. .Open source libs are even better, but pragmatically speaking, I think these types of platforms probably move faster and get better results.

One major competitor (well known for anyone who's looked into this stuff) is Alchemy [1]. I tried a New York Times link [2] on Aylien and Alchemy, and Alchemy performed much better -- in fact, Aylien didn't even successfully find the article body. I'm sure you guys will be iterating on improving the algorithms, but just wanted to flag that as a potential turnoff for anyone comparing your website demo with Alchemy.

Best of luck!

[1] http://www.alchemyapi.com/products/demo/

[2] http://www.nytimes.com/2014/02/18/world/middleeast/bombings-...

thanks for the feedback! as you may know, NYT articles are behind a paywall and fetching them can be problematic. so I believe Alchemy uses the NYT API to fetch articles, which is something we'll look into in future.

FWIW, I tried a different url [1] on both services and I thought the Aylien results were better in this instance.

[1] http://www.football365.com/john-nicholson/9170607/He-Who-Pan....

Seen quite a few times (NLP web APIs), and my opinion is that this kind of stuff tends to not be scalable: to be useful, such web APIs have to be able to do entire articles in just a split fraction of a second. Although I am not sure (because of the HN storm the API is down), it does not seem this tool will live up to those expectations, either. In the end, my choice always has been to include/wrap an off-the-shelf tool in your own pipeline rather than relying on a external service that might be too slow for end-users and mass mining alike...

What tools would you suggest for doing this? Or even what algorithms to implement for doing this sort of work?

This is a much better Noun Phrase / Entity extractor.


We don't rely on CoreNLP, or NLTK, we have our own sentence disambiguation, and our own part of speech tools. So we are a lot faster.

Our other api's let you piece together a lot of cool NLP projects with very little code.

These sorts of things are typically better offered as libraries, particularly as the training is usually specific to a corpus, or a particular context.

It would be a nice to offer a library with a bootstrapped training set.

Not to mention either the sensitivity behind the data, the sheer volume behind it, or the effort involved in customizing it for a particular algorithm or input - only for it to shut down and take your data with you.

Machine Learning as a Service seems Hella Neat, tho.

Sorry, don't understand your last sentence.

It seems to contradict the paragraph before -- ML as a service seems a terrible idea for the reasons you just listed (among others). What's "Hella Neat" about that?

The problem mostly stems from the vast risk you take on from making a large investment in an unstable/unproven platform vendor.

Servers are relatively fungible, given ops automation; it's painful but not the end of the world if you have to migrate away.

But the technology is still relatively immature in that building your own ML service in house - and having it scale, etc - is still a big pain.

I would immensely prefer it if we first brought ML libraries up to a higher level of maturity - as simple as apt-get install and adding `includes ActiveLearning::Bayes` to your models.

But if a client came to me tomorrow and said "there's this great Amazon API that we're thinking of using" I wouldn't consider that insane on first principles.

Unfortunately the web site is still analyzing the example Techcrunch link (it's been 3 min already).

Is something broken? Maybe you could cache some recurring analyses.

I contacted them about this using the live chat on the site. Their servers are melting down but it sounds like they're on it spinning up new instances etc.

sorry, our servers are melting :-) spawning new machines.

You could cache the results of the examples that are on the right ;]

good idea!

Thanks for the update. I'll check it later then!

Hey guys! Congrats, NLP is a huge problem that needs as many minds working on it as possible.

Just tried a few links:



Am I missing something here? It seems like it's just parsing text, i'm not seeing any context(keywords, categories, summaries)

edit: It's giving fantastic results when pasting the raw text! :)

Are you guys using DBpedia? It's giving very similar results to a system I was working on in the past: http://www.zachvanness.com/nanobird_relevancy_engine.pdf

thanks for the feedback. can't reproduce the first issue, what happens when you click on Analyze? do you mind sending us a screenshot?

we do use DBPedia in our Concept Extraction. please have a look at the docs: http://aylien.com/text-api-doc

You're welcome!

Sure thing(when running it on the urls, I don’t get any keywords: http://i.cubeupload.com/zubo4G.png

thanks, keywords are under "Entities".

What do you use for the extraction of entities (if you don't mind saying)? I entered "The Cat in the Hat" is a good book. It didn't recognize any entities. Are you using an ontology for named entity resolution, or just extracting NPs?

a combination of different techniques (NPs, statistical models, dictionary based matching) are used in our EE endpoint.

Another player in this space, from Oxford, UK: http://apidemo.theysay.io/

It does really poorly analyzing a Wiktionary entry like http://en.wiktionary.org/wiki/run or with a Wikipedia article like http://en.wikipedia.org/wiki/Big_O_notation

Playing around with it and seemed to have killed it by pasting the text from this WP article (http://pastebin.com/AtCU7E8H) in and hitting analyze. It's been spinning for a while.

edit I see from another response that the server room is on meltdown, I'll wait for a bit.

Maybe somebody will find useful and relevant my pet project: https://github.com/crypto5/wikivector . It uses machine learning and wikipedia data as training set, supports 10 languages, and completely open source.

Do you publish accuracy figures? Any information about what domains your training data is from?

> Do you publish accuracy figures?

we'd love to, but unfortunately some of our main competitors have restricting terms in their ToS (e.g. http://www.alchemyapi.com/company/terms.html) that prevent us from doing so. we will publish what we can though.

> Any information about what domains your training data is from?

they're mostly trained on general news and social media content (with lots of manual and automated cleanup). drop us an email if you need more details: hello@aylien.com

I don't care what alchemy scores --- I care what _you_ score.

Why can't you just run any of the standard NLP evaluations?

I'm curious - how does a competitor's ToS prevent your company from doing anything?

The competitors don't allow you to benchmark their services, so while you can benchmark your own product you can't compare it to others. For example, from the Alchemy API:


Also this: "publish or perform any benchmark or performance tests or analysis relating to the Service or the use thereof without express authorization from AlchemyAPI;"

Suppose I am evaluating their service, before I decide to buy. I would be breaking these ToS, I guess.

There's more and more of text analysis APIs, would you mind comparing your feature set to something like Textrazor (http://www.textrazor.com) or Open Calais?

What is special about your project ?

I would also like a comparison. I used Open Calais two years ago for a project, and would definitely use it again if needed.

Edit: A quick glance at the API also shows that there doesn't appear to be much in the way of machine learning. Does this build models for you or is it just to dissect text?

Super nice !

This is a very interesting area... Good to see something new apart from Alchemy and opencalais !

thanks for the feedback. there's a lot of room for improvement in this space.

"There was a time when men could roam free on earth, free from concrete and tarmac. Now it's all gone to shit."

Classification: arts, culture and entertainment - architecture .(WTF?)

Polarity: positive. (Nope)

Polarity confidence: 0.9994709276706056. (Well...)

Looks pretty rough to me.

Why does that classification elicit a WTF? That seems like a reasonable classification, given how little context the algorithm has about the snippet. It's entirely plausible for that quote to be from a book about how "concrete and tarmac" have impacted modern architecture. There's not really any other hints about what it could be about.

There's no excuse for the polarity though. "Gone to shit" should be a pretty good indicator about the sentiment.

thanks for the feedback. considering the fact that it's still a v1, it surely can be rough in some areas. anyway, here are some thoughts:

- classification is trained on longer texts (mainly news articles) so it won't perform well on shorter texts

- polarity: yeah that's bad - the sentiment is slightly ambiguous so maybe a lower confidence would've justified things

Isn't it supposed to analyze something of article length, or at least paragraph length? Seems unfair to slam it for messing up a single sentence.

A bunch of TA libraries (Stemmers, Wordbreakers, etc) ship "free" with Windows that support a ton of different languages. I wish MS would open up the API a bit more.

Clearly broken. Say's news.ycombinator.com sentiment is "Positive". All jokes aside, really cool; love the accessibility of the demo.

I posted a couple of paragraphs from a financial blog and the tool interpreted SEC to mean Southeastern Conference.

MVP - Most Valuable Player - even said I should say this as a hashtag, although i meant Minimal Viable Product

Should I have not tried it with a 3000 word essay I wrote? It has been beachballing for the last 5 minutes or so.

I probably shouldn't have tried a 12,000 word short story...

How is this superior to Alchemy?


in at least two main areas:

- corpuses: we update our indexes frequently + use higher quality / handpicked corpuses.

- features: our API provides Summarization and Hashtag Suggestion.

and future plans, obviously. hope that helps.

The plural is "corpora".

I looked this up recently and corpuses is also OK, though corpora is by far the most common usage.

I tried bbc.com and nothing shows up. Is it supposed to work on top level links and summarize ?

not really, it works best on homogenous pieces of content.

OK. Summarizing a top level content (parse headlines and generate nugget summary) would be a very useful feature, if you can do it.

I can't get it to work, can someone tell me what it's supposed to do?

thanks for the feedback folks. FWIW, here's the documentation (/ NLP crash course!): http://aylien.com/text-api-doc

Annnnnnd that's my thesis sorted. Part of it anyway.

One of stunning stuffs that I've seen. Good job.

HN - the ultimate DDOS machine

Nah. You can survive #1 on HN; #1 on reddit, is another story.

The upsite to that is that: the on-line helped asked me if there was anything it could do; I responded “The sites seems slow.“ and I had a perfectly appropriate answer.

welp, servers are starting to melt :-)

pretty cool - what languages does your API support?

you mean programming or human languages?

I'd be interested to know which human languages you support

ATM all endpoints except Language Detection (which supports 76 languages) only support English. 8 new languages are on the roadmap.

Thanks ! Hope French is on this roadmap !

sell it to a bank $$$

This is incredible!


Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact