This looks great. I'm building http://dokket.aws.af.cm. It's a database filled with documents from the Federal Communications Commission. From day 1, I've been looking for smart ways to make use of the thousands of documents of unstructured text. The customization you offer seems to be the killer feature for me.
I'll write a Ruby api wrapper if you give me an agreed-upon amount of usage when you settle on pricing.
Feel free to email me (HN name @ gmail) if you're interested or just want to follow up for customer development purposes.
I am also building an application that could use this service. So I tested TextRazor and the results were not good on two random articles I processed with the service.
I really want this to work out, because my application needs this kind of technology to be reliable and accurate. I just processed the article at the link below, and the word "Tesla" was not captured as one of the topics.
http://www.teslamotors.com/blog/most-peculiar-test-drive-fol...
I just tried it again and it worked this time. That is interesting: I don't know if they tweaked it a bit or not :), but I am feeling a bit better now, because my application really needs a reliable, accurate service like this.
No tweaking, promise :) It's possible you hit an inconsistent server first time around, I'll have a dig on our end. Let me know if you have any other problems - toby@textrazor.com.
Hey steeve we thought there were a few things missing in the competition. We've built a bunch of extra functionality such as more extensive relation and dependency parsing and contextual entailment generation, and use all that to build much more accurate entity and topic recognition, an area we think the others can be greatly improved on.
We also expose all these results to a Prolog interpreter on our backend and allow you to add custom logic to mashup and extend all of our results, as well as provide a much easier integration experience.
Totally agree with you on the pricing front, we're still finalising the details there. We're aiming to be fully transparent with both the technical and business side of things.
You can add billing/pricing tiers to your API now using Mashape http://www.mashape.com/ (Disclaimer: I work for Mashape. Let me know if you need help!)
The Stanford NLP tools are very good, and also GPLed, which works for a lot of projects. If the GPL doesn't work for you, the Apache OpenNLP project is also good.
BTW, it is not just having software packages to use: it is a ton of work obtaining and preparing training data. That said, Stanford NLP and OpenNLP tools come out of the box with trained models for tagging, entity name recognition, etc. For lots of uses, these pre-trained models will work well for you.
The Stanford parser is great, but isn't really the same. The Stanford entity recogniser is limited to the standard types of people, places, companies, but we identify and disambiguate into a far richer ontology from wikipedia, and can recognize topic abstractions that aren't explicitly mentioned.
Also we found the Stanford tools (and the other open source NLP tools) were difficult to integrate into "production" apps for various reasons. One big one was performance - we aim to run the full parsing and extraction pipeline on an average news story in a few hundred milliseconds, which can be an order of magnitude faster than the others.
I have been using the free tier (50K API calls per day) of Open Calais for years and have also used it in code examples in three books I have written.
One thing that Open Calais does that I really like is that they attempt to have a single URI uniquely identifying recognized named entities. This is useful because, for example, when it recognizes President Bill Clinton, you get a reference to a unique URI, even if his name, title is different in different processed texts.
Thomson-Reuters bought ClearForest several years ago, thus acquiring Calais. If you are interested in text mining, and if you haven't experimented with Open Calais, then please put that on your TODO list.
Try downloading the distribution of code and data and run it locally. Java stuff, maven based, easy to run. Use the example code listed on the installation web page to see how to set it up.
Really impressive results. I put in some music reviews, and it did an excellent job of identifying artists, Genres, labels, etc. From an API perspective,there's not much out there that competes with this - and nothing with a modern API.
That's exactly what I need to start working again on my algo trader! Seems to be working well with a sample of financial news extracts. Will definitely look into it further, thanks!
This is very impressive stuff. I ran a news article through the demo, and the entity recognition was very impressive. Waiting for them to reveal more details on pricing.
Wow, this seems incredible, signed up immediately. My mind is already spinning with all of the cool apps I could make with this. How hard would it be to allow this functionality offline for paid users? You could have some sort of packaged library which phones home to count requests used, but does the processing offline to take network latency out of the equation. Not sure if that's feasible, but it would be great.
Being a big Prolog fan, I think this looks like an awesome product. This sort of textual analysis will become more important as time goes on. As the interest in search technologies grows, I think intelligent search (contextual queries, query answering, clustering, recommendations, meta-data extraction etc.) will start to appear in more end-user products. One question... whereabouts in the UK are you based?
Question: What are people actually doing with technology like this right now? (i.e. who are the people who see this and think.. yay, I'll sign up now!)
It's certainly not going to be ideal for your typical CRUD app. Think about all of the information that is locked inside of unstructured text (MS word docs, pdf's come to mind), and then imagine if you can scan through thousands of documents, find the named entities, and then start connecting them together in queries.
Obvious uses would be any kind of CMS. Investigative journalism is another.
I've recently started exploring the Legal Informatics field. The problems in it are huge and typically involving adding some structure to lots and lots of unstructured text.
Also, Peter, if you do end up reading this, great work on the stuff you do :) Big fan here!
Extracting important keywords from thousands of pages for example. On our website we let user enter quite a lot of content, and being able to extract keywords and find patterns between user could be key in naming categories and such.
Great work! But I missed the "sentiment analysis" flavour that used to be so popular some years ago with the NLP bunch... In this sense, I did something similar:
Hi man, I tried my website (http://www.ngajakjalan.com) but the analyzer probably reads my inlined JS and doesn't read the "done" version (I use AngularJS). Maybe you can use something like PhantomJS to extract websites content "as seen by human"?
I don't know if your just being slammed right now but I posted a request for pricing and haven't got anything.. it looks like a nice API to integrate into our system but it really needs clearer pricing..
Awesome stuff. It would be great to use something like this for suggesting relevant tags of content. Perhaps a WordPress or Drupal plugin to get the ball rolling.
I'll write a Ruby api wrapper if you give me an agreed-upon amount of usage when you settle on pricing.
Feel free to email me (HN name @ gmail) if you're interested or just want to follow up for customer development purposes.
Best of luck!