Actually, this is a brilliant idea. However, I am not sure if I will use this service for my organization. Why? It's got something to do with (everything to do with!) my data and my trust.
Most likely, you're just converting my questions into queries and parsing it over my data. Because your hompage is not clear if you are letting out just an API or you are giving out results of our data by yourselves. But if you gave out just an API, then it would mean we still have to process the converted queries and create our own interfaces to display the results. So I'm going to assume you take care of the data processing as well.
But, wait? You need access to my data to perform those queries. Let's say I have a million users. This means, you could potentially log every single query AND the result of the query into YOUR databases.
This means,
If I ask you "Which percent of my users pay the highest and are from the United States?"
And you perform a query on my one million users to find that out, you have data in your hands that my competitors or third party advertisers will come around you like sharks for. Whereas, all I can do at that point is just hope that you won't sell my data to them, which puts me and my data in a pretty vulnerable state.
Not to say that this is a bad idea. It's a brilliant idea, but I'm not sure how you are going to earn the trust factor.
Yup, this is a big issue, and obviously our lives would be easier if everyone was happy for us to just gobble up their data. As I've mentioned elsewhere, we want to enable a scenario where we know your schema and ontology, and can therefore generate the query. That allows us to query on your network, but as you say, there's still a huge trust issue.
I'd love to hear your thoughts on the minimum desirable interaction here - is an API that just translates natural language to, say, SQL useful enough for you to pay for?
> is an API that just translates natural language to, say, SQL useful enough for you to pay for?
Definitely!
If you get the natural language to SQL right, then it is multitudes easier for me as a developer to parse a JSON and display the results to an interface than having to learn the natural language processing part by myself and then try to interact and implement with it!
Yes an API or library for text > SQL is well worth paying for. Enterprise library in Java something like $2k per install for internal use, $10k for use on a web server would fairly standard. Bare in mind that a lot of enterprises use ORMs which you may need to interoperable with.
The tricky bit is going to be your enterprise sales channel, you need someone with the right connections.
Yeah, that's a decent option. We'll already have the problem of supporting software in the customer's environment, so it's not really a huge leap to imagine a fully local version or an appliance.
It might be a completely worthless idea, but I would consider offering a "USB Stick" version rather than an appliance. It is much easier to replace, costs close to nothing, yet it can still be implemented as a self-contained blackbox with no customer serviceable parts.
Total speculation but Java is mentioned and it's very common in Javaland to sell enterprises highly priced components to use on their own codebases. It's not the sexiest way technology like this could be used but it could certainly form the base of a lucrative business. (Indeed, they could do that to make the business worthwhile while still doing something more sexy and modern as a third party service.)
We are indeed a JVM shop. Doing something resembling consultancy is a fallback position for us. We don't want to reject big contracts, obviously, but in many ways the more technically challenging problem we'd like to solve is making an NLP product that's portable between different apps. We think we're getting there for certain sizes and classes of app, but we'll see where the market takes us when we have a saleable product.
Absolutely sell/license the portable component. Doesn't have to be a consultancy thing (although it might have to be in the beginning). Either as a library or as a standalone server like Solr. 99.99% of data in the world that this is relevant for (sales/revenue numbers, finance etc) is locked away and you are never, ever, going to get access to it on servers you control.
Yeah, this is the model that I had in mind. There's a giant enterprise (and government!) market for this sort of stuff who rarely venture out into places as cool as HN ;-) (They can be tricky to sell to though given the time involved and even finding them in the first place.)
1) they often have sprawling, disparate data sources, with schemas designed by deranged people in the 80s or 90s. The cleaner (and admittedly smaller) your data, the easier you can integrate, and to be honest, there's probably a hard limit at some point beyond which we couldn't provide value without doing bespoke customisation.
2) there are some fearsome competitors at the higher level of the market, who offer wider business intelligence suites.
Are there any well-known purveyors? I've been thinking of developing components to sell recently and was looking for some examples of what's already on the market to compare and contrast.
The big assumption here is that the data you'll be querying is, in fact, decently structured. I like the idea, but I've come across too many structures that aren't organized to be queryable at all, because they've been constructed by people without any understanding of relational concepts, or indeed data integrity or normalization.
"If the 3rd character in "customer ID" is 7 or 8, and the start date is after 2007, then they are a "premium" customer, and the maximum order amount doesn't apply, if they're shipping an order to Michigan, Texas or Florida".
"If customer ID is greater than 80000 and the date is after November 2009, then the real customer number should be reduced by 10000, because we had a problem and needed to reuse customer IDs". (meaning, an invoice for customer 12000 on November 2004 is related to a different customer than customer 12000 for an invoice on November 2010).
"If the employee's start date is >2005, then check table 1 for employee data if their last name starts with A-M, and table 2 if their last name starts with N-Z, otherwise check table 0 (legacy) Oh, and in table 4, if the employee ID starts with L, that means "legacy", so use table 0 to find their information, but remove the L".
These are situations I've run in to in the last few years, and I'm sure many of you have similar WTFs in your experiences. If someone has their data in good, solid, structured formats/tables, natural language syntax might fun/easy/exciting, but those people can also be served by things like Crystal Reports, some books, and a few hours of learning. The companies that most desperately need NLP->SQL probably also have the worst data.
Yeah, there's certainly a level of schema insanity beyond which we won't be able to offer a lot of value. We can still in those cases consume data that looks like subject-predicate-object, but the onus would be on integrators to supply that, and then you have security and timeliness issues.
For people with good data, even those with the expertise to query it, they'll still often have end users who want the data. The cycle of 'call IT department, ask for data, wait for data' or 'email SaaS provider, request report, wait for report' can be short circuited in these cases, and I believe that's of value.
I'll certainly publish some anonymous stats about data sources and people's desired integration method (API/SDK/appliance etc), as it's likely to be interesting to other businesses offering a service to integrate in people's environment.
Love the idea of been able to provide NLP to our users in a very low effort way.
However we wouldn't be able to even disclose some anonymised data, let alone have something communicate with the outside world that was munging our real data. Just the idea from security attack vector stance, effecively allowing any query would be a a deal breaker.
The problem is I can't see much happening in the way of tuning, we would be the clients from asshole land:
Oh yeah when I make a query, I don't get good results back
Ok lets have a look, what's the query like
Can't tell you that
Ok, what's the data like
Can't tell you that
What can you tell us?
System sucks
But obviously if someone else is providing data for tuning the NLP stuff to actually work on our data, if we can run the output as a AST, putting it anywhere we want as we would the output from our DSL, I could imagine the business case for paying a few cents per user, per month.
Yup, people are making the privacy/security aspect pretty clear, thanks for confirming that. There are also issues about partitioning of data from one data source that we need to address - only being able to query data on your own user_id etc.
When you talk about tuning, are you saying you'd be unlikely to have time to train the system after initial setup? We're making an iterative model that'll allow you to add new concepts, new sentence structures etc as you go, and we've thought a few times it would be good to expose a log of queries (especially failed ones), and also allow end users to say 'this is wrong/nonsense' whenever they get results.
To be honest, I'd rather the security of the query wasn't handled by your thing. My data source shouldn't allower user bob to ever be able to see data that is not intended for him.
Time to train wouldn't be the problem, it would be a case of letting you guys near the data. The lawyers would have kittens. Having a nice tuning tools would be a good idea, as it allows us to do it.
Would the thing nock out an AST, or would it be SQL only? As it stands, one of the benefits of our own DSL (using Irony) is that we can implement the AST in T-SQL or just C# code against POCOs.
Well, internally we're just passing around bits of lisp before building the query. Part of the aim in gathering data through our survey is to get a feel for what data sources people want, so it's likely we'll have more than just SQL. Given that, I don't see why we could just have a homebrew, abstract structure as well, if not the internal representation.
Also, in my distant C# days I was very impressed with Irony, glad it's still around.
I think your three examples at the bottom, Data -> SQL -> Delver query, should all be on the same example. In the first table it's like a timesheet thing, in your SQL you're making a list of people and their number of purchases and in the third Delver query you're asking a number of questions.
I realize you might be aiming this application at people who actually can't read the SQL, but you're also saying "You're drowning in tedious reports." implying they actually use SQL and this is to replace it.
So make the examples consistent and give people some insight into how this works and what actual SQL query a question generates.
You're right the examples don't really tell a story, thanks for the feedback.
We need to work to clarify the message a little, because we're in the situation where we're simultaneously targeting IT departments and developers who would benefit from removing the burden of ad-hoc reports, but also their end-users who want the data. I'll give that some thought.
Pretty cool. I've become interested in this topic when I read the paper on Natural Language Query System for RDF Repositories[1], which mapped NL queries to a PIM ontology using SPARQL, but alas, I never further explored it.
It's nice to know the technology is becoming available to the average programmer like myself ;)
You can certainly go a long way with pretty naive models that take advantage of RDF[1][2], because you can just match words in a query to those in the ontology. Our model is a little more complicated, as we need to support everything from simple things like 'not' to negate part of a query, to more complex stuff like 'more recently', and compositional queries (what products have sold more than products that are red, tricky stuff like that).
As useful this by the new Facebook Graph Search inspired service my sound, I doubt the usefulness. I just prefer to have a expressive non-natural language to specify my needs (especially when retrieving data). This could be useful for the mainstream querying drunk female friends nearby but again this is solved by FB already.
Yup, if everyone who wants to get at the data already has the skills to do that, we've got nothing to offer. But many organisations have knowledge workers or end users that won't learn SQL, and don't have people available to regularly run ad-hoc reports, or the time to constantly expand the reporting available on their intranet/admin suite.
So these guys just do queries right? Not full on dialogues that Siri supports, but rather the trivia questions that Siri can resolve using Wolfram Alpha (via Wolfram's own hand-constructed NLP tech)?
Not that this is bad, but it is pretty far away from a general natural language interface where you basically have a conversation with your app.
We are read-only, this is absolutely correct, hopefully the site's not misleading in that respect. We would absolutely _love_ long term to allow commands as well as queries, and become something akin to the Siri third-party API everyone wants (although we're not tackling speech anytime soon).
That said, what you call trivia questions, we call decision support, reporting, and other grown up things. :P
The title of this post was just misleading a bit. I didn't mean to use trivia as a pejorative, just this is how I call the whole "use data X to answer question Y" domain of NLP, which are fairly distinct from dialogue processing systems. You seem to be basically in competition with Wolfram Alpha, except you focus on custom structured data sources, but they seem to do something here also [1].
Yup, Wolfram are a scary competitor, EasyAsk too. In my head, we're targeting the lower-end of the market - potentially smaller data models, no consultancy face-time. We'll see how realistic that turns out to be.
Is there something along these lines but in open source? I have been dreaming of having natural language querying combined with Freebase or such, and with computational in addition to pure data endpoints, to build a kind of open source Wolfram Alpha.
The University of Sheffield (Delver's home town!) maintains gate.ac.uk which offers a full-stack semantic text extraction engine. That's unlikely to get you to the stage we're aiming at though.
Yea my question was just a tangent, thanks for the info and all the best for your project! Is delver based on gate then or is it a completely separate project? Love to see this kind of application of academic research.
EasyAsk do some cool things, and offer a much wider business intelligence suite as well. I like to think we're targeting smaller enterprises than they are. Essentially, we want to be a self-serve app, and we can't learn and grow the way we want to if we have to send consultants to work with you on your integration (something which EasyAsk will happily charge you six figures to do).
Also good luck with your app, with Siri and Facebook Graph Search, it's really time this technology hit the mass market. I look forward to sharing more of our apps!
looks interesting. Would be nice to have a live demo database where you can try out different natural language searches on and see the results.
Also interested to know the model - is this a SaaS? if so perhaps some pricing info would be useful, or how it interfaces with your data, security considerations etc.
Thanks for the feedback. As mentioned elsewhere, we're very keen to open up with something like a Magento plugin or something targeting a similar product just to showcase the querying front-end and mapping tech with a known schema.
Business model-wise, we'll be offering per-seat licensing for internal apps, and likely traffic-based pricing for more public apps. We are a SaaS app - the intelligent bits happen in the cloud. The default option is to integrate one of our agents to gather data, or publish data to us via an SDK (or REST API). Because it's a deal breaker for many people, we're working hard to enable scenarios where data never leaves your network, but we know your schemas and ontology, and so the NLP bit happens on our side, but the querying happens privately. Obviously this can have an affect on our ability to do entity recognition, so it's an interesting problem.
Mappings aren't entirely monolithic things in our architecture. From the integrator's point of view there's an guided process of connecting to a data source, enriching the physical schema with some ontological data, mapping verb frames to those ontological concepts etc. We're making this as simple as possible - you need to speak English (English only at the moment, apologies), know your schema, and know the concepts it represents.
We're really hoping to produce a demo integration soon, and I've mentioned Magento elsewhere, or something like Wordpress etc. I'll publicise that when it's available
That should be coming soon - our first product may well be a bespoke implementation for something like Magento as a demo, but the big game is letting anyone with structured data integrate. I'll be absolutely clear - some people's data is going to be difficult to work intelligently with, and in those cases we'll shift much more of the burden to the integrator. The sweet spot is obviously people with fairly clean data models - if you've got a standard Ruby on Rails app using Active Record, for example, your data is pretty friendly.
I think it's going to take work from an integrator in all cases... How is it going to translate "developers free on Friday that know Javascript" into the three-way table join that you are asking for unless you first define "know" and "free" in terms of your database tables?
Apologies if I implied we just magically work things out, there is indeed an setup process where you help the app learn your ontology and how it maps to your physical infrastructure. This is, however, an iterative process, and you're likely to be able to get off the ground in hours, not days. Some people's data will be beyond us, I freely admit - in those cases you can do the work and supply us with something more or less resembling RDF.
I only object to what appeared to be magic. Now that I know there will be a setup it seems practical and useful. This is a great idea for exploring data, I hope you are wildly successful.
Most likely, you're just converting my questions into queries and parsing it over my data. Because your hompage is not clear if you are letting out just an API or you are giving out results of our data by yourselves. But if you gave out just an API, then it would mean we still have to process the converted queries and create our own interfaces to display the results. So I'm going to assume you take care of the data processing as well.
But, wait? You need access to my data to perform those queries. Let's say I have a million users. This means, you could potentially log every single query AND the result of the query into YOUR databases.
This means,
If I ask you "Which percent of my users pay the highest and are from the United States?"
And you perform a query on my one million users to find that out, you have data in your hands that my competitors or third party advertisers will come around you like sharks for. Whereas, all I can do at that point is just hope that you won't sell my data to them, which puts me and my data in a pretty vulnerable state.
Not to say that this is a bad idea. It's a brilliant idea, but I'm not sure how you are going to earn the trust factor.