I've done some previous digging into natural language SQL queries -- there's a good amount of research around this. But the error rate is always too high for useful production deployment and the systems I've looked at never handled ambiguity well. The target user for this is someone who knows nothing about SQL so ambiguity is guaranteed.
Most importantly, you can't have the system confidently answer incorrectly.
I just tested this query to evaluate joins: "What is the name of the series with the highest rating?" The answer: 5.8 :/
A typical user will be even less clear than this, asking the question as: "What's the top tv series?"
The ideal system should be able to confirm with the user that 1) they'd actually like to evaluate tv series based on rating. And 2) it should be able to guess without user confirmation that I really want the 'name' of the tv series.
It's a hard problem but massively valuable if it can be solved.
You're right! I package public trained models for natural language queries (https://github.com/paulfitz/mlsql/), and the models are bad at saying they don't know. I'm optimistic though. There's significant year-on-year improvement (driven by real progress in NLP), and the training datasets are getting more interesting. There are now conversational datasets (e.g. https://yale-lily.github.io/cosql) where the model is trained to ask follow-up questions, and an explicit goal is "system responses to clarify ambiguous questions, verify returned results, and notify users of unanswerable or unrelated questions". That could be a big win.
Great job compiling the datasets! For the Yale group, I like the goal of their research. If the system can only answer 70% of questions, but it's pretty sure it knows the answer when it claims it does, that can still be very useful in many domains. And the conversational approach is a good one for clarifying ambiguity. Will continue to monitor their progress. Thanks!
I worked for Cleargraph, a startup which built a natural language query layer on top of RDBMSes and which we sold to Tableau. We reached the same conclusions as you: that such a system must properly handle ambiguous queries, and users need to explicitly understand the results they're viewing.
The way our system worked, broadly, was with a computational linguistics approach rather than an NLP one. We used an augmented bottom-up parsing algorithm to assemble a user's query into an AST, then translated that AST into both SQL (though that changed post-acquisition) and back into a canonical English-like query.
For ambiguous inputs, the parser would produce multiple candidate ASTs which we could offer to the user in the form of search results, for them to select the expression they preferred (solving your point 1).
The parser was constructed to use semantic information about the database schema to infer fields in the case of underspecification (handling your point 2). For example, we might understand about the `Series` table that `name` was the "display column", and should be the default field by which record-level results were displayed. Semantic information could, to various degrees, be inferred - always with the option to be manually overridden.
It seemed clear to us that a schema-agnostic approach to SQL generation would be untenable for a real product. Annotation of the schema over time (starting with basic information like field names, synonyms, and types, then adding semantic information like currency or unit labels) is key to letting a query generator produce sensible results to underspecified queries. For this reason alone, I'm skeptical of the quality of results that something like Photon can produce. Every schema differs in syntax and semantics. Perhaps a big enough corpus of schemas can allow for a generalized model. The more you can constrain the domain in building such a system, the better the results. (look at Statmuse [0]: pretty impressive query capability over sports facts).
Statmuse works a lot better than I expected. It needs to be able to answer which player is the hottest/cutest/most handsome though. Huge gap that needs to be corrected, tout suite.
We paid a lot of attention to this exact problem at Delver, talking to potential users it was a big concern especially for reporting/BI. We’d never give results until we had a single unambiguous parse of a question, but this required a core that was quite old fashioned NLP, with machine learning only really being used around the edges. So ultimately we’d always generate a series of parse trees with embedded semantics that composed together, but the parser could look at things like external ontologies or word embeddings to try and work out the exact analogy in the domain for something the user said (and confirm with them that the new language was what they meant).
It was reasonably cool but I was a terrible CEO and slowly killed it:
I tried "What are the names of all the workers" but it didn't understand. So it doesn't seem to take slightly-vague input. (The schema has a workers table with fields first name and lastname)
When I have to be precise, I'm better off using SQL directly.
IMHO, this type of transformation is best done via traditional symbolic methods that also provide explanations with support, alternate outputs based on context/constraints, etc. It's ok for a symbolic transformation system to have the help of connectionist black-boxes that act as oracles, but the entire system should not be a black box.
Does it work for browsing by dates.. real user may ask date range in various ways.. e.g. give my previous month/ a date/quarter/weeks etc.. the demo didn't have dates in data I belive
And critical! Being able to snag a timespan of data, with point in time for records is the majority of what business users want.
How this works usually depends on your data's provenance. Databases get built with lots of different and variously compatible definitions of "correct", sometimes because RDBMS date and time types are inconsistent, but usually during development. Not all of those definitions have a clean path to UTC offset, for example, if you're comparing data from different time zones.
If someone said they'd made a one size fits all query engine that will talk to any arbitrary database from a range of RDBMS and handle date/time consistently and reliably, I'd be skeptical. Maybe if it's operated by someone who knows what they're doing, but that's not exactly the target audience for natural language SQL queries.
I can't seem to get it to correctly answer any query which references more than a single table.
For instance, "which city has the most farms" yields the somewhat nonsensical "SELECT city.City_ID FROM city GROUP BY city.City_ID ORDER BY COUNT(*) DESC LIMIT 1"
It seems fine for basic queries like "how many total horses are there", but that's basically it.
A system like this needs to be able to flag when a query can’t be answered, rather than spit some arbitrary query out instead. That’s by no means an easy feat, but bad data authoritatively delivered is worse than no data.
Most importantly, you can't have the system confidently answer incorrectly.
I just tested this query to evaluate joins: "What is the name of the series with the highest rating?" The answer: 5.8 :/
A typical user will be even less clear than this, asking the question as: "What's the top tv series?"
The ideal system should be able to confirm with the user that 1) they'd actually like to evaluate tv series based on rating. And 2) it should be able to guess without user confirmation that I really want the 'name' of the tv series.
It's a hard problem but massively valuable if it can be solved.