Most importantly, you can't have the system confidently answer incorrectly.
I just tested this query to evaluate joins: "What is the name of the series with the highest rating?" The answer: 5.8 :/
A typical user will be even less clear than this, asking the question as: "What's the top tv series?"
The ideal system should be able to confirm with the user that 1) they'd actually like to evaluate tv series based on rating. And 2) it should be able to guess without user confirmation that I really want the 'name' of the tv series.
It's a hard problem but massively valuable if it can be solved.
The way our system worked, broadly, was with a computational linguistics approach rather than an NLP one. We used an augmented bottom-up parsing algorithm to assemble a user's query into an AST, then translated that AST into both SQL (though that changed post-acquisition) and back into a canonical English-like query.
For ambiguous inputs, the parser would produce multiple candidate ASTs which we could offer to the user in the form of search results, for them to select the expression they preferred (solving your point 1).
The parser was constructed to use semantic information about the database schema to infer fields in the case of underspecification (handling your point 2). For example, we might understand about the `Series` table that `name` was the "display column", and should be the default field by which record-level results were displayed. Semantic information could, to various degrees, be inferred - always with the option to be manually overridden.
It seemed clear to us that a schema-agnostic approach to SQL generation would be untenable for a real product. Annotation of the schema over time (starting with basic information like field names, synonyms, and types, then adding semantic information like currency or unit labels) is key to letting a query generator produce sensible results to underspecified queries. For this reason alone, I'm skeptical of the quality of results that something like Photon can produce. Every schema differs in syntax and semantics. Perhaps a big enough corpus of schemas can allow for a generalized model. The more you can constrain the domain in building such a system, the better the results. (look at Statmuse : pretty impressive query capability over sports facts).
It was reasonably cool but I was a terrible CEO and slowly killed it:
When I have to be precise, I'm better off using SQL directly.
How this works usually depends on your data's provenance. Databases get built with lots of different and variously compatible definitions of "correct", sometimes because RDBMS date and time types are inconsistent, but usually during development. Not all of those definitions have a clean path to UTC offset, for example, if you're comparing data from different time zones.
If someone said they'd made a one size fits all query engine that will talk to any arbitrary database from a range of RDBMS and handle date/time consistently and reliably, I'd be skeptical. Maybe if it's operated by someone who knows what they're doing, but that's not exactly the target audience for natural language SQL queries.
For instance, "which city has the most farms" yields the somewhat nonsensical "SELECT city.City_ID FROM city GROUP BY city.City_ID ORDER BY COUNT(*) DESC LIMIT 1"
It seems fine for basic queries like "how many total horses are there", but that's basically it.