
Talk to Your Data: One Model, Any Relational Database - atrudeau
https://blog.einstein.ai/talk-to-your-data-one-model-any-database/
======
DevX101
I've done some previous digging into natural language SQL queries -- there's a
good amount of research around this. But the error rate is always too high for
useful production deployment and the systems I've looked at never handled
ambiguity well. The target user for this is someone who knows nothing about
SQL so ambiguity is guaranteed.

Most importantly, you can't have the system confidently answer incorrectly.

I just tested this query to evaluate joins: "What is the name of the series
with the highest rating?" The answer: 5.8 :/

A typical user will be even less clear than this, asking the question as:
"What's the top tv series?"

The ideal system should be able to confirm with the user that 1) they'd
actually like to evaluate tv series based on rating. And 2) it should be able
to guess without user confirmation that I really want the 'name' of the tv
series.

It's a hard problem but massively valuable if it can be solved.

~~~
kornish
I worked for Cleargraph, a startup which built a natural language query layer
on top of RDBMSes and which we sold to Tableau. We reached the same
conclusions as you: that such a system must properly handle ambiguous queries,
and users need to explicitly understand the results they're viewing.

The way our system worked, broadly, was with a computational linguistics
approach rather than an NLP one. We used an augmented bottom-up parsing
algorithm to assemble a user's query into an AST, then translated that AST
into both SQL (though that changed post-acquisition) and back into a canonical
English-like query.

For ambiguous inputs, the parser would produce multiple candidate ASTs which
we could offer to the user in the form of search results, for them to select
the expression they preferred (solving your point 1).

The parser was constructed to use semantic information about the database
schema to infer fields in the case of underspecification (handling your point
2). For example, we might understand about the `Series` table that `name` was
the "display column", and should be the default field by which record-level
results were displayed. Semantic information could, to various degrees, be
inferred - always with the option to be manually overridden.

It seemed clear to us that a schema-agnostic approach to SQL generation would
be untenable for a real product. Annotation of the schema over time (starting
with basic information like field names, synonyms, and types, then adding
semantic information like currency or unit labels) is key to letting a query
generator produce sensible results to underspecified queries. For this reason
alone, I'm skeptical of the quality of results that something like Photon can
produce. Every schema differs in syntax and semantics. Perhaps a big enough
corpus of schemas can allow for a generalized model. The more you can
constrain the domain in building such a system, the better the results. (look
at Statmuse [0]: pretty impressive query capability over sports facts).

[0]: [https://www.statmuse.com/](https://www.statmuse.com/)

~~~
cableshaft
Statmuse works a lot better than I expected. It needs to be able to answer
which player is the hottest/cutest/most handsome though. Huge gap that needs
to be corrected, tout suite.

------
mushufasa
I tried "What are the names of all the workers" but it didn't understand. So
it doesn't seem to take slightly-vague input. (The schema has a workers table
with fields first name and lastname)

When I have to be precise, I'm better off using SQL directly.

------
thelazydogsback
IMHO, this type of transformation is best done via traditional symbolic
methods that also provide explanations with support, alternate outputs based
on context/constraints, etc. It's ok for a symbolic transformation system to
have the _help_ of connectionist black-boxes that act as oracles, but the
entire _system_ should not be a black box.

------
lmeyerov
This has been one of the more compelling demo areas for GPT-3 as well (ex:
[https://www.youtube.com/watch?v=WlMHYEFt2uA](https://www.youtube.com/watch?v=WlMHYEFt2uA))
-- are there any open AI efforts here?

------
zerop
Does it work for browsing by dates.. real user may ask date range in various
ways.. e.g. give my previous month/ a date/quarter/weeks etc.. the demo didn't
have dates in data I belive

~~~
nabilhat
And critical! Being able to snag a timespan of data, with point in time for
records is the majority of what business users want.

How this works usually depends on your data's provenance. Databases get built
with lots of different and variously compatible definitions of "correct",
sometimes because RDBMS date and time types are inconsistent, but usually
during development. Not all of those definitions have a clean path to UTC
offset, for example, if you're comparing data from different time zones.

If someone said they'd made a one size fits all query engine that will talk to
any arbitrary database from a range of RDBMS and handle date/time consistently
and reliably, I'd be skeptical. Maybe if it's operated by someone who knows
what they're doing, but that's not exactly the target audience for natural
language SQL queries.

------
ordx
Looks like it expects user to know column names. Column names like
total_price, fname, user_lname, etc. would make question answering a bit
awkward.

------
homarp
demo: [https://naturalsql.com/](https://naturalsql.com/)

~~~
antimatter15
I can't seem to get it to correctly answer any query which references more
than a single table.

For instance, "which city has the most farms" yields the somewhat nonsensical
"SELECT city.City_ID FROM city GROUP BY city.City_ID ORDER BY COUNT(*) DESC
LIMIT 1"

It seems fine for basic queries like "how many total horses are there", but
that's basically it.

~~~
tgb
I couldn't get it to work well either, but note that farm's don't have a city,
so your question is impossible to answer.

~~~
jon-wood
A system like this needs to be able to flag when a query can’t be answered,
rather than spit some arbitrary query out instead. That’s by no means an easy
feat, but bad data authoritatively delivered is worse than no data.

