
Show HN: Translate English to SQL - paulfitz
https://github.com/paulfitz/mlsql
======
BrentOzar
Ah, this brings back fun memories. Over a decade ago, Microsoft SQL Server had
English Query to do the same thing. Developers had to define all the words &
relationships that would be used, and keep them up to date over time.
Microsoft discontinued it around 2005.

I remember briefly working with it and realizing, “No, this is never going to
work because users have a tough time spelling things accurately.”

~~~
johnsonjo
If misspelling is one of the main problems than it would be interesting to
have suggestions based on how similar a word was to another, so that you could
have something similar to the rust compiler’s “<undefined variable name> does
not exist, did you mean <defined variable name>”. I think it uses Levenshtein
distance, and maybe some other algorithms/heuristics to pick it’s best guess.

------
Myrmornis
I am not criticizing this project -- obviously there's a desire for software
to understand natural language questions and certainly makes sense to target
SQL.

However, it immediately reminds me of the main drawback of SQL: it was
designed with the aim that it would be easy for non-programmers to use, at a
point in history when that seemed to make sense to the designers. Hence SQL
itself has a natural languagey feel, e.g. a keyword like "group by" has an
embedded space despite "group" and "by" having no meaning individually. As
many others have observed less superficially, SQL lacks composability. It
feels less like a programming language and more like a natural language
interface to a relational database. It's certainly great, and has proven the
test of time, etc etc, but I agree that we can do better, e.g. the following
article's discussion, if not their proposed solution.

[https://edgedb.com/blog/we-can-do-better-than-
sql/](https://edgedb.com/blog/we-can-do-better-than-sql/)

------
drinane
The reverse is way more interesting to me.... do a seemingly bunch of random
aggregations and joins and send me an email of all the English versions and
let me look at the results of those if I care to. The issue with these types
of ideas is always that only "clean" or heavily analyzed data can generate
this level of automation... and this type of tool is really only useful for
when you are looking at virgin data or data that is crap.

------
mynegation
This is a wrapper for a pretrained model. Probably a better github link would
be [https://github.com/naver/sqlova/](https://github.com/naver/sqlova/)

------
kowdermeister
Since this is using NN under the hood, I would add "INSUFFICIENT DATA FOR
MEANINGFUL ANSWER" for zero rows returned :)

How does it handle real life tables such as
"employee_stock_option_pool_not_vested_IDS" for example?

------
mitchtbaum
Dhruv Baldawa, a friend of mine, also has an interest in this and a good
write-up: [https://www.dhruvb.com/blog/posts/specialized-syntax-for-
qui...](https://www.dhruvb.com/blog/posts/specialized-syntax-for-quick-sql-
query-building/)

------
amolo
Nice. Very interesting. But 4 gigs just to get I running. That's probably more
than the DB I'm trying to query.

~~~
z3t4
There's nothing wrong with being resource efficient, but relatively SQL
queries used to take two minutes, and the result printed on paper, and the
whole machine cost two million.

~~~
sfoley
Yeah and sending a message to someone across the sea used to take months and
probably wouldn’t even get there half the time, that doesn’t meant it’s not
ridiculous that I can’t run two instances of slack without my computer
melting.

~~~
z3t4
the VR world where everything is HR enough that you can go to work inside the
simulation. Like in various Sci-fi movies.

------
c8g
does it only work with a question that returns a single answer or it can be
used to build something like
[https://www.thoughtspot.com/solutions/healthcare-life-
scienc...](https://www.thoughtspot.com/solutions/healthcare-life-sciences-
analytics#product-slider) ?

------
o10449366
Very interesting. I did a similar project for my machine learning class in
grad school. I'm curious what kind of architecture/model framework you used
for this.

~~~
____Sash---701_
Looks like PyTorch

~~~
____Sash---701_
Or Panther, who knows

~~~
newaccoutnas
PyTorch accorging to
[https://github.com/paulfitz/sqlova#requirements](https://github.com/paulfitz/sqlova#requirements)

Also CoreNLP

------
viach
Why do people fear programming that much, so that even a simple SQL query is
tried to be abstracted?

Same thing with visual "no code" programming. Is that really simpler to
endlessly move around colorful blocks than writing a few lines in, say,
Python?

~~~
regular_person
Yes? If you work with it every day you forget how obtuse and opaque
programming can be. Most normal people barely understand basic consumer
electronics.

~~~
TazeTSchnitzel
Programming is like many things not an immensely difficult task, but it is one
that requires knowledge and experience, so a true novice confronted with a SQL
manual would not be able to quickly write the right code. Someone who's done
programming before would have a much easier time.

------
michaelmior
It's great to see a nice open source example of work in this area. It would be
helpful if there were also some examples of where the model fails either by
not being able to produce a query or producing an incorrect query.

------
rezeroed
I might be too much of a control freak to give up the explicit precision of
SQL.

~~~
Myrmornis
It's not intended to replace your code :) It's intended to turn normal
peoples' words into code.

------
ohadron
I feel that SQL queries that people will need help with are not the SQL 101
examples shown here.

Looking at the queries I write as a product manager, I wouldn't have the
slightest clue how to translate them to English. How do you LEFT JOIN on a
relation and fetch NTiles on a calculated value from three different tables?
Surely this will require some English-to-schema dictionary that someone who
knows SQL will have to maintain.

~~~
gatherhunterer
Humans think with language. Anything that can be understood can be written in
language. The README acknowledges the need for further research and
development but it's not hard to imagine how a more complex query could be
written in English.

"How many bridges are there that are in the same state as any bridge designed
by O.H. Amman and are shorter than the number of kilometers between the Earth
and the Moon divided by the number of people living in the state and have a
maximum load that is at least as great as the combined weight of all currently
living Asian elephants?"

When someone walks over and asks, "What does this query do?" Do you just tell
them that it cannot possibly be put into English? Of course not, knowing the
answer and knowing how to say the answer in at least one human language are
the same thing.

------
LunaSea
The example is probably wrong or not working since the English questions uses
"throgs neck" in lower case instead of the captialized "Throgs Neck" version.

Those are two different strings in SQL and shouldn't match exactly.

This already shows one difficulty of translating a natural language to a SQL
query.

~~~
laurent123456
Aren't the queries case insensitive by default?

~~~
zainhoda
It depends on the collation method used:

[https://kendsnyder.com/utf8_bin-vs-
utf8_general_ci/](https://kendsnyder.com/utf8_bin-vs-utf8_general_ci/)

utf8_bin: compare strings by the binary value of each character in the string

utf8_general_ci: compare strings using general language rules and using case-
insensitive comparisons

utf8_general_cs: compare strings using general language rules and using case-
sensitive comparisons

------
nothrabannosir
Looks fascinating. I tried the docker example but I'm getting a 405 Method Not
Allowed when I visit localhost:5050/ in the browser :/

~~~
herohamp
Check the examples, you meant to upload the CSV and questions using curl

------
monkeydust
From a product managers pov this looks very interesting - is there a demo
environment available somewhere to play with this?

~~~
mritchie712
Just curious, what tool do you use to write SQL today and how much time do you
spend doing it? Would you be interested in this to reduce the time you spend
writing SQL?

~~~
sheeshkebab
I think anyone who uses sql would find this tool (and a lot of similar
abstractions) too imprecise, verbose, and error prone.

What's the longest bridge in mariland - or was it merryland (or maybe
state_code=MD), is it in bridges table or bridge, or maybe bridge_specs joined
on bridges by some id...I'll just write my sql in my favorite sql autocomplete
tool.

~~~
mritchie712
I agree, but I was thinking it could give people that don't know much SQL a
starting point. They could ask a natural language question > get SQL > tweak
SQL. So in your example, they might get something back like:

    
    
        SELECT MAX(length) FROM bridges WHERE state = 'merryland'
    

And they could recognize the error of merryland vs. MD.

------
exdsq
Really interesting. Where can I learn more about building something similar
myself?

~~~
FiberBundle
You can e.g. do this using semantic parsing, I remember some older papers but
cannot find them right now. For a more recent paper see [1] (haven't read it
though). You can also use Deep Learning for this. There are some interesting
approaches using Reinforcement Learning [2]

[1]
[https://www.aclweb.org/anthology/P18-1034](https://www.aclweb.org/anthology/P18-1034)
[2]
[https://arxiv.org/pdf/1709.00103.pdf](https://arxiv.org/pdf/1709.00103.pdf)

------
duxup
This might not be useful directly but this is a super cool project.

------
quizotic
Does anyone remember Larry Harris and "Natural" back in the 1980s?

