
Queryparser, an Open Source Tool for Parsing and Analyzing SQL - manigandham
https://eng.uber.com/queryparser/
======
jmiserez
> _To achieve this, our Data Warehouse team was tasked with identifying every
> foreign-key relationship between every table in the data warehouse to
> backfill all the ID columns with corresponding UUIDs.

Given the decentralized ownership of our tables, this was not a simple
endeavor. The most promising solution was to crowdsource the information by
scraping all the SQL queries submitted to the warehouse and observing which
columns were joined together._

Am I missing something, or does scraping SQL queries to effectively get an ER
diagram sound a bit strange? If you have SQL and a relational database, don’t
you already have a formal model of the relations encoded somewhere? E.g. in
the table definitions, annotations in your code, or at least as a Word
document?

The tool sounds great, but isn’t it a failure having to write it in the first-
place?

~~~
bastawhiz
(former Uber engineer)

Data lived in _many_ databases across the company. Vertica, hive, schemaless,
Cassandra, etc. The value in a column might refer to an id in an entirely
different system. When you have (tens of?) thousands of tables owned by
hundreds of teams, documentation is neither homogeneous nor centralized.

~~~
fusiongyro
So, why does it make sense in this situation ("you can all do what you please
with your data") to mandate something technical ("but starting today you all
must use UUIDs from now on") across all groups? What's the value of this
change if using UUIDs is now literally the only thing they all have in common?

~~~
jhoechtl
Process mining to identify yet unknown relationships. This is more sort of a
big data "what relates to each-other yet we don't know it" analysis

------
d33
It's a pity that the thing has to get reimplemented all over so many times.
For example, you'd think you could reuse PostgreSQL's parser, but its code is
autogenerated and riddled with side effects:

[https://github.com/postgres/postgres/blob/master/src/backend...](https://github.com/postgres/postgres/blob/master/src/backend/parser/gram.y)

We've a lot of got languages and usually their parsers are tightly coupled,
basically forcing people to reimplement them if you want a flexible parser.
I'd say it would be great if we had one parsing library for each language,
ideally used at its core. I think that would greatly simplify implementing
linters, static analysers and maybe even tools that generate code. Also,
perhaps this would enable even more cooler hacks?

~~~
lfittl
Just in the off-chance this is useful to someone: If you find yourself needing
the Postgres parser elsewhere, e.g. in some custom tooling to analyze SQL
queries, I extracted it into a library some time ago:

[https://github.com/lfittl/libpg_query](https://github.com/lfittl/libpg_query)

Your comment still applies though, as a lot of the parser output is very
specific to the chosen Postgres version (10 at the moment) and can change
unpredictably.

~~~
arnon
There's also HSSQLPPP for Postgres, and it's written in Haskell

[https://github.com/JakeWheat/hssqlppp](https://github.com/JakeWheat/hssqlppp)

~~~
gilmi
Or simple-sql-parser by the same author which is just a parser.

[https://github.com/JakeWheat/simple-sql-
parser](https://github.com/JakeWheat/simple-sql-parser)

------
i_s
If you are using SQL Server, you may be interested to hear this already exists
for .NET:

[https://msdn.microsoft.com/en-
us/library/microsoft.sqlserver...](https://msdn.microsoft.com/en-
us/library/microsoft.sqlserver.transactsql.scriptdom.aspx)

It can both parse and generate SQL scripts. I've been leaning on it heavily
for some of our non-trival query APIs.

~~~
dllthomas
Scanning the docs, it's unclear whether that's just the parsing or also
analysis.

~~~
i_s
It is 'just' the parsing, but the analysis is fairly easy. Here is an example
of:

1\. Taking a database

2\. Parsing every view and stored procedure

3\. Tracking how tables are joined

4\. Creating a graph visualization of the tables and their relationships:

[https://github.com/isaksky/FsSqlDom/blob/master/examples/FsS...](https://github.com/isaksky/FsSqlDom/blob/master/examples/FsSqlDomGallery/FsSqlDomGallery/GraphTableRelationships.fs)

(Not perfect, but does a pretty good job on databases I have tested.)

~~~
intrasight
I've long thought that a graph visual based query client would be really neat

~~~
weavie
Didn't ms access have something like this?

~~~
intrasight
Yes - "something like". I'm envisioning more interact with the graph rather
than the graph only showing relationships.

------
chx
Is it just me who cringes heavily reading this? A single Haskell enthusiast
decides to code up something in Haskell despite it doesn't fit the infra and
most people don't know it. And they do this while Python already has SQL
parsers. WAT.

~~~
daveFNbuck
Python doesn't already have parsers for all SQL variants. There are several
important projects I can't do in my current job because I don't have Hive and
Impala parsers. It would be ideal to have them in Python, but a binary that
could output a digestible representation would be huge for me.

~~~
dllthomas
This project can output JSON for Hive queries, FWIW

~~~
daveFNbuck
Yeah, that's why I mentioned that a binary that can do that would be huge for
me.

~~~
dllthomas
You're saying this is huge? Or you're missing that this can be wrapped into a
binary in a handful of lines? :D

~~~
daveFNbuck
Yes, I'm saying this would be huge for someone who only needed Hive. This
would be huge for me if it could also do Impala.

~~~
dllthomas
Got it! I'd be happy to serve as a resource for anyone taking a swing at
adding an Impala front end (or any other, for that matter). I'm unlikely to
get to it myself any time soon.

~~~
daveFNbuck
There's something like 5k lines of code for the Hive dialect. How much of that
is auto-generated? Is there any way to take advantage of the similarities
between Hive and Impala to reduce the work needed?

~~~
dllthomas
It's not auto-generated, but it's not difficult. There's three approaches
likely to be reasonable. First, you could make a queryparser-impala package
starting with a copy of queryparser-hive, make the changes reflecting the
differences you're aware of. Second, you could start from a description of the
grammar and build it fresh, using queryparser-hive only as additional
reference. Third, if Hive and Impala are close enough (I'm not sufficiently
familiar with Impala to say) you could extend queryparser-hive to also work
with Impala. In any case, a large corpus of Impala queries would be hugely
valuable.

IIRC, Heli took the first approach when going from Vertica to Hive; Matt took
the second when adding Presto.

~~~
daveFNbuck
Thanks, I'll keep this in mind for when I have time to start a new project.

------
fuzzieozzie
A more general/robust implementation of this functionality (and a whole lot
more) has been developed by
[http://www.compilerworks.com](http://www.compilerworks.com)

Covers many SQL dialects:

Proprietary: Teradata, Netezza, Vertica, Oracle PL/SQL

Open Source:Hive, Presto, Postgre, MySQL, Derby, Splice Machine

Cloud: RedShift, Snowflake, BigQuery

------
tabtab
It's time for a new relational query language(s) to compete with SQL. We need
something that's more API-based rather than using custom key-words in a COBOL-
esque way. With the API-ish approach, Vendors (and DBA's) may still add
proprietary functions/methods, but the syntax itself could be consistent
across vendors. It's roughly comparable to using XML to define a domain-
specific sub-language: a standard XML parser can be used to parse it even
though specific tags or attributes are added by the vendor/domain. The syntax
charts for SQL are pasta-city.

My favorite SQL-replacement candidate is SMEQL (formerly TQL). One of its
features is using tables, real or virtual, to describe the attributes of
schemas and RDBMS settings rather than language text. This also makes it
easier to "query" commands, giving one meta abilities/abstraction.

~~~
manigandham
SQL is fine, most of the quirky way of stating things is because it's
declarative, which actually makes it very powerful. The biggest problem is
that the standards move so slowly that many of the database implementers came
up with their own dialects.

The latest official SQL:2016 is actually pretty great and it would be better
if all the various databases converged back to that, however it's probably too
late for that now, let alone a whole new language...

~~~
willtim
> SQL is fine, most of the quirky way of stating things is because it's
> declarative

No, it's quirky because of its semantics.

To quote Erik Meijer "SQL is rife with noncompositional features. For example,
the semantics of NULL is a big mess: why does adding the number 13 to a NULL
value, 13+NULL, return NULL, but summing the same two values, SUM(13, NULL),
returns 13?"

SQL also has no means of abstraction, i.e. defining parameterizable queries.
This is one reason that LINQ exists.

~~~
manigandham
Sure SQL can be awkward, but in your example the _+_ operator is not the same
as the _SUM()_ function, so it's not surprising that they wouldn't work the
same either. SUM() will probably cast the null into a 0 first in most
databases while the + operator has no proper guidance on what to do with a
null value.

Nullability is required to represent empty values and every database has
ISNULL() and IFNULL/COALESCE() and both of those read fine to me but I can see
why they're strange.

I'm not sure what you mean by abstraction though since CTEs, joins, window
functions, table values, UDFs, views, and more can all be used to compose
logic and the language does support parameters, cursors, if/else/case
statements, ... what's missing here?

~~~
willtim
> The + operator is not the same as the SUM() function, so it's not surprising
> that they wouldn't work the same

No, it's very surprising that they are different. In mathematics and most
programming languages, summation is defined in terms of the (+) operator.

> Nullability is required to represent empty values

I have no problem with (optional) nullability.

> I'm not sure what you mean by abstraction

There is no way to abstract queries over tables, joins, etc. Views allow me to
re-use a particular concrete query, but I cannot _parameterize_ a query.

~~~
manigandham
> In mathematics and most programming languages, summation is defined in terms
> of the (+) operator.

(+) is not SUM. All languages treat these basic operators as fluid logic that
depends on surrounding context and data types. You can't add a null or string
to a number in most languages either, and dynamic languages like Javascript
will usually create equally strange results. Likewise (+) means concat if
you're dealing with 2 strings. SUM() is an explicit function that clearly says
that you're adding numbers, and within that function context making NULL = 0
is a reasonable coercion as they are mathematically equivalent.

> but I cannot parameterize a query

What is the parameterization issue exactly? Instead of a view, you use a
stored procedure or function which accept parameters. Even normal queries can
use parameters which you can then provide them separately, either as variables
or even the output of other functions. You can even use functions to write
dynamic SQL and then execute that if you want.

Views are usually meant as static projections (which are an abstraction) but
you can still join them on some other table which can act as a control, or use
if/else/case statements within the view.

~~~
willtim
> (+) is not SUM.

As I said, people expect a mathematical summation to be defined in terms of
addition. Otherwise one has non-compositional semantics and broken equational
reasoning. Perhaps you care not for these properties, but many people do.

> All languages treat these basic operators as fluid logic that depends on
> surrounding context and data types.... Likewise (+) means concat if you're
> dealing with 2 strings

Syntax overloading is not relevant here. String concat is just sharing syntax,
it is a different function. The issue is that the treatment of null is not
consistent. It's possible to add null and still have all algebraic laws hold,
so null itself is not the problem.

> within that function context making NULL = 0 is a reasonable coercion

IMHO no coercion is reasonable, especially one that is unexpected. They should
have called your "SUM" something else and provided one defined in terms of
(+).

> What is the parameterization issue exactly? Instead of a view, you use a
> stored procedure

Stored procedures are not SQL, they are a feature of various imperative
languages with SQL as a subset. Typically I am not allowed to define a stored
procedure when I query your database.The context was deficiencies in the
standard SQL query language.

~~~
manigandham
A: "I want to add 2 numbers" B: "Ok, what's the first number?" A: "1" B: "Ok,
what's the second number?" A: "I dont know..." B: "Then I dont know the
answer..."

NULL is the absence of a value. It has no meaning in algebra. You cannot do
anything mathematically with a NULL value so it always results in a NULL
output regardless of the operator. If you do use the numeric functions, then
it will attempt to convert to a number (0) first. Where is the inconsistency?

It seems _you_ want the database to automatically assume 0 for NULL when they
are distinctly separate meanings and it would be far more confusing if that
were so.

~~~
willtim
> NULL is the absence of a value. It has no meaning in algebra.

Null is a value in its own right. If null is added to e.g. the type of
numbers, then it had better be given meaning (i.e. semantics).

> You cannot do anything mathematically with a NULL value so it always results
> in a NULL output

Returning null if any input is null seems reasonable, but again _this is not
true with SQL SUM_ : SUM [1, null] = 1.

> Where is the inconsistency

You seriously cannot see an inconsistency? Are you trolling?

> It seems you want the database to automatically assume 0 for NULL

Where on earth did you get that idea? You are defending the behaviour of SQL
SUM, not me! I've already told you, I don't want coercions.

------
cube00
I really wish I could get over my irrational fear that using an UUID will one
day result in a collision and the associated data integrity issue when an
update silently overwrites an unrelated record, where as the old sequential
sequence never has this risk.

~~~
dansingerman
Do you worry about the irrational fear that one day all the oxygen molecules
in the room will gather in an upper corner, and you'll suffocate?

~~~
eklavya
Having witnessed a uuid collision in only millions records, I can assure you
it's not a laughing matter.

~~~
dansingerman
That is likely due to a fault in the UUID generator, rather than the concept
of using UUIDs.

[http://www.h2database.com/html/advanced.html#uuid](http://www.h2database.com/html/advanced.html#uuid)

~~~
eklavya
Well than the java implementation of UUID is flawed!

~~~
dansingerman
You say 'only millions records'. Even if you had 2^36 records (that's over 68
billion) the chances of a collision should be 0.0000000000000004

That really is vanishingly small.

So, rationally, I'd guess it is more likely there is a problem with the
implementation of UUID you are using, or just perhaps there is some other
cause of the collision...like a bug?

~~~
WorldMaker
Also possibly a problem with the _version_ of UUID used. A v1 UUID is somewhat
likely to collide if machines share MAC addresses and/or have problematic
clocks.

The likelihood of a v4 collision is tied to the strength of the platform's
(crypto) RNG.

------
dllthomas
Also of note,
[https://github.com/uber/queryparser/blob/master/FUTURE.md](https://github.com/uber/queryparser/blob/master/FUTURE.md)
got cut from earlier drafts.

------
sixdimensional
This is an awesome utility and great that it was open sourced. With even more
SQL dialects, this can get even more interesting as one could try to
understand usage across heterogeneous databases.

I'm going to go all the way with this thought, as it is something I've been
wanting to do for a long time.

I've been wanting to build an open source query proxy server. The idea is
that, it would sit in front of any database that accepts any type of query
expression (first use case, SQL, of course). You would register database
drivers with the platform and then clients wouldn't need them anymore.

The proxy would accept queries destined for targets via a service call.
Traffic to the service can be secured and encrypted. It would be able to be
deployed in a cluster, so there are multiple endpoints to hit to load balance
traffic.

It would parse all queries, recording the metadata about the queries and
destination servers / tables / objects using something like Queryparser, so
you'd have the benefit of all this metadata, but now you'd have it across data
sources too.

This kind of data can also help power a data catalog, which could be a web app
on top of the metadata / query repository, to help users find and/or annotate
the data flowing around their organization (e.g. data democracy).

It could also have caching for result sets, additional security measures
(RBAC), or even be linked up with a processing layer for additional processing
(like Apache Spark).

There are commercial solutions for this, but I want to see an open source one.
Especially because, I'd like to see all dialects and query type expressions
supported, and I think only an open source project could get this kind of
support across industry players.

One way I have thought about this type of query proxy is, it's a bit like a
GraphQL back-end web service endpoint but with SQL instead of GraphQL as the
interface. Using the native SQL/query expressions of the destination platform
would simplify the work, since one would not need to create a generic query
expression abstraction. We just need a way to extract the relevant information
from the dialect/query that we are receiving.

We can let data platforms have different query semantics/dialects but still
get metadata. For example, there are ways to map concepts for MongoDb to
relational ones (collections are like tables, fields are like columns, etc.)
so even a MongoDb query expression could go through the query proxy, and have
metadata recorded in a consistent format, as long as you could express it to
the query proxy's web service interface.

Maybe I'm just daydreaming.. but I would love to work on that and let it be
open source.

~~~
contingencies
IIRC 10+ years ago MySQL already had one of these and it was Lua based, the
primary intent was for transparent sharding.

~~~
rzzzt
You might be thinking of this project: [https://github.com/mysql/mysql-
proxy](https://github.com/mysql/mysql-proxy)

~~~
contingencies
Yep.

------
batbomb
For those who don't know, the Presto parser is easy to use and analyze if you
are programming in Java/JVM. It's close to the grammar and the IR is really
good.

------
maslam
Someone's trying to be GDPR compliant :D

------
tkyjonathan
You know.. any Data Architect would have done this sort of data governance
task for you. You wanted to know which tables where join'd on and which group
by's were the most popular? welcome to a concept called query digest from >11
years ago. Sorry, but all I am hearing is 'how can we burn money on a pet
haskell project using a team of devs that will forever need to maintain it'.
Have the right expert do the right job..

------
iamleppert
How does it handle tables that aren't joined anywhere?

