>To achieve this, our Data Warehouse team was tasked with identifying every foreign-key relationship between every table in the data warehouse to backfill all the ID columns with corresponding UUIDs.
Given the decentralized ownership of our tables, this was not a simple endeavor. The most promising solution was to crowdsource the information by scraping all the SQL queries submitted to the warehouse and observing which columns were joined together.
Am I missing something, or does scraping SQL queries to effectively get an ER diagram sound a bit strange? If you have SQL and a relational database, don’t you already have a formal model of the relations encoded somewhere? E.g. in the table definitions, annotations in your code, or at least as a Word document?
The tool sounds great, but isn’t it a failure having to write it in the first-place?
Data lived in many databases across the company. Vertica, hive, schemaless, Cassandra, etc. The value in a column might refer to an id in an entirely different system. When you have (tens of?) thousands of tables owned by hundreds of teams, documentation is neither homogeneous nor centralized.
So, why does it make sense in this situation ("you can all do what you please with your data") to mandate something technical ("but starting today you all must use UUIDs from now on") across all groups? What's the value of this change if using UUIDs is now literally the only thing they all have in common?
The change to UUIDs was part of a larger effort to make sure everything that generated entities could do it without synchronization across datacenters. (And sure, there are other approaches, I'm just adding some context :) )
Often, formalizing constraints as part of the table schema is not done because of the performance penalties of the checks required to enforce the constraints. I.e. every insert to the database would also have to check that the foreign key relation existed. Some databases have options to turn off the checking, but a lot of people have gotten in the habit of just not doing it. Furthermore, if you have data in multiple places (redis/cassandra/mysql/hive/etc) this just becomes impossible even if you wanted to.
It's a pity that the thing has to get reimplemented all over so many times. For example, you'd think you could reuse PostgreSQL's parser, but its code is autogenerated and riddled with side effects:
We've a lot of got languages and usually their parsers are tightly coupled, basically forcing people to reimplement them if you want a flexible parser. I'd say it would be great if we had one parsing library for each language, ideally used at its core. I think that would greatly simplify implementing linters, static analysers and maybe even tools that generate code. Also, perhaps this would enable even more cooler hacks?
Just in the off-chance this is useful to someone: If you find yourself needing the Postgres parser elsewhere, e.g. in some custom tooling to analyze SQL queries, I extracted it into a library some time ago:
Your comment still applies though, as a lot of the parser output is very specific to the chosen Postgres version (10 at the moment) and can change unpredictably.
I am in the middle of a project similar to what uber is doing and it leans very heavily on this library. Thank you so much for this! I would not have been able to pull it off without libpg_query.
I agree, I wish every vendor would publish an official open source parser for their SQL dialects, or that the community would create a repository of standard parsers or grammars for them all.
When I looked at Calcite, the branding was very focused on query optimization, which was sufficiently unrelated to what I was doing that I didn't dig in. Had I, it might've been the starting point instead of my personal project, although at the time it didn't have Vertica support (I think that was ultimately added by a different group at Uber).
Is it just me who cringes heavily reading this? A single Haskell enthusiast decides to code up something in Haskell despite it doesn't fit the infra and most people don't know it. And they do this while Python already has SQL parsers. WAT.
Python doesn't already have parsers for all SQL variants. There are several important projects I can't do in my current job because I don't have Hive and Impala parsers. It would be ideal to have them in Python, but a binary that could output a digestible representation would be huge for me.
Got it! I'd be happy to serve as a resource for anyone taking a swing at adding an Impala front end (or any other, for that matter). I'm unlikely to get to it myself any time soon.
There's something like 5k lines of code for the Hive dialect. How much of that is auto-generated? Is there any way to take advantage of the similarities between Hive and Impala to reduce the work needed?
It's not auto-generated, but it's not difficult. There's three approaches likely to be reasonable. First, you could make a queryparser-impala package starting with a copy of queryparser-hive, make the changes reflecting the differences you're aware of. Second, you could start from a description of the grammar and build it fresh, using queryparser-hive only as additional reference. Third, if Hive and Impala are close enough (I'm not sufficiently familiar with Impala to say) you could extend queryparser-hive to also work with Impala. In any case, a large corpus of Impala queries would be hugely valuable.
IIRC, Heli took the first approach when going from Vertica to Hive; Matt took the second when adding Presto.
Not cringing, but I think I understand your point...
However; I feel like with modern infrastructures, projects don't really "fit" can be packaged into a container, and used virtually anywhere. (Oversimplification acknowledged!)
Containers aren't the problem. Even without containers, it's not that hard to automate the build/deployment infrastructure. The problem is when "Joe" is the only person who knows how to do anything with X. "Joe" leaves for a high flying company that uses X. You have new requirements for the thing "Joe" was working on. "Anybody fancy working on an X project?" <crickets> "I know. Why don't we rewrite this in Y. I hear it's the new X anyway.", "Great idea!"
Don't get me wrong, I'm a believer in using the right tool for the job. It's just that there's more to it than just making sure it's easy to install.
It's time for a new relational query language(s) to compete with SQL. We need something that's more API-based rather than using custom key-words in a COBOL-esque way. With the API-ish approach, Vendors (and DBA's) may still add proprietary functions/methods, but the syntax itself could be consistent across vendors. It's roughly comparable to using XML to define a domain-specific sub-language: a standard XML parser can be used to parse it even though specific tags or attributes are added by the vendor/domain. The syntax charts for SQL are pasta-city.
My favorite SQL-replacement candidate is SMEQL (formerly TQL). One of its features is using tables, real or virtual, to describe the attributes of schemas and RDBMS settings rather than language text. This also makes it easier to "query" commands, giving one meta abilities/abstraction.
SQL is fine, most of the quirky way of stating things is because it's declarative, which actually makes it very powerful. The biggest problem is that the standards move so slowly that many of the database implementers came up with their own dialects.
The latest official SQL:2016 is actually pretty great and it would be better if all the various databases converged back to that, however it's probably too late for that now, let alone a whole new language...
> SQL is fine, most of the quirky way of stating things is because it's declarative
No, it's quirky because of its semantics.
To quote Erik Meijer "SQL is rife with noncompositional features. For example, the semantics of NULL is a big mess: why does adding the number 13 to a NULL value, 13+NULL, return NULL, but summing the same two values, SUM(13, NULL), returns 13?"
SQL also has no means of abstraction, i.e. defining parameterizable queries. This is one reason that LINQ exists.
> For example, the semantics of NULL is a big mess: why does adding the number 13 to a NULL value, 13+NULL, return NULL, but summing the same two values, SUM(13, NULL), returns 13?
Because aggregate functions explicitly ignore NULLs. All aggregate functions do that including those like `AVG()` and `COUNT()`.
Like, I understand the complaint, but aggregate functions don't work like operators. Operators work on data in the same row or result set. Aggregates work on data in the same column, even if it's a column created by the query. If you understand the relational model, then semantically they're not the same operation even though they might both involve addition.
This complaint is like saying, "When I take an int and add 2 or multiply by 2 I get a new int. Why when add 2 to a string does it give me a the same string with a 2 at the end, but when I multiply a string by 2 it gives me the same string twice?" The answer is that you have to understand the semantics of operator overloading. Operator overloading isn't bad, but it does mean that you have to understand the underlying data type and operator models. You don't get to cheat just because you know what the operator does in one area.
> SQL also has no means of abstraction, i.e. defining parameterizable queries.
That's hardly a unique failing of a language. JavaScript, HTML, CSS, Python, shell, PHP, ASP, etc. All of them are vulnerable to code injection because there's essentially no way for a program to validate itself when it's interpretive and allowed to be dynamic.
With SQL people decided that validating the dynamically generated code is the job of your data store provider since that's done client site (i.e., web server) and only the client knows enough to correctly build the query. Since you essentially already need a data store provider anyways, even if that provider is ODBC or one that you roll yourself, it's really not a significant problem if you have any idea what you're doing to write parameterized queries.
Even if you do logically separate data and code, you still have to have a way to define it as such in whatever language your program is written in. In other words, a fictional SQL language which completely separates data from code would behave identically in practice to current SQL and current parameterized queries. We're not really gaining anything here.
The real problem with SQL is that the relational model is particularly poor with serializing hierarchical structures, and objects are inherently hierarchical.
Sure SQL can be awkward, but in your example the + operator is not the same as the SUM() function, so it's not surprising that they wouldn't work the same either. SUM() will probably cast the null into a 0 first in most databases while the + operator has no proper guidance on what to do with a null value.
Nullability is required to represent empty values and every database has ISNULL() and IFNULL/COALESCE() and both of those read fine to me but I can see why they're strange.
I'm not sure what you mean by abstraction though since CTEs, joins, window functions, table values, UDFs, views, and more can all be used to compose logic and the language does support parameters, cursors, if/else/case statements, ... what's missing here?
> In mathematics and most programming languages, summation is defined in terms of the (+) operator.
(+) is not SUM. All languages treat these basic operators as fluid logic that depends on surrounding context and data types. You can't add a null or string to a number in most languages either, and dynamic languages like Javascript will usually create equally strange results. Likewise (+) means concat if you're dealing with 2 strings. SUM() is an explicit function that clearly says that you're adding numbers, and within that function context making NULL = 0 is a reasonable coercion as they are mathematically equivalent.
> but I cannot parameterize a query
What is the parameterization issue exactly? Instead of a view, you use a stored procedure or function which accept parameters. Even normal queries can use parameters which you can then provide them separately, either as variables or even the output of other functions. You can even use functions to write dynamic SQL and then execute that if you want.
Views are usually meant as static projections (which are an abstraction) but you can still join them on some other table which can act as a control, or use if/else/case statements within the view.
As I said, people expect a mathematical summation to be defined in terms of addition. Otherwise one has non-compositional semantics and broken equational reasoning. Perhaps you care not for these properties, but many people do.
> All languages treat these basic operators as fluid logic that depends on surrounding context and data types.... Likewise (+) means concat if you're dealing with 2 strings
Syntax overloading is not relevant here. String concat is just sharing syntax, it is a different function. The issue is that the treatment of null is not consistent. It's possible to add null and still have all algebraic laws hold, so null itself is not the problem.
> within that function context making NULL = 0 is a reasonable coercion
IMHO no coercion is reasonable, especially one that is unexpected. They should have called your "SUM" something else and provided one defined in terms of (+).
> What is the parameterization issue exactly? Instead of a view, you use a stored procedure
Stored procedures are not SQL, they are a feature of various imperative languages with SQL as a subset. Typically I am not allowed to define a stored procedure when I query your database.The context was deficiencies in the standard SQL query language.
A: "I want to add 2 numbers"
B: "Ok, what's the first number?"
A: "1"
B: "Ok, what's the second number?"
A: "I dont know..."
B: "Then I dont know the answer..."
NULL is the absence of a value. It has no meaning in algebra. You cannot do anything mathematically with a NULL value so it always results in a NULL output regardless of the operator. If you do use the numeric functions, then it will attempt to convert to a number (0) first. Where is the inconsistency?
It seems you want the database to automatically assume 0 for NULL when they are distinctly separate meanings and it would be far more confusing if that were so.
IMO, thinking of it as coercion is wrong. Rather, many aggregate functions on many databases filter nulls first. For sum there's no difference, but for others (eg. average) there is. I make no particular claim as to whether that's the right choice.
It's the wrong choice, since they defined x+null=null and ignoring missing/undefined/error values (which null often represents) is dangerous in many domains. Users could have always filtered nulls manually, which would have made this dangerous behaviour explicit.
I see your argument and it's a strong one, I merely haven't considered the issue deeply enough to have much confidence that there aren't other strong arguments the other way. I don't object to your making the claim, I merely wouldn't make it myself :)
That said, I think the fact that there's a hard question here stems from SQL's conflation of "I don't know this" with "this is known to be absent". If NULL always represented the latter, implicitly omitting them from aggregates would be clearly correct.
> There is no way to abstract queries over tables, joins, etc. Views allow me to re-use a particular concrete query, but I cannot parameterize a query.
Not the person you replied to, but to me, you can "parameterize a query" by abstracting it away into a function or stored proc (SQL Server). Can you explain what you mean by it?
The context was the standard declarative SQL query language. I can't define stored procedures in my SQL query to your database, so I have to use something like LINQ to compose larger queries from smaller ones.
The API/functional-versus-keyword issue is not about declarative versus imperative. Those are mostly independent issues. Therefore, I'm not following you. Perhaps there's a miscommunication.
SQL already has UDFs and even custom types. Those can be namespaced and customized with built-in implementations by all databases, which they all already do to some extent (eg: system data queries) so it seems the only real issue is the fact that the SQL dialects are different.
If the SQL dialect was the same across all databases, then what would be the advantage of an "API" approach as you say?
I personally feel like there's still room to build something that's more terse, yet equally as powerful as SQL.
get name, age -> (foos.id X bars.foo_id) |> get first(name) |> head
as opposed to
select name from (select foos.name from foos inner join bars on foos.id = bars.foo_id) A order by created_at ASC limit 1;
more regular syntax that looks closer to conventional programming would be nice. At the very least it would probably be easier to write ORM style libraries for as the grammar would be closer together and more regular
That looks like going back to imperative/functional instead of the declarative syntax of SQL...
Have you used C# before? I think the LINQ query syntax could be a good evolution for some uses cases, but it again goes back to decades of dialects already being defined. Seems anything new would have to be implemented as some kind of proxy that then maps back to SQL or uses low level db access, but then why not just use an ORM or query builder in the first place and query from inside a language?
Part of the point of SQL is that the database may choose to satisfy your query in a very different manner than you might expect given the procedural reading. Your example query is a nice illustration; a modern relational database may move your project/select steps earlier in the processing as long as it doesn't change the declarative meaning. Doing this kind of rearranging with a pipeline-looking query language is going to surprise people because you're giving the majority of people who think procedurally a false affordance. This is the fundamental "impedance mismatch": the ORM is trying to put a low-level wrapper around a higher-level idea. OO languages are still row-by-row procedural systems where relational databases are set-oriented.
If the target language is Haskell or Ruby or another "sufficiently DSL-capable language" it will be possible to make an internal DSL that encapsulates your query. However, in that case I think you'll surprise your users when you either have non-obvious limits to how you can intermix query with in-language code, or you'll surprise them with how badly your database performs compared to Postgres. You can see a little of both in the story of RethinkDB. If you are not using an internal DSL, you'll be stuck in the same situation as SQL where you are embedding your query language in the code somehow.
Relational databases are not just storage technology with a complicated query language. They are also integration technology. SQL's surface syntax may be unfortunate, but I'm increasingly doubtful that there is going to be a serious general-purpose database competitor that manages to hit the same notes only better. The main contender from the 90s was OODBs; they managed to have both shitty performance and lacked the ecosystem that makes SQL so useful as an integration technology: your application may be the main actor, but there are usually a large number of supporting cast members like cron jobs, reports, little scripts, ad-hoc queries, backup and restore etc, and having to do all of that with an OODB is very painful.
There are now and will continue to be niches where relational databases suffer, the prominent one today being distributed data storage. But the problem you think is a problem really isn't a problem. Industrially speaking, "I don't like SQL" is a non-problem. For people who hate it, there are ORMs that make it possible to ignore it most of the time for better or worse (mostly worse, but whatever). Syntactically, the main problem with it is, as another commenter stated, different dialects that behave slightly differently. This turns into a major engineering annoyance, but one that is mostly swallowed by your library dependencies, who have by now already basically taken care of it.
The benefit of using a modern relational database (I'm talking about Postgres but it applies to SQL Server and probably others as well) is they already have hundreds or thousands of man-years of effort put into optimizing them. I really thought RethinkDB had a brilliant design and it was going to be the next generation database. But it performed worse than Postgres and that makes it basically a non-starter. This is part of why niche databases are not based on SQL: if you handle a niche well, I will tolerate your oddities, but if you want to build a new general-purpose database today, in 2018, you can't just have a better design. Your better design has to come with better performance and the ecosystem, the tool support, maintainability, etc., or it's a non-starter. Databases are one of the hardest technical markets to break into. For most companies, the data is actually the most important thing.
Nicely Put. I fully agree the power of SQL with Postgres is phenomenal. I would add that Postgres table returning functions are incredibly useful for getting around SQL annoyances such as verbosity and composability.
Because they automatically lateral join, they almost feel magical, enbling complex multiple combinations, without confusing deep nesting of sub queries or easier to follow but long winded and slower CTEs.
> It's time for a new relational query language(s) to compete with SQL.
There are several, some of which are arguably abstractly better than SQL. But given how well established SQL is and the strength of the ecosystem around it, you need a big advantage to make it worth changing.
Agreed. But we need competitors to seed the next generation. I agree any decent SQL competitor is not going to unseat SQL in the short-term in the general industry, but experimental and specialized projects would use them and gradually improve them. This approach has worked in the empirical language space, which had/has hundreds of active competitors, but RDBMS-land is sparse.
Somebody pointed out that one can create user-defined functions in most SQL dialects, but that's generally for the value level, not the table label. It's still too dependent on per-case key-words for table-level features. I don't believe this is an inherent limit of declarative languages, as the SMEQL drafts suggest.
I really wish I could get over my irrational fear that using an UUID will one day result in a collision and the associated data integrity issue when an update silently overwrites an unrelated record, where as the old sequential sequence never has this risk.
You say 'only millions records'. Even if you had 2^36 records (that's over 68 billion) the chances of a collision should be 0.0000000000000004
That really is vanishingly small.
So, rationally, I'd guess it is more likely there is a problem with the implementation of UUID you are using, or just perhaps there is some other cause of the collision...like a bug?
Also possibly a problem with the version of UUID used. A v1 UUID is somewhat likely to collide if machines share MAC addresses and/or have problematic clocks.
The likelihood of a v4 collision is tied to the strength of the platform's (crypto) RNG.
If the UUIDs are actually being picked uniformly, the fear is entirely irrational, I agree. However, fear that the distribution isn't actually uniform in real implementations is much less crazy.
On the flipside, I have a very rational distrust of sequential sequences. Over the years I've seen far too many bad SQL database designs abuse sequential sequences for things that should not have a sequence, and too many bad SQL applications that create sequence collisions inadvertently through off-by-one bugs and forced inserts.
In several of those cases, because the sequence remained consistent/in order with no obvious holes, collisions went undetected for days/weeks because people just assumed that because they could count off 1, 2, 3, 4, 5 (not realizing they were missing that a second 2 had overwritten the first and a bad outer join was bringing in the wrong 5, for instance) that everything was working correctly. At least with UUIDs there's no accidental assumptions about data integrity from meaningless tangents like "looks like a proper number line" from the human brain's default pattern matching toolkit.
A company I used to work for had a nasty, hard to diagnose outage the other week - bad enough to publically announce and grovel ( I keep an eye on them). I believe, without access to any current technical detail is that they slavishly use sequence keys everywhere - so no real foreign keys. This in turn has caused them to massively over rely on elastic search, which with a bit of growth caused their queue infrastructure to get swamped and thus a big performance problem.
I realised the writing was on the wall there for me when I mentioned how useful proper (compound) foreign keys were and my comment was treated with derision. Of course now I work somewhere with a more properly designed database, and win having to cope with Oracle instead.
It is incredibly unlikely but you're right, especially when there might be poor UUID generators being used. We use 64-big ints for all ids which are resereved as batches from a global id service. Any datastore with atomic increments can serve this so I'm surprised it's not used more often.
I am inexperienced, so please correct me if I am wrong - but conceptually this way doesn't seem much more effective or different than UUIDs if the purpose is a non-consecutive unique identifier. Also, could you please explain what you mean by "...which are resereved as batches from a global id service"?
Does it matter if it's consecutive? If you want 0% chance of collisions and/or using numbers then they will be increasing in some way. ID generation scheme:
Key "counter" => value "123"
Server N starts, asks for 100 ids by incrementing counter, gets the new number 223, now it has numbers from 124->223 to use for ids. Server N+1 starts, does the same thing, and you can keep incrementing the counter to get the next batch of numbers.
Anything that supports atomic increments works. Design for whatever availability you need. Add in some random increments if you don't want perfectly sequential numbers. 64-bit numbers pretty much guarantee you'll never run out of space, while being much nicer to handle than UUIDs (which are 128-bit numbers but usually stored as bytes or strings).
If you want completely random IDs then UUID is the way to do it, but if it's just for load balancing or partitioning then you can always hash the numbers to get some random strings which work well enough.
There is also a pattern referred to as "high/low keys"[0] and works quite well so long as the architecture guarantees access to the master when a new high key is needed.
Of course, this can be mitigated by allocating small batches of high keys as well if connectivity to the master is unreliable. However, having a 32 bit low range implies the need for batch high key allocation is negligible for all but the most pathological situations.
Yes, we reserve the actual number range which is simpler and avoids errors. I havent come across a standard name for this scheme though. We reserve millions of numbers at a time and it works perfectly, and the batch can easily be adjusted to account for usage factor and potential downtime buffering.
This is an awesome utility and great that it was open sourced. With even more SQL dialects, this can get even more interesting as one could try to understand usage across heterogeneous databases.
I'm going to go all the way with this thought, as it is something I've been wanting to do for a long time.
I've been wanting to build an open source query proxy server. The idea is that, it would sit in front of any database that accepts any type of query expression (first use case, SQL, of course). You would register database drivers with the platform and then clients wouldn't need them anymore.
The proxy would accept queries destined for targets via a service call. Traffic to the service can be secured and encrypted. It would be able to be deployed in a cluster, so there are multiple endpoints to hit to load balance traffic.
It would parse all queries, recording the metadata about the queries and destination servers / tables / objects using something like Queryparser, so you'd have the benefit of all this metadata, but now you'd have it across data sources too.
This kind of data can also help power a data catalog, which could be a web app on top of the metadata / query repository, to help users find and/or annotate the data flowing around their organization (e.g. data democracy).
It could also have caching for result sets, additional security measures (RBAC), or even be linked up with a processing layer for additional processing (like Apache Spark).
There are commercial solutions for this, but I want to see an open source one. Especially because, I'd like to see all dialects and query type expressions supported, and I think only an open source project could get this kind of support across industry players.
One way I have thought about this type of query proxy is, it's a bit like a GraphQL back-end web service endpoint but with SQL instead of GraphQL as the interface. Using the native SQL/query expressions of the destination platform would simplify the work, since one would not need to create a generic query expression abstraction. We just need a way to extract the relevant information from the dialect/query that we are receiving.
We can let data platforms have different query semantics/dialects but still get metadata. For example, there are ways to map concepts for MongoDb to relational ones (collections are like tables, fields are like columns, etc.) so even a MongoDb query expression could go through the query proxy, and have metadata recorded in a consistent format, as long as you could express it to the query proxy's web service interface.
Maybe I'm just daydreaming.. but I would love to work on that and let it be open source.
I really wanted to see it used for type checking (of domain-relevant types, not just the sql representational types), but I left before that got anywhere.
For those who don't know, the Presto parser is easy to use and analyze if you are programming in Java/JVM. It's close to the grammar and the IR is really good.
You know.. any Data Architect would have done this sort of data governance task for you.
You wanted to know which tables where join'd on and which group by's were the most popular? welcome to a concept called query digest from >11 years ago.
Sorry, but all I am hearing is 'how can we burn money on a pet haskell project using a team of devs that will forever need to maintain it'.
Have the right expert do the right job..
Given the decentralized ownership of our tables, this was not a simple endeavor. The most promising solution was to crowdsource the information by scraping all the SQL queries submitted to the warehouse and observing which columns were joined together.
Am I missing something, or does scraping SQL queries to effectively get an ER diagram sound a bit strange? If you have SQL and a relational database, don’t you already have a formal model of the relations encoded somewhere? E.g. in the table definitions, annotations in your code, or at least as a Word document?
The tool sounds great, but isn’t it a failure having to write it in the first-place?