Given the decentralized ownership of our tables, this was not a simple endeavor. The most promising solution was to crowdsource the information by scraping all the SQL queries submitted to the warehouse and observing which columns were joined together.
Am I missing something, or does scraping SQL queries to effectively get an ER diagram sound a bit strange? If you have SQL and a relational database, don’t you already have a formal model of the relations encoded somewhere? E.g. in the table definitions, annotations in your code, or at least as a Word document?
The tool sounds great, but isn’t it a failure having to write it in the first-place?
Data lived in many databases across the company. Vertica, hive, schemaless, Cassandra, etc. The value in a column might refer to an id in an entirely different system. When you have (tens of?) thousands of tables owned by hundreds of teams, documentation is neither homogeneous nor centralized.
Also I re-read only now that this is for a DWH, in which case it makes even more sense.
It's very common in data governance tools to do this e.g. Informatica MDM, Cloudera Navigator.
> don’t you already have a formal model of the relations encoded somewhere
Good companies will have this. But not all. And they definitely don't have it spanning across all of their data stores.
We've a lot of got languages and usually their parsers are tightly coupled, basically forcing people to reimplement them if you want a flexible parser. I'd say it would be great if we had one parsing library for each language, ideally used at its core. I think that would greatly simplify implementing linters, static analysers and maybe even tools that generate code. Also, perhaps this would enable even more cooler hacks?
Your comment still applies though, as a lot of the parser output is very specific to the chosen Postgres version (10 at the moment) and can change unpredictably.
We are almost there with ANTLR grammars...
And unfortunately, disambiguating references doesn't work the same way everywhere - for instance, tables on Hive will shadow a CTE.
Interestingly, Uber does use Calcite themselves for AthenaX: https://github.com/uber/AthenaX/
It is pretty good though and used by plenty of other Apache projects like Apex, Drill, Flink, Hive, Storm, etc: https://calcite.apache.org/docs/powered_by.html
It can both parse and generate SQL scripts. I've been leaning on it heavily for some of our non-trival query APIs.
1. Taking a database
2. Parsing every view and stored procedure
3. Tracking how tables are joined
4. Creating a graph visualization of the tables and their relationships:
(Not perfect, but does a pretty good job on databases I have tested.)
IIRC, Heli took the first approach when going from Vertica to Hive; Matt took the second when adding Presto.
However; I feel like with modern infrastructures, projects don't really "fit" can be packaged into a container, and used virtually anywhere. (Oversimplification acknowledged!)
Don't get me wrong, I'm a believer in using the right tool for the job. It's just that there's more to it than just making sure it's easy to install.
* In this case, furthering the use of Haskell in industry
Covers many SQL dialects:
Proprietary: Teradata, Netezza, Vertica, Oracle PL/SQL
Open Source:Hive, Presto, Postgre, MySQL, Derby, Splice Machine
Cloud: RedShift, Snowflake, BigQuery
My favorite SQL-replacement candidate is SMEQL (formerly TQL). One of its features is using tables, real or virtual, to describe the attributes of schemas and RDBMS settings rather than language text. This also makes it easier to "query" commands, giving one meta abilities/abstraction.
The latest official SQL:2016 is actually pretty great and it would be better if all the various databases converged back to that, however it's probably too late for that now, let alone a whole new language...
No, it's quirky because of its semantics.
To quote Erik Meijer "SQL is rife with noncompositional features. For example, the semantics of NULL is a big mess: why does adding the number 13 to a NULL value, 13+NULL, return NULL, but summing the same two values, SUM(13, NULL), returns 13?"
SQL also has no means of abstraction, i.e. defining parameterizable queries. This is one reason that LINQ exists.
Because aggregate functions explicitly ignore NULLs. All aggregate functions do that including those like `AVG()` and `COUNT()`.
Like, I understand the complaint, but aggregate functions don't work like operators. Operators work on data in the same row or result set. Aggregates work on data in the same column, even if it's a column created by the query. If you understand the relational model, then semantically they're not the same operation even though they might both involve addition.
This complaint is like saying, "When I take an int and add 2 or multiply by 2 I get a new int. Why when add 2 to a string does it give me a the same string with a 2 at the end, but when I multiply a string by 2 it gives me the same string twice?" The answer is that you have to understand the semantics of operator overloading. Operator overloading isn't bad, but it does mean that you have to understand the underlying data type and operator models. You don't get to cheat just because you know what the operator does in one area.
> SQL also has no means of abstraction, i.e. defining parameterizable queries.
With SQL people decided that validating the dynamically generated code is the job of your data store provider since that's done client site (i.e., web server) and only the client knows enough to correctly build the query. Since you essentially already need a data store provider anyways, even if that provider is ODBC or one that you roll yourself, it's really not a significant problem if you have any idea what you're doing to write parameterized queries.
Even if you do logically separate data and code, you still have to have a way to define it as such in whatever language your program is written in. In other words, a fictional SQL language which completely separates data from code would behave identically in practice to current SQL and current parameterized queries. We're not really gaining anything here.
The real problem with SQL is that the relational model is particularly poor with serializing hierarchical structures, and objects are inherently hierarchical.
Nullability is required to represent empty values and every database has ISNULL() and IFNULL/COALESCE() and both of those read fine to me but I can see why they're strange.
I'm not sure what you mean by abstraction though since CTEs, joins, window functions, table values, UDFs, views, and more can all be used to compose logic and the language does support parameters, cursors, if/else/case statements, ... what's missing here?
No, it's very surprising that they are different. In mathematics and most programming languages, summation is defined in terms of the (+) operator.
> Nullability is required to represent empty values
I have no problem with (optional) nullability.
> I'm not sure what you mean by abstraction
There is no way to abstract queries over tables, joins, etc. Views allow me to re-use a particular concrete query, but I cannot parameterize a query.
> but I cannot parameterize a query
What is the parameterization issue exactly? Instead of a view, you use a stored procedure or function which accept parameters. Even normal queries can use parameters which you can then provide them separately, either as variables or even the output of other functions. You can even use functions to write dynamic SQL and then execute that if you want.
Views are usually meant as static projections (which are an abstraction) but you can still join them on some other table which can act as a control, or use if/else/case statements within the view.
As I said, people expect a mathematical summation to be defined in terms of addition. Otherwise one has non-compositional semantics and broken equational reasoning. Perhaps you care not for these properties, but many people do.
> All languages treat these basic operators as fluid logic that depends on surrounding context and data types.... Likewise (+) means concat if you're dealing with 2 strings
Syntax overloading is not relevant here. String concat is just sharing syntax, it is a different function. The issue is that the treatment of null is not consistent. It's possible to add null and still have all algebraic laws hold, so null itself is not the problem.
> within that function context making NULL = 0 is a reasonable coercion
IMHO no coercion is reasonable, especially one that is unexpected. They should have called your "SUM" something else and provided one defined in terms of (+).
> What is the parameterization issue exactly? Instead of a view, you use a stored procedure
Stored procedures are not SQL, they are a feature of various imperative languages with SQL as a subset. Typically I am not allowed to define a stored procedure when I query your database.The context was deficiencies in the standard SQL query language.
NULL is the absence of a value. It has no meaning in algebra. You cannot do anything mathematically with a NULL value so it always results in a NULL output regardless of the operator. If you do use the numeric functions, then it will attempt to convert to a number (0) first. Where is the inconsistency?
It seems you want the database to automatically assume 0 for NULL when they are distinctly separate meanings and it would be far more confusing if that were so.
Null is a value in its own right. If null is added to e.g. the type of numbers, then it had better be given meaning (i.e. semantics).
> You cannot do anything mathematically with a NULL value so it always results in a NULL output
Returning null if any input is null seems reasonable, but again this is not true with SQL SUM: SUM [1, null] = 1.
> Where is the inconsistency
You seriously cannot see an inconsistency? Are you trolling?
> It seems you want the database to automatically assume 0 for NULL
Where on earth did you get that idea? You are defending the behaviour of SQL SUM, not me! I've already told you, I don't want coercions.
postgres=# SELECT AVG(x) FROM (SELECT 1 AS x UNION SELECT NULL) t;
That said, I think the fact that there's a hard question here stems from SQL's conflation of "I don't know this" with "this is known to be absent". If NULL always represented the latter, implicitly omitting them from aggregates would be clearly correct.
> There is no way to abstract queries over tables, joins, etc. Views allow me to re-use a particular concrete query, but I cannot parameterize a query.
Not the person you replied to, but to me, you can "parameterize a query" by abstracting it away into a function or stored proc (SQL Server). Can you explain what you mean by it?
If the SQL dialect was the same across all databases, then what would be the advantage of an "API" approach as you say?
get name, age -> (foos.id X bars.foo_id) |> get first(name) |> head
as opposed to
select name from (select foos.name from foos inner join bars on foos.id = bars.foo_id) A order by created_at ASC limit 1;
more regular syntax that looks closer to conventional programming would be nice. At the very least it would probably be easier to write ORM style libraries for as the grammar would be closer together and more regular
Have you used C# before? I think the LINQ query syntax could be a good evolution for some uses cases, but it again goes back to decades of dialects already being defined. Seems anything new would have to be implemented as some kind of proxy that then maps back to SQL or uses low level db access, but then why not just use an ORM or query builder in the first place and query from inside a language?
If the target language is Haskell or Ruby or another "sufficiently DSL-capable language" it will be possible to make an internal DSL that encapsulates your query. However, in that case I think you'll surprise your users when you either have non-obvious limits to how you can intermix query with in-language code, or you'll surprise them with how badly your database performs compared to Postgres. You can see a little of both in the story of RethinkDB. If you are not using an internal DSL, you'll be stuck in the same situation as SQL where you are embedding your query language in the code somehow.
Relational databases are not just storage technology with a complicated query language. They are also integration technology. SQL's surface syntax may be unfortunate, but I'm increasingly doubtful that there is going to be a serious general-purpose database competitor that manages to hit the same notes only better. The main contender from the 90s was OODBs; they managed to have both shitty performance and lacked the ecosystem that makes SQL so useful as an integration technology: your application may be the main actor, but there are usually a large number of supporting cast members like cron jobs, reports, little scripts, ad-hoc queries, backup and restore etc, and having to do all of that with an OODB is very painful.
There are now and will continue to be niches where relational databases suffer, the prominent one today being distributed data storage. But the problem you think is a problem really isn't a problem. Industrially speaking, "I don't like SQL" is a non-problem. For people who hate it, there are ORMs that make it possible to ignore it most of the time for better or worse (mostly worse, but whatever). Syntactically, the main problem with it is, as another commenter stated, different dialects that behave slightly differently. This turns into a major engineering annoyance, but one that is mostly swallowed by your library dependencies, who have by now already basically taken care of it.
The benefit of using a modern relational database (I'm talking about Postgres but it applies to SQL Server and probably others as well) is they already have hundreds or thousands of man-years of effort put into optimizing them. I really thought RethinkDB had a brilliant design and it was going to be the next generation database. But it performed worse than Postgres and that makes it basically a non-starter. This is part of why niche databases are not based on SQL: if you handle a niche well, I will tolerate your oddities, but if you want to build a new general-purpose database today, in 2018, you can't just have a better design. Your better design has to come with better performance and the ecosystem, the tool support, maintainability, etc., or it's a non-starter. Databases are one of the hardest technical markets to break into. For most companies, the data is actually the most important thing.
There are several, some of which are arguably abstractly better than SQL. But given how well established SQL is and the strength of the ecosystem around it, you need a big advantage to make it worth changing.
Somebody pointed out that one can create user-defined functions in most SQL dialects, but that's generally for the value level, not the table label. It's still too dependent on per-case key-words for table-level features. I don't believe this is an inherent limit of declarative languages, as the SMEQL drafts suggest.
That really is vanishingly small.
So, rationally, I'd guess it is more likely there is a problem with the implementation of UUID you are using, or just perhaps there is some other cause of the collision...like a bug?
The likelihood of a v4 collision is tied to the strength of the platform's (crypto) RNG.
In several of those cases, because the sequence remained consistent/in order with no obvious holes, collisions went undetected for days/weeks because people just assumed that because they could count off 1, 2, 3, 4, 5 (not realizing they were missing that a second 2 had overwritten the first and a bad outer join was bringing in the wrong 5, for instance) that everything was working correctly. At least with UUIDs there's no accidental assumptions about data integrity from meaningless tangents like "looks like a proper number line" from the human brain's default pattern matching toolkit.
I realised the writing was on the wall there for me when I mentioned how useful proper (compound) foreign keys were and my comment was treated with derision. Of course now I work somewhere with a more properly designed database, and win having to cope with Oracle instead.
Key "counter" => value "123"
Server N starts, asks for 100 ids by incrementing counter, gets the new number 223, now it has numbers from 124->223 to use for ids. Server N+1 starts, does the same thing, and you can keep incrementing the counter to get the next batch of numbers.
Anything that supports atomic increments works. Design for whatever availability you need. Add in some random increments if you don't want perfectly sequential numbers. 64-bit numbers pretty much guarantee you'll never run out of space, while being much nicer to handle than UUIDs (which are 128-bit numbers but usually stored as bytes or strings).
If you want completely random IDs then UUID is the way to do it, but if it's just for load balancing or partitioning then you can always hash the numbers to get some random strings which work well enough.
Of course, this can be mitigated by allocating small batches of high keys as well if connectivity to the master is unreliable. However, having a 32 bit low range implies the need for batch high key allocation is negligible for all but the most pathological situations.
0 - https://stackoverflow.com/questions/282099/whats-the-hi-lo-a...
EDIT: Re-read your original description and realized you described a related, but different, approach of directly reserving key ranges.
I'm going to go all the way with this thought, as it is something I've been wanting to do for a long time.
I've been wanting to build an open source query proxy server. The idea is that, it would sit in front of any database that accepts any type of query expression (first use case, SQL, of course). You would register database drivers with the platform and then clients wouldn't need them anymore.
The proxy would accept queries destined for targets via a service call. Traffic to the service can be secured and encrypted. It would be able to be deployed in a cluster, so there are multiple endpoints to hit to load balance traffic.
It would parse all queries, recording the metadata about the queries and destination servers / tables / objects using something like Queryparser, so you'd have the benefit of all this metadata, but now you'd have it across data sources too.
This kind of data can also help power a data catalog, which could be a web app on top of the metadata / query repository, to help users find and/or annotate the data flowing around their organization (e.g. data democracy).
It could also have caching for result sets, additional security measures (RBAC), or even be linked up with a processing layer for additional processing (like Apache Spark).
There are commercial solutions for this, but I want to see an open source one. Especially because, I'd like to see all dialects and query type expressions supported, and I think only an open source project could get this kind of support across industry players.
One way I have thought about this type of query proxy is, it's a bit like a GraphQL back-end web service endpoint but with SQL instead of GraphQL as the interface. Using the native SQL/query expressions of the destination platform would simplify the work, since one would not need to create a generic query expression abstraction. We just need a way to extract the relevant information from the dialect/query that we are receiving.
We can let data platforms have different query semantics/dialects but still get metadata. For example, there are ways to map concepts for MongoDb to relational ones (collections are like tables, fields are like columns, etc.) so even a MongoDb query expression could go through the query proxy, and have metadata recorded in a consistent format, as long as you could express it to the query proxy's web service interface.
Maybe I'm just daydreaming.. but I would love to work on that and let it be open source.
Take a look at https://calcite.apache.org/avatica/ -- it does some of what you're mentioning.