> why isn't it a standard feature of SQL to just compare two tables?
Not enough people have complained about needing it (it doesn't hit the desired ROI for a PM to prioritize the feature request).
SQLite's SQL Diff I first came across years ago - super useful. It's the perfect example of the industries which pop up to fill in the gaps that huge software vendors like Microsoft leave open. I used to work at a company valued at hundreds of millions of dollars, which made nothing but such products filling gaps.
SQL isn't any specific product for a PM to greenlight work on or not, it's a programming language (the L in SQL =) that an structured (the S in SQL =) data application can support as either "the" or "one of the" programming interfaces to query (the Q in SQL =) that data.
So it's really a question of why SQL, the language, doesn't come with a table comparison command so that you can write statements akin to "SELECT * FROM (COMPARE table1 table2) AS diff WHERE ..."
Wouldn't the process be that some PM wants a feature to be added and sends it of to the representative in the appropriate committe who then push for it to be added?
I mean it's not the SQL language itself which sits in the committe specifying new features, it would be representatives from Oracle, Microsoft, IBM and similar.. Right?
If some PM wants to add a debugger to C, do they write a language extension proposal or do they focus on putting their requirements in a separate tool?
This is roughly the same problem. The feature exists, but it's rare enough that the only people that need it are programmers/DBAs that need to deep-dive into a specific issue. Regular people/applications will never need this feature, so why should it be baked in the core language? The specialists already have specialist tooling, so it makes much more sense to implement this as a toolkit feature than a language feature.
Author here! This was mostly intended for entertainment, and partly to demonstrate the idea of "relations as vectors" and "queries as polynomials". But I guess I'll indulge myself a bit more:
> Just use sqldiff
sqldiff is sensitive to ordering, e.g., it'll say the relation [1, 2] is different from [2, 1] (I consider them to be the same because they are the same multiset). You'd need to sort with ORDER BY first, but that also requires listing all attributes explicitly like the GROUP BY solution (ORDER BY * doesn't work).
> What about CHECKSUM
It's also sensitive to ordering, and I was told different tables can have the same CHECKSUM (hash collisions?).
> Are the tables the same if they only differ by schema?
I'd say no. Perhaps a better definition of "the same" is that all SQL queries (using "textbook" SQL features) return the same result over the tables, if you just replace t1 with t2 in the query. Wait, but how do you know if the results are the same... :)
> There are better ways to compare tables
Absolutely, my recursive query runs in time O(N^N) so I'm sure you can be a little better than that.
> Perhaps a better definition of "the same" is that all SQL queries (using "textbook" SQL features) return the same result over the tables, if you just replace t1 with t2 in the query. Wait, but how do you know if the results are the same... :)
It's actually still a useful definition! (Assuming we're talking about all deterministic SQL queries and can define precisely what we mean by that!)
It's a useful definition because it includes all possible aggregations as well, including very onerous ones like ridiculous STRING_AGG stuff. Those almost certainly be candidates for reasonable queries to solve your original problem, but they are useful in benchmarking whether a proposed solution is accurate.
tl;dr: SQL hates you and it's your fault it hates you.
to elaborate: I'm rapidly appreciating that SQL is a "bondage and discipline" language where it is impossible for SQL to fail at any given task, it can only be failed by its users/developers.
edit: further, it occurs to me also that SQL hates you because in a sane language you'd be able to write up some generic method that says "compare every element of these collections to each other" that was reusable for all possible collections. But try defining a scalar user-defined function that takes two arbitrary resultsets as parameters. But not only can I not do that, I'm a bad person for wanting that because it violates Relational Theory.
It doesn't violate relational theory at all -- it merely violates the query compiler's requirements about what must be known at the time of query compilation.
You can absolute write a query that does what you want (you need a stored procedure that generates ad-hoc functions based on the involved tables' signatures), but stored sql objects must have defined inputs and outputs. What you're asking for is a meta-sql function that can translate to different query plans based on its inputs, and that's not allowed because its performance cannot be predetermined.
I have this old language aware diff tool I wrote that already supports CSV format, so what I would do is:
1) Check the record length of the two tables given your RDBMS API.
2) Check the column headers and data types of the tables are the same given your RDBMS API.
3) Only then would I export all triggers associated with the given tables using the RDBMS API and then compare that code using an external SQL language aware diff tool that excludes from comparison white space, comments, and other artifacts of your choosing.
4) Then finally if all else is the same I would export each table as CSV format and then compare that output in a CSV language aware diff tool.
Well yes, it makes sense that one could just solve this problem however they choose, whether that be through the programming language of their choice or pen and paper. The whole point of the article is to purposefully restrict the use of external tools and see if it can be done wholly within the native SQL environment.
I don't think most SQL flavours support MINUS function, imho.
Bing Chat says:
> The MINUS operator is not supported in all SQL databases. It can be used in databases like MySQL and Oracle. For databases like SQL Server, PostgreSQL, and SQLite, use the EXCEPT operator to perform this type of query
Does this work with the bag/multiset distinction that the author uses? Like, if table1 has two copies of some row and table2 has a single copy of that row, wont this query return that they're the same? But they're not: table1 has two copies of the same row, whereas table2 just has one?
I found that a weird edge case for the original author to fixate on. In mathematics or academia sure, but in “real” sql tables, that serve any kind of purpose, duplicate rows are not something you need to support, let alone go to twice the engineering effort to support. Duplicates are more likely to be something you deliberately eradicate (before the comparison) than preserve and respect.
Exactly. If you have two absolutely duplicate rows in a table, you're going to have problems with a lot of your queries. It's usually an indication that you are lacking a unique constraint or have a bug somewhere.
It's exactly something I need to support, because the only reason I am ever diffing two tables is to check that my new and improved sql query worked as expected against a known correct table. I don't want my new code to accidentally duplicate on some silly join and then not pick up on it in automated testing.
I have had it happen in real data from real systems - but it’s not good — “true duplicates” are a sign that something is missing from your schema or something has gone wrong earlier in the pipeline.
(I just noticed that I emitted the word “iniquitie” above, where I thought I’d written “uniquifier” — and have just confirmed that autocorrect was responsible. Strangely, in context, the word “iniquitie” sort of “appears” to be an appropriate word, when its meaning is certainly not. I found that interesting.)
Not necessarily - the clock source for logging is often at millisecond resolution, but at the speed of modern systems you could pile up quite a few log entries in a millisecond.
I handle this by having a guid field for a primary key on such tables where there isn't a naturally unique index in the shape of the data. So something is unique, and you can delete or ignore other rows relative to that. (Just don't make your guid PK clustered; I use create-date or log-date for that.)
Your log tables don't have timestamps or line numbers on them?
More generally (and formally) speaking, multisets violate normalization. Either you add information to the primary key to identify the copies or you roll the copies up into a quantity field. I can't think of any kind of data where neither of these would be good options.
Primary key and quantity field could be not a perfect solution from performance point of view. Timestamps does not guarantee uniqueness.
Look, Im not trying to win the argument. In most cases you definitely right, my point is that sometime you have to work with working legacy code/system, and sometime this system could have some unique features.
The duplicate row issue is part of why I don't use MINUS for table value comparisons, nor RECURSIVE like the original article suggests (which is not supported in all databases and scarier for junior developers)... You can accomplish the same thing and handle that dupes scenario too, with just GROUP BY/UNION ALL/HAVING, using the following technique:
It will catch both if you have 1 row for a set of values in one table and 0 in another... or vice-versa... or 1 row for a set of values in one table and 2+ (dupes) in another.
I have compared every row + every column value of billion-row tables in under a minute on a columnar database with this technique.
Pseudocode summary explanation: Create (via group by) a rowcount for every single set of column values you give it from table A, create that same rowcount for every single set of column values from table B, then compare if those rowcounts match for all rows, and lets you know if they don't (sorted to make it easier to read when you do have differences). A nice fast set-based operation.
In theory, yes, however the vast majority of tables will have some form of unique ID in each record... so in practice, there’s usually no difference. But if it must work for all tables...
Log analytics or warehouse tables often have no simple useful key for this sort of comparison.
Also in a more general case you might be comparing tables that may contain the same data but have been constructed from different sources. Or perhaps a distributed dataset became disconnected and may have seen updates in both partitions, and you have brought them together to compare to try decide which to keep or if it is worth trying to merge. In those and other circumstances there may be a key but if it is a surrogate key it will be meaningless for comparing data from two sets of updates, so you would have to disregard it and compare on the other data (which might not include useful candidate keys).
It happens. I’m currently working on a project where the CRM tool I need to access for data, actually does not have a unique id in its db. I have no idea if I will be able to successfully complete the project yet.
Is there any chance that the rows actually do have a unique id, but it's not being displayed without some magic incantation?
Asking because I've seen that before in some software, where it tries to "keep things simple" by default. But that behaviour can be toggled off so it shows the full schema (and data) for those with the need. :)
That's not all that unusual when something gets implemented, as people tend to take the easy approach for things that meet the desired goal.
It just sounds like the spec they were writing to wasn't very clear or it was just a checkbox list of features provided to them by marketing. So "lets get this list done then ship it". ;)
Honestly if the resultsets are small-enough, I just dump them to JSON and diff the files. But it has to be fully deterministically sorted for that (in a sane world "order by *" would be valid ANSI SQL).
Thank you for mentioning Data Diff! Founder of Datafold here.
We built Data Diff to solve a variety of problems that we encountered as data engineers: (A) Testing SQL code changes by diffing the output of production/dev versions of SQL query. (B) Validating that data is consistent when replicating data between databases.
Data Diff has two algorithms implemented for diffing in the same database and across databases.
The former is based on JOIN, and the latter utilizes checksumming with binary search, which has minimal network IO and database workload overhead.
SELECT count(q.*)
FROM (SELECT a, b FROM table_a a
NATURAL FULL OUTER JOIN table_b b
WHERE a IS NOT DISTINCT FROM NULL
OR b IS NOT DISTINCT FROM NULL) q;
This looks for rows in `a` that are not in `b` and vice-versa and produces a count of those.
The key for this in SQL is `NATURAL FULL OUTER JOIN` (and row values).
Are tables the same if the schema differs in names only but the content is the same? Are tables different if one table has triggers defined on it where other one doesn't? Same for various constraints? What if tables have infinite reference loops? Or if different foreign keys point to the "equal" tows? What if types in the schema don't match but values do (if interpreted as some third type)? Are currently active transactions taken into account?
I find the idea of comparing two SQL tables weird / pointless in general. Maybe OK, if implemented with some very restrictive and well-defined semantics, but I wouldn't rely on a third-party tool to do something like this.
checksum and checksum_agg do not work for all data items, at least for SQL server "Noncomparable data types" for checksum and "null" for checksum_agg are show stopper.
15 years ago we just used Red Gates SQL Compare. I think it's still SQL Server only, but this really seems like this should have been solved for most mature databases by now.
I was sad to not see them move their tooling to multi-database. I had a couple calls with them ~5-6 years ago to talk about our needs and how they would be solved by a Postgres version of their tools (like SQL Compare), but apparently it never went anywhere.
Having been a Red Gate customer in 2005, I decided to take a look at their current offerings. Based on https://www.red-gate.com/products/ it seems like they have started to look at other database. The newest products seem to have support for at least 3-4 databases. They are maybe caught between rewriting a legacy product and building out the features in their successor products.
I guess this is an interesting question, but it seems incredibly contrary to the whole point of SQL. Two tables should never be the same, if they were, why would you have two tables. I can understand why one might want to check if two datasets were the same, which is easy to do with sort and diff, but two tables? They shouldn’t even have the same fields.
It could be the case that the two tables represent two independent instantiations of a single or very similar concept. One might be a list of chefs and their DOBs, and another a list of robots and their manufacture date, and for whatever reason you want to see if both of them are identical. There's no reason to unify both into a single table if your data model doesn't presuppose any correlation between chefs and robots, but someone might want to see if all their chefs are robots and if all their robots work as chefs. Obviously it's a contrived example, but my point is that there exist cases where having two tables with the same columns makes sense, if they store unrelated data. That the data is unrelated doesn't mean that the tables cannot contain identical data, it just means that they don't necessarily contain identical data.
Migrating data or related processes and running a couple of things in parallel to make sure nothing was in translation… some sort of data replication hacking maybe.
Sometimes you want them to be equal, sometimes you want to know they aren’t… but it’s not out of the ordinary at all to want to check
I don’t think this article is expressing any opinions on normalization
I'm surprised FULL OUTER JOINs weren't discussed. Then I realized SQLite doesn't support them. Perhaps it should have been titled:
"How to Check 2 SQLite Tables Are the Same"
I think SQLite's great, but "fully featured SQL engine" is not one of them. More like "perfectly adequate SQL engine" in an astoundingly compact operating envelope.
Yeah - I was going to say the same thing. Hashing would be the way to go if you're going for runtime and efficiency. You select the elements in an ordered manner though so that makes it n*log(n) and then produce a hash value for each table (or each row in the table) -- I'm trying to think if there is a better way of doing it in less time right now but I'm not seeing it yet. You could do this in O(n) or even constant time but it adds a bit of complexity to the solution I think.
If you hash each row in isolation and XOR the hashes of all rows, you don’t need the ordering. I also think XOR-ing doesn’t lose you ‘resolving power’ for this problem, compared to hashing the hashes or hashing everything in one go.
XOR is not a great choice here. Consider that 2 copies of a row give the same result as 0 (or 4, 6, etc). And even without multiple copies of rows, you can force any hash you'd like by observing what happens when you insert more random rows
and finding a subcollection that flips exactly the bits you want.
What you probably want to look at is homomorphic hashing. This is usually implemented by hashing each row to an element of an appropriate abelian group and then using the group operation to combine them.
With suitable choice of group, this hash can have cryptographic strength. Some interesting choices here are lattices (LtHash), elliptic curves (ECMH), multiplicative groups (MuHash).
It depends on whether you are doing something security critical with the result.
Maybe you have a trusted table hash but only a user-supplied version of the table. Before you use that data for security sensitive queries, you should verify it hasn't been modified.
Basically, if you ever have to contend with a malicious adversary, things are more interesting as usual. If not, addition is likely fine (though 2^k copies of a row now leave the k lowest bits unchanged).
In q language, it really is t1~t2 match for full match or t1=t2 for item wise. When you have a fully coherent programming language, it makes sense that this and many other things just work. e.g. in q there are no CTEs it's assigning and using variables as you would in most languages.
Hmm, CONCAT all fields in each row into a string, hash that string, use that as a derived table where you order the rows definitively, and concat all rows from that derived table into a string and hash that string. Repeat for second table. Compare hashes. You can do in a single query with two derived tables.
You could do it without the inner hashing, but that seems more likely to exhaust memory if your rows are big enough, since you're basically hashing an entire table at that point and that's large string in memory.
Yeah, I was just thinking of that, and how you'd probably want to concat with some level of structure to eliminate this. e.g. by adding a delimiter between fields and doing a find and replace on that delimiter (carefully, e.g. CSV rules with quotes and escapes). Whatever it takes to definitively differentiate foo/barfoo from foobar/foo. Maybe just prefixing every field with its length is sufficient.
I'm sure there's some relevant papers on the simplest way to achieve this I can and should look up. Hopefully they don't summarize as it being a much harder problem to do right. ;)
Instead of CONCAT you could something like JSON serialization which preserves structure (which avoids reinventing the wheel on quoting, etc.) for the rows, which may not be maximally efficient but is probably the path of least resistance.
Yeah, as long as every RDBMS has a JSON serialization method now (do they? I'm not up on SQL standards or where each RDBMS is implmenetation of them). Otherwise if we're shipping our own serialization as a procedure from core SQL functions, something simpler might be sufficient. Replace backslash with double backslash, replace pipe with backslash pipe, concatenate with pipes.
Comparing for exactness is one challenge, but just a start - identifying what has changed is a bigger challenge. And that's where the world of file/database comparators come in. A somewhat niche but easy-to-use example of one with lots of options that others have sought to mimic in open source is SAS's PROC COMPARE, which can compare SAS datasets or other vendor's database tables (which SAS can effectively treat as if they're SAS datasets).
SAS COMPARE Procedure
Example 1: Producing a Complete Report of the Differences
DataComPy (open-source python software developed by Capital One)
DataComPy is a package to compare two Pandas DataFrames. Originally started to be something of a replacement for SAS’s PROC COMPARE for Pandas DataFrames with some more functionality than just Pandas.DataFrame.equals(Pandas.DataFrame) (in that it prints out some stats, and lets you tweak how accurate matches have to be).
from io import StringIO
import pandas as pd
import datacompy
compare = datacompy.Compare(
df1,
df2,
join_columns='acct_id', #You can also specify a list of columns
abs_tol=0, #Optional, defaults to 0
rel_tol=0, #Optional, defaults to 0
df1_name='Original', #Optional, defaults to 'df1'
df2_name='New' #Optional, defaults to 'df2'
)
compare.matches(ignore_extra_columns=False)
# False
# This method prints out a human-readable report summarizing and sampling differences
print(compare.report())
I’ve worked on tools for this a bunch of times and one particular instance was my favorite. Checksums were used for equivalence - but the real fun was in displaying the differences in a meaningful way. We’d show:
- which columns were only in the left, or only in the right
- and then show which rows are only in the left, or only on the right
- and then for rows that are in both but have some cell differences show a set that shows those differences. (Hard to explain how this was done… it was concise but rich with details.)
This kind of “thorough” comparison was very useful for understanding the differences all at once.
In any normal/half-way good designed application, no row should be duplicated within a table, thus a comparison with EXCEPT is a valid solution. And it's often used in practice.
I've worked with huge tables doing ETL and always used MINUS clause for the purpose of comparing tables. I was using Teradata though, not sure if it's available in all DB engines.
Slightly related: I recently needed to compare several large databases for schema differences. The output from `pg_dump` isn't very diff-friendly at all, so I built this[1]. It outputs schemas in JSON or text, and ensures a stable sort ordering.
It also optionally masks index and constraint names, which might be auto-generated and not relevant to the comparison.
The most modern way is to tokenize the database, or the two databases in case the tables belong to different databases, then you can pretrain a large language model. Fine tune it with the two tables, then apply prompt engineering to ask the ai model in natural language. No need to learn SQL!
I'm just sure there's some important relational theory reason why you're a bad person for wanting to do this and it's not a flaw in the language at all that "check the equivalence of two result-sets" is so difficult.
Yes there is! A table should have a primary key, which ensures no two rows in a table are exactly equal. So the simple solution with ‘a except b’ should work.
Lots of otherwise simple things get complicated with duplicates, e.g. try to change or delete one row out of a set of duplicates.
A table with duplicates is not even a relation, since relations are defined as sets. SQL is in a weird place because it technically allows duplicates, but many operations are not able to distinguish between duplicates.
1. Write an SQL query to retrieve entities. Join the multi-row relational tables to the primary entity using a JSON aggregation function. This ensures all related data remains on a single row.
2. Export the query results to a .csv file(s).
3. Utilize Go along with the encoding/csv package to process each CSV row. Construct an in-memory index mapping each entity ID to its byte offset within the CSV file.
4. Traverse the CSV again, using the index to quickly locate and read corresponding lines into in-memory structures. Convert the aggregated JSON columns to standard arrays or objects.
5. After comparing individual CSV rows, save the outcomes to a map. This map associates each entityID with metadata about columns that don't match.
6. Convert the mismatch map into JSON format for further processing or segmentation.
For full comparisons of MSSQL tables, I often use Visual Studio which has a Data Comparison tool that shows identical rows and differences between source and target.
Not sure I understand why implementing EXCEPT ALL is hard. You just do +1 for every row in the left side and -1 for every row in the right, and then consolidate. Am I missing something?
Consider owning an ETL system where users can create periodic jobs, like calculating a "daily active user" rollup table from a bunch of event source tables. Now consider being asked to switch the query execution engine from A to B, perhaps because engine B is much more cost-effective for your system's query patterns. Ex: You are tasked with switching from MapReduce to Tez, from Presto to Spark, etc. Before full migration, you can reduce risk by double-writing jobs, with engine A writing to the standard output location, and engine B writing to some other migration location. This should lead to two identical SQL tables, which you should then verify, because sometimes these execution engines have bugs hiding in the corners :)
Because it’s a (worse) variant of one of the first queries the article gives:
SELECT *, COUNT(*)
FROM t1
GROUP BY x, y, z, ... -- all attributes of t1
EXCEPT
SELECT *, COUNT(*)
FROM t2
GROUP BY x, y, z, ... -- all attributes of t2
https://www.sqlite.org/sqldiff.html
By default it compares whole databases, but it can be told to only compare a specific table in each: