Hacker News new | past | comments | ask | show | jobs | submit login
How to check if two SQL tables are the same (github.com/remysucre)
272 points by remywang on July 28, 2023 | hide | past | favorite | 120 comments



On the more practical side, SQLite has the "sqldiff" utility:

https://www.sqlite.org/sqldiff.html

By default it compares whole databases, but it can be told to only compare a specific table in each:

    sqldiff -t mytable database1.sqlite database2.sqlite


Indeed.

> why isn't it a standard feature of SQL to just compare two tables?

Not enough people have complained about needing it (it doesn't hit the desired ROI for a PM to prioritize the feature request).

SQLite's SQL Diff I first came across years ago - super useful. It's the perfect example of the industries which pop up to fill in the gaps that huge software vendors like Microsoft leave open. I used to work at a company valued at hundreds of millions of dollars, which made nothing but such products filling gaps.


SQL isn't any specific product for a PM to greenlight work on or not, it's a programming language (the L in SQL =) that an structured (the S in SQL =) data application can support as either "the" or "one of the" programming interfaces to query (the Q in SQL =) that data.

So it's really a question of why SQL, the language, doesn't come with a table comparison command so that you can write statements akin to "SELECT * FROM (COMPARE table1 table2) AS diff WHERE ..."


Wouldn't the process be that some PM wants a feature to be added and sends it of to the representative in the appropriate committe who then push for it to be added?

I mean it's not the SQL language itself which sits in the committe specifying new features, it would be representatives from Oracle, Microsoft, IBM and similar.. Right?


If some PM wants to add a debugger to C, do they write a language extension proposal or do they focus on putting their requirements in a separate tool?

This is roughly the same problem. The feature exists, but it's rare enough that the only people that need it are programmers/DBAs that need to deep-dive into a specific issue. Regular people/applications will never need this feature, so why should it be baked in the core language? The specialists already have specialist tooling, so it makes much more sense to implement this as a toolkit feature than a language feature.


isn't that what MS SQL Server's tablediff does? https://learn.microsoft.com/en-us/sql/tools/tablediff-utilit...


Author here! This was mostly intended for entertainment, and partly to demonstrate the idea of "relations as vectors" and "queries as polynomials". But I guess I'll indulge myself a bit more:

> Just use sqldiff

sqldiff is sensitive to ordering, e.g., it'll say the relation [1, 2] is different from [2, 1] (I consider them to be the same because they are the same multiset). You'd need to sort with ORDER BY first, but that also requires listing all attributes explicitly like the GROUP BY solution (ORDER BY * doesn't work).

> What about CHECKSUM

It's also sensitive to ordering, and I was told different tables can have the same CHECKSUM (hash collisions?).

> Are the tables the same if they only differ by schema?

I'd say no. Perhaps a better definition of "the same" is that all SQL queries (using "textbook" SQL features) return the same result over the tables, if you just replace t1 with t2 in the query. Wait, but how do you know if the results are the same... :)

> There are better ways to compare tables

Absolutely, my recursive query runs in time O(N^N) so I'm sure you can be a little better than that.


> Perhaps a better definition of "the same" is that all SQL queries (using "textbook" SQL features) return the same result over the tables, if you just replace t1 with t2 in the query. Wait, but how do you know if the results are the same... :)

It's actually still a useful definition! (Assuming we're talking about all deterministic SQL queries and can define precisely what we mean by that!)

It's a useful definition because it includes all possible aggregations as well, including very onerous ones like ridiculous STRING_AGG stuff. Those almost certainly be candidates for reasonable queries to solve your original problem, but they are useful in benchmarking whether a proposed solution is accurate.


Not sure if Go code is your kind of thing. But if it is my colleague wrote a SQLite "diff" capability for our online SQLite hosting service:

https://github.com/sqlitebrowser/dbhub.io/blob/5c9e1ab1cfe0f...

The code there can also output a "merge" object out of the differences too, in order to merge the differences from one database object into another.


When you say "bag semantics" are you just referring to Set Theory? Is there a formal distinction between bags and sets?


With bag (multiset) semantics, {1, 2, 2} is different from {1, 2}. With set semantics, they're the same.


tl;dr: SQL hates you and it's your fault it hates you.

to elaborate: I'm rapidly appreciating that SQL is a "bondage and discipline" language where it is impossible for SQL to fail at any given task, it can only be failed by its users/developers.

edit: further, it occurs to me also that SQL hates you because in a sane language you'd be able to write up some generic method that says "compare every element of these collections to each other" that was reusable for all possible collections. But try defining a scalar user-defined function that takes two arbitrary resultsets as parameters. But not only can I not do that, I'm a bad person for wanting that because it violates Relational Theory.


because it violates Relational Theory

It doesn't violate relational theory at all -- it merely violates the query compiler's requirements about what must be known at the time of query compilation.

You can absolute write a query that does what you want (you need a stored procedure that generates ad-hoc functions based on the involved tables' signatures), but stored sql objects must have defined inputs and outputs. What you're asking for is a meta-sql function that can translate to different query plans based on its inputs, and that's not allowed because its performance cannot be predetermined.


C has had this problem solved with preprocessor macros for 50 years.


You should definitely check out dbt :)

Some links in my comment: https://news.ycombinator.com/item?id=36911937


I have this old language aware diff tool I wrote that already supports CSV format, so what I would do is:

1) Check the record length of the two tables given your RDBMS API.

2) Check the column headers and data types of the tables are the same given your RDBMS API.

3) Only then would I export all triggers associated with the given tables using the RDBMS API and then compare that code using an external SQL language aware diff tool that excludes from comparison white space, comments, and other artifacts of your choosing.

4) Then finally if all else is the same I would export each table as CSV format and then compare that output in a CSV language aware diff tool.

This can all be done with shell scripts.


Well yes, it makes sense that one could just solve this problem however they choose, whether that be through the programming language of their choice or pen and paper. The whole point of the article is to purposefully restrict the use of external tools and see if it can be done wholly within the native SQL environment.


I've always used this. Any rows returned means there are differences

  select * from (
    (
      select *
      from table1
      minus
      select *
      from table2
    )
    union all
    (
      select *
      from table2
      minus
      select *
      from table1
    )
  )


I don't think most SQL flavours support MINUS function, imho.

Bing Chat says: > The MINUS operator is not supported in all SQL databases. It can be used in databases like MySQL and Oracle. For databases like SQL Server, PostgreSQL, and SQLite, use the EXCEPT operator to perform this type of query


Does this work with the bag/multiset distinction that the author uses? Like, if table1 has two copies of some row and table2 has a single copy of that row, wont this query return that they're the same? But they're not: table1 has two copies of the same row, whereas table2 just has one?


I found that a weird edge case for the original author to fixate on. In mathematics or academia sure, but in “real” sql tables, that serve any kind of purpose, duplicate rows are not something you need to support, let alone go to twice the engineering effort to support. Duplicates are more likely to be something you deliberately eradicate (before the comparison) than preserve and respect.


Exactly. If you have two absolutely duplicate rows in a table, you're going to have problems with a lot of your queries. It's usually an indication that you are lacking a unique constraint or have a bug somewhere.


It's exactly something I need to support, because the only reason I am ever diffing two tables is to check that my new and improved sql query worked as expected against a known correct table. I don't want my new code to accidentally duplicate on some silly join and then not pick up on it in automated testing.


> duplicate rows are not something you need to support

I can imagine that you want to have duplicates rows in a logging. If some events happens twice - you definitely want to log it twice.


I have had it happen in real data from real systems - but it’s not good — “true duplicates” are a sign that something is missing from your schema or something has gone wrong earlier in the pipeline.

Trying to delete duplicates (but leave 1 behind) is tricky in itself. I recorded notes on it one time here — using “row_number()” to act as the iniquitie, https://til.secretgeek.net/sql_server/delete_duplicate_rows....


(I just noticed that I emitted the word “iniquitie” above, where I thought I’d written “uniquifier” — and have just confirmed that autocorrect was responsible. Strangely, in context, the word “iniquitie” sort of “appears” to be an appropriate word, when its meaning is certainly not. I found that interesting.)


Logs usually have a timestamp that would differentiate the two events.


Not necessarily - the clock source for logging is often at millisecond resolution, but at the speed of modern systems you could pile up quite a few log entries in a millisecond.

I handle this by having a guid field for a primary key on such tables where there isn't a naturally unique index in the shape of the data. So something is unique, and you can delete or ignore other rows relative to that. (Just don't make your guid PK clustered; I use create-date or log-date for that.)


Your log tables don't have timestamps or line numbers on them?

More generally (and formally) speaking, multisets violate normalization. Either you add information to the primary key to identify the copies or you roll the copies up into a quantity field. I can't think of any kind of data where neither of these would be good options.


Primary key and quantity field could be not a perfect solution from performance point of view. Timestamps does not guarantee uniqueness.

Look, Im not trying to win the argument. In most cases you definitely right, my point is that sometime you have to work with working legacy code/system, and sometime this system could have some unique features.


It's still not something you "want to have", it's something you might be forced to deal with because someone screwed up the design.

And ensuring you have a real primary key should only be good for performance, in the realm of SQL databases.


I spent time researching this a while back.

The duplicate row issue is part of why I don't use MINUS for table value comparisons, nor RECURSIVE like the original article suggests (which is not supported in all databases and scarier for junior developers)... You can accomplish the same thing and handle that dupes scenario too, with just GROUP BY/UNION ALL/HAVING, using the following technique:

https://github.com/gregw2hn/handy_sql_queries/blob/main/sql_...

It will catch both if you have 1 row for a set of values in one table and 0 in another... or vice-versa... or 1 row for a set of values in one table and 2+ (dupes) in another.

I have compared every row + every column value of billion-row tables in under a minute on a columnar database with this technique.

Pseudocode summary explanation: Create (via group by) a rowcount for every single set of column values you give it from table A, create that same rowcount for every single set of column values from table B, then compare if those rowcounts match for all rows, and lets you know if they don't (sorted to make it easier to read when you do have differences). A nice fast set-based operation.


This is symmetric difference, but still has the problem that it's a set operation whereas in general a table is a bag (multiset).


In theory, yes, however the vast majority of tables will have some form of unique ID in each record... so in practice, there’s usually no difference. But if it must work for all tables...


Realistically which production DB tables don't have a unique id? Genuine question, never used one in my life.


Log analytics or warehouse tables often have no simple useful key for this sort of comparison.

Also in a more general case you might be comparing tables that may contain the same data but have been constructed from different sources. Or perhaps a distributed dataset became disconnected and may have seen updates in both partitions, and you have brought them together to compare to try decide which to keep or if it is worth trying to merge. In those and other circumstances there may be a key but if it is a surrogate key it will be meaningless for comparing data from two sets of updates, so you would have to disregard it and compare on the other data (which might not include useful candidate keys).


Also, database tables where unique key constraints aren't enforced. Programming and operational mistakes happen. :-)

https://stackoverflow.com/questions/62735776/what-is-the-poi...


It happens. I’m currently working on a project where the CRM tool I need to access for data, actually does not have a unique id in its db. I have no idea if I will be able to successfully complete the project yet.


Is there any chance that the rows actually do have a unique id, but it's not being displayed without some magic incantation?

Asking because I've seen that before in some software, where it tries to "keep things simple" by default. But that behaviour can be toggled off so it shows the full schema (and data) for those with the need. :)


Sadly, no.

The manufacturer is just really incompetent.

I was told their reason when asked was „it was easier (for us)“.


> it was easier (for us)

That's not all that unusual when something gets implemented, as people tend to take the easy approach for things that meet the desired goal.

It just sounds like the spec they were writing to wasn't very clear or it was just a checkbox list of features provided to them by marketing. So "lets get this list done then ship it". ;)


The question is whether it was actually easier.

Even a couple minutes of extra debugging takes longer than learning how to add a synthetic primary.


For example tables that store huge amount of logs or sensor data where IDs are not very useful and just increase space usage and decrease performance.


PostTags in the published Stack Overflow schema - https://data.stackexchange.com/stackoverflow/query/edit/1772...

It happens a lot when people are implementing something quick and often happens in linking tables.


Don’t things like BigQuery always allow duplicates?


Good point, have not used it before but looks like you have to add a unique ID if you want one


If the issue happen a lot, there is also: https://github.com/datafold/data-diff

That is a nice tool to do it cross database as well.

I think it's based on checksum method.


Honestly if the resultsets are small-enough, I just dump them to JSON and diff the files. But it has to be fully deterministically sorted for that (in a sane world "order by *" would be valid ANSI SQL).


Thank you for mentioning Data Diff! Founder of Datafold here. We built Data Diff to solve a variety of problems that we encountered as data engineers: (A) Testing SQL code changes by diffing the output of production/dev versions of SQL query. (B) Validating that data is consistent when replicating data between databases.

Data Diff has two algorithms implemented for diffing in the same database and across databases. The former is based on JOIN, and the latter utilizes checksumming with binary search, which has minimal network IO and database workload overhead.


Using row values in PG:

  SELECT count(q.*)
  FROM (SELECT a, b FROM table_a a
        NATURAL FULL OUTER JOIN table_b b
        WHERE a IS NOT DISTINCT FROM NULL
           OR b IS NOT DISTINCT FROM NULL) q;
This looks for rows in `a` that are not in `b` and vice-versa and produces a count of those.

The key for this in SQL is `NATURAL FULL OUTER JOIN` (and row values).


Are tables the same if the schema differs in names only but the content is the same? Are tables different if one table has triggers defined on it where other one doesn't? Same for various constraints? What if tables have infinite reference loops? Or if different foreign keys point to the "equal" tows? What if types in the schema don't match but values do (if interpreted as some third type)? Are currently active transactions taken into account?

I find the idea of comparing two SQL tables weird / pointless in general. Maybe OK, if implemented with some very restrictive and well-defined semantics, but I wouldn't rely on a third-party tool to do something like this.


""Fine," you say, "just generate the query and get back to work". Problem is, I don't feel like working today..."

And after that, it naturally excalates quickly :)

Never though I'd ever see SQL code golf, but here we are.


Quickly Compare Data in Two Tables Using CHECKSUM and CHECKSUM_AGG

https://sqlundercover.com/2018/12/18/quickly-compare-data-in...


checksum and checksum_agg do not work for all data items, at least for SQL server "Noncomparable data types" for checksum and "null" for checksum_agg are show stopper.

Also, checksum/ checksum_agg do not seem like SQL standard functions. referring https://www.postgresql.org/docs/current/features.html and https://en.wikipedia.org/wiki/SQL:2023#New_features.


I remember casting some types to varbinary and then using fn_repl_hash_binary() to generate the hashes.

And by “remember” I mean I wrote it down here so that I wouldn’t have to - https://til.secretgeek.net/sql_server/bulk_comparison_with_h...


15 years ago we just used Red Gates SQL Compare. I think it's still SQL Server only, but this really seems like this should have been solved for most mature databases by now.


I was sad to not see them move their tooling to multi-database. I had a couple calls with them ~5-6 years ago to talk about our needs and how they would be solved by a Postgres version of their tools (like SQL Compare), but apparently it never went anywhere.


Having been a Red Gate customer in 2005, I decided to take a look at their current offerings. Based on https://www.red-gate.com/products/ it seems like they have started to look at other database. The newest products seem to have support for at least 3-4 databases. They are maybe caught between rewriting a legacy product and building out the features in their successor products.


The fact that Red Gate is still a necessary QOL add-on for a product that costs that much should be shameful.


I guess this is an interesting question, but it seems incredibly contrary to the whole point of SQL. Two tables should never be the same, if they were, why would you have two tables. I can understand why one might want to check if two datasets were the same, which is easy to do with sort and diff, but two tables? They shouldn’t even have the same fields.


It could be the case that the two tables represent two independent instantiations of a single or very similar concept. One might be a list of chefs and their DOBs, and another a list of robots and their manufacture date, and for whatever reason you want to see if both of them are identical. There's no reason to unify both into a single table if your data model doesn't presuppose any correlation between chefs and robots, but someone might want to see if all their chefs are robots and if all their robots work as chefs. Obviously it's a contrived example, but my point is that there exist cases where having two tables with the same columns makes sense, if they store unrelated data. That the data is unrelated doesn't mean that the tables cannot contain identical data, it just means that they don't necessarily contain identical data.


Migrating data or related processes and running a couple of things in parallel to make sure nothing was in translation… some sort of data replication hacking maybe.

Sometimes you want them to be equal, sometimes you want to know they aren’t… but it’s not out of the ordinary at all to want to check

I don’t think this article is expressing any opinions on normalization


I'm surprised FULL OUTER JOINs weren't discussed. Then I realized SQLite doesn't support them. Perhaps it should have been titled:

"How to Check 2 SQLite Tables Are the Same"

I think SQLite's great, but "fully featured SQL engine" is not one of them. More like "perfectly adequate SQL engine" in an astoundingly compact operating envelope.


FULL OUTER JOINs were added in SQLite3 3.39.0. I've not checked if that includes NATURAL FULL OUTER JOIN.

https://sqlite.org/releaselog/3_39_0.html


If you have a total ordering and can get away with pulling the entire contents, just hash it. You don't have to contend with any of the sets stuff.

I suppose there are some other edges, like if you're storing floats, but that's no so different no matter what technique you wind up with.


Yeah - I was going to say the same thing. Hashing would be the way to go if you're going for runtime and efficiency. You select the elements in an ordered manner though so that makes it n*log(n) and then produce a hash value for each table (or each row in the table) -- I'm trying to think if there is a better way of doing it in less time right now but I'm not seeing it yet. You could do this in O(n) or even constant time but it adds a bit of complexity to the solution I think.


If you hash each row in isolation and XOR the hashes of all rows, you don’t need the ordering. I also think XOR-ing doesn’t lose you ‘resolving power’ for this problem, compared to hashing the hashes or hashing everything in one go.


XOR is not a great choice here. Consider that 2 copies of a row give the same result as 0 (or 4, 6, etc). And even without multiple copies of rows, you can force any hash you'd like by observing what happens when you insert more random rows and finding a subcollection that flips exactly the bits you want.

What you probably want to look at is homomorphic hashing. This is usually implemented by hashing each row to an element of an appropriate abelian group and then using the group operation to combine them.

With suitable choice of group, this hash can have cryptographic strength. Some interesting choices here are lattices (LtHash), elliptic curves (ECMH), multiplicative groups (MuHash).


> XOR is not a great choice here. Consider that 2 copies of a row give the same result as 0 (or 4, 6, etc).

That indeed is a major flaw. You have to use another commutative operation that doesn’t destroy entropy. Addition seems a good choice to me.

> And even without multiple copies of rows, you can force any hash you'd like

I don’t see how that matters for this problem.


It depends on whether you are doing something security critical with the result.

Maybe you have a trusted table hash but only a user-supplied version of the table. Before you use that data for security sensitive queries, you should verify it hasn't been modified.

Basically, if you ever have to contend with a malicious adversary, things are more interesting as usual. If not, addition is likely fine (though 2^k copies of a row now leave the k lowest bits unchanged).


Ahhh nice - yup that's one solution and really elegant as well. Great suggestion :)


In q language, it really is t1~t2 match for full match or t1=t2 for item wise. When you have a fully coherent programming language, it makes sense that this and many other things just work. e.g. in q there are no CTEs it's assigning and using variables as you would in most languages.

https://www.timestored.com/jq/online/?qcode=t1%3A(%5B%5D%20s...


Hmm, CONCAT all fields in each row into a string, hash that string, use that as a derived table where you order the rows definitively, and concat all rows from that derived table into a string and hash that string. Repeat for second table. Compare hashes. You can do in a single query with two derived tables.

You could do it without the inner hashing, but that seems more likely to exhaust memory if your rows are big enough, since you're basically hashing an entire table at that point and that's large string in memory.


Description | Notes

--------------------------

Fruit | Flies Fast

Fruit Flies | Fast


Note that if you have non-fixed-length fields, this can in principal produce false positives.


Yeah, I was just thinking of that, and how you'd probably want to concat with some level of structure to eliminate this. e.g. by adding a delimiter between fields and doing a find and replace on that delimiter (carefully, e.g. CSV rules with quotes and escapes). Whatever it takes to definitively differentiate foo/barfoo from foobar/foo. Maybe just prefixing every field with its length is sufficient.

I'm sure there's some relevant papers on the simplest way to achieve this I can and should look up. Hopefully they don't summarize as it being a much harder problem to do right. ;)


> e.g. by adding a delimiter between fields and doing a find and replace on that delimiter (carefully, e.g. CSV rules with quotes and escapes)

Once you set up "CSV rules" then you won't need to find and replace anything, CSV is perfectly sufficient by itself.


Instead of CONCAT you could something like JSON serialization which preserves structure (which avoids reinventing the wheel on quoting, etc.) for the rows, which may not be maximally efficient but is probably the path of least resistance.


Yeah, as long as every RDBMS has a JSON serialization method now (do they? I'm not up on SQL standards or where each RDBMS is implmenetation of them). Otherwise if we're shipping our own serialization as a procedure from core SQL functions, something simpler might be sufficient. Replace backslash with double backslash, replace pipe with backslash pipe, concatenate with pipes.


Comparing for exactness is one challenge, but just a start - identifying what has changed is a bigger challenge. And that's where the world of file/database comparators come in. A somewhat niche but easy-to-use example of one with lots of options that others have sought to mimic in open source is SAS's PROC COMPARE, which can compare SAS datasets or other vendor's database tables (which SAS can effectively treat as if they're SAS datasets).

SAS COMPARE Procedure Example 1: Producing a Complete Report of the Differences

  proc compare base=proclib.one compare=proclib.two printall;
https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/proc/...

_____

DataComPy (open-source python software developed by Capital One)

DataComPy is a package to compare two Pandas DataFrames. Originally started to be something of a replacement for SAS’s PROC COMPARE for Pandas DataFrames with some more functionality than just Pandas.DataFrame.equals(Pandas.DataFrame) (in that it prints out some stats, and lets you tweak how accurate matches have to be).

  from io import StringIO
  import pandas as pd
  import datacompy

  compare = datacompy.Compare(
      df1,
      df2,
      join_columns='acct_id',  #You can also specify a list of columns
      abs_tol=0, #Optional, defaults to 0
      rel_tol=0, #Optional, defaults to 0
      df1_name='Original', #Optional, defaults to 'df1'
      df2_name='New' #Optional, defaults to 'df2'
      )
  compare.matches(ignore_extra_columns=False)
  # False

  # This method prints out a human-readable report summarizing and sampling differences
  print(compare.report())
https://capitalone.github.io/datacompy/


I think as null <> null you ca never be sure.


But (select null intersect select null) still gives you a 1-row result set, so there are ways to be sure.


But violating the reflexive property of equality is a good thing because math!


In case you haven't tried dbt (www.getdbt.com / "Data Build Tool") - there's a whole package ecosystem that solves for things like this. The one that came to mind is "dbt-audit-helper": https://github.com/dbt-labs/dbt-audit-helper#compare_relatio...

It's kind of like PyPI/DockerHub for SQL. Lots of cool stuff in there...here's link to the package hub: https://hub.getdbt.com/


I’ve worked on tools for this a bunch of times and one particular instance was my favorite. Checksums were used for equivalence - but the real fun was in displaying the differences in a meaningful way. We’d show:

- which columns were only in the left, or only in the right

- and then show which rows are only in the left, or only on the right

- and then for rows that are in both but have some cell differences show a set that shows those differences. (Hard to explain how this was done… it was concise but rich with details.)

This kind of “thorough” comparison was very useful for understanding the differences all at once.


In any normal/half-way good designed application, no row should be duplicated within a table, thus a comparison with EXCEPT is a valid solution. And it's often used in practice.


> In any normal/half-way good designed application, no row should be duplicated within a table...

Sure, but there are plenty of poorly designed databases out there!


I've worked with huge tables doing ETL and always used MINUS clause for the purpose of comparing tables. I was using Teradata though, not sure if it's available in all DB engines.


Oracle has MINUS clause. Saved me many times.


Slightly related: I recently needed to compare several large databases for schema differences. The output from `pg_dump` isn't very diff-friendly at all, so I built this[1]. It outputs schemas in JSON or text, and ensures a stable sort ordering.

It also optionally masks index and constraint names, which might be auto-generated and not relevant to the comparison.

1. https://github.com/orf/diffable-sql


Why not hashed merkle tree approach? [0]

[0] : https://github.com/ameensol/merkle-tree-solidity/blob/master...


The most modern way is to tokenize the database, or the two databases in case the tables belong to different databases, then you can pretrain a large language model. Fine tune it with the two tables, then apply prompt engineering to ask the ai model in natural language. No need to learn SQL!


I'm just sure there's some important relational theory reason why you're a bad person for wanting to do this and it's not a flaw in the language at all that "check the equivalence of two result-sets" is so difficult.


Yes there is! A table should have a primary key, which ensures no two rows in a table are exactly equal. So the simple solution with ‘a except b’ should work.

Lots of otherwise simple things get complicated with duplicates, e.g. try to change or delete one row out of a set of duplicates.

A table with duplicates is not even a relation, since relations are defined as sets. SQL is in a weird place because it technically allows duplicates, but many operations are not able to distinguish between duplicates.


My guess is that t1 = t2 seems cheap but is actually expensive, so you might accidentally write very slow queries if it were built into the language.


Uh, that ship has sailed. Sooooo sailed. It is incredibly trivially easy to write unsargable queries in SQL.


Ah yes, the Apple support approach: “Why would you want to do that?”



Quite honestly if < 1M rows id just sqldump the two tables and use diff


1. Write an SQL query to retrieve entities. Join the multi-row relational tables to the primary entity using a JSON aggregation function. This ensures all related data remains on a single row.

2. Export the query results to a .csv file(s).

3. Utilize Go along with the encoding/csv package to process each CSV row. Construct an in-memory index mapping each entity ID to its byte offset within the CSV file.

4. Traverse the CSV again, using the index to quickly locate and read corresponding lines into in-memory structures. Convert the aggregated JSON columns to standard arrays or objects.

5. After comparing individual CSV rows, save the outcomes to a map. This map associates each entityID with metadata about columns that don't match.

6. Convert the mismatch map into JSON format for further processing or segmentation.


Normally EXCEPT or MINUS does a good enough job.

For full comparisons of MSSQL tables, I often use Visual Studio which has a Data Comparison tool that shows identical rows and differences between source and target.


Not sure I understand why implementing EXCEPT ALL is hard. You just do +1 for every row in the left side and -1 for every row in the right, and then consolidate. Am I missing something?


What I expected:

A neat pithy SQL trick

What I got:

A lesson in the Dark Arts from a wizard


server/client data sync example problem : https://sirupsen.com/napkin/problem-9


How do you end up with two identical SQL tables in the first place?


Consider owning an ETL system where users can create periodic jobs, like calculating a "daily active user" rollup table from a bunch of event source tables. Now consider being asked to switch the query execution engine from A to B, perhaps because engine B is much more cost-effective for your system's query patterns. Ex: You are tasked with switching from MapReduce to Tez, from Presto to Spark, etc. Before full migration, you can reduce risk by double-writing jobs, with engine A writing to the standard output location, and engine B writing to some other migration location. This should lead to two identical SQL tables, which you should then verify, because sometimes these execution engines have bugs hiding in the corners :)


I immediately considered it for comparison of backups.


The sql in the article is more like brian gymnastics.

It is interesting, but not something you should use. It scales horribly with number of columns


Good point


or a host migration.


Regression testing.


grading homework


In kdb+/q, it is only 1 character ~


Navicat is very useful for this


remywang you're a genius


That would be -1NF normalization


[flagged]


Did you test it?


I don't know why you're being downvoted. This is interesting. Did you test it yet?


Posting answers from GPT-4 etc on their own isn't interesting - they become interesting if you do the extra work to verify them first.


Because it’s a (worse) variant of one of the first queries the article gives:

  SELECT *, COUNT(*) 
    FROM t1
   GROUP BY x, y, z, ... -- all attributes of t1

  EXCEPT

  SELECT *, COUNT(*) 
    FROM t2
   GROUP BY x, y, z, ... -- all attributes of t2
and we aren’t discussing ChatGPT here.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: