
Ask HN: Data diff tool for tabular data? - polm23
Is there a tool you&#x27;d recommend to diff tabular data &#x2F; TSV files?<p>Every day I have a snapshot of a database stored as TSV. Given two days of snapshots and that every item in the TSV has a unique ID, I&#x27;d like to get a list of which lines are new, changed, unchanged, and deleted between the two days. These lists will then be used to update an external service with an awkward database-like API.<p>I can kind of do this with normal diff but perhaps there&#x27;s something better suited for the task? It reminds me of data joins in d3 but I&#x27;m not working with a UI or Javascript.
======
dexen
For diff on whole rows, use you could use Linux' CLI tools, in particular
join(1) which is very similar to SQL's [LEFT/RIGHT] JOIN clause. Another
option is the combination of sort(1) and uniq(1); with proper choice of
options, and, in some cases, repeating the file with old or new data, you can
get:

    
    
      cat old.csv old.csv new.csv | sort | uniq -u # only the added/changed records
      cat old.csv new.csv new.csv | sort | uniq -u # only the ol d/deleted records
      cat old.csv new.csv | sort | uniq -d # only the unchanged records

~~~
tudelo
Also I think sort has a -u flag.

~~~
SubMachineGhost
If you use sort -u you don't have two repeat files (:

------
maltelau
Using the hammer already in my hand, I would write a small R script to do the
joins.

Example script:

    
    
      library(tidyverse)
      old = read_tsv("path/to/old/db.tsv")
      new = read_tsv("path/to/new/db.tsv")
      deleted   = anti_join(old, new)
      new       = anti_join(new, old, by = "id")
      unchanged = semi_join(old, new)
      changed   = semi_join(new, old, by = "id") %>%
                    anti_join(unchanged)

------
Petefine
Beyond Compare has a tabular data mode, where you set the key columns and it
presents the differences in a nice interface. Works well for me.
[http://www.scootersoftware.com](http://www.scootersoftware.com)

------
camel_Snake
Not quite sure if it fits your needs exactly, but I've been pleasantly
surprised with visidata's [0] capabilities when munging through csvs.

You can totally join two 'sheets' by primary key and save and replay that
workflow, I think you may have to write a python function to iterate over the
sets of columns to look for differences. Not sure if there's a cleaner way

[0] [https://github.com/saulpw/visidata](https://github.com/saulpw/visidata)

~~~
rabidrat
There is a --diff option that works if the rows and columns are both exactly
aligned. But this post makes me think about adding functionality to match by
row key and column name, possibly via the "diff" join (current diff join is
more like an "xor"). If you have ideas in this regard I'd love to hear them!

------
BerislavLopac
Python and pandas: [https://stackoverflow.com/questions/17095101/outputting-
diff...](https://stackoverflow.com/questions/17095101/outputting-difference-
in-two-pandas-dataframes-side-by-side-highlighting-the-d)

------
bjoerns
I quite like daff (have used it for diffing Excel sheets):

[https://github.com/paulfitz/daff](https://github.com/paulfitz/daff)

------
bruth
My team and I had the exact use case.. here is what we came up with:
[https://github.com/chop-dbhi/diff-table#diff-table](https://github.com/chop-
dbhi/diff-table#diff-table)

Supports delimited files and Postgres tables. Outputs a large JSON-based
summary of record diffs or it emits an event stream as changes are found (and
thus does not hold the summary in memory).

Prebuilt binaries: [https://github.com/chop-dbhi/diff-
table/releases](https://github.com/chop-dbhi/diff-table/releases)

 __Edit __: add link to releases and description

------
KrispMonk
I helped make a tool called DataCompareR, which does exactly this, in R. It
gives you diffs for column names, variable types, rows and a few other things.
I hope you find it useful!

[https://github.com/capitalone/datacomparer](https://github.com/capitalone/datacomparer)

------
chrisweekly
Take a look at `lnav` [1]. It's ideal for chaining w/ other cli tools, has an
embedded sqlite engine, and makes it trivial to parse and query any kind of
reasonably structured data.

[1][https://lnav.org](https://lnav.org)

~~~
polm23
I've used lnav for log navigation before, but how is it relevant here? Checked
the features page but nothing jumped out at me...

------
kbr2000
You could import your daily snapshot in a new SQLite DB everyday, then use the
sqldiff utility
[[https://www.sqlite.org/sqldiff.html](https://www.sqlite.org/sqldiff.html)].

Another option that might fit would be a sort(1) pass on your TSV files
(unless you already ORDER BY on the primary key), then using Awk to detect
differences, and immediately print out a list of API commands to update your
external DB. Awk conveniently handles tabular data, just set the field-
separator (FS) to \t in this case (default it splits on all whitespace).

------
zo1
Personally, I would do it in the database. Most established database products
have some sort of "import" or "load" tool. Some even have "update_insert"
types of ingestion mechanisms depending on your requirement.

So, keep a table of the latest "data" you have. Then import into a separate
table. After that, your requirements are solvable with a few SQL joins. The
upside is that you also have a local copy of the data in a table, which is
always a plus when diagnosing/supporting your deployed product.

The data sets you then need:

    
    
        Deleted: Orig_table left outer join Import_Table with a null filter on Import_table.
        New: Import_Table left outer join Orig_table with a null filter on Orig_table.
        Changed: Inner join with not-equal check on all columns, joined by OR.
    

And each update to the external API also updates your local version of the
data. I.e. Orig_table.

I know it sounds a bit verbose and maybe not an "all in one" solution that
you're looking for. But sometimes, you just need to solve the problem.

------
tyingq
git diff --word-diff-regex="[^\t]" one.tsv two.tsv

For example:

    
    
      $ echo "1\t2\t3" > one.tsv
      $ echo "1\tx\t3" > two.tsv
      $ git diff --word-diff-regex="[^\t]" one.tsv two.tsv
      diff --git a/one.tsv b/two.tsv
      index d43017e..e2d5856 100644
      --- a/one.tsv
      +++ b/two.tsv
      @@ -1 +1 @@
      1       [-2-]{+x+}      3

------
xapata
Python. Symmetric difference.

    
    
        >>> set(open('a.tsv')) ^ set(open('b.tsv'))

------
SubMachineGhost
[https://github.com/MouslihAbdelhakim/Quick](https://github.com/MouslihAbdelhakim/Quick)

I made Quick to solve your exact problem! we use it to test our batch
processes that either generate flat files or write data into database tables.

please contact me if you need help with the tool, i know that the docs are not
as complete as i would like them to be.

here is the wiki:
[https://github.com/MouslihAbdelhakim/Quick/wiki](https://github.com/MouslihAbdelhakim/Quick/wiki)

here is a repo of some examples: [https://github.com/MouslihAbdelhakim/Quick-
examples](https://github.com/MouslihAbdelhakim/Quick-examples)

------
zygotic12
Got Excel? [https://superuser.com/questions/420635/how-do-i-join-two-
wor...](https://superuser.com/questions/420635/how-do-i-join-two-worksheets-
in-excel-as-i-would-in-sql)

~~~
miemo
whats that

~~~
timrichard
It's like Google Sheets, with some weird retro desktop theme as default /s

~~~
zygotic12
Arf!

------
nartz
Are you not allowed to do this in the database itself? That would be much
simpler.

------
aboodman
[https://github.com/attic-labs/noms](https://github.com/attic-labs/noms)

Import two snapshots of data like this:

    
    
      csv-import --dest-type="map:0" dump0.csv /noms/mydb::data
      csv-import --dest-type="map:0" dump1.csv /noms/mydb::data
    
    

Show difference between the two snapshots like so:

    
    
      noms log --max-lines=-1 /noms/mydb::data
    

See also the `noms diff` subcommand.

------
clagio
Hi! I'm working on [https://stuckd.com](https://stuckd.com) which compares
folders, so it's not exactly what you need, but I could add this as a new
feature, and I think should be quite easy to implement. I'm still in a early
stage, but if you don't mind being a beta tester, it's something you may be
interested to try?

------
GuB-42
Beyond Compare has a mode for tabular data.

It is not free but I find it much better than the usual visual diff free tools
like kdiff3 or meld.

------
trengrj
If you store your data in a slowly changing dimensional format you will be
able to create a table where you can query the data as it was at any point of
time.

The easiest way to do this is to use an ETL tool as hand coding this can be a
bit hairy. You can use an open source etl tool like Talend for this as it has
inbuilt Postgres or MySQL scd modules.

------
Maximal
I use xlcompare for comparing Excel. It creates very good diff reports of
Excel features (including formatting, formulas and Macros). I often use it for
simple CSV compares. Be warned, it does crash occasionally

[https://www.xlcompare.com](https://www.xlcompare.com)

------
emmelaich
Augeas might be helpful

[http://augeas.net/docs/references/lenses/files/csv-
aug.html](http://augeas.net/docs/references/lenses/files/csv-aug.html)

But will require a bit of coding

------
rmbeard
[https://github.com/paulfitz/daff](https://github.com/paulfitz/daff) plus
version control

------
billrobertson42
Crazy talk here, but do you have control of the database? Can you get history
added to it?

------
jcobnl
Maybe rsync could also work well for you?

