Hacker News new | comments | ask | show | jobs | submit login
Ask HN: Data diff tool for tabular data?
68 points by polm23 5 months ago | hide | past | web | favorite | 32 comments
Is there a tool you'd recommend to diff tabular data / TSV files?

Every day I have a snapshot of a database stored as TSV. Given two days of snapshots and that every item in the TSV has a unique ID, I'd like to get a list of which lines are new, changed, unchanged, and deleted between the two days. These lists will then be used to update an external service with an awkward database-like API.

I can kind of do this with normal diff but perhaps there's something better suited for the task? It reminds me of data joins in d3 but I'm not working with a UI or Javascript.

For diff on whole rows, use you could use Linux' CLI tools, in particular join(1) which is very similar to SQL's [LEFT/RIGHT] JOIN clause. Another option is the combination of sort(1) and uniq(1); with proper choice of options, and, in some cases, repeating the file with old or new data, you can get:

  cat old.csv old.csv new.csv | sort | uniq -u # only the added/changed records
  cat old.csv new.csv new.csv | sort | uniq -u # only the ol d/deleted records
  cat old.csv new.csv | sort | uniq -d # only the unchanged records

Also I think sort has a -u flag.

If you use sort -u you don't have two repeat files (:

Using the hammer already in my hand, I would write a small R script to do the joins.

Example script:

  old = read_tsv("path/to/old/db.tsv")
  new = read_tsv("path/to/new/db.tsv")
  deleted   = anti_join(old, new)
  new       = anti_join(new, old, by = "id")
  unchanged = semi_join(old, new)
  changed   = semi_join(new, old, by = "id") %>%

Beyond Compare has a tabular data mode, where you set the key columns and it presents the differences in a nice interface. Works well for me. http://www.scootersoftware.com

Not quite sure if it fits your needs exactly, but I've been pleasantly surprised with visidata's [0] capabilities when munging through csvs.

You can totally join two 'sheets' by primary key and save and replay that workflow, I think you may have to write a python function to iterate over the sets of columns to look for differences. Not sure if there's a cleaner way

[0] https://github.com/saulpw/visidata

There is a --diff option that works if the rows and columns are both exactly aligned. But this post makes me think about adding functionality to match by row key and column name, possibly via the "diff" join (current diff join is more like an "xor"). If you have ideas in this regard I'd love to hear them!

My team and I had the exact use case.. here is what we came up with: https://github.com/chop-dbhi/diff-table#diff-table

Supports delimited files and Postgres tables. Outputs a large JSON-based summary of record diffs or it emits an event stream as changes are found (and thus does not hold the summary in memory).

Prebuilt binaries: https://github.com/chop-dbhi/diff-table/releases

Edit: add link to releases and description

I quite like daff (have used it for diffing Excel sheets):


I helped make a tool called DataCompareR, which does exactly this, in R. It gives you diffs for column names, variable types, rows and a few other things. I hope you find it useful!


Take a look at `lnav` [1]. It's ideal for chaining w/ other cli tools, has an embedded sqlite engine, and makes it trivial to parse and query any kind of reasonably structured data.


I've used lnav for log navigation before, but how is it relevant here? Checked the features page but nothing jumped out at me...

You could import your daily snapshot in a new SQLite DB everyday, then use the sqldiff utility [https://www.sqlite.org/sqldiff.html].

Another option that might fit would be a sort(1) pass on your TSV files (unless you already ORDER BY on the primary key), then using Awk to detect differences, and immediately print out a list of API commands to update your external DB. Awk conveniently handles tabular data, just set the field-separator (FS) to \t in this case (default it splits on all whitespace).

Personally, I would do it in the database. Most established database products have some sort of "import" or "load" tool. Some even have "update_insert" types of ingestion mechanisms depending on your requirement.

So, keep a table of the latest "data" you have. Then import into a separate table. After that, your requirements are solvable with a few SQL joins. The upside is that you also have a local copy of the data in a table, which is always a plus when diagnosing/supporting your deployed product.

The data sets you then need:

    Deleted: Orig_table left outer join Import_Table with a null filter on Import_table.
    New: Import_Table left outer join Orig_table with a null filter on Orig_table.
    Changed: Inner join with not-equal check on all columns, joined by OR.
And each update to the external API also updates your local version of the data. I.e. Orig_table.

I know it sounds a bit verbose and maybe not an "all in one" solution that you're looking for. But sometimes, you just need to solve the problem.

git diff --word-diff-regex="[^\t]" one.tsv two.tsv

For example:

  $ echo "1\t2\t3" > one.tsv
  $ echo "1\tx\t3" > two.tsv
  $ git diff --word-diff-regex="[^\t]" one.tsv two.tsv
  diff --git a/one.tsv b/two.tsv
  index d43017e..e2d5856 100644
  --- a/one.tsv
  +++ b/two.tsv
  @@ -1 +1 @@
  1       [-2-]{+x+}      3

Python. Symmetric difference.

    >>> set(open('a.tsv')) ^ set(open('b.tsv'))


I made Quick to solve your exact problem! we use it to test our batch processes that either generate flat files or write data into database tables.

please contact me if you need help with the tool, i know that the docs are not as complete as i would like them to be.

here is the wiki: https://github.com/MouslihAbdelhakim/Quick/wiki

here is a repo of some examples: https://github.com/MouslihAbdelhakim/Quick-examples

whats that

It's like Google Sheets, with some weird retro desktop theme as default /s


Are you not allowed to do this in the database itself? That would be much simpler.


Import two snapshots of data like this:

  csv-import --dest-type="map:0" dump0.csv /noms/mydb::data
  csv-import --dest-type="map:0" dump1.csv /noms/mydb::data

Show difference between the two snapshots like so:

  noms log --max-lines=-1 /noms/mydb::data
See also the `noms diff` subcommand.

Hi! I'm working on https://stuckd.com which compares folders, so it's not exactly what you need, but I could add this as a new feature, and I think should be quite easy to implement. I'm still in a early stage, but if you don't mind being a beta tester, it's something you may be interested to try?

Beyond Compare has a mode for tabular data.

It is not free but I find it much better than the usual visual diff free tools like kdiff3 or meld.

If you store your data in a slowly changing dimensional format you will be able to create a table where you can query the data as it was at any point of time.

The easiest way to do this is to use an ETL tool as hand coding this can be a bit hairy. You can use an open source etl tool like Talend for this as it has inbuilt Postgres or MySQL scd modules.

I use xlcompare for comparing Excel. It creates very good diff reports of Excel features (including formatting, formulas and Macros). I often use it for simple CSV compares. Be warned, it does crash occasionally


Augeas might be helpful


But will require a bit of coding

https://github.com/paulfitz/daff plus version control

Crazy talk here, but do you have control of the database? Can you get history added to it?

Maybe rsync could also work well for you?

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact