
Faster and simpler with the command line: deep-comparing JSON files with jq - wwarnerandrew
https://genius.engineering/faster-and-simpler-with-the-command-line-deep-comparing-two-5gb-json-files-3x-faster-by-ditching-the-code/
======
pjungwir
I really enjoyed this article, and I think it shows how successfully jq fits
into the Unix culture of sed/awk/grep/etc. It seems so rare to find new CLUI
tools that feel as "classical" as jq. It has helped me do one-off tasks like
this several times, but I've really only scratched the surface. Often with
newer tools I'm reluctant to invest in going deeper into really learning the
features, but with jq I have a lot of confidence that it would pay off for
years to come. I don't see any books yet, but this is about the phase where
I'd normally buy an O'Reilly volume (hint hint).

Btw I'm surprised you needed -M, since I thought jq would suppress colors if
it saw it wasn't writing to a tty.

~~~
jeroenjanssens
Chapter 5 of Data Science at the Command Line (O'Reilly, 2014) mentions `jq`
briefly:
[https://www.datascienceatthecommandline.com/chapter-5-scrubb...](https://www.datascienceatthecommandline.com/chapter-5-scrubbing-
data.html#working-with-xmlhtml-and-json)

------
yonatank
Just a heads up to anyone using jq - I've previously spent a couple of hours
debugging a problem because jq uses float64 to store integers (which might
lead to rounding-errors/overflows). For example:

    
    
      echo 1152921504606846976 | jq                                                            
      1152921504606847000

~~~
otterley
This is an artifact of JavaScript, which even as of ES6 uses IEEE 754 double-
precision floats for all numeric values. jq likely uses the same
implementation internally for compatibility reasons and to avoid surprises of
a different kind.

See [https://www.ecma-international.org/ecma-262/6.0/#sec-
ecmascr...](https://www.ecma-international.org/ecma-262/6.0/#sec-ecmascript-
language-types-number-type)

~~~
BrendanEich
BigInt in top browsers now, not under a flag. Just sayin'!

[https://brendaneich.com/wp-
content/uploads/2017/12/dotJS-201...](https://brendaneich.com/wp-
content/uploads/2017/12/dotJS-2017.pdf#page=21&zoom=auto,-177,1061) et seq.

------
oftenwrong
Another option is to use a tool like 'gron' to convert the JSON into a shell-
friendly line-oriented format. This makes the rest straightforward.

[https://github.com/tomnomnom/gron](https://github.com/tomnomnom/gron)

~~~
cryptonector
jq can do something like that too! Check out its --stream option.

------
totally
That's cool, you made a thing that verifies the two export jobs you wrote have
the same data even though they have different output.

I can't help wondering, if you control the code that generates the JSON, why
not output in a conservative, consistent format? I'm sure there are pros/cons,
but this work would allow something like `diff` to work, and then you don't
have to maintain a separate utility.

~~~
wwarnerandrew
good question! the analysis that I was doing was really a one-off for
switching between these processes. We have unit tests and sanity checks to
ensure consistency going forward, but as a final check before flipping the
switch we wanted to be as confident as possible that we hadn't introduced any
regressions across the full data-set.

The new export process is much more reliable and a _lot_ faster, but as a side
effect of doing things in a different way it generated the export file in a
different format. Given that the order of objects in an export file and the
order of keys/etc in the JSON objects didn't matter for anything except
comparing the two processes, I figured it was simpler to put the normalization
logic in the one-off tool vs baking it into our export process. But certainly
if we were maintaining both exports in an ongoing fashion and validating them
against each other, it would make a lot more sense to spend time making sure
they generated objects and keys in the same order.

~~~
ryanisnan
I'm surprised the layout of the JSON doesn't arrange music by artist, any
reasoning why that wasn't done?

------
jillesvangurp
If you have json line formatted stuff (or csv) and an aws account, you can do
some nice things with Athena and SQL. We have a few simple backoffice tools
that I've implemented around simple sql queries on data dumped from various
systems that we have in json format. Awesome, if you want to do some quick
selects, joins, etc.

If you are going to process this amount of data, don't load it all into memory
and process line by line. Also do that concurrently if you have more than one
CPU core available. I've done this with ruby, python, Java, and misc shell
tools like jq. Use what you are comfortable with and what gets results
quickly.

One neat trick with jq is to use it to convert json objects to csv and to then
pipe that into csvkit for some quick and dirty sql querying. Generally gets
tedious beyond a few hundred MB. I recommend switching to Athena or something
similar if that becomes a regular thing for you.

~~~
wwarnerandrew
OP here— good point! We actually use Athena to query these exports in S3 to
debug data drift of specific export objects over time. It's quite a useful
tool, I was able to go knowing basically nothing about Athena to querying
gzipped newline-delimited JSON files in S3 using SQL in about an hour.

------
jb3689
It's worth mentioning that there are much faster JSON parsing libraries than
the default in Ruby stdlib. I still don't think Ruby is the best choice for
doing raw JSON parsing. Last time I had to care about JSON speed we were
transforming billions of events and the Ruby JSON lib was becoming a
bottleneck

~~~
baldeagle
What do you think is the best way to do deep JSON comparisons? We work with
2GB JSONs all day, and it is super annoying how long they take to process.

~~~
vidarh
Not parse them into a tree, to start with.

Use a streaming JSON parser, and compare them token by token unless/until they
diverge, at which point you take whatever actual suitable to identify the
delta.

Parsing it into a tree may be necessary if you want to do more complex
comparisons (such as sorting child objects etc.), but even then depending on
your need you may well be better off storing offsets into the file depending
on your requirements.

[https://github.com/lloyd/yajl](https://github.com/lloyd/yajl) is an example
of a streaming JSON parser (caveat: I've not benchmarked it at all), but JSON
is simple enough to write one specifically to handle two streams.

~~~
proyb2
I believe this comparison benchmark could be useful for you and you can expand
further with more tests. Although I got downvoted for sharing a link.

[https://github.com/kostya/benchmarks/blob/master/README.md](https://github.com/kostya/benchmarks/blob/master/README.md)

~~~
vidarh
That still parses into a tree.

------
wgjordan
> My first thought was to write a ruby script to parse and compare the two
> exports, but after spending a little time coding something up I had a
> program that was starting to get fairly complicated, didn't work correctly,
> and was too slow—my first cut took well over an hour. Then I thought: is
> this one of those situations where a simple series of shell commands can
> replace a complex purpose-built script?

Key takeaway: next time, start with the second thought first and save yourself
well over an hour!

------
g-harel
I had a similar problem diffing large API responses a few months ago and
implemented an automation friendly JSON schema tool. It's a great way to make
a summary of the data, especially when looking for forgotten fields for
example.

[https://github.com/g-harel/ence](https://github.com/g-harel/ence)

------
specialist
I love jq.

I replaced a bunch of bespoke ETL code with shell scripts. grep, sed, jq, xsv,
psql, etc. Fast, efficient, iterative, inspectable, portable.

Alas, most everyone else insists on python, nodejs, ruby, AWS Lambda, jenkins
goo, misc mayfly tech stacks. So my "use the most simple tool that works"
advocacy has never gained traction.

------
ComputerGuru
As part of an automated Jira upgrade script (well, Makefile) we needed to
export changes to the listener port/scheme configuration which is
unfortunately stored “in the code” so to speak (in WEB-INF/web.xml) which
Atlassian doesn’t deign to whitespace normally (indentation is all over the
place, as is formatting, character encoding, and more) —- and they mangle it
differently somehow with each point release. So the Makefile calls xmllint to
normalize formatting and whitespace of both the untouched source files from
the old and new release as well as the locally modified (deployed)
configuration, then calls diff/patch accordingly (in a three-way).

------
mattbillenstein
Note: use python3!

[https://gist.github.com/mattbillenstein/34cf2907390102ffbabd...](https://gist.github.com/mattbillenstein/34cf2907390102ffbabd982a3662b204)

------
zoom6628
I have found jq immensely useful to process ugly large responses from REST
APIs in enterprise systems. It's like an awk for JSON... And I've been awk fan
for 30 years for any text processing.

------
evancordell
This is somewhat related to a hack I threw together recently:
[https://github.com/ecordell/jf](https://github.com/ecordell/jf)

It attempts to address a similar problem (comparing json or subsets of json),
but I wanted the structure of what was being compared to be more readable
(compared to jq), so I went with graphql syntax. Doubt it would do great on
larger datasets though.

------
cphoover
I'm not sure why you are comparing the data to the old export instead of
against a source of truth... for example what is in the upstream data source.
Also why not verify using unit tests? Who is to say that the original export
is valid and not the second export.

~~~
kiallmacinnes
In theory, I agree! I hope the new codebase has a set of tests to validate
just that.

But, in practice, you have a downstream consumer of this data format (Apple in
this case..).. Validating the old and new formats are functionally identical
is just as important as validating the new format matches the upstream source
of truth :)

------
gorbachev
I'm using jsonassert [1], a Java based JSON unit testing library, for
something very similar.

Not sure how it'd handle comparing 5GB files though.

1\. [http://jsonassert.skyscreamer.org/](http://jsonassert.skyscreamer.org/)

------
proyb2
I'm curious how well could Crystal language handles that huge amount of JSON
since most of the Ruby code could be ported over to Crystal.

It has a JSON pull parser to minimize memory usage which is useful for memory
constraint environment but at the expense of less performant. If that could be
split up with fork Crystal processes, I believe it's feasible.

~~~
vidarh
There are stream parsers for JSON for Ruby too, including bindings for C
libraries like YAJL - using the default JSON parser is an awful choice for
doing comparisons like that given the massive overhead of the amount of
objects it'll be creating for no good reason.

~~~
proyb2
Agree, I believe the benchmark shown Yaji and jq in this repo is useful for
you

[https://github.com/kostya/benchmarks/blob/master/README.md](https://github.com/kostya/benchmarks/blob/master/README.md)

------
forgot-my-pw
If speed is essential, why not use protobuf/flatbuffer or one of their
variants?

------
gerbilly
I wanted to like jq, but honestly, I can't figure out it's crazy syntax.

~~~
Rotten194
it's a bit weird, but it's perfect for writing quick shell one liners with
once you get used to it. no regular scripting language can match it for that

~~~
saagarjha
awk, sed?

~~~
pathseeker
For extracting a value out of json where the keys can arbitrarily re-order and
you have nested maps containing the same key names? No thanks.

~~~
saagarjha
I was providing them as examples of things "perfect for writing quick shell
one liners", not good JSON parsers.

------
nrclark
+1 for a great tool.

------
CyberMew
I wonder if a C/C++ programm would perform better?

~~~
JulianMorrison
jq is a C program.

In theory a truly specific program could work better. In practise, the broad
scope of jq allows you to discover the operations you need and respond to
changes in requirements without being locked into custom code, and any given
programmer probably couldn't do the same job better.

~~~
cryptonector
jq is a C program, yes, but jq programs are interpreted. Because jq is a
dynamically-typed language, it wouldn't be easy to compile it to object code
that would run too much faster than the byte-interpreted version (though it
would still run faster).

As you say, jq's power is that it is an expressive language, and it's much
much easier to write jq programs that work than it is to write C/C++ programs
as needed that do the same or similar work.

~~~
JulianMorrison
The interpreted language is just the setup phase for a pipeline of compiled-in
data transformations.

~~~
cryptonector
I don't understand this statement. Keep in mind I'm a jq maintainer.

------
GordonS
> What’s the best way to compare these two 5GB files?

A much simpler way to do this is simply to hash the files, for example using
sha256sum, which AFAIK ships with just about every Linux distro. Then just
compare the hashes.

~~~
hannasanarion
Having the same content is not the same as being identical verbatim.

