Btw I'm surprised you needed -M, since I thought jq would suppress colors if it saw it wasn't writing to a tty.
Even when reading the article I thought about it :)
echo 1152921504606846976 | jq
Here's the epic jwz rant on it:
with Brendan Eich chiming in:
>JSON is agnostic about the semantics of numbers ... JSON instead offers only the representation of numbers that humans use: a sequence of digits.
So... anything is valid, per the spec.
> echo 1152921504606846976 | python -c 'import sys, json; print(json.load(sys.stdin))'
It's less "interoperable" but not strictly invalid, by my read.
Therefore, you can't assume that if you pass some JSON through an arbitrary pipeline of JSON-manipulating tools, written in various languages, that your integer values will be passed through losslessly.
Therefore, you just shouldn't use JSON integer values when you know that the values can potentially be large. This is why e.g. Ethereum's JSON-RPC API uses hex-escaped strings (e.g. "0x0") for representing its "quantity" type.
Still interesting to know it's not just a jq quirk.
I can't help wondering, if you control the code that generates the JSON, why not output in a conservative, consistent format? I'm sure there are pros/cons, but this work would allow something like `diff` to work, and then you don't have to maintain a separate utility.
The new export process is much more reliable and a _lot_ faster, but as a side effect of doing things in a different way it generated the export file in a different format. Given that the order of objects in an export file and the order of keys/etc in the JSON objects didn't matter for anything except comparing the two processes, I figured it was simpler to put the normalization logic in the one-off tool vs baking it into our export process. But certainly if we were maintaining both exports in an ongoing fashion and validating them against each other, it would make a lot more sense to spend time making sure they generated objects and keys in the same order.
If you are going to process this amount of data, don't load it all into memory and process line by line. Also do that concurrently if you have more than one CPU core available. I've done this with ruby, python, Java, and misc shell tools like jq. Use what you are comfortable with and what gets results quickly.
One neat trick with jq is to use it to convert json objects to csv and to then pipe that into csvkit for some quick and dirty sql querying. Generally gets tedious beyond a few hundred MB. I recommend switching to Athena or something similar if that becomes a regular thing for you.
I am on the edge of my seat now.
Would you mind listing which libraries are much (say, an order of magnitude) faster?
Use a streaming JSON parser, and compare them token by token unless/until they diverge, at which point you take whatever actual suitable to identify the delta.
Parsing it into a tree may be necessary if you want to do more complex comparisons (such as sorting child objects etc.), but even then depending on your need you may well be better off storing offsets into the file depending on your requirements.
https://github.com/lloyd/yajl is an example of a streaming JSON parser (caveat: I've not benchmarked it at all), but JSON is simple enough to write one specifically to handle two streams.
It doesn't parallelize, and you'd need memory enough for the entire structure, but of course Rust doesn't have GC overhead. You could trivially parse both files in parallel, at least.
(2) Try using streaming API (I don't know Ruby, but quick google found https://github.com/dgraham/json-stream ).
Note that this method will require you to massively restructure your program -- you want to avoid having all of the data in memory at once.
The streaming API might work better with jq-based preprocessing -- for example, if you want to compare two unsorted sets, it may be faster to sort them using jq, then compare line-by-line using streaming API.
That's an important caveat. Python's C JSON parser library is super-fast, but if you want to use the data for anything but a simple equality check afterwards, it'll be slow as molasses.
Or you'll write a C extension for it...
Key takeaway: next time, start with the second thought first and save yourself well over an hour!
I replaced a bunch of bespoke ETL code with shell scripts. grep, sed, jq, xsv, psql, etc. Fast, efficient, iterative, inspectable, portable.
Alas, most everyone else insists on python, nodejs, ruby, AWS Lambda, jenkins goo, misc mayfly tech stacks. So my "use the most simple tool that works" advocacy has never gained traction.
It attempts to address a similar problem (comparing json or subsets of json), but I wanted the structure of what was being compared to be more readable (compared to jq), so I went with graphql syntax. Doubt it would do great on larger datasets though.
But, in practice, you have a downstream consumer of this data format (Apple in this case..).. Validating the old and new formats are functionally identical is just as important as validating the new format matches the upstream source of truth :)
Not sure how it'd handle comparing 5GB files though.
It has a JSON pull parser to minimize memory usage which is useful for memory constraint environment but at the expense of less performant. If that could be split up with fork Crystal processes, I believe it's feasible.
Newline JSON is a fine interchange format for this, and the only advantage I can see for CSV is you can load it into a database in one command. Which begs the question as to why use a database at all for a simple one-off diff, when there are much more lightweight alternatives (a shell command).
So now you are converting your JSON to CSV to load it into a database to run a bunch of database diffs over it to then compare them in some way. Wouldn't that lead to the question "how did we get this huge problem and is there already a solution"?
Seems like you are the one over complicating things.
And I have to say, choosing CSV and then using a database for this task reeks of inexperience. KISS.
Please review https://news.ycombinator.com/newsguidelines.html and avoid turning nasty in arguments on Hacker News.
That seems unnecessarily condescending. JSON can also mean Janky Serialized Object Notation, but that's not the common case.
> I guess your experience with character separated value files is very limited.
In practice, using something other than a comma is a good solution for some problems, but not others (eg transfer corruption or you know, the OP's use case).
> a heavyweight solution like JSON.
I've literally never heard that phrase, nor does it make much sense. At best it's 2 more characters for wrapping braces with existing quoted data/numbers and at worst you have to make up a new non-interchangeable format as you run into exceptions from the diff, which can affect past encodings. Sounds more involved than using JSON. shrug
Your tone is oddly superior in your reply, which is really at odds with the technical content of your messages.
> if record fields are consistent
This is all very confused. The issue is that the JSON fields where not consistent compared to the baseline. So now what? You suggest instead of investing time making them consistent, you should just switch format entirely and then make them consistent? Or are you suggesting that somehow a line of CSV is easier to compare than a line of JSON? Or I should now shove a bunch of non-printing ascii characters in my message and that's now better?
I mean, HDF is super-general and stuff, but it looks like SQLite would solve all the trouble with CSVs.
I'm still leaning toward "a tarball of CSVs", though:
1. it's very easy to allow different devs to write a bunch of single-purpose Extract tools, each in whatever language is best for the job (e.g. Python if it has to use an API where the only available API-client library impl is in Python) to scrape some particular dimension out of an external source. You can write out CSV data in pretty much any language—even a bash script! That's because, even if the language doesn't have a CSV library, a "trivial" CSV dump can be accomplished by just calling printf(2) with a CSV template string. (Trivial = you know all your stringly-typed data is of constrained formats, such that it doesn't require any quoting. CSV files are trivial more often than you'd think!)
2. Presuming your CSV file is column-ordered to have any primary key(s) first, and that it has no embedded header line, you can "reduce" on the output of a bunch of Extract jobs (i.e. merge-sorting + deduping to produce one CSV dataset) by just feeding all the input files to sort(1) with the -u and -n switches passed. `sort -n -t ','` basically behaves as a very simple streaming CSV parser, while also being amazingly-well-optimized at chewing through on-disk files in parallel. sort(1) is to (local on-disk) CSV data as LevelDB's compaction algorithm is to key-value pair data: a solid primitive that scales with your dataset size.
3. Once you've got two sorted+deduped CSV "snapshot" files, you can create a differential snapshot from them just by calling:
comm -1 -3 "$old_csv" "$new_csv" > diff.csv
cat pre_ddl.sql copy_stmt_begin.sql header.csv diff.csv copy_stmt_end.sql post_ddl.sql > migration.sql
That being said, the other formats are nice for 1. keeping data like numbers in more compact binary forms, 2. being able to sort and de-dup the data slightly more cheaply, without having to parse anything at point-of-sort. (Though this matters less than you'd think; sort(1)'s minimal parser is very fast, and SQLite/Avro can't get any big access-time wins since the data is neither pre-sorted nor column-oriented.)
But in exchange for this, you lose the ability to cheaply merge working datasets together. You can't just concatenate them—you have to ask your storage-format library to serialize data from one of your data files, and then ask the library to parse and import said data into your other data file. Frequently, the overhead of a complete deep parse of the data is the thing we're trying to avoid the expense of in the first place! (Otherwise, why use an ETL pipeline at all, when you could just have your operational data sources do batched SQL inserts directly to your data warehouse?)
I think this guy did the right thing for what sounded like essentially a one-off job to test this new export tool. Why would you go to all the trouble to use a SQL database for a one-off thing that can be done using text processing or worst-case writing a small script?
In theory a truly specific program could work better. In practise, the broad scope of jq allows you to discover the operations you need and respond to changes in requirements without being locked into custom code, and any given programmer probably couldn't do the same job better.
As you say, jq's power is that it is an expressive language, and it's much much easier to write jq programs that work than it is to write C/C++ programs as needed that do the same or similar work.
A much simpler way to do this is simply to hash the files, for example using sha256sum, which AFAIK ships with just about every Linux distro. Then just compare the hashes.