Hacker News new | past | comments | ask | show | jobs | submit login
Parsing JSON at the CLI: A Practical Introduction to jq and more (makes.software)
294 points by sequoia on Dec 21, 2020 | hide | past | favorite | 102 comments



I've been frequently derided for demoing it but I should write something about how to do similar tricks with PowerShell, as I have found it easier to learn than jq and an underrated tool in dealing with JSON.

The first example:

  ConvertFrom-Json $USERX | ConvertTo-Json | Set-Clipboard
Getting properties by name:

  ConvertFrom-Json $ORDER | select order*


I haven't used pwsh in a few years, but does `ConvertTo-Json` still have that terrible `-Depth` parameter that you have to remember to set so that it doesn't mangle your JSON? A quick web search indicates that it does.

Edit: Yes, it does.

    '{ "a": { "b": { "c": { "d": { "e": 5 } } } } }' | ConvertFrom-Json | ConvertTo-Json

    {
      "a": {
        "b": {
          "c": "@{d=}"
        }
      }
    }
So you have to remember to set `-Depth 100` for every invocation of `ConvertTo-Json`. You also can't set it higher because it has a hard-coded limit of 100, so hopefully you never deal with JSON that has more nesting than that.

Yeesh. I like PS, but this one commandlet's design has always baffled me.


That API is being reworked now at least https://github.com/PowerShell/PowerShell/pull/11198


I was wondering about that, because the docs say it uses Json.NET, which while an excellent library is being replaced with System.Text.Json in the BCL/FCL as far as I know.


Nu shell is also really handy for these types of format conversions (and querying):

  > echo '{ "a": { "b": { "c": { "d": { "e": 5 } } } } }' | from json | to toml
  [a.b.c.d]
  e = 5


Yep. Love powershell, disliked this decision, but now understand it: objects returned from native powershell commands have a loooooooooooooooooooot of nesting. Try:

  gci | select -first 1 | ConvertTo-Json -Depth 2
  gci | select -first 1 | ConvertTo-Json -Depth 3
  gci | select -first 1 | ConvertTo-Json -Depth 4
one at a time.


In the words of Raymond Chen, adding a `-Depth` parameter is using a global solution to a local problem. If certain types should not be serialized beyond a certain depth, the config should be per type, not on the whole serialization.

Or leave it as it is and expect the user to map the complex values into hashtables. Eg in your example, make the user pipe the objects through `%{ @{ Name = $_.Name; Mode = $_.Mode; } }` first.

In any case, I don't know about other PS users, but all my command-lines that ended with ConvertTo-Json either started with ConvertFrom-Json (transforming an existing JSON file), or started with HashTables (building a JSON file from scratch) or a mix of the two. Therefore I always wanted everything to be serialized, and the `-Depth` parameter was always a nuisance.


Thankfully it looks like -Depth is going to default to 64 when the cmdlet is ported to System.Text.Json, per voxic11 downthread: https://github.com/PowerShell/PowerShell/pull/11198


True, but isn't it a good idea to have an "upper limit" global value too in case of badly behaved types?


Why? Both JSON strings and their deserialized value's memory usage grows only linearly if you nest stuff (unless you're doing something dumb).

By that logic they should also add -MaxArrayLength to make the parser stop parsing long JSON arrays, -MaxProperties to make it only parse object properties up to a limit, -MaxStringLength to only make it parser strings up to a certain length...

It's completely pointless.


I suppose it is only a concern during serialization, not deserialization.


Yes, a limit per type. Not for the whole serialization.


But what if a type which you need to serialize wrongly doesn't set the limit?

EDIT: I see your point now, thank you for explaining


Again, I'm not saying there shouldn't be a default limit. I'm saying the limit should be per type, not for the whole serialization.

It can be something as trivial as "PSHashTable is allowed to serialize without limit. Every other type is limited to N levels, unless the user sets `$SERIALIZATION_LIMIT[System.Type]` to some other value." Or make the `-Depth` parameter a `Dictionary<System.Type, int>`.


It turns an object hierarchy into flat text. References in the object hierarchy can be circular, and the Depth parameter limits the infinite loop a circular reference would cause.


Detecting circular structures (sometimes just called graphs), is a solved problem. The Lisp reader and printer even serialize them correctly: #1=(42 . #1#) makes a pair of two items. The first item is a 42 and the second is the pair itself.

It sounds like whoever implemented this just didn't know what they were doing.


This can't happen. JSON does not have references.


I believe he’s referring to an arbitrary object being converted to JSON.

From-JSON wouldn’t have this concern, but to-JSON would


Yes. My mistake.


>var obj = {}; obj.foo = obj; JSON.stringify(obj);

>Uncaught TypeError: cyclic object value

If browser JS engines can detect circular references, the PS serializer can too.


it CAN, but it DOESN'T.


Most of PowerShell is underrated when it comes to pipelines in shell.

I've completely rid myself of GNU core utils with just pwsh.


If it only wasn't dog slow and chewed up memory when working with large files. Large files often force me to go back to cygwin.


Idk that's just not something I do with the shell. At work I process a lot of large files, but typically those require hand written code to use off heap or file/db backed collections (I work in the JVM).


One difference in usage of PWSH I can't wrap my head around is when I want to do something akin to:

    grep -rl $pattern | $othercommand
So far I've got this (deliberately avoiding aliases for this example) in the above grep's command's stead:

    Get-ChildItem -Recurse -Path * | Select-String $pattern
But I just cannot fathom how to break from the table format to pass the lines to $othercommand.

How does one pass a pwsh list to a single command? Using %{}/For-Each {} wouldn't work because it would invoke $othercommand for each line.


Since Powershell isn't working on text but on objects, you might just want to use grep. Anyways, this should reproduce the output of grep in powershell:

  Get-ChildItem -Recurse -File | Select-String "$pattern" -List -SimpleMatch -CaseSensitive | Select-Object -ExpandProperty path
You will want to only search in files for this to not ouput some lines of InputStream if there is a match in the actual path. If $pattern is an actual regex you will want to drop SimpleMatch. In case you want some proper powershell objects you might want to pipe this into Get-Item.


Everything in powershell is an object, but sometimes the output format doesn't make it clear. Select-String basically returns a match object with some nested results, but the output formatting makes it awkward to see. Pipe it to

  | format-list *
to investigate it, and you'll see a bunch of members. In this case if what you want is just the matched string, you want to expand the `Matches` property of the object, then expand the value property of each of those. In this case the term 'expand' is key: if you just use `select` it'll return you the property, name and all. Use `-expand` to return the plain array. So something like:

  Get-ChildItem -Recurse -Path * | Select-String $pattern | select -expand matches | select -expand value
which you can then use how you like. An alternate form is

  (Get-ChildItem -Recurse -Path * | Select-String $pattern).Matches.Value
because the . notation for member access works on all items in an array (Matches in this case).


I don't think I entirely understand the question, is $othercommand doing something with all of the lines together?


Yes, for example other command may be (and often is) fzf followed by yet another command to exec on the chosen file.


why? are you forced to work on windows?


PowerShell is supported on Linux and macOS as well [1]

I completely agree with the parent that PowerShell blows the existing shell options out of the water. Having pipelines with objects rather than just text makes everything so much easier. Instead of spending an hour futzing with awk or regex or applicatioons like jq to parse values from command results you can just access what you want directly and get on with your work.

1. https://docs.microsoft.com/en-us/powershell/scripting/instal...


I'm glad you're productive in your CLI tool usage but just because it takes you an hour futzing with basic *nix tools doesn't mean it's the same for everyone else. Powershell is extremely cumbersome for me due to the sheer verbosity of each command and the apparent MO of "let's come up with a new name for everything" and "sensible defaults? what's that?"


PowerShell is, in my opinion, quite discoverable and usable with a few commands to find your footing. I don't memorize a lot, because it's so well-structured and intuitive for the most part that once you get the basic structure it's easy to hack around with the docs.

Three commands that are useful to memorize, though, (particularly if you're having trouble remembering names) are Get-Command, Get-Alias (also with -Definition), and Get-Member. Get-Command gets you info about a command, like if it's an alias or not, and the path if it's a unix command. Get-Alias shows you all the active aliases, and Get-Alias -Definition shows you the active aliases for a given command. Get-Member shows you all the members of an object which can help you with Select-Object and Where-Object and so on.


You don't have to type the full commands. Most installs come with a full list of aliases: https://pastebin.com/eHsTUaRQ

For instance on my system:

   Get-ChildItem -> gci or ls


   Select-String -> grep


With tabcompletion you do not need to have aliases, but you need to name your commands in a sane way, having the meaningless words forst like powershell does is really cumbersome for tabbing. Examples: Where-, Get-, Remove-, Export-, From-, Select-, Invoke-

Nice list of aliases though.


Menu completion is even better if you haven't tried it!


FWIW I'm talking about CLI, I have little experience with menu completion, but how does menu completion solve the Get-<TAB> problem it will give too many hits and parsing that will take too long. Menu-completion in Intellij, and it's ilk, is more about discoverablity than fast typing then you can take your time parsing the output.


I just don't want to have to type that much physically, let alone the verbose commands and slightly odd syntax you often have to memorize. I know some love it, but I'm much happier with just using bash everywhere, even though msys/git bash on windows has its' own quirks, WSL ftw though.


Yes, awk and jq “just work” for me: whether it’s because I’ve trained myself through constant usage or whatever, I’ve yet to understand the supposed benefits of a richer IPC mechanism than byte streams.


No, but I'm a gamer at my core and I develop on the JVM so I have no need to interface with the kernel. Windows just works for me.


Iirc powershell is cross platform.


You definitely should! Is it alright if I copy this comment onto my blog underneath this post in the "comments" section?

If you do write such a post make sure to let me know so I can link it from mine! You could write a response "here's how I'd do all the stuff in that post in Powershell with no extra tools" that would be very cool :)


I'll try to remember!


I needed this tutorial 6 months ago (and 6 months before that, and 6 months before that). :D Highly recommend looking at and maybe including "gron"[1] in this as a very nice complement to jq. It fills in some use cases in a very straightforward way that are pretty cumbersome in jq, such as finding a field deeply nested in an optional parent.

[1] https://github.com/tomnomnom/gron


Agreed, gron is a fantastic tool.

Earler today I was looking at some deeply nested structures which had leaf nodes with a field named "fileURL", and values that were mostly "https://" but some were "http://". I needed to see how many, etc.

cat file.json | gron | grep fileURL | grep http | grep -v https

... and presto, I had only four such nodes.

Would've been a ton more work to get there with just jq.


Based on your description, it sounds like:

    <file.json jq '.. | .fileURL? | select(startswith("http://"))' -r
... would've done the job?

Or, if you can't remember `startswith`:

    <file.json jq '.. | .fileURL?' -r | grep '^http://'


Nice and terse :-)

Right, the ".." part was hard to remember because it's something like:

  {
    "entries": [
      {
        "fields": {
          "fileURL": {
            "en-US": "https://..."
          },
          ...
        },
        ...
      },
      ...
    ]
  }
and as you can see the field fileURL is actually an object with another field en-US (with a hyphen) so the jq becomes something like this:

<file.json jq -r '.entries[] | .fields | select(.fileURL) | .fileURL["en-US"] | select(startswith("http://"))'

And later I had to do the same for other fields that ended in URL (websiteURL, etc.)

Anyway, gron made it simpler to get a quick summary of what I needed because it represents every node in the tree as a separate line that has the path in it, which is perfect for grep.

I still use jq more than gron :-)


>Right, the ".." part was hard to remember because it's something like:

You misunderstand. The command I wrote is meant to be used as I wrote it. `..` is not a placeholder for you to replace.

https://stedolan.github.io/jq/manual/#RecursiveDescent%3a%2e...

>and as you can see the field fileURL is actually an object with another field en-US (with a hyphen) so the jq becomes something like this:

Sure, so then it's:

    <file.json jq '.. | select(.fileURL?).fileURL["en-US"] | select(startswith("http://"))' -r


Oh, I had no idea ".." was an operator. That's a pretty cool feature, combined with the "?" to match optional objects.

Thanks for explaining!


not a jq expert but do a) get to explore random huge files sometimes as json. b) may care when the structure (but not the values) change.

so a jq utility[] that spits out all the jq paths for a json files helps in both cases.

[] https://github.com/TomConlin/json_to_paths


oh wow cool I didn't know about gron!! Is it alright if I add this comment to the bottom of my post?


gron is great! And because the 80% use case of "gron -> grep -> gron -u to turn it back into JSON" is so common, I use these two convenience wrappers that make it easier. I called them grop because its' gron + grep:

https://github.com/busfahrer/grop


Of course, although I wouldn't be surprised if a jq expert comes along and gives an easy way to deal with deeply nested children of an optional parent, making my comment moot. :D


Ha, I never let that stop me! You can do some of the stuff in my post other ways, like using the `--jsonpath` switch in kubectl, but 1. I prefer learning jq rather than the idiosyncratic syntax/flag of each tool & 2. Let 1000 flowers bloom. I figure its OK if there's more than one way to do something!


If you do not have jq installed or if you like SQL more to query/handle your data, you can rely on sqlite. For example:

  sqlite3 <<< "select json_extract(value, '$.Plan.Plans[0].Output') from json_each(readfile('explain.json'))"
Reference: https://www.sqlite.org/json1.html


I love idea of "json lines" which is implicitly here, in jq. You can convert giant json array of objects to "json lines" and back with jq. A lot of tools could produce them, and then you can mix and match jq, simple unix grep, head etc, because each unix line of your stream is a complete parsable json, guaranteed.

back in a day I've loaded a lot of data to postgresql and elastic search after preprocessing it with very simple but powerful chain of CSV parsers (I've used CSVfix), jq, sort, grep, etc.


This concept has a name: "ndjson", for "newline-delimited JSON". http://ndjson.org/

Some tools from the Node.js universe. `npm install $etc`. https://www.npmjs.com/package/ndjson-cli


I use https://www.npmjs.com/package/concatjson to get a Stream interface in node.


json-lines is as much of a name for the concept as ndjson is

https://jsonlines.org https://github.com/ndjson/ndjson.github.io/issues/1


ndjson has more than a few times saved the scalability concerns of my technical designs.


Thanks, codesnik -- good tip!

My favorite tool in this area is `lnav` (https://lnav.org), a scriptable / chainable mini-ETL with embedded sqlite.

Oh and tiny correction (intended as helpful not nit-picking), the idiom is "back in the day" not "back in a day". :)


It sure would be cool if all the common Unix tools had JSON lines / ndjson output mode.

It could also be done as a library of adapters. e.x. something like

    j ls
Would run “ls”, parse the output, and re-output it as JSON lines.


..not sure if a plug but this does exactly that: https://github.com/kellyjonbrazil/jc


That's awesome thank you!


Thanks!

I found the most value in some of the additional tricks shown in the article (like the “pretty print JSON stored in the clipboard) over the introduction/tutorial aspect; there are hundreds of these “look, I figured out how to do a handful of interesting things in jq so let me share with you all” tutorials floating around and most of the basics are redundant among those by now.

If jq’s documentation had some basic tutorials it would probably be a better authoritative source for this information.

Until then, as said, thanks for preparing and sharing this!


the jq manual has tons of examples https://stedolan.github.io/jq/manual/


I recently wrote a short example based gist for a specific health care use case (evaluating FHIR data) using jq.

https://gist.github.com/joshgel/12082d23a75feaab5d405db31981...


> But, this prints all 317 resource types on a new line, not super helpful.

> So, let's group by and count:

> cat 1396-Ledner.json | jq '.entry[].resource.resourceType' | sort | uniq -c | sort -nr This gives me a list that looks like this:

This is so funny! We came up with some of the exact same combinations of tools (jq + sort, uniq, wc etc.). I mean, it makes sense so I shouldn't be surprised!


And this post didn't mentioned custom functions, where things are getting more interesting :-) https://acalustra.com/jq-functions-modules-and-dotfiles-tips...


At that point why not write a Python script?


jq is a much, much more expressive language than Python at applying similar operations to the list of inputs (with fanout!).


How about combining the best of both worlds: https://github.com/mwilliamson/jq.py


Yeah, jq has some of the benefits that APL users like to talk about.


I'm not disagree with you, but in this case all stored in custom dotfiles, as functions, easy to use IMHO than a custom Python script.

But I use a lot JQ, so it's super easy for me ;-)


Shameless plug, but I created jello[0] to give you the ease of parsing json in bash, but using python syntax[1]

[0]https://github.com/kellyjonbrazil/jello

[1]https://blog.kellybrazil.com/2020/03/25/jello-the-jq-alterna...


> We could select for this using jq, but promql already lets us filter by metric names so we'll do that instead

I think it would have been much better to do this via jq instead, for the sake of demonstration. This is, after all, an article on jq, not on promql...


Good call, I will update the post.

It would look like this, for posterity:

jq '.data.result[].metric | select(.app == "toodle-app").task_name'

The reason I was putting the querying into promql where possible is that it is (I assume) more efficient to do that filtering on the server & only return matching values rather than returning "everything" & filtering locally. But your point stands, it's tangential here. Thanks for the feedback this will definitely improve the post!

Edit: updated!


I think it makes sense to point out that it could be done better using just promql, but it's useful to show how to do it in just jq for those using other tools!


I've been using jq for several years but still can't write anything more complex than simple XPath without going to manual and googling.

Babashka[1] with Cheshire[2] aliased as json solves this to me, the code is longer but no need for googling anymore:

  USERX='{"name":"duchess","city":"Toronto","orders":[{"id":"x","qty":10},{"id":"y","qty":15}]}'

  echo $USERX | jq '.orders[]|select(.qty>10)'
  {
    "id": "y",
    "qty": 15
  }

  echo $USERX | bb -i -o '(-> *input* first (json/decode true) (->> :orders (filter #(> (:qty %) 10))))'
  {:id y, :qty 15}

[1]: https://github.com/borkdude/babashka [2]: https://github.com/dakrone/cheshire


+1 for jq. It is so useful for mucking about with API responses. I used it to write a bash script that acts as a NAGIOS plugin for spotting changes in our Digital Ocean server and DNS configs - curl to pull from the API, jq to strip out the bits we don't care for, and sed to put actual newlines in instead of \n. We had Cisco as a customer, and infrastructure change-monitoring was one of their infosec team's requirements... that was part of our solution.


Suggestions for a more memory-efficient JSON processor for resource constrained computers? All the ones I have tried were no better than using standard line-oriented UNIX utilities.

Example:

189MB JSON

    curl https://raw.githubusercontent.com/zemirco/sf-city-lots-json/master/citylots.json |jq 
while that is running watch the system memory using vmstat, top, etc.


https://nigeltao.github.io/blog/2020/jsonptr.html

describes a JSON query tool that can use an order of magnitude less memory than jq. Its query language (JSON Pointer) is much simpler, though.



simdjson is indeed good stuff, but its focus is CPU efficiency, not memory efficiency.

For example, it requires the entire input to be in memory. It does not support streaming input.


For streaming applications, I've often found this approach useful: http://ndjson.org/

That's assuming your structure is list-large, rather than tree-large. Adjacency lists can often bridge the gap.


If you want to extract some data from JSON and wrangle it into another form (filter, count, dedupe etc), but _don't_ want to use a CLI, you might want to look at our drag and drop software for Windows and Mac: https://www.easydatatransform.com


I use jq for processing XML files. Recently I used https://github.com/fiatjaf/jq-web to support some of the conversions in JS. It was wonderful.


JSON is so slow and inefficient. All that marshaling and unmarshaling is crazy. Binary formats will rise again.

This comment says it best I think, "On the other hand, if you only need 53 bits of your 64 bit numbers, and enjoy blowing CPU on ridiculously inefficient marshaling and unmarshaling steps, hey, [JSON parsing] is your funeral." - Source https://rachelbythebay.com/w/2019/07/21/reliability/


REST is so inefficient. All those bloated text fields.

HTTP is so inefficient. All those newline-deliminated case-varying strings.

Logs are so inefficient. Error codes are so much shorter and precise.

Computers are here to make humans' lives easier. But it's not easier if we're forced to use tools that communicate in obscure ways we can't understand and later need translators for. If a human has to debug it, either it has to be translated into a human-language-like form, or it can just stay that way to begin with and save us the trouble of having to build translators into anything that outputs, extracts, transforms, or loads data.

JSON is actually a wonderful general-purpose data format. It's somewhat simple (compared to the alternatives), it's grokable, it's ubiquitous, and you can `cat` it and not destroy your terminal. But of course it shouldn't be used for everything. Actually one of the only things I find distasteful about JSON is the lack of a version indicator to allow extending it. But maybe this was its saving grace.

If "blowing CPU on JSON marshaling" is your big problem, boy, your business must be thriving.


I'm struggling to remember who, but I just recently saw a tech talk about microservices at scale by a fairly large company that had started adopting more efficient request/response formats because they discovered that in their tens of thousands of cores, some high single-digit percentage of all CPU cycles were spent marshaling and unmarshaling JSON.


http2 is binary and so is systemd log.


> Binary formats will rise again.

ASN.1 FTW!!!!!

:)

Probably the most popular modern alternatives:

* https://en.wikipedia.org/wiki/CBOR (RFC 8949)

* https://en.wikipedia.org/wiki/Protocol_Buffers

* https://en.wikipedia.org/wiki/MessagePack


Actually some parsers support bigint/bigdecimal. I've used jackson to parse json numbers into those data types in jackson. It's just not all json parsers are created equally.


jq + bash + (some other cli tools) are very powerful toolset for simple websites that need to act on external data. I've been generating static websites from cronjobs using this sort of workflow for a while now and it works really well. Mainly dashboards and such. I have one that doesn't use jq, but instead sqlite3 to query database and generate the static site content.


jq is great, but I doubt I'll ever be able to remember its syntax...


Like half my CLI tools, to be honest.

I usually mash together a 10-steps pipeline of jq, grep (-v), vim, xargs until I find the data I need. If I need a reliable processing pipeline, I can afford to google 10min for how to use jq in more details!


Shameless plug, but that was one of the reasons I created jello[0], which allows you to use pure python list/dict/comprehension, etc. syntax. Also, here is a blog post[1] explaining in more detail.

[0]https://github.com/kellyjonbrazil/jello

[1]https://blog.kellybrazil.com/2020/03/25/jello-the-jq-alterna...


Same here. I always end up googling things I know it can do because I've done them before. I just can't remember the syntax.


KATE text editor mentions using this as well but I've had so much trouble getting it to work. It has it's own console in the editor where you can use javascript to modify things if you want. Something I still need to figure out but it's just more complicated than I think it needs to be for some reason.




Added it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: