I like JMESPath, but it has some serious limitations which prevent it from being as general purpose as jq.
JMESPath limitations:
- No simple if/else, but it is possible using a hack, documented below.
- The to_number function doesn't support boolean values, but it is possible using a hack, documented below.
- can't reference parents when doing iteration. Why? All options for iteration, [* ] and map, all use the iterated item as the context for any expression. There's no opportunity to get any other values in. May be possible for a fixed set of lengths. Something akin to the following (except there is no syntax for switching or if statements):
switch (length):
case 1: [expression[0]]
case 2: [expression[1], expression[1]]
case 3: [expression[0], expression[1], expression[2]]
...
- Key name can't come from an expression. Why? The ABNF for constructing key-value pairs is given as: keyval-expr = identifier ":" expression. The key is an identifier, which gives no possibility for making it an expression. No functions modify keys in such a way as to allow using an expression as a key.)
- No basic math operations, add, multiply, divide, mod, etc. Why? Nobody added those operators/functions.
- There's a join, but no split.
- No array indexing based on expression. Why? Indexing is done based on a number or a slice expression, which also doesn't support expressions. Here's the ABNF:
I love JMESPath. I first discovered it when using the AWS CLI `--query` option[0].
I then realized that using it in my code would make things much more declarative and easy to grok than a bunch of maps, filters etc. Here's a real example which I think illustrates it[1]. It has libraries for lots of languages with a clear specification/compliance test[2].
The cherry on the top is the interactive query on the website. You can tweak any of the examples (both queries and data) and get results instantly. Extremely useful for playing around, building queries to work with JSON data (webhooks, API responses etc)
I only have used jq, but I noticed that even AWS CLI docs [0] (already mentioned in the comments a couple of times) suggest to use jq for "more advanced features that may not be possible with --query".
jq has a more polished CLI and can do 'everything'.
jmespath is more limited, but is well specified and easier to fit in your head. It's also much more appropriate for using as a library from other projects, since it has clean implementations in many languages. It's also an advantage in this case that it can't do everything, since you can more realistically provide untrusted user input.
jmespath's default (Go) CLI isn't as fully featured as jq, unfortunately.
Apparently JMESPath is a better JSONPath/dot-notation alternative for querying JSON, while jq and jl are full-featured fully functional programming languages that happen to use JSON as their underlying data type.
I had never heard of jl before, but I've recently compiled an "awesome" collection of jq tools, libraries and use-cases at https://github.com/fiatjaf/awesome-jq that may be worth checking out.
I've been using jq quite a bit with AWS CLI inside various bash scripts. It allows manipulation and filtering the native api doesn't, which makes it pretty straight forward to get certain values - such as specific tags on the current instance (converted to key-value form, or to turn into bash variables), a list of other systems in the current auto scaling group, etc.
I've also started using it just to get colorized/formatted output from curl or a local json file.
The reinvention of XML in JSON is almost complete - JMESPath vs XPath, JSON Schema vs XML Schema etc. If you need semi structured data to that level, consider using XML instead - you can validate it, there's plenty of tools, it's very stable and mature etc
Isn't that what we want? I would think that solving the same problems for a less verbose format is a good thing. And looking at the examples and the library support (http://jmespath.org/libraries.html) I would say this solves a real problem. I can already think of several use cases for my own system.
The fact that a solution is treading familiar ground does not invalidate its utility.
It's not very much less verbose, really. You still need a key and a value. The only thing you lose, really, is the end tag. For a complex text document in say TEI or Docbook, I don't see how this is much of an advantage.
No, you also loose the attribute system and thus the ability extend existing elements without changing their structure.
In addition, arrays must be indicated clearly in JSON while you can consider children of an xml nodes to always form a list (be it empty or containing one element).
So while the two format are very similar, there is still some differences that doesn’t make them interchangeable for every usecase.
Try implementing an xml parser sometime. The encoding and entity parsing alone is so complicated it’s brain melting. Compare that to the JSON spec and then ask why the world has decided JSON is the better format for casual object graph encoding. Not sure I could make the same argument about YAML though.
The reason I prefer JSON over XML is the latter's ambiguity. It's completely obvious how to convert a native object into a JSON object, for example. The translation to XML brings all sorts of questions about when to use tags vs. attributes of tags. It's cognitive load I'd rather spend on solving the problem at hand, not serialization.
If there had been an opinionated SGML syntax which mapped directly and unambiguously to common language primitives and back, I'm sure it would have been more popular than JSON.
XML Stylesheet Transformation, the language to describe transformations of XML documents into another XML documents. Isn't jq or JMESPath an example of such a language for transformations?
What's the difference between query and transformation then? Isn't transformation a (vector) function ^f of input ^x, with n dimensions of x and m dimensions of f where f_i = f_i(x_1, ... x_n)?
XPath can be used in Xslt, but not the other way round. In Xslt you can create new elements, while in XPath you can't. Seems like a pretty big difference to me. XPath isn't Turing complete either, while Xslt is.
Working on that, was thinking of naming it JST.. building it as part of a high level programming platform we are building.. will be open sourced once complete.
The problem with XML is that almost every tool has bad usability — both things like basic tasks frequently dumping you into a thicket of cryptic standards docs or that the libraries in most common languages have unnatural API choices (i.e. they follow the libxml2 C interfaces) and leave a lot of basic usability on the floor.
As a simple example, there's no technical reason why XPath couldn't allow you to either use namespaces as written in the document[1] or ignore them when appropriate[2]. Nobody cared about usability and it's the number one reason I've heard why developers with jobs to do fled to JSON / YAML as quickly as possible. A similar story arises with XPath 2 — libxml2 never got support so that standard effectively doesn't exist for most projects, but there's no way to shift the resources away from developing further revisions which will also not be widely used towards much cheaper basic investment in shared infrastructure.
1. e.g. if the doc is <foo:bar>, it should never require you to write {http://path/to/foo}bar to match that element.
2. Ever try to target docs in multiple versions of a standard?
Last time I checked, XML was uncertain which validation should I use. DTD, Schema, other solutions. Each has syntax/structure and 1st page explanation so cryptic, that I don’t even understand where do I begin.
I don’t like js/json at all, but for json (and without much js knowledge) I can roll out simple validation in less time than is needed to understand these schema formats. If my structure is dynamic, omg it will be hard to explain it to declarative validator. If it contains value-level type logic (graphs, references), I bet second phase of validation by hand is inevitable anyway.
A second thing, full-blown xml is alien to any structures of simple languages and cannot be serialized like jsonlib.encode(foo). You create nodes, set attributes, build trees, all that mess. It feels like using a 17th century official mail ceremony to send “)))” to your buddy.
> and cannot be serialized like jsonlib.encode(foo)
Huh? Could have fooled me. In Cocoa, for example, serializing to JSON and serializing to an XML property list (or binary property list) is effectively the same code.
Yes, but property list is a subset of XML and Cocoa’s almost internal format. It doesn’t even connect keys to values, probably losing all the xpath/xslt/etc abilities. No custom schema too?
Hmm...since XML is a meta-language describing a family of markup language[1], every actual XML-based language will be a "subset".
And of course that is why some of the XML tools are so heavy: they deal with the full meta-format. The "nice" thing about JSON is that it doesn't have this indirection step, it's just one concrete language, and that's why it is simple. But again, it's trivially easy to define an actual concrete markup language using XML that is just as simple.
I agree with you that the choice of <key>theKey</theKey><string>value</string> rather than <theKey>value</theKey> was regrettable, had I designed the format I would have chosen differently. I think they wanted one DTD to describe this format, which wouldn't have been possible without keeping the meta-level indirection.
I created XML-based archivers for Cocoa that don't have this problem[2]. Again, this wasn't hard, and the API is NSKeyedArchiver compatible, so [MPWXMLArchiver archivedDataWithRootObject:someObject]; gets you a nice XML representation.
You can still use XPath/Xslt in that scenario, it's just more annoying. Of course it wouldn't be hard to write an Xslt to rewrite the XML so that the keys and values are better connected, so you can query it more easily using XPath.
People become just furious at the existence (especially when they never write xml by hand) of closing tags. Because the XML people wanted to be able to spot unclosed elements before they reached EOF and because they thought it would be nice to be able to report what was actually unclosed we're going to have to just accept a world in which programmers will use absolutely anything before using xml.
see also: hating pascal's 'end' instead of '}' or claims that python is readable because it doesn't even have '}'
Is that a bad thing? For cases where the document is at least partially written by hand it makes sense to replace XML with something more friendly, such as JSON or YAML.
Wasn't the basic idea of JSON to be small and unstructured to be sent between applications? Shoehorning everything into JSON that is already solved by XML seems illogical, as GP already said. It reeks of NIH.
People shouldn't use JSON for anything hand-written. YAML or even the older INI format are way better suited for configuration files for one reason alone: they allow comments.
To me, JSON can be viewed as slightly enhanced S-expressions, enhanced in a particular way. It would sound strange hearing "shoehorning everything into S-expressions in Lisp that is already solved by structural Fortran syntax seems illogical".
NIH claim is probably valid when there are no noticeable difference. In my opinion, XML differs a lot from JSON, particularly when one need to write snippets by hand. JSON also seems more logical/laconic (once upon a time M-expressions were supposed to come to stage, but S-expressions turned up to be good enough).
Exactly. I'll stick with XML, XPATH, and XSLT - thank you very much. Is standards-based, natively-implemented, and super-fast. If a web services sends me JSON, the first thing I do is serialize it to XML.
Hm. One reason why I prefer Json to XML (and I actually like XML) is, that JSON is simpler. The fact that XML has schema definitions out of the box, and that these schema definitions can reference other definitions would lead to more complex parsers that can contain more bugs and vulnerabilities.
To me this sounds almost like "S-expressions still don't support comments".
Why do you need comments in a structure of nested
- unique key mappings
- arrays
- strings
- numbers
- booleans
- nulls
? You can include comments, just like any string, into it, just reserve a key with unique name, if you want. From JSON parser/transformer point of view, "comment" as a concept isn't a data structure piece, it's rather "intent" piece.
Something seems odd about wanting to put comments inside the JSON instead of before it. Then people will want a way to read the comments programmatically, then they'll want annotations, etc.
The "reinvention" is not complete and will never be necessary. The difference is that XPath is necessary to query XML because it's a botched horribly overcomplicated, designed-by-committee markup language. Except for tools like jq no such language is actually required for JSON because it maps on to language structures that always exist.
Neither JSON schema or XML schema are particularly popular - and for good reason. Let's say you want to create a schema that limits field "country" to be limited to ISO 3166-1 country codes - either you:
* Keep that schema file updated by hand every time something like Sudan breaking in two happens (no).
* Write a program that generates the schema (seriously... no)
* Do schema validation in code where it belongs - pulling in relevant validation data from canonical sources, rather than some markup language invented by people who didn't have the imagination to consider a really common use case.
There's a lot of benefit to being able to state what keys may be specified in a certain location, though. Look at DSLs like Cloudformation, for instance. Having schema validation could make static analysis of this kind of code much easier to handle. E.g.: Fn::Sub may be used inside of Fn::Join, but the reverse is not true, regardless of the types "returned" by each. It's certainly possible to validate via the api, but being able to do it in my editor will make finding errors much faster.
To your other point, however, dynamic code generation is becoming much more common. AWS generates a huge amount of its code from JSON definitions across multiple languages to keep its SDKs up to date. I could see schema validation being valuable in this domain as well.
> * Keep that schema file updated by hand every time something like Sudan breaking in two happens (no).
There is a lot of use for libraries dealing with time and dates. When you want to cover all cases, at some point you get to the situation when you have to allow variable number of seconds in a minute - not always 60, but sometimes 59 or 61, or may be even different numbers. And you don't know in advance - for arbitrary long future - which minutes will have which number of seconds.
So, for your timekeeping system to maintain precision, you have to allow external updates for when a minute will be considered non-60 seconds.
And those cases could happen more often than changing a list of valid country codes.
The point isn't to avoid it. Of course it's inevitable - that was my point! The point is to use code to validate instead of some markup so that the programmer can use their judgment about how it should be delegated.
I wrote some example code below that shows how you can validate with list of countries in such a way that no code changes will be required when the list changes.
JSON Schema, at least, can refer to a URI for the definition of something, and that URI can refer to only a specific section of the JSON document to which it points.
The point I was making was that you shouldn't use a "special" language for validation at all - you should just use a library in a regular language to do it.
Anyway, code:
yaml_text:
John: Yemen
James: South Sudan
python code:
from strictyaml import load, MapPattern, Str, Enum
import pycountry
result = load(
yaml_text,
MapPattern(
Str(),
Enum([country.name for country in pycountry.countries]),
)
)
full disclosure: I wrote the validation library ^^
The idea behind XML schema, DTD, etc. is to pick a simple language to express schemas in, so that implementations in different languages have a decent chance of being compatible with each other.
Python isn’t a good choice there, as it is too flexible. For example, that code could have gotten the list of allowed country names from a file, database, or URL.
⇒ If I have to send such json to you, I almost would have to write my program in python, and even then, it could be hard for me to replicate your setup.
>that code could have gotten the list of allowed country names from a file, database, or URL.
That is exactly the point. You should be able to do that, because the canonical list of data could easily from any of those and it should the up to programmer's discretion how to fetch it.
The point of validation is to prevent invalid data from slipping through a net at minimum cost and that's how you do that.
Suden, Sudaan and South Sudan were all invalid countries in 2010 and that YAML was invalid. In 2012, Suden and Sudaan were invalid but South Sudan was not so that YAML was valid.
In the above example you have to make no code changes in order to account for that - just update pycountry every so often.
With XML schemas and DTDs either you don't validate country at all (letting Suden and Sudaan) through the net. Or, you rewrite and redistribute the schema by hand every time some dependency like a list of countries changes.
>If I have to send such json to you, I almost would have to write my program in python
Only if I choose to validate that data using a shared schema. Frankly, I've dealt with XML a lot and the number of times I've been handed a shared schema of any kind is very low. People just don't seem to use them. If they define an API in XML for instance they tend to just send examples and give a written explanation (e.g. insert valid country name here).
I don't see much value in making a schema more inherently "shareable" especially not if it means it has to be re-released every month.
The benefit of a query language is that it can be described declaratively (i.e. in a non-executable text file, perhaps within JSON itself), and then programs written in any language can execute its query logic using a standard interpreter written in that specific programming language.
So you get reusability of queries across the stack, in all languages that implement a parser against the spec. Your example only provides re-usability in JavaScript, and requires evaluating code at run-time so may not be suitable for queries based on user-submitted data in multi-tenant environments.
I really appreciate this comment. I was trying to figure out why I wouldn't use native data types and functions, but this makes it clear.
In your opinion, where would someone be storing the json such that they'd benefit from a tool like this? The only time I use json outside of pulling it from an API (where I can convert it to a native object) is probably storing it in postgres, where I've already got json querying tools.
- Infrastructure configuration stored in JSON. Query could reference other JSON files, or the JSON file itself (loops would need to be considered).
- Declarative reactive programming, e.g. platforms like IFTT. You might want to take certain actions based on data in a JSON post. The IFTT GUI would create JSON config files that its server side parsers can safely use without eval'ing code to decide which action to take.
- Adding conditional logic to jsonschema form generation. Recently I've built a questionnaire renderer in react that renders forms based on jsonschema. The user creates forms with a GUI, which compiles them to JSON, and then the renderer knows how to render. Conditional logic (e.g. question B is required if question A === true) can be quite limited when constrained to pure JSON. Something like this could help with that.
The nice thing about declarative syntax is you can build a GUI to generate it, so users never use the JSON itself, but you can store it in a database, safely execute rules based on it, etc. without requiring programming from the user.
That said, there are usually better ways to accomplish this, like in pure JSON for example. Mongo syntax achieved this, with declarative operators like $or{}, ${sum}, etc., but it can be quite cumbersome.
I'm guessing the point is to be a common type of query language that can be used from any number of other languages: http://jmespath.org/libraries.html
In at least some languages if these queries can be used without deserializing into a native data structure and then serialized back into a string, this could be a major win.
Your example however will not support dynamic or user-configurable paths without eval(). Alternatively, instead of eval you could run expressions through a JS parser, but it'll be more code than your example. The library we're discussing also defines a grammar for the query language.
If we're talking about JS, it would seem to me to be trivial to accomplish, by simply decomposing paths and using bracket notation to access nested props, just like lodash _.get does.
Does this have any mathematical foundation like the relational algebra for SQL? Or more generally, does a mathematical framework exist to treat this or similar constructs and that goes beyond what relational algebra provides and that, for example, also handles aggregate functions?
The reason I am asking is that I am currently trying to build a tool to analyze a kind of time series data, think log file entries, in order to look for anomalies and visualize them. I could of course just build all the transformations I am interested in in an ad hoc fashion but it would be nice to have a mathematical framework in order to start out with a small set of basic operations and then compose those while having some guarantees about the expressiveness of that the basic operations and ideally also a rigorous foundation for transforming them, for example for performance optimizations.
But so far I was unable to find something that seems fitting, everything I am aware of is either to limited like relational algebra or way to general like general functions. It feels like what I am looking for should exist but I am unable to find it.
I never tried it but I am expecting the performance to be not good enough, it takes already several minutes with code specifically written to perform the calculations I am interested in. And because I don't know what exactly I am looking for I need more or less interactive speed so that I can try out many different ways to look at the data. But maybe I could use [materialized] views to convey enough information to the query planner how to efficiently carry out the calculations or maybe I am even underestimating how good query planners are. I just have the gut feeling that performing a lot of aggregation will make a database perform a lot of unnecessary work. But maybe I should and will try loading the data into SQL Server and see what happens.
The other thing is that SQL seems not the best fit to me. Say you just want to know how many events occurred in the last three months in any hour, that is straight forward grouping and counting at first, but already rounding the timestamps to an hour is not as obvious as it should be. But if there was no event in a specific hour, your result will just have no row for that hour instead of a row saying there were zero events in that hour. This in turn will cause more trouble if you want to build a histogram showing in how many hours there were say 0 to 9, 10 to 19, 20 to 29, and so on events. Certainly still doable with SQL but we are already entering the territory where writing a single query will take most people several hours to get the desired result.
I also couldn't easily tell how to express calculating the 99th percentile of the event size for every day of the week and hour of the day. I am pretty sure it is possible but I guess it would also be pretty unreadable unless you put in quite a bit of effort to create utility functions instead of hacking together one huge SQL statement. Then again I don't really know much about the more recent SQL features for partitioning and aggregating, maybe I should have a closer look at that first.
Right now it is just an effort to develop a tool to diagnose and hopefully thereafter fix random performance problems we are experiencing with one of our applications in production. Despite having a small team dedicated to investigating the problems, monitoring every click and function call with Dynatrace, having had a Microsoft SQL server expert look into it, and getting the system audited by one of the big consulting companies, the problem persists since years and nobody has really any clue about what is going wrong.
The performance is never really great, it is [one of] the central applications of the company and depends on the interaction with a sizable junk of the system landscape developed over decades and therefore it is prone to be affected by incidents in a lot of systems but most of the time it is good enough. But once every couple of weeks or months something goes badly wrong an requests, it's a web application, start taking several seconds or even minutes to complete. Minutes later everything is back to normal.
But I digress. If I would manage to come up with a reusable and somewhat general tool to analyze data similar to what I am looking at, I would consider releasing it. It could either be a somewhat general data analysis and visualization tool, think R, or it could be more specifically tailored towards looking for anomalies in data sets like the one I am investigating. But as of now I am struggling to come up with a general framework to express the analyses I am performing and therefore all I have is a rather ad hoc collection of transformations that extract and visualize aspects of the data that could lead to new insights into what is going on.
But right now it is really driven by our specific issue, I notice something in one view of the data and then come up with a new transformation to look at it in more detail or from a different angle. It is nothing that could easily be reused by anyone else and so for the moment it seems most likely that this will never become public or maybe only in the form of a blog article explaining what kind of information might be useful to look at and how to derive it from logs that look rather uninteresting at first glance.
A worthwhile alternative to this approach (a JSON-specific query language) is a language for converting JSON structures to newline-delimited records. Then, standard shell tools can be used to query and join: https://github.com/micha/json-table
For those interested in arbitrarily transforming JSON objects (for example, in a communications pipeline) I’d recommend JSONata. It’s quite useful and we’re well along in a Golang port with $function extensibility. http://jsonata.org
Agreed. I like JSONata a lot, even though it's the dark horse among JSON traversal languages. I've had a good experience parsing semi-unstable JSON with it.
This looks interesting - but doesn't MongoDB basically achieve the same effect? I kind of prefer MongoDB because you query JSON with JSON - but I'm open to changing my mind :)
If you're using MongoDB already, then sure use MongoDB's query tools. But if you are just working with raw JSON from a potential variety of sources, or in a streaming context, then you need something more in-place and general-purpose, which this appears to be.
My tiny lib with very similar functionality: [1].
The query syntax is slightly different though. Also I decided to re-use JS for evaluation of sub-expressions instead of implementing own full-fledged parser.
I love libraries like this, which is small enough to be read in one sitting. I can scan through and get a general understanding of everything that it does.
The "evaluation of sub-expressions" made me curious. This line:
I'm sad that JSONSelect (https://github.com/lloyd/JSONSelect) never caught on. It uses CSS selectors to query JSON, which has the nice side effect that learning to use it improves your CSS as well!
I remember when xml started down this road too. "We aren't going to be sgml, but just a lightweight markup. Xpath, xslt, etc. And now we have xml today, the modern day sgml.
the second I looked at the example on the homepage and saw this:
sort(@)
I'm like nope! What is this "@" symbol? Why can't that be "name"? I'm already passing judgment that this library will be a nightmare to use which isn't good.
Now I know I can read the docs and eventually what I can pass the sort expression and what it all means, however, this is an issue I come across more and more with new libraries in programming... show simple examples, not "smart" or complicated ones. I shouldn't have to read through docs to try to decipher an introductory example. There is a reason every programming language starts with "Hello World".
May also be interested in the `jq` CLI, which on first glance appears to use a similar but not identical query language. https://stedolan.github.io/jq/
JMESPath limitations:
- No simple if/else, but it is possible using a hack, documented below.
- The to_number function doesn't support boolean values, but it is possible using a hack, documented below.
- can't reference parents when doing iteration. Why? All options for iteration, [* ] and map, all use the iterated item as the context for any expression. There's no opportunity to get any other values in. May be possible for a fixed set of lengths. Something akin to the following (except there is no syntax for switching or if statements):
- Key name can't come from an expression. Why? The ABNF for constructing key-value pairs is given as: keyval-expr = identifier ":" expression. The key is an identifier, which gives no possibility for making it an expression. No functions modify keys in such a way as to allow using an expression as a key.)- No basic math operations, add, multiply, divide, mod, etc. Why? Nobody added those operators/functions.
- There's a join, but no split.
- No array indexing based on expression. Why? Indexing is done based on a number or a slice expression, which also doesn't support expressions. Here's the ABNF:
- No ability to group_by an expression.- No ability to get the index of an element in a list
Hacks:
Convert true/false to number:
If/else:Option 1)
Option 2)