It's significantly harder to write a compliant and performant parser for XML, even if one might be able to make it as fast. But, after parsing, the application developer still has to transform XML data into something usable to their program. This is much easier with JSON since it's information model (map/list/scalar) matches the information model of most modern programming languages.
XML is a poor choice for most serialization since it has the wrong information model. It's designed for extensible document markup and most application data packages aren’t documents that require markup.
I'd argue the opposite, simply because XML has many well defined schema formats and XPath. If you end up having versions of data formats, or are dealing with data from an unknown source, XML is far superior, and allows for easier interchange of data.
The correct way to load an XML document is as follows, IMO:
* Open it
* Run it through the schema to validate it's
structure
* Manipulate/access it via XPath, with the ability to
assume correct structure as defined by the schema.
Compare to JSON, where the process is
* Open it
* Attempt to access it, even though it's structure
may deviate from the expected
* Die or try to rollback work when an assumption about
the data structure isn't met.
Thus, most JSON code either trusts the input data implicitly, or is a mess of "is this where it should be?"/independently re-developed schema equivalent every time. I am aware of attempts to create JSON schema languages - unfortunately these are neither in common use, nor well defined enough to be standards, and they're more verbose and less expressive than XML's RelaxNG compact syntax.
JSON is fundamentally a dump of a data structure, not a document format. It isn't designed for long term data storage, or for inter-compatibility, despite being used those ways. Thus, it's great if you control both ends and your data structures don't change, as is the case with most web programming. It frequently falls down outside that realm.
Now, the above ignores runtime performance concerns. I'd argue (again) that outside of web apps, the load/store into a data structure is done rarely, or can be done in the background or in parallel if there's a lot of data concerned. Also, we're probably talking about less than an order of magnitude in the vast majority of cases.
The thing is it's almost impossible to express all the constraints on your input in schema. You certainly can't encode "userId must match a user that exists in the database" as part of a schema. So with XML you end up having two validation processes - the schema and then the part of your code that checks that the input datum really is valid. And then when the time comes to add another check you realize that code is a much more expressive way to write your constraints, and it's always possible for input to be rejected at the code stage, so you might as well just add your check there. So the schema level ends up being a useless extra complication.
I don't think you get what schemas are for. Schemas don't validate the content - they validate the structure and variable types. It doesn't magically make the data valid. It lets you make assumptions about how that data is structured.
The advantage of a schema is that one schema can be used in multiple languages. Someone's Lua based embedded system can run the same schema as a Go program on a 32 CPU box.
You can give someone a well written schema, tell them "make your data like this, and run the validator when you open and save your XML". In this way, it's a free, one line (in most languages) structure sanity check.
>I don't think you get what schemas are for. Schemas don't validate the content - they validate the structure and variable types. It doesn't magically make the data valid. It lets you make assumptions about how that data is structured.
In a typed language the very first stage of deserialization, the "turn this from a string into objects in my language" step, will do that for you "for free".
In an untyped language I guess it could be helpful, but again: it's almost always easier to write these sort of checks in a general purpose programming language than in XML schema.
>The advantage of a schema is that one schema can be used in multiple languages. Someone's Lua based embedded system can run the same schema as a Go program on a 32 CPU box.
If that's what you want there are more robust and performant approaches, e.g. thrift.
>You can give someone a well written schema, tell them "make your data like this, and run the validator when you open and save your XML". In this way, it's a free, one line (in most languages) structure sanity check.
Right, but that doesn't tell them what they think it means. It doesn't make the request valid.
If you have untrusted input, XML has it's own issues [1].
> JSON is fundamentally a dump of a data structure, not a document format.
Exactly! IMHO, it's unfortunate that mark-up documents can be used to represent other data structures. It isn't what they're meant for, and there are (or "should be" if you're not fond of JSON) better tools for human-readable serialization.
Indeed, YAML is a more powerful serialization format than JSON.
The original name (Yet Another Markup Language) threw me off, so I never seriously considered using it. Even its Wikipedia article is in the "Markup languages" category.
JSON Schema (http://json-schema.org/) seems quite competent. Perhaps the "problem" with JSON Schema/Path is that JSON is simple enough that you could load into native structures and then do validation in the programming language your most comfortable with.
It took years and lots of committee meetings for a W3 blessed XML schema language; and then, key people didn't agree with how it should be done: James Clark's alternative was TREX and MURATA Makoto's was RELAX.
XML schemas only solve the most obvious validation failures – even if the elements are present as expected, you still need the exact same process to handle invalid data. While XML has XPath, JSONPath exists and the downside to XPath is that many environments are stuck with 1.0 and XML libraries tend to be poorly designed (e.g. namespaces are simply unnecessarily painful in most XPath implementations).
Every significant project I've worked on has the same cautious loading process for XML as JSON, with some extra checking at the early XML load stage because XML is harder to work with and thus fewer people produce valid documents (forget schema validations, errors with simple character encoding, well-formedness or namespace declarations are surprisingly common). In practice, I tend to end up with a forgiving parser, a collection of selectors and full validation on the results, which works equally well with either format.
The "producing valid documents" issue is solved by running the output through the schema before, or immediately after it's saved, and throwing an error if the code has generated an invalid document.
Again, this approach does not work on projects where you can't immediately reject invalid documents. In many cases, this is unacceptable from a business standpoint and so you're forced to attempt to salvage minor conformance problems.
I don't know about you, but I don't find it difficult to just put sanity checks when reading data in case it isn't as expected. And hey, you need to validate user data anyway, schema or no.
I like json to, because its lean and mean. But with a schema you can built a connection from system A to system B without having to deal with the underlying XML. However protocolbuffers would be a better choice then. :p
This is, and always has been, an absolutely terrible way of writing off XML. Do we judge every technology today based upon their purported original purpose? "The computer was designed to solve linear equations, therefore it is ill suited for any other purpose". Noisy nonsense.
JSON only maps to the information model of JavaScript, and has no closer ties to any other language than XML does.
JSON maps to common structures in most languages. A JSON Array is comparable to a JS Array, a C/C++ array, a C++ list, a Python list, a PHP array, a C# List, etc. A JSON Object is comparable to a JS Object, a C++ map, a Python dict, a PHP array or object, a C# Dictionary, etc.
By contrast, XML does not directly map to common data structures.
XML has lots and lots of complexity needed so that you could read the "underlying text" (by stripping elements) and for processing notation extensions. For example, you have attributes for out-of-band information, and, this requires additional escaping and content normalization rules. Comparison has multiple definitions because white-space is significant; or not, depending on context. XML specification is ten times longer; it has dozens of sharp edges you have to memorize in order to use effectively.
If you believe that XML's information model maps onto "any other language" just as well as JSON, please show me native mapping/object support in XML and, conversely which programming languages (besides XSLT) have mixed content and attributes.
If you believe that XML's information model maps onto "any other language" just as well as JSON, please show me native mapping/object support in XML and, conversely which programming languages (besides XSLT) have mixed content and attributes.
You can serialize to/from XML in virtually any modern language. Not sure if you're seriously asking this, or what the limiter "native" is supposed to mean (beyond that JSON is JavaScript. But it isn't C#, and it isn't Python, and it isn't...).
How they use attributes or elements is context specific. Every feature and pattern doesn't apply to every use.
XML specification is ten times longer; it has dozens of sharp edges you have to memorize in order to use effectively.
No, you actually don't. You create an XSD of a simple layout (trivial, monkey work). That is what you serialize from/to. The universe of possibilities in XML is, again, utterly and absolutely irrelevant.
Seriously, JSON doesn't even have a DATE type. That is so fundamentally broken I don't even know where to start (Microsoft has their own mystical blend of date when they use JSON, for instance, incompatible with anything else).
JSON is used because Javascript, Python and Ruby (among others) understand it without needed external libraries, and because it's fairly readable for humans.
Performance has nothing to do with it. If you want performance and have some control on your stack, you're better off with something more specialized.
I don't think it's about built-in support. I believe it's about how JSON maps to common language constructs such as lists and hash tables. XML on the other hand needs to map to a complex tree structure that requires another level of abstraction.
They needed external libraries to parse it, but the information is easily represented in native data structures, is it not? That's what people are referring to.
JSON is used because Javascript, Python and Ruby (among others) understand it without needed external libraries
I don't think neither this is the point: afaik browsers (where the most of the js is evaluated), python and ruby also have xml parsers in the "standard" lib (maybe more than one).
What really matters is that javascript, python and ruby programmers undersand JSON without crying for the "<" and ">" presents in the xmls
No, it's that standard XML libraries are over-engineered for the purposes they are used. First, you have to pick between SAX and DOM. If you pick SAX for a typical web-app, you've screwed up. Even if you pick DOM, there's a big impedance mismatch XML's data, and lists or hashes that programming languages use.
I agree there's an annoying mismatch between how XML treats data and almost any programming language does. Without an XML-Schema saying otherwise (and, yuck), XML supports arbitrarily interleaved sequences of mixed elements, which can't be mapped cleanly to data structures in common programming languages without extra meta-data being supplied somehow.
You can automatically map the first sequence of a's to a "aList" collection in your programming language, but then what is the automatic mapping of the one or more a's that follow the b element? Should it be a list or a single item ? Or is it semantics preserving to just combine it with the previous collection? If it should be mapped to a collection, what would that collection's name be in the target language, so that it doesn't clash with the previous collection's name? Etc. Formats like Json avoid this problem by only supporting sequences as explicitly named collections, like most programming languages do within objects or structures.
IOW with XML you can't really do a proper mapping without an accompanying schema or other metadata, which is not true for the Json data.
This doesn't even look remotely unlike the behaviour of "almost any programming language".
If you have an a set of people, do you separate them out into a "manList" and a "womanList", and then panic about having to create a "secondManList" because some men came in after the women? If order is only important within gender, then you can happily just add the men onto the end of the original manList. If it is important across all people, then you have to have a "personList".
Just like you might have a "personList" that can contain "male people" and "female people", you can automatically map the whole sequence of "a elements" and "b elements" to an "elementList".
In the case of mixed content (text nodes and element nodes together), you would simply map them all to a "nodeList".
I think you're making my point, with your inclusions of "if"'s ("If order is only important within gender") that a machine would not and generally cannot know, without understanding the semantics of the data to begin with. That's a disadvantage because that's precisely the sort of "extra metadata" that is needed to complete the XML mapping in the general case that I was talking about - it requires human intervention and therefore a custom one-off mapping, or else at least something like an XML schema [edit: more like JAXB annotations]. The Json doesn't require either of these.
Also, do you really expect an automated process to do the abstraction from "female" list and "male" list to person list? That's more than just metadata that you're assuming to be in context (which was my point), but actual AI! This is very different from programming languages, where collections of things which occur within objects and structs are given names which reflect their intended meanings, like with Json.
But it's still no different to any other language. An XML element has two collection members, one representing the attributes, and another representing the child nodes.
How is this different from a Java object which contains two list members called "attributes" and "children"?
How is this different from an S-Expression containing two lists?
How is this different from a json object like {"attributes":[...], "children":[...]}? Bear in mind that there is no requirement for JSON lists to be homgeneous. {"things":[1,true,"hello",3,{"addressee":"world"},[{"greeting":"Hola"},7],false]} is a perfectly valid JSON object. You don't have to define it as {"numbers":[1,3], "strings":["hello"] ...}.
It is a pretty common behaviour, that if you want a homogeneous list of things that differ, then you make abstractions until the differences disappear, e.g in an OO situation, you go up the inheritance tree until you are at the lowest common base class. In an XML situation, that common base class is "Node".
Even in a strongly typed language that requires homogeneity in lists, the only thing you know about the list members is that they can be cast to the same type, not that the members are of that type and no other, and certainly not that they all have the same name. Consider a C++ array of CFruit objects. It may have a member that is of class CBanana, one that of class CApple, and another of class COrange. If you want to do something COrange specific with the oranges, then you have to perform dynamic_cast<COrange> on any member of that list you suspect of being an orange. The same is true in a duck-typing situation.
The reason for my male & female example is that you would normally simply have a list of people. Of course, if your model contains no base that is common to both men and women, then you can't expect the machine to work that out, but if Man and Woman both inherit from Person, or if there is no Man or Woman, just Person with a member that specifies a gender, then it's trivial.
"If you pick SAX for a typical web-app, you've screwed up."
Way too glib. There's lots of good reasons you may choose to go with that, and there's lots of reasons why you may not. DOM-based approaches pay a lot of resources for their functionality. If you don't use all of that functionality, you may lose over a SAX-based approach. If you're resource rich, hey, great, go with DOM unconditionally, but not all web apps are in that situation.
It's one advantage the XML ecosystem has over the JSON ecosystem; while JSON can be parsed in a streaming manner, you can generally count on a good XML event-based library for your environment, whereas streaming JSON parsing libraries seem more unusual. If you need it, you can count on it in XML.
Generally, SAX is only there because you have massive documents to process, or very constrained resources. Of course, you're right to say SAX has a place, and I did really oversimplify things. If you're using Ruby or Django though, and not processing large documents, I doubt switching to SAX would be a low-hanging fruit.
That's just my opinion, but even if JSON-parsing performance would be few percent worse than XML-parsing one, I would still try to use it whenever it would make logical sense.
(web services - JSON; configuration files - JSON for simpler files, XML for more complicated ones AND only if JSON couldn't handle it).
Also, I might be missing something but it appears that this benchmark doesn't take into consideration libraries used server-side to serialize and deserialize JSON / XML? I'd say that depending on those, there can be some big differences.
Not mentioning the fact that bigger impact on the feel how "snappy" application is lies in the overall application design, than in choosing whether we are using JSON / XML / YAML / anything else to transmit the data.
JSON is great for data interchange. I think better than XML because its simpler structure encourages simpler interchanges than XML.
For me, a disadvantage in JSON for configuration files is the lack of comments. My current project uses JSON for config and I sorely miss 1) comments and 2) unquoted config var names. I'd rather use YAML or another format with a config loader for whichever language I'm using.
Warning! Highly unscientific, non realistic test ahead!
Running on a baremetal, idle server..
python -m timeit -r 10 -n 100000 -s 'import json' 'j="""[{"t":"Hello"}]"""; json.loads(j)[0]["t"]'
100000 loops, best of 10: 9.07 usec per loop
python -m timeit -r 10 -n 100000 -s 'from xml.dom import minidom' 'xml = "<t>Hello</t>"; minidom.parseString(xml).getElementsByTagName("t")'
100000 loops, best of 10: 74.9 usec per loop
python -m timeit -r 10 -n 100000 -s 'from xml.dom import minidom' 'xml = "<t>Hello</t>"; minidom.parseString(xml).firstChild.firstChild.wholeText'
100000 loops, best of 10: 76.6 usec per loop
Given that the "study" comes from a "markup conference" that has XML all over its website and given the fact that XML documents (uncompressed, in memory) are much larger and that XML is a much much more complex standard, i seriously doubt the conclusion that XML and JSON are "almost the same" in terms of speed and memory usage.
Or simply put: How is it technically even feasible that an XML document that, by its nature, is much more complex (and thus i imagine a XML parser to be much more complex) can be parsed as fast as JSON with more or less the same memory footprint?
Wouldn't that only suggest that the JSON parser that is tested is just not as optimized as the XML parser?
I can imagine that a webbrowsers XML parser is much more optimized and mature then its JSON parser, given that it's a browser that mainly needs to parse HTML/XML?
Does it mean i should switch to using XML instead of JSON in commandline tools?
Or to put it another way:
The headline "XML Can Give the Same Performance as JSON" is probably true for browsers which have had years in improving the XML parser. But i don't think this can be a general conclusion.
Given that the "study" comes from a "markup conference" that has XML all over its website
So? The overwhelming majority of JSON advocacy comes from developers who happen to know JavaScript, and thus JSON appeals to them. Virtually everyone speaks from the position of self-interest.
yes, a bit of a cheap shot that ... and the guy works at a company that makes MarkLogic server which spits out XML ... but it also spits out JSON (all at very large scale I may add) so choose your poison.
Until someone replies with the same experimental rigor to refute the findingsthen I think the observations this paper makes stands ... thats how peer reviewed journals work.
The paper is not saying 'XML is better' or even 'XML is faster' ... its just addressing the perception that XML is slow in certain scenarios which has become a default myth.
JSON has been accepted as a datatype in the Markup conferences of the world its great for data transfer, XML is a compromise on many different levels but tends to be good for mixed content and documents. I think we've all moved on.
"Use HTTP Compression which most often is the single most important factor in total performance."
How does this advice stack up against the recent security issues with compressed HTTP traffic? Is this article's recommendation at the same place in the transmission stack where this would cause trouble?
"How does this advice stack up against the recent security issues with compressed HTTP traffic?"
It doesn't. The recent issues with HTTPS suggest there may be a fundamental tension there between performance and security. In fact the recent issues don't particularly care "where" in the HTTPS connection the compression occurs, it just has to be inside of it. It won't matter whether you use standard HTTP compression or roll your own (which on the TCP socket won't look all that much different anyhow, you'll just be giving up browser support for automatic decompression).
So his two JSON test cases are eval and jQuery. He does not use JSON.parse.
So this shows that with a lot of hand waving around cases no-one likely cares about, XML is almost as performant as JSON, even if the code is way, way uglier.
In an actual real project where you're probably passing lots of fiddly objects around we have no results.
I'm going to say something bald here: JSON people often dont know what schema based interchange formats are, and why they are fast. Such as protocol buffers, or facebook/apache thrift.
speed: use thrift or protocol buffers
ease of implemention: json
xml has best of both worlds. And therefore it is most of the time not suited. However the fact that it can support schema (XMLSchema), document translations (XSLT/XQuery) and query mechanisms (XPATH/XQuery) makes it a format very well suited for big enterprises. Where specification is important.
This is why cluncky protocols such as SOAP are built on top of XML. It is future proof (extensibility is more difficult in json). It is schema based (parsing, validation and language binding is easier).
JSON for example doesnt support references between nodes. You can built it in, but its not standard.
JSON is the easy peasy solution, the quick win, the fast enough one, and therefore the winner. However XML is the big beast that has it all, and therefore its complex. But that doesnt mean it sucks.
It really all depends on the use case and your target end-points. JSON is great when the data is transmitted to and from web browsers because of JavaScript. Nearly every major framework has a really good XML library. Readability of the payload is usually not a concern, at least my experience, because only the software has to deal with it and not a human. Also in my experience the size difference between a JSON payload and an XML payload is not a big deal (except perhaps across slow network links where every byte counts, but there are more succinct and faster binary serialization formats available).
One thing I haven't seen mentioned in this is the size difference between XML and JSON. At the end of the day that is the main reason I prefer JSON, it is much more compact when storing it and sending it across the wire. Anyone who has ever worked in a system where the canonical data representation was XML will know the horror of seeing a nice 500k file balloon to over 10 MEG just because it is now in XML...
Back in uni there was a contest for an efficient XML parser from some company elsewhere in Europe. The problem was efficient parsing of XML files around 10 GiB and more – apparently the European standard data format for bank transactions (or the log of them, I don't remember precisely) is XML and for the Central Bank such large files were not uncommon.
Makes me really wonder why they went with that format.
The difference in speed of XML and JSON is not going to be a bottleneck to most applications. Of course, the small difference in parsing speed I'm sure matters a lot to some people.
The major advantage of JSON is a readable syntax. XML tends to be overly verbose, and therefore not as easy to read.
JSON is more compact too, which matters when sending 1000s of objects across the wire.
If you care about performance, you would use neither format.
Developers choose json over xml because using it and dealing with it is more lightweight than dealing with the very baroque xml. The fact that the typical json payload is 1/2 the size of xml is just a bonus.
this xml vs json debate is getting tiresome but as someone focused more on getting things done over optimizing processes here are my rules of thumb:
for data transmitters: lean towards json unless your data structures are deeply nested.
for data receivers: be prepared to handle both (as even in cases where json clearly makes more sense) as large orgs tend to lean towards xml. When an API offers both serialization options, choose the better one so through log analysis you can nudge the producer towards the optimal solution.
XML is a poor choice for most serialization since it has the wrong information model. It's designed for extensible document markup and most application data packages aren’t documents that require markup.