I’m going to just argue the exact opposite of the article: xml and json are both structured data formats useful for tree like data graphs, such as objects.
Whether that was the intended purpose when xml was designed is irrelevant. It’s what xml is used for in almost every case.
The author also doesn’t suggest what should be used instead to encode structured data, or perhaps more importantly what should have been used to encode graph like things such as map/lists/objects in the 2000’s. Json really hasn’t been an alternative until quite recently (10 years ago?).
In fact reading the article carefully I fail to see the author argue why xml shouldn’t be used as a data format either.
The intended purpose is relevant because it tells us the conditions under which something is likely to work well.
XML was vastly overused for a long time. That doesn't make those usages correct, as there were alternatives even then. (It also doesn't make people who overused somehow bad; I think it was a reasonable and necessary mistake.) It certainly doesn't make new ones correct now given that JSON's been around 16+ years. [1]
I think the author here is slightly strong in his criticism; I think XML is great for things that are meant to be long-lived and self-documenting. That is, things that are used like documents. But if I'm passing short-lived globs of structured data back and forth, as with an API, I think JSON's a much better fit, as is Protobufs for more tightly joined code.
> The intended purpose is relevant because it tells us the conditions under which something is likely to work well.
Not really. Plenty of things suck at their original intended purpose and remain in use because they are very good for some other purpose. (Viagra is an example well-known to popular culture, but hardly unique.)
> XML was vastly overused for a long time. That doesn't make those usages correct, as there were alternatives even then.
XML is perhaps not abstractly ideal for many of the purpose it has been used for, but in many cases it was superior to other alternatives for practical reasons, particularly the tooling ecosystem. (JSON is the new XML, and virtually the same thing can be said for JSON in many of its current uses, though it does clean up XMLs two biggest warts, element/attribute distinction and verbosity—though even for human readable formats YAML does the latter even more than JSON while being easier to read, not to mention all the binary options when readability isn't a concern.)
Yes, really. I agree there are exceptions, which is why I said "likely". But by and large, fitness for purpose correlates with design intent (which generally involves a significant period of iterative use, further driving fitness for purpose).
Sure, but the enterprise sphere is mostly what Moore called late majority and laggards. The biggest driver of use is not effectiveness, but perceived safety. A good proof of that is your choice of word here: crazy. It's not that JSON would have been somehow technically wrong; it was just socially wrong.
I'd argue that for data storage purposes, you'd like to have a low metadata to information ratio. In the examples the author gives this seems to be the main problem, with way more characters being used for markup than for content.
Compare that to JSON or TOML, which are more human-friendly and waste fewer bytes on structure to convey the same information. When used for data storage, two XML files of the same schema describing two completely different objects are likely to share a large amount of content, which is wasteful and gets in the way.
For storage of structured data (and probably even for loosely coupled RPC) you want format that is efficient and schema-oblivious. The bad choice 15 years ago was XML, bad choice today is JSON (the parsing overhead is not negligible) or ProtoBufs (not schema-oblivious). Various binary formats with JSON-like object model seems like the way to go (my choice is CBOR).
And then there is the EU-wide absurdity of WhateverAdES, which invariably leads to onion-like layers of XML in ASN.1 encoded as base64 in XML wrapped in CMS DER encoded message...
I beg to differ. For a start, XML compresses well and besides, storage is monster cheap these days. XML is a better storage format because it documents what the data is (a title, a reference etc) as well as the data itself.
XML does compress well as text or over the wire but the parsing trees can be quite large in memory and processing consumption. At least in Perl I've had enough scripts crash out due to this overhead when implementing the common/naive solution using off the shelf modules. You can get around this by choosing between DOM or SAX but I consider that a symptom of the problem, you choose XML to solve a problem and now you have another problem to solve.
I had the same problem with npm, I think, and JSON, because npm could not simply load the huge JSON file into memory. A huge anything can crash a naively written tool used to handle smaller instances.
That's true but I think XML has the edge there. It has so many features like defining new types which you wouldn't normally see in JSON. One parser we used had a ten to one ratio - 50MB of XML meant 500MB of RAM usage when using a DOM parser. And that's taking into account the textual representation of XML is already >50% bloat with the closing tags etc.
Well, XML is a markup language (and is really good at being that) while JSON is not. Sure, XML can be used as a poor man's data storage, as a base for a DSL, etc., but almost always there are better choices.
What are the better choices? And what were the better choices on the major platforms 10 years ago, the choice of which would have not seen every app use xml config files/dls/storage now?
I use csv when applicable. I use protobufs when applicable. But for the typical use case I choose xml for it's some config/dsl/dataset that needs to be human-editable (support comments, for example), more complex structure than csv supports, and preferably not need an external library or a custom parser. Json, Csv, Toml, S-Expressions, protobufs all fail one of more of these requirements. I'm sure there are others but none that don't have at least one drawback I don't want.
A poor man's data storage is exactly what I want!.
> preferably not need an external library or a custom parser. Json, Csv, Toml, S-Expressions, protobufs all fail one of more of these requirements.
And XML doesn't? Quite a few (not all, but quite a few nonetheless) programming languages include zero support for reading or writing XML-formatted data without using an external library or custom parser. This includes nearly all languages that predate XML, and quite a few languages that postdate it. Even when a language does have built-in (or at least in the standard library) support for XML, it's almost always a royal pain to use, especially once namespaces and schemas are involved.
Once upon a time, though, the answer was (and in a lot of places still is) INI:
- It's human-editable and supports comments
- It supports more complex structure than CSV
- Some languages have built-in support for it, and the Windows and GLib APIs support it, too (well, something similar enough to be compatible, in the latter case)
INI falls flat when you need to express deeper levels of nesting than keys and sections, though.
There's also YAML, which meets all your criteria about as well as XML does (at least on average; your specific language/platform might favor one or the other).
Right. Xml is to .NET what ini was for MFC. It’s what the platform “makes” you use. The same is true for json on js of course.
On a platform that has almost no support out of the box (e.g python) the choice is open. But on a platform that has a couple of formats built in, picking a format outside that platform is a pretty big step. The return needs to be substantial for a .net developer to use yaml via an external library over xml.
My reasoning in this thread has always started from the perspective that xml comes built in and almost no other format does. This is the case for e.g java and .net but not for python or C for example. But the prevalence of xml comes from java/.net so if we are to ask why, then we should consider that.
It seems like the "type: External" should line up with "metric" and "target" but no, it needs to line up with the word "external" - not the dash, but the word after a space after the dash. Using YAML frequently reminds me of the quote "Be open minded, but not so open minded that your brains fall out".
Schemas are awesome for so many cases. Even though JSON Schema is a giant evolving clusterfuck, I still use it to be able to enforce some consistency.
And, being honest, JSON schema is better than, say, GPB or Avro schema at enforcing field relationships, e.g., "if typeId is 7, then partnerId cannot be null"
You aren't responding to the comment here, you are just reasserting the article's position. I'd argue there is not another format that is obviously better for every data storage or exchange use case, or that surpasses all of XML's benefits while minimizing all of its downsides. I don't want to look at XML, but I do understand why it is being used.
Abuse of XML killed it a format. JSON is absolutely shit for semantic markup, and yet developers today routinely use it for documents because "XML is bad". They contrive ridiculous schemes for adding metadata and type information. They use it to generate HTML even when HTML takes less space. Finally, we regressed from XTML to HTML5. Buy-buy namespaces and parsing consistency.
> They use it to generate HTML even when HTML takes less space.
The fact that MobileDoc exists makes me physically ill. Something that can be expressed with one line containing a paragraph element and an italic tag is over a dozen lines of JSON spam.
Right, but, if given a choice of what to use, between XML and JSON, I'll pick JSON every time.
XML is a complete mess. Have you SEEN it's spec?
You can put JSONs spec in a single page. XMLs spec, not so much. Hell, most of the XML parsers don't support the spec, and the ones which do, historically have been riddled with security holes.
JSON over XML was simplicity over a crazy spec built by a bunch of companies all wanting to shove their own crazyness into it.
XML spec is longer than one page, granted, but it's about three times shorter than YAML spec. And XML spec describe not only the XML syntax, but also a basic form of validation (DTD), which include references. Basic XML has only five special symbols (<, >, ', ", and &) and can be parsed in linear time. (Namespaces complicate things somewhat.)
That's because json's spec isn't complete. It is predicated on the language interpreting it to be able to just eval the structure and work with the data [1]
For a non human-edited data storage or exchange that’s fine. Json is worse for human editable data though. Xml might not be the best alternative there but it beats json for things like small configs.
It’s not as simple as saying “everywhere xml is used, json would be a better choice”.
Your data storage format shouldn't be human readable. The data transferred over the wire shouldn't be human readable. It should be a binary serialized data format, probably encrypted, definitely compressed.
Yes, storage is cheap. Bandwidth is not. Also, you really don't want a human that intercepts your data to be able to read it. Additionally, your data structue only makes sense in the context of your domain, which usually has been modeled in your program(s) that work in that domain, and thus it will be better if you deserialize it within tools that understand that domain.
If you feel the need for a general purpose deserialization protocol, there are several available - Avro/Protobuf, etc.
Binary encoded data can often be decoded without consuming the entire document. Sax-paparser-like reading requires at least reading an open and a close before the data is useful.
String serialization is a wasteful endeavor. It makes life easy for devs because it takes one less step to read the data in a text editor or log message, but quite often requires hacks to model things like recursive or self-referential data structures, and wastes space by repeating property names constantly for every item within the serialized structure. It's predicated on four of the fallacies of distributed computing, namely - bandwidth is infinite, the network is secure, transport cost is zero, and latency is zero. It is a solution looking for a problem, and because we are lazy, we don't build tools that would make binary serialized formats just as easy to use as json/yaml/xml.
Markup is a mix of scalar and structured data (it's a structure discovered or associated with a scalar) and thus it contains everything it needs to express structured data alone: just remove the scalar. E.g.
Is this really a poor man's choice? And compared to JSON?! I can see at least the following advantages here:
1. Each element has explicit type name (invoice, item). JSON is "typeless", which simply means the type information travels out of band. And with XML namespaces these type names can be made globally unique, but still stay human readable.
2. Each element is self-contained, the code that produces the <item> doesn't need to know if there was an item before or after it so that it should add a separator. (The dangling comma problem in JSON.)
3. The attribute names are not just arbitrary strings as in JSON, there are strict rules of what can be in the name. They're much more suited for structured data than JSON, where you can name an attribute "foo.bar" and some JSON readers that accept a JSON "path" won't be able to find it.
4. It has less visual noise than JSON because the attributes don't have quotes around them and you don't need to separate elements with a special symbol. Despite the common belief well-written XML is more readable than JSON.
And we haven't event touched things like validation + extended types, references, and transformation of data.
Yet every time I've stumbled upon XML in the past decade or so It's been used as a data format because it's easy to manage and supported by every platform/tool out there. But sure, let's switch over to JSON or use a SQL database because we can't deal with the fact that XML might be better suited for something that it wasn't originally designed for.
It doesn’t answer the question, but I do wonder if XML would be an improvement in devops, compared to the current obsession with YAML. For everything except the part where you write it.
Make an xml stylesheet and your kubernetes cluster is instantly documented.
Maybe if you are using Notepad. Any decent text editor will provide things like auto indentation, completion, auto end-tags, structured editing, and schema validation. For example, Emacs comes with nXML mode:
That assumes that you edit XML all day long. This is not always the case.
I am writing non-XML code most of the day, and I do not have structured editing / auto braces enabled. So when I need to edit that one XML config, I'll open it in my regular editor, which will provide at most syntax highlight, and edit it as needed with a bit of swearing. And next time, I would promise myself I'd choose a different config format which does not need special editors.
Which "same thing"? Not setting up editor and complex environment for the things I am only going to edit once or twice? Yes.
In general, when you see something inefficient, you can either fix it to make it better, or ignore and come up with random workarounds.
In my opinion, a config file which cannot be edited by hand, and which needs a special editor with non-trivial learning curve, is a inefficiency. I can either ignore it, and set up the specialized tools; or I can fix it, by ripping out XML and replacing it with something more human-editable, like TOML or YAML. In large teams, it is almost always better to fix it -- sure, I will spend a few hours getting rid of XML, but this will pay itself off in the long term, as no one else will have to bother with special setup anymore.
(This obvious only applies to the systems where XML is a minor part, like a single configuration file. If your system has huge amount of XML, you better learn the right tools)
I don't understand, what is there to set up? With the Emacs mode I gave as an example, you just open an XML file and everything is there. Any decent text editor will have XML support.
xmllint --noout <file> will check the file and report any issues with XML syntax in a very detailed way with line numbers for you to see.
I myself don't even use syntax highlighting and normally work in vim and although I do make errors in XML sometimes, I find that I make at least as many syntactic errors in Python or C code that I have to weed out before I can proceed. But I never heard anyone complaining about Python or C being too strict :)
They all are, but xml isn’t the worst. (Json and S-expressions are worse, for example).
Not even formats designed for human consumption such as yaml are very good. The good ones for editing (toml, csv, ini) fall short when it comes to complex structure instead. There is no silver bullet.
> The author also doesn’t suggest what should be used instead to encode structured data, or perhaps more importantly what should have been used to encode graph like things such as map/lists/objects in the 2000’s.
The example seems pretty weird. If it's just data there shouldn't be any quotes at all. The only purpose of quoting is to prevent evaluation, so I guess the idea must be that `log` is a function call to be evaluated, but then the example isn't even the same thing as the XML version (and there are still nested quotes inside the already quoted record).
You'll want XML as soon as it exceeds one screenful :) And this XML is, well, "musused"; it should be that:
<log>
<record date="2005-02-21T18:57:39" millis="1109041059800" sequence="1" level="SEVERE" class="java.util.logging.LogManager$RootLogger" method="log" thread="10" message="A very very bad thing has happened!">
<exception id="123" message="java.lang.Exception" class="logtest" method="main" line="30" />
</record>
</log>
I also don't object squeezing the exception attributes into record with some prefix that would make the names unique, like that:
<record date="2005-02-21T18:57:39" millis="1109041059800" sequence="1" level="SEVERE" class="java.util.logging.LogManager$RootLogger" method="log" thread="10" message="A very very bad thing has happened!" exc-message="java.lang.Exception" exc-class="logtest" exc-method="main" exc-line="30" />
Or, if it's possible to normalize these data, factor out the exception and other attributes and build an index of them so that they can be referenced by an ID:
Frankly, I’d rather some form of compressed binary logs, but sure, either works if it includes a stack trace and I need it at the time! Text logs compress well anyway, they just grow larger if/when you process them later... I’m just happy to see objects as log entries instead of plaintext strings. ;)
The xml, definitely. Without lisp-specific tooling that colours/balances etc, writing 5 consecutive closing parens (and not 4 or 6) is a headache. Also I probably couldn’t choose it on its own merits - I choose what the non programmers that edit the file can use (and they sure don’t have any other editor than notepad).
Then again: I’d usually only ever choose between formats with support already on the platform/standard library, if it was just some config or small data file. If the data is core to the product then of course it might be reasonable to include a new library or even write a parser. I’m talking java and .net now mainly.
> what should be used instead to encode structured data, or perhaps more importantly what should have been used to encode graph like things such as map/lists/objects in the 2000’s
I think you mean s-expressions instead of lisp and if you do, I'm with you. They are really neat and underused. Now, if you want to carry a whole language runtime for the structured data format you'll probably end up with the "which lisp" kind of problem in which each app end up using something different and then the structured data files are no longer exactly portable (until you devise a portable standard yadayadayada)
I have to admit, once you've seen S-expressions, XML looks insanely bloated and verbose.
The advantage XML would have in that situation is that, because it's so verbose, if you get a malformed XML, you can eyeball-parse it and often figure out how to hand-edit it to make it valid. If it is valid, you can also see exactly how the schema you were sent differs from the schema you expected. An S-expression, having less redundancy, also is potentially more brittle.
If malforming is regular (like someone printf’d or typed in bad xmls), then the same is true for any human-readable format, since you know data and what it should look like. If not, (like in randomly broken packet), then you solve the problem at the wrong level. By-hand error-correcting verbosity is hardly a selling point of the application level protocol.
> By-hand error-correcting verbosity is hardly a selling point of the application level protocol.
It's a selling point while you're trying to figure out how to get it working. Once you have it working, it's not - but by then, you have it working, so why change it?
> Abstract Syntax Notation One (ASN.1) is a standard interface description language for defining data structures that can be serialized and deserialized in a cross-platform way. It is broadly used in telecommunications and computer networking, and especially in cryptography.
> ASN.1 is a joint standard of the International Telecommunication Union Telecommunication Standardization Sector (ITU-T) and ISO/IEC, originally defined in 1984...
I think ASN.1 was designed for a world with (a) waterfall development and (b) static deployments - ie no over the air updates. Under the circumstances, messing up is simply not an option - hence defining the standard so clearly and for so many use cases.
Today, of course, we treat the entirety of our deployed infrastructure as 'merely' a platform to write code. And not only are experimentation and failure OK, they're positively encouraged. Velocity became important.
You can meaningfully decode any DER encoded ASN.1 structure and serialize it back without any knowledge of the schema. Somewhat surprisingly you cannot do that with all instances of XML documents.
The first thing wrong is that you have to serialize and deserialize it. Operationally, that's inconvenient, and it shows that they're optimizing for network bandwidth. But these days, squeezing the most out of each bit is, in most cases, not a defensible design decision.
Then, once you deserialize it, it's still a printable version of ASN.1. Sure, it's unambiguous, rigidly defined, and standardized. It's still gouge-your-eyeballs-out horrible to try to do anything with.
Say you get an XML message over the wire with a bit flipped. If you look at it, you have a good chance to be able to figure out what went wrong, edit one character, and you can now process it. If you get an ASN.1 message in the same condition, it's pretty much game over (though there may be special tools that could save you).
Say you get an XML, and you don't know the schema. You look at it, and you can see what's going on. You get an ASN.1 where you don't know the schema, and you can be totally sunk. (If I recall correctly, in ASN.1, you can have schemas that are private, that is, not specified in the standard.)
XML (and JSON) have the advantage of names, which makes it slightly easier when it comes to querying (and indeed building indexes) over lots of data. I'd be amazed if there wasn't tons of work on this for S-expressions, but I can imagine it's slightly clunky.
What do you mean by names here? S-expressions have symbols, which serve the exact same purpose as what I think you might mean: an interned string value which can be cheaply used more than once.
Not saying you can't specify some schema here, but there's nothing native to S-expressions that makes it quite as transparent and simple to specify a path into a data structure.
All you need on .NET/Java in 2007 then is a lisp parser/interpreter. I know they are easy to write, but they aren't as easy to write as something you don't have to write at all.
Let's not forget, many one of these configuration files and data formats are one-off hacks that were meant to be replaced by a real format, a real parser, a real DSL etc.
The reason the xml config/dsl/format stuck is because it worked. And it was cheap and easy.
S-expressions have been around since the 1960s. McCarthy was already proposing S-expressions for what XML would become, in 1975. Stop whining and finding excuses, and start using s-expressions.
I'm not trying to pick the best format for the job, I'm usually trying to pick the least bad one that's available IN the platform/standard library I'm using.
So, what is XML good for? If it's not good for data as everyone says (and I'm not inclined to argue), but it is good for documents, what kind of documents are we referring to? A defined metadata on a text document? A template used with data to generate something else? Is a configuration file a document or data? Where would I want to use XML that something like JSON, a text document, or some combination thereof wouldn't be better?
I'm not being facetious, this is an honest question. Where are the "right" places to use XML?
Tim Bray, co-editor of the XML spec, writing in 2006 on the topic:
> Use JSON: Seems easy to me; if you want to serialize a data structure that’s not too text-heavy and all you want is for the receiver to get the same data structure with minimal effort, and you trust the other end to get the i18n right, JSON is hunky-dory.
> Use XML: If you want to provide general-purpose data that the receiver might want to do unforeseen weird and crazy things with, or if you want to be really paranoid and picky about i18n, or if what you’re sending is more like a document than a struct, or if the order of the data matters, or if the data is potentially long-lived (as in, more than seconds) XML is the way to go.
There are points of disagreement between me and the author, although I wouldn't get too passionate about them.
Super-short version, reading over it again, is that XML is very good at what it does, but it really ought to be seen as a relatively specialized data format. It's really good at certain tasks, best-of-breed for a couple of them, and degrades rapidly as you get away from that. JSON is a fairly cheap & fast general-purpose format that's OK at a lot of things, isn't necessarily great at much, but as you get into more specialized use cases, also tends to degrade. Being a general-purpose format, perhaps arguably it degrades more "slowly", but it does degrade.
Properly understood, IMHO, their use cases don't overlap much if at all, and the combination of them may cover a lot of space, but are still far, far from the only serialization formats you'll ever need.
Just the sort of thing you would think of as 'documents'--the texts of books, manuscripts, and the like, where structure may be somewhat arbitrary. For instance, I work with a few different text corpuses--one of which is an actual dictionary, with entries, definitions, usage examples, etymological information, and bibliographic references. Another is a collection of poetry manuscripts, with annotations for line breaks and editorial emendations, both from the author and other editors (i.e, places in the manuscript with crossouts, interlineal notes, marginal notes, etc).
I mean, in theory, you could do this in JSON or some other data structure. But you would go insane and be shooting yourself in the head before long.
> you could do this in JSON or some other data structure
I'm not sure you could. For example, in another comment, I mentioned DocBook[1]. How would you do the following sample document in JSON?
<?xml version="1.0" encoding="UTF-8"?>
<book xml:id="simple_book" xmlns="http://docbook.org/ns/docbook" version="5.0">
<title>Very simple book</title>
<chapter xml:id="chapter_1">
<title>Chapter 1</title>
<para>Hello world!</para>
<img src="hello.jpg"/>
<para>I hope that your day is proceeding <emphasis>splendidly</emphasis>!</para>
</chapter>
<chapter xml:id="chapter_2">
<title>Chapter 2</title>
<para>Hello again, world!</para>
</chapter>
</book>
Would you make each <chapter> into an object? But you have 2 <para> children in there with an <img> in between. And one <para> has an additional <emphasis> in the content. I can't think of a good JSON schema equivalent to this.
[
'this text nees more',
{'type': 'emphasis',
'children': ['emotion']},
'!'
]
hellish to write by hand but probably okay for a program to consume (modulo all the XML libs/tooling you can't use). and you could probably even write some kind of schema for it.
if i actually had to represent that data, i'd also move some child nodes into attributes, e.g. make all nodes with 'type': 'book' also have a 'title' attribute, like you would if you had an AST datatype
which is probably the only way to properly deal with markup and especially commented sections that can span over paragraph start/ends - neither JSON or XML seems to have a proper answer for such annotations and I wonder if there's any standard format that can that, especially if humans still want to reasonable be able to view or edit iit...
(OOXML and its binary equivalents more or less solve this by completely separating paragraph and character formatting, both separately indexing the spans of text they annotate)
That is what essentially every WYSIWYG text processor does. And also the reason why getting sane HTML out of text processor is somewhat non-trivial, as the separately indexed spans can very well overlap, contradict each other or contain completely unnecessary formatting information.
But as pointed out in the article, JSON isn't necessarily going to guarantee the correct order of your nested bits. Your code is going to have to worry about that. And it will quickly become unmanageably complex. When you are for instance creating a marked up transcript of some archival material, there's a lot of human editing involved. Have a look at the TEI documentation to see how messy it can get.
Certainly. I wasn't suggesting that JSON representation I put up there was actually a good idea, just that it's theoretically possible to represent that document as JSON.
I absolutely agree. Where XML shines is when you could take just the text content - strip out the all the markup elements and attributes, document, comments, etc - and have a text document that still makes some sort of sense.
This AFAICT was actually why SVG has a few bizarre choices - such as putting all the drawing commands into attributes. A browser that doesn't understand an embedded SVG document in its HTML would be left with just the text contents.
I've been getting along fine using JSON for pretty much everything. That being said XML has some very sophisticated features like rigorous schema definition, a query language, a formal include syntax, comments (that's a big one), it's a lot easier to do multi-line content and in fact you can mix normal text and structured data.
The include syntax doesn't get enough love. It's crazy that JSON doesn't support it.
The issue with all the XML sophistication is that essentially only environment where all that really works is when you use XML as a markup language for technical publishing (ie. DocBook, DITA, ...) and in fact as a less convenient but more modern and cool SGML replacement.
Random applications that read XML just aren’t going to implement fully validating parsers, because it is lot of completely unnecessary work. Also in the article mentioned pattern of storing everything in attributes mostly comes from the fact that working with CDATA nodes in XML is major PITA wrt. whitespace handling and coalescing adjacent nodes.
Think of semi-structured documents, where you have a list of pre-defined sections. I've seen it in use by insurance companies for case reports, in real estate for appraisals. And of course we've all seen it work well in the form of HTML. There's some structure to all of these examples, but mostly for annotating text sections. You need some flexibility built into the schema to add fields as needed, but you're not dealing with various map / list / primitive data types as a matter of routine. Just making this one up, but if LaTeX wasn't already the standard, I'd also use it if I was digitizing the content academic papers, for instance. You have an a header with metadata, abstract, the body, citations. There's some structure, a need to add some metadata, perhaps flexibly over time, but mostly it's just a document.
What is the exact relationship? The history as I remember it is:
1. SGML (Simple Generalised Markup Language) came first.
2. HTML was a specialisation of SGML, it took off because of the web, and is probably the only reason for SGML to become famous.
3. XML was then invented as a generalisation of HTML, perhaps by people who had never heard of SGML.
And I seem to remember DocBook is an SGML thing, it was invented between steps 2 and 3.
That's completely wrong. XML is specified as a subset of SGML (it says so in the preamble even) by folks who where also involved in specifying SGML. Moreover, these same folks (the "Extended Review Board" at W3C) also amended SGML to align with the XML profile of SGML in ISO 8879 Annex K aka the "WebSGML adaptions".
DocBook is originally an SGML DocType and most of the DocBook formatters are written in DSSSL. Large amounts of documentation for open source software (and large amounts of O’Reilly books) is still SGML DocBook.
When you have much more than one line of these "a=b", that sugar helps. XML has hierarchy, you can group related values into elements. XML has comments. XML can be typed, the standard even defines formats for numbers, date and time. There're good libraries to serialize/deserialize objects to XML, from pretty much all OO languages out there. I use them a lot, and I rarely expect users to edit XML, I give them GUI to change the settings they need, updating the config.
Well, the problem with INI files for configuration is that config files (legitimately!) need to be able to represent repetition, nesting, schemas and comments, which there was never any standardization for. While XML seems like overkill for something as mundane as a config file, the standard does at least cover all of the cases you need.
If your config is that complex, you might be better served by JSON - or the full JS or, say, Lua, for that matter. Because what you are talking about looks more like (interpreted) code than a config.
It's under- and mis-used, but XSD helps to validate config files and provide some guidance on the structure. I know JSON has a schema in draft, is it used much?
An example of XML config we used at my workplace was a processing pipeline with various modules and options/parameters encoded for each phase (some optional) of the pipeline. So in a sense it was configuration that resulted in executing code modules, not so much your standard options.
XML is absolutely excellent for markup. There's no competitors here.
Markup consists of two things: a scalar (a string usually, but can be a binary sequence) and associated structured data: smaller scalars, records with fields, and lists (there's no "etc." here, that's all).
The structured data are either discovered in the scalar by parsing or added to it by marking it up. Parsing applies to binary data and artificial languages (although there are parsers for natural languages as well), marking up to structures that cannot be parsed out, but can be added manually, usually during authoring, but also during after-the-fact indexing.
XML stores both the original scalar and the structure together in a single piece. There's extensive tooling for processing the result.
Practical examples:
1. Parse a C file and do something else with it than compiling. E.g. you want to publish it, index with cross-references, transform maybe: XML shines here (you'll normally want to add XSLT to it).
2. Author text and do something with it. If it's Markdown, apply a minimal parsing and save the resulting AST in XML. Same for reST and any other format out there: just get it into XML as soon as you can and process the XML from that point. Whatever you want to produce (XML, man pages, PDFs), XML toolchain will help you to get there.
3. Mark up existing text. E.g. you have a collection of letters and want to index all references to people. XML would be a very good choice here too. (I'd say that marking up and indexing all existing texts of the humanity would be a very important project. There's already a lot of effort to publish them, and marking up and indexing is what naturally comes next.)
4. I'd venture to say that even binary formats would benefit from conversion to XML and back because of what's possible with XML toolchain (I'm thinking mostly about transformation, but indexing would also be good.) E.g. read a collection of MP3 files, parse them out into what they have (ID3 tags of different versions at the beginning or end, APE tags, other such tags, and MPEG frames), and then do what you want: index by anything, clean up, add extra information that cannot be expressed in tags (classification for classical music or argentine tango, for example) and so on.
PS: Since XML can store structures alongside a scalar, it can also store structures alone: just drop the scalar. It's a very good format for structured data, absolutely not as bad as it's usually painted. Much better than JSON, actually. But you have to prepare it well.
PPS: Scalars and structured data are, of course, the natural parlance of all other programming languages out there, so everything XML does you can do without XML. But it also means that XML is not as foreign as it appears. There is some friction between getting data out of XML and putting it back, but it's about same as with SQL.
That's a very useful and sobering view actually. Unfortunately for the XML format it wasn't designed to prevent its own abuse. But anyway, XML is for documents sounds like a good and acceptable paradigm.
That being said, one subtle and important (and often overlooked) difference between XML and, say JSON is that you can stream XML while parsing it on the application level, whereas JSON can not be parsed by the application due to arbitrary ordering of keys. (Of course lower level parsers use streaming anyway, but that's not the point)
In fact you not only can but you should parse XML while streaming it. This is another common abuse: wherever you look you see some high level function that loads an entire parsed XML structure into memory at once. But once you start asking yourself where the file may be coming from you realize that your system may be open to denial of service attacks. E.g. is your system ready to receive a 16GB XML file?
There is no reason you can’t do a SAX-style parser for JSON. A quick bit of searchengineering will find dozens, eg. https://github.com/dgraham/json-stream
You can in principle, but because the order is not guaranteed, you may find yourself accumulating things in memory. It depends on the task of course, but neither standard parsers nor generators are required to send adjacent JSON entities in a specific order.
I remember someone describing a trick where instead of sending a JSON array they'd just send a stream of JOSN objects, one per line, so that the receiving end could parse the data in a streaming fashion. But that's not JSON anymore.
XML is a actually wonderful format for data and especially extensible configuration if you combine it with XML Schema and CSS selectors...
- XML schemas give you a ready-to-use format to describe, restrict and document available configuration settings. The unique keys help and a libxml2 gives ready to use validation, even if you may need to 'translate' its error messages before showing them to end users
- XML schemas also support other annotations so you can further generalise your configuration readers by recording the necessary bindings in the XML schema itself, allowing to use it eg to define application user interfaces.
- Almost any text editor can do basic syntax validation preventing most typographic errors, and even better if they can read the schema
- XML schemas are extensible using <import>s, but namespaces still enforce some separation. You can define explicit points where plugins extend your configuration format using <any>
- Human editable - closing tags are noisy but more readable than }],{}] when non-programmers may have to edit these files just to add a few extra textfields to an UI.
- Better datatype support, eg datetimes, by using XML schemas. JSON's type support is too limited
- Support for comments!
- And once you've verified the schema... CSS selectors and DOM APIs to actually process the XML documents.
YAML fixed quite a few things, but still no date times or as far as I know standardised approaches to defining schemas. And I've lost count at how any attempts exist to add schema information or namespacing to JSON...
But for markup... we may be better off to just use markdown inside CDATA blocks
> XML schemas give you a ready-to-use format to describe, restrict and document available configuration settings.
As someone that likes and uses s-expressions, I never thought I would find myself defending XML, but here we are in 2019, no one understands basic parsing theory anymore, and file formats have "evolved" to hot garbage like YAML and TOML.
> YAML and TOML were optimized for human reading and writing, like markdown.
That was the rationalization, in reality I don't know what actual writing use case they were optimized for (Notepad?). XML in a syntax-aware editor with the help of automatic schema validation makes it far easier to write than YAML or TOML.
I don't need a special editor to write out YAML (if it didn't have tab dependence) or TOML quickly. It's also a lot faster for a human to read vs XML. JSON is fairly readable, but a bit of a pain to write out compared to ini files, since you have to quote all strings.
If your going into complicated things like schemas or other complicated structures, then you probably shouldn't be using YAML or TOML. I would mostly use it for config or other simple things.
At this point if you are not interacting with 20 year old java software, that was created when JSON didn't exist and XML was king, you should be using TOML for simple config, JSON for most things and heavyweight XML, protobuf or csv for the specialized cases. And while we are at it, markdown for simple documentation.
Even gradle decided to use groovy scripts as their config language because it's far more human readable and usable.
> At this point if you are not interacting with 20 year old java software, that was created when JSON didn't exist and XML was king, you should be using TOML for simple config, JSON for most things and heavyweight XML, protobuf or csv for the specialized cases.
But we can't always budget a nice configuration application.
And once we've just put simple XML file there for configuration and we're past the prototyping phase... "well this actually works good enough, let it be".
You likely need data model classes for the config anyway, along with support of serialization and deserialization (XML or not is not important). Use a stock property grid control, pass the root object of the config, and you’ll get a GUI that does the job much better than ASCII files.
I agree. The author and other commenters here have not proposed an alternative which provides two features I consider critical to a generic data format: schema specification and validation; and easy parsing support in many ecosystems. JSON and Yaml don't seem to have mature, widely adopted schema specification and validation. And I'm not aware of anything outside of XML, JSON and Yaml which have such wide parsing support in so many different ecosystems.
JsonSchema is pretty good -- it is not as widely used as XSD, but it has libraries for most languages.
And of course JSON is the easiest thing to parse in scripting languages, like Python and Ruby -- the entire API is one line, and then you have a native structure you can work with.
> JsonSchema is pretty good -- it is not as widely used as XSD, but it has libraries for most languages.
It's okay now that the newest revisions support more realistic use cases, but ironically I find it impossible to write as JSON... I write them as YAML which my validator supports natively :)
It's still not as nice as RELAX-NG's compact schema format though IMO :)
> But for markup... we may be better off to just use markdown inside CDATA blocks
No we arent't. markdown itself is literally specified as a shortform of HTML [1], and can be translated into canonical angle-bracket syntax using SGML short references (though not completely eg. markdown reference links require unlimited forward lookup). This gives a canonical representation of markdown in SGML/XML even if you don't use SGML.
If you mean "users" as in "users of the XML format", then it's the fault of developers as 'users' of their applications don't have any choice in the matter.
> "But if the people who made the strange decision to use XML as a data format [...] they might realise that what they're doing is unsuited to it and unergonomic
The author's point is that XML should not be a data format.
Is there logic to the authors assertion past “that isn’t what XML was intended for”? It is a pretty nice data format if you want wide compatibility and schema integration.
XDF looks cool! Some of the design choices look pretty inefficient and arbitrary for a data format though, have you thought about rebranding it as a markup language? =] /s
The point being, I think the author is arguing you should use the right tool for the job, and XML not being designed for arbitrary data structures makes it not the right tool.
Just recently people have shown you can build a raytracing engine in SQL, but if someome was arguing we shall call it SQLCycles and ship it in Blender, I'd definitely have a few objections!
This doesn't make any sense. W3C doesn't restrict/define use cases of XML, it defines structure and semantics thereof; using it as-if it is an XDF document is perfectly ok, just like you can use XML data for your DASH stream etc... It's a structure on top of XML structure.
The author’s point is moot. XML is a markup language intended to convey semantics. By its nature it is a data markup language, because the intent is that a particular type of information lives in specific tags.
For humans, this is documents with certain meanings attached. For computers, it’s documents with certain meanings attached.
It’s all data. XML is a data markup language. It’s just that humans call it “semantics” in a “document.”
It parses the same regardless of the order. With the right way, is <item><key>Name</key><value>John</value></item> the same as <item><value>John</value><key>Name</key></item> ?
One thing this misses in the "dictionary" example is that tools (like xpath) push you towards "key in attribute" selection. One of the most common operations we do with dictionaries is lookup by known key, and storing the key in attributes makes it much easier.
"Key in attribute" is the correct way to do it, it's just that his examples are absolutely terrible and make no sense at all. A completely unstructured list of key-value pairs is overkill for any structured data format.
I had the impression mobile was some new try at frontend tech, but somehow iOS and Android threw a whole bunch of outdated stuff at me.
I mean, before the iPhone I designed a XML based ETL config system and tried to avoid all the common XML errors, then I start doing a mobile app 10 years later and it's like all that knowledge was forgotten...
I remember when Neverwinter Nights 2 came out touting that all of its data files were in XML for ease of modifying in user extensions. So I had a look and was it XML? Was it fuck, it was like a GeoCities novice's idea of how to code HTML - absolutely unparseable in any way unless you're the idiot who decided what the codebase needed was another shitty homegrown parser.
And yes, plists suck and make your XPath selectors ugly, although you could write a function to abstract them out.
I know what you’re trying to do there - you’re trying to “future-proof” your schema by allowing introduction of arbitrary new elements. Which means that there’s no standard way to guard against somebody omitting a required field (like “name”) or adding a new field like “creditCardNumber” - other than to document your acceptable key values in a non-standard format and add defensive code that a validating parser would have given you. You’re better off taking as much advantage of the format as you can.
XML is/can be much more than a markup language and yes, it can be used very badly but this is usually by inexperienced 'data wranglers' who don't understand the difference between data attributes and data proper.
While XML can seem cumbersome (compared to JSON say) it is a very good 'data transport' tool when used correctly with a sensible schema (XSD).
For example, we use XML as a 'vendor neutral' data format to export/import CAD geometry and associated data for town utilities such as buildings, pipes, roads etc. All this data has to be validated against the schema to ensure its correctness.
Using a schema like this enables the city council to import this XML into the GIS system to be used for asset management, financial planning etc.
A good schema can be key to sharing XML effectively between departments/applications and being a markup language this data can also be viewed independently using XLST.
The larger your XML file is the more accurately you're using it. Less " and more <>. I made these mistakes, using XML like a I was writing an HTML doc.
It is difficult for me to see what the real issue is with examples given. It seems to be more an aesthetic preference of the author rather than a technical argument. People can use formats for whatever they want. :P
If you told me that the transmission and parsing rate is too slow for their application, that's a real dig at it.
I think it's a bit like trying to explain why a Python REPL isn't a substitute for a calculator? Like it can of course do what you want, and you can't "see" what's wrong if you just take what you see literally (you'll obviously get the same answers regardless of what tool you use), but it's just... not meant for that.
For those wondering what you do with XML as a document markup language, see the XML document that is the specification for XML. I had to look at the page source to determine it really is an XML document. Looks like an HTML document.
I've got a stylesheet I wrote that turns HTML or XHTML into a fully indented and highlighted representation of itself which I was quite proud of :) I used to spend a lot of time writing XSLT and XQuery lol.
The idea, for those not familiar, is that once a work of art is published (a novel, a poem, a song, a painting), it speaks for itself, and authorial intent no longer matters.
That is, meaning and purpose are in the eye of the beholder/consumer. And there is no right or wrong way to "interpret" art. If someone finds meaning that the author did not intend, it is just as valid as a deeply hidden but intentional allegory they intentionally placed in when they were writing.
The relevance to software is it applies to APIs, specifications, standards and formats.
There is no such thing as users using your software or specification "wrong" - if they insist on doing so, the meaning has evolved. Evolve with it or die.
That's a little extreme but you raise a good point. I think a talented spec designer anticipates how their work might be interpreted / used / abused, and, like an adroit villain, nudges their audience toward tenets their grand scheme seeks to achieve.
XML wasn't an original invention; it is specified as a proper SGML subset. From the XML spec:
> The Extensible Markup Language (XML) is a subset of SGML that is completely described in this document.
Now I totally agree that SGML and XML aren't for service payloads and config files. The sole purpose of markup languages is representing structured text. And arguably, SGML fills this role much more adequately than XML today as it can represent (via the SHORTREF mechanism) custom Wiki syntaxes such as markdown and others, and in contrast to XML, can deal with the largest corpus of markup out there eg. can parse HTML with all its minimization features such a omitted tags, enumerated and unquoted attributes, etc. See [1] for a practical introduction (disclaimer: link to a tutorial I held last month at ACM DocEng).
You control whether an element requires start- and end-element tags in your element declaration via "O" (letter O as in "omissible") in the respective tag omission indicator position:
<!ELEMENT e - -(f,g,h) -- no tag omission -->
<!ELEMENT f O - (#PCDATA) -- start-tag omission -->
<!ELEMENT g - O (#PCDATA) -- end-tag omission -->
<!ELEMENT h O O (#PCDATA) -- both start- and end-
tag omission allowed-->
turns out a well specified format that has a lot of parsers available is useful for more than just a markup language. xml is great at data formatting, a little more verbose than alternatives but also a lot more feature rich
The format wasn't well-specified and didn't have a lot of parsers in 1996 though. The parsers came after the decision was made by a lot of people to use a markup language inappropriately for structured data.
> a simple test for determining if an XML schema is well designed: remove all tags and attributes from it ... If what you have left over does not make sense ... you shouldn't be using XML at all.
Magento 2 (acquired by Adobe for $1.68bn) uses XML to render its layouts. Here's some fun XML for the checkout page:
I once had to write a data layer in xml, in-situ with a lifespan for up to hours, as more data was appended to it. An invalid xml document that you couldn't load in many xml apis for 99.9% of its lifespan.
I begged and pleaded the lead architect to use an sqlite db for the elements of the data until the transaction was complete and then merely produce the xml file at the end, but no.
This is not a dictionary, it’s a record. Unfortunately this fairly fundamental distinction has been thoroughly muddied by certain languages that want to use associative arrays for everything.
I'm not the one doing that. People using XML as something else than a document markup language are. Which is what the article author is complaining about. But if you really want to do that, then at least do it properly. Record fields have predefined names, just like XML elements have predefined names. Dictionary keys can be arbitrary, like XML text nodes or attribute values but emphatically not element or attribute names, unless your "XML" is actually just tag soup.
It would work (kind of); most XML parsers/generators would take care of escaping and unescaping quotes; but there's no way in the XML spec to escape characters in tag names.
The main issue I see with that is that if it's a true dictionary, then those elements will constantly be different, which is weird.
Now, if we're just encoding a dictionary that's an already an encoding of an object, then yeah, let's just encode the object directly like you are above.
Which works if you either expect a dictionary with a predefined set of fields (which isn't really a dictionary then), or parse your xml in a way that handles arbitrary tags. For a generic dictionary the shown approach is still the way to go, if you really want to use XML for that.
I can't find much to corroborate this article's take. RDF is a stark counter-example - a standard from the W3C. It has endorsement from Tim Bray, one of XML's co-authors.
Whether that was the intended purpose when xml was designed is irrelevant. It’s what xml is used for in almost every case.
The author also doesn’t suggest what should be used instead to encode structured data, or perhaps more importantly what should have been used to encode graph like things such as map/lists/objects in the 2000’s. Json really hasn’t been an alternative until quite recently (10 years ago?).
In fact reading the article carefully I fail to see the author argue why xml shouldn’t be used as a data format either.