XML is almost always misused

alkonaut · on Oct 29, 2019

I’m going to just argue the exact opposite of the article: xml and json are both structured data formats useful for tree like data graphs, such as objects.

Whether that was the intended purpose when xml was designed is irrelevant. It’s what xml is used for in almost every case.

The author also doesn’t suggest what should be used instead to encode structured data, or perhaps more importantly what should have been used to encode graph like things such as map/lists/objects in the 2000’s. Json really hasn’t been an alternative until quite recently (10 years ago?).

In fact reading the article carefully I fail to see the author argue why xml shouldn’t be used as a data format either.

wpietri · on Oct 29, 2019

The intended purpose is relevant because it tells us the conditions under which something is likely to work well.

XML was vastly overused for a long time. That doesn't make those usages correct, as there were alternatives even then. (It also doesn't make people who overused somehow bad; I think it was a reasonable and necessary mistake.) It certainly doesn't make new ones correct now given that JSON's been around 16+ years. [1]

I think the author here is slightly strong in his criticism; I think XML is great for things that are meant to be long-lived and self-documenting. That is, things that are used like documents. But if I'm passing short-lived globs of structured data back and forth, as with an API, I think JSON's a much better fit, as is Protobufs for more tightly joined code.

[1] https://web.archive.org/web/20030228034147/http://www.crockf...

dragonwriter · on Oct 30, 2019

> The intended purpose is relevant because it tells us the conditions under which something is likely to work well.

Not really. Plenty of things suck at their original intended purpose and remain in use because they are very good for some other purpose. (Viagra is an example well-known to popular culture, but hardly unique.)

> XML was vastly overused for a long time. That doesn't make those usages correct, as there were alternatives even then.

XML is perhaps not abstractly ideal for many of the purpose it has been used for, but in many cases it was superior to other alternatives for practical reasons, particularly the tooling ecosystem. (JSON is the new XML, and virtually the same thing can be said for JSON in many of its current uses, though it does clean up XMLs two biggest warts, element/attribute distinction and verbosity—though even for human readable formats YAML does the latter even more than JSON while being easier to read, not to mention all the binary options when readability isn't a concern.)

wpietri · on Oct 31, 2019

Yes, really. I agree there are exceptions, which is why I said "likely". But by and large, fitness for purpose correlates with design intent (which generally involves a significant period of iterative use, further driving fitness for purpose).

democracy · on Oct 29, 2019

10 years ago the idea of having json anywhere but in the browser was crazy in the enterprise sphere

wpietri · on Oct 30, 2019

Sure, but the enterprise sphere is mostly what Moore called late majority and laggards. The biggest driver of use is not effectiveness, but perceived safety. A good proof of that is your choice of word here: crazy. It's not that JSON would have been somehow technically wrong; it was just socially wrong.

miggol · on Oct 29, 2019

I'd argue that for data storage purposes, you'd like to have a low metadata to information ratio. In the examples the author gives this seems to be the main problem, with way more characters being used for markup than for content.

Compare that to JSON or TOML, which are more human-friendly and waste fewer bytes on structure to convey the same information. When used for data storage, two XML files of the same schema describing two completely different objects are likely to share a large amount of content, which is wasteful and gets in the way.

dfox · on Oct 30, 2019

For storage of structured data (and probably even for loosely coupled RPC) you want format that is efficient and schema-oblivious. The bad choice 15 years ago was XML, bad choice today is JSON (the parsing overhead is not negligible) or ProtoBufs (not schema-oblivious). Various binary formats with JSON-like object model seems like the way to go (my choice is CBOR).

And then there is the EU-wide absurdity of WhateverAdES, which invariably leads to onion-like layers of XML in ASN.1 encoded as base64 in XML wrapped in CMS DER encoded message...

RantyDave · on Oct 30, 2019

I beg to differ. For a start, XML compresses well and besides, storage is monster cheap these days. XML is a better storage format because it documents what the data is (a title, a reference etc) as well as the data itself.

There are many better reasons to hate XML.

hnick · on Oct 30, 2019

XML does compress well as text or over the wire but the parsing trees can be quite large in memory and processing consumption. At least in Perl I've had enough scripts crash out due to this overhead when implementing the common/naive solution using off the shelf modules. You can get around this by choosing between DOM or SAX but I consider that a symptom of the problem, you choose XML to solve a problem and now you have another problem to solve.

Mikhail_Edoshin · on Oct 30, 2019

I had the same problem with npm, I think, and JSON, because npm could not simply load the huge JSON file into memory. A huge anything can crash a naively written tool used to handle smaller instances.

hnick · on Oct 30, 2019

That's true but I think XML has the edge there. It has so many features like defining new types which you wouldn't normally see in JSON. One parser we used had a ten to one ratio - 50MB of XML meant 500MB of RAM usage when using a DOM parser. And that's taking into account the textual representation of XML is already >50% bloat with the closing tags etc.

Koshkin · on Oct 29, 2019

Well, XML is a markup language (and is really good at being that) while JSON is not. Sure, XML can be used as a poor man's data storage, as a base for a DSL, etc., but almost always there are better choices.

alkonaut · on Oct 29, 2019

What are the better choices? And what were the better choices on the major platforms 10 years ago, the choice of which would have not seen every app use xml config files/dls/storage now?

I use csv when applicable. I use protobufs when applicable. But for the typical use case I choose xml for it's some config/dsl/dataset that needs to be human-editable (support comments, for example), more complex structure than csv supports, and preferably not need an external library or a custom parser. Json, Csv, Toml, S-Expressions, protobufs all fail one of more of these requirements. I'm sure there are others but none that don't have at least one drawback I don't want.

A poor man's data storage is exactly what I want!.

yellowapple · on Oct 30, 2019

> preferably not need an external library or a custom parser. Json, Csv, Toml, S-Expressions, protobufs all fail one of more of these requirements.

And XML doesn't? Quite a few (not all, but quite a few nonetheless) programming languages include zero support for reading or writing XML-formatted data without using an external library or custom parser. This includes nearly all languages that predate XML, and quite a few languages that postdate it. Even when a language does have built-in (or at least in the standard library) support for XML, it's almost always a royal pain to use, especially once namespaces and schemas are involved.

Once upon a time, though, the answer was (and in a lot of places still is) INI:

- It's human-editable and supports comments

- It supports more complex structure than CSV

- Some languages have built-in support for it, and the Windows and GLib APIs support it, too (well, something similar enough to be compatible, in the latter case)

INI falls flat when you need to express deeper levels of nesting than keys and sections, though.

There's also YAML, which meets all your criteria about as well as XML does (at least on average; your specific language/platform might favor one or the other).

alkonaut · on Oct 30, 2019

Right. Xml is to .NET what ini was for MFC. It’s what the platform “makes” you use. The same is true for json on js of course.

On a platform that has almost no support out of the box (e.g python) the choice is open. But on a platform that has a couple of formats built in, picking a format outside that platform is a pretty big step. The return needs to be substantial for a .net developer to use yaml via an external library over xml.

My reasoning in this thread has always started from the perspective that xml comes built in and almost no other format does. This is the case for e.g java and .net but not for python or C for example. But the prevalence of xml comes from java/.net so if we are to ask why, then we should consider that.

EdwardDiego · on Oct 30, 2019

> There's also YAML

Omg YAML. Here's an example for you.

      - external:
        metric:
          name: kafka_consumer_group_lag
          selector:
            matchLabels:
              topic: rtb_trx_records
              consumer_group: trx-record-validator
        target:
          type: Value
          value: 30000
      type: External

It seems like the "type: External" should line up with "metric" and "target" but no, it needs to line up with the word "external" - not the dash, but the word after a space after the dash. Using YAML frequently reminds me of the quote "Be open minded, but not so open minded that your brains fall out".

yellowapple · on Oct 31, 2019

I'm kinda surprised that's even valid YAML. I was under the impression that arrays and dicts can't mix like that.

alkonaut · on Oct 30, 2019

Things like these should just be code in my opinion.

democracy · on Oct 29, 2019

and having schemas is also great for some cases

EdwardDiego · on Oct 30, 2019

Schemas are awesome for so many cases. Even though JSON Schema is a giant evolving clusterfuck, I still use it to be able to enforce some consistency.

And, being honest, JSON schema is better than, say, GPB or Avro schema at enforcing field relationships, e.g., "if typeId is 7, then partnerId cannot be null"

jeremyjh · on Oct 29, 2019

You aren't responding to the comment here, you are just reasserting the article's position. I'd argue there is not another format that is obviously better for every data storage or exchange use case, or that surpasses all of XML's benefits while minimizing all of its downsides. I don't want to look at XML, but I do understand why it is being used.

gambler · on Oct 29, 2019

Abuse of XML killed it a format. JSON is absolutely shit for semantic markup, and yet developers today routinely use it for documents because "XML is bad". They contrive ridiculous schemes for adding metadata and type information. They use it to generate HTML even when HTML takes less space. Finally, we regressed from XTML to HTML5. Buy-buy namespaces and parsing consistency.

redwall_hp · on Oct 30, 2019

> They use it to generate HTML even when HTML takes less space.

The fact that MobileDoc exists makes me physically ill. Something that can be expressed with one line containing a paragraph element and an italic tag is over a dozen lines of JSON spam.

sedachv · on Oct 30, 2019

SGML being replaced by XML being replaced by JSON is great proof that the idea of progress in tech is at best a myth and at worst a lie.

barberousse · on Oct 31, 2019

I mean, the idea of progress in general is barely agreeable outside of recent strides in the physical sciences

myrryr · on Oct 30, 2019

Right, but, if given a choice of what to use, between XML and JSON, I'll pick JSON every time.

XML is a complete mess. Have you SEEN it's spec?

You can put JSONs spec in a single page. XMLs spec, not so much. Hell, most of the XML parsers don't support the spec, and the ones which do, historically have been riddled with security holes.

JSON over XML was simplicity over a crazy spec built by a bunch of companies all wanting to shove their own crazyness into it.

sedachv · on Oct 30, 2019

http://seriot.ch/parsing_json.php

https://en.wikipedia.org/wiki/JSON#Data_portability_issues

XML has XSD and RELAX NG for more than 15 years now. https://json-schema.org/ is still a draft.

Mikhail_Edoshin · on Oct 30, 2019

XML spec is longer than one page, granted, but it's about three times shorter than YAML spec. And XML spec describe not only the XML syntax, but also a basic form of validation (DTD), which include references. Basic XML has only five special symbols (<, >, ', ", and &) and can be parsed in linear time. (Namespaces complicate things somewhat.)

jackcviers3 · on Oct 30, 2019

That's because json's spec isn't complete. It is predicated on the language interpreting it to be able to just eval the structure and work with the data [1]

1. http://seriot.ch/parsing_json.php

alkonaut · on Oct 30, 2019

For a non human-edited data storage or exchange that’s fine. Json is worse for human editable data though. Xml might not be the best alternative there but it beats json for things like small configs.

It’s not as simple as saying “everywhere xml is used, json would be a better choice”.

jackcviers3 · on Oct 30, 2019

Your data storage format shouldn't be human readable. The data transferred over the wire shouldn't be human readable. It should be a binary serialized data format, probably encrypted, definitely compressed.

Yes, storage is cheap. Bandwidth is not. Also, you really don't want a human that intercepts your data to be able to read it. Additionally, your data structue only makes sense in the context of your domain, which usually has been modeled in your program(s) that work in that domain, and thus it will be better if you deserialize it within tools that understand that domain.

If you feel the need for a general purpose deserialization protocol, there are several available - Avro/Protobuf, etc.

Binary encoded data can often be decoded without consuming the entire document. Sax-paparser-like reading requires at least reading an open and a close before the data is useful.

String serialization is a wasteful endeavor. It makes life easy for devs because it takes one less step to read the data in a text editor or log message, but quite often requires hacks to model things like recursive or self-referential data structures, and wastes space by repeating property names constantly for every item within the serialized structure. It's predicated on four of the fallacies of distributed computing, namely - bandwidth is infinite, the network is secure, transport cost is zero, and latency is zero. It is a solution looking for a problem, and because we are lazy, we don't build tools that would make binary serialized formats just as easy to use as json/yaml/xml.

Mikhail_Edoshin · on Oct 30, 2019

Markup is a mix of scalar and structured data (it's a structure discovered or associated with a scalar) and thus it contains everything it needs to express structured data alone: just remove the scalar. E.g.

  <invoice id="123" customer-id="456" date="20199-10-30">
    <item no="1" product-id="789" qty="42" />
  </invoice>

Is this really a poor man's choice? And compared to JSON?! I can see at least the following advantages here:

1. Each element has explicit type name (invoice, item). JSON is "typeless", which simply means the type information travels out of band. And with XML namespaces these type names can be made globally unique, but still stay human readable.

2. Each element is self-contained, the code that produces the <item> doesn't need to know if there was an item before or after it so that it should add a separator. (The dangling comma problem in JSON.)

3. The attribute names are not just arbitrary strings as in JSON, there are strict rules of what can be in the name. They're much more suited for structured data than JSON, where you can name an attribute "foo.bar" and some JSON readers that accept a JSON "path" won't be able to find it.

4. It has less visual noise than JSON because the attributes don't have quotes around them and you don't need to separate elements with a special symbol. Despite the common belief well-written XML is more readable than JSON.

And we haven't event touched things like validation + extended types, references, and transformation of data.

arkitaip · on Oct 29, 2019

Yet every time I've stumbled upon XML in the past decade or so It's been used as a data format because it's easy to manage and supported by every platform/tool out there. But sure, let's switch over to JSON or use a SQL database because we can't deal with the fact that XML might be better suited for something that it wasn't originally designed for.

ljm · on Oct 30, 2019

It doesn’t answer the question, but I do wonder if XML would be an improvement in devops, compared to the current obsession with YAML. For everything except the part where you write it.

Make an xml stylesheet and your kubernetes cluster is instantly documented.

axilmar · on Oct 30, 2019

I recently entered DevOps (not my choice), and I would like to take your request further: replace everything with a regular language.

Having to use a gazillion declarative languages to achieve what a regular programming language does is simply crazy.

alkonaut · on Oct 30, 2019

Here the next thing I hope will be code in the format you use for the apps themselves.

It’s testable, discoverable etc.

https://www.pulumi.com/

(Sorry about shameless plug, I’m not affiliated)

RantyDave · on Oct 30, 2019

XML is pretty horrible to edit by hand.

sedachv · on Oct 30, 2019

Maybe if you are using Notepad. Any decent text editor will provide things like auto indentation, completion, auto end-tags, structured editing, and schema validation. For example, Emacs comes with nXML mode:

https://www.gnu.org/software/emacs/manual/html_mono/nxml-mod...

theamk · on Oct 30, 2019

That assumes that you edit XML all day long. This is not always the case.

I am writing non-XML code most of the day, and I do not have structured editing / auto braces enabled. So when I need to edit that one XML config, I'll open it in my regular editor, which will provide at most syntax highlight, and edit it as needed with a bit of swearing. And next time, I would promise myself I'd choose a different config format which does not need special editors.

sedachv · on Oct 30, 2019

> I'll open it in my regular editor, which will provide at most syntax highlight, and edit it as needed with a bit of swearing.

That sounds like a very passive-aggressive way to deal with a problem. Do you do the same thing when writing programs?

theamk · on Oct 30, 2019

Which "same thing"? Not setting up editor and complex environment for the things I am only going to edit once or twice? Yes.

In general, when you see something inefficient, you can either fix it to make it better, or ignore and come up with random workarounds.

In my opinion, a config file which cannot be edited by hand, and which needs a special editor with non-trivial learning curve, is a inefficiency. I can either ignore it, and set up the specialized tools; or I can fix it, by ripping out XML and replacing it with something more human-editable, like TOML or YAML. In large teams, it is almost always better to fix it -- sure, I will spend a few hours getting rid of XML, but this will pay itself off in the long term, as no one else will have to bother with special setup anymore.

(This obvious only applies to the systems where XML is a minor part, like a single configuration file. If your system has huge amount of XML, you better learn the right tools)

sedachv · on Oct 31, 2019

I don't understand, what is there to set up? With the Emacs mode I gave as an example, you just open an XML file and everything is there. Any decent text editor will have XML support.

Mikhail_Edoshin · on Oct 30, 2019

xmllint --noout <file> will check the file and report any issues with XML syntax in a very detailed way with line numbers for you to see.

I myself don't even use syntax highlighting and normally work in vim and although I do make errors in XML sometimes, I find that I make at least as many syntactic errors in Python or C code that I have to weed out before I can proceed. But I never heard anyone complaining about Python or C being too strict :)

alkonaut · on Oct 30, 2019

They all are, but xml isn’t the worst. (Json and S-expressions are worse, for example).

Not even formats designed for human consumption such as yaml are very good. The good ones for editing (toml, csv, ini) fall short when it comes to complex structure instead. There is no silver bullet.

rauhl · on Oct 30, 2019

> The author also doesn’t suggest what should be used instead to encode structured data, or perhaps more importantly what should have been used to encode graph like things such as map/lists/objects in the 2000’s.

That’s easy: S-expressions. Steve Yegge wrote about this in 2005: https://sites.google.com/site/steveyegge2/the-emacs-problem

Would you rather have:

    <?xml version="1.0" encoding="utf-8" standalone="no"?>
    <!DOCTYPE log SYSTEM "logger.dtd">
    <log>
    <record>
      <date>2005-02-21T18:57:39</date>
      <millis>1109041059800</millis>
      <sequence>1</sequence>
      <logger></logger>
      <level>SEVERE</level>
      <class>java.util.logging.LogManager$RootLogger</class>
      <method>log</method>
      <thread>10</thread>
      <message>A very very bad thing has happened!</message>
      <exception>
        <message>java.lang.Exception</message>
        <frame>
          <class>logtest</class>
          <method>main</method>
          <line>30</line>
        </frame>
      </exception>
    </record>
    </log>

or:

    (log
     '(record
       (date "2005-02-21T18:57:39")
       (millis 1109041059800)
       (sequence 1)
       (logger nil)
       (level 'SEVERE)
       (class "java.util.logging.LogManager$RootLogger")
       (method 'log)
       (thread 10)
       (message "A very very bad thing has happened!")
       (exception
        (message "java.lang.Exception")
        (frame
         (class "logtest")
         (method 'main)
         (line 30)))))

EdwardDiego · on Oct 30, 2019

I have no strong preference for either, so long as both of them convey the structured information I need.

Although tbh I'm preferring the XML in your example due to the lack of random quotes - why does (record need a quote, but (log doesn't?

The XML seems a lot more consistent.

juki · on Oct 30, 2019

The example seems pretty weird. If it's just data there shouldn't be any quotes at all. The only purpose of quoting is to prevent evaluation, so I guess the idea must be that `log` is a function call to be evaluated, but then the example isn't even the same thing as the XML version (and there are still nested quotes inside the already quoted record).

Mikhail_Edoshin · on Oct 30, 2019

You'll want XML as soon as it exceeds one screenful :) And this XML is, well, "musused"; it should be that:

    <log>
      <record date="2005-02-21T18:57:39" millis="1109041059800" sequence="1" level="SEVERE" class="java.util.logging.LogManager$RootLogger" method="log" thread="10" message="A very very bad thing has happened!">
        <exception id="123" message="java.lang.Exception" class="logtest" method="main" line="30" />
      </record>
    </log>

I also don't object squeezing the exception attributes into record with some prefix that would make the names unique, like that:

    <record date="2005-02-21T18:57:39" millis="1109041059800" sequence="1" level="SEVERE" class="java.util.logging.LogManager$RootLogger" method="log" thread="10" message="A very very bad thing has happened!" exc-message="java.lang.Exception" exc-class="logtest" exc-method="main" exc-line="30" />

Or, if it's possible to normalize these data, factor out the exception and other attributes and build an index of them so that they can be referenced by an ID:

    <exception id="123" message="java.lang.Exception" class="logtest" method="main" line="30" />
    
    <record ... exception-id="123" />

rauhl · on Oct 30, 2019

The S-expressions are misused too — Steve Yegge’s example, not mine. My version would have no use of QUOTE at all.

dependenttypes · on Oct 30, 2019

The XML version has the dtd which can be used to validate it though. I am not aware of something similar for s-expressions.

lstamour · on Oct 30, 2019

Frankly, I’d rather some form of compressed binary logs, but sure, either works if it includes a stack trace and I need it at the time! Text logs compress well anyway, they just grow larger if/when you process them later... I’m just happy to see objects as log entries instead of plaintext strings. ;)

alkonaut · on Oct 30, 2019

The xml, definitely. Without lisp-specific tooling that colours/balances etc, writing 5 consecutive closing parens (and not 4 or 6) is a headache. Also I probably couldn’t choose it on its own merits - I choose what the non programmers that edit the file can use (and they sure don’t have any other editor than notepad).

Then again: I’d usually only ever choose between formats with support already on the platform/standard library, if it was just some config or small data file. If the data is core to the product then of course it might be reasonable to include a new library or even write a parser. I’m talking java and .net now mainly.

taberiand · on Oct 29, 2019

> what should be used instead to encode structured data, or perhaps more importantly what should have been used to encode graph like things such as map/lists/objects in the 2000’s

Lisp?

soapdog · on Oct 29, 2019

I think you mean s-expressions instead of lisp and if you do, I'm with you. They are really neat and underused. Now, if you want to carry a whole language runtime for the structured data format you'll probably end up with the "which lisp" kind of problem in which each app end up using something different and then the structured data files are no longer exactly portable (until you devise a portable standard yadayadayada)

AnimalMuppet · on Oct 30, 2019

I have to admit, once you've seen S-expressions, XML looks insanely bloated and verbose.

The advantage XML would have in that situation is that, because it's so verbose, if you get a malformed XML, you can eyeball-parse it and often figure out how to hand-edit it to make it valid. If it is valid, you can also see exactly how the schema you were sent differs from the schema you expected. An S-expression, having less redundancy, also is potentially more brittle.

wruza · on Oct 30, 2019

If malforming is regular (like someone printf’d or typed in bad xmls), then the same is true for any human-readable format, since you know data and what it should look like. If not, (like in randomly broken packet), then you solve the problem at the wrong level. By-hand error-correcting verbosity is hardly a selling point of the application level protocol.

AnimalMuppet · on Oct 30, 2019

> By-hand error-correcting verbosity is hardly a selling point of the application level protocol.

It's a selling point while you're trying to figure out how to get it working. Once you have it working, it's not - but by then, you have it working, so why change it?

taberiand · on Oct 29, 2019

Yes! I meant s-expressions, the name escaped me - thanks.

carapace · on Oct 30, 2019

> Abstract Syntax Notation One (ASN.1) is a standard interface description language for defining data structures that can be serialized and deserialized in a cross-platform way. It is broadly used in telecommunications and computer networking, and especially in cryptography.

> ASN.1 is a joint standard of the International Telecommunication Union Telecommunication Standardization Sector (ITU-T) and ISO/IEC, originally defined in 1984...

https://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One

RantyDave · on Oct 30, 2019

I think ASN.1 was designed for a world with (a) waterfall development and (b) static deployments - ie no over the air updates. Under the circumstances, messing up is simply not an option - hence defining the standard so clearly and for so many use cases.

Today, of course, we treat the entirety of our deployed infrastructure as 'merely' a platform to write code. And not only are experimentation and failure OK, they're positively encouraged. Velocity became important.

dfox · on Oct 30, 2019

You can meaningfully decode any DER encoded ASN.1 structure and serialize it back without any knowledge of the schema. Somewhat surprisingly you cannot do that with all instances of XML documents.

AnimalMuppet · on Oct 30, 2019

The first thing wrong is that you have to serialize and deserialize it. Operationally, that's inconvenient, and it shows that they're optimizing for network bandwidth. But these days, squeezing the most out of each bit is, in most cases, not a defensible design decision.

Then, once you deserialize it, it's still a printable version of ASN.1. Sure, it's unambiguous, rigidly defined, and standardized. It's still gouge-your-eyeballs-out horrible to try to do anything with.

Say you get an XML message over the wire with a bit flipped. If you look at it, you have a good chance to be able to figure out what went wrong, edit one character, and you can now process it. If you get an ASN.1 message in the same condition, it's pretty much game over (though there may be special tools that could save you).

Say you get an XML, and you don't know the schema. You look at it, and you can see what's going on. You get an ASN.1 where you don't know the schema, and you can be totally sunk. (If I recall correctly, in ASN.1, you can have schemas that are private, that is, not specified in the standard.)

thom · on Oct 29, 2019

XML (and JSON) have the advantage of names, which makes it slightly easier when it comes to querying (and indeed building indexes) over lots of data. I'd be amazed if there wasn't tons of work on this for S-expressions, but I can imagine it's slightly clunky.

rauhl · on Oct 30, 2019

What do you mean by names here? S-expressions have symbols, which serve the exact same purpose as what I think you might mean: an interned string value which can be cheaply used more than once.

thom · on Oct 30, 2019

Positions within lists aren't named. Sure you can do:

((foo . (bar . baz)) (foo . (bar . baz)) (foo . (bar . baz)) (foo . (bar . baz)))

And say "index all the foos", but you're mixing the structure and the content, in a way that JSON and XML explicitly separate.

thom · on Oct 30, 2019

Not saying you can't specify some schema here, but there's nothing native to S-expressions that makes it quite as transparent and simple to specify a path into a data structure.

alkonaut · on Oct 29, 2019

All you need on .NET/Java in 2007 then is a lisp parser/interpreter. I know they are easy to write, but they aren't as easy to write as something you don't have to write at all.

Let's not forget, many one of these configuration files and data formats are one-off hacks that were meant to be replaced by a real format, a real parser, a real DSL etc. The reason the xml config/dsl/format stuck is because it worked. And it was cheap and easy.

sedachv · on Oct 29, 2019

S-expressions have been around since the 1960s. McCarthy was already proposing S-expressions for what XML would become, in 1975. Stop whining and finding excuses, and start using s-expressions.

http://www-formal.stanford.edu/jmc/cbcl2/cbcl2.html

alkonaut · on Oct 29, 2019

I'm not trying to pick the best format for the job, I'm usually trying to pick the least bad one that's available IN the platform/standard library I'm using.

josteink · on Oct 29, 2019

That this answer is voted down, on HN of all places, is kinda sad.

It may not be the popular answer, but it’s definitely a valid answer.

taberiand · on Oct 29, 2019

It was because I said Lisp instead of s-expressions - I find imprecise language is always a good way to garner down-votes on HN

kbenson · on Oct 29, 2019

So, what is XML good for? If it's not good for data as everyone says (and I'm not inclined to argue), but it is good for documents, what kind of documents are we referring to? A defined metadata on a text document? A template used with data to generate something else? Is a configuration file a document or data? Where would I want to use XML that something like JSON, a text document, or some combination thereof wouldn't be better?

I'm not being facetious, this is an honest question. Where are the "right" places to use XML?

throw0101a · on Oct 29, 2019

> Where are the "right" places to use XML?

Tim Bray, co-editor of the XML spec, writing in 2006 on the topic:

> Use JSON: Seems easy to me; if you want to serialize a data structure that’s not too text-heavy and all you want is for the receiver to get the same data structure with minimal effort, and you trust the other end to get the i18n right, JSON is hunky-dory.

> Use XML: If you want to provide general-purpose data that the receiver might want to do unforeseen weird and crazy things with, or if you want to be really paranoid and picky about i18n, or if what you’re sending is more like a document than a struct, or if the order of the data matters, or if the data is potentially long-lived (as in, more than seconds) XML is the way to go.

* https://www.tbray.org/ongoing/When/200x/2006/12/21/JSON

He was also editor of the JSON RFCs:

* https://tools.ietf.org/html/rfc7159

* https://tools.ietf.org/html/rfc8259

jerf · on Oct 29, 2019

I wrote this four years ago: https://news.ycombinator.com/item?id=11446984

There are points of disagreement between me and the author, although I wouldn't get too passionate about them.

Super-short version, reading over it again, is that XML is very good at what it does, but it really ought to be seen as a relatively specialized data format. It's really good at certain tasks, best-of-breed for a couple of them, and degrades rapidly as you get away from that. JSON is a fairly cheap & fast general-purpose format that's OK at a lot of things, isn't necessarily great at much, but as you get into more specialized use cases, also tends to degrade. Being a general-purpose format, perhaps arguably it degrades more "slowly", but it does degrade.

Properly understood, IMHO, their use cases don't overlap much if at all, and the combination of them may cover a lot of space, but are still far, far from the only serialization formats you'll ever need.

Finnucane · on Oct 29, 2019

Just the sort of thing you would think of as 'documents'--the texts of books, manuscripts, and the like, where structure may be somewhat arbitrary. For instance, I work with a few different text corpuses--one of which is an actual dictionary, with entries, definitions, usage examples, etymological information, and bibliographic references. Another is a collection of poetry manuscripts, with annotations for line breaks and editorial emendations, both from the author and other editors (i.e, places in the manuscript with crossouts, interlineal notes, marginal notes, etc).

I mean, in theory, you could do this in JSON or some other data structure. But you would go insane and be shooting yourself in the head before long.

jolmg · on Oct 29, 2019

> you could do this in JSON or some other data structure

I'm not sure you could. For example, in another comment, I mentioned DocBook[1]. How would you do the following sample document in JSON?

  <?xml version="1.0" encoding="UTF-8"?>
  <book xml:id="simple_book" xmlns="http://docbook.org/ns/docbook" version="5.0">
    <title>Very simple book</title>
    <chapter xml:id="chapter_1">
      <title>Chapter 1</title>
      <para>Hello world!</para>
      <img src="hello.jpg"/>
      <para>I hope that your day is proceeding <emphasis>splendidly</emphasis>!</para>
    </chapter>
    <chapter xml:id="chapter_2">
      <title>Chapter 2</title>
      <para>Hello again, world!</para>
    </chapter>
  </book>

Would you make each <chapter> into an object? But you have 2 <para> children in there with an <img> in between. And one <para> has an additional <emphasis> in the content. I can't think of a good JSON schema equivalent to this.

[1] https://en.wikipedia.org/wiki/DocBook#Sample_document

uryga · on Oct 29, 2019

you could always do an S-expression-esque DSL in JSON ;)

  ['book', {'id': '...'},
    ['title', {}, ...],
    ['chapter', {'id': 0},
      ['title', {}, 'Chapter 1'],
      ...
    ],
    ['chapter', {'id': 1},
      ...
    ],
  ]

more realistically, you could just represent it with the AST of that XML, i.e

  {
    'type': 'book',
    'attrs': {'id': ...},
    'children': [
      {
        'type': 'title',
        'children': ['Simple book']
      },
      {
        'type': 'chapter',
        ...
      },
      {
        'type': 'chapter',
        ...
      },
      ...
    ]
  }

so you could do that emphasis bit as

  [
    'this text nees more', 
    {'type': 'emphasis', 
     'children': ['emotion']},
    '!'
  ]

hellish to write by hand but probably okay for a program to consume (modulo all the XML libs/tooling you can't use). and you could probably even write some kind of schema for it.

if i actually had to represent that data, i'd also move some child nodes into attributes, e.g. make all nodes with 'type': 'book' also have a 'title' attribute, like you would if you had an AST datatype

unilynx · on Oct 29, 2019

See https://developers.google.com/docs/api/samples/output-json for what Google Docs does - basically separating markup from the text by using indices.

which is probably the only way to properly deal with markup and especially commented sections that can span over paragraph start/ends - neither JSON or XML seems to have a proper answer for such annotations and I wonder if there's any standard format that can that, especially if humans still want to reasonable be able to view or edit iit...

(OOXML and its binary equivalents more or less solve this by completely separating paragraph and character formatting, both separately indexing the spans of text they annotate)

dfox · on Oct 30, 2019

That is what essentially every WYSIWYG text processor does. And also the reason why getting sane HTML out of text processor is somewhat non-trivial, as the separately indexed spans can very well overlap, contradict each other or contain completely unnecessary formatting information.

bradstewart · on Oct 29, 2019

Potential option:

  {
    "id": "simple_book",
    "title": "Very simple book",
    "chapters": [
      {
        "id": "chapter_1",
        "content": [
          { "type": "title", "value": "Chapter 1" },
          {
            "type": "para",
            "content": [
              { "type": "text", "value": "Hello World!" }
            ]
          },
          { "type": "img", "src": "hello.jpg" },
          {
            "type": "para",
            "content": [
              { "type": "text", "value": "I hope that your day is proceeding " },
              { "type": "emphasis", "value": "splendidly" },
              { "type": "text", "value": "!" }
            ]
          }
        ]
      }
    ]
  }

Finnucane · on Oct 29, 2019

But as pointed out in the article, JSON isn't necessarily going to guarantee the correct order of your nested bits. Your code is going to have to worry about that. And it will quickly become unmanageably complex. When you are for instance creating a marked up transcript of some archival material, there's a lot of human editing involved. Have a look at the TEI documentation to see how messy it can get.

bradstewart · on Oct 29, 2019

Certainly. I wasn't suggesting that JSON representation I put up there was actually a good idea, just that it's theoretically possible to represent that document as JSON.

dwaite · on Oct 30, 2019

I absolutely agree. Where XML shines is when you could take just the text content - strip out the all the markup elements and attributes, document, comments, etc - and have a text document that still makes some sort of sense.

This AFAICT was actually why SVG has a few bizarre choices - such as putting all the drawing commands into attributes. A browser that doesn't understand an embedded SVG document in its HTML would be left with just the text contents.

g_b · on Oct 29, 2019

A right place to use XML is when you want to write structured & reusable content, for example when writing documentation. See DITA [1].

[1] https://en.wikipedia.org/wiki/Darwin_Information_Typing_Arch...

tootie · on Oct 29, 2019

I've been getting along fine using JSON for pretty much everything. That being said XML has some very sophisticated features like rigorous schema definition, a query language, a formal include syntax, comments (that's a big one), it's a lot easier to do multi-line content and in fact you can mix normal text and structured data.

The include syntax doesn't get enough love. It's crazy that JSON doesn't support it.

dfox · on Oct 30, 2019

The issue with all the XML sophistication is that essentially only environment where all that really works is when you use XML as a markup language for technical publishing (ie. DocBook, DITA, ...) and in fact as a less convenient but more modern and cool SGML replacement.

Random applications that read XML just aren’t going to implement fully validating parsers, because it is lot of completely unnecessary work. Also in the article mentioned pattern of storing everything in attributes mostly comes from the fact that working with CDATA nodes in XML is major PITA wrt. whitespace handling and coalescing adjacent nodes.

TallGuyShort · on Oct 29, 2019

Think of semi-structured documents, where you have a list of pre-defined sections. I've seen it in use by insurance companies for case reports, in real estate for appraisals. And of course we've all seen it work well in the form of HTML. There's some structure to all of these examples, but mostly for annotating text sections. You need some flexibility built into the schema to add fields as needed, but you're not dealing with various map / list / primitive data types as a matter of routine. Just making this one up, but if LaTeX wasn't already the standard, I'd also use it if I was digitizing the content academic papers, for instance. You have an a header with metadata, abstract, the body, citations. There's some structure, a need to add some metadata, perhaps flexibly over time, but mostly it's just a document.

jolmg · on Oct 29, 2019

DocBook[1] is probably another good use of XML.

[1] https://en.wikipedia.org/wiki/DocBook#Sample_document

devnulloverflow · on Oct 29, 2019

What is the exact relationship? The history as I remember it is:

1. SGML (Simple Generalised Markup Language) came first. 2. HTML was a specialisation of SGML, it took off because of the web, and is probably the only reason for SGML to become famous. 3. XML was then invented as a generalisation of HTML, perhaps by people who had never heard of SGML.

And I seem to remember DocBook is an SGML thing, it was invented between steps 2 and 3.

tannhaeuser · on Oct 29, 2019

That's completely wrong. XML is specified as a subset of SGML (it says so in the preamble even) by folks who where also involved in specifying SGML. Moreover, these same folks (the "Extended Review Board" at W3C) also amended SGML to align with the XML profile of SGML in ISO 8879 Annex K aka the "WebSGML adaptions".

Also: SGML = standard generalized markup language

jolmg · on Oct 29, 2019

The sample on the Wiki page is an XML, though. If you look at the DocBook spec's intro[1], it says:

> DocBook is general purpose [XML] schema

[1] http://docs.oasis-open.org/docbook/docbook/v5.1/os/docbook-v...

dfox · on Oct 30, 2019

DocBook is originally an SGML DocType and most of the DocBook formatters are written in DSSSL. Large amounts of documentation for open source software (and large amounts of O’Reilly books) is still SGML DocBook.

k__ · on Oct 29, 2019

I worked on a project that used XML as document format for config files.

scarygliders · on Oct 29, 2019

XML for config files is utter madness.

Why use XML when INI files are way easier to read, and especially edit if needs be?

Why add a ton of additional sugar on top of something as simple as "a=b" for config files?

Const-me · on Oct 29, 2019

When you have much more than one line of these "a=b", that sugar helps. XML has hierarchy, you can group related values into elements. XML has comments. XML can be typed, the standard even defines formats for numbers, date and time. There're good libraries to serialize/deserialize objects to XML, from pretty much all OO languages out there. I use them a lot, and I rarely expect users to edit XML, I give them GUI to change the settings they need, updating the config.

commandlinefan · on Oct 29, 2019

Well, the problem with INI files for configuration is that config files (legitimately!) need to be able to represent repetition, nesting, schemas and comments, which there was never any standardization for. While XML seems like overkill for something as mundane as a config file, the standard does at least cover all of the cases you need.

Koshkin · on Oct 29, 2019

If your config is that complex, you might be better served by JSON - or the full JS or, say, Lua, for that matter. Because what you are talking about looks more like (interpreted) code than a config.

hnick · on Oct 30, 2019

It's under- and mis-used, but XSD helps to validate config files and provide some guidance on the structure. I know JSON has a schema in draft, is it used much?

An example of XML config we used at my workplace was a processing pipeline with various modules and options/parameters encoded for each phase (some optional) of the pipeline. So in a sense it was configuration that resulted in executing code modules, not so much your standard options.

k__ · on Oct 29, 2019

It was for rather complicated nested data- and view-definitions

arkitaip · on Oct 29, 2019

Sounds like a - gasp! - data format.

Mikhail_Edoshin · on Oct 30, 2019

XML is absolutely excellent for markup. There's no competitors here.

Markup consists of two things: a scalar (a string usually, but can be a binary sequence) and associated structured data: smaller scalars, records with fields, and lists (there's no "etc." here, that's all).

The structured data are either discovered in the scalar by parsing or added to it by marking it up. Parsing applies to binary data and artificial languages (although there are parsers for natural languages as well), marking up to structures that cannot be parsed out, but can be added manually, usually during authoring, but also during after-the-fact indexing.

XML stores both the original scalar and the structure together in a single piece. There's extensive tooling for processing the result.

Practical examples:

1. Parse a C file and do something else with it than compiling. E.g. you want to publish it, index with cross-references, transform maybe: XML shines here (you'll normally want to add XSLT to it).

2. Author text and do something with it. If it's Markdown, apply a minimal parsing and save the resulting AST in XML. Same for reST and any other format out there: just get it into XML as soon as you can and process the XML from that point. Whatever you want to produce (XML, man pages, PDFs), XML toolchain will help you to get there.

3. Mark up existing text. E.g. you have a collection of letters and want to index all references to people. XML would be a very good choice here too. (I'd say that marking up and indexing all existing texts of the humanity would be a very important project. There's already a lot of effort to publish them, and marking up and indexing is what naturally comes next.)

4. I'd venture to say that even binary formats would benefit from conversion to XML and back because of what's possible with XML toolchain (I'm thinking mostly about transformation, but indexing would also be good.) E.g. read a collection of MP3 files, parse them out into what they have (ID3 tags of different versions at the beginning or end, APE tags, other such tags, and MPEG frames), and then do what you want: index by anything, clean up, add extra information that cannot be expressed in tags (classification for classical music or argentine tango, for example) and so on.

PS: Since XML can store structures alongside a scalar, it can also store structures alone: just drop the scalar. It's a very good format for structured data, absolutely not as bad as it's usually painted. Much better than JSON, actually. But you have to prepare it well.

PPS: Scalars and structured data are, of course, the natural parlance of all other programming languages out there, so everything XML does you can do without XML. But it also means that XML is not as foreign as it appears. There is some friction between getting data out of XML and putting it back, but it's about same as with SQL.

mojuba · on Oct 29, 2019

That's a very useful and sobering view actually. Unfortunately for the XML format it wasn't designed to prevent its own abuse. But anyway, XML is for documents sounds like a good and acceptable paradigm.

That being said, one subtle and important (and often overlooked) difference between XML and, say JSON is that you can stream XML while parsing it on the application level, whereas JSON can not be parsed by the application due to arbitrary ordering of keys. (Of course lower level parsers use streaming anyway, but that's not the point)

In fact you not only can but you should parse XML while streaming it. This is another common abuse: wherever you look you see some high level function that loads an entire parsed XML structure into memory at once. But once you start asking yourself where the file may be coming from you realize that your system may be open to denial of service attacks. E.g. is your system ready to receive a 16GB XML file?

fanf2 · on Oct 29, 2019

There is no reason you can’t do a SAX-style parser for JSON. A quick bit of searchengineering will find dozens, eg. https://github.com/dgraham/json-stream

mojuba · on Oct 30, 2019

You can in principle, but because the order is not guaranteed, you may find yourself accumulating things in memory. It depends on the task of course, but neither standard parsers nor generators are required to send adjacent JSON entities in a specific order.

I remember someone describing a trick where instead of sending a JSON array they'd just send a stream of JOSN objects, one per line, so that the receiving end could parse the data in a streaming fashion. But that's not JSON anymore.

fanf2 · on Nov 2, 2019

JSONlines is a relatively common technique http://jsonlines.org/

hnick · on Oct 30, 2019

Almost sounds like JSON is the new "we use CSV, but..."

Bouncingsoul1 · on Oct 29, 2019

Exactly this, last time I checked the dump of german wikipedia was around 16GB so you'll find these files in the wild

unilynx · on Oct 29, 2019

XML is a actually wonderful format for data and especially extensible configuration if you combine it with XML Schema and CSS selectors...

- XML schemas give you a ready-to-use format to describe, restrict and document available configuration settings. The unique keys help and a libxml2 gives ready to use validation, even if you may need to 'translate' its error messages before showing them to end users

- XML schemas also support other annotations so you can further generalise your configuration readers by recording the necessary bindings in the XML schema itself, allowing to use it eg to define application user interfaces.

- Almost any text editor can do basic syntax validation preventing most typographic errors, and even better if they can read the schema

- XML schemas are extensible using <import>s, but namespaces still enforce some separation. You can define explicit points where plugins extend your configuration format using <any>

- Human editable - closing tags are noisy but more readable than }],{}] when non-programmers may have to edit these files just to add a few extra textfields to an UI.

- Better datatype support, eg datetimes, by using XML schemas. JSON's type support is too limited

- Support for comments!

- And once you've verified the schema... CSS selectors and DOM APIs to actually process the XML documents.

YAML fixed quite a few things, but still no date times or as far as I know standardised approaches to defining schemas. And I've lost count at how any attempts exist to add schema information or namespacing to JSON...

But for markup... we may be better off to just use markdown inside CDATA blocks

sedachv · on Oct 29, 2019

> XML schemas give you a ready-to-use format to describe, restrict and document available configuration settings.

As someone that likes and uses s-expressions, I never thought I would find myself defending XML, but here we are in 2019, no one understands basic parsing theory anymore, and file formats have "evolved" to hot garbage like YAML and TOML.

XML has some great tools in comparison:

http://xmlsoft.org/xmllint.html https://relaxng.org/

> But for markup... we may be better off to just use markdown inside CDATA blocks

SGML can still make a comeback: https://leancrew.com/all-this/2014/09/sgml-nostalgia/

novok · on Oct 29, 2019

YAML and TOML were optimized for human reading and writing, like markdown. XML & even JSON are not as human optimized, but still human usable.

sedachv · on Oct 29, 2019

> YAML and TOML were optimized for human reading and writing, like markdown.

That was the rationalization, in reality I don't know what actual writing use case they were optimized for (Notepad?). XML in a syntax-aware editor with the help of automatic schema validation makes it far easier to write than YAML or TOML.

novok · on Oct 29, 2019

I don't need a special editor to write out YAML (if it didn't have tab dependence) or TOML quickly. It's also a lot faster for a human to read vs XML. JSON is fairly readable, but a bit of a pain to write out compared to ini files, since you have to quote all strings.

If your going into complicated things like schemas or other complicated structures, then you probably shouldn't be using YAML or TOML. I would mostly use it for config or other simple things.

At this point if you are not interacting with 20 year old java software, that was created when JSON didn't exist and XML was king, you should be using TOML for simple config, JSON for most things and heavyweight XML, protobuf or csv for the specialized cases. And while we are at it, markdown for simple documentation.

Even gradle decided to use groovy scripts as their config language because it's far more human readable and usable.

sedachv · on Oct 30, 2019

> At this point if you are not interacting with 20 year old java software, that was created when JSON didn't exist and XML was king, you should be using TOML for simple config, JSON for most things and heavyweight XML, protobuf or csv for the specialized cases.

You should be using S-expressions.

gambler · on Oct 29, 2019

Humans should not interact with computers by manually editing data structures serialized to ASCII.

vinceguidry · on Oct 29, 2019

The alternatives are usually worse.

unilynx · on Oct 29, 2019

But we can't always budget a nice configuration application.

And once we've just put simple XML file there for configuration and we're past the prototyping phase... "well this actually works good enough, let it be".

Const-me · on Oct 29, 2019

You likely need data model classes for the config anyway, along with support of serialization and deserialization (XML or not is not important). Use a stock property grid control, pass the root object of the config, and you’ll get a GUI that does the job much better than ASCII files.

tannhaeuser · on Oct 29, 2019

> SGML can still make a comeback

Hey, you should checkout http://sgmljs.net (my project).

sedachv · on Oct 30, 2019

Very cool!

scott_s · on Oct 29, 2019

I agree. The author and other commenters here have not proposed an alternative which provides two features I consider critical to a generic data format: schema specification and validation; and easy parsing support in many ecosystems. JSON and Yaml don't seem to have mature, widely adopted schema specification and validation. And I'm not aware of anything outside of XML, JSON and Yaml which have such wide parsing support in so many different ecosystems.

theamk · on Oct 30, 2019

JsonSchema is pretty good -- it is not as widely used as XSD, but it has libraries for most languages.

And of course JSON is the easiest thing to parse in scripting languages, like Python and Ruby -- the entire API is one line, and then you have a native structure you can work with.

spiralx · on Nov 1, 2019

> JsonSchema is pretty good -- it is not as widely used as XSD, but it has libraries for most languages.

It's okay now that the newest revisions support more realistic use cases, but ironically I find it impossible to write as JSON... I write them as YAML which my validator supports natively :)

It's still not as nice as RELAX-NG's compact schema format though IMO :)

https://relaxng.org/compact-tutorial-20030326.html

And the LXML library for Python was a pleasure to use compared to any other similar library I've used.

tannhaeuser · on Oct 29, 2019

> But for markup... we may be better off to just use markdown inside CDATA blocks

No we arent't. markdown itself is literally specified as a shortform of HTML [1], and can be translated into canonical angle-bracket syntax using SGML short references (though not completely eg. markdown reference links require unlimited forward lookup). This gives a canonical representation of markdown in SGML/XML even if you don't use SGML.

[1]: https://daringfireball.net/projects/markdown/

zippergz · on Oct 29, 2019

If a piece of technology is "almost always" misused, is that the fault of the users, or the technology?

scarygliders · on Oct 29, 2019

If you mean "users" as in "users of the XML format", then it's the fault of developers as 'users' of their applications don't have any choice in the matter.

zippergz · on Oct 30, 2019

I mean "users" exactly the same way "used" was meant in the author's statement. So yes, the developers, who are the people who are using XML.

Gibbon1 · on Oct 29, 2019

Marketing usually.

dimitar · on Oct 29, 2019

Exactly, this is some variant of No true Scotsman

CobrastanJorji · on Oct 29, 2019

> Here are some very frequently occurring examples of bad schema design:

(4 lines, 75 characters)

> Here's the right way:

(10 lines, 133 characters)

I have a suspicion as to what went wrong.

SamBam · on Oct 29, 2019

> "But if the people who made the strange decision to use XML as a data format [...] they might realise that what they're doing is unsuited to it and unergonomic

The author's point is that XML should not be a data format.

zwkrt · on Oct 29, 2019

Is there logic to the authors assertion past “that isn’t what XML was intended for”? It is a pretty nice data format if you want wide compatibility and schema integration.

mrzool · on Oct 29, 2019

The point is, XML is not a data format. It’s a markup language.

LudwigNagasena · on Oct 29, 2019

I just invented XDF, it is exactly like XML, but it can be used to represent arbitrary data structures. Do you have any objections now?

tux3 · on Oct 29, 2019

XDF looks cool! Some of the design choices look pretty inefficient and arbitrary for a data format though, have you thought about rebranding it as a markup language? =] /s

The point being, I think the author is arguing you should use the right tool for the job, and XML not being designed for arbitrary data structures makes it not the right tool. Just recently people have shown you can build a raytracing engine in SQL, but if someome was arguing we shall call it SQLCycles and ship it in Blender, I'd definitely have a few objections!

Koshkin · on Oct 29, 2019

It would be a rather bad design. It would not survive a good design review.

benibela · on Oct 29, 2019

That is only valid if the XDF standard has been published by the W3C

gnulinux · on Oct 29, 2019

This doesn't make any sense. W3C doesn't restrict/define use cases of XML, it defines structure and semantics thereof; using it as-if it is an XDF document is perfectly ok, just like you can use XML data for your DASH stream etc... It's a structure on top of XML structure.

arkitaip · on Oct 29, 2019

In reality it's actually fine as a data format for lots of use cases.

ako · on Oct 29, 2019

Documents are unstructure data, markup provides structure to unstructured documents, so xml turns documents into data.

manicdee · on Oct 29, 2019

The author’s point is moot. XML is a markup language intended to convey semantics. By its nature it is a data markup language, because the intent is that a particular type of information lives in specific tags.

For humans, this is documents with certain meanings attached. For computers, it’s documents with certain meanings attached.

It’s all data. XML is a data markup language. It’s just that humans call it “semantics” in a “document.”

benibela · on Oct 29, 2019

I really like <root name="John" city="London" />

It easy to query with xpath

It parses the same regardless of the order. With the right way, is <item><key>Name</key><value>John</value></item> the same as <item><value>John</value><key>Name</key></item> ?

commandlinefan · on Oct 29, 2019

Well, I would argue that he's _still_ wrong about the "right" way to represent this: the _right_ way would be: <name>John</name><city>London</city>.

theamk · on Oct 29, 2019

One thing this misses in the "dictionary" example is that tools (like xpath) push you towards "key in attribute" selection. One of the most common operations we do with dictionaries is lookup by known key, and storing the key in attributes makes it much easier.

spiralx · on Nov 1, 2019

"Key in attribute" is the correct way to do it, it's just that his examples are absolutely terrible and make no sense at all. A completely unstructured list of key-value pairs is overkill for any structured data format.

altmind · on Oct 29, 2019

Can I have my gripe with apple plists?

<key>CFBundleDisplayName</key>

<string>TextEdit</string>

<key>NSHumanReadableCopyright</key>

This may be perfectly parsable by a SAX parser storing some state, but its totally not processable by xslt.

mikl · on Oct 29, 2019

Yeah, that's awful XML. Ironically, they used to have a decent readable format in NeXTStep, it was only later it was XML-ified.

k__ · on Oct 29, 2019

How old is this format?

I had the impression mobile was some new try at frontend tech, but somehow iOS and Android threw a whole bunch of outdated stuff at me.

I mean, before the iPhone I designed a XML based ETL config system and tried to avoid all the common XML errors, then I start doing a mobile app 10 years later and it's like all that knowledge was forgotten...

spiralx · on Nov 1, 2019

I remember when Neverwinter Nights 2 came out touting that all of its data files were in XML for ease of modifying in user extensions. So I had a look and was it XML? Was it fuck, it was like a GeoCities novice's idea of how to code HTML - absolutely unparseable in any way unless you're the idiot who decided what the codebase needed was another shitty homegrown parser.

And yes, plists suck and make your XPath selectors ugly, although you could write a function to abstract them out.

nwellnhof · on Oct 30, 2019

> but its totally not processable by xslt.

It's a bit inconvenient but perfectly processable with XPath:

    /dict/key[.='CFBundleDisplayName']/following-sibling::string[1]

robofanatic · on Oct 29, 2019

I guess the correct answer depends upon the requirement.

I like this way

  <root>
   <item key="name">John</item>
   <item key="city">London</item>
  </root>

So I can use this xpath to get the person's name:

  //root/item[@key="name"]/text()

Not sure what would be the xpath to get the name if the XML was

  <root>
   <item>
    <key>Name</key>
    <value>John</value>
   </item>
   <item>
    <key>City</key>
    <value>London</value>
   </item>
  </root>

This is a better example:

  <employees>
   <employee id="1">
     <field name="name">John</field>
     <field name="city">London</field>
   </employee>
   <employee id="2">
     <field name="name">Jack</field>
     <field name="city">Boston</field>
   </employee>
  <employees>

commandlinefan · on Oct 29, 2019

Well, actually, an even better example is:

    <employees>
      <employee id="1">
        <name>John</name>
        <city>London</city>
      </employee>
      <employee id="1">
        <name>Jack</name>
        <city>Boston</city>
      </employee>

I know what you’re trying to do there - you’re trying to “future-proof” your schema by allowing introduction of arbitrary new elements. Which means that there’s no standard way to guard against somebody omitting a required field (like “name”) or adding a new field like “creditCardNumber” - other than to document your acceptable key values in a non-standard format and add defensive code that a validating parser would have given you. You’re better off taking as much advantage of the format as you can.

scandox · on Oct 29, 2019

If you accept the author's premise then all of these (including his) are wrong uses of XML.

elFarto · on Oct 29, 2019

The XPath for the second one would be:

//root/item/key[text()="Name"]/../value/text()

icebraining · on Oct 29, 2019

Alternatively

  //root/item[key[text()="Name"]]/value/text()

benibela · on Oct 29, 2019

You usually do not need to use text() in xpath, so this should work the same:

  //root/item[key="name"]/value

mickduprez · on Oct 29, 2019

XML is/can be much more than a markup language and yes, it can be used very badly but this is usually by inexperienced 'data wranglers' who don't understand the difference between data attributes and data proper.

While XML can seem cumbersome (compared to JSON say) it is a very good 'data transport' tool when used correctly with a sensible schema (XSD).

For example, we use XML as a 'vendor neutral' data format to export/import CAD geometry and associated data for town utilities such as buildings, pipes, roads etc. All this data has to be validated against the schema to ensure its correctness. Using a schema like this enables the city council to import this XML into the GIS system to be used for asset management, financial planning etc.

A good schema can be key to sharing XML effectively between departments/applications and being a markup language this data can also be viewed independently using XLST.

just_myles · on Oct 29, 2019

I agree with this portion 100%

The correct way to express a dictionary in XML is something like this:

<root> <item> <key>Name</key> <value>John</value> </item> <item> <key>City</key> <value>London</value> </item> </root>

In the past I used to create scripts that exported xml from relational data but didn't really understand the right way to build and structure them.

sosuke · on Oct 29, 2019

The larger your XML file is the more accurately you're using it. Less " and more <>. I made these mistakes, using XML like a I was writing an HTML doc.

tehjoker · on Oct 29, 2019

It is difficult for me to see what the real issue is with examples given. It seems to be more an aesthetic preference of the author rather than a technical argument. People can use formats for whatever they want. :P

If you told me that the transmission and parsing rate is too slow for their application, that's a real dig at it.

dataflow · on Oct 29, 2019

I think it's a bit like trying to explain why a Python REPL isn't a substitute for a calculator? Like it can of course do what you want, and you can't "see" what's wrong if you just take what you see literally (you'll obviously get the same answers regardless of what tool you use), but it's just... not meant for that.

LameRubberDucky · on Oct 29, 2019

For those wondering what you do with XML as a document markup language, see the XML document that is the specification for XML. I had to look at the page source to determine it really is an XML document. Looks like an HTML document.

https://www.w3.org/TR/xml/REC-xml-20081126.xml

spiralx · on Nov 1, 2019

I've got a stylesheet I wrote that turns HTML or XHTML into a fully indented and highlighted representation of itself which I was quite proud of :) I used to spend a lot of time writing XSLT and XQuery lol.

yellowapple · on Oct 30, 2019

Jesus. That's slick. Almost makes me appreciate all the bells and whistles of XML.

Almost.

deanCommie · on Oct 29, 2019

I think that Software Engineers should take influence from Authors (after all, are we not all craftsmen/artisans?) and incorporate the philosophies from https://en.wikipedia.org/wiki/The_Death_of_the_Author

The idea, for those not familiar, is that once a work of art is published (a novel, a poem, a song, a painting), it speaks for itself, and authorial intent no longer matters.

That is, meaning and purpose are in the eye of the beholder/consumer. And there is no right or wrong way to "interpret" art. If someone finds meaning that the author did not intend, it is just as valid as a deeply hidden but intentional allegory they intentionally placed in when they were writing.

The relevance to software is it applies to APIs, specifications, standards and formats.

There is no such thing as users using your software or specification "wrong" - if they insist on doing so, the meaning has evolved. Evolve with it or die.

rkagerer · on Oct 29, 2019

That's a little extreme but you raise a good point. I think a talented spec designer anticipates how their work might be interpreted / used / abused, and, like an adroit villain, nudges their audience toward tenets their grand scheme seeks to achieve.

tannhaeuser · on Oct 29, 2019

> In 1996, XML was invented.

XML wasn't an original invention; it is specified as a proper SGML subset. From the XML spec:

> The Extensible Markup Language (XML) is a subset of SGML that is completely described in this document.

Now I totally agree that SGML and XML aren't for service payloads and config files. The sole purpose of markup languages is representing structured text. And arguably, SGML fills this role much more adequately than XML today as it can represent (via the SHORTREF mechanism) custom Wiki syntaxes such as markdown and others, and in contrast to XML, can deal with the largest corpus of markup out there eg. can parse HTML with all its minimization features such a omitted tags, enumerated and unquoted attributes, etc. See [1] for a practical introduction (disclaimer: link to a tutorial I held last month at ACM DocEng).

[1]: http://sgmljs.net/docs/sgml-html-tutorial.html

catalogia · on Oct 29, 2019

SGML's implied close tags are a pain in the ass though. End tags in XML are overly verbose, but at least they're required.

tannhaeuser · on Oct 29, 2019

You control whether an element requires start- and end-element tags in your element declaration via "O" (letter O as in "omissible") in the respective tag omission indicator position:

    <!ELEMENT e - -(f,g,h)    -- no tag omission -->
    <!ELEMENT f O - (#PCDATA) -- start-tag omission -->
    <!ELEMENT g - O (#PCDATA) -- end-tag omission -->
    <!ELEMENT h O O (#PCDATA) -- both start- and end-
                                 tag omission allowed-->

What's painful about end-element tag omission?

gnuarch · on Oct 29, 2019

Can't find the source, but vaguely remember this line from a Usenet or mailing list post, maybe by Tim Bray:

The lack of </> in XML is a crime against humanity.

foolfoolz · on Oct 29, 2019

turns out a well specified format that has a lot of parsers available is useful for more than just a markup language. xml is great at data formatting, a little more verbose than alternatives but also a lot more feature rich

kazinator · on Oct 29, 2019

The format wasn't well-specified and didn't have a lot of parsers in 1996 though. The parsers came after the decision was made by a lot of people to use a markup language inappropriately for structured data.

h2odragon · on Oct 29, 2019

I present, in the spirit of 'worst XML ever', the docs for ScriptXML:

https://www.egosoft.com:8444/confluence/display/XRWIKI/Missi...

This is the language used in the game X4 Foundations (and others in the series). An example of its use (mine, i claim no grace in it):

https://github.com/h2odragon/dragoncommands/blob/master/aisc...

... XML is not a great format for an extension language, I have to say.

Gibbon1 · on Oct 29, 2019

Eclipse project files are also hot garbage.

bullen · on Oct 29, 2019

I would love to know what people think of my XML node-graph/tree editor I made before JSON became mainstream (my excuse): http://rupy.se/logic.jar

It basically names the tag what you name the node. :S

- You link/unlink nodes (I called them entities! Xo) by right-click-dragging between them.

- You copy stuff by right-click-dragging to an empty space.

- You delete by grabbing something by left-click-holding and pressing the delete key.

- Oh, and nodes are completely tree structure expandable, just drag-drop attributes on nodes and nodes inside nodes.

The editor uses lightweight rendering so you can have a ton of elements with good performance.

(I know, not super intuitive; but very handy once you know about these.)

commandlinefan · on Oct 29, 2019

The first rule of XML is: whatever you're doing with it, that's not what it was for.

hashberry · on Oct 29, 2019

> a simple test for determining if an XML schema is well designed: remove all tags and attributes from it ... If what you have left over does not make sense ... you shouldn't be using XML at all.

Magento 2 (acquired by Adobe for $1.68bn) uses XML to render its layouts. Here's some fun XML for the checkout page:

https://github.com/magento/magento2/blob/2.3-develop/app/cod...

Quarrelsome · on Oct 30, 2019

I once had to write a data layer in xml, in-situ with a lifespan for up to hours, as more data was appended to it. An invalid xml document that you couldn't load in many xml apis for 99.9% of its lifespan. I begged and pleaded the lead architect to use an sqlite db for the elements of the data until the transaction was complete and then merely produce the xml file at the end, but no.

It had to survive power outs too.

benibela · on Oct 29, 2019

The worst XML use I have ever seen are lists generated by Lazarus. Every list. For example in the project files you have:

    <RequiredPackages Count="5">
      <Item1>
        <PackageName Value="LazUtils"/>
      </Item1>
      <Item2>
        <PackageName Value="treelistviewpackage"/>
      </Item2>
      <Item3>
        <PackageName Value="internettools"/>
      </Item3>
      <Item4>
        <PackageName Value="LCLBase"/>
        <MinVersion Major="1" Release="1" Valid="True"/>
      </Item4>
      <Item5>
        <PackageName Value="LCL"/>
      </Item5>
    </RequiredPackages>

hyperman1 · on Oct 29, 2019

I once got a really nice one:

Software v1.0: Config is a binary blob and everybody curses the nasty editor provided by vendors.

Software v2.0: Config is in XML. Thank god!

Oh wait.

<Binaryblob><Byte value="65"/><Byte value="99"/> ....

sedachv · on Oct 29, 2019

Compare that to SGML:

https://leancrew.com/all-this/2014/09/sgml-nostalgia/

SGML really needs a revival.

mpweiher · on Oct 29, 2019

Hmm...somewhat disagree with the "correct" way to express a dictionary. I prefer:

   <root>
      <Name>John</Name>
      <City>London</City>
   </root>

Removes one level of indirection, XML already has keys.

Sharlin · on Oct 29, 2019

This is not a dictionary, it’s a record. Unfortunately this fairly fundamental distinction has been thoroughly muddied by certain languages that want to use associative arrays for everything.

mpweiher · on Oct 29, 2019

Hmm...you seem to be mapping programming language concepts 1:1 to XML. That is a mistake.

Sharlin · on Oct 29, 2019

I'm not the one doing that. People using XML as something else than a document markup language are. Which is what the article author is complaining about. But if you really want to do that, then at least do it properly. Record fields have predefined names, just like XML elements have predefined names. Dictionary keys can be arbitrary, like XML text nodes or attribute values but emphatically not element or attribute names, unless your "XML" is actually just tag soup.

progval · on Oct 29, 2019

That's not a good way to express a dictionary because it does not allow arbitrary strings as key names.

It's also not a good example of XML, because XML schemas should have a fixed list of tag names.

sorenjan · on Oct 29, 2019

But the starting point in the example was

    <item name="name" value="John" />  
    <item name="city" value="London" />

where the key names are used as attributes, so it wouldn't work with arbitrary key names either, right?

progval · on Oct 29, 2019

It would work (kind of); most XML parsers/generators would take care of escaping and unescaping quotes; but there's no way in the XML spec to escape characters in tag names.

mpweiher · on Oct 29, 2019

XML allows arbitrary strings as key names just fine.

progval · on Oct 29, 2019

No, you can't have quotes, spaces, etc., because with GGP's scheme it would be a tag name, and these are invalid in tag names.

jimktrains2 · on Oct 29, 2019

The main issue I see with that is that if it's a true dictionary, then those elements will constantly be different, which is weird.

Now, if we're just encoding a dictionary that's an already an encoding of an object, then yeah, let's just encode the object directly like you are above.

Felk · on Oct 29, 2019

Which works if you either expect a dictionary with a predefined set of fields (which isn't really a dictionary then), or parse your xml in a way that handles arbitrary tags. For a generic dictionary the shown approach is still the way to go, if you really want to use XML for that.

mpweiher · on Oct 29, 2019

XML parsers generally allow arbitrary tags.

benibela · on Oct 29, 2019

That is definitely the best to use with xpath

srott · on Oct 29, 2019

you've just invented plist, wonder if it can be validated by a schema..

  <dict> 
        <key>CFBundlePackageType</key>
        <string>FMWK</string>
        <key>CFBundleShortVersionString</key>
        <string>4.8</string>

...

progval · on Oct 29, 2019

No, GP didn't. They use tag names as keys.

But you answer your question, yes this format can be expressed with XML-Schema:

  <xsd:element name="dict">
    <xsd:complexType>
      <xsd:sequence minOccurs="0" maxOccurs="unbounded">
        <xsd:element name="key" type="xsd:string" />
        <xsd:choice>
          <xsd:element name="string" type="xsd:string" />
          <xsd:element name="integer" type="xsd:integer" />
          <!-- etc. -->
        </xsd:choice>
      </xsd:sequence>
    </xsd:complexType>
  </xsd:element>

spiralx · on Nov 1, 2019

Or in RELAX-NG compact :)

    grammar {
      start = Dict
      Dict = element dict { DictItem* }
      DictItem =
        element key { text },
        (element string { text } | element number { xsd:integer })
    }

scarygliders · on Oct 29, 2019

I have always thought that using XML as a format for storing and retrieving configuration files, was complete insanity.

Which is why I still use the simple, effective, INI format for configuration files for applications I write.

XML for config files is madness personified.

billsix · on Oct 30, 2019

RIP Eric Naggum

http://www.schnada.de/grapt/eriknaggum-xmlrant.html

davidw · on Oct 29, 2019

I love the quote I originally saw on the Nokogiri (Ruby XML lib) site:

"XML is like violence – if it doesn’t solve your problems, you are not using enough of it."