

XML Sucks - gnosis
http://javaoldschool.blogspot.com/2009/01/xml-sucks.html

======
dasht
Hmm. I'm disappointed to see the article getting voted up but I suppose that
fairly reflects a certain amount of widespread frustration with XML.

As a standard, XML is actually pretty simple and, in my opinion, well thought
out. Alas, freely available high quality documentation to take someone from 0
knowledge to a good grasp appears to be lacking. Many tools which claim to
process XML are poorly written.

The author's list of complaints and some responses:

* "XML is very bad at handling binary data." That is mostly false. There is no obviously winner (widely adopted) standard for binary format XML although there will be some day. At that time, XML will be the best choice for new binary file formats and, I predict, processors for many legacy binary formats will gain DOM-based APIs. The surface syntax of textual XML requires that binary data be encoded into a safe character set. That's actually important for data interchange.

* "XML is incredibly verbose." Yes, yes it is. The primary sources of apparently needless verbosity are first, the need to mention the element name in closing end tags ("</foo>") and, second, the need to always wrap element names and closing tags (and attributes) in "<...>" pairs. The parent language (SGML) had mechanisms to relax those constraints and yield a more compact syntax. These were dropped in XML (for now) to make it easier to write parsers by getting the general case working first. Perhaps over time we will reintroduce a more flexible syntax or perhaps we'll discover we don't need it much.

* "DTDs use different syntax." That is a complaint about DTDs, not XML. Some schema languages use XML syntax and, as I recall, there is an XML syntax for DTDs. This complaint is mis-directed.

* "Confuses meta-data with content. Sometimes XML attributes contain values other times the value is in XML text." This statement is inaccurate and betrays the authors poor understanding of XML (see above about the poor state of XML documentation). Sometimes data is placed in element attributes and other times data is placed in XML _sub-elements_ (not necessarily "text").

An XML datum is an inductively defined finite tree structure. Leaf nodes may
contain text or more shallow trees. Leaf nodes are labeled with a type tag -
the element name. The type of a node is parametric - it has parameters. Those
parameters are element attributes. Generally speaking, if you have data that
tells you something about the contents of an element, or about how to
interpret the element - it goes in attributes. Otherwise, it goes into sub-
elements.

A _good_ criticism of XML would be that currently, attribute values may not be
arbitrary XML datums - but there is nothing in the definition of XML that
prevents fixing this in the future.

* "Unclear use of whitespace. Should a carriage-return after a tag be ignored? Should tabs be ignored?" This statement betrays the author's poor understanding of XML (see above about the poor state of XML documentation). In fact, XML specifications state two things that lead to the author's confusion, and some XML-based tools do confusing third things. (1) XML specifies a precise and simple canonicalization of whitespace. (2) XML specifies how to handle whitespace in tools which are manipulating XML source code rather than manipulating the canonical form. (3) Some tools do random, ad-hoc things - these are bad tools.

* "XML tools/libraries are very bad. The standard Java libraries have come a long way, but DOM parsing is still a chore." Many tools and libraries are very bad, some our wonderfully good. Because the author has a poor understanding of XML (see above about the poor state of XML documentation) he is not able to quickly tell the good tools from the bad and so the whole field looks to him a mess. I sympathize with the author quite a bit but if he understood XML better and stuck to good tools, his only complaint would be that there aren't enough good tools yet (though there are quite a few and you can get quite far with just what's already there).

The author complains: "XML has so many problems that an endless series of
patches have been pasted on it. Each patch, however, has made XML worse:
CDATA, namespaces, etc."

I am fairly certain that it is historically inaccurate to describe namespaces
or CDATA as "patches pasted on".

Namespaces are, in fact, one of the great strengths of XML. The real problem
is that too many programmers making use of XML don't understand them and
therefore fail to use them or fail to use them properly (see above about the
poor state of XML documentation). XML namespaces create a cooperative way to
allocate element and attribute names in a distributed and decentralized way
without introducing any central authorities not already present in the
allocation of URLs. In this way, XML element names are vastly superior to lisp
atom (symbol) names for tagging tree nodes and attribute names. Namespaces are
one of the more important advances that XML represents.

I have less of a defense for CDATA but not no defense at all. On one level it
is a handy addition to XML surface syntax. At the DOM level, it helps preserve
the ability to reasonably "pretty-print" an XML datum. It is a bit ad hoc but
it causes very little harm and is often enough quite convenient.

The author opines: "It was clear to me when XML first came on the scene that
it was a bad specification for data interchange. How it came to be so widely
used is a mystery to me. The most plausible explanation is that XML became a
means for Microsoft, Sun, Oracle, et al to attempt to wield control over the
industry. Each company used XML as a basis for creating semi-proprietary
protocols that they could control."

XML has been adopted by far more players than Microsoft, Sun, and Oracle and
is used by some quite excellent free software products. So many very talented
programmers and teams have chosen XML that the author should consider applying
Occam's razor and the scientific method here. A simpler explanation than his
conspiracy theory is that the author's judgement that XML is a bad format is a
mistaken judgement. (See above about the poor state of XML documentation.)

Finally, the author concludes: "The industry appears to be moving away from
XML [....]" From where I sit, he is profoundly mistaken. Javascript toolkit
authors and similar have often moved to JSON and similar, for temporary
convenience, but in many other domains XML continues to make slow and steady
advances, receiving huge amounts of investments.

I really, really do not want anything I've said above to be taken as a
criticism of the author's person or skills. I don't mean to say that he is
professionally wrong to shy away from XML in his own work. (See above about
the poor state of XML documentation (and many tools).) His report is
interesting as a reflection of frustrations that people in some domains are
feeling. It's just mostly wrong on mostly every point (but for understandable
reasons).

~~~
tentonova
_"XML is very bad at handling binary data." That is mostly false._

No, it is not. Encode a reasonably sized JPEG (say, 500k-1MB) in an XML
document. Try measuring how quickly you can parse that XML document on a
lower-end device.

Now try encoding a batch of thumbnails (say, 30k each) in an XML document. Now
try seeing how fast you can parse 500 of them on a lower-end device.

 _There is no obviously winner (widely adopted) standard for binary format XML
although there will be some day._

"Some day" doesn't do us any good today.

~~~
nicpottier
XML wasn't really ever made to be high performance though. It was made to be
easy to read for humans and relatively easy to generate. At those two things
it is awesome. If you are sticking binary data in XML and passing it over the
wire to someone you are probably 'doing it wrong'. :)

There are obviously much fast formats for that type of thing, but they have
their own drawbacks which are significant. Protocol Buffers are crazy fast but
aren't self describing or human readable. JSON is really just a different
representation of XML on many levels and certainly doesn't deal with binary
data any more elegantly. (base64 here we come!)

If you have binary data and plenty of CPU on both ends, then gzipping XML gets
you pretty close to the native size actually and can be a good solution. You
keep the human readability and the compression can be done on the transport
layer if it is HTTP.

If you don't have much CPU and want easy parsing, it is very easy to write a
binary format for XML. I've done this for products that are sending down data
to cell phones (including images). The native format is XML, because it is
easy to author in templates and easy to debug, but with an extra parameter it
gets thrown into a denser and very fast to parse format for the other end to
deal with. This costs a bit of CPU on the server, but that generally isn't the
problem.

Anyways, I would agree that the main author's points are pretty ridiculous.
They either point to him not really understanding XML or never really
considering the pros and cons of the alternatives. XML is great for what it
is, and it has its drawbacks, but as a universal transport it is pretty rad.

~~~
tentonova
_There are obviously much fast formats for that type of thing, but they have
their own drawbacks which are significant. Protocol Buffers are crazy fast but
aren't self describing or human readable._

Why does that matter? The fact that they aren't self-describing more or less
requires that you use a published message definition, which solves one of the
major problems of XML schemas -- nobody actually uses them, and many that do
use them incorrectly.

 _XML is great for what it is, and it has its drawbacks, but as a universal
transport it is pretty rad._

Having wasted far too much of my life on parsing XML documents, I simply can
not agree.

XML as a universal transport just sucks compared to the modern alternatives.

~~~
nicpottier
Are you pitting XML against Protocol Buffers? Because they are completely
different things. Being self describing is a big deal.. it makes interchange
far far easier and obvious. As does being human readable, debugging a web API
in a browser or via curl is a serious plus.

Having just spent the past two days reverse engineering an API that used
protocol buffers, I can tell you I would have much rather done it against XML.
:)

As for parsing, I guess I don't get what the big deal is. I've long since
given up on SAX parsers and occasionally use pull parsers, but for the
majority of applications using a DOM parser gives you a very OO friendly
approach to turning that XML into your objects, and it is dead simple. Sine
the generated Protocol Buffer objects are immutable and can't be used as base
classes, I suspect the amount of code to turn a 'message' to your domain
objects and vice versa is very similar between protocol buffers and a DOM
impl. At least it has been in my experience, neither is hard.

~~~
tentonova
_Having just spent the past two days reverse engineering an API that used
protocol buffers, I can tell you I would have much rather done it against
XML._

Why didn't you have the descriptor files? Optimizing for ease of external
reverse engineering seems rather counter to the authoring organization's
actual goals.

If you want third parties to be able to interoperate, give them the
descriptors you use.

 _As for parsing, I guess I don't get what the big deal is._

Writing lots of code to read and parse data types as text, especially when no
schema is available, is time consuming.

Protobuf is necessarily validated. The input types are known-correct,
conversion to the domain model is _direct_ and straight-forward, and it's
exceptionally easy to generate code or otherwise automatically deserialize to
native representations.

~~~
nicpottier
I didn't have the descriptor files because it wasn't an officially supported
API. And that happens, lots! Often times people aren't interested in nailing
down their API enough to 'officially' publish an interface.. XML makes it easy
for you to use it anyways.

Parsing XML doesn't include lots and lots of code, at least not for me. As I
said, the amount of code to parse a Proto built class into my own
datastructure is almost identical to me doing it from a DOM tree save for some
type conversion.

The other plus is that proto buffers involve firing up those tools and
incorporating them into your project. How often do you just want something
really simple, say the current temp in a zip code? That's a few lines of code
to pull out via XML, without the trouble of pulling down the proto buffer
definition, building the classes for your language etc..

------
prodigal_erik
There's one case where I like XML. Interoperating with someone who, starting
with a clean sheet, would have concocted something worse than XML (EDI is a
popular example).

That said, I really don't see the value of whitespace and comments in our RPC
traffic. I wish the industry were mature enough to successfully use ASN.1 for
this stuff.

~~~
ct4ul4u
There must be some campaign of secrecy around ASN.1. When I mention it to
otherwise well-informed developers, they have never heard of it. It doesn't
help that the really high performance marshalling libraries aren't open
source.

~~~
eugenejen
There was a real time blog search engine, pubsub.com. The internal messages
was encoded in ASN.1. Bob Wyman, CTO of the pubsub.com was part of the ASN.1
committees in 80's.

How can no one hear ASN.1? Andy Tanenbaum's Computer Network mentions it. SSL
uses part of it.

~~~
bobwyman
Eugene, I wasn't actually a formal member of the standards committee, however,
since my products at DEC (ALL-IN-1 mail, etc.) were some of the earliest to
use ASN.1, I was very involved in the standardization process. ASN.1 is
_everywhere_ whether or not people realize it. Just about all
telecommunications systems (cell phones, etc.) use it internally. As you say,
SSL uses it -- since public key certificates are encoded in a form of ASN.1
also the IETF SNMP protocol relies on ASN.1. Other examples would include
X.400 and X.500 standards and the LDAP which is a simplified version of X.500.
You may be interested to know that internally at Google, we use the "Protocol
Buffer" format instead of XML for all internal processing. That format is,
like ASN.1 a "tag/length/value" binary encoding. For more information see:
<http://code.google.com/apis/protocolbuffers/>

~~~
eugenejen
Bob, I know that. and Facebook uses Thrift which is now part of Apache
incubator project. I may need to deal with some large data set with Cassandra
database, which communicates with external programs in Thrift format messages.

As a side note, I hope Google will finish real time search api soon and open
the firehose :-)

------
nailer
> Very bad at handling binary data

Agreed. This is common to most text formats though. It This is why most XML
apps use URL relationships to refer to binaries.

> Incredibly verbose

There's a great deal on confusion between XML the data structure and XML the
common method of serializing that data structure. Probably because they have
the same name.

Primarily, we deal with XML the data structure. If we create something from
scratch, or once we de-serialize what was sent to us, we're appending,
inserting, iterating, etc. over element trees. The data structure is very
simple, how it looks when you serialize it is up to you - like the article, I
also think tag soup looks horrible, so I serialize into a Python-style
indented-block format when printing. I might get round to doing an input
format as well.

Of course, it would be nice if the default serialization format sucked less.

Extending your point about tools, most people don't really have good XML
editors. There's no reason I shouldn't be able to fire up 'vix' and delete,
insert, etc. elements and attributes. In the Unix world in particular the
tools are still based around presentation rather than structure.

------
drhowarddrfine
I think the author of that article sucks. He got himself in trouble with me by
saying his first complaint was XML doesn't handle binary data. wtf? Must be
the amateur hour tonight.

