What's frustrating with XML?

crux_ · on Aug 21, 2010

Closed as subjective and argumentative. Darn, I wanted to pick a fight there. ;)

Many of the really crappy aspects of XML have been thoroughly abstracted away by libraries. I'm of the opinion that you can only truly despise XML if you've tried to write a parser for it yourself.

Examples of the issues: It requires arbitrary lookahead & backtracking. There is no canonical document encoding. (Support arbitrary character encoding for content: great idea! Support arbitrary encoding for the metadata/XML itself: The opposite of a great idea.) Entity references: need I say more?

There's a reason this sort of thing keeps coming up: http://voices.washingtonpost.com/securityfix/2009/08/researc...

XML as it would be in an ideal world would be (a) simple and (b) unambiguous; it fails both.

All that said, there's a huge benefit in the whole world arriving at a somewhat standard way of doing things, and a lot of that benefit remains even if the standard itself really sucks.

sofuture · on Aug 21, 2010

XML isn't bad in and of itself. It's just a powerful, neutral format. The problem with XML is that it allows so much abuse.

I had to interface with a government system at some point. They shipped us a whole schema of custom elements such as "IsShipment" (for example) which extended bool to allow extra options. All this was clearly documented in the schema comments "You can use 'true', 'false', 'sortof', and 'mostly'." So we go to validate the data.... and it doesn't validate. Against their schema. Because they didn't extend bool, they just said they did in the comments. When we got back in touch with them, we realized that they had no clue that the stuff not in comments mattered. As far as they knew, XML was just text, and they had clearly specified how it was to be interpreted (in English).

crux_ · on Aug 21, 2010

> XML isn't bad in and of itself.

I'm arguing that it is. Symptoms of badness: version 1.0 of the basic standard is on its 5th edition over ~8 years. Said standard is stupendously huge. Major security/crash & other bugs in pretty much every parser (https://www.cert.fi/en/reports/2009/vulnerability2009085.htm...) as of less than a year ago. XHTML's abandonment. In short, XML stinks and there's plenty of evidence out there that it does.

Please note that doesn't mean I would wish XML away. Widespread adoption, in and of itself, is a killer feature.

10ren · on Aug 21, 2010

You make me wonder if the hate is partly due to XML being used in integration, and integration is difficult and problematic, and vulnerable to all kinds of communication problems.

And if JSON tends to be used where integration isn't very challenging (eg. you control both ends, and the data is fairly regular and simple; or one end is entirely determined by the other); that is, where schemas aren't needed.

10ren · on Aug 21, 2010

Could you explain the arbitrary lookahead and backtracking please? I thought UPA (unambiguous particle attribution - or "deterministic" for DTDs) avoided the need for backtracking... Or is it to do with entity references? I've found it simple to write XML parsers, but that's for a common (defacto?) subset, not the full spec.

crux_ · on Aug 21, 2010

Note that I was talking about parsing a stream of text into a DOM tree or SAX events, not parsing a DOM tree or SAX events into some second-order structure.

Lookahead: Your parser will need to read the entirety of a tag (from the '<' to the '>' or '/>') to know whether it is a start tag or a complete tag. I think my use of the word backtracking was a wee bit sloppy, although the above situation may cause it in some types of LL() parsers. http://www.antlr.org/wiki/display/ANTLR3/3.+LL%28*%29+Parsin...

(Another parsing gotcha: XML isn't even context free, since the start and end tags must match and, for a general parser, there's an infinite number of possible tags.)

10ren · on Aug 21, 2010

OK, thanks, I see what you mean.

masklinn · on Aug 21, 2010

> I'm of the opinion that you can only truly despise XML if you've tried to write a parser for it yourself.

Or had to work with any of the utterly terrible xml dialects out there, which represents about 90% of them first and foremost xslt and xsd. Or had to deal with the various bugs in parsers, or with most people's (and software's) complete and utter inability to correctly deal with xml namespaces.

crux_ · on Aug 21, 2010

Strictly speaking, XML could be extremely well done and still have horrible horrible things done with it higher up in the layer cake. ;)

There's no surprise, though, that a shoddy foundation might encourage crappy houses.

terra_t · on Aug 21, 2010

Namespaces are both the genius and stupidity of XML.

The idea of being able to stitch different vocabularies together is genius.

That said, I haven't seen a single XML toolchain that doesn't have some bug or, ahem, irregularity, in how namespaces are handled. XLinq comes pretty close to being correct though.

10ren · on Aug 21, 2010

I agree that the concept of namespaces is brilliant, but very difficult to read in practice. The rules around the different ways to specify namespaces also seem unnecessarily complicated - I always need to review the XML schema spec, and spend a bit of time on it.

I feel they should be as simple as (eg) Java packages; but the problem has different parameters in Java: there's less chance of collisions in Java, because you're just writing one module; whereas one XML document can merge different sources - it's as if arbitrary Java modules, written by unknown people, were combined into the one file.

dantheman · on Aug 21, 2010

The biggest problem is that a namespace can be declared anywhere, so technically it's impossible to use a streaming reader unless you stream through the document twice. Also xlink and other ways of dynamically composing xml is too complicated.

masklinn · on Aug 21, 2010

I don't really get why, since you can't use a namespace before it's defined and namespaces are scoped. The only reason why you'd have to stream through twice is if you wanted a list of all namespaces in the document at the start of your parsing, and why would you want that (let alone care about it)?

jerf · on Aug 21, 2010

... which reveals the real problem with namespaces, which is that "nobody" actually understands them. There really isn't that much to them, in my opinion, but every time I encounter them in the wild, they're never implemented correctly, with the variance ranging from really, really wrong to just sort of off. XMPP is the closest to correct, but they still screwed up in that a user's connection to an XMPP server is under one namespace, and a component is done under another, yet an <iq> packet with no namespace qualification is supposed to be treated as the same packet in both, despite being two different packets. (I would accept simply acknowledging that they are the same somewhere, but I've never found it.)

Some people use the namespaces in what appears to be a decorative manner. Some people mandate that the prefix be a certain thing <x:tag xmlns:x="thing"> works while <y:tag xmlns:y="thing"> doesn't, proving they're doing it wrong under the hood. Some things get it right with xmlns:*, but don't understand what the bare xmlns itself means, so they only trigger namespace logic if there's a colon in the tag. I've seen cases where the namespace is declared, then used out-of-scope like it's a global declaration or something. I'm still waiting to encounter the standard or software that actually uses them correctly. And despite this listing of wrong answers it really isn't that complicated....

masklinn · on Aug 21, 2010

> There really isn't that much to them, in my opinion, but every time I encounter them in the wild, they're never implemented correctly, with the variance ranging from really, really wrong to just sort of off.

My biggest issue with namespaces is the dichotomy between namespace URLs and namespace prefixes, and most people not understanding that prefixes are actually aliases for the URLs. For that reason, I quite like Clark's notation for namespaces (which is used by ElementTree and LXML), it makes the relation between element and namespace much clearer. Shame you can't use it in XML documents or XPath queries.

Also, that the default namespace only applies to elements, not attributes. I kind-of understand why they did that, but it's still very annoying.

> Some people use the namespaces in what appears to be a decorative manner. Some people mandate that the prefix be a certain thing <x:tag xmlns:x="thing"> works while <y:tag xmlns:y="thing"> doesn't

Oh yeah. Isn't it maven or something, which does that? Or did? I know I encountered it once or twice and I was using ElementTree 1.2 at the time (the one that went into the Python stdlib... and still is) and it doesn't keep track of XML namespace aliases (or defaults for that matter, and doesn't let you set them short of hacking through the private and undocumented namespace map) so everything comes out as `ns0:foo`, `ns1:bar`, ... Perfectly valid, and then you have a retarded tool which doesn't actually understand namespaces (even though the example documents say you need a namespace spec) and want an element called `foo:bar` and not "the element `bar` in the namespace http://foo.com.

> Some things get it right with xmlns:* , but don't understand what the bare xmlns itself means, so they only trigger namespace logic if there's a colon in the tag.

When that happens, somebody ought to get shot. Default namespaces are one of the most basic parts of namespaces (and it's not that hard to parse, though production might be a different issue).

> I'm still waiting to encounter the standard or software that actually uses them correctly.

libxml2 tended to work quite well in my experience (mostly though lxml), though I don't doubt I just missed its bugs.

edit: damn it, is there no way to escape those damn asterisks in yc?

jerf · on Aug 23, 2010

Well, to be fair, the best parsers do handle it correctly. Frequently the binding logic in Perl or Python or whatever will then proceed to get it wrong! And if you've got anything more complicated than a straight C binding and it actually tries to do stuff for you, you can just forget about it working correctly.

When I say I'm still waiting to see the software that does it correctly, I mean like end-user-level software, not the parsers.

dantheman · on Aug 22, 2010

It appears I was incorrect. Thank you. As I thought about it I couldn't recreate the issue I recalled having -- it may have resulted from using a buggy xml parser and/or a misunderstanding of the specification. But I should clarify, I am a huge proponent of namespaces -- I think they're great, I'd just prefer them to be declared at the start of the document.

masklinn · on Aug 22, 2010

> But I should clarify, I am a huge proponent of namespaces -- I think they're great

They're nice, but generally misimplemented, misunderstood and misused.

> I'd just prefer them to be declared at the start of the document.

I'm not sure why. It's nicer when you have to read the document yourself for sure (but then again, XML is rarely actually nice to read), but when mechanically processing it it shouldn't be relevant: just use each node as a (ns, name) pair where ns is the namespace's URI (not its alias) and name is the localname. In that case, what does early declaration bring to the equation?

terra_t · on Aug 21, 2010

I've been thinking about a way to represent RDF namespaces in JSON... I've seen RDF-in-JSON proposals that I don't like, because they involve whole URLs as keys, and I'm afraid that wouldn't work well in every JSON stack that's out there.

cce · on Aug 21, 2010

There is a type-system (and graph) encoding for JSON called JSYNC (http://jsync.org) which is compatible with YAML's model.

contextfree · on Aug 21, 2010

It convinced a generation of framework designers not to bother designing a decent concrete syntax for their domain-specific languages. When the framework is small and/or its developers probably couldn't hire a good language person anyway, this might be for the best, but it's a shame when a huge, well-funded and otherwise fairly well designed beast like WPF/Silverlight/XAML is trapped behind tasteless syntax.

On a deeper level, the element/attribute distinction unnecessarily mucks up the data model, but I'm not sure how big a problem that is in practice.

dstein · on Aug 21, 2010

The problem with JSON is that isn't very useful by itself because you always have to encode and decode it, and then analyze the structure to do something with the data. It works better as part of a protocol, not as a native data format.

If you convert your JSON data to XML (assuming it is structured in a way that makes it lossless) you have a whole lot more useful tools at your disposal.

arethuza · on Aug 21, 2010

People do seem to like abusing attributes... A surprising number of systems seem to rely on embedding entire XML documents in attributes within other documents.

wooby · on Aug 21, 2010

"We're an XML shop": who says that?

joe_the_user · on Aug 21, 2010

The most horrible XML encoding I've encountered recently was the Gnome/Kde Xdg desktop menu system.

Rather than encode what menu item is in what pseudo-folder, it encodes every change made to the menu system as a diff and expects the applications will piece these together -- and the libraries that parse this monstrosity all huffily say "this code is NOT stable...".

http://standards.freedesktop.org/desktop-entry-spec/latest/

Of course, this isn't so much XML's fault as the fault of the folks who kludged together this monstrosity. This shows, however, how XML is more or less a tool for knitting together two or more generally poorly-specified encodings. The good is that these might be somewhat better inside XML than running about wild but the bad is it lets them continue to exist all. See Microsoft's Office XML "standard".