A brief XML rant

masklinn · on March 4, 2013

> do not use template languages to generate XML.

Small correction: do not use text template languages (Jinja, moustache, erb — which seems to be the one used here considering `%= display_date %>`, raw PHP, smarty, freemarker, what have you) to generate XML. There are templating languages whose primary use case is to generate markup (including XML)[0] and (unless they're broken to uselessness) they should guarantee the output is valid XML.

> Schema-design-wise, the content:encoded and excerpt:encoded element names are deeply suspect, as if someone looked at RSS 2.0, squinted, shrugged, and invented their own ad hoc analogous namespace prefix, rather than understanding the role of elements in XML.

They seem to be using Wordpress's WXR import/export format, hence the wp-namespaced elements. The "content" and "excerpt" namespace garbage comes straight from there according to http://ipggi.wordpress.com/2011/03/16/the-wordpress-extended...

> <content:encoded> Is the replacement for the restrictive Rss <description> element. Enclosed within a character data enclosure is the complete WordPress formatted blog post, HTML tags and all.

> <excerpt:encoded> This is an unknown elementThis is a summary or description of the post often used by RSS/Atom feeds..

Considering the cottage industry of wordpress interaction, it was probably a good move to shoot for interop (should allow posterous exports to be directly imported into wordpress?). Not sure they succeeded though.

[0] genshi for instance http://genshi.edgewall.org/

timdorr · on March 4, 2013

   There are templating languages whose primary use case is to generate markup 
   (including XML)[0] and (unless they're broken to uselessness) they should 
   guarantee the output is valid XML.

Since they are using Rails, they should be using Builder for this: http://api.rubyonrails.org/classes/ActionView/Base.html#labe... https://github.com/jimweirich/builder

masklinn · on March 4, 2013

> Since they are using Rails, they should be using Builder for this

Indeed. It's really odd that they munged together an XML export in ERB when builder exist. Does it have some sort of breaking issue with namespaces or something which could explain the choice?

mnarayan01 · on March 4, 2013

Builder can be a bit of a pain if you want to do things that are...let us say "questionable" (e.g. output a tag with inner content which is _not_ XML escaped).

masklinn · on March 4, 2013

There's no such thing in this case though, there's a single layer of tags with escaped content inside (the example document uses CDATA, but as others have noted automated generation is not a good use case for CDATA)

mnarayan01 · on March 4, 2013

Agreed; just saying in general. Also, I would guess the reason they used ERB here is simply familiarity, not any type of reasoned decision.

purephase · on March 4, 2013

RABL does a pretty decent job at generating XML too.

jlogsdon · on March 4, 2013

If you need layouts RABL falls apart with Ruby 1.9: https://github.com/nesquena/rabl/wiki/Using-Layouts there's a note at the of the "Using Rabl" section.

the_mitsuhiko · on March 4, 2013

While I do not recommend people to generate XML with Jinja2, it's actually not to bad at doing that. It will escape properly for you automatically and unlike many other solutions in Python it actually supports streaming.

</biased response>

TazeTSchnitzel · on March 4, 2013

  Error on line 2: Closing tag for non-existent opening tag "biased"

  Error on line 2: Closing tags cannot have attributes

masklinn · on March 4, 2013

> It will escape properly for you automatically and unlike many other solutions in Python it actually supports streaming.

True and true, but it does not guarantee the output XML will be valid: as far as Jinja's concerned it's all just text is it not? Genshi also supports streaming (using `serialize`), will also properly escape everything and — using the default xml serializer — ensures the output is valid XML.

(edit: I want to note that I wasn't trying to put down jinja, it's just the first text-based template I thought of when trying to write down a list, it's a fine templating language) (just not to generate XML)

adamtaro · on March 4, 2013

Agreed on your stipulation on Genshi and the like.

And thanks for the further reverse engineering of the likely intent of the export. I wouldn't disagree with most of the WP-centric design choices. But attempting to run through a real XML parser might've been a good choice as well. (And I note there's a fair bit of complaint on the WP forums about the difficulty of using the data for import.)

bambax · on March 7, 2013

Just generated an export from Posterous. It's not just namespaces. XML files contain unescaped html entities (  for example). What a mess.

bazzargh · on March 4, 2013

One thing that bugs me about this is the use of CDATA. CDATA sections are just-about ok in hand-crafted xml, but in machine generated xml, they are absolutely pointless, and usually hint that the coder doesn't know what they're doing.

For example, the author thinks that the content inside the CDATA is escaped, but in fact, it isn't necessarily - eg in this case they're including chunks of html which may contain more CDATA sections, and of course they don't nest (you need to terminate and restart the CDATA section). I've also seen examples where the enclosing encoding and the encoding of the CDATA section were incompatible.

The worst thing is specs with CDATA sections in examples. Junior devs bend over backwards to use things like xsl's disable-output-escaping to get a character-for-character match in test results, and then wonder why their code breaks in production.

gav · on March 4, 2013

Outside of a few special cases (such as wanting to make embedded content in XML human editable) CDATA should be treated as a big warning flag that the author of the code that generated the XML doesn't really understand what they are doing.

There's always the issue that one day ']]>' will somehow sneak in and everything will break.

The key is using a tool to generate the XML that will transparently handle things like escaping correctly instead of using templating tools designed for text or HTML output.

mnarayan01 · on March 4, 2013

I'm not sure making "XML human editable" should really be considered a special case.

westi · on March 4, 2013

To be fair to the Posterous Team they are doing a good job of fixing the bugs in the export as they are reported to them.

Hopefully they will get all of them fixed before the final close down.

If you want an easy way to get your Posterous Export file cleaned up and into a more Valid XML file then feel free to use the Import from Posterous option over at WordPress.com - http://en.support.wordpress.com/import/import-from-posterous...

We've spent some time on writing code which cleans up the XML file so that it can be imported into WordPress successfully.

You can then export a clean WXR file and import elsewhere much easier - http://en.support.wordpress.com/export/

stblack · on March 4, 2013

I don't see any problem with this XML that can't be easily overcome.

The comment about GMT-offsetting the date is particularly pithy, Assuming the blog in question isn't about ephemerides. By and large, blog posts have dates. If you desperately need an hour-offset from GMT, one might suggest this is your edge-case because, by and large, it doesn't matter.

Count me among those who would argue that the omission of a schema is a blessing.

I've wasted whole f*cking days of my life wrangling with so-called "non-amateur" XML. Invariably this was over-bloated XML with schemas that did nothing to help the discoverability and the processing of the data. Plain and simple, XML is over-spec'd and many data publishers, aided by their inflexible toolsets, pushed their XML beyond reason.

Be careful what you wish for.

I would take this XML, map-it, iterate it, done! End of story. I don't think there's much to complain about here.

masklinn · on March 4, 2013

> Count me among those who would argue that the omission of a schema is a blessing.

TFA didn't ask for a schema, TFA asked for namespace declarations. Because they're kind-of necessary to parse namespaces with a namespace-aware XML parser. That's got 0 relation with a Schema. He only mentioned in passing because `content:encoded` and `excerpt:encoded` make very little sense... schema-wise (not "in an XML-Schema document").

> I would take this XML

You can't "take this XML" because it's not XML. Once you know it's not XML you can "take this tag soup", shove it into a tagsoup library (maybe with some encoding-guessing beforehand) and hope things come out about right at the other end — with no insurance that this is the case, you're deep in GIGO land at this point — but you can't "take it and map it"

obviouslygreen · on March 4, 2013

As someone who has used BeautifulSoup very happily without considering its etymology... is "tag soup" an actual term or just a very apt description you're using?

[edit: A quick search, which I should have conducted instead of posting this, shows this has at least been used before, and enough not to be deleted by Wikipedia editors for lack of notability. That's pretty funny.]

adamtaro · on March 4, 2013

Hi, TFA here.

Masklinn is right: I didn't ask for a schema. I didn't ask for anything. I somehow expected well-formed XML in a directory full of .xml files. That's 90% of what the rant is about.

I wasted a handful of fscking years of my life editing a significant international standard that used a peculiar dialect of W3C XML Schema. I know from schema over-design. I'm just talking about understanding the bare basics of xml and seeing that 'excerpt:encoded' might not convey what you think it means when set next to 'content:encoded'.

And it indeed took less time to hack together a solution to extract the information I needed (yay sed!) than it did to write this quick rant. That's not the point. The point is that the hacks and work-arounds should have been unnecessary. It's passing the savings of having one careless dev on as a cost to countless others having to deal with the data downstream.

gizzlon · on March 4, 2013

He has a few valid complaints (by a few I mean one), but this is really not that bad compared to a lot of the XML floating around. No reason to be shocked

"There are no namespace declarations. No self-respecting XML parser will have anything to do with this XML data."

I don't get this comment. I have never seen an XML parser that would refuse to parse XML without a namespace..

Am i missing something? Or is that just mindless hyperbole?

masklinn · on March 4, 2013

> Am i missing something? Or is that just mindless hyperbole?

Note that the document uses namespaces but does not declare them. In Python, both ElementTree and LXML will blow up parsing when they encounter the first undeclared prefix (dc, from dc:creator)

gizzlon · on March 4, 2013

Ah, you're right, I did miss something =)

Still nothing to be "shocked" about though ..

masklinn · on March 4, 2013

TFA wasn't "shocked" (I suspect he was being slightly hyperbolic) at the sole invalidity-through-broken-namespacing, broken templating also had a hand in it: simply exporting a post and proof-reading the output is sufficient to catch the latter.

Then again, you just have to put the output through any XML parser (it's not hard to find) to realize the document is completely broken, but...

bambax · on March 4, 2013

You can't process XML that uses namespaces without a namespace declaration. A namespace prefix is just a shorthand for the namespace itself.

prefix:name-of-element doesn't mean anything by itself, you need to know what 'prefix' stands for.

As it is, this XML is not parsable; it's not well-formed and therefore it shouldn't even be called XML; it's just text with random tags thrown in.

It is, indeed, quite shocking.

laurent123456 · on March 4, 2013

Maybe it's just me, and it's probably wrong, but more than once I pre-processed XML data by replacing all the "namespace:tag" by "namespace-tag" so that I can easily parse the XML without having to care about namespaces. I've never been convinced that this feature has much use anyway.

nullymcnull · on March 4, 2013

You never understood what they were actually for, then. The namespace will have a schema, and the schema can be used to validate the elements of that namespace. Not used in a "just import the data!" scenario, sure, but a lot of people who use XML do care about that kind of validation.

masklinn · on March 4, 2013

> The namespace will have a schema, and the schema can be used to validate the elements of that namespace.

That's one option. But a Schema (or DTD) is not mandatory, and not all schemas can easily be linked (few tools handle relaxng or schematron for namespace spec).

The core purpose of XML namespaces, and that of any namespace really, is better modularity and compositionability by preventing name collisions. This is useful when manipulating XML via XML (e.g. XSL, Genshi, ...) or when using multiple XML dialects in the same file (either because they're orthogonal or because they complement one another) for instance. You could do it by explicit prefixing à la C or Objective-C, but it tends to get dreary, requires everybody's cooperation and generally looks bad (not that xml namespaces look overly sexy).

masklinn · on March 4, 2013

Technically you "can" if you manage to find a non-namespace-aware XML parser, it'll parse `prefix:name` as the ELEMENTNAME `prefix:name`.

> As it is, this XML is not parsable

It's parsable with a non-namespace-aware XML parser (ignoring tagsoup parsers as we're pretenting this is supposed to be an XML document)

bambax · on March 4, 2013

That's true; but where would you find such a beast? AFAIK you can't switch namespace-awareness off in modern parsers, so you'd have to find a very (very) old version...

masklinn · on March 4, 2013

> AFAIK you can't switch namespace-awareness off in modern parsers

Apparently, the JDK's javax.xml.parsers.*Factory can return namespace-unaware parsers and — even stranger — do so by default:

http://docs.oracle.com/javase/7/docs/api/javax/xml/parsers/D...

> Specifies that the parser produced by this code will provide support for XML namespaces. By default the value of this is set to false

http://docs.oracle.com/javase/7/docs/api/javax/xml/parsers/S...

> Specifies that the parser produced by this code will provide support for XML namespaces. By default the value of this is set to false.

Whether they qualify as "modern parsers" can be debated and I didn't test their behavior, but there you are.

obviouslygreen · on March 4, 2013

"Worse things have happened" is a very tempting and unfortunate dismissal... I think we all do it, but when something is broken, is it really that important what else has been broken that may have been worse?

Yes, shit happens, and it's never going to stop happening. Not in the face of all the misguided idealism in the world. But is being punched in the face OK because the puncher didn't use brass knuckles? If he did, is it still OK, because people have been shot in the face, and that's a lot worse than being punched?

Anyway, I thought the same about namespacing until that was addressed in a more constructive reply. So thanks for asking that question. :)

fpgeek · on March 4, 2013

> Get off my lawn, you kids.

Isn't that what they were doing?

dylangs1030 · on March 4, 2013

Upped for giving me a chuckle in the midst of some very heated XML discussion :)

peterkelly · on March 4, 2013

There's nothing wrong with invalid XML - why is everyone complaining? C compilers should similarly take a stab in the dark about what the programmer meant if they encounter invalid syntax as well. And those linking errors always annoy me - it should just pick the closest matching symbol if the specified one can't be found.

_oqcu · on March 4, 2013

This isn't really criticism of XML, though. You can do a good job of screwing up in any language or format.

adamtaro · on March 4, 2013

It was not intended as a criticism of XML at all. XML is a perfectly cromulent standard. It is a criticism of amateurish use of XML.

egeozcan · on March 4, 2013

Which is everywhere (xhtml, anyone?)

sageikosa · on March 4, 2013

The "X" is for "extreme", right?

rimantas · on March 4, 2013

There are very few websites where xhtml is served with a proper MIME type. MIME type triggers xml parsing mode in browsers, so in most cases xhtml is treated exatly as deserved: just a tag soup.

Kequc · on March 4, 2013

Yea but I've been using haml for quite a while to generate markup. XML is horribly inefficient by comparison and prone to mistakes.

dscrd · on March 4, 2013

XML is so complex and obtuse that one can hardly blame the practicioners for misusing it.

sageikosa · on March 4, 2013

XML is obtuse only when you try to read definitions of XML dialects written in XML dialects themselves; even then, it's understandable, though it is a "high-art" discipline of schema-world semantics.

dscrd · on March 4, 2013

I agree that <tag attribute="value">data</tag> is simple, but XML is, unfortunately, much more than that. And this article complains about that "much more" part.

sageikosa · on March 4, 2013

I'll agree with your agreement; the human legibility of XML data often leads the novice programmer into making bad assumptions about the simplicity of implementing XML.

While it is possible to implement "well-formed" XML easily enough, validating against schema is another matter. In this particular article's case, the "well-formedness" isn't even there.

jarman · on March 5, 2013

For one-off, transport xml it's not much more. It's proper escaping, declaration with character set and not using features you do not know how to use. First two are solved by using proper library, third - by common sense

nanoscopic · on March 4, 2013

"There are no namespace declarations. No self-respecting XML parser will have anything to do with this XML data."

I would argue that any self respecting xml parser should parser it just find and shouldn't demand the namespaces to be defined at all.

"...invented their own ad hoc analogous namespace prefix, rather than understanding the role of elements in XML"

I don't think you understand the base concept of XML much. It is meant to be a generic container to hold whatever you want. XML in and of itself doesn't enforce node naming. Sure if you are talking about the official spec it does, but people pretty much globally use whatever node names they want. Don't have a cow.

"I haven’t been able to determine the intended encoding of the files"

Well maybe you should look into a parser that just parses as is without attempting to use some specific encoding.

Check out XML::Bare on cpan for perl. It will parse pretty much anything you throw at it, in any encoding. It leaves it up to you, the user, to decide what to do with the data after parsing.

masklinn · on March 4, 2013

> I would argue that any self respecting xml parser should parser it just find and shouldn't demand the namespaces to be defined at all.

The XML Namespaces specification unambiguously requires that a namespace be declared:

> The namespace prefix, unless it is xml or xmlns, MUST have been declared in a namespace declaration attribute in either the start-tag of the element where the prefix is used or in an ancestor element (i.e., an element in whose content the prefixed markup occurs).

A self-respecting XML parser would follow the spec. A namespace-aware XML parser must fault on undeclared namespaces.

Most XML parsers are namespace-aware.

> I don't think you understand the base concept of XML much.

Pot, meet kettle.

> XML in and of itself doesn't enforce node naming. Sure if you are talking about the official spec it does

Don't you feel like you're contradicting yourself a bit there?

> Well maybe you should look into a parser that just parses as is without attempting to use some specific encoding.

So he should look into parsers which do not parse XML and have no issue mangling the content? What are they going to do, assume the encoding is ascii-compatible anyway and go to town? How wonderfully anglo-centric.

> Check out XML::Bare on cpan for perl.

XML::Bare is an XML parser in the same sense that xhtml interpreted as text/html is an XML document: not in any way, shape or form. And if that's what you're shooting for, don't pretend to suggest an XML parser and suggest a recovering "soup" parser instead, something like html5lib or BeautifulSoup.

But herein remains the issue: I expect Posterous advertised their export as XML files, not as "encoding-deficient tag soup" (which it apparently is). I'm sure TFA would have had no expectations if he'd been told he got garbage in, and would have relied on tagsoup-parsing and encoding-guessing (using whatever libraries for doing so are available in his language of choice).

As it stands, he did have the pretty basic and undemanding expectation that he could shove supposedly-XML files into an XML parser and get data.

sanderjd · on March 4, 2013

You seem to know a lot about the XML specification. More than your parent and certainly more than me. That's great, and following specifications is good and all, but citing the spec as requiring that "a namespace-aware XML parser must fault on undeclared namespaces" does not give me any sense for why I would want it to. Put another way - what does the namespace declaration and halting error due to its omission accomplish for me?

Failing to define a content type is obviously dumb, but I can't seem to get riled up about leaving off namespace declarations.

masklinn · on March 4, 2013

> That's great, and following specifications is good and all, but citing the spec as requiring that "a namespace-aware XML parser must fault on undeclared namespaces" does not give me any sense for why I would want it to.

That's not really relevant though. XML is bondage and discipline.

> Put another way - what does the namespace declaration and halting error due to its omission accomplish for me?

Depends. You could think "nothing" which is basically the same mindset as using a tagsoup-parser for "xml" documents: as long as you can get stuff out of it do you really care?

The other way to think about it, and the way espoused by most XML specs (and really most serialization specs at all) is that if something goes wrong, what guarantee is it anything is right? If a namespace prefix is undefined, is it because nobody cares, because there's a typo or because the unsafe transport flipped some bytes? The parser can't know, and as is generally done in the XML world the spec says to just stop and not risk fucking things up (as it does when nesting is invalid, attributes are unquoted or decoding blows up).

What that accomplishes is the assurance that the document was correct as far as an XML parser is concerned, I guess?

If you don't care, you're free to use a tagsoup parser in the first place, after all.

> I can't seem to get riled up about leaving off namespace declarations.

I see it more from a "canary in a coal mine" point of view: namespace declarations not being there hints that either they're using a non-namespace-aware serializer (unlikely) or they're not using an XML serializer at all, and here dragons lurk. In this case it's confirmed that they're pretty certainly using ERB text templates to generate their XML, and that means the document could be all kinds of fucked up with improper escaping, invalid characters and the like.

Meaning maybe the export can't be trusted to have exported my data without fucking it up.

btilly · on March 4, 2013

A central idea of XML was that - in reaction to the mess that was HTML - any tool that calls itself XML MUST barf loudly on anything that is not XML, so that you could never have a situation where one tool is happily calling something XML and another tool barfs on it. (Because, given the choice, humans will regularly mess it up but not notice unless their tool tells them so, and what works for one tool won't for another.)

The requirement of strictness everywhere is a precondition for interoperability between diverse toolsets.

_kst_ · on March 4, 2013

"... what does the namespace declaration and halting error due to its omission accomplish for me?"

It encourages generation of proper XML.

If consumers accept invalid XML, guessing at what it's supposed to mean, then producers will become sloppier over time, since there's no penalty for failing to follow the specification. Eventually producers will be so sloppy that producers will no longer be able to make meaningful guesses.

Sami_Lehtinen · on March 4, 2013

So it seems that we prefer XML which is easy to read. I have seen those files way often. Like: <xml><item><key>1</key><value>Something</value></item><item....></xml>

Then you have to combine what ever keys and values are in item tags. I found out these to be very annoying files to handle. Especially when key is X3 and value is 83d, you have to look for every combination from some kind of mapping, because non of those tells you absolutely nothing directly. At least its easy to create files that full fill the schema, because the complexity is pushed out of XML level. Often these files are created by "upgrading" CSV to XML. Let's just call column key # and then put what ever is in that column to the value tag. Yes attributes could be used, but often aren't.

Then you have to know that if key X contains value Y then you also need to look for key Z and hopefully it does contain value N or what ever.

TheAnimus · on March 4, 2013

I'd just like to take a moment to mention Nested Comments.

Oh if I had a £1 for every time I'd had to sift through lines and lines of code, because I can't just comment an element. I just can't comprehend why they'd need to reserve -- inside a comment.

masklinn · on March 4, 2013

> I just can't comprehend why they'd need to reserve -- inside a comment.

It's because the feature was inherited from SGML, first for commenting in element declarations (e.g. <!ELEMENT -- this is an element>) and then generalized to the whole document: in SGML, the grammar for a comment is

    comment declaration =
        MDO ("<!"), (comment, ( s | comment )* )?, MDC (">")
    comment =
        COM ("--"), SGML character*, COM ("--")

HTML — as an SGML application — theoretically inherited this feature (most UA don't really implement it correctly so it's not exactly safe to use sequences of dashes inside a comment, browsers may or may not toggle commenting). See http://www.howtocreate.co.uk/SGMLComments.html for a more extensive explanation especially in relation to browsers (SGML-compliant comments handling used to be part of early ACID2, before being removed because it was a stupid idea)

Meanwhile XML took half of it, threw the rest away, and called it a day.

niggler · on March 4, 2013

It's ironic how many problems (large and irritating enough to justify blog posts or public spates) could have been avoided if someone bothered to test beforehand.

If someone did a trial export he would immediately see the missing dates.

kaoD · on March 4, 2013

Who uses XML in 2013 anyways?

icebraining · on March 4, 2013

Anyone who wants to interoperate with software not written in 2013?

kaoD · on March 4, 2013

Shame on them.

duaneb · on March 4, 2013

What would you recommend to replace XML that handles arbitrary trees, namespaces, attributes, and tools that are built on this, e.g. XSLT?

I don't think XML is amazing, but it still has its place.

kaoD · on March 4, 2013

Put your torches out, it's just a joke :)

function_seven · on March 4, 2013

Sketchers (http://www.skechers.com/). Go View Source on that.

rrreese · on March 4, 2013

In 2013 XML is widely used. What alternatives would you suggest?

daGrevis · on March 4, 2013

Probably they used regexes to parse it. :)

_kst_ · on March 4, 2013

http://stackoverflow.com/a/1732454/827263 for those who haven't seen it.

icedchai · on March 4, 2013

Yes, it's crap, but it would take a few minutes to clean this up with a couple of sed scripts to turn ns:tag into ns_tag or something to make it parseable.

Or you could prepend some fake namespace declarations.

LoneWolf · on March 4, 2013

Am I the only one bothered by the extremely oversized xml snippets? Or is it just me?

Chrome 25.0.1364.97 m

adamtaro · on March 5, 2013

It's a pretty new redesign. I use Chrome myself, but shoot me a screenshot? hello at article_domain

tlarkworthy · on March 4, 2013

It would take 0.5 days work to get that into any format you desire so I don't think it fails its purpose.

jerf · on March 4, 2013

No. Once you screw up encoding, the information is generally gone. It's not just a matter of munging, it's often a matter of having to grovel over the entire file, by hand, correcting things.

Programmers seem to love to think that encoding errors are a joke, but they aren't. The data is gone. That's a big deal. Why are you even writing a program in the first place if it's just going to output unrecoverable gibberish? So you can throw the onus on the user to figure it out?

And that's to say nothing of trying to recover the date.

mikeash · on March 4, 2013

It drives me bonkers. Use UTF-8. Use other encodings only when talking to systems that require it, and use those other encodings only when actually reading or writing the data. Translate to UTF-8 at the earliest opportunity, and translate from UTF-8 at the last possible moment, and only if you must.

This isn't the 90s. This stuff is basically solved now, except people can't be bothered to use the solution.

ohwp · on March 4, 2013

Lets say you could have earned $100 per hour instead of writing your own "parser". Then suddenly 0.5 days is $400.