

A brief XML rant - alsothings
http://atl.me/2013/posterous-xml

======
masklinn
> do not use template languages to generate XML.

Small correction: do not use _text_ template languages (Jinja, moustache, erb
— which seems to be the one used here considering `%= display_date %>`, raw
PHP, smarty, freemarker, what have you) to generate XML. There are templating
languages whose _primary_ use case is to generate markup (including XML)[0]
and (unless they're broken to uselessness) they should guarantee the output is
valid XML.

> Schema-design-wise, the content:encoded and excerpt:encoded element names
> are deeply suspect, as if someone looked at RSS 2.0, squinted, shrugged, and
> invented their own ad hoc analogous namespace prefix, rather than
> understanding the role of elements in XML.

They seem to be using Wordpress's WXR import/export format, hence the wp-
namespaced elements. The "content" and "excerpt" namespace garbage comes
straight from there according to [http://ipggi.wordpress.com/2011/03/16/the-
wordpress-extended...](http://ipggi.wordpress.com/2011/03/16/the-wordpress-
extended-rss-wxr-exportimport-xml-document-format-decoded-and-explained/)

> <content:encoded> Is the replacement for the restrictive Rss <description>
> element. Enclosed within a character data enclosure is the complete
> WordPress formatted blog post, HTML tags and all.

> <excerpt:encoded> This is an unknown elementThis is a summary or description
> of the post often used by RSS/Atom feeds..

Considering the cottage industry of wordpress interaction, it was probably a
good move to shoot for interop (should allow posterous exports to be directly
imported into wordpress?). Not sure they succeeded though.

[0] genshi for instance <http://genshi.edgewall.org/>

~~~
timdorr

       There are templating languages whose primary use case is to generate markup 
       (including XML)[0] and (unless they're broken to uselessness) they should 
       guarantee the output is valid XML.
    

Since they are using Rails, they should be using Builder for this:
[http://api.rubyonrails.org/classes/ActionView/Base.html#labe...](http://api.rubyonrails.org/classes/ActionView/Base.html#label-
Builder) <https://github.com/jimweirich/builder>

~~~
masklinn
> Since they are using Rails, they should be using Builder for this

Indeed. It's really odd that they munged together an XML export in ERB when
builder exist. Does it have some sort of breaking issue with namespaces or
something which could explain the choice?

~~~
mnarayan01
Builder can be a bit of a pain if you want to do things that are...let us say
"questionable" (e.g. output a tag with inner content which is _not_ XML
escaped).

~~~
masklinn
There's no such thing in this case though, there's a single layer of tags with
escaped content inside (the example document uses CDATA, but as others have
noted automated generation is _not_ a good use case for CDATA)

~~~
mnarayan01
Agreed; just saying in general. Also, I would guess the reason they used ERB
here is simply familiarity, not any type of reasoned decision.

------
bazzargh
One thing that bugs me about this is the use of CDATA. CDATA sections are
just-about ok in hand-crafted xml, but in machine generated xml, they are
absolutely pointless, and usually hint that the coder doesn't know what
they're doing.

For example, the author thinks that the content inside the CDATA is escaped,
but in fact, it isn't necessarily - eg in this case they're including chunks
of html which may contain _more_ CDATA sections, and of course they don't nest
(you need to terminate and restart the CDATA section). I've also seen examples
where the enclosing encoding and the encoding of the CDATA section were
incompatible.

The worst thing is specs with CDATA sections in examples. Junior devs bend
over backwards to use things like xsl's disable-output-escaping to get a
character-for-character match in test results, and then wonder why their code
breaks in production.

~~~
gav
Outside of a few special cases (such as wanting to make embedded content in
XML human editable) CDATA should be treated as a big warning flag that the
author of the code that generated the XML doesn't really understand what they
are doing.

There's always the issue that one day ']]>' will somehow sneak in and
everything will break.

The key is using a tool to generate the XML that will transparently handle
things like escaping correctly instead of using templating tools designed for
text or HTML output.

~~~
mnarayan01
I'm not sure making "XML human editable" should really be considered a special
case.

------
westi
To be fair to the Posterous Team they are doing a good job of fixing the bugs
in the export as they are reported to them.

Hopefully they will get all of them fixed before the final close down.

If you want an easy way to get your Posterous Export file cleaned up and into
a more Valid XML file then feel free to use the Import from Posterous option
over at WordPress.com - [http://en.support.wordpress.com/import/import-from-
posterous...](http://en.support.wordpress.com/import/import-from-posterous/)

We've spent some time on writing code which cleans up the XML file so that it
can be imported into WordPress successfully.

You can then export a clean WXR file and import elsewhere much easier -
<http://en.support.wordpress.com/export/>

------
stblack
I don't see any problem with this XML that can't be easily overcome.

The comment about GMT-offsetting the date is particularly pithy, Assuming the
blog in question isn't about ephemerides. By and large, blog posts have dates.
If you desperately need an hour-offset from GMT, one might suggest this is
your edge-case because, by and large, it doesn't matter.

Count me among those who would argue that the omission of a schema is a
blessing.

I've wasted whole f*cking days of my life wrangling with so-called "non-
amateur" XML. Invariably this was over-bloated XML with schemas that did
nothing to help the discoverability and the processing of the data. Plain and
simple, XML is over-spec'd and many data publishers, aided by their inflexible
toolsets, pushed their XML beyond reason.

Be careful what you wish for.

I would take this XML, map-it, iterate it, done! End of story. I don't think
there's much to complain about here.

~~~
masklinn
> Count me among those who would argue that the omission of a schema is a
> blessing.

TFA didn't ask for a schema, TFA asked for namespace declarations. Because
they're kind-of necessary to parse namespaces with a namespace-aware XML
parser. That's got 0 relation with a Schema. He only mentioned in passing
because `content:encoded` and `excerpt:encoded` make very little sense...
schema-wise (not "in an XML-Schema document").

> I would take this XML

You can't "take this XML" because it's not XML. Once you know it's not XML you
can "take this tag soup", shove it into a tagsoup library (maybe with some
encoding-guessing beforehand) and hope things come out about right at the
other end — with no insurance that this is the case, you're deep in GIGO land
at this point — but you can't "take it and map it"

~~~
obviouslygreen
As someone who has used BeautifulSoup very happily without considering its
etymology... is "tag soup" an actual term or just a very apt description
you're using?

[edit: A quick search, which I should have conducted instead of posting this,
shows this has at least been used before, and enough not to be deleted by
Wikipedia editors for lack of notability. That's pretty funny.]

------
gizzlon
He has a few valid complaints (by a few I mean one), but this is really not
that bad compared to a lot of the XML floating around. No reason to be
_shocked_

 _"There are no namespace declarations. No self-respecting XML parser will
have anything to do with this XML data."_

I don't get this comment. I have never seen an XML parser that would refuse to
parse XML without a namespace..

Am i missing something? Or is that just mindless hyperbole?

~~~
bambax
You can't process XML that uses namespaces without a namespace declaration. A
namespace prefix is just a shorthand for the namespace itself.

prefix:name-of-element doesn't mean anything by itself, you need to know what
'prefix' stands for.

As it is, this XML is not parsable; it's not well-formed and therefore it
shouldn't even be called XML; it's just text with random tags thrown in.

It is, indeed, quite shocking.

~~~
laurent123456
Maybe it's just me, and it's probably wrong, but more than once I pre-
processed XML data by replacing all the "namespace:tag" by "namespace-tag" so
that I can easily parse the XML without having to care about namespaces. I've
never been convinced that this feature has much use anyway.

~~~
nullymcnull
You never understood what they were actually for, then. The namespace will
have a schema, and the schema can be used to validate the elements of that
namespace. Not used in a "just import the data!" scenario, sure, but a lot of
people who use XML do care about that kind of validation.

~~~
masklinn
> The namespace will have a schema, and the schema can be used to validate the
> elements of that namespace.

That's one option. But a Schema (or DTD) is not mandatory, and not all schemas
can easily be linked (few tools handle relaxng or schematron for namespace
spec).

The core purpose of XML namespaces, and that of any namespace really, is
better modularity and compositionability by preventing name collisions. This
is useful when manipulating XML via XML (e.g. XSL, Genshi, ...) or when using
multiple XML dialects in the same file (either because they're orthogonal or
because they complement one another) for instance. You could do it by explicit
prefixing à la C or Objective-C, but it tends to get dreary, requires
everybody's cooperation and generally looks bad (not that xml namespaces look
overly sexy).

------
fpgeek
> Get off my lawn, you kids.

Isn't that what they were doing?

~~~
dylangs1030
Upped for giving me a chuckle in the midst of some _very_ heated XML
discussion :)

------
peterkelly
There's nothing wrong with invalid XML - why is everyone complaining? C
compilers should similarly take a stab in the dark about what the programmer
meant if they encounter invalid syntax as well. And those linking errors
always annoy me - it should just pick the closest matching symbol if the
specified one can't be found.

------
paulnechifor
This isn't really criticism of XML, though. You can do a good job of screwing
up in any language or format.

~~~
adamtaro
It was not intended as a criticism of XML at all. XML is a perfectly cromulent
standard. It is a criticism of amateurish use of XML.

~~~
egeozcan
Which is everywhere (xhtml, anyone?)

~~~
sageikosa
The "X" is for "extreme", right?

------
nanoscopic
"There are no namespace declarations. No self-respecting XML parser will have
anything to do with this XML data."

I would argue that any self respecting xml parser should parser it just find
and shouldn't demand the namespaces to be defined at all.

"...invented their own ad hoc analogous namespace prefix, rather than
understanding the role of elements in XML"

I don't think you understand the base concept of XML much. It is meant to be a
generic container to hold whatever you want. XML in and of itself doesn't
enforce node naming. Sure if you are talking about the official spec it does,
but people pretty much globally use whatever node names they want. Don't have
a cow.

"I haven’t been able to determine the intended encoding of the files"

Well maybe you should look into a parser that just parses as is without
attempting to use some specific encoding.

Check out XML::Bare on cpan for perl. It will parse pretty much anything you
throw at it, in any encoding. It leaves it up to you, the user, to decide what
to do with the data after parsing.

~~~
masklinn
> I would argue that any self respecting xml parser should parser it just find
> and shouldn't demand the namespaces to be defined at all.

The XML Namespaces specification unambiguously _requires_ that a namespace be
declared:

> The namespace prefix, unless it is xml or xmlns, MUST have been declared in
> a namespace declaration attribute in either the start-tag of the element
> where the prefix is used or in an ancestor element (i.e., an element in
> whose content the prefixed markup occurs).

A self-respecting XML parser would follow the spec. A namespace-aware XML
parser _must_ fault on undeclared namespaces.

Most XML parsers are namespace-aware.

> I don't think you understand the base concept of XML much.

Pot, meet kettle.

> XML in and of itself doesn't enforce node naming. Sure if you are talking
> about the official spec it does

Don't you feel like you're contradicting yourself a bit there?

> Well maybe you should look into a parser that just parses as is without
> attempting to use some specific encoding.

So he should look into parsers which do not parse XML and have no issue
mangling the content? What are they going to do, assume the encoding is ascii-
compatible anyway and go to town? How wonderfully anglo-centric.

> Check out XML::Bare on cpan for perl.

XML::Bare is an XML parser in the same sense that xhtml interpreted as
text/html is an XML document: not in any way, shape or form. And if that's
what you're shooting for, don't pretend to suggest an XML parser and suggest a
recovering "soup" parser instead, something like html5lib or BeautifulSoup.

But herein remains the issue: I expect Posterous advertised their export as
XML files, not as "encoding-deficient tag soup" (which it apparently is). I'm
sure TFA would have had no expectations if he'd been told he got garbage in,
and would have relied on tagsoup-parsing and encoding-guessing (using whatever
libraries for doing so are available in his language of choice).

As it stands, he did have the pretty basic and undemanding expectation that he
could shove supposedly-XML files into an XML parser and get data.

~~~
sanderjd
You seem to know a lot about the XML specification. More than your parent and
certainly more than me. That's great, and following specifications is good and
all, but citing the spec as requiring that "a namespace-aware XML parser
_must_ fault on undeclared namespaces" does not give me any sense for why I
would _want_ it to. Put another way - what does the namespace declaration and
halting error due to its omission accomplish for me?

Failing to define a content type is obviously dumb, but I can't seem to get
riled up about leaving off namespace declarations.

~~~
masklinn
> That's great, and following specifications is good and all, but citing the
> spec as requiring that "a namespace-aware XML parser must fault on
> undeclared namespaces" does not give me any sense for why I would want it
> to.

That's not really relevant though. XML is bondage and discipline.

> Put another way - what does the namespace declaration and halting error due
> to its omission accomplish for me?

Depends. You could think "nothing" which is basically the same mindset as
using a tagsoup-parser for "xml" documents: as long as you can get stuff out
of it do you really care?

The other way to think about it, and the way espoused by most XML specs (and
really most serialization specs at all) is that if something goes wrong, what
guarantee is it _anything_ is right? If a namespace prefix is undefined, is it
because nobody cares, because there's a typo or because the unsafe transport
flipped some bytes? The parser can't know, and as is generally done in the XML
world the spec says to just stop and not risk fucking things up (as it does
when nesting is invalid, attributes are unquoted or decoding blows up).

What that accomplishes is the assurance that the document was correct as far
as an XML parser is concerned, I guess?

If you don't care, you're free to use a tagsoup parser in the first place,
after all.

> I can't seem to get riled up about leaving off namespace declarations.

I see it more from a "canary in a coal mine" point of view: namespace
declarations not being there hints that either they're using a non-namespace-
aware serializer (unlikely) or they're not using an XML serializer at all, and
here dragons lurk. In this case it's confirmed that they're pretty certainly
using ERB text templates to generate their XML, and that means the document
could be all kinds of fucked up with improper escaping, invalid characters and
the like.

Meaning maybe the export can't be trusted to have exported my data without
fucking it up.

------
Sami_Lehtinen
So it seems that we prefer XML which is easy to read. I have seen those files
way often. Like:
<xml><item><key>1</key><value>Something</value></item><item....></xml>

Then you have to combine what ever keys and values are in item tags. I found
out these to be very annoying files to handle. Especially when key is X3 and
value is 83d, you have to look for every combination from some kind of
mapping, because non of those tells you absolutely nothing directly. At least
its easy to create files that full fill the schema, because the complexity is
pushed out of XML level. Often these files are created by "upgrading" CSV to
XML. Let's just call column key # and then put what ever is in that column to
the value tag. Yes attributes could be used, but often aren't.

Then you have to know that if key X contains value Y then you also need to
look for key Z and hopefully it does contain value N or what ever.

------
TheAnimus
I'd just like to take a moment to mention Nested Comments.

Oh if I had a £1 for every time I'd had to sift through lines and lines of
code, because I can't just comment an element. I just can't comprehend why
they'd need to reserve -- inside a comment.

~~~
masklinn
> I just can't comprehend why they'd need to reserve -- inside a comment.

It's because the feature was inherited from SGML, first for commenting in
element declarations (e.g. <!ELEMENT -- this is an element>) and then
generalized to the whole document: in SGML, the grammar for a comment is

    
    
        comment declaration =
            MDO ("<!"), (comment, ( s | comment )* )?, MDC (">")
        comment =
            COM ("--"), SGML character*, COM ("--")
    

HTML — as an SGML application — theoretically inherited this feature (most UA
don't really implement it correctly so it's not exactly safe to use sequences
of dashes inside a comment, browsers may or may not toggle commenting). See
<http://www.howtocreate.co.uk/SGMLComments.html> for a more extensive
explanation especially in relation to browsers (SGML-compliant comments
handling used to be part of early ACID2, before being removed because it was a
stupid idea)

Meanwhile XML took half of it, threw the rest away, and called it a day.

------
niggler
It's ironic how many problems (large and irritating enough to justify blog
posts or public spates) could have been avoided if someone bothered to test
beforehand.

If someone did a trial export he would immediately see the missing dates.

------
kaoD
Who uses XML in 2013 anyways?

~~~
duaneb
What would you recommend to replace XML that handles arbitrary trees,
namespaces, attributes, and tools that are built on this, e.g. XSLT?

I don't think XML is amazing, but it still has its place.

~~~
kaoD
Put your torches out, it's just a joke :)

------
daGrevis
Probably they used regexes to parse it. :)

~~~
_kst_
<http://stackoverflow.com/a/1732454/827263> for those who haven't seen it.

------
icedchai
Yes, it's crap, but it would take a few minutes to clean this up with a couple
of sed scripts to turn ns:tag into ns_tag or something to make it parseable.

Or you could prepend some fake namespace declarations.

------
LoneWolf
Am I the only one bothered by the extremely oversized xml snippets? Or is it
just me?

Chrome 25.0.1364.97 m

~~~
adamtaro
It's a pretty new redesign. I use Chrome myself, but shoot me a screenshot?
hello _at_ article_domain

------
tlarkworthy
It would take 0.5 days work to get that into any format you desire so I don't
think it fails its purpose.

~~~
jerf
No. Once you screw up encoding, the information is generally _gone_. It's not
just a matter of munging, it's often a matter of having to grovel over the
entire file, by hand, correcting things.

Programmers seem to love to think that encoding errors are a joke, but they
aren't. The data is _gone_. That's a big deal. Why are you even writing a
program in the first place if it's just going to output unrecoverable
gibberish? So you can throw the onus on the user to figure it out?

And that's to say nothing of trying to recover the date.

~~~
mikeash
It drives me bonkers. Use UTF-8. Use other encodings _only_ when talking to
systems that require it, and use those other encodings _only_ when actually
reading or writing the data. Translate to UTF-8 at the earliest opportunity,
and translate from UTF-8 at the last possible moment, and only if you must.

This isn't the 90s. This stuff is basically solved now, except people can't be
bothered to use the solution.

