Small correction: do not use text template languages (Jinja, moustache, erb — which seems to be the one used here considering `%= display_date %>`, raw PHP, smarty, freemarker, what have you) to generate XML. There are templating languages whose primary use case is to generate markup (including XML) and (unless they're broken to uselessness) they should guarantee the output is valid XML.
> Schema-design-wise, the content:encoded and excerpt:encoded element names are deeply suspect, as if someone looked at RSS 2.0, squinted, shrugged, and invented their own ad hoc analogous namespace prefix, rather than understanding the role of elements in XML.
They seem to be using Wordpress's WXR import/export format, hence the wp-namespaced elements. The "content" and "excerpt" namespace garbage comes straight from there according to http://ipggi.wordpress.com/2011/03/16/the-wordpress-extended...
> <content:encoded> Is the replacement for the restrictive Rss <description> element. Enclosed within a character data enclosure is the complete WordPress formatted blog post, HTML tags and all.
> <excerpt:encoded> This is an unknown elementThis is a summary or description of the post often used by RSS/Atom feeds..
Considering the cottage industry of wordpress interaction, it was probably a good move to shoot for interop (should allow posterous exports to be directly imported into wordpress?). Not sure they succeeded though.
 genshi for instance http://genshi.edgewall.org/
There are templating languages whose primary use case is to generate markup
(including XML) and (unless they're broken to uselessness) they should
guarantee the output is valid XML.
Indeed. It's really odd that they munged together an XML export in ERB when builder exist. Does it have some sort of breaking issue with namespaces or something which could explain the choice?
Error on line 2: Closing tag for non-existent opening tag "biased"
Error on line 2: Closing tags cannot have attributes
True and true, but it does not guarantee the output XML will be valid: as far as Jinja's concerned it's all just text is it not? Genshi also supports streaming (using `serialize`), will also properly escape everything and — using the default xml serializer — ensures the output is valid XML.
(edit: I want to note that I wasn't trying to put down jinja, it's just the first text-based template I thought of when trying to write down a list, it's a fine templating language) (just not to generate XML)
And thanks for the further reverse engineering of the likely intent of the export. I wouldn't disagree with most of the WP-centric design choices. But attempting to run through a real XML parser might've been a good choice as well. (And I note there's a fair bit of complaint on the WP forums about the difficulty of using the data for import.)
For example, the author thinks that the content inside the CDATA is escaped, but in fact, it isn't necessarily - eg in this case they're including chunks of html which may contain more CDATA sections, and of course they don't nest (you need to terminate and restart the CDATA section). I've also seen examples where the enclosing encoding and the encoding of the CDATA section were incompatible.
The worst thing is specs with CDATA sections in examples. Junior devs bend over backwards to use things like xsl's disable-output-escaping to get a character-for-character match in test results, and then wonder why their code breaks in production.
There's always the issue that one day ']]>' will somehow sneak in and everything will break.
The key is using a tool to generate the XML that will transparently handle things like escaping correctly instead of using templating tools designed for text or HTML output.
Hopefully they will get all of them fixed before the final close down.
If you want an easy way to get your Posterous Export file cleaned up and into a more Valid XML file then feel free to use the Import from Posterous option over at WordPress.com - http://en.support.wordpress.com/import/import-from-posterous...
We've spent some time on writing code which cleans up the XML file so that it can be imported into WordPress successfully.
You can then export a clean WXR file and import elsewhere much easier - http://en.support.wordpress.com/export/
The comment about GMT-offsetting the date is particularly pithy, Assuming the blog in question isn't about ephemerides. By and large, blog posts have dates. If you desperately need an hour-offset from GMT, one might suggest this is your edge-case because, by and large, it doesn't matter.
Count me among those who would argue that the omission of a schema is a blessing.
I've wasted whole f*cking days of my life wrangling with so-called "non-amateur" XML. Invariably this was over-bloated XML with schemas that did nothing to help the discoverability and the processing of the data. Plain and simple, XML is over-spec'd and many data publishers, aided by their inflexible toolsets, pushed their XML beyond reason.
Be careful what you wish for.
I would take this XML, map-it, iterate it, done! End of story. I don't think there's much to complain about here.
TFA didn't ask for a schema, TFA asked for namespace declarations. Because they're kind-of necessary to parse namespaces with a namespace-aware XML parser. That's got 0 relation with a Schema. He only mentioned in passing because `content:encoded` and `excerpt:encoded` make very little sense... schema-wise (not "in an XML-Schema document").
> I would take this XML
You can't "take this XML" because it's not XML. Once you know it's not XML you can "take this tag soup", shove it into a tagsoup library (maybe with some encoding-guessing beforehand) and hope things come out about right at the other end — with no insurance that this is the case, you're deep in GIGO land at this point — but you can't "take it and map it"
[edit: A quick search, which I should have conducted instead of posting this, shows this has at least been used before, and enough not to be deleted by Wikipedia editors for lack of notability. That's pretty funny.]
Masklinn is right: I didn't ask for a schema. I didn't ask for anything. I somehow expected well-formed XML in a directory full of .xml files. That's 90% of what the rant is about.
I wasted a handful of fscking years of my life editing a significant international standard that used a peculiar dialect of W3C XML Schema. I know from schema over-design. I'm just talking about understanding the bare basics of xml and seeing that 'excerpt:encoded' might not convey what you think it means when set next to 'content:encoded'.
And it indeed took less time to hack together a solution to extract the information I needed (yay sed!) than it did to write this quick rant. That's not the point. The point is that the hacks and work-arounds should have been unnecessary. It's passing the savings of having one careless dev on as a cost to countless others having to deal with the data downstream.
"There are no namespace declarations. No self-respecting XML parser will have anything to do with this XML data."
I don't get this comment. I have never seen an XML parser that would refuse to parse XML without a namespace..
Am i missing something? Or is that just mindless hyperbole?
Note that the document uses namespaces but does not declare them. In Python, both ElementTree and LXML will blow up parsing when they encounter the first undeclared prefix (dc, from dc:creator)
Still nothing to be "shocked" about though ..
Then again, you just have to put the output through any XML parser (it's not hard to find) to realize the document is completely broken, but...
prefix:name-of-element doesn't mean anything by itself, you need to know what 'prefix' stands for.
As it is, this XML is not parsable; it's not well-formed and therefore it shouldn't even be called XML; it's just text with random tags thrown in.
It is, indeed, quite shocking.
That's one option. But a Schema (or DTD) is not mandatory, and not all schemas can easily be linked (few tools handle relaxng or schematron for namespace spec).
The core purpose of XML namespaces, and that of any namespace really, is better modularity and compositionability by preventing name collisions. This is useful when manipulating XML via XML (e.g. XSL, Genshi, ...) or when using multiple XML dialects in the same file (either because they're orthogonal or because they complement one another) for instance. You could do it by explicit prefixing à la C or Objective-C, but it tends to get dreary, requires everybody's cooperation and generally looks bad (not that xml namespaces look overly sexy).
> As it is, this XML is not parsable
It's parsable with a non-namespace-aware XML parser (ignoring tagsoup parsers as we're pretenting this is supposed to be an XML document)
Apparently, the JDK's javax.xml.parsers.*Factory can return namespace-unaware parsers and — even stranger — do so by default:
> Specifies that the parser produced by this code will provide support for XML namespaces. By default the value of this is set to false
> Specifies that the parser produced by this code will provide support for XML namespaces. By default the value of this is set to false.
Whether they qualify as "modern parsers" can be debated and I didn't test their behavior, but there you are.
Yes, shit happens, and it's never going to stop happening. Not in the face of all the misguided idealism in the world. But is being punched in the face OK because the puncher didn't use brass knuckles? If he did, is it still OK, because people have been shot in the face, and that's a lot worse than being punched?
Anyway, I thought the same about namespacing until that was addressed in a more constructive reply. So thanks for asking that question. :)
Isn't that what they were doing?
While it is possible to implement "well-formed" XML easily enough, validating against schema is another matter. In this particular article's case, the "well-formedness" isn't even there.
I would argue that any self respecting xml parser should parser it just find and shouldn't demand the namespaces to be defined at all.
"...invented their own ad hoc analogous namespace prefix, rather than understanding the role of elements in XML"
I don't think you understand the base concept of XML much. It is meant to be a generic container to hold whatever you want. XML in and of itself doesn't enforce node naming. Sure if you are talking about the official spec it does, but people pretty much globally use whatever node names they want. Don't have a cow.
"I haven’t been able to determine the intended encoding of the files"
Well maybe you should look into a parser that just parses as is without attempting to use some specific encoding.
Check out XML::Bare on cpan for perl. It will parse pretty much anything you throw at it, in any encoding. It leaves it up to you, the user, to decide what to do with the data after parsing.
The XML Namespaces specification unambiguously requires that a namespace be declared:
> The namespace prefix, unless it is xml or xmlns, MUST have been declared in a namespace declaration attribute in either the start-tag of the element where the prefix is used or in an ancestor element (i.e., an element in whose content the prefixed markup occurs).
A self-respecting XML parser would follow the spec. A namespace-aware XML parser must fault on undeclared namespaces.
Most XML parsers are namespace-aware.
> I don't think you understand the base concept of XML much.
Pot, meet kettle.
> XML in and of itself doesn't enforce node naming. Sure if you are talking about the official spec it does
Don't you feel like you're contradicting yourself a bit there?
> Well maybe you should look into a parser that just parses as is without attempting to use some specific encoding.
So he should look into parsers which do not parse XML and have no issue mangling the content? What are they going to do, assume the encoding is ascii-compatible anyway and go to town? How wonderfully anglo-centric.
> Check out XML::Bare on cpan for perl.
XML::Bare is an XML parser in the same sense that xhtml interpreted as text/html is an XML document: not in any way, shape or form. And if that's what you're shooting for, don't pretend to suggest an XML parser and suggest a recovering "soup" parser instead, something like html5lib or BeautifulSoup.
But herein remains the issue: I expect Posterous advertised their export as XML files, not as "encoding-deficient tag soup" (which it apparently is). I'm sure TFA would have had no expectations if he'd been told he got garbage in, and would have relied on tagsoup-parsing and encoding-guessing (using whatever libraries for doing so are available in his language of choice).
As it stands, he did have the pretty basic and undemanding expectation that he could shove supposedly-XML files into an XML parser and get data.
Failing to define a content type is obviously dumb, but I can't seem to get riled up about leaving off namespace declarations.
That's not really relevant though. XML is bondage and discipline.
> Put another way - what does the namespace declaration and halting error due to its omission accomplish for me?
Depends. You could think "nothing" which is basically the same mindset as using a tagsoup-parser for "xml" documents: as long as you can get stuff out of it do you really care?
The other way to think about it, and the way espoused by most XML specs (and really most serialization specs at all) is that if something goes wrong, what guarantee is it anything is right? If a namespace prefix is undefined, is it because nobody cares, because there's a typo or because the unsafe transport flipped some bytes? The parser can't know, and as is generally done in the XML world the spec says to just stop and not risk fucking things up (as it does when nesting is invalid, attributes are unquoted or decoding blows up).
What that accomplishes is the assurance that the document was correct as far as an XML parser is concerned, I guess?
If you don't care, you're free to use a tagsoup parser in the first place, after all.
> I can't seem to get riled up about leaving off namespace declarations.
I see it more from a "canary in a coal mine" point of view: namespace declarations not being there hints that either they're using a non-namespace-aware serializer (unlikely) or they're not using an XML serializer at all, and here dragons lurk. In this case it's confirmed that they're pretty certainly using ERB text templates to generate their XML, and that means the document could be all kinds of fucked up with improper escaping, invalid characters and the like.
Meaning maybe the export can't be trusted to have exported my data without fucking it up.
The requirement of strictness everywhere is a precondition for interoperability between diverse toolsets.
It encourages generation of proper XML.
If consumers accept invalid XML, guessing at what it's supposed to mean, then producers will become sloppier over time, since there's no penalty for failing to follow the specification. Eventually producers will be so sloppy that producers will no longer be able to make meaningful guesses.
Then you have to combine what ever keys and values are in item tags. I found out these to be very annoying files to handle. Especially when key is X3 and value is 83d, you have to look for every combination from some kind of mapping, because non of those tells you absolutely nothing directly. At least its easy to create files that full fill the schema, because the complexity is pushed out of XML level. Often these files are created by "upgrading" CSV to XML. Let's just call column key # and then put what ever is in that column to the value tag. Yes attributes could be used, but often aren't.
Then you have to know that if key X contains value Y then you also need to look for key Z and hopefully it does contain value N or what ever.
Oh if I had a £1 for every time I'd had to sift through lines and lines of code, because I can't just comment an element. I just can't comprehend why they'd need to reserve -- inside a comment.
It's because the feature was inherited from SGML, first for commenting in element declarations (e.g. <!ELEMENT -- this is an element>) and then generalized to the whole document: in SGML, the grammar for a comment is
comment declaration =
MDO ("<!"), (comment, ( s | comment )* )?, MDC (">")
COM ("--"), SGML character*, COM ("--")
Meanwhile XML took half of it, threw the rest away, and called it a day.
If someone did a trial export he would immediately see the missing dates.
I don't think XML is amazing, but it still has its place.
Or you could prepend some fake namespace declarations.
Chrome 25.0.1364.97 m
Programmers seem to love to think that encoding errors are a joke, but they aren't. The data is gone. That's a big deal. Why are you even writing a program in the first place if it's just going to output unrecoverable gibberish? So you can throw the onus on the user to figure it out?
And that's to say nothing of trying to recover the date.
This isn't the 90s. This stuff is basically solved now, except people can't be bothered to use the solution.