Hacker News new | past | comments | ask | show | jobs | submit login

Can you detail this often non-well-formed XML? I've not seen any XML parsers that handle invalid XML. Except for people who wrote their own XML parser and think a simple regex is enough.

Validation is another issue, and I don't think you'll find anyone saying that the myriad XML addons are simple or easy :).

The mixing of HTTP and HTML also seems like a bit of strange hack to me. And let's not start talking about well-formed HTTP; I'd be surprised to find many real-world clients or servers actually following the inane HTTP spec. Just like mail clients don't always handle comments in email addresses.




Well, the classic example is XML + rules about character encoding. Suppose I send you an XHTML document, and I'm a good little XML citizen and in my XML prolog I mention that I've encoded the document UTF-8. And let's say I'm also taking advantage of this -- there are some characters in this document that aren't in ASCII.

So I send it to you over HTTP, and whatever you're using on the other end -- web browser, scraper, whatever -- parses my XML and is happy. Right?

Well, that depends:

* If I sent that document to you over HTTP, with a Content-Type header of "application/xhtml+xml; charset=utf-8", then it's well-formed.

* If I sent it as "text/html; charset=utf-8", then it's well-formed.

* If I sent it as "text/xml; charset=utf-8", then it's well-formed.

* If I sent it as "application/xhtml+xml", then it's well-formed.

* If I sent it as "text/xml", then FATAL ERROR: it's not well-formed.

* If I sent it as "text/html", then FATAL ERROR: it's not well-formed.

Or, at least, that's how it's supposed to work when you take into account the relevant RFCs. This is the example I mentioned in my original comment, and as far back as 2004 the tools weren't paying attention to this:

http://www.xml.com/pub/a/2004/07/21/dive.html

These are the kinds of scary corners you can get into with an "every error is a fatal error" model, where ignorance or apathy or a desire to make things work as expected ends up overriding the spec, and making you dependent on what are actually bugs in the system. Except if the bug ever gets fixed, instead of just having something not quite look right, suddenly everyone who's using your data is spewing fatal errors and wondering why.

Meanwhile, look at things like Evan Goer's "XHTML 100":

http://www.goer.org/Journal/2003/04/the_xhtml_100.html

Where he took a sample of 119 sites which claimed to be XHTML, and found that only one managed to pass even a small set of simple tests.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: