
How to Avoid Being Called a Bozo When Producing XML (2005) - stesch
https://hsivonen.fi/producing-xml/
======
bane
My "favorite" XML formats are the one that are just some kind of weird meta-
format and don't really use any of the XML features:

    
    
       <format>
          <record id="1">
             <field name="id" value="1"/>
             <field name="name" value="abc">blah blah</field>
             <field name="attribute">this is the attribute value</field>
             <field name="end_of_record" value="True"/>
          </record>
          <record id="2">
          ...
          </record>
       </format>
    

And yes, these types of abominations are everywhere.

The only way to avoid being called a Bozo when producing XML is to either

a) ensure that humans never had to see this craziness

b) don't use XML

XML as a config file format, in particular, is probably one of the worst ideas
in computing.

~~~
mwfunk
I don't necessarily disagree, except for the last point. I've rarely (never?)
encountered XML used as a config file format where users were expected or
encouraged to edit that config file directly vs. using other tools or APIs to
touch the file.

In those cases, I would rather have XML config files than undocumented binary
blobs as config files. When I see an XML config file, I feel a little relief
that it's not a binary blob rather than disappointment that it's not freeform
text, because I assume that freeform text must've been off the table for
whatever reason (which, depending on what the config file is for, can be a
totally rational and reasonable thing to do).

I don't work in specialties where XML has a ton of visibility though- maybe
there are lots of projects out there that I don't use in which people are
required to hand-edit XML config files, as opposed to "it's in XML, so you
_could_ edit it directly, but really no one should be modifying the file with
a text editor unless the preferred indirect mechanism isn't an option in some
specific case".

~~~
SomeCallMeTim
>In those cases, I would rather have XML config files than undocumented binary
blobs as config files.

False dichotomy.

Better than XML _and_ binary blobs:

* JSON (assuming everyone knows what this is)

* YAML [0]

* Lua tables (if you're already using Lua as a scripting language; Lua started out as a configuration language after all)

* INF format [1] (not my favorite, but pretty easy to parse and _much_ better for humans to read than XML)

* Any of the above compressed with a gzip compatible compression (if size matters, though it rarely does these days)

Even Protocol Buffers [2] are better than XML, though at that point it becomes
a "documented binary blob". But as long as the spec is shared, the format can
easily be read by just about any programming language.

[0] [https://en.wikipedia.org/wiki/YAML](https://en.wikipedia.org/wiki/YAML)

[1]
[https://en.wikipedia.org/wiki/INF_file](https://en.wikipedia.org/wiki/INF_file)

[2]
[https://en.wikipedia.org/wiki/Protocol_Buffers](https://en.wikipedia.org/wiki/Protocol_Buffers)

~~~
flukus
> JSON (assuming everyone knows what this is)

The new .net uses json, it's awful. No comments allowed and it get's pretty
unreadable when you have nested configuration elements.

~~~
ivl
I seriously think the lack of comments is a deal breaker for JSON config files
for me. At least with what I'm doing now. I find myself changing configs a
ton, and I love being able to simply change which blocks are commented to get
what I want, without having to dig anywhere.

~~~
SomeCallMeTim
I agree...and I found a Gulp plugin that lets me pre-strip comments from my
JSON files as part of the build process.

So I use JSON-with-comments, but the app only sees the stripped files.

------
jroseattle
> Don’t print

> Use an isolated serializer

Some old reference material (XML isn't as common as JSON anymore), but still
worthwhile learning: don't output data formats directly. Directly = echo,
print, printf,println...whatever your syntax suggests. I see this happen a lot
with my junior engineers, and I have this same conversation with them.

Prefer to use data serializers that encapsulate all the syntactical rules that
go along with XML, CSV, JSON, YAML, etc. Let the serializers do the grunt work
of writing output in correct format.

Some serializers aren't always ideal - correctness and speed can be an issue.
Nonetheless, prefer to use those mechanisms over writing your own output.

~~~
nrser
i think a major problem is that XML kinda looks and feels like HTML (and there
was the whole XHTML thing to further confuse), and outputting HTML
programmatically (vs string / print / template based) has most been frowned on
as overweight and cumbersome.

you come from web dev doing HTML like that and you see XML and think "hey,
that looks the same, i'll do it in the same way".

XML is a programmatic data exchange format like JSON or YAML, which most
people would never think of outputting as templates or printed text, but it
looks and feels like HTML, which most people deal with first and where that's
the standard approach.

~~~
ams6110
>YAML, which most people would never think of outputting as templates

Don't tell the Ansible folks!

~~~
geerlingguy
Ansible uses Jinja2 to output templates in whatever format is preferred by the
thing being configured. I haven't personally seen Ansible used to output
YAML... But people will do anything :-P

Ansible does use YAML as a configuration language though—something for which
it's perfectly suited.

~~~
spdionis
Well, some frameworks use yaml for config files and you might use ansible to
write those.

That said the templating is usually trivial, just maybe write some string
values.

~~~
cytzol
I’ve done it. It’s painful enough that it teaches you “don’t do this!”. For
example, you need to escape `{{ item }}` as `{{ "{{" }} item {{ "}}" }}`!

------
zubat
I use XML for a combination of features that I consider very important but are
also perceived as "overkill": A source syntax that has already handled text
escaping and encoding, lets me add some abstract structure, and lets me encode
the text in a way that lets me nest different parsing modes for various kinds
of structured data.

The first two are easy enough to get with your pick of JSON or S-Expressions.
For a lot of things even CSV is enough, although CSV has the downside of being
so simple that people opt to write an incorrect toolchain for it themselves
instead of adding a dependency.

But it's the last feature that really produces the complexity. Once you get
into "I want the inner structure to contain a different and unambiguous
semantic meaning from the outer structure" you have a pretty substantial
engineering problem. Less structured approaches like JSON or S-Expr's drop the
problem on the floor by declaring one universal semantic, making the
programmer deal with adding anything else on top. XML's compromises to achieve
a more detailed representation of data involve the angle bracket tax, schema
languages, etc.

If you want a guarantee that a rich data source can be processed correctly
through an n-tier architecture that emits various radically different outputs,
these compromises become compelling. I'm a big fan of DocBook, for example,
and its canonical toolchain is an XSLT style sheet: The workflow I end up with
is initial writing in a light syntax of choice, compile to DocBook XML, add
additional formatting and styling in the XML, and then emit the final document
in whatever forms needed - HTML, PDF, etc. It's extremely flexible, and you
wouldn't get the same quality of result with a less extensive treatment.

For ordinary data serialization problems and one-offs, it is considerably less
interesting.

------
cptskippy
XML is well regarded in the enterprise and languages like JAVA, C#, and VB.NET
handle is spectacularly as an exchange format.

I think it's bad reputation comes from anyone not using an enterprise language
because the support just isn't there.

I recall working with a partner who we were doing an identity federation with.
Our system was using WS-Trust which is a SOAP/XML protocol. It wasn't ideal
but everyone seemed to support it ok. These guys were cutting edge though and
used Ruby on Rails.

No support for the protocol wasn't a huge deal, just means you have to craft
your XML for your SOAP calls yourself. But at the time we were doing this, RoR
didn't have SOAP or XML libraries. They had to write everything from the
ground up. It sucked for me and I was just fielding rudimentary questions, I
can't imagine how painful it must have been for them.

~~~
wtbob
> I think it's bad reputation comes from anyone not using an enterprise
> language because the support just isn't there.

On the contrary, I think that XML's bad reputation comes from the fact that it
is <adverbial-particle modifies="#123">so</adverbial-particle> <adverb
id="123">incredibly</adverb> <adjective>verbose</adjective>.

Also, the whole child/attribute dichotomy is a huge, _huge_ mistake. I've been
recently dealing with the XDG Menu Specification, and it contains a
child/attribute design failure, one which would have been far less likely in a
less-arcane format.

XML is not bad at making markup languages (and indeed, in those languages
attributes make sense); it is poor at making data-transfer languages.

JSON has become popular because a lot of bad programmers saw nothing wrong
with calling eval on untrusted input (before JSON.parse was available). It's
still more verbose than a data transfer format should be, and people default
to using unordered hashes instead of ordered key-value pairs, so it's not
ideal.

The best human-readable data transfer format is probably canonical
S-expressions; the best binary format would probably be ASN.1, were it not so
incredibly arcane. As it is, maybe protobufs are a good binary compromise?

~~~
FunnyLookinHat
> JSON has become popular because a lot of bad programmers saw nothing wrong
> with calling eval on untrusted input (before JSON.parse was available).

Disagree. JSON became popular because it was extremely easy to implement (both
for marshaling and consuming), and because it was extremely lightweight.

I think you could also make the argument that JSON was conceptually easier for
programmers to wrap their minds around. You could just pretty-print it and
quickly get an idea for the object's format, attributes, etc.

~~~
ams6110
XML _could_ be fairly lightweight also. It was all the enterprisey-standard
formats that were hideous.

E.g.

    
    
        {"name":"John","age":42}
    

vs.

    
    
        <person name="John" age="42" />

~~~
rimantas
Now do the nested objects in both. One line does not show much.

~~~
Mikhail_Edoshin

        <person id="123" name="John" age="42" sec:checksum="...">
          <family-member type="spouse" ref="456""/>
          <family-member type="child" ref="789" />
          <fin:credit-rating score="A"
              last-change="2016-02-04T12:34:56Z" />
          <уфмс:статус значение="42" />
        </person>
    

Here we can describe `person/@id` as element ID and `family-member/@ref` as a
reference to an ID so our XML tools can link these together.

Also note three more elements from different namespaces: `@sec:checksum` could
be some kind of technical information about the record, `fin:credit-rating` is
added by the finanical module. The `@last-change` is defined as datetime so as
we read it with other XML tools we'll get it as datetime type.

The next one is a tag in Russian language that describes something related to
Russia; XML can use the whole Unicode in tag and attribute names.

Also, XML names are globally unique by design so there's no clash between all
the different pieces and the tools can easily be configured to ignore parts
they don't understand or work as a glue between different areas.

We can still efficiently validate the syntax the whole piece or parts of it as
we see fit.

------
chiph
Back in the early days of XML, Internet Explorer would insert "+" characters
to fold nested sections of XML. And was the default program to open .xml
files. Guess what showed up in the documents I got from an integration
partner?

~~~
kyllo
I once got an XML file from an integration partner where the whole thing was
XML escaped (all the tags looked like &lt;node&gt;value&lt;/node&gt;) because
they had embedded it within an outer "envelope" XML file. They saw nothing
wrong with this and argued when I questioned it. I wonder how they were
planning to express escape sequences within the inner XML document that was
already escaped...

~~~
nitrogen
It's ugly of course, but a parser should have no problem with &amp;amp; or
&amp;lt;. It can go arbitrarily deep.

------
Nuffinatall
Compared to the problems when dealing with 'delimited text', XML is great.

Also it's flexible where you can specify properties as attributes or child
nodes, depending on wildcard specifications.

So I have dealt with lots of edge-case XML situations, but the solutions are
always straight forward. Also it helps to have a client vs. trying to parse
out raw XML, which means programming and scripting sometimes relies on
personal tool development. XML handles scope creep well.

~~~
DougN7
Handling scope creep is my favorite feature. With XML, it's easy to
deserialize even if an expected element is not there, or if there is an extra
one you're not expecting, at least that's been my experience. I haven't done
much JSON but I'm not sure how that would work with it.

~~~
QuercusMax
Pretty much any "real" serialization format should handle that situation fine.
Protobuf, JSON, YAML, Thrift, heck, even Java serialization can handle that,
provided you set a serialVersionUID.

------
616c
On the Cognicast there was an excellent tangent (all of them were good) in
episode 106, where Michael Nygard bemoans with the fellow Cognitect Craig
that, despite all the hate from the JSON generation, the failed promise of XML
was the ability (again, that is part, not the whole) to have separated data
and presentation with schemas, so you would not have to redesign endpoints all
the damn time.

[http://blog.cognitect.com/cognicast/106](http://blog.cognitect.com/cognicast/106)

This is just one view, and I am sure I will be mercilessly downvoted, as this
is a gross simplification of that point, but it was one of many gems in that
episode. I might finally review XSLT as this once again affirms things other
devs told me when they said do not write off XML, in the complexity of it is
something interesting.

~~~
ams6110
I loved XML and XSLT. And Internet Exporer, for all its faults, had great
support for XSLT in the browser from version 5. It was quite easy to build
"rich" single-page apps that get XML data from the server and build various
user presentations by updating DOM with XSLT.

------
_greim_
This article in some ways describes the delta from HTML development to XML
development. In the early/mid 2000s, XML was cargo-culted through the tech
world on a massive scale; typically being adopted by web developers who
proceeded to apply the same habits and tools for XML as they'd been using for
HTML. Which of course resulted in many of the issues mentioned.

~~~
MichaelGG
There's a popular piece of "newer" software that decided that XML rules were
too difficult. So they URL encode all values. It also uses print style
formatting for XML tag names, so if you manage to get a name value that has,
say, a : in it, you'll get invalid tags. This is the default setup, in 2016,
for a system that handles a lot of real-world telephone calls.

Even just a few years ago I've worked with companies that wrote their own "XML
parser". They explained it was pretty easy but they had to "special case" for
broken output in the real world. An example of this output? "<tag />".

HTML would have been far better off if it had the strictness of XML. Remove
end tag names so you can't have invalid nesting. If browsers had refused to
parse invalid docs from the start, invalid docs would not have been produced.
(And like XML, they could provide decent error messages, so the difficulty
would not be significantly raised.)

------
beagle3
I used to hate doing XML in Python - ElementTree was the nicest of them 10
years ago, but it still hurt.

But last year, I discovered xmltodict[0] and since then, I don't really care -
it makes doing xml (both reading and writing) no more cumbersone than using
dicts, while still supporting stuff like namespaces, CDATA and friends.

I still think XML is a horrible, misguided idea - from inception, but even
more so in how it is used in practice - but I no longer feel any pain
interfacing with it.

[0]
[https://github.com/martinblech/xmltodict](https://github.com/martinblech/xmltodict)

~~~
Mikhail_Edoshin
Python has a very good lxml module for advanced XML processing. You can define
your own classes for XML elements, so you can read an XML file and get your
own classes for the underlying elements. They're somewhat limited, you can
easily define methods, but the data is locked to what's in XML. You can also
define your own XPath functions and XSLT extensions. Comes very handy
sometimes.

The API is still rather awkward though.

------
legulere
There's really no reason to use UTF-16 but compatibility with older software
(which is usually broken when handling surrogate pairs). It's an atavism from
times when all unicode codepoints fitted into 16 bit.

~~~
yuhong
I think that one boils to basically back in 1990, ISO 10646 wanted 32-bit
characters but had no software folks on that committee, while the Unicode
people was basically software folks but thought that 16-bit was enough (this
dates back to the original Unicode proposal from 1988). UTF-8 was only created
in 1992, after the software folks rejected the original DIS 10646 in mid-1991.

------
Agentlien
This reminds me of an interesting experience I had with XML at a pervious job
a few years ago.

We had bought a product from another company which was to be integrated into
our own main product. Theirs was horribly ugly, looking like a cross between a
90's website and an infomercial, predominately in vivid shades of pink and
purple. And it was really buggy. I soon noticed that all the content (many
hundred pages with text, video and interactive content) was specified in a
giant XML file and that the application itself simply interpreted this file
and presented it to the user. We quickly decided that the best course of
action was for me to reverse-engineer this XML file and write our own code to
generate an integrated version of it, presented in a visual style more in line
with the rest of our own product. This meant we could also solve some of their
bugs on the way.

I still feel this was the only reasonable option and it did work out within
our given time frame. However, I will never forget the horrors I saw in that
one file. A few gems included:

\- The file was most certainly handwritten with lots of tag mismatches and
spell errors in tag names.

\- One of the main sections was missing in their own standalone version
because of a syntax error which caused their program to skip over the entire
main branch of the syntax tree in which it occurred.

\- Exercises where you had to order a list of items were defined as dragging
items into hit boxes on a static bitmap image of the numbers 1-10 on a purple
background. The same image was used regardless of how many items had to be
ordered. The hit boxes didn't align with those numbers at all and often
overlapped. In their implementation, Items were stuck right where you dropped
them, rather than snapping to a fixed position by the right number.

\- We wrote a few tools to identify images and videos which were either
present on disk but never referenced or vice versa. This was often a case of
spelling errors, slight variations in word connotation or files placed in the
wrong folder. In these cases, their original program would bail out and skip
that page.

\- Indices of chapters were written as plain text rather than inferred. They
did not match how things were laid out in the XML and where it happened to
align it was sooner or later broken by sections which were commented out or
failed to parse.

There were many more issues, but these give some insight into the exciting
challenge of getting their data to work in a consistent and logical manner.
After the XML file had been thoroughly massaged into submission and
uniformity, of course.

~~~
jessaustin
Please edit your post to eliminate the fixed-text:

\- It will be easier to read.

\- Reading won't require a lot of fiddly trackpadding.

\- Maybe it would be nice if HN's simple markup system could handle the case
in which the author wants a list of indented items, but it doesn't, and fixed-
text is a poor substitute for that.

[EDIT:] _Thanks!_

~~~
Agentlien
There. I agree, it looked horrible

------
rwmj
This is by no means totally bulletproof, but these C macros around libxml2 let
us write nested well-formed XML expressions as code:

Example usage:
[https://github.com/libguestfs/libguestfs/blob/master/src/lau...](https://github.com/libguestfs/libguestfs/blob/master/src/launch-
libvirt.c#L1186-L1208)

Macro definitions:
[https://github.com/libguestfs/libguestfs/blob/master/src/lau...](https://github.com/libguestfs/libguestfs/blob/master/src/launch-
libvirt.c#L880)

~~~
oceanswave
Totally, we took this a step further and created a subversion repository where
xml documents describe classes. Each method is either inline, or is described
by a xml element of a particular namespace that links to a subversion id and
revision. ;)

~~~
witty_username
Note: I believe this is a reference to [http://thedailywtf.com/articles/the-
inner-json-effect](http://thedailywtf.com/articles/the-inner-json-effect)

------
maze-le
Some XML dialects become very confusing if features are added as an
afterthought without consideration of syntax and sematics. Microsofts
Wordprocessing XML for example has caveats like w:permStart:

    
    
        <w:permStart w:id="0" w:edGrp="editor"/>
        (...)
        <w:permEnd w:id="0"/>
    

permStart and permEnd define regions where special permissions are required to
edit a document. It is encoded in a complete anti-XML syntax, where different
tags (and a common ID) represent the start and end of a region.

~~~
Mikhail_Edoshin
Microsoft Wordprocessing XML is very quirky :) I think they use these markers
because different areas can overlap and thus you cannot express this with a
tree-like structure.

------
coding123
There are many flavors of XML and JSON out there now. I think for many
developers JSON started to "look good" when the number of standards that
started stacking up against XML (and XML-ish/SGML-ish/HTML-ish based formats)
started to make people go insane. In the healthcare world we typically had to
deal with a never ending set of "format standards" that kept integrating
themselves together. I guess originally that may have been the beauty of
XML... we started with XML RPC, moving on to SOAP 1.0, SOAP 1.1 introduced new
ways to send headers. At some point however it just went crazy.. I think kinda
when the enterprise-level people got their hands on things, they started
porting all of their non-standard wack-job features into XML.

WS-Addressing - ok seems simple, but now your SOAP stack has to support async
processing. WS-Trust - OK Let's add a simple feature that lets you put "some
tokens" in the request and response for security, auditing, non-repudiation -
good ideas sure. WS-Eventing - Let's add enterprise queuing to XML and soap
and require stacks to support that, let the users of the stack figure out a
way to connect that to the queues.

Anyway the list goes on, and you can read about it here:
[https://en.wikipedia.org/wiki/List_of_web_service_specificat...](https://en.wikipedia.org/wiki/List_of_web_service_specifications)

Suffice it to say, but XML died because the developer now had to learn all of
these, how they worked because one tiny industry body starts to adopt 1% of
each, requiring implementors to learn 99% of all. It basically just made JSON
attractive, a reset if you will.

XML won't go away. HTML will continue forever (it crosses a developer-designer
"human line" that makes it kinda permanent) Developers adapt to future
technologies a lot faster than designers and other's dabbling in HTML.

Now all this being said, you can see the list of standards piling up against
JSON. There's really no critical mass ready replacement though, so JSON will
be safe for quite a while longer. JSON will only be replaced in various
"areas" like YAML for config, binary JSON-compatible representations for wire
and/or storage.

I'm not biased against XML for data transfer, but if someone asked me to
create a SOAP 1.1 service with WS-Trust, SAML tokens, etc... I'd also argue
for a more industry accepted REST service with OAuth tokens, simply because it
would be like introducing the Hummer all over again in age where Tesla's are
everywhere. - everyone would hate us.

------
Marazan
XML is a perfectly fine format that was (ab)used dreadfully by many, many
people to such an extent that many people only have examples of completely
dreadful XML as their reference.

So many XML-as-interpreted-programming-language monstrosities out there (I
know I wrote one as I had the perfect problem domain to use LISP but didn't
have the environment capability to use LISP but did have a Database XML field
to store 'data' in so I did XML-as-S-Expression with a SAX based interpreter -
it was surprisingly nice).

------
erlehmann_
Discussions about XML and JSON often remind me of this comment on HN:
[https://news.ycombinator.com/item?id=5702868](https://news.ycombinator.com/item?id=5702868)

Partial quote:

> XML can certainly be shorter than JSON and often is, and repeated tags are
> the best showcase for it:
    
    
         <user id="abc">
            <phoneNo type="home">123456789</phoneNo>
            <phoneNo type="work">321654987</phoneNo>
        </user>
    

> This turns into this beautiful JSON:
    
    
         [
          "users": [
    	{
    	  "id": "abc",
    	  "phoneNos": [
    	    { "type": "home", "value": "123456789" }, 
    	    { "type": "work", "value": "321654987" }
    	  ]
    	}
          ]
        ]

~~~
lmm
Not a fair comparison since the JSON case includes the outer list as well. And
whenever I've seen the equivalent of this in a real-world XML format it would
use a <phoneNos> tag to group the phone numbers together.

~~~
erlehmann_
You probably have not looked too closely at real-world XML.

• Many XHTML and SVG elements can occur without dedicated wrapper elements.

• In Atom feeds, <author>, <category>, <contributor>, <link> elements can
occur multiple times without a dedicated wrapper element.

• In XSPF playlists, <link>, <meta>, <extension>, <location>, <identifier>
elements can occur multiple times without a dedicated wrapper element.

------
stesch
Had to post this old article because I encountered some bozo code again.
Reading more about some CMS and planning on using it for my blogs when I saw
the code of the RSS feed. It was written by the lead developer of the CMS and
used text templates.

~~~
oceanswave
The way your comment comes across is a bit irritating. Not understanding the
underlying codebase and classifying based on an attenuated knowledge of a
topic promotes one to the 'bozo' status more quickly than not. Many systems
use text-template based feeds, examples are Shopify, Salesforce, Wordpress,
and more. Are these systems fundamentally broken purely because of this
approach? Probably not. In your case, are the text templates escaping their
values when outputting? Are they validating for correct XML once generated?
Have more questions than having pre-defined answers.

~~~
stesch
You mention typical PHP projects written by people who think they know better
than the likes of Tim Bray.

PHP, the language that made short tags a configuration option because they
wanted to mix program code with XML.

PHP, the language with a lot of different escape functions because they didn't
get it right the first time.

~~~
oceanswave
I also mentioned projects written in Ruby and Java, but that's ok. VB.Net also
has XML Literals. Ha ha.

~~~
MichaelGG
VB's XML literals are just shortcuts for creating the corresponding classes
though, right? That's quite a bit different.

~~~
oceanswave
Was being a bit sardonic in my comment due to where the discussion went, but
yeah, XML Literals in VB.Net create XDocument instances and are just like
string literals except:

* Enclosing quotes aren't required * Assumed to be multi-line so line continuation characters aren't required * Are validated for being well-formed XML by the compiler (and at design-time, if VS) * Can have embedded expressions

------
oceanswave
The author of this post is a bozo, doing any (or not doing any) of the
suggested things does not guarantee well formed XML. Disregarding whole
sections of the XML spec, prescribing a certain way to generate xml are more
harmful than not. Can text templates generate well formed xml, absolutely. Can
tools generate non-well formed xml, absolutely.

~~~
wbkang
> Making mistakes with them is extremely easy and taking all cases into
> account is hard.

He states why right there. He doesn't say anywhere whether templates can or
cannot generate well-formed xml.

~~~
hsivonen
(I'm the author of the article.)

Today, it's clear that text/html has won over application/xhtml+xml and JSON
has won over XML for most (non-enterprise) non-document uses. But back around
2003..2009, there was no shortage of people who advocated in favor of XML and
got it wrong when writing it by hand or when generating it with text-based
templates.

Philip Taylor (not to be confused with Philip TAYLOR) was one of the regulars
on the #whatwg IRC channel around 2007..2009. He had a hobby of trying to get
XML advocates' systems to produce ill-formed output. He pretty much succeeded
every time. IIRC, he even found a bug in Validator.nu's XML output, even
though Validator.nu practices what I preach in the article.

The easy way was to supply user input that contained U+FFFE and watch the
output blow up with the Yellow Screen of Death when U+FFFE was echoed as-is.
Unless you have a templating system designed with the warts of XML in mind,
this will happen. (A proper XML serializer has to scrub the characters that
aren't allowed in XML, as seen in
[https://hg.mozilla.org/projects/htmlparser/file/dd08dec8acb7...](https://hg.mozilla.org/projects/htmlparser/file/dd08dec8acb7/src/nu/validator/htmlparser/sax/XmlSerializer.java#l419)
.)

He even found a bug in Tim Bray's code that was written to make the point that
it's possible to generate XML correctly...
([https://lists.w3.org/Archives/Public/www-
archive/2009Mar/006...](https://lists.w3.org/Archives/Public/www-
archive/2009Mar/0060.html))

------
Cozumel
The sheer amount of sites that produce badly formed RSS feeds is staggering,
the whole point of a feed is to make your content accessible to everyone, a
bit like meta-tags. Why have it if you're not going to at least implement it
properly?

~~~
nialo
I recently wrote a first pass at an RSS feed parser for podcasts, but couldn't
find examples of interestingly malformed podcast feeds to test against. Do you
have examples of sites with badly formed RSS feeds?

------
0x0
I had no idea there's such a beast called "XML 1.1". That sounds fun!

------
foota
This reminds me of when I was just starting out as a programmer. I was doing
contract work and needed to write a php json endpoint. I had no idea what I
was doing and hardcoded it all with print statements. Yikes.

------
benbristow
Why would anyone choose to use XML over JSON, other than for RSS?

~~~
cptskippy
For one, JSON didn't exist 15 years ago.

For another, JSON didn't have validation or schemas 5 years ago.

~~~
einhverfr
Even there, the schemas and validation are very lightweight compared to what
XML can do.

As I usually say, JSON is for relatively free-form, dynamically typed
languages, but if one side is uses a statically typed language, XML is
probably the better choice.

~~~
cptskippy
Hmm... I always tend to use XML for B2B communication or Mine-to-Theirs type
RPCs. I use JSON primarily for Client-to-Server communication internally with
own applications or for public APIs.

------
jordache
xaml for UI.. FML

------
bibinou
(2005)

~~~
dang
Yes. Added.

------
qwertyuiop924
Some of this (avoiding pretty printing, mainly) is just dealing with XML's
insanity. The rest is pretty solid advice, but fairly obvious for the most
part. Then again, I've already done a lot of Scheme programming, understand
common sense, and I read Steve Yegge's _The Emacs Problem_ , So I already
looked at XML as a tree structure, and crawling a pre-existing tree to turn it
into XML is just the most natural way to deal with XML in Lisps.

