
Erik Naggum on Attributes in SGML, XML, Lisp (2001) - networked
http://www.schnada.de/grapt/eriknaggum-enamel.html
======
kjhughes
I've been immersed in the markup world since the SGML days, and even I prefer
s-expressions to SGML/XML.

But here's the thing: _XML is all about agreement_. The value is not found at
the syntactic level, where needless irregularities exit. The value is that
more than a critical mass of agreement formed, and this agreement allowed the
development of tools and standards useful across industries. Moreover, the
low-level agreement on XML has enabled higher level agreement on vocabularies
and grammars in many disparate sectors.

Sure, if you had a clean slate you could improve many parts of the XML
standards. Consider, for example, XML Schema (XSD). The irregularities and
complexity there are similarly annoying. I'd much rather write RELAX schemas.
However, whole sectors have selected XML Schema and developed vocabularies and
grammars whose value isn't found at the syntactical level; it's found in the
agreement reflected in the decisions over vocabulary and grammar made by
disparate groups striving to develop a common exchange mechanism.

Quibbles over syntax miss XML's real merit: XML enables _agreement_ in
developing vocabularies and grammars.

Apropos: Here's my Stack Overflow answer to the question of when to use XML
attributes vs elements, written in XML for fun:
[http://stackoverflow.com/a/23133755/290085](http://stackoverflow.com/a/23133755/290085)

~~~
tel
I don't mind the syntax and suppose I agree that consensus syntax is the right
syntax.

The problem is beneath that: XML's _semantics_ are tremendously large while
also being a complete mess. The upshot is that the difficulty of being a
compliant adopter---of matching such a large, nebulous, incompletely defined
set of inconsistencies---is tremendous and instead there are leagues of
partial, massively inconsistent consumers and producers who hammer out
successful communication via a list of patches.

To some degree this situation was inevitable, but XML exacerbates it and
cements the pattern.

~~~
kjhughes
Semantics has two senses that could apply here.

In one sense, XML itself is semantically neutral. The association between
elements/attributes _names_ and real world _objects_ is merely suggestive.

In another sense (that of semantic analysis found in compiler design),
validating XML parsers could be said to be enforcing semantics expressed in
schemas. I suppose it is in this sense that your objection applies. In the
case of XML Schemas or DTDs, I would have to agree. You might find RELAX NG to
be less of a semantic mess given its basis in finite tree automata.

~~~
tel
A majority of my concerns were resting on XML Schemas and the like—or even OWL
(!)—but even recognizing what you mean about XML being semantically neutral
it's still... semantically dangerous.

The issue at hand where attributes and elements can serve very similar purpose
is a great example. In a technical sense XML does not suggest anything about
the relative meaning of these two bits, but with near universal acclaim
they're used as "metadata" and "child data" whatever the speaker might believe
those to mean.

In spiritually embracing TIMTOWTDI, XML has created a playing field where
semantics are by default ambiguous and have to fight their way out of that
disadvantage.

In some sense, XML-without-semantics has too much structure and this creates a
minefield for laying meaning atop it.

------
thomasfl
I get terrible deja vu moments when I read Naggum's postings on the net. I
knew him as a fellow student, long before I heard about his rants on the net
and he actually had a very soft voice and spoke very calmly. Later I had the
opprtunity to work with him professionally as well. He was a pleasure to have
in meetings away from keyboard, but group discussions by e-mail could be very
unpleasant.

~~~
wglb
Tim Bray said pretty much the same thing:
[https://www.tbray.org/ongoing/When/200x/2009/06/20/Erik-
Nagg...](https://www.tbray.org/ongoing/When/200x/2009/06/20/Erik-Naggum)

------
adwf
This reminds me of one of my biggest facepalm-ing moments when I first started
using Lisp. I needed to serialise some data to disk, it was all in a list form
- (list (a b) (c d)) kinda thing. So I thought to use XML format to store it
on disk as that's what I had experience with in other languages.

I was about halfway through writing the (write-to-xml) function when I
realised I could just write the entire lisp expression to the file instead...

Not only was it smaller, it had the huge advantage of not needing to parse it
back when I wanted to read from disk, I just called the inbuilt (read)
function. Thankfully I only wasted an afternoon!

That's what really ingrained the whole "data is code, code is data" idea into
me. I even went on to write the structure function to each file as well, so
that no matter what I did to the data structure down the line, I would still
have the actual code needed to understand it. Right next to the data itself.
Not only was my code self-documenting, my data was as well ;)

~~~
wodenokoto
Why does everybody say it's lisp thing to save an expression to disc, or that
"code is data in lisp" kind of things?

Every language I've ever used has variable assignment and you can write those
to a document and execute it.

------
chriswarbo
Some other comparisons of XML and s-expressions:

[http://okmij.org/ftp/Scheme/xml.html](http://okmij.org/ftp/Scheme/xml.html)

[http://homepages.inf.ed.ac.uk/wadler/papers/xml-
essence/xml-...](http://homepages.inf.ed.ac.uk/wadler/papers/xml-essence/xml-
essence.pdf)

A choice quote from the latter:

> So the essence of XML is this: the problem it solves is not hard, and it
> does not solve the problem well.

------
jerf
That's actually why I _like_ XML... the two-dimensional distinction between
"metadata" and "content" can be really helpful... _if and only if_ you need
it. If you need it, then you are dealing with a fundamentally complicated
format, and you will actually experience _more_ pain trying to jam it into a
single-dimensional format like JSON. By contrast, if the relevant data does
_not_ need the two dimensions, which is true of most things, XML is a terrible
choice; left adrift without a meaningful way to distinguish between what
should be "attribute" and what should be "content", both in theory and in
practice all the choices turn out to be terrible.

And, again, if you _really_ need it, the available third dimension provided by
XML namespacing is a _life-saver_ when you need it, and incredibly redundant
when you don't.

I often dislike the phrase "Use the right tool for the job", because it is
vacuous when not tied to some sort of suggestion on _how_ you should determine
what the right tool is, so: Match the dimensionality of your data to the
dimensionality of the serialization format. The vast bulk of stuff you will
encounter day-to-day is one dimensional and fits in JSON or a friend just
fine. Use that. A nontrivial quantity of things fits into XML, mostly
delimited text; use XML then. Every once in a while you need modularity in
your XML format... do not reinvent XML namespaces, use them. (This is mostly
for BIG standards, like XMPP or XHTML.) Use neither more nor less than what
you need.

------
lisper
I have never understood Erik's enduring appeal, particularly in light of the
vehemence with which he could be technically wrong.

    
    
        <foo bar="zot">quux</foo>
        
          should be read as
        
        (foo (bar "zot") "quux")
        
          and most definitely _NOT_ as ((:foo :bar "zot") "quux")
    

Erik has this exactly backwards. His rendering is ambiguous: it does not
distinguish <foo bar="zot"/> from <foo><bar>zot</bar></foo> (as he himself
implicitly acknowledges later in the post). Also, rendering unqualified tags
as keywords makes it easier to deal with namespaces.

IMHO the Right Way to render <foo bar="zot">quux</foo> is ((:foo :bar "zot")
"quux") with the further qualification that if there are no tags then ((:foo)
"quux") can be simplfied to (:foo "quux"). This preserves a 1-1 correspondence
between XML and sexprs, which is IMHO a requirement (in fact, the only
requirement) for a reasonable mapping. If you have a 1-1 correspondence then
you can freely convert back and forth between the two. If you don't, then you
can't.

~~~
brandonbloom
You've misunderstood his point with that example. He knows it's ambiguous, but
he's telling you that it's an undesirable distinction to make.

~~~
lisper
OK, but then he's wrong about that. <input type=button/> does not mean the
same thing as <input><type>button</type></input>.

No matter how you slice it, Erik is wrong here. His translation is lossy, and
a lossy translation is bad because it can't be applied bidirectionally.

~~~
brandonbloom
Again, you're missing the point. He's saying that those two things _shouldn 't
be different_. He's clearly well aware that they are different.

His very first sentence is:

    
    
        As long as we think aloud in alternative syntaxes, I actually prefer to
        break the _incredibly_ stupid syntactic-only separation of elements and
        attribute values.
    

His entire post is a hypothetical in which you 1) need a structured markup
language and 2) are willing to break compatibility with XML.

~~~
lisper
No, I understand what he is saying. My claim is that he is wrong. They
_should_ be different. There's a significant and useful difference between
data and metadata, and hence there's a significant and useful difference
between tags and attributes, and if you throw out that distinction you lose
useful functionality, and specifically you lose the ability to distinguish
between, e.g.:

<input type=button/>

and

<input><type>button</type></input>

Tags compose in different ways than attributes, that difference reflects a
real aspect of the structure of the problem, and hence it's useful.

~~~
brandonbloom
But it's not a difference between metadata and data. It's a difference between
structural representations. Either or both representations can be used to
encode data or metadata, but that separation comes from the interpretation of
the markup.

~~~
lisper
Yes. But the two structures at issue are:

((foo baz bar) bing)

and

(foo (baz bar) bing)

The only difference is in the location of a single paren. But latter
representation (Erik's) loses information whereas the former does not. All
else being equal, and absent a compelling argument why the lost information
has no value, it is better not to lose information.

~~~
brandonbloom
1) There's no information to be lost. He's not talking about encoding XML,
he's talking about general design criteria for a cleanroom markup language
design. I don't understand why this point is hard to get.

2) The idea that it's "better not to lose information" is totally nonsensical
because the value of information is contextual. For example, the ascii
encoding of this message useful omits the subtleties of my awful handwriting,
although you're right: it would be better to preserve the information present
in the deep pencil strokes, indicating my frustrating with this thread.

~~~
lisper
> He's not talking about encoding XML

Yes he is:

"I have come to _loathe_ the half-assed hybrid that some XML-in-Lisp tools use
and produce"

> The idea that it's "better not to lose information" is totally nonsensical

You are knocking down a straw man. I didn't say that is it better not to lose
information, full stop. I said "All else being equal, and absent a compelling
argument why the lost information has no value" it is better not to lose
information. In the case of your handwriting, all else is not equal.

------
breck
The LISP code there doesn't make things clearer IMO but I agree completely
with his point that the attribute and content rules add huge unneeded
complexity to XML.

Really enjoy these types of articles on markup languages. If there are any
other markup language nerds out there I've been chipping away at a simple one
([https://github.com/breck7/space](https://github.com/breck7/space)), and
would love feedback. And if you have one as well always love looking at
different ideas.

------
cm3
I wish there we had an S-Expressions based alternative to Docbook for writing
manuals and books. You can argue that DSSSL should have been preferred over
XSL and friends but people seem to like verbosity even if the alternative
(S-Expressions) is as precise and unambiguous.

~~~
brudgers
Mathew Flatt's Scribble might be in that ballpark I think.

[http://docs.racket-lang.org/scribble/index.html?q=dot](http://docs.racket-
lang.org/scribble/index.html?q=dot)

------
otabdeveloper1
Someone's been confusing syntax and internal representation again.

Simple: parse

    
    
      <foo bar="zot">quux</foo>
    

as (JSON-like)

    
    
      { 
        "foo": {
           "bar" : "zot",
           "" : "quux"
         }
      }
    

Problem solved, XML is now logical again.

~~~
networked
There is an existing alternative syntax for XML data [1] that represents it as
a list of key-value pairs where the keys are XPath-like paths. In this syntax

    
    
      <foo bar="zot">quux</foo>
    

would be represented as

    
    
      /foo/@bar=zot
      /foo=quux
    

I think data serialization formats of this particular kind are underused. They
are dead simple to parse while still being reasonably human-friendly (more so
than XML, I'd say). Get rid of the attributes and introduce the convention
that nodes with children named "0", "1", ... , "$N" are arrays and you can
represent JSON.

[1] [http://www.ofb.net/~egnor/xml2/](http://www.ofb.net/~egnor/xml2/)

~~~
smrtinsert
I like this, I had an instinct about making a file format similar to this,
except pushing the path onto a stack.

<foo bar="zot"><baz>quux</baz></foo>

    
    
        [foo]
        @bar="zoot"
        
        [baz]
        "quux"
        
        []
        []
    

I figured you would save a little in the serialization. In practice I rarely
need such a thing, and just serialize a structure to whatever I'm working with
(s expressions, or in java xstream etc since those xml files are read
somewhere...)

------
yarrel
Where did he copy and misunderstand this from?

