

How NPR's new website stores html markup separated from content in the database. - smharris65
http://blog.programmableweb.com/2009/11/11/content-portability-building-an-api-is-not-enough/

======
madair
I'm really hesitating to comment. It's way too easy for us types to miss
really good ideas due to our preconceptions. But it's hard for me to look past
the fact that this looks like a really bad abstraction over the top of the
structure document abstraction that was designed specifically to suit NPR's
purpose.

For example, they show the <em> tag being stored by position in the database
separately from the content itself, seemingly unaware of the significance that
<em> is a _semantic_ tag. It means _emphasis_. If the iPod doesn't like <em>
for emphasis, well then that's why it's a _semantic_ tag. You can remove
semantic tags or replace them with alternative markup or binary formats as
needed because they have meaning designed for parsing and formatting.

It's all well and good to hate on XML for inappropriate uses. But, critical
point here, documents are _what XML is for!_ And there is a shocking wealth of
libraries in just about every language available today that will handle
document markup just fine.

It seems to me to be a quintessential reinvention of the wheel, and the XML
hating has just gone on too far at this point.

~~~
pradocchia
It's a lot cheaper to insert your markup by position than to parse the XML on
every read. Of course, caching would mitigate much of that, but still I admire
the design.

~~~
patio11
You wouldn't even have to parse the XML -- just anoint one form the canonical
internal representation and then have filters which turn the internal
representation into other formats.

For example, if you store all documents in a restricted version of HTML, then
you can generate a text-only version by replacing all <em> and </em> with
star, etc. This is much cheaper than actually parsing XML (no need to build up
a data structure representation of the document, just assume it is mostly
right already and apply textual transformations).

Then, once you have the plaintext version, cache it. Journalism sites are
going to have an _absurd_ number of reads to writes. (One guy writes the
article, one guy edits it, 50,000 people see it within a minute of it being
posted.) Caching is cheap and easy. Caching is robust against the article
changing, too -- you purge the cache (or potentially just let it expire) and
bam your new article is mostly right, whereas tracking the offset of <em>s
seems to intuitively have a lot of possible failure modes to me. (For example,
the first time one of your reporters tries to add a bit of local color with
Japanese text in an article, I'll bet dollars to donuts that every tag after
the Japanese breaks and the surrounding content gets corrupted.)

------
pkulak
That just seems really fragile to me. Yank one space off the front of any
article and the entire thing is jacked. I'm just not convinced that it's all
that terrible to parse the markup on the way out if you need it transformed,
while enforcing strict rules on what (valid) markup can go in.

~~~
pradocchia
Either way, you would have to parse it into something equivalent to what they
are storing in the database so you you could handle the various target formats
individually.

As long as you tightly control writes to the database, I don't see the
problem. And, you don't have the overhead of parsing on every read. For an
environment where reads will be a few orders of magnitude greater than writes,
it does make some sense.

~~~
akamaka
It seems incredible that they can't sacrifice some speed to parse on every
read, and yet are also unwilling to sacrifice some disk space to cache several
fully parsed versions.

Considering how quickly CPU and disk performance change, this type of
tightrope-balanced optimization seems crazy. Or maybe they're way smarter than
me.

------
wglb
This is clearly working for them. What is interesting about this is that it
puts a vote on the SGML idea that content and presentation are nicely
separated and identified by the tags. But this has always faintly concerned
me, as to whether this really will work.

As an example madair in an earlier comment notes that _em_ is intended to be a
_semantic_ tag. But do the semantics of this particular tag ever mean anything
outside of "show this text as emphasized"?

I am wondering if whatever semantic markup we do is still all about
presentation, and takes a fairly arbitrary abstraction of what the text is to
_mean_.

~~~
madair
It's a twist, but if we say that an iPod _cannot_ emphasize text, then we can
instruct the system to ignore <em> tags. Because the semantics of <em> are
clearly defined, a judgment call is available to us to feel confident that we
can ignore the tag completely under the circumstances. So we utilize the tag
itself still by instructing the system to swallow it.

~~~
wglb
See, this is what I find a bit unclear. The concept is that presentation is
distinct from content. But if we say that the _semantics_ of the em tag are to
effect the presentation, they seem to be not be separated anymore, right?

~~~
madair
At a certain point the semantics of the content get translated into an output
format. In a browser it's italic. In an iPod it's ignored.

The semantic tags are the means, but output presentation is still the end, and
there's certainly no reason to expect that all devices can take the original
semantic markup and use CSS or some other mechanism to present it. And because
of the semantic meaning we can make the em disappear with confidence.

