Just the sort of thing you would think of as 'documents'--the texts of books, manuscripts, and the like, where structure may be somewhat arbitrary. For instance, I work with a few different text corpuses--one of which is an actual dictionary, with entries, definitions, usage examples, etymological information, and bibliographic references. Another is a collection of poetry manuscripts, with annotations for line breaks and editorial emendations, both from the author and other editors (i.e, places in the manuscript with crossouts, interlineal notes, marginal notes, etc).
I mean, in theory, you could do this in JSON or some other data structure. But you would go insane and be shooting yourself in the head before long.
> you could do this in JSON or some other data structure
I'm not sure you could. For example, in another comment, I mentioned DocBook[1]. How would you do the following sample document in JSON?
<?xml version="1.0" encoding="UTF-8"?>
<book xml:id="simple_book" xmlns="http://docbook.org/ns/docbook" version="5.0">
<title>Very simple book</title>
<chapter xml:id="chapter_1">
<title>Chapter 1</title>
<para>Hello world!</para>
<img src="hello.jpg"/>
<para>I hope that your day is proceeding <emphasis>splendidly</emphasis>!</para>
</chapter>
<chapter xml:id="chapter_2">
<title>Chapter 2</title>
<para>Hello again, world!</para>
</chapter>
</book>
Would you make each <chapter> into an object? But you have 2 <para> children in there with an <img> in between. And one <para> has an additional <emphasis> in the content. I can't think of a good JSON schema equivalent to this.
[
'this text nees more',
{'type': 'emphasis',
'children': ['emotion']},
'!'
]
hellish to write by hand but probably okay for a program to consume (modulo all the XML libs/tooling you can't use). and you could probably even write some kind of schema for it.
if i actually had to represent that data, i'd also move some child nodes into attributes, e.g. make all nodes with 'type': 'book' also have a 'title' attribute, like you would if you had an AST datatype
which is probably the only way to properly deal with markup and especially commented sections that can span over paragraph start/ends - neither JSON or XML seems to have a proper answer for such annotations and I wonder if there's any standard format that can that, especially if humans still want to reasonable be able to view or edit iit...
(OOXML and its binary equivalents more or less solve this by completely separating paragraph and character formatting, both separately indexing the spans of text they annotate)
That is what essentially every WYSIWYG text processor does. And also the reason why getting sane HTML out of text processor is somewhat non-trivial, as the separately indexed spans can very well overlap, contradict each other or contain completely unnecessary formatting information.
But as pointed out in the article, JSON isn't necessarily going to guarantee the correct order of your nested bits. Your code is going to have to worry about that. And it will quickly become unmanageably complex. When you are for instance creating a marked up transcript of some archival material, there's a lot of human editing involved. Have a look at the TEI documentation to see how messy it can get.
Certainly. I wasn't suggesting that JSON representation I put up there was actually a good idea, just that it's theoretically possible to represent that document as JSON.
I absolutely agree. Where XML shines is when you could take just the text content - strip out the all the markup elements and attributes, document, comments, etc - and have a text document that still makes some sort of sense.
This AFAICT was actually why SVG has a few bizarre choices - such as putting all the drawing commands into attributes. A browser that doesn't understand an embedded SVG document in its HTML would be left with just the text contents.
I mean, in theory, you could do this in JSON or some other data structure. But you would go insane and be shooting yourself in the head before long.