Just the sort of thing you would think of as 'documents'--the texts of books, ma...

jolmg · on Oct 29, 2019

> you could do this in JSON or some other data structure

I'm not sure you could. For example, in another comment, I mentioned DocBook[1]. How would you do the following sample document in JSON?

  <?xml version="1.0" encoding="UTF-8"?>
  <book xml:id="simple_book" xmlns="http://docbook.org/ns/docbook" version="5.0">
    <title>Very simple book</title>
    <chapter xml:id="chapter_1">
      <title>Chapter 1</title>
      <para>Hello world!</para>
      <img src="hello.jpg"/>
      <para>I hope that your day is proceeding <emphasis>splendidly</emphasis>!</para>
    </chapter>
    <chapter xml:id="chapter_2">
      <title>Chapter 2</title>
      <para>Hello again, world!</para>
    </chapter>
  </book>

Would you make each <chapter> into an object? But you have 2 <para> children in there with an <img> in between. And one <para> has an additional <emphasis> in the content. I can't think of a good JSON schema equivalent to this.

[1] https://en.wikipedia.org/wiki/DocBook#Sample_document

uryga · on Oct 29, 2019

you could always do an S-expression-esque DSL in JSON ;)

  ['book', {'id': '...'},
    ['title', {}, ...],
    ['chapter', {'id': 0},
      ['title', {}, 'Chapter 1'],
      ...
    ],
    ['chapter', {'id': 1},
      ...
    ],
  ]

more realistically, you could just represent it with the AST of that XML, i.e

  {
    'type': 'book',
    'attrs': {'id': ...},
    'children': [
      {
        'type': 'title',
        'children': ['Simple book']
      },
      {
        'type': 'chapter',
        ...
      },
      {
        'type': 'chapter',
        ...
      },
      ...
    ]
  }

so you could do that emphasis bit as

  [
    'this text nees more', 
    {'type': 'emphasis', 
     'children': ['emotion']},
    '!'
  ]

hellish to write by hand but probably okay for a program to consume (modulo all the XML libs/tooling you can't use). and you could probably even write some kind of schema for it.

if i actually had to represent that data, i'd also move some child nodes into attributes, e.g. make all nodes with 'type': 'book' also have a 'title' attribute, like you would if you had an AST datatype

unilynx · on Oct 29, 2019

See https://developers.google.com/docs/api/samples/output-json for what Google Docs does - basically separating markup from the text by using indices.

which is probably the only way to properly deal with markup and especially commented sections that can span over paragraph start/ends - neither JSON or XML seems to have a proper answer for such annotations and I wonder if there's any standard format that can that, especially if humans still want to reasonable be able to view or edit iit...

(OOXML and its binary equivalents more or less solve this by completely separating paragraph and character formatting, both separately indexing the spans of text they annotate)

dfox · on Oct 30, 2019

That is what essentially every WYSIWYG text processor does. And also the reason why getting sane HTML out of text processor is somewhat non-trivial, as the separately indexed spans can very well overlap, contradict each other or contain completely unnecessary formatting information.

bradstewart · on Oct 29, 2019

Potential option:

  {
    "id": "simple_book",
    "title": "Very simple book",
    "chapters": [
      {
        "id": "chapter_1",
        "content": [
          { "type": "title", "value": "Chapter 1" },
          {
            "type": "para",
            "content": [
              { "type": "text", "value": "Hello World!" }
            ]
          },
          { "type": "img", "src": "hello.jpg" },
          {
            "type": "para",
            "content": [
              { "type": "text", "value": "I hope that your day is proceeding " },
              { "type": "emphasis", "value": "splendidly" },
              { "type": "text", "value": "!" }
            ]
          }
        ]
      }
    ]
  }

Finnucane · on Oct 29, 2019

But as pointed out in the article, JSON isn't necessarily going to guarantee the correct order of your nested bits. Your code is going to have to worry about that. And it will quickly become unmanageably complex. When you are for instance creating a marked up transcript of some archival material, there's a lot of human editing involved. Have a look at the TEI documentation to see how messy it can get.

bradstewart · on Oct 29, 2019

Certainly. I wasn't suggesting that JSON representation I put up there was actually a good idea, just that it's theoretically possible to represent that document as JSON.

dwaite · on Oct 30, 2019

I absolutely agree. Where XML shines is when you could take just the text content - strip out the all the markup elements and attributes, document, comments, etc - and have a text document that still makes some sort of sense.

This AFAICT was actually why SVG has a few bizarre choices - such as putting all the drawing commands into attributes. A browser that doesn't understand an embedded SVG document in its HTML would be left with just the text contents.