Hacker News new | comments | show | ask | jobs | submit login
Portable Text – JSON rich text specification for content editing platforms (github.com)
123 points by thefln 4 days ago | hide | past | web | favorite | 80 comments





XML is superior to JSON for text documents. Far superior.

If you want to build a JSON for text, then start by considering how to map XML to JSON. Many attempts at that exist, most terrible because the author didn't understand ordering requirements of child nodes (in XML they are ordered as they appear, but in JSON that's only true of arrays). The only reasonable way to map XML to JSON is to use arrays for all child nodes of any node, using objects at most just for things like node attributes and PIs.

I've done a fair bit of XSLT and jq programming. I understand the appeal of JSON here: jq is much pithier than XSLT/XPath! But first you have to get a bunch of things right that XML already got right.

EDIT: Once you have a very complex document schema, its mapping onto JSON will complicate any jq programs to the point where the advantage JSON has over XML as to DSLs disappears. XML simply is the right tool for documents.


It is the circle of life: XML is a fine format for structured text, but some people decided it should be used for everyhing - tabular data, key-value sets, programming code what have you. Then somebody figures out an alternative (JSON) is much simpler and concise for some particular use cases. And then some people decides that JSON should be used for everything. It wont be long before someone propose JHTML - HTML in JSON-notation. And of course JCSS.

We programmers love simplicity. The selling point of JSON was that it was really simple compared to the pile of specs which XML had become. But after a while it turns out people actually needed the features the XML standards provided: validation, querying etc. And a similar set of specs grow around JSON.

My guess for the next revolution: Some people figure out it is really annoying to have to quote keys in JSON and that JSON does not express expressions or DSL's very well. So s-expressions comes into fashion. And we get sexpr-schema, sexpr-query, and at some point someone will propose sexprs as a standard for structured hypertext.


> XML is superior to JSON for text documents. Far superior.

Why do you think so? Is it because most structured texts are Trees?

If so I understand the appeal. Because hey, HTML is perfect isn't it? Except, it's really not. While HTML is great for rendering, it sucks at WYSIWYG editing.

Example:

    <p> This is <span style="font-weight:underline;">a <b><i>formatted </b>text</i>content </span> </p>
Now good luck, finding/changing the formatting information for the 9th character ;)

The funda is - rendering is easy with trees, nodes and children. Editing though is easier with working with indices than traversing trees.

https://prosemirror.net/docs/guide/#doc.structure

http://marijnhaverbeke.nl/blog/prosemirror.html#the-document

These write-ups are interesting reads on the topic.


I said nothing about HTML though, did I.

JSON and XML are both difficult to edit by hand, so don't do that. JSON, arguably, is harder to edit by hand because there's no CDATA, so you have to know how to escape characters, especially newline.

Both are tree-like structures. Why would one be better than the other? Well, XML has a long history of use and development for text documents, whereas JSON does not, so right there, that argues against JSON.

Does XML suck for WYSIWYG editing? I don't see why it would suck more than, say, TeX. LyX -a WYSIWYM editor for LaTeX- exists, and it can output XHTML using a LyX-specific schema. I'm sure LyX could easily have been built using an XML format from the get-go. You've not made any arguments as to why XML suck for WYSIWYG editing -- the example you pasted surely is meaningless in a WYSIWYG editing since that kinda implies having... a WYSIWYG (GUI) editor. And surely JSON will take more ceremony to express 'bold and italic' than XML, and not much less ceremony than XML to express 'use this specific font' (where the font's name alone is lengthy enough).

All you've done is expressed a preference for encoding details that no one really cares about, and you may not have stopped to really learn and understand XML, so your preference is leading you into making a number of mistakes the most important of which is NIH reinvention. Each NIH reinvention adds a fair amount of cognitive load for the rest of us, and adds development and maintenance burdens that never go away. Please don't do this. Innovation is very important, of course, but don't blindly reinvent to innovate.


I agree. I may have misinterpreted the original comment. Yes, in that way XML is indeed a better way to put it.

I'd still stand by my opinion on not-making formatting information as separate tags and blowing the tree one level deeper. Preserving formatting options as index based properties would make editing APIs simpler and convenient.

Again, this is nothing in regard to the original comment. Just putting it out there.


XML per se is great for storing tree like structures. However, it's not actually XML that's bad, but content editable, which is what most editors were using until recently.

In content editable, the visual information is not one to one transferable to a single html document, which, in turn, means not transferable to a single XML representation (and in all likelihood, the html is what is used as the representation, not a transformed XML document).

With Json representation , the content editable method is no longer the path of least resistance. Building out a proper renderer (whether in canvas or dom) means you end up removing most of the content editable issues.


"contenteditable" is an HTML5 (!) attribute defined by the XML hating HTML5 group. Right?

JSON is harder to edit manually, honestly. Don't do it. Use an editor that supports editing the format.

Nobody is talking about editing the JSON directly. They are talking about programming a wysiwyg editor to the JSON data model.

I would say XML vs JSON makes little difference to the complexity of an application (e.g., an editor) using it.

manipulating JSON (really, just plain native language objects) is way more straightforward than manipulating the XML dom. The purpose of formats like this is to make inexpressible the kinds of documents that result in difficult to untangle bugs, and frustrating UI behaviour. See this article about how Medium's editor was designed for some essential background of the motivation for this JSON format

https://medium.engineering/why-contenteditable-is-terrible-1...


Until recently? What are most editors using now?

> While HTML is great for rendering, it sucks at WYSIWYG editing.

Nobody says you have to operate on it in that format. You read it into internal structures and output those internal structures to HTML. I don't expect an editor to maintain a 1-to-1 relationship with the storage format while operating in memory.


Perhaps I don't understand you, but HTML is not XML. There was an attempt to make HTML a form of XML a while ago, but people mostly forgot about it for no good reason.

Sorry, it came off that way. Yes, HTML is not XML. I meant any XML-ish representation of rich-text contents. Which means every formatting block makes the tree grow one level deeper.

If you decided you don't want that, you could come up with a XML document format that doesn't allow nested formatting blocks. So your point seems awfully moot.

In fact, even with HTML you could remove all the nesting from formatting blocks on load and either put them back on save or leave it flat (assuming you use CSS and not semantic style tags).


>> XML is superior to JSON for text documents. Far superior.

> Why do you think so? Is it because most structured texts are Trees?

XML allows for mixed contents. If I am reading this proposal right to change

    <span>Some <b>formatted</b> content.</span>
into

    <span>Some <b>forma<i>t<i/>ted</b> content.</span>
you would essentially produce the equivalent of <span>Some <b>forma<b/><b><i>t<i/><b>ted</b> content. </span>

as the "text" field cannot contain childrens.


which is what happens at https://react-prosemirror-example.now.sh/

wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww ->

{ "type": "doc", "content": [ { "type": "paragraph", "content": [ { "type": "text", "text": "wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww" }, { "type": "hard_break" } ] } ] }

and with bold in the middle

{ "type": "doc", "content": [ { "type": "paragraph", "content": [ { "type": "text", "text": "wwwwwwwwwwwwww" }, { "type": "text", "marks": [ { "type": "strong" } ], "text": "w" }, { "type": "text", "text": "wwwwwwwwwwwwwwwww" }, { "type": "hard_break" } ] } ] }

This is not to say that is is a bad format, essentially XML does the same from an API level and maybe being explicit about it is good, but for sure is feels a bit different (again, do not want to express a better or worse).

Overall I am happy from the simple fact that it does not have the overused "Markup Language" in the name.


Your example shows why XML parsing is so strict.

If I add your example into my XQuery processor, I get: "Different start and end tag: <i>...</b>."

Also, HTML != XML.

An HTML parser may forgive that error, an XML parser clearly does not.

Also, with JsonML there is a lossless mapping from XML to (array based) JSON (and vice versa), which makes the project linked here, totally redundant.


Because not all markup is structural.

For JSON querying, checkout JSONata(http://jsonata.org/ - Inspired by the location path semantics of XPath 3.1) . It was created by Andrew Coleman(Part of W3C XML Query WG).

I seem to have reinvented parts of that in python by total happenstance with my latest little project[0]. Have to check out the docs some more to see what ideas I can steal.

[0] https://github.com/eponymous/python3-property_tree


This is really good. I can right off my head come up with interesting possibilities based on this standard.

#1) All major rich text libraries (Prosemirror, Quill, Draft, Slate, Trix & more) out their now have their own capabilities and their own JSON representation. But exporting content across these editors is a pain. You either have to use HTML or Markdown, both of which have formatting loss.

#2) This means when you develop an app around document creation you are essentially locked down with either the libraries' data model JSON, or end up writing migration code and parsers if you want to switch.

#3) What if word processors have an option to export their content into such a JSON representation, such that parsing or rendering these content in browsers wouldn't involve us working with HTML

Challenges:

What if view information like line-endings and pagination are part of the model? Coming up with a JSON standard for such cases would be very difficult / impractical?


I think this sounds good. Of course it only really becomes good if different projects start adopting it. Hopefully some of the libraries you listed are taking notice.

If only we had an open specification for formatted documents already. It could be based on text, and ideally there would be lots of different implementations, so it isn't hostage to a de facto master implementation. We could call it OpenDocument or something like that.

Now, a serious question: does this have an advantage over the existing standards, apart from "is JSON-based"? It looks like the company that made this decided to open source it, which is neat, but not very news-worthy.


What's the case for this as opposed to - say - a strictly a whitelisted subset of HTML?

There's already good tooling around XML-like dialects and the route from representation to rendering is much simpler.


It's true that this is essentially a markup language (in the original sense of markup: Text annotated with style and semantic information) in JSON form. But it wouldn't be sufficient to just pick a subset of HTML. You would need to extend it with metadata and custom element types. Since HTML elements, attributes and content are all just untyped strings, you will need to invent your own syntax to express other types. You're halfway into XML land at this point.

There's a million reasons why XML is a bad fit, but one stands out: JSON is a first-class citizen on the web, and XML isn't. Once you parse this with the browser's built-in JSON parser, you're good to go; you have a JS object you can plug into an editor, or manipulate and display in all sorts of ways. If this were a custom HTML subset, or XML, the path would be much longer. JSON has many issues, of course, but ease of use isn't one of them.


> Once you parse this with the browser's built-in JSON parser, you're good to go; you have a JS object you can plug into an editor, or manipulate and display in all sorts of ways. If this were a custom HTML subset, or XML, the path would be much longer.

Browsers parse XML too, y'know? It was, at one point, even more "first class" than JSON—why do you think browsers had an "XMLHTTPRequest" API, instead of just an "HTTPRequest" API?

> You're halfway into XML land at this point.

You mean... XHTML?

The ability to combine HTML tags (retaining their semantic meaning), with tags from your own, other XML namespaces (which you can define the semantic meaning of) was 99% of the point of XHTML.

People thought XHTML was supposed to "replace HTML" or "be the next version of HTML", and so hated it (because they'd have to fix all their existing HTML-authoring tools to make them generate valid XML.)

But no, XHTML was never supposed to entirely supplant HTML; XHTML was just supposed to be a syntax for including HTML-markup-ed text as part of the content of a larger XML document.

And XHTML is still a valid thing to use today. WHATWG is even developing XHTML5 to mirror HTML5.

Really, XHTML should not be considered "some past version of HTML, a bad diversionary experiment like New Coke that we've since gotten back on track from", like most people see it. Rather, XHTML was and is a separate tool, useful in particular domains for acting as a drop-in sub-schema to larger XML document formats that need to represent text markup within them. XHTML provides not only the DTD, but all the semantics for how those elements should translate to rendered results—while still allowing you to do whatever you want, XML-wise, inside and outside and around those elements.


XML is also a first-class citizen on the web. All major browsers have a built-in XML parser to access and manipulate XML. All major server-side programming languages have XML parsing and manipulation libraries.

XML is a far better format for this; it is designed specifically for mixing text content with markup. This format is incredibly hard to read and debug. Once you've parsed this JSON, you're still left with a complex stringy structure that you then have to parse again anyway. The value of being in JSON is incredibly low.


XML is not a native data structure to the web, and JSON arguably is. Browsers may have XML parsers built in, but XML requires APIs to work with in any meaningful way. XML's data model is complex. Where JSON allows you to build the entire data structure using pure JavaScript data types (objects, arrays, numbers, etc.), you can only build XML data reasonably via complex interactions of createElement(), appendChild() and so on. Compared to the thinness and transparency of JSON, the XML DOM has huge overhead. There's huge value in your document model also being your in-memory model.

I'm not sure why you think this particular JSON schema is a "complex stringy structure". XML is a complex stringy structure! There's conceptually and semantically zero difference between <node style="normal">text</node> and {style: "normal", text: "text"}, other than the fact that the former requires an API to read and manipulate, and the latter doesn't, from the point of view of JavaScript code. XML has a specific formalized data model of elements, attributes and so on, but this is just a graph representation: You can lossly encode an XML DOM in JSON, and vice versa.

The argument in favour of XML should revolve around the ecosystem of tools (XPath, XML Schema, namespaces, etc.) that allow you to work with it, but so far the comments here haven't conclusively highlighted these features as benefits. They also seem to ignore that with the exception of legacy enterprise software, the web world has largely (not entirely, but largely) moved away from XML. There's a reason we're using JSON over REST instead of SOAP. XML isn't the lingua franca of the web that W3C seemed to assume it would be during the early 2000s; JavaScript/JSON is.


> Browsers may have XML parsers built in, but XML requires APIs to work with in any meaningful way.

But this format will require parsers to take the resulting objects, arrays, and numbers and reconstruct something very close to XML's data model anyway.

> XML's data model is complex.

So is this document format. With it's special "_type" keys and specially named nodes. XML is simpler since attributes and children are a native part of the format and the API.

> There's huge value in your document model also being your in-memory model.

Are you suggesting this exact model be the in-memory model? Because that would be pretty non-optimal. You'd likely want your in-memory model to be more powerful than a tree of dictionaries, arrays, and strings. Therefore parsing and storing an XML document on load/save is not a big deal. In fact, given the existence of an API that already handles attributes and nested nodes, it might even be easier.

> XML has a specific formalized data model of elements, attributes and so on, but this is just a graph representation: You can lossly encode an XML DOM in JSON, and vice versa.

Exactly. So why use a format that it obviously terrible for mixed text and markup? For any reasonable sized document in JSON will be impossible to follow yet XML would be comparatively straight forward.

> There's a reason we're using JSON over REST instead of SOAP. XML isn't the lingua franca of the web that W3C seemed to assume it would be during the early 2000s; JavaScript/JSON is.

Basically SOAP is an over-engineered solution to web RDP. It started out as XML-RPC which was far simpler and far more constrained like JSON. You could easily encode SOAP in JSON if you wanted. You're blaming the underlying technology for when it's used inappropriately but this is an example of JSON being used inappropriately! It's an over-engineered solution to a problem solved simpler by a different technology. "XML bad / JSON good" doesn't tell the whole story.


> Are you suggesting this exact model be the in-memory model? Because that would be pretty non-optimal. You'd likely want your in-memory model to be more powerful than a tree of dictionaries, arrays, and strings.

I'm pretty confident that this is actually the intended goal, particularly for web-based editors.

Most "modern" web-based rich text editors work with an internal schema similar to this one, even when most of them support html/xml import/export too. In today's web ecosystem, it is hard to do much better than "javascript structure that gets diff-rendered to the DOM by a patching algorithm on updates".

I think this is just a case of someone saying "hey, wouldn't it be great if all these editors used the same schema instead of very similar ones with minor differences?"


> Are you suggesting this exact model be the in-memory model?

The data model, yes, though not the visual model. Sanity [1], which uses this specification, uses it for its rich text content.

Sanity is a document store with a collaborative, real-time editing UI similar to Airtable, but self-hosted and open source. When you're editing some text in the rich-text editor (which uses ProseMirror internally, last I checked), the editor is working directly on this data model, and produces patches against this data model that get synced with other clients. Similar to OT, except it uses a simpler git-like approach to rebase patches and converge the state.

There's another, less obvious reason it's JSON. In Sanity's document store, every document is a structured object that is read and written as JSON. Inside each document, Portable Text fields are also stored as JSON, not as a string or a binary blob. This content also is indexed and queryable. For example, let's say you're creating a wiki app; each wiki page is a document, with a title, body, and so on. The body is a Portable Text field. With this, you can find all wiki pages that link to another wiki page by running a query such as *[references("richard-feynman")]. You can do things like extract images, get the first paragraph, etc., all using the query language.

All technically possible with XML, of course, though I personally wouldn't want to go there.

(Disclosure: I work on Sanity's content store tech, but I don't work on Sanity itself.)

[1] https://www.sanity.io


> Browsers may have XML parsers built in, but XML requires APIs to work with in any meaningful way.

That's fine. APIs are good. The problem with XML is that the APIs never got any love from the end-user perspective. They are horrible to use, (seemingly) overcomplicated and the ones I've used, inconsistent.

I've seen one or two nice Python wrappers (forget names, been a long time since I worked with XML) that smoothed out the Developer Experience (DX), but that's it.

For me, this is the reason as a programmer I prefer to work with JSON, even though as a data author and maintainer I prefer XML at every level.


And as if dealing with JSON requires no APIs. Perhaps because I use libjq, I'm spoiled? But no, I can't imagine starting from scratch and not building an API for dealing with JSON.

> but XML requires APIs to work with in any meaningful way.

And JSON doesn't?! Sure it does. E.g., I regularly use the jv API in libjq. I wouldn't want to use JSON in any other way than via an API, honestly.

You could also argue that XSLT/XPath are part of the XML API / ecosystem, much the way I think of JSONPath / jq as part of the JSON ecosystem.

And yes, the world has moved away from XML... for non-document purposes. XML was a poor fit for encoding non-document data. SOAP's use of XML to encoded non-document data sure sucked, though it died of other things too (e.g., REST becoming more popular and being more appropriate in some ways). That's why we have things like protocol buffers, flat buffers, etc. Some even remember XDR, NDR, ASN.1 and its many encoding rules. Some even remember those and curious enough to have noticed that protocol buffers is DER reinvented.

JSON is much more appropriate for non-document data than XML, and unlike XDR/NDR/<various>buffers/DER/BER/CER/XER/PER/OER you don't have to agree on a schema a priori, which makes JSON very useful in REST APIs. But you could still define REST services in relation to schemas and then use OER or flat buffers to get very efficient encodings.

At the end of the day it's all the same data no matter how you encode it, provided the choice of encoding doesn't force you to give up some metadata or force you to express it in a convoluted way.

The point isn't that you can't do with JSON what you can with XML, or that you can't do with XML what you can with <various>buffers. The point is that XML has decades of evolution baked in already. You should not reinvent a document format without being very explicit about what's wrong with XML, what's missing that can't be added, and what you're willing to throw out, otherwise you'll find yourself replicating that evolution.


JSON doesn't need an API because it's a subset of Javacript:

  const foo = {
    bar: n,
    children = users.map(({name, id}) => ({name, id});
  }
  foo.x = 42;
DOM:

  const foo = document.createElement("root");
  foo.setAttribute("bar", `${n}`);
  for (const ({name, id)} of children) {
    const child = document.createElement("child");
    child.setAttribute("name", name);
    child.setAttribute("id", id);
    foo.appendChild(child);
  }
  foo.setAttribute("x", `${42}`);
Querying with jq isn't an "API".

JSON and XML are both encodings of two different formal data models. The difference is that JSON is native to JS. XML requires an API to build a safe and valid document. There are simplifying wrappers around the DOM, but they'd be yet another API on top of an API.

Also note how XML is string-based. In order to get any type safety at all you have to use XML Schema, which only works with a data consumer that also knows XML Schema. So given <foo x="1"/>, there's no way to tell arbitary consumers that x is, in fact, a number.

There are undoubtedly toolchains that do higher-level manipulation of XML data so that you can serialize and deserialize rich, type-safe data, but the fact remains that you need such a toolchain, in every single language that is to deal with this data.


I'll add to the chorus if I may. XML is much better for text documents, and very much is a first-class citizen on the web.

More than anything, XML is trivially extensible, but anything you do with JSON for documents will require that you reinvent just about everything that XML already has (e.g., namespaces, schemas), and as with all reinventions, you'll probably do it badly.

I've said elsewhere here that jq is so much pithier than XSLT, and that's true. But once you have a very complex schema to represent in JSON what would have been a simpler schema in XML, jq will no longer yield short, simple, pithy programs. The advantage of simplicity that JSON seems to have here over XML is imaginary and ephemeral -- I wouldn't count on it.


Sorry but this is nuts. Not only is JSON not capable to capture essential concepts of text documents such as text macros/entities and content models, it's also not a "first-class citizen" compared to HTML which can be styled and edited (via contenteditable) in browsers out of the box.

I think you're missing the point of what this specification is about. This has nothing to do with macros or entities.

> (in the original sense of markup: Text annotated with style and semantic information)

JSON is not annotated text, it is structured data that contains strings. It is a minor difference if you only use tools to manipulate it though.


JSON is a really bad match for this. XML or sexp-based approaches (like EDN) would be a much better approach.

XML just leaves a way too much space for quirks.

why is this not XML? it doesn't make any sense at all. They are trying to force a schema on json, when it would be hundred more times easier (and probably with better tooling) than forcing json for this.

I'm really struggling to understand that as well. Can't wait until this comes full circle and someone decides to write a "simplified" version that allows you to write some variant of <foo>bar</foo> that gets compiled into a JSON string. Bonus points for the huge waste of bytes in the JSON representation!

Marking up text using a markup language? It'll never catch on! (Actually, it probably won't, because XML is "uncool" now.)

Can I ask how does this compare with ProseMirror's format for rich text strings? What's the reason to not use ProseMirror's JSON structure? Can one convert between Portable Text and ProseMirror JSON?

Are there similarities between Portable Text and how rich text is represented in e.g. ProseMirror and SlateJS and DraftJS and other rich text editors?


Can I quickly prototype my own custom format for an XML document? So for example, I know that under the hood Word docs are basically XML formatted docs.

Is there a way that I can build my own custom XML tags and prototype my own custom XML document format? Something like:

<CustomGraphXMLtag>somevaluehere<CustomGraphXMLTag> <somenewtag2>some other value</> ........

Can I do this with Apache OpenOffice SDK or any other way?


For anyone who wants to see a similar, JSON-based, node-generating implementation looks like, this is a cool demo showing the structure of a Prosemirror WYSIWYG example. https://react-prosemirror-example.now.sh/

Thanks. That really helps. I fail to see how this is superior to XML, except that tools/DSLs like jq are so much easier to use than XSLT.

EDIT: But still, no please. This isn't even remotely close to on the same level as XML is, and we don't need XML to be reinvented.


Here's some of the rationale behind the specification https://www.sanity.io/blog/why-structured-text-is-awesome-an...

I wonder how queryable this is in document databases... (e.g. I want to use Arangodb to make something like Lotus Notes with a document classifier baked in)

Also another thing that has bugged me about "Rich text" since the Word 95 days was that something seems to be rotten in the data model in most implementations.

The main symptom is that applying styles to text just doesn't behave rationally. I want to select a certain region but I wind up selecting something slightly different. The mouse pointer moves in the wrong direction, etc.

Related to this it seems most WYSIWYG editors for HTML are buggy for the same reason. Dreamweaver used to work 10 years ago but now it can't get out of it's own way on my overpowered laptop and Adobe Support is clueless about it.

Will this help?


I mean, if you have jq... But yeah, it won't be easy because the schema is necessarily bloated with stuff and highly recursive. You'll end up preferring XPath, IMO. And XSLT.

The past, present and future of (portable) text is text. No need for wrapping your text in JavaScript Object Notation (JSON) :-) really. Keep it simple. See https://github.com/mundimark/why-text or https://github.com/officetxt/awesome-txt and others :-).

This category of standards for structured text will look simple and sufficient. But they break once you bring in more complicated formatting options like font-size, background colour, etc. Which is why we need rich-text standards that aren't opinionated on what formatting to allow.

I'm not entirely sure what you're arguing here re unopionated, but SGML (on what HTML is based) has had automatic recognition of Wiki text delimiters and producing XML-style canonical markup from eg. markdown and other Wiki syntax since before 1986. Kind of the best of both worlds for both easy editing and rich and sophisticated structuring. It's also odd that we're still discussing these things after all these years; so long so, in fact, that we've forgotten how those before us tackled these issues.

Do you think markdown-it token stream(https://github.com/markdown-it/markdown-it/blob/master/docs/...) may serve the same purpose?

Why this?

  {
   "_key": "a-key",
   "_type": "markType"
  }
And not this?

  {
   "a-key": "markType"
  }
Why this?

  {
   "style": "h1",
   "children": []
  }
And not this?

  "h1": {
   "children": []
  }

but what if you have to do this: "h1": { "children": [] }, "h1": { "children": [] } which can't be done because JSON?

Maybe parsers would have a better way of parsing that into a sensible datatype for example look at Gson where you map a json to a Java class, that way you can have a node where you can query if it is H1 or not.. That's just off of my head maybe I'm wrong

Even after arguing in favor of XHTML here in another comment, I'd like to make a separate "modest proposal":

What if we had a limited kind of markup as part of Unicode, such that all Unicode text could be "marked up", entirely-portably, such that as long as you aren't rewriting the Unicode representation, you're preserving the markup?

Unicode already handles many kinds of markup. We just don't tend to think of these features of Unicode as "markup"—even though many of them have one-to-one equivalents in markup languages!

Most obviously, you've got newlines, the plaintext equivalent of the HTML <br> tag. If Unicode was "just" about representing text, newlines (and horizontal/vertical tabs, non-breaking spaces, zero-width non-joiners, etc.) wouldn't be part of it. These aren't text; they're instructions to a text layout engine, for where to put the text. They're markup, just as much as an HTML <center> tag is!

Less obviously—but more flagrantly—Unicode has the BIDI LTR/RTL instruction codepoints, and support for storing and representing a https://en.wikipedia.org/wiki/Ruby_character gloss of a given character-sequence embedded alongside said text.

And because of this support, Unicode font rendering engines already have everything they need to be able to render a given "meta-instruction" codepoint that makes a given "run" of characters bold, or italic, or small-caps, or strikethrough; or to indicate that a given run of codepoints should be considered a list item, or even a semantic "paragraph" or "heading."

It's my understanding that the maintainers of the Unicode standard would balk at standardizing this kind of markup within Unicode. But, if all this is possible at the Unicode-parsing level, then why not just create a separate working group who will maintain such markup codepoints as a Unicode Private-Use Area, and then get font-rendering-engine vendors to support said markup codepoints?

---

Side-note: the result would be a lot like creating Unicode-codepoint equivalents for ANSI escape sequences, wouldn't it? In fact, I would imagine that pretty much every common ANSI "Select Graphic Rendition" escape sequence would make sense as a Unicode markup codepoint, save for maybe being translated to prefix-length or per-grapheme-cluster-paired forms rather than being modal switches.

And oddly, I think that such SGR-equivalent codepoints would be within the spirit of Unicode. Unicode wants to allow for the digitization and standards-based preservation of all historical text. Well, how are you supposed to preserve the text-art sent by an 80s BBS, without being able to represent (at least) fg/bg color changes? Unicode already decided to standardize the DOS box-drawing characters for exactly this same reason!


This is very similar to what you describe, except it's used to specify character variants (but one could reasonably define "bold", "italic" etc as variants...)

http://unicode.org/faq/vs.html


This wont catch on for the same reason noone uses U+001F or U+001E for their tabular data, because its not on the keyboard. Any extra encoding characters would need new keyboards, or maybe touch screens that are good enough to code with.

It would be pretty neat though.


It won't catch on for other reasons too, such as that existing software won't understand Unicode tags and will either not preserve them or break their forms so that you'll have the equivalent of </b>stuff here<b>other stuff here.

Unicode tag codepoints simply cannot be reliable.

Nor can Unicode tag codepoints provide sufficient functionality to replace XML (or HTML, or ...). Even the subset they can replace/displace would require revamping XML/XSLT processors in order to make XSLT able to see them -- that's a terrible imposition on the rest of the world.

Some of us hate UTF-16, but there it is and we can't get rid of it.

Others hate UTF-8, and.. the same thing applies.

XML exists, it's widely used, it's not going away no matter how much anyone dislikes it. Moving parts of HTML or XML into Unicode will not work for the reasons given above. Please dedicate your energy and ideas to innovating in other ways than this.


> will either not preserve them or break their forms so that you'll have the equivalent of </b>stuff here<b>other stuff here

You're assuming a representation that encodes such markup as ["start of bold" codepoint], ["end of bold" codepoint]. I agree that this would be bad, for the same reason that the Unicode "language" tags were bad.

There are other representation options, though.

You can instead do exactly what Unicode's Ruby-character support does, and just make a markup annotation last for exactly one grapheme-cluster. A run of bold characters, then, is a run of grapheme-clusters where each grapheme-cluster has been individually bold-annotated. There's no real way to "break" that. It's inefficient, of course, but the point of it isn't to be a universally-optimal format, but a markup-format-of-last-resort when there's no higher-level markup layer in the document format to hold styling information.

To be clear: there's demand for a "markup-format-of-last-resort." In fact, people are already abusing other Unicode features to create one, and are quite happy with the resulting "styled" Unicode documents, which are—horrifyingly—proliferating across the internet and breaking semantic machine-readability for any future archival uses of that text.

If you can't guess, I'm talking about people replacing the Unicode codepoints of their text with entirely-different Unicode codepoints, that just happen to already look like the "styled text" version of the text they want. For example, you've got 𝓽𝓱𝓲𝓼 𝓽𝓮𝔁𝓽, ʇɥᴉs ʇǝxʇ, t̶h̶i̶s̶ ̶t̶e̶x̶t̶, etc. These are the outputs of "text generators" (online IMEs, essentially) that take your Unicode plaintext input text, and output Unicode plaintext output text that is "styled."

Now picture a Unicode feature to obviate such "text generators"—where, instead of two codepoints ["𝓽"]["𝓱"] that are both 1. impossible to type on a keyboard and 2. not actually the same characters semantically as ["t"]["h"], you've instead got four codepoints: [next-with-italics]["t"][next-with-italics]["h"]. You typed ["t"]["h"], and then selected them both and hit the "I" button in your editor, and then copied the result, and that's what's now in the copy buffer.

Nothing would break—text editors already understand how to keep Unicode combining diacritics together with their base codepoint just fine, and this is just another use of the same logic.

Similarly, the NFD representation of [next-with-italics]["t"][next-with-italics]["h"] would just be... ["t"]["h"]. Stick your four-codepoint sequence in a DB fulltext field, and it'll index it no problem.

> Nor can Unicode tag codepoints provide sufficient functionality to replace XML

Oh, I would never suggest that. Structural markup has no place in Unicode. I just think that there is a thing that is a subset of "rich text" but a superset of "Unicode plaintext" and maps to exactly the sort of text-styles that a publishing editor laying out InDesign flows, would let their contributing writers use in the RTF documents data-bound to those flows; or which a blogging engine like Twitter would let their users use to style their tweets.

Like I said, this doesn't belong in the Unicode standard. Such a format can be considered an additional format layer between Unicode plaintext and a document format. It just so happens, though, that you can take advantage of the Unicode-rendering capabilities of most rich-text layout engines, to support such markup. Same as how the VT100 spec took advantage of the logic that already existed to respond to newlines and such, and introduced inline escape sequences—nominally "a higher layer", but in practice a thing that is conflated with the layer below it, such that any terminal program that claims to output "plaintext" might actually be outputting plaintext-with-ANSI-escapes without mentioning it.


Attaching style to each glyph is problematic too, since now they can get spread around unintentionally, and we'll need new regexp code, and much else. No, the semantics of tagging inside Unicode are uniformly bad. I understand that people abuse Unicode already to get some such effects, but it's mostly a novelty, and I would not seriously use any software making use of these tricks. A standard way to mark up style in Unicode would certainly be better than Unicode abuse, but it would still come with a bunch of difficult associated issues -- are they really better than whatever issues you have with XML? and are they better for all of us??

BTW, games with BS (backspace) and overstrike for bold and underscore for underline still work in Unix terminals... That might be the best argument for having some stylistic composing codepoints in Unicode... after all, ASCII was designed in part to be usable with BS for expressing diacritical marks (accents, umlauts, cedilles, ...), making it a variable length internationalized encoding! But still, then less we complicate Unicode, the better.


I think you misread my original post—I don't have any problems with XML, and in a sibling post (https://news.ycombinator.com/item?id=18611384), outlined exactly why XHTML is best-suited to this use-case. Meanwhile, I called this Unicode-markup idea a "modest proposal".

If you can use XML, by all means, please do use it. If you're designing a document format, and there should be markup in your document, just use XML for your document format and then XHTML for the markup parts. It's what ePub does. (ePub is one of my favourite formats: it's just a slapping-together of a bunch of well-known technologies. It's the "buy rather than build" of document formats.)

I do think, though—separately from what your options are as a document-format designer—that there's a point in having a Unicode-level markup standard, and getting the font-rendering engines on board with it. And that point is democratization of styling control: forcing styling into places where the format specifier had no intention, way back whenever they specified the format, to add styling; and where the software stack has maybe ossified since, such that nobody's willing to introduce a new version that specifies a new format with styling (or where such an effort was tried but has long fizzled out.)

Want to add styling to email? No, not HTML email, pine(1) ain't ever gonna render that for you. I mean, all email. Email in an app in the GUI; email on the CLI in your terminal emulator. Email even in pine(1), without modifying pine(1), merely updating GTK and Cairo and then restarting gnome-terminal. Unicode markup would work entirely opaquely to the email client's intent to "have styling" or not. At the layer the email client is operating at, it's Just Text™, just like emoji are now Just Text™.

As far as I can tell, encoding the styling at the Unicode-codepoint layer, and then parsing it out at the {OS graphics toolkit, terminal emulator} text-rendering layer, is the only way to make that work.

Or, for another example. How about we insert some bold text into a TEXT-typed column of a Postgres DB row? Yes, the text is supposed to be bold. No, we can't change the type of the field to XML; that would break everything that currently queries that field. I want to just put bold text in one row, one time. Yes, sure, add an INSERT trigger to ensure that nobody can insert text with any other styles. (But if you're so worried about that, why don't you have that trigger already to ensure that nobody's inserting a newline, huh?)

A cute trick: make temp-file filenames not "foo~" but rather "foo"-with-a-strikeout-through-it. Yes, that'd be the file's actual on-disk name. You're saying you want your filesystem to use XML for the filename field?

(I mean, these are ridiculous, yes. The only actual use-case is obviously allowing people to use bold on HN.)


> I'm talking about something that you'd use in exactly the situations where people currently use "text generators" to output those abominations,

The situations, AFAICT, are often to defeat text filters.


That use-case does certainly exist (it's what you see used on e.g. escort ads on websites where there shouldn't be escort ads), and obviously, such people wouldn't want to switch to using more machine-comprehensible styling.

While this might represent a large percentage of the "pre-baked styling" Unicode text in the wild by volume of text, there aren't that many people who have created all this text (most of the corpus is spam repostings of the same messages with randomly-varied styling.)

Rather, I think the majority of the authors of "pre-baked styling" Unicode text are simply people posting to websites like Twitter or Facebook, or a Disqus comments section, where styling is either not allowed, or doesn't have all the options the author wants.

Or, to put that another way: the authors are mostly teenagers who care more about the aesthetics of their text than its comprehensibility.

And that's surely the market that the Unicode consortium cares the most about—why else would we have so many emoji? ;)


You don't type the BIDI instructions on your keyboard; they're inserted automatically in response to script changes. You don't type Unicode non-breaking spaces with your keyboard—you ask your editor to insert one with an editor-specific command, or you use textual markup that compiles to a Unicode non-breaking space. Etc.

Most of the Unicode markup features that already exist—other than newlines—form a backing layer: a standard interchange format for text, when it's not in the internal DOM+ropes representation that most text editors use. It's not a working representation, it's just a representation that text is rendered out to.

(In fact, for a lot of these existing markup codepoints, when there's one sitting around "raw" in a text field—rather than having been parsed into a text-editor DOM node—you can't even position the cursor and then delete it. Such codepoints just sort of "stick" to the grapheme-clusters/edit-field boundaries on one or both sides of them, such that deleting the grapheme-cluster will maybe "garbage collect" the associated markup, or maybe leave it there forever. The editor doesn't expect that character to be in its text, breaking the layout. It expects to have parsed it out, and then to put it back in place when it saves the document. The markup codepoints exist to be inline instructions for rendering a run of "baked" text; they don't exist to be a convenient data structure for text-editors to do editing operations against.)

So, the idea I'm proposing here, is not that you'd just "be able to type a <b> codepoint"; but rather that:

1. plaintext editors like Notepad and TextEdit—and HTML's plain <textarea> controls—would now have buttons for simple text styling (because now they can!)

2. even OS plaintext edit controls would allow styling of their contents, through an OS-universal styling palette (like macOS's "Fonts panel") that would apply to the cursor/selection of the active window (like it does for macOS rich-text edit controls.) It would essentially be part of the IME, like macOS's minimized-form "Character Viewer" palette is, or like dead-key entry is on X11.

2. WYSIWYG editors for rich-text formats like RTF and Markdown would move to evolved, distinguished storage formats (e.g. ".rtfu" = "text/rtf+umark") where the choice-of-format has the meaning of canonicalizing all styling that is possible using Unicode markup codepoints, as Unicode markup codepoints.

3. The original formats these new formats evolved from ("text/rtf", "text/markdown", etc.) would now be defined specifically as formats that have no knowledge of the meaning of Unicode markup codepoints, such that if a file containing Unicode markup codepoints were to be "Open As"ed the old format, the codepoints would render as �s. (But if you just open the document without an "Open As", the editor would heuristically determine from the fact that the file contains markup codepoints, that the file is actually "text/rtf+umark", and would insist that you "Save As" the file with the new extension.)

4. Each OS's neutral "rich text" drag-object/paste-object format would switch from "text/rtf" to "text/rtf+umark". All text-editing controls on OS apps would be required to both produce this as on drag/copy (even if it's not their storage format), and accept+parse this on drop/paste (even if it's not their storage format.)

5. Each OS's neutral "plaintext" drag-object/paste-object format would become "text/plain+umark"—i.e. plaintext that preserves styling. Thus, copying from a rich-text edit control, to a plaintext edit control, would get you resulting text with editor-specific styling stripped, but retaining the Unicode markup codepoints.

6. If the OS supports arbitrary-Unicode filenames, it should probably accept Unicode filenames with markup codepoints in them, and should probably render that markup when displaying the filename. (Consider: at least in Linux, filenames can contain newlines, and the OS renders those.)


Some of this has been attempted, and deprecated. E.g., there used to be language tags in Unicode corresponding to Internet language tags.

Thats XML?

This does not even support nested lists. Why purposely come up with an overcomplicated document model if you aren’t even going to support basic features?

how is this different from odf or even ooxml?

I think this article lends an interesting perspective on the XML vs. JSON thing. https://twobithistory.org/2017/09/21/the-rise-and-rise-of-js...

Try this JSON Parser to validate JSON data. https://jsonformatter.org/json-parser

It looks a lot like HTML 1.0.

Kind of like an AST but for markup?

Not quite. AST but for rich text.

For G-d's sake, people, just use Markdown everywhere but in cases where something else is really necessary!



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: