Sorry, as who have already designed yet another JSON alternative. Many things about this format are just wrong (as of the first edition):
- (EDIT: Mistakenly read SHOULD as MUST, ignore this item please) The requirement for the mandatory BOM is unacceptable for most non-Windows users, as BOM is by definition invisible.
- An arbitrary name is equally questionable, even XML doesn't do that. Do you accept tabs for example?
- There is absolutely no way to indicate the data type. The document accidentally mixes the wire format and serialization protocol; never a good sign.
- Do you accept 3,000 or 3,,000 or 30,00 or 3,0,0,0 or ,,,3000 or 3000,,, or 3,e,3 or 0.000,3e7 or ,,,?
- How many digits in each group are recommended for the encoder?
- What on earth is "IEEE 64 bit double precision floating point numbers" in the context of textual format?
- Do you require a specific rounding or not in the decimal-to-binary conversion?
- Is `\u{d800}` accepted or not? (JSON famously has this issue.)
- If you have to escape the base64 padding, maybe you should just drop the padding?
- Graph is generally a bad thing to encode at this level, because many applications do not even expect graphs and that can lead to DoS attacks.
- No clear definition of the data model. The document starts with objects, arrays and scalars (untyped? stringy? I dunno), then only reveals that objects can be typed and shared and specific types of scalars should be written in certain ways much later. Define the data model first and describe possible encodings of that model instead.
- Not allowing anything besides from space and tab is okay, but that doesn't speed processing to be exact.
- While technically a matter of choice, it uses a lot of unmatching angle brackets and that's just... ugly.
- In fact, I don't really see why other grouping characters were unused in the first place.
- No canonical representation.
Maybe you should review tons of other alternatives in this space (I recall at least 2--30 of them, probably much more) before your own design.
- The flexibility in name is to support the plethora of programming languages.
- Commas are for readability as per English.
- Three digits are recommended for the encoder.
- Textual data is decoded into binary information in software so setting expectations, e.g. supporting ∞, facilitates interoperability by reducing impedance mismatches.
- If the implementation uses ɪᴇᴇᴇ doubles the rounding shall be as expected.
- Unicode surrogates are disallowed.
- We are following the Base64 standard exactly.
- It is essential for data to be structured as a graph, it simply occurs. Serialization on most data formats has the required support for graphs awkwardly layered on top of the encoding.
- The C# implementation defines the data model; essentially objects, array and scalars with multiple parents for nodes. The Formats section is to, as stated, facilitate interoperability.
- Limiting whitespace to spaces and tabs does speed processing. The ᴜᴛꜰ-8 bytes can be left in the input buffer rather than copied. All significant bytes are ᴀꜱᴄɪɪ.
- There are no unmatched angle brackets (that was a typo).
> The flexibility in name is to support the plethora of programming languages.
As far as aware, there is no language that allows a tab in its identifier. I'm aware of some (uncommon) languages that allow a space in identifiers though.
After all, the goal of language is firstly to set what's valid or not and secondly to give a meaning to valid one. The valid set should be as maximally different from the invalid set as possible, but allowing virtually invalid characters in names is against this goal. Consider the approach taken by XML 1.1 (not 1.0) if you need an idea.
> Commas are for readability as per English.
A lot of English-speaking countries actually have `.` and `,` swapped: 3,141.592 vs. 3.141,592 for example. Due to this ambiguity, a comma as a grouping separator is heavily discouraged. You can instead use a underscore `_` (very common in programming languages) or a space (more preferred in human texts) without such concerns.
> Three digits are recommended for the encoder.
Also, some English-speaking countries use different group sizes (notably India). I'm personally fine with three-digit groupings despite of that, but that should be clearly specified at the very least.
> Textual data is decoded into binary information in software so setting expectations, e.g. supporting ∞, facilitates interoperability by reducing impedance mismatches.
This procedure should have been made explicit. In fact, the single worst thing you can do in serialization formats is an unclear definition of data model.
> If the implementation uses ɪᴇᴇᴇ doubles the rounding shall be as expected.
There is nothing like "as expected" in IEEE 754. The current rounding mode is a part of the execution state, so leaving it unspecified risks a non-deterministic interpretation even in the same execution. Either you should specify some rounding mode (most likely round-to-even), or you should state that encoders should pick a long enough decimal representation to avoid any such issue.
> We are following the Base64 standard exactly.
Which base64 standard in [1]? The padding in base64 is vestigial anyway and serves no practical need by now, so there is no strong reason to use a particular standard with the required padding. It is much more important to decide what to do with incorrect padded bits.
> It is essential for data to be structured as a graph, it simply occurs. Serialization on most data formats has the required support for graphs awkwardly layered on top of the encoding.
I don't question that graph structures often occur naturally and existing schemes are often awkward, but I think that's more of the lack of co-developed standards. I have outlined my rationale for layering in other comments.
> The C# implementation defines the data model; essentially objects, array and scalars with multiple parents for nodes. The Formats section is to, as stated, facilitate interoperability.
The data model should be abstract enough to be truly interoperable. JSON suffered a lot from having no defined data model to this day, even when there was a soft-of-reference implementation by Crockford. It is not too hard to define a data model in prose rather than code.
> Limiting whitespace to spaces and tabs does speed processing. The ᴜᴛꜰ-8 bytes can be left in the input buffer rather than copied. All significant bytes are ᴀꜱᴄɪɪ.
May have been true in the past, but it's no longer true since SIMD-based parsers. Also the very existence of escape sequence prevents the true non-destructive zero-copy (aka in-situ) parsing anyway. With such sequences, zero-copy/in-situ parsing has to be destructive to be performant and that can preclude some use cases. Allowing additional space characters is much easier than that.
> There are no unmatched angle brackets (that was a typo).
I meant to refer to `<<Name>...<<$>`. Multiple grouping characters can allow for simpler syntaxes.
I evaluated xᴍʟ 1.0 and 1.1’s restrictions but consider the alternative, accepting anything but the empty string to be simpler.
>A lot of English-speaking countries actually have `.` and `,` swapped:
Which?
>...should state that encoders should pick a long enough decimal representation to avoid any such issue.
.net for example has round trip encoding to achieve this.
> Which base64 standard in [1]?
As linked in the document, we refer to ʀꜰᴄ 4648 for Base64 using the simplest version which states that “Implementations MUST include appropriate pad characters at the end of encoded data unless the specification referring to this document explicitly states otherwise”.
> I have outlined my rationale for layering in other comments.
>> Graph is generally a bad thing to encode at this level
The ᴅᴏꜱ attach you are concern about is not applicable. Layering graphs on top of the encoding in a separate standard is not practical. As stated serialization requires a graph structure.
>It is not too hard to define a data model in prose rather than code.
I have done that.
>May have been true in the past, but it's no longer true since SIMD-based parsers. Also the very existence of escape sequence prevents the true non-destructive zero-copy (aka in-situ) parsing anyway. With such sequences, zero-copy/in-situ parsing has to be destructive to be performant and that can preclude some use cases. Allowing additional space characters is much easier than that.
SMID based parsers would almost certainly run faster with simplified byte handling rules. Practically, when say skipping whitespace, searching for two specific bytes is faster than the alternatives. Most strings do not have escape sequences so do not need to be copied, a scan for '\' is fast.
> `<<Name>...<<$>`
This is to be terse. An array type in an xᴍʟ like language is a leap forward. Additional significant characters would over complicate the standard.
Oh, I just realized that I carelessly put "English-speaking" there. (AFAIK the exact reversal does exist in English, but is much rarer and not entirely domestic.) But that doesn't really justify the use of comma in numbers.
> .net for example has round trip encoding to achieve this.
That was never guaranteed to my knowledge, and there also seems an apparent difference in .NET Framework and .NET Core/Runtime versions according to some searches. Many enough standardized languages have no strong requirement as well (for example, see [1] for ECMAScript), so this should be clearly specified to be truly portable.
> As linked in the document, we refer to ʀꜰᴄ 4648 for Base64 using the simplest version [...].
You are free to rephrase any variant of base64 standard into simple statements. (Ideally the standard itself should be also cited, though.) I was asking why that particular variant was used.
> Layering graphs on top of the encoding in a separate standard is not practical. As stated serialization requires a graph structure.
The separateness here is not binary. The Unicode standard (ISO/IEC 10646) for example has multiple standard annexes [2] that are synchronized with the core standard but published and developed separately. For all practical purposes they form a single unified standard, but you can (and almost likely should) ignore some irrelevant annexes. I'm arguing for a similar structure to be clear, do you think that's also a no-go?
The only explicit thing in your data model is the following definition:
// Read as: "document" is either "object", "array" or "scalar".
document = object | array | scalar
// Read as: "object" is a distinct type "Object" with "name" and zero or more "field"s. (And so on)
object = Object(name, field*)
array = Array(name, item*)
scalar = Scalar(name, value)
This is not what's implied by following sections. To begin with, you have no explicit definition of `item` or `field`. Yes, it is kinda obvious that both should be `document` and also can be somehow inferred that `scalar` in the `item` place should be a nameless object, but not only such relations are not explicit but now we have another class of objects not mentioned in the first place! The exact nature of `value` is also very unclear.
In my understanding, your data model is more like this:
document = object | array | scalar
// `...?` for zero or one copy of the preceding data.
optional-name = Name(name)?
optional-ref-id = RefId(ref-id)?
optional-type = Type(type)?
name = string // Can't be empty
ref-id = string // Globally unique
type = string // Interpretation up to clients
object = Object(optional-name, optional-ref-id, optional-type, field*)
field = ObjectField(name, optional-ref-id, optional-type, field*)
| scalar
array = Array(optional-name, item*)
item = object
| ScalarItem(optional-ref-id, field*) // Implicitly converted to object
| ReferenceItem(reference-id)
scalar = Scalar(name, value)
| Reference(name, ref-id)
value = string // Additional interpretation rules may exist after parsing
This abstraction clearly demonstrates many important things coming up in implementations. For example, every `field` has to provide `name` because all possible branches (`ObjectField`, `Scalar` or `Reference`) have one---at least in my understanding. I shouldn't be figuring out such data model myself to be honest; it's your job to provide one.
Okay, I see you demand a very explicit question: why did you use comma instead of other characters for grouping?
> [separate standards documents] do you think that's also a no-go? Yes.
I like to hear why then, given this approach seems to be very successful for Unicode and many others.
> Data model here:
Thank you for the grammar. The document will massively benefit from such explicit description, I completely missed an unnamed scalar for example (because it has no syntax at all). The grammer doesn't have to be reproduced at all, for example the following is my attempt to rephrase it in prose:
There are three kinds of entities in Xenon: objects, arrays and scalars. All entities may optionally have a name, a type, an ID or any combination of them. Objects can contain named entities or fields, where field names should be distinct within the same object. Arrays can contain either named or unnamed entities. Scalars can contain zero or more characters. Names and types are non-empty free-form strings and no meanings are assigned for them, but this document gives a guideline for common types to ensure the interoperability. IDs should be distinct within the same document, but otherwise they are only used to encode a graph structure and have no other meaning. Objects and arrays can contain a reference which can stand in for any entity with given ID; references in an object needs its own name which can differ from the referent.
Please consider a similar clarification early in the document.
But English is not the largest language in terms of the number of speakers. I speak Korean for example and three-digit grouping is unnatural in Korean, which uses myriads instead. I guess Xenon is designed to be not comfortable for the vast majority of people including me then?
> Having special names, like $type, and hoping no language uses them is fragile.
I was not proposing special names after all. Xenon clearly has a much better extension point, namely the type string. You can have just one reserved character, which is very unlikely to appear in verbatim, and use it for the extension. And it is even unclear whether JSON's use of `$` for such special names was fragile after all, given that such complaint seems uncommon. (I complained about that in the past, but I mean others seemed to be cool.)
In addition, I don't think you have mapped enough languages to conclude so anyway. For example, can you list all popular enough languages that allow `$` in identifiers? While whether to allow `$` or not requires a single example, maybe you have missed other identifiers in other languages! (Angle brackets are also common in stringified types, after all.) So I expected you to have at least looked at them for your conclusion.
> The requirement for the mandatory BOM is unacceptable for most non-Windows users
I think the requirement for Unicode is bad in general, whether it uses BOM or not.
> What on earth is "IEEE 64 bit double precision floating point numbers" in the context of textual format?
I assume it means that numbers are expected to be IEEE 64-bit floating point numbers represented in the decimal format.
> If you have to escape the base64 padding, maybe you should just drop the padding?
I agree.
> While technically a matter of choice, it uses a lot of unmatching angle brackets and that's just... ugly.
I agree with that too.
> No canonical representation.
I think that a binary format would be better as a canonical representation anyways. However, the lack of indicating the data type would seem to make it difficult to know how to convert it unless you already know the schema.
> Maybe you should review tons of other alternatives in this space
Just a few days ago, and still working on it today, I had been making something called TER. It might also be worth to look, and I wrote another comment relating to this.
Xenon does have a reference type to other nodes, which is something that other similar formats don't have, though.
> I think the requirement for Unicode is bad in general, whether it uses BOM or not.
While there exist legitimate complaints about Unicode, I believe there is no other feasible encoding than UTF-8 for textual formats now.
> I assume it means that numbers are expected to be IEEE 64-bit floating point numbers represented in the decimal format.
I too believe so (hence the next item), but that just doesn't make sense if you ponder. IEEE 754 doesn't define any textual format while it does have binary decimal formats. The correct wording should have been that numeric scalars follow a specific grammar to be interpreted as an IEEE 754 binary64 number in the data model.
> I think that a binary format would be better as a canonical representation anyways. However, the lack of indicating the data type would seem to make it difficult to know how to convert it unless you already know the schema.
That's fair. But we have already observed textual formats being... "abused" for the cryptographic purpose (e.g. JWT), so it's not too bad either to have a canonical representation or to explain why there is no canonical representation defined.
> Xenon does have a reference type to other nodes, which is something that other similar formats don't have, though.
There are several formats that do try to support native graph types, including YAML and Concise [1]. So that is hardly new. I think Concise actually tried very hard to make it fine! But it became quite more complex as a result.
> I believe there is no other feasible encoding than UTF-8 for textual formats now.
I disagree, although I do not believe that this disagreement should have to affect some file formats, since some file formats should not need to care about the character encoding, except perhaps specific details, e.g. that ASCII bytes have ASCII meaning and non-ASCII bytes have non-ASCII meaning (which is true of UTF-8 and of some others).
> IEEE 754 doesn't define any textual format while it does have binary decimal formats. The correct wording should have been that numeric scalars follow a specific grammar to be interpreted as an IEEE 754 binary64 number in the data model.
OK. Although it seems clearly enough, probably the document should specify explicitly its working anyways, like you mention.
> There are several formats that do try to support native graph types, including YAML and Concise [1]. So that is hardly new.
I had seen Concise as well, and I have criticisms of that too. I did forget it though as I was writing my message.
I did have a idea to add a reference type too (you can see my other comment about TER and ASN.1X), although I would have it to not have the same meaning as the data it references (therefore meaning that it cannot result in cyclic graphs, even if it references itself). I think I should not have remote references, though. Concise encoding seems to have it have the same meaning as the data it references, and I think it is useful to not do this (you could use compression if you want to deal with many of duplicated data).
> I disagree, although I do not believe that this disagreement should have to affect some file formats, [...]
Ah, if you just want to make the UTF-8 requirement less strict for simpler decoders, that can be actually okay. My belief is more about legacy encodings like Shift_JIS.
I never seriously tried to fit any graph struture into a serialization format, so I don't yet have a very concrete opinion besides from what was already said. (I'm not even sure Concise had a good approach for that...)
"May", not "should", so some data don't have a graph structure. It really boils down to the ratio between non-graph data and graph data. I have some reason to believe that the former far outnumbers the latter, but maybe you have some concrete evidences against?
You misunderstand. Given that data MAY be a graph structure, the markup MUST support that. A tree with pointers to the parent nodes has a graph structure, a common enough occurrence.
Data MAY also include Egyptian hieroglyphs not encoded in Unicode. Does it mean that the markup MUST support them as well? I bet not. No format can natively support every possible data structure; serialization formats only have to support enough of them to which remaining structures can be isomorphically mapped.
> a common enough occurrence
Do you have a concrete figure about "common enough"? Otherwise we are only talking through anecdotal evidences (not necessarily bad though!). At least in my experience, graph structures do occur from time to time but almost all of them can be easily rewritten to non-graph form because they are typically DAGs and obviously ranked (i.e. its preferred topological order forms partitions of different nodes). The general graph that requires something like references was very rare.
The point is one can apply the xᴇɴᴏɴ library to arbitrary data and generate representative markup. For “Egyptian hieroglyphs not encoded in Unicode” 𓀠 one can write plug-ins.
Thanks for writing this up! Sounds like you really dig deeply in to this.
Not entirely sure what you mean with canonical representation (I've heard this in the context of JSON-LD before, though). Can you explain what you mean here?
Where do you see the problem with Graphs and Dos? A reference is just a pointer. You just have to be careful when doing recursive code. I actually like the idea to explicitly define how a reference / association is made, because otherwise people will have to re-invent ID and association concepts and there's no shared understanding. In JSON Schema, you cannot properly express an association or graph structure and people start using overloaded and not well-defined concepts like `$ref` which is a separate standard.
Here the canonical representation refers to one single definite and unambiguous encoding for given data. This requirement is very common in cryptographic applications and also commonly demanded when the deterministic processing is desired. Technically the "canonical" and "deterministic" encoding can differ (e.g. ASN.1 CER vs. DER), but there is not much value to have two distinct encodings.
On graphs: as you've said recursive code has to be careful, but the recursion itself is not very frequent in normal applications and they are more susceptible to attacks if the recursion is built into the serialization format. XML billion laughs attack is a famous demonstration for this issue. It is still worthwhile to have parallel standards to specify how recursive structures should be encoded in the basic format, and possibly to tweak the basic format to better accommodate such standards, but I believe such needs can be met without making the basic format bigger.
Is canonilization not irrelevant. Any format with comments is not canonical; so xᴍʟ is not, ᴊꜱᴏɴ has escapes options for string characters so it not either.
Almost no serialization format is canonical by default---AFAIK bencode was the sole example that mandates the canonicalization. Instead, a canonical subset of the format is usually defined, which would of course exclude comments, unless comments themselves are considered semantic like XML. Yes, even XML has a canonical subset [1]!
I feel like I've seen countless attempts at this, and this design doesn't seem any better than usual. There's not much in here to justify the design decisions, nor any indication that a parser implementation exists or is even being actively worked on. For something meant to be "Efficient to write by hand", the insistence on tag-like syntax but with "tags" that consist entirely of a single punctuation character, or which have an extra open bracket, seems error-prone. "Leading and trailing newlines and spacing are intuitively removed from scalars as is indenting." is not adequate guidance for something that claims it "can be implemented to be blazingly fast or using a mode-less tokenizer/parser" (and "mode-less" seems like a strong claim for something that appears intended to support arbitrarily nested objects). This isn't what I would call "terse" given the need for closing tags (which also, according to feedback I've received myself before, counts against human-writability), and it's not at all clear why "the [top-level?] document is named" is advantageous.
Both a C# and JavaScript implementations exist.
The terse “single punctuation character”, e.g. <$> make it efficient to write, more efficient than JSON.
There is no “extra open bracket”?
There is a overview of how the indenting operates.
Mode-less, means the tokeneizer does not require modes, unlike those required for XML. The parser has state.
Have the top level document named allows one to see what the document is, e.g. <Example-Document> or <Person>.
The “readable indented text” refers to https://xenondata.org/#scalars where multiple lines of text can be indented and extracted as expected. One can argue that JSON is not readable due to requiring escaping (\n) for multiple line strings.
This is not enough of an improvement over JSON to justify choosing another format, given that the new format is (A) not recognized and (B) uses even more special characters. (By "not recognized" I actually mean there are no implementations given for any languages, much less accepted standard implementations.)
Funnily enough, I've been working on solving the same problem concurrently. Though in my _very_ biased opinion; I think CONL is easier to read and write:
Agreed that this is much better than the OP. That said, my general opinion is that a whitespace-only indentation should be avoided especially in the serialization format due to the inherent ambiguity of whitespace characters and resulting human mistakes. When I designed CSON [1] I strived to make it as readable as possible without the indentation for that reason.
Nice – I like your verbatim syntax for multiline strings!
I went with indentation because a very common use-case in a configuration file is commenting out lines. Even with CSON-like comma rules, you still need to balance your {} and []s. Indentation balances itself most* of the time.
Indentations are still desired for most human tasks indeed! But you can have indentations and groupings at once, one complementing each other. As you've noticed, CSON's verbatim syntax was intentionally designed so that it remains valid without any indentation but your instinct really wants to align those lines anyway. (A similar approach can be seen in Zig verbatim strings, which seem to be designed independently from CSON and make me much more confident about this choice.)
If this is not parody, try to be humbler and not claim to be better than the most popular solutions.
It's unlikely, and even if you are right, we won't switch standards, best you can hope is individual adoption, and you can achieve that by offering a tradeoff, improving in one area.
Ok that's good but capitalize TERSE instead of IS. That way you acknowledge that terseness is one of many parameters of a string data structure instead of focusing on our disbelief of its terseness.
A basic scalar pair in ᴊꜱᴏɴ is "key":"value", or "key":value, so a 4 or 6 character overhead. xᴇɴᴏɴ is <key=value>, 3 characters. In my tests xᴇɴᴏɴ IS more terse.
But no one said it wasn't. And also no one particularly cares, you can have your crown for king of terseness, the criticism isn't that.
You are being criticized for quality and your response is to fixate on one parameter being better. It's as if you were overweighing that parameter or ignoring the other million of parameters, which is what makes your proposal so out of place.
Suppose that you come to the world economic forum, and you propose a new coin you call the GeneThomas coin. When people criticize or laugh about your idea, you respond that "The coin IS greener! It really is and I can prove it". Man, you are just showing that you don't understand anything about what makes a currency good or popular.
Just find your niche in green coins, but don't try to compete with the US dollar or bitcoin cause you'll lose
There are <> characters, which are awkward on a good day. Yeah, we use them in HTML but the editor helps with that. And there are a bunch of special characters and "objects" that require interpretation. JSON has what? Quotes, square brackets, colons, and commas, used the way they are used in most programming languages, and thus familiar to most of us. Outside of terseness, which is overrated, what real advantages does XENON provide?
It seems a bit like a toss-up to me ... Taking examples from your landing page I can see some where there is less typing in JSON and some where there is more. For example, the book one - I counted the equivalent JSON and they were 128 non-whitespace characters vs. 127 non-whitespace characters.
Practically, at least for my editor, I had to type less for JSON because every paired character automatically inserts the closing equivalent. This is likely to work across a wide range of editors from the browser console to whatever else fringe text input field. Once you account for paired characters JSON wins at 114 chars. Of course if you had editor support for XENON it could also automatically insert some of the control characters, at the very least all of the > and the <$>, which brings XENON to 117 chars typed.
Anyway, I think you probably would've gotten a less strong response from people here if you had less absolute statements on the site ("better alternative", "the best way") - certainly there are going to be use-cases where this excels, but I can also see how the average simple REST API would be very unlikely to benefit from this, and the extra features may in fact expose it to more risk (e.g. the graph support, look at YAML and the CVEs it has caused over the years)
I have read this document, and I don't really like this alternative, much. However, one consideration should be that one format is not necessarily suitable for everything; there will be differences by data model and other stuff. Different formats have different advantages and disadvantages, both in general and for specific applications.
I had recently been working on something too, called TER. It is a pure ASCII file (although some implementations might support other character sets, mine doesn't, although non-ASCII characters can still be represented using escape sequences). It can be converted to DER (a program can also be written to convert the other direction (BER to TER), but this is not done yet).
It is ASN.1X which is a variant of ASN.1; a few types are removed (OID-IRI, and a few others) and some new types are added (BCD string, TRON string, PC string, key/value list, OBJECT IDENTIFIER RELATIVE TO, and a few others), and also a few types are deprecated (such as UTCTime), although they are intended to not conflict so that it is possible to implement both in the same program if you want to do. (ASN.1X also removes or restricts some other features of ASN.1, such as that optional fields and fields with a default value are not allowed if it would require to look ahead to determine the absence or presence of the field.)
Xenon has a type for references to other nodes. I had considered (before seeing the document about Xenon) adding such a type into ASN.1X, although I had not done so yet. (My idea is to use a relative format (I have the idea how this will be encoded in DER, although I am not sure about TER; for TER it might require multiple passes, and currently the implementation uses only a single pass), that if some structure is taken out and moved to another file, and it only references within that structure, the reference will remain intact, and it cannot interfere.)
TER also uses the same comment syntax than Xenon (although, so does PostScript, and probably others too).
Perhaps another thing to be noted is PostScript notation; a subset of the PostScript notation could be used for arrays and key/value lists and some such things like that (and is something I sometimes use).
Unless I'm missing something, a major downside of this is that scale types are not explicit, and that is very, very, very bad IMO. It will lead to numerous cases of poor interoperability as scalars are interpreted differently.
That is, in JSON, at least I know "true" represents the string "true", and true represents the boolean value true, and "1.25" is a string and 1.25 is a number. If anything, JSON would be greatly improved by having a specified datetime type, and more guaranteed semantics around numbers (e.g. integers vs. decimals vs floats for example).
XML obviously suffers the same issue, and I think it is much worse for not having types (though highlights the difference between structured "documents" vs. object serialization formats).
"Documents must be utf-8 and should have a byte order mark."
No. If you're using UTF-8 (which is a good choice), the use of a BOM should be discouraged. Given that the format specification says documents MUST be UTF-8, there is no need to enable detection of UTF-8 content with the UTF-8 BOM. And, of course, the original purpose of the BOM (detecting big- or little-endian encoding) is unnecessary in UTF-8.
The Unicode standard, section 2.6, says, "Use of a BOM is neither required nor recommended for UTF-8".
While it is allowed, if you're making a new spec for a new data format, you shouldn't recommend the use of a BOM in UTF-8.
> Xenon is the best way to represent information: terse
The best would be to remove all the unneeded unergonomic shift-requiring <characters> from here, you already have = symbol with whitespace and $ that separate all you need
So delineate them when they contain whitespace, for example "quotes serve exactly that purpose". There is no point in delineation when there is nothing to delineate, it's just needless verbosity in a language that aims to be "terse"
What exactly is good about <>s that you think it's the good in the bad XML?
There is a widespread agreement that XML is generally bad for the data serialization, while HTML is not much questioned as such. You can't carelessly adopt angle brackets without why such difference exists in the first place [1].
If you figured that out yourself, you should have also realized that angle brackets are generally considered bad for the data serialization. Cherry-picking my words without answering my whole point is not a good move.
I should have been clearer. jq is a programming language, but it's also a data language in that a) valid JSON is a valid jq program, b) you can construct JSON data trivially with jq.
I know people are hating on it, because who likes a new thing? Especially something that looks kinda like xml.
But I personally really liked the structure that xml forced, I just found it to be way to verbose. I find json and yaml and their depedancies on tabs to be confusing and the specs are often abused in ways that make them nearly as hard to parse for a human as xml.
This feels like a good middle ground, I hope it receives more support
Pkl belongs to a related but different category of "configuration" languages, while we are talking about a pure serialization format. A serialization format can be used as a configuration language if designed carefully, but they are distinct.
The main advantages of XML (or any standard) is adoption and a wide ecosystem. Unfortunately that beats any "better" standards by a wide margin.
One thing that is also really important is the ability to define a schema and be able to validate. See XML Schema, JSON Schema. This is a really tricky problem to get right. Especially if you try to do both with the same model (describing your data model and describing how its validated) at the same time.
Once you have the schema, IDEs like VSCode offer code-intelligence and real-time validation, which is very nice.
I conject that with programming languages that do not crash schemas used for validation are relatively unimportant. xᴍʟ schemas suffer from the fact that one can not specify that an element may only appear once.
What advantages does XML have over what? Whatever you’re saying, it’s not as self-evident as you think it is. Respectfully, I think this standard needs to be reframed as version 0.0.1
This is needlessly rude and dismissive, consider that the author of this spec is in the comments responding to people. If he had presented this at a conference would you come up to him after and say this to his face?
Try to have a little more decorum when giving feedback
The point is that he's a stranger and not a close friend, and we're speaking in public. I understand you're trying to say that there's value in honest feedback. But that's generally when it's real feedback that is actionable or at least constructive, this appears to be an attempt at exaggerated humor without anything of value being imparted. You should consider the fact that he's a real human with real feelings and it's simply rude to be disrespectful to talk to anyone this way.
I doubt OP would ever speak in a business environment this way, and if he did those around him would certainly consider it to be harassment and deeply unprofessional.
Sorry but I would never use this format for both manual or programmatic approach.
* I've tried to read the data this format describes without reading its documentation and I just failed: the format is amazingly counter-intuitive. I never had a readability and understanding issues with XML/HTML, JSON or even YAML (that I think is overly complicated) when I saw them for the first time.
* Terse does not mean cryptic. Basic notation is just weird: why would it need unbalanced the less-than symbol to open the array? Why `<&>` for delimiting elements? Why `<<$>` but not `<$>>` at least just to be more readable by human and look balanced? The syntax goes more weird for arrays containing objects: indents (okay to some extent), `<>` and `<&>` (`{` and `}`?).
* Auto-removing whitespaces may hurt. If the format offers this, would it also offer a heredoc-style text like `cat <<EOF` in Bash so that the formatting could be preserved as is? `xml:space` and JSON string literals were designed exactly for this. (upd: I just saw new symbol: `|`... Well, okay, but another special character now.)
* Native support for arrays. I mentioned a few above. `<<Faults$$>` and `<<$$>` -- guess what these two mean if you see this first time? You would never guess. It's an empty array and an empty element, you've just failed.
* Graphs... Another weird syntax comes into the room: `#id;` but `@id` (no semicolon?). Okay, these seem to be first-class ids and refs, not necessarily designed for graphs (I'm not sure if the `#ID;` and `@` would play perfect with any non-empty names.) But what does graphs make first-class citizens here and why? Graphs can be expressed, I believe, in any data/markup format/language and then processed with a particular application if graphs are needed. By the way, arrays and objects are not necessarily trees from the semantic point of view. More graph processing issues were mentioned in other comments to this topic. What about the first-class support for sets? I'm kidding
* Comments. Another symbol here to come: `%`. To be honest, I can't recall any instance I could see the percent sign elsewhere for this purpose. What if the comments would start with a well-known `#` at least with a space right after it so that it wouldn't be considered a "graph id" (or, don't get me wrong, with another `<`/`$` sequence)
* Just got to the Escaping section and now I see how the characters are escaped. Perhaps this is okay.
* Scalars. Crazy number formatting and locale issues are waiting. The never-on-keyboard infinity symbol would be great for APL, but why not just Inf(inity)? Whatever the scalar value is, no need to cover all existing primitive scalars -- just let them be processed by an application since all scalars are text semantically. Another crazy things: what does make UUIDs that special for this format?; why does make Base64 that special so that it has native support (would it support Base16 for human-readable message digests; or Base58 to remove visually lookalike Base64 characters)?
* CR/LF? I can understand its semantic purpose, but why not LF to make it even more "blazingly" fast? Say good-bye to UNIX users.
* The cognitive load for the markup syntax absolutely does not make it efficient in typing. Believe me, it does not.
What I would do, I would probably enhance the widely used formats, say make JSON, which I find almost perfect from the syntax point of view, not require quotes for object property names if the names would not contain special characters like `:` just like it goes in JavaScript. And perhaps make XML "v2" move away from SGML hence loosening its syntax to get rid of the closing tags with shorter notation, first-class array support and fixing syntax issues especially for CDATA and comments that can't support `--`. You would blame me, but I love XML the most: it just has the richest set of standardized amazing well-designed extensions to operate XML with regardless the heavy XML syntax.
P.S. How does it look like in the document it marks up is minified (e.g., no whitespaces)?
> Having both begin and terminate arrays start with << is more consistent.
It hides context for humans. I am a human and I love to see what opens and what closes the context. Why would `<` open an array if `[` is astonishingly wide-spread practice? Why would `<<` close it just because you think it is more consistent? What if open/close balance is also consistency, especially for nested arrays?
Also just think how many key strokes you'd save if you'd use `]` instead of [Shift]+`,` [Shift]+`,` [Shift]+`4` [Shift]+`.` if you declare it as readable text.
> Using `{` and `}` would lead to more special characters.
Agree. Too many now.
> It is simpler to support graphs in the markup. The fact is that the data being serialized may be structured in a graph.
I can't understand why you call it native graph support. The only thing it does is declaring an identified element and references to the element. I can't see how different is that comparing to XML or JSON that semantically "have graph support" just because they also can declare something considered ids and references to the identified element.
> # matches the usage in ᴄꜱꜱ and ʜᴛᴍʟ, relating to an id/page location.
No. The # symbol is overloaded: it may be a comment start, especially for line-oriented and human-readable text formats or scripts; CSS uses it for IDs; HTML has nothing to do with it since browsers only use # as a part of a URL to reference a particular identified element for navigation purposes only (it's called anchor in URL syntax; formerly web-browsers used <a name="anchor"> to navigate to a part of the page; as of now in the HTML5 world any `id` attribute is considered an anchor which I find a design flaw since ids are something to be used to identify hence any id from the document is exposed for navigation navigation purposes, but <a name="anchor"> is semantically something for navigation).
> Having a space after the # differentiate between and id and comment would be a mistake.
Of course it would in its current perspective if the id declaration is `#`. Don't know what `#<NON_WHITESPACE_CHAR>` would do if it's legal.
> The Formats section is to facilitate interoperability between implementations, e.g. if you are encoding a ɢᴜɪᴅ [easy to say] then format it this way.
I agree that it may look better for consistency purposes, but what interoperability is all that about? Why would formatting even affect it? From the consumer application point of view, it must be handled from its context defined by its purpose and semantic type. If my element/attribute is formally declared as a GUID, then why would I care that much if it's conventionally formatted? Would it be still a GUID if I encode it using Base64? The dashes in GUIDs are for humans only and they are optional, and the application knows it's a GUID to process it even leniently if it can. The same goes for ISBN/ISSN for books and magazines, card numbers, phone numbers, etc -- none of them require dashes or spaces or parentheses to be processed.
This is why "Real numbers *should be stored* with commas for readability." is just hilarious. Why should? May I use underscores or dots or spaces to group digits (seriously, why comma)? Can I group digits after the period? If I need integers, why are they also limited to 32 bits and 64 bits? How would I present an arbitrary precision integer or non-integer number (say, I want the Pi number 197 digits after the 3)? If ∞ is allowed, but no mention on +Inf and -Inf, can be 4.2957×10^24 used instead of 4.2957e24? May I just have simple `D+(\.D+)?` for everything I need for true interoperability?
I agree consistent formatting is really beautiful, but it must never be the key to process data.
> It is more terse than ᴊꜱᴏɴ.
Sorry, it's not.
> Good
Could you please provide an example of minified (a single line, no new lines) array of timestamps from your page?
> I can't see how different is that comparing to XML or JSON that semantically "have graph support" just because they also can declare something considered ids and references to the identified element.
When serialising data with ᴊꜱᴏɴ one has to use special field names such as $id; hoping the programming language does not. It DOES have native graph support that xᴍʟ and ᴊꜱᴏɴ do not.
> # [..] it may be a comment start
No.
> but what interoperability is all that about?
Interoperability between implementations. If you were using Xᴇɴᴏɴ to communicate between two different languages, say the C# and a Python implementation, agreeing of what an integer IS is helpful. Both Xᴇɴᴏɴ libraries can provide support for encoding say ɢᴜɪᴅs. You have missed the point. A user is always free to encode data as arbitrary strings.
> commas [...] readability." is just hilarious. Why should?
Commas makes numbers faster to interpret. Something `ls` is missing. As I stated on another branch English is the global lingua franca so commas every three digits is the standard.
> ∞ is allowed, but no mention on +Inf and -Inf, can be 4.2957×10^24
∞ is +Inf. 4.2957×10^24 is not the xᴇɴᴏɴ standard.
> When serialising data with ᴊꜱᴏɴ one has to use special field names such as $id; hoping the programming language does not.
Unless a serialization/deserialization tool supports property name overriding which is trivial.
> It DOES have native graph support that xᴍʟ and ᴊꜱᴏɴ do not.
Again, how is this different from `xml:id` that is referenced from other XML document nodes and what makes it "native graph support"?
> Both Xᴇɴᴏɴ libraries can provide support for encoding say ɢᴜɪᴅs.
> Better than ᴊꜱᴏɴ which does not do timestamps.
Better?
There is just no need. For what? These two can be controlled by optional schemas that may be extensible like types to validate in XML Schema or Relax NG. Schemas do not dictate format and you don't need your format to be a schema. I still can't get what makes timestamps (and GUIDs) so special so that they have special sections in your document.
I tend I think JSON also has a design flaw providing first-class support for booleans and numbers in terms of literals it took from JavaScript because the latter needs more complex syntax as a programming language. Ridiculously, XML seems to be perfect in this case unifying scalar values: whatever scalar it encodes, text representation can encode it in any efficient format regardless it is a boolean, number (integer, "real", complex, whatever special), a "human-text" string, timestamp or whatever else; HTML attribute values unlike XML don't even need to be quoted in some trivial cases and even may be omitted for boolean attributes. The application simply parses/decodes its data and manages how the data is deserialized. That's all it needs.
I would probably be happy if, say, there would be a format as simple/minimalistic as possible not even requiring delimiters like or quoted strings unless they are ambiguous. Say, `[foo 'bar baz' foo\ bar Zm9vYmFyCg== 2.415e10 ∞ +Inf -∞ -Infinity \[qux\] +1\ 123\ 456789 978012345678 {k1 v1 k2 v2} aa512e8ecf97445eac10cb5a5ea3ef63 c8a0ebbd 2026-09-24T16:45:22.5383742 P3Y6M4DT12H30M5S]` or similar, maybe with nodes metadata and comments support. The above dumb format covers arrays/lists/sets, strings `foo`, space-containing `bar baz`, `foo bar` strings in human and Base64 encoding, the `2.415e10` number from your document and both four infinity notations, a single string `[qux]` and not a nested array with a single element, a phone number (with space delimited country code, region code and local number), an ISBN, a simple map/object made of two pairs, a GUID, a CRC32 checksum, an ISO-8601 zoned date/time, and an ISO-8601 duration. What more scalar types it can be extended with? Since there is no type for scalars in this "format" does not dictate types or preferred scalar formats letting the application make decision how to interpret these on its own.
> Commas makes numbers faster to interpret. Something `ls` is missing. As I stated on another branch English is the global lingua franca so commas every three digits is the standard.
For whom? Humans? Why would data encoding obey region number|date/time notation standards at all? English, but US, UK, Canada, or any other English-speaking country? You've been told that in that thread too, especially if spaces or underscores are even more readable for monospace fonts. You don't need it.
Funny enough -- your format saves on key/value pairs syntax appealing to 4 vs 6 overhead (okay, cool), but your array elements delimited with `<&>`, and amazingly bad at keyboard typing ergonomics, loses to simple and regular JSON `,` syntax (3 vs 1 overhead). Isn't it blind or crazy?
Based on special syntax. You're about to introduce node attributes.
> They are common in data.
I use tables everyday. May I have "first-class graph support" but for tabular data that is very common as well? I expected three or four times you eventually explain what makes the graph support and how it differs from declaring ids and refs in other formats you think are worse than yours. No answer.
> Separate attributes and sub elements is a mistake. One should be able to guess an ᴀᴘɪ.
For the first, I kind of agree that attributes and subnodes should be unified in favor of subnodes (which was sacrificed for markups like HTML for sane brevity sake). However attributes, your ids are, may be metadata for nodes of any kind. For the second, API for what? Document generating/parsing API? Validation API? Serialization/deserialization API? Enveloped application API? I guess, the latter for whatever reason dictated in your "standard" . In any case documentation, schemas, data validators and autocompletes are my best friends, no need to "guess".
> That is laborours! A Xᴇɴᴏɴ library provides AsGuid, AsDateTime etc.. and serialization directly to/from those types.
What you're mentioning is called serialization and deserialization, and these two be easily implemented once for "basic" types and extended at the application level for any kind of data, because an application decides what to do with data on its own, not the format the data is enveloped in. Serialization and deserialization don't exist from the format perspective which only defines the syntax way data is marked up in a document. So why would it care the formatting at all?
> Yes. Human have to read markup.
Format should not care too much.
> I repeat! READABILITY.
No yelling please. Regional formats are defined by countries, not languages you said elsewhere, just by definition, even if English is the lingua franca. Separate digits with underscores or spaces.
I'm very happy your "standard" neither recommend color highlighting for, say, numbers, nor even worse has special syntax for readability highlighting. Highlighting increases readability greatly as well, you know.
> No, quite the opposite.
6:4 but 1:3 is a great syntax win. Okay.
No any solid counter arguments from your side being blind for obvious design flaws of your so-called format "standard" only tells how you mixed up all concepts in a mess of crazy syntax markup, and scalar object formatting for scalars that only must be handled by applications while serialization and deserialization regardless the markup format "standard" recommends.
Good luck with your "standard" rightly criticized and rejected by others, but better just bury it not spending your life for nothing. Sincerely.
Xᴇɴᴏɴ has first class arrays also so tabular data could be stored as such.
> explain what makes the graph support and how it differs from declaring ids and refs in other formats you think are worse than yours. No answer.
It is built in!
> So why would it care the formatting at all?
FOR INTEROPERABILITY! That is different implementations of xᴇɴᴏɴ agree on what a ɢᴜɪᴅ or date looks like! Fʏɪ, with a good implementation of xᴇɴᴏɴ you just point the library at your data, sometimes augmented with some attributes, and you get cleanly formatted markup.
>>One should be able to guess an ᴀᴘɪ.
> For the second, API for what?
Say you are using an ᴀᴘɪ for information about a person and their is information about their height, in xᴇɴᴏɴ one knows there shall be a scalar called “Height”, in xᴍʟ it may be an attribute or a sub element.
>> Yes. Human have to read markup.
> Format should not care too much.
We are using text formats because they are READable to humans.
> Separate digits with underscores or spaces.
That is not standard anywhere.
> [...] color highlighting
Only the application knows if a scalar is a number or a string.
There are no obvious design flaws. Take xᴍʟ, add an array type and xᴇɴᴏɴ results.
We must be talking a cross purposes re formatting. [phew...] An application has an object called Person, and a field called Height with a type of double. C♯: Person fred = new Person { Height = 1.67 }; string xenon = XenonStart.Serialize("person", fred), results in the string "<person><Height=1.67><$>". A xᴇɴᴏɴ implementation in another language, say JavaScript can take that xᴇɴᴏɴ string and decode it into an object with a field called Height with a value that can be decoded .AsNumber into 1.67; because there is a standard for encoding a ɪᴇᴇᴇ 64 bit number/.net double/JavaScript number.
> * Native support for arrays. I mentioned a few above. `<<Faults$$>>` and `<<$$>>` -- guess what these two mean if you see this first time? You would never guess. It's an empty array and an empty element, you've just failed.
<< means it relates to starting an array, $>> means it is the end, $$ meaning something else — an empty array!
The xᴍʟ alternative is a bodge:
public class PurchaseOrder
{
public Item[] ItemsOrders;
}
public class Item
{
public string ItemID;
public decimal ItemPrice;
}
> Comments. Another symbol here to come: `%`. To be honest, I can't recall any instance I could see the percent sign elsewhere for this purpose.
PostScript is one programming language that uses a percentage sign for comments. TeX and METAFONT also use a percentage sign for comments. There are others, too.
I shall elaborate on the first paragraph of the web page:
• Terse.
Xᴇɴᴏɴ is as terse as ᴊꜱᴏɴ using 3 characters per scalar value <key=value> rather than ”key”:value, (4) or ”key”:”value”, (6). Xᴇɴᴏɴ is significantly more terse than xᴍʟ <key>value</key> (5+len(key)) around 10.
• Readable multiple line indented text.
ᴊꜱᴏɴ does not support multiple line text, forcing one to escape newlines as ‘\n‘. xᴍʟ looks messy with multiline line text as the text is copied verbatim, indenting is therefore from the left of the page. Xᴇɴᴏɴ allows one to indent text more than the enclosing markup.
Native support for arrays
Awkward in xᴍʟ.
• Native support for a graph structure, elements may have multiple parents.
Both missing from ᴊꜱᴏɴ and xᴍʟ where special fields are layers on top of the markup.
• Native support for a types used in serialization.
Also missing from ᴊꜱᴏɴ and xᴍʟ.
• Unambiguous choice of data structure.
No attributes.
• Efficient to write by hand.
see Terse. Also supports comments unlike ᴊꜱᴏɴ.
• Can be implemented to be blazingly fast or using a mode-less tokenizer.
Design decisions took performance and grammar simplicity into account.
• The xenon document is named.
Useful. One can see that the document is a <Person> or <Example-Document>
- (EDIT: Mistakenly read SHOULD as MUST, ignore this item please) The requirement for the mandatory BOM is unacceptable for most non-Windows users, as BOM is by definition invisible.
- An arbitrary name is equally questionable, even XML doesn't do that. Do you accept tabs for example?
- There is absolutely no way to indicate the data type. The document accidentally mixes the wire format and serialization protocol; never a good sign.
- Do you accept 3,000 or 3,,000 or 30,00 or 3,0,0,0 or ,,,3000 or 3000,,, or 3,e,3 or 0.000,3e7 or ,,,?
- How many digits in each group are recommended for the encoder?
- What on earth is "IEEE 64 bit double precision floating point numbers" in the context of textual format?
- Do you require a specific rounding or not in the decimal-to-binary conversion?
- Is `\u{d800}` accepted or not? (JSON famously has this issue.)
- If you have to escape the base64 padding, maybe you should just drop the padding?
- Graph is generally a bad thing to encode at this level, because many applications do not even expect graphs and that can lead to DoS attacks.
- No clear definition of the data model. The document starts with objects, arrays and scalars (untyped? stringy? I dunno), then only reveals that objects can be typed and shared and specific types of scalars should be written in certain ways much later. Define the data model first and describe possible encodings of that model instead.
- Not allowing anything besides from space and tab is okay, but that doesn't speed processing to be exact.
- While technically a matter of choice, it uses a lot of unmatching angle brackets and that's just... ugly.
- In fact, I don't really see why other grouping characters were unused in the first place.
- No canonical representation.
Maybe you should review tons of other alternatives in this space (I recall at least 2--30 of them, probably much more) before your own design.