> Time zone identifier instead of just a fixed offset (which are ambiguous for future events)
Time zone identifiers are not future-proof. As a concrete example imagine a fictitious country X with the largest city (not necessarily the capital) being Foo and a single time zone throughout the country. A natural time zone identifier would be something like `Region/Foo`. If a part of X switches to another time zone and Bar is the largest city in that region for whatever reason, there would be another time zone identifier `Region/Bar` assigned. You can't automatically determine which `Region/Foo` should be converted to `Region/Bar`. It is up to you to pick a strategy in such cases, but time zone identifiers themselves do not solve the ambiguity problem.
Besides from this false sense of security, time zone identifiers were not meant to be for general interchange and there are several understandable but problematic assignments. Baking them into a data structure format doesn't sound a good idea.
That's a valid scenario where time zone identifiers don't work well, but what's the alternative? A fixed offset is guaranteed to be wrong on DST switch, and that happens a lot more often than time zones splintering. I guess GPS coordinates would be more accurate, but then you have privacy concerns.
Fun fact: time zones are named after cities for robustness:
Country names are not used in this scheme, primarily because they would not be robust, owing to frequent political and boundary changes. The names of large cities tend to be more permanent. Usually the most populous city in a region is chosen to represent the entire time zone, although another city may be selected if it is more widely known, and another location, including a location other than a city, may be used if it results in a less ambiguous name.
You do need time zone identifiers or something like that, but they are not something you can blindly buy in. I mean, of course it would be great to see a semi-automated solution for that (for example, such a solution can automatically determine Region/Foo is split into Region/Foo and Region/Bar but nothing else)! But the time zone database doesn't provide anything like that, and a built-in support for time zone identifiers can hugely mislead users.
It seems like these problems with defining/describing timezones (i.e., temporal reference systems if you will) are similar to those of defining spatial reference systems. Spatial reference systems are fairly rigorously defined/described using both standard (e.g., WKT) and non-standard formats (PROJ strings, EPSG numbers). Is there something similar for temporal reference systems?
Time zones based upon anthropocentric concepts like countries have issues simply due to the realities of political changes. Entire countries and civilizations can come and go. The concept of even “common era” is a convention of convenience still based around a religious figure frankly, which is a sort of retcon given others through the same time period were likely measuring time in another coordinate system essentially.
A self defining and self describing set of measurements would help here but obviously bloats formats similar to how XML schemas become overly verbose and pedantic without adding clarity to the target audience.
So I agree that baking in a time zone is likely a bad idea but I have doubts we can do much better by trying to use another more absolute time coordinate system independent upon human civilization conventions.
The complicated number encoding scheme you mentioned is a hexfloat: C has them too.
Hexfloat can be really useful when you need precise/exact floating-point constants for numerical methods. Without them, you end up having to do more-complicated hacks to preserve exact constant values when code gets compiled, or you have to live with compilers (sometimes) subtly altering constants.
NUL is a tricky beast even under ideal conditions. In the first draft I'd simply forbidden them. Simple, decisive, and of course people didn't like that. then I did allow it, and suddenly the security gates were flung open. Some platforms simply cannot support nuls in strings, so now it becomes platform dependent (bad from a security perspective). So then my next attack vector was to make it configurable with a default of deny. Better from a security perspective, but now it's complicated. TBH there are nights where I think I should just go back to no NULs at all again. But I really need more discussion with stakeholders to figure out what problems this solves and how important it really is...
Line break is another sticky issue. Non-technical Windows users will inevitably produce documents that use CRLF, so if it rejects the file, that's a bad user experience. What's the best trade-off here? I'm not really sure.
The number encoding thing is being discussed in other comments so I'll leave it be. I had allowed comma because that's how a huge chunk of the world represents the decimal separator. Once again, this is about user experience with the non-technical users. I'm REEEEEEEALLY on the fence with this one.
Entity references are dangerous, yes, but also powerful. The point of them in the format is to solve the recursive reference problem, because you just can't do that otherwise, and these structures do exist in the world. It's another case of an imperfect solution for an imperfect world. Bear in mind I absolutely do NOT want this format to become some Turing complete language. This is just the minimal feature set I could think of to represent real world data.
Arrays-vs-list is another one of those compromises. Encoding an actual array of fixed types into a list would be slow and bulky, leading people to just encode them as a chunk of bytes like they currently do in JSON and other formats. As an imperfect solution to an imperfect world, I want to at least let people preserve the semantic meaning of what they're sending since they're going to use array encodings regardless of what the format supports.
Hum... For nul, it's common to have an escaping sequence (yes, on binary data) and use it to encode the problematic characters. It's for the best if you encode enough of the data for one to be able to dump your file into a terminal and nothing getting compromised on the way (the terminal just failing to work is ok).
Personally, I disagree with how your format handles all those other issues too (except for the numbers), but well, if you think you are correct, go try it. If it works, it works, and my disagreement may easily be misguided. Anyway, I disagree because:
For the line breaks, the internet has a way of trying to "fix" them and completely breaking the line-information of the original document. It would be ok if the format wasn't blank-space dependent, but it is, so changing the lines breaks the data. Anyway, that is becoming a lesser problem with time, so maybe for a new format it's fine.
Entity references on formats that are not focused on them are surprising. That means a lot of software will break once they get one, and tradition says they will do that in a way that compromises computer security. I would either change the format so that references are almost always used or remove them. If an application needs references, it can always tag the entities with an id and put the references there by itself.
The same applies for arrays, in a lesser degree. They will be surprising, but they are also easier to handle. But they are also much less necessary, since lists can always replace them. I'm really not sure if they are a net negative or positive.
Thanks for your reply! It's a really good proposal, and I'll definitely consider it next time I'm serializing something.
> NUL is a tricky beast
One alternative I found interesting is Modified UTF-8 (https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8), where NUL is encoded as 0xC0 0x80. It's already in use in Java, apparently. Disclaimer: I've never heard about it before, and I don't know how well supported it is.
> Line breaks and commas
I'm from a country that uses comma as decimal separator, and I still prefer dots when programming. I dread ambiguous numbers like "1,001". I'm confident this is true for almost all technical people. And I really don't see non-technical users editing this kind of file.
If they can be trusted to read it, and to modify it without introducing syntax errors, they can be trusted to use dots and an editor that shows LF as linebreaks (i.e. anything but `notepad`).
My choice would be dots only for decimal separator, accept LF and CR LF on reading, but prefer LF when writing.
Of all data types that could have native encodings (colors, IP addresses, lat/lon coordinates, enums, tags, markdown, hashes, file permissions, OIDs, DOIs, ISBNs, etc), I think arbitrary object graphs bring too many downsides for a serialization format.
You don't want reader CVE's because of Billion Laughs, and you don't want to force every programmer writing a simple traversal algorithm to correctly handle cycles.
>> accept LF and CR LF on reading, but prefer LF when writing.
> That's the current behavior.
I'd suggest changing the SHOULD to MUST here, and remove the "foreign or unknown system" part:
but encoders SHOULD output LF when the destination is a foreign or unknown system.
It's ok if a CR LF sneaked in because a user edited a file manually, but encoders should be more predictable.
>> and you don't want to force every programmer writing a simple traversal algorithm to correctly handle cycles.
> Ugh... I really really REALLY want you to be wrong on this :(
I have some good news then.
I just checked, and most of my JSON traversals are for things that you already take care of, like binary arrays and handling cycles (huh, talk about irony).
And the billion laughs problem was mostly because XML entities are more like macros, and expanded in place. As long as the reader doesn't try to convert the document to JSON, or naively print the object graph, it should be ok.
I think it might be ok to keep references.
And again, cheers for the encoding specification! It's really cool, and I hope it catches on.
Re. time zones, I'm of the increasing conviction that they should be avoided whenever possible, and it usually is.
I think of a time zone as drawing a large box on a map, for any given instant in time. Future events either happen 'only in time' or they happen at a time and place. A time zone isn't a great representation of a place! I'll be much happier storing UTC and coordinates, then turning that into a timezone for display when I have to.
A time zone is an approximate (and weird) spacial coordinate system which causes nonlinearities in the representation of time, and unless you want both of those properties it comes with baggage.
> A time zone is an approximate (and weird) spacial coordinate system which causes nonlinearities in the representation of time, and unless you want both of those properties it comes with baggage.
That approximate and weird system is how humans tend to think about future events though. You're right that often it is not necessary, but for many usecases (e.g. when a TV show will be first broadcast in the future) they are the most robust model we have.
Storing a set of coordinates for the timezone though would be more useful and future proof. The general case is a range which is the timezone bound, but full accuracy means you can store much more specific data.
Exactly. There is one case where time zones are both absolutely necessary and work as expected with no exceptions, and that case is displaying local time.
My assertion is that a very strong argument should be won before they are used for any other purpose.
I arrived 1 hour late for a zoom meeting a few weeks ago. At meeting at which I was presenting. I checked and double checked the calendar event to make sure I arrived at the right time. Daylight savings time hadn't changed for me recently, and nor had it changed for the organizers (who advertised the event in GMT).
The problem was that the calendar advertising the recurring event was set to PST. And daylight savings had changed in California (PST -> PDT or the other way around). The result was that the calendar event shifted by 1 hour local time for me and every other attendee who subscribed to the calendar event.
This situation sucks, and its really confusing to everyone. But I can't think of a way around this whole issue:
- If recurring calendar events weren't set to a timezone (and thus, just worked off GMT), then all recurring calendar events would drift forwards or backwards when daylight savings changes happen
- If recurring calendar events are set to some local time zone, then things like this happen - which really confuses everyone involved.
I mean, sure - in an ideal world we'd get rid of time zones. But until then, it seems like we're stuck with this problem.
(Though at least daylight savings time is slowly being phased out in some countries.)
A great illustration of why we should shun them whenever we can.
Calendar apps are exactly where the full and monstrous complexity of time zones emerges, and I don't envy anyone who works on one.
My case is that the problem doesn't have to happen in the other direction and this unforced error is made frequently. There are a huge class of recurring events where drifting by an hour on the local clock won't make a difference, but either skipping something or doing it twice is bad.
What you're describing strikes me as a bug in the absolute sense: nothing should be displaying PST during a time frame when that time zone isn't in use.
Again, don't sign up to have these problems if you can possibly avoid it.
The solution is to make the calendar aware of all the time zones relevant to an event (e.g. the organizer’s time and the presenter’s time), so it is able to display all the times and warn when timezone changes might be relevant. (Each attendee may have a different time too, but those times do not need to be recorded in the event but can be handled locally on each attendee’s devices.)
There is the extra problem that the standard explanatory names for timezones that appear in the CLDR are very confusing, eg British time is referred to as GMT even though that is wrong for more than half the year in the summer.
Sure, time zones are real† and therefore developers need to model them correctly.
†arbitrary, yes, but real
I'm saying something stronger than it often isn't necessary, I'm saying that storing time zones as part of a time is almost always the wrong thing to do.
For example: if your show is broadcast first in Eastern then in Central time, that's one thing. What if it shows in Canada the next day? All of a sudden you really wish you had modeled space separately from time.
Even when some event is time-zone-gated, this will correspond to one exact moment in time, and for e.g. server provisioning that instant is what matters. Including a time zone can only lead to missing that instant, it can never help you find it.
One thing I try to hammer home is that the time zone portion of a time is absolutely NOT to be used for any other purpose than to determine the time value itself. The moment you need that kind of data for other purposes, you should be recording a separate field in addition to the time field.
I appreciate your detailed write-up, which I found reasonable (at least for date & time things, I haven't read others), but most users would be clueless about this separation and will write something seemingly works, only to find it broken later. I also have a concern about the implementation complexity [1], which combined with clueless users can have a bigger effect.
Yes, the time data is complicated, but then again time is by nature hellishly complicated and difficult to get right. If I reduce scope, it just maintains the status quo we have now, which is everyone doing everything in TZ offsets (that's ISO 8601's fault), and doing it very, very wrong the moment they step out of the tiny confines of event timestamps.
I know this sounds like an "if everyone would just ..." kind of excuse, but poor time implementations really are a big problem in the industry, and I'm hoping to at least provide the tools for knowledgeable people to do it right.
If I understand your example correctly, it is about using time zones as a proxy for "regions where a TV show might be broadcast", which is clearly inadequate but because time zones are coarse and unrelated to the actual data of interest, not because broadcasting times are represented with timezones. (There might or might not be a second layer of error: 13:00 central time is the same time as 14:00 eastern time, 1 time zone to the east, and pretending they are different is a horrible abuse)
- Versioning.
- Time zone identifier instead of just a fixed offset (which are ambiguous for future events).
- Native encoding of binary values.
- Graph notation with support for labels.
- Comments!
- Trying to escape lookalike characters, even though I think that's a lost cause.
Things I'm not so keen about:
- NUL character in strings being platform and settings dependent.
- Line break not being forced to a consistent value.
- The most complicated number encoding scheme I've ever seen (e.g. 0xa,3fb8p+42).
- Entity references are a footgun for anyone writing a depth-first or breadth-first algorithm.
- Arrays-vs-list feels like it doesn't belong in encoding formats.