The shape of data

KwisaksHaderach · on March 31, 2022

Clojure really is probably one of the best languages for application programming out there, and pairs beautifully with Java when you need to go some levels lower.

thedudeabides5 · on March 30, 2022

This wishlist sounds like rose.ai

"Wishlist

Data model:

A small set of primitives eg writing an inspector gui eg searching for references to some id But still able to represent types and invariants Able to reify changes as data eg for undo log eg for real-time collaboration

All data has some name/path/location by which it can be referred to eg no hidden state in closures eg no hidden closures in the event loop queues Avoid depending on pointers for identity

Data notation:

A textual representation which is easy to read/write Used consistently everywhere - one standard way of picturing data Self-describing - doesn't require out-of-band type/schema

Uses layering to add capabilities while mimicking familiar notation Uses shorthands and exploits context to reduce redundant information eg clojure namespace aliases eg unison names

Code:

The notation for code is a superset of the notation for data eg can print data and copy-paste into code / repl

Can choose the mapping between tags in data notation and types in code Code can be represented as data with low mental distance

The codebase is also data - can trivially analyze whole thing including dependencies without having to execute side effects

Maybe, if possible, reify the execution of code as data

Crucially, the data model and the data notation need to be co-designed, because it's so easy to make choices in the data model that prevent creating a good data notation later."

cupofpython · on March 30, 2022

Honestly, I dont see the issue with JSON. It is capturing user generated content. It's not that '43' is a logged as a string instead of an int - it is that '43' is the raw data in quotes. To me, that is the same spirit as using "read" instead of "eval" as mentioned elsewhere. Yes the read-print-loop fails for JSON - but JSON only has this failing when you are working with code-generated values. At the end of the day - a user type the 4 and 3 keys on their keyboard and that was captured. To say it is an int or a str or whatever brings back the need to understand memory representations.

for example - when parsing json with python, you can apply the same principles you would to python objects. That is, assume the item is the format you know it should be (or test it first to be safe)

so even though the json is {'43' : ['bob','alice']} - you can do an int() cast if you need to do something with that data that requires it to have a type. Otherwise it is represented as it was typed.

I do agree with the article overall though!

joshlemer · on March 30, 2022

So, JSON has non-string values in other positions (as elements of arrays, or values in an object). Wouldn't your argument also lead to the conclusion that we don't need numbers at all, since we could get by with

{ "foo": "42", "bar": ["1", "2", "3"] }

There's also the issue of values with multiple equivalent string representations. I want 42.1 to equal 42.10 and 42.100. I also want {"foo":1,"bar":2} to equal {"bar":2,"foo":1} but with just strings you don't get that:

{ "{\"foo\":1,\"bar\":2}": 1, "{\"bar\":2,\"foo\":1}": 1, "42.1": 2, "42.10": 2, "42.100": 2 }

should have 2 keys but has 5

cupofpython · on March 31, 2022

> { "foo": "42", "bar": ["1", "2", "3"] }

Good point, we could also expect {"foo": "42", "bar": "[1,2,3]"}. JSON does assume that the values have types (like a list) and that is inconsistent.

As for equivalent representations, I do not think what you want is universally applicable. 42.1 does not equal 42.10 until you use logic to rule that what you are working with are Numbers

In government regulations related text, for example, 42.10 could be 9 items after 42.1 and you might expect to see 42.1(a) and 42.10(a) as other items in the same set or related value sets.

Any way you cut it, the real problem seems to be that when data entry happens - a certain amount of context is assumed - and those assumptions have enough variance to need to be handled differently when the data is consumed. Which makes sense