
Show HN: Dit – A new kind of container file for standardizing data - IsaiahShiner
https://github.com/isaiahshiner/dit-cli
======
IsaiahShiner
Dit is ready!

About 4 months ago, we posted just a website to HN:
[https://news.ycombinator.com/item?id=21263839](https://news.ycombinator.com/item?id=21263839)

Now we're launching the MVP and hope you'll join us in trying to make data
much better!

For a primer on what this is all really about, I wrote quite a bit on this
exact topic: [https://github.com/isaiahshiner/dit-
cli/blob/master/docs/wha...](https://github.com/isaiahshiner/dit-
cli/blob/master/docs/whats-the-point.md)

~~~
ssivark
Interesting! You might be interested in an old HN discussion between Alan Kay
and Rich Hickey (JGI). It sounds to me like dit is basically an attempt at
solving the problem that Alan Kay was getting at when he conceptualized
“object oriented programming” as affecting behavior (even extracting
information) from an object by passing messages to it — which would be a
critical bottleneck when trying to build an “inter-galactic network” in the
spirit of Licklider’s vision.

~~~
IsaiahShiner
Ah, this?
[https://news.ycombinator.com/item?id=11945722](https://news.ycombinator.com/item?id=11945722)

Super interesting! It's like, data, without an interpreter, has no meaning. We
rely super heavily on humans as general purpose interpreters, and when we
design computers, we make sure as much of the meaning as possible is implicit.

Rich was saying that the interpreter is secondary, which is only true if data
serves a single purpose. Alan is getting at the idea that data could serve
every purpose if only the interpreter wasn't implicit. If the concept of
"data" referred to both the value and the meaning, the language and how to
interpret it.

But, from a reddit comment: _An interpreter doesn 't magically give meaning to
data, since the interpreter has to translate to something, and that something
has to be written by a program that understands both the from and to sides of
the conversation._

The point of dit is to literally be that universal language. Yes, you do need
both sides to explicitly lay out what they main, both to convey and understand
the meaning. The difference is that the number of connections is unlimited.
You need a human for that critical first step, but then the machine can speak
"dit" language to any other machine that speaks dit. Rich couldn't conceive of
the idea that dit could be real. I think it can. I think we can make
Licklider's vision, data and programs that can move around without respect to
the original developers intentions because they all speak dit.

Thank you for sharing, super interesting!

------
jmiskovic
Intriguing idea. I just breezed through readme so sorry if some of these are
already answered.

You say Javascript and Python as if it is unique identity. Which version of
language? Which interpreter? How do you make sure that both sides of
communication run the same logic?

The data source wants to communicate data to receiver. Is the code transmitted
together with data? What is benefit of running this code on receiving side
instead of sending side?

Apart from validation the Dit file can include custom interpretation of data.
Do you serialize and store validation and query code on disk? Is this good
idea? As you build more features in the system, you end up with old data that
cannot be queried as well as new data, even though the raw data is the same.
For example dates - you might end up with a system where old dates cannot be
converted to various date formats while newer can.

~~~
IsaiahShiner
> Which version of language? Which interpreter? Ah, very good question.
> Currently, it's just whatever version is in /usr/bin/$LANGUAGE. You can see
> in /home/$USER/.dit-languages how the language configurations are defined.
> You can add another configuration, just by copying the Python configuration,
> changing the name and path to Python2.7, and then doing this:
    
    
        validator Python2.7 {{ //Older python code }}
    

In the long run, this should be made a lot simpler I think, which leads me to
your next question:

> How do you make sure that both sides of communication run the same logic?
> This will probably require some way to send language configurations along
> with a dit. If you send someone a dit which requires Java code, but the
> client doesn't even have Java, the dit should be able to recover from this
> somehow, by automatically installing Java, creating the configuration, etc.
> I'm not exactly sure how this should work, but its one of those clearly
> solvable problems that I'm trying to leave be until I have a good example of
> the real problem.

> Is the code transmitted together with data? What is benefit of running this
> code on receiving side instead of sending side? Yes, the code is always tied
> directly to the data. Currently, source code must be sent, but in the
> future, it could pull from a link, or even run the validation offsite, via
> an API (important in case you want closed source classes).

In theory, there is no reason for the receiver to run the validation,
(assuming the receiver trusts the sender to validate). But one purpose of dit
is to prevent a message from ever being separated from it's meaning. If a
client receives some data in a dit format they have never seen, they can learn
how it works by looking at the dit. The validation code can even be used as
ad-hoc unit tests when trying to process the data automatically. And also
remember that querying (and converting, mutating, anything else that might be
added) also requires the dit code.

> Do you serialize and store validation and query code on disk? Is this good
> idea? Currently, the modified code is regenerated every single time. You can
> see the code in /tmp/dit/. This is slower, but it does mean that the code
> and serialized data are always as new as possible. Caching could be an
> option, if done carefully.

> .old data that cannot be queried as well as new data. ...old dates cannot be
> converted to various date formats while newer can. I'm actually not sure
> what you mean here, I can't picture how this could happen. Could you
> explain?

~~~
jmiskovic
Let's say side A is a logger, with each log entry being timestamped. We want
to be helpful so we provide way to convert data into MM-DD-YYYY string. These
logs are transmitted to side B and stored to disk.

Few months later we decide to add another converter, one that gives out YYYY-
MM-DD format. New logs can interpret themselves better than old ones, and side
B needs to be careful what it can ask of dit file. If there are multiple log
sources, then depending on source we have more or less powerful interpretation
of _same_ data type, until they are all updated.

To me the code should be stored and managed separately from data. The side B
should be responsible for parsing raw data from A. The interpretation
shouldn't be bundled with data because same data can be interpreted better in
future. The cleanest way I've seen it done is Clojure's spec library for
describing and validating data.

------
anentropic
The project makes a bunch of big vague claims, like:

> Someday, there will be no file extensions: every file will be a dit.

reading: [https://github.com/isaiahshiner/dit-
cli/blob/master/docs/wha...](https://github.com/isaiahshiner/dit-
cli/blob/master/docs/whats-the-point.md) ...seems to imply you want to solve
the problem where:

    
    
      "Name": "John Doe"
      
      vs.
      
      "Name": {
          "FirstName": "John",
          "LastName": "Doe"
      }
    

represents the same data?

But for me it's completely unclear if or how Dit solves this though.

[https://www.ditabase.io/faq.html](https://www.ditabase.io/faq.html) says:

> Dit is not a format, it is a container. It's trying to be a bridge between
> other formats, and that means it relies on them heavily. You can't write
> data in dit format, only wrap data in a dit.

So it sounds like a schema language, an alternative to JSON-Schema, Avro etc?

But then it also bundles in a bunch of other things...

Apparently contradicting the above, the are 'assigners' and a sort of query
language which seem to create and return records in this format which you
"can't write data in":

    
    
      // Assign an existing object
      myName.givenName = name;
    
      // Use the 'n' assigner inside a list
      myName.middleNames = [n('Leopold'), n('Koser')];
    
      // Assign the string directly
      myName.familyName.value = 'Shiner';
    
      dit query name.dit '@@print(myName)'
      "Hi! My name is Isaiah Leopold Koser Shiner"
    

And then there are the validators and printers, as chunks of embedded python
or javascript code.

But it's still hard to see how any of this solves the "John Doe" problem?

If dit is meant to be a bridge between silos of incompatible data maybe it
would be useful to show some examples of that use case.

~~~
IsaiahShiner
Hmm, I can see how I'm being confusing. This is partially trying not to give
information overload with poorly written explanations, especially for features
which I haven't added yet. It's also because dit is really trying to solve
everything with Perfect Data all at once, and I need to go back and be clear
about when I'm talking about which concepts. To solve Perfect Data, we need a
single system that can do:

\- _Validation:_ This is for when you ask for a name and get "4jZw3ef\n". This
already exists in database constraints, APIs, and, as you said, in things like
JSON-Schema and Avro. The key difference is that the 'schema languages' are
not languages. They don't have full flow control, libraries, classes, not even
simple string functions. Dit does, by proxy of allowing for embedded
languages, which let you do whatever you want.

\- _Schemas:_ Obviously other schema languages already do this. This is
covered by dit classes. To solve the "FirstName", "LastName" issue, you just
make sure you're using a Name schema like the one I demonstrated, in which
Name contains a FamilyName/GivenName. But then everyone is using different
classes for the same data, which you can solve by...

\- _Global Online Schemas:_ The only one really doing this is Schema.org.
Everyone can agree that [https://schema.org/Person](https://schema.org/Person)
is a reasonable official schema for a person. However, they have no
validation, and changes are decided by a consortium, not free change.
DitaBase.io would have a sort of copy of GitHub, in which each class is it's
own page, and doesn't explicitly belong to anyone. Using meritocracy,
democracy, and compromise, the community would agree on which exact schema of
Person is best. If someone disagrees, they can make their own style and try to
explain why it's better, while offering easy conversion between their style
and the old style.

\- _Format Support /Universality:_ This is what I was getting at when I said
"you can't write data in dit". Dit cannot afford be locked into one technology
platform, the way JSON-Schema is locked to JSON, and the Semantic Web is
locked to HTML. How do I write down data when you ask for it outside of a dit?
In .json, .csv, .xml? Why not .yaml, .toml, .xlsx or .gsheet? There's no way
to serialize data raw data as dit, only take data written in a .dit and print
it some other way. The print functions and query system should allow you to
request data in any format you want, including more than one. And ultimately,
dit could never provide all of the features people could want.
[https://ipld.io/](https://ipld.io/) is basically doing a JSON-Schema style of
thing, specifically for the block chain. Dit isn't going to replicate their
custom block chain features, so instead dit should _support_ IPLD, by allowing
data to be read and printed in their format, and converted every which other
way you can think of. I mentioned DitaBase having it's own object model, but
the truth is this would almost certainly reference Schema.org, at least by
extension.

Perhaps the thing I was least clear on overall is that dit is still an
evolving concept. I know where dit is now, and I know where dit wants to go,
but I'm not really sure how to get there. Maybe assigners are a terrible way
to solve that specific problem. That's fine, we'll throw them out and make
something better. All I know is that Perfect Data needs to be solved, is
solvable, and dit has a much better chance of solving it than any other
existing solution.

------
matlin
This might actually come in handy for the project I'm building! Currently
working on a personal database to store data from external APIs like Gmail,
Spotify, etc. I was going to use RXJS but the schema requirements seemed like
it would get in the way of rapid integration. If there was an easy way to just
drop in a JSONSchema like how DefinitelyTyped does with Typescript it would be
super helpful. This is definitely a step in the right direction.

Link to the project for context: [https://github.com/aspen-cloud/aspen-
cli](https://github.com/aspen-cloud/aspen-cli)

