
Show HN: WSL, a clean text format for relational data - jstimpfle
http://jstimpfle.de/projects/wsl/main.html
======
fiatjaf
I like structured data that people (and software) can understand.

Better yet if we could find a way for (non-tech) people to be able to write
structured data in clean way. My attempt, so far:
[https://github.com/fiatjaf/lsd](https://github.com/fiatjaf/lsd)

~~~
jstimpfle
That syntax is too loose for my taste. WSL's approach is different; it tries
to be as strict as possible. There should be preferably one and only one
representation for any given value.

To be also syntactically efficient it has per-domain lexical syntax. This
would not be possible without a schema.

------
robochat
Looks interesting. It reminds a bit of the ARFF format. I can't tell whether
WSL allows comments though which I think would be useful. Also insisting that
no null entries are possible will probably lead to incompatible versions of
the format when people are unable to avoid nulls.

One thing that could be nice would be a field to indicate the physical unit of
each data type.

~~~
jstimpfle
Thanks. From a cursory glance it looks like ARFF is more like CSV than WSL.
Still far away from the relational model: Only a single table, fixed set of
available data types, no distinction between domains and representation. Seems
more like a language-independent struct description language (must be like
protocol buffers, though I haven't used that either).

Re: comments: they are possible in the schema. For the relational data,
comments as second-class citizens don't really make sense IMO, since
associated data is typically stored in a rather scattered way in the database
(this is a disadvantage compared to hierarchical representations).

The much better approach is to store Comments as first class citizens, like

    
    
      % DOMAIN Comment String
      % TABLE Person PersonID PersonName Comment
      Person michael [Michael Jordan] [nicknamed "Air Jordan"]
    

or if you want to allow multiple comments, or comments are very sparingly
used, use a separate PersonComment table.

Re: NULL values: these can relatively easily be modelled by making an
auxiliary table as described on the webpage. This leads to better
normalization. The drawback is that in this way conceptually associated data
is logically separated, and thus harder to edit and housekeep (that could be
remedied by relation editors that can edit "views").

Another possibility is to model missing values in each datatype separately as
needed. But that's probably a bad idea since the database wouldn't be able to
discern such sentinel values from "present" values.

> One thing that could be nice would be a field to indicate the physical unit
> of each data type.

I'm not sure what you mean by "physical unit". Maybe things like "varchar(3)"
in SQL? That would be easily feasible with domain parameterization. Something
like

    
    
      % DOMAIN CountryCode String length=3
    

Presently the WSL spec demands that "domain parsers" ("String" in the above
line) always return domains of the same internal representation though.
Parameterization should only add "value constraints". Depending on
interpretation length=3 and length=4 might mean distinct internal
representations, so that might be a conflicting idea.

~~~
robochat
Thanks for your answers. By physical unit, I meant metres, seconds,
nanoseconds, kilograms. Since I'm more in the physical sciences, units are
important to me. Keeping track of physical units is a kind of provenance. I
see now though that WSL is more like a textual relational database rather than
improved CSV.

I still stand by my NULL comment though, I totally understand why you want
things to be as you've described (along with the emphasis on having not too
many columns per table) but the problem will be other people and how they will
inevitably use the format. Null is always problematic though and a source of
arguments and bugs.

Have you ever seen recutils? It's a similar concept although I have no
experience of it, at a glance I prefer WSL.

~~~
jstimpfle
I would say it's a relational database more than _only_ an improved CSV: you
can absolutely use it as that. While it was meant to model whole databases, I
don't think there are any disadvantages if you only have one table. You can
even easily make the library parse lines without the table prefix (which is
unnecessary if there is only one table). However that use case is more trivial
and might not justify depending on a library.

Physical units are absolutely in scope. In fact thanks, I hadn't thought of
that, if we can find a reasonable default implementation and syntax I might
add them to the built-ins. (But you can also add them yourself as a user of
the python API).

Depending on your taste, you could let them have a unit suffix.

    
    
      % DOMAIN kilograms Kilograms precision=3 suffix
      % DOMAIN seconds Seconds precision=3 suffix
      % DOMAIN metres Metres precision=2 suffix
      % TABLE Example kilograms seconds meters
      Example 1.0kg 5.042s 4.42m
    

Or

    
    
      % DOMAIN laptime Time infixunits
      % TABLE Laptime laptime
      Laptime 4h05m03s
    

> Have you ever seen recutils? It's a similar concept although I have no
> experience of it, at a glance I prefer WSL.

Yes, in fact I gave it a closer look. The scope is different; as the name says
it's more _record_ or even hierarchy oriented:

    
    
      - Records are written on paragraphs instead of lines; each member on its own line
      - Multi-valued members
      - Field names are explicit in each record.
    

It doesn't really try to be a clean interpretation of the relational model.
While it is also a query language and there are some sort of joins, the multi-
values member notation combined with joins seems to quickly lead to cases
where it's not clear how to interpret the resulting data. Also, due to the
more complex multi-line data format, it's much better suited for consumption
by humans than machines.

WSL instead opts for unnamed records on single lines to be very easily
consumable by machines as well, and is designed to enable very easy to use
reader and writer APIs.

Note that the python library is rather slow; on my old machine, the 220KB
example database needs about 200ms to parse. With a C implementation, I reccon
there could be a 10x speedup. However if you care very much about speed and do
not always read in the data completely, sqlite3 or one of the big iron
databases are a better fit anyway. WSL trades speed for semantics and plain
text representation.

