
Comma-Separated Tree - mbostock
https://observablehq.com/@mbostock/comma-separated-tree
======
andrewla
I've often found it's better to be more verbose in situations like this, but
providing the "tree" entries as rows.

    
    
      name,value,color
      World
       Asia
        China,1409517397
        India,1339180127
        Indonesia,263991379
    
    

becomes

    
    
      region,subregion,country,value,color
      World,Asia,China,1409517397
      World,Asia,India,1339180127
      World,Asia,Indonesia,263991379
    

This still compresses very nicely, and allows data to be output in any order.
The tree form is more of a presentation thing, which should be the last step
of a data processing pipeline. As it is, the tree nodes themselves have no
names or context, so it makes it harder to consume. Is the second-level entry
a region? Easy to tell from this example, but if you're consuming a large data
set you're forced to first give names to these columns.

~~~
mbostock
Your suggestion requires a fixed-depth hierarchy (such as three levels in your
example), and then it requires a lot of redundancy. Here, I want to eliminate
the redundancy not primarily for performance reasons, but to make it easier to
edit the data by hand, for example to cut-and-paste some lines to move them
around within the tree without having to change the leading values.

To get a sense of what I mean, see this editor which lets you interactively
construct a treemap: [https://beta.observablehq.com/@mbostock/treemap-o-
matic](https://beta.observablehq.com/@mbostock/treemap-o-matic)

~~~
siddboots
Something that I've used in the past: store the hierarchical data as a
breadcrumb-like string in an ordinary column. This means the parsing is just
ordinary csv parsing, followed by a relatively trivial extra step to parse the
breadcrumb columns. You also retain the ability to describe arbitrary depth
hierarchies:

    
    
        country,value
        World/Asia/China, 1409517397
        World/Asia/India,  1339180127
        World/Asia/Indonesia, 263991379
    

An added benefit of this approach is that you can have hierarchies in multiple
columns. For example:

    
    
        datetime, source_account, destination_account, amount
        2016-07-01, Assets/ProjectFunding, Assets/Capital/Delivery, 134.2
        2017-07-01, Assets/ProjectFunding, Assets/Capital/Delivery, 72.2
        2016-07-01, Assets/ProjectFunding, Assets/Capital/Other, 5.0
        2016-07-01, Assets/ProjectFunding, Expenses/Development, 4.7
        2017-07-01, Assets/ProjectFunding, Expenses/Development, 1.6
        2017-07-01, Assets/Cash, Expenses/OPEX, 0.96
        2018-07-01, Assets/Cash, Expenses/OPEX, 1.62

~~~
kqr
That has the meta-problem of only allowing two different levels of
hierarchies. For one more, you'd have to introduce a new separator.

I feel like the fixpoint of this function is going back to only one separator,
or something, but I am not seeing the reasoning clearly.

------
kyberias
A well-known format in phylogenetic tree software circles is the Newick
format:

[https://en.wikipedia.org/wiki/Newick_format](https://en.wikipedia.org/wiki/Newick_format)

------
breck
I would call "Comma-Separated Trees" a subtype of Tree Notation
([https://github.com/breck7/jtree](https://github.com/breck7/jtree)). The
first time this style of notation appeared was in Egil Möller's SRFI from 2005
([https://srfi.schemers.org/srfi-49/srfi-49.html](https://srfi.schemers.org/srfi-49/srfi-49.html)).

I really enjoy this style of format. It makes reading and writing structured
data a breeze, and can handle any format with no escaping save indentation.
You can embed csvs,tsvs,psvs, et cetera, right in the tree, like in these
Comma-Separated Trees. You can write a grammar file to ensure strong type
checking. Finally, you can easily do conversions "fromCsv/toCsv,
fromJson/toJson, fromXml/toXml, fromSql/toSql" et cetera...

Here's another example showing an embedded PSV (and also same type later
expressed as a tree) and some source code embedded showing how you don't need
escaping.

    
    
      cobol
       books
        title|year|author|id|rating|ratings|reviews
        Cobol Programming|1983|M.K. Roy|4944251|4.11|9|1
        Structured Cobol Programming|1979|Nancy B. Stern|9030220|4.33|15|0
       fileType text
       year 1960
      lua
       fileType text
       books
        book
         title Programming in Lua
         year 2001
         author Roberto Ierusalimschy
         id 1321894
         rating 3.97
       example
        function factorial(n)
         local x = 1
         for i = 2, n do
           x = x * i
         end
         return x
        end

------
dreamcompiler
No disrespect to mbostock, but I'd have used S-expressions, which can still be
indented for readability without being brittle to indentation mistakes.

~~~
mbostock
Use whatever you like. There’s not one data format that will be perfect in all
situations. While S-expressions are more explicit, they also make it more
likely that you’ll get a syntax error because of mismatched parentheses. The
purpose of this format is to make syntax errors almost impossible (at the
expense of possible ambiguity), to favor interactive editing as in the
Treemap-o-Matic example I’ve linked.

~~~
kgwxd
Significant whitespace is syntax. Errors will happen, a lot, and the parser
will have no chance to catch it.

~~~
mbostock
The parser has well-defined behavior for malformed whitespace, and it’s not to
throw a syntax error: if you advance the indentation by more than one space on
a following line, the extra spaces are ignored.

~~~
munk-a
Wait so...

    
    
        a
        ..b
        ..c
    
    

Would be interpreted as (a (b (c)))

~~~
mbostock
That was the initial behavior, but I’ve changed it to ignore missing
intermediate parents. Your example would be (a (b c)). See the “Handling
ambiguity” section I added to the notebook. The nice property of this design
is that you can use whatever indentation style you prefer (tabs, 2x spaces, 1x
space, etc.) and it will do what you expect as long as you are consistent.

~~~
munk-a
Oh neato.

I still think there's something for the philosophy of developing tools that
expect a strict format and fail if that format is violated. I've worked in a
variety of settings and with a variety of tools, as time goes by I am finding
the value of tools that are strict about expectations to provide a more
maintainable product over the long haul. If I were writing a project using
this style I'd prefer to receive a parse failure in the case of ambiguity,
rather than carrying on and hoping it is correct.

If data correction needs to happen (i.e. one file that should be single-space
indented is triple space indented) I'd prefer to explicitly pre-process the
data rather than have a tool that can handle it gracefully.

------
RcouF1uZ4gsC
I tend to be wary of semantic white space. It is so easy for some tool or
function to screw up your white space and you have a huge change in meaning
that may or may not be visually apparent. I tend to prefer having characters
like braces or parentheses (s expressions) denote the tree level and white
space purely as non-semantic pretty printing.

~~~
pavon
I can see both sides of the sematic whitespace arguments, but CSV
intentionally uses commas as delimiter to avoid the problems caused by using
white-space as a delimiter. I'd rather use something that required delimiters
(json, s-expr), or one that didn't (YAML, indented TSV) over an odd mix of the
two.

~~~
daveFNbuck
CSV uses commas to avoid the problems of using whitespace for column
delimiters, but it still uses whitespace for row delimiters. It's already an
odd mix of the two.

------
wglb
So if you look carefully at the ASCII code, there are some interesting
characters down in the low area. Like RS, GS, FS. If we had used those in CSV,
there wouldn't be all this nonsense about having to add quotes or not.

Similarly, with leading spaces to delimit fields--what could possibly go
wrong?

Its ASCII, folks.

~~~
ISO-morphism
Totally unsubstantiated, but my opinion on why we use commas instead of the
built in record/field seperator characters is simple: the special ASCII
characters don't have a cononical printed representation and they don't have
keys on a keyboard to make them easy to type. Therefore they're neither human
readable nor human writable, and the human interface is really what matters
with a plain text serialization format.

Edit: spelling

~~~
zapzupnz
Indeed. Otherwise we could just use a binary representation.

------
joelthelion
Not sure what this brings over JSON.

~~~
zimpenfish
From a comment by the author,

> to make it easier to edit the data by hand, for example to cut-and-paste
> some lines to move them around within the tree without having to change the
> leading values

I wouldn't even consider trying something like that with JSON.

------
chmod775
How do you escape spaces in fields that start with spaces?

I think this should be specced out early to avoid falling into the trap good
old plain CSV fell into, which now now has multiple ways programs escape
commas.

------
lolive
The mantra of the Real World (aka non-technical persons): CSV is an acceptable
serialization format for whatever data structure you deal with.

(and the corollary: any data can be authored or visualized in Excel).

------
buchanae
Love the work you're doing on observablehq mbostock! I am really impressed
with the volume and variety of content you create. Especially love the
generative art stuff!

------
harrisreynolds
I think that ObservableHQ is a really cool idea. BUT. Every time I get to the
end of a notebook I am left scratching my head with how it ends.

This example is no exception. Here is how my mind interprets it...

Makes sense.

Makes sense.

Makes sense.

Makes sense.

WTF?

~~~
mbostock
Heh, thanks for the feedback! What you’re seeing at the end here is the
implementation of the parser itself.

You might not be interested in how the parser is implemented; more likely you
only care about the design and usage of the proposed data format. Which is
fine! But we designed Observable to share one interface for both authors and
readers under the view-source philosophy that made the early web so great: all
the source is there, accessible, if you do want to dive in and understand it.

But we could do a better job of making the segue from narrative to internal
implementation less jarring. I’ve edited this post to make that more explicit
with an “appendix” header for the implementation. And we’ve been thinking
about ways to formalize this convention, so that the code is still accessible
with a click or two if you want it, but doesn’t distract from a normal read.

------
faaaaargh
Why not use multiple commas, instead of spaces? The first level branches would
require 2 commas, instead of one space, else it's less error prone.

------
pimlottc
I love that neither the data format or the _article itself_ ever actually
identify what the numbers represent.

(It appears to be population)

------
fbn79
Have you hear of YAML?

~~~
pgt
Have you heard of S-expressions? Whitespace optional:

    
    
      (World
        (Asia
          (China 1409517397)
          (India 1339180127)
          (Indonesia 263991379)))
    

With an editor that supports Paredit or Parinfer, it's impossible to create an
invalid tree, because you manipulate the structure of the tree (nodes and
leaves) instead of manipulating error-prone text.

~~~
mbostock
Parinfer is very cool! If you’re willing to use a custom editor for trees, you
can clearly offer better usability than a standard text editor. But having a
usable data format for standard text editors is also nice.

