Hacker News new | past | comments | ask | show | jobs | submit login
Comma-Separated Tree (observablehq.com)
153 points by mbostock on Dec 19, 2018 | hide | past | favorite | 61 comments

I've often found it's better to be more verbose in situations like this, but providing the "tree" entries as rows.



This still compresses very nicely, and allows data to be output in any order. The tree form is more of a presentation thing, which should be the last step of a data processing pipeline. As it is, the tree nodes themselves have no names or context, so it makes it harder to consume. Is the second-level entry a region? Easy to tell from this example, but if you're consuming a large data set you're forced to first give names to these columns.

Your suggestion requires a fixed-depth hierarchy (such as three levels in your example), and then it requires a lot of redundancy. Here, I want to eliminate the redundancy not primarily for performance reasons, but to make it easier to edit the data by hand, for example to cut-and-paste some lines to move them around within the tree without having to change the leading values.

To get a sense of what I mean, see this editor which lets you interactively construct a treemap: https://beta.observablehq.com/@mbostock/treemap-o-matic

Something that I've used in the past: store the hierarchical data as a breadcrumb-like string in an ordinary column. This means the parsing is just ordinary csv parsing, followed by a relatively trivial extra step to parse the breadcrumb columns. You also retain the ability to describe arbitrary depth hierarchies:

    World/Asia/China, 1409517397
    World/Asia/India,  1339180127
    World/Asia/Indonesia, 263991379
An added benefit of this approach is that you can have hierarchies in multiple columns. For example:

    datetime, source_account, destination_account, amount
    2016-07-01, Assets/ProjectFunding, Assets/Capital/Delivery, 134.2
    2017-07-01, Assets/ProjectFunding, Assets/Capital/Delivery, 72.2
    2016-07-01, Assets/ProjectFunding, Assets/Capital/Other, 5.0
    2016-07-01, Assets/ProjectFunding, Expenses/Development, 4.7
    2017-07-01, Assets/ProjectFunding, Expenses/Development, 1.6
    2017-07-01, Assets/Cash, Expenses/OPEX, 0.96
    2018-07-01, Assets/Cash, Expenses/OPEX, 1.62

That has the meta-problem of only allowing two different levels of hierarchies. For one more, you'd have to introduce a new separator.

I feel like the fixpoint of this function is going back to only one separator, or something, but I am not seeing the reasoning clearly.

I'm not arguing against the CST format, but I don't think redundancy is necessarily bad or hard to work with.

It can be good because each line is now an independent idea and doesn't require the indentation context, the items can be sorted or moved easily.

As for hard to work with... a lot of editors have multi-cursor support these days, which makes editing these things in bulk pretty straight forward for smaller sizes and then there's always sed and awk. ;)

Given all that, CST seems like it might be cool for building some quick trees like the given UIs you present.

I'd probably keep to to rows for longer storage though.

CSV is just a serialisation format like JSON or XML. CST appears to put the emphasis on readability (not a bad thing!).

I've taken to using XSV to pretty-print CSV data in bug reports or feature requests for coworkers who give me CSV or spreadsheets to work with. The data doesn't need to go any further, it's already been processed, it just needs to be clearer to understand than raw CSV/JSON/XML:


(link has examples of pretty-printed CSV)

I like dealing with data that's line-denormalized like this since it's a lot easier to manipulate with simple things like grep and awk.

If you do too, and you find yourself dealing with JSON a lot, take a look at gron [0], it takes JSON and denormalizes all the structure into a line-oriented equivalent.

    % echo '{"root": {"left": [1, 1.4, 2, 2.88], "right": ["a", "b"]}}' | jq .
      "root": {
        "left": [
        "right": [
    % echo '{"root": {"left": [1, 1.4, 2, 2.88], "right": ["a", "b"]}}' | gron
    json = {};
    json.root = {};
    json.root.left = [];
    json.root.left[0] = 1;
    json.root.left[1] = 1.4;
    json.root.left[2] = 2;
    json.root.left[3] = 2.88;
    json.root.right = [];
    json.root.right[0] = "a";
    json.root.right[1] = "b";
[0] https://github.com/tomnomnom/gron

AHHA! My brother in arms! I also have a gron-like ML called 'KVIN'[0]. It'd be like this:

    root.left[0] = 1
        .left[1] = 1.4
        .left[2] = 2
    root.left[3] = 2.88
       ..right[0] = "a"
        .right[1] = "b"
[0] https://github.com/jaroslov/kvin

Eurgh, you've just put another blemish on this world's karma with that sad affair of manipulating structured data as indiscriminate mess of letters. If keys in your input contain dots, what will grep and awk do to help you out?

Meanwhile, the JQ tool could do your greppings and awkings but preserving and perusing the structure of the document notwithstanding its idiosyncrasies.

(A possible problem with JQ, though, is that it may be difficult to query a structure which varies, i.e. use entries on a nesting level that isn't known in advance.)

> If keys in your input contain dots, what will grep and awk do to help you out?

Without knowing gron, I presume the dots are escaped/quoted properly, and you would do the same in your grep/awk usage.

What if your keys contain equal signs and curly braces, how will json help you out?

> Meanwhile, the JQ tool could do your greppings and awkings but preserving and perusing the structure of the document notwithstanding its idiosyncrasies.

The structure of the data is preserved (I don't know what you mean with "notwithstanding its idiosyncrasies"). This is just a different, albeit redundant, but entirely equivalent representation of the same document. Despite their redundancy, denormalized representations do have advantages.

> I presume the dots are escaped/quoted properly, and you would do the same in your grep/awk usage.

Oh yeah, that's certainly much easier than having a tool's parser deal with the syntax, and just working with individual keys and values instead. So much so that in fifteen years of programming I haven't seen anyone remember to include escape-sequences in their regexps. But I guess for true lovers of regexps it's just a joy to keep escaping the escape sequences that they put in, and then escaping the result once over in the coding language of choice.

> What if your keys contain equal signs and curly braces, how will json help you out?

Care to clarify why that would be a problem?

So far I've only used this format for tiny datasets smaller than 10k root entries (and around ~1M nodes). And even then, before I use Python/R/etc to analyze things, I transform it into a normalized table form exactly like you've shown above.

I think the tree form is great for the smaller datasets that need to be edited or reviewed manually, not bulk machine generated data. For one project I have thousands of files, each with data stored in a tree, all stored in a git repo with many editors, sort of like a data wiki.

It also works for the "presentation thing", as you say. You can use a table-backed storage system but present to the user a tree, and when they edit a node you propagate the change to the appropriate row.

Finally, some editing plugins can give you good highlighting and more context if you create a grammar file.

Source: have been tinkering with this stuff for a couple years.

A well-known format in phylogenetic tree software circles is the Newick format:


I would call "Comma-Separated Trees" a subtype of Tree Notation (https://github.com/breck7/jtree). The first time this style of notation appeared was in Egil Möller's SRFI from 2005 (https://srfi.schemers.org/srfi-49/srfi-49.html).

I really enjoy this style of format. It makes reading and writing structured data a breeze, and can handle any format with no escaping save indentation. You can embed csvs,tsvs,psvs, et cetera, right in the tree, like in these Comma-Separated Trees. You can write a grammar file to ensure strong type checking. Finally, you can easily do conversions "fromCsv/toCsv, fromJson/toJson, fromXml/toXml, fromSql/toSql" et cetera...

Here's another example showing an embedded PSV (and also same type later expressed as a tree) and some source code embedded showing how you don't need escaping.

    Cobol Programming|1983|M.K. Roy|4944251|4.11|9|1
    Structured Cobol Programming|1979|Nancy B. Stern|9030220|4.33|15|0
   fileType text
   year 1960
   fileType text
     title Programming in Lua
     year 2001
     author Roberto Ierusalimschy
     id 1321894
     rating 3.97
    function factorial(n)
     local x = 1
     for i = 2, n do
       x = x * i
     return x

No disrespect to mbostock, but I'd have used S-expressions, which can still be indented for readability without being brittle to indentation mistakes.

Use whatever you like. There’s not one data format that will be perfect in all situations. While S-expressions are more explicit, they also make it more likely that you’ll get a syntax error because of mismatched parentheses. The purpose of this format is to make syntax errors almost impossible (at the expense of possible ambiguity), to favor interactive editing as in the Treemap-o-Matic example I’ve linked.

Significant whitespace is syntax. Errors will happen, a lot, and the parser will have no chance to catch it.

The parser has well-defined behavior for malformed whitespace, and it’s not to throw a syntax error: if you advance the indentation by more than one space on a following line, the extra spaces are ignored.

Wait so...


Would be interpreted as (a (b (c)))

That was the initial behavior, but I’ve changed it to ignore missing intermediate parents. Your example would be (a (b c)). See the “Handling ambiguity” section I added to the notebook. The nice property of this design is that you can use whatever indentation style you prefer (tabs, 2x spaces, 1x space, etc.) and it will do what you expect as long as you are consistent.

Oh neato.

I still think there's something for the philosophy of developing tools that expect a strict format and fail if that format is violated. I've worked in a variety of settings and with a variety of tools, as time goes by I am finding the value of tools that are strict about expectations to provide a more maintainable product over the long haul. If I were writing a project using this style I'd prefer to receive a parse failure in the case of ambiguity, rather than carrying on and hoping it is correct.

If data correction needs to happen (i.e. one file that should be single-space indented is triple space indented) I'd prefer to explicitly pre-process the data rather than have a tool that can handle it gracefully.

how is it supposed to know you put an item at the wrong level? thats not syntax now that i think about it, just an example of how syntax can help visualize.

Agreed, if you're using a non-paren-matching text editor. I've been exclusively using paren-matching and paren-formatting editors so long I sometimes forget that non-assistive editors exist.

Just like a good tool reduces parenthetical mistakes, so goes for this.

Also, like S-expressions, even without a good tool, mistakes drop once you use it enough.

I tend to be wary of semantic white space. It is so easy for some tool or function to screw up your white space and you have a huge change in meaning that may or may not be visually apparent. I tend to prefer having characters like braces or parentheses (s expressions) denote the tree level and white space purely as non-semantic pretty printing.

I can see both sides of the sematic whitespace arguments, but CSV intentionally uses commas as delimiter to avoid the problems caused by using white-space as a delimiter. I'd rather use something that required delimiters (json, s-expr), or one that didn't (YAML, indented TSV) over an odd mix of the two.

CSV uses commas to avoid the problems of using whitespace for column delimiters, but it still uses whitespace for row delimiters. It's already an odd mix of the two.

What I think was a mistake in the past was using 2 or 4 spaces (or tabs) for semantic indentation. I think it should be 1 space, like Comma-Separated Trees has above, and have the editor indicate the indentation level. In my experience this eliminates a common source of tool-caused indentation bugs.

Is this not usually how it works? Other than Makefiles, every language with significant whitespace I can think of just treats “more than the previous line” as an indent.

The optionality is the problem. As a tool writer you have to handle a 10x jump in complexity.

So if you look carefully at the ASCII code, there are some interesting characters down in the low area. Like RS, GS, FS. If we had used those in CSV, there wouldn't be all this nonsense about having to add quotes or not.

Similarly, with leading spaces to delimit fields--what could possibly go wrong?

Its ASCII, folks.

Totally unsubstantiated, but my opinion on why we use commas instead of the built in record/field seperator characters is simple: the special ASCII characters don't have a cononical printed representation and they don't have keys on a keyboard to make them easy to type. Therefore they're neither human readable nor human writable, and the human interface is really what matters with a plain text serialization format.

Edit: spelling

Indeed. Otherwise we could just use a binary representation.

Not sure what this brings over JSON.

From a comment by the author,

> to make it easier to edit the data by hand, for example to cut-and-paste some lines to move them around within the tree without having to change the leading values

I wouldn't even consider trying something like that with JSON.

How do you escape spaces in fields that start with spaces?

I think this should be specced out early to avoid falling into the trap good old plain CSV fell into, which now now has multiple ways programs escape commas.

The mantra of the Real World (aka non-technical persons): CSV is an acceptable serialization format for whatever data structure you deal with.

(and the corollary: any data can be authored or visualized in Excel).

Love the work you're doing on observablehq mbostock! I am really impressed with the volume and variety of content you create. Especially love the generative art stuff!

I think that ObservableHQ is a really cool idea. BUT. Every time I get to the end of a notebook I am left scratching my head with how it ends.

This example is no exception. Here is how my mind interprets it...

Makes sense.

Makes sense.

Makes sense.

Makes sense.


Heh, thanks for the feedback! What you’re seeing at the end here is the implementation of the parser itself.

You might not be interested in how the parser is implemented; more likely you only care about the design and usage of the proposed data format. Which is fine! But we designed Observable to share one interface for both authors and readers under the view-source philosophy that made the early web so great: all the source is there, accessible, if you do want to dive in and understand it.

But we could do a better job of making the segue from narrative to internal implementation less jarring. I’ve edited this post to make that more explicit with an “appendix” header for the implementation. And we’ve been thinking about ways to formalize this convention, so that the code is still accessible with a click or two if you want it, but doesn’t distract from a normal read.

Why not use multiple commas, instead of spaces? The first level branches would require 2 commas, instead of one space, else it's less error prone.

I love that neither the data format or the article itself ever actually identify what the numbers represent.

(It appears to be population)

Have you hear of YAML?

"Please don't post shallow dismissals, especially of other people's work."


Yes, but... if this is a serious proposal for this format to be adopted, then shouldn't it address "why is this better than competitors X, Y, Z?"

One can always ask politely.

Have you heard of S-expressions? Whitespace optional:

      (China 1409517397)
      (India 1339180127)
      (Indonesia 263991379)))
With an editor that supports Paredit or Parinfer, it's impossible to create an invalid tree, because you manipulate the structure of the tree (nodes and leaves) instead of manipulating error-prone text.

Parinfer is very cool! If you’re willing to use a custom editor for trees, you can clearly offer better usability than a standard text editor. But having a usable data format for standard text editors is also nice.

I'll see your snark and raise you two, throwing Emacs in for good measure. ;)

Have you heard of org-export-json? [1]

Have you heard of json.el? [2]

[1] https://github.com/mattduck/org-toggl-py/blob/master/org-exp...

[2] http://tess.oconnor.cx/2006/03/json.el

Yes. YAML would be more verbose and require more precise (error-prone) syntax. See this thread: https://twitter.com/mbostock/status/1075135178014482432

You can write YAML much more concisely though:

      foo: 1
      bar: 1
Syntax details might be off from the top of my head, but the example in the tweet is a bit overblown.

The thread I linked describes the limitations of the approach you suggest, in particular, the ability to associate metadata not just with leaf nodes but also internal nodes, and not requiring the names of children to be unique. Of course you can express hierarchies in YAML, but the result ends up being more verbose and more prone to syntax errors; the point of comma-separated trees is to be easier to read and edit.

I don't think comma separated trees are easier to read... in fact, I think they're quite a bit harder.

I will say it might be easier to write. do string quoting, quote escaping and newline embedded in a quoted string work?

Your snarky question is so far drawing a lot of snarky responses.

However, the implied question here is absolutely valid: Why write your own flat-file format parser from scratch, when there's a multitude of battle-tested third-party libraries already in existence?

I mean, if you're writing something very specialized or proprietary, that's one thing. But if you're simply proposing a generalized tree shape that looks like YAML, then why not indeed use YAML? I know it's tricky to write a YAML parser... but so what, why would you? There are already a ton of great ones.

Two good reasons to not use YAML for this interactive example:

1. intermediate steps in YAML are often invalid documents (breaks the update on every key press)

2. you have to be trained to write YAML. It is quite easy to make a syntax mistake with YAML if you aren’t used to writing it. Not good for an interactive experience

And arbitrary spaces with meaningful input isn't easy to make a mistake... my first thought, what about escaping newlines?

It's a superset of csv, what's the problem?

Have you heard of XML? Not my first choice, though:


Have you heard of JSON? [{"a": [{"b": "c"}]}]

YAML was designed as a superset of JSON with meaningful whitespace as part of the design, and optional string quotes. It's actually a REALLY good option for something like this.

That said, the statement itself came off really snarky.

There's also Mark Notation (http://marknotation.org/), which is also a tree notation. Less verbose than XML and JSON.

      {China '1409517397'}
      {India '1339180127'}
      {Indonesia '263991379'}

Applications are open for YC Winter 2021

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact