Hacker News new | comments | show | ask | jobs | submit login

Spreadsheets would be nicer if they had a coherent dataflow model.

What do you mean by "coherent dataflow model"? (I'm working on these problems - hence all the questions.)

This gives a column of constant values which are tweakable by tweaking just the first element. That's what a scalar variable looks like in a spreadsheet.

Why not just put the value in a cell and reference that cell absolutely?

With some visual tinkering you might even have a sort of "iframe" [...] but it should float

It seems that the general solution here would be (a) make it much easier to decompose a problem across multiple sheets (in your example, scalars and reductions could go in a different sheet), and (b) allowing sheets to "float" if you want them to, rather than always being in a different tab that you're forced to switch to. Does that make sense?

balance[i] = change[i] + balance[i - 1]

It's interesting that you single out this kind of recurrent calculation, where a later value of a column depends on an earlier value. It doesn't get mentioned very often. But it's fundamental to what spreadsheets do computationally and is the reason why parallelizing them is a lot harder than at first appears.

(1) You're right, I should have explained that more. I wanted to contrast this with something like "spreadsheets today have a datum-flow model" but a nice phrasing didn't come to me and I posted without remembering to insert the missing sentence.

What I mean is that spreadsheets are (right now) fundamentally based on the idea of a grid of cells which are individually meaningless and can contain anything, any individual datum, and datums may refer to each other by arbitrary operations. This grid view might be a good way to present datums to users but it requires a style convention when you want to write it to be readable; it encourages styles which obscure your ability to actually see what this sheet does.

It's not just that you can't see how the data flows, although that's part of it -- it's that the data is allowed to flow in ways that you could never easily visualize in the first place. Imagine that we simply draw the "depends on" relation by drawing a little curvy arrow from A to B if B depends on A. The Excel equivalent of "spaghetti code" could then literally look like spaghetti on the spreadsheet -- it would have neither head nor tail.

This could be solved with a nice model for how data, not individual datums, are allowed to flow through the application. Calculating a velocity might be as simple as writing "(x - last(x))/(t - last(t))", if x and t accepted vector subtractions and last(q)[i] == q[i - 1].

(2) I'm not entirely sure what you think the referring code is doing, if not putting the values in cells and referencing those cells. The reason why I can't be "absolute" about it is because in Calc (and Excel the last time I used it), to extend a computation over a vector, you highlight the computation and then click in a resizing corner to resize it into an area parallel to the input vectors -- or else you use some right-click "Fill" tool.

I used to think that these tools were broken but I think I can now appreciate that, because their model is so easily grasped, it's not really a break if it's hard to say, "no! I wanted this parameter fixed.

(3) That sounds suspicious. mean(v) should be associated with the column v in a clear way.

Interesting. It seems our views are similar on some aspects of this and divergent in others. I agree with you that the cell-based referencing system that treats every individual datum independently is too low-level and can lead to arbitrary spaghetti. (Boy do I know that spaghetti. I have been doing battle with that spaghetti. Or more precisely, with the computational model that you correctly describe as allowing for spaghetti.) There is clearly a need to express operations at a higher level: on vectors of data, let's say, rather than arbitrarily cell-by-cell. On the other hand, the grid layout and cell referencing system are the essence of the spreadsheet. If you mess with it very much, you forfeit the familiarity of spreadsheet users and run the risk of forfeiting what makes spreadsheets so accessible and popular in the first place. So I see the challenge less as "how to come up with a more structured UI that maps to a more coherent dataflow model" (where by "coherent" we mean literally, "let related data cohere together") and more as "how to generalize existing spreadsheet language into something that allows for coherence and exploits it where possible". Although spreadsheet language allows spaghetti, most spreadsheets do lay out their data and calcs in a coherent fashion. It's just that the system fails to recognize it.

Once one achieves "coherent dataflow", there might be a case for adding some features into spreadsheets at the application level that allow for declaring and visualizing more structure. But I think it's a mistake to start with that - or at least risky, because spreadsheet users like spreadsheets and there is no guarantee they will like anything else. (There's a reason why it's always programmers who come up with ideas for "fixing" the spreadsheet UI. You gave yourself away with the Haskell reference :)) The approach I'm advocating is risky too, though, since it's not obvious that there is any coherent generalization to be had out of the spreadsheet data model.

The reason why I can't be "absolute" about it is because in Calc (and Excel the last time I used it), to extend a computation over a vector, you highlight the computation and then [...] resize it into an area parallel to the input vectors

Yes, but in your example the computation that you extended this way had an identical number in every cell. (That is, by extending it, you were copying a formula guaranteed to evaluate to the same number in every cell of the column.) If that's correct, why copy that number at all? Why not just keep it in one cell and have everyone reference it from there? Is it that you wanted it to be visually nearby in every row? I'm probably just missing something about your example.

That sounds suspicious. mean(v) should be associated with the column v in a clear way.

But the solution is to make it very easy for you to lay out your data and calculations in the way you find clear. It is certainly not to force every user to lay out their data and calculations that way.

Just to answer your question with regards to (2):

Suppose I open up LibreOffice Calc, and my first row gets the values "c", "x", "y" in that order, as labels for three columns. The second row gets the values 2, -2, 0. The third row gets the odd formulas:

    =A2, =B2 + 0.02, =B3 * B3 + A3 * B3 - 1
Now I highlight C3 and drag it up to C2 (the y=0 term which shouldn't be y=0 is now correct). I also highlight A3 through C3 and drag it down, until my x's range from -2 to +2; this happens at row 202.

The key thing is, this "dragging autofill" has quickly managed to make all of the y computations dependent on the same c, whose authoritative value is stored at cell A2. (I can also change what x's I look at by tweaking the cell B2.)

So I highlight the x and y columns, click the Chart button, to do a scatterplot, lines only -- no point markers. Then I need to kill the autoadjustment of the y axis because it will confuse me, so I set it to go from y = -5 to +10.

Now I can just start modifying this parameter c in cell A2, and see how the graph changes. I might notice for example that the vertex of the parabola hits a maximum when c = 0. That's an interesting feature; it suggests that the vertex of the parabola describes its own parabola as you vary c. Et cetera.

The only reason for doing it this way is because that is the easiest way I know of to get the computation right for 200 data cells. You're right, I could hand-write 200 different computations to all point to A2. It would take a long time and I would hate my life. I could also write in the value "2" and whenever I want to change it, drag across 200 rows. But then I would never get to see how this thing changes. (If you've never seen it, see Bret Victor's "Inventing on Principle" talk for a discussion of the power of having a direct connection with your artistic creations.)

Maybe by "coherent dataflow model" he means some separation between what's an input to any given operation, so you can see what data is going where and how instead of it being scattered all over the place. That's what I'd like, anyway.

What it looks like to me is more like a program than a table, but with really good list/table entry and flow arrangement tools. That may just be because I'm a programmer.

Can you share anything more about what you're working on? These are interesting problems to me, too.

There's an aspect of that, too. I'm especially influenced by the circuit diagrams in G, the language which you program in when using LabVIEW. Basically most of the system, except for loops, looks like one big circuit. (Since loops are ugly because they're nonfunctional, I have wasted some time trying to think of what the Haskell version of G would look like, how you would embed monads in the circuit elements, and so forth.)

Can you point to an example of a nontrivial program in G that does something people would normally do in a spreadsheet?

The best reference I can show for the general programming model was a blog post which is not in G, but was called "Why Concatenative Programming Matters":


which shows you how graphical diagrams can make sense out of concatenative programs.

If you really want an example, then I will give you this with some caution:


The caution is that LabVIEW views a data source as a sort of continuous input stream, so that it is manipulating individual values at any given time. I would rather view vectors as more fundamental in a spreadsheet language.

I'd be happy to discuss it offline if you want to email me (address in profile).

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact