
Data first, not code first - et1337
http://etodd.io/2015/09/28/one-weird-trick-better-code/
======
mpweiher
"Show me your flowchart and conceal your tables, and I shall continue to be
mystified. Show me your tables, and I won't usually need your flowchart; it'll
be obvious." \-- Fred Brooks, The Mythical Man Month (1975)

via "Objects have not failed" \-- Guy Steele,
[http://www.dreamsongs.com/ObjectsHaveNotFailedNarr.html](http://www.dreamsongs.com/ObjectsHaveNotFailedNarr.html)

~~~
zurn
Though Guy Steele's idea sounds contentious, that OO encourages "data-first"
because code is encapsulated:

    
    
      "Smart data structures and dumb code works a lot better than 
      the other way around."
    
      This is especially true for object-oriented languages, 
      where data structures can be smart by virtue of the fact 
      that they can encapsulate the relevant snippets of "dumb 
      code." Big classes with little methods–that's the way to go!
    

Or maybe he is just encouraging OO programmers to think more in this vein?

~~~
Chris_Newton
The tricky part is that smartness of data structures is context-sensitive.

One of the most common design errors in OO systems seems to be building
systems that beautifully encapsulate a single object’s state… and then finding
that the access patterns you actually need involve multiple objects but it’s
impossible to write efficient implementations of those algorithms because the
underlying data points are all isolated within separate objects and often not
stored in a cache-friendly way either.

Another common design problem seems to be sticking with a single
representation of important data even though it’s not a good structure for all
of the required access patterns. I’m surprised by how often it does make sense
to invest a bit of run time converting even moderately large volumes of data
into some alternative or augmented structure, if doing so then sets up a more
efficient algorithm for the expensive part of whatever you need to do.
However, again it can be difficult to employ such techniques if all your data
is hidden away within generic containers of single objects and if the primary
tools you have to build your algorithms are generic algorithms operating over
those containers and methods on each object that operate on their own data in
isolation.

The more programming experience I gain, the less frequently I seem to find
single objects the appropriate granularity for data hiding.

~~~
ak39
Well said.

The exercise of comparmentalising and creating atomic islands of objects that
dutifully encapsulate data becomes difficult during reassembly simply because
we recreate the need for declarative style of accessing data in an imperative
(OO) world. It's ye old object-relational impedance mismatch.

A (relational) data model is a single unit. It has to be seen this way.
Creating imperative sub-structures (like encapsulating data into objects)
breaks this paradigm with serious consequences when attempting to rejig the
object-covered data into an on-demand style architecture. The whole model
(database?) must be seen as a single design construct and all operations
against the entire model must be sensitive to this notion - even if we access
one table at a time. Yes, at specific times we may be interested in the
contents of a single table or a few tables joined together declaratively for a
particular use case, but the entire data model is a single atomic structure
"at rest".

When paradigmatic lines like this are drawn, I side with the world-view that
getting the data model "right" first is the way to go.

Fred Brooks and Linus Torvalds speak from experience in the trenches.

------
jeffdavis
Or, start from the user experience.

Both are good places to start, and both should be given serious consideration
early in the project.

Code is usually the worst place to start. If you do need code early on, I
think it's fine to write it as quickly as possible, even ignoring edge cases
and simplifying algorithms.

If you have good data structure design (and understand the invariants) and you
understand the problem space well from a functional standpoint (including
functional edge cases), you can get the code into shape later. The bugs you
encounter will be mostly the trivial kind.

But if you misunderstand the user experience, or the data structures are a
mess, then it's difficult to recover.

That's one of the reasons SQL databases are so great: they help you get the
data design up and going quickly, and will enforce a lot of your invariants
(through schema normalization, PK/FKs, and CHECK constraints). If anything
goes wrong, rollback returns you to a known-good state.

I don't know anything equivalent at the user experience level, though.

~~~
nostrademons
This is one reason why prototyping is often so necessary. When you start from
the user experience, you usually end up working your way back from the front-
end: first you visualize what the user should see, then you mock up screens to
give them that, then you figure out what they should be able to interact with
and how, then you write code to make that happen, then you define data
structures that make the code possible. When you start from data, you identify
what data the user will need to manipulate, then you identify relationships
between that data, then you write code to manipulate it, then you slap a UI in
front, one which usually mimics the relationships you identified a priori.

There's a bit of an impedance mismatch here. Start from the UX, and you end up
with a bunch of ad-hoc data structures that are very difficult to rationalize
and inefficient to access. Start from the data, and you end up with a UI that
mimics the data you were given and not how the user thinks about achieving
their task.

The solution is to write a quick & dirty prototype focusing on UX, nailing it
but focusing on _only_ the happy-path that's core to the user experience. Then
take careful note of the data structures that you ended up with, and throw
away the prototype. Then you start with a carefully planned data architecture
that captures everything you learned in the prototyping phase, but eliminates
redundancies and awkward access paths that you wrote in the quick & dirty
prototype.

~~~
applecore
The better solution is to start by designing the API, and to leave the design
of the data model (to support the API) and myriad of user interfaces (as
clients of the API) for later.

~~~
nostrademons
You don't know what should be in the API until you've had a couple clients
built on top of it. Users will surprise you; oftentimes that little UI detail
(the one that's really hard to fit into your API) is what will make them buy.

~~~
Chris_Newton
A similar argument also applies in the opposite direction. That is, it’s all
very well beginning with a beautiful, idealised API, but if there is no
efficient way to implement that interface then you’ve backed yourself into a
corner before you even start.

~~~
XorNot
Also UIs which don't have to support an actual data manipulation tend to
underestimate dealing with the edge cases of that manipulation - which I tend
to think is why you never get too far from the shell in Unix-likes.

~~~
nostrademons
And why so many database-backed webapps look like a frontend over a database,
and virtually all mobile messaging apps are frontends over TCP/IP, and web
directories before Google were just webpages with a whole lot of links.

It turns out it can be pretty profitable to break with the machine model of
the underlying data and instead present things the way your users want to see
them, even if it makes the code necessary look complicated and gross. The
danger with building the data model or API first is that you'll build the
easy, obvious API, which is the same easy, obvious data model & API that all
your competitors build. Start with the user instead and you can end up with
some really powerful differentiators, at the cost of it being a bitch to
transform into working, sane computer code.

~~~
Chris_Newton
_It turns out it can be pretty profitable to break with the machine model of
the underlying data and instead present things the way your users want to see
them, even if it makes the code necessary look complicated and gross._

It certainly can, though I think you’re still figuring out a data model first
on those projects. It’s just that you’re starting with how the information is
organised and manipulated from the user’s point of view, then figuring out an
internal representation to support it afterwards, rather than the other way
around.

------
marktangotango
Ah yes, takes me back to the time when 'Data Analyst' was an actual
profession, and organizations planned their workloads in order to place an
appropriate order for 'enough' IBM hardware and dasd. Back then, during the
first wave of business automation via computer, the processes being automated
were so well defined, this level of understanding was acheivable; when you
have rooms full of clerks calculating monthly payroll, you KNOW what the data
is.

In so many domains today, you don't know, and the business is forgiving enough
(or amateur enough?) that THEY don't even fully understand their processes.

~~~
emersonrsantos
Everyone bash Cobol since the dawn of microcomputing, but the definition of a
"Data Division" section where all data is specified before literally any
actual line of code is something very interesting.

This language implementation mostly made the Y2K transition so sucessful, and
today these Fortune 500 companies are still stuck with their legacy systems
which were written by people "with a clue" back then.

But do not program in COBOL if you can avoid it.

~~~
riskneural
Cobol is interesting, but horrifically impractical in practice. Cobol is my
bugbear language... I just couldn't solve problems in it. It somehow separates
what should go together and puts together what is unrelated.

------
al2o3cr
Code samples or GTFO. No seriously: an article that brings up messy real-world
OO code and then handwaves it away with "just use FUNCTIONS, yo!" and some
box-and-arrow diagrams is useless and borderline unhelpful. Something like:

[http://prog21.dadgum.com/37.html](http://prog21.dadgum.com/37.html)

is, IMO, more valuable in the discussion. It points out the tradeoffs
explicitly: pure functions mean (in this case) immutable data, which means
more-complex data structures - which means the OPPOSITE of the micro-
optimization stuff this article closes with.

It's also worth noting that if you're fussing about the cache efficiency of
code you haven't written yet, STOP. JUST STOP. Write the code and MEASURE IT.

~~~
agentultra
There are some things worth doing the napkin math first before even bothering
writing the code. That includes cache efficiency. Finding and fixing design
mistakes long after you've written a few thousand lines of heavily depended on
code is rather difficult to tear out when you realize the design was wrong to
begin with.

Since it's mostly game engine programmers presently driving this design
philosophy imagine a scene in a game. There are some number of entities in the
scene. Is that number infinite? No. Do each of those entities exist in their
own bubble? No. You want to process all of the entities that need updates all
at once. One transformation. You also know that from frame to frame there may
be 0 - N entities in the scene but you don't want to be constantly allocating,
deallocating, and resizing data structures. So you sketch out a reasonable
number of entities to support in a scene. You know the data each entity needs
to track to participate in the scene and you figure the best way to pack all
of that into a struct of arrays so that processing that update is fast and
doesn't waste a byte of cache. You haven't even written a line of code yet.

There is a give and take. You have to test out your assumptions and not be
afraid to throw away bad code and start again. Once you start down a path you
may find that the statistically significant number of accesses for entity data
happens for the members related to the physics update. You might choose a
different structure to pack that data into apart from the cold data that
doesn't change much.

Much of data-oriented design is like this: save the modelling for
documentation and diagrams and things that humans understand. Write the code
for the machine since the machine is the platform. You lose mechanical
sympathy when you start designing data structures based on intuition.

It doesn't cost you anything to think about these things. It's just
engineering.

------
current_call
_Input - > Process -> Output_

Somewhere a Haskell programmer is nodding.

~~~
_RPM
I believe computing has 1 concept needed to be understood by programmers:
Input & Ouput. That's all programming is.

~~~
dragonwriter
> I believe computing has 1 concept needed to be understood by programmers:
> Input & Ouput.

That's two concepts. So, the two concepts that need to be understood by
programmers are Input and Output. And State.

~~~
philipov
State is just emergent behavior caused by feeding your output back into your
input.

~~~
inopinatus
Programmers therefore need to understand time.

~~~
socceroos
However, time is shackled to space. Therefore, all programmers need to
understand space.

~~~
inopinatus
That's interesting, we don't generally consider relativistic effects on
computation.

There could be an interesting synthesis between Minkowski spacetime, lambda
calculus, and Lamport clocks that will either a) show that the fundamental
unifying element in the universe is _functions_ (and really annoy at least one
side in the disagreements over the Black Hole Information Paradox), or b) make
for a terrific Greg Egan novel.

------
reedlaw
Does the conclusion boil down to "use functional programming"? The author
doesn't specify the way data or functions are encapsulated in the final
figure. But my gut reaction is that this describes FP better than OO or any
other paradigm.

~~~
eru
Check out functional relational programming
([http://shaffner.us/cs/papers/tarpit.pdf](http://shaffner.us/cs/papers/tarpit.pdf)).

~~~
nicklaf
I admire the combination of theoretical audacity and sweeping pragmatism taken
by the authors of that paper.

They mention Oz a few times. I am big fan of "Concepts, Techniques, and Models
of Computer Programming" by Van Roy and Haridi, which shows the reader how
different computing paradigms can be constructed from a kernel language in Oz.
That makes the text something of a kindred spirit to the paper, insofar as it
tries to aid the student in grasping many different paradigms of computation
(functional, object-oriented, concurrent, and relational are covered).

~~~
eru
The Haskell using folks at Standard Chartered Bank have been implementing some
of the "Out of the Tarpit" ideas for a few years now. I can confirm,
functional relational programming does work really well in practice.

------
chubot
It's interesting that he's advocating this on games.

I've been annoyed at most OOP codebases, but conceded that games and GUIs were
the canonical applications of OOP. Perhaps you could say that highly stateful
apps running on a single machine can use state encapsulated in objects.

In contrast, for web UIs and their "big data" back ends, I believe you also
want to use a data-oriented approach, rather than a code-oriented approach (or
at least this is the style that my code has converged upon). You don't need to
wrap everything up in objects all the time when you're really just passing
strings through the network and doing light processing. And when you're not
even using a type-safe language to begin with.

But he's advocating the same style on games, just with structs and functions
rather than buffers and processes. This makes me think OOP is more of a
mistake, although I will admit that I use objects as modules now (a longer
conversation, but the idea is to instantiate almost all your objects at
startup, and not allocate during normal program flow).

I've long thought there was this interesting historical oddity in that we came
up with languages in the 90's and designed them for GUIs, but we're using them
the 2010's for web and server programming. Then you get web apps that are a
mess of misleading objects, when really the architecture is stateless (or
stored in a database, which is not programmable using your language).

~~~
chipsy
I think our main error was in thinking that objects could be pervasive, in the
same way that today one encounters "test-driven" culture.

They work reasonably well at a lower level of granularity - make a fairly
complex data model using components under the hood, and then publish the
resulting api as a more streamlined mutable object.

But the diminishing returns are fast once you start needing each piece of data
to be fungible and adapt to new requirements, and then wrap it up in an
object. The relational database has been standardized for some time in large
part because it guides you towards coherent data automatically - systems that
resist that model are more likely to experience a catastrophic failure, and
"ORM" systems are likewise brittle.

------
rebootthesystem
When I took my first APL class the prof drilled into our heads to focus on
data representation before writing any code. Over the years and across a dozen
languages or so this has proven to perhaps have been the most valuable thing I
learned in CS. I lost count of how many problems I've seen go from difficult
to manageable and even simple to solve by devoting more effort towards finding
the best way to represent the data or the problem than simply taking things as
presented and writing code right away.

------
wsh91
Hey there! First off, beautifully written. Thank you for a lot of food for
thought.

Second--I can't help but think of the notion of mechanical sympathy.[1] Very
useful.

Finally, I want to put on [2] but my wit is eluding me.

[1]
[http://martinfowler.com/articles/lmax.html](http://martinfowler.com/articles/lmax.html)

[2]
[https://www.cs.cmu.edu/~crary/819-f09/Backus78.pdf](https://www.cs.cmu.edu/~crary/819-f09/Backus78.pdf)

------
ampersandy
I feel the author has chosen poor examples for illustrating 'bad code'. Does
he really think so little of the id software team that they aren't aware of
these things? There's no mention of the fact that in the real world, all the
patterns and _weird tricks_ there are won't save you from compromising on a
perfect design to get stuff done & shipped.

~~~
draw_down
Good people make mistakes too. It's important for all of us to accept
criticism of our ideas and code without mistaking that for personal criticism.

------
agentultra
This idea of data-oriented design has been influencing much of how I write
code and approach problems these days. In dynamic GC'd languages it's all too
easy to trick yourself into believing that all problems are unbounded. However
most problems are not. And doing the math ahead of time allows you to write
the minimal amount of code to do the transformation required.

It's really cool.

------
dyarosla
I think the author has some great points to begin with, but the finale is a
little.. lacklustre.

It's all well and good to say we can separate process from objects, but in
reality it's not uncommon to require instantiating new objects/components
dynamically or changing structure in ways that don't map so easily to the
input->process->output diagram shown.

I also think that maintaining code and building up testable modules was one of
the niceties of OOP ahead of all the inheritance spaghettis that started
happening in many poorly-OOPed projects.

Kudos though to the author for showcasing his first few points by commenting
on Doom's code- you don't often see that.

------
ilurk
The author had some pretty good points there. But I have to wonder if these
are somewhat cherry picked.

I mean, where is the other side of the coin? Where do the advocated
philosophies break?

Has any big project been built using that approach? With no problems at all?

~~~
Narishma
A lot of AAA games are built that way.

~~~
ilurk
Have any of them released their source code?

Not a snarky comment, just wondering if it would be possible to peak at the
code.

~~~
Narishma
None that I know of. The Ogre3d renderer has moved from an OOP to a data
oriented design in their newer versions, so maybe you can look at that.

------
mjcohen
Two relevant blasts from my past:

"Algorithms + Data Structures = Programs" by Wirth

The Jackson Design Methodology, which essentially said that the structure of
the data will produce the structure of the program.

------
javajosh
I've heard this as "debug data not code" and that's an addage that has served
me very, very well.

~~~
mathgenius
Yeah, "smart data-structures, dumb algorithms." Mathematicians know this too:
if you get the notation right, then the calculation is simple.

------
truncate
Couldn't agree more. Whatever program we write, it is manipulating and
transforming data. If our code is revolving around data, it couldn't make more
sense to formulate data before actually start code which is supposed to work
on that data.

------
zkhalique
I used to think I can communicate the architecture (including code) of an
entire website using just the tables. And a good enough developer+designer
could turn the schema into a website.

~~~
jdbernard
You imply that you no longer believe this. What changed?

------
112233
I so wish c++ would let me code like this without getting in the way with it's
OOishness. Recent gripes:

\- got a few bit flags per struct, need to know if there is at least one
instance with the flag set (a counter per flag). I cannot bit-pack the flags
without duplicating the flag getter/setter/destructor code.

\- no way to declare static linkage funcions as friends (the linkage is forced
to external). so i need to expose types that are implementation detail in
public header.

would be nice if c++ could get some RAD-slanted design work, instead of trying
to get all RUSTy

~~~
112233
(expanding on the above while using non-touchscreen keyboard)

The problem that I continually encounter writing data-first in c++ is the
inability to group data members that are related to single implementation
detail but are scattered in multiple classes, and indicating in a compiler-
parsable way which functions have the responsibility to manage these data
fields. For example, consider maintaining an M to N relation links between
classes A an B, without putting links in parent class, and grouping all
implementation details of these links in a a single file. Or even maintaining
reference counter without using inheritance or other heavy-OO mentioned in the
article. Even worse if your data is only a bitfield.

C++ assumes single implementation detail to be a single class, contained in a
single allocation. It can be subverted using "public", "friend", CRTP, but it
feels like hammering nails with pliers.

Newer standards go all-out on supporting really powerful data structures in
library, and writing incredible templates, but there is nothing for low-level
C-style custom data handling.

------
otis_inf
E. Yourdon was right after all! I knew it!

------
vinceguidry
> The boss comes in and says "Hey, change of plans. The player is now a car."

Stupid decision-making is why I will never work in games. With regular coding,
you can iterate towards the domain and so long as you keep your wits about you
and never make the same mistake twice, never be subjected to such a demand.

In the arts, everything's got to hew to some asshole's vision. The people
building it can never really know what's going on in his head and the asshole
never understands how much work it is to change things to fit the evolving
vision. Killer features on games are rarely evident from day one, game shops
seem to always wind up looking like sausage factories.

~~~
dragonwriter
> Stupid decision-making is why I will never work in games. With regular
> coding, you can iterate towards the domain and so long as you keep your wits
> about you and never make the same mistake twice, never be subjected to such
> a demand.

In the ideal environment, perhaps; all too often in real environments, radical
shifts in requirements not discoverable by advance consideration of the domain
can happen outside of games (sometimes, because there are radical shifts in
the domain -- if your system is automating implementation of corporate or
government policy, whims of policymakers can change things as radically as
"the player is now a car").

~~~
vinceguidry
I work for a marketing company, I'm well aware of shifting whims. You can
build flexibility into your infrastructure if you really need it. You come up
with a model for how things are likely to change in the future and work
modularity into your codebase.

This approach doesn't really work for games. Everything is so interdependent,
modularity that allows for all possible ways for the system could change would
slow it down to a crawl. There's just no way to avoid expensive, painful
rewrites.

~~~
yoklov
This is somewhat true, but you structure the code in a way so that rewriting
any one part is easy. You keep your modules fairly large, your abstractions
thin, and the code very literal. It's much easier to do a total rewrite on
code that's like this than when you abstract it out a great deal.

It's very quick to write code like this, and the maintenance cost of this
isn't as large as you might think (honestly, after you get used to it, you
often end up preferring it. It's very, uh, blunt, for lack of a better term).

Edit: Worth noting that these rewrites aren't always due to a change in
creative direction. There are often ones for technical reasons.

