Hacker News new | past | comments | ask | show | jobs | submit login
Data-Oriented Design (2018) (dataorienteddesign.com)
262 points by chrispsn 3 months ago | hide | past | web | favorite | 37 comments

I think what's missing is something like "Data Model Patterns: A Metadata Map" by David C. Hay

It's like C. Alexander's "Pattern Language" but for data models.

> ...I was modeling the structure— the language —of a company, not just the structure of a database. How does the organization under-stand itself and how can I represent that so that we can discuss the information requirements?

> Thanks to this approach, I was able to go into a company in an industry about which I had little or no previous knowledge and, very quickly, to understand the underlying nature and issues of the organization—often better than most of the people who worked there. Part of that has been thanks to the types of questions data modeling forces me to ask and answer. More than that, I quickly discovered common patterns that apply to all industries.

> It soon became clear to me that what was important in doing my work efficiently was not conventions about syntax(notation) but rather conventions about semantics(meaning). ... I had discovered that nearly all commercial and governmental organizations— in nearly all industries —shared certain semantic structures, and understanding those structures made it very easy to understand quickly the semantics that were unique to each.

> The one industry that has not been properly addressed in this regard, however,is our own— information technology. ...


I've pushed this at every company/startup I've worked at for years now and nobody was interested. You can basically just extract the subset of models that cover your domain and you're good to go. Or you can reinvent those wheels over again, and probably miss stuff that is already in Hay's (meta-)models.

Two (more) things I'd like to point out:

Prolog clauses are the same logical relations as in the Relational Model of DBs. ( Cf. Datalog https://en.wikipedia.org/wiki/Datalog )

The next big thing IMO is categorical databases. Go get your mind blown by CQL: https://www.categoricaldata.net/ It's "an open-source implementation of the data migration sketch from" the book "Seven Sketches in Compositionality: An Invitation to Applied Category Theory" which went by yesterday here: https://news.ycombinator.com/item?id=20376325

> The open-source Categorical Query Language (CQL) and integrated development environment (IDE) performs data-related tasks — such as querying, combining, migrating, and evolving databases — using category theory, ...

In reading both your comments, I kept thinking about Smalltalk's Browser and Method Finder functionalities.

In Hay's book, he talks about modeling not just Object Classes, but how you can go about modeling ObjectClass classes (the metadata). In Smalltalk, this metadata is automatically available via the Browser.

In reviewing the CQL Tutorial, the Typesides' "java_functions" specification section made me think of the Method Finder (wherein you could provide an input value and the expected output value and methods that satisfied the transformation would be shown). I'm not sure if in CQL you'd be able to search for a function that satisfied criteria, but that may beyond the scope of what that system is trying to provide.

In any case, interesting threads of thought to be followed and explored.


Apparently Conal Elliott[1] has reached out to the CQL folks so if they get together there should be some interesting and useful stuff emerging there. I told Wisnesky they should make a self-hosting CQL compiler in CQL. ;-D Maybe modeled on Arcfide's thing[2]. Categorical APL...

[1] http://conal.net/papers/compiling-to-categories/

[2] https://news.ycombinator.com/item?id=13797797

When designing data structures that hold my program's state, my structures tend to become more coupled& brittle because I use them in multiple contexts and do a poor job of separating them, I want each context to have the perfect shape for each context even if they share data

It makes me wonder what happens if instead of creating a tree of data I put my data into a Datom flat data structure then used syntax like Datalog to conform data into shape for each new context

Unfortunately, I still haven't installed Datomic free to try this out because I feel dirty if it doesn't come from brew

With things like EAV and Datalog, you still end up coupled to the names and kinds of facts that you store and the relationship between entities in your structure. It turns out that there can be many ways to store facts about an entity, and model the relationships between them, which may be convenient or not depending on the context. But it does help in that it can much more easily evolve with the kinds and shapes of data that you need to store in your app as your app changes and grows.

More abstract constructs like lenses can help with this as well. By building lenses for reading that can transform to a domain context structure from your global program state, you can keep them relatively decoupled and ensure all of the glue lives only in those lenses.

I'm undecided on the utility of leveraging lenses for mutation as well; on the one hand, mutation allows one to operate strictly in the local context, but on the other hand, it requires your lenses to be truly bidirectional (which I think is harder in practice than in theory), and counts on either synchronous sync of local state and global state _OR_ eventual consistency with the rest of the app.

Going a more CQRS-esque way like Redux/re-frame allows you to not rely on the bidirectionality of your lenses and also ensures consistency in that local state is only ever driven by changes to the global state.

Check out datascript then as the datalog query experience should be the same.

Can you give a specific example?

As someone who works in data warehousing and business intelligence I always feel the pain of developers thinking of data as an afterthought. Biggest issues are always not all data is persisted and not all changes are tracked in the database. The means you can never do all the reports the business want and it is always a mission to explain to business that you cannot report off data that doesn't exist or not properly stored (referential integrity). My dream has always been one day developers will think of the data first :-).

(Not to nitpick, but the thread is from 2013; that thread said the site is from 2013, but the page itself says that it's 2018...)

In 2016 there was a discussion of a 2013 book. Now in 2019 there is a discussion of the 2018 edition of that book. I think we're sorted now!

(I originally said "thread from 2013" above but that was wrong and I edited it.)

Makes perfect sense, and words are hard. Thanks :)

I've skimmed over the book and I think it definitely needs more code examples and before and afters, because whatever little code I saw is unconvincing.

In fact, when developing I don't want to think too much about mere data, I want to see the algorithms and data structures and even the overall program logic. Having to worry about each individual struct member, and whether those are used properly and kept in sync is a pain and is throwing away the brilliant idea of encapsulation.

The chapter on "managers" reads like a caricature of poor OOP practices, I don't quite understand how it's become a best practice in data-oriented design :)

All in all, this seems like a clumsy way of designing things.

There's plenty of good talks and literature on the subject and why it is applicable to game development, but also more generally.

This talk by Mike Acton (Formerly Insomniac now Unity): https://www.youtube.com/watch?v=rX0ItVEVjHc

I recommend this talk all the time, since it is the one that got me convinced to look into DoD seriously.

Also: More Mike Acton (now at Unity): https://www.youtube.com/watch?v=p65Yt20pw0g

Stoyan Nikolov “OOP Is Dead, Long Live Data-oriented Design”: https://www.youtube.com/watch?v=yy8jQgmhbAU

Overwatch Gameplay Architecture and Netcode (More specifically about ECS): https://www.youtube.com/watch?v=W3aieHjyNvw

The main argument is that you work with data so you should care about data. All a program does is take some input, do something with it, and give you an output. If you structure your program around this mindset, a lot of problems become much simpler. Concurrency is one thing that becomes much simpler to reason about now that you understand how your data is managed and moved around. Realisations that most of the time you have multiple of something and very rarely do you have a single thing. So having a lot of small objects of the same type each doing some work within their deep call-stack, rather than running through an array of them and doing the work you need to do.

I disagree that encapsulation is a brilliant idea _in general_, because it promotes hiding data, and hiding data is not inherently good. There's obviously cases, where there is internally critical, but since all your state is simply tables of data, your program is just choosing to represent that data in some way, which can make it easier to centralize a bug early.

There's obviously pros and cons, but I don't think you should discount the possibility of it being a good idea just because it questions ideas that seem standard.

I got this book recently in print form, its from a gamedev perspective partially, but interestingly a lot of the book will be familiar to people who work with databases, and thus a lot of the webdev world. A lot of the ideas come from how we organize data in databases and applying that to data in your code.

This is exactly the wrong design pattern to follow. Translation from code to business model has always been problematic. Reducing that translation by modeling code after a business domain (not a business object) is the best way to reduce complexity and enable a longer life for a system.

Can you elaborate?

I would argue that a business domain is defined by it's data and how that data is transformed and displayed based on user input.

So simply put: 1. Data goes into system 2. Data is displayed to user 3. User interacts with data (button, CLI etc.) 4. Data is transformed based on interaction

That is ALL a program ever is. So you can model a program by just specifying the transformations that happen on each user interaction.

This can be optimized heavily if each interaction requires a large set of data to be transformed, e.g. through data oriented design you can work on big sets of data based on previous interactions and transformations and only work on data you need to work on.

It's not a design pattern. It's just a way of thinking about programs as what they are. A design pattern is something you put on top of data because you want a human understanding of business logic.

> So simply put: 1. Data goes into system 2. Data is displayed to user 3. User interacts with data (button, CLI etc.) 4. Data is transformed based on interaction

> That is ALL a program ever is.

Well, no, that's a fairly typical pattern for an interactive program, sure, but plenty of programs involve transformations that are not in response to user interaction, and may not even include user interaction or display to user at all.

Sure, you can cut out the user interaction, but in general it is something goes in and something comes out. I'd argue that's just a simplification and the point still stands. 1. Data goes in 2. Data is transformed 3. Data is output/displayed

Also, I clearly don't know how to format lists on Hackernews...

Nobody does; there's no list support: https://news.ycombinator.com/formatdoc

I refer to this a lot in my discussions about application architecture, but Domain-Driven Design is the core of my belief system. It's not a silver bullet as there are often times when it's unnecessary, but for any complex system, it's a very important tool.

So as with any architecture, you start with "it depends".

Is the system a simple CRUD application with minimal integrations to other systems? In this case, just use your best judgment and build it as simply as possible. A data-driven approach is economical and perfectly acceptable for this kind of system.

Does the system require integrations in and/or out with other systems? In this case you'll want to understand those integrations clearly before proceeding because they will have an impact on how you develop the system. If the integrated systems are unreliable or work in unexpected ways, you may need an anti-corruption layer to buffer your new system from the older systems.

Now. If the new system is complex with many integrations, you'll want to move away from data-driven and into domain-driven design. You'll want to explore bound contexts, relationships in between them, translation layers, root aggregates, value objects, and event messaging. All of this is detailed in Eric Evans' book and further discussed in other books like Vaughn Vernon's "Implementing Domain-Driven Design".

I'd also highly recommend learning Event Storming as a workshop to clearly understand a system and identify strategic opportunities.

But the overriding concern I have for data-driven design is that data, tables, objects doesn't always align with a bound context in a one to one fashion. You may (will) likely have several models for the same context. For instance, "user" seems to be a singular object/domain, but it can have many contexts (employee, external, internal, vendor, sales, support, manager, etc). Each of these contexts carries different meaning and therefore different models. Our past object-oriented philosophy was to build do-everything objects with interfaces to manage complexity.

In Domain-Driven Design, you would actually create separate implementations for each model. (employee service would different than vendor service even though the underlying data may have similarities).

And before we go down the relational database avenue, you have to realize that relational databases are a product and solution, not an architecture. It is convenient for reporting and aggregation, but it actually is an anti-pattern for building transactional systems. Transactional systems are often better suited to key/value, NoSQL data stores. (not to say always, but often)

Lastly, a business domain is defined by its data and behavior. You cannot separate the two and I'd argue behavior takes precedence when designing an architecture.

The Pure Function Pipeline Data Flow, based on the philosophy of Taoism and the Great Unification Theory, In the computer field, for the first time, it was realized that the unification of hardware engineering and software engineering on the logical model. It has been extended from `Lisp language-level code and data unification` to `system engineering-level software and hardware unification`. Whether it is the appearance of the code or the runtime mechanism, it is highly consistent with the integrated circuit system. It has also been widely unified with other disciplines (such as management, large industrial assembly lines, water conservancy projects, power engineering, etc.). It's also very simple and clear, and the support for concurrency, parallelism, and distribution is simple and natural.

There are only five basic components:

1. Pipeline (pure function)

2. Branch

3. Reflow (feedback, whirlpool, recursion)

4. Shunt (concurrent, parallel)

5. Confluence.

The whole system consists of five basic components. It perfectly achieves unity and simplicity.It must be the ultimate programming methodology.

This method has been applied to 100,000 lines of code-level pure clojure project, which can prove the practicability of this method.

[The Pure Function Pipeline Data Flow](https://github.com/linpengcheng/PurefunctionPipelineDataflow)

Although I do agree data flow programming can be useful sometimes, it has been pointed out that data oriented design is not about data flow: https://sites.google.com/site/macton/home/onwhydodisntamodel...

And taking from the other side of the view, even when you consider high performance data flow programming, there're great people pointing out going functional might not be a very good idea: https://www.freelists.org/post/luajit/Ramblings-on-languages...

I attempted to read the first section expecting that this is something that reduces the complexity of software design, and I'm completely lost here. They don't really explain what is data-oriented design in a single summed up paragraph. So I turned to Wikipedia and I found this:

    In computing, data-oriented design is a program optimization approach motivated by efficient usage of the CPU cache, used in video game development.

One explanation is: In the search for higher performance and greater flexibility, game developers are converging on restructuring their data from hierarchies of objects into something that resembles in-memory databases.

This is based on a design philosophy that gets strict about structuring code into input-transform-output even though C++ encourages "modify parts of an object in place". Also, it get strict about cache utilization.

And so, you get the "Entity-Component-System" (which is a 3-noun term like model-view-controller) approach. The Components are conceptually the columns of the database. The Entities are the rows. Each entity only has a subset of columns, so it's actually a primary key across many tables. And, "Systems" are data transformation functions to be applied to some subset of components

So, instead of defining cars, trees and soldiers in your game as a tree of individual objects that participate in an inheritance hierarchy with a virtual Update() method, you define each actor as a subset of available components that update en-masse during the data transforms applied to arrays of raw structs.

Often the dominant factor can end up being memory latency to the CPU. Basically memory access is cripplingly slow compared with the CPU. To combat this modern CPUs have a hierarchy of caches where memory access gets progressively quicker as it gets closer to the CPU. The caches also get smaller as you get closer to the CPU. The aim of Data Oriented Design is to present computation in a way that will be as cache efficient as possible. This is super important in soft-realtime systems like games where we are squeezing more into increasingly small frame budgets. For example at 144Hz (modern gaming monitors and the Valve Index) you have essentially a little more than 6ms to do everything to simulate and send render instructions for each frame.

The net result is that it’s often significantly better to use an algorithm that packs its data contiguously (to allow easy pre-fetching) and with only the information required (to get the most elements in each cache line) than one with better time complexity.

Then the DB like model falls out of that as a way to represent domain objects (like game entities) as contiguous collections of the data that represents them broken up into the units that are actually used together.

This DB model got semi-divorced from the origin and you end up with the Entity-Component-System architecture that you’d be forgiven for thinking everyone in game development was using. But in actuality are not even if the game engine itself has been built to exploit the concepts of Data Oriented Design.

This is also why you can often find a huge performance improvement moving data into typed arrays in JS. So isn’t just a concern for people working in systems languages.

If you are interested in tools for automatizing Data-Oriented design. Check out GeneXus. You design what the user needs from the data, and it automagically builds the normalized tables, the cruds in various languagages and deploys.

This beats OOP and FP a million times.

I'm a fan of "table oriented programming". But what's lacking is good reference implementations for others to study. When I explain it with text and short examples, most others go "Huh? Why not not just use code?"

Plus, existing development tools are not well-suited for table oriented programming. One would have build such tools also for the benefits to show.

A reference example could be a CRUD framework or a gaming framework for a Trek-ish style universe (not too much supernatural like Star Wars). Making either is a non-trivial process. Maybe when I retire I'll get around to making such...

To make it practical, we may also need somebody to implement "Dynamic Relational" because existing RDBMS are too stiff for certain things, like representing different kinds of UI widgets. Having a dedicated table for each "kind" of widget is overbearing. With "static" RDBMS, one either has to use attribute tables (AKA, "EAV" tables) or dedicated per-widget tables. That's not ideal.

I'm currently helping re-architect large subsytems inside our multi million line C++ codebase (large commercial desktop app) using data-oriented design. It's working quite well so far, but we're only a few months into the project. Will have to wait another year to do a proper post-mortem.

"static" RDBMS support JSON columns which can store arbitrary shaped data.

That's part way there, but I see at least two problems: 1st you treat "regular" columns different than dynamic columns, and have to change your SQL if you switch which "type" it is, and second, it's hard to index a blob of text well. Dynamic relational wouldn't have these issues (if done right). Marking a column to be permanent is like adding a constraint rather than changing container "type".

In pure functional languages, designing around data is essentially the way to get things done. Its one of the reasons why, in combination with the other factors, the average dev has to unfreeze so much of its worldview in order to understand how to actually program real programs with these langs.

Data-Oriented Design, Entity Systems stores data in a SQL or NoSQL like way. A table is an array of structs. It's sorta like MUMPs or a language with inbuilt Redis instead of variables or lisp with an embedded prolog.

The problem is that functional doesn't by itself provide any discipline or consistency to the data structures. I prefer the discipline of RDBMS or RDBMS-like systems rather than willy-nilly structures. (Dynamic Relational is perhaps a form of "RDBMS-like".) However, tying/binding RDBMS to application code is still a grey art needing more R&D.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact