Hacker News new | past | comments | ask | show | jobs | submit login

Some of the best advice I ever got for writing composable, high-performance code was “work on structs of arrays, not arrays of structs”. I hear many echoes of that advice in this text. Turns out that entity-component architectures work well in line-of-business applications too, not just games.

Alas, many developers in enterprise are rusted onto a record-keeping CRUD model and struggle to think in columns rather than rows. The idea of inserting an entity id into a “published” table, instead of setting a boolean “published” field to true, doesn’t always come naturally. Yet once you realise how readily polymorphic this is, you may start wanting to use such approaches to data for everything. Rich new opportunities then arise from cross-pollinating component data. Some may question why it is structurally permissible that, say, a network interface can have a birthday, or why an invoice has an IPv6 address, why my cat is in the DHCP pool, whilst limegreen is deleted and $5 on Tuesdays. This of course is half the fun.

I don’t accept that it’s wholly incompatible with OO, though, a thesis you’ll see dotted around the place. I’ve even taken this approach with Ruby using Active Record for persistence; not normally a domain where the words “high performance” are bandied about. That worked particularly because Ruby’s object system, being more Smalltalk-ish than C++/Java-ish, strongly favours composition over inheritance.




> I don’t accept that it’s wholly incompatible with OO

It's not incompatible with the mechanics of OO, but it does require that programmers change how they approach problems. For instance, a common way to write code in an OO language is to focus solely on the thing you want to think about (a user, a blog post, a money transaction, what have you) and to implement it in isolation of everything else, to hide all of its data, and then to think about what methods need to be exposed to be useful to other parts of the system. The idea of encapsulation is quite strong.

In DOD, it is more common for data related to different domains to be accessible and let the subsystems pick and choose what they need to do their work. Nothing about Java or Ruby would prevent this, but programmers definitely have mental barriers.


> high-performance code was “work on structs of arrays, not arrays of structs”

Wikipedia's article "Array of Structure (AoS) and Structure of Arrays (SoA)" explains the trade off between performance (SoA) and intuitiveness/lang support (AoS): https://en.wikipedia.org/wiki/AoS_and_SoA

They also get into software support for SoA, including data frames as implemented by R, Python's Panda's package, and Julia's DataFrames.jl package which allows you to access SoA like AoS.


I’d push back on this by writing the Pandas and Polars etc of the world could be a lot better if they supported generic types. The fact the DataFrame is not connected to your struct / types is a big problem because everyone everywhere has to write at least two functions and probably more like 14 functions to enable basic CRUD operations on Struct of Arrays correctly given Structs.

For example, you need a StructOfArraysToArrayOfStructs function and ArrayOfStructsToStructOfArrays function and StructToRow and RowToStruct at least. Making sure those work for all the various types of thing in a programming language like python is no easy feat. Not having those functions built in, means everyone has to make sure those functions work on their own.


I'd also mention Clojure also has a very performant 'tech.ml.dataset' equivalent to dataframes

Maybe I'm wrong, but AoS VS SoA seems like an old C false dichotomy that's been already effectively resolved

If you need to choose between the two then the answer is probably neither - use an appropriate table datastructure

There is an extra layer of abstraction and software, but I don't really see any downsides. I'd love to hear some arguments against


The downside is losing the conceptual understanding of why one or the other organization is the fastest for the data in question. From the POV of a general table data structure how does it decide which kind of memory organization is the most performant?

Looking at the examples referenced it looks like sugar over SoA so things look comfortably like normal but SoA isn't necessarily the most performant layout in memory depending on how things are being accessed.


Isn't SoA basically always more performant though? You're either traversing individual "columns" and leveraging cache locality, or you're not and then you can traverse multiple columns in parallel to rebuild the structs on the fly.

I could only see this degrade or be suboptimal if you have a very small struct that still all in-cache (like an array of 3d coordinates or something)


If you’re always accessing all the members then no. SoA shines when you want to access a different set of a few members of a struct in different contexts. Then there’s also the option of building temporary arrays of just the data you need then spinning through it which can be faster for some things. For example rendering pipelines. Even more so if you cache the results and can use them next frame due to temporal locality. There’s no silver bullet you’ve got to sit down and look at your access patterns and profile.


SOA is optimal only if you are iterating sequentially over many elements on a subset of the columns.

Really the right layout is access dependent.


In the data side of the world, "structs of arrays" translate to column based indexes i.e. Snowflake and OLAP. "Arrays of structs" translate to relational databases with its page/row based indexes.

FWIW, I am a big fan of Snowflake and think it will eat everyone else's lunch. I also find it amusing that Snowflake "supports" foreign keys but don't enforce it. In other words, Snowflake is as "nosql" as I care to go.


> "supports" foreign keys but don't enforce it

As in, you can have a sort of 'link' to another table, that not only mignt be null, but might be a value which doesn't exist over there?

What does the 'support' add over just having a column in which you tell yourself (but not your DBMS) you're storing values which correspond to keys in another table?

(You can take it further - s/keys/values in a column/ - many to one relationships without an intermediary table! Amazing! ...more 'NoSQL' than I care to go I think, for almost anything.)


https://docs.snowflake.com/en/sql-reference/constraints-over...

Note Snowflake supports defining and maintaining constraints, but does not enforce them, except for NOT NULL constraints, which are always enforced.


I don't understand what it means (and that page doesn't explain) for a 'constraint' not to be enforced? Do you somehow get some sort of warning, but not an error, i.e. the insert/update is allowed to work, everything proceeds as normal, but it's there to check on and correct if you want?


Or an in-memory "data frame" like in R, Python's Pandas, and Polars.


As I see it, there are two types of DOD, one that you've mentioned and where you “work on structs of arrays, not arrays of structs”.

The other one means just giving up of encapsulation, separate the data from the methods that work with the data, think the whole app in terms on how the data flows through it and model it so everything is easy to understand and change. For added correctness, you can use immutable data structures and pure functions.


Isn't this the very definition of functional programming?


No. Functional programming is programming with functions as values. It is not incompatible with data hiding (for example through modules).


"Programming with functions as values" is merely one of the side effects, a feature, of writing your programs functionally. It does not define the paradigm.


This is exactly right. Good summary.


I've tried to push entity-component-systems (ECS) for non-game applications. A financial company in London took that advice to manage the complexity of their system since it was such a good fit.

For those who are curious, here's a very brief introduction to ECS: https://dev.to/ovid/the-unknown-design-pattern-1l64


Would you mind to elaborate a bit on your system? I tried to do something similar for an algo trading system, but in the end I hadn't enough entities to justify the approach.


I think there's a widespread misconception about the reason ECS exists and what problems it solves. It's a design for a general-purpose game engine where it solves the problem of having some conception of the game world and the things that make it up without knowing in advance what those things will be. This is a common issue with loads of different approaches in game engines.

ECS contrary to popular imagination doesn't have to be implemented with data-oriented principles and there are lots of implementations, particularly in dynamic languages that use the same design without the performance boost you might get once you do follow them. Once you unpick the data-oriented rabbit hole you quickly discover it also largely complicates the original design.

If you have a specific thing in mind then use the data-oriented principles on that and skip building something general purpose. This works for games as well as business applications. It'll be much simpler and easier to make performant.


"I don’t accept that it’s wholly incompatible with OO, though, a thesis you’ll see dotted around the place."

Agreed. Arrays of Structs is not the part that is really incompatible.

The part that clashes a bit with traditional OOP is ECS, where data and code are meant to be kept separate. But of course I'm talking about very traditional OOP. It is entirely possible to use an OOP languages + classes to implement ECS. It's just not gonna be "traditional".

EDIT: To quote your other reply: "records don’t have to identify strongly with a class hierarchy".


Not sure what you all mean by OOP to be honest. Is this a case of OOP actually meaning "good programming", then morphing into the most fashionable (and hopefully best) practices of the day?

It wouldn't be the first time… https://loup-vaillant.fr/articles/deaths-of-oop


Well, I made the case to emphasize the word traditional. By "traditional" I mean OOP that follows the popular practices and pedagogy.

Using ECS means totally eschewing things like encapsulation and inheritance. You stop grouping functions and methods together. You basically have "free records" and "free procedures" (or "free methods" if the language requires classes, Kingdom of Nouns style). ECS is a procedural/functional paradigm.

You can argue that this is not OOP anymore, and guess what: I'll probably support you on that! :)


Working with JavaScript has caused me to ask this question: "When should I create a Class, and when a Function?"

This is not a trivial question because class-instances are basically collections of Functions, and Functions can return class-instances. So a trivial answer would be: "It does not matter, you can do everything with both or either one".

No, you shouldn't do everything with both, you should do some specific types of things with classes and other types of things with functions. But what?

I've come to this Rule-of-Thumb: Use classes only to represent and encapsulate data, use functions to process and transform class-instances.

In practice this means that in most cases classes should not have methods which take arguments. Exceptions: You can have setters. You can have a new() method which creates a new instance with data that differs from the data of the recipient. But it should be all about data-creation and extraction.

Classes are still truly useful over plain records, because a class abstracts over what data is stored, and what are the names by which it is accessed, and whether the methods return a stored field or a calculated one.

The idea that "Everything is an instance of a Class" as in Smalltalk is a bit flawed, or at least too trivial an advice. It is of course also true that (even in JavaScript) Functions are objects. And in Smalltalk you can use BlockClosures, which really are "functions".

This division of design makes it easier for me to think about the structure of the whole application. I no longer have classes for everything. Instead I have classes which represent and provide access to and creation of data, and a set of functions which do the processing of those class-instances.

Why is this good? Because classes are, and easily become complicated, since they can have many methods, they can have both static and instance-methods, and methods can be local or inherited and you can even use the 'super' to add to the confusion. So therefore if you can find a way to force your classes to be as simple as possible, do that. Don't make them do complicated things since they are complicated to start with.

Dividing the design into two parts, classes + functions, divides the complexity of the whole app into two parts, each of which contains about 1/2 of the complexity of the whole app. The complexity is now data-complexity PLUS function-complexity whereas with use-classes-for-everything the app-complexity would be more like data-complexity TIMES function complexity (I conjecture).

Keep data (= classes) and functions separate. Divide and conquer complexity.


Oh, yes. I totally agree with that philosophy.

The part about classes only acting on themselves resonates heavily with me. To me it's the same with components in frontend apps: the more independent they are, the better.

I'm probably never going to be copy-pasting my classes and components into other apps, but a good class/component should be copy-paste-able without needing many dependencies. This avoids the "banana gorilla jungle problem", as named by Joe Armstrong.

"Free Functions" also help with keeping those classes even more self-contained and decoupled: you can use Function types for callbacks and things like that, instead of having to couple the class to some other class or interface. The only thing the class has to worry about is that the function conforms to the type.

-

"Keep data (= classes) and functions separate. Divide and conquer complexity."

Yep, 100% agree. Amen to that.


Right. Functions and data can not be decoupled because functions must know what properties of data they need to access. So functions by necessity depend on data. BUT data, if it is really "data" should NOT depend on functions. So we have reduced what would be bi-directional dependency problem into one directional one.

In a typical OOP design there would be no clear rules as to who can depend on whom and whom not.


> you may start wanting to use such approaches to data for everything

Please, dont. Arrays of structs is more natural for a reason, SOA is not universally faster and it has some of its own performance problems. If you really care about performance, AOSOA may well be your best bet. Again, please dont use it for everything.

> a dhcp pool full of cats is half the fun

Replacing the weird code that the last guy wrote for his own amusement is not fun.


> The idea of inserting an entity id into a “published” table, instead of setting a boolean “published” field to true, doesn’t always come naturally.

It doesn't come naturally because now you need a JOIN in your SQL just to fetch what was before a column. Or two queries instead of one.

Not to mention having closely related data spread in different tables increases cognitive load.

You just added a layer of indirection for what gain precisely?


Reduced column bloat, easy reusability, easy reporting and data extraction, ease of bulk transformation (especially stream processing), but all this is essentially repeating the book, which I recommend as an excellent read.

Contrary to the claim above, there is no additional layer of indirection simply from rotating one’s record-keeping perspective by ninety degrees. I have to reject outright the suggestion that joins are unnatural in a relational database. The opposite is true: there’s nothing more natural to a relational database than a join, and that’s even despite SQL barely paying lip service to Codd’s relational algebra. Heck, using a join instead of a predicate may even be faster, not least because decomposing entity data into recomposable per-component tables means less data to scan. And if it’s instead of an indexed query, well, an index is a relation too, so in terms of schema complexity that’s a wash. As for cognitive burden, how about the nails-down-a-blackboard dissonance of dealing with tables that grow ever further rightwards with each new product manager.

None of this is to downplay the sheer productive rapidity of letting your template generator spit out a bog-standard CRUD app, it’s more to undermine and challenge assumptions about what deserves to be the conventional default, and to suggest that we can safely and reasonably change perspective to allow a little programmer happiness in.


> Reduced column bloat, easy reusability, easy reporting and data extraction, ease of bulk transformation (especially stream processing)

Respectfully disagree. For one, reporting is easier when you don't have to JOIN more tables, not harder. Same for data extraction. And how can bulk transformation be harder just because a column is where it would intuitively reside and not in another table who knows where?

> I have to reject outright the suggestion that joins are unnatural in a relational database.

I'm afraid you misunderstood. It's unnatural to me, the developer, not to the database, to spread closely related columns across different tables.

And "tables that grow indefinetely" is doing a lot of heavy lifting for those arguments. Most tables don't grow indefinitely.

Like I said, smells like premature optimization. I optimize for humans first.


I think they're arguing that you get a performance gain. Personally the systems if deal with wouldn't gain from such and optimisation as there is too little data. So I'll stick to the simplest approach.


I agree with you. Splitting db tables can be a net positive in very specific cases but it smells like premature optimization.


This architecture completely ruins your write performance. Theres a reason double databases, one OLTP and one OLAP, is the norm.


Very insightful. I never thought about it like this, but it's common practice in embedded systems to approach problems in this way. One module may provide a statically-sized pool of like objects, and other modules extend behavior in relation to those objects by carrying buffers of pointers and/or indexes into that pool. It might feel inefficient due to the storage of so many pointers/indexes, but you're optimizing the size of the largest thing: the object pool itself.


Wouldn’t structs of arrays affect cache locality adversely, if say you’re say reading and processing the data in sequential order? You’d have to jump between memory addresses that are pretty far apart to read in a single record / set of fields, and potentially trigger cache misses since they’re likely spread out across different memory pages and you end up filling your L1/L2/L3 cache?


it depends on the access pattern. If you are performing an operation where you need to access every field on each object, then yes you'd be better off with array of structs. But at least in games, it seems more common that you'd need to e.g. get just the x-y coords of every monster, in which case struct-of-arrays is better.


>The idea of inserting an entity id into a “published” table, instead of setting a boolean “published” field to true, doesn’t always come naturally. Yet once you realise how readily polymorphic this is, you may start wanting to use such approaches to data for everything.

Wow thank you for this, it's a heuristic I didn't think about but is really powerful.

I need to think through some of the implications, cause I think there are some risks to this approach - namely co-mingling production and non-production data in the same infra. That means that there are data at rest and in transport that are following the same data pathways but have different production criticality. It puts a lot of risk on the filter working perfectly, rather than not even being available on the same infra.

Just spitballing:

Is the ostensible "prod_live_bool" flag manually set? If not then a bug in any automation would certainly cause a nasty data exposure issue

Are you doing column level security tokens? Wouldn't that need some kind of intra-table RBAC? In other words, it seems like if you want any RBAC within your table, then you have to bring it with you every time from the beginning because you have no idea what level of data sensitivity you'll have eventually, and refactoring an increasingly expanding table to inherit RBAC later ---- omg I can't even think of the amount of work that would be. O(n^2) level of manual work??

Does it lead to a canonical "live or not" lookup table/dict at scale?

I think the data security risks would prevent me from using this design pattern for critical applications in MOST cases, but I do love some of the patterns here and will be exploring this in the future for sure!


I interpreted `published` along the lines of blog posts, in which an article can be either a draft (visible only to the author) or published (visible to the world). This seems different from dev/prod venueing, where having separate databases altogether makes sense.

I understood the column-first approach more as an alternative to putting all columns for an entity in one table, especially when rows often don't populate every column. From that perspective, what's being described is a strong separation of concerns; applying this to dev/prod would be a weakening of this separation, and so probably not what is desired.


I don't think you're talking about the same thing as the parent comment.


> I don’t accept that it’s wholly incompatible with OO

I don't think it is, not even in Java. If you use its primitive array and value types, you can easily wrap this in a data wrapper class with functional interfaces, which then can be used quite elegantly in the rest of your code in an OO way - you just don't box all the records into objects.


Agreed. Honestly I think the hardest part here is programmer mindset - when you suggest that identity is fluid and records don’t have to correspond to a class hierarchy some folks get a panicked look going on, like you just claimed to eat their pets


Yep. Funny story, this was actually a coding interview homework for a question with order of millions of data rows from a file. The exercise required 10s timing requirement, so I decided to do it data centric from start. Interviewers found my style ‚non idiomatic‘, but probably never thought about why my version was done in fractions of a second.


Did you pass the interview?


I did and got the offer. In the end it didn't work out however (better offer elsewhere).


> Alas, many developers in enterprise are rusted onto a record-keeping CRUD model and struggle to think in columns rather than rows. The idea of inserting an entity id into a “published” table, instead of setting a boolean “published” field to true, doesn’t always come naturally. Yet once you realise how readily polymorphic this is, you may start wanting to use such approaches to data for everything

I don't really understand what's Polymorphic about this, or even beneficial. It seems like everytime I've had a boolean column in a long-standing application, it eventually needed to turn into something else




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: