Databases 102: Row versus Column Orientation (2014)

jandrewrogers · on July 2, 2015

While row-oriented and column-oriented storage models as laid out in the article are classic database organization patterns, modern databases engines often use neither as their primary storage model. They still have their use cases, pure OLTP for row-oriented and cold storage for column-oriented, but many workloads these days have a high insert rate concurrent with light analytic queries. This workload shows poor efficiency on either storage model.

Designing a general-purpose database engine today, I would probably use some variant of the vectorized row storage model. Rows are stored within a single page but are decomposed within that page into vectors, the organization of which can be quite complex and may not require any significant row reconstruction. Page sizes are usually larger than OLTP (64-256k versus 4-16k) which makes it suited to modern disk storage.

Vectorized row storage often approaches the insert performance of row-oriented and the light analytics performance of column-oriented if designed and implemented competently. Column-oriented is often more space efficient for cold data storage but depending on the use case may not be significantly faster for queries.

Classic row-oriented and column-oriented designs are at opposite extremes of a set of tradeoffs. Many newer database engines sit at neither extreme while enjoying most of the benefits of both.

contingencies · on July 2, 2015

This is a terrible article.

While arguably a single infographic would have sufficed, in terms of text https://en.wikipedia.org/wiki/Column-oriented_DBMS is far clearer.

People who sit down at your table at the beginning of a database project and want to make a binary decision about row vs column orientation in an off the shelf database selection process have simply drunk too much koolaid. Why?

Firstly, performance should generally be optimized post-facto (with real, measured numbers on real systems) not prematurely in etch-a-sketch land. Commercial solutions providers are usually happy to bring their systems in to do battle with the enemy for a lucrative contract, just make sure you give open source a fair seat at the table as well.

Secondly, the real driver for performance is the frequency and type of queries (both read and write) which are almost always unpredictable in the architecture (read: 'immaculate conception') stage, being well ahead of real use, and thus must be continuously re-evaluated as a system evolves.

Perhaps instead, a key factor within modern database engine selection is actually the degree of parallelism and consistency required. See https://en.wikipedia.org/wiki/Consistency_model

A good general purpose strategy for 90% of cases is to start with ACID and perfect consistency on a general purpose RDBMS, and relax your constraints over time based on measurements of real world performance as your use case becomes more clearly defined. You can usually switch out an RDBMS for a NoSQL solution fairly easily through code generation, since SQL is by definition very well defined. Don't assume for 5-10% of cases that building on a distributed database is a good idea from scratch, unless you know you really need it; if your API is clean you can always retrofit parallelism.

A second tip is that sharding your database (splitting one fat RDBMS in to multiple databases) is often a great workaround (if workable in your consistency model) earlier on in a project to avert the well-known failings of RDBMS overcentralization (performance problems: "I can only buy a server so fat", downtime issues when full backups, upgrades or new replication models are required, etc.) and can have the side benefit of helping to protect against architectural lock-in to any one particular DB engine (ie. immobility to NoSQL solutions if that becomes required later).