
Data denormalization is broken - endswapper
https://hackernoon.com/data-denormalization-is-broken-7b697352f405#.cxyg11a3r
======
dasil003
I'm usually the guy arguing against microservices, but I think this is a good
example of why they shine. It's much easier to manage the updating of these
synthesized numbers if you have a discrete service that limits the things you
can do at a higher level. So instead of worrying about who is updating the
data store directly, you push it up to a higher service-level interface where
you have a finite list of operations that are supported, and the scope is
small enough that one architect can understand the system in its entirety and
make smart decisions about how to compute/store/cache these synthesized data.

That said, if you have less than 10 engineers this shouldn't even be thinking
about this. Way too many young engineers get hung up on scalability when they
don't even have anything on the horizon that will require scaling. Couple that
with a general ignorance of SQL because it's old and boring, and not realizing
how much performance you can eke out of a well tuned RDBMS with proper schema
and queries. A SQL DB is an incredible hedge against an uncertain future, keep
your data tidy and you can keep pivoting until you find something you actually
_have_ to scale.

~~~
derefr
Speaking of RDBMSes, you can achieve this "limited API" by just creating an
RDBMS user with limited privileges (can query a subset of tables; can make
changes through a fixed set of stored procedures.)

Personally, I never understood why everyone thought it was necessary to create
HTTP API layers in front of every database, rather than just standardizing a
public-Internet SQL wire-protocol between different RDBMSes and giving them
all the ability to enforce per-user constraints on things like maximum query
execution cost.

~~~
s_kilk
> I never understood why everyone thought it was necessary to create HTTP API
> layers in front of every database

Mostly because you usually need to do some kind of logic that's either
difficult or impossible to represent in SQL views or functions. For example,
how would your SQL view access something like redis? or any kind of external
cache? or an HTTP service? This is the kind of stuff that's pretty common in
HTTP Controller functions, and would be a nightmare to move into the database.

(Note: I'm generally in favor of using SQL views/functions for data-facing
logic, just, y'know, not a silver bullet.)

~~~
derefr
> For example, how would your SQL view access something like redis? or any
> kind of external cache? or an HTTP service?

Smug comeback:
[https://wiki.postgresql.org/wiki/Foreign_data_wrappers](https://wiki.postgresql.org/wiki/Foreign_data_wrappers)

But your point still stands; no _other_ RDBMS has those, so it can't really be
an assumption.

Still, there is a mismatch of paradigms here. The people who built the RDBMSes
assumed that everyone would _import_ data into them from wherever else. _You_
, the user (or you, the application developer) shouldn't need to query a
third-party web service when doing your database query; something _else_ was
supposed to do that query in expectation of your need, and shove the results
_into_ the database, so you could just treat that data as another table of
useful information to JOIN on.

People who front databases with web APIs tend to treat the database itself as
a dumb store—one among many—that's supposed to quickly return "raw" data from
single tables so that the _web service_ can chew it together with other data
to compute answers to questions. But, by doing so, they're in effect
_creating_ their own bespoke Database Management System: the web service
becomes the thing that does efficient OnLine Analytical Processing of manifold
backing datasets. (Or, at least, _attempts_ to.)

A DBMS's job is to ingest data and process queries to spit out answers. It's
not your job; it's not your app's job; it's the DBMS's job. The DBMS has
decades of optimization for that use-case; and in fact, that's the meaning of
the word "database management system", when defined in opposition to a "flat-
file database" like BDB or LevelDB. The fact that there's a _daemon_ sitting
there between you and your databases allows the daemon to _index_ and _join_
and _expose views_ and _run checks_ and _enforce rules_ and on and on and on.
It allows the daemon to do _asynchronous, streaming imports_ of data-sets from
remote sources, so that its data is always up-to-date. It allows the daemon to
be there for things like _crawler agents_ to dump their results into, rather
than dumping them to something like an S3 bucket where _you_ later have to be
the one to _import_ them in. In other words, being a daemon allows the DBMS to
be the service that you just come to and _ask questions_ —arbitrarily-complex
analytical queries—and get answers.

But that only works if the database management system itself is given access
methods to all the _databases_ that contain the data to answer your question.
And, in modern practice, we've essentially forgotten that that is a thing that
we ever wanted to do. Who would teach a _database management system_ anything?
Database connector libraries don't belong in DBMSes; they belong scattered
throughout our toppling edifice of _RESTful API microservices_!

(And if you argue that we still need those RESTful APIs for things like
browser-clients wanting to do queries using AJAX: there's no reason DBMSes—not
just document stores but full-fledged OLAP DBMSes—can't speak HTTP as well.
CouchDB is a great but strangely-unique example. Adapter gateways also exist,
like PostgREST.)

------
nostrademons
Interestingly, this article reflects a really common misconception about
database normalization:

 _The relational algebra that normal forms are expressed in has no concept of
aggregate functions._

The example the article gives uses fields like numUnreadRooms or
lastSeenTimestamp. These are derivable from the contents of other tables with
SQL aggregates like COUNT() or MAX(), but the relational algebra itself would
treat them as facts of the relation itself, dependent upon the primary key.
And they'd be updated by the application itself.

In other words, _having numUnreadRooms or lastSeenTimestamp in your schema
does not make it denormalized_.

I was under a similar misconception for the first 5-6 years of my career, and
then when I realized that it's perfectly fine to store these aggregates
inline, my database queries suddenly started performing acceptably.

As for how to update them - in non-broken databases, the solution is usually
_triggers_. A trigger is a stored procedure that executes when a database
operation makes some condition true, and then is executed entirely in the DB,
with full transactional awareness. So you could listen for inserts and deletes
and then update the count on them, and if your insert failed or the
transaction was rolled back, your trigger would never be executed. This keeps
data integrity in the database, but still gives you efficient queries and an
application interface that's consistent.

~~~
catnaroek
Materialized views are a superior, or at least less error-prone alternative to
triggers, since they don't depend on the programmer's ability to get state
transitions right (whenever you manually update aggregates).

~~~
nostrademons
What's the performance characteristics of materialized views? My understanding
is that they require running & storing the full query. For a SELECT
COUNT(star) FROM TableName this is trivial, but if it's SELECT COUNT(star)
FROM BigTableName WHERE ComplexWhereCondition it can get very expensive.

For the simple count/max aggregates on a single relation that the blog post is
talking about, the trigger logic seems straightforward - you just need a
trigger on a single table, for insert/delete/truncate, + update for max or if
there are conditions. If there's more than one relation involved or if the
condition is complex, then it can get pretty complicated.

~~~
catnaroek
> My understanding is that they require running & storing the full query.

This is right. Having a single materialized view containing nested queries is
not a good idea. But you can factor out the nested queries into their own
materialized views. The propagation of updates between materialized views can
achieve an effect similar to that of complex triggers.

I should've been more careful with my original comment, though. I'm not saying
triggers are worthless. But, whenever a materialized view can do the job, it's
usually preferable.

------
aristus
TLDR: cache invalidation is hard, even if you don't call it cache
invalidation.

More seriously there are plenty of ways to make this less painful, eg, adding
a "lastMessageTime" to the Room table, which would reduce the work to a simple
intersection of RoomUser and Room. Facebook uses a more complicated system
involving "graph" relations and a global ID space, but it's more or less the
same thing.

Also, "normalized" is not a binary thing. There are several levels from 1NF
(flat file, fully denorm) on through 3NF and BCNF (what the SQL books usually
call "normalized") on up to 4th, 5th, and 6th normal forms, some of which are
academic party tricks but sometimes see use in production.

~~~
aristus
To add to this: every serious production system tends toward a slurry of
2NF-3NF-BCNF models, mostly for performance reasons. The case of
lastMessageTime is a simpler one, as it might drift out of date but generally
you get false positives but never a false negative. Note that people complain
about Facebook message jewels showing up when there's nothing new but seldom
complain about _not_ getting notified.

Another curious example is columnar databases. They are basically a 4NF
storage layout with a veneer of 3NF syntax for ease of querying.

------
combatentropy
I don't think, as the article claims, data denormalization is broken. I think
most programmer's knowledge of SQL is scrawny --- unexpectedly so for a
language that's 30 years old and underpins most applications. SQL was
obviously meant to be simple, almost declarative, and resembling plain-English
sentences. And yet it's the first thing that developers abstract away with an
Object Relational Mapping, much to their loss.

The article's example is one-to-one chat with millions of users. It says that
counting unread messages is too slow unless you denormalize it.

First of all, the Rooms aren't a core table. If each Room has just two people,
as the article says, then all you need is one table, messages:

    
    
        a  |   b   |        sent         |      body      |        read
      -----+-------+---------------------+----------------+---------------------
       ed  | tom   | 09/29/2016 14:22:10 | Hi             | 09/30/2016 02:02:15
       tom | ed    | 09/30/2016 02:03:15 | What up?       | 09/30/2016 02:04:03
       sue | sally | 09/30/2016 08:19:20 | Catch ya later |
    

So the count of unread rooms is:

    
    
      select count(1) from (select distinct a from messages where b = :user and read is null) t;
    

If the columns are indexed, I would think PostgreSQL could handle such queries
in a split second, even with millions of rows.

At Facebook's scale, where you have hundreds of millions of users, then you
will have hundreds of billions or rows, maybe trillions. Okay, so it might
start to get slow that this point. But let's stop for a second and remember
that 99.999% of web applications will never reach Facebook's scale.

In Facebook's case, I read a long time ago that they make a database for each
user. If so, the aforementioned table and query would still work --- in fact
without the need for the b column.

~~~
Tenhundfeld
I agree with most of what you said, but one thing bothers me:

> And yet [SQL]'s the first thing that developers abstract away with an Object
> Relational Mapping, much to their loss.

I see that sentiment repeated often, but I really don't understand it. A
decent ORM does two things:

1) It makes writing simple queries better – more maintainable, easier to
compose, easier to secure against injection, smoother to write without
constantly translating between your domain language and SQL.

2) It gets out of your way when you need to take the reins. If it is difficult
to make your ORM run raw SQL queries and map the results back into domain
models and primitives like array, hash, whatevs – it's time to find a better
ORM.

For me and many decent devs I've worked with, an ORM is not about hiding SQL.
It's about automating the simple, routine stuff and letting me focus on the
important, complicated parts – including highly-optimized SQL queries, at
times.

~~~
barrkel
If someone's primary perspective of the database is through the lens of an OO
graph, they will write code that performs abysmally. Graph navigation from row
to row is antithetical to all the power relational algebra gives you. That's
the biggest risk: viewing the DB primarily as an object store.

For the app I work on, the DB is only viewed as an object store for very small
sets of data - configuration, specifically. It's impossible to map it to
objects in any realistically performant way for our user's data, not least
because that would require the data round-tripping through the application.
It's much, much more efficient to execute my multi-million row updates and
inserts on the database server directly rather than drag the data down to the
app, tweak it slightly, then persist it back.

~~~
icebraining
On the other hand, isn't that just a sign of a weak ORM? For example, in the
platform I use, when I "fetch" some records by ID, it doesn't actually issue a
SELECT; it just collects the ID and returns a proxy object. Only when I access
an attribute of that object, it'll issue a SELECT for all the IDs in question.

One could extend that for more complex operations - instead of immediately
executing the operations, it'd collect the operations and convert them into
efficient queries. In a way, this is already done in "fluent interfaces" like
LINQ, but with a more flexible language (Lisp?), I'm sure one could use
standard control flow instead of a custom layer.

~~~
GrinningFool
_> For example, in the platform I use, when I "fetch" some records by ID, it
doesn't actually issue a SELECT; it just collects the ID and returns a proxy
object. Only when I access an attribute of that object, it'll issue a SELECT
for all the IDs in question._

I'm my opinion this is the least desirable behavior. It abstracts the data
layer away to the point where you don't think about what's happening under the
hood when you type foo = Klazz.new("id"); foo.bar; foo.baz;

Was that one, two or three separate queries ? Did a query run at all? Does it
matter, after all you have an object to work on, right in your language of
choice.

Sure the ORM may be smart about it and do the right thing - I have no doubt
that most decent ones day. But that misses the point - as the person using
this interface you get divorced from the direct awareness that this data has
an origin, and there is a cost to access it. After all, you're just
instantiating an object.

Patterns that use interfaces that make it so simple to do a thing that you
don't need awareness that you're doing it tend to degrade quickly after the
original engineers who _did_ have that understanding leave a project.

~~~
icebraining
That's a fair point; on the other hand, aren't SQL queries themselves divorced
from the actual data layer? Knowing whether the ORM issued one or two queries
doesn't tell you if you're fetching data from disk or memory, whether it's
making an index scan or full table scan, whether the join fits in working
memory or will hit the disk, etc.

Isn't knowing what queries are issued giving you a false sense of knowledge
about how the data is actually fetched?

And much the same way you can request a query plan, the ORM could provide the
list of queries issued for a given piece of code.

------
makmanalp
So I'm doing a lot of academic reading around databases, and people have been
thinking about this sort of thing for a while. Here's one approach:

> In this paper, we study the complementary approach, addressing index
> maintenance as part of query processing using continuous physical
> reorganization, i.e., cracking the database into manageable pieces. The
> motivation is that by automatically organizing data the way users request
> it, we can achieve fast access and the much desired self-organized behavior.

> The cracking approach is based on the hypothesis that index maintenance
> should be a byproduct of query processing, not of updates. Each query is
> interpreted not only as a request for a particular result set, but also as
> an advice to crack the physical database store into smaller pieces.

[http://stratos.seas.harvard.edu/files/IKM_CIDR07.pdf](http://stratos.seas.harvard.edu/files/IKM_CIDR07.pdf)

The references are also pretty good.

edit:

Here's one more specifically about (de)normalization:

[http://stratos.seas.harvard.edu/publications/main-memory-
ada...](http://stratos.seas.harvard.edu/publications/main-memory-adaptive-
denormalization)

But it's much newer and lighter on info.

~~~
ComodoHacker
Interesting read, thank you.

Where can I track recent interesting papers in databases?

~~~
makmanalp
Unfortunately I'm as clueless on this as you - I'm mostly going on the papers
that my professor suggests (cs165 and cs265 at harvard, the syllabus is
public). Though VLDB and cidrdb seems like two conferences that seem to
publish the most famous papers. Eventually you start learning the names of the
authors, etc.

------
hosh
Martin Kleppmann was presenting architectures that solve this problem:

[http://www.confluent.io/blog/turning-the-database-inside-
out...](http://www.confluent.io/blog/turning-the-database-inside-out-with-
apache-samza/)

[http://www.confluent.io/blog/bottled-water-real-time-
integra...](http://www.confluent.io/blog/bottled-water-real-time-integration-
of-postgresql-and-kafka/)

He does not just identify secondary indices and materialized views as possible
candidates, but also other applications such as caching and search indexing as
within the same class of problem.

Here, rather than trying to figure out a tradeoff between normalized and
denormalized views, Samza and Kafka are used as the heart of what the above
article was talking about as "generalized denormalization engine". It's done
by capturing data change as a stream to incrementally build whatever
denormalized view you want.

The downside is that this is still fairly clunky, if you have this expectation
that this should be built into the database. Kleppmann's ideas breaks it out
into a whole infrastructure to support that. On the other hand, if you do have
a full time ops team, it sounds like it can work very well at scale.

------
TurningCanadian
I skipped around a bit in the article, but didn't see them consider having
database updates simply trigger a cache invalidation (i.e. set it null) of the
denormalized data instead of doing a full recomputation every time the data
changes. That way you only pay the price when you need it, and it still allows
for caching. You can prewarm the cache with a cron while still allowing for
ondemand calculation if the cache is cold.

~~~
repsilat
I think TFA was more about "when to invalidate/recompute" than "how to
invalidate/update" \-- that you need to take care to enumerate all of the
times the denormalized value may get out of sync with the rest of the data.

(This is a reasonable idea though, depending on access patterns. Can increase
latency, can decrease computation.)

------
josephg
I've been thinking about this a lot recently, and I'm convinced we need a
Storm[1]-like architecture for databases. Something which would allow views /
denormalization / indexing to run in separate processes. Preferably in the
same language the app developers speak. And at the same time be part of the
database transaction logic - triggered idempotently _by the database_ so you
always see a consistent view of the data no matter how you update & query it.
(You can do this efficiently with MVCC if you proxy queries through the
indexer / denormalizer / view logic / whatever you want to call it).

I've scattered some more thoughts about this here, but its a big topic:
[https://josephg.com/blog/composing-
databases/](https://josephg.com/blog/composing-databases/)

[1] [http://storm.apache.org/](http://storm.apache.org/)

~~~
jerf
"triggered idempotently by the database so you always see a consistent view of
the data no matter how you update & query it"

As I sit here trying to build the thing in my head that would do that, I see
some major performance issues arising. Specifying how you want the data
denormalized isn't that big a deal. Taking a frozen normalized database and
denormalizing it according to that spec is a somewhat bigger deal; the
performance implications are non-obvious because the denormalization can
invoke arbitrary amounts of data in arbitrary combinations in arbitrary
orders, even _before_ we let the denormalization specification be Turing
complete. Mathematically it all works but if in practice my denormalization
pass is randomly accessing huge swathes of the database every time I insert a
row that's not going to work very well.

If we're trying to avoid slow queries when the user asks for data, and we're
trying to denormalize in advance, we're basically trying to have our already-
presumably-pretty-busy database pre-emptively _guess_ all possible slow
queries it _might_ run in the future, and run them now.

And then run them again when a small change comes in, having possibly never
used the first set of denormalized, expensive queries I just ran.

That seems... likely to generate performance problems, to understate my case
by a lot. Even accounting for possible efficiencies the denormalizer might be
able to take advantage of; there's still the bound of what has to be written
out.

OK, so that doesn't really work. But it doesn't need to; the database could do
this more lazily. But, doing it fully lazily is just the same as the case
where you use the normalized data, so we can't do it fully lazily. Somehow the
database needs to figure out what it should and should not precompute. Well,
that doesn't sound like _that_ big a deal, right? No, it's a nightmare to even
begin to specify what that might be, in the face of full Turing chaos. I can
almost see how one might try to apply some basic "learning" algorithms to the
task of pre-computing things partially, but it's getting pretty complicated in
the backend there. There's an awful lot of symbolic computation there where I
really want things to just be computing for performance.

Furthermore, if one insists on consistency in the denormalized fields... yeow,
that's complicated. Again, it's not hard to specify the math but making it
happen would be crazy insane. Oh, and I don't mean _distributed_
consistency... I mean even consistency on a single busy host is going to be a
challenge. A feasible challenge, certainly, but a challenge that shouldn't be
ignored.

It seems to me there might be a use case for this to be a bullet-point feature
for a database, for developers to very, very carefully use on very high-use
fields. But it wouldn't be very hard for a real-world denormalization engine
to become a performance nightmare. Or to end up taking more "index tuning" to
be usable engine than the code we already have.

~~~
josephg
When an update happens the trick is going to be figuring out an update
mechanism for the derived data without needing to run more queries. You're
probably right - in the general case thats really hard to do automatically.
But lots of aggregate values can be calculated based on an update without
reference to the original set. `SUM()` and `MEAN()` are easy. `MAX()` and
`MEDIAN()` are hard. In the case of the notification count query from blog
post, you can do that efficiently with 3 map reduces:

1\. Given a list of `(message, room, timestamp)` triples, generate `(room,
max(timestamp))` tuples. In general max is expensive to calculate, but in this
case the message collection is immutable, and the rule to update
max(timestamp) when an edit happens doesn't need any extra processing based on
an insert.

2\. Given `(room, message_timestamp, read_timestamp)` generate `(room,
message_timestamp > read_timestamp)`. Easy.

3\. Then a reduce to calculate the sum.

One of the things I really like about the multi-process architecture is the
diversity of solutions we should have access to here. Right now your data
lives in one vendor's system. Its really hard to use a map-reduce job if your
primary data store doesn't support it. And map reduces might be too
complicated for simple workloads - so with small data sets you might want
something simpler. With this architecture different vendors could provide
different indexing systems and you could swap them out as your requirements /
data size changes.

Finally as for consistency, I've got a bunch of ideas of different ways to
make it work. Its not so bad because its not a generic graph or consensus
problem. Its just a DAG. You can give each primary store a unique name and
version using vector clocks. (If each primary store is itself distributed
thats fine - just use whatever versioning system the primary store uses as the
value for that database in the vector clock.) Then when a query comes in that
needs to hit both an index and a primary store, pick the most recent database
version and query asking for the result set at that version. You'll need some
MVCC to make it work - but thats a solvable problem. If the named version is
too new for the index, the query stalls until the index sees the corresponding
update from the primary store. If the version is too old (and its expired out
of the MVCC working set) then we punt back to whatever is running the query
and it retries. The whole system is much simpler than a distributed consensus
problem because each node in the DAG can be thought of as authoritative, even
if they're individually implemented as a small redundant set.

------
akavi
If I've understood correctly, it feels like what the author wants is a DSL for
event-driven FRP[0]. If so, I'm 100% on board. It's something I've wanted for
a long time.

I've found it easy enough to implement a naive version. The tricky part,
unsurprisingly, is making it performant.

[0]
[https://en.wikipedia.org/wiki/Functional_reactive_programmin...](https://en.wikipedia.org/wiki/Functional_reactive_programming)

~~~
Liron
Author here. Yep, I basically want my database to be a giant Excel spreadsheet
that includes formula cells (a.k.a. denormalized fields).

------
briHass
To some degree, this looks like premature optimization. That query (on the
normalized tables), with the proper indexes and decent hardware, should be
plenty fast, even with millions of rows. I would potentially start my
optimization by considering if messages could be 'archived' to other tables or
considering how long I could cache the query results vs. how often the client
would poll.

I've seen small-medium sized systems go way overboard with complicated
mechanisms for handling performance problems that they don't, and may never,
have.

~~~
sixdimensional
So I agree, premature optimization can often not be necessary.

But let's be crazy for a minute. How much CPU/RAM/disk IO does that query on
normalized tables end up using? How much electricity does that query use?

Maybe from a pure efficiency perspective, it takes less energy to query the
pre-computed value, than to constantly be recalculating it on every query.

~~~
quickben
Craziness comes later. Usually few years after several developers added not
normalized tables, and it all grew organically with multiple apps depending on
the data.

When it comes to rdbms shema, you want it optimized and enforced from day one,
or the price paid later will be significant.

------
jkkramer
One technique that can be useful is imposing domain-specific restrictions,
then passing responsibility down the chain.

For example, most people only have a handful of recent conversations; can you
get away with ignoring distant-past conversations when it comes to read/unread
marking? If so, you could send recent normalized data to the frontend and let
it derive the unread state client-side. React has some nice tools for
efficiently deriving state from server-sent data.

This isn't always the best solution, but it's something to consider.

~~~
extrapickles
Unless the amount of data you need to transmit is large, I always prefer that
approach as it enables other UI features without any additional network trips
by the client. Also, it lets one use the same mechanism for multiple uses (in
this case, get latest messages), so it adds less complexity to the app.

In the articles example application, you can transmit the top X oldest
messages after the last seen timestamp for each room, then have the client
derive the unreadness of each room. You then can use that same info to do
things like preview the oldest unread message for each room and hide latency
when opening a room as you already have a screens worth of messages in hand.
One doesn't need to get more recent messages as UIs generally don't have room
to display large numbers and should switch to a "lots" indicator.

~~~
Roboprog
Sure, denormalize working storage in the app, fight to keep the "system of
record" / "operational data store" clean.

------
drcode
Smart people already have developed general solutions to this problem- This
problem is the main reason "graph queries" were independently invented by both
Facebook and Netflix.

See here for an exact solution to the problem posed in OP, using the OmNext
graph query language:
[https://github.com/omcljs/om/wiki/Components,-Identity-&-Nor...](https://github.com/omcljs/om/wiki/Components,-Identity-&-Normalization)

~~~
carterehsmith
"Smart people already have developed general solutions to this problem"

Yes, they did, 50 years ago :) The article behind link refers to several-
decades-old concepts of identity and normalization. That was not invented by
Facebook or Netflix.

------
Roboprog
META:

When did we all start saying "performant", instead of "fast [enough]"? Was it
about 5 years ago??? _Maybe_ 10?

English, like SQL, is passe`.

~~~
rplst8
I hate the word performant. People may argue that "fast" is a simpleton of a
word, however "performant" is ambiguous as to what characteristic is
performing well.

------
swlkr
I can see why the normalized approach is difficult to make performant, but it
seems much simpler.

~~~
jacques_chester
Broadly: Normalised data is more performant on _write_ , because for each
piece of state, there will be one and _only_ one place to store it.

It's less performant on _read_ because you are constructing logical deductions
out of a lattice of tiny pebbles of truth. It takes time to find, filter, sort
and project those.

Denormalisation "works" by skipping to a read-optimised schema for those
applications where reads greatly outnumber writes. Which is very many.

As an aside, "denormalised" doesn't have to mean "unstructured chaos". Do As
Thou Wilt isn't necessarily the law of denormalising. There are well-studied,
widely-used denormalisation techniques under the heading of "dimensional
modelling".

It's just that most of these discussions are dominated by web folk, not data
warehouse folk.

The shoutout to Event Sourcing is promising, given the interaction with the
CQRS pattern. In short: read schema and write schema have different demand and
performance characteristics. _So split them up_.

~~~
LgWoodenBadger
I don't know that you can always conclude that normalized data is more
performant for writes.

Take for example the case where your normalized write consists of 5 insert
statements into 5 tables, instead of 1 insert into 1 table that you'd do with
a denormalized write.

~~~
jacques_chester
As always, consult your doctor, accountant, lawyer and profiler.

------
callesgg
What we need to solve this is a cell aggregation dependency function in the
database.

Where we define a column and a query that will be used to define the values in
the cells of the column. The database would use its planer to figure out what
the dependency's are and update the cell every time one of the dependencys are
changed.

That dependency thing is very complex but very necessary. If you dont have it
you might as well just make a normal view.

~~~
Liron
Author here. Yep, I think you're getting to the heart of what characterizes a
"denormalization engine".

------
ThrustVectoring
What are the tradeoffs of making the data denormalization happen in database
triggers? Fundamentally, you want to update certain fields whenever certain
other fields get changed, and that's exactly what database triggers do. I'm
not enough of a database wizard to know if there's good reasons why this isn't
more popular, though.

~~~
snuxoll
Too many people argue "code doesn't belong in the database", but honestly,
with everyone circling around to FRP it honestly makes me think maybe stored
procedures and database triggers are going to make a comeback again. I never
really bought into the whole "but you can't version stored procedures!" bit
because that's precisely what schemas and the search path in PostgreSQL are
for (triggers are another matter entirely, but I avoid those whenever possible
- though it's exactly what I would use for a task like this).

~~~
buckbova
Triggers are synchronous and is not much different than the application
updating this value when the underlying data is updated, except for the
reliability that always when the data is updated the trigger will for sure
fire.

Postgres triggers are row by row so depending on your use cases they can be
quite slow.

Ideally I'd cache all these values and invalidate the global cache through the
application. This would require one set of services or each set of services
managing their own cache. Anytime a room is updated, boot and reload the cache
for that room.

Let the db stay normalized.

------
catnaroek
There's a lot of confusion in this article. The author doesn't seem to be able
to distinguish the logical view of data (in which we distinguish between
primitive and derived data) from its physical representation (in which derived
data may be physically stored redundantly for efficiency reasons).

------
tkyjonathan
ProxySQL has a mirroring feature that can act like a trigger on a separate
thread:
[https://github.com/sysown/proxysql/blob/master/doc/mirroring...](https://github.com/sysown/proxysql/blob/master/doc/mirroring.md#rewrite-
mirror-query-only) Add a cronjob to refresh results if you are worried about
skews over time. Problem solved.

------
orasis
I identify with the problems this article eloquently describes. I think
GraphQL could be a first piece of the solution.

~~~
Liron
I'm a fan of GraphQL, but that's like saying "object-oriented programming is
the solution to people leaving variables on the heap after they go out of
scope". No, the solution is managed memory. OOP is a higher level of
abstraction that happens to work better with managed memory.

See the analogy?

------
ComodoHacker
One way to solve this particular problem of "number of unread rooms" is not to
persist it in database at all. Send push notifications of new messages and
have client app to compute it client-side.

------
bryanlarsen
Wasn't this essentially what CouchDB did?

------
Maarius
Can someone explain why you wouldn't use a "read_at" flag on the message
entity in this example?

~~~
mosdave
which User read the Message?

------
thwee
Yup, that's the job!

------
marknadal
I very much enjoyed reading this article, it explains a lot of the pain and
calls out a lot of the current broken designs.

While the conclusions are correct, I feel like they could be simplified. If
you try to use the vocabulary of the broken system you will get complicated
phrases like "denormalization engines" (which I very much hope does not catch
on despite agreeing with it).

Instead, if you look at correct systems - like Elm, Rx, Datomic.com , or
[https://github.com/amark/gun](https://github.com/amark/gun) (my own,
obviously I am biased) you get less complicated ideas:

\- Event Driven. \- Data Streams. \- Data Flow. \- Data Pipeline. \- Reactive
Programming.

All of these, despite "FRP" itself being jargony, solve the majority of the
article's complaints. Largely because the solution he comes to at the end is
the same thing as FRP.

Don't we wish we had all listened to Alan Kay now?

