Ask HN: When denormalize is preferred instead normalization?

bad_username · on June 21, 2023

It is generally fine to denormalize the storage of immutable data points. I mean, if you have something that can never change or cease to exist, it's OK to store it in many places by value, for performance or convenience reasons.

If the value can be changed or deleted, it's a whole different story. Denormalization of a data point is isomorphic to caching it (you store it nearby to avoid getting it from far away). And caching is a notoriously difficult problem in computer science. It is very easy to screw it up and make your data internally inconsistent. It is arguably a worse problem to have than slow or complex queries, especially because you might not even know that you have it.

iman453 · on June 22, 2023

Never thought of denormalization as caching. Makes sense. Thanks for the perspective

iceporter · on June 22, 2023

some people use mongodb or redis for caching a query result

coreyp_1 · on June 21, 2023

Context: I'm a data architect, and I work on HUGE datasets.

The answer to your question is "it depends". In this case, it depends on what you are trying to do with the data. First, here's a little back-and-forth of why you would use the different approaches, with a final thought at the end.

If the intent is to log data as a snapshot in time, and storage is cheap, then denormalize the data. Querying may be a bit slower, but depending on your software, it may be financially beneficial.

If the intent is to always show the most up-to-date information, then normalize the data. Depending on the complexity of your setup, normalization may be the simplest solution anyway.

Usually, denormalization takes place as a logging function, and normalization is used for indexed data.

If you will have a lot of real-time reads, use a normalized relational database.

If you will have a lot of writes, but relatively few reads, then go denormalized logging.

In my day-to-day job, we use 2 databases, one is normalized and one is denormalized, and the choice of which to use depends on what we are doing. Of course, our data is massive scale, so it may not apply to you.

If this is a small project, just use a normalized database. Don't overengineer it.

drzel · on June 21, 2023

The copied data will one day get out of sync. And you will have a difficult and urgent issue to solve.

Query complexity on the other hand requires knowledge, to make the query work and again to make it fast. But it won’t be urgent, and having made it work you will be a better developer.

Of course, sometimes we need to copy / cache etc. but avoid it if you can.

iceporter · on June 22, 2023

yeah you are right I think I will try to use view, and union

this will make me a better developer!

beckingz · on June 21, 2023

Create a view in the database that takes the transactions and returns the data in the format that you want.

It will probably have a bunch of JOIN client.id on transaction.client_id etc etc, but it's the easiest way to get started.

When the view stops running fast enough, switch to a scheduled materialized view.

iceporter · on June 22, 2023

I think I will use this approach

but I think I need to use union to make a single table that is combination of accounting journal tables and view of transaction tables

vivegi · on June 23, 2023

The general guidance is that you segment your application domain into two categories - Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP).

The OLAP data is populated from the OLTP data using queries (snapshot tables, materialized views etc., could be the implementation).

You then add/refresh data into the OLAP tables in a set frequency (for eg: daily, weekly, bi-weekly, monthly, quarterly, yearly etc.,).

The OLTP system has up-to-date realtime transactions. The OLAP data has snapshots as of a particular date. The OLAP data may be denormalized while the OLTP data is highly normalized. This makes the OLAP data optimized for reads while the OLTP data is optimized for writes.

CodeWriter23 · on June 21, 2023

NetSuite uses the “child” table model though mostly normalized.

transactionline for the “human” details

transactionaccounting for the GL.

Either of these tables can be joined on the transaction.id field and the sequence numbers of both correspond to each other.

But like I said, normalized pretty much to 3rd normal form. You have to join to get the “description”. There is duplication in the quantity and rate are in transactionline and the extended amount is in transactionaccountingline.

There is a good reason behind breaking the normalization rules, separation of functions. Transactionline is for printing invoices, orders etc and transactionaccountingline is for querying the ledger.

Not saying this is the best example but it is from a successful in its own segment ERP system.

iceporter · on June 22, 2023

I'm really interested in this, pardon my english understanding.

did netsuite copying value from transaction for transactiononline or transactionaccountingline?

CodeWriter23 · on June 23, 2023

It does the quantity x price from the transactionline and stores that in transactionaccountingline. So for example, if you edit a transaction, the code accepting the POST has to update two records per line, not just one.

Unless of course the extension field in transactionaccountingline is seeded by a SQL Function that does that automatically. Which might be a preferred approach for your implementation. The downside of SQL functions though, they tend to be invisible to others maintaining the code.

machiaweliczny · on June 21, 2023

I don’t know specific of accounting but in general you want to save a log of immutable events that you then translate into read model. In practice you have transaction record with some productId and maybe price at that point in time, you can simply reduce all that transactions into (materialized) view with some group by. To have performant complex queries learn about WITH.

AnEro · on June 21, 2023

I've found for the accounting system I made it brough far too many problems the CEO demanded wider tables and less joins to simplify things so he could use them. It resulted in stuff simple complex rather than complicated simple.

xupybd · on June 21, 2023

It sounds like you might be mixing concepts.

Accounting records can duplicate copies of information without needed links to the original. Sometimes this is better. You want a snapshot of what things were at the transaction time.

houseatrielah · on June 21, 2023

Client should be its own table, transaction is its own table (with foreign-key to clientID). You can ask ChatGPT to help you out.