Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: When denormalize is preferred instead normalization?
21 points by iceporter 10 months ago | hide | past | favorite | 16 comments
So I'm a solo developer and I have problem with my small product(saas 2 client)

So this app is handling my client main business transaction like selling their services. I saved the transaction detail into a record, who is the client, how much the cost, what services is it.

Then my client also want the app to have accounting feature

Problem is I find it hard to design the system

My current approach is on every transaction record it have a child record (the accounting journal postings)

And I copying value and detail from the parent object to child object, that is needed for accounting task. every time the parent object is saved(rails callback)

is it bad practices? Should I just refer the value instead copying it?

But if I refer the value instead copying it I think I will meet with query complexity problem




It is generally fine to denormalize the storage of immutable data points. I mean, if you have something that can never change or cease to exist, it's OK to store it in many places by value, for performance or convenience reasons.

If the value can be changed or deleted, it's a whole different story. Denormalization of a data point is isomorphic to caching it (you store it nearby to avoid getting it from far away). And caching is a notoriously difficult problem in computer science. It is very easy to screw it up and make your data internally inconsistent. It is arguably a worse problem to have than slow or complex queries, especially because you might not even know that you have it.


Never thought of denormalization as caching. Makes sense. Thanks for the perspective


some people use mongodb or redis for caching a query result


Context: I'm a data architect, and I work on HUGE datasets.

The answer to your question is "it depends". In this case, it depends on what you are trying to do with the data. First, here's a little back-and-forth of why you would use the different approaches, with a final thought at the end.

If the intent is to log data as a snapshot in time, and storage is cheap, then denormalize the data. Querying may be a bit slower, but depending on your software, it may be financially beneficial.

If the intent is to always show the most up-to-date information, then normalize the data. Depending on the complexity of your setup, normalization may be the simplest solution anyway.

Usually, denormalization takes place as a logging function, and normalization is used for indexed data.

If you will have a lot of real-time reads, use a normalized relational database.

If you will have a lot of writes, but relatively few reads, then go denormalized logging.

In my day-to-day job, we use 2 databases, one is normalized and one is denormalized, and the choice of which to use depends on what we are doing. Of course, our data is massive scale, so it may not apply to you.

If this is a small project, just use a normalized database. Don't overengineer it.


The copied data will one day get out of sync. And you will have a difficult and urgent issue to solve.

Query complexity on the other hand requires knowledge, to make the query work and again to make it fast. But it won’t be urgent, and having made it work you will be a better developer.

Of course, sometimes we need to copy / cache etc. but avoid it if you can.


yeah you are right I think I will try to use view, and union

this will make me a better developer!


Create a view in the database that takes the transactions and returns the data in the format that you want.

It will probably have a bunch of JOIN client.id on transaction.client_id etc etc, but it's the easiest way to get started.

When the view stops running fast enough, switch to a scheduled materialized view.


I think I will use this approach

but I think I need to use union to make a single table that is combination of accounting journal tables and view of transaction tables


The general guidance is that you segment your application domain into two categories - Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP).

The OLAP data is populated from the OLTP data using queries (snapshot tables, materialized views etc., could be the implementation).

You then add/refresh data into the OLAP tables in a set frequency (for eg: daily, weekly, bi-weekly, monthly, quarterly, yearly etc.,).

The OLTP system has up-to-date realtime transactions. The OLAP data has snapshots as of a particular date. The OLAP data may be denormalized while the OLTP data is highly normalized. This makes the OLAP data optimized for reads while the OLTP data is optimized for writes.


NetSuite uses the “child” table model though mostly normalized.

transactionline for the “human” details

transactionaccounting for the GL.

Either of these tables can be joined on the transaction.id field and the sequence numbers of both correspond to each other.

But like I said, normalized pretty much to 3rd normal form. You have to join to get the “description”. There is duplication in the quantity and rate are in transactionline and the extended amount is in transactionaccountingline.

There is a good reason behind breaking the normalization rules, separation of functions. Transactionline is for printing invoices, orders etc and transactionaccountingline is for querying the ledger.

Not saying this is the best example but it is from a successful in its own segment ERP system.


I'm really interested in this, pardon my english understanding.

did netsuite copying value from transaction for transactiononline or transactionaccountingline?


It does the quantity x price from the transactionline and stores that in transactionaccountingline. So for example, if you edit a transaction, the code accepting the POST has to update two records per line, not just one.

Unless of course the extension field in transactionaccountingline is seeded by a SQL Function that does that automatically. Which might be a preferred approach for your implementation. The downside of SQL functions though, they tend to be invisible to others maintaining the code.


I don’t know specific of accounting but in general you want to save a log of immutable events that you then translate into read model. In practice you have transaction record with some productId and maybe price at that point in time, you can simply reduce all that transactions into (materialized) view with some group by. To have performant complex queries learn about WITH.


I've found for the accounting system I made it brough far too many problems the CEO demanded wider tables and less joins to simplify things so he could use them. It resulted in stuff simple complex rather than complicated simple.


It sounds like you might be mixing concepts.

Accounting records can duplicate copies of information without needed links to the original. Sometimes this is better. You want a snapshot of what things were at the transaction time.


Client should be its own table, transaction is its own table (with foreign-key to clientID). You can ask ChatGPT to help you out.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: