Hacker News new | past | comments | ask | show | jobs | submit login

> Having something in a single table that is denormalized is always going to be faster than having an elegant data model with "Everything In It's Right Place"

This one folk wisdom that is untrue. There are significant speed disadvantages relating to large blobs of data the database doesn't understand. Serialisation time makes returning large JSON/XML objects expensive when you only need a small part. Overwriting a whole object to increment a counter is an unnecessary source of IO. Duplicating JSON keys in every record bloats the size of your working set, making it more difficult to fit into memory (or the fast part of your SAN).

99% of denormalisation out there is unnecessary and has inferior performance. The best route to performance with row store SQL databases (any database?) is two fold: 1) get an expert on your database to help you write your code and 2) get an expert on your hardware to help you choose wisely. Denormalisation is typically a way to take a performance problem and make it worse while introducing data corruption and race conditions.




the reason for putting everything in json in one column is because alter table on a large database can take days. the only sql database i'm familiar with that doesn't have this problem is tokudb.

> Serialisation time makes returning large JSON/XML objects expensive when you only need a small part.

the expensive part of reads is finding the row on disk. once you've found the row the cost of reading part of the row vs the whole row is negligible. amount of data sent over the network doesn't matter either in 99% of cases. these days network bandwidth is orders of magnitude greater than random IO bandwidth on a spinning disk and still greater than random IO bandwidth on a SSD assuming you're using 10GbE.

> Overwriting a whole object to increment a counter is an unnecessary source of IO.

there is no way to write 4 bytes to a hard disk. disk io is performed in multiples of 4096 bytes, so it doesn't matter whether you just update the counter or update the whole blob. only incrementing the counter may allow you to write less data to the write ahead log, so you may save some IO there, but most databases put the whole row in the log anyway so it doesn't matter.

> Duplicating JSON keys in every record bloats the size of your working set, making it more difficult to fit into memory (or the fast part of your SAN).

this is definitely true, it's better to use an extensible but compact format like protobuf or thrift if you have a significant amount of data. or you could use a better database and not have to worry about the cost of adding columns.


> alter table on a large database can take days

There's a lot to migrations in these various implementations but in short: every variant of alter table is an "online" operation (meaning it doesn't hold an exclusive lock) in InnoDB as of MySQL 5.6 and it's possible with various well-supported third party tools before that. For Postgres: most migrations can be done online and those that do hold a lock are typically constant time.

Admittedly migration has been a big problem in the past but that hasn't been true for years now.

> the expensive part of reads is finding the row on disk

Hopefully most SQL queries in an optimised setup are not finding a row on disk! The difference between reading the whole row from on-disk heap storage and reading the interesting part of it from an index in memory is in fact considerable: 1000x or more - and obviously far worse if you have to scan through any of the heap storage.

> amount of data sent over the network doesn't matter either in 99% of cases

It actually matters hugely in the case of JSON blob storage because it all has to be doubly deserialised on the other end - first from the db's wire protocol and then from JSON. There are many apps out there for which JSON serialisation is a bottleneck (for the CPU, not the network) - that's why there are so many "fast JSON" libraries.

Good point - you could mitigate this by using something quicker. I haven't seen anywhere do that - the ability to read the database with other tools is normally useful

> most databases put the whole row in the log anyway so it doesn't matter

re: this whole topic - I'm not proposing switching a JSON blob with a table for which every JSON field was reified into a column. I'm comparing to a normalised design with narrow tables (most tables have fewer than 5 columns). The other stuff about serialisation applies.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: