Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Make your database tables smaller (jezenthomas.com)
84 points by yakshaving_jgt on Dec 30, 2022 | hide | past | favorite | 83 comments


This article uses “normalize” to mean “splitting into multiple tables”. But splitting into multiple tables is only normalization in the cases were it reduces data redundancy. If the resulting two tables are 1:1 and you dont save any data, that is not normalization, that is just busywork.


I entirely agree. This advice is plain wrong:

> Does your users table have a few columns prefixed with oidc for single sign-on authentication? Those should probably be in their own table.

If your users have one set of oidc data each, leave it in the users table. There's no such thing as a 1:1 relationship.


What if it can be 0 or 1 set each?

So null several columns, perhaps constrain they can only be null (resp. not null) together, ... Or just have a single nullable fk to a table where the data's not nullable.


If multiple columns can only be null/non-null together it indicates a dependency which should be normalized.


Yes, but I was replying to 'no such thing as 1:1 relationships, just put it in the same table'.


But you are talking about 1:0..1 which is different from plain 1:1 relationships.


As much as I dislike nullable fields, I personally think that the alternative to having them (Making any nullable instead a table) can make working with it too cumbersome.

> Or just have a single nullable fk to a table where the data's not nullable.

Why not the other table having a non-nullable pk/fk to the main table?


Yes or that, didn't mean to preclude that, just that 'somehow linked' table is better than bunging it all in main one IMO.


There are several reasons to use tables with a 1:1 relationships e.g. As described in my comment in this thread - avoiding table contention, maintainability and scalability.


sorry, what do you mean by "no such thing"?

https://docs.djangoproject.com/en/4.1/topics/db/examples/one...

yes, orms have a ton of problems, but it's clearly very much a thing.


Even with no ORM involved, 1-1 relations help keep things conceptually separate. This makes e.g spinning off parts of your application as microservices much easier, if the need ever arises.


How do you enforce the 1:1 relationship at the database level?


With an unique constraint/index


Yes, I mentioned the approach elsewhere. Now you have two database things (the FK and the constraint) describing a single relationship.

I don't think it's very elegant and people often forget the unique constraint.

One table is a much better solution.


Both tables have an ID primary key, one is also a FOREIGN KEY to the other.


1:1 mapping means that if one table has a row, so does the other.

How does this guarantee that the ID is in both tables?


> How does this guarantee that the ID is in both tables?

Create both tables, with the same PK the second with a deferrable, initially deferred foreign key to the first.

Alter the first to add a deferrable, initially deferred foreign key to the second. [Edited to correct swapped table refs in this ¶]

Mission accomplished.


Something like this (adjust the primary key name convention and schema as needed):

  create procedure mandatory_ (_table_name "text", _mandatory_table_name "text") language plpgsql as $$
        declare
                _trigger_name "text";
        begin
                _trigger_name = _table_name || '_mandatory_' || _mandatory_table_name || '_trigger';
                execute
                        'create function "' || _trigger_name || '_" () returns trigger language plpgsql as $' || '$ '
                                'begin '
                                        'if not exists '
                                                '( select 1 '
                                                        'from public."' || _mandatory_table_name || '" _mandatory '
                                                        'where _mandatory ."' || _table_name || '_id" = new ."id" ) '
                                        'then '
                                                'raise '
                                                        '''non-existent relation between "%" and "%" violates mandatory participation constraint'' '
                                                        ', ' || quote_literal (_table_name) ||
                                                        ', ' || quote_literal (_mandatory_table_name) || '; '
                                                'end if;'
                                        'return new; '
                                        'end; '
                                '$' || '$ ';
                execute
                        'create constraint trigger "' || _trigger_name || '" '
                                'after insert or update on "' || _table_name || '" '
                                'deferrable initially deferred '
                                'for each row execute function "' || _table_name || '_mandatory_' || _mandatory_table_name || '_trigger_" () ';
                end;
        $$;


Doesn't this require support for deferrable constraints? Even if they are supported on the transaction level, once you have these kinds of dependency loops, they tend to impact lots and lots of things—restoring from backups, application-level replication, and so on.


That's a 1:N relationship. There's nothing to prevent you from adding more related records.

At the database level there's only one kind of foreign key.

Yeah, you can try and prevent it from happening in code or throw some unique constraints at it, but it's unnecessary.


If the foreign key column is also that table’s primary key (or just has a unique constraint), then that’s not 1:N, it’s 1:1 or 1:0.


Yeah, I noted elsewhere among the replies it's possible to use constraints to prevent extra rows in the related table, but it's inelegant and I've seen plenty of cases where it's forgotten and ends up being buggy.

I'd still advise people to follow the rules of normalisation unless there are very strong to deviate.


If you could it would become impossible to insert any data in them.


This Django example is more of a "[one or zero] to [one or zero]" relationship, that's likely what the grandparent post means. I doubt any of the mainstream relational databases have a way to enforce exact one-to-one correspondence between two tables. You could use triggers that execute at the end of transactions or something like that - but that's not part of the relational model.


This is related to what I meant, but specifically I was referring to the inability of databases to prevent a 1:1 becoming a 1:2 (or 1:N).

Yes, you can hack it with a unique constraint, but it's not very elegant.


That’s pretty much exactly what unique constraints are for. Why do you consider that a hack?


Mainly because you end up with something that's untidy. Your elegant table that contains all data identifiable by a single PK is now spread across two tables and relies on a (often forgotten) constraint in addition to a FK.

I'd rather follow the rules of normalisation wherever possible.


I still fail to see how your single table approach is more normalized. Consider for example an e-commerce order that may have either (1) exactly one shipping address or (2) no shipping address (for things like digital, downloadable products). You’re saying that you’d rather store the address data on the order table using a bunch of nullable fields? That bloats the order table and introduces a ton of possible consistency problems. A much more normalized model, IMO, would be to separate the address fields into their own table and use the unique FK approach to ensure that no order has more than one related address. Precisely which rules of normalization is that breaking? And how would they be resolved by putting everything into a single table?


I mean at the database level. Your relationships are foreign keys and they're 1:N.

Yes, you can try and artificially limit them to 1:1 using constraints but it's a hack.

The correct way to represent a 1:1 relationship in a database is with one table.


> But normalization only mens splitting into multiple tables in the case were it reduces data redundancy.

Normalization means splitting into multiple tables where it avoids anomalies, redundancy (because it allows inconsistency) is one source of anomaly, but not the only one. 4NF -> 5NF normalizations, for instance, do not (at least, as a generality) remove redundancies. (Neither does DKNF -> 6NF, but I’m not convinced that actually reduces anomalies.)


I forgot about 6NF which I consider non-canonical fan-fiction. As far as I understand it is only useful in the context of data-warehouses.

But the other normal forms reduce redundancy which is what makes inconsistencies possible.


I agree, and there's something to be said for not having a table with 70 or 80 columns.

Honestly it cuts both ways, I regularly see databases with enough tables AND enough columns to make them nearly incomprehensible.


> Joins are cheap. Normalise earlier

I am sure i have heard strong opinions like this but in practice joins are never cheap, tables with billions of rows crossed with millions of rows just to find a single row with data is not something i would call cheap, more often than not it is better to avoid joining large tables if you can live with duplicate data, also one trick i have found better is just to archive data in your tables which are not accessed frequently and reduce the size of tables keeping only the data which is needed


aaand here we go again.

DB guy with 25+ years experience. Summary: it depends.

> joins are never cheap

it depends. On table size, indexes/table size vs how expensive the alternative is. Always!

> tables with billions of rows crossed with millions of rows just to find a single row with data is not something i would call cheap

indexes

> more often than not it is better to avoid joining large tables if you can live with duplicate data

1E9 x 1E6 = 1E15 (at worst anyway). A join via an index will save you colossal amounts of IO (though as ever, it depends).

Problem here isn't this mostly clueless advice (discarding/archiving unnecessary data is the only good idea here, and it's not used as often as it should be). Problem is strong opinions put forth by someone who doesn't have the necessary experience, or understanding of what's going on under the hood. Denormalising is a useful tool that IME rarely gains you more than it loses you, but this 'advice' is just going to lead people down the wrong alley, and I'm tired of suchlike n00b advice strongly (and incorrectly and arrogantly) expressed on HN.

(edited to fix maths error)


There's also the possibility of filtering each source table first, then doing an inner join. Which can VASTLY cut down on computation. I assume GP assumed doing an outer join first, then filtering.

But those are details for the database engine to handle. And, as you said, indexes


FYI for others, such filtering is called predicate pushdown (I believe also called predicate hoisting sometimes). Example (and this is trivial but for illustration)

   select * from (select * from tbl) as subqry where subqry.col = 25
would be rewritten by any halfway decent optimiser to

   select * from (select * from tbl where tbl.col = 25) 
(and FTR the outermost select * would be stripped off as well).

Good DB optimisers do a whole load of that and much more.


Yeah, had to get quite well acquainted with query execution plans and the like a few years ago (And forgot most of it by now) because of diagnosing a SLOW query.

Joining onto either table a or table b is something that REALLY trips optimizers up.


Wow, this comment comes across as being incredibly arrogant while providing zero value. nOOb lol


I thought I was being informative. I can't give hard&fast rules because (drumroll)... it depends. So I have tradeoffs to consider, and indexes got mentioned.

How else could I have posted better? Honest question.


Because you didn’t actually refute anything the GP said, and gave bad advice, all while being incredibly negative and arrogant.

> this mostly clueless advice

> strong opinions put forth by someone who doesn't have the necessary experience, or understanding of what's going on under the hood

> I'm tired of suchlike n00b advice strongly (and incorrectly and arrogantly) expressed on HN

You continue to just say it depends without giving any actual scenarios. You make it sound like magic, but it’s not: “under x and y, do z except when u” is better than “it depends, I’m sick of all these noobs”.

Also, your main points are against denormalization and avoiding large table joins which are 100% rational arguments under certain workloads.


I refuted what he said by pointing out that 1E9 x 1E6 = 1E15. A billion row table denormalised with a million row table = 1000 trillion row table. How big's your disk array? How are you going to ensure correctness on update?

His was stupid advice and had it should not have been given.

> You continue to just say it depends without giving any actual scenarios

it depends. Use your common sense and then use a stopwatch, is a good start. There are entire shelves of books on this, I won't repeat them.

> You make it sound like magic, but it’s not:

absolutely true!

> “under x and y, do z except when u” is better than

it's a multidimensional problems inc. memory size, disk size, the optimiser, sizes of particular tables joined, where the hotspot is, cost of updates of non-normalised tables, etc. I can't give general advice from here.

> Also, your main points are against denormalization and avoiding large table joins which are 100% rational arguments under certain workloads.

I said "Denormalising is a useful tool that IME rarely gains you more than it loses you,"

I don't accept your criticism.


That’s not what denormalize means, how long have you been doing this again?


True, you normalise/denormalise data not tables as such; tables pop out of a normalisation process and denormalisation collapses them together. Perhaps if I'm still wrong you could put me right. And don't just point at the wiki article on it, please be specific.

To your question, probably longer than you but I've always more to learn.


Joins are cheaper than the alternatives.

If you have billions of rows you need to search through for a single row, it will be costly unless you have the appropriate indexes. Storing data in JSON fields or similar is only going to make it slower.

If you are only selecting a single row, a join will be instantaneous. If you are joining many rows on either side it has a cost, but so has denormalized data since you just need to scan through that many more rows. Sure in certain specific cases denormalization can be a valid optimization, but in the general case it will just make queries slower.

Saying joins are cheap or expensive only makes sense when comparing to the alternative.


> If you are only selecting a single row, a join will be instantaneous.

Exactly. It's not like the join materializes explicitly in memory.

This has all been known for decades since the beginning of relational databases. That's why taking a DB class is valuable, or at least reading a good DB fundamentals book.


Would you care to mention such a book? In the unlikely event that I ever encounter an individual that would need it.


Database System Concepts by Silberschatz, Korth and Sudarshan is great.

I used the 5th edition, but have heard the 6th edition apparently regressed. There's now apparently a 7th edition that I can't comment on.


Came here to say the same :D

But to steel man this a little. Joins can be cheap if you understand how they work and what the patterns of use are. I've worked on systems where that billions of rows to millions scenario is a thing. I recall working on a particularly nasty legacy query underpinning a report that took most of a day to finish that I managed to get down to tens of seconds just by tweaking a bunch of things to join right.


Joins are cheap when joining on indexed columns, but the trade off is extra time maintaining those indexes when writing. As always, it depends on the use case.


Depends a lot on the size of the tables, but that is true for anything to do with databases. It's probably hard to give one-size-fits-all advice about database management, since you're ultimately balancing conflicting interests (query performance, maintainability, disk space).

With a sufficiently small database, even non-indexed joins may appear fast.

If your database has billions of rows or more, then even indexed joins will need to be used judiciously, especially if you have a lot of indexes. The indexes will probably also become very large on disk (possibly 100s of Gb), and they'll also degrade the performance since more than likely the system will struggle to keep them in RAM.


Joins can be cheap if you keep half an eye on the query optimiser. Just make sure that the filters only grab what is necessary from your huge tables before the join part starts and a join can be blazingly fast on most modern databases.


It really depends where you put the cost in the end: do you pay the cost on the query side or the manually managing data integrity side?

Even if you don't denormalize, there's plenty of optimizations for joins: e.g. a bitmap join index can optimize a millions x billions join pretty well!


> tables with billions of rows crossed with millions of rows just to find a single row with data is not something i would call cheap,

Um ..yes? It's cheap if you got your indices right. A hash table doesn't really care if there are 20 or 20B entries.


Normalize your DATA.

Materialize views for high-performance querying.


No, joins on single row to single row with indices are cheap - which is what the article is saying.

You can also use views to keep the joins in place so you don’t need to keep joining.


This advice seems a bit too simplistic, and I think misguided at least for the specific example of null values. I only know the details here for Postgres, so I'm not sure if this is valid for other databases.

In Postgres each null value in a column only occupies one bit (with some more complexity due to padding), so unless you have a very, very large number of columns that are mostly empty the space used by null values is negligible.

But one part of the idea I agree with, and that is that it can be useful to keep in mind how wide your tables are. If you're not doing index-only queries the database has to fetch all or most of that data (again, more complex in Postgres for TOAST columns). So if you have a wide table with non-TOAST columns, simple looking queries that only fetch one column might have to do a lot more IO compared to a less wide version of such a table.

I would not necessarily split table just as an optimization, but it helps a bit to think about how the tables will be used when designing the schema. There might be data in the same table that is used or updated in very different patterns, and that might be better handled by separate tables.


> unless you have a very, very large number of columns that are mostly empty the space used is negligible.

If you read the article as the author recommending all of this to save space then I recommend you to read it again.


I haven't read the article (yet), but this is a problem that I've had to deal with several times now recently. The most recent causing me to work 12+ hour days over the holidays.

In this case, there's a table that an ETL job loads data into several times a day. The problem is the original dev(s) thought it would be a good idea to add status columns to this table that the application can write to when it has finished processing the data ("InProgress", "Finalising", "Complete" etc). What does this mean now? If the load process takes longer than a minute or so, user interaction is blocked and ultimately deadlocked. This is extremely poor database (and system) design. The application never needs to update the "loaded" columns. So all modifiable columns should be moved to a separate table and updates either need to be serialised or written at the appropriate isolation levels. Now try to get any of these changes through during the Xmas period...

This is somewhat related to a refactor of another system I performed last year. The system had one Auditing microservice that wrote to a massive table with some keys, identifiers and metadata. Ignoring the fact they needed to write an auditing microservice to begin with, the problem was every time a new part of the system needed auditing the table needed to be modified. And every time you did a query, 70% of the columns were null. Again, this is very poor design. I deleted most of the columns except the identifiers and keys, separated the metadata into their own tables with a foreign key to the Audit table, and left-joined everything. Same result, but far less of a hassle in terms of maintenance and scalability.

[EDIT] I should probably end with this: Make your tables as big or small as they need to be considering your use cases.


There are always two sides to this: joins are expensive, or joins are cheap.

In reality both are true. If you’re doing complex business logic on low cardinality tables, joins win every time. If you’re doing simple retrievals on high cardinality tables, they’re not.

The trick for scaling is figuring out which your use case is and converting between the two as necessary. Maybe one database for editing and one for publishing/serving.

There are rare cases that are high data complexity and high cardinality. Normally these can actually be pushed into one of the other formats, but when they can’t, sharding can often be a good option.


If you’re doing complex business logic on low cardinality tables, joins win every time. If you’re doing simple retrievals on high cardinality tables, they’re not

Interesting, can’t figure out why this is so (not a databases guy). Could you please explain or push me in the right direction?


There's a difference between cardinality when it comes to databases (essentially cardinality in modelling vs cardinality in data).

Cardinality in modeling (relationship cardinality) refers to your relationship type i.e. 1:1, 1:M, M:M

Cardinality w.r.t. data refers to how many unique values a particular column can have. e.g. a Primary Key has high cardinality because every row has a unique value. A boolean flag has low cardinality because it only has two values.

TBH, I'm not 100% sure which one the OP is talking about here because you need to factor in number of joins and indexes etc, but from the perspective on relationship cardinality, you can think of some of the common differences in modelling for OLTP ("real time" databases) and OLAP (data warehouse etc).

The former would generally have smaller tables and more joins, while the latter would usually have relatively massive tables which inherently could imply fewer joins, but this is often not the case. It just depends on your system and use-case.


A bit late, but what I meant was that:

With complex data but not too much of it, going all in on relational modelling makes development much easier, more reliable, and databases are perfectly capable of handling the necessary joins.

With lots of data that isn’t too complex, denormalising a little and not using as many joins, or using a DBMS that doesn’t have joins and denormalising a lot (E.g. Redis, Cassandra, etc) will work well.

Most problems are one or the other, or can be refactored into one or the other.

At a previous job, we went hard on the relational aspect, and it worked well almost all the time. Inventory management, order management, user accounts, internal processes, etc. The main place it broke down was one table (out of ~450) of user content that due to the way the product worked, represented about 50% of our data. Working with that table became very hard. The right move would have been then to remove it and use a more appropriate DBMS, abandoning the heavily relational design for just that part.


“The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.”

- Donald Knuth "The Art of Computer Programming"


First, this is about data access and not programming.

Second, that quote is from a time when compiler optimizations did not exist, and the programmer was supposed to use all kinds of clever tricks to speed up code (what today you get for free with -O3). That kind of optimization is the context of the quote, and it's hardly ever appropriate to just throw into some discussion about optimization.


... or maybe stop running SELECT * everywhere.

collecting all of that data and sending it over the wire to an application where it will be deserialized when you only need 1-2 fields for the lifetime of the connection is just waste.


That part of the article referred to the programmer visually inspecting the contents of a table.


my response was to the title of the article, and likely a better solution than what they are proposing.


The word "smaller" can take different meanings, depending on the context. This is implied by the content of the article.

Your proposed solution is perfectly reasonable, but it is orthogonal to the topic in the article.


With the proviso that this all depends on the shape of your data, I think a single JSON document field us usually the right choice for anything that smells of content management. Relational database schemas become very complex very quickly when you strive for full normalization and that complexity is simply not worth worrying about until you have more data than RAM in a machine.


I don't have any experience with this. But isn't content management exactly the type of use-case in which the total data will be more than the RAM.


The large stuff is usually binary files stored outside the database. The database itself tends to only include text and metadata.


I find that in a lot of cases it makes sense to have subsidiary 1-1 relationships for auxiliary data. The goal, usually, is to wean the application off of using that data directly. Especially for hot tables in non-column-store databases it can make a big difference - keeping all the data in a table at a similar level of importance and access frequency helps the application cache, DB memory, and disk cache stay relevant.

It may or may not be worth going the final step and deleting those tables, but keeping the number of non-null columns in a table small is a big help.

For Postgres, at least, null columns have a pretty negligible impact, so even very sparse tables are fine, as long as the data is null-or-useful. That is a recent change in my understanding - you no longer need high density for high performance. The worst case is storing large volumes of non-null not-very-useful data directly in high frequency use tables. Examples I've seen are things like base64 encoded columns holding profile images in the main `user` table.


I quite agree with the author, I can see how the code can contain some logic that is more explanatory than simply expanding the data.

I have the impression that often when people design a system, the data you save in the database is never meant to be explored directly. Although I don't think that this assumption corresponds to the experience of the most.


> find columns that are near enough totally unused. This is a strong signal that the feature which relied on the column is not valuable, and the column along with the feature can and should be deleted.

The actual article's title should have been "Delete Unused Code."

There's also a partial rant about what ORMs do to schema in there. Second article?


Often "unused" is tricky too, although customers may not 'actually' use it, sales uses it as a selling point.

Sometimes it is easier for sales to check the box, by saying yes we have that feature, than go into a detailed explanation of why it is not necessary in the particular product.


> Often "unused" is tricky too, although customers may not 'actually' use it, sales uses it as a selling point.

I think there is a semi-truism in enterprise software too, which is that 90% of customers will only use 10% of extended functionality, but they all use a different 10%.

Let's say you have a shopping basket software and only 10% of customers are using your 'expiry-date-tracking' flag (lets say it allows you to track quantities of expiry dates), you can either argue that this is useless to 90% of customers or that it's absolutely critical to 10% of your customers.

You don't sell enterprise software by convincing a supermarket that they don't need expiry date tracking - if you want to operate in that sector, you have to sell it by understanding and matching what the customers needs.


For me the message from this is treat your data like a prince and make your code more complicated if need be to simplify and purify the data structure.

I am sure I have heard strong opinions for the other way around.

Probably like anything it is a tradeoff requiring engineering.

That said I also tend to prefer storing that data in a more normalised format.


This is probably a preference thing (and scale is likely an important factor) and can never be definitively answered, but I find that denormalising data such as SSO ID's into a third table comes at the cost of slightly more complex code and more boilerplate if you've got a caching model, I used to denormalise to this level but have now gone back to putting it all into a fairly wide user table as a short string field is also very cheap.


Curious, do you mostly denormalize only immutable data? Otherwise how do you handle EC well?


Heh I actually went the opposite way as it made my database schemas more difficult. If it's a one-on-one with just one piece of data, make it an extra column instead (sometimes I've also done this for many-to-ones if there is only some small but neglegible data duplication going on). One reason I do use one-on-ones is when the column is heavy on updates, because on update PostgreSQL rewrites the entire row.


I agree with the idea that code can contain logic that is more explanatory than simply expanding data, but I have the impression that the data stored in a database is often not meant to be explored directly, although this may not be the case for most people.


Binary is fast: https://featurebase.com


Avoid outer joins.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: