- Why do you need a user_id in todo_items if you already associate lists to users in todo_lists table?
- "If users have the same list ID..." Why would they?
For the purpose of the article, adding a user ID column to the items table primarily helps with distributing the data, since all of a user's lists and items can be placed on the same node based on the value in the user ID column, which enables transactions and efficient joins.
Distribution aside, it can be a good idea to separate user data by explicitly adding a user ID column for performance and data integrity reasons. It's difficult/inefficient to guarantee inside the database that item IDs aren't reused (e.g. due to bugs in the application, serial overflow). For row-level security , it's also required that all tables have a user column.
Always adding a filter or join by user ID to your queries also allows for more efficient query execution. You often access many or all items of a user at once, in which case it's more efficient for the database to use an index on user ID than doing multiple lookups using an index on list ID alone (this applies generally for multi-tenant apps).
Finally, you might have to scale out one day :).
> Another way to run the query on Citus is to always join on the distribution key, such that the filter can be derived from the join, which gives an equivalent query plan...
This article is a bit opaque and could be explained a bit more IMO. It requires a bit of background intuition about what is going on in it's current form.
But from my perspective, this Schema appears to be using USER_ID as a distribution key. As in, USER_ID specifies nodes/shards where corresponding data exists.
I don't use Citus, but I work on a project that rolls our own sharding for PostgreSQL in a similar way. We use ACCOUNT_ID to distribute data to a specific shard(s). Therefore all of our tables in a shard must have an ACCOUNT_ID column in order for our shard lookup/routing to work. In my data model the control flow works somewhat like:
1. Pass ACCOUNT_ID as part of Query.
2. Routing middleware looks up shards corresponding to that ACCOUNT_ID.
3. Connections are made to appropriate shard(s) and query is run. Other shards are ignored since they should not have any data for that ACCOUNT_ID.
How do you pass the ACCOUNT_ID? Do you parse the SQL query to know where the ACCOUNT_ID is used? Or do you pass it out of band?
We use an ORM/abstraction layer so we can actually get the ACCOUNT_ID as a property of our in memory object before the query is formatted. In this way we can inject the proper connection for the formatted query to be run on before/during handoff to the ORM.
Longer answer is you could do this same injection any number of ways...it is just convenient for our existing stack to do it this way. In some places we actually do it differently in order to facilitate direct queries (without the ORM layer).
I think the author wanted to keep the example brief and omitted several columns from the todo_lists table. In practice, you may have a creation time, last edit time, share options, and other fields for each list -- the todo_lists table.
You wouldn't want to duplicate these values in the todo_items table. By having a user_id on both tables, you can query the todo_lists table on user_id, and join the two tables if needed.
On the "if users have the same list id ..." question, I don't think they need to. If you're for example using UUID or sequence ids, then they wouldn't clash. If the application however had its own logic to generate list ids, then they might.
Again, these are just my educated guesses on the author's intent. Hopefully, they answer your questions.
The primary key of the todo_lists table is "PRIMARY KEY (user_id, list_id)".
So the foreign key in todo_items should be (user_id, list_id), and it is.
"If users have the same list ID..." Why would they?
User 1 has a list 1, and User 2 has a list 1. Just as if User 1 had a list 'grocery', and User 2 had a list 'grocery'.
As a plus in with DBMS that allow you to set the physical ordering on disk (MS SQL Server's clustered index for example) you might actually improve performance by distributing rights to multiple pages and improved locality of data for per-user aggregations. It depends though.
For point 2, sometimes these things happen. Through mistakes, design or just bad design. Best to protect yourself against cross-user data leaks.
(I mean, I guess you could imagine a system that followed the foreign key to todo_lists when you insert a todo_item to figure out which node it belongs in, but that seems tricky and/or slow ...)