Hacker News new | past | comments | ask | show | jobs | submit login
PostgreSQL Schema Design (graphile.org)
187 points by leandot on March 29, 2020 | hide | past | favorite | 41 comments



One thing that trips me up about postgres are schemas. What is the conceptual difference between a "schema" and a "database"?


In PostgreSQL, the hierarchy is: host > cluster > database > schema > object.

By host, I mean a server. A host can have many database clusters. Usually it has just one. To have more than one, you would have to have more than one PostgreSQL instance running, each listening on a separate port. The default port is 5432. I don't recommend more than one cluster per host. I just mention it as possible.

A cluster can have many databases.

A database can have many schemas.

A schema can have many objects. By object, the most familiar is a table, but there are other kinds of objects: views, functions, custom types, sequences. An object cannot exist directly in the database. It must be part of a schema. The default schema is called "public". Traditionally in other databases, there was a schema for each user. So if jdoe logs in, his default schema is also called jdoe. In fact in other databases this is the only schema a user can have. You cannot make more schemas and name them whatever you wish.

The advantage of a schema over a database is that you can make a query that uses objects in different schemas.

  select *
  from schema1.table1
      join schema2.table2 on table1.col = table2.col
If the tables were in different databases, then you could not combine them as easily. I think you would have to resort to Foreign Data Wrappers.

I have gotten a long way, with many applications over many years, with one host, one cluster, one database, and many schemas.


> I don't recommend more than one cluster per host. I just mention it as possible.

I actually use this as a cheap data backup mechanism. I have a primary database cluster on SSD, and two replicas on two separate harddrives. All running as three postgresql clusters on the same machine.

The harddrives run intentionally on different filesystems, so that filesystem bug will not eat all my DB data. (in case someone wants to suggest raid ;))


This seems complicated and would still fail if something happens to the host hardware itself right? What about projects like wal-e/wal-g that continuously archive to cloud storage?

https://github.com/wal-e/wal-e

https://github.com/wal-g/wal-g


Looks like an interesting option too.

But it's not really complicated. Making a replica is just a few shell commands and two new systemd service files. It's probably simpler than setting up wal-e, especially if you count in the setup and maintenance costs of those cloud accounts, and the need to keep up with regular payments for the services, and recovery not being as simple as switching to an already configured and uptodate replica is.


It would only fail if the host hardware failed in a way that damaged all three discs, which seems unlikely. It's for data backup, not uptime redundancy.



A power surge caused by a lightning strike would kill your master as well as your backup. Besides that, do you also keep all your WAL logs so you can roll back an accidentally dropped database?


A power surge here where I am may kill the power supply. It's low probability that it would kill everything.

Anyway, I don't fear the lightnings/storms/power surges where I am as much as I do the inevitable failures of the storage devices, or kernel bugs.

This is just a part of layered protection I have. Offline backups are nice, and I have them too, but they are out of sync all the time by definition. So unless absolutely needed, having a real-time synchronized replica is much more prefered.


You got some good answers already, but one thing to note is that a Postgres connection is always to a single database. So you can query across multiple schemas from the same connection, but not multiple databases.


Indeed, this means that it is logically impossible to connect to postgres and create a database, at least through a normal connection. But you can create a schema with the CREATE SCHEMA statement.


You can create databases via sql [1]. It's just that after you run the statement, you're still connected to the old database and you need to disconnect in order to connect to the new one.

[1]: https://www.postgresql.org/docs/current/sql-createdatabase.h...


Wow, I had no idea. Thank you so much!

(A few lost internet points from the downvotes are well worth it for that gem of knowledge.)


In general, "schema" refers to the design of database elements -- table names, tables' columns and their names and types, function names and prototypes, etc.

In some RDBMSes "schema" also refers to a namespace that qualifies type, table, view, materialized table, and function names -- this qualifier is optional, so you don't always see it, but all obtjects' fully qualified names include the schema name.

This overloading of the word can cause confusion, naturally, but once you understand it it's easy enough to keep it straight.

"Let me show you my schema" -> generic sense of the word.

"Utilities live in the 'util' schema, while the business logic lives in the 'public' schema" -> the second sense of the word given above.


And the confusion/overloading of the term is even worse than this - in Oracle the concept of "schema" overlaps with "user". In other words, owner of the object and namespace of the object overlaps.

It makes sense that the owner of an object is like saying the object belongs to a namespace, but coupling user to that takes getting used to in systems that do that.

The ANSI SQL standard (I think part 11) describes the formal definition of schema.


> In the SQL standard, the notion of objects in the same schema being owned by different users does not exist. Moreover, some implementations do not allow you to create schemas that have a different name than their owner. In fact, the concepts of schema and user are nearly equivalent in a database system that implements only the basic schema support specified in the standard.

--- https://www.postgresql.org/docs/current/ddl-schemas.html


PostgreSQL kinda mimics the oracle way, though it only goes so far. The default definition of PostgreSQL's search_path is:

"$user", public

So if a schema was named the same as the user, you'd automatically have objects available without qualification.

I spent about a fair amount of time working with Oracle and it forces the paradigm of a schema meaning user much more than does PostgreSQL... though with some effort you can make it work somewhat like the logical namespacing capability that PostgreSQL schemas are used for.


There are many good answers here already. As long as you understand that, in PostgreSQL you 1) connect to a single database; 2) you can only query within a single database (normally); and 3) multiple PostgreSQL schemas can exist in a database and that you may query across them, you understand the functionality well enough.

So I have a few practices related to PostgreSQL schemas that I like to employ when I'm designing a PostgreSQL database. I tend to work in enterprise business systems with larger database structures than many here I think, so to be sure, what I typically do isn't for everyone, but maybe it'll help you better understand the spot where this concept lives.

First, most database objects, like tables and functions, must live in some schema in PostgreSQL. By default the schema is 'public'. I actually avoid using the public schema in favor of using schemas I create. The reason I do this is because, some extensions and such will also define objects in the public schema and I don't want to confuse stuff from third parties with stuff I manage. By always creating at least one clearly dedicated schema for the objects I create, I know what software I'm managing vs. just got thrown into the dumping ground. I do this on all size databases I create.

If the database is sufficiently complex, I may create different schemas for different "modules" that I define in the software. It helps me to understand, in the database, where the boundaries are. For example, I work with an off-the-shelf ERP system. When I create extensions to this system, I will create a new database schema to hold the various tables and database functions required.

I may use PostgreSQL schemas to logically delineate different security concerns; I'll usually do this in conjunction with different authorization roles that have schema level permissions. I do this more often when there's need to define a database function "API" to the database. I'll put the data into a data schema, but then I'll create, say, two additional schemas... one to hold private/internal database functions and another to hold the "public" facing API. I can then have those applications/integrations that should always use the database function driven API to use a database role which only gives them access to functions defined in that schema. I don't mean to suggest that this is a "sufficient" security mechanism, but is part of a broader strategy of security in depth.

Anyway, some ideas to go along with the definitions.


Speaking of security and schemas, I just learned about a vulnerability having to do with Postgres's public schema (CVE-2018-1058). It's possible only if you have untrusted users who can create objects, like tables and functions --- which is rare. As an example, someone could make a function in the public schema called "lower" which does something different than the function by the same name in pg_catalog (which lowercases strings). When other users call

  select lower(col) from table
they would be calling public.lower instead of pg_catalog.lower. This is just one example. They could do this for any commonly used function in pg_catalog.

The solution is one of:

   drop schema public;
   revoke create on schema public from public;
   alter role all set search_path = "$user";
That last one you could do in postgresql.conf instead.

--- https://www.postgresql.org/docs/current/ddl-schemas.html#DDL...


Yep!

This is another reason why I don't favor actually using "public". Aside from issues like this it's in the default search_path and typically you want to leave it there. So when I do create other schemas, I also don't add them to the search_path globally or otherwise. Overall, I find the search_path setting spooky and, while it can save from typing, that saving ain't worth it... even setting it locally or for a session or a transaction... just no.


A schema is basically a namespace within a database.


Siblings have answered the question, but I'll add that the terminology comes from the sql standard, I don't believe it's postgres specific.


Schemas in databases are similar to what home directories are in operating systems.

In your phrasing, a database (really a database INSTANCE) is similar to a whole server itself, and carries with it overhead such as memory allocation, etc.

When you put multiple schemas in a single instance, the resources allocated to the database instance can be shared, whereas if you have every app in its own instance, you can’t share things like working memory between them because they are in separate processes.


But there's a middle layer between the database instance and the schema, which is also called a "database" (great naming /s). The "foo" in postgres://u:p@host/foo . What's that? So you have database host / instance, database "...", and schema (not to be confused with the schema being the column and view etc definition...!).


In the PostgreSQL world, the "database instance" is called a "cluster". A cluster contains multiple databases (and the definitions of roles and memberships and which roles can access which databases); a database contains multiple schemas; and a schema is where your tables/views/functions and other database objects live.


When you have a database engineer refactoring large tables, that's when this shines.

Use CREATE TABLE AS SELECT to have cheap copying from one schema to another.


It is perfectly fine to have a one-to-one mapping between schema and database, thereby treating them as interchangeable. So far as I know, by convention the schema will be the “database name” and the database name would just be “public”.


You have it backwards. Database is named "foo" and the default schema in PG is called "public". PG uses a schema search path. When you select from table you're really selecting from public.table. Cross schema queries like SELECT public.client.id, billing.ledger.amount FROM public.client JOIN billing.leger ON ....


Sorry, yes. Thanks for the correction!


>Warning: Be very careful with logging, while we encrypt our passwords here it may be possible that in a query or server log the password will be recorded in plain text!

I was glad to see this noted, but surprised this is the extent of the advice and that sending passwords to the database in plaintext is still recommended. Are there more fleshed out best practices around avoiding logging pitfalls of doing this?


Hash password in your application. Probably it's not compatible with that postgraphile thing, it's more general advice.


And if you’re using a language that supports it, pass passwords around in a custom data type/wrapper around string that does not leak its contents when your platform equivalent of .toString() is called (and override the JSON or slog conversion interface).


Another method, which has worked for me, is to never store passwords in the database at all but in a separate authenticator like Google Auth or AWS Cognito. Then the authenticated username is passed in via signed JWT. The database can perform authorization, but authentication (and authentication method) are kept far away from the data in your database.


Postgresql has built in crypt and gen_salt functions which may be good enough for your security requirements.


is the PG built-in role system flexible enough to do more sophisticated forms of RBAC? (groups, shares).

Is there a performance hit?

Have always liked the idea of graphile & wanted to try PG built-in roles, but wasn't sure whether the feature set was robust / whether the DB communicates properly about errors.


Yes, but you need to model it in the database. You don't even need to use the built in role system, you can use a user id that you pass into your database session.

Basically, create three tables: user, user_role, role_permission. Each user can have one or more role, and each role has one or more permissions. Permissions could be things like "view_admin_panel" or even granular like "view_project_with_id_5".

Then, you can create a row level security policy that does the right look up in these tables. I've not run this in a production system yet, but did successfully build out a proof of concept that worked. Performance seemed reasonable.


Yep, I made permissions tables per-entity too with simple `CRUD` access a bit like RWX (it was an API). It's definitely fast enough. I was handling ~5000 concurrent users with queries returning in the ns range. To be fair, I skipped RBAC checks if the entity was marked as 'public'.


I've wanted to try this too. I think speed should be fine. Roles are saved in system tables.

Traditionally the applicaiton handles authentication and authorization. Then it connects to the database as a generic user account, with access to everything. It lets the signed-in user do only the things that he should, through carefully crafted queries.

If instead you registered each end user as a database user, then that makes certain things easier (and perhaps other things harder). For one thing, the current user is available as a variable, current_user. You could create database views that use that variable, instead of having to feed it into each query as a parameter (a small convenience, I admit). You could make the current_user the default value for columns like created_by. You could update columns like last_edited_by purely through triggers. In general, you could write more of your logic in pure SQL, instead of a tight coupling of SQL and your application.

If you're not used to doing it this way, it feels dangerous, and rightly so. But don't think it's actually harder than securing it in application code, just less familiar. A new user in PostgreSQL has no rights, only what is granted through commands.

For ease of maintenance, you can gather users into groups. They are both called roles. A role that has a login is generally a user. You can add role to another role, though, and so the second role acts like a group. Then you can grant and revoke rights to the group, instead of having to issue commands to change the rights one user at a time. There is even a function to help check role membership, pg_has_role.


You're right that it's probably not much harder, but the current Postgres role is tied to the session, so you'll either be limited in the number of concurrent users (Postgres does not do great with high connection counts) or juggling SET ROLE around your connection pooling (which sounds dicey plus is just a lot of extra statements).


Misleading title. This is a GraphQL thing.


“While we will discuss how you can use the schema we create with PostGraphile, this article should be useful for anyone designing a Postgres schema.”




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: