Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What's your preferred way of ensuring complex database integrity?
73 points by Hex08 on May 14, 2019 | hide | past | favorite | 56 comments
What I mean really is: when working with relational databases, and you have complex constraints to ensure (not trivial things like notnull but something like: "two tables away" uniqueness, etc..), do you like filling it with tricks using PKs/FKs/UNIQUEs to ensure integrity or do you prefer writing some code that will get executed as you application boots up, that ensures that all the more complex requirements are met?

Isn't that the primary purpose of transactions?

I mean, every time you write to the database, you open a transaction, do you stuff and close it. It either works or it doesn't. If it doesn't, then it rolls back. Isn't that enough?

I guess I am saying that for complex integrity checks, it should be the responsibility of the application's data layer.

On the other hand what is meant by "complex"? Is it "complex" because of numerous, ever-changing applications that touch the database or is it just one application and a database with a lot of tables?

In the case of wild, numerous applications, I suppose that database-side integrity controls are needed. If you really have ONE application touching the database, it might be better to have the application police integrity. Perhaps there's more than one answer?

For what it’s worth, this is one of the rationales behind microservices: if the concept of an “allowed transaction” in your org is not “all things that can be done to a database” then everything should be using a shared API rather than having direct access.

In my experience, that works great as long as your schema never changes and your software never has bugs.

Here's an example of a "complex" constraint. You have a restaurant reservation system, and reservations are stored as time intervals. Each restaurant has a capacity, and you need to ensure that a booking does not make it so that the restaurant exceeds capacity at any point of time.

One way to resolve it with triggers is to have a stored procedure fetch all the records that intersect the new booking's interval, and procedurally count the number of concurrent bookings at each record's start and end times; the constraint would be violated if and only if the count at any point plus the pax of the new booking exceeds capacity.

Imho that's the only way. Anything else can end up in invalid state.

Yea I agree with you - if your database only ever has one concurrent user (one app and one person using that app at a time) then it doesn't matter that much how you implement it, either in the DB or in the app. The benefit of doing it in the DB is that multiple apps or people using the same data concurrently can be easier to manage depending on the situation. But if it's a single user app? Whatever you prefer is probably fine.

> do you like filling it with tricks using PKs/FKs/UNIQUEs

I am not sure that I would call these "tricks"

As mentioned by the other comment, it's standard to use database triggers to ensure consistency.

There are two major problems with dealing with them in the application side

1. You can no longer guarantee consistency.

2. In the case where you need to do some things procedurally so that your application can no longer make a yes/no query to determine if a constraint is violated but need to make use of intermediate results, then this means that you will need to transfer intermediate results over the wire, which is a performance penalty over just doing it in the database server.

edit: of course, when you need to ensure consistency with some data source outside the database, there's no choice but to do it application-side.

>As mentioned by the other comment, it's standard to use database triggers to ensure consistency.


Triggers are almost always the wrong level to handle consistency. It flies in the face of 'fail fast' if you need to actually handle problems of consistency and if you need 'last line' protection, why not bake it into the structure instead?

Foregoing triggers in favor of validating data in the application layer can be dangerous. The problem is that you need to re-implement data validation on every path to the database. If you've got a single application server, and it's the only client that ever writes to the database, this might work. But as soon as you add another service with write capability, you need to re-implement the validation logic. What if you forget?

If you put the validation in the database, any application can leverage it. If you want to fail fast, you can still do validation in the application layer while relying on the DB layer as the ultimate validator.

It’s generally not good for services to share write privileges to a database for this very reason.

Also you can end up with services communicating between an implicit database api rather than a well established rest or rpc (or whatever) api.

But as soon as you add another service with write capability, you need to re-implement the validation logic.

Then Don’t Do That (tm)

It’s hardly ever a good idea to have multiple apps writing to the same set of tables directly.

Why not? How is it any different than two web requests causing a multi-threaded application server to send two simultaneous writes to the database?

If you use triggers for validation, you can rely on your database's MVCC to resolve any conflicts between triggers. Note that this is significantly more performant and robust than the application trying to resolve, or even notice, those conflicts.

Don’t do that, was in response to:

But as soon as you add another service with write capability, you need to re-implement the validation logic. What if you forget?

In context of application validation logic, the logic is only living in one place.

Triggers are basically “spooky action at a distance”.

It’s also much easier to promote, version, upgrade, and rollback application changes.

There are consistency guarantees that it's possible that you can't ensure with just application logic.

Suppose you are running a booking system for ridesharing. When you want to lock-in a passenger's seat, it's possible that consistency can be violated when you do it at the application level: both A and B request a seat after a query returns that there is a seat available, then simultaneously lock-in their reservation after seeing stale data.

It's also naive to think that validation in the application is sufficient. For, say, a webapp, there should be three places where that occurs: at the client where you can provide the most meaningful feedback, at the service to catch most consistency errors and guard against malicious actors while you can provide meaningful feedback, and at the database level for the most perennial and slow-moving constraints so that concurrency and your own bugs don't fuck up your source-of-truth.

edit: come to think if it, it's possible to do it without triggers as long as you can still express the constraint as a query: append a boolean field to each table with a TRUE check constraint, and use a table query mirroring the constraint to populate that field on every mutation.

Suppose you are running a booking system for ridesharing. When you want to lock-in a passenger's seat, it's possible that consistency can be violated when you do it at the application level: both A and B request a seat after a query returns that there is a seat available, then simultaneously lock-in their reservation after seeing stale data.

This is a solved problem.

  function reserveSeat (customerid, seatid)
      updatedRows  = update seats set customerid = {customerid} where id= {seatid} and customerid is null 

      return updatedRows == 1

Yes pretend scarfaceScript takes care of sql injection vulnerabilities and “lock” makes sure that only one thread can enter the block at any given time.

This is only possible when you have a row for each available seat, though. Consider the case where you don't want to represent each seat in your application.

It's hardly a good idea if you have a database that allows invalid state.

Otherwise it's a fantastic idea.

An “invalid state” can mean a lot of things - including business rules that can’t be expressed by simple relational, constraints rules.

But what happens when you need to change something about the database. Isn’t it a lot easier to have all of the code in one place?

So the usual retort is to put all of the business rules in stores procedures.

Then you have an unholy mess of a database with triggers and stored procedures that are harder to modify, harder to unit test, harder to version and just an unmaintainable mess.

I’ve never heard a single developer say that they love maintaining a system with 100s of large stores procedures, triggers, etc.

Yea the 'correct' approach is often to to both:

- You check for consistency issues on the client-side UI level, and give the user instant feedback when something is not right.

- You also create a db trigger to enforce the rule in the database itself. This way you, as a developer, can write your code with the assumption that the data is gonna be consistent when it comes from the database, even if something went wrong on the client UI level at some point.

Generally my rule of thumb is to keep the integrity checking as close to the data as possible. Usually that means integrating it into the schema/database (optionally with triggers), as long as that has proper tooling (versioning specifically).

With more complex checks it might be necessary to build a periodically runnable check mechanism. At startup might be a good way to do it, or some sort of cronjob. I can imagine that in larger systems there might even be separate services to check data integrity across other services.

I’ve never heard someone say as the database grows in complexity - “I’m so glad we decided to use triggers everywhere”.

I haven't actually used triggers myself, and I do feel like they're too far away from the data to be visible/easy to understand.

I try to encode all constraints in the database using constraints and triggers. This way your data is always consistent regardless of which app/language is interacting with it.

Sometimes I add the same or similar constraints in the application layer because it leads to better error/validation messages.

IMO having data integrity enforced in the database instead of application code is similar to having a type system enforcing code integrity at compile time instead of at runtime. It's about catching errors as early and as close to the root as possible.

If you're using SQL Server, you can use indexed (materialized) views to enforce uniqueness across multiple tables.

There's also a neat little trick with indexed views and cross join[1] that you can use for more complex constraints. I've never used it for anything more than reference tables, so I can't comment on its ability to scale.

[1] https://spaghettidba.com/2011/08/03/enforcing-complex-constr...

If data needs to agree PKs and FKs are a way to go. I avoid UNIQUES unless absolutely necessary since they are not structural.

Reference tables for things that don't change or change very seldom are a valuable tool if data across other tables must agree. Check constraints otherwise.

But that's just the bones -- the last line of defense. Having a framework that can push such constraints through the application is very helpful so you are not waiting for SQL errors to enforce integrity.

It is easy to go overboard, though, and it is worth thinking about how important data integrity and agreement actually is before writing it in stone, so to speak.

UNIQUES are definitely "structural".

In fact, your the foreign side of the FK needs to be covered by a unique constraint (including PK, which is simply a UNIQUE with slight differences).

can you say more about "reference tables" ?


I've also heard them called 'Verification Tables'. They are basically check constraints but stored in a table for reference. If you are using a check constraint in more than one place and it goes beyond a simple 'Y', 'N' it's probably worth thinking about using a reference table instead.

These tables will generally not be changed very often and can be considered 'read only' for the most part. At their most pure, an application can count on certain values to exist in a reference table and might even verify this at startup or populate them with application properties. They could be application level properties if not for the referential integrity needs.

It is often useful, however, to have some reference tables that are only 'mostly read only'. Changeable but largely static.

As a trivial example, a table like state_abbreviations. It is a finite list of data that is unlikely to change.

Data that comes from the government is a pretty good candidate for this kind of thing. It changes, but not often, and you don't want to give users the option to put whatever they want. e.g. Federal School Codes.

Other good candidates are structural aspects of your application. Say, a list of modules that users can navigate to to perform a certain job function. Perhaps you want to give them the ability to compose security roles around those and then apply those roles to users. Also, say, you want to use the same names for menus or reference them elsewhere in a consistent way, a reference table might be the way to go.

I think he means the following, illustrated by example: instead of storing group_name in the users table for each user, you only reference a pk: group_id, and have a separate table groups with the actual name. So the actual value "admin" will only be there once, defending against typos etc.

So basically a normalised database

I much prefer when the database enforces integrity by limiting access and preventing corruption from occurring.

In Microsoft SQL Server, you can often achieve the desired result with check constraints, primary and foreign keys, indexed views, and unique indexes. When more logic is required, I usually restrict write access from the application at the table or column level and force modifications to occur through a stored procedure that enforces integrity.

It is a correctness vs performance trade-off. I typically use foreign keys and the type system as much as possible and back-off when needed for performance or certain cases of flexibility.

Unfortunately unlike compiled typed languages the database foreign keys are just straight up runtime overhead.

Using types correctly such as storing numbers and dates and UUID's in the actual database type rather than strings is an obvious one that typically improves space and performance, there is normally not much trade off here. Varchar with specific length I go back and forth on, it has saved me off on and catching something not validated for length properly in the app, but does get in the way when needed to expand length which happens quite often.

For coded values I like a code table per code type with foreign keys rather than say EAV, this has saved me many times in data quality and allows extra attributes to be added to specific types. It matches more closely to the type system in the client too which will have a specific type for a specific coded value. It also allows reflection like qualities where you can query the schema and follow the types. The downside is in a large app you can have many hundreds or more of code tables and its more work to cross cut features, but the benefits outweigh the downsides for me. this also goes for more full blown "entities" which are more obvious.

I don't normally use more complicated constraint or triggers, the tend to get in the way things like ETL loads and many complex rules end up working out much better in the app code. I am torn about it, but the procedural languages in the DB is usually less expressive with less reuse and can't do the other things needed such as alerting messaging etc.

Bottom line though don't be dogmatic, use the database as you can, test performance make the trade-offs. Sometimes constraints and triggers are better due to their locality to the data, sometimes leaving off foreign keys is better due to the overhead. Sometimes EAV is better for flexibility (or JSON).

In an ideal world we would run the same code in any tier(client, app server, database) for this stuff choosing the best location to run it based on the needs (or run it in multiple tiers, client for user experience, database for transactional correctness). You can almost get there with javascript an PL/V8 now days, and sort of with .Net and Sql Server, but its really not there yet.

I'm moving more and more stuff into the database. Let me clarify my job:

- I use Postgres. This is important. Postgres has long been one of the most stable databases. It has multiversion concurrency control and even lets you make changes to tables within transactions, which makes such changes easier. It has many features, like check constraints, triggers, procedural languages, and JSON, which makes it easy to put move more stuff to your database.

- I have shell access to Linux virtual machines. I don't have layers of frameworks. I use the psql command-line tool and can log in as the superuser, postgres, when I need to. So my use of the database is not inhibited in any way.

- I write internal business apps. They can have complex business rules, but it is not web scale --- though I would be comfortable serving hundreds of requests per second for tens of thousands of users with my set-up, which is just a single virtual machine. I'm being conservative, and it could probably handle 100 times that.

I began 14 years ago as most web programmers do, keeping their database as a dumb data store and writing everything in their middle layer. But I have found maintenance easier, and the overall codebase much, much smaller, by keeping my database in the know as much as possible.

- Are the values in a column supposed to be from a limited set of choices? Tell it! Put those choices in a table, and link the column to that table with a foreign key.

- Is a number supposed to be within a certain range? Is the date in a column supposed to be after a date in another column? Tell it, such as with Check constraints.

- Is the data supposed to be sorted or summed or transformed before printed on the page? Tell it! Put the gob of SQL into a view, so that all your middle layer does is "select * from view".

I generally keep the more complex stuff in the application layer, where it's often easier to capture and change. If all the logic goes into the DB, it can be a real pain to update and maintain (especially if you're running a cluster of sharded databases, say).

Some might suggest that you take a performance hit by not using stored procs etc, but in reality it's never been a problem for my apps (the largest one has over 150k active users - https://usebx.com).

Trying to think through the alternative presented:

If the application is the only way to access the data, it would appear consistent until it wasn't. Assuming two application processes, process one would not immediately see the mistake it made, but process two would see it on startup. So, an audit table might be a good idea to see which process (or user) made which change, including the original erroneous entry, and any change needed to remedy the failed constraint. If the first process is still active, the second process could warn the first process, but that leaves a lot to chance. In comparison, once in place, relational database constraints are always "on" and enforced globally. Enforcement on a per transaction basis might provide a consistent slowdown, but still be preferable to a long startup time for checking the integrity of a large dataset with potentially many errors. In order to handle increasing startup times given boot time integrity checking, you may have to partition the data into more recently and less recently accessed data. In that case, you may find inconsistencies in reporting where the application only touched an earlier set of data once, and didn't enforce integrity checking. On the other hand, you could be presented with an opportunity to remedy the data before you can receive the report. If the entries that are permitted change over time, you would need some kind of date ranges on those values, so that the person receiving the report doesn't enter historically inaccurate data while fixing their own report.

I've got some somewhat delicate relationships in the database (Postgres) for an application that I am working on.

I've opted to build an "API" within the database consisting of views for read-access and stored procedures that can validate relationships upon insert/update. It's pretty clean so far as it allows the data needed for integrity checks to remain in the database.

This hasn't made it to production yet, so take it with a grain of salt :)

It mostly depends on what type of table it is, I am a Db2 DBA z/OS. From experience, PK is must for every table either composite or single column. FK not recommended for 24/7 high volume or transaction table, recommended to handle it via code for ease of maintenance.

Column should be NOT NULL WITH DEFAULT, nulls introduces unnecessary additional where clause checks.

Make your complex database error resistant. Write documents for what it means when references are missing. Do Garbage Collection rather than ON_DELETE triggers. Do everything that SQL doesn't want you to do.

SQL requires defining integrity in advance, and if your database gets corrupted, there is little hope of recover. Cleaning up your data is nearly impossible because your schemas and logic are tightly coupled.

Writing garbage collection scripts is easy if you make a practice of it, and makes you think about your schema in the ways that are important and that the ease of triggers hides from you. You can monitor and metricize it in ways that you can't with triggers.

I’ve yet to encounter DB level integrity problems that aren’t solved by a well designed, normalized schema combined with proper use of transactions, PKs, FKs, unique and not null constraints, etc.

These problems probably exist, but they’re very rare. Much more rare than people THINKING they have a problem like this, hacking around it with poorly implemented application code, and ending up with data that has integrity issues, when it could have been solved well using the DB properly.

We ran a large e-commerce company main site solely on a single (with HA) SQL Server database. (We’ve since added other DBs, and always had other reporting and analysis systems, but we were doing over $1BB/yr on a single transactional server.)

There’s no “room” there for foreign key constraints nor pervasive use of transactions. We used limited transactions in payments area only (and eventually even retired the last of those) and no foreign keys.

Reasonably careful coding, doing a lot of work in stored procedures, and “fix it if it breaks” was the order of the day. I won’t claim it was academically pure, but it sure did work based on the scoreboard (cash register).

Curious to know what you mean by “no room.” As in the performance impact of FKs and extensive use of transactions was too high? In my experience (not insane scale, but around 1 million daily active users hitting DBs with TBs of data), the performance overhead of FKs and transactions is fairly minimal. Ad-hoc attempts to bring the same functionality almost always end up in a loss of referential integrity, corrupt/wrong data, lots of bug squashing, etc. Why waste all that dev time and introduce all those bugs when you can use battle tested approaches like FKs and transactions that can guarantee correctness, at only a small performance hit?

I handle those at both DB and application layer. Integrity is ensured by the database, proper error messages and logging is handled at the application layer.

Database triggers is a common tool to use in such cases

Provided your database system has triggers. Cloud Spanner, for example, does not.

If I'm in an environment where I'm using an RDBMS, then yeah, I mostly try to rely on the built in support for referential integrity that is supplied by the database. If there's a rule that's too complex to express that way, I would probably just write code (in Java, Groovy, Python, whatever I'm working with) to verify things.

It's hard to answer without having enough context. Indexes, Transactions, and Triggers are three common choices, respectively. But don't forget complex data integrity requirements are also often a sign for bad design.

You use functions and stored procedures to provide a controlled API over the table. All actions will go through those and that will have your logic to enforce the constraints.

right - except you can do this outside the database also, in code. But there is something about the rigor/pain-level of writing the in-DB functions that tends to make them change less. It is equally possible with external code, but the mental modal has to create expectations for your environment.

I try to keep code out of my data where possible. No triggers, no constraints (excepting maybe PRIMARY KEY and NOT NULL). Despite otherwise being a nutter for type safety, I'll even favour TEXT over enums.

The main reason is change. As I evolve the code base, I might want to do something today which is principled, but seemed unprincipled yesterday. Database constraints have tended to hit me the hardest at the worst times too, e.g. being unable to make an emergency fix in prod. Or prod can't come up because the flyway script (which succeeded in test/stage) got snagged on a constraint when it ran on prod data.

No triggers, no constraints (excepting maybe PRIMARY KEY and NOT NULL)

Could you not then simply use a nosql solution?

Or you can have that logic at a back-end application level (assuming all clients are going through this one back-end to interact with the database).

I use both.

I keep as much logic out of the database as possible besides PKs/FKs etc.

I keep all of my logic in the application and only one “service” can write to related tables.

I also think that you should always write software as domain specific “microservices”. Not necessarily out of process, separately released microservices, but the Domain Driven Design concept of “services” that can just as easily be part of a monolithic project with different modules or if necessary shared between teams via versioned modules.

It is so much easier to deploy, rollback, and version code than databases.

Integrity checks during startup will slow down in time. Also, does it mean you have to restart the app often? Sounds bad to me.

Complex constraints sounds like an app/domain responsibility, much easier to test, version, troubleshoot, evolve.

When a solution cannot quickly and easily be tested, it impacts overall agility (small iterations become small waterfalls). Having complex constraints in database, could mean slower release cycles, processes (schema changes reviews, etc), more migrations.

It's called database normalization.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact