
Ask HN: What's your preferred way of ensuring complex database integrity? - Hex08
What I mean really is: when working with relational databases, and you have complex constraints to ensure (not trivial things like notnull but something like: &quot;two tables away&quot; uniqueness, etc..), do you like filling it with tricks using PKs&#x2F;FKs&#x2F;UNIQUEs to ensure integrity or do you prefer writing some code that will get executed as you application boots up, that ensures that all the more complex requirements are met?
======
crispyambulance
Isn't that the primary purpose of transactions?

I mean, every time you write to the database, you open a transaction, do you
stuff and close it. It either works or it doesn't. If it doesn't, then it
rolls back. Isn't that enough?

I guess I am saying that for complex integrity checks, it should be the
responsibility of the application's data layer.

On the other hand what is meant by "complex"? Is it "complex" because of
numerous, ever-changing applications that touch the database or is it just one
application and a database with a lot of tables?

In the case of wild, numerous applications, I suppose that database-side
integrity controls are needed. If you really have ONE application touching the
database, it might be better to have the application police integrity. Perhaps
there's more than one answer?

~~~
jhanschoo
Here's an example of a "complex" constraint. You have a restaurant reservation
system, and reservations are stored as time intervals. Each restaurant has a
capacity, and you need to ensure that a booking does not make it so that the
restaurant exceeds capacity at any point of time.

One way to resolve it with triggers is to have a stored procedure fetch all
the records that intersect the new booking's interval, and procedurally count
the number of concurrent bookings at each record's start and end times; the
constraint would be violated if and only if the count at any point plus the
pax of the new booking exceeds capacity.

~~~
zepolen
Imho that's the only way. Anything else can end up in invalid state.

------
tdurden
> do you like filling it with tricks using PKs/FKs/UNIQUEs

I am not sure that I would call these "tricks"

------
jhanschoo
As mentioned by the other comment, it's standard to use database triggers to
ensure consistency.

There are two major problems with dealing with them in the application side

1\. You can no longer guarantee consistency.

2\. In the case where you need to do some things procedurally so that your
application can no longer make a yes/no query to determine if a constraint is
violated but need to make use of intermediate results, then this means that
you will need to transfer intermediate results over the wire, which is a
performance penalty over just doing it in the database server.

edit: of course, when you need to ensure consistency with some data source
outside the database, there's no choice but to do it application-side.

~~~
clavalle
>As mentioned by the other comment, it's standard to use database triggers to
ensure consistency.

 _shudder_

Triggers are almost always the wrong level to handle consistency. It flies in
the face of 'fail fast' if you need to actually handle problems of consistency
and if you need 'last line' protection, why not bake it into the structure
instead?

~~~
chatmasta
Foregoing triggers in favor of validating data in the application layer can be
dangerous. The problem is that you need to re-implement data validation on
every path to the database. If you've got a single application server, and
it's the only client that ever writes to the database, this might work. But as
soon as you add another service with write capability, you need to re-
implement the validation logic. What if you forget?

If you put the validation in the database, any application can leverage it. If
you want to fail fast, you can still do validation in the application layer
while relying on the DB layer as the ultimate validator.

~~~
scarface74
_But as soon as you add another service with write capability, you need to re-
implement the validation logic._

Then Don’t Do That (tm)

It’s hardly ever a good idea to have multiple apps writing to the same set of
tables directly.

~~~
chatmasta
Why not? How is it any different than two web requests causing a multi-
threaded application server to send two simultaneous writes to the database?

If you use triggers for validation, you can rely on your database's MVCC to
resolve any conflicts between triggers. Note that this is significantly more
performant and robust than the application trying to resolve, or even notice,
those conflicts.

~~~
scarface74
Don’t do that, was in response to:

 _But as soon as you add another service with write capability, you need to
re-implement the validation logic. What if you forget?_

In context of application validation logic, the logic is only living in one
place.

Triggers are basically “spooky action at a distance”.

It’s also much easier to promote, version, upgrade, and rollback application
changes.

~~~
jhanschoo
There are consistency guarantees that it's possible that you can't ensure with
just application logic.

Suppose you are running a booking system for ridesharing. When you want to
lock-in a passenger's seat, it's possible that consistency can be violated
when you do it at the application level: both A and B request a seat after a
query returns that there is a seat available, then simultaneously lock-in
their reservation after seeing stale data.

It's also naive to think that validation in the application is sufficient.
For, say, a webapp, there should be three places where that occurs: at the
client where you can provide the most meaningful feedback, at the service to
catch most consistency errors and guard against malicious actors while you can
provide meaningful feedback, and at the database level for the most perennial
and slow-moving constraints so that concurrency and your own bugs don't fuck
up your source-of-truth.

edit: come to think if it, it's possible to do it without triggers as long as
you can still express the constraint as a query: append a boolean field to
each table with a TRUE check constraint, and use a table query mirroring the
constraint to populate that field on every mutation.

~~~
scarface74
_Suppose you are running a booking system for ridesharing. When you want to
lock-in a passenger 's seat, it's possible that consistency can be violated
when you do it at the application level: both A and B request a seat after a
query returns that there is a seat available, then simultaneously lock-in
their reservation after seeing stale data._

This is a solved problem.

    
    
      function reserveSeat (customerid, seatid)
      {
       lock
      {
          updatedRows  = update seats set customerid = {customerid} where id= {seatid} and customerid is null 
    
          return updatedRows == 1
       }
      }
    
    

Yes pretend scarfaceScript takes care of sql injection vulnerabilities and
“lock” makes sure that only one thread can enter the block at any given time.

~~~
jhanschoo
This is only possible when you have a row for each available seat, though.
Consider the case where you don't want to represent each seat in your
application.

------
TotempaaltJ
Generally my rule of thumb is to keep the integrity checking as close to the
data as possible. Usually that means integrating it into the schema/database
(optionally with triggers), as long as that has proper tooling (versioning
specifically).

With more complex checks it might be necessary to build a periodically
runnable check mechanism. At startup might be a good way to do it, or some
sort of cronjob. I can imagine that in larger systems there might even be
separate services to check data integrity across other services.

~~~
scarface74
I’ve never heard someone say as the database grows in complexity - “I’m so
glad we decided to use triggers everywhere”.

~~~
TotempaaltJ
I haven't actually used triggers myself, and I do feel like they're too far
away from the data to be visible/easy to understand.

------
vbsteven
I try to encode all constraints in the database using constraints and
triggers. This way your data is always consistent regardless of which
app/language is interacting with it.

Sometimes I add the same or similar constraints in the application layer
because it leads to better error/validation messages.

IMO having data integrity enforced in the database instead of application code
is similar to having a type system enforcing code integrity at compile time
instead of at runtime. It's about catching errors as early and as close to the
root as possible.

------
lazulicurio
If you're using SQL Server, you can use indexed (materialized) views to
enforce uniqueness across multiple tables.

There's also a neat little trick with indexed views and cross join[1] that you
can use for more complex constraints. I've never used it for anything more
than reference tables, so I can't comment on its ability to scale.

[1] [https://spaghettidba.com/2011/08/03/enforcing-complex-
constr...](https://spaghettidba.com/2011/08/03/enforcing-complex-constraints-
with-indexed-views/)

------
clavalle
If data needs to agree PKs and FKs are a way to go. I avoid UNIQUES unless
absolutely necessary since they are not structural.

Reference tables for things that don't change or change very seldom are a
valuable tool if data across other tables must agree. Check constraints
otherwise.

But that's just the bones -- the last line of defense. Having a framework that
can push such constraints through the application is very helpful so you are
not waiting for SQL errors to enforce integrity.

It is easy to go overboard, though, and it is worth thinking about how
important data integrity and agreement actually is before writing it in stone,
so to speak.

~~~
mistrial9
can you say more about "reference tables" ?

~~~
clavalle
Sure.

I've also heard them called 'Verification Tables'. They are basically check
constraints but stored in a table for reference. If you are using a check
constraint in more than one place and it goes beyond a simple 'Y', 'N' it's
probably worth thinking about using a reference table instead.

These tables will generally not be changed very often and can be considered
'read only' for the most part. At their most pure, an application can count on
certain values to exist in a reference table and might even verify this at
startup or populate them with application properties. They could be
application level properties if not for the referential integrity needs.

It is often useful, however, to have some reference tables that are only
'mostly read only'. Changeable but largely static.

As a trivial example, a table like state_abbreviations. It is a finite list of
data that is unlikely to change.

Data that comes from the government is a pretty good candidate for this kind
of thing. It changes, but not often, and you don't want to give users the
option to put whatever they want. e.g. Federal School Codes.

Other good candidates are structural aspects of your application. Say, a list
of modules that users can navigate to to perform a certain job function.
Perhaps you want to give them the ability to compose security roles around
those and then apply those roles to users. Also, say, you want to use the same
names for menus or reference them elsewhere in a consistent way, a reference
table might be the way to go.

------
kogir
I much prefer when the database enforces integrity by limiting access and
preventing corruption from occurring.

In Microsoft SQL Server, you can often achieve the desired result with check
constraints, primary and foreign keys, indexed views, and unique indexes. When
more logic is required, I usually restrict write access from the application
at the table or column level and force modifications to occur through a stored
procedure that enforces integrity.

------
SigmundA
It is a correctness vs performance trade-off. I typically use foreign keys and
the type system as much as possible and back-off when needed for performance
or certain cases of flexibility.

Unfortunately unlike compiled typed languages the database foreign keys are
just straight up runtime overhead.

Using types correctly such as storing numbers and dates and UUID's in the
actual database type rather than strings is an obvious one that typically
improves space and performance, there is normally not much trade off here.
Varchar with specific length I go back and forth on, it has saved me off on
and catching something not validated for length properly in the app, but does
get in the way when needed to expand length which happens quite often.

For coded values I like a code table per code type with foreign keys rather
than say EAV, this has saved me many times in data quality and allows extra
attributes to be added to specific types. It matches more closely to the type
system in the client too which will have a specific type for a specific coded
value. It also allows reflection like qualities where you can query the schema
and follow the types. The downside is in a large app you can have many
hundreds or more of code tables and its more work to cross cut features, but
the benefits outweigh the downsides for me. this also goes for more full blown
"entities" which are more obvious.

I don't normally use more complicated constraint or triggers, the tend to get
in the way things like ETL loads and many complex rules end up working out
much better in the app code. I am torn about it, but the procedural languages
in the DB is usually less expressive with less reuse and can't do the other
things needed such as alerting messaging etc.

Bottom line though don't be dogmatic, use the database as you can, test
performance make the trade-offs. Sometimes constraints and triggers are better
due to their locality to the data, sometimes leaving off foreign keys is
better due to the overhead. Sometimes EAV is better for flexibility (or JSON).

In an ideal world we would run the same code in any tier(client, app server,
database) for this stuff choosing the best location to run it based on the
needs (or run it in multiple tiers, client for user experience, database for
transactional correctness). You can almost get there with javascript an PL/V8
now days, and sort of with .Net and Sql Server, but its really not there yet.

------
combatentropy
I'm moving more and more stuff into the database. Let me clarify my job:

\- I use Postgres. This is important. Postgres has long been one of the most
stable databases. It has multiversion concurrency control and even lets you
make changes to tables within transactions, which makes such changes easier.
It has many features, like check constraints, triggers, procedural languages,
and JSON, which makes it easy to put move more stuff to your database.

\- I have shell access to Linux virtual machines. I don't have layers of
frameworks. I use the psql command-line tool and can log in as the superuser,
postgres, when I need to. So my use of the database is not inhibited in any
way.

\- I write internal business apps. They can have complex business rules, but
it is not web scale --- though I would be comfortable serving hundreds of
requests per second for tens of thousands of users with my set-up, which is
just a single virtual machine. I'm being conservative, and it could probably
handle 100 times that.

I began 14 years ago as most web programmers do, keeping their database as a
dumb data store and writing everything in their middle layer. But I have found
maintenance easier, and the overall codebase much, much smaller, by keeping my
database in the know as much as possible.

\- Are the values in a column supposed to be from a limited set of choices?
Tell it! Put those choices in a table, and link the column to that table with
a foreign key.

\- Is a number supposed to be within a certain range? Is the date in a column
supposed to be after a date in another column? Tell it, such as with Check
constraints.

\- Is the data supposed to be sorted or summed or transformed before printed
on the page? Tell it! Put the gob of SQL into a view, so that all your middle
layer does is "select * from view".

------
osrec
I generally keep the more complex stuff in the application layer, where it's
often easier to capture and change. If all the logic goes into the DB, it can
be a real pain to update and maintain (especially if you're running a cluster
of sharded databases, say).

Some might suggest that you take a performance hit by not using stored procs
etc, but in reality it's never been a problem for my apps (the largest one has
over 150k active users - [https://usebx.com](https://usebx.com)).

------
man2525
Trying to think through the alternative presented:

If the application is the only way to access the data, it would appear
consistent until it wasn't. Assuming two application processes, process one
would not immediately see the mistake it made, but process two would see it on
startup. So, an audit table might be a good idea to see which process (or
user) made which change, including the original erroneous entry, and any
change needed to remedy the failed constraint. If the first process is still
active, the second process could warn the first process, but that leaves a lot
to chance. In comparison, once in place, relational database constraints are
always "on" and enforced globally. Enforcement on a per transaction basis
might provide a consistent slowdown, but still be preferable to a long startup
time for checking the integrity of a large dataset with potentially many
errors. In order to handle increasing startup times given boot time integrity
checking, you may have to partition the data into more recently and less
recently accessed data. In that case, you may find inconsistencies in
reporting where the application only touched an earlier set of data once, and
didn't enforce integrity checking. On the other hand, you could be presented
with an opportunity to remedy the data before you can receive the report. If
the entries that are permitted change over time, you would need some kind of
date ranges on those values, so that the person receiving the report doesn't
enter historically inaccurate data while fixing their own report.

------
joshmarlow
I've got some somewhat delicate relationships in the database (Postgres) for
an application that I am working on.

I've opted to build an "API" within the database consisting of views for read-
access and stored procedures that can validate relationships upon
insert/update. It's pretty clean so far as it allows the data needed for
integrity checks to remain in the database.

This hasn't made it to production yet, so take it with a grain of salt :)

------
bobbydreamer
It mostly depends on what type of table it is, I am a Db2 DBA z/OS. From
experience, PK is must for every table either composite or single column. FK
not recommended for 24/7 high volume or transaction table, recommended to
handle it via code for ease of maintenance.

Column should be NOT NULL WITH DEFAULT, nulls introduces unnecessary
additional where clause checks.

------
GauntletWizard
Make your complex database error resistant. Write documents for what it means
when references are missing. Do Garbage Collection rather than ON_DELETE
triggers. Do everything that SQL doesn't want you to do.

SQL requires defining integrity in advance, and if your database gets
corrupted, there is little hope of recover. Cleaning up your data is nearly
impossible because your schemas and logic are tightly coupled.

Writing garbage collection scripts is easy if you make a practice of it, and
makes you think about your schema in the ways that are important and that the
ease of triggers hides from you. You can monitor and metricize it in ways that
you can't with triggers.

------
yashap
I’ve yet to encounter DB level integrity problems that aren’t solved by a well
designed, normalized schema combined with proper use of transactions, PKs,
FKs, unique and not null constraints, etc.

These problems probably exist, but they’re very rare. Much more rare than
people THINKING they have a problem like this, hacking around it with poorly
implemented application code, and ending up with data that has integrity
issues, when it could have been solved well using the DB properly.

~~~
sokoloff
We ran a large e-commerce company main site solely on a single (with HA) SQL
Server database. (We’ve since added other DBs, and always had other reporting
and analysis systems, but we were doing over $1BB/yr on a single transactional
server.)

There’s no “room” there for foreign key constraints nor pervasive use of
transactions. We used limited transactions in payments area only (and
eventually even retired the last of those) and no foreign keys.

Reasonably careful coding, doing a lot of work in stored procedures, and “fix
it if it breaks” was the order of the day. I won’t claim it was academically
pure, but it sure did work based on the scoreboard (cash register).

~~~
yashap
Curious to know what you mean by “no room.” As in the performance impact of
FKs and extensive use of transactions was too high? In my experience (not
insane scale, but around 1 million daily active users hitting DBs with TBs of
data), the performance overhead of FKs and transactions is fairly minimal. Ad-
hoc attempts to bring the same functionality almost always end up in a loss of
referential integrity, corrupt/wrong data, lots of bug squashing, etc. Why
waste all that dev time and introduce all those bugs when you can use battle
tested approaches like FKs and transactions that can guarantee correctness, at
only a small performance hit?

------
0x70dd
I handle those at both DB and application layer. Integrity is ensured by the
database, proper error messages and logging is handled at the application
layer.

------
6nf
Database triggers is a common tool to use in such cases

~~~
Insanity
Provided your database system has triggers. Cloud Spanner, for example, does
not.

------
mindcrime
If I'm in an environment where I'm using an RDBMS, then yeah, I mostly try to
rely on the built in support for referential integrity that is supplied by the
database. If there's a rule that's too complex to express that way, I would
probably just write code (in Java, Groovy, Python, whatever I'm working with)
to verify things.

------
stunt
It's hard to answer without having enough context. Indexes, Transactions, and
Triggers are three common choices, respectively. But don't forget complex data
integrity requirements are also often a sign for bad design.

------
segmondy
You use functions and stored procedures to provide a controlled API over the
table. All actions will go through those and that will have your logic to
enforce the constraints.

~~~
mistrial9
right - except you can do this _outside_ the database also, in code. But there
is something about the rigor/pain-level of writing the in-DB functions that
tends to make them change less. It is equally possible with external code, but
the mental modal has to create expectations for your environment.

------
mrkeen
I try to keep code out of my data where possible. No triggers, no constraints
(excepting maybe PRIMARY KEY and NOT NULL). Despite otherwise being a nutter
for type safety, I'll even favour TEXT over enums.

The main reason is change. As I evolve the code base, I might want to do
something today which is principled, but seemed unprincipled yesterday.
Database constraints have tended to hit me the hardest at the worst times too,
e.g. being unable to make an emergency fix in prod. Or prod can't come up
because the flyway script (which succeeded in test/stage) got snagged on a
constraint when it ran on prod data.

~~~
le-mark
_No triggers, no constraints (excepting maybe PRIMARY KEY and NOT NULL)_

Could you not then simply use a nosql solution?

~~~
Dirlewanger
Or you can have that logic at a back-end application level (assuming all
clients are going through this one back-end to interact with the database).

------
scarface74
I keep as much logic out of the database as possible besides PKs/FKs etc.

I keep all of my logic in the application and only one “service” can write to
related tables.

I also think that you should always write software as domain specific
“microservices”. Not necessarily out of process, separately released
microservices, but the Domain Driven Design concept of “services” that can
just as easily be part of a monolithic project with different modules or if
necessary shared between teams via versioned modules.

It is so much easier to deploy, rollback, and version code than databases.

------
rooam-dev
Integrity checks during startup will slow down in time. Also, does it mean you
have to restart the app often? Sounds bad to me.

Complex constraints sounds like an app/domain responsibility, much easier to
test, version, troubleshoot, evolve.

When a solution cannot quickly and easily be tested, it impacts overall
agility (small iterations become small waterfalls). Having complex constraints
in database, could mean slower release cycles, processes (schema changes
reviews, etc), more migrations.

------
JHoye
It's called database normalization.

------
dillonmckay
triggers.

