
Databases always have a schema (2018) - ash
https://utcc.utoronto.ca/~cks/space/blog/programming/DatabasesAlwaysSchemas
======
kerng
At a high level there are names for what the author attempts to describe in
computer science:

1) Schema on write

The schema is enforced by the storage layer when data is persistet, optimized
for defined use cases and fast. ACID. Typical RDBMS systems.

2) Schema on read

This is was big data query processing and data lakes are about. Data is not
interpreted or given meaning until it is read from the store and put into
context.

Just thought that might add value to the discussion.

~~~
BoiledCabbage
Thanks, I almost posted the same. One of the things that makes our field so
great is the ability to learn independently, be productive and accomplish cool
things without needing formal study. But it's also one of it problems where
people can be great engineers but as aren't necessarily familiar with long
discusses concepts or ideas. Even CS education frequently won't cover them
because they aren't practical nor the main focus of theory.

90 percent of us on here have an intuitive understanding of these concepts but
not having formal language, definitions and knowledge of really causes less
efficient communication and the frequent re-invent the wheel syndrome.

Maybe the answer is software engineering becoming a more popular major
separate fom CS, or maybe its introducing more classes on it. Or maybe it's a
separate major or learning track on data modeling, programming languages,
applying them problem solving, validation, data communication.

As a random example, half the debates on static vs dynamic typing - is muddied
by explicit vs more implicit (ala Haskell) typing.

Heck most of us probably couldn't really give an authoritative definition of a
schema, but we know it when we see it.

Examples for something to be a schema does it have to be in a separate
document or in code. Does it have to be explicit or can it be implicit? Does
it have to be statically checkable or do run time semantic checks count? Is
there a complexity bound on validating against a schema (syntactic vs semantic
checking may take very different run times)? Can a schema be defined as
"whatever data structure and content is need to make this program X over here
run"? Why? Why not?

Not having rigorous definitions here makes conversations in this spaces
usually end up as people reusing the same word with different interpretations
/ meanings and disagreeing without realizing why.

I don't know how we do better. And absolutely acknowledge I suffer from doing
it just as much as anyone else.

------
fmjrey
I would definitely prefer an approach where schema is explicit and defined in
a minimum number of places. RDBMS/SQL schema do let you define schemas,
however an important aspect of how SQL handles schemas which links back to
their inflexibility is how they're defined at the level of a table, meaning a
(possibly ordered) set of attributes.

A more flexible approach as taken by Datomic is to define schemas at the level
of attributes, meaning an entity (table) does not have a schema, only
attributes do. This requires attribute names to be within namespaces which can
certainly be seen as equivalent to table names, but they don't have to,
namespaces could also refer to an entire domain/subdomain/org/whatever.

Defining schema at the level of attributes provides a greater level of
flexibility because client only declare/specify/depend-on a partial schema
that only lists the attributes it needs. It also helps avoiding the "place
oriented programming" approach of result sets, whereby attributes must be in a
certain order, and if one does not have a value, you have to introduce NULL
values.

~~~
tabtab
Some domains or projects require or work better with "soft" schemas, at least
initially. I always wanted "dynamic relational" where columns are optional and
"create on write" (except for "ID", which is needed for row uniqueness
guarantee). If you query for a column that doesn't exist, instead of an error,
you get a blank for that column (or null, depending on design decisions or
settings).

But, one can gradually add rules as the requirements solidify. You can later
add a rule that all rows of the Employee table must have a non-blank
"last_name" column. You can have a rule/constraint that requires it for new
rows, and perhaps have a "cleaning mode" to check existing rows. With enough
rules/constraints, it acts pretty much like the current crop of RDBMS.

Dynamic relational could still use SQL, with some minor modifications to
clarify intended compare types. (SQL has warts, but nothing better has arrived
yet that's sufficiently better to justify tossing an established standard.)

It's not SQL versus No-SQL, it should be strong-typed SQL versus "soft" SQL.
That way one _doesn 't_ have to toss all their SQL/RDBMS knowledge to get a
dynamic RDBMS. The current dichotomy of choice is _unnecessarily_ stark: it's
either an RDBMS with SQL, or something very foreign. The No-SQL movement was
driven by hardware & scaling needs, not query language needs.

If anyone is interested in starting up a dynamic relational company, I'd be
glad to supply you with a draft spec for free.

~~~
scriptkiddy
You may have heard of it already, but if you haven't, I would recommend taking
a look at RethinkDb:
[https://www.rethinkdb.com/docs/architecture/](https://www.rethinkdb.com/docs/architecture/)

It is a NoSQL Db, but has a lot of the features you're talking about. It
doesn't have any form of "schema", but does allow for SQL style relational
data. It definitely supports your "dynamic relational" data idea as stated
above. The only issue is that any sort of data constraints need to be
addressed at the outside of the Db itself.

~~~
tabtab
Interesting, but the "outside" constraint processor issue could be a show-
stopper. The "gradual tightening" of the DB is a key selling point of dynamic
relational. Parts of it could be borrowed to make DR. And it's not clear how
one queries it with SQL (or something very close). A translator layer would
probably be needed.

------
virtualwhys
I don't necessarily buy the two schema argument brought up in the article.

If you generate your model/domain based on the underlying database schema then
you effectively have a single view of the world, one that will ideally blow up
at compile time (if your language is statically typed) when the database
schema changes.

It's not perfect, if someone changes the schema in production then you're SOL,
but at least it brings some sanity to the table vs. NoSQL where you simply
have no idea what the state of the world is until you run your program.

F# Type Providers are probably the gold standard wrt to binding schema to
application code, but any code generation library will provide similar
benefit.

~~~
_bxg1
Correction: you have a single _authoritative_ view of the world, plus a
secondary _cached_ view of the world in a different language which probably
doesn't perfectly map to the original one. Many people choose this approach,
but it isn't without its own issues.

~~~
nathan_long
> you have a single authoritative view of the world, plus a secondary cached
> view of the world in a different language

Yes. OTOH, with NoSQL you have _no authoritative_ schema; your code may allow
for multiple different schemas, and your records may have multiple different
schemas, and they could agree or disagree to any extent.

~~~
_bxg1
You can still have an authoritative schema, it just doesn't live in the DB
itself. It would have to be defined in the code, generally.

~~~
nathan_long
I would argue that it's not authoritative. Nothing guarantees that the records
in the db match that schema; the closest you get to a guarantee is your
diligence to 1) run a job mutating all records every time you change your in-
code schema and 2) ensure that nothing but your latest application code can
write to the db.

OTOH, `\d users` in PostgreSQL is authoritative; there cannot be a row that
does not conform to the fields, types, and constraints listed there.

~~~
jt2190
> I would argue that it's not authoritative.

I think the word you're looking for is "enforced". Where there are multiple
schemas, one can certainly be the "authority", even if it's not enforced
across all data.

~~~
nathan_long
Whatever terminology you use, every place you loop through records doing stuff
with them, you'll either have to decide what to do with oddly-shaped records
or implicitly decide to let exceptions occur at that point. Whereas if
PostgreSQL tells you that `user` has a `email character varying(255) NOT
NULL`, you can be sure that every `user` does. The only place you need a
conditional related to that is when trying to insert an invalid record.

------
jfengel
The distinction between "database schema" and "data type" is the same
impedance mismatch problem we've had for decades. You end up making an
arbitrary transformation somewhere between your operational code and the
persistence layer, at which time you have two different formats to represent
your data. Any time you represent something twice, at least one of them is
wrong, sooner or later.

The "schema" is the persistence-layer discussion of the contracts that
programmers use to keep separate parts of the program consistent (even when
they're divided up among developers, or with yourself over time). The
persistence layer may use that information to provide additional services that
it can't provide on a blob, like indexing. It's "natural" to do that for
performance reasons: you want to lay stuff out on the disk, and limit network
traffic. But in terms of code maintenance, that's optimization, and the root
of all evil.

Few programming environments really take seriously the idea of eliminating
that distinction. What I want is a language that pushes up notions like
transactions and persistence right into the language itself, and let the
compiler/VM take care of whether a piece of data is persistent or not.
Ideally, I'd love a programming environment that could just be shut off at any
moment and restore itself transparently, including unwinding that had become
out of date. I gather that Inferno took that approach, but I haven't had a
chance to use it.

The relevant bit of that is that it would eliminate the arbitrary distinction
between "database schema" and the rest of your program. You'd use the same
data specification for both, without a translator (or rather, with the
translator completely hidden in the operating environment).

~~~
weberc2
Hard to understand what you're proposing... A language with a type system that
brings its own serialization format (so you don't have to deal with the fact
that SQL and friends have no good way to represent polymorphic data types such
as enums or interfaces)? Or a language with a runtime that uses disk as main
memory (this seems to be the implication from 'program can be terminated and
resumed transparently')?

~~~
j88439h84
Postgres has enums.

~~~
weberc2
Postgres has a feature called "enums", but they are not polymorphic--all
values for a given Postgres enum must have the same type.

------
jayd16
Now, this is just my opinion but I hate ORMs for several reasons. Highest on
the list is that I fundamentally disagree with their reason for existence.
They're designed to help manage the code schema vs db schema "problem" stated
in the article.

The database schema is for the integrity and performance of data storage. The
code schema is for the integrity of the business logic and api. Forcing both
schema to solve both problems is a huge and IMO unnecessary pain. ORMs attempt
to be all things to all use cases and you often end up with a lot of hoop
jumping.

They're not totally disjoint, of course. Its the same data so it will
naturally be in a similar format and how code and db interact will often
inform how the schemas are designed. The take away is, use the best schema for
the current use case and solve the mapping along the intersections.

My sense is that this is where a lot of the passion for NoSQL came from
although its a bit misplaced.

~~~
overgard
I agree although I think the biggest problem with ORMs is they’re super hard
to debug and reason about. You get all kinds of awful SQL out of them you
might not expect

~~~
Daishiman
This is a tired point and in a decade of using ORMs this hasn't been the case,
ever.

~~~
jayd16
Why do you think the comment is so common if it was never the case?

~~~
Izkata
Missing knowledge / lack of documentation. Django for example has a couple
ways to see the SQL, but the most useful one (generating it from a specific
QuerySet) I can't find in the documentation - I only know about it from an
answer on StackOverflow.

------
aboutruby
In my current case I have highly unstructured data stored as a jsonb in
postgres with some basic fields for indexing. So the database doesn't have a
schema and making many tables for each variation of the data would take an
enormous amount of time.

~~~
borplk
> So the database doesn't have a schema

What you described _IS_ the schema of the database.

Schema doesn't imply rigidness.

A schema can precisely define a wild lack of structure.

The reason it's important to have it is because it keeps you in the control
seat.

~~~
ben509
Another way of looking at it is that you have a schema whether you like it or
not.

If I write some code to sum up account balances, it's going to expect a series
of account objects, and they need to have some `.balance` field.

And those fields need to be decimals, because addition will choke if they're
not.

Your code implicitly defines a schema, and it's utterly rigid because it will
crash if the data structure is incorrect.

Or worse.

Imagine if some of the "balances" are actually some irrelevant field that
happens to be a number, or some are missing because some of the JSON
misspelled the field name and you skipped those records. Now the result you're
processing is silently wrong, and who knows how far the corrupt data gets
before someone catches it?

All schemas are enforced rigidly because that's how computers work. The
difference is whether you want the DBMS to nag you while you're coding, or
whether you'd rather be woken in the middle of the night by an angry customer
or boss.

~~~
threeseed
For many use cases we don't know the schema up front.

So it's irrelevant whether it exists or not because we don't know it.

~~~
ben509
That's a great point, but I think the real issue there is that typical SQL
DBMS's make it very hard to adjust the schema iteratively as you code. After
all, if you're writing in a language like Java, you don't know what your
classes look like to start with, but coders manage that by adjusting and
refactoring them as they work.

------
kamikaz1k
Pretty interesting that the author was upset someone said SQL = Schema and
NoSQL = No Schema. It's usually pretty clear that the no schema part is
usually in reference to no certainty in the schema of the data. Or I guess one
could model it as a spectrum of certainty.

Actually now mulling over it some more, I see their point. It's just that
their consideration of schema is more holistic than typical. I find people
tend to use "schema" to refer to what's in the database, while in code people
refer to it as a "model", and that a schema is just an implementation detail
of a model.

------
_bxg1
This is a good take. I've been offended by this aspect of NoSQL for some time
now, always using something like Mongoose to add schemas on top of it, but I'm
starting to think that what I was really offended by was the lack of types
inherent to plain JavaScript. It now occurs to me that a type system could be
used to define the canonical form of a given DB schema.

I spent a year and a half at an enterprise Java shop, and there was a ton of
work there put into maintaining and syncing the Java types and the SQL schema,
even though that Java application was the only thing that would ever use that
database.

I wonder what it would look like to have a schema definition system that was
equally understood by the DB and the code, so you didn't have to pick one or
the other or both.

~~~
erik_seaberg
> that Java application was the only thing that would ever use that database

It's a mistake to make this assumption unless you've taken extreme measures to
lock down and audit all access. It's pretty common for the entire dev team to
have read access, and someone has inevitably built some ad hoc thing taking
advantage of that.

~~~
_bxg1
Perhaps I should've said "the only _piece of infrastructure_ that would ever
use that database". DB browsing software and even ad-hoc scripts don't need a
perfect schema; you can usually just grok it by visual inspection

~~~
erik_seaberg
That's a problem when we break something ad hoc the founder wrote to reconcile
the books once per month. Data attracts surprising and important uses.

------
najarvg
Very interesting. FWIW Martin Fowler, Pramod Sadalge et al have spoken about
the implicit schema for NOSQL dbs from a design pattern perspective for quite
a while now. Here is an early 2013 slide from him which is very informational
-
[https://martinfowler.com/articles/schemaless/](https://martinfowler.com/articles/schemaless/)

------
eadmund
He's correct that there's always a version of the schema in one's code — he's
also correct that it's also nice to have a version of the schema in SQL, where
certain properties can be checked & verified.

Wouldn't it be nice to have a schema DSL which enabled one to declare the
schema _in code_ , and have it checked at compile time? Sounds to me like
that'd be the best of both worlds.

And client programs could just import that package as a library, so every
would be on the same page …

~~~
foobarbecue
Isn't this what, for example, Django models are? Or even js simpleschema?

~~~
smt88
As far as I know, those only check at runtime (or when generating code), not
at compile time.

~~~
foobarbecue
What you mean by compile time, when we're talking Django and nodejs?

~~~
smt88
Yeah, that's part of my point. JS and Python can't even do what we're talking
about here.

------
taude
In the context of this conversation, I assume we're talking about a single
data store for all of an apps data needs. However, I'd argue in more modern
application (newer) that one would design with a couple different data stores.
using NoSQL for true data that can't be definied by a schema, and then the
traditional RDMS when both ACID and schemas are needed. And then, there's
several other types of data stores in between.

------
nathan_long
> your stored data always has a schema, unless your code neither reads nor
> writes it as anything except an opaque blob.

This is well-put. I'd go further: if there's no schema, you don't have data at
all. A row with labelled email and name fields is data; a string of free-form
input in which the user describe him/herself is not data. You might possibly
be able to extract data from it, but if you do, you'll be building a set of
labelled fields.

> You can tell that your code's implicit schema exists even with an SQL
> database, even if your code auto-reads the SQL schemas, by asking what
> happens if the DBAs decide to rename a bunch of tables and a bunch of fields
> in those tables, maybe dropping some and adding others

Maybe a quibble, but I'd say your code has _partial knowledge_ of the
relational db's schema, not that it _duplicates_ the schema, just as it may
have knowledge of the file system structure, the domain names of various
servers, etc, and depend on that knowledge being accurate, without duplicating
them entirely.

> In some ways NoSQL is more honest than SQL, because it tells you straight up
> that it's entirely up to your code to have and maintain a schema.

With a relational db it's _not_ entirely up to your code to have and maintain
a schema. Eg, your code can assume that every "user" record has an "email"
field and that it's NOT NULL and always a string and always unique, assuming
the db is set up to guarantee that. The part of your code that reads records
can be sure of that, even if the part of your code that writes records is lax
about checking (and therefore blows up a lot). With a NoSQL database, it's
possible that some records have blank emails, or numeric emails, or don't have
that field at all, or have that field named something different. Those things
will happen unless you're very careful to ensure in your code that they don't,
and also unless you're careful to never to let anything write to the database
except your code.

I think a differentiating question is "what is it that guarantees all your
records of type X have the same schema?" If you use a SQL database, the answer
can be "the db". With NoSQL it might be "a combination of app logic,
background jobs and manual intervention for weird cases" or "nothing".

In cases where you want to store JSON blobs, an RDBMS like PostgreSQL lets you
have a JSON column. This is useful (eg) for cases where you need to capture
some input now, and might get around to parsing it out into real data later -
eg "we got these records from the legacy system and don't yet know if /
whether we can import them".

Finally, although I wouldn't advocate moving complex application logic to the
database, there are some validations that can only be reliably done by the
database itself or by leaning on the database. Specifically, any validation
that relies on the _current contents of the database_ , such as "don't allow
duplicate user names" (unique constraint) or "don't allow creating a comment
for a post that was deleted" (foreign keys) or "don't allow overlapping
reservations" (PostgreSQL exclusion constraint). Application code can do a
read, check, and insert, but two threads may have a race condition and insert
conflicting data. Application code can ask the db to lock while it does this,
but that's leaning on the db. Or the database schema itself can guarantee this
in a safe and performant way. But only if the database has a real schema.

------
ianamartin
This blog sort of makes the definition of schema so broad that it ceases to
have any real meaning. And it's a stupid argument to have anyway--whether or
not something qualifies as being called schema. The point is how it functions,
not what you want to call it.

I like to think of model code and sql schema as two branches of government in
your data democracy. Where your code is roughly congress and your SQL schema
is the supreme court. In a well-designed system, your code can do a lot of
different things without running afoul of the courts. Code has quite a lot of
freedom to operate on existing data and store what it needs for the future.
But when there's a conflict, your SQL schema pops up and says, "No. You can't
do that. That's in direct conflict of the stated goals of what this data is
supposed to mean. Either don't do that, or go back and explicitly change what
you mean."

Regardless of whether you want to call it schema or not, NoSQL is an extremely
weak court of final appeal when it comes to deciding whether the code is
obeying the intention of the data design. That is sometimes a problem and
sometimes not.

In general, if you are dealing with data, you need type checking somewhere.
Regardless of your opinion about programming languages, if you care about data
integrity at all, it has to be checked for type correctness. Strong SQL
schemas are far more important when the access layer is via dynamically typed
or weakly typed languages. You can feel a lot better about a weaker schema if
your language is statically typed. Sure, it's weaker and violates a certain
separation of powers and flips the power dynamic to the code being final
arbiter, but it's not always completely awful.

On the other hand, if your application code is written in javascript and your
database has no enforced schematic constraints, your data is already corrupt.
You just don't know it yet.

While theoretically we can account for weaker schemata in NoSQL data stores,
the reality is that every useful dataset will eventually be accessed and
written to by a client you didn't originally plan for. And that client isn't
going to be aware of or play by the rules you intended to enforce in your
code. So once again, your data is, in reality, already corrupt if you
don't/can't enforce schema in your data store.

This isn't always a killer problem. There are a lot of cases where your data
isn't all that important and this is an okay trade off. But pretending that
application code is schema because you want to call it schema and it sort of
does some of the same things so therefore schema is an energy field created by
all living code that surrounds us, penetrates us, and binds our data together
is just kind of bonkers.

