
The Next 50 Years of Databases (2015) - strikelaserclaw
http://www.cs.cmu.edu/~pavlo/blog/2015/09/the-next-50-years-of-databases.html
======
janpot
> ...but humans will never actually write SQL. They will instead ask questions
> about data in a natural language.

Wasn't that the promise of SQL in the first place? I don't believe it, why
would people (only) want to access structured data with something as ambiguous
as natural language?

~~~
The_rationalist
It's easy to produce unambiguous sentences in english.

~~~
transpostmeta
Then why aren't we programming in English? Why aren't we expressing
mathematical expressions in English?

~~~
BOBOTWINSTON
Exactly.

My initial thought is highly specific human language is actually harder to
write than most code, and always more verbose. Well written philosophy comes
to mind.

I'd argue that this is part of the reason ORMs gained so much prominence. They
solve the problem that SQL introduces by trying to emulate natural language.
Most developers will gladly risk sacrificing some specificity of the query in
exchange for reducing the verbosity of a SQL query. Obviously if the dev knows
their ORM/SQL well, there is no lost precision.

------
barbecue_sauce
Pavlo's CMU database courses are some of the best educational content on
YouTube, and they are updated regularly with revised course content. Great way
to get up to speed with the technical underpinnings of modern databases, as
well as emerging areas of research.

(Big time commitment though, FYI.)

~~~
muedzi
If I may ask, which career paths would benefit from taking such a course?

------
gwbas1c
> There will be a tighter coupling of programming frameworks and DBMSs such
> that all database interactions will be transparent (and optimal). Likewise,
> SQL (or some dialect of it) will remain the de facto language for
> interacting with a DBMS, but humans will never actually write SQL.

I've seen two disasters where someone mistakenly thought they could use an ORM
to make a database look like a giant, in-memory data structure.

To make a long story short, in both cases, working with a trivial amount of
data in an embedded database was painfully slow.

In general, I think we'll see application programming languages expose more
semantics that streamline data access. The problem is we keep thinking the way
to solve the impedance mismatch is to make the database look like traditional
data structures; when the real solution requires languages that have the high
degree of flexibility that a SQL database normally has.

For example: A table T might have columns Id, A, B, and C. We can "select Id,
B from T". But, if I make a class to map to table T, I probably have fields
Id, A, B, and C. Then, if I want to make a little utility function that just
looks at Id and B, it's probably taking a full T object, with all of the
fields loaded.

In such a case, the programming language needs to evolve. My function doesn't
need a full T object, just an object of a class that we, semantically, know is
a "T", but also has the fields that I care about. Then, we need enough
automation within the compiler to know that, if I add a call to my new utility
method, it has to change some code to "select Id, A from T" to "select Id, A,
B from T", and leave other code unchanged.

~~~
takeda
Agree with you that the problem is more on language side, I also believe that
the problem isn't as bad as it seems with the right tooling.

For example I recently discovered that I can configure PyCharm to a database.
Interestingly, after I did just that, the IDE downloaded the database schema
and suddenly I got highlighting, autocomplete and refactoring available (well
as migrations).

Suddenly I felt like I didn't actually need an ORM.

C# seems to also have thing called LINQ I'm not as familiar with it but my
understanding is that it is a language within language that can represent SQL
statements. I'm guessing this might be what you're talking. Ultimately I think
the solution is that:

\- there's a way to efficiently express what data we need (ORM has the issue
you mentioned)

\- an IDE can understand this syntax so it can help with it the same as with
rest of the code.

~~~
gwbas1c
LINQ lets you query and filter collections. You're thinking of LINQ to SQL.

~~~
cwbrandsma
I don’t think that is used much anymore. Most shops I’ve been on moved to
Entity Framework. Uses a very similar syntax tho.

~~~
gwbas1c
The few times I looked at LINQ to SQL, it didn't meet my requirements.

The first time it had the "update" problem, meaning, it appeared that I had to
always load an object into memory before updating it. (2008)

The second time, it wasn't available in Mono. (2011)

Granted, a lot changed since

~~~
takeda
I see, I never used it myself, from the outside it looked promising it is sad
that it didn't deliver.

------
thedudeabides5
"The role of humans as database administrators will cease to exist. These
future systems will be too complex for a human to reason about. DBMSs will
finally be completely autonomous and self-healing"

Dunno, seems like as long as there is crappy data that humans need to clean,
enterprise and financial firms will continue to use XL as critical part of
their data infrastructure.

And as long as XL reigns supreme in finance and consulting, seems a bit far
fetched to talk about infinitely scalable, sentient and 'self-healing'
DBMSs...

Let's focus on getting the data out of XL, then work on the genie in the
bottle.

~~~
Aloha
Unless you can give users something simple enough to program in themselves,
you'll have to pry excel out of their cold, dead hands.

~~~
ClumsyPilot
At which point you have built XL :))

------
nitwit005
> but humans will never actually write SQL. They will instead ask questions
> about data in a natural language.

Someone trialed an idea with directly interpreting user text queries at my
previous job. They gave up after seeing what people typed in.

People have no concept of what the computer needs to do it's job, so you get
terse gibberish. At a minimum you'd need the computer to talk back and get
them to clarify, but I rather suspect people would hate that. It'd feel like
you spent 20 minutes arguing with a computer to run a report.

------
iamEAP
As an end-user of databases, I hope in 50 years time, there will be no
meaningful distinction between OLAP and OLTP DBs.

We spend so much time and energy copying data around into specific data stores
to solve specific problems / answer specific questions / enable specific
features... It's messy, complex, and adds a ton of overhead.

It could simplify a lot of technology if an operational database could also
handle non-trivial analytical workloads.

~~~
pintxo
Isn't this the promise made by SAP HANA?

~~~
mathh
Oracle Active Data Guard also (queriable replica), with column store.

------
refset
> temporality will become important as well because it matters how information
> changes over time. Current systems are not able to account for this because
> of the large overhead of storing extracted information

This quote happens to be in the context of video frames but I think temporal
indexing in general is widely under-utilised, both for providing consistent
queries (i.e. the database-as-a-value) and for integration of late-arriving
data (i.e. bitemporality). It seems particularly relevant when considering how
best to synchronise information across the "omnipresent Internet of Things"
(not to mention inter-planetary devices, which also get a mention in the
post!).

~~~
DaiPlusPlus
Unfortunately, Temporal Tables in the latest ISO SQL spec and MS SQL Server
have issues. At least their current implementation.

First-off, there’s no history of schema changes. While you can easily go from
a NOT NULL to a NULL column, you can’t go from NULL to NOT NULL. This is a
deal-breaker when using TT for LoB data when schema changes happen somewhat
regularly. TT should have been designed with schema versioning from the start.

The second main issue is the still lack of tooling and ORM support for TT.
While EF6 is compatible with TT (it will safely ignore the SYSTEM TIME
columns) it doesn’t let you directly query the history table.

Third - any UPDATE statement, even when it doesn’t actually change any data,
causes a new history row to be added, including a full copy of any
mvarchar(max) values. Copy-on-write does not seem to be used. That’s a huge
waste of space.

Finally, you cannot exclude columns from temporal tracking - so if you have
frequently-updated columns containing inconsequential data (e.g. display sort-
order columns) then you’ll end up with history table spam.

I don’t know why the SQL Server team invests massive amount of engineering
effort into features like these when other massive priorities exist - like
modernising T-SQL’s syntax to be less verbose (parameterised object
identifiers and optional predicate clauses, please!) or adding support for
column array-types, which would greatly improve the platform.

~~~
jiggawatts
Linux support for SQL Server is my canary in the coal mine.

It makes zero sense.

Meanwhile, the Windows version, which is used for 99.999% of installs still
does not support Vista-era APIs like elliptic curve certificates because of
legacy code nobody has touched in 10+ years.

There's a crazy amount of low-hanging fruit in that product that just isn't
being addressed.

~~~
aljarry
> Linux support for SQL Server is my canary in the coal mine. > It makes zero
> sense.

There's one strategic reason: using it by developers in containers.

------
anonu
I'm continuously impressed that new database technologies come forth, pushing
productivity forward and pushing the bounds of what databases should be doing.

I like to think a lot about Kdb, partly because I've used it extensively. Kdb,
aside from it's database core functionality, can also be an app server.
There's a natural concept of being able to open a listen socket or connect to
other listenrs. Very quickly you can build distributed database systems with
just a few lines of code. It's very powerful... I think the "way of thinking"
in Kdb has yet to permeate into other database technologies.

~~~
kthielen
That way of thinking has definitely proven useful in the domain where it
started (at Morgan Stanley). We’ve also been looking at how to map it more
closely to C/C++ and native hardware, kind of a structurally typed Haskell
variant with small header-only views from C/C++ (for native access to files,
network IPC, shared mem structures, ...):

[https://github.com/Morgan-Stanley/hobbes](https://github.com/Morgan-
Stanley/hobbes)

------
ibatindev
>Lastly, I will be dead in 50 years.

And I'll be writing a number 8 to fill the Age box :(

------
ukj
>DBMSs will finally be completely autonomous and self-healing

Hahahahahahahaha - such optimism.

Autonomous, self-healing, horizontally-scalable stateless systems are hard-
but-doable.

Autonomous, self-healing, horizontally-scalable __stateful __systems are the
stuff nightmares are made of, if your application layer doesn 't relax its
expectations re: ACID properties of the system.

~~~
carlineng
This is the focus of Andy Pavlo's research:
[https://www.cs.cmu.edu/~pavlo/blog/2018/04/what-is-a-self-
dr...](https://www.cs.cmu.edu/~pavlo/blog/2018/04/what-is-a-self-driving-
database-management-system.html)

Essentially, there is a long history of developing tools to make databases
more "autonomous". Pavlo's research centers around accurately capturing the
state of the database, and training a system to learn the impact of tuning one
of the many available knobs.

~~~
ukj
Tuning != Scaling

Tuning == getting the most out of your hardware

Scaling == my perfectly tuned node, with the lowest-latency/mostest IOPS IO
subsystem, and highest number of CPUs, and the mostest amount of RAM can't
handle my workloads. What now?

Re-architect your monolith and re-consider your transactional boundaries -
that's what.

------
amsvie
I see a landscape of SaaS offers and connectors/data integrators between them
that, inside an auditable analytics environment, is basically working in a
plug-and-play manner. Databases are whatever happens in the back of these data
governance environments. We load everything into a large mangaged data lake,
and connectors are set up automatically.

I also see the scenario of radical data ownership not addressed, which may be
a black swan event. Open-source competitors / legislation may enforce the use
of e.g. data pods, as a digital identity storage with managed access right
distribution. It's worth a though, what Tim Berner Lee's solid / inrupt would
mean for the future of data storage systems. In this scenario, the
transmission of personal data from pods needs to be optimised in a secure way.

------
Layvier
Nothing about graph databases ? I thought it was on the rise

~~~
evangow
I was surprised that it didn't mention graph databases either, which is
pertinent to the conversation around the evolution of SQL in the article,
because querying graph databases requires re-thinking (or evolving) the query
language. I'm excited by the work Neo4J and AgensGraph is doing. I haven't
played around with AgensGraph, but I hope it takes off since it builds on
Postgres

------
FillardMillmore
I certainly think that NoSQL databases are only going to grow more popular in
the future for more complex and specialized use cases. I don't see traditional
RDBMS systems going away any time soon, but I don't know if they'll maintain
dominance in perpetuity.

> The relational model will still dominant for most applications...Likewise,
> SQL (or some dialect of it) will remain the de facto language for
> interacting with a DBMS

The author does believe RDBMS systems will continue to dominate for the next
50 years. I have no particular reason to cast serious doubt about that, but it
will certainly be interesting to see what the role and prominence of NoSQL
databases is in that future.

~~~
takeda
I believe that's what the NewSQL is trying to tackle.

~~~
threeseed
NoSQL means a lot of categories of database e.g. graph, wide, columnar,
document.

And so not everything is relational which is what SQL is for.

~~~
Izkata
"NewSQL", not NoSQL, is apparently taking relational databases and adding on
the features that drive people to NoSQL.

(This is also the first time I've heard that term, but it seems to be several
years old and that's roughly what I got from a few quick searches)

~~~
takeda
Google Spanner, CockroachDB etc are in that category.

------
jasoneckert
Interesting points - I'm skeptical about whether DBMSs will ever be too
complex for DB admins. Complexity will increase, but our level of abstraction
will increase to match, as with any technology. We may still find ourselves
fixing DBMS problems in 50 years, but those problems will likely be on the
same level of abstraction we have at that time.

~~~
devnulloverflow
> Complexity will increase, but our level of abstraction will increase to
> match, as with any technology.

But this is no panacea. The complexity will still be there, and ops folks will
have to deal with it through extra layers of indirection and obscurity (which
are other ways to spell "abstraction").

That might be the best compromise, but it is still a compromise.

------
hliyan
The distinction between database (disk) and memory being an artifact of the
memory hierarchy, I wonder what will happen to the concept of the database
itself, should technological advances erase the speed difference between disk
and memory. Will we simply be 'saving' data into special collection variables
in memory?

~~~
pjc50
I suspect that due to Moore's law issues we will abandon the concept of
synchronous memory access altogether and make the cache hierarchy more
visible. Already a cache miss is thousands of cycles, and algorithm complexity
analysis that assumes memory access is O(1) can lead to poor real-world
results compared to cache aware algorithms.

------
innagadadavida
One future area he missed predicting is how we could handle ML data. For
example, features could be more naturally stored and accessed as a DB row.
Many tasks such as regression and classification could be handled by the DB
system as an extension to SQL.

------
mbrodersen
> ...but humans will never actually write SQL. They will instead ask questions
> about data in a natural language.

Nope they won't. Natural languages are not precise enough for this.

------
sys_64738
We might have a sequel to SQL.

------
crazygringo
> _but developers will no longer need to explicitly worry about data model
> their application uses. There will be a tighter coupling of programming
> frameworks and DBMSs such that all database interactions will be transparent
> (and optimal)._

Sorry, but this feels hilariously wrong.

Databases that work performantly are _all_ about making key architectural
decisions about which information is indexed and/or denormalized and/or
preprocessed, how, and ensuring that queries are written only in forms that
take advantage of that design.

It's the difference between queries taking milliseconds or hours on
reasonably-sized datasets of, say, 10GB+.

Because it's shockingly easy to write a query that, otherwise, will not only
read every row of your database, but be required to read every row once _for
every row_ , because that's literally the only way to calculate the result if
the right data isn't suitably indexed or preprocessed. (E.g. "find the number
of users who have ever ordered the same item (any item) more than once" on a
table indexed only by order ID.)

I don't see how that can ever ever be made "transparent", and it certainly
doesn't have anything to do with tighter coupling with a programming
framework.

~~~
foota
I don't think it's anyway near reality, but there's actually a number of ways
this could work.

If you prepackage your queries, or do some sort of profiling guided
optimization, the database could learn what it needs to index to answer your
queries efficiently. There's already some precedence for this, with how
Facebook for instance packages their graphql queries ahead of time to avoid
parsing them at runtime.

Alternatively, you could maybe imagine models where storage is sharded along
with the code for running it, so you say declaratively specify the data you
need (count for this grouped by that, etc) and then at runtime the data is
pre-aggregated per server hosting a certain shard.

~~~
yazaddaruvala
Rather than specifying a schema+indices or an Elasticsearch like mapping,
preprocessing queries to derive the mapping will definitely be the next step.

However, I doubt we will ever derive mappings from application code or
(retroactive) profile guided optimization. The biggest problem will become
deployments.

We would need to couple application code deployments to deployments of
databases even more than we do today (which is a giant problem).

When a new index is needed or new data types we would need to “prime” the
datastore in order to get optimal queries from the start.

Even if this application-database deployment issue is fixed through automation
it’s likely we wouldn’t want it to be coupled. Database index creation or
schema change can take hours or longer. Application code needs to be very
nimble. In the days of CI we don’t want Feature A’s database migration to
block the deployment of Feature B or BugFix C.

~~~
foota
This is fair. Although I do think the collocated data + serving thing would be
pretty neat.

------
doodpants
Since this was posted in 2015, the title should be changed to one of:

\- The Next 50 Years of Databases (2015)

\- The Next 46 Years of Databases

:-)

~~~
dang
Heh. The former.

------
justaj
Mods: Could you please add 2015 to the title?

~~~
dang
At your service.

------
devC0de
There won't be a humanity who has the need for databases in 50 years.

~~~
ravoori
Care to elaborate?

~~~
james_s_tayler
you know... just the other day I was downvoted on HN for making the claim that
there is this perception out there that despite humanity being around hundreds
of thousands of years in some form that some people believe it's going to just
up and disappear in the next couple of hundred.

I'm not wrong. It's a thing.

------
Upvoter33
he's right about the last point, probably.

