Hacker News new | past | comments | ask | show | jobs | submit login
The Next 50 Years of Databases (2015) (cmu.edu)
327 points by strikelaserclaw 70 days ago | hide | past | web | favorite | 139 comments

> ...but humans will never actually write SQL. They will instead ask questions about data in a natural language.

Wasn't that the promise of SQL in the first place? I don't believe it, why would people (only) want to access structured data with something as ambiguous as natural language?

"Remember that code is really the language in which we ultimately express the requirements. We may create languages that are closer to the requirements. We may create tools that help us parse and assemble those requirements into formal structures. But we will never eliminate necessary precision—so there will always be code", Robert C. Martin, Clean Code.

Right. Most people simply are not in the habit of expressing themselves with enough precision, and might even be missing some essential training.

Whenever precision is required, we end up with methodologies like technical writing, or specialized language like legalese, to provide the necessary precision.

If we didn't have "programming languages", we'd end up with some sort of "computerese" analogous to legalese, and it would still take years to become proficient in using it, and there would still be either people or AI who would translate natural language into "computerese" (but we'd likely prefer human oversight over the AI anyway, given that a human that needs an AI to convert their language into something more precise is likely not capable of judging whether the output is precisely what they meant).

> Most people simply are not in the habit of expressing themselves with enough precision, and might even be missing some essential training.

I agree with this, I don't think we'll be programming in natural languages until the computer is intelligent enough to pick out unintended ambiguity and ask clarifying questions. What I'm unsure about is whether or not the computer will be that intelligent in 50 years. Whenever it is that intelligent, I expect us to transition quickly.

A problem is: How many clarifying questions do we want the computer to ask? How many much judgement do we want the computer to exercise?

There are going to be many tasks for which we don't want any independent judgement to be exercised by the machine. And for those tasks, we're going to still want a highly structured language.

And if the computer needs to ask the person a bunch of clarifying questions because they're providing ambiguous instructions, do we really trust that person's judgement about handling the unanticipated edge cases?

That's been a large part of my job at times - working with product stakeholders who give a high level requirements overview and then trying to figure out all the edge cases they didn't specify up front, so that my team could properly design and estimate the work to build the features.

So: yes, most companies already don't require people designing things to be able to fully spec every edge case in advance, without any dialogue.

The question is: can your job be automated in 50 years? Maybe it can. But the computer has a long way to go. I don't think we have enough evidence to give any estimate of when (if) that will happen.

Thats different. You're the one designing the system -- Exercising judgement about which mechanical rules to implement.

Ah, the good ol' Star Trek TNG computer. Even in the show people were constantly annoyed with it.

Exactly! I wish I could communicate with my wife in code, she seems extremely able to find ambiguous natural language expressions that tip me the wrong way with respect to her intent. I still love her regardless.

I think we want a precise computer language, but with a code completion/suggestion good enough to make transitioning from natural language trivial.

Agree. Most of program languages are context free. Human language is mostly context dependent. The auto completion and auto suggestion are the tools to close the gaps of the user experience.

Moreover, human communication is continuous and conversational. Programming is not. Most of existing code editors are not designed to have a conversation. Jupyter notebook is close but not there yet. I bet with a conversational agent style code editor plus a good auto suggestion and auto completion feature, we don't need to invent technology to use natural language to communicate with a machine. We just need a well designed formal language with precise and concise syntax.

Some people write news conforming to Google translation's performance just to make sure the story can be auto translated to many other languages. Most of these stories have pretty normalized vocabulary. This is how much human can adapt to the new world.

One day when the machine intelligence surpasses human intelligence, we will all speak in Python or whatever the most popular among machines.

There's the opposite problem too! If you over-specify, you may be denying the machine a chance to optimise.

I think that's what he's getting at using the word requirements in this quote - as in, you're not specifying the process, only the outcome. By definition it's not possible to over-specify the outcome, it just is what it is, and there is no theoretical reason that should limit the ability of the machine to optimise getting to that outcome.

In a sense, that's one definition of language power - to what extent can you specify that what without having to specify the how, assuming that the machine can execute the how much more optimally than you can most or all of the time.

> By definition it's not possible to over-specify the outcome, it just is what it is, and there is no theoretical reason that should limit the ability of the machine to optimise getting to that outcome.

I'm not sure what you mean by the outcome, but you can absolutely over-specify the output you want, and it's a relatively common problem. For example, you could say 'I want a system that prints lines with a saffron-based pigment', when all you wanted was an intense yellow-ish color, and don't care about the exact pigment being used. Or, you could say 'I want a program that let's a user choose red or blue or yellow and then paints a line in that color', when you actually mean the user can choose any color.

I've seen many problems of over-specification when talking high-level details with less technical people. They tend to give details to help explain what they mean, without caring about the specific details; or they use metonimy expecting everyone understands what they mean.

Ah yes you're right, I completely missed that and that's true. I was thinking about it more from the point of view of not mixing specifying output with implicit instructions on how to reach it (for example, don't say you want to order an array and then pick the last number, just say you want to pick the greatest number from the array) - but that's only assuming you actually are precise about what outcome you want and don't ask for anything irrelevant, which of course people obviously will do from time to time.

One question might be, in a natural language "conversation" with the database system, could one iteratively refine the results to the needed precision, and then perhaps bookmark the query. It's still an open question if people end up with enough precision, but I think that's an easier and different standard than equivocating a single SQL query to a single natural language query.

The question to be asked is whether your coding language contains non-essential information.

For example, in the case of SQL, the GROUP BY clause is nearly always noise.

So much so that SQL database will tell you exactly what things you need to add to the GROUP BY to make it valid SQL. Those things are extraneous information that the database could (and already did) know.

(IMO, in an aggregated query, SQL ought to implicitly add all non-aggregated fields to the GROUP BY, the user may explicitly add fields to the GROUP BY if desired.)

I just recently learned that the ordering of fields in your GROUP BY clause has meaningful consequences for the output and can differ from what's in your top SELECT. Try it with something like SQL Server's `FOR JSON AUTO` and you'll immediately see why `GROUP BY` demands the explicit order from you.

This is the case for the older equivalent syntax for XML also. Though it isn't quite SQL Server automatically doing anything, i.e. automatically extended the GROUP BY clause as needed. It is simply a short-hand for "use your standard heuristic to guess my schema instead of me giving it to you" and you explicitly tell it you want that heuristic to be followed, it isn't just deciding to because the other things you've told it are ambiguous. Also note that it has not always been version safe: I had trouble between SQL 2005 and 2008 with the way some FOR XML AUTO queries were interpreted having changed.

Maybe a similar syntax, "GROUP BY AUTO", could be useful, but I suspect it would quickly become a code smell - some people would slap "AUTO" in when they get an error in much the same way they slap "DISTINCT" in to fix an accidental cartesian product without actually thinking about the problem.

As irritating as having to list many columns in the GROUP cause of a statement with a wide output, I like that it has to be explicit: if I have the column lists wrong I've made an error and I don't want SQL Server to guess how to fix that error. Perhaps instead of missing a column in the grouping list I've instead messed up in the projection clause. And if the statement is getting inconvenient enough that the grouping clause is a significant irritation, perhaps it needs to be refactored more generally.

One further thing to note is that some DBs do automagically decide what to do when columns are neither in an aggregate or the grouping list: mySQL for instance will just return an arbitrary value from those that exist in the current group for such columns (IIRC the first value hit as the data is read). This leads to situations where the query seems to work fine until something changes in either the data balance or the available indexes (or some change is made to the query itself) that makes the processing happen a different way around so a different arbitrary value starts to get selected for the same group in the same query.

>This leads to situations where the query seems to work fine until something changes in either the data balance or the available indexes (or some change is made to the query itself) that makes the processing happen a different way around so a different arbitrary value starts to get selected for the same group in the same query.

That's a great point

FOR JSON AUTO is a non-standard SQL extension.

And as the name suggests its somewhat magic/implicit in its behavior.

No ANSI SQL result is altered by changing the order of GROUP BY.

> IMO, in an aggregated query, SQL ought to implicitly add all non-aggregated fields to the GROUP BY

I have literally never, ever wanted this. Why in the world would you group by any field not appearing in an aggregate?

Also, the group by can specify cubes and rollups, more over you may not want to group by any field, but rather an expression.

You virtually always want this.

   SELECT count(*), favorite_color, favorite_song
   FROM person
   GROUP BY ...
The "..." is of course favorite_color, favorite_song.

Whether you want this or not, ANSI SQL requires that you group by at least these.

You could group by more more fields, such as favorite_pet, though that would be somewhat odd since you are did not include it in the result set.

Ah, you meant in the select, not the table. I guess I'm just much more in favour of being explicit, but I can understand the usefulness of what you said now that I fully understand it.

This would definitely lead to me being very confused when things I'm selecting that I thought were aggregates were actually being implicitly grouped, or when one thing I'm selecting that I didn't realize is an aggregation results in grouping by all the other fields. There is no way for the query engine to know which direction I made a mistake in, so it just tells me something is wrong and let's me figure out how to fix it. Failing fast is much better than trying to guess at what I'm trying to do.

To summarize: "Writing things twice allows you to know where you went wrong and correct it."

That's true.

I still prefer DRY, but I can appreciate the WET perspective.

Your summary is wrong. When you write "select field" you are saying which field to return from the query. When you write "group by field", you are saying that you want the query to group by that field when generating query results. You are not writing anything twice, you are writing two different things.

When I write "select field" I am saying which field I want to return from the query and which field I want to group by. There's no difference. And the DBMS enforces that there is no difference. *

* Except for the special case when it's non-aggregated and I want duplicate rows in the result set. SQL allowing duplicate rows is a departure from the underlying relational paradigm.

The reason it needs to be explicit is to prevent mistakes, which is why it is a good thing that most databases try to infer which columns are functionally dependent on the GROUP BY columns so that you can skip them for the GROUP BY clause.

Wasn't that the promise of SQL in the first place?

Lots of companies and organizations have made that promise. Even legendary game company Infocom way back in 1985. So far, none have delivered.


I think the reason its so hard is when we ask something we are missing out huge bits of information about exactly what we want. Another human knows from experience and context what the person actually wants. When I say I want a webstore for selling keyboards A web dev already has a pretty good idea what I want because they know what a web store does and how it functions and they know how keyboards work so they will know to add search filters for things like switch types even though none of this information was in the original request.

Explaining exactly what we want and covering all of the edge cases is just as hard as using a regular programming language. Until we can have full back and conversation with computers that have a deep understanding of what people want from a program and can ask the user questions like "But what should happen when x happens?".

Using natural language for stuff like that would be just like writing legalese, except being even more specific about everything.

That actually sounds way worse than just using queries.

How would you write automated tests for it? How would you keep those tests from being really flakey?

I remember the natural language interface in Q&A by Symantec worked pretty well. I was using it in the very early 1990s.


The fact that it didn't respond to commands as well as Zork or A Mind Forever Voyaging didn't help the case. Infocom had a pretty solid engine for "natural language" input. Too bad they couldn't adapt it well.

I think even before we get there, one of the big issues with SQL is how we write optimized queries. It's basically involves memorizing a bunch of hacks and hoping that those hacks compile into something that is performant.

Before we use these 5th generation languages to write natural language queries we need an API that is sufficiently high level and possible to logically optimize without resorting to hacks.

Exactly. I always say SQL servers are kernel of beautiful relational theory wrapped in about 4 feet of duck-tape and bubble gum.

The SQL server presents itself as a black box, saying "just say what you want and I'll figure out how to get it" and then completely fails to deliver on this and you have to open up the black box and find that its insides are dark and scary and full of jagged sharp ends.

All layers of abstraction are leaky when you care about performance sufficiently.

To an extent. Some layers are significantly more leaky than others. Rust vs. SQL for example.

Was also the time of 4GL and how everything could be reduced to a form based language front-end to a database https://en.wikipedia.org/wiki/Fourth-generation_programming_...

and other wondrous promises of overselling dreams and delivering something that many have now forgotten ever existed. Let's be fair, C is still alive and kicking despite the promises of that 4GL era.

I had the same thought. Human language is ambiguous precisely because there are so many synonyms, expressions and nuances which are open to interpretation. Sometimes in negotiations, you can say one thing which has two different intents/meanings and you just watch the other person's reaction to decide which path to take the conversation. Sometimes the counterparty needs to know what you're thinking in order to know what you really mean. It wouldn't make sense to ask a machine to guess what the author is thinking when it's faced with ambiguous instructions. It would make debugging impossible.

Agreed, this sounds like the pontificating of someone who has not studied language and culture. As soon as the machine can understand any "natural language" say to it, the language is dead. Before you teach your database to interpret upper class out of sight "White" English, try teaching it "Ebonics."

I guess my point is, there's nothing "natural" about language.

I think the idea is more that querying will be "higher level", and still just as performant as writing the raw SQL. So instead of specifying a lot of JOINs and such, you just ask for the data you need and it delivers it optimally. GraphQL seems like the first iteration of this, but in its current form you do lose performance.

Not "natural language" per se, but some database frontends support building queries step-by-step via a GUI wizard, generally known as "Query by example" (QBE). Even LibreOffice Base can do this.

No. SQL was invented (in system-R) in order to have a decelerative way to query a relational database [tables and relations] (as a front end to the query optimizer).

The closest I've seen anyone come to this "dream" scenario is PowerBI. It's very visual and drag & drop, with practically no coding or query language skills required. Yet, it can automatically generate insanely complex queries that even a seasoned developer would struggle to write manually.

PowerBI is an interesting case. As long as you do “run of the mill” business analytics (Star or snowflake tables, “simple” aggregations) it is like you say. But once you need something different, you’re in for some serious rethinking and coding.

I have this article printed out in a drawer for these cases: http://mdxdax.blogspot.com/2011/03/logic-behind-magic-of-dax...

With DAX, Microsoft has extended the PivotTable model (instead of the relational model), and they deserve full respect for trying something new in this space.

Is it a simpler model? I kinda like it, but don’t think it is simpler.

Complex SQL takes a lot of time to write, and some of us don’t have the money to pay someone.

Hyper Anna

It's easy to produce unambiguous sentences in english.

Then why aren't we programming in English? Why aren't we expressing mathematical expressions in English?


My initial thought is highly specific human language is actually harder to write than most code, and always more verbose. Well written philosophy comes to mind.

I'd argue that this is part of the reason ORMs gained so much prominence. They solve the problem that SQL introduces by trying to emulate natural language. Most developers will gladly risk sacrificing some specificity of the query in exchange for reducing the verbosity of a SQL query. Obviously if the dev knows their ORM/SQL well, there is no lost precision.

I have to assume GP is being sarcastic

Because nobody has been able to implement a computable implementation of English. It's an orthogonal issue to ambiguity.

I'm pretty sure controlled attempto English or lojban (zero ambiguity natural languages) have no computable implementations either.

tldr (or boring)? skip to the last pragraph.

"the product of pi and the square of r" is no less clear than the mathematical transcription of it.

3245 is perhaps English, but there's nothing wrong with "three thousand, two hundred and forty-five".

Some more examples taken from Wikipedia:

The sum of zero and zero. Five less than the product of eight and some value. Seven times the square of some value plus four times the aforesaid value minus ten.

I mean I could go on; these are either not ambiguous, or they're ambiguous only because of a lack of convention.

We don't do it because our attention span doesn't last long enough. We need to keep looking back at what we've already read to really understand what we're reading. Mathematical formulae give us the ability to do that quickly, since the graphical, multidimensional concise layout quickly allows us to find the place we're looking for.

About 25 years ago a mathematician I knew who didn't have much time for computers said to me: "In fairness there is one thing you can say for programming: It forces you to be precise"

A good way to do it is to use a "structured" subset of the "language" for "querying" purposes.

My go-to (or goto, if you will) quote:

We are not here concerned with so-called computer 'languages', which resemble human languages (English, Sanskrit, Malayalam, Mandarin Chinese, Twi or Shoshone etc.) in some ways but are forever totally unlike human languages in that they do not grow out of the unconscious but directly out of consciousness. Computer language rules ('grammar') are stated first and thereafter used. The 'rules' of grammar in natural human languages are used first and can be abstracted from usage and stated explicitly in words only with difficulty and never completely.

--Walter Ong [Orality And Literacy]

There's at least two different ways to read this comment.

Like "select name from users where age > 15"?

You need to take an introductory natural language processing course.

I'm an NLP scientist trying to implement a software that take english as input and parse it into a queryable data structure that has zero semantic loss. I'm aware more than anybody on this thread, of the challenges of coreference resolution, word sense disambiguation, semantic parsing, etc... I didn't say that usual language is free of ambiguity. I just said humans are perfectly able to output clear, unambiguous English in a systematic manner. I just needs a little more mental energy.

Unambiguous? Maybe. Specific enough to execute as programs? not nearly so much. I mean, humans regularly misinterpret verbal commands, and we're effectively the best thing at understanding and language that exists in nature. Half of the disagreements on HN result from misinterpretation of very specific, rather unambiguous statements.

> Half of the disagreements on HN result from misinterpretation of very specific, rather unambiguous statements.

Disagree, it's almost certainly more or less than half; it's incredibly unlikely that these statements would account for exactly half of disagreements. Furthermore the statements in question are rather specific and very unambiguous, not the other way around.

> Furthermore the statements in question are rather specific and very unambiguous, not the other way around.

Or rather, the ambiguity is around how specific the statements are.

Anyone who thinks otherwise should be sentenced to death!

Pavlo's CMU database courses are some of the best educational content on YouTube, and they are updated regularly with revised course content. Great way to get up to speed with the technical underpinnings of modern databases, as well as emerging areas of research.

(Big time commitment though, FYI.)

I was going to comment the same thing. Even if you never plan on writing a database, the lectures are a great overview of what goes into building reliable distributed systems.

Can confirm. I'm currently halfway through the first one. It has given me a lot more clarity on what is going on under the hood. Also makes it much easier to understand how to compare one database to another!

If I may ask, which career paths would benefit from taking such a course?

> There will be a tighter coupling of programming frameworks and DBMSs such that all database interactions will be transparent (and optimal). Likewise, SQL (or some dialect of it) will remain the de facto language for interacting with a DBMS, but humans will never actually write SQL.

I've seen two disasters where someone mistakenly thought they could use an ORM to make a database look like a giant, in-memory data structure.

To make a long story short, in both cases, working with a trivial amount of data in an embedded database was painfully slow.

In general, I think we'll see application programming languages expose more semantics that streamline data access. The problem is we keep thinking the way to solve the impedance mismatch is to make the database look like traditional data structures; when the real solution requires languages that have the high degree of flexibility that a SQL database normally has.

For example: A table T might have columns Id, A, B, and C. We can "select Id, B from T". But, if I make a class to map to table T, I probably have fields Id, A, B, and C. Then, if I want to make a little utility function that just looks at Id and B, it's probably taking a full T object, with all of the fields loaded.

In such a case, the programming language needs to evolve. My function doesn't need a full T object, just an object of a class that we, semantically, know is a "T", but also has the fields that I care about. Then, we need enough automation within the compiler to know that, if I add a call to my new utility method, it has to change some code to "select Id, A from T" to "select Id, A, B from T", and leave other code unchanged.

Agree with you that the problem is more on language side, I also believe that the problem isn't as bad as it seems with the right tooling.

For example I recently discovered that I can configure PyCharm to a database. Interestingly, after I did just that, the IDE downloaded the database schema and suddenly I got highlighting, autocomplete and refactoring available (well as migrations).

Suddenly I felt like I didn't actually need an ORM.

C# seems to also have thing called LINQ I'm not as familiar with it but my understanding is that it is a language within language that can represent SQL statements. I'm guessing this might be what you're talking. Ultimately I think the solution is that:

- there's a way to efficiently express what data we need (ORM has the issue you mentioned)

- an IDE can understand this syntax so it can help with it the same as with rest of the code.

LINQ lets you query and filter collections. You're thinking of LINQ to SQL.

I don’t think that is used much anymore. Most shops I’ve been on moved to Entity Framework. Uses a very similar syntax tho.

The few times I looked at LINQ to SQL, it didn't meet my requirements.

The first time it had the "update" problem, meaning, it appeared that I had to always load an object into memory before updating it. (2008)

The second time, it wasn't available in Mono. (2011)

Granted, a lot changed since

I see, I never used it myself, from the outside it looked promising it is sad that it didn't deliver.

"The role of humans as database administrators will cease to exist. These future systems will be too complex for a human to reason about. DBMSs will finally be completely autonomous and self-healing"

Dunno, seems like as long as there is crappy data that humans need to clean, enterprise and financial firms will continue to use XL as critical part of their data infrastructure.

And as long as XL reigns supreme in finance and consulting, seems a bit far fetched to talk about infinitely scalable, sentient and 'self-healing' DBMSs...

Let's focus on getting the data out of XL, then work on the genie in the bottle.

Unless you can give users something simple enough to program in themselves, you'll have to pry excel out of their cold, dead hands.

At which point you have built XL :))

Pavlo's research on "autonomous databases" is focused less on issues of data quality, and more around configuration and tuning of database systems [1]. For most databases, there are a staggering number of knobs to tune and configure for a particular workload. DBAs typically work on this type of tuning, whereas data quality issues are often problems at the application layer or above.

[1] https://www.cs.cmu.edu/~pavlo/blog/2018/04/what-is-a-self-dr...

In my personal experience most of the tuning of a database happens in tandem with tuning the application. E.g. you find a set of problematic queries and then go back to the application to see if they should be fixed or if the database is incorrectly tuned. Quite often it is a mix of both, and just automating most of the tuning only just changes the role of the DBA, not does away with it.

> but humans will never actually write SQL. They will instead ask questions about data in a natural language.

Someone trialed an idea with directly interpreting user text queries at my previous job. They gave up after seeing what people typed in.

People have no concept of what the computer needs to do it's job, so you get terse gibberish. At a minimum you'd need the computer to talk back and get them to clarify, but I rather suspect people would hate that. It'd feel like you spent 20 minutes arguing with a computer to run a report.

As an end-user of databases, I hope in 50 years time, there will be no meaningful distinction between OLAP and OLTP DBs.

We spend so much time and energy copying data around into specific data stores to solve specific problems / answer specific questions / enable specific features... It's messy, complex, and adds a ton of overhead.

It could simplify a lot of technology if an operational database could also handle non-trivial analytical workloads.

This would be amazing, but it's an open question in my mind if it's possible to have a system that works sufficiently well for both OLTP and OLAP use cases at large scale. HTAP systems today are still very raw, and it intuitively feels like we'll be bumping into the Pareto frontier, where optimizing for analytics use cases can only come at the expense of transactional performance.

You’re not going to run your analytical queries off your production DB once you have more than a couple users writing ad-hoc queries. And you probably don’t won’t to replicate your salesforce/ads/clicks into your prod db. So you’re going to create a replica. So why does it matter if the replica is a different type of database than prod?

Do any current databases support replicas with heterogeneous indexes? It seems like that would greatly reduce this problem.

Google Cloud Spanner kind of does what you want.

Isn't this the promise made by SAP HANA?

Oracle Active Data Guard also (queriable replica), with column store.

> temporality will become important as well because it matters how information changes over time. Current systems are not able to account for this because of the large overhead of storing extracted information

This quote happens to be in the context of video frames but I think temporal indexing in general is widely under-utilised, both for providing consistent queries (i.e. the database-as-a-value) and for integration of late-arriving data (i.e. bitemporality). It seems particularly relevant when considering how best to synchronise information across the "omnipresent Internet of Things" (not to mention inter-planetary devices, which also get a mention in the post!).

Unfortunately, Temporal Tables in the latest ISO SQL spec and MS SQL Server have issues. At least their current implementation.

First-off, there’s no history of schema changes. While you can easily go from a NOT NULL to a NULL column, you can’t go from NULL to NOT NULL. This is a deal-breaker when using TT for LoB data when schema changes happen somewhat regularly. TT should have been designed with schema versioning from the start.

The second main issue is the still lack of tooling and ORM support for TT. While EF6 is compatible with TT (it will safely ignore the SYSTEM TIME columns) it doesn’t let you directly query the history table.

Third - any UPDATE statement, even when it doesn’t actually change any data, causes a new history row to be added, including a full copy of any mvarchar(max) values. Copy-on-write does not seem to be used. That’s a huge waste of space.

Finally, you cannot exclude columns from temporal tracking - so if you have frequently-updated columns containing inconsequential data (e.g. display sort-order columns) then you’ll end up with history table spam.

I don’t know why the SQL Server team invests massive amount of engineering effort into features like these when other massive priorities exist - like modernising T-SQL’s syntax to be less verbose (parameterised object identifiers and optional predicate clauses, please!) or adding support for column array-types, which would greatly improve the platform.

I also don't know why duplicating timestamps for each row to store the transaction times is needed for a kind of history table.

Other data models with schema after write semantics (for instance trees or graphs) don't have this problem of having to store schema changes.

As I'm working myself on a temporal storage system in my spare time (as well as during my student times at the University of Konstanz), I'd say that you should index each revision instead, storing the timestamp of the transaction commit in a RevisionRootPage for instance, which is the root and main entry to deserialize a revision. Unchanged pages should be shared amongst the different revisions.

Furthermore due to fast random access of flash drives and even parallel I/O with PCIe SSDs you can store page-fragments instead of whole pages. The SSD might fetch too much information due to misalignments, but I'd say that access to persistent storage also gets more fine granular in the future (byte level granularity instead of block-level or the blocks are smaller-sized -- I think even 512 bytes is not that unusual nowadays).

Not only storing the incremental or differential changes from a database page, but changes which fall out of a sliding window doesn't make intermediate full dumps of pages necessary, such that write and read peaks don't occur and a predictable number of page fragments can be read to reconstruct a page in-memory, preferably fetching the fragments if possible in parallel.

Datomic solves all three of these problems an the syntax+array complaints more people should be stealing ideas from it or using it

Linux support for SQL Server is my canary in the coal mine.

It makes zero sense.

Meanwhile, the Windows version, which is used for 99.999% of installs still does not support Vista-era APIs like elliptic curve certificates because of legacy code nobody has touched in 10+ years.

There's a crazy amount of low-hanging fruit in that product that just isn't being addressed.

> Linux support for SQL Server is my canary in the coal mine. > It makes zero sense.

There's one strategic reason: using it by developers in containers.

I'm continuously impressed that new database technologies come forth, pushing productivity forward and pushing the bounds of what databases should be doing.

I like to think a lot about Kdb, partly because I've used it extensively. Kdb, aside from it's database core functionality, can also be an app server. There's a natural concept of being able to open a listen socket or connect to other listenrs. Very quickly you can build distributed database systems with just a few lines of code. It's very powerful... I think the "way of thinking" in Kdb has yet to permeate into other database technologies.

That way of thinking has definitely proven useful in the domain where it started (at Morgan Stanley). We’ve also been looking at how to map it more closely to C/C++ and native hardware, kind of a structurally typed Haskell variant with small header-only views from C/C++ (for native access to files, network IPC, shared mem structures, ...):


>Lastly, I will be dead in 50 years.

And I'll be writing a number 8 to fill the Age box :(

>DBMSs will finally be completely autonomous and self-healing

Hahahahahahahaha - such optimism.

Autonomous, self-healing, horizontally-scalable stateless systems are hard-but-doable.

Autonomous, self-healing, horizontally-scalable stateful systems are the stuff nightmares are made of, if your application layer doesn't relax its expectations re: ACID properties of the system.

I think compromises at the application layer are the trick here. Instead of trying to make every possible aspect of the business system ACID compliant, perhaps only small aspects of it actually need this level of transactional isolation.

Alternatively, perhaps you have a business model which is amenable to document-based storage. If you can reduce your business modeling to:

Serializable GUID-keyed objects - Any arbitrary business type instance that can be turned into JSON bytes.

GUID-keyed transactions - Document the specific object GUID(s) to be locked for a transaction.

You can very easily entertain a resilient, clustered solution. Going wide on GUID keys is trivial compared to alternative approaches. If you need to support any degree of SQL grammar or row-based storage, you are in for a completely different dimension of pain. I would never subject myself to this. Instead, I would reach for SQLite or Postgres.

This is the focus of Andy Pavlo's research: https://www.cs.cmu.edu/~pavlo/blog/2018/04/what-is-a-self-dr...

Essentially, there is a long history of developing tools to make databases more "autonomous". Pavlo's research centers around accurately capturing the state of the database, and training a system to learn the impact of tuning one of the many available knobs.

Tuning != Scaling

Tuning == getting the most out of your hardware

Scaling == my perfectly tuned node, with the lowest-latency/mostest IOPS IO subsystem, and highest number of CPUs, and the mostest amount of RAM can't handle my workloads. What now?

Re-architect your monolith and re-consider your transactional boundaries - that's what.

I see a landscape of SaaS offers and connectors/data integrators between them that, inside an auditable analytics environment, is basically working in a plug-and-play manner. Databases are whatever happens in the back of these data governance environments. We load everything into a large mangaged data lake, and connectors are set up automatically.

I also see the scenario of radical data ownership not addressed, which may be a black swan event. Open-source competitors / legislation may enforce the use of e.g. data pods, as a digital identity storage with managed access right distribution. It's worth a though, what Tim Berner Lee's solid / inrupt would mean for the future of data storage systems. In this scenario, the transmission of personal data from pods needs to be optimised in a secure way.

Nothing about graph databases ? I thought it was on the rise

I was surprised that it didn't mention graph databases either, which is pertinent to the conversation around the evolution of SQL in the article, because querying graph databases requires re-thinking (or evolving) the query language. I'm excited by the work Neo4J and AgensGraph is doing. I haven't played around with AgensGraph, but I hope it takes off since it builds on Postgres

Exactly my thoughts. I haven't got a chance to dive into Neo4J yet, but this looks like a very exciting field that'll really improve visualization and data analysis.

I also believe this is the first kind of ACID NoSql database, which goes along Andy Pavlo's "All programs will execute using strongly consistent ACID transactions".

Definitely going to switch over from PostGreSQL at some point.

The Use Cases for GraphDB's are surprisingly small.

I certainly think that NoSQL databases are only going to grow more popular in the future for more complex and specialized use cases. I don't see traditional RDBMS systems going away any time soon, but I don't know if they'll maintain dominance in perpetuity.

> The relational model will still dominant for most applications...Likewise, SQL (or some dialect of it) will remain the de facto language for interacting with a DBMS

The author does believe RDBMS systems will continue to dominate for the next 50 years. I have no particular reason to cast serious doubt about that, but it will certainly be interesting to see what the role and prominence of NoSQL databases is in that future.

I believe that's what the NewSQL is trying to tackle.

NoSQL means a lot of categories of database e.g. graph, wide, columnar, document.

And so not everything is relational which is what SQL is for.

"NewSQL", not NoSQL, is apparently taking relational databases and adding on the features that drive people to NoSQL.

(This is also the first time I've heard that term, but it seems to be several years old and that's roughly what I got from a few quick searches)

Google Spanner, CockroachDB etc are in that category.

The NoSQL databases will be query-able with SQL and start supporting ACID transactions, and then they won't be NoSQL anymore.

All of them? Why would a NoSQL database like Neo4j (Graph database) forfeit its declarative ASCII-like query language for SQL? What would the benefit be?

Details like that aren't important because the right answer to every single data problem is "SQL" /s

Even if the data problem is graph, document, time series, wide table, EAVT etc.

Interesting points - I'm skeptical about whether DBMSs will ever be too complex for DB admins. Complexity will increase, but our level of abstraction will increase to match, as with any technology. We may still find ourselves fixing DBMS problems in 50 years, but those problems will likely be on the same level of abstraction we have at that time.

> Complexity will increase, but our level of abstraction will increase to match, as with any technology.

But this is no panacea. The complexity will still be there, and ops folks will have to deal with it through extra layers of indirection and obscurity (which are other ways to spell "abstraction").

That might be the best compromise, but it is still a compromise.

The distinction between database (disk) and memory being an artifact of the memory hierarchy, I wonder what will happen to the concept of the database itself, should technological advances erase the speed difference between disk and memory. Will we simply be 'saving' data into special collection variables in memory?

I suspect that due to Moore's law issues we will abandon the concept of synchronous memory access altogether and make the cache hierarchy more visible. Already a cache miss is thousands of cycles, and algorithm complexity analysis that assumes memory access is O(1) can lead to poor real-world results compared to cache aware algorithms.

One future area he missed predicting is how we could handle ML data. For example, features could be more naturally stored and accessed as a DB row. Many tasks such as regression and classification could be handled by the DB system as an extension to SQL.

> ...but humans will never actually write SQL. They will instead ask questions about data in a natural language.

Nope they won't. Natural languages are not precise enough for this.

We might have a sequel to SQL.

> but developers will no longer need to explicitly worry about data model their application uses. There will be a tighter coupling of programming frameworks and DBMSs such that all database interactions will be transparent (and optimal).

Sorry, but this feels hilariously wrong.

Databases that work performantly are all about making key architectural decisions about which information is indexed and/or denormalized and/or preprocessed, how, and ensuring that queries are written only in forms that take advantage of that design.

It's the difference between queries taking milliseconds or hours on reasonably-sized datasets of, say, 10GB+.

Because it's shockingly easy to write a query that, otherwise, will not only read every row of your database, but be required to read every row once for every row, because that's literally the only way to calculate the result if the right data isn't suitably indexed or preprocessed. (E.g. "find the number of users who have ever ordered the same item (any item) more than once" on a table indexed only by order ID.)

I don't see how that can ever ever be made "transparent", and it certainly doesn't have anything to do with tighter coupling with a programming framework.

The data model question is - at best - a very poor choice of words. Implicitly assuming a data model is a bad idea as the data model is one of the major choices every program has to contend with. So implicitly assuming it isn't really a desirable goal. It is like types where explicitly having information on types in key places in a program is always desirable and the only question is how pervasive it should be.

I also liked "They will instead ask questions about data in a natural language". Humans can't answer questions about data in natural language - when data questions come up they have to be couched in formal terms and statistics. I look forward to the day an AI can answer "what is the best decision we can make with this data?" but it runs into a problem we already have in spades. If the asker of the question isn't thinking in formal terms I question whether they are going to be able to consistently pose meaningful questions and interpret the answers. Applied Math is not easy to implement in industry because a company's ability to perform is limited by its management's understanding.

It is easy to nitpick. I'm also glad we have people thinking about the big picture of how data gets handled.

I don't think it's anyway near reality, but there's actually a number of ways this could work.

If you prepackage your queries, or do some sort of profiling guided optimization, the database could learn what it needs to index to answer your queries efficiently. There's already some precedence for this, with how Facebook for instance packages their graphql queries ahead of time to avoid parsing them at runtime.

Alternatively, you could maybe imagine models where storage is sharded along with the code for running it, so you say declaratively specify the data you need (count for this grouped by that, etc) and then at runtime the data is pre-aggregated per server hosting a certain shard.

Rather than specifying a schema+indices or an Elasticsearch like mapping, preprocessing queries to derive the mapping will definitely be the next step.

However, I doubt we will ever derive mappings from application code or (retroactive) profile guided optimization. The biggest problem will become deployments.

We would need to couple application code deployments to deployments of databases even more than we do today (which is a giant problem).

When a new index is needed or new data types we would need to “prime” the datastore in order to get optimal queries from the start.

Even if this application-database deployment issue is fixed through automation it’s likely we wouldn’t want it to be coupled. Database index creation or schema change can take hours or longer. Application code needs to be very nimble. In the days of CI we don’t want Feature A’s database migration to block the deployment of Feature B or BugFix C.

This is fair. Although I do think the collocated data + serving thing would be pretty neat.

Since this was posted in 2015, the title should be changed to one of:

- The Next 50 Years of Databases (2015)

- The Next 46 Years of Databases


Heh. The former.

Mods: Could you please add 2015 to the title?

At your service.

There won't be a humanity who has the need for databases in 50 years.

Maybe not, but please don't post unsubstantive comments to Hacker News.

Care to elaborate?

you know... just the other day I was downvoted on HN for making the claim that there is this perception out there that despite humanity being around hundreds of thousands of years in some form that some people believe it's going to just up and disappear in the next couple of hundred.

I'm not wrong. It's a thing.

he's right about the last point, probably.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact