Hacker News new | past | comments | ask | show | jobs | submit login

This exact debate took place in the early 70s. There were three major database models: relational, network, and hierarchical. Network and hierarchical had quite a bit of success, technically and as businesses. IMS was (is?) an IBM product based on the hierarchical model.

Network databases, which seem quite similar to graph databases, were standardized (https://en.wikipedia.org/wiki/CODASYL).

Both the hierarchical and network models had low-level query languages, in which you were navigating through the hierarchical or network structures.

Then the relational model was proposed in 1970, in Codd's famous paper. The genius of it was in proposing a mathematical model that was conceptually simple, conceivably practical, and it supported a high-level querying approach. (Actually two of them, relational algebra and relational calculus.) He left the little matter of implementation as an exercise to the reader, and so began many years of research into data structures and algorithms, query processing, query optimization, and transaction processing, to make the whole thing practical. And when these systems started showing practical promise (early 80s?), the network model withered away quickly.

Ignoring the fact that relational databases and SQL are permanently entrenched, an alternative database technology cannot succeed unless it also supports a high-level query language. The advantages of such a language are just overwhelming.

But another factor is that all of the hard database research and implementation problems have been solved in the context of relational database systems. You want to spring your new database technology on the world, because of its unique secret sauce? It isn't going anywhere until it has a high-level query language (including SQL support), query optimization, internationalization, ACID transactions, blob types, backup and recovery, replication, integration with all the major programming languages, scales with memory and CPUs, ...

(Source: cofounder of two startups creating databases with secret sauces.)




This kind of history is so important, and even this fairly recent history seems kind of hard to access. I am curious how you learned it? (Books? Articles? Finding people involved to interview? usenet posts?)

I learned almost none of it in my formal computer science education in the laste 90s -- maybe because at that time it was recent enough that "middle-aged prime of their career" people still remembered it, it didn't seem necessary to teach it to newcomers as "history".

I wonder how much post-1950 history is taught in current undergrad CS programs, like if a "databases" course will include a summary of any of the material you summarize.


Lived it. I remember taking my first course on databases, in 1977, and feeling like I was hopelessly and permanently behind because it had been seven years since Codd's paper.

I taught an intro database course recently, and only touched on some of that history. It wasn't the main point, and it was an undergrad course.

But you point out a real problem. There is so much reinvention by people who simply don't know what was done previously.

* I have seen research I did in the 70s/80s get redone in the last 20 years, with no awareness that the problem had already been solved.

* I saw a horrifying takedown at an academic conference, (late 80s?) Someone on the original System R research team was pretty furious at a young researcher for being completely oblivious to the fact that he had repeated some System R work. (He blamed the conference reviewers as much as the author.)


My late professor used to lament quite a lot about work being redone. Not a week went by without one or two papers/submissions/talks/works being mentioned with an immediate followup like “this is the same as this paper from the 70/80/90” or “this is the same as this paper by that author”. I found it quite sad, both because of the apparent double work and because this whole catalogue of computer science material went unused because of the tirades instead of their being a constructive way to share it.

Although I very well can imagine the frustration, as I have seen the same things with libraries being reinvented, I think there is value in revisiting some problems in a different context. Maybe the problems are just not that of a big deal anymore in newer languages/eco systems even though they are fundamentally the same problem as in the 70s.


You would think the CS discipline, of all disciplines, would be able to expose its history in an easily searchable format.


Cobblers children tend to go without shoes maybe?

I suspect some of it is just the joy of solving (or trying to solve the problem) means the unsexy work of digging through hundreds or thousands of old papers gets forgotten.

And looking at things with a fresh view (which isn’t going to happen after looking at all those old papers) does have value.


I find the freshest view is typically in the foundational papers (not necessarily in computer science, but generally). Always go back and look.


Please allow me a snarky comment: Coddler's children.


Such a format would not matter, because most programmers are anti-research, anti-math, and anti anything other than shipping Agile software.


Programmers are not computer scientists any more than civil engineers are material researchers.

Software "engineering" isn't. It's all about processes and very little about standardization of process and verification of outcomes.


Found one in the wild.


Ironically, if there were a database that would allow querying for this type of things... What's the State of the Art on X ? Where are we on Y ? Has the problem in paper Z been solved ?


I think in a lot of cases, it's partly a question of where the material was (on paper, not necessarily in libraries, possibly at obscure conferences). That's the case for a various things I'd like to refer to, and the system I used for physics research was never written up anyway. In this context and example is information on Logica's RAPPORT, allegedly the first commercial, portable RDB, which had some significance in UK research support.


https://wiki.c2.com/?ShouldersOfGiants

Databases are interesting because they were the forefront of computing when it was invented, and they still are today. The computer science sorting algorithms and trees (and trie) go directly to how good (for all the different kinds of definitions of good) a database is.

MongoDB threw out a ton of history in order to invent a different wheel, and it's faster for interesting reasons, but it's also for a different era than single monolith computing (aka Oracle DB). That doesn't mean MongoDB is appropriate for all situations, but that a fundamental precept of computing has changed. Used to be, there was room for maybe five computers in the whole world. Today, my computer has multiple virtual computers inside of it and they are treated as cattle, not pets. The work being redone is because (sometimes) the work has been invalidated by newer experiments.

We're going to have to "redo" a lot work once dark matter is found as well.


MongoDB is a classic example of the innovator's dilemma. In its first incarnation it had no ACID transactions, no joins, no Geo-Spatial indexes, no inverted indexes, no SQL query language etc.

What it did have (which no relational database had at the time) was JSON as a coin of the realm and a distributed database model. Run the clock forward and MongoDB now competes on equal footing with relational databases for all those features but it still has the key competitive advantages of JSON as a native format and a distributed architecture.


Of course there is value in revisiting a problem. But you should know that you're doing it, and be able to say how your work relates to the earlier work.

(Not denying that reimplementing things can be fun ...)


I think some of it is pure exposure. Developing a new interest in a topic can be inspirational to build things and without adequate access to prior art or knowing the right search terms can leave you lost and ignorant.


I wonder if this is where technological evolution will stagnate, when most people research an already solved and proven topic and nothing new happens, I saw some science YouTuber say development is still accelerating, wonder for how long until the next breakthrough!


I think this problem comes down to two core issues: discoverability and terminology.

You're going to be lucky if a paper from the 70s or 80s is available in a searchable database at all. That means someone bothered to scan it in, and OCR it since then. Even for the few papers that are searchable, they are old enough that they probably won't catch anyone's eye unless they are desperate.

Of course then there's also the problem of knowing what to search for. Programmers love to invent, reinvent, and re-reinvent terminology. It's only gotten worse with every other developer running a blog trying to explain complex ideas in simple terms.

The entire field of ML is a perfect example of this. I remember talking to my father about all sorts of new developments in ML back in the early 2010s, and I was quite surprised when he told me that he learned a lot of the things I was talking about back in the 80s just named a bit differently.

In most cases it ends up being a question of how much time you can put into any given problem. If I spend two weeks to find a paper that would have taken me a week to reinvent, then am I really ahead? If the knowledge wasn't important to enough make it into textbooks/classes/common knowledge then attempting to find it is akin to searching for a particular needle in a pile of needles.


I have never come across a popular CS paper that was not available on the web, for what it’s worth. Maybe some of the lesser known papers are lost, but all of the important ones, such as Codd’s writing, are very easily accessible with simple search engine searches.


The important and popular ones are absolutely available, but those are usually important because they have entered the realm of "common knowledge," at least in a particular sub-field. These are going to be at the top of the list when it comes to digitizing useful historic records. It's fairly easy to OCR a PDF, so as long as someone with some time decided "hey, this might be useful" then you'll probably be able to find it.

If you're doing databases then you've almost certainly been exposed to Codd's work, if not through his papers and books, then at least through textbooks and lectures. There are countless blogs, lecture series, and presentations that will happily direct you there.

The challenge is that there's also a mountain of work that never really got much popularity for whatever reason. Say a paper was ahead of it's time, or was released with bad timing, or simply kept the most interesting parts until the end where few people might have noticed. It's these sort of gems that are hard to find. It's hard to even know how many of these there are, because they are by definition not popular enough for most people to know about them.


>But you point out a real problem. There is so much reinvention by people who simply don't know what was done previously.

I've often wondered if and where the boundaries of human knowledge will run into this very problem. Mostly we rely on the concept of reductionism and hope that will generalize and scale but it rarely does. We often have different theories at different scales for different things we to want model and explain with science in general. If it's found that these ideas don't connect and we have to forever add more and more generalized nuggets of knowledge, models, etc for specific case, even if they are reductionist in nature, we may get to a point where there's just so much information the act of discovery of prior art may be more work than simply redoing the experiment against the ultimate judge (reality) and see if it flies.

I used to work with a handful of bioinformaticists in the era of the affordable sequencing boom and there was a lot of debate about what to do with all the sequence data. Do they archive it and pay the cost to manage, index, and make it searchable or does it make more sense to simply pay the couple thousand to rerun the experiment and assure yourself that no process flaws were introduced in a prior paper, that you eventually agree with the prior work, and so on. There's almost a little bit of built in reproducibility in this unintended act of duplicating prior work.

Software is especially bad because it's such a relatively new domain and is rarely done rigorously so people just run through cycles over and over again, assuming something people did before was wrong or that the environment changed and that approach was just inappropriate for it's time period. Plus, people in this industry seem to be driven to do innovative new work so they have no interest in looking backwards even if they should.


I wrote a little on this subject, of research that is ignored or forgotten. You might find it of interest.

https://blog.eutopian.io/the-next-big-thing-go-back-to-the-f...


You can read this article by Michael Stonebraker about the evolution of database systems: https://people.cs.umass.edu/~yanlei/courses/CS691LL-f06/pape.... It's the first article in the Red Book[1].

[1] http://www.redbook.io/


We left the exact same comment :)


History of science, as a scholarly discipline, is unfortunately rather decoupled from science (or engineering) itself. Practitioners of a field are rarely as interested in the field's history as they probably should. History of CS in particular has not attracted much academical interest even among science historians, possibly due to its relative youth and technical nature.


It's a good thing that the two are separate. You really want good scientific advances to be put in textbooks, and the rest should be left for history books. You really don't want to have the scientific debate get muddled in pointless historical fights like in heterodox economics where people are still debating what mArX rEaLlY mEaNt centuries later. Scientists should read textbooks that reflect the current state of the art, not 40 years old papers that will necessarily be flawed.


That only works if you assume a linear model of scientific progress. Those who don't know history are doomed to repeat it, and this is extremely evident in the field of software engineering (granted, SE is not science and not even a good example of an engineering discipline) where old wisdoms are forgotten and then reinvented every five years.


There is quite a lot about DBMS history in the Red Book and references therein: http://www.redbook.io/index.html

For the history of data model proposals in particular, there is this paper: https://scholar.google.com/scholar?cluster=73661829057771494...

So the info is definitely out there if you are interested in that sort of thing.


The Association for Computing Machinery[1] (usually "ACM") seems to be nearly invisible now. It was the forum for advancing the state of the art in those days.

Its archives are primary sources for this history. The database debate was carried out mostly in ACM SIGMOD, the "Special Interest Group for Management of Data".

1. https://www.acm.org/


I was taught exactly this history in my college DB course in Slovenia. I even had to know the years when each model was proposed and by whom.


Some references to the state of the data world in the 70s is found Codd’s initial paper itself: https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf.

I strongly recommend everyone pick up basic set theory notation so that you can read seminal papers like these. There is so much amazing information in them, yet they’re treated as if they’re incomprehensible. You can understand the relational model with undergraduate-level discrete math and set theory knowledge.

These are the papers that formed our industry. Let’s not forget them.


> This kind of history is so important, and even this fairly recent history seems kind of hard to access. I am curious how you learned it?

This sort of touches upon the general inadequacy of CS education in general.

The way I learned all this history is by digging it up myself, that was in early 2000s - old journals, papers, listening to people who witnessed all this themselves.

That's the cornerstone skill that should be taught - research and the understanding that no single course can give you much in actual knowledge, only skill to acquire it.


> I wonder how much post-1950 history is taught in current undergrad CS programs, like if a "databases" course will include a summary of any of the material you summarize.

My databases course (circa 2014 or so) covered the gist of this; we didn't spend much time talking about database systems before it, but we read the Codd paper and discussed how various companies began working on implementations over the next couple of decades.



One downside of both hierarchical and network (graph) models that is not mentioned either in the post or parent comment is, "access path dependence". Network model allows multiple paths but basically just forestalls the issue.

As to sibling comment question about how to keep up on this stuff, the preceding info was in the "tar pit" paper that has been discussed a lot here on HN. So I'd say, your already doing it. Continuing education is part of any profession and this is probably a better way than some.


> One downside of both hierarchical and network (graph) models that is not mentioned either in the post or parent comment is, "access path dependence". Network model allows multiple paths but basically just forestalls the issue.

Yes. I think that this access path independence is an essential part of the relational query approach, (of which SQL is one example), being high-level. And once you have such a language, query optimization becomes essential.


> the "tar pit" paper that has been discussed a lot here on HN

Did you mean Mosley and Marks' "Out of the Tar Pit" http://curtclifton.net/papers/MoseleyMarks06a.pdf?


Yes, thanks for providing the link.


Yes


> Ignoring the fact that relational databases and SQL are permanently entrenched, an alternative database technology cannot succeed unless it also supports a high-level query language. The advantages of such a language are just overwhelming.

You seem to be ignoring the success of highly scalable managed NoSQL databases that has been all the rage when building high TPS services. What is your opinion on those?


They work for a very specific use case. And if you need to expand that use case, or address others, good luck. Relational database systems are complicated for good reasons. Schemas, which do cause problems, are there for good reasons, and the alternative problems obtained by going without schemas are just not worth it, nearly all the time.

I think a lot of the attraction of NoSQL databases is that they are seductive for people who don't know about the problems inherent in working with shared, persistent, long-lived data, (i.e. years). A NoSQL system lets such people get started easily. They start down the path, and then they run into dragons, and muggers, and hostile aliens. This is a really good description of what I mean: http://www.sarahmei.com/blog/2013/11/11/why-you-should-never....


The NoSQL crowd often switches jobs before they run into these problems. 90% of NoSQL use is just resume driven development. Clearly some valuable cases exist but most startups are not among those.


You'd think that the recent increase in popularity of type systems would prompt some: "aren't type systems schemas?" moments for some of these cases.


Haha! I love reading comments like these - check out TypeDB (vaticle.com/typedb)

disclaimer: work there


Not OP but my guess is NoSQL wins were data isn't very relational or traditional RDBMS and ACID don't fit. One could even argue that file systems and key value stores are NoSQL solutions that predate RDBMS. Yet a lot of businesses just need something reliable, roughly relational, and with a community of talent.

All that said, the world of successful solutions is not mutually exclusive or limited to only one winner.


In my experience (having worked with both for a decade or so), NoSQL is valuable when the desired performance characteristics and consistency needs diverge wildly between ‘relations’/entities. You run across similar with file systems and database boundaries.

Have a lot of data you need stored where there is limited relation between them directly, and attempting to be transactionally consistent across the boundaries there and the rest of the system is going to very expensive? Then doing some part of the data in a nosql db and the rest in a ACID compliant database is probably a good idea.

Similarly, putting 50GB vm images into a nosql db is probably pretty silly when it should probably be on a proper filesystem, or at most copied out and written back periodically.

And putting a bunch of data that is tightly coupled together and needs to be transactionally updated to ensure your dataset doesn’t turn to gibberish in a NoSQL backend is going to be a nightmare.


> It isn't going anywhere until it has a high-level query language (including SQL support), query optimization, internationalization, ACID transactions, blob types, backup and recovery, replication, integration with all the major programming languages, scales with memory and CPUs

I would seriously argue that all of these are necessary for a database to be useful. The sad fact is that no graph databases in existence truly act as graph database, without some weird caveat like "you must have schemas".


At some point, the graph is simplified to a tree.

The tree becomes lumber.

The lumber becomes a table.

At that point, the data on the table become is suitable for human consumption.


For more on this topic, a good read is “what hoes around comes around”, by Michael Stonebraker (creator of Postgres, among other achievements): https://people.cs.umass.edu/~yanlei/courses/CS691LL-f06/pape...


To add to this, the debate was won in practice because the data persists, but the set of things you might want to do with it is mutable.

Hierarchical and Network databases assume you can anticipate future uses of your data.


> an alternative database technology cannot succeed unless it also supports a high-level query language. The advantages of such a language are just overwhelming

Neo4j's Cypher query language is beautiful to work with!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: