Hacker News new | past | comments | ask | show | jobs | submit login
Foundations of Databases (1995) (inria.fr)
662 points by tosh 28 days ago | hide | past | web | favorite | 52 comments



I'd also highly recommend the redbook (readings in databases) which is by several luminaries in the database world including Peter Bailis, Joe Hellerstein, and Michael Stonebraker. They also updated it in recent years to ensure it was up to date for the latest research on databases as well - http://www.redbook.io/


There is also a "Red Book" in mathematics. It is David Mumford's Algebraic Geometry book.

Are there any other fields with a "Red Book"? I am mildly curious.


IBM has a division which produces technical books based on internal staff, business partner and client contributions. They've been doing this for quite a long time (I participated in one almost 20 years ago). Historically they had red covers and so to this day they are called the Redbooks. http://www.redbooks.ibm.com



A well-known OpenGL book was also the Red Book.

http://www.opengl-redbook.com/


There is also a Red Book for audio CD (Compact Disc Digital Audio): https://en.wikipedia.org/wiki/Compact_Disc_Digital_Audio


The Lord of the Rings contains a Red Book which is basically itself :)

https://en.wikipedia.org/wiki/Red_Book_of_Westmarch

I'll show myself out...


In civil engineering, all construction contracts are usually based on FIDIC's Red Book. This way, everybody is used to a standard contract, from lawyers to judges, in whatever jurisdiction, nationally or internationally...


Mao and Feynman


Political philosophy.


My Ph.D. advisor gave me his copy when I was just getting started with my thesis, almost twenty years ago. I remember spending 6-7 very intense weeks with it. It gave me most of the mathematical/logical background specific to databases I needed to finish what was primarily a DB focused dissertation. I kept coming back to it all through my studies and later.

If you ever need to design a logically sound query processor/optimizer and/or data storage layer, this is a book you should study.

It still sits on my bookshelf, waiting for the student whose interest in databases is deeper than dujourDB, so that I can pass it down to them.


What are you working on now? If in industry, do you actively use what you learned in your PHD? Was it worth it?


Mostly bioinformatics these days. I’m in academia, and have been virtually all my career, so I can’t really speak to whether a Ph.D. is worth it in industry.


Of all the database theory books I have seen, I found the C.J Date book(An Introduction to Database Systems) to be the most elegant one, though the edition I used attempted to use tutorialD than SQL. What I liked is that it brought out the beauty of relational algebra to the reader with great clarity.


Absolutely!

It's still one of my favourite technical books.

I read it when I started looking at RDBMS after having worked in NoSQL and I clearly recall how excited I was to learn that there is a full fledged system that I could utilize which would get rid of 90% of my code.

I know this term is usually not used when it comes to technical books, but I found it to be quite a page turner!!


.. with a caveat of his efforts to remove NULL


That's definitely part of the elegance from my perspective.


I used to think this also. However after diving really deep into OLAP for business reporting over the last 6 months, I find NULL to be the single most powerful idea in SQL. Here’s some use cases for NULL that I find super useful:

COUNT(case when <complex condition> then 1 end) as foo

Count occurrences of complex condition over a row set. Works because COUNT ignores null.

NVL(1/NULLIF(foo, 0), 0)

Protect against division by zero. The expression 1/X is a simple example but this generalizes well. In fact, I’ve come to think that NULL turns any SQL value into a Monadic option type like those used functional languages and Rust.

Note that I’m not saying the above cannot be done without NULL, just that NULL is a really useful tool in my experience.


Done a fair bit of BI and OLAP this last decade, and I still feel NULL cause more issues than it solves.

For the counting case, sum(case when ... then 1 else 0 end) generally works as well, and generally att little to no cost.

Having nullability as a concept in the query language isn't very problematic though, I might even consider it be helpful on average bas long as a view or table can never contain nulls

Nullable columns in tables and queries is the problem. It boils down to how each nullable columns (in)effectively describes a schema equivalent to a table for each nullable columb, and one for the non-nullable, where the relationship between these tables are quite under constrained.

Thus to ensure data quality and correctness you will have to query all nullable columns of interest to make sure that the various combinations of nulls in those columns doesn't interfere with what you are trying to report. You will also have to track that no component of the system suddenly starts reporting nulls in columns where it didn't use to, as that is equivalent to a schema change.

This wouldn't be much of an issue if there were never more than one nullable column per table, or att the very least that only the absolute minimum of columns were nullable. This is however only very rarely the case, usually there are far too many nullable columns. To make reliable and stable reporting one then has to solve the under specification of the source system in the analysis and reporting pipeline for a class of issues that for the most part should be the responsibility of the source systems.

Best use of nullls I know of, is using them to lower cost of table scans in Google BigQuery.


Just be careful not to use * , count does not always ignore NULLS, if you specify a column, it ignores NULLS, if you use * then it counts it (this is on SQL Server btw... might be different on other platforms)

  create table foo(bar int)
  insert foo values(1),(null),(2)

  select count(*), count(bar) from foo

  3 2


It's exactly unlike a monadic option type. It's much more like null is in Java, C# etc. Every type gets a NULL value here, where in ML/Haskell/etc use of Option/Maybe/etc is explicit.


I believe they were well aware.of that but specifically pointing out similarity with Maybe in how NULL values just bubble up just like Nothing.

This is very different from Java, where a chain of operation on null values typically very quickly leads to runtime exceptions.


Also that one purpose for NULL is "unknowable" or "undefined," which means you're jamming some other value there in its place.

I know NULL causes a lot of heartache but it is a useful concept in DB design.


Chapter 3, 'The Structure of the Relational Model', has a helpful background on the origins of table-oriented relational algebra. Unfortunately, it misses the importance of graphs, which afaict, transcend the dependence on tables' bias for normalized topologies of pathologically regular 2D metric spaces, and it even reinforces their relevance by saying, "Intuitively, the data is represented in tables...". Whereas,

> Graphs allow us to build complex, intuitive relationships between data points. Unlike many traditional methods of structuring data, which focus on the format of relationships, graphs focus on the topology of relationships in the dataset.

https://blog.usejournal.com/data-modelling-with-graph-theory...

When it comes to making database concepts more intuitive, I find visually rich representations very helpful, for example using geometric space-encoded logical relationships to aid diagrammatic reasoning.

It goes along well with many other people's efforts to improve mathematical represenations and interfaces through intuitive visualizations, ie:

https://intuitive-math.club/

http://acko.net/

https://www.reddit.com/r/mathgifs/

https://www.reddit.com/r/visualizedmath/

...


If you are curious how databases are implemented, here is a simplistic implementation in clear abstractions https://github.com/komu/ahwen

It's a good idea to familiarise with the code in bottom-up order of layers.

Readme says this: > The implementation takes heavy inspiration from Edward Sciore's SimpleDB, augmented by implementations of various exercises in his textbook Database Design and Implementation.


Nice. It would be interesting to have such an implementation for a distributed database as well.



I see that this is now #3 on the homepage with 112 points but no comments yet. Are people already very familiar with this work? Can some expound on its merits?


I think that a common occurrence on HN, which I'm definitely a perpetrator of, is upvoting things you wish you had a better grasp on. Theory-heavy & math posts seem especially prone to this.


Perhaps people find it interesting but are too shy to leave the first comment. Like being the one to ask a question during a lecture, sometimes it's really hard to put yourself out there.


I'm curious too. As a CS grad student with relatively little database knowledge, would this be a good text to start learning?


I haven't fully read it (yet), but based on a quick scan of a few chapters, it looks like a dense but well-written introduction.

Some additional resources I'd recommend if you're interested:

Designing Data-Intensive Applications is a fantastic place to start if you're interested in the intersection of databases & distributed systems:

https://dataintensive.net/

The Architecture of Open-Source Applications book has a fewer chapters on databases:

https://aosabook.org/en/bdb.html

https://aosabook.org/en/hdfs.html

https://aosabook.org/en/nosql.html

There's some fantastic documentation on Postgres and its internals:

http://www.interdb.jp/pg/index.html

https://momjian.us/main/presentations/internals.html

https://www.postgresql.org/docs/current/internals.html


Thanks so much! I'll definitely take a look into these.


On the preface of the book (omitted from the ebook), it details several learning paths:

The Classic: a basic database theory course covering the classical material would center around Parts B and C. Chapter 10 and parts of Chapter 9 of Part C are somewhat more advanced and could be skipped. If time allows, some of Chapter 12 and a selection of Part F might be covered.

Feast of Query Languages: a course on the theory of query languages would start with a quick review of the basic material on classical languages (Part B), and continue with Parts D and E. If time allows, some material on languages for complex objects and object-oriented databases (Part F) could be covered.

Gourmet Sampling of Database Theory: a course for people with theoretical appetites that emphasizes the specificity of database theory. Logicians wishing to explore the connection between finite-model theory and databases will be interested in Parts C and E. Those interested in descriptive complexity will find Part E closest to their hearts. Researchers in logic programming will prefer Part D, particularly Chapters 12, 13, and 15. People with a background in theoretical artificial intelligence will find Parts D and F of particular interest. Rule-based systems are related to Chapter 14 (see also parts of Chapter 22). Programming language people will be interested in much of the material on query languages, including Chapters 20 and 21 in Part F.

Fast-Food Database Theory: a course for applied database students that is meant to provide informal exposure to some of the basic results of database theory. This would include assorted topics, with an informal presentation emphasizing the examples, results, and intuition provided in the text, rather than the proofs. A possible syllabus would include Part B; parts of Chapters 8, 9, and 11 in Part C; Chapter 12 and parts of Chapter 15 in Part D; and selected chapters of Part F.

Numerous exercises have been included, and they are essentially of three categories. Routine exercises (unmarked) require easy manipulation of the basic concepts and results. Harder exercises are marked with a (*). Another category of exercises is meant to complement the material in the text and often contains results from related research articles. These exercises are usually on the hard side and some may constitute term projects. These exercises are marked with a ( ).


With databases or anything in general, I believe in doing first and then learning the principles.

Play around with a small RDBMs like sqlite. Do a project around it. Look at the source maybe and then learn about the principles. Foundations of databases are based on discrete mathematics ( where relational comes from in RDBMs ) so a background in discrete math helps.

For enterprise level stuff ( clustering, replication, distribution, etc ), it's more practical to learn on the job where you have access to the hardware, servers, etc.


This is a text for understanding the fundamentals of databases but might be too deep if you just want to use an existing database.


This is really interesting. Now it's ranked #20 with more than 500 points, and it now has 36 comments. But none is actually about the link. People are giving advice on database books they read, and arguing about which one is the best, but it looks like nobody read this particular one. Yet it's massively upvoted.


That's not surprising. People often upvote an interesting conversation regardless of the actual article link.

In fact I added this thread to my favorites because the book references look useful.


I think a lot of people use the upvote to bookmark a link. The favorite feature was rolled out much later and I've always used votes for that purpose.


I have seen it previously and it reads really well :-)


I took Serge's class when he was on sabbatical (?) teaching at UC Berkeley back in the days of steam-powered web browsers. Was a great course but it felt like the theory wasn't _quite_ there yet to be the next big thing in databases.


Looks like a great textbook. I'm really looking for a book on relational database implementation itself, but I understand this is a topic so large that a single book probably could never cover it all.


My Tip: Grab an old cheap copy of "Database Systems: The Complete Book" (by Hector Garcia-Molina, Jeffrey D. Ullman and Jennifer Widom).

Here is an updated edition: https://www.amazon.com/dp/129202447X/


Another interesting link on making your own DB: https://cstack.github.io/db_tutorial/


Damn, the PDF download is not working but it specifically says not to post the PDF anywhere else...


Works fine for me.


Looks like it was temporary. Whew!


[flagged]


Obviously more than one HN user has heard of C.J. Date's work on databases.

We banned you last year for egregiously violating the site guidelines, and told you why. When banned users create new accounts and show new signs of abusive behavior, we ban the new account too. The reason is obvious, but I'll explain anyway: when a banned user gives us reason to believe they've had a change of heart and will use HN as intended in the future, we're happy to unban them—but when they give us reason to believe the opposite, we do the opposite. Otherwise bannage would be a bit of a joke on a site when people can just create a new account in a few seconds (and many do).

We detached this comment from https://news.ycombinator.com/item?id=19735388 and marked it off-topic.


Thank you for keeping this place civil - I know it musn't be fun.


"abusive" Care to elaborate on that overstatement, or are your personal histrionics beyond debate?


In that case I was referring to https://news.ycombinator.com/item?id=17922994. When a new account posts like that and we have evidence linking it to a previously banned account, we usually take it as a sign that the user does not want to use HN as intended. In such cases we extend the previous ban to the new account, for what I'd hope would be obvious reasons.


> if not, you'll never see this anyway...

All users can enable showdead.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: