Hacker News new | past | comments | ask | show | jobs | submit login
What goes around comes around and around [pdf] (cmu.edu)
122 points by craigkerstiens 2 days ago | hide | past | favorite | 27 comments





This paper is a really good treatment of the space from my perspective.

I think the greatest power in the relational model comes from its ability to directly represent cyclical dependencies without forcing weird workarounds. Many real-world domains have ambiguities regarding which types should be strict dependents of another. This confounds approaches relying on serialization. As mentioned in the paper, many major providers offer extensions to SQL which allow you to iterate through the graph implied by these relations with a single logical command.

> The impact of AI/ML on DBMSs will be significant

I agree with this but not in the way the authors may have intended. I think the impact will be mostly negative. The amount of energy being spent on blackbox query generator approaches could be better spent elsewhere. You can get extremely close, but this often doesn't matter.

> Do not ignore the out-of-box experience.

This is why everyone says to start with SQLite now.


Great article, one bit of errata: actually ChatGPT does not expose its internal embedding, so the use of embeddings for RAG are just optional or even coincidental. You can also use ordinary search like Elasticsearch (a point that's somehow often lost).

Besides, the internal embedding for ChatGPT is per-token (~word), whereas the embedding used for RAG search is per-document (retrieval document might be small like a paragraph or page, or could be as large the the whole source document), so these wouldn't be usable for this purpose anyway

> One compelling feature of vector DBMSs is that they provide better integration with AI tools (e.g., Chat- GPT [16], LangChain [36]) than RDBMSs. These sys- tems natively support transforming a record’s data into an embedding upon insertion using these tools and then uses the same transformation to convert a query’s in- put arguments into an embedding to perform the ANN search; other DBMSs require the application to perform these transformations outside of the database.


The paper is inspired by a hacker news comment: https://x.com/andy_pavlo/status/1807799839616614856


In a technology career that started in the early 1990s, one of the constants has been relational databases and SQL. There is no better general-purpose data storage and query architecture, and it's the first (and usually last) thing I consider for almost any new development project that involves storing and retreiving data.

I’d also add, being decent at sql queries and design/tuning is a massive advantage over programmers who are not. It’s like being tasked with bulk editing text and one person knowing regular expressions and the other person just knowing string manipulation functions.

The number of companies I've worked out where I said, "We can do that in the database with a view or a trigger" to be followed up by "Sounds Hacky" or "How do you saved that to Version Control" is staggering though. Modern programmers seem to be afraid of SQL.

"How do you saved that to Version Control"

I see that objection a lot and it is just weird to me. You save it to version control like you do any other program source code. SQL is just text.

I'm conservative with triggers. They have their place, but can definitely be abused. The problem arises when someone doesn't realize they are there, and updates start having unexpected (to them) side effects. I'll tend to prefer a stored procedure that is explicit about doing the main update and any needed side-effects.


This paper has a very concise and easier-to-understand definition of Google's Mapreduce:

> To a first approximation, MR runs a single query:

> SELECT map() FROM crawl_table GROUP BY reduce()

Or you could read the entire Google Mapreduce paper


Isn't the GROUP BY run before the SELECT though, e.g. "SELECT MAX(t) FROM foo GROUP BY t"? I think to do it the way they suggest you'd probably need to create a temp table like

WITH mapped as SELECT map() from crawl_table SELECT * FROM mapped GROUP BY reduce()


Yes. MapReduce's model is basically:

  1. Map (key, value) -> (new_key, tmp_value)
  2. Group by new_key
  3. Reduce (new_key, all tmp_values for that key) -> (new_key, new_values)
In that respect, it's not that far from SQL with custom aggregates. I guess the most precise SQL representation would be

  SELECT REDUCE(MAP(t)) FROM foo GROUP BY KEY(MAP(t))
(I've both been on the MapReduce team, and worked on an SQL database. I don't honestly think they're that comparable.)

I don't know how I feel about this paper: on the one hand, I agree with the sentiment that the relational data model is the natural end state if you keep adding features to a data system (and it perfectly captures my sentiment about vector DBs) and it's silly to not use SQL out of the gate.

On the other hand, the paper is kind of dismissive about engineering nuance and gets some details blatantly wrong.

- MapReduce is alive and well, it just has a different name now (for Googlers, that name is Flume). I'm pretty confident that your cloud bill - whether or not you use GCP, AWS, or Azure, is powered by a couple hundred, if not thousand, of jobs like this.

- Pretty sure anyone running in production has a hard serving dependency on Redis or Memcache _somewhere_ in their stack, because even if you're not using it directly, I would bet that one of your cloud service providers uses a distributed, shared-nothing KV cache under the hood.

- The vast majority of software is not backed by a truly serializable ACID database implementation.

-- MySQL's default isolation level has internal consistency violations[1] and its DDL is non-transactional.

-- The classic transaction example of a "bank transfer" is hilariously mis-representative - ACH is very obviously not implemented using an inter-bank database that supports serializable transactions.

-- A lot of search applications - I would venture to say most - don't need transactional semantics. Do you think Google Search is transactional? Or GitHub code search?

[1]: https://jepsen.io/analyses/mysql-8.0.34


> The classic transaction example of a "bank transfer" is hilariously mis-representative - ACH is very obviously not implemented using an inter-bank database that supports serializable transactions.

This is meant more as a pedagogical tool rather than a literal representation of how the system works. The intra-bank aspects of ACH absolutely do rely on serializable transactions.


I would find it hard to question with CMU papers. They are a pretty thorough when it comes to computer and the tradition goes wayback. If I disagreed with something in this paper it would be a clue to myself that I need to understand the problem domain better.

I also think as a school the philosophy maybe to thoroughly solve the problem with out regard to speed because eventually computers will be faster.

I think the paper is just pointing out that anything great is going to migrate to the RM or SQL model. So that if you start there any feature missing will show eventually show up. They also point out how many resources go to deploying immature ideas.


The number of times I mis-spelt Stonebraker in my Ph.D thesis... An absolute pioneer. I'm glad he's still around and active - sadly unlike many of his late contemporaries.

> The number of times I mis-spelt Stonebraker

Same problem here.

But, given how many last names (not only in English, but many other languages too) derive from trade or profession monikers, I can't help suspect that at some time earlier in his family's history it was spelled differently. (Except of course if it was first written down in this form. But then I'd guess that was because spelling wasn't all that stabilised a few centuries ago, so what was meant was still what we'd write as "StonebrEaker".)

"Anyway, it's just a thought." (i.e. idle etymological speculation.)


When I was building an object store years ago; I needed a way to attach metadata tags to each object. The objects themselves could be files like a picture, a document, or some music; and I wanted to allow tags to denote things like the author, the camera, or the music genre.

Most systems use things like file extended attributes or a separate database to store such metadata; but I wanted something different. It needed to be able to attach tags to hundreds of millions of objects and find things that matched certain tags quickly.

I invented a key-value store to hold the metadata and got it working well. When it started to look like a big columnar store with sparsely populated rows; I decided to see if it could handle queries like a relational database. To my surprise it not only did it well, it could outperform many of them.

There are data models besides relational that can work extremely well for certain data sets.


I'm not going to have 100's of millions of objects, so I can't speak to that.

But for my one hobby project, I'm using RDF and a triple store. Even with a "small" dataset, you can get an explosion of properties.

I want to be able to add arbitrary properties to arbitrary things and relate them all together. Build the graph organically.

So far, its working really well. But underneath, its (likely) just a couple of b+trees do all of the heavy lifting.


Most vendors use three indexes for triples and 4 or 6 for quads. All the indexes are covering, which is to say they triplicate all data—-in other words the database consists only of indexes.

Aint that just neat?


I refer to RDF as the "absurd normal form". When my friends and I do DB design, it's almost inevitable we fall into what might inevitably become the "ThingThing" table thats a many-to-many joiner of everything to everything else. (That's when we giggle, leave the room, go to lunch, and then come back when we've returned to our senses.)

But, for RDF its exactly what I want, I'm not interested in schemas and such for this work, so it's perfect for my scenario.


The killer feature isn’t even fully indexed queries for ever, it’s the serialization formats.

Need to do a non-trivial merge of complex domain graphs? Why have you tried string concatenating turtle files?


More specifically, blockchains are designed to avoid double-spending in a low-trust environment. If you're not trying to avoid double-spending, OR you're not in a low-trust environment, you probably dont need a blockchain.

> The ideal use case for blockchain databases is peer-to- peer applications where one cannot trust anybody. There is no centralized authority that controls the ordering of updates to the database. Thus, blockchain implementa- tions use a BFT commit protocol to determine which transaction to apply to the database next.


What an amazing read, here's hoping they'll both be around for the 2044 edition. 101 is not too old to write another research paper Dr. Stonebraker!

The relational model is to data what Lisp is to code: despite attempts to beat it, nothing really can because all those other models are expressible in terms of it (and, usually, can be made very efficient in practice).

RDBMS and Lisp sit near the tao of their respective domains, which is why I advise people to stick with an RDBMS unless they have a really, really, really good reason not to. Or as Nik Suresh put it, "Just use Postgres. You nerd. You dweeb."


LOL, you just outed yourself as a smug Lisp weenie. Utter confidence is absence of any evidence. The obvious glaring difference is RDMS utterly dominate the database space something Lisp doesn't even come close to.

One of the things you mentioned does seem unbeatable. The other is an obscure language that powers very little of the actual world

If you like Lisp I presume you would prefer Datalog over SQL, as that is used in the Clojure related database Datomic. Datalog is much more elegant and composable than SQL.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: