Reddit’s database has two tables (2012)

withinboredom · on Aug 10, 2022

I worked at a startup (now fairly popular in the US) where we had tables for each thing (users, companies, etc) and a “relationship” table that described each relationship between things. There were no foreign keys, so making changes were pretty cheap. It was actually pretty ingenious (the two guys who came up with the schema went on to get paid to work on k8s).

It was super handy to simply query that table to debug things, since by merely looking for a user, you’d discover everything. If Mongo was more mature and scalable back then (2012ish), I wonder if we would have used it.

vivegi · on Aug 10, 2022

This is quite similar to the RDF triples model. Everything is a 'thing' including 'relationships' between things. So you just need a thing table and a relation table.

The issue with this is schema management and rules gets pushed to the application layer. You also need to deal with very massive tables (in terms of # of rows) for the relationships table which leads to potential performance issues.

replygirl · on Aug 10, 2022

let's call them "nodes" and "edges"

CabSauce · on Aug 10, 2022

Right? What a groundbreaking idea! /s

claytonjy · on Aug 10, 2022

other downsides I saw from an implementation of this in postgres:

1.you can't have strongly typed schemas specific to a 'thing' subtype; to store it all together, you end up sticking all the interesting bits in a JSON field

2. any queries of interest require a lot of self joins and performance craters

Please, please, never implement a "graph database" or "party model" or "entity schema" in a normal RDBMS, except perhaps at miniscule scale; the flexibility is absolutely not worth it.

NicoJuicy · on Aug 10, 2022

That's why you have duplicated properties on things you went to query on.

philjohn · on Aug 10, 2022

That's why in RDF you should really be using a well-modelled ontology.

As for performance issues, using a dedicated Quad store helps there, rather than one built on top of an RDBMS (OpenLink Virtuoso excluded, because somehow they manage to make Quads on top of a distributed SQL engine blazing fast).

It's been about 6 years since I've been in that world, so things might have changed wrt performance.

OJFord · on Aug 10, 2022

Sounds like a 'star schema', just without foreign key constraints on the relationship ('fact') table.

I'm not really sure I understand why not having that constraint is helpful - what sort of 'changes' are significantly cheaper? Schema changes on the thing ('dimension') tables? Or do you mean cheaper in the 'reasoning about it' sense, you can delete a thing without having to worry about whether it should be nullified, deleted, or what in the relationship table? (And sure something starts to 404 or whatever, but in this application that's fine.)

withinboredom · on Aug 10, 2022

> you can delete a thing without having to worry about whether it should be nullified, deleted, or what in the relationship table?

Pretty much this. You could just delete all the relationships, or just some of them. For example, a user could be related to many things (appointments, multiple vendors, customers, etc). Thus a customer can just delete the relationship between them and a user, but the user still keeps their other relationships with the customer (such as appointments, allowing them to see their appointment history, even though the customer no longer has access to the user's information).

Ecio78 · on Aug 10, 2022

wouldn't you be able to do the same even with FK in place? what FK would prevent you from doing would be deleting a user before all his relationships are removed. That should prevent you to ending up with inconsistent data inside the tables.

hamandcheese · on Aug 10, 2022

In my experience with regular Postgres databases shared across many teams, inconsistent/broken references is seen as a much less risky thing to do than, say, cascading a delete. Usually a broken reference is of no consequence. And you don’t use FKs, people tend to bake that into how they code things without much trouble.

structural · on Aug 10, 2022

I've oft wished for a system where FKs could be used to automatically mark rows as invalid instead of deleting them, and you could inspect the nature of the broken reference as a property of the row/object. Then broken references can be cleaned up when appropriate, possibly never depending on the application.

This gives the neat property of being able to do data entry in parallel, too - different people may insert records into two tables linked by a FK relationship, it's nice for this to happen in either order. It also allows the user to query for exactly what rows will be affected by a cascaded delete, or to perform the operation in steps.

layer8 · on Aug 10, 2022

Wouldn't you effectively get those features by just not using FKs? You'd get your broken-reference marker by `where foreign_id not in (select id from foreign_table)`. You could create that as a view or materialized view.

hamandcheese · on Aug 12, 2022

It would be nice to have the schema metadata, at the very least, to be able to analyze a given tables reverse dependencies without resorting to naming conventions.

dehrmann · on Aug 10, 2022

> tables for each thing (users, companies, etc) and a “relationship” table that described each relationship between things

Sounds like objects and associations in the Tao model:

https://engineering.fb.com/2013/06/25/core-data/tao-the-powe...

hamandcheese · on Aug 10, 2022

So basically one table for each type of node, plus one table for all edges. Neat!

How big was the relationships table? How was performance?

withinboredom · on Aug 10, 2022

It was pretty massive. I’m certain that it is probably sharded by now. Database performance wasn’t our bottleneck, we had np-hard problems all over the place which was our primary performance issues. For most customers, they never really had issues, but our larger customers def did. Those problems were amazingly fun to solve.

prox · on Aug 10, 2022

How do you shard a thing like this? (Apart from “with great difficulty.”) :)

chrisjc · on Aug 10, 2022

Does this question imply sharding would be more difficult with this sort of table design, compared with other table designs?

Obviously sharding is difficult, but what would make it any more difficult with this table design?

Picking the sharding keys, avoiding hotspots, rebalancing, failover, joining disparate tables on different shards, etc... these are all complicated issues with far reaching implications, but I just don't understand how this table design would complicate sharding any more than another.

I'm not trying to sound condescending and suggest it's a dumb question... genuinely interested.

withinboredom · on Aug 10, 2022

This would have happened after I left, so I'm just speculating based on experience.

1. You create a function in your application that maps a customer id to a database server. There is only one at first.

2. You setup replication to another server.

3. You update your application's mapping function to map about half of your ids to the new replica. Use something like md5(customer id) % 2. Using md5 or something else that has a good random distribution is important (sha-x DOES NOT have a good random distribution).

4. You application now will write to the replica, turn off replication, and delete the ids from each server that no longer map to that server.

Adding a new server follows approximately the same lines, except you need to rebalance your keys which can be tricky.

jimmywetnips · on Aug 10, 2022

So I'm trying to learn about system design without having ever encountered large problems from any of my projects. What I'm reading keeps pushing for hash rings for these types of distribution problems.

Someone who does this stuff for real, do orgs actually just do a hash % serverCount? That's what a startup I worked at did 8 years ago. I thought nobody actually did this anymore though given the benefits of hash rings?

withinboredom · on Aug 10, 2022

It depends on control and automation. If you can easily migrate things between servers, a hash ring makes more sense. If you are a scrappy startup who has too much traffic/data, you are doing everything manually, including rebalancing and babysitting servers.

heavenlyblue · on Aug 10, 2022

Can you please clarify why md5 has better random distribution that sha?

FreakLegion · on Aug 10, 2022

It doesn't. I'm curious to hear how withinboredom arrived at that conclusion.

withinboredom · on Aug 10, 2022

I arrived at this empirically doing some tests at work. In general, Sha and md5 BOTH present randomly distributed numbers. However, most IDs (and the assumption made here) are numeric. For the first billion or so numbers, sha doesn't create a random distribution, while md5 does.

Looking back, I don't remember my experimental setup to verify a random distribution and I recommend checking for your own setup. But that was the conclusion we arrived at.

FreakLegion · on Aug 10, 2022

Here's the fraction of 1s for each bit position output by MD5 and SHA-256 using the first million integers as inputs: https://pastebin.com/aVR3u8Vv

withinboredom · on Aug 10, 2022

I quickly threw a few into excel with a pareto graph to show distribution. Granted, it is only the first 10,000 integers % 1000: https://imgur.com/a/HypGWP1

To my untrained eye, md5 and sha1 (and not shown, sha256) look remarkably similar to random. If you want truly even distributions, then going with a nieve "X % Y" approach looks better.

FreakLegion · on Aug 10, 2022

If you do % 1000 there'll be a bit of bias from the modulus, but not anything meaningful for the use case, and it affects MD5 and SHA both.

No argument that for simple incrementing integers a raw modulus is better, though. Even if hashing is needed, a cryptographic hash function probably isn't (depending on threat model, e.g. whether user inputs are involved).

withinboredom · on Aug 11, 2022

MD5 and ShaX are available in almost every language. Things that are probably better (like Murmur3) might not be, or may even have different hashes for the same thing (C# vs JS with Murmur3, for example). I had to reimplement Murmur3 in C# to get the hashes to match from JS, it boiled down to the C# version using longs and eventually casting to an int while JS can’t or didn’t do that final cast.

FreakLegion · on Aug 12, 2022

I wasn't thinking of anything exotic, just the built-in functions used for hash tables and random number generators (for integers and other inputs that can be interpreted as seeds, a random number generator is also a hash function, and of course in your case the hash function was a random number generator all along anyway!).

jimmywetnips · on Aug 10, 2022

soooo basically there is no bias

withinboredom · on Aug 10, 2022

Looks like it. Though I think md5 is faster, which is maybe what I’m remembering, now that I think about it. This was years ago, funny how memory gets tainted.

jimmywetnips · on Aug 11, 2022

I think that's probably true of lots of masters of crafts, sometimes you have a gut feeling and don't even remember the formative moments that created the conclusion haha

afiori · on Aug 10, 2022

If you worry about random distribution of the bits wouldn't it be better to do (H(customer_id) % big_prime) % 2? so that the "% 2" does not just read the last bit

withinboredom · on Aug 10, 2022

A finite field would ALMOST work here. However, you're making the assumption that the user ids using your application are random. If you have an A/B test running using that same mapping, you'll end up with a server hotter than the other if one arm of the experiment is widely successful.

afiori · on Aug 10, 2022

In practice I would have added a salt, but I did not consider that each independent sharding needs to have its own salt.

prox · on Aug 10, 2022

Thanks for that answer, it seems quite a specialist job. So many moving parts. Why would you need random distribution and not just customers 1 to 100k for example?

withinboredom · on Aug 10, 2022

> Why would you need random distribution and not just customers 1 to 100k for example?

To keep the databases from having hotspots. For example, if your average customer churns after 6 months, you'll have your "old customer" database mostly just idling, sucking up $$ that you can't downgrade to a smaller machine because there are still customers on it. Meanwhile, your "new customer" database is on fire, swamped with requests. By using a random distribution, you spread the load out.

If you get fancy with some tooling, you can migrate customers between databases, and have a pool of small databases for "dead customers" and a more powerful pool for "live customers" or even have geographical distribution and use the datacenter closest to them. But you really do need some fancy tooling to pull that off, but it is doable.

kcartlidge · on Aug 10, 2022

> Meanwhile, your "new customer" database is on fire, swamped with requests. By using a random distribution, you spread the load out.

Great answer.

Keyframe · on Aug 10, 2022

This depends on access patterns to your data.

hamandcheese · on Aug 10, 2022

I was wondering the same thing. But at the same time, it seems like having it all centralized in one place would make it really easy to write tooling to aid in such an endeavor.

For example, suppose it was decided that sharding would be by account ID. It would probably be very achievable to have some monitoring for all newly created relationships to check that they have an account ID specified, and break that down by relationship type (and therefor attributable to a team). Dashboards and metrics would be trivial to create and track progress, which would be essential for any large scale engineering effort.

prox · on Aug 10, 2022

Yeah it does seem you need to think ahead of what the future needs would be.

giancarlostoro · on Aug 10, 2022

One neat thing about Mongo is the ObjectID's:

We didn't know it at the time, but you can get the date of creation from them. Which means... If you sort by ObjectID, you're actually sorting by creation time. You can create a generic ObjectID using JavaScript and filter by time ranges. I still have a codepen proof of concept I did for doing ranges.

Anyway, the other thing is, and I'm not sure how many people use it this way: you can copy an ObjectID verbatim into other documents, like a foreign key, so you can reference between collections. If you do this, you'll save yourself a lot of headaches. Just because you can shove anything and everything into MongoDB documents, doesn't mean you absolutely should.

CGamesPlay · on Aug 10, 2022

But why? Ok, no foreign key checking, so it’s fast, great. Why not include the foreign ID in the local table? What’s the purpose behind a global “relationships” table?

fatherzine · on Aug 10, 2022

Guessing:

* Most relations are properly modeled as N:N anyways.

* Corollary: no "aargh, need to upgrade this relation to N:N" regrets.

* Corollary: no 1000 relationship tables to name & chase around the DB metadata.

* Low friction to add relations on-the-fly when developing code.

* Low friction to query all relations associated with an object.

* Low friction to update a relation, but keep the ex around.

Neat approach IMHO.

fauigerzigerk · on Aug 10, 2022

>Most relations are properly modeled as N:N anyways

I'm not sure that's true. Containment is a very common type of relationship.

>Corollary: no 1000 relationship tables to name & chase around the DB metadata.

That depends on whether relationships have associated attributes. If and when they do, you have to turn them into object types and create a table for them. And it's exactly the N:M relationships that tend to have associated information.

That's the conceptual problem with this approach in my view. Quite often, there isn't really a conceptual difference between relationship types and object/thing types. So you end up with a model that allows you to treat some relationships as data but not others.

>Neat approach IMHO.

It's a valid compromise I guess, depending heavily on the specifics of your schema. Ultimately, database optimisation is a very dirty business if you can't find a way to completely separate logical from physical aspects, which turns out to be a very hard problem to solve.

fatherzine · on Aug 10, 2022

Great points! Indeed, hard to tell without knowing a bit more about their schema specifics.

zepolen · on Aug 10, 2022

* low friction till things go wrong and you're not even aware they went wrong because the relational mistake wasn't constrained

* aaargh we have trillion row tables that contain everything, how do we scale this (you don't)

* aaargh our tables indices don't fit into memory, what do we do (you cry)

* no at a glance metrics like table sizes to have an idea of which relations are growing

* oh no we have to have complex app code to re-implement for relationships everything a database gave for free

Reddit used a similar system in Posgresql and hit all of these issues before moving to Cassandra. Even then they still had to deal with constraints of the original model.

Edit: Didn't even read the article being about Reddit, it pretty much confirms what I said. It took them something like 2 years iirc to migrate to Cassandra and the site was wholely unuseable during that time. There is no doubt in my mind if another Reddit style site existed during that time it would have eaten their lunch same as Reddit ate Diggs lunch (for non technical reasons).

Furthermore, back then iirc they only really had something like 4 tables/relations: comment, post, user, vote. The 'faster development time' adding new relations was completely moot, they spent more time developing and dealing with a complex db query wrapper and performance issues (caching in memcache all over the place) than actual features.

The (app+query) code and performance for their nested comments was absolutely hilarious. Something a well designed table would have given then for free in one recursive CTE.

It wasn't until they moved to Cassandra that they were able to let the db do that job and work on adding the shit ton of features that exist today.

withinboredom · on Aug 10, 2022

> low friction till things go wrong and you're not even aware they went wrong because the relational mistake wasn't constrained

These are people-problems, and usually a dev made this mistake exactly once after being hired. Writing robust code isn't hard if the culture supports it.

> we have trillion row tables that contain everything, how do we scale this (you don't)

You do, by sharding the table.

> no at a glance metrics like table sizes to have an idea of which relations are growing

select count(*) where relation = 'whatever'; is pretty simple.

> we have to have complex app code to re-implement for relationships everything a database gave for free

These queries are no different than standard many-to-many type queries. In all, the logic is no different than anything else. You also still have to do the same number of inserts that you have to do for many-to-many, but now the order isn't important. You can even do a quick consistency check before committing the transaction at the framework level.

Tainnor · on Aug 10, 2022

> These are people-problems, and usually a dev made this mistake exactly once after being hired.

Classifying bugs as "people problems" strikes me as an incredibly hostile work environment.

What happens to the poor developer who makes a particular mistake twice? They get fired?

withinboredom · on Aug 10, 2022

Have you ever accidentally deleted a production table before? If so, I doubt you’d do it again if you are able to learn from your mistakes. No hostilities required. Hire good people who are smart and capable of learning.

Tainnor · on Aug 10, 2022

Deleting a production DB is not the only way to mess up your data.

Smart people design systems that alert them when they're about to commit an error because they have 500 other priorities on their mind.

withinboredom · on Aug 10, 2022

Yeah, we had that.

Even where I work now (that has almost no table constraints at all because there is no way to support them at our scale), we have tooling that prevents you from making dumb mistakes and alerting + backups if you manage to do it anyway. Having processes in place to deal with these kinds of issues, and tooling is. That's what I meant by "people problem." No constraint will keep a determined person from deleting the wrong thing (in fact, it might be worse if it cascades).

We had someone accidentally fight the "query nanny" once, thinking they were deleting a table in their dev environment... nope, they deleted a production table with hundreds of millions of rows. It took hours to complete the deletion, while all devs worked to delete the feature so we could restore from backup and revert our hacked out code. They still work there.

My point is, these are all "people problems" and no technology will save you from them. They might mitigate the issue, or slow people down to get them to realize what they are about to do ... but that is all you can really hope for. A constraint isn't necessarily required to provide that mitigation, but it helps when you can use them.

Tainnor · on Aug 10, 2022

Thanks for clarifying, I had misinterpreted your statement.

I can see how at huge scales, DB constraints can be impossible to implement. But they do come (almost) for free in terms of development effort, whereas what you describe probably took a long time to implement.

replygirl · on Aug 10, 2022

"made a big mistake twice" does not correlate with "bad person who is not smart or capable of learning"—hope you're not in management

withinboredom · on Aug 11, 2022

I never said it did. Just that people tend to learn from their mistakes. Are there people who will make the same mistake different ways over and over again? Yep (I’m one of them). It’s not an issue, I learn and get better at spotting the mistake when I review other’s code and others point it out when reviewing my code. That’s why I said “exactly once” since once you know what it looks like, you can spot it in a review.

Scarbutt · on Aug 10, 2022

It wasn't until they moved to Cassandra that they were able to let the db do that job and work on adding the shit ton of features that exist today.

But cassandra is still a nosql and doesn't give the constraints of something like pg.

bongobingo1 · on Aug 10, 2022

Cant help but read this and wonder if it was an strict architectural decision (I think this is what you imply?), or a "move fast and fix it later" side effect that hung around. To me it seems like your discarding a lot of safety but perhaps that's overstated in my head.

withinboredom · on Aug 10, 2022

The relationships were enforced in-code, so the safety existed there instead of by the database. Even where I work now with sharding and such, we don’t have any foreign key constraints yet serve billions of requests. Code can provide just as much safety as the database, you just have to be smart enough to not try and override it.

Tainnor · on Aug 10, 2022

> Code can provide just as much safety as the database

No, database constraints declaratively specify invariants that have to be valid for the entirety of your data, no matter when it was entered.

Checks and constraints in code only affect data that is currently being modified while that version of the code is live. Unless you run regular "data cleanup" routines on all your data, your code will not be able to fix existing data or even tell you that it needs to be fixed. If you deployed a bug to production that caused data corruption, you deploying a fix for it will not undo the effects of the bug on already created data.

Now, a new database constraint will also not fix previously invalid data, but it will alert you to the problem and allow you to fix the data, after which it will then be able to confirm to you that the data is in correct shape and will be continue to be so in the future, even if you accidentally deploy a bug wherein the code forgets to set a reference properly.

I'm fine with people claiming that, for their specific use case or situation, database constraints are not worth it (e.g. for scalability reasons), but it seems disingenuous to me to claim that code can provide "just as much safety".

zepolen · on Aug 10, 2022

> Code can provide just as much safety as the database, you just have to be smart enough to not try and override it.

This statement is absolutely false. Some application code that random developers write will never be a match for constraints in a mature, well tested database.

withinboredom · on Aug 10, 2022

> Some application code that random developers write will never be a match for constraints in a mature, well tested database.

Well, lucky for you, most places don't hire random developers, only qualified ones. /s

Where I work has billions of database tables sharded all over the place. There simply isn't a way to define a constraint except via code. We seem to get along just fine with code reviews and documentation.

erik_seaberg · on Aug 10, 2022

I’ve never seen a database with just one writer. I wouldn’t even bet on all the writers using the same language, much less reusing a single piece of code implementing constraints that never have changed and never will.

holoduke · on Aug 10, 2022

Can you tell me what the performance impact is with code logic and extra dB calls versus built-in constraint logic? Because I am fairly convinced that most modern DBs have almost zero impact from relationship constraints. Those are heavily optimised. I am curious to your study and want to know more.

withinboredom · on Aug 10, 2022

The problem with sharding is that you can't join, as most databases don't support "distributed joins" so relationship constraints don't even make sense. Thus it has to be done in code and often requires multiple queries to various servers to get the full picture. Most of this can be optimized by denormalizing as much as possible for common queries and leaving normalized records for less common queries.

There's a performance impact, for sure, but there's really not much that can be done without switching to some database technology that does sharding for you and can handle distributed joins (rethinkdb comes to mind, but I'm not sure if that is maintained any more). But then you have to rewrite all your queries and logic to handle this new technology and that probably isn't worth it. Even if you were to use something like cockroachdb and postgres, cockroachdb doesn't support some features of postgres (like some triggers and other things), so you'll probably still have to rewrite some things.

mlni · on Aug 10, 2022

> I am fairly convinced that most modern DBs have almost zero impact from relationship constraints

It cannot fundamentally be zero impact, the database needs to guarantee that the referred record actually exists at the time the transaction commits. Postgres takes a lock on all referenced foreign keys when inserting/updating rows, this is definitely not completely free.

Found this out when a plain insert statement caused a transaction deadlock under production load, fun times.

lisper · on Aug 10, 2022

> There were no foreign keys

Then how did the relationship table refer to things in the thing tables?

chrchr · on Aug 10, 2022

I bet they meant no foreign key constraints.

lisper · on Aug 10, 2022

OK, but that still doesn't make any sense because:

> we had tables for each thing

I take that to mean "a separate table for each class of thing." So the single "relationship" table had to somehow refer not just to rows in tables but to the tables themselves. AFAIK that is not possible to do directly in SQL. You would have to query the relationship table, extract a table name from the result, and then make a separate query to get the relevant row from that table. And you would have to make that separate query for every relationship. That would be horribly inefficient.

criddell · on Aug 10, 2022

I don't see why you would have to include the table name. I presume the calling code knows what they are looking for.

For example, say we have a table of people and a table of books and a relationship table books-owned-by. If I want to see what's in your library I can select you in the relationship table and join with the books table to get a record set describing your library. Or is this not what the OP was describing?

lisper · on Aug 10, 2022

The OP said, "we had tables for each thing (users, companies, etc) and a “relationship” table that described each relationship between things". I took that to mean that there was only one relationship table.

criddell · on Aug 10, 2022

Oh I see. Hmm... I wonder what OP meant?

withinboredom · on Aug 10, 2022

Yep, so a table for each thing with their own schemas. A user table, a company table, etc. Then a single relationship that defines all the relationships between all the objects in a giant many-to-many relationship.

So to query a full relationship, you'd do a join from A -> relationship -> B, probably an inner join. But most of the time, it was usually just a singular join (to get a list of things to show the user, for example) since you'd already be querying in the context of some part of the relationship.

lisper · on Aug 11, 2022

Ah, that makes sense. So something like:

    select whatever from thingy, relation, widget
    where thingy.id = relation.id1
    and widget.id = relation.id2
    and relation.type1 = 'thingy'
    and relation.type2 = 'widget'

Sakos · on Aug 10, 2022

I'm a bit confused about how that relationship table would look. Is anybody kind enough to show a concrete example?

apocalyptic0n3 · on Aug 10, 2022

This type of setup is semi-common and I've seen it called "polymorphic relationships". The PHP framework Laravel has first-party support for them if you want a real example, although those are generally only polymorphic for one side of the relationship (i.e. building a CRM and having a single "comments" table with the ability to relate Customers, Opportunity, Leads, etc. to each comment without having a unique column for each type).

Generally, the relationship table will look like this:

  - relation1_id
  - relation1_type
  - relation2_id
  - relation2_type

Where *_id is the database ID and *_type is some kind of string constant representing another table. Generally you put an index on "relation#_id, relation#_type", but that's obviously use case dependent

bena · on Aug 10, 2022

Like other have said, you can store the table types with the ids.

But here's the neat part. That's often only for safety.

If I want to find all users and their orders, I can join the user table to the relationship table, then join that to the order table. Something like:

SELECT <COLUMNS GO HERE> FROM USER INNER JOIN RELATIONSHIP ON USER.Id IN (RELATIONSHIP.LeftId, RELATIONSHIP.RightId) INNER JOIN ORDER ON ORDER.Id IN (RELATIONSHIP.LeftId, RELATIONSHIP.RightId)

So you're getting the Users. The filtering the ones that have relationships, then filtering that to the ones that also have orders.

If you've ever managed an N:N relationship table, it's like that, but more generic.

withinboredom · on Aug 10, 2022

IIRC, we had a nullable column for each ID type. So a column for user, company, appointment, etc. then we had an enumerated column for the relation (owner, patient, vendor, etc).

benibela · on Aug 11, 2022

Sounds like the classic entity–relationship model

I am assisting a professor with his intro to databases class this semester and that was the first thing taught there

https://en.wikipedia.org/wiki/Entity%E2%80%93relationship_mo...

agumonkey · on Aug 10, 2022

How far is that from rdf triplets and similar ideas ? Just curious

withinboredom · on Aug 10, 2022

I had never heard of this before! But yeah, you could use the relationship table to describe it as a RDF triplet. For example, relationships could be OWNER, or PATIENT, or something and could be seen as a predicate. So USER OWNER of COMPANY, or USER PATIENT of COMPANY, or RECORD BELONGS to COMPANY, or COMPANY VENDOR of COMPANY.

anm89 · on Aug 10, 2022

I dont understand this. Wouldn't the relationships rows need to use foreign keys to know which things they apply to?

asddubs · on Aug 10, 2022

what's the advantage (besides debugging) to have all relationships in one table, rather than having a separate table for each relationship? seems like it would be strictly worse

ideamotor · on Aug 10, 2022

Funny. I’ve done exactly this elsewhere.

senttoschool · on Aug 10, 2022

They were using it like a NoSQL database back in 2010, just one year after MongoDB started. So there were no options.

In 2022, there are so many more mature, reliable, battle-tested NoSQL options that could have solved their problem more elegantly.

But even today, if I were to build Reddit as a startup, I'd start with a Postgres database and go as far as I can. Postgres allows me to use relations when I want to enforce data integrity and JSONB when I just want a key/value store.

lmm · on Aug 10, 2022

> But even today, if I were to build Reddit as a startup, I'd start with a Postgres database and go as far as I can. Postgres allows me to use relations when I want to enforce data integrity and JSONB when I just want a key/value store.

This is accepted wisdom on HN but I don't see it:

- True HA is very difficult. Even just automatic failover takes a lot of manual sysadmin effort

- You can't autoscale, and while scaling is a nice problem to have, having to not just fundamentally rearchitect all your code when you need to shard, but also probably migrate all your data, is probably the worst version of it.

- Even if you're not using transactions, you still pay the price for them. For example your writes won't commit until all your secondary indices have been updated.

- If you're actually using JSONB then driver/mapper/etc. support is very limited. E.g. it's still quite fiddly to have an object where one field is a set of strings / enum values and have a standard ORM store that in a column rather than creating a join table.

senttoschool · on Aug 10, 2022

If I were to start building Reddit from scratch without knowing if it's going to be successful or not, I'd rather iterate quick on features rather trying to spend time scaling something that will most likely fail before reaching that kind of scale.

>True HA is very difficult. Even just automatic failover takes a lot of manual sysadmin effort

Nearly all Postgres cloud providers such as RDS, CloudSQL, Upcloud, provide multi-region nodes and automatic failover.

Also, true HA databases are very expensive and have their own drawbacks. Not really designed for a Reddit-style startup.

>You can't autoscale, and while scaling is a nice problem to have, having to not just fundamentally rearchitect all your code when you need to shard, but also probably migrate all your data, is probably the worst version of it.

Hence why I said "as far as I can". I can keep upgrading the instance, creating material views, caching responses, etc.

Lastly, the most important reason is because I know Postgres and I can get started today. I don't want to learn a highly scalable, new type of database that I might need 5 years down the road.

lmm · on Aug 10, 2022

> Nearly all Postgres cloud providers such as RDS, CloudSQL, Upcloud, provide multi-region nodes and automatic failover.

From memory you have to specify a maintenance window and failover is not instant. And relying on your cloud provider comes with its own drawbacks.

> Also, true HA databases are very expensive and have their own drawbacks. Not really designed for a Reddit-style startup.

Plenty of free HA datastores out there, and if we're assuming it'll be managed by the cloud provider anyway then that's a wash.

> Lastly, the most important reason is because I know Postgres and I can get started today. I don't want to learn a highly scalable, new type of database that I might need 5 years down the road.

That's a good argument for using what you know; by the same token I'd use Cassandra. Postgres is a very complex system with lots of legacy and various edge cases.

naniwaduni · on Aug 10, 2022

Does a startup in Reddit's niche demand HA?

lmm · on Aug 10, 2022

I would think that if you want to create a habit where people check you frequently, avoiding downtime would be important. But IDK.

rakoo · on Aug 11, 2022

For something like reddit, who cares about consistency ? Under massive load no user cares about knowing the exact number of votes, all the replies to a comment, etc... It's also perfectly fine to refuse some writes if the load is too high to bear.

sophacles · on Aug 10, 2022

Reddit has gone through many periods of regular unplanned downtime as they hit various scale issues over the years. It's certainly something to avoid, but not necessarily something that's catastrophic - like most decisions there are tradeoffs.

antupis · on Aug 10, 2022

you probably would have some CDN front of it so some database unavailability would not impact to browsing that much.

senttoschool · on Aug 10, 2022

On a side note, I find it a bit ironic how NoSQL was all the rage back in the last decade or so but in 2022, NoSQL DBs are racing to add SQL querying back to key value stores.

It turns out that SQL is crucial to non-app-developers such as business analysts, data scientists. Trying to setup an ETL to pull data from your MongoDB datastore to a Postgres DB so the analysts could generate reports is such a waste of time and resources. Or worse, the analysts have to request devs to write Javascript code to query the data they need. For this reason alone, I will always start with a DB that supports SQL out of the box.

ralusek · on Aug 10, 2022

People didn't leave RDBMS for NoSQL because they wanted to get rid of RDBMS features, they did it because they couldn't scale their applications on single instances anymore. So they had to give up transactions and foreign keys and schemas, etc, because those concepts just didn't exist across instances. Not because they wanted to, but because they had to.

The entire database world has been working ever since trying to get back those RDBMS features into their horizontally scalable solutions, it's just taken a while because the problems are exponentially harder in a network. It's not like people are using eventual consistency and a loss of referential integrity because they prefer it.

senttoschool · on Aug 10, 2022

>People didn't leave RDBMS for NoSQL because they wanted to get rid of RDBMS features, they did it because they couldn't scale their applications on single instances anymore.

I agree that this should have been the reason for choosing something like MongoDB over MySQL/Postgres.

In reality, I saw a lot about the companies switching/starting with because so their devs didn't have to learn SQL and they can just stay in Javascript/whatever language they were using.

Tainnor · on Aug 10, 2022

> In reality, I saw a lot about the companies switching/starting with because so their devs didn't have to learn SQL and they can just stay in Javascript/whatever language they were using.

Or because they wanted their data to be schemaless.

Now, a valid reason for wanting data to be schemaless is that you store free form documents, generic cache entries (Redis) or other such things where it really makes no sense to specify the shape of the document/entry, because at any point in time, different entries will have different shapes. Another can be scalability concerns or other reasons for why you consciously choose to have your data fully denormalised.

But invalid reasons, IMHO, include:

- "It's faster." It is, until you get data corruption and have to manually fix data on prod.

- "We don't know the schema yet." That's not "schemaless", because your application will still assume a schema, the schema is now just implicit. If you want to iterate on your schema, you can easily do so using DB migrations; this is actually easier because the DB will help you verify that you migrated your data properly when you changed its schema, as opposed to changing the schema, but still keeping data around that is invalid under the new schema.

- "Not all of our data has a fixed schema." This is now easily solvable in e.g. Postgres with JSONB columns, so you only need to fully specify the data that does have a clearly defined shape.

kcartlidge · on Aug 10, 2022

> Or because they wanted their data to be schemaless.

I once led a team creating a public-facing site based on its own custom forms/wizards. We had half a dozen internal departments wanting dozens of wildly varying forms defined, and decided we didn't want fixed schemas as future field requirements were unknown.

Rather than the corporate C#/SQL Server we went for Node/MongoDB, and we (successfully) ramped up a new system that transformed the power and cost-base of our public sector employer.

We were well chuffed, and it was deemed revolutionary (it also did cloud integration whilst pulling submissions, parsing/verifying, communicating, managing, and pushing into legacy IT systems, so not hyperbole).

And yet, despite the outcome, privately the team regretted the choice from about half-way through. I think for me the clue was when we introduced Mongoose to ... enforce a schema onto our schema-less Mongo collections and help out with the relational querying.

(Edit: I'm not saying schema-less is bad, I'm saying you need to be clear about when it's good.)

Tainnor · on Aug 10, 2022

Yeah, that's what I was meaning to say. There's a difference between "the schema evolves" and "there is no schema (not even implicitly)". And it can also be "parts of the schema are fixed, others are not".

> I think for me the clue was when we introduced Mongoose to ... enforce a schema onto our schema-less Mongo collections and help out with the relational querying.

I agree that when you're doing things like that you're not really taking advantage of "schemaless", and that's probably a sign that that's not what you really want.

nprateem · on Aug 10, 2022

100%. You build something cool and want to get investment or understand how your marketing efforts are going so need to run some analytics. As soon as even the most basic queries require custom code it's obvious you've wasted a load of dev and analyst time.

machiaweliczny · on Aug 10, 2022

We seem to be happy using Hevo + BigQuery + Metabase for analytics need. Hevo can pull from many sources.

Misdicorl · on Aug 11, 2022

> So there were no options.

Hbase was a pretty solid NoSQL option by 2010, no?

TekMol · on Aug 10, 2022

These days, for a new Web2 type startup, I would use SQLite.

Because it requires no setup and has been used to scale typcial Web2 applications to millions in revenue on a single cheap VM.

Another option worth taking a look at is to use no DB at all. Just writing to files is often the faster way to get going. The filesystem offers so much functionality out of the box. And the ecosystem of tools is marvelous. As far as I know, HN uses no DB and just writes all data to plain text files.

It would actually be pretty interesting to have a look at how HN stores the data:

https://news.ycombinator.com/item?id=32408322

paldepind2 · on Aug 10, 2022

I sometimes see recommendations like this and I just don’t get it. It’s not like setting up a PosgreSQL databases is hard at all? And using files? That just sounds like a horrible developers experience and problems waiting to happen. Databases gives you so much for free: ACID, transactions, easy DSL for queries, structure to your data, etc. On top of that every dev knows some SQL.

What am I missing?

rtpg · on Aug 10, 2022

So I am also "really people, use PostgreSQL", because you have way less tricks you have to play to get it working compared to SQLite, in serious envs. However, some challenges with Postgres, especially if you have less strict data integrity requirements (reddit-y stuff, for example):

- Postgres backups are not trivial. They aren't hard, but well SQLite is just that one file

- Postgres isn't really built for many-cardinal-DB setups (talking on order of 100s of DBs). What does this mean in practice? If you are setting up a multi-tenant system, you're going to quickly realize that you're paying a decent cost because your data is laid out by insertion order. Combine this with MVCC meaning that no amount of indices will give you a nice COUNT, and suddenly your 1% largest tenants will cause perf issues across the board.

SQLite, meanwhile, one file per customer is not a crazy idea. You'll have to do tricks to "map/reduce" for cross-tenant stuff, of course. But your sharding story is much nicer.

- PSQL upgrades are non-trivial if you don't want downtime. There's logical upgrades, but you gotta be real fucking confident (did you remember to copy over your counters? No? Enjoy your downtime on switchover).

That being said, I think a lot of people see SQLite as this magical thing because of people like fly posting about it, without really grokking that the SQLite stuff that gets posted is either "you actually only have one writer" or "you will report successful writes that won't successfully commit later on". The fundamentals of databases don't stop existing with a bunch of SQLite middleware!

But SQLite at least matches the conceptual model of "I run a program on a file", which is much easier to grok than the client-server based stuff in general. But PSQL has way more straightforward conflict management in general

simonw · on Aug 10, 2022

SQLite backups aren't quite that easy.

Grabbing a copy of the file won't necessarily work: you need at atomic snapshot. You can create those using the backup API or by using VACUUM INTO, but in both cases you'll need enough spare disk space to create a fresh copy of your database.

I'm having great results from Litestream streaming to S3, so that's definitely a good option here.

rtpg · on Aug 10, 2022

That is a very good point! Obviously a lot of stuff is easier if you just allow yourself downtime, but at one point the file gets big enough to where this matters.

mekster · on Aug 10, 2022

Just use zfs snapshot to backup. Let go of the old ways of dump commands.

rahimnathwani · on Aug 10, 2022

"SQLite is just that one file"

How do you get enough time to back up that file when it's in use? Lock the file until you've copied it? zfs?

threatripper · on Aug 10, 2022

They actually address this problem.

https://www.sqlite.org/backup.html

> The Online Backup API was created to address these concerns. The online backup API allows the contents of one database to be copied into another database file, replacing any original contents of the target database. The copy operation may be done incrementally, in which case the source database does not need to be locked for the duration of the copy, only for the brief periods of time when it is actually being read from. This allows other database users to continue without excessive delays while a backup of an online database is made.

rahimnathwani · on Aug 10, 2022

From the user perspective, then, a SQLite backup doesn't seem materially less complicated than a postgres backup. I understood GP's comment to mean that it's easier to back up SQLite because it's just a file [which you can just copy].

threatripper · on Aug 10, 2022

It is just a file that you can copy when nobody is modifying data. In most simple cases you can just shut down the webserver for a minute to do database maintenance. Things that you already know how to do them. You don't have to remember any specific database commands for that. So the real difference is in the entry barrier.

e12e · on Aug 10, 2022

I'm quite confident that if you can snapshot the vm/filesystem (eg VMware snapshot, or zfs snapshot) - you will have a working postgresql backup too...

LanternLight83 · on Aug 10, 2022

BTRFS or ZFS sounds just fine to me, you could even create a small virtual disk for it on a host FS (less integrity, and one oight to snapshot the whole system anyways, but hypothetically you could do this just for atomic snapshots as long the rest of the system is cattle and those snapshots are properly backed up / replicated)

TingPing · on Aug 10, 2022

BTRFS recommends disabling CoW for DB files as it destroys performance.

akamaka · on Aug 10, 2022

I recommend that all developers should at least once try to write a web app that uses plain files for data storage. I’ve done it, and I very quickly realized why databases exist, and not to take them for granted.

sedatk · on Aug 10, 2022

Exactly. I'd written my now popular web app (most popular Turkish social platform to date) as a Delphi CGI using text files as data store in 99 because I wanted to get it running ASAP, and I thought BDE would be quite cumbersome to get running on a remote hosting service. (Its original source that uses text files is at https://github.com/ssg/sozluk-cgi)

I immediately started to have problems as expected, and later migrated to an Access DB. It didn't support concurrent writes, but it was an improvement beyond comprehension. Even "orders of magnitude" doesn't cut it because you get many more luxuries than performance like data-type consistency, ACID compliance, relational integrity, SQL and whatnot.

akamaka · on Aug 10, 2022

What made my implementation disastrous turned out to be that there was no enforcement of string encodings, and users started to see all kinds of garbled text.

sedatk · on Aug 10, 2022

One of my best decisions was to start out ASCII-only. It helped me manage that mess longer. :)

kcartlidge · on Aug 10, 2022

> write a web app that uses plain files for data storage ... very quickly realized why databases exist, and not to take them for granted

You haven't lived until you've built an in-memory database with background threads persisting to JSON files. Oh, the nightmares.

gjulianm · on Aug 10, 2022

> What am I missing?

Different requirements, expertise, and priorities. I don't like generic advice too much because there are so many different situations and priorities, but I've used files, SQLite and PostgreSQL according to the priorities:

- One time we needed to store some internal data from an application to compare certain things between batch launches. Yes, we could have translated form the application code to a database, but it was far easier to just dump the structure to a file in JSON, with the file uniquely named for each batch type. We didn't really need transactions, or ACID, or anything like that, and it worked well enough and, importantly, fast enough.

- Another time we had a library that provided some primitives for management of an application, and needed a database to manage instance information. We went for SQLite there, as it was far easier to setup, debug and develop for. Also, far easier to deploy, because for SQLite it's just "create a file when installing this library", while for PostgreSQL it would be far more complicated.

- Another situation, we needed a database to store flow data at high speed, and which needed to be queried by several monitoring/dashboarding tools. There the choice was PostgreSQL, because anything else wouldn't scale.

In other words, it really depends on the situation. Saying "use PostgreSQL always" is not going to really solve anything, you need to know what to apply to each situation.

paldepind2 · on Aug 10, 2022

Thanks for the reply. All the situations that you describe sounds very reasonable. I definitely wasn't trying to say "use PostgreSQL always", I too have used files and SQLite in various siturations. For instance, your second example is IMO a canonical example of where SQLite is the right tool and where you make use of its strengths. My comment was more directed at the siturations where PostgreSQL seems to be the right tool for the job, but where people still recommend other things.

senttoschool · on Aug 10, 2022

>In other words, it really depends on the situation. Saying "use PostgreSQL always" is not going to really solve anything, you need to know what to apply to each situation.

We're responding to this below. The requirement is a Web2 app with "millions in revenue".

>Because it requires no setup and has been used to scale typcial Web2 applications to millions in revenue on a single cheap VM.

gjulianm · on Aug 10, 2022

> The requirement is a Web2 app with "millions in revenue".

Millions in revenue doesn't really say anything about the performance required.

senttoschool · on Aug 10, 2022

>Millions in revenue doesn't really say anything about the performance required.

The assumption is that this is a Web2 company which usually means user-generated content. I assumed that there'd be high reads, decent amount of writes.

We're not trying to write a detailed spec here. Just going off a few details and make assumptions.

0xbadcafebee · on Aug 10, 2022

It's "kick-the-can-down-the-road" engineering. If the server can't hold the database in RAM anymore, they buy more RAM. If the disks are too slow, they buy faster/more disks. If the one server goes down, the site is just down for an hour or two (or a day or two) while they build a new server. It works until it doesn't, and most people are fine with things eventually not working.

This has only been practical within the last 6 years or so. Big servers were just too expensive. We started using cheap commodity low-resource PCs, which are cheaper to buy/scale/replace, but limits what your apps can do. Bare metal went down hard and there wasn't an API to quickly stand up a replacement. NAS/SAN was expensive. Network bandwidth was limited.

The cloud has made everything cheaper and easier and faster and bigger, changing what designs are feasible. You can spin up a new instance with a 6GB database in memory in a virtual machine with network-attached storage and a gigabit pipe for like $20/month. That's crazy.

nfhshy68 · on Aug 10, 2022

It's easy to install k8s curl a helm chart too, doesn't mean you should.

What you're missing is complexity and second order effects.

Making decisions about databases early results in needing to make a lot of secondary decisions about all sorts of things far earlier than you would if you just left it all in sqlite on one VM.

It's not for everyone. People have varying levels of experience and needs. I'll almost always setup a MySQL db but I'd be lying if I said it never resulted in navel gazing and premature db/application design I didn't need.

senttoschool · on Aug 10, 2022

>Making decisions about databases early results in needing to make a lot of secondary decisions about all sorts of things far earlier than you would if you just left it all in sqlite on one VM.

Like what? What could be easier than going to AWS, click a few things to spin up a Postgres instance, click a few more things to scale automatically without downtime, create read replicas, have automatic backups, one-click restore?

I feel like it's the opposite. Trying to put SQLite on a VM via Kubernetes will likely have a lot of secondary decisions that will make "scaling to millions" far harder and more expensive.

gjulianm · on Aug 10, 2022

> Like what? What could be easier than going to AWS, click a few things to spin up a Postgres instance, click a few more things to scale automatically without downtime, create read replicas, have automatic backups, one-click restore?

Not do that and not have to maintain that? Just a regular AWS instance with SQLite will be far easier to setup, with regular filesystem backup which is easier to manage and restore.

> Trying to put SQLite on a VM via Kubernetes will likely have a lot of secondary decisions that will make "scaling to millions" far harder and more expensive.

The issue is assuming that "scaling to millions" is a design goal from the start. For a lot of projects, the best-case-scenario does not require it. For another big set of them, they will never get to it. For the few that do, it's possible that the cost of maintaining the more complex solution while it isn't needed plus the cost of adapting over time (because, let's be honest, it's almost impossible to get the scaling solution right from the start) is more than the cost of just implementing the complex solution when it's actually needed.

senttoschool · on Aug 10, 2022

>Not do that and not have to maintain that? Just a regular AWS instance with SQLite will be far easier to setup, with regular filesystem backup which is easier to manage and restore.

You'd have less maintenance using RDS than you would with an SQLite hosted on a VM.

>The issue is assuming that "scaling to millions" is a design goal from the start.

I was responding to the OP who said you can scale to "millions of revenue" on SQLite. This part is true. You could. But it'd be easier to do it on a managed Postgres instance.

gjulianm · on Aug 10, 2022

> You'd have less maintenance using RDS than you would with an SQLite hosted on a VM.

Considering that the maintenance I need to do for SQLite databases is practically zero...

> I was responding to the OP who said you can scale to "millions of revenue" on SQLite.

Millions of revenue doesn't necessarily need a high performance database. You could have millions on a page that deals with less than one request per second.

senttoschool · on Aug 10, 2022

>Considering that the maintenance I need to do for SQLite databases is practically zero...

I also do zero maintenance on AWS RDS. Didn't have to setup it up either. Just a few clicks, copy and paste connection string, done. No VM to configure. No SSL to configure. No backup to configure.

Two clicks for a read only replica. A few more clicks and I can have multi-node failover. You'd want a failover if you have "millions in revenue" no? How do you plan to set that up with SQLite on a VM?

nfhshy68 · on Aug 10, 2022

If you're at millions of revenue and haven't made the switch to a more robust setup I'd have questions.

The question isn't about what you do when you get there the question is if you get there or if you flounder about deciding between self managed MySQL, postgress, rds, MongoDB, gcp, AWS or Azure.

senttoschool · on Aug 10, 2022

>If you're at millions of revenue and haven't made the switch to a more robust setup I'd have questions.

Companies with more revenue than "millions" are still running AWS RDS.

https://aws.amazon.com/rds/customers/

nfhshy68 · on Aug 11, 2022

I think you misread me or I didn't word it well enough.

I meant switch off sqllite by the time you're running a real business critical application off your db.

hurril · on Aug 10, 2022

That's a cop-out answer. You're making the claim that simply using the filesystem compared to going with Postgres somehow has less complexity and fewer (or lesser) second order effects, but you don't even indicate how come. So here's my question:

how come?

geocar · on Aug 10, 2022

> simply using the filesystem compared to going with Postgres somehow has less complexity

Postgres is also using the file system, it’s just got some extra layers in-front like networking and authentication.

People often refer to those “extra layers” as complexity.

> how come?

Why is it that more code is more stuff that can go wrong? That’s just how it is.

hurril · on Aug 10, 2022

You are also missing the point and in the same way.

You are at A, you want to go to C. We are debating which B to pick between:

1. a filesystem + implementing library over that to cover the interface between A and C.

2. Postgres (which is the same as above - just not invented here)

Now, both of you are imagining a world where #2 somehow implies more complexity and also more second order effects than #1 would.

Nobody uses all of Postgres, but the cost of not using more than 1% of what's available is hardly larger than implementing that 1% yourself for most use cases. And please don't make the misstake of thinking that I'm trying to sell Postgres over any other database product - this is all to do with the worse is better idea that so many of you cultivate.

geocar · on Aug 18, 2022

> We are debating which B to pick between:

No we are not.

Postgres isn't a webserver that can handle 1m http GET requests per second; it doesn't distribute data from 30 event collectors to a reporting database without configuration. *Other stuff needs to be written to satisfy the business.* That's life.

> both of you are imagining a world where #2 somehow implies more complexity and also more second order effects than #1 would.

Most studies into product resolution agree that there's zero difference (from a pass/fail perspective) between writing it yourself or trying to reuse third-party applications. Here's one from some quick ddging: https://standishgroup.com/sample_research_files/CHAOSReport2... and see the resolution of all software products surveyed (page 6).

That means that if you can, it is still cheaper to build new software and test it than to try to use Postgres and scale it, because once software is written the cost is (effectively) zero, whereas the cost to support Postgres is a body -- that I might only be able to use 1% of, and that I might need to hire two so that one can go on holiday sometime, with a real salary that we need to pay.

Now maybe you're working for a company whose software never gets finished. In that case, by all means use Postgres in your application, since you can share the churn on that part with every other person using Postgres in the world, but I'm going kayaking today.

ngc248 · on Aug 10, 2022

Choosing a DB is not premature optimization, but it eliminates a whole host of data related problems and smoothens development.

geocar · on Aug 19, 2022

> Choosing a DB is not premature optimization

That's your opinion.

> it eliminates a whole host of data related problems and smoothens development

I don't find this to be true, but maybe you have different problems than I do.

jokoon · on Aug 10, 2022

I built a gallery web app so I could tag images, for my own use.

At first i was writing in a large json file. Each time i did a change, I would write the file and close it.

It worked pretty well. I now use sqlite+dataset and the port was trivial.

I guess there are better ways to do it than open(), but using dataset is quite simple.

Maybe creating folders and duplicating data works, but it seems more complex than using sqlite and dataset.

latchkey · on Aug 10, 2022

You aren't missing anything.

Cloud SQL postgres on GCP with <50 connections is like ~$8/month and has automated daily backups.

The differences in SQL syntax between SQLite and Postgres, namely around date math (last_ping_at < NOW() - INTERVAL '10 minutes'), make SQLite a non-starter imho... you're going to end up having to rewrite a lot of sql queries.

saagarjha · on Aug 10, 2022

> On top of that every dev knows some SQL.

Where "some" may include "none" ;)

gfody · on Aug 10, 2022

> What am I missing?

only that hn is full, FULL of trolls who give the absolute worst advice possible. it's a sport maybe or it's entirely innocent and related to the Dunning Kruger effect but nothing brings it out more than database threads.

senttoschool · on Aug 10, 2022

You can get a managed Postgres instance for $15/month these days. Likely cheaper, more secure, faster, better than an SQLite db hosted manually on a VM.

>Because it requires no setup and has been used to scale typcial Web2 applications to millions in revenue on a single cheap VM.

Plenty of setup. How would you secure your VM? SSL configuration? SSL expiration management? Security? How would you scale your VM to millions without downtime? Backups? Restoring backups?

All these problems are solved with managed database hosting.

zinodaur · on Aug 10, 2022

I recently hopped on the SQLite train myself - I think it's a great default. If your business can possibly use sqlite, you get so much performance for free - performance that translates to a better user experience without the need to build a complicated caching layer.

The argument boils down to how the relative performance costs of databases have changed in the 30 years since MySQL/Postgres were designed - and the surprising observation that a read from a fast SSD is usually faster than a network call to memcached.

[1] https://fly.io/blog/all-in-on-sqlite-litestream/ [2] https://www.usenix.org/publications/loginonline/jurassic-clo...

throwaway0x7E6 · on Aug 10, 2022

my brother in Christ, SQLite is not a RDBMS. it runs on the same machine as your application. the scary, overwhelmingly complex problems you've enumerated should indeed be left to competent professionals to deal with, but they do not affect SQLite.

oefrha · on Aug 10, 2022

> it runs on the same machine as your application.

Better, it runs in the same process.

geocar · on Aug 10, 2022

> the scary, overwhelmingly complex problems you've enumerated should indeed be left to competent professionals to deal with

You also aren’t getting that for $15/mo, either.

senttoschool · on Aug 10, 2022

https://www.digitalocean.com/pricing/managed-databases

$15/month for Postgres managed hosting. I would consider DO as competent professionals.

geocar · on Aug 10, 2022

And are they going to deal with that scary stuff for me for $15/mo? Or are they going to turn it off and on and tell me I need a bigger boat if it’s slow?

senttoschool · on Aug 10, 2022

What scary stuff are you dealing with?

oefrha · on Aug 10, 2022

I love SQLite and use it whenever I can, but if you’re building the next Reddit, you obviously can’t live with the lack of concurrent writes. HN is fine as the write volume is really low, just a couple posts per minute, plus maybe ten times as many votes.

mekster · on Aug 10, 2022

> couple posts per minute

There are hundreds of posts that people are reading at every moment. I don't think it's that less.

oefrha · on Aug 10, 2022

You don't have to guess. Just check https://news.ycombinator.com/new and https://news.ycombinator.com/newcomments for everything that's posted here.

Or take a quantitative approach. HN item IDs are sequential. Using the API:

  $ curl 'https://hacker-news.firebaseio.com/v0/maxitem.json'
  32409711
  $ curl -s "https://hacker-news.firebaseio.com/v0/item/32409711.json" | jq .time
  1660125598
  $ curl -s "https://hacker-news.firebaseio.com/v0/item/$(( 32409711 - 14400 )).json" | jq .time
  1660031940

so, 14400 items in ~26 hours, fewer than 10 items per minute on average. At the peak you'll see somewhat more than that.

okasaki · on Aug 10, 2022

What about one sqlite db per subreddit?

senttoschool · on Aug 10, 2022

Where would you store the user data? Duplicate it across millions of SQLite databases?

colinmhayes · on Aug 10, 2022

Using files sounds insane. 100% chance you end up with a nasty data race.

asdff · on Aug 10, 2022

Why sqlite and not postgres? every sqlite tutorial or documentation or anything at all that I've seen says not to use this tool for a large database

bdlowery · on Aug 10, 2022

https://remoteok.com and https://nomadlist.com both use SQLite as the database. 100+ million requests a month and there's been no problems.

threatripper · on Aug 10, 2022

For what definition of "large"? SQLite surely has no problem holding gigabytes of data or millions of rows. In many cases it's much faster than most other kinds of databases. So, size is not the problem. Throughput on one machine with one active process on the same VM is not the problem. Having multiple processes on one VM in most cases is no problem.

But if you spread your compute across many machines that access a single database it can get iffy. If you need the database accessible through network there are server extensions which use SQLite as storage format but you're dealing with a server and probably could use any other database without much difference in deployment&maintenance.

14u2c · on Aug 10, 2022

This all sounds good until you consider high-availability, which IMO is absolutely essential for any serious business. How do you handle fail-overs when that cheap VM goes down? How do you handle regional replication?

You could cobble something together with rsync, etc, but then you have added a bunch of complexity and built a custom and unproven system. Another option is to use one of the SQLLite extensions popping recently like Dqlite, but again, this approach is relatively unproven for basing your entire business on.

Or you could simply use an off the shelf DBMS like Postgres or even MySQL. They already solve all of these problems and are as battle-tested as can be.

Skinney · on Aug 10, 2022

> high-availability, which IMO is absolutely essential for any serious business

Depends very much on the business. You can have downtime-free deploys on a single node, and as long as you've setup a system to automatically replace a faulty node with a new one (which typically takes < 10min) then a lot of businesses can live with that just fine. It's not like that cheap VM goes down every day, but just in case you can usually schedule a new instance every week or so to reduce the chance of that happening.

> How do you handle regional replication?

For backup purposes, you'd use litestream. Very easy to use with something like systemd or as a docker sidecar.

For performance purposes, if you do need that performance you'd obviously use something else. Depending on the type of service you have, though, you can get very far with a CDN in front of your server.

> Or you could simply use an off the shelf DBMS like Postgres or even MySQL.

If you need it, sure.

jsight · on Aug 10, 2022

TBH, I've used hypersonic sql before. People thought it was crazy, but there was no concurrency, and backups were just copying a couple of files. Fixing them if they failed was easy to.

People get too caught up in assumptions without knowing the use case. There are a million use cases where a tool like sqlite would be bad, but also a million where its likely the easiest and best tool.

Wisdom is knowing the difference.

Tade0 · on Aug 10, 2022

> These days, for a new Web2 type startup, I would use SQLite.

There's a whole page in the docs explaining why this might be a bad idea for a write-heavy site:

https://www.sqlite.org/howtocorrupt.html

Section 2 is especially interesting. SQLite locks on the whole database file - usually one per database - and that is a major bottleneck.

> Another option worth taking a look at is to use no DB at all.

We tried that in a project I'm currently in as a means of shipping an MVP as soon as possible.

Major headache in terms of performance down the road. We still serve some effectively static stuff as files, but they're cached in memory, otherwise they would slow the whole thing down by an order of magnitude.

kcartlidge · on Aug 10, 2022

If you're doing .Net Core you could also take a look at LiteDB. Use SQL queries in a local (optionally encrypted) file using a single embedded nuget package. Also creates schemas on the fly when you first persist collection types as per traditional NoSQL systems.

I'm using it to create an embeddable out of the box passwordless auth system that has no external dependencies. Works great.

Admittedly I've not tried it under extreme load yet, but I'm happy to migrate it onwards if something I create goes massive.

(Edit: to be clear it is NoSQL, but you can query it using almost standard SQL which is what I do).

polote · on Aug 10, 2022

Sure and I would recommend they use C to code the backend, they should not use a CDN either to save some bucks. And on the front-end there is no sense to use javascript as everything can be done in the backend

mandeepj · on Aug 10, 2022

Exactly! no separate machine required just for db. You can also backup SQLite db asynchronously

klysm · on Aug 10, 2022

Completely disagree with the suggestion to just write files. The file system APIs are incredibly easy to misuse. A database with good transaction semantics and a schema is worth so much.

gorgoiler · on Aug 10, 2022

Facebook was very similar, I believe. The MySQL schema boiled down to two tables: nodes with an id, a type, and key/value properties; and edges (known internally as assocs) between pairs of ids, also with key/value properties and a type.

A simple schema helps when you just need to write stuff. You can make simple queries against a simple schema, directly.

For the juicier queries what you do is you derive new tables from the data using ETL or map/reduce. The complexity of the underlying schema doesn’t have to reflect the complexity of your more complicated queries. It only needs to be complex enough to let you store data in its base form and make simple queries. Everything else can be derived on top of that, behind the scenes.

Example: nodes are people and edges represent friendships, and then from this you could derive a new table of best_friend(id, bf_id, since).

alexpotato · on Aug 10, 2022

After 15+ years of being an SRE in finance across firms ranging in size from 2 people to 350K people, I've come to the following conclusions:

1. There are no solutions, only tradeoffs.

Sure. You win in the OP example of easy to add things but you lose in the relational aspects.

You could say the same thing in reverse, using a RDBMS make it much easier to do join based lookups but then it's harder to update

2. At the end of the day, it doesn't matter b/c some other layer will do what your base layer can't

I've seen some really atrocious approaches to storing data in systems that were highly regarded by the end users. How did this happen? Someone went in and wrote an abstraction layer that combined everything needed into an easy to query/access interface. I've seen this happen so many times that whenever people start trying to future proof something I mention this point. You will get the design wrong and/or things will change. That means you either change the base layer or you modify/create an abstraction layer.

Waterluvian · on Aug 10, 2022

Re #1: I agree. But with a pedantic semantic distinction:

There are a lot of objectively wrong ways to do something. But among all the right ways, there are trade offs.

0xbadcafebee · on Aug 10, 2022

But even if you do something objectively wrong, it can still work out anyway. Look at every Perl script written by non-programmers. Horrifying mess of bullshit, and you try to read the code and think this shouldn't even be parseable, but somehow the script works.

There is no right or wrong, only working and not-working.

ALittleLight · on Aug 10, 2022

If someone has written their own bubble sort to sort a list of numbers and you replace the custom bubble sort with a built-in sort - there really aren't tradeoffs to consider. Your method will be faster, easier to understand, more robust. Their method works, but is strictly inferior.

ineptech · on Aug 10, 2022

Actually, speaking as a mediocre programmer who got in to management and now only writes code for fun side projects, there are lots of times when I'd prefer a less-performant function that I understand to a more-performant-but-opaque builtin. What if the sort only handles 100 items but I need to be able to change its internal behavior? I'm a lot better off writing my own slow sort than trying to modify one written by a 10x Rockstar Developer whose code I barely understand.

0xbadcafebee · on Aug 10, 2022

For the same result, within the same accepted parameters, the 'inferior' method is really identical, with no tradeoffs. By measuring the result rather than the design, you see the true value rather than the theoretical value.

If you only need to move a 60lbs bag of sod across the yard, carrying it yourself, using a wheelbarrow, and using an F-150 Lightning all have exactly the same result and none is inferior. They can all be done in the same time by the same person with no real difference. It's only when the parameters change, like moving 400 bags of sod, that the first two require tradeoffs.

Banana699 · on Aug 10, 2022

>For the same result

You're assuming that the "result" data structure is only a single boolean, is_task_done. That's not how most things work, GP's bubble sort will pretty damn well show up in locked-up UIs and injured-dog-slow bottlenecks, that's as observable as the Sun at 9 AM. That would happen pretty fast too - don't delude yourself with "Move Fast And Fix Later" with quadratic algorithms - 1000 squared is a million, and at 10000 it is a hundred millions.

>carrying it yourself, using a wheelbarrow, and using an F-150 Lightning all have exactly the same result and none is inferior

Off course not, the method used would show up in your health and exertion data, your fuel costs, the amount of noise and heat you emit while doing the work, etc... Every single way of observing a boolean is_bag_moved result can also be used to observe a metric measuring those things, and thus determine the method you used.

You're trying to defend an extremly strong claim : If algorithms A and B gives the same output in memory, NOTHING else matters. Here's an concrete example, consider the following sequence of algorithms:

Algorithm 0 : Print("Hello")

Algorithm 1 : do {Nothing} then Print("Hello")

Algorithm 2 : 2.times do {Nothing} then Print("Hello")

...

Algorithm N : N.times do {Nothing} then Print("Hello")

This is an infinite sequence of algorithms that all have exactly the same behaviour in a time-space agnostic manner. Would you say every algorithm in this sequence solves the proplem of printing "Hello" exactly as well as every other algorithm ? If yes... interesting. If no, then your original claim is false as-is, it needs qualifying with what observable behaviour means, as this example shows, it's a difficult notion to exactly define, a naive definition of "Everything that can be observed in computer memory or on screen" would leave you open to ridicilous conclusions.

1123581321 · on Aug 10, 2022

This is a thread about reddit and other scaled databases; it's given that the discussion is in 400+ sod bag territory...

balfebs · on Aug 10, 2022

When I was a Perl programmer, in our shop we amended the motto "There's more than one way to do it" to "There's more than one way to do it, but most of those ways are wrong."

The Perl community themselves eventually extended the motto to "There's more than one way to do it, but sometimes consistency is not a bad thing either"

Xeoncross · on Aug 10, 2022

You nailed it on the head (and people wonder why there are so many 'bad' laws passed).

mabbo · on Aug 10, 2022

In 2004, I started community college program and had a professor teaching intro to databases. He was retired from industry and had lived through the rise of RDBMS's and made his money in that. Teaching was just his hobby. He liked to have fun with the class (120+ of us) by having us do call-response things as a group.

Prof: "Why do we normalize?"

Class, in unison: "To make better relations."

Prof: "Why do we denormalize?"

Class, in unison: "Performance."

It took a lot of years of work before the simplicity of the lesson stuck with me. Normalization is good. Performance is good. You trade them off as needed. Reddit wants performance, so they have no strong relations. Your application may not need that, so don't prematurely optimize.

pc86 · on Aug 10, 2022

"Performance is good" is I think something everyone could agree with just hearing it for the first time. There's no reason to prefer worse performance if you can have better performance for no additional cost. Is the same true for normalization? Why is just the state of being more normalized inherently better?

jeremyjh · on Aug 10, 2022

Denormalized means it has multiple copies of the same pieces of data. The application is responsible for keeping those copies in sync with each other. This is a messy process that is easy to get wrong now and then. Then your app starts saying things that are absurd. No one cares much if Reddit does that every now and then.

vbezhenar · on Aug 10, 2022

Some levels of normalisation don't allow nulls. This is absurd. Normalisation alone is theoretical concept which should not be blindly pursued in practice, even without performance requirements.

bsedlm · on Aug 10, 2022

because a normalized database repeats less things and is much easier to understand than a messy one.

oh, and also, the entire point is that neither is "inherently better"... both have pros and cons

pc86 · on Aug 10, 2022

I think my point was "performance" in the sense of faster response time is inherently good. All things being equal, if you have a more performant option you'll choose that. The only time you accept worse performance is in pursuit of some other metric.

On the flip side, "normalization" doesn't have that same inherent good-ness. All things being equal (including performance) there isn't any inherent drive toward more normalization, maybe because a faster performing page will have clear impact to the user while a more normalized data structure would be completely transparent to the user?

bsedlm · on Aug 10, 2022

sure, from the perspective of the final user it doesn't matter much at all.

with my point about normalization being easier to understand I meant to say that it is easier for a developer learning about the schema.

if I recall ok, I think there's also something about data de-duplication in a normalized database. without this normalization, you could get a lot of repetition, which would actually require to update the same data in multiple places or find some other way to deal with different data about the same entity (some sort of de-synchronization is more likely without normalized schemas).

in any case, I'll grant you that normalization should never be 'the' goal for a database.

mabbo · on Aug 10, 2022

Normalized schemas are less prone to certain kinds of hard-to-solve issues.

Example: When you have the same data in two places, which is authoritative? If they disagree, which one is right?

And querying data reliably is much easier when you've got good relations that can be relied on to be accurate. SQL is almost a masterpiece (though imho it should have been FROM, WHERE, SELECT).

blahyawnblah · on Aug 10, 2022

> don't prematurely optimize

Premature optimization is the root of all evil

isaacremuant · on Aug 10, 2022

Careful with repeating mantras without reflection.

You may end up using them as arguments when they're just catchphrases that require context.

See people who used "goto considered harmful" as an absolute to never use goto (C).

branko_d · on Aug 10, 2022

“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.”

kcartlidge · on Aug 10, 2022

How dare you bring full quotes and context to a discussion :)

Seriously true, though. What's usually missed is the first bit, about "small efficiencies", combined with the last bit, the "critical 3%". He's basically saying don't spend time in the mass of weeds until you know where the problems really are, but conversely take the time to optimise the smaller areas that are performance-critical.

Almost the opposite of the carte-blanche overly-minimal MVPs often thrown together by those of us who forget what the V stands for.

zinclozenge · on Aug 10, 2022

So is premature pessimization.

matchagaucho · on Aug 10, 2022

Tuple table performance, though, tends to degrade quickly when querying separate rows for every little attribute.

More _memory_ and row caching in RAM has allowed for a resurgence in denormalized schema.

FB's underlying schema is similarly just a few tables. But the entire dataset is memcached.

marginalia_nu · on Aug 10, 2022

> Tuple table performance, though, tends to degrade quickly when querying separate rows for every little attribute.

Interesting. Do you know what causes this steep decline in performance?

A larger sequential read is typically not that much slower regardless on what sort of drive you are using. I'd have expected that the random access pattern of a normalized table would lose out.

swyx · on Aug 10, 2022

from 10 years ago https://news.ycombinator.com/item?id=4468265

dang · on Aug 10, 2022

Thanks! Macroexpanded:

Reddit's database has only two tables - https://news.ycombinator.com/item?id=4468265 - Sept 2012 (79 comments)

zegl · on Aug 10, 2022

We’re doing something similar at Codeball.

A single Aurora RDS database (with read replicas), with a single table called “blobs”. The blobs table has three columns: id, type, and json.

It’s scaling really really well, to many many millions of rows, and high traffic volumes (many thousands of writes per second).

Thorentis · on Aug 10, 2022

Wny even use RDS? There is nothing relational about that schema. Just use DynamoDB. It'll be cheaper, you can actually query the JSON, and if you decide to flesh out the documents and go all in on NoSQL then you can do that easily.