Hacker News new | past | comments | ask | show | jobs | submit login
An Unlikely Database Migration (tailscale.com)
364 points by ifcologne 9 days ago | hide | past | favorite | 178 comments





Interesting choice of technology, but you didn't completely convince me to why this is better than just using SQLite or PostgreSQL with a lagging replica. (You could probably start with either one and easily migrate to the other one if needed.)

In particular you've designed a very complicated system: Operationally you need an etcd cluster and a tailetc cluster. Code-wise you now have to maintain your own transaction-aware caching layer on top of etcd (https://github.com/tailscale/tailetc/blob/main/tailetc.go). That's quite a brave task considering how many databases fail at Jepsen. Have you tried running Jepsen tests on tailetc yourself? You also mentioned a secondary index system which I assume is built on top of tailetc again? How does that interact with tailetc?

Considering that high-availability was not a requirement and that the main problem with the previous solution was performance ("writes went from nearly a second (sometimes worse!) to milliseconds") it looks like a simple server with SQLite + some indexes could have gotten you quite far.

We don't really get the full overview from a short blog post like this though so maybe it turns out to be a great solution for you. The code quality itself looks great and it seems that you have thought about all of the hard problems.


> and a tailetc cluster

What do you mean by this part? tailetc is a library used by the client of etcd.

Running an etcd cluster is much easier than running an HA PostgreSQL or MySQL config. (I previously made LiveJournal and ran its massively sharded HA MySQL setup)


Neat. This is very similar to [0], which is _not_ a cache but rather a complete mirror of an Etcd keyspace. It does Key/Value decoding up front, into a user-defined & validated runtime type, and promises to never mutate an existing instance (instead decoding into a new instance upon revision change).

The typical workflow is do do all of your "reads" out of the keyspace, attempt to apply Etcd transactions, and (if needed) block until your keyspace has caught up such that you read your write -- or someone else's conflicting write.

[0] https://pkg.go.dev/go.gazette.dev/core/keyspace


Drat! I went looking for people doing something similar when I sat down to design our client, but did not find your package. That's a real pity, I would love to have collaborated on this.

I guess Go package discovery remains an unsolved problem.


> I guess Go package discovery remains an unsolved problem.

Or did you just not really search, like most of us excited to DIY? :-D

Godoc is pretty good, the package shows up for the searches I'd probably do in a similar situation.

https://godoc.org/?q=etcd

https://godoc.org/?q=etcd+watch


Funny, the package in question exists because _I_ thought I could do better and wanted to DIY.

You must not have seen https://godoc.org/?q=tailetc

Well, we literally named and pushed that publicly minutes before today's blog post.

thatsthejoke.gif :P

Whoa, we hadn't seen that! At first glance it indeed appears to be perhaps exactly identical to what we did.

Slightly different trade-offs. This package is emphatically just "for" Etcd, choosing to directly expose MVCC types & concepts from the client.

It also doesn't wrap transactions -- you use the etcd client directly for that.

The Nagel delay it implements helps quite a bit with scaling, though, while keeping the benefits of a tightly packed sorted keyspace. And you can directly access / walk decoded state without copies.


I wish pkg.dev had a signin and option to star/watch a package. I do this with GitHub repos I should revisit. Would have been handy for pkg.dev :) yes, I know - nobody wants yet another login

> What do you mean by this part? tailetc is a library used by the client of etcd.

Oh. Since they have a full cache of the database I thought it was intended to be used as a separate set of servers layered in front of etcd to lessen the read load. But you're actually using it directly? Interesting. What's the impact on memory usage and scalability? Are you not worried that this will not scale over time since all clients need to have all the data?


Well, we have exactly 1 client (our 1 control server process).

So architecturally it's:

3 or 5 etcd (forget what we last deployed) <--> 1 control process <--> every Tailscale client in the world

The "Future" section is about bumping "1 control process" to "N control processes" where N will be like 2 or max 5 perhaps.

The memory overhead isn't bad, as the "database" isn't big. Modern computers have tons of RAM.


You're able to serve all your clients from a single control process? And this would probably work for quite a while? Then I struggle to see why you couldn't just use SQLite. On startup read the full database into memory. Serve reads straight from memory. Writes go to SQLite first and if it succeeds then you update the data in memory. What am I missing here?

We could use SQLite. (I love SQLite and have written about it before!) The goal is N control processes not for scale, but for more flexibility with deployment, canarying, etc.

That makes sense. Thanks for answering all of my critical questions. Looks like a very nice piece of technology you’re building!

I'm curious what drove the decision to move to an external store (and multinode HA config at that) now compared to using a local Go KV store like Badger or Pebble?

Given that the goals seem to be improving performance over serializing a set of maps to disk as JSON on every change and keeping complexity down for fast and simple testing, a KV library would seem to accomplish both with less effort, without introducing dependence on an external service, and would enable the DB to grow out of memory if needed. Do you envision going to 2+ control processes that soon?

Any consideration given to running the KV store inside the control processes themselves (either by embedding something like an etcd or by integrating a raft library and a KV store to reinvent that wheel) since you are replicating the entire DB into the client anyway?

Meanwhile I'm working with application-sharded PG clusters with in-client caches with coherence maintained through Redis pubsub, so who am I to question the complexity of this setup haha.


Yes, we're going to be moving to 2+ control servers for HA + blue/green reasons pretty soon here.

> Running an etcd cluster is much easier than running an HA PostgreSQL or MySQL config.

What if you used one of the managed RDBMS services offered by the big cloud providers? BTW, if you don't mind sharing, where are you hosting the control plane?


> What if you used one of the managed RDBMS services offered by the big cloud providers?

We could (and likely would, despite the costs) but that doesn't address our testing requirements.

The control plane is on AWS.

We use 4 or 5 different cloud providers (Tailscale makes that much easier) but the most important bit is on AWS.


Why is testing Postgres/MySQL difficult? You can easily run a server locally (or on CI) and create new databases for test runs, etc.

It's not difficult. We've done it before and have code for it. See the article.

> ...but the most important bit is on AWS.

Curious: Was running DynamoDB with DAX (DynamoDB Accelerator) in front ever in contention? If not, is it due to vendor lock-in (for example, not being able to migrate out) or because tailscale doesn't feel the need to use managed offerings especially for core infrastructure?


> > Running an etcd cluster is much easier than running an HA PostgreSQL or MySQL config.

> What if you used one of the managed RDBMS services offered by the big cloud providers?

Yeah, AWS RDS "multi-AZ" does a good job of taking care of HA for you. (Google Cloud SQL's HA setup is extremely similar.) But you still get 1-2 minutes of full unavailability when hardware fails.

I haven't operated etcd in production myself, but I assume it does better because it's designed specifically for HA. You can't even run less than three nodes. (The etcd docs talk about election timeouts on the order of 1s, which is encouraging.)

For many use cases, 1-2 minutes of downtime is tolerable. But I can imagine situations where availability is paramount and you're willing to give up scale/performance/features to gain another 9.


if you want a distributed key/value data store, you want to use what's already out there and vetted. It use to be zookeeper, but etcd is much simpler and that's what Kubernetes uses and it has been a big success and proved itself out there in the field. Definitely easier than a full SQL database which is overkill and much harder to replicate especially if you want to have a cluster of >= 3. Again, key is "distributed" and that immediately rules out sqlite.

It's overkill until it's not. We chose etcd initially but after a while we started wanted to ask questions about our data that weren't necessarily organised in the same way as the key/value hierarchy. That just moved all the processing to the client, and now I just wish we used a SQL database from the beginning.

Yeah, but for their use case it's just KV and also ability to link directly in go

This is about spot on. I do get the part about testability, but with a simple Key/Value use case like this, BoltDB or Pebble might have fit extremely well into the Native Golang paradigm as a backing store for the in-memory maps while not needing nearly as much custom code.

Plus maybe replacing the sync.Mutex with RWMutexes for optimum read performance in a seldom-write use case.

On the other hand again, I feel a bit weird criticizing Brad Fitzpatrick ;-) — so there might be other things at play I don‘t have a clue about...


I was initially baffled by the choice of technology too. Part of it is that etcd is apparently much faster at handling writes, and offers more flexibility with regards to consistency, than I remember. Part of it might be that I don't understand the durability guarantees they're after, the gotchas they can avoid (e.g. transactions), or their overall architecture.

This post illustrates the difference between persistence and a database.

If you are expecting to simply persist one instance of one application's state across different runs and failures, a database can be frustrating.

But if you want to manage your data across different versions of an app, different apps accessing the same data, or concurrent access, then a database will save you a lot of headaches.

The trick is knowing which one you want. Persistence is tempting, so a lot of people fool themselves into going that direction, and it can be pretty painful.

I like to say that rollback is the killer feature of SQL. A single request fails (e.g. unique violation), and the overall app keeps going, handling other requests. You application code can be pretty bad, and you can still have a good service. That's why PHP was awesome despite being bad -- SQL made it good (except for all the security pitfalls of PHP, which the DB couldn't help with).


I'd say the universal query capability is the killer feature of SQL.

In the OP they spent two weeks designing and implementing transaction-save indexes -- something that all major SQL RDBMS (and even many NoSQL solutions) have out of the box.


Maybe also part of the success of Rails? Similarly an easy to engineer veneer atop a database.

This comment helped me understand the problem and the solution better (along with a few followup tweets by the tailscale engineers). Thanks.

> (Attempts to avoid this with ORMs usually replace an annoying amount of typing with an annoying amount of magic and loss of efficiency.)

Loss of efficiency? Come on, you were using a file before! :-)

Article makes me glad I'm using Django. Just set up a managed Postgres instance in AWS and be done with it. Sqlite for testing locally. Just works and very little engineering time spent on persistent storage.

Note: I do realize Brad is a very, very good engineer.


Efficiency can be measured in many different ways.

Having no dedicated database server or even database instance, being able to persist data to disk with almost no additional memory required, marginal amount of CPU and no heavy application dependencies can be considered very efficient depending on context.

Of course, if you start doing this on every change, many times a second, then it stops being efficient. But then there are ways to fix it without invoking Oracle or MongoDB or other beast.

When I worked on algorithmic trading framework the persistence was just two pointers in memory pointing to end of persisted and end of published region. Occasionally those pointers would be sent over to a dedicated CPU core that would be actually the only core talking to the operating system, and it would just append that data to a file and publish completion so that the other core can update the pointers.

The application would never read the file (the latency even to SSD is such that it could just as well be on the Moon) and the file was used to be able to retrace trading session and to bring up the application from event log in case it failed mid session.

As the data was nicely placed in order in the file, the entire process of reading that "database" would take no more than 1,5s, after which the application would be ready to synchronize with trading session again.


>Article makes me glad I'm using Django.

This was my main thought throughout reading it. So many things to consider and difficult issues to solve it seems they face a self-made database hell. Makes me appreciate the simplicity and stable performance of django orm + postgre.


I am missing a lot of context from this post because this just sounds nonsensical.

First they're conflating storage with transport. SQL databases are a storage and query system. They're intended to be slow, but efficient, like a bodybuilder. You don't ask a bodybuilder to run the 500m dash.

Second, they had a 150MB dataset, and they moved to... a distributed decentralized key-value store? They went from the simplest thing imaginable to the most complicated thing imaginable. I guess SQL is just complex in a direct way, and etcd is complex in an indirect way. But the end results of both are drastically different. And doesn't etcd have a whole lot of functional limitations SQL databases don't? Not to mention its dependence on gRPC makes it a PITA to work with REST APIs. Consul has a much better general-purpose design, imo.

And more of it doesn't make sense. Is this a backend component? Client side, server side? Why was it using JSON if resources mattered (you coulda saved like 20% of that 150MB with something less bloated). Why a single process? Why global locks? Like, I really don't understand the implementation at all. It seems like they threw away a common-sense solution to make a weird toy.


I'd answer questions but I'm not sure where to start.

I think we're pretty well aware of the pros and cons of all the options and between the team members designing this we have pretty good experience with all of them. But it's entirely possible we didn't communicate the design constraints well enough. (More: https://news.ycombinator.com/item?id=25769320)

Our data's tiny. We don't want to do anything to access it. It's nice just having it in memory always.

Architecturally, see https://news.ycombinator.com/item?id=25768146

JSON vs compressed JSON isn't the point: see https://news.ycombinator.com/item?id=25768771 and my reply to it.


You say you want to have a small database just for each control process deployment to be independent. But you need multiple nodes for etcd... So you currently have either a shared database for all control processes, or 3 nodes per control process, or 3 processes per control node, etc. Either way it seems weird.

I get that SQLite wouldn't work, but it also doesn't make sense to have one completely independent database per process. So I imagine you're using a shared database, at which poitlnt etcd starts to make more sense. It's just not that widely understood in production as sql databases, and has limitations which you might reach in a few years.


> It's just not that widely understood in production as sql databases, and has limitations which you might reach in a few years.

Reaching limitations in a few years and biting that bullet makes the difference between a successful startup that knows when and where to spend time innovating or a startup that spends all their time optimizing for that 1 million simultaneous requests / sec.


It's not about optimizing for scale, it's about optimizing for velocity. I don't care if I can only get to 1K RPS. I care if my team and product can work quickly. You cannot work quickly later if you slap something together now and later realize, oh shit, we have to stop pushing features for a month so we can completely rebuild the backend and everything that depends on it.

It's the devil you know versus the devil you don't. SQL is a very well understood devil, so your plans around it will be reliable. I would argue that being able to accurately estimate future work is the most valuable business asset.


The post touches upon it, but I didn't really understand the point. Why doesn't synchronous replication in Postgres work for this use case? With synchronous replication you have a primary and secondary. Your queries go to the primary and the secondary is guaranteed to be at least as up to date as the primary. That way if the primary goes down, you can query the secondary instead and not lose any data.

We could've done that. We could've also used DRBD, etc. But then we still have the SQL/ORM/testing latency/dependency problems.

Can you go into more about what these problems are? I've always used databases (about 15 years on Oracle and about 5 years on Postgres) and I'm not sure if I know what problems you are referring to. Maybe I have experienced them, but have thought of them by a different name.

SQL - I'm not sure what the problems are with SQL. But it is like a second language to me so maybe I experienced these problems long ago and have forgotten about them.

ORM - I never use an ORM, so I have no idea what the problems might be.

testing latency - I don't know what this refers to.

dependency - ditto


I think the "database" label is tripping up the conversation here. What's being talked about here, really, is fast & HA coordination over a (relatively) small amount of shared state by multiple actors within a distributed system. This is literally Etcd's raison d'etre, it excels at this use case.

There are many operational differences between Etcd and a traditional RDBMs, but the biggest ones are that broadcasting updates (so that actors may react) is a core operation, and the MVCC log is "exposed" (via ModRevision) so that actors can resolve state disagreements (am I out of date, or are you?).


SQL is fine. We use it for some things. But not writing SQL is easier than writing SQL. Our data is small enough to fit in memory. Having all the data in memory and just accessible is easier than doing SQL + network round trips to get anything.

ORMs: consider yourself lucky. They try to make SQL easy by auto-generating terrible SQL.

Testing latency: we want to run many unit tests very quickly without high start-up cost. Launching MySQL/PostgreSQL docker containers and running tests against Real Databases is slower than we'd like.

Dependencies: Docker and those MySQL or PostgreSQL servers in containers.


Can you put some numbers on how much time is too much? I've never seen anyone go this far to avoid using a database for what sounds like the only "real" reason is to avoid testing latency (a problem which has many other solutions) so I am really confused, but curious to understand!

Running all of our control server tests (including integration tests) right now takes 8 seconds, and we're not even incredibly happy with that. There's no reason it should even be half that.

So it's not really in our patience budget for adding a mysqld or postgres start up (possible docker pull, create its schema, etc).


>right now takes 8 seconds, and we're not even incredibly happy with that

With the amount of explaining and skepticism you're having to deal with in most of the threads here (plenty of reasonable questions, some seem to approach the question with the assumption that your approach is totally wrong) I feel compelled to comment on how nice such a fast feedback loop would be just so it's known that you're not listing these benefits into an ether that doesn't appreciate them.


Yeah, I'm inspired to reduce the cost of my tests to get to a better place as a result of this.

Mind you, I build ML/statisical models, so my integration/e2e tests are definitely not going to get down to 8 seconds.


Wow that's really quick. I can see why that would be very desirable.

It would create a nice flow to get feedback from your test suite that quickly.


Not sure what their requirements are, but I'm using a "spin up an isolated postgres instance per test run" solution and end up with ~3s overhead to do that. (Using https://pypi.org/project/testing.postgresql/

Edit: 3s for global setup/teardown. Not per test function/suite.


Ah, we don't use Docker or any other container technology. Maybe that is why we aren't seeing the latency issues you are referring to.

Docker itself doesn't add much latency. It just makes getting MySQL and PostgreSQL easier. If anything, it helps with dependencies. The database server startup still isn't great, though.

If you don't use Docker, you can just leave the database server running in the background, which removes the startup latency (you can of course do this with Docker too, but Docker has a tendency to use quite a few resources when left running in the background, which a database server on it's own won't).

So then every engineer needs to install & maintain a database on their machines. (Hope they install the right version!)

I mean, that's what my old company did pre-Docker. It works, but it's tedious.


Do the developers not have an internet connection? Why can’t they just hit a database running in the cloud or your own server somewhere?

There is a middle path -- use lxd. This way, the mysqld/postgres process can always be running, while the dev machine remains clean.

(But of course the battery will drain a little faster if it is a laptop)


I mean that's an `apt install postgres` or `brew install postgres` away. Takes about 5 minutes. I guess it could become a pain if you need to work with multiple different versions at once.

Being deep in the cloud world right now, with aws and terraform and kubernetes cli tools, etc, not having to install third party tools on my machine does sound pretty great, but also entirely unrealistic.

Managing local DBs once new versions are out and your server isn't upgraded yet is irritating, but when I'm using a Mac I'd still rather use a native DB than Docker because of the VM overhead, since I've not yet run into a bug caused by something like "my local postgres was a different version than the server was." (Closest I've gotten was imagemagick for mac doing something a bit differently than for linux, about 10 years ago at this point.)


> I've not yet run into a bug caused by something like "my local postgres was a different version than the server was."

Ran into that at a recent place - the code was doing "= NULL" in a bunch of places (before my time) and PG12 treated that differently than PG11 did which broke a bunch of tests.


Would love to know what resources you are speaking of here.

We've definitely done some whack-a-mole with allocations in the engine, and of course there's always things getting changed/added still.


Were you doing a lot of logic in SQL itself? Sounds like not really, but then I'm surprised you'd have so many tests hitting the DB directly, vs most feature logic living above that layer in a way that doesn't need the DB running at all.

Why are your unit tests touching a database? I’m a real stickler about keeping unit tests isolated, because once I/O gets involved, they invariably become much less reliable and as you mention, too slow.

Sorry, I should've just said tests. Our tests overall are a mix of unit tests and integration tests and all sizes in between, depending on what they want to test.

I see. That makes more sense.

That would have been considerably less scalable. etcd has some interesting scaling characteristics. I posted some followup notes on twitter here: https://twitter.com/apenwarr/status/1349453076541927425

How is PostgreSQL (or MySQL) "considerably less scalable" exactly? etcd isn't particularly known for being scalable or performant. I'm sure it's fast enough for your use-case (since you've benchmarked it), but people have been scaling both PostgreSQL and MySQL far beyond what etcd can achieve (usually at the cost of availability of course).

[I work at Tailscale] I only mean scalable for our very specific and weird access patterns, which involves frequently read-iterating through a large section of the keyspace to calculate and distribute network+firewall updates.

Our database has very small amounts of data but a very, very large number of parallel readers. etcd explicitly disclaims any ability to scale to large data sizes, and probably rightly so :)


This is getting confusing. The tweets sound like you are concerned about write scalability, and here it sounds like you are concerned about read scalability?

> So we can get, say, 1000 updates, bundle them all, get it synced in say ~100ms, and then answer all 1000 requests at once, and still only take ~100ms.

I assume the same trick is applicable to RDBMS as well? So you batch the 1000 updates, and do one commit with a single fsync.

> Virtually every other database I've used is quite naive about how they flush blocks to disk, which dramatically reduces their effective transactions/sec. It's rare to see one that made all the right choices here.

Can you elaborate on this? Anyway RDBMS worth its salt should be able to saturate the disk IOPS, i.e. the act of flushing itself wouldn't be the bottleneck.

> Our database has very small amounts of data but a very, very large number of parallel readers.

So the control plane is the sole writer of this database, and there are maybe 100s/1000s of other readers, who each has a watcher on etcd? Who are these readers? If they are different processes on different machines, how did it work when the database was in the json file?

Sorry for the barrage of questions, but I have to ask out of curiosity.


Not OP, but if I could ask further... How much consistency can you tolerate on your reads? From the use case you mention I imagine... quite a lot, but you could risk locking yourself out of networks/systems if you get it wrong?

EDIT: Negation is important


I've always found it hard to reason about relaxing consistency, and I think people underestimate how much complexity they take on by moving away from serializable isolation towards something looser. (Fun fact! Many databases out of the box don't correctly handle the classic transaction example -- read an account balance, subtract an amount from the balance, and then add the amount to another account's balance.)

Usually people design their app with the expectation of strict serializable isolation, relax it because of some production emergency, and then deal with the business consequences of the database doing the wrong thing until the company goes out of business (usually not due to database isolation levels, to be fair).


Interesting perspective, thanks!

Not sure whether I agree or disagree, actually...

AFAICT Linearilazable is about the best we can expect in reality (at least for a distributed system), but as you point out: Very few people actually check their assumptions... and even fewer actually think about DB transactions correctly in the first place. It's actually really, really hard and people have these rules of thumb in their heads that aren't actually correct.

Which gets me to wondering if we could formalize some of this stuff... (in relevant "code scopes", dgmw!)

EDIT: If there is one thing I am certain about it is the fact that a lot of consistency can be relaxed around human interaction. It's lossy anyway, and people will call you (eventually, depending on anxiety/shyness) if you haven't fulfilled an order. The browser is the first order of that and that's already out of date once you show a page, so... Anyway, that's just to say it's amusing how much people worry about consistency on the front end


This reminds me of this post from the hostifi founder, sharing the code they used for the first 3 years:

https://twitter.com/_rchase_/status/1334619345935355905

It’s just 2 files.

Sometimes it’s better to focus on getting the product working, and handle tech debt later.


I do like putting .json files on disk when it makes sense, as this is a one-liner to serialize both ways in .NET/C#. But, once you hit that wall of wanting to select subsets of data because the total dataset got larger than your CPU cache (or some other step-wise NUMA constraint)... It's time for a little bit more structure. I would have just gone with SQLite to start. If I am not serializing a singleton out to disk, I reach for SQLite by default.

I've seen the same when at one point we decided to just store most data in a JSON blob in the database, since "we will only read and write by ID anyway". Until we didn't, sigh. At least Postgres had JSON primitives for basic querying.

The real problem with that project was of course trying to set up a microservices architecture where it wasn't necessary yet and nobody had the right level of experience and critical thinking to determine where to separate the services.


Storing JSON blobs in the database can be the best option if you are careful with your domain modeling.

I use the same system (a JSON file protected with a mutex) for an internal tool I wrote, and it works great. For us, file size or request count is not a concern, it's serving a couple (internal) users per minute at peak loads, the JSON is about 150 kb after half a year, and old data could easily be deleted/archived if need be.

This tool needs to insert data in the middle of (pretty short) lists, using a pretty complicated algorithm to calculate the position to insert at. If I had used an RDBMS, I'd probably have to implement fractional indexes, or at least change the IDs of all the entries following the newly inserted one, and that would be a lot of code to write. This way, I just copy part of the old slice, insert the new item, copy the other part (which are very easy operations in Go), and then write the whole thing out to JSON.

I kept it simple, stupid, and I'm very happy I went with that decision. Sometimes you don't need a database after all.


As long as you're guaranteeing correctness[0], it's hard to disagree with the "simple" approach. As long as you don't over-promise or under-deliver, there's no problem, AFAICS.

[0] Via mutex in your case. Have you thought about durability, though. That one's actually weirdly difficult to guarantee...


> Have you thought about durability, though. That one's actually weirdly difficult to guarantee...

Strictly speaking, it's literally impossible to guarantee[0], so it's more a question of what kinds and degrees of problems are in- versus out-of-scope for being able to recover from.

0: What happens if I smash your hard drive with a hammer? Oh, you have multiple hard drives? That's fine, I have multiple hammers.


What happened to the first hammer :D

I guess it wasn't Durable.

That's good. But single file could break on powerloss. I use sqllite. It's quite easy to use, not a single line though.

Their point about schema migration is completely true though. An SQLite db is extremely stateful, and querying that state in order to apply schema migrations (for both the data schema and the indexes) is bothersome.

> The file reached a peak size of 150MB

is this a typo? 150MB is such a minuscule amount of data that you could do pretty much anything and be OK.


Not a typo. See why we're holding it all in RAM?

But writing out 150MB many times per second isn't super nice when both 150MB and the number of times per second are both growing.


I think a lot of people are missing the point that a traditional DB (MYSQL/Postgress) are not a good fit for this scenario. This isn't a CRUD application but is instead a distributed control plane with a lot of reads and a small dataset. Joins and complex queries are not needed in this case as the data is simple.

I am also going to go out on a limb and guess that this is all running in kubernetes. Running etcd there is dead simple compared to even running something like Postgress.

Congrats on a well engineered solution that you can easily test on a dev machine. Running a DB in a docker container isn't difficult but it is just one more dev environment nuance that needs to be maintained.


We don't use Kubernetes (or even Docker) currently.

Hopefully tailscale gets to the size where kubernetes is worth it. It's a complex thing to run and understand but in the end I think it is worth it. It has certainly made my day to day life a lot easier and allowed our tiny team to build out a solid platform. It has greatly reduced the amount of time that our developers need to get a service up and running our new features out.

We have a lot of Kubernetes experience on the team. Multiple of us run Kubernetes clusters in our home labs (mine: https://github.com/bradfitz/homelab), and one of us used to be on the Google GKE team as an SRE, and is the author of https://metallb.universe.tf/ (which multiple of us also use).

Us _not_ using Kubernetes isn't because we don't know how to use it. It's because we _do_ know how to use it and when _not_ to use it. :)


Haha, it sounds like you have it covered. Even more so if you were to run on GKE (which I use and adore).

When not to use it is a tough question. If I was ever in charge of a company, kubernetes would be the only way of running my product that I would consider. I am a fan of kubernetes as I use it every day but I have also been on the other side of the fence. I have run production systems on bare metal, VMs, EC2 instances, etc. The operational burden of anything non-kube is too much and takes time away from solving big problems such as stability, scaling, deploy, monitoring and more. The solutions to the problems become standard, boring and consistent.

I say the above as someone that spent over a year migrating an entire platform/product from ECS to GKE. It is not perfect but so many silly day to day interruptions have been eliminated. Retired and broken instances are a thing of a past. Scaling is easy. Stability is easier.

Side effects of the move are that our Ops team is 1/2 the size it was a year ago (attrition/covid), we are running 3 times the number of product stacks for 1/3rd the cost. I should really blog about that one!


Kubernetes support is a top Tailscale request and the community's starting to do it on their own (a bit, in less than ideal ways sometimes) so soon enough here we'll have to end our little Kubernetes vacation and get back into it and make it Tailscale support Kubernetes really well.

After reading the comments and the blog post, I think that the requirements boils down to fast persistence to disk, minimum dependencies and fast test-runs. Fortunately the data is very small 150MB and it fits very easily in memory. According to the post the data changes often so they need to write the data many times in a second. But I'm not sure why do they need to flush every time the entire 150MB ? Why not structure the files/indexes such that we write only the modified data ?

I never took a course in databases. At some point I was expected to store some data for a webserver, looked as the BSDDB API, and went straight to mysql (this was in ~2000). I spent the time to read the manual on how to do CRUD but didn't really look at indices or anything exotic. The webserver just wrote raw SQL queries against an ultra-simple schema, storing lunch orders. It's worked for a good 20 years and only needed minor data updates when the vendor changed and small python syntax changes to move to python3.

At that point I thought "hmm, i guess I know databases" and a few years later, attempted to store some slightly larger, more complicated data in MySQL and query it. My query was basically "join every record in this table against itself, returning only rows that satisfy some filter". It ran incredibly slowly, but it turned out our lab secretary was actually an ex-IBM Database Engineer, and she said "did you try sorting the data first?" One call to strace showed that MySQL was doing a very inefficient full table scan for each row, but by inserting the data in sorted order, the query ran much faster. Uh, OK. I can't repeat the result, so I expect MySQL fixed it at some point. She showed me the sorts of DBs "real professionals" designed- it was a third order normal form menu ordering system for an early meal delivery website (wayyyyy ahead of its time. food.com). At that point I realized that there was obviously something I didn't know about databases, in particular that there was an entire schema theory on how to structure knowledge to take advantage of the features that databases have.

My next real experience with databases came when I was hired to help run Google's MySQL databases. Google's Ads DB was implemented as a collection of mysql primaries with many local and remote replicas. It was a beast to run, required many trained engineers, and never used any truly clever techniques, since the database was sharded so nobody could really do any interesting joins.

I gained a ton of appreciation for MySQL's capabilities from that experience but I can't say I really enjoy MySQL as a system. I like PostgresQL much better; it feels like a grownup database.

What I can say is that after all this experience, and some recent work with ORMs, has led me to believe that while the SQL query model is very powerful, and RDBMS are very powerful, you basically have to fully buy into the mental model and retain some serious engineering talent- folks who understand database index disk structures, multithreading, etc, etc.

For everybody else, a simple single-machine on-disk key-value store with no schema is probably the best thing you can do.


I find the article a bit hard to follow. What were the actual requirements? I probably didnt understand all of this, but was the time spent on thinking about this more valuable than using a KV Store?

Yeah, I guess we could've laid that out earlier:

* our data is tiny and fits in RAM

* our data changes often

* we want to eventually get to an HA setup (3-5 etcd nodes now, a handful of backend server instances later)

* we want to be able to do blue/green deploys of our backend control server

* we want tests to run incredibly quickly (our current 8 seconds for all tests is too slow)

* we don't want all engineers to have to install Docker or specific DB versions on their dev machines


I found myself in a similar situation sometime ago with MongoDB. In one project my unit tests started slowing me down too much to be productive. In another, I had so little data that running a server alongside it was a waste of resources. I invested a couple of weeks in developing a SQLite type of library[1] for Go that implemented the official Go drivers API with a small wrapper to select between the two. Up until now, it paid huge dividends in both projects ongoing simplicity and was totally worth the investment.

[1]: https://github.com/256dpi/lungo


Honestly this feels like engineers that spent too long in FANG and get completely burnt out on dealing with SRE and HA requirements... so decide to built a setup so prone to 2AM pages even a PHP webshop would frown at it.

Curiously though its a pattern I've seen twice in the last 12 months, there was that guide on the good bits of AWS that also recommended starting with a single host with everything running on it.

Maybe we should all move that host back under our desks and really be back to basics!


For me the most shocking part was that this company is spending time and money on overengineering solutions when they know that a JSON file got them that far — SQLite would be a perfectly fine improvement to get themselves further!

I had no idea companies of this size had engineers with that much free time on their hands.


Philosophically: I've seen some "bespoke" systems like this that live long enough for a nice off-the-shelf solution to come around that solves the problem much more elegantly and efficiently than the "bespoke" one does. This seems like a normal and dare I say organic path for these kind of systems to take.

I don't even mind senior devs putting together things like this at the cornerstone of the company provided there are always at any given point in time 2 people that know how it works and can work on it, and sufficient time was spent looking at existing solutions to make that call. It should be made with full expectations that the first paragraph is inevitable.

Specifically, in this case: Without any actual data (# of reads, # of writes, size of writes, size of data, read patterns, consistency requirements) it is not possible to judge whether going custom on such a system was merited or not. I would find it VERY difficult to come to the conclusion that this use case couldn't be solved with very common tooling such as spark and/or nats-streaming. "provided the entire dataset fits in memory" is a very large design liberty when designing such a solution and doesn't scream "scalability" or n+1 node write-consistency to me. I say this acknowledging full well that etcd is an unbelievably well written piece of software with durability and speed beyond it's years.

Keeping my eyes open for that post-series-a-post-mortem post.


"""Even with fast NVMe drives and splitting the database into two halves (important data vs. ephemeral data that we could lose on a tmpfs), things got slower and slower. We knew the day would come. The file reached a peak size of 150MB and we were writing it as quickly as the disk I/O would let us. Ain’t that just peachy?"""

Uh, you compressed it first, right? because CPUs can compress data faster than disk I/O.


Yeah, I think it was zstd.

But the bigger problem was the curve. Doing something O(N*M) where N (file size) and M (writes per second) were both growing was not a winning strategy, compression or not.


Hrmm. Even lz4 level 1 compresses at "only" about 500-1000MB/s on various CPU types, which isn't quiet as fast as NVMe devices demand.

yes, that's a very recent change (afaict, the big change was going from SATA interface to NVME).

seems like the reason for not going the MySQL/PSQL/DMBS route is lack of good Go libraries to handle relational databases (ORM/migration/testing)? from the story it seems more like a solution in search for a problem

I thought we adequately addressed all those points in the article? Search for ORM and dockertest.

How do you come to this conclusion from the article? Or is it just an attempt to discredit Go?

I also use JSON files... but one per value! It has it's ups and downs: pro it's incredibly fast and scales like a monster. con it's uses alot of space and inodes, so better use type small with ext4!

The only feature it misses is to compress the data that is not actively in use, that way there is really not much of a downside.

http://root.rupy.se


> "Attempts to avoid this with ORMs usually replace an annoying amount of typing with an annoying amount of magic and loss of efficiency."

People seem to keep using poorly-designed ORMs or are stuck with some strange anti-ORM ideology.

Modern ORMs are fast, efficient, and very productive. If you're working with relational databases then you're using an ORM. It's a question of whether you use something prebuilt or write it yourself (since those objects have to be mapped to the database somehow). 99% of the time, ORMs generate perfectly fine SQL (if not exactly what you'd type anyway) while handling connections, security, mapping, batching, transactions, and other issues inherent in database calls.

The 1% of the time you need something more complex, you can always switch to manual SQL (and ORMs will even let you run that SQL while handling the rest as usual). The overall advantages massively outweigh any negatives, if they even apply to your project.


IMHO, ORMs are a wrong abstraction. Full-fledged objects are a wrong abstraction of the data in more cases than not.

The right tool is a wrapper / DSL over SQL, which allows to interact with the database in a predictable and efficient way, while not writing placeholder-ridden SQL by hand. Composable, typesafe, readable.

ORMs do fine in small applications without performance requirements. The further you go from that, the less adequate an ORM becomes, and the more you have to sidestep and even fight it, in my experience.


Do you have an example of the right tool? SQL is already a DSL and I can't see how creating another language isn't just adding more overhead without actually getting you to usable objects.

The only reason objects are the "wrong" abstraction is because they don't match relational models exactly. That impedance mismatch is the entire reason for the object-to-relational mapping, otherwise you can use things like document-stores and just serialize your objects directly.


SQLAlchemy gives a nice composable DSL in Python; it has an ORM layer on top of it, too, but it's optional.

JOOQ allows for composable typesafe SQL in Java.


+1 for jOOQ!

Modern ORMs don't let you hand craft sql, shunting it away in some scary extension that you then have to fight when management on using.

A ORM that worked on the principle of insert query text of any complexity receive object as the primary usecase, not the "nonstandard and non-idiomatic usecase" would be the only way to ease the concerns of dba's who code like me.

Its the same pitfall of api clients. Why would I take the time to learn an api like its a sdk, along with the pains of trying to shunt openapi's libraries in to my application without requiring the creation of a composer build step, further complicating deployment, when I can make 5 methods in the time before lunch to do the bits i need as rest queries and deployment of my php app remains as simple as `git pull production` on the nfs share all the workers read from?

The benefit of compile validated symbols is moot in the days of test driven dev, so the benefits gained from that can still be realized without creating build complexity or making competent engineers re-learn something they already know only re-abstracted in a way that almost always makes it harder for somebody who understands the low level to learn the new way compared to a new dev.


You're not using a modern ORM then if you can't use manual SQL queries. And tests are not a replacement for compilation and type-safety.

Like I said, you're either using an existing ORM or just writing your own everytime, and the one you write probably won't be very good as seen by the numerous security and performance bugs that are constantly found. Also abstractions are useful. All of software development is built with abstractions and they don't suddenly become useless when it comes to databases.

I also don't see what this has to do with API clients or build steps, but there are good and bad examples of those too.


> I also don't see what this has to do with API clients

They are the same thing, a way of taking text based queries for formatted data as std:maps and shunting that away behind classes and types and interfaces.

> Like I said, you're either using an existing ORM or just writing your own everytime

vectors of maps works pretty well.


> Modern ORMs don't let you hand craft sql

What "modern ORM" are you using?

Sure I don't write PHP, but the ORMs I've used in Java/Kotlin, Ruby, JavaScript and lately Rust and Crystal have all been more than fine with me writing raw SQL.


> Modern ORMs are fast, efficient, and very productive.

The author seems to be using Go, which honestly could use work in that area. gorm is the biggest / most popular ORM out there, but it looks like a one-person project, the author seems well worn-out already, and it kinda falls apart when you work with a data model more than one level deep.

Plus broadly speaking, there seems to be a bit of an aversion to using libraries in the Go community.


GORM is nice for beginners because it's pretty easy to get started with, but in my experience starts to fall apart when you need to do anything more complex or start to scale up. I've had good experiences with SQLBoiler [0] on the other hand. I haven't used it in for anything in production, but it's been a breeze to use in a couple of personal projects, and it handles complex SQL queries much better compared to GORM.

0: https://github.com/volatiletech/sqlboiler


> Modern ORMs are fast, efficient, and very productive

Which ones do you have experience with?


This post touches on "innovation tokens". While I agree with the premise of "choose boring technology", it feels like a process smell, particularly of a startup whose goal is to innovate a techology. Feels demotivating as an engineer if management says our team can only innovate an arbitrary N times.

It is a tricky tradeoff for startups. On the one hand, a startup has very limited resources and so has to focus on the business. On the other hand, a startup has to experiment to find the business. I don't think there's an easy answer.

In our case, the control plane data store really should be as boring as possible. It was real stretch using anything other than MySQL. We tried to lay out the arguments in the post, but the most compelling was we had lots of unit tests that spun up a DB and shut it down quickly. Maybe a hundred tests whose total execution time was 1.5s. The "boring" options made that surprisingly difficult.

(Tailscaler and blog post co-author)


For PostgreSQL and Go, here's a package to spin up a temp in-mem PostgreSQL: https://github.com/rubenv/pgtest/

From memory, in Postgres, you could also have a copy of your base database and then copy it in for each test, which is seem to recall being fairly fast. It includes the data too.

    Create database test_db template initial_db;

I'm used to Scala for these things and it seemed fairly easy to do. Docker DB container would be spun up at the start of testing (using a package such as testcontainers-scala). Then the DB would be setup using the migration scripts (using Flyway) so it has a known state. Every test (or set of tests if they're linked) would have a pre-hook which nuked the DB to a clean state. Then the docker container would be shut down at the end of testing. I'm guessing there's even some way to have this run concurrently with multiple DBs (one per thread) but we never did that. Is Java's ecosystem for this type of tooling just that much better?

I wrote about that in the article. Search for "dockertest".

We considered that option but didn't like it. It still wasn't fast enough, and placed onerous (or at least annoying) dependencies on future employees.


> It still wasn't fast enough, and placed onerous (or at least annoying) dependencies on future employees.

Did you configure the Postgres (or MySQL) database to be entirely in memory, e.g. by using a tmpfs Docker volume?

As for being onerous or annoying for new employees, which is worse: having to set up a Docker environment, or using a relatively obscure data store in a way that nobody else does?


Empirically, the former.

We've since hired many employees who just learned about our database today from this blog post but had been happily testing against it on their laptops for months.

3-4 of us know about it, and that's sufficient.


One solution is to use a standard subset of sql so you can use sqlite in unit tests and mysql/postgres in prod. Many languages also have in-memory SQL implementations that are also a more convenient substitute for sqlite.

Of course the benefit of what you did, even if I wouldn't have done it, is that you're _not_ using a different system in dev vs prod.


I would not go down the road of figuring out what subset of SQL various database understand. You will always be surprised, and you'll be surprised in production because "the thing" you developed against sqlite doesn't work in postgres.

I used to do this and stopped when I noticed that sqlite and postgres treat booleans differently; postgres accepts 't' as true, but SQLite stores a literal 't' in the boolean-typed field. This means you get different results when you read things back out. All in all, not a rabbit hole you want to go down.

Personally, I just create a new database for each test against a postgres server on localhost. The startup time is nearly zero, and the accuracy compared to production is nearly 100%.


Exactly. That's why I wrote: "No Docker, no mocks, testing what we’d actually use in production."

What is the purpose of a startup? Is it to put keywords on your resume, or is it to create a product?

Depends if you have funding or not.

I mean, what do you propose in terms of getting a diverse group of engineers to trade off innovation versus choosing boring technologies? Keeping in mind that how much risk the company is willing to take is not the decision of engineers but the executive team. Innovation tokens convey the level of risk the executive team is willing to take in terms of technologies. The alternative I've often seen is a dictatorial CTO (or VP of Eng) who simply says NO a lot which is a lot more demotivating. A large company may do detailed risk analyses but those are too cumbersome for a startup.

Every decision to increase platform spread should be justified in detail, as the implementation and support overhead is essentially unbounded. Be careful you don't buy a pig in a poke.

Ahh, a classic case of "if you don't understand it, you are bound to reimplement it".

My takeaway from the OP:

> Never underestimate how long your “temporary” hack will stay in production!


Nothing as permanent as a temporary solution.

Tangentially, this article makes me so very glad that our own work projects are all making use of the Django ORM. Database migrations and general usage have been a non-issue for literally years.

> “Yeah, whenever something changes, we grab a lock in our single process and rewrite out the file!”

Is this actually easier than using SQLite?


Keep reading!

Without knowing it, they reinvented the Kubernetes informer which I've proven can scale way past their current scale.

If you liked this I highly recommend "Clean Architecture" by Uncle Bob. He has a great story of how they kept putting off switching to a SQL DB on a project and then never needed to and it was a big success anyway.

I thoroughly enjoyed "Clean Architecture." I engage with all of Uncle Bob's work in an adversarial manner: he is trying to convince me that he is right and I am playing devil's advocate. This helped me gain insight from his work, develop my own thoughts on architecture, and avoid getting too caught up in doing things the "one true way."

I'm always surprised when people get so agitated when I bring up his books (Clean Code being one of my all time favorites). Even if you disagree with some of the specifics, if there's one big takeaway like you said, it's to develop your "thoughts". He has thought hard about all of these things, and will force you to as well.

I imagine it's like if you were trying to be an olympic marathon runner, you'd study things like humidity and shoes and arm motion deeply, and he kind of does that (function naming, nodes per block, comments, test coverage, et cetera). Even if you don't agree with him, as you suggest read it and play "devil's advocate", and you will be forced to think about details that will make you a better software engineer.


I think uncle bob is the only one who makes sense

Looks like a clear case of migration to postgres. A single table with jsonb. You can do indexes on jsonb and a few plpgsql for partial updates, and forget about it for a while.

So what happens when the computer fails in the middle of a long write to disk?

Keep reading!

[flagged]


That's not true. I deal with shady marketing every day, I know from shady marketing, this is not that, and I don't like seeing people unjustly accused. Please stop creating accounts to spread this falsehood.
npiit 4 days ago [flagged]

What is this?

You've been creating multiple accounts to troll HN with weird shit about one company for over a year now. It's bizarre and we've banned you multiple times. Please stop doing this.

We've seen literally zero evidence for anything that you're saying. Actually there's strong evidence against it.


I don't think it's shady marketing. The company is composed of several prominent software engineers (eg Brad Fitzpatrick of memcached fame). I think many folks on HN (myself included) are interested to see what they are working on. Especially as Brad left Google to work on it.

Brad Fitzpatrick aside (he's done a lot more than memcached), David Crawshaw, co-author of TFA, lead go-mobile (native golang apps for Android and iOS) and NetStack (userspace TCP/IP implementation used by gVisor and in-anger by Fuschia) while at Google.
npiit 9 days ago [flagged]

I understand. But given the product features compared to the rest of the industry, Tailscale brings no real value compared to Zero-tier hen it comes to meshes, it's not zerotrust like Cloudflare, Twingate, and many others, it claims to be open source while only the client is and it cannot be used without their closed source control plane where most of the feature are behind paywall, it's way more expensive than reputable offerings like Citrix, Cloudflare and others. Their security is very dubious to me (they can in fact inject their own public keys to connect to clients machines and there is no way but to trust their word that they won't). I mean, what's the innovation compared to the industry in order to get that systemically excessive coverage here?

It's more zerotrust-y than Cloudflare et al since it's entirely P2P, with only the control plane running in the cloud.

Compared to ZeroTier, the Tailscale client has a permissive license, the mesh is fully routed (vs. a L2 network with unencrypted broadcasts), is written in a memory-safe programming language, integrates with company SSO, and uses the Wireguard protocol (i.e. sane, audited crypto instead of a DIY protocol).

npiit 9 days ago [flagged]

zerotrust has nothing to do with p2p, zero-trust is about making sure that this user is authorized to access that application at the resource level not using some decades old segmentation/network level policies. Zerotier also claims to be zerotrust but it's technically not. Cloudflare, Citrix, PulseSecure have zerotrust offerings, but many others sadly just claim to be either by ignorance or dishonesty.

Yes, and implementing that is exactly the point of Tailscale, with the added advantage of not relying on a centralized proxy.
npiit 9 days ago [flagged]

You seem to be confused between zerotrust and encryption. Zerotrust is about auhtentication/authorization at the application level. Also tailscale is as centralized as Cloudflare et al. What happens when tailscale servers go down? Can 2 peers behind NAT still be able to connect to each other? can they synchronize each other's public endpoint and public key?

> Tailscale brings no real value compared to Zero-tier

This article has nothing to do with Tailscale the product and everything to do with the team's unconventional approach to engineering. That's what HN is interested in and why the post is being upvoted.

npiit 8 days ago [flagged]

There is nothing unconventional in moving from SQL to key-value distributed database. And if it was any other company that submitted this very same post here we wouldn't be talking here right now as it would have never gotten a single upvote. The posts of this company almost always come with their upvotes right after submission (by others) and the founders were surprisingly replying minutes after submission. This is systematic behavior.

> moving from SQL to key-value distributed database

You didn't read the article, clearly.

> The posts of this company almost always come with their upvotes right after submission

This has already been explained to you in other comments, so I just assume you're being disingenuous now.

Find a new hobby.


>EDIT: Why is the downvoting? the post was give like 9 upvotes in the first 5 minutes. I frequently go to "new" and this is a highly suspicious behaviour.

They are popular people so people submit the link. Once duplicate links are submitted there is an upvote on the first submission. No duplicates.

Source: I was the second upvote.


I upvote literally everything I see from Tailscale, because I am hugely impressed with Avery Pennarun and David Crawshaw. Seeing Fitz is there too takes it to an even higher level. I want them to succeed, and I think they create good HN content. Major kudos.

What is suspicious about the number 36? (The upvote count at this moment.)

Just because you dislike the product (and it's clear you do) does not prevent others from liking it, or at least finding their articles interesting.

npiit 9 days ago [flagged]

The post was given like 9 upvotes in the first 5 minutes. I frequently go to "new" and this is a highly suspicious behavior.

> EDIT: Why is the downvoting?

Because this kind of thing is something you should contact the mods about, not leave comments that nobody can really (dis-)prove


I upvoted because the content was technical, well written and different than what we normally see on HN.

Its ok im part of audience that doesnt know much about databases its fun to read discussions about them once in a while.

I learned a lot about postgresql redis clickhouse and elasticsearch here, people's perspectives here are great to learn from, they tell you which to avoid and which to try.


Is tailscale downvoting you as well?

Doesn’t sound like smart thing to do and sounds more like a js dev/student discovering step by step why sql databases are so popular..

Probably not so, bc tailscale is a decent product, but this post did not change my view in a good way


On the contrary, it sounds like a seasoned group of people who understand their needs and are wary of the very real challenges presented by most existing SQL systems with respect to deployment, testing, and especially fault tolerance. I'm interested in Cockroach myself, but I also acknowledge it's relatively new, and itself a large and complicated body of software, and choosing it represents a risk.

Building your own locking mechanism to dump out a JSON file is exactly how you would do this.

> Through this process we would do major reorganizations of our SQL data model every week, which required an astonishing amount of typing. SQL is widely used, durable, effective, and requires an annoying amount of glue to bring into just about any programming language

and

> So we invested what probably amounts to two or three weeks of engineering time into designing in-memory indexes that are transactionally consistent

Sounds to me like someone has learned a lot on the job. Good for him, but it looks exactly like what I said before.


This jumped out at me: "The obvious next step to take was to move to SQL"

No. Not unless your data is relational. This is a common problem, relational databases have a lot of over head. They are worth it when dealing with relational data. Not so much with non relational data.


Maybe "obvious" was the wrong word there. What we meant to say was that would be the typical route people would go, doing the common, well-known, risk-averse thing.

Yes, I agree.

It is the wrong "obvious" thing. Sounds like they did better though


The premise of articles like this annoys me. It reeks of "we are too smart to use databases", "json is good enough for us", when anyone that works with data to any large extent knows that json is just a pain and we only have to deal with it because the front end is enamored with because "readable" and "javascript".



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: