Hacker News new | past | comments | ask | show | jobs | submit login
Quicksilver: Configuration Distribution at Internet Scale (cloudflare.com)
50 points by migueldemoura 3 months ago | hide | past | favorite | 7 comments

As I was reading about the problems this system needs to solve and the issues with Kyoto Tycoon, I thought that LMDB might be a good foundation for a solution. So I was gratified to find out that Quicksilver indeed uses LMDB. I gather Cloudflare has found that LMDB is indeed as reliable as it's advertised to be.

Looking forward to the eventual open-source release of Quicksilver.

I was working on something similar one year ago for a personal project.

I simply persisted the config in Mysql and synced the data to redis. Then each server had a redis replica locally to allow fast reads from OpenResty (nginx+lua scripts).

The project never took off, it was just an MVP. But why would someone pick Kyoto Tycoon instead?

The story of how they picked a storage engine ill-suited for their needs and then used ridiculous and unsustainable workarounds to deal with its shortcomings (including unsafe practices like relying on rebuilding the databases, assuming it’s safe to turn off synchronous writes for KV storage writes, etc) just confirms how shoddy engineering is at CloudFlare. That is peak technical debt.

(It’s one thing to make a wrong choice, it’s another to think you can paper over those mistakes and they’ll go away.)

This is an extremely unconstructive comment. It's easy to be critical with 20/20 hindsight.

More constructively: what would you have picked back in _2011_ when Cloudflare was getting off the ground? Ideally, it needs to have a memcached like interface (for easy gets/puts from Lua + NGINX), still operate when disconnected from upstreams (CDN POPs can have unreliable upstream conns), be cheap/free in terms of CAPEX/licensing, and be optimized for (extremely) read-heavy workloads. Strong consistency is less useful here.

"Technical debt" only debt after the fact. Most of the time, it's the result of a series of (likely rational) trade-offs you've made given the current state of your business.

You're right, my comment would have been more valuable if I'd included some alternate technologies they could have used instead, but I believe that the criticisms still stand regardless, because the approach itself is in question. You don't find out once you've reached scale that the underpinnings for your distributed KV platform are not actually free of global locks for writes! That's something you verify and test for yourself before building your empire upon it. If an alternative exists, you use it. If not, you need to build it.

To answer your question:

> what would you have picked back in _2011_ when Cloudflare was getting off the ground?

Not only "would have" but did pick and use (well before 2011) an abstraction around SQLite because our team first evaluated our read vs write requirements and found it to be an adequate option rather than going crazy trying to find a nosql solution worthy of including on our resumes.

The Cloudflare article is somewhat skimpy on the details of their benchmark, so these numbers are not an exact equivalent but, here, I just threw this together: https://github.com/mqudsi/sqlite-readers-writers

P99.9 for reads with two writers is 2ms as compared to their 1215ms, and this is with full ACID compliance and write synchronization.

You don't need to be an expert in creating these systems, you just have to be able to test and validate your architectural decisions before building upon them. It takes only an hour or three to pick a library and write a similar benchmark for any KV store you're interested in (although the exact benchmark would have to be tweaked to match your expected needs).

(That said, lmdb is a great choice and I've recommended it here on HN before... except they ended up replicating on top of it a lot of what SQLite provides for free. Their transaction logs are almost like SQLite's WAL which I used in my benchmark, except going with SQLite would have skipped the entire second half of their Quicksilver solution since it they wouldn't need to manually handle transaction logs, split payloads, and manually reassemble to avoid fragmentation.)

No, it just shows lack of expertise in this particular domain and they have to rediscover a lot of things because of that [1], but it's not that different from software engineering anywhere else. This is pretty much what software engineering is: making a lot of assumptions, mostly wrong, trying them out in production, learning what was wrong, redesigning with a lot of new assumptions and repeating the cycle again and again.

[1] for example, strong eventual consistency approaches can solve pretty much all of their problems with fast reliable replication, but that requires engineers to be familiar with CRDT implementations and research on the subject in a pretty time consuming distributed systems field

I feel that learning from (bad) past choices to build a better solution is a good engineering practice ? Sure the initial db choice didn't scale but they learned from it and they seem to be happy with how they built their new db now

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact