Hacker News new | past | comments | ask | show | jobs | submit login
Cassandra at Apple: 1000s of Clusters, 300k Nodes, 100 PB (twitter.com/erickramirezau)
292 points by mfiguiere on Oct 8, 2022 | hide | past | favorite | 212 comments



Couple of things as always:

Cassandra works really bad with fat nodes (lots of data on one node), and much much better with a lot of small nodes, and 100PB with 300K nodes confirms this. Scylla scales better vertically, but don't know how much.

Some comments are already comparing this to pgsql/mysql/whatever. Please don't. You can't make the same queries even though the language seems to support it.

Cassandra is good at ingesting data, bad at deleting, really really bad at anything remotely relational. Errors are almost pointless.

I'm going to point at an older comment of mine on cassandra: https://news.ycombinator.com/item?id=20430925#20432564

The takeaway should be: Yes, cassandra/scylla can be really fast and scale a lot. But it is also very probably unusable for your use case. Don't trust what the CQL language says you can do. Don't get me started on how bad the CQL language is, either.


Your statement about data propagation in that linked comment is at least misleading. A write at quorum will always be visible instantly to a read at quorum.


I only wrote that quorum is not a transaction (no transactions in cassandra), and that is not a consensus.

While quorum looks a lot like consensus, it is not since what is returned to the client is the latest timestamp. Different nodes could have and return different data. So Quorum write+local_quorum read might fail even if you are in the datacenter that accepted the write. Quorum is also on the total copies of the data, not a quorum of DC, so in certain (weird) multi-DC setups you could have a quorum in a single DC.

In general though I think that the data consistency options of cassandra (quorum/local/N) are a good idea but underdeveloped.

Anyway my point was: cassandra has too many pitfalls and eliminating those restricts the use case by a lot more than people realize. Plus the naming of all features look designed to trick you into thinking soewthing else


No, you wrote that you have to write and wait several minutes for data to replicate. This is straight-up false.

Yes, if you mix consistency levels incorrectly you will do it wrong and maybe get stale data, but that is a different criticism. I agree that it is easy for unsophisticated users to incorrectly use consistency levels in complex topologies, and I hope we will introduce mechanisms to prevent users making such mistakes in future. But that was not your claim, and in my experience users do understand consistency levels just fine.

There are lots of valid criticisms to point at various use cases with Cassandra, but this was just incorrect.


it’s worse than waiting actually. Because client A might write value « 1 » and after client B write value « 2 ». Then client C can read Value 2 for 5 minute and then cassandra internal read-repair eventually put value 1 on all replica and value 2 is lost forever.


No, that is literally impossible.

If value 2 is newer than value 1, value 1 will never overwrite value 2. If a client reads value 2 at QUORUM then it will always be seen by all future queries.


Client À write value 1 on 2 of 3 node with timestamp=3

then later client B write value 2 on 2 of the 3 node with timestamp=1

then read repair happen and (value=1 timestamp=3) is written to all 3 nodes.

this is only one of many scenarios where this stupid design fail.


Client B’s write will only be successfully read from 1 of the nodes it wrote to, and read-repair only runs on QUORUM reads. So, no, it will not be possible to read value 2 for 5m - it will never be visible to operations at QUORUM.

This would have been a valid criticism of LWW (and there are other more contrived examples), but I think (or hope) this is an explicit trade off made by anyone using Cassandra in eventual consistency mode. There are strategies to prevent this being a problem for workloads where it matters, some discussed elsewhere in the thread.


Since "client B write" was successfully written on 2 of the 3 nodes. Any read request that read only from 2 of the 3 nodes instead of using quorum read, will be able to see what Client B wrote until it magically disappears.

Quorum Read-repair is only one reason for why the value would randomly disappear. Another one is periodic anti Entropy repair!


No, it won’t. Successfully written does not mean what you think it means. If there is “newer” (by timestamp) data on disk or in memtable then it will not be returned to a client, regardless of which order that data arrived. It is unlikely even to be written to disk (except the commit log).

Since at least one of those nodes has the “newer” value, only one node can serve this “older” value


> Different nodes could have and return different data.

Only if they miss some writes, and eventually they will converge. But if you do quorum writes and quorum reads (or local quorum W + local quorum R), this guarantees you'll read from at last one node that received all the writes issued before the read, so you get the converged value immediately, regardless of which node you ask. All nodes will eventually agree on the value, because timestamps are assigned by the coordinator or by the application, not at the replica. A single write will get the same timestamp across all replicas.

Incorrect timestamps can cause a different problem - a write that happened at physical time T2 > T1 might be considered to be older than T1 by the cluster, if it was accepted by the coordinator whose clock was set in the past. Such write might simply not take any effect, as old updates would be considered newer. However, again, the resolution would be consistent on all replicas once they get all updates.


eventually they will converge on a « random » value! that’s why we say it’s write quorum is not consensus it’s offer very weak guarantee


No, they will converge on a value with higher timestamp. And timestamps are controllable by your app, so you can guarantee they are monotonic.


timestamp is coming from the operating system where the client library is running. so unless all your write request are issued by the same machine then

you are guaranteed to face the problem where the client machine clocks are not in sync.

just comparing timestamp is obviously a design from someone that didn’t review the academic literature on distributed transaction and consensus


>> timestamp is coming from the operating system where the client library is running.

The client can set the timestamp to any value of its choice. It does not have to correspond to clock time.

>> so unless all your write request are issued by the same machine then you are guaranteed to face the problem where the client machine clocks are not in sync.

It's not about all writes. It's about ensuring that all writes to a single partition (specifically, a single column within a single partition) are done using a source of monotonic integers to ensure ordering.


You are correct that if you have a singleton server returning monotonic time to clients this would work.

But this Monotonic time server is not part of Cassandra itself and for this reason majority of the people using Cassandra will use the OS time without knowing this is silently corrupting the database.


What happens if you send the 'write' to 3 nodes but only 2 are up?


Nothing bad? That’s the whole point of quorum reads and writes. You write to at least two, and read from at least two (one typically being a digest-only to corroborate the result).

This is all done for you by Cassandra


It depends on the consistency level you set for writing.

If you have 3 nodes, set quorum cl, the write will succeed, because quorum of 3 is at least 2.

If you explicitly require cl of three, and only two nodes answer, write will fail.


"write will fail"...but the data will still be there (you can SELECT it normally) and will be replicated to the node that was down once it is up again.

Unless the node was down too much and could not fully catch up before a set time (DB TTL if I remember correctly), in which case the data might be propagated or not, and old deleted data might come up again and be repropagated by a cluster repair.

So much fun to maintain


I know this is HN and people love to troll, but it makes me sad when I see you using falsehoods to take a steaming dump on a group of engineers that are obsessed with building a database that can scale while maintaining the highest level of correctness possible. All in Open Source... for free! Spend some time on the Cassandra mailing list and you'll walk away feeling much differently. Instead of complaining, participate. Some examples of the project's obsession with quality and correctness:

https://cassandra.apache.org/_/blog/The-Path-to-Green-CI.htm...

https://cassandra.apache.org/_/blog/Finding-Bugs-in-Cassandr...

https://cassandra.apache.org/_/blog/Testing-Apache-Cassandra...

https://cassandra.apache.org/_/blog/Introducing-Apache-Cassa...


sorry to disappoint you but there is a reason Cassandra tried to add LWT using paxos.

it mostly work but it’s extremely slow and still have many bugs.

https://www.datastax.com/blog/lightweight-transactions-cassa...


That blog was posted 9 years ago. Needless to say, a lot of improvement and engineering has happened since then. The limited use cases of LWTs will soon be replaced with general, ACID transactions.

https://thenewstack.io/an-apache-cassandra-breakthrough-acid...


This new « accord » thing look very promising! If this get merged into Cassandra this change everything!

A prototype has been developed that has demonstrated its correctness against Jepsen.io’s Maelstrom tool


LWTs were added for a very simple reason: stronger isolation where the performance trade-off makes sense. Nothing to do with the GP comment.

Until recently they were indeed slow over the WAN, and they remain slow under heavy contention. They are now faster than peer features for WAN reads.

However, the claim that they have many bugs needs to be backed up. I just finished overhauling Paxos in Cassandra and it is now one of the most thoroughly tested distributed consensus implementations around.


LWT is still the only way to guarantee your write don’t random disappear and that conditional update are really conditional.

LWT still have awful performance compared to a write request with same guarantee in any other database.


> and old deleted data might come up again and be repropagated

That pretty much describes iCloud.

Argh! Zombies!

iCloud has terrible syncing. Here’s an example. Synced bookmarks in Safari:

I use three devices, regularly; my laptop (really a desktop, most of the time), my iPad (I’m on it, now), and my iPhone.

On any one of these devices, I may choose to “favorite” a page, and add it to a fairly extensive hierarchy of bookmarks, that I prefer to keep in a Bookmarks Bar.

Each folder can have a lot of bookmarks. Most are ones that I hardly ever need to use, and I generally access them via a search.

I like to keep the “active” ones at the top of the folder. This is especially important for my iPhone, which has an extremely limited screen (it’s an iPhone 13 Mini -Alas, poor Mini. I knew him well).

The damn bookmarks keep changing order. If I drag one to the top of the menu, I do that, because it’s the most important one, and I don’t want to scroll the screen.

The issue is that the order of bookmarks changes, between devices. In fact, I just noticed that a bookmark that I dragged to the top of one of my folders, yesterday, is now back down, several notches.

Don’t get me started on deleting contacts, or syncing Messages.

I assume that this is a symptom of DB dysfunction.


This is a symptom of DB dysfunction only if you are using just one device.

Otherwise it's far more likely to be faulty sync algorithm.


You can only SELECT at QUORUM if you are guaranteed to be able to SELECT it again at QUORUM, ie it must be durable (and the QUORUM read will ensure it if the prior write did not).

If you are selecting at ONE then yes, you can expect stale replies if you contact a different node. That is what the consistency level means; that just one node has seen the write.


At application level, you’ll probably retry, and you’ll probably select with the same cl you’ve written with.

The other behaviour you’re talking about is hinted handoff.

“So much fun to maintain” -> cassandra has very specific use cases. And hosted versions exist.

Disclosure: I work at Aiven, which has an hosted cassandra offer.


And nobody does reads at quorum, they're slow. People do reads at closest, even when they really need quorum.


I don't think that's true. Can't speak for everybody but for the stuff I worked on. Quorum reads at RF=3 are only twice as slow as reading a single node and it's pretty practical, if you care about it, to write and read using quorum writes/reads. It's true there's many applications where you're ok with some time not reading your latest data, and for those you can get somewhere better performance/latency for a given scale by going CL=ONE...


Unless they want "read your writes", otherwise bugs start appearing "why isn't the data i just put showing"?


What is wrong with CQL? The only gotcha I am aware of is UPDATE/INSERT really just meaning UPSERT.


In no particular order:

* LWT (Lightweight Transactions) are nothing like what you expect when you head "transactions". LWT are single row only.

* error handling is asinine. E.g: A quorum insert fails. is the data there or not? Maybe!

* material views. Added, removed, readded, deprecated.

* big limits on secondary indexes which only work inside each single partition

* different format of timestamp for insert and select

* BATCH is atomic, but not isolated.

...In general CQL feels like a hack. In classic SQL you understand early that you can compose queries, subqueries and such. CQL leads to you to believe that this is possible, only for you to find out that it isn't. The docs are full of stuff like "You can do this query, except in this case, unless this other case in which you can again, but not in this other". It's just exception on top of exception. Even the documentation does not list all exceptions (I managed to find a couple in my last work).


> LWTs

Single partition only. Poorly named, sure. Feature-wise though pretty comparable to peer offerings. Fortunately general purpose transactions are coming next year.

> material views. Added, removed, readded, deprecated.

No; added, then marked experimental. Never removed or deprecated, though they may be superseded before long.

> error handling is asinine

Fundamental distributed systems problem that is just more apparent with eventual consistency. Failure does not have a certain outcome.


please tell me more about General purpose transaction! I really hope they work Kyle Kingsbury.


> A quorum insert fails. is the data there or not? Maybe!

How is this point fundamentally different from SQL? In SQL, after you send COMMIT to a remote database and get back a network error, you don't know if your INSERTed data is committed or not. You'll have to reconnect and check for it.


In CQL queries can fail even if your connection is still up and running.

In this case it means that it could not write to 3 nodes, but it is not telling you if it did not write anything or at least once. Writing just once means that the data will be replicated.

So if only 1 of the 3 nodes is up, it will still write there and then return error, even if the coordinator knows that the other nodes are not up.

Queries have a maximum running time (only configurable at the database level if I remember correctly), and if your write exceeds this, it returns error. The replication still goes on and the data is still there.

Network errors are different category, since you might not even know if your command was sent or not. Cassandra kind of sidesteps the network errors though since the client connects to multiple coordinators in the cluster.


They are not a different category. This is a distributed database, network errors and node failures are a fundamental part of its function.

Fundamentally every distributed protocol has a moment where it may have to tell the client it has abandoned the operation, but already has in flight messages to remote replicas that would result in a decision to complete the operation.

Before that operation is answered by the replicas, the operation is in an unknown state. It is fundamental. Some slower approaches to reaching decisions may mask this problem more often, but it is there whether you realise it or not.


I think the concern there is that these different failure modes are all given the same error/exception messaging? The caller could make different decisions based on these different potential outcomes, but only if the caller receives specific errors. It's been a while since I've used Cassandra, so if the error handling has improved along those lines then I apologize.


A timeout is pretty much always an unknown outcome. Cassandra does have dedicated exceptions for failed writes (which should have a certain outcome of not applied) but this is much rarer in practice.

Cassandra does today inform you on timeout how many replicas have been successfully written to, if it was insufficient to reach the requested level of durability, which many years ago it did not. But this is no guarantee the write is durable, as this will represent a minority of nodes - and they may fail, or be partitioned from the remainder of the cluster. So the write remains in an unknown state until successfully read at QUORUM, much like a timeout during COMMIT for any database.


It encourages you to think you can use secondary indexes to make different types of query over the same data like SQL, which you mostly can’t.


And not to mention, for 90% of the use cases out there Postgres or MySql on aurora or whatever will be more than enough


If I know what my access patterns are, DynamoDB is often the better choice.


More like 99,9% of use cases. But yeah.


What you're saying applies more or less no any NoSQL database. Unusable for your use case is a bit of a big statement. There's plenty of cases where you don't need the power of relational databases and relational databases can't/don't typically scale horizontally the way NoSQL can. And really NoSQL can because of its attributes. If all you need is key/value lookups or time series, you're not looking for any joins or random reads that don't agree with the native ordering/partitioning then it's hard to beat NoSQL scalability/ingest-wise.


True, which is why I said that cassandra has its uses.

The problem is that the documentation and the language tries really hard to sell you something else, and I have seen multiple people wasting way too much time on this.


I agree, but I think the part that's not stated is that when picking such a technology, you basically commit to not needing joins or random reads. If you do, as the application evolves, you're really out of luck. Generally, that commitment cannot be made upfront.

Typically, solutions like time-series or document databases are deployed alongside other components in a much larger system. If you are looking for a single database solution, then maybe NoSQL is not for you


why settle for noSQL when we can use newSQL? Cockroachdb, CitusDB, TiDB,Yugabyte …


I don't think those have the same performance characteristics. At the end of the day you can't have your cake and eat it. You can't keep large scale indices that facilitate fast joins without writing them out to disk and you can't keep them in sync with your data without transactionality/atomicity. This boils down to tradeoffs in terms of locking/consensus/scale/indices/data locality. That said if you're going to be emulating this over noSQL anyways then there's an argument for using something that does it for you (assuming it meets your requirements).


This is definitely about tradeoffs. But there is 2 tradeoffs spectrum, 1- Consistent (strict serializability + global transaction) vs Eventual consistency. 2- Range query and Join vs KeyValue only

FoundationDB provide a Consistent databse but By default only a KeyValue API.

(CitusDB,Vitess..) give you Consistency but only within a partition, but also give a nice SQL API.

(Cockroachdb/Yugabyte) provide a sql api with a consistent database.

Cassandra gives you Eventual consistency with the Key-value API.

My experience has been that the performance penalty for using a Consistent database instead of one that will silently corrupt your data is small "less than 10% slower".

As for the cost of using Join and Index, no once is forcing you to create index and send query that use join. Its 100% opt-in.

But still Google Spanner and CockroachDB made it very cheap to do join between parent entity and a child entity table but using interleaved table. And using RocksDB and good SSD, the cost of index maintenance is not as bad as it appear to be.


Cassandra has tunable consistency. It also has some, terribly slow, lightweight transactions. If you have replication, which you need for HA/DR, then the cost of writes is pretty much the same, data has to be written to all 3 replicas. The question is, for the most part, when do you ack the writes. So on a single site Cassandra you can e.g. write 3 replicas, read 2 back, and you have a consistent database (for a given key). That's not sufficient for ACID though.

I'm honestly not familiar at all with CitusDB or Vitess in terms of where they sit in their ability to handle failures, scale out, transactions etc. or where they sit in the CAP story. I'm sure Spanner is paying the tax somewhere but it's just that Google can back it with infinite resources to give you the performance you need.

If there was a database that offered ACID, HA/DR, and scaled horizontally perfectly at a cost of 10% more resources then I doubt anyone would be using stuff like Cassandra, ScyllaDB or HBase... I don't think that exists though but I'll look up those other databases you mentioned.


It will exist next year. Cassandra is getting HA ACID transactions (depending how you define C; we won't be getting foreign key constraints by then) that will be fast. Typically as fast (in latency terms) as a normal eventually consistent operation. Depending on the read/write balance and complexity, they may even result in a reduction in cluster resource utilisation. For many workloads there will be additional costs, but it remains to be discovered exactly how much. I hope it will be on that sort of scale.

>It also has some, terribly slow, lightweight transactions

FWIW, LWTs are much (2x) faster in 4.1 (which has been very slowly making its way out the door for a while now... the project is mostly already more focused on 4.2, so pushing 4.1 out the door is tortuous for procedural reasons)


cool, good to know. There's still the relational vs. no-relational question though for many use cases but having better transactions opens the door for more use cases.


Right with real transaction you could maintain your own index if you really need them.

And to have good performance you should never join entity that have distinct partition key. Otherwise you end up reading too much data over the network.

I’m honestly very happy Cassandra will get real transaction. But i still need to understand the proposed design. Will it be possible to have serializability when doing Range scan query?


> when doing range scan query

Yes, but probably not initially. The design supports them, it’s just not very pressing to implement since most users today use hash partitioners. There will be a lot of avenues in which to improve transactions after they are delivered and it is hard to predict what will get the investment when.

Proper global indexes are likely to follow on the heels of transactions, as they solve most of the problem. Though transactions being one-shot and needing to declare partition-keys upfront means there’s some optimistic concurrency control required. Interactive transactions supporting pessimistic concurrency and without needing to declare the involved keys will also hopefully follow, at which point indexes are very trivial, but the same caveat as above applies.


Yes, Cassandra will still not be suitable for all workloads.


> older comment of mine on cassandra: https://news.ycombinator.com/item?id=20430925#20432564

Wow, that's surprisingly bad

For real, I'm not against java, but I find time and again that those services that depend on the JVM have lots of "weird stuff". Not because Java is bad but Java fans (which you usually are when you start to develop something this big) think a lot of this crap is acceptable.


In my (extensive) experience in infrastructure. When people say the JVM is the problem, the JVM is never the problem. It's usually just a symptom of something else and lazy ops people just want to blame something and throw up their hands. I've never had to "tune a JVM" to make things work.

I'll give you an example in Spark. We had a huge job that was failing and after checking the logs, it was when results were being spooled to disk. More log diving showed a lot of GC on every node. At that point we could have gone down the route of tuning something in the JVM, but more digging found the real culprit. IOstats when the jobs ran showed while reading data, writes were completely blocked and write latency was in the 100s of ms. The spark executor trying to dump data was blocked and the first symptom that things were falling apart was... GC. We changed the scheduler on the nodes and magically everything worked great. The VM in JVM is virtual machine. Same rules for resources apply and if you run out of resources, don't expect the mythical ops faeries to save you.


Can you expand on which scheduler you swapped out? OS? Spark? Something else?


Sure I can answer both. The OS scheduler CFQ is generally bad for high volume disk applications. Used deadline in this case. Amy's Guide is still a treasure trove of info and a solid recommended read: https://tobert.github.io/pages/als-cassandra-21-tuning-guide...

If you are using Spark in a bare metal cluster with spark-submit, the above advice applies. In Kubernetes, never use the default pod scheduler with analytic workloads. Great choices are Volcano(https://volcano.sh/en/) and Yunikorn(https://yunikorn.apache.org/). Also great and evolving projects to support and contribute if you can.


300k Cassandra nodes seems a bit over the top even for a company with as many active devices as Apple.

https://www.theverge.com/2022/1/28/22906071/apple-1-8-billio...

1.8B active devices / 300k nodes = (just) 6k devices per Cassandra node


Worked there, managed a lot of the teams. An agent is considered a node. So you could have X nodes per server if that clears items up. Even agent per X jbod. The teams are very smart on blast radius and sharding. When I left I had exabytes of data under my mgmt and happy to chat if anyone wants to DM. Folks often forget how many users and services Apple has.


I'd absolutely love to chat and get a deeper understanding of that! Didn't see a way to contact you in your profile — what platform would you prefer to be DM'd on?


+1 would love to hear more from GP.


+1


Added, thought I had in about.


What’s the orchestration setup like for clusters of that size?


Apple has a lot more data than just a list of devices.

There is everything from Weather to Siri to Store Purchases etc.

And companies will syndicate data sets to different teams for performance and security reasons ie. lots of duplication.


> Apple has a lot more data than just a list of devices. [...]

Of course. That is not the point here.


Perhaps you'd be better convinced with a service breakdown.

Breaking monoliths into service boundaries yields easier ownership, maintenance, migration, and resilience.

One "tiny" company with a few verticals can be comprised of thousands of microservices, each handling their own dedicated objective. Authentication, reverse proxy, API gateway, SMS, email, customer list, marketing email gateway, CMS for marketers on product X, feature flags, transaction histories, GDPR compliance handling, billing intelligence, various risk models, offline ML risk enrichment, etc. etc. Each will have its own data needs and replication / availability needs.

This Apple number might seem crazy, but I'm not phased by it. I can picture it.


I can also picture it, but not really in the way you're outlining it.

It's a sad and very inefficient picture though. Apple does not need this this much data processing. It's a grotesque amount per device. My most positive plausible interpretation is that maybe they're just wasting insane amounts of energy doing lots and lots doing of stupid analytics, as one tends to do.


> It's a grotesque amount per device

Again. It is not just for device data.

There are backend services which your device interacts with e.g. Maps, Siri, Weather.


Sometimes things have to be built as layered abstractions in order for humans to reason about them at scale.

See also the natural stochastic gradient ascent that produced our crazy complicated metabolic pathways (and all of biology).


I was also surprised by this. 300K nodes for a distributed DB is kind of crazy. I’ve worked with similar systems but they stored much more than 100 PB with 10x less nodes

Apple is using less than one TB per server…

But when you see the 1000s of clusters it starts to make sense. They probably have a Cassandra cluster as their default storage for any use case and each one probably requires at least 3 nodes. They’re keeping the blast radius small of any issue while being super redundant. It probably grew organically instead of any central capacity management


What you describe is best practice for older versions of Cassandra with older versions of the Oracle JVM on spinning disks. And at this time Apple already had a massive amount of Cassandra nodes. Back when 1TB disks were what we had started buying for our servers. Cassandra was designed to run on large numbers of cheap x86 boxes, unlike most other DBs where people had to spend hundreds of thousands or millions of dollars on mainframes and storage arrays to scale their DBs to the size they needed.

Half a TB per node, which during regular compaction can double. And if you went over, your CPU and disk spent so much time on overhead such as JVM garbage collection that your compaction processes backlog, your node goes slower and slower, your disk eventually fills up, and it falls over. Later things got better and you could use bigger nodes if you knew what you were doing and didn't trip over any of the hidden bottlenecks in your workload. Maybe even fixed in the last few versions of Cassandra 3x and 4.0.


What psaux mentioned makes more sense. A node == one Cassandra agent instead of a server.

Past 100k servers you start needing really intense automation just to keep the fleet up with enough spares.

If you’ve got say 10k servers it’s much more manageable

The fun thing is Cassandra was born at FB but they don’t run any Cassandra clusters there anymore. You can use lots of cheap boxes but at some point the failure rate of using soo many boxes ends up killing the savings and the teams.


Yes, you can run multiple nodes on a single physical server. However, then you have the additional headache of ensuring that only one copy of data gets stored on that physical server, or else you can lose your data if that server dies. Similar to having multiple nodes backed by the same storage system, where you need to ensure losing a disk or volume doesn't lose two or more copies of data. Cassandra lets you organize your replicas into 'data centers', and some control inside a DC by allocating nodes to 'racks' (with some major gotchas when resizing, so not recommended!). Translating that into VMs running on physical servers and shared disk is (was?) not documented.


> The fun thing is Cassandra was born at FB but they don’t run any Cassandra clusters there anymore.

Isn't Intragram mostly Cassandra?

https://instagram-engineering.com/open-sourcing-a-10x-reduct...


It wasn’t when I last saw it. Rocksandra ended up being a stepping stone to fbs most common distributed db, zippydb https://engineering.fb.com/2021/08/06/core-data/zippydb/

Zippydb is honestly one of the best parts of fb infra. It let you select levels of consistency vs latency


> Zippydb is honestly one of the best parts of fb infra. It let you select levels of consistency vs latency

How is that different from Cassandra's Tunable consistency model?

https://cassandra.apache.org/doc/4.1/cassandra/architecture/...


Seagate introduced 2TB drives no later than 2010.


Interestingly, using the highest capacity drives at any point in time would work even worse since they spun slower and slower sequential write speed. If you could get them from your preferred vendor, which seemed to be several years after introduction for us!


I'm wondering if this refers to "virtual" nodes which are not coupled with physical nodes.


I believe this is the case, with many smaller nodes per physical host.

I’ve seen this type of design pattern implemented successfully with a variety of extremely large databases.


Yeah, I could see that even outside of the context of the Cassandra virtual nodes primitives.

These could be K8s nodes for example, that don't make full utilization of the underlying VM, which would completely make sense at APPL scale.

For some background to other HN users out there, "virtual" nodes refer to logical nodes in distributed database software where # of virtual nodes >= physical nodes. This means if the size of data passes a certain threshold, physical nodes can redistribute the virtual nodes, reducing the amount of data shuffled across physical nodes (as opposed to a naive hash function that mods a key by a fixed number and requires all nodes to reshuffle data when a new node is added).


Or it tells us something of how much data is being scooped up per device. Certainly when I look through the raw health data collected it’s quite alarming and I’m sure that’s just a drop in the ocean.


Well, Health data can be uploaded to iCloud (CloudKit), but it's End-to-End encrypted so not really a concern.

Unlike other data in iCloud, if you lose your devices you lose your HealthKit data. This is not true for photos or emails, for example - which you keep if you lose your devices.


Why do you think the raw health data is getting sucked off your device? That would be totally off brand for them.

Apple does have a separate opt-in “Research” program to facilitate this kind of thing.


Regardless of their current brand, Apple is the next big advertising giant and no amount of brand purity is going to change this. The data of Apple's users is simply of too high value for Apple to ignore forever.


Makes me think of that first decade (98-08) when Google actually wasn't being evil. Yeah, it's inevitable that Apple will turn to this when they can't grow any more simply by raising the prices of their devices. Perhaps they have reached that point about now with iPhone 14.


Google’s business model is built on monetizing user data. Apple’s was not. Maybe that will change.


> Apple is the next big advertising giant

No it's not.

There is no evidence whatsoever that Apple is doing anything other than facilitating ads through their News, App Store, Maps etc properties.

Their revenue will be an insignificant fraction of Facebook, Google etc.


That still creates the same conflict of interests with regard to targeting data that made Google turn consumer hostile over time.


I’m not suggesting apple is doing nefarious, it was simply an example of a large dataset. I assume they’re encrypting/scrubbing all your PII as that’s their whole motto.

But unless they’re doing some voodoo magic then yes the data is leaving is your device in some form. Hence why I can view every heartbeat (aggregated by the minute) since I put on the original Apple Watch in 2015 despite changing all my devices and only restoring from iCloud. Indeed I expect it’s just part of your iCloud data storage.

Actually I just launched the health app now and I can see the app explicitly asks if you want to allow sharing your data with apple, so if you say ‘yes’ then you’re not only storing but allowing apple to query your data (minus PII).


It's also off-brand for Apple to join PRISM and comply with thousands of annual requests for supposedly-inaccessible iCloud data. Neither of you will ever be proven right until we look inside those servers though, so making any conclusive statements is a mistake. Apple designed Schrodinger's datacenter.


Apple didn’t ‘join’ prism. That’s a standard piece of misinformation.


The operational complexity of managing thousands of clusters must be mind boggling. I've been on two projects managing dozens of storage clusters, the second with more data than this in some individual clusters and adding up almost this many total nodes. There were technical problems with scaling up to 10K nodes per cluster, but the operational issues mostly scaled according to number of clusters. For example, how many alerts per hour/day can you stand? Too many and you're overwhelmed; too few and you miss stuff. Walking that fine line became successively more difficult as clusters were added. Same thing with graphs and dashboards. Also, when your storage and IOPS are siloed this much you have no elasticity, so you're going to be chasing capacity or load problems much more often. On the plus side, this probably means each tenant has their own cluster, so you don't have so many worries about them affecting each other.

The big question I'd have is: how many people (including on client teams) does it take to manage this much sprawl?


> how many people (including on client teams) does it take to manage this much sprawl?

Being remotely familiar something with similar complexity, I say "very few". Automation is not one time, it is ongoing. You start with something small, and you build, and build, and build. I saw a presentation from maybe 5-6 years back talking about Apple Cassandra, and have talked to some engineers too when they used to have DataStax Cassandra summits. Their automation has probably scaled up from those days.

Maybe their infrastructure grows by X% every year (X probably being a double digit number) and their automation increases their productivity / reduces their overhead by X/2%, it allows them to still retain a huge growth trajectory at a smaller increase in people costs. Improve productivity by X% or more (with new automation), and you have hit a home run.

And they should not be asking "by what percentage should I optimize", but simply "where do we spend effort today and how can we reduce it" ... which any business or group should be doing it anyways.

The biggest challenges with such things come when management changes strategy for various reasons because then you lose a year or 2, and then you fall really behind and need to do so much catch up just to catch breath.


One of the hard parts of quantifying that, is you have people who wear many hats. So sure you have Cassandra gurus, and probably a decent number of them. But this is in the league where really hardcore automation kicks in to keep staffing sane, and operations possible. But outside that, how do you count datacenter folks, client folks, networking folks, etc who only spend a little fraction of time on the database parts.

But I think I can say with sufficient knowledge, all things considered still, “way fewer than you might think”.


At this scale you automate away 99.9% of the things you respond to.

We are not even near this scale yet at my place of work, and we are moving towards this methodology of strong automation to orchestrate a cluster. We hire software engineers to operate our database clusters, and the expectation is to somewhat be selfishly motivated to write programs to remediate issues so you don't get paged constantly. We do not expect to grow our headcount proportionally to the number of clusters or nodes we operate.

You must treat your nodes like cattle not pets. If a node fails, automation kicks in and re-bootstraps it. It is not worth figuring out how to nurse it back to health. When you are performing rolling or scale up operations on the cluster you are just invoking automation to do everything for you.


> You must treat your nodes like cattle not pets.

Do not cite the deep magic to me. I was there when it was created. Really. I know all about Randy's "cattle and pets" metaphor, and automated responses, and not nursing individual nodes back to health, and I had already accounted for those in my comment. I did mention that I'd been on a team responsible for that many nodes, right? <checks> Yep, I did.

> If a node fails, automation kicks in and re-bootstraps it.

You think we didn't have such systems at Facebook? Yes, we did. And yet, amazingly, there were various kinds of failure or degradation that they didn't always catch or respond to correctly. These kinds of systems only work for syndromes that have already been seen, understood, and coded for. When you have a hundred thousand nodes or more, some nodes regularly fail in novel ways or get stranded in weird automation states. That requires human time to sort out.

> you are just invoking automation to do everything for you

Maybe that works for teeny-tiny clusters. I hope for your sake that it does. For 10K-node clusters, it's not uncommon for that "rare" condition to occur a few times, requiring a human to prod things along a bit. Yes, the automation does 99% of the work, but 1% can still add up pretty quickly.

> We do not expect

Only time will tell whether your expectation is realistic. Maybe it is, at your scale. It wouldn't have been at the scale I worked at, which seems a lot closer to what the OP is about.


Funny how every single reply missed the part where I mentioned I have worked at this scale. Larger in some ways. I know what that kind of automation looks like, and what it can do for you. My point is that, even with that, managing thousands of smaller clusters is going to involve way more operational work than running fewer larger clusters (based on appropriate technology). Try to address that, not the strawman of someone who might suggest trying to do this without modern automation.


They for sure have in-house software to handle these nodes. The software may be pretty good at what it does considering they know what their own DBs are susceptible to, that might reduce a whole bunch of the human management and the team they need may be not that big but likely highly skilled.


I'm going to guess somewhere between 50 and 400 people. Smaller than 50: this is probably an org where "we maintain the source code and documentation. Running, debugging, and scaling your cluster is YOUR problem". Larger than 400: the org probably fulfills other functions besides "running thousands of production Cassandra clusters."

This is a lot, and is one of the big things that surprised me when I first joined a large organization. But here's an example breakdown in no particular order:

- 10 to 30 people for managing, contributing, and maintaining the open source aspect. Just a handful of engineers could contribute features and fixes to the project, but once the open source project and community gets larger it becomes a full time job. System component maintainers, foundation boards, committees, conferences, etc. add up.

- 10 to 70 people for "operations". As you mentioned, the load here tends to scale with the number of clusters (customers). At the large end this is several teams, with say a team dedicated to (a) fleet-management of the individual machines in a cluster, (b) cluster lifecycle management, and (c) macro level operations above the cluster level. Alerts can't all go to this team, so some of the work is writing alerts that go to customers in a self-serve model.

- 10 - 40 people for "scale projects." At this scale you have 1 - 10 customers that are on the verge of toppling the system over. They've grown and hit various system bottlenecks that need to be addressed. And you'd be lucky if they're all hitting the same bottleneck. With this many customers, it's likely that they've all adopted orthogonal anti-patterns that you need to find a fix for: too many rows, too many columns, bad query optimization, too many schema changes, too many cluster creates and destroys, etc. So you probably have multiple projects ongoing for these.

- 10 - 30 people for "testing infrastructure". Everyone writes unit tests, but once you get to integration and scale testing, you need a team that writes and maintains fixtures to spin up a small cluster for some tests and a large cluster for scale tests (which your "scale projects" teams need, btw). And your customers probably need ways of getting access to small test Cassandra clusters (or mocks of the same) for THEIR integration and scale tests, since Cassandra is just a small part of their system.

- 10 - 30 people for automating resource scaling and handling cost attribution. These may not be one function, but I'm lumping them together. "Operations" might handle some of the resource scale problems, but at some point it's probably worth a team to continually look for ways to manage the multi-dimensional scaling problem that large cluster software systems inevitably create. (Is it better to have few large nodes, or many small nodes?) You need some way of attributing cost back to customer organizations, otherwise you're paying $50M because one engineer on the weather team forgot to tear down a test cluster in one automated test 6 months ago and... You need to make sure that growth projections for customers are defined and tracked so you have enough machines on hand.

- I'll add that it'll be worth adding whole teams for some of the more complex internal bits of this system, even if the actual rate of change in that sub-system is not very high. At this scale organizations need to optimize for stability, not efficiency. You don't want to be in the situation where the only person who understands the FizzBuzz system leaves and now dozens of people/projects are blocked because nobody understands how to safely integrate changes into FizzBuzz.

- Things not covered: security, auditing, internal documentation, machine provisioning, datacenter operations, operating system maintenance, firmware maintenance, new hardware qualification, etc. Maybe there's an entire organization dedicated to each of these, in which case you get it for free. If not, some of your time needs to be spent on these. (Even "free" might have a cost as you need to integrate with those services and update when those services change.)


Spot on, and thank you. My second team was ~40 (might have peaked at ~50) split across four sub-teams, for software that ran at similar scale and was designed and developed to rely heavily on other in-house infra. Maybe half a dozen people on adjacent teams (including customers) who had more than trivial knowledge of our system. Some in our team were almost pure developers, some were almost pure operators, most were at various points in between.

I think the reason you and I (we know each other on Twitter BTW) are so at odds with some of the other commenters is that they haven't maxed out on automation yet and don't realize that's A Thing. Automation is absolutely fantastic and essential for running anything at this scale, but it's no panacea. While it usually helps you get more work done faster, sometimes it causes damage faster. Some of our most memorable incidents involved automation run amok, like suddenly taking down 1000 machines in a cluster for trivial reasons or even false alarms while we were already fighting potential-data-loss or load-storm problems. That, in turn, was largely the result of the teams responsible for that infra only thinking about ephemeral web workers and caches, hardly even trying to understand the concerns of permanent data storage. But I digress.

The point, still, is that when you've maxed out on what automation can do for you, your remaining workload tends to scale by cluster. And having thousands instead of dozens of clusters sounds like a nightmare. There are many ways to scale such systems. Increasing the size of individual clusters sure as hell ain't easy - I joined that team because it seemed like a hard and fun challenge - but ultimately it pays off by avoiding the operational challenge of Too Many Clusters.


I'll see if we can get permission to discuss this publicly.


Exactly.

- There's nothing scarier than "the automation had no rate-limiting or health-checking". Of course, what do we mean by automation? At some point it becomes impractical to slow every change down to a crawl, so some judgement is required. But "slow enough to mount a human response to stop it" is the standard I've applied to critical systems.

- Thankfully I've avoided having to support "real" storage systems. The challenges of "simple" distributed databases storing is enough for me. :-)

On the "pets vs. cattle" metaphor, I think most people fail to grok the second component of that. I don't think there are many successful cattle ranchers that will just "take the cattle out of line and shoot it in the head." The point of the metaphor is: When you have thousands of head of cattle, you need to think about the problems of feeding and caring for cattle operationally, not as a one-off.

Despite what https://xkcd.com/1737 might make one believe, people don't just throw out servers when one part goes bad, or (intentionally) burn down datacenters. What the "hyperscalers" do is decouple the operational challenges of running machines from the operational challenges of running services (or at least try to). Of course this results in a TON of work on both ends.


Just wanted to say thanks for understanding how hard this is.

It's a fun sub-thread to read.

As I mentioned elsewhere, I'll see if I can get permission to talk publicly about the actual numbers.


Interesting, I thought they were trying to switch over to FoundationDB but looks like their Cassandra usage keeps growing.


Apple acquired FoundationDB but from what I have heard it’s not really used much. The FoundationDB founders have left Apple and are working on other things. Cassandra is the main datastore for iCloud data.


Are you sure? The latest SIGMOD paper on FoundationDB says it's in active use. My impression is that iCloud is built on both Cassandra and FDB, and that FDB (via Record Layer in particular) is used extensively.


Well, since we're exchanging unsubstantiated rumors: from what I have heard it's exactly the opposite. FoundationDB is being used for transactional data and works very well. Its active development by Apple would seem to confirm that.


These are different technologies with different use cases. FDB typically is used when there are strict ACID requirements.

Apple uses both


I am really curious. It would make sense to use Cassandra as low logic, high volume simple storage. And then use something more relational / graph and ACID on top of it


We (I'm a co-founder) run the following: - 1B (10^9) rows/sec with 86 servers https://www.scylladb.com/2019/12/12/how-scylla-scaled-to-one... - 1PB with 20 servers https://www.scylladb.com/presentations/operating-at-monstrou...

- Comcast moved from 970 Cassandra nodes (and 60 nodes of cache) to 78 Scylla nodes - https://www.scylladb.com/2020/01/15/comcast-sprinting-from-c...

- Palo Alto Networks run 8600 clusters(!) of ScyllaDB https://www.scylladb.com/2022/06/14/how-palo-alto-networks-r...

ScyllaDB excels in throughput and latency, we have also a better compaction algorithm that saves 37% of storage compared to C*. Usually one can replace lots of small nodes with gigantic nodes that have more resources and it allows much better management.

To run 100PB Scylla will need more than 300 nodes, even thousands but definitely not what Apple throw at the problem.


There is no magic behind Scylla, mainly lots of hard work, hundreds of years of engineering, based on the former C* design which is based on Dynamo/Bigtable.

The JVM is part of the problem, not all of it. The main issue is that it hides the hardware and makes tracing harder - instruction level and block level. At Scylla we strive for efficiency, every operation is tagged with a priority class for the CPU and I/O schedulers. Folks are welcome to read the blogs about those topic. Lots of details and hard work


I’d be curious to know what the orchestration platform looks like for running 300,000 nodes and 1,000s of clusters.


I know some of their backend services use a lot of Ansible (for at least the automation / config portion), but I don't know which ones. Apple's devs are pretty mum on the specific stacks they're using, usually.


Over 2PB per cluster, thousands of clusters, but only 100's of PB of data.

What do they use this for? iCloud storage related stuff?


Facebook database guru Mark Callaghan posits that Apple’s Cassandra workload likely relates more to iMessage than iTunes, but whatever the project, it’s massive… and it’s not uncommon.

https://www.techrepublic.com/article/apples-secret-nosql-sau...


I thought Cassandra was bad at storing big files?


Correct, that’s what object stores are for. However, metadata on said files, is probably very handy to have in a database.

I’m quite sure not all this Cassandra capacity is just file/photo metadata storage either.


you can easily chunk big files.


It’s a horrible use of the tech. It’s like storing jpgs in postgres. You can do it, it’s just a really bad idea.


Depends heavily on the use case. I know multiple video related companies that chunked data for replays into 1s and 500ms segments.

Having many keys, means that you can preform thousands of asynchronous requests for small pieces of data and then piece it together on the client side.

Super low latency is just something else to optimize for.


But then you need to push these segments into partitions, and big partitions are really bad, especially for old versions of Cassandra… Although I met customers with partitions of size of 100Gbs…


As another commenter said, 500ms clip is super tiny. Each key is a a separate partition.

So each segment is a separate partition.


chunks will be evenly distributed between many partitions, no need to store chunks in one partition.


it depends, storing files in cassandra and jpgs in postgress allows you to not add another software with bugs and maintenance burden.

If it works - everything is fine.


It may work, until you start to perform maintenance things, like repair, or new node bootstrap, etc. Then it may fail with high probability


100’s of PB seems low for iCloud. iMessage, probably.


On the slide shown, it says "1000s of applications."


They use it for storing iCloud Photos without E2EE while heavily marketing privacy


They were moving towards E2EE when everyone freaked out about the on-device perceptual hashing trade off.


Well… you don’t actually need to make a computing device automatically report its owner to the authorities for a serious crime based on a provably flawed automated process, prior to implementing encryption E2EE for a cloud storage service. That was simply the strategy that Apple chose to pursue. Blaming the users for reacting poorly to this strictly anti-user approach is very backwards.


That wasn't what Apple was proposing though, they were quite clear [1] that it would be a human making the call to notify authorities, and only after a scoring algorithm had passed a threshold set pretty high (expected to be a trillion-to-one against false positives).

[1] https://web.archive.org/web/20211210163051/https:/www.apple....


Yes, which isn’t a massive privacy concern and is of course not also susceptible false positives. Oh, wait a minute, it is…

https://www.nytimes.com/2022/08/21/technology/google-surveil...

Anytime this feature does anything, in every case, it will be acting against the interests of the user. There is no requirement, legal or otherwise, to implement this feature in order to enable E2EE for a cloud storage service. To act as if there is is simple gaslighting.


shrug Every, and I mean every cloud provider is currently scanning every image uploaded to it.

If every single provider is doing that, then I'm going to think there's a good reason for that, and maybe (just maybe) there's something behind it. Some reason for it.

But I'm not here to change your mind - believe what you wish.


The claim that I was replying to was that the reaction to CSAM scanning was the reason Apple didn’t implement additional E2EE for iCloud storage. When in fact CSAM scanning was neither a legal nor technical requirement for expanding E2EE in iCloud.

Your question as to why so many companies gravitate towards the same anti-user patterns is both unrelated and pointless.


Or they could just deploy e2e without turning our devices into things that spy on us. It’s a false dichotomy.


Premise [1]: Anything uploaded to the cloud will be scanned for kiddy porn.

Corollary: There are precisely two places that this data can be scanned

1) On the cloud servers. Anything and everything uploaded can be pushed through a scanner and anything that matches is flagged and sent off to a human to (ugh!) verify before being sent on to 3-letter agencies.

2) Within the privacy of your own device, only on things that are uploaded. Anything that "hits" is flagged and the same process (ugh!) as above is followed.

This is not a false dichotomy. Once you accept that the scanning will happen (and it does), then it either happens at the source or at the destination. Right now everyone offering a cloud service scans at the destination (the cloud servers themselves), and everything is scanned. It is not possible to have e2e if the server can read the data to scan it - this ought to be obvious.

Apple was offering to not do that scanning in their domain, but to trust that the device could do it in your own private domain, and that could have led to no further requirement for the server-side to be able to read the data (to do the scan). Which could have led to a fully end-to-end encrypted service for data, while still helping prevent ugly crimes.

Users chose option (1), that is: scan everything uploaded on the server all the time and deny the ability for end-to-end encryption to occur.

This is why we can't have nice things.

-------------

[1]: This isn't quite a legal requirement, but every cloud service does it because the lawyers won't issue advice to CYA (cover your arse as the service provider) if you don't do it. To mount a successful defence against being sued, you need to make an effort to detect, seems to be the legal opinion.


It's thoughtful, but clearly deputizing your own property against you is not a premise people are comfortable with.


Yup. That was my lament.


Nah, if it is end to end encrypted, it is rightfully opaque to the service owner. Who would sue them, and why?


I’m not a lawyer. My wife is, but she’s my lawyer. Get your own damn lawyer ;)

However, I could see:

- Bad person A is convicted of kiddy porn, as part of a plea deal, he gives up his sources etc - Turns out A has been sending stuff to B via iCloud - B does something nasty to C’s kid and gets caught - C sues Apple (who has money and certainly doesn’t want to be defending this in court) for making no attempt to stop this from happening to C’s kid, or worse assigns culpability due to being the medium of transport.

Would this have merit ? Probably not, but it’s not something Apple want splashed all over the interwebs. The court that matters is public opinion, in this instance, and mega-corp vs parents-of-abused-kid doesn’t play well whatever the merits of the case.

So Apple (and everyone else) scan, in part for self-interest, and also because I’m sure people at Apple/whoever have kids too, and have just as visceral a reaction as other people when confronted with hard evidence that this shit really happens. It’s easy to play the “think of the children is all bollocks” card - it’s harder when there’s a real abused kid that is front and center.


We send letters and parcels all the time within the USA and they are not inspected either. People also do horrible things in cars too, we don't systematically 'inspect' the content of every single car that drives on it past a bridge toll or similar. It feels rather flimsy IMO potential lawsuits without specific laws making this a liability that this would be the reason, like SESTA / FOSTA did.

Once precedent shows that apple or anyone else will just never have that info because they deliver things in the equivalent of opaque letters, legal precedent of previous court cases will make these happen less and less, if at all.

If I were apple, I would rather not have the responsibility of inspecting people's contents if I was a medium of transport, because it prevents an entire duty of care aspect that would pop up. You prevent more lawsuits by being E2EE IMO.

You see this avoidance behavior within medicine with malpractice lawsuits, where doctors would rather patients not speculatively test so they don't create duty of care issues, and where they outsource some kinds of testing to other firms, so the 'duty of care / malpractice' aspect that pops up in case they missed something goes to the firm instead of them. There is a big 'avoid seeing things if you don't have to' energy in a lot of medicine, and it comes from malpractice anxiety.

So there are things forcing apple to go this way IMO, I don't think they would do this by default.


My operating theory is that Apple is being coerced by federal regulators not to release e2e software without backdoors like this.

This is clearly unconstitutional and illustrates that the US and China are pretty similar when it comes to human rights policy.


Yup. The knee-jerk privacy reaction cost us privacy.


I don’t think it’s fair to say we need to accept either options. Yes the crime they are trying to stop is horrific and something must be done, but that doesn’t justify unlimited technological spyware.

And the scope for abuse is so large. People in the UK are getting arrested for retweeting mean memes, it’s pretty easy to imagine Google and Apple added offensive images to their scanning and you get arrested for saving something that goes against the current political agenda.

As well as the case where google locked the account of a parent who had taken photos to send to a medical expert.


I have no interest in relitigating the saga or the recent Google incident. HN did that for months. I was simply agreeing with the irony.

We had no privacy before, it was offered, people freaked out, we have no privacy today.


They should have wheeled out a better marketing spiel than "trust us ;)" then.


What use cases are good for Cassandra vs Postgres or Mongo or Elastic ? I've used the latter 3 a lot, never used Cassandra.


Cassandra is very good when all of these are true:

- You're willing to give up some common SQL-isms for leaderless, multi-dc, high-availability

- You're going to grow to need more than one machine taking writes - if you can fit on one normal commodity machine, just use something designed for one machine. Cassandra sacrifices a lot to scale to ~thousand-machines-per-cluster, so if you can use one machine, don't bother with Cassandra.

- You can model your data in a way that does SELECTs without JOINs, and always uses AT LEAST the partition key in the WHERE clause. Denormalizing and duplicating your data is PROBABLY ok.

- You're willing to run Java (so you understand that you may need to tune the JVM eventually), and you're willing to learn about data modeling before you just start writing code

If all of that is true, Cassandra starts being interesting.


I keep hearing people talk about how hard Cassandra is to run properly and how many people get it wrong etc. Is there anything to this or is it just FUD and people who genuinely don’t know what they are doing?


Our biggest issue with running Cassandra was related to pathological read / write patterns by some tenants on our system causing outsized availability impact due to triggering garbage collection pressure that would cause whole node GC STW pauses and severe tail latency / query degradation.

We have solved these issues in a few ways, mainly:

- working with the relevant product teams to implement appropriate rate limiting or improving data modeling.

- introducing our own query layer, written in Rust that sits in front of Cassandra that uses a form of micro-caching called read coalescing, and also other forms of query throttling/load shedding to reduce work the database must do for hot keys/pathological patterns of access. We expose a GRPC interface from this - and this lets us centralize control of the client driver and tune it appropriately, while also getting to leverage the ever growing open source grpc traffic routing solutions (envoy, etc...)

and ultimately,

- switching to ScyllaDB, a C++ rewrite of Cassandra which is of course void of any garbage collection issues, and features faster overall performance and lower latencies.

Scylla, however, is not without its own set of issues - and somewhat strict hardware requirements[0] thanks to the seastar engine it is built on top of. Their team however has been delightful to work with, and our platform is markedly more stable in current year than it was in years past thanks to the above factors.

Operationally, however, Scylla and Cassandra are quite easy to run, the trickiest part is repairs. Common operations such as cluster expansion, or replacement of node are so common an operation that they are at this point mundane. Be wary however about read/write amplification issues inherent to LSMT databases, choosing the correct compaction strategy and tuning it appropriately can be quite key. Additionally tombstones can be quite bad for performance.

In current day we offer a new more generic solution that sits on top of scylla (it would work with Cassandra too) that provides a simple interface to query KKV based data, without having to worry too much about problems like large partitions, hot keys, or tombstones! With a design like this, the underlying cluster thus far has been issue free and very easy to operate.

[0]: https://discord.com/blog/how-discord-supercharges-network-di...


For starters:

- Use a better a GC like Azul's C4.

- Leverage monitoring to shard off hot spots.

- If possible, get away from CPU- & memory-inefficient JVM and dynamic languages.

- Steer clear from ScyllaDB and its radioactive license. And, there's little customer demand for it. Astra DB for a commercial distro or YugabyteDB is from the folks behind Cassandra.

- Optimize persistence with multiple choices of backends that abstract away the underlying datastore: the eases the pain of going through "divorce" from any particular technical solution. Writes vs. Reads ratios, geographic replication, object size, data structure (K/V, tuplestore, time series, files, etc.)


What’s the deal with scylla? Everytime I google something cassandra related I get only scylla ads claiming it’s much better


It's a more performant equivalent. When we switched to it we reduced the number of nodes required and query latencies were better.


ScyllaDB uses AGPL, a good and valid license


I assume you must have but feel compelled to check - did you experiment with the new low-latency Java garbage collection options? Feels like this is exactly the sort of case they would be catering for.


We did with some. Although I don't recall the exact specifics but there were improvements we were able to make with GC tuning and switching from CMS to G1. We were on old version of Cassandra for our most problematic cluster - and it was our last cluster to migrate to Scylla so we didn't really do much to explore the problem space of JVM tuning when we were already moving away from JVM already.


thanks, good to know!

If you had fully explored using those and found them insufficient then it would have been an interesting case study in them "not working". Supposedly this is a nearly "solved problem" but I am very curious to hear real world use cases and whether it truly works or not.


Just for reference G1 is 15y old.


It’s still the default and actively developed.


We store a few PB of data in Cassandra, have used it for nearly 10 years and in my opinion it's not that hard. Operationally it's way easier to manage than Elastic and most other databases (ex. PostgreSQL, MongoDB) plus there's a ton of documentation available to help you debug/ benchmark your cluster. Note that even though CQL looks similar to SQL it's important to understand the differences but as with any new technology there's a learning curve. I would strongly recommend checking out C* if you need a database with high write throughput and that needs to scale out.


It’s a distributed system and if you have been a DBA for a single system like Oracle or MySQL there is a lot of new competencies to learn. That being said, completely doable and it’s typical to see small teams running massive amounts of Cassandra. At the same conference, Bloomberg talked about their large Cassandra footprint with only 4 people. If you want to run Cassandra in K8s there is the K8ssandra project that automates a lot. It’s a fast growing project as a result. (http://k8ssandra.io) If you want to use Cassandra and not run it, http://astra.datastax.com. One click and a few seconds, you get a completely serverless version of Cassandra that you only pay for what you use. I'm sure we will hear a lot more of these stories at Cassandra Summit in March (http://cassandrasummit.org)


It's not that hard. The harder thing is actually using it properly and not trying to use it like it's a SQL database. Your application needs to be aware of things like tombstones and you have to model your data properly so you can have efficient queries and not run into problems with deletions. The operational side is pretty easy. There's a few things like compactions and repairs that need to be run periodically. The tooling all in all is pretty good. Java/JVM can be annoying sometimes though.


How comes iTunes movies takes minutes to display a list on Apple TV, feels like minutes all the time?


Normally you post a question and someone posts an answer. Here someone has posted the answer and you have posted the question. Are we playing Jeopardy?


Or to just say it - because the system has absurd complexity as proven by the hardware needed to run it.


How comes iTunes movies takes minutes to display a list on Apple TV, feels like minutes all the time?

I just fired up the iTunes Movies app on my AppleTV for the first time so no cache (I only watch my DVD/Blu-Ray rips), and the app started and loaded a full list of movies in a little under 3.5 seconds.

If it takes you minutes, it sounds like PEBKAC.


to be fair, I have a fairly large iTunes purchased back-catalog of tv series and movies - when I go to view the index of them on a gigabit network, it takes more time than I would expect it to. latest appletv.

but, if I hit play, it's instantaneous and I'm surprised at the quality (plus they've upgraded a ton of my purchases to 4k when I bought them at whatever I could)

if I contrast that to local plex, the plex meta-data is instantaneous, but I find it amusing that the appletv videos start more quickly.

plex îs local cache, usually still in memory, apple is building my stuff from legacy iTunes data that I doubt they cache forever for me.

that said, I think I'm ok with the tradeoff for the most part.


Yea, it's amazing how quickly videos stream from most services.

If your Plex videos are taking a while to start, might be an issue with your client or server setup. Are the disk drives sleeping so they have to spin up, are they slow for some other reason, etc. My storage is all on spinning disk on 8+ year old hardware and I don't have any issues, except with certain slow clients.


the slowdown is with 4k movies - all codec (audio/video) match so no transcoding, plex runs in a vm with 4 typically idle cpu's allocated, and disks are cached with ssd's on a synology.

still the slowest of the players I watch. you can use infuse, and often times it is much faster than the plex client against the same server.


Idk. Apple search APIs seem slower than competitors often; my non-tech family members have made similar comments to me about Apple Music and Apple TV+ being slower than Netflix (or Prime) / Spotify.


But the OP specified iTunes Movies, which is not AppleTV+ or Apple Music.


While it doesn't take that long on my device, it probably is related to the fact that the iTunes store is over 20 years old and has the tech debt to prove it.


I'd love info on how much Apple contributed to Cassandra!


I’m sure Scott’s talk went into detail about this, but I can safely say that his team contributes a great deal to Cassandra


Well, when I worked on Cassandra at datastax, they hired away a lot of my coworkers to work for Apple.


Can be replaced with 300 servers with ScyllaDB :-)


Because of no JVM? Or because of its different architecture (different caching and such)?

I would expect it to still require more for high availability but from what I have heard around ScyllaDB it does seem there is a benefit to it over cassandra.



will ScyllaDB shrink data 1k times using some magic?


No.

Scylla has also stalled in support of features that Cassandra offers, which is strange because they came out of the gates like gangbusters. It is not the drop in replacement they advertise.

But I'm not going to tell you "don't use scylladb". They have lots of educational whitepapers for maxing I/O and hardware that apply both to cassandra and scylladb.

The biggest thing with cassandra and scylla is that you need a technical management to understand the headaches, and appreciate the silent benefits. Cassandra places often have frustrated management about the cassandra management problems, but when I ask "has your cluster ever gone down?" and they say no, even across major version upgrades and the like, expansions, etc.

Also keep in mind that cassandra also has a tendency to make your big data bigger with timestamps-per-cell, replicas, etc.


You might need about 1500. I think 64 TB of flash is the standard for a 2U database server these days.


Cost apart, can firestore ( firebase ) can handle this much data ?


Yes.


What happened to FondationDB? It is used elsewhere?


I thought Apple bought a company that did Cassandra like functionality and started using that?


300Gb per node, am I reading this right?

Seems extremely low, I expect at least three orders of magnutide more on a real production system.

Does Cassandra really suffer from performance problems at scale?


hmm you expect to see 300,000GB or 300TB per node? What does that mean?


It means that if you need a whole JVM instance to service a tiny chunk of data that's less than even a terabyte then you're doing something extremely wrong.

Leaves a bad impression of Cassandra, personally.


I'm under the belief that Redis is much faster than Cassandra, am I crazy to think that Apple or any company really should have a transition plan? Why isn't redis used more?


Cassandra solves different problems than Redis is typically used for.


Redis is for when your data fits in memory.


By default redis will drop your data when under memory pressure. It’s a different thing.


Redis is a cache


Redis is not simply a cache. It can be used as one but it supports many other data structures and use cases beyond that.


Redis is not a cache. It has all kinds of data structures.

https://redis.com/redis-enterprise/data-structures/


Unrelated but does anyone else get tired of the fetishization of BIG NUMBER? I don’t care if Facebook has billions of users if it’s a hot pile of garbage. I don’t care if some game has millions of players if it’s bad. When did BIG NUMBER overtake quality and can we go back?


Quantity of users is social proof (more accurately, social evidence) of quality. The claim "we have high quality" is cheap: anyone can make it, and it's subjective enough that it can't be disproven in any absolute sense. But if you lie about having billions of users, you can be called out on that pretty easily; and if you say it honestly, it implies that billions of people like the quality of your product or service enough to use it. Billions of people can be wrong, but saying you have billions of users is still much better evidence of quality than saying you have quality.


> Quantity of users is social proof (more accurately, social evidence) of quality.

It can also be evidence of founder effects (JavaScript), lock in (Windows), network effects (most social media), etc.


Those are all qualities that you’re painting in a negative light.

You may not think Facebook deserves credit for its network effects, but a lot of people would disagree.


So you’re arguing that popularity is a proxy for quality? So astrology is correct? There’s endless examples of popular things that are nonsense.

Of course the opposite is not true. Things are not bad because they are popular. I just don’t think there is any inherent correlation at all. Popularity generally measures memetic power and that’s it. Cold viruses are also popular.


You're conflating "quality" for correctness, which is not what I said.

Yes, astrology being popular means it is high quality entertainment value.

Cold viruses being popular means they are high quality reproducers.

And for Facebook, whose primary value is from network effects, popularity indeed means it is a high quality social network.


It is a bit like music. One could state that all of the popular music is garbage (I might even state that myself some days), but in the end, the music purchasing public have spoken.


BIG NUMBER implies a dedication to the care and feeding. It’s a nudge and a wink for “come work for us”

Number of users advertisement is more for investors, and maybe a broadcast for FOMO.

There is, however, a tendency in our field to look for systems design patterns that handle big N numbers, and apply those to little N platforms, but I believe deep down that this is business and management dysfunction, as systems refactoring is tolerated even less than code refactoring.


> I don’t care if Facebook has billions of users if it’s a hot pile of garbage.

Shockingly, Apache Cassandra was initially developed at Facebook, so at least a hot pile of garbage is good for something. Plus, it'll keep you warm, so that's two things.


[flagged]


Sounds fun, do tell




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: