Hacker News new | past | comments | ask | show | jobs | submit login
The trouble with Cassandra as an object storage metadata database (min.io)
84 points by jasim on Feb 24, 2021 | hide | past | favorite | 77 comments



The most basic and trivial way that you're going to get burned by cassandra is that you have to divide your primary key into two parts: partition key columns, and clustering key columns. Partition keys must be in every "where" clause; only clustering keys are optional.

Okay, I'll just not bother with partition keys, right? Except partition keys determine which "partition" your data goes in. So if your only partition key column is "day_of_week", then you have 7 partitions. New problem: Your partitions need to be < 300 MB or cassandra falls over dead.

In fact you'll soon realize that over the lifetime of your table, keeping those partitions under control may end up forcing you to use various date tricks like putting year/month/day in as "artificial" partition keys.

Of course if you put everything in the partition keys instead of clustering keys, let me note again that you have to put each partition key column in every where clause, in which case you may have trouble querying for batches of data.

Furthermore, when you do put clustering columns in your where clause, they have to be in order declared; so if your CC's are a, b, and c, then your where clause can use (a), (a,b), or (a,b,c); but if it has b, it has to have a, and if it has c, it has to have a&b. This is because storage is hierarchical. (Same rules for "order by", btw) (and no there's no "group by")

This is when you start realizing: Oh, you mean it's really not like "SQL without joins". No, not even close.


Best of summary of the Cassandra data model I've seen:

HashMap<PartitionKey, OrderedSet<ClusterKey>>

So you can lookup data by the ParititionKey, then perform range queries on the ClusterKey to filter data belonging to that partition.

If your access patterns look like that, Cassandra could be a great fit.

For anything else, you probably want to consider a different data store.


Ha! That's pretty succinct.

Though HashMap should be just "Map" and the value of the top level map is not a set, it's another map:

Map<PartitionKey, SortedMap<ClusterKey, Record>>


Making CQL SQL-Like was the worst possible thing the designers could have done. Newbies want to do SQL joins and filtering with it until they've been a few times.

C* is essentially scaled/automated MySQL Sharding and blob storage.. except you pay a huge cost for any server side filtering.

The Partition Key would be the equivalent of your MySQL shard id. ClusterKey is your primary key. if you treat everything like PUT $paritionKey, $primaryKey, $data and GET $partitionKey, $primaryKey you get a slow and massively scaling redis. If you do anything else with it you'll likely start regretting your choices and looking for a replacement database.


What you are expected to do with Cassandra, to achieve your horizontal scalability, is (a) denormalize by each supported query up front, and (b) fan out all of those (often by time) so they never get too big.

So for your "primary" data store for a type data, you might use the date for the partition key. Then if you do a lot of day-of-week queries, you'd store a second copy grouped by year-month-weekday, and query as many months as you need, and combine them. If you are collecting data for several sites, have a copy that's stored by year-month-weekday-site. And set up your software so it's easy to say "write in these 5 places, keyed off this combination of values." And because it's in all these denormalized places, you can forget updating anything atomically. So you go to append-only mode and if you have an "update" you record that as a separate entry after the first and your application's understanding of the data store includes coalescing the events and applying the update. (Maybe if you need it you have a separate process coalescing updates after the fact, but in the background, non-critical-like.)

And you're right. This is very much not SQL. It's also super obnoxious and support for it in extant frameworks is pretty minimal (they tend to be rather myopically SQL-oriented, ill suited to take advantage of this model). It also uses lots of storage, on the premise that disk is cheap. But what all this accomplishes is crazy horizontal scalability, and when you make any one query, then all the data is right there on disk lined up neatly, so it all gets streamed out in a predictable amount of time. For a few select applications, this consistent scalability is worth the pain.


The issue seems to be a misunderstanding of what Cassandra is. It's an (advanced/nested) key-value database, so of course you need to specify the key to do anything.


The misunderstanding is understandable, since the Apache Cassandra site fails to state what Cassandra is, even directly below the heading "What is Cassandra?"


Well... it is "the right choice when you need scalability and high availability without compromising performance". I'm not sure how much more description you need! :)


That's normal by now. Pretty much all software/service sites would do better by just using Wikipedia's description of themselves as the landing page. At least, better for the user. Probably because Wikipedia actually needs to explain the thing's meaning, and a standard marketing page would be slapped with plaques about non-encyclopedic content, in no time.


Replying to myself because I can't seem to find an edit button: If you're interested in Cassandra, the free online DataStax training is highly recommended. You'll learn all the gotchas ahead of time instead of after you went to production, and you really, really don't want the latter.


> New problem: Your partitions need to be < 300 MB or cassandra falls over dead.

people work around that by storing the data in something like S3 and then keeping the handle in cassandra. Yet another hack on top of another hack.


While part of this is Cassandra's architecture, it may also be a limitation of the designer's way of thinking. Recently we had a hackathon project where we decided to implement the s3 API on top of Scylla (a Cassandra-compatible database) and were able to chunkify binary large objects (BLOBs) of up to 5 TB, and even turn Scylla into a FUSE filesystem.

https://www.scylladb.com/2020/12/15/scylladb-blog-scylladb-d...


Do you fill postgres with 100MB+ blobs?

You'd do the same "hack" there.


yes, but postgress doesn't have the problem that it fails if you have more than 100MB in all records, in total.


Pretty sure that's what the article is about.


Noob question here.

> Your partitions need to be < 300 MB or cassandra falls over dead

Why is that? Is that some kind of engineered-in limitation?


There's a default limit of 2 billion cells (rows * cols) but larger partitions also mean larger indexes and more serialization, repair, compaction and memory overhead which doesn't work well with Cassandra's Java/JVM GC requirements.

It's getting better but likely will never reach the performance potential of Scylla which is Cassandra reimplemented in C++ with a much better sharded architecture.


It’s not a real requirement. Just a performance thing. Each partition has an index and as it grows the deserialization of it to find the queried start column takes longer. It is lazy loads but there a perf hit. Easier to have a rule of thumb like “100mb partitions” and have more fixed tps per core etc expectations.


No, it's just that the storage layer is low priority to improve for this use case, but it'll happen eventually.


> Your partitions need to be < 300 MB or cassandra falls over dead.

Is that a typo? 300MB sounds ridiculously low..


They don’t need to be. It’s just less performant so easier to just say that so people’s expectations on tps etc match. I’ve seen it in many GBs run fine


They have a bad data structure in it that it's so low.


A while back we explored the use of Cassandra. We wanted to keep some event related data there and for it to be relatively fast read-wise in order for us to do all sorts of reporting based on it. So we wrote allot and wanted to read fast. Seemed like a perfect store for our timestamped events, especially since we wanted to not even use deletes and has in-build record deduplication via its primary key. Turns out, it is not that perfect.

Other than what the article described, I can also add:

1. It has a steep learning curve, but you do get to see the advantages while you learn it. But then, everything comes crumbling down.

2. The setup is a pain locally. Then it is a pain to set it up in prod and manage it. The tooling itself feels very unfinished and basic.

3. No querying outside primary index on AWS Keyspace if you want it managed. Also, any managed variants are EXPENSIVE. I mean, every database is fast if you only query by the primary index so why pay extra?

It is just not worth it. For example, we winded up using MongoDb and it turned out to be fast, scalable, had mature tooling and we can keep tons of event related metadata in it and it is easy to manage and doesn't cost a fortune.


I think Cassandra is a better fit for interactive use cases, not for reporting. Also, basically it's super heavy duty and it should start to shine when you're really serving entire Internet (on the scale of Reddit, Expedia etc.) and your Cassandra cluster is distributed across DCs across the world.

I haven't really worked in this space for a couple of years so I don't know if the cloud offerings have already completely matched Cassandra's features and robustness.


Dynamo allegedly has global replication now.

Of course, I have been totally unable to determine how they merge rows/partitions without cell timestamps. It's a black box.

I was just on a "Keyspaces" meeting where the sales dude basically described dynamo billing, dynamo provisioning, feature shortfalls obviously due to dynamo, but would not admit it was dynamo.

It was bizarre.


We initially looked at Cassandra. We liked its use case, we liked its scalability. However we also ran into maintenance and setup pains.

We ended up going with ScyllaDB, which is a drop in replacement for Cassandra. It’s written in C. Much easier in resource demands and we didn’t have to deal with Zookeeper directly.


Cassandra doesn't require Zookeeper


Ah, good to know. Our admins set up Zookeeper with Cassandra, so I had always assumed it was part of the deal.


And Scylladb may have a better thread-per-core model and no GC pauses, it basically has the exact same management challenges as Cassandra.

The parent comment almost seems generated by AI.


Dumb statements all around, but clearly you've never been in a Cassandra environment vs Scylla. Scylla is far, far more reliable, easier to get up and running, and required a bit less supervision than Cassandra.


This is a restatement of the Cassandra docs paired with the usual misunderstanding of CAP. An AP system is not a choice of "I'll have availability and partition tolerance, please". It's the choice of availability given a partition. The whole point of CAP is that when partitions occur, there is a forced choice -- this is why it's incorrect to ask for a CA system.

The Minio team are manifestly excellent engineers, but insightless posts that contain subtle misunderstandings of CAP do nothing to showcase that competence.


Do you mean AP is the choice of losing consistency given a partition?


Oh duh... You were pointing out a typo in my comment.

I'm leaving my original reply below because what the hell.

~~~~

Yes, exactly.

Think of it this way: you don't actually have any control over partitioning, therefore partitions are a given. So CAP is expressed as such: given a partition, choose between consistency and availability.

EDIT: I personally find PACELC [0] less confusing, and more nuanced than CAP, but they basically say the same thing.

[0] https://en.wikipedia.org/wiki/PACELC_theorem


That's how it is usually expressed, yes.


So these guys have a storage system on top of Amazon S3 with a distributed (timestamp based) RW mutex locking, that uses NTP (github.com/beevik/ntp). That seems to be about it.

https://github.com/minio/minio/blob/master/pkg/bucket/object...

https://github.com/minio/minio/blob/master/pkg/dsync/drwmute...

And this is their test:

https://github.com/minio/minio/blob/master/pkg/dsync/dsync_t...

I just browsed quickly but it is littered with Amazon Simple Storage Service hardcoded bits like this:

https://github.com/minio/minio/blob/master/pkg/bucket/object...

There is not a single document that I can find that discusses the MinIO architecture. I guess "MinIO is a simple wrapper around S3 with a homegrown distributed state tech using NTP and it is 'fast!'" does not make for a sexy doc.

The column "trouble with OSS distributed DBs" without Jepsen tests has probably already been written. There should also be one about "competitor's mature product bashing blogs are HN clickbait".


Meh. About 80% of that article discusses known limitations of Cassandra, which isn't specific to a use case of object store metadata storage. In the little that it actually does specifically talk about that use case, if your object store reflects the Cassandra limitations (flat, infrequent mutability), I don't see why Cassandra would be a bad choice.


If you really want to see and get hints on what you can do with Cassandra, head over to Netflix tech blogs. They use Casandra extensively.


really? what do they use Cassandra for?


Back in 2014 when i last used Cassandra, Netflix were using it for some kind of metrics store i believe. They also published the nicer-to-use client back then (astyanx or something it was called - nicer API than Hector).

My memory is letting me down here on specifics - in my defence it's been a long time since i answered questions on stack overflow about Cassandra. I absolutely loved working with the product even if the sharp edges cut me a few times (ultimately my own fault not Cassandra's).


Netflix uses it for everything that you'd use Postgres for but:

- they have Cassandra committers (most end up going to Apple iCloud)

- they have mgmt. support to make it work

- they are/were on tokens for years, not vtokens, since their internal tools were token-based

(They started with Oracle Enterprise (and a little MySQL), tried Mongo briefly, then decided to make Cassandra work for them.)

Source: original Netflix Cassandra team member.


I would have hoped its obvious.

Your metadata database needs to be the fastest and most reliable store out of everything. It can't be eventually consistent without partitioning your datastore. Even then you'll end up partitioning your data neatly into the same failure zone.

Cassandra has basically one usecase: high volume writes, with a few batch reads.

Cassandra is not really optimised for high reads.

Most of the time postgres will do fine.


> Most of the time postgres will do fine.

For something that scale horizontally, I would probably recommend to use something like FoundationDB for this use case [^1].

A transactional Key-Value store is exactly what you want for this kind of use case.

[^1]: https://apple.github.io/foundationdb/


Yes now you are maintaining 5+ types of services for foundationdb.

While you can vertically-scale a single PostgreSQL (with failover/ha) server to 50+TB of NVME and grow until you can throw other engineers at the problem.

Example: wasabi.com uses mysql for metadata.


I'm curious on you describing the 5+ types of services fdb requires. In my world it requires 1.


Heading to the docs again, looks like they were consolidated. I remember you had to run coordinator nodes, data nodes, log storage nodes, cluster controller etc.


> Cassandra has basically one usecase: high volume writes, with a few batch reads.

That's how we use it in BigCompany(tm).

Cassandra holds (almost) all of our data. Most of it comes from batch insertions, but more complex (and lower volume) data comes from various REST APIs we expose and other teams use.

Most reads (60%) done on it are boring batch reads: analytics, reports, deltas, etc.

For more complex reads (30%) we copy relevant rows to Hive tables, then perform queries there without dealing with partition key shenanigans. So, batch with extra steps.

Then anything that requires random reads (10%) is read from ElasticSearch:

1. Coming from REST backend if it's time sensitive.

2. Coming from a batch that reads from Cassandra if not.


What would you use instead of Cassandra for multi-DC replication?


Cloud bigtable.


Or maybe CockroachDB?


> Cassandra has basically one usecase: high volume writes, with a few batch reads.

Yes. It's also a good choice for horizontal scaling, but only if the other DBs won't do.


I am working on SeaweedFS, which supports S3 API for object store, and can also use Cassandra as the metadata db. Cassandra has been performing well for most SeaweedFS users.

The article listed many known Cassandra characteristics and cited them as limitations. However, it all depends on use cases. There are no file system that works for all cases, and not all of them needs ACID, CA vs CP, etc. The rest points are not convincing either. They are related to how to design the data structure better.

Actually, SeaweedFS can use many other database/KV stores as the metadata DB. The list includes Redis, Cassandra, HBase, MySql, Postgres, Etcd, ElasticSearch, etc. https://github.com/chrislusf/seaweedfs/wiki/Filer-Stores

I did find one drawback for Cassandra as the metadata store though. One use case is that the customer uploaded a lot of zip files to one folder /tmp, unzip them, and then moved to a final folder. The rate is about 3000 files per second created and then deleted. Being a LSM structure, the tombstones quickly pile up and the directory listing was slow.

The solution was to use Redis for that /tmp folder, and still use Cassandra for the rest of folders. With Redis B-tree structure, the creation and deletion are cheap.

So it is all depends on use cases.


> With Redis B-tree structure, the creation and deletion are cheap.

Pretty sure redis has no b-tree but hashmap. But redis has to do no compaction at all since it can directly delete the stuff on disk.

> The rate is about 3000 files per second created and then deleted. Being a LSM structure, the tombstones quickly pile up and the directory listing was slow.

The problem here also is the gc-perdiod. Since the database has async-replication, it needs to keep deleted rows for some days so downed replicas can replicate. So they stay in disk/memory for some days actually and don't get compacted. While in a CP system they would be.


Storing metadata together with data will just make it harder,slower to query metadata (since it will reside in hdd in most cases).

You may think "it will be cached in ram, because it's small", yes, but then you'll end up querying many nodes just for metadata queries.

Yes it's nicer to manage only 1 system, but in big scenarios it's probably better to separate.

You can have 50+TB of nvme in 1 server, so your metadata layer probably doesn't need to horizontally scale.

Imagine if you lose some objects (because you lost some replicas). You won't even know WHICH objects you lost, because the metadata is gone together with the data.

Having separate, you can add a 5-replicas to metadata to be even safer compared to the usual 3 replicas.


What kind of 'metadata' are they referring to here? Can you give some concrete examples?


Metadata of the object storage. In the simplest is a table of <bucket_id, filename, location_of_object_in_storage, file_size, object_metadata>


One of my favourite tech overkills I've seen in my career is Cassandra database to store couple million records. Yup. Needless to say, it was later converted to PostgreSQL. The guy who did it is still advertising the fact on his online CV.


The one thing about Cassandra: Most probably you don't need it because you do not have the write performance needs and will not have them in the next five years. Scaling early is the death of many startups [1] Postgres with a time database will be sufficient for most needs.

[1] https://www.duetpartners.com/why-is-premature-scaling-still-...


A tangent - I love using minio in dev & test for their s3 simulators. Being able to throw away a bucket and start from scratch, and having everything self contained in my docker-compose command is a real blessing.

Has anyone ever used min.io for production stuff? What are the pros & cons over vanilla s3?


We use it in anger at Splitgraph [0] to let people upload/download datasets and Postgres table snapshots (instead of storing them directly in S3).

Pros:

* Less platform-dependent. By self-managing it, we can also run deploys to GCP / Azure / Scaleway / other providers without writing a separate adapter for e.g. Azure Blob Storage.

* Python API [1] much more pleasant to use than boto3 (and can speak to normal S3). It doesn't do everything that boto3 does, but it supports everything we need (e.g. pre-signed URLs).

* minio server itself supports a large chunk of S3's functionality (e.g. SELECT API / AssumeRole / bucket versioning)

* Don't pay per request and for egress: this was a big deal since people might want to download large amounts of data from us (or make a bunch of small requests to download/upload a subset of data).

Cons:

* Have to manage own infrastructure. We run it on managed VMs so it's semi-managed, but we still have to provision block storage, set up backup policies etc.

* In a similar vein, scaling and availability all have to be DIY [2]. We haven't run into situations yet where Minio would be the bottleneck, but it might be something to keep in mind.

* Obviously not as seamless: you don't get things like Glacier or integration with other IAM.

[0] https://www.splitgraph.com/

[1] https://github.com/minio/minio-py

[2] https://docs.min.io/docs/distributed-minio-quickstart-guide....


It doesn't work for a lot of small files scenario, there is too much overhead. You can't +1 new machines, you need to add whole clusters. If you ask too many questions you get invited for a premium subscription. You can't add +1 server, you need to add in clusters. There is no public big production scenario of 10PB+. There is no community of big production users (like there is with ceph).


> What are the pros & cons over vanilla s3

Mostly that you aren't bound to AWS IMHO. You can run it on-prem, or in a cloud provider with no object storage service.


We opted against Minio, mostly because we had a storage solution that provide S3 api, but also because managing Minio seemed painful. It’s just not clearly versioned, we didn’t figure out how you where suppose to rotate cerificates and documentation isn’t great.


My n=1 indirect experience as a consumer of someone else setting it up. The devops team had a lot of trouble keeping Minio up and running. Performance was not great.

The main pro and con is that it is self-hosted.


"I wanted to store something that could scale, but wanted to store and access my data in a way that didn't scale. Also, I don't understand CAP, distributed transactions, or distributed systems. I had a bad time."

Somewhere else in the comments: "Yeah, we went with MongoDB."


I don't quite get what they are referring to when they talk about 'metadata' here. Are they talking mostly about something internal to the database, something to do with the schema, or some kind of additional data used to enrich a particular object?


Minio is an s3 clone, so I assume you metadata they mean details about the object you uploaded. So if the actual data is a video file, the metadata would be the headers, filesize, what host/drive the file is stored on. Essentially when you ask for minio.io/my/video/file some service has to transform that to 192.168.0.1/customer/drive/ahe123.blob


Ahhh,I see thanks.


From an ops perspective, managing Cassandra is a bitch. Just use a managed service unless you have the money to hire a dedicated Cassandra expert. I’m so glad I’m done with ops


Just to provide a counter statement; at $work we definatively do not have the appropriate headcount to even start considering running our own distrubuted datastores. However, we _do_ use Cassandra. And while many persons shared the sentiment of 'it being to hard to maintain' when the first few projects started to incorporate Cassandra, the truth is; there never has been any failure or incident, or even technical hurdle related to the usage/maintenance of the -admittedly smallish- Cassandra cluster (for 2 years now).

The costs of operating the cluster are ~5K/month (that's what our service provider charges us for 24/7 ops). I consider this a scam since averaged maintenance costs are perhaps ~1 hour per month.

(normally peaks at 40.000 writes/sec, 500 reads/sec, 200GB)


Yes, but 200GB and those read/write rates of data can reside in a single server even entirely in-memory, it's not a big amount.


Also be wary with any company which "also does Cassandra" amongst a suite of products and isn't FAANG scale.

Chances are they absolutely don't have the headcount to properly manage it and were just filling out the portfolio they wanted to offer.


Its frustrating when it doesn't recommend something better. Every DB has tradeoffs if you choose something over Cassandra you have to give up something else.


What does "Bottom line. Write your metadata atomically with your object. Never separate them." mean from the perspective of picking a system?


If multiple distinct writes to DB must be atomic (as in a transaction), an AP system is likely a poor choice. AP systems are designed for 'eventual consistency'. If EC is a poor fit for 'consistent object data and metadata' in your application, do not pick an AP system.

An AP system may be a reasonable choice for an ODB if your added layer of transactionality is exploiting a domain specific characteristics that allows for fast-path, efficient, distributed transactions on top of a bare object level AP system (such as Cassandra). Since this is all very niche and technically demanding, it is almost always a better choice for the general team to choose a mature CP database as backing for metadata-rich objects and/or domains that demand transactional support for systems of record.


can anyone explain to me why someone would want to use cassandra when not handling internet scale stuff?

the team am in uses it, but after many times asking why it was chosen since it seems a poor fit for our uses cases compared to a relational DB, the only justification I was given is a that the cluster is easier to maintain for our ops guys.


Contrary to popular beliefs, programmers aren't very rational people, but still prone to emotional beliefs when choosing databases.

Another reason is for "resume driven development".


Yes. the problem with "eventual consistency" is that "eventually" never happens about 1 in a million times, and with millions of storage objects to manage you can't have that. So what's the alternative as a metadata store for object storage ? a consensus algorithm (paxos, raft, mencius, ...) on top of local key value stores.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: