The trouble with Cassandra as an object storage metadata database

kerblang · on Feb 24, 2021

The most basic and trivial way that you're going to get burned by cassandra is that you have to divide your primary key into two parts: partition key columns, and clustering key columns. Partition keys must be in every "where" clause; only clustering keys are optional.

Okay, I'll just not bother with partition keys, right? Except partition keys determine which "partition" your data goes in. So if your only partition key column is "day_of_week", then you have 7 partitions. New problem: Your partitions need to be < 300 MB or cassandra falls over dead.

In fact you'll soon realize that over the lifetime of your table, keeping those partitions under control may end up forcing you to use various date tricks like putting year/month/day in as "artificial" partition keys.

Of course if you put everything in the partition keys instead of clustering keys, let me note again that you have to put each partition key column in every where clause, in which case you may have trouble querying for batches of data.

Furthermore, when you do put clustering columns in your where clause, they have to be in order declared; so if your CC's are a, b, and c, then your where clause can use (a), (a,b), or (a,b,c); but if it has b, it has to have a, and if it has c, it has to have a&b. This is because storage is hierarchical. (Same rules for "order by", btw) (and no there's no "group by")

This is when you start realizing: Oh, you mean it's really not like "SQL without joins". No, not even close.

jimbokun · on Feb 24, 2021

Best of summary of the Cassandra data model I've seen:

HashMap<PartitionKey, OrderedSet<ClusterKey>>

So you can lookup data by the ParititionKey, then perform range queries on the ClusterKey to filter data belonging to that partition.

If your access patterns look like that, Cassandra could be a great fit.

For anything else, you probably want to consider a different data store.

koolba · on Feb 24, 2021

Ha! That's pretty succinct.

Though HashMap should be just "Map" and the value of the top level map is not a set, it's another map:

Map<PartitionKey, SortedMap<ClusterKey, Record>>

garyrichardson · on Feb 24, 2021

Making CQL SQL-Like was the worst possible thing the designers could have done. Newbies want to do SQL joins and filtering with it until they've been a few times.

C* is essentially scaled/automated MySQL Sharding and blob storage.. except you pay a huge cost for any server side filtering.

The Partition Key would be the equivalent of your MySQL shard id. ClusterKey is your primary key. if you treat everything like PUT $paritionKey, $primaryKey, $data and GET $partitionKey, $primaryKey you get a slow and massively scaling redis. If you do anything else with it you'll likely start regretting your choices and looking for a replacement database.

fennecfoxen · on Feb 24, 2021

What you are expected to do with Cassandra, to achieve your horizontal scalability, is (a) denormalize by each supported query up front, and (b) fan out all of those (often by time) so they never get too big.

So for your "primary" data store for a type data, you might use the date for the partition key. Then if you do a lot of day-of-week queries, you'd store a second copy grouped by year-month-weekday, and query as many months as you need, and combine them. If you are collecting data for several sites, have a copy that's stored by year-month-weekday-site. And set up your software so it's easy to say "write in these 5 places, keyed off this combination of values." And because it's in all these denormalized places, you can forget updating anything atomically. So you go to append-only mode and if you have an "update" you record that as a separate entry after the first and your application's understanding of the data store includes coalescing the events and applying the update. (Maybe if you need it you have a separate process coalescing updates after the fact, but in the background, non-critical-like.)

And you're right. This is very much not SQL. It's also super obnoxious and support for it in extant frameworks is pretty minimal (they tend to be rather myopically SQL-oriented, ill suited to take advantage of this model). It also uses lots of storage, on the premise that disk is cheap. But what all this accomplishes is crazy horizontal scalability, and when you make any one query, then all the data is right there on disk lined up neatly, so it all gets streamed out in a predictable amount of time. For a few select applications, this consistent scalability is worth the pain.

manigandham · on Feb 24, 2021

The issue seems to be a misunderstanding of what Cassandra is. It's an (advanced/nested) key-value database, so of course you need to specify the key to do anything.

jeffbee · on Feb 24, 2021

The misunderstanding is understandable, since the Apache Cassandra site fails to state what Cassandra is, even directly below the heading "What is Cassandra?"

mgkimsal · on Feb 24, 2021

Well... it is "the right choice when you need scalability and high availability without compromising performance". I'm not sure how much more description you need! :)

aasasd · on Feb 24, 2021

That's normal by now. Pretty much all software/service sites would do better by just using Wikipedia's description of themselves as the landing page. At least, better for the user. Probably because Wikipedia actually needs to explain the thing's meaning, and a standard marketing page would be slapped with plaques about non-encyclopedic content, in no time.

kerblang · on Feb 24, 2021

Replying to myself because I can't seem to find an edit button: If you're interested in Cassandra, the free online DataStax training is highly recommended. You'll learn all the gotchas ahead of time instead of after you went to production, and you really, really don't want the latter.

MichaelMoser123 · on Feb 24, 2021

> New problem: Your partitions need to be < 300 MB or cassandra falls over dead.

people work around that by storing the data in something like S3 and then keeping the handle in cassandra. Yet another hack on top of another hack.

PeterCorless · on Feb 26, 2021

While part of this is Cassandra's architecture, it may also be a limitation of the designer's way of thinking. Recently we had a hackathon project where we decided to implement the s3 API on top of Scylla (a Cassandra-compatible database) and were able to chunkify binary large objects (BLOBs) of up to 5 TB, and even turn Scylla into a FUSE filesystem.

https://www.scylladb.com/2020/12/15/scylladb-blog-scylladb-d...

AtlasBarfed · on Feb 25, 2021

Do you fill postgres with 100MB+ blobs?

You'd do the same "hack" there.

MichaelMoser123 · on Feb 25, 2021

yes, but postgress doesn't have the problem that it fails if you have more than 100MB in all records, in total.

aasasd · on Feb 24, 2021

Pretty sure that's what the article is about.

aasasd · on Feb 24, 2021

Noob question here.

> Your partitions need to be < 300 MB or cassandra falls over dead

Why is that? Is that some kind of engineered-in limitation?

manigandham · on Feb 24, 2021

There's a default limit of 2 billion cells (rows * cols) but larger partitions also mean larger indexes and more serialization, repair, compaction and memory overhead which doesn't work well with Cassandra's Java/JVM GC requirements.

It's getting better but likely will never reach the performance potential of Scylla which is Cassandra reimplemented in C++ with a much better sharded architecture.

cnlwsu · on Feb 25, 2021

It’s not a real requirement. Just a performance thing. Each partition has an index and as it grows the deserialization of it to find the queried start column takes longer. It is lazy loads but there a perf hit. Easier to have a rule of thumb like “100mb partitions” and have more fixed tps per core etc expectations.

_benedict · on Feb 24, 2021

No, it's just that the storage layer is low priority to improve for this use case, but it'll happen eventually.

noncoml · on Feb 24, 2021

> Your partitions need to be < 300 MB or cassandra falls over dead.

Is that a typo? 300MB sounds ridiculously low..

cnlwsu · on Feb 25, 2021

They don’t need to be. It’s just less performant so easier to just say that so people’s expectations on tps etc match. I’ve seen it in many GBs run fine

ddorian43 · on Feb 25, 2021

They have a bad data structure in it that it's so low.

SMFloris · on Feb 24, 2021

A while back we explored the use of Cassandra. We wanted to keep some event related data there and for it to be relatively fast read-wise in order for us to do all sorts of reporting based on it. So we wrote allot and wanted to read fast. Seemed like a perfect store for our timestamped events, especially since we wanted to not even use deletes and has in-build record deduplication via its primary key. Turns out, it is not that perfect.

Other than what the article described, I can also add:

1. It has a steep learning curve, but you do get to see the advantages while you learn it. But then, everything comes crumbling down.

2. The setup is a pain locally. Then it is a pain to set it up in prod and manage it. The tooling itself feels very unfinished and basic.

3. No querying outside primary index on AWS Keyspace if you want it managed. Also, any managed variants are EXPENSIVE. I mean, every database is fast if you only query by the primary index so why pay extra?

It is just not worth it. For example, we winded up using MongoDb and it turned out to be fast, scalable, had mature tooling and we can keep tons of event related metadata in it and it is easy to manage and doesn't cost a fortune.

killtimeatwork · on Feb 24, 2021

I think Cassandra is a better fit for interactive use cases, not for reporting. Also, basically it's super heavy duty and it should start to shine when you're really serving entire Internet (on the scale of Reddit, Expedia etc.) and your Cassandra cluster is distributed across DCs across the world.

I haven't really worked in this space for a couple of years so I don't know if the cloud offerings have already completely matched Cassandra's features and robustness.

AtlasBarfed · on Feb 25, 2021

Dynamo allegedly has global replication now.

Of course, I have been totally unable to determine how they merge rows/partitions without cell timestamps. It's a black box.

I was just on a "Keyspaces" meeting where the sales dude basically described dynamo billing, dynamo provisioning, feature shortfalls obviously due to dynamo, but would not admit it was dynamo.

It was bizarre.

snapetom · on Feb 24, 2021

We initially looked at Cassandra. We liked its use case, we liked its scalability. However we also ran into maintenance and setup pains.

We ended up going with ScyllaDB, which is a drop in replacement for Cassandra. It’s written in C. Much easier in resource demands and we didn’t have to deal with Zookeeper directly.

bbromhead · on Feb 24, 2021

Cassandra doesn't require Zookeeper

snapetom · on Feb 25, 2021

Ah, good to know. Our admins set up Zookeeper with Cassandra, so I had always assumed it was part of the deal.

AtlasBarfed · on Feb 25, 2021

And Scylladb may have a better thread-per-core model and no GC pauses, it basically has the exact same management challenges as Cassandra.

The parent comment almost seems generated by AI.

snapetom · on Feb 25, 2021

Dumb statements all around, but clearly you've never been in a Cassandra environment vs Scylla. Scylla is far, far more reliable, easier to get up and running, and required a bit less supervision than Cassandra.

omginternets · on Feb 24, 2021

This is a restatement of the Cassandra docs paired with the usual misunderstanding of CAP. An AP system is not a choice of "I'll have availability and partition tolerance, please". It's the choice of availability given a partition. The whole point of CAP is that when partitions occur, there is a forced choice -- this is why it's incorrect to ask for a CA system.

The Minio team are manifestly excellent engineers, but insightless posts that contain subtle misunderstandings of CAP do nothing to showcase that competence.

acscott · on Feb 24, 2021

Do you mean AP is the choice of losing consistency given a partition?

omginternets · on Feb 24, 2021

Oh duh... You were pointing out a typo in my comment.

I'm leaving my original reply below because what the hell.

~~~~

Yes, exactly.

Think of it this way: you don't actually have any control over partitioning, therefore partitions are a given. So CAP is expressed as such: given a partition, choose between consistency and availability.

EDIT: I personally find PACELC [0] less confusing, and more nuanced than CAP, but they basically say the same thing.

[0] https://en.wikipedia.org/wiki/PACELC_theorem

llarsson · on Feb 24, 2021

That's how it is usually expressed, yes.

eternalban · on Feb 24, 2021

So these guys have a storage system on top of Amazon S3 with a distributed (timestamp based) RW mutex locking, that uses NTP (github.com/beevik/ntp). That seems to be about it.

https://github.com/minio/minio/blob/master/pkg/bucket/object...

https://github.com/minio/minio/blob/master/pkg/dsync/drwmute...

And this is their test:

https://github.com/minio/minio/blob/master/pkg/dsync/dsync_t...

I just browsed quickly but it is littered with Amazon Simple Storage Service hardcoded bits like this:

https://github.com/minio/minio/blob/master/pkg/bucket/object...

There is not a single document that I can find that discusses the MinIO architecture. I guess "MinIO is a simple wrapper around S3 with a homegrown distributed state tech using NTP and it is 'fast!'" does not make for a sexy doc.

The column "trouble with OSS distributed DBs" without Jepsen tests has probably already been written. There should also be one about "competitor's mature product bashing blogs are HN clickbait".

snapetom · on Feb 24, 2021

Meh. About 80% of that article discusses known limitations of Cassandra, which isn't specific to a use case of object store metadata storage. In the little that it actually does specifically talk about that use case, if your object store reflects the Cassandra limitations (flat, infrequent mutability), I don't see why Cassandra would be a bad choice.

segmondy · on Feb 24, 2021

If you really want to see and get hints on what you can do with Cassandra, head over to Netflix tech blogs. They use Casandra extensively.

mrwd021 · on Feb 24, 2021

really? what do they use Cassandra for?

CraigJPerry · on Feb 24, 2021

Back in 2014 when i last used Cassandra, Netflix were using it for some kind of metrics store i believe. They also published the nicer-to-use client back then (astyanx or something it was called - nicer API than Hector).

My memory is letting me down here on specifics - in my defence it's been a long time since i answered questions on stack overflow about Cassandra. I absolutely loved working with the product even if the sharp edges cut me a few times (ultimately my own fault not Cassandra's).

redis_mlc · on Feb 24, 2021

Netflix uses it for everything that you'd use Postgres for but:

- they have Cassandra committers (most end up going to Apple iCloud)

- they have mgmt. support to make it work

- they are/were on tokens for years, not vtokens, since their internal tools were token-based

(They started with Oracle Enterprise (and a little MySQL), tried Mongo briefly, then decided to make Cassandra work for them.)

Source: original Netflix Cassandra team member.

KaiserPro · on Feb 24, 2021

I would have hoped its obvious.

Your metadata database needs to be the fastest and most reliable store out of everything. It can't be eventually consistent without partitioning your datastore. Even then you'll end up partitioning your data neatly into the same failure zone.

Cassandra has basically one usecase: high volume writes, with a few batch reads.

Cassandra is not really optimised for high reads.

Most of the time postgres will do fine.

adev_ · on Feb 24, 2021

> Most of the time postgres will do fine.

For something that scale horizontally, I would probably recommend to use something like FoundationDB for this use case [^1].

A transactional Key-Value store is exactly what you want for this kind of use case.

[^1]: https://apple.github.io/foundationdb/

ddorian43 · on Feb 24, 2021

Yes now you are maintaining 5+ types of services for foundationdb.

While you can vertically-scale a single PostgreSQL (with failover/ha) server to 50+TB of NVME and grow until you can throw other engineers at the problem.

Example: wasabi.com uses mysql for metadata.

adev_ · on Feb 24, 2021

I'm curious on you describing the 5+ types of services fdb requires. In my world it requires 1.

ddorian43 · on Feb 25, 2021

Heading to the docs again, looks like they were consolidated. I remember you had to run coordinator nodes, data nodes, log storage nodes, cluster controller etc.

fer · on Feb 24, 2021

> Cassandra has basically one usecase: high volume writes, with a few batch reads.

That's how we use it in BigCompany(tm).

Cassandra holds (almost) all of our data. Most of it comes from batch insertions, but more complex (and lower volume) data comes from various REST APIs we expose and other teams use.

Most reads (60%) done on it are boring batch reads: analytics, reports, deltas, etc.

For more complex reads (30%) we copy relevant rows to Hive tables, then perform queries there without dealing with partition key shenanigans. So, batch with extra steps.

Then anything that requires random reads (10%) is read from ElasticSearch:

1. Coming from REST backend if it's time sensitive.

2. Coming from a batch that reads from Cassandra if not.

uDontKnowMe · on Feb 24, 2021

What would you use instead of Cassandra for multi-DC replication?

jeffbee · on Feb 24, 2021

Cloud bigtable.

uDontKnowMe · on Feb 24, 2021

Or maybe CockroachDB?

athenot · on Feb 24, 2021

> Cassandra has basically one usecase: high volume writes, with a few batch reads.

Yes. It's also a good choice for horizontal scaling, but only if the other DBs won't do.

chrislusf · on Feb 24, 2021

I am working on SeaweedFS, which supports S3 API for object store, and can also use Cassandra as the metadata db. Cassandra has been performing well for most SeaweedFS users.

The article listed many known Cassandra characteristics and cited them as limitations. However, it all depends on use cases. There are no file system that works for all cases, and not all of them needs ACID, CA vs CP, etc. The rest points are not convincing either. They are related to how to design the data structure better.

Actually, SeaweedFS can use many other database/KV stores as the metadata DB. The list includes Redis, Cassandra, HBase, MySql, Postgres, Etcd, ElasticSearch, etc. https://github.com/chrislusf/seaweedfs/wiki/Filer-Stores

I did find one drawback for Cassandra as the metadata store though. One use case is that the customer uploaded a lot of zip files to one folder /tmp, unzip them, and then moved to a final folder. The rate is about 3000 files per second created and then deleted. Being a LSM structure, the tombstones quickly pile up and the directory listing was slow.

The solution was to use Redis for that /tmp folder, and still use Cassandra for the rest of folders. With Redis B-tree structure, the creation and deletion are cheap.

So it is all depends on use cases.

ddorian43 · on Feb 25, 2021

> With Redis B-tree structure, the creation and deletion are cheap.

Pretty sure redis has no b-tree but hashmap. But redis has to do no compaction at all since it can directly delete the stuff on disk.

> The rate is about 3000 files per second created and then deleted. Being a LSM structure, the tombstones quickly pile up and the directory listing was slow.

The problem here also is the gc-perdiod. Since the database has async-replication, it needs to keep deleted rows for some days so downed replicas can replicate. So they stay in disk/memory for some days actually and don't get compacted. While in a CP system they would be.

ddorian43 · on Feb 24, 2021

Storing metadata together with data will just make it harder,slower to query metadata (since it will reside in hdd in most cases).

You may think "it will be cached in ram, because it's small", yes, but then you'll end up querying many nodes just for metadata queries.

Yes it's nicer to manage only 1 system, but in big scenarios it's probably better to separate.

You can have 50+TB of nvme in 1 server, so your metadata layer probably doesn't need to horizontally scale.

Imagine if you lose some objects (because you lost some replicas). You won't even know WHICH objects you lost, because the metadata is gone together with the data.

Having separate, you can add a 5-replicas to metadata to be even safer compared to the usual 3 replicas.

anonymousDan · on Feb 24, 2021

What kind of 'metadata' are they referring to here? Can you give some concrete examples?

ddorian43 · on Feb 24, 2021

Metadata of the object storage. In the simplest is a table of <bucket_id, filename, location_of_object_in_storage, file_size, object_metadata>

jeff_vader · on Feb 24, 2021

One of my favourite tech overkills I've seen in my career is Cassandra database to store couple million records. Yup. Needless to say, it was later converted to PostgreSQL. The guy who did it is still advertising the fact on his online CV.

KingOfCoders · on Feb 24, 2021

The one thing about Cassandra: Most probably you don't need it because you do not have the write performance needs and will not have them in the next five years. Scaling early is the death of many startups [1] Postgres with a time database will be sufficient for most needs.

[1] https://www.duetpartners.com/why-is-premature-scaling-still-...

francoisdevlin · on Feb 24, 2021

A tangent - I love using minio in dev & test for their s3 simulators. Being able to throw away a bucket and start from scratch, and having everything self contained in my docker-compose command is a real blessing.

Has anyone ever used min.io for production stuff? What are the pros & cons over vanilla s3?

mildbyte · on Feb 24, 2021

We use it in anger at Splitgraph [0] to let people upload/download datasets and Postgres table snapshots (instead of storing them directly in S3).

Pros:

* Less platform-dependent. By self-managing it, we can also run deploys to GCP / Azure / Scaleway / other providers without writing a separate adapter for e.g. Azure Blob Storage.

* Python API [1] much more pleasant to use than boto3 (and can speak to normal S3). It doesn't do everything that boto3 does, but it supports everything we need (e.g. pre-signed URLs).

* minio server itself supports a large chunk of S3's functionality (e.g. SELECT API / AssumeRole / bucket versioning)

* Don't pay per request and for egress: this was a big deal since people might want to download large amounts of data from us (or make a bunch of small requests to download/upload a subset of data).

Cons:

* Have to manage own infrastructure. We run it on managed VMs so it's semi-managed, but we still have to provision block storage, set up backup policies etc.

* In a similar vein, scaling and availability all have to be DIY [2]. We haven't run into situations yet where Minio would be the bottleneck, but it might be something to keep in mind.

* Obviously not as seamless: you don't get things like Glacier or integration with other IAM.

[0] https://www.splitgraph.com/

[1] https://github.com/minio/minio-py

[2] https://docs.min.io/docs/distributed-minio-quickstart-guide....

ddorian43 · on Feb 24, 2021

It doesn't work for a lot of small files scenario, there is too much overhead. You can't +1 new machines, you need to add whole clusters. If you ask too many questions you get invited for a premium subscription. You can't add +1 server, you need to add in clusters. There is no public big production scenario of 10PB+. There is no community of big production users (like there is with ceph).

sofixa · on Feb 24, 2021

> What are the pros & cons over vanilla s3

Mostly that you aren't bound to AWS IMHO. You can run it on-prem, or in a cloud provider with no object storage service.

mrweasel · on Feb 24, 2021

We opted against Minio, mostly because we had a storage solution that provide S3 api, but also because managing Minio seemed painful. It’s just not clearly versioned, we didn’t figure out how you where suppose to rotate cerificates and documentation isn’t great.

jnwatson · on Feb 24, 2021

My n=1 indirect experience as a consumer of someone else setting it up. The devops team had a lot of trouble keeping Minio up and running. Performance was not great.

The main pro and con is that it is self-hosted.

AtlasBarfed · on Feb 24, 2021

"I wanted to store something that could scale, but wanted to store and access my data in a way that didn't scale. Also, I don't understand CAP, distributed transactions, or distributed systems. I had a bad time."

Somewhere else in the comments: "Yeah, we went with MongoDB."

anonymousDan · on Feb 24, 2021

I don't quite get what they are referring to when they talk about 'metadata' here. Are they talking mostly about something internal to the database, something to do with the schema, or some kind of additional data used to enrich a particular object?

nemothekid · on Feb 24, 2021

Minio is an s3 clone, so I assume you metadata they mean details about the object you uploaded. So if the actual data is a video file, the metadata would be the headers, filesize, what host/drive the file is stored on. Essentially when you ask for minio.io/my/video/file some service has to transform that to 192.168.0.1/customer/drive/ahe123.blob

anonymousDan · on Feb 24, 2021

Ahhh,I see thanks.

pid_0 · on Feb 24, 2021

From an ops perspective, managing Cassandra is a bitch. Just use a managed service unless you have the money to hire a dedicated Cassandra expert. I’m so glad I’m done with ops

mthomassen · on Feb 24, 2021

Just to provide a counter statement; at $work we definatively do not have the appropriate headcount to even start considering running our own distrubuted datastores. However, we _do_ use Cassandra. And while many persons shared the sentiment of 'it being to hard to maintain' when the first few projects started to incorporate Cassandra, the truth is; there never has been any failure or incident, or even technical hurdle related to the usage/maintenance of the -admittedly smallish- Cassandra cluster (for 2 years now).

The costs of operating the cluster are ~5K/month (that's what our service provider charges us for 24/7 ops). I consider this a scam since averaged maintenance costs are perhaps ~1 hour per month.

(normally peaks at 40.000 writes/sec, 500 reads/sec, 200GB)

ddorian43 · on Feb 25, 2021

Yes, but 200GB and those read/write rates of data can reside in a single server even entirely in-memory, it's not a big amount.

XorNot · on Feb 24, 2021

Also be wary with any company which "also does Cassandra" amongst a suite of products and isn't FAANG scale.

Chances are they absolutely don't have the headcount to properly manage it and were just filling out the portfolio they wanted to offer.

u678u · on Feb 24, 2021

Its frustrating when it doesn't recommend something better. Every DB has tradeoffs if you choose something over Cassandra you have to give up something else.

icegreentea2 · on Feb 24, 2021

What does "Bottom line. Write your metadata atomically with your object. Never separate them." mean from the perspective of picking a system?

eternalban · on Feb 24, 2021

If multiple distinct writes to DB must be atomic (as in a transaction), an AP system is likely a poor choice. AP systems are designed for 'eventual consistency'. If EC is a poor fit for 'consistent object data and metadata' in your application, do not pick an AP system.

An AP system may be a reasonable choice for an ODB if your added layer of transactionality is exploiting a domain specific characteristics that allows for fast-path, efficient, distributed transactions on top of a bare object level AP system (such as Cassandra). Since this is all very niche and technically demanding, it is almost always a better choice for the general team to choose a mature CP database as backing for metadata-rich objects and/or domains that demand transactional support for systems of record.

gryn · on Feb 24, 2021

can anyone explain to me why someone would want to use cassandra when not handling internet scale stuff?

the team am in uses it, but after many times asking why it was chosen since it seems a poor fit for our uses cases compared to a relational DB, the only justification I was given is a that the cluster is easier to maintain for our ops guys.

ddorian43 · on Feb 25, 2021

Contrary to popular beliefs, programmers aren't very rational people, but still prone to emotional beliefs when choosing databases.

Another reason is for "resume driven development".

toolslive · on Feb 24, 2021

Yes. the problem with "eventual consistency" is that "eventually" never happens about 1 in a million times, and with millions of storage objects to manage you can't have that. So what's the alternative as a metadata store for object storage ? a consensus algorithm (paxos, raft, mencius, ...) on top of local key value stores.