Hacker News new | comments | ask | show | jobs | submit login
Why Is Storage on Kubernetes So Hard? (softwareengineeringdaily.com)
187 points by kiyanwang 37 days ago | hide | past | web | favorite | 88 comments

I don't really think this is a Kubernetes-specific problem. If you have a million machines, want your database to run on one of them that is selected by some upstream orchestrator, and want the physical SSDs with that data on it to be in the same machine, you're going to have to do some work. But at the same time, you have to realize that you are doing this to get that tiny last bit of performance (most likely that 99.9%-ile latency) and that that last 0.1% is always the most expensive. Sometimes you need it, and I get that, but it's not a problem that everyone has.

Most applications are not so IOPS limited that they depend on the difference between PCIE request latency and going out over the network to get stuff that's actually stored on a nearby rack. And in that case what Kubernetes offers (with the help of cloud providers) is fine. You make a storage class. You make a persistent volume claim. Your pod mounts that. Not all that hard. If the performance isn't good enough, though, then you have to build something yourself.

I am used to a completely different model that we had at Google. You could not get durable storage in your job allocation. Everything went through some other controller that did not give you a block device or even POSIX semantics, and you designed your app around that. If you needed more IOPS you talked to more backends and duplicated more data.

Meanwhile in the public cloud world, you get to have a physical block device with an ext4 filesystem that can magically appear in any of your 5 availability zones, provisioned on the type of disk you specify with a guaranteed number of IOPS. It's honestly pretty good for 90% or even 99% of the things people are using disks for. (In my last production environment I actually ran stuff like InfluxDB against EFS, the fully-managed POSIX filesystem that Amazon provides. It did fine.)

For databases, local storage beats SAN storage massively. I can get better performance and IOPS from an Intel NUC with a decent PCIe SSD than I can get from an AWS RDS instance that costs as much per month as the NUC did to buy. If I optimize a proper rack mount server for database I can get an insane amount and storage and performance compared to even a few months of RDS. Even a pair of 40 Gbps SAN links would not compare.

> local storage beats SAN storage massively

Only if your (storage) network is slow (eg 1/10/25 GbE).

With a decent network (eg modern infiniband, 40+GbE, etc) for the storage, the latency and throughput to the storage shouldn't make a difference.

For example (years ago), I used to set up SSD arrays - SATA at the time, as M.2 wasn't a thing - and have them served over a 20Gbs Infiniband network to hosts in the same data centre.

The access times from an OS perspective to hit that storage over the network were the same as for hitting local disk. But the networked storage was higher bandwidth (multiple SSD's, instead of a single per host).

Worked really well. :)

For reference, the SCST project was the software used (back-in-the-day) for high performance block storage over infiniband:


Looks like it's still an active project too, as there's a new release listed from November 2018.

I'm not sure this is necessarily true. 2 x 40Gbps is bandwidth which is typically not the limiting factor. If it was you can go with higher bandwidth like FC. RDS is a database service, not SAN. Look at SAN boxes from storage companies, you can get things like 60 SSDs in a single box. To match the IOPs of that you'd need a lot of servers with local storage. I think in general dis-aggregating compute and storage is the optimal approach, whether some particular solution is better/price effective is a different question. Having all your drives in a box means they're easier to share, easier to service, easier to replace the servers as well ... lots of wins there.

If you're doing it locally, you'd ideally use RAID with a battery backed unit and your fsync essentially end up going to memory.

Until a battery relearn takes your app down in the middle of the night. I don't miss those.

The servers I buy now have large capacitors which don't fail like batteries, and you can then have the server on a UPS

I’m sure you can have some 12 drives in the same box as your database.

There are a few issues here.

EBS is a compromise, its allows your data to run anywhere in a region, on any machine. That means lots of hops and lots of interconnects.

the latest SAS stuff runs at 12 gigs(most likely 4 lanes per cable, and dual linked too.), _but_ thats dedicated for local traffic. The performance difference between having a SAS disk inside a box, or in the next rack is negligible. A decent SAN that exports over a 56gig connect (inifiniband et al) will be >> than a local pcie in terms of iops and sustained bandwith (at the expense of latency.)

Crucially, in somewhere that runs it's own datacenter, the storage is modelled for a specific workload. In VFX land we had 32 60 disk file servers, each capable of saturating two 10 gig ethernet links. But it'd be appalling for mixed VM hosting.

EBS is a hedged bet, it had a _boatload_ of caching to make any sort of performance. It had a huge amount of QoS to stop selfish loads stealing all the IOPs.

To get the best performance from EBS you have to have large volumes (5Tb+) and large latest generation instances (c5, r5, m5) 4 or 8/9 xlarge ec2 instances. This costs money to run.

This sort of portability is usually handled by the DB routing layer in cloud native ("webscale") database deployments. eg in Cassandra.

of course there are a lot of systems where one MySQL DB instance is needed and sufficient for the forseeable future. so that's a very static resource allocation one VM on a hypervisor with local disk, (and with backups every day), makes a lot of smaller sites very happy, and if they get the local provisioned performance instead of the EBS variant they will be happier for longer.

of course, EBS is a lot more fault tolerant.

> its allows your data to run anywhere in a region

that should read AZ

There's a wide range of performance with storage systems but why are you comparing that to RDS? That's a managed database service that just runs on EC2 instances with EBS volumes, and EBS is designed for affordable and scalable persistence instead of extreme performance.

EBS is not your typical SAN, which usually offer much more throughput, IOPS, and reliability in exchange for more complexity and latency, however you probably won't even notice that latency if your SAN system is close enough and using highend links.

That has almost nothing to do with SAN and everything to do with Amazon's implementation. A nuc will never give you the resiliency or performance of a properly sized SAN.

Furthermore, you'll never get close to saturating a 40gbps link with a database workload, your limitation is iops, not throughput, and you don't have anywhere near enough cpu to max out those links in a loaded server much less a nuc.


The standing joke amongst my co-workers is that ”enterprise” in this context is the tier you reach when you have gotten ripped of enough by third party vendors.

You usual run of the mill enterprise use most of this horsepower to heat air and have large clusters of oracle db’s power even more expensive SAP modules.

There’s a reason no serious iaas provider build infra this way. I have one example, actually an old customer to the vendor that used to employ me, that had probably well over 1k of tennants all with VM’s and DB’s backed by humongous SAN (replicated ofc!).

Not nice when corrupt data was mirrored due to a software bug in said SAN system. This went to a national level, really, due to the nature of many of the customers.

Sure you can build a reasonable architecture using the products and vendors mentioned, but, it doesn’t scale and it really locks you in.

I’ve seen and worked this stuff at numerous enterprises.

This only makes it my experience and one datapoint, so please bear that in mind! =D

they don't build it this way because cloud is cheap and slow. I have a UCS farm with 2000VMs, connected to 3PB of usable all-flash across 2 arrays, replicated active/active to another array, second-hopped to cloud. That's one site. About 100k users across 2 domains - 1k? I said enterprise. A SAN is not scalable but internal disk is? You need to look up what a SAN is.

The reason there is a standing joke with your coworkers is because you don't work on important things that run the world. When 1 minute of downtime costs you over $1mil, your "ripped off" Oracle costs, your SAN costs, etc, are lost in the rounding errors. I can have a SAN have hundreds of arrays, across multiple datacenters, and dynamically grow and shrink my storage needs. It is the definition of scalable. I can have one VM farm vmotion to another VM farm 30 miles away transparently, which will sit on a different array attached to the SAN. What happens when the storage needs of your server outgrow the drives you can shove in there? There are servers and databases a petabyte in size, pushing a million IOPS. They're in charge of money. If there is corruption on the SAN, well it infrequently happens, just like with servers and memory. Twice in my vast experience. You can roll back all writes on the replication software, and usually keep an undo journal a few days long, and something like hourly snapshots. This also protects against cryptoviruses and other types of corruption.

This is why people like you work at small companies, fiddling with your cute little projects, while the world moves forward with you on the sidelines. I've been doing this for 20+ years, and have been at most fortune100 companies, in 80 countries. But yeah, your opinion, while being dismissive, is cute.

The CIO is the guy in charge of getting this stuff at enterprises. It's clear you don't understand the impact of design. I am guessing you are not at an architect level. There's a reason for that, and a reason other people - the ones you put down in your post, are making the big decisions. People like you would cost the company millions of dollars in loss per year.

I said tenants, not users. Tiny little detail that you overlooked there. =) Last company I worked for had a €150.000.000 tech budget and about ~90k employees.

Again, if you read what I said you'll notice that I pointed out that you actually can build decent architectures using enterprisey stuff, it will however never scale horizontally in a resonable way. What happens when you have filled all your shelves in the array box? Ah, you'll need a new $250K box...

Mentioning vmotion and all... yeah so... VSAN et al... brrr... I'll put my trust in the open source world rather than shoddy software from 3rd party vendors trying to appease to the latest fad. Hyperconverge my *ss! =)

KISS is what rules but vendors are busy feeding channels and partners with impossible to penetrate acronyms leaving you in a mess sooner or later.

Focus from these vendors is to keep integration at a minimum, which means API's are usually crappy and achieving a reasonable level of automation is often a chore.

Now there is probably a little percentage of "enterprises" that uses tech in a sane way, but my bet is that the CIO you mention is a blockchain expert, as well as putting AI at the top of strategic actions to "implement" this year.

He probably have commissioned a pre-study from Accenture that outlines these strategic imperatives.

This is of course a little rant, but also true for the 90% of the 90% mentioned.

Don't be defensive and close minded. That's what's leaving most enterprises in the dust. There's just a lot of inertia inherent within certain businesses that will let them continue to burn through cash on pointless tech for yet a while.

again, cute. $250K. very cute. Try $5mil for a box, at least, after a 50% vendor discount. VSAN has nothing to do with SAN, and no one on a SAN uses VSAN. You are again showing your lack of enterprise experience, yet you are strongly trashing what you don't understand. A server sees many arrays, over a SAN. That is called horizontal scaling. You can't do that with internal disk.

Oh, 250 was for the box, no disks. 75% discount. :)

Keep it old-school!

Vsan btw is vmwares attempt to actually achieve _horizontal_ scalability.

I just don’t trust vmware with anything except for that bytecode vm. It used to rock, but time went ahead, and even as they IPO’d I thought they would end up dead in the water eventually. I had just been introduced to the wonderful zones in Solaris and it just made hypervisor vm’s seem silly.

And here we are with cgroups and friends...

Have a look at ceph if you’re interested: http://docs.ceph.com/docs/master/architecture/

Anyway, I’m out.

If your horizon extends to what vendors are prepared to sell you, take that 50 discount and run with it!

One final thought: is that disk not ”internal” to the array? This is what makes blockstorage notoriously difficult to scale horizontally. You’ll need very clever software!


yeah, I know what VSAN is. It's you who does not if you think you run VSAN on top of a SAN-connected cluster. Yes, the disk is internal to the array. A server can see a hundred arrays on the same HBAs. It doesn't care what array the storage comes from. I am positive at this point you know nothing about what a SAN is.

What "box" - the 42U Rack? I don't think so. The rack is always free. You then have DAs connected to the disk, and FAs connected to the SAN, which are on a pair of directors. You literally cannot get those w/o disk.

You don't know what a SAN is, you've never seen an itemized quote for an array. Thanks for your link. It's like sending a hooked on phonics link to an English professor. Cute. Keep it cute! People like you are the reason people like me get paid a lot.

For free! Good one. Cause’ that’s whats really happening... right? With a straight face?

I was certain I had nothing more for you, but this is too much fun!

You clearly don’t grasp the difference between vertical and horizontal scalability which means you have never been subjected to a bunch of scenarios requiring the latter.

Dinosaurs taking the p*ss are the reason many enterprises opt to off-shore and out-source.

yes idiot. I have worked for several vendors and sold this stuff. I have also been on the customer side buying this stuff. You're paying 250k for an empty 42U that you call "box" - you literally are lying.

there are, literally, zero enterprises off-shoring their hard IO hitting datacenters. In fact, having been in 80 different countries for these enterprises, they usually have many datacenters all over the world.

VSAN is for hyperconverged systems only. racks with a thousand 1u nodes that all have disk, connected to a fat ethernet backplane. Things on a SAN are not hyperconverged - they are on a SAN. VSAN is for tier 2 stuff that goes on hyperconverged - like web servers and DMZ things. The fastest processing is done on solid databases like Oracle or UDB on clusters of large servers, connected to a SAN. VSAN is even positioned by sales for tier 2 from all major vendors.

You literally picked up some technical words you heard around the office, googled a few things, and now consider yourself and expert so you give authoritative opinion on here about things you have never worked with. I bet you are deskside support or a code monkey, and have never architected a solution. When someone gives you a budget of 20mil and says you can average 1min of application unavailability per year or it impacts billions in company's bottom line and your whole team gets fired, do let me know. I'm sure your suggestion would be to cluster together a bunch of old dell laptops over wifi.

Hey sunshine!

So, you’re saying vsan for horizontal scalability and you regular san for single point of vertical scalability.

Yup, sounds about right!

I’ve only consumed vsan as a dev, and that turned out... not so good. Scaling performant storage horizontally is a science and no fun to troubleshoot.

Implying you get stuff for free from EMC is just a bad joke, or you have no clue how high the markups are. It does not matter what the invoice tells you, bottom line counts. Last time I was involved, admittedly many years ago, the enclosure actually came with a price tag.

I was there when the company I used to work for was first through the gates in emea to san boot their physical x86 boxes. What a ride and very little gain. Costed a fortune tho!

I’ve worked at a hardware/software vendor (499 of the 500) as an architect within professional services which means I have way to much exposure to the madness going on at the 500s.

I’ve built all flash boxes running zfs as well as xfs for specific scalability needs for customers, usually on-prem clouds but other use cases as well. Commodity hardware. Blazing performance.

If I need proper storage experts I turn to the zfs dev mailing list f ex.

If you could lower your guard for a sec you’ll see that I have no less than two times pointed out that you can build decent infrastructure with precious enterprise gear. You’ll pay through your nose tho, and personally I’d rather hire 5-10 super generalist FTEs rather than a vendor labeled expert and a bunch of hardware.

I’ll leave you with this oldie but goldie: https://www.backblaze.com/blog/petabytes-on-a-budget-how-to-...

I know: “that’s not SAN”, but the point I’m making is the mark-up. If you are working at a place that allows you to play the lone ranger and blow everyone away with three letter vendor acronyms, good for ya’!

Lone rangers are the biggest blocker for most enterprises from a technical perspective, second only to politics.

Leave out the last sentence next time, that attitude overshadows your entire comment and isn't a good fit for HN.

Not a chance buddy. I've been here for a while, and this place has become reddit, with mods from /r/incels. I have two accounts. When I see someone with zero knowledge of subject matter, like OP, authoritatively and dismissively spew complete bs, I reply like I do.

People with a clear agenda spewing fake crap gets upvoted here now. Disagree with a mod - your comments are shadow removed, and possibly your account is shadow banned, with no warning, so the insecure mod can feel better.

I give actual valuable information from 20 years of experience at almost all the fortune100 firms. Snarkness? Yes. Much better and funnier than the guy I replied to.

I know you disagree. That's specifically why I include sentences like the last - to draw attention to the issue. comment karma has become useless here. Stuff is greyed out, stuff is shadow deleted - good things, filtered by idiots. I don't want my content filtered by insecure idiots.

Hey budz, many around here have 20+ years of experience.

I've fought all my career to get "enterprise" folks to understand that spending $100s of mill on hardware vendors and 3rd party software is madness, if what you are doing is actually mission critical.

Money should be spend on acquiring talent and adapting to actual business needs.

The business you support could not care less about your shiny toys, they want to transform, and probably yesteryear!

A digression here, from one of your other posts: if your business applications rely on live migration of vm's, you're doing it wrong. You should take this up with your CIO or even better the CTO, but he's probably busy looking at Gartner magic quadrants and contemplating the dire need of a blockchain.

Keep an open mind, and keep it simple. Best advice ever in this crazy business!

plenty of talent, and for stability and performance all that talent spends millions of dollars on top of the line proven solutions. Applications rely on live migrations for load balancing a farm, and it happens automatically in the background around the clock to rebalance farms. Hardware maintenance is another use case. So is a bucket of water. So is change control - I want to load new HBA firmware? I'll evacuate the ESX to somewhere else first.

no one contemplates blockchain, but for someone CIO level, gartner quadrants present a quick high-level summary of where the industry is going. and $100mil+ for hardware is completely normal to run things that process billions of dollars. The US treasury department is a great example. Mastercard is another.

What you describe is said talent doing it wrong.

But if it floats your boat — awesome!

Take care.

I agree that 90% of people do not need that, but there is a scale between 1 machine and 1 million machines.

It is not just IOPS but latency. Some of my production servers are connected one-to-one by crossover cable, in clusters with as many extra nics as needed (say 3 machines: 2 extra nics per machine) just to shave that little extra overhead of going through a router as yes, it matters in some applications!

Try using fiber then. I was surprised and pleased to learn that my 10G fiber has something like a third or quarter of the latency of my 10G copper in my new house.

You mean use (X)SFP(+), not “use fiber”. The SFP+ copper twinax cables I use in my lab at home have lower latency than the majority of MMF/SMF transceivers on the market.

Yes, thank you for the clarification/nomenclature. My SFP+ on MMF (multi-mode fiber) has significantly lower latency than the copper 10GbaseT I use. I haven't measured the direct-connect copper (coax? twinax?) SFP+ latency, though.

I would be curious if the SFP for 1G using fiber also has lower latency than copper 1000baseT.

A large data center has 20-30k machines. Amazon disclosed that back in 2014. Most likely you want to partition the machine into some chunks, so each application or each database only use some of them. So you don't need to talk to many machines at once.

In general, a distributed database can be really fast. DynamoDB claim to be 3ms (AWS reinvent 2014). The current technology can probably achieve close to 1ms, which is equivalent to memcache. With such performance, you don't really need local storage.

You can get much faster IO if you use many local SSDs. The downside is utilization. It is very rare a single machine has a workload that fully utilize local disk. You end up over provision greatly fleet-wise to get high performance. A managed database over a network is more likely to utilize disk/SSD throughput.

> and that that last 0.1% is always the most expensive

This is one of the things that gets me. If you get the last .1% that can ever be gotten, what do you do for an encore?

When your userbase grows another 5% or 10% or 20%, it won't be enough. You'd be better off trying to figure out ways to give your users something they want that doesn't require most exotic thing that can be procured. It's expensive to begin with, and it's the end of the road. You don't want to be at the end of a road trying to figure out what to do next.

Nobody’s getting “the last .1% that can ever be gotten”, so it’s a moot point. Google and Facebook are still growing. There’s always room for more improvements.

And they are both aggressively using horizontal scalability. What’s your point?

That these kind of micro optimizations are worth it. They reduce bulk and tail latency, and probably make the system more load-bearing. Of course, since now the system depends on a more "lucky" allocation of resources, it has more failure cases, thus a tighter operational envelope.

You keep implying that Facebook and Google are making these micro-optimizations with no support. I don't think you can support them, which means your premise is flawed and your conclusions are wrong.

They have ridiculous numbers of servers. Rather than fiddling to reduce server count, they're improving their ability to scale out. It's cheaper and it's repeatable. Doing crazy things like building custom hardware and power distribution buses (Facebook) to reduce heat in the data centers.

Google does a lot of micro-optimizations.

There are services where latency is important, they launch multiple requests to different replicas but to optimize this when one of the replicas is serving the response it sends a request cancel to the other one.

Horizontal scaling and vertical scaling is both very important.

Both G and FB pours many engineering hours into maximizing per server "ROI". Just think about how much they work on scheduling work/tasks so they can do more with the same number of servers.

Its more of an issue with the OS, *nix is showing its age as a useful OS as we are moving away from a static and simple system (PC's, servers, etc) to more distributed computing.

Kubernetes has a few good storage solutions that are built on well-known technology...

https://rook.io (built on ceph)

https://www.openebs.io (built on Jiva/cStor)

The distributed storage problem is difficult on it's own, this isn't a kubernetes specific issue. IMO Kubernetes is improving on the state of the art by dealing with both the distributed storage problem and dynamic provisioning, and that's why it might seem somewhat flustered.

What isn't shown here is how effortless it feels when you do have a solution like rook or openEBS in place -- we've never seen ergonomics like this before for deploying applications. Also, the CSI (Container Storage Interface) that they're building and refining as they go will be extremely valuable to the community going forward.

BTW, if you want to do things in a static provisioning sense, support for local volumes (and hostPaths) have been around for a very long time -- just use those and handle your storage how you would have handled it before kubernetes existed.

Shameless plug I've also written about this on a fair number of occasions:



I've gone from a hostPath -> Rook (Ceph) -> hostPath (undoing RAID had issues) -> OpenEBS, and now I have easy to spin up, dynamic storage & resiliency on my tiny kubernetes cluster.

All of this being capable as someone who is not a sysadmin by trade, should not be understated. The bar is being lowered -- I learned enough about ceph to be dangerous, enough about openebs to be dangerous, and got resiliency and ease of use/integration from kubernetes.

OpenEBS looked interesting, but it turns out they're spammers. :(

Starred the GitHub repo... and a few minutes later received email spam from them to my personal email (it's in my GitHub profile). :(

Completely lost interest in their project at that point.

Probably best to skip it, as rewarding spammers doesn't lead to good things. :(

I had no idea they did that, just made an issue about it[0].

Also, I wouldn't be so quick to write them off -- their solution is based on Container Attached Storage (CAS), and is the only relatively mature solution so far I've seen (I haven't seen any others that do CAS) that sort of take the Ceph model and turn it inside out -- pods talk to volumes over iSCSI via "controller" pods, and writes are replicated amongst these controller pods (controller pods have anti-affinity to ensure they end up on separate machines).

I have yet to do any performance testing on ceph vs openebs but I can tell you it was easier to wrap my head around than Ceph (though of course ceph is a pretty robust system), and way easier to debug/trace through the system.

[0]: https://github.com/openebs/openebs/issues/2345

Thanks for doing that. I have personal strong aversion to spammers. To the point where their product doesn't matter, I'll just use/help a competitor instead.

Reaching out to the occasional person who stars a project, if there's some strong overlap of stuff then maybe sure. An automatic spam approach though... that's not on. That makes starring projects a "dangerous" thing for end users, as they'd have to be open to emails for every one. :/

And yeah, it did look useful up until this point. Lets see what their response to the GH issue is like. :)

Thank you for the feedback. Feedback like this will help us understand the best practices in community building. We have disabled the email trigger on starring. Thanks again for this open feedback.

No worries Uma. The project looks interesting, and it's even written in Go, so I'll probably take it for a spin in the next few weeks. :)

They just replied saying they disabled the emails.

Here's the root of the problem.

> Static provisioning also goes against the mindset of Kubernetes

Then the mindset of Kubernetes is wrong. Or at least incomplete. Persistent data is an essential part of computing. In some ways it's the most important part. You could swap out every compute element in your system and be running exactly the way you were very quickly. Now try it with your storage elements. Oops, screwed. The data is the identity of your system.

The problem is that storage is not trivially relocatable like compute is, and yet every single orchestration system I've seen seems to assume otherwise. The people who write them develop models to represent the easy case, then come back to the harder one as an afterthought. A car that's really a boat with wheels bolted on isn't going to have great handling, but that's pretty much where we are with storage in Kubernetes.

AFAICT, the problem is have a separate, generic, distributed storage layer that is also performant is a solution very few have (like Google), and the rest of us have tried to hack things together with StatefulSets, GlusterFS, NFS, Persistent Volumes, etc.

The Borg/Omega model kind of assumes you have a Google-like storage tier interconnected with 10GbE links that is automatically replicated and accessible from potentially anywhere. Once you have that, then storage on k8s is "easy".

Unless I totally misread the "colossus" papers, Google does not in fact have an "accessible from anywhere" storage layer. GFS/Colossus is a per-cluster filesystem capable of read and append only. It is very much NOT generic and only supports custom applications.

Presumably that means that on every Google machine there is a directory that is shared for the entire cluster BUT:

1) It is per-cluster, not global (given Google's cluster sizes one imagine's that's not that much of a limitation, but then again, it seems like that is a clear limit)

2) you can only create or append to files (or delete them I guess) (and this is therefore non-posix, and does not support things like mysql or postgres)

3) this is very different from what GlusterFS, NFS, persistent volumes, etc provide. Therefore disks on google cloud are presumably very much not just files on this GFS/colossus thing.

4b) it was single master at least until 2004. Maybe until 2010. Apparently that can work.



> GFS/Colossus is a per-cluster filesystem capable of read and append only. It is very much NOT generic and only supports custom applications.

For the most part, google only has custom applications. Just about everything is written in-house, and takes advantage of things like being able to open a file from the local disk file just as easily as opening one in Colossus or their equivalent of Zookeeper.

Would regional Persistent Disks be acceptable?


Global might mean you're accessing blocks stored in Europe from a container running in Australia. If your workload is doing that, you might want to consider network costs and latency, and then reconsider whether you really want to architect things that way.

You are entirely correct, Colossus isn't POSIX. However, the solution Google has, is since they control all the software, essentially everything they write acts as if Colossus is their storage layer.

This "solves" the persistent storage layer in a way Kubernetes cannot. Inside Google you might deploy your "Spanner" database container which knows how to interface with Colossus and doesn't require any special setup or colocation (below the cluster level). You cannot deploy MySQL on K8s and expect the same.

The tail wagging the dog that is the last 5 years of devops advancement. Constraints imposed on software development to achieve reproducibility.

The most obvious example of this is the need to prematurely scale horizontally and deal with ephemeraliy (even compute).

If you’re ever planing on running at scale (and I assume most startups/projects strive for it), then there’s no point (other than maybe a rough poc) at which sharding is “premature” because later on it most likely will be prohibitively expensive.

Picking sharding specifically is weird given the many problem spaces where it won't work.

Also, whatever unnecessary scaling solution you come up with today for the problems you have today is not likely to be applicable to the problems you have years down the road; features will have changed, apps will be rewritten.

I think people underestimate how much a handful of properly specd servers can achieve. My best experience with "at-scale" was 25+ million users, heavy API usage (and the API hits weren't trivial / hard to cache), and we could handle all the load on 2 beefy servers (but had more for HA and lower latency (they were distributed)).

Quick search and I found a dual EPYC 7281 with 256GB DDR4, 2TB nvme, 2TB ssd, and 12TB HDD for $500/m in LA. Now it obviously depends on what you're doing, but for most startups to max this out, they're either extremely successful or extremely bad software engineers.

> Quick search and I found a dual EPYC 7281 with 256GB DDR4, 2TB nvme, 2TB ssd, and 12TB HDD for $500/m in LA.

And if you're willing to actually get a physical server, Dell happily sells single nodes with 6 terabytes of RAM for a few tens of thousands. Other vendors go much higher.

> I think people underestimate how much a handful of properly specd servers can achieve.


> Quick search

Even AWS (i.e. the very pricey one) has an x1e.2xlarge with 4.5 physical (9 hyperthreaded) Xenon E7 8880 v3 processors, 244GB DDR4, 240GB SSD, 25 Gbps network for $1200/month paid monthly or $700/month paid annually.

And that scales linearly 16x.

You've really got to be something amazing (maybe a ton of video?) to scale past what a single commodity box can handle.

What do you mean by at scale? At what scale?

In the past, I've supported a few million daily active users on a single MySQL database.

Use cases obviously vary, your numbers may not line up.

But I'm pretty sure there are lot of startups that would be fine starting with an architecture that can scale to 2 million users per day.

A few million DAU is not at scale. A few million concurrent active users might be, but really, “at scale” has no meaning other than “we got to the point where shit gets hard”. And that point is different for every app.

I remember watching a big xbox/ps3 game launch crash and burn when 16 of the biggest MySQL servers we could buy couldn’t keep up. That was two jobs ago and embarrassing (and probably expensive). That was “at scale” for us.

Last job we started at about 60 m4 ex2 instanced and were well into the thousands when I left. I suspect they’re approaching 10k instances now. And they’re pre-IPO startup, and I think that was at scale.

Current job measures in the hundreds of thousands of database instances, and I only count one specific database engine. Probably counts as at scale.

Really curious what kind of problem requires 100.000s of DB servers, and what database system is actually capable of such massive horizontal scaling? Are you able to provide any specifics? I imagine that typically a single instance can serve at least 100-1000 concurrent users, meaning you have about 10 to 100.000.000 concurrent users on the system? Since you mentioned gaming I assume it's a massive multiplayer-game like Fortnite?

Gaming was two jobs ago. It’s apache Cassandra, and I work for a large tech company.

  Why is it so hard to hammer
  a nail with a screwdriver?
Because stateful storage isn't the problem the system was developed to solve. The author is conflating what K8s is (a stateless container orchestrator) with what he wants it to be (a full service devops guy).

Now, if you simply must have stateful storage, I've had a pretty good time with PVCs pointing to Plain Jane NFS volumes combined with node tainting and performing local replication of high need data on the pods as needed.

If your pods are scaling up and down or dying so quickly that this seems untenable, you have other fish to fry.

I think the premise of this article is misleading. It's true that creating a storage subsystem in Kubernetes is exceedingly difficult, however using Kubernetes for stateful applications isn't any more difficult than using instances with persistent disk in any cloud provider.

For applications deployed in cloud infrastructure, most folks are using persistent disk simply for the ease of management. Some folks end up going to local disk for scale and performance reasons, at which point they end up having exactly the same problem one would have trying to do the same with Kubernetes.

Microsoft had the same problems when launching their microservice platform (Service Fabric). They relialized most of their problems went away if they gave their stateless service the notion of state and so they did. Most MS services, including Azure itself, now runs of something similar to K8 but that is state full.

Service Fabric powers many Microsoft services today, including Azure SQL Database, Azure Cosmos DB, Cortana, Microsoft Power BI, Microsoft Intune, Azure Event Hubs, Azure IoT Hub, Dynamics 365, Skype for Business, and many core Azure services.


Netapp has Kubernetes-as-a-Service that includes their own persistent volume. As a consultant I had it foisted on me for a project, but it worked pretty well - I never had to worry about storage - it just worked - so I guess I never realized it was "hard"? Definitely something I'm glad I let someone else manage.


I could not figure out cluster storage with docker swarm at least. With containers springing up and going away on multiple computers I could not figure out where the volumes were supposed to “live”

The only thing that at least seemed possible without additional software was nfs mounts but I was thinking it does not help the high availability cause to see the one and only storage node go down. Nor could I see any benefits to a cluster that used the same network for file access as external network requests. Say I wanted to build a horizontally scalable video sharing site I don’t see how I could test the real world performance of it before it went live.

You could use multiple network interfaces, one for storage, the other for the rest.

Not sure if this is in a datacenter or not. In a datacenter, you could use something like 3-PAR to have network attached storage.

But storage is hard. This is one of the advantages of using cloud providers, they have this part figured out for you. AWS's EBS volumes are network-attached storage.

Then there's another layer of abstraction when you are using containers. Ok, so you have this volume accessible from the network now, in whatever form. How do you attach it to running containers? That's what this article is about, mostly. Not with the underlying storage mechanisms.

I am Leary of Amazon because of the potential for sudden ruinous expenses. I looked over digital ocean and they say they are working on kubernetes but I don’t see how block storage is going to make it work. I’m just developing this on my home server for now. I’m just trying to learn the basics. Seems like cluster computing isn’t something that can be roll your own like ordinary docker can be.

Thanks for the lead though.

> I looked over digital ocean and they say they are working on kubernetes

FYI, Digital Ocean Kubernetes opened to everyone on December 11th.

You can get a decent-sized cluster running free for a bit with Kubernetes Engine in Google Cloud, using the initial credits they give you. The cluster will shut off when the credits do expire.

I'm just afraid all of these volumes and resources I'm creating will be 100x easier to 'lose' and have total sprawl than it was in VM world, which already had enormous sprawl issues.

Clustered filesystems like Ceph or GlusterFS might help with the reliability part, but latency and bandwidth still remain a problem.

The other thing I was quite interested in at one point was flocker, a volume manager which is responsible for replicating/migrating data along with the container it is connected to. The company shut down quite a while ago but code is still available: https://github.com/ClusterHQ/flocker

I ended up making this: https://github.com/gluster/piragua to make connecting glusterfs to kubes easier/scalable.

A few observations here

o managing state is hard. Kubernetes makes it look simple because we have been moving the state from apps to databases/messaging queues o storing data durably at scale, with speed and consistently is a CAP problem o distributed block storage is slow, complex and resource hungry o distributed storage with metadata almost always has a metadata speed problem (again CAP) o managing your own high performance storage system is bankruptingly hard

Ceph/rook is almost certainly not the answer. Ceph has been pushed with openstack, which is a horridly mess of complexity. Ceph is slow hard to mange and eats resources. if you are running k8s on real steel, then use a SAN. if you're on AWS/google use the block primitives provided.

Firstly, there is no general storage backend that is a good fit for all workloads. Some things need low latency (either direct attached, or local SAN) some can cope with single instances of small EBS volumes. Some need shared block sorage, some need shared posix.

With AWS, there is no share block storage publicly available. Mapping volumes to random containers is pretty simple. Failing that there is EFS, which despite terrible metadata performance, kicks out a boat load of data. If you app can store all its state in one large file, this might be for you.

Google has the same, although I've not tried their new EFS/NFS service.

While storage generally not a simple problem, Kubernetes' design makes it even harder. Why Kubernetes went this way is not entirely clear to me. One of the reasons could be its owners' focus on cloud infrastructures and tutorial-level developer use cases, where you want to show how to dynamically create and consume a file system backed by GCE persistent disks or EBS. Also getting a secure and usable Kubernetes setup running is currently a lucrative business model.

But as the OP writes, the standard case it should support is secure access to existing storage, and Kubernetes fails there on all levels. Simple things should be simple, hard things should be possible.

In its API design, it hides storage under abstractions like PVC, PV, StorageClass that most users have not seen before in their career (where do they come from btw?) and which do not re-use general system architecture concepts. Worse, these abstractions do not have exact definitions as their semantics can be massaged with access control rules.

Consider the most simple case: mounting an existing file systems and authenticate with a Kubernetes secret. You should be able to do this in 3-4 lines in a pod defnition. Instead you need to create several yaml files, whose content is strongly dependent on how your Kubernetes was set up, so no general tutorial will help you (but others are, see above).

And this is just for getting the basics going. Secure access / user authentication is still unsolved after two years (https://www.quobyte.com/blog/2017/03/17/the-state-of-secure-...), and does not seem to be high on the agenda neither in Kubernetes nor CSI.

There are other basics missing (like file systems do not have necessarily a "size"), but let's not go into detail there.

Storage itself is a complicated problem domain, one that is easily dismissed until you actually have to deal with it in gory detail.

K8s grew organically here, and thus the seams show. It’s getting better every release, particularly being able able to dynamically provision and schedule storage to pods in a “zone aware” manner, which has been tricky.

The deeper issue issue is we are spoiled for choice on storage engines. Using Ceph for random access r/w low latency block storage I wouldn’t wish on my worst enemy, for example. But it’s hard to distinguish hard numbers for comparative purposes.

The old fashioned approach of using a database server in a co-hosted setup with a locally hosted storage on SSD with another couple of servers in active or standby type of replication is the best option for the db side.

The rest of the stateless compute cores can run on kubernetes.

There are solutions like Arrikto's Rok [1] that solve the problem of handling stateful workloads in Kubernetes.

[1] www.arrikto.com

Kind of feels like the author doesn’t know kubernetes? Stateful sets make all this very easy. I only have a user-end knowledge of k8s from GCE and it seems like k8s makes storage very easy.

EDIT K8s makes storage very easy. No “seems.”

> k8s from GCE and it seems like k8s makes storage very easy.

or, using storage in an environment where the real storage is completely managed for you makes storage seem easy..

I saw many comments about stateful workloads. I am not sure it is a necessary issue for cloud environment.

Within a zone or a cluster, the latency is about 1ms, which is faster than most hard disks. The network bandwidth is on par with disk throughput. What we really need is a faster database and a faster object storage that can match the network performance (1ms and 10Gbps), then all workloads can be stateless.

If one uses a VM on GCP, the VM has no local storage besides the small local SSDs. Practically even the VM is stateless besides some cache.

> The network bandwidth is on par with disk throughput

Yes, and most storage you have access to, in cloud environments, is network attached. GCP disks, AWS EBS volumes, etc. All network and outside the hypervisors. You may have some local storage, but that's ephemeral, by design.

However, since we are talking about Kubernetes: not only VMs are ephemeral, but your containers are ephemeral too! And they can move around. So now you (or rather, K8s) need to figure out which worker node has the pod, and which storage is assigned to it, and then attach/detach accordingly.

This is what persistent volumes and persistent volume claims give you. They actually work fine already for StatefulSets.

Now, if you are in a cloud environment you should look into the possibility of using the hosted database offerings. If you can (even at a price premium), that's a great deal of complexity you are going to avoid.

With stateless services, forwarding requests to underlying storage and serialization can dominate resource consumption. After all, some services will do little besides fetch the right content and transform the data somehow.

Addressing this requires caching data in memory while making sure those caches are also disjoint so that you fully utilize your cluster memory. This has driven Google (and others) to make some services semi stateful and build dynamic sharing infrastructure to make this easier [1].

[1] https://ai.google/research/pubs/pub46921

From a database perspective, 1ms to disk is an eternity.

A good disk subsystem had less write latency than that in the early 90’s.

Write to disk has no practical latency because of write buffer, either local file system or remote database. Flush to disk would be slow unless you use SSD.

On the other hand, a single machine has limited reliability. If one wants to have high availability, they needs to dual write to another machine, which also has network latency.

I don’t think that’s true. I recall ~5ms seek times being top of the line.

he said 'subsystem' not 'disk'-

what was the latency to the controller with ram cache?

using seek time as a measure is also somewhat worst case - controllers/filesystems also queue(d) according to drive geometry.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact