Hacker News new | past | comments | ask | show | jobs | submit login
How We Knew It Was Time to Leave the Cloud (gitlab.com)
379 points by sytse on Nov 12, 2016 | hide | past | favorite | 262 comments

I appreciate that you didn't sling mud at Azure, but re-reading the commit for the move to Azure [1] there were tell tale signs then that it might be bumpy for the storage layer.

For what it's worth, hardware doesn't provide an IOPS/latency SLA either ;). In all seriousness, we (Google, all providers) struggle with deciding what we can strictly promise. Offering you a "guaranteed you can hit 50k IOPS" SLA isn't much comfort if we know that all it takes is a single ToR failure for that to be not true (providers could still offer it, have you ask for a refund if affected, etc. but your experience isn't changed).

All that said, I would encourage you to reconsider. I know you're frustrated, but rolling your own infrastructure just means you have to build systems even better than the providers. On the plus side, when it's your fault, it's your fault (or the hardware vendor, or the colo facility). You've been through a lot already, but I'd suggest you'd be better off returning to AWS or coming to us (Google) [Note: Our PD offering historically allowed up to 10 TiB per disk and is now a full 64 TiB, I'm sorry if the docs were confusing].

Again, I'm not saying this to have you come to us or another cloud provider, but because I honestly believe this would be a huge time sink for GitLab. Instead of focusing on your great product, you'd have to play "Let's order more storage" (honestly managing Ceph has a similar annoyance). I'm sorry you had a bad experience with your provider, but it's not all the same. Feel free to reach out to me or others, if you want to chat further.

Disclosure: I work on Google Cloud.

[1] https://gitlab.com/gitlab-com/www-gitlab-com/merge_requests/...

> rolling your own infrastructure just means you have to build systems even better than the providers

As a cloud provider, though, you're trying to provide shared resources to a group of clients. A company rolling their own system doesn't have to share, and they can optimise specifically for their own requirements. You don't get these luxuries, and it's reasonable to expect a customised system to perform better than a general one.

As a counter argument: very few teams at Google run on dedicated machines. Those that do are enormous, both in the scale of their infrastructure and in their team sizes. I'm not saying always go with a cloud provider, I'm reiterating that you'd better be certain you need to.

Interesting, presumably they're very well informed and they obviously feel that Google's cloud offerings are the best way to go.

If I may ask a few questions:

- Are they charged the same as external customers or do they get a 'wholesale' rate?

- As internal clients, do they run under the same conditions as external clients? Or is there a shared internal server pool that they use?

- Do they get any say in the hardware or low-level configuration of the systems they use? (ie. if someone needs ultra low latency or more storage, can they just ask Joe down the hall for a machine on a more lightly loaded network, or with bunch more RAM, for the week?)

- Do they have the same type of performance constraints as the ones encountered by gitlab?

I feel like most of the reason to use cloud services is when you have little idea what your actual requirements are, and need the ability to scale fast in both directions. Once you're established and have a fairly predictable workload, it makes more sense to move your hosting in-house.

The Google teams that he's referring to probably don't run on Google Cloud Platform, and rather run on Google's internal infrastructure that GCP is built upon. So most of your questions may not apply. However, his points about cloud infrastructure are still valid.

If you're right, then Google teams using internal Google server infrastructure is literally Google rolling their own.

He technically only said they don't run on dedicated machines, not that they run on GCP. My guess would be Google has some sort of internal system that probably uses a bunch of the same software but is technically not GCP.

This is mostly speculation base don having read this http://shop.oreilly.com/product/0636920041528.do

Hopefully someone who actually knows what theyre talking about will be along shortly!

> Are they charged the same as external customers or do they get a 'wholesale' rate?

Id be quite surprised if internal customers are charged a markup. Presumably the whole point in operating an internal service is that you lower the cost as much as possible for your internal customers.

> As internal clients, do they run under the same conditions as external clients? Or is there a shared internal server pool that they use?

From the above book, it seems that the hardware is largely abstracted away so most services aren't really aware of servers. I assume there's some separation between internal and external customers, but at a guess that'd largely be because of the external facing services being forks of existing internal tools that have been untangled from other internal services.

> Do they get any say in the hardware or low-level configuration of the systems they use? (ie. if someone needs ultra low latency or more storage, can they just ask Joe down the hall for a machine on a more lightly loaded network, or with bunch more RAM, for the week?)

As above, the hardware is largely abstracted away. From memory, teams usually say "we think we need ~x hrs of cpu/day, y Gbps of network,..." then there's some very clever scheduling that goes on to fit all the services on to the available hardware. There's a really good chapter on this in the above book.

> Do they have the same type of performance constraints as the ones encountered by gitlab?

Presumably it depends entirely on the software being written.

But some workloads get all the priority while others get zero/idle priority. Not true in public cloud.

Multitenancy is a large part of what makes public cloud providers profitable, but they all understand the need to isolate customer resources as much as possible.

Using a resource abstraction layer such as mesos can alleviate this downside by consolidating many of your workloads onto a pool of large dedicated machines.

In the end it doesn't really say that they rolled their own on prem solution. For the kind of money they were forking out in the cloud you could just buy a Netapp or Isilon and get something that provides enough consistent storage performance. You don't need a distributed FS for the kinds of numbers they're looking at, using one is just a complicated way of working around the underlying limitations of cloud storage. In your own datacentre getting storage that works is pretty easy.

Most appliances are not geared towards many small random reads. And if you scale they start to be very expensive. And we would love to use an open source solution all our users can reuse.

I'm not an expert in Ceph but I've built many other storage solutions and typically where distributed filesystems fall down in performance is with small files. Even something like an Isilon can get into trouble with those kinds of workloads. The files are too small to be striped across multiple nodes and there's a lot of metadata overhead. Monolithic systems tend to do better with small files but even then you can run into trouble at the protocol (NFS) level with the metadata.

Disk systems do get expensive at scale but the scale that they're usually sold at these days is pretty huge. You talk about going up to a petabyte but that's a fraction of a single rack's worth of disk these days. Not everyone wants to be a filesystem expert and distributed filesystems are jumping in at the deep end.

You are right regarding many small files. Interestingly reading from many small files didn't seem to be so much of a problem with CephFS as it was to keep a large file open while reading and writing to it from thousands of processes (the legacy authorized_keys file).

Clearly CephFS as weak spots, but for what I've seen those are sports that we can work out, rough edges here and there. The good thing is that we are much more aware of these edges.

We are already working on what the next step will be to soften out these weaknesses so we are not impacted again. And of course to ship this to all our customers, either they run on CephFS, NFS appliances, local disks or whatever makes sense for them.

FYI. The kernel Ceph client has local fscache cache. I added it to the kernel :) https://lwn.net/Articles/563146/

We started using Ceph because we wanted to be able to grow our storage and compute independently. While it worked well for us we ended up having much larger latencies is as a result of this. So we developed FSCache support.

Even better yet, if you data is inherently shareable (or has some kind of locational affinity) you can end up always serving data with a Ceph backend from the local cache with the exception of a server going down or occasional request. I'm guessing it is (repo / account)

On your API machines serving out the git content of out the DFS you can setup a local SSD drive to read only caching. Depending on you workload you can end up significantly reducing the IOPs on the OSDs and also lowering network bandwidth.

With the network / IOPs savings we've decided to run our CephFS backed by Erasure Coded pool. Now we have lower cost of storage (1.7x vs 3x replication) and better reliability because now with our EC profile we can lose 5 chunks before data loss instead of 2 like before. That's because we more the 90% of requests are handled with local data and there's a long tail of old data that rarely accessed.

If you're going to give it a try, make sure you're using a recentish kernel such as a late 3.x series (or 4+). That has all the Cephfs FSCache / and upstream FSCache kinks work out. If you're using relatively recent kernel such late 3.x series or 4+ (as in ubuntu 16.04).

Thanks mtanski, this is great data.

We are running a recent kernel as in ubuntu 16.04.

The reason I'm framing the caching not so much at the CephFS level is because we are shipping a product, and I don't think that all our customers will be running CephFS on their infra. Therefore we will need to optimize for that use case also, and not only focus on what we do at GitLab.com.

Thanks for sharing! Will surely take a look at this.

I just want to note that the blog post you linked to was never actually published since we decided against pointing fingers when – by and large – the problems weren't their fault and their engineers were quite helpful. See the comment at the bottom of that thread[1] from Sid.

We made mistakes ourselves that were the root of the problems at the time, and based on the update[2] a few months later things definitely improved significantly.

[1]: https://gitlab.com/gitlab-com/www-gitlab-com/merge_requests/... [2]: https://gitlab.com/gitlab-com/www-gitlab-com/merge_requests/...

Judging by the posts.

Random thought: Go talk to google and get $100k of free cloud credits.

You run 100 instances of n1-highcpu-2 (2 CPU, 2GB RAM), with a local 375 SSD. That gives you 37TB of storage and you're fine for the year.

This is an absolutely ridiculous setup but it could work.

1) Servers don't need to have more memory for caching. Local SSD access is fast enough.

2) Servers have 1Gb uplink. That can be saturated easily by a single local SSD.

3) I'm a firm believer that if you put 8-16 disks in a single box, you're gonna hit a hard limit with the network bandwidth and infrastructure (among other things).

4) Do you need IOPS or do you need storage???

5) Can CephFS shard fairly across a hundred systems?

6) Should CephFS run on raw disks, manage replica itself? and can it handle failure?

Bonus question) Why are we thinking about all of this. You should have just used s3?

> Bonus question) Why are we thinking about all of this. You should have just used s3?

NFS and Ceph both support the standard POSIX file access model, notably including random-access reads and writes on a file without replacing the entire file and consistency models that let you reason about directories and their contents usefully. If your goal is to run the reference implementation of git, or something that is compatible with it, on the filesystem, you absolutely need something that supports this.

You could write something new that speaks the git protocol but is designed to be backed by an eventually-consistent object store instead of a POSIX-compliant file store ... but that seems like an equally big challenge, honestly.

Fun aside. Google wrote a backend to Mercurial so it could be stored on bigtable. They did a talk about it 7 years ago.


And to SVN! https://lwn.net/Articles/194667/

It's not a bad decision; getting POSIX semantics to play well with modern expectations of a highly-consistent and highly-available distributed system is very difficult. But it's a work-intensive decision, and figuring out how to deploy NFS or Ceph at scale might be less work. Especially if Google's SVN/hg developers can work directly with the Bigtable developers on features, and GitLab's developers can't work directly with the S3 team (but can own their own Ceph or NFS stack).

Their Git hosting was also running on that same backend (I was on the team managing those products). But to a previous posters point That is a lot of work both to implement as well as to maintain. If you can just use a distributed filesystem you might save yourself some trouble.

Yes, write a Git backend using a distributed actor system, like Akka, Orleans or Service Fabric, backed by Table Storage, Bigtable or some other distributed hashtable store. Git maps naturally to this structure, using git hashes as table index.

You'll have a really cheap and very scalable solution that uses low level storage mechanisms yet it will also be really performant because when loaded everything will be served from memory.

For sure, and people are trying to do this on the basis of libgit2. But so far nobody has been able to make it work for all of the cases. Even GitHub is using filesystems as far as I know.

One of the teams I work closely with used JGit's extensibility to run git on top of a KVS: https://medium.com/@palantir/stemma-distributed-git-server-7...

Notably this doesn't require a distributed filesystem.

That looks really interesting, thanks for posting. I added it to our issue https://gitlab.com/gitlab-com/infrastructure/issues/727#note...

For Visual Studio Online and TFS, Microsoft is using libgit2 and some custom ASP.NET code for the upload-pack / receive-pack endpoints, stuffing the data into SQL Azure. No filesystem, no `git` executable.

Others have done this with libgit2 storage backends.

Anyone in particular, or anything open source? I could use this for a work project, but the closest I've been able to find is an old project which backed JGit to Cassandra and the Dulwich Swift backend[1].

[1]: https://www.dulwich.io/apidocs/dulwich.contrib.swift.html

We looked at this but it didn't have either the features or the speed we needed. I do think it is very promising.

If you're rolling your own, you can tune these things.

For example, in this case, since they're IOPs-bound first, and probably network-bound second, you can cram 16 SSDs in a box, and then put a dual-port 25Gig Ethernet NIC up to 2 leaf switches at the top of rack, and then build a CLOS out of 100G spine switches, so that there are essentially never any network bottlenecks for Ceph.

This is very similar to DreamHost's DreamCompute cluster design, though they're using 2x 10G to the servers, and 40G up to the spines, since 25G/100G wasn't available when they built the cluster.

We're currently at 70TB and we're planning for a 256TB setup https://docs.google.com/spreadsheets/d/1XG9VXdDxNd8ipgPlEr7N... . So that would be a lot of servers. I don't think SSDs for the file storage are affordable. We'll use SSDs for the journals.

I think that's a false assumption. Especially in 2016.

Unless something changed dramatically at GitHub they're a 100% solid-state fleet since 2012 when they started down the "building their datacenter cages" journey.

Whilst you could say as the 500lb gorilla in the market they can afford that luxury, there are other examples, i.e. DigitalOcean, where SSD is a core part of their product and it is offered inexpensively ($5/mo).

Even if you're looking at 3-way replication of your data, meaning you're buying 768TB of storage, the SSD should only tot up to $200-300k for Intel S3510 (or similar).

There's a non-trivial amount of additional effort involved in any kind of "multi-layered" storage (c.f. data on 7200rpm, NVMe cache) over and above just making all of your storage moderately performant in IOPS terms. That has cost too.

Sorry about that! I was on my phone, and followed the first thing from the linked post/issue; I didn't realize it wasn't published.

No problem, the thread is pretty long, and it's not necessarily easy to tell without understanding how our blog post process works. :)

An alternative to user5994461's suggestion would be to switch from CephFS to a different distributed filesystem. If you need POSIX and can't go directly to S3 you could try ObjectiveFS[1]. Using local SSD instance store and memory for caching you can get very good performance even for small file workloads[2].

[1]: https://objectivefs.com [2]: https://objectivefs.com/howto/performance-amazon-efs-vs-obje...

Another option would be to move to HDFS. HopsFS, a new distribution, can scale to over 1m ops/sec: http://www.logicalclocks.com/index.php/2016/10/14/hops-smash...

As someone who has run Hdfs in production on bare metal, I wonder how much you dislik gitlab...

TL;DR - HDFS is rarely the answer, mainly due to pains usually involving crashed nameservers.

HopsFS has multiple redundant, stateless NameNodes.

ObjectiveFS looks interesting but I wonder about the latency if you do many small random reads (this is what we're seeing with git).

Latency for small random reads (16 parallel threads), using the Linux kernel source tree, when hitting in the SSD instance store disk cache have a 95th-percentile latency of 6ms with ~50MB/s throughput. When hitting in the memory cache the 95th-percentile latency is <1ms with ~380MB/s throughput. We have more performance data available at https://objectivefs.com/howto/performance-amazon-efs-vs-obje...

That sounds very good, but I think we prefer Ceph because it is better known. Do you have a comparison between Ceph and ObjectiveFS?

I think what you're seeing in this post has product implications for Google Cloud.

The OP is asking for guaranteed IOPS/latency SLA for which they're willing to pay HUGE money for by ROLLING THEIR OWN.

Possibly think about high service pricing tiers for systems that require it. This is a standard operations problem that can co-exist within pooled service models.

See my note about SLAs and ToR failures. We probably could promise something for our Local SSD offering (tail latency < 1ms!), but high-performance, guaranteed networked storage is just tricky.

As I said, rolling your own will not give you a guarantee, it will just give you the responsibility for failure. We don't offer the guarantee, because we don't want you to believe it can't fail.

At a certain scale, it's better to just take that responsibility rather than rely on someone to be able to handle that burden. When something fails, the worst thing is being out of control of it and unable to do anything about it.

Last time I had a failed storage controller, HP delivered a replacement in four hours. Last time a service provider went down, I had no visibility into the repair process, and had to explain there was nothing I could do... For five days.

It seems like you've outright admitted you can't guarantee what they need, yet you still urge them not to leave your business model. Maybe come back when your business model can meet their needs.

One of the things I stress when talking to customers is that when one is moving to cloud, it's not just the business model that changes, but the architecture model.

That may sound like cold comfort on the face of it, but it's key to getting the most out of cloud and exceed the possibilities of an on-prem architecture. Rule #1 is, everything fails. The key advantage to a good cloud provider (and there are many) is not that they can deliver a guarantee against failure (as boulos stated correctly) but that they'll allow you to design for failure. The issue becomes when the architecture in the cloud resembles that which was on-premise. While there's still some advantages, they're markedly fewer, and as you said, there's nothing you can do to prioritize your fix.

They key to having a good cloud deployment is effectively utilizing the features that eliminate single points of failure so that the same storage controller failure that might knock you out in your on-prem can't knock you out in the cloud, even though the repair time for the latter might be longer. That brings its own challenges, but brings huge advantages when it comes together.

Disclosure: I work for AWS.

Most ppl just want to deal with one provider though. Meaning there has to be a middle man btw you and the customer.

That is quite honestly amazing turn around. Most people aren't in a position to get a replacement delivered from a vendor that quickly, but if you are, awesome. Again though, there's a big gap between simple, local storage (we could probably provide a tighter SLO/SLA on Local SSD for example) and networked storage as a service.

As someone else alluded to downthread: anyone claiming they can provide guaranteed throughput and latency on networked block devices at arbitrary percentiles in the face of hardware failure is misleading you. I don't disagree that you might feel better and have more visibility into what's going on when it's your own hardware, but it's an apples and oranges comparison.

> Most people aren't in a position to get a replacement delivered from a vendor that quickly

Back when I worked in enterprise services, it was a requirement that we had good support contracts — 1-, 4- or 8-hour turnarounds were standard.

If you're running a serious business, support contracts are a must.

Physically delivered, no matter customer location? Sign me up ;).

I'm not sure why you're so surprised, this is standard logistics management for warranty replacement services.

Manufacturer specifically asks customer to keep it informed where the equipment is physically located, and then prepositions spares to appropriate depots in order to meet the contractual requirement established when the customer paid them often a large amount of money for 4-hour Same Day (24x7x365) coverage for that device.

This isn't how hyperscale folks operate for the same reason Fortune 100's rarely take out anything more than third-party coverage when their employees rent vehicles, it becomes an actuarial decision, the # of 'Advance Replacement' warranty contracts and the $ involved, vs. buying a % of spares and keeping those in the datacenter then RMA'ing the defective component for refund/replace (on a 3-5 week turnaround time).

tl;dr - Operating 100-500 servers you should likely pay Dell and they'll ship you the spare and a technician to install it, operating >500 servers and you should do the sums and make the best decision for your business, operating >5000 servers and you probably want to just 'cold spare' and replace yourself.

We had such a contract with Dell when I worked in HPC (at a large university). Since our churn was so high (top 500 cluster) we had common spare parts on site (RAM, drives) but when we needed a new motherboard it was there within 4 hours.

Now, these contracts aren't like your average SaaS offering. We had a sales rep and the terms of the support contract were personalized to our needs. I imagine some locations were offered better terms for 4 hour service than others.

As I learned long ago, never say no to a customer request. Just provide a quote high enough to make it worthwhile.

As nixgeek says, this is standard -- for a price.

That said I suspect most enterprises pay too much; as in their money might be better spent buying triple mirroring on a JBOD rather than a platinum service contract with a fancy all in one high end RAID-y machine.

Back when I worked in a big corp with an actual DC onsite we had similar contracts. I assume they worked anywhere in the US as we had multiple locations. IIRC they were with HP.

At the scale gitlab is talking about, it is usually better to source from a vendor that doesn't provide ultra-fast turnaround, and just keep a few spare servers and switches, so you can have 0-hour turnaround for hardware failure.

Get servers from Quanta, Wistron or Supermicro, and switches from Quanta, Edge-core or Supermicro. The $$$ you save vs a name-brand OEM more than pays for the spares. Use PXE/ONIE and an automation tool like Ansible or Puppet to manage the servers and switches, and you can get a replacement server or switch up and in service in minutes, not hours.

If you're moving out of the cloud to your own infrastructure, it makes sense to build and run similarly to how those cloud providers do.

I kinda disagree, see my other comment on logistical models, this isn't 5000+ system scale or even 500+ system scale.

There's non-trivial cost involved in simply being staffed to accommodate the model you propose, all of the ODM's have some "sharp edges" and you need to be prepared to invest engineering effort into RFP/RFQ processes, tooling and dealing with firmware glitches, etc.

Remember that 500-rack (per buy) scale is table stakes for "those cloud providers", it is their business, whereas GitLab is a software company. Play to your strengths.

In my experience, you'll be dealing with firmware glitches even with mainstream OEMs. You can avoid a lot of the RFP/RFQ and scale issues by going to a ODM/OEM hybrid like Edge-core, or Quanta. Or Penguin Computing or Supermicro. If you already have a relationship with Dell or HP, you probably won't get quite as good pricing, but they're still options.

I am shockingly biased (I co-founded Cumulus Networks) but working with a software vendor who can help you with the entire solution is very helpful.

The scale gitlab has talked about in this thread is firmly in the range where self-sparing/ODM/disaggregation make sense. I think 500 racks is a huge overestimate, I think the cross-over point is closer to 5 racks.

I think you're missing the point a bit -- the claim is not that the business model prevents these guarantees, but that it's inherently difficult for any party to do.

> had to explain there was nothing I could do... For five days

The good alternative would be that you controlled the infrastructure and it didn't break. The bad alternative is you directly control infrastructure, it breaks, and then you get fired.

Wikipedia, Stack Overflow, and Github all do just fine hosting their own physical infrastructure.

It's untrue that "the cloud" is always the right way. When you're on a multi-tenant system, you will never be the top priority. When you build your own, you are.

Google and AWS have vested interests (~30% margins) in getting you into the cloud. Always do the math to see if it's cost effective comparatively speaking.

Sure and GitHub also has literally an order of magnitude more infrastructure than is being discussed in these proposals and retains Amazon Web Services and Direct Connect for a bunch of really good reasons.

Transparency from GitLab is excellent but you shouldn't really generalise statements about cloud suitability without the full picture or "Walking a mile in their shoes".

Yep, and we at GitLab have Direct Connect to AWS as a requirement for picking out new colo.

If you guys use AWS, have you taken a look at EBS provisioned IOPS? EBS provisioned IOPS comes to mind here since it allows you to specify the desired IOPS of a volume, and provides an SLA.

> Providers don't provide a minimum IOPS, so they can just drop you.

The reason I ask is because the blog post generalizes a lot about what cloud providers offer or what the cloud is capable of, but doesn't explore some of the options available to address those concerns, like provisioned IOPS with EBS, dedicated instances with EC2, provisioned throughput with DynamoDB, and so on.

EBS goes up to 16TB, we need an order of magnitude more.

You're right that the blog post was generalizing.

BTW We looked into AWS but didn't want to use an AWS only solution because of maximum size, costs, and the reusability of the solution.

Per disk. You can attach multiple disks to a single EC2 node.

So, RAID multiple EBS volumes and you have a larger disk.

That's not a strawman. It's just a false statement.

Thanks for the correction.

By point of comparison, we had storage problems that went on for months or years. We had a vendor that was responsive to slas, but the problem was bigger than just random hardware failures, it was just fundamentally unsuited to what we were trying to do with it. That's the risk you take when you try to build your own.

In this case, it sounds like the cloud was fundamentally unsuited for what GitLab was trying to do with it. So definitely still a risk!

As someone on the private cloud team for a large internet company, dealing with storage problems is a nightmare that never ends. AWS is a walk in the park by comparison.

> See my note about SLAs and ToR failures. [...] but high-performance, guaranteed networked storage is just tricky

New job title: Networked Infrastructure Actuary

OTOH, solving the tricky parts in scalable ways is exactly the kind of unique selling point that cloud providers should offer.

A ToR failure doesn't have to mean the end if you're willing to wire each server to two ToRs and duplicate traffic streams to both. It's a waste, but it's one way to achieve high reliability if you have customers willing to pay for it.

I'm not sure we want to pay huge money. Right now it looks like our cloud hosting bill for 2 months (about $250k) can pay for the hardware to host 4x as much https://docs.google.com/spreadsheets/d/1XG9VXdDxNd8ipgPlEr7N...

My business partner and I ran a website (rubylane.com) for many years, just the 2 of us, colocated at he.net. With good hardware (we used Supermicro servers bought from acme.com), it's not a big deal. We mirrored every disk drive with hardware raid1, had all the servers connected in a loop via their serial ports, had the consoles redirected to the serial ports, and it was not much of a hassle. When we were first starting, we used cheap hardware and that caused us some pain. The other very useful thing we had setup is virtual IP addresses for all of the services: search engine, database, www server, image server, etc. The few times we ever had trouble with the site or needed to take a machine out of service, we could redirect its services to another machine with fakeip.

Recently I bought an old SuperMicro server on eBay, configured it with 2 6-way AMD Opterons, 3 8-way SATA controllers, and 32GB of memory. With 4TB 5700RPM drives in an 8-way software RAID6, it could do 800MB/sec. I realize it's not small file random I/O, but a blended configuration where you put small files on SSD and large files on spinning disk would probably be pretty sweet.

My intuition is that learning all the ins-and-outs of AWS, and how to react and handle all kinds of situations, is not that much easier than learning how to react with your own hardware when problems come up. Especially consider that AWS is constantly changing things and it's out of your control, whereas with your own hardware, you get to decide when things change.

If you can colocate physically close, it's a lot easier. Our colocation was in Fremont but our office was in San Fran, so it was a haul if we had to install or upgrade equipment. But even so, there was only 1 or 2 times in 7 years that we needed to spend 2 consecutive days at the colo. One of those was during a major upgrade where it turned out that the (cheap) hardware we bought was faulty.

Thanks. We plan to use a remote hands service to install new servers.

How are you calculating cost to your organization to run your own hardware? With a cloud provider you're benefitting from their own engineering pooled across thousands of customers.

When you run your own hardware you have all the engineering you were already doing plus investing to upkeep and to improve your architecture.

As Google's boulos said, that's where the real costs are.

Indeed with metal you have less flexibility and much higher engineering costs that offset your savings. We think metal will be more affordable as we scale but that is not the reason for doing it. We do it because it is the only way to scale Ceph.

I know at least two 500TB+ clusters running on IaaS and don't think "only way to scale Ceph" is to buy and rack machines.

In an earlier comment you said EBS only goes to 16TB and that is "an order of magnitude less" than your requirement, however, thats per volume, you can attach many volumes in much the same way as servers have many disks.

Scale horizontally not vertically, add more OSD instances? With each you can attach a number of EBS or PD volumes which each IOPS characteristics that in aggregate are sufficient to service your workload?

If you want to avoid EBS or PD entirely, is there a reason you can't look at 'i2' or 'd2' instance types?

https://cloud.google.com/compute/docs/disks/performance https://aws.amazon.com/ebs/details/#VolumeTypes

At a fundamental level you're just moving the problem and trading managing metal (which is hard) for I/O guarantees.

"Why is this harder than you might expect?" - you stated elsewhere that you'll have Remote Hands do rack/stack. Providers like Equinix refer to this as "Smart Hands". Everyone who's managed a reasonable-sized environment finds this term highly ironic, as the technician can and will replace the wrong drive, pull the wrong cable, etc.

I've done an non-trivial amount of infrastructure 'stuff' (design, procurement, install, maintenance, migration) for some well-known companies, if you want to Hangout for an hour and pick my brain, gratis, my e-mail is in my profile.

> The OP is asking for guaranteed IOPS/latency SLA for which they're willing to pay HUGE money for by ROLLING THEIR OWN.

There is no guarantee of SLA in any distributed system. The best you can do is measure things and know what you'll get most of the time.

If you want SLA, you can make a single server with 10TB of memory as storage. That's solid choice! :D

Ah, really? We are moved from Google Cloud, we just needed one machine with average IOPS to store our files (50-100gb only), but we are constantly hitting limits of IOPS, file upload for our users at some point became a nightmare. What's a problem with cloud when we pay for it 1k$+? Renting one server costs ~50$/mo and will cover everything. Only thing is needed is some adjustments in configuration. But, thanks to k8s we can roll out servers in a week. I am wonder now why we was needed to sit on clouds when everything works 10x times slower at 10x price?

What if your hardware fails?

What if you need to scale up (or scale down again)?

Making servers run as reliable as in cloud datacenters is really hard work and imposes additional cost. I.e. you need not one, but two or three datacenters with 2-3 times the amount of servers you actually use. Then you need admin staff who knows their work and keeps things running smoothly and reliably. You need not one but at least 3 of that staff because they might get sick or go on vacation.

Your cost savings probably comes from cutting corners. Depending on the structure of your applications, downtime requirements and recovery plans, that may even be OK. You don't always need a fleet of tanks to deliver a box of milk bottles.

> "Making servers run as reliable as in cloud datacenters is really hard work and imposes additional cost."

Not really. It's easy enough to have dedicated server hardware that runs just as smoothly as cloud hosted hardware. There are many options for doing so. The key is to pay for support.

For example, if you go with Microsoft or Red Hat or Ubuntu servers, you can pay for support contracts to get help from experts in configuring systems. Such options can work out cheaper than cloud hosting (depending on the IT infrastructure requirements), with the added benefit of having hardware more directly under your control.

What happens when you run into trouble with one of the servers? Like it's suddenly stopped responding and you can't SSH?

For my systems built on EC2, I might not need to do anything at all for typical hardware failures, if I've set up EC2 auto recovery. It transparently relaunches my instance on another server. If I'm not using auto recovery, then I might just stop and restart the instance, in which case I'm also migrated to another physical server if the first one wasn't working. It has the same identity, by the way: same IP address, same disk contents, and all that, and all the machine perceived was an OS reboot (or maybe kernel panic followed by reboot, depending on the failure).

For stateless systems I don't even need to bother with that. I'll just configure it to spin up a replacement instance if the desired number of servers I want to have online is not met because one of them failed. When I have 100+ servers in a fleet, this is really convenient. I don't need to keep track of them individually, I just say: "This is the image I'd like to run, and I want to have 100 of them". Hardware failures that would bring down a normal physical machine don't need to involve me at all.

The old server hardware that's broken is now in the hands of EC2 to repair or swap out, and my server's back up and running in minutes, possibly with no action on my part.

Is something like that easy to achieve with colocation and support offerings with those vendors? Genuine question - I've never operated in colo or worked with those vendors. I realize that colo facilities can take care of repairing and swapping servers for me, but the benefit I get from the cloud is that I can perform those actions in moments with an API, or they can happen automatically, and I don't need to engage other humans. I can't imagine spending my time coordinating with vendors, getting on the phone, opening support tickets, waiting for people to do things, etc... right now I can do everything at the speed of computing with APIs and automation. It'd be tough to give that up.

> "What happens when you run into trouble with one of the servers? Like it's suddenly stopped responding and you can't SSH?"

You've got plenty of options. Reboot the server (using a local login, as the server is hosted where employees/support staff are physically close to it), restore the server from a backup, etc...

Let's put it like this, how do you think companies managed before cloud servers? Do you think they just had the attitude of 'Oh damn the server has gone down, nobody can do any work today'? I'd put it to you that contingency plans existed long before cloud hosts did.

If you start running EC2 clusters, you're more in the "cloud storage" dominion and not really in the "self hosting" area anymore.

Not everybody needs that and not everybody who does need it also realizes it before a major failure strikes out one of their machines and there is no working failover and their one admin is not reachable and when they finally reach him he has to be flown in, then they need to wait for replacement parts, which all in all delays recovery by several days.

Out of band management gets you remote console access and fallback to a local technician if there's a hardware problem.

Do you need to store data?

If yes, how is it replicated over multiple datacenters?

If yes, does failover work flawlessly?

> "If yes, how is it replicated over multiple datacenters?"

Depends on the database engine used. For example...


>"If yes, does failover work flawlessly?"

Again, depends on the database engine used. To use SQL Server as an example again...


Traditional clustering with SQL Server is a huge pain in the ass. When it works (most of the time) it works well but failures can get hairy quick. There's tradeoffs but if you can make it work AlwaysOn is so much easier to deal with from an ops perspective.

Hardware fails? It will start on other node automagically. May be we will need to switch psql instance. What kind of additional cost? Do you ever hit something really awful with k8s on self-hosted servers? What can go wrong except losing node?

So, you have redundancy and failover at the application level. You don't need reliable servers then.

Reliability is with respect to some SLA; it could very well be that one would _like_ to timeout more quickly than can be reasonably expected of cloud infrastructure, but well within the capabilities of hardware itself.

> You don't always need a fleet of tanks to deliver a box of milk bottles.

A bottle of milk filled with Gold is worth about $800k.

By extension the 6-pack is $4.8M

If tanks can fit seamlessly in the budget, we shall give tanks a serious thoughts! :D

If your setup can run on a $50 server, you shouldn't use AWS/GCE ^^

Out of curiosity. What were you running? What was your setup? How much disk IOPS? and disk bandwidth used?

Are you sure that (just example) - https://www.hetzner.de/us/hosting/produkte_rootserver/ex41ss... is a weak server? Running same VM on AWS/GCE will cost 10x more.

OMG A server WITHOUT ECC memory. #fear

Let's not call that a server but a desktop computer please ^^


The point of the cloud is to manage MANY servers. If all you have is a single box, you don't need help to manage it, you don't need AWS/GCE.

If i can buy two desktops instead of one server then i can made a nice failover configuration instead of using just one server. Also as i know google and fb think that ecc is useless, isn't it?

Well, just order 50 of them and install k8s.

Failover is useless when you have already increased your chances of corrupting data. Even with ECC there's some degree of corruption; doing without is suicidal.

Google figured quite early in its life that RAM was the one component not to skimp on: http://research.google.com/pubs/pub35162.html

If you have an application where the performance and growth patterns are well known, cloud isn't a choice made for lowest cost. It may be best value for different reasons.

I've done ROI studies for several applications like that, and usually cloud has a higher total cost unless there are specific availability requirements or you don't have a facility that can meet a 99.9 SLA.

The key assumption is that you know a lot about the app and it's in a operating mode. If you're in a hyper growth mode, have fluctuating or seasonal demand, or you have no capital funds, cloud is a no brainer.

For sure, I think you can often save a decent amount of $$ with dedicated hardware. It does have downsides of course...

If your demand is 50% stable 50% fluctuating (say some base load + a big spike at US prime time), I still think you can win with a hybrid cloud... i.e. serve base load from a COLO, and serve spike load from the cloud. That does mean you need to configure at least 2 networks, but not a terrible idea from a DR standpoint anyway (Main DC fail? Push a button and run off the cloud until it is fixed)

This thinking (hybrid cloud for dynamic scale out) is going into our decision at GitLab and makes a lot of sense for us.

hardware provides so much headroom that it's not even funny to compare :) a good PCIE board will do 850K IOPS for reads for 10K (4TB). You can have a sane understandable setup. In our experience our clients on AWS had more downtime than on dedicated clusters. All major outages of public clouds have being due to control plane issues because it's too complex. Unless you are a major customer have fun trying to get things resolved with the cloud provider. At gitlub scale they would need 4 ops guys and some remote hands for hardware installs considering cloud pricing on bandwidth and storage and the cost of the ops team (+/- 24K month in Ukraine tops) they will not only get much better performance but will actually save decent amount of money.

I sense a collaboration oppurtunity between gcloud and gitlab might be benefitical, while google helps gitlab getting git running against your object storage ;) would be a win-win for both I guess!

"""For what it's worth, hardware doesn't provide an IOPS/latency SLA either ;)."""

Actually it really does for all practical purposes.

When it fails its SLA you replace it.

On the storage front, even enterprise grade SSDs with hundreds of thousands of 4k read and write IOPs set up in over-provisioned arrays won't give you a great SLA unless you really lowball it and say the SLA is 99th percentile latency / IOPS is 1/2 benchmarked and the hardware has to sustain that latency/IOPS for 99.99% (after benchmarking out to figure out cache vs sustained performance)

That's just storage, now you need to add so many layers on top of it. Like @boulos said, it is then your fault, but your customers still see the issues.

ToR failure? Tried googling it but it wasn't much help.

Top of Rack = the network switch that ties the servers on a rack to each other and provides a gateway to the rest of the datacenter.

I always recommend these:


Surprisingly cost effective given the massive performance and support. Cloud is great for some things but not everything.

Thanks for the shout out! Without being a commercial, this high performance storage/analytics realm is what we focus on. We were demoing units like this: https://scalability.org/images/30GBps.png for years now. Insanely fast, tastes great, less filling. Our NVM versions are pretty awesome as well.


I should point out that we build systems that the OP wants ... they likely don't know about us as we are a small company ...


and we do use gitlab



Always happy to help ...

Thanks for using GitLab and posting in https://gitlab.com/gitlab-com/infrastructure/issues/727#note... This looks really interesting.

I did not realize they were on Azure until I read this comment. GitLab post did not mention it, why did you then?

I'm glad I wasn't the only one to notice! Literally the first thing the Google engineer does is mention Azure, while I had to go back to Gitlab's article all confused as I didn't notice at any point they were running on Azure.

I can't say it's a bad thing to do; I just can't help but notice how Google invests in social media participation. That's why they own the conversation in places like HN and can pull off this stuff.

What I fail to understand is, how does GCE or AWS tackle the issue described in the article? As far as I understand, their problem seems difficult to work around due to the nature of the Cloud (shared).

How would GCE be better than AWS or Azure at this? I would be really interested to know and I'm sure that'll be useful for other HNers with the same worries.

Disclaimer: I work for AWS.

To solve problems just like these, we offer EBS (elastic block storage) with provisioned IOPs guarantees. Essentially, you can get guaranteed IOPs if you need it for I/O intensive applications; up to 30,000 IOPs per EBS volume.

But, PIOPs EBS volumes wouldn't be my first recommendation. It sounds like what they really need is an elastic, scale-out filesystem with NFS semantics. We have Elastic File System, or EFS, which is exactly that. It's a petabyte scale filesystem that is highly available across multiple availability zones, and scales in IOPs and performance as it scales in size.

Their application should also look at leveraging S3 object storage, rather than NFS, because that is a highly distributed, highly available object storage system, that is likely to give better scalability, availability, and performance, than rolling your own Ceph infrastructure.

This is a great answer, thanks illumin8! (and cool nick too :D)

Technical details aside, the "buts", parentheses, and generally weird structure of this reply made it very confusing.

I personally believe that, when a prospect feels wronged or poorly served by a competitor, it's the company's responsibility to reach out to that prospect directly. Ideally 1-on-1, the company should aim to listen carefully to the prospect's concerns and not to immediately start spouting opinions or solutions. From there, an honest dialogue can take place which can form the basis of a sale. And, more importantly, a relationship.

Of course, if the goal is simply to castigate the competition or to ward off uncomfortable questions by implying the dissatisfied prospect is thinking poorly or emotionally, then a public forum post works better, I guess.

Sorry if you found my writing style confusing. I'm not looking to shame anyone. Instead, my hope was to counter the overly broad "It's clear we must not be in the Cloud" with my experience that this may be particular to their provider.

I'm not here trying to sell them on coming to Google. If they're interested, my contact info is in my profile. It's quite likely that the best option for them would be to return to AWS, as they have experience running there and EBS has improved a lot over the years.

So Gitlab is in a bit of a strange position here. Sticking to a traditional filesystem interface (distributed under the hood) seems stupid at first. Surely there are better technical solutions.

However, considering they make money out of private installs of gitlab, it makes sense to keep gitlab.com as an eat-your-own-dog-food-at-scale environment. Necessary for them to keep experience with large installs of gitlab. If one of their customers that run on-prem has performance issues they can't just say: gitlab.com uses a totally different architecture so you're on your own. They need gitlab.com to be as close as possible to the standard product.

Pivotal does the same thing with Pivotal Web Services, their public Cloud Foundry solution. All of their money is made in Pivotal Cloud Foundry (private installs).

From a business perspective, private installs are a way of distributed computing. Pretty clever, and good way of minimizing risk.

It's not just that. The cloud is far, far more expensive than bare metal, and that is under completely optimal financial conditions for the cloud providers (extremely low, even 0%, interest rates for the companies owning them, hence required return is only a few percent)

Dedicated servers, and colocation are going to be far cheaper than the cloud, and worse, the savings directly related to the size of the infrastructure you need.

That, combined with the fact that even the very best of virtualization on shared resources still kills 20-30% of performance.

So there's 3 things you can use the cloud for:

1) your company is totally fucked for IT management, and the cloud beats local because of incompetence (this is very common). And you're unwilling or unable to fix this. Or "your company doesn't focus on IT and never will" in MBA speak.

2) wildly swinging resource loads, where the peak capacity is only needed for 1-5% of time at most.

3) you expect to have vastly increasing resource requirements in a short time, and you're willing to pay a 5x-20x premium over dedi/colo to achieve that

The thing I don't understand is that cloud has both a lower limit (cloud is (far) more expensive than web hosting, and having a VPS) that is an extremely common case, and far more expensive once you go over a certain capacity (doesn't matter which one, CPU, Network, Disk, ... all are far more expensive in the cloud). Even if you have wildly varying loads there's an upper limit to the resource needs where cloud becomes more expensive.

The thing I don't understand is why so many people are doing this. I ran a mysql-in-the-cloud instance on Amazon for years, with a 300 Mb database, serving 10-50 qps to serve as a backend to a website, and a reporting server that ran on-premise. Cost ? 105-150 euros per month. We could have easily done that either locally or on a VPS or dedicated server for a tenth of that cost.

Cloud moves a capital cost into an operational cost. This can be a boon or a disaster depending on your situation. You want to run an experiment that may or may not pan out ? Off the cloud you'll have spare capacity that you can use but don't really have to pay for. On the cloud cost controls will mean you can't use extra resources. You can't loan money from the bank ? The cloud(but also dedi providers) can still get you capacity, essentially allowing you to use their bank credit for a huge premium.

I think you're missing some use cases here. Cloud can be really helpful for prototyping systems that you may not even want running in a few months.

Another use case would be where infrastructure costs are minor compared to dev and ops staff costs. If hosting on AWS makes your ops team 2x as productive at a 30% infrastructure markup that can be a steal.

I was thinking this too while reading the above comment. For starting up and prototypes, the aspect of a managed host far outweighs and initial performance losses. As companies scale, we see them move to their own hardware (Stackoverflow, Gitlab, etc.)

It's best to design your stuff so it can easily go from hosted solutions (this "cloud" bullshit term people keep using) to something you manage yourself. Docker containers are a great solution to this.

If you setup some ansible or puppet scripts to create a docker cluster (using mesos/marathon, kubernets, docker swarm, etc) and built it in a hosted data center; it's not going to take a whole lot of effort to provision real machines and run that same stack on dedicated hardware.

I don't think stack overflow ever used cloud providers for their main compute and storage needs, just periphery stuff like cloudflare.

Look at those cost factors. If you have an ops team (ie. 2+ people) it is a near certainty that cloud is more expensive. And if you absolutely do not need an ops team, VPS is going to beat the cloud by a huge margin.

2 people doing ops work cost 7-8k $ per month. Let's assume each of them is managing at least 5x their own cost in infrastructure spend, ie 35k+ $/month. That easily buys you 20-30 extremely high spec dedicated machines, if necessary all around the world, with unlimited bandwidth. On the cloud it wouldn't buy you 5 high spec machines with zero bandwidth, and zero disk.

Let's compare. Amazon EC2 m3.2xlarge (not a spectacularly high end config I might add, 8vCPU, 30Gig ram, 160G disk, ZERO network) costs $23k per month. So this budget would buy you 2 of those. Using reserved instances you can halve that cost, so up to about 4, maybe 5 machines.

Now compare softlayer dedicated (far from the cheapest provider), most expensive machine they got: $1439/month. Quad cpu (32 cores), 120G ram, 1Tb SSD, 500G network included (and more network is about 10% of the price amazon charges for the same). For that budget it gets you 25 of these beasts (in any of 20+ datacenters around the globe). On a low cost provider, like Hetzner, Leaseweb or OVH you can quadruple that. That's how big the difference is.

It used to be the case that Amazon would have more geographic reach than dedicated servers, but that has long since ceased to be true.

There is a spot in the middle where it makes sense, let's say from $100+ to maybe $10k where cloud does work. And you are right that it lets a smaller team do more. But there's 2 things to keep in mind : higher base cost that rises far faster when you expand compared to dedicated or colo. This is not a good trade to make.

Your math is off by 60x for AWS and 100x for GCE.

An m3.2xlarge is $.532/hr or $388/month, not $23k/month [1]. A similar instance on GCE (n1-standard-8) is $204/month with our sustained use discount, and then you need to add 160 GB of PD-SSD at another $27/month (so $231 total) [2].

Disclosure: I work on Google Cloud, but this is just a factual response.

[1] https://aws.amazon.com/ec2/pricing/on-demand/ [2] https://cloud.google.com/compute/pricing

EC2 m3.2xlarge can be had for $153/month as well when purchased as a reserved instance.

EC2 reserved instances offer a substantial discount over on-demand pricing. The all-up-front price for a 3-year reservation for m3.2xlarge would be an amortized monthly rate of $153/month, which is a 61% saving vs. the on-demand price of $388/month, according to the EC2 reserved instances pricing page.

Granted, using this capacity type requires some confidence in one's need for it over that period of time, since RIs are purchased up front for 1 or 3 year terms. But RIs can also be changed using RI modification, or by using Convertible RIs, and can be resold on a secondary marketplace. As a tradeoff in comparison to GCE's automatic sustained use discount, the EC2 RI discount requires deliberate and up-front action.

For the saas business I run, Cronitor, aws costs have consistently stayed around 10% total MRR. I think there are a lot of small and medium sized businesses who realize a similar level of economic utility.

Edit: I do see in another comment you concede the value of the cloud for ppl spending under $10k a month.

One use case where cloud shines is managed databases. Having a continuously backed up and regularly patched DB with the promise of flexible scalability is definitely worth it.


1) Your needs are very static, or

2) Your IT department can competently replicate the PaaS experience on its in-house metal (common big tech company strategy)

The cloud is likely to do wonders for velocity, as when you have a new use case, you can "just" spin up a new VM and run the company Puppet on it within an hour or so, vs. wait weeks to months for a purchase order, shipping, installation at the colo facility, etc.

If your IT department is doing Mesos or Kubernetes or something with a decent self-service control panel for developers, then you get the best of both worlds, but you also have to build and maintain that.

A risk is of course that your public environment is so much bigger than your biggest private install, that it doesn't make sense anymore to keep the same technology for both tiers. I think both PWS and gitlab.com suffer from this.

Does Gitlab support some kind of federation between instances?

If so, they could (in theory) split Gitlab.com into a bunch shards which as a whole match the properties of N% of enterprise users. That'd be a pretty cool way to avoid the different-in-scale problem (although you might still run into novel problems as you're now the Gitlab instance with the most shards..).

this is what Salesforce does - they have publicly visible shards in the url (na11, na14, eu23, etc) and login.salesforce.com redirects you to your shard when it figures out who you are.

For anyone considering this, reconsider. It bit me many times.

In general, over time, load on some shards will increase while others decrease. Migrating a customer from one shard to another will likley cause a short outage for them, and many bugs down the line when they've bookmarked all kinds of things.

You need to preserve the name<->customer association and maintain another key that you can use to split traffic at the LB in case a customer outgrows their shard. But personally I think it looks hokey and should not be something a customer sees anywhere but perhaps a developer tool or sniffer.

could that not be resolved with a reverse proxy or load balancer trickery? ie; hide the shard name externally.

> However, considering they make money out of private installs of gitlab, it makes sense to keep gitlab.com as an eat-your-own-dog-food-at-scale environment.

Depends on how they bill.

If they bill on a few private installs while giving unlimited storage & projects for free on the cloud, they're setting up themselves for bankruptcy.

Gotta take care of the accounting. Having an unsustainable pricing is a classic mistake of web companies.

well they can probably include a better storage layer.

at the moment you can just setup multiple mount points, but I guess it would be superior to actually setup a moint point or object storage. I'm pretty sure that you can put git into something like s3.

s3fs? ;)

A) I think it would be kind of difficult to store git on S3 (the magic would be in the caching / consistency layer to keep it performant)

B) it would make sense for their customers that run on AWS. However, many of them just run on a single VM / physical server.

well I just said s3 since it's the best known, but I guess it would be feasible to run on object storage. I heard that libgit2 can actually change his storage layer..

btw here is the alibaba approach: http://de.slideshare.net/MinqiPan/how-we-scaled-git-lab-for-...

Edit: and I heard github uses gitrpc.

Indeed this is a big factor. We want our customers and users to be able to use open source to host their repositories. That is one of the reasons to use Ceph instead of a proprietary storage solution such as Nutanix or NetApp.

Having never really gone to the cloud, thankfully, FastMail can strongly recommend New York Internet if you're looking for a datacentre provider. They've been amazing.


Actually, we don't appear to blog quite enough about how awesome NYI are. They're a major part of our good uptime.

And some stuff about our hardware. I'd strongly recommend hot stuff on RAID1 SSDs and colder stuff on hard disks. The performance difference between rust and SSD is just massive.

We're looking at PCI SSDs for our next hardware upgrades:


(there are two types of SSD on the market. Ones that lose your data, and SSDs from Intel) - we're currently running mostly DC3700 SATA/SAS SSDs.

We customise the layout of our machines very much to match the heat patterns of our data:

https://blog.fastmail.com/2015/12/06/getting-the-most-out-of... https://blog.fastmail.com/2014/12/15/dec-15-putting-the-fast... https://blog.fastmail.com/2014/12/04/standalone-mail-servers...

Coming from OpenStack land where Ceph is used heavily, it's well known that you shouldn't run production Ceph on VMs and that CephFS (which is like an NFS endpoint for Ceph itself) has never been as robust as the underlying Rados Block Device stuff.

They probably could have saved themselves a lot of pain by talking to some Ceph experts still working inside RedHat for architectural and other design decisions.

I agree with other poster who asked why do they even need a gigantic distributed fs and how that seems like a design miss.

Also -- if you look at the infra updated from the linked article, they mention something about 3M updates/hour to a pg table ([1], slide 9) triggering continuous vacuums. This feels like using a db table as a queue which is not going to be fun at moderate to high loads.

[1] https://about.gitlab.com/2016/09/26/infrastructure-update/

It's not updates, it's just querying, updates are not that bad, the main issue there was that the CI runner was keeping a lock in the database while going for the filesystem to get the commit, this generated a lot of contention.

Still this is something we need to fix in our CI implementation because, as you say, databases are not good queueing systems.

For sure there is much we can improve to reduce DB updates, but we do use Redis for queues.

> They probably could have saved themselves a lot of pain by talking to some Ceph experts still working inside RedHat for architectural and other design decisions.

We have been in contact with RedHat and various other Ceph experts ever since we started using it.

> I agree with other poster who asked why do they even need a gigantic distributed fs and how that seems like a design miss.

Users can self host GitLab. Using some complex custom block storage system would complicate this too much, especially since the vast majority of users won't need it.

You're right. We talked to experts and they warned us about running Ceph on VMs and we tried it anyway, shame on us.

You do need either a distributed FS (GitHub made their on with Dgit http://githubengineering.com/introducing-dgit/, we want to try to reuse an existing technology) or buy a big storage appliance.

Bingo! Seasoned developers and architects with 15-20+ years of experience would very likely question using software stacks like CephFS with warnings on it's website about production use! You really want no exotic 3rd party stuff in your design, and plain-Jane components like ext3 and Ethernet switches. Choosing a newer exotic distributed filesystem may really come back to bite you in the future.

It sounds like they didn't design for the cloud and are now experiencing the consequences. The cloud has different tradeoffs and performance characteristics from a datacenter. If you plan for that, it's great. Your software will be antifragile as a result. If you assume the characteristics of a datacenter, you're likely to run into problems.

This got me curious again about the pluggable storage backends in Git(I assume AWS code commit is using something like this). I've looked at Azures blob storage API in the past and found it incredibly flexible..

Here is an article from a few years ago: http://blog.deveo.com/your-git-repository-in-a-database-plug...

In any case, GitLab is amazing and I can see how it's tempting to believe that GitLab the omnibus package is the core product. However, HOSTED GitLab's core product is GitLab as a SERVICE. That might require designs tailored a bit more for the cloud than simply operating a yoooge fs and calling it a day.

Gitlab is limited by the fact that they need their hosted product to run the same code as their on-prem product in order to avoid forking the codebase.

If on-prem customers can't get AWS/GCE/Azure SuperFastCloudStorage™, then it can't be part of their codebase.

Exactly, we want our users to be able to scale with us using open source technologies. Some are at 20k+ users so they are close to needing something like Ceph themselves.

From looking at the comments here, thinking about what I would do/want, and what others have done to scale git GitLab is in the minority of wanting to solve this issue with a Ceph cluster.

> Some are at 20k+ users so they are close to needing something like Ceph

Or they will scale it themselves ala Alibaba: http://www.slideshare.net/MinqiPan/how-we-scaled-git-lab-for... . They appear to have written a libgit2 backend for their object store(among other things).

I don't see a good reason why solutions using different storage backends could not make it into the OSS project. Many companies run their own Swift cluster, which is OSS.

If you're using CephFS and everyone else wants to be using other Cloud storage solutions, that would actually put you at a disconnect with your users and leave room for a competitor with the tools and experience to scale out on Cloud storage to come in offering support. I would at least consider all the opinions in this thread and maybe reach out to that Minqi Pan fellow from Alibaba with questions..

I actually really like GitLab and wish we could be using it at my company; this is why I'm spending so much effort on this topic(and scaling git is interesting). Hopefully my opinions are not out of place.

Thanks, we're in touch with Minqi Pan I think we all agree the Ceph solution is great if we can make it work.

Can you go more into the difference in the tradeoffs and how one should design differently?

A blunt summary would be that everything is unreliable and disposable. You have to design for failure, because most things will almost certainly fail (or just slow down) at some point.

I learned a lot when I was first moving to the cloud from Adrian Cockroft. He has a ton of material out there from Netflix's move to the cloud. I recommend googling around for his conference talks. (I haven't watched these, but they're probably relevant: http://perfcap.blogspot.com/2012/03/cloud-architecture-tutor...)

I'm trying to understand 'antifragile'. Are you trying to say: 'robust'? If not, what is the difference?

There's a whole book[0] about it. But to summarise (poorly):

'robust' - resilient against shocks, volatility and stressors

'antifragile' - thrives on shocks, volatility and stressors (ie. gets better in response)

Antifragile is a step beyond robust. Examples of antifragility are evolutionary systems and true free market economies (as opposed to our too-big-to fail version of propped-up, overly interconnected capitalism).

0: https://en.wikipedia.org/wiki/Antifragile

Can you provide any examples of working, real-world antifragile systems which were designed and built by humans and accomplish their purposes? Preferably in terms of software and hardware?

So far you've named "evolutionary systems", which are quite fragile in the real world, and an imaginary thing called a "true free market economy".

Taleb's book often came back to the example of the relatively free market that is a city's restaurants. Any one restaurant is fragile, but the entirety of the restaurant business in a city is antifragile - failure of the worst makes the marketplace better.

Just trying to define it. I'm not advocating for anything.

That said, if you are calling the entire multibillion year phenomenon of life on this planet "fragile" then we are not going to get on well.

Netflix with their chaos monkeys is a great and relevant example.

But their system doesn't thrive on chaos monkey. It's just resilient to it.

In this case, the anti-fragile system is the entire system, including Netflix engineers and the cloud over time. The cloud is stressed, maybe even goes down, but in response, becomes stronger and more reliable because engineers make changes.

Point-in-time human-engineered systems still can't really be anti-fragile, except perhaps in some weird corner cases, but the system as a whole with the humans included, over time, can be.

It should also be pointed out that "anti-fragile" was always intended to be a name for things that already exist, and to provide a word that we can use as a cognitive handle for thinking about these matters, not a "discovery" of a "new system" or something. There are many anti-fragile systems; in fact it's pretty hard to have a long-term successful production project of any kind in software without some anti-fragility in the system. (But I've seen some fragile projects klunk along for quite a while before someone finally came in and did whatever was managerially necessary to get someone to address root causes.)

Ah, true when you include the engineers in the loop I suppose. But then that becomes a vague term for any system where engineers fix problems after some load/failure testing.

When I think of anti fragile systems, truly adaptive algorithms come to mind that learn from a failure. For example, an algorithm that changes the leader in a global leader election system based on the time of day because one geographic region of the network is always busier depending on time of day and latency to the leader impacts performance.

Yes. It is stronger because it is attacked by it.

Isn't that bad because now you're depending on having stressors and shocks to have the best performance?

If you can't control the amount of stressors and shocks, you want a system that is neither antifragile or fragile, but strictly indifferent to the level of shocks.

Good questions. It is a very interesting subject. If you haven't read the book, I recommend it. I think you will find it interesting if you read it with an open mind.

I don't think antifragile works in the parent comment, robust would be better.

As for a definition, antifragile things get stronger from abuse--like harming muslces during a workout so they grow back stronger. If they were just robust, it would be like machinery (no healing / strengthening).

I suppose I had Netflix in mind...as part of their move to the cloud, they developed chaos monkeys to actively attack their own software to make sure it is resilient to the failures that will inevitably, but less frequently, happen as a result of operating in the cloud.

Stop using the term, you don't understand what it means. Chaos monkey is a form of testing. Testing is not antifragility.

Chaos monkey attacks their production environment. They must make their software stronger/more resilient/less fragile in response. It would be more helpful if you clarified how you think that's different from antifragility.

Antifragility describes a system that becomes more resilient automatically, because it is built such that it must, in response to attack or damage. Such a system doesn't require management. It doesn't require people to actively test, to think about how to make the system better, to implement the fixes.

I think that's an overly narrow definition of the term antifragility and interpretation of the system in this case.

While I can imagine software that is in itself antifragile, I think it's entirely reasonable to include the totality of the people and process that make up an operational system, in which case even your narrow definition applies here.

If you are including the developers in the system you're calling antifragile, then Chaos Monkey is also part of the system. Antifragility refers to a system that benefits from attacks or disorder from outside itself.

You are watering down the word so much that it wouldn't have reason to exist. Is every chair-making process antifragile since chairs get stress tested before being sold?

Chaos Monkey is just a tool to accelerate/simulate the disorder inherent to the cloud. The original point was that the cloud is unstable and hostile, so software designed for it benefits from that disorder. Granted, it's not doing so all on its own, but exhibits the effect nonetheless, and I think Taleb's book is full of similarly impure examples.

There's a world of difference between stress testing before something is sold or released and welcoming ongoing hostility throughout its lifecycle, and this difference is absolutely in line with the concept of antifragility.

They designed for a small operations with limited resources. That might be fair given their budget/funding which we don't know.

At the moment, the discussion in the GitHub issues looks like people who are buying servers to put and run in their garage ^^

Planning for the cloud doesn't make your software antifragile. antifragile != robust

> when you get into the consistency, accessibility, and partition tolerance (CAP) of CephFS, it will just give away availability in exchange for consistency.

Not their fault, POSIX API cannot fit into eventual consistency model to guarantee availability (i.e. latency). Moving to your own hardware doesn't actually solve the problem, just gives some room to scale vertically for some time. After that the only way to keep global consistency but minimize impact of unavailability is to shard everything, at least this way unavailability will be contained in its own shard without any impact on every other shard.

It's better to avoid POSIX API in the first place.

This is the correct answer to run github, the SAAS product at scale. Probably not practical for github the on-prem installable software. They may have to bite the bullet and separate out the two systems at some point anyway though.

edit: it's gitlab not github. The point still stands though.

I priced out the hardware that they specced out to be around $1.4 million:

https://gitlab.com/gitlab-com/infrastructure/issues/727#note... The R730xd are probably around $50k after dell discounts - depends a bit on what they ended up configuring with regards to support, exact network configuration, etc

The R830s are about $50k as configured - 1.5gb ram is expensive as the R830 only has 48 DIMMs and they need the relatively expensive 32GB RDIMMs

The R630s should be about 15k each:

The switches say they are 48x 40G QSFP+ which are very expensive (I'd put them at 30k each from Dell)

50k * 20 + 50k * 4 + 15k * 10k + 2 * 30k

1m + 200k + 150k + 60k ~= $1.4m invested

updated with better R830s pricing

From my perspective consumer Samsung 850 EVO drives which can be under-provisioned to match the 1.8tb (and get better performance characteristics) would give Gitlab cheaper and more reliable storage in terms of IOPS/Latency when compared to 10k 1.8tb drives.

Unfortunately I don't believe we'll be able to comment on the accuracy of this since the quotes from companies are private information, but I do love math like this :D

(Community Advocate at GitLab)

I would take a look at Supermicro servers and compare pricing. Worked as cloud engineer at a hosting provider built entirely on Supermicro kit. Didn't see anything there I didn't see from the Dell's and HP's when I worked at Rackspace. And Rackspace sure didn't build their cloud on top of HP's or Dell's.

Fetch a quote from Supermicro and tell your Dell or HP folks that you'll buy that if they don't match the price.

If you're large enough, then you should be talking to other folks including Wiwynn, ZT, Quanta, Celestica, et al.

They won't always get all the way there, depends on volume, but often they'll get pretty close, and it'll probably be a 25%+ additional discount vs. what they'll offer without you twisting their arm using a cheaper manufacturer as leverage.

They have to believe you're serious, obviously, which can mean visibly throwing a couple small orders to another manufacturer to tell your sales team they're not the only game in town.

In quality terms, Supermicro is worse vs. Dell/HP, which can still be fine if you've got enough scale to work through all the little issues with firmware.

We never had issues and I never noticed a quality issue over Supermicro vs Dell's and HP's at Rackspace. There were firmware niggles and updates with both; we were always upgrading server, storage card, and storage controller firmware at the Rack. I'm not privy to studies at scales large enough to tell which is higher quality.

I'm not beholden to either though haha. If you can get Dell to come close to Supermicro prices then that might make sense. When I was last in the infrastructure game a few years ago we also ran into configuration limitations with Dell. At the time we could get more storage and RAM in the form factors we needed from Supermicro.

This is fair and buying at end of quarter or end of year is often the little extra kick needed to get Dell/HP to compete.

One of my main beefs with Supermicro is their tooling around remote patching and configuration of BIOS and other firmware, they charge for it, on top of warranty, and the UX is awful.

With both Dell and HP if you're under a support contract all the updates and tools to apply them are included, and whilst both manufacturers have some "sharp edges" to their tools for managing large fleets, neither is "stab yourself in the eyeballs with a blunt fork" bad like Supermicro.

Thanks for looking into this. I do believe that the linked comment configuration was over the top, we'll have less memory and 7.2k rpm drives. We're working on the configuration in https://docs.google.com/spreadsheets/d/1XG9VXdDxNd8ipgPlEr7N... feel free to leave comments there.

Without knowing more on the specific needs (I followed some of the threads to try to grok it), it would be hard to guess what they really need.

[commercial alert]

My company (Scalable Informatics) literally builds very high performance Ceph (and other) appliances, specifically for people with huge data flow/load/performance needs.

Relevant links via shortener:

Main site: http://scalableinformatics.com (everything below is at that site under the FastPath->Unison tab) http://bit.ly/1vp3hGd

Ceph appliance: http://bit.ly/1qiOYpy

Especially relevant given the numbers I saw on the benchmarking ...

Ceph appliance benchmark whitepaper: http://bit.ly/2fMahfJ

Our EC test was about 2x better than the Dell unit (and the Supermicro unit), and our Librados tests were even more significantly ahead.

Petabyte scale appliances: http://bit.ly/2fuTTAH

We've even got some very nice SSD and NVM units, the latter starting around $1USD/GB.

[end commercial alert]

I noticed the 10k RPM drives ... really, drop them and go with SSDs if possible. You won't regret it.

Someone suggested underprovisioned 850 EVO. Our strong recommendation is against this, based upon our experience with Ceph, distributed storage, and consumer SSDs. You will be sorry if you go that route, as you will lose journals/MDS or whatever you put on there.

Additionally, I saw a thread about using RAIDs underneath. Just ... don't. Ceph doesn't like this ... or better, won't be able to make as effective use of it. Use the raw devices.

Depending upon the IOP needs (multi-tenant/massive client systems usually devolve into a DDoS against your storage platforms anyway), we'd probably recommend a number of specific SSD variations at various levels.

The systems we build are generally for people doing large scale genomics and financial processing (think thousands of cores hitting storage over 40-100Gb networks, where latency matters, and sustained performance needs to always be high). We do this with disk, flash, and NVMe.

I am at landman _at_ the company name above with no spaces, and a dot com at that end .

What kind of dell discounts does dell typically give larger customers? We get 0 (from HP) at my workplace and I've always wondered.

I'm sure it depends on how much you buy.

Does it vary by components? They seem to charge a lot for drives so I'm guessing those can be heavily discounted.

I helped design a DC for a big data mining / science company and we got just under 20% off the higher margin packages. I felt very good about the number and Dell was great to work with on the purchase side.

Please stop and read this and show it to your boss.

If you are really getting 0% discount you need to private message me and send me $10k so I can tell you this one weird trick.

You simply ask for them.

Just do that. Everyone gets discounts.

Well, it depends on how much stuff you're buying and from whom. If you're buying $800 dollar workstations you won't get as many discounts as when you're buying 400+GB quad CPU 2U servers

Agreed. But even a single purchase is likely to be below list price.

the last time i was in a position that i oversaw hardware purchasing... you get better deals based on your overall spend. after we had spend 250k we got a new sales rep that had access to offer us lower pricing, same thing happened once we breached the $1m mark. new sales rep and pricing went lower. In fact, our first sales rep couldn't even make up a reasonable deal at $80k because he didn't have access to that level of discounting.

usually i always send everything back and tell them to do better.

my biggest gripe about dell pricing was no line items, just a bottomline price. so i couldnt tell which things i could save money on by sourcing them elsewhere(ie: larger drives, large amounts of memory)

Subtract about 30% , maybe more. Dell pricing can be aggressive, if you are buying enough.

Pro-tip: always buy around the end of their sales quarter. Have internal approval ready. Just promise you'll sign immediately if they lower the price to XXX. Around the close of the quarter their sales guys are pushed to get deals done, and they are sometimes willing to give even bigger discounts.

I think at Dell the quarters are one month off (end of januari, april etc)

Indeed, there would be discounts from Dell, but I also expect companies to buy some extra add-ons and subscription/support services that cancel them out that I didn't see in the high level specs.

32GB DDR4 RDIMMs are not that expensive anymore. 8Gbit DDR4 is close to crossover now.

Yeah, you're actually right. I relied on my memory of pricing them out a bit back, and it looks like the prices are much closer to other DIMMs now.

As always, the neat thing about GitLab is how open they are with their process. I enjoyed this read, and followed the trail down to a corresponding ticket, where the staff is discussing their actual plans for moving to bare metal. Very cool.


If you like open processes like that then you might like following the work of Wikimedia's Technical Operations team. [1]

You won't find any organization that's much more open than Wikimedia.

Disclosure: I work for Wikimedia (on the release engineering team)

[1] https://phabricator.wikimedia.org/tag/operations/

They even list the hardware they've spec'ed out:


I'd recommend reading some of the closed issues in that project as well, lots of interesting postmortems on downtime, increased load, etc. :)


Why do they need one gigantic distributed fs? Seems like a design miss to me.

Indeed. If there's one thing I've learned in >10 years of building large, multi-tenant systems, it's that you need the ability to partition as you grow. Partitioning eases growth, reduces blast radius, and limits complexity.

I believe GitHub is or was distributing repos instead of files. AWS code commit likely stores pack files in S3 or otherwise takes advantage of existing AWS systems..

Using a single Ceph FS is an odd choice IMO. You need to have or acquire a ton of expertise to run stuff like that, and it can be finicky. I'm not convinced you can't get this running well enough on AWS though.. So I'd be worried the move to bare metal would just cause the team more problems.

Agreed, I think the wrong conclusion was drawn here.

But I get where they're coming from; container orchestrators like Kubernetes are heavily promoting distributed file systems as being the 'cloud-native approach'. But maybe this issue is more relevant to 'CephFS' specifically than to all distributed file systems in general.

It might be, but you have to look at what's important for a product like GitLab. It's in a market where those who'll pay want to run their own special version of the system. So it's naturally partitioned.

Architecting, or even spending mental cycles on day 1 on distribution isn't going to win you as much as focusing on making an awesome product.

This move will probably buy them another year or two, which will give them enough time hopefully implement some form of partitioning.

GitLab supports using multiple filesystems, and while waiting for their bare-metal solution, they switch to running 8 file systems: https://gitlab.com/gitlab-com/infrastructure/issues/711

multiple filesystem != multiple storage backends unfortunatly.

Yes, multiple storage backends: 8 file systems, on 8 independent servers, each with 16tb disk. See https://gitlab.com/gitlab-com/infrastructure/issues/727#note....

No information in this article.

How much data? We don't even know if it's GB/TB/PB/EB? How many files/objects? How many read IOPS are needed? How many write IOPS? What's the current setup on AWS? What's the current cost? What are they hosting? Can it scale horizontally? How do they shard jobs/users? What's running on Postgre? What's running on Ceph? What's running on NFS? How much disk bandwidth is used? How much network bandwidth is used?

How are we supposed to review their architecture if they don't explain anything...

I bet that there is a valid narrative where PostgreSQL and NFS was their doom, but I'd need data to explain that ^^

Unfortunately I don't have deep knowledge of our infrastructure, so I can't answer all of these questions myself.

That said, a decent chunk of this info can be found in the discussions on our Infrastructure issue tracker[1].

The last infrastructure update[2] includes some slide decks that contain more data (albeit it's now a little under 2 months old).

Looking at our internal Grafana instance, it looks like we're using about 1.25 TiB combined on NFS and just under 16 TiB on Ceph. We're working on migrating the data currently hosted on Ceph back to NFS soon[3].

I'll get someone from the infrastructure team to respond with more info.

[1]: https://gitlab.com/gitlab-com/infrastructure/issues [2]: https://about.gitlab.com/2016/09/26/infrastructure-update/ [3]: https://gitlab.com/gitlab-com/infrastructure/issues/711

>>If one of the hosts delays writing to the journal, then the rest of the fleet is waiting for that operation alone, and the whole file system is blocked. When this happens, all of the hosts halt, and you have a locked file system; no one can read or write anything and that basically takes everything down.

I have seen similar issues where a GC pause on one server, freeze the entire cluster.

Is this one single monolithic file system? On the service side, can the code be asynchronous with request queues for each shard? This can help free up threads from getting blocked and serve requests for other shards.

I don't know about this. We had disastrous experiences with Ceph and Gluster on bare metal. I think this says more about the immaturity (and difficulty) of distributed file systems than the cloud per se.

I think it says more about the state of open source distributed file systems than the cloud per se. Ceph and Gluster are not the best examples of these, though Lustre is awful too. I'm paid to dig into these at depth and each is like some combination of religion and dumpster fire. Understand that Red Hat (and Intel in the Lustre case) wants support, training and professional services revenue. Outside of their paid domains, it's truly commerce with demons but inside it's not much better.

BeeGFS is the only nominally open source one that I'd think about trusting my data too. And no, I don't work for them or am compensated in any way for recommendations.

Distributed file systems are tough especially if you're putting it together yourself. I'd go for an already built solution every time unless I absolutely could not for whatever the reason.

Thanks for posting. What went wrong with Ceph and when was this? We have the idea it improved a lot in the last year or so. But we'd love to learn from your experience.

Was your experience RE: Ceph/Gluster from Stack Overflow? I'd definitely be interested in hearing more about the specifics of that.

I think he's talking about Discourse.

The advice in this post is, Imo, misguided.

In large systems design, you should always design for a large variation in individual systems performance. You should be able to meet customers expectations if any machine drops to 1% performance at anytime. Here they are blaming the cloud for the variation, but at big enough sizes they'll see the same on real hardware.

Real hardware gets thermally throttled when heatsinks go bad, has io failures which cause dramatic performance drops, CPU's failing leaving only one core out of 32 operational, or ECC memory controllers that have to correct and reread every byte of memory.

In a large enough system, at any time there will always be a system with a fault like this. Sure you only see it occasionally in your 200 node cluster, but in a 20k machine cluster it'll happen every day.

You'll write code to detect the common cases and exclude the machines, but you'll never find all the cases.

The conclusion is that instead you shouldn't try. Your application should handle performance variation, and to make sure it does, you would be advised to deliberately give it variable performance on all the nodes. Run at low CPU or io priority on a shared machine for example.

In the example of a distributed filesystem, all data is stored in many places for redundancy. Overcome the variable performance by selecting the node to read from based on reported load. In a system with a "master" (which is a bad design pattern anyway IMO), instead have 5 masters and use a 3/5 vote. Now your system performance depends on the median performance of those 5.

Not to second guess the infrastructure architects at GitLab, but just to bring up a data point they might possibly not be aware of : Joyent virtual hosts have significatly better I/O performance profiles -- see https://www.joyent.com/blog/docker-bake-off-aws-vs-joyent for detail. Don't necessarily write off the cloud -- there are different clouds out there and if I was running something I/O intensive, I'd want to try it on Joyent. That said, nothing beats complete control, if you have the resources to handle that level of responsibility. =)

There is a threshold of performance on the cloud and if you need more, you will have to pay a lot more, be punished with latencies, or leave the cloud.

I've seen this a lot, and for a given workload I can tell when leaving the cloud will be the right choice. But the unspoken part is "can we change our application given the limitations we see in the cloud?" Probably pretty difficult in a DVCS but not impossible.

Sadly, storage isn't a first class service in most clouds (it should be) and so you end up with machines doing storage inefficiently and that costs time, power, and complexity.

> Sadly, storage isn't a first class service in most clouds

What would constitute a "first-class service" for you? (Or what does "storage" mean to you)

I think both of Persistent Disk for block and GCS for Object storage as fairly reasonable. I agree that the world of "Give me multiple writer networked block devices, preferably speaking NFS" is still pretty bad.

Usual Disclosure: I work on Google Cloud (and for a long time, specifically Compute Engine).

Storage as a first class service means that storage is provided to the fabric directly. Just as routing is provided to the fabric directly by switches and routers so 'networking' is a first class service. In some networks time is provided by hardware dedicated to that so time is a "first class" service.

"Storage" as a definition is an addressable persistent store of data. But that isn't as useful as one might hope for discussions, so I tend to think of it in terms of the collection of storage components and compute resources that enable those components to be accessed across the "network."

So at Google a GFS cluster provides "storage" but if the compute is also running compute jobs, web server back ends, etc. It isn't the "only" task of the infrastructure and that is the definition of "second" or "not first class". Back in the day Urs would argue that storage takes so few compute cycles that it made no sense to dedicate an index to the serving up of blocks of disk. But that also constrained how much storage per index you could service. And that is why from a TCO perspective "storage as a service" is cheap when you need the CPUs for other things anyway, but it's very expensive when you just want storage. I wrote a white paper for Bart comparing GFS cost per gigabyte to the NetApp cost per gigabyte, and NetApp was way cheaper because it wasn't replicated 9 times on mission critical data, and one index (aka one filer head) could talk to 1,000 drives.

That same sort of effect happens in cloud services where if you want 100TB of storage you end up having to pay for 10 high storage instances, even if your data movement needs could be addressed by a single server instance with say 32GB of memory. The startup DriveScale is targeting this imbalance for things like Hadoop clusters.

Plenty of Xooglers want "just give me direct access to Colossus". But priority-wise, the market seems to want:

1. "Give me a block device I can boot off of, mount, and run anything I want to from a single VM" (PD, which is just built on Colossus)

2. "Give me NFS".

3. "Give me massive I/O" (GCS)

I think we're doing fine-ish in 1 and 3. The main competition is a dense pile of drives in a box for Hadoop, but we lean on GCS for that via our HDFS connector (https://cloud.google.com/hadoop/google-cloud-storage-connect...). It's our recommended setup, the default for Dataproc, and honestly better in many ways than running in a single Colossus cell (you get failover in case of a zonal outage, and by the same token you can have lots of users simultaneously running Hadoop or other processing jobs in different zones).

PS - I'm going to go searching for your whitepaper (I find in arguing with folks that network is the bottleneck for something like a NetApp box, not CPUs).

[Edit: Newlines between lists... always forgetting the newlines]

Of course you the cloud vendor is doing fine, the question was when is it too expensive for your customer. And my thesis is that because storage is not a first class service you can't precisely optimize storage spend (or your storage offering) for you customer and that forces them out of cloud situation into their own managed infrastructure.

You could also improve your operational efficiency but that isn't a priority yet at the big G. I expect over time it will become one and you'll figure it out but in the meantime your customer has to over provision the crap out of their resources to meet their performance needs.

If Bart is still around it was shared with him and the rest of the 'cost of storage' team back in 2009.

To me it makes total sense for something like Gitlab to do their own HW, since it's really their core business. Sure, there's no point for say, target.com to use their own servers - computers is not really what they do, and cloud helps them keep expensive programmers and sysadmins to a minimum. But Gitlab is a whole different story.

if latency spikes affect the overall performance, it seems more that CephFS may have a design problem (global FS journal) rather than this being a cloud problem.

However perhaps they shouldn't try to run Ceph in the first place. Azure has a rather powerful blob storage (e.g. block, pages and append-only blobs) that allows high performance applications. You could use that directly and it will likely be cheaper and work better than Ceph on bare metal.

Like other commenters suggest, in order to take advantage of cloud infrastructure you need to design with those constraints in mind, rather than trying to shoehorn the familiar technologies.

Bare metal can be better and cheaper, etc. but it requires even more skills and experience and a relatively large scale.

Wouldn't that make Gitlab dependent on Azure-specific services? That's quite a risk to take when you're already not sure if you want to stay with MS hosting services.

How much total storage do you need?

Something to consider: A few years ago I used Rackspace's OnMetal servers https://www.rackspace.com/en-us/cloud/servers/onmetal for a dedicated MySQL 128GB RAM server that would handle 100s of thousands of very active hardcore mobile game users. We were doing thousands of HTTP requests per seconds and 10s of thousands of queries (and a lot of those were writes) per second, all on one server. The DB server would not skip a beat, CPU/IO was always <20% and all of our queries would run in 1-5ms range.

I'm not affiliated with Rackspace in any capacity, but my experience with them in the past has been top-notch, esp. when it comes to "dedicated-like" cloud hardware, which is what OnMetal is - your are 100% on one machine, no neighbors. Their prices can be high but the reliability is top-notch, and the description of the hardware is very accurate, much more detailed than AWS for example, and without "fluffy" cloud terms :).

For example: Boot device: 2x 240 GB hot-swappable SSDs configured in a RAID 1 mirror

Storage: 2x 1.6 TB PCIe storage devices (Seagate Nytro XP6302)

Thanks :). Glad to hear you liked the product.

We need about 70TB now and are planning for 256TB.

BTW HPE provides some storage solutions where you can scale it up to 8 petabyte (in a single rack afaik).

I have a really small setup but i would personally look into a dl580 as VM host and having two for redundancy. And a dual path storage system in my case I used a 2u MSA2400 (not sure if that is the latest name)

Since it could continue to scale up and provided dual path too.

I don't have experience running CEPH so I am not sure what are the hardware requirements for CEPH.

(Disclaimer I work at Hewlett Packard Enterprises with servers)

If only there was a way to run git in a decentralized manner. :-)

Then users could host their own repositories themselves and manage their storage.

This kind of setup would scale a lot better.

> The problem with CephFS is that in order to work, it needs to have a really performant underlaying infrastructure because it needs to read and write a lot of things really fast. If one of the hosts delays writing to the journal, then the rest of the fleet is waiting for that operation alone, and the whole file system is blocked.

This could easily have been titled "Why we couldn't use SephFS"

I may be wrong but it seems to me that immutable blob (git) are a perfect match for any blob storage such as s3, gcs or azure blob storage. I even think there are some libgit2 backends which do exactly this.

It may then become a problem of latency per blob, in which case you could coalesce blobs into bigger ones as needed (histories don't change much) and optimize for that.

Or move from a pure public cloud provider. I don't know what their spend is, but there are hybrid offers that provide managed premise and public cloud delivered on a consumption model that allow an organization to run a single administration and management interface, and can control what runs together on the premise gear.

The hardware part seems to be completely solved with blackblaze designs:


Therefore the only question left is what to use as the software for this.

I feel like I have to do it and ask the stupidest yet most important question in this thread...

Why didn't you use S3?

I assume for three reasons:

  - small file performance (if they're not storing the git objects as packed)

  - POSIX compliance

  - Their on-prem installations for customers
For posix compliance, you can use s3fuse/gcsfuse or put Avere in front, but you're either paying more for it or you're not getting the best performance.

For small file performance though, no object storage system is particularly tuned for that (often the granularity is more like MBs, and at least with GCS you're talking ~100ms per write).

Finally, they need to be testing and supporting customers who want to run the whole thing "at home" wherever that is. So they would still need to run Ceph!

Disclosure: I work on Google Cloud.

I wouldn't say no distributed object storage system is particularly tuned for that. Facebook's haystack and systems patterned after it can work very well for tiny files.

Sorry, I meant for public cloud providers. You're right that Haystack would be fine for small files, but IIRC (and skimming the paper again [1]) it stores the entire photo inline in a single Needle. That's fine for photos, but not for people that want to say store a 5 TB single object (and for Facebook, they could just reject photos larger than say 100 MB). Still, good idea!

[1] https://www.usenix.org/legacy/event/osdi10/tech/full_papers/...

S3 does not allow random access reads/writes to the files. "Editing" a file in S3 just means deleting the original blob and creating a new one.

Meaning, you can't run git on it, you'd need to make a full copy of the repo each time someone makes a commit.

That isn't true; a git repository isn't just a single file. It is one (or many) pack and object files, and it's possible to have a repository by using each push (which may be multiple commits) to store a new pack file rather than deleting the old one.

The only time you would need to delete/replace is during a GC operation which can be scheduled periodically to compact the representation.

The JGit driver supports pushing to S3 in this way, and I have implemented a custom backend to hosting git repositories in an S3 like service (which in turn is based on how Google hosts git repositories at scale)

Not all users may use Amazon or S3. This means that if we were to build something requiring S3 we'd cut off those users.

Building some kind of generic storage layer with multiple backends (S3/S4/etc) would result in us only maintaining whatever we happen to use for GitLab.com. It could also complicate maintenance too much.

That's a huge red flag to me.

You got an application designed to host a few repos on a single disk for a single company. And you wanna run it to store millions of repos for the entire internet... on a single volume.

I understand that you want to run the same thing as your customers. But you have different needs, you gonna run in permanent conflicts.

How much storage do you have now per service? How much growth do they experience? Do you even know if CephFS can scale linearly with servers/disks?

You don't solve storage performance problems by moving to S3. They're looking at response times in the hundreds of milliseconds - it's common to see multiple second response times from S3 (at least on AWS), sometimes they can take longer or just return an error.

> They're looking at response times in the hundreds of milliseconds

No they don't. Read their slides, they are running on 52 seconds latency at times ;)

I was referring to where the article compares 100ms latency to 8 years. Typical "enterprise" storage solutions have sub-millisecond response times. I'm assuming the 52 seconds is how long it's taking to flush the entire journal which wouldn't be a single write. In any case, when someone is having a problem with their storage performance I just find it funny that everyone suggests S3 which has terrible performance.

Their problem is not performance. Their problem is that they want high IOPS AND bandwidth AND storage size for a limited budget. There is no free lunch.

S3 is simple and it can ingest the volume. Of course, the performances are not comparable to a local disk.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact