For what it's worth, hardware doesn't provide an IOPS/latency SLA either ;). In all seriousness, we (Google, all providers) struggle with deciding what we can strictly promise. Offering you a "guaranteed you can hit 50k IOPS" SLA isn't much comfort if we know that all it takes is a single ToR failure for that to be not true (providers could still offer it, have you ask for a refund if affected, etc. but your experience isn't changed).
All that said, I would encourage you to reconsider. I know you're frustrated, but rolling your own infrastructure just means you have to build systems even better than the providers. On the plus side, when it's your fault, it's your fault (or the hardware vendor, or the colo facility). You've been through a lot already, but I'd suggest you'd be better off returning to AWS or coming to us (Google) [Note: Our PD offering historically allowed up to 10 TiB per disk and is now a full 64 TiB, I'm sorry if the docs were confusing].
Again, I'm not saying this to have you come to us or another cloud provider, but because I honestly believe this would be a huge time sink for GitLab. Instead of focusing on your great product, you'd have to play "Let's order more storage" (honestly managing Ceph has a similar annoyance). I'm sorry you had a bad experience with your provider, but it's not all the same. Feel free to reach out to me or others, if you want to chat further.
Disclosure: I work on Google Cloud.
As a cloud provider, though, you're trying to provide shared resources to a group of clients. A company rolling their own system doesn't have to share, and they can optimise specifically for their own requirements. You don't get these luxuries, and it's reasonable to expect a customised system to perform better than a general one.
If I may ask a few questions:
- Are they charged the same as external customers or do they get a 'wholesale' rate?
- As internal clients, do they run under the same conditions as external clients? Or is there a shared internal server pool that they use?
- Do they get any say in the hardware or low-level configuration of the systems they use? (ie. if someone needs ultra low latency or more storage, can they just ask Joe down the hall for a machine on a more lightly loaded network, or with bunch more RAM, for the week?)
- Do they have the same type of performance constraints as the ones encountered by gitlab?
I feel like most of the reason to use cloud services is when you have little idea what your actual requirements are, and need the ability to scale fast in both directions. Once you're established and have a fairly predictable workload, it makes more sense to move your hosting in-house.
Hopefully someone who actually knows what theyre talking about will be along shortly!
> Are they charged the same as external customers or do they get a 'wholesale' rate?
Id be quite surprised if internal customers are charged a markup. Presumably the whole point in operating an internal service is that you lower the cost as much as possible for your internal customers.
> As internal clients, do they run under the same conditions as external clients? Or is there a shared internal server pool that they use?
From the above book, it seems that the hardware is largely abstracted away so most services aren't really aware of servers. I assume there's some separation between internal and external customers, but at a guess that'd largely be because of the external facing services being forks of existing internal tools that have been untangled from other internal services.
> Do they get any say in the hardware or low-level configuration of the systems they use? (ie. if someone needs ultra low latency or more storage, can they just ask Joe down the hall for a machine on a more lightly loaded network, or with bunch more RAM, for the week?)
As above, the hardware is largely abstracted away. From memory, teams usually say "we think we need ~x hrs of cpu/day, y Gbps of network,..." then there's some very clever scheduling that goes on to fit all the services on to the available hardware. There's a really good chapter on this in the above book.
> Do they have the same type of performance constraints as the ones encountered by gitlab?
Presumably it depends entirely on the software being written.
Disk systems do get expensive at scale but the scale that they're usually sold at these days is pretty huge. You talk about going up to a petabyte but that's a fraction of a single rack's worth of disk these days. Not everyone wants to be a filesystem expert and distributed filesystems are jumping in at the deep end.
Clearly CephFS as weak spots, but for what I've seen those are sports that we can work out, rough edges here and there. The good thing is that we are much more aware of these edges.
We are already working on what the next step will be to soften out these weaknesses so we are not impacted again. And of course to ship this to all our customers, either they run on CephFS, NFS appliances, local disks or whatever makes sense for them.
We started using Ceph because we wanted to be able to grow our storage and compute independently. While it worked well for us we ended up having much larger latencies is as a result of this. So we developed FSCache support.
Even better yet, if you data is inherently shareable (or has some kind of locational affinity) you can end up always serving data with a Ceph backend from the local cache with the exception of a server going down or occasional request. I'm guessing it is (repo / account)
On your API machines serving out the git content of out the DFS you can setup a local SSD drive to read only caching. Depending on you workload you can end up significantly reducing the IOPs on the OSDs and also lowering network bandwidth.
With the network / IOPs savings we've decided to run our CephFS backed by Erasure Coded pool. Now we have lower cost of storage (1.7x vs 3x replication) and better reliability because now with our EC profile we can lose 5 chunks before data loss instead of 2 like before. That's because we more the 90% of requests are handled with local data and there's a long tail of old data that rarely accessed.
If you're going to give it a try, make sure you're using a recentish kernel such as a late 3.x series (or 4+). That has all the Cephfs FSCache / and upstream FSCache kinks work out.
If you're using relatively recent kernel such late 3.x series or 4+ (as in ubuntu 16.04).
We are running a recent kernel as in ubuntu 16.04.
The reason I'm framing the caching not so much at the CephFS level is because we are shipping a product, and I don't think that all our customers will be running CephFS on their infra. Therefore we will need to optimize for that use case also, and not only focus on what we do at GitLab.com.
Thanks for sharing! Will surely take a look at this.
We made mistakes ourselves that were the root of the problems at the time, and based on the update a few months later things definitely improved significantly.
Random thought: Go talk to google and get $100k of free cloud credits.
You run 100 instances of n1-highcpu-2 (2 CPU, 2GB RAM), with a local 375 SSD. That gives you 37TB of storage and you're fine for the year.
This is an absolutely ridiculous setup but it could work.
1) Servers don't need to have more memory for caching. Local SSD access is fast enough.
2) Servers have 1Gb uplink. That can be saturated easily by a single local SSD.
3) I'm a firm believer that if you put 8-16 disks in a single box, you're gonna hit a hard limit with the network bandwidth and infrastructure (among other things).
4) Do you need IOPS or do you need storage???
5) Can CephFS shard fairly across a hundred systems?
6) Should CephFS run on raw disks, manage replica itself? and can it handle failure?
Bonus question) Why are we thinking about all of this. You should have just used s3?
NFS and Ceph both support the standard POSIX file access model, notably including random-access reads and writes on a file without replacing the entire file and consistency models that let you reason about directories and their contents usefully. If your goal is to run the reference implementation of git, or something that is compatible with it, on the filesystem, you absolutely need something that supports this.
You could write something new that speaks the git protocol but is designed to be backed by an eventually-consistent object store instead of a POSIX-compliant file store ... but that seems like an equally big challenge, honestly.
It's not a bad decision; getting POSIX semantics to play well with modern expectations of a highly-consistent and highly-available distributed system is very difficult. But it's a work-intensive decision, and figuring out how to deploy NFS or Ceph at scale might be less work. Especially if Google's SVN/hg developers can work directly with the Bigtable developers on features, and GitLab's developers can't work directly with the S3 team (but can own their own Ceph or NFS stack).
You'll have a really cheap and very scalable solution that uses low level storage mechanisms yet it will also be really performant because when loaded everything will be served from memory.
Notably this doesn't require a distributed filesystem.
For example, in this case, since they're IOPs-bound first, and probably network-bound second, you can cram 16 SSDs in a box, and then put a dual-port 25Gig Ethernet NIC up to 2 leaf switches at the top of rack, and then build a CLOS out of 100G spine switches, so that there are essentially never any network bottlenecks for Ceph.
This is very similar to DreamHost's DreamCompute cluster design, though they're using 2x 10G to the servers, and 40G up to the spines, since 25G/100G wasn't available when they built the cluster.
Unless something changed dramatically at GitHub they're a 100% solid-state fleet since 2012 when they started down the "building their datacenter cages" journey.
Whilst you could say as the 500lb gorilla in the market they can afford that luxury, there are other examples, i.e. DigitalOcean, where SSD is a core part of their product and it is offered inexpensively ($5/mo).
Even if you're looking at 3-way replication of your data, meaning you're buying 768TB of storage, the SSD should only tot up to $200-300k for Intel S3510 (or similar).
There's a non-trivial amount of additional effort involved in any kind of "multi-layered" storage (c.f. data on 7200rpm, NVMe cache) over and above just making all of your storage moderately performant in IOPS terms. That has cost too.
TL;DR - HDFS is rarely the answer, mainly due to pains usually involving crashed nameservers.
The OP is asking for guaranteed IOPS/latency SLA for which they're willing to pay HUGE money for by ROLLING THEIR OWN.
Possibly think about high service pricing tiers for systems that require it. This is a standard operations problem that can co-exist within pooled service models.
As I said, rolling your own will not give you a guarantee, it will just give you the responsibility for failure. We don't offer the guarantee, because we don't want you to believe it can't fail.
Last time I had a failed storage controller, HP delivered a replacement in four hours. Last time a service provider went down, I had no visibility into the repair process, and had to explain there was nothing I could do... For five days.
It seems like you've outright admitted you can't guarantee what they need, yet you still urge them not to leave your business model. Maybe come back when your business model can meet their needs.
That may sound like cold comfort on the face of it, but it's key to getting the most out of cloud and exceed the possibilities of an on-prem architecture. Rule #1 is, everything fails. The key advantage to a good cloud provider (and there are many) is not that they can deliver a guarantee against failure (as boulos stated correctly) but that they'll allow you to design for failure. The issue becomes when the architecture in the cloud resembles that which was on-premise. While there's still some advantages, they're markedly fewer, and as you said, there's nothing you can do to prioritize your fix.
They key to having a good cloud deployment is effectively utilizing the features that eliminate single points of failure so that the same storage controller failure that might knock you out in your on-prem can't knock you out in the cloud, even though the repair time for the latter might be longer. That brings its own challenges, but brings huge advantages when it comes together.
Disclosure: I work for AWS.
As someone else alluded to downthread: anyone claiming they can provide guaranteed throughput and latency on networked block devices at arbitrary percentiles in the face of hardware failure is misleading you. I don't disagree that you might feel better and have more visibility into what's going on when it's your own hardware, but it's an apples and oranges comparison.
Back when I worked in enterprise services, it was a requirement that we had good support contracts — 1-, 4- or 8-hour turnarounds were standard.
If you're running a serious business, support contracts are a must.
Manufacturer specifically asks customer to keep it informed where the equipment is physically located, and then prepositions spares to appropriate depots in order to meet the contractual requirement established when the customer paid them often a large amount of money for 4-hour Same Day (24x7x365) coverage for that device.
This isn't how hyperscale folks operate for the same reason Fortune 100's rarely take out anything more than third-party coverage when their employees rent vehicles, it becomes an actuarial decision, the # of 'Advance Replacement' warranty contracts and the $ involved, vs. buying a % of spares and keeping those in the datacenter then RMA'ing the defective component for refund/replace (on a 3-5 week turnaround time).
tl;dr - Operating 100-500 servers you should likely pay Dell and they'll ship you the spare and a technician to install it, operating >500 servers and you should do the sums and make the best decision for your business, operating >5000 servers and you probably want to just 'cold spare' and replace yourself.
Now, these contracts aren't like your average SaaS offering. We had a sales rep and the terms of the support contract were personalized to our needs. I imagine some locations were offered better terms for 4 hour service than others.
As I learned long ago, never say no to a customer request. Just provide a quote high enough to make it worthwhile.
That said I suspect most enterprises pay too much; as in their money might be better spent buying triple mirroring on a JBOD rather than a platinum service contract with a fancy all in one high end RAID-y machine.
Get servers from Quanta, Wistron or Supermicro, and switches from Quanta, Edge-core or Supermicro. The $$$ you save vs a name-brand OEM more than pays for the spares. Use PXE/ONIE and an automation tool like Ansible or Puppet to manage the servers and switches, and you can get a replacement server or switch up and in service in minutes, not hours.
If you're moving out of the cloud to your own infrastructure, it makes sense to build and run similarly to how those cloud providers do.
There's non-trivial cost involved in simply being staffed to accommodate the model you propose, all of the ODM's have some "sharp edges" and you need to be prepared to invest engineering effort into RFP/RFQ processes, tooling and dealing with firmware glitches, etc.
Remember that 500-rack (per buy) scale is table stakes for "those cloud providers", it is their business, whereas GitLab is a software company. Play to your strengths.
I am shockingly biased (I co-founded Cumulus Networks) but working with a software vendor who can help you with the entire solution is very helpful.
The scale gitlab has talked about in this thread is firmly in the range where self-sparing/ODM/disaggregation make sense. I think 500 racks is a huge overestimate, I think the cross-over point is closer to 5 racks.
> had to explain there was nothing I could do... For five days
The good alternative would be that you controlled the infrastructure and it didn't break. The bad alternative is you directly control infrastructure, it breaks, and then you get fired.
It's untrue that "the cloud" is always the right way. When you're on a multi-tenant system, you will never be the top priority. When you build your own, you are.
Google and AWS have vested interests (~30% margins) in getting you into the cloud. Always do the math to see if it's cost effective comparatively speaking.
Transparency from GitLab is excellent but you shouldn't really generalise statements about cloud suitability without the full picture or "Walking a mile in their shoes".
> Providers don't provide a minimum IOPS, so they can just drop you.
The reason I ask is because the blog post generalizes a lot about what cloud providers offer or what the cloud is capable of, but doesn't explore some of the options available to address those concerns, like provisioned IOPS with EBS, dedicated instances with EC2, provisioned throughput with DynamoDB, and so on.
You're right that the blog post was generalizing.
BTW We looked into AWS but didn't want to use an AWS only solution because of maximum size, costs, and the reusability of the solution.
So, RAID multiple EBS volumes and you have a larger disk.
New job title: Networked Infrastructure Actuary
Recently I bought an old SuperMicro server on eBay, configured it with 2 6-way AMD Opterons, 3 8-way SATA controllers, and 32GB of memory. With 4TB 5700RPM drives in an 8-way software RAID6, it could do 800MB/sec. I realize it's not small file random I/O, but a blended configuration where you put small files on SSD and large files on spinning disk would probably be pretty sweet.
My intuition is that learning all the ins-and-outs of AWS, and how to react and handle all kinds of situations, is not that much easier than learning how to react with your own hardware when problems come up. Especially consider that AWS is constantly changing things and it's out of your control, whereas with your own hardware, you get to decide when things change.
If you can colocate physically close, it's a lot easier. Our colocation was in Fremont but our office was in San Fran, so it was a haul if we had to install or upgrade equipment. But even so, there was only 1 or 2 times in 7 years that we needed to spend 2 consecutive days at the colo. One of those was during a major upgrade where it turned out that the (cheap) hardware we bought was faulty.
When you run your own hardware you have all the engineering you were already doing plus investing to upkeep and to improve your architecture.
As Google's boulos said, that's where the real costs are.
In an earlier comment you said EBS only goes to 16TB and that is "an order of magnitude less" than your requirement, however, thats per volume, you can attach many volumes in much the same way as servers have many disks.
Scale horizontally not vertically, add more OSD instances? With each you can attach a number of EBS or PD volumes which each IOPS characteristics that in aggregate are sufficient to service your workload?
If you want to avoid EBS or PD entirely, is there a reason you can't look at 'i2' or 'd2' instance types?
At a fundamental level you're just moving the problem and trading managing metal (which is hard) for I/O guarantees.
"Why is this harder than you might expect?" - you stated elsewhere that you'll have Remote Hands do rack/stack. Providers like Equinix refer to this as "Smart Hands". Everyone who's managed a reasonable-sized environment finds this term highly ironic, as the technician can and will replace the wrong drive, pull the wrong cable, etc.
I've done an non-trivial amount of infrastructure 'stuff' (design, procurement, install, maintenance, migration) for some well-known companies, if you want to Hangout for an hour and pick my brain, gratis, my e-mail is in my profile.
There is no guarantee of SLA in any distributed system. The best you can do is measure things and know what you'll get most of the time.
If you want SLA, you can make a single server with 10TB of memory as storage. That's solid choice! :D
What if you need to scale up (or scale down again)?
Making servers run as reliable as in cloud datacenters is really hard work and imposes additional cost. I.e. you need not one, but two or three datacenters with 2-3 times the amount of servers you actually use. Then you need admin staff who knows their work and keeps things running smoothly and reliably. You need not one but at least 3 of that staff because they might get sick or go on vacation.
Your cost savings probably comes from cutting corners. Depending on the structure of your applications, downtime requirements and recovery plans, that may even be OK. You don't always need a fleet of tanks to deliver a box of milk bottles.
Not really. It's easy enough to have dedicated server hardware that runs just as smoothly as cloud hosted hardware. There are many options for doing so. The key is to pay for support.
For example, if you go with Microsoft or Red Hat or Ubuntu servers, you can pay for support contracts to get help from experts in configuring systems. Such options can work out cheaper than cloud hosting (depending on the IT infrastructure requirements), with the added benefit of having hardware more directly under your control.
For my systems built on EC2, I might not need to do anything at all for typical hardware failures, if I've set up EC2 auto recovery. It transparently relaunches my instance on another server. If I'm not using auto recovery, then I might just stop and restart the instance, in which case I'm also migrated to another physical server if the first one wasn't working. It has the same identity, by the way: same IP address, same disk contents, and all that, and all the machine perceived was an OS reboot (or maybe kernel panic followed by reboot, depending on the failure).
For stateless systems I don't even need to bother with that. I'll just configure it to spin up a replacement instance if the desired number of servers I want to have online is not met because one of them failed. When I have 100+ servers in a fleet, this is really convenient. I don't need to keep track of them individually, I just say: "This is the image I'd like to run, and I want to have 100 of them". Hardware failures that would bring down a normal physical machine don't need to involve me at all.
The old server hardware that's broken is now in the hands of EC2 to repair or swap out, and my server's back up and running in minutes, possibly with no action on my part.
Is something like that easy to achieve with colocation and support offerings with those vendors? Genuine question - I've never operated in colo or worked with those vendors. I realize that colo facilities can take care of repairing and swapping servers for me, but the benefit I get from the cloud is that I can perform those actions in moments with an API, or they can happen automatically, and I don't need to engage other humans. I can't imagine spending my time coordinating with vendors, getting on the phone, opening support tickets, waiting for people to do things, etc... right now I can do everything at the speed of computing with APIs and automation. It'd be tough to give that up.
You've got plenty of options. Reboot the server (using a local login, as the server is hosted where employees/support staff are physically close to it), restore the server from a backup, etc...
Let's put it like this, how do you think companies managed before cloud servers? Do you think they just had the attitude of 'Oh damn the server has gone down, nobody can do any work today'? I'd put it to you that contingency plans existed long before cloud hosts did.
Not everybody needs that and not everybody who does need it also realizes it before a major failure strikes out one of their machines and there is no working failover and their one admin is not reachable and when they finally reach him he has to be flown in, then they need to wait for replacement parts, which all in all delays recovery by several days.
If yes, how is it replicated over multiple datacenters?
If yes, does failover work flawlessly?
Depends on the database engine used. For example...
>"If yes, does failover work flawlessly?"
Again, depends on the database engine used. To use SQL Server as an example again...
A bottle of milk filled with Gold is worth about $800k.
By extension the 6-pack is $4.8M
If tanks can fit seamlessly in the budget, we shall give tanks a serious thoughts! :D
Out of curiosity. What were you running? What was your setup? How much disk IOPS? and disk bandwidth used?
Let's not call that a server but a desktop computer please ^^
The point of the cloud is to manage MANY servers. If all you have is a single box, you don't need help to manage it, you don't need AWS/GCE.
Well, just order 50 of them and install k8s.
Google figured quite early in its life that RAM was the one component not to skimp on: http://research.google.com/pubs/pub35162.html
I've done ROI studies for several applications like that, and usually cloud has a higher total cost unless there are specific availability requirements or you don't have a facility that can meet a 99.9 SLA.
The key assumption is that you know a lot about the app and it's in a operating mode. If you're in a hyper growth mode, have fluctuating or seasonal demand, or you have no capital funds, cloud is a no brainer.
If your demand is 50% stable 50% fluctuating (say some base load + a big spike at US prime time), I still think you can win with a hybrid cloud... i.e. serve base load from a COLO, and serve spike load from the cloud. That does mean you need to configure at least 2 networks, but not a terrible idea from a DR standpoint anyway (Main DC fail? Push a button and run off the cloud until it is fixed)
Actually it really does for all practical purposes.
When it fails its SLA you replace it.
That's just storage, now you need to add so many layers on top of it. Like @boulos said, it is then your fault, but your customers still see the issues.
Surprisingly cost effective given the massive performance and support. Cloud is great for some things but not everything.
I should point out that we build systems that the OP wants ... they likely don't know about us as we are a small company ...
and we do use gitlab
Always happy to help ...
I can't say it's a bad thing to do; I just can't help but notice how Google invests in social media participation. That's why they own the conversation in places like HN and can pull off this stuff.
What I fail to understand is, how does GCE or AWS tackle the issue described in the article? As far as I understand, their problem seems difficult to work around due to the nature of the Cloud (shared).
How would GCE be better than AWS or Azure at this? I would be really interested to know and I'm sure that'll be useful for other HNers with the same worries.
To solve problems just like these, we offer EBS (elastic block storage) with provisioned IOPs guarantees. Essentially, you can get guaranteed IOPs if you need it for I/O intensive applications; up to 30,000 IOPs per EBS volume.
But, PIOPs EBS volumes wouldn't be my first recommendation. It sounds like what they really need is an elastic, scale-out filesystem with NFS semantics. We have Elastic File System, or EFS, which is exactly that. It's a petabyte scale filesystem that is highly available across multiple availability zones, and scales in IOPs and performance as it scales in size.
Their application should also look at leveraging S3 object storage, rather than NFS, because that is a highly distributed, highly available object storage system, that is likely to give better scalability, availability, and performance, than rolling your own Ceph infrastructure.
I personally believe that, when a prospect feels wronged or poorly served by a competitor, it's the company's responsibility to reach out to that prospect directly. Ideally 1-on-1, the company should aim to listen carefully to the prospect's concerns and not to immediately start spouting opinions or solutions. From there, an honest dialogue can take place which can form the basis of a sale. And, more importantly, a relationship.
Of course, if the goal is simply to castigate the competition or to ward off uncomfortable questions by implying the dissatisfied prospect is thinking poorly or emotionally, then a public forum post works better, I guess.
I'm not here trying to sell them on coming to Google. If they're interested, my contact info is in my profile. It's quite likely that the best option for them would be to return to AWS, as they have experience running there and EBS has improved a lot over the years.
However, considering they make money out of private installs of gitlab, it makes sense to keep gitlab.com as an eat-your-own-dog-food-at-scale environment. Necessary for them to keep experience with large installs of gitlab. If one of their customers that run on-prem has performance issues they can't just say: gitlab.com uses a totally different architecture so you're on your own. They need gitlab.com to be as close as possible to the standard product.
Pivotal does the same thing with Pivotal Web Services, their public Cloud Foundry solution. All of their money is made in Pivotal Cloud Foundry (private installs).
From a business perspective, private installs are a way of distributed computing. Pretty clever, and good way of minimizing risk.
Dedicated servers, and colocation are going to be far cheaper than the cloud, and worse, the savings directly related to the size of the infrastructure you need.
That, combined with the fact that even the very best of virtualization on shared resources still kills 20-30% of performance.
So there's 3 things you can use the cloud for:
1) your company is totally fucked for IT management, and the cloud beats local because of incompetence (this is very common). And you're unwilling or unable to fix this. Or "your company doesn't focus on IT and never will" in MBA speak.
2) wildly swinging resource loads, where the peak capacity is only needed for 1-5% of time at most.
3) you expect to have vastly increasing resource requirements in a short time, and you're willing to pay a 5x-20x premium over dedi/colo to achieve that
The thing I don't understand is that cloud has both a lower limit (cloud is (far) more expensive than web hosting, and having a VPS) that is an extremely common case, and far more expensive once you go over a certain capacity (doesn't matter which one, CPU, Network, Disk, ... all are far more expensive in the cloud). Even if you have wildly varying loads there's an upper limit to the resource needs where cloud becomes more expensive.
The thing I don't understand is why so many people are doing this. I ran a mysql-in-the-cloud instance on Amazon for years, with a 300 Mb database, serving 10-50 qps to serve as a backend to a website, and a reporting server that ran on-premise. Cost ? 105-150 euros per month. We could have easily done that either locally or on a VPS or dedicated server for a tenth of that cost.
Cloud moves a capital cost into an operational cost. This can be a boon or a disaster depending on your situation. You want to run an experiment that may or may not pan out ? Off the cloud you'll have spare capacity that you can use but don't really have to pay for. On the cloud cost controls will mean you can't use extra resources. You can't loan money from the bank ? The cloud(but also dedi providers) can still get you capacity, essentially allowing you to use their bank credit for a huge premium.
Another use case would be where infrastructure costs are minor compared to dev and ops staff costs. If hosting on AWS makes your ops team 2x as productive at a 30% infrastructure markup that can be a steal.
It's best to design your stuff so it can easily go from hosted solutions (this "cloud" bullshit term people keep using) to something you manage yourself. Docker containers are a great solution to this.
If you setup some ansible or puppet scripts to create a docker cluster (using mesos/marathon, kubernets, docker swarm, etc) and built it in a hosted data center; it's not going to take a whole lot of effort to provision real machines and run that same stack on dedicated hardware.
2 people doing ops work cost 7-8k $ per month. Let's assume each of them is managing at least 5x their own cost in infrastructure spend, ie 35k+ $/month. That easily buys you 20-30 extremely high spec dedicated machines, if necessary all around the world, with unlimited bandwidth. On the cloud it wouldn't buy you 5 high spec machines with zero bandwidth, and zero disk.
Let's compare. Amazon EC2 m3.2xlarge (not a spectacularly high end config I might add, 8vCPU, 30Gig ram, 160G disk, ZERO network) costs $23k per month. So this budget would buy you 2 of those. Using reserved instances you can halve that cost, so up to about 4, maybe 5 machines.
Now compare softlayer dedicated (far from the cheapest provider), most expensive machine they got: $1439/month. Quad cpu (32 cores), 120G ram, 1Tb SSD, 500G network included (and more network is about 10% of the price amazon charges for the same). For that budget it gets you 25 of these beasts (in any of 20+ datacenters around the globe). On a low cost provider, like Hetzner, Leaseweb or OVH you can quadruple that. That's how big the difference is.
It used to be the case that Amazon would have more geographic reach than dedicated servers, but that has long since ceased to be true.
There is a spot in the middle where it makes sense, let's say from $100+ to maybe $10k where cloud does work. And you are right that it lets a smaller team do more. But there's 2 things to keep in mind : higher base cost that rises far faster when you expand compared to dedicated or colo. This is not a good trade to make.
An m3.2xlarge is $.532/hr or $388/month, not $23k/month . A similar instance on GCE (n1-standard-8) is $204/month with our sustained use discount, and then you need to add 160 GB of PD-SSD at another $27/month (so $231 total) .
Disclosure: I work on Google Cloud, but this is just a factual response.
EC2 reserved instances offer a substantial discount over on-demand pricing. The all-up-front price for a 3-year reservation for m3.2xlarge would be an amortized monthly rate of $153/month, which is a 61% saving vs. the on-demand price of $388/month, according to the EC2 reserved instances pricing page.
Granted, using this capacity type requires some confidence in one's need for it over that period of time, since RIs are purchased up front for 1 or 3 year terms. But RIs can also be changed using RI modification, or by using Convertible RIs, and can be resold on a secondary marketplace. As a tradeoff in comparison to GCE's automatic sustained use discount, the EC2 RI discount requires deliberate and up-front action.
Edit: I do see in another comment you concede the value of the cloud for ppl spending under $10k a month.
1) Your needs are very static, or
2) Your IT department can competently replicate the PaaS experience on its in-house metal (common big tech company strategy)
The cloud is likely to do wonders for velocity, as when you have a new use case, you can "just" spin up a new VM and run the company Puppet on it within an hour or so, vs. wait weeks to months for a purchase order, shipping, installation at the colo facility, etc.
If your IT department is doing Mesos or Kubernetes or something with a decent self-service control panel for developers, then you get the best of both worlds, but you also have to build and maintain that.
If so, they could (in theory) split Gitlab.com into a bunch shards which as a whole match the properties of N% of enterprise users. That'd be a pretty cool way to avoid the different-in-scale problem (although you might still run into novel problems as you're now the Gitlab instance with the most shards..).
In general, over time, load on some shards will increase while others decrease. Migrating a customer from one shard to another will likley cause a short outage for them, and many bugs down the line when they've bookmarked all kinds of things.
Depends on how they bill.
If they bill on a few private installs while giving unlimited storage & projects for free on the cloud, they're setting up themselves for bankruptcy.
Gotta take care of the accounting. Having an unsustainable pricing is a classic mistake of web companies.
at the moment you can just setup multiple mount points, but I guess it would be superior to actually setup a moint point or object storage. I'm pretty sure that you can put git into something like s3.
A) I think it would be kind of difficult to store git on S3 (the magic would be in the caching / consistency layer to keep it performant)
B) it would make sense for their customers that run on AWS. However, many of them just run on a single VM / physical server.
btw here is the alibaba approach: http://de.slideshare.net/MinqiPan/how-we-scaled-git-lab-for-...
Edit: and I heard github uses gitrpc.
Actually, we don't appear to blog quite enough about how awesome NYI are. They're a major part of our good uptime.
And some stuff about our hardware. I'd strongly recommend hot stuff on RAID1 SSDs and colder stuff on hard disks. The performance difference between rust and SSD is just massive.
We're looking at PCI SSDs for our next hardware upgrades:
(there are two types of SSD on the market. Ones that lose your data, and SSDs from Intel) - we're currently running mostly DC3700 SATA/SAS SSDs.
We customise the layout of our machines very much to match the heat patterns of our data:
They probably could have saved themselves a lot of pain by talking to some Ceph experts still working inside RedHat for architectural and other design decisions.
I agree with other poster who asked why do they even need a gigantic distributed fs and how that seems like a design miss.
Still this is something we need to fix in our CI implementation because, as you say, databases are not good queueing systems.
We have been in contact with RedHat and various other Ceph experts ever since we started using it.
> I agree with other poster who asked why do they even need a gigantic distributed fs and how that seems like a design miss.
Users can self host GitLab. Using some complex custom block storage system would complicate this too much, especially since the vast majority of users won't need it.
You do need either a distributed FS (GitHub made their on with Dgit http://githubengineering.com/introducing-dgit/, we want to try to reuse an existing technology) or buy a big storage appliance.
Here is an article from a few years ago:
In any case, GitLab is amazing and I can see how it's tempting to believe that GitLab the omnibus package is the core product. However, HOSTED GitLab's core product is GitLab as a SERVICE. That might require designs tailored a bit more for the cloud than simply operating a yoooge fs and calling it a day.
If on-prem customers can't get AWS/GCE/Azure SuperFastCloudStorage™, then it can't be part of their codebase.
> Some are at 20k+ users so they are close to needing something like Ceph
Or they will scale it themselves ala Alibaba: http://www.slideshare.net/MinqiPan/how-we-scaled-git-lab-for... . They appear to have written a libgit2 backend for their object store(among other things).
I don't see a good reason why solutions using different storage backends could not make it into the OSS project. Many companies run their own Swift cluster, which is OSS.
If you're using CephFS and everyone else wants to be using other Cloud storage solutions, that would actually put you at a disconnect with your users and leave room for a competitor with the tools and experience to scale out on Cloud storage to come in offering support. I would at least consider all the opinions in this thread and maybe reach out to that Minqi Pan fellow from Alibaba with questions..
I actually really like GitLab and wish we could be using it at my company; this is why I'm spending so much effort on this topic(and scaling git is interesting). Hopefully my opinions are not out of place.
I learned a lot when I was first moving to the cloud from Adrian Cockroft. He has a ton of material out there from Netflix's move to the cloud. I recommend googling around for his conference talks. (I haven't watched these, but they're probably relevant: http://perfcap.blogspot.com/2012/03/cloud-architecture-tutor...)
'robust' - resilient against shocks, volatility and stressors
'antifragile' - thrives on shocks, volatility and stressors (ie. gets better in response)
Antifragile is a step beyond robust. Examples of antifragility are evolutionary systems and true free market economies (as opposed to our too-big-to fail version of propped-up, overly interconnected capitalism).
So far you've named "evolutionary systems", which are quite fragile in the real world, and an imaginary thing called a "true free market economy".
That said, if you are calling the entire multibillion year phenomenon of life on this planet "fragile" then we are not going to get on well.
Point-in-time human-engineered systems still can't really be anti-fragile, except perhaps in some weird corner cases, but the system as a whole with the humans included, over time, can be.
It should also be pointed out that "anti-fragile" was always intended to be a name for things that already exist, and to provide a word that we can use as a cognitive handle for thinking about these matters, not a "discovery" of a "new system" or something. There are many anti-fragile systems; in fact it's pretty hard to have a long-term successful production project of any kind in software without some anti-fragility in the system. (But I've seen some fragile projects klunk along for quite a while before someone finally came in and did whatever was managerially necessary to get someone to address root causes.)
When I think of anti fragile systems, truly adaptive algorithms come to mind that learn from a failure. For example, an algorithm that changes the leader in a global leader election system based on the time of day because one geographic region of the network is always busier depending on time of day and latency to the leader impacts performance.
If you can't control the amount of stressors and shocks, you want a system that is neither antifragile or fragile, but strictly indifferent to the level of shocks.
As for a definition, antifragile things get stronger from abuse--like harming muslces during a workout so they grow back stronger. If they were just robust, it would be like machinery (no healing / strengthening).
While I can imagine software that is in itself antifragile, I think it's entirely reasonable to include the totality of the people and process that make up an operational system, in which case even your narrow definition applies here.
You are watering down the word so much that it wouldn't have reason to exist. Is every chair-making process antifragile since chairs get stress tested before being sold?
There's a world of difference between stress testing before something is sold or released and welcoming ongoing hostility throughout its lifecycle, and this difference is absolutely in line with the concept of antifragility.
At the moment, the discussion in the GitHub issues looks like people who are buying servers to put and run in their garage ^^
Not their fault, POSIX API cannot fit into eventual consistency model to guarantee availability (i.e. latency). Moving to your own hardware doesn't actually solve the problem, just gives some room to scale vertically for some time. After that the only way to keep global consistency but minimize impact of unavailability is to shard everything, at least this way unavailability will be contained in its own shard without any impact on every other shard.
It's better to avoid POSIX API in the first place.
edit: it's gitlab not github. The point still stands though.
The R730xd are probably around $50k after dell discounts - depends a bit on what they ended up configuring with regards to support, exact network configuration, etc
The R830s are about $50k as configured - 1.5gb ram is expensive as the R830 only has 48 DIMMs and they need the relatively expensive 32GB RDIMMs
The R630s should be about 15k each:
The switches say they are 48x 40G QSFP+ which are very expensive (I'd put them at 30k each from Dell)
50k * 20 + 50k * 4 + 15k * 10k + 2 * 30k
1m + 200k + 150k + 60k ~= $1.4m invested
updated with better R830s pricing
From my perspective consumer Samsung 850 EVO drives which can be under-provisioned to match the 1.8tb (and get better performance characteristics) would give Gitlab cheaper and more reliable storage in terms of IOPS/Latency when compared to 10k 1.8tb drives.
(Community Advocate at GitLab)
If you're large enough, then you should be talking to other folks including Wiwynn, ZT, Quanta, Celestica, et al.
They won't always get all the way there, depends on volume, but often they'll get pretty close, and it'll probably be a 25%+ additional discount vs. what they'll offer without you twisting their arm using a cheaper manufacturer as leverage.
They have to believe you're serious, obviously, which can mean visibly throwing a couple small orders to another manufacturer to tell your sales team they're not the only game in town.
In quality terms, Supermicro is worse vs. Dell/HP, which can still be fine if you've got enough scale to work through all the little issues with firmware.
I'm not beholden to either though haha. If you can get Dell to come close to Supermicro prices then that might make sense. When I was last in the infrastructure game a few years ago we also ran into configuration limitations with Dell. At the time we could get more storage and RAM in the form factors we needed from Supermicro.
One of my main beefs with Supermicro is their tooling around remote patching and configuration of BIOS and other firmware, they charge for it, on top of warranty, and the UX is awful.
With both Dell and HP if you're under a support contract all the updates and tools to apply them are included, and whilst both manufacturers have some "sharp edges" to their tools for managing large fleets, neither is "stab yourself in the eyeballs with a blunt fork" bad like Supermicro.
My company (Scalable Informatics) literally builds very high performance Ceph (and other) appliances, specifically for people with huge data flow/load/performance needs.
Relevant links via shortener:
Main site: http://scalableinformatics.com
(everything below is at that site under the FastPath->Unison tab)
Ceph appliance: http://bit.ly/1qiOYpy
Especially relevant given the numbers I saw on the benchmarking ...
Ceph appliance benchmark whitepaper: http://bit.ly/2fMahfJ
Our EC test was about 2x better than the Dell unit (and the Supermicro unit), and our Librados tests were even more significantly ahead.
Petabyte scale appliances: http://bit.ly/2fuTTAH
We've even got some very nice SSD and NVM units, the latter starting around $1USD/GB.
[end commercial alert]
I noticed the 10k RPM drives ... really, drop them and go with SSDs if possible. You won't regret it.
Someone suggested underprovisioned 850 EVO. Our strong recommendation is against this, based upon our experience with Ceph, distributed storage, and consumer SSDs. You will be sorry if you go that route, as you will lose journals/MDS or whatever you put on there.
Additionally, I saw a thread about using RAIDs underneath. Just ... don't. Ceph doesn't like this ... or better, won't be able to make as effective use of it. Use the raw devices.
Depending upon the IOP needs (multi-tenant/massive client systems usually devolve into a DDoS against your storage platforms anyway), we'd probably recommend a number of specific SSD variations at various levels.
The systems we build are generally for people doing large scale genomics and financial processing (think thousands of cores hitting storage over 40-100Gb networks, where latency matters, and sustained performance needs to always be high). We do this with disk, flash, and NVMe.
I am at landman _at_ the company name above with no spaces, and a dot com at that end .
I'm sure it depends on how much you buy.
Does it vary by components? They seem to charge a lot for drives so I'm guessing those can be heavily discounted.
If you are really getting 0% discount you need to private message me and send me $10k so I can tell you this one weird trick.
You simply ask for them.
Just do that. Everyone gets discounts.
usually i always send everything back and tell them to do better.
my biggest gripe about dell pricing was no line items, just a bottomline price. so i couldnt tell which things i could save money on by sourcing them elsewhere(ie: larger drives, large amounts of memory)
I think at Dell the quarters are one month off (end of januari, april etc)
You won't find any organization that's much more open than Wikimedia.
Disclosure: I work for Wikimedia (on the release engineering team)
Using a single Ceph FS is an odd choice IMO. You need to have or acquire a ton of expertise to run stuff like that, and it can be finicky. I'm not convinced you can't get this running well enough on AWS though.. So I'd be worried the move to bare metal would just cause the team more problems.
But I get where they're coming from; container orchestrators like Kubernetes are heavily promoting distributed file systems as being the 'cloud-native approach'. But maybe this issue is more relevant to 'CephFS' specifically than to all distributed file systems in general.
Architecting, or even spending mental cycles on day 1 on distribution isn't going to win you as much as focusing on making an awesome product.
This move will probably buy them another year or two, which will give them enough time hopefully implement some form of partitioning.
How much data? We don't even know if it's GB/TB/PB/EB? How many files/objects? How many read IOPS are needed? How many write IOPS? What's the current setup on AWS? What's the current cost? What are they hosting? Can it scale horizontally? How do they shard jobs/users? What's running on Postgre? What's running on Ceph? What's running on NFS? How much disk bandwidth is used? How much network bandwidth is used?
How are we supposed to review their architecture if they don't explain anything...
I bet that there is a valid narrative where PostgreSQL and NFS was their doom, but I'd need data to explain that ^^
That said, a decent chunk of this info can be found in the discussions on our Infrastructure issue tracker.
The last infrastructure update includes some slide decks that contain more data (albeit it's now a little under 2 months old).
Looking at our internal Grafana instance, it looks like we're using about 1.25 TiB combined on NFS and just under 16 TiB on Ceph. We're working on migrating the data currently hosted on Ceph back to NFS soon.
I'll get someone from the infrastructure team to respond with more info.
I have seen similar issues where a GC pause on one server, freeze the entire cluster.
Is this one single monolithic file system? On the service side, can the code be asynchronous with request queues for each shard? This can help free up threads from getting blocked and serve requests for other shards.
At the time (about 2005 iirc) I was read/writing between 500GB and 2TB a day, with heavy calculations across two high performance desktops. Then as I came to scale up AWS was announced, but when I priced it, just what was running already on those desktops came to more for the first month than buying 6 heavy duty dual CPU Dell servers with tons of storage.
So that's exactly how we started.
The only "shock" I got was how much of a b!tch keeping all that cool is, and that took a year or two to get that "just right".
BeeGFS is the only nominally open source one that I'd think about trusting my data too. And no, I don't work for them or am compensated in any way for recommendations.
In large systems design, you should always design for a large variation in individual systems performance. You should be able to meet customers expectations if any machine drops to 1% performance at anytime. Here they are blaming the cloud for the variation, but at big enough sizes they'll see the same on real hardware.
Real hardware gets thermally throttled when heatsinks go bad, has io failures which cause dramatic performance drops, CPU's failing leaving only one core out of 32 operational, or ECC memory controllers that have to correct and reread every byte of memory.
In a large enough system, at any time there will always be a system with a fault like this. Sure you only see it occasionally in your 200 node cluster, but in a 20k machine cluster it'll happen every day.
You'll write code to detect the common cases and exclude the machines, but you'll never find all the cases.
The conclusion is that instead you shouldn't try. Your application should handle performance variation, and to make sure it does, you would be advised to deliberately give it variable performance on all the nodes. Run at low CPU or io priority on a shared machine for example.
In the example of a distributed filesystem, all data is stored in many places for redundancy. Overcome the variable performance by selecting the node to read from based on reported load. In a system with a "master" (which is a bad design pattern anyway IMO), instead have 5 masters and use a 3/5 vote. Now your system performance depends on the median performance of those 5.
I've seen this a lot, and for a given workload I can tell when leaving the cloud will be the right choice. But the unspoken part is "can we change our application given the limitations we see in the cloud?" Probably pretty difficult in a DVCS but not impossible.
Sadly, storage isn't a first class service in most clouds (it should be) and so you end up with machines doing storage inefficiently and that costs time, power, and complexity.
What would constitute a "first-class service" for you? (Or what does "storage" mean to you)
I think both of Persistent Disk for block and GCS for Object storage as fairly reasonable. I agree that the world of "Give me multiple writer networked block devices, preferably speaking NFS" is still pretty bad.
Usual Disclosure: I work on Google Cloud (and for a long time, specifically Compute Engine).
"Storage" as a definition is an addressable persistent store of data. But that isn't as useful as one might hope for discussions, so I tend to think of it in terms of the collection of storage components and compute resources that enable those components to be accessed across the "network."
So at Google a GFS cluster provides "storage" but if the compute is also running compute jobs, web server back ends, etc. It isn't the "only" task of the infrastructure and that is the definition of "second" or "not first class". Back in the day Urs would argue that storage takes so few compute cycles that it made no sense to dedicate an index to the serving up of blocks of disk. But that also constrained how much storage per index you could service. And that is why from a TCO perspective "storage as a service" is cheap when you need the CPUs for other things anyway, but it's very expensive when you just want storage. I wrote a white paper for Bart comparing GFS cost per gigabyte to the NetApp cost per gigabyte, and NetApp was way cheaper because it wasn't replicated 9 times on mission critical data, and one index (aka one filer head) could talk to 1,000 drives.
That same sort of effect happens in cloud services where if you want 100TB of storage you end up having to pay for 10 high storage instances, even if your data movement needs could be addressed by a single server instance with say 32GB of memory. The startup DriveScale is targeting this imbalance for things like Hadoop clusters.
1. "Give me a block device I can boot off of, mount, and run anything I want to from a single VM" (PD, which is just built on Colossus)
2. "Give me NFS".
3. "Give me massive I/O" (GCS)
I think we're doing fine-ish in 1 and 3. The main competition is a dense pile of drives in a box for Hadoop, but we lean on GCS for that via our HDFS connector (https://cloud.google.com/hadoop/google-cloud-storage-connect...). It's our recommended setup, the default for Dataproc, and honestly better in many ways than running in a single Colossus cell (you get failover in case of a zonal outage, and by the same token you can have lots of users simultaneously running Hadoop or other processing jobs in different zones).
PS - I'm going to go searching for your whitepaper (I find in arguing with folks that network is the bottleneck for something like a NetApp box, not CPUs).
[Edit: Newlines between lists... always forgetting the newlines]
You could also improve your operational efficiency but that isn't a priority yet at the big G. I expect over time it will become one and you'll figure it out but in the meantime your customer has to over provision the crap out of their resources to meet their performance needs.
If Bart is still around it was shared with him and the rest of the 'cost of storage' team back in 2009.
However perhaps they shouldn't try to run Ceph in the first place.
Azure has a rather powerful blob storage (e.g. block, pages and append-only blobs) that allows high performance applications. You could use that directly and it will likely be cheaper and work better than Ceph on bare metal.
Like other commenters suggest, in order to take advantage of cloud infrastructure you need to design with those constraints in mind, rather than trying to shoehorn the familiar technologies.
Bare metal can be better and cheaper, etc. but it requires even more skills and experience and a relatively large scale.
Something to consider:
A few years ago I used Rackspace's OnMetal servers https://www.rackspace.com/en-us/cloud/servers/onmetal for a dedicated MySQL 128GB RAM server that would handle 100s of thousands of very active hardcore mobile game users. We were doing thousands of HTTP requests per seconds and 10s of thousands of queries (and a lot of those were writes) per second, all on one server. The DB server would not skip a beat, CPU/IO was always <20% and all of our queries would run in 1-5ms range.
I'm not affiliated with Rackspace in any capacity, but my experience with them in the past has been top-notch, esp. when it comes to "dedicated-like" cloud hardware, which is what OnMetal is - your are 100% on one machine, no neighbors. Their prices can be high but the reliability is top-notch, and the description of the hardware is very accurate, much more detailed than AWS for example, and without "fluffy" cloud terms :).
Boot device: 2x 240 GB hot-swappable SSDs configured in a RAID 1 mirror
Storage: 2x 1.6 TB PCIe storage devices (Seagate Nytro XP6302)
I have a really small setup but i would personally look into a dl580 as VM host and having two for redundancy.
And a dual path storage system in my case I used a 2u MSA2400 (not sure if that is the latest name)
Since it could continue to scale up and provided dual path too.
I don't have experience running CEPH so I am not sure what are the hardware requirements for CEPH.
(Disclaimer I work at Hewlett Packard Enterprises with servers)
Then users could host their own repositories themselves and manage their storage.
This kind of setup would scale a lot better.
This could easily have been titled "Why we couldn't use SephFS"
It may then become a problem of latency per blob, in which case you could coalesce blobs into bigger ones as needed (histories don't change much) and optimize for that.
Not continuing to link ever-increasing sets of machines to each other in a big glob ... thats much harder to make work in the cloud
Therefore the only question left is what to use as the software for this.
Why didn't you use S3?
- small file performance (if they're not storing the git objects as packed)
- POSIX compliance
- Their on-prem installations for customers
For small file performance though, no object storage system is particularly tuned for that (often the granularity is more like MBs, and at least with GCS you're talking ~100ms per write).
Finally, they need to be testing and supporting customers who want to run the whole thing "at home" wherever that is. So they would still need to run Ceph!
Meaning, you can't run git on it, you'd need to make a full copy of the repo each time someone makes a commit.
The only time you would need to delete/replace is during a GC operation which can be scheduled periodically to compact the representation.
The JGit driver supports pushing to S3 in this way, and I have implemented a custom backend to hosting git repositories in an S3 like service (which in turn is based on how Google hosts git repositories at scale)
Building some kind of generic storage layer with multiple backends (S3/S4/etc) would result in us only maintaining whatever we happen to use for GitLab.com. It could also complicate maintenance too much.
You got an application designed to host a few repos on a single disk for a single company. And you wanna run it to store millions of repos for the entire internet... on a single volume.
I understand that you want to run the same thing as your customers. But you have different needs, you gonna run in permanent conflicts.
How much storage do you have now per service? How much growth do they experience? Do you even know if CephFS can scale linearly with servers/disks?
No they don't. Read their slides, they are running on 52 seconds latency at times ;)
S3 is simple and it can ingest the volume. Of course, the performances are not comparable to a local disk.