CoreOS, however, is the real game-changer for us. The automatic updates and ease of management is what took us from a mess of 400+ ill-maintained OpenStack instances to a beautiful environment where servers automatically update themselves and everything "just works". We've built automation around our CoreOS baremetal deployment, our Docker container building (Jenkins + custom Groovy), our monitoring (Datadog-based), and soon, our F5-based hardware load balancing. I'm being completely serious when I say that this software has made it fun to be a sysadmin again. It's disposed of the rote, shitty aspects of running infrastructure and replaced it with exciting engineering projects with high ROI, huge efficiency improvements and more satisfying work for the ops engineering team.
It effectively gives you, the SaaS vendor, the best of both worlds:
- Access to markets you can't serve otherwise.
- Development agility with frequent deployments.
- Scalable ops/support for cloud/private deployments.
Outside the excitement of containers and composability I'm not sure that the problem has been fundamentally simplified. The challenge is that a fully "operational" system from a customer perspective has to fully integrate with the rest of the neworks services. That's primarily around all the other elements such as monitoring, intrusion detection, backups and secure remote access: granted the article starts to address some of these. As a SaaS vendor you need absolute repeatability and a constrained set of 'additional services' to integrate with - otherwise the support matrix is going to grow massively. From a customer perspective I can see the advantages in being able to flexibly move across the boundaries depending on the circumstance.
Symmetry between servers has a value.
The main board, raid controller, network, and usually the storage is going to be planned out meticulously ahead of time based on the maximum load the server is going to see during it's lifetime. Often, Processor and Memory come down to "If we needed X Feature and we didn't have it" or "If we took the server down for 1 hour to upgrade it, how expensive is that?".
I initially pictured someone picking the highest spec option they can afford when setting up a new service, rather than choosing based on actual demand of each node.
> "If we took the server down for 1 hour to upgrade it, how expensive is that?"
Putting my CI / DevOps hat on for a sec: who takes production servers down for upgrades without some level of HA to avoid downtime? ;)
Every service gets run in two environments: test and prod. Both are co-located on the same Kube cluster in different namespaces. We also don't put any datastores in Kube. That stuff still lives in OpenStack for now. Ceph can make it possible but for fast disk I/O, it's tough to beat the local SAS bus.
Personal anecdote - we have everything on-prem using a pretty standard vSphere setup, I've got a couple PostgreSQL databases that aren't even that heavily used (I mean, they top out around 20-40 tx/sec) backed by a hybrid Tegile array. Randomly throughout the day my IOWAIT starts spiking because even over 8Gb fiber channel our storage latency starts spiking because everything else is on the same storage (well over 200 full VM's), or sometimes I need to run a table scan over a 40GB table for a one-off query and the storage bottlenecks the crap out of us because only our small active set is cached on the SSD's.
You can run databases on remote storage, people obviously do it, but there's a reason why even on AWS the best practice is to use your instance disks instead of EBS volumes if you care about performance. Noisy neighbors suck.
GCE seems poorly engineered if the profit margins are indeed low (normal, that is).
Interestingly, the "3 cores for 1 user acing one" would explain the pricing of "preemptible" instances - which can cost as little as 1/3rd of full instance. And the main difference is that they are fully rescheduled every 24h or more often, and there's no live-migration for maintenance...
… I am now wondering if GCE runs fault-tolerant VMs. Because if it does, holy moly
A GCE VM spins up in less than a minute. I've seen them come up in 20 seconds.
CF is meant to provide a platform, not an API. The whole point is to liberate the developer from caring about how the app is brought to life so that they can focus on building the app. For operators the point is that no badly-behaved app can bring down production for everyone, so they can lower the drawbridge and let developers move faster.
I imagine that my counterparts at Red Hat would say the same about OpenShift. It's there to let you go faster.
Disclosure: I work for Pivotal, the majority donor of engineering to Cloud Foundry.
What am I meant to be estimating, exactly?
We would probably build this around Kube's Ingress API. We're still a few backlogged projects away from addressing this internally. For now, we manage VIPs by hand. About 1-3 new VIP requests a week.
The single biggest problem by far was ripping/replacing 3rd party SaaS services that we use/pay for with native code. While building a SaaS it is wonderful to use 3rd parties to save time and complexity. Examples include Stripe (payments), Rollbar (error tracking), Intercom (support/engagement), Iron.io (queue and async tasks), Gauthify (two factor auth)... Howerver, when you go to on-prem often times the servers don't have a public interface, so your entire application has to work offline.
2nd tip. While it may seem like a good idea to create a separate branch in your repo for on-prem (i.e. enterprise branch), this is bad idea. It ends up being much better to just if/else in the master branch all over the place:
// do something
Also, if you have proper dependency injection for testing and developer sandboxes, that's the right layer to vary your back end. If statements only needed when setting up the dependency environment.
In PHP we do this with polymorphic classes; in Haskell with type classes and the ReaderT monad.
BTW Shoutout to Ev from Gravitational, they are proper Kubernetes experts and we appreciate their help on GitLab, especially making https://github.com/gravitational/console-demo for our k8s integration efforts.
I would define Private Cloud as a non-bare metal solution on Prem in a traditional colo setting where hardware is owned by the company in question. Hybrid Cloud would be a Private Cloud and Public Cloud bridged in some way.
Funny how definitions diverge here.
The definition in the blog post I agree least with (secluded environment on a cloud provider).
Where I work we refer to a product offering as "private cloud" that means we will run the software on their own AWS account. This keeps their data separate from other customers and allows extra security(restricted access to US employees, restricted access via VPN, etc).
I'm also an ex cloud engineer(as in building, operating, etc) so I understand the hosting meaning of "private cloud" to mean a cloud under control of the user(software and hardware), such as a company running an Open Stack install in own datacenter or colo.
I was originally trying to build our Enterprise solution using Ansible, targeting a few common OSes (Ubuntu, Centos, RHEL); headaches gradually began to pile up, surrounding the "tiny" differences in each of these environments -- I'm VERY happy to offload this work.
It took me a little while to wrap my head around best practices regarding placing our services in Docker containers, but once I was over this conceptual leap I was quite happy.
My first thought was about some advice I would give someone considering this (having been pushed into this situation by a large healthcare client years back).
But once I found it was a sales piece, that thought faded.
That being said:
Would just like to add some coloring to this.
>> However, not every customer that wants on-premise is a government agency with air gapped servers.
This is the bulk of our customer base and also a very large portion of the market still. I deliver software via dvd (flash drives nor wifi not allowed)
A few notes from these kinds of customers:
They won't let you just install anything.
Docker is great but doesn't have a lot of enterprise adoption (despite the self perpetuating hype cycle) outside the companies that already have mature software engineering teams as a core competency.
They are often running centos 6.5 or less yet.
A lot of these environments still require deb/rpm based installation.
Admins at companies that run on prem installations tend to be very reserved about their tech stack. Docker looks like the wild west to folks like that.
Our core demographic:
We do a lot of hadoop related work. We have a dockerized version of that we deploy for customers.
We have also been forced to go the more traditional ssh based yum/deb route. We have automation for both.
They are right that many "on prem" accounts are now "someone else's AWS account".
We also have to run stuff on windows server as well. Docker won't fly in that kind of environment either. Microsoft still has large market share in enterprise yet and will for a long time.
K8s and co is great where I can use it, but it shouldn't be assumed that it will work everywhere let alone in most places in the wild yet. Hopefully that changes in the coming years.
Again: This is one anecdote. There are different slices of the market.
Still? I didn't know the "dev" in devops had eaten "ops" and standard practice is to start shitting files wildly all over a server. Just because it's in a container/vm doesn't make that any less gross of a practice.
A package(s) that behaves well within the rest of the distribution, and lets sysadmins use their favorite configuration management to setup /etc/foo/bar is exactly the level of control I want as a customer, and would want to afford admins if I was a vendor.
Giving customers the option of a prepackaged virtualization/container solution is fine and all, but I'd take a wizard-installer-for-linux over a VM image any day. I deal with installers already, and while less than ideal they're easy enough to have wrestle the output into a .deb and throw on our internal repo. Then I can apt install it wherever I want.
RHEL. Quite often 5.x still. They wouldn't let us ship CentOS to them. It was brutal
Ah yes! You win. ;-)
Containers are built into Windows Server 2016, and they're powered by Docker. (or Hyper-V for full-VM containers).
Ops wants predictable infrastructure. Customers want predicable applications.
Unless cloud providers offering ability to run Windows containers directly on the cloud platform, there are no efficiency gains.
Lots of customers want multi-cloud capability: they want to be able, relatively easily, to push their apps to a Cloud Foundry instance that's in a public IaaS or a private IaaS. They want to be able to choose which apps go where, or have the flexibility to keep baseload computing on-prem and spin up extra capacity in a public IaaS when necessary.
It also happens that lots of CIOs have painful lockins to commercial RDBMSes, and they don't want to repeat the experience. They want to avoid being locked into AWS, or Azure, or GCP, or vSphere, or even OpenStack.
CF is designed to achieve all of these. The official GCP writeup for Cloud Foundry literally says "Developers will not be able to differentiate which infrastructure provider their applications are running in..." (can't say I completely agree, GCP's networking is pretty fast, which is I guess what happens when you have your own transoceanic fibre).
If I push an app to PCFDev -- a Cloud Foundry on my laptop -- it will run the same way on a Cloud Foundry running on AWS, GCP, Azure, vSphere, OpenStack and RackHD.
I have moved dockerized apps to public cloud CF to on-prem CF. It all worked because the apps where stateless to begin with. Pivotal appropriating all the merit of easy migration of stateless apps is dishonest at best.
Stateless apps are easy to move. They can be moved from CF to Kubernetes to ECS to Heroku.
Now, let's talk about the state, because that's where the real problem and the lock-in might be?
State is hard, state has momentum. We do what everyone else does: we push it out into a separate parallel space. Heroku do that with their PostgreSQL and Kafka services. We do it with services. IBM, Amazon, Google, Microsoft all do it with services.
Cloud Foundry is agnostic to the services. More agnostic if you use BOSH. Less agnostic if you use a non-BOSH service (like RDS).
Heroku isn't. You're married to their PostgreSQL unless you want to build your own or switch to RDS.
Kubernetes has more of an emphasis on attaching and detaching volumes. I can do that with BOSH; in future it'll be an app-level feature too.
(edited to remove unnecessary grump)
In terms of quitting Cloud Foundry entirely, you have two ways out.
The first is that you used buildpacks to stage your apps. In that case you stop pushing sourcecode to CF and replace it with whatever build pipeline makes sense for the replacement platform.
The second is that you decided to push container images instead of sourcecode. In that case you push to the replacement platform instead.
That said, you'll need to recreate the other niceties: orchestration, routing, log collection, service injection and service brokers, self-healing infrastructure and a bunch of other bits and bobs.
All the IaaSes have APIs for fundamental components like compute, storage and network. BOSH provides a relatively uniform layer for them, which is how Cloud Foundry manages to get around to so many platforms.
Is anyone else noticing that bare metal is getting unbelievably cheap? OVH is well known but there are now many vendors also offering crazy cheap bare metal hosting with <1hr setup times, etc. Compared to the popular virtualized cloud vendors it's a steal, especially when it comes to CPU, RAM, and storage. (Bandwidth tends to be about the same.)
If you do that with an IaaS, you're nuts. Your static assets (which typically represent the bulk of bytes served) should be served from a blobstore or a CDN, which are priced dramatically cheaper.
If your mp4s are served from an EC2 instance instead of S3 or CloudFront, for example, you're keeping warm by setting $100 bills on fire.
"Fair Use Policy
Use as much bandwidth as required: With all dedicated servers from serverloft, you receive a traffic flat rate free of charge. Here our Fair Use Policy applies, which requires a fair usage of all our resources as well as a reasonable traffic usage.
Therefore, in case the bandwidth exceeds 50 TB on a 1 Gbit/s server during the current billing month, we will decrease the bandwidth to 100 Mbit/s.
You will be able to monitor you current usage in the customer panel. We will also inform you via e-mail should you approach the 50 TB maximum usage. In this case you can purchase additional bandwidth for $10 per TB in order to extend your traffic usage."
So, 99 for the server and 50TB a month. Now if you want to max out your 1gpbs connection, lets say 50% of max (1gbps * 24 * 30 * 60 * 60s/2 TB ~ 160 TB, or a additional commitment of 110 * 10 = 1100+99 = grand total of 1200 dollars a month. How much transfer does that get you out of s3? First off, 160TB out of S3 each month, clocks in at 13227$ for just the bandwidth - not including the any storage. For the price of 160TB on dedicated, you get approximately 12TB on s3 (again without storage).
120TB out from cloudfront? 25693/month.
Now, one, low-end box might not be enough to keep everything going, but notice the price difference here. If you bought four "low-end" boxes you'd get 200TB of bandwidth included, and still be paying a lot less than for just the bandwidth on S3. It might even be cheaper to rent dedicated boxes set up as caching proxies, than to serve your data from s3/cloud-front...
Now, I'm not saying the cloud doesn't have merit, but just pointing out that the price difference is real.
If all you need or want is hardware, power, and bandwidth I would say you should definitely go dedicated or colo. If your business runs on the value-added services than you don't really have that choice.
While our hardware and ops team and premise costs are a little lower than high volume EC2 to the same scale, the real win comes at the network. High quality transit through multiple redundant links, high capacity routers, our own BGP, is all much lower cost and more flexible than we can get from EC2. It's not even in the same ballpark!
(Also, we out put SSDs into databases years before they were usable in EC2, so there was a win there, too!)
Finally, we're hiring into our on prem team :-)
CI is basically impossible to accomplish.
It is hilarious discovering a bug after both (us and customer) approval to go live with the last version.
The most exotic technology we have implemented is a CDN to speed up some component's delivery. And... In some cases we did not yet have authorization for that.
Honestly... on-prem is perfect to preserve your job... but man...
We've got a solid CI/CD pipeline on prem using "old school" tools like vmware
Amazon, Google or others takes care of that unless regulations require you to be on-prem.
It never really had anything to do with the lack of software or management tools. Not to take away from Kubernertes but Vmware, Cisco, Microsoft and other enterprise vendors have always had this market sealed up with relatively sophisticated capabilities.
Things like Vcenter seem far ahead in terms of management, capabilities, feature set, policy and fine tuned for enteprise use. Just taking distributed storage as an example where options like VSan, Nutanix or ScaleIO do not really have any open source counterparts. But this is not surprising given the billions of dollars spent in enterprise.
Perhaps it is an overly cautious interpretation of the law, I don't know.
Since I worked at Amazon for 5 years I saw how to cut costs on networking and hardware and to build only as much as you actually need. My experience after working at Amazon is that most Enterprise IT is full of people who hide behind Best Practices in order to expand their budget.
You can very easily wind up with massive amounts of spending to VMWare, EMC and Cisco. Unless you have an executive with a vision on how to contain costs with networking, storage and servers, I wouldn't recommend on prem, you can easily wind up spending an order of magnitude more than you should be.
Forget about the huge issues in the support organization for a second: the impact on-prem has on your release cycle has consequences that are hard to fully grasp. So much for "continuous development and release" if you have to keep supporting old versions of software for a year.
Build your stack so that you can easily migrate clouds (i.e. don't use all the super high-level AWS APIs). It's a good idea in general, and it should make going on-prem doable enough if you are worried about having that option at all.
Now, the story time!
When I was at Rackspace, I was trying to analyze the top reasons our startup customers would stop using some of our SaaS offerings. The most common one, unsurprisingly, was they'd run out of business. But another top one was "they got successful". As they got bigger and more successful (can't mention names) they'd bring more and more in-house, eventually getting to a point that the only products they were interested in were just servers and bandwidth.
So it depends. Some things just don't work well as SaaS, especially for customers with money who can't scale their IT fast enough.
The smallest on-prem customers run on a single server. We have multiple separate web applications, and a couple separate backend services. In AWS, we have each separate web app/backend service on its own set of auto-scaling EC2 instances, and have ELB+haproxy and some scripts to manage everything. Each customer (100s) get their own DB, but the DBs are all hosted across a few actual DB servers.
We use the same codebase for on-prem and AWS, and there's really not much difference in operation: as far as the applications are concerned, on-prem is just the same multi-tenant environment that happens to have a single tenant.
However, I would absolutely go all-cloud if I could. Two major reasons: AWS/cloud services, and database.
The first point is easy: we just can't really use any AWS or other 3rd party SaaS stuff ourselves, other than for our own support infrastructure. The app has to deploy to a self-contained on-prem environment. This is really both a blessing and a curse. On one hand, we have to re-invent some things that could otherwise be off-the-shelf from AWS or other offerings. On the other hand, while switching to another cloud would not be trivial, there's nothing in our core application code that relies on AWS at all.
This also makes our technology stack a bit constrained: we could benefit immensely from a time-series database, but instead we use a RDBMS (MS SQL Server) because, well, we have enough trouble with on-prem administering the database as it is.
The second point -- database -- is the cause of so, so much frustration. Very few people can run databases well. It's the same old story: when you're small, it's no problem, but as you grow the DB performance becomes critical (for reference, while our smaller customers have DBs that are 1-2GB, dozens of our customers have DBs in the 100s of GB, and a couple are over 1TB, and typically there's about 15% inbound data per day, though after processing not that much gets stored). I/O speed is always an issue.. we've had customers trying to run with their I/O on a NAS that's also shared with a dozen other applications ("We spent 100k on our NAS and the sales guy said it'll do everything, so it must be your application that's slow"). I have heard our customers say things like "But I can't make a full database backup, I don't have enough disk space".
There's also the inevitable "what hardware do I need to buy?" question, that even our own sales team is frustrated by. Overspec, and they are upset at cost, or -- probably worse -- are frustrated you made them spend all that money when their machine sits mostly idle. Underspec, and it's now your problem that it doesn't work, because they bought what you told them. It took me months before I convinced even our own team the best we can do is provide a couple examples of hardware setups, and caveat the heck out of it ("with this many devices monitored, with an average of this many parameters, at this frequency, with this many alerts configured, this many concurrent users," etc).
It can be hard enough to isolate the performance problems under load when you operate the whole stack -- it's incredibly difficult when you can only approximate a customer's setup, and even then, you have to somehow convince them their DB config or I/O is inadequate and then need to spend money to fix it.
> Economies of scale: some customers have the capacity and ability to acquire and operate compute resources at much cheaper rates than cloud providers offer.
AWS is probably spending a billion dollars per year or more on hardware... they get special things that are not available to the public like chips and hard drives. You really can't compete with that.
Even going from EC2 to classical dedicated servers in cloud (IVH, Hetzner, etc) will get you about an order of magnitude cost reduction.
In terms of why cloud services are becoming popular for their users, it makes things cheaper in the short term, and is a easier sell to management than on-premises resources. It's also useful for putting together solutions when you work in a company with overly restrictive local computing resources. However if you're looking to save money in the medium to long term, and you can find people you trust with server administration, then you're better off with on-premises (perhaps only keeping a cloud-hosted load balancer if you wanted to guard against unexpected spikes in network traffic).
We also have a requirement for our software that time has to be synced. We also recently started to actually check if people's time was synced (using the requisite syscalls), and it quickly became our #1 support issue for a little bit. Customer environments are far too uncontrolled to simply be tamed by such a system as Kubernetes.
I think that if you can avoid on-prem software and use XaaS, you should. You probably don't need to run your own datacenters, databases, etc.. because it's unlikely that you're better than GOOG, AMZN, Rackspace, DigitalOcean, and others. It's very unlikely that your application will be running at sufficient economies of scale to benefit from the kind of work Amazon, and others have done to run datacenters efficiently at scale. Not only this, but Google has figured out how to run millions of servers.
Although GOOG / EC2 tend to makes hardware available (NVMe, SSDs, etc..) far after the market releases it, if you compare it to enterprise hardware cycles, it's lightning fast. We still have customers who run our system on SAS 15K disks, and prefer that over SSDs, even though we recommend it. On the other hand, if you control the environment, rather than spending days, if not months on how to make your application more I/O efficient, you can simply spend a few more cents an hour, and get 1000 more IOPs.
I have a technique (Checmate) to make containers 20-30% faster and expose other significant features. Unfortunately, we have customers which run Docker 1.12, the latest version of our software, want to use containers, and yet they want to use a kernel no newer than a third of a decade old. I literally have spent months making code work on old kernels, at significant performance, and morale cost. If we controlled the kernel, this wouldn't be a problem, and it's yet another problem that K8s cannot solve.
Lastly, networking on-prem tends to be an afterthought. IPv6 would make many container networking woes go away nearly immediately. Full bisection bandwidth could allow for location oblivious scheduling, making the likes of K8s, and Mesos significantly simpler. BGP-enabled ToRs give you unparalled control. Unfortunately, I have yet to see a customer environment with any of these features.
I really hope the world doesn't become more "on-prem".
Going on premise might have some advantages but it comes with completely different problems that you didn't even thought about. Some of them:
* You thought of all the power redundancy, ideal cooling, humidity, etc... but then your office gets robbed and all your computers stolen.
* Network wiring... some people are lousy, create an entire spaghetti mess with them, used a crappier type of cable, cable crosstalk from other equipment, someone stepped over a cable and damaged it. Which one is it? good luck finding it out.
* Hardware fails, power supplies fail, disks fail, everything can fail.
You can pick these battles, or you can focus on your software based service. I suggest the latter.