I wont talk to the benefits of using Azure versus AWS versus Google Cloud, but Azure's sales model is a bit different.
If you're looking at Azure, and on HN, my guess is you're probably a startup, in which case you want to look into Microsoft Bizspark (https://www.microsoft.com/bizspark). This drastically changes Azure prices. If you're a larger company, Azure is all about enterprise sales, and bundling. If you buy multiple Azure services along with Windows, and O365, the Azure part is dirt cheap.
And it requires your registrar to not try to perform a malicious attack against you. But at that point, most registrars already have access to a root CA, and your domain - they could easily break your environment.
This seems wrong to me. This was the state of the art ~3 years ago. Now, I feel like all of the machines should be provisioned already with an OS, and a basic image, and a orchestration system like CoreOS / Mesos / Docker should specialize them.
IMHO, requiring hardware, or the entire machine should be exception, not the rule.
Sometimes you have a workload that really isn't a fit for virtualization/containers/whatever the latest Rails hotness is, at all, and you just need to throw a couple of cargo trailers of insanely massively-spec'd servers at the problem. In those cases, your 'old school' server provisioning toolkit had better be on-point.
It's easy to forget just how ridiculously powerful bare iron is these days. Go to Dell.com and see how much RAM you can cram into a U or three or four today in 2015. Or see how many IOPS a modern NetApp or Symmetrix (EMC) can push with 'flashcache' or million-dollar SSDs. It is ridiculous, and while a lot of those platforms are meant for 'building your own private cloud', etc, there's a non-trivial amount of workloads/projects where bare-iron is the best tool for the job.
Containers are just a namespacing tool, though; you're still running on bare metal (well, bare Linux). Docker in particular runs on AuFS, which is slow, but other containerization tools just use a chroot.
A lot of other latency-sensitive applications tend to have so many adverse performance conditions (that can usually be remediated with a lot of blood sweat and tears) under virtualization that it becomes easier to just go bare metal and deal with physical infrastructure overhead.
Even if you were going to run CoreOS or Mesos on the machine, you'd still manage it booting your specific image, which you can change, rather than trusting the pre-installed dell verion and managing that relationship.
Now there's probably some room for debate on whether these guys job should just be outsourced to Amazon, but github has some pretty good uptime and they seem to know what they're doing, thus they've probably already won that debate.
Just because you use Containers/VMs for most of your apps doesn't mean that the lower levels don't need attention: installing OSes in the first place, hardware testing (both initially and to identify defects later), ...
And for important fileservers and databases you're going to run on specific hardware for a long time.
For whatever it's worth, SmartDataCenter, Joyent's open source SmartOS-based system for operating a cloud, does exactly this -- and (as explained in an old but still technically valid video) we have found it to be a huge, huge win. And we even made good on the Docker piece when we used SDC as the engine behind Triton -- and have found it all to be every bit as potent as you suggest!
The big issue I'm having with that is that it involves trusting vendors to get network boot right. Especially when it comes to the looping part of "loop until DHCP gets a response" it becomes a problem. One of the cheap vendor tries 30 times and then goes to a boot failed screen after trying the disk.
Also, 1 time out of a 4-5000 or so network boot fails. Not sure why.
That's where iLO comes in. iLO is horrible, but you can ssh to it and set all manner of stuff.
When we didn't have PXE, we had a script that told iLO to boot from CD, and that the CD was located at http://something/bootme.iso. iLO would always have network, and would pass the .iso magically to the server as device to boot from.
We buy the cheapest server that meets our needs, and buy it in somewhat larger quantities (often double what was originally envisioned for less than was originally budgeted). Much more efficient.
But it does mean no IPMI. However I built a small circuit that sits on a power cable that can interrupt said power cable with a relais that sits on a bus plugged into our server, so we can do the reboot thing.
I've been meaning to redo that power cable circuit using wifi as the linking technology, now that we have esp8266 available.
GitHub's physical infrastructure team doesn't dictate what technologies our engineers can run on our hardware. We are interested in providing reliable server resources in a easily consumable manner. If someone wants to provision hardware to run docker containers or similar, that's great!
We may eventually offer higher order infrastructure or platform services internally, but it's not our current focus.
It makes no sense to compare PostgreSQL (single node) with Riak, Cassandra etc (multiple nodes). And distributed generally means distributing your data amongst multiple nodes. I don't expect to see a Call Me Maybe test for Chrome even though by your definition it is distributed.
The reason why most cloud services seem to work is because of economies of scale. Google compute engine will only ever represent a small percentage of their total server utilization, and all they've learned in serving Google.com, and such enables them to serve GCE better.
Backblaze had their backup business that enabled them to figure out storage, and buy a ton of it. DigitalOcean is really a Linode competitor. Not really a EC2 / GCE competitor. I wouldn't call it "cloud."
For virtual machines, whether economies of scale are important depends on whether you define cloud based on minimum time units - hourly or minute increments - or in service expectation - the provider doesn't work particularly hard to keep your machine up on a given server but it's automatically brought back up when the underlying hardware has a failure. If you define it based on minimum time of compute, this means having enough spare capacity to handle highly variable demand and the resources to absorb the costs of having the hardware sit idle.
If you own your own datacenter and there are physical machines, you can turn off the power and not pay for it so as long as the initial investment is paid off, it doesn't hurt too much financially. Without your own data center, the cost is fixed regardless of whether the servers are in use or not which is why we (prgmr.com) do not plan to offer hourly pricing any time in the foreseeable future. Someone with a larger user base is also going to be able to negotiate better rates such that idle machines do not hurt as much.
DO is currently "cloud" based on pricing and not necessarily how it globally provides service, as at least some subset of VMs are subject to routine maintenance or downtime. But to bring up a server almost immediately on another machine if one has a hardware failure is a more tractable problem and it is a service we eventually intend to offer. Xen has a feature called remus http://wiki.xen.org/wiki/Remus which effectively does continual live migration which would be pretty cool to implement, though support on Linux is not mainstreamed yet.
- Until proven otherwise I assume that the demand for VMs in clouds follows the general "internet load curve" (peak demand = 1.5x avg demand; plateau during the day; peak at 19:00). With the normal monthly billing you just see the variable load on the host node, and the server must have spare capacity to handle the peak load. With hourly billing, the peak load will not vary, only the fact that your customers spin down VMs on non-peak times.
So basically my point is that hourly vs monthly billing doesn't change anything demand/load-wise. The only difference is the billing; you basically have to recoup your lost revenue from the non-peak times by generally higher hourly pricing.
- If you colocate your (few) servers you can also shut them down to save energy. If you have a contract with usage based power bills.
1. IPv6 completely removes the checksum you used to have in IPv4. So, now there is just the Ethernet FCS, and the TCP checksums. You should use IPv6. If you're not using IPv6, you're only hurting yourself, and the rest of the internet.
2. Just transport everything over SSL, please. With AES-NI, the overhead for encrypting data is so tiny, that it's easier just to let someone else solve this problem.
SSL still has significant overhead, Netflix did a bunch of work , and they still can only push 10 Gbps out of a box that used to be able to push 40 Gbps (quad port 10G nic). 1/4th the throughput seems like a lot of overhead; and I'm a mere mortal, and can't put TLS into my kernel.
It's not store-and-forward vs. cut-through. It's whether or not the switch acts as a layer 3 device, or a layer 2 device. If it acts as a plain old layer 2 device, it can pass the packet, unmodified. As a layer 3 device, it modifies the layer 2 headers, and the TTL. As a layer 3 device, it can still cut-through.