Disclosure: I work on Google Cloud (and really care about this). The challenge h...

alfalfasprout · on Feb 18, 2020

I mean, this mentality often is wrong. Scaling out actually isn't the right solution for everyone. It works for Google given that primarily web services are offered. It does not work for workloads that heavily rely on the CPU (think financial workloads, ML, HPC/scientific workloads) or have realtime requirements. In fact, for many ETL workloads vertical scaling proves far more efficient.

It's long been the "google way" to try and abstract out compute but it's led to an industry full of people trying to follow in their way and overcomplicating what can be solved on one or two machines.

cm2187 · on Feb 19, 2020

And even when it could make sense, the cost of redeveloping a system, validating its logic, ensuring it is production ready is often prohibitive compare to the cost of a bigger server (and a risk a business may not be willing to take).

erulabs · on Feb 18, 2020

Except, almost without exception, eventually the one or two machines will fall over. Ideally you can engineer your way around this ahead of time - but not always. Fundamentally relying on a few specific things (or people) will always be an existential risk to a big firm. Absolutely agree re: start small - but the problem with “scale out” is a lack of good tooling - not a fundamental philosophical one.

alfalfasprout · on Feb 19, 2020

It is a philosophical one when you design around scaling out at a high rate. You incur significant additional complexity in many cases along with increased overhead.

It's fallacious to think that relying on "n" things is strictly safer than "3" things where n is large. That's not quite true due to the significant complexity increases when dealing with large "n" and accompanying overhead.

For web applications (which I suspect the majority of HN readers work on) then sure, but plenty of realtime or safety critical applications are perfectly ok with three-way redundancy.

Bad_CRC · on Feb 19, 2020

I work a lot with voip systems and it's much much easier to have one big machine that trying to make it work distributed...

PudgePacket · on Feb 19, 2020

I've heard lots of anecdotes from big sites doing fine with a small number of machines.

eg stackoverflow only has 1 active DB server, and 1 backup.

https://stackexchange.com/performance

endymi0n · on Feb 19, 2020

Actually, this makes a lot of sense. Reasoning about a single machine is just way simpler and keeps the full power of a modern transactional database at your fingertips. Backups keep taking longer and disaster recovery isn‘t as speedy anymore, but we‘re running on Postgres internally as well and I‘d scale that server as big as possible (even slightly beyond linear cost growth, which is pretty huge these days) before even thinking about alternatives.

aidenn0 · on Feb 18, 2020

Plenty of services can deal with X hours of downtime when a single machine fails for values of X that are longer than it takes to restore to a new machine from backups.

chucky_z · on Feb 19, 2020

I'd like to add to this and say that a server being down for 6 hours, that if over the life of its uptime (months? years?) saves uncountable number of hours on computations and complexity, is so worth it.

Heck, even a machine like that being down for a week is usually still worth it.

darkwater · on Feb 19, 2020

I do not agree, and it is not my experience.Mind you that I've always worked in small/mid-sized businesses (50-300 employees) and basically every service has someone needing it for their daily work. Sure, they may live without it for some times, but you will make their lives more miserable.

And anyway if you already have all in place to completely rebuild every SPOF machine from scratch in a few hours, go the extra mile and make it an active/passive cluster, even manually switched, and make the downtime a minutes thing.

aidenn0 · on Feb 20, 2020

A small amount of work over a long period of time (i.e. setting up a redundant system) may be worse than losing a large amount of work in a short period of time.

Single machines just don't fail that often. I managed a database server for an internal tool and the machine failed once in about 10 years. It was commodity hardware, so I just restored the backups to a spare workstation and it was back up in less than 2 hours. 15 people used this service and they could get some work done, without it, so there was less than 30 person-hours of productivity lost. If I spent 30 hours getting failover &c. working for this system over a 10 year period, it would have been more hours lost for the company than the failure caused.

lmeyerov · on Feb 19, 2020

Totally. We could replace our GPU stack with who knows how many CPUs to hit the same 20ms SLAs, and we'll just pretend data transfer overhead doesn't exist ;-)

More seriously, we're adding multi-node stuff for isolation and multi-GPU for performance. Both are quite different... and useful!

mrich · on Feb 19, 2020

The solution often is to have a warm standby that can take over immediately. You do not get any distributed overhead that is present in a fully load-balanced system during normal operation, and only pay a small amount in the very exceptional failure case.

nemothekid · on Feb 19, 2020

Google doesn’t have ML workloads (Basically all of search) or real-time (basically all of RTB) requirements?

I agree not everyone can develop like Google, but it’s wrong to say that “it doesn’t work”

mdasen · on Feb 18, 2020

This makes a lot of sense, but it doesn't explain why the pricing isn't consistent. Why is an N1 the same price as an N2, except for sustained-use? Why is an E2 cheaper than an N1/N2D, except for sustained-use?

E2 is just such an amazing idea that feels like it's going to be under-utilized because it isn't cheaper for the sustained-use case. There doesn't seem to be any reason why E2 would be more expensive (to Google) for sustained-use and not for on-demand or committed.

Google Cloud is really nice, but the inconsistent pricing/discounting between the different types seems odd. Like, I'm running something on N1 right now with sustained-use because there's no incentive for me to switch to E2. It feels a bit wasteful since it doesn't get a lot of traffic and would be the perfect VM to steal resources from. However, I'd only get a discount if I did a 1-year commitment. For Google, I'm tying up resources you could put to better use. E2 instances are usually 30% cheaper which would give me a nice incentive to switch to them, but without the sustained-use discount, N2D and N1 instances become the same price. So, I end up tying up hardware that could be used more efficiently.

throwaway2048 · on Feb 19, 2020

Pricing confusion is a cornerstone of big single shop vendors, the more confusing you make pricing, the more chances a customer is going to spend more than they otherwise might.

Also opens avenues for highly paid consultants to dip their beak and promote your products.

milesward · on Feb 18, 2020

We need some kinda shortcut: like, run your app for a few days on an instance, we chew your stackdriver metrics, we make a new shortcut n3-mybestinstance, which picks the right shape/processor family etc for yah.

elithrar · on Feb 18, 2020

As a Googler: take VM rightsizing recommendations - "save $X because you're underutilizing this machine shape" - and extend them to encompass this by including VM-family swaps based on underlying VM metrics? :)

cm2187 · on Feb 19, 2020

I presume the downside of a fragmented offering is that it is easier to run out of stock of a particular type of configuration. Does that happen much on any of the major clouds? Ie if we script the provisioning of a new VM, is it something to watch for or is it a really rare event?

merb · on Feb 19, 2020

well the biggest problem is that commited usage discount is not replaceable with the cheaper e2 choices. btw. when we commited our usage to n1, e2 was not available.