If you pay for Google Services, they have an SLA (service level agreement) of 99.9% [1]. If their services are down more than 43 minutes this month[2], you can request “free days” of usage.
Edit:
Services were down from ~12:55pm to ~1:52pm, it's 57minutes. Thanks hiby007
If all 6 million G Suite customers, with an average number of users at 25 per G Suite account, paying the $20/user fee, requested the three day credit for this breach in the SLA contract for the outage, it'd cost Google about 300 million dollars.
You know there are companies who build crawlers or health check agents exactly for this purpose, so that they know precisely from when to when the service they pay or they need for their business doesn't work and went out of SLA. I think it's brilliant and the only way to make any company pay.
I believe you can sometimes get away with a couple of pingdom checks/jobs.
This topic just came up recently on a podcast I was on where someone said a large service was down for X amount of time and the service being down tanked his entire business while it was down for days. But he was compensated in hosting credits for the exact amount of down time for the 1 service that caused the issue. It took so long to resolve because it took support a while to figure out it was their service, not his site.
So then I jokingly responded with that being like going to a restaurant, getting massive food poisoning, almost dying, ending up with a $150,000 hospital bill and then the restaurant emails you with "Dear valued customer, we're sorry for the inconvenience and have decided to award you a $50 gift card for any of our restaurants, thanks!".
If your SLA agreement is only for precisely calculated credits, that's not really going to help in the grand scheme of things.
IANAL, but I negotiate a lot of enterprise SaaS agreements. When considering the SLA, it is important to remember it is a legal document, not an engineering one. It has engineering impact and is up to engineering to satisfy, but the actual contents of it are better considered when wearing your lawyer hat, not your engineering one.
e.g., What you're referring to is related to the limitation of liability clauses and especially "special" or "consequential" damages -- a category of damages that are not 'direct' damages but secondary. [1]
Accepting _any_ liability for special or consequential damages is always a point of negotiation. As a service provider, you always try to avoid it because it is so hard to estimate the magnitude, and thus judge how much insurance coverage you need.
Related, those paragraphs also contain a limitation of liability clause, often at capped at X times annual cost. Doesn't make much sense to sign up a client for $10k per year but accept $10M+ liability exposure for them.
This is just scratching the surface -- tons of color and depth here that is nuanced for every company and situation. It's why you employe attorneys!
I've never seen an SLA that compensates for anything more than credit off your bill. I can't imagine a service that pays for loss of productivity, one outage and the whole company could be bankrupt. If your business depends on a cloud service for productivity you should have a backup plan if that service goes down.
I haven't seen one (at least for a SaaS company) that will compensate for loss of productivity/revenue etc, but something like Slack's SLA[0] seems like it's moving in the right direction. They guarantee a 99.99% uptime (max downtime of 4 min/22 seconds per month) and give 10x credits for any downtime.
Granted, there's probably not many businesses that are losing major revenue because slack's down for half an hour, but it's nice to at least see them acknowledge that 1 minute down deserves more than 1 minute of refunds!
> I haven't seen one (at least for a SaaS company) that will compensate for loss of productivity/revenue
They won't show up on automated systems aimed at SMEs, but anybody taking out an "enterprise plan" with tailored pricing from a SaaS, will likely ask for tailored SLA conditions too (or rather should ask for them).
It's hard to give a compensation for profit loss, as then you would have to know the profit of the customer beforehand and put an adequate pricing including that risk. It's almost like insurance!
Seems like you want insurance. As with the hospital bill you'd generally be paying a bunch of extra money for your health insurance plan to not get stuck with the bill.
Not sure that exists for businesses, but I'd expect you'd need to go shopping separately if you want that.
Seems like a good business idea if it doesn't exist.
I think the idea here is that if the payment for SLA breach is just "don't pay for the time we were down" or (as I've seen in other SLAs) "half off during the time we were down" that doesn't feel like much of an incentive on the service provider.
They have other incentives, obviously, like if everyone talks about how Google is down then that's bad for future business. But when thinking of SLAs I'm always surprised when they're not more drastic. Like "over 0.1% downtime: free service for a month".
Independent 'a service was down' insurance isn't the same though. It is important for the cost to come out of the provider's pocket, thus giving them a huge financial incentive to not be down. Having that incentive in place is the most important part of an SLA.
Even with insurance, some of the cost will come out of the provider's pockets - as increased premiums at renewal (or even immediately, in some cases). Insurers might also force other onerous conditions on the provider as a prerequisite for continued coverage.
I hear you, but there's going to be a cost for that. For the sake of argument, say Google changes the SLA as you wish and ups the cost of their offering accordingly.
> . . . like going to a restaurant, getting massive food poisoning, almost dying, ending up with a $150,000 hospital bill and then the restaurant emails you with "Dear valued customer, we're sorry for the inconvenience and have decided to award you a $50 gift card for any of our restaurants, thanks!".
It's even slightly worse than that. SLAs generally refund you for the prorated portion of your monthly fee the service was out, so it's more like "here's a gift card for the exact value of the single dish we've determined caused your food poisoning." Hehe.
Completely agree with your analogy but have you ever seen any SLA that provides any additional liability? I haven't seen them - you are stuck either relying on hosting services SLA or DIY.
I cannot share details for the obvious reason, but yes - there are SLAs signed into contract directly behind the scenes which result in automatic payouts of a condition isn't met and it's not a simple credit.
Enterprise level SLAs are crafted by lawyers in negotiations behind the scenes and are not the same as what you see on random public services. Our customers have them with us, and we have them with our vendors. Contract negotiations take months at the $$$$ level.
That is a fair point. Is this a situation where you asymmetrically powerful? I have to imagine you would have considerable clout to represent a fair bit of their revenue in order to dictate terms. When I wrote my comment it was in the vein of a smaller organization.
I am but a technical cog in the machine my friend, while I know about what goes on in business and contract negoatiations I cannot comment on power dynamics. I would assume it's like any other negotiation - whomever has the greatest leverage has the power, I doubt it's ever fairly balanced.
Not OP, but how do you measure them? Let's say, for example, you can send and receive email, but attaching files does not work. Is the service up or down?
What if the majority of your users can access the service, but one of your BGP peers is not routing properly and some of your users are unable to access?
In answer to your question, they'll accept evidence from your own monitoring system when you claim on the SLA. They pair that up to their own knowledge about how the system was performing, then make the grant.
Google are exceptionally good at this, from my experience. Far better than most other companies, who aim to provide as little detail as possible while getting away with 'providing an SLA'.
In this example: you get free days. Which depending on your business might be worthless if you have suffered more monetary loss due to the downtime than the free days are worth.
Exactly; downtime doesn't cost a cloud service much. At worst it causes reputation damage, with possibly large companies deciding to go for a competitor, losing a contract worth tens or hundreds of millions.
Given the blast radius of this (all regions appear to be impacted) along with the fact that services that don't rely on auth are working as normal, it must be a global authN/Z issue. I do not envy Google engineers right now.
A few years ago I released a bug in production that prevented users from logging into our desktop app. It affected about ~1k users before we found out and rolled back the release.
I still remember a very cold feeling in my belly, barely could sleep that night. It is difficult to imagine what the people responsible for this are feeling right now.
> Recently, I was asked if I was going to fire an employee who made a mistake that cost the company $600,000. No, I replied, I just spent $600,000 training him. Why would I want somebody to hire his experience?
Agreed, and also it's worth noting that we're talking about companies here. Yes, for any individual the amount of money lost is insane, but that's the risk for the company. If one individual can accidentally nearly bankrupt the company, then the company did not have proper risk management in place.
That isn't too say that it wouldn't also affect my sleep quality.
Depending on the company's culture, that calculus can vary. If management start firing subordinates for making mistakes, then what should be done to management if they fail to account for human error, resulting in multimillion-dollar losses?
Welp, as a new grad there, I had brought down one very important database server on a Sunday night (a series of really unfortunate events). Multiple senior DBAs had to be involved to resuscitate it. It started functioning normally just a few hours before market open in HK. If it was any later, it would have been some serious monetary loss. Needless to say, I was sweating bullets. Couldn't eat anything the entire day lol. Took me like 2 days to calm down. And this was after I was fully shielded cuz I was a junior. God knows what would've happened if someone more experienced had done that.
I brought down the order management system at a large bank during the middle of the trading day. The backup kicked in after about a minute but it was not fun on the trading floor.
I'm so glad I'm not the only one feeling deployment anxiety. The project I'm involved in doesn't really have serious money involved, but when there's a regression found only after production deployment my stress levels go up a notch.
When I was working at a pretty big IT provider in the electronic banking sector, we (management and senior devs) made it an unspoken rule, that:
- Juniors shall also handle production deployments regularly.
- A senior person is always on call (even if only unofficially / off the clock).
- Junior devs are never blamed for fuckups, irrespective of the damage they caused.
That was the only way to help people develop routine regarding big production deployments.
Same thing -- used to work at a very large hosting provider. One of our big internal infra management teams wouldn't consider newhires fully "part of the team" until they had caused a significant outage. It was genuinely a right of passage, as one person put it, "to cause a measurable part of the internet to disappear".
I got to see a lot of people pass through this right of passage, and it was always fun to watch. Everyone would take it incredibly seriously, some VP would invariably yell at them, but at the end of the day their managers and all their peers were smiling and clapping them on the back.
Yep. It was supposed to be a very small change. I blundered. My team understood that and was super supportive about it all too. But this was after it was all fixed.
During the outage though, no one (obviously) had time for me. This was a very important server. The tension and anxiety on the remediation call was through the roof. Every passing hour someone even more important in the chain of command was joining the call. At that time I thought I was done for...
I work for an extremely famous hospital in the American midwest. We're divided into three sections, one for clinical work, one for research, and one for education. I always tell people that I'm pretty content being in research (which is less sexy than clinical), because if I screw something up, some PI's project takes ten months and one week instead of ten months. In clinical, if you screw something up, somebody dies! I just don't think I could handle that level of stress.
At AWS, I once took an entire AZ down of a public-facing production service (with a mis-typed command), but that was nothing compared to when I accidentally deleted an entire region via internal console (too many browser tabs). Thank goodness turned out to be unused / unlaunched, non-production stack. I felt horrible for hours despite zero impact (in both the cases).
Jesus. One would think you'd have some safeguards for that. Even Dropbox will give you an alert if you try to nuke over 1,000 files. More reasons to COLOR CODE your work environments, if possible.
colorblind (red/green) person here - 5% of the male population just don't see color enough for it to be an important visual clue.
So sure, color-code your environments, but if you find someone about to do something to a red environment that they clearly should only be doing to a green environment, just check if they're seeing what you're seeing before you sack them ;)
My primary customer right now color codes both with "actual color", and with words - ie the RED environment is tagged with red color bars, and also big [black] letters in the [red] bars reading "RED"
> At AWS, I once took an entire AZ down of a public-facing production service (with a mis-typed command), but that was nothing compared to when I accidentally deleted an entire region via internal console (too many browser tabs). Thank goodness turned out to be unused / unlaunched, non-production stack. I felt horrible for hours despite zero impact (in both the cases).
It seems like a design flaw for actions like that to be so easy. E.g.
> Hey, we detected you want to delete an AWS region. Please have an authorized coworker enter their credentials to second your command.
If it was indeed in-production, I'd never in a million years have had access rights to delete the stack. Those are gated just the way one would imagine they should be.
The service stack for the region (and not an entire region itself) looked like prod, but wasn't. It made me feel like shit anyway.
Several years back when I was working at Google I made a mistake that caused some of the special results in the knowledge cards to become unclickable for a small subset of queries for about an hour. As part of the postmortem I had to calculate how many people likely tried to interact with it while it was broken. It was a lot and really made me realize the magnitude of an otherwise seemingly small production failure. My boss didn't give me a hard time, just pointed me toward documentation about how to write the report. And crunching the numbers is what really made me feel the weight of it. It was a good process.
I feel for the engineer who has to calculate the cost of this bug.
This sounds like a good practice and hopefully something they still do. Calculating the exact numbers would definitely help cement the experience and its consequences into your mind.
Its absolutely possible, the worst AWS outage was caused by one engineer running the wrong command [0].
"This past Tuesday morning Pacific Time an Amazon Web Services engineer was debugging an issue with the billing system for the company’s popular cloud storage service S3 and accidentally mistyped a command. What followed was a several hours’ long cloud outage that wreaked havoc across the internet and resulted in hundreds of millions of dollars in losses for AWS customers and others who rely on third-party services hosted by AWS."
The big mistake in the system is that everyone in the world is relying on Google services... These problems would have less impact with a more diverse ecosystem.
Would they have less impact? Or would it have the same impact, just distributed across many more outages?
You can rely on Google outages being very few and far between, and recovering pretty fast. For the benefits you get from such a connected ecosystem, I'm not sure anyone is net positive from using a variety of different tools rather than Google supplying many of them.
I'm not sure I see that as a fair comparison. I think it's best to use the same durations for this, as an entire day changes the level of impact. <1hr outages have a fraction of the impact that an entire days outage might have.
It's obviously subjective, but even with our entire work leaning on Google– from GMail, GDrive and Google Docs, through to all our infrastructure being in GCP– todays outage just meant everyone took an hour break. History suggests we won't see another of these for another year, so everyone taking a collective 60m break has been minimally impactful vs many smaller, isolated outages spread over the year.
I remember how one of our engineers had his docker daemon connected to production instead of his local one and casually did a docker rm -f $(docker ps -aq) .
Same thing happened to me but with CI, which felt bad enough already.
Engineers shouldn’t deploy to prod directly, but sometimes it’s necessary to SSH into an instance for logs, stack dumps, etc. Source: worked for 2 big to very big tech cos.
For a large or v large tech co you should probably be aggregating logs to a centralised location that doesn't require access to production systems in this way. Stack dumps should also be collected safely off-system if necessary.
Perhaps my industry is a little more security conscious (I don't know which industry you're talking about), but this doesn't seem like good practice.
Let me be clear, I agree it should not be normal to SSH into a prod box. Our logs are centrally aggregated. But it’s one thing to say it’s not normal, but quite another to say engineers shouldn't have access, because I totally disagree with that.
What normally (should) happens in that unusual case is that the engineer is issued a special short-lifetime credential to do what needs to be done. An audit trail is kept of when and to whom the credential was issued, for what purpose, when it was revoked, etc.
It can be efficient, particularly in smaller companies, but that's where exactly this rule should be applied.
In some industries, security and customer requirements will at times mandate that developer workstations have no access to production. Deployments must even be carried out using different accounts than those used to access internal services, for security and auditing purposes.
There are of course good reasons for this; accidents, malicious engineers, overzealous engineers, lost/stolen equipment, risk avoidance, etc.
When you apply this rule, it makes for more process and perhaps slower response times to problems, but accidents or other internal-related issues mentioned above drop to zero.
Given how easy it is to destroy things these days with a single misplaced Kubernetes or Docker command, safeguards need to be put in place.
Let me tell you a little story from my experience;
I built myself a custom keyboard from a Numpad kit. I had gotten tired of typing so many docker commands in every day and I had the desire to build something. I built this little numpad into a full blown Docker control centre using QMK. A single key-press could deploy or destroy entire systems.
One day, something slid off of something else on my desk, onto said keyboard, pressing several of the keys while I happened to have an SSH session to a remote server in focus.
Suffice it to say, that little keyboard has never been seen since. On an unrelated topic, I don't have SSH access to production systems.
This exactly. I have deleted database records from a production DB thinking I am executing on my development DB. I've kept separate credentials and revoked dev machine access to prod environment ever since.
Well, it's something that can happen to anyone, take it easy. When I made the transition from developer to manager and become responsible for this situations, at first every problem made me feel as you describe. Eventually what helped me to be free is the understanding that how we feel about a fact does not change anything about that fact.
Don't be too hard on yourself, no dev works in a silo, there is usually user acceptance testing and product owner sign offs involved so they also have to wear some of this too.
Nope, especially considering the implications of this, with the amount of people working remotely. Google Meet, Classroom, etc. are down. This is probably literally costing billions every minute just in loss of productivity.
You are assuming that a minute of disruption can not cause more than a minute's loss of productivity. I don't think that assumption is justified.
Consider an exactly one minute outage that affects multiple things I use for work.
First, I may not immediately recognize that the outage is actually with some single service provider. If several things are out I'm probably going to suspect it is something on my end, or maybe with my ISP. I might spend several minutes thoroughly checking that possibility out, before noticing that whatever it was seems to have been resolved.
Second, even if I immediately recognize it for what it is and immediately notice when it ends it might take me several minutes to get back to where I was. Not everything is designed to automatically and transparently recover from disruptions, and so I might have had things in progress when the outage stuck that will need manual cleanup and restarting.
A billion is a number with two distinct definitions:
- 1,000,000,000, i.e. one thousand million, or 10^9, as defined on the short scale. This is now the meaning in both British and American English.
- 1,000,000,000,000, i.e. one million million, or 10^12, as defined on the long scale. This is one thousand times larger than the short scale billion, and equivalent to the short scale trillion. This is the historical meaning in English and the current use in many non-English-speaking countries where billion and trillion 10^18 maintain their long scale definitions.
Nevertheless almost everyone uses 1B = 10^9 for technical discussions
Indeed. Also, Google’s revenue is about $300K per minute. The value they provide is likely higher than that, but as you said, being able to send an email an hour later than you hoped it’s fine in most cases. Also, Google Search was fine, and that’s their highest impact product.
I’d guess actual losses to the world economy were more on the order of about $100K per minute, or about 1/3 of Google’s revenue. MAYBE a few hundred thousand per minute, though that seems unlikely with Search being unaffected, and everything else coming back. Certainly a far cry from billions per minute :)
I never understood this type of calculation as it implies that time is directly converted into money. However, I struggle to come up with an example for this. Even the most trivial labor cases like producing paperclips don't seem to be directly converting time into profit: even you will make 10k units instead of 100k this hour, you don't sell them immediately. They bring revenue to the firm via a long chain of convoluted contracts (both legal and "transactional") which are very loosely coupled to the immediate output.
Nothing is operating at minute margins unless it's explicitly priced on a minutely basis, like a cloud service. Even if a worked on a conveyor belt can't produce paperclips without looking at Google Docs sheet all the time, this will be absorbed by the buffers down the line. And only if the worker will fail to meet her monthly target due to this, loss of revenue might occur. But in this case the service has to be down for weeks.
In case of more complex conversions of time into money, like in the most of intellectual work, this is even less obvious that short downtimes will cause any measurable harm.
Besides the exaggerated figure, I always find these claims bizarre. Sure, there was some momentary loss, but aggregated over a month this will not even register.
In a previous lifetime I removed an "unused" TLS certificate. It turns out that it was a production cert that was being used to secure a whole state's worth of computers.
In my defence, the cert was not labeled properly, nor was it used properly, and there was no documentation. It took us 2 days to create a new cert and apply it to our software and deliver it to the customer. Those were 2 days I'll never get back. However, when I was finished the process was documented and the cert was labeled, so I guess its a win.
I am not sure why are they allowing it. Meaning why aren’t services completely isolated? Isn’t it obvious that in an intertwined environment those things are bound to happen (as in “question of when, not if”)? I understand, in smaller companies that are limited in resources (access to good developers and pressure to get product to market as soon as possible) we have single points of failure all over the place. But “the smartest developers on the planet”? What is it if not short-sighted disregard for risk management theories and practices? I mean, Calendar and Youtube, say, should be completely separate services hosted in different places, their teams should not even talk to each other. Yes, they can use same software components, frameworks and technologies. Standardization is very welcome. But decentralization should be an imperative.
Edit: again downvotes started! Thanks to everyone “supporting freedom of expression” :)
I've been in that situation before at one of my previous jobs, where some important IT infrastructure when down for the whole company. Nowhere as big of a scale as this, but it was easily one of the most stressful moments of my life
If this does not improve soon, we're looking at one of the most significant outages in recent internet history, at least from the number of people impacted.
Several others have shared their 'I broke things' experiences, and so I feel compelled to weigh in.
Many years ago, I was directly responsible for causing a substantial percentage of all credit/debit/EBT authorizations from every WalMart store world-wide to time out, and this went on for several days straight.
On the ground, this kind of timeout was basically a long delay at the register. Back then, most authorizations would take four or five seconds. The timeout would add more than 15 seconds to that.
In other words, I gave many tens of millions of people a pretty bad checkout experience.
This stat (authorization time) was and remains something WalMart focuses quite heavily on, in real time and historically, so it was known right away that something was wrong. Yet it took us (Network Engineering) days to figure it out. The root cause summary: I had written a program to scan (parallelized) all of the store networks for network devices. Some of the addresses scanned were broadcast and network addresses, which caused a massive amplification of return traffic which flooded the satellite networks. Info about why it took so long to discover is below.
Back in the 1990s, when this happened, all of the stores were connected to the home office via two way Hughes satellite links. This was a relatively bandwidth limited resource that was managed very carefully for obvious reasons.
I had just started and co-created the Network Management team with one other engineer. Basically prior to my arrival, there had been little systematic management of the network and network devices.
I realized that there was nothing like a robust inventory of either the networks or the routers and hubs (not switches!) that made up those networks.
We did have some notion of store numbers and what network ranges were assigned to them, but that was inaccurate in many cases.
Given that there were tens of thousands of networks ranges in question, I wrote a program creatively called 'psychoping' that would ICMP scan all of those network ranges with adjustable parallelism.
I ran it against the test store networks, talked it over with the senior engineers, and was cleared for takeoff.
Thing is, I didn't start it right away; some other things came up that I had to deal with. I ended up started it over a week after review.
Why didn't this get caught right away? Well, when timeouts started to skyrocket across the network, many engineers started working on the problem. None of the normal, typical problems were applicable. More troubling, none of the existing monitoring programs looked for ICMP at all, which is what I was using exclusively.
So of course they immediately plugged a sniffer into the network and did data captures to see what was actually going on. And nothing unusual showed up, except a lot of drops.
We're talking > 20 years ago, so know that "sniffing" wasn't the trivial thing it is now. Network Engineering had a few extremely expensive Data General hardware sniffers.
And to these expensive sniffers, the traffic I was generating was invisible.
Two things: the program I wrote to generate the traffic had a small bug and was generating very slightly invalid packets. I don't remember the details, but it had something to do with the IP header.
These packets were correct enough to route through all of the relevant networks, but incorrect enough for the Data General sniffer to not see them.
So...there was a lot of 'intense' discussions between Network Engineering and all of the relevant vendors. (Hughes, ACC for the routers, Synoptics and ODS for the hubs)
In the end, a different kind of sniffer was brought in, which was able to see the packets I was generating. I had helpfully put my userid and desk phone number in the packet data, just in case someone needed to track raw packets back to me.
Though the impact was great, and it scared me to death, there were absolutely no negative consequences. WalMart Information Systems was, in the late 1990s, a very healthy organization.
Makes sense, at work we have an application running on Google Cloud and everything seems to be working. So the outage is probably not at network or infrastructure level.
4:41AM PT, Google services have been restored to my accounts (free & gsuite).
And I have never seen them load so fast before - gmail progress bar barely seen for a fraction of a second whereas I am more used to seeing it for multiple seconds (2-3 sec) until it loads.
I observe the same anecdotal speedup for other sites... drive, youtube, calendar. I wonder if they are throwing all the hardware they have at their services or I am encountering underutilized servers since it is not fixed for everyone.
It is nice to experience (even if it is short lived) the snappiness of Google services if they weren't so multi-tenented.
If this phenomenon is actually real instead of just perception then I'd guess it is down to reduced demand of some short. Some possibilities:
a) users haven't all come back yet
b) Google is throttling how fast users can access services again to prevent further outages
c) to reduce load, apps have features turned off (which might make things directly faster on the user's end or just reduce load on the server side)
It might suggest that the frontend isn't the only issue, at least - and maybe this explains why it's usually so slow, if the frontend can be fast on a fast enough backend. On the other hand, the speed of the "basic HTML" version implies that the frontend can be the issue.
So, anybody still feel like arguing that 'the cloud' is a viable back-up? Or is that a sore point right now? Just for a moment imagine: what if it never comes back again?
Of course it will, - at least, it better - but what if it doesn't? And if it does, are you going to take countermeasures in case it happens again or is it just going to be 'back to normal' again?
Everybody uses it, so if, like, Gmail loses all the emails, we are then in such a state that the consequences will be more bearable and socially normal.
Most people are fine with accepting that whatever future thing will happen to most people will also happen to them. Because then the consequences will also be normal.
If the apocalypse comes, it comes for almost all of us and that's consolation enough.
The way I see it, backups are a strategy to reduce risk of ruin.
For me, backing up to the Cloud is fine, because I find the risk of my home being broken into and everything stolen AND the cloud goes down AND the cloud services are completely unrecoverable is a small enough risk to tolerate.
I don't think it's possible to have permanently indestructible files in existence over a given time period.
Different failure mode. If the cloud goes down, many more people are affected. If your self-hosted thing goes down, only you are affected. If everybody self-hosted, would the overall downtime be lower? Even if it were, would it be worth the effort of self-hosting?
I have the opposite experience, at least with regard to your first two paragraphs. Most of the things that I have backed up on other people's computers over the past 3-4 decades are irretrievably lost. But most of the things that I have taken care to make backups of on personal equipment over the years, are still with me.
Cloud storage is still useful of course, but I prefer to view it as a cache rather than as a dependable backup.
Of course it's viable as a backup. Availability != realibility. My data is still reliably saved in the cloud even if there is an outage for a few hours. The key point is backup, e.g. Dropbox. When you use Google Docs, it becomes a single source of truth and a SPOF.
This depends on the circumstances. If your personal photos are unaccessible then maybe it doesn't matter, but if it's your documentation for a mission critical bit of infrastructure then a few hours could be very significant. Somebody in that situation probably wouldn't agree with your assertion that "availability != reliability". If I can't access it when I need it then I wouldn't consider it reliable.
Whatever data I have backed up in the cloud is synced across multiple devices that I use. Even if the cloud disappeared altogether, I still have it. The cloud allows me to keep an up to date copy across various devices.
Both Google Drive/Photos and OneDrive have an option to only keep recently used files on your local device, and even periodically suggest they automatically remove local copies of unused files to "free up space".
I highly suggest everyone disable this setting on their own, but also on their (perhaps less technical) friends' and relatives' devices. Otherwise, if anything happens to your account or - less likely - the storage provider or their hardware, your data could very well be gone forever. I can't believe anyone would want that.
You don't need 'the cloud' to do that. Look into Syncthing. It does depend on an outside "discovery server" by default to enable syncing outside of your LAN, but you can run your own.
What's annoying is that synchronisation does't work for google slides or google docs. They are just synchronized as links to the webpage on my computer.
If you use Insync you have the option of converting to DOCX or ODT.
Insync has other issues though, my "sourcetreeconfig" is being downloaded as "sourcetreeconfig.xml".
>So, anybody still feel like arguing that 'the cloud' is a viable back-up? Or is that a sore point right now? Just for a moment imagine: what if it never comes back again?
Much less chance of that happening than my local backups getting borked...
Just a few moments ago, I wrote in my company's group chat this is the time when we buy NAS.
We have a lot of documents not accessible right now in Google Drive.
What worries me the most is email. I basically don't use any other Google services other than Gmail and YouTube, but for email I really don't know of an alternative.
Sure you can argue "move to Fastmail/Protonmail/Hey/whatever", but those can also go down on you just like Google is down now. And self hosting email is apparently not a thing due to complexity and having to forever fight with being marked as spam (ndr.: not my personal experience, I never tried self hosting, just relying what I read here on HN when the topic comes up).
So, yeah, what do we do about email? I feel like we should have a solution to this by now, but somehow we don't.
I've been happy with using Hosted Exchange on Microsoft. I own the domain so ultimately I can point the DNS to some other provider. Outlook stores the mails locally so I have a backup. I think the most important thing about E-Mail is to receive future E-Mails and not look at historical ones. In the end you can always ask the receipent to send you a copy of the email conversation - if you dont own the domain it get much harder to convince you actually own the email.
For < $100/year, Microsoft will sell you hosted Exchange (and you can use it with up to 900 domains [0]), 1 TB of storage, 2 installable copies of Office, and Office 365 on-line.
That's _much_ better than trying to host my own email server.
The point is that you shouldn't put all your eggs in one basket. All services go down. If you're worried about someone else handling it when it goes down then host your own [1], otherwise you can use something different for each thing you need. Don't rely on Google for everything.
Yes, I know what's the point. But how do you avoid putting all your eggs in one basket? You can't host your email on more than one "provider" (including self host), and the vast majority of important services that you link your email to (bank, digital identity, taxes and other government services) does not allow you to have more than one linked to it; which means, that one goes down, you don't have one. Sure, I can give my accountant and my lawyer a second email address, hosted on a different provider, but that poses two problems: 1. how are they gonna know when one is working and one isn't? It's not like you get a notification if your email didn't reach most of the time, it just drops; 2. if you always send all emails to both addresses, now two providers have my data instead of one (of course excluding if one is self hosted). And you also need to always keep in mind that for all things important: one is none, two is one; so you should really have 3 addresses on 3 different providers according to that, which brings us back to the problems above. (and I'm not even mentioning the confusion that it would generate if you don't manage to get the same name with every provider "Wait, was it beshrkayali@gmail.com, or was it alibeshrkay@gmail.com? Or was that fastmail?")
As I said (literally in the second sentence), I don't rely on Google for everything, as you mention. I don't actually rely on Google for anything other than gmail, and of that I am also unhappy. The point I was trying to make is that there aren't really alternatives, and I was hoping someone might come out with a suggestion about how to overcome that problem.
You shouldn't use you@company.com as your main email, you should have your own domain. So `something@yourdomain.com` will always be yours no matter if you self host or use 3rd party. I currently use Fastmail and i've been very happy with them. If they fail or turn evil, I'll switch to something else maintaining the same address. Emails themselves should be downloaded/backed-up regularly, kind of like important documents you'd have on your disk.
I'm running my own mail server, and I think anyone who has some experience with Linux should be able to do the same in a day or two. Once it's set up it just works.
You can still use Gmail and fall back to connecting directly to your server if Gmail is down.
Some mails might be flagged as spam if the IP/domain has no reputation, but that quickly passes, at least that's my experience.
I specifically use Gsuite so that I don't have to deal with managing a spam filter or dealing with IP reputation issues. I'd be willing to self-host almost anything else.
A lot of domain registrars will host/relay mail for you if you don't want to think about it. Otherwise it's not too hard to host yourself. The sucky part is when it breaks because you can't really just put off fixing it.
I’ve heard multiple founders argue that it’s safe to have downtime because of a cloud outage, because you’re not likely to be the highest importance service that your customers use that also had downtime.
Well yeah; I don't trust myself enough to own & operate my own servers, and I cannot give myself any uptime guarantees - let alone at the scale that a cloud provider can offer me.
"The Cloud" is vague, and if you don't specify what it means then the answer to your question can only be "it depends".
If the question is "anybody still feel like arguing that 'a single provider' is a viable back-up" then it's yes for most cases. A better strategy is of course to use multiple providers. The chances that it never comes back again is much lower.
"Multi-cloud" only works if you stick with the basics. Like disk storage, compute, and a well-supported database. Once you tie in into a cloud's specific offerings.....
I find the Anthos docs and talks so confusing. Half of them say Anthos is for hybrid between on-prem and GCP. The other half say it's for multicloud and hybrid.
Well, since a viable backup strategy requires at least 3 storage locations (eg the in-use primary, an on-site or off-site backup, and a secondary off-site backup) "the cloud" is fine as an off-site backup or secondary off-site backup.
I know you’re joking, but I am curious if Waymo’s fleet was affected. Shouldn’t be right? But I’m also surprised every time I fresh login to Gmail and YouTube shows up in the intermediary redirect chain.
Gmail said my account was "temporarily" unavailable... had a moment considering if it wasn't temporary. Good reminder to remove my reliance on gmail especially.
Underrated comment. If the government freezes your bank account, at least you can take them to court. If Google mistakenly disables your account, there does not seem to be any legal recourse. Considering the extent to which people depend on their services, perhaps there should be a more elaborate appeal process.
I agree, but the problem is that "ordinary rulefollowing" is not clearly defined by Google. Perhaps it was that innocent park frisbee party video with a Metallica sound played in the background by some other party there... and not only your video, but also your YouTube account, and worse, email account gets blocked. Maybe this scenario is dystopian - the point is that it is a black-box no-appeal/limited appeal system when such an event happens.
Google account terminations typically only affect the one service. If your YouTube account gets terminated for ToS violations, your email account will still work perfectly fine.
Thunderbird prompting me to login, and getting "Google does not recognize this email address" as a reply was a nice adrenaline spike, until I checked the status on HN!
This happened to me with a work email. For a moment I thought, "Well, I had a good run at this company.." My next thought was, "Ha, Google must be down - better check HackerNews."
I am still experiencing this problem on all my synced-to-Thunderbird Gmail accounts, so it either hasn't been completely fixed yet, or there's another ongoing issue.
Yes! It takes some work to switch, but it's worth it. Buy your own domain name, and link it to an existing service if you don't want to host yourself. You'll always be able to switch your mail alias when having issues with your email host.
Our entire library of video tutorials disappeared for a while. I was not happy, and the thought of losing our email.. Currently working with the team to make backups of absolutely everything off-site and off-Google. A good wake up call for us.
I really should try to get my stuff moved to my new address. Not self-hosted, but a non-profit that has been around for a while... Finally paid the membership fee.
When I tried to login it said: 'this account does not exist'. So my first thought was some algorithm made a mistake and my account got deleted for no reason.
I already imagined the only solution now was to write a medium post and hope it gets some traction on hackernews and google support steps in.
Thinking to myself I was an idiot for knowing all this and still thinking it wouldn't happen to me.
And even though it turns out to be an outage, it gave me a bad enough feeling to start using a domain name I own for my email.
The advantage of a custom domain name is: As long as you can update DNS records, you can have your mail easily hosted elsewhere.
Obviously not relevant for this kind of outage, but in the scenario outlined by GP - Google randomly kills you off, and there is nothing you can do - this is at least an emergency strategy.
Yep, but losing access to my emails isn't that big of a deal, losing access to my email address is much worse. Especially since it's also my login on a lot of other services.
This topic is so hot it's crashing HN!
Super long server response time and I get this error quite oftenly: "We're having some trouble serving your request. Sorry!"
Interesting cascade effect of sites going down! I wonder where people will go next to check if HN is down?
Also wondering if this is perhaps the fastest upvoted HN post ever? 8 mins -> ~350 votes, 15 mins -> ~750 votes. I wonder if @dang could chime in with some stats?
Update: looks like it hit 1000 upvotes in ~25 mins!
Update: 1500 in ~40 mins
Update: 2000 in ~1 hour 20 mins (used the HN API for the timestamp)
The recent youtube-dl takedown had a similar number of votes but still slightly slower, I think. I think it was >500 in 30min and >1000 in 60min if I remember correctly.
I, and probably many others here are like a status page for friends and family too... I had my wife thinking that the internet was broken, and tried tethering to phone and things and still didn't work, then showed me and I saw the status code errors and was like "it's actually parts of Google that are down".
I love (and am deeply scared by) the dependence of Google and the confusion of it with the entire internet.
I still like my physical switch on the wall to turn the light on and off very much.. for me it's hard to beat. I can even turn it on/off semi-remotely with e.g. a tennis ball.
Right? Home automation is perfectly fine and mught even expose a secure authenticated API to the internet so you can say check and adjust the temperature remotely, but it shoukd never ever go down for lical use if the Internet connectivity or the remote service backend goes down.
Home Assistant rulez with non-cloud sensors. I never by sensors, cameras, switches which require connection to an external host. But I'm sure that most people will do it even with such outages too.
Google should write an AI application that checks on HN if their systems are working and display the result live on their status page. That would be more reliable than the current system they have in place (which is obviously not working at all).
No -- it needs to be red when it's having the outage for people to have confidence in it as an indicator. The reality is Google have no real incentive to provide an actual external status page that is accurate -- to do so is an admission of not upholding SLAs. These status pages are updated retrospectively. Use a third party one like DownDetector.
Google made everyone get an account trying to force Google+ on to everyone. They don't get to pretend like that didn't happen by not factoring it into their status.
The small print says: The problem with Gmail should be resolved for the vast majority of affected users. We will continue to work towards restoring service for the remaining affected users...
At Google scale, "remaining affected users" probably number in tens of millions. Sucks to be one of them, tho.
But hey, it happens. As a SaaS maintainer, I can sympathise with any SREs involved in handling this, and know that no service can be up 100% of the time.
Someone here in hn said at Amazon AWS they are required direct CTO approvals to reflect negative realities. Maybe that’s the case for anyone these days.
Pretty funny. I would normally assume they are connected to some testing unit that real users interface with. Vs internal APIs or however Google is doing this with them still being green
Out of curiosity, what response time do you expect on a page like that? And what level of detail? I'd much rather have their team focus on fixing this as fast as possible than trying to update that dashboard in the first 5 minutes.
> what response time do you expect on a page like that?
Faster than a free third-party website’s response time. Google should know they are down and tell people about it before Hacker News, Twitter, etc. Google should be the source of truth for Google service status, not social media.
> And what level of detail?
Enough to not tell people that there are “No issues” with services.
> I'd much rather have their team focus on fixing this as fast as possible than trying to update that dashboard in the first 5 minutes.
I'd expect within seconds that Google is alerted of a very large number of issues with their servers and that the status page would be updated (the green light going to red) within seconds. It's now quite some time after the start of the outage and everything is still green on that status page.
If they know its broken, one of the _many_ engineers/support across Youtube+Gmail+etc that are all known to be down related to this bug should be able to update it in first few minutes. Especially if this isn't a 5 min fix.
As many comments in this thread have mentioned, the crash seems to happen when when logged in; youtube and other services seem to work when don't send the session cookie. This suggests something very fundamental related to user account/sessions is failing...
Did Maria Christensen make a mistake when adding "return true;"?
(This is a joke referencing Tom Scott's 2014 parable[1] about the danger of designing a system with a single point of failure. Tom's tells the fictional tale of a high level employee at Google adding "return true;" to the global "handleLogin()" function.)
Don’t blame people, blame systems and processes. I assume Google has a blameless process too. If an engineer can bring down huge swaths of google that’s not a human problem. As an eng your you should heavily invest in a sane test -> deploy -> monitoring process, and reward reliability.
Give people a bonus when things didn’t break, not only when there is a superhero that fixes broken things. Then you’re rewarding fragile systems that need superheroes.
It's super interesting that all Google services that I've tried are down _except_ for Google Search. What would isolate Search from the rest of Google's products such that it wouldn't be affected by a mass outage like this?
I’d guess Search is pretty well segregated from basically everything else because of how valuable it is - I’m logged into Google on Search and it works fine (unlike everything else)
I would've thought Google remembers what you search while logged in and sells it off. Maybe they do it in a different way - never really thought about that. I'm logged into Google when I search but my profile fails to load at the top right.
They probably do, but it is not essential to the functionality so it can work even if auth fails.
Fingerprinting happens separately because then they can identify you even if you are not using your own computer or are in incognito. Login is just a convenient way for you to tell them who you are.
Just got a text from my kids' school saying GMail is down and to use the school's direct email. Immediate reaction was "Google isn't down you idiots, the problem is on your end", go to check GMail, yep it's down.
Unless I'm misunderstanding something, they are going to have an immense SLA claim issue. 99.9% SLA on Workspace services, so any business paying for Google for Business (now known as Workspace) is going to have a credit claim (assuming the outage is longer than 43m 49s which feels like it will be).
Edit: As I comment it looks like things are coming back! Timing or what...
First the two Gmail accounts I had in Thunderbird logged me out and asked to relogin.
Then Google said my accounts did NOT exist which made me feel very uneasy. Banned? Lost all my data? OMG.
Then I got error "502 That's all we know" after entering my passwords.
Finally I realized it couldn't be just me, so I opened HN and confirmed my suspicions.
It's all up and running now but I was really scared.
I'm really curious what happened because it surely looks like Google has a single point of failure in their authentication mechanism which is just horribly wrong. The company has over two billion users - a situation/configuration like this just shouldn't be even theoretically possible.
Probably just people checking non-Google sites whether it's everyone or just them. 8.8.8.8 at least seems to be working, as does Google DNS for domains hosted on domains.google.com.
I came to HN to see if there was news to calm me down, we just started business day here, and as soon I tried to login some 30 minutes ago, Google claimed my paid business account didn't exist, so I came here to check if there was some mass banning or something like that.
I noticed Gmail and of course the google doc I was working on were unavailable. I didn't come check HN until I said to my wife "Google's down" when she complained her email didn't work and she said "check the news to see if Google's really down"
who would have thought a lot of folks here also use google products.
for real, though, hacker news is getting slower every second since this thread is up. i won't be shocked if it goes down due to the surge of traffic due to google outage
YouTube shows a "something's broken" error. Similar errors when trying to access Gmail. Tried on multiple browsers and connections, got the same result. The search engine works fine, as of now.
Services are not restored. Some came up again, some not. My Gsuite business mail is still completely down, while youtube started working again.
I'm pretty sure there will be some internal conferences at Google after this to make sure infrastructure problems can't propagate across the entire company and world at this rate even in the event of a sysop fatfingering a console...
In my house, it leads to a hilarious echo of multiple devices responding simultaneously with an error message. Having only one device respond seems to be cloud-mediated too.
YouTube Music failed in an interesting way. My music has been playing, but I started hearing ads (I'm on Premium), which prompted me to check what was going on.
edit: added details
edit: redacted my phone number
edit: big mistake to add phone number
edit: I think illic is right, probably not me
edit: removed details
If it was you, why the hell on earth would you bother exposing your personal phone number on the internet and asking google to call you on this post? Like, seriously...
Wouldn't you rather call them directly on the hotline...?
That's a phone number from Bulgaria, and it looks like it should be part of the Vivacom GSM network, so I guess it's his personal mobile phone number or a scam.
> why the hell on earth would you bother exposing your personal phone number on the internet
Why not? It's not like you're gonna receive death threats exactly. I've had my personal phone number on my public website and in the footer of every outgoing email for 15+ years, never had any problems, spam or otherwise.
That reminds me of the time when I plugged a network cable back into an Active Directory domain controller. At the exact same time as the RJ-45 plug would have made the little "click", a door slammed shut and a polished steel tanker truck drove by the window, shining a bright light into my eyes.
I had just plugged into a cable into the most important server in the organisation and I saw a bright flash and hear a bang.
All was well, it was just a coincidence, and a good reminder that sometimes shit happens and it's not us, it's just timing.
This isn't true, but considering every Google outage is a one in a billion rube goldberg domino machine, it could be true. Put this comment in the post mortem!
I'd wager a guess that you set up some weird 'expression'? coupled with some bug in the IAM service, maybe some stale resources that you were deleting at the same time?
I'd then assume once expression is evaluated the services end up busy looping / proxies throwing out internal errors and taking out capacity.
Still, you shouldn't be able to cause downtime to more then a few servers in the extremely unlikely case I am anywhere close.
PS:
- I haven't used googles IAM so guessing after a few min of reading docs.
- you are incredibly unlikely to have triggered this at google's scale.
Just permissions. The "IAM" can be safely dropped. It's exactly what you think it'd be: restrictions and privileges.
"IAM" is basically the name for a specific model of doing it.
Unless something really crazy happened, this user is unlikely to be correct. Accounts are supposed to be firewalled/sandboxed in a way that you can't contagion across to someone else's let alone systemwide.
It's possible (some sweeping script on a powerful connection that smashes just the right things or some exploit to break the sandboxing), just probably not likely - especially unintentionally.
> Last thing you want is a conversation with google's lawyer
Google more or less want people find their weaknesses so they can patch and secure them. A person accidentally triggering a global outage is not something that would cause that person to get lawyers on them. Especially not something that only affects his or hers GCP-project.
But why is HN so slow? Pages take like 30+ seconds to load for me (German vantage point, other sites are fine). Does it timeout on some Google dependency or try to use a Google submarine cable or something? Or is everyone just posting to HN about it?
I happened to use Google Takeout to download everything I had last weekend (~200 Gb)... at first I thought my account was banned and that was the only solace I had, but this definitely is killing my productivity.
I really hope this is just a temporary issue and no accounts are going to be lost. If that happens, way too many people are going to feel the pain of thinking Google makes backups unnecessary - including me.
And somehow Pokemon Go is down, too. They use GCP for hosting, but I'm not seeing other outages on GCP. Maybe they have deeper integration with Google's Cloud than publicly disclosed?
Discord was sort of down (wouldn't load), but I could log in again a few minutes later and everything was OK. My company's product is hosted on GCP and there are no problems that I can detect without administrative access (which I can't obtain because it's behind a Google login); services are returning correct results.
Google Maps for sure, maybe some GCP as well (as console is down, Discord and Playstation Network as well are having issues - which seems like some GCP service)
Seems like the auth solution that is down. Every service work in incognito/private.
Shared drives in Docs/Drive does not work. Probably connected to the auth service?
Gmail service has already been restored for some users, and we expect a resolution for all users in the near future. Please note this time frame is an estimate and may change.
Google is scaring the shit out of us, we were already trying to move out of Google for some past bizarre outages, and now before they went down, they outright claimed business account didn't exist anymore, our business drive and all our IMAP e-mail clients got kicked out.
This is a great and interesting thread. But I believe nobody discussed here the opposite and postivie effects of this kind of golbal outages. There are funny calculations for example of having less spam mail open-clciks or things like that. So think in such massive otuage of millions of users and auatomated process what means in term to generate less CPU, RAM, I/O and as consecuence less Power/Cooling >> less CO2 and pollution... It's huge! and at the Earth's weather impacts is like charge trillions in advance that we''ll have to paid in a future not only with money... (Economy Degrowth)
Launched YouTube app on the Roku and was prompted to sign in. Opened browser on PC, entered "activation code" from YouTube app, after entering my Google account username on the login page, it presented me with the reassuring "Couldn't find your Google Account" message.
Tried logging in to Gmail directly with the same effect.
Thanks to the Twitters, I realized that Google hadn't canceled me, specifically (apparently they decided to temporarily cancel everyone!).
FWIW, I typically get directed to their Chicago datacenter(s).
(Note: 7 A.M. on the U.S. east coast on a Monday morning; at least they have impeccable timing!)
When I worked in the automotive industry, the Detroit Big 3 American auto makers had figured out their cost per minute of downtime. They would even charge that to their suppliers and anyone who shut down one of their plants.
It was something like $10,000 per minute since they’re able to roll a complete vehicle off the assembly line about every minute
We use Google Cloud and Firebase at work and our application and servers are still up and running. Can't say the same about the Cloud Admin Console, tho.
Somebody threw a Spanner in the works is my guess. Global-scale database underpinning everything (or maybe the giant that is colossus is having an off-day).
Singalong: She broke the whoooole world, with her change, he broke the whole wide world, with his change, they broke the whole world with their chaaaaange!
I'm sure its stressful right now. But someday, these engineers will look back and retell the stories about how it happened and the lessons they learned. Hopefully with a laugh.
Gcloud/GKE command line utils are also having problems.
Was deploying some test applications and kubectl started complaining about gcloud auth helper throwing not zero errors. Trying to launch cloud shell from the website and nothing happened.
The web application which is a site that does not rely on any external API is running fine.
Not just google, hackernews was also displaying: "We're having some trouble serving your request. Sorry!", has it something to do with google architecture?
Btw, that's funny you listed all the google services exactly in the order I found them unavailable: first youtube, then gmail, then docs
It looks like the auth service is broken. Everything works when logged out. Attempting to log in gives "Account not found".
EDIT: Looks like they logged everyone out while they fix the auth issue, presumably so people can use YouTube and other stuff that doesn't necessarily need login?
Seems like the incident lasted from 04:00 AM PST to 04:30 AM PST, at least for BigQuery (sorry, image timestamps are in JST, since I'm in Japan): https://imgur.com/a/kiHED6v
The worst outage I've ever seen with Gmail. I can't log in, getting "Couldn't find your Google account" message. Also, with the already logged in account, I can't browse anything but whatever is in cache.
I'm definitely worried.
Even? Tim Berners-Lee tried since day 0 to make the web decentralized. One of the initial requirements in the 1989 proposal for a global hypertext system included "Non-Centralisation" where "new system must allow existing systems to be linked together without requiring any central control or coordination". See https://www.w3.org/History/1989/proposal.html for the rest.
While it went OK-ish for the internet, we massively screwed it up for the web.
No we didn't. The web is very nicely decentralised, especially with the general malaise and decline of Facebook.
What isn't decentralised is web apps but TBL never intended the web to be used for full blown desktop app replacements in the first place so no surprise it doesn't meet its design goals for that.
Anyone experiencing lost emails? I'm not sure if it's actually me messing things up or because of the Gmail outage, I couldn't find an extremely important email and I've tried archive, trash, deleted, spam, sent boxes..
This affects GCP as well as Google's consumer services. I'm here because I was paged about low traffic for a service. Turns out to just be that I can't access any metrics at all from Google Cloud Monitoring (i.e. Stackdriver).
These rare events help make extremely visible how much of our general basic infrastructure we've farmed out to a small handful of companies, centralizing something that was supposed to avoid the problems of centralization.
I feel like Google is having more outages recently. I worked there for 4 years and even in such a short time you really noticed the shift from engineering focus to business focus. But maybe it is just a coincidence.
I never worked for Google. They wanted to hire me after the first set of interviews, but I took a different opportunity at the time. Interview process was an intense, and the interviewers were sharp. I came out even more impressed with Google. This was way back in the day -- early '00s. I would totally be excited to work for that Google. It's just that, well, there are lots of awesome things to do, and I had (what seemed like at the time) a more interesting option. I sometimes had regrets and sometimes not.
Second process was maybe 2-3 years ago. It didn't get to a full onsite, since after a few conversations with the team, it was clear there wasn't a fit. The old arrogance was still there, but without the same sharpness or cleverness. I spoke to a team working on a new product (under NDA) in a field I had a lot of experience in, and:
1) There were computer scientists without a clear understanding of the target market domain.
2) They believed they were the best-and-brightest, and didn't need to consult experts in the field.
3) There was a lot of hype and salesmanship.
I like to be surrounded by smart people, and it felt like I'd be the sharpest guy in that room. I was a better computer scientist, AND better domain expert in the field the product was in than anyone on that team. That's not a position I want to find myself in.
And I have a job which pays a lot less than Google, but I'm surrounded by smart people I learn a heck of a lot from, where I'm working on meaningful things, and having fun. I'm also continuing to build my personal brand.
Now a few years later, the product they were working on never shipped, so if I worked for Google, I would have likely been on a failed project. On the other hand, my mortgage would likely be paid off.
Seems to also affect Google Cloud console to some extent. Dashboard data is not loading at the moment. Not sure what services exactly are affected there. Google Cloud SQL seems to be up and running so far.
youtube greece is down returns "oops" error but i can still playback a 2 hour video that I was already watching before it went down. Trying to open any other video or channel returns this error.
How long did your takeout take? It’s been 3 weeks and mine either says not ready yet, or when it is it will say unable to download and force me to start again...
When I request a full takeout it usually takes 2-4 weeks to generate.
This time around it was missing a .tgz which had error'd. A subsequent full takeout request took a couple of days.
The .tgz files I find can only be reliably extracted on a Linux... the Gmail export for me is 20GB which in the last few exports has been a single file in the last .tgz
Google still is the best when it comes to search. But its also important to know the alternatives. You will be surprised how many people dont know Google alternatives. lol
How coud all of googles websites be down this long tho. You would at least think google being the company they are would have some sort of like backup serevers or backup plan type stuff. Idk
Not a fan or even their user but kudos to Google for admitting the failure. Amazon (at least AWS) customers would have probably been shown a green dashboard and told to review their code.
YouTube is working in India.
But can not log in to any Google Accounts.
GMAIL and Drive not accessible via web.
Drive sync is working though.
Appstatus still showing green.
Now we'll have a bit of outrage, talking about how dangerous it is to have the whole world impacted by one company's processes and forget about it in 2 days
HN is faster and more accurate than status pages of major service providers themselves. Special thanks to people working at those service providers and posting here!
Waiting for the post-mortem to see it was all because someone forgot to renew an SSL cert that nobody knew was a single point of failure for literally everything.
Yeah, that status page is really frustrating. Didn't anyone consider building it to REALLY check if services were working? not just returning a 200 Status... but actually DOING things?
Internet jargon; when a commenter is overly excited, they tend to overuse exclamation points. In the rush to add many exclamation points, the shift key might not be held down for all of them.
Hey can someone explain to me how all google apps are down cause idk a lot and ik that at least a kid ddosing these websites is not a possiblity so what kind of attacks are going down here? And how come google doesnt have like backup servers to run there websites on or is that not a thing?
seems to be working again, for me. except gmail says "Gmail is temporarily unable to access your Contacts. You may experience issues while this persists."
AFAIK (as someone who does not work at Google), "brain teaser" questions have not been asked at Google for almost a decade.
Every time I see complaints like this I can't help but think the posters have a chip on their shoulder from being rejected by Google or a similarly selective company. It's never a commentary on any fundamental issues with these types of interview questions and no one ever comes up with a better process that is equally scaleable and effective for hiring generalists. In my experience interviewing with dozens of small, medium and large companies, the vast majority of technical roles require these types of questions now. These come off as criticisms against Google specifically for asking hard variants of these questions.
I don't personally have any issue with companies asking questions like these as long as they don't simply look for "a correct and optimal solution coded up perfectly whilst under stress in under 30 minutes", but rather the process of solving the problem.
If we're talking about FAANG, I don't believe that they really care about "the process of solving the problem". If you don't come up with a close-to-optimal solution in about 5 minutes, you're in trouble - there will be competitors who are better prepared and thus can. Solving algorithmic questions is a very trainable skill, after all.
Funnily enough, this response is exactly what the parent seems to be talking about. FAANG originated as a stock term and not as a grouping of companies with similar hiring processes. i.e. Interviews at Netflix are generally domain specific. Likewise, Amazon isn't the same as Apple isn't the same as Google.
> If you don't come up with a close-to-optimal solution in about 5 minutes
Disingenuously hyperbolic. I was only given a single problem to solve in each of my 45 minute "coding rounds" at Google and Facebook. If you approach these interviews like a competitive coding competition you will absolutely not get the job. By that I mean arriving at a non-optimal solution as fast as possible while writing messy code and not explaining your thought process. Not to mention that senior roles involve an increasing number of interviews that focus on design rather than algorithmic questions.
This all seems like nonsense coming from individuals who are (understandably or not) frustrated at an interview process that doesn't align with their strengths.
But their hiring practices are pretty similar (perhaps except Netflix, don't know anything about them).
In one of my interview rounds at google I was under huge time pressure, despite coming up with an optimal solution in seconds. That was an outlier as the interviewer was not aware of this solution, so explaining it took a long time. But still.
The valuable part from the competitive coding background is not the coding, but "inventing" solutions. In quotes because it's mostly a pattern-matching process. And it's a _huge_ advantage.
Personally I'm perfectly happy with the current process, as I took about half a year to become good at it, and the payoff is nice. :) But I'm convinced that's what this process selects for first and foremost - amount of preparation.
Funnily enough, this response is exactly what someone who succeeded under this model would say.
This all seems like nonsense coming from individuals who are (understandably or not) frustrated that others would suggest that their success was only due to an interview process that aligned with their strengths, rather than their obviously superior skill and intellect.
I don't think I would be considered a "success" under this model as I never passed a tough interview and currently work at company you will never hear of. I simply have no desire to justify my skills by talking others down and pretending I'm too good for companies that reject me.
I'm not sure how this is a valid rebuttal to my comment since it addresses none of the point I made. Do you really think the secret sauce to getting into highly competitive jobs is by incoherently speed running through algorithm questions?
I just took your first and last sentences and had some fun with it.
That way of interviewing is good, because it weeds out people who don't prepare and thus might be lazy or just looking for an easy paycheck. But it also tends to confuse good memory of technical and abstract questions as competence, and if you're not careful you'll end up with a lot of copy pasted humans that are really good at inverting binary trees, but suck at everything else. Also people who have already passed it tend to hold on to it as pride, and might fail an otherwise amazing candidate, just because of a failure on a question that has no practical use outside of a textbook. Those are my thoughts on it, and I'd wager the thoughts of many others who are critical of the model.
> Do you really think the secret sauce to getting into highly competitive jobs is by incoherently speed running through algorithm questions?
No, the real secret sauce is having a friend or relative on the inside. If you don't have that it's a gamble most of the time imo.
Yes. Thank you for explaining my comment to me, and you're completely right, but you missed the part where my comment was based on his first and last sentences only going the other way around. It did seem a bit dismissive and condescending at the time so I had some fun with it.
And the only reply that to me somewhat seems to understand what's going on, getting lots of downvotes, so it got collapsed and you cannot easily find it:
> They just play a numbers game, and are able to discard perfectly capable candidates with this kind of questions, and still get a bunch of great candidates
This story seems like a bad example of a false negative. There were claims by Google insiders that Max was given a particularly easy interview as a formality. I am by no means a brilliant programmer and also rusty with algorithms, but given the structure and question, I would be surprised if most people couldn't figure out how to invert a binary tree within a matter of minutes.
Max himself later opened up[0] and admitted to being difficult to work with. It is entirely likely that he was rejected based on his personality and not his ability. As someone who contributed to Homebrew many years ago I would not be surprised if this was the case. In their own words: "I am often a dick, I am often difficult, I often don’t know computer science". I am not sure why any company would want to hire someone like that and put the culture of the team in jeopardy.
I love your response because I have a PhD in Computer Science. I do LeetCode as and CodeWars problems in the mornings for fun (I like coding and helps me avoid getting rusty) and I have been a hiring manager for over 10 years.
But yeah, other than the "sexiness" of CS concepts it's only because of the "narrative" we want to push.
So you do think Leetcode is useful ? What's up with the flex ? Congratulations on the PhD and job, I guess ?. Looks like you could definitely invert a binary tree.
Leetcode is good for some things, but in my experience is it not a good factor to filter Software Engineers (it is a good metric to filter candidates to form a team for the IOI or similar).
This is a response I wrote elsewhere:
I've been growing Engineering teams for the last 10 years as hiring managers in different startups. At some point in our startup we had those kind of HackerRank questions as filters.
The thing we realized is that those sort of interviews optimize to hire a specific type of very Jr Engineers who have recently graduated or are graduating from CS. That is because those are the people that have the time to churn these types of "puzzle" problems. Particularly, there are 3 types of recent graduates from CS or related fields: The ones that don't know crap, the ones that focus on these sort of problems, and the ones that are "generalists" because they dove into all sort of subjects during their degree.
I found out that the Jr people that excel at those sort of problems have a huge learning curve to climb to be productive in "production", real life environment. On the contrary, the "generalists" work better.
We stopped doing those sort of algorithm puzzles interviews after that realization, and we started getting really good Engineers with great real-life experience.
Yeah, nothing wrong with not having memorized how to do that - it's certainly not often useful in practice. But if you're given the definitions and a couple minutes to figure it out (which, in an interview, you are), and you actually can't come up with the 10 lines of code to do it, maybe that question is working as intended after all...
> But if you're given the definitions and a couple minutes to figure it out (which, in an interview, you are), and you actually can't come up with the 10 lines of code to do it, maybe that question is working as intended after all...
or maybe the person struggles to work things out in an extremely stressful situation
I speak as somebody that finds interviews Extremely stressful and if working there is regularly my interview levels of stress then please do reject me.
Edit: Services were down from ~12:55pm to ~1:52pm, it's 57minutes. Thanks hiby007
[1] https://workspace.google.com/intl/en/terms/sla.html
[2] https://en.wikipedia.org/wiki/High_availability#Percentage_c...