Hacker News new | past | comments | ask | show | jobs | submit login
A Google Cloud support engineer solves a tough DNS case (cloud.google.com)
771 points by sciurus 13 days ago | hide | past | web | favorite | 275 comments

This is a fun debugging story, but is a great example why servers should be cattle not pets. Having trouble with a VM? Blow it up and get a fresh one. Still having trouble? The provisioning steps are codified, you can walk through them and find the one that causes the issue.

My fleet of machines I was the owner of at Facebook was around 10,000. I still remember the odd JVM crash that prompted me to reimage a machine. I wouldn't have remembered it except there were a few that month and it was the 3rd time I reimaged the same machine that I thought, "That's odd... I think I know that machine name." Checked history, saw the 3 repair jobs I had submitted... RAM was reset, CPU was eventually guessed at as bad.

Cattle is a threshold, but when the same problem keeps coming up it's time to call the vet. http://rachelbythebay.com/w/ has many good examples of this, some submitted and voted up here.

It is indeed folly to assume that cattle have no identity. There's an internally famous video inside Google in which a particular well-known machine was unracked, dragged out into a field, and ceremonially smashed to pieces by some hardware techs. Sometimes a machine just takes an arrow to the knee and it's never the same again. Then there are all the uncontrolled or unrecorded differences between machines: the ones at the tops of the racks, or the ends of the rows, are hotter (or colder); there's some difference between the same model of hard disk made in Hungary compared to the ones made in Mexico; at some date the BIOS vendor made an undocumented firmware revision that changes an obscure energy/performance register in your CPU; you have a machine with a dead CMOS battery that worked normally until it was rebooted.

Cattle is a good philosophy but it takes a huge amount of work to approach perfection.

Once, the head of IT of a company I used to work for was touring the datacenter, passing some new racks filled with blade servers. He stopped, said "why are all the fans running full blast on this rack?" and the admins checked and they were running some test workload at scale somebody had forgotten about a few weeks before.

Everybody was embarassed because no monitoring caught it, but the VP of IT did by walking past the cage.

Admiral Rickover was known for walking into the engineering spaces of nuclear ships and just throwing a valve handle that would force a reactor scram. Not infrequently on a submerged submarine. Just to make sure the team was on their toes.

I'm not a sailor or a nuclear engineer, but that doesn't sound like a great idea. Should the Chaos Monkey approach really be used on nuclear systems?

Aside: was it even legal for him to do that?

If you are responsible for building the industry that designs and builds nuclear submarines that carry nuclear missiles, you had better make sure that those submarines and their crews can handle chaos monkeys.

Also, Rickover was Congress's favorite admiral. They forced the Navy to promote him. I'm pretty sure they made sure that the laws were to his liking.

Rickover was the sort of person who had the technical expertise, gumption, and charisma to get away with this. It's worth reading up on this amazing individual, who made navy's nuclear reactors so safe, and led the creation of nuclear reactor expertise within the navy.

Also, if you're not confident enough in your nuclear reactor to apply chaos monkey techniques, you shouldn't be engineering nuclear reactors.

In general, one would hope that a nuclear system is designed such that the problem can easily be corrected if a single button or lever is accidentally pressed. It would be quite a terrible system if you could e.g. trigger a meltdown with just one action.

A reactor scram is basically just an emergency shutdown. If I were a nuclear engineer, it might give me a heart attack to hear the scram alarms, but I would be plenty happy knowing the scram works.

I walk the server room daily, every morning. I've tended our monitoring system for 15 years now and I don't trust myself to be infallible. I'm also the MD ...

MD = VP for those not in finance

In the UK MD = CEO (when not finance), so without more context you couldn’t really say.

He’s either an MD in finance, which is the equivalent of a VP (roughly) in other kinds of companies, or he’s the CEO.

> It is indeed folly to assume that cattle have no identity.

My dad had a book over almost all the cow names in Norway[1], per 1988. As a kid I found it rather fun to just flip through it and read some names, often wondering how they came up with them.

However since then it seems the tradition of naming cattle has dropped[2] to less than 30%.

[1]: "Gullhorn og dei andre : kunamn i Noreg" https://urn.nb.no/URN:NBN:no-nb_digibok_2010111708049

[2]: https://www.nrk.no/nordland/kyrne-far-ikke-lenger-navn-1.835...

I only know of small farmers who name their livestock, but that was back in the States.

Compared to US-scale farming, I'd guess most Norwegian farmers are "small".

Indeed. When the 1988 study was done, which resulted in the book of cow names among other things, there were about 360k cows in Norway total, and the average number of cows per farm was ~5.6.

These days the number has risen, IIRC around 35, though that's still quite a low number compared to larger countries I imagine. I'm pretty sure the variance is quite high however, with a fair number of farms with just a few cows dragging down the mean.

There was a time when I managed a few clusters of MySQL servers provisioned on OVH bare-metal. [Yes, we know "they're the worst" at least used to be.] Everything was running well until we upgraded to their largest offerings. Turned out they had NUMA memory so had to retune the MySQL parameters. After sorting that out they all worked fine except one machine. Note that we already had a pretty good relationship so I'd already requested the exact same motherboards and BIOS on these machines. Somehow one machine always came up different. I couldn't trust OVH to set the CMOS the same so I carefully checked against one of the 'normal' machines. The cores reported differently and had different affinity characteristics. Never did find out why that was, perhaps a CPU microcode/stepping that I never investigated. Anyway I just worked around the issue (with a different tuning for this one machine) because asking for a replacement resulted in a similar situation.

> There's an internally famous video inside Google in which a particular well-known machine was unracked, dragged out into a field

I really hope the video is edited to strongly resemble this scene: https://www.youtube.com/watch?v=N9wsjroVlu8

It was made to reference exactly that, down to the attire.

One of my computers has been aptly named 'THESEUS' due to what was replaced on it. By the time it was repaired to an acceptable level, the only original component remaining was the chassis.

Heh... that reminds me of my favourite troubleshooting story!

We had a customer with regularly failing tape backups. CRC errors, verify pass failures, even failed writes, and so forth.

We replaced the tapes with new ones. Same issues.

We replaced the tape drive with a new one. Still the same problems.

We replaced the internal ribbon cable and the SCSI controller. No luck.

Firmware flashed everything. Didn't help.

New server chassis, wiped the OS and reinstalled everything from scratch. Changed the backup software just in case. The backups still failed!

Literally no part was the same. I went on site to start looking into things like the power cables, the UPS, or vibration issues. Basically were getting desperate and grasping at straws.

I was sitting down in an office, casually chatting with the IT guy while we were waiting for 5pm so we could reboot the server. He's leaning back in is office chair, and he casually picks up one of the tape cartridges and throws it up in the air and then catches it before it hits the ground. Just playing. Over and over.

I asked him if he does that a lot.

"Yes, it's fun!" he answered.


That is quite the story! Sounds like a very large amount of resources spent on that case.

What was your companys role? Backup services/devices?

This was general IT consulting back in the early 2000s. The customer was small, they only had three tower servers and only one had a tape drive.

Many would have hit the ground too! I'm twitching...

I was troubleshooting a computer once that would randomly shut off during boot and, one component at a time, I replaced everything on it including the motherboard to no avail.

Finally I took all the parts out of the original computer and put them in a different chassis and it worked! Put them back in the old chassis and back to the old problem.

Eventually I noticed that there was an extra stand-off in the first computer case and it was shorting out the motherboard.

It was literally the chassis causing the problem.

Back in the 90s we had a faulty DELL server that someone decided needed to have its BIOS upgraded. They didn't read the specs and upgraded to a BIOS not supported by the CPU.

Motherboard is bricked. Ring DELL for support. After going through the rigmarole of explaining what had happened and that we had a bricked motherboard, the person on the phone said "Have you tried taking out the CPU and rebooting?"

To avoid further delay in getting a replacement sent (we had 4 hour on-site at the time), we went through the motions. Not surprisingly, the motherboard was substantially bricked without a CPU.

The DELL engineer that came on-site was suitably amused.

My neighbour's computer stopped working after a lightning storm and asked me to take a look at it. It wouldn't boot so I started taking things out of it (hard drive, video card, modem, etc.) and trying again.

Nothing worked. Finally, I removed the processor from the motherboard, looked at it, and reinstalled it. The computer booted right up and never had another problem. Weird.

Put in an extra standoff on my very first PC build, and it wouldn't boot until I found and removed it.

Reminds me of the saying, "I have used this broom for 20 years. I only needed to change the broom head 20 times and the broom stick 10 times.".

Trigger's broom scene from Only Fools And Horses. Classic British comedy

There needs to be some level of conformity between instances, they stop being a heard and more of a zoo if the skew is too large. The workloads running on the instances shouldn't be able to tell which instance type they are running on, or your workloads should be written such that it doesn't matter (but at some point it will). Things that grow and move together, wear together, so you will end up with a system that is designed against the empirical contract, not the stated one.

My approach has always been that somewhere in my fleet there is a heat sink that fell off and a CPU running at 400MHz. My last two jobs have started with me sitting down at my desk on day 1 and demonstrating this fact. After concluding that the zoo is unavoidable, the only thing left to do is write the software accordingly.

> After concluding that the zoo is unavoidable, the only thing left to do is write the software accordingly.

What do you do to write software accordingly? Make it detect when it's running on a dud? Have it run as best as it can anyways?

Suicide is a good solution, if some higher-level thing will notice and move the task to another machine. Batch frameworks can kill slow shards, or re-assign their work to faster shards. Clients of online services can direct more traffic to working shards and less or none to slow or broken ones.

"There's an internally famous video inside Google in which a particular well-known machine was unracked, dragged out into a field, and ceremonially smashed to pieces by some hardware techs..."

Are you perhaps thinking of the printer execution scene from "Office Space"?


> particular well-known machine was unracked

are you able to comment a bit further on why this machine was well known?

Because of the way engineers habitually run batch jobs with more replicas than there are machines, this one broken computer had crapped up every map/reduce job in that facility for a long time, and it had been sent to repairs many times without benefit. Many people knew instinctively that if their job was stuck it was probably because of the shard on xyz42 (or whatever the node name was).

I still remember the machine name. It starts with an l and ends with a 6. Over the course of a couple of years, pretty much all of its components (CPUs, RAM, drives) were replaced at least once. You could look up its maintenance history and it went on and on. I'm not sure if it was well known across all of engineering; from what I recall, it was in a cluster in Oregon reserved for a specific team. Because it was company property, no matter how doomed, they had to get signoff from upper management, close to Eric Schmidt's level, before they could destroy it.

My recollection, assuming it's the same machine I'm thinking of, is that it wasn't reserved for our team; rather, we left a do-nothing job permanently allocated to it, in order to prevent some poor other sucker from getting their job scheduled on it. (Because we, through painful experience, were well aware the machine had hardware problems; but we had long since given up on convincing the responsible parties to take it out of the pool, since it passed all their internal tests every time we complained. I don't remember how long this situation existed before someone finally took it out back and shot it.)

Could be a different incident and a different machine, though. I'm sure this story happened more than once.

Maybe a different machine? I meant that it was not in one of the general-purpose clusters: the entire pool was dedicated and a random team couldn't request Borg quota in it. For years, though, half of the Oregon datacenter was special for one reason or another.

The infamous machine did go through repairs and part swaps many times, as you could see from its long and troubled hwops history.

The worst machines were the zombies with NICs bad enough to break Stubby RPCs, but still passing heartbeat checks. Or breaking connections only when (re)using specific ports. Fun times!

I wonder if MR could integrate with a fuzzing engine that jumbles random combinations of real inputs into garbage but runnable jobs that cause reproducible crashes above some threshold (eg at least once per day, or if things are bad enough, once per month or something).

Regarding this system: the motherboard was never swapped?

In what way had the jobs failed? Very open-ended question :) but just coming from a hardware-diagnosis standpoint. (I guess the canonical answer is "here's the repair history," but yeah, duh.)

> engineers habitually run batch jobs with more replicas than there are machines

Idly curious, how do I parse parse this? It sounds like the same jobs are replicated to multiple machines as a sort of asynchronous, eventually-consistent lockstep arrangement?

> There's an internally famous video inside Google in which a particular well-known machine was unracked, dragged out into a field, and ceremonially smashed to pieces by some hardware techs.

Eyy what would I search on moma to find this video?

Sounds like they were re-enacting Office Space. If you haven't seen that movie... It's culturally relevant even today. Has a bit of profanity and such though.

Huh, I can't find it.

I can confirm I've read the same thing though, years back.

There used to be a go link for it (same as the machine name), but knowing Google, it might be stale. There's a good chance you can find more on the internal folklore site. If that one is still around, too.

Confirmed: the go link with the machine name works but I don't want to post it on HN to be safe :)

If any Googlers are reading this: just goto go/legends and search for officespace. The first link that pops up has context as to why the video exists.

Oh yeah there it is, hah


back when we where buying hardware for big (at the time) intranet. All the servers where brought of the same batch of suns production line, I recall our sysadmin saying he rely wanted to do the same for the disks ie case of identical drives

A super-bad idea because drives made on the same week in the same facility will all fail at the same moment.

Unlikely that's not how statistics works in production engineering

With respect, I just went through a "code red" at a large, well-known cloud storage company caused by synchronized late-life death of hard disks all manufactured in the same batch. That's the second time in my career that I've been through the same phenomenon. Hard disks that are made together wear out together.

I can confirm this. I learned the hard way to buy hard disk drives of the same model but from different batches.

I'm curious how wide the failure window was (timespan, ramp-up/down, etc), relative to how many devices were involved.

And I wonder how well the signal in that ratio might scale down to hundreds or tens of disks.

Shockingly bad production engineering then.

Not once, but twice:


"Hewlett Packard Enterprise (HPE) has once again issued a warning to its customers that some of its Serial-Attached SCSI solid-state drives will fail after 40,000 hours of operation unless a critical patch is applied.

Back in November of last year, the company sent out a similar message to its customers after a firmware defect in its SSDs caused them to fail after running for 32,768 hours."

Can you imagine provisioning and deploying a rack or 3 full of shiny new identical drives, all in RAID6 or RAID10, so you couldn't possibly lose any data without multiple drives all failing at once...

(Evidence that the universe can and does invent better idiots...)

Your default assumption only works if every disk has an independent probability of failing from each other. Which is definitely not true if you buy all the disks from the same batch.

Others have mentioned the problems with this strategy, but getting drives with the same firmware is done routinely to avoid having slightly different behavior in the RAID set.

I don’t think they make hard disks in Hungary.

The infamous IBM Deskstars aka Deathstars were made there.


Also at FB: one day we got a huge spike in measured site-wide cpu usage. After the terror subsided, we found that a single request on a single machine had reported an improbably huge number of cycles (like, a billion years of cpu time). We figured a hardware problem and sent it to repair. A month later the same thing happened to the same machine; it had just been reimaged and sent back into the fleet. There was some problem with the hardware performance counter where it randomly returned zero, but after that we made sure it was removed permanently.

Alternatively, you found a machine from the future :-)

Out of curiosity, do these reproducibly-broken components ever make it into an upstream testing environment?

Yes, in a few cases the hardware would go back to its manufacturer for more investigation. The ones I'm aware of were more subtly bad, and more reproducible than this one though.

Had a similar issue. Spend weeks debugging a process, adding logging metrics, making graphs to figure out where the performance cliff was in an email sending service. The results were so weird they were unexplainable.

Someone suggested just nuke it and bring it back up on a fresh instance. Problem was gone! Everything running smoothly again..

> Cattle is a threshold, but when the same problem keeps coming up it's time to call the vet.

If CPU was bad, then that means that you kept running the instance on the same node. Quick was to test if it was "cattle" would have been to try on a different node

Additionally, if CPU was bad how was it not affecting other services?

I think the issue with bad CPUs is that they'll unreliably error.

I’m confused as to why the solution wasn’t to just replace the machine (preferably automatically) the first time it failed.

I'm curious, how many machines have FB?

The end result was a kernel patch to LKML, so I for one am happy they solved this problem at the source.

Did you read the article?

The customer had set `net.core.rmem_default = 2147483647` on purpose. Which exposed a Kernel bug. The whole herd would be having the same issue.

I think what he's trying to suggest is that the customer may have been able to isolate the issue faster by walking through the provisioning settings for the machine to identify core changes.

The bug report resulted in a core fix, which is a better result than if the customer had fixed it themselves of course.

Also, it was very nice of Google to follow up and submit the patch to LKML. IMHO this goes beyond the scope of their role. They could have taken a more selfish approach and accepted the bug as "normal" behavior, and advised their customer to not configure the buffer to such an enormous size.

Any decent engineer would smile at the fact that they just found a bug in this type of open source stack and happily submit it. Feel this is more of a side effect of individual behavior rather than company policy.

until you hit something like "bug reports should be subitted to here, not there and filled out with appropriate information, matching triplicate documents and make sure to CC the grand puba, also CLOSED-WONTFIX" enough times and you just stop submitting requests.

Unfortunately, not every engineer has an employer that would let them do this…

"Professional responsibility", if nothing else.

Was thinking the exact same thought as I finished up the article...

I really want to know if the telemetrics (or whatever the thing was) pushed enough packets to actually warrant the config. Setting something to the max sounds like preopt to me. This must be a truly exceptional condition if it actually remained undiscovered since linux 3.

I can't think of anything that would warrant a 2GB receive buffer. The buffer should be sized so that the receiving program has a reasonable amount of time to drain it before it becomes full. A large skylake VM in GCP can do 32 Gbps (lowercase b), so assuming worst conditions, a 2GB receive buffer would give the receiver 500ms to call recv(), which is a huge amount of time for something that should take microseconds, especially in the context of a client.

Even if there was a specialized server that needed such a large receive buffer, it doesn't make sense to set the system-wide default so high.

That's awfully convenient, and I can't deny having done this, but it's also a great way to never understand what went wrong.

Re-provisioning a failed server to solve the problem and taking a deep dive to find the root cause are not mutually exclusive. Essentially all VM software will allow you to snapshot/backup/clone the VM for later analysis, while also fixing your production environment _now_.

When you have a service which is acting weirdly, ideally I'd like to snapshot the VM, then do whatever I need to do to repair the service urgently.

That might involve making a new VM from scratch, but it also might twiddling some settings, or other emergency changes.

Afterwards, I want to be able to restore the VM state, probably in a firewalled off environment, so I can debug exactly what was wrong.

Sometimes I'd like to do it to a set of VM's - for example, if there is some DNS wierdness, I might want to snapshot an application server and a DNS server.

So far, no cloud provider seems to offer functionality to make that easy, which is a bit disappointing.

If a freshly provisioned VM doesn't have the same issue, then they're not using automated configuration management (Puppet, Chef, or similar) for these settings, and then they have a more serious problem, as nothing in their runtime environment is "codified" or predictable.

plenty of issues are not deterministic, even with 100% of everything managed by configuration management software.

But /etc/sysctl.conf is deterministic.

What is a reason why that file would correspond to the actual sysctls in effect?

If you use an automated configuration management system such as Puppet, you don't ever run sysctl manually in a shell. Instead, everything is controlled by the configuration management system.

sysctl is a bit problematic in terms of exhaustiveness. That is, how do you ensure that the kernel only has its original values plus whatever you put in sysctl.conf, and nobody actually ran sysctl manually at some point? But it's possible to do.

I have seen so much random behavior from puppet runs. It's basically a big fancy wrapper around a bunch of shell commands (much better than the raw shell commands) but subject to all the bizarre race conditions and so on. We had to wait 30 minutes to use a newly created VM so that puppet had run three times, and it was >0.99 likely to be good now. (If it wasn't, it was killed and we retried; 30 minutes was chosen to minimize the expected time; the puppet config was migrated from cfengine and was based on a lot of host-name based regular expressions and very dangerous to debug/refactor).

Puppet can be difficult to get right. Dependencies are _very_ hard to get right, despite the fact that Puppet is virtually designed around the idea of dependencies. I'm a fan of the concept, less a fan of the execution.

Unfortunately, the competition (Salt, Ansible, Chef) aren't really any better here.

These days, I run Kubernetes whenever possible, and keep the base OS light, which makes the configuration management surface extremely small.

After years of pain, I've come to appreciate what was once relayed to me. All configuration management software is broken. They are equally terrible, each in their own merry way. The only thing you get to do is to choose the one that sucks the least for your use-case, and two years down the line hope that you made the right choice.

Which is why I have come to believe that the very concept of host configuration management is broken. We should do it as little as possible, preferably NONE AT ALL. Sure, use something like Ansible to run the image creation steps, and provision the necessary first-boot scripts in place. Only leave the steps in that absolutely can not be done during image pre-bake.

Cycle your hosts without mercy, so that new ones are brought up from fresh pre-baked images, continuously.

And even for the few unavoidable snowflake hosts (eg. those that have to live outside the K8S cluster), follow the same strategy. Make them disposable, so that you can bring up a new one from their own pre-baked images on demand. Try to keep the delta between the snowflake base and your cattle base as small as possible.

Configuring live hosts should be considered an anti-pattern - if you find yourself doing it at all, take a step back and consider how to get rid of the need.

Absolutely. My solution to this is Kubernetes on GKE, and limit the number of non-GKE nodes to the absolutely minimum.

OK, so a reason why this file might reflect reality is that some automatic system wrote the file and subsequently successfully ran sysctl -p, but there are dozens of reasons why the file and reality would differ. The only source of truth is sysctl(8) or reading files in /proc/sys, and these are the values that need to propagate to observability systems and decision-making.

The configuration management does whatever you tell it to. In this case, it's your responsibility to ensure that sysctl.conf is exhaustive, and there are ways to ensure that it is. If anyone applies changes on the side, they will be reverted on the next pass. Not saying it's easy, but it's not non-trivial. Making this exhaustive with /proc is another story.

Unfortunately, this is the only way until kernel folks start agreeing that random mutability is a good thing. Right now, the kernel has way too many mutation points, and it's not (as far as I know) possible to ask it for a "diff" against the defaults.

Not all issues stem from sysctl.conf

No, but that was the context of my comment.

I can't agree with this. In the case that was debugged, the configuration change was probably legitimately changed for a valid business reason and was codified. If you have traditional VM infrastructure your provisioning may consist of 100s of small configuration changes. Are you really going to step through every single one of them manually? Configuration management tools don't exactly have a `git bisect` equivalent, and even if you did, you'd have to re-image the VM every time because VMs are stateful.

And even if you could somehow bisect every single configuration change in your configuration management, there's the added complexity of how many configuration changes are actually needed to test if the problem is still present. In this case it would probably be fairly easy because DNS is such a core "feature", but if this is something more application-level, you're really going to be lost.

Once upon a time I got pulled off a project in the middle; the project was given to a new, but experienced, developer to finish. I was talking to a friend who is one of our cloud guys the other day and he tells me "my" app is a bit of a problem child. It seems it crashes regularly but the problem is mitigated by the server restarting, so no one has any urge to fix it. (As a professional, I'm offended.)

Servers as cattle can cover a variety of sins.

While I agree with your premise, its important to stress that finding solutions by 'blowing up your server' and starting fresh are rarely sustainable.

It's often a useful exercise to dig into the root cause - a lot of times the problem you're seeing is just the tip of the iceberg.

We really need a new analogy - I assure you that ranchers care when they lose a cow.

> servers should be cattle not pets

I'm stealing this.

Hey thanks! I've never heard it but given that I'm old and it (the phrase) was coined in 2011-2012 I'm not surprised.

Awesome! Everyone learns something new every day :)

We just pray we don't learn the "obvious thing everyone who's any good knows" in an interview.

If you're interested there are a couple of great books that dig more into this kind of thing.

The Phoenix Project: https://www.amazon.com/Phoenix-Project-DevOps-Helping-Busine...

Its related Dev Ops Handbook: https://www.amazon.com/DevOps-Handbook-World-Class-Reliabili...

You don't have any information about where the customer was on the cattle/pet divide, nor if the general advice to "treat servers as cattle" even makes sense in their case. Regardless, the whole point of the exercise was to find the root cause and prevent it from happening in the future. Sometimes you gotta dig in and do the work.

They had a perfectly valid configuration and it uncovered a bug in the Linux kernel. That wouldn't have happened had they ignored the issue and tried again.

Of course, anything necessary to get your production box producing, but a well-engineered server is worth the debugging time.

the cattle-vs-pets thing is not really an excuse to not root cause a persistent issue while you have one.

The LKML message described in the post is here: https://lkml.org/lkml/2019/12/19/482

Something I'd like to add here the actual fix is -

+ if (rmem > (size + (unsigned int)sk->sk_rcvbuf))

However in reality this would have worked too - + if (rmem > (unsigned int)(size + sk->sk_rcvbuf)) (The bit pattern of the result remains the same and it's still casted as unsigned int during the comparison)

However, signed integer overflow is undefined behavior in C and unsigned integer overflow isn't. Hence, the submitted patch is the correct solution

Those are not safe to treat as equivalent, even if it might work in theory. You should always cast as narrowly as possible, and when you see code doing otherwise, look very carefully for bugs.

If A + (cast)B is a correct form, then (cast)(A + B) is generally an inappropriate form. As you note, it’s possible it will happen to work, but it’s not good form.

Yes I qualified it at the end why the former is the correct solution and not the latter.

I figured that "in reality this would have worked too" is a easy phrase to misinterpret as "this would have worked" (assuming they miss the context a paragraph later), and so the reply helps ensure that others do not misread it as I initially did.

Yes ! thank you.

I'm not an expert with AWS or Google Cloud, so I'm interested in knowing:

What "level" of customer or SLA do you have to be to get a certain quantity or guarantee of support and troubleshooting? Or is it that if even a free-tier customer points out something that is fundamentally a problem, it will receive attention by certain solutions engineers?

Are there $ spending, 20 x (c3.4x.large), or I-pay-you-for-certain-uptime/troubleshooting levels that get you certain response levels? Do certain problems get resolved with "well, you just have to live with that behavior, we're not fixing that".

Do you get to call them or chat live? Or is it all via tickets?

At my previous employer, a large network services provider, everyone got the same depth of support, even people that were on the free tier! Granted, the people paying us $ENTERPRISE usually got responses in minutes, whereas free users might be waiting around for a week on more difficult cases, but it was the same set of engineers working on each.

Free customers did report bugs, and we would replicate, triage, and fix them as usual. These tended to be more obscure bugs (more severe ones would usually surface in the paid queues first), but we didn't discard them immediately.

Whether we effectively ignored bugs depended more on the ability of the customer to provide an actionable report. Some users provide exactly what we need up front, but there is a lot of "it doesn't work!" white noise from users that aren't able to or aren't willing to put in the work to accurately describe their issue and/or action feedback from us. There's usually not a whole lot support can do in that scenario if we don't see any obvious issues, but we'd go a bit further to placate paying customers--I fondly remember joining a call between some very technically inept user and their ISP who was adamant that either we or the ISP were at fault, after we guided their network team through taking local packet captures showing unanswered SYNs past their network border.

Here you go:


$250/month/dev is the minimal for phone calls on technical issues, $150k + 4% of GCP spend for 'come running' support.

There're more details here


though they use the old names for the support tiers.

It seems like the blog post talks about a written case report, which the 100$ tier has access to, albeit with 4 hour first response instead of 1 hour. So it is possible that you could get your case escalated to such an in depth debugging with that tier?

I've seen free tier tickets get escalated to product engineering teams for investigations about sub $100 charges.

It's possible but there's steps to jump through, and the expected response times don't change even if you get escalated to a TSE from a different area.

Disclaimer: Is a TSE...

What "dev" or "user" means here ...for example if I run a server serving some HTTP API.

I believe it means "person who wants the ability to contact cloud support".

For small to medium sized businesses, that number is probably 1.

When purchasing support, you should consider you are really buying an expert who really knows google cloud, but doesn't have special buttons to click to do things you couldn't do.

If a service you depend on is down, your support agent will be able to tell you it's down, but not speed up the fix.

Cloud support will have more information about performance black holes and limitations that the documents don't describe. They also will be able to advise on "is this design or that design likely better". They generally know how the backends of GCP services work, and their common failure modes, which is pretty hard knowledge to get from the outside.

This sounds a lot like my experience with AWS support.

They really deeply understand AWS, are very responsive to calls/emails, but often have no tools to solve the problem we're having right now.

This even happened with some of their high end hardware, we did an upgrade for a critical set of instances to some extremely pricey dedicated hosts and ended up in a runaround for over a week due to a hardware issue on their side.

This question brings to mind a recent experience I had with Scaleway support. I pay maybe 25 euro a month to host my k8s based application on their managed k8s offering.I did not pay the extra 2 euro a month for an upgraded support tier. I encountered an issue when deploying istio to the cluster, pinged the scaleway support chat on a saturday, and they had figured out the bug on their end and had a fix eta estimate within a few minutes and had the fix in the next business day. Those guys have fantastic support.

Support should be judged on how they perform under load, and how they perform consistently, not how they perform on random events.

We don't really have any way to know from your story whether Scaleway's support was under normal or extra load and delivered an excellent experience, or whether they had a bunch of bored support reps just waiting for something to work on because it was abnormally slow. The latter is nice, at the moment it happens, but doesn't really help you if 2 days later for a different issue you're left in a lurch for days on end because they're busy. The former would be good for maybe indicating that.

That's the whole point of service level guarantees. They provide a lower bound on the support you'll receive, which is often much more important and useful to track.

"When sk_rcvbuf gets close to 2^31, adding the size of the packet can cause an integer overflow. And since it’s an int it becomes a negative number, therefore the condition is true when it should be false (for more, also check out this discussion of signed magnitude representation)."

And this is why you don't generally use signed numbers in systems code, unless you specifically need negative numbers. And why you gradually develop a paranoia about the sizes of numbers.

I'm not sure how using an unsigned number would help, given that when it overflows you're still going to have some code do unexpected stuff anyway.

For one thing, it's a clue you need to step back and think, "What happens when this overflows?" rather than "Oh, it's just a number."

For another, that's why you get paranoid.

(For a third, I strongly recommend something like Frama-C with the Weakest-Precondition module---it's very good at finding issues like these.)

I'm not convinced that it would have been any more obvious to the person who made the error that the variable could overflow if it were unsigned.

It's also much easier, IMO, to accidentally underflow an unsigned integer; it's so much more common to work with 0 than it is to work with +/- 2 billion.

I'm being pedantic, but technically wrapping from 0 to MAX_INT is still considered overflow. Underflow refers to decimal truncation e.g. by integer division.

well... at least it takes twice as long to overflow.

Signed or unsigned doesn’t matter much both can overflow.

I actually started putting assertion checks about overflow issues almost everywhere but it requires great discipline. I wonder if there is a better solution available.

If you can allow yourself this kind of performance regression, you can compile in GCC or clang with `-fsanitize=signed-integer-overflow`, this will do runtime checks.

What happens when some dumbledork sets it to 2^32+1?

From the blog post:

> (if you try and set it to 2^31 the kernel returns “INVALID ARGUMENT”).

I would expect the same to happen for 2^31+1.

“...This means that the case will Follow the Sun by default, to provide 24/7 support”

I love the concept of “Follow the Sun” to describe 24/7 support - I don’t think I’ve heard it described that way. I wonder how much we’d have to spend to get that tier of service?

"Follow the Sun" is subtly different from 24/7. 24/7 can mean follow the sun, but it can also mean "we are prepared to page someone at 2 AM and wake them up." Follow the sun means "there is an engineer in China, India, France, Boston, and San Francisco, and at least one of them is always at their desk and ready to take work."

The difference for users can be fairly small, but as someone who used to carry a pager for Amazon, the difference is really huge for the support person.

For Google Cloud, 250$ per month per user: https://cloud.google.com/support

Thank you.

I know that this phrase was used way back in the early 90s. I suspect it was used before then as well.

    if (rmem > (size + sk->sk_rcvbuf))
      goto uncharge_drop;
What is rmem in this case? I'm a bit confused as to why it is written that way. This drops the packet right when it overflows the buffer?

It's not very literate, is it? rmem is initially the sk_backlog.rmem_alloc field of struct sock. There is no comment in net/sock.h what this field might mean. People who modify this function just have to guess. I also appreciate that this function adds |size| to rmem_alloc, tests for limits, then later it subtracts |truesize| from rmem_alloc. This happens to seem correct, but it's just asking for someone to accidentally screw up the accounting in a later change. Reading this function only reinforces my view of Linux code quality.

Another interesting thing is how stewardship of this logic and the comment right above it have diverged.


  /* we drop only if the receive buf is full and the receive
   * queue contains some other skb
  rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
  if ((rmem > sk->sk_rcvbuf) && (rmem > size))
   goto uncharge_drop;

  /* we drop only if the receive buf is full and the receive
   * queue contains some other skb
  rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
  if (rmem > (size + sk->sk_rcvbuf))
   goto uncharge_drop;
How does the comment correspond to each block?

Basically rmem - size was the number of bytes that were consumed before this current packet. In the line before, we have thread-atomically added size to rmem immediately prior and read the current value in one single step. Call this "staking our claim" to part of the buffer, and the meaning of "uncharge" here is discharging this claim by atomically decrementing the counter, before dropping the packet.

Probably this bug would not have happened if this comparison were written as `rmem - size > sk->sk_rcvbuf`?

So it is saying that the simplest sanity check is "if the buffer was already full before we staked our claim we should drop this packet immediately." As the "goto" indicates, there are then a bunch more checks on other circumstances where we should also drop the packet. Due to the quirks of multi-threading it is of course possible that some packets get unnecessarily dropped between when we stake the claim and when we discharge it, which the code just accepts -- the thinking is presumably "yeah if the buffer is full a lot of packets are gonna get dropped and that's just life -- it's much less important that we dropped some extra packets when we were already dropping packets, and much more important that we don't mismanage the buffer's memory when it's nearly full."

A comment suggests that part of the reason for this awkward phrasing is that it is possible for rmem = size, in other words the buffer was empty when we staked our claim--and in this case we don't want to drop this packet even if it would overflow a small max-buffer-size. I think the idea there is "we already have the socket buffer allocated, obviously this thing fits in memory, so let's just handle it if the queue is empty rather than dropping every single packet that is larger than the queue size."

Reminds me of this classic haiku -

It's not DNS. There's no way it's DNS. It was DNS.

Is Google Cloud support this good in general, or only for certain tiers or plans?

Whenever the engineering support steps in, the problem gets solved fast or the issue gets acknowledged. However, over the last 15 months, ~50% of times our GCP support cases end up in frustration with no resolution. We have more success finding issues in the corresponding GitHub repo and opening an issue there.

We have the expensive off-the-shelve support option (I think 450/seat/month) for 1h response.

In most cases, we spend more time back and forth with support than it would take you to figure it out. I'm talking about issues that span over weeks with tens of hours spent. We end up reiterating the original support case problem (i.e. the support engineer doesn't bother reading the actual problem) whenever the engineer changes.

We've had: P1s where the support engineer told us we'll get an update the following day, only to figure out that there was a breaking release on their side that exactly matched our description. While investigating a load balancer issue, the support engineer looked at the LB logs, saw a ton of logs coming from penetration scans (e.g. GET /phpmyadmin), and suggested that the solution was to open up those addresses.

The cynical part of me expects that this case was handled so well because a) the support people found the issue fascinating and fun to work on, b) the post-mort on it would make an excellent blog post.

On the flip side, it's encouraging that they have people somewhere in the support chain who are capable enough to read Linux kernel code and submit fixes upstream.

Author of the article here. I only thought of the possibility of making a blog post after the case was closed and I started telling my colleagues about it, and realized I would have loved to read about this.

The case was indeed fun to work with, but the main reason why it had such a fast and happy resolution was because the customer was very responsive and very cooperative.

I cannot talk for every Technical Solution Engineer, but I can tell you that I have no particular interest in simply closing a ticket: I want to go down the rabbit hole and solve technical issues, and I know many of my colleagues feel the same.

I am also far from being the most senior or skilled TSE in Google Cloud Support, I just wrote an article about one of most interesting cases I had.

I'm inspired by how much you seem to know about the details of computer network stuff. Is that a required knowledge to become a Google Tech Support person or you are just above average in terms of that among your peers?

Also, I wonder how you learn all these knowledge (that is, asking for recommendation on a few books/resources for learning) if you don't mind sharing. Thanks in advance!

I don't have deep knowledge of details of compute networks, there is a team of TSE who deal with network cases who know more than me. But the whole point of troubleshooting is not knowing what is wrong, but being able to find what is wrong. In order to do that you need good basis, and those you can make by studying how networks and linux systems work (someone here posted some titles) and with experience (I have some grey hair myself). But every time you troubleshoot something you end up touching something you don't know, and that's where you learn something new you might use next time. For example I didn't know about dropwatch, a colleague suggested it to me.

During the interview process at Google we don't expect candidates to be able to get to this level of depth, but we try to hire candidates that could, over time and depending on their skill set, potentially reach a similar level of depth and ability to troubleshoot cases.

Following up on amessina1's post,

I'm one of the TSEs who handle networking cases. True to what was said, I was hired with very little networking background, but plenty of development and hardware information.

I've since taken the mantle for handling most of the cases dealing with Interconnects and VPNs. I enjoy it too!

Oh, yeah, we're hiring: https://careers.google.com/jobs/results/?company=Google&q=Te...

Thank you! This is encouraging. My background is mostly in Python programming and SQL. But in an alternate universe, I wish I am a network ninja like you guys and I will definitely check out Google TSE jobs when I can look for new jobs (currently hoping to get my green card done). If there's a book or two that is the most useful for you to be an efficient TSE, please feel free to share. Have a restful weekend!

Thank you for the reply! My background is mostly in Python programming and SQL. But in an alternate universe, I wish I am a network ninja like you. If there's a book or two that is the most useful for you to be an efficient TSE, please feel free to share. Have a restful weekend!

Thank you for writing it. Never really had a glimpse of networking debugging process.

The front-line support is the same as anywhere else. But Google Cloud has really really good second and third line support, if the first tier can't figure it out. And in many cases, it'll get escalated directly to the implementing engineers.

In my experience, Google Cloud is better than most organizations about escalating hard issues up to the chain. Admittedly, this happened at a company with substantial spend, and I can't say one way or another whether a smaller player would get the same quality of support.

How much is “substantial” for gcp to get decent support? Tens of millions/y certainly wasn’t cutting it...

If you weren't getting substantial support at tens of millions/y, your finance team did something deeply wrong when negotiating the contract.

Wait what does it have to do with the finance team? We did get their “platinum” tier or whatever.

If you're spending that much money (this isn't GCP specific -- this is any cloud) you should be establishing a 1-3 year min-commit contract, and in practice, this will get negotiated through the CFO. This will get you massive discounts -- 20-30% under list price, in exchange for spending $X million/year over Y years.

It will also get you a dedicated sales rep and sales team, and they will absolutely crack the whip on internal teams to get issues resolved. At those spends, you can almost get an in-house support team of PSOs to bounce problems off of.

Yeah not talking about discount. The discount was nice (or so i heard, but if you do 3y commit you can get that anyway).

> and they will absolutely crack the whip on internal teams to get issues resolved

Not in my experience. Although we did get an ever-rotating rep. I think they changed three of them in like a year or so

Google only gives you good service if they respect you as engineers. We'd say stupid stuff and get the cold-shoulder, and then later would find some cool bug with encrypted VPNs dropping packets (with no monitoring in GCP, only our tcpdump from various places) and got some very skilled network engineers looking at the data and making code changes. They still muted us for long periods of time while talking amongst themselves, but did deliver.

I don’t really think they care about you as an engineer. I had some “cool” problems which they totally neglected for long periods of time and general vibe has been “you need to prove to us this is our fault”. I think the real reason is org incentives aren’t setup to make infra team happy

Contract-level commits will generally work on top of Committed Use Discounts, not instead of (but obviously this comes down to your own SKU by SKU negotiating).

The only support for basic accounts is for billing

Even on platinum plans they are terrible in my experience. Endless pinballing of tickets and trying to blame issue on the customer. We had two day outage several times in my previous gig because google support refused to acknowledge the problems

> they use raw sockets! Raw sockets are different than normal sockets: they bypass iptables

But this bugreport says raw sockets would be filtered by the OUTPUT chain of iptables: https://bugzilla.redhat.com/show_bug.cgi?id=1269914#c4

Is that accurate across distros? It does make sense for some socket types, like device sockets, to not be routed through iptables.

I think that bug report is misleading. Raw sockets do bypass iptables but they still go through ebtables. They hook in at the ebtables NAT OUTPUT chain. See the diagram here https://erlerobotics.gitbooks.io/erle-robotics-introduction-...

Thanks for that great link

> they use raw sockets! Raw sockets are different than normal sockets: they bypass iptables, and they are not buffered!

Can someone elaborate on the above statement from the article? Does this imply that raw sockets have unbounded buffer?

Clearly not, but raw is raw. For the purposes of this article, it's enough to say that raw sockets, being raw, don't traverse the flawed block of code, which is in net/ipv4/udp.c

In short, raw sockets can push bytes into the network card.

Whereas the commonly used socket functions recv/send construct the required headers for TCP, UDP and whatnot, they handle encapsulation/buffering/connection/etc so they're easy to use for developers, just read and write application data.

By nature raw sockets skip TCP/UDP libraries and a good chunk of the network kernel code. Including the place where the bug was located.

Probably the opposite, they send without waiting for any bytes to build up.

Seems like a good place to mention this: I once was troubleshooting an Outlook issue where email stopped working after some time, seemingly at random. Turned out that Outlook picked up the IP address for the mail server backwards - so instead of WW.XX.YY.ZZ Outlook tried ZZ.YY.XX.WW. Found that by using sysinternals network tools, confirmed with Wireshark. Thunderbird worked, ping to the domain name (of the mail server) worked, Outlook worked on-and-off, ...

As it was MS Windows and I'm a Linux native I didn't really know how to investigate further - I guess I couldn't without Outlook source.

Luckily setting a hosts entry fixed it.

I only found one other post online with the same issue, and they didn't have a solution. Presumably it was something like ISP automated rDNS entries getting parsed .. but honestly I don't know.

Still curious ...

Would have loved to have found work investigating such things, as they said in the post, it's fun!

It sounds like maybe your mail server had a misconfigured reverse DNS entry? Those DNS PTR records look like ZZ.YY.XX.WW.in-addr.arpa. I'm aware that SMTP in particular has a dependency on rDNS, but I'm not sure about the details.


that’s a pretty good and detailed explanation. 2 things: 1) i hope they have a runbook for situations like this (ie the support engineer does not have to figure all this on the fly) 2) the customer should have provides more details and maybe should have thought of the tweaks they made (classic solution is to compare 2 instances - one works one does not)

1) you cannot have a runbook for everything, and even if you have a runbook you the best you could have found in this case is that something weird was happening in the VM. The setting had an insanely big value but it was accepted by the kernel, so you would assume it was a valid one. 2) the customer provided a huge amount of details, but it is usually very hard to explain what did you change from the base image. Most customers might not be willing to provide the full spec of their running system, as they might contain information they don't want to disclose. It is easier when the issue starts right after a change has been made, but this was not the case.

i’m not saying a runbook will catch everything. but it will give you a chance to solve the problem quicker.

I totally agree, but keep in mind that it is an iterative process: you have a case, you apply the runbook/playbook you have, if they are not enough you use your skill/knowledge to solve the case, then you update the playbooks.

iteration is my middle name :) the point I was trying to make is you should be prepared - if only for the 95% of the cases that the runbook can solve. you also don't start with a complete runbook - you build it as you operate the service.

You can't runbook these sort of issues... You can ensure you have the knowledge and tools to root cause the problem.

you most definitely can and you should. there are steps that you can do to gather the info and the linked example shows basic things to try. when you exhaust the run-book is when you start digging

I agree with the point around information gathering. Good technical teams have guides for endusers to collect relevant information. In this case the engineering team may have shared information with the TSCs to help narrow down the root cause.

This also highlights why support agents shouldn't be 100 engaged with customers. They need time to review and amend runbooks, consult with engineering teams, etc.

I don't think that this was a "support engineer". He was a god mode developer!

He forgot mention that the kernel is Linux. It's almost like linux has become the standard OS. Linux is the new windows.

That being said, as you have noticed, Linux is the go-to OS for servers, and the post has a number of Linux-isms.

that is pretty good thing to have.

I've been supporting AWS environments almost from the beginning but can't ever remember a case where I was asked or even considered offering Support a copy of a VM's storage volume. Is this common on Google/Azure/etc.?

It reminds me of my friend's hosting company that failed. They got a big customer and created a VM and the customer asked to fix a problem that involved getting a shell in the VM. Friend does it and the customer is gone next day.

This aside, even though we had too many support cases so far with AWS, and having highest support level, they mostly cannot access user data, just the metadata. We had a major problem with RDS once, and they specifically requested to load that snapshot to an internal instance to reproduce. It can happen in AWS, but not very common, in my experience.

> It reminds me of my friend's hosting company that failed. They got a big customer and created a VM and the customer asked to fix a problem that involved getting a shell in the VM. Friend does it and the customer is gone next day.

The customer asked the support people to access a shell on their VM and they then quit because...? - or the support people accessed a shell without the customer’s express permission?

I think it's because the support people even had the possibility of that access at all.

And I agree; support people shouldn't have that kind of access, with or without consent.

It is a fairly common method to test to see if they have access.

It's not common, but it certainly happens. Sometimes you just can't reproduce a problem without specific data.

I enjoyed reading this but wouldn't have running either "netstat -s" or "ss -s" to show protocol statistic have shown either receive buffer errors/receive packet errors statistics? It seems like this basic tool was noticeably absent from the early troubleshooting steps and other standard troubleshooting tools used.

I understand the importance of the ultimate fix but wouldn't seeing an incrementing error counter for UDP have shortened some of the troubleshooting done to identify and resolve the customer's immediate issue?

The customer had set an extremely large buffer size and nobody thought to mention that? Perhaps the individual reporting the problem was different and was unaware of that unusual change.

I spend a decent amount of time investigating trouble reports, and in my experience it's quite uncommon to even get as much information as was provided in what google showed. It's also fairly uncommon to get any of these sorts of rare configurations in trouble reports, and usually takes some probing.

When I was a new engineer working in telco, one of the longest investigations I worked was when connectivity broke between one of our regional roaming partners and 1/3 of our nodes (I'm summarizing to try and keep the story brief). We called them and asked if they changed anything, reviewed the configuration and secrets used on the tunnels, etc. And were working with the vendor to go through any problems with the implementation. Saturday morning and probably 20 hours of investigation later, a new engineer at the regional partner see's there is a work order for changes to the connectivity to our nodes (we were adding some new ones) that was supposed to be executed that week. A typo in the change overwrote the secrets used by an existing tunnel instead of creating a new secret for the new peer. The person we were working with to investigate, was the person who implemented that change and told us several times nothing changed. He was also the one we worked with and read through all the secrets for typos or issues and didn't notice anything. Saturday morning he get's into the office, is shown the work order, and goes, oh yea, I did that at exactly the time the tunnel went down. Fix of typo'd secret later and everything comes right back up.

So just in my experience, I find it quite plausible that buffer size was not mentioned. And even besides this story, I know I've personally missed connecting causes with potential effects when investigating a problem, it's very easy to dismiss some setting, like the buffer size, as being connected specifically to DNS behaviours, especially if they are not noticed together or with a strong change management system that helps connect the timelines together.

There are many things here that concern me from a system view.

1. They need guaranteed delivery, but chose to use UDP 2. They jacked up the default rmem buffer to ~2GB which is insane. Also, applies to all sockets not just UDP, so I wouldn't be surprised if they where also running into issues with memory pressure especially under load 3. Support didn't seem to let them know that's a pretty unconventional configuration

That was an interesting debugging story, and catching a bug like this is always good IMO. But, there is just so much WTF in this setup.

I would have told the customer to reduce the number by 1000 or whatever and closed the case.

To be fair, there are a lot of sysctl settings. To be sure, it's one of the first places I would look for networking weirdness, but it's also often hard to tell what impact those settings have on anything.

Anyone that messes with them, better know what they are doing. 99% of the time they read about or heard about the setting somewhere and are blindly following along. I see this a lot with "performance improvements" when they should be looking other places like their web server configuration. Like why would you tweak those on a low end wordpress server?!

I used to do work kind of like this stuff on enterprise storage arrays that ran a modified BSD. We didn't lock the system down much, so customers could go and set whatever system guts stuff they wanted. We had a tool on-system that would basically tar up all the system configs and phone home with them when the customer hit a button. You'd better believe that one of my first steps investigating anything was diffing the crap out of any relevant configs against a clean base version.

As another commenter mentioned, this was the result of customers never actually mentioning their weird sysctl tuning in the original issue description. It's not like they're trying to screw you over or anything - there's just an awful lot of config options in an entire system that does anything interesting, and in the case of a big enterprise appliance, it's likely that dozens of people have had admin on it at one point or another.

Google support is basically non existent for customers even spending like 5-10K a month on their platform.

Even questions you raise on their Reddit sub go unnoticed.

They don't have bunch of people actively trying to solve their customers problem.

That said the only time my problems were actually listed to were from BigQuery team. Other than this, I don't think it's possible to get any explanation on a feature from any other product team at Google cloud.

Google isn't bad with commercial accounts through normal channels. I'd personally put them in middle of the road. Their sales/se type people that I've dealt with are above average in my experience, although in the past there weren't many of them.

If you're 5 people at a dentist office with GSuite, that may be a different story. That's always a problem when small entities buy direct -- that's why VARs exist to provide more handholding for smaller orgs.

I've managed some pretty significant vendor relationships -- if you think they are the worst from a support POV, you're young or exceptionally lucky!

We made an effort to reach out to our regional GCP sales team when we moved to GCP and ended up with an account manager and solutions engineer. They've been very helpful and if we have hard questions, we can ping them. Not a huge spend by any means.

We rarely have to do so, but whenever we did, they went out of their way to figure it out.

YMMV, I guess?

The difference of the support plan is amazing. I had a long standing issue where I couldn't get any help, at best I get stuffed to a forum where some random other user would invariably tell me I'm doing things the wrong way and standard non-answers.

We got put on their best support plan for a few months for reasons. The difference was insane. I gave them permission/creds to log into the problem box, along with steps to repo things. In a few days they had reverse engineered what was going on in our code, without having our code, figured out where the bottleneck was in the kernel, and gave me detailed steps to build a tweaked kernel that wouldn't be a problem.

We spent orders of magnitude of that and they still treated us poorly. The silver lining is that they at least don't discriminate based on the spend.

Perhaps Reddit isn't a preferred support channel for them?

Do you have a support plan?

Why should you need a support plan for a product you're paying for?

"Ok, you can pay us $X/mo for the service, but if something goes wrong, we won't help you unless you also pay an additional $Y/mo."

It's absolute garbage that this is where the industry is.

Why is it bad to give the option to not pay for support you don't want? If they didn't charge separately for it, that means the cost is distributed to everyone in terms of higher costs for the service itself.

If you think support is always necessary, then just do the math yourself and add in the cost of support for every product, and use that price to determine if it is worth it or not.

Because some businesses want support, and others don't. Forcing the people who don't want support to pay for support seems rude, does it not?

This is consumer thinking, not business thinking. Every high ticket item on the planet comes with the option of a support contract.

Because the product that you're paying for is an extremely efficient automated cloud infrastructure with little manual support, unless you're willing to pay more for it. If you want hand holding there are other less automated infrastructure providers that cost a lot more.

I don't know, lot's of tech enterprises will give you the product for free and make it up on "support". Fortunately, a lot of these environments have forums and such which means you will eventually get help, but not soon enough if your hair is on fire.

Yes, my 100$ monthly spend deserves the same support as a company with a $250,000 monthly spend.

Indeed, but once growth stops, they have increase revenue somehow.

Support plans make sense for products/software that are sold and expected to operate independently (cars as always are a good example). Heck, even those products generally come with some sort of free limited time support which come in the form of warranties. Services requiring a separate support plan is... not intuitive.

I was about to say similar. My experiences with Google Support, even on their $500/mo. Gold plan, have been infuriating at best.

Even their product teams are just... Depressing. There was a support ticket open for TEN YEARS for people asking for WebSockets support on Google App Engine.

I often wind up with DevOps responsibilities, and I'd never recommend building more on them, and I'd help in every conceivable way to minimize money given to Google, in addition to aiding the transition to more reliable providers.

And on a more personal note... My husband recently tried recovering his email. He no longer had the password, the phone number he had it registered with was no longer owned by him (a major security issue, btw), so he tried his recovery email. And even after clicking the link from his recovery email, he was denied access, and sent to the same help page that couldn't be more unhelpful if someone actively tried. And there was no apparent way to contact support from that screen.

I hear tell of Gmail users who could reach Google support, but... We don't put much stock in those stories round these parts.

Moral of the story? Never trust Google with your emails, or other important information. Always make backups if you must continue using them, and forward your email to an email provider you trust (ideally one that you pay for, own, and has a decent support department of any kind).

amazing. I wish I knew where to go learn the basics to navigate that many layers of knowledge (kernel/os/network)..

These books should help!

Computer Networks and Internet

Internet working with TCP/IP Volume III

The Linux Programming interface


Computer Architecture: A Quantitative Approach

Operating System Concepts

Computer Networking: A Top-Down Approach

W. Richard Stevens (Somewhat out of date, but I haven't seen anything to beat them.)

Doesn't look horrible: http://intronetworks.cs.luc.edu/

Really enjoyed this :)

As our systems grow in size and complexity we will inevitably encounter limits (and resulting problems) that previously were not approached. For the future (interstellar space travel etc.) all this will need to be recreated for greater scale

Oh look another C overflow bug.

A slightly different case, as the Linux kernel uses a nonstandard (-fwrapv) C where overflow is defined to wrap rather than be undefined.

Every time I run into bugs, and feel like I'm doing everything right, but it's just behaving unreasonably for some reason, but then find the issue and feel stupid. This one is one of those

That brings up another point. Should the kernel standardize on unsigned scalars completely? How many legitimate use cases are there to use signed scalars in the kernel?

The Linux kernel internally uses the common C convention in which negative numbers are errors (-ENOMEM and so on) while positive numbers (and zero) are successful values. (Some parts of the kernel use a related convention in which values within the "last page" of the address space, that is, -4096 to -1 inclusive, are errors, while other values are valid memory addresses; there are macros to convert between both conventions.)

Using unsigned does not generally fix overflow flaws. It just moves the threshold.

Sure. I was not suggesting it that it will eliminate overflows but would eliminate one source of them. Also mostly because there are probably few use cases that warrant signed values.

Using unsigned numbers doesn't really fix anything here, because the Linux kernel defines overflow of signed numbers. In both cases, you have generally surprising behavior when the number gets large enough: changing the type doesn't help; it just hides the issue in one of the cases.

I'd like to hear Linus's thoughts on this particular issue.

I envy people who find things like this fun.

Nice reading. Just ignore the metadata server thing. The guy probably means a resolver.

Oh boy. It was like a thriller. Enjoyed it :)

That is why google only hires the best

lol. google hires vast swathes of the unwashed

easily caught by SAST. not even that. standard compiler warning should catch this.

Great troubleshooting.

TLDR; int overflow in C


didn’t read the story - but a tough case deserves an upvote!

Long-term solution - Use rust

Can you expand on your answer. How does Rust resolve this issue? Does it solve by disallowing type casts thereby preventing overflow when the number becomes negative?

I think this person is trolling. However, for the purposes of discussion, two things here:

So, in Rust, overflow panics in debug builds, but does wrap around in release builds. So, it is possible this bug would have been caught in testing, but if it wasn't, it still would have slipped into production.

However, that being said, Rust does not do implicit casting between numeric types. So it's very likely that this code would not have compiled in the first place, though I haven't examined it super closely. At that time, the person would have had to cast it, and so the end result would have been roughly the same.

AFAIK, Rust can panic on overflow even in release builds if you want it to, at a somewhat heavy performance cost (which is why this is not enabled by default in release builds). In this case, it would convert the issue from "some packets are unexpectedly being discarded" into an immediate crash within the kernel.

How heavy is the performance cost really? On x86, wouldn't a JO to a higher address take care of it? It would be a never-taken branch with perfect predictability.


> For undefined behavior checking using precondition checks, slowdown relative to the baseline ranged from −0.5%–191%. In other words, from a tiny accidental speedup to a 3X increase in runtime.

Very interesting, thank you. I question the application of this result to kernel performance, though. Specint has hot arithmetic and UDP packet handling pretty much does not.

In this case, it'd probably be fine. In general I can see it interfering with loop vectorization and bloating code size, so it might not be low-cost in general.

At what cost to code size?

That would be the thing: two bytes in the hot path. But it wouldn't need to be universally applied to every place the + operator appears. Couldn't it just be wrapped around sensitive computations, or generated for the x > y + z idiom? It would be cheaper or at least as cheap as the machine code generated for the manual overflow check (if int_max - z > y ...)

Yes, there is a flag to change the default behavior. I am not sure how many folks actually use this flag.

If you want flags, gcc has -Wall -Wextra -Werror which seems like it would have caught this bug. (Of course, if you weren't using -Wall -Wextra from the beginning you'll have a lot of catching up to do before you can build with -Werror.)

Yep, there's tons of details here, which is why I think the original poster was simply trolling.

How would that have changed anything here?

Re-write linux in rust? Good luck with that.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact