
A Google Cloud support engineer solves a tough DNS case - sciurus
https://cloud.google.com/blog/topics/inside-google-cloud/google-cloud-support-engineer-solves-a-tough-dns-case
======
luhn
This is a fun debugging story, but is a great example why servers should be
cattle not pets. Having trouble with a VM? Blow it up and get a fresh one.
Still having trouble? The provisioning steps are codified, you can walk
through them and find the one that causes the issue.

~~~
storyinmemo
My fleet of machines I was the owner of at Facebook was around 10,000. I still
remember the odd JVM crash that prompted me to reimage a machine. I wouldn't
have remembered it except there were a few that month and it was the 3rd time
I reimaged the same machine that I thought, "That's odd... I think I know that
machine name." Checked history, saw the 3 repair jobs I had submitted... RAM
was reset, CPU was eventually guessed at as bad.

Cattle is a threshold, but when the same problem keeps coming up it's time to
call the vet. [http://rachelbythebay.com/w/](http://rachelbythebay.com/w/) has
many good examples of this, some submitted and voted up here.

~~~
jeffbee
It is indeed folly to assume that cattle have no identity. There's an
internally famous video inside Google in which a particular well-known machine
was unracked, dragged out into a field, and ceremonially smashed to pieces by
some hardware techs. Sometimes a machine just takes an arrow to the knee and
it's never the same again. Then there are all the uncontrolled or unrecorded
differences between machines: the ones at the tops of the racks, or the ends
of the rows, are hotter (or colder); there's some difference between the same
model of hard disk made in Hungary compared to the ones made in Mexico; at
some date the BIOS vendor made an undocumented firmware revision that changes
an obscure energy/performance register in your CPU; you have a machine with a
dead CMOS battery that worked normally until it was rebooted.

Cattle is a good philosophy but it takes a huge amount of work to approach
perfection.

~~~
grawlinson
One of my computers has been aptly named 'THESEUS' due to what was replaced on
it. By the time it was repaired to an acceptable level, the only original
component remaining was the chassis.

~~~
jiggawatts
Heh... that reminds me of my favourite troubleshooting story!

We had a customer with regularly failing tape backups. CRC errors, verify pass
failures, even failed writes, and so forth.

We replaced the tapes with new ones. Same issues.

We replaced the tape drive with a new one. Still the same problems.

We replaced the internal ribbon cable and the SCSI controller. No luck.

Firmware flashed everything. Didn't help.

New server chassis, wiped the OS and reinstalled everything from scratch.
Changed the backup software just in case. The backups _still failed!_

Literally no part was the same. I went on site to start looking into things
like the power cables, the UPS, or vibration issues. Basically were getting
desperate and grasping at straws.

I was sitting down in an office, casually chatting with the IT guy while we
were waiting for 5pm so we could reboot the server. He's leaning back in is
office chair, and he casually picks up one of the tape cartridges and throws
it up in the air and then catches it before it hits the ground. Just playing.
Over and over.

I asked him if he does that a lot.

"Yes, it's fun!" he answered.

ಠ_ಠ

~~~
asmosoinio
That is quite the story! Sounds like a very large amount of resources spent on
that case.

What was your companys role? Backup services/devices?

~~~
jiggawatts
This was general IT consulting back in the early 2000s. The customer was
small, they only had three tower servers and only one had a tape drive.

------
floatingatoll
The LKML message described in the post is here:
[https://lkml.org/lkml/2019/12/19/482](https://lkml.org/lkml/2019/12/19/482)

~~~
mav3rick
Something I'd like to add here the actual fix is -

\+ if (rmem > (size + (unsigned int)sk->sk_rcvbuf))

However in reality this would have worked too - \+ if (rmem > (unsigned
int)(size + sk->sk_rcvbuf)) (The bit pattern of the result remains the same
and it's still casted as unsigned int during the comparison)

However, signed integer overflow is undefined behavior in C and unsigned
integer overflow isn't. Hence, the submitted patch is the correct solution

~~~
floatingatoll
Those are not safe to treat as equivalent, even if it might work in theory.
You should always cast as narrowly as possible, and when you see code doing
otherwise, look very carefully for bugs.

If A + (cast)B is a correct form, then (cast)(A + B) is generally an
inappropriate form. As you note, it’s possible it will happen to work, but
it’s not good form.

~~~
mav3rick
Yes I qualified it at the end why the former is the correct solution and not
the latter.

~~~
floatingatoll
I figured that "in reality this would have worked too" is a easy phrase to
misinterpret as "this would have worked" (assuming they miss the context a
paragraph later), and so the reply helps ensure that others do not misread it
as I initially did.

~~~
mav3rick
Yes ! thank you.

------
supernova87a
I'm not an expert with AWS or Google Cloud, so I'm interested in knowing:

What "level" of customer or SLA do you have to be to get a certain quantity or
guarantee of support and troubleshooting? Or is it that if even a free-tier
customer points out something that is fundamentally a problem, it will receive
attention by certain solutions engineers?

Are there $ spending, 20 x (c3.4x.large), or I-pay-you-for-certain-
uptime/troubleshooting levels that get you certain response levels? Do certain
problems get resolved with "well, you just have to live with that behavior,
we're not fixing that".

Do you get to call them or chat live? Or is it all via tickets?

~~~
andyljones
Here you go:

[https://cloud.google.com/support#support-
plans](https://cloud.google.com/support#support-plans)

$250/month/dev is the minimal for phone calls on technical issues, $150k + 4%
of GCP spend for 'come running' support.

There're more details here

[https://cloud.google.com/support/docs/procedures#additional_...](https://cloud.google.com/support/docs/procedures#additional_services_for_gold_platinum_production_and_enterprise_support_customers)

though they use the old names for the support tiers.

~~~
ehsankia
It seems like the blog post talks about a written case report, which the 100$
tier has access to, albeit with 4 hour first response instead of 1 hour. So it
is possible that you could get your case escalated to such an in depth
debugging with that tier?

~~~
perfect_wave
I've seen free tier tickets get escalated to product engineering teams for
investigations about sub $100 charges.

------
mcguire
" _When sk_rcvbuf gets close to 2^31, adding the size of the packet can cause
an integer overflow. And since it’s an int it becomes a negative number,
therefore the condition is true when it should be false (for more, also check
out this discussion of signed magnitude representation)._ "

And this is why you don't generally use signed numbers in systems code, unless
you specifically need negative numbers. And why you gradually develop a
paranoia about the sizes of numbers.

~~~
FreeFull
I'm not sure how using an unsigned number would help, given that when it
overflows you're still going to have some code do unexpected stuff anyway.

~~~
mcguire
For one thing, it's a clue you need to step back and think, "What happens when
this overflows?" rather than "Oh, it's just a number."

For another, that's why you get paranoid.

(For a third, I strongly recommend something like Frama-C with the Weakest-
Precondition module---it's very good at finding issues like these.)

~~~
sleepydog
I'm not convinced that it would have been any more obvious to the person who
made the error that the variable could overflow if it were unsigned.

It's also much easier, IMO, to accidentally underflow an unsigned integer;
it's so much more common to work with 0 than it is to work with +/\- 2
billion.

~~~
karagenit
I'm being pedantic, but technically wrapping from 0 to MAX_INT is still
considered overflow. Underflow refers to decimal truncation e.g. by integer
division.

------
erikig
“...This means that the case will _Follow the Sun_ by default, to provide 24/7
support”

I love the concept of “Follow the Sun” to describe 24/7 support - I don’t
think I’ve heard it described that way. I wonder how much we’d have to spend
to get that tier of service?

~~~
amessina1
For Google Cloud, 250$ per month per user:
[https://cloud.google.com/support](https://cloud.google.com/support)

~~~
erikig
Thank you.

------
jtchang

        if (rmem > (size + sk->sk_rcvbuf))
          goto uncharge_drop;
    

What is rmem in this case? I'm a bit confused as to why it is written that
way. This drops the packet right when it overflows the buffer?

~~~
jeffbee
It's not very literate, is it? rmem is initially the sk_backlog.rmem_alloc
field of struct sock. There is no comment in net/sock.h what this field might
mean. People who modify this function just have to guess. I also appreciate
that this function adds |size| to rmem_alloc, tests for limits, then later it
subtracts |truesize| from rmem_alloc. This happens to seem correct, but it's
just asking for someone to accidentally screw up the accounting in a later
change. Reading this function only reinforces my view of Linux code quality.

~~~
jeffbee
Another interesting thing is how stewardship of this logic and the comment
right above it have diverged.

Original:

    
    
      /* we drop only if the receive buf is full and the receive
       * queue contains some other skb
       */
      rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
      if ((rmem > sk->sk_rcvbuf) && (rmem > size))
       goto uncharge_drop;
    

Later:

    
    
      /* we drop only if the receive buf is full and the receive
       * queue contains some other skb
       */
      rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
      if (rmem > (size + sk->sk_rcvbuf))
       goto uncharge_drop;
    

How does the comment correspond to each block?

------
rishabhd
Reminds me of this classic haiku -

It's not DNS. There's no way it's DNS. It was DNS.

------
xvector
Is Google Cloud support this good in general, or only for certain tiers or
plans?

~~~
kelnos
The cynical part of me expects that this case was handled so well because a)
the support people found the issue fascinating and fun to work on, b) the
post-mort on it would make an excellent blog post.

On the flip side, it's encouraging that they have people somewhere in the
support chain who are capable enough to read Linux kernel code and submit
fixes upstream.

~~~
amessina1
Author of the article here. I only thought of the possibility of making a blog
post after the case was closed and I started telling my colleagues about it,
and realized I would have loved to read about this.

The case was indeed fun to work with, but the main reason why it had such a
fast and happy resolution was because the customer was very responsive and
very cooperative.

I cannot talk for every Technical Solution Engineer, but I can tell you that I
have no particular interest in simply closing a ticket: I want to go down the
rabbit hole and solve technical issues, and I know many of my colleagues feel
the same.

I am also far from being the most senior or skilled TSE in Google Cloud
Support, I just wrote an article about one of most interesting cases I had.

~~~
programmertote
I'm inspired by how much you seem to know about the details of computer
network stuff. Is that a required knowledge to become a Google Tech Support
person or you are just above average in terms of that among your peers?

Also, I wonder how you learn all these knowledge (that is, asking for
recommendation on a few books/resources for learning) if you don't mind
sharing. Thanks in advance!

~~~
amessina1
I don't have deep knowledge of details of compute networks, there is a team of
TSE who deal with network cases who know more than me. But the whole point of
troubleshooting is not _knowing_ what is wrong, but being able to _find_ what
is wrong. In order to do that you need good basis, and those you can make by
studying how networks and linux systems work (someone here posted some titles)
and with experience (I have some grey hair myself). But every time you
troubleshoot something you end up touching something you don't know, and
that's where you learn something new you might use next time. For example I
didn't know about dropwatch, a colleague suggested it to me.

During the interview process at Google we don't expect candidates to be able
to get to this level of depth, but we try to hire candidates that _could_ ,
over time and depending on their skill set, potentially reach a similar level
of depth and ability to troubleshoot cases.

~~~
qmarchi
Following up on amessina1's post,

I'm one of the TSEs who handle networking cases. True to what was said, I was
hired with very little networking background, but plenty of development and
hardware information.

I've since taken the mantle for handling most of the cases dealing with
Interconnects and VPNs. I enjoy it too!

Oh, yeah, we're hiring:
[https://careers.google.com/jobs/results/?company=Google&q=Te...](https://careers.google.com/jobs/results/?company=Google&q=Technical%20Solutions%20Engineer)

~~~
programmertote
Thank you! This is encouraging. My background is mostly in Python programming
and SQL. But in an alternate universe, I wish I am a network ninja like you
guys and I will definitely check out Google TSE jobs when I can look for new
jobs (currently hoping to get my green card done). If there's a book or two
that is the most useful for you to be an efficient TSE, please feel free to
share. Have a restful weekend!

------
ser_tyrion
> they use raw sockets! Raw sockets are different than normal sockets: they
> bypass iptables

But this bugreport says raw sockets would be filtered by the OUTPUT chain of
iptables:
[https://bugzilla.redhat.com/show_bug.cgi?id=1269914#c4](https://bugzilla.redhat.com/show_bug.cgi?id=1269914#c4)

Is that accurate across distros? It does make sense for some socket types,
like device sockets, to not be routed through iptables.

~~~
barbegal
I think that bug report is misleading. Raw sockets do bypass iptables but they
still go through ebtables. They hook in at the ebtables NAT OUTPUT chain. See
the diagram here [https://erlerobotics.gitbooks.io/erle-robotics-
introduction-...](https://erlerobotics.gitbooks.io/erle-robotics-introduction-
to-linux-networking/security/introduction_to_iptables.html)

~~~
ser_tyrion
Thanks for that great link

------
itsmemattchung
> they use raw sockets! Raw sockets are different than normal sockets: they
> bypass iptables, and they are not buffered!

Can someone elaborate on the above statement from the article? Does this imply
that raw sockets have unbounded buffer?

~~~
jeffbee
Clearly not, but raw is raw. For the purposes of this article, it's enough to
say that raw sockets, being raw, don't traverse the flawed block of code,
which is in net/ipv4/udp.c

------
pbhjpbhj
Seems like a good place to mention this: I once was troubleshooting an Outlook
issue where email stopped working after some time, seemingly at random. Turned
out that Outlook picked up the IP address for the mail server backwards - so
instead of WW.XX.YY.ZZ Outlook tried ZZ.YY.XX.WW. Found that by using
sysinternals network tools, confirmed with Wireshark. Thunderbird worked, ping
to the domain name (of the mail server) worked, Outlook worked on-and-off, ...

As it was MS Windows and I'm a Linux native I didn't really know how to
investigate further - I guess I couldn't without Outlook source.

Luckily setting a hosts entry fixed it.

I only found one other post online with the same issue, and they didn't have a
solution. Presumably it was something like ISP automated rDNS entries getting
parsed .. but honestly I don't know.

Still curious ...

Would have loved to have found work investigating such things, as they said in
the post, it's fun!

~~~
maccam94
It sounds like maybe your mail server had a misconfigured reverse DNS entry?
Those DNS PTR records look like ZZ.YY.XX.WW.in-addr.arpa. I'm aware that SMTP
in particular has a dependency on rDNS, but I'm not sure about the details.

[https://en.wikipedia.org/wiki/Reverse_DNS_lookup#Uses](https://en.wikipedia.org/wiki/Reverse_DNS_lookup#Uses)

------
rantwasp
that’s a pretty good and detailed explanation. 2 things: 1) i hope they have a
runbook for situations like this (ie the support engineer does not have to
figure all this on the fly) 2) the customer should have provides more details
and maybe should have thought of the tweaks they made (classic solution is to
compare 2 instances - one works one does not)

~~~
amessina1
1) you cannot have a runbook for everything, and even if you have a runbook
you the best you could have found in this case is that something weird was
happening in the VM. The setting had an insanely big value but it was accepted
by the kernel, so you would assume it was a valid one. 2) the customer
provided a huge amount of details, but it is usually very hard to explain what
did you change from the base image. Most customers might not be willing to
provide the full spec of their running system, as they might contain
information they don't want to disclose. It is easier when the issue starts
right after a change has been made, but this was not the case.

~~~
rantwasp
i’m not saying a runbook will catch everything. but it will give you a chance
to solve the problem quicker.

~~~
amessina1
I totally agree, but keep in mind that it is an iterative process: you have a
case, you apply the runbook/playbook you have, if they are not enough you use
your skill/knowledge to solve the case, _then_ you update the playbooks.

~~~
rantwasp
iteration is my middle name :) the point I was trying to make is you should be
prepared - if only for the 95% of the cases that the runbook can solve. you
also don't start with a complete runbook - you build it as you operate the
service.

------
slim
He forgot mention that the kernel is Linux. It's almost like linux has become
the standard OS. Linux is the new windows.

~~~
saagarjha
That being said, as you have noticed, Linux is the go-to OS for servers, and
the post has a number of Linux-isms.

------
schoolornot
I've been supporting AWS environments almost from the beginning but can't ever
remember a case where I was asked or even considered offering Support a copy
of a VM's storage volume. Is this common on Google/Azure/etc.?

~~~
CSDude
It reminds me of my friend's hosting company that failed. They got a big
customer and created a VM and the customer asked to fix a problem that
involved getting a shell in the VM. Friend does it and the customer is gone
next day.

This aside, even though we had too many support cases so far with AWS, and
having highest support level, they mostly cannot access user data, just the
metadata. We had a major problem with RDS once, and they specifically
requested to load that snapshot to an internal instance to reproduce. It can
happen in AWS, but not very common, in my experience.

~~~
DaiPlusPlus
> It reminds me of my friend's hosting company that failed. They got a big
> customer and created a VM and the customer asked to fix a problem that
> involved getting a shell in the VM. Friend does it and the customer is gone
> next day.

The customer asked the support people to access a shell on their VM and they
then quit because...? - or the support people accessed a shell without the
customer’s express permission?

~~~
kelnos
I think it's because the support people even had the possibility of that
access at all.

And I agree; support people shouldn't have that kind of access, with or
without consent.

------
bogomipz
I enjoyed reading this but wouldn't have running either "netstat -s" or "ss
-s" to show protocol statistic have shown either receive buffer errors/receive
packet errors statistics? It seems like this basic tool was noticeably absent
from the early troubleshooting steps and other standard troubleshooting tools
used.

I understand the importance of the ultimate fix but wouldn't seeing an
incrementing error counter for UDP have shortened some of the troubleshooting
done to identify and resolve the customer's immediate issue?

------
TwoBit
The customer had set an extremely large buffer size and nobody thought to
mention that? Perhaps the individual reporting the problem was different and
was unaware of that unusual change.

~~~
GABeech
There are many things here that concern me from a system view.

1\. They need guaranteed delivery, but chose to use UDP 2\. They jacked up the
default rmem buffer to ~2GB which is insane. Also, applies to all sockets not
just UDP, so I wouldn't be surprised if they where also running into issues
with memory pressure especially under load 3\. Support didn't seem to let them
know that's a pretty unconventional configuration

That was an interesting debugging story, and catching a bug like this is
always good IMO. But, there is just so much WTF in this setup.

~~~
bluedino
I would have told the customer to reduce the number by 1000 or whatever and
closed the case.

------
econcon
Google support is basically non existent for customers even spending like
5-10K a month on their platform.

Even questions you raise on their Reddit sub go unnoticed.

They don't have bunch of people actively trying to solve their customers
problem.

That said the only time my problems were actually listed to were from BigQuery
team. Other than this, I don't think it's possible to get any explanation on a
feature from any other product team at Google cloud.

~~~
Axsuul
Do you have a support plan?

~~~
kelnos
Why should you need a support plan for a product you're paying for?

"Ok, you can pay us $X/mo for the service, but if something goes wrong, we
won't help you unless you also pay an additional $Y/mo."

It's absolute garbage that this is where the industry is.

~~~
cortesoft
Why is it bad to give the option to not pay for support you don't want? If
they didn't charge separately for it, that means the cost is distributed to
everyone in terms of higher costs for the service itself.

If you think support is always necessary, then just do the math yourself and
add in the cost of support for every product, and use that price to determine
if it is worth it or not.

------
hidiegomariani
amazing. I wish I knew where to go learn the basics to navigate that many
layers of knowledge (kernel/os/network)..

~~~
auspex
These books should help!

Computer Networks and Internet

Internet working with TCP/IP Volume III

The Linux Programming interface

~~~
mcguire
And...

Computer Architecture: A Quantitative Approach

Operating System Concepts

Computer Networking: A Top-Down Approach

W. Richard Stevens (Somewhat out of date, but I haven't seen anything to beat
them.)

Doesn't look horrible:
[http://intronetworks.cs.luc.edu/](http://intronetworks.cs.luc.edu/)

------
andred14
Really enjoyed this :)

As our systems grow in size and complexity we will inevitably encounter limits
(and resulting problems) that previously were not approached. For the future
(interstellar space travel etc.) all this will need to be recreated for
greater scale

------
ryanmarsh
Oh look another C overflow bug.

~~~
saagarjha
A slightly different case, as the Linux kernel uses a nonstandard (-fwrapv) C
where overflow is defined to wrap rather than be undefined.

------
srnvs123
Every time I run into bugs, and feel like I'm doing everything right, but it's
just behaving unreasonably for some reason, but then find the issue and feel
stupid. This one is one of those

------
yalogin
That brings up another point. Should the kernel standardize on unsigned
scalars completely? How many legitimate use cases are there to use signed
scalars in the kernel?

~~~
jeffbee
Using unsigned does not generally fix overflow flaws. It just moves the
threshold.

~~~
yalogin
Sure. I was not suggesting it that it will eliminate overflows but would
eliminate one source of them. Also mostly because there are probably few use
cases that warrant signed values.

~~~
saagarjha
Using unsigned numbers doesn't really fix anything here, because the Linux
kernel defines overflow of signed numbers. In both cases, you have generally
surprising behavior when the number gets large enough: changing the type
doesn't help; it just hides the issue in one of the cases.

------
biohax2015
I envy people who find things like this fun.

------
ninj4fly
Nice reading. Just ignore the metadata server thing. The guy probably means a
resolver.

------
amrx101
Oh boy. It was like a thriller. Enjoyed it :)

------
HugoDaniel
That is why google only hires the best

~~~
jiveturkey
lol. google hires vast swathes of the unwashed

------
jiveturkey
easily caught by SAST. not even that. standard compiler warning should catch
this.

------
xyproto
Great troubleshooting.

TLDR; int overflow in C

------
happppy
#boycottGoogle

------
pistolpeteDK
didn’t read the story - but a tough case deserves an upvote!

------
joshsyn
Long-term solution - Use rust

~~~
hi41
Can you expand on your answer. How does Rust resolve this issue? Does it solve
by disallowing type casts thereby preventing overflow when the number becomes
negative?

~~~
steveklabnik
I think this person is trolling. However, for the purposes of discussion, two
things here:

So, in Rust, overflow panics in debug builds, but does wrap around in release
builds. So, it is possible this bug would have been caught in testing, but if
it wasn't, it still would have slipped into production.

However, that being said, Rust does not do _implicit_ casting between numeric
types. So it's very likely that this code would not have compiled in the first
place, though I haven't examined it super closely. At that time, the person
would have had to cast it, and so the end result would have been roughly the
same.

~~~
cesarb
AFAIK, Rust _can_ panic on overflow even in release builds if you want it to,
at a somewhat heavy performance cost (which is why this is not enabled by
default in release builds). In this case, it would convert the issue from
"some packets are unexpectedly being discarded" into an immediate crash within
the kernel.

~~~
jeffbee
How heavy is the performance cost really? On x86, wouldn't a JO to a higher
address take care of it? It would be a never-taken branch with perfect
predictability.

~~~
steveklabnik
[http://www.cs.utah.edu/~regehr/papers/overflow12.pdf](http://www.cs.utah.edu/~regehr/papers/overflow12.pdf)

> For undefined behavior checking using precondition checks, slowdown relative
> to the baseline ranged from −0.5%–191%. In other words, from a tiny
> accidental speedup to a 3X increase in runtime.

~~~
jeffbee
Very interesting, thank you. I question the application of this result to
kernel performance, though. Specint has hot arithmetic and UDP packet handling
pretty much does not.

------
peterwwillis
Note: You shouldn't use _int_ , _unsigned int_ , _char_ , _short_ , _long_.
Use _int16_t_ , _uint16_t_ , _uint8_t_ , etc (or their __fast_ equivalents)
from _stdint.h_.

The former's sizes change based on platform, cpu, and compiler; the latter are
fixed-width (or flexible, where __fast_ may use a larger size if it's faster).

I started brushing up on my C recently and have been collecting these little
nuggets:
[https://gist.github.com/peterwwillis/53cd9d34d8755784e483790...](https://gist.github.com/peterwwillis/53cd9d34d8755784e48379047af9f358)

~~~
kccqzy
Not when you are writing the Linux kernel, when you know exactly which sizes
the integers are.

Or even when you are writing low-level code on a known platform (e.g. an LP64
platform).

~~~
peterwwillis
All the typical sizes depend on the compiler and the flags you provide. If you
provide the wrong flags, the sizes of each type may change, but you won't see
any warnings about it because it's expected behavior, and now you've got
different binaries with different behavior. Or you could use fixed-width types
and if the wrong flags get passed (no C99 support) your code just doesn't
compile.

I just checked the Linux kernel style guide, and they explicitly suggest you
can use fixed-width types from C99 when it makes sense. I get that they want a
balance for their project, but for general programming, it's just safer to be
explicit. [https://www.kernel.org/doc/html/v5.1/process/coding-
style.ht...](https://www.kernel.org/doc/html/v5.1/process/coding-
style.html#typedefs)

------
jldugger
> After another spin around the world the case comes back to our team.

So basically, follow the sun doesn't work for hard problems? Can you really
say the people are working on this 24/7 if progress is only made in one time
zone?

~~~
amessina1
Follow the sun has a lot of overhead: imagine having to dump the engineers'
thoughts and current hypothesis and load them in the brain of the next
oncallers. Also it only really works if also the customer is active 24/7, as
often you might need the customer to perform some action on their systems.
Once the time pressure is off you might get better results dedicating the
engineer who is best suited to work on the case (both from the point of view
of the timezone and skill set) and give them the time needed to troubleshoot
the issue.

------
tobykimmel
The big news here is Google support engineer solves ANY problem. I can never
get through to them. That’s the downside to a fully automated support system,
no humans.

~~~
bauerd
There are lots of support engineers around, you just need to pay Google to use
up their time.

~~~
tobykimmel
That’s not true for all services. Google Voice for example has become
unusable, and this is confirmed by thousands of support forum posts. There’s
no way to reach a human. All the support forum posts are eventually closed
with no resolution.

Google should just shut down Google Voice instead of keeping it around for
free, but broken and with no support.

------
rmac
" I find that 2147481343 seems to do the trick. This number doesn't make any
sense to me. I suggest the customer try this number. The customer replies
back: it works with google.com, but it does not work with other domains."

My extrapolation: I find a potential fix: don't test it, don't understand it,
and send it to customer.

How is this acceptable?

~~~
floatingatoll
Your extrapolation is incorrect. The following description matches the factual
components of what you see as unacceptable, but includes the missing context
from the post that makes the engineer's actions acceptable rather than
unacceptable:

It was sent to the customer to collect experimental results while
investigation continued in parallel. Per the post, production mitigations had
already been put into place, and the customer was knowingly participating in
an investigation of an unknown issue. The customer provided the requested
experimental results.

