Cattle is a threshold, but when the same problem keeps coming up it's time to call the vet. http://rachelbythebay.com/w/ has many good examples of this, some submitted and voted up here.
Cattle is a good philosophy but it takes a huge amount of work to approach perfection.
Everybody was embarassed because no monitoring caught it, but the VP of IT did by walking past the cage.
Aside: was it even legal for him to do that?
Also, Rickover was Congress's favorite admiral. They forced the Navy to promote him. I'm pretty sure they made sure that the laws were to his liking.
Also, if you're not confident enough in your nuclear reactor to apply chaos monkey techniques, you shouldn't be engineering nuclear reactors.
He’s either an MD in finance, which is the equivalent of a VP (roughly) in other kinds of companies, or he’s the CEO.
My dad had a book over almost all the cow names in Norway, per 1988. As a kid I found it rather fun to just flip through it and read some names, often wondering how they came up with them.
However since then it seems the tradition of naming cattle has dropped to less than 30%.
: "Gullhorn og dei andre : kunamn i Noreg" https://urn.nb.no/URN:NBN:no-nb_digibok_2010111708049
These days the number has risen, IIRC around 35, though that's still quite a low number compared to larger countries I imagine. I'm pretty sure the variance is quite high however, with a fair number of farms with just a few cows dragging down the mean.
I really hope the video is edited to strongly resemble this scene: https://www.youtube.com/watch?v=N9wsjroVlu8
We had a customer with regularly failing tape backups. CRC errors, verify pass failures, even failed writes, and so forth.
We replaced the tapes with new ones. Same issues.
We replaced the tape drive with a new one. Still the same problems.
We replaced the internal ribbon cable and the SCSI controller. No luck.
Firmware flashed everything. Didn't help.
New server chassis, wiped the OS and reinstalled everything from scratch. Changed the backup software just in case. The backups still failed!
Literally no part was the same. I went on site to start looking into things like the power cables, the UPS, or vibration issues. Basically were getting desperate and grasping at straws.
I was sitting down in an office, casually chatting with the IT guy while we were waiting for 5pm so we could reboot the server. He's leaning back in is office chair, and he casually picks up one of the tape cartridges and throws it up in the air and then catches it before it hits the ground. Just playing. Over and over.
I asked him if he does that a lot.
"Yes, it's fun!" he answered.
What was your companys role? Backup services/devices?
Finally I took all the parts out of the original computer and put them in a different chassis and it worked! Put them back in the old chassis and back to the old problem.
Eventually I noticed that there was an extra stand-off in the first computer case and it was shorting out the motherboard.
It was literally the chassis causing the problem.
Motherboard is bricked. Ring DELL for support. After going through the rigmarole of explaining what had happened and that we had a bricked motherboard, the person on the phone said "Have you tried taking out the CPU and rebooting?"
To avoid further delay in getting a replacement sent (we had 4 hour on-site at the time), we went through the motions. Not surprisingly, the motherboard was substantially bricked without a CPU.
The DELL engineer that came on-site was suitably amused.
Nothing worked. Finally, I removed the processor from the motherboard, looked at it, and reinstalled it. The computer booted right up and never had another problem. Weird.
What do you do to write software accordingly? Make it detect when it's running on a dud? Have it run as best as it can anyways?
Are you perhaps thinking of the printer execution scene from "Office Space"?
are you able to comment a bit further on why this machine was well known?
Could be a different incident and a different machine, though. I'm sure this story happened more than once.
The infamous machine did go through repairs and part swaps many times, as you could see from its long and troubled hwops history.
The worst machines were the zombies with NICs bad enough to break Stubby RPCs, but still passing heartbeat checks. Or breaking connections only when (re)using specific ports. Fun times!
Regarding this system: the motherboard was never swapped?
> engineers habitually run batch jobs with more replicas than there are machines
Idly curious, how do I parse parse this? It sounds like the same jobs are replicated to multiple machines as a sort of asynchronous, eventually-consistent lockstep arrangement?
Eyy what would I search on moma to find this video?
I can confirm I've read the same thing though, years back.
If any Googlers are reading this: just goto go/legends and search for officespace. The first link that pops up has context as to why the video exists.
And I wonder how well the signal in that ratio might scale down to hundreds or tens of disks.
"Hewlett Packard Enterprise (HPE) has once again issued a warning to its customers that some of its Serial-Attached SCSI solid-state drives will fail after 40,000 hours of operation unless a critical patch is applied.
Back in November of last year, the company sent out a similar message to its customers after a firmware defect in its SSDs caused them to fail after running for 32,768 hours."
Can you imagine provisioning and deploying a rack or 3 full of shiny new identical drives, all in RAID6 or RAID10, so you couldn't possibly lose any data without multiple drives all failing at once...
(Evidence that the universe can and does invent better idiots...)
Someone suggested just nuke it and bring it back up on a fresh instance. Problem was gone! Everything running smoothly again..
If CPU was bad, then that means that you kept running the instance on the same node. Quick was to test if it was "cattle" would have been to try on a different node
Additionally, if CPU was bad how was it not affecting other services?
The customer had set `net.core.rmem_default = 2147483647` on purpose. Which exposed a Kernel bug. The whole herd would be having the same issue.
The bug report resulted in a core fix, which is a better result than if the customer had fixed it themselves of course.
Even if there was a specialized server that needed such a large receive buffer, it doesn't make sense to set the system-wide default so high.
That might involve making a new VM from scratch, but it also might twiddling some settings, or other emergency changes.
Afterwards, I want to be able to restore the VM state, probably in a firewalled off environment, so I can debug exactly what was wrong.
Sometimes I'd like to do it to a set of VM's - for example, if there is some DNS wierdness, I might want to snapshot an application server and a DNS server.
So far, no cloud provider seems to offer functionality to make that easy, which is a bit disappointing.
sysctl is a bit problematic in terms of exhaustiveness. That is, how do you ensure that the kernel only has its original values plus whatever you put in sysctl.conf, and nobody actually ran sysctl manually at some point? But it's possible to do.
Unfortunately, the competition (Salt, Ansible, Chef) aren't really any better here.
These days, I run Kubernetes whenever possible, and keep the base OS light, which makes the configuration management surface extremely small.
Which is why I have come to believe that the very concept of host configuration management is broken. We should do it as little as possible, preferably NONE AT ALL. Sure, use something like Ansible to run the image creation steps, and provision the necessary first-boot scripts in place. Only leave the steps in that absolutely can not be done during image pre-bake.
Cycle your hosts without mercy, so that new ones are brought up from fresh pre-baked images, continuously.
And even for the few unavoidable snowflake hosts (eg. those that have to live outside the K8S cluster), follow the same strategy. Make them disposable, so that you can bring up a new one from their own pre-baked images on demand. Try to keep the delta between the snowflake base and your cattle base as small as possible.
Configuring live hosts should be considered an anti-pattern - if you find yourself doing it at all, take a step back and consider how to get rid of the need.
Unfortunately, this is the only way until kernel folks start agreeing that random mutability is a good thing. Right now, the kernel has way too many mutation points, and it's not (as far as I know) possible to ask it for a "diff" against the defaults.
And even if you could somehow bisect every single configuration change in your configuration management, there's the added complexity of how many configuration changes are actually needed to test if the problem is still present. In this case it would probably be fairly easy because DNS is such a core "feature", but if this is something more application-level, you're really going to be lost.
Servers as cattle can cover a variety of sins.
It's often a useful exercise to dig into the root cause - a lot of times the problem you're seeing is just the tip of the iceberg.
I'm stealing this.
The Phoenix Project: https://www.amazon.com/Phoenix-Project-DevOps-Helping-Busine...
Its related Dev Ops Handbook: https://www.amazon.com/DevOps-Handbook-World-Class-Reliabili...
Of course, anything necessary to get your production box producing, but a well-engineered server is worth the debugging time.
+ if (rmem > (size + (unsigned int)sk->sk_rcvbuf))
However in reality this would have worked too -
+ if (rmem > (unsigned int)(size + sk->sk_rcvbuf))
(The bit pattern of the result remains the same and it's still casted as unsigned int during the comparison)
However, signed integer overflow is undefined behavior in C and unsigned integer overflow isn't. Hence, the submitted patch is the correct solution
If A + (cast)B is a correct form, then (cast)(A + B) is generally an inappropriate form. As you note, it’s possible it will happen to work, but it’s not good form.
What "level" of customer or SLA do you have to be to get a certain quantity or guarantee of support and troubleshooting? Or is it that if even a free-tier customer points out something that is fundamentally a problem, it will receive attention by certain solutions engineers?
Are there $ spending, 20 x (c3.4x.large), or I-pay-you-for-certain-uptime/troubleshooting levels that get you certain response levels? Do certain problems get resolved with "well, you just have to live with that behavior, we're not fixing that".
Do you get to call them or chat live? Or is it all via tickets?
Free customers did report bugs, and we would replicate, triage, and fix them as usual. These tended to be more obscure bugs (more severe ones would usually surface in the paid queues first), but we didn't discard them immediately.
Whether we effectively ignored bugs depended more on the ability of the customer to provide an actionable report. Some users provide exactly what we need up front, but there is a lot of "it doesn't work!" white noise from users that aren't able to or aren't willing to put in the work to accurately describe their issue and/or action feedback from us. There's usually not a whole lot support can do in that scenario if we don't see any obvious issues, but we'd go a bit further to placate paying customers--I fondly remember joining a call between some very technically inept user and their ISP who was adamant that either we or the ISP were at fault, after we guided their network team through taking local packet captures showing unanswered SYNs past their network border.
$250/month/dev is the minimal for phone calls on technical issues, $150k + 4% of GCP spend for 'come running' support.
There're more details here
though they use the old names for the support tiers.
Disclaimer: Is a TSE...
For small to medium sized businesses, that number is probably 1.
If a service you depend on is down, your support agent will be able to tell you it's down, but not speed up the fix.
Cloud support will have more information about performance black holes and limitations that the documents don't describe. They also will be able to advise on "is this design or that design likely better". They generally know how the backends of GCP services work, and their common failure modes, which is pretty hard knowledge to get from the outside.
They really deeply understand AWS, are very responsive to calls/emails, but often have no tools to solve the problem we're having right now.
This even happened with some of their high end hardware, we did an upgrade for a critical set of instances to some extremely pricey dedicated hosts and ended up in a runaround for over a week due to a hardware issue on their side.
We don't really have any way to know from your story whether Scaleway's support was under normal or extra load and delivered an excellent experience, or whether they had a bunch of bored support reps just waiting for something to work on because it was abnormally slow. The latter is nice, at the moment it happens, but doesn't really help you if 2 days later for a different issue you're left in a lurch for days on end because they're busy. The former would be good for maybe indicating that.
That's the whole point of service level guarantees. They provide a lower bound on the support you'll receive, which is often much more important and useful to track.
And this is why you don't generally use signed numbers in systems code, unless you specifically need negative numbers. And why you gradually develop a paranoia about the sizes of numbers.
For another, that's why you get paranoid.
(For a third, I strongly recommend something like Frama-C with the Weakest-Precondition module---it's very good at finding issues like these.)
It's also much easier, IMO, to accidentally underflow an unsigned integer; it's so much more common to work with 0 than it is to work with +/- 2 billion.
I actually started putting assertion checks about overflow issues almost everywhere but it requires great discipline. I wonder if there is a better solution available.
> (if you try and set it to 2^31 the kernel returns “INVALID ARGUMENT”).
I would expect the same to happen for 2^31+1.
I love the concept of “Follow the Sun” to describe 24/7 support - I don’t think I’ve heard it described that way. I wonder how much we’d have to spend to get that tier of service?
The difference for users can be fairly small, but as someone who used to carry a pager for Amazon, the difference is really huge for the support person.
if (rmem > (size + sk->sk_rcvbuf))
/* we drop only if the receive buf is full and the receive
* queue contains some other skb
rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
if ((rmem > sk->sk_rcvbuf) && (rmem > size))
/* we drop only if the receive buf is full and the receive
* queue contains some other skb
rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
if (rmem > (size + sk->sk_rcvbuf))
Probably this bug would not have happened if this comparison were written as `rmem - size > sk->sk_rcvbuf`?
So it is saying that the simplest sanity check is "if the buffer was already full before we staked our claim we should drop this packet immediately." As the "goto" indicates, there are then a bunch more checks on other circumstances where we should also drop the packet. Due to the quirks of multi-threading it is of course possible that some packets get unnecessarily dropped between when we stake the claim and when we discharge it, which the code just accepts -- the thinking is presumably "yeah if the buffer is full a lot of packets are gonna get dropped and that's just life -- it's much less important that we dropped some extra packets when we were already dropping packets, and much more important that we don't mismanage the buffer's memory when it's nearly full."
A comment suggests that part of the reason for this awkward phrasing is that it is possible for rmem = size, in other words the buffer was empty when we staked our claim--and in this case we don't want to drop this packet even if it would overflow a small max-buffer-size. I think the idea there is "we already have the socket buffer allocated, obviously this thing fits in memory, so let's just handle it if the queue is empty rather than dropping every single packet that is larger than the queue size."
It's not DNS.
There's no way it's DNS.
It was DNS.
We have the expensive off-the-shelve support option (I think 450/seat/month) for 1h response.
In most cases, we spend more time back and forth with support than it would take you to figure it out. I'm talking about issues that span over weeks with tens of hours spent. We end up reiterating the original support case problem (i.e. the support engineer doesn't bother reading the actual problem) whenever the engineer changes.
P1s where the support engineer told us we'll get an update the following day, only to figure out that there was a breaking release on their side that exactly matched our description.
While investigating a load balancer issue, the support engineer looked at the LB logs, saw a ton of logs coming from penetration scans (e.g. GET /phpmyadmin), and suggested that the solution was to open up those addresses.
On the flip side, it's encouraging that they have people somewhere in the support chain who are capable enough to read Linux kernel code and submit fixes upstream.
The case was indeed fun to work with, but the main reason why it had such a fast and happy resolution was because the customer was very responsive and very cooperative.
I cannot talk for every Technical Solution Engineer, but I can tell you that I have no particular interest in simply closing a ticket: I want to go down the rabbit hole and solve technical issues, and I know many of my colleagues feel the same.
I am also far from being the most senior or skilled TSE in Google Cloud Support, I just wrote an article about one of most interesting cases I had.
Also, I wonder how you learn all these knowledge (that is, asking for recommendation on a few books/resources for learning) if you don't mind sharing. Thanks in advance!
During the interview process at Google we don't expect candidates to be able to get to this level of depth, but we try to hire candidates that could, over time and depending on their skill set, potentially reach a similar level of depth and ability to troubleshoot cases.
I'm one of the TSEs who handle networking cases. True to what was said, I was hired with very little networking background, but plenty of development and hardware information.
I've since taken the mantle for handling most of the cases dealing with Interconnects and VPNs. I enjoy it too!
Oh, yeah, we're hiring: https://careers.google.com/jobs/results/?company=Google&q=Te...
In my experience, Google Cloud is better than most organizations about escalating hard issues up to the chain. Admittedly, this happened at a company with substantial spend, and I can't say one way or another whether a smaller player would get the same quality of support.
It will also get you a dedicated sales rep and sales team, and they will absolutely crack the whip on internal teams to get issues resolved. At those spends, you can almost get an in-house support team of PSOs to bounce problems off of.
> and they will absolutely crack the whip on internal teams to get issues resolved
Not in my experience. Although we did get an ever-rotating rep. I think they changed three of them in like a year or so
But this bugreport says raw sockets would be filtered by the OUTPUT chain of iptables:
Is that accurate across distros? It does make sense for some socket types, like device sockets, to not be routed through iptables.
Can someone elaborate on the above statement from the article? Does this imply that raw sockets have unbounded buffer?
Whereas the commonly used socket functions recv/send construct the required headers for TCP, UDP and whatnot, they handle encapsulation/buffering/connection/etc so they're easy to use for developers, just read and write application data.
By nature raw sockets skip TCP/UDP libraries and a good chunk of the network kernel code. Including the place where the bug was located.
As it was MS Windows and I'm a Linux native I didn't really know how to investigate further - I guess I couldn't without Outlook source.
Luckily setting a hosts entry fixed it.
I only found one other post online with the same issue, and they didn't have a solution. Presumably it was something like ISP automated rDNS entries getting parsed .. but honestly I don't know.
Still curious ...
Would have loved to have found work investigating such things, as they said in the post, it's fun!
This also highlights why support agents shouldn't be 100 engaged with customers. They need time to review and amend runbooks, consult with engineering teams, etc.
This aside, even though we had too many support cases so far with AWS, and having highest support level, they mostly cannot access user data, just the metadata. We had a major problem with RDS once, and they specifically requested to load that snapshot to an internal instance to reproduce. It can happen in AWS, but not very common, in my experience.
The customer asked the support people to access a shell on their VM and they then quit because...? - or the support people accessed a shell without the customer’s express permission?
And I agree; support people shouldn't have that kind of access, with or without consent.
I understand the importance of the ultimate fix but wouldn't seeing an incrementing error counter for UDP have shortened some of the troubleshooting done to identify and resolve the customer's immediate issue?
When I was a new engineer working in telco, one of the longest investigations I worked was when connectivity broke between one of our regional roaming partners and 1/3 of our nodes (I'm summarizing to try and keep the story brief). We called them and asked if they changed anything, reviewed the configuration and secrets used on the tunnels, etc. And were working with the vendor to go through any problems with the implementation. Saturday morning and probably 20 hours of investigation later, a new engineer at the regional partner see's there is a work order for changes to the connectivity to our nodes (we were adding some new ones) that was supposed to be executed that week. A typo in the change overwrote the secrets used by an existing tunnel instead of creating a new secret for the new peer. The person we were working with to investigate, was the person who implemented that change and told us several times nothing changed. He was also the one we worked with and read through all the secrets for typos or issues and didn't notice anything. Saturday morning he get's into the office, is shown the work order, and goes, oh yea, I did that at exactly the time the tunnel went down. Fix of typo'd secret later and everything comes right back up.
So just in my experience, I find it quite plausible that buffer size was not mentioned. And even besides this story, I know I've personally missed connecting causes with potential effects when investigating a problem, it's very easy to dismiss some setting, like the buffer size, as being connected specifically to DNS behaviours, especially if they are not noticed together or with a strong change management system that helps connect the timelines together.
1. They need guaranteed delivery, but chose to use UDP
2. They jacked up the default rmem buffer to ~2GB which is insane. Also, applies to all sockets not just UDP, so I wouldn't be surprised if they where also running into issues with memory pressure especially under load
3. Support didn't seem to let them know that's a pretty unconventional configuration
That was an interesting debugging story, and catching a bug like this is always good IMO. But, there is just so much WTF in this setup.
As another commenter mentioned, this was the result of customers never actually mentioning their weird sysctl tuning in the original issue description. It's not like they're trying to screw you over or anything - there's just an awful lot of config options in an entire system that does anything interesting, and in the case of a big enterprise appliance, it's likely that dozens of people have had admin on it at one point or another.
Even questions you raise on their Reddit sub go unnoticed.
They don't have bunch of people actively trying to solve their customers problem.
That said the only time my problems were actually listed to were from BigQuery team. Other than this, I don't think it's possible to get any explanation on a feature from any other product team at Google cloud.
If you're 5 people at a dentist office with GSuite, that may be a different story. That's always a problem when small entities buy direct -- that's why VARs exist to provide more handholding for smaller orgs.
I've managed some pretty significant vendor relationships -- if you think they are the worst from a support POV, you're young or exceptionally lucky!
We rarely have to do so, but whenever we did, they went out of their way to figure it out.
YMMV, I guess?
We got put on their best support plan for a few months for reasons. The difference was insane. I gave them permission/creds to log into the problem box, along with steps to repo things. In a few days they had reverse engineered what was going on in our code, without having our code, figured out where the bottleneck was in the kernel, and gave me detailed steps to build a tweaked kernel that wouldn't be a problem.
"Ok, you can pay us $X/mo for the service, but if something goes wrong, we won't help you unless you also pay an additional $Y/mo."
It's absolute garbage that this is where the industry is.
If you think support is always necessary, then just do the math yourself and add in the cost of support for every product, and use that price to determine if it is worth it or not.
Even their product teams are just... Depressing. There was a support ticket open for TEN YEARS for people asking for WebSockets support on Google App Engine.
I often wind up with DevOps responsibilities, and I'd never recommend building more on them, and I'd help in every conceivable way to minimize money given to Google, in addition to aiding the transition to more reliable providers.
And on a more personal note... My husband recently tried recovering his email. He no longer had the password, the phone number he had it registered with was no longer owned by him (a major security issue, btw), so he tried his recovery email. And even after clicking the link from his recovery email, he was denied access, and sent to the same help page that couldn't be more unhelpful if someone actively tried. And there was no apparent way to contact support from that screen.
I hear tell of Gmail users who could reach Google support, but... We don't put much stock in those stories round these parts.
Moral of the story? Never trust Google with your emails, or other important information. Always make backups if you must continue using them, and forward your email to an email provider you trust (ideally one that you pay for, own, and has a decent support department of any kind).
Computer Networks and Internet
Internet working with TCP/IP Volume III
The Linux Programming interface
Computer Architecture: A Quantitative Approach
Operating System Concepts
Computer Networking: A Top-Down Approach
W. Richard Stevens (Somewhat out of date, but I haven't seen anything to beat them.)
Doesn't look horrible: http://intronetworks.cs.luc.edu/
As our systems grow in size and complexity we will inevitably encounter limits (and resulting problems) that previously were not approached. For the future (interstellar space travel etc.) all this will need to be recreated for greater scale
TLDR; int overflow in C
So, in Rust, overflow panics in debug builds, but does wrap around in release builds. So, it is possible this bug would have been caught in testing, but if it wasn't, it still would have slipped into production.
However, that being said, Rust does not do implicit casting between numeric types. So it's very likely that this code would not have compiled in the first place, though I haven't examined it super closely. At that time, the person would have had to cast it, and so the end result would have been roughly the same.
> For undefined behavior checking using precondition checks, slowdown relative to the baseline ranged from −0.5%–191%. In other words, from a tiny accidental speedup to a 3X increase in runtime.