> "Real engineers"? This is a Googleism and their caste system of "real engineers" and "software reliability engineers". The rest of the industry doesn't coddle their precious "real engineers".
There's no such caste system at Google. SREs have the same bar as the SWEs, and plenty of people switch roles from one to other. SREs are also well-respected and I don't think anyone would consider them as _not_ real engineers.
> Designing a resilient, scalable, frugal, operateable architecture of load balancers, servers, caches, queues, dead letter queues, and all their associated metrics and alarms is absolutely "real engineering" and worth the time and attention of the same engineers that are building the core business value of their company.
You are absolutely right and that's why having people focusing on that is _really_ important. Having worked at AWS previously, I can tell you that when you force your software engineers to do everything and promote them based on _what_ they deliver without caring about the operational load, you end up with crazy oncalls and poorly-designed software. There's a reason why AWS oncalls are infamous whereas Google's are not.
> Having worked at AWS previously, I can tell you that when you force your software engineers to do everything and promote them based on _what_ they deliver without caring about the operational load, you end up with crazy oncalls and poorly-designed software.
I work at AWS. While it's true that promotions can create all sorts of incorrect incentives, the reason why AWS oncall can be brutal on some teams isn't because of promotion-orientated architectures that disregard operational load.
The reason is that the company prioritizes moving fast and shipping stuff to customers. We have a deep backlog of customer asks and we want to ship them quickly. The systems need to be secure, durable, available, and fast - those are non-negotiable. But do they need to be operated automatically? Well, no, that's the easiest thing to compromise - ship quickly, have initially shitty oncall, improve operations behind the scenes.
This is usually why customers are always confused/concerned why AWS ships something new then releases no updates for a year. We don't compromise on security or durability, but we do compromise on engineer happiness during oncall for the first release. So after shipping, the immediate next step is to start improving automation and operations.
This is done with open eyes, and most engineers are happy with it to ship things big things quicker, rather than releasing a new version of chat software every year.
None of this is promotion-orientated. It's working backwards from customers.
I can't find the post, but I base this on rachelbythebay's writing - who got hired as an SRE, and then tried to become an SWE and was told she couldn't just convert.
But fair enough, I don't work at Google, so I'll withdraw the point. Having said that, knowing that AWS engineers do oncall, and GCP don't (letting the SREs do it) makes me still think there is some sort of two-tier system.
Before you withdraw your point, I guess I have to say my experience is limited to my org and maybe one or two more. Google is a big company, so I can see that happening. I'd still presume it's not the general attitude towards SREs, though.
re: Oncall, we have two rotations:
- one manned (personed?) by SWEs, 9-5, responsible for dealing with customer issues and mandatory.
- one is mix of SREs and SWEs, 24/7, responsible for prod issues
I believe SREs also have their own rotation, but that's 9am-9pm because they are always spread among two timezones. Overall, this is muuuuuch better for everyone involved compared to the AWS oncalls. I remember barely sleeping for a week being the norm on one of the teams I worked at. Our cries for another team in a different timezone, similar to Google SREs, were shut down every single time. "Customer obsession" at AWS means delivering stuff as fast as possible and then throwing the engineers under the bus. I still remember the days I had to wake up multiple times a night to run a command manually (literally 5 minutes) because engineers couldn't take the extra 4 months to do it right.
Thanks, but no thanks. I was at a great team at AWS for ~2 years with great engineering culture and little operational load, but unlike Google, that's rare.
It's just the cultural difference between the two companies. I have been told that GCP oncalls are a lot busier than rest of the Google, but it's still nowhere close to the suffering I had at AWS. It's an organizational pain and comes from the mindset AWS has towards software development and their engineers.
AWS's mindset is software is useless without customers. Google doesn't seem to have the same care - it's a charity for academic software engineers to spend AdSense revenue on abstract high level computer science problems not caring about practical applications.
It's why most Google X ideas flame out. It's why there's a new chat app every year.
But - it's a net positive for humanity. Google publishes a lot of papers, and I think genuinely has moved humanity forward in the last few decades.
I don't have a PhD and I don't do L33tcode, so I don't think I'm smart enough to be at Google, but honestly I don't know if I would want to be.
I'm sure you're more than enough smart to be at Google :) It's just practice and luck.
I completely agree that people who prefer to ship fast would be happier at AWS. I've heard all the horror stories about how slow Google is, but I think GCP has a great balance between speed and quality. I'm definitely happier here, but I understand why some people would be happier at AWS.
There's no such caste system at Google. SREs have the same bar as the SWEs, and plenty of people switch roles from one to other. SREs are also well-respected and I don't think anyone would consider them as _not_ real engineers.
> Designing a resilient, scalable, frugal, operateable architecture of load balancers, servers, caches, queues, dead letter queues, and all their associated metrics and alarms is absolutely "real engineering" and worth the time and attention of the same engineers that are building the core business value of their company.
You are absolutely right and that's why having people focusing on that is _really_ important. Having worked at AWS previously, I can tell you that when you force your software engineers to do everything and promote them based on _what_ they deliver without caring about the operational load, you end up with crazy oncalls and poorly-designed software. There's a reason why AWS oncalls are infamous whereas Google's are not.