I enjoyed this piece. One thing I’d add is handling failure. Whether it’s a business process breaking down or a server crashing, eventually every engineer will face failure conditions. How to handle these varies greatly by the situation but the most important factor is the takeaway. Being able to identify and fix patterns that can lead to failure is important, but learning from mistakes and picking your team back up is the most important lesson, because eventually everyone falls. These lessons can only be learned through experience and are a large part of what makes a “senior” engineer.
This all rings true to me. I summarize senior engineer responsibilities as "think holistically" - cost of development, cost of maintenance, path to adoption, upfront investment required before you can determine if you'll succeed/fail, etc all need to be factored in. In many respects, it's the art of solving the right problems (and then solving them well).
If you're curious, my own toolbox (Senior Staff SWE at Google):
* Understand your problem space in all dimensions, including prior art internally and externally, failed attempts in the space, tradeoffs and constraints. Takes a year or two at minimum with heavy research and detective work, so start with a small seed area and grow, and figure out whose opinions to lean on in the meantime.
* Inventory your teams and their projects. Get a feel for the problem each is solving, whether you believe it's possible, and the staffing needed for that. Look for:
---* Succeeding teams. Observe from a distance.
---* Under-valued teams. Help them articulate true value to leadership. If they disappeared, what would the cost be? Including externalized costs elsewhere in the company.
---* Under-staffed teams (usually also under-valued). Teach them to deprecate features and cut problem scope to reach a sustainable subset - the worst thing is growing tech-debt with no path out. If leadership doesn't like the cuts, they can help prioritize differently, or invest. But do not allow them to request the impossible or unsustainable.
---* Under-executing teams. Remain skeptical you fully understand the problem (is it harder than you think?), and look closely for education gaps you can solve in-place with mentoring. But it may ultimately be a staffing problem - e.g. wrong lead for the job - or a team culture problem, and those must be addressed too.
---* Over-staffed teams, or teams solving unimportant problems. Flag it to management - generally either their scope will be grown, or engineers can be pulled off to work on higher-impact projects.
* Do the math. Understand costs and opportunities (what areas are overly expensive), growth rates, usage metrics, reliability. Develop a gut-feel you can use to sanity-check what you see and are told.
* Look ahead - keyword "unsustainable". Where are costs growing too quick? Scaling approaching unsupportable limits? Toil growing linearly with usage? Users taking features in an unexpected direction? Architectures breaking down? Work with teams to have a plan, and escalate the risks upwards for visibility.
* Predict the future. Based on all this, what problems should be solved but aren't? What leverage points do you have to change the model, and solve a different problem more successfully? Pitch, sketch, and prototype to sell the importance or opportunity.
* Roll up your sleeves - "architecture astronaut" is a pejorative for a reason. Dive in to your struggling projects and work alongside the team, providing an extra pair of hands and educating through demonstration. Get involved in outages and their resolution. Build tools and dashboards. Take responsibility for success in at-risk areas and do whatever is required, whether that's writing code, doing analysis, PMing, or simply making yourself available.
* Be a diplomat. Recognize that all success is a team effort, and your path to impact is through and alongside others.
Thanks for the detailed write up, I love hearing from folks with more experience than me.
How do you study failed attempts in a space? The survivorship bias can make those hard to find. Do you focus on failed attempts within your company for those cases?