

Principles of Software Engineering, Part 1 - ananthrk
http://nathanmarz.com/blog/principles-of-software-engineering-part-1.html

======
bcantrill
Great stuff, and I love the concrete example of the ZK failure due to error
logging -- a classic cascading failure mode. While it's true that I'm an
inveterate disaster porn addict[1] and would therefore love this regardless, I
think that Nathan's piece serves as a model in that it speaks to learning from
failure rather than gloating about nascent success -- we collectively need
much more of this! I also like that Nathan doesn't romanticize other
engineering domains, as naive software engineers are wont to do; other
engineering domains also struggle with failure -- it's just that their
failures are so much more public (and so much more likely to involve loss of
property and/or life) that they cannot evade collective introspection the way
software engineering so frequently seems to. Very much looking forward to Part
2!

[1] [http://www.infoq.com/presentations/Debugging-Production-
Syst...](http://www.infoq.com/presentations/Debugging-Production-Systems)

~~~
m_mueller
I've enjoyed your talk, thanks for posting. One thing I'd like to know though:
As someone who's optimizing his debugging skills and environment so thoroughly
as you, it surprised me that you love javascript. Don't get me wrong,
obviously it has some of the best tooling thanks to its abundance, but doesn't
it bug you that it tends to fail silently? I feel that there are quite a few
error classes that need to be caught by unit tests in case of JS, where in
languages with more rigid type systems (such as python) it gets caught as an
exception right on the first run. Or is it that this uneasy feeling about
everything you do in JS is what has spawned a culture of more thorough unit
testing, such that at the end you're better off?

~~~
seanmcdirmid
The things that you need to unit test even when you have static typing
typically overlap tests that will detect type errors as well. The fact that
there is no static typing also puts a bit more fire under your butt to test
things.

In the end, its a wash.

------
mathattack
I think this quote is magic, "Software engineering is a constant battle
against uncertainty – uncertainty about your specs, uncertainty about your
implementation, uncertainty about your dependencies, and uncertainty about
your inputs."

Engineering is about handling what goes wrong, not what goes right. It's about
handling the errors, changes, misuse, etc. It isn't about the techniques per
say, as much as the mindset of living in an imperfect world.

[Edit: Fixed a typo.]

~~~
confluence
Indeed, which is why I like to think of engineering as a game of hyper-
dimensional whack-a-mole [1].

There are a certain series of things you have to hit in a fairly hyper-
dimensional world, dodging constraints, hurdling uncertainty and taking risk
in your stride as you struggle to make products that work, delight consumers
and make bank.

It's like a complex and exquisite ballet really, with suppliers,
manufacturers, producers and designers all coming together to make
extraordinary products that astonish the world.

Ah, I love engineering.

[1] <https://news.ycombinator.com/item?id=4238984>

> _designing a rocket engine is a massive game of high dimensional parameter
> whack-a-mole, it's very difficult to get a passable configuration without a
> lot of iteration and forwards-backwards passes_

------
baumgartn3r
Super interesting post! I'd have mentioned Unit tests as another measure to
tackle uncertainty. Simple, boring unit tests (reminds me of this post[1]).
Maybe he just assumes those will exist when professional engineers write code.
[2]

[1] [http://robertheaton.com/2013/04/01/check-youre-wearing-
trous...](http://robertheaton.com/2013/04/01/check-youre-wearing-trousers-
first/) [2] [http://www.amazon.com/Clean-Coder-Conduct-Professional-
Progr...](http://www.amazon.com/Clean-Coder-Conduct-Professional-
Programmers/dp/0137081073)

~~~
lorewarden
Unit tests are of course important, but they don't test for higher level
failures like network issues, high latency, increased load, etc. Your
components must be designed to be isolated from incidents as much as possible,
possibly using the techniques implemented in Hystrix [1], an open source
library from Netflix.

[1] <https://github.com/Netflix/Hystrix>

------
abecedarius
One question this raised (and I don't mean this as a gotcha): why could a
flood to the error-reporting servers take down _all_ of the applications? I
expected the primary fix to be to decouple the work so it could continue with
no error reporting server. (But I'm not familiar with Zookeeper or any of the
other work the author's doing, beyond reading some post on Storm.)

~~~
MichaelSalib
Zookeeper is a distributed coordination service. Think of it as an extremely
robust reliable datastore for handling small amounts of data. It provides that
robustness by using an expensive synchronization protocol. When you try and
slam it with large volumes of data, zookeeper falls over. And Storm relies on
Zookeeper for basic functioning, so without a running zookeeper ensemble, the
associated Storm cluster will die too.

~~~
abecedarius
That makes sense. It's not clear to me though why error logging should belong
to it.

~~~
MichaelSalib
Well, a Storm "program" operates concurrently on many nodes at once. If an
exception is thrown, you may want to log it and the stack trace, but where? If
you write to a local log file, that data will be useless unless you run some
sort of log shipping or log centralization (like with scribe or kafka or
syslogng). But that's usually a pain in neck to setup and you can't run storm
without already running a zookeeper cluster, so if you're lazy, you just log
to zookeeper.

Everything is fine as long as exceptions are infrequent.

------
ambiate
There is a fine line between industry (cost center vs generating revenue) and
startups when it comes to discussing the term software engineering.

I see a large amount of legacy maintenance in cost center based programming.
Revenue generating industry channels seem to favor the enterprise aspect of
software engineering. Startups attempt to just build, and fix as necessary
(cowboy). Yet, each has their own facet of software engineering.

I am still trying to draw the line between too-enterprisy, too-maintenancy,
and too-cowboy. At my current job, we assume everything is certain. The
uncertainties are not coded for, because everything is internal. This bothers
me to a large extent. I love coding for the uncertain. Giving more control to
the user and automating a whole department is right up my court. Sadly, it is
hard to convert people. Only the 'RU' in CRUD is in the user's hands most of
the time. It is pure legacy fear.

The removing cascading failures part needs more emphasis. Remove portions from
your cycle/automation/jobs. What happens? I also agree with the measure and
monitor portion. Waiting to create analyzers and looking at metrics once the
program starts breaking in production is too late.

Looking forward to the next posts.

------
205guy
This may be semantics, but I think of software engineering as the slightly
larger scope of building real-world solutions with software and hardware.
Civil engineering is not (just) about mixing the right cement and letting it
cure at the right temperature for the right length of time, nor is it strictly
about building a bridge, it's about building a bridge for the right price in
the right amount of time that will last a given number of years, all
parameters which were determined through a careful process and making
decisions with stakeholders, while applying scientific principles (geology,
materials science, etc.) and good people management skills. Oh, and the
successful bridge project leaves behind the documentation of the bridge as
built and a structure to assure its proper maintenance.

However, I do agree that handling the huge and complex range of inputs, not
only the expected ones, is a great beginning to the process, one that is often
overlooked. And same goes for internal monitoring, to make sure your system is
still functioning as designed.

