
Loss of power grid automatic control due to software update [pdf] - eigenvector
https://www.nerc.com/pa/rrm/ea/Lessons%20Learned%20Document%20Library/LL20200403_Loss_of_AGC_During_Routine_Update.pdf
======
segfaultbuserr
> It was discovered that there was a change made to the alarm text during this
> AGC update that caused the failure. Prior to the update, when the PII value
> exceeded +/-999 MW, the value in the alarm text defaulted to “xxxx,”
> preventing dispatchers from having an accurate indication of the amount of
> PII change. During this deployment, the two MW value fields in the PII alarm
> text were modified from i4 to i5 (4 digit integers to 5 digit integers) to
> allow for an additional digit.

> The original alarm text array had 79 characters which almost hit the max
> character limit of 80, and the change from i4 to i5 resulted in the alarm
> text increasing by 2 characters to 81 characters in length. Writing an 81
> character text string into an 80 character fixed length array resulted in a
> run-time abort of the task.

I bet nobody expected that modifying a string can crash the entire control
system due to a (caught) buffer overflow. Apparently the reason it went
undetected was that the modification was not tested on a demo system first,
since it was believed to be trivial.

> Lesson Learned: No matter how small the change is, validate changes in a
> test environment first. This requires using a thorough test script for the
> changes.

~~~
7thaccount
Generally, SCADA, AGC, and any EMS software updates should go through multiple
lower environments before getting into a production environment. Even critical
patches would spend a little bit if time in a lower environment prior to
production.

~~~
BunsanSpace
If my SCADA work is anything, it goes on a demo/staging area for a week then
is deployed, in 90% of cases.

It's rare for an org (and usually only very large heavily regulated government
orgs) do full testing on new builds.

~~~
csense
I would assume controlling GW of power generation would be a very large,
heavily regulated org...

------
boner666
No lesson learned, they're still doing it wrong.

1\. They shouldn't be deploying to one of two redundant systems and then
deploying to the other one an hour later. One system should keep running the
old code for days, if not weeks. That would have prevented the loss of AGC.

2\. They should have rollback procedures in case an update causes problems.
Instead they added an emergency patch on top of the new bad code. This worked
out OK but could have fucked things up even worse.

~~~
BunsanSpace
1\. when doing a system upgrade, a lot of times the new version won't play
nicely with the old version. Key example iFix, not backwards (or forwards!)
compatible.

2\. sometimes addin a quick fix with a support engineer on the line is quicker
than attempting a role back. You're making a lot of assumptions here, but
likely what happened was a try to fix this asap and if that fails roll back
scenario.

------
paypalcust83
A similar thing happened to a Fortune 500 client of mine when a payment
gateway decided to make a production change to their API without notice and
without deploy to their sandbox first.

Another case is the E911 system used by either Vacaville, CA or Solano County
that cannot correctly parse GPS information from Verizon Wireless's system,
leading to an inability to find callers. It is likely still an unresolved
safety issue that will never be fixed.

Here are some important principles in this kind of critical system
engineering:

0\. Waterflow development model

1\. Formal verification

2\. Staging env

3\. "Space Shuttle" control system quorum with deploy new code to one system
of three for a month

4\. Externally-coordinated change control approval, deployment and
notification processes, including subscriber-published specifications

5\. All unexpected software conditions raise human attention to relevant
stakeholders, similar to Taiichi Ohno's "Stop the Line"

------
cpr
All I've ever heard about the US grid is that it's held together with spit and
baling wire and is highly insecure--but that's just anecdotal.

(E.g., the SCAD systems that control various aspects are eminently hackable; a
few high-powered rifle rounds fired at a Palo Alto area transformer caused
serious problems for days; etc, etc.)

Does anyone have any good references about overall grid reliability and
security?

~~~
7thaccount
That sounds a bit extreme. Keep in mind that the US is broadly split into the
Eastern and Western Interconnect where the East is a much more dense network
compared to the West. Also there is ERCOT (Texas) which only connects to the
East and West through DC tie substations.

The acronym is SCADA too.

~~~
Scoundreller
Quebec Canada is like that: it’s own system, and only interconnected through
HVDC links.

------
kwhitefoot
Strong type checking should be able to detect this kind of overflow
statically. Probably not practical in the kinds of software involved though.

------
generatorguy
The footer indicated the issue occurred in the western region. I wonder if
this AGC system is built On a scada or control system package offering from
one of the major vendors or if it is more or a one off or even totally home
brew.

Some SCADA software has existed since the early 80s and they have supported
their customers through upgrades the whole way. As you can imagine there is
some serious cruft.

~~~
7thaccount
I'm pretty sure all decent sized utilities and system operators use one of the
major vendors, but there are many customizations for each customer. You're
right that most vendors go back decades which is a blessing (proven to work
for decades) and a curse (lots of feature bloat and dated design decisions).

~~~
BunsanSpace
Its very rare that anyone goes through a SCADA vendor.

Most SCADA software is done through a system integrator which is the wild
west.

And the projects that do go through the original software vendor tend to be
massive multi million dollar ones that require significant custom work and new
features.

~~~
7thaccount
Yep. I generically refer to that as a vendor as well lol.

------
Aloha
I'm not really surprised, but the human process around this is designed to
catch such failures, and react to them.

Most complex systems are at least somewhat brittle, brittle can be okay so
long as the human process around it is designed to deal with that brittleness.

