Hacker News new | past | comments | ask | show | jobs | submit login
Loss of power grid automatic control due to software update [pdf] (nerc.com)
62 points by eigenvector 43 days ago | hide | past | web | favorite | 27 comments

> It was discovered that there was a change made to the alarm text during this AGC update that caused the failure. Prior to the update, when the PII value exceeded +/-999 MW, the value in the alarm text defaulted to “xxxx,” preventing dispatchers from having an accurate indication of the amount of PII change. During this deployment, the two MW value fields in the PII alarm text were modified from i4 to i5 (4 digit integers to 5 digit integers) to allow for an additional digit.

> The original alarm text array had 79 characters which almost hit the max character limit of 80, and the change from i4 to i5 resulted in the alarm text increasing by 2 characters to 81 characters in length. Writing an 81 character text string into an 80 character fixed length array resulted in a run-time abort of the task.

I bet nobody expected that modifying a string can crash the entire control system due to a (caught) buffer overflow. Apparently the reason it went undetected was that the modification was not tested on a demo system first, since it was believed to be trivial.

> Lesson Learned: No matter how small the change is, validate changes in a test environment first. This requires using a thorough test script for the changes.

Generally, SCADA, AGC, and any EMS software updates should go through multiple lower environments before getting into a production environment. Even critical patches would spend a little bit if time in a lower environment prior to production.

Are there lower AGC environments ? The document concluded they need to test more but I imagine there is just one AGC per system operator. BC hydro, fortis, BPA, PGE?

AGC just balances power flows so imports/exports over interconnections between power authorities flow as scheduled right?

Yeah it depends on the system operator, but I know some do for sure. In a test environment you might still receive the same input SCADA data, but could have entirely new calculations running. It's just that those calculations aren't sent anywhere.

If my SCADA work is anything, it goes on a demo/staging area for a week then is deployed, in 90% of cases.

It's rare for an org (and usually only very large heavily regulated government orgs) do full testing on new builds.

I would assume controlling GW of power generation would be a very large, heavily regulated org...

Is your stuff for a system operator or single generator?

Software was a mistake.

No lesson learned, they're still doing it wrong.

1. They shouldn't be deploying to one of two redundant systems and then deploying to the other one an hour later. One system should keep running the old code for days, if not weeks. That would have prevented the loss of AGC.

2. They should have rollback procedures in case an update causes problems. Instead they added an emergency patch on top of the new bad code. This worked out OK but could have fucked things up even worse.

1. when doing a system upgrade, a lot of times the new version won't play nicely with the old version. Key example iFix, not backwards (or forwards!) compatible.

2. sometimes addin a quick fix with a support engineer on the line is quicker than attempting a role back. You're making a lot of assumptions here, but likely what happened was a try to fix this asap and if that fails roll back scenario.

A similar thing happened to a Fortune 500 client of mine when a payment gateway decided to make a production change to their API without notice and without deploy to their sandbox first.

Another case is the E911 system used by either Vacaville, CA or Solano County that cannot correctly parse GPS information from Verizon Wireless's system, leading to an inability to find callers. It is likely still an unresolved safety issue that will never be fixed.

Here are some important principles in this kind of critical system engineering:

0. Waterflow development model

1. Formal verification

2. Staging env

3. "Space Shuttle" control system quorum with deploy new code to one system of three for a month

4. Externally-coordinated change control approval, deployment and notification processes, including subscriber-published specifications

5. All unexpected software conditions raise human attention to relevant stakeholders, similar to Taiichi Ohno's "Stop the Line"

All I've ever heard about the US grid is that it's held together with spit and baling wire and is highly insecure--but that's just anecdotal.

(E.g., the SCAD systems that control various aspects are eminently hackable; a few high-powered rifle rounds fired at a Palo Alto area transformer caused serious problems for days; etc, etc.)

Does anyone have any good references about overall grid reliability and security?

To some extent this is true (and a lot of the software involved is god awful) but most of the US grid has plenty of redundancy so as to be resilient to the various storms, earthquakes, backhoes, and other damaging phenomena found in the US.

This is a good example of why redundancy doesn't protect against single-cause failure. Ugh I sound like the functional safety guy.

The core issue around powergrids is that the PLCs in use have little to no security (many cases too old to have an real security).

Further in SCADA Safety takes precedence over security. If you need to hit the SCRAM button on a nuclear reactor, you can't have a cumbersome authentication procedure.

So NIST has a set of guidlines for security which generally boil down to physical security and DMZs. Anything that needs remote connection has to have its rx pins cut (for older devices).

It's not as bad as advertised, but security is defensibly taking a back seat.

If you're really curious, give NIST.SP.800-82r2 a read.

As far as grid cyber security, check out the CIP Standards. From an electrical engineering side, there are many textbooks on reliability (see contingency analysis, state estimation, voltage stability analysis, protective relaying, and transient stability analysis)

That sounds a bit extreme. Keep in mind that the US is broadly split into the Eastern and Western Interconnect where the East is a much more dense network compared to the West. Also there is ERCOT (Texas) which only connects to the East and West through DC tie substations.

The acronym is SCADA too.

Quebec Canada is like that: it’s own system, and only interconnected through HVDC links.

Any important grid data and control signals go on a hardware VPN network run by, I think, at&t. There's no way to get on that network unless you're authorized.

This is for caiso utilities, other regions might have less or more security.

A lot of power utilities have their own fibre as they own lots of poles and transmission towers and have lots of skilled staff to design systems, string up cable and make terminations.

Strong type checking should be able to detect this kind of overflow statically. Probably not practical in the kinds of software involved though.

The footer indicated the issue occurred in the western region. I wonder if this AGC system is built On a scada or control system package offering from one of the major vendors or if it is more or a one off or even totally home brew.

Some SCADA software has existed since the early 80s and they have supported their customers through upgrades the whole way. As you can imagine there is some serious cruft.

I'm pretty sure all decent sized utilities and system operators use one of the major vendors, but there are many customizations for each customer. You're right that most vendors go back decades which is a blessing (proven to work for decades) and a curse (lots of feature bloat and dated design decisions).

Its very rare that anyone goes through a SCADA vendor.

Most SCADA software is done through a system integrator which is the wild west.

And the projects that do go through the original software vendor tend to be massive multi million dollar ones that require significant custom work and new features.

Yep. I generically refer to that as a vendor as well lol.

also some of the ones with a really long lineage are impressively fast and memory efficient since they used to run on such old hardware. Actually dealing with some 80 character length alarm strings right now!

I'm not really surprised, but the human process around this is designed to catch such failures, and react to them.

Most complex systems are at least somewhat brittle, brittle can be okay so long as the human process around it is designed to deal with that brittleness.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact