> The original alarm text array had 79 characters which almost hit the max character limit of 80, and the change from i4 to i5 resulted in the alarm text increasing by 2 characters to 81 characters in length. Writing an 81 character text string into an 80 character fixed length array resulted in a run-time abort of the task.
I bet nobody expected that modifying a string can crash the entire control system due to a (caught) buffer overflow. Apparently the reason it went undetected was that the modification was not tested on a demo system first, since it was believed to be trivial.
> Lesson Learned: No matter how small the change is, validate changes in a test environment first. This requires using a thorough test script for the changes.
AGC just balances power flows so imports/exports over interconnections between power authorities flow as scheduled right?
It's rare for an org (and usually only very large heavily regulated government orgs) do full testing on new builds.
1. They shouldn't be deploying to one of two redundant systems and then deploying to the other one an hour later. One system should keep running the old code for days, if not weeks. That would have prevented the loss of AGC.
2. They should have rollback procedures in case an update causes problems. Instead they added an emergency patch on top of the new bad code. This worked out OK but could have fucked things up even worse.
2. sometimes addin a quick fix with a support engineer on the line is quicker than attempting a role back. You're making a lot of assumptions here, but likely what happened was a try to fix this asap and if that fails roll back scenario.
Another case is the E911 system used by either Vacaville, CA or Solano County that cannot correctly parse GPS information from Verizon Wireless's system, leading to an inability to find callers. It is likely still an unresolved safety issue that will never be fixed.
Here are some important principles in this kind of critical system engineering:
0. Waterflow development model
1. Formal verification
2. Staging env
3. "Space Shuttle" control system quorum with deploy new code to one system of three for a month
4. Externally-coordinated change control approval, deployment and notification processes, including subscriber-published specifications
5. All unexpected software conditions raise human attention to relevant stakeholders, similar to Taiichi Ohno's "Stop the Line"
(E.g., the SCAD systems that control various aspects are eminently hackable; a few high-powered rifle rounds fired at a Palo Alto area transformer caused serious problems for days; etc, etc.)
Does anyone have any good references about overall grid reliability and security?
Further in SCADA Safety takes precedence over security. If you need to hit the SCRAM button on a nuclear reactor, you can't have a cumbersome authentication procedure.
So NIST has a set of guidlines for security which generally boil down to physical security and DMZs. Anything that needs remote connection has to have its rx pins cut (for older devices).
It's not as bad as advertised, but security is defensibly taking a back seat.
If you're really curious, give NIST.SP.800-82r2 a read.
The acronym is SCADA too.
This is for caiso utilities, other regions might have less or more security.
Some SCADA software has existed since the early 80s and they have supported their customers through upgrades the whole way. As you can imagine there is some serious cruft.
Most SCADA software is done through a system integrator which is the wild west.
And the projects that do go through the original software vendor tend to be massive multi million dollar ones that require significant custom work and new features.
Most complex systems are at least somewhat brittle, brittle can be okay so long as the human process around it is designed to deal with that brittleness.