Hacker News new | past | comments | ask | show | jobs | submit login
Full tech report of UK 9th August power outage (ofgem.gov.uk)
80 points by trebligdivad 32 days ago | hide | past | web | favorite | 46 comments

To summarize:

* Multiple contingencies occurred simultaneously (loss of generation from two major generators and lost of distributed generation totalling 1,400 MW) resulting in a drop in system frequency to 49.1 Hz

* Standby generation (frequency response reserve) was deployed, totaling 1,000 MW or the largest single generation contingency and began to arrest the system frequency decline

* Just as system frequency began to recover, a third contingency occurred resulting in the loss of a further 210 MW of generation. This caused system frequency to decline again to 48.8 Hz

* Load shedding kicked in as designed and dropped 5% of load to stabilize the system

The largest loss of generation was from Hornsea offshore wind farm. The wind farm should have rode through the system disturbance, but instead its control and protection systems rapidly curtailed active power generation in response to an undamped oscillation in the response of its voltage regulator through the disturbance.

Basically, the internal voltage of the Hornsea wind farm collector system dropped due to the voltage regulator oscillations (from 35 kV nominal to 20 kV), while active power generation remained the same. Power = current * voltage, so an overcurrent condition occurred and protection systems operated to prevent overload of the wind turbine generators.

Subsynchronous oscillations (SSO), i.e. oscillations at below power frequency (50 Hz), are a known issue in power system controls that can lead to unstable or unexpected consequences during system disturbances. The reduction in system inertia caused by the replacement of large synchronous machines with asynchronous generators as wind and solar replace conventional generators exacerbates the possibility of problematic SSO because there is less damping.

Nowadays, in North America, very specific modelling is done in design stage to identify the possibility for such behaviour and ensure that if present it is adequately damped. Some system operators, such as ERCOT (Texas), require this for new wind projects. I imagine that this major occurrence will led to revisions to modelling and grid code testing standards in the UK to protect against future incidents.

All in all, kudos to Ofgem, National Grid and all other participants for producing a thorough, public technical report in just about one month.

Was the fact that the Hornsea site is a wind farm a contributing factor or is that merely a coincidence?

Sort of: what matters is that it was not an AC synchronous generator, like traditional thermal power plants. Wind and solar systems these days have fully digital AC-AC conversion systems which take the variable frequency multi phase output of the wind turbine(s) and turn it into standard three phase.

Turbine-generator systems have nice, simple behaviour in reponse to frequency drop: they act to maintain the frequency by transferring more energy from the shaft rotation to the generator. In the long run this slows the turbine down or triggers a throttle response, but over the few second period we're talking about the shaft speed is basically constant due to its own inertia.

The wind farm "saw" the rapid fluctuations in connection voltage, tried to compensate, and instead went into oscillation. This appears to have been a software bug:

> "During the incident, the turbine controllers reacted incorrectly due to an insufficiently damped electrical resonance in the subsynchronous frequency range, so that the local Hornsea voltage dropped and the turbines shut themselves down. Orsted have since updated the control system software for the wind turbines and have observed that the behaviour of the turbines now demonstrates a stable control system that will withstand any future events in line with Grid Code and CUSC requirements"

(Oscillation damping is "control theory 101", but in a complex system like this it's not so easy!)

Good news is that, while more renewables do potentially have this kind of vulnerability, battery systems are the perfect counter. Some are already being deployed for "fast frequency response". Being a DC-AC system, they can deploy power with any frequency and phase angle required to compensate for problems.

> Oscillation damping is "control theory 101", but in a complex system like this it's not so easy

This is a nice understatement.

My first gig (summer after freshman year) I worked with D. Van Ness, who had an inquiry from he Bonneville Power Administration to determine why their frequency was oscillating (yes, the frequency). This oscillation would rapidly get worse until something tripped and the whole network in the Northwest would go down.

He modeled the system with a state vector and interconnect matrix. The matrix was 500x500 and the path to understanding it was to find Eigenvectors and Eigenvalues of this system. If there any poles to the right of the y-axis, you have an oscillator. Over time, they changed enough to get it stable.

And you make some good points about the synchronization available if everything is a classic generator, and these other power sources are not.

And this was many years ago, so the power systems of today are likely much harder to model.

You put this nicely:

> They act to maintain the frequency by transferring more energy from the shaft rotation to the generato

Another way to think of this is that in a system with more than one generator, a phase difference anywhere in the hookup causes power to flow in direct proportion to the difference in phase angle. In other words, the slow generator becomes a motor.

Yes, the SSO issue in the northwest US power system is a textbook case of this, although in that case IIRC the main issue was control interactions between the power system stabilizers* oscillating together and actually exchanging quite a bit of energy over long distances at a low frequency because those very low frequencies were not effectively damped (or in some cases, negatively damped). At the time, the tools available for large scale power system modelling were very rudimentary compared to what we have today.

In general I don't see it as a renewable vs. conventional issue. SSO/SSR/SSCI have been around since the 1960s when PSS started to be deployed in synchronous generator excitation control systems. Rather it reflects the greater complexity in modelling involved high speed digital controls vs. physical, inertial responses that are expressed very effectively by well-known equations. As we layer on more and more controls, we don't only have to model what is going on at power frequency (50 or 60 Hz) but also at harmonic frequencies and sub-synchronous frequencies. Renewable generators just happen to depend much more heavily on complex control systems for power conversion, mimicking synchronous generator response characteristics and to marry all the components of a large renewable plant together.

At the same time, we have far more powerful tools for power system simulation today that can effectively mitigate this risk, as long as engineers realize the risk is there.

A good reference explaining SSO as it applies to conventional generators can be found here: http://www.cigre.org.br/archives/pptcigre/07_subsynchronous_...

* Power system stabilizers (PSS) are a part of synchronous generator excitation control that improves dynamic stability by damping generator oscillations against the grid. However, PSS systems can actually cause additional, long-distance oscillations with other PSS systems in the frequency range of 0.1 to 1 Hz. See: https://www.wecc.org/Reliability/Power%20System%20Stabilizer... and http://www.meppi.com/Products/GeneratorExcitationProducts/St...

That is the thing about such a highly dynamic system. "Let's just add this dampening right here." Then, we have a system that is very slow to recover.

This is all made more complex by the fact that many of the components of these systems have really non-linear behavior. Like a dam spill that hits a hard boundary.

Without knowing much about the specifics of regulating wind generation, I'd say it adds some complexity, and we don't have the same number of decades of experience of operating it at scale that we do for some other generation techniques.

So I don't see that wind (in the sense of weather patterns) was a contributing factor, but that complexity of regulation probably did contribute.

But then again, as the Little Barford CCGT station showed, it's still perfectly possible to have unexpected failure modes on more conventional generating equipment. (Little Barford enter service in 1996, and is presumably fairly typical of the kind of CCGT stations that were built in large numbers in the UK through the 90s.)

Interesting thanks. A lot of people are using this power outage as an argument against wind power.

Not directly from my reading; but it was a very windy day so it was contributing quite a lot, so when it failed it made a bigger impact than normal. They also take about sources that provide 'inertia' - and I don't understand if a wind farm counts or not; I don't think it does

"The reduction in system inertia caused by the replacement of large synchronous machines with asynchronous generators"

So would a large fly wheel, or something similar be of value?

Yes I was thinking its basically a big battery/capacitor.

I wasn't sure for the intended usecase (compensating for 1000s of MW of powerloss) whether a battery would necessarily be the best thing.

Very interesting from engineering view and the impact to rail.

Rail Headline:-

    Page 27. "The effects were exacerbated as the fleet was undergoing a software change which meant the train drivers could not recover trains which were operating on the new software." 
My view: Twenty-eight more units would have required a technician to visit if the software roll-out had been completed. Potentially exponentially increasing the disruption to public.

Appendix F – Govia Thameslink Railway (GTR) technical report, Page 47-50. http://www.ofgem.gov.uk/system/files/docs/2019/09/eso_techni...

    Appendix F, page 49: "Therefore, the affected Class 700 and 717 sets did not react according to their design intent in these circumstances."
The ability for the driver to recover was removed as part of the software update. (See, Appendix F, Cause point 8). Operators are a key stakeholder.

Great technical report, would have loved to have had more information on Victoria line. Lessons can be learned from this report.

Yeh, I'm guessing they were having problems with drivers doing a reset for any random thing they couldn't figure out. But shit happens, you really need a way to get out of trouble.

Oof. The report from the steam turbine operator is dicey. Their time stamps are relative so they don’t have GPS time synced but they are ms precision so must have a decent sequence of events record. But they don’t know why the 200MW steam turbine unit tripped, which causes a knock on trip of 400 MW of gas turbines. Report says steam turbine tripped due to a discrepancy in the speed signals. Will be interesting to see what that is! The turbine speed will be measured either by number or teeth passing by a magnetic pickup or proximity sensor per unit of time but unit of time varies so it is always a whole number of teeth, or measuring the period of the generator voltage waveform is also a nice speed signal, but maybe not if you are trying to ride through a fault in which the voltage has collapsed or is subject up to bad harmonics. Maybe harmonics caused an instantaneous over speed trip in the turbine governor? Protection relays wouldn’t be involved in measuring turbine speee and should be smart enough not to trip on overfrequency in harmonics during a fault.

Yeah I found that part the most concerning. Basically RWE don't have any clue why they tripped even more than a month on from the event, and apparently have to wait for their next scheduled outage to make progress? That's not good.

There also seem to be multiple faults at once that they don't know anything about. Turbine trip? No clue, could be bad sensors, could have been some actual physical problem in the turbine. Overpressure in the condenser? No clue, could be anything. Second generator tripped? Dunno boss.

Implies that don't have enough sensor coverage and are hoping to literally eyeball something wrong when they next open it up. Also implies they can't shut down their own plant to diagnose apparent faults in anything like a reasonable timeframe? Not good.

Also worth noting - Newcastle Airport were totally fine on their UPS but demanded to be considered a priority customer anyway, and that request was granted? Why? They clearly don't need to be, they have a working UPS!

Honestly I'd be sending this report back for more work if I were the boss guy receiving it. It's not good. Filled with repetition, bad grammar, missing information (why did London Underground shut down, what was this 'internal traction issue') and most seriously it leaves a gaping hole around Little Barford.

They clearly don't need to be, they have a working UPS!

The impression I get from the data centre outages reported here on HN is that backup generators are about the least reliable thing in IT.

Yeah, but most places don't have UPS at all. If Newcastle Airport can survive a power outage but other places cannot, why should it be prioritised?

Because the airport's UPS+Generator is only, say, 99% effective.

And a 1% chance of losing power to air traffic control and runway lights is worse than a 100% chance of 1000 homes having their dinner spoiled by the cooker turning off.

> backup generators are about the least reliable thing in IT.

Not just in IT, as the Badim Hospital fire (with 11 dead and IIRC over 70 wounded) two days ago shows.

I’m really quite impressed with the ESO, who is getting all the flack. They’d planned for a loss of 1GW; lost 1.9GW and had protected 95% of the network within 5 minutes. I hope everyone else realises that these networks aren’t infallible and they need to plan for very occasional outages.

The smoking gun is the last plot in appendix D. A 2% step change in voltage results in VAR oscillation Of initially +/-100MVAR that takes 2 seconds to die out, with 13 peaks over that two seconds. This was behavior before the outage. Probably it has been like that since it was connected but nobody bothered to look. This is a problem with the smart grid, so much data is generated but nobody has time to look at any of it. Probably a good application for some kind of ml or ai.

They say it required a software update to fix it which was applied the next day - probably this was just a change in the gains in the voltage controller rather than an update to the actual program or firmware.

Somebody did a bad job of commissioning the voltage regulators on the wind turbines, and that is what caused a normal transmission line reclose to escalate in to such a large loss of generation.

I am still curious as to why the DAR delayed action reclose time on the transmission line is 20s. I would have thought it would be more like 1-2s tops.

One of the issues in commissioning facilities as large as Hornsea is that if you want to do field validation of things like dynamic voltage regulator response (instead of just steady state performance), actually creating a sufficiently large disturbance to test it may not realistically be possible without impacting grid reliability. Hence the reliance on modelling. The presence of an underdamped SSO probably could have been identified in modelling, but there's no indication that National Grid requires SSO studies in design phase.

Also I'm not sure I agree that the SSO was there all along. The system configuration appears to be several STATCOMs at the HV interconnection substation plus the VAR capabilities of the individual wind turbines. There may be an interaction between these control systems that leads to SSO under certain conditions only while being effectively damped at other times.

As for the reclose time, it may have to do with circuit breaker duty cycles. We don't know what equipment they're using but if it's dated stuff, it's conceivable that it requires that level of delay before it's rated for another interrupting operation.

My dad worked in power generation from the 1960s to the 90s. My recollection is that the 20 second reclose has been comon for quite a while, so there's probably a lot of systems out there that assume that's what the fault clearance procedure will be.

Computer controlled artificial loads should exist large enough for any generator...

It should be a device that can be brought in on a truck, hooked up to the plant during commissioning, and the artificial load should be able to replay any conditions during grid incidents in the past, to check all the control systems work as designed.

A 1 Gigawatt artificial load that can work for 1 second could consist of 10 cubic meters of water (on a truck) and a few miles of nicrome wire... You'll also be needing some beefy switching silicon to simulate something more than a simple resistive load, but they exist already in any DC undersea power project.

So a 2% step change in voltage regulator setpoint for any individual WTG would be non oscillatory but in aggregate, or in interaction with the statcom it oscillates at 6.5 Hz?

When I’m setting the gains for any control loop I tend to prefer choosing lower gains that still meet the performance requirements rather than having high gains closer to the edge of stability. I would not leave a system behind that had 13 oscillations after a step.

All I'm saying is that the frequency-impedance characteristic of the local system changes depending on active power loading, other nearby generators in or out service, nearby reactors or capacitors in service, etc. So it's conceivably possible that the oscillatory behaviour was more effectively damped both in the modelling cases and in whatever field testing they did. Though if you read the grid code compliance testing report for Hornsea, it's not evident they actually did voltage step change field validation.

If this kind of oscillation was present under all conditions, it would be an oversight to not have caught it in the modelling stage. The level of modelling we have to do in some North American regions for a 50 MW wind farm would catch that kind of behaviour, let alone an 800 MW unit.

Indeed when there are large outages is when utilities often find out that the models Don’t quite match reality. One example would be when 1200 MW of generation tripped in the WECC region and they found the effective droop of the system was about 8% instead of 5, and have since mandated that every generator over 10 Mva is tested every 5 years to ensure the equipment still matches the model.

Fun bits: * Software upgrade your wind farm to remove bad responses/oscillations * Software upgrade on trains removed ability for drivers to reboot them (& they intend to keep it that way?!) so hand to send techs with laptops * Upgrades needed to improve loss-of-mains detection in small generators.

>Software upgrade on trains removed ability for drivers to reboot them (& they intend to keep it that way?!)

From section 5.2.1: "The train manufacturer, Siemens, are developing a patch which will allow the drivers to recover the trains themselves without the need for a reboot or technician to attend site."

sounds like '737 max' engineering again. sigh.

Finally an explanation of what a train driver actually does - they are there to reboot the automation if necessary. But what have they been doing up until now? Train driving is a very well paid job, over twice as much as a bus driver, for a fraction of the work... they don’t even steer it!

Lots of responsibility in driving a train. That's usually going to be dozens, and for some trains regularly hundreds of passengers, or in freight hundreds or thousands of tonnes of freight.

It's psych-profiled, companies which employ drivers will be looking for "compliance" (a psychological tendency to obey rules even if you don't understand why) so that the driver obeys all the safety rules.

It's also a fairly complicated machine, not as complicated as a jet liner but far more complicated to operate than a bus, so that reduces your pool of candidates further, in most cases they'll be looking for someone with some mechanical aptitude to understand how it works.

They need communication skills, the driver needs to work with their signallers, and potentially also company dispatch, and on trains without separate customer service personnel they need to talk directly to passengers.

For example yesterday I was on a train which was delayed by trespassers. The driver will have needed to use "proceed with caution" rules, where they drive the train slowly enough that they can always stop it within the distance they can clearly see, obeying any signals, and then call their signaller back each time a signal cancels that authority, to get a new authority overriding each signal. Then, clear of the problem but much delayed, they needed to handle the fact that their dispatch turned their train into an Express to get it back where it should be, so they need to make announcements to passengers about where passengers should disembark to get a different train that's still going to their destination.

Mainline train drivers make similar money to me (or at least similar to what I made five years ago) but I can't say I feel like they don't earn it. Like me their job is pretty easy when things go right, but not so much when things go wrong. Lots of people couldn't do it, and more wouldn't.

Here is a very informative twitter thread from a train driver explaining what is involved in learning a route https://twitter.com/chris_thedriver/status/11637928873871646...

Also, 10 minute reboot time. And I thought Windows Update was bad!

I've been on trains when they were obviously rebooting due to some sort of fault. It takes forever. What is the train doing during this time, exactly? Does every sub-computer on board boot serially or something?

There are normally a lot of self-tests involved in a reboot.

Every actuator will be energised, de-energised, moved end to end to test limit switches, etc.

It's mostly done serially because some things would be disrupted by other things, and figuring out a dependancy tree is tricky.

Quite a lot of stuff is auto-configured in bootup. For example, it might spend 10 seconds trying to ping a debug console to see if it should enter debug mode. Or 30 seconds with IP addresses configured to the 'lab' setup before switching over to the production networking config when the lab settings won't let it connect to anything.

Good explanation, thanks.

What is the purpose of these embedded generation protections:

Vector Shift Protection (triggered by lightning, led to loss of 150 MW):

As far as I can see, this protection shuts down generation when part of the grid might be disconnected from the rest. Shutting down when islanding hasn't occurred is wrong, and destabilises the grid. Perhaps we should be measuring islanding another way? What about applying gold coded frequency modulation to the actual system frequency? A gold code of length 1 million could be injected on just a few points on the national grid, at a power of just a few kilowatts, and be measurable from anywhere. When islanding occurred, the signal disappears, and embedded generation can switch off?

Rate of change of frequency protection (led to loss of 350MW). What's the purpose of this protection at all? If frequency is changing in a downward direction, the faster it's falling, the more important it is not to disconnect supply.

High positive rate of change of frequency might be a reason to disconnect generation to prevent oscillation (effectively acting as the "D" term in a pid loop), but did this occur?

Signals on power lines get filtered out by transformers, line reactors, capacitors, etc. they are also expensive to inject and extract since it requires high voltage connections which have to be of the same quality as the rest of transmission system components. I’ve worked both with power line carrier for transfer trip schemes and ripple plants for controlling hot water loads. More likely your ripple plant is going to have some failure resulting in tripping all your generation when it didn’t need to.

Many Important plants and transmission lines are connected by a utilities own fiber which can be used to transmit the actual state of the system instead of trying to infer it from the power waveforms. This is the best solution but obviously expensive.

Uncontrolled decentralized embedded generation is not meant to ever energize a dead line. If it has a solid state power electronics interface to inject power (a fancy inverter) it is probably operating in a mode where it follows the waveform on the grid to make sure it stays in phase. If the grid waveform is poor quality, full of harmonics due to faults, or phase shifts due to major loads or generation disappearing resulting in instantly changing power flows, the inverter can’t stay in phase. It is probably a delicate balancing act to be able to follow the grid frequency but also affect it by injecting active and reactive power. I’m not an inverter guy so this is mostly speculation from reading data sheets and manuals for grid tie small battery systems.

Not sure about rocof tripping generation on falling frequency. There are situations in which injecting power in to an island can result in overvoltages that would damage all of the equipment on the island, so it is better to avoid it if it looks like an island might be forming. Just a guess based on experience with an embedded steam turbine.

There could have been a very short increase in frequency for the 80ms or 4-5 cycles when the single phase to ground fault occurred as faults cause machines to accelerate since it is the same as removing the load and replacing with a short circuit. Otherwise the only increase in frequency was when it started to recover by the system operator calling for more generation.

They characterize the Hornsea and Little Barford trips as to

"not be expected to trip off or de-load in response to a lightning strike. This therefore appears to represent an extremely rare and unexpected event."

Looking at the timeline, both of those events are logged within 1 second of the strike. To me, with even a little bit of experience with systems having complex interacting components, it seems vastly more likely that there is some unknown interaction rather than pure chance. I would imagine the prior probability of either of those two going offline is very low, so the probability of both independently going offline within one second of a potential causal event seems vanishingly small.

I read it as they tripped because of the lightning strike and transmission line trip, but they shouldn’t have.

Very detailed info in the Appendix pdf, and wonder if some of that should of been made public - when I see grid layouts like that.

Anyone with the level of engineering knowledge to do something nefarious with that information is quite capable of doing something nefarious without it.

On the other hand, by releasing publicly the internal findings of the involved power companies can be scrutinized by academics, independent engineers and members of the public.

Equally, might be some nice deliberate errors. Sort of like a trap street upon a map https://en.wikipedia.org/wiki/Trap_street though in this case, not for copyright protection.

Trying to keep a network of huge towers across the countryside secret is not going to work.

How would you compare grid reliability to reliability of huge networks such as Google's or Facebook's? Scale might be comparable.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact