
Croydon Shut Down - lelf
https://www.networkrail.co.uk/running-the-railway/our-regions/southern/disruption-at-victoria-and-london-bridge/
======
kiwijamo
Nice to see a post mortem made public. Also fairly well written for laymen
while still retaining detail. Most rail infrastructure operators from other
countries don’t make this sort of info public (apart from PR type messages
which typically don’t explain the issue at at all and are more focused on
deflecting blame etc). Kudos to Network Rail.

~~~
johnnycab
>Kudos to Network Rail.

This kind of jingoistic comment demonstrates how a tactical post with some
technical information, can massively skew judgement. According to Network
Rail, the routes in question are not only one of the most congested in the
country, but have been mired with problems of under-investment and lack of
desire to ease the bottleneck. Some of the routes connect salubrious stock-
broker belts with the City of London, in addition to Brighton & Hove, East
Sussex. Other routes traversing East Croydon, acting as a hub, also serve the
mainline stations of London Victoria and London Bridge to (London) Gatwick ─
the second busiest airport in the country.

I wonder how many people will be forgiving or cheering for Network Rail and
handing out accolades for their neatly constructed blogs, when they miss their
flight or arrive to work late 'due to signalling problems'.

Part 1:[https://www.networkrail.co.uk/news/strong-support-for-
propos...](https://www.networkrail.co.uk/news/strong-support-for-proposals-to-
unblock-the-railway-bottleneck-at-croydon/)

Part 2:[https://www.networkrail.co.uk/running-the-railway/our-
routes...](https://www.networkrail.co.uk/running-the-railway/our-
routes/sussex/upgrading-the-brighton-main-line/unblocking-the-croydon-
bottleneck/)

------
nathell
I don’t mind delays very much, even long ones, as long as there is information
readily available about why the delay happens and how long it is likely to
last. I have very positive experience with UK railways in this regard; this
post is a prime example.

Contrast with PKP (Polish rail), where it’s not uncommon for a train to stand
in the middle of nowhere for half an hour, without so much as a single word of
explanation from staff. It’s changing for the better, but slowly.

------
rossmohax
Acccording to [1] just 67% of all UK trains arrived on time (no more than a
minute late), with 2.8% scheduled trains cancelled all togther.

How does it fare against other EU countries? What is so different abount Japan
trains, that they are just 0.7 minutes late on average?

[1] [https://dataportal.orr.gov.uk/media/1630/passenger-
performan...](https://dataportal.orr.gov.uk/media/1630/passenger-
performance-2019-20-q2.pdf)

~~~
rwmj
I can tell you from experience the trains in the UK are expensive and
miserable.

The common theory for why is because of the mess made of privatising British
Rail in the 1990s. The government did it in a way which managed to create
monopolies while at the same time splitting out responsibility between
different companies in such a way that millions are drained away on legal
costs while the companies argue with each other about who is responsible for
each problem.

There are many articles analysing the problems you can find online, here's one
that turned up at the top of search: [https://www.theguardian.com/uk-
news/2018/jan/07/railways-pri...](https://www.theguardian.com/uk-
news/2018/jan/07/railways-privatisation-nationalisation) Edit: Better article:
[https://www.citylab.com/transportation/2012/09/why-
britains-...](https://www.citylab.com/transportation/2012/09/why-britains-
railway-privatization-failed/3378/)

~~~
jonwinstanley
I commute on British trains every day and have done for years.

They’re not perfect but I am rarely delayed by more than a minute or two.
Agree that a massive issue is the price, my train is close to £30 a day for a
50 minute ride into the city and it goes up by 3-5% every year.

The only proper delays I see are caused by unusual weather, as it’s the UK -
flooding or snow.

~~~
NeedMoreTea
Can I hazard a guess you are doing so in the South East? Try the trains the
rest (most) of us experience.

Northern Rail, Transpennine, West Coast, ScotRail, East Coast (there already)
etc that have or are about to be given to "operator of last resort", i.e.
renationalised -- under a Tory government --for having failed badly.

~~~
zimpenfish
I commuted in the South East (particularly South East London which means
SouthEastern trains) and it was a miracle if I was only delayed by 1 minute on
a 12-20 minute journey. Coming to down the Charlton line into London Bridge
almost always involved a 5-10 minute delay at (the crossover junction just
outside the station that I forget the name of).

~~~
alias_neo
I commute there now, the past weeks have been a nightmare of cancelled trains
2 minutes before they're due while I stand at an open air station freezing my
nuts off or getting rained on. If we had a little more notice id stay home
where it's warm for an extra 15 minutes. The delay you speak of is still
there, and it's now worsened by the fact that the new Thameslink trains are
given priority to pass as we sit and wait.

------
Animats
OK, that's what should happen for an overvoltage. The control systems tripped
offline to protect themselves, all signals went to red, and all trains
stopped.

Electric railways have multiple different power systems - traction power,
signaling power, and utility power, at least. The possibility of a short
between systems means a conservative shutdown-on-overvoltage circuit breaker
setup is appropriate.

~~~
ec109685
The need to manually turn this system back on is non optimal. A control loop
powered by a battery backup should have been able to reset the system once the
voltage came back down.

~~~
Animats
This is a railroad signaling system. It's designed to fail safe, and that's
not a buzzword. Any broken wire, go to red. Any broken relay, go to red. Any
rail break, go to red. Loss of power, go to red.

If you're getting power surges in signaling power, it's time to shut down, go
to red, and not reset until someone has tested the system.

~~~
ec109685
The backup system can be programmed to do the same testing a human can to
bootstrap the system. Another benefit is that they can simulate a surge in the
system regularly and watch it recover itself versus a human running a
checklist once every 10 years.

------
bloak
The article doesn't mention a date. Presumably the "Wednesday" referred to is
18 Dec 2019.

On a previous occasion when rail equipment failed because of a fluctuation in
supply voltage the electricity company was not at fault: the voltage had
remained within specified limits. I find it a bit suspicious that the article
doesn't mention what the change in voltage was this time, and what variation
the specification allows. The penultimate paragraph is perhaps an implicit
admission that the electricity supply was within its specification.

I wonder if any of this is indirectly related to the official change in supply
voltage, from 240 V to 230 V, in 1995. Nah, probably not; they can't have
equipment that old, can they?

~~~
makomk
It could be related to the official change in supply voltage, though not for
the reason you're thinking. Although Britain's mains supply officially changed
to 230V, in reality it's the same 240V as before and there's a wider tolerance
band for voltages above the nominal 230V than below to make this work. So if
Network Rail installed equipment that expected actual continental-style 230V,
that could cause problems.

It wouldn't be the first time that rail companies not paying attention to
actual grid specs caused transport chaos this year. Back in August, a power
failure caused load shedding that dropped a million houses off the grid but
preserved power to the rail network as intended. However, one train company
had bought Siemens trains that couldn't cope and immediately shut down when
they saw a dip in frequency of the level required to cause automatic load
shedding, and required an engineer visit to restart. That one set of trains
effectively rendered all the measures taken to ensure the rail network
retained power almost useless, as they turned into immobile blocks all over
the network.

~~~
avianlyric
> rail companies not paying attention to actual grid specs

I would give the rail companies a little slack. They did specify trains that
should have handled the frequency change. But Siemens didn’t build the trains
to spec.

The shutdown was unexpected and entirely down to Siemens not building their
power supplies the specs set out by the rail company.

~~~
growse
This isn't accurate. Siemens delivered trains that could tolerate the power
fluctuations _at the time they were specified_.

The problem was that the national grid then changed (loosened) the
specification of what an acceptable fluctuation was. The trains could not
tolerate this new specification.

What we can blame Siemens for is not providing an easier and quicker way to
get the electrical systems back on their feet after a problem, instead relying
on a specialist to physically visit a stranded train. I think they said they'd
fix that one...

~~~
makomk
That's not true - the frequency spec hasn't changed in any way relevant to
what happened. (I think there was some changes to rate-of-change-of-frequency
disconnect specs for generators and possibly medium-term frequency accuracy,
but neither of those things mattered here.) This kind of near-simultaneous
failure of two large generators, tripping the low frequency demand disconnect,
was always specified to cause a dip in frequency of a level those Siemens
trains couldn't cope with. There was actually a very similar failure back in
2008. The only reason that didn't cause such chaos is because those trains
hadn't been ordered yet.

------
makach
It's not necessarily a failure when you fail.

This PM exuberates professionalism with a clear and concise message! This is
an excellent learning experience, for bystanders, travelers, and employees.
Seems to me that this was handled in a good way.

The issue was highly unusual and difficult to anticipate but now you know.
Good luck mitigating this issue preparing for the future. You will now be
better as a result of this failure.

------
amluto
One odd omission from the article is why it took an hour to reset everything.
Given that they have redundant power, I would imagine that they could remotely
select which power supply to use or at least to remotely reset the breakers.

Alternatively, a good enough automatic transfer switch should transfer the
power supply rather than tripping if the primary supply goes out of acceptable
parameters.

(This goes to one of basics of high availability: failures, when they happen,
should be short.)

~~~
avianlyric
Most power systems are built with a mix or clever automated systems that can
self recover, along with backup dumb systems that can’t.

This is done to ensure that if your smart stuff fails, you don’t end up
seriously damaging something.

Unfortunately the dumb stuff pretty much always has to be reset manually.
Usually because the dumb stuff tripping means you’ve encountered a scenario
that no one anticipated, you really want someone to double check that nothing
got damaged before the safety tripped.

It’s a little like a database starting up and discovering that part of its
Write-Ahead-Log is corrupt. You really want a human to go in there and see
what happened, rather than just ignoring the damage, plowing on and loosing
data.

------
szc
In my opinion, the problem is in poor logical thinking in the design of the
supply switchover systems. The article indicates there are multiple power
feeds and power supplies (presumably to provide low voltage equipment power).

For a single power feed all of the following results in the loss of the
output, (1) no input voltage, (2) input over voltage (under voltage also???).
However, in the article it implies that switching to one of the other
redundant power feeds or power supplies only happens for (1).

If the redundancy system had considered either (1) or (2) as a fault and to
"find another supply, then shutdown this supply" then complete shutoff would
have been avoided.

Also, since the active power supply shut down and switchover also didn't
happen, there must not be any monitoring of the output of the individual
supplies to trigger a switchover. To me, this suggests there is also another
single point of failure in the current design.

I guess there is a market for smarter power failover systems.

------
Aloha
A 20 second overvolt period does seem exceedingly long - I'm curious to find
the cause of that.

~~~
ksec
Yes I thought power surge happens in milliseconds to a second range, may be
few seconds at worst, first time I ever read a double digit seconds power
surge.

Luckily the system shutter itself down as intended.

------
vladd
By reading the postmortem I got the feeling that it focuses primarily on
providing a timeline of historical events without being future-looking.

I did not find any questions that would trigger a path of improvement for
similar future events:

\- Why did the recovery take so long (up until Thursday morning)?

\- If the surge lasted for 20 seconds, why weren't the systems operational
again 1 minute after the event?

\- Why is an on-site technician needed in order to re-activate the equipment?

\- Could we build a system that enables the reactivation remotely?

There might be very well-rooted answers to the questions above (based on how
real world electrical supplies work), but if the questions aren't asked, we
won't end up exploring better solutions for the answers.

~~~
goodcanadian
_Why did the recovery take so long (up until Thursday morning)?_

Systems were up within an hour, but if you throw an hour's delay into a busy
part of the rail system, you have knock on effects (trains aren't where they
are supposed to be) that will basically last until the end of the service day.

 _If the surge lasted for 20 seconds, why weren 't the systems operational
again 1 minute after the event?

Why is an on-site technician needed in order to re-activate the equipment?_

Basically, the breakers were flipped by the surge. Resetting such things is
best done manually on site to ensure there was no equipment damage that
occurred because of the surge.

 _Could we build a system that enables the reactivation remotely?_

Probably, but that strikes me as dangerous.

The article explicitly talks about investigating what can be done to make
their systems more robust to such power events. I think it is very future
looking.

~~~
vidarh
For those unaware, Croydon is basically a massive chokepoint, with trains
coming in from London Bridge and London Victoria to the North and large part
of the South coast to the South.

If trains start piling up near either Victoria (and Clapham Junction) or
London Bridge, it quickly causes trains to back up all the way to East
Croydon, a 15 minute train journey away.

Once it does, it blocks trains towards the other line, as East Croydon only
has 6 platforms and no tracks bypassing it.

Once that happens, trains can start backing up tens of miles further South,
and will be out of place when things are up and running again.

Victoria is a terminus, and is at times busy enough that trains gets 'stacked'
two to a platform on the platforms serving Croydon, so turning trains back
quickly becomes a fun exercise... (London Bridge also has lots of platforms
that terminates, but seems to have fewer platform capacity issues after the
overhaul)

So even if the trains that are in position get moving again pretty quickly,
untangling that and getting platform space for all of those trains in short
order can compound short delays very substantially.

------
jhiouoiurew
> _how we can improve their resilience so we can stop this from happening in
> future_

I'm not so sure about that.

Just a few months ago a temporary power cut in London left some modern Siemens
trains stuck for hours because they had shutdown do to irregular power and had
to be manually rebooted.

Apparently nobody thought to investigate if the signaling system is also
vulnerable to irregular power.

------
herpderperator
> We are investigating whether fitting Uninterruptable Power Supplies to each
> signalling location would have prevented this failure, although it is also
> possible they would have had to shut down to protect themselves too, as
> happened at another location during the same event.

This specific remark sounds rather amateurish. This is part of the reason
UPSes were invented. A Double Conversion Online UPS would have been able to
handle this situation just fine.

~~~
wilhil
I worked for an ISP that was deploying switches nationwide... We got a lot of
space in BT Telephone Exchanges and thought nothing more of it.

I have never seen such dirty and bad power! Literally once a week we had to go
to exchanges to reset fuses that had tripped.

We then started deploying (rather entry level) UPS to all locations and the
problem just disappeared completely... we then invested in more high end UPS
systems, but, yeah - dirty power is a big problem and UPS help.

~~~
Aloha
I'm surprised CO gear was not running on -48v rather than AC power

~~~
viraptor
Is DC actually common these days? I haven't been to a physical datacenter in a
decade or do, but I remember almost everything running on AC.

~~~
wilhil
I'm an IT and networking guy... This is one thing that got on my nerves
completely - most equipment comes as AC or DC, and I'm never sure what to
order, so, I usually go for the AC one as it's the safe choice

------
londons_explore
Rail equipment is all ancient and highly redundant for both uptime and safety.

Yet the actual system reliability and safety doesn't seem great. Signal
outages happen nearly every day in the UK, and delay tens of thousands of
passengers. Safety failures in signalling systems kill on average a few people
per year.

How about a modern system, built with JavaScript-de-jure, running over 4G and
WiFi, consumer GPS, "Download train-control from the app store", running on
tablets or the train drivers BYOD phone, etc. How would it work out for
safety/reliability?

I suspect counterintuitively it would work out much better. Simple cheap
portable devices means a bunch of spares can be around always. "Phone battery
dead? No worries, there's a tablet I can control the train from too". No WiFi?
It also works with 4G or Bluetooth. Server down? It has a Peer connection
backup. Hackers broke in? The system has safety logic built in to every
device, and uses a consensus system so a few rogue systems can't crash trains.
Power supply broken? We have 24 hours of battery, longer if I turn the
brightness down.

~~~
bschwindHN
> How about a modern system, built with JavaScript-de-jure

No thank you.

~~~
WrtCdEvrydy
import { train } from 'lodash';

