
How the Singapore Circle Line rogue train was caught with data - sohkamyung
https://blog.data.gov.sg/how-we-caught-the-circle-line-rogue-train-with-data-79405c86ab6a#.qjsiufjb7
======
flashman
This is going straight into my favourite bug-hunting stories, along with the
500-Mile Email:
[https://www.ibiblio.org/harris/500milemail.html](https://www.ibiblio.org/harris/500milemail.html)

~~~
qznc
Have fun: [http://beza1e1.tuxen.de/lore/](http://beza1e1.tuxen.de/lore/)

~~~
erl
Now that is a great collection, thank you!

One thing that I think would make the collection even better is spoiler free
titles to maximize the suspense.

For example, the screen saver story lost much excitement when I knew the cause
from the title.

~~~
qznc
Good point. I changed a few. Thanks for the note.

------
koliber
When I interview developers, I like to ask them about their favorite bug. I
want to hear a story. Most people have a good one. It's usually intermittent,
hard to track down, and takes a long time to solve. It becomes a developer's
nemesis, taking on an identity. The relief, when it gets solved, is huge. Good
developers tend to talk about such bugs with passion.

This story is a beautiful example of such a story. Great read, great bug,
great analysis!

~~~
voidifremoved
One of my personal favourites.

Years ago I worked on a stock control system for a popular clothes retailer.
Where previously store managers had to spend hours on the phone ringing the
depots to see what was in stock and reordering popular lines, now they could
fill in orders an they would be emailed to the depots in machine readable
format, with the order balanced across multiple depots depending on
availability of stock.

After the system had been in operation for a year or so, managers started to
complain that some orders weren’t turning up. After a bit of digging it only
seemed to affect one particular depot, but I couldn’t initially see a common
factor between the orders. It had nothing to do with the size of the mail, or
the originating store. I could tell from the logs the emails were being
formatted and supposedly sent to the depot, yet they never translated into
orders. The system would silently fail after we’d handed the email to the
depot with no obvious reason why.

So I wrote a quick script to collect all the orders for a day from all the
stores and compile a list of all product lines that were present in the orders
that went missing, but not present in the orders that succeeded, just to see
if there was a pattern. This turned up a few dozen items and after a quick
visual scan of the product descriptions one stood out slightly, so I phoned
the depot on a hunch.

The depot had an anti-spam filter on incoming emails that silently dropped
messages with pornographic words in the body or any attachments. The retailer
had just received a new seasonal product line, which was selling well, and the
product description was “bondage tights”.

After getting the depot to whitelist emails from our system, orders for this
product line started working.

~~~
neotek
We ran into a similar but much easier to identify problem with PayPal - any
products we sold containing the words "Cuba" or "Havana" in their title or
metadata, regardless of the context, would cause PayPal to reject the entire
payment with a meaningless error and no indication of what had caused the
problem.

PayPal refused to admit to us that this was the cause and insisted they
couldn't shed any further light on the issue despite the fact we could trigger
it 100% of the time.

In the last few weeks it's started randomly choking on the word "Pharaoh", God
knows why, and God knows why it only does so about half the time, but I'm not
going to bother asking PayPal about it.

~~~
traviscj
Fraud detection and compliance systems have a false positive rate,
unfortunately. And a company confirming something like that would be
tantamount to telling how to avoid the filter, which also raises their false
negative rate, which is not in the company's interest, whether the party in
question is a fraudster using it for their own interest or a non-fraudster
that spills the beans on HN.

But yeah, it sucks that their system can't learn to give you an exception
after it has been reviewed once or twice.

------
Terr_
I hope they do a followup about what kind of problem they found in PV46.

I suspect that someone flipped the switch away from "More Magic" :)

[http://catb.org/esr/jargon/html/magic-
story.html](http://catb.org/esr/jargon/html/magic-story.html)

~~~
tankenmate
I've been caught out by "ground" not really being ground before; two different
sides of a whole floor computer / comms centre had different potentials. So
most of the power was hooked up to ground on one side of the building and a
small number of non computer / comms gear was hooked up to ground on the other
side (think lights etc that were too difficult to rewire), but the other gear
used wall sockets that were a different colour to make sure people didn't plug
computers or network gear into them. This was all discovered after a temporary
20V spike difference in the different grounds trashed an expensive router. The
electricians suspected that because we used a largish number of microwave
links on the building the copper bars that run down the building to ground
were acting like antennas and picking up the energy. So we defaulted all the
sensitive gear to the "ground" that was least problematic. It was cheaper than
re-doing all the ground circuits in the building with RF shielding. The joys
of old buildings.

~~~
verytrivial
A popular train line out of London has an electrical problem like this with
one of the train models run upon it. Whenever these trains, travelling about
about 100mph, cross from one power section to the next with the front the
train entering one, and the rear leaving the other, there is some weird
electrical event that causes a safety mechanism to violently retract the rear
pantograph[1] (maybe more than one). This causes an almighty _WHACK_ on the
roof which scares the bejesus out of any visitors not used to impact noise.
I've seen at least one coffee ejected from a cup onto someones trousers
because of this. The train progresses unharmed.

[1]
[https://en.wikipedia.org/wiki/Pantograph_(transport)](https://en.wikipedia.org/wiki/Pantograph_\(transport\))

~~~
tomarr
Are you sure this isn’t just the pantograph passing through the neutral
section? They have these on long runs where the power is supplied from
different sources (resulting in sync issues). There are circuit breakers at
these points which on some trains makes a very loud ‘clunk’, which stops the
train trying to draw current through . The other type of ‘clunk’ you get on
some trains (although not the ones you are describing) is the pant going
up/down on combined AC/DC lines.

~~~
verytrivial
It could be circuit breakers, yes. They would need to be big devices, though.
It is very loud. Best I can come up with is someone dropping a shopping
trolley on the roof from a storey or two high. I got the pantograph
information from a rather "train-nerd-I-love-my-job!" sounding driver over the
tannoy one evening perhaps two years ago. I may have misheard!

~~~
NamTaf
They are the circuit breakers. Our trains in Brisbane, AU run 25kV (Thameslink
seems to be the same 25kV overhead, with a 750V third rail, but this will be
the overhead section) and the circuit breakers trip during the entry/exit of a
neutral section, causing an almighty bang as they kick out and back in. If
you're sitting in the wrong carriage directly underneath them it's very loud.

For reference, 1kV arcs across about 1cm of air, so 25kV needs a full foot of
space in air (obviously less with a dielectric, but you get the orders of
magnitude).

Source: I design trains for a living, though on the mechanical side

------
a3n
Damn that was exciting.

From the bottom of the linked press release, which summarizes the overall
investigation:

"In particular, I thank the engineers and data scientists from DSTA and
GovTech respectively, without whom we would not have been able to theorise the
possibility of a faulty train, and identify PV46."

A "steely eyed missile man" moment.
[https://en.wikipedia.org/wiki/John_Aaron](https://en.wikipedia.org/wiki/John_Aaron)

------
NamTaf
This is fantastic. I was in Singapore a couple of weeks ago and after
mentioning that I'd narrowly missed out on getting an SMRT job earlier this
year we got talking about the SMRT in general. He commented on this exact
phenomenon of emergency braking for 'no reason', saying that the people of
Singapore were losing faith in SMRT and their ability to keep up the
reliability of the network after a spate of these issues due to
'interference'.

It's truly wonderful to see how they went about identifying the problem. This
is fantastic work and the team deserves major kudos for nailing this.

------
deugtniet
Not trying to be too negative here, but why wasn't a broad association
analysis done on this dataset first? The number of outages correlated with
train activity should have popped up like a sore thumb, no? Hindsight is
always twenty-twenty, but I feel the author just wanted to show of his cool
visualization knowledge, while the same could have been achieved with a much
more boring association table.

EDIT: I am corrected by "Rifu". Activity data of all traincars was not
available in the first instance, so a broad association analysis would
probably have yielded nothing. The type of analysis the authors use seems
warranted. Although I'd still recommend doing broad association analysis on
any dataset, before moving into complicated visualization techniques.

~~~
Rifu
As per TFA, they did say that the only data they had on hand were from the
trains that broke down, so it would have been impossible to even detect the
"rogue train" from a simple analysis of the data they had. Though it also said
SMRT was slowly extracting the train logs from the incidents so I'm sure they
would have gone with that if their preliminary analysis yielded nothing.

~~~
deugtniet
Right I overlooked this. That makes it more meaningful to do the analysis as
in the article. I stand corrected.

------
grkvlt
I was expecting something more like the 'A Subway Named Mobius' short story
[1] - it's one of my favourite topology related bits of SF - OK, possibly the
_only_ topology related SF?

1\.
[http://www.rioranchomathcamp.com/Topology/SubwayNamedMobius....](http://www.rioranchomathcamp.com/Topology/SubwayNamedMobius.pdf)

------
yedpodtrzitko
That's a really great analysis, almost feels like a detective story .)

OT: I am a huge fan of Python, so whenever I see Python helps to solve a thing
I always say myself "yay Python!"

------
ChuckMcM
ARGH! What hardware problems on train X can cause train Y to go into emergency
braking mode? Curious minds want to know.

~~~
NamTaf
Wireless signalling transmissions, EM interference, at two guesses.

~~~
crocal
Yep. Singapore Circle Line is a CBTC (Communication Based Train Control).
Trains receive their movement authority through wireless transmission. If due
to interferences the transmission is interrupted for more than some seconds,
then movement authority expires and the on-board systems will bring the train
to a stop through safe measures (typically, application of emergency brakes).
It is quite common to have interference issues due to some geographical
problem (e.g. incorrect installation of access point antenna) but much less
common to have a jammer train. It is also quite impressive that the jamming
would be long and wide enough to exhaust all the redundancies put into such
system.

------
Neliquat
But what actually caused the interference? Never been blueballed so had by and
article.

~~~
nicholasluimy
faulty signalling hardware on the rogue train

[http://www.straitstimes.com/singapore/transport/mystery-
of-c...](http://www.straitstimes.com/singapore/transport/mystery-of-circle-
lines-signalling-woes-solved-train-with-faulty-signalling)

------
lyonlim
Super cool to see how the team isolated the cause. I travel along the same
route as well and this train caused half the circle line to be shut down
during one morning.

For the next few days after, they shut down the telco signals in the train
tunnels to determine if it caused the signal interference.

------
hobaak
Impressive work by Singapore government agency. I heard that they are as
competent as private sector.

~~~
vkou
Since moving to the US, my health insurance is handled by the private sector.
Competent is not how I would describe it. (The more accurate term is 'Kafka-
esque')

Back in Canada, on the other hand...

~~~
anigbrowl
The incompetence of government is a matter of religious faith among some
people in the US, logic and empirical evidence be damned.

------
UhUhUhUh
The parallel with epidemiology is striking. And I keep on being amazed by the
power of the visualization of data. The problem began to unravel when they
"zoomed in" on a pattern that was not clearly visible initially.

------
andrewvijay
Wow what a digital Sherlock Holmes tale! Beautiful story. Incredibly hard if
the data was not visualized that way. I learnt some good things. Thanks for
sharing here!

------
anigbrowl
Good work, though I can't help thinking that a defective train could just as
easily have been caught through performing regular maintenance. Rolling stock
and other physical things are subject to both manufacturing defects and wear.

I wonder why the rail managers allowed the defects to inconvenience passengers
for so long instead of rethinking their maintenance procedures.

~~~
gridspy
The defective train worked perfectly. It was the other trains that failed due
to the radio emissions the rogue transmitted.

Not obvious.

------
novaleaf
I wonder if some "AI Gen 3.0" machine learning / cognitive machine system
could have caught this, without human intervention?

I just yesterday attended a presentation by a ex NSA guy (who's of course
pitching his company Adatos, so grain of salt) who claims you could feed the
raw data in and the pattern will be found.

------
jefurii
Reminds me of a 1996 Argentine science fiction film called "Moebius" which
involves a missing train, topology, the Dirty War, and of course Borges.

[https://en.wikipedia.org/wiki/Moebius_%281996_film%29](https://en.wikipedia.org/wiki/Moebius_%281996_film%29)

------
mpweiher
Dunno. About 2 paragraphs in, "oncoming trains" was the first thing that
popped into my head.

------
atmosx
Nice write up.

As a side note, I find alarming the fact that the blog of the _Data Science
Division_ of a government (any gov) is hosted on a third party provider.

These guys look like they know what they're doing though, so I guess it was a
weighted decision.

~~~
dnautics
do you find it alarming that governments (including the US) use official
twitter and facebook accounts? Why should this be any different?

------
spraak
This makes me excited about data science

------
chaostheory
Never heard of Jupyter Notebook until this story. It reminds me of Eve.

~~~
chewxy
You may have heard its previous name being used - iPython notebook. FWIW it
took me a year plus to actually start using the name Jupyter. Wasn't until one
of my students asked me why I kept saying "iPython" when the tab says
"Jupyter"

------
daneyh
code snippets not displaying?

EDIT: Great article and good problem solving skills

~~~
reustle
They're hosted by github, does the main site load for you?

------
johnm1019
TLDR; There was a single train emitting both correct and erroneous signals.
This caused trains in the vicinity to lose data link which triggered an
emergency brake safety feature. The article doesn't make clear what medium the
trains communicate on which was affected (e.g. rail, wireless, powerline).
They have not yet determined what caused the faulty hardware (or software???)
on the faulty train to enter this state.

~~~
4ndr3vv
Think you're looking at this story the wrong way; Its not about the fault but
the bug hunting process.

I propose:

 _TL:DR - They had a strange hard to identify bug, but used a limited data set
and interesting techniques to quickly find the esoteric cause of the problem._

~~~
sweetjesus
imho his TL;DR is better than yours, but your improvements could be made to
his. In my education we was taught that in science and business, the abstracts
should tell the whole point of the story; the "spoiler" should not be saved
till the end even though that makes for more fun; this is to allow a busy
person to get the takeaway quickly and assess if they need to read all the
detail. Whether anybody agrees with that or not, it's been burned into my way
of looking at the world, so even when reading this article, all the way
through I was gritting my teeth thinking "I wish I had some clue as to what
I'm reading about."

so to add in your point, to his TL;DR I would put in "the intial dataset
included only the trains that had suffered the fault, but as the fault was
caused by a functioning train, a more comprehensive dataset was necessary to
find the problem; had it been provided initially, less detective work might
have been necessary"

------
thisisquitecool
Wow!!! Super impressive.

