I feel like we in the software industry should do more reflection on what happen...

hermitcrab · 2024-01-09T20:07:15.000000Z

According to the TV drama, the Fujitsu staff could go in and change data in the live data, while the log said it was the subpostmaster making the change. If so, that seems like negligently bad design.

SilverBirch · 2024-01-09T20:53:15.000000Z

It was an IT project for a government department designed and delivered in the 90s, negligently bad design is pretty much what I think we would've expected.

hermitcrab · 2024-01-09T21:16:56.000000Z

The history of UK government IT contracts is indeed a very sad one.

From memory:

UK Covid test and trace system: £37 billion pretty much wasted

UK NHS IT system: £12 billion pretty much wasted

And I'm sure there are lots of others.

And yet we are still giving huge projects to the same companies. Amazingly, Fujitsu's contract for the Horizons system has recently been renewed.

_Wintermute · 2024-01-09T22:49:47.000000Z

Only a small fraction of the covid test and trace costs were on software, the overwhelming majority was on lab tests and PCR kits.

hermitcrab · 2024-01-09T23:19:55.000000Z

Fair point. I don't know what percentage was IT.

sgt101 · 2024-01-09T19:28:21.000000Z

There is an organizational axis too. The technical capabilities of the GPO went off with BT in 1984. The remaining organisation was anything but a competent customer for IT implementations. Outsourcing has definite limits and potentially catastrophic results - as does the demolition of corporate technical capability.

An empowered technical architecture function could have vetted this system and prevented this all. But gut it and stamp it under the heal of the CFO and you may as well not bother.

g_p · 2024-01-09T19:46:15.000000Z

> Outsourcing has definite limits and potentially catastrophic results - as does the demolition of corporate technical capability.

A lot of what seems apparent in this case is that contractual and commercial factors weren't set up correctly - they were set up to deliver predictable prices (loved by public sector clients), but not necessarily to deliver good outcomes.

An example - it appeared much of the rush to ship the point of sale terminals was to get through customer acceptance (and presumably the payment milestone), despite scope creep and quality issues. And there was a cost for PO to access audit logs, and limits on capacity of these logs which could be handled. Presumably this delivered a lower headline price for an accountant negotiating the price down, but it ultimately made a poorer solution that wasn't for for purpose.

It seems like (from what I've seen of the evidence) nobody internally in PO really had full understanding and ownership of the project and they'd outsourced that (but kept the suppliers tightly commercially managed, creating the incentives for shipping poor quality code, rather than spending the time to polish it, as some had tried to do in the development team).

Some interesting evidence here in the inquiry on software development practices and low competency and quality of code - https://postofficeinquiry.dracos.co.uk/phase-2/2022-11-16/#d...

sgt101 · 2024-01-10T09:03:20.000000Z

I have "owned" the technical delivery of projects for clients, and I like to think I did a very good job - but it was very uncomfortable because when I insisted on things being done properly it ate into my bonus. Lucky for me I had a great team and this didn't happen so much, but I think that external ownership and accountability for project outcomes is only appropriate when the organisation really doesn't know what it's doing and really has to act. In that case I believe that the best thing to do is to get a third party to do it and separate the delivery organisation / program office from the development organisation / resource management.

Interesting link, thanks.

paxiongmap · 2024-01-09T21:18:00.000000Z

> The remaining organisation was anything but a competent customer for IT implementations.

This is very much the standard in UK public procurement and has been for a large number of years. It's got a lot worse since Brexit when most civil servants with any skills or capability to deliver have moved on because they didn't want to deliver the 'will of the people' to have their cake and eat it.

mike_hearn · 2024-01-10T08:32:45.000000Z

Can you evidence that claim? The only major public sector procurement effort I can recall since Brexit was the COVID vaccine in which the UK procurement programme worked much better than the EU level one did, to the extent that at the height of the event the EU was seriously talking about seizing the factories manufacturing vaccines the UK had bought whilst the EU were still talking.

And they also bought far too much. Germany is now required by the EU treaties to buy so much vaccine supply that if it didn't expire it would last them until the 24th century.

After all that, there was an attempt at an investigation but it turned out the whole thing was negotiated in secret and key deals were made by Ursula von der Leyen using deleted SMS messages.

pjc50 · 2024-01-10T14:46:26.000000Z

> whole thing was negotiated in secret and key deals were made by Ursula von der Leyen using deleted SMS messages.

This is completely different from the British system, where key deals are made in secret using deleted Whatsapp messages.

michael1999 · 2024-01-09T21:25:47.000000Z

Don't worry about dbms transaction management. This is the wrong level. No bank uses database-level transactions to make sure a balance transfer doesn't erase or double money. They post a durable entry in the transaction log, and then compute balances as roll-ups of the transactions. If some weird data glitch or cosmic ray produces the wrong balances, just re-run the tx rollup!

mike_hearn · 2024-01-10T08:24:21.000000Z

Yes I know, but banking isn't the only place where this sort of stuff can go wrong. See also the payroll discussion above. And Horizon had a similar design I think (message logs that were replayed to catch up with the true state of the ledger).

michael1999 · 2024-01-09T21:41:40.000000Z

This approach still leaks, but the breakage will be things like overdraft limits, and those can be handled as business exceptions. And that's why we have transaction size limits. Risk-management, all the way down.

surfer7837 · 2024-01-09T19:56:43.000000Z

> Offline is very hard. A lot of bugs happened due to trying to make Horizon v1 work with flaky or very slow connections, and losing transactional consistency as a result. The SOTA here has barely advanced since the 90s, instead the industry has just given up on this and now accepts that every so often there'll be massive outages that cause parts of the economy to just shut down for a few hours when the SPOF fails. Should there be more focus on how to handle flaky connectivity in mission-critical apps safely?

If there's a network partition you have two options: accept reduced availability and keep your consistency, or have better availability and have reduce node consistency. Not much else you can do, that's just life.

Obviously, ways to increase consistency with consensus algorithms etc with 2 phase commits, and you reduce consistency with consensus algorithms. Depends on your requirements.

mike_hearn · 2024-01-10T08:27:33.000000Z

In many real-world situations conflicts are rare and it's OK to temporarily lose consistency (especially if you know that it's happened), as long as you can catch up later and resolve the merge. Version control is a practical example that we interact with every day but there are others.

A lot of the Horizon stuff was very local to the specific post office, hence their initial replication based design.

mytailorisrich · 2024-01-09T20:45:04.000000Z

> But the root cause was bugs. So many bugs.

I disagree, these were not the root cause of this scandal. Bugs happen and even if those ones might fall below expected standards, trying to pin the 'blame' on them might be perceived as deflecting from the real culprits. The scandal here is how the technical issues were handled.

mike_hearn · 2024-01-10T08:28:31.000000Z

It's a bit like saying planes will just fall apart in the sky, it's inevitable, what matters is whether the compensation was handled appropriate. All of it should be improved, but we can't just assume arbitrarily incorrect software will always be covered for by non-technical systems.

mytailorisrich · 2024-01-10T09:32:24.000000Z

Obviously not the same at all and not what happened here.

pjc50 · 2024-01-10T14:48:23.000000Z

You can't technical your way out of a social problem. It doesn't matter how many best practices you define if the management aren't going to follow them, and get their mates to award them a CBE for services to misconduct.

> Are modern capacitative touch screens immune to this failure mode?

No, this is basically impossible.