Hacker News new | past | comments | ask | show | jobs | submit login

I feel like we in the software industry should do more reflection on what happened here. Yes, many of the most evil acts were the result of social non-technical problems like terrible management, stupid legacy UK laws allowing PO prosecutions, and so on.

But the root cause was bugs. So many bugs.

There isn't much the average HN poster can do about the political and justice problems, which are firmly in the realm of the British government. But there are people here who work on databases and app frameworks. What can be learned from the Horizon scandal? Unfortunately there doesn't seem to be much discussion of this. Compare vs the airline industry where failures are aggressively root caused.

I'll start:

1. Transaction anomalies can end lives. Should popular RDBMS engines really default to non-serializability by default (non repeatable reads, for example).

2. Offline is very hard. A lot of bugs happened due to trying to make Horizon v1 work with flaky or very slow connections, and losing transactional consistency as a result. The SOTA here has barely advanced since the 90s, instead the industry has just given up on this and now accepts that every so often there'll be massive outages that cause parts of the economy to just shut down for a few hours when the SPOF fails. Should there be more focus on how to handle flaky connectivity in mission-critical apps safely?

3. What's the right way to ensure rock-solid accountability around critical databases, given that bugs are inevitable and data corruption must sometimes be manually fixed? A lot of the Horizon problems seemed to involve Fujitsu manually logging in to post offices and "fixing" the results of bugs, in such a way that they didn't realize their fix created ledger imbalances that the SPMs would be blamed for. A part of why big enterprises got so excited about blockchains was this notion of an immutable ledger in which business records can't go magically changing around you without anyone knowing how. There are clearly ways to do this, but they're not the default.

4. IIRC at least some failures were traced back to broken touch screens generating false random touches, which could lead at night to random transactions being entered and confirmed when nobody was around. Are modern capacitative touch screens immune to this failure mode? If not, are consoles in embedded applications always reliably engaging screen locks?

I guess there are bazillions more you could come up with.




According to the TV drama, the Fujitsu staff could go in and change data in the live data, while the log said it was the subpostmaster making the change. If so, that seems like negligently bad design.


It was an IT project for a government department designed and delivered in the 90s, negligently bad design is pretty much what I think we would've expected.


The history of UK government IT contracts is indeed a very sad one.

From memory:

UK Covid test and trace system: £37 billion pretty much wasted

UK NHS IT system: £12 billion pretty much wasted

And I'm sure there are lots of others.

And yet we are still giving huge projects to the same companies. Amazingly, Fujitsu's contract for the Horizons system has recently been renewed.


Only a small fraction of the covid test and trace costs were on software, the overwhelming majority was on lab tests and PCR kits.


Fair point. I don't know what percentage was IT.


There is an organizational axis too. The technical capabilities of the GPO went off with BT in 1984. The remaining organisation was anything but a competent customer for IT implementations. Outsourcing has definite limits and potentially catastrophic results - as does the demolition of corporate technical capability.

An empowered technical architecture function could have vetted this system and prevented this all. But gut it and stamp it under the heal of the CFO and you may as well not bother.


> Outsourcing has definite limits and potentially catastrophic results - as does the demolition of corporate technical capability.

A lot of what seems apparent in this case is that contractual and commercial factors weren't set up correctly - they were set up to deliver predictable prices (loved by public sector clients), but not necessarily to deliver good outcomes.

An example - it appeared much of the rush to ship the point of sale terminals was to get through customer acceptance (and presumably the payment milestone), despite scope creep and quality issues. And there was a cost for PO to access audit logs, and limits on capacity of these logs which could be handled. Presumably this delivered a lower headline price for an accountant negotiating the price down, but it ultimately made a poorer solution that wasn't for for purpose.

It seems like (from what I've seen of the evidence) nobody internally in PO really had full understanding and ownership of the project and they'd outsourced that (but kept the suppliers tightly commercially managed, creating the incentives for shipping poor quality code, rather than spending the time to polish it, as some had tried to do in the development team).

Some interesting evidence here in the inquiry on software development practices and low competency and quality of code - https://postofficeinquiry.dracos.co.uk/phase-2/2022-11-16/#d...


I have "owned" the technical delivery of projects for clients, and I like to think I did a very good job - but it was very uncomfortable because when I insisted on things being done properly it ate into my bonus. Lucky for me I had a great team and this didn't happen so much, but I think that external ownership and accountability for project outcomes is only appropriate when the organisation really doesn't know what it's doing and really has to act. In that case I believe that the best thing to do is to get a third party to do it and separate the delivery organisation / program office from the development organisation / resource management.

Interesting link, thanks.


> The remaining organisation was anything but a competent customer for IT implementations.

This is very much the standard in UK public procurement and has been for a large number of years. It's got a lot worse since Brexit when most civil servants with any skills or capability to deliver have moved on because they didn't want to deliver the 'will of the people' to have their cake and eat it.


Can you evidence that claim? The only major public sector procurement effort I can recall since Brexit was the COVID vaccine in which the UK procurement programme worked much better than the EU level one did, to the extent that at the height of the event the EU was seriously talking about seizing the factories manufacturing vaccines the UK had bought whilst the EU were still talking.

And they also bought far too much. Germany is now required by the EU treaties to buy so much vaccine supply that if it didn't expire it would last them until the 24th century.

After all that, there was an attempt at an investigation but it turned out the whole thing was negotiated in secret and key deals were made by Ursula von der Leyen using deleted SMS messages.


> whole thing was negotiated in secret and key deals were made by Ursula von der Leyen using deleted SMS messages.

This is completely different from the British system, where key deals are made in secret using deleted Whatsapp messages.


Don't worry about dbms transaction management. This is the wrong level. No bank uses database-level transactions to make sure a balance transfer doesn't erase or double money. They post a durable entry in the transaction log, and then compute balances as roll-ups of the transactions. If some weird data glitch or cosmic ray produces the wrong balances, just re-run the tx rollup!


Yes I know, but banking isn't the only place where this sort of stuff can go wrong. See also the payroll discussion above. And Horizon had a similar design I think (message logs that were replayed to catch up with the true state of the ledger).


This approach still leaks, but the breakage will be things like overdraft limits, and those can be handled as business exceptions. And that's why we have transaction size limits. Risk-management, all the way down.


> Offline is very hard. A lot of bugs happened due to trying to make Horizon v1 work with flaky or very slow connections, and losing transactional consistency as a result. The SOTA here has barely advanced since the 90s, instead the industry has just given up on this and now accepts that every so often there'll be massive outages that cause parts of the economy to just shut down for a few hours when the SPOF fails. Should there be more focus on how to handle flaky connectivity in mission-critical apps safely?

If there's a network partition you have two options: accept reduced availability and keep your consistency, or have better availability and have reduce node consistency. Not much else you can do, that's just life.

Obviously, ways to increase consistency with consensus algorithms etc with 2 phase commits, and you reduce consistency with consensus algorithms. Depends on your requirements.


In many real-world situations conflicts are rare and it's OK to temporarily lose consistency (especially if you know that it's happened), as long as you can catch up later and resolve the merge. Version control is a practical example that we interact with every day but there are others.

A lot of the Horizon stuff was very local to the specific post office, hence their initial replication based design.


> But the root cause was bugs. So many bugs.

I disagree, these were not the root cause of this scandal. Bugs happen and even if those ones might fall below expected standards, trying to pin the 'blame' on them might be perceived as deflecting from the real culprits. The scandal here is how the technical issues were handled.


It's a bit like saying planes will just fall apart in the sky, it's inevitable, what matters is whether the compensation was handled appropriate. All of it should be improved, but we can't just assume arbitrarily incorrect software will always be covered for by non-technical systems.


Obviously not the same at all and not what happened here.


You can't technical your way out of a social problem. It doesn't matter how many best practices you define if the management aren't going to follow them, and get their mates to award them a CBE for services to misconduct.

> Are modern capacitative touch screens immune to this failure mode?

No, this is basically impossible.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: