Hacker News new | past | comments | ask | show | jobs | submit login
Hellandizing (1998) (multicians.org)
56 points by redeemed 9 months ago | hide | past | favorite | 9 comments



I wrote something similar in the past.

It was credit card terminal application. It was supposed to behave correctly regardless of when in the process of transaction something happened. Something could be application crash, power cycle, etc. Users were frequently impatient with the slower communication methods (we had land line and GPRS modems back then) and would frequently power cycle the terminal if it took a moment too long to do something.

I designed the application in such a way that it always moves transactionally between series of well defined states with no observable states in between. Then I put test points between the states to crash the application at random. Many of these points were put in places in code which would be extremely difficult to test other way (it was extremely unlikely a power cycle would happen naturally at that exact point in time, between those two exact instructions).

The application had capability to crash at literally any point and simply continue operation once power cycled.


Fascinating. Was it able to do this by constantly writing to persistent storage?


Yes. I wrote append-only database for the device. I started with the database because I needed a transactional storage but the device used flash chips without wear levelling. This and requirement for constant memory usage basically disqualified any existing database. The device was very limited in memory, only about 1MB available to the application of which 600kB was used up by OpenSSL.

The database I implemented did not allocate any memory and used a constant amount of stack which was important for me to ensure the application can be verified statically.

So the database worked by having two files allocated. The application would append data to one file, then when it was full it would copy live entries to the other and start appending there and so on.

This might sound wasteful but in reality write magnification was very low. There was very little data that needed to be copied, most records were created and then promptly deleted when it was reconciled with the server.

All sorts of data was written to the storage. I started by writing just basic transaction information but then I discovered that I can also log other information (UI state, etc.) to recover state in case of power failure. This was extremely efficient, most UI operations would result in only a single byte written to the flash. This was important as the flash had quite limited durability and was typically a limiting factor for the longevity of the device.

**

Now that I remember, there were other fun tricks I did.

One of them was for the OpenSSL. This super underpowered device took no less than 9 seconds to open SSL connection over GPRS.

That was fine when the device was first designed and we deployed lots of them. Initially, we did not need SSL and we did not need to open any connections to complete the transaction. We only did that later, once a day, to send all of the information and reconcile with the server.

But at some point we had to deploy ability to do online check with the bank and also required to have all connections secured with SSL. 9 seconds for the client to wait on the transaction was definitely not acceptable.

While we could try to keep the connection open, in many cases the device would be deployed in places with poor connectivity and it was just unreasonable expectation.

I saved the company A TON of cash by figuring out I can gut the OpenSSL library to be able to manually save and restore cryptographic state of the connection (symmetric cryptography). I did the same for the system that terminated connections on the server.

The application would connect to the server, skip entire handshake communication (not even a single handshake packet) and would immediately, speculatively switch to the stored symmetric key.

The server kept a database of all most recent cryptographic states with each of the known terminals and would try to match the communication with its own stored cryptographic state. If this worked, it would continue as if nothing happened. If it did not, it would close the connection. The terminal would then restart the connection with the complete handshake from scratch. But it very rarely had to. As it was very successful, we had to add a functionality to force to clear the state every night so that it we knew there is at least one fresh connection every day.

This cut down almost all of the overhead of OpenSSL.


Did something like and unlike this a while ago. Payment processing system, multiple microservices written in Java on a cloud platform. Tool runs on a developer machine, uses the cloud tools to SSH into a running container and run jdb. It can add a breakpoint, wait for it to get hit, then do a selected thing - resume immediately, delay a while then resume, throw an exception, hang forever, etc.

The main tool can also manage the platform, so start the app, kill individual containers, etc. And inject payment messages, and wait for payments to be sent out.

So you could write test plans like "add a breakpoint in ValidateAccount::isValidSortCode, inject a payment from customer A, wait for the breakpoint to get hit, inject a payment from customer B, throw a NullPointerException, then check that the payment from customer B gets delivered".

We didn't do systematic testing of failure at every point, as in the article, but it was very nice to be able to automate testing of error cases without having to have special code in the app for it.



A question about how nonstop worked:

If both processors are running the same process in the same state, why won't both processors hit the same error condition at the same time?

I understand there are random hardware faults that can happen, bits can flip, etc., but logic errors should be bug-for-bug the same on both processors.

So, were those random faults so frequent that the redundancy was worth it? Or am I missing something?


You'd probably be interested in their set of slides on techniques for constructing robust software [1]. It talks about this issue, among others. For one thing, the processors could easily be in different states due to resources each has access to, so the same code could behave differently on different processors. Another topic they touch on in the slides is the notion of having multiple implementations of the same program, with compatible inputs and outputs but different implementations written by different teams that did not communicate.

[1] https://www.fastonline.it/sites/default/files/2019-06/Robust... (I couldn't find a version on the stratus website any more, but this appears to be the same as the one I downloaded from their site many years ago).


Thank you!


What does "Reset the test point" mean in this piece?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: