On how to make something rare happen reliably

What happens when some rare emergency flairs up? In principle, we should follow the "rare emergency" procedure. For poignant example, when a US Navy Orion P3 spy plane was captured by the Chinese in 2001, the crew should have destroyed all the confidential material on the plane. They tried, but failed.

Damaged USAF Orion P3 spy plane

In this article, the Navy/NSA report now revealed by the Snowden leaks blames the system for lack of procedures.

The report describes the crew’s haphazard and jury-rigged efforts to destroy equipment without proper tools and the woefully inadequate training they received for dealing with a scenario the Navy should have considered inevitable. Even though several close encounters with Chinese fighter jets had occurred in the region before, procedures for dealing with such a situation were insufficient, and the crew never underwent emergency destruction drills. As a result, they were left scrambling in the heat of the moment to determine what needed to be destroyed and how to do it. Although the crew had about 40 minutes between the moment of collision and the landing in China — plenty of time to jettison or destroy all sensitive material, investigators concluded — there “were no readily available means or standard procedures for timely destruction of computers, electronic media, and hardcopy material.” This deficiency, along with the lack of training, investigators wrote, “was the primary cause of the compromise of classified material.”

This is a pretty good result, as clearly the event was so unlikely that the crew would not have been able to handle the emergency destruction, and save their own lives. Something had to give.

But would procedures have worked? No. Putting in place procedures requires training and rehearsal. By its nature or rarity, the training would be squeezed and the rehearsal would have been dropped. Procedures would have just obfuscated the situation, and disasters would be interpreted as failures of training or bad users - we'd be going backwards, worse than useless.

It wasn’t the first time cryptologic sources and methods were at risk of compromise. In 1968, North Korea captured the USS Pueblo and acquired a large inventory of highly sensitive intelligence materials from the ship. Since then, crews were supposed to be trained in emergency destruction procedures. But that didn’t happen with the EP-3E crew. Only one member of the crew had ever participated in an in-flight emergency destruction drill.

But we do want to keep our secrets. Or, we want to preserve our state, whatever that means. The answer to the conundrum of infrequent emergencies versus routine solidity is an odd but logical one: if you cannot prepare adequately for rare events, then you have to make the rare events non-rare. You have to make emergencies routine, you have to move a fragile system into anti-fragile space.

And so it is with computer backups, bringing this back to a more comfortable zone. Anecdote time: When I was a young programmer my job was to port a big powerful database to platforms. This was in the day when there were more hardware platforms than ported apps, so every hardware seller wanted apps. Business was good.

But as I was the only one in the country that had the source code, and understood it, I was also the one called out when the databases screwed up. This was the scary part of the job - if I screwed up, people got really messed up. E.g., I got called out to repair the database that handled flight logistics for the air force's cargo planes in routine and emergency actions. If I failed, the logistics was grounded because nobody knew what to do, it was all in the computer.

Each time this happened the cause was the same - the database screwed up, and the backups didn't work.

Why didn't the backups work? Actually they had never worked ... because they'd never been tried. Place after place that had this syndrome, from the boring to the exciting, had the same problem: no testing of backups. So when it came time for me to write my own database, I took on the task of that never happening to me, out of guilt I suppose.

The answer to making the backups completely secure was to make the backups completely routine. Which means, you always recover from backups. Not just occassionally, but always, which means, the backups are the database, and you throw away the traditional concept of storing a data set in some ISAM (indexed sequential access method) and just keeping a backup somewhere else for emergencies.

This was pretty radical at the time (back in 1995-1996 when I built my first accounting engine for client server digital cash) but it worked. The system would start up, read through all the prior transactions in the log, re-run them one after the other, and build the entire state within memory on the fly. Every new transaction would be logged - not the state, just the message coming in. When it came time to stop the database, just kill it.

The TwilightZone

It worked. It only screwed up the once, a situation which I then characterised as the Twilight Zone, but is now better known as a fork.

Coming back to the Spy Plane and our desire to protect our secrets, the answer that the report should have given was to adjust the procedures for everything such that destruction of the secrets, and recovery from backups was a routine operation. The system should be designed to crash-safe. Now, that doesn't entirely answer the problem because they would also need to crash-safe-recover in the air - the gear often needs to be cold-restarted in flight. More thinking required here to handle the notion of encrypted logs - I've done this, but how do we handle the crew being captured, with passwords in their heads? So there is still some need for procedures, and K6 raises its ugly head again.

A laptop destroyed by the crew

But I believe the principle to be sound: that which must work must be routine. Emergencies must therefore be built in as routine operations.

Postscript: one of the implications of this was that the state had to be saved as a series of messages, not as state. I've now been informed that this is referred to as an event sourcing pattern! and furthermore, I'm told it is the core mechanism behind Bitshares and Steem. Outstanding! If you're into Computer Science and you want to understand high performance, start with the LMAX architecture.

On how to make something rare happen reliably - the Navy-NSA gets a B+ but has more to learn