Lessons learned from failing as a witness

About a week ago, my witness node crashed abruptly. No errors, no trace. My guess is that the steemd process has been killed by the kernel when the system ran out of ram (32GB setup).

My setup had been a mess. A manually configured server, and a few scripts running on my PC (feed, killswitch). I had no backup node.

To make matters worse, when my node crashed, my kill-switch hadn't been running, and I was AFK. This caused me to miss 90 blocks. I tried re-building the node, but steemd failed me on the first attempt. So I had to re-sync twice, which caused a whole day outage for my witness.

My negligence got me kicked out of the witness pool.

Never give up

At first, I was going to give up, and quit as a witness. But the thought of that alone made me really sad. Being a witness is part of my Steemit identity, and I've decided to turn my failure into an opportunity to create an awesome setup.

The setup

I now have a 4-server cluster, in 2 datacenters (Paris and Amsterdam).
1.) Witness node A, running latest version of steemd (ie 0.19.0.rc1)
2.) Witness node B, running a known stable version of steemd (ie 0.18.2)
3.) conductor node, publishing price feeds and handling failovers.
4.) A public seed node (seed.steemdata.com:2001)

Witness Nodes

The witness nodes are quad-core xeon's, with 32GB ram each, and 2 SSDS.
I am hosting steemd in docker, and its data volume is mounted to second, larger SSD.
The first SSD also has a 16GB swap file.

The node setup is still manual, and fairly arduous, but fortunately it only has to be done once per server.

Management

Managing a witness can be messy, which is why I developed a neat command line app (conductor).

Generating Keys

I will begin by generating signing keys.
I will generate new signing key every time I deploy a new node, for security and double-signing prevention reasons.

conductor key-gen

Here I generate 2 key-pairs. Each witness node gets its own private key. This is very important to get right, because having the same signing key on more than one node could lead to double-signing, which is a quick way to get into serious trouble.

Setting up automated failover

Once both witness nodes are up and running, its time to prepare the failover plan.

If my main witness node A is signing with key SK-A, and my backup witness node B is signing with key SK-B, I will enable my witness with SK-B, and make it failover to SK-B.

conductor kill-switch --second-key <SK-B>

or more concretely (see screenshot above):

conductor kill-switch --second-key STM5kZnvU8R1zqSn6yg6ERiGfffDupxznRureyJrLQfSp6QcKsTUk

Here is what happens in the event of a node failure:
1.) Witness node A fails. Once 10 blocks are missed, kill-switch will change the signing node to backup node B, by assigning its public key (SK-B).
2.) If witness node B fails as well, my witness will be automatically disabled to avoid missing more blocks.

Enabling the witness

Now that the failover is configured, its time to enable the witness.
This is done with conductor enable command, and assigning primary node key to it (SK-A).

conductor enable <SK-A>

or more concretely (see screenshot above):

conductor enable STM5yjjAg4ApWoMHLTHbAss7EFC3LxyiEoAcivaTNKBtpSaD5WtyJ

Running the price feeds

The last component of running a witness is providing an accurate, reliable feed.

I've developed a feed publishing tool into conductor as well:

conductor feed

You can see more options for feeds here.

Thank you

Thank you for supporting my witness. It means a lot to me. Any suggestions for further improvements are welcome.