Me and STEEM (3) -- fixing seed node stuck issue

About seed nodes in Steem and BitShares network.

Background

In Steem and BitShares, seed nodes are entrances of the network. When a new node wants to join the network, firstly it connects to one or more known seed nodes, then the seed nodes will tell the new node where are other nodes, then the new node connect to other nodes, thus connect to the P2P network.

A public seed node should always

  • be online, and
  • be listening on promised port(s), and
  • accept incoming connections, and
  • broadcast available peer/node list to connected peers/nodes

(We're only talking about networking here, so will ignore blocks, transactions and etc)

According to the networking protocol (specification/white paper needed), when a seed node get a new incoming connection, at first it responds with a "hello" message, then the connected node know this seed node is alive then perform further communications. An offline seed node refuses or drops incoming connections. A stuck seed node accepts incoming connections but respond with nothing or bad message.

Here are lists of public seed nodes of Steem, BitShares and MUSE network (thanks to @wackou and @steempty and the node providers):

The Story

Recently, we got reports that new nodes were unable to sync to BitShares network. Then we found that all public seed nodes were either offline or stuck (see this forum post). It's quite critical, it means public entrances of the network are closed. At least one witness had been unable to produce blocks due to unable to connect to the network after got disconnected. Some others are unable to sync their own nodes. We even guessed that perhaps someone succeeded in attacking the network.

Fortunately, I'm operating one of the public seed nodes, so I restarted it, then the network became open again. I also notified other seed node operators I've known to restart their nodes. While we're discussing, @steempty provided a script to check the status of a seed node, which helped much:

[ -n "`echo EOF | nc steem.clawmap.com 2001 -w 10 -q 2`" ] && echo Ok || echo Failed

But the good status did not last long. After only one day, I found my node got stuck again, and most of other nodes were still not working or down again. Good news is there were still a few nodes alive. I restarted my node again and started to look for the cause of that issue. Some interesting entries were found in the logs, but they're not clear enough, so I patched my node with additional logging and hope I could get more info.

Another day passed, my node got stuck again, but no more useful info found in the logs. In the mean while @wackou worked out the seed node status pages (the lists above). I restarted my node with more logging.

Yet another day, my node got stuck twice, but still no more useful info found in my logs. Obviously I didn't add logging at the right place. @dannotestein aka @blocktrades indicated that the issue could be caused by an uncaught exception, which I hadn't noticed. I restarted my node with the special logging.

Some days passed, my node never got stuck, so I had no chance to check what the exception was. Okay, at least the network is alive. In the meanwhile @Thom fixed his several seed nodes, so the network became much more healthy.

One day, my node got stuck again (finally). I caught the exception -- "Transport endpoint is not connected". According to a discussion with @dannotestein, we think it's safe to just ignore that exception (and perhaps other exceptions thrown from same code block). Then I patched my node and restarted again, and wait for next exception.

The next day, my node caught the exception again and ignored it, and has been working fine after that. So I guess the patch works. Then I submitted the code to Github. It's only one line of change, but it took us about two weeks to get it.

The end. Thanks to all the people that have helped. Wish we have a healthy platform forever.

H2
H3
H4
3 columns
2 columns
1 column
2 Comments