What happened to the STEEM network Friday March 16th 2018?

image background credit to flickr.com/photos/carlberger/7777348180

For 3 hours and 44 minutes, almost all steem blockchain explorers were down! The websites we know and love like steemit.com, busy.org, steemd.com (among others) are in fact blockchain explorers, and were not available for anyone to use.

The outage started at 09:32:12 (UTC) on March 16, 2018 and finished at 13:16:24 (UTC)

As a witness, these things worry me. But why? Is the STEEM blockchain not decentralised? Of course it is. So what's the problem? Why did most of STEEM apps stopped working? Well, there is a lot of plumbing between the blockchain and the apps.

So the name of the game becomes: can you spot the single point of failure?

Let's look at what happend and what can be done.

Outage

The outage occured right between these two blocks.

  • Block 20,721,949 59 transactions in this block, produced at 2018-03-16 09:32:09 (UTC)
  • Block 20,721,950 22 transactions in this block, produced at 2018-03-16 09:32:12 (UTC)

Recovery

The recovery occured around these blocks, within a minute, I saw the ramp up from few transactions to the normal average.

  • Block 20,726,434 22 transactions in this block, produced at 2018-03-16 13:16:24 (UTC)
  • Block 20,726,439 63 transactions in this block, produced at 2018-03-16 13:16:39 (UTC)

What happend?

I was present in the witness channel on steem.chat during the outage and witnesses were saying that the main api server was down (api.steemit.com).

Since most STEEM related website (condensers) use that api server, websites were not able to provide the content of the still working blockchain by consequence.

I was monitoring the output of my witness server and saw almost empty blocks being produced, it was an eerie sight.

How we can prevent this?

While looking in the code of steemit dot com website config file, I can see that the server settings take a single server argument and not an array (a list of multiple servers).

I think we could implement a function that would check wich available rpc is up and use it to feed the condensers.

I've found some lists of rpc nodes: provided @jamzed and provided by @followbtcnews.

It's a start, but even these lists are themselves a single point of failiure. We need an automatic discovery system of rpc nodes built right into our steem apps (websites included).

Also, all nodes are not born equal, the steemit condensers need a JUSSI flavored service. Acording to @yehey the api.steemit.com is the only JUSSI rpc node. See @sneak's comment for more information on JUSSI.

There are many working parts in the STEEM machine, I think I will tackle a component infographic of how it's all connected in a future post.

They kept going

While looking at the blocks that kept being produced I noticed some activity. In there were a few posts being created by the esteem app and this was confirmed by @good-karma, it is a feature of the app that allows the user to choose the server.

I do think this is part of the answer, if at least we put the choice to the user to pick a server, then we are free to choose one that actually works.

The steemkr.com also kept going, presumably the're using another rpc server than api.steemit.com

No news is good news, right?

On april 1st, 16 days after the outage I'm reporting on, @steemitblog posted a notice about another Steemit.com Outage, altough this one was caused by a 3rd party (AWS), I'm disapointed that nothing came out from them about the March 16, 2018 event.

While technically Steemit.com is still in beta, (It says that right in the logo of the site) I do beleive that transparency within this community is essentiel. Let's lay the issues out there and perhaps interested parties like developpers can help. See Utopian.io for ways to get rewarded by contributing to open source software.

I'm a witness

If you like the points I'm raising up, do not hesitate to vote for me twice: Once for this contribution, second as a witness.

H2
H3
H4
3 columns
2 columns
1 column
3 Comments