From zero to hero
How long does it take to go from 0 to 20,000,000?
Let's check how fast we can resync and replay, because that shows how fast our system can reach the head block.
Of course, that’s not something that measures overall steemd performance. For example, a system that replays faster is not necessarily faster when it comes to serving rpc requests.
The runtime performance of API endpoints is a good topic for another episode.
Video created for Steem Pressure series.
As you may remember from episode #3, I used an entry-level dedicated machine for the purpose of my presentation:
Intel(R) Xeon(R) CPU E3-1245 V2 @ 3.40GHz on an Ivy Bridge with 32GB DDR3 1333MHz RAM and 3x 120GB SSD
Here’s the reference config file that we used back then. We will also use it now (with slight modifications depending on our needs):
p2p-endpoint = 0.0.0.0:2001
seed-node = gtg.steem.house:2001
public-api =
enable-plugin = witness
[log.console_appender.stderr]
stream=std_error
[logger.default]
level=info
appenders=stderr
That’s all that is needed to run a seed node or a witness node.
In this presentation, I continue to use v0.19.2
(with some minor patches, basically something that you will get by checking out stable
)
In the previous episode, you saw the --resync
process:
I was able to sync 20M blocks in slightly less than 9 hours.
Resync of 20M blocks with state file located on disk: 539 minutes
Currently, block_log
takes 110GB, so it’s quite a lot of data to process (and this amount continues to grow).
For the purpose of most of our comparisons, we will use the 20M block as a reference point, because it’s nice and round and, unlike 10M, it contains all HARDFORKs.
Once we have our block_log
file, either from a previous resync operation, or we’ve downloaded it from a different instance or public source, we can save some time we would otherwise spend on --resync
and perform --replay
.
What’s the difference?
Resync connects to various p2p nodes in the Steem network and requests blocks from 0 to the head block. It stores them in your block_log
file and builds up the current state in the shared memory file. The content of the latter depends on the plugins you use, so for example if you add the account history plugin, you will have to reindex aka replay.
Replay uses the existing block_log
to build up the shared memory file up to the highest block stored there, and then it continues with sync, up to the head block.
That’s much faster, but operations on both your block_log
and your shared_memory.bin
without network latency in the meantime cause very intensive I/O workload for your storage.
By default, both files reside in the blockchain
directory just below --data-dir
, but the shared memory file can be placed elsewhere if you use the shared-file-dir
option, preferably on a fast local storage device (low latency is the key) but also on a ramdisk or tmpfs.
Putting it all in RAM is the fastest solution, and you might be tempted to do so, but that’s not a viable option in the long run. Eventually, you will run out of RAM, and that will happen sooner than you think.
Fast local storage
In our setup, we are using Intel 320 Series SSDs
root@pressure:~# hdparm -t /dev/sda
/dev/sda:
Timing buffered disk reads: 750 MB in 3.00 seconds = 249.74 MB/sec
It’s not a speed monster - on the contrary, that’s a pretty old and slow SSD drive.
It was introduced in early 2011.
According to Intel:
Sequential Read (up to) | 270 MB/s |
Sequential Write (up to) | 130 MB/s |
Random Read (8GB Span) (up to) | 38000 IOPS |
Random Read (100% Span) | 38000 IOPS |
Random Write (8GB Span) (up to) | 14000 IOPS |
Random Write (100% Span) | 400 IOPS |
Latency - Read | 75 µs |
Latency - Write | 90 µs |
It looks poor compared to devices available nowadays, but I have three of them, so let’s make use of what we have
/dev/md3: 273.86GiB raid0 3 devices
root@pressure:~# hdparm -t /dev/md3
/dev/md3:
Timing buffered disk reads: 2442 MB in 3.00 seconds = 813.81 MB/sec
That’s much better.
We can now check how fast we can --replay
, but to do so we need to get a block_log
from somewhere, remember?
As a witness, you should have plenty of these (I have one in my pocket).
If you don’t, you have a few other options.
Each of them has its pros and cons and largely depends on your infrastructure.
Here, we test two of them:
First, download my “always up to date” file
It’s publicly available at: https://gtg.steem.house/get/blockchain/
That took 31m51s and the file is ready to use.
To get to the head block, I will have to replay it and then sync the remaining time that passed during the replay.
Then, download a highly compressed file
It’s publicly available at https://gtg.steem.house/get/blockchain.xz
That took 11m53s, but the file needs to be decompressed. Unfortunately, due to xz limitations, the can’t be done on the fly (one of the reasons I’m going to abandon it). The good thing is that it was compressed by pixz
, so we can now use it with support for parallel decompression.
That took a total of 7m13s so 19m6s.
Was it worth it?
Sure!
We have saved over 10 minutes of time and 60GB of transfer.
That’s fine for our needs, but if I wanted to go to the head block, I would need to sync a few additional days that are missing from the compressed file, so in my case the break-even point is when the compressed file is two days old.
In your case, the result may vary.
But think about a situation in which your transfer speed is 5MB/s or less - you can then save 3 or 4 hours.
That, however, is a completely different story about block_log
and state providers.
You need to get used to handling big files, and that takes time even when you make a local copy.
(In our case, a simple local copy within the same device took 5m49s)
Replay locally
20M blocks completed in 202 minutes.
Resync vs replay of 20M blocks with state file located on disk: 539 vs 202 minutes
Important factors:
Storage latency
Low latency is the key. My rough benchmark:
steem@pressure:~$ time dd if=/dev/zero of=tst.tmp bs=4k count=10k oflag=dsync
10240+0 records in
10240+0 records out
41943040 bytes (42 MB, 40 MiB) copied, 5.11719 s, 8.2 MB/s
real 0m5.118s
user 0m0.012s
sys 0m0.544s
Size and speed of RAM
It’s not only storage that is used intensively. The size and speed of RAM also matter.
As long as your OS has plenty of RAM , it can use it effectively to optimize reads/writes.
When you are low on RAM, expect things to slow down significantly .
CPU
Due to the nature of the blockchain, steemd can’t easily make effective use of extra cores, so the faster the performance of a single core, the better.
Intel(R) Xeon(R) E3-1245 V2 @ 3.40GHz does the trick in our case.
SWAP space
If you have plenty of RAM, you don’t need it, but when you do, make sure that it’s located on a fast storage device. That will be extremely important if you decide to keep your shared memory file on a tmpfs device.
Replay using tmpfs
This is my preferred way of storing the shared memory file. It works pretty much as an oldschool ramdisk with the difference that the swap space is used as backing store in case of low memory situations.
Make sure your tmpfs device is big enough to hold your shared memory file.
I’m using /run/steem
for that, so I’m resizing the underlying filesystem:
mount -o remount ,size=48G /run
Also, make sure that your shared memory file is big enough to store your state.
Currently, for a low memory node running only the witness plugin, it can be something around 40GB, so defining it for 42G in config.ini
will be sufficient for now.
shared-file-size = 42G
shared-file-dir = /run/steem
The result?
20M blocks completed in 145 minutes.
Compared with 202 minutes while doing replay with the shared memory file residing on a disk and 539 minutes when doing resync.
Comparison of all methods, fastest is replay with state file on tmpfs: 145 minutes
20 M blocks via: | resync | replay |
---|---|---|
with state on disk | 539 min | 202 min |
with state on tmpfs | 415 min | 145 min |
Even looking at such simple case as a seed/witness node you can see significant difference in time needed to reach the head block, thus having fully operational node. It gets even more tricky for more complex types of nodes, running various resource hungry plugins. For example exchanges needs to run account_history plugin to track transactions to/from their account. Do they have to run Steem on a more powerful node? Is using VPS viable to run exchange node? Why it is a really good idea to pay attention to track-account-range
while configuring node for exchange?
Stay tuned for next episodes :-)
If you believe I can be of value to Steem, please vote for me (gtg) as a witness on Steemit's Witnesses List or set (gtg) as a proxy that will vote for witnesses for you.
Your vote does matter!
You can contact me directly on steem.chat, as Gandalf
Steem On