In yesterday’s article I suggested that blockchains need to be able to process transactions at a rate of 10 million per second (offline) in order to sustain 1000 transactions per second realtime while only adding 1 hour per year to the replay time. Today I would like to present a solution to achieving this level of performance for the specific case of transfers.
Lets assume there are 3000 transfers in a block (blocks are 3 seconds apart). To process these 3000 transfers at a rate of 10M per second means we have 300us (.3 milliseconds) to apply the transfers. Using the current ChainBase architecture performing these transfers would take 0.15 seconds (assuming 5us per transfer). We need to accelerate this operation by a factor of 500 to achieve our goals.
Single Thread is not Viable
The act of transferring from one account to another involves subtracting from one balance and adding to another balance. In computational terms it doesn’t get much simpler than this and there is very little room to improve single threaded performance.
Order of Operations Matters
Lets assume an initial condition where Alice has 100 SBD and Bob has 0 SBD. A single block can support Alice transferring 100 SBD to Bob and Bob transferring 100 SBD to Eve so long as Alice’s transfer is applied first.
As things are currently defined it is not possible to perform two transfers in parallel because Bob’s transfer is not valid until after Alice’s transfer completes. Furthermore, if they were operating in parallel there would be a race condition on reading and writing the balances.
Redefining the Semantics of a Transfer
What if we redefined the requirements for a valid block to require that each account may receive at most one deposit or one withdraw. Under these more restrictive definitions we can now apply all 3000 transfers in parallel. Since each transfer only takes 5us and we have 15us, we can allow each account up to 3 deposits or withdraws per block. With 3 second blocks this means that each account has a maximum transaction rate of 1 TPS but the blockchain as a whole can process 200,000 * THREAD COUNT transactions per second. It is entirely possible to build a workstation that supports 44 cores which means that this blockchain should be able to process 8.8M transfers per second.
Achieving 100M transfers per second
With some small tweaks to memory layout and operation ordering it should be easy to make up the difference required to get to 10M transfers per second. With a bit more optimization designed to run the blockchain on a GPU this could scale to 100M transfers per second.
Real World Throughput
Having a CPU that can process 10M transfers per second, means we have a blockchain that can sustain 1000 transactions per second with a growth rate of 100KB/sec or 3 TB per year with a rsync rate of 1 hour per year and require reading the blockchain from disk at 1GB per second.
It should be obvious that downloading a 30TB blockchain over a gigabit connection would take days after 10 years of operation at 1000 transactions per second. Keeping track of all blockchain history for all time will be a very expensive undertaking at 1000 transactions per second.
Eliminating the Need to Replay
A large part of our scalability problem is requiring all nodes to replay all transactions to reliably derive the current state. This replay time could be entirely eliminated if a blockchain had a fixed and well defined “current state”. If the only thing the blockchain was concerned with was balance transfers, then every smartphone owner in the world could have an account with a database size of less than 256GB.
Steem has intentionally kept the structure of the blockchain state “undefined” to give us the greatest flexibility to optimize going forward, especially as we keep adding features. A simple currency blockchain has no need to add an unbounded number of features. This means we can define an optimal data-structure for its state. The hash of this data structure could be included in the blockchain from time to time and new users could simply download this state and start applying new blocks.
Key to Growth
A blockchain that desires to scale must achieve the following:
- perform a small number of well defined tasks
- operate on a protocol defined state
- minimal change in function over time
- ensure that all transactions in a block are independent
- minimize the number of sequential steps per block
The best way to achieve these properties while maintaining flexibility is to have a robust cross-chain communication protocol and keep all new / changing features on separate chains. You could have one chain for STEEM, one for SBD, one for STEALTH STEEM, and one for STEALTH SBD. Each of these parallel chains would have the potential to process 1000’s of transactions per second and have a fixed, well-defined state. This alone gives the transfer throughput a 4x scalability boost.
Leveraging Ample Real-Time Capacity
There is a clear gap between the 1000 transactions per second being processed in real-time and the 10M transactions per second that get processed during replay. During real time evaluation of the blockchain the CPU is idle 99.9% of the time. We can use this idle time to perform extra computations that are not necessary during replay.
Among these calculations are the scheduling of operations in a block. The protocol can require that all valid blocks keep operations sorted by account. It can even organize operations by memory access patterns and calculate the best execution strategy. These calculations take extra time to perform while building the blockchain, but can be completely skipped during replay. When replaying the blockchain we are only concerned with deriving the proper state, not verifying that all the steps are authorized and well formed (that is guaranteed by the hash linking of blocks).
This can be viewed like a compiler interpreting a program to generate assembly. Once the assembly is generated the CPU can process it very quickly. Another analogy would be a CPU that re-orders instructions to maximize performance. Block producers are responsible for building blocks that properly order transactions for maximum performance. Any block that is produced with conflicting memory accesses would be rejected.
Conclusion
A poorly defined problem can demand an entirely sequential solution; however, if we add a few small constraints on the problem it can easily become trivially parallel. Block producers can do a little bit of extra work up front to ensure that all blocks can be replayed using parallel operation evaluation. By simplifying the problem and defining the output state format as part of the protocol we can also eliminate the need to replay all together.