Flow's Execution Nodes answered by Alexander Hentschel - Chief Protocol Architect

Question from @avcd

Hi, I have a question about Flow’s Execution Nodes. In the technical paper, it says that tasks can be parallelized and processed by small groups.

https://arxiv.org/pdf/1909.05821.pdf
p.13

The Execution Nodes unpack the collections and process individual transactions. Without a large number of Byzantine actors involved, this can be handled directly between the Collector and Execution Nodes without involving Consensus Nodes. Furthermore, tasks can be parallelized and processed by small groups which hold each other accountable for correct results.
Only during Byzantine attacks, malicious actions would be reported to Consensus Nodes to adjudicate slashing.

My understanding is that Flow does not do transaction execution sharding, so I would expect all small groups to do the same processing. Is this correct? Then why do we need multiple execution nodes and small groupings? Is it for redundancy so that the system does not stop?


Answer from AlexH

Hi avcd,

It was an explicit architecture decision to not depend on sharding for the flow protocol (to achieve throughput scaling). Instead we heavily utilize pipelining (inspired by modern CPU designs). * Transactions flow through multiple steps of processing:

  • Batching: Collector nodes create transaction batches (aka collections)
  • Block-Formation: Consensus nodes construct blocks from the collections (only include transaction ID, but not the full transactions)
  • Execution: Execution nodes compute the blocks
  • Verification: Verification nodes check the execution results for correctness and approve them
  • Sealing: Consensus nodes commit approved execution results

As in any blockchain, the protocol is explicitly designed to tolerate malicious (aka byzantine) participants. The easiest attack an execution node could do is stop processing blocks (denial of service attack) – maybe even due to technical problems.

A malicious Execution node could publish wrong results, which don’t get approved. Hence, we need some level of redundancy to keep the probability high that some execution nodes behave honestly, only then the blockchain is live (operational).

We just discussed the example of Execution nodes. The same principle applies to all node roles: we need redundancy to guarantee liveness in the presence of byzantine nodes.

An analogy might be the design of space crafts: critical components (hardware and software) are redundant. Redundant components of the same type execute exactly the same work. There are measures in place to detect a failing component and safely proceed without it.(edited)

As for the statement from the paper you cited:

  • Remember that blocks only contain the collection ID, but not the full transactions. So how do the execution nodes get the transaction texts? They ask the collector nodes directly.
  • So there is a group of collector nodes (aka collector cluster), who have the collections. And another group, the execution nodes, who need the transactions. If a noteworthy fraction of the collector nodes are honest, the execution nodes should be able to get the transaction texts without any problems. Only when a dominant fraction of the collectors are malicious, an execution node might run into problems. Then, an execution node can escalate the situation to the consensus nodes (raise a slashing challenge).

From the example of collector-execution node interaction, we see that two small subsets of nodes exchange large amount of data. This also means that consensus nodes don’t have to bother with this transaction data (during happy operation), which frees them up to focus their resources on other important tasks (e.g. block formation).

In the end, this is the essence of sharding vs pipelining:

On a high level, there are different types of work and for each type, there are many units of this particular type of work. For sharding, you split work units of the same type between workers. For pipelining, you partition the work units by their type and dedicate one group of workers to each type.

An analogy here could be: you have 6 workers and want them to pick apples, sort out spoiled ones, and package the good apples into bags.

We have 3 different types of work:

  • harvesting,
  • sorting,
  • packaging.

For each type, we have many apples to process. If you want to shard the work, you could tell 3 workers to process the apples from one tree and the other 3 workers handle the second tree.

Each of the workers then does harvesting, sorting, and packaging. If you are a fan of pipelining, you would split your team into 3 groups A, B, and C. Team A picks the apples, team B sorts them, and team C packages.