Protocol Version Upgrade Mechanisms Discussion

Context

The Core Protocol BFT team is working on a mechanism to upgrade the Protocol State without requiring a spork. The goal of this post is to briefly describe Flow’s existing version upgrade systems (spork, height-coordinated upgrade) and start a discussion about:

  1. How the protocol state upgrade mechanism should function.
  2. Can we (and if so, how) consolidate existing mechanisms (mainly NodeVersionBeacon smart contract and StopControl tooling in Execution Node.

Terminology

Version Directive - A tuple containing (1) a version identifier and (2) either activation view or activation height, originating from a trusted source.

Software Version - The version identifier of a binary distribution of Flow Node software. By convention specified as semver-ish tag in Git and Docker releases.

Component Version - The version identifier of a specific component within Flow. This document will propose that we exclusively use integer component version identifiers.

Protocol State Machine - The state machine operating on the Protocol State (epochs, identity table, protocol parameters, etc.). LastProtocolState + Block -> NextProtocolState

Execution State Machine - The state machine operatings on the Execution State (smart contracts, user accounts, tokens, NFTs, etc.). LastExecutionState + Transaction -> NextExecutionState.

Safety Threshold - A buffer of block views or heights that must exist between the point at which a Version Directive is processed by the Protocol and when it comes into effect. This threshold is set so that it is overwhelmingly likely that a block is finalized for any view/height range with the threshold size. (This is referred to as versionBoundaryFreezePeriod in EN HCU terminology; it is referred to as EpochCommitSafetyThreshold in the protocol parameters for a similar purpose.)

Existing Software Upgrade Mechanisms

Height-Coordinated Upgrade (HCU)

In a Height-Coordinated Upgrade, the Governance committee publishes a Version Directive (tuple of block height and semver software version) by submitting an admin transaction to the NodeVersionBeacon smart contract. This causes a VersionBeacon service event to be emitted. The StopControl component in Execution Nodes ingests this service event and arranges to stop at the block height when the new software version takes effect.

  • Operators manually configure their Execution Node with a new software version
  • Verification Nodes don’t currently automatically stop; rather, we disable --require-result-approvals then wait for them to be manually updated before re-enabling the flag.

Challenges

  • Conceptually, a HCU is implementing a change in Component Version (a breaking change to the Execution State Machine), but it specifies Software Version.
    • As a side effect, versions are specified as semver, however:
      • Semver is useful to differentiate between different kinds of breaking and non-breaking changes, but here we only care about breaking changes.
      • Semver is relatively complex (compared to eg. an integer), and this increases complexity in version upgrade-related components (including NodeVersionBeacon smart contract)
  • The NodeVersionBeacon smart contract is used to direct changes to the Execution State Machine but the use of a semver software version and the documentation of the contract and surrounding components implies that it refers to a global protocol version.
  • The Execution Node processes the VersionBeacon service event only when it processes the sealing block. If an EN has not processed the last VersionBeacon-containing block (for example, due to bootstrapping timing), then it could miss a version upgrade and produce incorrect Execution Receipts. (Version Directives are not reliably persisted.)

Spork

In a spork, the entire network is halted for an extended period. Both the execution state and protocol state databases are re-instantiated from a snapshot. Essentially any backward-incompatible changes are possible during a spork. Included here for completeness.

Dynamic Protocol State Upgrade

The Dynamic Protocol State adds the ability for:

  • Protocol State commitments to be included in every block
  • Flexible changes to the Protocol State on a block-by-block basis

The Protocol State Key Value Store is a versioned data structure representing the Protocol State (basically, the thing that is committed to in every block).

High-Level Design

  • Protocol State Component Version is defined as an integer, incremented for breaking changes
  • A particular Software Version may support one or more Component Versions (see #5371)
  • A service event communicates Version Directives to the Protocol State. See #5428 and #411 for WIP implementations)
    • New versions become active at an “activation view” rather than height.
    • Service events must be processed at least SafetyThreshold many views in advance of the activation view, like is required for Execution State version upgrades.

In general:

  • Component Versions are specified as integers and only increment for breaking changes. There may be several and they are specific to a specific component; there isn’t one global version for the whole network. Component versions are part of a component’s implementation and defined at compiled-time (not tagged after-the-fact).
  • Version directives are communicated using service events and pending directives are persisted in the Protocol State, so all nodes have access to it, regardless of bootstrapping timing.

Request for Feedback/Discussion

Thank you for taking the time to read! Here are some questions I’m hoping to answer:

  • Are there any reasons to continue using semver versions (or software binary versions) in the version upgrade mechanisms?
  • How to best consolidate existing HCU mechanisms with the Protocol State - driven upgrade mechanism? Or at least co-exist peacefully.
  • Is it indeed preferable to version components individually (as we do in practice with current HCUs, and plan to with Protocol State version upgrade), rather than using one global version (as implied by existing NodeVersionBeacon documentation)?
  • Is it desirable to include a bypass for version requirements as a “escape hatch” flag (see PR comment)
  • Disagreement (or inaccuracies) in the post above.
3 Likes

The Core Protocol sub-working-group discussed this topic with key engineers and stakeholders on April 5, 2024 [summary, notes, video, transcript]

2 Likes

In the dedicated sub-protocol working group meeting from April 5th, we concluded that it would be useful to sketch out a possible update process for two exemplary scenarios.

I have started the following notion doc Dynamic-Protocol-State-Mediated Upgrades (work in progress) to describe possible upgrade processes for:

  • [Scenario 1] Change in block Header
  • [Scenario 2] HCU for ENs and VNs (and ANs) for coordination around Cadence version