Abstract

From a client and developer perspective, the process of using a blockchain can be broken into three high-level steps: (1) computation tasks are submitted to the chain in form of transactions, (2) the blockchain processing the transactions, (3) inspecting the resulting state.

For a highly-scalable platform such as Flow, which we envision to process tens of thousand to millions of transactions per second, the events emitted during the a single block’s computation and the resulting state changes will be beyond the capacity of any single server to process. The Flow network itself shoulders this massive computation and data load via its horizontally-scalable, pipelined architecture. However, the approach commonly adopted in the blockchain space, to just run a single computer replicating the entire space and computation becomes intractable at Flow’s scale.

In the following, we describe Flow’s long-term vision how clients and developers of the Flow network can reliably and trustlessly query and replicate the subsection of the global state that is relevant for them. We emphasize that the design is byzantine fault tolerant [BFT] and clients can receive correctness proof for all data they are interested in if they desire so.

Furthermore, Flow’s design enables smartphone-sized light clients that can operate entirely trustlessly without needing to download or store chain history.

Central concepts and terminology:

Data Egress Node or Edge Node (as a collective descriptor for archive, observer, Access node). Edge nodes follow the chain (block headers), and maintain a (small) subset of execution state and event data.

Usage pattern that an Edge Node is designed for:
- online most of the time and continuously following new blocks and ingesting the relevant state and event data
- receive data once, verify correctness, store and index it locally
- many operations are executed on the same stored data
- if an edge node is offline for prolonged periods of time, it needs to catch up which can take some time; during the catchup period the edge node does not offer its normal services (system resources largely required for catch-up and newest data is not yet available)
Light clients are largely stateless. They do not have to maintain any execution state or event data. Though, they might store a negligible amount of data locally, such as some block headers and a few accounts’ states. Data that the light client serves to its user is fetched on demand (or streamed on demand if desired by the client).

Usage pattern that a Light client is designed for:
- very sparse data access, most data is only needed once and never again
- can operate on minimal hardware, e.g. browser plugins, embedded hardware controllers (such as ESP32), or smartphones
- prolonged periods of being offline do not substantially impact the light client
trusted vs trustless

Shipping correct data to the user / developer is paramount. While the Flow platform cannot universally guarantee data availability, we must unconditionally guarantee data correctness. There are two ways for guaranteeing data correctness:

(i) trusted: the user / developer trusts that the data source delivers correct data

(ii) trustless: the data source might deliver incorrect data. The user’s / developer’s software (edge node or light client) must recognize the data as incorrect during the ingestion process and drop it.
- Supplemental details on trustless data egress

Long-term goals

Flow is architected to scale beyond 1 million transactions per second. You can read the details in this paper.

A short- to mid-term goal is to support • the state exceeding 1 petabyte in size (snapshot at one specific height) and • the network producing gigabytes of event data every few seconds.
Trustless operations, including trustless data egress, is a central value proposition of Flow. In all likelihood, some data egress functionality will be technically intractable to implement in a trustless manner, too large of an engineering lift to implement compared to its utility, or not economically viable for clients to use due to disproportionate cost of the proofs.

Nevertheless, we strive to keep the data egress functions that are only available via trusted sources as small as possible. The important data egress functionality must be available in a trustless manner. Otherwise, we would substantially weaken the value of Flow as a decentralized, trustless platform, if core data egress functions are available only through trusted/centralized entities.
We want to support trusted data egress as well, because omitting proofs substantially reduces hardware and energy consumption. Trust relationships based on legal or economic incentives are very common, and we should allow clients to utilize such to improve efficiency and decrease cost.
We want to enable developers to run their own edge nodes at small costs, allowing them to locally access the subset of state and event data that is relevant to them. Edge nodes must eventually allow getting their input data in a trustless manner for all prevalent use cases.