<aside> 🚧

This document is a Work In Progress

</aside>

The bulk of the logic in the existing ingestion engine is used to track and sync collections. Since this will be mostly handled by the processing pipeline and we want to make these changes in a backwards compatible way to reduce change management overhead, we should create a new version of the engine.

The ingestion engine has the following responsibilities:

  1. Load and maintain the Results Forest
  2. Track receipts received from execution nodes over push-receipts
  3. Index block/collection mapping for finalized blocks
  4. Index block/result mapping for sealed blocks
  5. Record transaction timing metrics
  6. Request and track collections from collection nodes (as a backup)

Load and maintain the Results Forest

On startup, the ingestion engine loads the LatestPersistedSealedResultHeight, and initializes the forest using the result for this block as the base. It then ExecutionResults and ExecutionReceipts for all descending blocks into the forest. Depending on the state of the node, there may be a small or large number of sealed results, as well as a set of unsealed results.

We can generalize this to 3 cases:

  1. The node is starting after a restart: the node is likely only a few minutes behind the network, so there should be small number of sealed and unsealed results (hundreds).
  2. The node is starting after an extended downtime: there will only be a few locally sealed results on startup. However, the node’s follower engine will quickly seal blocks as it catches up with the network, and sealing will outpace ingestion. Depending on how long the node was offline, this could result in millions of results getting added to the forest.
  3. The node has been online for a while, and just enabled indexing: there will be a large number of sealed results to add, which could result in millions of results getting added to the forest.

Case 1 is the happy path, and we should load all results into the forest and immediately start ingesting results from the network.

For cases 2 and 3, the loader will quickly fill the forest to its max size, and need to wait for pipelines to complete or be abandoned before adding more. We should not start ingesting results from the network since these would compete with the finalized data for space.

In case 2, the forest may not fill until after the initial persisted blocks are all loaded. i.e. it may fill with real-time network data. The ingestion engine would then need to fall back to loading results from disk.

In case 3, the loader could stay running until all data is loaded, switching to a real-time mode afterwards.