NoETL Runtime Results — Storage & Access

This document defines how NoETL stores and retrieves task outcomes and step results in a reference-first, event-sourced system.

Aligned with current DSL:

No sink tool kind: storage is just tools that write data and return references.
No retry/pagination/case blocks: retries + pagination are handled by task policies (task.spec.policy.rules) and iteration scope (iter.*).
No eval:/expr:: use when in policies.
Events and context pass reference envelopes only (status + optional reference + optional bounded context), never raw output bodies.

1) Goals

Event-sourced correctness: the event log is the source of truth for what happened.
Efficiency: large payloads MUST NOT bloat the event log.
Composable access: downstream steps MUST be able to reference:
- latest outcome
- per-attempt outcomes (retry)
- per-page outcomes (pagination streams)
- per-iteration outcomes (loop)
- combined outcomes via manifests (streamable)
Pluggable storage (RisingWave-aligned tiers):

in-process memory (step-scoped, <10KB)
NATS KV (execution-scoped, <1MB)
local disk cache with async cloud spill (disk, >=1MB)
S3 / S3-compatible object store / Google Cloud Storage (durable, large)
Postgres (queryable intermediate tables)

and keep only references + extracted fields in events/context. The previous nats_object tier was removed in phase 0 of the RisingWave alignment. See Storage and Streaming Alignment with RisingWave. 5. Streaming-friendly: represent combined results as manifests, not huge merged arrays.

2) Concepts

2.1 Task outcome vs step result

Task outcome: the final result of a single task invocation in a step pipeline (e.g., one HTTP call).
Step result: logical outcome of a step; may be comprised of many task outcomes due to:
- retries (multiple attempts)
- pagination (multiple pages)
- loops (multiple iterations)

current DSL produces step boundaries via step.done / step.failed events.

2.2 Reference-first output rule

A task’s output body is either:

inline (only if small and within caps), OR
externalized to a backend and returned as a ResultRef.

The event log stores:

the outcome envelope (status, error, meta)
ResultRef + extracted fields + preview

3) ResultRef

A ResultRef is a lightweight pointer to externally stored data.

3.1 Standard shape

{
  "kind": "result_ref",
  "ref": "noetl://execution/<eid>/step/<step>/task/<task>/run/<task_run_id>/attempt/<n>",
  "store": "memory|kv|disk|s3|gcs|db|eventlog",
  "scope": "step|execution|workflow|permanent",
  "expires_at": "2026-02-01T13:00:00Z",
  "meta": {
    "content_type": "application/json",
    "bytes": 123456,
    "sha256": "...",
    "compression": "gzip"
  },
  "extracted": {
    "page": 2,
    "has_more": true
  },
  "preview": {
    "truncated": true,
    "bytes": 2048,
    "sample": [{"id": 1}]
  }
}

3.2 Logical vs physical addressing

ref is a logical NoETL URI.
Physical location is derived from store + meta (bucket/key/object/table range), and/or a control-plane mapping.

4) Storage tiers and recommended usage

4.1 Inline (small)

Store directly in the event payload for small results.

controlled by inline_max_bytes

Use for:

small status objects
small aggregates
extracted fields

4.2 Postgres (queryable)

Use for:

queryable intermediate tables
projections / indices
moderate-sized JSON stored in tables (when useful)

Recommended ResultRef meta:

{ "schema": "...", "table": "...", "pk": "...", "range": "id:100-150" }

4.3 NATS KV (small, fast)

Use for:

cursors
session-like state
small JSON parts (within practical limits)

Recommended ResultRef meta:

{ "bucket": "...", "key": "execution/<eid>/..." }

4.4 DISK tier (medium artifacts, >= 1 MB)

Local SSD/NVMe cache on the worker with async spill to the configured cloud tier (S3-compatible or GCS via NOETL_STORAGE_CLOUD_TIER). Replaces the removed NATS Object Store tier.

Use for:

paginated pages
chunked streaming parts
medium artifacts between 1 MB and 10 MB (and larger as local capacity allows)

Recommended ResultRef meta:

{ "bucket": "...", "key": ".../page_001.json.gz" } (written to the cloud tier; the local cache copy is indexed by the same key)

4.5 Google Cloud Storage (large, durable)

Use for:

large payloads
durable datasets
cross-system distribution

Recommended ResultRef meta:

{ "bucket": "...", "object": ".../payload.json.gz" }

5) DSL controls for result storage

Result storage controls live under task.spec.result.

- fetch_page:
    kind: http
    method: GET
    url: "{{ workload.api_url }}/items"
    spec:
      result:
        inline_max_bytes: 65536
        preview_max_bytes: 2048
        store:
          kind: auto                 # auto|memory|kv|disk|s3|gcs|db
          scope: execution           # step|execution|workflow|permanent
          ttl: "1h"
          compression: gzip
        select:                      # extracted fields for routing/state
          - path: "$.paging.hasMore"
            as: has_more
          - path: "$.paging.page"
            as: page

5.1 Extracted fields

select extracts small values into ResultRef.extracted. These are safe to pass in context and to use in routing decisions without resolving the full body.

6) Indexing (how to address pieces)

To retrieve “pieces” deterministically, every task completion SHOULD record correlation keys:

execution_id
step_name, step_run_id
task_label, task_run_id
attempt (retry attempt number, starting at 1)
iteration / iteration_id (loop)
page (pagination)

This enables queries like:

final successful attempt for iteration 2 / page 3
all pages for iteration 7
latest output for task transform in step fetch_transform_store

7) Manifests (aggregation without bloat)

7.1 Why manifests

Do not materialize huge merged arrays in memory/events. Use manifest refs listing part refs.

{
  "kind": "manifest",
  "strategy": "append",
  "merge_path": "$.data.items",
  "parts": [
    {"ref": "noetl://.../page/1/..."},
    {"ref": "noetl://.../page/2/..."}
  ]
}

7.2 Where manifests live

A manifest itself is stored reference-first:

inline if tiny
else DISK (local cache + cloud spill) / S3 / GCS / Postgres

The step boundary event stores only a reference to the manifest.

8) How downstream steps access results (reference-only)

Standard rule: downstream steps receive the latest task envelope, not full payload bodies.

In a tool: [] pipeline:

pointer key is task label; if omitted it is task_<index>
pointer value is latest envelope for that key (status + optional reference + optional context)
goto/jump/retry overwrites the same key
_prev always points to the latest executed task envelope
scope is local to the current step pipeline only

To read the full body, the playbook MUST explicitly resolve the ref:

via server API (resolve endpoint), or
via a tool (kind: artifact/result, action: get), or
by scanning the artifact directly (DuckDB, etc.), depending on backend.

8.1 Replay when reference is unavailable

If a worker restarts before persisting output, no reference is committed.
When a downstream task tries to read that data, runtime surfaces:

outcome.status: error
outcome.error.code: REFERENCE_NOT_AVAILABLE

Recommended recovery policy in task sequence:

then: { do: jump, to: previous } to rerun the producer task and regenerate/persist data.

previous is scoped to the current tool: [] pipeline and resolves to the task immediately before the failing task.

9) Implementation strategy (event-sourced & efficient)

9.1 Worker responsibilities

On each task completion:

compute outcome (status/error/meta + raw result)
apply task.spec.result:
- create preview (optional)
- extract selected fields (optional)
- choose store backend (auto/explicit)
- store full body if needed
emit task.attempt.* + task.done/failed events containing:
- outcome (size-capped)
- ResultRef (if externalized) and extracted fields
- correlation keys

9.2 Server responsibilities

On event ingestion:

persist events
upsert projections (rebuildable):

noetl.result_index (recommended)

execution_id, step_name, task_label
step_run_id, task_run_id
iteration, page, attempt
result_ref (json)
status, created_at

noetl.step_state (recommended)

execution_id, step_name
last_result_ref
aggregate_result_ref (manifest)
status

These projections prevent scanning the full event stream for common reads.

10) Pagination + retry + loop (Standard mental model)

A single step with loop + pagination + retry yields a lattice of pieces indexed by:

iteration (outer loop: endpoint/city/hotel)
page (pagination stream inside that iteration)
attempt (retry per page)

Each externally stored page becomes a ResultRef part, and the entire stream is represented as a manifest ref.

11) Quantum orchestration note

Quantum tools often return large measurement datasets. Standard pattern:

tool returns a ResultRef (DISK / S3 / GCS)
extracted fields store small metadata (job id, backend, shots, cost)
manifests represent iterative algorithms (VQE/QAOA) or shot batches

This preserves a complete experiment trace keyed by execution_id without embedding large payloads in events.

1) Goals​

2) Concepts​

2.1 Task outcome vs step result​

2.2 Reference-first output rule​

3) ResultRef​

3.1 Standard shape​

3.2 Logical vs physical addressing​

4) Storage tiers and recommended usage​

4.1 Inline (small)​

4.2 Postgres (queryable)​

4.3 NATS KV (small, fast)​

4.4 DISK tier (medium artifacts, >= 1 MB)​

4.5 Google Cloud Storage (large, durable)​

5) DSL controls for result storage​

5.1 Extracted fields​

6) Indexing (how to address pieces)​

7) Manifests (aggregation without bloat)​

7.1 Why manifests​

7.2 Where manifests live​

8) How downstream steps access results (reference-only)​

8.1 Replay when reference is unavailable​

9) Implementation strategy (event-sourced & efficient)​

9.1 Worker responsibilities​

9.2 Server responsibilities​

10) Pagination + retry + loop (Standard mental model)​

11) Quantum orchestration note​

1) Goals

2) Concepts

2.1 Task outcome vs step result

2.2 Reference-first output rule

3) ResultRef

3.1 Standard shape

3.2 Logical vs physical addressing

4) Storage tiers and recommended usage

4.1 Inline (small)

4.2 Postgres (queryable)

4.3 NATS KV (small, fast)

4.4 DISK tier (medium artifacts, >= 1 MB)

4.5 Google Cloud Storage (large, durable)

5) DSL controls for result storage

5.1 Extracted fields

6) Indexing (how to address pieces)

7) Manifests (aggregation without bloat)

7.1 Why manifests

7.2 Where manifests live

8) How downstream steps access results (reference-only)

8.1 Replay when reference is unavailable

9) Implementation strategy (event-sourced & efficient)

9.1 Worker responsibilities

9.2 Server responsibilities

10) Pagination + retry + loop (Standard mental model)

11) Quantum orchestration note