Architecture

The full Constellation Research Stack architecture is captured in the canonical Notion doc:

This page mirrors the Orion-specific section so the deployed docs site can stand on its own.

Goals

  1. Researcher-friendly entry point. orion train configs/myrun.py works end-to-end with smart defaults.

  2. Powerful but typed configs with aggressive pre-flight validation. Pydantic + Tyro. Configs are versioned, validated, serialized into the run record. Pre-flight catches doomed runs before they spin up a GPU.

  3. Distributed-first. Single-node 8-GPU on Polaris and 100+ GPU multi-node in cloud are the same code path. SkyPilot picks the host; Lightning + DDP/FSDP (Monarch opt-in) handles parallelism.

  4. Resumable from any failure. Checkpoints contain everything needed to bit-exactly resume, including the complete list of data hashes consumed up to that step.

  5. Rich, queryable run artifacts. Every checkpoint is queryable: data, config, benchmark scores, lineage.

  6. Benchmarks as first-class artifacts and as the in-training eval mechanism. Same Benchmark class powers ad hoc evaluation, periodic-during-training callbacks, and final eval. Supports partial subsets for fast in-training metrics.

  7. Native multi-stage training. Pretrain → finetune → finetune is a declarative pipeline of RunConfigs with full lineage tracking.

  8. torch_brain is wrapped, not extended. We use its modules but our own subclasses live in orion.models.*, so iteration speed isn’t gated on upstream PRs.

Configuration

from orion.config import RunConfig, TrainingConfig, DataConfig, ModelConfig, InfraConfig, BenchCompute
from orion.models import POYO
from ursa import QuerySpec

run = RunConfig(
    name="cognitive_load_v3_baseline",
    data=DataConfig(
        ursa_query=QuerySpec(
            participants=["p042", "p043", "p044"],
            modalities=["eeg", "pupil"],
            derived=["standard_v2026q2.eeg_clean", "standard_v2026q2.pupil_dilation"],
        ),
        train_window=10.0,
        train_split=0.85,
    ),
    model=ModelConfig(cls=POYO, kwargs=dict(dim=512, depth=6, num_latents=64)),
    training=TrainingConfig(
        batch_size=32, max_steps=200_000,
        optimizer="sparse_lamb", lr=3e-4, scheduler="onecycle",
        ckpt_every_n_steps=2_000, precision="bf16-mixed",
        bench_every_n_steps=10_000, bench_suites=["cognitive_load_eval@1"],
        bench_compute=BenchCompute(mode="async", partial_subset=0.1),
    ),
    infra=InfraConfig(accelerator="gpu", nodes=1, gpus_per_node=8),
)

Rich checkpoints

A checkpoint is a directory:

ckpt-step000050000/
├── manifest.json              # version, hash, parent ckpt hash
├── config.json                # full RunConfig snapshot
├── code_version.json          # orion+virgo+ursa versions, git commits, dirty flag, module versions
├── env.lock                   # uv.lock snapshot
├── model/{weights.safetensors,ema.safetensors}
├── optimizer/state.pt
├── scheduler/state.pt
├── grad_scaler/state.pt
├── distributed/rank0.pt ... rankN.pt   # FSDP shards if applicable
├── rng/{cpu.pt,cuda_<i>.pt}
├── dataloader/cursor.pt       # Lance scan position
├── data_hashes/manifest.json  # the list of ALL recording_hashes + time_windows + derived_asset_ids
├── metrics/snapshot.json
├── benchmarks/                # results from periodic bench callbacks
└── lineage.json               # parent run, parent ckpt, training steps consumed

The data_hashes/manifest.json is the load-bearing piece: with it, train/test overlap detection is a set intersection on Ursa’s catalog, not a forensic exercise.

Benchmarks

A benchmark is the canonical way Orion measures anything. Same class powers ad hoc eval, periodic-during-training callbacks, and final eval.

@orion.benchmark(name="cognitive_load_eval", version=1)
class CognitiveLoadEvalV1:
    def held_out_data(self) -> QuerySpec: ...
    def metrics(self) -> list[Metric]: ...
    def run(self, model, data, *, partial_subset: float = 1.0) -> BenchmarkResult: ...

A BenchmarkResult is content-addressed by (suite_name, suite_version, checkpoint_hash, dataset_hash, partial_subset, partial_seed) and stored in Ursa’s benchmark_results table.

BenchCompute modes: inline (cheapest, blocks training), async (default — snapshot weights to a side SkyPilot job; training continues; metric arrives later), separate (own instance per bench).

Multi-stage training

from orion.config import RunConfig, TrainingPipeline, FinetuneFrom

pretrain = RunConfig(name="cognitive_load_v3_pretrain", ...)

finetune_p042 = RunConfig(
    name="cognitive_load_v3_finetune_p042",
    init_from=FinetuneFrom(run=pretrain, checkpoint="best", weight_overrides=["readout"]),
    ...,
)

pipeline = TrainingPipeline(
    name="cognitive_load_v3_full",
    stages=[pretrain, finetune_p042],
)

Lineage propagates automatically: parent, children, full_history, data_hashes superset.

Phasing

See Linear for issue-level detail.