Orion¶
Planned — not yet implemented. Orion is design-only today (
src/orion/is a skeleton). Every behavior on this page is the intended design from the architecture sketch (orion-architecture.md, 2026-05-22) and the Research Stack Architecture (Notion) §5; nothing here can be run yet. Implementation is tracked in the Linear Orion project.
Orion is the research / training / benchmarking layer of Constellation’s research stack — the package researchers will actively wield. It sits on top of Ursa (the data catalog) and Virgo (preprocessing), and wraps torch_brain model classes under orion.models.* so iteration speed isn’t gated on upstream PRs.
The design deliberately collapses the surface to two top-level types and one entry point (design-doc §0):
RunConfig— a Pydantic-validated, Tyro-CLI-overridable description of one training run. Itsdatafield is aVirgoQuery; there is no separate data-config layer.VirgoQuery— the data spec, owned by Virgo. Researchers build it withpi.query(...), compose multiple sources withVirgoQuery + VirgoQuery, and Orion materializes it lazily viapi.stream(query).orion.train()— the single entry point, polymorphic overRunConfig | list[RunConfig]. A list covers sweeps and multi-stage grouping; there is noTrainingPipelineclass and nopipelineCLI verb.
What Orion will provide¶
Pure-PyTorch
Trainer(no Lightning) — wraps the model in FSDP2, writes sharded checkpoints viatorch.distributed.checkpoint, dispatches a smallCallbackprotocol, and runs undertorchrunwith elastic restart (sameWORLD_SIZE). Parallelism is pluggable behind aParallelismStrategyprotocol.VirgoQuery-driven dataloader — resolves a
VirgoQuerytodict[recording_id, Data]through Virgo’sProcessingInterface, thenmodel.tokenize(data)inside the dataset converts eachtemporaldata.Datainto the flat tensor dict the model consumes. torch_brain samplers are re-exported; anAugmentationchain runs onDatabefore tokenize.Typed loss list —
TrainingConfig.lossis a list of BrainFrame-styleLossobjects (each a Pydantic model +nn.Module), composed at build time rather than hard-coded in the model.Rich self-describing checkpoints — bit-exact resume state (model, optimizer, scheduler, RNG, dataloader cursor) plus a
data_hashes/manifest.jsonthat answers “what data was this trained on?” and is registered back into Ursa.Benchmarks as first-class artifacts — a
@orion.benchmarkdecorator produces content-addressedBenchmarkResults stored in Ursa; configurable compute boundaries (inline/async/separate) and partial subsets enable fast in-training metrics.One config, two launch targets — a discriminated
InfraConfig(LocalInfra/SlurmInfra/SkyPilotInfra) launches the same run on Polaris via Slurm or burst to the cloud via SkyPilot.Self-hosted ClearML run registry — surfaced through
orion.registry.*andorion-mcp. At MVP astubbackend writes JSON locally; richer search/diff/lineage light up later.Implicit multi-stage & lineage —
init_from=FinetuneFrom(run="upstream_name")plus aRunConfig.pipelinelineage key chains pretrain → finetune runs with full lineage propagation; no orchestration class required.Aggressive pre-flight validation —
pi.dry_run(query)and config checks run before any GPU is touched, so doomed runs never reach a node.
How a run flows¶
flowchart LR
RC["RunConfig<br/>(data = VirgoQuery)"] --> PF["preflight<br/>pi.dry_run(query)"]
PF --> L["launcher<br/>(Local / Slurm / SkyPilot)"]
L --> TR["torchrun"]
TR --> FIT["Trainer.fit()"]
DL["make_dataloader<br/>pi.stream(query)"] --> FIT
FIT --> OUT["checkpoints<br/>+ benchmark results"]
OUT --> URSA["Ursa<br/>(register_checkpoint /<br/>register_benchmark_result)"]
URSA -. "raw + processed Data" .-> DL
The flow within the package is one-way: a RunConfig carrying a VirgoQuery is validated by preflight, handed to the launcher, run under torchrun, and Trainer.fit() consumes batches from a dataloader backed by pi.stream(query). Checkpoints and benchmark results flow back into Ursa’s catalog, which is also where the data came from.
Where this fits¶
Orion is one of three packages in Constellation’s research stack:
Ursa — data catalog & storage layer
Virgo — DAG-based preprocessing & the
VirgoQuerylanguageOrion (this site) — research / training / benchmarking
Cross-cutting concerns are documented once and linked, not duplicated here:
Observability, timestamps, CI, MCP, cloud-agnostic launch → Research Stack Architecture (Notion) §5.
Secrets & notifications → constellation-utils docs.
Status & phasing¶
Planned. Implementation is sequenced in the Linear Orion project (design-doc §18); M2 is split into a true-MVP M2a and a production-grade M2b:
M1 — Foundations (in progress) — repo skeleton, core deps,
OrionModelmixin, creds & notifications,orion.status.*namespace.M2a — First end-to-end run (true MVP) —
RunConfig, preflight, single-deviceTrainer,OrionDataset, basic rich checkpoint +data_hashes/manifest.json,LocalInfra, stub registry. Demo: train POYO 100 steps onr2-test, kill, resume.M2b — Production-grade MVP — DDP, multi-rank manifest merge,
Augmentation, Slurm-on-Polaris & multi-node SkyPilot, full ClearML,v0.1.0tag.M3 — Benchmarking framework.
M4 — Multi-stage training & lineage.
M5 — Production-scale (module version routing, multi-node).
M6 — Polish & onboarding.
Contents
- Architecture
- Goals (§0)
- Configuration —
orion.config(§3.1–§3.6) - Trainer —
orion.trainer(§4.1–§4.6) - Data loading —
orion.data(§5.1–§5.10) - Rich checkpoint —
orion.checkpoint(§6.1–§6.4) - Run registry —
orion.registry(§7.1–§7.3a) - Benchmarks —
orion.benchmarks(§8.1–§8.5) - Multi-stage training & lineage —
orion.lineage(§9.1–§9.2) - Module version routing —
orion.models(§10.1–§10.2) - Infrastructure / launcher —
orion.launcher(§11.1–§11.4) - Coupling to Ursa, Virgo & torch_brain (§12.1–§12.5)
- Phasing
- Concepts
- The one-sentence mental model
- RunConfig: the declarative description of a run
- VirgoQuery: data as a composable spec
- Pre-flight: doomed runs never reach a GPU
- The Trainer: a thin, pure-PyTorch loop
- Tokenize-in-dataset, typed losses, and augmentation
- Rich checkpoints and data-hash lineage
- Benchmarks as content-addressed artifacts
- Multi-stage training and lineage
- Module versioning: replaying the past
- Where this connects to the rest of the stack
- Quickstart
- Tutorials
- API Reference
- Top-level entry points (
orion) orion.config— theRunConfigtree (§3)orion.trainer— pure-PyTorch trainer (§4)orion.data—VirgoQuery-driven dataloader (§5)orion.losses— typed loss list (§4.4a)orion.models— model wrappers + version routing (§10)orion.benchmarks— benchmark framework (§8)orion.registry— run registry (ClearML facade) (§7)orion.lineage— multi-stage lineage (§9)- Other modules
- Cross-cutting surfaces (linked out)
- Top-level entry points (