Jacquard Documentation

Welcome to the documentation for Jacquard, a GPU-accelerated RTL logic simulator.

Use the sidebar to navigate between topics, or start with the Getting Started guide — it runs four bundled designs in a few seconds with no synthesis. To prepare and run your own RTL, see the Synthesis Flow; for UVM/cocotb/SVA questions, Testbench Interop.

Documents

Project Scope & Planning

Start here if you're considering a feature contribution or want to understand Jacquard's overall direction.

Project Scope & Guarantees: Top-level contract — what Jacquard is for, what it isn't, licensing and architecture constraints, stability tiers.
Why Jacquard: Honest positioning vs. STA tools and event-driven simulators; what's unique, what isn't, and what output interface would let users extract the value.
Timing Correctness: Scoped requirements for timing accuracy, validation, and the forthcoming timing IR.
Timing Model Extensions: Pre-spike design notes for δ(T) dynamic delay, clock-tree skew, and wire delay at scale. Formalised in ADR 0007.
Post-Phase-0 Roadmap: Sequencing of Phase 1+ work covering structured timing output (ADR 0008) and timing model fidelity (ADR 0007). (OpenTimer integration was originally Phase 1's centrepiece; ADR 0003 was Superseded by the spike — OpenSTA out of process is now the sole STA path per ADR 0001.)
Architecture Decision Records: Design decisions and their rationale (numbered, per-decision). See the index for status and how the ADRs relate.
Implementation Plans: Phased implementation plans with entry and exit criteria. See the index for status and reading order.
Spikes: Time-boxed experiments and their outcomes.

Core Documentation

Simulation Architecture: Detailed explanation of Jacquard's internal architecture
- Pipeline stages (NetlistDB → AIG → StagedAIG → Partitions → FlattenedScript → GPU)
- Data structures and representations
- VCD input/output format requirements
- Assertion and display support infrastructure
- Performance characteristics
- Known issues and limitations
Timing Simulation: CPU-based timing simulation with Liberty/SDF delays
Timing Violations: GPU-side setup/hold violation detection

Troubleshooting Guides

Troubleshooting VCD: Debugging VCD input issues
- VCD hierarchy requirements
- Signal naming and matching
- Solutions for flat VCD generation
- Diagnostic checklist
- Working examples

Quick Reference

VCD Input Requirements (Critical!)

Jacquard expects VCD signals at absolute top-level (no module hierarchy):

// ✓ Correct testbench
initial begin
    $dumpfile("output.vcd");
    $dumpvars(1, clk, reset, din, dout);  // Depth 1, explicit signals
end

// ✗ Incorrect testbench
initial begin
    $dumpfile("output.vcd");
    $dumpvars(0, testbench);  // Dumps entire hierarchy
end

Debug Commands

# Enable debug logging
RUST_LOG=debug cargo run -r --features metal --bin jacquard -- sim <args>

# Verify with CPU simulation
cargo run -r --features metal --bin jacquard -- sim <args> --check-with-cpu

# Check VCD structure
grep '\$scope\|\$var' input.vcd | head -20

Cosim (reactive peripherals)

jacquard cosim runs GPU-resident peripheral models (SPI flash, UART, Wishbone) alongside the design so inputs can react to outputs cycle-by-cycle. It runs on Metal, CUDA, and HIP (plus a CPU fallback).

# Drive the design from a JSON testbench config; write an output VCD
cargo run -r --features metal --bin jacquard -- cosim \
    design.v --config sim_config.json --output-vcd out.vcd

Flag	Purpose
`--config <json>`	Testbench config: clock(s), reset, peripherals (required)
`--output-vcd <path>`	Output VCD (chip outputs + any traced nets)
`--trace-signals <path>`	Surface internal nets in the VCD (Signal Tracing)
`--bus-trace-csv <path>`	Decode on-chip bus transactions (Bus Tracing)
`--jtag-server <port>` / `--jtag-replay <path>`	Interactive / deterministic JTAG debug (JTAG Debug)
`--xprop`	Selective X-propagation for uninitialised state
`--max-clock-edges <n>`	Limit simulation length (1 cycle = 2 edges)

Key Statistics

When running Jacquard, look for these diagnostic outputs:

netlist has X pins, Y aig pins, Z and gates        # AIG complexity
current: N endpoints, try M parts                  # Partition count
Built script for B blocks, reg/io state size S     # Final script
WARN (GATESIM_VCDI_MISSING_PI) ...                 # VCD issues!

Investigation Methodology

This documentation was created through systematic investigation of Jacquard's behavior:

Source Code Analysis: Examined src/aig.rs, src/flatten.rs, src/staging.rs
Debug Tracing: Used RUST_LOG=debug to capture internal state
Test Case Development: Created minimal reproducible examples
Comparative Testing: Compared Jacquard vs iverilog outputs
Third-Party Validation: Tested with real-world examples (sva-playground)

Known Issues

Tracked live on GitHub — see the open issues and the priority:high label. The long-standing ones:

VCD hierarchy mismatch — Jacquard expects a flat top-level VCD; most testbenches emit hierarchical ones. Workaround: --input-vcd-scope (see Troubleshooting VCD). Tracking: #142.
Complex FSM simulation — some FSM designs (e.g. safe.v) don't simulate correctly; under investigation. Tracking: #143.
Format-string preservation — Yosys may drop gem_format attributes, so $display messages show placeholders. This is an upstream Yosys limitation; the workaround is to extract format strings from the pre-synthesis JSON.

Contributing

When adding documentation:

Be specific: Include actual commands, file paths, code snippets
Show examples: Both working and non-working cases
Link related docs: Cross-reference other documentation files
Date updates: Update version and date at bottom of documents
Test instructions: Verify all commands actually work

Future Documentation Needs

Dedicated guides not yet written (coverage today is scattered across ADRs and reference docs):

Performance tuning guide (choosing NUM_BLOCKS, --level-split)
SRAM modeling & synthesis (synthesis flow + preload + observability in one place)
Multi-clock domain user guide (config examples; cf. #87 for test coverage)
GPU kernel optimization internals (profiling, backend-specific tuning)

Now covered: custom cell libraries → Adding a New PDK + ADR 0010/0011; VCD scope behaviour → Troubleshooting VCD.

Main README: ../README.md - Project overview and quick start
CLAUDE.md: ../CLAUDE.md - Development guidelines and architecture overview
Test Suite: ../tests/ - Examples and regression tests
Third-Party Tests: ../tests/regression/third_party/ - Real-world examples with attribution

Last Updated: 2026-06-26 Maintained By: gpu-eda community

Jacquard — Project Scope and Guarantees

Status: Draft — under review.

This document states what Jacquard is for, what it is deliberately not for, and the constraints under which it is built. It is the top-level contract that scoped requirements docs (timing-correctness.md, future equivalents) inherit from.

If you are a contributor deciding whether a feature or change fits Jacquard, start here.

Purpose

Jacquard is a GPU-accelerated gate-level simulator for synthesized digital circuits. It exists to make pre-silicon functional verification of synthesized designs substantially faster than CPU-based simulators, by mapping the design onto a GPU's massively parallel execution model.

It is a descendant of NVIDIA Research's GEM project, maintained by Rob Taylor and community contributors.

Jacquard ships with AIGPDK, a simple and-inverter standard cell library aligned to the GPU schema's internal representation. It also accepts Liberty-described cells from open-source PDKs (SKY130) and commercial PDKs (under a private test track; see ADR 0004).

In scope

Gate-level simulation of synthesized Verilog netlists against AIGPDK, SKY130, and similar Liberty-described cell libraries.
Stimulus from static VCD input, from co-simulated GPU-resident peripheral models (SPI flash, UART, and similar), or from combinations of both for SoC-scale functional verification.
Synchronous designs clocked by one or more clocks of known frequency.
GPU-backed simulation across CUDA (NVIDIA), HIP (AMD), and Metal (Apple Silicon).
Back-annotated timing simulation from SDF, including setup/hold violation detection.
Correctness validation against permissive third-party simulators and STA tools.

Non-goals

Jacquard does not aim to cover the following areas. Feature work requiring cross-cutting changes to the GPU schema or core architecture is declined on scope grounds unless this document is first amended. Contributions that interface Jacquard with external tools covering these areas are welcome where noted.

Mixed-signal or analog simulation. Digital gate-level only. Interfacing to external analog / mixed-signal simulators via cosim hooks is in scope for future contribution.
RTL-level simulation. Input is a synthesized netlist, not behavioural RTL. Synthesis (via Yosys or commercial tools) happens upstream.
Sign-off STA. Jacquard performs functional simulation with setup/hold guardrails. Sign-off timing analysis is delegated to dedicated STA tools; Jacquard validates its results against them but does not aim to replace them.
Incremental or interactive editing of running simulations. Each Jacquard run starts from a complete netlist and stimulus. Contributions that embed Jacquard's engine into interactive or incremental workflows (REPLs, debuggers, IDE integrations) as an external driver are welcome and considered outside this project's direct scope rather than non-goals.

Constraints

Licensing

All code linked into the Jacquard binary must be under a permissive license (MIT / Apache-2 / BSD-3 or equivalent). GPL tools may be invoked as subprocesses. This keeps Jacquard commercially usable by downstream integrators.

Platforms

Jacquard commits to three GPU backends: CUDA, HIP, and Metal. A change that lands on one backend without a plan for the others is a regression in product surface. Feature-parity timing is negotiable; eventual feature-parity is not.

Design assumptions baked into the architecture

These are structural properties of the current GPU schema:

Sequential logic is currently edge-triggered and synchronous: a raw latch in the logic, and asynchronous sequential (self-timed) logic, are not modelled today — the GPU schema's scheduling assumes synchronous clocking. (Async set/reset on flip-flops is supported; so are clock gating via CKLNQD and latch-based memory mapped to RAM — see the latch note in simulation-architecture.md.) Extending it to support async sequential approaches is open territory; contributions with a viable approach are welcome.
Circuits fit the boomerang block shape: 8191-signal input/output and 4095 intermediate-pin limits per partition, 64 SRAM output groups. Very wide designs may require manual --level-split tuning.
Numerics are 4-state at partition granularity (X-capable or not), not per-bit.

Validation

Results are validated against at least one independent third-party tool per format. No single parse path (Jacquard's or otherwise) is its own reference. See timing-correctness.md for the detailed validation contract on timing.

Stability

Treat the following as honest characterisations, not marketing claims:

Stable. Core GPU simulation of AIGPDK designs. NVDLA / Rocket / Gemmini regression path.
Stable with caveats. SKY130 flow; known to work on the MCU SoC reference design, with a history of PDK-specific issues resolved over time.
Evolving. Timing simulation: sim-path arrival tracking and setup/hold violation detection now work across Metal, CUDA, and HIP (ADR 0008), with per-DFF clock-arrival skew folded in (ADR 0007 Pillar B); cosim-path timing output (--timing-report) is not yet wired. Multi-clock scheduling. X-propagation semantics. SDF parser is hand-rolled and has received multiple reactive fixes.
Experimental. GPU-resident peripheral models. HIP-on-NVIDIA path exists primarily to unblock CI.
Planned. Private commercial-PDK test track (see ADR 0004).

Contributors can expect stable-tier behaviour to remain stable across releases. Evolving and experimental tiers may change shape between releases; reasonable migration notes will be provided.

Decision principles

When scope conflicts arise:

Product surface before performance. Jacquard's speed advantage is its value proposition; optimizations that compromise correctness, validation, or portability are declined.
Permissive-license contributors win. Any change that pushes Jacquard toward GPL-contamination is rejected; subprocess-based integrations with GPL tools are acceptable.
Non-goals hold until amended. Extending into latches, interactive stimulus, or RTL simulation is not a bug fix; it is a scope change and requires this document to be updated first.
Validation is not optional. New parsers, new formats, new simulation modes require an independent third-party reference at least for representative cases.
Honest stability labels. Moving a feature from "evolving" to "stable" requires evidence (regression coverage, oracle-backed validation, absence of known silent-failure modes), not just time in the codebase.

References

README.md — project overview, quick start.
CLAUDE.md — repository conventions and architecture overview for contributors working with AI assistance.
docs/simulation-architecture.md — internal pipeline and data structures.
docs/timing-correctness.md — scoped contract for timing accuracy, validation, and IR requirements.
docs/adr/ — architectural decision records.
docs/plans/ — phased implementation plans.

Last updated: 2026-06-26 (v0.2.x; CUDA/HIP sim-timing + cosim backends shipped; reflects ADR 0007/0008/0013/0017/0018 amendments).

Installation

Jacquard is three tools; install only what your task needs:

Tool	What it's for	Install
`jacquard`	the simulator (`sim` / `cosim`)	Homebrew · `cargo binstall` · prebuilt release · from source
`opensta-to-ir`	SDF → timing-IR (only for the timing / post-PnR path)	ships with `jacquard` (same release / Homebrew formula)
`netlist-graph`	post-synthesis signal-name discovery (companion to the tracing docs)	PyPI (`uvx` / `pip`)

Availability. The prebuilt-binary, Homebrew, and PyPI channels go live with the first tagged release (v0.1.0). Until then, build the simulator from source and run netlist-graph from the repo with uv run. The design behind this layout is ADR 0018.

The simulator (`jacquard` + `opensta-to-ir`)

Homebrew — macOS / Apple Silicon (Metal)

brew install gpu-eda/tap/jacquard      # installs jacquard + opensta-to-ir

The cleanest path on a Mac. Requires an Apple Silicon machine with a Metal GPU. The Homebrew formula is built with --features synth, so behavioral RTL input works out of the box — see Accepted RTL surface.

To try a release candidate before it ships, install from the prerelease tap instead (it tracks the latest -rc tag):

brew install gpu-eda/tap-prerelease/jacquard

cargo binstall — prebuilt binary, no toolchain build

brew install llvm     # runtime dependency — see note below
cargo binstall --git https://github.com/gpu-eda/Jacquard jacquard-sim \
  --disable-strategies compile,quick-install

Installs the jacquard binary (+ timing_analysis) on macOS/Metal. The crate is the jacquard-sim package (the binary it installs is still jacquard); the --git form is required because it's not on crates.io (its dependencies are a vendored fork carrying in-flight patches), so binstall reads the [package.metadata.binstall] pkg-url straight from the repo.

Why jacquard-sim, not jacquard. The package is named jacquard-sim because the crate name jacquard is taken on crates.io by an unrelated project (an AT-Protocol client library) — so cargo install jacquard would build that, not this. Naming our package jacquard-sim makes resolution unambiguous. --disable-strategies compile,quick-install is kept as belt-and-suspenders: it turns a missing prebuilt binary into a clean hard error rather than any source-build fallback. (validate-install.yml uses the same guard.) Linux is not binstall-able: there are two GPU backends (CUDA, HIP) for one target triple, so it can't be auto-selected — use the release tarball for your backend, a container, or build from source.

Release binaries are built with --features synth and support behavioral RTL input directly.

Runtime dependency — Homebrew LLVM. The prebuilt macOS binary links Homebrew LLVM's libc++ and libomp (the build uses LLVM clang for OpenMP, via the mt-kahypar partitioner), so it needs brew install llvm to run. The Homebrew install handles this automatically (depends_on "llvm"); binstall and the raw tarball do not, so install LLVM first.

Prebuilt release tarball

Download jacquard-<version>-<target>.tar.gz from the releases page, extract, and put jacquard, timing_analysis, and opensta-to-ir on your PATH. The GPU kernel is embedded, but the binary still needs Homebrew LLVM at runtime (brew install llvm) — see the note above.

From source (any backend)

The portable path, and the only one for Linux CUDA / HIP today. Needs the Rust toolchain and the GPU SDK for your backend.

git clone https://github.com/gpu-eda/Jacquard.git
cd Jacquard
git submodule update --init --recursive

cargo build -r --features metal --bin jacquard         # macOS / Apple Silicon
cargo build -r --features cuda  --bin jacquard         # NVIDIA (CUDA toolkit)
cargo build -r --features hip   --bin jacquard         # AMD (ROCm)

The binary lands at target/release/jacquard. See the README's Dependencies table for optional tooling (flatc, mdbook, OpenSTA).

Behavioral RTL on-ramp (jacquard sim design.v …) requires the synth feature (embedded YoWASP Yosys engine). Add it to your build:

cargo build -r --features metal,synth --bin jacquard   # macOS + RTL synthesis
cargo build -r --features cuda,synth  --bin jacquard   # NVIDIA + RTL synthesis
cargo build -r --features hip,synth   --bin jacquard   # AMD + RTL synthesis

A binary built without synth still simulates pre-synthesized gate-level netlists; it gives an actionable error if handed behavioral RTL.

Providing yosys.wasm for RTL synthesis — the synth engine needs the YoWASP Yosys wasm module. Discovery order:

--yosys-wasm <path> flag on sim/cosim (overrides the rest).
JACQUARD_YOSYS_WASM=/path/to/yosys.wasm environment variable.
Installed yowasp-yosys Python package, found automatically:
```
pip install yowasp-yosys
```
Fetch-from-release is a planned follow-up (not yet implemented).

Prebuilt and Homebrew binaries include the synth feature; the wasm is still found via the flag, env var, or installed Python package until auto-fetch ships.

The signal-analysis companion (`netlist-graph`)

Pure Python — install from PyPI, no GPU or Rust needed:

uvx netlist-graph search design.gv psel     # one-off, no install
pip install netlist-graph                    # or install it

From a Jacquard checkout you can also run it without installing: uv run netlist-graph … (it's a workspace member). See signal tracing for what it's used for.

The timing path (`opensta-to-ir` + a PDK)

For post-PnR timing simulation you also need PDK Liberty files, fetched with volare/ciel (pinned in the root pyproject.toml). opensta-to-ir converts SDF to the Jacquard timing IR (.jtir) that jacquard sim --timing-ir / cosim --timing-ir consume. Pure functional (pre-PnR) runs need none of this — see signal tracing § pre-PnR functional runs.

Verify

jacquard --version
# A quick self-contained cosim (from a Jacquard checkout):
jacquard cosim tests/apb_trace/apb_trace_synth.gv \
    --config tests/apb_trace/sim_config.json \
    --top-module apb_trace --max-clock-edges 200 \
    --bus-trace-csv /tmp/apb.csv

Then head to Getting Started to run bundled designs, Accepted RTL surface to simulate your own behavioral RTL, or Synthesis Flow to prepare a high-QoR gate-level netlist.

Getting Started

This guide gets you from a fresh clone to four working simulations, in increasing order of realism. Every command here runs against designs that ship in the repository — no synthesis, no large downloads, no extra EDA tools. Each one is also a CI job, so it is known-green.

Already have your own RTL? Pass it directly to jacquard sim design.v … — the simulator auto-detects behavioral Verilog / SystemVerilog and synthesizes it transparently (see Accepted RTL surface). For peak GPU performance, or to use a commercial synthesizer (DC) for best mapping quality, see Synthesis Flow. For UVM/cocotb/SVA questions see Testbench Interop.

Prerequisites

A built jacquard binary. Either install it (see Installation) or build from a clone:

git clone https://github.com/gpu-eda/Jacquard.git
cd Jacquard
git submodule update --init --recursive
cargo build -r --features metal --bin jacquard   # macOS / Apple Silicon

The examples below use cargo run -r --features metal --bin jacquard -- … so they work straight from a clone. If you installed the binary (e.g. brew install gpu-eda/homebrew-tap/jacquard), replace that prefix with just jacquard. On Linux, swap --features metal for --features cuda (NVIDIA) or --features hip (AMD), and replace the trailing 1 (NUM_BLOCKS) with 2× your GPU's SM/CU count.

Tier 1 — Your first simulation (a flip-flop)

The smallest possible run: a single D flip-flop. jacquard sim reads a gate-level netlist plus a static input waveform (VCD) and writes an output VCD.

cargo run -r --features metal --bin jacquard -- sim \
    tests/timing_test/dff_test_synth.gv \
    tests/timing_test/dff_test.vcd \
    /tmp/dff_out.vcd \
    1

It finishes in well under a second (6 cycles). Confirm the captured q matches the golden waveform:

diff <(grep -E '^[01]!' tests/timing_test/dff_test.vcd) \
     <(grep -E '^[01]!' /tmp/dff_out.vcd) && echo "MATCH"

That MATCH is your "Jacquard runs on this machine" checkpoint. The last argument, 1, is NUM_BLOCKS — always 1 for Metal; on CUDA/HIP set it to 2× your GPU's SM/CU count.

Tier 2 — A reactive testbench (`cosim`)

jacquard sim drives the design from a fixed input waveform. Real testbenches need to react to the design's outputs. That is what jacquard cosim is for: it runs peripheral models (UART, SPI flash, JTAG, bus monitors) as GPU kernels alongside the design, so inputs can depend on outputs cycle-by-cycle.

The dual_uart design is two independent UART transmitters that send "Hi" and "OK". Cosim decodes the serial bits back into bytes — no input VCD required; clock and reset come from the config.

cargo run -r --features metal --bin jacquard -- cosim \
    tests/dual_uart/dual_uart_synth.gv \
    --config tests/dual_uart/sim_config.json \
    --top-module dual_uart_top \
    --max-clock-edges 10000

Verify both channels decoded correctly:

python3 tests/dual_uart/check_pass.py target/test-out/dual_uart_events.json
# console: b'Hi' OK
# debug:   b'OK' OK
# PASS: both UART channels decoded correctly

The peripherals, pin mapping, baud rates, and clock are all declared in tests/dual_uart/sim_config.json. See Cosim execution model and Bus Transaction Tracing for the full peripheral set.

Tier 3 — A real open-source PDK

Tiers 1 and 2 use Jacquard's abstract AIG cell library. Real designs come out of synthesis mapped to a real PDK (process design kit). Jacquard decomposes real standard cells — SKY130 and GF180MCU today — into its AIG representation directly, so a synthesized netlist simulates as-is.

3a. Pre-P&R (post-synthesis) — runs from the repo

A pre-place-and-route netlist is the output of logic synthesis, before layout and before any SDF timing annotation. This is the fastest, most common gate-level verification step. logic_cone is a small SKY130 netlist (real sky130_fd_sc_hd__* cells) that ships in the repo:

cargo run -r --features metal --bin jacquard -- sim \
    tests/timing_test/sky130_timing/logic_cone.v \
    tests/timing_test/sky130_timing/logic_cone.vcd \
    /tmp/logic_cone_out.vcd \
    1 --input-vcd-scope logic_cone

No --sdf flag: this is pure functional simulation of a real-PDK netlist. The circuit is 4 DFFs → nand2/nor2/and2/inv cone → DFF, computing Q = a & !b & !c & !d (registered). Its sibling inv_chain.v (16 SKY130 inverters = identity) is a cleaner "output tracks input, delayed two cycles" check if you want one.

Add --sdf logic_cone.sdf --sdf-corner typ to layer Liberty cell delays on top — that moves you from functional to timing-aware simulation, covered in Timing Simulation.

3b. A full chip on GF180MCU

To see Jacquard handle a real, large design on a real PDK end-to-end, the wafer.space chess chip_top (~227,000 cells, GlobalFoundries open-source GF180MCU PDK) is wired up as a smoke-test recipe. The post-P&R netlist is ~200 MB so it is not committed — you regenerate it from the upstream LibreLane flow. Full step-by-step recipe, expected timings, and what it exercises: tests/gf180mcu_chess_chip_top/README.md.

This is the bridge to your own designs: synthesize to SKY130 or GF180MCU, then run jacquard sim exactly as in 3a.

Where to next

You want to…	Go to
Simulate your own behavioral RTL (one command, auto-synthesized)	Accepted RTL surface
Prepare a high-QoR gate-level netlist (DC / Yosys, memory mapping)	Synthesis Flow
Run the large research benchmarks (NVDLA, Rocket, Gemmini)	`benchmarks/README.md`
Add timing (Liberty / SDF / violation checks)	Timing Simulation
Use UVM / cocotb / SVA, or understand interop limits	Testbench Interop
Add support for a new PDK	Adding a New PDK
Understand how it works	Simulation Architecture

Input: netlist language & RTL

What Jacquard reads

Jacquard is a gate-level emulator — it maps a synthesized and-inverter graph onto a virtual manycore GPU processor (see ADR 0014). So the direct input to jacquard sim / cosim is a gate-level Verilog netlist: structural Verilog whose leaf cells are aigpdk, SKY130, or GF180MCU standard cells.

Behavioral RTL is the intended design input — via a synthesis step. You bring your RTL, synthesize it to a gate-level netlist (memory mapping + logic synthesis to aigpdk.lib cells), and Jacquard emulates the result at 5–40×. Synthesis is currently a separate, documented step — see Synthesis Flow — because synthesis quality directly sets Jacquard's performance, so it's a deliberate knob (open-source Yosys works; a commercial tool like DC gives better QoR).

Roadmap: an integrated jacquard build design.v on-ramp (Yosys via YoWASP, no manual synthesis) is planned — see ADR 0021 and #162. Until then, synthesize first, then sim.

Behavioral constructs (always, if/case, reg, parameters, generate, function/task, #delay) are not read by Jacquard directly — the synthesizer elaborates them away. If you feed raw behavioral RTL to sim, it will fail to parse; run it through synthesis first.

Supported netlist syntax

The netlist parser (sverilogparse, consumed by netlistdb) accepts the structural subset a synthesizer emits:

Structure

Multiple module … endmodule per file; // and /* */ comments; (* … *) attributes (parsed and ignored).
Both port-list styles: non-ANSI (module m(a, b); with body input/output decls) and ANSI (module m(input [7:0] a, output b);). A wire/reg net-type keyword in an ANSI header is accepted and ignored.

Declarations

input / output / inout / wire, scalar or bus ([hi:lo]), with comma-separated names (wire a, b, c;).

Cell instantiations

CELL_TYPE inst (.pin(expr), …); — named-port connections only. An empty .pin() (unconnected) is accepted and dropped.

Assigns

assign lhs = rhs;, both sides full wire expressions.

Wire expressions (in assigns, port connections, and .name(expr) hookups)

Form	Example
Identifier (scalar or whole bus)	`w`
Bit-select	`w[3]`
Part-select / slice (`[hi:lo]` or `[lo:hi]`)	`w[7:0]`
Concatenation	`{a, b[3:1], 4'b0101}`
Sized literal (x/z allowed for bin/oct/hex)	`4'b01xz`, `8'hFF`, `10'd42`
Unary NOT	`~w`
Parenthesised grouping	`(w)`

Identifiers — standard ([A-Za-z_][A-Za-z0-9_$]*) and escaped (\name-until-whitespace ).

Not supported in the netlist

Beyond the behavioral constructs above (which belong in pre-synthesis RTL), the parser does not accept:

Positional port connections — CELL inst(a, b, c); use .pin(expr).
Parameters — parameter / localparam, or #(…) overrides on instances.
Preprocessor — `define / `ifdef (no preprocessor).
Net types beyond input/output/inout/wire — no supply0/1, tri, wand/wor, trireg.
Multi-dimensional / memory arrays — one [hi:lo] range per declaration.
Bare unsized decimal literals — write 1'b1, not 1.
Replication — {4{a}}.
~ inside a concatenation — {~a, b} (use an explicit inverter cell).
Operators — &, |, ?:, arithmetic; and base-10 literals with x/z (8'dx).

Design-level constraints (`netlistdb`)

Single top module — auto-detected as the module no other instantiates; pass --top-module <name> if ambiguous. Cyclic instantiation is rejected.
Hierarchy is flattened — multi-level module hierarchy is supported and flattened at load.
Leaf-cell pin directions come from the cell library (aigpdk / SKY130 / GF180MCU); an unknown cell's pins default to Unknown with a warning.
inout on the design boundary is parsed but modelled as Unknown (not fully supported); prefer split input/output where possible.
Edge-triggered flops only — a raw LATCH cell is rejected (async set/reset is fine). Clock gating uses CKLNQD; latch/register-file memory maps to RAM through the memory-synthesis step (see Synthesis Flow).

SystemVerilog & assertions (SVA)

Immediate assertions are already lowered through synthesis today, as GEM_ASSERT cells (see the assertion handling in src/aigpdk.rs).
Broader SVA (SystemVerilog Assertions) is planned — see Testbench Interop § Roadmap. The concrete near-term slice is the X-barrier $isunknown assertion work: a spec-file-driven !$isunknown check against the X-mask (#106) and lowering an $isunknown SVA subset from RTL to that spec (#107), building on selective X-propagation (ADR 0016).

Accepted Behavioral RTL Surface

jacquard sim and jacquard cosim accept behavioral Verilog / SystemVerilog directly — no separate synthesis step, no external toolchain. This page describes what the on-ramp synthesizes, what it drops, and where the coverage boundary lies.

One-line summary: the accepted surface is whatever YoWASP Yosys (with the yosys-slang read_slang frontend) can synthesize to an aigpdk netlist. Jacquard delegates the elaboration and synthesis; it does not implement a Verilog front-end itself.

Invoking the on-ramp

# Pass behavioral RTL directly — synthesis is transparent and cached:
jacquard sim design.v in.vcd out.vcd NUM_BLOCKS

# Cosim works the same way:
jacquard cosim design.v --config sim_config.json

sim/cosim classify the input file automatically:

Input	Dispatch
Behavioral RTL (structural parse fails)	auto-synthesized via embedded Yosys → simulated
Gate-level netlist, built-in PDK (AIGPDK / SKY130 / GF180MCU)	simulated directly, no synthesis
Gate-level netlist, unrecognized cells	error directing you to `--cell-descriptor <path>`

Override flags:

--rtl — force the synthesis path (useful if detection misclassifies a file).
--netlist — force direct gate-level loading, skip detection.
--emit-synth <path> — also write the intermediate synthesized netlist for inspection or as a fixture.

The simulator always logs its decision:

design.v: behavioral RTL → synthesized [YoWASP Yosys, functional QoR] → <cache>

Providing `yosys.wasm`

The synthesis engine requires yosys.wasm (the YoWASP Yosys WebAssembly module). It is located in this order:

--yosys-wasm /path/to/yosys.wasm — CLI flag on sim/cosim (highest priority; overrides the env var and discovery).
JACQUARD_YOSYS_WASM=/path/to/yosys.wasm — environment variable.
Installed yowasp-yosys Python package — discovered automatically via python3 -c "import yowasp_yosys …". Install with:
```
pip install yowasp-yosys          # or: uvx yowasp-yosys
```
Fetch from release — a planned follow-up (ADR 0021 / #162 Phase 4); not yet implemented. Until then, one of the three methods above is required.

Version caveat. Pin to yowasp-yosys==0.64.0.0.post1131 (the version in the project's uv.lock, verified to carry read_slang). Newer wheels ship a wasm built with the WebAssembly exception-handling proposal, which the current embedded wasmtime engine cannot load yet (pip install yowasp-yosys gets the latest and may fail with "exception refs not supported"). Loading newer modules is a tracked Phase-4 follow-up; until then, pip install 'yowasp-yosys==0.64.0.0.post1131'.

Released jacquard binaries are built with --features synth and include the Yosys WASM runtime. Source builds must add --features synth explicitly: cargo build -r --features metal,synth --bin jacquard.

A binary built without --features synth gives an actionable error when handed behavioral RTL, pointing at the synth-enabled build path.

The compiled Yosys module is cached under $XDG_CACHE_HOME/jacquard by content hash — only the first run of a given wasm pays the ~20 s cranelift compile. The synthesized netlist is also cached keyed by (design source + synth script + wasm), so repeat sim runs skip synthesis entirely.

SystemVerilog frontend

The embedded wasm (pinned yowasp-yosys 0.64.0.0.post1131) bundles yosys-slang — a near-complete SystemVerilog-2017 elaborator. Jacquard probes for read_slang at startup and uses it when present; it falls back gracefully to Yosys's built-in read_verilog -sv for older wasm modules.

read_slang coverage is substantially broader than read_verilog -sv: packages, interfaces, structs, enums, and most SV-2017 constructs are handled by the slang elaborator before Yosys sees any RTL. If you observe a "not supported" message, check whether it names slang or read_verilog — the fallback has a narrower surface.

Supported language subset

The following are synthesizable and produce correct simulation output:

Verilog-2005 (synthesizable subset):

All combinational logic: assign, always @(*), always @(posedge/negedge)
Synchronous flip-flops (positive or negative edge)
Synchronous reset DFFs (if (rst) Q <= 0; else Q <= D;)
Asynchronous reset / set DFFs (async reset is supported; latches are not)
Case and if/else chains
Parameterized modules and generate blocks
Inferred memories (mapped to RAMGEM by memlib_yosys.txt — see below)

SystemVerilog-2017 (via yosys-slang read_slang):

Packages and package imports
Interfaces (including modports)
always_ff, always_comb, always_latch
logic, enum, struct, union (synthesizable forms)
Parameters with complex types; advanced generate (if, for, case)
Casting and type conversions (synthesizable)

This is not the narrow subset accepted by Yosys's built-in read_verilog -sv; it is the slang elaborator's synthesizable coverage, which tracks the SystemVerilog-2017 standard closely.

Project-specific mappings

The synthesis script applies three project-specific transformations on top of standard Yosys synthesis:

RTL construct	Result	Notes
`$assert`, `$assume`, immediate `assert property`	`GEM_ASSERT` cell	Visible in simulation as assertion failures. `$cover` is silently dropped.
`$display` (→ Yosys `$print`)	`GEM_DISPLAY` cell	The cell is emitted, but the on-ramp does not yet write the display-info JSON companion the CPU side needs to decode format strings — so `$display` text is not surfaced through the on-ramp path yet (tracked as a Phase-4 follow-up).
Inferred synchronous memories	`$__RAMGEM_SYNC_` (RAMGEM)	Mapped via `aigpdk/memlib_yosys.txt`. Asynchronous-read ports generate a warning.

The techmap rules are in aigpdk/gem_formal.v. Pass --strip-assertions (if available on your build) to use chformal -remove and drop GEM_ASSERT cells instead, for a pure logic netlist.

Known limits

Concurrent SVA → synthesizable checker synthesis is partial. Turning concurrent SystemVerilog Assertions (property / sequence bound to assert property with a clock) into gate-level checkers is a Yosys formal-flow capability — independent of slang's ability to parse SVA. This is tracked in issues #106 and #107. Immediate assertions (assert (cond) inside procedural blocks) synthesize correctly via the GEM_ASSERT mapping above.

Testbench-only constructs are dropped. Synthesis discards constructs that have no synthesizable meaning: #delay timing controls, most initial blocks (other than register initialization), $finish, $stop, and similar testbench procedural code. These are silently dropped by Yosys during elaboration, not simulated. If a $display is inside an initial or inside a #delay-driven block, it may be dropped too.

Latches are not supported. Jacquard's GPU emulator core is synchronous only — see docs/synthesis-flow.md for the latch constraint. A latch inferred from RTL causes a synthesis or AIG-load error.

QoR note

YoWASP Yosys produces functional-grade QoR — correct results but not optimized for GPU speed. Synthesis quality is the primary tuner of simulation throughput (the AIG the GPU emulates is only as good as the mapping). For peak performance, synthesize your design with a commercial synthesizer (DC) or a native Yosys install, and point jacquard sim at the resulting gate-level netlist directly. See docs/synthesis-flow.md for the performance path.

Authoritative surface: empirical coverage (Phase 4)

This prose describes the intended surface. Because the accepted RTL surface is whatever the embedded Yosys frontend synthesizes, the authoritative measure is an empirical pass/fail coverage table driven through sv-tests (or a curated subset) — a Phase 4 follow-up tracked in docs/plans/rtl-onramp-sim-integration.md. Until that table exists, hand-claimed feature lists for a delegated frontend are inherently approximate. Report gaps as bugs; they will be tracked in the coverage table when it ships.

Synthesis Flow — Running Your Own Design

New to Jacquard? Start with Getting Started, which runs bundled designs in a few seconds with no synthesis. This page is the deeper path: preparing your own RTL for peak GPU performance using a commercial synthesizer (DC) or a native Yosys install.

Behavioral RTL shortcut: jacquard sim design.v in.vcd out.vcd N accepts behavioral Verilog / SystemVerilog directly — synthesis is automatic and transparent. See Accepted RTL surface. This page covers the performance path where synthesis quality is the primary tuner of simulation throughput.

Caveats: jacquard sim drives the design from a static input waveform (e.g. VCD); for reactive testbenches use jacquard cosim. Storage must be edge-triggered flip-flops — latches (level-sensitive storage) are not supported. Asynchronous set/reset on flip-flops is fine (async reset is not the restriction); what's excluded is latch-based / level-sensitive sequential logic.

Dataset: Some (namely, netlists after AIG transformation in Steps 1-2 below, and reference VCDs) input data is available here .

Step 0. Download the AIG Process Kit

Go to aigpdk directory where you can download aigpdk.lib, aigpdk_nomem.lib, aigpdk.v, and memlib_yosys.txt. You will need them later in the flow.

Before continuing, make sure your design's logic storage is edge-triggered flip-flops — a raw latch left in the gate-level logic is rejected (async set/reset on flip-flops is fine; async reset is not the restriction). Two structured latch uses are still supported: clock gates map to the CKLNQD integrated clock-gating cell — replace any RTL clock gates manually with CKLNQD instantiations from aigpdk.v — and latch-based register files / memory are captured by the memory-synthesis step below (memory_libmap → RAM), not left as raw latches. Also, you are advised to be familiar with where memory blocks (e.g., caches) are implemented in your design so you can check that the memory blocks are mapped correctly later.

Step 1. Memory Synthesis with Yosys

This step makes use of the open-source Yosys synthesizer to recognize and map the memory blocks automatically.

Download and compile the latest version of Yosys. Then run yosys shell with the following synthesis script.

# replace this with paths to your RTL code, and add `-I`, `-D`, `-sv` etc when necessary
read_verilog xx.v yy.v top.v

# replace TOP_MODULE with your top module name
hierarchy -check -top TOP_MODULE

# simplify design before mapping
proc;;
opt_expr; opt_dff; opt_clean
memory -nomap

# map the rams
# point -lib path to your downloaded memlib_yosys.txt
memory_libmap -lib path/to/memlib_yosys.txt -logic-cost-rom 100 -logic-cost-ram 100

The memory_libmap command will output a list of RAMs it found and mapped.

If you see $__RAMGEM_SYNC_ (naming inherited from GEM), it means the mapping is successful.
If you see $__RAMGEM_ASYNC_, it means this RAM is found to have asynchronous READ port. You need to confirm if it is the case.
- If it is a synchronous one but accidentally recognized as asynchronous, you might need to patch the RTL code to fix it. There might be multiple reasons it cannot be recognized as synchronous. For example, when the read and write clocks are different.
- If it is indeed asynchronous, check its size. If its size is very small and affordable to be synthesized using registers and mux trees (which is very expensive for large RAM banks), you can remove the $__RAMGEM_ASYNC_ block in memlib_yosys.txt, re-run Yosys to force the use of registers.
If you see using FF mapping for memory, it means the memory is recognized, but due to it being nonstandard (e.g., special global reset or nontrivial initialization), Jacquard will fall back to registers and mux trees. If the size of the memory is small, this is usually not an issue. Otherwise, you are advised to try other implementations.

After a successful mapping, use the following command to write out the mapped RTL as a single Verilog file.

write_verilog memory_mapped.v

Check the correctness of this step by simulating memory_mapped.v with your reference CPU simulator.

Step 2. Logic Synthesis

This step maps all combinational and sequential logic into a special set of standard cells we defined in aigpdk.lib. The quality of synthesis is directly tied to Jacquard's final performance, so we suggest you use a commercial synthesis tool like DC. You can also use Yosys to complete this if you do not have access to a commercial synthesis tool.

Check the correctness of this step by simulating gatelevel.gv with your reference CPU simulator.

Use Synopsys DC

First, you need to compile aigpdk.lib to aigpdk.db using Library Compiler.

With that, you synthesize the memory_mapped.v obtained before under aigpdk.db.

Some key commands you may use on top of your existing DC flow:

# change path/to/aigpdk.db to a correct path. same for other commands.
set_app_var link_path path/to/aigpdk.db
set_app_var target_library path/to/aigpdk.db
read_file -format db $target_library

# elaborate TOP_MODULE
# current_design TOP_MODULE

# timing settings like create_clock ... are recommended. Jacquard benefits from timing-driven synthesis.

compile_ultra -no_seq_output_inversion -no_autoungroup
optimize_netlist -area

write -format verilog -hierarchy -out gatelevel.gv

Use Yosys: Example script

# if you exited Yosys in step 2, you can read back in your memory_mapped.v yourself.
# read_verilog memory_mapped.v
# hierarchy -check -top TOP_MODULE

# synthesis
synth -flatten
delete t:$print

# change path/to/aigpdk_nomem.lib to a correct path. same for other commands.
dfflibmap -liberty path/to/aigpdk_nomem.lib
opt_clean -purge
abc -liberty path/to/aigpdk_nomem.lib
opt_clean -purge
techmap
abc -liberty path/to/aigpdk_nomem.lib
opt_clean -purge

# write out
write_verilog gatelevel.gv

Step 3. Download and Compile Jacquard

Download and install the Rust toolchain. This is as simple as a one-liner in your terminal. We recommend https://rustup.rs.

Clone Jacquard along with its dependencies.

git clone https://github.com/gpu-eda/Jacquard.git
cd Jacquard
git submodule update --init --recursive

Jacquard supports two GPU backends: CUDA (NVIDIA GPUs on Linux) and Metal (Apple Silicon Macs).

All functionality is accessed through the jacquard CLI, which provides sim, cosim, dump-paths, and other subcommands:

# Analyze timing / validate a netlist (no GPU features needed)
cargo run -r --bin jacquard -- dump-paths --help

# Simulation (Metal - macOS)
cargo run -r --features metal --bin jacquard -- sim --help

# Simulation (CUDA - Linux, requires CUDA toolkit)
cargo run -r --features cuda --bin jacquard -- sim --help

Simulate the Design

Jacquard automatically partitions the design at startup using mt-kahypar-sc hypergraph partitioning.

If partitioning fails due to deep circuits (which often shows as trying to partition a circuit with only 0 or 1 endpoints), try adding a --level-split option to force a stage split. For example --level-split 30 or --level-split 20,40.

Metal (macOS)

Use NUM_BLOCKS=1 for Metal.

cargo run -r --features metal --bin jacquard -- sim path/to/gatelevel.gv path/to/input.vcd path/to/output.vcd 1

CUDA (Linux)

Replace NUM_BLOCKS with twice the number of physical streaming multiprocessors (SMs) of your GPU.

cargo run -r --features cuda --bin jacquard -- sim path/to/gatelevel.gv path/to/input.vcd path/to/output.vcd NUM_BLOCKS

VCD Scope Handling

Jacquard automatically detects the correct VCD scope containing your design's ports. In most cases, you don't need to specify --input-vcd-scope. If auto-detection fails or you need to override it, use:

# Metal
cargo run -r --features metal --bin jacquard -- sim path/to/gatelevel.gv path/to/input.vcd path/to/output.vcd 1 --input-vcd-scope "testbench/dut"

# CUDA
cargo run -r --features cuda --bin jacquard -- sim path/to/gatelevel.gv path/to/input.vcd path/to/output.vcd NUM_BLOCKS --input-vcd-scope "testbench/dut"

Use slash separators (/) for hierarchical paths, not dots. See troubleshooting-vcd.md for details.

The simulated output ports value will be stored in output.vcd.

Caveat: The actual GPU simulation runtime will also be outputted. You might see a long time before GPU enters due to reading and parsing input.vcd. You are recommended to develop your own pipeline to feed the input waveform into Jacquard's GPU kernels.

Timing-Aware Simulation

Jacquard supports two ways to feed timing data into the simulator:

--timing-ir <path.jtir> — pre-converted Jacquard timing IR. This is the canonical path and requires no external tools at run time. Generate the IR ahead of time with the standalone opensta-to-ir tool (see crates/opensta-to-ir/).
--sdf <path.sdf> --liberty <path.lib> — raw SDF, converted to IR on the fly. This subprocesses OpenSTA, which must be installed on the user's machine.

OpenSTA dependency

When using --sdf, Jacquard locates OpenSTA in this order:

JACQUARD_OPENSTA_BIN environment variable.
<repo-root>/scripts/build-opensta.sh --print-binary (the canonical install path during development; the script builds the version vendored at vendor/opensta/).
sta on PATH.

Jacquard requires OpenSTA 3.1.0 or newer, matching the commit pinned at vendor/opensta/. The pinned version is the only one with end-to-end test coverage; newer OpenSTA versions are accepted with a warning, older versions are a hard error.

The simplest way to get a known-good OpenSTA is to build the vendored copy from the Jacquard repo:

git submodule update --init --recursive
./scripts/build-opensta.sh

Then either set JACQUARD_OPENSTA_BIN to the path printed by ./scripts/build-opensta.sh --print-binary, or just let Jacquard find it automatically — the build script's output is searched by default.

Error messages

Symptom	Meaning	Fix
`--sdf requires OpenSTA: OpenSTA binary not found.`	OpenSTA isn't installed or isn't on PATH.	Run `./scripts/build-opensta.sh`, set `JACQUARD_OPENSTA_BIN`, or install OpenSTA system-wide.
`OpenSTA at <path> is v2.4.0; Jacquard requires v3.1.0 or newer.`	Installed OpenSTA is too old.	Rebuild from `vendor/opensta/` (which is pinned at 3.1.0) or upgrade your system OpenSTA.
`Detected OpenSTA v3.2.0, newer than the latest tested version v3.1.0.` (warning)	OpenSTA version is newer than what Jacquard's test corpus has been validated against. Simulation proceeds.	Report any timing discrepancies as bugs; we'll bump the tested-version range when CI catches up.
`--sdf requires --liberty <PATH>.`	OpenSTA needs the Liberty library to link the design.	Pass `--liberty <PATH>` alongside `--sdf`.

For licensing context (Jacquard is permissively-licensed, OpenSTA is GPL-3, and Jacquard's runtime subprocess invocation is permitted but bundling is not), see adr/0006-sdf-preprocessing-model.md.

Testbench Interop (UVM, cocotb, SVA)

A common first question: "Can I point my existing UVM / cocotb / SVA testbench at Jacquard?"

Short answer: not today. Jacquard is a gate-level engine, not a drop-in replacement for a SystemVerilog or Python simulator's testbench runtime. This page explains what works now, what is on the roadmap, and the fallback that works for any flow today.

What Jacquard drives today

Mechanism	What it is	Reactive?
`jacquard sim` + input VCD	Replay a recorded input waveform through the netlist	No — inputs are fixed
`jacquard cosim` + peripheral models	UART, SPI flash, JTAG, Wishbone / APB3 monitors run as GPU kernels beside the design	Yes — inputs can depend on outputs cycle-by-cycle

So reactive stimulus is supported — but through Jacquard's own peripheral model architecture and cosim execution model, not through an external testbench framework.

The fallback that works for any flow: record-and-replay

You do not need framework integration to use Jacquard with a UVM, cocotb, or plain-Verilog testbench. The universal path:

Run your existing testbench against any simulator that can dump a VCD (Verilator, Icarus, a commercial simulator, …).
Dump the design's top-level input pins to a VCD.
Replay that VCD through Jacquard with jacquard sim.

This is exactly what jacquard sim already is — a recorded-waveform replay — so it works regardless of how the stimulus was generated. The trade-off is that the replay is open-loop: it reproduces the recorded inputs, so it can't react to divergent design behaviour. For closed-loop reactive stimulus, model the peripheral as a cosim kernel instead.

Roadmap

These are directions under consideration, not commitments or dated milestones:

SVA (SystemVerilog Assertions) — planned. Jacquard already lowers a class of immediate assertions through synthesis (GEM_ASSERT cells; see the assertion handling in aigpdk.rs). Broader SVA support is the next step here.
Running UVM test suites — design settled, unbuilt. The shape is ADR 0022: split the testbench the way emulators have since SCE-MI. Sequences, randomisation and checking stay on the host; the driver's timed half is rewritten as a synthesizable transactor and compiled into the AIG beside the design (possible since the RTL on-ramp shipped), so it wiggles pins at GPU speed. Existing UVM drivers don't port as-is, and that isn't a gap we can close: they drive every cycle, and cosim runs 1024 edges per dispatch with the CPU absent, which is where the speed comes from.
cocotb — needs more work. A naive bridge would marshal Python ↔ GPU every cycle, which would dominate runtime and erase the GPU speedup. Same cliff as above, same fix: exchange transactions, not cycles. Making cocotb performant against Jacquard needs more design thinking, not just a shim.

If any of these is blocking for you, the record-and-replay path above is the recommended interim approach.

Why Jacquard — positioning and output interface

Status: Honest assessment of where Jacquard fits in an EDA flow alongside dedicated STA tools (OpenTimer/OpenSTA) and event-driven simulators (Verilator, iverilog, CVC). Includes a survey of what timing information Jacquard exposes today and what would let users actually consume it.

This is not a marketing document. The goal is for a contributor or user to read it and decide accurately whether Jacquard helps them — and, if it does, how to extract the answer they need.

TL;DR

Jacquard's unique value is vector-driven timing analysis at GPU scale: answering "did this stimulus violate setup/hold at any DFF, on which cycle, on which signal?" for designs large enough that SDF-annotated event-driven sim is too slow to finish in useful time.

Everything else Jacquard offers is offered, often better, by the standard flow:

For functional sim: Verilator is faster on small designs.
For timing: OpenSTA gives more accurate answers than Jacquard, vector-independent.
For glitch / metastability: event-driven sim with SDF (CVC, iverilog) sees behaviours Jacquard's lockstep kernel structurally cannot.

Jacquard becomes the right tool when (design size × vector length) exceeds what event-driven SDF-annotated sim can handle, and you specifically want vector-driven timing answers.

STA is not optional even with Jacquard. Jacquard does not replace OpenSTA; it complements it. The right framing is "STA proves no bad vectors exist; Jacquard proves your real workload runs cleanly within those bounds." OpenSTA is also a hard runtime dependency for any timing-aware Jacquard flow — the timing IR is produced by opensta-to-ir, which subprocesses OpenSTA. See ADR 0001.

What's actually unique

The intersection where Jacquard wins is narrow but real:

Activity-driven setup/hold sweep at scale. Run a long workload (boot trace, architectural validation, NoC congestion stimulus) on a large design at GPU speed; get a per-cycle violation report. STA can't tell you "this real workload trips violation X at cycle 12,847"; CVC can but won't finish in time on big designs.
Arrival-time distributions for power/activity analysis. Per-signal arrival histograms across millions of cycles → useful for worst-case-power analysis informed by actual switching activity. STA gives you nothing here; CVC could but slowly.
Failure forensics. When a functional test fails, answering "was this a timing issue?" without rerunning under a different simulator. Jacquard's timing-VCD output ties violations to cycle/signal/path — useful when you already have it from the same run.
Fast iteration during timing closure. Change a constraint, resynthesise, re-run a long test — Jacquard's loop time is short enough to make this practical on big designs in a way iverilog+SDF isn't.

What dedicated STA (OpenSTA) gives you that Jacquard doesn't

This list is long and you should know it:

Worst-case path enumeration. STA tells you the top-N critical paths over all possible inputs. Jacquard sees only what your stimulus exercises. If your testbench misses a critical path, Jacquard's "no violations" report is silent on it; OpenSTA would flag it.
True min-delay analysis. OpenSTA does proper min-delay path search. Jacquard's hold check is per-DFF against actual stimulus only.
Per-pair CRPR. OpenSTA applies common-path-pessimism removal as a launch/capture credit on each path. Jacquard consumes per-DFF clock arrival from opensta-to-ir and folds it into setup/hold (see timing-model-extensions.md, Part B Stages 1+2 — landed), but treats the launch reference as 0 — i.e. the per-pair CRPR credit is intentionally not modelled at this stage. Stage 3 in the same doc is the lever if Stage 1+2 pessimism turns out to matter on a real design.
SDC-aware constraint handling. False paths, multi-cycle paths, generated clocks, async groups — OpenSTA reads SDC and respects it. Jacquard doesn't read SDC at the timing layer.
Coverage by construction. STA covers every path by definition. Dynamic sim covers only what's exercised.
Vector-independent confidence. "This design meets timing" is something STA can claim; Jacquard can only claim "this design met timing on these vectors."

What event-driven SDF sim (CVC/iverilog) gives you that Jacquard doesn't

The honest comparison isn't "Jacquard vs. Verilator + OpenTimer." It's "Jacquard vs. iverilog/CVC-with-SDF + OpenTimer." On the timing-sim side specifically:

Glitch propagation. CVC/iverilog with inertial or transport delay see intra-cycle pulses. Jacquard's lockstep cycle-accurate kernel does not.
Per-pin wire delay fidelity. CVC consumes SDF interconnect records per-receiver, per-edge, with rise/fall distinction. Jacquard collapses to per-cell-max (see timing-model-extensions.md, Part C).
Per-DFF setup/hold without per-word collapse pessimism. Jacquard collapses all DFFs in a 32-bit state word to min(setup), min(hold); CVC checks each flop individually.
Async event handling. Real $setup/$hold checks across asynchronous control. Jacquard explicitly assumes synchronous designs.

So today, accuracy-per-vector goes to CVC; throughput goes to Jacquard.

When to choose what

Your situation	Best tool
Small design, just want functional results	Verilator (free, fast, mature)
Small design, need timing certainty	OpenSTA + Verilator (or +CVC for vector-driven)
Large design, functional only	Verilator if it scales, else Jacquard
Large design, vector-driven timing needed	Jacquard + OpenSTA for STA backstop
Glitch / metastability investigation	CVC or iverilog with SDF — Jacquard cannot model these structurally
Asynchronous design / latches	Not Jacquard (synchronous-only) — use CVC/iverilog
Sign-off STA	OpenSTA / commercial — Jacquard is not a sign-off tool

The trajectory

Jacquard's timing fidelity gap with CVC is closeable. The work in timing-model-extensions.md — δ(T), clock-tree skew, per-receiver wire delay — closes much of it while preserving GPU throughput. The further along that path the project goes, the more "Jacquard" looks like "GPU-accelerated SDF-annotated event-driven sim, with the inherent limits the cycle-accurate kernel imposes (no glitches, lockstep cycles)" — i.e. CVC's report quality at Verilator's speed, on designs where neither alone suffices.

Output interface — what Jacquard exposes today

Jacquard's unique value depends on getting the timing information out of a run in a form users can act on. Phase 1 of the post-Phase-0 roadmap (ADR 0008) closed the gap between "data Jacquard has" and "answers users want" for setup/hold violations.

Symbolic stderr violation messages

The kernel writes setup/hold violation events to a per-block event buffer (csrc/kernel_v1.metal:554-576). The host drains the buffer each cycle (src/event_buffer.rs), resolves the state-word index to a hierarchical DFF site name via WordSymbolMap, and emits:

[cycle 12847] SETUP VIOLATION at top/cpu/regs[7][bit 22] [word=412]: arrival=2150ps setup=80ps slack=-30ps
[cycle 12847] HOLD VIOLATION at top/cpu/state[bit 3] [word=412]: arrival=12ps hold=20ps slack=-8ps

The bare [word=N] suffix is preserved for grep/tooling compatibility; up to four DFFs per word are named, with +N more truncation beyond that.

Structured timing report (`--timing-report <path.json>`)

Schema-versioned JSON document written at end of run. Contents:

Per-cycle violation list (cycle, kind, word, site, arrival, constraint, slack).
Per-word aggregate: violation counts and worst slack (sorted by total violations).
Top-N worst-slack ranking per kind (setup, hold).
Run metadata: design, vector source, timing source, clock period, cycles run, Jacquard version.
Aggregate stats: setup/hold totals, dropped events.

Machine-readable, CI-friendly. Sample at tests/timing_ir/sample_reports/two_violations.json; full schema in src/timing_report.rs (SCHEMA_VERSION = "1.0.0"). Stability contract per ADR 0008: additive-only extensions, breaking changes bump the major.

Text summary (`--timing-summary`)

One-screen human summary on stdout. Same data as the JSON report, different channel; either or both flags can be set:

=== Jacquard Timing Summary ===
Design:        my_cpu.gv
Vectors:       boot.vcd (1000 cycles)
Clock period:  1000 ps
Timing source: my_cpu.jtir

Violations:
  Setup: 5
  Hold:  2
  Total: 7

Worst slack:
  Setup: -150ps  at top/cpu/regs[7][bit 22] [word=5]  (cycle 87)
  Hold:   -40ps  at top/cpu/state[bit 3] [word=12]  (cycle 91)

Top 2 by violation count (of 2 total words with violations):
  top/cpu/regs[7][bit 22] [word=5] (5 violations): worst setup=-150ps hold=- arrival=950ps
  top/cpu/state[bit 3] [word=12] (2 violations): worst setup=- hold=-40ps arrival=10ps

Format is for human inspection — explicitly not a stable parseable contract. Tools should use --timing-report JSON.

Timed VCD (`--timed`)

Annotates the output VCD with per-signal arrival times. Largest, most detailed output; suitable for waveform-level inspection.

What you get: per-signal arrival ps at each writeout cycle.
Caveat: the VCD doesn't carry slack relative to the clock edge — you compute it yourself.
Cost: doubles VCD size. Not appropriate for long workloads on large designs.

`SimStats` aggregate counts (in-process)

SimStats { setup_violations, hold_violations, ... } is available to in-process consumers (src/event_buffer.rs). Only counts; full detail flows through the structured report path.

Still on the wishlist

Items captured in ADR 0008's "Optional / later outputs" plus a few caveats on what shipped. Demand-driven; not scheduled.

Closest-to-violation tracking when no violation occurred

The shipped worst_slack ranking is populated only from observed violation events. Surfacing "where am I close to the edge" on a run that passed timing requires GPU-side near-miss instrumentation (emit slack events whenever |slack| falls below a configurable threshold). Useful for proactive signoff regression. Separate workstream — needs a kernel change.

Arrival histogram (`--arrival-histogram <pattern>`)

Per-signal arrival histogram dump for matched signal patterns, as JSON or CSV. Foundation for activity-based power analysis and "is my actual timing margin healthy" reporting.

STA cross-reference (`--sta-cross-reference <opensta-paths.txt>`)

Read OpenSTA's worst-N critical-path report and produce coverage output: of those paths, which were exercised by the stimulus, at what observed arrival. Closes the loop between vector-driven and static analysis.

Path back-trace from worst-arrival DFF

Given a flagged DFF, walk the max-of-fanin chain backward to the source AIG pin / primary input, emitting per-edge contributions. Most expensive item on the wishlist; only useful once symbolic names are in place (which they now are).

CUDA / HIP / cosim runtime violation routing

The current Metal sim path routes runtime violations through process_events (which is what feeds the resolver, structured report, and text summary). The CUDA, HIP, and cosim paths don't yet share that plumbing — they detect violations on the GPU but don't drain through process_events. Independent plumbing follow-up; doesn't affect the Metal user experience.

Per-signal activity / transition counts

Listed in ADR 0008 as part of the JSON report's wishlist. Not in v1.0.0 of the schema; will be added (additively) when the GPU kernel emits transition events.

"Corner" and "margin percentage" in the text summary

ADR 0008's summary template includes both. Corner is missing because the metadata struct doesn't carry it through from the IR yet; margin percentage is trivially derivable from slack_ps / clock_period_ps and was omitted to keep the v1 summary terse.

project-scope.md — what Jacquard is for and not for; the formal contract this doc operates inside.
timing-correctness.md — forward-looking validation requirements.
timing-violations.md — current GPU-side violation detection mechanics.
timing-validation.md — how Jacquard's timing output is validated against CVC/iverilog.
timing-model-extensions.md — proposed accuracy improvements (δ(T), clock-tree skew, wire delay).

GEM Simulation Architecture

This document describes GEM's internal simulation architecture based on investigation and testing.

Overview

GEM (GPU-accelerated Emulator-inspired RTL simulation) compiles gate-level netlists into GPU kernels that simulate designs 5-40X faster than CPU-based simulators. It works like an FPGA-based RTL emulator by converting designs into an and-inverter graph (AIG), partitioning it for GPU blocks, and generating optimized GPU code.

Pipeline Stages

Verilog Netlist → NetlistDB → AIG → StagedAIG → Partitions → FlattenedScript → GPU Kernel
                     ↓            ↓                    ↓            ↓
                  Parse      Synthesis         Hypergraph     Instruction
                  Netlist    to AIGs          Partitioning    Generation

1. NetlistDB (Input Parsing)

Input: Gate-level Verilog (.gv files) from synthesis tools (Yosys, Design Compiler)

Process:

Parses structural Verilog using sverilogparse crate
Creates flattened netlist database with cells, pins, nets
Identifies primary inputs, outputs, clock signals
Stores connectivity in CSR (Compressed Sparse Row) format

Key Limitations:

Only supports synthesized gate-level netlists (not RTL)
No behavioral Verilog constructs (always blocks, if/case statements)
Expects standard cells from supported libraries (AIGPDK)

2. AIG (And-Inverter Graph)

Process: Converts gate-level netlist to AIG representation

Data Structure:

#![allow(unused)]
fn main() {
pub enum DriverType {
    AndGate,           // Basic AND gate
    DFF,               // D flip-flop
    ClockGate,         // Clock gating cell
    RAMBlock,          // Memory block
    GemAssert,         // Assertion checking
    GemDisplay,        // Display output
    // ... more types
}
}

Statistics (example from safe.v):

157 AIG pins: Internal circuit nodes
133 AND gates: Logic operations
16 DFF cells: Sequential elements
2 GEM_ASSERT cells: Assertion nodes
480 total pins: Including I/O

Key Features:

Clock inference from DFF connections
Assertion cell detection (GEM_ASSERT, GEM_DISPLAY)
Endpoint grouping for outputs and registers

3. StagedAIG (Pipeline Staging)

Purpose: Split deep combinational logic into pipeline stages

Process:

Analyzes combinational depth between registers
Splits logic at --level-split thresholds
Creates pipeline stages to fit GPU resource constraints

When Needed:

Designs with very deep combinational paths (>50 levels)
When single-stage partitioning fails resource limits
Use --level-split 30 or --level-split 20,40 to force splits

4. Partitioning (Hypergraph Cut)

Tool: mt-kahypar hypergraph partitioner

Constraints (GPU block resources):

Max 8191 unique inputs per partition
Max 8191 unique outputs per partition
Max 4095 intermediate pins alive per stage
Max 64 SRAM output groups

Process:

Interactive partitioning (runs automatically at simulation start)
Tries 1 partition first, then increases if needed
Merges partitions to minimize inter-partition communication

5. FlattenedScript (GPU Instruction Generation)

Process: Generates GPU execution script from partitions

Script Components:

Boomerang stages: Hierarchical 8192→1 reduction structure
State buffer: Packed 32-bit words for all register values
SRAM interface: Memory block read/write operations
Assertion positions: Bit positions for assertion conditions
Display positions: Enable bits and argument positions

Statistics (example):

reg/io state size: 133 bits → 5 words (32-bit)
script size: 30208 instructions
assertion_positions: [(cell_id, bit_pos, msg_id, type)]
display_positions: [(cell_id, enable_pos, format, arg_positions, widths)]

Key Insight: All state is packed into a flat bit array, indexed by position in 32-bit words.

6. GPU Kernel Execution

Kernel Types:

kernel_v1.cu / kernel_v1_impl.cuh: CUDA implementation
kernel_v1.metal: Metal (Apple Silicon) implementation

Execution Model:

Each GPU block simulates one partition
Multiple blocks run in parallel
State synchronized between stages
CPU checks assertion/display conditions after GPU completes

VCD Input/Output

Input VCD Requirements

Critical Discovery: GEM expects VCD signals at absolute top-level (no module hierarchy).

Expected Signal Format:

$var reg 1 ! clk $end
$var reg 1 " reset $end
$var reg 4 # din [3:0] $end
$var reg 1 $ din_valid $end

NOT (with module scope):

$scope module testbench $end
  $scope module dut $end
    $var wire 1 ! clk $end
    ...

Signal Matching:

GEM looks for signals matching synthesized module port names
Uses HierName() (empty hierarchy) for matching

If signals are scoped under modules, GEM reports:

WARN (GATESIM_VCDI_MISSING_PI) Primary input port (HierName(), "reset", None) not present in the VCD input

VCD Scope Option:

--input-vcd-scope <scope>: Specify module hierarchy to read from
Current Issue: Even with scope specified, signal matching fails
Workaround: Generate VCD with signals at absolute top level

Output VCD Structure

GEM generates minimal VCD with only primary outputs:

$timescale 1 ns $end
$scope module gem_top_module $end
$var wire 1 ! unlocked $end
$upscope $end

Internal states and intermediate signals are not dumped.

Assertion and Display Support

Assertion Infrastructure

Synthesis Flow:

Verilog assert() → Yosys $check cell → techmap gem_formal.v → GEM_ASSERT cell

Runtime:

GEM stores assertion positions in FlattenedScript
CPU checks assertion bits after GPU simulation
Configurable actions: Log, Pause, Terminate

AssertConfig:

#![allow(unused)]
fn main() {
pub struct AssertConfig {
    pub on_failure: AssertAction,  // Log, Pause, Terminate
    pub max_failures: Option<u32>,
}
}

Display Infrastructure

Synthesis Flow:

Verilog $display() → Yosys $print cell → techmap gem_formal.v → GEM_DISPLAY cell

Runtime:

Format strings stored in JSON metadata
CPU checks display enable bits after GPU simulation
Arguments extracted from state buffer positions

Limitation: Format string preservation depends on Yosys synthesis preserving attributes.

Debug Information

Enabling Debug Output

# Metal simulation with debug logging
RUST_LOG=debug cargo run -r --features metal --bin jacquard -- sim <args>

# CPU verification (slower but validates GPU results)
cargo run -r --features metal --bin jacquard -- sim <args> --check-with-cpu

Key Debug Messages

AIG Construction:

Found GEM_ASSERT cell 143 (condition_iv=0, en_iv=0, a_iv=76, clken_iv=2)
Found GEM_DISPLAY cell 24 (enable_iv=2, clken_iv=2, args=32)

Partitioning:

netlist has 480 pins, 157 aig pins, 133 and gates
current: 19 endpoints, try 1 parts
after merging: 1 parts

Flattening:

Built script for 48 blocks, reg/io state size 133, sram size 0, script size 30208
Assertion: cell=144, pos=4195 (word=131, bit=3), msg_id=144, type=None
Display: cell=24, enable_pos=5154 (word=161, bit=2), format='...', args=[...]

VCD Reading:

WARN (GATESIM_VCDI_MISSING_PI) Primary input port (HierName(), "reset", None) not present

Performance Characteristics

Speedup vs CPU

Simple designs: 5-10X faster
Complex designs: 10-40X faster
Depends on:
- Number of GPU SMs (streaming multiprocessors)
- Partition granularity
- VCD I/O overhead

Resource Scaling

GPU Block Count: Set NUM_BLOCKS to 2× number of GPU SMs

Apple M4 Pro: 48 blocks (24 SMs × 2)
NVIDIA GPUs: Check SM count with nvidia-smi

Memory Usage:

State buffer: num_blocks × state_size × num_cycles × 4 bytes
Script: script_size × 4 bytes (shared across blocks)

Known Issues and Limitations

1. VCD Hierarchy Mismatch

Issue: GEM expects flat VCD signal hierarchy Impact: Missing input signals cause incorrect simulation results Workaround: Generate VCD with $dumpvars(1, sig1, sig2, ...) at top level Status: Under investigation

2. Complex FSM Designs

Issue: Some FSM designs don't simulate correctly even with proper VCD Example: safe.v (9-state PIN cracker FSM) Possible Causes:

Synthesis optimization changes FSM encoding
Initial state handling differences
Reset timing issues Status: Identified through third-party test suite

3. No Latch or Asynchronous Sequential Logic Support

Issue: Jacquard only supports edge-triggered D flip-flops (DFFs) as sequential elements. A raw LATCH cell left in the combinational/sequential logic, latch-based designs (SR latches, transparent latches, master-slave latch pairs), and asynchronous sequential logic (self-timed feedback) are not supported. Asynchronous set/reset on flip-flops is supported (it lowers to an AIG overlay — wire_dff_reset_set_overlay), so "async reset" is not the restriction.

Supported via dedicated paths: two common latch-derived structures are not affected by this limitation —

Clock gating is handled by the CKLNQD integrated clock-gating cell (an ICG is internally a latch + AND, identified as a gated clock rather than rejected; aig.rs:845). Replace RTL clock gates with CKLNQD instantiations.
Latch-based register files / memory (a common area-saving pattern) are recognised by the memory-synthesis step (Yosys memory_libmap) and mapped to $__RAMGEM_SYNC_ SRAM cells that Jacquard simulates (aig.rs:830) — not left as raw latches.

Impact: Designs using latches in the logic will either:

Fail during AIG conversion (unrecognized cell type)
Be silently treated as combinational logic (incorrect simulation)

What this means in practice:

Gate-level netlists must be synthesized to a DFF-only cell library (AIGPDK or SKY130)
CVC's built-in test suite (tests_and_examples/install.test/) uses NAND-latch flip-flops (e.g., dfpsetd.v, sdfia04.v) and cannot be used as Jacquard reference tests
Self-timed designs with internal clock generation (e.g., CVC's das_lfsr benchmark) are also unsupported

What would be needed to support latches:

New DriverType variant: Add Latch(enable, data) to DriverType in aig.rs, representing a level-sensitive storage element
Two-phase evaluation: Latches are transparent when enabled, requiring evaluation within a clock phase rather than only at clock edges. The current cycle-based simulation model (evaluate all combinational logic, then capture DFF outputs) would need to iterate until latch outputs stabilize
AIG conversion: Map latch library cells (e.g., SKY130 dlxtp) to the new Latch driver, identifying enable and data pins
GPU kernel changes: The writeout stage currently uses clken_perm for DFF clock gating. Latches would need a different mechanism: while enable is high, output tracks input continuously rather than capturing on an edge
Timing: Latch timing is more complex — setup/hold is relative to the enable edge, and time borrowing across latch boundaries is a key use case in high-performance designs
Convergence: Combinational loops through transparent latches must be detected and iterated to a fixed point, or flagged as errors

Complexity estimate: Moderate-to-high. The main challenge is the evaluation model change — DFF-only simulation is a clean "capture at edge" model, while latches require iterative evaluation within clock phases.

Status: Not planned. Jacquard targets synthesis flows that produce DFF-only netlists.

4. Format String Preservation

Issue: Yosys synthesis may not preserve gem_format attributes Impact: Display messages show placeholders instead of actual format strings Workaround: Extract format strings from pre-synthesis JSON Status: Tool limitation, not GEM bug

Investigation Methodology

This documentation was created through systematic investigation:

Structure Analysis: Examined source code in src/aig.rs, src/flatten.rs, src/staging.rs
Debug Tracing: Used RUST_LOG=debug to capture internal state
Netlist Inspection: Analyzed synthesized .gv files with grep
VCD Comparison: Compared iverilog vs GEM VCD outputs
Test Case Development: Created minimal reproducible examples
Iterative Debugging: Progressively simplified designs to isolate issues

References

Main codebase: src/ directory
EDA infrastructure: vendor/eda-infra-rs/ submodule (netlistdb, vcd-ng, ulib)
AIGPDK library: aigpdk/ directory
Test cases: tests/ directory
Third-party examples: tests/regression/third_party/

Document Version: 1.0 Last Updated: 2025-01-08 Authors: NVIDIA GEM Team + Claude Code Investigation

Timing Correctness Requirements

Status: Draft — under review.

This document is the contract for Jacquard's timing correctness story. It defines what must be true; ADRs under docs/adr/ define how; plans under docs/plans/ execute them.

Scope

Inherits constraints and non-goals from project-scope.md — in particular the permissive-license requirement for linked code and the synchronous-only design assumption.

Addresses three known weaknesses from the April 2026 architecture review:

Timing-model pessimism from the packed-32 ALU (no single fix; needs refinement path for sign-off use).
Weak correctness observability (CPU reference and GPU kernel share Rust source, so representation bugs pass both sides).
Hand-rolled SDF parser fragility on real post-P&R output from production tools.

Out of scope (tracked elsewhere, addressed later): hard schema limits (no latches, static VCD), three-backend maintenance cost, GPU-side IO model scaling, silent-failure engineering patterns beyond parsers, partitioner non-determinism.

Principles

P1 — OpenSTA is the oracle

When Jacquard's results disagree with OpenSTA, Jacquard is wrong until demonstrated otherwise. In the shipped release, OpenSTA is never invoked from the jacquard runtime binary, and never linked (it is GPL; Jacquard stays permissive). Subprocess invocation from CI, test harnesses, and the standalone opensta-to-ir preprocessing tool is acceptable. During development before first release, a runtime subprocess invocation may exist as a contributor-ergonomics convenience (see ADR 0006) — this is explicitly interim and is removed before release. Divergence is never accepted silently: it is fixed, explicitly justified in-doc, or filed as a bug.

P2 — No single parse path is its own reference

Every format we consume (SDF, Liberty, SPEF, Verilog) is validated against at least one third-party tool's parse of the same file. This is the primary defense against representation bugs that would otherwise affect Jacquard's primary and reference paths simultaneously.

P3 — Fail loud on silent failures

Every parser and pipeline stage emits a success count (cells parsed, arcs matched, annotations applied). The test harness asserts thresholds. Zero-match is never silently acceptable. A pipeline that quietly succeeds with the wrong data is a bug.

P4 — Multi-corner from day one

Any timing-value representation natively supports multiple PVT corners. Single-corner shortcuts become tech debt the moment commercial flows arrive; the shape is enforced upfront.

Functional requirements

R1 — Timing IR

A canonical intermediate representation (IR) for SDF-equivalent timing annotations exists. It is:

Schema-versioned with compatible evolution rules.
Lossless for the subset of information Jacquard consumes downstream.
Preserves vendor-specific annotations as typed passthrough (never silently dropped).
Multi-corner by default.
Tags each arc with provenance: source tool, source file, and origin category (asserted / computed / defaulted).

Details: docs/adr/0002-timing-ir.md.

R2 — In-process reference STA — deferred

The original requirement was an in-process STA engine that computes per-endpoint arrival/slack from Liberty + SPEF, linked directly (subject to the permissive-license constraint in project-scope.md), cross-checking Jacquard's SDF-derived timing at load time and on demand during sim.

Preferred implementation was OpenTimer; the SKY130 spike (docs/spikes/opentimer-sky130.md) found OpenTimer's input pipeline unfit for OpenROAD-flow outputs, and ADR 0003 was Superseded (commit d002bde). The cross-check role is now performed out of process by OpenSTA via opensta-to-ir (ADR 0001 — sole STA path); the in-process variant is parked until a fit-for-purpose permissive option appears (libreda-sta or in-house walker, both behind a future ADR).

R3 — Oracle-backed CI

The OpenSTA test corpus is vendored (or submoduled) into the repository and used as a regression fixture. Every IR converter runs against the corpus; converter-to-converter diffs must be explained or fixed before merge.

Jacquard's own regression designs are run through OpenSTA (subprocess) and compared against Jacquard's output. Runs nightly or pre-release, not per-PR, due to runtime cost.

For any sim run that reports timing violations, or for any user-requested critical-path report, top-K paths are reported with:

Full per-stage path trace with per-edge delay.
Pessimism delta: the gap between the packed-thread max arrival and the actual arrival along this specific path (exposes the magnitude of the packed-32 ALU's worst-case accounting).
Provenance of each delay (SDF-asserted / Liberty-computed / defaulted).

R5 — Private PDK testing

A private test track for commercial PDKs exists, gated on per-PDK environment variables (e.g. <VENDOR>_PDK_PATH). Tests skip cleanly when PDK files are unavailable; CI runs with PDK access execute them. No PDK-derived artifacts are committed. Details: docs/adr/0004-private-pdk-testing.md.

Non-functional requirements

N1 — Startup parse speed

IR consumption at sim startup is effectively O(1) of IR size (binary, memory-mapped, zero-copy). The expensive SDF→IR conversion is a one-time preprocessing step, not an every-sim cost.

N2 — Reproducibility

Given identical inputs and tool versions, every converter produces byte-identical IR. Tool version bumps may change output but must be explicit (recorded in IR metadata).

N3 — Modularity

Each converter (SDF→IR, Liberty→IR, OpenSTA→IR, reference-STA→IR) is testable in isolation, without the full Jacquard build. Converters are separate crates or binaries. Consumers of IR need not know which converter produced it.

Acceptance criteria — phase 0

Phase 0 (see docs/plans/phase-0-ir-and-oracle.md) is complete when:

Timing IR schema defined, with both a binary encoding and a JSON sidecar for human/CI diffs.
OpenSTA → IR converter implemented as a subprocess-driven tool.
Jacquard's existing SDF parser emits IR alongside its native representation.
CI harness runs both converters on the OpenSTA test corpus and reports structured diffs.
At least one representative Jacquard test design produces matching IR between both paths, within a declared tolerance.
Parser-success assertions (per P3) in place for the SDF and Liberty paths.

Phases 1 and beyond are planned at the start of each phase, not all up front.

Open questions

Items not settled by this document; they resolve in ADRs, spike outcomes, or phase plans:

Exact IR schema format (FlatBuffers / Cap'n Proto / other). Tracked in ADR 0002.
~~Whether OpenTimer handles SKY130 Liberty robustly.~~ Resolved: spike failed Q2 on SKY130; ADR 0003 Superseded.
Whether SPEF gets its own IR or is embedded in the timing IR. Deferred; likely separate.
Whether the IR is Jacquard-local or shared across a broader tooling ecosystem. External decision; answer affects investment level and schema stability requirements.

Supersession

This document supersedes the "±5% arrival tolerance" convention from docs/timing-validation.md once phase 0 ships. Until then, the existing convention remains in effect for designs not yet covered by oracle-backed CI.

References

docs/simulation-architecture.md — current pipeline.
docs/timing-simulation.md — current timing-sim usage.
docs/timing-validation.md — current validation methodology (to be superseded).
docs/adr/ — decisions executed through this document.
docs/plans/ — phased implementation.
docs/spikes/ — time-boxed experiments.

Last updated: 2026-04-23

Timing Simulation in GEM

See also: timing-correctness.md — forward-looking validation contract and timing IR requirements (in progress). The document below describes current behaviour.

This document explains GEM's boomerang evaluation architecture and how timing simulation with per-gate delays can be implemented efficiently on GPU.

Current status (what ships today)

Phase 1 closed 2026-05-02; see plans/post-phase-0-roadmap.md.

Capability	Status
Liberty parsing	✅
SDF back-annotation via `opensta-to-ir`	✅
Per-DFF clock-arrival folding (Pillar B Stages 1+2)	✅
GPU-side setup/hold violation detection	✅ Metal, CUDA, HIP (`sim`); the `cosim` path is a follow-up
Symbolic violation messages (`top/cpu/regs[7][bit 22] [word=42]`)	✅ Metal, CUDA, HIP (`sim`)
`--timing-report <path.json>` structured end-of-run report	✅ Metal, CUDA, HIP (`sim`); not yet wired for `cosim`
`--timing-summary` human-readable text summary	✅ Metal, CUDA, HIP (`sim`); not yet wired for `cosim`
OpenSTA detection + version check	✅
Multi-corner timing IR (`--timing-corner <name>`)	✅
`--sdf-corner` (min/typ/max selection from one SDF)	⚠ One corner at a time
Per-receiver wire delay (Pillar C Tier 1)	❌ Phase 2 (blocked on ADR 0007)

What "HIP" means above. HIP is two quite different things, and for most of this project's life the table only ever meant the first:

HIP-over-CUDA — hip-runtime-nvidia, built with the CUDA toolkit and executed on an NVIDIA GPU (HIP Tests (NVIDIA backend), tesla4-runner). Source-compatible with CUDA and, in practice, CUDA semantics throughout.
HIP on real ROCm — an actual AMD GPU (HIP Tests (ROCm backend)). This had never been built, let alone run, until 2026-07-15, and sim did not run there at all until the non-cooperative fallback landed: the device has no cooperative launch, so the grid-wide barrier sim relies on is unavailable and the host drives one launch per (cycle, stage) instead. See spikes/amd-laptop-backend.md.

Both are now covered by CI and both pass, so the ✅s above are accurate for each. The distinction still matters when reading a green tick: it is what hid #203 — an out-of-bounds read that nvcc silently truncated back into range, so CUDA and HIP-over-CUDA were accidentally correct while ROCm faulted. Arrival/violation arithmetic itself is host-side Rust (#195) and therefore backend-independent.

See timing-violations.md for the full violation-output interface and why-jacquard.md for positioning.

Background: The Simulation Challenge

GEM simulates And-Inverter Graphs (AIGs) where every node is either:

A primary input (value comes from VCD stimulus)
An AND gate with two inputs (possibly inverted)

Traditional simulation evaluates gates in topological order, which is inherently serial. GPUs excel at massive parallelism - thousands of threads doing the same operation on different data. GEM bridges this gap with the boomerang architecture.

Boomerang Evaluation

Core Concept

The boomerang structure is a hierarchical reduction tree that maps an AIG onto GPU threads. It's called "boomerang" because data flows down the tree during reduction, then results are written back out at various levels - like a boomerang going out and returning.

Hierarchy Structure

GEM uses BOOMERANG_NUM_STAGES = 13, meaning the tree has 2^13 = 8192 leaf positions:

Level 0 (inputs):   8192 positions
Level 1:            4096 positions  (8192 / 2)
Level 2:            2048 positions
Level 3:            1024 positions
Level 4:             512 positions
Level 5:             256 positions
Level 6:             128 positions
Level 7:              64 positions
Level 8:              32 positions
Level 9:              16 positions
Level 10:              8 positions
Level 11:              4 positions
Level 12:              2 positions
Level 13 (output):     1 position

Each level halves the number of positions by computing AND gates that combine pairs.

Thread Organization

A GPU block has 256 threads (threadIdx.x = 0..255). Each thread holds a 32-bit word where each bit represents an independent Boolean signal:

Thread 0:   [bit0, bit1, bit2, ... bit31]  = 32 Boolean signals
Thread 1:   [bit0, bit1, bit2, ... bit31]  = 32 Boolean signals
...
Thread 255: [bit0, bit1, bit2, ... bit31]  = 32 Boolean signals
            ─────────────────────────────
            Total: 256 × 32 = 8192 signals per level

Thread position refers to threadIdx.x - which of the 256 threads we're addressing. Each thread position processes 32 signals in parallel using SIMD operations.

Memory Layout

__shared__ u32 shared_metadata[256];   // Partition configuration
__shared__ u32 shared_writeouts[256];  // Output staging area
__shared__ u32 shared_state[256];      // Working state (8192 bits)

The shared_state array holds the current level's values during reduction.

The Reduction Process

Phase 1: Level 0 → Level 1 (hier[0])

Only threads 128-255 are active. Each computes 32 AND gates in parallel:

if(threadIdx.x >= 128) {
    u32 hier_input_a = shared_state[threadIdx.x - 128];  // From threads 0-127
    u32 hier_input_b = hier_input;                        // This thread's data

    // 32 AND gates computed simultaneously (one per bit)
    u32 ret = (hier_input_a ^ hier_flag_xora) &
              ((hier_input_b ^ hier_flag_xorb) | hier_flag_orb);

    shared_state[threadIdx.x] = ret;
}

The xora, xorb, and orb flags encode:

xora/xorb: Input inversions (for AND-inverter graph)
orb: Passthrough mode (when output equals input A, skip the AND)

Visual representation:

Before:  [T0][T1]...[T127] [T128][T129]...[T255]
              │                  │
              └───────┬──────────┘
                      │
                   AND gates (128 threads × 32 bits = 4096 gates)
                      │
                      ▼
After:   [----unused----] [T128][T129]...[T255]
                          (128 × 32 = 4096 results)

Phase 2: Levels 1-3 (Shared Memory)

for(int hi = 1; hi <= 3; ++hi) {
    int hier_width = 1 << (7 - hi);  // 64, 32, 16
    if(threadIdx.x >= hier_width && threadIdx.x < hier_width * 2) {
        u32 hier_input_a = shared_state[threadIdx.x + hier_width];
        u32 hier_input_b = shared_state[threadIdx.x + hier_width * 2];
        u32 ret = (hier_input_a ^ xora) & ((hier_input_b ^ xorb) | orb);
        shared_state[threadIdx.x] = ret;
    }
    __syncthreads();  // Barrier between levels
}

Each level activates fewer threads:

Level 1: threads 64-127 (64 threads → 2048 gates)
Level 2: threads 32-63 (32 threads → 1024 gates)
Level 3: threads 16-31 (16 threads → 512 gates)

Phase 3: Levels 4-7 (Warp Shuffle)

Within a single warp (32 threads), data exchange uses fast shuffle instructions instead of shared memory:

if(threadIdx.x < 32) {
    for(int hi = 4; hi <= 7; ++hi) {
        int hier_width = 1 << (7 - hi);  // 8, 4, 2, 1
        u32 hier_input_a = __shfl_down_sync(0xffffffff, tmp_cur_hi, hier_width);
        u32 hier_input_b = __shfl_down_sync(0xffffffff, tmp_cur_hi, hier_width * 2);
        if(threadIdx.x >= hier_width && threadIdx.x < hier_width * 2) {
            tmp_cur_hi = (hier_input_a ^ xora) & ((hier_input_b ^ xorb) | orb);
        }
    }
}

No synchronization needed - warp shuffle is implicitly synchronized.

Phase 4: Levels 8-12 (Bit Operations)

The final levels operate on bits within a single u32, computed by thread 0 only:

if(threadIdx.x == 0) {
    // Level 8: 32 → 16 (operates on upper/lower halves)
    u32 r8 = ((v1 << 16) ^ xora) & ((v1 ^ xorb) | orb) & 0xffff0000;

    // Level 9: 16 → 8
    u32 r9 = ((r8 >> 8) ^ xora) & (((r8 >> 16) ^ xorb) | orb) & 0xff00;

    // Level 10: 8 → 4
    u32 r10 = ((r9 >> 4) ^ xora) & (((r9 >> 8) ^ xorb) | orb) & 0xf0;

    // Level 11: 4 → 2
    u32 r11 = ((r10 >> 2) ^ xora) & (((r10 >> 4) ^ xorb) | orb) & 0b1100;

    // Level 12: 2 → 1
    u32 r12 = ((r11 >> 1) ^ xora) & (((r11 >> 2) ^ xorb) | orb) & 0b10;

    tmp_cur_hi = r8 | r9 | r10 | r11 | r12;
}

Write-Outs

Results are captured at various levels (not just the final output) and written to global memory:

if((writeout_hook_i >> 8) == bs_i) {
    shared_writeouts[threadIdx.x] = shared_state[writeout_hook_i & 255];
}

This is the "return" part of the boomerang - results flow back from intermediate levels.

Timing Simulation Approaches

Approach Comparison

Approach	Parallelism	Memory	Accuracy	GPU Fit
Event-driven	Poor (serial queue)	Low	Exact	Bad
Time-wheel	Medium	High	Configurable	Medium
Levelized	Excellent	Low	Conservative	Best
Oblivious	Maximum	Very High	Exact	Wasteful

Recommended: Levelized with Delay Accumulation

This approach piggybacks on the existing boomerang structure with minimal changes.

Data Structure Addition

// Add to shared memory (256 bytes additional)
__shared__ u8 shared_arrival[256];  // One arrival time per thread position

Each thread position stores a single 8-bit arrival time representing the maximum arrival across all 32 bits in that position.

Modified AND Gate Evaluation

// Current (value only):
u32 ret = (hier_input_a ^ xora) & ((hier_input_b ^ xorb) | orb);
shared_state[threadIdx.x] = ret;

// With timing (add ~4 instructions):
u32 ret = (hier_input_a ^ xora) & ((hier_input_b ^ xorb) | orb);
shared_state[threadIdx.x] = ret;

u8 arr_a = shared_arrival[threadIdx.x - offset_a];
u8 arr_b = shared_arrival[threadIdx.x - offset_b];
u8 arr_ret = min(max(arr_a, arr_b) + GATE_DELAY, 255);  // Saturating add
shared_arrival[threadIdx.x] = arr_ret;

Complexity Analysis

Same number of kernel launches as zero-delay simulation
O(levels × cycles) - identical to current
~256 bytes additional shared memory per partition
Estimated 10-20% performance overhead

The Approximation Trade-off

What We Track

One arrival time per thread position (256 values) instead of per signal (8192 values).

Implications

If thread position 50 contains signals A, B, C with different true arrivals:

Signal A: 15ps (shortest path)
Signal B: 23ps (longest path)
Signal C: 8ps  (medium path)

We store only: arrival[50] = 23ps (the maximum).

Why This Works

Conservative: We might report false violations, but never miss real ones
Correlated signals: Signals at the same thread position are often topologically nearby with similar timing
Endpoint focus: We ultimately only care about arrivals at DFF D inputs

When Full Accuracy is Needed

For bit-accurate timing, you would need:

// 8KB additional shared memory (may exceed limits)
__shared__ u8 shared_arrival[256][32];  // Per-bit arrivals

This is feasible but significantly increases memory pressure and computation.

Implementation Phases

Phase 1: CPU Timing Analysis (Completed)

Liberty parser for delay extraction
Static timing analysis on AIG
CPU reference simulation with delays
Timing violation detection

Phase 2: Hybrid GPU+CPU (Completed)

GPU performs zero-delay value simulation
CPU performs timing analysis on results
Validates infrastructure without kernel changes

Phase 3: GPU Arrival Tracking (Completed)

Added shared_arrival[256] (u16) to Metal and CUDA kernels
Arrivals tracked during boomerang reduction at all hierarchy levels
Per-gate delays injected via script padding slots from SDF data
DFF timing constraint checking at cycle boundaries (setup/hold)
Timing-aware VCD output (--timed flag)
Validated against CVC reference simulator (88ps / 7.1% conservative overestimate)

Phase 4: Full Integration (Partial)

Timing violation events via event buffer (completed)
Per-cycle timing reports (completed)
Integration with output VCD (completed via --timed)
Timing-aware bit packing for reduced approximation error (future)

Conservative Timing Model: Sources of Overestimation

Jacquard's GPU timing is intentionally conservative — it may over-estimate arrival times but will never under-estimate them. This is important for setup violation detection: false positives are safe, false negatives would miss real bugs.

There are three independent sources of conservatism, each adding to the overestimate:

Source 1: max(rise, fall) per cell

The GPU kernel tracks a single u16 arrival per thread position. It cannot distinguish between rising and falling signal transitions because each thread processes 32 packed Boolean signals simultaneously — there's no per-bit transition direction available.

How it works: For each cell, inject_timing_to_script() computes:

#![allow(unused)]
fn main() {
delay = max(gate_delays[pin].rise_ps, gate_delays[pin].fall_ps)
}

Impact: For the SKY130 inv_chain test (16 inverters), rise delays average ~10ps larger than fall delays. In a real inverter chain, transitions alternate (rise→fall→rise), so half the cells use the smaller fall delay. Jacquard uses the larger rise delay for all.

Measured: 80ps overestimate on 1235ps (6.5%) for 16 inverters with ~10ps rise/fall asymmetry per cell.

Source 2: max wire delay across all input pins

For multi-input cells (AND gates, MUXes), INTERCONNECT delays to different input pins may differ significantly. Jacquard takes the maximum across all input pins:

#![allow(unused)]
fn main() {
// wire_delays_per_cell: dest_cellid → max(all input wire delays)
entry.rise_ps = entry.rise_ps.max(ic.delay.rise_ps);
entry.fall_ps = entry.fall_ps.max(ic.delay.fall_ps);
}

Impact: If an AND gate has input A arriving via a 10ps wire and input B via a 200ps wire, Jacquard assigns 200ps to the cell regardless of which input is on the critical path. An event-driven simulator would correctly propagate the 10ps arrival on input A independently.

When this matters: Designs with highly asymmetric routing (e.g., one input is local, another crosses the chip). Well-routed designs typically have balanced wire delays to multi-input cells.

Source 3: max arrival across 32 packed signals per thread

Each thread position holds 32 independent Boolean signals. Jacquard tracks one arrival per thread position (the maximum across all 32 signals):

Thread 50: [signal_A: 5ps, signal_B: 23ps, signal_C: 8ps, ...]
Tracked:   arrival[50] = 23ps (max of all 32)

Impact: If signals with very different timing are packed into the same thread, the fastest signals inherit the slowest signal's arrival time.

Mitigation: The bit-packing algorithm can sort signals by estimated timing before assignment (see "Timing-Aware Bit Packing" section). This keeps similar-timing signals together, reducing the max approximation error.

Combined Effect

These sources are multiplicative in the worst case. For the inv_chain test:

Source	Contribution	Notes
max(rise, fall)	+80ps	8 inverters × 10ps asymmetry
max wire delay	+8ps	8 wires × 1ps asymmetry
max per thread	0ps	Only 1 signal per thread in this test
Total overestimate	88ps / 7.1%	vs CVC transition-accurate result

For larger designs with more routing asymmetry and denser bit packing, the combined overestimate could be larger. The bit-packing sort (Source 3) is the most actionable mitigation.

CVC Reference Validation

The inv_chain design (2 DFFs + 16 SKY130 inverters) was validated against CVC (open-src-cvc), an event-driven Verilog simulator with native SDF back-annotation:

CVC:  clk_to_q=350ps  chain=885ps  total=1235ps  (transition-accurate)
Jacquard: clk_to_q=350ps  chain=973ps  total=1323ps  (conservative max)
Difference: 88ps (7.1% overestimate)

Both simulators agree on CLK→Q delay (350ps) because the DFF has a single output transition direction per clock edge. The chain delay differs because CVC tracks actual rise/fall polarity through each inverter.

To run the CVC comparison locally:

bash tests/timing_test/cvc/run_cvc.sh

Requires Docker (builds CVC from source on first run).

Delay Data Encoding

Script Format

The existing boomerang section has padding that can store delay data:

Current format per thread per stage:
  [xora: u32]
  [xorb: u32]
  [orb:  u32]
  [padding: u32]  ← Can store delay here

PackedDelay Structure

#![allow(unused)]
fn main() {
#[repr(C)]
pub struct PackedDelay {
    pub rise_ps: u16,  // Rising edge delay in picoseconds
    pub fall_ps: u16,  // Falling edge delay in picoseconds
}
}

For simplified timing, a single uniform delay constant can be used instead of per-gate delays.

Timing Violation Detection

At Each Cycle Boundary

The GPU kernel checks timing constraints per state word (32 signals) after the boomerang evaluation completes. Arrivals and constraints use u16 picosecond values (range 0–65535 ps). Arithmetic is performed in u32 to avoid overflow when summing arrival + setup:

// After boomerang completes, before next cycle
// arrival: u16 max accumulated delay for this 32-signal group
// constraint_word: packed [setup_ps:16][hold_ps:16]
u16 setup_ps = constraint_word >> 16;
u16 hold_ps  = constraint_word & 0xFFFF;

// Setup check: skip when arrival == 0 (no data propagated, e.g. first cycle
// or DFF with constant inputs)
if (arrival > 0 && (u32)arrival + (u32)setup_ps > clock_period_ps) {
    int slack = (int)clock_period_ps - (int)arrival - (int)setup_ps;
    write_event(event_buffer, EVENT_TYPE_SETUP_VIOLATION,
                cycle, io_offset + threadIdx.x,
                (u32)slack, (u32)arrival, (u32)setup_ps);
}

// Hold check: no arrival > 0 guard (hold violations matter even at cycle 0)
if ((u32)arrival < (u32)hold_ps) {
    int slack = (int)arrival - (int)hold_ps;
    write_event(event_buffer, EVENT_TYPE_HOLD_VIOLATION,
                cycle, io_offset + threadIdx.x,
                (u32)slack, (u32)arrival, (u32)hold_ps);
}

Event Buffer Integration

#![allow(unused)]
fn main() {
pub enum EventType {
    Stop = 0,
    Finish = 1,
    Display = 2,
    AssertFail = 3,
    SetupViolation = 4,   // Timing events
    HoldViolation = 5,
}
}

For full details on interpreting violation reports and tracing violations to source signals, see docs/timing-violations.md.

Timing-Aware Bit Packing

The Problem

Each thread position holds 32 signals packed into a u32. When tracking timing with one arrival value per thread position, we approximate all 32 signals as having the same arrival time (the maximum).

This approximation is accurate when signals in the same thread have similar timing. But the default placement algorithm uses first-fit for bit assignment:

#![allow(unused)]
fn main() {
// Default: first available slot
for i in 0..hier[selected_level].len() {
    if hier[selected_level][i] == usize::MAX {
        slot_at_level = i;  // First-fit, not timing-aware
        break;
    }
}
}

This can result in signals with very different timing sharing a thread:

Thread 50 (accidental grouping):
  bit 0: level 5,  ~5ps arrival
  bit 1: level 12, ~12ps arrival  ← 7ps difference!
  bit 2: level 6,  ~6ps arrival

Thread 50 (timing-aware grouping):
  bit 0: level 5, ~5ps arrival
  bit 1: level 5, ~5ps arrival    ← similar timing
  bit 2: level 6, ~6ps arrival

Current Timing Correlation

The placement algorithm already computes logic levels:

#![allow(unused)]
fn main() {
// Level = max(level of inputs) + 1
level[node] = max(level[input_a], level[input_b]) + 1;
}

Logic level correlates with timing (more levels = more gate delays), but signals at the same level can still have different actual delays due to:

Different gate types (AND2_00_0 vs AND2_11_1)
Different wire loads
Path reconvergence

Solution: Sort by Timing Before Packing

Before assigning bit positions, sort signals by their estimated arrival time:

#![allow(unused)]
fn main() {
// Collect nodes at this level
let mut nodes_to_place: Vec<_> = candidates
    .filter(|n| level[n] == selected_level)
    .collect();

// Sort by arrival time (level as proxy, or actual timing if available)
nodes_to_place.sort_by_key(|n| arrival_estimate[n]);

// Place in sorted order - similar timing ends up in same thread
for (slot, node) in nodes_to_place.iter().enumerate() {
    place_bit(..., slot, *node);
}
}

Alternative Approaches

Approach	Complexity	Effectiveness	When to Use
Sort by timing	Low	Good	Default choice
Timing-aware partitioning	High	Best	Large designs
Post-placement swapping	Medium	Good	Fine-tuning
Timing bands	Low	Moderate	Simple heuristic

Timing Bands

Group signals into arrival time bands:

Band 0: 0-10ps   → Threads 0-63
Band 1: 10-20ps  → Threads 64-127
Band 2: 20-30ps  → Threads 128-191
Band 3: 30+ps    → Threads 192-255

Measuring Packing Quality

Diagnostic to measure timing variance per thread:

#![allow(unused)]
fn main() {
fn analyze_timing_packing(hier: &Hierarchy, arrivals: &[u64]) {
    for thread in 0..256 {
        let times: Vec<_> = get_bits_in_thread(hier, thread)
            .map(|b| arrivals[b])
            .collect();

        let range = times.iter().max() - times.iter().min();
        let variance = compute_variance(&times);

        if range > threshold {
            warn!("Thread {} has {}ps timing spread", thread, range);
        }
    }
}
}

Impact on Approximation Accuracy

With timing-aware packing:

Reduced false positives: Fewer spurious timing violations from max approximation
Tighter bounds: Per-thread arrival closer to actual signal arrivals
Better critical path identification: Max arrival more accurately reflects true critical path

Performance Expectations

Metric	Zero-Delay	With Timing
Kernel launches	N	N
Shared memory	3KB	3.25KB
Registers	~32	~36
Instructions/gate	~5	~9
Estimated overhead	-	15-25%

The overhead is modest because:

Timing operations are simple (max, add)
Memory access pattern is identical
No additional synchronization needed
Same parallelism structure

References

src/pe.rs - Partition executor and boomerang stage construction
csrc/kernel_v1_impl.cuh - GPU kernel implementation
src/flatten.rs - Script generation with timing data
src/event_buffer.rs - GPU→CPU event communication
src/liberty_parser.rs - Timing library parsing

Timing Violation Detection

See also: timing-correctness.md — forward-looking validation contract and timing IR requirements (in progress). The document below describes current behaviour.

Guide to enabling, reading, and debugging setup/hold timing violations in GEM.

Overview

Setup and hold violations occur when data arrives too late (setup) or too early (hold) relative to the clock edge at a flip-flop. GEM checks for these violations during GPU simulation by tracking arrival times — the accumulated gate delay from primary inputs or DFF outputs through combinational logic to the next DFF data input.

Approximation model: GEM tracks one arrival time per 32-signal group (one GPU thread position). The arrival is the maximum across all 32 signals in the group. This is conservative: it may over-report violations but will never miss a real one. See Reducing False Positives for details.

Enabling Timing Checks

Prerequisites

SDF file with back-annotated delays from your place-and-route tool
Gate-level netlist synthesized to aigpdk.lib cells

Step-by-step

Generate SDF from your P&R tool (or use scripts/generate_sdf.py for test designs):
```
# Example: OpenROAD flow output
ls my_build/6_final.sdf
```

Run the simulator with --sdf and a clock period:

Metal (macOS):

cargo run -r --features metal --bin jacquard -- sim \
    design.gv input.vcd output.vcd 1 \
    --sdf design.sdf \
    --sdf-corner typ

CUDA (NVIDIA):

cargo run -r --features cuda --bin jacquard -- sim \
    design.gv input.vcd output.vcd 8 \
    --sdf design.sdf \
    --sdf-corner typ \
    --enable-timing \
    --timing-clock-period 1200

cosim (co-simulation):

cargo run -r --features metal --bin jacquard -- cosim \
    design.gv \
    --config testbench.json \
    --sdf design.sdf \
    --sdf-corner typ

CLI Flags Reference

Flag	Binary	Description
`--sdf <path>`	all	Path to SDF file with back-annotated delays
`--sdf-corner <min\|typ\|max>`	all	Which SDF corner to use (default: `typ`)
`--sdf-debug`	all	Print unmatched SDF instances for debugging
`--enable-timing`	`jacquard sim`	Enable timing analysis (arrival + violation checks)
`--timing-clock-period <ps>`	`jacquard sim`	Clock period in picoseconds (default: 1000)
`--timing-report-violations`	`jacquard sim`	Report all violations, not just summary
`--timing-report <path.json>`	`jacquard sim`	Write a structured end-of-run JSON report (schema in `src/timing_report.rs`, ADR 0008).
`--timing-summary`	`jacquard sim`	Print a human-readable text summary at end of run. Independent of `--timing-report`; both can be combined.
`--timing-report-max-violations <N>`	`jacquard sim`	Cap on the per-cycle violations list in `--timing-report`. Default 100k. `0` = unbounded. Totals + worst-slack always reflect every event.
`--liberty <path>`	`jacquard sim`	Liberty library for timing data (optional, falls back to AIGPDK defaults)

Example: inv_chain_pnr Test Case

# Run with SDF timing
cargo run -r --features metal --bin jacquard -- sim \
    tests/timing_test/inv_chain_pnr/6_final.v \
    tests/timing_test/inv_chain_pnr/input.vcd \
    tests/timing_test/inv_chain_pnr/output.vcd 1 \
    --sdf tests/timing_test/inv_chain_pnr/6_final.sdf

Reading Violation Reports

Setup Violation Format

[cycle 42] SETUP VIOLATION at top/cpu/regs[7][bit 22] [word=5]: arrival=900ps setup=200ps slack=-100ps

(WS-P1.1.a, 2026-05-02: state-word indices are now resolved to symbolic hierarchical signal names. The bare [word=N] suffix is preserved for grep compatibility. Words packing more than 4 DFFs truncate with a +N more suffix.)

Field	Meaning
cycle	Simulation cycle where the violation occurred
word	State word index — identifies a group of 32 DFF data inputs
arrival	Maximum accumulated gate delay to this word's signals (picoseconds)
setup	DFF setup time constraint from SDF/Liberty (picoseconds)
slack	`clock_period - arrival - setup`. Negative = violation amount

Hold Violation Format

[cycle 11] HOLD VIOLATION at top/cpu/state[bit 3] [word=3]: arrival=10ps hold=50ps slack=-40ps

Field	Meaning
cycle	Simulation cycle where the violation occurred
word	State word index
arrival	Accumulated gate delay to this word's signals (picoseconds)
hold	DFF hold time constraint from SDF/Liberty (picoseconds)
slack	`arrival - hold`. Negative = violation amount

Summary Statistics

At the end of simulation, GEM prints totals:

Simulation complete: 1000 cycles, 5 setup violations, 0 hold violations

Text Summary (`--timing-summary`)

A one-screen human summary printed to stdout at end of run. Reuses the same data the JSON report builds (so --timing-report and --timing-summary cost the same; only the output channel differs). Sample output:

=== Jacquard Timing Summary ===
Design:        my_cpu.gv
Vectors:       boot.vcd (1000 cycles)
Clock period:  1000 ps
Timing source: my_cpu.jtir

Violations:
  Setup: 5
  Hold:  2
  Total: 7

Worst slack:
  Setup: -150ps  at top/cpu/regs[7][bit 22] [word=5]  (cycle 87)
  Hold:   -40ps  at top/cpu/state[bit 3] [word=12]  (cycle 91)
  Total negative: setup (TNS)=-620ps  hold (THS)=-80ps

Top 2 by violation count (of 2 total words with violations):
  top/cpu/regs[7][bit 22] [word=5] (5 violations): worst setup=-150ps hold=- arrival=950ps
  top/cpu/state[bit 3] [word=12] (2 violations): worst setup=- hold=-40ps arrival=10ps

The Worst slack block reports the four standard signoff metrics:

Metric	Meaning
WNS (Worst Negative Slack)	The single most-negative setup slack across the whole run — the worst setup path.
WHS (Worst Hold Slack)	The single most-negative hold slack.
TNS (Total Negative Slack)	Sum of all negative setup slacks (`stats.total_setup_slack_ps` in the JSON). Captures the aggregate severity, not just the worst path.
THS (Total Hold Slack)	Sum of all negative hold slacks (`stats.total_hold_slack_ps`).

TNS/THS reflect every observed violation regardless of the per-cycle list cap, and are never optimistic: if the GPU event buffer overflows, dropped events only make the true totals more negative.

The format is for human inspection — explicitly not a stable parseable contract. Tools that need to script against the data should use --timing-report JSON.

Structured JSON Report (`--timing-report <path.json>`)

For CI integration and downstream tooling, pass --timing-report <path> to get an end-of-run JSON document. The schema is versioned (ADR 0008's stability contract: additive-only extensions, breaking changes bump the major). Sample at tests/timing_ir/sample_reports/two_violations.json; authoritative type definitions in src/timing_report.rs.

Top-level shape:

{
  "schema_version": "1.0.0",
  "metadata": { "design": "...", "cycles_run": 1000, "clock_period_ps": 1000, "...": "..." },
  "stats": { "setup_violations": 5, "hold_violations": 0, "events_dropped": 0 },
  "violations": [
    { "cycle": 42, "kind": "setup", "word_id": 5, "site": "top/cpu/regs[7][bit 22] [word=5]",
      "arrival_ps": 900, "constraint_ps": 200, "slack_ps": -100 }
  ],
  "per_word": [
    { "word_id": 5, "site": "...", "setup_violations": 5, "hold_violations": 0,
      "worst_setup_slack_ps": -100, "worst_hold_slack_ps": null, "worst_arrival_ps": 900 }
  ],
  "worst_slack": {
    "setup": [ /* top-N most-negative slacks across the run */ ],
    "hold":  [ /* same shape */ ]
  }
}

per_word is sorted by total violation count desc, then by word_id. worst_slack.setup / .hold are top-10 by closest-to-violation slack (most negative first). Caveats:

The "even when no violation occurred" half of WS-P1.1.d (per-DFF closest-to-violation tracking when the design never tripped a violation) needs GPU-side near-miss instrumentation and is not in v1.0.0; for now, worst_slack is populated only from actual violation events.
--timing-report is wired on all three sim backends (Metal, CUDA, HIP). The cosim path does not yet route runtime violations through process_events — bringing it in is independent plumbing.
The violations array is capped at 100,000 records by default (~8 MB JSON). Override or disable the cap with --timing-report-max-violations <N> (0 = unbounded). Setup/hold totals, events_dropped, and worst_slack rankings always reflect every observed event; only the per-cycle list is bounded. stats.violations_truncated reports how many records were dropped because the cap was reached.

Tracing Violations to Source Signals

When you see a violation on a specific word, follow this workflow to identify the offending signals and their logic cone.

1. Get the Word Index

From the log: word 5 means state word index 5.

2. Map Word to DFF Signals

Each word covers 32 bits of state. The DFFs in that word have data_state_pos / 32 == word_index. To find which DFFs:

Look at the dff_constraints entries in the FlattenedScriptV1:

dff_constraints entries where data_state_pos / 32 == 5
→ cell_id values → netlist cell names

In gpu_sim, violations are logged with word IDs that map directly to the output_map positions. Each word covers bit positions word * 32 through word * 32 + 31.

3. Trace Backwards with netlist_graph

Use the netlist_graph tool to trace the combinational logic cone feeding the DFF. After uv sync --group dev, the netlist-graph console script is on the workspace's uv run path — no cd required:

# Find the DFF data input driver chain
uv run netlist-graph drivers design.v "dff_name.D" -d 10

# Search for DFFs matching a pattern
uv run netlist-graph search design.v "dff_out*"

Discovered signal names can be passed directly into jacquard sim --trace-signals <file> / jacquard cosim --trace-signals <file> (one name per line) to surface them in the output VCD alongside top-level IO.

4. Detailed Timing Analysis with CVC

For per-signal accuracy (no 32-signal approximation), use CVC (open-src-cvc) with SDF back-annotation:

# Run CVC with SDF timing
cvc64 +typdelays tb.v design.v
./cvcsim

CVC provides event-driven simulation with full SDF support (IOPATH + INTERCONNECT delays), allowing you to pinpoint exactly which path is critical.

The Approximation Caveat

GEM tracks one arrival time per 32-signal group (one GPU thread position). The tracked value is the maximum arrival across all 32 signals in that thread. This means:

Conservative: If any signal in the group has a long path, the arrival for the entire group reflects that worst case. Violations may be reported for signals that individually meet timing.
Never misses real violations: A real violation always results in a reported violation (the max is >= any individual signal's arrival).

Reducing False Positives

If a violation is reported but you suspect it's a false positive from the approximation:

Use CVC for per-signal accuracy (see Detailed Timing Analysis with CVC above).
Timing-aware bit packing groups signals with similar arrival times into the same thread, reducing the approximation error. See docs/timing-simulation.md § "Timing-Aware Bit Packing" for details.

Common Scenarios

Setup violations on many words, same cycle: The clock period is likely too tight for the design. The combinational logic depth exceeds what can settle in one clock period. Try increasing the clock period.

Setup violation on a single word: A critical path through one specific logic cone. Use netlist_graph drivers to trace the path and identify the bottleneck.

Hold violation: Rare with SKY130 process (negative hold times clamp to 0 in the SDF). If seen, the design likely has minimum-delay paths that are too short. Check for direct connections between DFF outputs and nearby DFF inputs with minimal combinational logic.

Violations only on first cycle: The arrival > 0 guard in the GPU kernel skips setup checks when arrival is zero (meaning no data has propagated through combinational logic yet). If you see violations on cycle 0, they are hold violations — setup violations on cycle 0 are suppressed by design.

Timing Validation Methodology

Note: The ±5% tolerance convention described here will be superseded by timing-correctness.md once phase 0 ships (OpenSTA oracle + timing IR + diff harness). Content below remains accurate for current behaviour.

This document describes how GEM validates timing simulation accuracy against reference simulators (CVC, Icarus Verilog). It covers test cases, comparison metrics, known simulator differences, and acceptance criteria.

Overview

GEM's timing simulation (--timed and --enable-timing flags) must be validated against independent reference simulators to ensure correctness. We use two reference tools:

CVC (Cadence Verilog Compiler): For post-layout SDF-annotated designs (commercial tool, high confidence)
Icarus Verilog: For structural Verilog and Liberty-based timing (open source, good for pre-layout)

What We Validate

1. Functional Correctness (Primary Validation)

Definition: Both GEM and reference simulators produce identical output values at each clock cycle.

Signals compared: Primary design outputs (e.g., gpio_out[43:0] for MCU SoC), NOT internal timing signals.

Tolerance: Exact match required. If output values differ, timing simulation has a correctness bug.

Tool: compare_outputs.py script compares VCD outputs cycle-by-cycle.

# Example: MCU SoC functional comparison
uv run tests/mcu_soc/cvc/compare_outputs.py \
    jacquard_output.vcd cvc_output.vcd \
    --skip-cycles 5  # Skip first 5 cycles (reset/initialization)

2. Timing Accuracy (Secondary Validation)

Definition: Gate-level delays and arrival times computed by GEM match reference simulator within acceptable margins.

Known differences (simulator-specific semantics):

Metric	GEM	CVC	Notes
Q arrival	Full combo path delay (CLK→dff_in.Q + chain + interconnect)	Final gate CLK→Q only	GEM is conservative: includes downstream logic
Setup slack	Time from data arrival to DFF setup deadline	Similar	Usually aligned
Hold slack	Time from data change to DFF hold deadline	Similar	Can differ due to arrival rounding

Why the difference? GEM's boomerang architecture computes cumulative delays through pipeline stages, resulting in conservative estimates that over-predict delays by including full combinational paths. CVC only reports final DFF gate delays.

Tolerance:

Functional output: Exact match required
Arrival times: ±5% acceptable (due to architectural differences)
Setup/hold margins: ±10ps acceptable (platform-dependent rounding)

Test Cases

inv_chain_pnr (Simple Reference Case)

Location: tests/timing_test/inv_chain_pnr/

Description: Single inverted AND gate followed by a chain of inverters. Provides ground truth for:

Basic gate delay accuracy
Multi-stage combinational path accumulation
Clock-to-Q propagation

Size: 9 cells, ~40ps full path delay

Validation:

# Generate CVC reference
cd tests/timing_test/inv_chain_pnr
cvc64 +typdelays tb_cvc.v inv_chain.v 2>&1 | tee cvc.log
./cvcsim 2>&1 | grep "RESULT:" # Extract: clk_to_q, chain_delay, total_delay

# Generate GEM output
cargo run -r --features metal --bin jacquard -- sim \
    6_final.v stimulus.vcd output.vcd 1 \
    --sdf 6_final.sdf \
    --timed

Expected results:

Jacquard Q arrival ≈ CVC total_delay (within ±5%)
Both simulators show monotonic delay increase with each inverter stage

MCU SoC post-layout (Complex Case)

Location: tests/mcu_soc/data/6_final.v + 6_final.sdf

Description: Full MCU SoC design with SKY130 cells, post-P&R netlist with SDF timing.

Size: 2.7k cells, ~19MB netlist, ~18MB SDF

Validation:

# In CI: .github/workflows/ci.yml mcu-soc-metal job

# 1. Strip SDF timing checks (remove malformed TIMINGCHECK directives)
uv run tests/mcu_soc/cvc/strip_sdf_checks.py \
    tests/mcu_soc/data/6_final.sdf \
    tests/mcu_soc/data/6_final_stripped.sdf

# 2. Generate Jacquard timing VCD
cargo run -r --features metal --bin jacquard -- sim \
    tests/mcu_soc/data/6_final.v \
    tests/mcu_soc/stimulus.vcd \
    tests/mcu_soc/jacquard_timed_mcu.vcd 1 \
    --sdf tests/mcu_soc/data/6_final_stripped.sdf \
    --sdf-corner typ \
    --timed \
    --max-clock-edges 10000

# 3. Generate CVC reference
cvc64 +typdelays tests/mcu_soc/cvc/tb_cvc.v \
    tests/mcu_soc/data/6_final.v \
    tests/mcu_soc/cvc/sky130_cells.v \
    2>&1 | tee cvc_compile.log
./cvcsim > cvc_output.vcd 2>&1

# 4. Compare functional outputs
uv run tests/mcu_soc/cvc/compare_outputs.py \
    tests/mcu_soc/jacquard_timed_mcu.vcd \
    tests/mcu_soc/cvc/cvc_output.vcd \
    --skip-cycles 5

Expected results:

Functional output: Exact match (or documented difference with explanation)
Arrival times: Jacquard values ≥ CVC due to conservative path accumulation
CI: Comparison completes without errors

Pre-layout Library Timing

Location: tests/timing_test/sky130_timing/

Description: Synthesized designs with Library-only timing (no SDF from P&R). Pre-layout tests validate timing accuracy early, before place-and-route.

Test circuits:

inv_chain: DFF → 16 inverters → DFF
- Expected combo delay: 16 × 28ps = 448ps (from Liberty inv_1 tpd)
- Full path: CLK→Q (310ps) + chain (448ps) = 758ps
logic_cone: 4 input DFFs → nand2/nor2/and2 tree → DFF
- Critical path: 4 gates × 35ps avg = 140ps
- Full path: CLK→Q (310ps) + logic (140ps) = 450ps

Validation:

# Generate Liberty-only SDF
cd tests/timing_test/sky130_timing
python3 gen_liberty_sdf.py inv_chain.v
python3 gen_liberty_sdf.py logic_cone.v

# Run CVC reference
cvc64 +typdelays tb_inv_chain.v inv_chain.v 2>&1 | tee cvc.log
./cvcsim

# Run Jacquard GPU simulation
cargo run -r --features metal --bin jacquard -- sim \
    inv_chain.v stimulus.vcd jacquard_output.vcd 1 \
    --sdf inv_chain.sdf \
    --timed

# Compare outputs
uv run ../../mcu_soc/cvc/compare_outputs.py jacquard_output.vcd cvc_output.vcd

Expected results:

Functional output: Exact match (logic correctness)
Arrival times: Jacquard ≥ CVC (conservative accumulation expected)
No SDF parsing errors or malformed delays

Purpose: Pre-layout tests catch timing issues early without P&R turnaround (30+ min). Library-only delays provide floor (lower bound); post-layout adds routing parasitics for final validation.

Multi-corner workflow (local Mac dev)

Location: crates/opensta-to-ir/tests/opensta_integration.rs — sky130_multi_corner_emits_per_corner_values.

The test drives opensta-to-ir --liberty NAME=PATH (WS2.4) three times with the genuine sky130 typ/slow/fast Liberty files and asserts the emitted timing IR carries three corners with different setup values (slow > typ > fast). This is the load-bearing check that --timing-corner actually means something — the companion aigpdk_dff_emits_per_corner_timing_values only validates IR shape.

One-time setup on a fresh Mac:

uv sync --group dev                                     # installs volare
uv run volare enable c6d73a35f524070e85faff4a6a9eef49553ebc2b
# Optional override if the PDK lives elsewhere:
export SKY130_LIBERTY_DIR=/path/to/sky130_fd_sc_hd/lib

The pinned volare hash lives in pyproject.toml::[tool.jacquard.pdks.sky130]. The test skips cleanly with an install hint when the PDK isn't found, so it's safe to run on machines without sky130 installed.

Running:

cd crates/opensta-to-ir
cargo test --test opensta_integration sky130_multi_corner -- --nocapture

GF180MCU PDK (in progress — phased per docs/plans/gf180mcu-enablement.md):

uv sync --group dev
uv run volare enable --pdk gf180mcu 559a117b163cef2f920f33f30f6f690aa0b47e4c

Liberty lands under ~/.volare/volare/gf180mcu/versions/<hash>/gf180mcuC/libs.ref/gf180mcu_fd_sc_mcu{7,9}t5v0/liberty/. Both 7-track and 9-track standard-cell libraries are pulled together by volare. Cell models are vendored as submodules at vendor/gf180mcu_fd_sc_mcu7t5v0/ and vendor/gf180mcu_fd_sc_mcu9t5v0/.

CI Integration

MCU SoC Timing Comparison Workflow

The main CI pipeline (.github/workflows/ci.yml) includes:

mcu-soc-metal job: Generates Jacquard timing VCD
- Includes SDF stripping step (strip_sdf_checks.py)
- Produces jacquard_timed_mcu.vcd with arrival time annotations
mcu-soc-cvc job: Generates CVC reference output
- Uses stripped SDF (6_final_nocheck.sdf)
- Produces cvc_output.vcd for comparison
mcu-soc-comparison job: Validates both produce same functional output
- Runs compare_outputs.py
- Reports pass/fail in CI summary
- Skips gracefully if either simulator fails

SDF Stripping

SDF files from post-P&R tools may contain:

Malformed TIMINGCHECK directives (setup/hold specs with syntax errors)
INTERCONNECT entries with escaped port names that parsers reject

Solution: tests/mcu_soc/cvc/strip_sdf_checks.py removes:

TIMINGCHECK blocks (parser errors, not needed for gate-level functional sim)
INTERCONNECT lines (wire delays, optional)
Empty DELAY blocks (after INTERCONNECT removal)
CELL blocks with escaped $ in instance names

Result: Stripped SDF retains ~402k lines of useful IOPATH (gate delay) data, removing ~131k problematic lines.

Comparison Tolerances

Pre-Layout (Liberty-Only) Tests

Pre-layout tests compare designs without P&R parasitics, using only library cell timing:

Metric	Tolerance	Notes
Functional output	0 (exact match)	Logic correctness required
Arrival time	±10%	Library delays only; no routing variance
Setup/hold slack	±10ps	Rounding differences expected

Why loose tolerances? Pre-layout designs have no P&R uncertainty. Functional correctness is critical (exact match); timing metrics can vary due to simulator implementation differences.

Post-Layout (SDF-Annotated) Tests

Post-layout designs include routing delays and P&R context:

Metric	Tolerance	Notes
Functional output	0 (exact match)	Required; SDF adds no logic changes
Arrival time	±5% (SDF tool dependent)	GEM accumulates conservatively
Setup/hold slack	±20ps	Detailed routing adds variance

Known Issues & Limitations

1. Boomerang Path Accumulation

Issue: GEM sums gate delays across all pipeline stages leading to a node, producing cumulative arrival times. CVC only reports final DFF gate delay.

Example: inv_chain test:

CVC reports: CLK→Q = 350ps (final gate only)
GEM reports: Q arrival = 1323ps (CLK→dff_in.Q + inverter chain + interconnect)

Why: Boomerang architecture evaluates in hierarchical stages. To know Q arrival, you sum delays from all stages.

Mitigation: Compare Jacquard Q arrival against CVC RESULT: total_delay (if available), or accept that arrival times differ but functional output matches.

2. SDF Parser Robustness

Issue: Post-P&R SDF files from various tools contain edge-case syntax that doesn't parse correctly.

Examples:

Empty delay specs: (IOPATH A B () (0:0:0)) — treated as 0ps
COND-qualified pins: (SETUP (COND x==1 D) (posedge CLK) (180:200:220)) — not yet supported
Malformed TIMINGCHECK: cause parser errors in byte-range 18M+

Current approach: Strip SDF timing checks before use, preserving IOPATH data.

Future: Improve SDF parser to handle more edge cases without stripping.

3. Timing Model Accuracy

Conservative model: GEM accumulates delays pessimistically. This is intentional—designs validated under the GEM model are guaranteed to meet timing in actual P&R.

Trade-off: Over-predicting delays means GEM may flag timing violations that won't occur in silicon. This is preferable to under-predicting (which would give false confidence).

Debugging Timing Failures

When Functional Output Doesn't Match

Check testbench stimulus: Does both simulators receive same input?
- Generate stimulus VCD: --stimulus-vcd stimulus_out.vcd
- Compare against reference testbench
Check SDF parsing: Did parser successfully load all timing data?
- Enable debug logging: --sdf-debug
- Look for "unmatched SDF instances" warnings
Check initialization: Are both simulators starting from same state?
- Compare first 5-10 cycles after reset
- Verify reset logic is synchronized

Compare against Jacquard non-timed version:

# Run without timing VCD
cargo run -r --features metal --bin jacquard -- sim \
    design.v stimulus.vcd functional_output.vcd 1

# Does functional output match CVC?
compare_outputs.py functional_output.vcd cvc_output.vcd

When Arrival Times Seem Wrong

Verify SDF was loaded: Check logs for "Failed to load SDF" warnings
Check corner selection: Confirm --sdf-corner typ|min|max matches reality
Compare against CVC RESULT lines: If available, extract CVC timing measurements
Check for pipeline stage effects: Arrival time = sum of delays across stages

Acceptance Criteria

Test Case	Criterion	Status
inv_chain_pnr	Functional output exact match	✅ Passing
inv_chain_pnr	Arrival time matches CVC±5%	✅ Passing
MCU SoC	Functional output exact match	✅ Passing (CI)
MCU SoC	SDF parsing completes without panic	✅ Passing (fixed with strip_sdf_checks)
MCU SoC	Timing VCD generates successfully	✅ Passing (fixed)
Pre-layout inv_chain	Library-only SDF generated correctly	✅ Passing
Pre-layout logic_cone	Library-only SDF generated correctly	✅ Passing
Pre-layout timing comparison	Functional output matches CVC	⏳ In progress (CVC testbenches added)
CUDA/HIP timing support	--timed flag on GPU backends	⏳ Not yet implemented
Cosim timing mode	Arrival time readback in cosim	⏳ Not yet implemented

References

Timing Simulation in GEM — Architecture details
Timing Violation Detection — Setup/hold checks
SDF Parser — Parsing implementation
CVC Integration — Reference simulator setup

Timing-model extensions — design notes

Status: Idea / pre-spike. Not scheduled. Captured here so the architecture sketch survives the next session-clear.

Scope: Three related extensions to Jacquard's timing model, all aimed at making setup/hold reporting more honest without abandoning the cycle-accurate boomerang kernel.

Dynamic delay — per-gate δ(T) inspired by the Involution Delay Model (Maier 2021, arXiv:2107.06814). Captures pulse-width-dependent delay degradation that fixed δ∞ misses on near-threshold paths.
Clock-tree skew — per-DFF clock arrival accounting. Today every DFF on a clock is treated as if it captures simultaneously; SDF clock-buffer arcs and clock-net interconnect are silently dropped during AIG construction.
Wire delay at scale — per-receiver interconnect delay applied to the right edge in the AIG, and explicit modelling of inter-partition wires. Today wire delay is collapsed to a max-per-destination-cell scalar — fine for sky130 short routes, increasingly wrong as we move to faster clocks, finer processes, and large many-core/NoC designs.

All three share the same insight: the data the model needs is already in the TimingIR. The work is at the consumer layer (flatten.rs, aig.rs, the kernel arrival math), not the IR or the partitioner.

Background — what the timing pipeline does today

.sdf  ─┬─► opensta-to-ir ──► TimingIR (.jtir, FlatBuffers)
.jtir ─┘                          │
                                  ▼
              flatten.rs::load_timing_from_ir   (per-cell arc → AIG-pin delay)
                                  │
                                  ▼
                       gate_delays: Vec<PackedDelay>     (rise/fall ps per AIG pin)
                       dff_constraints: Vec<DFFConstraint>  (setup/hold ps per DFF)
                                  │
                                  ▼
              flatten.rs::inject_timing_to_script   (bake max ps into u16 script slot)
                                  │
                                  ▼
                       kernel_v1.metal at runtime:
                       per-AND:    new_arr = max(arr_a, arr_b) + gate_delay
                       per-DFF:    check arrival vs setup/hold per word

Reference points:

IR schema: crates/timing-ir/schemas/timing_ir.fbs
IR consumer: src/flatten.rs:1768 (load_timing_from_ir), src/flatten.rs:1686 (inject_timing_to_script)
Setup/hold buffer: src/flatten.rs:1732 (build_timing_constraint_buffer)
GPU arrival math: csrc/kernel_v1.metal:220-255 (AND gates), csrc/kernel_v1.metal:547-580 (setup/hold)

Per-AIG-pin arrival is a single ushort accumulated by max through the boomerang reduction. There is no event scheduling — arrival is a scalar that rides alongside the Boolean evaluation in lockstep with cycle ticks.

Part A — Dynamic delay (IDM-style δ(T))

What IDM is, briefly

A per-gate dynamic delay model that makes δ a function of T (time since the gate's last output transition). The distinguishing property: input pulses with Δᵢ → 0 have diminishing effect on the output. The model handles pulse-width degradation faithfully and is the only model proven to solve the short-pulse-filtration problem. The paper notes ~80–590% CPU overhead vs. inertial delay on a CPU event-driven simulator.

The architectural wall

True IDM needs event scheduling and intra-cycle pulse observability — neither is available in Jacquard's lockstep cycle-accurate kernel. We cannot model glitch suppression or metastability oscillation traces without either sub-cycle ticks or a different kernel architecture.

What we can do is enrich the per-gate delay used in arrival propagation so setup/hold reporting reflects realistic pulse degradation on marginal paths.

Five hook points

Hook	File	Today	With δ(T)
A Schema	`crates/timing-ir/schemas/timing_ir.fbs`	rise/fall per arc	+ per-cell-type `DynamicDelayParams` (exp-channel params or piecewise-linear LUT)
B IR load	`src/flatten.rs:1768`	one `PackedDelay` per AIG pin	+ parallel `gate_dyn_delays` keyed by originating cell-type via `aigpin_cell_origins`
C Bake	`src/flatten.rs:1686`	one u16 ps per thread slot	static-IDM: bake worst-case δ(T) into same slot. dynamic-IDM: reserve second u32
D Kernel arrival	`csrc/kernel_v1.metal:220-255`	`max(arr_a, arr_b) + gate_delay`	`+ eval_idm(dyn_params, T, edge)` via small LUT
E Setup/hold	`csrc/kernel_v1.metal:547-580`	unchanged math, dumber inputs	unchanged math, smarter inputs

For dynamic-IDM the kernel needs two new persistent buffers:

last_transition_ps[aig_pin] — when the gate's output last switched (absolute ps).
last_value[aig_pin] — to detect transitions across cycles.

Memory cost ~4 bytes per AIG pin per partition. For NVDLA-scale designs (~hundreds of thousands of pins) this is MB-scale — fine.

`eval_idm` on GPU

The paper uses exp/log per gate. On GPU replace with a 16-entry LUT indexed by quantised T. Cheap, branch-free, smooth enough.

Characterisation

The δ(T) parameters have to come from per-cell SPICE characterisation. For sky130 we'd characterise each sky130_fd_sc_hd__*_* cell once, check the result into the repo, and ship it as a sidecar table consumed by the IR builder. This is the expensive one-off — the paper flags characterisation cost as the unsolved part of making IDM "truly competitive."

Staged plan

Stage	What	Touches	Kernel	Effort	Win
1 Static IDM	Bake worst-case δ(T) into existing u16 slot using STA pulse-width estimates	A, B, C	None	1–2 days	Better setup/hold on marginal paths
2 Dynamic δ(T)	Add `last_transition_ps` buffer + LUT eval	All	Lines 220–255	1–2 weeks	Pulse-degradation-aware arrivals end-to-end
3 Sub-cycle ticks	Multiple arrival propagations per logical cycle	Whole kernel	Major	Months	True IDM glitch behaviour. Probably not worth it for Jacquard's positioning.

Stage 1 is a 1–2 day spike with no kernel risk. Stage 2 is the honest implementation. Stage 3 is a different simulator.

What we get / don't get from dynamic δ(T)

Achievable

Per-corner δ(T) propagating through arrival → setup/hold reports that distinguish "just meets timing under δ∞" from "fails under realistic pulse degradation".
Stays inside cycle-accurate boomerang. ~1.5–2× memory growth on arrival data, ~10–20% kernel slowdown (estimate).

Not achievable

Glitch suppression (Δᵢ → 0 → no transition).
Metastable oscillation traces.
Combinational-loop behaviours (loops are forbidden in the AIG anyway).

Why sky130 is the right vehicle

sky130_pdk.rs decomposes vendor functional Verilog into AIG nodes while preserving cell identity through aigpin_cell_origins. We can attach δ(T) at the original sky130 cell granularity even after AIG flattening — that structural property is what makes any of this tractable. Cells from a hand-coded library without origin tracking would be much harder.

Part B — Clock-tree skew

Status: Stages 1 + 2 implemented (2026-05-01). Per-DFF clock arrival is carried through the IR (ClockArrival table) and folded into per-DFF setup/hold via DFFConstraint::effective_setup_hold before the per-word collapse. Producer landed in c403cc8; consumer fold-in in 6767c3e. The narrative below describes the original motivation; the Staged plan at the end of this part records what shipped and what remains (Stage 3, conditional).

Where the information is — and where we drop it

Clocks in Jacquard are walked back from each DFF through buffers/inverters/clock-gates, terminating at an InputClockFlag(pinid, is_negedge) (src/aig.rs:441, :477, :495-560). Recognised cells: INV/BUF/CKLNQD and the sky130 equivalents inv*, clkinv*, buf*, clkbuf*, clkdlybuf*, lpflow_*.

Two consequences:

Clock-tree cells produce no AIG pin. They collapse into a polarity flag on the DFF. Since aigpin_cell_origins only lists cells that produced AIG pins, the timing-IR arcs on those cells (IOPATH records on clkbuf_8, etc.) match no AIG pin in load_timing_from_ir and are silently discarded.
Clock-net interconnect is dropped the same way. interconnect_delays records keyed by net endpoints have no destination cell to attach to, so they fall on the floor.

Net effect: every DFF on a given clock domain is treated as having identical clock arrival, i.e. perfect skew. The current setup/hold check is honest about combinational-path delay but blind to clock-tree topology.

For a sky130 MCU SoC at ~25 ns clock period this is fine functionally; for any timing claim near the period boundary it's misleading. Intra-domain clock-tree skew on sky130 is typically O(50–200 ps) — small relative to a 25 ns period, but exactly the order of magnitude that determines whether a path "barely meets" or "barely fails" setup.

Do we have the information?

Yes, in three places, in increasing fidelity:

TimingIR arcs on clock cells (.jtir already contains them; we just don't consume them).
The AIG clock walk in aig.rs:495–560 already iterates the clock-side cells of each DFF in order. It just doesn't accumulate their delays. Adding a dff_clock_origins: Vec<Vec<cellid>> parallel structure costs O(num_dffs × clock_depth) memory — negligible.
OpenSTA can compute per-DFF clock arrival end-to-end. (OpenTimer was the original primary STA candidate per ADR 0003 but the spike Superseded it; ADR 0001 makes OpenSTA the sole STA path, called out of process via opensta-to-ir.) Per-pair common-path-pessimism removal (CRPR) is fundamentally a launch/capture credit, not a per-DFF property — so what shipped is per-DFF capture-side arrival, treating launch as a 0-reference. This is the form in the IR today:
```
table ClockArrival {
    cell_instance: string;     // DFF instance path
    clk_pin: string;           // local pin name
    arrival: [TimingValue];    // per-corner clock arrival ps
    provenance: Provenance;
}
```
Populated by opensta-to-ir's Tcl driver via [all_registers -clock_pins] + [::sta::vertex_worst_arrival_path]. Consumer code never touches the netlist — it just looks up each DFF's clock arrival.

Consumer change (shipped)

DFFConstraint carries the field now:

#![allow(unused)]
fn main() {
pub struct DFFConstraint {
    pub setup_ps: u16,
    pub hold_ps: u16,
    pub clock_arrival_ps: i16,   // signed — capture-side arrival, launch ref = 0
    pub data_state_pos: u32,
    pub cell_id: u32,
}
}

The setup/hold formula for per-pair skew is:

Setup margin = (clock_period + clock_arr_capture - clock_arr_launch) - data_arrival - setup
Hold margin = data_arrival - (clock_arr_capture - clock_arr_launch + hold)

Per-launch/per-capture pairing is awkward in the current per-word-collapsed constraint buffer, so the implementation folds the capture-side clock arrival into the per-DFF effective setup/hold before packing, via DFFConstraint::effective_setup_hold:

effective_setup = setup - clock_arrival_capture (clamped to [0, u16::MAX])
effective_hold = hold + clock_arrival_capture (clamped to [0, u16::MAX])

The GPU kernel runs unchanged — the same packed (setup<<16)|hold word it already consumes now carries skew-aware values. Launch arrival is treated as zero (ref) — pessimistic for paths whose launch DFF also has a long clock path, but a clean first cut. Stage 3 below addresses that pessimism if measurement justifies it.

Partitioning question

"could we partition a design effectively to do this somewhat accurately without sacrificing too much?"

Today partitioning (src/repcut.rs) is hypergraph-cut on logic connectivity. DFFs co-located by logic affinity may have very different clock arrivals.

The pessimism cost: build_timing_constraint_buffer collapses all DFFs in a 32-bit state word to min(setup) and min(hold). If a word holds DFFs with clock arrival 50 ps and 200 ps, the per-word effective setup is the worst of both — i.e. we report timing as if every DFF in that word saw the worst skew in the word. That's a 150 ps pessimism for the lucky DFF.

Three options, ranked:

Do nothing. For typical sky130 SoCs at ≥10 ns clock periods, intra-word skew (≤200 ps worst-case) vs. period (10 000+ ps) is ≤2%. Worth-it threshold for the optimisation: when designs run close enough to the period that 2% pessimism flips genuine passes into reported violations. Likely never for sky130. Plausibly relevant for designs running at ≥1 GHz on a more aggressive PDK.
Skew-bucket the DFF constraint packing, not the partitioning. Group DFFs into clock-arrival buckets after partitioning, and emit one constraint word per bucket-within-partition rather than collapsing everything in the word. Increases constraint-buffer size by O(num_buckets) but doesn't disturb the partitioner. Probably the right answer if we ever need to.
Skew-aware partitioning. Add a soft objective to repcut.rs that prefers grouping DFFs by clock arrival. Degrades cut quality (more inter-partition logic edges → more state shuffling). Almost certainly worse than option 2 for the same accuracy gain.

So: yes we have the info, no we probably don't need to repartition, and the constraint-collapsing pessimism is the real lever — either accept it (option 1) or break it bucket-wise (option 2).

Staged plan for clock tree

Stage	What	Touches	Kernel	Status
1 Capture clock-tree delay	Add `ClockArrival` IR table; populate from opensta-to-ir	IR schema, `opensta-to-ir/builder` + Tcl	None	Shipped — `c403cc8`
2 Apply to setup/hold	Fold capture-side arrival into `DFFConstraint`; existing kernel check now skew-aware	`src/flatten.rs` `DFFConstraint`, `effective_setup_hold`, `build_timing_constraint_buffer`	None	Shipped — `6767c3e`
3 (conditional) Bucketed packing	Per-bucket constraint words to remove the per-word `min(setup, hold)` collapse pessimism; kernel reads the right bucket per DFF	`src/flatten.rs:1722-1761`, kernel constraint indexing	Minor	Open — land only if measurement shows the per-word collapse materially over-reports violations

Part C — Wire delay at scale

Why this gets more important as designs grow

In sky130 at 25 ns clock periods, wire delay is a small perturbation on gate delay and the lumped model is fine. The picture changes in two regimes:

Faster clocks. Wire delay is a fixed physical quantity (RC-dominated); period shrinks; wire fraction of the budget grows.
Finer processes (e.g. 22nm and below). Gate delays scale down with feature size; wire RC scales unfavourably (resistance per square goes up, capacitance per length stays roughly flat). The classic "reverse scaling" inflection: gates get faster, long wires don't. Typical 22nm: inverter delay 5–15 ps, local short wires 5–20 ps, global routes 50–500 ps, multi-mm wires 1+ ns without repeaters.
Large many-core/NoC SoCs. Inter-tile mesh links can span multiple millimetres; chip-level signals have wire delays comparable to or larger than entire combinational stages.

For a many-small-core NoC at 22nm, wire delay on inter-core links is typically the dominant timing factor. Any model that can't represent it accurately will misreport the critical paths.

What Jacquard does today

The IR side is already in shape. crates/timing-ir/schemas/timing_ir.fbs carries InterconnectDelay { net, from_pin, to_pin, delay[corner] } per receiver, and opensta-to-ir populates it from SDF.

The lossy step is the consumer in src/flatten.rs:1850-1872:

#![allow(unused)]
fn main() {
let mut wire_delays_per_cell: HashMap<usize, (u64, u64)> = HashMap::new();
// ... for each InterconnectDelay record:
let entry = wire_delays_per_cell.entry(dest_cellid).or_insert((0, 0));
entry.0 = entry.0.max(d);   // rise
entry.1 = entry.1.max(d);   // fall (same value!)
}

Three layers of pessimism stacked here:

Keyed by destination cell, not destination pin. A cell with two inputs from very different routes loses per-pin fidelity.
Max across inputs of the same cell. Worst-case incoming wire is applied to every output of the cell.
No rise/fall distinction on wire delay. SDF carries both; we collapse to one number.

Then in arrival propagation (csrc/kernel_v1.metal:220-255):

new_arr = max(arr_a, arr_b) + gate_delay

where gate_delay = intrinsic + max_wire_into_cell. The mathematically correct propagation is:

new_arr = max(arr_a + wire_a, arr_b + wire_b) + intrinsic

These are equivalent only when the input with the worst arrival also has the worst wire. When they don't coincide — common on a NoC node where one input comes from a long mesh hop and another from local logic — the current model over-reports by max_wire − actual_wire_on_critical_input.

For sky130 small designs this gap is in the noise. For 22nm with 10× variation between local and global wire delays, it's the difference between "this path meets timing" and "STA reports a violation that doesn't exist."

Inter-partition wires — the architectural wrinkle

A NoC tile naturally maps to one (or a few) partition(s). The inter-tile links — the long, wire-dominated, timing-critical ones — are precisely the partition-crossing signals. Today wire delay sits on the destination cell's gate_delays slot, evaluated inside the destination partition's boomerang reduction. The wire is a property of the crossing, not the destination cell, and should ideally be modelled at the partition I/O boundary, where src/sim/cosim_metal.rs already shuffles state between partitions.

This is the inverse alignment of the clock-tree case. There partitioning didn't help with skew accounting. Here partitioning is load-bearing: tile-aligned partitions naturally expose the small set of edges that deserve careful wire-delay modelling, and let intra-partition logic stay on the fast lumped path.

Three fidelity tiers

Tier	Model	Where wire delay lives	When it's enough
0 (current)	One scalar per destination cell, max-collapsed	Folded into `gate_delays[output_pin]` of dest cell	sky130 + ≥10 ns periods + small designs
1 Per-receiver	One scalar per `(from_pin, to_pin)` edge in the AIG	Folded into the source AIG pin's gate_delay, with one entry per fanout target	Local wires in faster designs; intra-tile NoC logic
2 Per-edge with inter-partition arcs	Tier 1 + explicit wire delay on partition-crossing signals	Tier 1 + new arrival-bump applied during `cosim_metal.rs` state shuffle	Long routes + many-core/NoC + 22nm-scale processes

Tier 1 is mostly a flatten.rs rewrite. Tier 2 needs cosim_metal.rs extension and a new field in the inter-partition transfer format.

Information availability

Yes, it's there:

InterconnectDelay records exist per receiver. SDF carries them. opensta-to-ir emits them.
Per-input-pin granularity is in the IR (to_pin includes the local pin name). The consumer just discards it via to_pin.rfind('/') to derive dest_inst.
Rise/fall distinction is in the schema (delay: [TimingValue] per corner; rise/fall could be on top via the same pattern as TimingArc). For SDF-back-annotated flows the rise/fall split usually comes from the SDF; we'd need to confirm opensta-to-ir preserves both edges.

What's missing today:

Tier-1 plumbing: AIG-pin-level wire delay per fanout. Current gate_delays: Vec<PackedDelay> is keyed by AIG pin (the output side); to do per-input-edge correctly we want delay attached to the edge, not the node. Either add a parallel wire_delays: HashMap<(src_aigpin, dst_aigpin), PackedDelay> or refactor toward an edge-attributed AIG.
Tier-2 plumbing: a "partition-crossing arc" concept in cosim_metal.rs. Currently inter-partition state shuffle moves bits with no associated arrival bump. Adding a per-edge ps adjustment is straightforward in principle; finding the right place in the shuffle pipeline matters.

IR scale

The IR-size concern bites here. InterconnectDelay is roughly 100–200 bytes per record; a 22nm SoC with 10⁶–10⁷ nets is a .jtir file in the hundreds-of-MB to multi-GB range.

Mitigations:

Streaming load: today TimingIrFile::from_path reads the whole buffer. Could mmap and lazy-decode, since FlatBuffers is offset-based.
Sharding: split IR per partition or per top-level module. Adds a build-time step but bounds memory per process.
Drop intra-cell wires from IR generation: SDF often has microscopic interconnect records that lump into the destination's own pin-cap. Filter these out at the opensta-to-ir builder. Loss is genuinely negligible.

Worth measuring before committing to mitigations — sky130 NVDLA-scale today is fine; the question is what 22nm + N-tile mesh looks like.

Partitioning question — the other direction

For NoC designs partitioning becomes a positive lever (unlike the clock-tree case where it was neutral). Two specific levers:

Tile-aligned partitions. If repcut.rs finds tile-aligned cuts naturally (likely, given typical tile-to-tile connectivity sparsity), inter-partition arcs are a small, well-defined set of NoC links. Worth verifying with a representative design — a partitioning report keyed by signal name pattern (*_link_*, noc_*, configurable) would expose whether the partitioner's logic-affinity score is already aligned with tile boundaries or whether we need to bias it.
NoC-link partitioning hint. Add a soft bias to repcut that prefers cutting nets matching a configured regex. Same partitioning machinery, configurable input. Cost: degrades cut quality if the hint conflicts with logic affinity. Likely worth it for explicitly tile-decomposed designs where the user knows the tile boundaries; not worth it for flat designs.

The point of any of this is to make Tier-2 cheap: if the inter-partition arc set is small, per-edge wire delay on those crossings costs almost nothing.

Crosstalk and OCV

These are upstream concerns. SDF from a crosstalk-aware STA flow already carries pessimistic delays; OCV (on-chip variation) is similarly baked into the chosen corner. Jacquard consumes whatever the IR was generated against. Worth a one-line note in the user-facing docs that the timing report's accuracy is bounded by the SDF/STA flow it was built from — Jacquard does not invent crosstalk pessimism.

Staged plan for wire delay

Stage	What	Touches	Kernel	Effort
1 Per-receiver consumption	Key wire delay by `(src_aigpin, dst_aigpin)` edge; fold into source AIG pin's gate_delay per fanout	`src/flatten.rs:1850-1872`, possibly `src/aig.rs` for fanout tracking	None	3–5 days
2 Rise/fall distinction	Preserve per-edge rise/fall through the consumer; honour both in `PackedDelay` accumulation	`src/flatten.rs:1850-1914`	None	1–2 days
3 Inter-partition arc delay	New per-crossing wire-delay table; arrival bump applied during inter-partition state transfer	`src/sim/cosim_metal.rs` shuffle path; `src/flatten.rs` partition-boundary metadata	Yes (transfer path)	2–3 weeks
4 IR scale plumbing	Streaming/mmap load; opensta-to-ir filtering of microscopic records	`src/sim/timing_ir_loader.rs`, `opensta-to-ir/builder`	None	1 week (gated on measurement)
5 NoC-aware partitioning	Soft bias in repcut for cutting flagged nets; partition report by tile	`src/repcut.rs` and CLI flags	None	1–2 weeks

For a sky130 use case Stage 1+2 likely covers everything you'd notice. For 22nm NoC, Stages 1–3 are the meaningful set; Stage 5 is the optimisation that makes Stage 3 cheap.

What we get / don't get

Achievable

Setup/hold accuracy on long routes that today gets clobbered by max-collapse pessimism.
Honest reporting on NoC inter-tile links — the paths that actually matter for many-core SoC timing closure.
All of the above without changing Jacquard's cycle-accurate kernel architecture.

Not achievable from this work alone

Crosstalk-driven delay uncertainty (handled upstream in STA).
Variation-aware (statistical) timing — would need OCV-corner sweeping or SSTA, neither of which is on the roadmap.
Process variation modelling beyond the corners the SDF/IR was generated against.

Open questions

δ(T) characterisation cost. One-off SPICE per cell-type per corner. Cheaper if we lean on existing ECSM/CCSM data already in vendor Liberty rather than re-running SPICE. Worth investigating before committing to Stage 2.
Whose clock arrival is authoritative? Resolved by Pillar B Stage 1+2: OpenSTA-computed per-DFF arrival via opensta-to-ir, treating launch as 0-reference. Per-pair CRPR credit is intentionally not modelled at this stage (see Stage 3 in the staged plan above).
Interaction. Does δ(T) on clock-tree buffers matter? Probably not enough to model — clock buffers are sized for fast edges and operate far from their pulse-degradation regime. But the framework should be able to express "ignore δ(T) on clock domain" cleanly.
Validation oracle. CVC and Icarus already serve as functional oracles; for skew-aware and wire-aware reporting OpenSTA's slack report (via opensta-to-ir / direct subprocess) is the ground truth for unit tests. (ADR 0003 originally nominated OpenTimer for this role; superseded by the spike outcome — OpenSTA carries the role end-to-end now.)
IR size at 22nm scale. Open question whether .jtir for a representative many-core NoC fits in available memory under the current eager-load model. Needs measurement before committing to streaming mitigations.
Edge-attributed AIG. Per-receiver wire delay wants delay attached to AIG edges, not nodes. Today the AIG is node-attributed (gate_delays: Vec<PackedDelay> indexed by aigpin). A clean Tier-1 implementation may push toward edge attribution, with downstream effects on the boomerang reduction script layout. Worth a small spike before the main implementation.
Partition-crossing format. Adding per-edge wire delay to cosim_metal.rs inter-partition transfers needs a precise place in the existing pipeline. Currently the shuffle moves Boolean state words without arrival; the natural place is alongside the writeout-arrival path that already exists for setup/hold checking, but the alignment isn't 1:1 because partition crossings happen at logic boundaries, not capture-DFF boundaries.

docs/timing-correctness.md — forward-looking validation contract; this doc extends rather than replaces.
docs/timing-simulation.md — boomerang architecture; the kernel-side context.
docs/timing-validation.md — current ±5% acceptance criteria; would tighten under δ(T).
docs/adr/0002-timing-ir.md — IR design rationale; schema additions here follow the "lossless extension" principle.
docs/adr/0001-opensta-as-oracle.md — STA path; OpenSTA out of process is committed (post-supersedure of ADR 0003).
docs/adr/0003-opentimer-primary-sta.md — Superseded. Original in-process STA proposal; spike Q2 fail moved Jacquard to OpenSTA-only. See docs/spikes/opentimer-sky130.md.

In-Design Signal Tracing (`--trace-signals`)

Overview

By default a Jacquard output VCD contains only top-level IO. --trace-signals <FILE> surfaces user-selected internal nets in that VCD alongside the top-level ports — so you can watch a DFF's Q, a controller state bit, or an SRAM port wire without re-synthesizing or exposing it as a port.

It is available on both jacquard sim and jacquard cosim, and is observe-only: traced nets are read out each tick, never driven.

Each name in the file is resolved against the netlist, registered as a primary output before partitioning (so it gets a state-buffer slot), and emitted on the same path as the top-level IO. It works uniformly for sequential (DFF Q) and combinational nets — anything that has a name in the netlist database.

This is the raw-wire counterpart to bus transaction tracing: --trace-signals gives you per-cycle waveforms of individual nets; bus tracing gives you decoded transaction records. Use this when you want a waveform; use bus tracing when you want READ 0x40 => 0x1.

File format

One hierarchical signal name per line:

# JTAG debug-module state (comments and blank lines are ignored)
chip_core.dm.haltreq_q[0]
chip_core.dm.haltreq_q[1]

# Yosys-internal nets — same syntax works
chip_core.sram_u._00147_

# A whole bus, one bit per line
data0_obs[0]
data0_obs[1]

Blank lines and lines whose first non-whitespace character is # are skipped.
Hierarchy uses . as the separator; a trailing [N] selects a bus bit.
A leading backslash (Verilog escaped-identifier syntax) is stripped.

A real example ships in tests/jtag_minimal/trace_signals.txt.

Name resolution

Post-synthesis net names are ambiguous — Yosys may flatten a hierarchy into one escaped identifier (\soc.sram.read_port__data), expand a bus into per-bit scalars (soc.bus__addr[3]), or preserve real structural hierarchy. Rather than guess, the resolver tries multiple candidate interpretations of each name and takes the first that matches the netlist database, so the same syntax works across all three conventions.

Unresolved names warn, they don't abort. A bad name logs a warning and is skipped; the rest of the list still registers. A trailing summary line reports how many signals registered vs. were dropped, so a mistyped list surfaces clearly at startup:
```
--trace-signals: registered 34 signal(s), dropped 2 (file: trace.txt)
```
Names that resolve to a constant (tied 0/1) are skipped — there's nothing to observe at runtime.

Where the output lands

Traced nets appear in whichever VCD the run already emits:

Command	Flag	Traced nets appear in
`jacquard sim`	`--trace-signals <FILE>`	the output VCD
`jacquard cosim`	`--trace-signals <FILE>`	the `--output-vcd` output only

They show up as ordinary VCD wires next to the top-level IO, named by the string you put in the trace file.

cosim: traced nets land in --output-vcd only. The --stimulus-vcd carries primary inputs and does not include them, so if you trace a net and look in the stimulus VCD you'll see nothing. --output-vcd does not require timing data — see Pre-PnR functional runs.

Pre-PnR functional runs

--output-vcd is the functional output path too — it does not require --timing-ir/SDF. Run a synthesized (pre-PnR) netlist through cosim with --output-vcd out.vcd and you get chip outputs and traced nets per cycle, with transitions at clock edges (no arrival-time offsets). This is the right mode for functional / 4-state X-pessimism debugging, where there is no timing data to supply yet. Adding --timing-ir later only adds arrival-time offsets to the same VCD.

Top-level inout (bidir) pads

A top-level inout pad is split into two observables in the output VCD: <pad>__out (the value the core drives) and <pad>__oe (the pad's output-enable). The raw <pad> net reads the pad's input side, so on an output-only or undriven cycle it can look flat — watch <pad>__out / <pad>__oe to see what the design is driving. Example: bidir_PAD[12]__out, bidir_PAD[12]__oe. These appear automatically; you don't need to list them in the trace file.

Finding signal names

Use the netlist-graph tool (see the project README) to discover the exact post-synthesis names:

# Search for nets matching a pattern
uv run netlist-graph search <netlist.v> "haltreq"

# Trace what drives / loads a signal (to find nearby observable nets)
uv run netlist-graph drivers <netlist.v> "soc.cpu.state" -d 5
uv run netlist-graph loads   <netlist.v> "soc.cpu.ack"   -d 5

# Emit a ready-to-use trace file
uv run netlist-graph watchlist <netlist.v> out.json signal1 signal2 ...

SRAM observability workflow

The recommended way to observe SRAM port activity is wire-level tracing rather than the env-var-gated JACQUARD_SRAM_DUMP. netlist-graph can discover the port wires and emit a trace file directly:

# 1. Discover SRAM port wire names from the netlist
uv run netlist-graph sram-ports design.v --cell-type SRAM -o sram_trace.txt

# 2. Surface them in the VCD with full per-tick accuracy
jacquard cosim design.v --config sim.json \
    --trace-signals sram_trace.txt --output-vcd out.vcd

# 3. Post-process the VCD to reconstruct bus values

Example

tests/jtag_minimal/ uses --trace-signals to surface the debug module's observable outputs (dmactive_obs, haltreq_obs, data0_obs[0..31]) so the test's pass criterion can check that the magic value 0xCAFEBABE lands in data0_obs:

jacquard cosim tests/jtag_minimal/data/top.pnl.v \
    --config tests/jtag_minimal/sim_config.json \
    --trace-signals tests/jtag_minimal/trace_signals.txt \
    --jtag-replay tests/jtag_minimal/data/bitbang.rec \
    --output-vcd out.vcd

Troubleshooting

Symptom	Cause / fix
`not found in netlistdb (tried N candidate(s))`	The name doesn't exist post-synthesis under any candidate spelling. Find the real name with `netlist-graph search`; the net may have been renamed or optimized away.
Signal registered but flat in the VCD	It may resolve to a constant after optimization (the startup log notes constants are skipped), or the cone was stripped. Confirm it's a live net with `netlist-graph drivers`.
Nothing appears in the VCD	For cosim, traced nets only land in `--output-vcd` (not `--stimulus-vcd`) — make sure you passed it. Also check the startup summary reports a non-zero registered count.

Implementation notes

Registration happens at AIG construction, before partitioning, which is why the list must be supplied via the CLI flag (not a runtime env var). The mechanism lives in src/sim/trace_signals.rs; emission piggybacks on emit_extra_observables in src/sim/vcd_io.rs. The same multi-candidate resolver backs bus-trace pin binding (see bus tracing and ADR 0013).

Bus Transaction Tracing (AHB / APB)

Overview

jacquard cosim can decode on-chip bus transactions and emit them in a compact, transaction-level form — one row per transfer, rather than raw per-cycle waveforms. You declare the bus interfaces to watch in sim_config.json; cosim observes their pins on the GPU each tick and runs the protocol decode on the CPU, writing decoded transactions to a CSV file.

This is observe-only: the tracer watches signals the design already drives, it never drives anything. It adds no measurable simulation overhead when no buses are configured.

Protocol	Status
APB3	Supported
AHB-Lite	Planned (pipelined address/data pairing, burst tracking)
AHB5	Planned (AHB-Lite + security / exclusive signals)

The design rationale lives in ADR 0013; the roadmap is in plans/bus-transaction-tracing.md.

Bus tracing is the structured, protocol-aware counterpart to --trace-signals, which surfaces raw internal nets in the output VCD. Use --trace-signals when you want waveforms of individual wires; use bus tracing when you want decoded READ 0x40 => 0x1 records.

Configuring a bus

Add a bus_traces array to sim_config.json. Each entry names one bus interface:

{
    "netlist_path": "build/soc.gv",
    "clock_gpio": 0,
    "reset_gpio": 1,
    "num_cycles": 100000,
    "clock_period_ps": 40000,

    "bus_traces": [
        {
            "name": "dmi",
            "protocol": "apb3",
            "prefix": "soc.dm.",
            "addr_bits": 9,
            "data_bits": 32
        }
    ]
}

Field	Required	Meaning
`name`	yes	Label for this bus in the CSV `bus` column.
`protocol`	yes	`apb3` (or `ahb-lite` / `ahb5` once supported).
`prefix`	yes	Hierarchical net-name prefix; standard pin names are appended (see below). May be `""` for top-level pins.
`addr_bits`	no (default 32)	Address bus width.
`data_bits`	no (default 32)	Data bus width.
`signals`	no	Per-pin net-name overrides (see Pin resolution).

Pin names

By default each protocol pin is resolved as {prefix}{pin}. For APB3:

Logical pin	Default net	Notes
`psel`	`{prefix}psel`	required
`penable`	`{prefix}penable`	required
`pwrite`	`{prefix}pwrite`	direction
`paddr`	`{prefix}paddr[i]`	`addr_bits` wide
`pwdata`	`{prefix}pwdata[i]`	`data_bits` wide
`prdata`	`{prefix}prdata[i]`	`data_bits` wide
`pready`	`{prefix}pready`	optional — unresolved is treated as always-ready (1)
`pslverr`	`{prefix}pslverr`	optional — unresolved is treated as no-error (0)

So a bus with "prefix": "soc.dm." looks for soc.dm.psel, soc.dm.paddr[0], …, soc.dm.prdata[31].

If your design's pins don't follow that convention, remap individual logical pins with signals:

{
    "name": "periph",
    "protocol": "apb3",
    "prefix": "soc.apb.",
    "signals": {
        "psel": "soc.apb_decode.sel_periph",
        "prdata": "soc.apb_mux.readback"
    }
}

Running

cargo run -r --features metal --bin jacquard -- cosim \
    build/soc.gv \
    --config sim_config.json \
    --bus-trace-csv bus.csv

At startup each bus logs whether it resolved:

bus-trace `dmi` (APB3): psel/penable resolved, addr 9/9 bits, pready=true pslverr=true

and at the end:

bus-trace: decoded 12 transaction(s) across 1 bus(es)
bus-trace: wrote 12 transaction(s) to bus.csv

CSV output

tick,bus,protocol,dir,addr,data,resp,burst
24,dmi,apb3,WR,0x10,0xCAFEBABE,OK,
30,dmi,apb3,RD,0x10,0xCAFEBABE,OK,

Column	Meaning
`tick`	Cosim edge at which the transfer completed. One clock cycle = 2 edges (rising + falling) for a single-domain design.
`bus`	The configured bus `name`.
`protocol`	`apb3` / `ahb-lite` / `ahb5`.
`dir`	`WR` or `RD`.
`addr`	Transfer address (hex).
`data`	`pwdata` for writes, `prdata` for reads (hex).
`resp`	`OK` or `ERR` (from `pslverr` / `hresp`).
`burst`	AHB burst position `beat/len` (empty for APB).

Pin resolution

For the GPU to read a bus pin each tick, that net must (1) exist in the post-synthesis netlist under a resolvable name and (2) survive into the simulation's output state. Two consequences:

Names must survive synthesis. The resolver uses the same multi-candidate matcher as --trace-signals, so Yosys-flattened (\soc.dm.psel), scalar-expanded (soc.dm.paddr[3]), and structurally-hierarchical names all work. But synthesis is free to rename or delete combinational nets. The robust pattern is to make the bus signals registers (their DFF Q outputs keep their names), or to annotate the RTL nets with (* keep *).
Constant-folded bits read as 0 — correctly. If a design only ever drives, say, addresses 0x00 and 0x04, synthesis folds every address bit except paddr[2] to a constant. The startup log then shows e.g. addr 1/8 bits. This is expected: the tracer reconstructs the full value correctly because the dropped bits are genuinely 0.

pready / pslverr are allowed to be absent. A common case is an always-ready slave that ties pready high — it folds to a constant, fails to resolve, and the tracer correctly treats the bus as always-ready.

Worked example

tests/apb_trace/ is a self-contained, synthesizable APB3 system used as the CI regression. Its master issues a fixed program — two writes then two reads — to a register-file slave, and check.py asserts the decoded CSV. See tests/apb_trace/README.md.

yosys -s tests/apb_trace/synth.tcl          # (from tests/apb_trace/)
cargo run -r --features metal --bin jacquard -- cosim \
    tests/apb_trace/apb_trace_synth.gv \
    --config tests/apb_trace/sim_config.json \
    --top-module apb_trace \
    --max-clock-edges 200 \
    --bus-trace-csv apb.csv
python3 tests/apb_trace/check.py apb.csv

Troubleshooting

Symptom	Cause / fix
`psel/penable did not resolve … this bus will not capture`	The `prefix` is wrong, or the nets were optimized away. Find the real names with `uv run netlist-graph search <netlist> psel`, then fix `prefix` or add `signals` overrides.
Zero transactions decoded	Gate never asserted. Check that `psel`/`penable` resolve (startup log) and that the bus is actually exercised within `--max-clock-edges`.
Address or data always 0	`paddr`/`pwdata`/`prdata` nets didn't resolve (renamed/folded). Confirm with `netlist-graph search`; mark the RTL nets `(* keep *)` and re-synthesize.
Reads return stale/wrong data	The slave must present `prdata` during the ACCESS phase. Register `prdata` so its value is stable when `psel & penable` are high.

Limitations

APB3 only for now; AHB-Lite / AHB5 and annotated-VCD output are the next phases (see the plan).
Up to 4 buses per run, addresses/data up to 32 bits.
Cosim is now backend-portable (Metal, CUDA, HIP, plus a CPU fallback); bus tracing is wired across the GPU backends (cuda.rs/hip.rs/metal.rs).
The legacy hardcoded Wishbone trace (a separate, SoC-specific path) is unaffected; folding it onto this general mechanism is a planned follow-up.

Cosim Perf Report (`--cosim-perf-json`)

Overview

jacquard cosim reports a per-edge timing breakdown of the co-simulation loop, including ground-truth GPU-execution time from device timestamps — free of the instrumentation overhead that a full GPU trace (Metal System Trace / nsys) imposes on a workload of thousands of tiny dispatches per batch.

The breakdown always prints in the human-readable profiling summary at the end of a run. Passing --cosim-perf-json <PATH> additionally writes a flat JSON record for CI to log and track.

The measurement

The cosim loop batches edges into GPU command buffers (BATCH_SIZE edges each). Per batch, the CPU: builds the command buffer (encode), commits it, then waits for the GPU to finish (spin), then drains ring buffers (VCD/UART). The existing summary times these CPU phases with Instant deltas. This report adds the GPU-side truth:

Metal: MTLCommandBuffer.GPUStartTime/GPUEndTime — host-clock timestamps the driver records for free; GPUEndTime − GPUStartTime is the batch's GPU-execution wall. Read after waitUntilCompleted (the driver posts the timestamps only at full completion).
CUDA / HIP: cudaEvent / hipEvent elapsed time — follow-up (the trait seam exists; the impls land in a later PR validated on the GPU CI runner).
CPU backend: none — gpu_* fields are 0 / gpu_exec_recorded: false.

The seam is one CosimBackend trait method, last_batch_gpu_seconds() -> Option<f64>, called by the orchestration after each batch's wait(). The generic loop owns the CPU timing and report assembly; each backend supplies only its GPU clock.

Human summary

=== Profiling Breakdown ===
  Batch encode + commit             7.6μs/tick   10.3%
  GPU wait (spin)                  66.3μs/tick   89.3%
  ...
  TOTAL (instrumented)             74.2μs/tick  100.0%
  GPU exec (device ts)             65.8μs/tick   88.7%  (GPU-busy share of wall)

GPU exec is a separate measure from the CPU categories — it overlaps the GPU wait (spin) line (the CPU wall spent waiting is the GPU execution). Its % is the GPU-busy share of the instrumented wall.

JSON schema (report-only)

{
  "backend": "metal",
  "edges": 500000,
  "wall_seconds": 40.1,
  "total_us_per_edge": 74.24,
  "gpu_exec_us_per_edge": 65.85,
  "gpu_util_pct": 88.7,
  "gpu_exec_recorded": true,
  "cpu_encode_us_per_edge": 7.56,
  "gpu_wait_spin_us_per_edge": 66.31,
  "drain_us_per_edge": 0.001,
  "output_vcd_us_per_edge": 0.36,
  "batches": 550,
  "mean_batch": 909.1
}

This is a report format, not a stable contract — it may gain fields.

CI

The mcu-soc-metal job runs the mcu_soc cosim with --cosim-perf-json and posts the JSON to the GitHub step summary (report-only — no perf gate). This builds a timing history before we commit to a regression threshold: GPU/thermal noise on shared runners needs a measured noise floor first, and a premature gate would flake. Adding a gate (fail if total_us_per_edge regresses beyond a tuned threshold vs a committed baseline) is the natural follow-up once the history exists, alongside the CUDA/HIP last_batch_gpu_seconds impls so every tested architecture reports.

Why not just use a GPU trace?

Metal System Trace / nsys instrument every Metal/CUDA call. For a batch of ~2048 tiny dispatches that overhead is large and systematically distorts the picture — it inflated an early measurement to make the GPU look ~88% idle when device timestamps show it ~88% busy. Command-buffer timestamps cost nothing and are the trustworthy signal for steady-state throughput; reserve full traces for one-off per-dispatch investigation (see docs/gpu-capture.md).

GPU Frame Capture (`.gputrace`)

Overview

jacquard cosim (Metal backend) can emit an Xcode .gputrace document capturing the exact Metal command stream of the GPU simulation — every compute dispatch, buffer binding, and blit for a bounded window of edge batches. This is the detailed counterpart to a system-level xctrace --template "Metal System Trace" recording: the system trace shows when GPU work runs relative to the CPU, while a .gputrace shows what each dispatch does (threadgroup sizes, buffer arguments, per-dispatch timing, dependency DAG, and — after an Xcode replay — shader occupancy and memory-bandwidth counters).

Capture is env-gated and zero-overhead when off: with no JACQUARD_GPU_CAPTURE set, none of the capture code runs.

Quick start

# Capture one steady-state batch (skip the reset/warm-up batch 0).
# METAL_CAPTURE_ENABLED=1 is REQUIRED — Metal refuses programmatic
# .gputrace capture without it.
METAL_CAPTURE_ENABLED=1 \
JACQUARD_GPU_CAPTURE=/tmp/cosim.gputrace \
JACQUARD_GPU_CAPTURE_SKIP=1 \
  cargo run -r --features metal --bin jacquard -- cosim \
    tests/qspi_psram/qspi_psram_dut_synth.gv \
    --config tests/qspi_psram/sim_config.json \
    --top-module qspi_psram_dut \
    --max-clock-edges 200 \
    --output-vcd target/test-out/qspi_psram.vcd

open /tmp/cosim.gputrace   # opens in Xcode's GPU debugger

Environment variables

Variable	Default	Meaning
`JACQUARD_GPU_CAPTURE`	(unset)	Output path for the `.gputrace` bundle. Setting it enables capture.
`JACQUARD_GPU_CAPTURE_SKIP`	`0`	Batches to run before opening the capture window. Skip the reset/warm-up batches to grab a steady-state one.
`JACQUARD_GPU_CAPTURE_BATCHES`	`1`	Number of consecutive batches to capture. Keep small — each batch is thousands of dispatches.
`METAL_CAPTURE_ENABLED`	(unset)	Required by Metal for programmatic `.gputrace` capture. Without it, capture is skipped with a warning and the run continues uncaptured.

How it works

The Metal backend commits one command buffer per edge batch (MetalSimulator::encode_and_commit_gpu_batch). Each batch encodes, for every edge in it, the full per-edge dispatch chain:

state_prep → apply_flash_din → simulate_v1_stage × N → flash_model_step → io_step → VCD blit

Capture brackets an MTLCaptureManager scope (destination GpuTraceDocument, scoped to the simulator's command queue) around the window [SKIP, SKIP + BATCHES):

Start — immediately before the first in-window batch's command buffer is created, so the trace opens on a clean batch boundary.
Stop — after the last in-window batch is committed; that command buffer is waited to completion first so its GPU work is recorded before stopCapture() finalizes the document.

A stale bundle at the target path is removed first (Metal refuses to overwrite). Any failure (missing METAL_CAPTURE_ENABLED, unsupported destination) is logged and the simulation continues normally, uncaptured.

Choosing a batch to capture

Batch 0 is the reset/warm-up batch and is usually small and unrepresentative. JACQUARD_GPU_CAPTURE_SKIP=1 (or higher) captures a steady-state batch. The batch size is BATCH_SIZE edges (see src/sim/cosim/mod.rs), so a single captured batch already contains the complete per-edge dispatch stream repeated across the batch — enough to profile the kernel without a giant multi-batch trace.

Shader performance counters

The .gputrace records the command stream and per-dispatch timing out of the box. Occupancy, bandwidth, and cache-hit counters require an Xcode replay with shader profiling enabled: open the trace in Xcode, press Replay, and Xcode writes the counter streamData back into the bundle. Only then do the profiler_gpu_counters / profiler_gpu_scheduling analyses have data to read.

Relationship to system-level tracing

For CPU↔GPU timeline correlation (where the sim spends wall-clock, GPU duty cycle, inter-batch CPU gaps) use a Metal System Trace instead:

xctrace record --template "Metal System Trace" --output cosim.trace \
  --launch -- jacquard cosim ...

That .trace answers "is the sim CPU- or GPU-bound?"; the .gputrace here answers "is each GPU dispatch efficient?". They are complementary.

Selective X-Propagation for Jacquard

Summary

This document proposes adding selective unknown-value (X) propagation to Jacquard's gate-level simulator. Rather than uniformly upgrading the entire simulator to four-state logic (2x storage, ~2-3x ALU cost), we use static analysis at compile time to identify which signals can carry X values, and only apply X-aware simulation to the affected partitions. The rest of the design continues to run with the existing fast two-state kernel.

Motivation

Jacquard currently simulates in pure two-state Boolean logic (0/1). All DFFs and SRAMs start at 0. This is fast but has two significant drawbacks:

Undetected initialisation bugs: If a design reads a register before it has been written through a proper reset sequence, the simulator silently returns 0 instead of flagging the value as unknown. Real hardware would produce an arbitrary bit pattern.
SRAM initialisation masking: SRAMs read as all-zeros before being written. A design that depends on SRAM initialisation may appear to work in simulation but fail on silicon.
No RTL/gate-level mismatch detection: RTL simulators (Icarus, VCS, Questa) propagate X values that expose these bugs. When comparing Jacquard's gate-level results against an RTL reference, false mismatches arise because Jacquard resolves unknowns to zero.

Naively upgrading the entire simulator to four-state would halve throughput (double storage per signal, double ALU per gate). The key insight is that in a well-designed SoC after reset, typically <5% of signals are genuinely X-capable (uninitialised memories, clock-domain crossings, registers without reset). We should only pay the overhead where it matters.

Background: X in And-Inverter Graphs

Jacquard's core IR is an And-Inverter Graph (AIG). Every gate is an AND with optional input inversions. The boomerang reduction tree computes:

ret = (a ^ xora) & ((b ^ xorb) | orb)

Where xora/xorb encode inversions and orb encodes pass-through (when orb = 0xFFFFFFFF, input b is forced to 1, making the gate a buffer for a).

The AND gate has a favourable property for X propagation:

a	b	a AND b
0	0	0
0	1	0
0	X	0
1	0	0
1	1	1
1	X	X
X	0	0
X	1	X
X	X	X

A known-zero on either input forces the output to known-zero regardless of X on the other input. This means X does not spread as aggressively through AND logic as it would through XOR or MUX logic. AIG-based designs have a natural tendency to "absorb" X values at AND gates with known-zero inputs.

NOT (inversion) simply preserves the X mask: NOT(X) = X, NOT(0) = 1, NOT(1) = 0.

Design

Phase 1: Static X-Source Analysis (Compile Time)

At AIG construction time, identify all X sources -- signals whose initial value is unknown:

DFF Q outputs: Every DFF output is an X source at cycle 0. In real hardware, flip-flop power-on state is indeterminate.
SRAM read ports: All 32 data-output pins of each RAMBlock are X sources. Memory contents are undefined until written.
Undriven primary inputs: If a primary input is not driven by the testbench VCD, it should be marked X. (Currently Jacquard warns about missing PI signals; this would upgrade that warning to X propagation.)

These are identified directly from AIG.drivers (variant DFF(_) and SRAM(_)) and from AIG.dffs / AIG.srams.

Phase 2: Forward Cone Computation (Compile Time)

Compute the forward cone of influence from all X sources through the AIG. Since AIG pins are guaranteed to be in topological order, this is a single linear-time forward pass:

x_capable = BitVec::zeros(num_aigpins + 1)

// Mark X sources
for each DFF:
    x_capable.set(dff.q)
for each SRAM:
    for each read data pin:
        x_capable.set(pin)

// Forward propagate (pins are in topological order)
for aigpin in 1..=num_aigpins:
    if let AndGate(a_iv, b_iv) = drivers[aigpin]:
        a = a_iv >> 1   // strip inversion bit
        b = b_iv >> 1
        if x_capable[a] || x_capable[b]:
            x_capable.set(aigpin)

This is O(V + E) on the AIG -- negligible compared to partitioning.

Sequential propagation: X-capability also propagates through DFF feedback loops. If a DFF's D input is X-capable, its Q output remains X-capable (even after the first clock edge, because it may have captured an X value). The analysis iterates until a fixpoint:

loop:
    changed = false
    for each DFF:
        d_pin = dff.d_iv >> 1
        if x_capable[d_pin] && !x_capable[dff.q]:
            x_capable.set(dff.q)
            changed = true
    if !changed:
        break
    // Re-run forward cone from newly-marked DFF Q outputs

In practice this converges in 1-2 iterations because most feedback loops go through DFFs that are already marked.

Phase 3: Partition Classification

After partitioning (mt-kahypar), classify each partition:

X-capable: Contains at least one X-capable aigpin, OR reads input state from an X-capable partition's output. Run with the X-aware kernel variant.
X-free: All signals are provably not-X. Run with the existing fast two-state kernel.

The classification must account for inter-partition communication:

loop:
    changed = false
    for each partition P:
        if P is already X-capable:
            continue
        for each global-read in P's script:
            source_word = identify source partition and state word
            if source partition is X-capable:
                mark P as X-capable
                changed = true
                break
    if !changed:
        break

Phase 4: X-Mask Representation

For X-capable partitions, each signal carries a sideband X mask alongside its value:

v (value bit): The Boolean value. When x = 1, this is a "best guess" (we use 0 by convention, matching Jacquard's current behaviour for backwards-compatible output).
x (X-mask bit): 1 = unknown, 0 = known.

This doubles the per-signal storage within X-capable partitions only.

State Buffer Layout

State buffer (current):
  [word 0] [word 1] ... [word N-1]   ← N u32 words, 32 signals each

State buffer (with X sideband):
  [word 0] [word 1] ... [word N-1]   ← value words (unchanged)
  [word N] [word N+1] ... [word 2N-1] ← X-mask words (new, same layout)

X-free partitions only read/write the value section. X-capable partitions read/write both sections. The sideband occupies the same state-buffer address space with a fixed offset of state_size words.

Phase 5: X-Aware Boomerang Kernel

The existing boomerang gate computation:

ret_v = (a ^ xora) & ((b ^ xorb) | orb)

The X-aware version computes in parallel:

// Effective inputs after inversion and OR-bypass
a_eff   = a_v ^ xora
b_eff   = (b_v ^ xorb) | orb
b_eff_x = b_x & ~orb        // OR-bypass forces bits to known-1

// Value: same as before (X bits treated as 0 in value lane)
ret_v = a_eff & b_eff

// X mask: result is X when both inputs are not-known-zero AND at
// least one input is X
//
//   known_zero_a = ~a_x & ~a_eff  (not X, and effective value is 0)
//   known_zero_b = ~b_eff_x & ~b_eff
//   ret_x = (a_x | b_eff_x) & ~known_zero_a & ~known_zero_b
//
// Expanded:
ret_x = (a_x | b_eff_x) & (a_eff | a_x) & (b_eff | b_eff_x)

Verification of the X-mask formula:

a	b	orb=0	a_eff(xora=0)	b_eff(xorb=0)	b_eff_x	ret_v	ret_x	Expected
0	0	0	0	0	0	0	0	0
0	X	0	0	0	1	0	0	0 (0&X=0)
1	X	0	1	0	1	0	1	X (1&X=X)
X	0	0	0	0	0	0	0	0 (X&0=0)
X	1	0	0	1	0	0	1	X (X&1=X)
X	X	0	0	0	1	0	1	X (X&X=X)
X	*	1	0	1	0	0	1	X (pass-thru of X a)
0	*	1	0	1	0	0	0	0 (pass-thru of 0 a)
1	*	1	1	1	0	1	0	1 (pass-thru of 1 a)

Wait -- row for a=X, b=*, orb=1 (pass-through): a_x=1, a_eff=0, b_eff=1, b_eff_x=0. ret_x = (1|0) & (0|1) & (1|0) = 1 & 1 & 1 = 1. Correct.

The X-mask computation adds 5 extra bitwise operations per boomerang level (2 OR, 2 AND, 1 AND-NOT -- plus loading the X mask values). This is roughly 1.5-1.6x the CPU ALU work per gate, but only within X-capable partitions.

GPU Resource Impact

Per X-capable partition:

Resource	Current	With X-mask	Delta
Shared state (threadgroup mem)	256 x u32 = 1 KB	512 x u32 = 2 KB	+1 KB
Shared arrival (threadgroup mem)	256 x u16 = 512 B	unchanged	0
Global state buffer	N words	2N words	+N words
ALU per boomerang stage	~3 ops/thread	~7 ops/thread	~2.3x

Metal threadgroup memory limit is 32 KB; current usage is ~4 KB. The extra 1 KB is well within budget.

Phase 6: Boundary Protocol

When signals cross from an X-capable partition to another partition:

X-capable -> X-capable: Both value and X-mask words are communicated through the state buffer. The reading partition loads both.
X-capable -> X-free: At the boundary, we assert that no X values cross. If an X-mask bit is set for a signal read by an X-free partition, this is a simulation error (the design has an X reaching a region we statically proved should be X-free). Report it and optionally halt.
X-free -> X-capable: The reading partition loads the value word normally and treats the X-mask as all-zero for those inputs. No overhead.

Phase 7: Dynamic X Narrowing (Optional Enhancement)

After the design's reset sequence completes, most DFFs will hold known values and the X-mask will be all-zeros across most of the state. The simulator can detect this:

every K cycles (e.g., K = 1000):
    for each X-capable partition:
        if all X-mask words in this partition's state are zero:
            switch partition to fast two-state kernel

This gives the best of both worlds: full X-propagation during initialisation (when it matters most), then automatic fallback to maximum throughput once the design is in steady state.

For designs with a clear reset phase, this means the performance overhead of X-propagation is confined to the first few thousand cycles.

Expected Performance Impact

Compile Time

The static analysis adds:

X-source identification: O(|DFFs| + |SRAMs|) -- negligible
Forward cone computation: O(|AIG pins|) -- one linear pass
Partition classification: O(|partitions| x |global reads|) -- negligible
Fixpoint iteration: 1-2 rounds of the above

Total: well under 1% of the partitioning time.

Simulation Time

For a typical SoC with reset:

Phase	X-capable partitions	Overhead
Before reset (cycles 0-100)	~30-50% of partitions	~1.5-2x slowdown
During reset (cycles 100-1000)	Shrinking as resets propagate	Decreasing
After reset (steady state)	<5% of partitions (uninitialised SRAM, CDC)	<10% overhead
After dynamic narrowing	~0%	~0% overhead

The overall impact on a full simulation run is estimated at 5-15% throughput reduction compared to current two-state, in exchange for catching initialisation bugs that would otherwise escape to silicon.

Implementation Status

All stages are implemented. Enable with --xprop on jacquard sim.

Stage 1: Static X-Source Analysis (`src/aig.rs`)

compute_x_sources() -> Vec<bool> -- identifies DFF Q outputs and SRAM read data ports
compute_x_capable_pins() -> (Vec<bool>, XPropStats) -- forward cone + fixpoint iteration

Stage 2: Partition Classification (`src/flatten.rs`)

FlattenedScriptV1 fields: xprop_enabled, partition_x_capable, xprop_state_offset
effective_state_size() returns 2 * reg_io_state_size when xprop enabled
Metadata words 8 (is_x_capable) and 9 (xmask_state_offset) encode per-partition xprop info
X-propagation is configured at simulation startup, no separate mapping step needed

Stage 3: X-Aware CPU Reference Kernel (`src/sim/cpu_reference.rs`)

simulate_block_v1_xprop() -- dual-lane value + X-mask computation
sanity_check_cpu_xprop() -- CPU vs GPU comparison for both lanes
Non-X-capable partitions delegate to simulate_block_v1 (zero overhead)

Stage 4: CLI and VCD Integration (`src/bin/jacquard.rs`, `src/sim/vcd_io.rs`)

jacquard sim --xprop runs static analysis, logs X-capable pin/partition stats, and enables X-aware simulation with doubled state buffer
write_output_vcd_xprop() emits Value::X for X-masked primary outputs
expand_states_for_xprop() / split_xprop_states() for state buffer management

Stage 5: GPU Kernels (`csrc/kernel_v1.metal`, `csrc/kernel_v1_impl.cuh`)

Dual-lane X-mask tracking through all kernel phases (global read, boomerang, SRAM, DFF writeout)
is_x_capable branch is uniform per threadgroup (zero warp/SIMD divergence)
Shared memory: shared_state_x[256] + shared_writeouts_x[256] (+2KB, within 32KB limit)
SRAM X-mask shadow via sram_xmask buffer (buffer slot 7 on Metal)

Stage 6: Diagnostics

Compile-time: X-source count, X-capable pin %, X-capable partition %
Runtime: first-cycle-X-free detection, final-cycle X warning
CPU sanity check uses xprop variant when enabled

Stage 7: Benchmarks (`benches/xprop.rs`)

Criterion micro-benchmarks: two_state vs xprop_xfree vs xprop_xcapable
X-free partitions: zero overhead confirmed
X-capable partitions: ~1.5-1.6x CPU overhead (well within 2x budget)

Dynamic Narrowing (Future Enhancement)

Periodic X-mask scan on CPU between GPU batches
Partition kernel hot-swapping from X-aware to fast mode
Statistics reporting: "X cleared after N cycles"

Prior Art

Mixed 2-4 State Simulation with VCS (Chaudhry et al., 1997): Proved that mixed two-state / four-state simulation is viable within a single Verilog simulator, validating the core approach.
A Two-State Methodology for RTL Logic Simulation (1999): Eliminated X entirely using random two-state initialisation, arguing it catches more bugs than X-optimistic RTL simulation. Their technique for handling Z-state boundaries is relevant to our boundary protocol.
Essent (Beamer, 2020-2021): Demonstrated that partitioning a compiled simulator by signal activity yields significant speedups. Their activity-proportional execution is analogous to our dynamic X narrowing -- skip work for partitions where nothing interesting is happening.
Synopsys VCS T-Prop: Taint propagation in VCS uses a parallel sideband bit per signal to track data flow for security verification. Our X-mask is architecturally identical -- a sideband bit that propagates alongside the value using different rules.
Chris Drake, "Improving Verilog Four State Logic" (2024): Argues that Verilog conflates "uninitialised" and "don't care" uses of X, and proposes splitting them. Our approach naturally supports this: the X-mask tracks only uninitialised/unknown values, not synthesis don't-cares (which are already resolved by the synthesis tool before Jacquard sees the netlist).

What This Does NOT Address

Z (high-impedance): Jacquard does not support tri-state buses and this proposal does not add support. Z would require a third state bit and bus resolution logic.
X-optimism in RTL control flow: Since Jacquard operates on gate-level netlists (post-synthesis), there are no if/case statements to be X-optimistic about. Gate-level X propagation through AND/OR truth tables is naturally correct (though potentially X-pessimistic -- see below).
X-pessimism reduction: Gate-level simulation is inherently X-pessimistic in some cases (e.g., a XOR a should be 0 even if a is X, but standard gate-level propagation gives X). Since AIGs decompose XOR into AND/NOT, this pessimism is present. Addressing this would require symbolic analysis or reconvergence detection, which is out of scope.
Strength modelling: No weak/strong drive strength tracking. All known values are "strong."

Design Decisions

SRAM X granularity: Conservative whole-SRAM approach -- all reads return X until any write occurs. This is pessimistic but correct and simple. Per-address tracking (requiring 8192 x 32 bits of shadow state per SRAM block) is a potential follow-up for reduced pessimism.
Reset-aware analysis: Skipped for v1 -- all DFF Q outputs start as X regardless of whether they have async reset. Identifying reset nets is fragile (varies by synthesis tool and coding style), and the fixpoint iteration naturally resolves DFFs that become non-X after reset propagates.
VCD X output: Yes -- write_output_vcd_xprop() emits Value::X when the X-mask bit is set for a primary output. This is the primary user-visible benefit. The vcd-ng crate already supports Value::X. Downstream tools that only handle two-state VCD would need updating, but Verilog-standard four-state VCD is widely supported.
Partition-level granularity: X-awareness is applied at partition granularity (entire partition runs X-aware kernel or not). This provides better steady-state performance than per-signal tracking when most partitions are X-free (~95% typical), at the cost of slightly more pessimism at partition boundaries.
Runtime CLI flag: X-propagation is controlled by --xprop on jacquard sim, rather than a compile-time feature flag. No new Cargo dependencies are needed.
State buffer layout: When xprop is enabled, the state buffer doubles. Value words occupy [0 .. reg_io_state_size) and X-mask words occupy [reg_io_state_size .. 2*reg_io_state_size) per cycle. Per-partition metadata word 8 stores the is_x_capable flag, and word 9 stores the xmask_state_offset (equal to reg_io_state_size for X-capable partitions). X-free partitions ignore both fields and run unchanged.

Debugging X values in cosim (`xsources` + `xroots`)

When you run jacquard cosim --xprop and a signal reads x (unknown), the question is always why — which uninitialised state or undriven input is feeding that X forward? This guide covers the two-command workflow that answers it as a static query, without the trace→guess→re-run loop.

jacquard xsources — enumerate where X originates in the design.
netlist-graph xroots — given a signal, find which of those X-sources reach it (its backward "X-root" frontier).

See ADR 0016 for the X-propagation semantics and issue #98 for the design rationale.

Background: where X comes from

Under --xprop, an X value originates at one of three X-sources and then propagates forward through combinational logic:

Kind	X because…
`unreset-dff`	A DFF Q output with no reset/set pin — undefined at power-up and never forced to a known value.
`reset-dff`	A DFF Q with a reset/set pin — undefined at power-up but defined once reset asserts. Usually not the culprit after reset.
`sram-read`	An SRAM read-data port — undefined until the addressed cell is written.
`undriven-input`	A primary input the testbench leaves undriven (no clock/reset/constant/peripheral drives it) — reads X every tick.

A signal reads X iff at least one X-source lies in its backward logic cone. That is a pure netlist property — no simulation needed to compute it.

Step 1 — enumerate X-sources: `jacquard xsources`

jacquard xsources <netlist> --config <sim_config.json> -o xsources.json

This builds the static AIG (no GPU) and writes a JSON manifest of every X-source. --config is required only to classify undriven-input sources (the driven-set complement); without it, only DFF and SRAM sources are emitted.

{
  "schema_version": "1.0",
  "netlist": "design.v",
  "undriven_inputs_classified": true,
  "x_sources": [
    { "net": "q_unreset",  "kind": "unreset-dff",    "cell": "_29_" },
    { "net": "_10_[1]",    "kind": "unreset-dff",    "cell": "_30_" },
    { "net": "q_reset",    "kind": "reset-dff",      "cell": "_28_" },
    { "net": "unconnected","kind": "undriven-input" }
  ]
}

Notes:

net is jacquard's canonical net name (after assign-merging), which may differ from the raw name in the Verilog. cell is the driving instance and is stable across tools — xroots resolves DFF/SRAM sources by cell, so the naming difference does not matter.
The undriven-input classification reflects the config's declared drivers: clock(s), reset, constant_inputs, constant_ports, and each peripheral's pins (flash/uart/gpio/jtag). An input outside that set is reported as undriven.
A reset-connected DFF is emitted as reset-dff. It is X at power-up but resolves once reset asserts, so xroots traverses through it rather than stopping (see below).

Step 2 — find a signal's X-roots: `netlist-graph xroots`

netlist-graph xroots <netlist> <signal> --xsources xsources.json

xroots walks backward from <signal> through the driver cone — and through DFF data (D) pins, the reverse of the forward X-prop fixpoint — skipping clock/set/reset pins. It stops at each persistent X-source (unreset-dff / sram-read / undriven-input), reporting the nearest frontier:

$ netlist-graph xroots design.v top.cpu.stall --xsources xsources.json
X-source frontier of top.cpu.stall (2 root(s), nearest first):
  [unreset-dff] top.cpu.state[3]   (depth 5)
  [undriven-input] cfg_mode        (depth 8)

Reset DFFs (reset-dff) are traversed through (their data cone is followed), because reset defines them — so the frontier surfaces the real persistent roots behind them, not the reset flop itself.

If no X-source reaches the signal, xroots says so — under --xprop that signal should be X-free.

Without a manifest

xroots also runs without --xsources, classifying X-sources from the netlist alone (DFF Q by reset-pin presence, SRAM reads, and genuinely undriven internal nets). This cannot classify undriven-input sources (it does not know the cosim driven set), so pass jacquard xsources --config output when you need those.

Step 3 — confirm with a targeted `--xprop` run

--emit-trace writes the frontier as a --trace-signals file, so confirming the X actually originates where xroots says is one command:

netlist-graph xroots design.v top.cpu.stall \
    --xsources xsources.json --emit-trace xroots.txt

jacquard cosim design.v --config sim.json \
    --xprop --trace-signals xroots.txt --output-vcd out.vcd

The traced frontier nets appear in out.vcd; watch them carry x and resolve (or not) over the reset window to confirm the diagnosis.

Worked example

tests/xprop_cosim/ is the X-propagation demo. q_unreset is a self-holding unreset register (stays X); q_reset resolves after reset:

jacquard xsources tests/xprop_cosim/xprop_demo_synth.gv \
    --config tests/xprop_cosim/sim_config.json -o /tmp/xsrc.json

# Why is the counter's next-state X?
netlist-graph xroots tests/xprop_cosim/xprop_demo_synth.gv "_11_[3]" \
    --xsources /tmp/xsrc.json
#   [unreset-dff] unreset_count[3]  (depth 2)
#   [unreset-dff] unreset_count[2]  (depth 3)
#   ...

Limitations

The frontier is a static over-approximation: it reports X-sources whose cone reaches the signal, not whether the X is live on a given cycle (an unreset DFF may resolve once its own data cone settles from known inputs). It tells you what to initialise/drive to make the signal defined.
Name round-tripping is exact for flattened post-synthesis netlists (the target of these tools). DFF/SRAM sources resolve by cell instance, which is robust; undriven-input sources resolve by name.
Dominator analysis (X-sources that every path passes through — guaranteed roots) is a planned follow-up; v1 reports the reachable frontier.

Interactive JTAG Debug (`--jtag-server`)

Overview

jacquard cosim --jtag-server <PORT> opens a live remote_bitbang JTAG socket alongside a running GPU co-simulation. An external debugger — OpenOCD, then gdb on top of it — attaches and inspects the design through its RISC-V Debug Module (DM): halt/resume/step the hart, read/write GPRs, CSRs, PC and memory, and load firmware — exactly as the same firmware would be debugged on real silicon.

It is the interactive sibling of --jtag-replay. Where --jtag-replay <FILE> plays back a recorded remote_bitbang byte stream (deterministic, one-directional), --jtag-server <PORT> opens a live socket and lets the connected client drive the same configured TCK/TMS/TDI/(TRST) pins, stepping the design in lock-step with each debug transaction and answering TDO reads from the design's live output.

The two are mutually exclusive — pick a recorded stream or a live client, not both.

Flag	Byte source	TDO read-back	Use
`--jtag-replay <FILE>`	recorded file	counted, not answered	deterministic regression / firmware load
`--jtag-server <PORT>`	live TCP client	sampled + written to client	interactive debug (OpenOCD → gdb)

Configuration

The JTAG TAP pins are declared once in sim_config.json, shared by both the replay and server paths. TCK/TMS/TDI/TRST are design inputs; TDO is a design output, so the server needs tdo_gpio to answer TDO reads:

{
  "jtag": {
    "tck_gpio": 2,
    "tms_gpio": 3,
    "tdi_gpio": 4,
    "trst_gpio": 5,
    "trst_active_low": true,
    "tdo_gpio": 6
  }
}

trst_gpio should be set for a RISC-V Debug TAP: the DTM resets on negedge trst_n, and the server injects a one-shot power-on TRST pulse at startup so a debugger that never asserts TRST (stock OpenOCD with reset_config none) still resets the DTM — mirroring a real chip's power-on TAP reset. (Tunable/disable via the JACQUARD_JTAG_TRST_PULSE env var; see How it works.) TAPs that genuinely reset via a five-cycle TMS=1 sequence can omit it.
tdo_gpio is optional for --jtag-replay (replay never answers R) but required for useful --jtag-server use: without it, every TDO read returns 0 and the debugger cannot read the design back. cosim warns at startup if it is missing.

The TCK clock domain must also appear in clocks[] so the multi-clock scheduler allocates it a tick slot; the model overrides TCK at each byte transition. See tests/jtag_minimal/sim_config.json for a complete, working example.

Launching the server

jacquard cosim design.pnl.v \
    --config sim_config.json \
    --top-module top \
    --jtag-server 9999 \
    --jtag-hold-cycles 8 \
    --output-vcd debug.vcd \
    --max-clock-edges 100000000

cosim builds the design, binds 127.0.0.1:9999, and blocks waiting for one client before the run starts:

JTAG server `jtag_0`: listening on 127.0.0.1:9999, hold_edges=8 (...);
  waiting for a remote_bitbang client (e.g. OpenOCD)…

Pass --jtag-server 0 to bind an OS-assigned free port instead of a fixed one — read the actual port from the listening on 127.0.0.1:<port> line. Use this when several jacquard instances debug concurrently (or in CI) so they never collide; if a fixed port is already taken, the bind fails with an actionable error.

Once a client connects, the design steps forward one remote_bitbang transaction at a time, paced by the client. By default a single connection is served; when the client disconnects (or sends Q), the session ends and the simulation free-runs to its edge budget.

Add --jtag-reconnect to keep the server alive across disconnects: when the debugger detaches, the server preserves the design state and waits to accept() the next client — so you can restart OpenOCD without re-running the slow cosim setup (each fresh attach gets a clean DTM reset). The run then never stops on its own; Ctrl-C/kill to end it.

Edge budget. --max-clock-edges does not stop the run while a debug client is attached — the session is paced by the client and would otherwise die mid-inspect. The cap is honoured again once the client disconnects (so the post-session free-run is still bounded). You can leave --max-clock-edges at its default for interactive use.

OpenOCD configuration

The boilerplate is the most error-prone part of attaching a debugger, so generate it with the built-in helper instead of hand-writing it:

jacquard jtag-openocd-config --port 9999 --expected-id 0xdeadbeef \
    -o openocd.cfg

--irlen (default 5), --host, --chipname and --dmi-timeout are all configurable; jacquard jtag-openocd-config --help lists them. The emitted config wires up the remote_bitbang driver, a RISC-V Debug TAP, and the target, and ends with init. For reference, the equivalent hand-written config (mirroring tests/jtag_minimal/scripts/openocd.cfg):

adapter driver remote_bitbang
remote_bitbang host localhost
remote_bitbang port 9999
transport select jtag
adapter speed 1

set _CHIPNAME jtag
jtag newtap $_CHIPNAME cpu -irlen 5
set _TARGETNAME $_CHIPNAME.cpu
target create $_TARGETNAME riscv -chain-position $_TARGETNAME

# The JTAG protocol is bit-serial and the GPU server is per-edge while
# stepping — keep the DMI timeout generous.
riscv set_command_timeout_sec 60

Run it against the launched server:

openocd -f openocd.cfg

OpenOCD examines the TAP, brings up the DM (dmactive=1), and exposes a gdb remote on :3333 by default.

Attaching gdb

riscv32-unknown-elf-gdb firmware.elf
(gdb) target remote :3333
(gdb) info registers          # read GPRs / CSRs / PC via the DM
(gdb) x/16xw 0x20000000       # read memory
(gdb) load                    # write firmware through Program Buffer / sysbus
(gdb) break main
(gdb) continue

The design runs on the GPU; gdb sees it through the DM exactly as it would see silicon over a real JTAG probe.

How it works

The cosim main loop is synchronous (step_edge → run_edges → wait → repeat). A live debug session inverts time control: the client is the clock source, so step_edge blocking on a socket read is the correct synchronisation — no async executor, no background thread.

While a client is connected the JTAG model reports is_active() == true, which forces the existing single-edge (batch=1) dispatch — the same fine-grained stepping --jtag-replay already uses. Each cosim edge processes one remote_bitbang step.
On each R (read-TDO) command the model samples the design's live TDO from the output state (output_state[tdo_pos]) and writes the ASCII '0'/'1' back over the socket — the only response the protocol requires.
TCK/TMS/TDI/TRST drive through the usual overrides → BitOps → state_prep path, unchanged from replay.
Power-on TRST pulse. A RISC-V DTM resets on negedge trst_n. On real silicon the power-on reset supplies that edge; in a recorded capture the harness pulses TRST before the debugger runs. But stock OpenOCD with reset_config none never asserts TRST — so without help the DTM would never reset and examine would fail with "dtmcontrol is 0" / "scan chain interrogation: all zeroes". The live server therefore injects one brief TRST pulse at startup (deasserted → asserted → released, timed in cosim edges), reproducing the recorded stream's leading u…r. Replay is unaffected (it drives TRST from the stream), and an explicit client TRST assertion still wins after the pulse. The window is [hold_edges, 5·hold_edges) by default; override or disable it with JACQUARD_JTAG_TRST_PULSE="<from>,<to>" / ="off" (edges) if a design needs different timing.

The batched fast path is untouched when no client is attached: a design with a jtag config but no --jtag-server/--jtag-replay flag simply leaves the JTAG inputs floating and runs at full throughput.

This works on any cosim backend (CPU / Metal / CUDA / HIP) — it is the CPU-side model plus batch=1 of the GPU backend, with no per-backend kernel work. See ADR 0017 (Amendment 2026-06-21) for the execution-model rationale and ADR 0013 (Amendment 2026-06-21) for the tdo_gpio config surface.

Caveats and limitations

One client at a time. A single connection is served at once; --jtag-reconnect accepts a new one after the previous disconnects. Simultaneous clients and multi-tap chains are future work.
Performance. An attached session loses edge batching for its duration — inherent and acceptable, since interactive debug is slow relative to free-running simulation. Throughput is unaffected when no client is attached.
X-propagation. Under --xprop, the debug read path is two-state in v1 (TDO resolves through the design as 0/1). X-aware debug reads are a planned refinement (J5 in the plan).
Reset interplay. The design's own reset (reset_gpio) and the JTAG TAP reset (driven by the client) proceed concurrently, mirroring real hardware where the debugger drives JTAG while the chip resets.

Validation

Three layers, increasing in fidelity:

A model-level loopback unit test (live_source_loops_back_tdo_over_socket in src/sim/models/jtag.rs) drives the FSM through a real TcpStream and reads back a sampled TDO bit — GPU-free and deterministic, runs in cargo test.
The jtag-minimal-cosim-server CI job streams the same bitbang.rec the --jtag-replay gate uses over the live socket (via tests/jtag_minimal/scripts/bitbang_client.py) and asserts the design reaches the same data0_obs == 0xCAFEBABE. This pins live-vs-replay drive equivalence with zero external tooling — but reaches 0xCAFEBABE via a DMI write, so it does not exercise a correct TDO read.
The jtag-minimal-openocd CI job drives the live server with real OpenOCD (tests/jtag_minimal/scripts/openocd_debug.sh): examine reads IDCODE/DTMCS back over TDO, then it writes and reads back DATA0, asserting IDCODE == 0xdeadbeef and the DMI read-back == 0xcafebabe. This is the only gate that exercises the live TDO read path (and the power-on TRST-pulse fix). Run it locally with any port:
```
jacquard cosim … --jtag-server 9824 &
bash tests/jtag_minimal/scripts/openocd_debug.sh 9824
```
Fake hart. jtag_minimal is a bare DTM+DM with no real CPU, so full RISC-V target examination stops at Failed to read MISA and OpenOCD exits non-zero — expected. DMI access (write/read DATA0) is the contract the gate checks. A design with a real hart examines fully.

References

Issue #124.
Implementation staging: docs/plans/jtag-debug-server.md.
ADR 0017 — Cosim execution model (interactive, externally-paced peripheral models; output_state wiring).
ADR 0013 — Cosim peripheral model architecture (tdo_gpio config surface).
Replay path & fixture: src/sim/models/jtag.rs, tests/jtag_minimal/.

Adding a New PDK for Post-Layout Simulation

This guide documents the process of enabling a new process design kit (PDK) for gate-level simulation in Jacquard. It is based on the SKY130 enablement and captures every integration point.

Overview

Jacquard natively supports AIGPDK (its own synthesis library of AND gates, DFFs, and SRAMs). Supporting a foundry PDK like SKY130 requires teaching the simulator how to interpret the PDK's standard cells: their pin directions, their boolean function, and which ones are sequential.

There are three pathways for enabling new cells; pick based on what you're adding:

Cell-model IR descriptor (recommended — the modern path, ADR 0019). A standard-cell library — pin directions, boolean functions, sequential roles, and timing — is captured in one generated JSON descriptor produced from the library's Liberty by the liberty-to-cellir converter. Jacquard consumes the descriptor at runtime; no per-PDK Rust is required. This is how a brand-new PDK is added (vendor the library + generate a descriptor) and how a proprietary library you cannot vendor is simulated (--cell-descriptor). See "Pathway A" below. This supersedes the legacy Rust workflow (Steps 1–8) for all combinational logic today, with sequential consumption completing across the built-in PDKs in the ADR 0019 C3 series.
Runtime cell library (--cell-library + .cells.toml manifest). For third-party IP, hard macros, foundry memories, and any other cells that don't need new AIG decomposition rules — i.e. cells that act as opaque outputs (RAM macros), filler/cap blocks, or IO pads. See ADR 0010 and docs/plans/declarative-cell-metadata.md for the recipe. No Jacquard PR required — users ship a manifest alongside their netlist. See "Adding third-party IP via runtime manifest" at the end.
Legacy per-PDK Rust (Steps 1–8 below). The original pathway: pin tables, classifiers, decomposition functions, and AIG builder hooks hand-written in Rust. Being retired by the cell-model IR cutover (ADR 0019) — retained here as reference for the machinery the descriptor replaces. Do not add a new PDK this way; use Pathway A.

If you're adding just a memory macro or other behaviourally-opaque IP, skip ahead to "Adding third-party IP via runtime manifest" at the end of this document — it's a 6-line TOML entry, not a Rust PR.

Pathway A: Cell-model IR descriptor (recommended)

A cell-model IR descriptor is a portable, versioned, generated JSON file carrying everything per-cell-type about a library — L1 pin directions, L2 combinational logic (as a pre-decomposed AIG), L3 sequential roles/classification, and L4 timing characterization — from one source: the library's Liberty. Adding a PDK is no longer a Jacquard code change; it is a data generation step. (ADR 0019.)

Generate a descriptor from Liberty

The liberty-to-cellir converter reads a Liberty .lib (Liberty-first; a functional .v is consulted only as a fallback / cross-check) and emits the descriptor:

cargo run --release --manifest-path crates/liberty-to-cellir/Cargo.toml -- \
    path/to/mylib_typ_1p20V_25C.lib \
    --functional-v path/to/mylib_stdcell.v \
    -o mylib.cellir.json

The converter derives the corner (PVT) from the Liberty's operating_conditions/default_operating_conditions, compiles each cell's function into an AIG, extracts ff/latch sequential roles, and (where a .v is present and per-cell) cross-checks logic + timing-arc topology, surfacing any disagreement. It prints a summary (cells=… combinational=… l3_sequential=… l4_timing=… corners=… cross_check_mismatches=…). A monolithic single-.lib-per-corner commercial library and a per-cell split library are both handled.

Selection: how Jacquard picks a descriptor for a netlist

Each descriptor declares the cell-name prefix(es) it covers (D8). Jacquard auto-matches a netlist's cell types against the bundled descriptors by prefix; --cell-descriptor <file.json> is the explicit override (and the path for proprietary libraries). Precedence: --cell-descriptor (explicit file) > --bundled-descriptor <name> (explicit bundled) > auto-match by prefix > the default-fallback descriptor (used when no vendor prefix matches — this is how AIGPDK, whose cells have no common prefix, is selected).

Worked example — a built-in PDK (IHP SG13G2, zero per-PDK Rust)

IHP's open SG13G2 PDK was added as a built-in with no Rust:

Vendor the library as a sparse/shallow git submodule (only the stdcell Liberty + .v are checked out — ~10 MB, not the multi-GB PDK):

git submodule add https://github.com/IHP-GmbH/IHP-Open-PDK.git vendor/IHP-Open-PDK
git -C vendor/IHP-Open-PDK sparse-checkout set \
    ihp-sg13g2/libs.ref/sg13g2_stdcell/lib \
    ihp-sg13g2/libs.ref/sg13g2_stdcell/verilog

Generate + embed at build time. build.rs runs the converter over the pinned vendored .lib and embeds the descriptor into the binary (descriptors are not checked in — CI regenerates them deterministically, D7). Add the descriptor to bundled_descriptors::ALL with its declared prefix (sg13g2_).
That's it. A SG13G2 netlist auto-selects the descriptor by prefix; combinational cells splice from it with no IHP-specific code. Adding the PDK touched no ihp_*.rs — only a submodule, a build.rs generation line, and a registry entry.

Worked example — a proprietary library you cannot vendor (GF130)

A foundry/NDA library that can never live in vendor/ is simulated entirely at runtime — the Liberty never leaves your machine:

# 1. Generate a descriptor from your own Liberty (offline, on your machine):
cargo run --release --manifest-path crates/liberty-to-cellir/Cargo.toml -- \
    /path/to/pdk-gf130/.../GF013bcd_sc6_1p5_a0_TT_1P50V_25C_max.lib \
    --functional-v /path/to/pdk-gf130/.../GF013bcd_sc6_1p5_a0.v \
    -o gf130.cellir.json

# 2. Point jacquard at it — no Jacquard build, no vendoring:
jacquard sim my_chip.gv stim.vcd out.vcd 1 --cell-descriptor gf130.cellir.json

The generator is the only tool that touches raw foundry files; the released jacquard binary is library-agnostic. (A --corner <name> flag selects among a multi-corner descriptor's corners, defaulting to the descriptor's default_corner.)

Current status & limits

Combinational logic is fully descriptor-driven for every built-in PDK (GF180, SKY130, AIGPDK, IHP) and for proprietary libraries via --cell-descriptor.
Sequential consumption is descriptor-driven for GF180 today and is being wired for the other PDKs in the ADR 0019 C3 series; the descriptor already carries the L3 sequential data for all of them.
A library whose functional .v is a single flat-module file (common for commercial PDKs) skips the optional .v logic cross-check at generation (the descriptor is still emitted; validate via simulation) — improving the flat-.v indexer is tracked follow-up.
RAM/SRAM macros are not covered by the descriptor — they use the runtime manifest (ADR 0011), below.

Prerequisites

You need:

The PDK's Verilog cell library (behavioral or functional models)
A post-synthesis or post-P&R netlist using those cells
The cell naming convention (prefix, drive strength suffix format)

For SKY130, the PDK data lives in vendor/sky130_fd_sc_hd/ as a git submodule.

Legacy pathway: per-PDK Rust (Steps 1–8)

This section documents the original hand-written-Rust workflow, which the cell-model IR (Pathway A, above) is retiring under ADR 0019. It is kept as a reference for the machinery the descriptor replaces and for the still-live paths during the cutover. To add a new PDK, use Pathway A — do not write per-PDK Rust. The RAM/macro manifest section that follows Step 8 remains fully current.

Step 1: Library Detection

Reference: src/sky130.rs -- is_sky130_cell(), detect_library(), detect_library_from_file()

Jacquard scans the netlist to determine which cell library is in use. Each PDK needs a name-matching function:

#![allow(unused)]
fn main() {
// src/sky130.rs:535
pub fn is_sky130_cell(name: &str) -> bool {
    name.starts_with("sky130_fd_sc_")
        || name.starts_with("CF_SRAM_")
}
}

The CellLibrary enum tracks known libraries. detect_library() iterates cell names and returns the detected library (or Mixed if cells from multiple libraries are found -- this is an error).

For a new PDK: Add a variant to CellLibrary, write an is_<pdk>_cell() function, and update detect_library().

Step 2: Cell Type Extraction

Reference: src/sky130.rs -- extract_cell_type()

PDK cell names follow a convention: <prefix>__<type>_<drive>. The simulator needs to strip the prefix and drive strength to get the base cell type:

sky130_fd_sc_hd__nand2_4  -->  nand2
sky130_fd_sc_hd__dfxtp_1  -->  dfxtp

This function must handle all library variants (hd, hs, ms, ls, lp, hdll, hvl for SKY130) and any custom macros (CF_SRAM_*).

For a new PDK: Write an equivalent extract_cell_type() for the PDK's naming scheme.

Step 3: Pin Direction Provider

Reference: src/sky130.rs -- SKY130LeafPins implementing LeafPinProvider

The netlist parser (from eda-infra-rs/netlistdb) needs to know pin directions and widths for every cell type. This is implemented as a trait:

#![allow(unused)]
fn main() {
impl LeafPinProvider for SKY130LeafPins {
    fn direction_of(&self, macro_name, pin_name, pin_idx) -> Direction;
    fn width_of(&self, macro_name, pin_name) -> Option<SVerilogRange>;
}
}

For SKY130, direction_of() is a large match statement covering ~80 cell types with all their pin names. This is tedious but straightforward -- for each cell, list which pins are inputs and which are outputs.

Sources for pin directions:

The PDK's Liberty (.lib) files list pin directions
The PDK's behavioral Verilog models declare input/output ports
LEF files also contain pin direction information

For a new PDK: Implement the trait for all cells that appear in your target netlists. You can start with just the cells used in your design and add others as needed.

Step 4: Cell Classification

Reference: src/sky130_pdk.rs -- is_sequential_cell(), is_tie_cell(), is_multi_output_cell()

Three classification functions control how cells are processed during AIG construction:

Sequential cells (DFFs and latches)

These are handled specially in the AIG builder -- their outputs become state elements rather than combinational logic.

Critical: Use an explicit whitelist, not prefix matching. PDK naming collisions will silently break simulation if you guess wrong (e.g., SKY130's dlygate4sd3 starts with "dl" but is a combinational delay buffer, not a latch).

Derivation method: Grep the PDK's behavioral Verilog models for DFF/latch primitives:

for cell in $(ls vendor/<pdk>/cells/); do
    vfile="vendor/<pdk>/cells/$cell/<pdk>__${cell}.behavioral.v"
    if [ -f "$vfile" ] && grep -qE 'udp_dff|udp_dlatch' "$vfile"; then
        echo "$cell"
    fi
done

For PDKs that don't use Verilog UDPs, look for always @(posedge blocks or check the Liberty file's ff and latch groups.

Tie cells

Cells that produce constant 0 or 1 (e.g., SKY130's conb with HI/LO pins).

Multi-output cells

Cells with more than one output (e.g., half-adder ha with SUM and COUT, full-adder fa). These need special handling because the AIG builder processes one output pin at a time.

Step 5: Behavioral Model Loading

Reference: src/sky130_pdk.rs -- load_pdk_models(), parse_functional_model(), parse_udp()

Jacquard decomposes PDK cells to AIG primitives (AND gates and inversions) by parsing their functional Verilog models. The expected file structure:

vendor/<pdk>/
  cells/
    <cell_type>/
      <pdk>__<cell_type>.functional.v    # Gate-level behavioral model
  models/
    <udp_name>/
      <pdk>__<udp_name>.v               # Verilog UDP definitions

Functional models

These are gate-level Verilog using primitives like and, or, nand, nor, not, xor, xnor, buf. The parser (parse_functional_model()) extracts these into a topologically-ordered list of BehavioralGate structures.

Example (sky130_fd_sc_hd__o21ai.functional.v):

module sky130_fd_sc_hd__o21ai (Y, A1, A2, B1);
    output Y;
    input  A1, A2, B1;
    wire or0_out;
    or  or0  (or0_out, A2, A1);
    nand nand0 (Y, B1, or0_out);
endmodule

UDP models

Some cells (typically muxes) use Verilog User-Defined Primitives with truth tables. The parser (parse_udp()) converts these to a row-based representation, which is then evaluated as sum-of-products during AIG decomposition.

What's loaded

Only models for cell types actually present in the design are loaded. Sequential cells are skipped (their behavior is hardcoded in the AIG builder). Tie cells are also skipped (constant generation is trivial).

For a new PDK: If the PDK uses the same Verilog gate primitive syntax, the existing parsers should work. If it uses behavioral Verilog (assign statements, always blocks), the parser would need extension.

Step 6: AIG Decomposition

Reference: src/sky130_pdk.rs -- decompose_with_pdk(), decompose_from_behavioral()

The decomposition converts each combinational cell to a set of 2-input AND gates with optional inversions:

Map the cell's input pin names to AIG pin indices via CellInputs
Walk the behavioral model's gate list in topological order
For each gate, build the equivalent AIG sub-graph:
- and/nand -> AND gate (with optional output inversion)
- or/nor -> De Morgan's: OR(a,b) = NOT(AND(NOT a, NOT b))
- xor/xnor -> Four AND gates: XOR(a,b) = NOT(AND(NOT(AND(a, NOT b)), NOT(AND(NOT a, b))))
- buf/not -> Pass-through with optional inversion
- UDP -> Sum-of-products from truth table
Record the output with cell origin (for SDF timing annotation)

CellInputs struct

CellInputs has named fields for all possible input pins across all SKY130 cells (A, B, C, D, A_N, B_N, S, S0, S1, CIN, SET_B, RESET_B, etc.). The set_pin() method maps netlist pin names to AIG pin indices.

For a new PDK: If the PDK introduces pin names not in the current struct, add new fields.

Step 7: AIG Builder Integration

Reference: src/aig.rs -- get_sky130_dependencies(), sky130_preprocess(), sky130_postprocess()

The AIG builder processes cells in three phases during topological traversal:

Dependencies (what must be built before this cell)

Tie cells: No dependencies
Sequential cells: Only SET_B and RESET_B pins (the data input D is handled by the DFF mechanism, not combinational decomposition)
Combinational cells: All input pins

Preprocessing (before dependencies are resolved)

Sequential cells: Create a DFF output AIG pin. This establishes the state element before the combinational cone driving it is built.

Postprocessing (after all dependencies are resolved)

Tie cells: Wire HI to constant-1, LO to constant-0
Sequential cells: Apply reset/set logic: Q = AND(OR(Q_state, NOT SET_B), RESET_B) (active-low semantics)
Combinational cells: Call decompose_with_pdk() and wire the resulting AND gates into the AIG

For a new PDK: The three-phase structure is reusable. You need PDK-specific implementations of each phase that handle the new cell types' pin names and reset/set conventions.

Step 8: CLI Integration

Reference: src/bin/jacquard.rs

The load_design function detects the library and creates the netlist with the appropriate pin provider:

#![allow(unused)]
fn main() {
let lib = detect_library_from_file(&args.netlist_verilog)?;
let netlistdb = match lib {
    CellLibrary::SKY130 => NetlistDB::from_sverilog_file(&paths, &SKY130LeafPins),
    CellLibrary::AIGPDK => NetlistDB::from_sverilog_file(&paths, &AIGPDKLeafPins()),
    CellLibrary::Mixed => panic!("Mixed libraries not supported"),
};
}

For a new PDK: Add a match arm for the new library.

Testing Strategy

Unit tests

Cell type extraction: Verify prefix/suffix stripping
Pin directions: Spot-check common cells
Behavioral model parsing: Parse each cell type, verify gate count

Decomposition correctness: For each combinational cell, exhaustively test all input combinations against the PDK's truth table:

#![allow(unused)]
fn main() {
#[test]
fn test_all_cells_vs_pdk() {
    let pdk = load_test_pdk();
    for (cell_type, model) in &pdk.models {
        // For each input combination:
        //   1. Evaluate behavioral model directly
        //   2. Decompose to AIG and evaluate AIG
        //   3. Assert outputs match
    }
}
}

This test exists in src/sky130_pdk.rs as test_all_cells_vs_pdk and covers every combinational cell against every input combination.

Integration tests

Small test circuit: Synthesize a simple design (DFF + some gates) to the new PDK and verify simulation output matches a reference (e.g., iverilog)
Flash boot test: If targeting an SoC, verify the CPU boots and reads from flash (this exercises sequential logic, combinational cones, and IO)

File Checklist

For a complete PDK integration, you need:

File	Purpose
`src/<pdk>.rs`	LeafPinProvider, library detection, cell type extraction
`src/<pdk>_pdk.rs`	Cell classification, model parsing, AIG decomposition
`src/aig.rs`	AIG builder hooks (dependencies, pre/post-process)
`src/sky130.rs`	Update `CellLibrary` enum
`src/bin/jacquard.rs`	CLI match arms for new library
`vendor/<pdk>/`	PDK cell models (git submodule)

Common Pitfalls

Cell name collisions: Do not use prefix matching for cell classification. dlygate4sd3 starts with "dl" but is not a latch. Always derive the exhaustive list from behavioral models.
Active-low vs active-high resets: SKY130 uses active-low RESET_B and SET_B. Other PDKs may use active-high. Get this wrong and every DFF will be stuck.
Multi-output cells: The AIG builder processes one output pin at a time. If a cell has both Q and Q_N outputs (e.g., dfbbp), the second output must be derived from the first (Q_N = NOT Q), not decomposed independently.
Liberty file size: SKY130's liberty files are 12MB+. If your PDK has similarly large files, ensure the parser doesn't OOM or timeout.
Power/ground pins: Post-layout netlists often include VPWR/VGND pins. Use the unpowered netlist variant (.nl.v not .pnl.v in OpenLane2) or handle power pins as constants in the pin provider.
Hold-time repair buffers: P&R tools insert delay buffers (like dlygate4sd3) that must be treated as combinational. If your PDK's delay cells have names that collide with sequential cell prefixes, the whitelist approach prevents misclassification.

Adding third-party IP via runtime manifest

If you're adding a memory macro, IO pad, hard block, or filler library — anything that doesn't need new AIG decomposition rules — the runtime cell-library pathway (ADR 0010) is the right route. No Jacquard PR required. Ship a Verilog blackbox file plus a TOML manifest alongside your design.

Step 1: Provide the cell's Verilog interface

The blackbox just declares the cell's module + port directions. The foundry typically ships this (<library>__blackbox.v). Example for the OCD GF180MCU SRAM:

module gf180mcu_ocd_ip_sram__sram1024x8m8wm1 (CLK, CEN, GWEN, WEN, A, D, Q);
  input CLK;
  input CEN;
  input GWEN;
  input [7:0] WEN;
  input [9:0] A;
  input [7:0] D;
  output [7:0] Q;
endmodule

Step 2: Write the TOML manifest

Co-locate <library>.cells.toml next to <library>.v (it autoloads when present) or pass it via --cell-manifest:

schema_version = "1.0"

[cells.gf180mcu_ocd_ip_sram__sram1024x8m8wm1]
kind = "ram"

Recognised kind values in v1.0: std, dff, latch, clock_gate, ram, filler, endcap, tap, io_pad_input, io_pad_output, io_pad_bidir, delay, multi_output, tie_high, tie_low.

Step 3: Invoke jacquard with the manifest

jacquard sim my_chip.v stim.vcd out.vcd 1 \
    --cell-library deps/gf180mcu_ocd_ip_sram/cells/gf180mcu_ocd_ip_sram__sram1024x8m8wm1/gf180mcu_ocd_ip_sram__sram1024x8m8wm1__blackbox.v

The --cell-library flag is repeatable for multi-IP designs.

What `kind = "ram"` delivers — opaque vs explicit-port modes

There are two modes depending on whether the manifest includes a [cells.NAME.ram] port-mapping sub-table:

Opaque mode (no ram sub-table, schema v1.0+): the cell's output pins become X-source slots in the AIG. The SRAM's internal memory behaviour is not modelled. Sufficient for designs whose CPU executes from boot ROM / register file and never reads SRAM contents at the timescales Jacquard simulates.

Explicit-port mode (with ram sub-table, schema v1.1+, ADR 0011): outputs are wired to a real AIG-backed RAMBlock, writes populate per-entry storage, reads return what was written. Real memory modelling end-to-end. Use this when the CPU reads its own SRAM (the common case for any design beyond heartbeat verification).

Schema (full example, mirroring the upstream OCD GF180MCU SRAM):

schema_version = "1.1"

[cells.gf180mcu_ocd_ip_sram__sram1024x8m8wm1]
kind = "ram"

[cells.gf180mcu_ocd_ip_sram__sram1024x8m8wm1.ram]
depth = 1024
width = 8
clock        = { pin = "CLK", edge = "pos" }
chip_enable  = { pin = "CEN", polarity = "low" }
write_enable = { pin = "GWEN", polarity = "low" }
write_mask   = { pin = "WEN", polarity = "low", granularity = "bit" }
address      = "A"
data_in      = "D"
data_out     = "Q"

Field semantics, defaults, and the multi-port-SRAM/async/wider-than-32-bit out-of-scope items are documented in ADR 0011. Polarity defaults to low; clock edge defaults to pos; mask granularity defaults to bit. All three control pins (chip_enable / write_enable / write_mask) are optional — omit them for sync SRAMs without those signals.

Preloading SRAM contents at sim start

Once a SRAM is in explicit-port mode, its contents can be preloaded from an ELF file via sim_config.json:

{
  "sram_init": {
    "elf_path": "build/firmware.elf"
  }
}

The ELF's PT_LOAD segments are packed into the SRAM's backing storage before tick 0; the lowest loadable virtual address is taken as SRAM address 0. Single-SRAM designs only — multi-SRAM instance-targeting is a future schema extension (issue #80).

Other kinds

filler, endcap, tap — physical-only, contribute no logic.
io_pad_input / io_pad_output / io_pad_bidir — pad-level behaviour (parallel to the built-in gf180mcu_ws_io__* family).
dff, latch, clock_gate, delay, multi_output — recognised but the v1.0 schema doesn't yet expose enough port semantics to drive AIG construction for these. Coming in the port-mapping schema (future ADR). For now, declaring these kinds documents intent without changing behaviour.

Troubleshooting VCD Input Issues

This guide helps debug VCD input problems where GEM simulations produce incorrect results or warn about missing signals.

VCD Scope Auto-Detection (Recommended)

NEW: GEM now automatically detects the correct VCD scope containing your design's ports. In most cases, you don't need to specify --input-vcd-scope manually.

How Auto-Detection Works

When you run jacquard sim without specifying --input-vcd-scope, GEM:

Extracts the list of required input ports from your synthesized design
Searches the VCD file for scopes containing all required ports
Tries common DUT scope names first: dut, uut, DUT, UUT, or your module name
Falls back to any scope that contains all required ports
Logs which scope was selected for transparency

Example Output

INFO No VCD scope specified - attempting auto-detection
DEBUG Searching for VCD scope containing 4 input ports
DEBUG Required ports: {"din_valid", "clk", "reset", "din"}
INFO Auto-detected VCD scope: safe_tb/uut (matched common pattern 'uut')

Manual Override

If auto-detection fails or selects the wrong scope, use --input-vcd-scope to specify manually:

# Slash-separated path to the DUT scope
jacquard sim design.gv input.vcd output.vcd 8 \
    --input-vcd-scope "testbench/dut"

# For nested hierarchies
jacquard sim design.gv input.vcd output.vcd 8 \
    --input-vcd-scope "top_tb/subsystem/my_module"

Note: Use slash separators (/), not dots (.).

Symptom: Missing Primary Input Warnings

WARN (GATESIM_VCDI_MISSING_PI) Primary input port (HierName(), "reset", None) not present in the VCD input
WARN (GATESIM_VCDI_MISSING_PI) Primary input port (HierName(), "din", Some(3)) not present in the VCD input

Root Cause

GEM expects VCD signals at absolute top-level with no module hierarchy prefix. The signal names must exactly match the synthesized module's port names.

How to Check

Inspect your VCD file:

grep '\$var' your_input.vcd | head -20

Look for module scopes:

grep '\$scope module' your_input.vcd

Check synthesized module ports:

head -20 your_design_synth.gv

What GEM Expects

Correct - Signals at top level:

$timescale 1ns/1ns
$var reg 1 ! clk $end
$var reg 1 " reset $end
$var reg 4 # din [3:0] $end
$var reg 1 $ din_valid $end
$var wire 1 % unlocked $end
$enddefinitions $end
$dumpvars
0"
0$
0%
1!
#10
1"
#20
b1100 #
1$

Incorrect - Signals scoped under module:

$scope module testbench $end
  $scope module dut $end
    $var wire 1 ! clk $end
    $var wire 1 " reset $end
    $var wire 4 # din [3:0] $end
    ...
  $upscope $end
$upscope $end

Solution 1: Flat VCD Generation

Create a testbench that dumps signals at absolute top level:

module testbench;

reg clk = 0;
reg reset;
reg [3:0] din;
reg din_valid = 0;
wire unlocked;

// DUT instantiation
your_module dut (
    .clk(clk),
    .reset(reset),
    .din(din),
    .din_valid(din_valid),
    .unlocked(unlocked)
);

always #10 clk = !clk;

initial begin
    // CRITICAL: Dump signals at top level (depth 1)
    // NOT inside module hierarchy!
    $dumpfile("output.vcd");
    $dumpvars(1, clk, reset, din, din_valid, unlocked);

    // Test sequence
    reset = 1;
    #60;
    reset = 0;

    // ... your test stimulus ...

    #200;
    $finish;
end

endmodule

Key Point: $dumpvars(1, signal1, signal2, ...) dumps individual signals at the current scope level, not inside child modules.

Compile and Run

# For synthesis-compatible testbench
iverilog -DSYNTHESIS -o sim your_design.v testbench.v
./sim

# Check VCD structure
grep '\$scope' output.vcd  # Should be minimal or none
grep '\$var' output.vcd | head -10

Solution 2: Post-Process VCD (Advanced)

If you can't change the testbench, post-process the VCD to flatten hierarchy:

#!/usr/bin/env python3
"""Flatten VCD hierarchy to top level"""

import sys

def flatten_vcd(input_vcd, output_vcd):
    with open(input_vcd) as inf, open(output_vcd, 'w') as outf:
        in_scope = False
        scope_depth = 0

        for line in inf:
            # Track scope depth
            if line.strip().startswith('$scope'):
                scope_depth += 1
                if scope_depth == 1:
                    continue  # Keep root scope
                in_scope = True
                continue
            elif line.strip().startswith('$upscope'):
                scope_depth -= 1
                if in_scope and scope_depth == 0:
                    in_scope = False
                continue

            # Skip signals inside nested scopes, keep only top-level
            if in_scope and line.strip().startswith('$var'):
                continue  # Skip nested module signals

            outf.write(line)

if __name__ == '__main__':
    flatten_vcd(sys.argv[1], sys.argv[2])

Usage:

python3 flatten_vcd.py hierarchical.vcd flat.vcd

Solution 3: VCD Scope Option (Experimental)

GEM provides --input-vcd-scope to specify which module hierarchy to read:

cargo run -r --features metal --bin jacquard -- sim \
    design.gv input.vcd output.vcd 48 \
    --input-vcd-scope module_name

Known Issue: Currently, signal matching still fails even with correct scope specified. This is under investigation.

Diagnostic Checklist

1. Verify Signal Names Match

Synthesized Module:

grep "^module\|input\|output" design_synth.gv

Output:

module safe(clk, reset, din, din_valid, unlocked);
  input clk;
  input reset;
  input [3:0] din;
  input din_valid;
  output unlocked;

VCD Signals:

grep '\$var.*\(clk\|reset\|din\|unlocked\)' input.vcd

Output should match synthesized port names exactly.

2. Check Signal Bit Widths

Multi-bit signals must have correct indices:

Synthesized: input [3:0] din;

VCD:

$var reg 4 # din [3:0] $end

GEM expects separate indices: din[3], din[2], din[1], din[0]

3. Verify Timestamp Format

GEM expects integer timestamps (not real numbers):

Correct:

#0
#10
#20

Incorrect:

#0.0
#10.5
#20.25

4. Check Timescale

Ensure VCD timescale matches simulation expectations:

$timescale 1ns $end

$timescale 1ps $end

Clock periods in testbench should use same time unit.

Validation Steps

After fixing VCD issues, validate GEM is reading inputs correctly:

1. Run with CPU Verification

cargo run -r --features metal --bin jacquard -- sim \
    design.gv input.vcd output.vcd 48 \
    --check-with-cpu

This compares GPU results against CPU gate-level simulation. Should print:

[INFO] sanity test passed!

2. Compare Output VCD with Reference

Run same design with iverilog:

iverilog -o reference_sim design.v testbench.v
./reference_sim  # Generates reference.vcd

Compare outputs:

# Check if unlocked signal toggles the same in both
grep '^[01]!' gem_output.vcd
grep '^[01]!' reference.vcd

3. Check Cycle Count

cargo run -r --features metal --bin jacquard -- sim \
    design.gv input.vcd output.vcd 48 \
    2>&1 | grep "total number of cycles"

Should match your testbench's simulation time / clock period.

Common Pitfalls

1. Testbench Inside `ifndef SYNTHESIS

If testbench is only compiled when SYNTHESIS is not defined:

`ifndef SYNTHESIS
module testbench;
  // ...
endmodule
`endif

You must compile without -DSYNTHESIS for VCD generation:

iverilog -o sim design.v testbench.v  # No -DSYNTHESIS!

But the DUT must be compiled with -DSYNTHESIS if it has non-synthesizable constructs:

# Separate compilation
iverilog -DSYNTHESIS -c design.v
iverilog -o sim design.v testbench.v

2. X/Z Values in VCD

GEM may not handle unknown (X) or high-impedance (Z) values correctly:

$dumpvars
x"  # reset = X
bxxxx #  # din = XXXX

Solution: Initialize all inputs in testbench:

initial begin
    reset = 0;  // Don't leave uninitialized
    din = 4'h0;
    din_valid = 0;
end

3. Missing Clock Signal

If VCD doesn't include clock:

WARN (GATESIM_VCDI_MISSING_PI) Primary input port (HierName(), "clk", None) not present

Ensure:

Clock is generated in testbench
Clock is included in $dumpvars
Clock signal name matches synthesized netlist exactly

Example: Working Flat VCD Testbench

// testbench_flat.v - Generates GEM-compatible VCD
module testbench_flat;

// Declare all signals at top level
reg clk = 0;
reg reset = 1;
reg [3:0] din = 4'h0;
reg din_valid = 0;
wire unlocked;

// DUT instantiation
safe dut (
    .clk(clk),
    .reset(reset),
    .din(din),
    .din_valid(din_valid),
    .unlocked(unlocked)
);

// Clock generation
always #10 clk = !clk;  // 20ns period = 50MHz

// Test sequence
initial begin
    // CRITICAL: Dump at top level (depth 1)
    $dumpfile("safe_flat.vcd");
    $dumpvars(1, clk, reset, din, din_valid, unlocked);

    // Reset phase
    reset = 1;
    #60;  // 3 clock cycles
    reset = 0;
    #11;  // Small offset from clock edge

    // Apply test stimulus
    din = 4'hc;
    din_valid = 1;
    #20;

    din = 4'h0;
    #20;

    din = 4'hd;
    #20;

    din = 4'he;
    #20;

    din_valid = 0;
    #40;

    $finish;
end

endmodule

Compile and test:

# Compile (DUT must be SYNTHESIS-compatible)
iverilog -DSYNTHESIS -o sim safe.v testbench_flat.v

# Run simulation
./sim

# Verify VCD structure
echo "=== VCD Scopes ==="
grep '\$scope' safe_flat.vcd

echo -e "\n=== VCD Signals ==="
grep '\$var' safe_flat.vcd

# Should show signals at top level, no nested $scope modules

Still Having Issues?

Enable debug logging:

RUST_LOG=debug,vcd_ng=trace cargo run -r --features metal --bin jacquard -- sim <args> 2>&1 | tee debug.log

Check with minimal test:
- Create simplest possible design (single DFF)
- Generate flat VCD
- Verify GEM can read it correctly
Report issue with:
- Synthesized .gv file
- Input VCD file
- GEM command line
- Error messages or unexpected output

Document Version: 1.0 Last Updated: 2025-01-08 Related: simulation-architecture.md

Development

For working on Jacquard itself. If you only want to run it, start at Installation and Getting Started.

Build

You need the Rust toolchain (2021 edition) and the GPU SDK for the backend you're building. Nothing else.

git submodule update --init --recursive

The GPU backend is a feature, and you pick exactly one:

cargo build -r --features metal --bin jacquard   # macOS, Apple Silicon
cargo build -r --features cuda  --bin jacquard   # NVIDIA, CUDA toolkit
cargo build -r --features hip   --bin jacquard   # AMD, ROCm

Two features are worth knowing about:

synth adds the embedded YoWASP Yosys engine, which is what lets sim/cosim take behavioral RTL instead of a gate-level netlist (see Accepted RTL Surface). It costs real compile time — wasmtime and cranelift — so it is opt-in for local builds, and on for release binaries. Combine it with a backend: --features metal,synth.
No GPU feature at all still builds dump-paths, so timing analysis and netlist validation work on a machine with no GPU SDK installed.

Test

cargo test                       # library + integration tests, no GPU needed
cargo bench --bench event_buffer # criterion micro-benchmarks, no GPU needed
cargo bench --bench xprop

For the GPU path, two flags matter more than the test suite:

jacquard sim ... --check-with-cpu        # run a CPU baseline and compare
jacquard sim ... --max-clock-edges 1000  # bound a long run while bisecting
                                         # (edges, not cycles: 2 per cycle)

--check-with-cpu is the one to reach for when a kernel change produces plausible-looking waveforms: it is the difference between "it ran" and "it is right".

Benchmark designs (NVDLA, Rocket, Gemmini) live in benchmarks/dataset/, a submodule. NVDLA is the smallest and the usual first thing to try.

Where things live

Path	What's in it
`src/aig.rs`	the and-inverter graph, and the conversion from NetlistDB into it
`src/staging.rs`	splits the AIG into pipeline stages (`--level-split`)
`src/repcut.rs`	hypergraph partitioning onto GPU blocks, via mt-kahypar
`src/pe.rs`	maps a partition onto one block's resources; the limits below live here
`src/flatten.rs`	emits `FlattenedScriptV1`, the packed instruction stream the kernel runs
`src/aigpdk.rs`	the AIGPDK standard-cell interface (AND, DFF, clock gate, SRAM)
`src/synth.rs`	the embedded Yosys on-ramp (ADR 0021), behind `synth`
`csrc/kernel_v1.metal`	the Metal kernel
`csrc/kernel_v1.cu`, `kernel_v1.hip.cpp`	CUDA and HIP, sharing `kernel_v1_impl.cuh`
`crates/`	`timing-ir`, `opensta-to-ir`, `cell-model-ir`, `liberty-parse`, `liberty-to-cellir`, `cell-decomp`
`vendor/eda-infra-rs`	netlistdb, sverilogparse, vcd-ng, ulib, ucc, clilog
`docs/scripts/`	documentation tooling, kept under `docs/` so editing it isn't a code change

The pipeline reads left to right:

NetlistDB → AIG → StagedAIG → Partitions → FlattenedScript → GPU kernel

Simulation Architecture walks each stage. For why it is shaped this way, ADR 0014 explains the AIG choice and ADR 0015 the boomerang execution model.

The constraint that shapes everything

A GPU block is a fixed budget, and src/pe.rs has to fit each partition inside it:

at most 8191 unique inputs and 8191 unique outputs per partition (for SRAMs and DFFs, outputs count every enable and bus pin, and holes mean the effective capacity can be as low as 4095);
at most 4095 intermediate pins alive at any stage;
at most 64 SRAM output groups — that is 8192 / (32 × 4).

When a design doesn't fit you get single endpoint cannot map. The answer is usually --level-split, which forces more stages so each one is smaller:

jacquard sim ... --level-split 30
jacquard sim ... --level-split 20,40

This is the first thing most people hit on a large design, and it is a mapping limit rather than a bug.

Optional tooling

Each of these is needed by one workflow and nothing else. Editing Rust, C++ or kernel sources needs none of them: the timing-IR bindings are checked in, and OpenSTA is only for the timing-correctness corpus.

Tool	Used for	macOS (Homebrew)	Linux (Debian/Ubuntu)
`flatc`	regenerating timing-IR bindings when editing `crates/timing-ir/schemas/timing_ir.fbs`	`brew install flatbuffers`	`apt install flatbuffers-compiler`
`mdbook`	building the docs locally	`brew install mdbook`	`cargo install mdbook`
OpenSTA	building vendored `vendor/opensta/` for `opensta-to-ir` and the timing-correctness CI corpus	`brew bundle --file vendor/opensta/Brewfile`, then `scripts/build-opensta.sh`	see `vendor/opensta/Dockerfile.ubuntu22.04`, then `scripts/build-opensta.sh`

Python tooling (PDK fetchers, harness utilities) belongs in the workspace's uv dev group rather than an ad-hoc pip install: add it under [dependency-groups] in the root pyproject.toml, uv sync --group dev, and run it with uv run.

Debugging

Jacquard's own tools, each with a page:

Signal Tracing — surface internal nets in the output VCD. netlist-graph finds the names to trace: uv run netlist-graph drivers <netlist> <signal>.
Debugging X Values — find why a signal went x, statically, instead of trace-guess-rerun.
Bus Transaction Tracing — decode on-chip bus transfers rather than reading raw wires.
Timing Violations — GPU-side setup/hold checks.
Cosim Perf Report and GPU Frame Capture — where the time actually goes in a cosim run.

Docs

The docs are an mdBook, sourced from docs/:

mdbook serve   # http://localhost:3000

SUMMARY.md is the table of contents, and a page missing from it isn't rendered at all — which is why docs/scripts/check_doc_links.py validates links against the rendered page set rather than against files on disk. Run it before pushing docs; CI does too.

The published site keeps main's docs at the root and a frozen copy of each release under /vX.Y.Z/, with a picker to move between them. Release notes link to the pinned copy, so a link in an old release keeps meaning what it meant. See Release Process.

Conventions

ADRs record decisions worth understanding later, and are append-only: when reality moves past one, amend it rather than rewriting it. An ADR's status is a claim about the code, so check it against the code.
Plans hold work in progress; deferred work gets a plan and an issue rather than being lost.
Handoffs are working memory, not history. One per active thread, folded into the ADRs or plans and deleted when resolved.
Release Process covers cutting a release, the RC-first flow, and what's automated.

The through-line: a sentence that says how the tool behaves is a verifiable claim. That applies to docs, --help text, and ADR status alike — check it against the code before writing it.

Release Process

Lightweight by design. Jacquard is a single-binary Rust project with vendor/ submodules; releases are git tags + a CHANGELOG entry. No crates.io publication, no pre-built binaries (until/unless that demand surfaces).

When to release

Cut a release when:

A user-visible feature or fix lands that you want consumers to be able to pin against.
Schema or CLI changes happened (--timing-report JSON, CLI flags) and consumers need a stable reference point.
A meaningful chunk of work in docs/plans/ is closed (e.g. a Phase exits all criteria).

There is no fixed cadence.

Versioning

SemVer, starting once the first numbered release ships. Pre-1.0 versions (0.x.0) carry the standard SemVer caveat: minor bumps may include breaking changes; the public contracts (--timing-report schema, IR layout) are documented in their own ADRs and follow stricter rules.

Stable contracts (additive-only, breaking changes require a major bump and a deprecation window):

--timing-report JSON schema — src/timing_report.rs::SCHEMA_VERSION, governed by ADR 0008.
Timing IR FlatBuffers schema — crates/timing-ir/schemas/timing_ir.fbs, governed by ADR 0002.

CLI flags, log message formats, and --timing-summary text output are not stable parseable contracts; consumers that need to script against them should use --timing-report JSON.

Steps

For maintainers cutting a release:

Verify CI is green on main for all three GPU backends (Metal, CUDA, HIP) plus the unit-test, opensta-to-ir, and lint jobs. If any GPU runner is offline, hold the release until it's restored — see .github/workflows/ci.yml. Do not ship a binary the CI hasn't built.
Roll the [Unreleased] section in CHANGELOG.md into a numbered version block. Format follows Keep a Changelog. Update the link references at the bottom of the file. Leave a fresh empty [Unreleased] section at the top.
Bump the Rust crate version to match: python3 scripts/bump_version.py <X.Y.Z>. The three first-party Rust crates (jacquard, opensta-to-ir, timing-ir) ship together in one tarball and carry a single shared version, so this script sets all three at once — never edit their [package].version by hand. Then cargo build to update Cargo.lock. (netlist-graph versions independently — see its own netlist-graph-v* tag flow.) The release workflow re-runs bump_version.py --check <tag> as a verify-guard and aborts before publishing if the tag and the crates disagree.
Commit: chore: release v<X.Y.Z> with the standard Co-developed-by trailer.
Tag: git tag -a v<X.Y.Z> -m "v<X.Y.Z>" then git push --tags.
Create a GitHub release from the tag. Body = the CHANGELOG section for that version. No artefacts attached unless someone has asked for them.

Homebrew tap (automated)

The Homebrew formula is auto-bumped by release.yml — no manual step (this closes the drift that had left the tap stale at 0.2.3). On every release tag the bump-tap job rewrites packaging/homebrew/jacquard.rb's url/version/sha256 from the just-published tarball + .sha256 and pushes Formula/jacquard.rb to the tap:

final release → gpu-eda/homebrew-tap (brew install gpu-eda/tap/jacquard);
prerelease (RC) → gpu-eda/homebrew-tap-prerelease (brew install gpu-eda/tap-prerelease/jacquard), so RCs are brew install-able for staging without touching the stable channel.

packaging/homebrew/jacquard.rb is the template — edit it only to change the formula's structure (deps, install, test); its version pin is a placeholder the job overwrites. Requires the one-time org setup: secrets.HOMEBREW_TAP_TOKEN (a token with contents: write on both tap repos) and the gpu-eda/homebrew-tap-prerelease repo.

Release notes & versioned docs (automated)

Release notes come from the CHANGELOG, and doc links are pinned to the release — both automatic in release.yml:

Notes body = the CHANGELOG section for the tag's version. A prerelease (X.Y.Z-rc.N) has no dated section, so the extractor falls back to [Unreleased] — RCs ship the same curated draft you'll ship at promotion. So: write the notes in [Unreleased].
Lead with a user-facing overview. Before the technical ### Added / ### Changed sections, open with a short "What this means for you" block — a few benefit-framed bullets (what a user can now do, not just what changed). The technical changelog then gives the detail. This becomes the release intro and is the first thing a reader sees.
Doc links are version-pinned. The extractor rewrites `docs/foo.md` references into [docs/foo.md](https://gpu-eda.github.io/Jacquard/<tag>/foo.html) — the mdBook page frozen for this release. So keep CHANGELOG doc references as repo-relative `docs/foo.md` (clean in-repo); the workflow does the URL rewrite.
Versioned docs deploy. The docs-version job publishes the book to gh-pages /(tag)/ on every release tag (keep_files: true), while the main push keeps deploying "latest" to the site root. Version subdirs accumulate side by side; the pinned links above resolve to them.
The version picker (theme/version-picker.js) reads versions.json at the site root and lets a reader move between main's docs and the frozen releases. main is the default and stays at the root, so anyone arriving without a version in the URL reads main HEAD; a pinned build is flagged in the control, since someone who followed a link out of a release note has no other cue that the page is frozen. versions.json is regenerated after each deploy by docs/scripts/refresh_doc_versions.sh, derived from the directories actually published rather than accumulated — so pruning an old version directory also removes it from the picker, and the control can never offer a 404. Release candidates are deliberately excluded: their docs stay published (RC notes link into them) but they would crowd out the releases people want.

Staging validation (release candidates)

Optional but recommended before a user-facing release: prove the documented install commands work against a staging artifact before promoting. There is no "test crates registry", so a GitHub prerelease is the staging tier for the binary channels.

Sequence matters — do NOT roll main to the final version before the RC validates. During the RC window main is the candidate: it stays at X.Y.Z-rc.N with the changelog still under [Unreleased]. Only the Promote step (below) rolls [Unreleased] → [X.Y.Z] and bumps to the final X.Y.Z. Rolling the release onto main first leaves main advertising a version that has no release — the version pin and dated changelog claim X.Y.Z is shipped while only the RC tag exists, which also breaks cargo binstall --git (it reads main's version and fetches a non-existent vX.Y.Z tarball). Each new RC is a main commit bumped to the next -rc.N; main == the latest RC tag throughout.

Cut an RC. On main at X.Y.Z-rc.N (changelog under [Unreleased]), python3 scripts/bump_version.py <X.Y.Z>-rc.<N>, commit, tag v<X.Y.Z>-rc.<N>, push the tag. release.yml detects the SemVer pre-release suffix and publishes a GitHub prerelease (never shown as "Latest") with the Metal tarball attached, and bump-tap pushes the formula to gpu-eda/homebrew-tap-prerelease.
Validate. Dispatch the Validate install (staging) workflow (validate-install.yml) with that tag. It runs the real install commands against the prerelease asset on macOS:
- cargo binstall (asset-fetch via the [package.metadata.binstall] override, compile fallback disabled so a missing asset fails hard);
- brew install of an RC formula (the source-of-truth formula repointed at the prerelease tarball + its .sha256, installed from a throwaway local tap).
Promote. A green run means the channels work. Bump to the final <X.Y.Z> (drop the -rc.<N>), commit, tag v<X.Y.Z>, push — the same flow as a normal release below.

The netlist-graph (PyPI) channel validates separately via its own workflow_dispatch → TestPyPI path (see publish-netlist-graph.yml); it versions independently of the Rust crates.

What does NOT need to change at release time

Submodule pins (unless deliberately bumping a vendored dep).
The vendor/opensta/ submodule pin is the version named in crates/opensta-to-ir::MIN_TESTED_OPENSTA_VERSION. If you bump the submodule, also bump the constant and the version-probe test — see WS-RH.1 in docs/plans/post-phase-0-roadmap.md.
LICENSE (unless re-licensing).

Pre-release checklist (one-time, before the first numbered release)

These items are tracked in docs/plans/post-phase-0-roadmap.md § Release hardening; this section is the visible punch-list:

Phase 1 (ADR 0008 required outputs) closed.
WS-RH.1 (OpenSTA detection + version check) shipped.
Metal CI on macos-runner-1 green (re-enabled in commit 12e98df, 2026-05-12).
CUDA CI on nvidia-runner-1 green on main. Currently disabled in .github/workflows/ci.yml (if: ${{ false }}, ~line 268). Re-enable when hardware lands.
HIP CI on the AMD runner green on main. Currently disabled in .github/workflows/ci.yml (if: ${{ false }}, ~line 357). Re-enable when the AMD runner is online.
Prebuilt CUDA/HIP binaries (ADR 0018 Phase 4), when produced, must build with JACQUARD_CUDA_ARCH=all-major so the kernel ships portable SASS for every major arch (sm_50…sm_120 on CUDA ≥ 12.8, Blackwell included) plus PTX for the newest — see the README § CUDA target architecture. Local dev uses JACQUARD_CUDA_ARCH=native instead. The nvidia1.local Blackwell box (sm_120, CUDA 12.8) is a candidate self-hosted CUDA release runner.
Vendored-dep license posture confirmed (gzz2000/eda-infra-rs#2 — sverilogparse AGPL declaration acknowledged as a typo; workspace Apache-2.0 governs).
Cargo.toml::license = "Apache-2.0" set.
NOTICE file enumerating vendored deps + their licenses.
Bump vendor/eda-infra-rs submodule once upstream pushes the sverilogparse Cargo.toml correction; remove the corresponding footnote in NOTICE. Maintainer acknowledged the typo on 2026-05-02 but hasn't pushed the fix as of 2026-05-13. Verify with git -C vendor/eda-infra-rs fetch && git log origin/master --oneline.
CUDA / HIP runtime violation routing through process_events — done (commit 24723b5, issue #104). sim_cuda / sim_hip now dispatch the timed-batched path and drain violation events, so --timing-report / --timing-summary / --timed are no longer Metal-only.
Bounded violations array (--timing-report-max-violations, default 100k).
End-to-end --timing-report test on Metal CI. The inv_chain_pnr sim step uses --timing-ir (pre-generated .jtir checked in) + --timing-report + --timing-summary; a follow-up step validates the JSON shape (top-level keys, semver schema version, metadata, stats, arrays).
GF180MCU enablement (Phases 0–6) shipped. See docs/plans/gf180mcu-enablement.md. Phase 7 (wafer.space test-run-1 design integration) deferred pending design availability; not release-blocking.

License posture

Project license is Apache-2.0 (LICENSE). Vendored-dep posture is enumerated in NOTICE. Summary:

vendor/eda-infra-rs/ — Apache-2.0 (workspace). The sverilogparse sub-crate's stale AGPL-3.0-only declaration in Cargo.toml is a typo per upstream maintainer (gzz2000/eda-infra-rs#2); governed by the workspace LICENSE. Submodule pin will be bumped when upstream pushes the correction.
vendor/sky130_fd_sc_hd/ — Apache-2.0.
vendor/opensta/ — GPL-3 (subprocess only per ADR 0001 + ADR 0006 § Amendment; never linked, never bundled).

Cross-references

CHANGELOG.md — release log.
docs/adr/0008-structured-timing-output.md — --timing-report stability contract.
docs/adr/0002-timing-ir.md — IR schema versioning.
docs/adr/0006-sdf-preprocessing-model.md — OpenSTA bundling rules.
docs/project-scope.md — license posture contract.

Handoff discipline

Handoffs in this project are ephemeral working memory, not historical record. They exist to bridge a single session boundary — when you stop working and someone else (Claude or human) picks up — and they are deleted once the work they describe is resolved.

This document defines what a handoff is, what it isn't, when to write one, and exactly what to do when one is resolved.

Why this discipline exists

Decision rationale, technical context, and project state all have natural homes:

ADRs (docs/adr/) capture architectural decisions and their why.
Design docs (docs/timing-model-extensions.md, etc.) capture how things work.
Plan docs (docs/plans/phase-0-ir-and-oracle.md, post-phase-0-roadmap.md) capture what's left and the next workstream slices.

When that content lives in a handoff instead, two things go wrong:

It's not where contributors look. A new contributor reading the README → SUMMARY → ADR chain shouldn't have to dig through a stack of resolved handoff docs to find load-bearing decisions or the current state of a workstream.
It rots out of sync with reality. Handoffs are point-in-time snapshots. A "STATUS: RESOLVED" banner doesn't help when the thing referenced has moved or changed; the canonical doc is what should hold the current truth.

The discipline closes this gap by forcing migration before deletion. Every load-bearing piece of a handoff lands in its proper home (ADR / design doc / plan doc) before the handoff file is removed.

What a handoff IS

A handoff lives in its own dedicated directory, separate from the persistent plan docs whose content it eventually feeds: a single markdown file at docs/handoffs/<topic>-handoff.md containing exactly what the next session needs to pick up where you left off:

Goal & next-up — what this session was trying to do, and what the very next concrete action is.
Done this session — commits landed, with one-line summaries.
Open follow-ups — the work that wasn't done, with enough scope detail to start cold.
Critical context — gotchas, surprising findings, environment specifics that aren't obvious from the code or docs yet.
Verification — the command(s) the next session runs to confirm the work is in the state you say it is.

One handoff per active thread of work — not one globally. A thread is a distinct arc someone could pick up cold (the RTL on-ramp, the cell-model-IR migration, a triage sweep). Concurrent threads get concurrent handoffs, each named by topic (<topic>-handoff.md). There's still no chain: a thread holds at most one live handoff, and a resolved one is folded-and-deleted (see below), not archived behind its successor.

Treat the handoff count as a WIP signal. If the number of live handoffs climbs past roughly 2× the number of people actively working, that's a smell, not a badge — more parallel threads than the team can hold context on. Read it as a prompt to resolve, consolidate, or drop threads, not to open more. (Earlier revisions of this doc mandated "exactly one handoff at a time"; that didn't match engineering reality, where several independent arcs are legitimately in flight at once. The per-thread rule plus the WIP heuristic replaces it.)

What a handoff IS NOT

Not a decision log. Decisions go in ADRs. If you find yourself writing "we chose X over Y because Z" in a handoff, that paragraph belongs in an ADR (or an existing ADR's "Consequences" / "Walk-back" section).
Not a design doc. "How clock arrival flows from OpenSTA Tcl through the IR into the GPU constraint buffer" is a design topic; it lives in docs/timing-model-extensions.md Part B, not in a handoff's "Critical context" section.
Not a status dashboard for the project. Workstream status lives in plan docs — phase-0-ir-and-oracle.md for current-phase WS state, post-phase-0-roadmap.md for forward-looking sequencing. A handoff cites those, doesn't reproduce them.
Not a historical record. git log is the historical record. Handoffs that survive past their resolution turn into noise that misleads new contributors.

When to write one

Write a handoff at the end of any session that:

Leaves work in a partial state that someone else might pick up cold.
Captures non-obvious context the next session needs (e.g. "the OpenSTA Tcl find_timing proc rejects -full_update; use ::sta::find_timing_cmd 1 directly").
Documents the next concrete step with enough scope to start without re-discovering it.

If the session ended at a clean stopping point (everything merged, all decisions documented in ADRs/plans, nothing surprising), don't write a handoff. The plan doc already says what's next.

Resolution: fold, then delete

The two-location split is deliberate: handoffs live at docs/handoffs/<topic>-handoff.md while in flight; their content migrates into the persistent docs (docs/adr/, docs/plans/, design docs under docs/) at resolution. The handoff file then gets removed; nothing about the work is lost because everything load-bearing has a permanent home elsewhere.

When a handoff's work is done — whether in the next session or several sessions later — every load-bearing piece of it must be migrated to its proper home before the handoff file is deleted:

If the handoff says...	It belongs in...
"We chose approach X over Y because Z"	The relevant ADR's Decision/Consequences section, or a new ADR if no fit exists
"Future scope for WS-N: do A then B then C"	The plan doc's WS-N section (`phase-0-ir-and-oracle.md` or successor)
"Gotcha: OpenSTA's Tcl X behaves Y"	A code comment near the Tcl call site, or a design doc if the gotcha cuts across files
"Build dep Z is required on Linux"	The build script's apt-suggestion / Brewfile / README install section
"Subsystem A doesn't yet do B"	Plan doc as a new open item, or an ADR-tracked walk-back if it's a deferred design choice
"Run `cargo test --feature foo` to verify"	The verification block in the relevant plan doc, or a test-running section in `CLAUDE.md`

After migration, the handoff file is removed in the same commit as the migration:

git rm docs/handoffs/<topic>-handoff.md
git add <files-receiving-the-migrated-content>
git commit -m "$(cat <<'EOF'
docs: resolve <topic> handoff — fold into <where-it-went>

<one-paragraph summary of what was migrated and where>

Co-developed-by: Claude Code v<version> (<model-id>)
EOF
)"

The commit message records what migrated where — that's the audit trail. git log -- docs/handoffs/ then shows the project's handoff history (one add, one delete per session) without needing the files themselves to live forever.

Template

When you do need to write one, use this skeleton. Replace placeholders inline; delete sections that don't apply (better to omit a section than fill it with "N/A").

# Handoff — <Topic> (one-line summary of what this session left open)

**Created:** YYYY-MM-DD
**Working tree:** clean | <state if not clean>
**Branch:** main | <branch>

## Goal & next-up

**Goal of this session:** <what you were trying to do, in 1–3 sentences>

**Next session should pick up:** <the very next concrete action, by name. Reference the plan doc section if applicable.>

**Verification command:**
```sh
<commands the next session runs to confirm this handoff's claimed state>
# Expect: <what success looks like>

Done this session

Commit	Subject	Notes
`<sha>`

Open follow-ups (priority-ordered)

1. ()

2. ...

Critical context

References

<predecessor-handoff if any> — predecessor (if relevant)
<plan doc> — current workstream state
<ADR> — relevant decision

Resume in a new session with: ``` /resume_handoff docs/handoffs/-handoff.md ```


## Tooling

The `create_handoff` and `resume_handoff` skills (from various Claude Code orchestration toolkits) generate and consume handoffs. They're optional — the discipline above is the load-bearing artifact. A handoff written by hand following this template is just as valid.

If you use one of those skills, expect it to default to YAML format under `thoughts/shared/handoffs/` with database indexing. **That doesn't apply to this project.** Override it: produce markdown at `docs/handoffs/<topic>-handoff.md` and skip the database step. The skill activation is informational; the project's convention takes precedence.

Architecture Decision Records

ADRs capture decisions worth understanding later: the context, the options considered, and the rationale for the choice. They are numbered, append-only, and never silently rewritten. When reality moves past an ADR, record the change in the ADR rather than letting a stale claim mislead — two paths depending on the size of the change:

Full reversal (the decision no longer holds) → supersede the old ADR with a new one and set the old status to Superseded.
Refinement (a claim turned out too blunt, a constraint relaxed, a detail corrected) → add a dated Amendment note at the top of the affected section stating the current understanding, and keep the original decision text in place below it (relegated, not deleted), so the audit trail stays intact. Mark the status Accepted (amended <date>) and note it in the index. ADR 0006 and ADR 0014 are worked examples.

Status legend

Accepted / Approved — current, in effect.
Accepted (partial) — design ratified and partly built; the ADR carries an ## Implementation status section (see below).
Proposed — drafted, not yet ratified.
Superseded — historical, replaced by a later ADR or by a spike outcome; kept for the audit trail.

Keeping status honest

An ADR's Status is a claim about the codebase, not an aspiration. Before setting or changing it, verify the claim against the implementation — read the code; don't trust the previous status or a feature's "done" framing. The same goes for any present-tense statement inside an ADR ("jitter feeds the setup/hold checker"): it's a verifiable claim, so check it.

Don't bump Proposed → Accepted just because a design merged. Confirm the decision is actually in effect in the code.
When a design is ratified but only partly built, use Accepted (partial) and add an ## Implementation status section splitting implemented (with file references) from deferred (with the specific gap). ADR 0012 is the worked example.
Deferred work gets a home: a plan under docs/plans/ and a tracking issue, cross-linked from the ADR's status section, so the unbuilt half isn't lost.

This extends to user-facing docs and --help text: a sentence telling the reader how the tool behaves is a verifiable claim — check it against the code before writing it.

Index

#	Title	Status
0001	OpenSTA as the timing correctness oracle and sole STA path	Accepted (amended 2026-06-25; scope expanded 2026-05-01)
0002	Timing intermediate representation	Accepted (amended 2026-06-25)
0003	OpenTimer as in-process reference STA	Superseded (2026-05-01) — spike failed; OpenSTA subprocess only
0004	Private PDK testing track	Accepted (amended 2026-06-25)
0005	OpenSTA vendoring and test-corpus strategy	Accepted (amended 2026-06-25)
0006	SDF preprocessing model and interim-to-release cutover	Accepted (amended 2026-05-02)
0007	Timing model fidelity roadmap	Proposed (line refs amended 2026-06-25)
0008	Structured timing output as first-class deliverable	Accepted (amended 2026-06-25)
0009	OpenSTA Verilog reader inputs	Accepted (amended 2026-06-25)
0010	Declarative cell metadata	Accepted (amended 2026-06-25)
0011	RAM port-mapping schema for declarative cell metadata	Accepted (amended 2026-06-25)
0012	Reproducible CDC jitter injection for multi-clock cosim	Accepted (partial; amended 2026-06-25)
0013	Cosim peripheral model architecture	Accepted (amended 2026-06-25)
0014	AIG as simulation intermediate representation	Accepted (amended 2026-06-25)
0015	Boomerang execution model and GPU resource mapping	Accepted
0016	Selective X-propagation	Accepted (amended 2026-06-25)
0017	Cosim execution model	Accepted (amended 2026-06-25)
0018	Distribution and installation model	Accepted (amended 2026-06-25) — Phase 4 & 7 open
0019	Cell-model IR: a complete per-cell-type library descriptor	Proposed
0020	Python engine as a bundled binary wheel (cibuildwheel)	Draft — deferred (PyO3 preferred; see ADR)
0021	Behavioral RTL support via an embedded synthesis front-end (YoWASP)	Proposed
0022	Transaction-based external stimulus (SCE-MI-style pipes)	Proposed

How the ADRs relate

0014 / 0015 document the core simulation pipeline: 0014 explains why the AIG (and-inverter graph) is the simulation IR — its uniform AND-gate structure enables the boomerang reduction tree and eliminates per-cell dispatch in the GPU kernel. 0015 describes the boomerang execution model itself — the 13-level hierarchical reduction tree, the GPU resource limits it imposes (8191 inputs, 8191 outputs, 4095 intermediates, 64 SRAM groups per partition), the hypergraph partitioning that distributes work across GPU blocks, and the packed instruction format (FlattenedScriptV1) consumed by the kernel. Together they document the path from gate-level Verilog to GPU kernel execution that the GEM paper describes.
0001 / 0003 / 0005 / 0006 describe the timing oracle stack: OpenSTA as the ground truth (0001), vendored at a pinned revision with its own corpus reused (0005), driving SDF preprocessing out-of-process (0006). The earlier OpenTimer in-process plan (0003) was retired after the spike (../spikes/opentimer-sky130.md).
0002 is the data contract those tools talk over — a JSON timing IR consumed by Jacquard, produced by opensta-to-ir.
0004 governs how PDK-specific testing happens for NDA-bound contributors without leaking files into the public repo.
0007 / 0008 are the forward-looking pair: 0008 (Approved) defines the structured timing output Jacquard owes downstream flows; 0007 (Proposed) sketches the model-fidelity work needed to back those outputs at scale (δ(T), clock-tree skew, wire delay). Scheduling for both lives in ../plans/post-phase-0-roadmap.md.
0013 / 0017 cover the cosim runtime: 0013 documents the peripheral model architecture (CPU-side PeripheralModel trait, GPU-side kernel patterns, ring buffers, plural-config convention); 0017 documents the execution model (batch dispatch loop, multi-clock scheduler, edges-vs-cycles semantics).
0016 accepts the selective X-propagation design documented in docs/selective-x-propagation.md. The full seven-phase design lives there; the ADR is a thin acceptance record with a summary of key choices.

Adding a new ADR

Pick the next number (highest existing + 1).
Filename: NNNN-short-kebab-title.md.
Start with # ADR NNNN — <title> and a **Status:** line — set it to match the code, not the intent (see Keeping status honest).
Standard sections: Context, Decision, Consequences. Add Amendment blocks dated when the decision is revisited; do not rewrite accepted history.
Add the row to the table above.

ADR 0001 — OpenSTA as the timing correctness oracle and sole STA path

Status: Accepted (amended 2026-06-25). Scope expanded 2026-05-01 — see Decision §3 below.

Amendment (2026-06-25): Decision §1's claim that "OpenSTA is never invoked from the jacquard runtime binary" is no longer true. opensta-to-ir is a direct dependency (Cargo.toml) and is the shipping SDF path (src/sim/setup.rs load_sdf_via_opensta_to_ir); ADR 0006's amendment ratified invoking it as a subprocess at runtime. The original §1 text is retained below as the record of the initial decision.

Context

Jacquard's current correctness validation for timing relies on its own CPU reference simulator (--check-with-cpu), which shares the Rust source tree, data structures, and parsers with the GPU simulation path. Representation bugs (e.g., hierarchical SDF prefix mismatch, inverter-collapse issues) have passed both paths silently because they affect both.

Historical regressions have been caught only by comparing against genuinely external tools — specifically CVC for functional simulation and, by implication, OpenSTA for timing. No format or tool inside Jacquard is currently treated as authoritative.

OpenSTA is widely deployed in open-source EDA (SKY130, OpenLane2, OpenROAD) and has the largest effective test surface of any open-source STA tool for the Liberty + SDF + Verilog + SPEF stack. It is licensed under GPL-3.0 and also sold commercially.

Jacquard requires permissive licensing for code linked into its binary (see ../project-scope.md).

Decision

OpenSTA is the ground-truth oracle for timing correctness and the sole STA path used by Jacquard.

In the shipped release, OpenSTA is never invoked from the jacquard runtime binary, and never linked. Subprocess invocation from CI pipelines, test harnesses, and the standalone opensta-to-ir preprocessing tool (see ADR 0006) is acceptable — GPL's reciprocal requirements do not cross a subprocess boundary ("mere aggregation") and so Jacquard's permissive licensing is preserved. Pre-release, a runtime subprocess invocation may exist as a contributor-ergonomics convenience (per ADR 0006); it is removed before release.
All timing, STA, and parser-related code paths are validated against OpenSTA on (a) a vendored subset of OpenSTA's own test corpus, and (b) representative Jacquard test designs.
OpenSTA is also Jacquard's only STA path, not just its oracle. ADR 0003 originally proposed an in-process reference STA via OpenTimer to complement this oracle role; the spike (../spikes/opentimer-sky130.md) found OpenTimer's input pipeline unfit for OpenROAD-flow outputs (commit d002bde superseded ADR 0003). The role OpenTimer would have played — providing per-DFF clock arrival, structured timing data for the IR, etc. — now sits with OpenSTA, called out of process via opensta-to-ir. OpenSTA is therefore a required runtime dependency for any timing-aware Jacquard flow, not just for CI validation.
Where Jacquard's output disagrees with OpenSTA's output past a declared tolerance, Jacquard is wrong until proven otherwise. Divergence is either fixed, explicitly justified in writing, or filed as a bug.

Consequences

OpenSTA is a required runtime dependency for timing-aware Jacquard flows (post §3 expansion), not merely a CI/validation dependency. Users running jacquard sim --timing-ir ... need a .jtir produced by opensta-to-ir, which subprocesses OpenSTA. Documented in ../why-jacquard.md.
Subprocess integration preserves Jacquard's permissive licensing (satisfies project-scope.md).
"Oracle-diff clean" becomes a required CI gate for timing-related PRs, run nightly or pre-release (not per-PR — OpenSTA runs on large designs can be slow).
OpenSTA bugs may produce false-positive divergences. The expectation is to file upstream rather than work around silently. A pinned OpenSTA version in CI avoids drift. With OpenSTA now also the only STA path (not just the oracle), upstream regressions land in users' hands too — pinning matters more than before.
A vendored OpenSTA test corpus (or git submodule) is added to the repo as a fixture. Licensing of specific test inputs is verified per file before inclusion.
No second STA tool to maintain. The original ADR 0003 proposal would have given Jacquard a permissive-licensed in-process reference; the spike showed that's not achievable today with OpenTimer. A future ADR may revisit libreda-sta or an in-house walker if an in-process reference is wanted.

ADR 0002 — Timing intermediate representation

Status: Accepted (amended 2026-06-25).

Amendment (2026-06-25): The vendor-extension type names in the Decision are stale. The schema uses a VendorSource enum with variants Cadence / Synopsys / Mentor / Other (timing_ir.fbs), not the VendorCadence / VendorSynopsys / VendorOther names written below; the Mentor variant was added and isn't mentioned in the original text.

Context

Jacquard currently parses SDF directly in src/sdf_parser.rs, a hand-rolled parser that has accumulated reactive fixes (empty () delays, (COND …) pin specs, backslash escapes, edge-qualified timing checks, TIMINGCHECK stripping workarounds for OpenLane2 output). Each new production failure has been a one-off patch.

Commercial tool output adds dialect variation (Cadence, Synopsys extensions). Future parser paths (Liberty, SPEF) and future reference tools (OpenSTA, OpenTimer) each carry their own data models. A format-per-consumer coupling structure will continue to spread parser complexity into the simulator.

The project needs:

A stable format we consume, with parser complexity isolated from simulator complexity.
A format that can be diffed between producers (two parsers of the same file must agree).
A format that supports multi-corner PVT values natively — commercial flows require this; single-corner shortcuts become retrofit pain.
Preservation of vendor-specific annotations so information is not silently discarded.
Fast consumption at sim startup (SDF parsing is currently on the critical path).

Decision

Introduce a timing intermediate representation (timing IR) for SDF-equivalent annotation data.

Binary format: FlatBuffers. Zero-copy reads, schema evolution, cross-language (Rust, C++ for OpenTimer adapter, Python for tooling).
Text sidecar: JSON, produced via FlatBuffers' JSON round-trip, for CI diffs and human inspection.
Schema versioning: explicit version field, compatible-evolution rules stated in schema comments. Breaking changes require a major version bump and migration notes.
Multi-corner native: timing values are min / typ / max across a declared set of PVT corners. Single-corner designs are represented as a single-element corner set.
Vendor extension passthrough: typed VendorExtension variants (VendorCadence, VendorSynopsys, VendorOther) carry unrecognised annotations as byte-typed blobs with source labels. Consumers opt in to understanding them; the IR never silently drops them.
Per-arc provenance: each timing arc records source tool, source file, and origin category — asserted (from SDF / input), computed (derived by an STA tool), defaulted (fallback because no better value was available). Provenance is inspectable at consumer side.
Scope boundary: the IR represents timing annotation data only. It is not a netlist representation, not a timing graph, not cell characterization. Attempts to extend it toward those adjacent formats are rejected — they become separate IRs if needed.

Consequences

A new schema and format to maintain. Scope discipline is load-bearing: if the IR creeps toward being a full STA framework, it becomes duplicate work with OpenSTA/OpenTimer.
Parser complexity moves out of src/sdf_parser.rs (and its future rewrite, per ADR covering #3) into a focused converter crate. Unit-testable in isolation.
A diff-based test corpus becomes natural: multiple converters on the same input must produce equivalent IR. This is the enforcement mechanism for ADR 0001's oracle pattern.
Vendor extensions do not require Jacquard code changes — only converter updates.
Startup parse cost drops: reading binary IR is near-instant. SDF-to-IR conversion becomes a one-time preprocessing step, not repeated per sim.
Adopting FlatBuffers adds a code-generation step to the build, via flatc. Build hygiene (checked-in generated code, pinned flatc version, or a build-script integration) is required.
If the IR is ever shared across other tooling beyond Jacquard, its stability contract tightens. Flagged in open questions on timing-correctness.md; not resolved here.

Amendment 2026-06-23: cell characterization is ADR 0019, not this IR

The scope boundary above ("not cell characterization … they become separate IRs if needed") is now realised by ADR 0019. Two timing artifacts must not be conflated: this IR is per-design instance annotation (TimingArc { cell_instance }, SDF-equivalent for a specific netlist), whereas per-cell-type timing characterization (setup/hold, clock→Q, today's liberty_parser::TimingLibrary) is a library property and lives in the ADR-0019 cell-model IR alongside the cell's logic. The two are orthogonal axes — per-design vs per-library — not two halves of one scope. This ADR's decision is unchanged.

ADR 0003 — OpenTimer as in-process reference STA

Status: Superseded (2026-05-01). Spike (../spikes/opentimer-sky130.md) failed Q2 — OpenTimer's input pipeline cannot handle real OpenROAD-flow .v/.spef for SKY130 designs with bus ports. Fallback is OpenSTA subprocess validation only (ADR 0001); a future ADR may revisit libreda-sta or an in-house walker if an in-process reference is wanted later.

Context

Jacquard needs an in-process reference STA path to:

Validate SDF-derived timing against an independent computation at load time and on demand (requirement R2 in timing-correctness.md).
Provide exact per-edge arrival for top-K critical paths (requirement R4, pessimism-delta reporting).

OpenSTA (ADR 0001) is the ground-truth oracle but runs only as a subprocess — unsuitable for per-run, in-process checking. A linked alternative is needed.

Options surveyed:

OpenTimer (MIT, C++17). Parses .lib / .v / .spef / .sdc directly. Won TAU Timing Analysis Contests (2014 1st, 2015 2nd, 2016 1st); industry "Golden Timer" for benchmark comparisons. Actively maintained (latest push 2025-12-26 as of this writing). Does not parse SDF — timing is computed from Liberty + parasitics.
libreda-sta (Rust, permissive). Young framework, self-described as "basic components." Unknown whether it handles SKY130 Liberty robustly. Lower maturity risk than OpenTimer.
Tatum (MIT, C++). Analysis engine only; does not parse Liberty/SDF/Verilog. Using Tatum would require supplying our own parsers, so it does not solve the problem directly.
In-house Rust walker. Author-shared blind spots with Jacquard's main pipeline reduce the independence benefit.

Decision

Subject to the spike's success, OpenTimer becomes Jacquard's in-process reference STA, integrated via C++ FFI (bindgen or equivalent).

Linked directly; MIT licence satisfies project-scope.md.
Computes timing from .lib + .spef independently of any SDF-derived path. This is an accepted (and arguably preferable) property: the reference path shares no parsing with Jacquard's SDF consumer, so a parse bug on either side is detectable rather than mutually masked.
Emits timing IR (per ADR 0002) so its output is directly diffable against Jacquard's SDF-derived IR.

Spike criteria in ../spikes/opentimer-sky130.md. On spike failure, fallback is to drop the in-process reference entirely and rely on OpenSTA subprocess validation (ADR 0001). This weakens per-PR feedback on timing correctness but is not fatal.

Consequences

C++ FFI dependency; bindgen-generated bindings; build complexity rises modestly.
Direct linking preserves permissive licensing (MIT).
Three-way cross-check becomes the default in CI: Jacquard (SDF path) vs OpenTimer (Liberty+SPEF path) vs OpenSTA (subprocess, full ground truth). Three-way disagreement localises bugs to SDF parse / delay model / tool issue cleanly.
OpenTimer does not parse SDF. To use it in Jacquard's current flow, OpenLane2 (or equivalent) must produce SPEF alongside SDF. This plumbing change is tracked in the phase-0 plan.
OpenTimer's maturity is measured in contest benchmarks, not SKY130 real-flow output. Spike must verify it handles our actual Liberty and SPEF. The spike is structured to fail fast if it does not.
If OpenTimer is dropped post-spike, alternative in-process references (libreda-sta, in-house) can be revisited; this ADR would be superseded rather than amended.

ADR 0004 — Private PDK testing track

Status: Accepted (amended 2026-06-25).

Amendment (2026-06-25): Phase 0 shipped (2026-05-02), so the "plumbing tracked in the phase-0 plan" rider no longer points at anything in flight. What actually landed is the open-source GF180MCU path (GF180MCU_LIBERTY_DIR); the commercial / NDA PDK track described in the Decision body (the *_PDK_PATH env-gated, CI-runner-restricted flow) is not yet implemented.

Context

Some contributors and operators have access to commercial PDKs (GlobalFoundries, TSMC, and others) under NDA or licensing agreements that prohibit public redistribution of PDK files. Whether a given contributor has access is itself typically under NDA and not publicly known.

Commercial PDK Liberty libraries are substantially richer and quirkier than open-source alternatives — they include cell variants, conditional timing arcs, vendor-specific annotations, and characterization detail not present in SKY130 or AIGPDK. Several parser bugs live only on commercial PDK output.

SKY130-only coverage is insufficient for a sim tool used on commercial flows, and adding commercial PDK files to a public repository is not an option regardless of who operates the project.

The standard industry pattern for testing against proprietary PDKs is environment-gated test suites: tests run when the contributor has licensed access, and skip cleanly when they don't.

Decision

Establish a private PDK test track gated on per-PDK environment variables (e.g. TSMC_PDK_PATH, and similar — one per PDK).

Tests check for the required env var(s) and skip with a clear "PDK not available" message when unset.
When env vars point to a readable PDK directory, tests execute fully.
Only the test harness, expected structural outputs, and IR fixtures (where the PDK vendor licensing permits) are committed.
No PDK-derived artifacts (.lib, .sdf, .spef, characterization data) are committed to the public repository under any circumstances.
CI runners with configured PDK access execute the private track; public PRs from non-licensed contributors see the private tests as skipped, not as failures. Which runners have access is determined by whoever operates CI; this ADR does not name specific organisations.

The timing IR (ADR 0002) makes this feasible: converter output and diff results can be checked in as fixtures where they contain no PDK-licensed data. Expected behaviour can be asserted in terms of IR structure rather than in terms of specific cell timings that would leak characterization data.

Consequences

Contributors without PDK access cannot locally reproduce PDK-specific bugs. They rely on maintainer CI for validation.
A separate setup doc for licensed contributors is required (not public). Points at env-var configuration, test runner invocation, and PDK-file staging expectations.
Fixture schema must be PDK-agnostic enough that structural assertions don't implicitly leak cell-characterization data. Review process must check new fixtures against this rule before merge.
Bugs found via private PDK testing are, where possible, distilled into minimal public reproducers. The private track is not a place to park unreviewable tests — every private test should ideally surface a public fixture once the bug's essence is extracted.
CI cost rises (licensed runners). Runs are nightly or pre-release rather than per-PR.

ADR 0005 — OpenSTA vendoring and test-corpus strategy

Status: Accepted (amended 2026-06-25).

Amendment (2026-06-25): The corpus contents described below (SKY130 MCU SoC, NVDLA, AIGPDK examples, representative SDFs) are aspirational, not current. As shipped, the primary corpus contains a single entry — aigpdk_dff_chain — with the SKY130/MCU/NVDLA entries marked pending (blocked on a sky130-Liberty CI strategy), and the stress corpus is empty (tests/timing_ir/stress/manifest.toml → entries = [], pending the Phase 1 stress runner). The vendoring + corpus-split architecture is implemented as described.

Context

Under ADR 0001, OpenSTA is the ground-truth oracle for timing correctness, invoked as a subprocess. Phase 0 (../plans/phase-0-ir-and-oracle.md) requires:

A reproducible, pinned OpenSTA reference so CI diffs are comparable run-to-run.
Access to OpenSTA's test inputs for stress testing our OpenSTA-driven converters.
Separately, a primary regression corpus representative of Jacquard's actual use cases.

Two questions were considered jointly: (a) how we pin / reference the OpenSTA codebase, and (b) how we use their test data.

On vendoring source: OpenSTA is licensed GPL-3.0. Copying its source into Jacquard's repository as committed code creates licensing ambiguity for a permissive-licensed project. Git submodules are conventionally treated differently — the parent repository pins a commit reference, does not incorporate the submodule's source into its own commits, and inherits no license obligations from the submodule's presence. This convention is widely relied on in permissive projects that depend on GPL tooling at arm's length.

On test data: OpenSTA's corpus exercises OpenSTA's concerns — Liberty parsing edge cases, SI-aware analysis, timing-check variants specific to its engine. Much of it does not exercise anything Jacquard does, and some of it exercises features Jacquard deliberately does not support. Using it as the primary regression corpus would optimise for the wrong target: our converters would be validated against files OpenSTA cares about, not files Jacquard actually encounters.

Its real value to Jacquard is as a stress / robustness corpus: a large bank of real-world-ish timing files that exercise parser edge cases and dialect variants. A converter that survives their entire corpus is more robust than one validated against a hand-curated subset.

Decision

Vendoring

OpenSTA is vendored as a git submodule at vendor/opensta/.
The submodule is not built from Jacquard's build. Jacquard's subprocess invocations use whatever OpenSTA binary is installed in the developer or CI environment.
The submodule exists for two purposes only: (a) pinning a specific OpenSTA version for CI reproducibility, (b) providing in-tree access to its test corpus without redistribution.
Licensing: by git-submodule convention, the submodule's GPL-3.0 licence does not extend to the parent repository. This is the standard interpretation; contributors redistributing binaries or compiled artefacts should nonetheless verify the interpretation applies to their specific jurisdiction and use.

Test corpus split

Two corpora, two distinct roles:

Primary regression corpus at tests/timing_ir/corpus/.
- Jacquard-specific designs: SKY130 MCU SoC, NVDLA, AIGPDK examples, representative SDFs from the real Jacquard flow.
- Small, curated, committed directly.
- Run on every CI execution.
- Exit criterion: every file converts cleanly and matches golden IR within declared tolerance.
Stress / robustness corpus at tests/timing_ir/stress/ as a manifest file listing paths into vendor/opensta/<test-tree-subdir>/.
- Not committed as duplicated data; the manifest references submodule paths.
- Large, whatever upstream maintains.
- Run nightly or pre-release, not per-PR.
- Exit criterion: no crashes, no hangs, no malformed IR. Numerical agreement with OpenSTA not required — this corpus is for robustness, not correctness.

Copying from stress corpus into primary corpus

If a stress-corpus file exposes a bug, a minimal reproducer may be distilled and added to the primary corpus. When doing so:

Verify the specific file's licence before copying. OpenSTA's overall GPL-3.0 licence does not imply every test input is GPL-3.0 — some test inputs are vendor-derived or public-domain.
Prefer distilling a synthetic minimal reproducer over copying the original file wholesale.

Consequences

CI reproducibility: pinned submodule means we control when OpenSTA version changes land. Bumping the pin is an explicit, reviewable step.
Repository size grows by OpenSTA's submodule size (multi-megabyte) but not by test-data duplication.
Maintenance cadence: periodic submodule pin updates are a known maintenance item. Not frequent, but not zero.
Primary regression corpus stays lean and directly relevant; developers can reproduce corpus-level failures locally without pulling the entire submodule.
Stress-corpus failures are treated as bugs against our converter, never as bugs against OpenSTA's test inputs.
Licensing posture is conventionally defensible; if stronger legal assurance is ever required, the submodule can be replaced by the external-install-only option (drop the submodule, rely purely on whatever OpenSTA is installed) at the cost of losing in-tree test access.

ADR 0006 — SDF preprocessing model and interim-to-release cutover

Status: Accepted 2026-04; amended 2026-05-02 (see § Amendment).

Amendment (2026-05-02)

The original Decision treated subprocess invocation of OpenSTA from the shipped Jacquard runtime as license-incompatible, requiring Phase 3 (native Rust SDF→IR converter) to land before first release. On review of GPL-3 § 5 ("aggregate") and the FSF interpretation of subprocess/IPC boundaries, this restriction is more conservative than necessary. The relevant facts:

The interface is arms-length: standard EDA interchange formats (Liberty / Verilog / SDF / SPEF / SDC) in, our own IR JSON (ADR 0002) out. No shared data structures, no headers, no linking.
We do not bundle OpenSTA in any Jacquard distribution. The user installs OpenSTA themselves; user-side combination of separately-distributed programs is not "distribution of a combined work" under GPL-3.
The original "no runtime subprocess" rule was effectively a commercial-perception buffer, not a strict licensing requirement.

Revised bright lines (these supersede the original "Shipped release" sub-section):

No linking of GPL code into the Jacquard binary. Unchanged.
No bundling of OpenSTA (or any GPL tool) in Jacquard distribution artefacts (release tarballs, Homebrew formulae, Docker images that ship as Jacquard releases). If a packager wants to bundle, they take on GPL distribution obligations themselves.
Subprocess invocation of user-installed OpenSTA from the shipped runtime is permitted. jacquard sim input.sdf may keep its opensta-to-ir subprocess hook in shipped releases, provided OpenSTA is discovered on PATH rather than bundled.

Phase 3 reclassification. Native Rust SDF→IR converter is no longer release-gating. It remains a goal — for ergonomics (no OpenSTA install required) and for downstream commercial integrators whose legal teams treat any GPL touchpoint as risk — but ships when bandwidth allows, not as a release blocker. Roadmap consequences are tracked in ../plans/post-phase-0-roadmap.md § Phase 3.

Corequisite — OpenSTA detection and version check (release-blocking). Relaxing the no-runtime-subprocess rule is conditional on the shipped runtime giving users a meaningful error when OpenSTA is missing or out-of-date. Today (src/sim/setup.rs:248-264), missing OpenSTA only emits a warn! and the simulation proceeds with no timing data loaded — acceptable during development, ships as a UX bug. Concretely, before first release we must:

Hard-fail (not warn) when --sdf is requested and OpenSTA cannot be located.
Probe OpenSTA's version on first invocation and fail with a remediation message if it is older than the version pinned in vendor/opensta/ (per ADR 0005).
Warn-but-proceed if the detected version is newer than the latest tested version, naming the version in the warning.
Document the OpenSTA dependency in docs/synthesis-flow.md.

Tracked as WS-RH.1 in ../plans/post-phase-0-roadmap.md § Release hardening.

Code-comment cleanup follow-up. The INTERIM per ADR 0006 / Pre-release only tags in src/sim/setup.rs (lines ~176, ~228, ~286) and src/bin/jacquard.rs (~187) describe a premise that no longer applies. Folded into WS-RH.1 (../plans/post-phase-0-roadmap.md § Release hardening) rather than spun out as a separate cleanup commit.

The original Context, Decision (Phase 0 + Phase 3), and Walk-back sections below are retained for historical record. Where they conflict with the bright lines above, the bright lines win.

Context

Jacquard's hand-rolled SDF parser (src/sdf_parser.rs) has accumulated reactive maintenance over time — empty () delays, (COND …) pin specs, escape handling, edge-qualified timing checks, TIMINGCHECK-stripping workarounds for OpenLane2 output. Each production failure has been a one-off patch. The timing-correctness review flagged this as issue #3, and a native Rust grammar-based replacement is the Phase 3 deliverable.

Concurrently, ADR 0001 establishes OpenSTA as the timing correctness oracle (subprocess, never linked, GPL), and ADR 0002 introduces a timing IR that decouples parsing from consumption.

Two facts together shape the decision:

No release pressure. Release can happen after Phase 3 lands. We are not forced to keep the hand-rolled parser alive while waiting on Phase 3.
Permissive-license constraint applies to the shipped binary. Subprocess invocation of GPL tooling is acceptable — does not trigger reciprocal obligations — and during pre-release development, even in-runtime subprocess invocation does not violate the constraint because no runtime binary is being distributed.

Given these, maintaining the hand-rolled parser through Phase 0–2 is unnecessary. OpenSTA's mature dialect coverage can substitute, via subprocess, while we build toward a native Rust replacement at our own pace.

Decision

Phase 0

Delete src/sdf_parser.rs and the SDF→Jacquard-internal-types code path. All paths that previously consumed SDF now consume timing IR.
Ship opensta-to-ir as a standalone preprocessing tool that consumes Liberty + Verilog + SDF + SPEF + SDC and emits timing IR. Subprocess-based on OpenSTA. Production-quality: stable CLI, documented exit codes, clear diagnostics.
Canonical runtime path is jacquard sim --timing-ir <path>, consuming pre-converted IR. This path works without OpenSTA on the user's machine — pre-converted IR is sufficient.
Interim ergonomic path: during development (pre-release only), jacquard sim input.sdf subprocesses opensta-to-ir internally to produce IR on the fly. This is a contributor convenience, not a shipping feature. Flag exists in code as pre-release only with a clear comment tying back to this ADR.

Phase 3

Native Rust SDF→IR converter replaces the OpenSTA subprocess call inside jacquard sim input.sdf. Grammar-based (nom / pest), validated against OpenSTA on the corpus per ADR 0001.
Lands before first release.

Shipped release

No OpenSTA invocation from the jacquard runtime binary. The native Rust converter handles SDF inputs directly.
opensta-to-ir remains as an alternative preprocessing tool. Users who want OpenSTA-computed timing may use it; subprocess model preserves permissive licensing.

Walk-back options (if assumptions change)

If OpenSTA dialect coverage proves insufficient during Phase 0 — e.g., a current Jacquard-supported SDF fails to parse — add dialect shims to opensta-to-ir's post-processing. Reinstating the hand-rolled parser is the last resort, not the first.
If the Phase 3 Rust rewrite stalls — ship the first release with preprocessing-only (no jacquard sim input.sdf path), remove the interim subprocess, and land the native converter in a later release. No information lost; users preprocess manually. This is already the post-release shape for opensta-to-ir; it's only the jacquard sim input.sdf convenience that would be deferred.
If OpenSTA becomes unmaintainable or disappears — the submodule pin (ADR 0005) remains authoritative for the integrated version. A forked submodule can maintain any necessary patches.

Consequences

Jacquard's repository stops carrying a hand-rolled SDF parser as a reactive-maintenance target. Bugs in SDF interpretation between Phase 0 and Phase 3 are OpenSTA's problem (upstream) or opensta-to-ir post-processing's problem, not Jacquard's core codebase's problem.
Pre-release ergonomic one-step workflow for contributors is preserved.
Contributors running Jacquard on a new design (no pre-converted IR) must have OpenSTA installed during Phase 0 through Phase 3. For existing primary-corpus designs, pre-converted IR is checked in; no OpenSTA needed.
Release-time check is unambiguous: either the runtime subprocess is replaced by native code, or it is removed entirely. Both outcomes satisfy the permissive-licensing constraint for the shipped binary.
Test corpus regenerable: if OpenSTA updates change IR output, golden files are regenerated deliberately (reviewable diff), not silently.

ADR 0007 — Timing model fidelity roadmap

Status: Proposed. (Line references amended 2026-06-25.)

Amendment (2026-06-25): The roadmap is still Proposed/unbuilt, but several src/ line references below have drifted as the code moved — most notably the wire-delay lumping code, cited as flatten.rs:1850-1872, now lives around flatten.rs:2030-2051 (the old location is unrelated code). Treat the file:line citations as approximate; verify against current flatten.rs / aig.rs before relying on them.

Context

Jacquard's timing model today consumes SDF-equivalent annotations via the timing IR (ADR 0002), produced and validated by OpenSTA called out of process (ADR 0001 — sole STA path; ADR 0003's in-process OpenTimer alternative was Superseded by the spike). The accuracy contract at present is "±5% on arrival times against CVC reference" per timing-validation.md. This is acceptable for sky130-class designs at ≥10 ns clock periods.

Three structural simplifications in the current implementation become accuracy bottlenecks at scale:

Static δ∞ per gate. No pulse-degradation modelling. Glitch behaviour and short-pulse propagation cannot be represented. The Involution Delay Model (Maier 2021, arXiv:2107.06814) demonstrates this is the root cause of inertial-delay's known failure modes, and provides a model that's both faithful and implementable.
Zero clock-tree skew. During AIG construction (src/aig.rs:495-560), clock buffers/inverters/gating cells collapse to a single polarity flag on the DFF. SDF arcs and interconnect on the clock tree are silently dropped. Every DFF on a clock domain is treated as capturing simultaneously.
Per-cell-max wire delay. src/flatten.rs:1850-1872 lumps all interconnect arrivals at a destination cell into a single max value, with no rise/fall distinction. Adequate for short local routes; incorrect for long routes where wire delay rivals or exceeds gate delay (typical of NoCs at 22nm and faster).

The full design analysis is in docs/timing-model-extensions.md. This ADR captures the decision to commit to closing these three gaps as a roadmap, sets the staged ordering, and constrains how the implementation may evolve.

Decision

Adopt a three-pillar roadmap for closing the fidelity gap with CVC, while preserving Jacquard's GPU-throughput advantage. All three pillars are consumer-side work (src/flatten.rs, src/aig.rs, src/sim/cosim_metal.rs, the kernel arrival math); none require schema changes inconsistent with ADR 0002 nor abandoning the cycle-accurate boomerang kernel architecture.

Pillar A — Dynamic delay (δ(T))

Per-gate dynamic delay parameterised on T (time since last output transition). Three accuracy tiers:

Static IDM. Bake worst-case δ(T) into existing per-thread script slot using STA pulse-width estimates. No kernel change.
Dynamic δ(T). Add last_transition_ps and last_value persistent buffers per AIG pin; kernel evaluates δ(T) from a small per-cell LUT during arrival propagation.
Sub-cycle ticks. Multiple arrival propagations per logical cycle, enabling true glitch suppression. Out of scope by this ADR. Would require a different kernel architecture; if pursued, requires its own ADR superseding this one.

Pillar B — Clock-tree skew

Per-DFF clock arrival accounting via TimingIR extension (ClockArrival table) populated by OpenSTA via opensta-to-ir (ADR 0001 — ADR 0003's OpenTimer alternative is Superseded). Per-pair CRPR is intentionally not modelled at this stage; per-DFF capture-side arrival is, treating launch as the 0-reference. Consumed by extending DFFConstraint with a clock_arrival_ps: i16 field, folded into the existing per-word setup/hold check in src/flatten.rs via DFFConstraint::effective_setup_hold. No kernel change for the baseline case; bucketed packing is an option if pessimism becomes material. Stages 1+2 landed: commits c403cc8 (producer) and 6767c3e (consumer).

Pillar C — Wire delay at scale

Three fidelity tiers:

Tier 1: Per-receiver consumption. Key wire delay by (src_aigpin, dst_aigpin) edge in the AIG, with rise/fall distinction preserved. Mostly a src/flatten.rs:1850-1872 rewrite. No kernel change.
Tier 2: Inter-partition arc delay. Explicit modelling of wire delay on partition-crossing signals. Touches src/sim/cosim_metal.rs shuffle pipeline. Required for many-core/NoC designs at advanced processes.
Tier 3: NoC-aware partitioning hints. Soft bias in src/repcut.rs favouring cuts on flagged net patterns. Optional optimisation that makes Tier 2 cheap on tile-decomposed designs.

Sequencing constraint

Pillar B Stage 1+2 is the cheapest accuracy improvement. Originally gated on the (now Superseded) OpenTimer integration; landed early on top of the OpenSTA-out-of-process path instead. See commits c403cc8/6767c3e.
Pillar C Tier 1 is independent of which STA tool feeds the IR and can proceed any time.
Pillar A Stage 1 (Static IDM) is the cheapest δ(T) entry point, gated on per-cell SPICE characterisation effort. Schedule this only after Pillars B and C land — δ(T) compounds on top of correct wire/skew baseline; doing it earlier risks chasing characterisation noise that's actually wire-delay error.
Pillar C Tier 2 lands when a real many-core/NoC use case appears in the test corpus and Tier 1 measurement shows it's needed.
Pillar A Stage 2 (Dynamic δ(T)) is a substantial implementation; schedule only when Stage 1 reports indicate the value is real, and a contributor with the analog-characterisation domain expertise is willing to lead it.
Pillar A Stage 3 (Sub-cycle ticks) is explicitly out of scope of this ADR.

Validation contract

Each pillar lands with regression coverage extending timing-validation.md's ±5% tolerance. Tighter tolerances may apply per pillar (Pillar B should achieve ≤±2% on skew-aware paths with OpenSTA-fed per-DFF arrival as currently implemented; Pillar C Tier 1 should achieve ≤±3% on long-wire paths).
Each pillar must demonstrate no regression on the existing primary corpus before merge.
The IR schema may be extended (additive only) per ADR 0002 to carry pillar-specific data. Extensions require a minor schema bump and a documented consumer-version compatibility note.

Consequences

The "±5%" line in timing-validation.md becomes a per-pillar specification rather than a single number. The doc is updated as each pillar lands.
crates/timing-ir/schemas/timing_ir.fbs accumulates additive extensions for clock arrival and per-cell δ(T) parameters. Schema versioning per ADR 0002 governs.
No changes to the cycle-accurate boomerang kernel architecture. The cost of preserving that architecture is permanent: no glitch propagation, no metastability oscillation, no asynchronous handling. These remain non-goals (per project-scope.md) unless a future ADR explicitly supersedes this position.
Per-cell SPICE characterisation effort is acknowledged as the long-pole risk for Pillar A. If characterisation cost proves prohibitive, Pillar A reduces to "Stage 1 only, using Liberty-derived ECSM/CCSM data as approximation," and the gap with CVC's full IDM fidelity remains open. This is acceptable; Pillar A Stage 2 is not a release-gating commitment.
Jacquard's positioning (why-jacquard.md) becomes coherent: STA-complement-not-replacement, vector-driven timing at GPU scale, fidelity comparable to CVC where the cycle-accurate kernel architecture allows.

Walk-back options

If a pillar's measurement shows the accuracy gain is smaller than expected, descope it. Each pillar's first stage is sized to deliver measurable improvement; if it doesn't, later stages of that pillar are deferred or abandoned.
If the IR schema extensions cause downstream tooling friction, fall back to vendor-extension passthrough (VendorExtension in timing_ir.fbs) until the typed schema stabilises. Already supported.
OpenTimer integration was retired (ADR 0003 Superseded by the spike outcome). Pillar B did not need the documented fallback to manual clock-tree accumulation in src/aig.rs — OpenSTA's per-pin arrival via opensta-to-ir covers the same ground without the per-pair CRPR credit (deferred to Stage 3 if measurement justifies it).

ADR 0008 — Structured timing output as first-class deliverable

Status: Accepted (amended 2026-06-25).

Amendment (2026-06-25): Two refinements to claims below. (1) The structured outputs (--timing-report <json>, --timing-summary, symbolic violation messages) are no longer Metal-only — CUDA and HIP sim now route setup/hold violations through process_events (commit 24723b5; jacquard.rs sim_cuda/sim_hip). The remaining gap is the cosim path, which does not yet emit --timing-report. (2) Any reference to a --worst-slack-n flag is aspirational — it is not implemented; the report's worst_slack arrays use a fixed top-N. Original decision unchanged below.

Context

Jacquard produces timing information today through three channels: timed VCD (--timed), per-violation clilog::warn! messages on stderr, and an in-process SimStats counter. The why-jacquard.md analysis identifies a gap between the timing data Jacquard has internally and the answers users actually need from a flow:

User question	Today
Did my workload trip any violations?	`SimStats` counts (in-process API only)
Which DFFs nearly missed timing?	Not extractable without parsing stderr
Show me arrival distribution per signal	Reconstructable from --timed via post-processing only
Which DFF was that violation on?	State-word index + manual lookup
What path caused the worst arrival?	Not available
Run this in CI and fail if any violation	Possible only via stderr grep

The most acute problem: stderr violation messages identify a state-word index, not a signal name. Mapping back to "which DFF, which path" requires manual investigation. On a violating design the message volume can be enormous (one warning per word per cycle per type). The data needed to do better — hierarchical signal names, DFF instance paths, per-DFF arrival distributions — already exists in the netlistdb and event buffer; it is simply not surfaced in usable form.

This ADR is about making Jacquard's timing output useful in a real flow rather than merely produced. The substantive work in ADR 0007 (model fidelity) is wasted if no one can extract the answers.

The full design analysis is in docs/why-jacquard.md, "Output interface" section.

Decision

Treat structured, machine-readable timing output as a first-class shipping deliverable, not an optional improvement. Land the work in priority order, where priority is set by user impact per implementation cost not by technical interest.

Required outputs

The following are required for Jacquard to be considered usable for vector-driven timing analysis in a real flow. They land before any further fidelity work past ADR 0007 Pillar B Stage 1+2.

Symbolic violation messages. Replace state-word indices with hierarchical signal names in stderr violation output. Mapping data already exists in netlistdb. Cost: contained edit in src/event_buffer.rs:305-338 plus name-resolution helper. Highest UX impact per LoC of any improvement on this list.
--timing-report <path.json>. Structured JSON document at end-of-run containing:
- Per-DFF worst arrival, worst slack, violation count over the run.
- Per-cycle violation list (cycle, signal name, hierarchical path, arrival, constraint, slack).
- Aggregate stats: total violations, distribution buckets, peak arrival per clock domain.
- Per-signal activity summary: transition count, average/max arrival, idle cycles.
- Run metadata: clock period, SDF/IR file, design hash, vector source.
Required for CI integration and any downstream tooling. Schema versioned; additive extension policy mirrors crates/timing-ir.
--timing-summary. Fast text summary, no VCD. Designed for scripts and human inspection of long runs. Contents:
- Vectors run, clock period, corner.
- Setup/hold violation totals.
- Worst-slack DFF (setup and hold) with hierarchical path.
- Peak arrival per writeout vs clock budget, with margin percentage.
Cost: trivial wrapper over (2)'s data.
Per-DFF worst-slack ranking. Top-N DFFs by closest-to-violation slack across the entire run, even when no violation occurred. Surfaces "where am I close to the edge" without requiring a violation to actually trip. Output as part of (2) and (3); also accessible via a dedicated --worst-slack-n N flag for quick inspection.

Optional / later outputs

The following are higher-value-but-lower-priority. They land after the four required items above, in any order driven by user demand.

--arrival-histogram <pattern>. Per-signal arrival histogram dump for matched signal patterns, as JSON or CSV. Foundation for activity-based power analysis.
--sta-cross-reference <opensta-paths.txt>. Cross-reference OpenSTA's critical-path report against observed worst arrivals. Closes the loop between vector-driven and static analysis. Coverage-style "of the top-N STA paths, which were exercised, and at what observed arrival." (Originally framed against OpenTimer; ADR 0003 was Superseded — OpenSTA is the only STA tool Jacquard interoperates with now.)
Path-back-trace from worst-arrival DFF. Given a flagged DFF, walk the max-of-fanin chain backward to the source AIG pin / primary input, emitting the path with per-edge contribution. Most expensive item on this list; only useful once the cheaper items are in place.

Backward compatibility

All new outputs are opt-in via flags. Existing stderr behaviour and --timed semantics are unchanged.
Symbolic violation messages (item 1) do change existing stderr format. This is intentional: the current state-word-index format is not a stable contract and is not consumed by any known automation. Format change documented in changelog at land time.

Output stability contract

The --timing-report JSON is a stable consumer-facing format. Schema versioned. Additive-only extensions per the IR convention; breaking changes require a major version bump and a transition period.
--timing-summary is human-readable and explicitly not stable for parsing. Tools should consume the JSON.
Stderr violation messages remain human-oriented; tools should not parse them.

Consequences

Jacquard becomes usable in CI without bespoke stderr parsing. Existing users who scrape stderr will need to migrate to the JSON report; the migration window is the release in which symbolic messages land.
The SimStats in-process API gains a public counterpart: end-of-run JSON. This raises the bar for changes to either — they must agree.
Documentation gains a "Jacquard timing report format" reference page. Sample reports from the corpus designs are checked in to tests/timing_ir/corpus/ alongside golden IR.
The why-jacquard.md positioning becomes truthful: the user-facing claim "vector-driven setup/hold answers at GPU scale" is backed by an interface that delivers them.

Walk-back options

If the JSON schema causes consumer-tooling friction, the format may be extended additively but not narrowed. Existing consumers must continue to work. If a fundamental rethink is required, ship a v2 alongside v1 with a deprecation window.
If symbolic name resolution is too slow at scale (millions of DFFs, very long runs), the resolution step becomes opt-in via flag, with the existing state-word-index format retained as a fast-path default. No evidence yet that this is a problem; treated as a deferred consequence.
If users specifically want the path-back-trace (item 7) before the cheaper items are scheduled, it can be promoted, but only once items 1–4 are in place. Path-back-trace without symbolic names is unusable.

Priority and effort estimate

Item	Effort	Blocks	User impact
1. Symbolic violations	1–2 days	Nothing	High (turns stderr from noise to signal)
2. JSON report	3–5 days	CI integration	High
3. Text summary	1 day (after #2)	Human dashboards	Medium
4. Worst-slack ranking	1–2 days (folds into #2)	"Am I close?"	High
5. Arrival histogram	3–5 days	Power analysis	Medium
6. STA cross-ref	1 week	Vector coverage report	Medium
7. Path-back-trace	2–3 weeks	Forensics	Lower-frequency-but-high-value

Items 1–4 are a single workstream, ~2 weeks total. They constitute the "Jacquard is now usable" bar. Items 5–7 are scheduled per user demand after that.

ADR 0009 — OpenSTA Verilog reader input constraints

Status: Accepted (amended 2026-06-25).

Amendment (2026-06-25): The claim that the filter has "integration test coverage in tests/opensta_integration.rs" is imprecise — the module-filtering tests are unit tests in src/verilog_filter.rs (tests/opensta_integration.rs covers the OpenSTA end-to-end run, not the filter). Decision unchanged.

Context

OpenSTA's read_verilog Tcl command is structural-only: it accepts cell instantiations and bare-net assign statements but rejects RTL operators (~, &, |, ^), bit-selects in assigns, and ranged concatenations. Violations surface as Error: <file> line <N>, syntax error and exit 1. This is a long-standing OpenSTA limitation, not a flag.

Two patterns make this surprising in practice — both have already caught us once:

Final-stage outputs from the LibreLane/OpenROAD flow are sometimes wrapped. LibreLane itself only ever reads structural netlists (<design>.pnl.v — verified locally on chip_top.pnl.v: zero RTL operators, single module). The wrapping is added by downstream integration tooling — for the SkyWater openframe flow, chipflow's harness wraps the LibreLane output in openframe_project_wrapper to patch active-low OEB pins into the pad ring, producing the assign gpio_oeb[0] = ~( ... ); pattern. The combined file (tests/mcu_soc/data/6_final.v) contains both the readable-by-OpenSTA structural top module and the wrapper's unreadable RTL. The SDF was generated against the inner top, not the wrapper — matching what LibreLane's own STA saw.
Post-synthesis Verilog has the right form but the wrong cells. Pre-P&R synthesis output (e.g. top_synth.v) is fully structural and uses the same module name top as the post-P&R body, so it looks like an acceptable substitute. It is not: the SDF references hundreds of thousands of P&R-inserted cells (clkbuf_regs_* CTS buffers, ANTENNA_* diodes, delaybuf_*, fillers) that simply do not exist in synthesis output. OpenSTA quietly drops SDF entries whose endpoints are not in the loaded design; the resulting IR back-annotates only the surviving subset. Concrete numbers from the MCU SoC fixture: top_synth.v has 31,500 cells; module top inside 6_final.v has 266,746. Feeding top_synth.v would silently drop ~88% of the design's structure.

Past convention (docs/plans/ws3-cosim-sdf-followup.md, pre 2026-05-18) recommended substituting top_synth.v to dodge the wrapper-parse error. The contemporaneous verification log (28162 matched, 2090 unmatched) reported the jtir-to-cosim-netlist match rate, not SDF coverage against the jtir — high surface "working" while the IR was missing most of the design's real timing. That recommendation is retracted in the same change as this ADR lands.

Decision

The "structural-only" constraint is owned by opensta-to-ir, not by the caller. Specifically:

opensta-to-ir filters Verilog inputs at invocation time. For each --verilog file, it extracts the module <--top> … endmodule block before handing files to OpenSTA. Files that do not contain module <--top> (sub-module-only files in hierarchical designs) are passed through unchanged. The wrapper modules that LibreLane + wafer.space integration adds — and any future analogues — are simply not seen by OpenSTA. Implementation in crates/opensta-to-ir/src/verilog_filter.rs; integration test coverage in tests/opensta_integration.rs.
The cell-set match against the SDF is the caller's responsibility. opensta-to-ir cannot determine programmatically whether a given Verilog input is the right design stage for a given SDF. The CI fixture comment in prepare-mcu-soc-jtir captures the rule for sky130 mcu_soc; copy the spirit (use the post-P&R structural body, not synthesis output) when adding new fixtures, but don't copy a per-design extraction recipe — there no longer is one to copy.

Architectural alternative (separate concern): the upstream chipflow harness could preserve LibreLane's pre-wrap <top>.pnl.v alongside its wrapped <top>_final.v output. That would make opensta-to-ir's in-tool extraction a no-op for the common chipflow case, but it would not obviate the filter — third-party LibreLane + wafer.space users (hazard3 and future tapeouts using the vanilla flow) hit the same wrapper pattern. The filter is the right place for the fix because it covers both opensta-to-ir as a CLI and jacquard sim --sdf (which subprocesses opensta-to-ir).

Consequences

End-user runs of jacquard sim --sdf <path> and the standalone opensta-to-ir tool both transparently handle the LibreLane + wafer.space wrapper pattern. No flags, no preprocessing recipe in user-facing docs.
Match-rate metrics in the IR consumer measure jtir coverage against the consuming netlist, not against the source SDF. A high match rate is necessary but not sufficient — confirm the jtir contains the post-P&R cell population separately (e.g. by spot checking for clkbuf_regs_* / ANTENNA_* arcs in the IR JSON sidecar) before declaring a flow "working".
The filter assumes module <--top> … endmodule is line-anchored in the Verilog source. Machine-generated post-P&R netlists meet this; hand-rolled Verilog that opens a module mid-line would not. If that ever surfaces, upgrade the filter to use a real Verilog tokenizer (sverilogparse is already a workspace dependency).
This ADR retroactively retracts the top_synth.v recommendation in docs/plans/ws3-cosim-sdf-followup.md; that doc is corrected in the same change.

ADR 0010 — Declarative cell metadata for PDK enablement

Status: Accepted (amended 2026-06-25).

Amendment (2026-06-25): The work this ADR "deferred to a future ADR" (explicit RAM port mapping) was delivered as ADR 0011 — schema v1.1 (src/cell_library.rs accepts "1.0" and "1.1"). The "Deferred" section below should be read as delivered, not open.

Context

PDK enablement today is per-PDK code + vendored Verilog (see src/sky130.rs, src/gf180mcu.rs, src/gf180mcu_pdk.rs, the build.rs pin-table scanner). Adding a new cell family — third-party IP memories, hard macros, foundry-supplied blocks — requires vendoring Verilog into jacquard/vendor/, extending the build.rs scanner, editing prefix matchers (is_<pdk>_cell, extract_cell_type), and adding entries to hand-curated matches!() lists (is_filler_cell, is_io_pad_cell, is_sequential_cell, is_multi_output_cell, …). Each of those last is data masquerading as code; PR #64 (2026-05-18 power-pin + wired-filler shortcuts for wafer.space) is the most recent example of the pattern.

The acute trigger is gf180mcu_ocd_ip_sram__sram1024x8m8wm1 — Tim Edwards' OCD 3.3V port of the GF180MCU SRAM IP, used in a downstream wafer.space tapeout. The cell is third-party IP (not in Jacquard's vendor/), doesn't match is_gf180mcu_cell's prefix walk (fd_* / ws_* only), has no pin table, and isn't filler-stubbable. Issue #67 captures the discussion.

The same pattern will repeat for every wafer.space tapeout that includes IP outside Jacquard's vendored library — hazard3, future chips. Code-gating each one through a Jacquard PR doesn't scale.

Decision

PDK enablement gains a declarative metadata path alongside the existing built-in classifiers. The decision separates cleanly into two tiers; this ADR commits to Tier 1 + a minimal Tier 2 slice now, and explicitly defers the larger Tier 2 schema (port-mapping semantics) to a future ADR after real adoption data.

Tier 1 — runtime cell library (`--cell-library <PATH>.v`)

sverilogparse (already a workspace dependency) parses user-supplied Verilog files at startup and populates the LeafPinProvider for every module … endmodule block found. Handles input / output / inout. Replaces the build.rs scanner for newly-added cells; existing built-in tables stay as fallback.

Flag is repeatable: --cell-library a.v --cell-library b.v for designs that pull in multiple IP libraries. Files are parsed in order; later files override earlier ones for collisions (with a warning).

Tier 2 (minimal slice) — `kind` discriminator in TOML

Each cell library may be accompanied by a TOML manifest declaring the kind of each cell — the same classification today's is_filler_cell / is_sequential_cell / etc. encode in matches!() lists. Manifest path mirrors the library path (foo.v → foo.cells.toml) and is loaded automatically when present; an explicit --cell-manifest <PATH>.toml flag overrides the autoloading behaviour.

schema_version = "1.0"

[cells.gf180mcu_ocd_ip_sram__sram1024x8m8wm1]
kind = "ram"

[cells.gf180mcu_fd_io__fillcap_18_h]
kind = "filler"

Recognised kind values (v1.0): std, dff, latch, clock_gate, ram, filler, endcap, tap, io_pad_input, io_pad_output, io_pad_bidir, delay, multi_output, tie_high, tie_low.

Schema versioning: top-level schema_version is mandatory. v1.x additive rule — new optional keys / new kind values are non-breaking; semantics of existing kind values must not narrow.

`kind = "ram"` semantics in v1.0 (opaque-RAM mode)

aig.rs today has two hardcoded RAM detection paths: celltype == "$__RAMGEM_SYNC_" (line 775, port_r/port_w resolution from Yosys memlib_yosys.txt) and starts_with("CF_SRAM_") (line 1006, .DO output resolution for ChipFlow's single-port convention). Neither matches gf180mcu_ocd_ip_sram_* or arbitrary third-party SRAM IP.

In v1.0, kind = "ram" allocates a RAMBlock slot in opaque mode: the cell's outputs are routed to X-source slots, no port resolution is attempted, no memory behaviour is modelled. This is sufficient for designs whose CPU executes from boot ROM / register file and never reads SRAM contents at the timescales Jacquard simulates (the heartbeat-verification use case driving this work). The existing compute_x_sources test path at src/aig.rs:3247-3273 already validates the X-source convergence shape.

When real memory modelling is required, future schema versions add explicit port mapping ([cells.NAME.ports] sub-tables) — the opaque mode stays as the documented fallback.

Integration ordering

aig.rs cell-type recognition slots the manifest path after the existing recognisers:

1. celltype == "$__RAMGEM_SYNC_"  → RAMBlock with port_r/port_w   (unchanged)
2. starts_with("CF_SRAM_")        → RAMBlock with .DO              (unchanged)
3. PdkVariant::classify(celltype) → built-in classifier dispatch    (unchanged)
4. NEW: manifest.lookup(celltype) → manifest-declared kind dispatch

The new path activates only for cells none of the existing recognisers match AND that have a manifest entry. All existing tests stay green without churn.

Deferred to a future ADR

Port-mapping schema ([cells.NAME.ports] sub-tables, polarity annotations, bus-width inference, write-enable encoding). This is a small behavioural description language doing more than classification; needs concrete adoption data before its schema is fixed.
Built-in classifier removal. sky130.rs / gf180mcu.rs / gf180mcu_pdk.rs classification tables stay as fallback through the entire migration. Removal happens only after the manifest pathway is the source of truth for at least one PDK in production.
build.rs pin-table scanner removal. Same rule: removed LAST, after manifests cover the built-in PDKs.

Consequences

Third-party IP unblocks without Jacquard PRs. Users ship a <library>.cells.toml alongside their <library>.v; CI flows point --cell-library at both. The driving wafer.space tapeout's chip_top.pnl.v clears gf180mcu_ocd_ip_sram__sram1024x8m8wm1 by shipping a six-line manifest entry.
The "vendor + edit code + extend lists" PR workflow for new IP becomes "ship a manifest, no Jacquard change". docs/adding-a-pdk.md evolves to document the manifest pathway as the primary route.
The opaque-RAM semantics is honest about what v1.0 delivers — no silent partial memory modelling. The contract is "RAMBlock allocated, outputs X-source, no read/write behaviour" until a future schema version adds explicit ports.
Existing built-in PDK code stays load-bearing through the transition. No risk of regression in sky130 / gf180mcu test flows during the migration.

ADR 0011 — RAM port-mapping schema for declarative cell metadata

Status: Accepted (amended 2026-06-25).

Amendment (2026-06-25): The "SRAM preload" consequence is now wired (TestbenchConfig::sram_init, src/sim/cosim/mod.rs), but the shipped path handles only the single-SRAM case (the design must have exactly one SRAM; multi-SRAM designs fail with an explicit error). The virtual-address-overlap matching of segments to multiple SRAM instances described below is not yet built (issue #103). Schema unchanged.

Context

ADR 0010 shipped a minimal Tier 2 slice with one kind discriminator per cell. For kind = "ram" specifically, v1.0 declares the cell-as-opaque: the AIG allocates a RAMBlock slot but routes outputs to X-source slots without resolving read/write port semantics. That's sufficient for "design boots from ROM, never reads SRAM contents" cases but fails the moment a real CPU writes to SRAM and expects to read its data back.

The acute trigger is the JTAG-DM firmware-load path enabled by PR #78: OpenOCD walks a debug-module sequence that culminates in abstract-memory writes into the design's SRAM, then jumps the CPU to that memory. Because the SRAM is opaque (no backing storage, writes go nowhere), the CPU boots to garbage. Issue #80 captures the symptom and notes that wiring SramInitConfig is the smaller sibling problem — pre-loading SRAM contents at tick 0 — but the bigger gap is that kind = "ram" doesn't model writes at all.

ADR 0010 § "Deferred to a future ADR" listed the port-mapping schema explicitly:

Port-mapping schema ([cells.NAME.ports] sub-tables, polarity annotations, bus-width inference, write-enable encoding). This is a small behavioural description language doing more than classification; needs concrete adoption data before its schema is fixed.

The OCD GF180MCU SRAM (gf180mcu_ocd_ip_sram__sram1024x8m8wm1) — a real third-party IP cell behind the apitronix-semiconductor / hazard3 / future wafer.space tapeout pipelines — gives us the concrete adoption-data input. This ADR fixes the schema against that worked example.

Worked example: the OCD SRAM

The upstream behavioural model (RTimothyEdwards/gf180mcu_ocd_ip_sram) declares:

module gf180mcu_ocd_ip_sram__sram1024x8m8wm1 (
    CLK, CEN, GWEN, WEN, A, D, Q
);
  input         CLK;                // posedge clock
  input         CEN;                // chip enable, active-low
  input         GWEN;               // global write enable, active-low
  input  [7:0]  WEN;                // per-bit write mask, active-low
  input  [9:0]  A;                  // address (1024 entries)
  input  [7:0]  D;                  // data in
  output [7:0]  Q;                  // data out
  reg    [7:0]  mem[1023:0];        // backing storage

Read semantics: on posedge CLK, when !CEN && GWEN → Q = mem[A]. Write semantics: on posedge CLK, when !CEN && !GWEN && !(&WEN) → mem[A][i] = D[i] for each i where !WEN[i].

The schema needs to capture: per-pin polarity (active-low vs active-high), per-pin role (clock / chip-enable / write-enable / mask / address / data-in / data-out), bus widths (derived from the Verilog declaration; not redeclared), mask granularity (per-bit vs per-byte vs none).

Decision

Extend the <library>.cells.toml schema with an optional ram sub-table on entries declaring kind = "ram". Presence of the sub-table promotes a cell from opaque (v1.0 semantics) to explicit — outputs are properly wired to the AIG-backed RAMBlock, writes populate backing storage, reads return what was written.

Schema (v1.1)

schema_version = "1.1"

[cells.gf180mcu_ocd_ip_sram__sram1024x8m8wm1]
kind = "ram"

[cells.gf180mcu_ocd_ip_sram__sram1024x8m8wm1.ram]
depth = 1024
width = 8
clock        = { pin = "CLK", edge = "pos" }
chip_enable  = { pin = "CEN", polarity = "low" }
write_enable = { pin = "GWEN", polarity = "low" }
write_mask   = { pin = "WEN", polarity = "low", granularity = "bit" }
address      = "A"
data_in      = "D"
data_out     = "Q"

Field semantics

depth (required, integer): number of addressable entries. Must satisfy depth ≤ 2^AIGPDK_SRAM_ADDR_WIDTH (8192 today).
width (required, integer 1..=32): bit-width of each entry. Capped at 32 by RAMBlock's fixed-size port arrays.
clock (required, table): pin is the clock input pin name; edge defaults to "pos". "neg" is accepted (matches gf180mcu dffnq-family negedge convention).
chip_enable (optional, table): pin + polarity (default "low"). When the pin's effective level is inactive, the cell neither reads nor writes for that cycle. Omit for sync SRAMs that are always-enabled.
write_enable (optional, table): pin + polarity (default "low"). Gates all writes regardless of mask. The OCD SRAM's GWEN. Omit for SRAMs without a global write-enable.
write_mask (optional, table): per-bit / per-byte write enables. pin is the mask pin name; polarity defaults to "low"; granularity is "bit" (default) or "byte". The mask width must match width (bit) or width / 8 (byte). Omit for SRAMs without per-bit masking — in that case the global write_enable controls the whole word.
address / data_in / data_out (required, string): pin names. Bus widths are read from the Verilog (via sverilogparse) — not re-declared here.

Optional cells (no `ram` block)

Cells declaring kind = "ram" without the ram sub-table fall back to v1.0 opaque mode — outputs route to X-source slots, no backing storage, no port resolution. The contract is unchanged for existing consumers.

Backing storage

Cells with an explicit ram block allocate a RAMBlock with port_r_* and port_w_* arrays populated from resolved pin positions. The simulator's existing GPU-side SRAM machinery handles reads, writes, and per-entry backing memory; no new kernel work is required.

Schema versioning

The top-level schema_version field bumps from "1.0" to "1.1". v1.0 manifests continue to parse — the ram sub-table is purely additive. Loaders that don't recognise the new sub-table (none today; this ADR ships the loader simultaneously) would treat flagged cells as opaque RAMs, which is a graceful degradation.

SRAM preload (sibling work)

TestbenchConfig::sram_init (an existing schema field declared in src/testbench.rs but unwired today — issue #80) becomes load-bearing once explicit-port RAMs have backing storage. The preload path:

Parse ELF segments from sram_init.elf_path.
Match segments to SRAM instances by virtual-address overlap with declared SRAM regions.
Write segment bytes into each matched SRAM's backing memory at tick 0.

Schema extensions to SramInitConfig (instance targeting, multi-section support) land alongside the implementation but don't require an ADR — purely additive JSON schema work.

Consequences

The OCD GF180MCU SRAM (and any structurally similar third-party IP — 1RW, sync, optional per-bit mask) becomes simulable end-to-end via the manifest pathway. Real CPU writes populate real memory.
The opaque-mode fallback stays load-bearing for cells the consumer hasn't taken the time to schema-map — important so the cell-library pathway doesn't require schema work just to load a cell library.
JTAG-DM-driven firmware load (PR #78 stage 1) becomes end-to-end testable in cosim. Closes the chicken-and-egg loop for designs whose firmware-load mechanism is what cosim is trying to validate.
The schema is opinionated: 1-port (1RW), sync-only, write-mask is bit OR byte (not arbitrary). Multi-port SRAMs (2RW, 1R1W), async SRAMs, and write-mask-with-stripes encodings are explicitly out of scope. Adding them is a future schema version (v1.2+); doesn't break v1.1 manifests.

Out of scope

Multi-port SRAMs. Most foundry IPs in our ecosystem are single-port. Dual-port designs are a meaningful follow-up but not driven by any in-tree fixture today.
Async (non-clocked) SRAMs. Hardly seen in synthesised digital designs at modern PDKs. Not modeled.
Width > 32 bits. Bounded by RAMBlock's array sizes; consumers wider than 32 should split into multiple instances.
Built-in classifier removal. Same rule as ADR 0010 — the $__RAMGEM_SYNC_ and CF_SRAM_* recognisers stay as fallback; manifest-declared RAMs supplement, don't replace.

ADR 0012 — Reproducible CDC jitter injection for multi-clock cosim

Status: Accepted — design accepted and partially implemented. The reproducibility core (§1) and scheduler-domain jitter on the VCD timeline (§2, partial) are built; model-driven jitter (§3), setup/hold integration (§5), the gcd_ps/2 guard, and true coincident-edge perturbation (§4) are not yet. The sections below describe the decided design; see Implementation status for what is built versus deferred. Remaining work is tracked in issue #92 and ../plans/cdc-jitter-completion.md.

Amendment (2026-06-25): The Implementation status table cites cosim_metal.rs for the scheduler and jitter draw — that file reference is stale. The MultiClockScheduler, per-domain PRNG setup, and jitter draw loop all live in src/sim/cosim/mod.rs (the unified multi-backend cosim entrypoint); src/sim/cosim/metal.rs holds only the MetalBackend struct. The implemented/deferred split itself is still accurate.

Context

The multi-clock scheduler (MultiClockScheduler in cosim_metal.rs) pre-computes a fixed LCM-based edge schedule: every clock domain fires at perfectly rational offsets forever. Real hardware doesn't do that — PLL jitter, clock-tree skew, and propagation delay make coincident edges land in unpredictable order. CDC synchronizers are designed to tolerate this, but RTL bugs (missing synchronizers, gray-code errors, handshake protocol violations) only surface when edge alignment varies from the ideal.

The motivating incident was PR #89 / run 26413667030: a scheduler index bug caused sys_clk to fire at TCK's period, making CDC synchronizers between the JTAG and system clock domains marginal. The test passed intermittently because Metal GPU scheduling jitter shifted the effective phase relationship. Once the bug was fixed (commit 5bb07c3), determinism was restored — but the experience highlighted that no deliberate mechanism exists to stress-test CDC paths under controlled timing skew.

Additionally, cosim's model-driven clocks (JtagReplayModel, SpiFlashModel, etc.) override the scheduler's periodic pattern with software-driven edges. These introduce a distinct CDC concern: model-driven clock transitions are phase-locked to the host-side dispatch loop, not to the design's system clock. The same jitter injection infrastructure must cover both scheduler-derived and model-driven clock edges.

The multi-clock plan (docs/plans/multi-clock-and-stimulus-architecture.md) lists "CDC verification mode: jitter injection on coincident edges and random X-injection on detected async-source paths" as a future capability. This ADR formalises the design for the jitter injection half; X-injection is deferred to a follow-up ADR that depends on MC.1 (island partitioner) landing.

Decision

1. Run-parameters file and per-domain seeded PRNG

Simulation runs that use any non-deterministic feature (jitter, future partition randomisation, model-driven timing offsets) are governed by a run-parameters file (--run-params <path>):

{
  "master_seed": 8429173640281
}

From the master seed, a per-domain sub-seed is derived for each clock domain and each model-driven clock (e.g. sub_seed = hash(master_seed, domain_name)). Each domain gets its own independent PRNG stream. This ensures reproducibility even when the number of PRNG draws per domain is path-dependent — a reactive model that fires more or fewer edges based on design output doesn't contaminate another domain's displacement sequence.

Behaviour:

--run-params <path> supplied, file exists: load parameters from it. The run is a deterministic replay.
--run-params <path> supplied, file does not exist: generate a master seed from system entropy, write the file immediately (before the simulation loop starts), then run. The user gets reproducibility even if the process crashes mid-simulation.
No --run-params flag: generate a master seed, write to a default location (<output_dir>/run_params.json next to the output VCD) before simulation begins. Always persisted — the user can re-run any simulation by passing the written file back.

The master seed is also logged at INFO level and included in the VCD header comment, so even without the file the seed is recoverable from logs.

Rationale: "random testing that can't be replayed isn't testing," but forcing users to pick seeds upfront discourages use. Writing the file before simulation means every run — even a crashed one — is reproducible after the fact. Per-domain streams mean the seed alone is sufficient; no displacement log is needed.

For CI seed sweeps, a wrapper generates N parameter files with sequential seeds and fans out runs. Each failure ships with its parameter file as an artifact — gh run download gives you everything needed to reproduce locally.

2. Per-domain jitter budget

A new jitter_ps field on ClockConfig in sim_config.json declares the maximum edge displacement in picoseconds for that domain:

{
  "clocks": [
    { "gpio": 0, "period_ps": 40000, "name": "sys_clk", "jitter_ps": 200 },
    { "gpio": 2, "period_ps": 160000, "name": "tck", "jitter_ps": 0 }
  ]
}

At each edge, the scheduler draws a signed displacement from a uniform distribution [-jitter_ps, +jitter_ps] and shifts the edge forward or backward within the GCD granularity window. The resulting edge still fires within the same GCD tick (no reordering across ticks), but the effective arrival time recorded in the state buffer (and honoured by setup/hold checkers) shifts. Disabling jitter (jitter_ps: 0) is the default and produces today's ideal-clock behaviour.

Constraint: jitter must not exceed gcd_ps / 2; larger values would re-order edges across GCD ticks and require a fundamentally different scheduling model.

3. Model-driven clock jitter

Model-driven clocks (JTAG TCK, SPI SCK, etc.) bypass the scheduler's periodic edges. Their jitter path is different:

A --cdc-model-jitter-ps <N> flag (or per-model jitter_ps in the config) specifies the budget for model-driven transitions.
After patch_model_clock_edges fires the edge, the arrival-time offset recorded in the timing state is displaced by a PRNG-drawn value from the same seeded generator.
This does NOT delay the functional edge (the DFF still samples on the same tick) — it shifts the timing-model arrival so that setup/hold checks against the receiving domain see a different margin each run.

The functional-vs-timing split means jitter injection doesn't change combinational propagation (which would require an event-driven kernel), only the timing oracle's view of when edges "really" arrived. This is consistent with Jacquard's philosophy: functional correctness is cycle-accurate, timing is an overlay.

4. Coincident-edge perturbation

When two domains have edges scheduled at the same GCD tick (coincident edges), their relative order is undefined in real hardware. The jitter mechanism naturally handles this: if domain A's jitter shifts it +100ps and domain B's shifts it -50ps, the timing model sees A after B, which may differ from the next run's draw. This exercises both "A-before-B" and "B-before-A" orderings over a seed sweep without needing explicit permutation logic.

5. Integration with existing infrastructure

Setup/hold checker (timing_report.rs): already receives arrival-time offsets. Jittered arrivals feed directly into the existing violation detection — a jitter-induced setup violation appears in --timing-report output with the jittered arrival annotated.
VCD ring buffer: records the jittered arrival time so waveform viewers show the displaced edge.
X-prop (future): when MC.1 identifies CDC boundaries, X-injection on violated paths can use the same PRNG stream for correlated randomisation.
--check-with-cpu: the CPU baseline does NOT apply jitter (it doesn't model timing at all). Jitter-mode results should not be compared against the CPU baseline. The flag combination --run-params (with jitter enabled) + --check-with-cpu should warn or error.

Implementation status

The design above is accepted in full; the code implements part of it. This section is the source of truth on what is built. Remaining items are tracked in issue #92 / ../plans/cdc-jitter-completion.md.

Implemented:

Part	Where
Run-parameters file, `master_seed`, load/write/`load_or_generate`	`src/sim/run_params.rs`
Per-domain sub-seed `hash(master_seed, name)` + per-domain `ChaCha8Rng`	`RunParams::domain_seed`; `cosim_metal.rs`
`jitter_ps` per `ClockConfig` (default 0)	`src/testbench.rs`
Uniform `[-jitter_ps, +jitter_ps]` draw per domain per tick	`cosim_metal.rs`
Jitter displacement applied to the timing-VCD event timestamp	`cosim_metal.rs` (inside the `--output-vcd` block)
`master_seed` logged at INFO	`cosim_metal.rs`
`--check-with-cpu` + jitter warning	`cosim_metal.rs`

Deferred (issue #92):

Part	ADR §	Gap
Setup/hold integration	§2, §5	Jitter shifts only the VCD base timestamp; it does not feed the per-signal arrival offsets, so it produces no `--timing-report` violations. Also: jitter currently has no effect unless `--output-vcd` is set.
Model-driven clock jitter	§3	No `--cdc-model-jitter-ps` flag or `patch_model_clock_edges` path; only scheduler domains jitter.
True coincident-edge perturbation	§4	A single global displacement (last firing domain wins) is applied to the shared timestamp rather than independent per-domain displacement.
`gcd_ps / 2` constraint	§2	Not validated.
Persist seed unconditionally	§1	Without `--run-params` or `--output-vcd`, the seed is generated but not written.
`master_seed` in VCD header comment	§1, §5	INFO log only.
`--cdc-jitter-seed` CI sweep	Consequences	The replay mechanism is `--run-params`; no dedicated CI sweep step yet.

Consequences

CI can run a small seed sweep (via --run-params) as a lightweight CDC stress test on every PR, catching synchroniser failures that the ideal-clock schedule hides.
Users debugging real silicon CDC failures can replay the exact jitter pattern that triggered the issue.
The design is forward-compatible with X-injection (the PRNG infrastructure and per-domain budgets are reusable).
Model-driven clocks get explicit jitter coverage rather than relying on accidental GPU scheduling delays.
No kernel changes required — jitter is a host-side timing-model overlay on the existing edge schedule.

Deferred

X-injection on CDC paths. Requires MC.1's island partitioner to identify which DFF outputs cross domains. Separate ADR once MC.1 lands.
Frequency sweep / DFS simulation. Changing a clock's period mid-simulation is orthogonal to jitter. Captured in the multi-clock plan as a future axis.
Per-path jitter profiles. Real jitter isn't uniform — PLLs have period jitter (Gaussian), recovered clocks have cycle-to-cycle jitter (bounded), external clocks have frequency offset (deterministic drift). V1 uses uniform; richer distributions can be added later without API changes (the seed + budget interface is distribution-agnostic).

ADR 0013 — Cosim peripheral model architecture

Status: Accepted — the architecture is implemented and in use across multiple peripherals (multi-UART #90, config-driven APB3 bus tracing). The "Target architecture" section below tracks the remaining, optional refactors; the conventions it establishes are already followed.

Amendment (2026-06-25): Two body claims are now stale. (1) "Cosim is Metal-only today" — CUDA (src/sim/cosim/cuda.rs) and HIP (src/sim/cosim/hip.rs) backends now implement the full peripheral stack (gpu_io_step, flash FSM, ring-buffer drain); the "Metal-only" characterisation applied only to the prebuilt distribution (ADR 0018), not the source build. (2) The note that step_edge receives an empty output_state slice was resolved by the interactive JTAG work — it now receives &backend.state()[state_size..] (cosim/mod.rs). The original text below is retained for the record.

Amendment (2026-07-09): Plural QSPI memories may now share one SCK/SIO bus, selected by distinct CS lines (config: same clk_gpio/d0_gpio, distinct csn_gpio). This requires CS-gated MISO injection: a deselected instance (prev_csn high) presents high-Z and must not drive. Previously gpu_apply_flash_din (and the CpuBackend mirror) wrote every instance's d_i to its d_in_pos unconditionally, so on a shared bus the last-iterated (typically deselected) memory clobbered the selected driver. The gate lives in gpu_apply_flash_din (Metal + the shared CUDA/HIP kernel_v1_impl.cuh) and CpuBackend::apply_flash_din_gated; regression: tests/qspi_shared_bus/ (golden + content, all/qspi scopes) plus the flash_din_tests unit tests. Independent-pin plurality (the original Stage B/C design, tests/multi_mem_cosim/) is unaffected — a deselected flash on its own pins was always ignored by the design; the gate only changes the now-shared case.

Context

Jacquard's cosim mode runs reactive peripheral models alongside the GPU-simulated design: SPI flash serves firmware, UART decodes serial output, JTAG replays debug sessions, GPIO drives/observes pins, and Wishbone trace captures bus transactions. The architecture evolved organically; this ADR documents the current design, identifies the abstractions emerging from it, and establishes conventions for extending it.

Architecture

Execution domains

Peripheral work splits across CPU and GPU. The boundary follows a simple rule: models that drive input pins (must react to design output each edge) run on the CPU; models that observe output pins (pure consumers of post-simulation state) or exchange data bidirectionally with the design run on the GPU for zero-copy access to the state buffer.

Some peripherals span both domains. UART has a CPU-side RX driver (feeds bytes into the design's RX input pin) and a GPU-side TX decoder (reads the design's TX output pin).

CPU-side: `PeripheralModel` trait

Defined in src/sim/models/mod.rs:

#![allow(unused)]
fn main() {
trait PeripheralModel {
    fn name(&self) -> &str;
    fn driven_positions(&self) -> &[u32];
    fn apply_action(&mut self, action: &QueuedAction);
    fn step_edge(&mut self, output_state, overrides, emitted); // default: just calls contribute_overrides
    fn contribute_overrides(&self, overrides);
    fn is_active(&self) -> bool; // default: false
}
}

apply_action is how the InputDispatcher feeds queued stimulus commands to models. is_active signals that the model is mid- transmission and needs per-edge granularity (forces batch size to 1). step_edge has a default that just calls contribute_overrides — stateless models (GPIO) only need the latter.

Models are registered into a Vec<Box<dyn PeripheralModel>> at startup. Each batch boundary, the loop calls step_edge on every model; models write their pin drives into a shared ModelOverrides map. These overrides are patched in-place into pre-allocated BitOp arrays (built at startup with placeholder entries for model-driven positions) and applied via the state_prep GPU kernel.

Note: step_edge currently receives an empty output_state slice — GPU output state is not read back per-edge for CPU-side models. GPIO and UART RX don't need it; I²C and SPI bus observation will require wiring the output state readback when those models are completed.

The dispatch is peripheral-agnostic: state_prep applies whatever BitOp array it receives. Clock edges, reset, GPIO, UART RX, and JTAG TCK/TMS/TDI are all entries in the same ops buffer.

Registered CPU-side models: GPIO, UART RX, JTAG replay (complete); I²C, SPI (scaffolded, output-state readback not yet wired).

GPU-side: two model patterns

GPU-side models fall into two categories distinguished by their data-flow relationship to the simulation:

Observe-only (post-simulate): The model reads output state after simulation and produces results (decoded bytes, bus traces) into a ring buffer. It never writes to input state. One kernel call per edge, after simulate_v1_stage.

Bidirectional (pre+post simulate): The model both reads the design's outputs and injects data into the design's inputs. This requires two kernel calls per edge — one before simulation (inject response data into input state) and one after (read request signals from output state, advance the model's FSM).

Pattern	When	Current models
Observe-only	Post-simulate	UART TX decoder, Wishbone bus trace
Bidirectional	Pre-simulate (inject) + post-simulate (sample, advance)	SPI Flash

Any memory-mapped peripheral (external SRAM, I²C EEPROM, etc.) would follow the bidirectional pattern.

Per-edge execution order

state_prep (apply clk/gpio/jtag pin drives from CPU-side models)
  → [bidirectional: inject] — e.g. gpu_apply_flash_din
    → simulate_v1_stage ×N (combinational logic evaluation)
  → [bidirectional: sample+advance] — e.g. gpu_flash_model_step
  → [observe-only] — e.g. gpu_io_step (UART TX + Wishbone)

CPU-side PeripheralModel::step_edge runs between GPU batches.

GPU→CPU communication: ring buffers

GPU-side models produce output into fixed-size ring buffers in device memory. The CPU drains these after each GPU batch completes, reading from a local read_head up to the GPU-written write_head. No synchronisation beyond Metal's command buffer completion is needed.

Current ring buffers:

Buffer	Element	Capacity
`UartChannel`	`u8` (decoded bytes)	4096
`WbTraceChannel`	`WbTraceEntry` (20 bytes)	16384

Configuration

Peripheral config lives in sim_config.json, deserialized into TestbenchConfig (src/testbench.rs):

Peripheral	Field	Plural?
Clock	`clocks: Option<Vec<ClockConfig>>`	Yes (`effective_clocks()`)
GPIO	`gpios: Vec<GpioConfig>`	Yes
UART	`uart` + `uarts: Vec<UartConfig>`	Yes (`effective_uarts()`, #90)
QSPI memory (flash/PSRAM)	`flash` + `qspi_memory: Vec<QspiMemoryConfig>`	Yes (`effective_qspi_memory()`, #170/#171)
JTAG	`jtag: Option<JtagConfig>`	Not yet
Wishbone	(auto-detected, hardcoded signal names)	N/A (legacy)
Bus trace (AHB/APB)	`bus_traces: Vec<BusTraceConfig>`	Yes (`effective_bus_traces()`)

Current implementation (bespoke kernels)

Today each GPU-side peripheral has its own kernel function:

Kernel	Slots	Pattern
`gpu_apply_flash_din`	states[0], flash_state[1], flash_din_params[2]	Bidirectional: inject
`gpu_flash_model_step`	states[0], flash_state[1], flash_model_params[2], flash_data[3]	Bidirectional: sample+advance
`gpu_io_step`	states[0], uart_state[1], uart_params[2], uart_channel[3], wb_channel[4], wb_params[5], bus_channel[6], bus_params[7]	Observe-only (UART + Wishbone + AHB/APB bus trace)

All run on thread 0 only — the per-tick work is a trivial FSM step. gpu_io_step combines three logically independent observe-only models, gated by n_uarts > 0, has_trace, and n_buses > 0 respectively.

Config-driven bus monitor (AHB/APB)

The Wishbone trace (build_wb_trace_params) hardcodes one SoC's signal names (cpu.fetch.ibus__cyc, spiflash.ctrl.wb_bus__ack, …) directly in source. The AHB/APB bus tracer generalizes it into a config-driven, protocol-aware monitor that is the model for future bus tracing:

Config (BusTraceConfig): name, protocol (apb3 / ahb-lite / ahb5), hierarchical prefix, addr_bits/data_bits, and optional per-pin signals overrides. Pins default to {prefix}{pin}.
Pin binding: protocol pin names (psel, paddr, …) are resolved to output-state positions via resolve_to_state_pos in trace_signals.rs — the same multi-candidate resolver --trace-signals uses, so Yosys-flattened / scalar-expanded / structural naming all work. The pins are registered as observables before partitioning (via DesignArgs::extra_observable_signals) so they get state-buffer slots.
GPU capture / CPU decode split: the kernel is protocol-agnostic — it packs a raw beat (addr, wdata, rdata, ctrl flags) into the ring buffer on the protocol's gating edge (psel & penable & pready for APB), using rising-edge detection so exactly one beat is recorded per completed transfer. The protocol FSM (phase pairing, burst tracking, response decode) lives in plain, unit-testable Rust in src/sim/models/bus_trace.rs. APB3 is stateless (one beat = one transaction); AHB pairing is the Phase-2 extension.
Output: decoded transactions stream to CSV via --bus-trace-csv; annotated-VCD emission is a planned follow-up.

This is observe-only, so it slots into the existing post-simulate pattern. Migrating the hardcoded WbTrace onto this mechanism (expressing the VexRiscv ibus/dbus as configured buses) is a clean follow-up.

Target architecture

The two patterns (observe-only, bidirectional) and the common conventions (ring buffers, params structs, per-instance config arrays) should be codified so new peripherals follow a template:

Common conventions

Params struct layout: { u32 state_size; u32 n_active; u32 _pad[2]; PerInstanceConfig configs[MAX_N]; } — uniform header, compile-time MAX_N cap.
Ring buffer struct: { u32 write_head; u32 capacity; u32 _pad[2]; T data[CAP]; } — shared across all models producing GPU→CPU output.
Buffer sizing: always MAX_N elements regardless of n_active. Wastes negligible memory for small N.
Guard pattern: for (i = 0; i < n_active && i < MAX_N; i++) replaces the current has_foo != 0 booleans.

Model registration

New GPU-side models declare which pattern they follow:

Observe-only: register a post-simulate kernel. Receives output state (read-only), writes to ring buffer.
Bidirectional: register a pre-simulate kernel (inject into input state) and a post-simulate kernel (read output state, advance FSM).

Today this registration is implicit in cosim_metal.rs's encode_and_commit_gpu_batch. Formalizing it is a future step — the convention is sufficient while the model count is small.

Plural config convention

To support multi-instance peripherals (multiple UARTs, potentially multiple flash chips or RAM banks):

Legacy singular field kept via #[serde(default)].
New plural field alongside (e.g. uarts: Vec<UartConfig>).
effective_<peripheral>() -> Vec<Config> merges both.
Each config struct gains name: Option<String> for labelling.

This mirrors the existing effective_clocks() pattern.

Cross-backend considerations

Cosim is Metal-only today. CUDA/HIP paths (kernel_v1_impl.cuh) implement the core simulation kernel but have no gpu_io_step or flash kernels. When CUDA/HIP cosim is added, the same two-pattern taxonomy applies — the kernel implementations will differ but the Rust-side buffer allocation, config resolution, and drain logic can be shared via feature-gated code in cosim_metal.rs (or a future cosim_common.rs).

Phasing

Phase	Scope	Status
1	Multi-UART (#90): first peripheral using plural-config + array-in-kernel conventions	Done
1b	Config-driven bus monitor, APB3 + CSV (GPU-capture/CPU-decode split)	Done
2	Refactor `gpu_io_step` to use common params/ring-buffer layout	Future
2b	AHB-Lite / AHB5 bus tracing + annotated-VCD output; migrate WbTrace onto the general monitor	Future
3	Multi-Flash / external RAM (bidirectional pattern)	Done — plural QSPI memory (`Vec<QspiMemoryConfig>`, N-instance kernels on Metal + CUDA/HIP) #170/#171; writable QSPI PSRAM (RAM mode) #159. See 2026-07-06 amendment.
—	Multi-JTAG	Not needed (TAP daisy-chain suffices)

Plan docs: ../plans/multi-peripheral-cosim.md, ../plans/bus-transaction-tracing.md.

Amendment 2026-06-21: interactive JTAG debug server config surface

The interactive JTAG/DM debug server (--jtag-server, #124) is the live-socket sibling of --jtag-replay. Its config additions follow this ADR's conventions; the execution-model decisions are in ADR 0017's 2026-06-21 amendment. Recorded here:

JtagConfig gains tdo_gpio: Option<usize> (src/testbench.rs). TDO is a design output, so unlike the existing tck/tms/tdi/trst_gpio (inputs, resolved via input_bits) it resolves via the output-bit map. It is the first JTAG pin that the model observes rather than drives — added Option so replay configs without it keep working.
Resolves the "output_state readback not yet wired" note above for the JTAG case. The interactive server is the first CPU-side PeripheralModel that must read a design output (TDO, to answer remote_bitbang R), so it wires the real output slice into step_edge — the same plumbing the scaffolded I²C/SPI models were noted as needing. See ADR 0017's amendment for the execution-model detail.
New CLI flag --jtag-server <PORT>, mutually exclusive with --jtag-replay. Reuses the same jtag peripheral pin mapping and --jtag-hold-cycles semantics; opens a remote_bitbang TCP server and drives the configured pins live from the connected OpenOCD/gdb client. JTAG stays in the "Not yet" plural column (TAP daisy-chain suffices — single instance).

Plan: ../plans/jtag-debug-server.md.

Amendment 2026-07-06: plural QSPI memory + writable QSPI PSRAM (Phase 3 done)

The Flash peripheral went plural and gained a writable RAM mode, completing Phase 3. This resolves the plural-qspi-memory working handoff (folded here).

Plural config (Stage A, #170). flash: Option<FlashConfig> → qspi_memory: Vec<QspiMemoryConfig> with the standard back-compat pattern: the legacy flash key folds into instance 0 via effective_qspi_memory() (mirrors effective_uarts()). FlashConfig is now a type alias of QspiMemoryConfig. Each instance carries its own GPIO map, backing size, and (Stage C) firmware.
N-instance GPU kernels (Stage B, #171). FlashDinParams/FlashModelParams are wrapped in *All { u32 n_flashes; u32 _pad[3]; T[MAX_QSPI_MEMS] } blocks (exactly BusTraceParamsAll, the "target architecture" convention above). FlashModelParams gains a per-instance data_offset; flash_data is one concatenated buffer with each instance's store at its offset (independent backing stores). FlashState is an N-slot array; both kernels loop f < n_flashes instead of the old has_flash guard. Mirrored across Metal (kernel_v1.metal) and the shared CUDA/HIP kernel (kernel_v1_impl.cuh). Gotcha fixed in-flight: flash_set_in_reset must drive every slot's reset line, not just slot 0.
Writable QSPI PSRAM / RAM mode (#159). Opt-in FlashConfig fields (writable, enter_qpi_cmd, quad_write_cmd, read_dummy_cycles, size_bytes; firmware now optional) turn an instance into an APS6404L-class PSRAM: enter-QPI (0x35) latches 4-lane command sampling, quad- write (0x38) stores into the now-writable backing store, quad-read (0xEB) inserts 3 + read_dummy/2 dummy boundaries. Mirrored across the CPU CppSpiFlash oracle, Metal, and CUDA/HIP with #[repr(C)] size asserts in lockstep (FlashState 52B, FlashModelParams 52B); the CUDA/HIP flash_data buffer became a writable UnsafeCell<UVec<u8>>. Unset options ⇒ byte-identical to the read-only flash.
Tests. tests/multi_mem_cosim/ (3 flashes + 2 SRAMs, independent stores, all scope) and tests/qspi_psram/ (write→read round-trip, all + a dedicated qspi scope for the GPU suite). Goldens captured on CpuBackend, byte-identical to Metal/CUDA/HIP (Backend Equivalence CI green). Motivating use case: a C64-subset GF180 tapeout whose main RAM is external QSPI PSRAM.

ADR 0014 — AIG as simulation intermediate representation

Status: Accepted (amended 2026-06-25 — see the latch constraint below)

Context

Jacquard simulates gate-level RTL designs on GPUs by converting technology-mapped netlists into an executable form. The choice of intermediate representation (IR) determines how easily the design maps to GPU hardware, how much the representation compresses, and what classes of optimisation are available at compile time.

Gate-level netlists arrive from synthesis tools (Yosys, Synopsys DC) mapped to a variety of cell libraries: the project's own AIGPDK library, SKY130, or GF180MCU. Each library uses different cell names and pin conventions; the IR must abstract over these while preserving the combinational and sequential semantics exactly.

The GEM paper (Guo et al., "GEM: GPU-Accelerated Emulator-Inspired RTL Simulation," DAC 2025) describes a "virtual Boolean processor" that evaluates combinational logic as a tree of AND-with-invert operations — directly motivating an and-inverter graph.

Decision

1. Uniform AND-gate IR

All combinational logic is represented as an and-inverter graph (AIG). Every node in the combinational cone is one of:

#![allow(unused)]
fn main() {
pub enum DriverType {
    AndGate(usize, usize),    // inputs with inversion bits
    InputPort(usize),         // primary input
    InputClockFlag(usize, u8),// clock edge flag
    DFF(usize),               // sequential (D flip-flop output)
    SRAM(usize),              // memory block output
    Tie0,                     // constant zero
}
}

Only AndGate has combinational fan-in. The two operands carry an inversion bit in their LSB (aigpin_id << 1 | invert), giving the full {AND, NAND, NOR, OR} family with a single node type. Inverters and buffers are absorbed into the inversion bits rather than creating separate nodes, keeping the graph compact.

This uniformity is the key property: because every combinational node is the same (a XOR xa) AND (b XOR xb) operation, the boomerang reduction tree (ADR 0015) can execute them all with a single GPU instruction pattern — no opcode decode, no per-cell dispatch.

2. Conversion path: NetlistDB to AIG

The conversion is implemented in src/aig.rs via AIG::from_netlistdb_impl(). It handles three cell library families:

Library	Strategy
AIGPDK (native)	Cells are already AND gates, DFFs, SRAMs — direct mapping
SKY130	Load Verilog behavioural models from `vendor/sky130_fd_sc_hd/`, decompose each cell into AND gates via `decompose_with_pdk()`
GF180MCU	Load behavioural models from `vendor/gf180mcu_fd_sc_mcu7t5v0/`, decompose similarly
RuntimeCellLibrary	User-supplied cell metadata (ADR 0010) for cells outside vendored PDKs

The decomposition process for technology-specific cells:

Clock tracing: Identify sequential cells (DFFs, SRAMs), trace clock pins to primary inputs, create InputClockFlag drivers for posedge/negedge detection.
Iterative DFS: Walk the netlist in topological order. For each unvisited output pin, recursively decompose driving cells into AND gates using the PDK behavioural models. An and_gate_cache deduplicates structurally identical sub-expressions.
Multi-output cells: SKY130 cells like full adders with multiple outputs get special handling — shared sub-expressions are computed once and reused via postprocess hooks.
Fanout construction: After all pins are processed, CSR-format fanout arrays are built for efficient traversal.

AIG pins are guaranteed to be in topological order (pin i is defined before any pin that depends on it), which the downstream pipeline relies on for level computation and scheduling.

3. EndpointGroup abstraction

The AIG partitions its outputs into endpoint groups — the units of work that partitions must realise:

#![allow(unused)]
fn main() {
pub enum EndpointGroup<'i> {
    PrimaryOutput(usize),     // top-level output pin
    DFF(&'i DFF),             // D flip-flop: data + clock-enable
    RAMBlock(&'i RAMBlock),   // SRAM: addr, data, enables
    SimControl(&'i SimControlNode), // $stop/$finish
    Display(&'i DisplayNode), // $display/$write
    StagedIOPin(usize),       // inter-stage boundary (from --level-split)
}
}

Each variant bundles the signals that must be evaluated together: a DFF needs both its D input and clock-enable; an SRAM needs address, data, and write-enable buses. The for_each_input() method enumerates all AIG pins feeding an endpoint group, which the hypergraph partitioner (RepCut) uses to build connectivity and the partition executor (pe.rs) uses to determine resource requirements.

This grouping is important because the boomerang reduction tree produces results in 32-bit-aligned write-out slots. Endpoint groups that share a write-out slot are co-located in the hierarchy; groups that need different clock-enable conditions (e.g., two DFFs with different clocks driving the same data pin) generate "output duplicates" that consume additional write-out capacity.

4. Why AIG over alternatives

BDDs (Binary Decision Diagrams): BDDs can represent Boolean functions canonically but suffer from exponential blowup for many practical circuits (e.g., multipliers). The canonical form is useful for equivalence checking but unnecessary for simulation, where we just need to evaluate. BDDs also have no natural mapping to the GPU's SIMT execution model.

Truth tables / LUTs: Lookup tables scale exponentially with input count. A 6-input LUT (as in Xilinx FPGAs) covers individual cells efficiently but doesn't compose — cascading LUTs requires separate evaluation steps. AIGs compose naturally: the output of one AND gate feeds the input of the next, forming a tree that maps directly to the boomerang hierarchy.

Technology-mapped netlist (direct execution): Keeping the original cell library would require per-cell-type dispatch in the GPU kernel — a conditional branch per node. GPU SIMT execution penalises warp divergence heavily; a uniform operation eliminates this entirely. The conversion cost (one-time decomposition at compile time) is negligible compared to the simulation runtime.

MIG (Majority-Inverter Graph): MIGs are a more compact representation (3-input majority gates) but the 3-input structure doesn't map as cleanly to binary reduction trees. AIGs are the industry standard for synthesis and verification tools (ABC, AIGER format), making interop straightforward.

The AIG's key advantage is that it reduces the GPU kernel to a single bit-parallel operation repeated across a hierarchical tree — no opcode dispatch, no conditional branching, maximum SIMT utilisation.

Consequences

Enables:

The boomerang reduction tree (ADR 0015) works because every node is the same AND-with-invert operation. A heterogeneous IR would require per-node dispatch and break the hierarchical reduction pattern.
Technology independence: the same GPU kernel and partition executor handle AIGPDK, SKY130, and GF180MCU designs. Adding a new PDK requires only a decomposition module, not kernel changes.
Structural deduplication via and_gate_cache reduces graph size when multiple cells share sub-expressions.
The inversion-bit encoding (pin_iv = aigpin << 1 | invert) eliminates inverter/buffer nodes entirely — these are free in hardware too, so the IR's size correlates better with actual simulation cost than a technology-mapped netlist would.

Constrains:

No latches or async logic.

Amendment (2026-06-25): "No latches" is too blunt. The constraint is specifically a raw LATCH cell left in the logic. Two structured latch-derived uses are supported and not rejected: clock gating via the CKLNQD integrated clock-gating cell (an ICG is internally a latch + AND, identified as a gated clock — aig.rs:845), and latch-based register files / memory mapped to $__RAMGEM_SYNC_ SRAM cells by the memory-synthesis step (aig.rs:830). Asynchronous set/reset on flip-flops is also supported (it lowers to an AIG overlay, wire_dff_reset_set_overlay); "async reset" was never the restriction. What remains unsupported is raw level-sensitive latches in the logic and asynchronous sequential (self-timed) feedback. See docs/simulation-architecture.md § "No Latch or Asynchronous Sequential Logic Support".

Original decision: The AIG assumes clean register boundaries: DFFs capture on clock edges, combinational logic is acyclic between registers. Level-sensitive latches and combinational loops would require iterative evaluation that the current pipeline doesn't support (see docs/simulation-architecture.md § "Known Issues").
Decomposition quality matters. A poor decomposition of a complex cell (e.g., a mux-heavy datapath cell) can produce a deep AND tree that requires more boomerang stages. The SKY130 and GF180MCU decompositions are hand-tuned for the common cells; exotic cells from other PDKs may decompose sub-optimally.
No gate-delay preservation in the AIG itself. The AIG is a functional (Boolean) representation. Timing information from Liberty/SDF is loaded separately and overlaid onto the AIG's pin structure via gate_delays and aigpin_cell_origins. This means the AIG construction can re-order or deduplicate nodes without worrying about timing — but it also means the timing model must reconstruct the mapping from AIG pins back to physical cells.

ADR 0015 — Boomerang execution model and GPU resource mapping

Status: Accepted

Context

Once the design is converted to an AIG (ADR 0014), the combinational logic must be mapped onto GPU hardware for parallel evaluation. GPUs offer massive parallelism but impose rigid constraints: fixed thread counts per block, limited shared memory, and synchronous SIMT execution within a warp/SIMD group.

The GEM paper (Guo et al., "GEM: GPU-Accelerated Emulator-Inspired RTL Simulation," DAC 2025) introduces a "virtual Boolean processor" organised as a boomerang hierarchical reduction tree. This ADR documents how the boomerang maps to GPU hardware, the resource limits it imposes, and the partitioning and instruction-generation pipeline that stays within those limits.

Decision

1. Boomerang reduction tree

A single GPU block (CUDA/HIP) or threadgroup (Metal) executes one partition of the design. Each partition evaluates a subset of the AIG's endpoint groups (DFFs, primary outputs, SRAMs, etc.) by reducing their combinational fan-in cones through a hierarchical binary tree called the boomerang.

The boomerang has BOOMERANG_NUM_STAGES = 13 levels, giving a reduction width of 2^13 = 8192 leaf positions. Each thread in the block handles 32 bits (one u32), so the block uses 8192 / 32 = 256 threads (NUM_THREADS_V1 in flatten.rs).

The 13 hierarchy levels map to three GPU execution tiers:

Levels	Width	GPU mechanism
hier[0]	8192 → 4096	256 threads, shared memory reduction (threads 128-255 compute, 0-127 supply inputs)
hier[1–3]	4096 → 512	Shared memory reduction with barrier between levels; only threads in `[hier_width, 2×hier_width)` compute — the rest idle
hier[4–7]	512 → 32	Warp/SIMD shuffle (`__shfl_down_sync` / `simd_shuffle_down`) — no barrier needed
hier[8–12]	32 → 1	Bit-level operations within a single `u32` on thread 0

At each level, every position computes (a XOR xora) AND (b XOR xorb) OR orb — the same AND-with-invert operation from the AIG. When orb is all-ones, the position acts as a pass-through (forwarding input a unchanged). This single instruction pattern handles AND gates, inversions, and wiring with zero branch divergence.

Between boomerang stages (when the AIG is too deep for a single 8192-wide tree), a shuffle permutation redistributes results from shared memory back to thread-local registers. The shuffle is encoded as 16-bit index pairs in the script, allowing arbitrary re-routing of signals between stages.

2. GPU resource limits and partition constraints

The boomerang's fixed geometry imposes hard resource limits on each partition. These are documented in src/pe.rs on the Partition struct:

Resource	Limit	Derivation
Unique inputs	8191	8192 leaf positions minus Tie0. Each input occupies a leaf slot; duplicates consume additional slots. Global-read rounds pack multiple state words into each thread's initial register.
Unique outputs	8191	Write-out slots in the boomerang hierarchy, addressed by stage+position pairs. Outputs include DFF data pins, primary outputs, and SRAM port signals.
Intermediate pins per stage	4095	The hier[1] level has `2^(13-1) = 4096` positions. One position is reserved for Tie0. Intermediates are AIG pins that are produced in one boomerang stage and consumed in the next.
SRAM output groups	64	`8192 / (32 * 4) = 64`. Each SRAM occupies 4 write-out groups (32-bit read-data, address, write-data, write-enable). `BOOMERANG_MAX_WRITEOUTS = 1 << (13 - 5) = 256` total write-out slots, of which SRAMs consume 4 each.

Write-out slots are 32-bit-aligned groups within the hier[1] level. The total write-out capacity is BOOMERANG_MAX_WRITEOUTS = 256. SRAMs and "output duplicates" (same data pin driven by DFFs with different clock enables) consume write-out slots from this budget. A quick_reject() pre-check catches obvious overflows before the expensive full build.

When a partition exceeds these limits, Partition::build_one() returns None and the partitioner must split the endpoint set further.

3. Hypergraph partitioning with RepCut

The design's endpoint groups are distributed across GPU blocks by RepCut (src/repcut.rs), which constructs a weighted hypergraph and partitions it using mt-kahypar.

Why a hypergraph, not a graph: In a standard graph, an edge connects exactly two vertices. But a single AIG node (an AND gate deep in the combinational cone) may be shared by many endpoint groups — its "edge" in the connectivity structure is a hyperedge spanning all groups that depend on it. Modelling this as pairwise graph edges would lose the information that cutting this one node simultaneously affects all connected groups. Hypergraph partitioning minimises the actual communication cost (shared signals that must be read from global memory by multiple blocks).

Why mt-kahypar: mt-kahypar is a state-of-the-art multilevel hypergraph partitioner with LargeK support (many partitions in one pass) and parallel execution. The implementation uses:

Preset::LargeK — optimised for k >> 2.
epsilon = 0.2 — 20% imbalance tolerance, giving the partitioner flexibility to reduce cut while keeping partitions roughly equal.
Objective::Soed — Sum of External Degrees, which counts how many partition boundaries each hyperedge crosses. This directly correlates with the number of global memory reads each block must perform.
Vertex weights proportional to estimated evaluation cost (accounting for sub-graph size and fanout sharing).
Hyperedge weights equal to the number of AIG nodes with that endpoint reachability pattern.
Hyperedge size cap at 1000 nodes (reservoir-sampled beyond that) to keep partitioning tractable for signals with extreme fanout.

The hypergraph construction itself is the bottleneck for large designs: for each AIG node, RepCut computes a bitset of which endpoint groups it can reach via forward traversal. Nodes with identical reachability sets are clustered into a single hyperedge. This is done in parallel across bitset blocks (REPCUT_BITSET_BLOCK_SIZE = 4096) using Rayon.

4. Greedy merge-back strategy

mt-kahypar produces an initial partition assignment, but the partition count is typically much larger than needed (set to 2x the number of GPU blocks). process_partitions() in pe.rs then aggressively merges partitions:

Bitset-based overlap scoring: For each pair of partitions, compute the union of their AIG node bitsets. The merge cost is |union| - max(|A|, |B|) — lower is better, indicating more shared sub-graph. This is O(num_aigpins/64) per pair instead of full DFS.
Speculative parallel trials: Merge candidates are sorted by overlap cost. Up to parallel_trial_stride merges are attempted in parallel using Rayon, with a cancel-on-success AtomicBool to abort remaining trials once a valid merge is found. The stride doubles on each iteration.
Quality gate: A merged partition is rejected if it would increase the maximum boomerang stage count beyond max_original_nstages + max_stage_degrad. This prevents merges that technically fit in resource limits but would degrade simulation throughput by adding extra boomerang stages.
Blacklisting: Failed merge attempts are blacklisted for that partition to avoid redundant retries. Cancelled (interrupted by a parallel success) trials are not blacklisted — the merge may still be valid in a future iteration.

The result: 2x-4x fewer partitions than the initial hypergraph solution, with each partition fully validated to fit within boomerang resource limits.

5. FlattenedScript instruction generation

src/flatten.rs converts partitions into FlattenedScriptV1 — a packed u32 instruction stream consumed directly by the GPU kernel. The script encodes:

Metadata section (256 u32): Per-partition control fields at fixed indices, followed by the write-out hook table:

Index	Field	Purpose
0	`num_stages`	Boomerang stage count
1	`is_last_part`	Flag: last partition in the design
2	`num_ios`	Number of write-out endpoints
3	`io_offset`	Start offset in global state buffer
4	`num_srams`	SRAM block count
5	`sram_offset`	SRAM start offset
6	`num_global_read_rounds`	Input read rounds
7	`num_output_duplicates`	Output duplication count
8	`is_x_capable`	X-propagation flag (ADR 0016)
9	`xmask_state_offset`	X-mask offset (when X-capable)
128..255	write-out hook table	Maps each thread to the boomerang stage+position where it captures its output

This layout is the load-bearing contract between Rust (flatten.rs) and the GPU kernel (kernel_v1.metal, kernel_v1_impl.cuh).

Global-read permutation (2 × NUM_THREADS_V1 per round): Each thread gets an index into the global state buffer and a bitmask. The thread reads one u32 from global memory and extracts the bits indicated by the mask using a pext-like loop. Rounds are packed to maximise throughput (each thread accumulates up to 32 bits across rounds). An index high-bit flag distinguishes previous-cycle state from current-cycle inter-stage intermediates.

That flag is a cross-backend wire format: one encoder (src/flatten.rs sets bit 31 of the word index) and four independent hand-written decoders (src/flatten.rs, src/sim/cpu_reference.rs, csrc/kernel_v1_impl.cuh for CUDA and HIP, and csrc/kernel_v1.metal). Every decoder must clear the flag from the index. Do not bias the base pointer by -2^31 to cancel the flag out of the subscript instead: that forms an out-of-bounds pointer whose behaviour is a compiler's to define, and the sign error it invites put every staged read 2^32 words past the buffer — visible only on ROCm, which is the one backend that computes the address exactly rather than narrowing it to 32 bits (#203).
Boomerang sections (per stage, NUM_THREADS_V1 × 20 u32):
- 16 u32 per thread: shuffle permutation (16-bit index pairs selecting source bits from shared memory)
- 4 u32 per thread: AND-gate flags (xora, xorb, orb) plus a padding slot reused for gate-delay injection (u16 picoseconds)
Global write-out: SRAM and output-duplicate permutations, clock-enable conditions, and data-inversion flags for committing results back to the state buffer.

The entire script is uploaded to device memory once and read sequentially by the kernel. Script reads are overlapped with computation via double-buffering (reading the next stage's data while computing the current stage's AND gates).

6. Pipeline staging for deep circuits

When a design's combinational depth exceeds the boomerang's capacity, src/staging.rs splits the AIG into major stages at user- specified level thresholds (--level-split 30 or --level-split 20,40).

Each major stage gets its own StagedAIG with:

primary_inputs: the AIG pins produced by previous stages (or the design's actual primary inputs for the first stage).
primary_output_pins: live AIG pins at the split boundary that must be forwarded to the next stage.
endpoints: the original AIG endpoint groups whose combinational depth falls within this stage.

Major stages execute sequentially on the GPU (the kernel loops over them). Between stages, intermediate values are written to the output state buffer and re-read by the next stage's global-read permutation (indicated by the high-bit flag in the index).

Staging trades latency (more sequential kernel dispatches) for fitting within the 8192-wide boomerang. Without it, designs with

50-level combinational paths would fail partitioning entirely.

Consequences

Enables:

Fixed, branch-free GPU kernel. The kernel has no per-node dispatch — every thread executes the same AND-XOR-OR instruction at every boomerang level. This maximises SIMT utilisation across CUDA, HIP, and Metal.
Deterministic shared-memory budget. The 256-thread, 8192-bit boomerang uses a fixed amount of shared memory (threadgroup memory on Metal), independent of the design. No dynamic allocation, no shared-memory pressure variation between blocks.
Scalable partitioning. The hypergraph partitioner + greedy merge naturally adapts to designs from hundreds to millions of gates. Larger designs get more partitions; the GPU kernel is the same.
Technology independence at the kernel level. The GPU kernel knows nothing about AIGPDK, SKY130, or GF180MCU. It executes packed u32 scripts. All cell-library knowledge is absorbed during AIG construction and script generation.

Constrains:

8191-input/output ceiling per partition. Designs with extremely wide buses or highly connected sub-circuits may require aggressive partitioning, which increases inter-partition communication (global memory reads). The --level-split option helps by splitting deep cones into multiple stages, but wide cones remain fundamentally limited by the 8192-slot boomerang.
Write-out slot scarcity for SRAM-heavy designs. Each SRAM consumes 4 write-out slots. With BOOMERANG_MAX_WRITEOUTS = 256, a partition can hold at most 64 SRAMs — and fewer when output duplicates also need slots. Designs with many small memories may need finer partitioning than their gate count alone would suggest.
Fixed thread count. The 256-thread block size is hardcoded (NUM_THREADS_V1). On GPUs where the SM/CU could benefit from larger blocks (e.g., occupancy tuning), there's no flexibility. Changing this would require redesigning the boomerang hierarchy depth and the bit-packing in the script format.
Script size grows with partition depth. Each boomerang stage adds ~20 × 256 = 5120 u32 entries to the script. Very deep partitions (many boomerang stages) produce large scripts that may pressure GPU memory bandwidth for the script reads, though double-buffering mitigates this.

ADR 0016 — Selective X-propagation

Status: Accepted (2026-05; amended 2026-06-25). Extension to cosim proposed 2026-06-03 — see Amendment.

Amendment (2026-06-25): Stage-count correction. The body says "Stages 1–6 are implemented; Stage 7 (dynamic X narrowing) is a future enhancement." Per docs/selective-x-propagation.md, all seven stages are implemented — Stage 7 is the Criterion benchmarks (benches/xprop.rs), not dynamic narrowing. Dynamic X narrowing (periodic X-mask scan, partition-kernel hot-swap) is a separate, still unbuilt enhancement.

Context

Jacquard's default two-state (0/1) simulation silently resolves uninitialised DFF and SRAM outputs to zero. This masks initialisation bugs that real hardware would expose as unknown (X) values, and creates false mismatches when comparing against four-state RTL simulators.

Naively upgrading the entire simulator to four-state logic would double storage and roughly halve throughput. In a well-designed SoC after reset, typically less than 5% of signals are genuinely X-capable.

Decision

Implement selective X-propagation controlled by the --xprop CLI flag. Static analysis at compile time identifies X-source signals (uninitialised DFFs, SRAM read ports); forward-cone computation classifies each partition as X-capable or X-free. Only X-capable partitions run an X-aware kernel variant; the rest continue with the fast two-state path.

The full seven-phase design, implementation details, and design rationale are in docs/selective-x-propagation.md. Stages 1–6 are implemented; Stage 7 (dynamic X narrowing) is a future enhancement.

Key design choices (summary)

Partition-level granularity — entire partition runs X-aware or not. ~95% of partitions are typically X-free after reset.
Conservative SRAM X — all reads return X until any write. Per-address tracking deferred.
No reset-aware analysis — all DFFs start as X; the fixpoint iteration naturally resolves reset-connected DFFs.
State buffer doubling — X-mask words occupy [reg_io_state_size .. 2*reg_io_state_size) when enabled. X-free partitions ignore the mask entirely.
Runtime flag, not compile-time — --xprop on jacquard sim; no new Cargo features needed.

Consequences

X-capable partitions pay ~2× storage and ALU cost; X-free partitions (the vast majority) pay nothing.
VCD output includes x values when --xprop is enabled, compatible with standard four-state VCD tools.
The --check-with-cpu reference path includes an X-aware CPU kernel for validation.
Benchmarks (benches/xprop.rs) track the overhead.

Amendment 2026-06-03: cosim and IO X-sources

The original decision wired --xprop into the sim (static-input) path only. The reactive cosim path is two-state, so JTAG-replay / peripheral runs silently zero-init uninitialised state (#95). This amendment extends selective X-propagation to cosim.

Two points the original design did not address, because the static sim path never had to:

Undriven input pads are X. In a reactive run, peripheral models drive only some input bits each edge (clock, reset, JTAG/UART pins, configured constants). Every primary-input bit not in that driven set is unconnected and must be X, not 0.
Bidir pad reads. A bi_24t pad's core-read was originally modelled Y = PAD (tristate not modelled); since PAD is an undriven primary input, bidir reads fell out of rule (1) as X — safe (false-X, never false-0) but pessimistic for the OE=1 loopback. The combinationally-correct read Y = OE ? A : external is now modelled as a mux in the AIG (#96, implemented — see the dated subsection below). (An earlier draft of this amendment proposed a per-edge OE→input feedback with one-edge latency — that was wrong; the correct read is combinational.)

So the X-source taxonomy is now three-way: uninitialised DFF, uninitialised SRAM (both as before), and undriven input pads (which subsumes bidir reads under the current Y = PAD model). The first two are sequential power-up X; the third is the reactive IO X-source specific to cosim.

The Metal kernel is already X-capable, so this is host-side reactive plumbing (state-buffer expansion, per-edge X-mask maintenance, and an observe-kernel output-offset fix for the doubled layout). Phasing and risks are in ../plans/cosim-xprop.md.

Seed-template correction (2026-06-03)

Implementing the cosim extension surfaced a latent bug in the shipped sim path too: the power-up X-mask seed (expand_states_for_xprop) was built as "all-X, then clear every input_map position." But input_map contains the DFF-Q combinational-read positions, not just primary input ports — so uninitialised DFFs were read as known 0 and X never originated. --xprop was therefore silently two-state for any sequential design (the gate-level X math was unit-tested, but no end-to-end test asserted X surfacing from an uninitialised DFF).

The seed is now built by vcd_io::xprop_xmask_template: all-known, set X only at genuine X-source positions (uninitialised DFF Q reads + SRAM reads), excluding primary inputs (nets present in input_layout) and constant-pinned DFFs (const_zero_pos = input_layout.len()). The cosim path additionally seeds the output slot's X-mask, since its per-edge state_prep copies output→input before the first simulate. This corrects design choice #3 above ("all DFFs start as X"): the intent was always X at DFF positions; the implementation had inverted it for DFF-feedback reads.

Undriven input X-source (cosim, implemented 2026-06-03)

The "undriven input pad → X" rule from this amendment is now implemented for cosim. compute_x_capable_pins(treat_inputs_as_x_sources) (gated by DesignArgs::xprop_undriven_inputs, set only by the cosim path) marks input cones X-capable; vcd_io::xprop_xmask_template_cosim seeds every primary input as X; and the GPU kernels clear the X-mask of each bit they drive each edge — state_prep for the build_edge_ops driven set (clock/reset/constants/model pins) and gpu_apply_flash_din for the SPI MISO bits it writes directly (they bypass state_prep). The complement — genuinely undriven inputs — stays X. sim keeps inputs known (driven from the VCD) and pays no extra X-aware cost. End-to-end guards covering the DFF and undriven-input X-sources, in both sim and cosim, live in tests/xprop_cosim/ (CI, fatal).

Bidir tristate read-back mux (implemented 2026-06-04)

Point #2 above is now implemented (#96). AIG::from_netlistdb's bi_24t branch builds Y = OE ? A : external combinationally in the AIG — OR(AND(OE, A), AND(!OE, PAD)) via the De Morgan idiom already used by wire_dff_reset_set_overlay — instead of the conservative Y = PAD. The external arm is the same undriven PAD primary input (X under rule (1) until a peripheral model drives it); the OE=1 arm reads the core's own drive A, so the loopback is X-exact (known whenever A is) and the two-state read returns A, not the external stim. This removes bidir reads from the "undriven input → X" subsumption for the OE=1 case; they are now exact rather than conservatively-X. in_c/in_s stay Y = PAD. Without both A and OE pins the conservative Y = PAD still stands. Unit test: aig::gf180mcu_chip_top_tests::bi_24t_models_tristate_readback evaluates the full Y truth table. (#107's $isunknown x-assert work can now assert bidir read-backs go definite when OE is asserted.)

ADR 0017 — Cosim execution model

Status: Accepted (amended 2026-06-25).

Amendment (2026-06-25): The original body describes cosim as Metal-only (run_cosim in cosim_metal.rs, cmd_cosim hard-erroring on other backends) and step_edge receiving an empty output_state. Both are now stale: CudaBackend and HipBackend implement CosimBackend (src/sim/cosim/cuda.rs, hip.rs; dispatched from jacquard.rs), and step_edge receives the real output slice. The later in-body amendments (2026-06-07/06-19/06-21) document this architecture; this note flags that the original "Metal-only" wording reflects the state at initial acceptance, not current reality.

Context

The cosim mode runs a GPU-simulated design alongside reactive peripheral models (flash, UART, JTAG, GPIO) that drive and observe design pins each clock edge. The execution model must balance two competing needs: GPU throughput (which favours large batches of edges dispatched as a single command buffer) and peripheral responsiveness (which requires CPU-side model updates between edges).

This ADR documents the batch dispatch loop, the multi-clock scheduler, and the time-domain abstractions that tie them together.

Decision

Batch dispatch loop

The cosim main loop groups consecutive scheduler edges into batches of up to BATCH_SIZE = 1024 edges. Each batch is encoded into a single Metal command buffer and dispatched to the GPU. Between batches, CPU-side peripheral models (PeripheralModel:: step_edge) run, ring buffers are drained, and model overrides are compiled into BitOp arrays for the next batch.

Per-edge execution within a batch:

state_prep (apply clk/gpio/jtag pin drives via BitOps)
  → gpu_apply_flash_din (inject flash MISO into input state)
    → simulate_v1_stage ×N (combinational logic evaluation)
  → gpu_flash_model_step (read MOSI, advance flash FSM)
  → gpu_io_step (UART TX decode + Wishbone bus trace)

CPU-side models cannot observe intra-batch state changes — they see the output state only after the batch completes. For peripherals that require per-edge responsiveness (e.g. JTAG replay with tight hold-cycle requirements), the batch is forced to size 1 when any model reports is_active() == true.

Why BATCH_SIZE = 1024

The batch size trades off GPU utilisation against peripheral latency. Smaller batches → more Metal command buffer submissions per second → higher overhead. Larger batches → staler CPU-side model state. 1024 was chosen empirically as a sweet spot:

For peripheral-free simulation: amortises ~1ms of command buffer overhead across 1024 edges ≈ 1µs/edge overhead.
For active peripherals (JTAG, stimulus-driven): the is_active fallback to batch=1 ensures correctness regardless of batch size.
The batch size only affects cosim; the sim command processes the entire VCD in one GPU dispatch.

Pre-allocated schedule buffers

Each scheduler edge has pre-allocated Metal buffers for its StatePrepParams and BitOp array (ScheduleBuffers::edge_buffers). These are allocated once at startup — not per-dispatch — to avoid allocation latency in the hot loop. The schedule repeats with period edges_per_period (= LCM schedule length); edge N reuses buffer N % edges_per_period.

Multi-clock scheduler

The MultiClockScheduler computes a deterministic interleaving of edges across clock domains. Given N clocks with potentially different periods and phase offsets:

Compute gcd_ps = GCD of all half-periods and phase offsets. This is the scheduler tick — the minimum time quantum.
Compute lcm_ps = LCM of all full periods. This is the schedule period — the point at which the edge pattern repeats.
schedule_len = lcm_ps / gcd_ps — number of ticks per period.
For each tick, compute which domains have rising/falling edges based on (tick_ps - phase_offset) % half_period == 0.

The schedule length is capped at 1,000,000 ticks. This prevents degenerate clock ratios (e.g. primes) from producing unbounded schedules. If the cap is hit, the assertion fires with a message suggesting the clocks may not be commensurable at the configured resolution.

Time units: edges vs clock cycles

A scheduler edge is one tick of the scheduler (duration = gcd_ps). A clock cycle is two half-periods of a given domain (= rising + falling edge). The ratio sched_ticks_per_sys_clk_cycle = clock_period_ps / gcd_ps converts between them. Note this ratio is the number of scheduler ticks per sys_clk period, which is 2 only when gcd_ps equals the half-period (single-clock or harmonic multi-clock); non-commensurate periods or phase offsets make it larger.

This distinction is load-bearing for peripheral timing:

UART baud rate dividers count edges, not clock cycles.
Reset duration counts edges.
The --max-clock-edges CLI flag counts edges.

Confusing edges with clock cycles was the root cause of the UART baud rate bug fixed in commit a263e47 — edges_per_period (the LCM schedule length) was used where sched_ticks_per_sys_clk_cycle was needed, doubling the bit time in multi-clock designs.

GPU→CPU ring buffer drain

After each batch completes, the CPU drains three categories of GPU-side state:

Peripheral ring buffers — UART channels and Wishbone trace channel, drained from local read_head to GPU-written write_head. See ADR 0013 for struct conventions.
VCD snapshot buffer — when --stimulus-vcd or --output-vcd is enabled, a separate ring buffer (2 × state_size words per edge) captures per-tick output state on the GPU. The CPU drains it after each batch to write VCD transitions. This mechanism is what enables BATCH_SIZE > 1 even with VCD output — without it, the CPU would need to read output state after every single edge.
CPU reference check — when --check-with-cpu is active, the CPU replays the batch with the reference kernel and compares.

No synchronisation beyond Metal's command buffer completion is needed — all drains happen after waitUntilCompleted.

Consequences

The batch dispatch model means CPU-side peripheral models see output state with up to BATCH_SIZE edges of latency. This is acceptable for all current peripherals; models that need tighter coupling set is_active() = true.
The 1M tick schedule cap prevents pathological memory use but rejects exotic clock ratios. A min-heap scheduler (proposed in docs/plans/multi-clock-and-stimulus-architecture.md as MC.2) would remove this limit.
The edges-vs-cycles distinction must be maintained carefully in any code that converts user-facing "cycles" to internal "ticks". The sched_ticks_per_sys_clk_cycle helper exists for this purpose.
Pre-allocated schedule buffers consume O(schedule_len) Metal buffer pairs at startup. Each schedule entry creates two Metal buffer objects (params + ops). For typical single-clock designs this is 2 entries = 4 buffer objects; for complex multi-clock designs it can reach thousands of entries, but each buffer is small (tens of bytes).

Amendment 2026-06-07: backend-portable cosim — target architecture (#105)

The execution model above is Metal-only — run_cosim lives in cosim_metal.rs (gated #[cfg(feature = "metal")]) and cmd_cosim hard-errors on other backends. This amendment records the target architecture for making cosim backend-portable (CPU reference + CUDA/HIP), tracked as #105. It supersedes the incremental 2026-06-05 note (whose "per-edge on every backend" framing the measurements below correct). It describes the steady-state design; the staging to reach it lives in docs/plans/cosim-backend-portability.md. It does not change the batch/scheduler model above — it factors where each part runs and along what seam.

The evidence: measured batch utilisation (2026-06-07)

The cosim loop was instrumented (telemetry in the run summary: single_edge_batches, mean/max batch) to measure how often the batched fast path (batch > 1) runs versus forced single-edge dispatch (force_single_edge = any_model_active, plus diagnostic modes). Per-edge handover is the only mode needing a true per-edge CPU↔GPU round-trip.

Fixture	Edges	Batched (edges)	Single-edge commits	Commits
`dual_uart`	10,000	100%	0	11
`apb_trace`	200	100%	0	2
`xprop_cosim`	40	100%	0	2
`jtag_minimal`	4,000,000	97.4%	102,310	106,117

Designs whose peripherals have GPU-side halves (UART, APB bus-trace, SPI flash — the gpu_io_step / gpu_flash_model_step kernels) run 100% batched: CPU↔GPU handover is at BATCH_SIZE-edge boundaries, not per clock edge. Even jtag_minimal (CPU-side JTAG replay, the most per-edge-heavy fixture) batches 97% of edges — but its 102k single-edge commits are 96% of all submits and dominate its wall-clock. Batching is the dominant path; per-edge is the exception (CPU-side models + diagnostic modes). This drives every decision below.

Layer 1 — backend-agnostic orchestration (the shared `cosim` driver)

Everything that is not GPU-specific moves above the seam and operates on &[u32] state + Vec<BitOp> edge ops: the MultiClockScheduler, build_edge_ops, the batch-size policy (force_single_edge), peripheral coordination, the input dispatcher, reset/constant init, VCD writing, and event/ring-buffer draining. The batch-size decision stays here, unchanged.

Layer 2 — the `CosimBackend` trait (one impl per backend)

Owns the [2 × state_size] design state and runs the design. Crucially it is batch-granular, not single-edge — the measurements show a literal simulate_edge-per-edge trait would collapse Metal's 100%-batched designs to one command-buffer submit per edge (~1000× regression). The trait method is therefore "run N consecutive scheduler edges, applying each edge's ops and snapshotting each output slot to the ring", plus state_prep (output→input copy + apply BitOps + clear driven X-mask) and input_state()/output_state() accessors. MetalSimulator becomes MetalBackend; CpuBackend and Cuda/HipBackend are added.

The backend owns the schedule storage (opaque to the orchestration). The edge ops are a tiny, fixed, repeating set (edges_per_period entries, =2 for single-clock) built once; the orchestration must not hold a parallel copy that the backend re-materialises each dispatch (that would add a per-dispatch copy and, on Metal, regress today's zero-copy unified-memory path). Instead:

init_schedule(edges: Vec<(StatePrepParams, Vec<BitOp>)>) hands the backend-agnostic description to the backend once; the backend materialises its native buffers and retains them. The orchestration keeps only scalars (edges_per_period, gcd_ps).
edge_ops_mut(edge_idx) -> &mut [BitOp] is how reset / model-driven / clock-edge patching mutates ops. Metal returns a slice straight over the shared MTLBuffer — zero-copy, the write is the upload (exactly today's behaviour). CUDA/HIP return a slice over a host mirror and mark the edge dirty; run_edges uploads only dirty edges before launch. Ops change rarely (reset transitions; only while a CPU-side model is active), so this is near zero in practice — and the buffers are KB-scale regardless.

This replaces the earlier "neutral Vec + backend re-materialises" sketch, which risked needless CPU↔GPU traffic and double bookkeeping.

MetalBackend runs N edges in one command buffer with GPU peripherals inside (today's encode_and_commit_gpu_batch).
CpuBackend runs the per-edge loop via cpu_reference::simulate_block_v1 — the reference/oracle, and the unlock for cosim regression on free Linux CI (today Metal-only). N is effectively 1; throughput is not the point. It also validates the per-edge orchestration path that the CUDA/HIP fallback reuses.
Cuda/HipBackend run the existing simulate kernel, sidestepping the cooperative_groups grid-sync that only the sim command needs (the hardest CUDA feature to port). They ship with their Tier-2 GPU peripherals (Layer 3) so reactive designs batch from the start; the per-edge path is the permanent fallback for CPU-side models (e.g. JTAG replay).

Layer 3 — the `GpuPeripheral` abstraction (3-tier, GPU peripherals primary)

Batching a reactive design requires the peripheral to run inside the batch — i.e. on the GPU — because the peripheral consumes each edge's output to drive the next edge's input. On Metal this is hidden by unified memory; on a discrete GPU, per-edge means a PCIe round-trip every edge (~1–2 µs each way), which over millions of edges is likely slower than the CPU backend. Therefore GPU-side peripherals are architecturally required for the CUDA/HIP perf story — not an optional optimization. The decision (2026-06-07) is to make GPU peripherals the primary path, with a 3-tier model mirroring the CosimBackend seam:

Tier 1 — CPU reference model (PeripheralModel, src/sim/models/*.rs, exists). The semantic ground truth, the cross-backend equivalence oracle, and the fallback for any (backend, peripheral) lacking a GPU kernel. Always present.
Tier 2 — hand-written GPU kernels for core peripherals (now). Because CUDA and HIP already share kernel_v1_impl.cuh, a core peripheral is two implementations, not three: one shared *_impl.cuh (CUDA + HIP) and one .metal. Tractable for the small in-core set (flash, UART, bus-trace) and matching the existing simulate-kernel precedent.
Tier 3 — single-source peripheral compilation (later; the user-extensible peripheral API). Hand-written kernels don't scale to user-defined peripherals; the endgame is a user writing a peripheral once (restricted-Rust subset or a small peripheral-FSM IR) that compiles to CPU
- every GPU backend. This domain (peripheral FSMs) is far narrower than the general cross-shader-tool port previously rejected, so an in-house IR is the tractable route.

The GpuPeripheral seam is defined at Tier 2 so Tier 3 slots in without reworking the orchestration.

The peripheral contract — one shape, input and output, every model

The seam above only sketched Tier 2 as encode_step(encoder). This fills in what a peripheral is, so the CPU model (Tier 1), the GPU kernel (Tier 2), and the eventual single source (Tier 3) all express the same contract rather than two parallel ones.

Every peripheral — on either substrate, input-driving or output-observing — is the same shape:

observe some design-output bits → advance an FSM (over persistent state + const params) → drive some design-input bits and/or emit decoded records.

The CPU PeripheralModel trait (src/sim/models/mod.rs:56) is already this contract and already bidirectional:

#![allow(unused)]
fn main() {
fn step_edge(&mut self,
    output_state: &[u32],              // OBSERVE design outputs
    overrides: &mut ModelOverrides,    // DRIVE design inputs (position → value)
    emitted: &mut Vec<EmittedEvent>);  // EMIT decoded records
fn driven_positions(&self) -> &[u32]; // the input bits it may drive
fn is_active(&self) -> bool;           // forces batch=1 mid-transmission
}

One trait already covers the whole spectrum via optional halves: GPIO is input-only (default step_edge just contributes overrides), UART-TX decode / bus-trace are output-only (empty driven_positions), SPI flash is bidirectional. So a single interface genuinely covers input and output — the doubt about "enough commonality" is unfounded: the commonality is the FSM-over-IO-bits shape, and six models already share it.

The GPU half is not yet unified — it is three bespoke kernels with the same skeleton but no common trait:

Kernel	reads	FSM state	writes	role
`gpu_apply_flash_din`	`states`, `FlashState`	—	`states` (MISO)	input inject
`gpu_flash_model_step`	`states`, `flash_data`	`FlashState`	`FlashState`	output observe + FSM
`gpu_io_step`	`states`	`UartDecoderState`	`UartChannel`/`BusTraceChannel` rings	output decode → ring

Every kernel is kernel(device u32* states, device FsmState* state, constant Params& params, [device Ring* out], [const Data* in]).

Target shape — one logical contract, two substrates:

CPU substrate = today's step_edge (Rust over &[u32]; drives via ModelOverrides).
GPU substrate (GpuPeripheral) = encode_step(encoder, states_buf, fsm_buf, params_buf, ring_buf) running the same FSM over device u32* states.
Consistency anchor = the shared #[repr(C)] FSM-state + params structs. These already exist on both sides but are hand-synced ("must match Metal UartChannel", cosim_metal.rs:178); that duplication is the tax Tier 3 removes by generating the Rust step, the GPU kernel, and the one struct from a single FSM definition.

The decision that makes it consistent: all input drives are (position, value) pairs applied through the one state_prep/ops path. Today this is the inconsistent part — CPU models drive indirectly (overrides → BitOps → state_prep, so drives land clock-edge-aligned), but gpu_apply_flash_din writes states directly. They are the same logical operation done two ways. Normalising flash's direct-write into FSM-produced ops applied by state_prep makes input application uniform across every peripheral and both substrates, and removes flash's special case. (Flash writes directly today only because its MISO depends on the FSM computed that same edge — expressible as ops the FSM emits.)

What deliberately does not unify (substrate detail below the contract): output draining (GPU needs ring buffers because the CPU cannot observe intra-batch state; CPU models emit immediately — same events out contract, different plumbing), and the FSM body itself (Rust vs kernel, single-sourced only by Tier 3's IR). These divergences are expected and bounded.

Phase-1 implication: the CPU UART-TX decoder added in Phase 1 (no GPU equivalent exists today — models/uart.rs only has an RX-line receiver decode) must be written to this contract, with its FSM state mirroring UartDecoderState's fields, so the Phase-2 GpuPeripheral kernel and the Tier-3 single source fold into one definition rather than a third parallel one.

Correctness contract

The CPU PeripheralModel (Tier 1) is ground truth. The cross-backend equivalence harness (#113, today sim-only) extends to cosim: every backend's output VCD must be byte-identical on the same reactive design, and every GPU-peripheral kernel (Tier 2) must match its CPU model. This is the backstop for the whole effort.

Considered alternative (not adopted as primary)

Speculative batching — keep peripherals on the CPU, batch optimistically, and roll back on divergence (the multi-clock plan's MC.4 island run-ahead / MC.5 record-and-replay). It avoids writing GPU kernels but is non-deterministic in throughput and substantially more complex. Rejected as the primary path; retained as the natural fallback for user CPU peripherals that are not GPU-portable. Cross-shader tools (Slang, Ferrox) remain rejected for the design kernel; Tier 3's narrow peripheral-FSM IR is the cross-GPU answer for peripherals.

Relationship to the multi-clock plan

The 102k JTAG single-edge round-trips are exactly the "cosim CPU↔GPU round-trip measured as the bottleneck" trigger for MC.3 (streaming stimulus) and the motivation for MC.4 (per-island multi-rate batching). Both are orthogonal to and larger than this seam (MC.4 needs the MC.1 island partitioner) — the long-term fix for the per-edge tail, not a prerequisite here.

Consequences

The sim cooperative-launch model and the cosim per-edge/batch model remain distinct per backend; this unifies the cosim driver, not the two execution models.
ScheduleBuffers currently stores metal::Buffer pairs and the ops-update helpers write Metal shared memory in place. These move into the backend (built once via init_schedule, mutated via edge_ops_mut) rather than becoming an orchestration-owned Vec the backend re-materialises — which keeps Metal zero-copy and lets CUDA/HIP upload only dirty edges. Converting the in-place *mut BitOp shared-memory mutation to edge_ops_mut is the main refactor friction (and resolves the closure-borrow issue too).
CUDA/HIP cosim ships with Tier-2 GPU peripherals so reactive designs batch from the start (Phase 2 lands the backend + GPU peripherals together, rather than a per-edge-only intermediate that would be unusably slow). The per-edge path remains as the fallback for CPU-side models (e.g. JTAG replay), where the per-edge tail is addressed later by the multi-clock plan's MC.3/MC.4, not by this seam.

Amendment 2026-06-19: no CPU-peripheral CUDA/HIP variant; stage on fixtures

Refines the Layer-2/3 phasing above with an implementation decision (made while building Phase 2). There is exactly one CUDA/HIP cosim backend, and it mirrors MetalBackend: GPU design step + GPU peripherals + variable batching + managed memory (cudaMallocManaged/hipMallocManaged, the closest analog to Metal's unified StorageModeShared).

No CPU-peripheral CUDA/HIP backend. An earlier plan sketched a bring-up "checkpoint 2a" — a CUDA backend running all peripherals on the CPU, per-edge — as a stepping stone. That is dropped: no production backend works that way (Metal never runs peripherals on the CPU), so it would introduce a backend shape that exists nowhere else and obscure the architecture. CpuBackend stays the pure-CPU reference oracle; Metal and CUDA/HIP are the GPU backends with the same shape.
"Per-edge fallback" means batch=1 of the GPU backend, not a CPU-peripheral path. Model-driven-clock designs (JTAG) run the same GPU-peripheral backend at batch=1; only a small output read-back per edge feeds the CPU-side clock model. The Tier-1 CPU PeripheralModel remains the per-peripheral fallback solely for a (backend, peripheral) pair that genuinely lacks a GPU kernel (e.g. a future user-defined Tier-3 peripheral before its kernel exists) — not for the core flash/UART/bus set, which get Tier-2 GPU kernels on CUDA/HIP.
Bisectability comes from staging on fixtures, not from a throwaway backend. Each fixture exercises a different kernel subset, so the single backend is brought up in stages: A (design step only → xprop_cosim, no peripherals), B (gpu_io_step → dual_uart + apb_trace), C (flash kernels → flash/JTAG). Every stage is the real architecture, gated against the committed CPU/Metal goldens on the T4. Staging detail: docs/plans/cosim-phase2-cuda-hip.md.

Amendment 2026-06-21: interactive (externally-paced) peripheral models

The execution model above assumes the peripheral pace is internal: the loop drives time and models react each edge (recorded JTAG replay, queued stimulus, GPU peripherals). The interactive JTAG debug server (--jtag-server, #124) introduces the inverse: an external client (OpenOCD/gdb) drives the pace, one remote_bitbang transaction at a time. This amendment records how that fits the model without new machinery — and the one latent contract gap it forces closed. It does not change the batch loop or scheduler.

An interactive peripheral is a Tier-1 CPU PeripheralModel that blocks on external I/O. Its step_edge blocks reading the next bitbang byte from the socket; because the client is the clock source, this blocking is the synchronisation. is_active() stays true while a client is connected, so the existing force_single_edge gate drives the design at batch=1 for the session's duration — the same mechanism JTAG replay already uses. No async executor and no background thread are needed for a single connection; the blocking accept happens once before the main loop. The batched fast path is unaffected when no client is attached.
This makes the contract's "observe" half real. The Decision section and ADR 0013 both note step_edge is handed an empty output_state today (cosim/mod.rs:2987 passes &[]) because no CPU-side model yet reads a design output. An interactive debugger must answer remote_bitbang R (read-TDO) with the live TDO bit, so output_state must carry the real design-output slice (&backend.state()[state_size..]). This generalises beyond JTAG — it is the same wiring the scaffolded I²C/SPI observation models need — so it is a closure of the standing TODO, not a JTAG special case.
TDO read-back is the contract's "emit" half pointed at an external client. The peripheral contract is observe → advance FSM → drive and/or emit. A live debug server adds one I/O direction: it emits a response to an external socket mid-edge (the R reply) rather than into a ring buffer or VCD. The observe/drive halves (TDO sample, TCK/TMS/TDI drive via the usual overrides → BitOps → state_prep path) are unchanged.

Per-backend: the interactive path is the CPU-side model plus batch=1 of the GPU backend (the "per-edge fallback" of the 2026-06-19 amendment), so it works on any cosim backend once the host-side plumbing lands — no kernel work. Implementation staging: docs/plans/jtag-debug-server.md.

Amendment 2026-06-26: open questions — cosim timing output

Folded in from a now-deleted implementation-plan doc (cosim-timing-support.md) whose objective largely shipped. Open items remaining:

Timed cosim is Metal-only. Arrival-annotated VCD output works on the Metal backend (metal.rs threads arrival_state_offset; the driver writes per-net arrival_ps). The CPU, CUDA, and HIP backends assert !script.timing_arrivals_enabled — they don't yet support timed cosim. Extending arrival tracking to the other CosimBackend impls is open.
Cosim --timing-report (structured JSON) is not wired. The structured report is sim-only; see ADR 0008's 2026-06-25 amendment. The cosim path emits arrival-annotated VCD but not the JSON report.

Cross-references

ADR 0012 — CDC jitter injection (uses the scheduler's edge timestamps as the injection point).
ADR 0013 — Peripheral model architecture (documents GPU-side model patterns and ring buffers).
docs/plans/multi-clock-and-stimulus-architecture.md — design-space doc for the multi-clock scheduler.
docs/plans/cosim-backend-portability.md — implementation plan for backend portability (#105).

ADR 0018 — Distribution and installation model

Status: Accepted (amended 2026-06-25). Phase 4 (CUDA/HIP prebuilt binaries) and Phase 7 (eda-infra-rs upstreaming) remain open — see the amendment below.

Amendment (2026-06-25) — built and shipped. The distribution layer this ADR proposed is now operational; the original "no release artifacts; install = clone + submodules + cargo build" premise is historical. What's live:

GitHub Releases (macOS arm64/Metal): from v0.2.1 (first attached, working binary) through v0.2.3. release.yml builds, smoke-tests the relocated tarball, and publishes (prereleases via a draft, for the immutable-releases repo).

cargo binstall --git: [package.metadata.binstall] pkg-url override in Cargo.toml.

Homebrew tap gpu-eda/homebrew-tap: formula at packaging/homebrew/jacquard.rb (v0.2.3); depends_on "llvm".

Staging install-validation (validate-install.yml): a workflow_dispatch gate running the documented cargo binstall + brew install against a published (pre)release. RC tags (vX.Y.Z-rc.N) publish as prereleases and gate promotion.

netlist-graph on PyPI (publish-netlist-graph.yml, OIDC trusted publishing): netlist-graph 0.1.0 published; TestPyPI dry-run job too.

Single shared crate version via scripts/bump_version.py + verify-guard; Cargo.lock tracked; repo URLs corrected to gpu-eda.

Docs: docs/installation.md, docs/release-process.md (incl. the staging-validation gate).

Still open: Phase 4 (CUDA/HIP prebuilt binaries — gated on self-hosted NVIDIA/AMD release runners; Linux stays source-build), Phase 6 (container image — deferred per the original decision), and Phase 7 (eda-infra-rs upstreaming, the path to a crates.io publish — see docs/plans/distribution.md § Phase 7). The original proposal follows unchanged.

TL;DR. In the context of shipping Jacquard to users (and to docs-dogfooding agents), facing the fact that it is a GPU-compiled Rust binary with vendored path-dependency submodules plus two companion tools, we chose a tiered, channel-per-artifact model — Rust binaries via GitHub Releases + cargo-binstall + a Homebrew tap (Metal), and the Python netlist-graph companion via PyPI — accepting that we maintain per-GPU-target release binaries and three coordinated release cadences, rather than forcing everything through one channel.

Context

"Install Jacquard" is not one artifact. The toolset is three pieces:

Tool	Language	Role	GPU?
`jacquard`	Rust	the simulator (`sim` / `cosim`)	yes — Metal / CUDA / HIP, compiled via the `ucc` build script
`opensta-to-ir`	Rust	SDF → timing-IR preprocessing (timing path)	no (CPU)
`netlist-graph`	Python	post-synthesis signal-name discovery; the companion the tracing docs (`signal-tracing.md`, `bus-tracing.md`) lean on	no

Constraints that rule options out:

jacquard is not crates.io-publishable. Its dependencies are vendored path deps (vendor/eda-infra-rs/* — a fork carrying in-flight patches), and the build needs git submodules, a GPU SDK, and a C++/CUDA/Metal compile. cargo install from source works but is slow and needs the full toolchain — not "easy."
jacquard is not a natural PyPI package. It's a GPU binary; a maturin wheel would mean per-backend wheels (Metal macOS-arm64 only, CUDA huge and driver-coupled, HIP).
netlist-graph is PyPI-ready today — self-contained (networkx + click only), a netlist-graph console script, no workspace path deps.

Today there are no release artifacts: install = clone + submodules + cargo build -r --features <backend>. That blocks easy adoption and makes docs-dogfooding (a fresh agent following the docs) start from a heavy source build.

Decision

Distribute each artifact through the channel that fits it:

Rust binaries (jacquard + opensta-to-ir) →
- GitHub Releases prebuilt binaries, one per GPU target (macos-arm64-metal, linux-x64-cuda, linux-x64-hip).
- cargo-binstall support via [package.metadata.binstall] so cargo binstall jacquard fetches the release asset.
- Homebrew tap (gpu-eda/homebrew-tap) for the macOS/Metal path: brew install gpu-eda/tap/jacquard.
- opensta-to-ir ships alongside jacquard (same release, same formula) — it's a sibling CPU bin in the same cargo workspace.
netlist-graph (Python) → PyPI: uvx netlist-graph / pip install netlist-graph. Versioned independently of the Rust bins.
The simulator never ships via PyPI. PyPI is for the Python companion only.

Versioning: the two Rust bins share the workspace version and a single tag (coordinated); netlist-graph versions independently. This is effectively Jacquard's first numbered release — see release-process.md.

Homebrew scope: the formula installs the Rust bins only; netlist-graph is documented as a separate uvx/pip line rather than a formula dependency, keeping the formula simple and the Python tool independently installable.

Rollout is Metal-first: the macOS-arm64 Metal binary ships now (the self-hosted macOS runner exists). CUDA and HIP release binaries are gated on the NVIDIA / AMD runners being stood up; until then those targets remain source-build.

Install tiers (documented where each is first needed):

pure functional cosim → jacquard only;
signal-name debugging → add netlist-graph;
timing / post-PnR → add opensta-to-ir + a PDK (volare/ciel).

Alternatives considered

PyPI wheel for the simulator (maturin). Rejected: per-backend wheels, CUDA wheel size and driver coupling, Metal macOS-arm64-only. PyPI fits the pure-Python companion, not the GPU binary. (This is the natural-sounding option; it doesn't survive contact with the GPU backends.)
crates.io publish + cargo install. Blocked by the vendored path-dependency fork, and still requires a GPU SDK + a multi-minute compile. Not "easy install."
Container image as the primary channel. Good for reproducible CUDA/Linux, but a poor fit for the macOS/Metal majority path and heavy for a quick jacquard --help. Kept as a possible additional channel (deferred), not the primary.

Consequences

One-line installs on the platforms that matter; docs-dogfood agents can brew install instead of building from source.
Release CI must build a per-target matrix; the CUDA/HIP rows are runner-gated, so the first release is Metal-only and the matrix fills in as runners land.
Three artifacts to release: two Rust bins (coordinated, one tag) and netlist-graph (independent). Coordinated Rust versioning keeps this to two cadences, not three.
New cross-repo surface to own: a homebrew-tap repo and a PyPI project + publish credentials (trusted publisher).
The stale repository = "…/ChipFlow/Jacquard" URLs (both Cargo.tomls) must be corrected to gpu-eda as part of this work.
Prebuilt binaries require a relocatable kernel. The Metal binary currently loads its .metallib from a compile-time build-tree path (env!("METALLIB_PATH")), so a shipped binary must embed or bundle the kernel first. Tracked as Phase 1a in the plan.

Walk-back options

If maintaining per-backend release binaries proves too costly, fall back to container images for CUDA/HIP and keep Releases + Homebrew for Metal only — the channels are independent, so this is a per-target retreat, not a redesign.
If the Homebrew tap is more upkeep than it's worth, drop it and keep cargo-binstall + the release tarballs; the formula is the thinnest layer to remove.

ADR 0019 — Cell-model IR: a complete per-cell-type library descriptor

Status: Proposed — all four design open questions resolved (see below); pending maintainer approval to start implementation (plan C1).

Context

Everything Jacquard needs to know about a standard cell is a property of the cell type and comes from the same Liberty library — its pin directions, its combinational logic, its sequential/classification nature, and its timing characterization:

Per-cell-type fact	What Jacquard needs	Where it lives today
L1 — pin directions	input vs output per pin	build-time baked from vendored submodules (`build.rs` → `GF180MCU_PIN_TABLE`/`SKY130_PIN_TABLE`), hand-coded for AIGPDK; partly user-suppliable via `--cell-library <v>` (ADR 0010 Tier 1)
L2 — combinational logic	the boolean function of each output, to build the AIG	read at runtime from a hardcoded `vendor/…/<cell>.functional.v` and decomposed by `pdk_decomp` (`src/aig.rs:1895`); fully hand-coded for AIGPDK
L3 — sequential & classification	is the cell a DFF/latch/clock-gate/SRAM/tie/filler, and what are its pin roles (clock, D, Q, async set/reset, enable; SRAM ports)?	hand-coded per-PDK Rust (`src/pdk.rs`, `src/gf180mcu_pdk.rs`, `src/sky130_pdk.rs`, pin-name matches in `src/aig.rs:2080-2260`) — data masquerading as code (ADR 0010's framing)
L4 — timing characterization	per-cell-type setup/hold, clock→Q, DFF/SRAM timing	parsed at runtime from a `.lib` into `liberty_parser::TimingLibrary` (`src/liberty_parser.rs`), consumed in `src/aig.rs:2793` / `src/flatten.rs` (incl. as a `liberty_fallback`); already user-suppliable via a `.lib` path

Two different "timing" artifacts — only one is per-cell-type

This is the easy thing to conflate. There are two:

Per-cell-type characterization (L4 above) — setup/hold, clock→Q, delays — a library property, one value per cell type, straight from the .lib.
Per-design annotation — the timing IR (ADR 0002), TimingArc { cell_instance: "<hierarchical instance path>" } — SDF for a specific netlist's instances, produced per-design by SDF/OpenSTA.

Only L4 is a cell-library fact. The timing IR is orthogonal to a cell library — it annotates a design, not a library — and stays its own IR. ADR 0002 drew exactly this line and predicted this ADR:

"The IR represents timing annotation data only. It is not … cell characterization. Attempts to extend it toward those adjacent formats are rejected — they become separate IRs if needed."

Cell characterization is that separate IR. It is not a sibling of the timing IR (same scope, different half); it is a different axis entirely — per-cell-type library facts vs per-instance design annotation.

Where the declarative path got to

ADR 0010 opened a declarative path for some of L1–L3 (--cell-library for L1, a .cells.toml manifest for cell kind), and ADR 0011 added a real port-mapping schema for kind = "ram". But ADR 0010 explicitly deferred the larger schema — sequential pin-roles, full L3 — "to a future ADR after real adoption data", L2 was never addressed (functional models still come from a hardcoded vendor/ path), and L4 lives in a parallel runtime .lib parser. Four facts about one cell type, from one Liberty file, arrive through four different mechanisms.

Two concrete gaps make this acute, both surfaced by issue #130:

Selection. A post-P&R netlist built against the 9-track library (gf180mcu_fd_sc_mcu9t5v0) is simulated against the 7-track functional models, because the path is hardcoded (src/aig.rs:1895). There is no way to point cosim at the library the netlist actually instantiates. (build.rs only assert_eq!s ports, not functional bodies, so "7t == 9t" is unenforced for logic.)
Libraries we cannot vendor. A proprietary/NDA foundry library can never live in jacquard/vendor/. Today L2 and L3 are only satisfiable by baking the library into the binary, so such libraries simply cannot be simulated — there is no all-runtime path.

The project already proved the shape of the fix on the timing side: the timing IR (ADR 0002) is a portable, versioned, generated, diff-able structured form produced from Liberty by a focused converter crate, with the vendored sources demoted to a generation-time input. This ADR applies the same shape to the cell library — but to all of its per-cell-type facts at once, in one descriptor, because they share one source.

Decision

Introduce a cell-model IR: a portable, versioned, generated, JSON-first structured descriptor that carries everything per-cell-type about a library — L1 directions, L2 combinational logic, L3 sequential/classification, and L4 timing characterization — in one file per library. Jacquard core consumes the cell-model IR (and the hand-override manifest from ADR 0010/0011) as its only source of cell semantics: no per-PDK Rust classifiers, no build.rs pin-table generation, no runtime functional.v parsing, and no runtime .lib parsing.

The ADR-0002 timing IR is unchanged and orthogonal — it annotates a specific design's instances; nothing in it is per-library.

The sub-decisions:

D1 — One descriptor for all per-cell-type facts; the timing IR stays orthogonal

All five facts (L1–L4 + classification) live in one cell-model IR per library, because they all come from one Liberty file and are all keyed by cell type. This folds in liberty_parser::TimingLibrary (L4): the runtime stops parsing .lib and reads pre-extracted timing from the descriptor. It does not touch the per-design timing IR (ADR 0002), which is a different axis (per-instance design annotation) and keeps its own crate/format. The only cross-reference the two need is the netlist itself — an instance knows its cell type — so there is no shared-schema join to co-design between them.

D2 — JSON-first

The cell-model IR is machine-generated, not hand-written, so its primary on-disk form is JSON — diff-able, inspectable, no flatc dependency to read. (This differs from the timing IR, which went FlatBuffers-first because timing data is per-instance and large; cell models are per-cell-type and small.) A FlatBuffers encoding is a deferred escape hatch if size or startup cost ever demands it (see D3). Schema is explicitly versioned, same evolution discipline as ADR 0002.

D3 — L2 as a pre-built AIG

Each combinational cell's logic is stored as a pre-decomposed AIG (and-inverter nodes, with input-pin → node and output-pin → node maps), not as boolean expressions or truth tables. Rationale: the runtime splices the cell's AIG directly into the design AIG with no decomposition work — minimal startup time and memory, and it removes pdk_decomp / functional-Verilog parsing from jacquard core entirely. Decomposition moves into the generator (D6). If the aggregate AIG payload grows unwieldy in JSON, encode it with FlatBuffers (the D2 escape hatch).

D4 — L3 schema: sequential pin-roles + classification

Sequential cells carry pin-role metadata — clock pin + edge, D / next-state, Q/QN, async set/reset pin + polarity, enable — plus their combinational next-state function as the same pre-built AIG (D3), consumed by the existing DriverType::DFF path (replacing the hardcoded pin-name matches in src/aig.rs:2080-2260). Cell classification (std, dff, latch, clock_gate, ram, tie_high/low, filler, endcap, tap, io_pad_*) is a declared kind. This is the sequential analogue of ADR 0011's RAM port schema, which stands as the worked precedent; RAM keeps its ADR-0011 schema.

D5 — L4 timing characterization, in the same descriptor

The descriptor carries the per-cell-type timing — setup/hold, clock→Q, DFF/SRAM timing — keyed by the same cell type as the logic; this is the data liberty_parser::TimingLibrary holds today, extracted once by the generator instead of parsed from a .lib at every run. At runtime this replaces TimingLibrary::from_file; the per-design timing IR (ADR 0002) still layers a specific design's instance arrivals on top, unchanged.

Multi-corner lives in the one descriptor, keyed by corner — the same shape the timing IR already uses (crates/timing-ir: an ordered corner-name set + a corner_index per value, with min/typ/max as the within-corner derate triple). Corner is the outer key; min/typ/max is the orthogonal inner derate — not a reinterpretation of the triple. File-per-corner is rejected because L1–L3 (directions, function, sequential roles, and especially the D3 AIG — the bulk of the descriptor) are corner-invariant: only L4's numbers vary across ss/tt/ff, so a per-corner file would duplicate the expensive AIG to vary the cheap timing. Corner-keyed L4 dedups the AIG, keeps one descriptor = one library (which D8 selection assumes), and degrades to a one-entry corner map for the common single-corner flow. A corner-overlay-file split is a deferred size escape hatch (parallel to D2), to be decided from a real descriptor size — not built now.

The simulation corner is user-selected, not read from the netlist or SDF. A --corner <name> run flag picks among the descriptor's corner set, defaulting to a descriptor-declared default_corner (typically tt). The netlist is structural and records no corner; synthesis used a setup corner to choose cells but writes it nowhere Jacquard reads. SDF carries PVT header fields (VOLTAGE/PROCESS/TEMPERATURE), but they are the wrong source on both counts: when an SDF is present its delays already encode the corner and feed the orthogonal timing IR (ADR 0002), so L4 corner selection does not apply to annotated instances; and L4 matters precisely in the no-SDF cosim path (today's liberty_fallback), where there is no SDF header to read. There are two different corners — the setup corner synthesis chose for (a flow fact, recorded nowhere) and the simulation corner the user wants to observe — and only the user knows the latter, so user-passed is the correct source, not a fallback. This also makes today's implicit single-corner choice (whatever .lib was passed) explicit. When an SDF is supplied, its PVT header may be surfaced as an informational cross-check (SDF says ss_125C, you selected tt_025C), never as the selector. This mirrors opensta-to-ir, which already treats corner identity as declared input, not auto-detected.

D6 — A separate generator (converter crate)

The IR is produced by a converter crate mirroring opensta-to-ir: Liberty → cell-model IR — emitting all of L1–L4 from a single pass over the .lib, with functional.v consulted only as a fallback for the L2 combinational cases Liberty under-specifies (see the resolved L2 open question below). It reuses, not reimplements, the existing machinery: pdk_decomp relocates into a shared library the generator links, and the Liberty group-walker (src/liberty_parser.rs, which today reads only the L4 timing groups) is extended to also read function / ff / latch / cell-class. The generator — not jacquard core — owns all Verilog/Liberty parsing and AIG decomposition; it is unit-testable in isolation, off the runtime critical path.

Cross-check when both sources are present. For a library that ships both Liberty and functional.v (the open PDKs), the generator validates the two against each other and surfaces any disagreement as a generation-time diagnostic rather than silently trusting one. Two checks:

Logic (L2/L3). Build the combinational AIG from the Liberty function and from the .v gate/UDP netlist and check the two for logical equivalence; likewise the sequential next-state/reset from Liberty ff/latch against the .v sequential UDP.
Timing (L4). For standard cells, the .v specify block carries timing arc topology but no values (zero-placeholder SDF scaffold), so the check is arc-set agreement, not value comparison: every Liberty timing() group (related_pin + timing_type) should correspond to a .v specify path or timing check and vice versa, and across a multi-corner library every corner .lib should carry the same cell/arc set (only the values differ). For macros/SRAM, whose behavioural .v often embeds real hardcoded timing (access time, setup/hold) that can diverge from the macro's Liberty, the check extends to a value comparison and flags the divergence. A missing, extra, or mis-typed arc, a cross-corner arc-set difference, or a macro value divergence — on either source — is surfaced. Liberty remains the authoritative L4 source either way.

This is the same diff discipline ADR 0001/0002 use, applied across the L2 logic sources, the L4 arc topology, and the corner set; it is the structural check build.rs's port-only assert_eq! never had — directly the silent-substitution class #130 names.

D7 — Bundled descriptors regenerated in CI from pinned vendored PDKs

The built-in PDKs — AIGPDK, SKY130, GF180MCU, and IHP SG13G2 (added through the IR; see D7a) — are bundled as generated cell-model-IR files, regenerated in CI from the pinned vendored PDKs rather than checked in as artifacts. CI runs the converter (D6) over the pinned submodules and embeds the resulting descriptors into the binary at build time; the descriptors are not committed blobs. The vendored PDKs therefore become inputs to cell-library generation, not a runtime dependency of jacquard core — core depends on neither the cell submodules nor pdk_decomp at runtime, and a released binary is self-contained.

This is the deliberate contrast with the timing-IR bindings (which are checked in): cell descriptors are larger, fully derived, and have a deterministic generator, so committing them would only add generated blobs and drift risk. Build-time generation replaces today's build.rs pin-table generation, so source builds keep the same vendored-submodule requirement they already have. Diff-ability (D2) is preserved as a capability — CI regenerates deterministically and the generator ships a descriptor diff tool (mirroring timing-ir-diff) — rather than via git history; drift and provenance are enforced by the CI regeneration + the D6 cross-check, not by a committed reference.

D7a — IHP SG13G2 as a new PDK added with zero per-PDK Rust

IHP's open SG13G2 PDK (IHP-Open-PDK) is added as a new built-in target — Jacquard has no existing per-PDK Rust for it — so it is the proof of the headline consequence that adding a PDK is no longer a Jacquard code change: it is onboarded purely by vendoring its submodule and generating a descriptor. It is also the cleanest Liberty-first validation: every SG13G2 combinational cell carries a function string and every sequential cell a complete ff group (with reset polarity), exercising the D6 Liberty-first path end to end, in contrast to GF180's functional.v-primary legacy path.

D8 — Selection by declared prefix + explicit override

Each descriptor declares the cell-name prefix(es) it covers. Jacquard auto-matches a netlist against the bundled descriptors (replacing the hardcoded prefix detection in src/pdk.rs); --cell-descriptor <foo.json> is the explicit override and the path for proprietary libraries. The ADR-0010 .cells.toml becomes the hand-authored override layer over the generated IR — same data model, different provenance.

Consequences

Library-agnostic binary. A proprietary library works with zero Jacquard changes: the user generates a descriptor on their own machine from their Liberty (which never leaves it) and points jacquard at it. The generator is the only tool that touches raw foundry files.
Large core simplification. src/gf180mcu.rs, src/gf180mcu_pdk.rs, src/sky130*.rs special-cases, the PdkVariant enum, the build.rs pin-table generators, and the hardcoded vendor paths all retire in favour of one IR consumer. Adding a PDK stops being a Jacquard PR — and IHP SG13G2 (D7a) is the worked proof: a new built-in PDK onboarded with zero per-PDK Rust, purely by vendoring + generating a descriptor.
#130 dissolves. "Which functional models?" becomes "which descriptor matches this netlist" — derived or --cell-descriptor-overridden — and the 7t/9t silent-substitution risk goes away because each library has its own generated descriptor.
Retires the runtime .lib parse and unifies "bring your own library." L4 folds in, so liberty_parser::TimingLibrary leaves the runtime for the generator. Today you already point at your own .lib for timing while the logic is baked from vendor/; now both come from one descriptor — exactly the proprietary-library goal.
New schema + converter crate to maintain, under the same scope discipline ADR 0002 demands: the cell-model IR is per-cell-type library facts only — not a netlist, not per-design annotation, not a placement/physical model. Creep is rejected.
Verification moves to a diff/round-trip discipline (ADR 0001/0002 pattern): since descriptors are CI-regenerated rather than committed (D7), the gate is that CI regeneration is deterministic plus a logic round-trip — does the IR's AIG reproduce a reference of the cell? — and the D6 cross-check across sources. Together these generalise today's build.rs port-only assert_eq! into an actual logic check, closing the gap #130 names.
Sequential fidelity is the risk surface. Mapping Liberty ff (clear / preset / clear_preset_var) onto Jacquard's DFF + async-reset model is where bugs will hide; it gets dedicated generator tests against the existing GF180/SKY130 behaviour as the oracle.

Open questions

~~L2 source of truth~~ — Resolved (D6): Liberty-first, .v as a bounded fallback. A survey of open and commercial (Liberate-generated) standard-cell libraries found the same shape in all of them: Liberty carries function for every combinational cell and ff/latch groups (clocked_on/next_state/clear/preset, with reset polarity) for every sequential cell, while the behavioural .v models sequential cells as a vendor UDP wrapped in notifier/$setuphold timing-check machinery. So Liberty is the cleaner and sufficient source for L2 function and L3 ff/latch — it states the next-state/reset logic directly, whereas the .v buries it in simulation scaffolding. The generator reads Liberty first and consults .v/UDP only for the L2 combinational cases Liberty under-specifies (cells with no function string, or UDP truth tables whose exact X-semantics you want, e.g. some muxes). This is the converter's target; the existing GF180/SKY130 functional.v path is fine as the first converter input in plan C1.

Two bounded sub-points:
- L4 timing is Liberty-exclusive for standard cells. A stdcell .v specify block carries timing arc topology (which arcs/checks exist) but zero/placeholder values — it is an SDF back-annotation scaffold, redundant with the .lib's own arc set. So the cross-check on stdcell timing is arc-set agreement, not value comparison (see D6). Exception — macros/SRAM. A memory/macro behavioural .v often embeds real, hardcoded timing (access time, setup/hold) that can diverge from the macro's Liberty — here the two sources disagree on values, not just topology. (The GF180 SRAM model is a suspected case, not yet diffed; SRAM timing is already a special path in Jacquard, with a hardcoded conservative read-delay fallback in src/liberty_parser.rs.) The generator must surface that disagreement (D6); L4 still comes from Liberty as the single authoritative source, but the divergence is a real correctness signal worth flagging rather than hiding.
- Violation-response is out of scope. The .v's notifier→UDP-x corruption-on-violation is a simulator policy, not a per-cell-type library fact: the constraint values it needs are already L4 from Liberty, and the response is owned by Jacquard's own timing-violation subsystem (docs/timing-violations.md: report-based, X only under --xprop). It is not extracted into the IR.
~~L4 multi-corner shape~~ — Resolved (D5): all corners in the one descriptor, L4 keyed by corner (mirroring the timing IR's corner-set + corner_index; min/typ/max stays the orthogonal within-corner derate). Simulation corner is user-selected via --corner (default default_corner), not inferred from the netlist or SDF — the setup corner is recorded nowhere Jacquard reads, and SDF delays already encode the corner via the orthogonal timing IR. Corner-overlay-file split deferred as a size escape hatch.
~~Bundled-descriptor provenance~~ — Resolved (D7): regenerated in CI from pinned vendored PDKs and embedded at build time, not checked in as artifacts. Build-time generation replaces the build.rs pin-table step (source builds keep the existing submodule requirement); diff-ability and drift control move to deterministic CI regeneration + the generator's descriptor diff tool + the D6 cross-check.
~~Migration shape~~ — Resolved: per-PDK cutover. The IR consumer runs alongside the per-PDK Rust and each PDK is migrated independently, keeping the suite green throughout, rather than a single switch. IHP SG13G2 (D7a) is added in the same per-PDK fashion — but greenfield (no existing Rust to retire), so it doubles as the zero-per-PDK-Rust proof.

Relationship to other ADRs

Orthogonal to ADR 0002 — the timing IR is per-design instance annotation; the cell-model IR is per-cell-type library facts (incl. L4 timing characterization). Different axes, not siblings — but the cell-model IR realises 0002's predicted "separate IR for cell characterization", and reuses its versioning + diff discipline.
Extends ADR 0010 — supplies the "future ADR" 0010 deferred for the heavy (L2/L3) schema, folds in the L4 .lib characterization, and recasts the .cells.toml path as a hand-override over generated IR.
Builds on ADR 0011 — RAM keeps its port schema; it is the worked precedent for the D4 sequential schema.
Feeds ADR 0014 — the cell-model IR's L2 payload is AIG (D3), spliced into the design AIG at load.

Staged delivery: docs/plans/cell-model-ir.md. Tracks #130 and #67.

ADR 0020 — Python engine as a bundled binary wheel

Status: Draft — decision deferred (2026-07-01). Not ratified; not scheduled.

Direction note (2026-07-01). This ADR drafts a subprocess-bundled binary wheel (embed the CLI, keep PR #53's subprocess API). On review, a native PyO3 binding is the preferred long-term direction — an in-process engine is a materially better "first-class" Python surface than shelling out to a bundled binary, and if we're going to invest in per-platform wheels (cibuildwheel + dylib repair) either way, doing it once for a PyO3 extension avoids building the subprocess-wheel machinery only to replace it. We are not doing this now; the decision (subprocess-wheel-first vs. straight to PyO3) is deferred and tracked in #161. The subprocess-wheel design below is preserved as the worked alternative and as the analysis of the shared hard part (per-platform wheels + vendoring the libc++/libomp runtime tail), which a PyO3 binding faces too.

Relates to: ADR 0018 (distribution model — this extends it with a Python channel), PR #53 (the subprocess API this would adopt), docs/plans/python-engine-binary-wheel.md (implementation phases + tracking).

Context

Jacquard ships as a Rust CLI (jacquard sim / cosim / map). Batch automation, regression sweeps, and result analysis are naturally Python work, and PR #53 drafted a Python API for exactly that: JacquardConfig, sim() / map(), SimResult / DesignStats, and a run_regression() harness.

PR #53 as drafted is a pure-Python subprocess wrapper: a hatchling noarch wheel with dependencies = [] that locates a separately installed jacquard binary (JACQUARD_BIN env → PATH → shutil.which) and shells out via subprocess.run. That means a user must install two things from two channels — the binary (brew / cargo binstall, per ADR 0018) and the Python package — and keep their versions in step by hand.

We want a first-class Python engine: pip install jacquard yields a working simulator with no separate binary step. That is a distribution decision PR #53 does not make, and it is the subject of this ADR.

Two forces shape it:

The binary is self-contained except for a runtime-library tail. The sim Metal kernel is embedded via include_bytes! (ADR 0018 / v0.2.1), so the relocated binary needs no sidecar .metallib. But it links Homebrew LLVM's libc++ and libomp (via mt-kahypar/OpenMP), so a bare binary fails to launch without LLVM present — the same defect that broke the v0.2.1 install channels and is worked around today by the Homebrew formula's depends_on "llvm". A wheel has no depends_on; those dylibs must travel inside the wheel.
The GPU backend is a per-platform build. metal (macOS/arm64) is the shipping backend; cuda/hip are Linux and heavy. A universal wheel is not possible; wheels are per-platform, and the backend matrix must be phased.

Decision

Ship the Python engine as a binary wheel that embeds the compiled jacquard binary and its runtime libraries, built with cibuildwheel, published to PyPI. Keep PR #53's subprocess API and config/result model unchanged — the wheel changes packaging, not the Python surface.

Concretely:

Adopt PR #53's API (config, runner, result, regression, errors) as the package's Python layer, moved into the uv workspace (python/jacquard/, a 5th workspace member alongside netlist_graph, chipflow_harness, mcu_soc, root).
Embed the binary. The wheel carries the release binary at jacquard/_bin/jacquard (per-platform). runner.find_jacquard_binary() grows a step 0: prefer the packaged _bin/jacquard, then fall back to its existing env → PATH → which chain (so a dev pointing JACQUARD_BIN at a local build still wins, and a source checkout with no embedded binary still works).
Vendor the runtime tail into the wheel. cibuildwheel's repair step (delocate on macOS, auditwheel on Linux) copies libc++/libomp (and any other non-system dylib the binary links) into the wheel and rewrites the install names, so pip install needs no Homebrew LLVM. This is the crux and the main risk; the plan spikes it first.
Phase the platform matrix (docs/plans/...):
- P1 — macOS/arm64 + Metal. The shipping backend; validates the whole embed-plus-delocate story on the platform we already release for.
- P2 — Linux/x86_64 CPU fallback. The cosim CPU backend (ADR 0017) with no GPU runtime, so pip install works in plain CI containers.
- P3 — CUDA / HIP. Gated on ADR 0018 Phase 4 (prebuilt GPU binaries) and on the wheel-size / CUDA-runtime questions below; may ship as separate extras (jacquard[cuda]) or a manylinux variant rather than the default wheel.
Publish via the established OIDC path. Mirror publish-netlist-graph.yml (trusted publishing, no stored token; TestPyPI dry-run on workflow_dispatch, real PyPI on tag). cibuildwheel replaces the single uv build with a per-platform build matrix.
Version with the binary, not independently. The wheel embeds a specific jacquard build, so its version tracks the binary release (extend scripts/bump_version.py), unlike netlist-graph which versions independently. A wheel's embedded binary and its Python API are one artifact.

Why subprocess, not native (PyO3) bindings

A native PyO3/maturin binding would give in-process state access (drive the sim, peek signals without a subprocess) — a deeper "engine." We defer it:

It multiplies the build matrix by the GPU-feature axis inside the extension module, where the runtime-library and feature-gating problems are far harder than embedding an already-built binary.
PR #53's subprocess API is backend-agnostic: it works for metal/cuda/hip unchanged, because the binary owns the backend. A binding does not.
The subprocess wheel delivers the user-visible win (pip install → working simulator) now; PyO3 is a strictly larger follow-on that a later ADR can take up if in-process access becomes a real requirement.

Consequences

pip install jacquard is self-contained on supported platforms — the headline win. The two-channel version-skew problem goes away.
cibuildwheel + delocate/auditwheel become release-critical infrastructure. A new failure surface (dylib repair) that the plan de-risks with a P1 spike and a post-build "install into a clean venv and run sim" smoke gate, mirroring the existing relocated-tarball user-acceptance gate.
Wheels are large (embedded binary + vendored dylibs; CUDA especially). Acceptable for Metal/CPU; a gating question for P3.
The PyPI name jacquard must be secured (see open questions). Until then, TestPyPI validates the pipeline.
netlist-graph stays a separate, independently-versioned noarch package. This ADR is only about the simulator engine wheel.

Alternatives considered

Ship PR #53 as-is (pure-Python noarch + separate binary). Simplest, no cibuildwheel, and a fine interim — but it is not the first-class, single-command install the goal calls for; it defers the real distribution decision rather than making it. Rejected as the end state; effectively subsumed as pre-P1 (the API lands first, the embed follows).
Native PyO3/maturin bindings. Highest ceiling, deferred — see above.
Document-and-require the LLVM runtime (as the raw tarball does) instead of vendoring dylibs. Pushes the v0.2.1-class failure onto every pip user; rejected — self-containment is the entire point.

Open questions (tracked in the plan)

PyPI name. Is jacquard available/securable? If not, jacquard-eda or gpu-eda-jacquard, aliasing the import name. Blocks nothing until publish.
CUDA/HIP wheel viability. Size and CUDA-runtime bundling may make P3 an extras/manylinux special case rather than a default wheel — decide in P3 with data, and log/document any platform the wheel silently omits.
delocate coverage of libomp. OpenMP runtime vendoring has known sharp edges (duplicate-runtime aborts if a user's other packages also load libomp); the P1 spike must confirm a clean load in a mixed environment.

ADR 0021 — Behavioral RTL support via an embedded synthesis front-end

Status: Proposed

Revised 2026-07-03 (pre-ratification, still Proposed): the entry point moved from a standalone jacquard build command to folding synthesis into sim/cosim — RTL is simulated with one command, no separate build step. The embedded-Yosys/wasmtime engine (Decision §2) is unchanged; only the surface changed. The original build-command shape is preserved in "Alternatives considered" for the audit trail. Rationale below.

Relates to: ADR 0014 (AIG / emulator model — why synthesis is a front-end at all), the Python-engine work (#161, ADR 0020 pending) that hosts this, ADR 0018 (distribution), docs/synthesis-flow.md (the manual flow this wraps), #162 (implementation tracking).

Context

Jacquard is described as a "GPU-accelerated RTL simulator", and behavioral RTL is the intended design input. But it is an emulator (GEM = GPU-Emulator-inspired; ADR 0014): it maps a synthesized and-inverter graph onto a virtual manycore, exactly as an FPGA-based emulator runs a synthesized bitstream, not behavioral source. So the input to jacquard sim / cosim is a gate-level netlist — structural Verilog mapped to aigpdk / SKY130 / GF180MCU cells — and the parser (sverilogparse) is structural-only.

Behavioral RTL reaches that point through synthesis, which today is a manual, external step: the user runs Yosys (memory_libmap → aigpdk.lib logic synthesis) or a commercial tool per docs/synthesis-flow.md. This is a genuine capability — RTL designs run fine — but it is a UX cliff:

Newcomers can't tell what the tool accepts. (Repeated external feedback: "it wasn't clear what your netlist input language support was.")
"Bring RTL, then run this Yosys script, then point sim at the output" is a multi-tool ceremony before the first waveform.

Two forces shape the fix:

Synthesis quality drives Jacquard's speed. synthesis-flow.md is explicit: the AIG the GPU emulates is only as good as the mapping, and a commercial synthesizer (DC) yields better QoR than Yosys. So synthesis is a real quality knob, not pure friction — we must not hide it in a way that silently caps performance.
A synthesizer is embeddable now, cheaply. YoWASP ships Yosys as WebAssembly wheels (yowasp-yosys) — no system Yosys, no C++ vendoring, cross-platform. It is already in this workspace's uv.lock (transitively via amaranth[builtin-yosys] in the mcu_soc design).

Decision

Add an embedded synthesis front-end so behavioral RTL is a first-class input, while keeping the emulator's synthesized-netlist core unchanged:

sim / cosim accept behavioral RTL directly — synthesis is a transparent, cached pre-processor inside those commands, not a separate user step. There is no jacquard build command. On the input path, the command classifies what it was handed and dispatches three ways:

Input	Structural parse (`sverilogparse`)	Cell family	Action
Gate-level, built-in PDK	parses	matches an `is_*_stdcell` family	simulate directly — the embedded descriptor supplies logic + timing (corner via `--corner`, ADR 0019)
Gate-level, unknown PDK	parses	matches nothing	error, actionable: "gate-level netlist with unrecognized cells — pass `--cell-descriptor <path>`" (ADR 0019 D8). Not synthesized — it is already a netlist.
Behavioral RTL	fails (`always`/`if`/`case`/operators are not structural)	n/a	synthesize → aigpdk → simulate, via the embedded Yosys (Decision §2), caching the result

The command prints what it decided (e.g. design.v: behavioral RTL → synthesized [YoWASP Yosys, functional QoR] → <cache>), so synthesis is never silent (honouring force #1 below). --rtl / --netlist override the auto-detection; a syntax error in a netlist that falls through to the synth path surfaces both the structural and the Yosys diagnostics. The gate-level artefact is dumpable for inspection/fixtures via --emit-synth <path>. RTL in, waveform out, one command — driving the existing docs/synthesis-flow.md scripts (memlib_yosys.txt → aigpdk.lib).

Yosys as WebAssembly, executed in-process from Rust — not a vendored native build, and not via the Python yowasp-yosys wrapper. YoWASP ships Yosys as a single self-contained yosys.wasm (abc is compiled in-tree into that module and called in-process — WASI has no exec), and the yowasp-runtime that runs it is a thin harness over wasmtime, which has a first-class Rust crate. So the synthesis engine is a Rust component (src/synth.rs) embedded in the jacquard binary that embeds wasmtime, loads the (bundled or fetch-on-first-use) yosys.wasm, preopens the design + aigpdk library files + a temp dir under WASI, and runs the existing synthesis script — caching the compiled module and the synthesized netlist to disk (by content hash) exactly as yowasp-runtime does. It is invoked transparently by sim/cosim on the behavioral-input path (Decision §1), behind the opt-in synth feature (wasmtime+cranelift are heavy to compile). No Python interpreter and no external toolchain: jacquard sim design.v … is RTL-to-waves from the single jacquard binary. This decouples the on-ramp from the Python-engine packaging decision (ADR 0020 / #161) — it needs neither PyO3 nor a subprocess-bundled wheel.
Two synthesis tracks, kept explicit (honouring force #1):
- On-ramp: YoWASP Yosys — easy, functional; the default when sim/cosim are handed behavioral RTL.
- Performance: bring-your-own DC (or native Yosys) → gatelevel.gv → jacquard sim directly (the "gate-level, built-in PDK" row above). Documented as the path to peak GPU speed.
The emulator core does not change. Synthesis is a transparent pre-processor inside sim/cosim that produces the same structural netlist users synthesize by hand today; the AIG/boomerang pipeline (ADR 0014/0015) and the structural sverilogparse input are untouched — the behavioral path simply feeds them a just-synthesized netlist instead of a hand-written one. Behavioral elaboration stays Yosys's job — we do not reimplement an RTL front-end.

Implementation phasing lives in #162.

Phase 2 — RTL-source provenance (roadmap, not this decision's gate)

Landed (2026-07). RTL in → source-annotated waves out. The on-ramp maps via abc_new (abc9/aiger2 XAIGER path) so Yosys (* src *) origins survive std-cell mapping, and the WS-B ingestion (sverilogparse → netlistdb.cell_src → AIG::aigpin_src_locations) surfaces them in xsources + --trace-signals. Two facts below were corrected in flight: (1) the fork wasm builds via the develop-0.64 Makefile recipe (wasi-sdk 27 + yosys-slang whole-archive), NOT CMake/wasi-sdk-33 — the patched yosys fork is Makefile-only and CMake dropped read_slang. (2) The in-process &origins risk flagged below is resolved — origins survive the in-process WASI abc_new round-trip (100% \src on comb + sequential). The provenance wasm is built + released by the gpu-eda/yowasp-yosys fork's own CI; jacquard fetches the pinned release (A2). aigpdk-specific mapping needs read_liberty -lib + hierarchy -purge_lib before abc_new and a single mapping pass. Details: docs/plans/rtl-source-provenance.md.

A patched toolchain carrying the origin-shell \src pass-through — berkeley-abc #487 (vOrigins/&origins) + robtaylor/yosys@src-retention-y-ext — keeps RTL source locations alive through std-cell mapping. Jacquard could then thread \src through netlistdb/AIG so GPU-sim results speak RTL — --trace-signals, timing violations, and X-debugging reporting source lines instead of flattened gate names.

The tractable route is building our own yosys.wasm from source rather than depending on the stock upstream wheel. YoWASP's build (build.sh: CMake + wasi-sdk 33 via wasi-sdk-p1.cmake, zlib/libffi/readline/editline/tcl disabled; recon'd 2026-07-04 — supersedes an earlier "wasi-sdk 27 / Makefile CONFIG=wasi" description) compiles abc in-tree from yosys's own bundled abc/ submodule — verified against the shipped yosys.wasm, which embeds the abc engine (the build path …/yosys-src/abc/src/base/abci/… and the full set of &-prefixed GIA commands are present in the module). So a provenance build is mechanical: fork YoWASP/yosys (now on Codeberg; the old GitHub repo was archived read-only 2026-03-11), repoint yosys-src → robtaylor/yosys@src-retention-y-ext and yosys's bundled abc submodule → robtaylor/abc@origin-tracking-clean (#487), and rerun build.sh. Provenance then ships self-contained in one .wasm — not blocked on upstream YoWASP adopting the patches, only on the patches themselves (abc#487 in review).

Key open risk to spike before committing to WASM provenance: origin-shell validated \src retention through external abc (ABCEXTERNAL, temp .aig round-trips). The WASM build calls abc in-process, a different integration surface — the &origins data must survive the in-process call path, not a temp-file hand-off. This is unproven and is the first Phase-2 task. A large build effort overall; tracked as Phase 2 in #162, out of scope for ratifying the on-ramp.

Consequences

RTL becomes a first-class, single-command input — the onboarding cliff is removed, and "what does it accept?" has a clean answer: your RTL (or a pre-synthesized netlist).
The accepted-RTL surface is defined and documented, not implicit. Because synthesis is delegated to Yosys, the accepted behavioral subset is the embedded YoWASP Yosys frontend — and that frontend includes yosys-slang (read_slang), a near-complete SystemVerilog-2017 elaborator, verified present in the pinned yowasp-yosys 0.64.0.0.post1131 wasm (read_slang + 495 slang symbols). So SystemVerilog language coverage is strong (packages, interfaces, advanced generate, structs/enums, most of SV), not the narrow built-in-read_verilog subset. The remaining bound is concurrent-SVA synthesis — turning SVA into synthesizable checkers is a separate Yosys formal capability, still partial (#106/#107), independent of slang's parsing. On top of the frontend sit the project's techmaps (assertions → GEM_ASSERT, $display → GEM_DISPLAY, memories → RAMGEM), minus dropped testbench-only constructs. This gets a dedicated docs/accepted-rtl.md, whose authoritative form is an empirical coverage table driven through sv-tests (follow-up) rather than a hand-claimed feature list — hand-claims about a delegated frontend would violate the "verify, don't assert" bar.
A QoR ceiling on the easy path. YoWASP Yosys is functional-grade; peak GPU performance still wants DC. The auto-synth path prints its QoR tier and the docs must state this, so the on-ramp isn't mistaken for the performance path.
A Rust wasmtime dependency + a yosys.wasm asset — no new Python dependency. The synth feature is opt-in at compile time (heavy cranelift build), but because behavioral input is now a sim/cosim capability rather than a separate command, released binaries must be built --features synth — otherwise sim design.v on RTL fails with a build-without-synth message. A pre-synthesized netlist still simulates with the feature off. The ~39 MB wasm is either bundled into the binary or fetched to a cache on first use (open sub-decision, #162). Resolved (2026-07, Phase 2): fetch-on-first-use. locate_yosys_wasm (src/synth.rs) resolves --yosys-wasm → JACQUARD_YOSYS_WASM → a pinned provenance wheel downloaded (sha256-verified, cached) from a gpu-eda/yowasp-yosys release. Provenance forces this: the abc_new origins flow needs the fork wasm, so a bundled/stock wasm won't do (see Phase 2 below).
Decouples the on-ramp from ADR 0020 / #161: because synthesis runs Yosys from Rust via wasmtime, it needs neither the PyO3 binding nor a subprocess-bundled Python wheel. The Python-engine packaging call can be made independently; the RTL on-ramp no longer waits on it.
Timing stays descriptor-aligned, not .lib-coupled. The on-ramp does not reintroduce a --liberty runtime path: for built-in PDKs, logic and timing come from the embedded cell-model descriptor (ADR 0019 D5), corner-selected via --corner. --liberty / --cell-descriptor remain for user/custom PDKs.
Provenance (Phase 2) is a large, dependency-gated follow-on, not promised by this ADR.

Alternatives considered

Keep synthesis external (status quo), document better. The honest interim and still the performance path — but rejected as the end state: it leaves the onboarding cliff that prompted this.
Vendor/embed native Yosys (C++). Heavy build + distribution burden across macOS/Linux/arch. YoWASP's WASM wheels are the lightweight embed that makes this decision cheap.
Python-hosted front-end via the yowasp-yosys wrapper (the original shape of this ADR). Rejected once inspection showed the wrapper is a ~1 KB shim and yowasp-runtime a ~100-line harness over wasmtime: hosting build in Python would force the user to have Python + pip + the wheel present (reintroducing an install dependency) and couple the on-ramp to the unresolved #161 packaging decision — all to avoid porting a small WASI harness the Rust wasmtime crate supports directly. scripts/local_synth.py (which drives yowasp_yosys.run_yosys for the sky130 flow) remains a useful reference for the synthesis-script content, even though the runtime is now Rust.
A Nix environment instead of (or alongside) YoWASP. origin-shell is itself a Nix flake pinning patched yosys + abc + librelane. A Nix devshell is a viable alternative back-end for jacquard build: native binaries (better QoR and wall-clock than WASM), fully reproducible, and it carries the same patches with far less build engineering than a bespoke WASM toolchain — at the cost of requiring Nix on the user's machine. The two are complementary — YoWASP: zero-install, cross-platform, pip-native, with a WASM speed/QoR ceiling; Nix: native performance + reproducibility for users already in a Nix flow, and the natural vehicle for the Phase-2 patched (\src) toolchain before/without a WASM build. jacquard build should abstract the synthesis back-end so it can dispatch to whichever (YoWASP wheel, Nix devshell, or a plain system Yosys/DC) is present.
A standalone jacquard build <design.v> command producing gatelevel.gv as an explicit user step (the original shape of this ADR; implemented on the first cut of PR #167). Revised away before ratification: it reinstated the very multi-command ceremony ("build, then point sim at the output") the ADR set out to remove, and the artefact boundary it justified turned out unneeded — the committed *_synth.gv test fixtures are each written once by the test that introduces them and never regenerated, so no standing "rebuild the fixtures" workflow depends on a build command; --emit-synth covers one-off fixture authoring. Folding synthesis into sim/cosim (Decision §1) keeps the one-command promise while the same src/synth.rs engine still backs it.
Elaborate behavioral RTL directly inside Jacquard (no synthesis). Rejected: it contradicts the emulator model (ADR 0014) and would reimplement a Verilog front-end Yosys already provides — enormous scope for no architectural gain.

ADR 0022 — Transaction-based external stimulus (SCE-MI-style pipes)

Status: Proposed

Relates to: ADR 0013 (peripheral models, GPU→CPU ring buffers), ADR 0017 (batch dispatch), ADR 0021 (RTL on-ramp — what makes a synthesizable transactor possible), and Testbench Interop.

Context

"Can I point my UVM / cocotb testbench at Jacquard?" is the most common question we get, and interop.md currently answers it as an open one. It isn't. The batching constant decides it.

The number

src/sim/cosim/mod.rs:

#![allow(unused)]
fn main() {
/// Batch size for backend dispatch: number of consecutive scheduler edges run
/// in one `run_edges` call (no per-tick CPU interaction within a batch).
const BATCH_SIZE: usize = 1024;
}

No per-tick CPU interaction within a batch. That clause is where cosim's throughput comes from: 1024 scheduler edges execute with the CPU absent. Device timestamps put the run firmly GPU-compute-bound (the simulate kernel is ~72% of per-edge GPU time; see Cosim Perf Report), which is only true because nothing crosses the boundary in between.

A UVM driver is per-cycle by construction: it reaches through a virtual interface and wiggles pins on every clock. Driving Jacquard that way forces BATCH_SIZE to 1 and a CPU round-trip per edge, which deletes the reason to use a GPU. So this was never an effort question. Any external stimulus that needs to observe or drive every cycle is incompatible with the engine, and no amount of implementation makes it otherwise.

Why not run the testbench on the GPU

Running UVM inside Jacquard is not a long tail of missing features, it is a category error:

No runtime to allocate into. The AIG is elaborated once into a static instruction script (FlattenedScriptV1). UVM needs a heap, new(), inheritance, virtual dispatch, and a factory doing runtime type overrides.
No scheduler. fork/join, mailboxes, events and UVM's phasing all sit on SystemVerilog's stratified event queue. Jacquard deliberately has no event queue — discarding it, and evaluating whole logic cones per edge, is the speed thesis.
No solver. randomize() is a constraint solve per transaction, a search, not circuit evaluation.
No strings or associative arrays for uvm_config_db and reporting.
Only immediate assertions. $check lowers to GEM_ASSERT cells today; temporal SVA (|->, ##[1:$]) needs automata with runtime state.

Building all that is writing a SystemVerilog simulator. And the payoff would be negative: a testbench is sequential and dynamic, so it would run at CPU speed while dragging the GPU into a per-edge handshake. A slow simulator inside a fast one.

The industry settled this

Emulators hit this wall decades ago and standardised the answer: SCE-MI (Accellera, currently v2.4, Nov 2016). Split the testbench in two:

an untimed side, on the host, holding sequences, randomisation, scoreboard and coverage;
a timed side, in the emulator, holding a transactor (BFM) that turns messages into pin activity;
pipes between them: buffered, flow-controlled, message-oriented channels with a configurable depth.

The property that matters for us: a transactor may run ahead of the host, up to the pipe's buffering. That is the same statement as "no per-tick CPU interaction within a batch", arrived at independently by people with the same constraint. Jacquard being emulator-inspired (GEM) is not a coincidence here.

No open-source testbench will drop into this

Worth stating plainly, because it sets expectations: no open-source UVM testbench uses SCE-MI pipes. That isn't an oversight to be fixed, it's structural — SCE-MI exists to talk to Palladium, Veloce and ZeBu, and open-source UVM targets simulators, which have no boundary to amortise. The public material is DVCon papers and the spec, not repositories.

The consequence for us: there is no ready-made UVM testbench to point at Jacquard, and adopting this ADR will not produce one. The transactor is ours to write, per protocol, and the untimed side has to be adapted to speak transactions. Anyone hoping to lift an existing testbench off GitHub should read that as a no.

The open-source world solved it anyway, under another name

FireSim / Golden Gate (MIDAS II) is the same architecture with different vocabulary, and unlike SCE-MI it is readable code rather than a PDF:

SCE-MI	FireSim	here
transactor / BFM	bridge (target-side RTL + host-side software)	synthesizable BFM in the AIG
pipe	token channel	the CPU↔GPU rings
untimed / timed split	host-decoupling	CPU model / GPU batch

Its central idea is worth more than the vocabulary: Golden Gate decomposes the target into a dataflow graph of latency-insensitive models, so the target's notion of time is decoupled from the host's and either side may run ahead or stall without changing the result. That is the general form of what BATCH_SIZE does for us by hand, and it is the reason FireSim can be both FPGA-accelerated and deterministic.

We cannot lift the implementation — FireSim is Chisel/FIRRTL onto FPGAs, we are Verilog/AIG onto GPUs — but it is the closest prior art with source, and the place to look before inventing channel semantics.

Two things recently made this reachable

ADR 0021 shipped (v0.3.0). A transactor is synthesizable RTL, so it now compiles into the same AIG as the DUT through the on-ramp. Before that, every protocol meant a hand-written kernel.
Half the channel exists. ADR 0013 already defines GPU→CPU ring buffers, carrying UART bytes and bus traces out of a batch.

Decision

Adopt the SCE-MI split and pipe semantics as the model for external stimulus. Do not implement the SCE-MI API.

Untimed side (host). Sequences, randomisation, checking. UVM in a real simulator, or cocotb, or plain Rust — the model doesn't care.
Timed side (GPU). A transactor written as synthesizable SystemVerilog, compiled via the on-ramp into the AIG beside the DUT, so it lives inside the batch and wiggles pins at GPU speed.
Between them: pipes. Buffered, flow-controlled, message-oriented, with a depth. Semantics only — the C API, the SCE-MI 1 macro-based message ports, and DMI are all out of scope.

What this requires, concretely:

Direction	Status
GPU→CPU (responses, monitors)	exists — the ADR 0013 ring buffers
CPU→GPU (transactions in)	missing — the one new mechanism

The CPU→GPU pipe must be pre-loaded before dispatch, not streamed. ulib's H2D is synchronous, so uploading per edge reintroduces exactly the stall the batch exists to avoid. Load a batch's worth of messages into device memory, and let a GPU-side feeder pop them as the transactor asserts ready.

The sizing rule

Pipe depth buys batch depth. A transfer of ~4–40 cycles means a 1024-edge batch consumes roughly 25–250 transactions, so the pipe must hold at least that or the batch stalls on an empty queue and we are back to per-edge. This is the design's one hard number, and the metric to judge it by is transactions retired per batch, not features supported.

Consequences

Existing UVM drivers do not port. Their timed half is rewritten as a BFM. That is the standard SCE-MI/TBX workflow rather than a Jacquard tax, but it is real work and should be stated plainly to anyone asking.
Suits protocol transactors, not big memories. A BFM has little state. A 16 MB flash backing store does not want to be an AIG memory, so bespoke kernels keep earning their place there.
Gives ADR 0013's target architecture a second answer. Rather than generalising the bespoke-kernel pattern to every peripheral, some peripherals could simply be RTL.
UVM is not free at the end of this. A host adapter (DPI or a socket) is still needed to put a real SystemVerilog simulator on the untimed side.
The failure mode is a starved pipe. If the host cannot keep it full, the batch stalls and the GPU idles — the same cliff, reached from the other side.

Alternatives considered

Per-cycle bridge (naive cocotb/DPI). Rejected: BATCH_SIZE → 1. This is what interop.md means by "a naive bridge would marshal Python ↔ GPU every cycle"; the number above is the reason.
UVM runtime on the GPU. Rejected, see Context — an SV simulator, and slower than the thing it replaces.
Record-and-replay only. Works today and stays the recommended interim path, but it is not reactive: the stimulus cannot depend on the design.
Vendoring the SCE-MI spec into docs/. Rejected: it is Accellera's copyrighted work, a checked-in copy has no update path, and we need the split and the pipes, not the DPI binding details. Cite it, don't copy it.

References

SCE-MI (Standard Co-Emulation Modeling Interface), Accellera — standard downloads. v2.4 (Nov 2016) is current at the time of writing.
FireSim / Golden Gate (MIDAS II), UC Berkeley — the same split in open source, with code: target-to-host bridges (transactors), target abstraction & host decoupling (latency-insensitive models and token channels). Read before designing the channel semantics.
Testbench Interop — what Jacquard drives today, and the record-and-replay fallback.
Cosim Perf Report — where the per-edge time goes, and how the GPU-bound claim was measured.

Implementation Plans

Phased implementation plans with entry and exit criteria. Plans live here when the work spans multiple commits and needs an explicit scheduling artefact; once shipped, the plan is kept as a historical record (Status flipped to Implemented) rather than deleted, so the phasing is recoverable later.

For short-lived working memory between sessions, see ../handoff-discipline.md — that lives in docs/handoffs/ and is deliberately kept separate from the persistent plans here.

Status legend

Active — currently being worked on or scheduled.
Implemented — shipped; kept as historical record.
Deferred — captured for future work; not currently scheduled.
Exploratory — architectural thinking captured ahead of demand.

Index

Plan	Status
Post-Phase-0 Roadmap	Active — scheduling doc for ADRs 0007 and 0008
GF180MCU PDK enablement	Mostly implemented — Phases 0–6 shipped; Phase 7 deferred
Phase 0: Timing IR and OpenSTA oracle	Implemented — historical record
WS2: `opensta-to-ir`	Implemented — historical record
WS3: delete SDF parser + interim runtime hook	Implemented — historical record (see ADR 0006 Amendment)
WS3 follow-up: re-add cosim `--sdf` via `opensta-to-ir`	Deferred
Multi-clock and stimulus architecture	Exploratory — demand-driven
Cosim backend portability	Active — design captured (#105); see ADR 0017 amendment
Cell-model IR	Proposed — realises ADR 0019 (#130, #67)
RTL on-ramp folded into `sim`/`cosim`	Active — reworks #167, realises ADR 0021 (#162)
RTL-source provenance	Active — design captured; ADR 0021 Phase 2 (#162), gated on A0 (build forked wasm, check `\src` survives)

Reading order for new contributors

If you want to understand how the timing stack got to where it is:

phase-0-ir-and-oracle.md — the umbrella plan, with the five work streams (WS1–WS5).
ws2-opensta-to-ir.md and ws3-delete-sdf-parser.md — the per-work-stream detail for the IR producer and the SDF parser removal.
post-phase-0-roadmap.md — what comes next, sequenced against the ADRs.

Adding a new plan

Filename: short kebab-case (<topic>.md or <ws-or-phase>-<topic>.md).
Start with # Plan — <title> and a **Status:** line.
Where the plan executes a specific ADR or work stream, name them in a **Predecessors:** / **ADRs:** block near the top so the dependency graph is explicit.
Add the row to the table above. When the plan ships, change the status in the file and here in the same commit.

Roadmap — Post-Phase-0 work scheduling

Status: Active. Phase 1 (structured timing output, ADR 0008 — Accepted 2026-06-25) and release hardening (WS-RH.1) shipped; Pillar B Stages 1+2 (ADR 0007) landed early. Phase 2 (Pillar C Tier 1 — per-receiver wire delay) remains gated on ADR 0007 acceptance, which is still Proposed as of 2026-06-26.

This document orders the work captured in those two ADRs alongside the in-flight tail of Phase 0. It is a scheduling doc, not a design doc — design lives in the ADRs and in docs/timing-model-extensions.md / docs/why-jacquard.md.

Where things stand (2026-05-02)

Phase 0 (phase-0-ir-and-oracle.md): WS1–WS5 + WS2.2 + WS2.4 all landed. WS2.4 multi-corner shipped 2026-05-02 across four commits (5822343 consumer, 530bb36 builder, 59fde04 producer, plus the integration test). Open items: sky130-based corpus entries (gated on a CI sky130-Liberty install strategy) and peripheral wiring for I²C/SPI when a fuller mcu_soc fixture lands.
OpenTimer spike (spikes/opentimer-sky130.md): resolved 2026-05-01 — Superseded. Q1 (Liberty parse) passed cleanly on SKY130; Q2 (arrival computation) failed on the canonical OpenSTA-bundled GCD example after eight input-pipeline workarounds (bus ports, OpenROAD-emitted SPEF, modern TCL, tap cells). Per the spike's decision matrix, ADR 0003 is now Superseded (commit d002bde). OpenSTA out-of-process is committed as Jacquard's sole STA path — opensta-to-ir is the canonical preprocessor; no in-process reference STA is planned. A future ADR may revisit libreda-sta or an in-house walker if an in-process reference is wanted later, but not on this roadmap.
Pillar B Stages 1+2 (per adr/0007): landed. ClockArrival IR table + opensta-to-ir Tcl emission in commit c403cc8; DFFConstraint.clock_arrival_ps + skew-aware fold-in in build_timing_constraint_buffer in 6767c3e. Closed Pillar B's main accuracy lever ahead of this roadmap's original Phase 2 schedule.
ADR 0006 amended 2026-05-02: subprocess invocation of user-installed OpenSTA from the shipped runtime is now permitted (no linking, no bundling). Phase 3 (native Rust SDF→IR) is no longer release-gating — see § Phase 3 below. New release-hardening workstream WS-RH.1 (OpenSTA detection + version check) is required before first release; see § Release hardening.
ADRs 0007 / 0008: ADR 0008 accepted 2026-05-02; ADR 0007 still pending review.

Phase boundaries

The phase numbering established by Phase 0 and ADR 0006 continues:

Phase	Topic	Trigger
0	Timing IR + OpenSTA preprocessor	In flight, near close
1	Structured timing output (ADR 0008 required items) + Phase 0 carryover	ADR 0008 accepted ✓
2	Timing model fidelity Pillar C Tier 1 + Pillar B Stage 3 if needed (ADR 0007)	Phase 1 lands; ADR 0007 accepted
RH	Release hardening (OpenSTA detection + version check, see § Release hardening)	WS-RH.1 shipped ✓; no other items currently scoped
3	Native Rust SDF→IR parser (ADR 0006)	Deferred indefinitely — no longer release-gating per amended ADR 0006. Picks up when bandwidth allows or commercial demand appears.
4+	Pillar A Stage 1 (static IDM); Pillar C Tier 2; ADR 0008 optional outputs	Demand-driven; not committed

Parked (require new ADR to revive): in-process reference STA (ADR 0003 superseded), Pillar A Stage 2 (dynamic δ(T)), Pillar A Stage 3 (sub-cycle ticks), NoC-aware partitioning hints (Pillar C Tier 3).

Phase 1 — Structured timing output and Phase 0 wrap-up

Entry criteria:

ADR 0008 accepted.
Phase 0 exit criteria met (per phase-0-ir-and-oracle.md).

OpenTimer integration was originally Phase 1's centrepiece (former WS-P1.1) but was retired when the spike Superseded ADR 0003. With OpenSTA-out-of-process as the sole STA path, Phase 1 is now anchored on user-visible output rather than a second STA tool.

Workstreams (parallel where independent):

WS-P1.1 — Structured timing output (ADR 0008 required items)

The four required items from ADR 0008. Single workstream because they share infrastructure.

WS-P1.1.a — Symbolic violation messages. Shipped 2026-05-02 in commit 0432d9a. New WordSymbolMap in src/flatten.rs built once at sim startup; process_events gained an optional resolver closure; sim_metal threads it through. Setup/hold violation messages now name DFFs as top/cpu/regs[7][bit 22] [word=42] instead of bare word 42. CUDA/HIP sim paths don't currently route runtime violations through process_events (separate plumbing gap, not blocked on this format change).
WS-P1.1.b — --timing-report <path.json>. Shipped 2026-05-02 in commit 58a7a04. New src/timing_report.rs module with serde-derived TimingReport (schema_version 1.0.0); process_events takes a ReportingCtx bundling the optional resolver + violation observer (signature back to 5 args); sim_metal builds the report end-to-end. Sample fixture at tests/timing_ir/sample_reports/two_violations.json; schema documented in docs/timing-violations.md. WS-P1.1.d's worst-slack ranking is included (top-10 per kind from violation events). Caveats: closest-to-violation tracking in non-violating runs needs GPU near-miss instrumentation (deferred); violations array is unbounded (opt-in cap is the natural follow-up); CUDA/HIP/cosim paths don't route runtime violations through process_events yet.
WS-P1.1.c — --timing-summary text output. Shipped 2026-05-02 in commit 44e70a0. New TimingReport::format_summary() formatter; --timing-summary CLI flag; TimingReportConfig refactored to support either / both / neither output. Text writes to stdout. Deferred from ADR 0008's wishlist: "corner" (metadata struct doesn't carry it yet) and "margin percentage" (derivable from existing fields). Both are documented in code as known gaps.
WS-P1.1.d — Per-DFF worst-slack ranking. Partially shipped in 58a7a04 alongside WS-P1.1.b: top-10 per kind from observed violation events. Remaining: closest-to-violation tracking when no violation occurred — needs GPU near-miss instrumentation, deferred to a separate workstream.

Total ~2 weeks.

WS-P1.2 — Phase 0 follow-ups (carryover)

Tail of Phase 0 work that didn't gate WS3 completion. Listed for completeness.

~~WS2.4: multi-corner CLI flag in opensta-to-ir.~~ Shipped 2026-05-02 (commits 5822343 / 530bb36 / 59fde04).
WS4: corpus + runner + regen helper + CI hookup shipped 2026-05-02 with the seed entry aigpdk_dff_chain (covers all four IR record types). One follow-up: add sky130-based corpus entries (inv_chain_pnr, mcu_soc subset) once a CI sky130-Liberty install strategy is decided.
Peripheral wiring for I²C/SPI when a fuller mcu_soc fixture lands.

(WS5 — parser-success assertions on the Liberty parser path and on opensta-to-ir — was already shipped; see phase-0-ir-and-oracle.md § WS5.)

These are not gated by any new ADR; pick them up as bandwidth allows.

Exit criteria for Phase 1:

✅ Symbolic violation messages live; old state-word-index format gone (commit 0432d9a).
✅ --timing-report JSON shipping; sample fixture at tests/timing_ir/sample_reports/two_violations.json (commit 58a7a04).
✅ --timing-summary available (commit 44e70a0).
✅ Worst-slack ranking included in both report and summary (top-10 from violations; non-violating-run tracking still requires GPU near-miss instrumentation, separate workstream).
✅ why-jacquard.md updated; old "Output interface" section now describes the shipped surface, "Still on the wishlist" carries the deferred items.

Phase 1 closed. Phase 2 entry now blocked only on ADR 0007 acceptance.

Phase 2 — Timing model fidelity

Entry criteria:

Phase 1 exit criteria met.
ADR 0007 accepted.

Pillar B Stages 1 and 2 (per-DFF clock arrival in the IR + setup/hold fold-in) landed early, in commits c403cc8 and 6767c3e — directly on top of the OpenSTA-out-of-process producer rather than the OpenTimer integration originally planned. Phase 2 is therefore anchored on Pillar C Tier 1 (per-receiver wire delay), with Pillar B Stage 3 only if measurement justifies it.

Workstreams (parallel where independent):

WS-P2.1 — Pillar C Tier 1: Per-receiver wire delay (ADR 0007)

Key wire delay per (src_aigpin, dst_aigpin) edge.

WS-P2.1.a — Edge-attributed wire delay. Rewrite of src/flatten.rs:1850-1872 to key wire delay per fanout; fold into source-side gate_delay per fanout target. ~3–5 days.
WS-P2.1.b — Rise/fall preservation. Carry per-edge rise/fall through the consumer; honour both in PackedDelay accumulation. ~1–2 days, after WS-P2.1.a.
WS-P2.1.c — Validation. Long-route corpus addition; tolerance ≤±3% on long-wire paths.

Total ~1 week.

WS-P2.2 — Pillar B Stage 3: Bucketed per-DFF constraint packing (conditional)

Stages 1+2 collapsed all DFFs in a 32-bit state word to min(setup), min(hold) after folding the per-DFF clock arrival in. For most current designs the per-word collapse pessimism is small relative to clock period; for designs running close to the period boundary, splitting each word into clock-arrival buckets eliminates the collapse loss without disturbing the partitioner. See Stage 3 in docs/timing-model-extensions.md Part B.

Land only if Stage 1+2 measurement on a representative design shows the per-word collapse materially over-reports violations; otherwise defer indefinitely. Effort if pursued: ~3–5 days, touches src/flatten.rs:1722-1761 and the kernel's constraint indexing.

WS-P2.3 — Output adjustments for fidelity work

Small touch-ups to ensure Phase 1 outputs continue to work as model fidelity changes. JSON report fields, summary metrics, etc. Folded into WS-P2.1 / WS-P2.2 PRs as needed.

Exit criteria for Phase 2:

Per-receiver wire delay landed; long-route paths reported within ≤±3% of CVC.
timing-model-extensions.md Parts B and C marked Implemented with cross-references to landed code (Part B already updated post-Stage-1+2).
timing-validation.md updated with per-pillar tolerances.
No regression on existing corpus.

Phase 3 — Native Rust SDF→IR parser

Deferred indefinitely as of 2026-05-02 per amended ADR 0006. No longer release-gating: shipped Jacquard binaries may subprocess user-installed OpenSTA via opensta-to-ir, provided OpenSTA is not bundled and not linked. The user-facing capability gap is "OpenSTA must be on PATH for jacquard sim input.sdf," surfaced by WS-RH.1 below with a clear error message.

Reasons to revive:

A downstream commercial integrator's legal team rejects subprocess-of-GPL-tool even with no bundling/linking.
OpenSTA dialect coverage gaps appear that are easier to fix in our own parser than via opensta-to-ir post-processing.
Bandwidth opens up and the team wants the zero-runtime-dependency story for its own ergonomics.

Effort estimate (unchanged from the original ADR 0006 framing): grammar-based (nom / pest), validated against OpenSTA on the WS4 corpus per ADR 0001. Probably 2–3 weeks of focused work. Not scheduled.

Release hardening

Pre-first-release work that became necessary when ADR 0006 § Amendment relaxed the no-runtime-subprocess rule. These are blockers for first release, not for any specific Phase.

WS-RH.1 — OpenSTA detection + version check

Status: Shipped 2026-05-02 in commit c9c393b. All scope items below are landed; this entry is preserved as a brief reference. Test coverage: 9 unit tests for the version parser + 6 integration tests for the locator across the missing / too-old / newer-than-tested / unparseable / failing-probe paths.

Why: With the shipped runtime now allowed to subprocess opensta-to-ir, a user invoking jacquard sim input.sdf on a machine without OpenSTA — or with an untested OpenSTA version — must get an actionable error rather than silent timing-data loss. Pre-WS-RH.1, missing OpenSTA only emitted a warn! and the simulation proceeded with no timing information loaded. That was acceptable during development but shipped as a UX bug.

Scope:

Promote missing-OpenSTA from warning to hard error when --sdf is provided. Today's silent-fallback behaviour is fine for --liberty-only runs but wrong when SDF was explicitly requested. Error message must name the env var (JACQUARD_OPENSTA_BIN), the PATH lookup, and link to install instructions. ~0.5 day.
Pin a tested OpenSTA version range. Record the version we test against in vendor/opensta/ (already pinned via submodule per ADR 0005) and surface that as a MIN_TESTED_OPENSTA_VERSION / MAX_TESTED_OPENSTA_VERSION constant in crates/opensta-to-ir/src/opensta.rs. Need to choose a version-detection mechanism — OpenSTA's -version flag output format is the obvious target; check whether it's stable across the versions we care about. ~0.5 day.
Version probe at first invocation. On first call to find_opensta() per process, run <binary> -version, parse the version, and:
- If older than min-tested → hard error with remediation message ("rebuild via scripts/build-opensta.sh or upgrade your system OpenSTA").
- If newer than max-tested → warn but proceed ("untested OpenSTA version vN.M; please report any timing discrepancies").
- Cache the result for the rest of the process. ~1 day.
Document the dependency in docs/synthesis-flow.md. Single section: required tooling, install paths, version range, what jacquard sim does and doesn't need OpenSTA for. ~0.5 day.
Test coverage: unit tests for the version-string parser (with sample -version outputs from the pinned version and a synthetic too-old version); an integration test that points JACQUARD_OPENSTA_BIN at a stub script and confirms the error path. ~0.5 day.
Stale-framing cleanup (folded in here per 2026-05-02 decision rather than spun out separately):
- Reword INTERIM per ADR 0006 / Pre-release only markers in source: src/sim/setup.rs:176,228,286, src/bin/jacquard.rs:187, src/sim/cosim_metal.rs:2053, src/testbench.rs:255-257. Replace with "subprocess wrapper per ADR 0006 § Amendment" or similar — these paths are no longer interim.
- Update docs/plans/phase-0-ir-and-oracle.md lines 152, 161, 172 — drop "tagged for pre-release removal" framing; the subprocess wrapper is now the shipping mechanism, not a temporary bridge.
- Audit docs/plans/ws3-delete-sdf-parser.md for the same stale framing and update.
- ~0.5 day total for the cleanup.

Total: ~3.5 days. Single PR, owned by whoever picks up release prep.

Open question: does OpenSTA emit a stable -version string, or do we need to scrape git describe from a build-time-recorded commit? If -version is unreliable, fall back to recording the submodule commit at crates/opensta-to-ir build time and comparing — this is cheaper than version-string sniffing and avoids the "user has a custom build" problem.

Phase 4+ — Demand-driven

Items below land when (a) a real use case appears that demands them, and (b) bandwidth is available. Each gets its own ADR amendment / new ADR before scheduling, since the cost is non-trivial.

Pillar A Stage 1 (static IDM)

Cheapest δ(T) entry point. Lands only after Pillars B and C confirm the wire/skew baseline is correct — characterisation work done before that risks chasing wire-delay error masquerading as δ(T) error.

Effort: 1–2 day spike to validate value, then ~1 week implementation, plus per-cell SPICE characterisation effort (long-pole risk).

Pillar C Tier 2 (inter-partition wire delay)

Required for many-core/NoC designs at advanced processes. Lands when a representative such design appears in the test corpus and Tier 1 measurement shows it is needed.

Effort: ~2–3 weeks, touches src/sim/cosim_metal.rs shuffle pipeline.

ADR 0008 optional outputs

Items 5–7 from ADR 0008: arrival histograms, STA cross-reference, path-back-trace. Demand-driven prioritisation.

Pillar C Tier 3 (NoC-aware partitioning hints)

Optional optimisation that makes Tier 2 cheap on tile-decomposed designs. Lands only if Tier 2 lands and partitioning quality on tile designs proves measurably suboptimal.

Risks and walk-back

Pillar measurement shows smaller-than-expected gain. Each pillar's later stages are deferred or abandoned per ADR 0007's walk-back clause. Pillar B Stage 3 is explicitly conditional on this signal.
JSON report schema design wastes time in bikeshedding. Mitigation: ship v1 quickly, additive-only changes thereafter, breaking changes require explicit ADR-level decision.
OpenSTA upstream regressions. With OpenSTA as the sole STA path, an upstream behaviour change reaches us through opensta-to-ir's output. Mitigation: pin OpenSTA in CI (per ADR 0001) and rely on the regression corpus to surface drift.
CRPR pessimism on tight designs. Stage 1+2 fold-in treats launch=0; a design with very heterogeneous launch arrivals will see pessimism on paths whose launch DFF also has a long clock path. Stage 3 is the lever if this matters; otherwise live with it.

Cross-references

../adr/0007-timing-model-fidelity-roadmap.md — Pillar definitions for Phase 2.
../adr/0008-structured-timing-output.md — Output items for Phase 1.
../adr/0001-opensta-as-oracle.md — OpenSTA out-of-process commitment (post-ADR-0003 supersedure).
../adr/0003-opentimer-primary-sta.md — Superseded. Spike fail outcome documented in ../spikes/opentimer-sky130.md.
../adr/0006-sdf-preprocessing-model.md — Phase 3.
../why-jacquard.md — User-facing positioning that this roadmap delivers.
../timing-model-extensions.md — Technical analysis underlying ADR 0007.
../timing-validation.md — Validation tolerances each phase updates.
phase-0-ir-and-oracle.md — Predecessor roadmap (current Phase 0 status lives there per workstream).
../spikes/opentimer-sky130.md — Spike outcome (Superseded).

Plan — GF180MCU PDK enablement (full sim path)

Status: Phases 0–6 shipped (2026-05-12 / 13). Phase 7 (wafer.space test-run-1 design integration) deferred pending design availability. Subsequent follow-ups also landed (2026-05-14): IO pad behavioural decomposition (__in_c, __in_s, __bi_24t, plus filler classification for the wafer.space gf180mcu_ws_* families) and bidir A/OE observability surfacing as <port>__out / <port>__oe extra primary outputs — see commits aa312b8, c23d583, 207cc80. These extended GF180MCU support from "synthesized-core-only" to "full chip_top including pad ring", validated end-to-end on a 227k-cell wafer.space chess chip_top netlist. This document is now a recap of what landed; the forward-looking deferred items are in § Follow-on cleanup at the bottom.

Predecessors:

SKY130 enablement (reference recipe in docs/adding-a-pdk.md).
Multi-corner Liberty plumbing — WS2.4 + the sky130 multi-corner integration test (crates/opensta-to-ir/tests/opensta_integration.rs), shipped 2026-05-12.

ADRs: None new shipped. docs/adding-a-pdk.md is the canonical integration-points checklist; this plan applied that recipe to GF180MCU with both 7-track (gf180mcu_fd_sc_mcu7t5v0) and 9-track (gf180mcu_fd_sc_mcu9t5v0) standard-cell libraries.

Goal (as shipped)

GF180MCU is now at the same support tier as SKY130:

Timing path — opensta-to-ir accepts GF180MCU Liberty files and emits IR; the multi-corner integration test at crates/opensta-to-ir/tests/opensta_integration.rs::gf180mcu_multi_corner_emits_per_corner_values asserts per-corner setup/hold values differ correctly across tt/ss/ff PVT corners.
Simulation path — jacquard sim runs gate-level GF180MCU netlists on the GPU. Cell-type detection, pin direction tables, sequential/tie/multi-output classification, behavioural model parsing (with UDP support for sequential elements), and AIG decomposition are all wired through AIG::from_netlistdb.
Validation — synthetic DFF+inverter fixture at tests/timing_test/gf180mcu_timing/. Real wafer.space test-run-1 design integration is deferred (Phase 7, gated on design availability).

End state mirrors today's SKY130 support: CellLibrary::GF180MCU detected, decomposed to AIG, simulated on Metal/CUDA/HIP, with a golden-IR corpus entry covering the timing-IR side.

Why now

GF180MCU support was a release prerequisite per session 2026-05-12. The wafer.space ecosystem (https://github.com/wafer-space/gf180mcu) is the near-term commercial demand driver; the upstream google/gf180mcu-pdk is the canonical PDK that the wafer.space variant builds on.

Decisions (frozen 2026-05-12 session)

One enum variant for GF180MCU. CellLibrary::GF180MCU covers both 7t5v0 and 9t5v0 prefixes. Matches the SKY130 precedent (CellLibrary::SKY130 covers seven prefixes).
Both 7t and 9t fully supported. Unlike SKY130 (only hd is decomposed), both GF180MCU standard-cell variants are first-class for cell detection, pin direction, classification, and AIG decomposition. Cell models for 7t and 9t are byte-identical per cell type (verified at build time in build.rs); decomposition reads from the 7t submodule and reuses for 9t.
Two separate submodules for vendoring cell models, mirroring the per-library SKY130 split:
- vendor/gf180mcu_fd_sc_mcu7t5v0/
- vendor/gf180mcu_fd_sc_mcu9t5v0/
Install path: volare pinned hash under [tool.jacquard.pdks.gf180mcu] in pyproject.toml alongside the existing sky130 entry. Variant: gf180mcuC.
Reset polarity: GF180MCU uses active-low resets/sets (pin names RN, SETN) — same AIG formula shape as SKY130's RESET_B/SET_B. The "n" prefix in cell names like dffnq/dffnrnq/icgtn indicates a negative-edge clock (pin CLKN), not reset polarity (resolving Open Q3 from the original plan).

Shipped phases

Phase 0 — Foundations (commit `6ae3e54`)

pyproject.toml: [tool.jacquard.pdks.gf180mcu] with volare_hash = "559a117b163cef2f920f33f30f6f690aa0b47e4c", variant gf180mcuC, separate default_lib_subdir_7t / default_lib_subdir_9t paths.
Vendored submodules at vendor/gf180mcu_fd_sc_mcu7t5v0/ and vendor/gf180mcu_fd_sc_mcu9t5v0/.
Skeleton src/gf180mcu.rs + src/gf180mcu_pdk.rs declared in src/lib.rs.

Phase 1 — Library detection + cell-type extraction (commit `858dd70`)

is_gf180mcu_cell(name) -> bool matching both 7t5v0 and 9t5v0 prefixes.
extract_cell_type(name) strips prefix + drive suffix.
CellLibrary::GF180MCU enum value added; detect_library() / detect_library_from_file() extended; Mixed enforcement upgraded to three known libraries.

Phase 2 — Pin direction provider (commit `e97e2d2`)

GF180MCULeafPins implementing LeafPinProvider.
Generation strategy: build-time via build.rs::generate_gf180mcu_pin_table, which scans vendor/gf180mcu_fd_sc_mcu{7,9}t5v0/cells/, parses .functional.v, cross-asserts 7t/9t pin layouts match, emits $OUT_DIR/gf180mcu_pins.rs. New precedent vs SKY130's hand-rolled match arms (see § Follow-on cleanup item 1).
Round-trip test instantiating every cell.

Phase 3 — Cell classification (commit `6969b90`)

Sequential / tie / filler / delay-cell whitelists in src/gf180mcu_pdk.rs derived from behavioural models.
Unit tests asserting classification across the union of 7t5v0 and 9t5v0 cell catalogues.

Phase 4 — Combinational AIG decomposition

Sequenced as four commits:

92bb665 — Phase 4 recon: confirmed SKY130 behavioural parser is PDK-neutral; identified shared infrastructure that gf180mcu_pdk could reuse.
02da077 — Phase 4 prep: introduced the PDK-neutral src/pdk_decomp.rs re-export module; exposed WireVal, GATE_MARKER, build_chain_gate, build_xor_chain, finalize_decomp_result as pub(crate).
32fb3b9 — Phase 4 (combinational): decompose_combinational for GF180MCU + boolean equivalence test suite vs the vendored PDK models.
d898343 — Phase 4 (aig.rs integration): wired combinational decomposition through AIG::from_netlistdb, end-to-end sim path for combinational GF180MCU netlists.

Phase 4b — Sequential cells (UDPs)

a7c0618 — Phase 4b prep: UDP loader for gf180mcu_pdk (parses UDP_GF018hv5v_mcu_sc7_TT_1P8V_25C_verilog_nonpg_*_FF_UDP and friends from the vendored PDK).
459317e — Phase 4b: AIG hooks for sequential cells (DFFs, latches, scan-DFFs, clock-gating cells icgtp/icgtn). gf180mcu_preprocess pre-creates DFF Q pins; gf180mcu_postprocess applies async set/reset using the active-low RN/SETN convention via the same AIG formula as SKY130. Negative-edge clock cells use CLKN instead of CLK (handled in trace_clock_pin).
3006f59 — Phase 4b boolean-equivalence tests covering DFF, latch, scan-DFF, and clock-gating cells via multi-step truth-table evaluation.

Phase 5 — CLI / pipeline wiring audit (commit `57244d5`)

Audit-only — no per-PDK branch was missing GF180MCU handling. The auto-detection in AIG::from_netlistdb already covers every CLI surface (sim / cosim / dump-paths all route through setup::load_design). Cleanup: stale Phase 4b panic comments in src/sim/setup.rs and src/aig.rs; field doc comments on CLI arguments refreshed to mention GF180MCU alongside AIGPDK / SKY130.

Phase 6 — Validation fixture + multi-corner test

Fixture (commit 4a7ee0e): tests/timing_test/gf180mcu_timing/ mirroring sky130_timing/ 1:1. Synthetic inv_chain.v (DFF + 16-inverter chain + DFF) with gf180mcu_fd_sc_mcu7t5v0__{dffq,inv}_1 cells, Liberty-only SDF generator, CVC testbench, sample stimulus, Makefile, README.
Integration test: gf180mcu_multi_corner_emits_per_corner_values in crates/opensta-to-ir/tests/opensta_integration.rs. Loads three real PVT corners (typ=tt_025C_5v00, slow=ss_125C_4v50, fast=ff_n40C_5v50) at the 5.0 V operating point and asserts per-corner setup TimingValues differ correctly across PVT. Skips gracefully when the volare-installed PDK isn't present (gated on find_gf180mcu_lib_dir() returning Some; matches the sky130 test's skip pattern). $GF180MCU_LIBERTY_DIR overrides the volare default path.

Phase 7 — wafer.space test-run-1 design (deferred)

Gated on design availability. Scope:

Vendor or pull a wafer.space test-run-1 gate-level netlist into the tests/timing_test/ or designs/ tree.
End-to-end pipeline: synth + PnR (or consume post-PnR output), opensta-to-ir, jacquard sim with Metal backend, golden-output VCD comparison.
Promote to a corpus entry once stable.

Test inventory

Counts after Phase 6:

cargo test --lib: 212 passing (up from 166 at plan start).
cargo test --lib gf180mcu: 45 passing (combinational + sequential equivalence + classification + detection + AIG-build).
cargo test -p opensta-to-ir multi_corner: 2 passing (sky130 + gf180mcu), each gated on its respective volare PDK install.

Follow-on cleanup

These are nice-to-have refactors flagged during the GF180MCU work but deliberately out of scope for the enablement effort itself.

Update 2026-05-19: Items 1, 2, and 4 are now subsumed by ADR 0010 — Declarative cell metadata and its companion plan declarative-cell-metadata.md. The manifest pathway converts these from "Rust refactor" projects into "move data out of code as part of the migration to manifest-as-source-of-truth" — happens once, gets all three at once.

~~build.rs pin-table generator for SKY130 too.~~ Subsumed by ADR 0010 § "Deferred to a future ADR — build.rs pin-table scanner removal." Removed LAST in the manifest migration, after manifests cover the built-in PDKs.
~~Physical relocation of shared PDK decomp infrastructure~~ out of sky130_pdk.rs into pdk_decomp.rs. Still relevant for the built-in (Rust-decomp) pathway, since ADR 0010 keeps that path load-bearing for cells with real AIG decomposition rules. Move when a third PDK exercises the surface.
CellLibrary enum location. Currently lives in src/sky130.rs even though it represents all PDKs. Moving to a neutral home (src/pdk.rs or src/lib.rs) is a trivial mechanical refactor. Independent of ADR 0010.
~~IO and PR libraries.~~ Now solved by the ADR 0010 manifest pathway. gf180mcu_fd_io and gf180mcu_fd_pr cells can be declared via kind = "io_pad_*" / kind = "filler" / kind = "tap" etc. in user-supplied manifests — no Jacquard PR needed.
CI install strategy for GF180MCU Liberty. Both the sky130 and gf180mcu multi-corner tests currently skip when the PDK isn't installed locally. CI integration (volare-on-CI or a vendored minimal Liberty subset) is the same blocker that gates the inv_chain_pnr sky130 corpus entry — out of scope for the GF180 enablement effort itself. Unrelated to ADR 0010.

Pitfalls (PDK-specific, for future readers)

Reset polarity — GF180MCU is active-low (RN/SETN); same AIG formula as SKY130's RESET_B/SET_B.
Negative-edge clocks — cells like dffnq/dffnrnq/icgtn use pin name CLKN instead of CLK. The "n" prefix is a clock marker, not a reset-polarity marker.
Power pins — GF180MCU operates at 5V nominal (vs SKY130's 1.8V). Both follow VDD/VSS naming. Corner names follow tt_025C_5v00 shape and parse cleanly through the generic TimingLibrary loader.
Cell pin names differ from SKY130 — inverter is I/ZN (not A/Y); DFF is CLK/D/Q/notifier. The notifier port wires the UDP delay-model wrapper but is unused for logic simulation.
Cell-name collisions between 7t5v0 and 9t5v0 — both have nand2_1 etc. Detection keys on the full prefix, not the base type. Auto-handled by is_gf180mcu_cell.
Drive-strength suffixes — GF180MCU uses integer multipliers (inv_1, inv_2, inv_4, …) matching the SKY130 convention.

Plan — Phase 0: Timing IR and OpenSTA oracle

Status: Implemented — historical record. All five work streams (WS1 schema, WS2 opensta-to-ir producer, WS3 SDF parser deletion + interim runtime hook, WS4 diff harness + CI, WS5 parser-success assertions) shipped through 2026-05-02. All eight exit criteria are met. Ongoing scheduling for timing-model fidelity work has moved to post-phase-0-roadmap.md. The per-WS detail and embedded status markers below are preserved for the implementation record.

Goal

Deliver the minimum viable infrastructure to enforce Jacquard's timing correctness contract:

A stable timing intermediate representation (IR) for SDF-equivalent annotations.
An OpenSTA-driven subprocess converter that produces IR from the same inputs Jacquard consumes.
A converter that produces IR from Jacquard's existing SDF parser output.
A CI diff harness that fails loud on converter disagreement.
Parser-success assertions on the SDF and Liberty paths.

After phase 0, Jacquard's timing pipeline has an enforced external reference. Silent failures (zero-match SDF, mis-scoped hierarchical prefixes, unexpected cell drops) surface as CI failures rather than correctness regressions detected in the field.

Prerequisites

Requirements doc (../timing-correctness.md) accepted.
ADR 0001 (OpenSTA oracle) accepted.
ADR 0002 (timing IR) accepted.
A representative test design committed to the repo with inputs needed for both Jacquard and OpenSTA (.v + .lib + .sdf minimum; .spef if available). Candidate: tests/timing_test/inv_chain_pnr or the MCU SoC subset, whichever is smaller for first-pass iteration.
OpenSTA available on developer machines and CI runners (installation documented).

Work breakdown

WS1 — IR schema

Done. Shipped as the timing-ir crate (508baaf initial, 2432d41 simplification). Schema at crates/timing-ir/schemas/timing_ir.fbs; per-DFF CLOCK_ARRIVAL records added later in c403cc8 (Pillar B Stage 1, beyond original WS1 scope). JSON round-trip verified via crates/timing-ir/tests/.

Produce the FlatBuffers schema (schemas/timing_ir.fbs) and generated Rust bindings.

Fields (minimum viable; extend only with written justification):

SchemaVersion { major, minor, patch }.
Corner { name, process, voltage, temperature }; IR holds a list of corners.
CornerValue { corner_index, min, typ, max } for multi-corner floats.
TimingArc { driver_pin, load_pin, rise_delay: [CornerValue], fall_delay: [CornerValue], condition, provenance }.
InterconnectDelay { net, from_pin, to_pin, delay: [CornerValue], provenance }.
SetupHoldCheck { d_pin, clk_pin, edge, setup: [CornerValue], hold: [CornerValue], condition, provenance }.
Provenance { source_tool, source_file, origin: Asserted | Computed | Defaulted }.
VendorExtension { source_tool, kind: CadenceX | SynopsysY | Other, raw_bytes } — untyped passthrough for unrecognised annotations.
Root table TimingIR { schema_version, corners, cell_instances, timing_arcs, interconnect_delays, setup_hold_checks, vendor_extensions }.

Deliverables:

schemas/timing_ir.fbs checked in.
build.rs integration for code generation (or checked-in generated Rust with a flatc pin).
A tiny timing-ir crate exposing read/write helpers.
JSON round-trip via flatc --json verified in a unit test.

Scope guard: if you find yourself adding fields that represent computed timing graphs, cell electrical characterisation, or netlist structure, stop and re-read ADR 0002.

WS2 — `opensta-to-ir` production tool

Per ADR 0006, opensta-to-ir is a shipped preprocessing tool, not merely a validation helper. Post-release it remains as an alternative preprocessing path for users who want OpenSTA-computed timing.

Detailed design and phased implementation: ws2-opensta-to-ir.md.

Deliverables:

A Tcl script runnable by OpenSTA that loads Liberty + Verilog + SDF + (optionally) SPEF + SDC, then emits a machine-readable dump of timing annotations.
A production-quality standalone Rust binary opensta-to-ir that parses OpenSTA's dump and emits timing IR (binary + JSON sidecar). Stable CLI, documented exit codes, clear diagnostics, man-page-worthy --help.
Invocation wrapper handling OpenSTA subprocess lifecycle, stderr capture, exit-code checking, and error propagation up through opensta-to-ir's own exit code.
Assertion: if OpenSTA reports < expected-count cells, exit non-zero with a clear diagnostic.
Ships as part of Jacquard's release artefacts (binary distributable, documented in user-facing docs).

WS2.4 — Multi-corner CLI flag (shipped 2026-05-02)

Status: Shipped 2026-05-02 across commits 5822343 (consumer + --timing-corner flag), 530bb36 (builder dedupe + per-corner [TimingValue] collection), 59fde04 (Tcl driver per-scene emission

--liberty NAME=PATH syntax), and the integration test aigpdk_dff_emits_per_corner_timing_values. The historical scope notes below are kept for reference but are no longer "open work".

The IR schema (crates/timing-ir/schemas/timing_ir.fbs) supports per-corner TimingValue vectors today, but every record lands in the IR with a single TimingValue keyed at corner_index = 0. Both producer (opensta-to-ir) and consumer (flatten.rs) treat the world as single-corner. Multi-corner support has three pieces:

Producer (Tcl + Rust binary):

crates/opensta-to-ir/tcl/dump_timing.tcl: replace single read_liberty + hardcoded CORNER 0 default tt 1.0 25.0 with OpenSTA's define_corners + per-corner read_liberty -corner $name. The existing arc / setup-hold / wire / clock-arrival walks already key by (cell, …); wrap each in a per-corner loop and call [edge arc_delays $arc -corner $c]. Verify the exact -corner syntax against the locally built OpenSTA before relying on it (similar to the vertex_worst_arrival_path probe done for clock arrival in commit c403cc8).
crates/opensta-to-ir/src/main.rs: rework --liberty PATH to accept --corner NAME=PATH[,V=…,T=…,P=…] repeats. Validate at least one corner.
crates/opensta-to-ir/src/builder.rs: today each ARC / SETUP_HOLD / INTERCONNECT / CLOCK_ARRIVAL line lands as one IR record with one TimingValue. Multi-corner emits multiple lines per (cell, driver, load, corner_index) from Tcl; the builder dedupes them into one IR record carrying a [TimingValue] vector. Mechanical.

Consumer (jacquard root):

Add --timing-corner <NAME> to SimArgs / CosimArgs in src/bin/jacquard.rs; resolve to an index by walking ir.corners().
Replace flatten.rs::ir_corner0_max(...) (used in ~5 sites) with ir_corner_max(idx). Thread the resolved index through load_timing_from_ir.

Fixture: sky130 ships multi-corner Liberty (tt_025C_1v80, ss_-40C_1v62, ff_125C_1v95) on disk via volare under ~/.volare/... on dev machines that have run the cosim work. Wire two corners against the existing DFF / chain integration tests for a synthetic-but-real fixture; no external decision is needed before starting.

Land in this order: fixture probe (~hour, verifies the OpenSTA Tcl -corner flag works as expected) → producer (Tcl + binary + builder) → consumer (CLI + flatten plumbing) → integration test exercising both corners. The risk concentrates in the first hour; everything after that is mechanical.

WS3 — Remove hand-rolled SDF parser; wire interim runtime hook

Per ADR 0006, Jacquard's hand-rolled SDF parser is deleted in Phase 0 rather than maintained through later phases. The runtime gains a new IR input path; the old SDF input path becomes an interim convenience wrapper over WS2.

Detailed design and phased implementation: ws3-delete-sdf-parser.md.

Deliverables:

Delete src/sdf_parser.rs and the SDF→Jacquard-internal-types code path. Remove all direct consumers.
Add jacquard sim --timing-ir <path> as the canonical post-release timing input. Loads a pre-converted timing IR file, consumes it into the simulator's internal structures.
Retarget the existing --timing-sdf / --enable-timing CLI behaviour: when SDF is provided, jacquard sim subprocesses opensta-to-ir internally to produce IR on the fly, then consumes it. Code site tagged "INTERIM per ADR 0006; removed before first release."
Verify no remaining imports of the deleted module. Verify all existing tests that previously used the hand-rolled parser now pass via the interim hook or via checked-in IR fixtures.
No runtime behaviour regression on Jacquard's timing-related regression suite; any design that currently works must still work after WS3.

WS4 — Diff harness and CI integration

Reframed 2026-05-02; corpus + runner shipped 2026-05-02. The original WS4 was framed as "WS2 vs WS3 IR diff" (OpenSTA-derived against Jacquard's hand-rolled SDF parser-derived). WS3 deleted that parser; the diff has only one side now. Three reframings were considered: Option A (golden-IR regression corpus for opensta-to-ir) was chosen as the Phase 0 closure; Option B (end-to-end behavioural diff cxxrtl/CVC vs Jacquard cosim event traces) belongs in timing-validation.md as a Phase 1+ extension; Option C (cross-tool diff vs a future native Rust SDF→IR parser) is Phase 3 work per ADR 0006.

Deliverables:

A test binary timing-ir-diff that reads two IR files and produces a structured diff (missing arcs, mismatched delays past tolerance, mismatched provenance). Shipped in crates/timing-ir/src/bin/timing-ir-diff.rs.
OpenSTA vendored as a git submodule at vendor/opensta/. Not built from Jacquard's build at runtime; present for CI version pinning, the opensta-to-ir integration tests, and stress-corpus access (see ADR 0005). Shipped.
A primary regression corpus at tests/timing_ir/corpus/ — Jacquard-specific designs with checked-in expected.jtir (and a expected.json sidecar via flatc --json for human-readable diffs). Shipped 2026-05-02 with the seed entry aigpdk_dff_chain (a minimal aigpdk DFF + AND with back-annotated wire delay; covers ARC + SETUP_HOLD + CLOCK_ARRIVAL + INTERCONNECT in a self-contained fixture). Sky130 entries (inv_chain_pnr, mcu_soc subset) remain to be added — the inputs exist under tests/timing_test/, but a CI strategy for installing the sky130 Liberty (likely volare) lands with them.
A stress corpus at tests/timing_ir/stress/ — a manifest file listing paths into vendor/opensta/<test-tree-subdir>/. Run nightly or pre-release. Exit criterion: no crashes, no hangs, no malformed IR; numerical agreement with OpenSTA not required. Manifest format specced in tests/timing_ir/stress/README.md; entries pending.
A regression test that, for each design in the primary corpus, runs opensta-to-ir on its inputs and diffs against expected.jtir via timing_ir::diff::diff_irs with the per-design tolerance from manifest.toml. Shipped as crates/opensta-to-ir/tests/corpus.rs::corpus_designs_match_golden_ir. Skips gracefully when OpenSTA isn't built; fails loud with a structured diff when there's a mismatch.
A regenerate-goldens helper for the OpenSTA-pin-bump workflow: bump submodule, run regen, review the diff, commit golden + submodule together. Shipped as scripts/regenerate-corpus-goldens.sh. Iterates tests/timing_ir/corpus/*/manifest.toml, runs opensta-to-ir per entry with the manifest-specified flags, refreshes both expected.jtir and the expected.json sidecar via flatc --json. Accepts entry names as positional args for targeted regen.
A diff-machinery mutation test that perturbs a known-good IR and asserts timing-ir-diff flags it. Shipped in crates/timing-ir/tests/diff.rs: delay_mismatch_past_tolerance_detected, delay_mismatch_within_tolerance_is_clean, arc_only_in_a_detected, arc_only_in_b_detected.

CI hookup landed 2026-05-02. The opensta-to-ir-tests job in .github/workflows/ci.yml builds CUDD (cached), builds OpenSTA via scripts/build-opensta.sh (cached on the submodule SHA), and runs cargo test inside crates/opensta-to-ir — covering the corpus regression test, the CLI tests, and the OpenSTA-driven integration tests on every PR. scripts/build-opensta.sh was extended to honour a CUDD_DIR env var so the CI job can hand it the source-built CUDD location without bypassing the script.

What this catches: OpenSTA upstream regressions, dump-format / Tcl-driver regressions, accidental schema-breaking changes in timing_ir.fbs, builder bugs in opensta-to-ir/src/builder.rs, and the diff machinery itself (via the mutation tests that perturb an IR and assert timing-ir-diff flags the perturbation).

What this doesn't catch: behavioural divergence between Jacquard and a reference simulator. That's timing-validation.md's job (CVC/iverilog event-trace comparison) — the mcu_soc/sky130 90/90 reference match is the current one-design instance, generalisable in Phase 1+.

WS5 — Parser-success assertions

Done. Both halves shipped pre-this-section being marked.

Deliverables (all live):

Assertions in Jacquard's Liberty parsing code: non-zero cells parsed on non-empty input. Implemented as TimingLibrary::parse (src/liberty_parser.rs:297-309); rejects with a clear diagnostic naming the input byte count and pointing at the explicit override.
Assertions in opensta-to-ir (WS2): non-zero IOPATHs / timing arcs resolved on non-trivial SDF input. Implemented as the --min-arcs N CLI flag (default 1) in the binary (crates/opensta-to-ir/src/main.rs:71-77, :112-121); exits with code EXIT_MIN_ARCS_FAILED = 3 (see :17) and a diagnostic naming the produced count, the threshold, and the override flag.
A way to override thresholds for intentionally-empty test inputs: TimingLibrary::parse_unchecked (src/liberty_parser.rs:316) for the Liberty path, --allow-empty-parse flag for the opensta-to-ir path.

Tests covering both halves: liberty_parser::parse_rejects_library_input_with_zero_cells and parse_unchecked_accepts_zero_cell_library; opensta-to-ir::cli::cli_min_arcs_failure_exit_3 (covers both the failure and the --allow-empty-parse override).

(Original-plan assertions for Jacquard's SDF parser are obsolete — WS3 deleted the parser they were to guard.)

Test plan

Tests live in tests/timing_ir/.

Schema round-trip (WS1). Construct a small IR in Rust, serialize to binary, deserialize, assert equality. Same for JSON.
OpenSTA converter unit tests (WS2). For a hand-crafted tiny design, invoke the converter, assert IR contents match expectation.
Jacquard converter unit tests (WS3). Same, on the same tiny design, through Jacquard's parser.
Corpus diff (WS4). For each design in the primary corpus, freshly produced opensta-to-ir output diffs clean against the checked-in golden expected.jtir within per-design tolerance.
Parser-success assertion tests (WS5). Feed empty Liberty, empty SDF, and non-empty-but-no-match Liberty. Each should fail loud with a clear diagnostic, not proceed silently.

Tolerances:

Delay values: ±5% or ±5 ps absolute floor, whichever is larger. Rationale: matches the existing timing-validation.md convention; per-design overrides allowed via manifest.toml.
Missing arcs: zero tolerance. Every arc in the golden IR must appear in the freshly produced one (and vice versa).

Exit criteria (all met)

Phase 0 is complete when all of the following hold:

✅ schemas/timing_ir.fbs checked in (crates/timing-ir/schemas/timing_ir.fbs); round-trip unit tests in crates/timing-ir/tests/.
✅ opensta-to-ir binary production-quality with stable CLI, documented exit codes, primary-corpus support. See ws2-opensta-to-ir.md (Implemented).
✅ src/sdf_parser.rs deleted; --timing-ir <path> canonical; --timing-sdf is a subprocess wrapper over opensta-to-ir (per ADR 0006 § Amendment, the shipping mechanism — Phase 3 native Rust parser deferred indefinitely). See ws3-delete-sdf-parser.md (Implemented).
✅ OpenSTA vendored at vendor/opensta/ (ADR 0005).
✅ timing-ir-diff runs in CI on the primary corpus (opensta-to-ir-tests job), passes cleanly, fails loud on regressions. Mutation tests in crates/timing-ir/tests/diff.rs.
✅ Parser-success assertions live on both halves: TimingLibrary::parse and opensta-to-ir --min-arcs. See WS5 above.
✅ No regression observed in Jacquard's timing-related tests after WS3 cutover.
✅ timing-validation.md carries the forward-pointing note (line 3) explicitly stating its ±5% convention will be superseded by timing-correctness.md once Phase 0 ships. Phase 0 has shipped; that supersession is now effective in practice (the corpus tolerance is set per-design via manifest.toml). Removing the in-doc note is a small follow-up if anyone authoring against the page would benefit.

Out of scope (deferred to later phases)

Native Rust SDF→IR converter. The hand-rolled parser is removed in Phase 0 WS3 (per ADR 0006); the native Rust replacement is Phase 3 work, deferred indefinitely per ADR 0006 § Amendment (no longer release-gating). SDF input ships via the opensta-to-ir subprocess wrapper. See post-phase-0-roadmap.md § Phase 3 for revival triggers.
OpenTimer integration. Depends on the spike; tracked in ../spikes/opentimer-sky130.md and its resulting phase-1 plan.
Private PDK test track. Tracked in ADR 0004; plumbing deferred to its own phase.
SPEF IR. Separate from timing-annotation IR per ADR 0002.
Runtime violation reporting improvements (R4 critical-path refinement JSON). Phase 1 or 2.

Risks

Licensing verification on vendored OpenSTA corpus. Per-file check needed before inclusion. May reduce corpus size if restrictive; acceptable.
FlatBuffers build integration friction. If build.rs codegen causes cross-compilation or CI issues, fall back to checked-in generated code with a documented flatc version. Pick one approach and stick to it; flip-flopping is worse than either option.
Tolerance tuning. Initial ±5% may prove too loose (hides bugs) or too tight (false positives from numerical differences). Plan to re-tune after first real-design data arrives.
WS3 cutover risk. Deleting the hand-rolled SDF parser risks regressing designs that depend on behaviour it currently provides. Exit criterion 7 requires a clean regression run before WS3 is considered complete. If coverage gaps emerge, walk-back options per ADR 0006 apply: add dialect shims to opensta-to-ir, or (now that Phase 3 is deferred) keep the hand-rolled parser available behind a feature flag until dialect parity is reached.
OpenSTA dialect coverage. OpenSTA may not accept every SDF dialect Jacquard's hand-rolled parser has been patched to handle. Such cases are tracked as either opensta-to-ir post-processing fixes or upstream OpenSTA contributions. Under no condition is the fix to reinstate the hand-rolled parser unless walk-back per ADR 0006 is formally triggered.

Plan — WS2: `opensta-to-ir`

Status: Implemented — historical record. All five phases (2.1–2.5) plus Pillar B Stage 1 (per-DFF CLOCK_ARRIVAL records) and release hardening (WS-RH.1 OpenSTA version probe) have shipped. The crate lives at crates/opensta-to-ir/. Current scheduling for further timing-model fidelity work is tracked in post-phase-0-roadmap.md.

Phase: 0 (executed WS2 from phase-0-ir-and-oracle.md). Predecessors: WS1 (crates/timing-ir, schema and round-trip — done), ADRs 0001 / 0002 / 0005 / 0006.

Goal

Deliver a production-quality preprocessing tool that consumes a design's timing inputs and emits a timing-ir file suitable for downstream Jacquard consumption. End-to-end:

.lib + .v + .sdf + .spef + .sdc  →  opensta-to-ir  →  design.jtir (+ design.json)

opensta-to-ir is shipped as a release artefact (per ADR 0006) and is also used by Phase 0 WS3's interim jacquard sim --timing-sdf runtime hook.

High-level architecture

Three components, single binary:

┌─────────────────────────┐     ┌─────────────────────────┐     ┌─────────────────────────┐
│  Rust CLI / driver      │     │  Tcl dump script        │     │  Rust IR builder        │
│  (clap, subprocess mgmt)│ →   │  (runs in OpenSTA proc) │ →   │  (parses dump, builds   │
│  Validates inputs       │     │  Emits canonical dump   │     │   FlatBuffers IR)       │
└─────────────────────────┘     └─────────────────────────┘     └─────────────────────────┘
            │                                                                   │
            └──────────────────── one process invocation ───────────────────────┘

The Rust CLI invokes OpenSTA as a subprocess, writes the Tcl driver script to a temp directory, runs sta -f $tmpdir/dump.tcl, captures the dump file, and converts to IR. The Tcl driver lives at crates/opensta-to-ir/tcl/dump_timing.tcl and is embedded in the binary via include_str!() so the binary is self-contained at runtime — no separate Tcl file needs to ship alongside it.

OpenSTA is located via scripts/build-opensta.sh --print-binary first (the canonical install path for the vendored submodule), then falling back to a PATH lookup, then --opensta-bin <PATH> override.

Reasons for this shape:

OpenSTA's structured Tcl API (get_timing_edges, get_timing_arcs_from, etc.) gives access to OpenSTA's internalised timing graph directly. Walking it is simpler than parsing OpenSTA's SDF output back through a second-generation parser.
The Tcl script is the only OpenSTA-specific code; the Rust side is format-only and can later be reused with other producers (Phase 3 native Rust parser, future OpenTimer adapter).
Subprocess invocation preserves Jacquard's permissive license posture (ADR 0001).

Tcl dump format

A simple line-oriented record format. Each line is one annotation. Fields are tab-separated. Strings with tabs/newlines are quoted with simple "..." and \t/\n escaping. Header / footer lines mark the document.

# format-version: 1
# generator-tool: opensta-to-ir 0.1.0
# generator-opensta: <opensta version string>
# input-files: <comma-separated list>
CORNER	<index>	<name>	<process>	<voltage>	<temperature>
ARC	<cell_instance>	<driver_pin>	<load_pin>	<corner_index>	<rise_min>	<rise_typ>	<rise_max>	<fall_min>	<fall_typ>	<fall_max>	<condition>	<origin>
INTERCONNECT	<net>	<from_pin>	<to_pin>	<corner_index>	<min>	<typ>	<max>	<origin>
SETUP_HOLD	<cell_instance>	<d_pin>	<clk_pin>	<edge>	<corner_index>	<setup_min>	<setup_typ>	<setup_max>	<hold_min>	<hold_typ>	<hold_max>	<condition>	<origin>
VENDOR_EXT	<source>	<source_tool>	<kind>	<base64_payload>
# end

Why line-oriented (not JSON): Tcl emits this trivially with puts. Rust parses it with a BufReader line-at-a-time, no streaming-JSON parser. Mismatched lines fail loud at the unit level, not after parsing 100MB of nested JSON.

The format is a private interface between the bundled Tcl script and the bundled Rust binary — both ship together in one release artefact. We reserve the right to change the format any time as long as both sides update.

Rust binary

opensta-to-ir [OPTIONS] --output <PATH>

Inputs (at least one liberty + one verilog required):
  --liberty <PATH>...           One or more Liberty files (-r overlay supported by OpenSTA).
  --verilog <PATH>...           One or more Verilog netlists.
  --sdf <PATH>                  Optional. Back-annotated delays.
  --spef <PATH>                 Optional. Parasitics; required for SPEF-based delay calc.
  --sdc <PATH>                  Optional. Constraints (clocks, input delays).
  --top <NAME>                  Top-level module name. Required.
  --corner <NAME>...            Corner name(s). Default: "default".

Output:
  --output <PATH>               IR binary output path (.jtir).
  --json <PATH>                 Optional. JSON sidecar via flatc round-trip.

Behaviour:
  --opensta-bin <PATH>          Override the OpenSTA executable path. Default: probe via
                                `scripts/build-opensta.sh --print-binary`, then fall back to PATH.
  --keep-tmp                    Keep the Tcl script and dump file in $TMPDIR for debugging.
  --min-arcs <N>                Fail if fewer than N timing arcs are emitted. Default: 1.
  --allow-empty-parse           Disable the --min-arcs check. For test fixtures only.
  --strict-tcl                  Treat OpenSTA Tcl warnings as errors.
  -v, --verbose                 Echo OpenSTA's stderr to ours. Default: capture and replay only on failure.

Exit codes:
  0    IR produced successfully.
  1    OpenSTA returned an error.
  2    Tcl dump format error or IR-build failure.
  3    Parser-success assertion failed (--min-arcs not met).
  4    Argument validation error.

Internal flow:

Validate args (required files exist, top name non-empty).
Locate OpenSTA binary; verify version is in supported range.
Render Tcl driver script into $TMPDIR (or stdin).
Spawn opensta -f <script>; capture stdout/stderr/exit.
Read dump file from $TMPDIR/<uniqued>.osd (OpenSTA dump).
Parse dump, build IR via timing-ir crate's FlatBuffers builders.
Apply --min-arcs assertion (see WS5 portion below).
Write .jtir (and .json if requested).
Surface any captured warnings on stderr.

Multi-corner handling

OpenSTA's define_corners and set_scene commands drive multi-corner analysis. Our flow:

Caller passes --corner ss_125C_1v08 --corner tt_25C_1v80 --corner ff_-40C_1v98.
Tcl script calls define_corners once with the union, then iterates foreach corner [get_corners] { ... } and emits CORNER + ARC/INTERCONNECT/SETUP_HOLD lines tagged with the corner index.
Single-corner designs use one entry — same code path, no special case.

PVT extraction (process / voltage / temperature) — OpenSTA exposes these via Liberty's operating conditions. Tcl extracts via the corner's pvt object. If unavailable, process="?", voltage=0.0, temperature=0.0.

Vendor extensions

OpenSTA does not expose a single mechanism for arbitrary annotations. For Phase 0 WS2:

We do not produce VENDOR_EXT records.
The IR's vendor-extension passthrough remains a forward-looking feature; a future producer (a commercial-tool-aware adapter) will populate it.

Tcl-side parsing of vendor-specific Liberty simulation blocks or SDF (VENDOR …) constructs is not in scope for Phase 0 WS2.

Parser-success assertion (WS5 portion)

Per phase-0-ir-and-oracle.md WS5: "Assertions in opensta-to-ir: non-zero IOPATHs / timing arcs resolved on non-trivial SDF input. Exit non-zero with a clear diagnostic when below threshold."

Implementation:

--min-arcs <N> flag with default 1.
After IR is built, count TimingArc records in the buffer.
If below threshold and --allow-empty-parse was not passed, exit code 3 with message: opensta-to-ir: produced N timing arcs (--min-arcs <M>); use --allow-empty-parse for empty-fixture tests.
Liberty parser-success assertion already lives in Jacquard's TimingLibrary::parse (see commit 5db131e) — opensta-to-ir invokes OpenSTA's own Liberty reader rather than Jacquard's, so it surfaces missing-cell issues via OpenSTA's exit status (not our concern at this layer).

Test plan

Fixture progression — minimum-viable to representative

inv_chain_pnr (already in tests/timing_test/): smallest design with real SKY130 cells and SDF. Verify single arc per inverter, correct rise/fall, single corner.
MCU SoC subset: representative of the real Jacquard flow. Verify the count of arcs matches a known baseline; spot-check a handful of arrival times against report_timing output.
Multi-corner synthetic: hand-built tiny design with ss/tt/ff Liberty corners, verify the IR carries 3 corner records and 3 sets of values per arc.

Test types

Unit tests (Rust): dump-format parser tested against synthetic dump strings (no OpenSTA needed).
Integration tests (Rust + OpenSTA): invoke the binary against committed fixtures, diff the resulting IR against golden IR via timing-ir-diff. Each integration test gates itself on scripts/build-opensta.sh --print-binary succeeding — when the OpenSTA binary is unbuilt, tests skip with a clear "run scripts/build-opensta.sh" message rather than failing. CI runs them after building OpenSTA via the script.
Failure-mode tests: missing OpenSTA, malformed Tcl dump, zero-arc input, missing required argument — each surfaces the expected exit code.

CI integration (closes WS4 remaining work)

A new CI job runs opensta-to-ir on each tests/timing_ir/corpus/<name>/inputs/ and diffs against expected.jtir via timing-ir-diff. Fails loud on diff or exit-code regression.
Stress-corpus run is deferred to Phase 1.

Phased implementation

Splitting WS2 into focused PRs keeps reviewability tight. Each phase exits with a runnable end-to-end on its scope:

Phase	Scope	Exit signal	Status
2.1	Single-corner, timing-arc IOPATHs only. CLI scaffolding.	AIGPDK AND2 round-trip clean through `opensta-to-ir` end-to-end.	✅ Shipped (`dc3db4a` scaffold + `3997e06` subprocess plumbing + `50b8600` real Tcl extraction).
2.2	Add interconnect delays (`wire`-role edges, with optional SPEF).	Multi-cell design produces INTERCONNECT records that round-trip.	✅ Shipped (`67210c0`). Test: `chain_with_sdf_emits_interconnect_delay`.
2.3	Add setup/hold checks.	DFF setup/hold round-trips end-to-end.	✅ Shipped (`8343b14`). Test: `aigpdk_dff_emits_setup_hold_records`. Recovery / removal / width checks remain out of scope.
2.4	Multi-corner.	3-corner synthetic fixture produces 3-corner IR.	✅ Shipped (`530bb36` builder + `59fde04` per-corner Tcl emission + `d110174` integration test + `50f4bf5` real-sky130 multi-corner follow-up). Tests: `aigpdk_dff_emits_per_corner_timing_values`, `sky130_multi_corner_emits_per_corner_values`.
2.5	CI corpus integration; golden-IR fixtures for representative designs.	WS4 corpus job in CI; WS2 task complete.	✅ Shipped (`90558bb`). Runner: `cargo test -p opensta-to-ir corpus`.

Beyond original WS2 scope:

Pillar B Stage 1 — per-DFF CLOCK_ARRIVAL records (c403cc8). Adds clock arrival times to the IR so downstream consumers can compute per-DFF setup/hold margins without re-running OpenSTA. Test: dff_with_sdc_clock_emits_clock_arrival. Tracked separately in post-phase-0-roadmap.md.
Release hardening WS-RH.1 — hard-fail on missing or too-old OpenSTA, with version probe and usage diagnostics (c9c393b). Tests: locate_accepts_min_tested_version, locate_flags_newer_than_tested. Tracked in post-phase-0-roadmap.md § Release hardening.

WS3 (delete src/sdf_parser.rs + wire interim runtime hook) was unblocked once Phase 2.3 minimum landed and has also shipped — see ws3-delete-sdf-parser.md.

Open questions — resolution

Resolutions from implementation:

OpenSTA version pinning — Resolved by WS-RH.1 (c9c393b). Binary probes OpenSTA's version_string, accepts a [MIN_TESTED, MAX_TESTED] range, prints a usage diagnostic with the supported range on mismatch.
OpenSTA installation — Resolved. scripts/build-opensta.sh ships with --print-binary for the dependency probe; integration tests skip cleanly when the binary isn't built. Documented in the script's --help and the post-Phase-0 roadmap.
Tcl-script versioning — Resolved. # format-version: 1 header check is enforced in dump.rs; the binary refuses unknown versions with an explicit error.
Conditional arcs (SDF COND) — Partial. The condition field is plumbed end-to-end (dump format → Rust parser → IR builder), but the Tcl emission side does not yet populate it for conditional variants. Defer until a real design surfaces a COND arc that needs distinguishing.

Still open / deferred:

Long-running designs: streaming dump emission (Tcl flushing line-by-line, Rust incremental read) — defer until profiling on a real SoC shows memory pressure.
Strict Tcl error handling: --strict-tcl flag was specced but not implemented. Current behaviour captures all stderr and replays on failure; no warning-to-error upgrade path. Land if it becomes a real CI hygiene concern.

Risks

OpenSTA's Tcl API is large and not all of it is documented. Some primitives we'll need (e.g., per-corner delay values for a specific arc) may require digging through Sta.cc. Mitigation: budget time, lean on report_path text output as a fallback if the structured API proves opaque for a given query.
OpenSTA may be slow on big designs — the structured walk over millions of arcs is single-threaded. Mitigation: --keep-tmp for profiling, accept slow phase-0 runs, optimise later if it blocks CI.
Format drift between Tcl and Rust — both sides advance together; the format-version line plus version-mismatch fail-loud catches drift. Add a unit test that the Rust parser rejects an unexpected version line.

Non-goals

A general SDF parser. (The whole point: avoid that.)
Wire-level reactivity or feedback to OpenSTA mid-run (this is a one-shot extract).
Comparison against OpenTimer (that's a separate ADR-0003-spike concern).
Replacing OpenSTA's role as oracle in CI — opensta-to-ir is a producer, not a checker.

References

../adr/0001-opensta-as-oracle.md — subprocess model, license posture.
../adr/0002-timing-ir.md — IR contract this tool emits.
../adr/0005-opensta-vendoring-and-corpus.md — vendor/opensta/ submodule.
../adr/0006-sdf-preprocessing-model.md — interim runtime hook + release-time cutover.
phase-0-ir-and-oracle.md — WS2 row in the work breakdown.
crates/timing-ir/schemas/timing_ir.fbs — schema this tool produces.
vendor/opensta/doc/StaApi.txt — OpenSTA Tcl API reference.

Last updated: 2026-04-28 (design); 2026-05-15 (status flip to Implemented).

Plan — WS3: delete SDF parser, wire IR consumer + interim runtime hook

Status: Implemented — kept as historical record. Note: the "interim" / "pre-release-only" framing throughout this document describes the original ADR 0006 model. Per ADR 0006 § Amendment (2026-05-02), the runtime subprocess wrapper is now the shipping mechanism — Phase 3 (native Rust SDF→IR) is no longer release-gating. This document is preserved for the implementation phasing record; for current shipping intent see ADR 0006 § Amendment and post-phase-0-roadmap.md § Phase 3.

Phase: 0 (executes WS3 from phase-0-ir-and-oracle.md). Predecessors: WS2 phases 2.1 + 2.3-minimum (delay arcs + setup/hold checks landed). Sufficient IR coverage for runtime cutover. ADRs: 0002 (IR), 0006 (SDF preprocessing model + interim cutover; amended 2026-05-02).

Goal

Delete src/sdf_parser.rs and migrate src/flatten.rs's timing-loading to consume the timing IR directly. Wire jacquard sim --timing-ir <PATH> as the canonical input path, and (per ADR 0006) keep --timing-sdf <PATH> working pre-release as a contributor-ergonomics convenience that internally subprocesses opensta-to-ir.

End state:

No hand-rolled SDF parsing in the Jacquard codebase.
Runtime SDF input still works (via internal subprocess) until first release.
flatten.rs consumes timing_ir::TimingIR<'_> for arc / setup / hold loading.
All flatten.rs tests that previously hand-built SDF strings are migrated to build IR fixtures via the timing-ir crate's FlatBuffers builders.

Surface analysis

src/sdf_parser.rs (1099 lines) defines SdfFile, SdfDelay, SdfCorner, TimingCheckType, and parses SDF text. Consumers:

src/flatten.rs — load_timing_from_sdf(...) is the only non-test consumer; iterates SdfFile.get_cell(path), uses SdfDelay for wire delays, TimingCheckType::Setup/Hold for check identification. ~200 lines of integration plus 7+ test fixtures that build SDF strings inline.
src/sim/setup.rs — translates --sdf-corner CLI string into SdfCorner and calls SdfFile::parse_file.
src/aig.rs — test imports only.
src/lib.rs — module declaration only.

Architecture changes

New: `src/sim/timing_ir_loader.rs`

Thin module that owns the IR file buffer (so consumers can borrow TimingIR<'_> views from it):

#![allow(unused)]
fn main() {
pub struct TimingIrFile {
    buf: Vec<u8>,
}

impl TimingIrFile {
    pub fn from_path(path: &Path) -> Result<Self, ...> { ... }
    pub fn from_bytes(buf: Vec<u8>) -> Result<Self, ...> { ... }
    pub fn view(&self) -> Result<timing_ir::TimingIR<'_>, ...> {
        timing_ir::root_as_timing_ir(&self.buf)
    }
}
}

The TimingIR view holds a lifetime tied to the buffer. Callers keep the TimingIrFile alive while iterating the view.

Modified: `src/flatten.rs`

Replace load_timing_from_sdf with load_timing_from_ir:

#![allow(unused)]
fn main() {
pub fn load_timing_from_ir(
    &mut self,
    aig: &AIG,
    netlistdb: &NetlistDB,
    ir: &timing_ir::TimingIR<'_>,
    clock_period_ps: u64,
    liberty_fallback: Option<&TimingLibrary>,
    debug: bool,
) { ... }
}

Logic translation table:

Old (`SdfFile`)	New (`TimingIR<'_>`)
`sdf.get_cell(path)`	Index `ir.timing_arcs()` / `ir.setup_hold_checks()` by `cell_instance` (build a `HashMap<&str, _>` once).
`cell.iopaths`	Filter timing arcs by `cell_instance == path`.
`cell.timing_checks`	Filter setup/hold checks by `cell_instance == path`.
`SdfDelay { rise, fall, ... }`	`TimingArc.rise_delay()` / `.fall_delay()` (per-corner); take corner 0 max for now.
`TimingCheckType::Setup` / `::Hold`	`SetupHoldCheck.setup()` / `.hold()` per record.
`cell.interconnect_delays`	`ir.interconnect_delays()` — empty until WS2.2 lands; tolerate.

The hierarchy-prefix detection (lines 1793-1820 of current flatten.rs) is independent of source format — same logic applies, just use IR's cell_instance strings instead of SDF's. Keep the heuristic.

Modified: CLI surface (`src/bin/jacquard.rs`, `src/sim/setup.rs`)

Add --timing-ir <PATH> flag that loads IR directly via TimingIrFile::from_path.
Retarget --timing-sdf <PATH> (and the existing --sdf-corner) to: spawn opensta-to-ir as a subprocess, capture its IR output, call load_timing_from_ir. Mark the code site INTERIM per ADR 0006.
The interim hook needs Liberty + Verilog paths to feed opensta-to-ir; the jacquard sim CLI already takes those, so plumb them through.
Keep --sdf-corner for backward compat — the interim wrapper passes it as --corner to opensta-to-ir.

Deletions

src/sdf_parser.rs — entire file.
src/lib.rs — pub mod sdf_parser line.
src/aig.rs — use crate::sdf_parser::{SdfCorner, SdfFile} test imports; rewrite or delete the affected tests.
src/flatten.rs — use crate::sdf_parser::SdfFile; rewrite test fixtures.

Test migration strategy

Test fixtures in flatten.rs currently look like:

#![allow(unused)]
fn main() {
let sdf_content = r#"(DELAYFILE ... )"#;
let sdf = SdfFile::parse_str(sdf_content, SdfCorner::Typ).expect("...");
flat.load_timing_from_sdf(&aig, &netlistdb, &sdf, ...);
}

After cutover:

#![allow(unused)]
fn main() {
let ir_buf = build_test_ir(&TestIrSpec {
    arcs: vec![ /* (cell, from, to, rise_max, fall_max) */ ],
    setup_hold: vec![ /* (cell, d, clk, edge, setup, hold) */ ],
});
let ir = root_as_timing_ir(&ir_buf).unwrap();
flat.load_timing_from_ir(&aig, &netlistdb, &ir, ...);
}

A build_test_ir helper in flatten.rs::tests mirrors build_ir_with_arcs from crates/timing-ir/tests/diff.rs. Single source of truth would be nicer; for now duplicate it (deduplication is a future cleanup).

Phased implementation

Phase	Scope	Exit signal
3.1	Add `src/sim/timing_ir_loader.rs` and `flatten.rs::load_timing_from_ir` (parallel to `_from_sdf`). No CLI surface, no deletions. Unit-test the new function with a small synthetic IR.	New function compiles + passes unit test; existing `_from_sdf` path still works.
3.2	Add `jacquard sim --timing-ir <PATH>` CLI flag wired to `load_timing_from_ir`. End-to-end test: pre-generate IR via `opensta-to-ir`, run `jacquard sim --timing-ir`, compare against the existing `--timing-sdf` baseline.	A representative timing test (e.g., one of the existing `tests/timing_test/`) produces matching VCD output via both paths.
3.3	Retarget `--timing-sdf` to subprocess `opensta-to-ir` internally, then consume IR. Tag the code site `INTERIM per ADR 0006`.	Existing `--timing-sdf` regression tests pass through the new path.
3.4	Delete `src/sdf_parser.rs`. Migrate flatten.rs test fixtures from SDF strings to IR builders. Migrate aig.rs test imports.	All `cargo test --lib` tests pass; `src/sdf_parser.rs` is gone; the only `crate::sdf_parser::` reference is `git log`.

Each phase exits cleanly. Phase 3.4 is the irreversible deletion — gates on phases 3.1-3.3 having green CI on the migration tests.

Open questions

Hierarchy separator: SDF uses ., OpenSTA's default divider is /. Our IR's cell_instance strings come from OpenSTA so use /. The flatten.rs hierarchy-prefix detection logic uses .. After cutover, the logic needs to use /. Verify by running on a hierarchical design (MCU SoC) before declaring 3.4 ready.
--sdf-corner semantics under IR: today this picks one of Min/Typ/Max from SDF triples. The IR has min/typ/max per TimingValue already; the corner selection becomes "pick which of the three to use" applied per-arc rather than per-file. Document the mapping.
Default-corner consistency: WS2 emits default as the corner name. Pre-existing Jacquard tests may not look at corner names — need to spot-check.
liberty_fallback semantics: today, for cells absent from SDF, we fall back to Liberty-computed delays. Under IR, OpenSTA-computed values are already in the IR's arcs (as Origin::Computed). So liberty_fallback is potentially dead. Decide whether to drop it in 3.4 or keep as safety net.
Multi-corner (post-WS2.4): when WS2.4 lands, the IR will have multiple corners. flatten.rs currently picks one. Define the per-corner selection contract — explicit corner-name CLI flag, or default to a named corner.

Risks

flatten.rs test churn: 7+ test fixtures need rewrites. Each is a focused mechanical change but the bulk adds up. Mitigation: a build_test_ir helper standardizes the pattern.
Hidden-bug exposure: the existing SDF parser had quirks. The IR parser has different ones (or none). Migration may surface bugs that were latent. Treat any test failure during 3.4 as a real bug, not "just adjust the test."
Hierarchy-separator regression: if not caught in phase 3.2 testing (which tests on a single design), it could land in 3.4 and break a hierarchical design that wasn't previously regression-tested. Mitigation: include a hierarchical design in the 3.2 verification matrix.
Cutover timing: WS3 lands while WS2.2 (interconnects) and WS2.4 (multi-corner) are still pending. flatten.rs's cutover assumes those will land later — test fixtures should not depend on interconnect delays or multi-corner behaviour for at-least-3.4 to pass.

Walk-back

If 3.4 surfaces blocking issues, ADR 0006 already permits deferring deletion: keep src/sdf_parser.rs alive but tagged LEGACY — superseded by IR consumer; remove before first release, and ship preprocessing-only for the interim. The runtime SDF subprocess wrapper covers the contributor ergonomics. The native Rust SDF parser rewrite (Phase 3 in the original phasing) is the durable replacement.

Non-goals

A native Rust SDF parser. (Original ADR 0006 Phase 3; not part of WS3.)
Validating SDF round-trip equivalence between the old parser and OpenSTA. (CI corpus test in WS4/WS2.5 covers this when fixtures exist.)
Refactoring the broader flatten.rs structure beyond what migration requires.

References

../adr/0002-timing-ir.md — IR contract.
../adr/0006-sdf-preprocessing-model.md — interim runtime subprocess + release-time cutover.
phase-0-ir-and-oracle.md — WS3 row.
ws2-opensta-to-ir.md — produces the IR this consumer reads.
crates/opensta-to-ir/ — subprocess target for the interim --timing-sdf hook.
crates/timing-ir/ — IR library + builders for test fixtures.

Last updated: 2026-04-28

Plan — WS3 follow-up: re-add cosim `--sdf` via `opensta-to-ir`

Status: Deferred. Tracked here so future work can pick it up. Predecessor: WS3 phase 3.4 (deletes hand-rolled src/sdf_parser.rs).

Background

Phase 3.4 deleted src/sdf_parser.rs. The jacquard sim subcommand kept SDF input working (Phase 3.3 wired --sdf through setup::load_sdf_via_opensta_to_ir, an internal subprocess wrapper that calls the opensta-to-ir crate to convert SDF→IR). The jacquard cosim subcommand chose Option B of the phase 3.4 handoff: drop --sdf entirely rather than thread --liberty through. As a result, cosim now only accepts pre-converted IR via --timing-ir.

What was removed in 3.4

CosimArgs::sdf, sdf_corner, sdf_debug CLI fields (src/bin/jacquard.rs).
The config.timing.sdf_file / sdf_corner fallback path in src/sim/cosim_metal.rs::run_cosim.
TimingSimConfig::sdf_file and sdf_corner JSON fields (src/testbench.rs).

User-facing migration (current state)

The tests/mcu_soc/ cosim flow that used to load SDF via the testbench config now needs an explicit pre-conversion step.

Feed `6_final.v` directly to `opensta-to-ir`

Retraction (2026-05-18). Earlier versions of this section recommended feeding tests/mcu_soc/data/top_synth.v (post-synthesis, pre-P&R) to opensta-to-ir to dodge a parse error on 6_final.v's chipflow integration wrapper. That was wrong: top_synth.v is missing the ~236K cells P&R inserts (clkbuf_regs_* CTS buffers, ANTENNA_* diodes, delaybuf_*, fillers), so OpenSTA silently drops every SDF entry referencing a P&R-inserted cell and the resulting IR is missing the bulk of the design's timing. The "28162 matched / 2090 unmatched" verification log we celebrated at the time measured jtir records against the cosim-loaded netlist, not SDF coverage against the jtir — high surface match rate, materially incomplete IR. See ADR 0009 (OpenSTA Verilog reader input constraints) for the broader rule.

opensta-to-ir now transparently extracts module <--top> from each input file before invoking OpenSTA (implementation in crates/opensta-to-ir/src/verilog_filter.rs). For the chipflow mcu_soc case this strips the openframe_project_wrapper module automatically; the same handling kicks in for any LibreLane + wafer.space user (hazard3 and future tapeouts) whose final netlist carries an integration wrapper around the structural top.

# Convert SDF → IR once. Pass 6_final.v directly; the wrapper module
# is dropped automatically.
opensta-to-ir \
    --liberty /path/to/sky130_fd_sc_hd__tt_025C_1v80.lib \
    --verilog tests/mcu_soc/data/6_final.v \
    --sdf tests/mcu_soc/data/6_final.sdf \
    --top top \
    --output tests/mcu_soc/data/6_final.jtir

# Run cosim with the pre-converted IR. Cosim loads 6_final.v (the
# wrapper) because that's what carries GPIO ports. The IR consumer's
# hierarchy-prefix detection strips the `top_inst/` prefix from the
# wrapper's cell paths so they match the IR's instance names.
cargo run -r --features metal --bin jacquard -- cosim \
    tests/mcu_soc/data/6_final.v \
    --config tests/mcu_soc/sim_config_sky130.json \
    --top-module openframe_project_wrapper \
    --timing-ir tests/mcu_soc/data/6_final.jtir

tests/mcu_soc/sim_config_sky130.json no longer carries sdf_file / sdf_corner (the fields would be silently ignored if added back; cosim does not consume them).

Events-reference comparison: nuances

tests/mcu_soc/events_reference.json was wired into the sky130 cosim config as part of phase 3.4 verification. End-to-end pipeline result on a 3M-tick run:

67 UART bytes captured; the reference's 155 UART events end at timestamp 4,187,182. All 67 captured payloads match the reference's leading bytes (decoded UART output: ....: nyaa~!\nSoC type: CA7F100F\nFlash ID: CA7CA7FF\nQuad mode). No payload divergence.
15 non-UART entries in the reference (cxxrtl-emitted SPI deselect events with payload: "") are filtered out at parse time by the tolerant deserializer in cosim_metal.rs::run_cosim. Without that filter the comparison panicked on the first SPI entry.

chipflow's `num_steps` and `timestamp` are edge-counted

Retraction. Earlier drafts of this section claimed Jacquard's --max-cycles counts half-cycles. That was a misdiagnosis based on reading MultiClockScheduler::new (which does emit per-edge raw entries) without noticing the pairing layer at src/sim/cosim_metal.rs:2604-2675 that collapses them into one paired buffer per cycle. Today, --max-cycles N correctly counts N full clock cycles: each cosim tick does one fall-edge dispatch plus one rise-edge dispatch and DFFs capture once per tick. Verified via --stimulus-vcd trace (5 ticks → simulated time spans 0–200000 ps for a 40 ns period clock, exactly 5 cycles).

The actual unit difference vs chipflow's cxxrtl harness:

chipflow's num_steps is the count of tick() calls; each tick() bumps ++timestamp twice (once after the negedge dispatch, once after the posedge), so the events_reference.json timestamp field counts clock edges (a full cycle = posedge-to-posedge = 2 edges). The harness:
```
auto tick = [&]() {
    {{interface}}.step(timestamp);
    top.clk.set(false); agent.step(); ++timestamp;  // post-negedge (odd)
    top.clk.set(true);  agent.step(); ++timestamp;  // post-posedge (even)
};
for (int i = 0; i < num_steps; i++) tick();
```
See chipflow-lib/chipflow/common/sim/main.cc.jinja:32-74.
The half-tick timestamp is an intentional design, not a bug: parity tags each event with the clock phase it fired on (useful for verification of async paths).
chipflow's num_steps therefore doubles as an edge budget: 3 M num_steps = 3 M edges = 1.5 M full clock cycles.

To compare a Jacquard cosim run against today's events_reference.json, divide reference timestamps by 2 to convert edges → cycles. Empirical spot-check on mcu_soc/sky130: byte-0 in Jacquard at --max-cycles 200000 arrives at tick 28682; reference timestamp 58290 / 2 = 29145 cycles; ratio 0.984× (simulators agree on simulated time within 2%).

The earlier "67 of 155 events captured" gap is not a budget issue — chipflow drives input stimulus via design/tests/input.json and reference events 69+ require those driven inputs. The input-stimulus dispatcher was added in commit 4a1a989, and the mcu_soc/sky130 cosim now matches the cut-down chipflow reference 1:1 (90/90 events).

The earlier "Jacquard ~14% slower per byte than cxxrtl" claim relied on a phantom half-cycle correction; it is also retracted. There is no rate gap to explain at this level.

Done: `--max-cycles` renamed to `--max-clock-edges` (commit `46b5c28`)

Cosim's internal granularity moved from full clock cycles to scheduler edges, aligning Jacquard's CLI 1:1 with chipflow's num_steps and unlocking per-edge event timestamping. Section retained for context on the unit conventions captured above.

Option A — restore cosim `--sdf` ergonomics

When this becomes a priority, mirror the jacquard sim surface:

Changes

Add --liberty to CosimArgs (src/bin/jacquard.rs). Plumb it through DesignArgs::liberty (currently hardcoded None in cmd_cosim). Also passthrough --top-module if not already.
Add --sdf, --sdf-corner, --sdf-debug back to CosimArgs. Make them mutually exclusive with --timing-ir (clap conflicts_with = "timing_ir").
Re-add TimingSimConfig::sdf_file / sdf_corner (optional) — plus a new liberty_file field for the OpenSTA invocation. Update tests/mcu_soc/sim_config_sky130.json to use the new shape.
Restore the cosim config-file fallback: in src/sim/cosim_metal.rs::run_cosim, when timing is not yet enabled and the config provides SDF + Liberty paths, call setup::load_sdf_via_opensta_to_ir. Match the priority order: CLI > config.timing.* > nothing.
Update --output-vcd error message to mention --sdf again.

Out of scope for Option A

Rebuilding a hand-rolled SDF parser. (See ADR 0006 — the durable replacement is the native Rust SDF→IR converter, tracked separately as Phase 3 in the original phasing.)
Adding cosim-specific corner-selection beyond what jacquard sim already offers. The IR's min/typ/max triple is selected via ir_corner0_max (currently always max); changing that is a separate concern that affects both subcommands.

Verification

After Option A lands:

cargo build --features metal
cargo test --lib
# Manual smoke test of the previous mcu_soc workflow:
cargo run -r --features metal --bin jacquard -- cosim \
    tests/mcu_soc/data/6_final.v \
    --config tests/mcu_soc/sim_config_sky130.json \
    --liberty <path>/sky130.lib \
    --sdf tests/mcu_soc/data/6_final.sdf

Should produce equivalent results to the pre-3.4 hand-rolled-parser path within the IR's representational bounds (single-value interconnect delays, max corner selection).

Walk-back

If Option A is never picked up before first release, the existing IR-only cosim surface is fine — contributors using SDF can pre-convert via opensta-to-ir and pass the resulting .jtir. The follow-up exists as a contributor-ergonomics improvement, not a correctness gap.

Multi-clock and stimulus architecture — exploratory roadmap

Status: Captured architectural thinking. Most phases here are demand-driven and will only be picked up when a real-world workload requires them. Phases 1 and 2 may be worth scheduling on their own merits in a future release; the rest are written down so the design space is on record when the need appears.

This is a design-space doc, not a scheduling doc. It complements post-phase-0-roadmap.md (which schedules committed work) by capturing the architecture for two related areas — multi-clock-domain support and stimulus generation — that today have working but limited implementations.

Why now

The conversation that produced this doc was about supporting cosim against external testbench environments (UVM, CocoTB) and external clock sources (PHY, audio, DFS). Two observations crystallised the architecture:

Real designs partition into large synchronous islands with thin boundaries. External-clock and DFS scenarios look intractable until you notice that <1 % of nets typically cross domains; the bulk of the design is batchable inside one island.
Stimulus generation and stimulus consumption don't have to share a loop. Today cosim couples the testbench tick-by-tick to the GPU dispatch. Decoupling them — via streaming or full precompute — turns the GPU from a ping-ponging coprocessor into a stream consumer.

Both observations point at architecture changes that compose cleanly with each other, with the existing multi-clock plumbing, and with the existing X-prop / timing-arrival infrastructure.

What exists today

Worth pinning down so the gap is precise:

Multi-clock-domain functional support. MultiClockScheduler in src/sim/cosim_metal.rs:1347 builds a tick-by-tick edge schedule over the LCM of all domains' periods (with GCD granularity). DFFs are tagged by clock domain via clock_pin2aigpins in src/aig.rs:209. Each scheduler tick asserts only the firing domains' posedge/negedge flag bits; the GPU kernel gates DFF write-back on those flags, so non-firing domains' DFFs hold.
LCM constraint. The scheduler asserts schedule_len <= 1_000_000 (cosim_metal.rs:1376). Commensurable periods (PLL-derived) work; truly non-commensurable external clocks (audio, USB-recovered, DFS-mid-flight) hit the cap.
Cosim stimulus. InputDispatcher (src/sim/input_stim.rs) consumes a chipflow-compatible wait/action/stop JSON command list. Peripheral models (src/sim/models/) drain queued actions per edge and emit events. Generation is interleaved with the GPU dispatch loop — every tick (or every few ticks) round-trips through the host.
VCD replay path. jacquard sim already runs from a precomputed input VCD with no host-side reactive logic. This is, in effect, the "Level 1" precomputed-edge mode described below; the gap is between cosim's reactive loop and sim's flat replay, not in the kernel itself.
CDC checking. None today. SDF setup/hold checks exist (src/timing_report.rs) but are not wired through any CDC-specific path.

Architecture: two orthogonal axes

The work falls cleanly into two independent dimensions.

Axis 1 — Spatial: synchronous islands with thin boundaries

A static analysis pass partitions the AIG into islands: maximal connected sets of gates whose transitive fanin/fanout stays inside one clock domain. Whatever's left is the boundary — combinational gates and DFFs whose data cones cross domains. In real designs the boundary is small, dominated by synchronizers (2FF), async FIFO control, and handshake glue.

Per-island execution lets the GPU:

Skip evaluation of an island whose state hasn't changed.
Batch K consecutive ticks of a fast island into one kernel launch when the slow island has no edges in the window.
Treat the boundary as a small mailbox (source-island outputs read by destination-island reads) rather than a global state vector.

This is essentially functional partitioning for parallel discrete-event simulation, but the GPU dataflow model gets more benefit than a CPU sim because batched dataflow is exactly what a fast island's run-ahead window wants.

Axis 2 — Temporal: stimulus generation decoupled from consumption

The cosim host loop is the throughput floor today. Decoupling has three levels:

Replay — the testbench has already produced a complete input VCD; the GPU just plays it back. Today's jacquard sim is this case.
Streaming buffer — testbench runs in a separate thread feeding a ring buffer of (tick, input_op) tuples. GPU consumes batches. As long as the producer keeps up on average, the GPU never stalls. Works because most ticks have no input change and peripheral state machines run far slower than the kernel.
Record-and-replay with divergence detection — pass 1 runs full cosim and records every input transition; pass 2 replays at line-rate while checksumming outputs against the recorded run. If outputs diverge, abort and fall back. Wins decisively for regression CI where most runs confirm "nothing externally observable changed".

Phase breakdown

Each phase is independently shippable. The phase numbering here is local to this doc and should not be confused with the timing-IR phase numbering in post-phase-0-roadmap.md.

Phase	Topic	Trigger
MC.1	Static island partitioner (analysis only, emits metadata)	Standalone-useful for CDC reporting; could land in a future release without further work
MC.2	Min-heap multi-clock scheduler (replaces LCM precompute)	First non-commensurable external clock or DFS use case lands
MC.3	Streaming stimulus buffer (decouples testbench thread from kernel)	First workload where cosim CPU↔GPU round-trip is measured as the bottleneck
MC.4	Per-island kernel dispatch + multi-rate batching	MC.1+MC.2 in place; first multi-domain workload large enough that whole-AIG eval per tick is wasteful
MC.5	Record-and-replay with divergence detection	Regression CI throughput becomes a release blocker
MC.6+	Speculation staircase, AOT trace compilation, profile-guided kernel specialization	Demand-driven; deferred until measurement shows residual sync overhead after MC.4

MC.3/MC.4 trigger is now measured (2026-06-07). Instrumenting the cosim loop's batch utilisation (see ADR 0017 amendment, Measured batch utilisation) shows GPU-peripheral designs run 100% batched, but jtag_minimal — CPU-side JTAG replay — emits 102,310 single-edge command buffers out of 106,117 (96% of all submits; 2.6% of edges). Those per-edge CPU↔GPU round-trips dominate its wall-clock: this is the MC.3 "round-trip measured as the bottleneck" trigger. The structural fix is MC.4 — the fast sys_clk island runs ahead/batched while only the slow model-driven tck boundary needs per-edge handover — which is why MC.4 depends on the MC.1 island partitioner. Both are orthogonal to the #105 backend-portability seam (which preserves today's batch model unchanged).

MC.1 — Static island partitioner

Walk the AIG; for each gate compute the set of clock domains its transitive fanin/fanout touches. Tag gates as island-internal (fanin and fanout both inside one domain) or boundary (touches more than one domain on either side). Emit per-island gate counts and a list of boundary gates as metadata on the existing FlattenedScript.

What it enables on its own, even with no runtime change:

Diagnostic: "this design has 14 inter-domain combinational paths from audio_clk → core_clk and 2 the other way". Useful for designers reviewing CDC structure.
Data structure that MC.2 / MC.4 / CDC reporting all need.
Sanity-check on the "<1 %" boundary-surface assumption for the workloads that motivate further phases.

Classification policy for derived signals (e.g. a sync-FIFO read pointer in clock_b qualified by an output of a sync chain from clock_a): classify aggressively. Only gates whose direct fanin includes pins from multiple domains are boundary; downstream gates fed by a domain-tagged pre-synchronizer output inherit that domain. This pushes the boundary in as close to the structural CDC crossing as possible and is what makes the "<1 %" claim hold on real designs — a lazy classification that propagated "multi-domain" forward through every downstream cone would yield a boundary surface that swallowed half the design.

Code locations: extends aig.rs (domain analysis on DriverType) and flatten.rs (metadata on FlattenedScriptV1). No kernel changes.

MC.2 — Min-heap multi-clock scheduler

Replace MultiClockScheduler's precomputed Vec<TickEdges> with a min-heap of (next-edge-time, domain) pairs. Pop the next edge, dispatch, push the domain's next edge back. No LCM constraint; non-commensurable periods are free. DFS support falls out: when the DUT writes a clock-control register, the host updates the heap entry's period.

DFS hook design: explicit, not generic signal-watching. The cosim config declares (control_signal, period_table) pairs; the host polls the named bit each tick (cheap — one bit) and updates the heap. Generic "call-back-on-arbitrary-signal" is rejected as too coupled.

Code locations: MultiClockScheduler::new and build_edge_ops in cosim_metal.rs. Same per-domain flag emission, different scheduling backend.

MC.3 — Streaming stimulus buffer

InputDispatcher becomes a trait; today's FileDispatcher is one implementation. New implementations:

ThreadedDispatcher — runs peripheral models on a separate thread; emits (tick, input_op) into a lock-free SPSC ring buffer; GPU loop consumes batches.
StreamDispatcher — same shape but the producer is a JSON-lines stream over a Unix socket / stdio (this is also the bridge to UVM/CocoTB peer testbenches).

Latency budget: the producer must be at least one tick ahead of the consumer. For transaction-level workloads this is easy (peripheral state machines run orders of magnitude slower than the GPU). For sub-cycle reactive loops it isn't, and those workloads stay on the synchronous path.

Code locations: refactor input_stim.rs around a trait; new module for ring-buffer plumbing; cosim main loop drains a batch per dispatch instead of one tick.

MC.4 — Per-island kernel dispatch + multi-rate batching

Build per-island execution scripts (and one boundary script) from the metadata MC.1 produces. Cosim main loop becomes:

#![allow(unused)]
fn main() {
loop {
    let (next_t, domain) = scheduler.peek();
    let lookahead = scheduler.next_other_domain_edge(domain) - now;
    let edges_in_window = lookahead / domain.period;
    dispatch(island_script[domain], edges = edges_in_window);
    dispatch(boundary_script);  // only if boundary signals changed
    advance_clock(now + edges_in_window * domain.period);
}
}

Boundary mailbox lives in shared state-buffer slots that the source island's script writes and the destination island's script reads. Repcut continues to partition each island's script across GPU blocks independently.

Tight-boundary gates (combinationally fed by both domains) force a sync point on every edge of either side; MC.1's metadata identifies these so the runtime knows when batching can extend.

MC.5 — Record-and-replay with divergence detection

Add --record-stimulus to cosim that emits a complete tick-by-tick input VCD and a per-tick output checksum. Add --replay-stimulus to sim (or a new mode) that consumes the VCD, runs at line-rate, and verifies the checksum each batch.

Divergence handling is two-tier, not just abort:

Mismatch in watched signals (the existing cosim signals_of_interest set, or a --watch CLI argument) → abort and require re-recording. This is the genuine "the design's externally observable behaviour changed" case — the recording is now stale and replay is unsafe.
Mismatch in unwatched signals → warn-and-continue against the recorded transitions. Internal microarchitectural changes that don't move the observable surface are normal during development; aborting on them defeats the purpose of accelerating regression CI, where most runs exist to confirm "nothing externally observable changed".

The watchset is the user-visible policy lever — it specifies what "externally observable" means for this design. Default to the cosim output signals (the natural CI invariant) plus any user-declared checkpoint signals.

Useful primarily as a regression-CI accelerator. Doesn't help one-off runs.

Cross-test sharing. A single design accumulates many test cases. The natural extension of record-and-replay is to share the design-side specialized kernel across all tests in the suite and vary only the stimulus recording. For a suite of N tests against one design, recording costs N× pass-1 (one per test, on demand or in parallel) but replay costs N× line-rate-kernel-launches sharing the same compiled state-buffer layout. That's a multiplicative win on top of per-test record-and-replay and is the actual leverage point for full-suite CI throughput.

MC.6+ — Deferred sophistication

Documented now so the design space is on record:

Speculation staircase for hot boundaries: value prediction → protocol pattern recognition → control-slice reachable-set enumeration → full case enumeration. Each tier larger and cheaper-to-skip. Add a "case" dimension to the kernel dispatch only if measured sync overhead after MC.4 justifies it.
AOT trace compilation: when stimulus is fully known (replay mode), compile the schedule offline — fold constant inputs into AIG constants, merge no-op ticks, sort transitions by domain. Profile-guided specialization for designs with lots of "configured once at boot" inputs. Composes directly with MC.5: a recording is a complete stimulus trace, so the AOT compiler can fold every input value into the kernel unconditionally. The resulting binary is valid only until either the design or the recording changes, so the lifecycle model is "compile per (design SHA, recording SHA) pair, cache for the test session, invalidate on either source changing". Acceptable cost for a 100×-replay regression run; not for one-off interactive sim.
CDC verification mode: jitter injection on coincident edges and random X-injection on detected async-source paths. Reuses MC.1's boundary metadata and existing X-prop infrastructure. Distinct from static CDC checking (Spyglass, Real Intent), which is explicitly out of scope — that's a different product. The jitter-injection half is designed in ADR 0012 and partly built; remaining work is tracked in issue #92 / cdc-jitter-completion.md. X-injection stays deferred until MC.1 lands.

Out of scope (explicit non-goals)

These come up adjacent and are worth being clear about:

Pin-level VPI / GPI fidelity. Implementing enough VPI for unmodified cocotb / SystemVerilog testbenches. The surface area is enormous and Jacquard would be lying about delta cycles, NBA regions, #delay semantics, and X-propagation behaviour. Use transaction-level peer protocols (the natural extension of input.json over a socket) instead.
Metastability simulation. No RTL simulator does this; CDC verification is structural/formal (Spyglass, JasperGold-CDC, Real Intent) and a separate product.
Structural CDC checking (synchronizer recognition rules, gray-code analysis). Different product. MC.1's boundary metadata enables a light diagnostic but not a verification flow.
DUT-internal #delay. Requires an event-driven kernel; destroys the batched dataflow that gives Jacquard its speedup. Permanently unsupported.
Async resets / latches in DUT. Same reason. Permanently unsupported (already documented in CLAUDE.md).

Implementation triggers

When to revisit and pull which phase off the shelf:

Trigger	Pulls
First user workload with non-commensurable external clocks (audio, USB, DFS)	MC.2
First UVM/CocoTB integration request reaches engineering scoping	MC.3
User-visible CDC reporting requested	MC.1
Multi-domain workload measurably bottlenecked on whole-AIG-per-tick eval	MC.1 + MC.4
Regression CI total time exceeds release tolerance	MC.5
Post-MC.4 measurement shows boundary-sync overhead >10 %	MC.6 speculation tier 1 (value prediction)

Why MC.1 and MC.2 may be worth doing standalone

The user observation in the originating discussion was that MC.1 and MC.2 are worth carrying in a future release on their own merits, ahead of any specific workload demand. Rationale:

MC.1 has standalone diagnostic value. A "boundary report" for any multi-clock design — count of cross-domain combinational paths, location of inter-domain DFF samples — is useful to any user reviewing CDC structure, independent of whether the runtime ever uses the partition.
MC.2 lifts a real correctness limit. The current LCM cap silently fails on legitimate designs (any audio-clock SoC, anything with DFS). Replacing precompute with a min-heap is a small, contained change that removes a category of "your design doesn't fit" errors.
Both are foundational for the rest of the architecture. Doing them early means later phases pick up cleanly.

If MC.1 + MC.2 ship in isolation, they don't commit Jacquard to any of the later phases. Each later phase remains demand-driven.

References

Current multi-clock infrastructure: src/sim/cosim_metal.rs:855 and following (ClockDomainFlags, MultiClockScheduler).
Per-DFF clock-domain tagging: src/aig.rs:204 (clock_pin2aigpins).
Cosim stimulus protocol: src/sim/input_stim.rs, src/sim/models/mod.rs.
Existing precomputed-edge path (replay): jacquard sim and src/sim/vcd_io.rs.
Adjacent committed roadmap: docs/plans/post-phase-0-roadmap.md.
Synchronous-only constraint and rationale: CLAUDE.md "Key limitation".

Declarative cell metadata — Tier 1 + minimal Tier 2 + port mapping

Status: Implemented — historical record. Tier 1, minimal Tier 2, and the port-mapping schema have all landed. ADRs:

0010 — Declarative cell metadata for PDK enablement (Tier 1 + minimal Tier 2)
0011 — RAM port-mapping schema (the port-mapping extension originally deferred by ADR 0010) Issues: #67, #80. Driving designs: the wafer.space chip_top.pnl.v blocked on gf180mcu_ocd_ip_sram__sram1024x8m8wm1, then the JTAG-DM workflow in PR #78 surfacing the need for real RAM backing storage.

Scope (as shipped)

Originally scoped to one slice (Tier 1 + minimal Tier 2 — opaque kind = "ram" with no port resolution). Expanded mid-flight when the JTAG-DM workflow (PR #78) surfaced the need for explicit-port RAMs with real backing storage:

Tier 1: --cell-library + sverilogparse-backed pin tables (landed 2026-05-19 in PR #65/#68).
Tier 2 minimal: kind discriminator in TOML, opaque-RAM mode (landed alongside Tier 1).
Port-mapping schema (ADR 0011, v1.1): [cells.NAME.ram] sub-table for explicit-port RAMs with backing storage. Landed in this PR alongside SramInitConfig ELF preload (closes #80).

Deliverables

--cell-library <PATH> CLI flag on jacquard sim, jacquard cosim. Repeatable. Each path is parsed via sverilogparse at startup; results merged into a runtime LeafPinProvider extension.
<PATH>.cells.toml autoload + --cell-manifest <PATH> override. TOML schema as in ADR 0010 § Tier 2. Required field schema_version = "1.0". Per-cell kind discriminator, v1.0 vocabulary.
New code path in aig.rs: after PdkVariant::classify falls through (no built-in match), consult manifest. For kind = "ram", allocate RAMBlock in opaque mode — outputs routed to X-source slots, no port resolution.
Tests: TOML parsing unit tests; integration test exercising a synthetic kind = "ram" cell through AIG construction + sim (mini fixture, not the full tapeout design).
Doc update — docs/adding-a-pdk.md: new section "Adding third-party IP via manifest", linked from existing per-PDK recipes.

Out of scope (deferred)

Port-mapping schema ([cells.NAME.ports]). Future ADR.
Other kind values beyond what the tapeout fixture exercises end-to-end (ram, plus filler if cheap parity demo). Adding other kinds is data-only and can land per-need.
Migration of built-in sky130.rs / gf180mcu.rs classifiers to manifest data. Stays in this codebase as the fallback.
build.rs pin-table scanner removal. Stays.

Phasing

Phase	Output
P1	`--cell-library` parsing + `LeafPinProvider` extension + tests. No AIG-construction changes yet — verify pin tables alone.
P2	Manifest TOML parser + `CellManifest` struct + `schema_version` validation. Standalone unit tests.
P3	`aig.rs` integration — manifest threaded through, new fallback path for `kind = "ram"` opaque mode. Add the `compute_x_sources`-style test exercising the new path.
P4	Smoke test against a representative reduced fixture; confirm `jacquard sim` clears `gf180mcu_ocd_ip_sram_*`. The full downstream-tapeout netlist is the real-world target but not in-tree.
P5	Doc update (`adding-a-pdk.md`); update `gf180mcu-enablement.md` § Follow-on cleanup to mark items 1/2/3 superseded by this work.

Each phase is its own commit. No squashing until the spike feedback loop confirms shape.

Open questions to settle in code

Autoload path discovery: spec says foo.v → foo.cells.toml sibling. Does that handle the multi-file library case (a.v + b.v sharing one manifest)? Probably yes — autoload each sibling, merge into the single CellManifest. Explicit --cell-manifest flag still wins for users who want a single consolidated file.
Conflict policy: if a cell name appears both in a built-in classifier AND in a manifest, built-in wins (per ADR 0010 integration ordering). Warn on conflict to surface accidental collisions.
Empty-library noise: parsing a .v file containing only (* blackbox *) modules with no logic should succeed without warnings, since that's the expected shape for IP libraries.

Not promised

Memory contents simulation for kind = "ram" in v1.0. Documented in ADR 0010 § "kind = ram semantics in v1.0".
Stable opaque-RAM port routing beyond "outputs are X-source slots". The set of outputs is what sverilogparse reports; if a cell's port list changes, the routing follows.

Cosim Peripheral Models

Architecture: ADR 0013.

This plan tracks implementation work for the cosim peripheral model framework. ADR 0013 documents the architecture (two execution domains, observe-only vs bidirectional GPU patterns, ring buffers, plural config convention); this doc tracks the concrete workstreams.

Phase 1: Multi-UART (#90)

First peripheral using the plural-config + array-in-kernel conventions from ADR 0013.

Schema — `src/testbench.rs`

Add name: Option<String> to UartConfig. Add uarts: Vec<UartConfig> to TestbenchConfig. Add effective_uarts() mirroring effective_clocks():

#![allow(unused)]
fn main() {
pub fn effective_uarts(&self) -> Vec<UartConfig> {
    let mut out = self.uarts.clone();
    if let Some(ref u) = self.uart {
        out.insert(0, u.clone());
    }
    out
}
}

Existing "uart": {...} configs work unchanged. New form: "uarts": [{"name": "console", ...}, {"name": "debug", ...}]. Both may coexist; uart is prepended to uarts.

Metal kernel — `csrc/kernel_v1.metal`

MAX_UARTS = 4. Restructure the three UART types:

#define MAX_UARTS 4

struct UartPerChannelConfig {
    u32 tx_out_pos;
    u32 cycles_per_bit;
};

struct UartParams {
    u32 state_size;
    u32 n_uarts;          // replaces has_uart
    u32 _pad[2];
    UartPerChannelConfig channels[MAX_UARTS];
};

UartDecoderState and UartChannel structs unchanged — the device buffers hold [MAX_UARTS] elements. gpu_io_step buffer signature unchanged (same 6 slots); the UART decode block becomes a loop over n_uarts.

Rust runtime — `src/sim/cosim_metal.rs`

Repr structs (~line 130): update UartParams to match kernel. Add UartPerChannelConfig. Keep UartDecoderState and UartChannel unchanged.
Config resolution (~line 2229): iterate effective_uarts().
Buffer allocation (~line 2820): size buffers for MAX_UARTS elements. Init each UartDecoderState with last_tx=1.
RX driver creation (~line 2544): one UartRxDriver per entry, named uart_{name} (fallback uart_{index}).
CPU drain (~line 3990): iterate N channels with per-channel uart_read_head[i]. Label events with UART name.

Verification

cargo build --release --features metal compiles.
cargo test --lib passes (add effective_uarts unit tests).
Existing MCU SoC cosim CI passes unchanged (single "uart" config).
Local smoke: temporarily edit tests/jtag_minimal/sim_config.json to use "uarts": [...] syntax, confirm identical results.

Not in scope

Dual-UART test fixture: separate follow-up with a small 2-TX design.
CUDA/HIP: cosim is Metal-only; no kernel changes needed.

Future phases

Phase	Scope	Status
2	Refactor `gpu_io_step` toward common params/ring-buffer layout	Future
3	Multi-Flash / external RAM (bidirectional pattern)	Deferred (no use case)
—	Multi-JTAG	Not needed (TAP daisy-chain suffices)

Plan: Config-driven AHB/APB bus transaction tracing

Goal

Trace AHB5, AHB-Lite, and APB3 bus transactions in cosim, compactly, without baking signal names into source. Output as CSV (machine-readable transaction table) and annotated VCD (transactions as a signal group for waveform viewers). Decode site: GPU capture + CPU protocol FSM (the kernel stays dumb; protocol semantics live in testable Rust).

Order: APB3 first (validate against the Hazard3 JTAG-DM APB DMI in tests/jtag_minimal/), then AHB-Lite, then AHB5.

Why this shape

The existing "Wishbone bus trace" (build_wb_trace_params, cosim_metal.rs:1277; gpu_io_step, kernel_v1.metal:1182) proves the mechanism — a GPU observe-only peripheral that packs a compact per-tick entry into a ring buffer only when the bus is active/changed, drained by the CPU — but it is hardcoded to one VexRiscv-style SoC (literal names cpu.fetch.ibus__cyc, spiflash.ctrl.wb_bus__ack, …). We generalize that mechanism into a config-driven, protocol-aware monitor. It is observe-only (we watch design outputs, never drive), so it fits the ADR-0013 GPU observe-only peripheral pattern, and gets the effective_*()-style plural config for free.

Two existing pieces are reused:

Multi-candidate name resolution in src/sim/trace_signals.rs — handles Yosys-flattened / scalar-expanded / structural hierarchical naming. Refactor the candidate generator into a shared helper so the bus tracer binds pins the same way --trace-signals does.
Extra-observables VCD path (emit_extra_observables, vcd_io.rs:635) — the model for emitting synthesized signals into the output VCD.

The hardcoded WbTrace is left intact for now (it has a passing test); migrating it onto the general mechanism is a clean follow-up, not a prerequisite.

Design

1. Config schema — `src/testbench.rs`

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Deserialize)]
#[serde(rename_all = "lowercase")]
pub enum BusProtocol { Apb3, AhbLite, Ahb5 }

#[derive(Debug, Clone, Deserialize)]
pub struct BusTraceConfig {
    pub name: String,
    pub protocol: BusProtocol,
    /// Hierarchical prefix; standard protocol pin names are appended.
    pub prefix: String,
    #[serde(default = "default_addr_bits")] pub addr_bits: usize, // 32
    #[serde(default = "default_data_bits")] pub data_bits: usize, // 32
    /// Optional per-pin overrides: logical pin name -> explicit net name,
    /// for designs whose pins don't follow `{prefix}{PIN}`.
    #[serde(default)] pub signals: HashMap<String, String>,
}
}

Add to TestbenchConfig:

#![allow(unused)]
fn main() {
#[serde(default)] pub bus_traces: Vec<BusTraceConfig>,
}

New feature, so no singular legacy form. (effective_bus_traces() provided for symmetry with effective_uarts(), even though it just returns the Vec.)

2. Protocol pin maps + CPU decoder — new `src/sim/models/bus_trace.rs`

Logical-pin tables per protocol:

APB3: psel penable pwrite pready pslverr paddr[] pwdata[] prdata[]
AHB-Lite: htrans[1:0] haddr[] hwrite hsize[2:0] hburst[2:0] hready hresp hwdata[] hrdata[]
AHB5: AHB-Lite + optional hnonsec hexcl hexokay hmaster[] (resolved if present, ignored if absent)

Default net name {prefix}{pin} (lowercased), overridable via signals. Resolution via the shared multi-candidate resolver (item 4).

BusTraceDecoder (per bus) consumes raw captured beats and emits:

#![allow(unused)]
fn main() {
pub struct BusTransaction {
    pub tick: u64, pub bus: String, pub protocol: BusProtocol,
    pub dir: Dir,            // Read | Write
    pub addr: u64, pub data: u64,
    pub resp: BusResp,       // Ok | Error
    pub burst: Option<BurstInfo>, // beat index / length for AHB
}
}

APB3 FSM: GPU gates capture on psel & penable & pready (access-phase complete), so each captured beat is a complete transaction. dir = pwrite, data = pwrite ? pwdata : prdata, resp = pslverr.
AHB FSM: GPU gates capture on hready high (pipeline advance) and records htrans, haddr, hwrite, hsize, hburst, hwdata, hrdata, hresp. CPU keeps a 1-deep pending address-phase record and pairs address beat N with the data on beat N+1; tracks burst beat counter from hburst/htrans==SEQ.

Pure-Rust, unit-tested with synthetic beat sequences — no GPU required. This is the testability win of CPU-side decode.

3. GPU capture — `csrc/kernel_v1.metal` + `src/sim/cosim_metal.rs`

Generalize the WbTrace structs into protocol-agnostic capture:

#define MAX_BUS_TRACES 4
#define BUS_TRACE_MAX_ADR_BITS 32
#define BUS_TRACE_MAX_DAT_BITS 32

struct BusTraceParams {           // one per configured bus
    u32 protocol;                 // 0=apb3 1=ahb-lite 2=ahb5
    u32 gate_a_pos, gate_b_pos, gate_c_pos;   // edge-gating bits (psel/penable/pready or hready/htrans)
    u32 dir_pos, resp_pos;
    u32 addr_pos[BUS_TRACE_MAX_ADR_BITS];
    u32 wdata_pos[BUS_TRACE_MAX_DAT_BITS];
    u32 rdata_pos[BUS_TRACE_MAX_DAT_BITS];
    u32 ctrl_pos[8];              // htrans, hsize, hburst, hnonsec, ...
    u32 addr_bits, data_bits;
};
struct BusTraceEntry { u32 tick, flags, ctrl; u32 addr, wdata, rdata; };
struct BusTraceChannel { u32 write_head, capacity, current_tick, n_buses; /* entries follow */ };

The kernel computes the per-protocol gate, and on a gating edge packs one BusTraceEntry (bus id in flags high bits). No FSM, no pairing on GPU.

gpu_io_step currently uses buffer slots 0–5 (UART + WbTrace). Add slots 6–7 for BusTraceParams[] + BusTraceChannel. Metal allows ≫8 buffers, so extend the existing dispatch rather than adding a kernel.

Rust mirrors of the structs in cosim_metal.rs (next to WbTraceParams), build_bus_trace_params() resolving pins for each configured bus, buffer allocation sized MAX_BUS_TRACES, and a per-bus read head in the drain loop (near cosim_metal.rs:4057) feeding each BusTraceDecoder.

4. Shared signal resolver — refactor `src/sim/trace_signals.rs`

Extract the multi-candidate name → AIG-pin / state-position resolver (currently internal to trace-signal registration) into a reusable helper callable from build_bus_trace_params. Keeps one source of truth for the Yosys/scalar/structural naming conventions.

5. Output

CSV (--bus-trace-csv <PATH>): drain-time, one row per BusTransaction. Header: tick,bus,protocol,dir,addr,data,resp,burst. Trivial — lands in Phase 1.
Annotated VCD: synthesized per-bus VCD vars ({bus}_addr, {bus}_wdata/{bus}_rdata, {bus}_dir, {bus}_resp) that value-change at transaction-complete ticks. This needs a new "virtual signal" emission path in vcd_io.rs: unlike existing extra-observables (raw nets sampled per tick from the state buffer), these are sparse CPU-decoded events the VCD writer must interleave by tick. Bigger plumbing → Phase 3. Dovetails with the wire-bundle-scripting / Surfer direction in project memory.

6. CLI — `src/bin/jacquard.rs`

--bus-trace-csv <PATH> (Phase 1)
bus VCD annotation folded into the output/--output-vcd when bus_traces is configured, or a dedicated --bus-trace-vcd flag (Phase 3)

Status

Phase 1 is complete (APB3 end-to-end + CSV). Validated by tests/apb_trace/ — a dedicated synthesized APB3 design (the Hazard3 JTAG-DM post-PnR netlist drops the APB addr/data nets during flattening, so a names-preserved design was built instead). CI step: Run APB3 bus-trace cosim (ADR 0013). Phases 2–3 remain.

Phasing

Phase 1 — APB3 end-to-end. ✅ Done. Config schema, pin maps, shared resolver, APB3 GPU capture, APB3 CPU decoder, CSV output. Validated on tests/apb_trace/ (synthesized APB3 design). APB3 FSM unit-tested.
Phase 2 — AHB-Lite + AHB5. Pipeline pairing, burst tracking, AHB5 extra signals. Unit-test the AHB FSM. Needs an AHB design to integration-test against (open question — see below).
Phase 3 — Annotated VCD. Virtual-signal emission path in vcd_io.rs.
Follow-up — migrate WbTrace onto the general mechanism (express the VexRiscv ibus/dbus as configured buses), then delete the hardcoded path.

Verification

Unit: APB3 & AHB FSM decoders against synthetic beat vectors (pure Rust, no GPU).
Integration (Phase 1): cosim the Hazard3 JTAG-DM with --bus-trace-csv, assert the expected DMI register accesses (DMCONTROL/DMSTATUS) appear.
Build: cargo build --release --features metal clean; existing cosim tests (single-UART, WbTrace) unaffected since bus_traces defaults empty.

Open questions

AHB integration test design. APB3 validates on the existing Hazard3 JTAG-DM. Phase 2 needs an AHB-Lite/AHB5 design — do we have one, or synthesize a small AHB peripheral (like tests/dual_uart/)?
Per-bus ring vs shared ring. One BusTraceChannel with a bus-id field (simpler allocation) vs one ring per bus (no cross-bus contention). Start shared; revisit if a hot multi-bus design overflows.
CUDA/HIP. Cosim is Metal-only today; no kernel changes needed elsewhere now, but the general design should port cleanly when CUDA cosim lands.

ADR impact

This generalizes the cosim peripheral architecture — update ADR-0013 (plural-peripheral configs) to record the config-driven bus-monitor pattern and the GPU-capture/CPU-decode split, once Phase 1 is real.

Plan: complete ADR 0012 CDC jitter injection

Tracks the deferred half of ADR 0012. Issue: #92.

Where it stands

Implemented: the run-parameters file + per-domain seeded PRNG (src/sim/run_params.rs), jitter_ps per ClockConfig, the uniform per-domain draw, and a jitter displacement applied to the timing-VCD event timestamp (cosim_metal.rs, inside the --output-vcd block only). So today jitter perturbs the waveform timeline but nothing else — it does not reach the setup/hold checker, model-driven clocks, or coincident-edge ordering.

The goal of this plan is to make jitter actually stress CDC paths, then extend it to model-driven clocks and tidy the loose ends, so ADR 0012's present-tense design fully matches the code.

Phase 1 — Jitter reaches the timing checker (the core value)

Right now jitter_displacement only adjusts the VCD base_timestamp (cosim_metal.rs:~3928-3948) and is computed inside the timing-VCD emission block, so it has no effect without --output-vcd and never influences violations.

Hoist the per-tick per-domain displacement draw out of the VCD block so it is available whenever jitter_active, independent of --output-vcd.
Apply each domain's displacement to the arrival offsets that setup/hold checking consumes (the arrival_state section), not just the VCD base timestamp — so a jittered edge can move a margin across the setup/hold boundary and surface in --timing-report.
True per-domain perturbation (ADR §4): keep a displacement per firing domain this tick rather than the current single global value (the loop overwrites jitter_displacement with the last domain's draw). Coincident edges from domains A and B then move independently, exercising both orderings over a seed sweep.

Verify: a small two-domain design with a deliberately marginal CDC path; assert that a seed sweep produces both "no violation" and "violation" outcomes, and that a fixed seed reproduces exactly.

Phase 2 — Model-driven clock jitter (ADR §3)

Model-driven clocks (JtagReplayModel, SPI SCK, …) bypass the scheduler and currently get no jitter.

Add --cdc-model-jitter-ps <N> (and/or per-model jitter_ps in config) → a budget + seeded stream via RunParams::domain_seed(model_name).
After a model fires its edge, displace the timing-model arrival for that transition (not the functional edge — the DFF still samples on the same tick), mirroring the Phase 1 arrival-offset path.

Verify: extend tests/jtag_minimal (model-driven TCK) with a model jitter budget; confirm reproducibility and that TCK→sys_clk CDC margins vary by seed.

Phase 3 — Hygiene / correctness guards

gcd_ps / 2 constraint (ADR §2): at startup, error (or clamp with a loud warning) if any jitter_ps > scheduler.gcd_ps / 2, since larger values would reorder edges across GCD ticks.
Always persist the seed (ADR §1): when neither --run-params nor --output-vcd is given, RunParams::generate() currently does not write the file. Persist to a default path unconditionally so every run is replayable.
master_seed in the VCD header (ADR §1/§5): emit the master seed as a VCD header comment in vcd_io.rs, so the seed is recoverable from an output artifact, not just the INFO log.

Phase 4 — CI CDC stress sweep (ADR Consequences)

Once jitter feeds violations (Phase 1), add a lightweight CI step: run the marginal-CDC design across a few sequential seeds, upload each run's run_params.json as an artifact, fail if an unexpected violation appears. Gives every PR a cheap CDC regression.

Out of scope (separate ADRs / plans)

X-injection on CDC paths (needs MC.1 island partitioner — ADR 0012 "Deferred").
Non-uniform jitter distributions (Gaussian period jitter, etc.) — the seed+budget interface is distribution-agnostic, add later.
Frequency sweep / DFS.

Plan: distribution & easy install

Implements ADR 0018. Goal: a fresh user (or a docs-dogfooding agent) can install Jacquard in one line on macOS/Metal, with CUDA/HIP following as runners land.

Artifacts & channels (from ADR 0018)

Artifact	Channel	Blocked on
`jacquard` (Metal)	GitHub Release + `cargo-binstall` + Homebrew tap	— (Phase 1a done — relocatable)
`opensta-to-ir`	same release/formula as `jacquard`	—
`jacquard` (CUDA / HIP)	added to the release matrix	NVIDIA / AMD runners
`netlist-graph`	PyPI	—

Phase 0 — Metadata hygiene (no new infra)

Fix repository URL ChipFlow/Jacquard → gpu-eda/Jacquard in Cargo.toml and crates/opensta-to-ir/Cargo.toml.
Decide the release version (this is effectively 0.1.0, the first numbered release) and confirm CHANGELOG.md exists / is rolled per release-process.md.
Add [package.metadata.binstall] to Cargo.toml describing the release asset name template (e.g. { name }-{ version }-{ target }{ archive-suffix }).

Phase 1a — Relocatable Metal binary (code prerequisite) ✅ Done

Done in d2afde4: the metallib is embedded via include_bytes! + new_library_with_data, verified by running the release binary copied outside target/. The original problem, for the record:

Blocker found while scoping Phase 1. MetalSimulator::new (cosim_metal.rs:337) loads the GPU kernel via new_library_with_file(env!("METALLIB_PATH")) — a compile-time absolute path into the build's target/.../out/ucc_metal/. A prebuilt binary moved to another machine panics with Failed to load metallib: the .metallib isn't bundled and the path doesn't exist there.

A distributable binary must locate its kernel without that build-tree path. Preferred fix: embed the metallib in the binary — include_bytes!(env!("METALLIB_PATH")) + new_library_with_data, so the binary is self-contained (no sidecar file). Alternative: ship the .metallib next to the executable and resolve relative to current_exe(), with the env! path as a dev fallback. Embedding is cleaner for distribution.

(CUDA/HIP likely have an analogous kernel-loading assumption; check when Phase 4 lands.)

This must merge before Phase 1 produces a usable artifact.

Phase 1 — Metal release CI (after 1a)

A release.yml workflow triggered on v* tags:

On the self-hosted macOS runner: cargo build -r --features metal --bin jacquard and cargo build -r --bin opensta-to-ir.
Package both binaries into jacquard-<version>-macos-arm64-metal.tar.gz (+ a .sha256).
gh release create / upload the asset. Body = the CHANGELOG section.
Smoke-test the packaged binary (jacquard --version, a tiny tests/apb_trace cosim) before publishing.

Deliverable: tagging v0.1.0 produces a downloadable Metal binary.

Phase 2 — Homebrew tap

Tap repo provisioned: gpu-eda/homebrew-tap created via brew tap-new (brew test-bot CI + pr-pull bottle publishing). Formula staged at packaging/homebrew/jacquard.rb (brew-style clean apart from homebrew-core-only Sorbet sigils; installs jacquard + opensta-to-ir, tests --version). Remaining:

At the first release: open a PR to the tap adding Formula/jacquard.rb with the released url / version / sha256 (the release emits a .sha256). The tap CI installs + tests it on the PR.
Subsequent releases: brew bump-formula-pr, or a release-CI step that opens the PR automatically (stretch).
Then brew install gpu-eda/tap/jacquard → both bins on PATH.

See packaging/README.md.

Phase 3 — netlist-graph → PyPI

Workflow written: .github/workflows/publish-netlist-graph.yml builds the wheel (uv build --package netlist-graph) and publishes via PyPI trusted publishing (OIDC) on netlist-graph-v* tags; workflow_dispatch is a build-only dry-run. The wheel + sdist build cleanly (verified locally); license = "Apache-2.0" + URLs added to the package metadata. Remaining (needs maintainer action):

Create/own the netlist-graph PyPI project and configure its trusted publisher (owner gpu-eda, repo Jacquard, workflow publish-netlist-graph.yml, environment pypi).
Result: uvx netlist-graph … / pip install netlist-graph work.

Phase 4 — CUDA / HIP release binaries (runner-gated)

Add linux-x64-cuda and linux-x64-hip rows to the release.yml matrix once the NVIDIA / AMD self-hosted runners are registered (tracks the same work re-enabling CUDA/HIP CI).
Same package/upload/smoke-test shape as Metal.

Phase 5 — Install docs (depends on 1–3)

New docs/installation.md (in SUMMARY.md) presenting the tiers: brew install / cargo binstall for the simulator, uvx netlist-graph for signal analysis, opensta-to-ir + PDK for timing. Source-build remains documented as the fallback / contributor path.
Trim the README "Build" section to point at the install page for users while keeping the from-source instructions for contributors.

Phase 6 — Container image (optional, deferred)

ghcr.io/gpu-eda/jacquard:cuda for reproducible Linux/CUDA runs. Deferred per ADR 0018; revisit if CUDA release binaries prove awkward.

Phase 7 — Upstream the `eda-infra-rs` fork (de-fork prerequisite)

The simulator can't be published to crates.io while its core deps are a vendored fork (vendor/eda-infra-rs → ChipFlow/eda-infra-rs) carrying patches not on the registry (the path deps declare a version but resolve to the fork's patched code; ADR 0018). De-forking — getting every fork change upstreamed + released so jacquard can depend on published gzz2000/eda-infra-rs crates — is the long-term unblock. Not required for the binary-distribution channels (those work today); tracked here as the path to crates.io / dependency hygiene. (opensta-to-ir / timing-ir are ours to version + name as we like, so they are not a blocker.)

Audit (2026-06-23): the fork (e4e3db0, branch master) is 13 commits ahead, 1 behind upstream gzz2000/eda-infra-rs. Upstream merged none of ours; it only added the license-string fix (026070c). Status of our 13 commits:

Fork change	Upstream PR	Status
sverilogparse: ANSI-style ports (`e4e3db0`)	gzz2000#3	OPEN — awaiting merge
Apple Metal support (`139a696`)	gzz2000#1	DRAFT — never submitted
sverilogparse: unary NOT `~` (`d7df6e8`)	—	not PR'd
HIP/AMD + HIP-on-NVIDIA backend (~9 commits, `c89d426`..`8b1bc63`)	—	not PR'd (largest gap)
rustfmt + .gitignore (`fb436ed`)	—	housekeeping; fold into a PR

Tasks (priority order):

Pull upstream's license fix (026070c) into the fork + bump the submodule pin → closes the release-process checklist item + the NOTICE footnote.
File the HIP backend PR(s) upstream (the ~9-commit gap).
Mark Metal PR gzz2000#1 ready for review (un-draft).
File the unary-NOT sverilogparse PR.
Once all merged + released: drop the fork, point deps at published gzz2000/eda-infra-rs versions — removes the core "no crates.io" blocker.

Verification

A clean checkout-free install on a second macOS machine (or a fresh shell with no repo): brew install … then run a tests/apb_trace cosim end to end.
cargo binstall jacquard fetches and runs without a Rust toolchain build.
uvx netlist-graph search <netlist> psel works with no repo clone.
Then the docs-dogfood: a Sonnet agent installs via the docs and attempts real tasks, reporting friction (the original motivation).

Decisions (confirmed)

Versioning: coordinated version across the two Rust bins (jacquard + opensta-to-ir), single tag; netlist-graph versioned independently.
Homebrew scope: the formula installs the Rust bins only; the docs carry the uvx netlist-graph line separately. netlist-graph is not a formula dependency.
Tag scheme: v<X.Y.Z> triggers the Rust release; netlist-graph-v<X.Y.Z> triggers the PyPI publish. The two workflows filter on their own tag prefixes so they never collide.

Plan: selective X-propagation in cosim

Extends ADR 0016 (selective X-propagation, today sim-only) to the reactive cosim path. Issue: #95.

Where it stands

Update 2026-06-03: while extending to cosim we found the seed template was broken for both paths — expand_states_for_xprop cleared the X-mask for all input_map positions, which includes DFF-Q feedback reads, so uninitialised DFFs read as known 0 and X never surfaced even in sim. Fixed via vcd_io::xprop_xmask_template (X at genuine X-sources only) plus output-slot seeding in run_cosim. See the handoff and ADR-0016 amendment. Phases 1, 2, 5, the seed fix, phase 3 (undriven inputs → X: compute_x_capable_pins(treat_inputs_as_x) gated by DesignArgs::xprop_undriven_inputs; xprop_xmask_template_cosim seeds inputs X; state_prep + gpu_apply_flash_din clear the X-mask of each bit they drive), phase 6 (end-to-end tests/xprop_cosim/ guards in CI, sim + cosim) and phase 4 (observe-kernel output-offset: correct by construction — gpu_io_step reads the value half via uart_params.state_size = effective_state_size; guarded by re-running the APB3 + dual-UART cosims under --xprop and asserting identical decoded output) are done. All phases of #95 are complete; the bidir tristate-mux read Y = OE ? A : external is separately tracked as #96.

--xprop is now wired into both sim and cosim. (Historically it was sim-only and cosim always ran two-state, silently resolving uninitialised DFF/SRAM and undriven inputs to 0 — false agreement against 4-state RTL; that gap is what this plan closed.)

What the work built on (already existed and was reused):

The Metal simulate_v1_stage kernel is already X-capable (sram_xmask buffer, xmask_state_offset, the X-mask read logic in kernel_v1.metal). cosim dispatches the same kernel — no GPU core change.
DesignArgs.xprop already threads into script.xprop_enabled (via setup); cmd_cosim just hardcodes xprop: false.
Host machinery on the sim side: expand_states_for_xprop, the sram_xmask shadow (init 0xFFFF_FFFF), split_xprop_states, write_output_vcd_xprop.

So this is host-side reactive plumbing, not new kernel work.

X-source taxonomy (the semantic to get right)

X originates from four places; cosim must model all four:

Source	X when…	Mechanism
Uninitialised DFF	power-up, before first clocked write	`expand_states_for_xprop` seeds the X-mask half (as in `sim`)
Uninitialised SRAM	before first write to a cell	`sram_xmask` shadow init `0xFFFF_FFFF`, carried across ticks
Undriven input pad	no model / constant / clock / reset drives it	input X-mask = X for every primary-input bit not in the driven set
Bidir pad input side	`OE` deasserted and nothing external drives it	per-edge: input X-mask = `OE ? known : X`, reading `OE` (the `__oe` observable)

The first two are sequential power-up X (ADR 0016's original scope). The last two are new — input/IO X-sources the sim static-VCD path never had to model, and the reason this is more than "carry the sim machinery over."

Driven set (bits that are known, X-mask cleared): scheduler-driven clock(s) + reset, each peripheral model's driven_positions(), and constant_inputs / constant_ports. The complement of the driven set within the primary inputs is X.

Phases

Phase 1 — Flag plumbing

Add xprop: bool to CosimArgs; stop hardcoding xprop: false in cmd_cosim's DesignArgs. (setup already turns it into script.xprop_enabled.)

Phase 2 — X-state in the cosim loop ✅ done (1ba01eb)

Expanded the cosim state buffer via expand_states_for_xprop and allocated the sram_xmask shadow (init 0xFFFF_FFFF), bound at simulate_v1_stage buffer(7) through the dispatch chain.
No write_params change needed: is_x_capable / xmask_state_offset are baked into the script at flatten time and state_size already uses effective_state_size (so the layout scales automatically).

Phase 3 — Input X-mask policy + per-edge maintenance (the novel part)

Init: every primary-input X-mask bit = X, except the driven set (clock/reset/model/constants), which start known. The expand_states_for_xprop template clears all inputs to known, so Phase 3 re-marks the undriven inputs back to X.
Per edge: wherever state_prep / model ModelOverrides drive an input bit, also clear that bit's X-mask (driven ⇒ known). Undriven bits stay X.
Bidir: no special handling in this issue. A bi_24t pad's core-read is modelled Y = PAD (tristate not modelled — aig.rs), and the PAD net is an undriven primary input, so bidir reads fall out of the generic undriven-input rule above as X — conservative and safe (an unmodelled bidir read surfaces as unknown, never a false 0/1). The earlier "per-edge OE→input feedback with one-edge latency" idea was wrong: the correct read is combinational Y = OE ? A : external, which requires modelling the tristate mux in the AIG. That is deferred to #96; until it lands, an OE=1 loopback reads pessimistic-X (safe).

Phase 4 — Observe-kernel offset (DONE — correct by construction)

gpu_io_step's output reads (READ_OUT_BIT → states[state_size + …]) for UART, Wishbone, and bus-trace turned out to be already correct under xprop: state_size here is uart_params.state_size, set to the full effective_state_size, so states[state_size + word] indexes the output slot and word < reg_io_state_size lands in the value half (value is at the front of [value | xmask | …]). The earlier worry that "the offset shifts" was unfounded — no Rust/kernel change was needed.
Guard added so it can't silently regress: CI re-runs the APB3 bus-trace and dual-UART cosims under --xprop and asserts the decoded transactions/bytes are identical to the two-state runs (both designs are fully reset-driven, so phase-3 undriven-input X never reaches the traced signals). If a future layout change made the observe reads hit the xmask half, the decoded values would corrupt and the checkers would fail.

Phase 5 — X-aware VCD output

cosim emit path uses a write_output_vcd_xprop-equivalent so traced nets and top-level IO emit x (not 0) where unknown; the bidir __out/__oe split already exists and should reflect X correctly.

Phase 6 — Verification

A small reactive test design with (a) an unreset register and (b) an unconnected input pad — assert the output VCD shows x until each is resolved (clocked write / model drive), 0/1 after. (Bidir read-back correctness is deferred to #96.)
Where feasible, extend the CPU sanity_check_cpu_xprop parity to a cosim scenario.

Risks / open questions

Bidir read-back is pessimistic-X until #96 models the tristate mux (Y = OE ? A : external). Safe (false-X, not false-0); resolve by driving the pad's external side with a model.
Observe-offset regression (Phase 4): the highest-risk interaction; the bus-trace code is days old. Needs an explicit test under --xprop.
Performance: the state buffer doubles again on top of any timing expansion; the VCD ring-buffer snapshot grows. Measure on a real JTAG-replay run.
SRAM xmask carry: confirm the shadow persists correctly across the batched/single-tick dispatch modes the cosim loop uses.

ADR / docs impact

Amend ADR 0016: record the cosim extension and broaden the X-source taxonomy to include undriven input pads + bidir-OE (the original ADR covered only sequential power-up X).
Fold the IO X-source rules into docs/selective-x-propagation.md.
Update the cosim --xprop help + docs/installation.md once shipped.

Cosim Backend Portability

Status: Active — design captured, not yet scheduled. Issue: #105. Architecture: ADR 0017 — Cosim execution model (see Amendment 2026-06-07: backend-portable cosim — target architecture).

cosim is Metal-only today: run_cosim lives in src/sim/cosim_metal.rs (gated #[cfg(feature = "metal")]) and cmd_cosim hard-errors on other backends. This plan is the staging to reach the target architecture in ADR 0017: a backend-agnostic orchestration layer, a batch-granular CosimBackend trait, and a 3-tier GpuPeripheral model — then CPU and CUDA/HIP backends.

Goal & non-goals

Goal: jacquard cosim runs on CPU (reference, no GPU) and on CUDA/HIP with batching, not just Metal — reusing the existing scheduler, peripheral models, and VCD machinery.

Non-goals:

Changing the batch/scheduler execution model of ADR 0017 (untouched; the trait is made batch-granular to preserve it, not change it).
Matching Metal cosim throughput on the CPU path (it's a reference/oracle).
The user-extensible single-source peripheral API (Tier 3) up front — core peripherals get hand-written GPU kernels first (Phase 2); the portable authoring path is Phase 3.

What's already portable (no work needed)

Peripheral protocol models (Tier 1) — src/sim/models/*.rs (gpio, uart, i2c, spi, jtag, bus_trace) are pure CPU Rust implementing PeripheralModel (models/mod.rs:56). The semantic ground truth and the cross-backend equivalence oracle; reusable as-is across all backends.
CPU design step — cpu_reference::simulate_block_v1 is the exact CPU equivalent of one simulate_v1_stage threadgroup. The existing run_cosim --check-with-cpu path (cosim_metal.rs:4282–4504) already runs state_prep + apply_flash_din + simulate_block_v1 per block on the CPU — a working prototype of the Path A backend step.
The scheduler / batch-size policy / VCD / event drain logic is GPU-agnostic in intent; it operates on &[u32] state and Vec<BitOp> ops.

The seam (batch-granular)

#![allow(unused)]
fn main() {
/// Design execution + state ownership. One impl per backend. The
/// orchestration layer (scheduler, models, VCD, drains) calls this.
///
/// Batch-granular by construction: measurements (ADR 0017) show Metal runs
/// 100% batched on GPU-peripheral designs, so a single-edge method would
/// regress it ~1000×. The orchestration decides N (`force_single_edge`);
/// the backend runs N edges however it likes.
trait CosimBackend {
    /// Build native schedule storage ONCE from the backend-agnostic
    /// description; the backend retains+owns it (opaque to orchestration).
    fn init_schedule(&mut self, edges: Vec<(StatePrepParams, Vec<BitOp>)>);
    /// Mutable view of one edge's ops (reset/model/clock-edge patching).
    /// Metal: slice over the shared MTLBuffer (zero-copy; write IS upload).
    /// CUDA/HIP: slice over host mirror, marks edge dirty (uploaded lazily).
    fn edge_ops_mut(&mut self, edge_idx: usize) -> &mut [BitOp];
    /// output→input copy + apply that edge's BitOps + clear driven X-mask.
    fn state_prep(&mut self, edge_idx: usize, xmask_state_offset: u32);
    /// Run N consecutive scheduler edges from `start_edge`, snapshotting each
    /// output slot into the ring. Flushes dirty edges first (no-op on Metal).
    /// Metal: one command buffer w/ GPU peripherals inside. CPU/CUDA-fallback:
    /// per-edge loop with CPU peripherals.
    fn run_edges(&mut self, start_edge: usize, n: usize, ring: &mut Ring);
    /// Read-only view of the current output slot (VCD, model step_edge, drains).
    fn output_state(&self) -> &[u32];
    /// Mutable input slot (reset/constant init, flash MISO injection).
    fn input_state_mut(&mut self) -> &mut [u32];
}

/// GPU-side peripheral kernel (Tier 2). Runs inside the backend's batch so
/// reactive designs can batch on a discrete GPU. CPU `PeripheralModel`
/// (Tier 1) is the reference + fallback when no GpuPeripheral exists.
trait GpuPeripheral {
    fn encode_step(&self, /* backend-specific encoder */);
}
}

MetalSimulator becomes the MetalBackend impl; CpuBackend and Cuda/HipBackend are added. The backend owns the schedule storage — the orchestration hands it the description once via init_schedule and keeps only scalars (edges_per_period, gcd_ps); it does not hold a parallel Vec the backend re-materialises (that would add a per-dispatch copy and regress Metal's zero-copy path). Mutation goes through edge_ops_mut (zero-copy on Metal; host-mirror + dirty-flag + lazy upload on CUDA/HIP). The orchestration owns the batch-size policy and the CPU peripheral models; it delegates the design step, state, and (Phase 2+) GPU peripheral steps to the backend.

Why batching must be a backend concern

cosim is per-edge dispatch in the reactive sense (inputs depend on outputs), which is why CUDA/HIP avoid the sim command's cooperative single-launch + grid.sync — the hardest CUDA feature to port. But reactive and per-command-buffer are different axes: Metal batches up to BATCH_SIZE reactive edges into one command buffer because its peripherals run on the GPU inside the batch. On a discrete GPU, going per-command-buffer means a PCIe round-trip every edge, which is why batching (hence GPU peripherals) is required for CUDA/HIP perf, not optional — so the CUDA/HIP backend ships with its GPU peripherals (Phase 2), not as a later add-on. See ADR 0017, Layer 3 and Consequences.

Phases

Phase 0 — Extract the seam (Metal-only refactor, zero behaviour change)

Introduce the batch-granular CosimBackend trait and make the schedule backend-neutral; MetalSimulator becomes MetalBackend. The physical module split (cosim/mod.rs orchestration + cosim/metal.rs) is separable — the trait can land in-place in cosim_metal.rs first, split later. Must leave Metal cosim bit-identical.

Move the schedule storage into MetalBackend (built once via init_schedule); the orchestration keeps only edges_per_period + gcd_ps. Currently ScheduleBuffers holds metal::Buffer pairs at the call site.
Convert the ops-update helpers (update_model_driven_in_ops, update_reset_in_ops, patch_model_clock_edges) from in-place *mut BitOp writes over contents() to backend.edge_ops_mut(edge_idx). On Metal this returns a slice over the same shared buffer (zero-copy, identical behaviour). This conversion is the main refactor friction (and resolves the closure-borrow issue) — but it stays zero-copy on Metal.
~~Consolidate the private simulate_block_v1 copies onto cpu_reference.~~ Done — #115.
Move the ~20 ad-hoc simulator.device.new_buffer(...) allocations in run_cosim into MetalBackend.

Entry: #113 (sim cross-backend equivalence) merged. ✅ Exit: Metal cosim CI (dual_uart, apb_trace, JTAG-minimal, xprop_cosim) all green, byte-identical output VCDs vs pre-refactor (harness: re-run the fixtures, cmp against a pre-refactor golden).

Phase 1 — Path A: `CpuBackend` + Linux cosim CI

Implement CpuBackend: state_prep (output→input copy + BitOps + X-mask clear — the loop at cosim_metal.rs:4286–4301), run_edges via cpu_reference::simulate_block_v1 per block (N effectively 1), plain Vec<u32> state.
Peripherals stay on CPU (Tier 1, already are). The GPU flash kernel's FSM is a simple SPI state machine; reuse the existing CppSpiFlash FFI (cosim_metal.rs:2493) rather than writing a third copy.
Wire cmd_cosim to select CpuBackend when --features metal is absent (or via an explicit --backend cpu), removing the hard-error.
Move the cosim regression tests to a Linux CI job — Metal-only today; CPU cosim lets xprop_cosim (cosim mode), dual_uart, and apb_trace run on free ubuntu-latest.

Entry: Phase 0 merged. Exit: cargo run --bin jacquard -- cosim … (no GPU features) runs the existing cosim fixtures on Linux CI and passes their checkers.

Phase 2 — Path B: `Cuda`/`HipBackend` mirroring Metal (GPU peripherals, batched)

Refined 2026-06-19 (see ADR 0017 Amendment 2026-06-19 and the detail doc cosim-phase2-cuda-hip.md): there is one CUDA/HIP cosim backend, mirroring MetalBackend (GPU design step + GPU peripherals + variable batching + managed memory). The earlier "checkpoint 2a = CPU-peripheral CUDA backend" is dropped — no production backend runs peripherals on the CPU, so it would be a confusing one-off. CpuBackend stays the pure-CPU oracle; the "per-edge fallback" is batch=1 of the GPU backend, not a CPU-peripheral path.

Bisectability comes from staging on fixtures (each exercises a different kernel subset), not from a throwaway backend:

Stage A — design step + batched orchestration. CudaBackend/HipBackend over the cosim_state_prep + cosim_simulate_stage kernels + init_schedule / edge_ops_mut (managed buffers, dirty-edge upload) + the batched run_edges/VCD-ring path. Validate against xprop_cosim (logic fixtures, no peripherals).
Stage B — gpu_io_step (UART + bus). Validate dual_uart + apb_trace.
Stage C — flash kernels (gpu_apply_flash_din, gpu_flash_model_step). Validate flash/JTAG fixtures.

GPU peripheral kernels are two impls (shared *_impl.cuh for CUDA+HIP, plus the existing .metal). Define the GpuPeripheral seam in Stage B/C; the Tier-1 CpuBackend is the per-kernel equivalence oracle. Gate GPU-backed cosim CI on tesla4-runner.

Entry: Phase 1 merged (seam + CPU reference proven). Exit: cosim --features cuda on each fixture runs batched (batch > 1), output VCD byte-identical across CPU/Metal/CUDA/HIP; JTAG matches at batch=1.

Phase 3 — (Future) single-source peripherals (Tier 3, user-extensible API)

A user authors a peripheral once (restricted-Rust subset or a small peripheral-FSM IR) that compiles to the CPU model + each GPU backend's kernel — the user-extensible peripheral API. Slots into the GpuPeripheral seam defined in Phase 2 without reworking orchestration. Big effort; demand- driven (first external/user-defined peripheral type).

Testing strategy

Reuse existing fixtures: tests/xprop_cosim/ (cosim mode), tests/dual_uart/, tests/apb_trace/, tests/jtag_minimal/.
Cross-backend equivalence (the correctness backstop): extend the #113 harness (scripts/ci/compare_backend_vcds.py) to reactive designs — run the same cosim on CPU / Metal / CUDA and assert byte-identical output VCDs; the CPU PeripheralModel is ground truth and each Tier-2 GPU kernel must match it.
Bit-identical Phase 0 gate: capture a pre-refactor golden of all fixtures, re-run + cmp after each refactor step.
Linux CI: Phase 1 is the unlock — cosim regression coverage on free ubuntu-latest instead of the single self-hosted Metal runner.

Risks

Refactor drift (Phase 0): the bit-identical-Metal exit criterion is load-bearing; the equivalence test + golden cmp + existing checkers guard it.
Shared-memory ops mutation (Phase 0): the in-place *mut BitOp writes are the trickiest part to de-Metal; the trait's explicit upload point is the fix.
Flash FSM (Phase 1): the GPU flash kernel and the CppSpiFlash FFI must agree. Prefer reusing CppSpiFlash to avoid a third copy.
Per-edge device read (Phase 2 fallback path): the CPU-peripheral fallback reads device state every edge — slow by design, used only for CPU-side models; checkpoint 2b (GPU peripherals) is what keeps the common case batched. Not a correctness risk.
Tier-2 kernel divergence (Phase 2 checkpoint 2b): two kernel families (CUDA+HIP shared *_impl.cuh, Metal .metal) per peripheral; equivalence tests against the CPU model are the guard, and Tier 3 eventually removes the hand-maintenance.
Phase 2 size: merging the backend bring-up with the GPU-peripheral ports makes Phase 2 large; the internal 2a/2b checkpoint keeps it incrementally verifiable (per-edge equivalence before adding batching).

Sequencing relative to other backend-alignment work

Independent of #104 (CUDA/HIP sim timing) — both bring CUDA/HIP toward Metal parity and are now T4-testable, but touch different code paths.
Complements the sim cross-backend equivalence test (#113) and the proposed single-source simulate_block_v1 macro prelude — those harden the sim compute kernel; this plan adds the cosim driver.
CDC/island batching (multi-clock plan MC.1→MC.4) is the long-term fix for the per-edge tail of CPU-side-model designs (e.g. JTAG) — orthogonal to and larger than this seam; not a prerequisite.

Cosim Phase 1 — `CpuBackend` + Linux cosim CI (implementation plan)

Status: Proposed (implementation plan for Phase 1 of #105). Parent plan: cosim-backend-portability.md (staging). Architecture: ADR 0017 — Amendment 2026-06-07. Base: stacks on Phase 0 (branch cosim-backend-seam-phase0, PR #118 — CosimBackend trait + MetalBackend extracted, Metal bit-identical).

Goal

jacquard cosim runs on a CPU reference backend with no GPU feature, reusing the scheduler, peripheral models, and VCD machinery — and the existing cosim regression fixtures run on free ubuntu-latest CI. Throughput is not a goal (the CPU backend is the oracle); correctness parity with Metal is.

The core problem

After Phase 0 the CosimBackend trait exists, but everything still lives in src/sim/cosim_metal.rs gated #[cfg(feature = "metal")] — the trait, BitOp, run_cosim, the patchers, MetalBackend. run_cosim's setup + loop still touch Metal directly at ~54 .contents() / 20 new_buffer / 19 MTLResourceOptions sites (xprop X-mask seeding, flash-state init, SRAM fill, set_flash_din, --check-with-cpu reads, VCD-ring drain, UART/bus channel drains). For a CpuBackend to compile and run without metal, the orchestration must become backend-agnostic and these sites must move behind the seam.

Site categories (run_cosim body): flash ~38, sram ~22, bus/wb-trace ~21, states ~13, uart ~12, vcd-ring 3.

Chosen interface approach — "fat backend constructor" (ADR Layer 1/2 split)

The backend owns its allocation and initialisation. run_cosim becomes generic over B: CosimBackend and only: builds the backend-agnostic descriptions (schedule ops, init seeds, peripheral configs), calls B::new(...), then drives the per-edge loop through trait methods. Each backend allocates + initialises its own storage inside new. This is the clean Layer-1 (agnostic orchestration) / Layer-2 (backend) split; it keeps the trait surface small at the cost of larger per-backend constructors (the Metal constructor is just today's setup block relocated).

Trait additions (beyond Phase 0's `init_schedule`/`edge_ops_mut`/`edge_ops`/`edges_per_period`/`gcd_ps`/`run_edges`/`wait`)

Revised per plan review (2026-06-07). flash_d_i is not a trait method — Metal and CPU update flash d_i at different points in the dispatch cycle (GPU gpu_flash_model_step vs CppSpiFlash::step), so surfacing it through the seam couples ordering. Instead CpuBackend::run_edges calls CppSpiFlash::step() internally and injects the result into the input state; d_i never crosses the seam. flash_set_in_reset stays (called before run_edges).

#![allow(unused)]
fn main() {
// Design state (Phase 0 deferral #1). Full [2 × effective_state_size] —
// input slot followed by output slot.
fn state(&self) -> &[u32];
fn state_mut(&mut self) -> &mut [u32];
fn sram(&self) -> &[u32];             // final dump / equivalence compare

// Flash reset line (set before run_edges; both flash FSMs honour it).
fn flash_set_in_reset(&mut self, in_reset: bool);

// Per-edge output snapshot for VCD — replaces run_edges' `metal::Buffer`
// (Phase 0 deferral #2). Slot layout is `[input_state | output_state]`,
// matching today's Metal ring. The orchestration drains agnostic &[u32]
// slots: `for i in 0..batch { let s = backend.vcd_snapshot(i); .. }`.
// CpuBackend (N=1) fills slot 0 before run_edges returns.
fn vcd_snapshot(&self, edge_in_batch: usize) -> &[u32];  // [2 × eff_state_size]
}

run_edges loses its Option<&metal::Buffer> parameter; whether to capture snapshots becomes backend state set at construction (enable_vcd: bool). The current raw ring.contents() pointer-math drain (cosim_metal.rs ~4065–4198) is refactored to the vcd_snapshot(i) loop in step 1.

Peripheral output decode — the one real design wrinkle

Input-driving peripherals already run CPU-side (PeripheralModel, models/*.rs) via the patchers — backend-agnostic, no change. Output decode differs by backend:

Peripheral	Metal (today)	CpuBackend (Phase 1)
SPI flash	`gpu_flash_model_step` kernel	`CppSpiFlash` FFI (exists, `cosim_metal.rs:16`)
Bus trace (APB)	`gpu_io_step` kernel	`BusTraceDecoder` (already CPU, `models/bus_trace.rs`)
UART TX decode	`gpu_io_step` kernel	new CPU decoder (port the `UartDecoderState` FSM, currently a GPU-side mirror, to a `step(output_state)` CPU fn)
Wishbone trace (legacy `WbTraceParams`)	`gpu_io_step` kernel	stub — debug-only `eprintln!` path; CpuBackend leaves `write_head=0` so the drain is a silent no-op (acknowledged, not ported)

GPIO and JTAG/UART-RX are input-only drivers (models/*.rs, already CPU) — no output decode. The step_edge(&[], ..) placeholder for model output-state (cosim_metal.rs ~3907) stays empty in Phase 1: the Phase-1 fixtures (dual_uart, apb_trace, xprop_cosim) have no model that reads design output, so this is sound; wiring state() into step_edge is deferred with the I²C/SPI-model work.

So CpuBackend's run_edges (N=1, iterating blocks × num_major_stages): state_prep (output→input copy + apply edge BitOps + clear driven X-mask — the loop at cosim_metal.rs ~4286–4301) → cpu_reference::simulate_block_v1 per block/stage → CPU flash/UART/bus decode reading the new output state. The bus + flash decoders exist; only the UART TX decoder needs a CPU port (small: shift-register baud FSM).

Module split (required, not cosmetic)

src/sim/cosim_metal.rs → src/sim/cosim/:

cosim/mod.rs (NOT gated): CosimBackend trait, BitOp, StatePrepParams, the patchers + ModelDrivenClockState, the agnostic run_cosim<B>, the scheduler/VCD/drain glue, and CpuBackend. Public API (run_cosim, CosimOpts, CosimResult) re-exported here.
cosim/metal.rs (#[cfg(feature = "metal")]): MetalBackend, MetalSimulator, ScheduleBuffers, the encode_*/profile_gpu_kernels methods, create_ops_buffer/create_prep_params_buffer, GPU IO struct definitions.
src/sim/mod.rs: pub mod cosim; (drop the gated cosim_metal).

Sub-commit sequencing (each gated by the bit-identical Metal harness)

De-Metal the trait surface + VCD ring. Add state/state_mut/sram/ flash_set_in_reset/vcd_snapshot; move the VCD ring into MetalBackend; drop metal::Buffer from run_edges; refactor the raw ring.contents() drain (~4065–4198) to the vcd_snapshot(i) loop. Route the loop-body diagnostic reads (--check-with-cpu, dff-dump ~4205, trace-signals ~4278, deep-diag ~4634, post_reset_state_snapshot) — ~100 lines of direct states_buffer.contents() — through state()/sram(). Metal bit-identical. Exit assertion: grep metal:: over the run-cosim body + trait shows only the soon-to-move construction block.
Move buffer setup+init into MetalBackend::new, as three checkpoints, each Metal bit-identical:
- 2a — flash (state/din/model/data) buffer alloc + init.
- 2b — UART + WB + bus-trace buffer alloc + init.
- 2c — states/sram/sram-xmask/event/blocks alloc + init (incl. xprop X-mask seeding). run_cosim ends up calling MetalBackend::new(...).
De-Metal the run_cosim body + make it generic. Split into three bit-identical checkpoints (the loop body, not the construction block, is the real work — step 1 deferred the diagnostic-read routing, so it lands here):
- 3a (done, d5a029f) — add state/state_mut/sram to the trait + MetalBackend (sram_len field); route the ~15 read-only states_buffer/sram_data_buffer loop-body reads through state()/ sram(). run_cosim stays concrete-typed.
- 3b-i — route the remaining concrete-field reads (groups C+D) off MetalBackend fields, decoded-records seam (ADR 0017 Layer 3):
  - Flash diagnostics (C): FlashModelParams/FlashDinParams reads become agnostic locals (derived from config+gpio_map, as build_flash_buffers does). FlashState reads → flash_d_i() -> u8 (functional, for --check-with-cpu) + flash_debug_snapshot() -> FlashDebug (agnostic struct mirroring the printed fields); the tick-0 raw-bytes/offsetof dump stays Metal-internal behind a debug trait method (not deleted).
  - Peripheral drains (D): drain_uart_tx() -> Vec<(usize,u8)>, drain_bus_beats() -> Vec<RawBeat>, drain_wb_trace_debug() (legacy eprintln, Metal-internal; CpuBackend no-op), uart_decoder_debug(ch). The read cursors (uart_read_heads, wb/bus_trace_read_head) move into MetalBackend; bus_lanes/uart_names/event-dispatch/tick stay in the agnostic orchestration. run_cosim stays concrete-typed. Exit: zero backend.<field>.contents() in the loop body.
- 3b-ii — assemble MetalBackend::new (the deferred step-2 finale: build_flash+build_io+build_state+struct literal+init_schedule), route the pre-loop stimulus deposits through state_mut(), flip to run_cosim<B: CosimBackend> constructing the backend via the fat constructor, update the jacquard.rs call site. Metal bit-identical. Exit: no metal:: / MTLResourceOptions token outside the Metal constructor + impl.
Module split cosim_metal.rs → cosim/{mod,metal}.rs. mod.rs must import zero metal::*; ScheduleBuffers/MetalSimulator/ create_ops_buffer/create_prep_params_buffer and the GPU IO structs live entirely in cosim/metal.rs (gated). Audit pub use, src/sim/mod.rs, src/bin/jacquard.rs, docs, CI. Gate: cargo check --lib (no feature) must compile (it currently can't — cosim_metal is fully gated). Metal bit-identical with --features metal.
Implement CpuBackend — Vec<u32> state sized effective_state_size() * 2, Vec<u32> sram, Vec<Vec<BitOp>> schedule. run_edges (blocks × stages) via cpu_reference::simulate_block_v1; flash via CppSpiFlash::step (internal, injects d_i into input state); bus via BusTraceDecoder; new CPU UART decoder. CpuBackend::new asserts !script.timing_arrivals_enabled and !(xprop_enabled && sram_storage_size > 0) (see Risks). --check-with-cpu becomes a no-op-with-warning under CpuBackend (the backend is the reference — never compare it to itself).
Wire cmd_cosim — select CpuBackend when no GPU feature (or explicit --backend cpu); remove the hard-error (jacquard.rs ~1684–1691, the cmd_cosim branch — not the sim hard-error at ~507).
Linux CI job — ubuntu-latest runs xprop_cosim (cosim mode), dual_uart, apb_trace via CpuBackend, asserting their checkers + the cross-backend VCD equivalence (below).

Testing strategy

Metal bit-identical after steps 1–4: /tmp/claude/cosim_fixtures.sh + shasum -c against the Phase 0 golden; jtag_minimal 4M PASS.
CpuBackend correctness (steps 5–7): extend the #113 cross-backend harness (scripts/ci/compare_backend_vcds.py) to cosim — run the same fixture on CPU vs Metal, assert byte-identical output VCDs. The CPU PeripheralModel is ground truth.
cargo test --lib (no feature) must compile + pass once the trait/CpuBackend are non-gated.

Risks

Step 2 size — relocating ~54 sites is the bulk; split into 2a/2b/2c (above), each Metal bit-identical, so the diff stays reviewable/bisectable.
Module-split compile gate (step 4) — any stray metal:: import in mod.rs breaks cargo check --lib (no feature). Step 1's exit assertion + step 3's token sweep front-load this; step 4 only moves code.
UART CPU decoder divergence — equivalence-test against Metal's gpu_io_step output on dual_uart before relying on it.
simulate_block_v1 X-mask handling — the CPU stepper must replicate the GPU state_prep X-mask clear for --xprop parity (xprop_cosim is the guard — but note it is SRAM-less, so SRAM-xprop is not covered, below).
SRAM xprop unsupported on CpuBackend — cpu_reference::simulate_block_v1 (cpu_reference.rs:17) takes no sram_xmask, so --xprop on a SRAM-containing design would read SRAM as always-known (no X). CpuBackend::new asserts !(xprop_enabled && sram_storage_size > 0) until resolved. Logic- only xprop (the xprop_cosim fixture) is fine.
Timed cosim deferred — arrival readback rides the GPU ring; CpuBackend::new asserts !timing_arrivals_enabled. effective_state_size() (flatten.rs:1582) still sizes the Vec, but the arrival section stays zero.
--check-with-cpu self-comparison — disabled-with-warning under CpuBackend (it would compare simulate_block_v1 to itself). Step 5.
Generic run_cosim<B> monomorphisation — only two backends; negligible.
Stacked-PR rebase — #118 must merge (or this rebases) before Phase 2; the handoff's stacked-PR gotcha applies.

Out of scope (→ Phase 2)

CUDA/HIP backend + Tier-2 GPU peripherals; the GpuPeripheral trait. Phase 1 defines no GPU-peripheral seam — only the CPU reference path + the agnostic orchestration that Phase 2 builds on.

CUDA/HIP Parity for Release — cosim Phase 2 + #104 sim timing

Status: ✅ SHIPPED — merged to main in PR #120 (2026-06-21, rebase). Track 0 (#104 sim timing) + Stages A/B/C (T1.1–T2.2) all done and T4-green. Remaining: T2.3 (optional, below) + the performance follow-up (issue #122 — managed-memory profiling/tuning; the track closed on correctness, performance is untuned). Goal (maintainer): "close off CUDA and HIP before release" — bring both GPU backends to Metal parity on the two paths that lag today:

#104 — sim setup/hold timing-violation detection (Metal-only today).
#105 Phase 2 — cosim on CUDA/HIP with batching (Metal-only today; a --features cuda build currently falls through to CpuBackend).

Architecture: ADR 0017 (Layer 1/2/3 + peripheral contract); staging in cosim-backend-portability.md (this is the Phase-2 detail doc, sibling to cosim-phase1-cpu-backend.md).

Release bar (decided): full batched cosim (checkpoints 2a and 2b), not a per-edge-only intermediate. Per-edge-only CUDA is an unusably-slow artifact (PCIe round-trip per edge); 2b is what makes it a real backend.

Hard constraint: validation is CI-only on the T4

This work cannot be built or run on the dev machine (Apple Silicon → Metal only). Every CUDA/HIP iteration is a CI round-trip on tesla4-runner. The workflow consequences:

Maximise local confidence before each push: cargo check-equivalent reasoning, mirror the proven Metal path line-for-line, keep diffs reviewable.
Batch changes per push to minimise CI cycles; expect red→green iteration on the T4 for genuinely GPU-dependent bugs (struct ABI mismatch, fence scope, __shfl width).
The cross-backend equivalence harness (compare_backend_vcds.py) and the Phase-1 CPU goldens (tests/*/expected/) are the correctness oracle — CUDA/HIP output must be byte-identical to CPU/Metal.

Track 0 — #104: CUDA/HIP `sim` timing (warm-up, Rust-only)

Why first: smallest, independent, zero kernel work, and it exercises the full CUDA/HIP build + T4 CI loop before the hard cosim work. Confirmed by investigation:

The kernel-side timing logic is already in the shared csrc/kernel_v1_impl.cuh (simulate_block_v1 arrival writeback at :530-532, setup/hold write_event at :546-553).
The timed C launchers already exist: simulate_v1_noninteractive_timed_cuda (kernel_v1.cu:50), ..._timed_hip (kernel_v1.hip.cpp:70) — both call the same kernel as Metal, just passing timing_constraints + event_buffer through instead of nullptr.
ucc::bindgen will auto-surface them as ucci::simulate_v1_noninteractive_timed / ucci_hip::... (suffix stripped) with no build.rs change.
Today sim_cuda/sim_hip accept timing_constraints but call the untimed simulate_v1_noninteractive_simple_scan (jacquard.rs:1196 / :1365) and drop the arg.

The gap is entirely Rust (~120-150 lines):

Un-gate TimingReportConfig (jacquard.rs:13) and its impl from #[cfg(feature="metal")] → cfg(any(metal,cuda,hip)) (struct has no Metal-specific fields).
Un-gate report_cfg construction in cmd_sim (jacquard.rs:518-550) for cuda/hip.
sim_cuda: add report_cfg: &TimingReportConfig param; when timing_constraints.is_some() → allocate Box<EventBuffer>, call ucci::simulate_v1_noninteractive_timed(...), then process_events() once post-run feeding ReportBuilder; else keep simple_scan. Add the expand_states_for_arrivals call (mirror sim_metal :804-807) when script.timing_arrivals_enabled.
sim_hip: identical.
Update the two cmd_sim call sites (jacquard.rs:564-580) to pass &report_cfg.

Known limitation (acceptable, pre-existing): the CUDA/HIP sim path is a single bulk cooperative launch, so the EventBuffer is drained once at end-of-run, not per-cycle. MAX_EVENTS=1024 with an overflow flag; no early-exit on $finish from events. Out of scope to change here.

Verification: extend the cuda/hip CI jobs to run the existing timing fixture (tests/timing_test/dff_test_synth.gv with constraints) and assert the timing report matches Metal's. Add the timing VCD to backend-equivalence.

Exit: jacquard sim --features cuda … --timing-report produces the same violations/report as Metal on the T4; --timed arrival VCD matches.

Decision (2026-06-19): one backend, no CPU-peripheral intermediate

The originally-planned checkpoint 2a (a CUDA backend with CPU peripherals, per-edge) is dropped. No production backend works that way — Metal always runs peripherals on the GPU, falling to batch=1 (not to CPU peripherals) for model-driven-clock designs (JTAG). A CPU-peripheral CUDA variant would be a bring-up crutch that exists nowhere else and muddies the architecture.

Target: a single CudaBackend/HipBackend that mirrors MetalBackend — GPU design step + GPU peripherals + variable batching + managed memory. CpuBackend remains the pure-CPU reference oracle. Bisectability (2a's only real benefit) is recovered by staging on the fixtures, since each exercises a different kernel subset — no separate backend:

Stage A — CudaBackend with the design step only (cosim_state_prep + cosim_simulate_stage, landed in T1.1) + batched orchestration + managed memory + VCD ring. Validate against xprop_cosim (4 logic fixtures: no flash/UART/bus). Proves the whole pipeline end-to-end with zero peripherals.
Stage B — port gpu_io_step (UART + bus). Validate dual_uart + apb_trace.
Stage C — port the flash kernels (gpu_apply_flash_din, gpu_flash_model_step). Validate the flash/JTAG fixtures.

Every stage is the real architecture, gated against the committed CPU/Metal goldens on the T4.

Track 1 — CUDA/HIP cosim backend (mirrors MetalBackend)

1a. Kernels (shared `kernel_v1_impl.cuh`, two thin launchers)

CUDA/HIP have zero cosim kernels today. Add, mirroring the Metal shader:

simulate_v1_stage — a non-cooperative per-stage __global__ (num_blocks blocks × 256 threads), one major stage per launch; host loops stages, each launch is the grid barrier. The body is the already-shared simulate_block_v1 (impl.cuh:32) — the new global is a thin wrapper indexing blocks_start[stage_i*num_blocks + blockIdx.x], input slot states[0], output slot states[state_size] (current_cycle=0, 2-slot ping-pong). This avoids cooperative launch entirely for cosim (the hard-to-port grid.sync stays sim-only). Mirror the Metal SimParams ABI exactly.
state_prep — port from shader:679 (output→input copy + per-BitOp set + driven-bit X-mask clear at xmask_state_offset). Use __threadfence() (device scope) between copy and bit-ops to match Metal's mem_device barrier.

Add extern "C" _cuda/_hip launchers to kernel_v1.cu/.hip.cpp. Replicate SimParams and StatePrepParams #[repr(C)] structs in a device header with identical field order/padding (see the GPU-struct catalogue captured in the Phase-2 investigation, folded into the appendix below).

1b. The unified-memory abstraction (the central porting decision)

Metal uses StorageModeShared (one physical pointer; CPU write = upload). CUDA/ HIP have no arbitrary-host zero-copy. Decision: use managed memory (cudaMallocManaged/hipMallocManaged) for the v1 backend — it is the closest functional analog to Metal's unified buffers and keeps the backend code structurally identical to MetalBackend (CPU casts the pointer directly; the driver migrates pages). Revisit pinned-host + explicit-mirror only if managed memory's page-migration cost dominates (profile on the T4 in 2b). The edge_ops_mut / state / sram / drain accessors in the trait already isolate this — only the backend struct changes.

Buffers to allocate in CudaBackend::new (sizes per the Metal catalogue): states (2×state_size), sram_data, sram_xmask, blocks_start, blocks_data (reuse the ulib UVec device copies where they exist), event_buffer, optional timing_constraints, and the per-edge schedule (StatePrepParams, Vec<BitOp>) pairs.

1c. `CudaBackend`/`HipBackend` Rust impls

New src/sim/cosim/cuda.rs (and hip.rs), #[cfg(feature="cuda")] / #[cfg(feature="hip")], implementing CosimBackend exactly like MetalBackend:

new — allocate managed buffers (1b).
init_schedule — store per-edge (params, ops) in managed buffers.
edge_ops_mut — slice over the managed ops buffer (write = visible to GPU after the next launch in-stream; no explicit flush with managed memory, but add cudaStreamSynchronize discipline at the documented points).
run_edges — batch BATCH_SIZE edges into one cudaStream_t, mirroring Metal's encode_and_commit_gpu_batch. Per edge: cosim_state_prep → gpu_apply_flash_din → cosim_simulate_stage × num_major_stages → gpu_flash_model_step → gpu_io_step → memcpy output→VCD-ring slot. Stream ordering gives sequential execution (no inter-kernel barriers). (Stage A wires only state_prep + simulate; B adds gpu_io_step; C adds the flash kernels.)
state/state_mut/sram — direct managed-pointer slices.
drain_uart_tx/drain_bus_beats/flash_d_i — read the GPU ring buffers (managed-memory cursors), exactly like MetalBackend::drain_* (Stages B/C).
wait — cudaEvent_t recorded at end-of-batch + cudaEventSynchronize (replaces Metal SharedEvent).

1d. cmd_cosim dispatch

jacquard.rs:1784-1789 currently: metal → run_cosim, else → run_cosim_cpu. Add run_cosim_cuda/run_cosim_hip shims (run_cosim_generic::<CudaBackend>) with the same priority order as sim (metal > cuda > hip > cpu).

Verification: add jacquard cosim invocations to the cuda/hip CI jobs, output VCD byte-identical to the CPU golden in tests/*/expected/ and to Metal. Stage A: xprop_cosim (batched, no peripherals). Stage B: dual_uart + apb_trace. Stage C: flash/JTAG fixtures. Model-driven-clock designs (JTAG) run GPU peripherals at batch=1 within the same backend — not a separate path.

Peripheral kernels (Stages B/C)

Port the three GPU peripheral kernels so peripherals run inside the batch (eliminating any per-edge round-trip) — the same model as Metal.

peripheral kernels (shared impl header + two launchers each)

gpu_apply_flash_din (shader:904) — write FlashState.d_i → input state.
gpu_flash_model_step (shader:943) — SPI/QSPI flash FSM, dual-step setup delay; needs the 16 MiB firmware buffer + FlashState persistent struct.
gpu_io_step (shader:1170) — UART-TX decoder (4-state FSM ×4 channels) + APB3 bus-trace beat extraction + legacy WB trace; writes ring buffers (UartChannel, BusTraceChannel).

Each is single-thread (thread 0) work — straightforward ports. Replicate the full GPU-struct ABI (FlashState, FlashDinParams, FlashModelParams, UartParams/UartChannel/UartDecoderState, BusTraceParamsAll/ BusTraceChannel, etc.) — exhaustive field layouts captured in the appendix.

As each peripheral kernel lands, wire it into CudaBackend::run_edges (Stage B = gpu_io_step; Stage C = the flash kernels) and switch the corresponding drain_*/flash_d_i accessors to read the GPU ring buffers. Define the GpuPeripheral seam (ADR 0017 Layer 3) here so Phase 3 (Tier-3 single-source) can slot in later. CpuBackend (Tier-1) is the per-kernel equivalence oracle — equivalence-test each GPU kernel against its CPU model.

Verification: cosim --features cuda on each fixture (Stage B/C) runs batched (telemetry batch > 1), output VCD byte-identical to the CPU/Metal goldens on tesla4-runner. Model-driven-clock designs (JTAG) match at batch=1.

CI: the cross-backend cosim equivalence gate

compare_backend_vcds.py is N-way and backend-agnostic (no change needed). Add to the cuda and hip-on-nvidia jobs a jacquard cosim step per fixture, upload the VCDs, and extend backend-equivalence with cosim comparisons (cuda vs hip vs metal vs the committed CPU golden). This closes the current gap: cosim is CI-covered on CPU only; sim equivalence covers GPU only.

Sequencing & checkpoints (each = one reviewable PR-sized push, CI-gated)

#	Stage	Deliverable	CI gate
T0	#104	CUDA/HIP sim timing wired (Rust-only)	✅ timing report == Metal on T4
T1.1	—	`cosim_state_prep` + `cosim_simulate_stage` kernels + launchers compile	✅ `cuda`/`hip` build green
T1.2	A	`CudaBackend`/`HipBackend` (managed mem, batched, design-step only) + cmd_cosim dispatch + cosim CI	✅ `xprop_cosim` cosim VCD == CPU/Metal golden
T2.1	B	`gpu_io_step` ported + wired; GPU UART/bus drains	✅ `dual_uart` + `apb_trace` == golden
T2.2	C	`gpu_apply_flash_din` + `gpu_flash_model_step` ported + wired; GPU flash drain	✅ `mcu_soc` flash cosim == golden (T4)
T2.3	—	`GpuPeripheral` seam + cross-backend cosim equivalence gate	⬜ optional/not started — backend-equivalence (cosim) green

T0–T2.2 all ✅ merged in PR #120 (2026-06-21). Only T2.3 remains (optional; the flash gate already achieves cross-backend equivalence transitively by diffing each backend against the same committed golden). Performance tuning of the v1 managed-memory backend is tracked separately in issue #122.

CUDA and HIP land together at each step (shared *_impl.cuh; HIP = a thin second launcher + mod ucci_hip).

Risks

CI-only validation — the dominant friction. Mitigate by mirroring Metal exactly and batching pushes.
GPU-struct ABI drift — any field-order/padding mismatch between the device header and Rust #[repr(C)] silently corrupts. Add static-size asserts both sides; the equivalence test catches behavioural drift.
Managed-memory perf — page migration may dominate; profile in 2b, fall back to pinned+mirror only if needed (isolated behind the trait).
__shfl_down_sync width / fence scope — simd_shuffle_down → __shfl_down_sync(0xFFFFFFFF,…); mem_device → __threadfence(). Classic per-arch GPU bugs; surface only on the T4.
Two kernel families per peripheral (CUDA+HIP shared impl, Metal .metal) — equivalence tests against the CPU model are the guard; Tier 3 removes the hand-maintenance eventually.

Appendix — authoritative source material

The exhaustive Metal cosim spec (every kernel signature, the per-edge dispatch ordering, all device-buffer layouts, the full GPU #[repr(C)] struct catalogue, and the Metal-specific constructs needing CUDA analogs) was captured during the 2026-06-17 Phase-2 investigation. Primary sources to port from: csrc/kernel_v1.metal (kernels), src/sim/cosim/metal.rs (MetalBackend, run_edges, encode_*, drains), csrc/kernel_v1_impl.cuh (shared simulate_block_v1). The #104 gap analysis (Rust-only, ~120-150 lines) and the build/FFI/CI mechanism (ucc bindgen auto-surfaces _cuda/_hip launchers; no build.rs change to add kernels) are recorded in the same investigation.

Stage B (T2.1) port spec — `gpu_io_step` on CUDA/HIP + batched `run_edges`

Status: Spec for review — not yet implemented. Sibling detail doc to cosim-phase2-cuda-hip.md (the authoritative plan; Stage B = its checkpoint T2.1). Resumed from docs/handoffs/backend-alignment-handoff.md.

Goal: bring CudaBackend/HipBackend from design-step-only (Stage A) to GPU peripherals (UART + bus) running inside a batched launch, byte-identical to the CPU/Metal goldens, validated CI-only on tesla4-runner. Two decisions already taken by the maintainer this session:

GPU peripherals, not a CPU-peripheral crutch (the dropped "2a").
Batch in this stage — rework run_edges off its current per-edge DEVICE.synchronize() into one async launch sequence + a single end-of-batch sync, mirroring Metal's encode_and_commit_gpu_batch.

CI gate: dual_uart + apb_trace cosim VCD/CSV byte-identical to tests/{dual_uart,apb_trace}/expected/ on cuda and hip.

1. Reference map (port from → to)

Concern	Metal / CPU source (read)	CUDA/HIP target (write)
Kernel body	`csrc/kernel_v1.metal:1170` `gpu_io_step`	`csrc/kernel_v1_impl.cuh` new `__global__ gpu_io_step`
GPU structs	`kernel_v1.metal:1041-1163`	device header in `kernel_v1_impl.cuh`
Launcher	(Metal `encode_io_step`)	`kernel_v1.cu` / `.hip.cpp` `gpu_io_step_{cuda,hip}`
Rust `#[repr(C)]` ABI	`metal.rs:113-250`	`cuda.rs`/`hip.rs` (or shared in `mod.rs`)
IO buffer build	`metal.rs:1446` `build_io_buffers`	`CudaBackend::new`
Bus params/positions	`mod.rs:149` `build_bus_trace` (agnostic, already shared)	reuse as-is
Batched encode	`metal.rs:643` `encode_and_commit_gpu_batch`	`CudaBackend::run_edges`
UART drain	`metal.rs:2116` `drain_uart_tx`	`CudaBackend::drain_uart_tx`
Bus drain	`metal.rs:2132` `drain_bus_beats`	`CudaBackend::drain_bus_beats`
CPU equivalence oracle	`mod.rs:3805-4045` (`CpuBackend` FSM)	(test target, no change)

Stage-B fixtures exercise UART + APB3 only. The legacy Wishbone (WbTrace*) path in gpu_io_step is dead for dual_uart/apb_trace (has_trace == 0, n_buses drives APB). Port the WB block for ABI/structural parity with Metal but it is not on the Stage-B critical path (no WB fixture until later).

2. GPU-struct ABI to replicate (device header in `kernel_v1_impl.cuh`)

Exact field order/padding from kernel_v1.metal:1041-1163. Constants: MAX_UARTS, UART_CHANNEL_CAP (already defined for Metal — confirm the values and re-declare in the .cuh), WB_TRACE_MAX_ADR_BITS=30, …_DAT_BITS=32, WB_TRACE_CHANNEL_CAP=16384, MAX_BUS_TRACES=4, BUS_TRACE_MAX_ADR_BITS=32, …_DAT_BITS=32, BUS_TRACE_CHANNEL_CAP=16384, BUS_PROTO_APB3=0.

Structs (CUDA struct, plain — no device/constant qualifiers): UartDecoderState{state,last_tx,start_cycle,bits_received,value,current_cycle}, UartPerChannelConfig{tx_out_pos,cycles_per_bit}, UartParams{state_size,n_uarts,_pad[2],channels[MAX_UARTS]}, UartChannel{write_head,capacity,_pad[2],data[UART_CHANNEL_CAP]}, WbTraceParams/WbTraceEntry/WbTraceChannel, BusTraceParams{protocol,addr_bits,data_bits,sel_pos,enable_pos,ready_pos,write_pos,resp_pos,addr_pos[32],wdata_pos[32],rdata_pos[32]}, BusTraceParamsAll{n_buses,_pad[3],buses[MAX_BUS_TRACES]}, BusTraceEntry{tick,flags,addr,wdata,rdata}, BusTraceChannel{write_head,capacity,current_tick,prev_gate} (+ entries at byte-offset 16).

ABI guard (required): static_assert(sizeof(X) == …) on both the C side and const _: () = assert!(size_of::<X>() == …) on the Rust side for every struct that crosses FFI. ABI drift is the #1 risk (silent corruption); the Rust ABI already exists at metal.rs:113-250 and is the size-of-truth.

The Rust #[repr(C)] mirrors at metal.rs:113-250 are #[cfg(feature="metal")]. Action: lift UartParams/UartPerChannelConfig/UartChannel/ UartDecoderState/BusTraceParamsAll/BusTraceParams/BusTraceEntry/ BusTraceChannel (+ WB structs) into the agnostic parent (mod.rs, non-gated) so all three backends share one ABI definition, then have metal.rs, cuda.rs, hip.rs use super:: them. This removes the triple-maintenance the plan's Risk "two kernel families" warns about. (build_bus_trace/ BusTracePositions are already agnostic in mod.rs — this extends that pattern.)

3. Kernel: `gpu_io_step` (shared `kernel_v1_impl.cuh`)

Direct transliteration of kernel_v1.metal:1180-1354. Metal→CUDA mechanical substitutions:

kernel void → __global__ void; [[buffer(n)]] args → plain pointers; tid [[thread_position_in_threadgroup]] → threadIdx.x. Single-thread work: keep the if (tid != 0) return; guard → if (threadIdx.x != 0) return;.
device/constant qualifiers dropped. uchar → unsigned char / u8, u32 already aliased in the .cuh.
The READ_OUT_BIT macro (metal.rs-shader :1186) ports verbatim (reads states[state_size + (pos>>5)] — the output slot; 0xFFFFFFFF ⇒ 0).
No barriers needed — pure thread-0 serial logic, no simd_*/threadfence.

The UART FSM (4-state: IDLE/START/DATA/STOP), WB trace, and APB3 rising-edge gate logic copy line-for-line. The CPU FSM at mod.rs:3923-4039 is the already-verified Rust twin — cross-check the port against it (same cycles_per_bit/2 midpoints, same value |= tx << bits_received, same gate = sel & en & rdy + rising-edge (prev>>b)&1 == 0).

4. Launcher: `gpu_io_step_{cuda,hip}` (`kernel_v1.cu` / `.hip.cpp`)

Follow the existing cosim_state_prep_cuda (kernel_v1.cu:84) pattern exactly — extern "C", raw pointer args, <<<1, 256>>> launch, checkCudaErrors. ucc strips _cuda/_hip and appends the Device arg, surfacing ucci::gpu_io_step(...) automatically (no build.rs change). Signature mirrors the kernel: (u32* states, UartDecoderState* uart_state, const UartParams* uart_params, UartChannel* uart_channel, WbTraceChannel* wb_channel, const WbTraceParams* wb_params, BusTraceChannel* bus_channel, const BusTraceParamsAll* bus_params). HIP launcher is the same body in .hip.cpp (shared .cuh kernel).

VCD-ring snapshot launcher (needed for §6 batching): add cosim_snapshot_{cuda,hip} — cudaMemcpyAsync(ring + edge_off*2*state_size, states, 2*state_size*4, cudaMemcpyDeviceToDevice, 0). This is the CUDA analog of Metal's per-edge blit (metal.rs:732-742); it lets each edge's [input|output] slot be retained in a device ring while states is overwritten by the next edge — without it, batching (no per-edge sync/readback) loses all but the final snapshot.

5. Backend wiring — `CudaBackend::new` (IO buffers)

Mirror build_io_buffers (metal.rs:1446), substituting device-resident UVec for metal::Buffer. New backend fields (all UnsafeCell<UVec<…>> or UVec):

uart_state: UVec<UartDecoderState> — MAX_UARTS, init state=0,last_tx=1.
uart_params: UVec<UartParams> — state_size, n_uarts, per-channel tx_out_pos = gpio_map.output_bits[tx_gpio], cycles_per_bit = cpb * sched_ticks_per_sys_clk_cycle (the two args currently prefixed _ in CudaBackend::new — un-prefix _gpio_map, _uart_configs, _sched_ticks_per_sys_clk_cycle).
uart_channel: UVec<UartChannel> — MAX_UARTS, capacity=UART_CHANNEL_CAP, write_head=0.
wb_params/wb_channel (via build_wb_trace_params — also agnostic-lift or cfg-gate; low priority, no Stage-B fixture).
bus_params: UVec<u8>-backed BusTraceParamsAll + bus_channel (header + BUS_TRACE_CHANNEL_CAP entries, byte-sized buffer). Build params from build_bus_trace(aig, netlistdb, script, config.effective_bus_traces()) (mod.rs:149) → pack BusTracePositions into BusTraceParams (the packing loop lives in metal.rs build_bus_trace_params:1126 — extract the positions→BusTraceParamsAll packer into mod.rs so cuda/hip/metal share it, same move already done for the lanes).
Per-channel read cursors uart_read_heads: Vec<u32>, bus_trace_read_head: u32 (host-side, mirror MetalBackend).

new returns (Self, bus_lanes) — bus_lanes now real (from build_bus_trace), replacing the Stage-A empty vec![].

6. Backend wiring — `run_edges` (the batching rework)

Current Stage A (cuda.rs:342): per edge → state_prep, simulate_stage × N, DEVICE.synchronize(), host read-back into VCD ring. The per-edge sync is what makes it not-yet-batched.

Target (mirror encode_and_commit_gpu_batch): enqueue all edges' kernels on the default stream with no intervening sync, one DEVICE.synchronize() at the very end, then read the ring + drain channels once. Per edge in 0..batch:

upload edge ops UVec (retain in a Vec for the whole batch — async launches read it after the call returns; dropping early = UB),
ucci::cosim_state_prep(...),
ucci::cosim_simulate_stage(...) × num_major_stages,
ucci::gpu_io_step(...) (NEW — UART/bus capture into the device rings),
if enable_vcd: ucci::cosim_snapshot(...) → device ring slot edge_offset.

Then one DEVICE.synchronize(). Then: read the VCD ring UVec back to host → Vec<Vec<u32>> for vcd_snapshot; UART/bus channels are managed/UVec so drain_* reads them after the sync.

Notes / invariants:

Flash kernels (gpu_apply_flash_din, gpu_flash_model_step) are Stage C — omit from the dispatch chain here (Metal's encode_*flash* calls are skipped).
wait/vcd_snapshot semantics unchanged from Stage A (token still unused; the single end-of-batch sync replaces per-edge).
Ops-buffer lifetime is the one new correctness hazard vs Stage A (which synced immediately so the UVec could drop). Collect Vec<UVec<u32>>, drop after sync.
The VCD ring UVec is sized batch_capacity * 2 * state_size; grow lazily to the largest batch seen (mirror Metal's ring sizing).

7. Backend wiring — drains

drain_uart_tx (cuda.rs:332): replace vec![] with the metal.rs:2116 loop over uart_channel[i].write_head vs uart_read_heads[i], reading data[head % capacity]. UVec → ensure host-visible (post-sync read).
drain_bus_beats (cuda.rs:337): replace vec![] with the metal.rs:2132 loop reading BusTraceEntry at byte-offset 16, building RawBeat (same flag decode: bus_id = flags>>8, write = flags&1, err = (flags>>1)&1).
flash_d_i/flash_debug_snapshot stay Stage-A stubs (flash is Stage C).

8. CI

The check script already supports dual_uart + apb_trace under COSIM_SCOPE=all (scripts/ci/cosim_cpu_check.sh:97-110). Stage A restricted the cuda/hip steps to COSIM_SCOPE=logic (ci.yml:580/:734). Stage B change: flip those two steps to COSIM_SCOPE=all (or a new logic+io scope that adds just UART/bus, deferring flash/JTAG to Stage C). The goldens at tests/{dual_uart,apb_trace}/expected/ already exist (Phase-1 CPU goldens, = Metal). No new fixtures, no compare_backend_vcds.py change.

9. Risks / open questions to resolve during implementation

ulib stream/async semantics. The whole batching rework assumes the ucci launchers enqueue async on the default stream and DEVICE.synchronize() is the only barrier (true of the current launchers — they <<<>>> + cudaGetLastError, no inner sync). Verify ulib UVec host→device upload doesn't itself force a sync mid-batch (if &mut ops_uvec triggers a blocking copy each call, the "batch" still serialises on the host — acceptable functionally, but not the perf win; flag if so). This is the one assumption that could force a design change (e.g. a single batched C launcher that loops internally).
VCD ring memory. batch * 2 * state_size * 4 B device-resident. Large designs × large batch could be significant; Metal already pays this. Size to the max batch lazily.
ABI drift — mitigated by §2 static asserts both sides + the byte-identical gate.
UartChannel is 16 + UART_CHANNEL_CAP bytes — large struct in a UVec; confirm UVec<UartChannel> device alloc handles the [u8; CAP] inline array (vs a flat UVec<u8> view). A flat byte-buffer + manual offset (like the bus channel) may be cleaner than UVec<UartChannel>.
CUDA + HIP land together — shared .cuh kernel + ABI; HIP is a second launcher in .hip.cpp + mod ucci_hip in hip.rs. Keep cuda.rs/hip.rs diffs identical (they're 452-line twins today).

10. Checkpoint sequencing (each = one CI round-trip)

Step	Deliverable	Local gate	T4 gate
B0	Lift IO structs → `mod.rs` (agnostic) + ABI static-asserts; Metal still builds bit-identical	`cargo test --features metal` (298), fixtures byte-identical	—
B1	`gpu_io_step` + `cosim_snapshot` kernels + `_cuda`/`_hip` launchers	`cargo check` reasoning only	cuda+hip build green
B2	`CudaBackend`/`HipBackend` IO buffers + batched `run_edges` + drains	—	`dual_uart`+`apb_trace` == golden on cuda+hip
B3	CI: flip `COSIM_SCOPE` to include UART/bus on cuda/hip	—	green

B0 is local-verifiable (Metal) and de-risks the ABI before any T4 round-trip — do it first. B1+B2 batch into one push (kernel is useless without the backend).

Stage C (T2.2) port spec — CUDA/HIP flash kernels + (optional) CpuBackend oracle

Status: ✅ DONE (C1–C4 complete, merged in PR #120, 2026-06-21, T4-green). Plan A was taken (CpuBackend oracle first). Sibling to cosim-phase2-cuda-hip.md (Stage C = checkpoint T2.2) and cosim-phase2-stageB-io-port.md. The C2/C3/C4 sections below are kept as the as-built record. Performance tuning of the managed-memory backend is a separate follow-up (issue #122).

Goal: bring CUDA/HIP cosim to flash parity — the last stage of the CUDA/HIP cosim track. Port gpu_apply_flash_din + gpu_flash_model_step, wire into the batched run_edges, gate against a golden.

Done (all merged in PR #120; main SHAs)

C1: lifted FlashState/FlashDinParams/FlashModelParams from metal.rs to the shared mod.rs GPU-struct region (gated, size_of asserts 48/24/32). Metal bit-identical.
C2 (CpuBackend oracle): wired CppSpiFlash into CpuBackend::run_edges (per-edge apply_flash_din + dual-step flash_model_step); patched spiflash_model.cc p_d_i init 0→0x0F to match the GPU FlashState.d_i. A 10k-edge mcu_soc cosim on the no-GPU CpuBackend is byte-identical to the Metal GPU flash kernel → C++↔shader FSM equivalence proven, CpuBackend is a valid golden source. (Detail: the C2 section below.)
C3 (a071326): ported flash_eval_commit_persistent + gpu_apply_flash_din
- gpu_flash_model_step to kernel_v1_impl.cuh + _cuda/_hip launchers; wired into Cuda/HipBackend::run_edges (Metal order); flash buffers built in new via shared build_flash_buffers_dev; stubs replaced with real FlashState reads. FlashState byte offsets derived via std::mem::offset_of! (ABI drift → compile error). T4-green.
C4 (7c54448): flash CI gate — 74 KB CpuBackend golden tests/mcu_soc/expected/mcu_flash.vcd (== Metal, deterministic), self-contained tests/mcu_soc/sim_config_selfcontained.json, COSIM_SCOPE=flash in cosim_cpu_check.sh, wired into Linux + CUDA/HIP T4 jobs. All three flash-gate steps T4-green (byte-identical vs golden).

Scoping findings (the part that changes the plan)

No pure-CPU flash stepper exists. CppSpiFlash (C++ FFI SPI/QSPI model, testbench.rs; step(clk,csn,d_o)->d_i) is instantiated in run_cosim_generic (mod.rs:1965, loads the 16 MiB firmware) but is _-prefixed and never stepped. The --check-with-cpu path injects backend.flash_d_i() (the GPU's value) — it does not run a CPU flash model. So CpuBackend flash is a stub (flash_d_i → 0x0F), and wiring a real CPU flash path is from-scratch work, not "connect the existing model."
CppSpiFlash has no reset API — reset handling (the GPU FlashState.in_reset) has no CPU analog; would need to idle the model (step with csn high) during reset.
C++ CppSpiFlash FSM vs GPU gpu_flash_model_step shader FSM equivalence is unvalidated. They are meant to be the same SPI FSM, but byte-identical d_i sequences across all commands/edge cases were never proven. A shared CPU/GPU golden depends on this.
Fixture: mcu_soc is the only flash cosim fixture, and it is committed + self-contained — tests/mcu_soc/data/6_final.v (19.6 MB netlist) and tests/mcu_soc/software.bin (3.6 KB firmware) are both git-tracked, so it runs on a fresh checkout without the chipflow firmware build. But it is heavy (whole SoC; full boot is 500K edges). A short run (a few k edges) exercises the deterministic boot-time flash command/address/read sequence — enough for a regression golden (flash-pin VCD or decoded flash transactions).

The decision this raises

Option (a) — "wire CppSpiFlash into CpuBackend" (chosen before finding #1–3) — is now known to be a from-scratch CPU flash stepper plus an FSM-equivalence proof, and it is orthogonal to the release goal: CUDA/HIP parity needs the GPU flash kernels gated against a golden, and Metal already produces a correct one.

Two viable plans:

Plan B-first (recommended): GPU kernels gated vs the Metal mcu_soc golden. Do C3 (port the two flash kernels to CUDA/HIP + wire into run_edges) and C4 (short mcu_soc cosim in CI; CUDA/HIP/Metal VCDs compared via the existing backend-equivalence harness, or vs a committed Metal golden). This is the direct release-critical path. The CpuBackend flash oracle becomes an optional follow-up (own effort, with the FSM-equivalence validation built in).
Plan A (as originally chosen): CpuBackend flash oracle first. Implement the CPU flash stepper (C2), prove it byte-identical to Metal on a short mcu_soc run, commit that as the no-GPU golden, then C3/C4 gate CUDA/HIP against it. Completes the Tier-1-oracle story + Linux no-GPU flash coverage, but is larger and carries the equivalence risk on the critical path.

C2 — CpuBackend flash stepper: the exact dual-step convention (chosen plan a)

This is the make-or-break detail. CppSpiFlash::step(clk, csn, d_o) is a single eval()+commit() returning p_d_i (CXXRTL agent.step() semantics, spiflash_model.cc:165). The GPU gpu_flash_model_step is a direct port of the same spiflash_model.cc, but per call it does a dual-step with delayed CSN (shader :1008-1018), and is called once per edge (2×/tick):

// step 1: delayed csn + delayed d_out (processes the clock edge, samples old data)
flash_eval_commit(clk, prev_csn, prev_d_out)
// step 2: delayed csn + current d_out
d_i = flash_eval_commit(clk, prev_csn, d_out)
// then store for next edge:
prev_csn   = csn      // current OUTPUT csn → next edge's delayed csn
prev_d_out = d_out

where clk/csn/d_out are read from the output state slot (clk_out_pos/csn_out_pos/d_out_pos[4]), and prev_csn/prev_d_out are the previous edge's output values (the setup-delay model). model_prev_csn (the model's internal prev_csn_o edge-detect state) is threaded through both evals; because step 2 re-feeds the same prev_csn, no spurious CSN edge is seen between the two evals — CppSpiFlash's own commit() (prev_csn_o = csn) reproduces this automatically when called as above. Reset branch (shader :974-980): force d_i = 0x0F, set prev_csn=csn, prev_d_out=d_out, do not step the model.

CpuBackend wiring (mod.rs): add fields flash: UnsafeCell<CppSpiFlash>, flash_clk/csn_out_pos, flash_d_out_pos[4], flash_d_in_pos[4], flash_xmask_off, flash_d_i, flash_prev_csn, flash_prev_d_out, flash_in_reset. In new, build CppSpiFlash from config.flash (load firmware) + resolve positions (the orchestration already computes these as locals at mod.rs:1812-1860 — mirror it). In run_edges, per edge: apply_flash_din (inject self.flash_d_i into input MISO d_in_pos, clear X-mask) before simulate_block_v1; flash_model_step (the dual-step above) after. flash_d_i() returns self.flash_d_i (drop the 0x0F stub); flash_set_in_reset stores the flag.

Init-state equivalence hazard (confirmed). The GPU inits FlashState (build_flash_buffers): data_width=1, prev_csn=1, model_prev_csn=1, d_i=0x0F, in_reset=1, rest 0. The C++ SpiFlashModel constructs State{data_width=1,…0} (matches) but p_d_i = 0 (spiflash_model.cc:27) vs the GPU's d_i = 0x0F. d_i/p_d_i is a persistent output (only written on negedge_clk during a read command's data phase), so during command/address phases CppSpiFlash would carry MISO=0 while the GPU carries 0x0F. The design doesn't sample MISO outside the read-data phase (functionally harmless), but a raw flash-pin VCD diff would flag it. Fix: change spiflash_model.cc:27 to uint8_t p_d_i = 0x0F; to match the GPU init — safe, CppSpiFlash has no live callers (verify with a repo-wide grep first). The reset branch already forces flash_d_i = 0x0F without stepping, so the pre-reset-release values align once this init is fixed. Also force the CPU reset behaviour to mirror the shader (d_i=0x0F, set prev_csn/prev_d_out, do not step) so CppSpiFlash's internal prev_csn_o stays high through reset (csn idle high in mcu_soc).

Validation (C2 — DONE, byte-identical): reproducible recipe (the /tmp config is ephemeral — regenerate it):

python3 -c "import json,pathlib; c=json.loads(pathlib.Path('tests/mcu_soc/sim_config.json').read_text()); c['flash']['firmware']='tests/mcu_soc/software.bin'; pathlib.Path('/tmp/mcu_sc.json').write_text(json.dumps(c,indent=2))"
cargo build -r --bin jacquard                      # no-feature → CpuBackend
cargo build -r --features metal --bin jacquard
for b in cpu metal; do f=$([ $b = metal ] && echo '--features metal'); \
  ./target/release/jacquard cosim tests/mcu_soc/data/6_final.v --config /tmp/mcu_sc.json \
  --top-module top --max-clock-edges 10000 --output-vcd /tmp/mcu_$b.vcd; done
diff /tmp/mcu_cpu.vcd /tmp/mcu_metal.vcd            # byte-identical ⇒ FSMs match

~40s partitioning + ~4s sim per run. Flash is actively read by ~10k edges (cmd 0x03 @ 0x100000). This same comparison is the C3 gate for CUDA/HIP (vs the committed CpuBackend golden) and the C4 CI gate. (If a future port diverges, the C++↔shader bug is in flash_eval_commit_persistent vs eval()/commit().)

C3 — CUDA/HIP flash kernels (mirror Stage B mechanics)

Port gpu_apply_flash_din (shader:904 — write FlashState.d_i → input-state MISO bits at d_in_pos, clear their X-mask) and gpu_flash_model_step (shader:943 — read clk/csn/d_out from output state, dual-step the SPI/QSPI FSM, update FlashState) to kernel_v1_impl.cuh + extern "C" _cuda/_hip launchers. Also port the flash_eval_commit_persistent helper (shader:843) — the per-eval primitive gpu_flash_model_step calls twice (it is the shader port of spiflash_model.cc eval()+commit()); it must live in the .cuh too. Needs the 16 MiB firmware UVec<u8> (load from config.flash.firmware like CpuBackend::new / build_flash_buffers's flash_data_buffer) + persistent FlashState (UVec<u8>, the event-buffer FFI pattern; init per build_flash_buffers: data_width=1, prev_csn=1, model_prev_csn=1, d_i=0x0F, in_reset=1). Structs already shared (C1). The launchers cross the struct/firmware buffers as u8* and cast (Stage B convention).
Wire into CudaBackend/HipBackend::run_edges in the Metal order: state_prep → gpu_apply_flash_din → simulate_stage × N → gpu_flash_model_step → gpu_io_step → snapshot. Build the flash buffers in new (mirror MetalBackend::build_flash_buffers); replace the Stage-B flash_d_i / flash_debug_snapshot stubs with GPU FlashState reads; flash_set_in_reset drives FlashState.in_reset. Model-driven-clock (JTAG) runs the peripherals at batch=1 within the same backend.

C4 — CI gate

Add a short mcu_soc (and/or jtag_minimal) flash cosim to the GPU CI jobs; compare CUDA/HIP against the golden (Metal, or CpuBackend under Plan A). Note the runtime cost of the 19.6 MB netlist on the runners; keep edge counts small.

Risks

FSM equivalence (C++ CppSpiFlash ↔ shader gpu_flash_model_step) — only matters under Plan A (shared CPU golden); Plan B sidesteps it (Metal is the GPU reference).
CI-only GPU validation on the T4, as in Stage B.
Fixture weight — mcu_soc is large; a short run must still be deterministic and exercise real flash transactions.

Interactive JTAG/DM debug server (`--jtag-server`) — plan

Status: J1–J4 implemented (interactive --jtag-server lands; J5 single-step/breakpoints deferred). User guide: docs/jtag-debug.md. Tracks #124. Goal: add an interactive JTAG debug server to jacquard cosim so an external debugger (OpenOCD → gdb) can attach to a running GPU co-simulation and inspect the design through its RISC-V Debug Module — read/halt/resume/step architected state (GPRs, CSRs, PC, memory), exactly as the same firmware would be debugged on real silicon.

--jtag-server <port> is the interactive sibling of the existing --jtag-replay <stream>: instead of replaying a recorded remote_bitbang byte stream, it opens a live remote_bitbang TCP socket and drives the same configured TCK/TMS/TDI/TRST pins from the connected client, stepping the design in lock-step with debug transactions.

Why this is cheap: the infrastructure already exists

cosim already drives a design's DTM/DM via --jtag-replay to load firmware. The same DTM/DM, fed a live bitbang socket instead of a recording, exposes the full RISC-V external-debug interface. The architectural pieces are in place:

JtagReplayModel (src/sim/models/jtag.rs) already parses the entire remote_bitbang byte alphabet (0–7 pin drives, r/s/t/u reset, R TDO read, B/b blink, Q quit) and maps it to TCK/TMS/TDI/TRST drives.
PeripheralModel (src/sim/models/mod.rs:56) is already the right contract: step_edge(output_state, overrides, emitted) is bidirectional, and is_active() already forces batch=1 (single-edge dispatch) while JTAG is live — exactly the fine-grained stepping a debug session needs. The batched fast path is untouched when no client is attached.
cosim already resolves the jtag peripheral's pin mapping from sim_config.json and patches model-driven pins into the per-edge BitOp ops via state_prep.

So the live server is largely "swap the recorded Vec<u8> cursor for a live socket, and answer R from real TDO." No GPU/kernel change, no backend change, no PeripheralModel trait change.

Seam map (verified, with anchors)

Concern	Location	Today	Needed
Byte alphabet → pins	`src/sim/models/jtag.rs:156-196` (`consume_byte`)	parses `0-7,r,s,t,u,R,B,b,Q`; `R` only counted, `B/b` no-op	answer `R` with live TDO
Byte source	`jtag.rs:74-77` (`bytes: Vec<u8>`, `cursor`)	hardcoded Vec + cursor	abstract over `Replay(Vec)` vs `Live(TcpStream)`
Edge advance	`jtag.rs:229-251` (`step_edge`)	`bytes[cursor]`; one-edge TCK deferral for TMS/TDI settle	`source.next_byte_blocking()`
Observe half	`cosim/mod.rs:2987`	`step_edge` gets `&[]` — TDO not wired	pass `&backend.state()[state_size..]`
Drive path	`cosim/mod.rs` `contribute_overrides`→`overrides`→`patch_model_driven_in_ops` (`1916-1929`)	works for replay	unchanged
`is_active()`→batch=1	`jtag.rs:310-312`, gate at `cosim/mod.rs:3042-3045`	`!finished()`	live model: true while connected
Instantiation	`cosim/mod.rs:2354-2415`	reads file → `JtagReplayModel::new`	branch on `jtag_server`: bind + `accept()`
Config	`src/testbench.rs:361-374` (`JtagConfig`)	`tck/tms/tdi/trst_gpio` (inputs only)	add `tdo_gpio: Option<usize>` (an output)
CLI/opts	`jacquard.rs:362-375`, `:2007-2008`; `CosimOpts` `cosim/mod.rs:49-52`	`jtag_replay`, `jtag_hold_cycles`	add `jtag_server: Option<u16>`
Config example	`tests/jtag_minimal/sim_config.json:15-21`	`jtag{}` + `clocks[]` TCK domain	+ `tdo_gpio`

Design decisions

D1 — Single-threaded blocking socket; no async, no thread. The cosim loop is synchronous (step_edge → run_edges → wait → repeat). A live debug session inverts time control: the client (OpenOCD) drives the pace, so step_edge blocking on a socket read is the correct synchronisation. With is_active()==true forcing batch=1, each edge processes one bitbang step. No executor or background thread is required for a single connection.
D2 — Wire output_state into step_edge (resolve the standing TODO). Both ADR 0017 and ADR 0013 note step_edge is handed an empty output_state "until I²C/SPI observation needs it." The interactive server is the first CPU model that must read a design output (TDO). Replace the &[] at cosim/mod.rs:2987 with the real output slice. This also unblocks the scaffolded I²C/SPI models — a general improvement, not a JTAG special case.
D3 — Abstract the byte source, not the model. Introduce enum JtagSource { Replay { bytes: Vec<u8>, cursor: usize }, Live(TcpStream) } in jtag.rs; the single change point is bytes[cursor] → source.next_byte_blocking(). JtagReplayModel keeps its FSM; only its input feed changes. Replay behaviour is byte-for-byte identical.
D4 — TDO read-back over the socket. On R, sample TDO from output_state[tdo_pos] and write the ASCII '0'/'1' back to the TcpStream — the only response the remote_bitbang protocol requires. tdo_pos is resolved from the new JtagConfig.tdo_gpio via the design's output-bit map after construction.
D5 — --jtag-server <PORT>; mutually exclusive with --jtag-replay. Bind a TcpListener and accept() (blocking) in the instantiation block before the main loop — single connection for v1, stored in the model. Q ends the session; the sim can continue free-running (decision: log and continue, vs exit — see open questions).
D6 — Honour the OpenOCD remote_bitbang contract exactly. Reuse the existing alphabet handling; B/b (blink) stay no-ops, R is the sole response byte, s/u (SRST) map to the existing reset handling. Mirrors OpenOCD's remote_bitbang.c so stock OpenOCD connects unmodified.

Validation strategy

Real OpenOCD+gdb attach is interactive and heavy for CI, so the CI gate is a self-contained loopback test, with manual OpenOCD as a documented recipe:

V1 — loopback integration test (CI). A Rust test acts as the remote_bitbang client over a localhost socket: it feeds the same recorded stream the jtag_minimal fixture already uses (bitbang.rec), services R reads, and asserts the design reaches the same data0_obs == 0xCAFEBABE as the replay path. This exercises the full live socket → TDO read-back → drive path end-to-end with zero external tooling, and pins live-vs-replay equivalence. Runs on the CPU/Metal cosim backends like the existing jtag-minimal-cosim job.
V2 — manual OpenOCD + gdb recipe (docs). A docs/jtag-debug.md guide: jacquard cosim … --jtag-server 9999, an OpenOCD remote_bitbang config, gdb … target remote, info registers / x/ / load. The issue names the cocotb RemoteBitbangServer as the behavioural precedent to match.

Staged checkpoints (each ≈ one reviewable PR)

#	Scope	Gate	Status
J1	Wire `output_state` into `step_edge` (D2); add `JtagConfig.tdo_gpio` + resolve `tdo_pos`; unit-test TDO sampling in the replay model	replay fixtures still byte-identical; TDO-sample unit test green	✅ done
J2	Byte-source abstraction (D3): `JtagSource` enum; replay path refactored onto it	`jtag_minimal` replay unchanged; model unit tests green	✅ done
J3	`--jtag-server` (D5) + live byte source + `R` write-back (D4) + `TcpListener` accept	V1 loopback test: live run == replay golden (`data0_obs==0xCAFEBABE`)	✅ done (model loopback unit test + `jtag-minimal-cosim-server` CI gate; verified locally on Metal)
J4	`docs/jtag-debug.md` (V2 recipe); `--help` text; cross-link ADRs	docs build; manual OpenOCD/gdb smoke (local)	✅ done (guide + `--help`; manual OpenOCD/gdb left to operators)
J5 (later)	Single-step / breakpoints via DM `step`/triggers; X-aware debug under `--xprop`	follows from attach; see open questions	⏳ deferred

CUDA/HIP note: the interactive path is the CPU-side model + batch=1 of the GPU backend (per ADR 0017's "per-edge fallback"), so it works on any cosim backend once J1–J3 land; no per-backend kernel work.

Risks / open questions

TDO sample timing. R must sample TDO at the protocol-correct point (after the TCK edge the client just clocked). The model already defers TCK by one edge (pending_tck) for TMS/TDI settle; confirm the output slot read on R reflects the just-clocked state. The loopback test (V1) catches misalignment via the data0_obs assert.
Performance. Interactive sessions lose edge batching for their duration (inherent and acceptable — debug is slow). The batched fast path is unaffected when no client is attached.
--xprop interaction. The DM debug-load path's X-behaviour was addressed in #102. Initial debug runs two-state; X-aware read under --xprop is a J5 refinement, not the initial ask.
Single vs multi client / session end. v1 is one connection; Q either exits cosim or lets it free-run — pick the gdb-friendly behaviour during J3.
--max-clock-edges semantics. An attached session may outlive the edge budget; decide whether --jtag-server disables/relaxes the cap while a client is connected.

References

Issue #124.
ADR 0017 (cosim execution model) — Amendment 2026-06-21 (interactive, externally-paced peripheral models; output_state wiring).
ADR 0013 (peripheral model architecture) — Amendment 2026-06-21 (tdo_gpio config surface; --jtag-server as the interactive sibling of --jtag-replay).
Existing replay path: src/sim/models/jtag.rs, tests/jtag_minimal/ (bitbang.rec, sim_config.json, the jtag-minimal-cosim CI gate).

Plan — Python engine as a bundled binary wheel

Implementation plan for ADR 0020: turn PR #53's subprocess API into a pip install jacquard self-contained binary wheel built with cibuildwheel.

Deferred (2026-07-01). ADR 0020 is a draft — a native PyO3 binding is the preferred long-term direction, and whether we build this subprocess-wheel path at all is deferred (#161). P0 (adopt PR #53's API into the uv workspace) is worth doing regardless — it's the Python surface either approach exposes. P1–P3 (embed-and-delocate the subprocess binary) are on hold pending the subprocess-vs-PyO3 decision; note the shared hard part (per-platform wheels + libc++/libomp vendoring) carries over to a PyO3 extension either way.

Guiding constraints

The Python surface is PR #53's — config / runner / result / regression / errors. This plan changes packaging and adds a binary; it does not redesign the API.
Reuse, don't reinvent, the release machinery. The binary to embed is the one release.yml already builds (cargo build --release --features metal --bin jacquard, metallib embedded). The publish path is publish-netlist-graph.yml's OIDC trusted-publishing shape. The smoke idea is scripts/ci/user_acceptance_smoke.sh (install relocated → run sim).
Every phase ends green in CI with an install-into-clean-venv smoke test, and any platform the wheel does not cover is log-ged, not silently dropped.

P0 — Land the API in the workspace (no binary yet)

Adopt PR #53 as a uv workspace member; ship nothing to PyPI yet.

Move #53's python/jacquard/ in as the 5th workspace member; wire it into the root pyproject.toml workspace list next to netlist_graph / chipflow_harness / mcu_soc.
Keep find_jacquard_binary()'s current env → PATH → which chain (no embedded binary yet); it resolves a dev's local target/release/jacquard.
CI: a python-api job — uv sync, ruff, pytest — plus one integration test that builds jacquard and drives sim() against a tiny fixture (guarded like the other Metal jobs; reuses the metal-build artifact from the build/test split).
Exit: #53's tests green in CI; from jacquard import sim works against a locally built binary. #53 can be closed/superseded by this branch.

P1 — macOS/arm64 + Metal binary wheel (the spike + the crux)

The whole ADR rests on "embed the binary + vendor its dylibs into a wheel that launches with no Homebrew LLVM." Prove it here on the platform we already release for.

Embed step. Package build copies the release binary to jacquard/_bin/jacquard. find_jacquard_binary() gains step 0: prefer the packaged _bin/jacquard; existing chain remains the fallback.
cibuildwheel + delocate. Configure cibuildwheel for macos/arm64, with delocate repairing the wheel — copying libc++/libomp in and rewriting install names. Confirm with otool -L that the repaired binary references @loader_path-relative dylibs, not /opt/homebrew.
Clean-environment smoke (the gate). In a runner without Homebrew LLVM on the load path (or with it hidden), pip install the wheel into a fresh venv and run a real sim — the pip-user analogue of the relocated-tarball user-acceptance gate. This is the test that would have caught v0.2.1.
libomp duplicate-runtime check. Load jacquard in a venv that also imports a numpy/scipy stack (which ship their own libomp) and confirm no duplicate-OpenMP abort.
Exit: a macOS/arm64 wheel that pip installs and runs sim on a clean box; TestPyPI upload validated via workflow_dispatch.

P2 — Linux/x86_64 CPU-fallback wheel

Make pip install jacquard work in a plain Linux CI container (no GPU).

Build the cosim CPU backend (ADR 0017) for manylinux; auditwheel repairs the (smaller, no-GPU) dylib set.
Same clean-venv sim/cosim smoke, in a stock manylinux/ubuntu container.
Document the backend the Linux wheel provides (CPU) vs. what needs P3.
Exit: Linux CPU wheel installs and runs in a bare container.

P3 — CUDA / HIP (gated, likely extras)

Gate: ADR 0018 Phase 4 (prebuilt CUDA/HIP binaries) must exist first — this reuses those binaries, it does not invent GPU CI.
Decide with data: default manylinux wheel vs. jacquard[cuda] extra vs. a separate package — driven by wheel size and CUDA-runtime bundling. log and document whichever platforms the default wheel omits.
Exit: a documented, working install path for at least one GPU backend on Linux, or a recorded decision to keep GPU on the tarball/binstall channel.

Publish pipeline (built across P1–P3)

New publish-jacquard-wheel.yml, modeled on publish-netlist-graph.yml: cibuildwheel build matrix → artifact → TestPyPI on workflow_dispatch (dry-run) → PyPI via OIDC trusted publishing on a release tag. No stored token; configure the trusted publisher once (mind the case-sensitive gpu-eda/Jacquard repo claim — the same gotcha netlist-graph hit).
Version tracks the binary release via scripts/bump_version.py (the embedded binary and the wheel are one artifact), not netlist-graph's independent line.
Extend docs/installation.md (a pip install jacquard path) and docs/release-process.md (the wheel channel + its staging-validation gate).

Open questions (from the ADR)

PyPI name jacquard — confirm availability; fall back to jacquard-eda with an aliased import name if taken. Blocks only the first real publish.
CUDA wheel size / runtime — P3 decision, with data.
libomp duplicate runtime — must be cleared in the P1 spike (step 4) before committing to delocate as the vendoring mechanism.

Plan: `xroots` — backward X-source frontier query

Issue: #98. Builds on the netlist-graph cone/driver fixes in #101 (PR1), which is why this branch is stacked on fix/netlist-graph-cone-drivers.

Problem

When a signal reads X under jacquard cosim --xprop, finding why is a manual trace→guess→re-run loop: the VCD can only report wires you already chose to trace, so you must half-know the answer to ask the question. But the information is static: an X originates at a known X-source (unreset DFF Q, SRAM read port, undriven primary input) and propagates forward. So "what makes S read X?" = "which X-sources lie in S's backward cone?" — a pure netlist query.

Design decisions (confirmed)

Driven-input set comes from jacquard, not re-derived in Python. The authoritative driven set (clock/reset/constant/peripheral pins + GPIO→port mapping) is computed in Rust during cosim setup. A new jacquard xsources subcommand dumps the X-source set (including the undriven-input complement) as JSON; netlist-graph xroots consumes it. No drift, single source of truth.
Dominators deferred. v1 ships the frontier query + classification + --emit-trace. Dominators (X-sources every path passes through) need a careful formulation over the DFF-feedback cyclic graph and land as a scoped follow-up once the frontier query proves useful.

Part A — `jacquard xsources` (Rust)

A new clap subcommand alongside Sim / Cosim in src/bin/jacquard.rs.

jacquard xsources <netlist> --config <sim_config.json> -o xsources.json

Reuses the existing cosim setup path to build the AIG + NetlistDB and the driven-input set, then:

DFF-Q / SRAM-read X-sources from AIG::compute_x_sources() (src/aig.rs:3277) — already enumerates these as AIG pins.
Undriven primary inputs = primary-input bits not in the cosim driven set (the same complement cosim --xprop treats as X; see cosim-xprop.md X-source taxonomy rows 3–4).
Name resolution: map each X-source AIG pin → hierarchical net name via the existing aigpin → (cell_id, cell_type, output_pin_name) map (src/aig.rs:265) and NetlistDB pin/net names. WordSymbolMap (src/flatten.rs:138) is the precedent for this AIG→name translation.

Output schema (schema_version: "1.0", additive-only):

{
  "schema_version": "1.0",
  "netlist": "design.v",
  "x_sources": [
    {"net": "top.cpu.regs[7]", "kind": "unreset-dff", "cell": "..."},
    {"net": "top.sram.rd_data[3]", "kind": "sram-read", "cell": "..."},
    {"net": "ext_in[2]", "kind": "undriven-input"}
  ]
}

kind ∈ {unreset-dff, sram-read, undriven-input}. The net names are emitted in the same conventions --trace-signals resolves, so they round-trip into the Python tool and back into a confirming --xprop run.

Note on "unreset": jacquard marks all DFF Qs as X at cycle 0; a DFF with a connected reset resolves once reset asserts. For a static backward query the useful classification is "is this DFF reset-connected?" — emitted as unreset-dff only when no reset/set pin is wired. This is a static over-approximation (documented in the command help).

Part B — `netlist-graph xroots` (Python)

netlist-graph xroots <netlist> <signal> [--xsources xsources.json] [--emit-trace <file>] [-d DEPTH]

Resolve <signal> to a net (reuse resolve_name).
Reverse-reachability from the net through find_drivers, continuing through DFF data pins (the same data-path-through-registers walk as logic_cone(through_regs=True) restricted to _DFF_DATA_PINS — reuses PR1's corrected _is_register). Clock/reset pins are not followed.
X-source set:
- With --xsources: use the manifest's authoritative net set + classification (covers undriven-input, which the netlist alone can't classify).
- Without it: classify natively from the netlist — DFF-Q nets (cells _is_register matches) and SRAM read-port nets — and emit undriven-input as unknown (warn that --xsources is needed for the driven-set complement). Genuinely-undriven internal nets (PR1's [undriven — X-source] leaves) are reported as candidate roots.
Frontier: BFS outward; when a reached net is an X-source, record it and stop expanding past it (it is a root). Non-source nets keep expanding. Frontier = X-sources reachable without passing through another X-source.
Report: frontier X-sources, classified and grouped by kind, nearest first (BFS depth). --emit-trace <file> writes them as a --trace-signals list (one net per line, # kind comments) so a confirming --xprop run is one command.

Reuse / new code

Reuses: resolve_name, find_drivers, _is_register, _DFF_DATA_PINS, _short_net, the out_driver/is_driven machinery from PR1.
New: xroots graph method (reverse-reachability + frontier intersection), xroots CLI command, manifest loader.

Tests

Rust: xsources unit test on a small netlist+config — assert DFF-Q, SRAM-read, and undriven-input nets appear with correct kind; reset- connected DFFs are not unreset-dff.
Python: synthetic netlist with a known X-source behind two logic levels and a reset-defined path; assert the frontier finds the source, the classification matches the manifest, --emit-trace output round-trips, and the "no manifest" mode warns.
Integration: xsources on tests/timing_test/minimal_build → xroots --emit-trace → feed back to cosim --xprop --trace-signals and confirm the surfaced wires carry X (smoke test, gated on Metal availability).

Sequencing

Part A (jacquard xsources) + Rust tests.
Part B (netlist-graph xroots) + Python tests, consuming Part A's manifest.
Docs: a docs/x-debugging.md user guide (xsources → xroots → confirming --xprop run) + CHANGELOG entries + a one-line pointer in CLAUDE.md's debugging-tools section.

Out of scope (follow-ups)

Dominator analysis (decision 2).
Wire-bundle reconstruction of multi-bit X-source buses (tracked separately).

Cell-model IR — staged delivery plan

Status: Largely delivered. Realises ADR 0019. Tracks #130 and #67.

C1 (foundation + #130) — ✅ landed (#132). #130 is fixed by default: a 9T netlist auto-selects its own 9T descriptor.
C1b (converter generalization) — ✅ landed (#155): corner from Liberty PVT, ff internal-state-var (IQ). Surfaced by running the converter against the proprietary GF130 (GF013BCD) library.
C2 (L3 sequential + L4 timing) — ✅ landed (#132).
C3.1/C3.2 (build-time generation + prefix selection) — ✅ landed (#132).
SKY130 .lib.json reader — ✅ landed (#155): SKY130 ships only .lib.json, so liberty-parse reads it.
C3a (IHP SG13G2, zero-Rust new PDK) — ✅ landed (#155): vendored as a sparse/shallow submodule + a generated descriptor, no ihp_*.rs.
C3.3 (cut over + drop runtime vendor/ dep) — ✅ landed (#160): the runtime binary is self-contained for standard cells (load_pdk_models has no runtime callers), and PdkVariant is deleted (grep -rn PdkVariant src/ = 0).
C4 (proprietary workflow + clear_preset) — 🟡 partial: clear_preset set-dominant field landed (#155); docs/adding-a-pdk.md reframed around the descriptor workflow (#155). The proprietary sequential end-to-end test and the round-trip honesty fix remain (see Remaining follow-ups).

Goal: make Jacquard core consume a single, generated, JSON cell-model IR carrying all per-cell-type facts of a library — L1 directions, L2 combinational AIG, L3 sequential/classification, and L4 timing characterization — so any library, including proprietary ones the authors can't vendor, is selectable at runtime, and the per-PDK Rust, the hardcoded vendor/ paths, and the runtime .lib parse all retire.

Why staged

Each step is independently useful and de-risks the next: step 1 ships the #130 fix and validates the format on real cells before the heavier L3 schema is committed; step 2 is the schema-fidelity work; step 3 is the cleanup that the first two earn.

Staged checkpoints (each ≈ one reviewable PR)

#	Scope	Gate
C1 — foundation + #130	Relocate `pdk_decomp` into a shared lib. Define the cell-model-IR JSON schema for L1 directions + L2 combinational AIG (D3) — the corner of the schema needed to build the AIG. Write the converter crate (Liberty `function`/`functional.v` → IR). Redirect one PDK's (GF180MCU) stdcell logic to consume the IR.	A 9T netlist (e.g. `tests/jtag_minimal`) simulates against its own generated 9T descriptor; result byte-identical to the current 7T-substituted run where they truly agree, and the round-trip logic check passes.
C2 — L3 sequential + L4 timing schema	Add the D4 sequential pin-role schema (clock+edge, D/next-state, Q, async set/reset+polarity, enable) + classification kinds, and the D5 L4 timing block (setup/hold, clock→Q, DFF/SRAM timing). Extend the converter to emit both from Liberty `ff`/`latch` and the timing groups it already parses, with L4 keyed by corner (one descriptor, all corners; mirrors the timing IR). Wire the consumer to replace the hardcoded DFF pin-name matches (`src/aig.rs:2080-2260`) and to read L4 from the IR instead of `TimingLibrary::from_file`, selecting the corner via `--corner` (default `default_corner`).	A design with sequential GF180/SKY130 cells simulates and times from the IR with no per-PDK Rust DFF handling and no runtime `.lib` parse; equivalence vs the current hardcoded + `liberty_parser` path (the oracle).
C3 — bundle + cut over + selection	Regenerate bundled descriptors for all built-in PDKs (AIGPDK/SKY130/GF180) in CI from the pinned vendored submodules and embed at build time — not checked in (D7); build-time generation replaces the `build.rs` pin-table step. Selection by descriptor-declared prefix + `--cell-descriptor` (D8). Drop the runtime `vendor/` cell dependency, `pdk_decomp`, and `liberty_parser::TimingLibrary` from core; retire `gf180mcu.rs`/`sky130.rs` classifiers, `PdkVariant`, and the `build.rs` pin-table generators.	The existing PDK regression suite (incl. timed runs) passes consuming only build-time-generated descriptors; vendored cell submodules are no longer a runtime dependency of jacquard core, and CI regeneration is deterministic.
C3a — IHP SG13G2 (new PDK, zero Rust)	Vendor the IHP-Open-PDK SG13G2 stdcells as a submodule; add it to the bundle by generating a descriptor only — no per-PDK Rust (D7a). Exercises the Liberty-first path cleanly (every SG13G2 cell has `function`; every flop a full `ff` group with reset polarity).	An SG13G2 gate-level design simulates and times purely from its generated descriptor, with no `ihp_*.rs` in core — the worked proof that adding a PDK is no longer a Jacquard code change.
C4 (later)	A documented proprietary-library workflow: user runs the generator on their own Liberty, gets a descriptor, simulates — no Jacquard build. Honesty fix: the round-trip logic check replaces `build.rs`'s port-only `assert_eq!`.	`docs/adding-a-pdk.md` recipe; a synthetic "private" library exercised end-to-end in a test.

Remaining follow-ups (post-cutover)

Discovered during implementation; folded here from the working handoff so they survive its deletion. None reopen the runtime vendor/ stdcell dependency or reintroduce PdkVariant.

Descriptor-drive AIGPDK DFF/DFFSR. AIGPDK's two internal flops are still on a literal-match path. Their descriptor L3 is well-formed (clock=CLK/rising, next_state=D, async S/R active-low, reset-dominant); the conversion is gate-covered (every AIGPDK/IHP fixture exercises them). Do this first — it is bounded and safe.
Descriptor-drive the preserved edfxtp/icgtp/icgtn seq branches. The legacy SET_B/RESET_B/RN/SETN sequential-wiring branches remain because they still serve preserved cells: SKY130 edfxtp (its data-enable folds into next_state via the ff state var IQ, so ir_seq_input_wireable returns false) and GF180 icgtp/icgtn clock gates (emit no L3). No gated design exercises these, so add a small edfxtp/icgtp fixture first, then descriptor-drive them and delete the residual literals + the per-PDK classifiers.
Gate IHP sequential. IHP ff cells are code-supported (previously panicked) but ungated — no flop-bearing SG13G2 fixture exists (needs synthesis infra to produce an IHP gate-level netlist). Add one to lock in IHP sequential.
Flat-module .v cross-check indexer. Commercial libraries (GF130, IHP) ship one flat-module .v, so the D6 logic cross-check reports 0 models indexed and does not run at generation (the descriptor is still emitted; correctness is validated via simulation, and the converter now warns). Teach the indexer flat-module .v so commercial descriptors are logic-cross-checked at generation.
Descriptor-backed LeafPinProvider + delete build.rs pin-table gen. Confirmed feasible (descriptor L1 covers every stdcell's scalar pins; chain: power-pin → filler → descriptor L1 → preserved pad/IP/SRAM tables). Not yet wired; the per-PDK LeafPinProviders + build.rs::generate_pin_table + generated modules remain (inert for stdcell logic, but still the parse-time pin-direction source). This is the last mechanical cleanup.
C4 — proprietary sequential end-to-end test + round-trip honesty fix. GF130 descriptor generation is proven (the C4 proof), but no proprietary sequential simulation test exists; add a synthetic "private" library exercised end-to-end. Separately, the round-trip logic check should replace build.rs's port-only assert_eq! (the plan's C4 honesty fix).
clear_preset for true set-dominant sequential sim. The field is emitted and the consumer honors it, but a proprietary user simulating set-dominant flops still needs the non-GF180 sequential path fully descriptor-driven (items 1–2) to observe the correct behaviour end-to-end.

Risks / open questions

Sequential fidelity (C2). Liberty ff clear/preset/ clear_preset_var → Jacquard async-reset DFF is the bug-prone mapping; gate it on equivalence against the current hardcoded behaviour.
L2 source (Liberty-first; resolved in ADR 0019 D6). The converter reads Liberty function/ff/latch first and falls back to functional.v/UDP only where Liberty under-specifies. C1 must therefore exercise both ends: a clean function cell and a cell that needs the .v/UDP fallback (a UDP-modelled mux). Where both sources exist, the converter cross-checks them (logic equivalence; timing arc-set agreement; macro/SRAM timing-value divergence) and surfaces disagreement — the structural check build.rs's port-only assert_eq! never had (#130).
AIG payload size (D2/D3). If the JSON AIG is unwieldy for a full library, switch that payload to the FlatBuffers escape hatch — decide at C1 from the real GF180 descriptor size.
Migration ordering (C3) — resolved: per-PDK cutover. Each PDK migrates independently with the IR consumer running alongside the per-PDK Rust, keeping the suite green; a single switch is riskier. IHP (C3a) is added the same way but greenfield (no Rust to retire).
Bundled-descriptor provenance — resolved: CI-regenerated (D7). Not checked in; embedded at build time from pinned submodules. C3 must wire the generation into the build/CI and prove it deterministic.
Identifier alignment (D1). The shared cell/pin-name fragment must be fixed in C1 before two IRs exist that would diverge.

References

ADR 0019 — Cell-model IR.
ADR 0002 — Timing IR (the pattern + scope boundary this realises), ADR 0010 / ADR 0011 (the declarative path this extends).
Current state: src/aig.rs:1895 (hardcoded 7T path), build.rs (port-only pin gen), src/pdk.rs / src/gf180mcu_pdk.rs (hardcoded classifiers), src/liberty_parser.rs (Liberty group-walker to extend), crates/opensta-to-ir + crates/timing-ir (the sibling pattern).

Plan — RTL on-ramp folded into `sim`/`cosim`

Status: Active — reworks draft PR #167

ADRs: 0021 (Revised 2026-07-03 — synth folds into sim/cosim, no build command), aligns with 0019 (descriptor-supplied logic + timing) and 0018 (wasm distribution).

Predecessors: #167 shipped the src/synth.rs embedded-Yosys engine and a jacquard build subcommand + synth feature (CI green, draft). This plan keeps the engine and removes the standalone command, wiring synthesis into the simulation input path instead.

Tracking: #162.

Goal

Behavioral RTL becomes a first-class one-command input: jacquard sim design.v in.vcd out.vcd N (and the cosim equivalent) detects behavioral RTL, synthesizes it transparently to an aigpdk netlist (cached), and simulates — no separate build step, no Python, no external toolchain. A pre-synthesized netlist continues to simulate unchanged.

Input dispatch (the core behaviour — ADR 0021 §1)

On the netlist-input path of sim/cosim:

Attempt structural parse (NetlistDB::from_sverilog_file).
Parse succeeds → enumerate instantiated cells, test against the built-in stdcell recognizers (the is_*_stdcell helpers introduced by #160 / ADR 0019):
- All recognized → gate-level, built-in PDK → simulate directly. Logic + timing come from the embedded descriptor (--corner); no .lib.
- Any unrecognized → gate-level, unknown PDK → error, listing the unrecognized cell types: "gate-level netlist with unrecognized cells — pass --cell-descriptor <path>" (ADR 0019 D8). Do not synthesize.
Parse fails → treat as behavioral → synthesize (embedded Yosys) → parse the result → simulate. If synthesis also fails, surface both the structural parse error and the Yosys diagnostics (a genuine netlist syntax error must not masquerade as "tried to synthesize your RTL").

Overrides: --rtl forces the synth path; --netlist forces direct-structural (skip detection). Transparency: always print the decision, including the QoR tier on the synth path (e.g. design.v: behavioral RTL → synthesized [YoWASP Yosys, functional QoR] → <cache>).

Detection is a heuristic on a structural-only parser; the explicit flags are the escape hatch. Pure net-alias assigns parse structurally (they occur in gate-level too) — only real behavioral constructs (always, arithmetic/logic assign, if/case) trip the parse and route to synth. Confirm the exact is_*_stdcell API against src/ during Phase 1 (do not assume names).

Phases

Phase 1 — Fold synth into `sim`/`cosim`; remove `build`

Entry: #167 branch (feat/rtl-onramp-build), synth feature + src/synth.rs present.

Extract the synthesis invocation from cmd_build into a reusable synth::synthesize(design: &Path, opts) -> Result<PathBuf> returning the cached gate-level .gv path. Cache keyed by content hash of (design sources + synth script + yosys.wasm hash), under $XDG_CACHE_HOME/jacquard alongside the compiled-module cache.
Use read_slang for the SV frontend. The pinned yowasp-yosys 0.64.0.0.post1131 wasm bundles yosys-slang (verified: read_slang + 495 slang symbols in the 39 MB module) — a near-complete SV-2017 elaborator, far beyond built-in read_verilog -sv. Update synth_script to front SV input with read_slang (fall back to read_verilog -sv if read_slang errors or is absent in an older wasm — probe help read_slang once and cache the result, so the on-ramp degrades gracefully on a pre-slang module).
Add the input-dispatch classifier (above) to the shared sim/cosim netlist load path.
Add CLI flags to SimArgs / CosimArgs: --emit-synth <path> (dump the intermediate netlist), --rtl, --netlist.
Delete Build, BuildArgs, cmd_build from src/bin/jacquard.rs.
Graceful message when the binary is built without --features synth and given behavioral input: point at the synth-enabled build / release binary.

Exit: jacquard sim counter.v in.vcd out.vcd 1 --features synth runs end-to-end (behavioral → waves); the three #167 validation designs (counter, assert_test, mem_test) simulate from RTL directly; a gate-level fixture still simulates with the feature off; an unknown-cell netlist errors toward --cell-descriptor; --emit-synth writes a parseable .gv; no build subcommand exists.

Phase 2 — CI coverage + distribution (blocking: feature ships in no binary)

Entry: Phase 1 merged to the branch.

Add --features synth to release.yml and user-acceptance.yml so shipped binaries include the on-ramp.
Add a CI job to ci.yml: cargo build --release --features synth + a jacquard sim <behavioral.v> smoke run. Resolve yosys.wasm sourcing in CI (install a pinned yowasp-yosys wheel and discover it, matching local verification; or fetch the release asset once increment 2 lands).
Confirm --features synth compile time is tolerable in CI (cache the cranelift build if needed).

Exit: CI proves sim consumes behavioral RTL; release/UA binaries contain the feature; green on the branch.

Phase 3 — Docs + stale-reference cleanup

Entry: Phases 1–2 green.

Rewrite the RTL flow in docs/getting-started.md, docs/installation.md, and any synthesis-flow.md cross-links to the single-command story.
Write docs/accepted-rtl.md — the accepted behavioral-RTL surface. The honest framing: what sim/cosim accept as behavioral input is exactly what the embedded YoWASP Yosys frontend synthesizes — and that frontend bundles yosys-slang (read_slang, verified in the pinned 0.64 wasm), plus the project's techmaps and minus testbench-only constructs. Document:
- Supported (delegated to Yosys + slang): synthesizable Verilog-2005 and a broad SystemVerilog-2017 surface via slang (packages, interfaces, structs/enums, always_ff/always_comb, parameters, advanced generate, memories) — not the narrow built-in read_verilog -sv subset.
- Project-specific mappings: immediate assertions → GEM_ASSERT (--strip-assertions removes via chformal); $display → GEM_DISPLAY; inferred memories → RAMGEM via memlib_yosys.txt.
- Known limits: concurrent-SVA → checker synthesis is partial (a Yosys formal-flow bound, independent of slang's parsing; #106/#107); testbench-only constructs (#delays, most initial, TB $display) are dropped by synthesis, not simulated.
- State plainly that the authoritative accepted-surface is the empirical coverage table (Phase 4), not this prose — prose is the orientation.
- Cross-link from getting-started.md.
Fix stale jacquard map references (CLAUDE.md, docs) → dump-paths.
docs/plans/cell-model-ir.md status — reconciled with main during rebase: its status is now "Largely delivered" (C3 + the D5 L4-from-descriptor runtime wiring landed on main via de8255f3), so no further edit here.
Update docs/handoffs/adr-0021-behavioral-rtl-handoff.md (or resolve/fold it per handoff-discipline once this ships).

Exit: No doc describes a jacquard build command or the old multi-tool ceremony; map/build stale refs gone.

Phase 4 — Follow-ups (deferred, not gating)

Load exception-handling wasm — newer yowasp-yosys wheels build the wasm with the WebAssembly exception-handling proposal, which our wasmtime engine (Engine::default(), no gc feature) rejects: "exception refs not supported without the exception handling feature." Config::wasm_exceptions(true) exists but is #[cfg(feature = "gc")], so it needs the gc feature and a spike confirming wasm-EH actually runs yosys (not just parses). Until then CI + docs pin yowasp-yosys==0.64.0.0.post1131. Removing that pin is the exit criterion.
Fetch yosys.wasm from GitHub release (increment 2, ADR 0018): publish the pinned wasm as a Jacquard release asset; first behavioral run fetches to cache
- sha256-verifies.
--synth-target sky130|gf180 — synthesize to a real PDK (uses #160 descriptors) for timing-accurate on-ramp runs.
Empirical SV/Verilog coverage table — turn the docs/accepted-rtl.md prose surface into a measured pass/fail matrix by running SymbiFlow/sv-tests (or a curated subset) through the embedded YoWASP Yosys frontend and recording which constructs synthesize. Because the accepted surface is the Yosys frontend, this is the only trustworthy way to enumerate it (vs hand-claiming a feature list). Publish the table into docs/accepted-rtl.md; automatable as a CI job so the coverage claim stays current as the pinned yosys.wasm moves.
~~Wire L4-from-descriptor onto the runtime timing path~~ — landed on main (de8255f3, ADR 0019 D5): all runtime timing paths now source L4 from the cell-model-IR descriptor, so --corner on-ramp timing for built-in PDKs is live. No longer a follow-up.
Project manifest (Jacquard.toml) — collapse the positional sim netlist in.vcd out.vcd N arg soup and hold synth-target/top/sources, referencing the existing sim_config.json. Its own ADR when scheduled.

Non-goals

No config/manifest format this pass (Phase 4).
No Phase-2 \src-provenance work (ADR 0021 Phase 2 roadmap, unchanged).
No change to the AIG/boomerang core or sverilogparse (ADR 0021 §4).

Plan — RTL-source provenance (ADR 0021 Phase 2)

2026-07-15 — shipped in v0.3.0. Behavioral RTL → sim → a signal resolves to its RTL line, in a released binary. src/synth.rs maps via abc_new and emits \src; the provenance fork lives at gpu-eda/yowasp-yosys, whose CI builds + validates + releases the wheel, and jacquard fetches that pinned release on first use (no PAT, no artifact expiry). aigpdk mapping needed two fixes the sky130 A0 didn't surface: read_liberty -lib + hierarchy -purge_lib before a single abc_new pass. Verified against third-party RTL (PULP common-cells via core-v-mcu): the emitted netlist carries (* src = "binary_to_gray.sv:15.8" *).

Status: Shipped (v0.3.0) with one deferred item and one known toolchain bug. Per-section ✅ markers below remain authoritative for detail.

WS-A — done. The forked provenance wasm builds and carries origins through the in-process WASI abc_new/aiger2 XAIGER-"y" path (100% \src coverage on comb + seq2). A1 (harden/pin) + A2 (distribute) + the src/synth.rs → abc_new integration all landed (#176).
WS-B — B0–B3 done (sverilogparse capture → netlistdb.cell_src → AIG::aigpin_src_locations, surfaced in xsources and --trace-signals). Deferred: B3's timing-report surface — src locations in --timing-report. That's the one piece of the original plan not shipped.
Known bug in the toolchain this plan adopted: abc_new hard-traps (wasm unreachable) on a design that reaches it with zero primary I/O — bisected; neither the constant clock nor emptiness is the trigger, one port of either direction avoids it, and the failure is inside the ABC9 sub-pass. Whether &origins causes it or it's stock abc9 is open; the A/B is blocked because a stock YoWASP wheel won't load under our wasmtime (built with wasm exception handling, which we don't enable). Tracking: robtaylor/origin-shell#1 upstream, #211 here (whose reporting half is fixed — a trap now names the pass instead of printing 27 wasm frames). Low impact for real designs: a portless module is testbench-shaped.

ADRs: 0021 Phase 2 (the roadmap this realises), 0014 (AIG core the provenance rides through), 0018 (wasm distribution).

Predecessors: ADR 0021 Phase 1 (the on-ramp — sim/cosim synthesize behavioral RTL via embedded YoWASP Yosys). This plan makes the results of that simulation speak RTL source locations instead of flattened gate names.

Tracking: #162 Phase 2.

Goal

Thread \src (RTL source file:line, Yosys's source-provenance attribute) from synthesis all the way to Jacquard's user-facing outputs, so that --trace-signals, timing-violation reports, and X-debugging (xsources/xroots) report RTL source locations rather than post-synthesis flattened gate names. "Why is spiflash.ctrl.\$auto$…$1234 X?" becomes "spiflash.v:88".

Why it's two independent problems

The toolchain must emit \src. Stock YoWASP Yosys drops source provenance through std-cell mapping (abc). Carrying it needs a patched toolchain — the origin-shell \src pass-through: berkeley-abc #487 (vOrigins/&origins) + robtaylor/yosys@src-retention-y-ext.
Jacquard must ingest \src. Even with a perfect provenance netlist, Jacquard throws it away today: sverilogparse discards all attributes (vendor/eda-infra-rs/sverilogparse/src/sverilognom.rs:44 — "we regard attributes as comments"), and netlistdb has no source-location field.

These are independent and should be de-risked in parallel: WS-B (ingestion) can be prototyped end-to-end against a hand-annotated (* src=… *) netlist before the real toolchain (WS-A) exists. Only invest in the heavy wasm build (A1) once both the A0 spike passes and WS-B is proven.

WS-A — Provenance-carrying toolchain

Reframed after reading origin-shell (2026-07-04). The gate is not "does \src survive in-process abc" — origins ride the XAIGER "y" channel (an in-memory AIGER round-trip via the aiger2 reader/writer, keyed by object id), so external-vs-in-process abc is irrelevant to the channel. origin-shell's real lesson: provenance cannot survive the classic abc/BLIF path at all (BLIF has no object identity), so the std-cell flow must move to abc9/abc_new. Data path (origin-shell POC): abc_new: write_xaiger2 → ABC (&read; &origins; &dch -f; &nf) → read_xaiger2.

This means two things Jacquard's own synth.rs must change, independent of the patched wasm:

src/synth.rs maps with classic abc -liberty (lines 241/244) → must become the abc_new flow (&dch -f; &nf, scratchpad -set abc9.origins_max N) to carry origins.

It writes with write_verilog -noattr (line 247) — -noattr strips \src. Must drop it (attribute-selective if needed).

QoR caveat: abc9 std-cell mapping is "the road less travelled" upstream (YosysHQ/yosys#5679 removed abc9 -liberty; the yosys fork carries the aiger2 std-cell path). Bare &nf is 9–22% worse area than classic abc; &dch -f; &nf recovers to parity. So the flow change carries a QoR validation obligation, not just a wiring change.

A′ — De-risk the `abc_new` flow on the stock wasm first (no patched toolchain)

abc_new/&nf exists in the stock yosys.wasm we already ship (just without the &origins patch). So the flow migration + its QoR risk can be proven before any fork/build work, moving the "road less travelled" risk off the patched-wasm critical path.

In src/synth.rs, switch the two abc -liberty aigpdk_nomem.lib passes to the abc_new &dch -f; &nf flow against aigpdk_nomem.lib; keep -noattr for now (no origins to preserve yet).
Exit: the on-ramp still produces a correct aigpdk netlist via abc_new, and QoR (cell count / area from stat) is at parity with the classic-abc baseline on the counter/assert/mem designs.

A′ spike result (2026-07-05, stock yowasp-yosys 0.64) — the early signal fired. Ran abc_new -script +&dch,-f;&nf -liberty aigpdk_nomem.lib on a small design (scratchpad /tmp/claude/aprime):

Combinational-only (adder) → works (maps to $_NAND_/$_XNOR_/…).

Sequential (a posedge flop) → FAILS: ERROR: Bad connection $auto$ff.cc:337:slice$150/D ~ \d [1] (abc_new marked "experimental").

classic abc baseline on the same design → works, 33 cells (23 AND2 + 4 DFF + 6 INV).

The ff.cc error is precisely the flop-handling that robtaylor/yosys@src-retention-y-ext fixes (origin-shell lists "the ff.cc/memory_map/mem.cc fixes"). So A′ cannot be fully de-risked on the stock wasm — the yosys fork is required even to make abc_new function on sequential logic, not just to carry origins. Consequence: A′ folds into A0 — the first real test is building the forked wasm and running the abc_new flow there (getting flow-correctness + QoR + origins in one shot), since stock can only validate the purely-combinational sub-case.

A0 — GATING: build the forked wasm, check `\src` survives

With A′ proving the abc_new flow works, A0 adds the origins: build the patched wasm and confirm \src rides the XAIGER "y" channel through mapping. The spike is the build (the wasm is the A1 artifact — no throwaway work).

Build inputs — ✅ ALL WIRED + VERIFIED 2026-07-06 (the two repoints below are done; recon was against Codeberg YoWASP/yosys, 2026-07-04). Chain:

robtaylor/yowasp-yosys @ yowasp-yosys-integration   (NEW; mirror of Codeberg, base develop-0.64)
  └─ yosys-src → robtaylor/yosys@src-retention-y-ext  (bcc5698)
       └─ abc  → robtaylor/abc@origin-tracking-clean  (2daf32f2, #487)   ← abc is nested in yosys-src
       └─ yosys-slang-src → povik/yosys-slang          (SV frontend, upstream)

✅ yowasp overlay: created robtaylor/yowasp-yosys (full mirror; default develop = clean mirror, yowasp-yosys-integration carries the repoint), repointed yosys-src → robtaylor/yosys@src-retention-y-ext.
✅ Inside the yosys fork: repointed robtaylor/yosys@src-retention-y-ext's abc submodule YosysHQ/abc → robtaylor/abc@origin-tracking-clean (commit bcc5698). Without this the build would have yosys-side \src retention but stock abc, and provenance would die at mapping.

Build mechanics — ⚠️ CORRECTED 2026-07-06 (supersedes the "CMake + wasi-sdk 33" recon; that was develop HEAD, the WRONG base for our fork). Two hard constraints force the develop-0.64 Makefile recipe, not develop's CMake:

robtaylor/yosys@src-retention-y-ext is Makefile-only (no CMakeLists.txt) — develop's cmake -S yosys-src cannot build the patched fork at all.
develop's CMake migration (61073ed "CI: switch to CMake based builds") dropped yosys-slang — its wasm has no read_slang, the SV frontend the on-ramp depends on. develop also moved to wasi-sdk 33.

So base the overlay's integration branch on develop-0.64: build.sh is the Makefile flow (make -f yosys-src/Makefile CONFIG=wasi, -flto, -Wl,-z,stack-size=8M)

wasi-sdk 27 + flex-2.6.4 built from source + yosys-slang built via cmake and whole-archive-linked (libyosys-slang.a) — the exact recipe that built the pinned yowasp-yosys 0.64 the on-ramp uses. It hardcodes x86_64-linux wasi-sdk → Linux/Docker/CI only (done: the A0 CI runs ubuntu-latest).

A0 CI (authored 2026-07-06): robtaylor/yowasp-yosys/.github/workflows/provenance-wasm.yml — build (compile → WASI, link slang, smoke read_slang/abc_new, upload wasm+wheel)

provenance-check (origin-shell's abc_new origins flow on comb+seq2 vs sky130, \src coverage; seq2@0% fails). Residual risk CI will surface: develop-0.64's yosys-slang pin 4e53d772 targets yosys 0.64 (Apr); our fork's yosys is ~Jun 2026, so slang may need a newer pin to compile.

Repoint (steps 1–2 above), run build.sh, and run the A′ abc_new synth flow with origins_max set and -noattr dropped on a small multi-module design; measure the \src coverage — the % of mapped cells carrying a src attribute. Crib origin-shell's test/src_coverage.sh harness (loads write_json output, counts 'src' in cell['attributes'] over mapped cells) — reuse it near verbatim.
Go/no-go: high \src coverage on mapped cells → you already hold the provenance wasm (A1 done); proceed to harden (A1) + distribute (A2). Low/zero → the aiger2 "y" channel or abc &origins isn't functioning in the wasm build; diagnose before investing further.

✅ A0 GO (2026-07-06, CI run 28779240456). The provenance wasm builds and carries origins: comb 4/4 = 100%, seq2 88/88 = 100% \src coverage on mapped sky130 cells. Origins survive the in-process WASI abc_new round-trip (including sequential logic) — the ADR's key open risk is resolved. Getting here also required rebasing robtaylor/abc@origin-tracking-clean onto current berkeley-abc/master (it was 97 commits behind, missing the #ifdef __wasm guards around abc's system() calls → wasm-ld: undefined symbol: system). CI: robtaylor/yowasp-yosys/.github/workflows/provenance-wasm.yml.

Fallbacks if it fails: (a) debug the aiger2/&origins path in the wasm build (the channel is XAIGER round-trip, not exec, so the failure is in the reader/writer or the patch, not "in-process abc" per se); (b) Nix native toolchain — origin-shell is already a Nix flake with the validated flow, so a native back-end for jacquard sim --synth-backend nix is the lowest-risk way to ship provenance if the wasm path proves stubborn; (c) native-only provenance, on-ramp wasm stays provenance-free.

A1 — Harden & pin the provenance build

The A0 build, productionized. Thin harness fork, not our own build recipe: YoWASP/yosys is just a build harness, so the overlay's whole delta is the two submodule repoints — trivial to rebase when upstream moves, and it beats hand-rolling a WASI/LTO build (where subtle divergence bites hardest). The durable maintenance is keeping the existing yosys/abc patch branches rebased on upstream — unavoidable in any approach, already owned. Per [~/.claude/FORKED_DEPS_WORKFLOW.md], carry each of the three on an -integration branch.

Pinning — track our branches, don't freeze SHAs. These are our actively developed branches (we'll tweak the \src patches, and want to fold in upstream review of abc#487), so pinning them to frozen SHAs would fight normal development. Instead:

Track by branch, not SHA, for the three forks we own (yosys/abc/yowasp overlay) — new patches and upstream-review changes flow through automatically; when a patch merges upstream, retarget the branch to upstream.
Pin only genuinely-third-party build inputs we don't control and need reproducible: wasi-sdk version, flex, and the rest of build.sh's toolchain.
Stamp provenance into each built artifact, not the source: when CI builds a released wasm, record the exact yosys/abc/yowasp SHAs (release notes / asset metadata). That gives release-level reproducibility — any shipped wasm is recreatable — without freezing the dev branches. Develop freely; ship traceably.

Exit strategy: upstream the patches (abc#487 → yosys → a YoWASP build knob) to shrink the forks toward zero.

Exit: a pinned, CI-reproducible wasm that runs the existing synth_script and emits (* src *) on mapped cells; byte-diff vs the stock wasm shows only provenance additions.
Escape hatch (own-recipe): only if A0 shows build.sh is gnarly/undermaintained enough that rebasing the overlay is painful — then reimplement a minimal pinned WASI build in-repo. Default is the thin fork.

A2 — Distribute the provenance wasm

Publish as a Jacquard release asset (shares the Phase-4 fetch-from-release mechanism). Decide: is provenance the default on-ramp wasm, or an opt-in (--yosys-wasm/a provenance flag) larger asset? Recommend opt-in first.

WS-B — Jacquard `\src` ingestion (startable now with a synthetic netlist)

B0 — Capture attributes in `sverilogparse`

Stop discarding (* … *) (sverilognom.rs:44). Parse src (and keep the door open for other attrs) and attach to the cell/wire in the SVerilog AST. vendor/eda-infra-rs is a vendored submodule → fork + integration branch per ~/.claude/FORKED_DEPS_WORKFLOW.md.

Exit: a (* src="f.v:12" *)-annotated netlist round-trips through SVerilog::parse_* with the attribute retrievable per cell.

✅ B0 DONE (2026-07-06). gpu-eda/eda-infra-rs@jacquard-integration a0772f4; SVerilogCell gains src: Option<CompactString>. skip_whitespace_and_comment no longer treats (* ... *) as whitespace; leading_attributes captures/skips at the 3 grammar leading edges (module / port / module-body item). Proven by sverilogparse test_src_attribute (+ tests/attributes.v): parse + round-trip. Caveat: attributes are now only accepted at those leading edges (narrows vs the old "ignored anywhere" — fine for structural netlists).

B1 — Carry source location through `netlistdb`

Add an optional per-cell (and where meaningful, per-net) source_loc on NetlistDB, populated from B0. Keep it Option — most cells post-synthesis may have 0 provenance.

Exit: NetlistDB::from_sverilog_file exposes \src for annotated cells; zero overhead / behaviour change when absent.

✅ B1 DONE (2026-07-06). gpu-eda/eda-infra-rs@jacquard-integration 4f7f0ec; NetlistDB.cell_src: Vec<Option<CompactString>> parallel to celltypes/cellnames, populated in insert_cell from SVerilogCell.src through flattening. None for top cell + synthesised inverters + un-annotated cells. Proven by netlistdb/tests/provenance.{rs,v}.

B2 — Preserve provenance through AIG / staging / flatten

Thread a mapping from AIG nodes / endpoint groups back to \src so a sim-visible signal can be traced to a source line. Provenance is lossy by nature (abc merges/splits nodes): design for 0, 1, or many source locations per signal; never assert exactly one.

Exit: given an annotated netlist, a chosen output/endpoint resolves to its source location(s) through the built FlattenedScript.

✅ B2 DONE (2026-07-06, Jacquard 68cc938b). AIG::aigpin_src_locations composes the AIG's existing per-pin cell-origin map (aigpin_cell_origins, built for SDF back-annotation) with netlistdb.cell_src (B1) — no new state threaded through the AIG builder. Returns 0/1/many de-duplicated locations; a primary output resolves directly. The resolver is AIG-level (the AIG is retained alongside the FlattenedScript). Proven by aig::path_mapping_tests + prov_annotated.v.

B3 — Surface `\src` in the user-facing outputs

Add source locations to the three consumers, gated on availability (fall back to today's hierarchical gate name when absent):

--trace-signals (docs/signal-tracing.md) — annotate traced signals.
Timing-violation reports (docs/timing-violations.md).
X-debugging xsources / xroots (docs/x-debugging.md) — the highest-value target: report the RTL line of an X-source, not the flattened gate.
Exit: an end-to-end run on a synthetic annotated netlist prints source locations in all three; a design with no provenance is unchanged.

⏳ B3 IN PROGRESS (2026-07-06) — 2 of 3 consumers done.

✅ xsources (ec7b648f): optional src on the XSource record from cell_src[cell_id]; schema 1.0→1.1. (xprop_demo_synth.gv already carried (* src *) → provenance flows through the real pipeline.)

✅ --trace-signals (ca55c786): logs each traced net's RTL src via the B2 resolver at registration.

⏳ timing-violation reports — deferred (largest; governed schema). Path: word_id → cell_id via dff_constraints; add src to DffSiteName → ViolationRecord. A word packs many DFFs → 0/1/many. ADR 0008 permits additive → bump SCHEMA_VERSION 1.2.0→1.3.0 + ADR note.

Sequencing

A′ abc_new on STOCK wasm (flow+QoR) ─► A0 patched wasm + \src coverage ─(go)─► A1 harden/pin ─► A2 distribute ─┐
                                                                                                           ├─► integrate: on-ramp emits + surfaces \src
B0 parser ─► B1 netlistdb ─► B2 AIG ─► B3 outputs ─────────────────────────────────────────────────────────┘   (validated on synthetic netlist first)

Parallel: the whole A-track and B0–B3 have no dependency; B is provable against a hand-written annotated netlist, and A′ against the stock wasm.
A′ first within WS-A: proves the abc_new flow + QoR on the stock wasm before any fork/build, moving the "road less travelled" QoR risk off the patched-build critical path.
Barrier: full end-to-end (RTL → provenance wasm → surfaced source lines) needs A1+A2 and B0–B3.
Kill-switch: if A0 fails, WS-B still delivers value for any externally produced provenance netlist (DC/native Yosys with origins); the on-ramp just doesn't auto-generate it until a fallback toolchain lands.

Risks / open questions

abc9/aiger2 "y" channel in the wasm build — the gating unknown (A0): does the XAIGER origins round-trip function in the WASI-built yosys+abc?
QoR of the abc_new std-cell flow — abc9 std-cell mapping is unofficial upstream; &nf alone is 9–22% worse area, &dch -f; &nf recovers to parity. A′ must confirm parity on Jacquard's designs before committing the flow change.
sverilogparse fork maintenance — vendored submodule; carry the patch on a -integration branch per FORKED_DEPS_WORKFLOW; upstreamable (attribute capture is generally useful).
Does \src survive Yosys flatten (separate from abc)? origin-shell targets the full flow, but confirm in A′/A0, since the on-ramp flattens.
Provenance granularity — post-optimization a gate may map to 0/1/many source lines; the reporting (B3) and IR (B2) must not assume 1:1.
Asset size — a second (provenance) wasm ~doubles the fetched-asset story; A2's opt-in-vs-default call interacts with Phase-4 distribution.

Non-goals

No change to the emulator AIG/boomerang core (ADR 0014/0015).
Not blocked on upstream YoWASP adopting the patches (we build our own wasm).
Full SVA / Verific-grade provenance is out of scope (bounded by the patched open-source toolchain).

Spike — OpenTimer on SKY130 and MCU SoC

Status: Proposed. Not yet executed.

Time box: Half a day. Extend by up to one day if initial signs are positive but hitting specific SKY130 quirks. Abort and fall back if first-four-hours progress is blocked.

Goal

Determine whether OpenTimer (MIT, C++17) can reliably parse and analyse Jacquard's real-flow inputs — SKY130 Liberty and OpenLane2 MCU SoC post-P&R output — well enough to serve as Jacquard's in-process reference STA (per ADR 0003).

The outcome resolves ADR 0003's Pending Spike status to either Accepted or Superseded.

Out of scope for this spike

C++ FFI / bindgen integration work. Pure spike on OpenTimer's standalone behaviour.
Timing-IR integration. Establishing that OpenTimer produces usable arrival/slack output is sufficient; converting it to IR belongs in phase 1.
Performance measurement beyond rough "does it complete in reasonable time."
Commercial-PDK coverage. SKY130 is the spike target; private-track confirmation is later.

Setup

Required artefacts (checked before starting):

OpenTimer clone and local build (MIT licence, standard CMake).
SKY130 Liberty file(s) matching the corner the MCU SoC flow uses. At minimum sky130_fd_sc_hd__tt_025C_1v80.lib.
MCU SoC post-P&R output: synthesised .v, SDC, and — critically — .spef. Check that the current OpenLane2 invocation is configured to produce SPEF; if not, enable it. OpenTimer requires SPEF, it does not consume SDF.
Jacquard's current timing-analysis binary output on the same design for comparison.
OpenSTA installed locally, for three-way comparison.

Success criteria

The spike answers four questions. Each is a pass/fail observation, not a measurement.

Q1 — Does OpenTimer parse SKY130 Liberty without errors?

Pass: clean parse, no warnings that indicate misinterpreted cells.
Partial: parses but warns on specific cells — in particular sky130_fd_sc_hd__dlygate4sd3_* or anything with non-trivial conditional timing. Document which cells and whether their timing is discarded or mishandled.
Fail: parse errors, segfaults, or silently-wrong output on recognised cells.

Q2 — Does OpenTimer compute arrivals on the MCU SoC design?

Feed .lib + .v + .spef + .sdc. Run report_timing -worst 20 or equivalent. Observe:

Pass: produces a full timing report with reasonable-looking arrivals (non-zero, monotonic along paths).
Partial: produces a report but with suspect values (many zeros, missing cells, incomplete paths).
Fail: hangs, crashes, or refuses to analyse.

Q3 — Does OpenTimer's result agree with OpenSTA?

Run OpenSTA on the same inputs, compare top-20 critical endpoints' arrivals. Declare tolerance: ±5% on arrival time, ±10 ps absolute floor for very short paths.

Pass: all top-20 endpoints within tolerance.
Partial: most within tolerance, a small number of outliers traceable to specific delay-model differences (e.g., CCS vs NLDM).
Fail: systematic disagreement suggesting OpenTimer is computing something meaningfully different. Investigate; if the disagreement is on SKY130 cell interpretation (a PDK handling issue) this is essentially a fail for our purposes.

Q4 — Does OpenTimer's result correlate with Jacquard's current timing analysis?

Compare worst-slack and top-K endpoint lists (not exact values — pessimism differences are expected and documented). Observe:

Pass: top-K lists overlap substantially; worst-slack is on a comparable path.
Informational: any systematic discrepancy tells us what the pessimism delta actually looks like in practice. This data informs R4 (critical-path refinement reporting) whether OpenTimer is adopted or not.

Decision matrix

Q1	Q2	Q3	Outcome
Pass	Pass	Pass	ADR 0003 → Accepted. Proceed to phase 1 integration.
Pass	Pass	Partial	ADR 0003 → Accepted with documented scope limits. Define where OpenTimer is authoritative vs deferred to OpenSTA.
Pass	Partial	—	ADR 0003 → Accepted provisionally; spike extends to investigate Q2 anomalies.
Partial	—	—	ADR 0003 → Accepted with SKY130 cell workarounds documented, or → Superseded if the workarounds are too invasive.
Fail on any	—	—	ADR 0003 → Superseded. Fall back to OpenSTA-subprocess-only validation. Revisit libreda-sta or in-house walker as alternatives in a follow-up ADR.

Fallback

If the spike fails, Jacquard operates with:

OpenSTA subprocess validation in CI (ADR 0001) as the sole timing-reference mechanism.
No per-PR in-process timing cross-check; feedback timing degrades.
Phase 1 drops OpenTimer integration work and refocuses on tightening OpenSTA-driven CI.

Superseding ADR 0003 is clean — it is currently Pending Spike so no downstream work has accrued to it. Phases 0 and 2 are unaffected.

Progress log

Setup (2026-04-23 → 2026-04-30)

OpenTimer 2.1.0 and OpenSTA 3.1.0 cloned to Jacquard-depends/ and built locally. Build notes in that repo's README.md.
SKY130 Liberty already on disk via volare: ~/.volare/volare/sky130/versions/c6d73a35f524070e85faff4a6a9eef49553ebc2b/sky130A/libs.ref/sky130_fd_sc_hd/lib/sky130_fd_sc_hd__tt_025C_1v80.lib.
Spike artefacts kept in this worktree under spike-out/ (gitignored — reproducible from Jacquard-depends/).

Q1 — Liberty parse (2026-04-30) — Pass

Tool	Cells loaded	Wall time	Warnings
OpenTimer 2.1.0	428	0.12 s	1
OpenSTA 3.1.0	428	0.18 s	0

Cell counts agree exactly. OpenSTA parses cleanly. OpenTimer emits one warning:

W celllib.cpp:274] unexpected lut template variable normalized_voltage

The normalized_voltage axis appears in exactly one place in the Liberty: the library-level normalized_driver_waveform("driver_waveform_template") block, which is CCS-driver-waveform data. No per-cell timing arc references it — cell_rise/cell_fall/rise_constraint/fall_constraint all use the NLDM templates del_1_7_7, vio_3_3_1, constraint_3_0_1. So the warning has no impact on arrival/slack computation under NLDM, which is what OpenTimer does anyway.

Operational note: OpenTimer's read_celllib is lazy — the parse only runs when an action like update_timing (or report_*) forces taskflow execution. Issuing dump_celllib immediately after read_celllib reports "celllib not found" because the read hasn't fired yet. Always insert update_timing before any inspection command.

The documented read_celllib -min|-max <file> syntax silently no-ops; bare read_celllib <file> loads the lib as both min and max corners. Filed as a docs/build mismatch in our Jacquard-depends/README.md.

Q2 — Arrival computation on SKY130 (2026-05-01) — Fail

Used OpenSTA's bundled gcd_sky130hd example (a canonical SKY130-HD GCD with .v, .sdc, .spef, .lib) as a fast smoke test before tackling MCU-SoC SPEF generation. If OpenTimer can't handle this, the MCU-SoC effort is wasted.

OpenSTA baseline: clean run, period 5 ns, top arrival 4.82 ns, WNS 0.00, slack 0.09 met. 0.28 s wall, zero warnings.

OpenTimer: could not produce a single timing path. The result was no critical path found, wns = nan, tns = nan — even after working around the following issues, each of which had to be discovered and patched manually:

#	Issue	Workaround tried	Status
1	`read_celllib -min	-max ` (the documented syntax) silently no-ops	bare `read_celllib <file>` loads as both corners
2	`dump_` after `read_` reports state-not-loaded because the read is lazy	insert `update_timing` before any inspection	works
3	Tap cells in post-P&R Verilog (`sky130_fd_sc_hd__tapvpwrvgnd_*`) trigger 1040 `cell not found in celllib` errors and abort the netlist load	strip tap cell instances from Verilog	works
4	OpenTimer's bundled SDC parser uses pre-TCL-8.5 syntax (`trace variable VAR w CMD`); fails on the system's TCL 8.6 with `bad option "variable"` and produces zero parsed commands — even on OpenTimer's own bundled examples	patch `ot/sdc/sdcparsercore.tcl:144` to `trace add variable sdc_version write __set_v`	works (one-line fix; should be upstreamed)
5	OpenSTA-style SDC with `set period 5 / expr $period * 0.2 / [all_inputs]` parses as zero commands	hand-write a literal SDC with `create_clock -name clk -period 5 [get_ports clk]`	works for trivial constraints; non-trivial SDC remains uncovered
6	SPEF `*PORTS` section (standard SPEF, IEEE 1481, emitted by OpenROAD/OpenLane) is rejected with a parse error pointing at the first port line	strip `*PORTS` block from SPEF before reading	works
7	Verilog bus ports (`input [31:0] req_msg;`) are not bit-blasted by OpenTimer's Verilog parser, but post-P&R SPEF references the bus as bit-indexed nets (`req_msg[0]`, `req_msg[1]`, …). 48 bus-element nets fail to match between netlist and SPEF	none found	blocking
8	After all of the above, two interior pins (`_251_:B`, `_218_:B`) report "not found in rctree" and the timing graph remains disconnected enough that no path can be reported	not investigated further	blocking

Issues 7 and 8 mean that on a SKY130 design with bus ports — i.e. any design that talks to the rest of the world — OpenTimer cannot compute arrivals from a standard OpenROAD .v/.spef pair without inputs being pre-processed by code that doesn't exist.

The cumulative finding is not "OpenTimer mishandles a few SKY130 cells". It is that OpenTimer's input pipeline (Verilog parser, SPEF parser, bundled SDC parser) is incomplete relative to what real OpenROAD-flow outputs contain, and the gaps fall on hot paths (bus ports, tap cells, modern TCL, OpenROAD-emitted SPEF). The cells themselves parse fine (Q1); it's the surrounding ecosystem that doesn't.

Q3, Q4 — not run

Q3 (cross-check vs OpenSTA) and Q4 (correlation with Jacquard's timing-analysis) both depend on OpenTimer producing arrivals. With Q2 unable to produce a single path, they're moot for this spike.

Decision

ADR 0003 → Superseded. Per the spike's decision matrix ("Fail on any → ADR 0003 → Superseded. Fall back to OpenSTA-subprocess-only validation"), the right move is to retire the in-process-OpenTimer plan and lean on OpenSTA-subprocess validation (ADR 0001) as the sole timing reference. A follow-up ADR should consider libreda-sta or an in-house walker if an in-process reference is still wanted later.

OpenTimer's strengths (in-process C++17, taskflow-based, MIT, fast for the academic benchmarks it ships with) are real, but the input-pipeline gaps are large enough that adopting it would mean owning a non-trivial fork — the opposite of what a "lightweight in-process reference" is supposed to be.

The Liberty parser is genuinely capable (Q1 passed cleanly on the 12 MB SKY130 NLDM lib in 120 ms), so OpenTimer remains an option for future narrow tasks like Liberty introspection, but not as the STA engine.

Setup notes worth keeping

OpenSTA bundles gcd_sky130hd.{v,sdc,spef} and sky130hd_tt.lib.gz — a cleaner SKY130 smoke-test fixture than anything we'd have produced from chipflow in the time we had.
~/.volare/volare/sky130/versions/c6d73a35f524070e85faff4a6a9eef49553ebc2b/sky130A/... is the live SKY130 PDK already on this machine (chipflow installs it). No need to fetch it separately.

Deliverable

A short report added to this document as a "Spike outcome" section, summarising:

Which Q1–Q4 answers were observed.
Specific SKY130 cells where OpenTimer misbehaves (if any).
Whether SPEF generation had to be added to the OpenLane2 flow, and what that change was.
Decision: confirm, scope-limit, or supersede ADR 0003.

Spike — reaching AMD laptops

Status: The compute-API question is answered — stay on HIP, do not port to OpenCL (see "Revised conclusion"). ROCm now builds, runs, and passes all 14 cosim goldens byte-identically, guarded by the permanent HIP Tests (ROCm backend) CI job — and on APU-class silicon, which is better news than this doc originally thought (see the correction below).

sim now runs on ROCm too — it never had, because the device has no cooperative launch — via a non-cooperative fallback; see "sim on a device without cooperative launch" below.

What remains open is the last mile: whether the laptop archs (gfx1103/gfx1150/gfx1151) run correctly, which needs silicon we don't have.

Question: what does Jacquard need in order to run on an AMD laptop, and is the answer HIP/ROCm, OpenCL, Vulkan, or something else?

Why this is a question at all

The HIP Tests (NVIDIA backend) job builds hip-runtime-nvidia and runs on tesla4-runner: HIP-over-CUDA, never ROCm. The org's self-hosted AMD runner sat online and idle because nothing targeted it (#198).

Probing it (2026-07-15) showed the box itself is fine — ROCm 7.2.4, hipcc (HIP 7.2.53211), hipconfig --platform = amd, and a trivial HIP kernel compiles and runs correctly on the GPU:

--- native ---
result: 1 2 3 4 5 6 7 8
NATIVE: OK

It reports gfx1030, despite a gfx1036 runner label — and the original version of this spike drew the wrong conclusion from that, which is worth recording because the error ran for a while and pointed the whole investigation the wrong way.

Corrected (2026-07-16, measured on the runner): the label is right and the report is spoofed. The runner is the Raphael iGPU integrated into nvidia1's own Ryzen 5 7600 — lspci says 1002:164e (Raphael), which is gfx1036, and the amd-runner container image bakes in HSA_OVERRIDE_GFX_VERSION=10.3.0:

$ docker inspect amd-runner --format '{{range .Config.Env}}{{println .}}{{end}}'
HSA_OVERRIDE_GFX_VERSION=10.3.0
RUNNER_LABELS=self-hosted,amd,gfx1036,rocm,hip,vulkan,cubecl

$ rocminfo | grep gfx                        →  gfx1030    # spoofed
$ env -u HSA_OVERRIDE_GFX_VERSION rocminfo   →  gfx1036    # the truth

So the claim that "green HIP CI on that runner would tell us nothing about an AMD laptop" was backwards, and wrong in our favour. The runner is an integrated RDNA2 GPU — an APU, on the same support tier as the laptop parts we care about, not a discrete board on ROCm's main matrix. The 14/14 ROCm cosim goldens are therefore already passing on APU-class silicon. That is a far better laptop proxy than a discrete gfx1030 would have been.

Note what that also means: gfx1036 (Raphael) appears nowhere on ROCm's supported lists (see the table below) — and it works anyway, via the override. That is a data point about how binding those lists really are.

What ROCm actually supports on laptops (as of 2026-07-15)

Prior assumption — "ROCm doesn't do laptop APUs, you need HSA_OVERRIDE_GFX_VERSION" — is out of date. ROCm 7.2.1 added official Ryzen APU support. But it's narrower than "AMD laptops":

The main compute compatibility matrix lists no APU targets at all (gfx908, gfx90a, gfx942, gfx1030, gfx1100, gfx1101, gfx1200, gfx1201, gfx950). APU support lives in a separate tier, Use ROCm on Radeon and Ryzen. The native-Linux support matrix there lists exactly two gfx targets:

Silicon	Target	Parts
Strix Halo	`gfx1151`	Ryzen AI Max+ 395, Max 390, Max 385
Strix Point	`gfx1150`	Ryzen AI 9 HX 375, HX 370, 365
400-series	`gfx1150`/`gfx1151`	Ryzen AI 9 HX 475, HX 470, 465

Not listed: gfx1103 (Phoenix / Hawk Point — Ryzen 7040/8040), gfx1035 (Rembrandt), gfx1036 (Raphael), gfx90c (Cezanne). One search summary claimed gfx1103 was supported; the authoritative page does not list it. Treat as unconfirmed pending the research below.

Support is also version-churny — ROCm#5339 is titled "Confusing rocm support for gfx1151", and reports suggest 6.4.2 had gfx1151 but not gfx1150, with both in the 7.13.0 preview matrix.

The install contract is the real problem

Per the Ryzen Linux install guide, running ROCm on a supported APU needs:

Ubuntu 24.04.4 specifically (24.04.3 "preliminary");
the 6.14-1018 OEM kernel or newer (apt install linux-oem-24.04c);
amdgpu-install -y --usecase=rocm --no-dkms — --no-dkms is mandatory (inbox drivers required); if DKMS lands anyway, autoremove amdgpu-dkms dkms;
BIOS changes: minimum dedicated VRAM (0.5 GB) plus a raised TTM limit (via amd-ttm from the amd-debug-tools PyPI package);
usermod -a -G render,video + reboot;
no in-place upgrades — uninstall before upgrading.

And a trap that presents as "the tool is broken": on gfx1150, GPU detection fails when UMA is "Auto"/Dynamic VRAM and silently falls back to CPU. Fixed VRAM in BIOS works.

So even on supported laptop silicon, ROCm is: two gfx targets, one Ubuntu point release, a specific OEM kernel, and a BIOS change. That is a demanding contract to put in front of someone who just wants to simulate a netlist.

What a non-CUDA backend costs us

Measured, not estimated:

File	Lines	Role
`csrc/kernel_v1_impl.cuh`	1462	the kernel logic
`csrc/kernel_v1.cu`	207	CUDA launch wrapper — `#include`s the impl
`csrc/kernel_v1.hip.cpp`	226	HIP launch wrapper — `#include`s the same impl
`csrc/kernel_v1.metal`	1441	can't share it; full reimplementation

Host side mirrors this: cuda.rs 690, hip.rs 695, metal.rs 2116.

AMD support currently costs ~226 lines because HIP is source-compatible with CUDA. Metal is the honest precedent for a backend that isn't: a whole parallel kernel plus 3× the host code. Any move off HIP moves AMD from the first column to the second.

The blocker: `sim` needs a device-wide barrier

kernel_v1_impl.cuh:623 calls cooperative_groups::this_grid().sync() — a grid-wide barrier inside the kernel, via hipLaunchCooperativeKernel. Neither OpenCL nor Vulkan has a device-wide barrier primitive.

cosim does not have this problem, and says so itself:

Unlike the sim scan above, cosim is reactive (inputs depend on outputs), so the host drives one scheduler edge at a time over a 2-slot [input|output] state. These kernels are NON-cooperative ordinary launches — the host loops major stages and each launch is the grid-wide barrier — so cosim never needs the cooperative grid.sync the scan relies on. They mirror Metal's state_prep and simulate_v1_stage.

So the realistic scope of any portable-compute backend is cosim only, sim stays on CUDA/HIP — unless sim is restructured into N host-driven launches, which is a barrier per sync point and is precisely what the cooperative launch exists to avoid.

What makes it cheaper than feared

Zero templates in the impl; only 26 CUDA qualifiers (__device__/__global__/__shared__/__forceinline__). It's C-like CUDA, so a port is transliteration plus the sync problem, not a fight with a type system.
The cross-backend goldens already exist — CpuBackend == Metal == CUDA == HIP, byte-identical. A new backend gets a correctness oracle on day one. The timestamp maths is host-side Rust (#195), so goldens should match outright.

The recurring cost

Two kernel implementations become three. Every kernel-level change lands three times and must stay byte-identical against the goldens. That tax is forever and is larger than the port. Plus: a Device::OpenCL variant in vendored ulib (the [CPU] [CUDA] [HIP] [Metal] device-ID layout is positional — additive, but a submodule change), 24 CosimBackend methods, and a packaging change (OpenCL compiles kernels at runtime from source/SPIR-V, unlike the build-time ucc compile).

What the local-LLM community has learned — and why most of it isn't our problem

llama.cpp / ollama have driven these parts in anger far longer than any vendor matrix reflects. Their pain on AMD laptops is real and well documented. It is also, almost entirely, rocBLAS pain — which we don't have.

gfx1103 (Radeon 780M), llama.cpp#20839 — three failure modes:

Flash-Attention WMMA kernel: "no device code compatible with HIP arch 1300", tuned for discrete RDNA3;
rocBLAS TensileLibrary missing: "Cannot read TensileLibrary.dat … for GPU arch: gfx1103" — ROCm 6.3.2 ships gfx1100/1101/1102 only;
MMQ kernels: HSA_OVERRIDE_GFX_VERSION=11.0.0 spoofing gives "invalid device function".

The decisive line in that issue: "The problem didn't exist in older llama.cpp versions (~late 2024 vintage) that embedded HIP kernels directly rather than calling rocBLAS externally." Vulkan works there, ~2 s/generation slower than a working ROCm build.

gfx1151 (Strix Halo), llama.cpp#13565 — an officially supported part where HIP is 2.5× slower than Vulkan (pp512: HIP 348 tok/s vs Vulkan 881). Tellingly, compiling for gfx1100 and spoofing HSA_OVERRIDE_GFX_VERSION=11.0.0 reaches ~599 — faster than the native gfx1151 path. Both hit max clock, so it isn't hardware; it's untuned rocBLAS/Tensile kernels for the arch. Still open. See also ROCm#5643 (hipBLASLt falls back on gfx1151 as unsupported) and the community's custom rocBLAS builds for gfx1103 — an entire cottage industry of rebuilding the library per arch.

Why this mostly doesn't bind on us

Checked against our kernel:

No BLAS. Zero references to rocblas/hipblas/cublas/tensile anywhere in csrc/ or src/. Failure modes 2 and 3, and the gfx1151 performance gap, are all rocBLAS/Tensile artefacts.
No matrix intrinsics. No wmma/mfma/matrix_core. Failure mode 1 is a WMMA kernel. We simulate AND gates.
One arch-specific intrinsic, __shfl_down_sync (kernel_v1_impl.cuh:273) — a standard warp shuffle, fine on RDNA.
wave32 is required and satisfied. kernel_v1.hip.cpp hard-rejects warpSize != 32 (CDNA/GCN wave64 unsupported). gfx1103 (RDNA3), gfx1150 and gfx1151 (RDNA3.5) are all wave32 — the laptop parts are exactly the shape we want.

We're the "embedded HIP kernels directly" case that worked on gfx1103.

The actual blocker is one line

ucc::cl_hip() (vendored eda-infra-rs/ucc/src/compile.rs):

#![allow(unused)]
fn main() {
// Default AMD targets: RDNA2 + RDNA3.
vec!["gfx1030".to_string(), "gfx1100".to_string()]
}

We only emit code for gfx1030 and gfx1100 — both discrete. No gfx1103, no gfx1150, no gfx1151. That is why the CI runner works (it reports gfx1030) and why a laptop wouldn't: not because HIP can't, but because we never compiled for it.

There is already an escape hatch — UCC_HIP_TARGETS, comma-separated — in a fork we control. Custom HIP kernels can be compiled for any arch the compiler knows; only rocBLAS needs per-arch prebuilt libraries, and we don't use it.

Revised conclusion

Do not port to OpenCL. The cost is a third 1400-line kernel plus a permanent 3× tax on every kernel change, and sim can't port at all (no device-wide barrier). The premise that motivated it — "ROCm won't reach AMD laptops" — did not survive contact: with two small fixes the existing HIP backend compiles for every laptop arch and passes 13/14 goldens on real ROCm hardware. We were three commits from ROCm, not one backend.

Both blockers were ours, not AMD's, and both were hidden by the same thing: hip-build installs the CUDA Toolkit and targets hip-runtime-nvidia, so the HIP path had only ever been compiled with CUDA headers and CUDA semantics on hand. HIP Tests (NVIDIA backend) was an accurate job name that nobody read literally.

What's left for laptops specifically is unproven but no longer speculative: compiling for gfx1103/gfx1150/gfx1151 works; whether they run correctly needs silicon we don't have.

Result of the compile test (2026-07-15)

Ran it on the AMD runner (ROCm 7.2.4) with UCC_HIP_TARGETS=gfx1030,gfx1100,gfx1103,gfx1150,gfx1151. Two findings, and the second is bigger than the spike's original question.

1. hipcc accepts every laptop target. The compiler was invoked as

clang++ --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1103 \
        --offload-arch=gfx1150 --offload-arch=gfx1151 ...

and raised no objection to any arch. The arch list is not a barrier — consistent with the research: only rocBLAS needs per-arch prebuilt libraries, and we don't use it.

2. We have never compiled against real ROCm at all. The build died here:

csrc/types.hpp:26:10: fatal error: 'math_constants.h' file not found
   26 | #include <math_constants.h>
1 error generated when compiling for gfx1030.

That's vendored ulib/csrc/types.hpp:

#if defined(__NVCC__) || defined(__HIP_DEVICE_COMPILE__)
#include <math_constants.h>

The guard fires for HIP device compilation, but math_constants.h is a CUDA toolkit header. ROCm ships hip/hip_math_constants.h instead. It has never been caught because hip-build installs the CUDA Toolkit ("no GPU needed to compile") and builds hip-runtime-nvidia: the HIP path has only ever been compiled with CUDA headers on the include path.

So the HIP Tests (NVIDIA backend) job name is exact, and nobody read it that literally: the HIP backend is HIP-over-CUDA only. It has never been built, let alone run, on ROCm — on a laptop, a discrete Radeon, or anything else. Note this fails on gfx1030, the arch we nominally support and the one our own runner is. Laptop support was never the first blocker; it's the second.

This is a one-header fix in a fork we control, but it must be fixed before any claim about ROCm — laptop or otherwise — can be tested.

Result of actually fixing it (2026-07-15)

Two blockers, both invisible until something compiled against real ROCm. Both fixed; the third finding is open.

Blocker 1 — CUDA header on the HIP path. 81184fa ("Add HIP (AMD GPU) backend support") widened ulib's guard from #ifdef __NVCC__ to #if defined(__NVCC__) || defined(__HIP_DEVICE_COMPILE__), dragging the CUDA-only <math_constants.h> into HIP device compilation. Upstream is unaffected — its #ifdef __NVCC__ is correct for a CUDA header; the bug is ours, introduced with the HIP patch. Nothing uses the CUDART_* constants that header provides; the macros beneath it want nanf/nan/INFINITY from <cmath>. Fixed by including what's used.

Blocker 2 — the lane mask is CUDA-shaped. With the header fixed, the compile reached our one arch-specific intrinsic and died:

amd_warp_sync_functions.h:297:62: error: static assertion failed due to
requirement 'sizeof(unsigned int) == 8': The mask must be a 64-bit integer.

CUDA's __shfl_*_sync takes a 32-bit mask; HIP-on-AMD takes a 64-bit one (an AMD wave can be 64 lanes) and static-asserts it. Every call site in the shared kernel was a hard compile error on ROCm, and invisible on CUDA and on HIP-over-CUDA — which is all we had ever built. Fixed with a lane_mask_t typedef (64-bit under __HIP_PLATFORM_AMD__, unsigned otherwise). It has to be a type, not a constant: one mask is a ragged tail (0xffffffff >> n) whose value must survive widening. No-op off AMD, and Metal goldens stayed 14/14.

With both fixed: it builds, and it very nearly works

UCC_HIP_TARGETS=gfx1030,gfx1100,gfx1103,gfx1150,gfx1151
→ build: SUCCESS  (all five archs, incl. every laptop target)

And on the AMD runner's real gfx1030 — the first time Jacquard has ever executed on ROCm — the goldens are byte-identical to the CpuBackend/Metal captures:


PASS	xprop, 2state, noreginit, reginit, dual_uart_events, apb_trace, apb_trace_xprop, multi_mem, vcd_axes, qspi_psram (+content), qspi_shared_bus (+content)
FAIL	multi_mem_split

13 of 14 on the first run — 14 of 14 once the one failure was fixed. The cross-backend equivalence the goldens assert (CpuBackend == Metal == CUDA == HIP) had never been tested against ROCm. It holds, and finding where it didn't was worth the whole exercise.

The one failure: a GPU memory fault on the staged path (#203, fixed)

Memory access fault by GPU node-1 on address 0x7501bce00000.
Reason: Page not present or supervisor privilege.

multi_mem_split is multi_mem with --level-split 10. The non-split fixture passed on the same hardware and the same binary; only the staged path faulted.

The cause was ours, in simulate_block_v1's staged-IO read. Bit 31 of a word index flags "read this cycle's inter-stage intermediates from the output slot" (§5 of ADR 0015). The kernel decoded that flag by biasing the base pointer to cancel it out of the subscript — with a signed 1 << 31, which is INT_MIN. So the pointer moved *+*2^31 words instead of −2^31, and the subsequent [idx] (bit 31 set, so idx ≥ 2^31) added another 2^31: every staged read landed 2^32 words — 16 GiB — past the buffer. Unreachable without --level-split, since staged_io_map is empty when the design is one stage, and unreachable in stage 0, which reads primary inputs and DFFs. Hence: stage 0 survives, stage 1 faults.

Why only ROCm. The instinct that ROCm is a bounds-checking oracle was right, but the mechanism is more interesting than "CUDA tolerates an OOB access". nvcc narrows the address arithmetic to 32 bits, which truncates the overflow back to the intended offset — so CUDA and HIP-over-CUDA were silently correct, not silently wrong. Metal had the same decode written with an unsigned 1u << 31 and was correct outright. ROCm was the only backend that computed the address the way the language says to, and it faulted. The bug was not that ROCm is strict; it was that we relied on a compiler to paper over undefined behaviour, and one of them declined.

The fix decodes by clearing the flag from the index instead — the idiom CpuBackend already used (src/sim/cpu_reference.rs) — in all three kernels. That removes the UB rather than correcting its sign, so no backend's address arithmetic can resurrect the class. multi_mem_split now passes on ROCm byte-identically to the golden, and Metal is re-verified 14/14 locally.

Worth noting where it sat: level-split + SRAM is exactly where #186 lived (a staged endpoint index read against the original AIG's accounting). Two staged- index bugs in the same neighbourhood, the second only visible on a runtime that checks. Both were real bugs in Jacquard, surfaced by ROCm, not ROCm quirks.

Next steps

~~Compile-only test.~~ Done — see above. It answered a bigger question than it asked. Fix the types.hpp CUDA-header leak in vendored ulib, then re-run; only then is "does it build for laptop archs" a meaningful question.
~~Run the goldens on our own AMD runner.~~ Done — 14/14. The first run was 13/14; the one failure became #203 and is fixed (a signed 1 << 31 put every staged-IO read 2^32 words out of bounds; see above). Traced with AMD_SERIALIZE_KERNEL=3 to the exact kernel and dispatch — rocgdb hangs, that route is the one that works. The goldens are now guarded permanently by the HIP Tests (ROCm backend) job in ci.yml, which is worth keeping for its own sake: ROCm is the only backend that computes addresses exactly rather than narrowing them to 32 bits, so it is our only standing oracle for OOB/UB in the shared kernel.
Then find real silicon. Compiling is necessary, not sufficient. The cross-backend goldens make the check a byte-diff, not a judgement call. Our runner is APU-class (gfx1036, spoofed to gfx1030), so it stands in better than first thought — but it is RDNA2 Raphael, not RDNA3/3.5 Phoenix or Strix, so it still cannot answer for gfx1103/gfx1150/gfx1151. scripts/amd-laptop-probe.sh is a self-contained volunteer test: it compiles a ~40-line HIP program using exactly Jacquard's kernel surface (wave32 + __shfl_down_sync + __syncthreads, nothing else), runs it, cross-compiles for each laptop arch, and prints a pasteable report. No Jacquard build, no root, nothing installed.
Only if 1 or 2 fails, revisit portable compute — and then Vulkan, not OpenCL: it's what actually works on these parts today per the evidence above, it isn't deprecated, and the runner's vulkan/cubecl labels suggest prior thought. CubeCL being Rust-native likely beats hand-written OpenCL C as a third kernel.
~~Fix the runner label.~~ Void — the label was right all along. It says gfx1036, the hardware is gfx1036, and the gfx1030 report is an HSA_OVERRIDE_GFX_VERSION=10.3.0 spoof baked into the amd-runner image (see "Why this is a question at all"). Nothing to fix. If anything is worth changing it is the override, not the label — though it is what makes an unsupported-tier APU work at all, so leave it be.
~~Ship the cooperative-launch check in scripts/amd-laptop-probe.sh.~~ Done — the probe now reports PROBE_COOP_VERDICT alongside the wave32 / shuffle result. Whether the laptop archs support cooperative launch remains unknown and unknowable to us (we have measured exactly one AMD GPU, and it cannot), but a volunteer's report now answers it. sim works either way; the answer only decides fast path vs fallback.

Still open: whether sim (not just cosim) matters on a laptop; if cosim-only is acceptable the problem shrinks either way.

`sim` on a device without cooperative launch (2026-07-16)

Until this was fixed, sim had never run on ROCm. It died in csrc/kernel_v1.hip.cpp at the hipLaunchCooperativeKernel call with unspecified launch failure, on the simplest design that exists — 1 block, 1 stage, 6 cycles. Cosim was unaffected: it is reactive, so it already drives one ordinary launch per scheduler edge and never needs a device-wide barrier.

The device simply does not support the mechanism, and says so:

prop.cooperativeLaunch            = 0
attr CooperativeLaunch            = 0
plain,       1 block   launch=no error                    sync=no error
cooperative, 1 block   launch=unspecified launch failure  sync=no error
cooperative, 2 blocks  launch=unspecified launch failure  sync=no error

A trivial grid.sync kernel fails at 1 block while a plain launch of the same shape succeeds. It is not our kernel, our launch config, or occupancy.

It is also not the spoof. The obvious suspicion — that cooperativeLaunch=0 is an artefact of running gfx1030 code on gfx1036 silicon — was tested directly, compiling the probe with --offload-arch=gfx1036 and running it with env -u HSA_OVERRIDE_GFX_VERSION. The result is identical to the spoofed run, field for field. The limitation is the hardware's.

Do not generalise this to "RDNA2 can't do cooperative launch." We have measured one integrated 2-CU Raphael APU. A discrete gfx1030 may well report 1. That is what the volunteer probe (next step 5) is for.

This sharpens — but does not overturn — the stay-on-HIP conclusion. The argument against porting leaned partly on "cosim-only is the realistic scope, sim stays on CUDA/HIP", and on APU-class AMD we were briefly cosim-only in fact, which is the exact limitation that argument used to dismiss portable compute.

The fix: a non-cooperative fallback (landed)

Not a port — a host loop over cycles × stages where each launch is the barrier, which is precisely what Metal (no device-wide barrier either) has always done and what cosim already did. csrc/kernel_v1.hip.cpp now queries hipDeviceAttributeCooperativeLaunch and picks:

supported → one hipLaunchCooperativeKernel for the whole run, unchanged;
not supported → num_cycles × num_major_stages ordinary launches of simulate_v1_stage.

The enabling move was generalising cosim's stage kernel into one simulate_v1_stage parameterised by current_cycle, matching Metal's long-standing shape. Cosim passes current_cycle = 0, so its behaviour is unchanged by construction; sim's fallback passes the real cycle index.

Verified on the runner: sim matches the CpuBackend reference (--check-with-cpu) on a 6-cycle single-stage design and on mcu_soc (--level-split 10, 13 cycles × 2 major stages — i.e. the staged-IO path with both loops non-trivial), and the full gpu_test_suite.sh — sim, X-prop, the timed launcher + report, and all 14 cosim goldens — passes on ROCm.

Cost: a launch per stage rather than one per run. HIP Tests (ROCm backend) is the only CI job that exercises this path, since CUDA and HIP-over-CUDA both support cooperative launch.