Jacquard Documentation

Welcome to the documentation for Jacquard, a GPU-accelerated RTL logic simulator.

Use the sidebar to navigate between topics, or start with the Getting Started guide.

Documents

Project Scope & Planning

Start here if you're considering a feature contribution or want to understand Jacquard's overall direction.

  • Project Scope & Guarantees: Top-level contract — what Jacquard is for, what it isn't, licensing and architecture constraints, stability tiers.
  • Why Jacquard: Honest positioning vs. STA tools and event-driven simulators; what's unique, what isn't, and what output interface would let users extract the value.
  • Timing Correctness: Scoped requirements for timing accuracy, validation, and the forthcoming timing IR.
  • Timing Model Extensions: Pre-spike design notes for δ(T) dynamic delay, clock-tree skew, and wire delay at scale. Formalised in ADR 0007.
  • Post-Phase-0 Roadmap: Sequencing of Phase 1+ work covering structured timing output (ADR 0008) and timing model fidelity (ADR 0007). (OpenTimer integration was originally Phase 1's centrepiece; ADR 0003 was Superseded by the spike — OpenSTA out of process is now the sole STA path per ADR 0001.)
  • Architecture Decision Records: Design decisions and their rationale (numbered, per-decision). See the index for status and how the ADRs relate.
  • Implementation Plans: Phased implementation plans with entry and exit criteria. See the index for status and reading order.
  • Spikes: Time-boxed experiments and their outcomes.

Core Documentation

  • Simulation Architecture: Detailed explanation of Jacquard's internal architecture

    • Pipeline stages (NetlistDB → AIG → StagedAIG → Partitions → FlattenedScript → GPU)
    • Data structures and representations
    • VCD input/output format requirements
    • Assertion and display support infrastructure
    • Performance characteristics
    • Known issues and limitations
  • Timing Simulation: CPU-based timing simulation with Liberty/SDF delays

  • Timing Violations: GPU-side setup/hold violation detection

Troubleshooting Guides

  • Troubleshooting VCD: Debugging VCD input issues
    • VCD hierarchy requirements
    • Signal naming and matching
    • Solutions for flat VCD generation
    • Diagnostic checklist
    • Working examples

Quick Reference

VCD Input Requirements (Critical!)

Jacquard expects VCD signals at absolute top-level (no module hierarchy):

// ✓ Correct testbench
initial begin
    $dumpfile("output.vcd");
    $dumpvars(1, clk, reset, din, dout);  // Depth 1, explicit signals
end

// ✗ Incorrect testbench
initial begin
    $dumpfile("output.vcd");
    $dumpvars(0, testbench);  // Dumps entire hierarchy
end

Debug Commands

# Enable debug logging
RUST_LOG=debug cargo run -r --features metal --bin jacquard -- sim <args>

# Verify with CPU simulation
cargo run -r --features metal --bin jacquard -- sim <args> --check-with-cpu

# Check VCD structure
grep '\$scope\|\$var' input.vcd | head -20

Key Statistics

When running Jacquard, look for these diagnostic outputs:

netlist has X pins, Y aig pins, Z and gates        # AIG complexity
current: N endpoints, try M parts                  # Partition count
Built script for B blocks, reg/io state size S     # Final script
WARN (GATESIM_VCDI_MISSING_PI) ...                 # VCD issues!

Investigation Methodology

This documentation was created through systematic investigation of Jacquard's behavior:

  1. Source Code Analysis: Examined src/aig.rs, src/flatten.rs, src/staging.rs
  2. Debug Tracing: Used RUST_LOG=debug to capture internal state
  3. Test Case Development: Created minimal reproducible examples
  4. Comparative Testing: Compared Jacquard vs iverilog outputs
  5. Third-Party Validation: Tested with real-world examples (sva-playground)

Known Issues Documented

  1. VCD Hierarchy Mismatch (CRITICAL):

    • Jacquard expects flat VCD hierarchy
    • Most testbenches generate hierarchical VCDs
    • See troubleshooting-vcd.md for solutions
  2. Complex FSM Simulation:

    • Some FSM designs don't simulate correctly
    • Under investigation (safe.v example in third_party tests)
    • May be related to synthesis optimization or reset handling
  3. Format String Preservation:

    • Yosys may not preserve format attributes
    • Display messages show placeholders
    • Extract format strings from pre-synthesis JSON as workaround

Contributing

When adding documentation:

  1. Be specific: Include actual commands, file paths, code snippets
  2. Show examples: Both working and non-working cases
  3. Link related docs: Cross-reference other documentation files
  4. Date updates: Update version and date at bottom of documents
  5. Test instructions: Verify all commands actually work

Future Documentation Needs

  • Performance tuning guide (optimal NUM_BLOCKS, --level-split)
  • Memory (SRAM) modeling and synthesis
  • Custom cell library support beyond AIGPDK
  • Multi-clock domain handling
  • VCD scope option detailed behavior
  • GPU kernel optimization internals
  • Main README: ../README.md - Project overview and quick start
  • CLAUDE.md: ../CLAUDE.md - Development guidelines and architecture overview
  • Test Suite: ../tests/ - Examples and regression tests
  • Third-Party Tests: ../tests/regression/third_party/ - Real-world examples with attribution

Last Updated: 2026-02-16 Maintained By: ChipFlow + Community Contributions

Getting Started with Jacquard

Caveats: Jacquard currently only supports non-interactive testbenches. This means the input to the circuit needs to be a static waveform (e.g., VCD). Registers and clock gates inside the circuit are allowed, but latches and other asynchronous sequential logics are currently unsupported.

Dataset: Some (namely, netlists after AIG transformation in Steps 1-2 below, and reference VCDs) input data is available here .

Step 0. Download the AIG Process Kit

Go to aigpdk directory where you can download aigpdk.lib, aigpdk_nomem.lib, aigpdk.v, and memlib_yosys.txt. You will need them later in the flow.

Before continuing, make sure your design contains only synchronous logic. If your design has clock gates implemented in your RTL code, you need to replace them manually with instantiations to the CKLNQD module in aigpdk.v. Also, you are advised to be familiar with where memory blocks (e.g., caches) are implemented in your design so you can check that the memory blocks are mapped correctly later.

Step 1. Memory Synthesis with Yosys

This step makes use of the open-source Yosys synthesizer to recognize and map the memory blocks automatically.

Download and compile the latest version of Yosys. Then run yosys shell with the following synthesis script.

# replace this with paths to your RTL code, and add `-I`, `-D`, `-sv` etc when necessary
read_verilog xx.v yy.v top.v

# replace TOP_MODULE with your top module name
hierarchy -check -top TOP_MODULE

# simplify design before mapping
proc;;
opt_expr; opt_dff; opt_clean
memory -nomap

# map the rams
# point -lib path to your downloaded memlib_yosys.txt
memory_libmap -lib path/to/memlib_yosys.txt -logic-cost-rom 100 -logic-cost-ram 100

The memory_libmap command will output a list of RAMs it found and mapped.

  • If you see $__RAMGEM_SYNC_ (naming inherited from GEM), it means the mapping is successful.
  • If you see $__RAMGEM_ASYNC_, it means this RAM is found to have asynchronous READ port. You need to confirm if it is the case.
    • If it is a synchronous one but accidentally recognized as asynchronous, you might need to patch the RTL code to fix it. There might be multiple reasons it cannot be recognized as synchronous. For example, when the read and write clocks are different.
    • If it is indeed asynchronous, check its size. If its size is very small and affordable to be synthesized using registers and mux trees (which is very expensive for large RAM banks), you can remove the $__RAMGEM_ASYNC_ block in memlib_yosys.txt, re-run Yosys to force the use of registers.
  • If you see using FF mapping for memory, it means the memory is recognized, but due to it being nonstandard (e.g., special global reset or nontrivial initialization), Jacquard will fall back to registers and mux trees. If the size of the memory is small, this is usually not an issue. Otherwise, you are advised to try other implementations.

After a successful mapping, use the following command to write out the mapped RTL as a single Verilog file.

write_verilog memory_mapped.v

Check the correctness of this step by simulating memory_mapped.v with your reference CPU simulator.

Step 2. Logic Synthesis

This step maps all combinational and sequential logic into a special set of standard cells we defined in aigpdk.lib. The quality of synthesis is directly tied to Jacquard's final performance, so we suggest you use a commercial synthesis tool like DC. You can also use Yosys to complete this if you do not have access to a commercial synthesis tool.

Check the correctness of this step by simulating gatelevel.gv with your reference CPU simulator.

Use Synopsys DC

First, you need to compile aigpdk.lib to aigpdk.db using Library Compiler.

With that, you synthesize the memory_mapped.v obtained before under aigpdk.db.

Some key commands you may use on top of your existing DC flow:

# change path/to/aigpdk.db to a correct path. same for other commands.
set_app_var link_path path/to/aigpdk.db
set_app_var target_library path/to/aigpdk.db
read_file -format db $target_library

# elaborate TOP_MODULE
# current_design TOP_MODULE

# timing settings like create_clock ... are recommended. Jacquard benefits from timing-driven synthesis.

compile_ultra -no_seq_output_inversion -no_autoungroup
optimize_netlist -area

write -format verilog -hierarchy -out gatelevel.gv

Use Yosys: Example script

# if you exited Yosys in step 2, you can read back in your memory_mapped.v yourself.
# read_verilog memory_mapped.v
# hierarchy -check -top TOP_MODULE

# synthesis
synth -flatten
delete t:$print

# change path/to/aigpdk_nomem.lib to a correct path. same for other commands.
dfflibmap -liberty path/to/aigpdk_nomem.lib
opt_clean -purge
abc -liberty path/to/aigpdk_nomem.lib
opt_clean -purge
techmap
abc -liberty path/to/aigpdk_nomem.lib
opt_clean -purge

# write out
write_verilog gatelevel.gv

Step 3. Download and Compile Jacquard

Download and install the Rust toolchain. This is as simple as a one-liner in your terminal. We recommend https://rustup.rs.

Clone Jacquard along with its dependencies.

git clone https://github.com/ChipFlow/Jacquard.git
cd Jacquard
git submodule update --init --recursive

Jacquard supports two GPU backends: CUDA (NVIDIA GPUs on Linux) and Metal (Apple Silicon Macs).

All functionality is accessed through the jacquard CLI, which provides map, sim, and cosim subcommands:

# Mapping (no GPU features needed)
cargo run -r --bin jacquard -- map --help

# Simulation (Metal - macOS)
cargo run -r --features metal --bin jacquard -- sim --help

# Simulation (CUDA - Linux, requires CUDA toolkit)
cargo run -r --features cuda --bin jacquard -- sim --help

Simulate the Design

Jacquard automatically partitions the design at startup using mt-kahypar-sc hypergraph partitioning.

If partitioning fails due to deep circuits (which often shows as trying to partition a circuit with only 0 or 1 endpoints), try adding a --level-split option to force a stage split. For example --level-split 30 or --level-split 20,40.

Metal (macOS)

Use NUM_BLOCKS=1 for Metal.

cargo run -r --features metal --bin jacquard -- sim path/to/gatelevel.gv path/to/input.vcd path/to/output.vcd 1

CUDA (Linux)

Replace NUM_BLOCKS with twice the number of physical streaming multiprocessors (SMs) of your GPU.

cargo run -r --features cuda --bin jacquard -- sim path/to/gatelevel.gv path/to/input.vcd path/to/output.vcd NUM_BLOCKS

VCD Scope Handling

Jacquard automatically detects the correct VCD scope containing your design's ports. In most cases, you don't need to specify --input-vcd-scope. If auto-detection fails or you need to override it, use:

# Metal
cargo run -r --features metal --bin jacquard -- sim path/to/gatelevel.gv path/to/input.vcd path/to/output.vcd 1 --input-vcd-scope "testbench/dut"

# CUDA
cargo run -r --features cuda --bin jacquard -- sim path/to/gatelevel.gv path/to/input.vcd path/to/output.vcd NUM_BLOCKS --input-vcd-scope "testbench/dut"

Use slash separators (/) for hierarchical paths, not dots. See troubleshooting-vcd.md for details.

The simulated output ports value will be stored in output.vcd.

Caveat: The actual GPU simulation runtime will also be outputted. You might see a long time before GPU enters due to reading and parsing input.vcd. You are recommended to develop your own pipeline to feed the input waveform into Jacquard's GPU kernels.

Timing-Aware Simulation

Jacquard supports two ways to feed timing data into the simulator:

  1. --timing-ir <path.jtir> — pre-converted Jacquard timing IR. This is the canonical path and requires no external tools at run time. Generate the IR ahead of time with the standalone opensta-to-ir tool (see crates/opensta-to-ir/).
  2. --sdf <path.sdf> --liberty <path.lib> — raw SDF, converted to IR on the fly. This subprocesses OpenSTA, which must be installed on the user's machine.

OpenSTA dependency

When using --sdf, Jacquard locates OpenSTA in this order:

  1. JACQUARD_OPENSTA_BIN environment variable.
  2. <repo-root>/scripts/build-opensta.sh --print-binary (the canonical install path during development; the script builds the version vendored at vendor/opensta/).
  3. sta on PATH.

Jacquard requires OpenSTA 3.1.0 or newer, matching the commit pinned at vendor/opensta/. The pinned version is the only one with end-to-end test coverage; newer OpenSTA versions are accepted with a warning, older versions are a hard error.

The simplest way to get a known-good OpenSTA is to build the vendored copy from the Jacquard repo:

git submodule update --init --recursive
./scripts/build-opensta.sh

Then either set JACQUARD_OPENSTA_BIN to the path printed by ./scripts/build-opensta.sh --print-binary, or just let Jacquard find it automatically — the build script's output is searched by default.

Error messages

SymptomMeaningFix
--sdf requires OpenSTA: OpenSTA binary not found.OpenSTA isn't installed or isn't on PATH.Run ./scripts/build-opensta.sh, set JACQUARD_OPENSTA_BIN, or install OpenSTA system-wide.
OpenSTA at <path> is v2.4.0; Jacquard requires v3.1.0 or newer.Installed OpenSTA is too old.Rebuild from vendor/opensta/ (which is pinned at 3.1.0) or upgrade your system OpenSTA.
Detected OpenSTA v3.2.0, newer than the latest tested version v3.1.0. (warning)OpenSTA version is newer than what Jacquard's test corpus has been validated against. Simulation proceeds.Report any timing discrepancies as bugs; we'll bump the tested-version range when CI catches up.
--sdf requires --liberty <PATH>.OpenSTA needs the Liberty library to link the design.Pass --liberty <PATH> alongside --sdf.

For licensing context (Jacquard is permissively-licensed, OpenSTA is GPL-3, and Jacquard's runtime subprocess invocation is permitted but bundling is not), see adr/0006-sdf-preprocessing-model.md.

Why Jacquard — positioning and output interface

Status: Honest assessment of where Jacquard fits in an EDA flow alongside dedicated STA tools (OpenTimer/OpenSTA) and event-driven simulators (Verilator, iverilog, CVC). Includes a survey of what timing information Jacquard exposes today and what would let users actually consume it.

This is not a marketing document. The goal is for a contributor or user to read it and decide accurately whether Jacquard helps them — and, if it does, how to extract the answer they need.


TL;DR

Jacquard's unique value is vector-driven timing analysis at GPU scale: answering "did this stimulus violate setup/hold at any DFF, on which cycle, on which signal?" for designs large enough that SDF-annotated event-driven sim is too slow to finish in useful time.

Everything else Jacquard offers is offered, often better, by the standard flow:

  • For functional sim: Verilator is faster on small designs.
  • For timing: OpenSTA gives more accurate answers than Jacquard, vector-independent.
  • For glitch / metastability: event-driven sim with SDF (CVC, iverilog) sees behaviours Jacquard's lockstep kernel structurally cannot.

Jacquard becomes the right tool when (design size × vector length) exceeds what event-driven SDF-annotated sim can handle, and you specifically want vector-driven timing answers.

STA is not optional even with Jacquard. Jacquard does not replace OpenSTA; it complements it. The right framing is "STA proves no bad vectors exist; Jacquard proves your real workload runs cleanly within those bounds." OpenSTA is also a hard runtime dependency for any timing-aware Jacquard flow — the timing IR is produced by opensta-to-ir, which subprocesses OpenSTA. See ADR 0001.


What's actually unique

The intersection where Jacquard wins is narrow but real:

  1. Activity-driven setup/hold sweep at scale. Run a long workload (boot trace, architectural validation, NoC congestion stimulus) on a large design at GPU speed; get a per-cycle violation report. STA can't tell you "this real workload trips violation X at cycle 12,847"; CVC can but won't finish in time on big designs.

  2. Arrival-time distributions for power/activity analysis. Per-signal arrival histograms across millions of cycles → useful for worst-case-power analysis informed by actual switching activity. STA gives you nothing here; CVC could but slowly.

  3. Failure forensics. When a functional test fails, answering "was this a timing issue?" without rerunning under a different simulator. Jacquard's timing-VCD output ties violations to cycle/signal/path — useful when you already have it from the same run.

  4. Fast iteration during timing closure. Change a constraint, resynthesise, re-run a long test — Jacquard's loop time is short enough to make this practical on big designs in a way iverilog+SDF isn't.

What dedicated STA (OpenSTA) gives you that Jacquard doesn't

This list is long and you should know it:

  • Worst-case path enumeration. STA tells you the top-N critical paths over all possible inputs. Jacquard sees only what your stimulus exercises. If your testbench misses a critical path, Jacquard's "no violations" report is silent on it; OpenSTA would flag it.
  • True min-delay analysis. OpenSTA does proper min-delay path search. Jacquard's hold check is per-DFF against actual stimulus only.
  • Per-pair CRPR. OpenSTA applies common-path-pessimism removal as a launch/capture credit on each path. Jacquard consumes per-DFF clock arrival from opensta-to-ir and folds it into setup/hold (see timing-model-extensions.md, Part B Stages 1+2 — landed), but treats the launch reference as 0 — i.e. the per-pair CRPR credit is intentionally not modelled at this stage. Stage 3 in the same doc is the lever if Stage 1+2 pessimism turns out to matter on a real design.
  • SDC-aware constraint handling. False paths, multi-cycle paths, generated clocks, async groups — OpenSTA reads SDC and respects it. Jacquard doesn't read SDC at the timing layer.
  • Coverage by construction. STA covers every path by definition. Dynamic sim covers only what's exercised.
  • Vector-independent confidence. "This design meets timing" is something STA can claim; Jacquard can only claim "this design met timing on these vectors."

What event-driven SDF sim (CVC/iverilog) gives you that Jacquard doesn't

The honest comparison isn't "Jacquard vs. Verilator + OpenTimer." It's "Jacquard vs. iverilog/CVC-with-SDF + OpenTimer." On the timing-sim side specifically:

  • Glitch propagation. CVC/iverilog with inertial or transport delay see intra-cycle pulses. Jacquard's lockstep cycle-accurate kernel does not.
  • Per-pin wire delay fidelity. CVC consumes SDF interconnect records per-receiver, per-edge, with rise/fall distinction. Jacquard collapses to per-cell-max (see timing-model-extensions.md, Part C).
  • Per-DFF setup/hold without per-word collapse pessimism. Jacquard collapses all DFFs in a 32-bit state word to min(setup), min(hold); CVC checks each flop individually.
  • Async event handling. Real $setup/$hold checks across asynchronous control. Jacquard explicitly assumes synchronous designs.

So today, accuracy-per-vector goes to CVC; throughput goes to Jacquard.

When to choose what

Your situationBest tool
Small design, just want functional resultsVerilator (free, fast, mature)
Small design, need timing certaintyOpenSTA + Verilator (or +CVC for vector-driven)
Large design, functional onlyVerilator if it scales, else Jacquard
Large design, vector-driven timing neededJacquard + OpenSTA for STA backstop
Glitch / metastability investigationCVC or iverilog with SDF — Jacquard cannot model these structurally
Asynchronous design / latchesNot Jacquard (synchronous-only) — use CVC/iverilog
Sign-off STAOpenSTA / commercial — Jacquard is not a sign-off tool

The trajectory

Jacquard's timing fidelity gap with CVC is closeable. The work in timing-model-extensions.md — δ(T), clock-tree skew, per-receiver wire delay — closes much of it while preserving GPU throughput. The further along that path the project goes, the more "Jacquard" looks like "GPU-accelerated SDF-annotated event-driven sim, with the inherent limits the cycle-accurate kernel imposes (no glitches, lockstep cycles)" — i.e. CVC's report quality at Verilator's speed, on designs where neither alone suffices.


Output interface — what Jacquard exposes today

Jacquard's unique value depends on getting the timing information out of a run in a form users can act on. Phase 1 of the post-Phase-0 roadmap (ADR 0008) closed the gap between "data Jacquard has" and "answers users want" for setup/hold violations.

Symbolic stderr violation messages

The kernel writes setup/hold violation events to a per-block event buffer (csrc/kernel_v1.metal:554-576). The host drains the buffer each cycle (src/event_buffer.rs), resolves the state-word index to a hierarchical DFF site name via WordSymbolMap, and emits:

[cycle 12847] SETUP VIOLATION at top/cpu/regs[7][bit 22] [word=412]: arrival=2150ps setup=80ps slack=-30ps
[cycle 12847] HOLD VIOLATION at top/cpu/state[bit 3] [word=412]: arrival=12ps hold=20ps slack=-8ps

The bare [word=N] suffix is preserved for grep/tooling compatibility; up to four DFFs per word are named, with +N more truncation beyond that.

Structured timing report (--timing-report <path.json>)

Schema-versioned JSON document written at end of run. Contents:

  • Per-cycle violation list (cycle, kind, word, site, arrival, constraint, slack).
  • Per-word aggregate: violation counts and worst slack (sorted by total violations).
  • Top-N worst-slack ranking per kind (setup, hold).
  • Run metadata: design, vector source, timing source, clock period, cycles run, Jacquard version.
  • Aggregate stats: setup/hold totals, dropped events.

Machine-readable, CI-friendly. Sample at tests/timing_ir/sample_reports/two_violations.json; full schema in src/timing_report.rs (SCHEMA_VERSION = "1.0.0"). Stability contract per ADR 0008: additive-only extensions, breaking changes bump the major.

Text summary (--timing-summary)

One-screen human summary on stdout. Same data as the JSON report, different channel; either or both flags can be set:

=== Jacquard Timing Summary ===
Design:        my_cpu.gv
Vectors:       boot.vcd (1000 cycles)
Clock period:  1000 ps
Timing source: my_cpu.jtir

Violations:
  Setup: 5
  Hold:  2
  Total: 7

Worst slack:
  Setup: -150ps  at top/cpu/regs[7][bit 22] [word=5]  (cycle 87)
  Hold:   -40ps  at top/cpu/state[bit 3] [word=12]  (cycle 91)

Top 2 by violation count (of 2 total words with violations):
  top/cpu/regs[7][bit 22] [word=5] (5 violations): worst setup=-150ps hold=- arrival=950ps
  top/cpu/state[bit 3] [word=12] (2 violations): worst setup=- hold=-40ps arrival=10ps

Format is for human inspection — explicitly not a stable parseable contract. Tools should use --timing-report JSON.

Timed VCD (--timed)

Annotates the output VCD with per-signal arrival times. Largest, most detailed output; suitable for waveform-level inspection.

  • What you get: per-signal arrival ps at each writeout cycle.
  • Caveat: the VCD doesn't carry slack relative to the clock edge — you compute it yourself.
  • Cost: doubles VCD size. Not appropriate for long workloads on large designs.

SimStats aggregate counts (in-process)

SimStats { setup_violations, hold_violations, ... } is available to in-process consumers (src/event_buffer.rs). Only counts; full detail flows through the structured report path.

Still on the wishlist

Items captured in ADR 0008's "Optional / later outputs" plus a few caveats on what shipped. Demand-driven; not scheduled.

Closest-to-violation tracking when no violation occurred

The shipped worst_slack ranking is populated only from observed violation events. Surfacing "where am I close to the edge" on a run that passed timing requires GPU-side near-miss instrumentation (emit slack events whenever |slack| falls below a configurable threshold). Useful for proactive signoff regression. Separate workstream — needs a kernel change.

Arrival histogram (--arrival-histogram <pattern>)

Per-signal arrival histogram dump for matched signal patterns, as JSON or CSV. Foundation for activity-based power analysis and "is my actual timing margin healthy" reporting.

STA cross-reference (--sta-cross-reference <opensta-paths.txt>)

Read OpenSTA's worst-N critical-path report and produce coverage output: of those paths, which were exercised by the stimulus, at what observed arrival. Closes the loop between vector-driven and static analysis.

Path back-trace from worst-arrival DFF

Given a flagged DFF, walk the max-of-fanin chain backward to the source AIG pin / primary input, emitting per-edge contributions. Most expensive item on the wishlist; only useful once symbolic names are in place (which they now are).

CUDA / HIP / cosim runtime violation routing

The current Metal sim path routes runtime violations through process_events (which is what feeds the resolver, structured report, and text summary). The CUDA, HIP, and cosim paths don't yet share that plumbing — they detect violations on the GPU but don't drain through process_events. Independent plumbing follow-up; doesn't affect the Metal user experience.

Per-signal activity / transition counts

Listed in ADR 0008 as part of the JSON report's wishlist. Not in v1.0.0 of the schema; will be added (additively) when the GPU kernel emits transition events.

"Corner" and "margin percentage" in the text summary

ADR 0008's summary template includes both. Corner is missing because the metadata struct doesn't carry it through from the IR yet; margin percentage is trivially derivable from slack_ps / clock_period_ps and was omitted to keep the v1 summary terse.


GEM Simulation Architecture

This document describes GEM's internal simulation architecture based on investigation and testing.

Overview

GEM (GPU-accelerated Emulator-inspired RTL simulation) compiles gate-level netlists into GPU kernels that simulate designs 5-40X faster than CPU-based simulators. It works like an FPGA-based RTL emulator by converting designs into an and-inverter graph (AIG), partitioning it for GPU blocks, and generating optimized GPU code.

Pipeline Stages

Verilog Netlist → NetlistDB → AIG → StagedAIG → Partitions → FlattenedScript → GPU Kernel
                     ↓            ↓                    ↓            ↓
                  Parse      Synthesis         Hypergraph     Instruction
                  Netlist    to AIGs          Partitioning    Generation

1. NetlistDB (Input Parsing)

Input: Gate-level Verilog (.gv files) from synthesis tools (Yosys, Design Compiler)

Process:

  • Parses structural Verilog using sverilogparse crate
  • Creates flattened netlist database with cells, pins, nets
  • Identifies primary inputs, outputs, clock signals
  • Stores connectivity in CSR (Compressed Sparse Row) format

Key Limitations:

  • Only supports synthesized gate-level netlists (not RTL)
  • No behavioral Verilog constructs (always blocks, if/case statements)
  • Expects standard cells from supported libraries (AIGPDK)

2. AIG (And-Inverter Graph)

Process: Converts gate-level netlist to AIG representation

Data Structure:

#![allow(unused)]
fn main() {
pub enum DriverType {
    AndGate,           // Basic AND gate
    DFF,               // D flip-flop
    ClockGate,         // Clock gating cell
    RAMBlock,          // Memory block
    GemAssert,         // Assertion checking
    GemDisplay,        // Display output
    // ... more types
}
}

Statistics (example from safe.v):

  • 157 AIG pins: Internal circuit nodes
  • 133 AND gates: Logic operations
  • 16 DFF cells: Sequential elements
  • 2 GEM_ASSERT cells: Assertion nodes
  • 480 total pins: Including I/O

Key Features:

  • Clock inference from DFF connections
  • Assertion cell detection (GEM_ASSERT, GEM_DISPLAY)
  • Endpoint grouping for outputs and registers

3. StagedAIG (Pipeline Staging)

Purpose: Split deep combinational logic into pipeline stages

Process:

  • Analyzes combinational depth between registers
  • Splits logic at --level-split thresholds
  • Creates pipeline stages to fit GPU resource constraints

When Needed:

  • Designs with very deep combinational paths (>50 levels)
  • When single-stage partitioning fails resource limits
  • Use --level-split 30 or --level-split 20,40 to force splits

4. Partitioning (Hypergraph Cut)

Tool: mt-kahypar hypergraph partitioner

Constraints (GPU block resources):

  • Max 8191 unique inputs per partition
  • Max 8191 unique outputs per partition
  • Max 4095 intermediate pins alive per stage
  • Max 64 SRAM output groups

Process:

  • Interactive partitioning (runs automatically at simulation start)
  • Tries 1 partition first, then increases if needed
  • Merges partitions to minimize inter-partition communication

5. FlattenedScript (GPU Instruction Generation)

Process: Generates GPU execution script from partitions

Script Components:

  • Boomerang stages: Hierarchical 8192→1 reduction structure
  • State buffer: Packed 32-bit words for all register values
  • SRAM interface: Memory block read/write operations
  • Assertion positions: Bit positions for assertion conditions
  • Display positions: Enable bits and argument positions

Statistics (example):

reg/io state size: 133 bits → 5 words (32-bit)
script size: 30208 instructions
assertion_positions: [(cell_id, bit_pos, msg_id, type)]
display_positions: [(cell_id, enable_pos, format, arg_positions, widths)]

Key Insight: All state is packed into a flat bit array, indexed by position in 32-bit words.

6. GPU Kernel Execution

Kernel Types:

  • kernel_v1.cu / kernel_v1_impl.cuh: CUDA implementation
  • kernel_v1.metal: Metal (Apple Silicon) implementation

Execution Model:

  • Each GPU block simulates one partition
  • Multiple blocks run in parallel
  • State synchronized between stages
  • CPU checks assertion/display conditions after GPU completes

VCD Input/Output

Input VCD Requirements

Critical Discovery: GEM expects VCD signals at absolute top-level (no module hierarchy).

Expected Signal Format:

$var reg 1 ! clk $end
$var reg 1 " reset $end
$var reg 4 # din [3:0] $end
$var reg 1 $ din_valid $end

NOT (with module scope):

$scope module testbench $end
  $scope module dut $end
    $var wire 1 ! clk $end
    ...

Signal Matching:

  • GEM looks for signals matching synthesized module port names
  • Uses HierName() (empty hierarchy) for matching
  • If signals are scoped under modules, GEM reports:
    WARN (GATESIM_VCDI_MISSING_PI) Primary input port (HierName(), "reset", None) not present in the VCD input
    

VCD Scope Option:

  • --input-vcd-scope <scope>: Specify module hierarchy to read from
  • Current Issue: Even with scope specified, signal matching fails
  • Workaround: Generate VCD with signals at absolute top level

Output VCD Structure

GEM generates minimal VCD with only primary outputs:

$timescale 1 ns $end
$scope module gem_top_module $end
$var wire 1 ! unlocked $end
$upscope $end

Internal states and intermediate signals are not dumped.

Assertion and Display Support

Assertion Infrastructure

Synthesis Flow:

Verilog assert() → Yosys $check cell → techmap gem_formal.v → GEM_ASSERT cell

Runtime:

  • GEM stores assertion positions in FlattenedScript
  • CPU checks assertion bits after GPU simulation
  • Configurable actions: Log, Pause, Terminate

AssertConfig:

#![allow(unused)]
fn main() {
pub struct AssertConfig {
    pub on_failure: AssertAction,  // Log, Pause, Terminate
    pub max_failures: Option<u32>,
}
}

Display Infrastructure

Synthesis Flow:

Verilog $display() → Yosys $print cell → techmap gem_formal.v → GEM_DISPLAY cell

Runtime:

  • Format strings stored in JSON metadata
  • CPU checks display enable bits after GPU simulation
  • Arguments extracted from state buffer positions

Limitation: Format string preservation depends on Yosys synthesis preserving attributes.

Debug Information

Enabling Debug Output

# Metal simulation with debug logging
RUST_LOG=debug cargo run -r --features metal --bin jacquard -- sim <args>

# CPU verification (slower but validates GPU results)
cargo run -r --features metal --bin jacquard -- sim <args> --check-with-cpu

Key Debug Messages

AIG Construction:

Found GEM_ASSERT cell 143 (condition_iv=0, en_iv=0, a_iv=76, clken_iv=2)
Found GEM_DISPLAY cell 24 (enable_iv=2, clken_iv=2, args=32)

Partitioning:

netlist has 480 pins, 157 aig pins, 133 and gates
current: 19 endpoints, try 1 parts
after merging: 1 parts

Flattening:

Built script for 48 blocks, reg/io state size 133, sram size 0, script size 30208
Assertion: cell=144, pos=4195 (word=131, bit=3), msg_id=144, type=None
Display: cell=24, enable_pos=5154 (word=161, bit=2), format='...', args=[...]

VCD Reading:

WARN (GATESIM_VCDI_MISSING_PI) Primary input port (HierName(), "reset", None) not present

Performance Characteristics

Speedup vs CPU

  • Simple designs: 5-10X faster
  • Complex designs: 10-40X faster
  • Depends on:
    • Number of GPU SMs (streaming multiprocessors)
    • Partition granularity
    • VCD I/O overhead

Resource Scaling

GPU Block Count: Set NUM_BLOCKS to 2× number of GPU SMs

  • Apple M4 Pro: 48 blocks (24 SMs × 2)
  • NVIDIA GPUs: Check SM count with nvidia-smi

Memory Usage:

  • State buffer: num_blocks × state_size × num_cycles × 4 bytes
  • Script: script_size × 4 bytes (shared across blocks)

Known Issues and Limitations

1. VCD Hierarchy Mismatch

Issue: GEM expects flat VCD signal hierarchy Impact: Missing input signals cause incorrect simulation results Workaround: Generate VCD with $dumpvars(1, sig1, sig2, ...) at top level Status: Under investigation

2. Complex FSM Designs

Issue: Some FSM designs don't simulate correctly even with proper VCD Example: safe.v (9-state PIN cracker FSM) Possible Causes:

  • Synthesis optimization changes FSM encoding
  • Initial state handling differences
  • Reset timing issues Status: Identified through third-party test suite

3. No Latch or Asynchronous Sequential Logic Support

Issue: Jacquard only supports edge-triggered D flip-flops (DFFs) as sequential elements. Latch-based designs (SR latches, transparent latches, master-slave latch pairs) and asynchronous sequential logic are not supported.

Impact: Designs using latches will either:

  • Fail during AIG conversion (unrecognized cell type)
  • Be silently treated as combinational logic (incorrect simulation)

What this means in practice:

  • Gate-level netlists must be synthesized to a DFF-only cell library (AIGPDK or SKY130)
  • CVC's built-in test suite (tests_and_examples/install.test/) uses NAND-latch flip-flops (e.g., dfpsetd.v, sdfia04.v) and cannot be used as Jacquard reference tests
  • Self-timed designs with internal clock generation (e.g., CVC's das_lfsr benchmark) are also unsupported

What would be needed to support latches:

  1. New DriverType variant: Add Latch(enable, data) to DriverType in aig.rs, representing a level-sensitive storage element
  2. Two-phase evaluation: Latches are transparent when enabled, requiring evaluation within a clock phase rather than only at clock edges. The current cycle-based simulation model (evaluate all combinational logic, then capture DFF outputs) would need to iterate until latch outputs stabilize
  3. AIG conversion: Map latch library cells (e.g., SKY130 dlxtp) to the new Latch driver, identifying enable and data pins
  4. GPU kernel changes: The writeout stage currently uses clken_perm for DFF clock gating. Latches would need a different mechanism: while enable is high, output tracks input continuously rather than capturing on an edge
  5. Timing: Latch timing is more complex — setup/hold is relative to the enable edge, and time borrowing across latch boundaries is a key use case in high-performance designs
  6. Convergence: Combinational loops through transparent latches must be detected and iterated to a fixed point, or flagged as errors

Complexity estimate: Moderate-to-high. The main challenge is the evaluation model change — DFF-only simulation is a clean "capture at edge" model, while latches require iterative evaluation within clock phases.

Status: Not planned. Jacquard targets synthesis flows that produce DFF-only netlists.

4. Format String Preservation

Issue: Yosys synthesis may not preserve gem_format attributes Impact: Display messages show placeholders instead of actual format strings Workaround: Extract format strings from pre-synthesis JSON Status: Tool limitation, not GEM bug

Investigation Methodology

This documentation was created through systematic investigation:

  1. Structure Analysis: Examined source code in src/aig.rs, src/flatten.rs, src/staging.rs
  2. Debug Tracing: Used RUST_LOG=debug to capture internal state
  3. Netlist Inspection: Analyzed synthesized .gv files with grep
  4. VCD Comparison: Compared iverilog vs GEM VCD outputs
  5. Test Case Development: Created minimal reproducible examples
  6. Iterative Debugging: Progressively simplified designs to isolate issues

References

  • Main codebase: src/ directory
  • EDA infrastructure: vendor/eda-infra-rs/ submodule (netlistdb, vcd-ng, ulib)
  • AIGPDK library: aigpdk/ directory
  • Test cases: tests/ directory
  • Third-party examples: tests/regression/third_party/

Document Version: 1.0 Last Updated: 2025-01-08 Authors: NVIDIA GEM Team + Claude Code Investigation

Timing Simulation in GEM

See also: timing-correctness.md — forward-looking validation contract and timing IR requirements (in progress). The document below describes current behaviour.

This document explains GEM's boomerang evaluation architecture and how timing simulation with per-gate delays can be implemented efficiently on GPU.

Background: The Simulation Challenge

GEM simulates And-Inverter Graphs (AIGs) where every node is either:

  • A primary input (value comes from VCD stimulus)
  • An AND gate with two inputs (possibly inverted)

Traditional simulation evaluates gates in topological order, which is inherently serial. GPUs excel at massive parallelism - thousands of threads doing the same operation on different data. GEM bridges this gap with the boomerang architecture.

Boomerang Evaluation

Core Concept

The boomerang structure is a hierarchical reduction tree that maps an AIG onto GPU threads. It's called "boomerang" because data flows down the tree during reduction, then results are written back out at various levels - like a boomerang going out and returning.

Hierarchy Structure

GEM uses BOOMERANG_NUM_STAGES = 13, meaning the tree has 2^13 = 8192 leaf positions:

Level 0 (inputs):   8192 positions
Level 1:            4096 positions  (8192 / 2)
Level 2:            2048 positions
Level 3:            1024 positions
Level 4:             512 positions
Level 5:             256 positions
Level 6:             128 positions
Level 7:              64 positions
Level 8:              32 positions
Level 9:              16 positions
Level 10:              8 positions
Level 11:              4 positions
Level 12:              2 positions
Level 13 (output):     1 position

Each level halves the number of positions by computing AND gates that combine pairs.

Thread Organization

A GPU block has 256 threads (threadIdx.x = 0..255). Each thread holds a 32-bit word where each bit represents an independent Boolean signal:

Thread 0:   [bit0, bit1, bit2, ... bit31]  = 32 Boolean signals
Thread 1:   [bit0, bit1, bit2, ... bit31]  = 32 Boolean signals
...
Thread 255: [bit0, bit1, bit2, ... bit31]  = 32 Boolean signals
            ─────────────────────────────
            Total: 256 × 32 = 8192 signals per level

Thread position refers to threadIdx.x - which of the 256 threads we're addressing. Each thread position processes 32 signals in parallel using SIMD operations.

Memory Layout

__shared__ u32 shared_metadata[256];   // Partition configuration
__shared__ u32 shared_writeouts[256];  // Output staging area
__shared__ u32 shared_state[256];      // Working state (8192 bits)

The shared_state array holds the current level's values during reduction.

The Reduction Process

Phase 1: Level 0 → Level 1 (hier[0])

Only threads 128-255 are active. Each computes 32 AND gates in parallel:

if(threadIdx.x >= 128) {
    u32 hier_input_a = shared_state[threadIdx.x - 128];  // From threads 0-127
    u32 hier_input_b = hier_input;                        // This thread's data

    // 32 AND gates computed simultaneously (one per bit)
    u32 ret = (hier_input_a ^ hier_flag_xora) &
              ((hier_input_b ^ hier_flag_xorb) | hier_flag_orb);

    shared_state[threadIdx.x] = ret;
}

The xora, xorb, and orb flags encode:

  • xora/xorb: Input inversions (for AND-inverter graph)
  • orb: Passthrough mode (when output equals input A, skip the AND)

Visual representation:

Before:  [T0][T1]...[T127] [T128][T129]...[T255]
              │                  │
              └───────┬──────────┘
                      │
                   AND gates (128 threads × 32 bits = 4096 gates)
                      │
                      ▼
After:   [----unused----] [T128][T129]...[T255]
                          (128 × 32 = 4096 results)

Phase 2: Levels 1-3 (Shared Memory)

for(int hi = 1; hi <= 3; ++hi) {
    int hier_width = 1 << (7 - hi);  // 64, 32, 16
    if(threadIdx.x >= hier_width && threadIdx.x < hier_width * 2) {
        u32 hier_input_a = shared_state[threadIdx.x + hier_width];
        u32 hier_input_b = shared_state[threadIdx.x + hier_width * 2];
        u32 ret = (hier_input_a ^ xora) & ((hier_input_b ^ xorb) | orb);
        shared_state[threadIdx.x] = ret;
    }
    __syncthreads();  // Barrier between levels
}

Each level activates fewer threads:

  • Level 1: threads 64-127 (64 threads → 2048 gates)
  • Level 2: threads 32-63 (32 threads → 1024 gates)
  • Level 3: threads 16-31 (16 threads → 512 gates)

Phase 3: Levels 4-7 (Warp Shuffle)

Within a single warp (32 threads), data exchange uses fast shuffle instructions instead of shared memory:

if(threadIdx.x < 32) {
    for(int hi = 4; hi <= 7; ++hi) {
        int hier_width = 1 << (7 - hi);  // 8, 4, 2, 1
        u32 hier_input_a = __shfl_down_sync(0xffffffff, tmp_cur_hi, hier_width);
        u32 hier_input_b = __shfl_down_sync(0xffffffff, tmp_cur_hi, hier_width * 2);
        if(threadIdx.x >= hier_width && threadIdx.x < hier_width * 2) {
            tmp_cur_hi = (hier_input_a ^ xora) & ((hier_input_b ^ xorb) | orb);
        }
    }
}

No synchronization needed - warp shuffle is implicitly synchronized.

Phase 4: Levels 8-12 (Bit Operations)

The final levels operate on bits within a single u32, computed by thread 0 only:

if(threadIdx.x == 0) {
    // Level 8: 32 → 16 (operates on upper/lower halves)
    u32 r8 = ((v1 << 16) ^ xora) & ((v1 ^ xorb) | orb) & 0xffff0000;

    // Level 9: 16 → 8
    u32 r9 = ((r8 >> 8) ^ xora) & (((r8 >> 16) ^ xorb) | orb) & 0xff00;

    // Level 10: 8 → 4
    u32 r10 = ((r9 >> 4) ^ xora) & (((r9 >> 8) ^ xorb) | orb) & 0xf0;

    // Level 11: 4 → 2
    u32 r11 = ((r10 >> 2) ^ xora) & (((r10 >> 4) ^ xorb) | orb) & 0b1100;

    // Level 12: 2 → 1
    u32 r12 = ((r11 >> 1) ^ xora) & (((r11 >> 2) ^ xorb) | orb) & 0b10;

    tmp_cur_hi = r8 | r9 | r10 | r11 | r12;
}

Write-Outs

Results are captured at various levels (not just the final output) and written to global memory:

if((writeout_hook_i >> 8) == bs_i) {
    shared_writeouts[threadIdx.x] = shared_state[writeout_hook_i & 255];
}

This is the "return" part of the boomerang - results flow back from intermediate levels.

Timing Simulation Approaches

Approach Comparison

ApproachParallelismMemoryAccuracyGPU Fit
Event-drivenPoor (serial queue)LowExactBad
Time-wheelMediumHighConfigurableMedium
LevelizedExcellentLowConservativeBest
ObliviousMaximumVery HighExactWasteful

This approach piggybacks on the existing boomerang structure with minimal changes.

Data Structure Addition

// Add to shared memory (256 bytes additional)
__shared__ u8 shared_arrival[256];  // One arrival time per thread position

Each thread position stores a single 8-bit arrival time representing the maximum arrival across all 32 bits in that position.

Modified AND Gate Evaluation

// Current (value only):
u32 ret = (hier_input_a ^ xora) & ((hier_input_b ^ xorb) | orb);
shared_state[threadIdx.x] = ret;

// With timing (add ~4 instructions):
u32 ret = (hier_input_a ^ xora) & ((hier_input_b ^ xorb) | orb);
shared_state[threadIdx.x] = ret;

u8 arr_a = shared_arrival[threadIdx.x - offset_a];
u8 arr_b = shared_arrival[threadIdx.x - offset_b];
u8 arr_ret = min(max(arr_a, arr_b) + GATE_DELAY, 255);  // Saturating add
shared_arrival[threadIdx.x] = arr_ret;

Complexity Analysis

  • Same number of kernel launches as zero-delay simulation
  • O(levels × cycles) - identical to current
  • ~256 bytes additional shared memory per partition
  • Estimated 10-20% performance overhead

The Approximation Trade-off

What We Track

One arrival time per thread position (256 values) instead of per signal (8192 values).

Implications

If thread position 50 contains signals A, B, C with different true arrivals:

Signal A: 15ps (shortest path)
Signal B: 23ps (longest path)
Signal C: 8ps  (medium path)

We store only: arrival[50] = 23ps (the maximum).

Why This Works

  1. Conservative: We might report false violations, but never miss real ones
  2. Correlated signals: Signals at the same thread position are often topologically nearby with similar timing
  3. Endpoint focus: We ultimately only care about arrivals at DFF D inputs

When Full Accuracy is Needed

For bit-accurate timing, you would need:

// 8KB additional shared memory (may exceed limits)
__shared__ u8 shared_arrival[256][32];  // Per-bit arrivals

This is feasible but significantly increases memory pressure and computation.

Implementation Phases

Phase 1: CPU Timing Analysis (Completed)

  • Liberty parser for delay extraction
  • Static timing analysis on AIG
  • CPU reference simulation with delays
  • Timing violation detection

Phase 2: Hybrid GPU+CPU (Completed)

  • GPU performs zero-delay value simulation
  • CPU performs timing analysis on results
  • Validates infrastructure without kernel changes

Phase 3: GPU Arrival Tracking (Completed)

  • Added shared_arrival[256] (u16) to Metal and CUDA kernels
  • Arrivals tracked during boomerang reduction at all hierarchy levels
  • Per-gate delays injected via script padding slots from SDF data
  • DFF timing constraint checking at cycle boundaries (setup/hold)
  • Timing-aware VCD output (--timed flag)
  • Validated against CVC reference simulator (88ps / 7.1% conservative overestimate)

Phase 4: Full Integration (Partial)

  • Timing violation events via event buffer (completed)
  • Per-cycle timing reports (completed)
  • Integration with output VCD (completed via --timed)
  • Timing-aware bit packing for reduced approximation error (future)

Conservative Timing Model: Sources of Overestimation

Jacquard's GPU timing is intentionally conservative — it may over-estimate arrival times but will never under-estimate them. This is important for setup violation detection: false positives are safe, false negatives would miss real bugs.

There are three independent sources of conservatism, each adding to the overestimate:

Source 1: max(rise, fall) per cell

The GPU kernel tracks a single u16 arrival per thread position. It cannot distinguish between rising and falling signal transitions because each thread processes 32 packed Boolean signals simultaneously — there's no per-bit transition direction available.

How it works: For each cell, inject_timing_to_script() computes:

#![allow(unused)]
fn main() {
delay = max(gate_delays[pin].rise_ps, gate_delays[pin].fall_ps)
}

Impact: For the SKY130 inv_chain test (16 inverters), rise delays average ~10ps larger than fall delays. In a real inverter chain, transitions alternate (rise→fall→rise), so half the cells use the smaller fall delay. Jacquard uses the larger rise delay for all.

Measured: 80ps overestimate on 1235ps (6.5%) for 16 inverters with ~10ps rise/fall asymmetry per cell.

Source 2: max wire delay across all input pins

For multi-input cells (AND gates, MUXes), INTERCONNECT delays to different input pins may differ significantly. Jacquard takes the maximum across all input pins:

#![allow(unused)]
fn main() {
// wire_delays_per_cell: dest_cellid → max(all input wire delays)
entry.rise_ps = entry.rise_ps.max(ic.delay.rise_ps);
entry.fall_ps = entry.fall_ps.max(ic.delay.fall_ps);
}

Impact: If an AND gate has input A arriving via a 10ps wire and input B via a 200ps wire, Jacquard assigns 200ps to the cell regardless of which input is on the critical path. An event-driven simulator would correctly propagate the 10ps arrival on input A independently.

When this matters: Designs with highly asymmetric routing (e.g., one input is local, another crosses the chip). Well-routed designs typically have balanced wire delays to multi-input cells.

Source 3: max arrival across 32 packed signals per thread

Each thread position holds 32 independent Boolean signals. Jacquard tracks one arrival per thread position (the maximum across all 32 signals):

Thread 50: [signal_A: 5ps, signal_B: 23ps, signal_C: 8ps, ...]
Tracked:   arrival[50] = 23ps (max of all 32)

Impact: If signals with very different timing are packed into the same thread, the fastest signals inherit the slowest signal's arrival time.

Mitigation: The bit-packing algorithm can sort signals by estimated timing before assignment (see "Timing-Aware Bit Packing" section). This keeps similar-timing signals together, reducing the max approximation error.

Combined Effect

These sources are multiplicative in the worst case. For the inv_chain test:

SourceContributionNotes
max(rise, fall)+80ps8 inverters × 10ps asymmetry
max wire delay+8ps8 wires × 1ps asymmetry
max per thread0psOnly 1 signal per thread in this test
Total overestimate88ps / 7.1%vs CVC transition-accurate result

For larger designs with more routing asymmetry and denser bit packing, the combined overestimate could be larger. The bit-packing sort (Source 3) is the most actionable mitigation.

CVC Reference Validation

The inv_chain design (2 DFFs + 16 SKY130 inverters) was validated against CVC (open-src-cvc), an event-driven Verilog simulator with native SDF back-annotation:

CVC:  clk_to_q=350ps  chain=885ps  total=1235ps  (transition-accurate)
Jacquard: clk_to_q=350ps  chain=973ps  total=1323ps  (conservative max)
Difference: 88ps (7.1% overestimate)

Both simulators agree on CLK→Q delay (350ps) because the DFF has a single output transition direction per clock edge. The chain delay differs because CVC tracks actual rise/fall polarity through each inverter.

To run the CVC comparison locally:

bash tests/timing_test/cvc/run_cvc.sh

Requires Docker (builds CVC from source on first run).

Delay Data Encoding

Script Format

The existing boomerang section has padding that can store delay data:

Current format per thread per stage:
  [xora: u32]
  [xorb: u32]
  [orb:  u32]
  [padding: u32]  ← Can store delay here

PackedDelay Structure

#![allow(unused)]
fn main() {
#[repr(C)]
pub struct PackedDelay {
    pub rise_ps: u16,  // Rising edge delay in picoseconds
    pub fall_ps: u16,  // Falling edge delay in picoseconds
}
}

For simplified timing, a single uniform delay constant can be used instead of per-gate delays.

Timing Violation Detection

At Each Cycle Boundary

The GPU kernel checks timing constraints per state word (32 signals) after the boomerang evaluation completes. Arrivals and constraints use u16 picosecond values (range 0–65535 ps). Arithmetic is performed in u32 to avoid overflow when summing arrival + setup:

// After boomerang completes, before next cycle
// arrival: u16 max accumulated delay for this 32-signal group
// constraint_word: packed [setup_ps:16][hold_ps:16]
u16 setup_ps = constraint_word >> 16;
u16 hold_ps  = constraint_word & 0xFFFF;

// Setup check: skip when arrival == 0 (no data propagated, e.g. first cycle
// or DFF with constant inputs)
if (arrival > 0 && (u32)arrival + (u32)setup_ps > clock_period_ps) {
    int slack = (int)clock_period_ps - (int)arrival - (int)setup_ps;
    write_event(event_buffer, EVENT_TYPE_SETUP_VIOLATION,
                cycle, io_offset + threadIdx.x,
                (u32)slack, (u32)arrival, (u32)setup_ps);
}

// Hold check: no arrival > 0 guard (hold violations matter even at cycle 0)
if ((u32)arrival < (u32)hold_ps) {
    int slack = (int)arrival - (int)hold_ps;
    write_event(event_buffer, EVENT_TYPE_HOLD_VIOLATION,
                cycle, io_offset + threadIdx.x,
                (u32)slack, (u32)arrival, (u32)hold_ps);
}

Event Buffer Integration

#![allow(unused)]
fn main() {
pub enum EventType {
    Stop = 0,
    Finish = 1,
    Display = 2,
    AssertFail = 3,
    SetupViolation = 4,   // Timing events
    HoldViolation = 5,
}
}

For full details on interpreting violation reports and tracing violations to source signals, see docs/timing-violations.md.

Timing-Aware Bit Packing

The Problem

Each thread position holds 32 signals packed into a u32. When tracking timing with one arrival value per thread position, we approximate all 32 signals as having the same arrival time (the maximum).

This approximation is accurate when signals in the same thread have similar timing. But the default placement algorithm uses first-fit for bit assignment:

#![allow(unused)]
fn main() {
// Default: first available slot
for i in 0..hier[selected_level].len() {
    if hier[selected_level][i] == usize::MAX {
        slot_at_level = i;  // First-fit, not timing-aware
        break;
    }
}
}

This can result in signals with very different timing sharing a thread:

Thread 50 (accidental grouping):
  bit 0: level 5,  ~5ps arrival
  bit 1: level 12, ~12ps arrival  ← 7ps difference!
  bit 2: level 6,  ~6ps arrival

Thread 50 (timing-aware grouping):
  bit 0: level 5, ~5ps arrival
  bit 1: level 5, ~5ps arrival    ← similar timing
  bit 2: level 6, ~6ps arrival

Current Timing Correlation

The placement algorithm already computes logic levels:

#![allow(unused)]
fn main() {
// Level = max(level of inputs) + 1
level[node] = max(level[input_a], level[input_b]) + 1;
}

Logic level correlates with timing (more levels = more gate delays), but signals at the same level can still have different actual delays due to:

  • Different gate types (AND2_00_0 vs AND2_11_1)
  • Different wire loads
  • Path reconvergence

Solution: Sort by Timing Before Packing

Before assigning bit positions, sort signals by their estimated arrival time:

#![allow(unused)]
fn main() {
// Collect nodes at this level
let mut nodes_to_place: Vec<_> = candidates
    .filter(|n| level[n] == selected_level)
    .collect();

// Sort by arrival time (level as proxy, or actual timing if available)
nodes_to_place.sort_by_key(|n| arrival_estimate[n]);

// Place in sorted order - similar timing ends up in same thread
for (slot, node) in nodes_to_place.iter().enumerate() {
    place_bit(..., slot, *node);
}
}

Alternative Approaches

ApproachComplexityEffectivenessWhen to Use
Sort by timingLowGoodDefault choice
Timing-aware partitioningHighBestLarge designs
Post-placement swappingMediumGoodFine-tuning
Timing bandsLowModerateSimple heuristic

Timing Bands

Group signals into arrival time bands:

Band 0: 0-10ps   → Threads 0-63
Band 1: 10-20ps  → Threads 64-127
Band 2: 20-30ps  → Threads 128-191
Band 3: 30+ps    → Threads 192-255

Measuring Packing Quality

Diagnostic to measure timing variance per thread:

#![allow(unused)]
fn main() {
fn analyze_timing_packing(hier: &Hierarchy, arrivals: &[u64]) {
    for thread in 0..256 {
        let times: Vec<_> = get_bits_in_thread(hier, thread)
            .map(|b| arrivals[b])
            .collect();

        let range = times.iter().max() - times.iter().min();
        let variance = compute_variance(&times);

        if range > threshold {
            warn!("Thread {} has {}ps timing spread", thread, range);
        }
    }
}
}

Impact on Approximation Accuracy

With timing-aware packing:

  • Reduced false positives: Fewer spurious timing violations from max approximation
  • Tighter bounds: Per-thread arrival closer to actual signal arrivals
  • Better critical path identification: Max arrival more accurately reflects true critical path

Performance Expectations

MetricZero-DelayWith Timing
Kernel launchesNN
Shared memory3KB3.25KB
Registers~32~36
Instructions/gate~5~9
Estimated overhead-15-25%

The overhead is modest because:

  1. Timing operations are simple (max, add)
  2. Memory access pattern is identical
  3. No additional synchronization needed
  4. Same parallelism structure

References

  • src/pe.rs - Partition executor and boomerang stage construction
  • csrc/kernel_v1_impl.cuh - GPU kernel implementation
  • src/flatten.rs - Script generation with timing data
  • src/event_buffer.rs - GPU→CPU event communication
  • src/liberty_parser.rs - Timing library parsing

Timing Violation Detection

See also: timing-correctness.md — forward-looking validation contract and timing IR requirements (in progress). The document below describes current behaviour.

Guide to enabling, reading, and debugging setup/hold timing violations in GEM.

Overview

Setup and hold violations occur when data arrives too late (setup) or too early (hold) relative to the clock edge at a flip-flop. GEM checks for these violations during GPU simulation by tracking arrival times — the accumulated gate delay from primary inputs or DFF outputs through combinational logic to the next DFF data input.

Approximation model: GEM tracks one arrival time per 32-signal group (one GPU thread position). The arrival is the maximum across all 32 signals in the group. This is conservative: it may over-report violations but will never miss a real one. See Reducing False Positives for details.

Enabling Timing Checks

Prerequisites

  1. SDF file with back-annotated delays from your place-and-route tool
  2. Gate-level netlist synthesized to aigpdk.lib cells

Step-by-step

  1. Generate SDF from your P&R tool (or use scripts/generate_sdf.py for test designs):

    # Example: OpenROAD flow output
    ls my_build/6_final.sdf
    
  2. Run the simulator with --sdf and a clock period:

    Metal (macOS):

    cargo run -r --features metal --bin jacquard -- sim \
        design.gv input.vcd output.vcd 1 \
        --sdf design.sdf \
        --sdf-corner typ
    

    CUDA (NVIDIA):

    cargo run -r --features cuda --bin jacquard -- sim \
        design.gv input.vcd output.vcd 8 \
        --sdf design.sdf \
        --sdf-corner typ \
        --enable-timing \
        --timing-clock-period 1200
    

    cosim (co-simulation):

    cargo run -r --features metal --bin jacquard -- cosim \
        design.gv \
        --config testbench.json \
        --sdf design.sdf \
        --sdf-corner typ
    

CLI Flags Reference

FlagBinaryDescription
--sdf <path>allPath to SDF file with back-annotated delays
--sdf-corner <min|typ|max>allWhich SDF corner to use (default: typ)
--sdf-debugallPrint unmatched SDF instances for debugging
--enable-timingjacquard simEnable timing analysis (arrival + violation checks)
--timing-clock-period <ps>jacquard simClock period in picoseconds (default: 1000)
--timing-report-violationsjacquard simReport all violations, not just summary
--timing-report <path.json>jacquard simWrite a structured end-of-run JSON report (schema in src/timing_report.rs, ADR 0008).
--timing-summaryjacquard simPrint a human-readable text summary at end of run. Independent of --timing-report; both can be combined.
--timing-report-max-violations <N>jacquard simCap on the per-cycle violations list in --timing-report. Default 100k. 0 = unbounded. Totals + worst-slack always reflect every event.
--liberty <path>jacquard simLiberty library for timing data (optional, falls back to AIGPDK defaults)

Example: inv_chain_pnr Test Case

# Run with SDF timing
cargo run -r --features metal --bin jacquard -- sim \
    tests/timing_test/inv_chain_pnr/6_final.v \
    tests/timing_test/inv_chain_pnr/input.vcd \
    tests/timing_test/inv_chain_pnr/output.vcd 1 \
    --sdf tests/timing_test/inv_chain_pnr/6_final.sdf

Reading Violation Reports

Setup Violation Format

[cycle 42] SETUP VIOLATION at top/cpu/regs[7][bit 22] [word=5]: arrival=900ps setup=200ps slack=-100ps

(WS-P1.1.a, 2026-05-02: state-word indices are now resolved to symbolic hierarchical signal names. The bare [word=N] suffix is preserved for grep compatibility. Words packing more than 4 DFFs truncate with a +N more suffix.)

FieldMeaning
cycleSimulation cycle where the violation occurred
wordState word index — identifies a group of 32 DFF data inputs
arrivalMaximum accumulated gate delay to this word's signals (picoseconds)
setupDFF setup time constraint from SDF/Liberty (picoseconds)
slackclock_period - arrival - setup. Negative = violation amount

Hold Violation Format

[cycle 11] HOLD VIOLATION at top/cpu/state[bit 3] [word=3]: arrival=10ps hold=50ps slack=-40ps
FieldMeaning
cycleSimulation cycle where the violation occurred
wordState word index
arrivalAccumulated gate delay to this word's signals (picoseconds)
holdDFF hold time constraint from SDF/Liberty (picoseconds)
slackarrival - hold. Negative = violation amount

Summary Statistics

At the end of simulation, GEM prints totals:

Simulation complete: 1000 cycles, 5 setup violations, 0 hold violations

Text Summary (--timing-summary)

A one-screen human summary printed to stdout at end of run. Reuses the same data the JSON report builds (so --timing-report and --timing-summary cost the same; only the output channel differs). Sample output:

=== Jacquard Timing Summary ===
Design:        my_cpu.gv
Vectors:       boot.vcd (1000 cycles)
Clock period:  1000 ps
Timing source: my_cpu.jtir

Violations:
  Setup: 5
  Hold:  2
  Total: 7

Worst slack:
  Setup: -150ps  at top/cpu/regs[7][bit 22] [word=5]  (cycle 87)
  Hold:   -40ps  at top/cpu/state[bit 3] [word=12]  (cycle 91)

Top 2 by violation count (of 2 total words with violations):
  top/cpu/regs[7][bit 22] [word=5] (5 violations): worst setup=-150ps hold=- arrival=950ps
  top/cpu/state[bit 3] [word=12] (2 violations): worst setup=- hold=-40ps arrival=10ps

The format is for human inspection — explicitly not a stable parseable contract. Tools that need to script against the data should use --timing-report JSON.

Structured JSON Report (--timing-report <path.json>)

For CI integration and downstream tooling, pass --timing-report <path> to get an end-of-run JSON document. The schema is versioned (ADR 0008's stability contract: additive-only extensions, breaking changes bump the major). Sample at tests/timing_ir/sample_reports/two_violations.json; authoritative type definitions in src/timing_report.rs.

Top-level shape:

{
  "schema_version": "1.0.0",
  "metadata": { "design": "...", "cycles_run": 1000, "clock_period_ps": 1000, "...": "..." },
  "stats": { "setup_violations": 5, "hold_violations": 0, "events_dropped": 0 },
  "violations": [
    { "cycle": 42, "kind": "setup", "word_id": 5, "site": "top/cpu/regs[7][bit 22] [word=5]",
      "arrival_ps": 900, "constraint_ps": 200, "slack_ps": -100 }
  ],
  "per_word": [
    { "word_id": 5, "site": "...", "setup_violations": 5, "hold_violations": 0,
      "worst_setup_slack_ps": -100, "worst_hold_slack_ps": null, "worst_arrival_ps": 900 }
  ],
  "worst_slack": {
    "setup": [ /* top-N most-negative slacks across the run */ ],
    "hold":  [ /* same shape */ ]
  }
}

per_word is sorted by total violation count desc, then by word_id. worst_slack.setup / .hold are top-10 by closest-to-violation slack (most negative first). Caveats:

  • The "even when no violation occurred" half of WS-P1.1.d (per-DFF closest-to-violation tracking when the design never tripped a violation) needs GPU-side near-miss instrumentation and is not in v1.0.0; for now, worst_slack is populated only from actual violation events.
  • --timing-report only produces output today on the Metal sim path. The CUDA / HIP / cosim paths do not currently route runtime violations through process_events — bringing them in is independent plumbing.
  • The violations array is capped at 100,000 records by default (~8 MB JSON). Override or disable the cap with --timing-report-max-violations <N> (0 = unbounded). Setup/hold totals, events_dropped, and worst_slack rankings always reflect every observed event; only the per-cycle list is bounded. stats.violations_truncated reports how many records were dropped because the cap was reached.

Tracing Violations to Source Signals

When you see a violation on a specific word, follow this workflow to identify the offending signals and their logic cone.

1. Get the Word Index

From the log: word 5 means state word index 5.

2. Map Word to DFF Signals

Each word covers 32 bits of state. The DFFs in that word have data_state_pos / 32 == word_index. To find which DFFs:

  • Look at the dff_constraints entries in the FlattenedScriptV1:

    dff_constraints entries where data_state_pos / 32 == 5
    → cell_id values → netlist cell names
    
  • In gpu_sim, violations are logged with word IDs that map directly to the output_map positions. Each word covers bit positions word * 32 through word * 32 + 31.

3. Trace Backwards with netlist_graph

Use the netlist_graph tool to trace the combinational logic cone feeding the DFF. After uv sync --group dev, the netlist-graph console script is on the workspace's uv run path — no cd required:

# Find the DFF data input driver chain
uv run netlist-graph drivers design.v "dff_name.D" -d 10

# Search for DFFs matching a pattern
uv run netlist-graph search design.v "dff_out*"

Discovered signal names can be passed directly into jacquard sim --trace-signals <file> / jacquard cosim --trace-signals <file> (one name per line) to surface them in the output VCD alongside top-level IO.

4. Detailed Timing Analysis with CVC

For per-signal accuracy (no 32-signal approximation), use CVC (open-src-cvc) with SDF back-annotation:

# Run CVC with SDF timing
cvc64 +typdelays tb.v design.v
./cvcsim

CVC provides event-driven simulation with full SDF support (IOPATH + INTERCONNECT delays), allowing you to pinpoint exactly which path is critical.

The Approximation Caveat

GEM tracks one arrival time per 32-signal group (one GPU thread position). The tracked value is the maximum arrival across all 32 signals in that thread. This means:

  • Conservative: If any signal in the group has a long path, the arrival for the entire group reflects that worst case. Violations may be reported for signals that individually meet timing.
  • Never misses real violations: A real violation always results in a reported violation (the max is >= any individual signal's arrival).

Reducing False Positives

If a violation is reported but you suspect it's a false positive from the approximation:

  1. Use CVC for per-signal accuracy (see Detailed Timing Analysis with CVC above).
  2. Timing-aware bit packing groups signals with similar arrival times into the same thread, reducing the approximation error. See docs/timing-simulation.md § "Timing-Aware Bit Packing" for details.

Common Scenarios

Setup violations on many words, same cycle: The clock period is likely too tight for the design. The combinational logic depth exceeds what can settle in one clock period. Try increasing the clock period.

Setup violation on a single word: A critical path through one specific logic cone. Use netlist_graph drivers to trace the path and identify the bottleneck.

Hold violation: Rare with SKY130 process (negative hold times clamp to 0 in the SDF). If seen, the design likely has minimum-delay paths that are too short. Check for direct connections between DFF outputs and nearby DFF inputs with minimal combinational logic.

Violations only on first cycle: The arrival > 0 guard in the GPU kernel skips setup checks when arrival is zero (meaning no data has propagated through combinational logic yet). If you see violations on cycle 0, they are hold violations — setup violations on cycle 0 are suppressed by design.

Timing-model extensions — design notes

Status: Idea / pre-spike. Not scheduled. Captured here so the architecture sketch survives the next session-clear.

Scope: Three related extensions to Jacquard's timing model, all aimed at making setup/hold reporting more honest without abandoning the cycle-accurate boomerang kernel.

  1. Dynamic delay — per-gate δ(T) inspired by the Involution Delay Model (Maier 2021, arXiv:2107.06814). Captures pulse-width-dependent delay degradation that fixed δ∞ misses on near-threshold paths.
  2. Clock-tree skew — per-DFF clock arrival accounting. Today every DFF on a clock is treated as if it captures simultaneously; SDF clock-buffer arcs and clock-net interconnect are silently dropped during AIG construction.
  3. Wire delay at scale — per-receiver interconnect delay applied to the right edge in the AIG, and explicit modelling of inter-partition wires. Today wire delay is collapsed to a max-per-destination-cell scalar — fine for sky130 short routes, increasingly wrong as we move to faster clocks, finer processes, and large many-core/NoC designs.

All three share the same insight: the data the model needs is already in the TimingIR. The work is at the consumer layer (flatten.rs, aig.rs, the kernel arrival math), not the IR or the partitioner.


Background — what the timing pipeline does today

.sdf  ─┬─► opensta-to-ir ──► TimingIR (.jtir, FlatBuffers)
.jtir ─┘                          │
                                  ▼
              flatten.rs::load_timing_from_ir   (per-cell arc → AIG-pin delay)
                                  │
                                  ▼
                       gate_delays: Vec<PackedDelay>     (rise/fall ps per AIG pin)
                       dff_constraints: Vec<DFFConstraint>  (setup/hold ps per DFF)
                                  │
                                  ▼
              flatten.rs::inject_timing_to_script   (bake max ps into u16 script slot)
                                  │
                                  ▼
                       kernel_v1.metal at runtime:
                       per-AND:    new_arr = max(arr_a, arr_b) + gate_delay
                       per-DFF:    check arrival vs setup/hold per word

Reference points:

  • IR schema: crates/timing-ir/schemas/timing_ir.fbs
  • IR consumer: src/flatten.rs:1768 (load_timing_from_ir), src/flatten.rs:1686 (inject_timing_to_script)
  • Setup/hold buffer: src/flatten.rs:1732 (build_timing_constraint_buffer)
  • GPU arrival math: csrc/kernel_v1.metal:220-255 (AND gates), csrc/kernel_v1.metal:547-580 (setup/hold)

Per-AIG-pin arrival is a single ushort accumulated by max through the boomerang reduction. There is no event scheduling — arrival is a scalar that rides alongside the Boolean evaluation in lockstep with cycle ticks.


Part A — Dynamic delay (IDM-style δ(T))

What IDM is, briefly

A per-gate dynamic delay model that makes δ a function of T (time since the gate's last output transition). The distinguishing property: input pulses with Δᵢ → 0 have diminishing effect on the output. The model handles pulse-width degradation faithfully and is the only model proven to solve the short-pulse-filtration problem. The paper notes ~80–590% CPU overhead vs. inertial delay on a CPU event-driven simulator.

The architectural wall

True IDM needs event scheduling and intra-cycle pulse observability — neither is available in Jacquard's lockstep cycle-accurate kernel. We cannot model glitch suppression or metastability oscillation traces without either sub-cycle ticks or a different kernel architecture.

What we can do is enrich the per-gate delay used in arrival propagation so setup/hold reporting reflects realistic pulse degradation on marginal paths.

Five hook points

HookFileTodayWith δ(T)
A Schemacrates/timing-ir/schemas/timing_ir.fbsrise/fall per arc+ per-cell-type DynamicDelayParams (exp-channel params or piecewise-linear LUT)
B IR loadsrc/flatten.rs:1768one PackedDelay per AIG pin+ parallel gate_dyn_delays keyed by originating cell-type via aigpin_cell_origins
C Bakesrc/flatten.rs:1686one u16 ps per thread slotstatic-IDM: bake worst-case δ(T) into same slot. dynamic-IDM: reserve second u32
D Kernel arrivalcsrc/kernel_v1.metal:220-255max(arr_a, arr_b) + gate_delay+ eval_idm(dyn_params, T, edge) via small LUT
E Setup/holdcsrc/kernel_v1.metal:547-580unchanged math, dumber inputsunchanged math, smarter inputs

For dynamic-IDM the kernel needs two new persistent buffers:

  • last_transition_ps[aig_pin] — when the gate's output last switched (absolute ps).
  • last_value[aig_pin] — to detect transitions across cycles.

Memory cost ~4 bytes per AIG pin per partition. For NVDLA-scale designs (~hundreds of thousands of pins) this is MB-scale — fine.

eval_idm on GPU

The paper uses exp/log per gate. On GPU replace with a 16-entry LUT indexed by quantised T. Cheap, branch-free, smooth enough.

Characterisation

The δ(T) parameters have to come from per-cell SPICE characterisation. For sky130 we'd characterise each sky130_fd_sc_hd__*_* cell once, check the result into the repo, and ship it as a sidecar table consumed by the IR builder. This is the expensive one-off — the paper flags characterisation cost as the unsolved part of making IDM "truly competitive."

Staged plan

StageWhatTouchesKernelEffortWin
1 Static IDMBake worst-case δ(T) into existing u16 slot using STA pulse-width estimatesA, B, CNone1–2 daysBetter setup/hold on marginal paths
2 Dynamic δ(T)Add last_transition_ps buffer + LUT evalAllLines 220–2551–2 weeksPulse-degradation-aware arrivals end-to-end
3 Sub-cycle ticksMultiple arrival propagations per logical cycleWhole kernelMajorMonthsTrue IDM glitch behaviour. Probably not worth it for Jacquard's positioning.

Stage 1 is a 1–2 day spike with no kernel risk. Stage 2 is the honest implementation. Stage 3 is a different simulator.

What we get / don't get from dynamic δ(T)

Achievable

  • Per-corner δ(T) propagating through arrival → setup/hold reports that distinguish "just meets timing under δ∞" from "fails under realistic pulse degradation".
  • Stays inside cycle-accurate boomerang. ~1.5–2× memory growth on arrival data, ~10–20% kernel slowdown (estimate).

Not achievable

  • Glitch suppression (Δᵢ → 0 → no transition).
  • Metastable oscillation traces.
  • Combinational-loop behaviours (loops are forbidden in the AIG anyway).

Why sky130 is the right vehicle

sky130_pdk.rs decomposes vendor functional Verilog into AIG nodes while preserving cell identity through aigpin_cell_origins. We can attach δ(T) at the original sky130 cell granularity even after AIG flattening — that structural property is what makes any of this tractable. Cells from a hand-coded library without origin tracking would be much harder.


Part B — Clock-tree skew

Status: Stages 1 + 2 implemented (2026-05-01). Per-DFF clock arrival is carried through the IR (ClockArrival table) and folded into per-DFF setup/hold via DFFConstraint::effective_setup_hold before the per-word collapse. Producer landed in c403cc8; consumer fold-in in 6767c3e. The narrative below describes the original motivation; the Staged plan at the end of this part records what shipped and what remains (Stage 3, conditional).

Where the information is — and where we drop it

Clocks in Jacquard are walked back from each DFF through buffers/inverters/clock-gates, terminating at an InputClockFlag(pinid, is_negedge) (src/aig.rs:441, :477, :495-560). Recognised cells: INV/BUF/CKLNQD and the sky130 equivalents inv*, clkinv*, buf*, clkbuf*, clkdlybuf*, lpflow_*.

Two consequences:

  1. Clock-tree cells produce no AIG pin. They collapse into a polarity flag on the DFF. Since aigpin_cell_origins only lists cells that produced AIG pins, the timing-IR arcs on those cells (IOPATH records on clkbuf_8, etc.) match no AIG pin in load_timing_from_ir and are silently discarded.
  2. Clock-net interconnect is dropped the same way. interconnect_delays records keyed by net endpoints have no destination cell to attach to, so they fall on the floor.

Net effect: every DFF on a given clock domain is treated as having identical clock arrival, i.e. perfect skew. The current setup/hold check is honest about combinational-path delay but blind to clock-tree topology.

For a sky130 MCU SoC at ~25 ns clock period this is fine functionally; for any timing claim near the period boundary it's misleading. Intra-domain clock-tree skew on sky130 is typically O(50–200 ps) — small relative to a 25 ns period, but exactly the order of magnitude that determines whether a path "barely meets" or "barely fails" setup.

Do we have the information?

Yes, in three places, in increasing fidelity:

  1. TimingIR arcs on clock cells (.jtir already contains them; we just don't consume them).

  2. The AIG clock walk in aig.rs:495–560 already iterates the clock-side cells of each DFF in order. It just doesn't accumulate their delays. Adding a dff_clock_origins: Vec<Vec<cellid>> parallel structure costs O(num_dffs × clock_depth) memory — negligible.

  3. OpenSTA can compute per-DFF clock arrival end-to-end. (OpenTimer was the original primary STA candidate per ADR 0003 but the spike Superseded it; ADR 0001 makes OpenSTA the sole STA path, called out of process via opensta-to-ir.) Per-pair common-path-pessimism removal (CRPR) is fundamentally a launch/capture credit, not a per-DFF property — so what shipped is per-DFF capture-side arrival, treating launch as a 0-reference. This is the form in the IR today:

    table ClockArrival {
        cell_instance: string;     // DFF instance path
        clk_pin: string;           // local pin name
        arrival: [TimingValue];    // per-corner clock arrival ps
        provenance: Provenance;
    }
    

    Populated by opensta-to-ir's Tcl driver via [all_registers -clock_pins] + [::sta::vertex_worst_arrival_path]. Consumer code never touches the netlist — it just looks up each DFF's clock arrival.

Consumer change (shipped)

DFFConstraint carries the field now:

#![allow(unused)]
fn main() {
pub struct DFFConstraint {
    pub setup_ps: u16,
    pub hold_ps: u16,
    pub clock_arrival_ps: i16,   // signed — capture-side arrival, launch ref = 0
    pub data_state_pos: u32,
    pub cell_id: u32,
}
}

The setup/hold formula for per-pair skew is:

  • Setup margin = (clock_period + clock_arr_capture - clock_arr_launch) - data_arrival - setup
  • Hold margin = data_arrival - (clock_arr_capture - clock_arr_launch + hold)

Per-launch/per-capture pairing is awkward in the current per-word-collapsed constraint buffer, so the implementation folds the capture-side clock arrival into the per-DFF effective setup/hold before packing, via DFFConstraint::effective_setup_hold:

  • effective_setup = setup - clock_arrival_capture (clamped to [0, u16::MAX])
  • effective_hold = hold + clock_arrival_capture (clamped to [0, u16::MAX])

The GPU kernel runs unchanged — the same packed (setup<<16)|hold word it already consumes now carries skew-aware values. Launch arrival is treated as zero (ref) — pessimistic for paths whose launch DFF also has a long clock path, but a clean first cut. Stage 3 below addresses that pessimism if measurement justifies it.

Partitioning question

"could we partition a design effectively to do this somewhat accurately without sacrificing too much?"

Today partitioning (src/repcut.rs) is hypergraph-cut on logic connectivity. DFFs co-located by logic affinity may have very different clock arrivals.

The pessimism cost: build_timing_constraint_buffer collapses all DFFs in a 32-bit state word to min(setup) and min(hold). If a word holds DFFs with clock arrival 50 ps and 200 ps, the per-word effective setup is the worst of both — i.e. we report timing as if every DFF in that word saw the worst skew in the word. That's a 150 ps pessimism for the lucky DFF.

Three options, ranked:

  1. Do nothing. For typical sky130 SoCs at ≥10 ns clock periods, intra-word skew (≤200 ps worst-case) vs. period (10 000+ ps) is ≤2%. Worth-it threshold for the optimisation: when designs run close enough to the period that 2% pessimism flips genuine passes into reported violations. Likely never for sky130. Plausibly relevant for designs running at ≥1 GHz on a more aggressive PDK.

  2. Skew-bucket the DFF constraint packing, not the partitioning. Group DFFs into clock-arrival buckets after partitioning, and emit one constraint word per bucket-within-partition rather than collapsing everything in the word. Increases constraint-buffer size by O(num_buckets) but doesn't disturb the partitioner. Probably the right answer if we ever need to.

  3. Skew-aware partitioning. Add a soft objective to repcut.rs that prefers grouping DFFs by clock arrival. Degrades cut quality (more inter-partition logic edges → more state shuffling). Almost certainly worse than option 2 for the same accuracy gain.

So: yes we have the info, no we probably don't need to repartition, and the constraint-collapsing pessimism is the real lever — either accept it (option 1) or break it bucket-wise (option 2).

Staged plan for clock tree

StageWhatTouchesKernelStatus
1 Capture clock-tree delayAdd ClockArrival IR table; populate from opensta-to-irIR schema, opensta-to-ir/builder + TclNoneShipped — c403cc8
2 Apply to setup/holdFold capture-side arrival into DFFConstraint; existing kernel check now skew-awaresrc/flatten.rs DFFConstraint, effective_setup_hold, build_timing_constraint_bufferNoneShipped — 6767c3e
3 (conditional) Bucketed packingPer-bucket constraint words to remove the per-word min(setup, hold) collapse pessimism; kernel reads the right bucket per DFFsrc/flatten.rs:1722-1761, kernel constraint indexingMinorOpen — land only if measurement shows the per-word collapse materially over-reports violations

Part C — Wire delay at scale

Why this gets more important as designs grow

In sky130 at 25 ns clock periods, wire delay is a small perturbation on gate delay and the lumped model is fine. The picture changes in two regimes:

  • Faster clocks. Wire delay is a fixed physical quantity (RC-dominated); period shrinks; wire fraction of the budget grows.
  • Finer processes (e.g. 22nm and below). Gate delays scale down with feature size; wire RC scales unfavourably (resistance per square goes up, capacitance per length stays roughly flat). The classic "reverse scaling" inflection: gates get faster, long wires don't. Typical 22nm: inverter delay 5–15 ps, local short wires 5–20 ps, global routes 50–500 ps, multi-mm wires 1+ ns without repeaters.
  • Large many-core/NoC SoCs. Inter-tile mesh links can span multiple millimetres; chip-level signals have wire delays comparable to or larger than entire combinational stages.

For a many-small-core NoC at 22nm, wire delay on inter-core links is typically the dominant timing factor. Any model that can't represent it accurately will misreport the critical paths.

What Jacquard does today

The IR side is already in shape. crates/timing-ir/schemas/timing_ir.fbs carries InterconnectDelay { net, from_pin, to_pin, delay[corner] } per receiver, and opensta-to-ir populates it from SDF.

The lossy step is the consumer in src/flatten.rs:1850-1872:

#![allow(unused)]
fn main() {
let mut wire_delays_per_cell: HashMap<usize, (u64, u64)> = HashMap::new();
// ... for each InterconnectDelay record:
let entry = wire_delays_per_cell.entry(dest_cellid).or_insert((0, 0));
entry.0 = entry.0.max(d);   // rise
entry.1 = entry.1.max(d);   // fall (same value!)
}

Three layers of pessimism stacked here:

  1. Keyed by destination cell, not destination pin. A cell with two inputs from very different routes loses per-pin fidelity.
  2. Max across inputs of the same cell. Worst-case incoming wire is applied to every output of the cell.
  3. No rise/fall distinction on wire delay. SDF carries both; we collapse to one number.

Then in arrival propagation (csrc/kernel_v1.metal:220-255):

new_arr = max(arr_a, arr_b) + gate_delay

where gate_delay = intrinsic + max_wire_into_cell. The mathematically correct propagation is:

new_arr = max(arr_a + wire_a, arr_b + wire_b) + intrinsic

These are equivalent only when the input with the worst arrival also has the worst wire. When they don't coincide — common on a NoC node where one input comes from a long mesh hop and another from local logic — the current model over-reports by max_wire − actual_wire_on_critical_input.

For sky130 small designs this gap is in the noise. For 22nm with 10× variation between local and global wire delays, it's the difference between "this path meets timing" and "STA reports a violation that doesn't exist."

Inter-partition wires — the architectural wrinkle

A NoC tile naturally maps to one (or a few) partition(s). The inter-tile links — the long, wire-dominated, timing-critical ones — are precisely the partition-crossing signals. Today wire delay sits on the destination cell's gate_delays slot, evaluated inside the destination partition's boomerang reduction. The wire is a property of the crossing, not the destination cell, and should ideally be modelled at the partition I/O boundary, where src/sim/cosim_metal.rs already shuffles state between partitions.

This is the inverse alignment of the clock-tree case. There partitioning didn't help with skew accounting. Here partitioning is load-bearing: tile-aligned partitions naturally expose the small set of edges that deserve careful wire-delay modelling, and let intra-partition logic stay on the fast lumped path.

Three fidelity tiers

TierModelWhere wire delay livesWhen it's enough
0 (current)One scalar per destination cell, max-collapsedFolded into gate_delays[output_pin] of dest cellsky130 + ≥10 ns periods + small designs
1 Per-receiverOne scalar per (from_pin, to_pin) edge in the AIGFolded into the source AIG pin's gate_delay, with one entry per fanout targetLocal wires in faster designs; intra-tile NoC logic
2 Per-edge with inter-partition arcsTier 1 + explicit wire delay on partition-crossing signalsTier 1 + new arrival-bump applied during cosim_metal.rs state shuffleLong routes + many-core/NoC + 22nm-scale processes

Tier 1 is mostly a flatten.rs rewrite. Tier 2 needs cosim_metal.rs extension and a new field in the inter-partition transfer format.

Information availability

Yes, it's there:

  • InterconnectDelay records exist per receiver. SDF carries them. opensta-to-ir emits them.
  • Per-input-pin granularity is in the IR (to_pin includes the local pin name). The consumer just discards it via to_pin.rfind('/') to derive dest_inst.
  • Rise/fall distinction is in the schema (delay: [TimingValue] per corner; rise/fall could be on top via the same pattern as TimingArc). For SDF-back-annotated flows the rise/fall split usually comes from the SDF; we'd need to confirm opensta-to-ir preserves both edges.

What's missing today:

  • Tier-1 plumbing: AIG-pin-level wire delay per fanout. Current gate_delays: Vec<PackedDelay> is keyed by AIG pin (the output side); to do per-input-edge correctly we want delay attached to the edge, not the node. Either add a parallel wire_delays: HashMap<(src_aigpin, dst_aigpin), PackedDelay> or refactor toward an edge-attributed AIG.
  • Tier-2 plumbing: a "partition-crossing arc" concept in cosim_metal.rs. Currently inter-partition state shuffle moves bits with no associated arrival bump. Adding a per-edge ps adjustment is straightforward in principle; finding the right place in the shuffle pipeline matters.

IR scale

The IR-size concern bites here. InterconnectDelay is roughly 100–200 bytes per record; a 22nm SoC with 10⁶–10⁷ nets is a .jtir file in the hundreds-of-MB to multi-GB range.

Mitigations:

  • Streaming load: today TimingIrFile::from_path reads the whole buffer. Could mmap and lazy-decode, since FlatBuffers is offset-based.
  • Sharding: split IR per partition or per top-level module. Adds a build-time step but bounds memory per process.
  • Drop intra-cell wires from IR generation: SDF often has microscopic interconnect records that lump into the destination's own pin-cap. Filter these out at the opensta-to-ir builder. Loss is genuinely negligible.

Worth measuring before committing to mitigations — sky130 NVDLA-scale today is fine; the question is what 22nm + N-tile mesh looks like.

Partitioning question — the other direction

For NoC designs partitioning becomes a positive lever (unlike the clock-tree case where it was neutral). Two specific levers:

  1. Tile-aligned partitions. If repcut.rs finds tile-aligned cuts naturally (likely, given typical tile-to-tile connectivity sparsity), inter-partition arcs are a small, well-defined set of NoC links. Worth verifying with a representative design — a partitioning report keyed by signal name pattern (*_link_*, noc_*, configurable) would expose whether the partitioner's logic-affinity score is already aligned with tile boundaries or whether we need to bias it.
  2. NoC-link partitioning hint. Add a soft bias to repcut that prefers cutting nets matching a configured regex. Same partitioning machinery, configurable input. Cost: degrades cut quality if the hint conflicts with logic affinity. Likely worth it for explicitly tile-decomposed designs where the user knows the tile boundaries; not worth it for flat designs.

The point of any of this is to make Tier-2 cheap: if the inter-partition arc set is small, per-edge wire delay on those crossings costs almost nothing.

Crosstalk and OCV

These are upstream concerns. SDF from a crosstalk-aware STA flow already carries pessimistic delays; OCV (on-chip variation) is similarly baked into the chosen corner. Jacquard consumes whatever the IR was generated against. Worth a one-line note in the user-facing docs that the timing report's accuracy is bounded by the SDF/STA flow it was built from — Jacquard does not invent crosstalk pessimism.

Staged plan for wire delay

StageWhatTouchesKernelEffort
1 Per-receiver consumptionKey wire delay by (src_aigpin, dst_aigpin) edge; fold into source AIG pin's gate_delay per fanoutsrc/flatten.rs:1850-1872, possibly src/aig.rs for fanout trackingNone3–5 days
2 Rise/fall distinctionPreserve per-edge rise/fall through the consumer; honour both in PackedDelay accumulationsrc/flatten.rs:1850-1914None1–2 days
3 Inter-partition arc delayNew per-crossing wire-delay table; arrival bump applied during inter-partition state transfersrc/sim/cosim_metal.rs shuffle path; src/flatten.rs partition-boundary metadataYes (transfer path)2–3 weeks
4 IR scale plumbingStreaming/mmap load; opensta-to-ir filtering of microscopic recordssrc/sim/timing_ir_loader.rs, opensta-to-ir/builderNone1 week (gated on measurement)
5 NoC-aware partitioningSoft bias in repcut for cutting flagged nets; partition report by tilesrc/repcut.rs and CLI flagsNone1–2 weeks

For a sky130 use case Stage 1+2 likely covers everything you'd notice. For 22nm NoC, Stages 1–3 are the meaningful set; Stage 5 is the optimisation that makes Stage 3 cheap.

What we get / don't get

Achievable

  • Setup/hold accuracy on long routes that today gets clobbered by max-collapse pessimism.
  • Honest reporting on NoC inter-tile links — the paths that actually matter for many-core SoC timing closure.
  • All of the above without changing Jacquard's cycle-accurate kernel architecture.

Not achievable from this work alone

  • Crosstalk-driven delay uncertainty (handled upstream in STA).
  • Variation-aware (statistical) timing — would need OCV-corner sweeping or SSTA, neither of which is on the roadmap.
  • Process variation modelling beyond the corners the SDF/IR was generated against.

Open questions

  1. δ(T) characterisation cost. One-off SPICE per cell-type per corner. Cheaper if we lean on existing ECSM/CCSM data already in vendor Liberty rather than re-running SPICE. Worth investigating before committing to Stage 2.
  2. Whose clock arrival is authoritative? Resolved by Pillar B Stage 1+2: OpenSTA-computed per-DFF arrival via opensta-to-ir, treating launch as 0-reference. Per-pair CRPR credit is intentionally not modelled at this stage (see Stage 3 in the staged plan above).
  3. Interaction. Does δ(T) on clock-tree buffers matter? Probably not enough to model — clock buffers are sized for fast edges and operate far from their pulse-degradation regime. But the framework should be able to express "ignore δ(T) on clock domain" cleanly.
  4. Validation oracle. CVC and Icarus already serve as functional oracles; for skew-aware and wire-aware reporting OpenSTA's slack report (via opensta-to-ir / direct subprocess) is the ground truth for unit tests. (ADR 0003 originally nominated OpenTimer for this role; superseded by the spike outcome — OpenSTA carries the role end-to-end now.)
  5. IR size at 22nm scale. Open question whether .jtir for a representative many-core NoC fits in available memory under the current eager-load model. Needs measurement before committing to streaming mitigations.
  6. Edge-attributed AIG. Per-receiver wire delay wants delay attached to AIG edges, not nodes. Today the AIG is node-attributed (gate_delays: Vec<PackedDelay> indexed by aigpin). A clean Tier-1 implementation may push toward edge attribution, with downstream effects on the boomerang reduction script layout. Worth a small spike before the main implementation.
  7. Partition-crossing format. Adding per-edge wire delay to cosim_metal.rs inter-partition transfers needs a precise place in the existing pipeline. Currently the shuffle moves Boolean state words without arrival; the natural place is alongside the writeout-arrival path that already exists for setup/hold checking, but the alignment isn't 1:1 because partition crossings happen at logic boundaries, not capture-DFF boundaries.
  • docs/timing-correctness.md — forward-looking validation contract; this doc extends rather than replaces.
  • docs/timing-simulation.md — boomerang architecture; the kernel-side context.
  • docs/timing-validation.md — current ±5% acceptance criteria; would tighten under δ(T).
  • docs/adr/0002-timing-ir.md — IR design rationale; schema additions here follow the "lossless extension" principle.
  • docs/adr/0001-opensta-as-oracle.md — STA path; OpenSTA out of process is committed (post-supersedure of ADR 0003).
  • docs/adr/0003-opentimer-primary-sta.mdSuperseded. Original in-process STA proposal; spike Q2 fail moved Jacquard to OpenSTA-only. See docs/spikes/opentimer-sky130.md.

In-Design Signal Tracing (--trace-signals)

Overview

By default a Jacquard output VCD contains only top-level IO. --trace-signals <FILE> surfaces user-selected internal nets in that VCD alongside the top-level ports — so you can watch a DFF's Q, a controller state bit, or an SRAM port wire without re-synthesizing or exposing it as a port.

It is available on both jacquard sim and jacquard cosim, and is observe-only: traced nets are read out each tick, never driven.

Each name in the file is resolved against the netlist, registered as a primary output before partitioning (so it gets a state-buffer slot), and emitted on the same path as the top-level IO. It works uniformly for sequential (DFF Q) and combinational nets — anything that has a name in the netlist database.

This is the raw-wire counterpart to bus transaction tracing: --trace-signals gives you per-cycle waveforms of individual nets; bus tracing gives you decoded transaction records. Use this when you want a waveform; use bus tracing when you want READ 0x40 => 0x1.

File format

One hierarchical signal name per line:

# JTAG debug-module state (comments and blank lines are ignored)
chip_core.dm.haltreq_q[0]
chip_core.dm.haltreq_q[1]

# Yosys-internal nets — same syntax works
chip_core.sram_u._00147_

# A whole bus, one bit per line
data0_obs[0]
data0_obs[1]
  • Blank lines and lines whose first non-whitespace character is # are skipped.
  • Hierarchy uses . as the separator; a trailing [N] selects a bus bit.
  • A leading backslash (Verilog escaped-identifier syntax) is stripped.

A real example ships in tests/jtag_minimal/trace_signals.txt.

Name resolution

Post-synthesis net names are ambiguous — Yosys may flatten a hierarchy into one escaped identifier (\soc.sram.read_port__data), expand a bus into per-bit scalars (soc.bus__addr[3]), or preserve real structural hierarchy. Rather than guess, the resolver tries multiple candidate interpretations of each name and takes the first that matches the netlist database, so the same syntax works across all three conventions.

  • Unresolved names warn, they don't abort. A bad name logs a warning and is skipped; the rest of the list still registers. A trailing summary line reports how many signals registered vs. were dropped, so a mistyped list surfaces clearly at startup:

    --trace-signals: registered 34 signal(s), dropped 2 (file: trace.txt)
    
  • Names that resolve to a constant (tied 0/1) are skipped — there's nothing to observe at runtime.

Where the output lands

Traced nets appear in whichever VCD the run already emits:

CommandFlagTraced nets appear in
jacquard sim--trace-signals <FILE>the output VCD
jacquard cosim--trace-signals <FILE>the --output-vcd output only

They show up as ordinary VCD wires next to the top-level IO, named by the string you put in the trace file.

cosim: traced nets land in --output-vcd only. The --stimulus-vcd carries primary inputs and does not include them, so if you trace a net and look in the stimulus VCD you'll see nothing. --output-vcd does not require timing data — see Pre-PnR functional runs.

Pre-PnR functional runs

--output-vcd is the functional output path too — it does not require --timing-ir/SDF. Run a synthesized (pre-PnR) netlist through cosim with --output-vcd out.vcd and you get chip outputs and traced nets per cycle, with transitions at clock edges (no arrival-time offsets). This is the right mode for functional / 4-state X-pessimism debugging, where there is no timing data to supply yet. Adding --timing-ir later only adds arrival-time offsets to the same VCD.

Top-level inout (bidir) pads

A top-level inout pad is split into two observables in the output VCD: <pad>__out (the value the core drives) and <pad>__oe (the pad's output-enable). The raw <pad> net reads the pad's input side, so on an output-only or undriven cycle it can look flat — watch <pad>__out / <pad>__oe to see what the design is driving. Example: bidir_PAD[12]__out, bidir_PAD[12]__oe. These appear automatically; you don't need to list them in the trace file.

Finding signal names

Use the netlist-graph tool (see the project README) to discover the exact post-synthesis names:

# Search for nets matching a pattern
uv run netlist-graph search <netlist.v> "haltreq"

# Trace what drives / loads a signal (to find nearby observable nets)
uv run netlist-graph drivers <netlist.v> "soc.cpu.state" -d 5
uv run netlist-graph loads   <netlist.v> "soc.cpu.ack"   -d 5

# Emit a ready-to-use trace file
uv run netlist-graph watchlist <netlist.v> out.json signal1 signal2 ...

SRAM observability workflow

The recommended way to observe SRAM port activity is wire-level tracing rather than the env-var-gated JACQUARD_SRAM_DUMP. netlist-graph can discover the port wires and emit a trace file directly:

# 1. Discover SRAM port wire names from the netlist
uv run netlist-graph sram-ports design.v --cell-type SRAM -o sram_trace.txt

# 2. Surface them in the VCD with full per-tick accuracy
jacquard cosim design.v --config sim.json \
    --trace-signals sram_trace.txt --output-vcd out.vcd

# 3. Post-process the VCD to reconstruct bus values

Example

tests/jtag_minimal/ uses --trace-signals to surface the debug module's observable outputs (dmactive_obs, haltreq_obs, data0_obs[0..31]) so the test's pass criterion can check that the magic value 0xCAFEBABE lands in data0_obs:

jacquard cosim tests/jtag_minimal/data/top.pnl.v \
    --config tests/jtag_minimal/sim_config.json \
    --trace-signals tests/jtag_minimal/trace_signals.txt \
    --jtag-replay tests/jtag_minimal/data/bitbang.rec \
    --output-vcd out.vcd

Troubleshooting

SymptomCause / fix
not found in netlistdb (tried N candidate(s))The name doesn't exist post-synthesis under any candidate spelling. Find the real name with netlist-graph search; the net may have been renamed or optimized away.
Signal registered but flat in the VCDIt may resolve to a constant after optimization (the startup log notes constants are skipped), or the cone was stripped. Confirm it's a live net with netlist-graph drivers.
Nothing appears in the VCDCheck that the run actually emits a VCD (--output-vcd / --stimulus-vcd for cosim) and that the startup summary line reports a non-zero registered count.

Implementation notes

Registration happens at AIG construction, before partitioning, which is why the list must be supplied via the CLI flag (not a runtime env var). The mechanism lives in src/sim/trace_signals.rs; emission piggybacks on emit_extra_observables in src/sim/vcd_io.rs. The same multi-candidate resolver backs bus-trace pin binding (see bus tracing and ADR 0013).

Bus Transaction Tracing (AHB / APB)

Overview

jacquard cosim can decode on-chip bus transactions and emit them in a compact, transaction-level form — one row per transfer, rather than raw per-cycle waveforms. You declare the bus interfaces to watch in sim_config.json; cosim observes their pins on the GPU each tick and runs the protocol decode on the CPU, writing decoded transactions to a CSV file.

This is observe-only: the tracer watches signals the design already drives, it never drives anything. It adds no measurable simulation overhead when no buses are configured.

ProtocolStatus
APB3Supported
AHB-LitePlanned (pipelined address/data pairing, burst tracking)
AHB5Planned (AHB-Lite + security / exclusive signals)

The design rationale lives in ADR 0013; the roadmap is in plans/bus-transaction-tracing.md.

Bus tracing is the structured, protocol-aware counterpart to --trace-signals, which surfaces raw internal nets in the output VCD. Use --trace-signals when you want waveforms of individual wires; use bus tracing when you want decoded READ 0x40 => 0x1 records.

Configuring a bus

Add a bus_traces array to sim_config.json. Each entry names one bus interface:

{
    "netlist_path": "build/soc.gv",
    "clock_gpio": 0,
    "reset_gpio": 1,
    "num_cycles": 100000,
    "clock_period_ps": 40000,

    "bus_traces": [
        {
            "name": "dmi",
            "protocol": "apb3",
            "prefix": "soc.dm.",
            "addr_bits": 9,
            "data_bits": 32
        }
    ]
}
FieldRequiredMeaning
nameyesLabel for this bus in the CSV bus column.
protocolyesapb3 (or ahb-lite / ahb5 once supported).
prefixyesHierarchical net-name prefix; standard pin names are appended (see below). May be "" for top-level pins.
addr_bitsno (default 32)Address bus width.
data_bitsno (default 32)Data bus width.
signalsnoPer-pin net-name overrides (see Pin resolution).

Pin names

By default each protocol pin is resolved as {prefix}{pin}. For APB3:

Logical pinDefault netNotes
psel{prefix}pselrequired
penable{prefix}penablerequired
pwrite{prefix}pwritedirection
paddr{prefix}paddr[i]addr_bits wide
pwdata{prefix}pwdata[i]data_bits wide
prdata{prefix}prdata[i]data_bits wide
pready{prefix}preadyoptional — unresolved is treated as always-ready (1)
pslverr{prefix}pslverroptional — unresolved is treated as no-error (0)

So a bus with "prefix": "soc.dm." looks for soc.dm.psel, soc.dm.paddr[0], …, soc.dm.prdata[31].

If your design's pins don't follow that convention, remap individual logical pins with signals:

{
    "name": "periph",
    "protocol": "apb3",
    "prefix": "soc.apb.",
    "signals": {
        "psel": "soc.apb_decode.sel_periph",
        "prdata": "soc.apb_mux.readback"
    }
}

Running

cargo run -r --features metal --bin jacquard -- cosim \
    build/soc.gv \
    --config sim_config.json \
    --bus-trace-csv bus.csv

At startup each bus logs whether it resolved:

bus-trace `dmi` (APB3): psel/penable resolved, addr 9/9 bits, pready=true pslverr=true

and at the end:

bus-trace: decoded 12 transaction(s) across 1 bus(es)
bus-trace: wrote 12 transaction(s) to bus.csv

CSV output

tick,bus,protocol,dir,addr,data,resp,burst
24,dmi,apb3,WR,0x10,0xCAFEBABE,OK,
30,dmi,apb3,RD,0x10,0xCAFEBABE,OK,
ColumnMeaning
tickCosim edge at which the transfer completed. One clock cycle = 2 edges (rising + falling) for a single-domain design.
busThe configured bus name.
protocolapb3 / ahb-lite / ahb5.
dirWR or RD.
addrTransfer address (hex).
datapwdata for writes, prdata for reads (hex).
respOK or ERR (from pslverr / hresp).
burstAHB burst position beat/len (empty for APB).

Pin resolution

For the GPU to read a bus pin each tick, that net must (1) exist in the post-synthesis netlist under a resolvable name and (2) survive into the simulation's output state. Two consequences:

  • Names must survive synthesis. The resolver uses the same multi-candidate matcher as --trace-signals, so Yosys-flattened (\soc.dm.psel), scalar-expanded (soc.dm.paddr[3]), and structurally-hierarchical names all work. But synthesis is free to rename or delete combinational nets. The robust pattern is to make the bus signals registers (their DFF Q outputs keep their names), or to annotate the RTL nets with (* keep *).

  • Constant-folded bits read as 0 — correctly. If a design only ever drives, say, addresses 0x00 and 0x04, synthesis folds every address bit except paddr[2] to a constant. The startup log then shows e.g. addr 1/8 bits. This is expected: the tracer reconstructs the full value correctly because the dropped bits are genuinely 0.

pready / pslverr are allowed to be absent. A common case is an always-ready slave that ties pready high — it folds to a constant, fails to resolve, and the tracer correctly treats the bus as always-ready.

Worked example

tests/apb_trace/ is a self-contained, synthesizable APB3 system used as the CI regression. Its master issues a fixed program — two writes then two reads — to a register-file slave, and check.py asserts the decoded CSV. See tests/apb_trace/README.md.

yosys -s tests/apb_trace/synth.tcl          # (from tests/apb_trace/)
cargo run -r --features metal --bin jacquard -- cosim \
    tests/apb_trace/apb_trace_synth.gv \
    --config tests/apb_trace/sim_config.json \
    --top-module apb_trace \
    --max-clock-edges 200 \
    --bus-trace-csv apb.csv
python3 tests/apb_trace/check.py apb.csv

Troubleshooting

SymptomCause / fix
psel/penable did not resolve … this bus will not captureThe prefix is wrong, or the nets were optimized away. Find the real names with uv run netlist-graph search <netlist> psel, then fix prefix or add signals overrides.
Zero transactions decodedGate never asserted. Check that psel/penable resolve (startup log) and that the bus is actually exercised within --max-clock-edges.
Address or data always 0paddr/pwdata/prdata nets didn't resolve (renamed/folded). Confirm with netlist-graph search; mark the RTL nets (* keep *) and re-synthesize.
Reads return stale/wrong dataThe slave must present prdata during the ACCESS phase. Register prdata so its value is stable when psel & penable are high.

Limitations

  • APB3 only for now; AHB-Lite / AHB5 and annotated-VCD output are the next phases (see the plan).
  • Up to 4 buses per run, addresses/data up to 32 bits.
  • Cosim is Metal-only today, so bus tracing is Metal-only.
  • The legacy hardcoded Wishbone trace (a separate, SoC-specific path) is unaffected; folding it onto this general mechanism is a planned follow-up.

Adding a New PDK for Post-Layout Simulation

This guide documents the process of enabling a new process design kit (PDK) for gate-level simulation in Jacquard. It is based on the SKY130 enablement and captures every integration point.

Overview

Jacquard natively supports AIGPDK (its own synthesis library of AND gates, DFFs, and SRAMs). Supporting a foundry PDK like SKY130 requires teaching the simulator how to interpret the PDK's standard cells: their pin directions, their boolean function, and which ones are sequential.

There are now two pathways for enabling new cells; pick based on what you're adding:

  • Built-in PDK enablement (this guide). For full standard-cell libraries — AND gates, DFFs, sequential cells with explicit AIG decomposition rules. Requires Rust code: pin tables, classifiers, decomposition functions, AIG builder hooks.
  • Runtime cell library (--cell-library + .cells.toml manifest). For third-party IP, hard macros, foundry memories, and any other cells that don't need new AIG decomposition rules — i.e. cells that act as opaque outputs (RAM macros), filler/cap blocks, or IO pads. See ADR 0010 and docs/plans/declarative-cell-metadata.md for the recipe. No Jacquard PR required — users ship a manifest alongside their netlist.

This guide covers the built-in pathway, which touches five areas:

  1. Library detection -- recognizing cell names from a netlist
  2. Pin direction provider -- telling the netlist parser which pins are inputs/outputs
  3. Cell classification -- identifying sequential, tie, and multi-output cells
  4. Behavioral decomposition -- converting PDK cells to AIG (AND/NOT) primitives
  5. CLI wiring -- connecting it all together

If you're adding just a memory macro or other behaviourally-opaque IP, skip ahead to "Adding third-party IP via runtime manifest" at the end of this document — it's a 6-line TOML entry, not a Rust PR.

Prerequisites

You need:

  • The PDK's Verilog cell library (behavioral or functional models)
  • A post-synthesis or post-P&R netlist using those cells
  • The cell naming convention (prefix, drive strength suffix format)

For SKY130, the PDK data lives in vendor/sky130_fd_sc_hd/ as a git submodule.

Step 1: Library Detection

Reference: src/sky130.rs -- is_sky130_cell(), detect_library(), detect_library_from_file()

Jacquard scans the netlist to determine which cell library is in use. Each PDK needs a name-matching function:

#![allow(unused)]
fn main() {
// src/sky130.rs:535
pub fn is_sky130_cell(name: &str) -> bool {
    name.starts_with("sky130_fd_sc_")
        || name.starts_with("CF_SRAM_")
}
}

The CellLibrary enum tracks known libraries. detect_library() iterates cell names and returns the detected library (or Mixed if cells from multiple libraries are found -- this is an error).

For a new PDK: Add a variant to CellLibrary, write an is_<pdk>_cell() function, and update detect_library().

Step 2: Cell Type Extraction

Reference: src/sky130.rs -- extract_cell_type()

PDK cell names follow a convention: <prefix>__<type>_<drive>. The simulator needs to strip the prefix and drive strength to get the base cell type:

sky130_fd_sc_hd__nand2_4  -->  nand2
sky130_fd_sc_hd__dfxtp_1  -->  dfxtp

This function must handle all library variants (hd, hs, ms, ls, lp, hdll, hvl for SKY130) and any custom macros (CF_SRAM_*).

For a new PDK: Write an equivalent extract_cell_type() for the PDK's naming scheme.

Step 3: Pin Direction Provider

Reference: src/sky130.rs -- SKY130LeafPins implementing LeafPinProvider

The netlist parser (from eda-infra-rs/netlistdb) needs to know pin directions and widths for every cell type. This is implemented as a trait:

#![allow(unused)]
fn main() {
impl LeafPinProvider for SKY130LeafPins {
    fn direction_of(&self, macro_name, pin_name, pin_idx) -> Direction;
    fn width_of(&self, macro_name, pin_name) -> Option<SVerilogRange>;
}
}

For SKY130, direction_of() is a large match statement covering ~80 cell types with all their pin names. This is tedious but straightforward -- for each cell, list which pins are inputs and which are outputs.

Sources for pin directions:

  • The PDK's Liberty (.lib) files list pin directions
  • The PDK's behavioral Verilog models declare input/output ports
  • LEF files also contain pin direction information

For a new PDK: Implement the trait for all cells that appear in your target netlists. You can start with just the cells used in your design and add others as needed.

Step 4: Cell Classification

Reference: src/sky130_pdk.rs -- is_sequential_cell(), is_tie_cell(), is_multi_output_cell()

Three classification functions control how cells are processed during AIG construction:

Sequential cells (DFFs and latches)

These are handled specially in the AIG builder -- their outputs become state elements rather than combinational logic.

Critical: Use an explicit whitelist, not prefix matching. PDK naming collisions will silently break simulation if you guess wrong (e.g., SKY130's dlygate4sd3 starts with "dl" but is a combinational delay buffer, not a latch).

Derivation method: Grep the PDK's behavioral Verilog models for DFF/latch primitives:

for cell in $(ls vendor/<pdk>/cells/); do
    vfile="vendor/<pdk>/cells/$cell/<pdk>__${cell}.behavioral.v"
    if [ -f "$vfile" ] && grep -qE 'udp_dff|udp_dlatch' "$vfile"; then
        echo "$cell"
    fi
done

For PDKs that don't use Verilog UDPs, look for always @(posedge blocks or check the Liberty file's ff and latch groups.

Tie cells

Cells that produce constant 0 or 1 (e.g., SKY130's conb with HI/LO pins).

Multi-output cells

Cells with more than one output (e.g., half-adder ha with SUM and COUT, full-adder fa). These need special handling because the AIG builder processes one output pin at a time.

Step 5: Behavioral Model Loading

Reference: src/sky130_pdk.rs -- load_pdk_models(), parse_functional_model(), parse_udp()

Jacquard decomposes PDK cells to AIG primitives (AND gates and inversions) by parsing their functional Verilog models. The expected file structure:

vendor/<pdk>/
  cells/
    <cell_type>/
      <pdk>__<cell_type>.functional.v    # Gate-level behavioral model
  models/
    <udp_name>/
      <pdk>__<udp_name>.v               # Verilog UDP definitions

Functional models

These are gate-level Verilog using primitives like and, or, nand, nor, not, xor, xnor, buf. The parser (parse_functional_model()) extracts these into a topologically-ordered list of BehavioralGate structures.

Example (sky130_fd_sc_hd__o21ai.functional.v):

module sky130_fd_sc_hd__o21ai (Y, A1, A2, B1);
    output Y;
    input  A1, A2, B1;
    wire or0_out;
    or  or0  (or0_out, A2, A1);
    nand nand0 (Y, B1, or0_out);
endmodule

UDP models

Some cells (typically muxes) use Verilog User-Defined Primitives with truth tables. The parser (parse_udp()) converts these to a row-based representation, which is then evaluated as sum-of-products during AIG decomposition.

What's loaded

Only models for cell types actually present in the design are loaded. Sequential cells are skipped (their behavior is hardcoded in the AIG builder). Tie cells are also skipped (constant generation is trivial).

For a new PDK: If the PDK uses the same Verilog gate primitive syntax, the existing parsers should work. If it uses behavioral Verilog (assign statements, always blocks), the parser would need extension.

Step 6: AIG Decomposition

Reference: src/sky130_pdk.rs -- decompose_with_pdk(), decompose_from_behavioral()

The decomposition converts each combinational cell to a set of 2-input AND gates with optional inversions:

  1. Map the cell's input pin names to AIG pin indices via CellInputs
  2. Walk the behavioral model's gate list in topological order
  3. For each gate, build the equivalent AIG sub-graph:
    • and/nand -> AND gate (with optional output inversion)
    • or/nor -> De Morgan's: OR(a,b) = NOT(AND(NOT a, NOT b))
    • xor/xnor -> Four AND gates: XOR(a,b) = NOT(AND(NOT(AND(a, NOT b)), NOT(AND(NOT a, b))))
    • buf/not -> Pass-through with optional inversion
    • UDP -> Sum-of-products from truth table
  4. Record the output with cell origin (for SDF timing annotation)

CellInputs struct

CellInputs has named fields for all possible input pins across all SKY130 cells (A, B, C, D, A_N, B_N, S, S0, S1, CIN, SET_B, RESET_B, etc.). The set_pin() method maps netlist pin names to AIG pin indices.

For a new PDK: If the PDK introduces pin names not in the current struct, add new fields.

Step 7: AIG Builder Integration

Reference: src/aig.rs -- get_sky130_dependencies(), sky130_preprocess(), sky130_postprocess()

The AIG builder processes cells in three phases during topological traversal:

Dependencies (what must be built before this cell)

  • Tie cells: No dependencies
  • Sequential cells: Only SET_B and RESET_B pins (the data input D is handled by the DFF mechanism, not combinational decomposition)
  • Combinational cells: All input pins

Preprocessing (before dependencies are resolved)

  • Sequential cells: Create a DFF output AIG pin. This establishes the state element before the combinational cone driving it is built.

Postprocessing (after all dependencies are resolved)

  • Tie cells: Wire HI to constant-1, LO to constant-0
  • Sequential cells: Apply reset/set logic: Q = AND(OR(Q_state, NOT SET_B), RESET_B) (active-low semantics)
  • Combinational cells: Call decompose_with_pdk() and wire the resulting AND gates into the AIG

For a new PDK: The three-phase structure is reusable. You need PDK-specific implementations of each phase that handle the new cell types' pin names and reset/set conventions.

Step 8: CLI Integration

Reference: src/bin/jacquard.rs

The load_design function detects the library and creates the netlist with the appropriate pin provider:

#![allow(unused)]
fn main() {
let lib = detect_library_from_file(&args.netlist_verilog)?;
let netlistdb = match lib {
    CellLibrary::SKY130 => NetlistDB::from_sverilog_file(&paths, &SKY130LeafPins),
    CellLibrary::AIGPDK => NetlistDB::from_sverilog_file(&paths, &AIGPDKLeafPins()),
    CellLibrary::Mixed => panic!("Mixed libraries not supported"),
};
}

For a new PDK: Add a match arm for the new library.

Testing Strategy

Unit tests

  1. Cell type extraction: Verify prefix/suffix stripping

  2. Pin directions: Spot-check common cells

  3. Behavioral model parsing: Parse each cell type, verify gate count

  4. Decomposition correctness: For each combinational cell, exhaustively test all input combinations against the PDK's truth table:

    #![allow(unused)]
    fn main() {
    #[test]
    fn test_all_cells_vs_pdk() {
        let pdk = load_test_pdk();
        for (cell_type, model) in &pdk.models {
            // For each input combination:
            //   1. Evaluate behavioral model directly
            //   2. Decompose to AIG and evaluate AIG
            //   3. Assert outputs match
        }
    }
    }

    This test exists in src/sky130_pdk.rs as test_all_cells_vs_pdk and covers every combinational cell against every input combination.

Integration tests

  1. Small test circuit: Synthesize a simple design (DFF + some gates) to the new PDK and verify simulation output matches a reference (e.g., iverilog)
  2. Flash boot test: If targeting an SoC, verify the CPU boots and reads from flash (this exercises sequential logic, combinational cones, and IO)

File Checklist

For a complete PDK integration, you need:

FilePurpose
src/<pdk>.rsLeafPinProvider, library detection, cell type extraction
src/<pdk>_pdk.rsCell classification, model parsing, AIG decomposition
src/aig.rsAIG builder hooks (dependencies, pre/post-process)
src/sky130.rsUpdate CellLibrary enum
src/bin/jacquard.rsCLI match arms for new library
vendor/<pdk>/PDK cell models (git submodule)

Common Pitfalls

  • Cell name collisions: Do not use prefix matching for cell classification. dlygate4sd3 starts with "dl" but is not a latch. Always derive the exhaustive list from behavioral models.

  • Active-low vs active-high resets: SKY130 uses active-low RESET_B and SET_B. Other PDKs may use active-high. Get this wrong and every DFF will be stuck.

  • Multi-output cells: The AIG builder processes one output pin at a time. If a cell has both Q and Q_N outputs (e.g., dfbbp), the second output must be derived from the first (Q_N = NOT Q), not decomposed independently.

  • Liberty file size: SKY130's liberty files are 12MB+. If your PDK has similarly large files, ensure the parser doesn't OOM or timeout.

  • Power/ground pins: Post-layout netlists often include VPWR/VGND pins. Use the unpowered netlist variant (.nl.v not .pnl.v in OpenLane2) or handle power pins as constants in the pin provider.

  • Hold-time repair buffers: P&R tools insert delay buffers (like dlygate4sd3) that must be treated as combinational. If your PDK's delay cells have names that collide with sequential cell prefixes, the whitelist approach prevents misclassification.

Adding third-party IP via runtime manifest

If you're adding a memory macro, IO pad, hard block, or filler library — anything that doesn't need new AIG decomposition rules — the runtime cell-library pathway (ADR 0010) is the right route. No Jacquard PR required. Ship a Verilog blackbox file plus a TOML manifest alongside your design.

Step 1: Provide the cell's Verilog interface

The blackbox just declares the cell's module + port directions. The foundry typically ships this (<library>__blackbox.v). Example for the OCD GF180MCU SRAM:

module gf180mcu_ocd_ip_sram__sram1024x8m8wm1 (CLK, CEN, GWEN, WEN, A, D, Q);
  input CLK;
  input CEN;
  input GWEN;
  input [7:0] WEN;
  input [9:0] A;
  input [7:0] D;
  output [7:0] Q;
endmodule

Step 2: Write the TOML manifest

Co-locate <library>.cells.toml next to <library>.v (it autoloads when present) or pass it via --cell-manifest:

schema_version = "1.0"

[cells.gf180mcu_ocd_ip_sram__sram1024x8m8wm1]
kind = "ram"

Recognised kind values in v1.0: std, dff, latch, clock_gate, ram, filler, endcap, tap, io_pad_input, io_pad_output, io_pad_bidir, delay, multi_output, tie_high, tie_low.

Step 3: Invoke jacquard with the manifest

jacquard sim my_chip.v stim.vcd out.vcd 1 \
    --cell-library deps/gf180mcu_ocd_ip_sram/cells/gf180mcu_ocd_ip_sram__sram1024x8m8wm1/gf180mcu_ocd_ip_sram__sram1024x8m8wm1__blackbox.v

The --cell-library flag is repeatable for multi-IP designs.

What kind = "ram" delivers — opaque vs explicit-port modes

There are two modes depending on whether the manifest includes a [cells.NAME.ram] port-mapping sub-table:

Opaque mode (no ram sub-table, schema v1.0+): the cell's output pins become X-source slots in the AIG. The SRAM's internal memory behaviour is not modelled. Sufficient for designs whose CPU executes from boot ROM / register file and never reads SRAM contents at the timescales Jacquard simulates.

Explicit-port mode (with ram sub-table, schema v1.1+, ADR 0011): outputs are wired to a real AIG-backed RAMBlock, writes populate per-entry storage, reads return what was written. Real memory modelling end-to-end. Use this when the CPU reads its own SRAM (the common case for any design beyond heartbeat verification).

Schema (full example, mirroring the upstream OCD GF180MCU SRAM):

schema_version = "1.1"

[cells.gf180mcu_ocd_ip_sram__sram1024x8m8wm1]
kind = "ram"

[cells.gf180mcu_ocd_ip_sram__sram1024x8m8wm1.ram]
depth = 1024
width = 8
clock        = { pin = "CLK", edge = "pos" }
chip_enable  = { pin = "CEN", polarity = "low" }
write_enable = { pin = "GWEN", polarity = "low" }
write_mask   = { pin = "WEN", polarity = "low", granularity = "bit" }
address      = "A"
data_in      = "D"
data_out     = "Q"

Field semantics, defaults, and the multi-port-SRAM/async/wider-than-32-bit out-of-scope items are documented in ADR 0011. Polarity defaults to low; clock edge defaults to pos; mask granularity defaults to bit. All three control pins (chip_enable / write_enable / write_mask) are optional — omit them for sync SRAMs without those signals.

Preloading SRAM contents at sim start

Once a SRAM is in explicit-port mode, its contents can be preloaded from an ELF file via sim_config.json:

{
  "sram_init": {
    "elf_path": "build/firmware.elf"
  }
}

The ELF's PT_LOAD segments are packed into the SRAM's backing storage before tick 0; the lowest loadable virtual address is taken as SRAM address 0. Single-SRAM designs only — multi-SRAM instance-targeting is a future schema extension (issue #80).

Other kinds

  • filler, endcap, tap — physical-only, contribute no logic.
  • io_pad_input / io_pad_output / io_pad_bidir — pad-level behaviour (parallel to the built-in gf180mcu_ws_io__* family).
  • dff, latch, clock_gate, delay, multi_output — recognised but the v1.0 schema doesn't yet expose enough port semantics to drive AIG construction for these. Coming in the port-mapping schema (future ADR). For now, declaring these kinds documents intent without changing behaviour.

Troubleshooting VCD Input Issues

This guide helps debug VCD input problems where GEM simulations produce incorrect results or warn about missing signals.

NEW: GEM now automatically detects the correct VCD scope containing your design's ports. In most cases, you don't need to specify --input-vcd-scope manually.

How Auto-Detection Works

When you run jacquard sim without specifying --input-vcd-scope, GEM:

  1. Extracts the list of required input ports from your synthesized design
  2. Searches the VCD file for scopes containing all required ports
  3. Tries common DUT scope names first: dut, uut, DUT, UUT, or your module name
  4. Falls back to any scope that contains all required ports
  5. Logs which scope was selected for transparency

Example Output

INFO No VCD scope specified - attempting auto-detection
DEBUG Searching for VCD scope containing 4 input ports
DEBUG Required ports: {"din_valid", "clk", "reset", "din"}
INFO Auto-detected VCD scope: safe_tb/uut (matched common pattern 'uut')

Manual Override

If auto-detection fails or selects the wrong scope, use --input-vcd-scope to specify manually:

# Slash-separated path to the DUT scope
jacquard sim design.gv input.vcd output.vcd 8 \
    --input-vcd-scope "testbench/dut"

# For nested hierarchies
jacquard sim design.gv input.vcd output.vcd 8 \
    --input-vcd-scope "top_tb/subsystem/my_module"

Note: Use slash separators (/), not dots (.).


Symptom: Missing Primary Input Warnings

WARN (GATESIM_VCDI_MISSING_PI) Primary input port (HierName(), "reset", None) not present in the VCD input
WARN (GATESIM_VCDI_MISSING_PI) Primary input port (HierName(), "din", Some(3)) not present in the VCD input

Root Cause

GEM expects VCD signals at absolute top-level with no module hierarchy prefix. The signal names must exactly match the synthesized module's port names.

How to Check

  1. Inspect your VCD file:
grep '\$var' your_input.vcd | head -20
  1. Look for module scopes:
grep '\$scope module' your_input.vcd
  1. Check synthesized module ports:
head -20 your_design_synth.gv

What GEM Expects

Correct - Signals at top level:

$timescale 1ns/1ns
$var reg 1 ! clk $end
$var reg 1 " reset $end
$var reg 4 # din [3:0] $end
$var reg 1 $ din_valid $end
$var wire 1 % unlocked $end
$enddefinitions $end
$dumpvars
0"
0$
0%
1!
#10
1"
#20
b1100 #
1$

Incorrect - Signals scoped under module:

$scope module testbench $end
  $scope module dut $end
    $var wire 1 ! clk $end
    $var wire 1 " reset $end
    $var wire 4 # din [3:0] $end
    ...
  $upscope $end
$upscope $end

Solution 1: Flat VCD Generation

Create a testbench that dumps signals at absolute top level:

module testbench;

reg clk = 0;
reg reset;
reg [3:0] din;
reg din_valid = 0;
wire unlocked;

// DUT instantiation
your_module dut (
    .clk(clk),
    .reset(reset),
    .din(din),
    .din_valid(din_valid),
    .unlocked(unlocked)
);

always #10 clk = !clk;

initial begin
    // CRITICAL: Dump signals at top level (depth 1)
    // NOT inside module hierarchy!
    $dumpfile("output.vcd");
    $dumpvars(1, clk, reset, din, din_valid, unlocked);

    // Test sequence
    reset = 1;
    #60;
    reset = 0;

    // ... your test stimulus ...

    #200;
    $finish;
end

endmodule

Key Point: $dumpvars(1, signal1, signal2, ...) dumps individual signals at the current scope level, not inside child modules.

Compile and Run

# For synthesis-compatible testbench
iverilog -DSYNTHESIS -o sim your_design.v testbench.v
./sim

# Check VCD structure
grep '\$scope' output.vcd  # Should be minimal or none
grep '\$var' output.vcd | head -10

Solution 2: Post-Process VCD (Advanced)

If you can't change the testbench, post-process the VCD to flatten hierarchy:

#!/usr/bin/env python3
"""Flatten VCD hierarchy to top level"""

import sys

def flatten_vcd(input_vcd, output_vcd):
    with open(input_vcd) as inf, open(output_vcd, 'w') as outf:
        in_scope = False
        scope_depth = 0

        for line in inf:
            # Track scope depth
            if line.strip().startswith('$scope'):
                scope_depth += 1
                if scope_depth == 1:
                    continue  # Keep root scope
                in_scope = True
                continue
            elif line.strip().startswith('$upscope'):
                scope_depth -= 1
                if in_scope and scope_depth == 0:
                    in_scope = False
                continue

            # Skip signals inside nested scopes, keep only top-level
            if in_scope and line.strip().startswith('$var'):
                continue  # Skip nested module signals

            outf.write(line)

if __name__ == '__main__':
    flatten_vcd(sys.argv[1], sys.argv[2])

Usage:

python3 flatten_vcd.py hierarchical.vcd flat.vcd

Solution 3: VCD Scope Option (Experimental)

GEM provides --input-vcd-scope to specify which module hierarchy to read:

cargo run -r --features metal --bin jacquard -- sim \
    design.gv input.vcd output.vcd 48 \
    --input-vcd-scope module_name

Known Issue: Currently, signal matching still fails even with correct scope specified. This is under investigation.

Diagnostic Checklist

1. Verify Signal Names Match

Synthesized Module:

grep "^module\|input\|output" design_synth.gv

Output:

module safe(clk, reset, din, din_valid, unlocked);
  input clk;
  input reset;
  input [3:0] din;
  input din_valid;
  output unlocked;

VCD Signals:

grep '\$var.*\(clk\|reset\|din\|unlocked\)' input.vcd

Output should match synthesized port names exactly.

2. Check Signal Bit Widths

Multi-bit signals must have correct indices:

Synthesized: input [3:0] din;

VCD:

$var reg 4 # din [3:0] $end

GEM expects separate indices: din[3], din[2], din[1], din[0]

3. Verify Timestamp Format

GEM expects integer timestamps (not real numbers):

Correct:

#0
#10
#20

Incorrect:

#0.0
#10.5
#20.25

4. Check Timescale

Ensure VCD timescale matches simulation expectations:

$timescale 1ns $end

or

$timescale 1ps $end

Clock periods in testbench should use same time unit.

Validation Steps

After fixing VCD issues, validate GEM is reading inputs correctly:

1. Run with CPU Verification

cargo run -r --features metal --bin jacquard -- sim \
    design.gv input.vcd output.vcd 48 \
    --check-with-cpu

This compares GPU results against CPU gate-level simulation. Should print:

[INFO] sanity test passed!

2. Compare Output VCD with Reference

Run same design with iverilog:

iverilog -o reference_sim design.v testbench.v
./reference_sim  # Generates reference.vcd

Compare outputs:

# Check if unlocked signal toggles the same in both
grep '^[01]!' gem_output.vcd
grep '^[01]!' reference.vcd

3. Check Cycle Count

cargo run -r --features metal --bin jacquard -- sim \
    design.gv input.vcd output.vcd 48 \
    2>&1 | grep "total number of cycles"

Should match your testbench's simulation time / clock period.

Common Pitfalls

1. Testbench Inside `ifndef SYNTHESIS

If testbench is only compiled when SYNTHESIS is not defined:

`ifndef SYNTHESIS
module testbench;
  // ...
endmodule
`endif

You must compile without -DSYNTHESIS for VCD generation:

iverilog -o sim design.v testbench.v  # No -DSYNTHESIS!

But the DUT must be compiled with -DSYNTHESIS if it has non-synthesizable constructs:

# Separate compilation
iverilog -DSYNTHESIS -c design.v
iverilog -o sim design.v testbench.v

2. X/Z Values in VCD

GEM may not handle unknown (X) or high-impedance (Z) values correctly:

$dumpvars
x"  # reset = X
bxxxx #  # din = XXXX

Solution: Initialize all inputs in testbench:

initial begin
    reset = 0;  // Don't leave uninitialized
    din = 4'h0;
    din_valid = 0;
end

3. Missing Clock Signal

If VCD doesn't include clock:

WARN (GATESIM_VCDI_MISSING_PI) Primary input port (HierName(), "clk", None) not present

Ensure:

  • Clock is generated in testbench
  • Clock is included in $dumpvars
  • Clock signal name matches synthesized netlist exactly

Example: Working Flat VCD Testbench

// testbench_flat.v - Generates GEM-compatible VCD
module testbench_flat;

// Declare all signals at top level
reg clk = 0;
reg reset = 1;
reg [3:0] din = 4'h0;
reg din_valid = 0;
wire unlocked;

// DUT instantiation
safe dut (
    .clk(clk),
    .reset(reset),
    .din(din),
    .din_valid(din_valid),
    .unlocked(unlocked)
);

// Clock generation
always #10 clk = !clk;  // 20ns period = 50MHz

// Test sequence
initial begin
    // CRITICAL: Dump at top level (depth 1)
    $dumpfile("safe_flat.vcd");
    $dumpvars(1, clk, reset, din, din_valid, unlocked);

    // Reset phase
    reset = 1;
    #60;  // 3 clock cycles
    reset = 0;
    #11;  // Small offset from clock edge

    // Apply test stimulus
    din = 4'hc;
    din_valid = 1;
    #20;

    din = 4'h0;
    #20;

    din = 4'hd;
    #20;

    din = 4'he;
    #20;

    din_valid = 0;
    #40;

    $finish;
end

endmodule

Compile and test:

# Compile (DUT must be SYNTHESIS-compatible)
iverilog -DSYNTHESIS -o sim safe.v testbench_flat.v

# Run simulation
./sim

# Verify VCD structure
echo "=== VCD Scopes ==="
grep '\$scope' safe_flat.vcd

echo -e "\n=== VCD Signals ==="
grep '\$var' safe_flat.vcd

# Should show signals at top level, no nested $scope modules

Still Having Issues?

  1. Enable debug logging:

    RUST_LOG=debug,vcd_ng=trace cargo run -r --features metal --bin jacquard -- sim <args> 2>&1 | tee debug.log
    
  2. Check with minimal test:

    • Create simplest possible design (single DFF)
    • Generate flat VCD
    • Verify GEM can read it correctly
  3. Report issue with:

    • Synthesized .gv file
    • Input VCD file
    • GEM command line
    • Error messages or unexpected output

Document Version: 1.0 Last Updated: 2025-01-08 Related: simulation-architecture.md

Handoff discipline

Handoffs in this project are ephemeral working memory, not historical record. They exist to bridge a single session boundary — when you stop working and someone else (Claude or human) picks up — and they are deleted once the work they describe is resolved.

This document defines what a handoff is, what it isn't, when to write one, and exactly what to do when one is resolved.

Why this discipline exists

Decision rationale, technical context, and project state all have natural homes:

  • ADRs (docs/adr/) capture architectural decisions and their why.
  • Design docs (docs/timing-model-extensions.md, etc.) capture how things work.
  • Plan docs (docs/plans/phase-0-ir-and-oracle.md, post-phase-0-roadmap.md) capture what's left and the next workstream slices.

When that content lives in a handoff instead, two things go wrong:

  1. It's not where contributors look. A new contributor reading the README → SUMMARY → ADR chain shouldn't have to dig through a stack of resolved handoff docs to find load-bearing decisions or the current state of a workstream.
  2. It rots out of sync with reality. Handoffs are point-in-time snapshots. A "STATUS: RESOLVED" banner doesn't help when the thing referenced has moved or changed; the canonical doc is what should hold the current truth.

The discipline closes this gap by forcing migration before deletion. Every load-bearing piece of a handoff lands in its proper home (ADR / design doc / plan doc) before the handoff file is removed.

What a handoff IS

A handoff lives in its own dedicated directory, separate from the persistent plan docs whose content it eventually feeds: a single markdown file at docs/handoffs/<topic>-handoff.md containing exactly what the next session needs to pick up where you left off:

  • Goal & next-up — what this session was trying to do, and what the very next concrete action is.
  • Done this session — commits landed, with one-line summaries.
  • Open follow-ups — the work that wasn't done, with enough scope detail to start cold.
  • Critical context — gotchas, surprising findings, environment specifics that aren't obvious from the code or docs yet.
  • Verification — the command(s) the next session runs to confirm the work is in the state you say it is.

Exactly one handoff exists at a time. There's no chain of resolved predecessors to wade through.

What a handoff IS NOT

  • Not a decision log. Decisions go in ADRs. If you find yourself writing "we chose X over Y because Z" in a handoff, that paragraph belongs in an ADR (or an existing ADR's "Consequences" / "Walk-back" section).
  • Not a design doc. "How clock arrival flows from OpenSTA Tcl through the IR into the GPU constraint buffer" is a design topic; it lives in docs/timing-model-extensions.md Part B, not in a handoff's "Critical context" section.
  • Not a status dashboard for the project. Workstream status lives in plan docs — phase-0-ir-and-oracle.md for current-phase WS state, post-phase-0-roadmap.md for forward-looking sequencing. A handoff cites those, doesn't reproduce them.
  • Not a historical record. git log is the historical record. Handoffs that survive past their resolution turn into noise that misleads new contributors.

When to write one

Write a handoff at the end of any session that:

  1. Leaves work in a partial state that someone else might pick up cold.
  2. Captures non-obvious context the next session needs (e.g. "the OpenSTA Tcl find_timing proc rejects -full_update; use ::sta::find_timing_cmd 1 directly").
  3. Documents the next concrete step with enough scope to start without re-discovering it.

If the session ended at a clean stopping point (everything merged, all decisions documented in ADRs/plans, nothing surprising), don't write a handoff. The plan doc already says what's next.

Resolution: fold, then delete

The two-location split is deliberate: handoffs live at docs/handoffs/<topic>-handoff.md while in flight; their content migrates into the persistent docs (docs/adr/, docs/plans/, design docs under docs/) at resolution. The handoff file then gets removed; nothing about the work is lost because everything load-bearing has a permanent home elsewhere.

When a handoff's work is done — whether in the next session or several sessions later — every load-bearing piece of it must be migrated to its proper home before the handoff file is deleted:

If the handoff says...It belongs in...
"We chose approach X over Y because Z"The relevant ADR's Decision/Consequences section, or a new ADR if no fit exists
"Future scope for WS-N: do A then B then C"The plan doc's WS-N section (phase-0-ir-and-oracle.md or successor)
"Gotcha: OpenSTA's Tcl X behaves Y"A code comment near the Tcl call site, or a design doc if the gotcha cuts across files
"Build dep Z is required on Linux"The build script's apt-suggestion / Brewfile / README install section
"Subsystem A doesn't yet do B"Plan doc as a new open item, or an ADR-tracked walk-back if it's a deferred design choice
"Run cargo test --feature foo to verify"The verification block in the relevant plan doc, or a test-running section in CLAUDE.md

After migration, the handoff file is removed in the same commit as the migration:

git rm docs/handoffs/<topic>-handoff.md
git add <files-receiving-the-migrated-content>
git commit -m "$(cat <<'EOF'
docs: resolve <topic> handoff — fold into <where-it-went>

<one-paragraph summary of what was migrated and where>

Co-developed-by: Claude Code v<version> (<model-id>)
EOF
)"

The commit message records what migrated where — that's the audit trail. git log -- docs/handoffs/ then shows the project's handoff history (one add, one delete per session) without needing the files themselves to live forever.

Template

When you do need to write one, use this skeleton. Replace placeholders inline; delete sections that don't apply (better to omit a section than fill it with "N/A").

# Handoff — <Topic> (one-line summary of what this session left open)

**Created:** YYYY-MM-DD
**Working tree:** clean | <state if not clean>
**Branch:** main | <branch>

## Goal & next-up

**Goal of this session:** <what you were trying to do, in 1–3 sentences>

**Next session should pick up:** <the very next concrete action, by name. Reference the plan doc section if applicable.>

**Verification command:**
```sh
<commands the next session runs to confirm this handoff's claimed state>
# Expect: <what success looks like>

Done this session

CommitSubjectNotes
<sha>

Open follow-ups (priority-ordered)

1. ()

<Concrete scope. Enough detail to start cold. Link to existing plan/ADR/design-doc sections rather than reproducing them.>

2. ...

Critical context

<Things the next session needs to know that aren't yet in the code/docs. Be honest about what's truly load-bearing — anything obvious from git log or a quick grep doesn't belong here.>

References


Resume in a new session with: ``` /resume_handoff docs/handoffs/-handoff.md ```


## Tooling

The `create_handoff` and `resume_handoff` skills (from various Claude Code orchestration toolkits) generate and consume handoffs. They're optional — the discipline above is the load-bearing artifact. A handoff written by hand following this template is just as valid.

If you use one of those skills, expect it to default to YAML format under `thoughts/shared/handoffs/` with database indexing. **That doesn't apply to this project.** Override it: produce markdown at `docs/handoffs/<topic>-handoff.md` and skip the database step. The skill activation is informational; the project's convention takes precedence.

Architecture Decision Records

ADRs capture decisions worth understanding later: the context, the options considered, and the rationale for the choice. They are numbered, append-only, and never silently rewritten — if a decision changes, supersede the old ADR with a new one and update the status.

Status legend

  • Accepted / Approved — current, in effect.
  • Accepted (partial) — design ratified and partly built; the ADR carries an ## Implementation status section (see below).
  • Proposed — drafted, not yet ratified.
  • Superseded — historical, replaced by a later ADR or by a spike outcome; kept for the audit trail.

Keeping status honest

An ADR's Status is a claim about the codebase, not an aspiration. Before setting or changing it, verify the claim against the implementation — read the code; don't trust the previous status or a feature's "done" framing. The same goes for any present-tense statement inside an ADR ("jitter feeds the setup/hold checker"): it's a verifiable claim, so check it.

  • Don't bump Proposed → Accepted just because a design merged. Confirm the decision is actually in effect in the code.
  • When a design is ratified but only partly built, use Accepted (partial) and add an ## Implementation status section splitting implemented (with file references) from deferred (with the specific gap). ADR 0012 is the worked example.
  • Deferred work gets a home: a plan under docs/plans/ and a tracking issue, cross-linked from the ADR's status section, so the unbuilt half isn't lost.

This extends to user-facing docs and --help text: a sentence telling the reader how the tool behaves is a verifiable claim — check it against the code before writing it.

Index

#TitleStatus
0001OpenSTA as the timing correctness oracle and sole STA pathAccepted (scope expanded 2026-05-01)
0002Timing intermediate representationAccepted
0003OpenTimer as in-process reference STASuperseded (2026-05-01) — spike failed; OpenSTA subprocess only
0004Private PDK testing trackAccepted
0005OpenSTA vendoring and test-corpus strategyAccepted
0006SDF preprocessing model and interim-to-release cutoverAccepted (amended 2026-05-02)
0007Timing model fidelity roadmapProposed
0008Structured timing output as first-class deliverableApproved
0009OpenSTA Verilog reader inputsAccepted
0010Declarative cell metadataAccepted
0011RAM port-mapping schema for declarative cell metadataAccepted
0012Reproducible CDC jitter injection for multi-clock cosimAccepted (partial)
0013Cosim peripheral model architectureAccepted
0014AIG as simulation intermediate representationAccepted
0015Boomerang execution model and GPU resource mappingAccepted
0016Selective X-propagationAccepted
0017Cosim execution modelAccepted

How the ADRs relate

  • 0014 / 0015 document the core simulation pipeline: 0014 explains why the AIG (and-inverter graph) is the simulation IR — its uniform AND-gate structure enables the boomerang reduction tree and eliminates per-cell dispatch in the GPU kernel. 0015 describes the boomerang execution model itself — the 13-level hierarchical reduction tree, the GPU resource limits it imposes (8191 inputs, 8191 outputs, 4095 intermediates, 64 SRAM groups per partition), the hypergraph partitioning that distributes work across GPU blocks, and the packed instruction format (FlattenedScriptV1) consumed by the kernel. Together they document the path from gate-level Verilog to GPU kernel execution that the GEM paper describes.

  • 0001 / 0003 / 0005 / 0006 describe the timing oracle stack: OpenSTA as the ground truth (0001), vendored at a pinned revision with its own corpus reused (0005), driving SDF preprocessing out-of-process (0006). The earlier OpenTimer in-process plan (0003) was retired after the spike (../spikes/opentimer-sky130.md).

  • 0002 is the data contract those tools talk over — a JSON timing IR consumed by Jacquard, produced by opensta-to-ir.

  • 0004 governs how PDK-specific testing happens for NDA-bound contributors without leaking files into the public repo.

  • 0007 / 0008 are the forward-looking pair: 0008 (Approved) defines the structured timing output Jacquard owes downstream flows; 0007 (Proposed) sketches the model-fidelity work needed to back those outputs at scale (δ(T), clock-tree skew, wire delay). Scheduling for both lives in ../plans/post-phase-0-roadmap.md.

  • 0013 / 0017 cover the cosim runtime: 0013 documents the peripheral model architecture (CPU-side PeripheralModel trait, GPU-side kernel patterns, ring buffers, plural-config convention); 0017 documents the execution model (batch dispatch loop, multi-clock scheduler, edges-vs-cycles semantics).

  • 0016 accepts the selective X-propagation design documented in docs/selective-x-propagation.md. The full seven-phase design lives there; the ADR is a thin acceptance record with a summary of key choices.

Adding a new ADR

  1. Pick the next number (highest existing + 1).
  2. Filename: NNNN-short-kebab-title.md.
  3. Start with # ADR NNNN — <title> and a **Status:** line — set it to match the code, not the intent (see Keeping status honest).
  4. Standard sections: Context, Decision, Consequences. Add Amendment blocks dated when the decision is revisited; do not rewrite accepted history.
  5. Add the row to the table above.

ADR 0001 — OpenSTA as the timing correctness oracle and sole STA path

Status: Accepted. Scope expanded 2026-05-01 — see Decision §3 below.

Context

Jacquard's current correctness validation for timing relies on its own CPU reference simulator (--check-with-cpu), which shares the Rust source tree, data structures, and parsers with the GPU simulation path. Representation bugs (e.g., hierarchical SDF prefix mismatch, inverter-collapse issues) have passed both paths silently because they affect both.

Historical regressions have been caught only by comparing against genuinely external tools — specifically CVC for functional simulation and, by implication, OpenSTA for timing. No format or tool inside Jacquard is currently treated as authoritative.

OpenSTA is widely deployed in open-source EDA (SKY130, OpenLane2, OpenROAD) and has the largest effective test surface of any open-source STA tool for the Liberty + SDF + Verilog + SPEF stack. It is licensed under GPL-3.0 and also sold commercially.

Jacquard requires permissive licensing for code linked into its binary (see ../project-scope.md).

Decision

OpenSTA is the ground-truth oracle for timing correctness and the sole STA path used by Jacquard.

  1. In the shipped release, OpenSTA is never invoked from the jacquard runtime binary, and never linked. Subprocess invocation from CI pipelines, test harnesses, and the standalone opensta-to-ir preprocessing tool (see ADR 0006) is acceptable — GPL's reciprocal requirements do not cross a subprocess boundary ("mere aggregation") and so Jacquard's permissive licensing is preserved. Pre-release, a runtime subprocess invocation may exist as a contributor-ergonomics convenience (per ADR 0006); it is removed before release.
  2. All timing, STA, and parser-related code paths are validated against OpenSTA on (a) a vendored subset of OpenSTA's own test corpus, and (b) representative Jacquard test designs.
  3. OpenSTA is also Jacquard's only STA path, not just its oracle. ADR 0003 originally proposed an in-process reference STA via OpenTimer to complement this oracle role; the spike (../spikes/opentimer-sky130.md) found OpenTimer's input pipeline unfit for OpenROAD-flow outputs (commit d002bde superseded ADR 0003). The role OpenTimer would have played — providing per-DFF clock arrival, structured timing data for the IR, etc. — now sits with OpenSTA, called out of process via opensta-to-ir. OpenSTA is therefore a required runtime dependency for any timing-aware Jacquard flow, not just for CI validation.
  4. Where Jacquard's output disagrees with OpenSTA's output past a declared tolerance, Jacquard is wrong until proven otherwise. Divergence is either fixed, explicitly justified in writing, or filed as a bug.

Consequences

  • OpenSTA is a required runtime dependency for timing-aware Jacquard flows (post §3 expansion), not merely a CI/validation dependency. Users running jacquard sim --timing-ir ... need a .jtir produced by opensta-to-ir, which subprocesses OpenSTA. Documented in ../why-jacquard.md.
  • Subprocess integration preserves Jacquard's permissive licensing (satisfies project-scope.md).
  • "Oracle-diff clean" becomes a required CI gate for timing-related PRs, run nightly or pre-release (not per-PR — OpenSTA runs on large designs can be slow).
  • OpenSTA bugs may produce false-positive divergences. The expectation is to file upstream rather than work around silently. A pinned OpenSTA version in CI avoids drift. With OpenSTA now also the only STA path (not just the oracle), upstream regressions land in users' hands too — pinning matters more than before.
  • A vendored OpenSTA test corpus (or git submodule) is added to the repo as a fixture. Licensing of specific test inputs is verified per file before inclusion.
  • No second STA tool to maintain. The original ADR 0003 proposal would have given Jacquard a permissive-licensed in-process reference; the spike showed that's not achievable today with OpenTimer. A future ADR may revisit libreda-sta or an in-house walker if an in-process reference is wanted.
  • ../project-scope.md — permissive-license constraint.
  • ../timing-correctness.md — principle P1, requirement R3.
  • ADR 0002 — timing IR (the concrete diff format used for oracle comparison).
  • ADR 0003 — Superseded. OpenTimer in-process reference; spike Q2 fail moved Jacquard to OpenSTA-only. See ../spikes/opentimer-sky130.md for the spike outcome.
  • ../why-jacquard.md — user-facing consequence: OpenSTA as a runtime dependency.

ADR 0002 — Timing intermediate representation

Status: Accepted.

Context

Jacquard currently parses SDF directly in src/sdf_parser.rs, a hand-rolled parser that has accumulated reactive fixes (empty () delays, (COND …) pin specs, backslash escapes, edge-qualified timing checks, TIMINGCHECK stripping workarounds for OpenLane2 output). Each new production failure has been a one-off patch.

Commercial tool output adds dialect variation (Cadence, Synopsys extensions). Future parser paths (Liberty, SPEF) and future reference tools (OpenSTA, OpenTimer) each carry their own data models. A format-per-consumer coupling structure will continue to spread parser complexity into the simulator.

The project needs:

  • A stable format we consume, with parser complexity isolated from simulator complexity.
  • A format that can be diffed between producers (two parsers of the same file must agree).
  • A format that supports multi-corner PVT values natively — commercial flows require this; single-corner shortcuts become retrofit pain.
  • Preservation of vendor-specific annotations so information is not silently discarded.
  • Fast consumption at sim startup (SDF parsing is currently on the critical path).

Decision

Introduce a timing intermediate representation (timing IR) for SDF-equivalent annotation data.

  • Binary format: FlatBuffers. Zero-copy reads, schema evolution, cross-language (Rust, C++ for OpenTimer adapter, Python for tooling).
  • Text sidecar: JSON, produced via FlatBuffers' JSON round-trip, for CI diffs and human inspection.
  • Schema versioning: explicit version field, compatible-evolution rules stated in schema comments. Breaking changes require a major version bump and migration notes.
  • Multi-corner native: timing values are min / typ / max across a declared set of PVT corners. Single-corner designs are represented as a single-element corner set.
  • Vendor extension passthrough: typed VendorExtension variants (VendorCadence, VendorSynopsys, VendorOther) carry unrecognised annotations as byte-typed blobs with source labels. Consumers opt in to understanding them; the IR never silently drops them.
  • Per-arc provenance: each timing arc records source tool, source file, and origin category — asserted (from SDF / input), computed (derived by an STA tool), defaulted (fallback because no better value was available). Provenance is inspectable at consumer side.
  • Scope boundary: the IR represents timing annotation data only. It is not a netlist representation, not a timing graph, not cell characterization. Attempts to extend it toward those adjacent formats are rejected — they become separate IRs if needed.

Consequences

  • A new schema and format to maintain. Scope discipline is load-bearing: if the IR creeps toward being a full STA framework, it becomes duplicate work with OpenSTA/OpenTimer.
  • Parser complexity moves out of src/sdf_parser.rs (and its future rewrite, per ADR covering #3) into a focused converter crate. Unit-testable in isolation.
  • A diff-based test corpus becomes natural: multiple converters on the same input must produce equivalent IR. This is the enforcement mechanism for ADR 0001's oracle pattern.
  • Vendor extensions do not require Jacquard code changes — only converter updates.
  • Startup parse cost drops: reading binary IR is near-instant. SDF-to-IR conversion becomes a one-time preprocessing step, not repeated per sim.
  • Adopting FlatBuffers adds a code-generation step to the build, via flatc. Build hygiene (checked-in generated code, pinned flatc version, or a build-script integration) is required.
  • If the IR is ever shared across other tooling beyond Jacquard, its stability contract tightens. Flagged in open questions on timing-correctness.md; not resolved here.
  • ../project-scope.md — validation and permissive-license constraints.
  • ../timing-correctness.md — requirement R1, principle P5 (multi-corner).
  • ADR 0001 — OpenSTA oracle (IR is the diff format).
  • ADR 0003 — Superseded. OpenTimer was the proposed in-process reference STA; spike Q2 fail moved Jacquard to OpenSTA-only via opensta-to-ir. See ../spikes/opentimer-sky130.md.
  • ADR 0004 — private PDK testing (IR enables portable fixtures without leaking PDK data).

ADR 0003 — OpenTimer as in-process reference STA

Status: Superseded (2026-05-01). Spike (../spikes/opentimer-sky130.md) failed Q2 — OpenTimer's input pipeline cannot handle real OpenROAD-flow .v/.spef for SKY130 designs with bus ports. Fallback is OpenSTA subprocess validation only (ADR 0001); a future ADR may revisit libreda-sta or an in-house walker if an in-process reference is wanted later.

Context

Jacquard needs an in-process reference STA path to:

  • Validate SDF-derived timing against an independent computation at load time and on demand (requirement R2 in timing-correctness.md).
  • Provide exact per-edge arrival for top-K critical paths (requirement R4, pessimism-delta reporting).

OpenSTA (ADR 0001) is the ground-truth oracle but runs only as a subprocess — unsuitable for per-run, in-process checking. A linked alternative is needed.

Options surveyed:

  • OpenTimer (MIT, C++17). Parses .lib / .v / .spef / .sdc directly. Won TAU Timing Analysis Contests (2014 1st, 2015 2nd, 2016 1st); industry "Golden Timer" for benchmark comparisons. Actively maintained (latest push 2025-12-26 as of this writing). Does not parse SDF — timing is computed from Liberty + parasitics.
  • libreda-sta (Rust, permissive). Young framework, self-described as "basic components." Unknown whether it handles SKY130 Liberty robustly. Lower maturity risk than OpenTimer.
  • Tatum (MIT, C++). Analysis engine only; does not parse Liberty/SDF/Verilog. Using Tatum would require supplying our own parsers, so it does not solve the problem directly.
  • In-house Rust walker. Author-shared blind spots with Jacquard's main pipeline reduce the independence benefit.

Decision

Subject to the spike's success, OpenTimer becomes Jacquard's in-process reference STA, integrated via C++ FFI (bindgen or equivalent).

  • Linked directly; MIT licence satisfies project-scope.md.
  • Computes timing from .lib + .spef independently of any SDF-derived path. This is an accepted (and arguably preferable) property: the reference path shares no parsing with Jacquard's SDF consumer, so a parse bug on either side is detectable rather than mutually masked.
  • Emits timing IR (per ADR 0002) so its output is directly diffable against Jacquard's SDF-derived IR.

Spike criteria in ../spikes/opentimer-sky130.md. On spike failure, fallback is to drop the in-process reference entirely and rely on OpenSTA subprocess validation (ADR 0001). This weakens per-PR feedback on timing correctness but is not fatal.

Consequences

  • C++ FFI dependency; bindgen-generated bindings; build complexity rises modestly.
  • Direct linking preserves permissive licensing (MIT).
  • Three-way cross-check becomes the default in CI: Jacquard (SDF path) vs OpenTimer (Liberty+SPEF path) vs OpenSTA (subprocess, full ground truth). Three-way disagreement localises bugs to SDF parse / delay model / tool issue cleanly.
  • OpenTimer does not parse SDF. To use it in Jacquard's current flow, OpenLane2 (or equivalent) must produce SPEF alongside SDF. This plumbing change is tracked in the phase-0 plan.
  • OpenTimer's maturity is measured in contest benchmarks, not SKY130/GF130 real-flow output. Spike must verify it handles our actual Liberty and SPEF. The spike is structured to fail fast if it does not.
  • If OpenTimer is dropped post-spike, alternative in-process references (libreda-sta, in-house) can be revisited; this ADR would be superseded rather than amended.
  • ../project-scope.md — licensing constraint, validation constraint.
  • ../timing-correctness.md — requirement R2, requirement R4.
  • ../spikes/opentimer-sky130.md — spike and success criteria.
  • ADR 0001 — OpenSTA oracle.
  • ADR 0002 — timing IR (OpenTimer emits it).

ADR 0004 — Private PDK testing track

Status: Accepted. Plumbing tracked in the phase-0 plan.

Context

Some contributors and operators have access to commercial PDKs (GlobalFoundries, TSMC, and others) under NDA or licensing agreements that prohibit public redistribution of PDK files. Whether a given contributor has access is itself typically under NDA and not publicly known.

Commercial PDK Liberty libraries are substantially richer and quirkier than open-source alternatives — they include cell variants, conditional timing arcs, vendor-specific annotations, and characterization detail not present in SKY130 or AIGPDK. Several parser bugs live only on commercial PDK output.

SKY130-only coverage is insufficient for a sim tool used on commercial flows, and adding commercial PDK files to a public repository is not an option regardless of who operates the project.

The standard industry pattern for testing against proprietary PDKs is environment-gated test suites: tests run when the contributor has licensed access, and skip cleanly when they don't.

Decision

Establish a private PDK test track gated on per-PDK environment variables (e.g. GF130_PDK_PATH, TSMC_PDK_PATH, and similar — one per PDK).

  • Tests check for the required env var(s) and skip with a clear "PDK not available" message when unset.
  • When env vars point to a readable PDK directory, tests execute fully.
  • Only the test harness, expected structural outputs, and IR fixtures (where the PDK vendor licensing permits) are committed.
  • No PDK-derived artifacts (.lib, .sdf, .spef, characterization data) are committed to the public repository under any circumstances.
  • CI runners with configured PDK access execute the private track; public PRs from non-licensed contributors see the private tests as skipped, not as failures. Which runners have access is determined by whoever operates CI; this ADR does not name specific organisations.

The timing IR (ADR 0002) makes this feasible: converter output and diff results can be checked in as fixtures where they contain no PDK-licensed data. Expected behaviour can be asserted in terms of IR structure rather than in terms of specific cell timings that would leak characterization data.

Consequences

  • Contributors without PDK access cannot locally reproduce PDK-specific bugs. They rely on maintainer CI for validation.
  • A separate setup doc for licensed contributors is required (not public). Points at env-var configuration, test runner invocation, and PDK-file staging expectations.
  • Fixture schema must be PDK-agnostic enough that structural assertions don't implicitly leak cell-characterization data. Review process must check new fixtures against this rule before merge.
  • Bugs found via private PDK testing are, where possible, distilled into minimal public reproducers. The private track is not a place to park unreviewable tests — every private test should ideally surface a public fixture once the bug's essence is extracted.
  • CI cost rises (licensed runners). Runs are nightly or pre-release rather than per-PR.
  • ../project-scope.md — validation constraint.
  • ../timing-correctness.md — requirement R5.
  • ADR 0002 — timing IR (enables portable fixtures).

ADR 0005 — OpenSTA vendoring and test-corpus strategy

Status: Accepted.

Context

Under ADR 0001, OpenSTA is the ground-truth oracle for timing correctness, invoked as a subprocess. Phase 0 (../plans/phase-0-ir-and-oracle.md) requires:

  • A reproducible, pinned OpenSTA reference so CI diffs are comparable run-to-run.
  • Access to OpenSTA's test inputs for stress testing our OpenSTA-driven converters.
  • Separately, a primary regression corpus representative of Jacquard's actual use cases.

Two questions were considered jointly: (a) how we pin / reference the OpenSTA codebase, and (b) how we use their test data.

On vendoring source: OpenSTA is licensed GPL-3.0. Copying its source into Jacquard's repository as committed code creates licensing ambiguity for a permissive-licensed project. Git submodules are conventionally treated differently — the parent repository pins a commit reference, does not incorporate the submodule's source into its own commits, and inherits no license obligations from the submodule's presence. This convention is widely relied on in permissive projects that depend on GPL tooling at arm's length.

On test data: OpenSTA's corpus exercises OpenSTA's concerns — Liberty parsing edge cases, SI-aware analysis, timing-check variants specific to its engine. Much of it does not exercise anything Jacquard does, and some of it exercises features Jacquard deliberately does not support. Using it as the primary regression corpus would optimise for the wrong target: our converters would be validated against files OpenSTA cares about, not files Jacquard actually encounters.

Its real value to Jacquard is as a stress / robustness corpus: a large bank of real-world-ish timing files that exercise parser edge cases and dialect variants. A converter that survives their entire corpus is more robust than one validated against a hand-curated subset.

Decision

Vendoring

  • OpenSTA is vendored as a git submodule at vendor/opensta/.
  • The submodule is not built from Jacquard's build. Jacquard's subprocess invocations use whatever OpenSTA binary is installed in the developer or CI environment.
  • The submodule exists for two purposes only: (a) pinning a specific OpenSTA version for CI reproducibility, (b) providing in-tree access to its test corpus without redistribution.
  • Licensing: by git-submodule convention, the submodule's GPL-3.0 licence does not extend to the parent repository. This is the standard interpretation; contributors redistributing binaries or compiled artefacts should nonetheless verify the interpretation applies to their specific jurisdiction and use.

Test corpus split

Two corpora, two distinct roles:

  • Primary regression corpus at tests/timing_ir/corpus/.

    • Jacquard-specific designs: SKY130 MCU SoC, NVDLA, AIGPDK examples, representative SDFs from the real Jacquard flow.
    • Small, curated, committed directly.
    • Run on every CI execution.
    • Exit criterion: every file converts cleanly and matches golden IR within declared tolerance.
  • Stress / robustness corpus at tests/timing_ir/stress/ as a manifest file listing paths into vendor/opensta/<test-tree-subdir>/.

    • Not committed as duplicated data; the manifest references submodule paths.
    • Large, whatever upstream maintains.
    • Run nightly or pre-release, not per-PR.
    • Exit criterion: no crashes, no hangs, no malformed IR. Numerical agreement with OpenSTA not required — this corpus is for robustness, not correctness.

Copying from stress corpus into primary corpus

If a stress-corpus file exposes a bug, a minimal reproducer may be distilled and added to the primary corpus. When doing so:

  • Verify the specific file's licence before copying. OpenSTA's overall GPL-3.0 licence does not imply every test input is GPL-3.0 — some test inputs are vendor-derived or public-domain.
  • Prefer distilling a synthetic minimal reproducer over copying the original file wholesale.

Consequences

  • CI reproducibility: pinned submodule means we control when OpenSTA version changes land. Bumping the pin is an explicit, reviewable step.
  • Repository size grows by OpenSTA's submodule size (multi-megabyte) but not by test-data duplication.
  • Maintenance cadence: periodic submodule pin updates are a known maintenance item. Not frequent, but not zero.
  • Primary regression corpus stays lean and directly relevant; developers can reproduce corpus-level failures locally without pulling the entire submodule.
  • Stress-corpus failures are treated as bugs against our converter, never as bugs against OpenSTA's test inputs.
  • Licensing posture is conventionally defensible; if stronger legal assurance is ever required, the submodule can be replaced by the external-install-only option (drop the submodule, rely purely on whatever OpenSTA is installed) at the cost of losing in-tree test access.
  • ../project-scope.md — licensing constraints.
  • ../timing-correctness.md — R3 (oracle-backed CI).
  • ADR 0001 — OpenSTA as oracle (establishes the subprocess model).
  • ADR 0002 — timing IR (the format being stress-tested).
  • ../plans/phase-0-ir-and-oracle.md — Phase 0 WS4 implements this split.

ADR 0006 — SDF preprocessing model and interim-to-release cutover

Status: Accepted 2026-04; amended 2026-05-02 (see § Amendment).

Amendment (2026-05-02)

The original Decision treated subprocess invocation of OpenSTA from the shipped Jacquard runtime as license-incompatible, requiring Phase 3 (native Rust SDF→IR converter) to land before first release. On review of GPL-3 § 5 ("aggregate") and the FSF interpretation of subprocess/IPC boundaries, this restriction is more conservative than necessary. The relevant facts:

  • The interface is arms-length: standard EDA interchange formats (Liberty / Verilog / SDF / SPEF / SDC) in, our own IR JSON (ADR 0002) out. No shared data structures, no headers, no linking.
  • We do not bundle OpenSTA in any Jacquard distribution. The user installs OpenSTA themselves; user-side combination of separately-distributed programs is not "distribution of a combined work" under GPL-3.
  • The original "no runtime subprocess" rule was effectively a commercial-perception buffer, not a strict licensing requirement.

Revised bright lines (these supersede the original "Shipped release" sub-section):

  1. No linking of GPL code into the Jacquard binary. Unchanged.
  2. No bundling of OpenSTA (or any GPL tool) in Jacquard distribution artefacts (release tarballs, Homebrew formulae, Docker images that ship as Jacquard releases). If a packager wants to bundle, they take on GPL distribution obligations themselves.
  3. Subprocess invocation of user-installed OpenSTA from the shipped runtime is permitted. jacquard sim input.sdf may keep its opensta-to-ir subprocess hook in shipped releases, provided OpenSTA is discovered on PATH rather than bundled.

Phase 3 reclassification. Native Rust SDF→IR converter is no longer release-gating. It remains a goal — for ergonomics (no OpenSTA install required) and for downstream commercial integrators whose legal teams treat any GPL touchpoint as risk — but ships when bandwidth allows, not as a release blocker. Roadmap consequences are tracked in ../plans/post-phase-0-roadmap.md § Phase 3.

Corequisite — OpenSTA detection and version check (release-blocking). Relaxing the no-runtime-subprocess rule is conditional on the shipped runtime giving users a meaningful error when OpenSTA is missing or out-of-date. Today (src/sim/setup.rs:248-264), missing OpenSTA only emits a warn! and the simulation proceeds with no timing data loaded — acceptable during development, ships as a UX bug. Concretely, before first release we must:

  1. Hard-fail (not warn) when --sdf is requested and OpenSTA cannot be located.
  2. Probe OpenSTA's version on first invocation and fail with a remediation message if it is older than the version pinned in vendor/opensta/ (per ADR 0005).
  3. Warn-but-proceed if the detected version is newer than the latest tested version, naming the version in the warning.
  4. Document the OpenSTA dependency in docs/usage.md.

Tracked as WS-RH.1 in ../plans/post-phase-0-roadmap.md § Release hardening.

Code-comment cleanup follow-up. The INTERIM per ADR 0006 / Pre-release only tags in src/sim/setup.rs (lines ~176, ~228, ~286) and src/bin/jacquard.rs (~187) describe a premise that no longer applies. Folded into WS-RH.1 (../plans/post-phase-0-roadmap.md § Release hardening) rather than spun out as a separate cleanup commit.

The original Context, Decision (Phase 0 + Phase 3), and Walk-back sections below are retained for historical record. Where they conflict with the bright lines above, the bright lines win.

Context

Jacquard's hand-rolled SDF parser (src/sdf_parser.rs) has accumulated reactive maintenance over time — empty () delays, (COND …) pin specs, escape handling, edge-qualified timing checks, TIMINGCHECK-stripping workarounds for OpenLane2 output. Each production failure has been a one-off patch. The timing-correctness review flagged this as issue #3, and a native Rust grammar-based replacement is the Phase 3 deliverable.

Concurrently, ADR 0001 establishes OpenSTA as the timing correctness oracle (subprocess, never linked, GPL), and ADR 0002 introduces a timing IR that decouples parsing from consumption.

Two facts together shape the decision:

  1. No release pressure. Release can happen after Phase 3 lands. We are not forced to keep the hand-rolled parser alive while waiting on Phase 3.
  2. Permissive-license constraint applies to the shipped binary. Subprocess invocation of GPL tooling is acceptable — does not trigger reciprocal obligations — and during pre-release development, even in-runtime subprocess invocation does not violate the constraint because no runtime binary is being distributed.

Given these, maintaining the hand-rolled parser through Phase 0–2 is unnecessary. OpenSTA's mature dialect coverage can substitute, via subprocess, while we build toward a native Rust replacement at our own pace.

Decision

Phase 0

  • Delete src/sdf_parser.rs and the SDF→Jacquard-internal-types code path. All paths that previously consumed SDF now consume timing IR.
  • Ship opensta-to-ir as a standalone preprocessing tool that consumes Liberty + Verilog + SDF + SPEF + SDC and emits timing IR. Subprocess-based on OpenSTA. Production-quality: stable CLI, documented exit codes, clear diagnostics.
  • Canonical runtime path is jacquard sim --timing-ir <path>, consuming pre-converted IR. This path works without OpenSTA on the user's machine — pre-converted IR is sufficient.
  • Interim ergonomic path: during development (pre-release only), jacquard sim input.sdf subprocesses opensta-to-ir internally to produce IR on the fly. This is a contributor convenience, not a shipping feature. Flag exists in code as pre-release only with a clear comment tying back to this ADR.

Phase 3

  • Native Rust SDF→IR converter replaces the OpenSTA subprocess call inside jacquard sim input.sdf. Grammar-based (nom / pest), validated against OpenSTA on the corpus per ADR 0001.
  • Lands before first release.

Shipped release

  • No OpenSTA invocation from the jacquard runtime binary. The native Rust converter handles SDF inputs directly.
  • opensta-to-ir remains as an alternative preprocessing tool. Users who want OpenSTA-computed timing may use it; subprocess model preserves permissive licensing.

Walk-back options (if assumptions change)

  • If OpenSTA dialect coverage proves insufficient during Phase 0 — e.g., a current Jacquard-supported SDF fails to parse — add dialect shims to opensta-to-ir's post-processing. Reinstating the hand-rolled parser is the last resort, not the first.
  • If the Phase 3 Rust rewrite stalls — ship the first release with preprocessing-only (no jacquard sim input.sdf path), remove the interim subprocess, and land the native converter in a later release. No information lost; users preprocess manually. This is already the post-release shape for opensta-to-ir; it's only the jacquard sim input.sdf convenience that would be deferred.
  • If OpenSTA becomes unmaintainable or disappears — the submodule pin (ADR 0005) remains authoritative for the integrated version. A forked submodule can maintain any necessary patches.

Consequences

  • Jacquard's repository stops carrying a hand-rolled SDF parser as a reactive-maintenance target. Bugs in SDF interpretation between Phase 0 and Phase 3 are OpenSTA's problem (upstream) or opensta-to-ir post-processing's problem, not Jacquard's core codebase's problem.
  • Pre-release ergonomic one-step workflow for contributors is preserved.
  • Contributors running Jacquard on a new design (no pre-converted IR) must have OpenSTA installed during Phase 0 through Phase 3. For existing primary-corpus designs, pre-converted IR is checked in; no OpenSTA needed.
  • Release-time check is unambiguous: either the runtime subprocess is replaced by native code, or it is removed entirely. Both outcomes satisfy the permissive-licensing constraint for the shipped binary.
  • Test corpus regenerable: if OpenSTA updates change IR output, golden files are regenerated deliberately (reviewable diff), not silently.
  • ../project-scope.md — licensing constraint, preprocessing-tools pattern.
  • ../timing-correctness.md — P1 (oracle), R1 (IR).
  • ADR 0001 — OpenSTA as oracle (subprocess model).
  • ADR 0002 — timing IR (format consumed).
  • ADR 0005 — OpenSTA vendoring (submodule for reproducibility + stress corpus).
  • ../plans/phase-0-ir-and-oracle.md — WS2 productionisation, WS3 deletion + interim hook.

ADR 0007 — Timing model fidelity roadmap

Status: Proposed.

Context

Jacquard's timing model today consumes SDF-equivalent annotations via the timing IR (ADR 0002), produced and validated by OpenSTA called out of process (ADR 0001 — sole STA path; ADR 0003's in-process OpenTimer alternative was Superseded by the spike). The accuracy contract at present is "±5% on arrival times against CVC reference" per timing-validation.md. This is acceptable for sky130-class designs at ≥10 ns clock periods.

Three structural simplifications in the current implementation become accuracy bottlenecks at scale:

  1. Static δ∞ per gate. No pulse-degradation modelling. Glitch behaviour and short-pulse propagation cannot be represented. The Involution Delay Model (Maier 2021, arXiv:2107.06814) demonstrates this is the root cause of inertial-delay's known failure modes, and provides a model that's both faithful and implementable.
  2. Zero clock-tree skew. During AIG construction (src/aig.rs:495-560), clock buffers/inverters/gating cells collapse to a single polarity flag on the DFF. SDF arcs and interconnect on the clock tree are silently dropped. Every DFF on a clock domain is treated as capturing simultaneously.
  3. Per-cell-max wire delay. src/flatten.rs:1850-1872 lumps all interconnect arrivals at a destination cell into a single max value, with no rise/fall distinction. Adequate for short local routes; incorrect for long routes where wire delay rivals or exceeds gate delay (typical of NoCs at 22nm and faster).

The full design analysis is in docs/timing-model-extensions.md. This ADR captures the decision to commit to closing these three gaps as a roadmap, sets the staged ordering, and constrains how the implementation may evolve.

Decision

Adopt a three-pillar roadmap for closing the fidelity gap with CVC, while preserving Jacquard's GPU-throughput advantage. All three pillars are consumer-side work (src/flatten.rs, src/aig.rs, src/sim/cosim_metal.rs, the kernel arrival math); none require schema changes inconsistent with ADR 0002 nor abandoning the cycle-accurate boomerang kernel architecture.

Pillar A — Dynamic delay (δ(T))

Per-gate dynamic delay parameterised on T (time since last output transition). Three accuracy tiers:

  • Static IDM. Bake worst-case δ(T) into existing per-thread script slot using STA pulse-width estimates. No kernel change.
  • Dynamic δ(T). Add last_transition_ps and last_value persistent buffers per AIG pin; kernel evaluates δ(T) from a small per-cell LUT during arrival propagation.
  • Sub-cycle ticks. Multiple arrival propagations per logical cycle, enabling true glitch suppression. Out of scope by this ADR. Would require a different kernel architecture; if pursued, requires its own ADR superseding this one.

Pillar B — Clock-tree skew

Per-DFF clock arrival accounting via TimingIR extension (ClockArrival table) populated by OpenSTA via opensta-to-ir (ADR 0001 — ADR 0003's OpenTimer alternative is Superseded). Per-pair CRPR is intentionally not modelled at this stage; per-DFF capture-side arrival is, treating launch as the 0-reference. Consumed by extending DFFConstraint with a clock_arrival_ps: i16 field, folded into the existing per-word setup/hold check in src/flatten.rs via DFFConstraint::effective_setup_hold. No kernel change for the baseline case; bucketed packing is an option if pessimism becomes material. Stages 1+2 landed: commits c403cc8 (producer) and 6767c3e (consumer).

Pillar C — Wire delay at scale

Three fidelity tiers:

  • Tier 1: Per-receiver consumption. Key wire delay by (src_aigpin, dst_aigpin) edge in the AIG, with rise/fall distinction preserved. Mostly a src/flatten.rs:1850-1872 rewrite. No kernel change.
  • Tier 2: Inter-partition arc delay. Explicit modelling of wire delay on partition-crossing signals. Touches src/sim/cosim_metal.rs shuffle pipeline. Required for many-core/NoC designs at advanced processes.
  • Tier 3: NoC-aware partitioning hints. Soft bias in src/repcut.rs favouring cuts on flagged net patterns. Optional optimisation that makes Tier 2 cheap on tile-decomposed designs.

Sequencing constraint

  • Pillar B Stage 1+2 is the cheapest accuracy improvement. Originally gated on the (now Superseded) OpenTimer integration; landed early on top of the OpenSTA-out-of-process path instead. See commits c403cc8/6767c3e.
  • Pillar C Tier 1 is independent of which STA tool feeds the IR and can proceed any time.
  • Pillar A Stage 1 (Static IDM) is the cheapest δ(T) entry point, gated on per-cell SPICE characterisation effort. Schedule this only after Pillars B and C land — δ(T) compounds on top of correct wire/skew baseline; doing it earlier risks chasing characterisation noise that's actually wire-delay error.
  • Pillar C Tier 2 lands when a real many-core/NoC use case appears in the test corpus and Tier 1 measurement shows it's needed.
  • Pillar A Stage 2 (Dynamic δ(T)) is a substantial implementation; schedule only when Stage 1 reports indicate the value is real, and a contributor with the analog-characterisation domain expertise is willing to lead it.
  • Pillar A Stage 3 (Sub-cycle ticks) is explicitly out of scope of this ADR.

Validation contract

  • Each pillar lands with regression coverage extending timing-validation.md's ±5% tolerance. Tighter tolerances may apply per pillar (Pillar B should achieve ≤±2% on skew-aware paths with OpenSTA-fed per-DFF arrival as currently implemented; Pillar C Tier 1 should achieve ≤±3% on long-wire paths).
  • Each pillar must demonstrate no regression on the existing primary corpus before merge.
  • The IR schema may be extended (additive only) per ADR 0002 to carry pillar-specific data. Extensions require a minor schema bump and a documented consumer-version compatibility note.

Consequences

  • The "±5%" line in timing-validation.md becomes a per-pillar specification rather than a single number. The doc is updated as each pillar lands.
  • crates/timing-ir/schemas/timing_ir.fbs accumulates additive extensions for clock arrival and per-cell δ(T) parameters. Schema versioning per ADR 0002 governs.
  • No changes to the cycle-accurate boomerang kernel architecture. The cost of preserving that architecture is permanent: no glitch propagation, no metastability oscillation, no asynchronous handling. These remain non-goals (per project-scope.md) unless a future ADR explicitly supersedes this position.
  • Per-cell SPICE characterisation effort is acknowledged as the long-pole risk for Pillar A. If characterisation cost proves prohibitive, Pillar A reduces to "Stage 1 only, using Liberty-derived ECSM/CCSM data as approximation," and the gap with CVC's full IDM fidelity remains open. This is acceptable; Pillar A Stage 2 is not a release-gating commitment.
  • Jacquard's positioning (why-jacquard.md) becomes coherent: STA-complement-not-replacement, vector-driven timing at GPU scale, fidelity comparable to CVC where the cycle-accurate kernel architecture allows.

Walk-back options

  • If a pillar's measurement shows the accuracy gain is smaller than expected, descope it. Each pillar's first stage is sized to deliver measurable improvement; if it doesn't, later stages of that pillar are deferred or abandoned.
  • If the IR schema extensions cause downstream tooling friction, fall back to vendor-extension passthrough (VendorExtension in timing_ir.fbs) until the typed schema stabilises. Already supported.
  • OpenTimer integration was retired (ADR 0003 Superseded by the spike outcome). Pillar B did not need the documented fallback to manual clock-tree accumulation in src/aig.rs — OpenSTA's per-pin arrival via opensta-to-ir covers the same ground without the per-pair CRPR credit (deferred to Stage 3 if measurement justifies it).
  • ../timing-model-extensions.md — full technical analysis underlying this ADR.
  • ../why-jacquard.md — positioning context: where this fidelity work fits in the user value story.
  • ../adr/0001-opensta-as-oracle.md, ../adr/0002-timing-ir.md — preceding decisions this ADR builds on.
  • ../adr/0003-opentimer-primary-sta.mdSuperseded by the spike outcome (../spikes/opentimer-sky130.md); referenced here for historical context.
  • ../timing-validation.md — validation tolerance contract that each pillar updates.
  • ../project-scope.md — synchronous-only / cycle-accurate constraints that bound what this ADR can pursue.

ADR 0008 — Structured timing output as first-class deliverable

Status: Approved.

Context

Jacquard produces timing information today through three channels: timed VCD (--timed), per-violation clilog::warn! messages on stderr, and an in-process SimStats counter. The why-jacquard.md analysis identifies a gap between the timing data Jacquard has internally and the answers users actually need from a flow:

User questionToday
Did my workload trip any violations?SimStats counts (in-process API only)
Which DFFs nearly missed timing?Not extractable without parsing stderr
Show me arrival distribution per signalReconstructable from --timed via post-processing only
Which DFF was that violation on?State-word index + manual lookup
What path caused the worst arrival?Not available
Run this in CI and fail if any violationPossible only via stderr grep

The most acute problem: stderr violation messages identify a state-word index, not a signal name. Mapping back to "which DFF, which path" requires manual investigation. On a violating design the message volume can be enormous (one warning per word per cycle per type). The data needed to do better — hierarchical signal names, DFF instance paths, per-DFF arrival distributions — already exists in the netlistdb and event buffer; it is simply not surfaced in usable form.

This ADR is about making Jacquard's timing output useful in a real flow rather than merely produced. The substantive work in ADR 0007 (model fidelity) is wasted if no one can extract the answers.

The full design analysis is in docs/why-jacquard.md, "Output interface" section.

Decision

Treat structured, machine-readable timing output as a first-class shipping deliverable, not an optional improvement. Land the work in priority order, where priority is set by user impact per implementation cost not by technical interest.

Required outputs

The following are required for Jacquard to be considered usable for vector-driven timing analysis in a real flow. They land before any further fidelity work past ADR 0007 Pillar B Stage 1+2.

  1. Symbolic violation messages. Replace state-word indices with hierarchical signal names in stderr violation output. Mapping data already exists in netlistdb. Cost: contained edit in src/event_buffer.rs:305-338 plus name-resolution helper. Highest UX impact per LoC of any improvement on this list.

  2. --timing-report <path.json>. Structured JSON document at end-of-run containing:

    • Per-DFF worst arrival, worst slack, violation count over the run.
    • Per-cycle violation list (cycle, signal name, hierarchical path, arrival, constraint, slack).
    • Aggregate stats: total violations, distribution buckets, peak arrival per clock domain.
    • Per-signal activity summary: transition count, average/max arrival, idle cycles.
    • Run metadata: clock period, SDF/IR file, design hash, vector source.

    Required for CI integration and any downstream tooling. Schema versioned; additive extension policy mirrors crates/timing-ir.

  3. --timing-summary. Fast text summary, no VCD. Designed for scripts and human inspection of long runs. Contents:

    • Vectors run, clock period, corner.
    • Setup/hold violation totals.
    • Worst-slack DFF (setup and hold) with hierarchical path.
    • Peak arrival per writeout vs clock budget, with margin percentage.

    Cost: trivial wrapper over (2)'s data.

  4. Per-DFF worst-slack ranking. Top-N DFFs by closest-to-violation slack across the entire run, even when no violation occurred. Surfaces "where am I close to the edge" without requiring a violation to actually trip. Output as part of (2) and (3); also accessible via a dedicated --worst-slack-n N flag for quick inspection.

Optional / later outputs

The following are higher-value-but-lower-priority. They land after the four required items above, in any order driven by user demand.

  1. --arrival-histogram <pattern>. Per-signal arrival histogram dump for matched signal patterns, as JSON or CSV. Foundation for activity-based power analysis.

  2. --sta-cross-reference <opensta-paths.txt>. Cross-reference OpenSTA's critical-path report against observed worst arrivals. Closes the loop between vector-driven and static analysis. Coverage-style "of the top-N STA paths, which were exercised, and at what observed arrival." (Originally framed against OpenTimer; ADR 0003 was Superseded — OpenSTA is the only STA tool Jacquard interoperates with now.)

  3. Path-back-trace from worst-arrival DFF. Given a flagged DFF, walk the max-of-fanin chain backward to the source AIG pin / primary input, emitting the path with per-edge contribution. Most expensive item on this list; only useful once the cheaper items are in place.

Backward compatibility

  • All new outputs are opt-in via flags. Existing stderr behaviour and --timed semantics are unchanged.
  • Symbolic violation messages (item 1) do change existing stderr format. This is intentional: the current state-word-index format is not a stable contract and is not consumed by any known automation. Format change documented in changelog at land time.

Output stability contract

  • The --timing-report JSON is a stable consumer-facing format. Schema versioned. Additive-only extensions per the IR convention; breaking changes require a major version bump and a transition period.
  • --timing-summary is human-readable and explicitly not stable for parsing. Tools should consume the JSON.
  • Stderr violation messages remain human-oriented; tools should not parse them.

Consequences

  • Jacquard becomes usable in CI without bespoke stderr parsing. Existing users who scrape stderr will need to migrate to the JSON report; the migration window is the release in which symbolic messages land.
  • The SimStats in-process API gains a public counterpart: end-of-run JSON. This raises the bar for changes to either — they must agree.
  • Documentation gains a "Jacquard timing report format" reference page. Sample reports from the corpus designs are checked in to tests/timing_ir/corpus/ alongside golden IR.
  • The why-jacquard.md positioning becomes truthful: the user-facing claim "vector-driven setup/hold answers at GPU scale" is backed by an interface that delivers them.

Walk-back options

  • If the JSON schema causes consumer-tooling friction, the format may be extended additively but not narrowed. Existing consumers must continue to work. If a fundamental rethink is required, ship a v2 alongside v1 with a deprecation window.
  • If symbolic name resolution is too slow at scale (millions of DFFs, very long runs), the resolution step becomes opt-in via flag, with the existing state-word-index format retained as a fast-path default. No evidence yet that this is a problem; treated as a deferred consequence.
  • If users specifically want the path-back-trace (item 7) before the cheaper items are scheduled, it can be promoted, but only once items 1–4 are in place. Path-back-trace without symbolic names is unusable.

Priority and effort estimate

ItemEffortBlocksUser impact
1. Symbolic violations1–2 daysNothingHigh (turns stderr from noise to signal)
2. JSON report3–5 daysCI integrationHigh
3. Text summary1 day (after #2)Human dashboardsMedium
4. Worst-slack ranking1–2 days (folds into #2)"Am I close?"High
5. Arrival histogram3–5 daysPower analysisMedium
6. STA cross-ref1 weekVector coverage reportMedium
7. Path-back-trace2–3 weeksForensicsLower-frequency-but-high-value

Items 1–4 are a single workstream, ~2 weeks total. They constitute the "Jacquard is now usable" bar. Items 5–7 are scheduled per user demand after that.

  • ../why-jacquard.md — positioning analysis and full output-interface design.
  • ../timing-violations.md — current violation detection mechanics; updated to describe new outputs once they land.
  • ../timing-validation.md — validation tolerances; will reference the JSON report format for golden comparisons.
  • ../adr/0002-timing-ir.md — IR schema versioning policy that the JSON report mirrors.
  • ../project-scope.md — output stability constraints that apply to any user-facing format.

ADR 0009 — OpenSTA Verilog reader input constraints

Status: Accepted.

Context

OpenSTA's read_verilog Tcl command is structural-only: it accepts cell instantiations and bare-net assign statements but rejects RTL operators (~, &, |, ^), bit-selects in assigns, and ranged concatenations. Violations surface as Error: <file> line <N>, syntax error and exit 1. This is a long-standing OpenSTA limitation, not a flag.

Two patterns make this surprising in practice — both have already caught us once:

  1. Final-stage outputs from the LibreLane/OpenROAD flow are sometimes wrapped. LibreLane itself only ever reads structural netlists (<design>.pnl.v — verified locally on chip_top.pnl.v: zero RTL operators, single module). The wrapping is added by downstream integration tooling — for the SkyWater openframe flow, chipflow's harness wraps the LibreLane output in openframe_project_wrapper to patch active-low OEB pins into the pad ring, producing the assign gpio_oeb[0] = ~( ... ); pattern. The combined file (tests/mcu_soc/data/6_final.v) contains both the readable-by-OpenSTA structural top module and the wrapper's unreadable RTL. The SDF was generated against the inner top, not the wrapper — matching what LibreLane's own STA saw.

  2. Post-synthesis Verilog has the right form but the wrong cells. Pre-P&R synthesis output (e.g. top_synth.v) is fully structural and uses the same module name top as the post-P&R body, so it looks like an acceptable substitute. It is not: the SDF references hundreds of thousands of P&R-inserted cells (clkbuf_regs_* CTS buffers, ANTENNA_* diodes, delaybuf_*, fillers) that simply do not exist in synthesis output. OpenSTA quietly drops SDF entries whose endpoints are not in the loaded design; the resulting IR back-annotates only the surviving subset. Concrete numbers from the MCU SoC fixture: top_synth.v has 31,500 cells; module top inside 6_final.v has 266,746. Feeding top_synth.v would silently drop ~88% of the design's structure.

Past convention (docs/plans/ws3-cosim-sdf-followup.md, pre 2026-05-18) recommended substituting top_synth.v to dodge the wrapper-parse error. The contemporaneous verification log (28162 matched, 2090 unmatched) reported the jtir-to-cosim-netlist match rate, not SDF coverage against the jtir — high surface "working" while the IR was missing most of the design's real timing. That recommendation is retracted in the same change as this ADR lands.

Decision

The "structural-only" constraint is owned by opensta-to-ir, not by the caller. Specifically:

  1. opensta-to-ir filters Verilog inputs at invocation time. For each --verilog file, it extracts the module <--top> … endmodule block before handing files to OpenSTA. Files that do not contain module <--top> (sub-module-only files in hierarchical designs) are passed through unchanged. The wrapper modules that LibreLane + wafer.space integration adds — and any future analogues — are simply not seen by OpenSTA. Implementation in crates/opensta-to-ir/src/verilog_filter.rs; integration test coverage in tests/opensta_integration.rs.

  2. The cell-set match against the SDF is the caller's responsibility. opensta-to-ir cannot determine programmatically whether a given Verilog input is the right design stage for a given SDF. The CI fixture comment in prepare-mcu-soc-jtir captures the rule for sky130 mcu_soc; copy the spirit (use the post-P&R structural body, not synthesis output) when adding new fixtures, but don't copy a per-design extraction recipe — there no longer is one to copy.

Architectural alternative (separate concern): the upstream chipflow harness could preserve LibreLane's pre-wrap <top>.pnl.v alongside its wrapped <top>_final.v output. That would make opensta-to-ir's in-tool extraction a no-op for the common chipflow case, but it would not obviate the filter — third-party LibreLane + wafer.space users (hazard3 and future tapeouts using the vanilla flow) hit the same wrapper pattern. The filter is the right place for the fix because it covers both opensta-to-ir as a CLI and jacquard sim --sdf (which subprocesses opensta-to-ir).

Consequences

  • End-user runs of jacquard sim --sdf <path> and the standalone opensta-to-ir tool both transparently handle the LibreLane + wafer.space wrapper pattern. No flags, no preprocessing recipe in user-facing docs.
  • Match-rate metrics in the IR consumer measure jtir coverage against the consuming netlist, not against the source SDF. A high match rate is necessary but not sufficient — confirm the jtir contains the post-P&R cell population separately (e.g. by spot checking for clkbuf_regs_* / ANTENNA_* arcs in the IR JSON sidecar) before declaring a flow "working".
  • The filter assumes module <--top> … endmodule is line-anchored in the Verilog source. Machine-generated post-P&R netlists meet this; hand-rolled Verilog that opens a module mid-line would not. If that ever surfaces, upgrade the filter to use a real Verilog tokenizer (sverilogparse is already a workspace dependency).
  • This ADR retroactively retracts the top_synth.v recommendation in docs/plans/ws3-cosim-sdf-followup.md; that doc is corrected in the same change.
  • ADR 0001 — OpenSTA as oracle and sole STA path (the upstream tool whose constraints these are).
  • ADR 0006 — SDF preprocessing model (the surrounding flow that consumes these inputs).
  • docs/plans/ws3-cosim-sdf-followup.md — the prior workaround this ADR corrects.

ADR 0010 — Declarative cell metadata for PDK enablement

Status: Accepted.

Context

PDK enablement today is per-PDK code + vendored Verilog (see src/sky130.rs, src/gf180mcu.rs, src/gf180mcu_pdk.rs, the build.rs pin-table scanner). Adding a new cell family — third-party IP memories, hard macros, foundry-supplied blocks — requires vendoring Verilog into jacquard/vendor/, extending the build.rs scanner, editing prefix matchers (is_<pdk>_cell, extract_cell_type), and adding entries to hand-curated matches!() lists (is_filler_cell, is_io_pad_cell, is_sequential_cell, is_multi_output_cell, …). Each of those last is data masquerading as code; PR #64 (2026-05-18 power-pin + wired-filler shortcuts for wafer.space) is the most recent example of the pattern.

The acute trigger is gf180mcu_ocd_ip_sram__sram1024x8m8wm1 — Tim Edwards' OCD 3.3V port of the GF180MCU SRAM IP, used in a downstream wafer.space tapeout. The cell is third-party IP (not in Jacquard's vendor/), doesn't match is_gf180mcu_cell's prefix walk (fd_* / ws_* only), has no pin table, and isn't filler-stubbable. Issue #67 captures the discussion.

The same pattern will repeat for every wafer.space tapeout that includes IP outside Jacquard's vendored library — hazard3, future chips. Code-gating each one through a Jacquard PR doesn't scale.

Decision

PDK enablement gains a declarative metadata path alongside the existing built-in classifiers. The decision separates cleanly into two tiers; this ADR commits to Tier 1 + a minimal Tier 2 slice now, and explicitly defers the larger Tier 2 schema (port-mapping semantics) to a future ADR after real adoption data.

Tier 1 — runtime cell library (--cell-library <PATH>.v)

sverilogparse (already a workspace dependency) parses user-supplied Verilog files at startup and populates the LeafPinProvider for every module … endmodule block found. Handles input / output / inout. Replaces the build.rs scanner for newly-added cells; existing built-in tables stay as fallback.

Flag is repeatable: --cell-library a.v --cell-library b.v for designs that pull in multiple IP libraries. Files are parsed in order; later files override earlier ones for collisions (with a warning).

Tier 2 (minimal slice) — kind discriminator in TOML

Each cell library may be accompanied by a TOML manifest declaring the kind of each cell — the same classification today's is_filler_cell / is_sequential_cell / etc. encode in matches!() lists. Manifest path mirrors the library path (foo.vfoo.cells.toml) and is loaded automatically when present; an explicit --cell-manifest <PATH>.toml flag overrides the autoloading behaviour.

schema_version = "1.0"

[cells.gf180mcu_ocd_ip_sram__sram1024x8m8wm1]
kind = "ram"

[cells.gf180mcu_fd_io__fillcap_18_h]
kind = "filler"

Recognised kind values (v1.0): std, dff, latch, clock_gate, ram, filler, endcap, tap, io_pad_input, io_pad_output, io_pad_bidir, delay, multi_output, tie_high, tie_low.

Schema versioning: top-level schema_version is mandatory. v1.x additive rule — new optional keys / new kind values are non-breaking; semantics of existing kind values must not narrow.

kind = "ram" semantics in v1.0 (opaque-RAM mode)

aig.rs today has two hardcoded RAM detection paths: celltype == "$__RAMGEM_SYNC_" (line 775, port_r/port_w resolution from Yosys memlib_yosys.txt) and starts_with("CF_SRAM_") (line 1006, .DO output resolution for ChipFlow's single-port convention). Neither matches gf180mcu_ocd_ip_sram_* or arbitrary third-party SRAM IP.

In v1.0, kind = "ram" allocates a RAMBlock slot in opaque mode: the cell's outputs are routed to X-source slots, no port resolution is attempted, no memory behaviour is modelled. This is sufficient for designs whose CPU executes from boot ROM / register file and never reads SRAM contents at the timescales Jacquard simulates (the heartbeat-verification use case driving this work). The existing compute_x_sources test path at src/aig.rs:3247-3273 already validates the X-source convergence shape.

When real memory modelling is required, future schema versions add explicit port mapping ([cells.NAME.ports] sub-tables) — the opaque mode stays as the documented fallback.

Integration ordering

aig.rs cell-type recognition slots the manifest path after the existing recognisers:

1. celltype == "$__RAMGEM_SYNC_"  → RAMBlock with port_r/port_w   (unchanged)
2. starts_with("CF_SRAM_")        → RAMBlock with .DO              (unchanged)
3. PdkVariant::classify(celltype) → built-in classifier dispatch    (unchanged)
4. NEW: manifest.lookup(celltype) → manifest-declared kind dispatch

The new path activates only for cells none of the existing recognisers match AND that have a manifest entry. All existing tests stay green without churn.

Deferred to a future ADR

  • Port-mapping schema ([cells.NAME.ports] sub-tables, polarity annotations, bus-width inference, write-enable encoding). This is a small behavioural description language doing more than classification; needs concrete adoption data before its schema is fixed.
  • Built-in classifier removal. sky130.rs / gf180mcu.rs / gf180mcu_pdk.rs classification tables stay as fallback through the entire migration. Removal happens only after the manifest pathway is the source of truth for at least one PDK in production.
  • build.rs pin-table scanner removal. Same rule: removed LAST, after manifests cover the built-in PDKs.

Consequences

  • Third-party IP unblocks without Jacquard PRs. Users ship a <library>.cells.toml alongside their <library>.v; CI flows point --cell-library at both. The driving wafer.space tapeout's chip_top.pnl.v clears gf180mcu_ocd_ip_sram__sram1024x8m8wm1 by shipping a six-line manifest entry.
  • The "vendor + edit code + extend lists" PR workflow for new IP becomes "ship a manifest, no Jacquard change". docs/adding-a-pdk.md evolves to document the manifest pathway as the primary route.
  • The opaque-RAM semantics is honest about what v1.0 delivers — no silent partial memory modelling. The contract is "RAMBlock allocated, outputs X-source, no read/write behaviour" until a future schema version adds explicit ports.
  • Existing built-in PDK code stays load-bearing through the transition. No risk of regression in sky130 / gf180mcu test flows during the migration.
  • Issue #67 — design discussion.
  • PR #64 (9281e57) — most recent per-PDK-code-as-data workaround this ADR resolves.
  • ADR 0009 — OpenSTA Verilog reader input constraints (sverilogparse is already in-tree for unrelated reasons; Tier 1 reuses that dep).
  • docs/plans/declarative-cell-metadata.md — implementation phasing.
  • docs/plans/gf180mcu-enablement.md § Follow-on cleanup items 1, 2, 3 — superseded by this ADR.

ADR 0011 — RAM port-mapping schema for declarative cell metadata

Status: Accepted.

Context

ADR 0010 shipped a minimal Tier 2 slice with one kind discriminator per cell. For kind = "ram" specifically, v1.0 declares the cell-as-opaque: the AIG allocates a RAMBlock slot but routes outputs to X-source slots without resolving read/write port semantics. That's sufficient for "design boots from ROM, never reads SRAM contents" cases but fails the moment a real CPU writes to SRAM and expects to read its data back.

The acute trigger is the JTAG-DM firmware-load path enabled by PR #78: OpenOCD walks a debug-module sequence that culminates in abstract-memory writes into the design's SRAM, then jumps the CPU to that memory. Because the SRAM is opaque (no backing storage, writes go nowhere), the CPU boots to garbage. Issue #80 captures the symptom and notes that wiring SramInitConfig is the smaller sibling problem — pre-loading SRAM contents at tick 0 — but the bigger gap is that kind = "ram" doesn't model writes at all.

ADR 0010 § "Deferred to a future ADR" listed the port-mapping schema explicitly:

Port-mapping schema ([cells.NAME.ports] sub-tables, polarity annotations, bus-width inference, write-enable encoding). This is a small behavioural description language doing more than classification; needs concrete adoption data before its schema is fixed.

The OCD GF180MCU SRAM (gf180mcu_ocd_ip_sram__sram1024x8m8wm1) — a real third-party IP cell behind the apitronix-semiconductor / hazard3 / future wafer.space tapeout pipelines — gives us the concrete adoption-data input. This ADR fixes the schema against that worked example.

Worked example: the OCD SRAM

The upstream behavioural model (RTimothyEdwards/gf180mcu_ocd_ip_sram) declares:

module gf180mcu_ocd_ip_sram__sram1024x8m8wm1 (
    CLK, CEN, GWEN, WEN, A, D, Q
);
  input         CLK;                // posedge clock
  input         CEN;                // chip enable, active-low
  input         GWEN;               // global write enable, active-low
  input  [7:0]  WEN;                // per-bit write mask, active-low
  input  [9:0]  A;                  // address (1024 entries)
  input  [7:0]  D;                  // data in
  output [7:0]  Q;                  // data out
  reg    [7:0]  mem[1023:0];        // backing storage

Read semantics: on posedge CLK, when !CEN && GWENQ = mem[A]. Write semantics: on posedge CLK, when !CEN && !GWEN && !(&WEN)mem[A][i] = D[i] for each i where !WEN[i].

The schema needs to capture: per-pin polarity (active-low vs active-high), per-pin role (clock / chip-enable / write-enable / mask / address / data-in / data-out), bus widths (derived from the Verilog declaration; not redeclared), mask granularity (per-bit vs per-byte vs none).

Decision

Extend the <library>.cells.toml schema with an optional ram sub-table on entries declaring kind = "ram". Presence of the sub-table promotes a cell from opaque (v1.0 semantics) to explicit — outputs are properly wired to the AIG-backed RAMBlock, writes populate backing storage, reads return what was written.

Schema (v1.1)

schema_version = "1.1"

[cells.gf180mcu_ocd_ip_sram__sram1024x8m8wm1]
kind = "ram"

[cells.gf180mcu_ocd_ip_sram__sram1024x8m8wm1.ram]
depth = 1024
width = 8
clock        = { pin = "CLK", edge = "pos" }
chip_enable  = { pin = "CEN", polarity = "low" }
write_enable = { pin = "GWEN", polarity = "low" }
write_mask   = { pin = "WEN", polarity = "low", granularity = "bit" }
address      = "A"
data_in      = "D"
data_out     = "Q"

Field semantics

  • depth (required, integer): number of addressable entries. Must satisfy depth ≤ 2^AIGPDK_SRAM_ADDR_WIDTH (8192 today).
  • width (required, integer 1..=32): bit-width of each entry. Capped at 32 by RAMBlock's fixed-size port arrays.
  • clock (required, table): pin is the clock input pin name; edge defaults to "pos". "neg" is accepted (matches gf180mcu dffnq-family negedge convention).
  • chip_enable (optional, table): pin + polarity (default "low"). When the pin's effective level is inactive, the cell neither reads nor writes for that cycle. Omit for sync SRAMs that are always-enabled.
  • write_enable (optional, table): pin + polarity (default "low"). Gates all writes regardless of mask. The OCD SRAM's GWEN. Omit for SRAMs without a global write-enable.
  • write_mask (optional, table): per-bit / per-byte write enables. pin is the mask pin name; polarity defaults to "low"; granularity is "bit" (default) or "byte". The mask width must match width (bit) or width / 8 (byte). Omit for SRAMs without per-bit masking — in that case the global write_enable controls the whole word.
  • address / data_in / data_out (required, string): pin names. Bus widths are read from the Verilog (via sverilogparse) — not re-declared here.

Optional cells (no ram block)

Cells declaring kind = "ram" without the ram sub-table fall back to v1.0 opaque mode — outputs route to X-source slots, no backing storage, no port resolution. The contract is unchanged for existing consumers.

Backing storage

Cells with an explicit ram block allocate a RAMBlock with port_r_* and port_w_* arrays populated from resolved pin positions. The simulator's existing GPU-side SRAM machinery handles reads, writes, and per-entry backing memory; no new kernel work is required.

Schema versioning

The top-level schema_version field bumps from "1.0" to "1.1". v1.0 manifests continue to parse — the ram sub-table is purely additive. Loaders that don't recognise the new sub-table (none today; this ADR ships the loader simultaneously) would treat flagged cells as opaque RAMs, which is a graceful degradation.

SRAM preload (sibling work)

TestbenchConfig::sram_init (an existing schema field declared in src/testbench.rs but unwired today — issue #80) becomes load-bearing once explicit-port RAMs have backing storage. The preload path:

  1. Parse ELF segments from sram_init.elf_path.
  2. Match segments to SRAM instances by virtual-address overlap with declared SRAM regions.
  3. Write segment bytes into each matched SRAM's backing memory at tick 0.

Schema extensions to SramInitConfig (instance targeting, multi-section support) land alongside the implementation but don't require an ADR — purely additive JSON schema work.

Consequences

  • The OCD GF180MCU SRAM (and any structurally similar third-party IP — 1RW, sync, optional per-bit mask) becomes simulable end-to-end via the manifest pathway. Real CPU writes populate real memory.
  • The opaque-mode fallback stays load-bearing for cells the consumer hasn't taken the time to schema-map — important so the cell-library pathway doesn't require schema work just to load a cell library.
  • JTAG-DM-driven firmware load (PR #78 stage 1) becomes end-to-end testable in cosim. Closes the chicken-and-egg loop for designs whose firmware-load mechanism is what cosim is trying to validate.
  • The schema is opinionated: 1-port (1RW), sync-only, write-mask is bit OR byte (not arbitrary). Multi-port SRAMs (2RW, 1R1W), async SRAMs, and write-mask-with-stripes encodings are explicitly out of scope. Adding them is a future schema version (v1.2+); doesn't break v1.1 manifests.

Out of scope

  • Multi-port SRAMs. Most foundry IPs in our ecosystem are single-port. Dual-port designs are a meaningful follow-up but not driven by any in-tree fixture today.
  • Async (non-clocked) SRAMs. Hardly seen in synthesised digital designs at modern PDKs. Not modeled.
  • Width > 32 bits. Bounded by RAMBlock's array sizes; consumers wider than 32 should split into multiple instances.
  • Built-in classifier removal. Same rule as ADR 0010 — the $__RAMGEM_SYNC_ and CF_SRAM_* recognisers stay as fallback; manifest-declared RAMs supplement, don't replace.
  • ADR 0010 — Declarative cell metadata (the parent decision deferring this schema).
  • Issue #80 — driving consumer.
  • PR #78 — the JTAG-DM workflow that surfaced the schema need.
  • Upstream OCD SRAM behavioural model: RTimothyEdwards/gf180mcu_ocd_ip_sram.

ADR 0012 — Reproducible CDC jitter injection for multi-clock cosim

Status: Accepted — design accepted and partially implemented. The reproducibility core (§1) and scheduler-domain jitter on the VCD timeline (§2, partial) are built; model-driven jitter (§3), setup/hold integration (§5), the gcd_ps/2 guard, and true coincident-edge perturbation (§4) are not yet. The sections below describe the decided design; see Implementation status for what is built versus deferred. Remaining work is tracked in issue #92 and ../plans/cdc-jitter-completion.md.

Context

The multi-clock scheduler (MultiClockScheduler in cosim_metal.rs) pre-computes a fixed LCM-based edge schedule: every clock domain fires at perfectly rational offsets forever. Real hardware doesn't do that — PLL jitter, clock-tree skew, and propagation delay make coincident edges land in unpredictable order. CDC synchronizers are designed to tolerate this, but RTL bugs (missing synchronizers, gray-code errors, handshake protocol violations) only surface when edge alignment varies from the ideal.

The motivating incident was PR #89 / run 26413667030: a scheduler index bug caused sys_clk to fire at TCK's period, making CDC synchronizers between the JTAG and system clock domains marginal. The test passed intermittently because Metal GPU scheduling jitter shifted the effective phase relationship. Once the bug was fixed (commit 5bb07c3), determinism was restored — but the experience highlighted that no deliberate mechanism exists to stress-test CDC paths under controlled timing skew.

Additionally, cosim's model-driven clocks (JtagReplayModel, SpiFlashModel, etc.) override the scheduler's periodic pattern with software-driven edges. These introduce a distinct CDC concern: model-driven clock transitions are phase-locked to the host-side dispatch loop, not to the design's system clock. The same jitter injection infrastructure must cover both scheduler-derived and model-driven clock edges.

The multi-clock plan (docs/plans/multi-clock-and-stimulus-architecture.md) lists "CDC verification mode: jitter injection on coincident edges and random X-injection on detected async-source paths" as a future capability. This ADR formalises the design for the jitter injection half; X-injection is deferred to a follow-up ADR that depends on MC.1 (island partitioner) landing.

Decision

1. Run-parameters file and per-domain seeded PRNG

Simulation runs that use any non-deterministic feature (jitter, future partition randomisation, model-driven timing offsets) are governed by a run-parameters file (--run-params <path>):

{
  "master_seed": 8429173640281
}

From the master seed, a per-domain sub-seed is derived for each clock domain and each model-driven clock (e.g. sub_seed = hash(master_seed, domain_name)). Each domain gets its own independent PRNG stream. This ensures reproducibility even when the number of PRNG draws per domain is path-dependent — a reactive model that fires more or fewer edges based on design output doesn't contaminate another domain's displacement sequence.

Behaviour:

  • --run-params <path> supplied, file exists: load parameters from it. The run is a deterministic replay.
  • --run-params <path> supplied, file does not exist: generate a master seed from system entropy, write the file immediately (before the simulation loop starts), then run. The user gets reproducibility even if the process crashes mid-simulation.
  • No --run-params flag: generate a master seed, write to a default location (<output_dir>/run_params.json next to the output VCD) before simulation begins. Always persisted — the user can re-run any simulation by passing the written file back.

The master seed is also logged at INFO level and included in the VCD header comment, so even without the file the seed is recoverable from logs.

Rationale: "random testing that can't be replayed isn't testing," but forcing users to pick seeds upfront discourages use. Writing the file before simulation means every run — even a crashed one — is reproducible after the fact. Per-domain streams mean the seed alone is sufficient; no displacement log is needed.

For CI seed sweeps, a wrapper generates N parameter files with sequential seeds and fans out runs. Each failure ships with its parameter file as an artifact — gh run download gives you everything needed to reproduce locally.

2. Per-domain jitter budget

A new jitter_ps field on ClockConfig in sim_config.json declares the maximum edge displacement in picoseconds for that domain:

{
  "clocks": [
    { "gpio": 0, "period_ps": 40000, "name": "sys_clk", "jitter_ps": 200 },
    { "gpio": 2, "period_ps": 160000, "name": "tck", "jitter_ps": 0 }
  ]
}

At each edge, the scheduler draws a signed displacement from a uniform distribution [-jitter_ps, +jitter_ps] and shifts the edge forward or backward within the GCD granularity window. The resulting edge still fires within the same GCD tick (no reordering across ticks), but the effective arrival time recorded in the state buffer (and honoured by setup/hold checkers) shifts. Disabling jitter (jitter_ps: 0) is the default and produces today's ideal-clock behaviour.

Constraint: jitter must not exceed gcd_ps / 2; larger values would re-order edges across GCD ticks and require a fundamentally different scheduling model.

3. Model-driven clock jitter

Model-driven clocks (JTAG TCK, SPI SCK, etc.) bypass the scheduler's periodic edges. Their jitter path is different:

  • A --cdc-model-jitter-ps <N> flag (or per-model jitter_ps in the config) specifies the budget for model-driven transitions.
  • After patch_model_clock_edges fires the edge, the arrival-time offset recorded in the timing state is displaced by a PRNG-drawn value from the same seeded generator.
  • This does NOT delay the functional edge (the DFF still samples on the same tick) — it shifts the timing-model arrival so that setup/hold checks against the receiving domain see a different margin each run.

The functional-vs-timing split means jitter injection doesn't change combinational propagation (which would require an event-driven kernel), only the timing oracle's view of when edges "really" arrived. This is consistent with Jacquard's philosophy: functional correctness is cycle-accurate, timing is an overlay.

4. Coincident-edge perturbation

When two domains have edges scheduled at the same GCD tick (coincident edges), their relative order is undefined in real hardware. The jitter mechanism naturally handles this: if domain A's jitter shifts it +100ps and domain B's shifts it -50ps, the timing model sees A after B, which may differ from the next run's draw. This exercises both "A-before-B" and "B-before-A" orderings over a seed sweep without needing explicit permutation logic.

5. Integration with existing infrastructure

  • Setup/hold checker (timing_report.rs): already receives arrival-time offsets. Jittered arrivals feed directly into the existing violation detection — a jitter-induced setup violation appears in --timing-report output with the jittered arrival annotated.
  • VCD ring buffer: records the jittered arrival time so waveform viewers show the displaced edge.
  • X-prop (future): when MC.1 identifies CDC boundaries, X-injection on violated paths can use the same PRNG stream for correlated randomisation.
  • --check-with-cpu: the CPU baseline does NOT apply jitter (it doesn't model timing at all). Jitter-mode results should not be compared against the CPU baseline. The flag combination --run-params (with jitter enabled) + --check-with-cpu should warn or error.

Implementation status

The design above is accepted in full; the code implements part of it. This section is the source of truth on what is built. Remaining items are tracked in issue #92 / ../plans/cdc-jitter-completion.md.

Implemented:

PartWhere
Run-parameters file, master_seed, load/write/load_or_generatesrc/sim/run_params.rs
Per-domain sub-seed hash(master_seed, name) + per-domain ChaCha8RngRunParams::domain_seed; cosim_metal.rs
jitter_ps per ClockConfig (default 0)src/testbench.rs
Uniform [-jitter_ps, +jitter_ps] draw per domain per tickcosim_metal.rs
Jitter displacement applied to the timing-VCD event timestampcosim_metal.rs (inside the --output-vcd block)
master_seed logged at INFOcosim_metal.rs
--check-with-cpu + jitter warningcosim_metal.rs

Deferred (issue #92):

PartADR §Gap
Setup/hold integration§2, §5Jitter shifts only the VCD base timestamp; it does not feed the per-signal arrival offsets, so it produces no --timing-report violations. Also: jitter currently has no effect unless --output-vcd is set.
Model-driven clock jitter§3No --cdc-model-jitter-ps flag or patch_model_clock_edges path; only scheduler domains jitter.
True coincident-edge perturbation§4A single global displacement (last firing domain wins) is applied to the shared timestamp rather than independent per-domain displacement.
gcd_ps / 2 constraint§2Not validated.
Persist seed unconditionally§1Without --run-params or --output-vcd, the seed is generated but not written.
master_seed in VCD header comment§1, §5INFO log only.
--cdc-jitter-seed CI sweepConsequencesThe replay mechanism is --run-params; no dedicated CI sweep step yet.

Consequences

  • CI can run a small seed sweep (via --run-params) as a lightweight CDC stress test on every PR, catching synchroniser failures that the ideal-clock schedule hides.
  • Users debugging real silicon CDC failures can replay the exact jitter pattern that triggered the issue.
  • The design is forward-compatible with X-injection (the PRNG infrastructure and per-domain budgets are reusable).
  • Model-driven clocks get explicit jitter coverage rather than relying on accidental GPU scheduling delays.
  • No kernel changes required — jitter is a host-side timing-model overlay on the existing edge schedule.

Deferred

  • X-injection on CDC paths. Requires MC.1's island partitioner to identify which DFF outputs cross domains. Separate ADR once MC.1 lands.
  • Frequency sweep / DFS simulation. Changing a clock's period mid-simulation is orthogonal to jitter. Captured in the multi-clock plan as a future axis.
  • Per-path jitter profiles. Real jitter isn't uniform — PLLs have period jitter (Gaussian), recovered clocks have cycle-to-cycle jitter (bounded), external clocks have frequency offset (deterministic drift). V1 uses uniform; richer distributions can be added later without API changes (the seed + budget interface is distribution-agnostic).

ADR 0013 — Cosim peripheral model architecture

Status: Accepted — the architecture is implemented and in use across multiple peripherals (multi-UART #90, config-driven APB3 bus tracing). The "Target architecture" section below tracks the remaining, optional refactors; the conventions it establishes are already followed.

Context

Jacquard's cosim mode runs reactive peripheral models alongside the GPU-simulated design: SPI flash serves firmware, UART decodes serial output, JTAG replays debug sessions, GPIO drives/observes pins, and Wishbone trace captures bus transactions. The architecture evolved organically; this ADR documents the current design, identifies the abstractions emerging from it, and establishes conventions for extending it.

Architecture

Execution domains

Peripheral work splits across CPU and GPU. The boundary follows a simple rule: models that drive input pins (must react to design output each edge) run on the CPU; models that observe output pins (pure consumers of post-simulation state) or exchange data bidirectionally with the design run on the GPU for zero-copy access to the state buffer.

Some peripherals span both domains. UART has a CPU-side RX driver (feeds bytes into the design's RX input pin) and a GPU-side TX decoder (reads the design's TX output pin).

CPU-side: PeripheralModel trait

Defined in src/sim/models/mod.rs:

#![allow(unused)]
fn main() {
trait PeripheralModel {
    fn name(&self) -> &str;
    fn driven_positions(&self) -> &[u32];
    fn apply_action(&mut self, action: &QueuedAction);
    fn step_edge(&mut self, output_state, overrides, emitted); // default: just calls contribute_overrides
    fn contribute_overrides(&self, overrides);
    fn is_active(&self) -> bool; // default: false
}
}

apply_action is how the InputDispatcher feeds queued stimulus commands to models. is_active signals that the model is mid- transmission and needs per-edge granularity (forces batch size to 1). step_edge has a default that just calls contribute_overrides — stateless models (GPIO) only need the latter.

Models are registered into a Vec<Box<dyn PeripheralModel>> at startup. Each batch boundary, the loop calls step_edge on every model; models write their pin drives into a shared ModelOverrides map. These overrides are patched in-place into pre-allocated BitOp arrays (built at startup with placeholder entries for model-driven positions) and applied via the state_prep GPU kernel.

Note: step_edge currently receives an empty output_state slice — GPU output state is not read back per-edge for CPU-side models. GPIO and UART RX don't need it; I²C and SPI bus observation will require wiring the output state readback when those models are completed.

The dispatch is peripheral-agnostic: state_prep applies whatever BitOp array it receives. Clock edges, reset, GPIO, UART RX, and JTAG TCK/TMS/TDI are all entries in the same ops buffer.

Registered CPU-side models: GPIO, UART RX, JTAG replay (complete); I²C, SPI (scaffolded, output-state readback not yet wired).

GPU-side: two model patterns

GPU-side models fall into two categories distinguished by their data-flow relationship to the simulation:

Observe-only (post-simulate): The model reads output state after simulation and produces results (decoded bytes, bus traces) into a ring buffer. It never writes to input state. One kernel call per edge, after simulate_v1_stage.

Bidirectional (pre+post simulate): The model both reads the design's outputs and injects data into the design's inputs. This requires two kernel calls per edge — one before simulation (inject response data into input state) and one after (read request signals from output state, advance the model's FSM).

PatternWhenCurrent models
Observe-onlyPost-simulateUART TX decoder, Wishbone bus trace
BidirectionalPre-simulate (inject) + post-simulate (sample, advance)SPI Flash

Any memory-mapped peripheral (external SRAM, I²C EEPROM, etc.) would follow the bidirectional pattern.

Per-edge execution order

state_prep (apply clk/gpio/jtag pin drives from CPU-side models)
  → [bidirectional: inject] — e.g. gpu_apply_flash_din
    → simulate_v1_stage ×N (combinational logic evaluation)
  → [bidirectional: sample+advance] — e.g. gpu_flash_model_step
  → [observe-only] — e.g. gpu_io_step (UART TX + Wishbone)

CPU-side PeripheralModel::step_edge runs between GPU batches.

GPU→CPU communication: ring buffers

GPU-side models produce output into fixed-size ring buffers in device memory. The CPU drains these after each GPU batch completes, reading from a local read_head up to the GPU-written write_head. No synchronisation beyond Metal's command buffer completion is needed.

Current ring buffers:

BufferElementCapacity
UartChannelu8 (decoded bytes)4096
WbTraceChannelWbTraceEntry (20 bytes)16384

Configuration

Peripheral config lives in sim_config.json, deserialized into TestbenchConfig (src/testbench.rs):

PeripheralFieldPlural?
Clockclocks: Option<Vec<ClockConfig>>Yes (effective_clocks())
GPIOgpios: Vec<GpioConfig>Yes
UARTuart + uarts: Vec<UartConfig>Yes (effective_uarts(), #90)
Flashflash: Option<FlashConfig>Not yet
JTAGjtag: Option<JtagConfig>Not yet
Wishbone(auto-detected, hardcoded signal names)N/A (legacy)
Bus trace (AHB/APB)bus_traces: Vec<BusTraceConfig>Yes (effective_bus_traces())

Current implementation (bespoke kernels)

Today each GPU-side peripheral has its own kernel function:

KernelSlotsPattern
gpu_apply_flash_dinstates[0], flash_state[1], flash_din_params[2]Bidirectional: inject
gpu_flash_model_stepstates[0], flash_state[1], flash_model_params[2], flash_data[3]Bidirectional: sample+advance
gpu_io_stepstates[0], uart_state[1], uart_params[2], uart_channel[3], wb_channel[4], wb_params[5], bus_channel[6], bus_params[7]Observe-only (UART + Wishbone + AHB/APB bus trace)

All run on thread 0 only — the per-tick work is a trivial FSM step. gpu_io_step combines three logically independent observe-only models, gated by n_uarts > 0, has_trace, and n_buses > 0 respectively.

Config-driven bus monitor (AHB/APB)

The Wishbone trace (build_wb_trace_params) hardcodes one SoC's signal names (cpu.fetch.ibus__cyc, spiflash.ctrl.wb_bus__ack, …) directly in source. The AHB/APB bus tracer generalizes it into a config-driven, protocol-aware monitor that is the model for future bus tracing:

  • Config (BusTraceConfig): name, protocol (apb3 / ahb-lite / ahb5), hierarchical prefix, addr_bits/data_bits, and optional per-pin signals overrides. Pins default to {prefix}{pin}.
  • Pin binding: protocol pin names (psel, paddr, …) are resolved to output-state positions via resolve_to_state_pos in trace_signals.rs — the same multi-candidate resolver --trace-signals uses, so Yosys-flattened / scalar-expanded / structural naming all work. The pins are registered as observables before partitioning (via DesignArgs::extra_observable_signals) so they get state-buffer slots.
  • GPU capture / CPU decode split: the kernel is protocol-agnostic — it packs a raw beat (addr, wdata, rdata, ctrl flags) into the ring buffer on the protocol's gating edge (psel & penable & pready for APB), using rising-edge detection so exactly one beat is recorded per completed transfer. The protocol FSM (phase pairing, burst tracking, response decode) lives in plain, unit-testable Rust in src/sim/models/bus_trace.rs. APB3 is stateless (one beat = one transaction); AHB pairing is the Phase-2 extension.
  • Output: decoded transactions stream to CSV via --bus-trace-csv; annotated-VCD emission is a planned follow-up.

This is observe-only, so it slots into the existing post-simulate pattern. Migrating the hardcoded WbTrace onto this mechanism (expressing the VexRiscv ibus/dbus as configured buses) is a clean follow-up.

Target architecture

The two patterns (observe-only, bidirectional) and the common conventions (ring buffers, params structs, per-instance config arrays) should be codified so new peripherals follow a template:

Common conventions

  • Params struct layout: { u32 state_size; u32 n_active; u32 _pad[2]; PerInstanceConfig configs[MAX_N]; } — uniform header, compile-time MAX_N cap.
  • Ring buffer struct: { u32 write_head; u32 capacity; u32 _pad[2]; T data[CAP]; } — shared across all models producing GPU→CPU output.
  • Buffer sizing: always MAX_N elements regardless of n_active. Wastes negligible memory for small N.
  • Guard pattern: for (i = 0; i < n_active && i < MAX_N; i++) replaces the current has_foo != 0 booleans.

Model registration

New GPU-side models declare which pattern they follow:

  • Observe-only: register a post-simulate kernel. Receives output state (read-only), writes to ring buffer.
  • Bidirectional: register a pre-simulate kernel (inject into input state) and a post-simulate kernel (read output state, advance FSM).

Today this registration is implicit in cosim_metal.rs's encode_and_commit_gpu_batch. Formalizing it is a future step — the convention is sufficient while the model count is small.

Plural config convention

To support multi-instance peripherals (multiple UARTs, potentially multiple flash chips or RAM banks):

  • Legacy singular field kept via #[serde(default)].
  • New plural field alongside (e.g. uarts: Vec<UartConfig>).
  • effective_<peripheral>() -> Vec<Config> merges both.
  • Each config struct gains name: Option<String> for labelling.

This mirrors the existing effective_clocks() pattern.

Cross-backend considerations

Cosim is Metal-only today. CUDA/HIP paths (kernel_v1_impl.cuh) implement the core simulation kernel but have no gpu_io_step or flash kernels. When CUDA/HIP cosim is added, the same two-pattern taxonomy applies — the kernel implementations will differ but the Rust-side buffer allocation, config resolution, and drain logic can be shared via feature-gated code in cosim_metal.rs (or a future cosim_common.rs).

Phasing

PhaseScopeStatus
1Multi-UART (#90): first peripheral using plural-config + array-in-kernel conventionsDone
1bConfig-driven bus monitor, APB3 + CSV (GPU-capture/CPU-decode split)Done
2Refactor gpu_io_step to use common params/ring-buffer layoutFuture
2bAHB-Lite / AHB5 bus tracing + annotated-VCD output; migrate WbTrace onto the general monitorFuture
3Multi-Flash / external RAM (bidirectional pattern)Deferred (no use case yet)
Multi-JTAGNot needed (TAP daisy-chain suffices)

Plan docs: ../plans/multi-peripheral-cosim.md, ../plans/bus-transaction-tracing.md.

ADR 0014 — AIG as simulation intermediate representation

Status: Accepted

Context

Jacquard simulates gate-level RTL designs on GPUs by converting technology-mapped netlists into an executable form. The choice of intermediate representation (IR) determines how easily the design maps to GPU hardware, how much the representation compresses, and what classes of optimisation are available at compile time.

Gate-level netlists arrive from synthesis tools (Yosys, Synopsys DC) mapped to a variety of cell libraries: the project's own AIGPDK library, SKY130, or GF180MCU. Each library uses different cell names and pin conventions; the IR must abstract over these while preserving the combinational and sequential semantics exactly.

The GEM paper (Guo et al., "GEM: GPU-Accelerated Emulator-Inspired RTL Simulation," DAC 2025) describes a "virtual Boolean processor" that evaluates combinational logic as a tree of AND-with-invert operations — directly motivating an and-inverter graph.

Decision

1. Uniform AND-gate IR

All combinational logic is represented as an and-inverter graph (AIG). Every node in the combinational cone is one of:

#![allow(unused)]
fn main() {
pub enum DriverType {
    AndGate(usize, usize),    // inputs with inversion bits
    InputPort(usize),         // primary input
    InputClockFlag(usize, u8),// clock edge flag
    DFF(usize),               // sequential (D flip-flop output)
    SRAM(usize),              // memory block output
    Tie0,                     // constant zero
}
}

Only AndGate has combinational fan-in. The two operands carry an inversion bit in their LSB (aigpin_id << 1 | invert), giving the full {AND, NAND, NOR, OR} family with a single node type. Inverters and buffers are absorbed into the inversion bits rather than creating separate nodes, keeping the graph compact.

This uniformity is the key property: because every combinational node is the same (a XOR xa) AND (b XOR xb) operation, the boomerang reduction tree (ADR 0015) can execute them all with a single GPU instruction pattern — no opcode decode, no per-cell dispatch.

2. Conversion path: NetlistDB to AIG

The conversion is implemented in src/aig.rs via AIG::from_netlistdb_impl(). It handles three cell library families:

LibraryStrategy
AIGPDK (native)Cells are already AND gates, DFFs, SRAMs — direct mapping
SKY130Load Verilog behavioural models from vendor/sky130_fd_sc_hd/, decompose each cell into AND gates via decompose_with_pdk()
GF180MCULoad behavioural models from vendor/gf180mcu_fd_sc_mcu7t5v0/, decompose similarly
RuntimeCellLibraryUser-supplied cell metadata (ADR 0010) for cells outside vendored PDKs

The decomposition process for technology-specific cells:

  1. Clock tracing: Identify sequential cells (DFFs, SRAMs), trace clock pins to primary inputs, create InputClockFlag drivers for posedge/negedge detection.
  2. Iterative DFS: Walk the netlist in topological order. For each unvisited output pin, recursively decompose driving cells into AND gates using the PDK behavioural models. An and_gate_cache deduplicates structurally identical sub-expressions.
  3. Multi-output cells: SKY130 cells like full adders with multiple outputs get special handling — shared sub-expressions are computed once and reused via postprocess hooks.
  4. Fanout construction: After all pins are processed, CSR-format fanout arrays are built for efficient traversal.

AIG pins are guaranteed to be in topological order (pin i is defined before any pin that depends on it), which the downstream pipeline relies on for level computation and scheduling.

3. EndpointGroup abstraction

The AIG partitions its outputs into endpoint groups — the units of work that partitions must realise:

#![allow(unused)]
fn main() {
pub enum EndpointGroup<'i> {
    PrimaryOutput(usize),     // top-level output pin
    DFF(&'i DFF),             // D flip-flop: data + clock-enable
    RAMBlock(&'i RAMBlock),   // SRAM: addr, data, enables
    SimControl(&'i SimControlNode), // $stop/$finish
    Display(&'i DisplayNode), // $display/$write
    StagedIOPin(usize),       // inter-stage boundary (from --level-split)
}
}

Each variant bundles the signals that must be evaluated together: a DFF needs both its D input and clock-enable; an SRAM needs address, data, and write-enable buses. The for_each_input() method enumerates all AIG pins feeding an endpoint group, which the hypergraph partitioner (RepCut) uses to build connectivity and the partition executor (pe.rs) uses to determine resource requirements.

This grouping is important because the boomerang reduction tree produces results in 32-bit-aligned write-out slots. Endpoint groups that share a write-out slot are co-located in the hierarchy; groups that need different clock-enable conditions (e.g., two DFFs with different clocks driving the same data pin) generate "output duplicates" that consume additional write-out capacity.

4. Why AIG over alternatives

BDDs (Binary Decision Diagrams): BDDs can represent Boolean functions canonically but suffer from exponential blowup for many practical circuits (e.g., multipliers). The canonical form is useful for equivalence checking but unnecessary for simulation, where we just need to evaluate. BDDs also have no natural mapping to the GPU's SIMT execution model.

Truth tables / LUTs: Lookup tables scale exponentially with input count. A 6-input LUT (as in Xilinx FPGAs) covers individual cells efficiently but doesn't compose — cascading LUTs requires separate evaluation steps. AIGs compose naturally: the output of one AND gate feeds the input of the next, forming a tree that maps directly to the boomerang hierarchy.

Technology-mapped netlist (direct execution): Keeping the original cell library would require per-cell-type dispatch in the GPU kernel — a conditional branch per node. GPU SIMT execution penalises warp divergence heavily; a uniform operation eliminates this entirely. The conversion cost (one-time decomposition at compile time) is negligible compared to the simulation runtime.

MIG (Majority-Inverter Graph): MIGs are a more compact representation (3-input majority gates) but the 3-input structure doesn't map as cleanly to binary reduction trees. AIGs are the industry standard for synthesis and verification tools (ABC, AIGER format), making interop straightforward.

The AIG's key advantage is that it reduces the GPU kernel to a single bit-parallel operation repeated across a hierarchical tree — no opcode dispatch, no conditional branching, maximum SIMT utilisation.

Consequences

Enables:

  • The boomerang reduction tree (ADR 0015) works because every node is the same AND-with-invert operation. A heterogeneous IR would require per-node dispatch and break the hierarchical reduction pattern.
  • Technology independence: the same GPU kernel and partition executor handle AIGPDK, SKY130, and GF180MCU designs. Adding a new PDK requires only a decomposition module, not kernel changes.
  • Structural deduplication via and_gate_cache reduces graph size when multiple cells share sub-expressions.
  • The inversion-bit encoding (pin_iv = aigpin << 1 | invert) eliminates inverter/buffer nodes entirely — these are free in hardware too, so the IR's size correlates better with actual simulation cost than a technology-mapped netlist would.

Constrains:

  • No latches or async logic. The AIG assumes clean register boundaries: DFFs capture on clock edges, combinational logic is acyclic between registers. Level-sensitive latches and combinational loops would require iterative evaluation that the current pipeline doesn't support (see docs/simulation-architecture.md § "Known Issues").
  • Decomposition quality matters. A poor decomposition of a complex cell (e.g., a mux-heavy datapath cell) can produce a deep AND tree that requires more boomerang stages. The SKY130 and GF180MCU decompositions are hand-tuned for the common cells; exotic cells from other PDKs may decompose sub-optimally.
  • No gate-delay preservation in the AIG itself. The AIG is a functional (Boolean) representation. Timing information from Liberty/SDF is loaded separately and overlaid onto the AIG's pin structure via gate_delays and aigpin_cell_origins. This means the AIG construction can re-order or deduplicate nodes without worrying about timing — but it also means the timing model must reconstruct the mapping from AIG pins back to physical cells.

ADR 0015 — Boomerang execution model and GPU resource mapping

Status: Accepted

Context

Once the design is converted to an AIG (ADR 0014), the combinational logic must be mapped onto GPU hardware for parallel evaluation. GPUs offer massive parallelism but impose rigid constraints: fixed thread counts per block, limited shared memory, and synchronous SIMT execution within a warp/SIMD group.

The GEM paper (Guo et al., "GEM: GPU-Accelerated Emulator-Inspired RTL Simulation," DAC 2025) introduces a "virtual Boolean processor" organised as a boomerang hierarchical reduction tree. This ADR documents how the boomerang maps to GPU hardware, the resource limits it imposes, and the partitioning and instruction-generation pipeline that stays within those limits.

Decision

1. Boomerang reduction tree

A single GPU block (CUDA/HIP) or threadgroup (Metal) executes one partition of the design. Each partition evaluates a subset of the AIG's endpoint groups (DFFs, primary outputs, SRAMs, etc.) by reducing their combinational fan-in cones through a hierarchical binary tree called the boomerang.

The boomerang has BOOMERANG_NUM_STAGES = 13 levels, giving a reduction width of 2^13 = 8192 leaf positions. Each thread in the block handles 32 bits (one u32), so the block uses 8192 / 32 = 256 threads (NUM_THREADS_V1 in flatten.rs).

The 13 hierarchy levels map to three GPU execution tiers:

LevelsWidthGPU mechanism
hier[0]8192 → 4096256 threads, shared memory reduction (threads 128-255 compute, 0-127 supply inputs)
hier[1–3]4096 → 512Shared memory reduction with barrier between levels; only threads in [hier_width, 2×hier_width) compute — the rest idle
hier[4–7]512 → 32Warp/SIMD shuffle (__shfl_down_sync / simd_shuffle_down) — no barrier needed
hier[8–12]32 → 1Bit-level operations within a single u32 on thread 0

At each level, every position computes (a XOR xora) AND (b XOR xorb) OR orb — the same AND-with-invert operation from the AIG. When orb is all-ones, the position acts as a pass-through (forwarding input a unchanged). This single instruction pattern handles AND gates, inversions, and wiring with zero branch divergence.

Between boomerang stages (when the AIG is too deep for a single 8192-wide tree), a shuffle permutation redistributes results from shared memory back to thread-local registers. The shuffle is encoded as 16-bit index pairs in the script, allowing arbitrary re-routing of signals between stages.

2. GPU resource limits and partition constraints

The boomerang's fixed geometry imposes hard resource limits on each partition. These are documented in src/pe.rs on the Partition struct:

ResourceLimitDerivation
Unique inputs81918192 leaf positions minus Tie0. Each input occupies a leaf slot; duplicates consume additional slots. Global-read rounds pack multiple state words into each thread's initial register.
Unique outputs8191Write-out slots in the boomerang hierarchy, addressed by stage+position pairs. Outputs include DFF data pins, primary outputs, and SRAM port signals.
Intermediate pins per stage4095The hier[1] level has 2^(13-1) = 4096 positions. One position is reserved for Tie0. Intermediates are AIG pins that are produced in one boomerang stage and consumed in the next.
SRAM output groups648192 / (32 * 4) = 64. Each SRAM occupies 4 write-out groups (32-bit read-data, address, write-data, write-enable). BOOMERANG_MAX_WRITEOUTS = 1 << (13 - 5) = 256 total write-out slots, of which SRAMs consume 4 each.

Write-out slots are 32-bit-aligned groups within the hier[1] level. The total write-out capacity is BOOMERANG_MAX_WRITEOUTS = 256. SRAMs and "output duplicates" (same data pin driven by DFFs with different clock enables) consume write-out slots from this budget. A quick_reject() pre-check catches obvious overflows before the expensive full build.

When a partition exceeds these limits, Partition::build_one() returns None and the partitioner must split the endpoint set further.

3. Hypergraph partitioning with RepCut

The design's endpoint groups are distributed across GPU blocks by RepCut (src/repcut.rs), which constructs a weighted hypergraph and partitions it using mt-kahypar.

Why a hypergraph, not a graph: In a standard graph, an edge connects exactly two vertices. But a single AIG node (an AND gate deep in the combinational cone) may be shared by many endpoint groups — its "edge" in the connectivity structure is a hyperedge spanning all groups that depend on it. Modelling this as pairwise graph edges would lose the information that cutting this one node simultaneously affects all connected groups. Hypergraph partitioning minimises the actual communication cost (shared signals that must be read from global memory by multiple blocks).

Why mt-kahypar: mt-kahypar is a state-of-the-art multilevel hypergraph partitioner with LargeK support (many partitions in one pass) and parallel execution. The implementation uses:

  • Preset::LargeK — optimised for k >> 2.
  • epsilon = 0.2 — 20% imbalance tolerance, giving the partitioner flexibility to reduce cut while keeping partitions roughly equal.
  • Objective::Soed — Sum of External Degrees, which counts how many partition boundaries each hyperedge crosses. This directly correlates with the number of global memory reads each block must perform.
  • Vertex weights proportional to estimated evaluation cost (accounting for sub-graph size and fanout sharing).
  • Hyperedge weights equal to the number of AIG nodes with that endpoint reachability pattern.
  • Hyperedge size cap at 1000 nodes (reservoir-sampled beyond that) to keep partitioning tractable for signals with extreme fanout.

The hypergraph construction itself is the bottleneck for large designs: for each AIG node, RepCut computes a bitset of which endpoint groups it can reach via forward traversal. Nodes with identical reachability sets are clustered into a single hyperedge. This is done in parallel across bitset blocks (REPCUT_BITSET_BLOCK_SIZE = 4096) using Rayon.

4. Greedy merge-back strategy

mt-kahypar produces an initial partition assignment, but the partition count is typically much larger than needed (set to 2x the number of GPU blocks). process_partitions() in pe.rs then aggressively merges partitions:

  1. Bitset-based overlap scoring: For each pair of partitions, compute the union of their AIG node bitsets. The merge cost is |union| - max(|A|, |B|) — lower is better, indicating more shared sub-graph. This is O(num_aigpins/64) per pair instead of full DFS.

  2. Speculative parallel trials: Merge candidates are sorted by overlap cost. Up to parallel_trial_stride merges are attempted in parallel using Rayon, with a cancel-on-success AtomicBool to abort remaining trials once a valid merge is found. The stride doubles on each iteration.

  3. Quality gate: A merged partition is rejected if it would increase the maximum boomerang stage count beyond max_original_nstages + max_stage_degrad. This prevents merges that technically fit in resource limits but would degrade simulation throughput by adding extra boomerang stages.

  4. Blacklisting: Failed merge attempts are blacklisted for that partition to avoid redundant retries. Cancelled (interrupted by a parallel success) trials are not blacklisted — the merge may still be valid in a future iteration.

The result: 2x-4x fewer partitions than the initial hypergraph solution, with each partition fully validated to fit within boomerang resource limits.

5. FlattenedScript instruction generation

src/flatten.rs converts partitions into FlattenedScriptV1 — a packed u32 instruction stream consumed directly by the GPU kernel. The script encodes:

  1. Metadata section (256 u32): Per-partition control fields at fixed indices, followed by the write-out hook table:

    IndexFieldPurpose
    0num_stagesBoomerang stage count
    1is_last_partFlag: last partition in the design
    2num_iosNumber of write-out endpoints
    3io_offsetStart offset in global state buffer
    4num_sramsSRAM block count
    5sram_offsetSRAM start offset
    6num_global_read_roundsInput read rounds
    7num_output_duplicatesOutput duplication count
    8is_x_capableX-propagation flag (ADR 0016)
    9xmask_state_offsetX-mask offset (when X-capable)
    128..255write-out hook tableMaps each thread to the boomerang stage+position where it captures its output

    This layout is the load-bearing contract between Rust (flatten.rs) and the GPU kernel (kernel_v1.metal, kernel_v1_impl.cuh).

  2. Global-read permutation (2 × NUM_THREADS_V1 per round): Each thread gets an index into the global state buffer and a bitmask. The thread reads one u32 from global memory and extracts the bits indicated by the mask using a pext-like loop. Rounds are packed to maximise throughput (each thread accumulates up to 32 bits across rounds). An index high-bit flag distinguishes previous-cycle state from current-cycle inter-stage intermediates.

  3. Boomerang sections (per stage, NUM_THREADS_V1 × 20 u32):

    • 16 u32 per thread: shuffle permutation (16-bit index pairs selecting source bits from shared memory)
    • 4 u32 per thread: AND-gate flags (xora, xorb, orb) plus a padding slot reused for gate-delay injection (u16 picoseconds)
  4. Global write-out: SRAM and output-duplicate permutations, clock-enable conditions, and data-inversion flags for committing results back to the state buffer.

The entire script is uploaded to device memory once and read sequentially by the kernel. Script reads are overlapped with computation via double-buffering (reading the next stage's data while computing the current stage's AND gates).

6. Pipeline staging for deep circuits

When a design's combinational depth exceeds the boomerang's capacity, src/staging.rs splits the AIG into major stages at user- specified level thresholds (--level-split 30 or --level-split 20,40).

Each major stage gets its own StagedAIG with:

  • primary_inputs: the AIG pins produced by previous stages (or the design's actual primary inputs for the first stage).
  • primary_output_pins: live AIG pins at the split boundary that must be forwarded to the next stage.
  • endpoints: the original AIG endpoint groups whose combinational depth falls within this stage.

Major stages execute sequentially on the GPU (the kernel loops over them). Between stages, intermediate values are written to the output state buffer and re-read by the next stage's global-read permutation (indicated by the high-bit flag in the index).

Staging trades latency (more sequential kernel dispatches) for fitting within the 8192-wide boomerang. Without it, designs with

50-level combinational paths would fail partitioning entirely.

Consequences

Enables:

  • Fixed, branch-free GPU kernel. The kernel has no per-node dispatch — every thread executes the same AND-XOR-OR instruction at every boomerang level. This maximises SIMT utilisation across CUDA, HIP, and Metal.
  • Deterministic shared-memory budget. The 256-thread, 8192-bit boomerang uses a fixed amount of shared memory (threadgroup memory on Metal), independent of the design. No dynamic allocation, no shared-memory pressure variation between blocks.
  • Scalable partitioning. The hypergraph partitioner + greedy merge naturally adapts to designs from hundreds to millions of gates. Larger designs get more partitions; the GPU kernel is the same.
  • Technology independence at the kernel level. The GPU kernel knows nothing about AIGPDK, SKY130, or GF180MCU. It executes packed u32 scripts. All cell-library knowledge is absorbed during AIG construction and script generation.

Constrains:

  • 8191-input/output ceiling per partition. Designs with extremely wide buses or highly connected sub-circuits may require aggressive partitioning, which increases inter-partition communication (global memory reads). The --level-split option helps by splitting deep cones into multiple stages, but wide cones remain fundamentally limited by the 8192-slot boomerang.
  • Write-out slot scarcity for SRAM-heavy designs. Each SRAM consumes 4 write-out slots. With BOOMERANG_MAX_WRITEOUTS = 256, a partition can hold at most 64 SRAMs — and fewer when output duplicates also need slots. Designs with many small memories may need finer partitioning than their gate count alone would suggest.
  • Fixed thread count. The 256-thread block size is hardcoded (NUM_THREADS_V1). On GPUs where the SM/CU could benefit from larger blocks (e.g., occupancy tuning), there's no flexibility. Changing this would require redesigning the boomerang hierarchy depth and the bit-packing in the script format.
  • Script size grows with partition depth. Each boomerang stage adds ~20 × 256 = 5120 u32 entries to the script. Very deep partitions (many boomerang stages) produce large scripts that may pressure GPU memory bandwidth for the script reads, though double-buffering mitigates this.

ADR 0016 — Selective X-propagation

Status: Accepted

Context

Jacquard's default two-state (0/1) simulation silently resolves uninitialised DFF and SRAM outputs to zero. This masks initialisation bugs that real hardware would expose as unknown (X) values, and creates false mismatches when comparing against four-state RTL simulators.

Naively upgrading the entire simulator to four-state logic would double storage and roughly halve throughput. In a well-designed SoC after reset, typically less than 5% of signals are genuinely X-capable.

Decision

Implement selective X-propagation controlled by the --xprop CLI flag. Static analysis at compile time identifies X-source signals (uninitialised DFFs, SRAM read ports); forward-cone computation classifies each partition as X-capable or X-free. Only X-capable partitions run an X-aware kernel variant; the rest continue with the fast two-state path.

The full seven-phase design, implementation details, and design rationale are in docs/selective-x-propagation.md. Stages 1–6 are implemented; Stage 7 (dynamic X narrowing) is a future enhancement.

Key design choices (summary)

  1. Partition-level granularity — entire partition runs X-aware or not. ~95% of partitions are typically X-free after reset.
  2. Conservative SRAM X — all reads return X until any write. Per-address tracking deferred.
  3. No reset-aware analysis — all DFFs start as X; the fixpoint iteration naturally resolves reset-connected DFFs.
  4. State buffer doubling — X-mask words occupy [reg_io_state_size .. 2*reg_io_state_size) when enabled. X-free partitions ignore the mask entirely.
  5. Runtime flag, not compile-time--xprop on jacquard sim; no new Cargo features needed.

Consequences

  • X-capable partitions pay ~2× storage and ALU cost; X-free partitions (the vast majority) pay nothing.
  • VCD output includes x values when --xprop is enabled, compatible with standard four-state VCD tools.
  • The --check-with-cpu reference path includes an X-aware CPU kernel for validation.
  • Benchmarks (benches/xprop.rs) track the overhead.

ADR 0017 — Cosim execution model

Status: Accepted

Context

The cosim mode runs a GPU-simulated design alongside reactive peripheral models (flash, UART, JTAG, GPIO) that drive and observe design pins each clock edge. The execution model must balance two competing needs: GPU throughput (which favours large batches of edges dispatched as a single command buffer) and peripheral responsiveness (which requires CPU-side model updates between edges).

This ADR documents the batch dispatch loop, the multi-clock scheduler, and the time-domain abstractions that tie them together.

Decision

Batch dispatch loop

The cosim main loop groups consecutive scheduler edges into batches of up to BATCH_SIZE = 1024 edges. Each batch is encoded into a single Metal command buffer and dispatched to the GPU. Between batches, CPU-side peripheral models (PeripheralModel:: step_edge) run, ring buffers are drained, and model overrides are compiled into BitOp arrays for the next batch.

Per-edge execution within a batch:

state_prep (apply clk/gpio/jtag pin drives via BitOps)
  → gpu_apply_flash_din (inject flash MISO into input state)
    → simulate_v1_stage ×N (combinational logic evaluation)
  → gpu_flash_model_step (read MOSI, advance flash FSM)
  → gpu_io_step (UART TX decode + Wishbone bus trace)

CPU-side models cannot observe intra-batch state changes — they see the output state only after the batch completes. For peripherals that require per-edge responsiveness (e.g. JTAG replay with tight hold-cycle requirements), the batch is forced to size 1 when any model reports is_active() == true.

Why BATCH_SIZE = 1024

The batch size trades off GPU utilisation against peripheral latency. Smaller batches → more Metal command buffer submissions per second → higher overhead. Larger batches → staler CPU-side model state. 1024 was chosen empirically as a sweet spot:

  • For peripheral-free simulation: amortises ~1ms of command buffer overhead across 1024 edges ≈ 1µs/edge overhead.
  • For active peripherals (JTAG, stimulus-driven): the is_active fallback to batch=1 ensures correctness regardless of batch size.
  • The batch size only affects cosim; the sim command processes the entire VCD in one GPU dispatch.

Pre-allocated schedule buffers

Each scheduler edge has pre-allocated Metal buffers for its StatePrepParams and BitOp array (ScheduleBuffers::edge_buffers). These are allocated once at startup — not per-dispatch — to avoid allocation latency in the hot loop. The schedule repeats with period edges_per_period (= LCM schedule length); edge N reuses buffer N % edges_per_period.

Multi-clock scheduler

The MultiClockScheduler computes a deterministic interleaving of edges across clock domains. Given N clocks with potentially different periods and phase offsets:

  1. Compute gcd_ps = GCD of all half-periods and phase offsets. This is the scheduler tick — the minimum time quantum.
  2. Compute lcm_ps = LCM of all full periods. This is the schedule period — the point at which the edge pattern repeats.
  3. schedule_len = lcm_ps / gcd_ps — number of ticks per period.
  4. For each tick, compute which domains have rising/falling edges based on (tick_ps - phase_offset) % half_period == 0.

The schedule length is capped at 1,000,000 ticks. This prevents degenerate clock ratios (e.g. primes) from producing unbounded schedules. If the cap is hit, the assertion fires with a message suggesting the clocks may not be commensurable at the configured resolution.

Time units: edges vs clock cycles

A scheduler edge is one tick of the scheduler (duration = gcd_ps). A clock cycle is two half-periods of a given domain (= rising + falling edge). The ratio edges_per_sys_clk_cycle = clock_period_ps / gcd_ps converts between them.

This distinction is load-bearing for peripheral timing:

  • UART baud rate dividers count edges, not clock cycles.
  • Reset duration counts edges.
  • The --max-clock-edges CLI flag counts edges.

Confusing edges with clock cycles was the root cause of the UART baud rate bug fixed in commit a263e47edges_per_period (the LCM schedule length) was used where edges_per_sys_clk_cycle was needed, doubling the bit time in multi-clock designs.

GPU→CPU ring buffer drain

After each batch completes, the CPU drains three categories of GPU-side state:

  1. Peripheral ring buffers — UART channels and Wishbone trace channel, drained from local read_head to GPU-written write_head. See ADR 0013 for struct conventions.
  2. VCD snapshot buffer — when --stimulus-vcd or --output-vcd is enabled, a separate ring buffer (2 × state_size words per edge) captures per-tick output state on the GPU. The CPU drains it after each batch to write VCD transitions. This mechanism is what enables BATCH_SIZE > 1 even with VCD output — without it, the CPU would need to read output state after every single edge.
  3. CPU reference check — when --check-with-cpu is active, the CPU replays the batch with the reference kernel and compares.

No synchronisation beyond Metal's command buffer completion is needed — all drains happen after waitUntilCompleted.

Consequences

  • The batch dispatch model means CPU-side peripheral models see output state with up to BATCH_SIZE edges of latency. This is acceptable for all current peripherals; models that need tighter coupling set is_active() = true.
  • The 1M tick schedule cap prevents pathological memory use but rejects exotic clock ratios. A min-heap scheduler (proposed in docs/plans/multi-clock-and-stimulus-architecture.md as MC.2) would remove this limit.
  • The edges-vs-cycles distinction must be maintained carefully in any code that converts user-facing "cycles" to internal "ticks". The edges_per_sys_clk_cycle helper exists for this purpose.
  • Pre-allocated schedule buffers consume O(schedule_len) Metal buffer pairs at startup. Each schedule entry creates two Metal buffer objects (params + ops). For typical single-clock designs this is 2 entries = 4 buffer objects; for complex multi-clock designs it can reach thousands of entries, but each buffer is small (tens of bytes).

Cross-references

  • ADR 0012 — CDC jitter injection (uses the scheduler's edge timestamps as the injection point).
  • ADR 0013 — Peripheral model architecture (documents GPU-side model patterns and ring buffers).
  • docs/plans/multi-clock-and-stimulus-architecture.md — design-space doc for the multi-clock scheduler.

Implementation Plans

Phased implementation plans with entry and exit criteria. Plans live here when the work spans multiple commits and needs an explicit scheduling artefact; once shipped, the plan is kept as a historical record (Status flipped to Implemented) rather than deleted, so the phasing is recoverable later.

For short-lived working memory between sessions, see ../handoff-discipline.md — that lives in docs/handoffs/ and is deliberately kept separate from the persistent plans here.

Status legend

  • Active — currently being worked on or scheduled.
  • Implemented — shipped; kept as historical record.
  • Deferred — captured for future work; not currently scheduled.
  • Exploratory — architectural thinking captured ahead of demand.

Index

PlanStatus
Post-Phase-0 RoadmapActive — scheduling doc for ADRs 0007 and 0008
GF180MCU PDK enablementMostly implemented — Phases 0–6 shipped; Phase 7 deferred
Phase 0: Timing IR and OpenSTA oracleImplemented — historical record
WS2: opensta-to-irImplemented — historical record
WS3: delete SDF parser + interim runtime hookImplemented — historical record (see ADR 0006 Amendment)
WS3 follow-up: re-add cosim --sdf via opensta-to-irDeferred
Multi-clock and stimulus architectureExploratory — demand-driven

Reading order for new contributors

If you want to understand how the timing stack got to where it is:

  1. phase-0-ir-and-oracle.md — the umbrella plan, with the five work streams (WS1–WS5).
  2. ws2-opensta-to-ir.md and ws3-delete-sdf-parser.md — the per-work-stream detail for the IR producer and the SDF parser removal.
  3. post-phase-0-roadmap.md — what comes next, sequenced against the ADRs.

Adding a new plan

  1. Filename: short kebab-case (<topic>.md or <ws-or-phase>-<topic>.md).
  2. Start with # Plan — <title> and a **Status:** line.
  3. Where the plan executes a specific ADR or work stream, name them in a **Predecessors:** / **ADRs:** block near the top so the dependency graph is explicit.
  4. Add the row to the table above. When the plan ships, change the status in the file and here in the same commit.

Roadmap — Post-Phase-0 work scheduling

Status: Active. ADR 0008 accepted 2026-05-02. ADR 0007 still pending.

This document orders the work captured in those two ADRs alongside the in-flight tail of Phase 0. It is a scheduling doc, not a design doc — design lives in the ADRs and in docs/timing-model-extensions.md / docs/why-jacquard.md.

Where things stand (2026-05-02)

  • Phase 0 (phase-0-ir-and-oracle.md): WS1–WS5 + WS2.2 + WS2.4 all landed. WS2.4 multi-corner shipped 2026-05-02 across four commits (5822343 consumer, 530bb36 builder, 59fde04 producer, plus the integration test). Open items: sky130-based corpus entries (gated on a CI sky130-Liberty install strategy) and peripheral wiring for I²C/SPI when a fuller mcu_soc fixture lands.
  • OpenTimer spike (spikes/opentimer-sky130.md): resolved 2026-05-01 — Superseded. Q1 (Liberty parse) passed cleanly on SKY130; Q2 (arrival computation) failed on the canonical OpenSTA-bundled GCD example after eight input-pipeline workarounds (bus ports, OpenROAD-emitted SPEF, modern TCL, tap cells). Per the spike's decision matrix, ADR 0003 is now Superseded (commit d002bde). OpenSTA out-of-process is committed as Jacquard's sole STA pathopensta-to-ir is the canonical preprocessor; no in-process reference STA is planned. A future ADR may revisit libreda-sta or an in-house walker if an in-process reference is wanted later, but not on this roadmap.
  • Pillar B Stages 1+2 (per adr/0007): landed. ClockArrival IR table + opensta-to-ir Tcl emission in commit c403cc8; DFFConstraint.clock_arrival_ps + skew-aware fold-in in build_timing_constraint_buffer in 6767c3e. Closed Pillar B's main accuracy lever ahead of this roadmap's original Phase 2 schedule.
  • ADR 0006 amended 2026-05-02: subprocess invocation of user-installed OpenSTA from the shipped runtime is now permitted (no linking, no bundling). Phase 3 (native Rust SDF→IR) is no longer release-gating — see § Phase 3 below. New release-hardening workstream WS-RH.1 (OpenSTA detection + version check) is required before first release; see § Release hardening.
  • ADRs 0007 / 0008: ADR 0008 accepted 2026-05-02; ADR 0007 still pending review.

Phase boundaries

The phase numbering established by Phase 0 and ADR 0006 continues:

PhaseTopicTrigger
0Timing IR + OpenSTA preprocessorIn flight, near close
1Structured timing output (ADR 0008 required items) + Phase 0 carryoverADR 0008 accepted ✓
2Timing model fidelity Pillar C Tier 1 + Pillar B Stage 3 if needed (ADR 0007)Phase 1 lands; ADR 0007 accepted
RHRelease hardening (OpenSTA detection + version check, see § Release hardening)WS-RH.1 shipped ✓; no other items currently scoped
3Native Rust SDF→IR parser (ADR 0006)Deferred indefinitely — no longer release-gating per amended ADR 0006. Picks up when bandwidth allows or commercial demand appears.
4+Pillar A Stage 1 (static IDM); Pillar C Tier 2; ADR 0008 optional outputsDemand-driven; not committed

Parked (require new ADR to revive): in-process reference STA (ADR 0003 superseded), Pillar A Stage 2 (dynamic δ(T)), Pillar A Stage 3 (sub-cycle ticks), NoC-aware partitioning hints (Pillar C Tier 3).

Phase 1 — Structured timing output and Phase 0 wrap-up

Entry criteria:

  • ADR 0008 accepted.
  • Phase 0 exit criteria met (per phase-0-ir-and-oracle.md).

OpenTimer integration was originally Phase 1's centrepiece (former WS-P1.1) but was retired when the spike Superseded ADR 0003. With OpenSTA-out-of-process as the sole STA path, Phase 1 is now anchored on user-visible output rather than a second STA tool.

Workstreams (parallel where independent):

WS-P1.1 — Structured timing output (ADR 0008 required items)

The four required items from ADR 0008. Single workstream because they share infrastructure.

  • WS-P1.1.a — Symbolic violation messages. Shipped 2026-05-02 in commit 0432d9a. New WordSymbolMap in src/flatten.rs built once at sim startup; process_events gained an optional resolver closure; sim_metal threads it through. Setup/hold violation messages now name DFFs as top/cpu/regs[7][bit 22] [word=42] instead of bare word 42. CUDA/HIP sim paths don't currently route runtime violations through process_events (separate plumbing gap, not blocked on this format change).
  • WS-P1.1.b — --timing-report <path.json>. Shipped 2026-05-02 in commit 58a7a04. New src/timing_report.rs module with serde-derived TimingReport (schema_version 1.0.0); process_events takes a ReportingCtx bundling the optional resolver + violation observer (signature back to 5 args); sim_metal builds the report end-to-end. Sample fixture at tests/timing_ir/sample_reports/two_violations.json; schema documented in docs/timing-violations.md. WS-P1.1.d's worst-slack ranking is included (top-10 per kind from violation events). Caveats: closest-to-violation tracking in non-violating runs needs GPU near-miss instrumentation (deferred); violations array is unbounded (opt-in cap is the natural follow-up); CUDA/HIP/cosim paths don't route runtime violations through process_events yet.
  • WS-P1.1.c — --timing-summary text output. Shipped 2026-05-02 in commit 44e70a0. New TimingReport::format_summary() formatter; --timing-summary CLI flag; TimingReportConfig refactored to support either / both / neither output. Text writes to stdout. Deferred from ADR 0008's wishlist: "corner" (metadata struct doesn't carry it yet) and "margin percentage" (derivable from existing fields). Both are documented in code as known gaps.
  • WS-P1.1.d — Per-DFF worst-slack ranking. Partially shipped in 58a7a04 alongside WS-P1.1.b: top-10 per kind from observed violation events. Remaining: closest-to-violation tracking when no violation occurred — needs GPU near-miss instrumentation, deferred to a separate workstream.

Total ~2 weeks.

WS-P1.2 — Phase 0 follow-ups (carryover)

Tail of Phase 0 work that didn't gate WS3 completion. Listed for completeness.

  • WS2.4: multi-corner CLI flag in opensta-to-ir. Shipped 2026-05-02 (commits 5822343 / 530bb36 / 59fde04).
  • WS4: corpus + runner + regen helper + CI hookup shipped 2026-05-02 with the seed entry aigpdk_dff_chain (covers all four IR record types). One follow-up: add sky130-based corpus entries (inv_chain_pnr, mcu_soc subset) once a CI sky130-Liberty install strategy is decided.
  • Peripheral wiring for I²C/SPI when a fuller mcu_soc fixture lands.

(WS5 — parser-success assertions on the Liberty parser path and on opensta-to-ir — was already shipped; see phase-0-ir-and-oracle.md § WS5.)

These are not gated by any new ADR; pick them up as bandwidth allows.

Exit criteria for Phase 1:

  • ✅ Symbolic violation messages live; old state-word-index format gone (commit 0432d9a).
  • --timing-report JSON shipping; sample fixture at tests/timing_ir/sample_reports/two_violations.json (commit 58a7a04).
  • --timing-summary available (commit 44e70a0).
  • ✅ Worst-slack ranking included in both report and summary (top-10 from violations; non-violating-run tracking still requires GPU near-miss instrumentation, separate workstream).
  • why-jacquard.md updated; old "Output interface" section now describes the shipped surface, "Still on the wishlist" carries the deferred items.

Phase 1 closed. Phase 2 entry now blocked only on ADR 0007 acceptance.

Phase 2 — Timing model fidelity

Entry criteria:

  • Phase 1 exit criteria met.
  • ADR 0007 accepted.

Pillar B Stages 1 and 2 (per-DFF clock arrival in the IR + setup/hold fold-in) landed early, in commits c403cc8 and 6767c3e — directly on top of the OpenSTA-out-of-process producer rather than the OpenTimer integration originally planned. Phase 2 is therefore anchored on Pillar C Tier 1 (per-receiver wire delay), with Pillar B Stage 3 only if measurement justifies it.

Workstreams (parallel where independent):

WS-P2.1 — Pillar C Tier 1: Per-receiver wire delay (ADR 0007)

Key wire delay per (src_aigpin, dst_aigpin) edge.

  • WS-P2.1.a — Edge-attributed wire delay. Rewrite of src/flatten.rs:1850-1872 to key wire delay per fanout; fold into source-side gate_delay per fanout target. ~3–5 days.
  • WS-P2.1.b — Rise/fall preservation. Carry per-edge rise/fall through the consumer; honour both in PackedDelay accumulation. ~1–2 days, after WS-P2.1.a.
  • WS-P2.1.c — Validation. Long-route corpus addition; tolerance ≤±3% on long-wire paths.

Total ~1 week.

WS-P2.2 — Pillar B Stage 3: Bucketed per-DFF constraint packing (conditional)

Stages 1+2 collapsed all DFFs in a 32-bit state word to min(setup), min(hold) after folding the per-DFF clock arrival in. For most current designs the per-word collapse pessimism is small relative to clock period; for designs running close to the period boundary, splitting each word into clock-arrival buckets eliminates the collapse loss without disturbing the partitioner. See Stage 3 in docs/timing-model-extensions.md Part B.

Land only if Stage 1+2 measurement on a representative design shows the per-word collapse materially over-reports violations; otherwise defer indefinitely. Effort if pursued: ~3–5 days, touches src/flatten.rs:1722-1761 and the kernel's constraint indexing.

WS-P2.3 — Output adjustments for fidelity work

Small touch-ups to ensure Phase 1 outputs continue to work as model fidelity changes. JSON report fields, summary metrics, etc. Folded into WS-P2.1 / WS-P2.2 PRs as needed.

Exit criteria for Phase 2:

  • Per-receiver wire delay landed; long-route paths reported within ≤±3% of CVC.
  • timing-model-extensions.md Parts B and C marked Implemented with cross-references to landed code (Part B already updated post-Stage-1+2).
  • timing-validation.md updated with per-pillar tolerances.
  • No regression on existing corpus.

Phase 3 — Native Rust SDF→IR parser

Deferred indefinitely as of 2026-05-02 per amended ADR 0006. No longer release-gating: shipped Jacquard binaries may subprocess user-installed OpenSTA via opensta-to-ir, provided OpenSTA is not bundled and not linked. The user-facing capability gap is "OpenSTA must be on PATH for jacquard sim input.sdf," surfaced by WS-RH.1 below with a clear error message.

Reasons to revive:

  • A downstream commercial integrator's legal team rejects subprocess-of-GPL-tool even with no bundling/linking.
  • OpenSTA dialect coverage gaps appear that are easier to fix in our own parser than via opensta-to-ir post-processing.
  • Bandwidth opens up and the team wants the zero-runtime-dependency story for its own ergonomics.

Effort estimate (unchanged from the original ADR 0006 framing): grammar-based (nom / pest), validated against OpenSTA on the WS4 corpus per ADR 0001. Probably 2–3 weeks of focused work. Not scheduled.

Release hardening

Pre-first-release work that became necessary when ADR 0006 § Amendment relaxed the no-runtime-subprocess rule. These are blockers for first release, not for any specific Phase.

WS-RH.1 — OpenSTA detection + version check

Status: Shipped 2026-05-02 in commit c9c393b. All scope items below are landed; this entry is preserved as a brief reference. Test coverage: 9 unit tests for the version parser + 6 integration tests for the locator across the missing / too-old / newer-than-tested / unparseable / failing-probe paths.

Why: With the shipped runtime now allowed to subprocess opensta-to-ir, a user invoking jacquard sim input.sdf on a machine without OpenSTA — or with an untested OpenSTA version — must get an actionable error rather than silent timing-data loss. Pre-WS-RH.1, missing OpenSTA only emitted a warn! and the simulation proceeded with no timing information loaded. That was acceptable during development but shipped as a UX bug.

Scope:

  • Promote missing-OpenSTA from warning to hard error when --sdf is provided. Today's silent-fallback behaviour is fine for --liberty-only runs but wrong when SDF was explicitly requested. Error message must name the env var (JACQUARD_OPENSTA_BIN), the PATH lookup, and link to install instructions. ~0.5 day.
  • Pin a tested OpenSTA version range. Record the version we test against in vendor/opensta/ (already pinned via submodule per ADR 0005) and surface that as a MIN_TESTED_OPENSTA_VERSION / MAX_TESTED_OPENSTA_VERSION constant in crates/opensta-to-ir/src/opensta.rs. Need to choose a version-detection mechanism — OpenSTA's -version flag output format is the obvious target; check whether it's stable across the versions we care about. ~0.5 day.
  • Version probe at first invocation. On first call to find_opensta() per process, run <binary> -version, parse the version, and:
    • If older than min-tested → hard error with remediation message ("rebuild via scripts/build-opensta.sh or upgrade your system OpenSTA").
    • If newer than max-tested → warn but proceed ("untested OpenSTA version vN.M; please report any timing discrepancies").
    • Cache the result for the rest of the process. ~1 day.
  • Document the dependency in docs/usage.md. Single section: required tooling, install paths, version range, what jacquard sim does and doesn't need OpenSTA for. ~0.5 day.
  • Test coverage: unit tests for the version-string parser (with sample -version outputs from the pinned version and a synthetic too-old version); an integration test that points JACQUARD_OPENSTA_BIN at a stub script and confirms the error path. ~0.5 day.
  • Stale-framing cleanup (folded in here per 2026-05-02 decision rather than spun out separately):
    • Reword INTERIM per ADR 0006 / Pre-release only markers in source: src/sim/setup.rs:176,228,286, src/bin/jacquard.rs:187, src/sim/cosim_metal.rs:2053, src/testbench.rs:255-257. Replace with "subprocess wrapper per ADR 0006 § Amendment" or similar — these paths are no longer interim.
    • Update docs/plans/phase-0-ir-and-oracle.md lines 152, 161, 172 — drop "tagged for pre-release removal" framing; the subprocess wrapper is now the shipping mechanism, not a temporary bridge.
    • Audit docs/plans/ws3-delete-sdf-parser.md for the same stale framing and update.
    • ~0.5 day total for the cleanup.

Total: ~3.5 days. Single PR, owned by whoever picks up release prep.

Open question: does OpenSTA emit a stable -version string, or do we need to scrape git describe from a build-time-recorded commit? If -version is unreliable, fall back to recording the submodule commit at crates/opensta-to-ir build time and comparing — this is cheaper than version-string sniffing and avoids the "user has a custom build" problem.

Phase 4+ — Demand-driven

Items below land when (a) a real use case appears that demands them, and (b) bandwidth is available. Each gets its own ADR amendment / new ADR before scheduling, since the cost is non-trivial.

Pillar A Stage 1 (static IDM)

Cheapest δ(T) entry point. Lands only after Pillars B and C confirm the wire/skew baseline is correct — characterisation work done before that risks chasing wire-delay error masquerading as δ(T) error.

Effort: 1–2 day spike to validate value, then ~1 week implementation, plus per-cell SPICE characterisation effort (long-pole risk).

Pillar C Tier 2 (inter-partition wire delay)

Required for many-core/NoC designs at advanced processes. Lands when a representative such design appears in the test corpus and Tier 1 measurement shows it is needed.

Effort: ~2–3 weeks, touches src/sim/cosim_metal.rs shuffle pipeline.

ADR 0008 optional outputs

Items 5–7 from ADR 0008: arrival histograms, STA cross-reference, path-back-trace. Demand-driven prioritisation.

Pillar C Tier 3 (NoC-aware partitioning hints)

Optional optimisation that makes Tier 2 cheap on tile-decomposed designs. Lands only if Tier 2 lands and partitioning quality on tile designs proves measurably suboptimal.

Risks and walk-back

  • Pillar measurement shows smaller-than-expected gain. Each pillar's later stages are deferred or abandoned per ADR 0007's walk-back clause. Pillar B Stage 3 is explicitly conditional on this signal.
  • JSON report schema design wastes time in bikeshedding. Mitigation: ship v1 quickly, additive-only changes thereafter, breaking changes require explicit ADR-level decision.
  • OpenSTA upstream regressions. With OpenSTA as the sole STA path, an upstream behaviour change reaches us through opensta-to-ir's output. Mitigation: pin OpenSTA in CI (per ADR 0001) and rely on the regression corpus to surface drift.
  • CRPR pessimism on tight designs. Stage 1+2 fold-in treats launch=0; a design with very heterogeneous launch arrivals will see pessimism on paths whose launch DFF also has a long clock path. Stage 3 is the lever if this matters; otherwise live with it.

Cross-references

  • ../adr/0007-timing-model-fidelity-roadmap.md — Pillar definitions for Phase 2.
  • ../adr/0008-structured-timing-output.md — Output items for Phase 1.
  • ../adr/0001-opensta-as-oracle.md — OpenSTA out-of-process commitment (post-ADR-0003 supersedure).
  • ../adr/0003-opentimer-primary-sta.mdSuperseded. Spike fail outcome documented in ../spikes/opentimer-sky130.md.
  • ../adr/0006-sdf-preprocessing-model.md — Phase 3.
  • ../why-jacquard.md — User-facing positioning that this roadmap delivers.
  • ../timing-model-extensions.md — Technical analysis underlying ADR 0007.
  • ../timing-validation.md — Validation tolerances each phase updates.
  • phase-0-ir-and-oracle.md — Predecessor roadmap (current Phase 0 status lives there per workstream).
  • ../spikes/opentimer-sky130.md — Spike outcome (Superseded).

Plan — GF180MCU PDK enablement (full sim path)

Status: Phases 0–6 shipped (2026-05-12 / 13). Phase 7 (wafer.space test-run-1 design integration) deferred pending design availability. Subsequent follow-ups also landed (2026-05-14): IO pad behavioural decomposition (__in_c, __in_s, __bi_24t, plus filler classification for the wafer.space gf180mcu_ws_* families) and bidir A/OE observability surfacing as <port>__out / <port>__oe extra primary outputs — see commits aa312b8, c23d583, 207cc80. These extended GF180MCU support from "synthesized-core-only" to "full chip_top including pad ring", validated end-to-end on a 227k-cell wafer.space chess chip_top netlist. This document is now a recap of what landed; the forward-looking deferred items are in § Follow-on cleanup at the bottom.

Predecessors:

  • SKY130 enablement (reference recipe in docs/adding-a-pdk.md).
  • Multi-corner Liberty plumbing — WS2.4 + the sky130 multi-corner integration test (crates/opensta-to-ir/tests/opensta_integration.rs), shipped 2026-05-12.

ADRs: None new shipped. docs/adding-a-pdk.md is the canonical integration-points checklist; this plan applied that recipe to GF180MCU with both 7-track (gf180mcu_fd_sc_mcu7t5v0) and 9-track (gf180mcu_fd_sc_mcu9t5v0) standard-cell libraries.

Goal (as shipped)

GF180MCU is now at the same support tier as SKY130:

  1. Timing pathopensta-to-ir accepts GF180MCU Liberty files and emits IR; the multi-corner integration test at crates/opensta-to-ir/tests/opensta_integration.rs::gf180mcu_multi_corner_emits_per_corner_values asserts per-corner setup/hold values differ correctly across tt/ss/ff PVT corners.
  2. Simulation pathjacquard sim runs gate-level GF180MCU netlists on the GPU. Cell-type detection, pin direction tables, sequential/tie/multi-output classification, behavioural model parsing (with UDP support for sequential elements), and AIG decomposition are all wired through AIG::from_netlistdb.
  3. Validation — synthetic DFF+inverter fixture at tests/timing_test/gf180mcu_timing/. Real wafer.space test-run-1 design integration is deferred (Phase 7, gated on design availability).

End state mirrors today's SKY130 support: CellLibrary::GF180MCU detected, decomposed to AIG, simulated on Metal/CUDA/HIP, with a golden-IR corpus entry covering the timing-IR side.

Why now

GF180MCU support was a release prerequisite per session 2026-05-12. The wafer.space ecosystem (https://github.com/wafer-space/gf180mcu) is the near-term commercial demand driver; the upstream google/gf180mcu-pdk is the canonical PDK that the wafer.space variant builds on.

Decisions (frozen 2026-05-12 session)

  1. One enum variant for GF180MCU. CellLibrary::GF180MCU covers both 7t5v0 and 9t5v0 prefixes. Matches the SKY130 precedent (CellLibrary::SKY130 covers seven prefixes).
  2. Both 7t and 9t fully supported. Unlike SKY130 (only hd is decomposed), both GF180MCU standard-cell variants are first-class for cell detection, pin direction, classification, and AIG decomposition. Cell models for 7t and 9t are byte-identical per cell type (verified at build time in build.rs); decomposition reads from the 7t submodule and reuses for 9t.
  3. Two separate submodules for vendoring cell models, mirroring the per-library SKY130 split:
    • vendor/gf180mcu_fd_sc_mcu7t5v0/
    • vendor/gf180mcu_fd_sc_mcu9t5v0/
  4. Install path: volare pinned hash under [tool.jacquard.pdks.gf180mcu] in pyproject.toml alongside the existing sky130 entry. Variant: gf180mcuC.
  5. Reset polarity: GF180MCU uses active-low resets/sets (pin names RN, SETN) — same AIG formula shape as SKY130's RESET_B/SET_B. The "n" prefix in cell names like dffnq/dffnrnq/icgtn indicates a negative-edge clock (pin CLKN), not reset polarity (resolving Open Q3 from the original plan).

Shipped phases

Phase 0 — Foundations (commit 6ae3e54)

  • pyproject.toml: [tool.jacquard.pdks.gf180mcu] with volare_hash = "559a117b163cef2f920f33f30f6f690aa0b47e4c", variant gf180mcuC, separate default_lib_subdir_7t / default_lib_subdir_9t paths.
  • Vendored submodules at vendor/gf180mcu_fd_sc_mcu7t5v0/ and vendor/gf180mcu_fd_sc_mcu9t5v0/.
  • Skeleton src/gf180mcu.rs + src/gf180mcu_pdk.rs declared in src/lib.rs.

Phase 1 — Library detection + cell-type extraction (commit 858dd70)

  • is_gf180mcu_cell(name) -> bool matching both 7t5v0 and 9t5v0 prefixes.
  • extract_cell_type(name) strips prefix + drive suffix.
  • CellLibrary::GF180MCU enum value added; detect_library() / detect_library_from_file() extended; Mixed enforcement upgraded to three known libraries.

Phase 2 — Pin direction provider (commit e97e2d2)

  • GF180MCULeafPins implementing LeafPinProvider.
  • Generation strategy: build-time via build.rs::generate_gf180mcu_pin_table, which scans vendor/gf180mcu_fd_sc_mcu{7,9}t5v0/cells/, parses .functional.v, cross-asserts 7t/9t pin layouts match, emits $OUT_DIR/gf180mcu_pins.rs. New precedent vs SKY130's hand-rolled match arms (see § Follow-on cleanup item 1).
  • Round-trip test instantiating every cell.

Phase 3 — Cell classification (commit 6969b90)

  • Sequential / tie / filler / delay-cell whitelists in src/gf180mcu_pdk.rs derived from behavioural models.
  • Unit tests asserting classification across the union of 7t5v0 and 9t5v0 cell catalogues.

Phase 4 — Combinational AIG decomposition

Sequenced as four commits:

  • 92bb665 — Phase 4 recon: confirmed SKY130 behavioural parser is PDK-neutral; identified shared infrastructure that gf180mcu_pdk could reuse.
  • 02da077 — Phase 4 prep: introduced the PDK-neutral src/pdk_decomp.rs re-export module; exposed WireVal, GATE_MARKER, build_chain_gate, build_xor_chain, finalize_decomp_result as pub(crate).
  • 32fb3b9 — Phase 4 (combinational): decompose_combinational for GF180MCU + boolean equivalence test suite vs the vendored PDK models.
  • d898343 — Phase 4 (aig.rs integration): wired combinational decomposition through AIG::from_netlistdb, end-to-end sim path for combinational GF180MCU netlists.

Phase 4b — Sequential cells (UDPs)

  • a7c0618 — Phase 4b prep: UDP loader for gf180mcu_pdk (parses UDP_GF018hv5v_mcu_sc7_TT_1P8V_25C_verilog_nonpg_*_FF_UDP and friends from the vendored PDK).
  • 459317e — Phase 4b: AIG hooks for sequential cells (DFFs, latches, scan-DFFs, clock-gating cells icgtp/icgtn). gf180mcu_preprocess pre-creates DFF Q pins; gf180mcu_postprocess applies async set/reset using the active-low RN/SETN convention via the same AIG formula as SKY130. Negative-edge clock cells use CLKN instead of CLK (handled in trace_clock_pin).
  • 3006f59 — Phase 4b boolean-equivalence tests covering DFF, latch, scan-DFF, and clock-gating cells via multi-step truth-table evaluation.

Phase 5 — CLI / pipeline wiring audit (commit 57244d5)

Audit-only — no per-PDK branch was missing GF180MCU handling. The auto-detection in AIG::from_netlistdb already covers every CLI surface (sim / cosim / dump-paths all route through setup::load_design). Cleanup: stale Phase 4b panic comments in src/sim/setup.rs and src/aig.rs; field doc comments on CLI arguments refreshed to mention GF180MCU alongside AIGPDK / SKY130.

Phase 6 — Validation fixture + multi-corner test

  • Fixture (commit 4a7ee0e): tests/timing_test/gf180mcu_timing/ mirroring sky130_timing/ 1:1. Synthetic inv_chain.v (DFF + 16-inverter chain + DFF) with gf180mcu_fd_sc_mcu7t5v0__{dffq,inv}_1 cells, Liberty-only SDF generator, CVC testbench, sample stimulus, Makefile, README.
  • Integration test: gf180mcu_multi_corner_emits_per_corner_values in crates/opensta-to-ir/tests/opensta_integration.rs. Loads three real PVT corners (typ=tt_025C_5v00, slow=ss_125C_4v50, fast=ff_n40C_5v50) at the 5.0 V operating point and asserts per-corner setup TimingValues differ correctly across PVT. Skips gracefully when the volare-installed PDK isn't present (gated on find_gf180mcu_lib_dir() returning Some; matches the sky130 test's skip pattern). $GF180MCU_LIBERTY_DIR overrides the volare default path.

Phase 7 — wafer.space test-run-1 design (deferred)

Gated on design availability. Scope:

  • Vendor or pull a wafer.space test-run-1 gate-level netlist into the tests/timing_test/ or designs/ tree.
  • End-to-end pipeline: synth + PnR (or consume post-PnR output), opensta-to-ir, jacquard sim with Metal backend, golden-output VCD comparison.
  • Promote to a corpus entry once stable.

Test inventory

Counts after Phase 6:

  • cargo test --lib: 212 passing (up from 166 at plan start).
  • cargo test --lib gf180mcu: 45 passing (combinational + sequential equivalence + classification + detection + AIG-build).
  • cargo test -p opensta-to-ir multi_corner: 2 passing (sky130 + gf180mcu), each gated on its respective volare PDK install.

Follow-on cleanup

These are nice-to-have refactors flagged during the GF180MCU work but deliberately out of scope for the enablement effort itself.

Update 2026-05-19: Items 1, 2, and 4 are now subsumed by ADR 0010 — Declarative cell metadata and its companion plan declarative-cell-metadata.md. The manifest pathway converts these from "Rust refactor" projects into "move data out of code as part of the migration to manifest-as-source-of-truth" — happens once, gets all three at once.

  1. build.rs pin-table generator for SKY130 too. Subsumed by ADR 0010 § "Deferred to a future ADR — build.rs pin-table scanner removal." Removed LAST in the manifest migration, after manifests cover the built-in PDKs.

  2. Physical relocation of shared PDK decomp infrastructure out of sky130_pdk.rs into pdk_decomp.rs. Still relevant for the built-in (Rust-decomp) pathway, since ADR 0010 keeps that path load-bearing for cells with real AIG decomposition rules. Move when a third PDK exercises the surface.

  3. CellLibrary enum location. Currently lives in src/sky130.rs even though it represents all PDKs. Moving to a neutral home (src/pdk.rs or src/lib.rs) is a trivial mechanical refactor. Independent of ADR 0010.

  4. IO and PR libraries. Now solved by the ADR 0010 manifest pathway. gf180mcu_fd_io and gf180mcu_fd_pr cells can be declared via kind = "io_pad_*" / kind = "filler" / kind = "tap" etc. in user-supplied manifests — no Jacquard PR needed.

  5. CI install strategy for GF180MCU Liberty. Both the sky130 and gf180mcu multi-corner tests currently skip when the PDK isn't installed locally. CI integration (volare-on-CI or a vendored minimal Liberty subset) is the same blocker that gates the inv_chain_pnr sky130 corpus entry — out of scope for the GF180 enablement effort itself. Unrelated to ADR 0010.

Pitfalls (PDK-specific, for future readers)

  • Reset polarity — GF180MCU is active-low (RN/SETN); same AIG formula as SKY130's RESET_B/SET_B.
  • Negative-edge clocks — cells like dffnq/dffnrnq/icgtn use pin name CLKN instead of CLK. The "n" prefix is a clock marker, not a reset-polarity marker.
  • Power pins — GF180MCU operates at 5V nominal (vs SKY130's 1.8V). Both follow VDD/VSS naming. Corner names follow tt_025C_5v00 shape and parse cleanly through the generic TimingLibrary loader.
  • Cell pin names differ from SKY130 — inverter is I/ZN (not A/Y); DFF is CLK/D/Q/notifier. The notifier port wires the UDP delay-model wrapper but is unused for logic simulation.
  • Cell-name collisions between 7t5v0 and 9t5v0 — both have nand2_1 etc. Detection keys on the full prefix, not the base type. Auto-handled by is_gf180mcu_cell.
  • Drive-strength suffixes — GF180MCU uses integer multipliers (inv_1, inv_2, inv_4, …) matching the SKY130 convention.
  • docs/adding-a-pdk.md — canonical PDK integration recipe.
  • src/sky130.rs, src/sky130_pdk.rs — SKY130 reference implementation.
  • src/gf180mcu.rs, src/gf180mcu_pdk.rs — GF180MCU implementation.
  • crates/opensta-to-ir/tests/opensta_integration.rs::{sky130,gf180mcu}_multi_corner_emits_per_corner_values — timing-side validation.
  • tests/timing_test/{sky130,gf180mcu}_timing/ — synthetic fixtures.
  • pyproject.toml::[tool.jacquard.pdks.{sky130,gf180mcu}] — install pins.
  • Upstream PDK: https://github.com/google/gf180mcu-pdk
  • wafer.space variant: https://github.com/wafer-space/gf180mcu

Plan — Phase 0: Timing IR and OpenSTA oracle

Status: Implemented — historical record. All five work streams (WS1 schema, WS2 opensta-to-ir producer, WS3 SDF parser deletion + interim runtime hook, WS4 diff harness + CI, WS5 parser-success assertions) shipped through 2026-05-02. All eight exit criteria are met. Ongoing scheduling for timing-model fidelity work has moved to post-phase-0-roadmap.md. The per-WS detail and embedded status markers below are preserved for the implementation record.

Goal

Deliver the minimum viable infrastructure to enforce Jacquard's timing correctness contract:

  1. A stable timing intermediate representation (IR) for SDF-equivalent annotations.
  2. An OpenSTA-driven subprocess converter that produces IR from the same inputs Jacquard consumes.
  3. A converter that produces IR from Jacquard's existing SDF parser output.
  4. A CI diff harness that fails loud on converter disagreement.
  5. Parser-success assertions on the SDF and Liberty paths.

After phase 0, Jacquard's timing pipeline has an enforced external reference. Silent failures (zero-match SDF, mis-scoped hierarchical prefixes, unexpected cell drops) surface as CI failures rather than correctness regressions detected in the field.

Prerequisites

  • Requirements doc (../timing-correctness.md) accepted.
  • ADR 0001 (OpenSTA oracle) accepted.
  • ADR 0002 (timing IR) accepted.
  • A representative test design committed to the repo with inputs needed for both Jacquard and OpenSTA (.v + .lib + .sdf minimum; .spef if available). Candidate: tests/timing_test/inv_chain_pnr or the MCU SoC subset, whichever is smaller for first-pass iteration.
  • OpenSTA available on developer machines and CI runners (installation documented).

Work breakdown

WS1 — IR schema

Done. Shipped as the timing-ir crate (508baaf initial, 2432d41 simplification). Schema at crates/timing-ir/schemas/timing_ir.fbs; per-DFF CLOCK_ARRIVAL records added later in c403cc8 (Pillar B Stage 1, beyond original WS1 scope). JSON round-trip verified via crates/timing-ir/tests/.

Produce the FlatBuffers schema (schemas/timing_ir.fbs) and generated Rust bindings.

Fields (minimum viable; extend only with written justification):

  • SchemaVersion { major, minor, patch }.
  • Corner { name, process, voltage, temperature }; IR holds a list of corners.
  • CornerValue { corner_index, min, typ, max } for multi-corner floats.
  • TimingArc { driver_pin, load_pin, rise_delay: [CornerValue], fall_delay: [CornerValue], condition, provenance }.
  • InterconnectDelay { net, from_pin, to_pin, delay: [CornerValue], provenance }.
  • SetupHoldCheck { d_pin, clk_pin, edge, setup: [CornerValue], hold: [CornerValue], condition, provenance }.
  • Provenance { source_tool, source_file, origin: Asserted | Computed | Defaulted }.
  • VendorExtension { source_tool, kind: CadenceX | SynopsysY | Other, raw_bytes } — untyped passthrough for unrecognised annotations.
  • Root table TimingIR { schema_version, corners, cell_instances, timing_arcs, interconnect_delays, setup_hold_checks, vendor_extensions }.

Deliverables:

  • schemas/timing_ir.fbs checked in.
  • build.rs integration for code generation (or checked-in generated Rust with a flatc pin).
  • A tiny timing-ir crate exposing read/write helpers.
  • JSON round-trip via flatc --json verified in a unit test.

Scope guard: if you find yourself adding fields that represent computed timing graphs, cell electrical characterisation, or netlist structure, stop and re-read ADR 0002.

WS2 — opensta-to-ir production tool

Per ADR 0006, opensta-to-ir is a shipped preprocessing tool, not merely a validation helper. Post-release it remains as an alternative preprocessing path for users who want OpenSTA-computed timing.

Detailed design and phased implementation: ws2-opensta-to-ir.md.

Deliverables:

  • A Tcl script runnable by OpenSTA that loads Liberty + Verilog + SDF + (optionally) SPEF + SDC, then emits a machine-readable dump of timing annotations.
  • A production-quality standalone Rust binary opensta-to-ir that parses OpenSTA's dump and emits timing IR (binary + JSON sidecar). Stable CLI, documented exit codes, clear diagnostics, man-page-worthy --help.
  • Invocation wrapper handling OpenSTA subprocess lifecycle, stderr capture, exit-code checking, and error propagation up through opensta-to-ir's own exit code.
  • Assertion: if OpenSTA reports < expected-count cells, exit non-zero with a clear diagnostic.
  • Ships as part of Jacquard's release artefacts (binary distributable, documented in user-facing docs).

WS2.4 — Multi-corner CLI flag (shipped 2026-05-02)

Status: Shipped 2026-05-02 across commits 5822343 (consumer + --timing-corner flag), 530bb36 (builder dedupe + per-corner [TimingValue] collection), 59fde04 (Tcl driver per-scene emission

  • --liberty NAME=PATH syntax), and the integration test aigpdk_dff_emits_per_corner_timing_values. The historical scope notes below are kept for reference but are no longer "open work".

The IR schema (crates/timing-ir/schemas/timing_ir.fbs) supports per-corner TimingValue vectors today, but every record lands in the IR with a single TimingValue keyed at corner_index = 0. Both producer (opensta-to-ir) and consumer (flatten.rs) treat the world as single-corner. Multi-corner support has three pieces:

Producer (Tcl + Rust binary):

  • crates/opensta-to-ir/tcl/dump_timing.tcl: replace single read_liberty + hardcoded CORNER 0 default tt 1.0 25.0 with OpenSTA's define_corners + per-corner read_liberty -corner $name. The existing arc / setup-hold / wire / clock-arrival walks already key by (cell, …); wrap each in a per-corner loop and call [edge arc_delays $arc -corner $c]. Verify the exact -corner syntax against the locally built OpenSTA before relying on it (similar to the vertex_worst_arrival_path probe done for clock arrival in commit c403cc8).
  • crates/opensta-to-ir/src/main.rs: rework --liberty PATH to accept --corner NAME=PATH[,V=…,T=…,P=…] repeats. Validate at least one corner.
  • crates/opensta-to-ir/src/builder.rs: today each ARC / SETUP_HOLD / INTERCONNECT / CLOCK_ARRIVAL line lands as one IR record with one TimingValue. Multi-corner emits multiple lines per (cell, driver, load, corner_index) from Tcl; the builder dedupes them into one IR record carrying a [TimingValue] vector. Mechanical.

Consumer (jacquard root):

  • Add --timing-corner <NAME> to SimArgs / CosimArgs in src/bin/jacquard.rs; resolve to an index by walking ir.corners().
  • Replace flatten.rs::ir_corner0_max(...) (used in ~5 sites) with ir_corner_max(idx). Thread the resolved index through load_timing_from_ir.

Fixture: sky130 ships multi-corner Liberty (tt_025C_1v80, ss_-40C_1v62, ff_125C_1v95) on disk via volare under ~/.volare/... on dev machines that have run the cosim work. Wire two corners against the existing DFF / chain integration tests for a synthetic-but-real fixture; no external decision is needed before starting.

Land in this order: fixture probe (~hour, verifies the OpenSTA Tcl -corner flag works as expected) → producer (Tcl + binary + builder) → consumer (CLI + flatten plumbing) → integration test exercising both corners. The risk concentrates in the first hour; everything after that is mechanical.

WS3 — Remove hand-rolled SDF parser; wire interim runtime hook

Per ADR 0006, Jacquard's hand-rolled SDF parser is deleted in Phase 0 rather than maintained through later phases. The runtime gains a new IR input path; the old SDF input path becomes an interim convenience wrapper over WS2.

Detailed design and phased implementation: ws3-delete-sdf-parser.md.

Deliverables:

  • Delete src/sdf_parser.rs and the SDF→Jacquard-internal-types code path. Remove all direct consumers.
  • Add jacquard sim --timing-ir <path> as the canonical post-release timing input. Loads a pre-converted timing IR file, consumes it into the simulator's internal structures.
  • Retarget the existing --timing-sdf / --enable-timing CLI behaviour: when SDF is provided, jacquard sim subprocesses opensta-to-ir internally to produce IR on the fly, then consumes it. Code site tagged "INTERIM per ADR 0006; removed before first release."
  • Verify no remaining imports of the deleted module. Verify all existing tests that previously used the hand-rolled parser now pass via the interim hook or via checked-in IR fixtures.
  • No runtime behaviour regression on Jacquard's timing-related regression suite; any design that currently works must still work after WS3.

WS4 — Diff harness and CI integration

Reframed 2026-05-02; corpus + runner shipped 2026-05-02. The original WS4 was framed as "WS2 vs WS3 IR diff" (OpenSTA-derived against Jacquard's hand-rolled SDF parser-derived). WS3 deleted that parser; the diff has only one side now. Three reframings were considered: Option A (golden-IR regression corpus for opensta-to-ir) was chosen as the Phase 0 closure; Option B (end-to-end behavioural diff cxxrtl/CVC vs Jacquard cosim event traces) belongs in timing-validation.md as a Phase 1+ extension; Option C (cross-tool diff vs a future native Rust SDF→IR parser) is Phase 3 work per ADR 0006.

Deliverables:

  • A test binary timing-ir-diff that reads two IR files and produces a structured diff (missing arcs, mismatched delays past tolerance, mismatched provenance). Shipped in crates/timing-ir/src/bin/timing-ir-diff.rs.
  • OpenSTA vendored as a git submodule at vendor/opensta/. Not built from Jacquard's build at runtime; present for CI version pinning, the opensta-to-ir integration tests, and stress-corpus access (see ADR 0005). Shipped.
  • A primary regression corpus at tests/timing_ir/corpus/ — Jacquard-specific designs with checked-in expected.jtir (and a expected.json sidecar via flatc --json for human-readable diffs). Shipped 2026-05-02 with the seed entry aigpdk_dff_chain (a minimal aigpdk DFF + AND with back-annotated wire delay; covers ARC + SETUP_HOLD + CLOCK_ARRIVAL + INTERCONNECT in a self-contained fixture). Sky130 entries (inv_chain_pnr, mcu_soc subset) remain to be added — the inputs exist under tests/timing_test/, but a CI strategy for installing the sky130 Liberty (likely volare) lands with them.
  • A stress corpus at tests/timing_ir/stress/ — a manifest file listing paths into vendor/opensta/<test-tree-subdir>/. Run nightly or pre-release. Exit criterion: no crashes, no hangs, no malformed IR; numerical agreement with OpenSTA not required. Manifest format specced in tests/timing_ir/stress/README.md; entries pending.
  • A regression test that, for each design in the primary corpus, runs opensta-to-ir on its inputs and diffs against expected.jtir via timing_ir::diff::diff_irs with the per-design tolerance from manifest.toml. Shipped as crates/opensta-to-ir/tests/corpus.rs::corpus_designs_match_golden_ir. Skips gracefully when OpenSTA isn't built; fails loud with a structured diff when there's a mismatch.
  • A regenerate-goldens helper for the OpenSTA-pin-bump workflow: bump submodule, run regen, review the diff, commit golden + submodule together. Shipped as scripts/regenerate-corpus-goldens.sh. Iterates tests/timing_ir/corpus/*/manifest.toml, runs opensta-to-ir per entry with the manifest-specified flags, refreshes both expected.jtir and the expected.json sidecar via flatc --json. Accepts entry names as positional args for targeted regen.
  • A diff-machinery mutation test that perturbs a known-good IR and asserts timing-ir-diff flags it. Shipped in crates/timing-ir/tests/diff.rs: delay_mismatch_past_tolerance_detected, delay_mismatch_within_tolerance_is_clean, arc_only_in_a_detected, arc_only_in_b_detected.

CI hookup landed 2026-05-02. The opensta-to-ir-tests job in .github/workflows/ci.yml builds CUDD (cached), builds OpenSTA via scripts/build-opensta.sh (cached on the submodule SHA), and runs cargo test inside crates/opensta-to-ir — covering the corpus regression test, the CLI tests, and the OpenSTA-driven integration tests on every PR. scripts/build-opensta.sh was extended to honour a CUDD_DIR env var so the CI job can hand it the source-built CUDD location without bypassing the script.

What this catches: OpenSTA upstream regressions, dump-format / Tcl-driver regressions, accidental schema-breaking changes in timing_ir.fbs, builder bugs in opensta-to-ir/src/builder.rs, and the diff machinery itself (via the mutation tests that perturb an IR and assert timing-ir-diff flags the perturbation).

What this doesn't catch: behavioural divergence between Jacquard and a reference simulator. That's timing-validation.md's job (CVC/iverilog event-trace comparison) — the mcu_soc/sky130 90/90 reference match is the current one-design instance, generalisable in Phase 1+.

WS5 — Parser-success assertions

Done. Both halves shipped pre-this-section being marked.

Deliverables (all live):

  • Assertions in Jacquard's Liberty parsing code: non-zero cells parsed on non-empty input. Implemented as TimingLibrary::parse (src/liberty_parser.rs:297-309); rejects with a clear diagnostic naming the input byte count and pointing at the explicit override.
  • Assertions in opensta-to-ir (WS2): non-zero IOPATHs / timing arcs resolved on non-trivial SDF input. Implemented as the --min-arcs N CLI flag (default 1) in the binary (crates/opensta-to-ir/src/main.rs:71-77, :112-121); exits with code EXIT_MIN_ARCS_FAILED = 3 (see :17) and a diagnostic naming the produced count, the threshold, and the override flag.
  • A way to override thresholds for intentionally-empty test inputs: TimingLibrary::parse_unchecked (src/liberty_parser.rs:316) for the Liberty path, --allow-empty-parse flag for the opensta-to-ir path.

Tests covering both halves: liberty_parser::parse_rejects_library_input_with_zero_cells and parse_unchecked_accepts_zero_cell_library; opensta-to-ir::cli::cli_min_arcs_failure_exit_3 (covers both the failure and the --allow-empty-parse override).

(Original-plan assertions for Jacquard's SDF parser are obsolete — WS3 deleted the parser they were to guard.)

Test plan

Tests live in tests/timing_ir/.

  1. Schema round-trip (WS1). Construct a small IR in Rust, serialize to binary, deserialize, assert equality. Same for JSON.
  2. OpenSTA converter unit tests (WS2). For a hand-crafted tiny design, invoke the converter, assert IR contents match expectation.
  3. Jacquard converter unit tests (WS3). Same, on the same tiny design, through Jacquard's parser.
  4. Corpus diff (WS4). For each design in the primary corpus, freshly produced opensta-to-ir output diffs clean against the checked-in golden expected.jtir within per-design tolerance.
  5. Parser-success assertion tests (WS5). Feed empty Liberty, empty SDF, and non-empty-but-no-match Liberty. Each should fail loud with a clear diagnostic, not proceed silently.

Tolerances:

  • Delay values: ±5% or ±5 ps absolute floor, whichever is larger. Rationale: matches the existing timing-validation.md convention; per-design overrides allowed via manifest.toml.
  • Missing arcs: zero tolerance. Every arc in the golden IR must appear in the freshly produced one (and vice versa).

Exit criteria (all met)

Phase 0 is complete when all of the following hold:

  1. schemas/timing_ir.fbs checked in (crates/timing-ir/schemas/timing_ir.fbs); round-trip unit tests in crates/timing-ir/tests/.
  2. opensta-to-ir binary production-quality with stable CLI, documented exit codes, primary-corpus support. See ws2-opensta-to-ir.md (Implemented).
  3. src/sdf_parser.rs deleted; --timing-ir <path> canonical; --timing-sdf is a subprocess wrapper over opensta-to-ir (per ADR 0006 § Amendment, the shipping mechanism — Phase 3 native Rust parser deferred indefinitely). See ws3-delete-sdf-parser.md (Implemented).
  4. ✅ OpenSTA vendored at vendor/opensta/ (ADR 0005).
  5. timing-ir-diff runs in CI on the primary corpus (opensta-to-ir-tests job), passes cleanly, fails loud on regressions. Mutation tests in crates/timing-ir/tests/diff.rs.
  6. ✅ Parser-success assertions live on both halves: TimingLibrary::parse and opensta-to-ir --min-arcs. See WS5 above.
  7. ✅ No regression observed in Jacquard's timing-related tests after WS3 cutover.
  8. timing-validation.md carries the forward-pointing note (line 3) explicitly stating its ±5% convention will be superseded by timing-correctness.md once Phase 0 ships. Phase 0 has shipped; that supersession is now effective in practice (the corpus tolerance is set per-design via manifest.toml). Removing the in-doc note is a small follow-up if anyone authoring against the page would benefit.

Out of scope (deferred to later phases)

  • Native Rust SDF→IR converter. The hand-rolled parser is removed in Phase 0 WS3 (per ADR 0006); the native Rust replacement is Phase 3 work, deferred indefinitely per ADR 0006 § Amendment (no longer release-gating). SDF input ships via the opensta-to-ir subprocess wrapper. See post-phase-0-roadmap.md § Phase 3 for revival triggers.
  • OpenTimer integration. Depends on the spike; tracked in ../spikes/opentimer-sky130.md and its resulting phase-1 plan.
  • Private PDK (GF130) test track. Tracked in ADR 0004; plumbing deferred to its own phase.
  • SPEF IR. Separate from timing-annotation IR per ADR 0002.
  • Runtime violation reporting improvements (R4 critical-path refinement JSON). Phase 1 or 2.

Risks

  • Licensing verification on vendored OpenSTA corpus. Per-file check needed before inclusion. May reduce corpus size if restrictive; acceptable.
  • FlatBuffers build integration friction. If build.rs codegen causes cross-compilation or CI issues, fall back to checked-in generated code with a documented flatc version. Pick one approach and stick to it; flip-flopping is worse than either option.
  • Tolerance tuning. Initial ±5% may prove too loose (hides bugs) or too tight (false positives from numerical differences). Plan to re-tune after first real-design data arrives.
  • WS3 cutover risk. Deleting the hand-rolled SDF parser risks regressing designs that depend on behaviour it currently provides. Exit criterion 7 requires a clean regression run before WS3 is considered complete. If coverage gaps emerge, walk-back options per ADR 0006 apply: add dialect shims to opensta-to-ir, or (now that Phase 3 is deferred) keep the hand-rolled parser available behind a feature flag until dialect parity is reached.
  • OpenSTA dialect coverage. OpenSTA may not accept every SDF dialect Jacquard's hand-rolled parser has been patched to handle. Such cases are tracked as either opensta-to-ir post-processing fixes or upstream OpenSTA contributions. Under no condition is the fix to reinstate the hand-rolled parser unless walk-back per ADR 0006 is formally triggered.
  • ../project-scope.md
  • ../timing-correctness.md — acceptance criteria this plan satisfies.
  • ../adr/0001-opensta-as-oracle.md
  • ../adr/0002-timing-ir.md
  • ../spikes/opentimer-sky130.md — runs in parallel; no dependency either way.

Plan — WS2: opensta-to-ir

Status: Implemented — historical record. All five phases (2.1–2.5) plus Pillar B Stage 1 (per-DFF CLOCK_ARRIVAL records) and release hardening (WS-RH.1 OpenSTA version probe) have shipped. The crate lives at crates/opensta-to-ir/. Current scheduling for further timing-model fidelity work is tracked in post-phase-0-roadmap.md.

Phase: 0 (executed WS2 from phase-0-ir-and-oracle.md). Predecessors: WS1 (crates/timing-ir, schema and round-trip — done), ADRs 0001 / 0002 / 0005 / 0006.

Goal

Deliver a production-quality preprocessing tool that consumes a design's timing inputs and emits a timing-ir file suitable for downstream Jacquard consumption. End-to-end:

.lib + .v + .sdf + .spef + .sdc  →  opensta-to-ir  →  design.jtir (+ design.json)

opensta-to-ir is shipped as a release artefact (per ADR 0006) and is also used by Phase 0 WS3's interim jacquard sim --timing-sdf runtime hook.

High-level architecture

Three components, single binary:

┌─────────────────────────┐     ┌─────────────────────────┐     ┌─────────────────────────┐
│  Rust CLI / driver      │     │  Tcl dump script        │     │  Rust IR builder        │
│  (clap, subprocess mgmt)│ →   │  (runs in OpenSTA proc) │ →   │  (parses dump, builds   │
│  Validates inputs       │     │  Emits canonical dump   │     │   FlatBuffers IR)       │
└─────────────────────────┘     └─────────────────────────┘     └─────────────────────────┘
            │                                                                   │
            └──────────────────── one process invocation ───────────────────────┘

The Rust CLI invokes OpenSTA as a subprocess, writes the Tcl driver script to a temp directory, runs sta -f $tmpdir/dump.tcl, captures the dump file, and converts to IR. The Tcl driver lives at crates/opensta-to-ir/tcl/dump_timing.tcl and is embedded in the binary via include_str!() so the binary is self-contained at runtime — no separate Tcl file needs to ship alongside it.

OpenSTA is located via scripts/build-opensta.sh --print-binary first (the canonical install path for the vendored submodule), then falling back to a PATH lookup, then --opensta-bin <PATH> override.

Reasons for this shape:

  • OpenSTA's structured Tcl API (get_timing_edges, get_timing_arcs_from, etc.) gives access to OpenSTA's internalised timing graph directly. Walking it is simpler than parsing OpenSTA's SDF output back through a second-generation parser.
  • The Tcl script is the only OpenSTA-specific code; the Rust side is format-only and can later be reused with other producers (Phase 3 native Rust parser, future OpenTimer adapter).
  • Subprocess invocation preserves Jacquard's permissive license posture (ADR 0001).

Tcl dump format

A simple line-oriented record format. Each line is one annotation. Fields are tab-separated. Strings with tabs/newlines are quoted with simple "..." and \t/\n escaping. Header / footer lines mark the document.

# format-version: 1
# generator-tool: opensta-to-ir 0.1.0
# generator-opensta: <opensta version string>
# input-files: <comma-separated list>
CORNER	<index>	<name>	<process>	<voltage>	<temperature>
ARC	<cell_instance>	<driver_pin>	<load_pin>	<corner_index>	<rise_min>	<rise_typ>	<rise_max>	<fall_min>	<fall_typ>	<fall_max>	<condition>	<origin>
INTERCONNECT	<net>	<from_pin>	<to_pin>	<corner_index>	<min>	<typ>	<max>	<origin>
SETUP_HOLD	<cell_instance>	<d_pin>	<clk_pin>	<edge>	<corner_index>	<setup_min>	<setup_typ>	<setup_max>	<hold_min>	<hold_typ>	<hold_max>	<condition>	<origin>
VENDOR_EXT	<source>	<source_tool>	<kind>	<base64_payload>
# end

Why line-oriented (not JSON): Tcl emits this trivially with puts. Rust parses it with a BufReader line-at-a-time, no streaming-JSON parser. Mismatched lines fail loud at the unit level, not after parsing 100MB of nested JSON.

The format is a private interface between the bundled Tcl script and the bundled Rust binary — both ship together in one release artefact. We reserve the right to change the format any time as long as both sides update.

Rust binary

opensta-to-ir [OPTIONS] --output <PATH>

Inputs (at least one liberty + one verilog required):
  --liberty <PATH>...           One or more Liberty files (-r overlay supported by OpenSTA).
  --verilog <PATH>...           One or more Verilog netlists.
  --sdf <PATH>                  Optional. Back-annotated delays.
  --spef <PATH>                 Optional. Parasitics; required for SPEF-based delay calc.
  --sdc <PATH>                  Optional. Constraints (clocks, input delays).
  --top <NAME>                  Top-level module name. Required.
  --corner <NAME>...            Corner name(s). Default: "default".

Output:
  --output <PATH>               IR binary output path (.jtir).
  --json <PATH>                 Optional. JSON sidecar via flatc round-trip.

Behaviour:
  --opensta-bin <PATH>          Override the OpenSTA executable path. Default: probe via
                                `scripts/build-opensta.sh --print-binary`, then fall back to PATH.
  --keep-tmp                    Keep the Tcl script and dump file in $TMPDIR for debugging.
  --min-arcs <N>                Fail if fewer than N timing arcs are emitted. Default: 1.
  --allow-empty-parse           Disable the --min-arcs check. For test fixtures only.
  --strict-tcl                  Treat OpenSTA Tcl warnings as errors.
  -v, --verbose                 Echo OpenSTA's stderr to ours. Default: capture and replay only on failure.

Exit codes:
  0    IR produced successfully.
  1    OpenSTA returned an error.
  2    Tcl dump format error or IR-build failure.
  3    Parser-success assertion failed (--min-arcs not met).
  4    Argument validation error.

Internal flow:

  1. Validate args (required files exist, top name non-empty).
  2. Locate OpenSTA binary; verify version is in supported range.
  3. Render Tcl driver script into $TMPDIR (or stdin).
  4. Spawn opensta -f <script>; capture stdout/stderr/exit.
  5. Read dump file from $TMPDIR/<uniqued>.osd (OpenSTA dump).
  6. Parse dump, build IR via timing-ir crate's FlatBuffers builders.
  7. Apply --min-arcs assertion (see WS5 portion below).
  8. Write .jtir (and .json if requested).
  9. Surface any captured warnings on stderr.

Multi-corner handling

OpenSTA's define_corners and set_scene commands drive multi-corner analysis. Our flow:

  • Caller passes --corner ss_125C_1v08 --corner tt_25C_1v80 --corner ff_-40C_1v98.
  • Tcl script calls define_corners once with the union, then iterates foreach corner [get_corners] { ... } and emits CORNER + ARC/INTERCONNECT/SETUP_HOLD lines tagged with the corner index.
  • Single-corner designs use one entry — same code path, no special case.

PVT extraction (process / voltage / temperature) — OpenSTA exposes these via Liberty's operating conditions. Tcl extracts via the corner's pvt object. If unavailable, process="?", voltage=0.0, temperature=0.0.

Vendor extensions

OpenSTA does not expose a single mechanism for arbitrary annotations. For Phase 0 WS2:

  • We do not produce VENDOR_EXT records.
  • The IR's vendor-extension passthrough remains a forward-looking feature; a future producer (a commercial-tool-aware adapter) will populate it.

Tcl-side parsing of vendor-specific Liberty simulation blocks or SDF (VENDOR …) constructs is not in scope for Phase 0 WS2.

Parser-success assertion (WS5 portion)

Per phase-0-ir-and-oracle.md WS5: "Assertions in opensta-to-ir: non-zero IOPATHs / timing arcs resolved on non-trivial SDF input. Exit non-zero with a clear diagnostic when below threshold."

Implementation:

  • --min-arcs <N> flag with default 1.
  • After IR is built, count TimingArc records in the buffer.
  • If below threshold and --allow-empty-parse was not passed, exit code 3 with message: opensta-to-ir: produced N timing arcs (--min-arcs <M>); use --allow-empty-parse for empty-fixture tests.
  • Liberty parser-success assertion already lives in Jacquard's TimingLibrary::parse (see commit 5db131e) — opensta-to-ir invokes OpenSTA's own Liberty reader rather than Jacquard's, so it surfaces missing-cell issues via OpenSTA's exit status (not our concern at this layer).

Test plan

Fixture progression — minimum-viable to representative

  1. inv_chain_pnr (already in tests/timing_test/): smallest design with real SKY130 cells and SDF. Verify single arc per inverter, correct rise/fall, single corner.
  2. MCU SoC subset: representative of the real Jacquard flow. Verify the count of arcs matches a known baseline; spot-check a handful of arrival times against report_timing output.
  3. Multi-corner synthetic: hand-built tiny design with ss/tt/ff Liberty corners, verify the IR carries 3 corner records and 3 sets of values per arc.

Test types

  • Unit tests (Rust): dump-format parser tested against synthetic dump strings (no OpenSTA needed).
  • Integration tests (Rust + OpenSTA): invoke the binary against committed fixtures, diff the resulting IR against golden IR via timing-ir-diff. Each integration test gates itself on scripts/build-opensta.sh --print-binary succeeding — when the OpenSTA binary is unbuilt, tests skip with a clear "run scripts/build-opensta.sh" message rather than failing. CI runs them after building OpenSTA via the script.
  • Failure-mode tests: missing OpenSTA, malformed Tcl dump, zero-arc input, missing required argument — each surfaces the expected exit code.

CI integration (closes WS4 remaining work)

  • A new CI job runs opensta-to-ir on each tests/timing_ir/corpus/<name>/inputs/ and diffs against expected.jtir via timing-ir-diff. Fails loud on diff or exit-code regression.
  • Stress-corpus run is deferred to Phase 1.

Phased implementation

Splitting WS2 into focused PRs keeps reviewability tight. Each phase exits with a runnable end-to-end on its scope:

PhaseScopeExit signalStatus
2.1Single-corner, timing-arc IOPATHs only. CLI scaffolding.AIGPDK AND2 round-trip clean through opensta-to-ir end-to-end.✅ Shipped (dc3db4a scaffold + 3997e06 subprocess plumbing + 50b8600 real Tcl extraction).
2.2Add interconnect delays (wire-role edges, with optional SPEF).Multi-cell design produces INTERCONNECT records that round-trip.✅ Shipped (67210c0). Test: chain_with_sdf_emits_interconnect_delay.
2.3Add setup/hold checks.DFF setup/hold round-trips end-to-end.✅ Shipped (8343b14). Test: aigpdk_dff_emits_setup_hold_records. Recovery / removal / width checks remain out of scope.
2.4Multi-corner.3-corner synthetic fixture produces 3-corner IR.✅ Shipped (530bb36 builder + 59fde04 per-corner Tcl emission + d110174 integration test + 50f4bf5 real-sky130 multi-corner follow-up). Tests: aigpdk_dff_emits_per_corner_timing_values, sky130_multi_corner_emits_per_corner_values.
2.5CI corpus integration; golden-IR fixtures for representative designs.WS4 corpus job in CI; WS2 task complete.✅ Shipped (90558bb). Runner: cargo test -p opensta-to-ir corpus.

Beyond original WS2 scope:

  • Pillar B Stage 1 — per-DFF CLOCK_ARRIVAL records (c403cc8). Adds clock arrival times to the IR so downstream consumers can compute per-DFF setup/hold margins without re-running OpenSTA. Test: dff_with_sdc_clock_emits_clock_arrival. Tracked separately in post-phase-0-roadmap.md.
  • Release hardening WS-RH.1 — hard-fail on missing or too-old OpenSTA, with version probe and usage diagnostics (c9c393b). Tests: locate_accepts_min_tested_version, locate_flags_newer_than_tested. Tracked in post-phase-0-roadmap.md § Release hardening.

WS3 (delete src/sdf_parser.rs + wire interim runtime hook) was unblocked once Phase 2.3 minimum landed and has also shipped — see ws3-delete-sdf-parser.md.

Open questions — resolution

Resolutions from implementation:

  • OpenSTA version pinningResolved by WS-RH.1 (c9c393b). Binary probes OpenSTA's version_string, accepts a [MIN_TESTED, MAX_TESTED] range, prints a usage diagnostic with the supported range on mismatch.
  • OpenSTA installationResolved. scripts/build-opensta.sh ships with --print-binary for the dependency probe; integration tests skip cleanly when the binary isn't built. Documented in the script's --help and the post-Phase-0 roadmap.
  • Tcl-script versioningResolved. # format-version: 1 header check is enforced in dump.rs; the binary refuses unknown versions with an explicit error.
  • Conditional arcs (SDF COND)Partial. The condition field is plumbed end-to-end (dump format → Rust parser → IR builder), but the Tcl emission side does not yet populate it for conditional variants. Defer until a real design surfaces a COND arc that needs distinguishing.

Still open / deferred:

  • Long-running designs: streaming dump emission (Tcl flushing line-by-line, Rust incremental read) — defer until profiling on a real SoC shows memory pressure.
  • Strict Tcl error handling: --strict-tcl flag was specced but not implemented. Current behaviour captures all stderr and replays on failure; no warning-to-error upgrade path. Land if it becomes a real CI hygiene concern.

Risks

  • OpenSTA's Tcl API is large and not all of it is documented. Some primitives we'll need (e.g., per-corner delay values for a specific arc) may require digging through Sta.cc. Mitigation: budget time, lean on report_path text output as a fallback if the structured API proves opaque for a given query.
  • OpenSTA may be slow on big designs — the structured walk over millions of arcs is single-threaded. Mitigation: --keep-tmp for profiling, accept slow phase-0 runs, optimise later if it blocks CI.
  • Format drift between Tcl and Rust — both sides advance together; the format-version line plus version-mismatch fail-loud catches drift. Add a unit test that the Rust parser rejects an unexpected version line.

Non-goals

  • A general SDF parser. (The whole point: avoid that.)
  • Wire-level reactivity or feedback to OpenSTA mid-run (this is a one-shot extract).
  • Comparison against OpenTimer (that's a separate ADR-0003-spike concern).
  • Replacing OpenSTA's role as oracle in CI — opensta-to-ir is a producer, not a checker.

References

  • ../adr/0001-opensta-as-oracle.md — subprocess model, license posture.
  • ../adr/0002-timing-ir.md — IR contract this tool emits.
  • ../adr/0005-opensta-vendoring-and-corpus.mdvendor/opensta/ submodule.
  • ../adr/0006-sdf-preprocessing-model.md — interim runtime hook + release-time cutover.
  • phase-0-ir-and-oracle.md — WS2 row in the work breakdown.
  • crates/timing-ir/schemas/timing_ir.fbs — schema this tool produces.
  • vendor/opensta/doc/StaApi.txt — OpenSTA Tcl API reference.

Last updated: 2026-04-28 (design); 2026-05-15 (status flip to Implemented).

Plan — WS3: delete SDF parser, wire IR consumer + interim runtime hook

Status: Implemented — kept as historical record. Note: the "interim" / "pre-release-only" framing throughout this document describes the original ADR 0006 model. Per ADR 0006 § Amendment (2026-05-02), the runtime subprocess wrapper is now the shipping mechanism — Phase 3 (native Rust SDF→IR) is no longer release-gating. This document is preserved for the implementation phasing record; for current shipping intent see ADR 0006 § Amendment and post-phase-0-roadmap.md § Phase 3.

Phase: 0 (executes WS3 from phase-0-ir-and-oracle.md). Predecessors: WS2 phases 2.1 + 2.3-minimum (delay arcs + setup/hold checks landed). Sufficient IR coverage for runtime cutover. ADRs: 0002 (IR), 0006 (SDF preprocessing model + interim cutover; amended 2026-05-02).

Goal

Delete src/sdf_parser.rs and migrate src/flatten.rs's timing-loading to consume the timing IR directly. Wire jacquard sim --timing-ir <PATH> as the canonical input path, and (per ADR 0006) keep --timing-sdf <PATH> working pre-release as a contributor-ergonomics convenience that internally subprocesses opensta-to-ir.

End state:

  • No hand-rolled SDF parsing in the Jacquard codebase.
  • Runtime SDF input still works (via internal subprocess) until first release.
  • flatten.rs consumes timing_ir::TimingIR<'_> for arc / setup / hold loading.
  • All flatten.rs tests that previously hand-built SDF strings are migrated to build IR fixtures via the timing-ir crate's FlatBuffers builders.

Surface analysis

src/sdf_parser.rs (1099 lines) defines SdfFile, SdfDelay, SdfCorner, TimingCheckType, and parses SDF text. Consumers:

  • src/flatten.rsload_timing_from_sdf(...) is the only non-test consumer; iterates SdfFile.get_cell(path), uses SdfDelay for wire delays, TimingCheckType::Setup/Hold for check identification. ~200 lines of integration plus 7+ test fixtures that build SDF strings inline.
  • src/sim/setup.rs — translates --sdf-corner CLI string into SdfCorner and calls SdfFile::parse_file.
  • src/aig.rs — test imports only.
  • src/lib.rs — module declaration only.

Architecture changes

New: src/sim/timing_ir_loader.rs

Thin module that owns the IR file buffer (so consumers can borrow TimingIR<'_> views from it):

#![allow(unused)]
fn main() {
pub struct TimingIrFile {
    buf: Vec<u8>,
}

impl TimingIrFile {
    pub fn from_path(path: &Path) -> Result<Self, ...> { ... }
    pub fn from_bytes(buf: Vec<u8>) -> Result<Self, ...> { ... }
    pub fn view(&self) -> Result<timing_ir::TimingIR<'_>, ...> {
        timing_ir::root_as_timing_ir(&self.buf)
    }
}
}

The TimingIR view holds a lifetime tied to the buffer. Callers keep the TimingIrFile alive while iterating the view.

Modified: src/flatten.rs

Replace load_timing_from_sdf with load_timing_from_ir:

#![allow(unused)]
fn main() {
pub fn load_timing_from_ir(
    &mut self,
    aig: &AIG,
    netlistdb: &NetlistDB,
    ir: &timing_ir::TimingIR<'_>,
    clock_period_ps: u64,
    liberty_fallback: Option<&TimingLibrary>,
    debug: bool,
) { ... }
}

Logic translation table:

Old (SdfFile)New (TimingIR<'_>)
sdf.get_cell(path)Index ir.timing_arcs() / ir.setup_hold_checks() by cell_instance (build a HashMap<&str, _> once).
cell.iopathsFilter timing arcs by cell_instance == path.
cell.timing_checksFilter setup/hold checks by cell_instance == path.
SdfDelay { rise, fall, ... }TimingArc.rise_delay() / .fall_delay() (per-corner); take corner 0 max for now.
TimingCheckType::Setup / ::HoldSetupHoldCheck.setup() / .hold() per record.
cell.interconnect_delaysir.interconnect_delays() — empty until WS2.2 lands; tolerate.

The hierarchy-prefix detection (lines 1793-1820 of current flatten.rs) is independent of source format — same logic applies, just use IR's cell_instance strings instead of SDF's. Keep the heuristic.

Modified: CLI surface (src/bin/jacquard.rs, src/sim/setup.rs)

  • Add --timing-ir <PATH> flag that loads IR directly via TimingIrFile::from_path.
  • Retarget --timing-sdf <PATH> (and the existing --sdf-corner) to: spawn opensta-to-ir as a subprocess, capture its IR output, call load_timing_from_ir. Mark the code site INTERIM per ADR 0006.
  • The interim hook needs Liberty + Verilog paths to feed opensta-to-ir; the jacquard sim CLI already takes those, so plumb them through.
  • Keep --sdf-corner for backward compat — the interim wrapper passes it as --corner to opensta-to-ir.

Deletions

  • src/sdf_parser.rs — entire file.
  • src/lib.rspub mod sdf_parser line.
  • src/aig.rsuse crate::sdf_parser::{SdfCorner, SdfFile} test imports; rewrite or delete the affected tests.
  • src/flatten.rsuse crate::sdf_parser::SdfFile; rewrite test fixtures.

Test migration strategy

Test fixtures in flatten.rs currently look like:

#![allow(unused)]
fn main() {
let sdf_content = r#"(DELAYFILE ... )"#;
let sdf = SdfFile::parse_str(sdf_content, SdfCorner::Typ).expect("...");
flat.load_timing_from_sdf(&aig, &netlistdb, &sdf, ...);
}

After cutover:

#![allow(unused)]
fn main() {
let ir_buf = build_test_ir(&TestIrSpec {
    arcs: vec![ /* (cell, from, to, rise_max, fall_max) */ ],
    setup_hold: vec![ /* (cell, d, clk, edge, setup, hold) */ ],
});
let ir = root_as_timing_ir(&ir_buf).unwrap();
flat.load_timing_from_ir(&aig, &netlistdb, &ir, ...);
}

A build_test_ir helper in flatten.rs::tests mirrors build_ir_with_arcs from crates/timing-ir/tests/diff.rs. Single source of truth would be nicer; for now duplicate it (deduplication is a future cleanup).

Phased implementation

PhaseScopeExit signal
3.1Add src/sim/timing_ir_loader.rs and flatten.rs::load_timing_from_ir (parallel to _from_sdf). No CLI surface, no deletions. Unit-test the new function with a small synthetic IR.New function compiles + passes unit test; existing _from_sdf path still works.
3.2Add jacquard sim --timing-ir <PATH> CLI flag wired to load_timing_from_ir. End-to-end test: pre-generate IR via opensta-to-ir, run jacquard sim --timing-ir, compare against the existing --timing-sdf baseline.A representative timing test (e.g., one of the existing tests/timing_test/) produces matching VCD output via both paths.
3.3Retarget --timing-sdf to subprocess opensta-to-ir internally, then consume IR. Tag the code site INTERIM per ADR 0006.Existing --timing-sdf regression tests pass through the new path.
3.4Delete src/sdf_parser.rs. Migrate flatten.rs test fixtures from SDF strings to IR builders. Migrate aig.rs test imports.All cargo test --lib tests pass; src/sdf_parser.rs is gone; the only crate::sdf_parser:: reference is git log.

Each phase exits cleanly. Phase 3.4 is the irreversible deletion — gates on phases 3.1-3.3 having green CI on the migration tests.

Open questions

  • Hierarchy separator: SDF uses ., OpenSTA's default divider is /. Our IR's cell_instance strings come from OpenSTA so use /. The flatten.rs hierarchy-prefix detection logic uses .. After cutover, the logic needs to use /. Verify by running on a hierarchical design (MCU SoC) before declaring 3.4 ready.
  • --sdf-corner semantics under IR: today this picks one of Min/Typ/Max from SDF triples. The IR has min/typ/max per TimingValue already; the corner selection becomes "pick which of the three to use" applied per-arc rather than per-file. Document the mapping.
  • Default-corner consistency: WS2 emits default as the corner name. Pre-existing Jacquard tests may not look at corner names — need to spot-check.
  • liberty_fallback semantics: today, for cells absent from SDF, we fall back to Liberty-computed delays. Under IR, OpenSTA-computed values are already in the IR's arcs (as Origin::Computed). So liberty_fallback is potentially dead. Decide whether to drop it in 3.4 or keep as safety net.
  • Multi-corner (post-WS2.4): when WS2.4 lands, the IR will have multiple corners. flatten.rs currently picks one. Define the per-corner selection contract — explicit corner-name CLI flag, or default to a named corner.

Risks

  • flatten.rs test churn: 7+ test fixtures need rewrites. Each is a focused mechanical change but the bulk adds up. Mitigation: a build_test_ir helper standardizes the pattern.
  • Hidden-bug exposure: the existing SDF parser had quirks. The IR parser has different ones (or none). Migration may surface bugs that were latent. Treat any test failure during 3.4 as a real bug, not "just adjust the test."
  • Hierarchy-separator regression: if not caught in phase 3.2 testing (which tests on a single design), it could land in 3.4 and break a hierarchical design that wasn't previously regression-tested. Mitigation: include a hierarchical design in the 3.2 verification matrix.
  • Cutover timing: WS3 lands while WS2.2 (interconnects) and WS2.4 (multi-corner) are still pending. flatten.rs's cutover assumes those will land later — test fixtures should not depend on interconnect delays or multi-corner behaviour for at-least-3.4 to pass.

Walk-back

If 3.4 surfaces blocking issues, ADR 0006 already permits deferring deletion: keep src/sdf_parser.rs alive but tagged LEGACY — superseded by IR consumer; remove before first release, and ship preprocessing-only for the interim. The runtime SDF subprocess wrapper covers the contributor ergonomics. The native Rust SDF parser rewrite (Phase 3 in the original phasing) is the durable replacement.

Non-goals

  • A native Rust SDF parser. (Original ADR 0006 Phase 3; not part of WS3.)
  • Validating SDF round-trip equivalence between the old parser and OpenSTA. (CI corpus test in WS4/WS2.5 covers this when fixtures exist.)
  • Refactoring the broader flatten.rs structure beyond what migration requires.

References

  • ../adr/0002-timing-ir.md — IR contract.
  • ../adr/0006-sdf-preprocessing-model.md — interim runtime subprocess + release-time cutover.
  • phase-0-ir-and-oracle.md — WS3 row.
  • ws2-opensta-to-ir.md — produces the IR this consumer reads.
  • crates/opensta-to-ir/ — subprocess target for the interim --timing-sdf hook.
  • crates/timing-ir/ — IR library + builders for test fixtures.

Last updated: 2026-04-28

Plan — WS3 follow-up: re-add cosim --sdf via opensta-to-ir

Status: Deferred. Tracked here so future work can pick it up. Predecessor: WS3 phase 3.4 (deletes hand-rolled src/sdf_parser.rs).

Background

Phase 3.4 deleted src/sdf_parser.rs. The jacquard sim subcommand kept SDF input working (Phase 3.3 wired --sdf through setup::load_sdf_via_opensta_to_ir, an internal subprocess wrapper that calls the opensta-to-ir crate to convert SDF→IR). The jacquard cosim subcommand chose Option B of the phase 3.4 handoff: drop --sdf entirely rather than thread --liberty through. As a result, cosim now only accepts pre-converted IR via --timing-ir.

What was removed in 3.4

  • CosimArgs::sdf, sdf_corner, sdf_debug CLI fields (src/bin/jacquard.rs).
  • The config.timing.sdf_file / sdf_corner fallback path in src/sim/cosim_metal.rs::run_cosim.
  • TimingSimConfig::sdf_file and sdf_corner JSON fields (src/testbench.rs).

User-facing migration (current state)

The tests/mcu_soc/ cosim flow that used to load SDF via the testbench config now needs an explicit pre-conversion step.

Feed 6_final.v directly to opensta-to-ir

Retraction (2026-05-18). Earlier versions of this section recommended feeding tests/mcu_soc/data/top_synth.v (post-synthesis, pre-P&R) to opensta-to-ir to dodge a parse error on 6_final.v's chipflow integration wrapper. That was wrong: top_synth.v is missing the ~236K cells P&R inserts (clkbuf_regs_* CTS buffers, ANTENNA_* diodes, delaybuf_*, fillers), so OpenSTA silently drops every SDF entry referencing a P&R-inserted cell and the resulting IR is missing the bulk of the design's timing. The "28162 matched / 2090 unmatched" verification log we celebrated at the time measured jtir records against the cosim-loaded netlist, not SDF coverage against the jtir — high surface match rate, materially incomplete IR. See ADR 0009 (OpenSTA Verilog reader input constraints) for the broader rule.

opensta-to-ir now transparently extracts module <--top> from each input file before invoking OpenSTA (implementation in crates/opensta-to-ir/src/verilog_filter.rs). For the chipflow mcu_soc case this strips the openframe_project_wrapper module automatically; the same handling kicks in for any LibreLane + wafer.space user (hazard3 and future tapeouts) whose final netlist carries an integration wrapper around the structural top.

# Convert SDF → IR once. Pass 6_final.v directly; the wrapper module
# is dropped automatically.
opensta-to-ir \
    --liberty /path/to/sky130_fd_sc_hd__tt_025C_1v80.lib \
    --verilog tests/mcu_soc/data/6_final.v \
    --sdf tests/mcu_soc/data/6_final.sdf \
    --top top \
    --output tests/mcu_soc/data/6_final.jtir

# Run cosim with the pre-converted IR. Cosim loads 6_final.v (the
# wrapper) because that's what carries GPIO ports. The IR consumer's
# hierarchy-prefix detection strips the `top_inst/` prefix from the
# wrapper's cell paths so they match the IR's instance names.
cargo run -r --features metal --bin jacquard -- cosim \
    tests/mcu_soc/data/6_final.v \
    --config tests/mcu_soc/sim_config_sky130.json \
    --top-module openframe_project_wrapper \
    --timing-ir tests/mcu_soc/data/6_final.jtir

tests/mcu_soc/sim_config_sky130.json no longer carries sdf_file / sdf_corner (the fields would be silently ignored if added back; cosim does not consume them).

Events-reference comparison: nuances

tests/mcu_soc/events_reference.json was wired into the sky130 cosim config as part of phase 3.4 verification. End-to-end pipeline result on a 3M-tick run:

  • 67 UART bytes captured; the reference's 155 UART events end at timestamp 4,187,182. All 67 captured payloads match the reference's leading bytes (decoded UART output: ....: nyaa~!\nSoC type: CA7F100F\nFlash ID: CA7CA7FF\nQuad mode). No payload divergence.
  • 15 non-UART entries in the reference (cxxrtl-emitted SPI deselect events with payload: "") are filtered out at parse time by the tolerant deserializer in cosim_metal.rs::run_cosim. Without that filter the comparison panicked on the first SPI entry.

chipflow's num_steps and timestamp are edge-counted

Retraction. Earlier drafts of this section claimed Jacquard's --max-cycles counts half-cycles. That was a misdiagnosis based on reading MultiClockScheduler::new (which does emit per-edge raw entries) without noticing the pairing layer at src/sim/cosim_metal.rs:2604-2675 that collapses them into one paired buffer per cycle. Today, --max-cycles N correctly counts N full clock cycles: each cosim tick does one fall-edge dispatch plus one rise-edge dispatch and DFFs capture once per tick. Verified via --stimulus-vcd trace (5 ticks → simulated time spans 0–200000 ps for a 40 ns period clock, exactly 5 cycles).

The actual unit difference vs chipflow's cxxrtl harness:

  • chipflow's num_steps is the count of tick() calls; each tick() bumps ++timestamp twice (once after the negedge dispatch, once after the posedge), so the events_reference.json timestamp field counts clock edges (a full cycle = posedge-to-posedge = 2 edges). The harness:
    auto tick = [&]() {
        {{interface}}.step(timestamp);
        top.clk.set(false); agent.step(); ++timestamp;  // post-negedge (odd)
        top.clk.set(true);  agent.step(); ++timestamp;  // post-posedge (even)
    };
    for (int i = 0; i < num_steps; i++) tick();
    
    See chipflow-lib/chipflow/common/sim/main.cc.jinja:32-74.
  • The half-tick timestamp is an intentional design, not a bug: parity tags each event with the clock phase it fired on (useful for verification of async paths).
  • chipflow's num_steps therefore doubles as an edge budget: 3 M num_steps = 3 M edges = 1.5 M full clock cycles.

To compare a Jacquard cosim run against today's events_reference.json, divide reference timestamps by 2 to convert edges → cycles. Empirical spot-check on mcu_soc/sky130: byte-0 in Jacquard at --max-cycles 200000 arrives at tick 28682; reference timestamp 58290 / 2 = 29145 cycles; ratio 0.984× (simulators agree on simulated time within 2%).

The earlier "67 of 155 events captured" gap is not a budget issue — chipflow drives input stimulus via design/tests/input.json and reference events 69+ require those driven inputs. The input-stimulus dispatcher was added in commit 4a1a989, and the mcu_soc/sky130 cosim now matches the cut-down chipflow reference 1:1 (90/90 events).

The earlier "Jacquard ~14% slower per byte than cxxrtl" claim relied on a phantom half-cycle correction; it is also retracted. There is no rate gap to explain at this level.

Done: --max-cycles renamed to --max-clock-edges (commit 46b5c28)

Cosim's internal granularity moved from full clock cycles to scheduler edges, aligning Jacquard's CLI 1:1 with chipflow's num_steps and unlocking per-edge event timestamping. Section retained for context on the unit conventions captured above.

Option A — restore cosim --sdf ergonomics

When this becomes a priority, mirror the jacquard sim surface:

Changes

  1. Add --liberty to CosimArgs (src/bin/jacquard.rs). Plumb it through DesignArgs::liberty (currently hardcoded None in cmd_cosim). Also passthrough --top-module if not already.
  2. Add --sdf, --sdf-corner, --sdf-debug back to CosimArgs. Make them mutually exclusive with --timing-ir (clap conflicts_with = "timing_ir").
  3. Re-add TimingSimConfig::sdf_file / sdf_corner (optional) — plus a new liberty_file field for the OpenSTA invocation. Update tests/mcu_soc/sim_config_sky130.json to use the new shape.
  4. Restore the cosim config-file fallback: in src/sim/cosim_metal.rs::run_cosim, when timing is not yet enabled and the config provides SDF + Liberty paths, call setup::load_sdf_via_opensta_to_ir. Match the priority order: CLI > config.timing.* > nothing.
  5. Update --output-vcd error message to mention --sdf again.

Out of scope for Option A

  • Rebuilding a hand-rolled SDF parser. (See ADR 0006 — the durable replacement is the native Rust SDF→IR converter, tracked separately as Phase 3 in the original phasing.)
  • Adding cosim-specific corner-selection beyond what jacquard sim already offers. The IR's min/typ/max triple is selected via ir_corner0_max (currently always max); changing that is a separate concern that affects both subcommands.

Verification

After Option A lands:

cargo build --features metal
cargo test --lib
# Manual smoke test of the previous mcu_soc workflow:
cargo run -r --features metal --bin jacquard -- cosim \
    tests/mcu_soc/data/6_final.v \
    --config tests/mcu_soc/sim_config_sky130.json \
    --liberty <path>/sky130.lib \
    --sdf tests/mcu_soc/data/6_final.sdf

Should produce equivalent results to the pre-3.4 hand-rolled-parser path within the IR's representational bounds (single-value interconnect delays, max corner selection).

Walk-back

If Option A is never picked up before first release, the existing IR-only cosim surface is fine — contributors using SDF can pre-convert via opensta-to-ir and pass the resulting .jtir. The follow-up exists as a contributor-ergonomics improvement, not a correctness gap.

Multi-clock and stimulus architecture — exploratory roadmap

Status: Captured architectural thinking. Most phases here are demand-driven and will only be picked up when a real-world workload requires them. Phases 1 and 2 may be worth scheduling on their own merits in a future release; the rest are written down so the design space is on record when the need appears.

This is a design-space doc, not a scheduling doc. It complements post-phase-0-roadmap.md (which schedules committed work) by capturing the architecture for two related areas — multi-clock-domain support and stimulus generation — that today have working but limited implementations.

Why now

The conversation that produced this doc was about supporting cosim against external testbench environments (UVM, CocoTB) and external clock sources (PHY, audio, DFS). Two observations crystallised the architecture:

  1. Real designs partition into large synchronous islands with thin boundaries. External-clock and DFS scenarios look intractable until you notice that <1 % of nets typically cross domains; the bulk of the design is batchable inside one island.
  2. Stimulus generation and stimulus consumption don't have to share a loop. Today cosim couples the testbench tick-by-tick to the GPU dispatch. Decoupling them — via streaming or full precompute — turns the GPU from a ping-ponging coprocessor into a stream consumer.

Both observations point at architecture changes that compose cleanly with each other, with the existing multi-clock plumbing, and with the existing X-prop / timing-arrival infrastructure.

What exists today

Worth pinning down so the gap is precise:

  • Multi-clock-domain functional support. MultiClockScheduler in src/sim/cosim_metal.rs:1347 builds a tick-by-tick edge schedule over the LCM of all domains' periods (with GCD granularity). DFFs are tagged by clock domain via clock_pin2aigpins in src/aig.rs:209. Each scheduler tick asserts only the firing domains' posedge/negedge flag bits; the GPU kernel gates DFF write-back on those flags, so non-firing domains' DFFs hold.
  • LCM constraint. The scheduler asserts schedule_len <= 1_000_000 (cosim_metal.rs:1376). Commensurable periods (PLL-derived) work; truly non-commensurable external clocks (audio, USB-recovered, DFS-mid-flight) hit the cap.
  • Cosim stimulus. InputDispatcher (src/sim/input_stim.rs) consumes a chipflow-compatible wait/action/stop JSON command list. Peripheral models (src/sim/models/) drain queued actions per edge and emit events. Generation is interleaved with the GPU dispatch loop — every tick (or every few ticks) round-trips through the host.
  • VCD replay path. jacquard sim already runs from a precomputed input VCD with no host-side reactive logic. This is, in effect, the "Level 1" precomputed-edge mode described below; the gap is between cosim's reactive loop and sim's flat replay, not in the kernel itself.
  • CDC checking. None today. SDF setup/hold checks exist (src/timing_report.rs) but are not wired through any CDC-specific path.

Architecture: two orthogonal axes

The work falls cleanly into two independent dimensions.

Axis 1 — Spatial: synchronous islands with thin boundaries

A static analysis pass partitions the AIG into islands: maximal connected sets of gates whose transitive fanin/fanout stays inside one clock domain. Whatever's left is the boundary — combinational gates and DFFs whose data cones cross domains. In real designs the boundary is small, dominated by synchronizers (2FF), async FIFO control, and handshake glue.

Per-island execution lets the GPU:

  • Skip evaluation of an island whose state hasn't changed.
  • Batch K consecutive ticks of a fast island into one kernel launch when the slow island has no edges in the window.
  • Treat the boundary as a small mailbox (source-island outputs read by destination-island reads) rather than a global state vector.

This is essentially functional partitioning for parallel discrete-event simulation, but the GPU dataflow model gets more benefit than a CPU sim because batched dataflow is exactly what a fast island's run-ahead window wants.

Axis 2 — Temporal: stimulus generation decoupled from consumption

The cosim host loop is the throughput floor today. Decoupling has three levels:

  • Replay — the testbench has already produced a complete input VCD; the GPU just plays it back. Today's jacquard sim is this case.
  • Streaming buffer — testbench runs in a separate thread feeding a ring buffer of (tick, input_op) tuples. GPU consumes batches. As long as the producer keeps up on average, the GPU never stalls. Works because most ticks have no input change and peripheral state machines run far slower than the kernel.
  • Record-and-replay with divergence detection — pass 1 runs full cosim and records every input transition; pass 2 replays at line-rate while checksumming outputs against the recorded run. If outputs diverge, abort and fall back. Wins decisively for regression CI where most runs confirm "nothing externally observable changed".

Phase breakdown

Each phase is independently shippable. The phase numbering here is local to this doc and should not be confused with the timing-IR phase numbering in post-phase-0-roadmap.md.

PhaseTopicTrigger
MC.1Static island partitioner (analysis only, emits metadata)Standalone-useful for CDC reporting; could land in a future release without further work
MC.2Min-heap multi-clock scheduler (replaces LCM precompute)First non-commensurable external clock or DFS use case lands
MC.3Streaming stimulus buffer (decouples testbench thread from kernel)First workload where cosim CPU↔GPU round-trip is measured as the bottleneck
MC.4Per-island kernel dispatch + multi-rate batchingMC.1+MC.2 in place; first multi-domain workload large enough that whole-AIG eval per tick is wasteful
MC.5Record-and-replay with divergence detectionRegression CI throughput becomes a release blocker
MC.6+Speculation staircase, AOT trace compilation, profile-guided kernel specializationDemand-driven; deferred until measurement shows residual sync overhead after MC.4

MC.1 — Static island partitioner

Walk the AIG; for each gate compute the set of clock domains its transitive fanin/fanout touches. Tag gates as island-internal (fanin and fanout both inside one domain) or boundary (touches more than one domain on either side). Emit per-island gate counts and a list of boundary gates as metadata on the existing FlattenedScript.

What it enables on its own, even with no runtime change:

  • Diagnostic: "this design has 14 inter-domain combinational paths from audio_clkcore_clk and 2 the other way". Useful for designers reviewing CDC structure.
  • Data structure that MC.2 / MC.4 / CDC reporting all need.
  • Sanity-check on the "<1 %" boundary-surface assumption for the workloads that motivate further phases.

Classification policy for derived signals (e.g. a sync-FIFO read pointer in clock_b qualified by an output of a sync chain from clock_a): classify aggressively. Only gates whose direct fanin includes pins from multiple domains are boundary; downstream gates fed by a domain-tagged pre-synchronizer output inherit that domain. This pushes the boundary in as close to the structural CDC crossing as possible and is what makes the "<1 %" claim hold on real designs — a lazy classification that propagated "multi-domain" forward through every downstream cone would yield a boundary surface that swallowed half the design.

Code locations: extends aig.rs (domain analysis on DriverType) and flatten.rs (metadata on FlattenedScriptV1). No kernel changes.

MC.2 — Min-heap multi-clock scheduler

Replace MultiClockScheduler's precomputed Vec<TickEdges> with a min-heap of (next-edge-time, domain) pairs. Pop the next edge, dispatch, push the domain's next edge back. No LCM constraint; non-commensurable periods are free. DFS support falls out: when the DUT writes a clock-control register, the host updates the heap entry's period.

DFS hook design: explicit, not generic signal-watching. The cosim config declares (control_signal, period_table) pairs; the host polls the named bit each tick (cheap — one bit) and updates the heap. Generic "call-back-on-arbitrary-signal" is rejected as too coupled.

Code locations: MultiClockScheduler::new and build_edge_ops in cosim_metal.rs. Same per-domain flag emission, different scheduling backend.

MC.3 — Streaming stimulus buffer

InputDispatcher becomes a trait; today's FileDispatcher is one implementation. New implementations:

  • ThreadedDispatcher — runs peripheral models on a separate thread; emits (tick, input_op) into a lock-free SPSC ring buffer; GPU loop consumes batches.
  • StreamDispatcher — same shape but the producer is a JSON-lines stream over a Unix socket / stdio (this is also the bridge to UVM/CocoTB peer testbenches).

Latency budget: the producer must be at least one tick ahead of the consumer. For transaction-level workloads this is easy (peripheral state machines run orders of magnitude slower than the GPU). For sub-cycle reactive loops it isn't, and those workloads stay on the synchronous path.

Code locations: refactor input_stim.rs around a trait; new module for ring-buffer plumbing; cosim main loop drains a batch per dispatch instead of one tick.

MC.4 — Per-island kernel dispatch + multi-rate batching

Build per-island execution scripts (and one boundary script) from the metadata MC.1 produces. Cosim main loop becomes:

#![allow(unused)]
fn main() {
loop {
    let (next_t, domain) = scheduler.peek();
    let lookahead = scheduler.next_other_domain_edge(domain) - now;
    let edges_in_window = lookahead / domain.period;
    dispatch(island_script[domain], edges = edges_in_window);
    dispatch(boundary_script);  // only if boundary signals changed
    advance_clock(now + edges_in_window * domain.period);
}
}

Boundary mailbox lives in shared state-buffer slots that the source island's script writes and the destination island's script reads. Repcut continues to partition each island's script across GPU blocks independently.

Tight-boundary gates (combinationally fed by both domains) force a sync point on every edge of either side; MC.1's metadata identifies these so the runtime knows when batching can extend.

MC.5 — Record-and-replay with divergence detection

Add --record-stimulus to cosim that emits a complete tick-by-tick input VCD and a per-tick output checksum. Add --replay-stimulus to sim (or a new mode) that consumes the VCD, runs at line-rate, and verifies the checksum each batch.

Divergence handling is two-tier, not just abort:

  • Mismatch in watched signals (the existing cosim signals_of_interest set, or a --watch CLI argument) → abort and require re-recording. This is the genuine "the design's externally observable behaviour changed" case — the recording is now stale and replay is unsafe.
  • Mismatch in unwatched signals → warn-and-continue against the recorded transitions. Internal microarchitectural changes that don't move the observable surface are normal during development; aborting on them defeats the purpose of accelerating regression CI, where most runs exist to confirm "nothing externally observable changed".

The watchset is the user-visible policy lever — it specifies what "externally observable" means for this design. Default to the cosim output signals (the natural CI invariant) plus any user-declared checkpoint signals.

Useful primarily as a regression-CI accelerator. Doesn't help one-off runs.

Cross-test sharing. A single design accumulates many test cases. The natural extension of record-and-replay is to share the design-side specialized kernel across all tests in the suite and vary only the stimulus recording. For a suite of N tests against one design, recording costs N× pass-1 (one per test, on demand or in parallel) but replay costs N× line-rate-kernel-launches sharing the same compiled state-buffer layout. That's a multiplicative win on top of per-test record-and-replay and is the actual leverage point for full-suite CI throughput.

MC.6+ — Deferred sophistication

Documented now so the design space is on record:

  • Speculation staircase for hot boundaries: value prediction → protocol pattern recognition → control-slice reachable-set enumeration → full case enumeration. Each tier larger and cheaper-to-skip. Add a "case" dimension to the kernel dispatch only if measured sync overhead after MC.4 justifies it.
  • AOT trace compilation: when stimulus is fully known (replay mode), compile the schedule offline — fold constant inputs into AIG constants, merge no-op ticks, sort transitions by domain. Profile-guided specialization for designs with lots of "configured once at boot" inputs. Composes directly with MC.5: a recording is a complete stimulus trace, so the AOT compiler can fold every input value into the kernel unconditionally. The resulting binary is valid only until either the design or the recording changes, so the lifecycle model is "compile per (design SHA, recording SHA) pair, cache for the test session, invalidate on either source changing". Acceptable cost for a 100×-replay regression run; not for one-off interactive sim.
  • CDC verification mode: jitter injection on coincident edges and random X-injection on detected async-source paths. Reuses MC.1's boundary metadata and existing X-prop infrastructure. Distinct from static CDC checking (Spyglass, Real Intent), which is explicitly out of scope — that's a different product. The jitter-injection half is designed in ADR 0012 and partly built; remaining work is tracked in issue #92 / cdc-jitter-completion.md. X-injection stays deferred until MC.1 lands.

Out of scope (explicit non-goals)

These come up adjacent and are worth being clear about:

  • Pin-level VPI / GPI fidelity. Implementing enough VPI for unmodified cocotb / SystemVerilog testbenches. The surface area is enormous and Jacquard would be lying about delta cycles, NBA regions, #delay semantics, and X-propagation behaviour. Use transaction-level peer protocols (the natural extension of input.json over a socket) instead.
  • Metastability simulation. No RTL simulator does this; CDC verification is structural/formal (Spyglass, JasperGold-CDC, Real Intent) and a separate product.
  • Structural CDC checking (synchronizer recognition rules, gray-code analysis). Different product. MC.1's boundary metadata enables a light diagnostic but not a verification flow.
  • DUT-internal #delay. Requires an event-driven kernel; destroys the batched dataflow that gives Jacquard its speedup. Permanently unsupported.
  • Async resets / latches in DUT. Same reason. Permanently unsupported (already documented in CLAUDE.md).

Implementation triggers

When to revisit and pull which phase off the shelf:

TriggerPulls
First user workload with non-commensurable external clocks (audio, USB, DFS)MC.2
First UVM/CocoTB integration request reaches engineering scopingMC.3
User-visible CDC reporting requestedMC.1
Multi-domain workload measurably bottlenecked on whole-AIG-per-tick evalMC.1 + MC.4
Regression CI total time exceeds release toleranceMC.5
Post-MC.4 measurement shows boundary-sync overhead >10 %MC.6 speculation tier 1 (value prediction)

Why MC.1 and MC.2 may be worth doing standalone

The user observation in the originating discussion was that MC.1 and MC.2 are worth carrying in a future release on their own merits, ahead of any specific workload demand. Rationale:

  • MC.1 has standalone diagnostic value. A "boundary report" for any multi-clock design — count of cross-domain combinational paths, location of inter-domain DFF samples — is useful to any user reviewing CDC structure, independent of whether the runtime ever uses the partition.
  • MC.2 lifts a real correctness limit. The current LCM cap silently fails on legitimate designs (any audio-clock SoC, anything with DFS). Replacing precompute with a min-heap is a small, contained change that removes a category of "your design doesn't fit" errors.
  • Both are foundational for the rest of the architecture. Doing them early means later phases pick up cleanly.

If MC.1 + MC.2 ship in isolation, they don't commit Jacquard to any of the later phases. Each later phase remains demand-driven.

References

  • Current multi-clock infrastructure: src/sim/cosim_metal.rs:855 and following (ClockDomainFlags, MultiClockScheduler).
  • Per-DFF clock-domain tagging: src/aig.rs:204 (clock_pin2aigpins).
  • Cosim stimulus protocol: src/sim/input_stim.rs, src/sim/models/mod.rs.
  • Existing precomputed-edge path (replay): jacquard sim and src/sim/vcd_io.rs.
  • Adjacent committed roadmap: docs/plans/post-phase-0-roadmap.md.
  • Synchronous-only constraint and rationale: CLAUDE.md "Key limitation".

Declarative cell metadata — Tier 1 + minimal Tier 2 + port mapping

Status: Implemented — historical record. Tier 1, minimal Tier 2, and the port-mapping schema have all landed. ADRs:

Scope (as shipped)

Originally scoped to one slice (Tier 1 + minimal Tier 2 — opaque kind = "ram" with no port resolution). Expanded mid-flight when the JTAG-DM workflow (PR #78) surfaced the need for explicit-port RAMs with real backing storage:

  • Tier 1: --cell-library + sverilogparse-backed pin tables (landed 2026-05-19 in PR #65/#68).
  • Tier 2 minimal: kind discriminator in TOML, opaque-RAM mode (landed alongside Tier 1).
  • Port-mapping schema (ADR 0011, v1.1): [cells.NAME.ram] sub-table for explicit-port RAMs with backing storage. Landed in this PR alongside SramInitConfig ELF preload (closes #80).

Deliverables

  1. --cell-library <PATH> CLI flag on jacquard sim, jacquard cosim. Repeatable. Each path is parsed via sverilogparse at startup; results merged into a runtime LeafPinProvider extension.
  2. <PATH>.cells.toml autoload + --cell-manifest <PATH> override. TOML schema as in ADR 0010 § Tier 2. Required field schema_version = "1.0". Per-cell kind discriminator, v1.0 vocabulary.
  3. New code path in aig.rs: after PdkVariant::classify falls through (no built-in match), consult manifest. For kind = "ram", allocate RAMBlock in opaque mode — outputs routed to X-source slots, no port resolution.
  4. Tests: TOML parsing unit tests; integration test exercising a synthetic kind = "ram" cell through AIG construction + sim (mini fixture, not the full tapeout design).
  5. Doc update — docs/adding-a-pdk.md: new section "Adding third-party IP via manifest", linked from existing per-PDK recipes.

Out of scope (deferred)

  • Port-mapping schema ([cells.NAME.ports]). Future ADR.
  • Other kind values beyond what the tapeout fixture exercises end-to-end (ram, plus filler if cheap parity demo). Adding other kinds is data-only and can land per-need.
  • Migration of built-in sky130.rs / gf180mcu.rs classifiers to manifest data. Stays in this codebase as the fallback.
  • build.rs pin-table scanner removal. Stays.

Phasing

PhaseOutput
P1--cell-library parsing + LeafPinProvider extension + tests. No AIG-construction changes yet — verify pin tables alone.
P2Manifest TOML parser + CellManifest struct + schema_version validation. Standalone unit tests.
P3aig.rs integration — manifest threaded through, new fallback path for kind = "ram" opaque mode. Add the compute_x_sources-style test exercising the new path.
P4Smoke test against a representative reduced fixture; confirm jacquard sim clears gf180mcu_ocd_ip_sram_*. The full downstream-tapeout netlist is the real-world target but not in-tree.
P5Doc update (adding-a-pdk.md); update gf180mcu-enablement.md § Follow-on cleanup to mark items 1/2/3 superseded by this work.

Each phase is its own commit. No squashing until the spike feedback loop confirms shape.

Open questions to settle in code

  • Autoload path discovery: spec says foo.vfoo.cells.toml sibling. Does that handle the multi-file library case (a.v + b.v sharing one manifest)? Probably yes — autoload each sibling, merge into the single CellManifest. Explicit --cell-manifest flag still wins for users who want a single consolidated file.
  • Conflict policy: if a cell name appears both in a built-in classifier AND in a manifest, built-in wins (per ADR 0010 integration ordering). Warn on conflict to surface accidental collisions.
  • Empty-library noise: parsing a .v file containing only (* blackbox *) modules with no logic should succeed without warnings, since that's the expected shape for IP libraries.

Not promised

  • Memory contents simulation for kind = "ram" in v1.0. Documented in ADR 0010 § "kind = ram semantics in v1.0".
  • Stable opaque-RAM port routing beyond "outputs are X-source slots". The set of outputs is what sverilogparse reports; if a cell's port list changes, the routing follows.

Cosim Peripheral Models

Architecture: ADR 0013.

This plan tracks implementation work for the cosim peripheral model framework. ADR 0013 documents the architecture (two execution domains, observe-only vs bidirectional GPU patterns, ring buffers, plural config convention); this doc tracks the concrete workstreams.

Phase 1: Multi-UART (#90)

First peripheral using the plural-config + array-in-kernel conventions from ADR 0013.

Schema — src/testbench.rs

Add name: Option<String> to UartConfig. Add uarts: Vec<UartConfig> to TestbenchConfig. Add effective_uarts() mirroring effective_clocks():

#![allow(unused)]
fn main() {
pub fn effective_uarts(&self) -> Vec<UartConfig> {
    let mut out = self.uarts.clone();
    if let Some(ref u) = self.uart {
        out.insert(0, u.clone());
    }
    out
}
}

Existing "uart": {...} configs work unchanged. New form: "uarts": [{"name": "console", ...}, {"name": "debug", ...}]. Both may coexist; uart is prepended to uarts.

Metal kernel — csrc/kernel_v1.metal

MAX_UARTS = 4. Restructure the three UART types:

#define MAX_UARTS 4

struct UartPerChannelConfig {
    u32 tx_out_pos;
    u32 cycles_per_bit;
};

struct UartParams {
    u32 state_size;
    u32 n_uarts;          // replaces has_uart
    u32 _pad[2];
    UartPerChannelConfig channels[MAX_UARTS];
};

UartDecoderState and UartChannel structs unchanged — the device buffers hold [MAX_UARTS] elements. gpu_io_step buffer signature unchanged (same 6 slots); the UART decode block becomes a loop over n_uarts.

Rust runtime — src/sim/cosim_metal.rs

  • Repr structs (~line 130): update UartParams to match kernel. Add UartPerChannelConfig. Keep UartDecoderState and UartChannel unchanged.
  • Config resolution (~line 2229): iterate effective_uarts().
  • Buffer allocation (~line 2820): size buffers for MAX_UARTS elements. Init each UartDecoderState with last_tx=1.
  • RX driver creation (~line 2544): one UartRxDriver per entry, named uart_{name} (fallback uart_{index}).
  • CPU drain (~line 3990): iterate N channels with per-channel uart_read_head[i]. Label events with UART name.

Verification

  • cargo build --release --features metal compiles.
  • cargo test --lib passes (add effective_uarts unit tests).
  • Existing MCU SoC cosim CI passes unchanged (single "uart" config).
  • Local smoke: temporarily edit tests/jtag_minimal/sim_config.json to use "uarts": [...] syntax, confirm identical results.

Not in scope

  • Dual-UART test fixture: separate follow-up with a small 2-TX design.
  • CUDA/HIP: cosim is Metal-only; no kernel changes needed.

Future phases

PhaseScopeStatus
2Refactor gpu_io_step toward common params/ring-buffer layoutFuture
3Multi-Flash / external RAM (bidirectional pattern)Deferred (no use case)
Multi-JTAGNot needed (TAP daisy-chain suffices)

Plan: Config-driven AHB/APB bus transaction tracing

Goal

Trace AHB5, AHB-Lite, and APB3 bus transactions in cosim, compactly, without baking signal names into source. Output as CSV (machine-readable transaction table) and annotated VCD (transactions as a signal group for waveform viewers). Decode site: GPU capture + CPU protocol FSM (the kernel stays dumb; protocol semantics live in testable Rust).

Order: APB3 first (validate against the Hazard3 JTAG-DM APB DMI in tests/jtag_minimal/), then AHB-Lite, then AHB5.

Why this shape

The existing "Wishbone bus trace" (build_wb_trace_params, cosim_metal.rs:1277; gpu_io_step, kernel_v1.metal:1182) proves the mechanism — a GPU observe-only peripheral that packs a compact per-tick entry into a ring buffer only when the bus is active/changed, drained by the CPU — but it is hardcoded to one VexRiscv-style SoC (literal names cpu.fetch.ibus__cyc, spiflash.ctrl.wb_bus__ack, …). We generalize that mechanism into a config-driven, protocol-aware monitor. It is observe-only (we watch design outputs, never drive), so it fits the ADR-0013 GPU observe-only peripheral pattern, and gets the effective_*()-style plural config for free.

Two existing pieces are reused:

  • Multi-candidate name resolution in src/sim/trace_signals.rs — handles Yosys-flattened / scalar-expanded / structural hierarchical naming. Refactor the candidate generator into a shared helper so the bus tracer binds pins the same way --trace-signals does.
  • Extra-observables VCD path (emit_extra_observables, vcd_io.rs:635) — the model for emitting synthesized signals into the output VCD.

The hardcoded WbTrace is left intact for now (it has a passing test); migrating it onto the general mechanism is a clean follow-up, not a prerequisite.

Design

1. Config schema — src/testbench.rs

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Deserialize)]
#[serde(rename_all = "lowercase")]
pub enum BusProtocol { Apb3, AhbLite, Ahb5 }

#[derive(Debug, Clone, Deserialize)]
pub struct BusTraceConfig {
    pub name: String,
    pub protocol: BusProtocol,
    /// Hierarchical prefix; standard protocol pin names are appended.
    pub prefix: String,
    #[serde(default = "default_addr_bits")] pub addr_bits: usize, // 32
    #[serde(default = "default_data_bits")] pub data_bits: usize, // 32
    /// Optional per-pin overrides: logical pin name -> explicit net name,
    /// for designs whose pins don't follow `{prefix}{PIN}`.
    #[serde(default)] pub signals: HashMap<String, String>,
}
}

Add to TestbenchConfig:

#![allow(unused)]
fn main() {
#[serde(default)] pub bus_traces: Vec<BusTraceConfig>,
}

New feature, so no singular legacy form. (effective_bus_traces() provided for symmetry with effective_uarts(), even though it just returns the Vec.)

2. Protocol pin maps + CPU decoder — new src/sim/models/bus_trace.rs

Logical-pin tables per protocol:

  • APB3: psel penable pwrite pready pslverr paddr[] pwdata[] prdata[]
  • AHB-Lite: htrans[1:0] haddr[] hwrite hsize[2:0] hburst[2:0] hready hresp hwdata[] hrdata[]
  • AHB5: AHB-Lite + optional hnonsec hexcl hexokay hmaster[] (resolved if present, ignored if absent)

Default net name {prefix}{pin} (lowercased), overridable via signals. Resolution via the shared multi-candidate resolver (item 4).

BusTraceDecoder (per bus) consumes raw captured beats and emits:

#![allow(unused)]
fn main() {
pub struct BusTransaction {
    pub tick: u64, pub bus: String, pub protocol: BusProtocol,
    pub dir: Dir,            // Read | Write
    pub addr: u64, pub data: u64,
    pub resp: BusResp,       // Ok | Error
    pub burst: Option<BurstInfo>, // beat index / length for AHB
}
}
  • APB3 FSM: GPU gates capture on psel & penable & pready (access-phase complete), so each captured beat is a complete transaction. dir = pwrite, data = pwrite ? pwdata : prdata, resp = pslverr.
  • AHB FSM: GPU gates capture on hready high (pipeline advance) and records htrans, haddr, hwrite, hsize, hburst, hwdata, hrdata, hresp. CPU keeps a 1-deep pending address-phase record and pairs address beat N with the data on beat N+1; tracks burst beat counter from hburst/htrans==SEQ.

Pure-Rust, unit-tested with synthetic beat sequences — no GPU required. This is the testability win of CPU-side decode.

3. GPU capture — csrc/kernel_v1.metal + src/sim/cosim_metal.rs

Generalize the WbTrace structs into protocol-agnostic capture:

#define MAX_BUS_TRACES 4
#define BUS_TRACE_MAX_ADR_BITS 32
#define BUS_TRACE_MAX_DAT_BITS 32

struct BusTraceParams {           // one per configured bus
    u32 protocol;                 // 0=apb3 1=ahb-lite 2=ahb5
    u32 gate_a_pos, gate_b_pos, gate_c_pos;   // edge-gating bits (psel/penable/pready or hready/htrans)
    u32 dir_pos, resp_pos;
    u32 addr_pos[BUS_TRACE_MAX_ADR_BITS];
    u32 wdata_pos[BUS_TRACE_MAX_DAT_BITS];
    u32 rdata_pos[BUS_TRACE_MAX_DAT_BITS];
    u32 ctrl_pos[8];              // htrans, hsize, hburst, hnonsec, ...
    u32 addr_bits, data_bits;
};
struct BusTraceEntry { u32 tick, flags, ctrl; u32 addr, wdata, rdata; };
struct BusTraceChannel { u32 write_head, capacity, current_tick, n_buses; /* entries follow */ };

The kernel computes the per-protocol gate, and on a gating edge packs one BusTraceEntry (bus id in flags high bits). No FSM, no pairing on GPU.

gpu_io_step currently uses buffer slots 0–5 (UART + WbTrace). Add slots 6–7 for BusTraceParams[] + BusTraceChannel. Metal allows ≫8 buffers, so extend the existing dispatch rather than adding a kernel.

Rust mirrors of the structs in cosim_metal.rs (next to WbTraceParams), build_bus_trace_params() resolving pins for each configured bus, buffer allocation sized MAX_BUS_TRACES, and a per-bus read head in the drain loop (near cosim_metal.rs:4057) feeding each BusTraceDecoder.

4. Shared signal resolver — refactor src/sim/trace_signals.rs

Extract the multi-candidate name → AIG-pin / state-position resolver (currently internal to trace-signal registration) into a reusable helper callable from build_bus_trace_params. Keeps one source of truth for the Yosys/scalar/structural naming conventions.

5. Output

  • CSV (--bus-trace-csv <PATH>): drain-time, one row per BusTransaction. Header: tick,bus,protocol,dir,addr,data,resp,burst. Trivial — lands in Phase 1.
  • Annotated VCD: synthesized per-bus VCD vars ({bus}_addr, {bus}_wdata/{bus}_rdata, {bus}_dir, {bus}_resp) that value-change at transaction-complete ticks. This needs a new "virtual signal" emission path in vcd_io.rs: unlike existing extra-observables (raw nets sampled per tick from the state buffer), these are sparse CPU-decoded events the VCD writer must interleave by tick. Bigger plumbing → Phase 3. Dovetails with the wire-bundle-scripting / Surfer direction in project memory.

6. CLI — src/bin/jacquard.rs

  • --bus-trace-csv <PATH> (Phase 1)
  • bus VCD annotation folded into the output/--output-vcd when bus_traces is configured, or a dedicated --bus-trace-vcd flag (Phase 3)

Status

Phase 1 is complete (APB3 end-to-end + CSV). Validated by tests/apb_trace/ — a dedicated synthesized APB3 design (the Hazard3 JTAG-DM post-PnR netlist drops the APB addr/data nets during flattening, so a names-preserved design was built instead). CI step: Run APB3 bus-trace cosim (ADR 0013). Phases 2–3 remain.

Phasing

  1. Phase 1 — APB3 end-to-end. ✅ Done. Config schema, pin maps, shared resolver, APB3 GPU capture, APB3 CPU decoder, CSV output. Validated on tests/apb_trace/ (synthesized APB3 design). APB3 FSM unit-tested.
  2. Phase 2 — AHB-Lite + AHB5. Pipeline pairing, burst tracking, AHB5 extra signals. Unit-test the AHB FSM. Needs an AHB design to integration-test against (open question — see below).
  3. Phase 3 — Annotated VCD. Virtual-signal emission path in vcd_io.rs.
  4. Follow-up — migrate WbTrace onto the general mechanism (express the VexRiscv ibus/dbus as configured buses), then delete the hardcoded path.

Verification

  • Unit: APB3 & AHB FSM decoders against synthetic beat vectors (pure Rust, no GPU).
  • Integration (Phase 1): cosim the Hazard3 JTAG-DM with --bus-trace-csv, assert the expected DMI register accesses (DMCONTROL/DMSTATUS) appear.
  • Build: cargo build --release --features metal clean; existing cosim tests (single-UART, WbTrace) unaffected since bus_traces defaults empty.

Open questions

  • AHB integration test design. APB3 validates on the existing Hazard3 JTAG-DM. Phase 2 needs an AHB-Lite/AHB5 design — do we have one, or synthesize a small AHB peripheral (like tests/dual_uart/)?
  • Per-bus ring vs shared ring. One BusTraceChannel with a bus-id field (simpler allocation) vs one ring per bus (no cross-bus contention). Start shared; revisit if a hot multi-bus design overflows.
  • CUDA/HIP. Cosim is Metal-only today; no kernel changes needed elsewhere now, but the general design should port cleanly when CUDA cosim lands.

ADR impact

This generalizes the cosim peripheral architecture — update ADR-0013 (plural-peripheral configs) to record the config-driven bus-monitor pattern and the GPU-capture/CPU-decode split, once Phase 1 is real.

Plan: complete ADR 0012 CDC jitter injection

Tracks the deferred half of ADR 0012. Issue: #92.

Where it stands

Implemented: the run-parameters file + per-domain seeded PRNG (src/sim/run_params.rs), jitter_ps per ClockConfig, the uniform per-domain draw, and a jitter displacement applied to the timing-VCD event timestamp (cosim_metal.rs, inside the --output-vcd block only). So today jitter perturbs the waveform timeline but nothing else — it does not reach the setup/hold checker, model-driven clocks, or coincident-edge ordering.

The goal of this plan is to make jitter actually stress CDC paths, then extend it to model-driven clocks and tidy the loose ends, so ADR 0012's present-tense design fully matches the code.

Phase 1 — Jitter reaches the timing checker (the core value)

Right now jitter_displacement only adjusts the VCD base_timestamp (cosim_metal.rs:~3928-3948) and is computed inside the timing-VCD emission block, so it has no effect without --output-vcd and never influences violations.

  • Hoist the per-tick per-domain displacement draw out of the VCD block so it is available whenever jitter_active, independent of --output-vcd.
  • Apply each domain's displacement to the arrival offsets that setup/hold checking consumes (the arrival_state section), not just the VCD base timestamp — so a jittered edge can move a margin across the setup/hold boundary and surface in --timing-report.
  • True per-domain perturbation (ADR §4): keep a displacement per firing domain this tick rather than the current single global value (the loop overwrites jitter_displacement with the last domain's draw). Coincident edges from domains A and B then move independently, exercising both orderings over a seed sweep.

Verify: a small two-domain design with a deliberately marginal CDC path; assert that a seed sweep produces both "no violation" and "violation" outcomes, and that a fixed seed reproduces exactly.

Phase 2 — Model-driven clock jitter (ADR §3)

Model-driven clocks (JtagReplayModel, SPI SCK, …) bypass the scheduler and currently get no jitter.

  • Add --cdc-model-jitter-ps <N> (and/or per-model jitter_ps in config) → a budget + seeded stream via RunParams::domain_seed(model_name).
  • After a model fires its edge, displace the timing-model arrival for that transition (not the functional edge — the DFF still samples on the same tick), mirroring the Phase 1 arrival-offset path.

Verify: extend tests/jtag_minimal (model-driven TCK) with a model jitter budget; confirm reproducibility and that TCK→sys_clk CDC margins vary by seed.

Phase 3 — Hygiene / correctness guards

  • gcd_ps / 2 constraint (ADR §2): at startup, error (or clamp with a loud warning) if any jitter_ps > scheduler.gcd_ps / 2, since larger values would reorder edges across GCD ticks.
  • Always persist the seed (ADR §1): when neither --run-params nor --output-vcd is given, RunParams::generate() currently does not write the file. Persist to a default path unconditionally so every run is replayable.
  • master_seed in the VCD header (ADR §1/§5): emit the master seed as a VCD header comment in vcd_io.rs, so the seed is recoverable from an output artifact, not just the INFO log.

Phase 4 — CI CDC stress sweep (ADR Consequences)

Once jitter feeds violations (Phase 1), add a lightweight CI step: run the marginal-CDC design across a few sequential seeds, upload each run's run_params.json as an artifact, fail if an unexpected violation appears. Gives every PR a cheap CDC regression.

Out of scope (separate ADRs / plans)

  • X-injection on CDC paths (needs MC.1 island partitioner — ADR 0012 "Deferred").
  • Non-uniform jitter distributions (Gaussian period jitter, etc.) — the seed+budget interface is distribution-agnostic, add later.
  • Frequency sweep / DFS.

Spike — OpenTimer on SKY130 and MCU SoC

Status: Proposed. Not yet executed.

Time box: Half a day. Extend by up to one day if initial signs are positive but hitting specific SKY130 quirks. Abort and fall back if first-four-hours progress is blocked.

Goal

Determine whether OpenTimer (MIT, C++17) can reliably parse and analyse Jacquard's real-flow inputs — SKY130 Liberty and OpenLane2 MCU SoC post-P&R output — well enough to serve as Jacquard's in-process reference STA (per ADR 0003).

The outcome resolves ADR 0003's Pending Spike status to either Accepted or Superseded.

Out of scope for this spike

  • C++ FFI / bindgen integration work. Pure spike on OpenTimer's standalone behaviour.
  • Timing-IR integration. Establishing that OpenTimer produces usable arrival/slack output is sufficient; converting it to IR belongs in phase 1.
  • Performance measurement beyond rough "does it complete in reasonable time."
  • GF130 coverage. SKY130 is the spike target; GF130 private-track confirmation is later.

Setup

Required artefacts (checked before starting):

  1. OpenTimer clone and local build (MIT licence, standard CMake).
  2. SKY130 Liberty file(s) matching the corner the MCU SoC flow uses. At minimum sky130_fd_sc_hd__tt_025C_1v80.lib.
  3. MCU SoC post-P&R output: synthesised .v, SDC, and — critically — .spef. Check that the current OpenLane2 invocation is configured to produce SPEF; if not, enable it. OpenTimer requires SPEF, it does not consume SDF.
  4. Jacquard's current timing-analysis binary output on the same design for comparison.
  5. OpenSTA installed locally, for three-way comparison.

Success criteria

The spike answers four questions. Each is a pass/fail observation, not a measurement.

Q1 — Does OpenTimer parse SKY130 Liberty without errors?

  • Pass: clean parse, no warnings that indicate misinterpreted cells.
  • Partial: parses but warns on specific cells — in particular sky130_fd_sc_hd__dlygate4sd3_* or anything with non-trivial conditional timing. Document which cells and whether their timing is discarded or mishandled.
  • Fail: parse errors, segfaults, or silently-wrong output on recognised cells.

Q2 — Does OpenTimer compute arrivals on the MCU SoC design?

Feed .lib + .v + .spef + .sdc. Run report_timing -worst 20 or equivalent. Observe:

  • Pass: produces a full timing report with reasonable-looking arrivals (non-zero, monotonic along paths).
  • Partial: produces a report but with suspect values (many zeros, missing cells, incomplete paths).
  • Fail: hangs, crashes, or refuses to analyse.

Q3 — Does OpenTimer's result agree with OpenSTA?

Run OpenSTA on the same inputs, compare top-20 critical endpoints' arrivals. Declare tolerance: ±5% on arrival time, ±10 ps absolute floor for very short paths.

  • Pass: all top-20 endpoints within tolerance.
  • Partial: most within tolerance, a small number of outliers traceable to specific delay-model differences (e.g., CCS vs NLDM).
  • Fail: systematic disagreement suggesting OpenTimer is computing something meaningfully different. Investigate; if the disagreement is on SKY130 cell interpretation (a PDK handling issue) this is essentially a fail for our purposes.

Q4 — Does OpenTimer's result correlate with Jacquard's current timing analysis?

Compare worst-slack and top-K endpoint lists (not exact values — pessimism differences are expected and documented). Observe:

  • Pass: top-K lists overlap substantially; worst-slack is on a comparable path.
  • Informational: any systematic discrepancy tells us what the pessimism delta actually looks like in practice. This data informs R4 (critical-path refinement reporting) whether OpenTimer is adopted or not.

Decision matrix

Q1Q2Q3Outcome
PassPassPassADR 0003 → Accepted. Proceed to phase 1 integration.
PassPassPartialADR 0003 → Accepted with documented scope limits. Define where OpenTimer is authoritative vs deferred to OpenSTA.
PassPartialADR 0003 → Accepted provisionally; spike extends to investigate Q2 anomalies.
PartialADR 0003 → Accepted with SKY130 cell workarounds documented, or → Superseded if the workarounds are too invasive.
Fail on anyADR 0003 → Superseded. Fall back to OpenSTA-subprocess-only validation. Revisit libreda-sta or in-house walker as alternatives in a follow-up ADR.

Fallback

If the spike fails, Jacquard operates with:

  • OpenSTA subprocess validation in CI (ADR 0001) as the sole timing-reference mechanism.
  • No per-PR in-process timing cross-check; feedback timing degrades.
  • Phase 1 drops OpenTimer integration work and refocuses on tightening OpenSTA-driven CI.

Superseding ADR 0003 is clean — it is currently Pending Spike so no downstream work has accrued to it. Phases 0 and 2 are unaffected.

Progress log

Setup (2026-04-23 → 2026-04-30)

  • OpenTimer 2.1.0 and OpenSTA 3.1.0 cloned to Jacquard-depends/ and built locally. Build notes in that repo's README.md.
  • SKY130 Liberty already on disk via volare: ~/.volare/volare/sky130/versions/c6d73a35f524070e85faff4a6a9eef49553ebc2b/sky130A/libs.ref/sky130_fd_sc_hd/lib/sky130_fd_sc_hd__tt_025C_1v80.lib.
  • Spike artefacts kept in this worktree under spike-out/ (gitignored — reproducible from Jacquard-depends/).

Q1 — Liberty parse (2026-04-30) — Pass

ToolCells loadedWall timeWarnings
OpenTimer 2.1.04280.12 s1
OpenSTA 3.1.04280.18 s0

Cell counts agree exactly. OpenSTA parses cleanly. OpenTimer emits one warning:

W celllib.cpp:274] unexpected lut template variable normalized_voltage

The normalized_voltage axis appears in exactly one place in the Liberty: the library-level normalized_driver_waveform("driver_waveform_template") block, which is CCS-driver-waveform data. No per-cell timing arc references it — cell_rise/cell_fall/rise_constraint/fall_constraint all use the NLDM templates del_1_7_7, vio_3_3_1, constraint_3_0_1. So the warning has no impact on arrival/slack computation under NLDM, which is what OpenTimer does anyway.

Operational note: OpenTimer's read_celllib is lazy — the parse only runs when an action like update_timing (or report_*) forces taskflow execution. Issuing dump_celllib immediately after read_celllib reports "celllib not found" because the read hasn't fired yet. Always insert update_timing before any inspection command.

The documented read_celllib -min|-max <file> syntax silently no-ops; bare read_celllib <file> loads the lib as both min and max corners. Filed as a docs/build mismatch in our Jacquard-depends/README.md.

Q2 — Arrival computation on SKY130 (2026-05-01) — Fail

Used OpenSTA's bundled gcd_sky130hd example (a canonical SKY130-HD GCD with .v, .sdc, .spef, .lib) as a fast smoke test before tackling MCU-SoC SPEF generation. If OpenTimer can't handle this, the MCU-SoC effort is wasted.

OpenSTA baseline: clean run, period 5 ns, top arrival 4.82 ns, WNS 0.00, slack 0.09 met. 0.28 s wall, zero warnings.

OpenTimer: could not produce a single timing path. The result was no critical path found, wns = nan, tns = nan — even after working around the following issues, each of which had to be discovered and patched manually:

#IssueWorkaround triedStatus
1`read_celllib -min-max ` (the documented syntax) silently no-opsbare read_celllib <file> loads as both corners
2dump_* after read_* reports state-not-loaded because the read is lazyinsert update_timing before any inspectionworks
3Tap cells in post-P&R Verilog (sky130_fd_sc_hd__tapvpwrvgnd_*) trigger 1040 cell not found in celllib errors and abort the netlist loadstrip tap cell instances from Verilogworks
4OpenTimer's bundled SDC parser uses pre-TCL-8.5 syntax (trace variable VAR w CMD); fails on the system's TCL 8.6 with bad option "variable" and produces zero parsed commands — even on OpenTimer's own bundled examplespatch ot/sdc/sdcparsercore.tcl:144 to trace add variable sdc_version write __set_vworks (one-line fix; should be upstreamed)
5OpenSTA-style SDC with set period 5 / expr $period * 0.2 / [all_inputs] parses as zero commandshand-write a literal SDC with create_clock -name clk -period 5 [get_ports clk]works for trivial constraints; non-trivial SDC remains uncovered
6SPEF *PORTS section (standard SPEF, IEEE 1481, emitted by OpenROAD/OpenLane) is rejected with a parse error pointing at the first port linestrip *PORTS block from SPEF before readingworks
7Verilog bus ports (input [31:0] req_msg;) are not bit-blasted by OpenTimer's Verilog parser, but post-P&R SPEF references the bus as bit-indexed nets (req_msg[0], req_msg[1], …). 48 bus-element nets fail to match between netlist and SPEFnone foundblocking
8After all of the above, two interior pins (_251_:B, _218_:B) report "not found in rctree" and the timing graph remains disconnected enough that no path can be reportednot investigated furtherblocking

Issues 7 and 8 mean that on a SKY130 design with bus ports — i.e. any design that talks to the rest of the world — OpenTimer cannot compute arrivals from a standard OpenROAD .v/.spef pair without inputs being pre-processed by code that doesn't exist.

The cumulative finding is not "OpenTimer mishandles a few SKY130 cells". It is that OpenTimer's input pipeline (Verilog parser, SPEF parser, bundled SDC parser) is incomplete relative to what real OpenROAD-flow outputs contain, and the gaps fall on hot paths (bus ports, tap cells, modern TCL, OpenROAD-emitted SPEF). The cells themselves parse fine (Q1); it's the surrounding ecosystem that doesn't.

Q3, Q4 — not run

Q3 (cross-check vs OpenSTA) and Q4 (correlation with Jacquard's timing-analysis) both depend on OpenTimer producing arrivals. With Q2 unable to produce a single path, they're moot for this spike.

Decision

ADR 0003 → Superseded. Per the spike's decision matrix ("Fail on any → ADR 0003 → Superseded. Fall back to OpenSTA-subprocess-only validation"), the right move is to retire the in-process-OpenTimer plan and lean on OpenSTA-subprocess validation (ADR 0001) as the sole timing reference. A follow-up ADR should consider libreda-sta or an in-house walker if an in-process reference is still wanted later.

OpenTimer's strengths (in-process C++17, taskflow-based, MIT, fast for the academic benchmarks it ships with) are real, but the input-pipeline gaps are large enough that adopting it would mean owning a non-trivial fork — the opposite of what a "lightweight in-process reference" is supposed to be.

The Liberty parser is genuinely capable (Q1 passed cleanly on the 12 MB SKY130 NLDM lib in 120 ms), so OpenTimer remains an option for future narrow tasks like Liberty introspection, but not as the STA engine.

Setup notes worth keeping

  • OpenSTA bundles gcd_sky130hd.{v,sdc,spef} and sky130hd_tt.lib.gz — a cleaner SKY130 smoke-test fixture than anything we'd have produced from chipflow in the time we had.
  • ~/.volare/volare/sky130/versions/c6d73a35f524070e85faff4a6a9eef49553ebc2b/sky130A/... is the live SKY130 PDK already on this machine (chipflow installs it). No need to fetch it separately.

Deliverable

A short report added to this document as a "Spike outcome" section, summarising:

  • Which Q1–Q4 answers were observed.
  • Specific SKY130 cells where OpenTimer misbehaves (if any).
  • Whether SPEF generation had to be added to the OpenLane2 flow, and what that change was.
  • Decision: confirm, scope-limit, or supersede ADR 0003.
  • ADR 0003 — OpenTimer as in-process reference STA.
  • ../timing-correctness.md — requirement R2.
  • ../plans/phase-0-ir-and-oracle.md — phase 0 (independent of this spike; runs in parallel).