Jacquard Documentation
Welcome to the documentation for Jacquard, a GPU-accelerated RTL logic simulator.
Use the sidebar to navigate between topics, or start with the Getting Started guide.
Documents
Project Scope & Planning
Start here if you're considering a feature contribution or want to understand Jacquard's overall direction.
- Project Scope & Guarantees: Top-level contract — what Jacquard is for, what it isn't, licensing and architecture constraints, stability tiers.
- Why Jacquard: Honest positioning vs. STA tools and event-driven simulators; what's unique, what isn't, and what output interface would let users extract the value.
- Timing Correctness: Scoped requirements for timing accuracy, validation, and the forthcoming timing IR.
- Timing Model Extensions: Pre-spike design notes for δ(T) dynamic delay, clock-tree skew, and wire delay at scale. Formalised in ADR 0007.
- Post-Phase-0 Roadmap: Sequencing of Phase 1+ work covering structured timing output (ADR 0008) and timing model fidelity (ADR 0007). (OpenTimer integration was originally Phase 1's centrepiece; ADR 0003 was Superseded by the spike — OpenSTA out of process is now the sole STA path per ADR 0001.)
- Architecture Decision Records: Design decisions and their rationale (numbered, per-decision). See the index for status and how the ADRs relate.
- Implementation Plans: Phased implementation plans with entry and exit criteria. See the index for status and reading order.
- Spikes: Time-boxed experiments and their outcomes.
Core Documentation
-
Simulation Architecture: Detailed explanation of Jacquard's internal architecture
- Pipeline stages (NetlistDB → AIG → StagedAIG → Partitions → FlattenedScript → GPU)
- Data structures and representations
- VCD input/output format requirements
- Assertion and display support infrastructure
- Performance characteristics
- Known issues and limitations
-
Timing Simulation: CPU-based timing simulation with Liberty/SDF delays
-
Timing Violations: GPU-side setup/hold violation detection
Troubleshooting Guides
- Troubleshooting VCD: Debugging VCD input issues
- VCD hierarchy requirements
- Signal naming and matching
- Solutions for flat VCD generation
- Diagnostic checklist
- Working examples
Quick Reference
VCD Input Requirements (Critical!)
Jacquard expects VCD signals at absolute top-level (no module hierarchy):
// ✓ Correct testbench
initial begin
$dumpfile("output.vcd");
$dumpvars(1, clk, reset, din, dout); // Depth 1, explicit signals
end
// ✗ Incorrect testbench
initial begin
$dumpfile("output.vcd");
$dumpvars(0, testbench); // Dumps entire hierarchy
end
Debug Commands
# Enable debug logging
RUST_LOG=debug cargo run -r --features metal --bin jacquard -- sim <args>
# Verify with CPU simulation
cargo run -r --features metal --bin jacquard -- sim <args> --check-with-cpu
# Check VCD structure
grep '\$scope\|\$var' input.vcd | head -20
Key Statistics
When running Jacquard, look for these diagnostic outputs:
netlist has X pins, Y aig pins, Z and gates # AIG complexity
current: N endpoints, try M parts # Partition count
Built script for B blocks, reg/io state size S # Final script
WARN (GATESIM_VCDI_MISSING_PI) ... # VCD issues!
Investigation Methodology
This documentation was created through systematic investigation of Jacquard's behavior:
- Source Code Analysis: Examined
src/aig.rs,src/flatten.rs,src/staging.rs - Debug Tracing: Used
RUST_LOG=debugto capture internal state - Test Case Development: Created minimal reproducible examples
- Comparative Testing: Compared Jacquard vs iverilog outputs
- Third-Party Validation: Tested with real-world examples (sva-playground)
Known Issues Documented
-
VCD Hierarchy Mismatch (CRITICAL):
- Jacquard expects flat VCD hierarchy
- Most testbenches generate hierarchical VCDs
- See troubleshooting-vcd.md for solutions
-
Complex FSM Simulation:
- Some FSM designs don't simulate correctly
- Under investigation (safe.v example in third_party tests)
- May be related to synthesis optimization or reset handling
-
Format String Preservation:
- Yosys may not preserve format attributes
- Display messages show placeholders
- Extract format strings from pre-synthesis JSON as workaround
Contributing
When adding documentation:
- Be specific: Include actual commands, file paths, code snippets
- Show examples: Both working and non-working cases
- Link related docs: Cross-reference other documentation files
- Date updates: Update version and date at bottom of documents
- Test instructions: Verify all commands actually work
Future Documentation Needs
-
Performance tuning guide (optimal
NUM_BLOCKS,--level-split) - Memory (SRAM) modeling and synthesis
- Custom cell library support beyond AIGPDK
- Multi-clock domain handling
- VCD scope option detailed behavior
- GPU kernel optimization internals
Related Resources
- Main README:
../README.md- Project overview and quick start - CLAUDE.md:
../CLAUDE.md- Development guidelines and architecture overview - Test Suite:
../tests/- Examples and regression tests - Third-Party Tests:
../tests/regression/third_party/- Real-world examples with attribution
Last Updated: 2026-02-16 Maintained By: ChipFlow + Community Contributions
Getting Started with Jacquard
Caveats: Jacquard currently only supports non-interactive testbenches. This means the input to the circuit needs to be a static waveform (e.g., VCD). Registers and clock gates inside the circuit are allowed, but latches and other asynchronous sequential logics are currently unsupported.
Dataset: Some (namely, netlists after AIG transformation in Steps 1-2 below, and reference VCDs) input data is available here .
Step 0. Download the AIG Process Kit
Go to aigpdk directory where you can download aigpdk.lib, aigpdk_nomem.lib, aigpdk.v, and memlib_yosys.txt. You will need them later in the flow.
Before continuing, make sure your design contains only synchronous logic.
If your design has clock gates implemented in your RTL code, you need to replace them manually with instantiations to the CKLNQD module in aigpdk.v.
Also, you are advised to be familiar with where memory blocks (e.g., caches) are implemented in your design so you can check that the memory blocks are mapped correctly later.
Step 1. Memory Synthesis with Yosys
This step makes use of the open-source Yosys synthesizer to recognize and map the memory blocks automatically.
Download and compile the latest version of Yosys. Then run yosys shell with the following synthesis script.
# replace this with paths to your RTL code, and add `-I`, `-D`, `-sv` etc when necessary
read_verilog xx.v yy.v top.v
# replace TOP_MODULE with your top module name
hierarchy -check -top TOP_MODULE
# simplify design before mapping
proc;;
opt_expr; opt_dff; opt_clean
memory -nomap
# map the rams
# point -lib path to your downloaded memlib_yosys.txt
memory_libmap -lib path/to/memlib_yosys.txt -logic-cost-rom 100 -logic-cost-ram 100
The memory_libmap command will output a list of RAMs it found and mapped.
- If you see
$__RAMGEM_SYNC_(naming inherited from GEM), it means the mapping is successful. - If you see
$__RAMGEM_ASYNC_, it means this RAM is found to have asynchronous READ port. You need to confirm if it is the case.- If it is a synchronous one but accidentally recognized as asynchronous, you might need to patch the RTL code to fix it. There might be multiple reasons it cannot be recognized as synchronous. For example, when the read and write clocks are different.
- If it is indeed asynchronous, check its size. If its size is very small and affordable to be synthesized using registers and mux trees (which is very expensive for large RAM banks), you can remove the
$__RAMGEM_ASYNC_block inmemlib_yosys.txt, re-run Yosys to force the use of registers.
- If you see
using FF mapping for memory, it means the memory is recognized, but due to it being nonstandard (e.g., special global reset or nontrivial initialization), Jacquard will fall back to registers and mux trees. If the size of the memory is small, this is usually not an issue. Otherwise, you are advised to try other implementations.
After a successful mapping, use the following command to write out the mapped RTL as a single Verilog file.
write_verilog memory_mapped.v
Check the correctness of this step by simulating memory_mapped.v with your reference CPU simulator.
Step 2. Logic Synthesis
This step maps all combinational and sequential logic into a special set of standard cells we defined in aigpdk.lib.
The quality of synthesis is directly tied to Jacquard's final performance, so we suggest you use a commercial synthesis tool like DC. You can also use Yosys to complete this if you do not have access to a commercial synthesis tool.
Check the correctness of this step by simulating gatelevel.gv with your reference CPU simulator.
Use Synopsys DC
First, you need to compile aigpdk.lib to aigpdk.db using Library Compiler.
With that, you synthesize the memory_mapped.v obtained before under aigpdk.db.
Some key commands you may use on top of your existing DC flow:
# change path/to/aigpdk.db to a correct path. same for other commands.
set_app_var link_path path/to/aigpdk.db
set_app_var target_library path/to/aigpdk.db
read_file -format db $target_library
# elaborate TOP_MODULE
# current_design TOP_MODULE
# timing settings like create_clock ... are recommended. Jacquard benefits from timing-driven synthesis.
compile_ultra -no_seq_output_inversion -no_autoungroup
optimize_netlist -area
write -format verilog -hierarchy -out gatelevel.gv
Use Yosys: Example script
# if you exited Yosys in step 2, you can read back in your memory_mapped.v yourself.
# read_verilog memory_mapped.v
# hierarchy -check -top TOP_MODULE
# synthesis
synth -flatten
delete t:$print
# change path/to/aigpdk_nomem.lib to a correct path. same for other commands.
dfflibmap -liberty path/to/aigpdk_nomem.lib
opt_clean -purge
abc -liberty path/to/aigpdk_nomem.lib
opt_clean -purge
techmap
abc -liberty path/to/aigpdk_nomem.lib
opt_clean -purge
# write out
write_verilog gatelevel.gv
Step 3. Download and Compile Jacquard
Download and install the Rust toolchain. This is as simple as a one-liner in your terminal. We recommend https://rustup.rs.
Clone Jacquard along with its dependencies.
git clone https://github.com/ChipFlow/Jacquard.git
cd Jacquard
git submodule update --init --recursive
Jacquard supports two GPU backends: CUDA (NVIDIA GPUs on Linux) and Metal (Apple Silicon Macs).
All functionality is accessed through the jacquard CLI, which provides map, sim, and cosim subcommands:
# Mapping (no GPU features needed)
cargo run -r --bin jacquard -- map --help
# Simulation (Metal - macOS)
cargo run -r --features metal --bin jacquard -- sim --help
# Simulation (CUDA - Linux, requires CUDA toolkit)
cargo run -r --features cuda --bin jacquard -- sim --help
Simulate the Design
Jacquard automatically partitions the design at startup using mt-kahypar-sc hypergraph partitioning.
If partitioning fails due to deep circuits (which often shows as trying to partition a circuit with only 0 or 1 endpoints), try adding a --level-split option to force a stage split. For example --level-split 30 or --level-split 20,40.
Metal (macOS)
Use NUM_BLOCKS=1 for Metal.
cargo run -r --features metal --bin jacquard -- sim path/to/gatelevel.gv path/to/input.vcd path/to/output.vcd 1
CUDA (Linux)
Replace NUM_BLOCKS with twice the number of physical streaming multiprocessors (SMs) of your GPU.
cargo run -r --features cuda --bin jacquard -- sim path/to/gatelevel.gv path/to/input.vcd path/to/output.vcd NUM_BLOCKS
VCD Scope Handling
Jacquard automatically detects the correct VCD scope containing your design's ports. In most cases, you don't need to specify --input-vcd-scope. If auto-detection fails or you need to override it, use:
# Metal
cargo run -r --features metal --bin jacquard -- sim path/to/gatelevel.gv path/to/input.vcd path/to/output.vcd 1 --input-vcd-scope "testbench/dut"
# CUDA
cargo run -r --features cuda --bin jacquard -- sim path/to/gatelevel.gv path/to/input.vcd path/to/output.vcd NUM_BLOCKS --input-vcd-scope "testbench/dut"
Use slash separators (/) for hierarchical paths, not dots. See troubleshooting-vcd.md for details.
The simulated output ports value will be stored in output.vcd.
Caveat: The actual GPU simulation runtime will also be outputted. You might see a long time before GPU enters due to reading and parsing input.vcd. You are recommended to develop your own pipeline to feed the input waveform into Jacquard's GPU kernels.
Timing-Aware Simulation
Jacquard supports two ways to feed timing data into the simulator:
--timing-ir <path.jtir>— pre-converted Jacquard timing IR. This is the canonical path and requires no external tools at run time. Generate the IR ahead of time with the standaloneopensta-to-irtool (seecrates/opensta-to-ir/).--sdf <path.sdf> --liberty <path.lib>— raw SDF, converted to IR on the fly. This subprocesses OpenSTA, which must be installed on the user's machine.
OpenSTA dependency
When using --sdf, Jacquard locates OpenSTA in this order:
JACQUARD_OPENSTA_BINenvironment variable.<repo-root>/scripts/build-opensta.sh --print-binary(the canonical install path during development; the script builds the version vendored atvendor/opensta/).staonPATH.
Jacquard requires OpenSTA 3.1.0 or newer, matching the commit pinned at vendor/opensta/. The pinned version is the only one with end-to-end test coverage; newer OpenSTA versions are accepted with a warning, older versions are a hard error.
The simplest way to get a known-good OpenSTA is to build the vendored copy from the Jacquard repo:
git submodule update --init --recursive
./scripts/build-opensta.sh
Then either set JACQUARD_OPENSTA_BIN to the path printed by ./scripts/build-opensta.sh --print-binary, or just let Jacquard find it automatically — the build script's output is searched by default.
Error messages
| Symptom | Meaning | Fix |
|---|---|---|
--sdf requires OpenSTA: OpenSTA binary not found. | OpenSTA isn't installed or isn't on PATH. | Run ./scripts/build-opensta.sh, set JACQUARD_OPENSTA_BIN, or install OpenSTA system-wide. |
OpenSTA at <path> is v2.4.0; Jacquard requires v3.1.0 or newer. | Installed OpenSTA is too old. | Rebuild from vendor/opensta/ (which is pinned at 3.1.0) or upgrade your system OpenSTA. |
Detected OpenSTA v3.2.0, newer than the latest tested version v3.1.0. (warning) | OpenSTA version is newer than what Jacquard's test corpus has been validated against. Simulation proceeds. | Report any timing discrepancies as bugs; we'll bump the tested-version range when CI catches up. |
--sdf requires --liberty <PATH>. | OpenSTA needs the Liberty library to link the design. | Pass --liberty <PATH> alongside --sdf. |
For licensing context (Jacquard is permissively-licensed, OpenSTA is GPL-3, and Jacquard's runtime subprocess invocation is permitted but bundling is not), see adr/0006-sdf-preprocessing-model.md.
Why Jacquard — positioning and output interface
Status: Honest assessment of where Jacquard fits in an EDA flow alongside dedicated STA tools (OpenTimer/OpenSTA) and event-driven simulators (Verilator, iverilog, CVC). Includes a survey of what timing information Jacquard exposes today and what would let users actually consume it.
This is not a marketing document. The goal is for a contributor or user to read it and decide accurately whether Jacquard helps them — and, if it does, how to extract the answer they need.
TL;DR
Jacquard's unique value is vector-driven timing analysis at GPU scale: answering "did this stimulus violate setup/hold at any DFF, on which cycle, on which signal?" for designs large enough that SDF-annotated event-driven sim is too slow to finish in useful time.
Everything else Jacquard offers is offered, often better, by the standard flow:
- For functional sim: Verilator is faster on small designs.
- For timing: OpenSTA gives more accurate answers than Jacquard, vector-independent.
- For glitch / metastability: event-driven sim with SDF (CVC, iverilog) sees behaviours Jacquard's lockstep kernel structurally cannot.
Jacquard becomes the right tool when (design size × vector length) exceeds what event-driven SDF-annotated sim can handle, and you specifically want vector-driven timing answers.
STA is not optional even with Jacquard. Jacquard does not replace OpenSTA; it complements it. The right framing is "STA proves no bad vectors exist; Jacquard proves your real workload runs cleanly within those bounds." OpenSTA is also a hard runtime dependency for any timing-aware Jacquard flow — the timing IR is produced by opensta-to-ir, which subprocesses OpenSTA. See ADR 0001.
What's actually unique
The intersection where Jacquard wins is narrow but real:
-
Activity-driven setup/hold sweep at scale. Run a long workload (boot trace, architectural validation, NoC congestion stimulus) on a large design at GPU speed; get a per-cycle violation report. STA can't tell you "this real workload trips violation X at cycle 12,847"; CVC can but won't finish in time on big designs.
-
Arrival-time distributions for power/activity analysis. Per-signal arrival histograms across millions of cycles → useful for worst-case-power analysis informed by actual switching activity. STA gives you nothing here; CVC could but slowly.
-
Failure forensics. When a functional test fails, answering "was this a timing issue?" without rerunning under a different simulator. Jacquard's timing-VCD output ties violations to cycle/signal/path — useful when you already have it from the same run.
-
Fast iteration during timing closure. Change a constraint, resynthesise, re-run a long test — Jacquard's loop time is short enough to make this practical on big designs in a way iverilog+SDF isn't.
What dedicated STA (OpenSTA) gives you that Jacquard doesn't
This list is long and you should know it:
- Worst-case path enumeration. STA tells you the top-N critical paths over all possible inputs. Jacquard sees only what your stimulus exercises. If your testbench misses a critical path, Jacquard's "no violations" report is silent on it; OpenSTA would flag it.
- True min-delay analysis. OpenSTA does proper min-delay path search. Jacquard's hold check is per-DFF against actual stimulus only.
- Per-pair CRPR. OpenSTA applies common-path-pessimism removal as a launch/capture credit on each path. Jacquard consumes per-DFF clock arrival from
opensta-to-irand folds it into setup/hold (seetiming-model-extensions.md, Part B Stages 1+2 — landed), but treats the launch reference as 0 — i.e. the per-pair CRPR credit is intentionally not modelled at this stage. Stage 3 in the same doc is the lever if Stage 1+2 pessimism turns out to matter on a real design. - SDC-aware constraint handling. False paths, multi-cycle paths, generated clocks, async groups — OpenSTA reads SDC and respects it. Jacquard doesn't read SDC at the timing layer.
- Coverage by construction. STA covers every path by definition. Dynamic sim covers only what's exercised.
- Vector-independent confidence. "This design meets timing" is something STA can claim; Jacquard can only claim "this design met timing on these vectors."
What event-driven SDF sim (CVC/iverilog) gives you that Jacquard doesn't
The honest comparison isn't "Jacquard vs. Verilator + OpenTimer." It's "Jacquard vs. iverilog/CVC-with-SDF + OpenTimer." On the timing-sim side specifically:
- Glitch propagation. CVC/iverilog with inertial or transport delay see intra-cycle pulses. Jacquard's lockstep cycle-accurate kernel does not.
- Per-pin wire delay fidelity. CVC consumes SDF interconnect records per-receiver, per-edge, with rise/fall distinction. Jacquard collapses to per-cell-max (see
timing-model-extensions.md, Part C). - Per-DFF setup/hold without per-word collapse pessimism. Jacquard collapses all DFFs in a 32-bit state word to
min(setup), min(hold); CVC checks each flop individually. - Async event handling. Real
$setup/$holdchecks across asynchronous control. Jacquard explicitly assumes synchronous designs.
So today, accuracy-per-vector goes to CVC; throughput goes to Jacquard.
When to choose what
| Your situation | Best tool |
|---|---|
| Small design, just want functional results | Verilator (free, fast, mature) |
| Small design, need timing certainty | OpenSTA + Verilator (or +CVC for vector-driven) |
| Large design, functional only | Verilator if it scales, else Jacquard |
| Large design, vector-driven timing needed | Jacquard + OpenSTA for STA backstop |
| Glitch / metastability investigation | CVC or iverilog with SDF — Jacquard cannot model these structurally |
| Asynchronous design / latches | Not Jacquard (synchronous-only) — use CVC/iverilog |
| Sign-off STA | OpenSTA / commercial — Jacquard is not a sign-off tool |
The trajectory
Jacquard's timing fidelity gap with CVC is closeable. The work in timing-model-extensions.md — δ(T), clock-tree skew, per-receiver wire delay — closes much of it while preserving GPU throughput. The further along that path the project goes, the more "Jacquard" looks like "GPU-accelerated SDF-annotated event-driven sim, with the inherent limits the cycle-accurate kernel imposes (no glitches, lockstep cycles)" — i.e. CVC's report quality at Verilator's speed, on designs where neither alone suffices.
Output interface — what Jacquard exposes today
Jacquard's unique value depends on getting the timing information out of a run in a form users can act on. Phase 1 of the post-Phase-0 roadmap (ADR 0008) closed the gap between "data Jacquard has" and "answers users want" for setup/hold violations.
Symbolic stderr violation messages
The kernel writes setup/hold violation events to a per-block event buffer (csrc/kernel_v1.metal:554-576). The host drains the buffer each cycle (src/event_buffer.rs), resolves the state-word index to a hierarchical DFF site name via WordSymbolMap, and emits:
[cycle 12847] SETUP VIOLATION at top/cpu/regs[7][bit 22] [word=412]: arrival=2150ps setup=80ps slack=-30ps
[cycle 12847] HOLD VIOLATION at top/cpu/state[bit 3] [word=412]: arrival=12ps hold=20ps slack=-8ps
The bare [word=N] suffix is preserved for grep/tooling compatibility; up to four DFFs per word are named, with +N more truncation beyond that.
Structured timing report (--timing-report <path.json>)
Schema-versioned JSON document written at end of run. Contents:
- Per-cycle violation list (cycle, kind, word, site, arrival, constraint, slack).
- Per-word aggregate: violation counts and worst slack (sorted by total violations).
- Top-N worst-slack ranking per kind (setup, hold).
- Run metadata: design, vector source, timing source, clock period, cycles run, Jacquard version.
- Aggregate stats: setup/hold totals, dropped events.
Machine-readable, CI-friendly. Sample at tests/timing_ir/sample_reports/two_violations.json; full schema in src/timing_report.rs (SCHEMA_VERSION = "1.0.0"). Stability contract per ADR 0008: additive-only extensions, breaking changes bump the major.
Text summary (--timing-summary)
One-screen human summary on stdout. Same data as the JSON report, different channel; either or both flags can be set:
=== Jacquard Timing Summary ===
Design: my_cpu.gv
Vectors: boot.vcd (1000 cycles)
Clock period: 1000 ps
Timing source: my_cpu.jtir
Violations:
Setup: 5
Hold: 2
Total: 7
Worst slack:
Setup: -150ps at top/cpu/regs[7][bit 22] [word=5] (cycle 87)
Hold: -40ps at top/cpu/state[bit 3] [word=12] (cycle 91)
Top 2 by violation count (of 2 total words with violations):
top/cpu/regs[7][bit 22] [word=5] (5 violations): worst setup=-150ps hold=- arrival=950ps
top/cpu/state[bit 3] [word=12] (2 violations): worst setup=- hold=-40ps arrival=10ps
Format is for human inspection — explicitly not a stable parseable contract. Tools should use --timing-report JSON.
Timed VCD (--timed)
Annotates the output VCD with per-signal arrival times. Largest, most detailed output; suitable for waveform-level inspection.
- What you get: per-signal arrival ps at each writeout cycle.
- Caveat: the VCD doesn't carry slack relative to the clock edge — you compute it yourself.
- Cost: doubles VCD size. Not appropriate for long workloads on large designs.
SimStats aggregate counts (in-process)
SimStats { setup_violations, hold_violations, ... } is available to in-process consumers (src/event_buffer.rs). Only counts; full detail flows through the structured report path.
Still on the wishlist
Items captured in ADR 0008's "Optional / later outputs" plus a few caveats on what shipped. Demand-driven; not scheduled.
Closest-to-violation tracking when no violation occurred
The shipped worst_slack ranking is populated only from observed violation events. Surfacing "where am I close to the edge" on a run that passed timing requires GPU-side near-miss instrumentation (emit slack events whenever |slack| falls below a configurable threshold). Useful for proactive signoff regression. Separate workstream — needs a kernel change.
Arrival histogram (--arrival-histogram <pattern>)
Per-signal arrival histogram dump for matched signal patterns, as JSON or CSV. Foundation for activity-based power analysis and "is my actual timing margin healthy" reporting.
STA cross-reference (--sta-cross-reference <opensta-paths.txt>)
Read OpenSTA's worst-N critical-path report and produce coverage output: of those paths, which were exercised by the stimulus, at what observed arrival. Closes the loop between vector-driven and static analysis.
Path back-trace from worst-arrival DFF
Given a flagged DFF, walk the max-of-fanin chain backward to the source AIG pin / primary input, emitting per-edge contributions. Most expensive item on the wishlist; only useful once symbolic names are in place (which they now are).
CUDA / HIP / cosim runtime violation routing
The current Metal sim path routes runtime violations through process_events (which is what feeds the resolver, structured report, and text summary). The CUDA, HIP, and cosim paths don't yet share that plumbing — they detect violations on the GPU but don't drain through process_events. Independent plumbing follow-up; doesn't affect the Metal user experience.
Per-signal activity / transition counts
Listed in ADR 0008 as part of the JSON report's wishlist. Not in v1.0.0 of the schema; will be added (additively) when the GPU kernel emits transition events.
"Corner" and "margin percentage" in the text summary
ADR 0008's summary template includes both. Corner is missing because the metadata struct doesn't carry it through from the IR yet; margin percentage is trivially derivable from slack_ps / clock_period_ps and was omitted to keep the v1 summary terse.
Related artefacts
project-scope.md— what Jacquard is for and not for; the formal contract this doc operates inside.timing-correctness.md— forward-looking validation requirements.timing-violations.md— current GPU-side violation detection mechanics.timing-validation.md— how Jacquard's timing output is validated against CVC/iverilog.timing-model-extensions.md— proposed accuracy improvements (δ(T), clock-tree skew, wire delay).
GEM Simulation Architecture
This document describes GEM's internal simulation architecture based on investigation and testing.
Overview
GEM (GPU-accelerated Emulator-inspired RTL simulation) compiles gate-level netlists into GPU kernels that simulate designs 5-40X faster than CPU-based simulators. It works like an FPGA-based RTL emulator by converting designs into an and-inverter graph (AIG), partitioning it for GPU blocks, and generating optimized GPU code.
Pipeline Stages
Verilog Netlist → NetlistDB → AIG → StagedAIG → Partitions → FlattenedScript → GPU Kernel
↓ ↓ ↓ ↓
Parse Synthesis Hypergraph Instruction
Netlist to AIGs Partitioning Generation
1. NetlistDB (Input Parsing)
Input: Gate-level Verilog (.gv files) from synthesis tools (Yosys, Design Compiler)
Process:
- Parses structural Verilog using
sverilogparsecrate - Creates flattened netlist database with cells, pins, nets
- Identifies primary inputs, outputs, clock signals
- Stores connectivity in CSR (Compressed Sparse Row) format
Key Limitations:
- Only supports synthesized gate-level netlists (not RTL)
- No behavioral Verilog constructs (always blocks, if/case statements)
- Expects standard cells from supported libraries (AIGPDK)
2. AIG (And-Inverter Graph)
Process: Converts gate-level netlist to AIG representation
Data Structure:
#![allow(unused)] fn main() { pub enum DriverType { AndGate, // Basic AND gate DFF, // D flip-flop ClockGate, // Clock gating cell RAMBlock, // Memory block GemAssert, // Assertion checking GemDisplay, // Display output // ... more types } }
Statistics (example from safe.v):
- 157 AIG pins: Internal circuit nodes
- 133 AND gates: Logic operations
- 16 DFF cells: Sequential elements
- 2 GEM_ASSERT cells: Assertion nodes
- 480 total pins: Including I/O
Key Features:
- Clock inference from DFF connections
- Assertion cell detection (
GEM_ASSERT,GEM_DISPLAY) - Endpoint grouping for outputs and registers
3. StagedAIG (Pipeline Staging)
Purpose: Split deep combinational logic into pipeline stages
Process:
- Analyzes combinational depth between registers
- Splits logic at
--level-splitthresholds - Creates pipeline stages to fit GPU resource constraints
When Needed:
- Designs with very deep combinational paths (>50 levels)
- When single-stage partitioning fails resource limits
- Use
--level-split 30or--level-split 20,40to force splits
4. Partitioning (Hypergraph Cut)
Tool: mt-kahypar hypergraph partitioner
Constraints (GPU block resources):
- Max 8191 unique inputs per partition
- Max 8191 unique outputs per partition
- Max 4095 intermediate pins alive per stage
- Max 64 SRAM output groups
Process:
- Interactive partitioning (runs automatically at simulation start)
- Tries 1 partition first, then increases if needed
- Merges partitions to minimize inter-partition communication
5. FlattenedScript (GPU Instruction Generation)
Process: Generates GPU execution script from partitions
Script Components:
- Boomerang stages: Hierarchical 8192→1 reduction structure
- State buffer: Packed 32-bit words for all register values
- SRAM interface: Memory block read/write operations
- Assertion positions: Bit positions for assertion conditions
- Display positions: Enable bits and argument positions
Statistics (example):
reg/io state size: 133 bits → 5 words (32-bit)
script size: 30208 instructions
assertion_positions: [(cell_id, bit_pos, msg_id, type)]
display_positions: [(cell_id, enable_pos, format, arg_positions, widths)]
Key Insight: All state is packed into a flat bit array, indexed by position in 32-bit words.
6. GPU Kernel Execution
Kernel Types:
kernel_v1.cu/kernel_v1_impl.cuh: CUDA implementationkernel_v1.metal: Metal (Apple Silicon) implementation
Execution Model:
- Each GPU block simulates one partition
- Multiple blocks run in parallel
- State synchronized between stages
- CPU checks assertion/display conditions after GPU completes
VCD Input/Output
Input VCD Requirements
Critical Discovery: GEM expects VCD signals at absolute top-level (no module hierarchy).
Expected Signal Format:
$var reg 1 ! clk $end
$var reg 1 " reset $end
$var reg 4 # din [3:0] $end
$var reg 1 $ din_valid $end
NOT (with module scope):
$scope module testbench $end
$scope module dut $end
$var wire 1 ! clk $end
...
Signal Matching:
- GEM looks for signals matching synthesized module port names
- Uses
HierName()(empty hierarchy) for matching - If signals are scoped under modules, GEM reports:
WARN (GATESIM_VCDI_MISSING_PI) Primary input port (HierName(), "reset", None) not present in the VCD input
VCD Scope Option:
--input-vcd-scope <scope>: Specify module hierarchy to read from- Current Issue: Even with scope specified, signal matching fails
- Workaround: Generate VCD with signals at absolute top level
Output VCD Structure
GEM generates minimal VCD with only primary outputs:
$timescale 1 ns $end
$scope module gem_top_module $end
$var wire 1 ! unlocked $end
$upscope $end
Internal states and intermediate signals are not dumped.
Assertion and Display Support
Assertion Infrastructure
Synthesis Flow:
Verilog assert() → Yosys $check cell → techmap gem_formal.v → GEM_ASSERT cell
Runtime:
- GEM stores assertion positions in
FlattenedScript - CPU checks assertion bits after GPU simulation
- Configurable actions: Log, Pause, Terminate
AssertConfig:
#![allow(unused)] fn main() { pub struct AssertConfig { pub on_failure: AssertAction, // Log, Pause, Terminate pub max_failures: Option<u32>, } }
Display Infrastructure
Synthesis Flow:
Verilog $display() → Yosys $print cell → techmap gem_formal.v → GEM_DISPLAY cell
Runtime:
- Format strings stored in JSON metadata
- CPU checks display enable bits after GPU simulation
- Arguments extracted from state buffer positions
Limitation: Format string preservation depends on Yosys synthesis preserving attributes.
Debug Information
Enabling Debug Output
# Metal simulation with debug logging
RUST_LOG=debug cargo run -r --features metal --bin jacquard -- sim <args>
# CPU verification (slower but validates GPU results)
cargo run -r --features metal --bin jacquard -- sim <args> --check-with-cpu
Key Debug Messages
AIG Construction:
Found GEM_ASSERT cell 143 (condition_iv=0, en_iv=0, a_iv=76, clken_iv=2)
Found GEM_DISPLAY cell 24 (enable_iv=2, clken_iv=2, args=32)
Partitioning:
netlist has 480 pins, 157 aig pins, 133 and gates
current: 19 endpoints, try 1 parts
after merging: 1 parts
Flattening:
Built script for 48 blocks, reg/io state size 133, sram size 0, script size 30208
Assertion: cell=144, pos=4195 (word=131, bit=3), msg_id=144, type=None
Display: cell=24, enable_pos=5154 (word=161, bit=2), format='...', args=[...]
VCD Reading:
WARN (GATESIM_VCDI_MISSING_PI) Primary input port (HierName(), "reset", None) not present
Performance Characteristics
Speedup vs CPU
- Simple designs: 5-10X faster
- Complex designs: 10-40X faster
- Depends on:
- Number of GPU SMs (streaming multiprocessors)
- Partition granularity
- VCD I/O overhead
Resource Scaling
GPU Block Count: Set NUM_BLOCKS to 2× number of GPU SMs
- Apple M4 Pro: 48 blocks (24 SMs × 2)
- NVIDIA GPUs: Check SM count with
nvidia-smi
Memory Usage:
- State buffer:
num_blocks × state_size × num_cycles × 4 bytes - Script:
script_size × 4 bytes(shared across blocks)
Known Issues and Limitations
1. VCD Hierarchy Mismatch
Issue: GEM expects flat VCD signal hierarchy
Impact: Missing input signals cause incorrect simulation results
Workaround: Generate VCD with $dumpvars(1, sig1, sig2, ...) at top level
Status: Under investigation
2. Complex FSM Designs
Issue: Some FSM designs don't simulate correctly even with proper VCD Example: safe.v (9-state PIN cracker FSM) Possible Causes:
- Synthesis optimization changes FSM encoding
- Initial state handling differences
- Reset timing issues Status: Identified through third-party test suite
3. No Latch or Asynchronous Sequential Logic Support
Issue: Jacquard only supports edge-triggered D flip-flops (DFFs) as sequential elements. Latch-based designs (SR latches, transparent latches, master-slave latch pairs) and asynchronous sequential logic are not supported.
Impact: Designs using latches will either:
- Fail during AIG conversion (unrecognized cell type)
- Be silently treated as combinational logic (incorrect simulation)
What this means in practice:
- Gate-level netlists must be synthesized to a DFF-only cell library (AIGPDK or SKY130)
- CVC's built-in test suite (
tests_and_examples/install.test/) uses NAND-latch flip-flops (e.g.,dfpsetd.v,sdfia04.v) and cannot be used as Jacquard reference tests - Self-timed designs with internal clock generation (e.g., CVC's
das_lfsrbenchmark) are also unsupported
What would be needed to support latches:
- New DriverType variant: Add
Latch(enable, data)toDriverTypeinaig.rs, representing a level-sensitive storage element - Two-phase evaluation: Latches are transparent when enabled, requiring evaluation within a clock phase rather than only at clock edges. The current cycle-based simulation model (evaluate all combinational logic, then capture DFF outputs) would need to iterate until latch outputs stabilize
- AIG conversion: Map latch library cells (e.g., SKY130
dlxtp) to the newLatchdriver, identifying enable and data pins - GPU kernel changes: The writeout stage currently uses
clken_permfor DFF clock gating. Latches would need a different mechanism: while enable is high, output tracks input continuously rather than capturing on an edge - Timing: Latch timing is more complex — setup/hold is relative to the enable edge, and time borrowing across latch boundaries is a key use case in high-performance designs
- Convergence: Combinational loops through transparent latches must be detected and iterated to a fixed point, or flagged as errors
Complexity estimate: Moderate-to-high. The main challenge is the evaluation model change — DFF-only simulation is a clean "capture at edge" model, while latches require iterative evaluation within clock phases.
Status: Not planned. Jacquard targets synthesis flows that produce DFF-only netlists.
4. Format String Preservation
Issue: Yosys synthesis may not preserve gem_format attributes
Impact: Display messages show placeholders instead of actual format strings
Workaround: Extract format strings from pre-synthesis JSON
Status: Tool limitation, not GEM bug
Investigation Methodology
This documentation was created through systematic investigation:
- Structure Analysis: Examined source code in
src/aig.rs,src/flatten.rs,src/staging.rs - Debug Tracing: Used
RUST_LOG=debugto capture internal state - Netlist Inspection: Analyzed synthesized
.gvfiles withgrep - VCD Comparison: Compared iverilog vs GEM VCD outputs
- Test Case Development: Created minimal reproducible examples
- Iterative Debugging: Progressively simplified designs to isolate issues
References
- Main codebase:
src/directory - EDA infrastructure:
vendor/eda-infra-rs/submodule (netlistdb, vcd-ng, ulib) - AIGPDK library:
aigpdk/directory - Test cases:
tests/directory - Third-party examples:
tests/regression/third_party/
Document Version: 1.0 Last Updated: 2025-01-08 Authors: NVIDIA GEM Team + Claude Code Investigation
Timing Simulation in GEM
See also:
timing-correctness.md— forward-looking validation contract and timing IR requirements (in progress). The document below describes current behaviour.
This document explains GEM's boomerang evaluation architecture and how timing simulation with per-gate delays can be implemented efficiently on GPU.
Background: The Simulation Challenge
GEM simulates And-Inverter Graphs (AIGs) where every node is either:
- A primary input (value comes from VCD stimulus)
- An AND gate with two inputs (possibly inverted)
Traditional simulation evaluates gates in topological order, which is inherently serial. GPUs excel at massive parallelism - thousands of threads doing the same operation on different data. GEM bridges this gap with the boomerang architecture.
Boomerang Evaluation
Core Concept
The boomerang structure is a hierarchical reduction tree that maps an AIG onto GPU threads. It's called "boomerang" because data flows down the tree during reduction, then results are written back out at various levels - like a boomerang going out and returning.
Hierarchy Structure
GEM uses BOOMERANG_NUM_STAGES = 13, meaning the tree has 2^13 = 8192 leaf positions:
Level 0 (inputs): 8192 positions
Level 1: 4096 positions (8192 / 2)
Level 2: 2048 positions
Level 3: 1024 positions
Level 4: 512 positions
Level 5: 256 positions
Level 6: 128 positions
Level 7: 64 positions
Level 8: 32 positions
Level 9: 16 positions
Level 10: 8 positions
Level 11: 4 positions
Level 12: 2 positions
Level 13 (output): 1 position
Each level halves the number of positions by computing AND gates that combine pairs.
Thread Organization
A GPU block has 256 threads (threadIdx.x = 0..255). Each thread holds a 32-bit word where each bit represents an independent Boolean signal:
Thread 0: [bit0, bit1, bit2, ... bit31] = 32 Boolean signals
Thread 1: [bit0, bit1, bit2, ... bit31] = 32 Boolean signals
...
Thread 255: [bit0, bit1, bit2, ... bit31] = 32 Boolean signals
─────────────────────────────
Total: 256 × 32 = 8192 signals per level
Thread position refers to threadIdx.x - which of the 256 threads we're addressing. Each thread position processes 32 signals in parallel using SIMD operations.
Memory Layout
__shared__ u32 shared_metadata[256]; // Partition configuration
__shared__ u32 shared_writeouts[256]; // Output staging area
__shared__ u32 shared_state[256]; // Working state (8192 bits)
The shared_state array holds the current level's values during reduction.
The Reduction Process
Phase 1: Level 0 → Level 1 (hier[0])
Only threads 128-255 are active. Each computes 32 AND gates in parallel:
if(threadIdx.x >= 128) {
u32 hier_input_a = shared_state[threadIdx.x - 128]; // From threads 0-127
u32 hier_input_b = hier_input; // This thread's data
// 32 AND gates computed simultaneously (one per bit)
u32 ret = (hier_input_a ^ hier_flag_xora) &
((hier_input_b ^ hier_flag_xorb) | hier_flag_orb);
shared_state[threadIdx.x] = ret;
}
The xora, xorb, and orb flags encode:
xora/xorb: Input inversions (for AND-inverter graph)orb: Passthrough mode (when output equals input A, skip the AND)
Visual representation:
Before: [T0][T1]...[T127] [T128][T129]...[T255]
│ │
└───────┬──────────┘
│
AND gates (128 threads × 32 bits = 4096 gates)
│
▼
After: [----unused----] [T128][T129]...[T255]
(128 × 32 = 4096 results)
Phase 2: Levels 1-3 (Shared Memory)
for(int hi = 1; hi <= 3; ++hi) {
int hier_width = 1 << (7 - hi); // 64, 32, 16
if(threadIdx.x >= hier_width && threadIdx.x < hier_width * 2) {
u32 hier_input_a = shared_state[threadIdx.x + hier_width];
u32 hier_input_b = shared_state[threadIdx.x + hier_width * 2];
u32 ret = (hier_input_a ^ xora) & ((hier_input_b ^ xorb) | orb);
shared_state[threadIdx.x] = ret;
}
__syncthreads(); // Barrier between levels
}
Each level activates fewer threads:
- Level 1: threads 64-127 (64 threads → 2048 gates)
- Level 2: threads 32-63 (32 threads → 1024 gates)
- Level 3: threads 16-31 (16 threads → 512 gates)
Phase 3: Levels 4-7 (Warp Shuffle)
Within a single warp (32 threads), data exchange uses fast shuffle instructions instead of shared memory:
if(threadIdx.x < 32) {
for(int hi = 4; hi <= 7; ++hi) {
int hier_width = 1 << (7 - hi); // 8, 4, 2, 1
u32 hier_input_a = __shfl_down_sync(0xffffffff, tmp_cur_hi, hier_width);
u32 hier_input_b = __shfl_down_sync(0xffffffff, tmp_cur_hi, hier_width * 2);
if(threadIdx.x >= hier_width && threadIdx.x < hier_width * 2) {
tmp_cur_hi = (hier_input_a ^ xora) & ((hier_input_b ^ xorb) | orb);
}
}
}
No synchronization needed - warp shuffle is implicitly synchronized.
Phase 4: Levels 8-12 (Bit Operations)
The final levels operate on bits within a single u32, computed by thread 0 only:
if(threadIdx.x == 0) {
// Level 8: 32 → 16 (operates on upper/lower halves)
u32 r8 = ((v1 << 16) ^ xora) & ((v1 ^ xorb) | orb) & 0xffff0000;
// Level 9: 16 → 8
u32 r9 = ((r8 >> 8) ^ xora) & (((r8 >> 16) ^ xorb) | orb) & 0xff00;
// Level 10: 8 → 4
u32 r10 = ((r9 >> 4) ^ xora) & (((r9 >> 8) ^ xorb) | orb) & 0xf0;
// Level 11: 4 → 2
u32 r11 = ((r10 >> 2) ^ xora) & (((r10 >> 4) ^ xorb) | orb) & 0b1100;
// Level 12: 2 → 1
u32 r12 = ((r11 >> 1) ^ xora) & (((r11 >> 2) ^ xorb) | orb) & 0b10;
tmp_cur_hi = r8 | r9 | r10 | r11 | r12;
}
Write-Outs
Results are captured at various levels (not just the final output) and written to global memory:
if((writeout_hook_i >> 8) == bs_i) {
shared_writeouts[threadIdx.x] = shared_state[writeout_hook_i & 255];
}
This is the "return" part of the boomerang - results flow back from intermediate levels.
Timing Simulation Approaches
Approach Comparison
| Approach | Parallelism | Memory | Accuracy | GPU Fit |
|---|---|---|---|---|
| Event-driven | Poor (serial queue) | Low | Exact | Bad |
| Time-wheel | Medium | High | Configurable | Medium |
| Levelized | Excellent | Low | Conservative | Best |
| Oblivious | Maximum | Very High | Exact | Wasteful |
Recommended: Levelized with Delay Accumulation
This approach piggybacks on the existing boomerang structure with minimal changes.
Data Structure Addition
// Add to shared memory (256 bytes additional)
__shared__ u8 shared_arrival[256]; // One arrival time per thread position
Each thread position stores a single 8-bit arrival time representing the maximum arrival across all 32 bits in that position.
Modified AND Gate Evaluation
// Current (value only):
u32 ret = (hier_input_a ^ xora) & ((hier_input_b ^ xorb) | orb);
shared_state[threadIdx.x] = ret;
// With timing (add ~4 instructions):
u32 ret = (hier_input_a ^ xora) & ((hier_input_b ^ xorb) | orb);
shared_state[threadIdx.x] = ret;
u8 arr_a = shared_arrival[threadIdx.x - offset_a];
u8 arr_b = shared_arrival[threadIdx.x - offset_b];
u8 arr_ret = min(max(arr_a, arr_b) + GATE_DELAY, 255); // Saturating add
shared_arrival[threadIdx.x] = arr_ret;
Complexity Analysis
- Same number of kernel launches as zero-delay simulation
- O(levels × cycles) - identical to current
- ~256 bytes additional shared memory per partition
- Estimated 10-20% performance overhead
The Approximation Trade-off
What We Track
One arrival time per thread position (256 values) instead of per signal (8192 values).
Implications
If thread position 50 contains signals A, B, C with different true arrivals:
Signal A: 15ps (shortest path)
Signal B: 23ps (longest path)
Signal C: 8ps (medium path)
We store only: arrival[50] = 23ps (the maximum).
Why This Works
- Conservative: We might report false violations, but never miss real ones
- Correlated signals: Signals at the same thread position are often topologically nearby with similar timing
- Endpoint focus: We ultimately only care about arrivals at DFF D inputs
When Full Accuracy is Needed
For bit-accurate timing, you would need:
// 8KB additional shared memory (may exceed limits)
__shared__ u8 shared_arrival[256][32]; // Per-bit arrivals
This is feasible but significantly increases memory pressure and computation.
Implementation Phases
Phase 1: CPU Timing Analysis (Completed)
- Liberty parser for delay extraction
- Static timing analysis on AIG
- CPU reference simulation with delays
- Timing violation detection
Phase 2: Hybrid GPU+CPU (Completed)
- GPU performs zero-delay value simulation
- CPU performs timing analysis on results
- Validates infrastructure without kernel changes
Phase 3: GPU Arrival Tracking (Completed)
- Added
shared_arrival[256](u16) to Metal and CUDA kernels - Arrivals tracked during boomerang reduction at all hierarchy levels
- Per-gate delays injected via script padding slots from SDF data
- DFF timing constraint checking at cycle boundaries (setup/hold)
- Timing-aware VCD output (
--timedflag) - Validated against CVC reference simulator (88ps / 7.1% conservative overestimate)
Phase 4: Full Integration (Partial)
- Timing violation events via event buffer (completed)
- Per-cycle timing reports (completed)
- Integration with output VCD (completed via
--timed) - Timing-aware bit packing for reduced approximation error (future)
Conservative Timing Model: Sources of Overestimation
Jacquard's GPU timing is intentionally conservative — it may over-estimate arrival times but will never under-estimate them. This is important for setup violation detection: false positives are safe, false negatives would miss real bugs.
There are three independent sources of conservatism, each adding to the overestimate:
Source 1: max(rise, fall) per cell
The GPU kernel tracks a single u16 arrival per thread position. It cannot distinguish between rising and falling signal transitions because each thread processes 32 packed Boolean signals simultaneously — there's no per-bit transition direction available.
How it works: For each cell, inject_timing_to_script() computes:
#![allow(unused)] fn main() { delay = max(gate_delays[pin].rise_ps, gate_delays[pin].fall_ps) }
Impact: For the SKY130 inv_chain test (16 inverters), rise delays average ~10ps larger than fall delays. In a real inverter chain, transitions alternate (rise→fall→rise), so half the cells use the smaller fall delay. Jacquard uses the larger rise delay for all.
Measured: 80ps overestimate on 1235ps (6.5%) for 16 inverters with ~10ps rise/fall asymmetry per cell.
Source 2: max wire delay across all input pins
For multi-input cells (AND gates, MUXes), INTERCONNECT delays to different input pins may differ significantly. Jacquard takes the maximum across all input pins:
#![allow(unused)] fn main() { // wire_delays_per_cell: dest_cellid → max(all input wire delays) entry.rise_ps = entry.rise_ps.max(ic.delay.rise_ps); entry.fall_ps = entry.fall_ps.max(ic.delay.fall_ps); }
Impact: If an AND gate has input A arriving via a 10ps wire and input B via a 200ps wire, Jacquard assigns 200ps to the cell regardless of which input is on the critical path. An event-driven simulator would correctly propagate the 10ps arrival on input A independently.
When this matters: Designs with highly asymmetric routing (e.g., one input is local, another crosses the chip). Well-routed designs typically have balanced wire delays to multi-input cells.
Source 3: max arrival across 32 packed signals per thread
Each thread position holds 32 independent Boolean signals. Jacquard tracks one arrival per thread position (the maximum across all 32 signals):
Thread 50: [signal_A: 5ps, signal_B: 23ps, signal_C: 8ps, ...]
Tracked: arrival[50] = 23ps (max of all 32)
Impact: If signals with very different timing are packed into the same thread, the fastest signals inherit the slowest signal's arrival time.
Mitigation: The bit-packing algorithm can sort signals by estimated timing before assignment (see "Timing-Aware Bit Packing" section). This keeps similar-timing signals together, reducing the max approximation error.
Combined Effect
These sources are multiplicative in the worst case. For the inv_chain test:
| Source | Contribution | Notes |
|---|---|---|
| max(rise, fall) | +80ps | 8 inverters × 10ps asymmetry |
| max wire delay | +8ps | 8 wires × 1ps asymmetry |
| max per thread | 0ps | Only 1 signal per thread in this test |
| Total overestimate | 88ps / 7.1% | vs CVC transition-accurate result |
For larger designs with more routing asymmetry and denser bit packing, the combined overestimate could be larger. The bit-packing sort (Source 3) is the most actionable mitigation.
CVC Reference Validation
The inv_chain design (2 DFFs + 16 SKY130 inverters) was validated against CVC (open-src-cvc), an event-driven Verilog simulator with native SDF back-annotation:
CVC: clk_to_q=350ps chain=885ps total=1235ps (transition-accurate)
Jacquard: clk_to_q=350ps chain=973ps total=1323ps (conservative max)
Difference: 88ps (7.1% overestimate)
Both simulators agree on CLK→Q delay (350ps) because the DFF has a single output transition direction per clock edge. The chain delay differs because CVC tracks actual rise/fall polarity through each inverter.
To run the CVC comparison locally:
bash tests/timing_test/cvc/run_cvc.sh
Requires Docker (builds CVC from source on first run).
Delay Data Encoding
Script Format
The existing boomerang section has padding that can store delay data:
Current format per thread per stage:
[xora: u32]
[xorb: u32]
[orb: u32]
[padding: u32] ← Can store delay here
PackedDelay Structure
#![allow(unused)] fn main() { #[repr(C)] pub struct PackedDelay { pub rise_ps: u16, // Rising edge delay in picoseconds pub fall_ps: u16, // Falling edge delay in picoseconds } }
For simplified timing, a single uniform delay constant can be used instead of per-gate delays.
Timing Violation Detection
At Each Cycle Boundary
The GPU kernel checks timing constraints per state word (32 signals) after the boomerang evaluation completes. Arrivals and constraints use u16 picosecond values (range 0–65535 ps). Arithmetic is performed in u32 to avoid overflow when summing arrival + setup:
// After boomerang completes, before next cycle
// arrival: u16 max accumulated delay for this 32-signal group
// constraint_word: packed [setup_ps:16][hold_ps:16]
u16 setup_ps = constraint_word >> 16;
u16 hold_ps = constraint_word & 0xFFFF;
// Setup check: skip when arrival == 0 (no data propagated, e.g. first cycle
// or DFF with constant inputs)
if (arrival > 0 && (u32)arrival + (u32)setup_ps > clock_period_ps) {
int slack = (int)clock_period_ps - (int)arrival - (int)setup_ps;
write_event(event_buffer, EVENT_TYPE_SETUP_VIOLATION,
cycle, io_offset + threadIdx.x,
(u32)slack, (u32)arrival, (u32)setup_ps);
}
// Hold check: no arrival > 0 guard (hold violations matter even at cycle 0)
if ((u32)arrival < (u32)hold_ps) {
int slack = (int)arrival - (int)hold_ps;
write_event(event_buffer, EVENT_TYPE_HOLD_VIOLATION,
cycle, io_offset + threadIdx.x,
(u32)slack, (u32)arrival, (u32)hold_ps);
}
Event Buffer Integration
#![allow(unused)] fn main() { pub enum EventType { Stop = 0, Finish = 1, Display = 2, AssertFail = 3, SetupViolation = 4, // Timing events HoldViolation = 5, } }
For full details on interpreting violation reports and tracing violations to source signals, see docs/timing-violations.md.
Timing-Aware Bit Packing
The Problem
Each thread position holds 32 signals packed into a u32. When tracking timing with one arrival value per thread position, we approximate all 32 signals as having the same arrival time (the maximum).
This approximation is accurate when signals in the same thread have similar timing. But the default placement algorithm uses first-fit for bit assignment:
#![allow(unused)] fn main() { // Default: first available slot for i in 0..hier[selected_level].len() { if hier[selected_level][i] == usize::MAX { slot_at_level = i; // First-fit, not timing-aware break; } } }
This can result in signals with very different timing sharing a thread:
Thread 50 (accidental grouping):
bit 0: level 5, ~5ps arrival
bit 1: level 12, ~12ps arrival ← 7ps difference!
bit 2: level 6, ~6ps arrival
Thread 50 (timing-aware grouping):
bit 0: level 5, ~5ps arrival
bit 1: level 5, ~5ps arrival ← similar timing
bit 2: level 6, ~6ps arrival
Current Timing Correlation
The placement algorithm already computes logic levels:
#![allow(unused)] fn main() { // Level = max(level of inputs) + 1 level[node] = max(level[input_a], level[input_b]) + 1; }
Logic level correlates with timing (more levels = more gate delays), but signals at the same level can still have different actual delays due to:
- Different gate types (AND2_00_0 vs AND2_11_1)
- Different wire loads
- Path reconvergence
Solution: Sort by Timing Before Packing
Before assigning bit positions, sort signals by their estimated arrival time:
#![allow(unused)] fn main() { // Collect nodes at this level let mut nodes_to_place: Vec<_> = candidates .filter(|n| level[n] == selected_level) .collect(); // Sort by arrival time (level as proxy, or actual timing if available) nodes_to_place.sort_by_key(|n| arrival_estimate[n]); // Place in sorted order - similar timing ends up in same thread for (slot, node) in nodes_to_place.iter().enumerate() { place_bit(..., slot, *node); } }
Alternative Approaches
| Approach | Complexity | Effectiveness | When to Use |
|---|---|---|---|
| Sort by timing | Low | Good | Default choice |
| Timing-aware partitioning | High | Best | Large designs |
| Post-placement swapping | Medium | Good | Fine-tuning |
| Timing bands | Low | Moderate | Simple heuristic |
Timing Bands
Group signals into arrival time bands:
Band 0: 0-10ps → Threads 0-63
Band 1: 10-20ps → Threads 64-127
Band 2: 20-30ps → Threads 128-191
Band 3: 30+ps → Threads 192-255
Measuring Packing Quality
Diagnostic to measure timing variance per thread:
#![allow(unused)] fn main() { fn analyze_timing_packing(hier: &Hierarchy, arrivals: &[u64]) { for thread in 0..256 { let times: Vec<_> = get_bits_in_thread(hier, thread) .map(|b| arrivals[b]) .collect(); let range = times.iter().max() - times.iter().min(); let variance = compute_variance(×); if range > threshold { warn!("Thread {} has {}ps timing spread", thread, range); } } } }
Impact on Approximation Accuracy
With timing-aware packing:
- Reduced false positives: Fewer spurious timing violations from max approximation
- Tighter bounds: Per-thread arrival closer to actual signal arrivals
- Better critical path identification: Max arrival more accurately reflects true critical path
Performance Expectations
| Metric | Zero-Delay | With Timing |
|---|---|---|
| Kernel launches | N | N |
| Shared memory | 3KB | 3.25KB |
| Registers | ~32 | ~36 |
| Instructions/gate | ~5 | ~9 |
| Estimated overhead | - | 15-25% |
The overhead is modest because:
- Timing operations are simple (max, add)
- Memory access pattern is identical
- No additional synchronization needed
- Same parallelism structure
References
src/pe.rs- Partition executor and boomerang stage constructioncsrc/kernel_v1_impl.cuh- GPU kernel implementationsrc/flatten.rs- Script generation with timing datasrc/event_buffer.rs- GPU→CPU event communicationsrc/liberty_parser.rs- Timing library parsing
Timing Violation Detection
See also:
timing-correctness.md— forward-looking validation contract and timing IR requirements (in progress). The document below describes current behaviour.
Guide to enabling, reading, and debugging setup/hold timing violations in GEM.
Overview
Setup and hold violations occur when data arrives too late (setup) or too early (hold) relative to the clock edge at a flip-flop. GEM checks for these violations during GPU simulation by tracking arrival times — the accumulated gate delay from primary inputs or DFF outputs through combinational logic to the next DFF data input.
Approximation model: GEM tracks one arrival time per 32-signal group (one GPU thread position). The arrival is the maximum across all 32 signals in the group. This is conservative: it may over-report violations but will never miss a real one. See Reducing False Positives for details.
Enabling Timing Checks
Prerequisites
- SDF file with back-annotated delays from your place-and-route tool
- Gate-level netlist synthesized to
aigpdk.libcells
Step-by-step
-
Generate SDF from your P&R tool (or use
scripts/generate_sdf.pyfor test designs):# Example: OpenROAD flow output ls my_build/6_final.sdf -
Run the simulator with
--sdfand a clock period:Metal (macOS):
cargo run -r --features metal --bin jacquard -- sim \ design.gv input.vcd output.vcd 1 \ --sdf design.sdf \ --sdf-corner typCUDA (NVIDIA):
cargo run -r --features cuda --bin jacquard -- sim \ design.gv input.vcd output.vcd 8 \ --sdf design.sdf \ --sdf-corner typ \ --enable-timing \ --timing-clock-period 1200cosim (co-simulation):
cargo run -r --features metal --bin jacquard -- cosim \ design.gv \ --config testbench.json \ --sdf design.sdf \ --sdf-corner typ
CLI Flags Reference
| Flag | Binary | Description |
|---|---|---|
--sdf <path> | all | Path to SDF file with back-annotated delays |
--sdf-corner <min|typ|max> | all | Which SDF corner to use (default: typ) |
--sdf-debug | all | Print unmatched SDF instances for debugging |
--enable-timing | jacquard sim | Enable timing analysis (arrival + violation checks) |
--timing-clock-period <ps> | jacquard sim | Clock period in picoseconds (default: 1000) |
--timing-report-violations | jacquard sim | Report all violations, not just summary |
--timing-report <path.json> | jacquard sim | Write a structured end-of-run JSON report (schema in src/timing_report.rs, ADR 0008). |
--timing-summary | jacquard sim | Print a human-readable text summary at end of run. Independent of --timing-report; both can be combined. |
--timing-report-max-violations <N> | jacquard sim | Cap on the per-cycle violations list in --timing-report. Default 100k. 0 = unbounded. Totals + worst-slack always reflect every event. |
--liberty <path> | jacquard sim | Liberty library for timing data (optional, falls back to AIGPDK defaults) |
Example: inv_chain_pnr Test Case
# Run with SDF timing
cargo run -r --features metal --bin jacquard -- sim \
tests/timing_test/inv_chain_pnr/6_final.v \
tests/timing_test/inv_chain_pnr/input.vcd \
tests/timing_test/inv_chain_pnr/output.vcd 1 \
--sdf tests/timing_test/inv_chain_pnr/6_final.sdf
Reading Violation Reports
Setup Violation Format
[cycle 42] SETUP VIOLATION at top/cpu/regs[7][bit 22] [word=5]: arrival=900ps setup=200ps slack=-100ps
(WS-P1.1.a, 2026-05-02: state-word indices are now resolved to symbolic
hierarchical signal names. The bare [word=N] suffix is preserved for
grep compatibility. Words packing more than 4 DFFs truncate with a
+N more suffix.)
| Field | Meaning |
|---|---|
| cycle | Simulation cycle where the violation occurred |
| word | State word index — identifies a group of 32 DFF data inputs |
| arrival | Maximum accumulated gate delay to this word's signals (picoseconds) |
| setup | DFF setup time constraint from SDF/Liberty (picoseconds) |
| slack | clock_period - arrival - setup. Negative = violation amount |
Hold Violation Format
[cycle 11] HOLD VIOLATION at top/cpu/state[bit 3] [word=3]: arrival=10ps hold=50ps slack=-40ps
| Field | Meaning |
|---|---|
| cycle | Simulation cycle where the violation occurred |
| word | State word index |
| arrival | Accumulated gate delay to this word's signals (picoseconds) |
| hold | DFF hold time constraint from SDF/Liberty (picoseconds) |
| slack | arrival - hold. Negative = violation amount |
Summary Statistics
At the end of simulation, GEM prints totals:
Simulation complete: 1000 cycles, 5 setup violations, 0 hold violations
Text Summary (--timing-summary)
A one-screen human summary printed to stdout at end of run. Reuses the
same data the JSON report builds (so --timing-report and
--timing-summary cost the same; only the output channel differs).
Sample output:
=== Jacquard Timing Summary ===
Design: my_cpu.gv
Vectors: boot.vcd (1000 cycles)
Clock period: 1000 ps
Timing source: my_cpu.jtir
Violations:
Setup: 5
Hold: 2
Total: 7
Worst slack:
Setup: -150ps at top/cpu/regs[7][bit 22] [word=5] (cycle 87)
Hold: -40ps at top/cpu/state[bit 3] [word=12] (cycle 91)
Top 2 by violation count (of 2 total words with violations):
top/cpu/regs[7][bit 22] [word=5] (5 violations): worst setup=-150ps hold=- arrival=950ps
top/cpu/state[bit 3] [word=12] (2 violations): worst setup=- hold=-40ps arrival=10ps
The format is for human inspection — explicitly not a stable
parseable contract. Tools that need to script against the data should
use --timing-report JSON.
Structured JSON Report (--timing-report <path.json>)
For CI integration and downstream tooling, pass --timing-report <path>
to get an end-of-run JSON document. The schema is versioned (ADR 0008's
stability contract: additive-only extensions, breaking changes bump
the major). Sample at tests/timing_ir/sample_reports/two_violations.json;
authoritative type definitions in src/timing_report.rs.
Top-level shape:
{
"schema_version": "1.0.0",
"metadata": { "design": "...", "cycles_run": 1000, "clock_period_ps": 1000, "...": "..." },
"stats": { "setup_violations": 5, "hold_violations": 0, "events_dropped": 0 },
"violations": [
{ "cycle": 42, "kind": "setup", "word_id": 5, "site": "top/cpu/regs[7][bit 22] [word=5]",
"arrival_ps": 900, "constraint_ps": 200, "slack_ps": -100 }
],
"per_word": [
{ "word_id": 5, "site": "...", "setup_violations": 5, "hold_violations": 0,
"worst_setup_slack_ps": -100, "worst_hold_slack_ps": null, "worst_arrival_ps": 900 }
],
"worst_slack": {
"setup": [ /* top-N most-negative slacks across the run */ ],
"hold": [ /* same shape */ ]
}
}
per_word is sorted by total violation count desc, then by word_id.
worst_slack.setup / .hold are top-10 by closest-to-violation slack
(most negative first). Caveats:
- The "even when no violation occurred" half of WS-P1.1.d (per-DFF
closest-to-violation tracking when the design never tripped a
violation) needs GPU-side near-miss instrumentation and is not in
v1.0.0; for now,
worst_slackis populated only from actual violation events. --timing-reportonly produces output today on the Metal sim path. The CUDA / HIP / cosim paths do not currently route runtime violations throughprocess_events— bringing them in is independent plumbing.- The
violationsarray is capped at 100,000 records by default (~8 MB JSON). Override or disable the cap with--timing-report-max-violations <N>(0= unbounded). Setup/hold totals,events_dropped, andworst_slackrankings always reflect every observed event; only the per-cycle list is bounded.stats.violations_truncatedreports how many records were dropped because the cap was reached.
Tracing Violations to Source Signals
When you see a violation on a specific word, follow this workflow to identify the offending signals and their logic cone.
1. Get the Word Index
From the log: word 5 means state word index 5.
2. Map Word to DFF Signals
Each word covers 32 bits of state. The DFFs in that word have data_state_pos / 32 == word_index. To find which DFFs:
-
Look at the
dff_constraintsentries in theFlattenedScriptV1:dff_constraints entries where data_state_pos / 32 == 5 → cell_id values → netlist cell names -
In
gpu_sim, violations are logged with word IDs that map directly to theoutput_mappositions. Each word covers bit positionsword * 32throughword * 32 + 31.
3. Trace Backwards with netlist_graph
Use the netlist_graph tool to trace the combinational logic cone feeding the DFF. After uv sync --group dev, the netlist-graph console script is on the workspace's uv run path — no cd required:
# Find the DFF data input driver chain
uv run netlist-graph drivers design.v "dff_name.D" -d 10
# Search for DFFs matching a pattern
uv run netlist-graph search design.v "dff_out*"
Discovered signal names can be passed directly into jacquard sim --trace-signals <file> / jacquard cosim --trace-signals <file> (one name per line) to surface them in the output VCD alongside top-level IO.
4. Detailed Timing Analysis with CVC
For per-signal accuracy (no 32-signal approximation), use CVC (open-src-cvc) with SDF back-annotation:
# Run CVC with SDF timing
cvc64 +typdelays tb.v design.v
./cvcsim
CVC provides event-driven simulation with full SDF support (IOPATH + INTERCONNECT delays), allowing you to pinpoint exactly which path is critical.
The Approximation Caveat
GEM tracks one arrival time per 32-signal group (one GPU thread position). The tracked value is the maximum arrival across all 32 signals in that thread. This means:
- Conservative: If any signal in the group has a long path, the arrival for the entire group reflects that worst case. Violations may be reported for signals that individually meet timing.
- Never misses real violations: A real violation always results in a reported violation (the max is >= any individual signal's arrival).
Reducing False Positives
If a violation is reported but you suspect it's a false positive from the approximation:
- Use CVC for per-signal accuracy (see Detailed Timing Analysis with CVC above).
- Timing-aware bit packing groups signals with similar arrival times into the same thread, reducing the approximation error. See
docs/timing-simulation.md§ "Timing-Aware Bit Packing" for details.
Common Scenarios
Setup violations on many words, same cycle: The clock period is likely too tight for the design. The combinational logic depth exceeds what can settle in one clock period. Try increasing the clock period.
Setup violation on a single word: A critical path through one specific logic cone. Use netlist_graph drivers to trace the path and identify the bottleneck.
Hold violation: Rare with SKY130 process (negative hold times clamp to 0 in the SDF). If seen, the design likely has minimum-delay paths that are too short. Check for direct connections between DFF outputs and nearby DFF inputs with minimal combinational logic.
Violations only on first cycle: The arrival > 0 guard in the GPU kernel skips setup checks when arrival is zero (meaning no data has propagated through combinational logic yet). If you see violations on cycle 0, they are hold violations — setup violations on cycle 0 are suppressed by design.
Timing-model extensions — design notes
Status: Idea / pre-spike. Not scheduled. Captured here so the architecture sketch survives the next session-clear.
Scope: Three related extensions to Jacquard's timing model, all aimed at making setup/hold reporting more honest without abandoning the cycle-accurate boomerang kernel.
- Dynamic delay — per-gate δ(T) inspired by the Involution Delay Model (Maier 2021, arXiv:2107.06814). Captures pulse-width-dependent delay degradation that fixed δ∞ misses on near-threshold paths.
- Clock-tree skew — per-DFF clock arrival accounting. Today every DFF on a clock is treated as if it captures simultaneously; SDF clock-buffer arcs and clock-net interconnect are silently dropped during AIG construction.
- Wire delay at scale — per-receiver interconnect delay applied to the right edge in the AIG, and explicit modelling of inter-partition wires. Today wire delay is collapsed to a max-per-destination-cell scalar — fine for sky130 short routes, increasingly wrong as we move to faster clocks, finer processes, and large many-core/NoC designs.
All three share the same insight: the data the model needs is already in the TimingIR. The work is at the consumer layer (flatten.rs, aig.rs, the kernel arrival math), not the IR or the partitioner.
Background — what the timing pipeline does today
.sdf ─┬─► opensta-to-ir ──► TimingIR (.jtir, FlatBuffers)
.jtir ─┘ │
▼
flatten.rs::load_timing_from_ir (per-cell arc → AIG-pin delay)
│
▼
gate_delays: Vec<PackedDelay> (rise/fall ps per AIG pin)
dff_constraints: Vec<DFFConstraint> (setup/hold ps per DFF)
│
▼
flatten.rs::inject_timing_to_script (bake max ps into u16 script slot)
│
▼
kernel_v1.metal at runtime:
per-AND: new_arr = max(arr_a, arr_b) + gate_delay
per-DFF: check arrival vs setup/hold per word
Reference points:
- IR schema:
crates/timing-ir/schemas/timing_ir.fbs - IR consumer:
src/flatten.rs:1768(load_timing_from_ir),src/flatten.rs:1686(inject_timing_to_script) - Setup/hold buffer:
src/flatten.rs:1732(build_timing_constraint_buffer) - GPU arrival math:
csrc/kernel_v1.metal:220-255(AND gates),csrc/kernel_v1.metal:547-580(setup/hold)
Per-AIG-pin arrival is a single ushort accumulated by max through the boomerang reduction. There is no event scheduling — arrival is a scalar that rides alongside the Boolean evaluation in lockstep with cycle ticks.
Part A — Dynamic delay (IDM-style δ(T))
What IDM is, briefly
A per-gate dynamic delay model that makes δ a function of T (time since the gate's last output transition). The distinguishing property: input pulses with Δᵢ → 0 have diminishing effect on the output. The model handles pulse-width degradation faithfully and is the only model proven to solve the short-pulse-filtration problem. The paper notes ~80–590% CPU overhead vs. inertial delay on a CPU event-driven simulator.
The architectural wall
True IDM needs event scheduling and intra-cycle pulse observability — neither is available in Jacquard's lockstep cycle-accurate kernel. We cannot model glitch suppression or metastability oscillation traces without either sub-cycle ticks or a different kernel architecture.
What we can do is enrich the per-gate delay used in arrival propagation so setup/hold reporting reflects realistic pulse degradation on marginal paths.
Five hook points
| Hook | File | Today | With δ(T) |
|---|---|---|---|
| A Schema | crates/timing-ir/schemas/timing_ir.fbs | rise/fall per arc | + per-cell-type DynamicDelayParams (exp-channel params or piecewise-linear LUT) |
| B IR load | src/flatten.rs:1768 | one PackedDelay per AIG pin | + parallel gate_dyn_delays keyed by originating cell-type via aigpin_cell_origins |
| C Bake | src/flatten.rs:1686 | one u16 ps per thread slot | static-IDM: bake worst-case δ(T) into same slot. dynamic-IDM: reserve second u32 |
| D Kernel arrival | csrc/kernel_v1.metal:220-255 | max(arr_a, arr_b) + gate_delay | + eval_idm(dyn_params, T, edge) via small LUT |
| E Setup/hold | csrc/kernel_v1.metal:547-580 | unchanged math, dumber inputs | unchanged math, smarter inputs |
For dynamic-IDM the kernel needs two new persistent buffers:
last_transition_ps[aig_pin]— when the gate's output last switched (absolute ps).last_value[aig_pin]— to detect transitions across cycles.
Memory cost ~4 bytes per AIG pin per partition. For NVDLA-scale designs (~hundreds of thousands of pins) this is MB-scale — fine.
eval_idm on GPU
The paper uses exp/log per gate. On GPU replace with a 16-entry LUT indexed by quantised T. Cheap, branch-free, smooth enough.
Characterisation
The δ(T) parameters have to come from per-cell SPICE characterisation. For sky130 we'd characterise each sky130_fd_sc_hd__*_* cell once, check the result into the repo, and ship it as a sidecar table consumed by the IR builder. This is the expensive one-off — the paper flags characterisation cost as the unsolved part of making IDM "truly competitive."
Staged plan
| Stage | What | Touches | Kernel | Effort | Win |
|---|---|---|---|---|---|
| 1 Static IDM | Bake worst-case δ(T) into existing u16 slot using STA pulse-width estimates | A, B, C | None | 1–2 days | Better setup/hold on marginal paths |
| 2 Dynamic δ(T) | Add last_transition_ps buffer + LUT eval | All | Lines 220–255 | 1–2 weeks | Pulse-degradation-aware arrivals end-to-end |
| 3 Sub-cycle ticks | Multiple arrival propagations per logical cycle | Whole kernel | Major | Months | True IDM glitch behaviour. Probably not worth it for Jacquard's positioning. |
Stage 1 is a 1–2 day spike with no kernel risk. Stage 2 is the honest implementation. Stage 3 is a different simulator.
What we get / don't get from dynamic δ(T)
Achievable
- Per-corner δ(T) propagating through arrival → setup/hold reports that distinguish "just meets timing under δ∞" from "fails under realistic pulse degradation".
- Stays inside cycle-accurate boomerang. ~1.5–2× memory growth on arrival data, ~10–20% kernel slowdown (estimate).
Not achievable
- Glitch suppression (Δᵢ → 0 → no transition).
- Metastable oscillation traces.
- Combinational-loop behaviours (loops are forbidden in the AIG anyway).
Why sky130 is the right vehicle
sky130_pdk.rs decomposes vendor functional Verilog into AIG nodes while preserving cell identity through aigpin_cell_origins. We can attach δ(T) at the original sky130 cell granularity even after AIG flattening — that structural property is what makes any of this tractable. Cells from a hand-coded library without origin tracking would be much harder.
Part B — Clock-tree skew
Status: Stages 1 + 2 implemented (2026-05-01). Per-DFF clock arrival is carried through the IR (
ClockArrivaltable) and folded into per-DFF setup/hold viaDFFConstraint::effective_setup_holdbefore the per-word collapse. Producer landed inc403cc8; consumer fold-in in6767c3e. The narrative below describes the original motivation; the Staged plan at the end of this part records what shipped and what remains (Stage 3, conditional).
Where the information is — and where we drop it
Clocks in Jacquard are walked back from each DFF through buffers/inverters/clock-gates, terminating at an InputClockFlag(pinid, is_negedge) (src/aig.rs:441, :477, :495-560). Recognised cells: INV/BUF/CKLNQD and the sky130 equivalents inv*, clkinv*, buf*, clkbuf*, clkdlybuf*, lpflow_*.
Two consequences:
- Clock-tree cells produce no AIG pin. They collapse into a polarity flag on the DFF. Since
aigpin_cell_originsonly lists cells that produced AIG pins, the timing-IR arcs on those cells (IOPATHrecords onclkbuf_8, etc.) match no AIG pin inload_timing_from_irand are silently discarded. - Clock-net interconnect is dropped the same way.
interconnect_delaysrecords keyed by net endpoints have no destination cell to attach to, so they fall on the floor.
Net effect: every DFF on a given clock domain is treated as having identical clock arrival, i.e. perfect skew. The current setup/hold check is honest about combinational-path delay but blind to clock-tree topology.
For a sky130 MCU SoC at ~25 ns clock period this is fine functionally; for any timing claim near the period boundary it's misleading. Intra-domain clock-tree skew on sky130 is typically O(50–200 ps) — small relative to a 25 ns period, but exactly the order of magnitude that determines whether a path "barely meets" or "barely fails" setup.
Do we have the information?
Yes, in three places, in increasing fidelity:
-
TimingIR arcs on clock cells (
.jtiralready contains them; we just don't consume them). -
The AIG clock walk in
aig.rs:495–560already iterates the clock-side cells of each DFF in order. It just doesn't accumulate their delays. Adding adff_clock_origins: Vec<Vec<cellid>>parallel structure costs O(num_dffs × clock_depth) memory — negligible. -
OpenSTA can compute per-DFF clock arrival end-to-end. (OpenTimer was the original primary STA candidate per ADR 0003 but the spike Superseded it; ADR 0001 makes OpenSTA the sole STA path, called out of process via
opensta-to-ir.) Per-pair common-path-pessimism removal (CRPR) is fundamentally a launch/capture credit, not a per-DFF property — so what shipped is per-DFF capture-side arrival, treating launch as a 0-reference. This is the form in the IR today:table ClockArrival { cell_instance: string; // DFF instance path clk_pin: string; // local pin name arrival: [TimingValue]; // per-corner clock arrival ps provenance: Provenance; }Populated by
opensta-to-ir's Tcl driver via[all_registers -clock_pins]+[::sta::vertex_worst_arrival_path]. Consumer code never touches the netlist — it just looks up each DFF's clock arrival.
Consumer change (shipped)
DFFConstraint carries the field now:
#![allow(unused)] fn main() { pub struct DFFConstraint { pub setup_ps: u16, pub hold_ps: u16, pub clock_arrival_ps: i16, // signed — capture-side arrival, launch ref = 0 pub data_state_pos: u32, pub cell_id: u32, } }
The setup/hold formula for per-pair skew is:
- Setup margin =
(clock_period + clock_arr_capture - clock_arr_launch) - data_arrival - setup - Hold margin =
data_arrival - (clock_arr_capture - clock_arr_launch + hold)
Per-launch/per-capture pairing is awkward in the current per-word-collapsed constraint buffer, so the implementation folds the capture-side clock arrival into the per-DFF effective setup/hold before packing, via DFFConstraint::effective_setup_hold:
- effective_setup =
setup - clock_arrival_capture(clamped to [0, u16::MAX]) - effective_hold =
hold + clock_arrival_capture(clamped to [0, u16::MAX])
The GPU kernel runs unchanged — the same packed (setup<<16)|hold word it already consumes now carries skew-aware values. Launch arrival is treated as zero (ref) — pessimistic for paths whose launch DFF also has a long clock path, but a clean first cut. Stage 3 below addresses that pessimism if measurement justifies it.
Partitioning question
"could we partition a design effectively to do this somewhat accurately without sacrificing too much?"
Today partitioning (src/repcut.rs) is hypergraph-cut on logic connectivity. DFFs co-located by logic affinity may have very different clock arrivals.
The pessimism cost: build_timing_constraint_buffer collapses all DFFs in a 32-bit state word to min(setup) and min(hold). If a word holds DFFs with clock arrival 50 ps and 200 ps, the per-word effective setup is the worst of both — i.e. we report timing as if every DFF in that word saw the worst skew in the word. That's a 150 ps pessimism for the lucky DFF.
Three options, ranked:
-
Do nothing. For typical sky130 SoCs at ≥10 ns clock periods, intra-word skew (≤200 ps worst-case) vs. period (10 000+ ps) is ≤2%. Worth-it threshold for the optimisation: when designs run close enough to the period that 2% pessimism flips genuine passes into reported violations. Likely never for sky130. Plausibly relevant for designs running at ≥1 GHz on a more aggressive PDK.
-
Skew-bucket the DFF constraint packing, not the partitioning. Group DFFs into clock-arrival buckets after partitioning, and emit one constraint word per bucket-within-partition rather than collapsing everything in the word. Increases constraint-buffer size by O(num_buckets) but doesn't disturb the partitioner. Probably the right answer if we ever need to.
-
Skew-aware partitioning. Add a soft objective to
repcut.rsthat prefers grouping DFFs by clock arrival. Degrades cut quality (more inter-partition logic edges → more state shuffling). Almost certainly worse than option 2 for the same accuracy gain.
So: yes we have the info, no we probably don't need to repartition, and the constraint-collapsing pessimism is the real lever — either accept it (option 1) or break it bucket-wise (option 2).
Staged plan for clock tree
| Stage | What | Touches | Kernel | Status |
|---|---|---|---|---|
| 1 Capture clock-tree delay | Add ClockArrival IR table; populate from opensta-to-ir | IR schema, opensta-to-ir/builder + Tcl | None | Shipped — c403cc8 |
| 2 Apply to setup/hold | Fold capture-side arrival into DFFConstraint; existing kernel check now skew-aware | src/flatten.rs DFFConstraint, effective_setup_hold, build_timing_constraint_buffer | None | Shipped — 6767c3e |
| 3 (conditional) Bucketed packing | Per-bucket constraint words to remove the per-word min(setup, hold) collapse pessimism; kernel reads the right bucket per DFF | src/flatten.rs:1722-1761, kernel constraint indexing | Minor | Open — land only if measurement shows the per-word collapse materially over-reports violations |
Part C — Wire delay at scale
Why this gets more important as designs grow
In sky130 at 25 ns clock periods, wire delay is a small perturbation on gate delay and the lumped model is fine. The picture changes in two regimes:
- Faster clocks. Wire delay is a fixed physical quantity (RC-dominated); period shrinks; wire fraction of the budget grows.
- Finer processes (e.g. 22nm and below). Gate delays scale down with feature size; wire RC scales unfavourably (resistance per square goes up, capacitance per length stays roughly flat). The classic "reverse scaling" inflection: gates get faster, long wires don't. Typical 22nm: inverter delay 5–15 ps, local short wires 5–20 ps, global routes 50–500 ps, multi-mm wires 1+ ns without repeaters.
- Large many-core/NoC SoCs. Inter-tile mesh links can span multiple millimetres; chip-level signals have wire delays comparable to or larger than entire combinational stages.
For a many-small-core NoC at 22nm, wire delay on inter-core links is typically the dominant timing factor. Any model that can't represent it accurately will misreport the critical paths.
What Jacquard does today
The IR side is already in shape. crates/timing-ir/schemas/timing_ir.fbs carries InterconnectDelay { net, from_pin, to_pin, delay[corner] } per receiver, and opensta-to-ir populates it from SDF.
The lossy step is the consumer in src/flatten.rs:1850-1872:
#![allow(unused)] fn main() { let mut wire_delays_per_cell: HashMap<usize, (u64, u64)> = HashMap::new(); // ... for each InterconnectDelay record: let entry = wire_delays_per_cell.entry(dest_cellid).or_insert((0, 0)); entry.0 = entry.0.max(d); // rise entry.1 = entry.1.max(d); // fall (same value!) }
Three layers of pessimism stacked here:
- Keyed by destination cell, not destination pin. A cell with two inputs from very different routes loses per-pin fidelity.
- Max across inputs of the same cell. Worst-case incoming wire is applied to every output of the cell.
- No rise/fall distinction on wire delay. SDF carries both; we collapse to one number.
Then in arrival propagation (csrc/kernel_v1.metal:220-255):
new_arr = max(arr_a, arr_b) + gate_delay
where gate_delay = intrinsic + max_wire_into_cell. The mathematically correct propagation is:
new_arr = max(arr_a + wire_a, arr_b + wire_b) + intrinsic
These are equivalent only when the input with the worst arrival also has the worst wire. When they don't coincide — common on a NoC node where one input comes from a long mesh hop and another from local logic — the current model over-reports by max_wire − actual_wire_on_critical_input.
For sky130 small designs this gap is in the noise. For 22nm with 10× variation between local and global wire delays, it's the difference between "this path meets timing" and "STA reports a violation that doesn't exist."
Inter-partition wires — the architectural wrinkle
A NoC tile naturally maps to one (or a few) partition(s). The inter-tile links — the long, wire-dominated, timing-critical ones — are precisely the partition-crossing signals. Today wire delay sits on the destination cell's gate_delays slot, evaluated inside the destination partition's boomerang reduction. The wire is a property of the crossing, not the destination cell, and should ideally be modelled at the partition I/O boundary, where src/sim/cosim_metal.rs already shuffles state between partitions.
This is the inverse alignment of the clock-tree case. There partitioning didn't help with skew accounting. Here partitioning is load-bearing: tile-aligned partitions naturally expose the small set of edges that deserve careful wire-delay modelling, and let intra-partition logic stay on the fast lumped path.
Three fidelity tiers
| Tier | Model | Where wire delay lives | When it's enough |
|---|---|---|---|
| 0 (current) | One scalar per destination cell, max-collapsed | Folded into gate_delays[output_pin] of dest cell | sky130 + ≥10 ns periods + small designs |
| 1 Per-receiver | One scalar per (from_pin, to_pin) edge in the AIG | Folded into the source AIG pin's gate_delay, with one entry per fanout target | Local wires in faster designs; intra-tile NoC logic |
| 2 Per-edge with inter-partition arcs | Tier 1 + explicit wire delay on partition-crossing signals | Tier 1 + new arrival-bump applied during cosim_metal.rs state shuffle | Long routes + many-core/NoC + 22nm-scale processes |
Tier 1 is mostly a flatten.rs rewrite. Tier 2 needs cosim_metal.rs extension and a new field in the inter-partition transfer format.
Information availability
Yes, it's there:
InterconnectDelayrecords exist per receiver. SDF carries them. opensta-to-ir emits them.- Per-input-pin granularity is in the IR (
to_pinincludes the local pin name). The consumer just discards it viato_pin.rfind('/')to derivedest_inst. - Rise/fall distinction is in the schema (
delay: [TimingValue]per corner; rise/fall could be on top via the same pattern asTimingArc). For SDF-back-annotated flows the rise/fall split usually comes from the SDF; we'd need to confirm opensta-to-ir preserves both edges.
What's missing today:
- Tier-1 plumbing: AIG-pin-level wire delay per fanout. Current
gate_delays: Vec<PackedDelay>is keyed by AIG pin (the output side); to do per-input-edge correctly we want delay attached to the edge, not the node. Either add a parallelwire_delays: HashMap<(src_aigpin, dst_aigpin), PackedDelay>or refactor toward an edge-attributed AIG. - Tier-2 plumbing: a "partition-crossing arc" concept in
cosim_metal.rs. Currently inter-partition state shuffle moves bits with no associated arrival bump. Adding a per-edge ps adjustment is straightforward in principle; finding the right place in the shuffle pipeline matters.
IR scale
The IR-size concern bites here. InterconnectDelay is roughly 100–200 bytes per record; a 22nm SoC with 10⁶–10⁷ nets is a .jtir file in the hundreds-of-MB to multi-GB range.
Mitigations:
- Streaming load: today
TimingIrFile::from_pathreads the whole buffer. Could mmap and lazy-decode, since FlatBuffers is offset-based. - Sharding: split IR per partition or per top-level module. Adds a build-time step but bounds memory per process.
- Drop intra-cell wires from IR generation: SDF often has microscopic interconnect records that lump into the destination's own pin-cap. Filter these out at the opensta-to-ir builder. Loss is genuinely negligible.
Worth measuring before committing to mitigations — sky130 NVDLA-scale today is fine; the question is what 22nm + N-tile mesh looks like.
Partitioning question — the other direction
For NoC designs partitioning becomes a positive lever (unlike the clock-tree case where it was neutral). Two specific levers:
- Tile-aligned partitions. If
repcut.rsfinds tile-aligned cuts naturally (likely, given typical tile-to-tile connectivity sparsity), inter-partition arcs are a small, well-defined set of NoC links. Worth verifying with a representative design — a partitioning report keyed by signal name pattern (*_link_*,noc_*, configurable) would expose whether the partitioner's logic-affinity score is already aligned with tile boundaries or whether we need to bias it. - NoC-link partitioning hint. Add a soft bias to repcut that prefers cutting nets matching a configured regex. Same partitioning machinery, configurable input. Cost: degrades cut quality if the hint conflicts with logic affinity. Likely worth it for explicitly tile-decomposed designs where the user knows the tile boundaries; not worth it for flat designs.
The point of any of this is to make Tier-2 cheap: if the inter-partition arc set is small, per-edge wire delay on those crossings costs almost nothing.
Crosstalk and OCV
These are upstream concerns. SDF from a crosstalk-aware STA flow already carries pessimistic delays; OCV (on-chip variation) is similarly baked into the chosen corner. Jacquard consumes whatever the IR was generated against. Worth a one-line note in the user-facing docs that the timing report's accuracy is bounded by the SDF/STA flow it was built from — Jacquard does not invent crosstalk pessimism.
Staged plan for wire delay
| Stage | What | Touches | Kernel | Effort |
|---|---|---|---|---|
| 1 Per-receiver consumption | Key wire delay by (src_aigpin, dst_aigpin) edge; fold into source AIG pin's gate_delay per fanout | src/flatten.rs:1850-1872, possibly src/aig.rs for fanout tracking | None | 3–5 days |
| 2 Rise/fall distinction | Preserve per-edge rise/fall through the consumer; honour both in PackedDelay accumulation | src/flatten.rs:1850-1914 | None | 1–2 days |
| 3 Inter-partition arc delay | New per-crossing wire-delay table; arrival bump applied during inter-partition state transfer | src/sim/cosim_metal.rs shuffle path; src/flatten.rs partition-boundary metadata | Yes (transfer path) | 2–3 weeks |
| 4 IR scale plumbing | Streaming/mmap load; opensta-to-ir filtering of microscopic records | src/sim/timing_ir_loader.rs, opensta-to-ir/builder | None | 1 week (gated on measurement) |
| 5 NoC-aware partitioning | Soft bias in repcut for cutting flagged nets; partition report by tile | src/repcut.rs and CLI flags | None | 1–2 weeks |
For a sky130 use case Stage 1+2 likely covers everything you'd notice. For 22nm NoC, Stages 1–3 are the meaningful set; Stage 5 is the optimisation that makes Stage 3 cheap.
What we get / don't get
Achievable
- Setup/hold accuracy on long routes that today gets clobbered by max-collapse pessimism.
- Honest reporting on NoC inter-tile links — the paths that actually matter for many-core SoC timing closure.
- All of the above without changing Jacquard's cycle-accurate kernel architecture.
Not achievable from this work alone
- Crosstalk-driven delay uncertainty (handled upstream in STA).
- Variation-aware (statistical) timing — would need OCV-corner sweeping or SSTA, neither of which is on the roadmap.
- Process variation modelling beyond the corners the SDF/IR was generated against.
Open questions
- δ(T) characterisation cost. One-off SPICE per cell-type per corner. Cheaper if we lean on existing ECSM/CCSM data already in vendor Liberty rather than re-running SPICE. Worth investigating before committing to Stage 2.
- Whose clock arrival is authoritative? Resolved by Pillar B Stage 1+2: OpenSTA-computed per-DFF arrival via
opensta-to-ir, treating launch as 0-reference. Per-pair CRPR credit is intentionally not modelled at this stage (see Stage 3 in the staged plan above). - Interaction. Does δ(T) on clock-tree buffers matter? Probably not enough to model — clock buffers are sized for fast edges and operate far from their pulse-degradation regime. But the framework should be able to express "ignore δ(T) on clock domain" cleanly.
- Validation oracle. CVC and Icarus already serve as functional oracles; for skew-aware and wire-aware reporting OpenSTA's slack report (via
opensta-to-ir/ direct subprocess) is the ground truth for unit tests. (ADR 0003 originally nominated OpenTimer for this role; superseded by the spike outcome — OpenSTA carries the role end-to-end now.) - IR size at 22nm scale. Open question whether
.jtirfor a representative many-core NoC fits in available memory under the current eager-load model. Needs measurement before committing to streaming mitigations. - Edge-attributed AIG. Per-receiver wire delay wants delay attached to AIG edges, not nodes. Today the AIG is node-attributed (
gate_delays: Vec<PackedDelay>indexed by aigpin). A clean Tier-1 implementation may push toward edge attribution, with downstream effects on the boomerang reduction script layout. Worth a small spike before the main implementation. - Partition-crossing format. Adding per-edge wire delay to
cosim_metal.rsinter-partition transfers needs a precise place in the existing pipeline. Currently the shuffle moves Boolean state words without arrival; the natural place is alongside the writeout-arrival path that already exists for setup/hold checking, but the alignment isn't 1:1 because partition crossings happen at logic boundaries, not capture-DFF boundaries.
Related artefacts
docs/timing-correctness.md— forward-looking validation contract; this doc extends rather than replaces.docs/timing-simulation.md— boomerang architecture; the kernel-side context.docs/timing-validation.md— current ±5% acceptance criteria; would tighten under δ(T).docs/adr/0002-timing-ir.md— IR design rationale; schema additions here follow the "lossless extension" principle.docs/adr/0001-opensta-as-oracle.md— STA path; OpenSTA out of process is committed (post-supersedure of ADR 0003).docs/adr/0003-opentimer-primary-sta.md— Superseded. Original in-process STA proposal; spike Q2 fail moved Jacquard to OpenSTA-only. Seedocs/spikes/opentimer-sky130.md.
In-Design Signal Tracing (--trace-signals)
Overview
By default a Jacquard output VCD contains only top-level IO.
--trace-signals <FILE> surfaces user-selected internal nets in that
VCD alongside the top-level ports — so you can watch a DFF's Q, a
controller state bit, or an SRAM port wire without re-synthesizing or
exposing it as a port.
It is available on both jacquard sim and jacquard cosim, and is
observe-only: traced nets are read out each tick, never driven.
Each name in the file is resolved against the netlist, registered as a primary output before partitioning (so it gets a state-buffer slot), and emitted on the same path as the top-level IO. It works uniformly for sequential (DFF Q) and combinational nets — anything that has a name in the netlist database.
This is the raw-wire counterpart to
bus transaction tracing: --trace-signals gives you
per-cycle waveforms of individual nets; bus tracing gives you decoded
transaction records. Use this when you want a waveform; use bus tracing
when you want READ 0x40 => 0x1.
File format
One hierarchical signal name per line:
# JTAG debug-module state (comments and blank lines are ignored)
chip_core.dm.haltreq_q[0]
chip_core.dm.haltreq_q[1]
# Yosys-internal nets — same syntax works
chip_core.sram_u._00147_
# A whole bus, one bit per line
data0_obs[0]
data0_obs[1]
- Blank lines and lines whose first non-whitespace character is
#are skipped. - Hierarchy uses
.as the separator; a trailing[N]selects a bus bit. - A leading backslash (Verilog escaped-identifier syntax) is stripped.
A real example ships in tests/jtag_minimal/trace_signals.txt.
Name resolution
Post-synthesis net names are ambiguous — Yosys may flatten a hierarchy
into one escaped identifier (\soc.sram.read_port__data), expand a bus
into per-bit scalars (soc.bus__addr[3]), or preserve real structural
hierarchy. Rather than guess, the resolver tries multiple candidate
interpretations of each name and takes the first that matches the netlist
database, so the same syntax works across all three conventions.
-
Unresolved names warn, they don't abort. A bad name logs a warning and is skipped; the rest of the list still registers. A trailing summary line reports how many signals registered vs. were dropped, so a mistyped list surfaces clearly at startup:
--trace-signals: registered 34 signal(s), dropped 2 (file: trace.txt) -
Names that resolve to a constant (tied 0/1) are skipped — there's nothing to observe at runtime.
Where the output lands
Traced nets appear in whichever VCD the run already emits:
| Command | Flag | Traced nets appear in |
|---|---|---|
jacquard sim | --trace-signals <FILE> | the output VCD |
jacquard cosim | --trace-signals <FILE> | the --output-vcd output only |
They show up as ordinary VCD wires next to the top-level IO, named by the string you put in the trace file.
cosim: traced nets land in
--output-vcdonly. The--stimulus-vcdcarries primary inputs and does not include them, so if you trace a net and look in the stimulus VCD you'll see nothing.--output-vcddoes not require timing data — see Pre-PnR functional runs.
Pre-PnR functional runs
--output-vcd is the functional output path too — it does not require
--timing-ir/SDF. Run a synthesized (pre-PnR) netlist through cosim
with --output-vcd out.vcd and you get chip outputs and traced nets per
cycle, with transitions at clock edges (no arrival-time offsets). This is
the right mode for functional / 4-state X-pessimism debugging, where
there is no timing data to supply yet. Adding --timing-ir later only
adds arrival-time offsets to the same VCD.
Top-level inout (bidir) pads
A top-level inout pad is split into two observables in the output VCD:
<pad>__out (the value the core drives) and <pad>__oe (the pad's
output-enable). The raw <pad> net reads the pad's input side, so on
an output-only or undriven cycle it can look flat — watch <pad>__out /
<pad>__oe to see what the design is driving. Example:
bidir_PAD[12]__out, bidir_PAD[12]__oe. These appear automatically;
you don't need to list them in the trace file.
Finding signal names
Use the netlist-graph tool (see the project README) to discover the
exact post-synthesis names:
# Search for nets matching a pattern
uv run netlist-graph search <netlist.v> "haltreq"
# Trace what drives / loads a signal (to find nearby observable nets)
uv run netlist-graph drivers <netlist.v> "soc.cpu.state" -d 5
uv run netlist-graph loads <netlist.v> "soc.cpu.ack" -d 5
# Emit a ready-to-use trace file
uv run netlist-graph watchlist <netlist.v> out.json signal1 signal2 ...
SRAM observability workflow
The recommended way to observe SRAM port activity is wire-level tracing
rather than the env-var-gated JACQUARD_SRAM_DUMP. netlist-graph can
discover the port wires and emit a trace file directly:
# 1. Discover SRAM port wire names from the netlist
uv run netlist-graph sram-ports design.v --cell-type SRAM -o sram_trace.txt
# 2. Surface them in the VCD with full per-tick accuracy
jacquard cosim design.v --config sim.json \
--trace-signals sram_trace.txt --output-vcd out.vcd
# 3. Post-process the VCD to reconstruct bus values
Example
tests/jtag_minimal/ uses --trace-signals to surface the debug
module's observable outputs (dmactive_obs, haltreq_obs,
data0_obs[0..31]) so the test's pass criterion can check that the magic
value 0xCAFEBABE lands in data0_obs:
jacquard cosim tests/jtag_minimal/data/top.pnl.v \
--config tests/jtag_minimal/sim_config.json \
--trace-signals tests/jtag_minimal/trace_signals.txt \
--jtag-replay tests/jtag_minimal/data/bitbang.rec \
--output-vcd out.vcd
Troubleshooting
| Symptom | Cause / fix |
|---|---|
not found in netlistdb (tried N candidate(s)) | The name doesn't exist post-synthesis under any candidate spelling. Find the real name with netlist-graph search; the net may have been renamed or optimized away. |
| Signal registered but flat in the VCD | It may resolve to a constant after optimization (the startup log notes constants are skipped), or the cone was stripped. Confirm it's a live net with netlist-graph drivers. |
| Nothing appears in the VCD | Check that the run actually emits a VCD (--output-vcd / --stimulus-vcd for cosim) and that the startup summary line reports a non-zero registered count. |
Implementation notes
Registration happens at AIG construction, before partitioning, which is
why the list must be supplied via the CLI flag (not a runtime env var).
The mechanism lives in src/sim/trace_signals.rs; emission piggybacks on
emit_extra_observables in src/sim/vcd_io.rs. The same multi-candidate
resolver backs bus-trace pin binding (see
bus tracing and ADR 0013).
Bus Transaction Tracing (AHB / APB)
Overview
jacquard cosim can decode on-chip bus transactions and emit them in a
compact, transaction-level form — one row per transfer, rather than raw
per-cycle waveforms. You declare the bus interfaces to watch in
sim_config.json; cosim observes their pins on the GPU each tick and
runs the protocol decode on the CPU, writing decoded transactions to a
CSV file.
This is observe-only: the tracer watches signals the design already drives, it never drives anything. It adds no measurable simulation overhead when no buses are configured.
| Protocol | Status |
|---|---|
| APB3 | Supported |
| AHB-Lite | Planned (pipelined address/data pairing, burst tracking) |
| AHB5 | Planned (AHB-Lite + security / exclusive signals) |
The design rationale lives in
ADR 0013; the roadmap is in
plans/bus-transaction-tracing.md.
Bus tracing is the structured, protocol-aware counterpart to
--trace-signals, which surfaces raw internal
nets in the output VCD. Use --trace-signals when you want waveforms of
individual wires; use bus tracing when you want decoded
READ 0x40 => 0x1 records.
Configuring a bus
Add a bus_traces array to sim_config.json. Each entry names one bus
interface:
{
"netlist_path": "build/soc.gv",
"clock_gpio": 0,
"reset_gpio": 1,
"num_cycles": 100000,
"clock_period_ps": 40000,
"bus_traces": [
{
"name": "dmi",
"protocol": "apb3",
"prefix": "soc.dm.",
"addr_bits": 9,
"data_bits": 32
}
]
}
| Field | Required | Meaning |
|---|---|---|
name | yes | Label for this bus in the CSV bus column. |
protocol | yes | apb3 (or ahb-lite / ahb5 once supported). |
prefix | yes | Hierarchical net-name prefix; standard pin names are appended (see below). May be "" for top-level pins. |
addr_bits | no (default 32) | Address bus width. |
data_bits | no (default 32) | Data bus width. |
signals | no | Per-pin net-name overrides (see Pin resolution). |
Pin names
By default each protocol pin is resolved as {prefix}{pin}. For APB3:
| Logical pin | Default net | Notes |
|---|---|---|
psel | {prefix}psel | required |
penable | {prefix}penable | required |
pwrite | {prefix}pwrite | direction |
paddr | {prefix}paddr[i] | addr_bits wide |
pwdata | {prefix}pwdata[i] | data_bits wide |
prdata | {prefix}prdata[i] | data_bits wide |
pready | {prefix}pready | optional — unresolved is treated as always-ready (1) |
pslverr | {prefix}pslverr | optional — unresolved is treated as no-error (0) |
So a bus with "prefix": "soc.dm." looks for soc.dm.psel,
soc.dm.paddr[0], …, soc.dm.prdata[31].
If your design's pins don't follow that convention, remap individual
logical pins with signals:
{
"name": "periph",
"protocol": "apb3",
"prefix": "soc.apb.",
"signals": {
"psel": "soc.apb_decode.sel_periph",
"prdata": "soc.apb_mux.readback"
}
}
Running
cargo run -r --features metal --bin jacquard -- cosim \
build/soc.gv \
--config sim_config.json \
--bus-trace-csv bus.csv
At startup each bus logs whether it resolved:
bus-trace `dmi` (APB3): psel/penable resolved, addr 9/9 bits, pready=true pslverr=true
and at the end:
bus-trace: decoded 12 transaction(s) across 1 bus(es)
bus-trace: wrote 12 transaction(s) to bus.csv
CSV output
tick,bus,protocol,dir,addr,data,resp,burst
24,dmi,apb3,WR,0x10,0xCAFEBABE,OK,
30,dmi,apb3,RD,0x10,0xCAFEBABE,OK,
| Column | Meaning |
|---|---|
tick | Cosim edge at which the transfer completed. One clock cycle = 2 edges (rising + falling) for a single-domain design. |
bus | The configured bus name. |
protocol | apb3 / ahb-lite / ahb5. |
dir | WR or RD. |
addr | Transfer address (hex). |
data | pwdata for writes, prdata for reads (hex). |
resp | OK or ERR (from pslverr / hresp). |
burst | AHB burst position beat/len (empty for APB). |
Pin resolution
For the GPU to read a bus pin each tick, that net must (1) exist in the post-synthesis netlist under a resolvable name and (2) survive into the simulation's output state. Two consequences:
-
Names must survive synthesis. The resolver uses the same multi-candidate matcher as
--trace-signals, so Yosys-flattened (\soc.dm.psel), scalar-expanded (soc.dm.paddr[3]), and structurally-hierarchical names all work. But synthesis is free to rename or delete combinational nets. The robust pattern is to make the bus signals registers (their DFF Q outputs keep their names), or to annotate the RTL nets with(* keep *). -
Constant-folded bits read as 0 — correctly. If a design only ever drives, say, addresses
0x00and0x04, synthesis folds every address bit exceptpaddr[2]to a constant. The startup log then shows e.g.addr 1/8 bits. This is expected: the tracer reconstructs the full value correctly because the dropped bits are genuinely 0.
pready / pslverr are allowed to be absent. A common case is an
always-ready slave that ties pready high — it folds to a constant,
fails to resolve, and the tracer correctly treats the bus as
always-ready.
Worked example
tests/apb_trace/ is a self-contained, synthesizable APB3 system used
as the CI regression. Its master issues a fixed program — two writes
then two reads — to a register-file slave, and check.py asserts the
decoded CSV. See tests/apb_trace/README.md.
yosys -s tests/apb_trace/synth.tcl # (from tests/apb_trace/)
cargo run -r --features metal --bin jacquard -- cosim \
tests/apb_trace/apb_trace_synth.gv \
--config tests/apb_trace/sim_config.json \
--top-module apb_trace \
--max-clock-edges 200 \
--bus-trace-csv apb.csv
python3 tests/apb_trace/check.py apb.csv
Troubleshooting
| Symptom | Cause / fix |
|---|---|
psel/penable did not resolve … this bus will not capture | The prefix is wrong, or the nets were optimized away. Find the real names with uv run netlist-graph search <netlist> psel, then fix prefix or add signals overrides. |
| Zero transactions decoded | Gate never asserted. Check that psel/penable resolve (startup log) and that the bus is actually exercised within --max-clock-edges. |
| Address or data always 0 | paddr/pwdata/prdata nets didn't resolve (renamed/folded). Confirm with netlist-graph search; mark the RTL nets (* keep *) and re-synthesize. |
| Reads return stale/wrong data | The slave must present prdata during the ACCESS phase. Register prdata so its value is stable when psel & penable are high. |
Limitations
- APB3 only for now; AHB-Lite / AHB5 and annotated-VCD output are the next phases (see the plan).
- Up to 4 buses per run, addresses/data up to 32 bits.
- Cosim is Metal-only today, so bus tracing is Metal-only.
- The legacy hardcoded Wishbone trace (a separate, SoC-specific path) is unaffected; folding it onto this general mechanism is a planned follow-up.
Adding a New PDK for Post-Layout Simulation
This guide documents the process of enabling a new process design kit (PDK) for gate-level simulation in Jacquard. It is based on the SKY130 enablement and captures every integration point.
Overview
Jacquard natively supports AIGPDK (its own synthesis library of AND gates, DFFs, and SRAMs). Supporting a foundry PDK like SKY130 requires teaching the simulator how to interpret the PDK's standard cells: their pin directions, their boolean function, and which ones are sequential.
There are now two pathways for enabling new cells; pick based on what you're adding:
- Built-in PDK enablement (this guide). For full standard-cell libraries — AND gates, DFFs, sequential cells with explicit AIG decomposition rules. Requires Rust code: pin tables, classifiers, decomposition functions, AIG builder hooks.
- Runtime cell library (
--cell-library+.cells.tomlmanifest). For third-party IP, hard macros, foundry memories, and any other cells that don't need new AIG decomposition rules — i.e. cells that act as opaque outputs (RAM macros), filler/cap blocks, or IO pads. See ADR 0010 anddocs/plans/declarative-cell-metadata.mdfor the recipe. No Jacquard PR required — users ship a manifest alongside their netlist.
This guide covers the built-in pathway, which touches five areas:
- Library detection -- recognizing cell names from a netlist
- Pin direction provider -- telling the netlist parser which pins are inputs/outputs
- Cell classification -- identifying sequential, tie, and multi-output cells
- Behavioral decomposition -- converting PDK cells to AIG (AND/NOT) primitives
- CLI wiring -- connecting it all together
If you're adding just a memory macro or other behaviourally-opaque IP, skip ahead to "Adding third-party IP via runtime manifest" at the end of this document — it's a 6-line TOML entry, not a Rust PR.
Prerequisites
You need:
- The PDK's Verilog cell library (behavioral or functional models)
- A post-synthesis or post-P&R netlist using those cells
- The cell naming convention (prefix, drive strength suffix format)
For SKY130, the PDK data lives in vendor/sky130_fd_sc_hd/ as a git submodule.
Step 1: Library Detection
Reference: src/sky130.rs -- is_sky130_cell(), detect_library(),
detect_library_from_file()
Jacquard scans the netlist to determine which cell library is in use. Each PDK needs a name-matching function:
#![allow(unused)] fn main() { // src/sky130.rs:535 pub fn is_sky130_cell(name: &str) -> bool { name.starts_with("sky130_fd_sc_") || name.starts_with("CF_SRAM_") } }
The CellLibrary enum tracks known libraries. detect_library() iterates cell
names and returns the detected library (or Mixed if cells from multiple
libraries are found -- this is an error).
For a new PDK: Add a variant to CellLibrary, write an is_<pdk>_cell()
function, and update detect_library().
Step 2: Cell Type Extraction
Reference: src/sky130.rs -- extract_cell_type()
PDK cell names follow a convention: <prefix>__<type>_<drive>. The simulator
needs to strip the prefix and drive strength to get the base cell type:
sky130_fd_sc_hd__nand2_4 --> nand2
sky130_fd_sc_hd__dfxtp_1 --> dfxtp
This function must handle all library variants (hd, hs, ms, ls, lp, hdll, hvl for SKY130) and any custom macros (CF_SRAM_*).
For a new PDK: Write an equivalent extract_cell_type() for the PDK's
naming scheme.
Step 3: Pin Direction Provider
Reference: src/sky130.rs -- SKY130LeafPins implementing LeafPinProvider
The netlist parser (from eda-infra-rs/netlistdb) needs to know pin
directions and widths for every cell type. This is implemented as a trait:
#![allow(unused)] fn main() { impl LeafPinProvider for SKY130LeafPins { fn direction_of(&self, macro_name, pin_name, pin_idx) -> Direction; fn width_of(&self, macro_name, pin_name) -> Option<SVerilogRange>; } }
For SKY130, direction_of() is a large match statement covering ~80 cell types
with all their pin names. This is tedious but straightforward -- for each cell,
list which pins are inputs and which are outputs.
Sources for pin directions:
- The PDK's Liberty (.lib) files list pin directions
- The PDK's behavioral Verilog models declare
input/outputports - LEF files also contain pin direction information
For a new PDK: Implement the trait for all cells that appear in your target netlists. You can start with just the cells used in your design and add others as needed.
Step 4: Cell Classification
Reference: src/sky130_pdk.rs -- is_sequential_cell(), is_tie_cell(),
is_multi_output_cell()
Three classification functions control how cells are processed during AIG construction:
Sequential cells (DFFs and latches)
These are handled specially in the AIG builder -- their outputs become state elements rather than combinational logic.
Critical: Use an explicit whitelist, not prefix matching. PDK naming
collisions will silently break simulation if you guess wrong (e.g., SKY130's
dlygate4sd3 starts with "dl" but is a combinational delay buffer, not a
latch).
Derivation method: Grep the PDK's behavioral Verilog models for DFF/latch primitives:
for cell in $(ls vendor/<pdk>/cells/); do
vfile="vendor/<pdk>/cells/$cell/<pdk>__${cell}.behavioral.v"
if [ -f "$vfile" ] && grep -qE 'udp_dff|udp_dlatch' "$vfile"; then
echo "$cell"
fi
done
For PDKs that don't use Verilog UDPs, look for always @(posedge blocks or
check the Liberty file's ff and latch groups.
Tie cells
Cells that produce constant 0 or 1 (e.g., SKY130's conb with HI/LO pins).
Multi-output cells
Cells with more than one output (e.g., half-adder ha with SUM and COUT,
full-adder fa). These need special handling because the AIG builder processes
one output pin at a time.
Step 5: Behavioral Model Loading
Reference: src/sky130_pdk.rs -- load_pdk_models(), parse_functional_model(),
parse_udp()
Jacquard decomposes PDK cells to AIG primitives (AND gates and inversions) by parsing their functional Verilog models. The expected file structure:
vendor/<pdk>/
cells/
<cell_type>/
<pdk>__<cell_type>.functional.v # Gate-level behavioral model
models/
<udp_name>/
<pdk>__<udp_name>.v # Verilog UDP definitions
Functional models
These are gate-level Verilog using primitives like and, or, nand, nor,
not, xor, xnor, buf. The parser (parse_functional_model()) extracts
these into a topologically-ordered list of BehavioralGate structures.
Example (sky130_fd_sc_hd__o21ai.functional.v):
module sky130_fd_sc_hd__o21ai (Y, A1, A2, B1);
output Y;
input A1, A2, B1;
wire or0_out;
or or0 (or0_out, A2, A1);
nand nand0 (Y, B1, or0_out);
endmodule
UDP models
Some cells (typically muxes) use Verilog User-Defined Primitives with truth
tables. The parser (parse_udp()) converts these to a row-based representation,
which is then evaluated as sum-of-products during AIG decomposition.
What's loaded
Only models for cell types actually present in the design are loaded. Sequential cells are skipped (their behavior is hardcoded in the AIG builder). Tie cells are also skipped (constant generation is trivial).
For a new PDK: If the PDK uses the same Verilog gate primitive syntax, the
existing parsers should work. If it uses behavioral Verilog (assign statements,
always blocks), the parser would need extension.
Step 6: AIG Decomposition
Reference: src/sky130_pdk.rs -- decompose_with_pdk(),
decompose_from_behavioral()
The decomposition converts each combinational cell to a set of 2-input AND gates with optional inversions:
- Map the cell's input pin names to AIG pin indices via
CellInputs - Walk the behavioral model's gate list in topological order
- For each gate, build the equivalent AIG sub-graph:
and/nand-> AND gate (with optional output inversion)or/nor-> De Morgan's:OR(a,b) = NOT(AND(NOT a, NOT b))xor/xnor-> Four AND gates:XOR(a,b) = NOT(AND(NOT(AND(a, NOT b)), NOT(AND(NOT a, b))))buf/not-> Pass-through with optional inversion- UDP -> Sum-of-products from truth table
- Record the output with cell origin (for SDF timing annotation)
CellInputs struct
CellInputs has named fields for all possible input pins across all SKY130
cells (A, B, C, D, A_N, B_N, S, S0, S1, CIN, SET_B, RESET_B, etc.). The
set_pin() method maps netlist pin names to AIG pin indices.
For a new PDK: If the PDK introduces pin names not in the current struct, add new fields.
Step 7: AIG Builder Integration
Reference: src/aig.rs -- get_sky130_dependencies(), sky130_preprocess(),
sky130_postprocess()
The AIG builder processes cells in three phases during topological traversal:
Dependencies (what must be built before this cell)
- Tie cells: No dependencies
- Sequential cells: Only SET_B and RESET_B pins (the data input D is handled by the DFF mechanism, not combinational decomposition)
- Combinational cells: All input pins
Preprocessing (before dependencies are resolved)
- Sequential cells: Create a DFF output AIG pin. This establishes the state element before the combinational cone driving it is built.
Postprocessing (after all dependencies are resolved)
- Tie cells: Wire
HIto constant-1,LOto constant-0 - Sequential cells: Apply reset/set logic:
Q = AND(OR(Q_state, NOT SET_B), RESET_B)(active-low semantics) - Combinational cells: Call
decompose_with_pdk()and wire the resulting AND gates into the AIG
For a new PDK: The three-phase structure is reusable. You need PDK-specific implementations of each phase that handle the new cell types' pin names and reset/set conventions.
Step 8: CLI Integration
Reference: src/bin/jacquard.rs
The load_design function detects the library and creates the netlist with the
appropriate pin provider:
#![allow(unused)] fn main() { let lib = detect_library_from_file(&args.netlist_verilog)?; let netlistdb = match lib { CellLibrary::SKY130 => NetlistDB::from_sverilog_file(&paths, &SKY130LeafPins), CellLibrary::AIGPDK => NetlistDB::from_sverilog_file(&paths, &AIGPDKLeafPins()), CellLibrary::Mixed => panic!("Mixed libraries not supported"), }; }
For a new PDK: Add a match arm for the new library.
Testing Strategy
Unit tests
-
Cell type extraction: Verify prefix/suffix stripping
-
Pin directions: Spot-check common cells
-
Behavioral model parsing: Parse each cell type, verify gate count
-
Decomposition correctness: For each combinational cell, exhaustively test all input combinations against the PDK's truth table:
#![allow(unused)] fn main() { #[test] fn test_all_cells_vs_pdk() { let pdk = load_test_pdk(); for (cell_type, model) in &pdk.models { // For each input combination: // 1. Evaluate behavioral model directly // 2. Decompose to AIG and evaluate AIG // 3. Assert outputs match } } }This test exists in
src/sky130_pdk.rsastest_all_cells_vs_pdkand covers every combinational cell against every input combination.
Integration tests
- Small test circuit: Synthesize a simple design (DFF + some gates) to the new PDK and verify simulation output matches a reference (e.g., iverilog)
- Flash boot test: If targeting an SoC, verify the CPU boots and reads from flash (this exercises sequential logic, combinational cones, and IO)
File Checklist
For a complete PDK integration, you need:
| File | Purpose |
|---|---|
src/<pdk>.rs | LeafPinProvider, library detection, cell type extraction |
src/<pdk>_pdk.rs | Cell classification, model parsing, AIG decomposition |
src/aig.rs | AIG builder hooks (dependencies, pre/post-process) |
src/sky130.rs | Update CellLibrary enum |
src/bin/jacquard.rs | CLI match arms for new library |
vendor/<pdk>/ | PDK cell models (git submodule) |
Common Pitfalls
-
Cell name collisions: Do not use prefix matching for cell classification.
dlygate4sd3starts with "dl" but is not a latch. Always derive the exhaustive list from behavioral models. -
Active-low vs active-high resets: SKY130 uses active-low
RESET_BandSET_B. Other PDKs may use active-high. Get this wrong and every DFF will be stuck. -
Multi-output cells: The AIG builder processes one output pin at a time. If a cell has both Q and Q_N outputs (e.g.,
dfbbp), the second output must be derived from the first (Q_N = NOT Q), not decomposed independently. -
Liberty file size: SKY130's liberty files are 12MB+. If your PDK has similarly large files, ensure the parser doesn't OOM or timeout.
-
Power/ground pins: Post-layout netlists often include VPWR/VGND pins. Use the unpowered netlist variant (
.nl.vnot.pnl.vin OpenLane2) or handle power pins as constants in the pin provider. -
Hold-time repair buffers: P&R tools insert delay buffers (like
dlygate4sd3) that must be treated as combinational. If your PDK's delay cells have names that collide with sequential cell prefixes, the whitelist approach prevents misclassification.
Adding third-party IP via runtime manifest
If you're adding a memory macro, IO pad, hard block, or filler library — anything that doesn't need new AIG decomposition rules — the runtime cell-library pathway (ADR 0010) is the right route. No Jacquard PR required. Ship a Verilog blackbox file plus a TOML manifest alongside your design.
Step 1: Provide the cell's Verilog interface
The blackbox just declares the cell's module + port directions. The
foundry typically ships this (<library>__blackbox.v). Example for the
OCD GF180MCU SRAM:
module gf180mcu_ocd_ip_sram__sram1024x8m8wm1 (CLK, CEN, GWEN, WEN, A, D, Q);
input CLK;
input CEN;
input GWEN;
input [7:0] WEN;
input [9:0] A;
input [7:0] D;
output [7:0] Q;
endmodule
Step 2: Write the TOML manifest
Co-locate <library>.cells.toml next to <library>.v (it autoloads
when present) or pass it via --cell-manifest:
schema_version = "1.0"
[cells.gf180mcu_ocd_ip_sram__sram1024x8m8wm1]
kind = "ram"
Recognised kind values in v1.0: std, dff, latch, clock_gate,
ram, filler, endcap, tap, io_pad_input, io_pad_output,
io_pad_bidir, delay, multi_output, tie_high, tie_low.
Step 3: Invoke jacquard with the manifest
jacquard sim my_chip.v stim.vcd out.vcd 1 \
--cell-library deps/gf180mcu_ocd_ip_sram/cells/gf180mcu_ocd_ip_sram__sram1024x8m8wm1/gf180mcu_ocd_ip_sram__sram1024x8m8wm1__blackbox.v
The --cell-library flag is repeatable for multi-IP designs.
What kind = "ram" delivers — opaque vs explicit-port modes
There are two modes depending on whether the manifest includes a
[cells.NAME.ram] port-mapping sub-table:
Opaque mode (no ram sub-table, schema v1.0+): the cell's
output pins become X-source slots in the AIG. The SRAM's internal
memory behaviour is not modelled. Sufficient for designs whose
CPU executes from boot ROM / register file and never reads SRAM
contents at the timescales Jacquard simulates.
Explicit-port mode (with ram sub-table, schema v1.1+, ADR
0011): outputs are wired to a real AIG-backed RAMBlock, writes
populate per-entry storage, reads return what was written. Real
memory modelling end-to-end. Use this when the CPU reads its own
SRAM (the common case for any design beyond heartbeat
verification).
Schema (full example, mirroring the upstream OCD GF180MCU SRAM):
schema_version = "1.1"
[cells.gf180mcu_ocd_ip_sram__sram1024x8m8wm1]
kind = "ram"
[cells.gf180mcu_ocd_ip_sram__sram1024x8m8wm1.ram]
depth = 1024
width = 8
clock = { pin = "CLK", edge = "pos" }
chip_enable = { pin = "CEN", polarity = "low" }
write_enable = { pin = "GWEN", polarity = "low" }
write_mask = { pin = "WEN", polarity = "low", granularity = "bit" }
address = "A"
data_in = "D"
data_out = "Q"
Field semantics, defaults, and the multi-port-SRAM/async/wider-than-32-bit
out-of-scope items are documented in
ADR 0011. Polarity defaults
to low; clock edge defaults to pos; mask granularity defaults
to bit. All three control pins (chip_enable / write_enable /
write_mask) are optional — omit them for sync SRAMs without
those signals.
Preloading SRAM contents at sim start
Once a SRAM is in explicit-port mode, its contents can be preloaded
from an ELF file via sim_config.json:
{
"sram_init": {
"elf_path": "build/firmware.elf"
}
}
The ELF's PT_LOAD segments are packed into the SRAM's backing storage before tick 0; the lowest loadable virtual address is taken as SRAM address 0. Single-SRAM designs only — multi-SRAM instance-targeting is a future schema extension (issue #80).
Other kinds
filler,endcap,tap— physical-only, contribute no logic.io_pad_input/io_pad_output/io_pad_bidir— pad-level behaviour (parallel to the built-ingf180mcu_ws_io__*family).dff,latch,clock_gate,delay,multi_output— recognised but the v1.0 schema doesn't yet expose enough port semantics to drive AIG construction for these. Coming in the port-mapping schema (future ADR). For now, declaring these kinds documents intent without changing behaviour.
Troubleshooting VCD Input Issues
This guide helps debug VCD input problems where GEM simulations produce incorrect results or warn about missing signals.
VCD Scope Auto-Detection (Recommended)
NEW: GEM now automatically detects the correct VCD scope containing your design's ports. In most cases, you don't need to specify --input-vcd-scope manually.
How Auto-Detection Works
When you run jacquard sim without specifying --input-vcd-scope, GEM:
- Extracts the list of required input ports from your synthesized design
- Searches the VCD file for scopes containing all required ports
- Tries common DUT scope names first:
dut,uut,DUT,UUT, or your module name - Falls back to any scope that contains all required ports
- Logs which scope was selected for transparency
Example Output
INFO No VCD scope specified - attempting auto-detection
DEBUG Searching for VCD scope containing 4 input ports
DEBUG Required ports: {"din_valid", "clk", "reset", "din"}
INFO Auto-detected VCD scope: safe_tb/uut (matched common pattern 'uut')
Manual Override
If auto-detection fails or selects the wrong scope, use --input-vcd-scope to specify manually:
# Slash-separated path to the DUT scope
jacquard sim design.gv input.vcd output.vcd 8 \
--input-vcd-scope "testbench/dut"
# For nested hierarchies
jacquard sim design.gv input.vcd output.vcd 8 \
--input-vcd-scope "top_tb/subsystem/my_module"
Note: Use slash separators (/), not dots (.).
Symptom: Missing Primary Input Warnings
WARN (GATESIM_VCDI_MISSING_PI) Primary input port (HierName(), "reset", None) not present in the VCD input
WARN (GATESIM_VCDI_MISSING_PI) Primary input port (HierName(), "din", Some(3)) not present in the VCD input
Root Cause
GEM expects VCD signals at absolute top-level with no module hierarchy prefix. The signal names must exactly match the synthesized module's port names.
How to Check
- Inspect your VCD file:
grep '\$var' your_input.vcd | head -20
- Look for module scopes:
grep '\$scope module' your_input.vcd
- Check synthesized module ports:
head -20 your_design_synth.gv
What GEM Expects
Correct - Signals at top level:
$timescale 1ns/1ns
$var reg 1 ! clk $end
$var reg 1 " reset $end
$var reg 4 # din [3:0] $end
$var reg 1 $ din_valid $end
$var wire 1 % unlocked $end
$enddefinitions $end
$dumpvars
0"
0$
0%
1!
#10
1"
#20
b1100 #
1$
Incorrect - Signals scoped under module:
$scope module testbench $end
$scope module dut $end
$var wire 1 ! clk $end
$var wire 1 " reset $end
$var wire 4 # din [3:0] $end
...
$upscope $end
$upscope $end
Solution 1: Flat VCD Generation
Create a testbench that dumps signals at absolute top level:
module testbench;
reg clk = 0;
reg reset;
reg [3:0] din;
reg din_valid = 0;
wire unlocked;
// DUT instantiation
your_module dut (
.clk(clk),
.reset(reset),
.din(din),
.din_valid(din_valid),
.unlocked(unlocked)
);
always #10 clk = !clk;
initial begin
// CRITICAL: Dump signals at top level (depth 1)
// NOT inside module hierarchy!
$dumpfile("output.vcd");
$dumpvars(1, clk, reset, din, din_valid, unlocked);
// Test sequence
reset = 1;
#60;
reset = 0;
// ... your test stimulus ...
#200;
$finish;
end
endmodule
Key Point: $dumpvars(1, signal1, signal2, ...) dumps individual signals at the current scope level, not inside child modules.
Compile and Run
# For synthesis-compatible testbench
iverilog -DSYNTHESIS -o sim your_design.v testbench.v
./sim
# Check VCD structure
grep '\$scope' output.vcd # Should be minimal or none
grep '\$var' output.vcd | head -10
Solution 2: Post-Process VCD (Advanced)
If you can't change the testbench, post-process the VCD to flatten hierarchy:
#!/usr/bin/env python3
"""Flatten VCD hierarchy to top level"""
import sys
def flatten_vcd(input_vcd, output_vcd):
with open(input_vcd) as inf, open(output_vcd, 'w') as outf:
in_scope = False
scope_depth = 0
for line in inf:
# Track scope depth
if line.strip().startswith('$scope'):
scope_depth += 1
if scope_depth == 1:
continue # Keep root scope
in_scope = True
continue
elif line.strip().startswith('$upscope'):
scope_depth -= 1
if in_scope and scope_depth == 0:
in_scope = False
continue
# Skip signals inside nested scopes, keep only top-level
if in_scope and line.strip().startswith('$var'):
continue # Skip nested module signals
outf.write(line)
if __name__ == '__main__':
flatten_vcd(sys.argv[1], sys.argv[2])
Usage:
python3 flatten_vcd.py hierarchical.vcd flat.vcd
Solution 3: VCD Scope Option (Experimental)
GEM provides --input-vcd-scope to specify which module hierarchy to read:
cargo run -r --features metal --bin jacquard -- sim \
design.gv input.vcd output.vcd 48 \
--input-vcd-scope module_name
Known Issue: Currently, signal matching still fails even with correct scope specified. This is under investigation.
Diagnostic Checklist
1. Verify Signal Names Match
Synthesized Module:
grep "^module\|input\|output" design_synth.gv
Output:
module safe(clk, reset, din, din_valid, unlocked);
input clk;
input reset;
input [3:0] din;
input din_valid;
output unlocked;
VCD Signals:
grep '\$var.*\(clk\|reset\|din\|unlocked\)' input.vcd
Output should match synthesized port names exactly.
2. Check Signal Bit Widths
Multi-bit signals must have correct indices:
Synthesized: input [3:0] din;
VCD:
$var reg 4 # din [3:0] $end
GEM expects separate indices: din[3], din[2], din[1], din[0]
3. Verify Timestamp Format
GEM expects integer timestamps (not real numbers):
Correct:
#0
#10
#20
Incorrect:
#0.0
#10.5
#20.25
4. Check Timescale
Ensure VCD timescale matches simulation expectations:
$timescale 1ns $end
or
$timescale 1ps $end
Clock periods in testbench should use same time unit.
Validation Steps
After fixing VCD issues, validate GEM is reading inputs correctly:
1. Run with CPU Verification
cargo run -r --features metal --bin jacquard -- sim \
design.gv input.vcd output.vcd 48 \
--check-with-cpu
This compares GPU results against CPU gate-level simulation. Should print:
[INFO] sanity test passed!
2. Compare Output VCD with Reference
Run same design with iverilog:
iverilog -o reference_sim design.v testbench.v
./reference_sim # Generates reference.vcd
Compare outputs:
# Check if unlocked signal toggles the same in both
grep '^[01]!' gem_output.vcd
grep '^[01]!' reference.vcd
3. Check Cycle Count
cargo run -r --features metal --bin jacquard -- sim \
design.gv input.vcd output.vcd 48 \
2>&1 | grep "total number of cycles"
Should match your testbench's simulation time / clock period.
Common Pitfalls
1. Testbench Inside `ifndef SYNTHESIS
If testbench is only compiled when SYNTHESIS is not defined:
`ifndef SYNTHESIS
module testbench;
// ...
endmodule
`endif
You must compile without -DSYNTHESIS for VCD generation:
iverilog -o sim design.v testbench.v # No -DSYNTHESIS!
But the DUT must be compiled with -DSYNTHESIS if it has non-synthesizable constructs:
# Separate compilation
iverilog -DSYNTHESIS -c design.v
iverilog -o sim design.v testbench.v
2. X/Z Values in VCD
GEM may not handle unknown (X) or high-impedance (Z) values correctly:
$dumpvars
x" # reset = X
bxxxx # # din = XXXX
Solution: Initialize all inputs in testbench:
initial begin
reset = 0; // Don't leave uninitialized
din = 4'h0;
din_valid = 0;
end
3. Missing Clock Signal
If VCD doesn't include clock:
WARN (GATESIM_VCDI_MISSING_PI) Primary input port (HierName(), "clk", None) not present
Ensure:
- Clock is generated in testbench
- Clock is included in
$dumpvars - Clock signal name matches synthesized netlist exactly
Example: Working Flat VCD Testbench
// testbench_flat.v - Generates GEM-compatible VCD
module testbench_flat;
// Declare all signals at top level
reg clk = 0;
reg reset = 1;
reg [3:0] din = 4'h0;
reg din_valid = 0;
wire unlocked;
// DUT instantiation
safe dut (
.clk(clk),
.reset(reset),
.din(din),
.din_valid(din_valid),
.unlocked(unlocked)
);
// Clock generation
always #10 clk = !clk; // 20ns period = 50MHz
// Test sequence
initial begin
// CRITICAL: Dump at top level (depth 1)
$dumpfile("safe_flat.vcd");
$dumpvars(1, clk, reset, din, din_valid, unlocked);
// Reset phase
reset = 1;
#60; // 3 clock cycles
reset = 0;
#11; // Small offset from clock edge
// Apply test stimulus
din = 4'hc;
din_valid = 1;
#20;
din = 4'h0;
#20;
din = 4'hd;
#20;
din = 4'he;
#20;
din_valid = 0;
#40;
$finish;
end
endmodule
Compile and test:
# Compile (DUT must be SYNTHESIS-compatible)
iverilog -DSYNTHESIS -o sim safe.v testbench_flat.v
# Run simulation
./sim
# Verify VCD structure
echo "=== VCD Scopes ==="
grep '\$scope' safe_flat.vcd
echo -e "\n=== VCD Signals ==="
grep '\$var' safe_flat.vcd
# Should show signals at top level, no nested $scope modules
Still Having Issues?
-
Enable debug logging:
RUST_LOG=debug,vcd_ng=trace cargo run -r --features metal --bin jacquard -- sim <args> 2>&1 | tee debug.log -
Check with minimal test:
- Create simplest possible design (single DFF)
- Generate flat VCD
- Verify GEM can read it correctly
-
Report issue with:
- Synthesized
.gvfile - Input VCD file
- GEM command line
- Error messages or unexpected output
- Synthesized
Document Version: 1.0 Last Updated: 2025-01-08 Related: simulation-architecture.md
Handoff discipline
Handoffs in this project are ephemeral working memory, not historical record. They exist to bridge a single session boundary — when you stop working and someone else (Claude or human) picks up — and they are deleted once the work they describe is resolved.
This document defines what a handoff is, what it isn't, when to write one, and exactly what to do when one is resolved.
Why this discipline exists
Decision rationale, technical context, and project state all have natural homes:
- ADRs (
docs/adr/) capture architectural decisions and their why. - Design docs (
docs/timing-model-extensions.md, etc.) capture how things work. - Plan docs (
docs/plans/phase-0-ir-and-oracle.md,post-phase-0-roadmap.md) capture what's left and the next workstream slices.
When that content lives in a handoff instead, two things go wrong:
- It's not where contributors look. A new contributor reading the README → SUMMARY → ADR chain shouldn't have to dig through a stack of resolved handoff docs to find load-bearing decisions or the current state of a workstream.
- It rots out of sync with reality. Handoffs are point-in-time snapshots. A "STATUS: RESOLVED" banner doesn't help when the thing referenced has moved or changed; the canonical doc is what should hold the current truth.
The discipline closes this gap by forcing migration before deletion. Every load-bearing piece of a handoff lands in its proper home (ADR / design doc / plan doc) before the handoff file is removed.
What a handoff IS
A handoff lives in its own dedicated directory, separate from the persistent plan docs whose content it eventually feeds: a single markdown file at docs/handoffs/<topic>-handoff.md containing exactly what the next session needs to pick up where you left off:
- Goal & next-up — what this session was trying to do, and what the very next concrete action is.
- Done this session — commits landed, with one-line summaries.
- Open follow-ups — the work that wasn't done, with enough scope detail to start cold.
- Critical context — gotchas, surprising findings, environment specifics that aren't obvious from the code or docs yet.
- Verification — the command(s) the next session runs to confirm the work is in the state you say it is.
Exactly one handoff exists at a time. There's no chain of resolved predecessors to wade through.
What a handoff IS NOT
- Not a decision log. Decisions go in ADRs. If you find yourself writing "we chose X over Y because Z" in a handoff, that paragraph belongs in an ADR (or an existing ADR's "Consequences" / "Walk-back" section).
- Not a design doc. "How clock arrival flows from OpenSTA Tcl through the IR into the GPU constraint buffer" is a design topic; it lives in
docs/timing-model-extensions.mdPart B, not in a handoff's "Critical context" section. - Not a status dashboard for the project. Workstream status lives in plan docs —
phase-0-ir-and-oracle.mdfor current-phase WS state,post-phase-0-roadmap.mdfor forward-looking sequencing. A handoff cites those, doesn't reproduce them. - Not a historical record.
git logis the historical record. Handoffs that survive past their resolution turn into noise that misleads new contributors.
When to write one
Write a handoff at the end of any session that:
- Leaves work in a partial state that someone else might pick up cold.
- Captures non-obvious context the next session needs (e.g. "the OpenSTA Tcl
find_timingproc rejects-full_update; use::sta::find_timing_cmd 1directly"). - Documents the next concrete step with enough scope to start without re-discovering it.
If the session ended at a clean stopping point (everything merged, all decisions documented in ADRs/plans, nothing surprising), don't write a handoff. The plan doc already says what's next.
Resolution: fold, then delete
The two-location split is deliberate: handoffs live at docs/handoffs/<topic>-handoff.md while in flight; their content migrates into the persistent docs (docs/adr/, docs/plans/, design docs under docs/) at resolution. The handoff file then gets removed; nothing about the work is lost because everything load-bearing has a permanent home elsewhere.
When a handoff's work is done — whether in the next session or several sessions later — every load-bearing piece of it must be migrated to its proper home before the handoff file is deleted:
| If the handoff says... | It belongs in... |
|---|---|
| "We chose approach X over Y because Z" | The relevant ADR's Decision/Consequences section, or a new ADR if no fit exists |
| "Future scope for WS-N: do A then B then C" | The plan doc's WS-N section (phase-0-ir-and-oracle.md or successor) |
| "Gotcha: OpenSTA's Tcl X behaves Y" | A code comment near the Tcl call site, or a design doc if the gotcha cuts across files |
| "Build dep Z is required on Linux" | The build script's apt-suggestion / Brewfile / README install section |
| "Subsystem A doesn't yet do B" | Plan doc as a new open item, or an ADR-tracked walk-back if it's a deferred design choice |
"Run cargo test --feature foo to verify" | The verification block in the relevant plan doc, or a test-running section in CLAUDE.md |
After migration, the handoff file is removed in the same commit as the migration:
git rm docs/handoffs/<topic>-handoff.md
git add <files-receiving-the-migrated-content>
git commit -m "$(cat <<'EOF'
docs: resolve <topic> handoff — fold into <where-it-went>
<one-paragraph summary of what was migrated and where>
Co-developed-by: Claude Code v<version> (<model-id>)
EOF
)"
The commit message records what migrated where — that's the audit trail. git log -- docs/handoffs/ then shows the project's handoff history (one add, one delete per session) without needing the files themselves to live forever.
Template
When you do need to write one, use this skeleton. Replace placeholders inline; delete sections that don't apply (better to omit a section than fill it with "N/A").
# Handoff — <Topic> (one-line summary of what this session left open)
**Created:** YYYY-MM-DD
**Working tree:** clean | <state if not clean>
**Branch:** main | <branch>
## Goal & next-up
**Goal of this session:** <what you were trying to do, in 1–3 sentences>
**Next session should pick up:** <the very next concrete action, by name. Reference the plan doc section if applicable.>
**Verification command:**
```sh
<commands the next session runs to confirm this handoff's claimed state>
# Expect: <what success looks like>
Done this session
| Commit | Subject | Notes |
|---|---|---|
<sha> |
Open follow-ups (priority-ordered)
1. - (
)
<Concrete scope. Enough detail to start cold. Link to existing plan/ADR/design-doc sections rather than reproducing them.>
2. ...
Critical context
<Things the next session needs to know that aren't yet in the code/docs. Be honest about what's truly load-bearing — anything obvious from git log or a quick grep doesn't belong here.>
References
<predecessor-handoff if any>— predecessor (if relevant)<plan doc>— current workstream state<ADR>— relevant decision
Resume in a new session with:
```
/resume_handoff docs/handoffs/
## Tooling
The `create_handoff` and `resume_handoff` skills (from various Claude Code orchestration toolkits) generate and consume handoffs. They're optional — the discipline above is the load-bearing artifact. A handoff written by hand following this template is just as valid.
If you use one of those skills, expect it to default to YAML format under `thoughts/shared/handoffs/` with database indexing. **That doesn't apply to this project.** Override it: produce markdown at `docs/handoffs/<topic>-handoff.md` and skip the database step. The skill activation is informational; the project's convention takes precedence.
Architecture Decision Records
ADRs capture decisions worth understanding later: the context, the options considered, and the rationale for the choice. They are numbered, append-only, and never silently rewritten — if a decision changes, supersede the old ADR with a new one and update the status.
Status legend
- Accepted / Approved — current, in effect.
- Accepted (partial) — design ratified and partly built; the ADR
carries an
## Implementation statussection (see below). - Proposed — drafted, not yet ratified.
- Superseded — historical, replaced by a later ADR or by a spike outcome; kept for the audit trail.
Keeping status honest
An ADR's Status is a claim about the codebase, not an aspiration. Before setting or changing it, verify the claim against the implementation — read the code; don't trust the previous status or a feature's "done" framing. The same goes for any present-tense statement inside an ADR ("jitter feeds the setup/hold checker"): it's a verifiable claim, so check it.
- Don't bump Proposed → Accepted just because a design merged. Confirm the decision is actually in effect in the code.
- When a design is ratified but only partly built, use
Accepted (partial)and add an## Implementation statussection splitting implemented (with file references) from deferred (with the specific gap). ADR 0012 is the worked example. - Deferred work gets a home: a plan under
docs/plans/and a tracking issue, cross-linked from the ADR's status section, so the unbuilt half isn't lost.
This extends to user-facing docs and --help text: a sentence telling
the reader how the tool behaves is a verifiable claim — check it against
the code before writing it.
Index
| # | Title | Status |
|---|---|---|
| 0001 | OpenSTA as the timing correctness oracle and sole STA path | Accepted (scope expanded 2026-05-01) |
| 0002 | Timing intermediate representation | Accepted |
| 0003 | OpenTimer as in-process reference STA | Superseded (2026-05-01) — spike failed; OpenSTA subprocess only |
| 0004 | Private PDK testing track | Accepted |
| 0005 | OpenSTA vendoring and test-corpus strategy | Accepted |
| 0006 | SDF preprocessing model and interim-to-release cutover | Accepted (amended 2026-05-02) |
| 0007 | Timing model fidelity roadmap | Proposed |
| 0008 | Structured timing output as first-class deliverable | Approved |
| 0009 | OpenSTA Verilog reader inputs | Accepted |
| 0010 | Declarative cell metadata | Accepted |
| 0011 | RAM port-mapping schema for declarative cell metadata | Accepted |
| 0012 | Reproducible CDC jitter injection for multi-clock cosim | Accepted (partial) |
| 0013 | Cosim peripheral model architecture | Accepted |
| 0014 | AIG as simulation intermediate representation | Accepted |
| 0015 | Boomerang execution model and GPU resource mapping | Accepted |
| 0016 | Selective X-propagation | Accepted |
| 0017 | Cosim execution model | Accepted |
How the ADRs relate
-
0014 / 0015 document the core simulation pipeline: 0014 explains why the AIG (and-inverter graph) is the simulation IR — its uniform AND-gate structure enables the boomerang reduction tree and eliminates per-cell dispatch in the GPU kernel. 0015 describes the boomerang execution model itself — the 13-level hierarchical reduction tree, the GPU resource limits it imposes (8191 inputs, 8191 outputs, 4095 intermediates, 64 SRAM groups per partition), the hypergraph partitioning that distributes work across GPU blocks, and the packed instruction format (
FlattenedScriptV1) consumed by the kernel. Together they document the path from gate-level Verilog to GPU kernel execution that the GEM paper describes. -
0001 / 0003 / 0005 / 0006 describe the timing oracle stack: OpenSTA as the ground truth (0001), vendored at a pinned revision with its own corpus reused (0005), driving SDF preprocessing out-of-process (0006). The earlier OpenTimer in-process plan (0003) was retired after the spike (
../spikes/opentimer-sky130.md). -
0002 is the data contract those tools talk over — a JSON timing IR consumed by Jacquard, produced by
opensta-to-ir. -
0004 governs how PDK-specific testing happens for NDA-bound contributors without leaking files into the public repo.
-
0007 / 0008 are the forward-looking pair: 0008 (Approved) defines the structured timing output Jacquard owes downstream flows; 0007 (Proposed) sketches the model-fidelity work needed to back those outputs at scale (δ(T), clock-tree skew, wire delay). Scheduling for both lives in
../plans/post-phase-0-roadmap.md. -
0013 / 0017 cover the cosim runtime: 0013 documents the peripheral model architecture (CPU-side
PeripheralModeltrait, GPU-side kernel patterns, ring buffers, plural-config convention); 0017 documents the execution model (batch dispatch loop, multi-clock scheduler, edges-vs-cycles semantics). -
0016 accepts the selective X-propagation design documented in
docs/selective-x-propagation.md. The full seven-phase design lives there; the ADR is a thin acceptance record with a summary of key choices.
Adding a new ADR
- Pick the next number (highest existing + 1).
- Filename:
NNNN-short-kebab-title.md. - Start with
# ADR NNNN — <title>and a**Status:**line — set it to match the code, not the intent (see Keeping status honest). - Standard sections: Context, Decision, Consequences. Add Amendment blocks dated when the decision is revisited; do not rewrite accepted history.
- Add the row to the table above.
ADR 0001 — OpenSTA as the timing correctness oracle and sole STA path
Status: Accepted. Scope expanded 2026-05-01 — see Decision §3 below.
Context
Jacquard's current correctness validation for timing relies on its own CPU reference simulator (--check-with-cpu), which shares the Rust source tree, data structures, and parsers with the GPU simulation path. Representation bugs (e.g., hierarchical SDF prefix mismatch, inverter-collapse issues) have passed both paths silently because they affect both.
Historical regressions have been caught only by comparing against genuinely external tools — specifically CVC for functional simulation and, by implication, OpenSTA for timing. No format or tool inside Jacquard is currently treated as authoritative.
OpenSTA is widely deployed in open-source EDA (SKY130, OpenLane2, OpenROAD) and has the largest effective test surface of any open-source STA tool for the Liberty + SDF + Verilog + SPEF stack. It is licensed under GPL-3.0 and also sold commercially.
Jacquard requires permissive licensing for code linked into its binary (see ../project-scope.md).
Decision
OpenSTA is the ground-truth oracle for timing correctness and the sole STA path used by Jacquard.
- In the shipped release, OpenSTA is never invoked from the
jacquardruntime binary, and never linked. Subprocess invocation from CI pipelines, test harnesses, and the standaloneopensta-to-irpreprocessing tool (see ADR 0006) is acceptable — GPL's reciprocal requirements do not cross a subprocess boundary ("mere aggregation") and so Jacquard's permissive licensing is preserved. Pre-release, a runtime subprocess invocation may exist as a contributor-ergonomics convenience (per ADR 0006); it is removed before release. - All timing, STA, and parser-related code paths are validated against OpenSTA on (a) a vendored subset of OpenSTA's own test corpus, and (b) representative Jacquard test designs.
- OpenSTA is also Jacquard's only STA path, not just its oracle. ADR 0003 originally proposed an in-process reference STA via OpenTimer to complement this oracle role; the spike (
../spikes/opentimer-sky130.md) found OpenTimer's input pipeline unfit for OpenROAD-flow outputs (commitd002bdesuperseded ADR 0003). The role OpenTimer would have played — providing per-DFF clock arrival, structured timing data for the IR, etc. — now sits with OpenSTA, called out of process viaopensta-to-ir. OpenSTA is therefore a required runtime dependency for any timing-aware Jacquard flow, not just for CI validation. - Where Jacquard's output disagrees with OpenSTA's output past a declared tolerance, Jacquard is wrong until proven otherwise. Divergence is either fixed, explicitly justified in writing, or filed as a bug.
Consequences
- OpenSTA is a required runtime dependency for timing-aware Jacquard flows (post §3 expansion), not merely a CI/validation dependency. Users running
jacquard sim --timing-ir ...need a.jtirproduced byopensta-to-ir, which subprocesses OpenSTA. Documented in../why-jacquard.md. - Subprocess integration preserves Jacquard's permissive licensing (satisfies
project-scope.md). - "Oracle-diff clean" becomes a required CI gate for timing-related PRs, run nightly or pre-release (not per-PR — OpenSTA runs on large designs can be slow).
- OpenSTA bugs may produce false-positive divergences. The expectation is to file upstream rather than work around silently. A pinned OpenSTA version in CI avoids drift. With OpenSTA now also the only STA path (not just the oracle), upstream regressions land in users' hands too — pinning matters more than before.
- A vendored OpenSTA test corpus (or git submodule) is added to the repo as a fixture. Licensing of specific test inputs is verified per file before inclusion.
- No second STA tool to maintain. The original ADR 0003 proposal would have given Jacquard a permissive-licensed in-process reference; the spike showed that's not achievable today with OpenTimer. A future ADR may revisit libreda-sta or an in-house walker if an in-process reference is wanted.
Links
../project-scope.md— permissive-license constraint.../timing-correctness.md— principle P1, requirement R3.- ADR 0002 — timing IR (the concrete diff format used for oracle comparison).
- ADR 0003 — Superseded. OpenTimer in-process reference; spike Q2 fail moved Jacquard to OpenSTA-only. See
../spikes/opentimer-sky130.mdfor the spike outcome. ../why-jacquard.md— user-facing consequence: OpenSTA as a runtime dependency.
ADR 0002 — Timing intermediate representation
Status: Accepted.
Context
Jacquard currently parses SDF directly in src/sdf_parser.rs, a hand-rolled parser that has accumulated reactive fixes (empty () delays, (COND …) pin specs, backslash escapes, edge-qualified timing checks, TIMINGCHECK stripping workarounds for OpenLane2 output). Each new production failure has been a one-off patch.
Commercial tool output adds dialect variation (Cadence, Synopsys extensions). Future parser paths (Liberty, SPEF) and future reference tools (OpenSTA, OpenTimer) each carry their own data models. A format-per-consumer coupling structure will continue to spread parser complexity into the simulator.
The project needs:
- A stable format we consume, with parser complexity isolated from simulator complexity.
- A format that can be diffed between producers (two parsers of the same file must agree).
- A format that supports multi-corner PVT values natively — commercial flows require this; single-corner shortcuts become retrofit pain.
- Preservation of vendor-specific annotations so information is not silently discarded.
- Fast consumption at sim startup (SDF parsing is currently on the critical path).
Decision
Introduce a timing intermediate representation (timing IR) for SDF-equivalent annotation data.
- Binary format: FlatBuffers. Zero-copy reads, schema evolution, cross-language (Rust, C++ for OpenTimer adapter, Python for tooling).
- Text sidecar: JSON, produced via FlatBuffers' JSON round-trip, for CI diffs and human inspection.
- Schema versioning: explicit version field, compatible-evolution rules stated in schema comments. Breaking changes require a major version bump and migration notes.
- Multi-corner native: timing values are min / typ / max across a declared set of PVT corners. Single-corner designs are represented as a single-element corner set.
- Vendor extension passthrough: typed
VendorExtensionvariants (VendorCadence,VendorSynopsys,VendorOther) carry unrecognised annotations as byte-typed blobs with source labels. Consumers opt in to understanding them; the IR never silently drops them. - Per-arc provenance: each timing arc records source tool, source file, and origin category —
asserted(from SDF / input),computed(derived by an STA tool),defaulted(fallback because no better value was available). Provenance is inspectable at consumer side. - Scope boundary: the IR represents timing annotation data only. It is not a netlist representation, not a timing graph, not cell characterization. Attempts to extend it toward those adjacent formats are rejected — they become separate IRs if needed.
Consequences
- A new schema and format to maintain. Scope discipline is load-bearing: if the IR creeps toward being a full STA framework, it becomes duplicate work with OpenSTA/OpenTimer.
- Parser complexity moves out of
src/sdf_parser.rs(and its future rewrite, per ADR covering #3) into a focused converter crate. Unit-testable in isolation. - A diff-based test corpus becomes natural: multiple converters on the same input must produce equivalent IR. This is the enforcement mechanism for ADR 0001's oracle pattern.
- Vendor extensions do not require Jacquard code changes — only converter updates.
- Startup parse cost drops: reading binary IR is near-instant. SDF-to-IR conversion becomes a one-time preprocessing step, not repeated per sim.
- Adopting FlatBuffers adds a code-generation step to the build, via
flatc. Build hygiene (checked-in generated code, pinnedflatcversion, or a build-script integration) is required. - If the IR is ever shared across other tooling beyond Jacquard, its stability contract tightens. Flagged in open questions on
timing-correctness.md; not resolved here.
Links
../project-scope.md— validation and permissive-license constraints.../timing-correctness.md— requirement R1, principle P5 (multi-corner).- ADR 0001 — OpenSTA oracle (IR is the diff format).
- ADR 0003 — Superseded. OpenTimer was the proposed in-process reference STA; spike Q2 fail moved Jacquard to OpenSTA-only via
opensta-to-ir. See../spikes/opentimer-sky130.md. - ADR 0004 — private PDK testing (IR enables portable fixtures without leaking PDK data).
ADR 0003 — OpenTimer as in-process reference STA
Status: Superseded (2026-05-01). Spike (../spikes/opentimer-sky130.md) failed Q2 — OpenTimer's input pipeline cannot handle real OpenROAD-flow .v/.spef for SKY130 designs with bus ports. Fallback is OpenSTA subprocess validation only (ADR 0001); a future ADR may revisit libreda-sta or an in-house walker if an in-process reference is wanted later.
Context
Jacquard needs an in-process reference STA path to:
- Validate SDF-derived timing against an independent computation at load time and on demand (requirement R2 in
timing-correctness.md). - Provide exact per-edge arrival for top-K critical paths (requirement R4, pessimism-delta reporting).
OpenSTA (ADR 0001) is the ground-truth oracle but runs only as a subprocess — unsuitable for per-run, in-process checking. A linked alternative is needed.
Options surveyed:
- OpenTimer (MIT, C++17). Parses
.lib/.v/.spef/.sdcdirectly. Won TAU Timing Analysis Contests (2014 1st, 2015 2nd, 2016 1st); industry "Golden Timer" for benchmark comparisons. Actively maintained (latest push 2025-12-26 as of this writing). Does not parse SDF — timing is computed from Liberty + parasitics. - libreda-sta (Rust, permissive). Young framework, self-described as "basic components." Unknown whether it handles SKY130 Liberty robustly. Lower maturity risk than OpenTimer.
- Tatum (MIT, C++). Analysis engine only; does not parse Liberty/SDF/Verilog. Using Tatum would require supplying our own parsers, so it does not solve the problem directly.
- In-house Rust walker. Author-shared blind spots with Jacquard's main pipeline reduce the independence benefit.
Decision
Subject to the spike's success, OpenTimer becomes Jacquard's in-process reference STA, integrated via C++ FFI (bindgen or equivalent).
- Linked directly; MIT licence satisfies
project-scope.md. - Computes timing from
.lib+.spefindependently of any SDF-derived path. This is an accepted (and arguably preferable) property: the reference path shares no parsing with Jacquard's SDF consumer, so a parse bug on either side is detectable rather than mutually masked. - Emits timing IR (per ADR 0002) so its output is directly diffable against Jacquard's SDF-derived IR.
Spike criteria in ../spikes/opentimer-sky130.md. On spike failure, fallback is to drop the in-process reference entirely and rely on OpenSTA subprocess validation (ADR 0001). This weakens per-PR feedback on timing correctness but is not fatal.
Consequences
- C++ FFI dependency;
bindgen-generated bindings; build complexity rises modestly. - Direct linking preserves permissive licensing (MIT).
- Three-way cross-check becomes the default in CI: Jacquard (SDF path) vs OpenTimer (Liberty+SPEF path) vs OpenSTA (subprocess, full ground truth). Three-way disagreement localises bugs to SDF parse / delay model / tool issue cleanly.
- OpenTimer does not parse SDF. To use it in Jacquard's current flow, OpenLane2 (or equivalent) must produce SPEF alongside SDF. This plumbing change is tracked in the phase-0 plan.
- OpenTimer's maturity is measured in contest benchmarks, not SKY130/GF130 real-flow output. Spike must verify it handles our actual Liberty and SPEF. The spike is structured to fail fast if it does not.
- If OpenTimer is dropped post-spike, alternative in-process references (libreda-sta, in-house) can be revisited; this ADR would be superseded rather than amended.
Links
../project-scope.md— licensing constraint, validation constraint.../timing-correctness.md— requirement R2, requirement R4.../spikes/opentimer-sky130.md— spike and success criteria.- ADR 0001 — OpenSTA oracle.
- ADR 0002 — timing IR (OpenTimer emits it).
ADR 0004 — Private PDK testing track
Status: Accepted. Plumbing tracked in the phase-0 plan.
Context
Some contributors and operators have access to commercial PDKs (GlobalFoundries, TSMC, and others) under NDA or licensing agreements that prohibit public redistribution of PDK files. Whether a given contributor has access is itself typically under NDA and not publicly known.
Commercial PDK Liberty libraries are substantially richer and quirkier than open-source alternatives — they include cell variants, conditional timing arcs, vendor-specific annotations, and characterization detail not present in SKY130 or AIGPDK. Several parser bugs live only on commercial PDK output.
SKY130-only coverage is insufficient for a sim tool used on commercial flows, and adding commercial PDK files to a public repository is not an option regardless of who operates the project.
The standard industry pattern for testing against proprietary PDKs is environment-gated test suites: tests run when the contributor has licensed access, and skip cleanly when they don't.
Decision
Establish a private PDK test track gated on per-PDK environment variables (e.g. GF130_PDK_PATH, TSMC_PDK_PATH, and similar — one per PDK).
- Tests check for the required env var(s) and skip with a clear "PDK not available" message when unset.
- When env vars point to a readable PDK directory, tests execute fully.
- Only the test harness, expected structural outputs, and IR fixtures (where the PDK vendor licensing permits) are committed.
- No PDK-derived artifacts (
.lib,.sdf,.spef, characterization data) are committed to the public repository under any circumstances. - CI runners with configured PDK access execute the private track; public PRs from non-licensed contributors see the private tests as
skipped, not as failures. Which runners have access is determined by whoever operates CI; this ADR does not name specific organisations.
The timing IR (ADR 0002) makes this feasible: converter output and diff results can be checked in as fixtures where they contain no PDK-licensed data. Expected behaviour can be asserted in terms of IR structure rather than in terms of specific cell timings that would leak characterization data.
Consequences
- Contributors without PDK access cannot locally reproduce PDK-specific bugs. They rely on maintainer CI for validation.
- A separate setup doc for licensed contributors is required (not public). Points at env-var configuration, test runner invocation, and PDK-file staging expectations.
- Fixture schema must be PDK-agnostic enough that structural assertions don't implicitly leak cell-characterization data. Review process must check new fixtures against this rule before merge.
- Bugs found via private PDK testing are, where possible, distilled into minimal public reproducers. The private track is not a place to park unreviewable tests — every private test should ideally surface a public fixture once the bug's essence is extracted.
- CI cost rises (licensed runners). Runs are nightly or pre-release rather than per-PR.
Links
../project-scope.md— validation constraint.../timing-correctness.md— requirement R5.- ADR 0002 — timing IR (enables portable fixtures).
ADR 0005 — OpenSTA vendoring and test-corpus strategy
Status: Accepted.
Context
Under ADR 0001, OpenSTA is the ground-truth oracle for timing correctness, invoked as a subprocess. Phase 0 (../plans/phase-0-ir-and-oracle.md) requires:
- A reproducible, pinned OpenSTA reference so CI diffs are comparable run-to-run.
- Access to OpenSTA's test inputs for stress testing our OpenSTA-driven converters.
- Separately, a primary regression corpus representative of Jacquard's actual use cases.
Two questions were considered jointly: (a) how we pin / reference the OpenSTA codebase, and (b) how we use their test data.
On vendoring source: OpenSTA is licensed GPL-3.0. Copying its source into Jacquard's repository as committed code creates licensing ambiguity for a permissive-licensed project. Git submodules are conventionally treated differently — the parent repository pins a commit reference, does not incorporate the submodule's source into its own commits, and inherits no license obligations from the submodule's presence. This convention is widely relied on in permissive projects that depend on GPL tooling at arm's length.
On test data: OpenSTA's corpus exercises OpenSTA's concerns — Liberty parsing edge cases, SI-aware analysis, timing-check variants specific to its engine. Much of it does not exercise anything Jacquard does, and some of it exercises features Jacquard deliberately does not support. Using it as the primary regression corpus would optimise for the wrong target: our converters would be validated against files OpenSTA cares about, not files Jacquard actually encounters.
Its real value to Jacquard is as a stress / robustness corpus: a large bank of real-world-ish timing files that exercise parser edge cases and dialect variants. A converter that survives their entire corpus is more robust than one validated against a hand-curated subset.
Decision
Vendoring
- OpenSTA is vendored as a git submodule at
vendor/opensta/. - The submodule is not built from Jacquard's build. Jacquard's subprocess invocations use whatever OpenSTA binary is installed in the developer or CI environment.
- The submodule exists for two purposes only: (a) pinning a specific OpenSTA version for CI reproducibility, (b) providing in-tree access to its test corpus without redistribution.
- Licensing: by git-submodule convention, the submodule's GPL-3.0 licence does not extend to the parent repository. This is the standard interpretation; contributors redistributing binaries or compiled artefacts should nonetheless verify the interpretation applies to their specific jurisdiction and use.
Test corpus split
Two corpora, two distinct roles:
-
Primary regression corpus at
tests/timing_ir/corpus/.- Jacquard-specific designs: SKY130 MCU SoC, NVDLA, AIGPDK examples, representative SDFs from the real Jacquard flow.
- Small, curated, committed directly.
- Run on every CI execution.
- Exit criterion: every file converts cleanly and matches golden IR within declared tolerance.
-
Stress / robustness corpus at
tests/timing_ir/stress/as a manifest file listing paths intovendor/opensta/<test-tree-subdir>/.- Not committed as duplicated data; the manifest references submodule paths.
- Large, whatever upstream maintains.
- Run nightly or pre-release, not per-PR.
- Exit criterion: no crashes, no hangs, no malformed IR. Numerical agreement with OpenSTA not required — this corpus is for robustness, not correctness.
Copying from stress corpus into primary corpus
If a stress-corpus file exposes a bug, a minimal reproducer may be distilled and added to the primary corpus. When doing so:
- Verify the specific file's licence before copying. OpenSTA's overall GPL-3.0 licence does not imply every test input is GPL-3.0 — some test inputs are vendor-derived or public-domain.
- Prefer distilling a synthetic minimal reproducer over copying the original file wholesale.
Consequences
- CI reproducibility: pinned submodule means we control when OpenSTA version changes land. Bumping the pin is an explicit, reviewable step.
- Repository size grows by OpenSTA's submodule size (multi-megabyte) but not by test-data duplication.
- Maintenance cadence: periodic submodule pin updates are a known maintenance item. Not frequent, but not zero.
- Primary regression corpus stays lean and directly relevant; developers can reproduce corpus-level failures locally without pulling the entire submodule.
- Stress-corpus failures are treated as bugs against our converter, never as bugs against OpenSTA's test inputs.
- Licensing posture is conventionally defensible; if stronger legal assurance is ever required, the submodule can be replaced by the external-install-only option (drop the submodule, rely purely on whatever OpenSTA is installed) at the cost of losing in-tree test access.
Links
../project-scope.md— licensing constraints.../timing-correctness.md— R3 (oracle-backed CI).- ADR 0001 — OpenSTA as oracle (establishes the subprocess model).
- ADR 0002 — timing IR (the format being stress-tested).
../plans/phase-0-ir-and-oracle.md— Phase 0 WS4 implements this split.
ADR 0006 — SDF preprocessing model and interim-to-release cutover
Status: Accepted 2026-04; amended 2026-05-02 (see § Amendment).
Amendment (2026-05-02)
The original Decision treated subprocess invocation of OpenSTA from the shipped Jacquard runtime as license-incompatible, requiring Phase 3 (native Rust SDF→IR converter) to land before first release. On review of GPL-3 § 5 ("aggregate") and the FSF interpretation of subprocess/IPC boundaries, this restriction is more conservative than necessary. The relevant facts:
- The interface is arms-length: standard EDA interchange formats (Liberty / Verilog / SDF / SPEF / SDC) in, our own IR JSON (ADR 0002) out. No shared data structures, no headers, no linking.
- We do not bundle OpenSTA in any Jacquard distribution. The user installs OpenSTA themselves; user-side combination of separately-distributed programs is not "distribution of a combined work" under GPL-3.
- The original "no runtime subprocess" rule was effectively a commercial-perception buffer, not a strict licensing requirement.
Revised bright lines (these supersede the original "Shipped release" sub-section):
- No linking of GPL code into the Jacquard binary. Unchanged.
- No bundling of OpenSTA (or any GPL tool) in Jacquard distribution artefacts (release tarballs, Homebrew formulae, Docker images that ship as Jacquard releases). If a packager wants to bundle, they take on GPL distribution obligations themselves.
- Subprocess invocation of user-installed OpenSTA from the shipped runtime is permitted.
jacquard sim input.sdfmay keep itsopensta-to-irsubprocess hook in shipped releases, provided OpenSTA is discovered on PATH rather than bundled.
Phase 3 reclassification. Native Rust SDF→IR converter is no longer release-gating. It remains a goal — for ergonomics (no OpenSTA install required) and for downstream commercial integrators whose legal teams treat any GPL touchpoint as risk — but ships when bandwidth allows, not as a release blocker. Roadmap consequences are tracked in ../plans/post-phase-0-roadmap.md § Phase 3.
Corequisite — OpenSTA detection and version check (release-blocking). Relaxing the no-runtime-subprocess rule is conditional on the shipped runtime giving users a meaningful error when OpenSTA is missing or out-of-date. Today (src/sim/setup.rs:248-264), missing OpenSTA only emits a warn! and the simulation proceeds with no timing data loaded — acceptable during development, ships as a UX bug. Concretely, before first release we must:
- Hard-fail (not warn) when
--sdfis requested and OpenSTA cannot be located. - Probe OpenSTA's version on first invocation and fail with a remediation message if it is older than the version pinned in
vendor/opensta/(per ADR 0005). - Warn-but-proceed if the detected version is newer than the latest tested version, naming the version in the warning.
- Document the OpenSTA dependency in
docs/usage.md.
Tracked as WS-RH.1 in ../plans/post-phase-0-roadmap.md § Release hardening.
Code-comment cleanup follow-up. The INTERIM per ADR 0006 / Pre-release only tags in src/sim/setup.rs (lines ~176, ~228, ~286) and src/bin/jacquard.rs (~187) describe a premise that no longer applies. Folded into WS-RH.1 (../plans/post-phase-0-roadmap.md § Release hardening) rather than spun out as a separate cleanup commit.
The original Context, Decision (Phase 0 + Phase 3), and Walk-back sections below are retained for historical record. Where they conflict with the bright lines above, the bright lines win.
Context
Jacquard's hand-rolled SDF parser (src/sdf_parser.rs) has accumulated reactive maintenance over time — empty () delays, (COND …) pin specs, escape handling, edge-qualified timing checks, TIMINGCHECK-stripping workarounds for OpenLane2 output. Each production failure has been a one-off patch. The timing-correctness review flagged this as issue #3, and a native Rust grammar-based replacement is the Phase 3 deliverable.
Concurrently, ADR 0001 establishes OpenSTA as the timing correctness oracle (subprocess, never linked, GPL), and ADR 0002 introduces a timing IR that decouples parsing from consumption.
Two facts together shape the decision:
- No release pressure. Release can happen after Phase 3 lands. We are not forced to keep the hand-rolled parser alive while waiting on Phase 3.
- Permissive-license constraint applies to the shipped binary. Subprocess invocation of GPL tooling is acceptable — does not trigger reciprocal obligations — and during pre-release development, even in-runtime subprocess invocation does not violate the constraint because no runtime binary is being distributed.
Given these, maintaining the hand-rolled parser through Phase 0–2 is unnecessary. OpenSTA's mature dialect coverage can substitute, via subprocess, while we build toward a native Rust replacement at our own pace.
Decision
Phase 0
- Delete
src/sdf_parser.rsand the SDF→Jacquard-internal-types code path. All paths that previously consumed SDF now consume timing IR. - Ship
opensta-to-iras a standalone preprocessing tool that consumes Liberty + Verilog + SDF + SPEF + SDC and emits timing IR. Subprocess-based on OpenSTA. Production-quality: stable CLI, documented exit codes, clear diagnostics. - Canonical runtime path is
jacquard sim --timing-ir <path>, consuming pre-converted IR. This path works without OpenSTA on the user's machine — pre-converted IR is sufficient. - Interim ergonomic path: during development (pre-release only),
jacquard sim input.sdfsubprocessesopensta-to-irinternally to produce IR on the fly. This is a contributor convenience, not a shipping feature. Flag exists in code aspre-release onlywith a clear comment tying back to this ADR.
Phase 3
- Native Rust SDF→IR converter replaces the OpenSTA subprocess call inside
jacquard sim input.sdf. Grammar-based (nom / pest), validated against OpenSTA on the corpus per ADR 0001. - Lands before first release.
Shipped release
- No OpenSTA invocation from the
jacquardruntime binary. The native Rust converter handles SDF inputs directly. opensta-to-irremains as an alternative preprocessing tool. Users who want OpenSTA-computed timing may use it; subprocess model preserves permissive licensing.
Walk-back options (if assumptions change)
- If OpenSTA dialect coverage proves insufficient during Phase 0 — e.g., a current Jacquard-supported SDF fails to parse — add dialect shims to
opensta-to-ir's post-processing. Reinstating the hand-rolled parser is the last resort, not the first. - If the Phase 3 Rust rewrite stalls — ship the first release with preprocessing-only (no
jacquard sim input.sdfpath), remove the interim subprocess, and land the native converter in a later release. No information lost; users preprocess manually. This is already the post-release shape foropensta-to-ir; it's only thejacquard sim input.sdfconvenience that would be deferred. - If OpenSTA becomes unmaintainable or disappears — the submodule pin (ADR 0005) remains authoritative for the integrated version. A forked submodule can maintain any necessary patches.
Consequences
- Jacquard's repository stops carrying a hand-rolled SDF parser as a reactive-maintenance target. Bugs in SDF interpretation between Phase 0 and Phase 3 are OpenSTA's problem (upstream) or
opensta-to-irpost-processing's problem, not Jacquard's core codebase's problem. - Pre-release ergonomic one-step workflow for contributors is preserved.
- Contributors running Jacquard on a new design (no pre-converted IR) must have OpenSTA installed during Phase 0 through Phase 3. For existing primary-corpus designs, pre-converted IR is checked in; no OpenSTA needed.
- Release-time check is unambiguous: either the runtime subprocess is replaced by native code, or it is removed entirely. Both outcomes satisfy the permissive-licensing constraint for the shipped binary.
- Test corpus regenerable: if OpenSTA updates change IR output, golden files are regenerated deliberately (reviewable diff), not silently.
Links
../project-scope.md— licensing constraint, preprocessing-tools pattern.../timing-correctness.md— P1 (oracle), R1 (IR).- ADR 0001 — OpenSTA as oracle (subprocess model).
- ADR 0002 — timing IR (format consumed).
- ADR 0005 — OpenSTA vendoring (submodule for reproducibility + stress corpus).
../plans/phase-0-ir-and-oracle.md— WS2 productionisation, WS3 deletion + interim hook.
ADR 0007 — Timing model fidelity roadmap
Status: Proposed.
Context
Jacquard's timing model today consumes SDF-equivalent annotations via the timing IR (ADR 0002), produced and validated by OpenSTA called out of process (ADR 0001 — sole STA path; ADR 0003's in-process OpenTimer alternative was Superseded by the spike). The accuracy contract at present is "±5% on arrival times against CVC reference" per timing-validation.md. This is acceptable for sky130-class designs at ≥10 ns clock periods.
Three structural simplifications in the current implementation become accuracy bottlenecks at scale:
- Static δ∞ per gate. No pulse-degradation modelling. Glitch behaviour and short-pulse propagation cannot be represented. The Involution Delay Model (Maier 2021, arXiv:2107.06814) demonstrates this is the root cause of inertial-delay's known failure modes, and provides a model that's both faithful and implementable.
- Zero clock-tree skew. During AIG construction (
src/aig.rs:495-560), clock buffers/inverters/gating cells collapse to a single polarity flag on the DFF. SDF arcs and interconnect on the clock tree are silently dropped. Every DFF on a clock domain is treated as capturing simultaneously. - Per-cell-max wire delay.
src/flatten.rs:1850-1872lumps all interconnect arrivals at a destination cell into a single max value, with no rise/fall distinction. Adequate for short local routes; incorrect for long routes where wire delay rivals or exceeds gate delay (typical of NoCs at 22nm and faster).
The full design analysis is in docs/timing-model-extensions.md. This ADR captures the decision to commit to closing these three gaps as a roadmap, sets the staged ordering, and constrains how the implementation may evolve.
Decision
Adopt a three-pillar roadmap for closing the fidelity gap with CVC, while preserving Jacquard's GPU-throughput advantage. All three pillars are consumer-side work (src/flatten.rs, src/aig.rs, src/sim/cosim_metal.rs, the kernel arrival math); none require schema changes inconsistent with ADR 0002 nor abandoning the cycle-accurate boomerang kernel architecture.
Pillar A — Dynamic delay (δ(T))
Per-gate dynamic delay parameterised on T (time since last output transition). Three accuracy tiers:
- Static IDM. Bake worst-case δ(T) into existing per-thread script slot using STA pulse-width estimates. No kernel change.
- Dynamic δ(T). Add
last_transition_psandlast_valuepersistent buffers per AIG pin; kernel evaluates δ(T) from a small per-cell LUT during arrival propagation. - Sub-cycle ticks. Multiple arrival propagations per logical cycle, enabling true glitch suppression. Out of scope by this ADR. Would require a different kernel architecture; if pursued, requires its own ADR superseding this one.
Pillar B — Clock-tree skew
Per-DFF clock arrival accounting via TimingIR extension (ClockArrival table) populated by OpenSTA via opensta-to-ir (ADR 0001 — ADR 0003's OpenTimer alternative is Superseded). Per-pair CRPR is intentionally not modelled at this stage; per-DFF capture-side arrival is, treating launch as the 0-reference. Consumed by extending DFFConstraint with a clock_arrival_ps: i16 field, folded into the existing per-word setup/hold check in src/flatten.rs via DFFConstraint::effective_setup_hold. No kernel change for the baseline case; bucketed packing is an option if pessimism becomes material. Stages 1+2 landed: commits c403cc8 (producer) and 6767c3e (consumer).
Pillar C — Wire delay at scale
Three fidelity tiers:
- Tier 1: Per-receiver consumption. Key wire delay by
(src_aigpin, dst_aigpin)edge in the AIG, with rise/fall distinction preserved. Mostly asrc/flatten.rs:1850-1872rewrite. No kernel change. - Tier 2: Inter-partition arc delay. Explicit modelling of wire delay on partition-crossing signals. Touches
src/sim/cosim_metal.rsshuffle pipeline. Required for many-core/NoC designs at advanced processes. - Tier 3: NoC-aware partitioning hints. Soft bias in
src/repcut.rsfavouring cuts on flagged net patterns. Optional optimisation that makes Tier 2 cheap on tile-decomposed designs.
Sequencing constraint
- Pillar B Stage 1+2 is the cheapest accuracy improvement. Originally gated on the (now Superseded) OpenTimer integration; landed early on top of the OpenSTA-out-of-process path instead. See commits
c403cc8/6767c3e. - Pillar C Tier 1 is independent of which STA tool feeds the IR and can proceed any time.
- Pillar A Stage 1 (Static IDM) is the cheapest δ(T) entry point, gated on per-cell SPICE characterisation effort. Schedule this only after Pillars B and C land — δ(T) compounds on top of correct wire/skew baseline; doing it earlier risks chasing characterisation noise that's actually wire-delay error.
- Pillar C Tier 2 lands when a real many-core/NoC use case appears in the test corpus and Tier 1 measurement shows it's needed.
- Pillar A Stage 2 (Dynamic δ(T)) is a substantial implementation; schedule only when Stage 1 reports indicate the value is real, and a contributor with the analog-characterisation domain expertise is willing to lead it.
- Pillar A Stage 3 (Sub-cycle ticks) is explicitly out of scope of this ADR.
Validation contract
- Each pillar lands with regression coverage extending
timing-validation.md's ±5% tolerance. Tighter tolerances may apply per pillar (Pillar B should achieve ≤±2% on skew-aware paths with OpenSTA-fed per-DFF arrival as currently implemented; Pillar C Tier 1 should achieve ≤±3% on long-wire paths). - Each pillar must demonstrate no regression on the existing primary corpus before merge.
- The IR schema may be extended (additive only) per ADR 0002 to carry pillar-specific data. Extensions require a minor schema bump and a documented consumer-version compatibility note.
Consequences
- The "±5%" line in
timing-validation.mdbecomes a per-pillar specification rather than a single number. The doc is updated as each pillar lands. crates/timing-ir/schemas/timing_ir.fbsaccumulates additive extensions for clock arrival and per-cell δ(T) parameters. Schema versioning per ADR 0002 governs.- No changes to the cycle-accurate boomerang kernel architecture. The cost of preserving that architecture is permanent: no glitch propagation, no metastability oscillation, no asynchronous handling. These remain non-goals (per
project-scope.md) unless a future ADR explicitly supersedes this position. - Per-cell SPICE characterisation effort is acknowledged as the long-pole risk for Pillar A. If characterisation cost proves prohibitive, Pillar A reduces to "Stage 1 only, using Liberty-derived ECSM/CCSM data as approximation," and the gap with CVC's full IDM fidelity remains open. This is acceptable; Pillar A Stage 2 is not a release-gating commitment.
- Jacquard's positioning (
why-jacquard.md) becomes coherent: STA-complement-not-replacement, vector-driven timing at GPU scale, fidelity comparable to CVC where the cycle-accurate kernel architecture allows.
Walk-back options
- If a pillar's measurement shows the accuracy gain is smaller than expected, descope it. Each pillar's first stage is sized to deliver measurable improvement; if it doesn't, later stages of that pillar are deferred or abandoned.
- If the IR schema extensions cause downstream tooling friction, fall back to vendor-extension passthrough (
VendorExtensionintiming_ir.fbs) until the typed schema stabilises. Already supported. - OpenTimer integration was retired (ADR 0003 Superseded by the spike outcome). Pillar B did not need the documented fallback to manual clock-tree accumulation in
src/aig.rs— OpenSTA's per-pin arrival viaopensta-to-ircovers the same ground without the per-pair CRPR credit (deferred to Stage 3 if measurement justifies it).
Links
../timing-model-extensions.md— full technical analysis underlying this ADR.../why-jacquard.md— positioning context: where this fidelity work fits in the user value story.../adr/0001-opensta-as-oracle.md,../adr/0002-timing-ir.md— preceding decisions this ADR builds on.../adr/0003-opentimer-primary-sta.md— Superseded by the spike outcome (../spikes/opentimer-sky130.md); referenced here for historical context.../timing-validation.md— validation tolerance contract that each pillar updates.../project-scope.md— synchronous-only / cycle-accurate constraints that bound what this ADR can pursue.
ADR 0008 — Structured timing output as first-class deliverable
Status: Approved.
Context
Jacquard produces timing information today through three channels: timed VCD (--timed), per-violation clilog::warn! messages on stderr, and an in-process SimStats counter. The why-jacquard.md analysis identifies a gap between the timing data Jacquard has internally and the answers users actually need from a flow:
| User question | Today |
|---|---|
| Did my workload trip any violations? | SimStats counts (in-process API only) |
| Which DFFs nearly missed timing? | Not extractable without parsing stderr |
| Show me arrival distribution per signal | Reconstructable from --timed via post-processing only |
| Which DFF was that violation on? | State-word index + manual lookup |
| What path caused the worst arrival? | Not available |
| Run this in CI and fail if any violation | Possible only via stderr grep |
The most acute problem: stderr violation messages identify a state-word index, not a signal name. Mapping back to "which DFF, which path" requires manual investigation. On a violating design the message volume can be enormous (one warning per word per cycle per type). The data needed to do better — hierarchical signal names, DFF instance paths, per-DFF arrival distributions — already exists in the netlistdb and event buffer; it is simply not surfaced in usable form.
This ADR is about making Jacquard's timing output useful in a real flow rather than merely produced. The substantive work in ADR 0007 (model fidelity) is wasted if no one can extract the answers.
The full design analysis is in docs/why-jacquard.md, "Output interface" section.
Decision
Treat structured, machine-readable timing output as a first-class shipping deliverable, not an optional improvement. Land the work in priority order, where priority is set by user impact per implementation cost not by technical interest.
Required outputs
The following are required for Jacquard to be considered usable for vector-driven timing analysis in a real flow. They land before any further fidelity work past ADR 0007 Pillar B Stage 1+2.
-
Symbolic violation messages. Replace state-word indices with hierarchical signal names in stderr violation output. Mapping data already exists in netlistdb. Cost: contained edit in
src/event_buffer.rs:305-338plus name-resolution helper. Highest UX impact per LoC of any improvement on this list. -
--timing-report <path.json>. Structured JSON document at end-of-run containing:- Per-DFF worst arrival, worst slack, violation count over the run.
- Per-cycle violation list (cycle, signal name, hierarchical path, arrival, constraint, slack).
- Aggregate stats: total violations, distribution buckets, peak arrival per clock domain.
- Per-signal activity summary: transition count, average/max arrival, idle cycles.
- Run metadata: clock period, SDF/IR file, design hash, vector source.
Required for CI integration and any downstream tooling. Schema versioned; additive extension policy mirrors
crates/timing-ir. -
--timing-summary. Fast text summary, no VCD. Designed for scripts and human inspection of long runs. Contents:- Vectors run, clock period, corner.
- Setup/hold violation totals.
- Worst-slack DFF (setup and hold) with hierarchical path.
- Peak arrival per writeout vs clock budget, with margin percentage.
Cost: trivial wrapper over (2)'s data.
-
Per-DFF worst-slack ranking. Top-N DFFs by closest-to-violation slack across the entire run, even when no violation occurred. Surfaces "where am I close to the edge" without requiring a violation to actually trip. Output as part of (2) and (3); also accessible via a dedicated
--worst-slack-n Nflag for quick inspection.
Optional / later outputs
The following are higher-value-but-lower-priority. They land after the four required items above, in any order driven by user demand.
-
--arrival-histogram <pattern>. Per-signal arrival histogram dump for matched signal patterns, as JSON or CSV. Foundation for activity-based power analysis. -
--sta-cross-reference <opensta-paths.txt>. Cross-reference OpenSTA's critical-path report against observed worst arrivals. Closes the loop between vector-driven and static analysis. Coverage-style "of the top-N STA paths, which were exercised, and at what observed arrival." (Originally framed against OpenTimer; ADR 0003 was Superseded — OpenSTA is the only STA tool Jacquard interoperates with now.) -
Path-back-trace from worst-arrival DFF. Given a flagged DFF, walk the max-of-fanin chain backward to the source AIG pin / primary input, emitting the path with per-edge contribution. Most expensive item on this list; only useful once the cheaper items are in place.
Backward compatibility
- All new outputs are opt-in via flags. Existing stderr behaviour and
--timedsemantics are unchanged. - Symbolic violation messages (item 1) do change existing stderr format. This is intentional: the current state-word-index format is not a stable contract and is not consumed by any known automation. Format change documented in changelog at land time.
Output stability contract
- The
--timing-reportJSON is a stable consumer-facing format. Schema versioned. Additive-only extensions per the IR convention; breaking changes require a major version bump and a transition period. --timing-summaryis human-readable and explicitly not stable for parsing. Tools should consume the JSON.- Stderr violation messages remain human-oriented; tools should not parse them.
Consequences
- Jacquard becomes usable in CI without bespoke stderr parsing. Existing users who scrape stderr will need to migrate to the JSON report; the migration window is the release in which symbolic messages land.
- The
SimStatsin-process API gains a public counterpart: end-of-run JSON. This raises the bar for changes to either — they must agree. - Documentation gains a "Jacquard timing report format" reference page. Sample reports from the corpus designs are checked in to
tests/timing_ir/corpus/alongside golden IR. - The
why-jacquard.mdpositioning becomes truthful: the user-facing claim "vector-driven setup/hold answers at GPU scale" is backed by an interface that delivers them.
Walk-back options
- If the JSON schema causes consumer-tooling friction, the format may be extended additively but not narrowed. Existing consumers must continue to work. If a fundamental rethink is required, ship a v2 alongside v1 with a deprecation window.
- If symbolic name resolution is too slow at scale (millions of DFFs, very long runs), the resolution step becomes opt-in via flag, with the existing state-word-index format retained as a fast-path default. No evidence yet that this is a problem; treated as a deferred consequence.
- If users specifically want the path-back-trace (item 7) before the cheaper items are scheduled, it can be promoted, but only once items 1–4 are in place. Path-back-trace without symbolic names is unusable.
Priority and effort estimate
| Item | Effort | Blocks | User impact |
|---|---|---|---|
| 1. Symbolic violations | 1–2 days | Nothing | High (turns stderr from noise to signal) |
| 2. JSON report | 3–5 days | CI integration | High |
| 3. Text summary | 1 day (after #2) | Human dashboards | Medium |
| 4. Worst-slack ranking | 1–2 days (folds into #2) | "Am I close?" | High |
| 5. Arrival histogram | 3–5 days | Power analysis | Medium |
| 6. STA cross-ref | 1 week | Vector coverage report | Medium |
| 7. Path-back-trace | 2–3 weeks | Forensics | Lower-frequency-but-high-value |
Items 1–4 are a single workstream, ~2 weeks total. They constitute the "Jacquard is now usable" bar. Items 5–7 are scheduled per user demand after that.
Links
../why-jacquard.md— positioning analysis and full output-interface design.../timing-violations.md— current violation detection mechanics; updated to describe new outputs once they land.../timing-validation.md— validation tolerances; will reference the JSON report format for golden comparisons.../adr/0002-timing-ir.md— IR schema versioning policy that the JSON report mirrors.../project-scope.md— output stability constraints that apply to any user-facing format.
ADR 0009 — OpenSTA Verilog reader input constraints
Status: Accepted.
Context
OpenSTA's read_verilog Tcl command is structural-only: it accepts cell
instantiations and bare-net assign statements but rejects RTL
operators (~, &, |, ^), bit-selects in assigns, and ranged
concatenations. Violations surface as Error: <file> line <N>, syntax error and exit 1. This is a long-standing OpenSTA limitation, not a
flag.
Two patterns make this surprising in practice — both have already caught us once:
-
Final-stage outputs from the LibreLane/OpenROAD flow are sometimes wrapped. LibreLane itself only ever reads structural netlists (
<design>.pnl.v— verified locally onchip_top.pnl.v: zero RTL operators, single module). The wrapping is added by downstream integration tooling — for the SkyWater openframe flow, chipflow's harness wraps the LibreLane output inopenframe_project_wrapperto patch active-low OEB pins into the pad ring, producing theassign gpio_oeb[0] = ~( ... );pattern. The combined file (tests/mcu_soc/data/6_final.v) contains both the readable-by-OpenSTA structuraltopmodule and the wrapper's unreadable RTL. The SDF was generated against the innertop, not the wrapper — matching what LibreLane's own STA saw. -
Post-synthesis Verilog has the right form but the wrong cells. Pre-P&R synthesis output (e.g.
top_synth.v) is fully structural and uses the same module nametopas the post-P&R body, so it looks like an acceptable substitute. It is not: the SDF references hundreds of thousands of P&R-inserted cells (clkbuf_regs_*CTS buffers,ANTENNA_*diodes,delaybuf_*, fillers) that simply do not exist in synthesis output. OpenSTA quietly drops SDF entries whose endpoints are not in the loaded design; the resulting IR back-annotates only the surviving subset. Concrete numbers from the MCU SoC fixture:top_synth.vhas 31,500 cells;module topinside6_final.vhas 266,746. Feedingtop_synth.vwould silently drop ~88% of the design's structure.
Past convention (docs/plans/ws3-cosim-sdf-followup.md, pre
2026-05-18) recommended substituting top_synth.v to dodge the
wrapper-parse error. The contemporaneous verification log
(28162 matched, 2090 unmatched) reported the jtir-to-cosim-netlist
match rate, not SDF coverage against the jtir — high surface
"working" while the IR was missing most of the design's real
timing. That recommendation is retracted in the same change as this
ADR lands.
Decision
The "structural-only" constraint is owned by opensta-to-ir,
not by the caller. Specifically:
-
opensta-to-irfilters Verilog inputs at invocation time. For each--verilogfile, it extracts themodule <--top> … endmoduleblock before handing files to OpenSTA. Files that do not containmodule <--top>(sub-module-only files in hierarchical designs) are passed through unchanged. The wrapper modules that LibreLane + wafer.space integration adds — and any future analogues — are simply not seen by OpenSTA. Implementation incrates/opensta-to-ir/src/verilog_filter.rs; integration test coverage intests/opensta_integration.rs. -
The cell-set match against the SDF is the caller's responsibility.
opensta-to-ircannot determine programmatically whether a given Verilog input is the right design stage for a given SDF. The CI fixture comment inprepare-mcu-soc-jtircaptures the rule for sky130 mcu_soc; copy the spirit (use the post-P&R structural body, not synthesis output) when adding new fixtures, but don't copy a per-design extraction recipe — there no longer is one to copy.
Architectural alternative (separate concern): the upstream
chipflow harness could preserve LibreLane's pre-wrap <top>.pnl.v
alongside its wrapped <top>_final.v output. That would make
opensta-to-ir's in-tool extraction a no-op for the common chipflow
case, but it would not obviate the filter — third-party LibreLane +
wafer.space users (hazard3 and future tapeouts using the vanilla
flow) hit the same wrapper pattern. The filter is the right place
for the fix because it covers both opensta-to-ir as a CLI and
jacquard sim --sdf (which subprocesses opensta-to-ir).
Consequences
- End-user runs of
jacquard sim --sdf <path>and the standaloneopensta-to-irtool both transparently handle the LibreLane + wafer.space wrapper pattern. No flags, no preprocessing recipe in user-facing docs. - Match-rate metrics in the IR consumer measure jtir coverage
against the consuming netlist, not against the source SDF. A
high match rate is necessary but not sufficient — confirm the jtir
contains the post-P&R cell population separately (e.g. by spot
checking for
clkbuf_regs_*/ANTENNA_*arcs in the IR JSON sidecar) before declaring a flow "working". - The filter assumes
module <--top> … endmoduleis line-anchored in the Verilog source. Machine-generated post-P&R netlists meet this; hand-rolled Verilog that opens a module mid-line would not. If that ever surfaces, upgrade the filter to use a real Verilog tokenizer (sverilogparseis already a workspace dependency). - This ADR retroactively retracts the
top_synth.vrecommendation indocs/plans/ws3-cosim-sdf-followup.md; that doc is corrected in the same change.
Links
- ADR 0001 — OpenSTA as oracle and sole STA path (the upstream tool whose constraints these are).
- ADR 0006 — SDF preprocessing model (the surrounding flow that consumes these inputs).
docs/plans/ws3-cosim-sdf-followup.md— the prior workaround this ADR corrects.
ADR 0010 — Declarative cell metadata for PDK enablement
Status: Accepted.
Context
PDK enablement today is per-PDK code + vendored Verilog (see
src/sky130.rs, src/gf180mcu.rs, src/gf180mcu_pdk.rs, the
build.rs pin-table scanner). Adding a new cell family — third-party
IP memories, hard macros, foundry-supplied blocks — requires
vendoring Verilog into jacquard/vendor/, extending the build.rs
scanner, editing prefix matchers (is_<pdk>_cell,
extract_cell_type), and adding entries to hand-curated matches!()
lists (is_filler_cell, is_io_pad_cell, is_sequential_cell,
is_multi_output_cell, …). Each of those last is data masquerading
as code; PR #64 (2026-05-18 power-pin + wired-filler shortcuts for
wafer.space) is the most recent example of the pattern.
The acute trigger is gf180mcu_ocd_ip_sram__sram1024x8m8wm1 — Tim
Edwards' OCD 3.3V port of the GF180MCU SRAM IP, used in a downstream
wafer.space tapeout. The cell is third-party IP (not in Jacquard's
vendor/), doesn't match is_gf180mcu_cell's prefix walk
(fd_* / ws_* only), has no pin table, and isn't filler-stubbable.
Issue #67 captures
the discussion.
The same pattern will repeat for every wafer.space tapeout that includes IP outside Jacquard's vendored library — hazard3, future chips. Code-gating each one through a Jacquard PR doesn't scale.
Decision
PDK enablement gains a declarative metadata path alongside the existing built-in classifiers. The decision separates cleanly into two tiers; this ADR commits to Tier 1 + a minimal Tier 2 slice now, and explicitly defers the larger Tier 2 schema (port-mapping semantics) to a future ADR after real adoption data.
Tier 1 — runtime cell library (--cell-library <PATH>.v)
sverilogparse (already a workspace dependency) parses user-supplied
Verilog files at startup and populates the LeafPinProvider for
every module … endmodule block found. Handles input / output /
inout. Replaces the build.rs scanner for newly-added cells;
existing built-in tables stay as fallback.
Flag is repeatable: --cell-library a.v --cell-library b.v for
designs that pull in multiple IP libraries. Files are parsed in
order; later files override earlier ones for collisions (with a
warning).
Tier 2 (minimal slice) — kind discriminator in TOML
Each cell library may be accompanied by a TOML manifest declaring
the kind of each cell — the same classification today's
is_filler_cell / is_sequential_cell / etc. encode in
matches!() lists. Manifest path mirrors the library path
(foo.v → foo.cells.toml) and is loaded automatically when
present; an explicit --cell-manifest <PATH>.toml flag overrides
the autoloading behaviour.
schema_version = "1.0"
[cells.gf180mcu_ocd_ip_sram__sram1024x8m8wm1]
kind = "ram"
[cells.gf180mcu_fd_io__fillcap_18_h]
kind = "filler"
Recognised kind values (v1.0): std, dff, latch,
clock_gate, ram, filler, endcap, tap, io_pad_input,
io_pad_output, io_pad_bidir, delay, multi_output,
tie_high, tie_low.
Schema versioning: top-level schema_version is mandatory. v1.x
additive rule — new optional keys / new kind values are
non-breaking; semantics of existing kind values must not narrow.
kind = "ram" semantics in v1.0 (opaque-RAM mode)
aig.rs today has two hardcoded RAM detection paths:
celltype == "$__RAMGEM_SYNC_" (line 775, port_r/port_w resolution
from Yosys memlib_yosys.txt) and starts_with("CF_SRAM_") (line
1006, .DO output resolution for ChipFlow's single-port
convention). Neither matches gf180mcu_ocd_ip_sram_* or arbitrary
third-party SRAM IP.
In v1.0, kind = "ram" allocates a RAMBlock slot in opaque
mode: the cell's outputs are routed to X-source slots, no port
resolution is attempted, no memory behaviour is modelled. This is
sufficient for designs whose CPU executes from boot ROM / register
file and never reads SRAM contents at the timescales Jacquard
simulates (the heartbeat-verification use case driving this work).
The existing compute_x_sources test path at src/aig.rs:3247-3273
already validates the X-source convergence shape.
When real memory modelling is required, future schema versions add
explicit port mapping ([cells.NAME.ports] sub-tables) — the
opaque mode stays as the documented fallback.
Integration ordering
aig.rs cell-type recognition slots the manifest path after the
existing recognisers:
1. celltype == "$__RAMGEM_SYNC_" → RAMBlock with port_r/port_w (unchanged)
2. starts_with("CF_SRAM_") → RAMBlock with .DO (unchanged)
3. PdkVariant::classify(celltype) → built-in classifier dispatch (unchanged)
4. NEW: manifest.lookup(celltype) → manifest-declared kind dispatch
The new path activates only for cells none of the existing recognisers match AND that have a manifest entry. All existing tests stay green without churn.
Deferred to a future ADR
- Port-mapping schema (
[cells.NAME.ports]sub-tables, polarity annotations, bus-width inference, write-enable encoding). This is a small behavioural description language doing more than classification; needs concrete adoption data before its schema is fixed. - Built-in classifier removal.
sky130.rs/gf180mcu.rs/gf180mcu_pdk.rsclassification tables stay as fallback through the entire migration. Removal happens only after the manifest pathway is the source of truth for at least one PDK in production. build.rspin-table scanner removal. Same rule: removed LAST, after manifests cover the built-in PDKs.
Consequences
- Third-party IP unblocks without Jacquard PRs. Users ship a
<library>.cells.tomlalongside their<library>.v; CI flows point--cell-libraryat both. The driving wafer.space tapeout'schip_top.pnl.vclearsgf180mcu_ocd_ip_sram__sram1024x8m8wm1by shipping a six-line manifest entry. - The "vendor + edit code + extend lists" PR workflow for new IP
becomes "ship a manifest, no Jacquard change".
docs/adding-a-pdk.mdevolves to document the manifest pathway as the primary route. - The opaque-RAM semantics is honest about what v1.0 delivers — no silent partial memory modelling. The contract is "RAMBlock allocated, outputs X-source, no read/write behaviour" until a future schema version adds explicit ports.
- Existing built-in PDK code stays load-bearing through the transition. No risk of regression in sky130 / gf180mcu test flows during the migration.
Links
- Issue #67 — design discussion.
- PR #64 (
9281e57) — most recent per-PDK-code-as-data workaround this ADR resolves. - ADR 0009 — OpenSTA Verilog reader input constraints
(
sverilogparseis already in-tree for unrelated reasons; Tier 1 reuses that dep). docs/plans/declarative-cell-metadata.md— implementation phasing.docs/plans/gf180mcu-enablement.md§ Follow-on cleanup items 1, 2, 3 — superseded by this ADR.
ADR 0011 — RAM port-mapping schema for declarative cell metadata
Status: Accepted.
Context
ADR 0010 shipped a minimal Tier 2
slice with one kind discriminator per cell. For kind = "ram"
specifically, v1.0 declares the cell-as-opaque: the AIG allocates a
RAMBlock slot but routes outputs to X-source slots without
resolving read/write port semantics. That's sufficient for "design
boots from ROM, never reads SRAM contents" cases but fails the
moment a real CPU writes to SRAM and expects to read its data back.
The acute trigger is the JTAG-DM firmware-load path enabled by
PR #78: OpenOCD
walks a debug-module sequence that culminates in abstract-memory
writes into the design's SRAM, then jumps the CPU to that memory.
Because the SRAM is opaque (no backing storage, writes go nowhere),
the CPU boots to garbage. Issue
#80 captures the
symptom and notes that wiring SramInitConfig is the smaller sibling
problem — pre-loading SRAM contents at tick 0 — but the bigger gap
is that kind = "ram" doesn't model writes at all.
ADR 0010 § "Deferred to a future ADR" listed the port-mapping schema explicitly:
Port-mapping schema (
[cells.NAME.ports]sub-tables, polarity annotations, bus-width inference, write-enable encoding). This is a small behavioural description language doing more than classification; needs concrete adoption data before its schema is fixed.
The OCD GF180MCU SRAM (gf180mcu_ocd_ip_sram__sram1024x8m8wm1) — a
real third-party IP cell behind the apitronix-semiconductor /
hazard3 / future wafer.space tapeout pipelines — gives us the
concrete adoption-data input. This ADR fixes the schema against
that worked example.
Worked example: the OCD SRAM
The upstream behavioural model (RTimothyEdwards/gf180mcu_ocd_ip_sram) declares:
module gf180mcu_ocd_ip_sram__sram1024x8m8wm1 (
CLK, CEN, GWEN, WEN, A, D, Q
);
input CLK; // posedge clock
input CEN; // chip enable, active-low
input GWEN; // global write enable, active-low
input [7:0] WEN; // per-bit write mask, active-low
input [9:0] A; // address (1024 entries)
input [7:0] D; // data in
output [7:0] Q; // data out
reg [7:0] mem[1023:0]; // backing storage
Read semantics: on posedge CLK, when !CEN && GWEN → Q = mem[A].
Write semantics: on posedge CLK, when !CEN && !GWEN && !(&WEN) →
mem[A][i] = D[i] for each i where !WEN[i].
The schema needs to capture: per-pin polarity (active-low vs active-high), per-pin role (clock / chip-enable / write-enable / mask / address / data-in / data-out), bus widths (derived from the Verilog declaration; not redeclared), mask granularity (per-bit vs per-byte vs none).
Decision
Extend the <library>.cells.toml schema with an optional ram
sub-table on entries declaring kind = "ram". Presence of the
sub-table promotes a cell from opaque (v1.0 semantics) to
explicit — outputs are properly wired to the AIG-backed
RAMBlock, writes populate backing storage, reads return what was
written.
Schema (v1.1)
schema_version = "1.1"
[cells.gf180mcu_ocd_ip_sram__sram1024x8m8wm1]
kind = "ram"
[cells.gf180mcu_ocd_ip_sram__sram1024x8m8wm1.ram]
depth = 1024
width = 8
clock = { pin = "CLK", edge = "pos" }
chip_enable = { pin = "CEN", polarity = "low" }
write_enable = { pin = "GWEN", polarity = "low" }
write_mask = { pin = "WEN", polarity = "low", granularity = "bit" }
address = "A"
data_in = "D"
data_out = "Q"
Field semantics
depth(required, integer): number of addressable entries. Must satisfydepth ≤ 2^AIGPDK_SRAM_ADDR_WIDTH(8192 today).width(required, integer 1..=32): bit-width of each entry. Capped at 32 byRAMBlock's fixed-size port arrays.clock(required, table):pinis the clock input pin name;edgedefaults to"pos"."neg"is accepted (matches gf180mcudffnq-family negedge convention).chip_enable(optional, table):pin+polarity(default"low"). When the pin's effective level is inactive, the cell neither reads nor writes for that cycle. Omit for sync SRAMs that are always-enabled.write_enable(optional, table):pin+polarity(default"low"). Gates all writes regardless of mask. The OCD SRAM'sGWEN. Omit for SRAMs without a global write-enable.write_mask(optional, table): per-bit / per-byte write enables.pinis the mask pin name;polaritydefaults to"low";granularityis"bit"(default) or"byte". The mask width must matchwidth(bit) orwidth / 8(byte). Omit for SRAMs without per-bit masking — in that case the globalwrite_enablecontrols the whole word.address/data_in/data_out(required, string): pin names. Bus widths are read from the Verilog (viasverilogparse) — not re-declared here.
Optional cells (no ram block)
Cells declaring kind = "ram" without the ram sub-table fall
back to v1.0 opaque mode — outputs route to X-source slots, no
backing storage, no port resolution. The contract is unchanged for
existing consumers.
Backing storage
Cells with an explicit ram block allocate a RAMBlock with
port_r_* and port_w_* arrays populated from resolved pin
positions. The simulator's existing GPU-side SRAM machinery handles
reads, writes, and per-entry backing memory; no new kernel work is
required.
Schema versioning
The top-level schema_version field bumps from "1.0" to "1.1".
v1.0 manifests continue to parse — the ram sub-table is purely
additive. Loaders that don't recognise the new sub-table (none
today; this ADR ships the loader simultaneously) would treat
flagged cells as opaque RAMs, which is a graceful degradation.
SRAM preload (sibling work)
TestbenchConfig::sram_init (an existing schema field declared in
src/testbench.rs but unwired today — issue
#80) becomes
load-bearing once explicit-port RAMs have backing storage. The
preload path:
- Parse ELF segments from
sram_init.elf_path. - Match segments to SRAM instances by virtual-address overlap with declared SRAM regions.
- Write segment bytes into each matched SRAM's backing memory at tick 0.
Schema extensions to SramInitConfig (instance targeting,
multi-section support) land alongside the implementation but don't
require an ADR — purely additive JSON schema work.
Consequences
- The OCD GF180MCU SRAM (and any structurally similar third-party IP — 1RW, sync, optional per-bit mask) becomes simulable end-to-end via the manifest pathway. Real CPU writes populate real memory.
- The opaque-mode fallback stays load-bearing for cells the consumer hasn't taken the time to schema-map — important so the cell-library pathway doesn't require schema work just to load a cell library.
- JTAG-DM-driven firmware load (PR #78 stage 1) becomes end-to-end testable in cosim. Closes the chicken-and-egg loop for designs whose firmware-load mechanism is what cosim is trying to validate.
- The schema is opinionated: 1-port (1RW), sync-only, write-mask is
bit OR byte (not arbitrary). Multi-port SRAMs (
2RW,1R1W), async SRAMs, and write-mask-with-stripes encodings are explicitly out of scope. Adding them is a future schema version (v1.2+); doesn't break v1.1 manifests.
Out of scope
- Multi-port SRAMs. Most foundry IPs in our ecosystem are single-port. Dual-port designs are a meaningful follow-up but not driven by any in-tree fixture today.
- Async (non-clocked) SRAMs. Hardly seen in synthesised digital designs at modern PDKs. Not modeled.
- Width > 32 bits. Bounded by
RAMBlock's array sizes; consumers wider than 32 should split into multiple instances. - Built-in classifier removal. Same rule as ADR 0010 — the
$__RAMGEM_SYNC_andCF_SRAM_*recognisers stay as fallback; manifest-declared RAMs supplement, don't replace.
Links
- ADR 0010 — Declarative cell metadata (the parent decision deferring this schema).
- Issue #80 — driving consumer.
- PR #78 — the JTAG-DM workflow that surfaced the schema need.
- Upstream OCD SRAM behavioural model: RTimothyEdwards/gf180mcu_ocd_ip_sram.
ADR 0012 — Reproducible CDC jitter injection for multi-clock cosim
Status: Accepted — design accepted and partially implemented. The
reproducibility core (§1) and scheduler-domain jitter on the VCD
timeline (§2, partial) are built; model-driven jitter (§3), setup/hold
integration (§5), the gcd_ps/2 guard, and true coincident-edge
perturbation (§4) are not yet. The sections below describe the decided
design; see Implementation status for what is
built versus deferred. Remaining work is tracked in
issue #92 and
../plans/cdc-jitter-completion.md.
Context
The multi-clock scheduler (MultiClockScheduler in cosim_metal.rs)
pre-computes a fixed LCM-based edge schedule: every clock domain fires
at perfectly rational offsets forever. Real hardware doesn't do that —
PLL jitter, clock-tree skew, and propagation delay make coincident
edges land in unpredictable order. CDC synchronizers are designed to
tolerate this, but RTL bugs (missing synchronizers, gray-code errors,
handshake protocol violations) only surface when edge alignment varies
from the ideal.
The motivating incident was
PR #89 / run 26413667030:
a scheduler index bug caused sys_clk to fire at TCK's period, making
CDC synchronizers between the JTAG and system clock domains marginal.
The test passed intermittently because Metal GPU scheduling jitter
shifted the effective phase relationship. Once the bug was fixed
(commit 5bb07c3), determinism was restored — but the experience
highlighted that no deliberate mechanism exists to stress-test CDC
paths under controlled timing skew.
Additionally, cosim's model-driven clocks (JtagReplayModel,
SpiFlashModel, etc.) override the scheduler's periodic pattern with
software-driven edges. These introduce a distinct CDC concern:
model-driven clock transitions are phase-locked to the host-side
dispatch loop, not to the design's system clock. The same jitter
injection infrastructure must cover both scheduler-derived and
model-driven clock edges.
The multi-clock plan (docs/plans/multi-clock-and-stimulus-architecture.md)
lists "CDC verification mode: jitter injection on coincident edges and
random X-injection on detected async-source paths" as a future
capability. This ADR formalises the design for the jitter injection
half; X-injection is deferred to a follow-up ADR that depends on MC.1
(island partitioner) landing.
Decision
1. Run-parameters file and per-domain seeded PRNG
Simulation runs that use any non-deterministic feature (jitter, future
partition randomisation, model-driven timing offsets) are governed by a
run-parameters file (--run-params <path>):
{
"master_seed": 8429173640281
}
From the master seed, a per-domain sub-seed is derived for each
clock domain and each model-driven clock (e.g.
sub_seed = hash(master_seed, domain_name)). Each domain gets its own
independent PRNG stream. This ensures reproducibility even when the
number of PRNG draws per domain is path-dependent — a reactive model
that fires more or fewer edges based on design output doesn't
contaminate another domain's displacement sequence.
Behaviour:
--run-params <path>supplied, file exists: load parameters from it. The run is a deterministic replay.--run-params <path>supplied, file does not exist: generate a master seed from system entropy, write the file immediately (before the simulation loop starts), then run. The user gets reproducibility even if the process crashes mid-simulation.- No
--run-paramsflag: generate a master seed, write to a default location (<output_dir>/run_params.jsonnext to the output VCD) before simulation begins. Always persisted — the user can re-run any simulation by passing the written file back.
The master seed is also logged at INFO level and included in the VCD header comment, so even without the file the seed is recoverable from logs.
Rationale: "random testing that can't be replayed isn't testing," but forcing users to pick seeds upfront discourages use. Writing the file before simulation means every run — even a crashed one — is reproducible after the fact. Per-domain streams mean the seed alone is sufficient; no displacement log is needed.
For CI seed sweeps, a wrapper generates N parameter files with
sequential seeds and fans out runs. Each failure ships with its
parameter file as an artifact — gh run download gives you
everything needed to reproduce locally.
2. Per-domain jitter budget
A new jitter_ps field on ClockConfig in sim_config.json declares
the maximum edge displacement in picoseconds for that domain:
{
"clocks": [
{ "gpio": 0, "period_ps": 40000, "name": "sys_clk", "jitter_ps": 200 },
{ "gpio": 2, "period_ps": 160000, "name": "tck", "jitter_ps": 0 }
]
}
At each edge, the scheduler draws a signed displacement from a uniform
distribution [-jitter_ps, +jitter_ps] and shifts the edge forward or
backward within the GCD granularity window. The resulting edge still
fires within the same GCD tick (no reordering across ticks), but the
effective arrival time recorded in the state buffer (and honoured by
setup/hold checkers) shifts. Disabling jitter (jitter_ps: 0) is the
default and produces today's ideal-clock behaviour.
Constraint: jitter must not exceed gcd_ps / 2; larger values would
re-order edges across GCD ticks and require a fundamentally different
scheduling model.
3. Model-driven clock jitter
Model-driven clocks (JTAG TCK, SPI SCK, etc.) bypass the scheduler's periodic edges. Their jitter path is different:
- A
--cdc-model-jitter-ps <N>flag (or per-modeljitter_psin the config) specifies the budget for model-driven transitions. - After
patch_model_clock_edgesfires the edge, the arrival-time offset recorded in the timing state is displaced by a PRNG-drawn value from the same seeded generator. - This does NOT delay the functional edge (the DFF still samples on the same tick) — it shifts the timing-model arrival so that setup/hold checks against the receiving domain see a different margin each run.
The functional-vs-timing split means jitter injection doesn't change combinational propagation (which would require an event-driven kernel), only the timing oracle's view of when edges "really" arrived. This is consistent with Jacquard's philosophy: functional correctness is cycle-accurate, timing is an overlay.
4. Coincident-edge perturbation
When two domains have edges scheduled at the same GCD tick (coincident edges), their relative order is undefined in real hardware. The jitter mechanism naturally handles this: if domain A's jitter shifts it +100ps and domain B's shifts it -50ps, the timing model sees A after B, which may differ from the next run's draw. This exercises both "A-before-B" and "B-before-A" orderings over a seed sweep without needing explicit permutation logic.
5. Integration with existing infrastructure
- Setup/hold checker (
timing_report.rs): already receives arrival-time offsets. Jittered arrivals feed directly into the existing violation detection — a jitter-induced setup violation appears in--timing-reportoutput with the jittered arrival annotated. - VCD ring buffer: records the jittered arrival time so waveform viewers show the displaced edge.
- X-prop (future): when MC.1 identifies CDC boundaries, X-injection on violated paths can use the same PRNG stream for correlated randomisation.
--check-with-cpu: the CPU baseline does NOT apply jitter (it doesn't model timing at all). Jitter-mode results should not be compared against the CPU baseline. The flag combination--run-params(with jitter enabled) +--check-with-cpushould warn or error.
Implementation status
The design above is accepted in full; the code implements part of it.
This section is the source of truth on what is built. Remaining items are
tracked in issue #92 /
../plans/cdc-jitter-completion.md.
Implemented:
| Part | Where |
|---|---|
Run-parameters file, master_seed, load/write/load_or_generate | src/sim/run_params.rs |
Per-domain sub-seed hash(master_seed, name) + per-domain ChaCha8Rng | RunParams::domain_seed; cosim_metal.rs |
jitter_ps per ClockConfig (default 0) | src/testbench.rs |
Uniform [-jitter_ps, +jitter_ps] draw per domain per tick | cosim_metal.rs |
| Jitter displacement applied to the timing-VCD event timestamp | cosim_metal.rs (inside the --output-vcd block) |
master_seed logged at INFO | cosim_metal.rs |
--check-with-cpu + jitter warning | cosim_metal.rs |
Deferred (issue #92):
| Part | ADR § | Gap |
|---|---|---|
| Setup/hold integration | §2, §5 | Jitter shifts only the VCD base timestamp; it does not feed the per-signal arrival offsets, so it produces no --timing-report violations. Also: jitter currently has no effect unless --output-vcd is set. |
| Model-driven clock jitter | §3 | No --cdc-model-jitter-ps flag or patch_model_clock_edges path; only scheduler domains jitter. |
| True coincident-edge perturbation | §4 | A single global displacement (last firing domain wins) is applied to the shared timestamp rather than independent per-domain displacement. |
gcd_ps / 2 constraint | §2 | Not validated. |
| Persist seed unconditionally | §1 | Without --run-params or --output-vcd, the seed is generated but not written. |
master_seed in VCD header comment | §1, §5 | INFO log only. |
--cdc-jitter-seed CI sweep | Consequences | The replay mechanism is --run-params; no dedicated CI sweep step yet. |
Consequences
- CI can run a small seed sweep (via
--run-params) as a lightweight CDC stress test on every PR, catching synchroniser failures that the ideal-clock schedule hides. - Users debugging real silicon CDC failures can replay the exact jitter pattern that triggered the issue.
- The design is forward-compatible with X-injection (the PRNG infrastructure and per-domain budgets are reusable).
- Model-driven clocks get explicit jitter coverage rather than relying on accidental GPU scheduling delays.
- No kernel changes required — jitter is a host-side timing-model overlay on the existing edge schedule.
Deferred
- X-injection on CDC paths. Requires MC.1's island partitioner to identify which DFF outputs cross domains. Separate ADR once MC.1 lands.
- Frequency sweep / DFS simulation. Changing a clock's period mid-simulation is orthogonal to jitter. Captured in the multi-clock plan as a future axis.
- Per-path jitter profiles. Real jitter isn't uniform — PLLs have period jitter (Gaussian), recovered clocks have cycle-to-cycle jitter (bounded), external clocks have frequency offset (deterministic drift). V1 uses uniform; richer distributions can be added later without API changes (the seed + budget interface is distribution-agnostic).
ADR 0013 — Cosim peripheral model architecture
Status: Accepted — the architecture is implemented and in use across multiple peripherals (multi-UART #90, config-driven APB3 bus tracing). The "Target architecture" section below tracks the remaining, optional refactors; the conventions it establishes are already followed.
Context
Jacquard's cosim mode runs reactive peripheral models alongside the GPU-simulated design: SPI flash serves firmware, UART decodes serial output, JTAG replays debug sessions, GPIO drives/observes pins, and Wishbone trace captures bus transactions. The architecture evolved organically; this ADR documents the current design, identifies the abstractions emerging from it, and establishes conventions for extending it.
Architecture
Execution domains
Peripheral work splits across CPU and GPU. The boundary follows a simple rule: models that drive input pins (must react to design output each edge) run on the CPU; models that observe output pins (pure consumers of post-simulation state) or exchange data bidirectionally with the design run on the GPU for zero-copy access to the state buffer.
Some peripherals span both domains. UART has a CPU-side RX driver (feeds bytes into the design's RX input pin) and a GPU-side TX decoder (reads the design's TX output pin).
CPU-side: PeripheralModel trait
Defined in src/sim/models/mod.rs:
#![allow(unused)] fn main() { trait PeripheralModel { fn name(&self) -> &str; fn driven_positions(&self) -> &[u32]; fn apply_action(&mut self, action: &QueuedAction); fn step_edge(&mut self, output_state, overrides, emitted); // default: just calls contribute_overrides fn contribute_overrides(&self, overrides); fn is_active(&self) -> bool; // default: false } }
apply_action is how the InputDispatcher feeds queued stimulus
commands to models. is_active signals that the model is mid-
transmission and needs per-edge granularity (forces batch size to 1).
step_edge has a default that just calls contribute_overrides —
stateless models (GPIO) only need the latter.
Models are registered into a Vec<Box<dyn PeripheralModel>> at
startup. Each batch boundary, the loop calls step_edge on every
model; models write their pin drives into a shared ModelOverrides
map. These overrides are patched in-place into pre-allocated BitOp
arrays (built at startup with placeholder entries for model-driven
positions) and applied via the state_prep GPU kernel.
Note: step_edge currently receives an empty output_state
slice — GPU output state is not read back per-edge for CPU-side
models. GPIO and UART RX don't need it; I²C and SPI bus observation
will require wiring the output state readback when those models are
completed.
The dispatch is peripheral-agnostic: state_prep applies whatever
BitOp array it receives. Clock edges, reset, GPIO, UART RX, and
JTAG TCK/TMS/TDI are all entries in the same ops buffer.
Registered CPU-side models: GPIO, UART RX, JTAG replay (complete); I²C, SPI (scaffolded, output-state readback not yet wired).
GPU-side: two model patterns
GPU-side models fall into two categories distinguished by their data-flow relationship to the simulation:
Observe-only (post-simulate): The model reads output state after
simulation and produces results (decoded bytes, bus traces) into a
ring buffer. It never writes to input state. One kernel call per
edge, after simulate_v1_stage.
Bidirectional (pre+post simulate): The model both reads the design's outputs and injects data into the design's inputs. This requires two kernel calls per edge — one before simulation (inject response data into input state) and one after (read request signals from output state, advance the model's FSM).
| Pattern | When | Current models |
|---|---|---|
| Observe-only | Post-simulate | UART TX decoder, Wishbone bus trace |
| Bidirectional | Pre-simulate (inject) + post-simulate (sample, advance) | SPI Flash |
Any memory-mapped peripheral (external SRAM, I²C EEPROM, etc.) would follow the bidirectional pattern.
Per-edge execution order
state_prep (apply clk/gpio/jtag pin drives from CPU-side models)
→ [bidirectional: inject] — e.g. gpu_apply_flash_din
→ simulate_v1_stage ×N (combinational logic evaluation)
→ [bidirectional: sample+advance] — e.g. gpu_flash_model_step
→ [observe-only] — e.g. gpu_io_step (UART TX + Wishbone)
CPU-side PeripheralModel::step_edge runs between GPU batches.
GPU→CPU communication: ring buffers
GPU-side models produce output into fixed-size ring buffers in device
memory. The CPU drains these after each GPU batch completes, reading
from a local read_head up to the GPU-written write_head. No
synchronisation beyond Metal's command buffer completion is needed.
Current ring buffers:
| Buffer | Element | Capacity |
|---|---|---|
UartChannel | u8 (decoded bytes) | 4096 |
WbTraceChannel | WbTraceEntry (20 bytes) | 16384 |
Configuration
Peripheral config lives in sim_config.json, deserialized into
TestbenchConfig (src/testbench.rs):
| Peripheral | Field | Plural? |
|---|---|---|
| Clock | clocks: Option<Vec<ClockConfig>> | Yes (effective_clocks()) |
| GPIO | gpios: Vec<GpioConfig> | Yes |
| UART | uart + uarts: Vec<UartConfig> | Yes (effective_uarts(), #90) |
| Flash | flash: Option<FlashConfig> | Not yet |
| JTAG | jtag: Option<JtagConfig> | Not yet |
| Wishbone | (auto-detected, hardcoded signal names) | N/A (legacy) |
| Bus trace (AHB/APB) | bus_traces: Vec<BusTraceConfig> | Yes (effective_bus_traces()) |
Current implementation (bespoke kernels)
Today each GPU-side peripheral has its own kernel function:
| Kernel | Slots | Pattern |
|---|---|---|
gpu_apply_flash_din | states[0], flash_state[1], flash_din_params[2] | Bidirectional: inject |
gpu_flash_model_step | states[0], flash_state[1], flash_model_params[2], flash_data[3] | Bidirectional: sample+advance |
gpu_io_step | states[0], uart_state[1], uart_params[2], uart_channel[3], wb_channel[4], wb_params[5], bus_channel[6], bus_params[7] | Observe-only (UART + Wishbone + AHB/APB bus trace) |
All run on thread 0 only — the per-tick work is a trivial FSM step.
gpu_io_step combines three logically independent observe-only models,
gated by n_uarts > 0, has_trace, and n_buses > 0 respectively.
Config-driven bus monitor (AHB/APB)
The Wishbone trace (build_wb_trace_params) hardcodes one SoC's signal
names (cpu.fetch.ibus__cyc, spiflash.ctrl.wb_bus__ack, …) directly
in source. The AHB/APB bus tracer generalizes it into a config-driven,
protocol-aware monitor that is the model for future bus tracing:
- Config (
BusTraceConfig):name,protocol(apb3 / ahb-lite / ahb5), hierarchicalprefix,addr_bits/data_bits, and optional per-pinsignalsoverrides. Pins default to{prefix}{pin}. - Pin binding: protocol pin names (
psel,paddr, …) are resolved to output-state positions viaresolve_to_state_posintrace_signals.rs— the same multi-candidate resolver--trace-signalsuses, so Yosys-flattened / scalar-expanded / structural naming all work. The pins are registered as observables before partitioning (viaDesignArgs::extra_observable_signals) so they get state-buffer slots. - GPU capture / CPU decode split: the kernel is protocol-agnostic —
it packs a raw beat (
addr, wdata, rdata, ctrl flags) into the ring buffer on the protocol's gating edge (psel & penable & preadyfor APB), using rising-edge detection so exactly one beat is recorded per completed transfer. The protocol FSM (phase pairing, burst tracking, response decode) lives in plain, unit-testable Rust insrc/sim/models/bus_trace.rs. APB3 is stateless (one beat = one transaction); AHB pairing is the Phase-2 extension. - Output: decoded transactions stream to CSV via
--bus-trace-csv; annotated-VCD emission is a planned follow-up.
This is observe-only, so it slots into the existing post-simulate pattern. Migrating the hardcoded WbTrace onto this mechanism (expressing the VexRiscv ibus/dbus as configured buses) is a clean follow-up.
Target architecture
The two patterns (observe-only, bidirectional) and the common conventions (ring buffers, params structs, per-instance config arrays) should be codified so new peripherals follow a template:
Common conventions
- Params struct layout:
{ u32 state_size; u32 n_active; u32 _pad[2]; PerInstanceConfig configs[MAX_N]; }— uniform header, compile-timeMAX_Ncap. - Ring buffer struct:
{ u32 write_head; u32 capacity; u32 _pad[2]; T data[CAP]; }— shared across all models producing GPU→CPU output. - Buffer sizing: always
MAX_Nelements regardless ofn_active. Wastes negligible memory for small N. - Guard pattern:
for (i = 0; i < n_active && i < MAX_N; i++)replaces the currenthas_foo != 0booleans.
Model registration
New GPU-side models declare which pattern they follow:
- Observe-only: register a post-simulate kernel. Receives output state (read-only), writes to ring buffer.
- Bidirectional: register a pre-simulate kernel (inject into input state) and a post-simulate kernel (read output state, advance FSM).
Today this registration is implicit in cosim_metal.rs's
encode_and_commit_gpu_batch. Formalizing it is a future step —
the convention is sufficient while the model count is small.
Plural config convention
To support multi-instance peripherals (multiple UARTs, potentially multiple flash chips or RAM banks):
- Legacy singular field kept via
#[serde(default)]. - New plural field alongside (e.g.
uarts: Vec<UartConfig>). effective_<peripheral>() -> Vec<Config>merges both.- Each config struct gains
name: Option<String>for labelling.
This mirrors the existing effective_clocks() pattern.
Cross-backend considerations
Cosim is Metal-only today. CUDA/HIP paths (kernel_v1_impl.cuh)
implement the core simulation kernel but have no gpu_io_step or
flash kernels. When CUDA/HIP cosim is added, the same two-pattern
taxonomy applies — the kernel implementations will differ but the
Rust-side buffer allocation, config resolution, and drain logic can
be shared via feature-gated code in cosim_metal.rs (or a
future cosim_common.rs).
Phasing
| Phase | Scope | Status |
|---|---|---|
| 1 | Multi-UART (#90): first peripheral using plural-config + array-in-kernel conventions | Done |
| 1b | Config-driven bus monitor, APB3 + CSV (GPU-capture/CPU-decode split) | Done |
| 2 | Refactor gpu_io_step to use common params/ring-buffer layout | Future |
| 2b | AHB-Lite / AHB5 bus tracing + annotated-VCD output; migrate WbTrace onto the general monitor | Future |
| 3 | Multi-Flash / external RAM (bidirectional pattern) | Deferred (no use case yet) |
| — | Multi-JTAG | Not needed (TAP daisy-chain suffices) |
Plan docs: ../plans/multi-peripheral-cosim.md,
../plans/bus-transaction-tracing.md.
ADR 0014 — AIG as simulation intermediate representation
Status: Accepted
Context
Jacquard simulates gate-level RTL designs on GPUs by converting technology-mapped netlists into an executable form. The choice of intermediate representation (IR) determines how easily the design maps to GPU hardware, how much the representation compresses, and what classes of optimisation are available at compile time.
Gate-level netlists arrive from synthesis tools (Yosys, Synopsys DC) mapped to a variety of cell libraries: the project's own AIGPDK library, SKY130, or GF180MCU. Each library uses different cell names and pin conventions; the IR must abstract over these while preserving the combinational and sequential semantics exactly.
The GEM paper (Guo et al., "GEM: GPU-Accelerated Emulator-Inspired RTL Simulation," DAC 2025) describes a "virtual Boolean processor" that evaluates combinational logic as a tree of AND-with-invert operations — directly motivating an and-inverter graph.
Decision
1. Uniform AND-gate IR
All combinational logic is represented as an and-inverter graph (AIG). Every node in the combinational cone is one of:
#![allow(unused)] fn main() { pub enum DriverType { AndGate(usize, usize), // inputs with inversion bits InputPort(usize), // primary input InputClockFlag(usize, u8),// clock edge flag DFF(usize), // sequential (D flip-flop output) SRAM(usize), // memory block output Tie0, // constant zero } }
Only AndGate has combinational fan-in. The two operands carry an
inversion bit in their LSB (aigpin_id << 1 | invert), giving the
full {AND, NAND, NOR, OR} family with a single node type.
Inverters and buffers are absorbed into the inversion bits rather than
creating separate nodes, keeping the graph compact.
This uniformity is the key property: because every combinational node
is the same (a XOR xa) AND (b XOR xb) operation, the boomerang
reduction tree (ADR 0015) can execute them all with a single GPU
instruction pattern — no opcode decode, no per-cell dispatch.
2. Conversion path: NetlistDB to AIG
The conversion is implemented in src/aig.rs via
AIG::from_netlistdb_impl(). It handles three cell library
families:
| Library | Strategy |
|---|---|
| AIGPDK (native) | Cells are already AND gates, DFFs, SRAMs — direct mapping |
| SKY130 | Load Verilog behavioural models from vendor/sky130_fd_sc_hd/, decompose each cell into AND gates via decompose_with_pdk() |
| GF180MCU | Load behavioural models from vendor/gf180mcu_fd_sc_mcu7t5v0/, decompose similarly |
| RuntimeCellLibrary | User-supplied cell metadata (ADR 0010) for cells outside vendored PDKs |
The decomposition process for technology-specific cells:
- Clock tracing: Identify sequential cells (DFFs, SRAMs), trace
clock pins to primary inputs, create
InputClockFlagdrivers for posedge/negedge detection. - Iterative DFS: Walk the netlist in topological order. For each
unvisited output pin, recursively decompose driving cells into AND
gates using the PDK behavioural models. An
and_gate_cachededuplicates structurally identical sub-expressions. - Multi-output cells: SKY130 cells like full adders with multiple outputs get special handling — shared sub-expressions are computed once and reused via postprocess hooks.
- Fanout construction: After all pins are processed, CSR-format fanout arrays are built for efficient traversal.
AIG pins are guaranteed to be in topological order (pin i is
defined before any pin that depends on it), which the downstream
pipeline relies on for level computation and scheduling.
3. EndpointGroup abstraction
The AIG partitions its outputs into endpoint groups — the units of work that partitions must realise:
#![allow(unused)] fn main() { pub enum EndpointGroup<'i> { PrimaryOutput(usize), // top-level output pin DFF(&'i DFF), // D flip-flop: data + clock-enable RAMBlock(&'i RAMBlock), // SRAM: addr, data, enables SimControl(&'i SimControlNode), // $stop/$finish Display(&'i DisplayNode), // $display/$write StagedIOPin(usize), // inter-stage boundary (from --level-split) } }
Each variant bundles the signals that must be evaluated together: a
DFF needs both its D input and clock-enable; an SRAM needs address,
data, and write-enable buses. The for_each_input() method
enumerates all AIG pins feeding an endpoint group, which the
hypergraph partitioner (RepCut) uses to build connectivity and the
partition executor (pe.rs) uses to determine resource requirements.
This grouping is important because the boomerang reduction tree produces results in 32-bit-aligned write-out slots. Endpoint groups that share a write-out slot are co-located in the hierarchy; groups that need different clock-enable conditions (e.g., two DFFs with different clocks driving the same data pin) generate "output duplicates" that consume additional write-out capacity.
4. Why AIG over alternatives
BDDs (Binary Decision Diagrams): BDDs can represent Boolean functions canonically but suffer from exponential blowup for many practical circuits (e.g., multipliers). The canonical form is useful for equivalence checking but unnecessary for simulation, where we just need to evaluate. BDDs also have no natural mapping to the GPU's SIMT execution model.
Truth tables / LUTs: Lookup tables scale exponentially with input count. A 6-input LUT (as in Xilinx FPGAs) covers individual cells efficiently but doesn't compose — cascading LUTs requires separate evaluation steps. AIGs compose naturally: the output of one AND gate feeds the input of the next, forming a tree that maps directly to the boomerang hierarchy.
Technology-mapped netlist (direct execution): Keeping the original cell library would require per-cell-type dispatch in the GPU kernel — a conditional branch per node. GPU SIMT execution penalises warp divergence heavily; a uniform operation eliminates this entirely. The conversion cost (one-time decomposition at compile time) is negligible compared to the simulation runtime.
MIG (Majority-Inverter Graph): MIGs are a more compact representation (3-input majority gates) but the 3-input structure doesn't map as cleanly to binary reduction trees. AIGs are the industry standard for synthesis and verification tools (ABC, AIGER format), making interop straightforward.
The AIG's key advantage is that it reduces the GPU kernel to a single bit-parallel operation repeated across a hierarchical tree — no opcode dispatch, no conditional branching, maximum SIMT utilisation.
Consequences
Enables:
- The boomerang reduction tree (ADR 0015) works because every node is the same AND-with-invert operation. A heterogeneous IR would require per-node dispatch and break the hierarchical reduction pattern.
- Technology independence: the same GPU kernel and partition executor handle AIGPDK, SKY130, and GF180MCU designs. Adding a new PDK requires only a decomposition module, not kernel changes.
- Structural deduplication via
and_gate_cachereduces graph size when multiple cells share sub-expressions. - The inversion-bit encoding (
pin_iv = aigpin << 1 | invert) eliminates inverter/buffer nodes entirely — these are free in hardware too, so the IR's size correlates better with actual simulation cost than a technology-mapped netlist would.
Constrains:
- No latches or async logic. The AIG assumes clean register
boundaries: DFFs capture on clock edges, combinational logic is
acyclic between registers. Level-sensitive latches and
combinational loops would require iterative evaluation that the
current pipeline doesn't support (see
docs/simulation-architecture.md§ "Known Issues"). - Decomposition quality matters. A poor decomposition of a complex cell (e.g., a mux-heavy datapath cell) can produce a deep AND tree that requires more boomerang stages. The SKY130 and GF180MCU decompositions are hand-tuned for the common cells; exotic cells from other PDKs may decompose sub-optimally.
- No gate-delay preservation in the AIG itself. The AIG is a
functional (Boolean) representation. Timing information from
Liberty/SDF is loaded separately and overlaid onto the AIG's pin
structure via
gate_delaysandaigpin_cell_origins. This means the AIG construction can re-order or deduplicate nodes without worrying about timing — but it also means the timing model must reconstruct the mapping from AIG pins back to physical cells.
ADR 0015 — Boomerang execution model and GPU resource mapping
Status: Accepted
Context
Once the design is converted to an AIG (ADR 0014), the combinational logic must be mapped onto GPU hardware for parallel evaluation. GPUs offer massive parallelism but impose rigid constraints: fixed thread counts per block, limited shared memory, and synchronous SIMT execution within a warp/SIMD group.
The GEM paper (Guo et al., "GEM: GPU-Accelerated Emulator-Inspired RTL Simulation," DAC 2025) introduces a "virtual Boolean processor" organised as a boomerang hierarchical reduction tree. This ADR documents how the boomerang maps to GPU hardware, the resource limits it imposes, and the partitioning and instruction-generation pipeline that stays within those limits.
Decision
1. Boomerang reduction tree
A single GPU block (CUDA/HIP) or threadgroup (Metal) executes one partition of the design. Each partition evaluates a subset of the AIG's endpoint groups (DFFs, primary outputs, SRAMs, etc.) by reducing their combinational fan-in cones through a hierarchical binary tree called the boomerang.
The boomerang has BOOMERANG_NUM_STAGES = 13 levels, giving a
reduction width of 2^13 = 8192 leaf positions. Each thread in the
block handles 32 bits (one u32), so the block uses 8192 / 32 = 256 threads (NUM_THREADS_V1 in flatten.rs).
The 13 hierarchy levels map to three GPU execution tiers:
| Levels | Width | GPU mechanism |
|---|---|---|
| hier[0] | 8192 → 4096 | 256 threads, shared memory reduction (threads 128-255 compute, 0-127 supply inputs) |
| hier[1–3] | 4096 → 512 | Shared memory reduction with barrier between levels; only threads in [hier_width, 2×hier_width) compute — the rest idle |
| hier[4–7] | 512 → 32 | Warp/SIMD shuffle (__shfl_down_sync / simd_shuffle_down) — no barrier needed |
| hier[8–12] | 32 → 1 | Bit-level operations within a single u32 on thread 0 |
At each level, every position computes (a XOR xora) AND (b XOR xorb) OR orb — the same AND-with-invert operation from the AIG. When
orb is all-ones, the position acts as a pass-through (forwarding
input a unchanged). This single instruction pattern handles AND
gates, inversions, and wiring with zero branch divergence.
Between boomerang stages (when the AIG is too deep for a single 8192-wide tree), a shuffle permutation redistributes results from shared memory back to thread-local registers. The shuffle is encoded as 16-bit index pairs in the script, allowing arbitrary re-routing of signals between stages.
2. GPU resource limits and partition constraints
The boomerang's fixed geometry imposes hard resource limits on each
partition. These are documented in src/pe.rs on the Partition
struct:
| Resource | Limit | Derivation |
|---|---|---|
| Unique inputs | 8191 | 8192 leaf positions minus Tie0. Each input occupies a leaf slot; duplicates consume additional slots. Global-read rounds pack multiple state words into each thread's initial register. |
| Unique outputs | 8191 | Write-out slots in the boomerang hierarchy, addressed by stage+position pairs. Outputs include DFF data pins, primary outputs, and SRAM port signals. |
| Intermediate pins per stage | 4095 | The hier[1] level has 2^(13-1) = 4096 positions. One position is reserved for Tie0. Intermediates are AIG pins that are produced in one boomerang stage and consumed in the next. |
| SRAM output groups | 64 | 8192 / (32 * 4) = 64. Each SRAM occupies 4 write-out groups (32-bit read-data, address, write-data, write-enable). BOOMERANG_MAX_WRITEOUTS = 1 << (13 - 5) = 256 total write-out slots, of which SRAMs consume 4 each. |
Write-out slots are 32-bit-aligned groups within the hier[1] level.
The total write-out capacity is BOOMERANG_MAX_WRITEOUTS = 256.
SRAMs and "output duplicates" (same data pin driven by DFFs with
different clock enables) consume write-out slots from this budget. A
quick_reject() pre-check catches obvious overflows before the
expensive full build.
When a partition exceeds these limits, Partition::build_one()
returns None and the partitioner must split the endpoint set
further.
3. Hypergraph partitioning with RepCut
The design's endpoint groups are distributed across GPU blocks by
RepCut (src/repcut.rs), which constructs a weighted hypergraph
and partitions it using mt-kahypar.
Why a hypergraph, not a graph: In a standard graph, an edge connects exactly two vertices. But a single AIG node (an AND gate deep in the combinational cone) may be shared by many endpoint groups — its "edge" in the connectivity structure is a hyperedge spanning all groups that depend on it. Modelling this as pairwise graph edges would lose the information that cutting this one node simultaneously affects all connected groups. Hypergraph partitioning minimises the actual communication cost (shared signals that must be read from global memory by multiple blocks).
Why mt-kahypar: mt-kahypar is a state-of-the-art multilevel
hypergraph partitioner with LargeK support (many partitions in one
pass) and parallel execution. The implementation uses:
Preset::LargeK— optimised for k >> 2.epsilon = 0.2— 20% imbalance tolerance, giving the partitioner flexibility to reduce cut while keeping partitions roughly equal.Objective::Soed— Sum of External Degrees, which counts how many partition boundaries each hyperedge crosses. This directly correlates with the number of global memory reads each block must perform.- Vertex weights proportional to estimated evaluation cost (accounting for sub-graph size and fanout sharing).
- Hyperedge weights equal to the number of AIG nodes with that endpoint reachability pattern.
- Hyperedge size cap at 1000 nodes (reservoir-sampled beyond that) to keep partitioning tractable for signals with extreme fanout.
The hypergraph construction itself is the bottleneck for large
designs: for each AIG node, RepCut computes a bitset of which
endpoint groups it can reach via forward traversal. Nodes with
identical reachability sets are clustered into a single hyperedge.
This is done in parallel across bitset blocks (REPCUT_BITSET_BLOCK_SIZE = 4096) using Rayon.
4. Greedy merge-back strategy
mt-kahypar produces an initial partition assignment, but the
partition count is typically much larger than needed (set to 2x the
number of GPU blocks). process_partitions() in pe.rs then
aggressively merges partitions:
-
Bitset-based overlap scoring: For each pair of partitions, compute the union of their AIG node bitsets. The merge cost is
|union| - max(|A|, |B|)— lower is better, indicating more shared sub-graph. This isO(num_aigpins/64)per pair instead of full DFS. -
Speculative parallel trials: Merge candidates are sorted by overlap cost. Up to
parallel_trial_stridemerges are attempted in parallel usingRayon, with a cancel-on-successAtomicBoolto abort remaining trials once a valid merge is found. The stride doubles on each iteration. -
Quality gate: A merged partition is rejected if it would increase the maximum boomerang stage count beyond
max_original_nstages + max_stage_degrad. This prevents merges that technically fit in resource limits but would degrade simulation throughput by adding extra boomerang stages. -
Blacklisting: Failed merge attempts are blacklisted for that partition to avoid redundant retries. Cancelled (interrupted by a parallel success) trials are not blacklisted — the merge may still be valid in a future iteration.
The result: 2x-4x fewer partitions than the initial hypergraph solution, with each partition fully validated to fit within boomerang resource limits.
5. FlattenedScript instruction generation
src/flatten.rs converts partitions into FlattenedScriptV1 — a
packed u32 instruction stream consumed directly by the GPU kernel.
The script encodes:
-
Metadata section (256 u32): Per-partition control fields at fixed indices, followed by the write-out hook table:
Index Field Purpose 0 num_stagesBoomerang stage count 1 is_last_partFlag: last partition in the design 2 num_iosNumber of write-out endpoints 3 io_offsetStart offset in global state buffer 4 num_sramsSRAM block count 5 sram_offsetSRAM start offset 6 num_global_read_roundsInput read rounds 7 num_output_duplicatesOutput duplication count 8 is_x_capableX-propagation flag (ADR 0016) 9 xmask_state_offsetX-mask offset (when X-capable) 128..255 write-out hook table Maps each thread to the boomerang stage+position where it captures its output This layout is the load-bearing contract between Rust (
flatten.rs) and the GPU kernel (kernel_v1.metal,kernel_v1_impl.cuh). -
Global-read permutation (2 ×
NUM_THREADS_V1per round): Each thread gets an index into the global state buffer and a bitmask. The thread reads one u32 from global memory and extracts the bits indicated by the mask using apext-like loop. Rounds are packed to maximise throughput (each thread accumulates up to 32 bits across rounds). An index high-bit flag distinguishes previous-cycle state from current-cycle inter-stage intermediates. -
Boomerang sections (per stage,
NUM_THREADS_V1 × 20u32):- 16 u32 per thread: shuffle permutation (16-bit index pairs selecting source bits from shared memory)
- 4 u32 per thread: AND-gate flags (
xora,xorb,orb) plus a padding slot reused for gate-delay injection (u16 picoseconds)
-
Global write-out: SRAM and output-duplicate permutations, clock-enable conditions, and data-inversion flags for committing results back to the state buffer.
The entire script is uploaded to device memory once and read sequentially by the kernel. Script reads are overlapped with computation via double-buffering (reading the next stage's data while computing the current stage's AND gates).
6. Pipeline staging for deep circuits
When a design's combinational depth exceeds the boomerang's capacity,
src/staging.rs splits the AIG into major stages at user-
specified level thresholds (--level-split 30 or --level-split 20,40).
Each major stage gets its own StagedAIG with:
primary_inputs: the AIG pins produced by previous stages (or the design's actual primary inputs for the first stage).primary_output_pins: live AIG pins at the split boundary that must be forwarded to the next stage.endpoints: the original AIG endpoint groups whose combinational depth falls within this stage.
Major stages execute sequentially on the GPU (the kernel loops over them). Between stages, intermediate values are written to the output state buffer and re-read by the next stage's global-read permutation (indicated by the high-bit flag in the index).
Staging trades latency (more sequential kernel dispatches) for fitting within the 8192-wide boomerang. Without it, designs with
50-level combinational paths would fail partitioning entirely.
Consequences
Enables:
- Fixed, branch-free GPU kernel. The kernel has no per-node dispatch — every thread executes the same AND-XOR-OR instruction at every boomerang level. This maximises SIMT utilisation across CUDA, HIP, and Metal.
- Deterministic shared-memory budget. The 256-thread, 8192-bit boomerang uses a fixed amount of shared memory (threadgroup memory on Metal), independent of the design. No dynamic allocation, no shared-memory pressure variation between blocks.
- Scalable partitioning. The hypergraph partitioner + greedy merge naturally adapts to designs from hundreds to millions of gates. Larger designs get more partitions; the GPU kernel is the same.
- Technology independence at the kernel level. The GPU kernel knows nothing about AIGPDK, SKY130, or GF180MCU. It executes packed u32 scripts. All cell-library knowledge is absorbed during AIG construction and script generation.
Constrains:
- 8191-input/output ceiling per partition. Designs with extremely
wide buses or highly connected sub-circuits may require aggressive
partitioning, which increases inter-partition communication (global
memory reads). The
--level-splitoption helps by splitting deep cones into multiple stages, but wide cones remain fundamentally limited by the 8192-slot boomerang. - Write-out slot scarcity for SRAM-heavy designs. Each SRAM
consumes 4 write-out slots. With
BOOMERANG_MAX_WRITEOUTS = 256, a partition can hold at most 64 SRAMs — and fewer when output duplicates also need slots. Designs with many small memories may need finer partitioning than their gate count alone would suggest. - Fixed thread count. The 256-thread block size is hardcoded
(
NUM_THREADS_V1). On GPUs where the SM/CU could benefit from larger blocks (e.g., occupancy tuning), there's no flexibility. Changing this would require redesigning the boomerang hierarchy depth and the bit-packing in the script format. - Script size grows with partition depth. Each boomerang stage
adds
~20 × 256 = 5120u32 entries to the script. Very deep partitions (many boomerang stages) produce large scripts that may pressure GPU memory bandwidth for the script reads, though double-buffering mitigates this.
ADR 0016 — Selective X-propagation
Status: Accepted
Context
Jacquard's default two-state (0/1) simulation silently resolves uninitialised DFF and SRAM outputs to zero. This masks initialisation bugs that real hardware would expose as unknown (X) values, and creates false mismatches when comparing against four-state RTL simulators.
Naively upgrading the entire simulator to four-state logic would double storage and roughly halve throughput. In a well-designed SoC after reset, typically less than 5% of signals are genuinely X-capable.
Decision
Implement selective X-propagation controlled by the --xprop
CLI flag. Static analysis at compile time identifies X-source
signals (uninitialised DFFs, SRAM read ports); forward-cone
computation classifies each partition as X-capable or X-free. Only
X-capable partitions run an X-aware kernel variant; the rest
continue with the fast two-state path.
The full seven-phase design, implementation details, and design
rationale are in
docs/selective-x-propagation.md.
Stages 1–6 are implemented; Stage 7 (dynamic X narrowing) is a
future enhancement.
Key design choices (summary)
- Partition-level granularity — entire partition runs X-aware or not. ~95% of partitions are typically X-free after reset.
- Conservative SRAM X — all reads return X until any write. Per-address tracking deferred.
- No reset-aware analysis — all DFFs start as X; the fixpoint iteration naturally resolves reset-connected DFFs.
- State buffer doubling — X-mask words occupy
[reg_io_state_size .. 2*reg_io_state_size)when enabled. X-free partitions ignore the mask entirely. - Runtime flag, not compile-time —
--xproponjacquard sim; no new Cargo features needed.
Consequences
- X-capable partitions pay ~2× storage and ALU cost; X-free partitions (the vast majority) pay nothing.
- VCD output includes
xvalues when--xpropis enabled, compatible with standard four-state VCD tools. - The
--check-with-cpureference path includes an X-aware CPU kernel for validation. - Benchmarks (
benches/xprop.rs) track the overhead.
ADR 0017 — Cosim execution model
Status: Accepted
Context
The cosim mode runs a GPU-simulated design alongside reactive peripheral models (flash, UART, JTAG, GPIO) that drive and observe design pins each clock edge. The execution model must balance two competing needs: GPU throughput (which favours large batches of edges dispatched as a single command buffer) and peripheral responsiveness (which requires CPU-side model updates between edges).
This ADR documents the batch dispatch loop, the multi-clock scheduler, and the time-domain abstractions that tie them together.
Decision
Batch dispatch loop
The cosim main loop groups consecutive scheduler edges into
batches of up to BATCH_SIZE = 1024 edges. Each batch is
encoded into a single Metal command buffer and dispatched to the
GPU. Between batches, CPU-side peripheral models (PeripheralModel:: step_edge) run, ring buffers are drained, and model overrides are
compiled into BitOp arrays for the next batch.
Per-edge execution within a batch:
state_prep (apply clk/gpio/jtag pin drives via BitOps)
→ gpu_apply_flash_din (inject flash MISO into input state)
→ simulate_v1_stage ×N (combinational logic evaluation)
→ gpu_flash_model_step (read MOSI, advance flash FSM)
→ gpu_io_step (UART TX decode + Wishbone bus trace)
CPU-side models cannot observe intra-batch state changes — they see
the output state only after the batch completes. For peripherals
that require per-edge responsiveness (e.g. JTAG replay with tight
hold-cycle requirements), the batch is forced to size 1 when any
model reports is_active() == true.
Why BATCH_SIZE = 1024
The batch size trades off GPU utilisation against peripheral latency. Smaller batches → more Metal command buffer submissions per second → higher overhead. Larger batches → staler CPU-side model state. 1024 was chosen empirically as a sweet spot:
- For peripheral-free simulation: amortises ~1ms of command buffer overhead across 1024 edges ≈ 1µs/edge overhead.
- For active peripherals (JTAG, stimulus-driven): the
is_activefallback to batch=1 ensures correctness regardless of batch size. - The batch size only affects cosim; the
simcommand processes the entire VCD in one GPU dispatch.
Pre-allocated schedule buffers
Each scheduler edge has pre-allocated Metal buffers for its
StatePrepParams and BitOp array (ScheduleBuffers::edge_buffers).
These are allocated once at startup — not per-dispatch — to avoid
allocation latency in the hot loop. The schedule repeats with period
edges_per_period (= LCM schedule length); edge N reuses
buffer N % edges_per_period.
Multi-clock scheduler
The MultiClockScheduler computes a deterministic interleaving of
edges across clock domains. Given N clocks with potentially
different periods and phase offsets:
- Compute
gcd_ps= GCD of all half-periods and phase offsets. This is the scheduler tick — the minimum time quantum. - Compute
lcm_ps= LCM of all full periods. This is the schedule period — the point at which the edge pattern repeats. schedule_len = lcm_ps / gcd_ps— number of ticks per period.- For each tick, compute which domains have rising/falling edges
based on
(tick_ps - phase_offset) % half_period == 0.
The schedule length is capped at 1,000,000 ticks. This prevents degenerate clock ratios (e.g. primes) from producing unbounded schedules. If the cap is hit, the assertion fires with a message suggesting the clocks may not be commensurable at the configured resolution.
Time units: edges vs clock cycles
A scheduler edge is one tick of the scheduler (duration =
gcd_ps). A clock cycle is two half-periods of a given domain
(= rising + falling edge). The ratio edges_per_sys_clk_cycle = clock_period_ps / gcd_ps converts between them.
This distinction is load-bearing for peripheral timing:
- UART baud rate dividers count edges, not clock cycles.
- Reset duration counts edges.
- The
--max-clock-edgesCLI flag counts edges.
Confusing edges with clock cycles was the root cause of the UART
baud rate bug fixed in commit a263e47 — edges_per_period (the
LCM schedule length) was used where edges_per_sys_clk_cycle was
needed, doubling the bit time in multi-clock designs.
GPU→CPU ring buffer drain
After each batch completes, the CPU drains three categories of GPU-side state:
- Peripheral ring buffers — UART channels and Wishbone trace
channel, drained from local
read_headto GPU-writtenwrite_head. See ADR 0013 for struct conventions. - VCD snapshot buffer — when
--stimulus-vcdor--output-vcdis enabled, a separate ring buffer (2 × state_sizewords per edge) captures per-tick output state on the GPU. The CPU drains it after each batch to write VCD transitions. This mechanism is what enablesBATCH_SIZE > 1even with VCD output — without it, the CPU would need to read output state after every single edge. - CPU reference check — when
--check-with-cpuis active, the CPU replays the batch with the reference kernel and compares.
No synchronisation beyond Metal's command buffer completion is
needed — all drains happen after waitUntilCompleted.
Consequences
- The batch dispatch model means CPU-side peripheral models see
output state with up to
BATCH_SIZEedges of latency. This is acceptable for all current peripherals; models that need tighter coupling setis_active() = true. - The 1M tick schedule cap prevents pathological memory use but
rejects exotic clock ratios. A min-heap scheduler (proposed in
docs/plans/multi-clock-and-stimulus-architecture.mdas MC.2) would remove this limit. - The edges-vs-cycles distinction must be maintained carefully in
any code that converts user-facing "cycles" to internal "ticks".
The
edges_per_sys_clk_cyclehelper exists for this purpose. - Pre-allocated schedule buffers consume
O(schedule_len)Metal buffer pairs at startup. Each schedule entry creates two Metal buffer objects (params + ops). For typical single-clock designs this is 2 entries = 4 buffer objects; for complex multi-clock designs it can reach thousands of entries, but each buffer is small (tens of bytes).
Cross-references
- ADR 0012 — CDC jitter injection (uses the scheduler's edge timestamps as the injection point).
- ADR 0013 — Peripheral model architecture (documents GPU-side model patterns and ring buffers).
docs/plans/multi-clock-and-stimulus-architecture.md— design-space doc for the multi-clock scheduler.
Implementation Plans
Phased implementation plans with entry and exit criteria. Plans live here when the work spans multiple commits and needs an explicit scheduling artefact; once shipped, the plan is kept as a historical record (Status flipped to Implemented) rather than deleted, so the phasing is recoverable later.
For short-lived working memory between sessions, see
../handoff-discipline.md — that lives
in docs/handoffs/ and is deliberately kept separate from the
persistent plans here.
Status legend
- Active — currently being worked on or scheduled.
- Implemented — shipped; kept as historical record.
- Deferred — captured for future work; not currently scheduled.
- Exploratory — architectural thinking captured ahead of demand.
Index
| Plan | Status |
|---|---|
| Post-Phase-0 Roadmap | Active — scheduling doc for ADRs 0007 and 0008 |
| GF180MCU PDK enablement | Mostly implemented — Phases 0–6 shipped; Phase 7 deferred |
| Phase 0: Timing IR and OpenSTA oracle | Implemented — historical record |
WS2: opensta-to-ir | Implemented — historical record |
| WS3: delete SDF parser + interim runtime hook | Implemented — historical record (see ADR 0006 Amendment) |
WS3 follow-up: re-add cosim --sdf via opensta-to-ir | Deferred |
| Multi-clock and stimulus architecture | Exploratory — demand-driven |
Reading order for new contributors
If you want to understand how the timing stack got to where it is:
phase-0-ir-and-oracle.md— the umbrella plan, with the five work streams (WS1–WS5).ws2-opensta-to-ir.mdandws3-delete-sdf-parser.md— the per-work-stream detail for the IR producer and the SDF parser removal.post-phase-0-roadmap.md— what comes next, sequenced against the ADRs.
Adding a new plan
- Filename: short kebab-case (
<topic>.mdor<ws-or-phase>-<topic>.md). - Start with
# Plan — <title>and a**Status:**line. - Where the plan executes a specific ADR or work stream, name them
in a
**Predecessors:**/**ADRs:**block near the top so the dependency graph is explicit. - Add the row to the table above. When the plan ships, change the status in the file and here in the same commit.
Roadmap — Post-Phase-0 work scheduling
Status: Active. ADR 0008 accepted 2026-05-02. ADR 0007 still pending.
This document orders the work captured in those two ADRs alongside the in-flight tail of Phase 0. It is a scheduling doc, not a design doc — design lives in the ADRs and in docs/timing-model-extensions.md / docs/why-jacquard.md.
Where things stand (2026-05-02)
- Phase 0 (
phase-0-ir-and-oracle.md): WS1–WS5 + WS2.2 + WS2.4 all landed. WS2.4 multi-corner shipped 2026-05-02 across four commits (5822343consumer,530bb36builder,59fde04producer, plus the integration test). Open items: sky130-based corpus entries (gated on a CI sky130-Liberty install strategy) and peripheral wiring for I²C/SPI when a fuller mcu_soc fixture lands. - OpenTimer spike (
spikes/opentimer-sky130.md): resolved 2026-05-01 — Superseded. Q1 (Liberty parse) passed cleanly on SKY130; Q2 (arrival computation) failed on the canonical OpenSTA-bundled GCD example after eight input-pipeline workarounds (bus ports, OpenROAD-emitted SPEF, modern TCL, tap cells). Per the spike's decision matrix, ADR 0003 is now Superseded (commitd002bde). OpenSTA out-of-process is committed as Jacquard's sole STA path —opensta-to-iris the canonical preprocessor; no in-process reference STA is planned. A future ADR may revisit libreda-sta or an in-house walker if an in-process reference is wanted later, but not on this roadmap. - Pillar B Stages 1+2 (per
adr/0007): landed.ClockArrivalIR table +opensta-to-irTcl emission in commitc403cc8;DFFConstraint.clock_arrival_ps+ skew-aware fold-in inbuild_timing_constraint_bufferin6767c3e. Closed Pillar B's main accuracy lever ahead of this roadmap's original Phase 2 schedule. - ADR 0006 amended 2026-05-02: subprocess invocation of user-installed OpenSTA from the shipped runtime is now permitted (no linking, no bundling). Phase 3 (native Rust SDF→IR) is no longer release-gating — see § Phase 3 below. New release-hardening workstream WS-RH.1 (OpenSTA detection + version check) is required before first release; see § Release hardening.
- ADRs 0007 / 0008: ADR 0008 accepted 2026-05-02; ADR 0007 still pending review.
Phase boundaries
The phase numbering established by Phase 0 and ADR 0006 continues:
| Phase | Topic | Trigger |
|---|---|---|
| 0 | Timing IR + OpenSTA preprocessor | In flight, near close |
| 1 | Structured timing output (ADR 0008 required items) + Phase 0 carryover | ADR 0008 accepted ✓ |
| 2 | Timing model fidelity Pillar C Tier 1 + Pillar B Stage 3 if needed (ADR 0007) | Phase 1 lands; ADR 0007 accepted |
| RH | Release hardening (OpenSTA detection + version check, see § Release hardening) | WS-RH.1 shipped ✓; no other items currently scoped |
| 3 | Native Rust SDF→IR parser (ADR 0006) | Deferred indefinitely — no longer release-gating per amended ADR 0006. Picks up when bandwidth allows or commercial demand appears. |
| 4+ | Pillar A Stage 1 (static IDM); Pillar C Tier 2; ADR 0008 optional outputs | Demand-driven; not committed |
Parked (require new ADR to revive): in-process reference STA (ADR 0003 superseded), Pillar A Stage 2 (dynamic δ(T)), Pillar A Stage 3 (sub-cycle ticks), NoC-aware partitioning hints (Pillar C Tier 3).
Phase 1 — Structured timing output and Phase 0 wrap-up
Entry criteria:
- ADR 0008 accepted.
- Phase 0 exit criteria met (per
phase-0-ir-and-oracle.md).
OpenTimer integration was originally Phase 1's centrepiece (former WS-P1.1) but was retired when the spike Superseded ADR 0003. With OpenSTA-out-of-process as the sole STA path, Phase 1 is now anchored on user-visible output rather than a second STA tool.
Workstreams (parallel where independent):
WS-P1.1 — Structured timing output (ADR 0008 required items)
The four required items from ADR 0008. Single workstream because they share infrastructure.
- WS-P1.1.a — Symbolic violation messages. Shipped 2026-05-02 in commit
0432d9a. NewWordSymbolMapinsrc/flatten.rsbuilt once at sim startup;process_eventsgained an optional resolver closure;sim_metalthreads it through. Setup/hold violation messages now name DFFs astop/cpu/regs[7][bit 22] [word=42]instead of bareword 42. CUDA/HIP sim paths don't currently route runtime violations throughprocess_events(separate plumbing gap, not blocked on this format change). - WS-P1.1.b —
--timing-report <path.json>. Shipped 2026-05-02 in commit58a7a04. Newsrc/timing_report.rsmodule with serde-derivedTimingReport(schema_version 1.0.0);process_eventstakes aReportingCtxbundling the optional resolver + violation observer (signature back to 5 args);sim_metalbuilds the report end-to-end. Sample fixture attests/timing_ir/sample_reports/two_violations.json; schema documented indocs/timing-violations.md. WS-P1.1.d's worst-slack ranking is included (top-10 per kind from violation events). Caveats: closest-to-violation tracking in non-violating runs needs GPU near-miss instrumentation (deferred); violations array is unbounded (opt-in cap is the natural follow-up); CUDA/HIP/cosim paths don't route runtime violations throughprocess_eventsyet. - WS-P1.1.c —
--timing-summarytext output. Shipped 2026-05-02 in commit44e70a0. NewTimingReport::format_summary()formatter;--timing-summaryCLI flag;TimingReportConfigrefactored to support either / both / neither output. Text writes to stdout. Deferred from ADR 0008's wishlist: "corner" (metadata struct doesn't carry it yet) and "margin percentage" (derivable from existing fields). Both are documented in code as known gaps. - WS-P1.1.d — Per-DFF worst-slack ranking. Partially shipped in
58a7a04alongside WS-P1.1.b: top-10 per kind from observed violation events. Remaining: closest-to-violation tracking when no violation occurred — needs GPU near-miss instrumentation, deferred to a separate workstream.
Total ~2 weeks.
WS-P1.2 — Phase 0 follow-ups (carryover)
Tail of Phase 0 work that didn't gate WS3 completion. Listed for completeness.
WS2.4: multi-corner CLI flag inShipped 2026-05-02 (commitsopensta-to-ir.5822343/530bb36/59fde04).- WS4: corpus + runner + regen helper + CI hookup shipped 2026-05-02 with the seed entry
aigpdk_dff_chain(covers all four IR record types). One follow-up: add sky130-based corpus entries (inv_chain_pnr, mcu_soc subset) once a CI sky130-Liberty install strategy is decided. - Peripheral wiring for I²C/SPI when a fuller mcu_soc fixture lands.
(WS5 — parser-success assertions on the Liberty parser path and on opensta-to-ir — was already shipped; see phase-0-ir-and-oracle.md § WS5.)
These are not gated by any new ADR; pick them up as bandwidth allows.
Exit criteria for Phase 1:
- ✅ Symbolic violation messages live; old state-word-index format gone (commit
0432d9a). - ✅
--timing-reportJSON shipping; sample fixture attests/timing_ir/sample_reports/two_violations.json(commit58a7a04). - ✅
--timing-summaryavailable (commit44e70a0). - ✅ Worst-slack ranking included in both report and summary (top-10 from violations; non-violating-run tracking still requires GPU near-miss instrumentation, separate workstream).
- ✅
why-jacquard.mdupdated; old "Output interface" section now describes the shipped surface, "Still on the wishlist" carries the deferred items.
Phase 1 closed. Phase 2 entry now blocked only on ADR 0007 acceptance.
Phase 2 — Timing model fidelity
Entry criteria:
- Phase 1 exit criteria met.
- ADR 0007 accepted.
Pillar B Stages 1 and 2 (per-DFF clock arrival in the IR + setup/hold fold-in) landed early, in commits c403cc8 and 6767c3e — directly on top of the OpenSTA-out-of-process producer rather than the OpenTimer integration originally planned. Phase 2 is therefore anchored on Pillar C Tier 1 (per-receiver wire delay), with Pillar B Stage 3 only if measurement justifies it.
Workstreams (parallel where independent):
WS-P2.1 — Pillar C Tier 1: Per-receiver wire delay (ADR 0007)
Key wire delay per (src_aigpin, dst_aigpin) edge.
- WS-P2.1.a — Edge-attributed wire delay. Rewrite of
src/flatten.rs:1850-1872to key wire delay per fanout; fold into source-side gate_delay per fanout target. ~3–5 days. - WS-P2.1.b — Rise/fall preservation. Carry per-edge rise/fall through the consumer; honour both in
PackedDelayaccumulation. ~1–2 days, after WS-P2.1.a. - WS-P2.1.c — Validation. Long-route corpus addition; tolerance ≤±3% on long-wire paths.
Total ~1 week.
WS-P2.2 — Pillar B Stage 3: Bucketed per-DFF constraint packing (conditional)
Stages 1+2 collapsed all DFFs in a 32-bit state word to min(setup), min(hold) after folding the per-DFF clock arrival in. For most current designs the per-word collapse pessimism is small relative to clock period; for designs running close to the period boundary, splitting each word into clock-arrival buckets eliminates the collapse loss without disturbing the partitioner. See Stage 3 in docs/timing-model-extensions.md Part B.
Land only if Stage 1+2 measurement on a representative design shows the per-word collapse materially over-reports violations; otherwise defer indefinitely. Effort if pursued: ~3–5 days, touches src/flatten.rs:1722-1761 and the kernel's constraint indexing.
WS-P2.3 — Output adjustments for fidelity work
Small touch-ups to ensure Phase 1 outputs continue to work as model fidelity changes. JSON report fields, summary metrics, etc. Folded into WS-P2.1 / WS-P2.2 PRs as needed.
Exit criteria for Phase 2:
- Per-receiver wire delay landed; long-route paths reported within ≤±3% of CVC.
timing-model-extensions.mdParts B and C marked Implemented with cross-references to landed code (Part B already updated post-Stage-1+2).timing-validation.mdupdated with per-pillar tolerances.- No regression on existing corpus.
Phase 3 — Native Rust SDF→IR parser
Deferred indefinitely as of 2026-05-02 per amended ADR 0006. No longer release-gating: shipped Jacquard binaries may subprocess user-installed OpenSTA via opensta-to-ir, provided OpenSTA is not bundled and not linked. The user-facing capability gap is "OpenSTA must be on PATH for jacquard sim input.sdf," surfaced by WS-RH.1 below with a clear error message.
Reasons to revive:
- A downstream commercial integrator's legal team rejects subprocess-of-GPL-tool even with no bundling/linking.
- OpenSTA dialect coverage gaps appear that are easier to fix in our own parser than via
opensta-to-irpost-processing. - Bandwidth opens up and the team wants the zero-runtime-dependency story for its own ergonomics.
Effort estimate (unchanged from the original ADR 0006 framing): grammar-based (nom / pest), validated against OpenSTA on the WS4 corpus per ADR 0001. Probably 2–3 weeks of focused work. Not scheduled.
Release hardening
Pre-first-release work that became necessary when ADR 0006 § Amendment relaxed the no-runtime-subprocess rule. These are blockers for first release, not for any specific Phase.
WS-RH.1 — OpenSTA detection + version check
Status: Shipped 2026-05-02 in commit c9c393b. All scope items below are landed; this entry is preserved as a brief reference. Test coverage: 9 unit tests for the version parser + 6 integration tests for the locator across the missing / too-old / newer-than-tested / unparseable / failing-probe paths.
Why: With the shipped runtime now allowed to subprocess opensta-to-ir, a user invoking jacquard sim input.sdf on a machine without OpenSTA — or with an untested OpenSTA version — must get an actionable error rather than silent timing-data loss. Pre-WS-RH.1, missing OpenSTA only emitted a warn! and the simulation proceeded with no timing information loaded. That was acceptable during development but shipped as a UX bug.
Scope:
- Promote missing-OpenSTA from warning to hard error when
--sdfis provided. Today's silent-fallback behaviour is fine for--liberty-only runs but wrong when SDF was explicitly requested. Error message must name the env var (JACQUARD_OPENSTA_BIN), the PATH lookup, and link to install instructions. ~0.5 day. - Pin a tested OpenSTA version range. Record the version we test against in
vendor/opensta/(already pinned via submodule per ADR 0005) and surface that as aMIN_TESTED_OPENSTA_VERSION/MAX_TESTED_OPENSTA_VERSIONconstant incrates/opensta-to-ir/src/opensta.rs. Need to choose a version-detection mechanism — OpenSTA's-versionflag output format is the obvious target; check whether it's stable across the versions we care about. ~0.5 day. - Version probe at first invocation. On first call to
find_opensta()per process, run<binary> -version, parse the version, and:- If older than min-tested → hard error with remediation message ("rebuild via
scripts/build-opensta.shor upgrade your system OpenSTA"). - If newer than max-tested → warn but proceed ("untested OpenSTA version vN.M; please report any timing discrepancies").
- Cache the result for the rest of the process. ~1 day.
- If older than min-tested → hard error with remediation message ("rebuild via
- Document the dependency in
docs/usage.md. Single section: required tooling, install paths, version range, whatjacquard simdoes and doesn't need OpenSTA for. ~0.5 day. - Test coverage: unit tests for the version-string parser (with sample
-versionoutputs from the pinned version and a synthetic too-old version); an integration test that pointsJACQUARD_OPENSTA_BINat a stub script and confirms the error path. ~0.5 day. - Stale-framing cleanup (folded in here per 2026-05-02 decision rather than spun out separately):
- Reword
INTERIM per ADR 0006/Pre-release onlymarkers in source:src/sim/setup.rs:176,228,286,src/bin/jacquard.rs:187,src/sim/cosim_metal.rs:2053,src/testbench.rs:255-257. Replace with "subprocess wrapper per ADR 0006 § Amendment" or similar — these paths are no longer interim. - Update
docs/plans/phase-0-ir-and-oracle.mdlines 152, 161, 172 — drop "tagged for pre-release removal" framing; the subprocess wrapper is now the shipping mechanism, not a temporary bridge. - Audit
docs/plans/ws3-delete-sdf-parser.mdfor the same stale framing and update. - ~0.5 day total for the cleanup.
- Reword
Total: ~3.5 days. Single PR, owned by whoever picks up release prep.
Open question: does OpenSTA emit a stable -version string, or do we need to scrape git describe from a build-time-recorded commit? If -version is unreliable, fall back to recording the submodule commit at crates/opensta-to-ir build time and comparing — this is cheaper than version-string sniffing and avoids the "user has a custom build" problem.
Phase 4+ — Demand-driven
Items below land when (a) a real use case appears that demands them, and (b) bandwidth is available. Each gets its own ADR amendment / new ADR before scheduling, since the cost is non-trivial.
Pillar A Stage 1 (static IDM)
Cheapest δ(T) entry point. Lands only after Pillars B and C confirm the wire/skew baseline is correct — characterisation work done before that risks chasing wire-delay error masquerading as δ(T) error.
Effort: 1–2 day spike to validate value, then ~1 week implementation, plus per-cell SPICE characterisation effort (long-pole risk).
Pillar C Tier 2 (inter-partition wire delay)
Required for many-core/NoC designs at advanced processes. Lands when a representative such design appears in the test corpus and Tier 1 measurement shows it is needed.
Effort: ~2–3 weeks, touches src/sim/cosim_metal.rs shuffle pipeline.
ADR 0008 optional outputs
Items 5–7 from ADR 0008: arrival histograms, STA cross-reference, path-back-trace. Demand-driven prioritisation.
Pillar C Tier 3 (NoC-aware partitioning hints)
Optional optimisation that makes Tier 2 cheap on tile-decomposed designs. Lands only if Tier 2 lands and partitioning quality on tile designs proves measurably suboptimal.
Risks and walk-back
- Pillar measurement shows smaller-than-expected gain. Each pillar's later stages are deferred or abandoned per ADR 0007's walk-back clause. Pillar B Stage 3 is explicitly conditional on this signal.
- JSON report schema design wastes time in bikeshedding. Mitigation: ship v1 quickly, additive-only changes thereafter, breaking changes require explicit ADR-level decision.
- OpenSTA upstream regressions. With OpenSTA as the sole STA path, an upstream behaviour change reaches us through
opensta-to-ir's output. Mitigation: pin OpenSTA in CI (per ADR 0001) and rely on the regression corpus to surface drift. - CRPR pessimism on tight designs. Stage 1+2 fold-in treats launch=0; a design with very heterogeneous launch arrivals will see pessimism on paths whose launch DFF also has a long clock path. Stage 3 is the lever if this matters; otherwise live with it.
Cross-references
../adr/0007-timing-model-fidelity-roadmap.md— Pillar definitions for Phase 2.../adr/0008-structured-timing-output.md— Output items for Phase 1.../adr/0001-opensta-as-oracle.md— OpenSTA out-of-process commitment (post-ADR-0003 supersedure).../adr/0003-opentimer-primary-sta.md— Superseded. Spike fail outcome documented in../spikes/opentimer-sky130.md.../adr/0006-sdf-preprocessing-model.md— Phase 3.../why-jacquard.md— User-facing positioning that this roadmap delivers.../timing-model-extensions.md— Technical analysis underlying ADR 0007.../timing-validation.md— Validation tolerances each phase updates.phase-0-ir-and-oracle.md— Predecessor roadmap (current Phase 0 status lives there per workstream).../spikes/opentimer-sky130.md— Spike outcome (Superseded).
Plan — GF180MCU PDK enablement (full sim path)
Status: Phases 0–6 shipped (2026-05-12 / 13). Phase 7
(wafer.space test-run-1 design integration) deferred pending design
availability. Subsequent follow-ups also landed (2026-05-14):
IO pad behavioural decomposition (__in_c, __in_s, __bi_24t,
plus filler classification for the wafer.space gf180mcu_ws_*
families) and bidir A/OE observability surfacing as
<port>__out / <port>__oe extra primary outputs — see commits
aa312b8, c23d583, 207cc80. These extended GF180MCU support
from "synthesized-core-only" to "full chip_top including pad ring",
validated end-to-end on a 227k-cell wafer.space chess chip_top
netlist. This document is now a recap of what landed; the
forward-looking deferred items are in § Follow-on cleanup at the
bottom.
Predecessors:
- SKY130 enablement (reference recipe in
docs/adding-a-pdk.md). - Multi-corner Liberty plumbing — WS2.4 + the sky130 multi-corner
integration test (
crates/opensta-to-ir/tests/opensta_integration.rs), shipped 2026-05-12.
ADRs: None new shipped. docs/adding-a-pdk.md is the canonical
integration-points checklist; this plan applied that recipe to
GF180MCU with both 7-track (gf180mcu_fd_sc_mcu7t5v0) and 9-track
(gf180mcu_fd_sc_mcu9t5v0) standard-cell libraries.
Goal (as shipped)
GF180MCU is now at the same support tier as SKY130:
- Timing path —
opensta-to-iraccepts GF180MCU Liberty files and emits IR; the multi-corner integration test atcrates/opensta-to-ir/tests/opensta_integration.rs::gf180mcu_multi_corner_emits_per_corner_valuesasserts per-corner setup/hold values differ correctly across tt/ss/ff PVT corners. - Simulation path —
jacquard simruns gate-level GF180MCU netlists on the GPU. Cell-type detection, pin direction tables, sequential/tie/multi-output classification, behavioural model parsing (with UDP support for sequential elements), and AIG decomposition are all wired throughAIG::from_netlistdb. - Validation — synthetic DFF+inverter fixture at
tests/timing_test/gf180mcu_timing/. Real wafer.space test-run-1 design integration is deferred (Phase 7, gated on design availability).
End state mirrors today's SKY130 support: CellLibrary::GF180MCU
detected, decomposed to AIG, simulated on Metal/CUDA/HIP, with a
golden-IR corpus entry covering the timing-IR side.
Why now
GF180MCU support was a release prerequisite per session 2026-05-12. The wafer.space ecosystem (https://github.com/wafer-space/gf180mcu) is the near-term commercial demand driver; the upstream google/gf180mcu-pdk is the canonical PDK that the wafer.space variant builds on.
Decisions (frozen 2026-05-12 session)
- One enum variant for GF180MCU.
CellLibrary::GF180MCUcovers both 7t5v0 and 9t5v0 prefixes. Matches the SKY130 precedent (CellLibrary::SKY130covers seven prefixes). - Both 7t and 9t fully supported. Unlike SKY130 (only hd is
decomposed), both GF180MCU standard-cell variants are first-class
for cell detection, pin direction, classification, and AIG
decomposition. Cell models for 7t and 9t are byte-identical per
cell type (verified at build time in
build.rs); decomposition reads from the 7t submodule and reuses for 9t. - Two separate submodules for vendoring cell models, mirroring
the per-library SKY130 split:
vendor/gf180mcu_fd_sc_mcu7t5v0/vendor/gf180mcu_fd_sc_mcu9t5v0/
- Install path:
volarepinned hash under[tool.jacquard.pdks.gf180mcu]inpyproject.tomlalongside the existing sky130 entry. Variant:gf180mcuC. - Reset polarity: GF180MCU uses active-low resets/sets
(pin names
RN,SETN) — same AIG formula shape as SKY130'sRESET_B/SET_B. The "n" prefix in cell names likedffnq/dffnrnq/icgtnindicates a negative-edge clock (pinCLKN), not reset polarity (resolving Open Q3 from the original plan).
Shipped phases
Phase 0 — Foundations (commit 6ae3e54)
pyproject.toml:[tool.jacquard.pdks.gf180mcu]withvolare_hash = "559a117b163cef2f920f33f30f6f690aa0b47e4c", variantgf180mcuC, separatedefault_lib_subdir_7t/default_lib_subdir_9tpaths.- Vendored submodules at
vendor/gf180mcu_fd_sc_mcu7t5v0/andvendor/gf180mcu_fd_sc_mcu9t5v0/. - Skeleton
src/gf180mcu.rs+src/gf180mcu_pdk.rsdeclared insrc/lib.rs.
Phase 1 — Library detection + cell-type extraction (commit 858dd70)
is_gf180mcu_cell(name) -> boolmatching both 7t5v0 and 9t5v0 prefixes.extract_cell_type(name)strips prefix + drive suffix.CellLibrary::GF180MCUenum value added;detect_library()/detect_library_from_file()extended;Mixedenforcement upgraded to three known libraries.
Phase 2 — Pin direction provider (commit e97e2d2)
GF180MCULeafPinsimplementingLeafPinProvider.- Generation strategy: build-time via
build.rs::generate_gf180mcu_pin_table, which scansvendor/gf180mcu_fd_sc_mcu{7,9}t5v0/cells/, parses.functional.v, cross-asserts 7t/9t pin layouts match, emits$OUT_DIR/gf180mcu_pins.rs. New precedent vs SKY130's hand-rolled match arms (see § Follow-on cleanup item 1). - Round-trip test instantiating every cell.
Phase 3 — Cell classification (commit 6969b90)
- Sequential / tie / filler / delay-cell whitelists in
src/gf180mcu_pdk.rsderived from behavioural models. - Unit tests asserting classification across the union of 7t5v0 and 9t5v0 cell catalogues.
Phase 4 — Combinational AIG decomposition
Sequenced as four commits:
92bb665— Phase 4 recon: confirmed SKY130 behavioural parser is PDK-neutral; identified shared infrastructure thatgf180mcu_pdkcould reuse.02da077— Phase 4 prep: introduced the PDK-neutralsrc/pdk_decomp.rsre-export module; exposedWireVal,GATE_MARKER,build_chain_gate,build_xor_chain,finalize_decomp_resultaspub(crate).32fb3b9— Phase 4 (combinational):decompose_combinationalfor GF180MCU + boolean equivalence test suite vs the vendored PDK models.d898343— Phase 4 (aig.rs integration): wired combinational decomposition throughAIG::from_netlistdb, end-to-end sim path for combinational GF180MCU netlists.
Phase 4b — Sequential cells (UDPs)
a7c0618— Phase 4b prep: UDP loader forgf180mcu_pdk(parsesUDP_GF018hv5v_mcu_sc7_TT_1P8V_25C_verilog_nonpg_*_FF_UDPand friends from the vendored PDK).459317e— Phase 4b: AIG hooks for sequential cells (DFFs, latches, scan-DFFs, clock-gating cellsicgtp/icgtn).gf180mcu_preprocesspre-creates DFF Q pins;gf180mcu_postprocessapplies async set/reset using the active-low RN/SETN convention via the same AIG formula as SKY130. Negative-edge clock cells useCLKNinstead ofCLK(handled intrace_clock_pin).3006f59— Phase 4b boolean-equivalence tests covering DFF, latch, scan-DFF, and clock-gating cells via multi-step truth-table evaluation.
Phase 5 — CLI / pipeline wiring audit (commit 57244d5)
Audit-only — no per-PDK branch was missing GF180MCU handling. The
auto-detection in AIG::from_netlistdb already covers every CLI
surface (sim / cosim / dump-paths all route through
setup::load_design). Cleanup: stale Phase 4b panic comments in
src/sim/setup.rs and src/aig.rs; field doc comments on CLI
arguments refreshed to mention GF180MCU alongside AIGPDK / SKY130.
Phase 6 — Validation fixture + multi-corner test
- Fixture (commit
4a7ee0e):tests/timing_test/gf180mcu_timing/mirroringsky130_timing/1:1. Syntheticinv_chain.v(DFF + 16-inverter chain + DFF) withgf180mcu_fd_sc_mcu7t5v0__{dffq,inv}_1cells, Liberty-only SDF generator, CVC testbench, sample stimulus, Makefile, README. - Integration test:
gf180mcu_multi_corner_emits_per_corner_valuesincrates/opensta-to-ir/tests/opensta_integration.rs. Loads three real PVT corners (typ=tt_025C_5v00, slow=ss_125C_4v50, fast=ff_n40C_5v50) at the 5.0 V operating point and asserts per-corner setup TimingValues differ correctly across PVT. Skips gracefully when the volare-installed PDK isn't present (gated onfind_gf180mcu_lib_dir()returningSome; matches the sky130 test's skip pattern).$GF180MCU_LIBERTY_DIRoverrides the volare default path.
Phase 7 — wafer.space test-run-1 design (deferred)
Gated on design availability. Scope:
- Vendor or pull a wafer.space test-run-1 gate-level netlist into
the
tests/timing_test/ordesigns/tree. - End-to-end pipeline: synth + PnR (or consume post-PnR output), opensta-to-ir, jacquard sim with Metal backend, golden-output VCD comparison.
- Promote to a corpus entry once stable.
Test inventory
Counts after Phase 6:
cargo test --lib: 212 passing (up from 166 at plan start).cargo test --lib gf180mcu: 45 passing (combinational + sequential equivalence + classification + detection + AIG-build).cargo test -p opensta-to-ir multi_corner: 2 passing (sky130 + gf180mcu), each gated on its respective volare PDK install.
Follow-on cleanup
These are nice-to-have refactors flagged during the GF180MCU work but deliberately out of scope for the enablement effort itself.
Update 2026-05-19: Items 1, 2, and 4 are now subsumed by
ADR 0010 — Declarative cell metadata
and its companion plan declarative-cell-metadata.md. The manifest
pathway converts these from "Rust refactor" projects into "move data
out of code as part of the migration to manifest-as-source-of-truth"
— happens once, gets all three at once.
-
Subsumed by ADR 0010 § "Deferred to a future ADR —build.rspin-table generator for SKY130 too.build.rspin-table scanner removal." Removed LAST in the manifest migration, after manifests cover the built-in PDKs. -
Physical relocation of shared PDK decomp infrastructureout ofsky130_pdk.rsintopdk_decomp.rs. Still relevant for the built-in (Rust-decomp) pathway, since ADR 0010 keeps that path load-bearing for cells with real AIG decomposition rules. Move when a third PDK exercises the surface. -
CellLibraryenum location. Currently lives insrc/sky130.rseven though it represents all PDKs. Moving to a neutral home (src/pdk.rsorsrc/lib.rs) is a trivial mechanical refactor. Independent of ADR 0010. -
IO and PR libraries.Now solved by the ADR 0010 manifest pathway.gf180mcu_fd_ioandgf180mcu_fd_prcells can be declared viakind = "io_pad_*"/kind = "filler"/kind = "tap"etc. in user-supplied manifests — no Jacquard PR needed. -
CI install strategy for GF180MCU Liberty. Both the sky130 and gf180mcu multi-corner tests currently skip when the PDK isn't installed locally. CI integration (volare-on-CI or a vendored minimal Liberty subset) is the same blocker that gates the
inv_chain_pnrsky130 corpus entry — out of scope for the GF180 enablement effort itself. Unrelated to ADR 0010.
Pitfalls (PDK-specific, for future readers)
- Reset polarity — GF180MCU is active-low (
RN/SETN); same AIG formula as SKY130'sRESET_B/SET_B. - Negative-edge clocks — cells like
dffnq/dffnrnq/icgtnuse pin nameCLKNinstead ofCLK. The "n" prefix is a clock marker, not a reset-polarity marker. - Power pins — GF180MCU operates at 5V nominal (vs SKY130's
1.8V). Both follow VDD/VSS naming. Corner names follow
tt_025C_5v00shape and parse cleanly through the genericTimingLibraryloader. - Cell pin names differ from SKY130 — inverter is
I/ZN(notA/Y); DFF isCLK/D/Q/notifier. Thenotifierport wires the UDP delay-model wrapper but is unused for logic simulation. - Cell-name collisions between 7t5v0 and 9t5v0 — both have
nand2_1etc. Detection keys on the full prefix, not the base type. Auto-handled byis_gf180mcu_cell. - Drive-strength suffixes — GF180MCU uses integer multipliers
(
inv_1,inv_2,inv_4, …) matching the SKY130 convention.
Links
docs/adding-a-pdk.md— canonical PDK integration recipe.src/sky130.rs,src/sky130_pdk.rs— SKY130 reference implementation.src/gf180mcu.rs,src/gf180mcu_pdk.rs— GF180MCU implementation.crates/opensta-to-ir/tests/opensta_integration.rs::{sky130,gf180mcu}_multi_corner_emits_per_corner_values— timing-side validation.tests/timing_test/{sky130,gf180mcu}_timing/— synthetic fixtures.pyproject.toml::[tool.jacquard.pdks.{sky130,gf180mcu}]— install pins.- Upstream PDK: https://github.com/google/gf180mcu-pdk
- wafer.space variant: https://github.com/wafer-space/gf180mcu
Plan — Phase 0: Timing IR and OpenSTA oracle
Status: Implemented — historical record. All five work streams
(WS1 schema, WS2 opensta-to-ir producer, WS3 SDF parser deletion +
interim runtime hook, WS4 diff harness + CI, WS5 parser-success
assertions) shipped through 2026-05-02. All eight exit criteria are
met. Ongoing scheduling for timing-model fidelity work has moved to
post-phase-0-roadmap.md. The per-WS detail and embedded status
markers below are preserved for the implementation record.
Goal
Deliver the minimum viable infrastructure to enforce Jacquard's timing correctness contract:
- A stable timing intermediate representation (IR) for SDF-equivalent annotations.
- An OpenSTA-driven subprocess converter that produces IR from the same inputs Jacquard consumes.
- A converter that produces IR from Jacquard's existing SDF parser output.
- A CI diff harness that fails loud on converter disagreement.
- Parser-success assertions on the SDF and Liberty paths.
After phase 0, Jacquard's timing pipeline has an enforced external reference. Silent failures (zero-match SDF, mis-scoped hierarchical prefixes, unexpected cell drops) surface as CI failures rather than correctness regressions detected in the field.
Prerequisites
- Requirements doc (
../timing-correctness.md) accepted. - ADR 0001 (OpenSTA oracle) accepted.
- ADR 0002 (timing IR) accepted.
- A representative test design committed to the repo with inputs needed for both Jacquard and OpenSTA (
.v+.lib+.sdfminimum;.spefif available). Candidate:tests/timing_test/inv_chain_pnror the MCU SoC subset, whichever is smaller for first-pass iteration. - OpenSTA available on developer machines and CI runners (installation documented).
Work breakdown
WS1 — IR schema
Done. Shipped as the
timing-ircrate (508baafinitial,2432d41simplification). Schema atcrates/timing-ir/schemas/timing_ir.fbs; per-DFFCLOCK_ARRIVALrecords added later inc403cc8(Pillar B Stage 1, beyond original WS1 scope). JSON round-trip verified viacrates/timing-ir/tests/.
Produce the FlatBuffers schema (schemas/timing_ir.fbs) and generated Rust bindings.
Fields (minimum viable; extend only with written justification):
SchemaVersion { major, minor, patch }.Corner { name, process, voltage, temperature }; IR holds a list of corners.CornerValue { corner_index, min, typ, max }for multi-corner floats.TimingArc { driver_pin, load_pin, rise_delay: [CornerValue], fall_delay: [CornerValue], condition, provenance }.InterconnectDelay { net, from_pin, to_pin, delay: [CornerValue], provenance }.SetupHoldCheck { d_pin, clk_pin, edge, setup: [CornerValue], hold: [CornerValue], condition, provenance }.Provenance { source_tool, source_file, origin: Asserted | Computed | Defaulted }.VendorExtension { source_tool, kind: CadenceX | SynopsysY | Other, raw_bytes }— untyped passthrough for unrecognised annotations.- Root table
TimingIR { schema_version, corners, cell_instances, timing_arcs, interconnect_delays, setup_hold_checks, vendor_extensions }.
Deliverables:
schemas/timing_ir.fbschecked in.build.rsintegration for code generation (or checked-in generated Rust with aflatcpin).- A tiny
timing-ircrate exposing read/write helpers. - JSON round-trip via
flatc --jsonverified in a unit test.
Scope guard: if you find yourself adding fields that represent computed timing graphs, cell electrical characterisation, or netlist structure, stop and re-read ADR 0002.
WS2 — opensta-to-ir production tool
Per ADR 0006, opensta-to-ir is a shipped preprocessing tool, not merely a validation helper. Post-release it remains as an alternative preprocessing path for users who want OpenSTA-computed timing.
Detailed design and phased implementation: ws2-opensta-to-ir.md.
Deliverables:
- A Tcl script runnable by OpenSTA that loads Liberty + Verilog + SDF + (optionally) SPEF + SDC, then emits a machine-readable dump of timing annotations.
- A production-quality standalone Rust binary
opensta-to-irthat parses OpenSTA's dump and emits timing IR (binary + JSON sidecar). Stable CLI, documented exit codes, clear diagnostics, man-page-worthy--help. - Invocation wrapper handling OpenSTA subprocess lifecycle, stderr capture, exit-code checking, and error propagation up through
opensta-to-ir's own exit code. - Assertion: if OpenSTA reports < expected-count cells, exit non-zero with a clear diagnostic.
- Ships as part of Jacquard's release artefacts (binary distributable, documented in user-facing docs).
WS2.4 — Multi-corner CLI flag (shipped 2026-05-02)
Status: Shipped 2026-05-02 across commits 5822343 (consumer +
--timing-corner flag), 530bb36 (builder dedupe + per-corner
[TimingValue] collection), 59fde04 (Tcl driver per-scene emission
--liberty NAME=PATHsyntax), and the integration testaigpdk_dff_emits_per_corner_timing_values. The historical scope notes below are kept for reference but are no longer "open work".
The IR schema (crates/timing-ir/schemas/timing_ir.fbs) supports per-corner TimingValue vectors today, but every record lands in the IR with a single TimingValue keyed at corner_index = 0. Both producer (opensta-to-ir) and consumer (flatten.rs) treat the world as single-corner. Multi-corner support has three pieces:
Producer (Tcl + Rust binary):
crates/opensta-to-ir/tcl/dump_timing.tcl: replace singleread_liberty+ hardcodedCORNER 0 default tt 1.0 25.0with OpenSTA'sdefine_corners+ per-cornerread_liberty -corner $name. The existing arc / setup-hold / wire / clock-arrival walks already key by(cell, …); wrap each in a per-corner loop and call[edge arc_delays $arc -corner $c]. Verify the exact-cornersyntax against the locally built OpenSTA before relying on it (similar to thevertex_worst_arrival_pathprobe done for clock arrival in commitc403cc8).crates/opensta-to-ir/src/main.rs: rework--liberty PATHto accept--corner NAME=PATH[,V=…,T=…,P=…]repeats. Validate at least one corner.crates/opensta-to-ir/src/builder.rs: today each ARC / SETUP_HOLD / INTERCONNECT / CLOCK_ARRIVAL line lands as one IR record with oneTimingValue. Multi-corner emits multiple lines per(cell, driver, load, corner_index)from Tcl; the builder dedupes them into one IR record carrying a[TimingValue]vector. Mechanical.
Consumer (jacquard root):
- Add
--timing-corner <NAME>toSimArgs/CosimArgsinsrc/bin/jacquard.rs; resolve to an index by walkingir.corners(). - Replace
flatten.rs::ir_corner0_max(...)(used in ~5 sites) withir_corner_max(idx). Thread the resolved index throughload_timing_from_ir.
Fixture: sky130 ships multi-corner Liberty (tt_025C_1v80, ss_-40C_1v62, ff_125C_1v95) on disk via volare under ~/.volare/... on dev machines that have run the cosim work. Wire two corners against the existing DFF / chain integration tests for a synthetic-but-real fixture; no external decision is needed before starting.
Land in this order: fixture probe (~hour, verifies the OpenSTA Tcl -corner flag works as expected) → producer (Tcl + binary + builder) → consumer (CLI + flatten plumbing) → integration test exercising both corners. The risk concentrates in the first hour; everything after that is mechanical.
WS3 — Remove hand-rolled SDF parser; wire interim runtime hook
Per ADR 0006, Jacquard's hand-rolled SDF parser is deleted in Phase 0 rather than maintained through later phases. The runtime gains a new IR input path; the old SDF input path becomes an interim convenience wrapper over WS2.
Detailed design and phased implementation: ws3-delete-sdf-parser.md.
Deliverables:
- Delete
src/sdf_parser.rsand the SDF→Jacquard-internal-types code path. Remove all direct consumers. - Add
jacquard sim --timing-ir <path>as the canonical post-release timing input. Loads a pre-converted timing IR file, consumes it into the simulator's internal structures. - Retarget the existing
--timing-sdf/--enable-timingCLI behaviour: when SDF is provided,jacquard simsubprocessesopensta-to-irinternally to produce IR on the fly, then consumes it. Code site tagged "INTERIM per ADR 0006; removed before first release." - Verify no remaining imports of the deleted module. Verify all existing tests that previously used the hand-rolled parser now pass via the interim hook or via checked-in IR fixtures.
- No runtime behaviour regression on Jacquard's timing-related regression suite; any design that currently works must still work after WS3.
WS4 — Diff harness and CI integration
Reframed 2026-05-02; corpus + runner shipped 2026-05-02. The original WS4 was framed as "WS2 vs WS3 IR diff" (OpenSTA-derived against Jacquard's hand-rolled SDF parser-derived). WS3 deleted that parser; the diff has only one side now. Three reframings were considered: Option A (golden-IR regression corpus for
opensta-to-ir) was chosen as the Phase 0 closure; Option B (end-to-end behavioural diff cxxrtl/CVC vs Jacquard cosim event traces) belongs intiming-validation.mdas a Phase 1+ extension; Option C (cross-tool diff vs a future native Rust SDF→IR parser) is Phase 3 work per ADR 0006.
Deliverables:
- A test binary
timing-ir-diffthat reads two IR files and produces a structured diff (missing arcs, mismatched delays past tolerance, mismatched provenance). Shipped incrates/timing-ir/src/bin/timing-ir-diff.rs. - OpenSTA vendored as a git submodule at
vendor/opensta/. Not built from Jacquard's build at runtime; present for CI version pinning, theopensta-to-irintegration tests, and stress-corpus access (see ADR 0005). Shipped. - A primary regression corpus at
tests/timing_ir/corpus/— Jacquard-specific designs with checked-inexpected.jtir(and aexpected.jsonsidecar viaflatc --jsonfor human-readable diffs). Shipped 2026-05-02 with the seed entryaigpdk_dff_chain(a minimal aigpdk DFF + AND with back-annotated wire delay; covers ARC + SETUP_HOLD + CLOCK_ARRIVAL + INTERCONNECT in a self-contained fixture). Sky130 entries (inv_chain_pnr, mcu_soc subset) remain to be added — the inputs exist undertests/timing_test/, but a CI strategy for installing the sky130 Liberty (likely volare) lands with them. - A stress corpus at
tests/timing_ir/stress/— a manifest file listing paths intovendor/opensta/<test-tree-subdir>/. Run nightly or pre-release. Exit criterion: no crashes, no hangs, no malformed IR; numerical agreement with OpenSTA not required. Manifest format specced intests/timing_ir/stress/README.md; entries pending. - A regression test that, for each design in the primary corpus, runs
opensta-to-iron its inputs and diffs againstexpected.jtirviatiming_ir::diff::diff_irswith the per-design tolerance frommanifest.toml. Shipped ascrates/opensta-to-ir/tests/corpus.rs::corpus_designs_match_golden_ir. Skips gracefully when OpenSTA isn't built; fails loud with a structured diff when there's a mismatch. - A
regenerate-goldenshelper for the OpenSTA-pin-bump workflow: bump submodule, run regen, review the diff, commit golden + submodule together. Shipped asscripts/regenerate-corpus-goldens.sh. Iteratestests/timing_ir/corpus/*/manifest.toml, runsopensta-to-irper entry with the manifest-specified flags, refreshes bothexpected.jtirand theexpected.jsonsidecar viaflatc --json. Accepts entry names as positional args for targeted regen. - A diff-machinery mutation test that perturbs a known-good IR and asserts
timing-ir-diffflags it. Shipped incrates/timing-ir/tests/diff.rs:delay_mismatch_past_tolerance_detected,delay_mismatch_within_tolerance_is_clean,arc_only_in_a_detected,arc_only_in_b_detected.
CI hookup landed 2026-05-02. The opensta-to-ir-tests job in .github/workflows/ci.yml builds CUDD (cached), builds OpenSTA via scripts/build-opensta.sh (cached on the submodule SHA), and runs cargo test inside crates/opensta-to-ir — covering the corpus regression test, the CLI tests, and the OpenSTA-driven integration tests on every PR. scripts/build-opensta.sh was extended to honour a CUDD_DIR env var so the CI job can hand it the source-built CUDD location without bypassing the script.
What this catches: OpenSTA upstream regressions, dump-format / Tcl-driver regressions, accidental schema-breaking changes in timing_ir.fbs, builder bugs in opensta-to-ir/src/builder.rs, and the diff machinery itself (via the mutation tests that perturb an IR and assert timing-ir-diff flags the perturbation).
What this doesn't catch: behavioural divergence between Jacquard and a reference simulator. That's timing-validation.md's job (CVC/iverilog event-trace comparison) — the mcu_soc/sky130 90/90 reference match is the current one-design instance, generalisable in Phase 1+.
WS5 — Parser-success assertions
Done. Both halves shipped pre-this-section being marked.
Deliverables (all live):
- Assertions in Jacquard's Liberty parsing code: non-zero cells parsed on non-empty input. Implemented as
TimingLibrary::parse(src/liberty_parser.rs:297-309); rejects with a clear diagnostic naming the input byte count and pointing at the explicit override. - Assertions in
opensta-to-ir(WS2): non-zero IOPATHs / timing arcs resolved on non-trivial SDF input. Implemented as the--min-arcs NCLI flag (default 1) in the binary (crates/opensta-to-ir/src/main.rs:71-77, :112-121); exits with codeEXIT_MIN_ARCS_FAILED = 3(see:17) and a diagnostic naming the produced count, the threshold, and the override flag. - A way to override thresholds for intentionally-empty test inputs:
TimingLibrary::parse_unchecked(src/liberty_parser.rs:316) for the Liberty path,--allow-empty-parseflag for theopensta-to-irpath.
Tests covering both halves: liberty_parser::parse_rejects_library_input_with_zero_cells and parse_unchecked_accepts_zero_cell_library; opensta-to-ir::cli::cli_min_arcs_failure_exit_3 (covers both the failure and the --allow-empty-parse override).
(Original-plan assertions for Jacquard's SDF parser are obsolete — WS3 deleted the parser they were to guard.)
Test plan
Tests live in tests/timing_ir/.
- Schema round-trip (WS1). Construct a small IR in Rust, serialize to binary, deserialize, assert equality. Same for JSON.
- OpenSTA converter unit tests (WS2). For a hand-crafted tiny design, invoke the converter, assert IR contents match expectation.
- Jacquard converter unit tests (WS3). Same, on the same tiny design, through Jacquard's parser.
- Corpus diff (WS4). For each design in the primary corpus, freshly produced
opensta-to-iroutput diffs clean against the checked-in goldenexpected.jtirwithin per-design tolerance. - Parser-success assertion tests (WS5). Feed empty Liberty, empty SDF, and non-empty-but-no-match Liberty. Each should fail loud with a clear diagnostic, not proceed silently.
Tolerances:
- Delay values: ±5% or ±5 ps absolute floor, whichever is larger. Rationale: matches the existing
timing-validation.mdconvention; per-design overrides allowed viamanifest.toml. - Missing arcs: zero tolerance. Every arc in the golden IR must appear in the freshly produced one (and vice versa).
Exit criteria (all met)
Phase 0 is complete when all of the following hold:
- ✅
schemas/timing_ir.fbschecked in (crates/timing-ir/schemas/timing_ir.fbs); round-trip unit tests incrates/timing-ir/tests/. - ✅
opensta-to-irbinary production-quality with stable CLI, documented exit codes, primary-corpus support. Seews2-opensta-to-ir.md(Implemented). - ✅
src/sdf_parser.rsdeleted;--timing-ir <path>canonical;--timing-sdfis a subprocess wrapper overopensta-to-ir(per ADR 0006 § Amendment, the shipping mechanism — Phase 3 native Rust parser deferred indefinitely). Seews3-delete-sdf-parser.md(Implemented). - ✅ OpenSTA vendored at
vendor/opensta/(ADR 0005). - ✅
timing-ir-diffruns in CI on the primary corpus (opensta-to-ir-testsjob), passes cleanly, fails loud on regressions. Mutation tests incrates/timing-ir/tests/diff.rs. - ✅ Parser-success assertions live on both halves:
TimingLibrary::parseandopensta-to-ir --min-arcs. See WS5 above. - ✅ No regression observed in Jacquard's timing-related tests after WS3 cutover.
- ✅
timing-validation.mdcarries the forward-pointing note (line 3) explicitly stating its ±5% convention will be superseded bytiming-correctness.mdonce Phase 0 ships. Phase 0 has shipped; that supersession is now effective in practice (the corpus tolerance is set per-design viamanifest.toml). Removing the in-doc note is a small follow-up if anyone authoring against the page would benefit.
Out of scope (deferred to later phases)
- Native Rust SDF→IR converter. The hand-rolled parser is removed in Phase 0 WS3 (per ADR 0006); the native Rust replacement is Phase 3 work, deferred indefinitely per ADR 0006 § Amendment (no longer release-gating). SDF input ships via the
opensta-to-irsubprocess wrapper. Seepost-phase-0-roadmap.md§ Phase 3 for revival triggers. - OpenTimer integration. Depends on the spike; tracked in
../spikes/opentimer-sky130.mdand its resulting phase-1 plan. - Private PDK (GF130) test track. Tracked in ADR 0004; plumbing deferred to its own phase.
- SPEF IR. Separate from timing-annotation IR per ADR 0002.
- Runtime violation reporting improvements (R4 critical-path refinement JSON). Phase 1 or 2.
Risks
- Licensing verification on vendored OpenSTA corpus. Per-file check needed before inclusion. May reduce corpus size if restrictive; acceptable.
- FlatBuffers build integration friction. If
build.rscodegen causes cross-compilation or CI issues, fall back to checked-in generated code with a documentedflatcversion. Pick one approach and stick to it; flip-flopping is worse than either option. - Tolerance tuning. Initial ±5% may prove too loose (hides bugs) or too tight (false positives from numerical differences). Plan to re-tune after first real-design data arrives.
- WS3 cutover risk. Deleting the hand-rolled SDF parser risks regressing designs that depend on behaviour it currently provides. Exit criterion 7 requires a clean regression run before WS3 is considered complete. If coverage gaps emerge, walk-back options per ADR 0006 apply: add dialect shims to
opensta-to-ir, or (now that Phase 3 is deferred) keep the hand-rolled parser available behind a feature flag until dialect parity is reached. - OpenSTA dialect coverage. OpenSTA may not accept every SDF dialect Jacquard's hand-rolled parser has been patched to handle. Such cases are tracked as either
opensta-to-irpost-processing fixes or upstream OpenSTA contributions. Under no condition is the fix to reinstate the hand-rolled parser unless walk-back per ADR 0006 is formally triggered.
Links
../project-scope.md../timing-correctness.md— acceptance criteria this plan satisfies.../adr/0001-opensta-as-oracle.md../adr/0002-timing-ir.md../spikes/opentimer-sky130.md— runs in parallel; no dependency either way.
Plan — WS2: opensta-to-ir
Status: Implemented — historical record. All five phases (2.1–2.5)
plus Pillar B Stage 1 (per-DFF CLOCK_ARRIVAL records) and release
hardening (WS-RH.1 OpenSTA version probe) have shipped. The crate
lives at crates/opensta-to-ir/. Current scheduling for further
timing-model fidelity work is tracked in post-phase-0-roadmap.md.
Phase: 0 (executed WS2 from phase-0-ir-and-oracle.md).
Predecessors: WS1 (crates/timing-ir, schema and round-trip — done), ADRs 0001 / 0002 / 0005 / 0006.
Goal
Deliver a production-quality preprocessing tool that consumes a design's timing inputs and emits a timing-ir file suitable for downstream Jacquard consumption. End-to-end:
.lib + .v + .sdf + .spef + .sdc → opensta-to-ir → design.jtir (+ design.json)
opensta-to-ir is shipped as a release artefact (per ADR 0006) and is also used by Phase 0 WS3's interim jacquard sim --timing-sdf runtime hook.
High-level architecture
Three components, single binary:
┌─────────────────────────┐ ┌─────────────────────────┐ ┌─────────────────────────┐
│ Rust CLI / driver │ │ Tcl dump script │ │ Rust IR builder │
│ (clap, subprocess mgmt)│ → │ (runs in OpenSTA proc) │ → │ (parses dump, builds │
│ Validates inputs │ │ Emits canonical dump │ │ FlatBuffers IR) │
└─────────────────────────┘ └─────────────────────────┘ └─────────────────────────┘
│ │
└──────────────────── one process invocation ───────────────────────┘
The Rust CLI invokes OpenSTA as a subprocess, writes the Tcl driver script to a temp directory, runs sta -f $tmpdir/dump.tcl, captures the dump file, and converts to IR. The Tcl driver lives at crates/opensta-to-ir/tcl/dump_timing.tcl and is embedded in the binary via include_str!() so the binary is self-contained at runtime — no separate Tcl file needs to ship alongside it.
OpenSTA is located via scripts/build-opensta.sh --print-binary first (the canonical install path for the vendored submodule), then falling back to a PATH lookup, then --opensta-bin <PATH> override.
Reasons for this shape:
- OpenSTA's structured Tcl API (
get_timing_edges,get_timing_arcs_from, etc.) gives access to OpenSTA's internalised timing graph directly. Walking it is simpler than parsing OpenSTA's SDF output back through a second-generation parser. - The Tcl script is the only OpenSTA-specific code; the Rust side is format-only and can later be reused with other producers (Phase 3 native Rust parser, future OpenTimer adapter).
- Subprocess invocation preserves Jacquard's permissive license posture (ADR 0001).
Tcl dump format
A simple line-oriented record format. Each line is one annotation. Fields are tab-separated. Strings with tabs/newlines are quoted with simple "..." and \t/\n escaping. Header / footer lines mark the document.
# format-version: 1
# generator-tool: opensta-to-ir 0.1.0
# generator-opensta: <opensta version string>
# input-files: <comma-separated list>
CORNER <index> <name> <process> <voltage> <temperature>
ARC <cell_instance> <driver_pin> <load_pin> <corner_index> <rise_min> <rise_typ> <rise_max> <fall_min> <fall_typ> <fall_max> <condition> <origin>
INTERCONNECT <net> <from_pin> <to_pin> <corner_index> <min> <typ> <max> <origin>
SETUP_HOLD <cell_instance> <d_pin> <clk_pin> <edge> <corner_index> <setup_min> <setup_typ> <setup_max> <hold_min> <hold_typ> <hold_max> <condition> <origin>
VENDOR_EXT <source> <source_tool> <kind> <base64_payload>
# end
Why line-oriented (not JSON): Tcl emits this trivially with puts. Rust parses it with a BufReader line-at-a-time, no streaming-JSON parser. Mismatched lines fail loud at the unit level, not after parsing 100MB of nested JSON.
The format is a private interface between the bundled Tcl script and the bundled Rust binary — both ship together in one release artefact. We reserve the right to change the format any time as long as both sides update.
Rust binary
opensta-to-ir [OPTIONS] --output <PATH>
Inputs (at least one liberty + one verilog required):
--liberty <PATH>... One or more Liberty files (-r overlay supported by OpenSTA).
--verilog <PATH>... One or more Verilog netlists.
--sdf <PATH> Optional. Back-annotated delays.
--spef <PATH> Optional. Parasitics; required for SPEF-based delay calc.
--sdc <PATH> Optional. Constraints (clocks, input delays).
--top <NAME> Top-level module name. Required.
--corner <NAME>... Corner name(s). Default: "default".
Output:
--output <PATH> IR binary output path (.jtir).
--json <PATH> Optional. JSON sidecar via flatc round-trip.
Behaviour:
--opensta-bin <PATH> Override the OpenSTA executable path. Default: probe via
`scripts/build-opensta.sh --print-binary`, then fall back to PATH.
--keep-tmp Keep the Tcl script and dump file in $TMPDIR for debugging.
--min-arcs <N> Fail if fewer than N timing arcs are emitted. Default: 1.
--allow-empty-parse Disable the --min-arcs check. For test fixtures only.
--strict-tcl Treat OpenSTA Tcl warnings as errors.
-v, --verbose Echo OpenSTA's stderr to ours. Default: capture and replay only on failure.
Exit codes:
0 IR produced successfully.
1 OpenSTA returned an error.
2 Tcl dump format error or IR-build failure.
3 Parser-success assertion failed (--min-arcs not met).
4 Argument validation error.
Internal flow:
- Validate args (required files exist, top name non-empty).
- Locate OpenSTA binary; verify version is in supported range.
- Render Tcl driver script into
$TMPDIR(or stdin). - Spawn
opensta -f <script>; capture stdout/stderr/exit. - Read dump file from
$TMPDIR/<uniqued>.osd(OpenSTA dump). - Parse dump, build IR via
timing-ircrate's FlatBuffers builders. - Apply
--min-arcsassertion (see WS5 portion below). - Write
.jtir(and.jsonif requested). - Surface any captured warnings on stderr.
Multi-corner handling
OpenSTA's define_corners and set_scene commands drive multi-corner analysis. Our flow:
- Caller passes
--corner ss_125C_1v08 --corner tt_25C_1v80 --corner ff_-40C_1v98. - Tcl script calls
define_cornersonce with the union, then iteratesforeach corner [get_corners] { ... }and emitsCORNER+ARC/INTERCONNECT/SETUP_HOLDlines tagged with the corner index. - Single-corner designs use one entry — same code path, no special case.
PVT extraction (process / voltage / temperature) — OpenSTA exposes these via Liberty's operating conditions. Tcl extracts via the corner's pvt object. If unavailable, process="?", voltage=0.0, temperature=0.0.
Vendor extensions
OpenSTA does not expose a single mechanism for arbitrary annotations. For Phase 0 WS2:
- We do not produce
VENDOR_EXTrecords. - The IR's vendor-extension passthrough remains a forward-looking feature; a future producer (a commercial-tool-aware adapter) will populate it.
Tcl-side parsing of vendor-specific Liberty simulation blocks or SDF (VENDOR …) constructs is not in scope for Phase 0 WS2.
Parser-success assertion (WS5 portion)
Per phase-0-ir-and-oracle.md WS5: "Assertions in opensta-to-ir: non-zero IOPATHs / timing arcs resolved on non-trivial SDF input. Exit non-zero with a clear diagnostic when below threshold."
Implementation:
--min-arcs <N>flag with default1.- After IR is built, count
TimingArcrecords in the buffer. - If below threshold and
--allow-empty-parsewas not passed, exit code 3 with message:opensta-to-ir: produced N timing arcs (--min-arcs <M>); use --allow-empty-parse for empty-fixture tests. - Liberty parser-success assertion already lives in Jacquard's
TimingLibrary::parse(see commit5db131e) —opensta-to-irinvokes OpenSTA's own Liberty reader rather than Jacquard's, so it surfaces missing-cell issues via OpenSTA's exit status (not our concern at this layer).
Test plan
Fixture progression — minimum-viable to representative
- inv_chain_pnr (already in
tests/timing_test/): smallest design with real SKY130 cells and SDF. Verify single arc per inverter, correct rise/fall, single corner. - MCU SoC subset: representative of the real Jacquard flow. Verify the count of arcs matches a known baseline; spot-check a handful of arrival times against
report_timingoutput. - Multi-corner synthetic: hand-built tiny design with
ss/tt/ffLiberty corners, verify the IR carries 3 corner records and 3 sets of values per arc.
Test types
- Unit tests (Rust): dump-format parser tested against synthetic dump strings (no OpenSTA needed).
- Integration tests (Rust + OpenSTA): invoke the binary against committed fixtures, diff the resulting IR against golden IR via
timing-ir-diff. Each integration test gates itself onscripts/build-opensta.sh --print-binarysucceeding — when the OpenSTA binary is unbuilt, tests skip with a clear "run scripts/build-opensta.sh" message rather than failing. CI runs them after building OpenSTA via the script. - Failure-mode tests: missing OpenSTA, malformed Tcl dump, zero-arc input, missing required argument — each surfaces the expected exit code.
CI integration (closes WS4 remaining work)
- A new CI job runs
opensta-to-iron eachtests/timing_ir/corpus/<name>/inputs/and diffs againstexpected.jtirviatiming-ir-diff. Fails loud on diff or exit-code regression. - Stress-corpus run is deferred to Phase 1.
Phased implementation
Splitting WS2 into focused PRs keeps reviewability tight. Each phase exits with a runnable end-to-end on its scope:
| Phase | Scope | Exit signal | Status |
|---|---|---|---|
| 2.1 | Single-corner, timing-arc IOPATHs only. CLI scaffolding. | AIGPDK AND2 round-trip clean through opensta-to-ir end-to-end. | ✅ Shipped (dc3db4a scaffold + 3997e06 subprocess plumbing + 50b8600 real Tcl extraction). |
| 2.2 | Add interconnect delays (wire-role edges, with optional SPEF). | Multi-cell design produces INTERCONNECT records that round-trip. | ✅ Shipped (67210c0). Test: chain_with_sdf_emits_interconnect_delay. |
| 2.3 | Add setup/hold checks. | DFF setup/hold round-trips end-to-end. | ✅ Shipped (8343b14). Test: aigpdk_dff_emits_setup_hold_records. Recovery / removal / width checks remain out of scope. |
| 2.4 | Multi-corner. | 3-corner synthetic fixture produces 3-corner IR. | ✅ Shipped (530bb36 builder + 59fde04 per-corner Tcl emission + d110174 integration test + 50f4bf5 real-sky130 multi-corner follow-up). Tests: aigpdk_dff_emits_per_corner_timing_values, sky130_multi_corner_emits_per_corner_values. |
| 2.5 | CI corpus integration; golden-IR fixtures for representative designs. | WS4 corpus job in CI; WS2 task complete. | ✅ Shipped (90558bb). Runner: cargo test -p opensta-to-ir corpus. |
Beyond original WS2 scope:
- Pillar B Stage 1 — per-DFF
CLOCK_ARRIVALrecords (c403cc8). Adds clock arrival times to the IR so downstream consumers can compute per-DFF setup/hold margins without re-running OpenSTA. Test:dff_with_sdc_clock_emits_clock_arrival. Tracked separately inpost-phase-0-roadmap.md. - Release hardening WS-RH.1 — hard-fail on missing or too-old OpenSTA, with version probe and usage diagnostics (
c9c393b). Tests:locate_accepts_min_tested_version,locate_flags_newer_than_tested. Tracked inpost-phase-0-roadmap.md§ Release hardening.
WS3 (delete src/sdf_parser.rs + wire interim runtime hook) was unblocked once Phase 2.3 minimum landed and has also shipped — see ws3-delete-sdf-parser.md.
Open questions — resolution
Resolutions from implementation:
- OpenSTA version pinning — Resolved by WS-RH.1 (
c9c393b). Binary probes OpenSTA'sversion_string, accepts a[MIN_TESTED, MAX_TESTED]range, prints a usage diagnostic with the supported range on mismatch. - OpenSTA installation — Resolved.
scripts/build-opensta.shships with--print-binaryfor the dependency probe; integration tests skip cleanly when the binary isn't built. Documented in the script's--helpand the post-Phase-0 roadmap. - Tcl-script versioning — Resolved.
# format-version: 1header check is enforced indump.rs; the binary refuses unknown versions with an explicit error. - Conditional arcs (SDF
COND) — Partial. Theconditionfield is plumbed end-to-end (dump format → Rust parser → IR builder), but the Tcl emission side does not yet populate it for conditional variants. Defer until a real design surfaces aCONDarc that needs distinguishing.
Still open / deferred:
- Long-running designs: streaming dump emission (Tcl flushing line-by-line, Rust incremental read) — defer until profiling on a real SoC shows memory pressure.
- Strict Tcl error handling:
--strict-tclflag was specced but not implemented. Current behaviour captures all stderr and replays on failure; no warning-to-error upgrade path. Land if it becomes a real CI hygiene concern.
Risks
- OpenSTA's Tcl API is large and not all of it is documented. Some primitives we'll need (e.g., per-corner delay values for a specific arc) may require digging through
Sta.cc. Mitigation: budget time, lean onreport_pathtext output as a fallback if the structured API proves opaque for a given query. - OpenSTA may be slow on big designs — the structured walk over millions of arcs is single-threaded. Mitigation:
--keep-tmpfor profiling, accept slow phase-0 runs, optimise later if it blocks CI. - Format drift between Tcl and Rust — both sides advance together; the
format-versionline plus version-mismatch fail-loud catches drift. Add a unit test that the Rust parser rejects an unexpected version line.
Non-goals
- A general SDF parser. (The whole point: avoid that.)
- Wire-level reactivity or feedback to OpenSTA mid-run (this is a one-shot extract).
- Comparison against OpenTimer (that's a separate ADR-0003-spike concern).
- Replacing OpenSTA's role as oracle in CI —
opensta-to-iris a producer, not a checker.
References
../adr/0001-opensta-as-oracle.md— subprocess model, license posture.../adr/0002-timing-ir.md— IR contract this tool emits.../adr/0005-opensta-vendoring-and-corpus.md—vendor/opensta/submodule.../adr/0006-sdf-preprocessing-model.md— interim runtime hook + release-time cutover.phase-0-ir-and-oracle.md— WS2 row in the work breakdown.crates/timing-ir/schemas/timing_ir.fbs— schema this tool produces.vendor/opensta/doc/StaApi.txt— OpenSTA Tcl API reference.
Last updated: 2026-04-28 (design); 2026-05-15 (status flip to Implemented).
Plan — WS3: delete SDF parser, wire IR consumer + interim runtime hook
Status: Implemented — kept as historical record. Note: the "interim" / "pre-release-only" framing throughout this document describes the original ADR 0006 model. Per ADR 0006 § Amendment (2026-05-02), the runtime subprocess wrapper is now the shipping mechanism — Phase 3 (native Rust SDF→IR) is no longer release-gating. This document is preserved for the implementation phasing record; for current shipping intent see ADR 0006 § Amendment and post-phase-0-roadmap.md § Phase 3.
Phase: 0 (executes WS3 from phase-0-ir-and-oracle.md).
Predecessors: WS2 phases 2.1 + 2.3-minimum (delay arcs + setup/hold checks landed). Sufficient IR coverage for runtime cutover.
ADRs: 0002 (IR), 0006 (SDF preprocessing model + interim cutover; amended 2026-05-02).
Goal
Delete src/sdf_parser.rs and migrate src/flatten.rs's timing-loading to consume the timing IR directly. Wire jacquard sim --timing-ir <PATH> as the canonical input path, and (per ADR 0006) keep --timing-sdf <PATH> working pre-release as a contributor-ergonomics convenience that internally subprocesses opensta-to-ir.
End state:
- No hand-rolled SDF parsing in the Jacquard codebase.
- Runtime SDF input still works (via internal subprocess) until first release.
flatten.rsconsumestiming_ir::TimingIR<'_>for arc / setup / hold loading.- All
flatten.rstests that previously hand-built SDF strings are migrated to build IR fixtures via thetiming-ircrate's FlatBuffers builders.
Surface analysis
src/sdf_parser.rs (1099 lines) defines SdfFile, SdfDelay, SdfCorner, TimingCheckType, and parses SDF text. Consumers:
src/flatten.rs—load_timing_from_sdf(...)is the only non-test consumer; iteratesSdfFile.get_cell(path), usesSdfDelayfor wire delays,TimingCheckType::Setup/Holdfor check identification. ~200 lines of integration plus 7+ test fixtures that build SDF strings inline.src/sim/setup.rs— translates--sdf-cornerCLI string intoSdfCornerand callsSdfFile::parse_file.src/aig.rs— test imports only.src/lib.rs— module declaration only.
Architecture changes
New: src/sim/timing_ir_loader.rs
Thin module that owns the IR file buffer (so consumers can borrow TimingIR<'_> views from it):
#![allow(unused)] fn main() { pub struct TimingIrFile { buf: Vec<u8>, } impl TimingIrFile { pub fn from_path(path: &Path) -> Result<Self, ...> { ... } pub fn from_bytes(buf: Vec<u8>) -> Result<Self, ...> { ... } pub fn view(&self) -> Result<timing_ir::TimingIR<'_>, ...> { timing_ir::root_as_timing_ir(&self.buf) } } }
The TimingIR view holds a lifetime tied to the buffer. Callers keep the TimingIrFile alive while iterating the view.
Modified: src/flatten.rs
Replace load_timing_from_sdf with load_timing_from_ir:
#![allow(unused)] fn main() { pub fn load_timing_from_ir( &mut self, aig: &AIG, netlistdb: &NetlistDB, ir: &timing_ir::TimingIR<'_>, clock_period_ps: u64, liberty_fallback: Option<&TimingLibrary>, debug: bool, ) { ... } }
Logic translation table:
Old (SdfFile) | New (TimingIR<'_>) |
|---|---|
sdf.get_cell(path) | Index ir.timing_arcs() / ir.setup_hold_checks() by cell_instance (build a HashMap<&str, _> once). |
cell.iopaths | Filter timing arcs by cell_instance == path. |
cell.timing_checks | Filter setup/hold checks by cell_instance == path. |
SdfDelay { rise, fall, ... } | TimingArc.rise_delay() / .fall_delay() (per-corner); take corner 0 max for now. |
TimingCheckType::Setup / ::Hold | SetupHoldCheck.setup() / .hold() per record. |
cell.interconnect_delays | ir.interconnect_delays() — empty until WS2.2 lands; tolerate. |
The hierarchy-prefix detection (lines 1793-1820 of current flatten.rs) is independent of source format — same logic applies, just use IR's cell_instance strings instead of SDF's. Keep the heuristic.
Modified: CLI surface (src/bin/jacquard.rs, src/sim/setup.rs)
- Add
--timing-ir <PATH>flag that loads IR directly viaTimingIrFile::from_path. - Retarget
--timing-sdf <PATH>(and the existing--sdf-corner) to: spawnopensta-to-iras a subprocess, capture its IR output, callload_timing_from_ir. Mark the code siteINTERIM per ADR 0006. - The interim hook needs Liberty + Verilog paths to feed
opensta-to-ir; thejacquard simCLI already takes those, so plumb them through. - Keep
--sdf-cornerfor backward compat — the interim wrapper passes it as--cornertoopensta-to-ir.
Deletions
src/sdf_parser.rs— entire file.src/lib.rs—pub mod sdf_parserline.src/aig.rs—use crate::sdf_parser::{SdfCorner, SdfFile}test imports; rewrite or delete the affected tests.src/flatten.rs—use crate::sdf_parser::SdfFile; rewrite test fixtures.
Test migration strategy
Test fixtures in flatten.rs currently look like:
#![allow(unused)] fn main() { let sdf_content = r#"(DELAYFILE ... )"#; let sdf = SdfFile::parse_str(sdf_content, SdfCorner::Typ).expect("..."); flat.load_timing_from_sdf(&aig, &netlistdb, &sdf, ...); }
After cutover:
#![allow(unused)] fn main() { let ir_buf = build_test_ir(&TestIrSpec { arcs: vec![ /* (cell, from, to, rise_max, fall_max) */ ], setup_hold: vec![ /* (cell, d, clk, edge, setup, hold) */ ], }); let ir = root_as_timing_ir(&ir_buf).unwrap(); flat.load_timing_from_ir(&aig, &netlistdb, &ir, ...); }
A build_test_ir helper in flatten.rs::tests mirrors build_ir_with_arcs from crates/timing-ir/tests/diff.rs. Single source of truth would be nicer; for now duplicate it (deduplication is a future cleanup).
Phased implementation
| Phase | Scope | Exit signal |
|---|---|---|
| 3.1 | Add src/sim/timing_ir_loader.rs and flatten.rs::load_timing_from_ir (parallel to _from_sdf). No CLI surface, no deletions. Unit-test the new function with a small synthetic IR. | New function compiles + passes unit test; existing _from_sdf path still works. |
| 3.2 | Add jacquard sim --timing-ir <PATH> CLI flag wired to load_timing_from_ir. End-to-end test: pre-generate IR via opensta-to-ir, run jacquard sim --timing-ir, compare against the existing --timing-sdf baseline. | A representative timing test (e.g., one of the existing tests/timing_test/) produces matching VCD output via both paths. |
| 3.3 | Retarget --timing-sdf to subprocess opensta-to-ir internally, then consume IR. Tag the code site INTERIM per ADR 0006. | Existing --timing-sdf regression tests pass through the new path. |
| 3.4 | Delete src/sdf_parser.rs. Migrate flatten.rs test fixtures from SDF strings to IR builders. Migrate aig.rs test imports. | All cargo test --lib tests pass; src/sdf_parser.rs is gone; the only crate::sdf_parser:: reference is git log. |
Each phase exits cleanly. Phase 3.4 is the irreversible deletion — gates on phases 3.1-3.3 having green CI on the migration tests.
Open questions
- Hierarchy separator: SDF uses
., OpenSTA's default divider is/. Our IR'scell_instancestrings come from OpenSTA so use/. The flatten.rs hierarchy-prefix detection logic uses.. After cutover, the logic needs to use/. Verify by running on a hierarchical design (MCU SoC) before declaring 3.4 ready. --sdf-cornersemantics under IR: today this picks one ofMin/Typ/Maxfrom SDF triples. The IR has min/typ/max perTimingValuealready; the corner selection becomes "pick which of the three to use" applied per-arc rather than per-file. Document the mapping.- Default-corner consistency: WS2 emits
defaultas the corner name. Pre-existing Jacquard tests may not look at corner names — need to spot-check. liberty_fallbacksemantics: today, for cells absent from SDF, we fall back to Liberty-computed delays. Under IR, OpenSTA-computed values are already in the IR's arcs (asOrigin::Computed). Soliberty_fallbackis potentially dead. Decide whether to drop it in 3.4 or keep as safety net.- Multi-corner (post-WS2.4): when WS2.4 lands, the IR will have multiple corners. flatten.rs currently picks one. Define the per-corner selection contract — explicit corner-name CLI flag, or default to a named corner.
Risks
- flatten.rs test churn: 7+ test fixtures need rewrites. Each is a focused mechanical change but the bulk adds up. Mitigation: a
build_test_irhelper standardizes the pattern. - Hidden-bug exposure: the existing SDF parser had quirks. The IR parser has different ones (or none). Migration may surface bugs that were latent. Treat any test failure during 3.4 as a real bug, not "just adjust the test."
- Hierarchy-separator regression: if not caught in phase 3.2 testing (which tests on a single design), it could land in 3.4 and break a hierarchical design that wasn't previously regression-tested. Mitigation: include a hierarchical design in the 3.2 verification matrix.
- Cutover timing: WS3 lands while WS2.2 (interconnects) and WS2.4 (multi-corner) are still pending. flatten.rs's cutover assumes those will land later — test fixtures should not depend on interconnect delays or multi-corner behaviour for at-least-3.4 to pass.
Walk-back
If 3.4 surfaces blocking issues, ADR 0006 already permits deferring deletion: keep src/sdf_parser.rs alive but tagged LEGACY — superseded by IR consumer; remove before first release, and ship preprocessing-only for the interim. The runtime SDF subprocess wrapper covers the contributor ergonomics. The native Rust SDF parser rewrite (Phase 3 in the original phasing) is the durable replacement.
Non-goals
- A native Rust SDF parser. (Original ADR 0006 Phase 3; not part of WS3.)
- Validating SDF round-trip equivalence between the old parser and OpenSTA. (CI corpus test in WS4/WS2.5 covers this when fixtures exist.)
- Refactoring the broader
flatten.rsstructure beyond what migration requires.
References
../adr/0002-timing-ir.md— IR contract.../adr/0006-sdf-preprocessing-model.md— interim runtime subprocess + release-time cutover.phase-0-ir-and-oracle.md— WS3 row.ws2-opensta-to-ir.md— produces the IR this consumer reads.crates/opensta-to-ir/— subprocess target for the interim--timing-sdfhook.crates/timing-ir/— IR library + builders for test fixtures.
Last updated: 2026-04-28
Plan — WS3 follow-up: re-add cosim --sdf via opensta-to-ir
Status: Deferred. Tracked here so future work can pick it up.
Predecessor: WS3 phase 3.4 (deletes hand-rolled src/sdf_parser.rs).
Background
Phase 3.4 deleted src/sdf_parser.rs. The jacquard sim subcommand kept
SDF input working (Phase 3.3 wired --sdf through
setup::load_sdf_via_opensta_to_ir, an internal subprocess wrapper that
calls the opensta-to-ir crate to convert SDF→IR). The jacquard cosim
subcommand chose Option B of the phase 3.4 handoff: drop --sdf
entirely rather than thread --liberty through. As a result, cosim now
only accepts pre-converted IR via --timing-ir.
What was removed in 3.4
CosimArgs::sdf,sdf_corner,sdf_debugCLI fields (src/bin/jacquard.rs).- The
config.timing.sdf_file/sdf_cornerfallback path insrc/sim/cosim_metal.rs::run_cosim. TimingSimConfig::sdf_fileandsdf_cornerJSON fields (src/testbench.rs).
User-facing migration (current state)
The tests/mcu_soc/ cosim flow that used to load SDF via the testbench
config now needs an explicit pre-conversion step.
Feed 6_final.v directly to opensta-to-ir
Retraction (2026-05-18). Earlier versions of this section
recommended feeding tests/mcu_soc/data/top_synth.v (post-synthesis,
pre-P&R) to opensta-to-ir to dodge a parse error on 6_final.v's
chipflow integration wrapper. That was wrong: top_synth.v is
missing the ~236K cells P&R inserts (clkbuf_regs_* CTS buffers,
ANTENNA_* diodes, delaybuf_*, fillers), so OpenSTA silently drops
every SDF entry referencing a P&R-inserted cell and the resulting IR
is missing the bulk of the design's timing. The "28162 matched /
2090 unmatched" verification log we celebrated at the time measured
jtir records against the cosim-loaded netlist, not SDF coverage
against the jtir — high surface match rate, materially incomplete
IR. See ADR 0009 (OpenSTA Verilog reader input constraints) for
the broader rule.
opensta-to-ir now transparently extracts module <--top> from each
input file before invoking OpenSTA (implementation in
crates/opensta-to-ir/src/verilog_filter.rs). For the chipflow
mcu_soc case this strips the openframe_project_wrapper module
automatically; the same handling kicks in for any LibreLane +
wafer.space user (hazard3 and future tapeouts) whose final netlist
carries an integration wrapper around the structural top.
# Convert SDF → IR once. Pass 6_final.v directly; the wrapper module
# is dropped automatically.
opensta-to-ir \
--liberty /path/to/sky130_fd_sc_hd__tt_025C_1v80.lib \
--verilog tests/mcu_soc/data/6_final.v \
--sdf tests/mcu_soc/data/6_final.sdf \
--top top \
--output tests/mcu_soc/data/6_final.jtir
# Run cosim with the pre-converted IR. Cosim loads 6_final.v (the
# wrapper) because that's what carries GPIO ports. The IR consumer's
# hierarchy-prefix detection strips the `top_inst/` prefix from the
# wrapper's cell paths so they match the IR's instance names.
cargo run -r --features metal --bin jacquard -- cosim \
tests/mcu_soc/data/6_final.v \
--config tests/mcu_soc/sim_config_sky130.json \
--top-module openframe_project_wrapper \
--timing-ir tests/mcu_soc/data/6_final.jtir
tests/mcu_soc/sim_config_sky130.json no longer carries sdf_file /
sdf_corner (the fields would be silently ignored if added back;
cosim does not consume them).
Events-reference comparison: nuances
tests/mcu_soc/events_reference.json was wired into the sky130 cosim
config as part of phase 3.4 verification. End-to-end pipeline result
on a 3M-tick run:
- 67 UART bytes captured; the reference's 155 UART events end at
timestamp 4,187,182. All 67 captured payloads match the reference's
leading bytes (decoded UART output:
....: nyaa~!\nSoC type: CA7F100F\nFlash ID: CA7CA7FF\nQuad mode). No payload divergence. - 15 non-UART entries in the reference (cxxrtl-emitted SPI
deselectevents withpayload: "") are filtered out at parse time by the tolerant deserializer incosim_metal.rs::run_cosim. Without that filter the comparison panicked on the first SPI entry.
chipflow's num_steps and timestamp are edge-counted
Retraction. Earlier drafts of this section claimed Jacquard's
--max-cycles counts half-cycles. That was a misdiagnosis based on
reading MultiClockScheduler::new (which does emit per-edge raw
entries) without noticing the pairing layer at
src/sim/cosim_metal.rs:2604-2675 that collapses them into one
paired buffer per cycle. Today, --max-cycles N correctly counts
N full clock cycles: each cosim tick does one fall-edge dispatch
plus one rise-edge dispatch and DFFs capture once per tick. Verified
via --stimulus-vcd trace (5 ticks → simulated time spans 0–200000
ps for a 40 ns period clock, exactly 5 cycles).
The actual unit difference vs chipflow's cxxrtl harness:
- chipflow's
num_stepsis the count oftick()calls; eachtick()bumps++timestamptwice (once after the negedge dispatch, once after the posedge), so theevents_reference.jsontimestampfield counts clock edges (a full cycle = posedge-to-posedge = 2 edges). The harness:
Seeauto tick = [&]() { {{interface}}.step(timestamp); top.clk.set(false); agent.step(); ++timestamp; // post-negedge (odd) top.clk.set(true); agent.step(); ++timestamp; // post-posedge (even) }; for (int i = 0; i < num_steps; i++) tick();chipflow-lib/chipflow/common/sim/main.cc.jinja:32-74. - The half-tick timestamp is an intentional design, not a bug: parity tags each event with the clock phase it fired on (useful for verification of async paths).
- chipflow's
num_stepstherefore doubles as an edge budget: 3 M num_steps = 3 M edges = 1.5 M full clock cycles.
To compare a Jacquard cosim run against today's events_reference.json,
divide reference timestamps by 2 to convert edges → cycles. Empirical
spot-check on mcu_soc/sky130: byte-0 in Jacquard at --max-cycles 200000 arrives at tick 28682; reference timestamp 58290 / 2 = 29145
cycles; ratio 0.984× (simulators agree on simulated time within 2%).
The earlier "67 of 155 events captured" gap is not a budget
issue — chipflow drives input stimulus via design/tests/input.json
and reference events 69+ require those driven inputs. The input-stimulus
dispatcher was added in commit 4a1a989, and the mcu_soc/sky130 cosim
now matches the cut-down chipflow reference 1:1 (90/90 events).
The earlier "Jacquard ~14% slower per byte than cxxrtl" claim relied on a phantom half-cycle correction; it is also retracted. There is no rate gap to explain at this level.
Done: --max-cycles renamed to --max-clock-edges (commit 46b5c28)
Cosim's internal granularity moved from full clock cycles to scheduler
edges, aligning Jacquard's CLI 1:1 with chipflow's num_steps and
unlocking per-edge event timestamping. Section retained for context on
the unit conventions captured above.
Option A — restore cosim --sdf ergonomics
When this becomes a priority, mirror the jacquard sim surface:
Changes
- Add
--libertytoCosimArgs(src/bin/jacquard.rs). Plumb it throughDesignArgs::liberty(currently hardcodedNoneincmd_cosim). Also passthrough--top-moduleif not already. - Add
--sdf,--sdf-corner,--sdf-debugback toCosimArgs. Make them mutually exclusive with--timing-ir(clapconflicts_with = "timing_ir"). - Re-add
TimingSimConfig::sdf_file/sdf_corner(optional) — plus a newliberty_filefield for the OpenSTA invocation. Updatetests/mcu_soc/sim_config_sky130.jsonto use the new shape. - Restore the cosim config-file fallback: in
src/sim/cosim_metal.rs::run_cosim, when timing is not yet enabled and the config provides SDF + Liberty paths, callsetup::load_sdf_via_opensta_to_ir. Match the priority order: CLI > config.timing.* > nothing. - Update
--output-vcderror message to mention--sdfagain.
Out of scope for Option A
- Rebuilding a hand-rolled SDF parser. (See ADR 0006 — the durable replacement is the native Rust SDF→IR converter, tracked separately as Phase 3 in the original phasing.)
- Adding cosim-specific corner-selection beyond what
jacquard simalready offers. The IR'smin/typ/maxtriple is selected viair_corner0_max(currently alwaysmax); changing that is a separate concern that affects both subcommands.
Verification
After Option A lands:
cargo build --features metal
cargo test --lib
# Manual smoke test of the previous mcu_soc workflow:
cargo run -r --features metal --bin jacquard -- cosim \
tests/mcu_soc/data/6_final.v \
--config tests/mcu_soc/sim_config_sky130.json \
--liberty <path>/sky130.lib \
--sdf tests/mcu_soc/data/6_final.sdf
Should produce equivalent results to the pre-3.4 hand-rolled-parser
path within the IR's representational bounds (single-value
interconnect delays, max corner selection).
Walk-back
If Option A is never picked up before first release, the existing IR-only
cosim surface is fine — contributors using SDF can pre-convert via
opensta-to-ir and pass the resulting .jtir. The follow-up exists as a
contributor-ergonomics improvement, not a correctness gap.
Multi-clock and stimulus architecture — exploratory roadmap
Status: Captured architectural thinking. Most phases here are demand-driven and will only be picked up when a real-world workload requires them. Phases 1 and 2 may be worth scheduling on their own merits in a future release; the rest are written down so the design space is on record when the need appears.
This is a design-space doc, not a scheduling doc. It complements
post-phase-0-roadmap.md (which schedules committed work) by capturing the
architecture for two related areas — multi-clock-domain support and stimulus
generation — that today have working but limited implementations.
Why now
The conversation that produced this doc was about supporting cosim against external testbench environments (UVM, CocoTB) and external clock sources (PHY, audio, DFS). Two observations crystallised the architecture:
- Real designs partition into large synchronous islands with thin boundaries. External-clock and DFS scenarios look intractable until you notice that <1 % of nets typically cross domains; the bulk of the design is batchable inside one island.
- Stimulus generation and stimulus consumption don't have to share a loop. Today cosim couples the testbench tick-by-tick to the GPU dispatch. Decoupling them — via streaming or full precompute — turns the GPU from a ping-ponging coprocessor into a stream consumer.
Both observations point at architecture changes that compose cleanly with each other, with the existing multi-clock plumbing, and with the existing X-prop / timing-arrival infrastructure.
What exists today
Worth pinning down so the gap is precise:
- Multi-clock-domain functional support.
MultiClockSchedulerinsrc/sim/cosim_metal.rs:1347builds a tick-by-tick edge schedule over the LCM of all domains' periods (with GCD granularity). DFFs are tagged by clock domain viaclock_pin2aigpinsinsrc/aig.rs:209. Each scheduler tick asserts only the firing domains' posedge/negedge flag bits; the GPU kernel gates DFF write-back on those flags, so non-firing domains' DFFs hold. - LCM constraint. The scheduler asserts
schedule_len <= 1_000_000(cosim_metal.rs:1376). Commensurable periods (PLL-derived) work; truly non-commensurable external clocks (audio, USB-recovered, DFS-mid-flight) hit the cap. - Cosim stimulus.
InputDispatcher(src/sim/input_stim.rs) consumes a chipflow-compatiblewait/action/stopJSON command list. Peripheral models (src/sim/models/) drain queued actions per edge and emit events. Generation is interleaved with the GPU dispatch loop — every tick (or every few ticks) round-trips through the host. - VCD replay path.
jacquard simalready runs from a precomputed input VCD with no host-side reactive logic. This is, in effect, the "Level 1" precomputed-edge mode described below; the gap is between cosim's reactive loop and sim's flat replay, not in the kernel itself. - CDC checking. None today. SDF setup/hold checks exist
(
src/timing_report.rs) but are not wired through any CDC-specific path.
Architecture: two orthogonal axes
The work falls cleanly into two independent dimensions.
Axis 1 — Spatial: synchronous islands with thin boundaries
A static analysis pass partitions the AIG into islands: maximal connected sets of gates whose transitive fanin/fanout stays inside one clock domain. Whatever's left is the boundary — combinational gates and DFFs whose data cones cross domains. In real designs the boundary is small, dominated by synchronizers (2FF), async FIFO control, and handshake glue.
Per-island execution lets the GPU:
- Skip evaluation of an island whose state hasn't changed.
- Batch K consecutive ticks of a fast island into one kernel launch when the slow island has no edges in the window.
- Treat the boundary as a small mailbox (source-island outputs read by destination-island reads) rather than a global state vector.
This is essentially functional partitioning for parallel discrete-event simulation, but the GPU dataflow model gets more benefit than a CPU sim because batched dataflow is exactly what a fast island's run-ahead window wants.
Axis 2 — Temporal: stimulus generation decoupled from consumption
The cosim host loop is the throughput floor today. Decoupling has three levels:
- Replay — the testbench has already produced a complete input VCD; the
GPU just plays it back. Today's
jacquard simis this case. - Streaming buffer — testbench runs in a separate thread feeding a ring
buffer of
(tick, input_op)tuples. GPU consumes batches. As long as the producer keeps up on average, the GPU never stalls. Works because most ticks have no input change and peripheral state machines run far slower than the kernel. - Record-and-replay with divergence detection — pass 1 runs full cosim and records every input transition; pass 2 replays at line-rate while checksumming outputs against the recorded run. If outputs diverge, abort and fall back. Wins decisively for regression CI where most runs confirm "nothing externally observable changed".
Phase breakdown
Each phase is independently shippable. The phase numbering here is local to
this doc and should not be confused with the timing-IR phase numbering in
post-phase-0-roadmap.md.
| Phase | Topic | Trigger |
|---|---|---|
| MC.1 | Static island partitioner (analysis only, emits metadata) | Standalone-useful for CDC reporting; could land in a future release without further work |
| MC.2 | Min-heap multi-clock scheduler (replaces LCM precompute) | First non-commensurable external clock or DFS use case lands |
| MC.3 | Streaming stimulus buffer (decouples testbench thread from kernel) | First workload where cosim CPU↔GPU round-trip is measured as the bottleneck |
| MC.4 | Per-island kernel dispatch + multi-rate batching | MC.1+MC.2 in place; first multi-domain workload large enough that whole-AIG eval per tick is wasteful |
| MC.5 | Record-and-replay with divergence detection | Regression CI throughput becomes a release blocker |
| MC.6+ | Speculation staircase, AOT trace compilation, profile-guided kernel specialization | Demand-driven; deferred until measurement shows residual sync overhead after MC.4 |
MC.1 — Static island partitioner
Walk the AIG; for each gate compute the set of clock domains its transitive
fanin/fanout touches. Tag gates as island-internal (fanin and fanout both
inside one domain) or boundary (touches more than one domain on either
side). Emit per-island gate counts and a list of boundary gates as metadata
on the existing FlattenedScript.
What it enables on its own, even with no runtime change:
- Diagnostic: "this design has 14 inter-domain combinational paths from
audio_clk→core_clkand 2 the other way". Useful for designers reviewing CDC structure. - Data structure that MC.2 / MC.4 / CDC reporting all need.
- Sanity-check on the "<1 %" boundary-surface assumption for the workloads that motivate further phases.
Classification policy for derived signals (e.g. a sync-FIFO read pointer in
clock_b qualified by an output of a sync chain from clock_a): classify
aggressively. Only gates whose direct fanin includes pins from multiple
domains are boundary; downstream gates fed by a domain-tagged
pre-synchronizer output inherit that domain. This pushes the boundary in as
close to the structural CDC crossing as possible and is what makes the
"<1 %" claim hold on real designs — a lazy classification that propagated
"multi-domain" forward through every downstream cone would yield a
boundary surface that swallowed half the design.
Code locations: extends aig.rs (domain analysis on DriverType) and
flatten.rs (metadata on FlattenedScriptV1). No kernel changes.
MC.2 — Min-heap multi-clock scheduler
Replace MultiClockScheduler's precomputed Vec<TickEdges> with a min-heap
of (next-edge-time, domain) pairs. Pop the next edge, dispatch, push the
domain's next edge back. No LCM constraint; non-commensurable periods are
free. DFS support falls out: when the DUT writes a clock-control register,
the host updates the heap entry's period.
DFS hook design: explicit, not generic signal-watching. The cosim config
declares (control_signal, period_table) pairs; the host polls the named
bit each tick (cheap — one bit) and updates the heap. Generic
"call-back-on-arbitrary-signal" is rejected as too coupled.
Code locations: MultiClockScheduler::new and build_edge_ops in
cosim_metal.rs. Same per-domain flag emission, different scheduling
backend.
MC.3 — Streaming stimulus buffer
InputDispatcher becomes a trait; today's FileDispatcher is one
implementation. New implementations:
ThreadedDispatcher— runs peripheral models on a separate thread; emits(tick, input_op)into a lock-free SPSC ring buffer; GPU loop consumes batches.StreamDispatcher— same shape but the producer is a JSON-lines stream over a Unix socket / stdio (this is also the bridge to UVM/CocoTB peer testbenches).
Latency budget: the producer must be at least one tick ahead of the consumer. For transaction-level workloads this is easy (peripheral state machines run orders of magnitude slower than the GPU). For sub-cycle reactive loops it isn't, and those workloads stay on the synchronous path.
Code locations: refactor input_stim.rs around a trait; new module for
ring-buffer plumbing; cosim main loop drains a batch per dispatch instead of
one tick.
MC.4 — Per-island kernel dispatch + multi-rate batching
Build per-island execution scripts (and one boundary script) from the metadata MC.1 produces. Cosim main loop becomes:
#![allow(unused)] fn main() { loop { let (next_t, domain) = scheduler.peek(); let lookahead = scheduler.next_other_domain_edge(domain) - now; let edges_in_window = lookahead / domain.period; dispatch(island_script[domain], edges = edges_in_window); dispatch(boundary_script); // only if boundary signals changed advance_clock(now + edges_in_window * domain.period); } }
Boundary mailbox lives in shared state-buffer slots that the source island's script writes and the destination island's script reads. Repcut continues to partition each island's script across GPU blocks independently.
Tight-boundary gates (combinationally fed by both domains) force a sync point on every edge of either side; MC.1's metadata identifies these so the runtime knows when batching can extend.
MC.5 — Record-and-replay with divergence detection
Add --record-stimulus to cosim that emits a complete tick-by-tick input
VCD and a per-tick output checksum. Add --replay-stimulus to sim (or a
new mode) that consumes the VCD, runs at line-rate, and verifies the
checksum each batch.
Divergence handling is two-tier, not just abort:
- Mismatch in watched signals (the existing cosim
signals_of_interestset, or a--watchCLI argument) → abort and require re-recording. This is the genuine "the design's externally observable behaviour changed" case — the recording is now stale and replay is unsafe. - Mismatch in unwatched signals → warn-and-continue against the recorded transitions. Internal microarchitectural changes that don't move the observable surface are normal during development; aborting on them defeats the purpose of accelerating regression CI, where most runs exist to confirm "nothing externally observable changed".
The watchset is the user-visible policy lever — it specifies what "externally observable" means for this design. Default to the cosim output signals (the natural CI invariant) plus any user-declared checkpoint signals.
Useful primarily as a regression-CI accelerator. Doesn't help one-off runs.
Cross-test sharing. A single design accumulates many test cases. The natural extension of record-and-replay is to share the design-side specialized kernel across all tests in the suite and vary only the stimulus recording. For a suite of N tests against one design, recording costs N× pass-1 (one per test, on demand or in parallel) but replay costs N× line-rate-kernel-launches sharing the same compiled state-buffer layout. That's a multiplicative win on top of per-test record-and-replay and is the actual leverage point for full-suite CI throughput.
MC.6+ — Deferred sophistication
Documented now so the design space is on record:
- Speculation staircase for hot boundaries: value prediction → protocol pattern recognition → control-slice reachable-set enumeration → full case enumeration. Each tier larger and cheaper-to-skip. Add a "case" dimension to the kernel dispatch only if measured sync overhead after MC.4 justifies it.
- AOT trace compilation: when stimulus is fully known (replay mode), compile the schedule offline — fold constant inputs into AIG constants, merge no-op ticks, sort transitions by domain. Profile-guided specialization for designs with lots of "configured once at boot" inputs. Composes directly with MC.5: a recording is a complete stimulus trace, so the AOT compiler can fold every input value into the kernel unconditionally. The resulting binary is valid only until either the design or the recording changes, so the lifecycle model is "compile per (design SHA, recording SHA) pair, cache for the test session, invalidate on either source changing". Acceptable cost for a 100×-replay regression run; not for one-off interactive sim.
- CDC verification mode: jitter injection on coincident edges and
random X-injection on detected async-source paths. Reuses MC.1's
boundary metadata and existing X-prop infrastructure. Distinct from
static CDC checking (Spyglass, Real Intent), which is explicitly out
of scope — that's a different product. The jitter-injection half is
designed in ADR 0012 and partly
built; remaining work is tracked in
issue #92 /
cdc-jitter-completion.md. X-injection stays deferred until MC.1 lands.
Out of scope (explicit non-goals)
These come up adjacent and are worth being clear about:
- Pin-level VPI / GPI fidelity. Implementing enough VPI for unmodified
cocotb / SystemVerilog testbenches. The surface area is enormous and
Jacquard would be lying about delta cycles, NBA regions,
#delaysemantics, and X-propagation behaviour. Use transaction-level peer protocols (the natural extension ofinput.jsonover a socket) instead. - Metastability simulation. No RTL simulator does this; CDC verification is structural/formal (Spyglass, JasperGold-CDC, Real Intent) and a separate product.
- Structural CDC checking (synchronizer recognition rules, gray-code analysis). Different product. MC.1's boundary metadata enables a light diagnostic but not a verification flow.
- DUT-internal
#delay. Requires an event-driven kernel; destroys the batched dataflow that gives Jacquard its speedup. Permanently unsupported. - Async resets / latches in DUT. Same reason. Permanently unsupported
(already documented in
CLAUDE.md).
Implementation triggers
When to revisit and pull which phase off the shelf:
| Trigger | Pulls |
|---|---|
| First user workload with non-commensurable external clocks (audio, USB, DFS) | MC.2 |
| First UVM/CocoTB integration request reaches engineering scoping | MC.3 |
| User-visible CDC reporting requested | MC.1 |
| Multi-domain workload measurably bottlenecked on whole-AIG-per-tick eval | MC.1 + MC.4 |
| Regression CI total time exceeds release tolerance | MC.5 |
| Post-MC.4 measurement shows boundary-sync overhead >10 % | MC.6 speculation tier 1 (value prediction) |
Why MC.1 and MC.2 may be worth doing standalone
The user observation in the originating discussion was that MC.1 and MC.2 are worth carrying in a future release on their own merits, ahead of any specific workload demand. Rationale:
- MC.1 has standalone diagnostic value. A "boundary report" for any multi-clock design — count of cross-domain combinational paths, location of inter-domain DFF samples — is useful to any user reviewing CDC structure, independent of whether the runtime ever uses the partition.
- MC.2 lifts a real correctness limit. The current LCM cap silently fails on legitimate designs (any audio-clock SoC, anything with DFS). Replacing precompute with a min-heap is a small, contained change that removes a category of "your design doesn't fit" errors.
- Both are foundational for the rest of the architecture. Doing them early means later phases pick up cleanly.
If MC.1 + MC.2 ship in isolation, they don't commit Jacquard to any of the later phases. Each later phase remains demand-driven.
References
- Current multi-clock infrastructure:
src/sim/cosim_metal.rs:855and following (ClockDomainFlags,MultiClockScheduler). - Per-DFF clock-domain tagging:
src/aig.rs:204(clock_pin2aigpins). - Cosim stimulus protocol:
src/sim/input_stim.rs,src/sim/models/mod.rs. - Existing precomputed-edge path (replay):
jacquard simandsrc/sim/vcd_io.rs. - Adjacent committed roadmap:
docs/plans/post-phase-0-roadmap.md. - Synchronous-only constraint and rationale:
CLAUDE.md"Key limitation".
Declarative cell metadata — Tier 1 + minimal Tier 2 + port mapping
Status: Implemented — historical record. Tier 1, minimal Tier 2, and the port-mapping schema have all landed. ADRs:
- 0010 — Declarative cell metadata for PDK enablement (Tier 1 + minimal Tier 2)
- 0011 — RAM port-mapping schema
(the port-mapping extension originally deferred by ADR 0010)
Issues: #67,
#80.
Driving designs: the wafer.space
chip_top.pnl.vblocked ongf180mcu_ocd_ip_sram__sram1024x8m8wm1, then the JTAG-DM workflow in PR #78 surfacing the need for real RAM backing storage.
Scope (as shipped)
Originally scoped to one slice (Tier 1 + minimal Tier 2 — opaque
kind = "ram" with no port resolution). Expanded mid-flight when
the JTAG-DM workflow (PR #78) surfaced the need for explicit-port
RAMs with real backing storage:
- Tier 1:
--cell-library+ sverilogparse-backed pin tables (landed 2026-05-19 in PR #65/#68). - Tier 2 minimal:
kinddiscriminator in TOML, opaque-RAM mode (landed alongside Tier 1). - Port-mapping schema (ADR 0011, v1.1):
[cells.NAME.ram]sub-table for explicit-port RAMs with backing storage. Landed in this PR alongsideSramInitConfigELF preload (closes #80).
Deliverables
--cell-library <PATH>CLI flag onjacquard sim,jacquard cosim. Repeatable. Each path is parsed viasverilogparseat startup; results merged into a runtimeLeafPinProviderextension.<PATH>.cells.tomlautoload +--cell-manifest <PATH>override. TOML schema as in ADR 0010 § Tier 2. Required fieldschema_version = "1.0". Per-cellkinddiscriminator, v1.0 vocabulary.- New code path in
aig.rs: afterPdkVariant::classifyfalls through (no built-in match), consult manifest. Forkind = "ram", allocateRAMBlockin opaque mode — outputs routed to X-source slots, no port resolution. - Tests: TOML parsing unit tests; integration test exercising
a synthetic
kind = "ram"cell through AIG construction + sim (mini fixture, not the full tapeout design). - Doc update —
docs/adding-a-pdk.md: new section "Adding third-party IP via manifest", linked from existing per-PDK recipes.
Out of scope (deferred)
- Port-mapping schema (
[cells.NAME.ports]). Future ADR. - Other
kindvalues beyond what the tapeout fixture exercises end-to-end (ram, plusfillerif cheap parity demo). Adding other kinds is data-only and can land per-need. - Migration of built-in
sky130.rs/gf180mcu.rsclassifiers to manifest data. Stays in this codebase as the fallback. build.rspin-table scanner removal. Stays.
Phasing
| Phase | Output |
|---|---|
| P1 | --cell-library parsing + LeafPinProvider extension + tests. No AIG-construction changes yet — verify pin tables alone. |
| P2 | Manifest TOML parser + CellManifest struct + schema_version validation. Standalone unit tests. |
| P3 | aig.rs integration — manifest threaded through, new fallback path for kind = "ram" opaque mode. Add the compute_x_sources-style test exercising the new path. |
| P4 | Smoke test against a representative reduced fixture; confirm jacquard sim clears gf180mcu_ocd_ip_sram_*. The full downstream-tapeout netlist is the real-world target but not in-tree. |
| P5 | Doc update (adding-a-pdk.md); update gf180mcu-enablement.md § Follow-on cleanup to mark items 1/2/3 superseded by this work. |
Each phase is its own commit. No squashing until the spike feedback loop confirms shape.
Open questions to settle in code
- Autoload path discovery: spec says
foo.v→foo.cells.tomlsibling. Does that handle the multi-file library case (a.v+b.vsharing one manifest)? Probably yes — autoload each sibling, merge into the singleCellManifest. Explicit--cell-manifestflag still wins for users who want a single consolidated file. - Conflict policy: if a cell name appears both in a built-in classifier AND in a manifest, built-in wins (per ADR 0010 integration ordering). Warn on conflict to surface accidental collisions.
- Empty-library noise: parsing a
.vfile containing only(* blackbox *)modules with no logic should succeed without warnings, since that's the expected shape for IP libraries.
Not promised
- Memory contents simulation for
kind = "ram"in v1.0. Documented in ADR 0010 § "kind = ram semantics in v1.0". - Stable opaque-RAM port routing beyond "outputs are X-source
slots". The set of outputs is what
sverilogparsereports; if a cell's port list changes, the routing follows.
Cosim Peripheral Models
Architecture: ADR 0013.
This plan tracks implementation work for the cosim peripheral model framework. ADR 0013 documents the architecture (two execution domains, observe-only vs bidirectional GPU patterns, ring buffers, plural config convention); this doc tracks the concrete workstreams.
Phase 1: Multi-UART (#90)
First peripheral using the plural-config + array-in-kernel conventions from ADR 0013.
Schema — src/testbench.rs
Add name: Option<String> to UartConfig. Add
uarts: Vec<UartConfig> to TestbenchConfig. Add
effective_uarts() mirroring effective_clocks():
#![allow(unused)] fn main() { pub fn effective_uarts(&self) -> Vec<UartConfig> { let mut out = self.uarts.clone(); if let Some(ref u) = self.uart { out.insert(0, u.clone()); } out } }
Existing "uart": {...} configs work unchanged. New form:
"uarts": [{"name": "console", ...}, {"name": "debug", ...}].
Both may coexist; uart is prepended to uarts.
Metal kernel — csrc/kernel_v1.metal
MAX_UARTS = 4. Restructure the three UART types:
#define MAX_UARTS 4
struct UartPerChannelConfig {
u32 tx_out_pos;
u32 cycles_per_bit;
};
struct UartParams {
u32 state_size;
u32 n_uarts; // replaces has_uart
u32 _pad[2];
UartPerChannelConfig channels[MAX_UARTS];
};
UartDecoderState and UartChannel structs unchanged — the device
buffers hold [MAX_UARTS] elements. gpu_io_step buffer signature
unchanged (same 6 slots); the UART decode block becomes a loop over
n_uarts.
Rust runtime — src/sim/cosim_metal.rs
- Repr structs (~line 130): update
UartParamsto match kernel. AddUartPerChannelConfig. KeepUartDecoderStateandUartChannelunchanged. - Config resolution (~line 2229): iterate
effective_uarts(). - Buffer allocation (~line 2820): size buffers for
MAX_UARTSelements. Init eachUartDecoderStatewithlast_tx=1. - RX driver creation (~line 2544): one
UartRxDriverper entry, nameduart_{name}(fallbackuart_{index}). - CPU drain (~line 3990): iterate N channels with per-channel
uart_read_head[i]. Label events with UART name.
Verification
cargo build --release --features metalcompiles.cargo test --libpasses (addeffective_uartsunit tests).- Existing MCU SoC cosim CI passes unchanged (single
"uart"config). - Local smoke: temporarily edit
tests/jtag_minimal/sim_config.jsonto use"uarts": [...]syntax, confirm identical results.
Not in scope
- Dual-UART test fixture: separate follow-up with a small 2-TX design.
- CUDA/HIP: cosim is Metal-only; no kernel changes needed.
Future phases
| Phase | Scope | Status |
|---|---|---|
| 2 | Refactor gpu_io_step toward common params/ring-buffer layout | Future |
| 3 | Multi-Flash / external RAM (bidirectional pattern) | Deferred (no use case) |
| — | Multi-JTAG | Not needed (TAP daisy-chain suffices) |
Plan: Config-driven AHB/APB bus transaction tracing
Goal
Trace AHB5, AHB-Lite, and APB3 bus transactions in cosim, compactly, without baking signal names into source. Output as CSV (machine-readable transaction table) and annotated VCD (transactions as a signal group for waveform viewers). Decode site: GPU capture + CPU protocol FSM (the kernel stays dumb; protocol semantics live in testable Rust).
Order: APB3 first (validate against the Hazard3 JTAG-DM APB DMI in
tests/jtag_minimal/), then AHB-Lite, then AHB5.
Why this shape
The existing "Wishbone bus trace" (build_wb_trace_params,
cosim_metal.rs:1277; gpu_io_step, kernel_v1.metal:1182) proves the
mechanism — a GPU observe-only peripheral that packs a compact per-tick
entry into a ring buffer only when the bus is active/changed, drained by the
CPU — but it is hardcoded to one VexRiscv-style SoC (literal names
cpu.fetch.ibus__cyc, spiflash.ctrl.wb_bus__ack, …). We generalize that
mechanism into a config-driven, protocol-aware monitor. It is observe-only
(we watch design outputs, never drive), so it fits the ADR-0013 GPU
observe-only peripheral pattern, and gets the effective_*()-style plural
config for free.
Two existing pieces are reused:
- Multi-candidate name resolution in
src/sim/trace_signals.rs— handles Yosys-flattened / scalar-expanded / structural hierarchical naming. Refactor the candidate generator into a shared helper so the bus tracer binds pins the same way--trace-signalsdoes. - Extra-observables VCD path (
emit_extra_observables,vcd_io.rs:635) — the model for emitting synthesized signals into the output VCD.
The hardcoded WbTrace is left intact for now (it has a passing test); migrating it onto the general mechanism is a clean follow-up, not a prerequisite.
Design
1. Config schema — src/testbench.rs
#![allow(unused)] fn main() { #[derive(Debug, Clone, Deserialize)] #[serde(rename_all = "lowercase")] pub enum BusProtocol { Apb3, AhbLite, Ahb5 } #[derive(Debug, Clone, Deserialize)] pub struct BusTraceConfig { pub name: String, pub protocol: BusProtocol, /// Hierarchical prefix; standard protocol pin names are appended. pub prefix: String, #[serde(default = "default_addr_bits")] pub addr_bits: usize, // 32 #[serde(default = "default_data_bits")] pub data_bits: usize, // 32 /// Optional per-pin overrides: logical pin name -> explicit net name, /// for designs whose pins don't follow `{prefix}{PIN}`. #[serde(default)] pub signals: HashMap<String, String>, } }
Add to TestbenchConfig:
#![allow(unused)] fn main() { #[serde(default)] pub bus_traces: Vec<BusTraceConfig>, }
New feature, so no singular legacy form. (effective_bus_traces() provided
for symmetry with effective_uarts(), even though it just returns the Vec.)
2. Protocol pin maps + CPU decoder — new src/sim/models/bus_trace.rs
Logical-pin tables per protocol:
- APB3:
psel penable pwrite pready pslverr paddr[] pwdata[] prdata[] - AHB-Lite:
htrans[1:0] haddr[] hwrite hsize[2:0] hburst[2:0] hready hresp hwdata[] hrdata[] - AHB5: AHB-Lite + optional
hnonsec hexcl hexokay hmaster[](resolved if present, ignored if absent)
Default net name {prefix}{pin} (lowercased), overridable via signals.
Resolution via the shared multi-candidate resolver (item 4).
BusTraceDecoder (per bus) consumes raw captured beats and emits:
#![allow(unused)] fn main() { pub struct BusTransaction { pub tick: u64, pub bus: String, pub protocol: BusProtocol, pub dir: Dir, // Read | Write pub addr: u64, pub data: u64, pub resp: BusResp, // Ok | Error pub burst: Option<BurstInfo>, // beat index / length for AHB } }
- APB3 FSM: GPU gates capture on
psel & penable & pready(access-phase complete), so each captured beat is a complete transaction.dir = pwrite,data = pwrite ? pwdata : prdata,resp = pslverr. - AHB FSM: GPU gates capture on
hreadyhigh (pipeline advance) and recordshtrans, haddr, hwrite, hsize, hburst, hwdata, hrdata, hresp. CPU keeps a 1-deep pending address-phase record and pairs address beat N with the data on beat N+1; tracks burst beat counter fromhburst/htrans==SEQ.
Pure-Rust, unit-tested with synthetic beat sequences — no GPU required. This is the testability win of CPU-side decode.
3. GPU capture — csrc/kernel_v1.metal + src/sim/cosim_metal.rs
Generalize the WbTrace structs into protocol-agnostic capture:
#define MAX_BUS_TRACES 4
#define BUS_TRACE_MAX_ADR_BITS 32
#define BUS_TRACE_MAX_DAT_BITS 32
struct BusTraceParams { // one per configured bus
u32 protocol; // 0=apb3 1=ahb-lite 2=ahb5
u32 gate_a_pos, gate_b_pos, gate_c_pos; // edge-gating bits (psel/penable/pready or hready/htrans)
u32 dir_pos, resp_pos;
u32 addr_pos[BUS_TRACE_MAX_ADR_BITS];
u32 wdata_pos[BUS_TRACE_MAX_DAT_BITS];
u32 rdata_pos[BUS_TRACE_MAX_DAT_BITS];
u32 ctrl_pos[8]; // htrans, hsize, hburst, hnonsec, ...
u32 addr_bits, data_bits;
};
struct BusTraceEntry { u32 tick, flags, ctrl; u32 addr, wdata, rdata; };
struct BusTraceChannel { u32 write_head, capacity, current_tick, n_buses; /* entries follow */ };
The kernel computes the per-protocol gate, and on a gating edge packs one
BusTraceEntry (bus id in flags high bits). No FSM, no pairing on GPU.
gpu_io_step currently uses buffer slots 0–5 (UART + WbTrace). Add slots 6–7
for BusTraceParams[] + BusTraceChannel. Metal allows ≫8 buffers, so extend
the existing dispatch rather than adding a kernel.
Rust mirrors of the structs in cosim_metal.rs (next to WbTraceParams),
build_bus_trace_params() resolving pins for each configured bus, buffer
allocation sized MAX_BUS_TRACES, and a per-bus read head in the drain loop
(near cosim_metal.rs:4057) feeding each BusTraceDecoder.
4. Shared signal resolver — refactor src/sim/trace_signals.rs
Extract the multi-candidate name → AIG-pin / state-position resolver
(currently internal to trace-signal registration) into a reusable helper
callable from build_bus_trace_params. Keeps one source of truth for the
Yosys/scalar/structural naming conventions.
5. Output
- CSV (
--bus-trace-csv <PATH>): drain-time, one row perBusTransaction. Header:tick,bus,protocol,dir,addr,data,resp,burst. Trivial — lands in Phase 1. - Annotated VCD: synthesized per-bus VCD vars (
{bus}_addr,{bus}_wdata/{bus}_rdata,{bus}_dir,{bus}_resp) that value-change at transaction-complete ticks. This needs a new "virtual signal" emission path invcd_io.rs: unlike existing extra-observables (raw nets sampled per tick from the state buffer), these are sparse CPU-decoded events the VCD writer must interleave by tick. Bigger plumbing → Phase 3. Dovetails with the wire-bundle-scripting / Surfer direction in project memory.
6. CLI — src/bin/jacquard.rs
--bus-trace-csv <PATH>(Phase 1)- bus VCD annotation folded into the output/
--output-vcdwhenbus_tracesis configured, or a dedicated--bus-trace-vcdflag (Phase 3)
Status
Phase 1 is complete (APB3 end-to-end + CSV). Validated by
tests/apb_trace/ — a dedicated synthesized APB3 design (the Hazard3
JTAG-DM post-PnR netlist drops the APB addr/data nets during flattening,
so a names-preserved design was built instead). CI step:
Run APB3 bus-trace cosim (ADR 0013). Phases 2–3 remain.
Phasing
- Phase 1 — APB3 end-to-end. ✅ Done. Config schema, pin maps, shared
resolver, APB3 GPU capture, APB3 CPU decoder, CSV output. Validated on
tests/apb_trace/(synthesized APB3 design). APB3 FSM unit-tested. - Phase 2 — AHB-Lite + AHB5. Pipeline pairing, burst tracking, AHB5 extra signals. Unit-test the AHB FSM. Needs an AHB design to integration-test against (open question — see below).
- Phase 3 — Annotated VCD. Virtual-signal emission path in
vcd_io.rs. - Follow-up — migrate WbTrace onto the general mechanism (express the VexRiscv ibus/dbus as configured buses), then delete the hardcoded path.
Verification
- Unit: APB3 & AHB FSM decoders against synthetic beat vectors (pure Rust, no GPU).
- Integration (Phase 1): cosim the Hazard3 JTAG-DM with
--bus-trace-csv, assert the expected DMI register accesses (DMCONTROL/DMSTATUS) appear. - Build:
cargo build --release --features metalclean; existing cosim tests (single-UART, WbTrace) unaffected sincebus_tracesdefaults empty.
Open questions
- AHB integration test design. APB3 validates on the existing Hazard3
JTAG-DM. Phase 2 needs an AHB-Lite/AHB5 design — do we have one, or synthesize
a small AHB peripheral (like
tests/dual_uart/)? - Per-bus ring vs shared ring. One
BusTraceChannelwith a bus-id field (simpler allocation) vs one ring per bus (no cross-bus contention). Start shared; revisit if a hot multi-bus design overflows. - CUDA/HIP. Cosim is Metal-only today; no kernel changes needed elsewhere now, but the general design should port cleanly when CUDA cosim lands.
ADR impact
This generalizes the cosim peripheral architecture — update ADR-0013 (plural-peripheral configs) to record the config-driven bus-monitor pattern and the GPU-capture/CPU-decode split, once Phase 1 is real.
Plan: complete ADR 0012 CDC jitter injection
Tracks the deferred half of ADR 0012. Issue: #92.
Where it stands
Implemented: the run-parameters file + per-domain seeded PRNG
(src/sim/run_params.rs), jitter_ps per ClockConfig, the uniform
per-domain draw, and a jitter displacement applied to the timing-VCD
event timestamp (cosim_metal.rs, inside the --output-vcd block
only). So today jitter perturbs the waveform timeline but nothing
else — it does not reach the setup/hold checker, model-driven clocks,
or coincident-edge ordering.
The goal of this plan is to make jitter actually stress CDC paths, then extend it to model-driven clocks and tidy the loose ends, so ADR 0012's present-tense design fully matches the code.
Phase 1 — Jitter reaches the timing checker (the core value)
Right now jitter_displacement only adjusts the VCD base_timestamp
(cosim_metal.rs:~3928-3948) and is computed inside the timing-VCD
emission block, so it has no effect without --output-vcd and never
influences violations.
- Hoist the per-tick per-domain displacement draw out of the VCD block
so it is available whenever
jitter_active, independent of--output-vcd. - Apply each domain's displacement to the arrival offsets that
setup/hold checking consumes (the
arrival_statesection), not just the VCD base timestamp — so a jittered edge can move a margin across the setup/hold boundary and surface in--timing-report. - True per-domain perturbation (ADR §4): keep a displacement per
firing domain this tick rather than the current single global value
(the loop overwrites
jitter_displacementwith the last domain's draw). Coincident edges from domains A and B then move independently, exercising both orderings over a seed sweep.
Verify: a small two-domain design with a deliberately marginal CDC path; assert that a seed sweep produces both "no violation" and "violation" outcomes, and that a fixed seed reproduces exactly.
Phase 2 — Model-driven clock jitter (ADR §3)
Model-driven clocks (JtagReplayModel, SPI SCK, …) bypass the scheduler
and currently get no jitter.
- Add
--cdc-model-jitter-ps <N>(and/or per-modeljitter_psin config) → a budget + seeded stream viaRunParams::domain_seed(model_name). - After a model fires its edge, displace the timing-model arrival for that transition (not the functional edge — the DFF still samples on the same tick), mirroring the Phase 1 arrival-offset path.
Verify: extend tests/jtag_minimal (model-driven TCK) with a model
jitter budget; confirm reproducibility and that TCK→sys_clk CDC margins
vary by seed.
Phase 3 — Hygiene / correctness guards
gcd_ps / 2constraint (ADR §2): at startup, error (or clamp with a loud warning) if anyjitter_ps > scheduler.gcd_ps / 2, since larger values would reorder edges across GCD ticks.- Always persist the seed (ADR §1): when neither
--run-paramsnor--output-vcdis given,RunParams::generate()currently does not write the file. Persist to a default path unconditionally so every run is replayable. - master_seed in the VCD header (ADR §1/§5): emit the master seed as
a VCD header comment in
vcd_io.rs, so the seed is recoverable from an output artifact, not just the INFO log.
Phase 4 — CI CDC stress sweep (ADR Consequences)
Once jitter feeds violations (Phase 1), add a lightweight CI step: run
the marginal-CDC design across a few sequential seeds, upload each run's
run_params.json as an artifact, fail if an unexpected violation
appears. Gives every PR a cheap CDC regression.
Out of scope (separate ADRs / plans)
- X-injection on CDC paths (needs MC.1 island partitioner — ADR 0012 "Deferred").
- Non-uniform jitter distributions (Gaussian period jitter, etc.) — the seed+budget interface is distribution-agnostic, add later.
- Frequency sweep / DFS.
Spike — OpenTimer on SKY130 and MCU SoC
Status: Proposed. Not yet executed.
Time box: Half a day. Extend by up to one day if initial signs are positive but hitting specific SKY130 quirks. Abort and fall back if first-four-hours progress is blocked.
Goal
Determine whether OpenTimer (MIT, C++17) can reliably parse and analyse Jacquard's real-flow inputs — SKY130 Liberty and OpenLane2 MCU SoC post-P&R output — well enough to serve as Jacquard's in-process reference STA (per ADR 0003).
The outcome resolves ADR 0003's Pending Spike status to either Accepted or Superseded.
Out of scope for this spike
- C++ FFI / bindgen integration work. Pure spike on OpenTimer's standalone behaviour.
- Timing-IR integration. Establishing that OpenTimer produces usable arrival/slack output is sufficient; converting it to IR belongs in phase 1.
- Performance measurement beyond rough "does it complete in reasonable time."
- GF130 coverage. SKY130 is the spike target; GF130 private-track confirmation is later.
Setup
Required artefacts (checked before starting):
- OpenTimer clone and local build (MIT licence, standard CMake).
- SKY130 Liberty file(s) matching the corner the MCU SoC flow uses. At minimum
sky130_fd_sc_hd__tt_025C_1v80.lib. - MCU SoC post-P&R output: synthesised
.v, SDC, and — critically —.spef. Check that the current OpenLane2 invocation is configured to produce SPEF; if not, enable it. OpenTimer requires SPEF, it does not consume SDF. - Jacquard's current
timing-analysisbinary output on the same design for comparison. - OpenSTA installed locally, for three-way comparison.
Success criteria
The spike answers four questions. Each is a pass/fail observation, not a measurement.
Q1 — Does OpenTimer parse SKY130 Liberty without errors?
- Pass: clean parse, no warnings that indicate misinterpreted cells.
- Partial: parses but warns on specific cells — in particular
sky130_fd_sc_hd__dlygate4sd3_*or anything with non-trivial conditional timing. Document which cells and whether their timing is discarded or mishandled. - Fail: parse errors, segfaults, or silently-wrong output on recognised cells.
Q2 — Does OpenTimer compute arrivals on the MCU SoC design?
Feed .lib + .v + .spef + .sdc. Run report_timing -worst 20 or equivalent. Observe:
- Pass: produces a full timing report with reasonable-looking arrivals (non-zero, monotonic along paths).
- Partial: produces a report but with suspect values (many zeros, missing cells, incomplete paths).
- Fail: hangs, crashes, or refuses to analyse.
Q3 — Does OpenTimer's result agree with OpenSTA?
Run OpenSTA on the same inputs, compare top-20 critical endpoints' arrivals. Declare tolerance: ±5% on arrival time, ±10 ps absolute floor for very short paths.
- Pass: all top-20 endpoints within tolerance.
- Partial: most within tolerance, a small number of outliers traceable to specific delay-model differences (e.g., CCS vs NLDM).
- Fail: systematic disagreement suggesting OpenTimer is computing something meaningfully different. Investigate; if the disagreement is on SKY130 cell interpretation (a PDK handling issue) this is essentially a fail for our purposes.
Q4 — Does OpenTimer's result correlate with Jacquard's current timing analysis?
Compare worst-slack and top-K endpoint lists (not exact values — pessimism differences are expected and documented). Observe:
- Pass: top-K lists overlap substantially; worst-slack is on a comparable path.
- Informational: any systematic discrepancy tells us what the pessimism delta actually looks like in practice. This data informs R4 (critical-path refinement reporting) whether OpenTimer is adopted or not.
Decision matrix
| Q1 | Q2 | Q3 | Outcome |
|---|---|---|---|
| Pass | Pass | Pass | ADR 0003 → Accepted. Proceed to phase 1 integration. |
| Pass | Pass | Partial | ADR 0003 → Accepted with documented scope limits. Define where OpenTimer is authoritative vs deferred to OpenSTA. |
| Pass | Partial | — | ADR 0003 → Accepted provisionally; spike extends to investigate Q2 anomalies. |
| Partial | — | — | ADR 0003 → Accepted with SKY130 cell workarounds documented, or → Superseded if the workarounds are too invasive. |
| Fail on any | — | — | ADR 0003 → Superseded. Fall back to OpenSTA-subprocess-only validation. Revisit libreda-sta or in-house walker as alternatives in a follow-up ADR. |
Fallback
If the spike fails, Jacquard operates with:
- OpenSTA subprocess validation in CI (ADR 0001) as the sole timing-reference mechanism.
- No per-PR in-process timing cross-check; feedback timing degrades.
- Phase 1 drops OpenTimer integration work and refocuses on tightening OpenSTA-driven CI.
Superseding ADR 0003 is clean — it is currently Pending Spike so no downstream work has accrued to it. Phases 0 and 2 are unaffected.
Progress log
Setup (2026-04-23 → 2026-04-30)
- OpenTimer 2.1.0 and OpenSTA 3.1.0 cloned to
Jacquard-depends/and built locally. Build notes in that repo'sREADME.md. - SKY130 Liberty already on disk via volare:
~/.volare/volare/sky130/versions/c6d73a35f524070e85faff4a6a9eef49553ebc2b/sky130A/libs.ref/sky130_fd_sc_hd/lib/sky130_fd_sc_hd__tt_025C_1v80.lib. - Spike artefacts kept in this worktree under
spike-out/(gitignored — reproducible fromJacquard-depends/).
Q1 — Liberty parse (2026-04-30) — Pass
| Tool | Cells loaded | Wall time | Warnings |
|---|---|---|---|
| OpenTimer 2.1.0 | 428 | 0.12 s | 1 |
| OpenSTA 3.1.0 | 428 | 0.18 s | 0 |
Cell counts agree exactly. OpenSTA parses cleanly. OpenTimer emits one warning:
W celllib.cpp:274] unexpected lut template variable normalized_voltage
The normalized_voltage axis appears in exactly one place in the Liberty:
the library-level normalized_driver_waveform("driver_waveform_template")
block, which is CCS-driver-waveform data. No per-cell timing arc references
it — cell_rise/cell_fall/rise_constraint/fall_constraint all use
the NLDM templates del_1_7_7, vio_3_3_1, constraint_3_0_1. So the
warning has no impact on arrival/slack computation under NLDM, which is
what OpenTimer does anyway.
Operational note: OpenTimer's read_celllib is lazy — the parse only
runs when an action like update_timing (or report_*) forces taskflow
execution. Issuing dump_celllib immediately after read_celllib reports
"celllib not found" because the read hasn't fired yet. Always insert
update_timing before any inspection command.
The documented read_celllib -min|-max <file> syntax silently no-ops; bare
read_celllib <file> loads the lib as both min and max corners. Filed as a
docs/build mismatch in our Jacquard-depends/README.md.
Q2 — Arrival computation on SKY130 (2026-05-01) — Fail
Used OpenSTA's bundled gcd_sky130hd example (a canonical SKY130-HD GCD
with .v, .sdc, .spef, .lib) as a fast smoke test before tackling
MCU-SoC SPEF generation. If OpenTimer can't handle this, the MCU-SoC
effort is wasted.
OpenSTA baseline: clean run, period 5 ns, top arrival 4.82 ns, WNS 0.00, slack 0.09 met. 0.28 s wall, zero warnings.
OpenTimer: could not produce a single timing path. The result was
no critical path found, wns = nan, tns = nan — even after working
around the following issues, each of which had to be discovered and
patched manually:
| # | Issue | Workaround tried | Status |
|---|---|---|---|
| 1 | `read_celllib -min | -max | bare read_celllib <file> loads as both corners |
| 2 | dump_* after read_* reports state-not-loaded because the read is lazy | insert update_timing before any inspection | works |
| 3 | Tap cells in post-P&R Verilog (sky130_fd_sc_hd__tapvpwrvgnd_*) trigger 1040 cell not found in celllib errors and abort the netlist load | strip tap cell instances from Verilog | works |
| 4 | OpenTimer's bundled SDC parser uses pre-TCL-8.5 syntax (trace variable VAR w CMD); fails on the system's TCL 8.6 with bad option "variable" and produces zero parsed commands — even on OpenTimer's own bundled examples | patch ot/sdc/sdcparsercore.tcl:144 to trace add variable sdc_version write __set_v | works (one-line fix; should be upstreamed) |
| 5 | OpenSTA-style SDC with set period 5 / expr $period * 0.2 / [all_inputs] parses as zero commands | hand-write a literal SDC with create_clock -name clk -period 5 [get_ports clk] | works for trivial constraints; non-trivial SDC remains uncovered |
| 6 | SPEF *PORTS section (standard SPEF, IEEE 1481, emitted by OpenROAD/OpenLane) is rejected with a parse error pointing at the first port line | strip *PORTS block from SPEF before reading | works |
| 7 | Verilog bus ports (input [31:0] req_msg;) are not bit-blasted by OpenTimer's Verilog parser, but post-P&R SPEF references the bus as bit-indexed nets (req_msg[0], req_msg[1], …). 48 bus-element nets fail to match between netlist and SPEF | none found | blocking |
| 8 | After all of the above, two interior pins (_251_:B, _218_:B) report "not found in rctree" and the timing graph remains disconnected enough that no path can be reported | not investigated further | blocking |
Issues 7 and 8 mean that on a SKY130 design with bus ports — i.e. any
design that talks to the rest of the world — OpenTimer cannot compute
arrivals from a standard OpenROAD .v/.spef pair without inputs being
pre-processed by code that doesn't exist.
The cumulative finding is not "OpenTimer mishandles a few SKY130 cells". It is that OpenTimer's input pipeline (Verilog parser, SPEF parser, bundled SDC parser) is incomplete relative to what real OpenROAD-flow outputs contain, and the gaps fall on hot paths (bus ports, tap cells, modern TCL, OpenROAD-emitted SPEF). The cells themselves parse fine (Q1); it's the surrounding ecosystem that doesn't.
Q3, Q4 — not run
Q3 (cross-check vs OpenSTA) and Q4 (correlation with Jacquard's
timing-analysis) both depend on OpenTimer producing arrivals. With Q2
unable to produce a single path, they're moot for this spike.
Decision
ADR 0003 → Superseded. Per the spike's decision matrix ("Fail on any → ADR 0003 → Superseded. Fall back to OpenSTA-subprocess-only validation"), the right move is to retire the in-process-OpenTimer plan and lean on OpenSTA-subprocess validation (ADR 0001) as the sole timing reference. A follow-up ADR should consider libreda-sta or an in-house walker if an in-process reference is still wanted later.
OpenTimer's strengths (in-process C++17, taskflow-based, MIT, fast for the academic benchmarks it ships with) are real, but the input-pipeline gaps are large enough that adopting it would mean owning a non-trivial fork — the opposite of what a "lightweight in-process reference" is supposed to be.
The Liberty parser is genuinely capable (Q1 passed cleanly on the 12 MB SKY130 NLDM lib in 120 ms), so OpenTimer remains an option for future narrow tasks like Liberty introspection, but not as the STA engine.
Setup notes worth keeping
- OpenSTA bundles
gcd_sky130hd.{v,sdc,spef}andsky130hd_tt.lib.gz— a cleaner SKY130 smoke-test fixture than anything we'd have produced from chipflow in the time we had. ~/.volare/volare/sky130/versions/c6d73a35f524070e85faff4a6a9eef49553ebc2b/sky130A/...is the live SKY130 PDK already on this machine (chipflow installs it). No need to fetch it separately.
Deliverable
A short report added to this document as a "Spike outcome" section, summarising:
- Which Q1–Q4 answers were observed.
- Specific SKY130 cells where OpenTimer misbehaves (if any).
- Whether SPEF generation had to be added to the OpenLane2 flow, and what that change was.
- Decision: confirm, scope-limit, or supersede ADR 0003.
Links
- ADR 0003 — OpenTimer as in-process reference STA.
../timing-correctness.md— requirement R2.../plans/phase-0-ir-and-oracle.md— phase 0 (independent of this spike; runs in parallel).