Timing Simulation in GEM

See also: timing-correctness.md — forward-looking validation contract and timing IR requirements (in progress). The document below describes current behaviour.

This document explains GEM's boomerang evaluation architecture and how timing simulation with per-gate delays can be implemented efficiently on GPU.

Current status (what ships today)

Phase 1 closed 2026-05-02; see plans/post-phase-0-roadmap.md.

Capability	Status
Liberty parsing	✅
SDF back-annotation via `opensta-to-ir`	✅
Per-DFF clock-arrival folding (Pillar B Stages 1+2)	✅
GPU-side setup/hold violation detection	✅ Metal, CUDA, HIP (`sim`); the `cosim` path is a follow-up
Symbolic violation messages (`top/cpu/regs[7][bit 22] [word=42]`)	✅ Metal, CUDA, HIP (`sim`)
`--timing-report <path.json>` structured end-of-run report	✅ Metal, CUDA, HIP (`sim`); not yet wired for `cosim`
`--timing-summary` human-readable text summary	✅ Metal, CUDA, HIP (`sim`); not yet wired for `cosim`
OpenSTA detection + version check	✅
Multi-corner timing IR (`--timing-corner <name>`)	✅
`--sdf-corner` (min/typ/max selection from one SDF)	⚠ One corner at a time
Per-receiver wire delay (Pillar C Tier 1)	❌ Phase 2 (blocked on ADR 0007)

What "HIP" means above. HIP is two quite different things, and for most of this project's life the table only ever meant the first:

HIP-over-CUDA — hip-runtime-nvidia, built with the CUDA toolkit and executed on an NVIDIA GPU (HIP Tests (NVIDIA backend), tesla4-runner). Source-compatible with CUDA and, in practice, CUDA semantics throughout.
HIP on real ROCm — an actual AMD GPU (HIP Tests (ROCm backend)). This had never been built, let alone run, until 2026-07-15, and sim did not run there at all until the non-cooperative fallback landed: the device has no cooperative launch, so the grid-wide barrier sim relies on is unavailable and the host drives one launch per (cycle, stage) instead. See spikes/amd-laptop-backend.md.

Both are now covered by CI and both pass, so the ✅s above are accurate for each. The distinction still matters when reading a green tick: it is what hid #203 — an out-of-bounds read that nvcc silently truncated back into range, so CUDA and HIP-over-CUDA were accidentally correct while ROCm faulted. Arrival/violation arithmetic itself is host-side Rust (#195) and therefore backend-independent.

See timing-violations.md for the full violation-output interface and why-jacquard.md for positioning.

Background: The Simulation Challenge

GEM simulates And-Inverter Graphs (AIGs) where every node is either:

A primary input (value comes from VCD stimulus)
An AND gate with two inputs (possibly inverted)

Traditional simulation evaluates gates in topological order, which is inherently serial. GPUs excel at massive parallelism - thousands of threads doing the same operation on different data. GEM bridges this gap with the boomerang architecture.

Boomerang Evaluation

Core Concept

The boomerang structure is a hierarchical reduction tree that maps an AIG onto GPU threads. It's called "boomerang" because data flows down the tree during reduction, then results are written back out at various levels - like a boomerang going out and returning.

Hierarchy Structure

GEM uses BOOMERANG_NUM_STAGES = 13, meaning the tree has 2^13 = 8192 leaf positions:

Level 0 (inputs):   8192 positions
Level 1:            4096 positions  (8192 / 2)
Level 2:            2048 positions
Level 3:            1024 positions
Level 4:             512 positions
Level 5:             256 positions
Level 6:             128 positions
Level 7:              64 positions
Level 8:              32 positions
Level 9:              16 positions
Level 10:              8 positions
Level 11:              4 positions
Level 12:              2 positions
Level 13 (output):     1 position

Each level halves the number of positions by computing AND gates that combine pairs.

Thread Organization

A GPU block has 256 threads (threadIdx.x = 0..255). Each thread holds a 32-bit word where each bit represents an independent Boolean signal:

Thread 0:   [bit0, bit1, bit2, ... bit31]  = 32 Boolean signals
Thread 1:   [bit0, bit1, bit2, ... bit31]  = 32 Boolean signals
...
Thread 255: [bit0, bit1, bit2, ... bit31]  = 32 Boolean signals
            ─────────────────────────────
            Total: 256 × 32 = 8192 signals per level

Thread position refers to threadIdx.x - which of the 256 threads we're addressing. Each thread position processes 32 signals in parallel using SIMD operations.

Memory Layout

__shared__ u32 shared_metadata[256];   // Partition configuration
__shared__ u32 shared_writeouts[256];  // Output staging area
__shared__ u32 shared_state[256];      // Working state (8192 bits)

The shared_state array holds the current level's values during reduction.

The Reduction Process

Phase 1: Level 0 → Level 1 (hier[0])

Only threads 128-255 are active. Each computes 32 AND gates in parallel:

if(threadIdx.x >= 128) {
    u32 hier_input_a = shared_state[threadIdx.x - 128];  // From threads 0-127
    u32 hier_input_b = hier_input;                        // This thread's data

    // 32 AND gates computed simultaneously (one per bit)
    u32 ret = (hier_input_a ^ hier_flag_xora) &
              ((hier_input_b ^ hier_flag_xorb) | hier_flag_orb);

    shared_state[threadIdx.x] = ret;
}

The xora, xorb, and orb flags encode:

xora/xorb: Input inversions (for AND-inverter graph)
orb: Passthrough mode (when output equals input A, skip the AND)

Visual representation:

Before:  [T0][T1]...[T127] [T128][T129]...[T255]
              │                  │
              └───────┬──────────┘
                      │
                   AND gates (128 threads × 32 bits = 4096 gates)
                      │
                      ▼
After:   [----unused----] [T128][T129]...[T255]
                          (128 × 32 = 4096 results)

Phase 2: Levels 1-3 (Shared Memory)

for(int hi = 1; hi <= 3; ++hi) {
    int hier_width = 1 << (7 - hi);  // 64, 32, 16
    if(threadIdx.x >= hier_width && threadIdx.x < hier_width * 2) {
        u32 hier_input_a = shared_state[threadIdx.x + hier_width];
        u32 hier_input_b = shared_state[threadIdx.x + hier_width * 2];
        u32 ret = (hier_input_a ^ xora) & ((hier_input_b ^ xorb) | orb);
        shared_state[threadIdx.x] = ret;
    }
    __syncthreads();  // Barrier between levels
}

Each level activates fewer threads:

Level 1: threads 64-127 (64 threads → 2048 gates)
Level 2: threads 32-63 (32 threads → 1024 gates)
Level 3: threads 16-31 (16 threads → 512 gates)

Phase 3: Levels 4-7 (Warp Shuffle)

Within a single warp (32 threads), data exchange uses fast shuffle instructions instead of shared memory:

if(threadIdx.x < 32) {
    for(int hi = 4; hi <= 7; ++hi) {
        int hier_width = 1 << (7 - hi);  // 8, 4, 2, 1
        u32 hier_input_a = __shfl_down_sync(0xffffffff, tmp_cur_hi, hier_width);
        u32 hier_input_b = __shfl_down_sync(0xffffffff, tmp_cur_hi, hier_width * 2);
        if(threadIdx.x >= hier_width && threadIdx.x < hier_width * 2) {
            tmp_cur_hi = (hier_input_a ^ xora) & ((hier_input_b ^ xorb) | orb);
        }
    }
}

No synchronization needed - warp shuffle is implicitly synchronized.

Phase 4: Levels 8-12 (Bit Operations)

The final levels operate on bits within a single u32, computed by thread 0 only:

if(threadIdx.x == 0) {
    // Level 8: 32 → 16 (operates on upper/lower halves)
    u32 r8 = ((v1 << 16) ^ xora) & ((v1 ^ xorb) | orb) & 0xffff0000;

    // Level 9: 16 → 8
    u32 r9 = ((r8 >> 8) ^ xora) & (((r8 >> 16) ^ xorb) | orb) & 0xff00;

    // Level 10: 8 → 4
    u32 r10 = ((r9 >> 4) ^ xora) & (((r9 >> 8) ^ xorb) | orb) & 0xf0;

    // Level 11: 4 → 2
    u32 r11 = ((r10 >> 2) ^ xora) & (((r10 >> 4) ^ xorb) | orb) & 0b1100;

    // Level 12: 2 → 1
    u32 r12 = ((r11 >> 1) ^ xora) & (((r11 >> 2) ^ xorb) | orb) & 0b10;

    tmp_cur_hi = r8 | r9 | r10 | r11 | r12;
}

Write-Outs

Results are captured at various levels (not just the final output) and written to global memory:

if((writeout_hook_i >> 8) == bs_i) {
    shared_writeouts[threadIdx.x] = shared_state[writeout_hook_i & 255];
}

This is the "return" part of the boomerang - results flow back from intermediate levels.

Timing Simulation Approaches

Approach Comparison

Approach	Parallelism	Memory	Accuracy	GPU Fit
Event-driven	Poor (serial queue)	Low	Exact	Bad
Time-wheel	Medium	High	Configurable	Medium
Levelized	Excellent	Low	Conservative	Best
Oblivious	Maximum	Very High	Exact	Wasteful

Recommended: Levelized with Delay Accumulation

This approach piggybacks on the existing boomerang structure with minimal changes.

Data Structure Addition

// Add to shared memory (256 bytes additional)
__shared__ u8 shared_arrival[256];  // One arrival time per thread position

Each thread position stores a single 8-bit arrival time representing the maximum arrival across all 32 bits in that position.

Modified AND Gate Evaluation

// Current (value only):
u32 ret = (hier_input_a ^ xora) & ((hier_input_b ^ xorb) | orb);
shared_state[threadIdx.x] = ret;

// With timing (add ~4 instructions):
u32 ret = (hier_input_a ^ xora) & ((hier_input_b ^ xorb) | orb);
shared_state[threadIdx.x] = ret;

u8 arr_a = shared_arrival[threadIdx.x - offset_a];
u8 arr_b = shared_arrival[threadIdx.x - offset_b];
u8 arr_ret = min(max(arr_a, arr_b) + GATE_DELAY, 255);  // Saturating add
shared_arrival[threadIdx.x] = arr_ret;

Complexity Analysis

Same number of kernel launches as zero-delay simulation
O(levels × cycles) - identical to current
~256 bytes additional shared memory per partition
Estimated 10-20% performance overhead

The Approximation Trade-off

What We Track

One arrival time per thread position (256 values) instead of per signal (8192 values).

Implications

If thread position 50 contains signals A, B, C with different true arrivals:

Signal A: 15ps (shortest path)
Signal B: 23ps (longest path)
Signal C: 8ps  (medium path)

We store only: arrival[50] = 23ps (the maximum).

Why This Works

Conservative: We might report false violations, but never miss real ones
Correlated signals: Signals at the same thread position are often topologically nearby with similar timing
Endpoint focus: We ultimately only care about arrivals at DFF D inputs

When Full Accuracy is Needed

For bit-accurate timing, you would need:

// 8KB additional shared memory (may exceed limits)
__shared__ u8 shared_arrival[256][32];  // Per-bit arrivals

This is feasible but significantly increases memory pressure and computation.

Implementation Phases

Phase 1: CPU Timing Analysis (Completed)

Liberty parser for delay extraction
Static timing analysis on AIG
CPU reference simulation with delays
Timing violation detection

Phase 2: Hybrid GPU+CPU (Completed)

GPU performs zero-delay value simulation
CPU performs timing analysis on results
Validates infrastructure without kernel changes

Phase 3: GPU Arrival Tracking (Completed)

Added shared_arrival[256] (u16) to Metal and CUDA kernels
Arrivals tracked during boomerang reduction at all hierarchy levels
Per-gate delays injected via script padding slots from SDF data
DFF timing constraint checking at cycle boundaries (setup/hold)
Timing-aware VCD output (--timed flag)
Validated against CVC reference simulator (88ps / 7.1% conservative overestimate)

Phase 4: Full Integration (Partial)

Timing violation events via event buffer (completed)
Per-cycle timing reports (completed)
Integration with output VCD (completed via --timed)
Timing-aware bit packing for reduced approximation error (future)

Conservative Timing Model: Sources of Overestimation

Jacquard's GPU timing is intentionally conservative — it may over-estimate arrival times but will never under-estimate them. This is important for setup violation detection: false positives are safe, false negatives would miss real bugs.

There are three independent sources of conservatism, each adding to the overestimate:

Source 1: max(rise, fall) per cell

The GPU kernel tracks a single u16 arrival per thread position. It cannot distinguish between rising and falling signal transitions because each thread processes 32 packed Boolean signals simultaneously — there's no per-bit transition direction available.

How it works: For each cell, inject_timing_to_script() computes:

#![allow(unused)]
fn main() {
delay = max(gate_delays[pin].rise_ps, gate_delays[pin].fall_ps)
}

Impact: For the SKY130 inv_chain test (16 inverters), rise delays average ~10ps larger than fall delays. In a real inverter chain, transitions alternate (rise→fall→rise), so half the cells use the smaller fall delay. Jacquard uses the larger rise delay for all.

Measured: 80ps overestimate on 1235ps (6.5%) for 16 inverters with ~10ps rise/fall asymmetry per cell.

Source 2: max wire delay across all input pins

For multi-input cells (AND gates, MUXes), INTERCONNECT delays to different input pins may differ significantly. Jacquard takes the maximum across all input pins:

#![allow(unused)]
fn main() {
// wire_delays_per_cell: dest_cellid → max(all input wire delays)
entry.rise_ps = entry.rise_ps.max(ic.delay.rise_ps);
entry.fall_ps = entry.fall_ps.max(ic.delay.fall_ps);
}

Impact: If an AND gate has input A arriving via a 10ps wire and input B via a 200ps wire, Jacquard assigns 200ps to the cell regardless of which input is on the critical path. An event-driven simulator would correctly propagate the 10ps arrival on input A independently.

When this matters: Designs with highly asymmetric routing (e.g., one input is local, another crosses the chip). Well-routed designs typically have balanced wire delays to multi-input cells.

Source 3: max arrival across 32 packed signals per thread

Each thread position holds 32 independent Boolean signals. Jacquard tracks one arrival per thread position (the maximum across all 32 signals):

Thread 50: [signal_A: 5ps, signal_B: 23ps, signal_C: 8ps, ...]
Tracked:   arrival[50] = 23ps (max of all 32)

Impact: If signals with very different timing are packed into the same thread, the fastest signals inherit the slowest signal's arrival time.

Mitigation: The bit-packing algorithm can sort signals by estimated timing before assignment (see "Timing-Aware Bit Packing" section). This keeps similar-timing signals together, reducing the max approximation error.

Combined Effect

These sources are multiplicative in the worst case. For the inv_chain test:

Source	Contribution	Notes
max(rise, fall)	+80ps	8 inverters × 10ps asymmetry
max wire delay	+8ps	8 wires × 1ps asymmetry
max per thread	0ps	Only 1 signal per thread in this test
Total overestimate	88ps / 7.1%	vs CVC transition-accurate result

For larger designs with more routing asymmetry and denser bit packing, the combined overestimate could be larger. The bit-packing sort (Source 3) is the most actionable mitigation.

CVC Reference Validation

The inv_chain design (2 DFFs + 16 SKY130 inverters) was validated against CVC (open-src-cvc), an event-driven Verilog simulator with native SDF back-annotation:

CVC:  clk_to_q=350ps  chain=885ps  total=1235ps  (transition-accurate)
Jacquard: clk_to_q=350ps  chain=973ps  total=1323ps  (conservative max)
Difference: 88ps (7.1% overestimate)

Both simulators agree on CLK→Q delay (350ps) because the DFF has a single output transition direction per clock edge. The chain delay differs because CVC tracks actual rise/fall polarity through each inverter.

To run the CVC comparison locally:

bash tests/timing_test/cvc/run_cvc.sh

Requires Docker (builds CVC from source on first run).

Delay Data Encoding

Script Format

The existing boomerang section has padding that can store delay data:

Current format per thread per stage:
  [xora: u32]
  [xorb: u32]
  [orb:  u32]
  [padding: u32]  ← Can store delay here

PackedDelay Structure

#![allow(unused)]
fn main() {
#[repr(C)]
pub struct PackedDelay {
    pub rise_ps: u16,  // Rising edge delay in picoseconds
    pub fall_ps: u16,  // Falling edge delay in picoseconds
}
}

For simplified timing, a single uniform delay constant can be used instead of per-gate delays.

Timing Violation Detection

At Each Cycle Boundary

The GPU kernel checks timing constraints per state word (32 signals) after the boomerang evaluation completes. Arrivals and constraints use u16 picosecond values (range 0–65535 ps). Arithmetic is performed in u32 to avoid overflow when summing arrival + setup:

// After boomerang completes, before next cycle
// arrival: u16 max accumulated delay for this 32-signal group
// constraint_word: packed [setup_ps:16][hold_ps:16]
u16 setup_ps = constraint_word >> 16;
u16 hold_ps  = constraint_word & 0xFFFF;

// Setup check: skip when arrival == 0 (no data propagated, e.g. first cycle
// or DFF with constant inputs)
if (arrival > 0 && (u32)arrival + (u32)setup_ps > clock_period_ps) {
    int slack = (int)clock_period_ps - (int)arrival - (int)setup_ps;
    write_event(event_buffer, EVENT_TYPE_SETUP_VIOLATION,
                cycle, io_offset + threadIdx.x,
                (u32)slack, (u32)arrival, (u32)setup_ps);
}

// Hold check: no arrival > 0 guard (hold violations matter even at cycle 0)
if ((u32)arrival < (u32)hold_ps) {
    int slack = (int)arrival - (int)hold_ps;
    write_event(event_buffer, EVENT_TYPE_HOLD_VIOLATION,
                cycle, io_offset + threadIdx.x,
                (u32)slack, (u32)arrival, (u32)hold_ps);
}

Event Buffer Integration

#![allow(unused)]
fn main() {
pub enum EventType {
    Stop = 0,
    Finish = 1,
    Display = 2,
    AssertFail = 3,
    SetupViolation = 4,   // Timing events
    HoldViolation = 5,
}
}

For full details on interpreting violation reports and tracing violations to source signals, see docs/timing-violations.md.

Timing-Aware Bit Packing

The Problem

Each thread position holds 32 signals packed into a u32. When tracking timing with one arrival value per thread position, we approximate all 32 signals as having the same arrival time (the maximum).

This approximation is accurate when signals in the same thread have similar timing. But the default placement algorithm uses first-fit for bit assignment:

#![allow(unused)]
fn main() {
// Default: first available slot
for i in 0..hier[selected_level].len() {
    if hier[selected_level][i] == usize::MAX {
        slot_at_level = i;  // First-fit, not timing-aware
        break;
    }
}
}

This can result in signals with very different timing sharing a thread:

Thread 50 (accidental grouping):
  bit 0: level 5,  ~5ps arrival
  bit 1: level 12, ~12ps arrival  ← 7ps difference!
  bit 2: level 6,  ~6ps arrival

Thread 50 (timing-aware grouping):
  bit 0: level 5, ~5ps arrival
  bit 1: level 5, ~5ps arrival    ← similar timing
  bit 2: level 6, ~6ps arrival

Current Timing Correlation

The placement algorithm already computes logic levels:

#![allow(unused)]
fn main() {
// Level = max(level of inputs) + 1
level[node] = max(level[input_a], level[input_b]) + 1;
}

Logic level correlates with timing (more levels = more gate delays), but signals at the same level can still have different actual delays due to:

Different gate types (AND2_00_0 vs AND2_11_1)
Different wire loads
Path reconvergence

Solution: Sort by Timing Before Packing

Before assigning bit positions, sort signals by their estimated arrival time:

#![allow(unused)]
fn main() {
// Collect nodes at this level
let mut nodes_to_place: Vec<_> = candidates
    .filter(|n| level[n] == selected_level)
    .collect();

// Sort by arrival time (level as proxy, or actual timing if available)
nodes_to_place.sort_by_key(|n| arrival_estimate[n]);

// Place in sorted order - similar timing ends up in same thread
for (slot, node) in nodes_to_place.iter().enumerate() {
    place_bit(..., slot, *node);
}
}

Alternative Approaches

Approach	Complexity	Effectiveness	When to Use
Sort by timing	Low	Good	Default choice
Timing-aware partitioning	High	Best	Large designs
Post-placement swapping	Medium	Good	Fine-tuning
Timing bands	Low	Moderate	Simple heuristic

Timing Bands

Group signals into arrival time bands:

Band 0: 0-10ps   → Threads 0-63
Band 1: 10-20ps  → Threads 64-127
Band 2: 20-30ps  → Threads 128-191
Band 3: 30+ps    → Threads 192-255

Measuring Packing Quality

Diagnostic to measure timing variance per thread:

#![allow(unused)]
fn main() {
fn analyze_timing_packing(hier: &Hierarchy, arrivals: &[u64]) {
    for thread in 0..256 {
        let times: Vec<_> = get_bits_in_thread(hier, thread)
            .map(|b| arrivals[b])
            .collect();

        let range = times.iter().max() - times.iter().min();
        let variance = compute_variance(&times);

        if range > threshold {
            warn!("Thread {} has {}ps timing spread", thread, range);
        }
    }
}
}

Impact on Approximation Accuracy

With timing-aware packing:

Reduced false positives: Fewer spurious timing violations from max approximation
Tighter bounds: Per-thread arrival closer to actual signal arrivals
Better critical path identification: Max arrival more accurately reflects true critical path

Performance Expectations

Metric	Zero-Delay	With Timing
Kernel launches	N	N
Shared memory	3KB	3.25KB
Registers	~32	~36
Instructions/gate	~5	~9
Estimated overhead	-	15-25%

The overhead is modest because:

Timing operations are simple (max, add)
Memory access pattern is identical
No additional synchronization needed
Same parallelism structure

References

src/pe.rs - Partition executor and boomerang stage construction
csrc/kernel_v1_impl.cuh - GPU kernel implementation
src/flatten.rs - Script generation with timing data
src/event_buffer.rs - GPU→CPU event communication
src/liberty_parser.rs - Timing library parsing

Jacquard Documentation