Verification

Readable but wrong is the cardinal sin of reverse engineering. WARDEN treats verification not as a post-hoc sanity check but as a first-class stage of the loop. This page explains what “verified” means along three distinct axes, what runs unconditionally today, what activates when native tooling is present, and what is scaffolded for the future.

Three honest axes of “100%”

“100% reverse engineered” is a precise, three-part claim. Each axis addresses a different failure mode.

Axis	Claim	Failure mode it guards
Symbol coverage	Every defined function has a binding: a name, a provenance, and a confidence score. No `func_412` stays anonymous.	Dark spots that conceal unknown behavior.
Behavioral equivalence	Reconstructions are differentially executed against the original until their outputs and side effects match on a corpus of inputs.	Plausible-sounding names that describe the wrong behavior.
Change accountability	Across any two versions, every delta is mapped to a function and a semantic explanation. Nothing changes silently.	Missed security-relevant changes in an update.

Coverage and accountability run fully today. Behavioral equivalence is corpus-bounded evidence, not a formal proof, and the automated harness activates only when the right native toolchain is present.

Running the check

warden verify app_v1.wasm

The verify command reads the file, runs the determinism check, and prints a colored summary. It then calls differential_plan and reports whether the optional behavioral harness is ready:

42/42 functions fingerprint deterministically
differential equivalence ready: False  tooling={'wasm2c': False, 'w2c2': False, 'cc': True, 'wabt_validate': False, 'can_differential': False}

If any function’s stable_id is unstable, the exit line prints in red and the offending function indices are listed. verify takes no options beyond the .wasm path. The determinism check always runs; differential readiness is always reported.

What runs today: `verify_determinism`

verify_determinism(wasm_bytes: bytes) -> DeterminismReport is the zero-dependency check that runs on every call to warden verify. It requires only the Python standard library.

How it works

The function parses the same raw bytes twice via parse_module, calls fingerprint_function on each pair of defined functions, and compares the resulting stable_id strings byte-exactly. The returned DeterminismReport carries:

total: number of defined (non-import) functions examined.
stable: count whose stable_id matched across both parse runs.
unstable: list of function indices where the id differed.
ok: True when unstable is empty.
summary: a one-liner such as "42/42 functions fingerprint deterministically".

Why it matters for carry-over

The stable_id is WARDEN’s stable function identity: a composite of the structural fingerprint, call-neighborhood, and type signature that stays constant across rebuilds even when table indices shift. The entire annotation carry-over mechanism (the feature that makes RE incremental rather than Sisyphean) depends on every stable_id being reproducible from the binary bytes alone.

If a stable_id were non-deterministic, annotations written against one parse would silently fail to port to the next version. An unstable result from verify_determinism indicates a bug in the fingerprinting engine that must be fixed before any carry-over can be trusted.

Because the check operates on raw bytes with no external tooling, it runs identically in CI, in a sandboxed environment, and on a laptop with nothing native installed.

Differential execution (runs today)

The warden.interp mini-interpreter is the middle tier: it makes behavioral equivalence runnable right now, with zero native toolchain required, for the integer subset and the f32 float subset that together cover the vast majority of Emscripten arithmetic and glue code.

What the interpreter covers

execute_function is a pure-Python stack-machine evaluator for the i32 integer subset. It models:

Arithmetic: add, sub, mul, and both signed and unsigned division and remainder (div_s, div_u, rem_s, rem_u).
Bitwise: and, or, xor, the shifts shl, shr_s, shr_u, and the rotates rotl, rotr.
Bit counting: clz (count leading zeros), ctz (count trailing zeros), and popcnt (count set bits).
All i32 comparisons, signed and unsigned (eq, ne, lt_s, lt_u, gt_s, gt_u, le_s, le_u, ge_s, ge_u, eqz).
The select opcode (pick one of two values on a condition).
Structured control: block, loop, if/else/end, br, br_if, return.
Local variable operations: local.get, local.set, local.tee.
Direct function calls (with recursive fuel tracking).
Linear memory loads and stores at full and narrow widths: i32.load, i32.store, the narrow loads i32.load8_s, i32.load8_u, i32.load16_s, i32.load16_u, and the narrow stores i32.store8, i32.store16.

The interpreter now also executes the f32 single-precision float subset:

f32 constants (f32.const).
f32 arithmetic: add, sub, mul, div, min, and max.
f32 unary math: neg, abs, and sqrt.
All f32 comparisons: eq, ne, lt, gt, le, and ge.

This means behavioral equivalence now covers float functions on any machine, with no native toolchain required, exactly as it already did for the integer subset. Anything outside that scope raises UnsupportedExecution. The harness catches this and records the pair as “undecided” rather than crashing, so a partially-modeled module still yields useful results for the functions that are covered.

Integer division and remainder follow the WebAssembly rules. An integer division by zero raises UnsupportedExecution rather than returning a wrong number, and the signed forms apply two’s-complement semantics before dividing. Float division does not trap: f32.div by zero follows IEEE-754 and yields an infinity or NaN, exactly as a real engine would. The narrow loads come in signed and unsigned pairs: load8_s sign-extends the byte into the full i32, while load8_u zero-extends it.

Execution is deterministic by construction: the result depends only on the module bytes, the function index, and the argument vector. No wall-clock, no RNG, no external state.

Running one function

warden exec <label> <index> [args...]

The exec command looks up the named project, parses its .wasm, locates function <index>, and executes it on the provided arguments.

warden exec app_v1 42 10 3
# → [91]

Programmatically:

from warden.ingest import parse_module
from warden.interp import execute_function

module = parse_module(wasm_bytes)
func = module.functions[42]
results = execute_function(module, func, [10, 3])
# → [91]

execute_function accepts optional keyword arguments:

Argument	Default	Purpose
`host`	`None`	Callback for imported function calls; absent → imports are no-ops
`memory`	Zeroed 64 KiB	Pre-seeded `bytearray`; observe stores after the call
`fuel`	`100000`	Instruction budget; exceeded → `UnsupportedExecution`

The deterministic input corpus: `generate_inputs`

A differential check is only as good as the inputs it tries, and those inputs must be the same on every run and on every machine. verify.corpus.generate_inputs produces a fixed corpus of argument vectors from the function’s arity alone, with no use of the wall clock and no use of the random module:

from warden.verify.corpus import generate_inputs

inputs = generate_inputs(2)
# → [[0, 1], [1, 2], [2, 0xFFFFFFFF], [0xFFFFFFFF, 0x7FFFFFFF], ...]

generate_inputs(param_count, *, count=64, seed=0) returns a list of up to count vectors, each a list of param_count i32 values. The corpus always leads with the cases that catch the most bugs, then fills the rest from a seeded integer recurrence:

Boundary values that probe sign and overflow edges: 0, 1, 2, -1 (as 0xFFFFFFFF), 0x7FFFFFFF (the largest positive signed i32), 0x80000000 (the most negative), 7, and 1024.
Pseudo-random spreads drawn from a hand-written, seeded linear congruential recurrence (x = (1103515245 * x + 12345) & 0xFFFFFFFF) so the same seed and count always yield the same corpus, byte for byte.

A param_count of 0 yields a single empty argument vector, [[]], regardless of count. The corpus generator also produces deterministic float inputs for f32 parameters through generate_float_inputs(param_count, *, count=32, seed=0). It leads with float boundary cases that probe the float edges (0.0, 1.0, -1.0, 0.5, 2.0, and a large and a small magnitude), then fills the rest from the same seeded recurrence mapped to floats, so the same seed and count always yield the same float corpus, value for value. Float inputs are generated the same way integer inputs are: from the function’s arity alone, with no wall clock and no random module. A param_count of 0 yields a single empty argument vector, [[]], just like the integer corpus. To map a stored function’s type signature to a parameter count, corpus.parse_param_count reads a signature string such as "(i32, i32) -> (i32)" and returns the count. It tolerates the unknown-type placeholder "(?) -> (?)" and an empty parameter list, returning 0 for both.

The recurrence is a fixed integer formula, not the random module. Determinism is a hard requirement here for the same reason it is for the fingerprint: a check that depends on the clock or on an unseeded RNG cannot be reproduced in CI, so its result could not be trusted. Pass an explicit seed to sweep a different but still fully reproducible corpus.

Differential execution

differential_execute runs two functions from two modules over the same input corpus and reports per-input agreement:

from warden.interp import differential_execute

rows = differential_execute(mod_v1, fn_v1, mod_v2, fn_v2, inputs=[
    [0, 0], [1, 2], [255, 1], [0x7FFFFFFF, 1],
])
# Each row: {"args": [...], "a": [...], "b": [...], "match": True|False}

The match field is True when both sides return identical result stacks. If either side raises UnsupportedExecution, that side records None and match is False (undecided, not wrong). The comparison is NaN-safe: two float results that are both NaN count as a match, even though NaN is never equal to itself under normal float equality, so a function that legitimately returns NaN on both sides is not flagged as a spurious divergence. Concrete example. parse_token v1 and v2 differ structurally (v2 adds a bounds check), but the bounds-check result is dropped before the return. differential_execute proves they are behaviorally equivalent across the full input corpus: every row shows match: True. internal_crc, by contrast, shows match: False on most inputs, which is a genuine behavioral change.

# parse_token v1 == v2: bounds check is dropped, behavior unchanged
{"args": [10, 3],          "a": [91],  "b": [91],  "match": True}
{"args": [0x7FFFFFFF, 1],  "a": [...], "b": [...],  "match": True}

# internal_crc v1 != v2: genuine behavioral divergence
{"args": [10, 3],  "a": [0x1234], "b": [0xABCD], "match": False}

differential_execute never mutates the modules or functions it receives, and each input gets a fresh zeroed memory. You can run it in a loop across a large corpus without side effects.

Proving two versions equivalent: `differential_versions`

differential_versions is the one-call answer to the question diffing raises: a function’s structure changed across an update, but did its behavior change? It works from the knowledge base, pairs generate_inputs with differential_execute so you do not have to hand-write a corpus, and it needs no native toolchain at all:

from warden.verify import differential_versions

report = differential_versions(kb, v1.id, v2.id)
# report["checked"]     → number of (function, input) pairs tried
# report["matched"]     → how many agreed
# report["mismatched"]  → how many returned different results
# report["undecided"]   → pairs where one side hit an unmodeled opcode
# report["divergences"] → [{"func_index", "args", "a", "b"}, ...] for each mismatch

differential_versions(kb, from_version_id, to_version_id, *, func_index=None, samples=64) parses both versions from their recorded .wasm paths (via kb.version_paths), picks the function to compare (the one at func_index, or every defined index present in both versions when func_index is None), generates the deterministic corpus from each function’s type signature, runs both sides, and aggregates the result. The corpus is chosen by parameter type: a function whose parameters are all floats (f32/f64) gets the float corpus from generate_float_inputs, and every other function gets the integer corpus from generate_inputs. Because the corpus is fixed and the interpreter is deterministic, the same two versions always produce the same report.

Resolve the functions

With func_index set, just that function is compared, if it is defined in both versions. Otherwise every defined function index present in both versions is compared.

Read the parameter count

Each function’s type signature fixes how many values an input vector needs. parse_param_count reads the count, and the parameter types select the corpus: generate_float_inputs for all-float parameters, generate_inputs otherwise.

Execute both versions over every input

Each side runs in the pure-Python interpreter on a fresh zeroed memory, exactly as differential_execute does.

Classify and tally each row

A row where both sides returned the same stack increments matched. A row where they returned different stacks increments mismatched and is appended to divergences. A row where either side raised UnsupportedExecution (the interpreter returned None) increments undecided.

Concrete example. Run it on parse_token (index 3), whose v2 adds a bounds check whose result is dropped, and on internal_crc (index 5), whose v2 genuinely adds one to the returned value:

parse = differential_versions(kb, v1.id, v2.id, func_index=3)
# parse["mismatched"] → 0   (the dropped bounds check does not change behavior)

crc = differential_versions(kb, v1.id, v2.id, func_index=5)
# crc["mismatched"] > 0, crc["divergences"] lists the inputs that diverge

This is the behavioral half of the carry-over story. Diffing says the bytes moved; this says whether the behavior moved with them.

Three-tier verification model

With the interpreter in place, WARDEN now has three stacked tiers for behavioral equivalence:

Tier	Dependency	Coverage
`warden.interp` (mini-interpreter)	Pure Python, stdlib only	i32 integer subset and the f32 float subset; runs today on any machine
`wasm2c` / `w2c2` + `cc` (native harness)	WABT + C compiler	Full numeric and memory behavior; corpus-bounded
SeeWasm (symbolic cross-check)	SeeWasm (future)	Path-condition soundness for security-critical branches

The interpreter fills the gap between the zero-cost determinism check above and the native harness below. For the integer-heavy glue code Emscripten emits, it is often sufficient on its own.

The optional differential harness

The behavioral harness activates only when the right native toolchain is present. Nothing silently pretends a check ran. WARDEN reports readiness honestly.

Tooling detection: `tooling_status()`

tooling_status() uses shutil.which to probe PATH for four tools:

Field	Binary probed	Role
`wasm2c`	`wasm2c`	WABT’s WebAssembly-to-C transpiler
`w2c2`	`w2c2`	Alternative wasm-to-C transpiler (faster compile)
`cc`	`cc`, `clang`, or `gcc` (first found)	C compiler to build the lifted output
`wabt_validate`	`wasm-validate`	Structural validation of the `.wasm` before lifting

The can_differential property is True when (wasm2c or w2c2) and cc. That is the minimum required to execute the harness. wabt_validate is independent, useful for confirming the input is well-formed before spending time lifting it.

The plan: `differential_plan(wasm_path)`

differential_plan(wasm_path: str) -> dict calls tooling_status() and returns a dict describing what would run and whether it can:

{
    "tooling": { ... },        # ToolingStatus.as_dict()
    "ready": True | False,     # can_differential
    "steps": [ ... ],          # concrete shell commands
    "note": "...",             # human-readable install hint when not ready
}

When ready is False, the note field tells you exactly what to install. The concrete steps it describes, in order:

Transpile the original to C

wasm2c app_v1.wasm -o target.c

wasm2c (from WABT) produces a self-contained C file that is functionally equivalent to the original by construction.

Compile the lifted C to a native executable

cc -O2 target.c wasm-rt-impl.c -o target_ref

The WABT runtime shim wasm-rt-impl.c is included with WABT.

Compile the agent/LLM reconstruction the same way

The reconstruction is compiled identically to produce a second native executable.

Differential execution over a fuzzer corpus

Both executables are run over a fuzzer-generated input corpus. Outputs and observable memory side effects must agree across every input.

Running the pipeline: `run_differential`

differential_plan describes the work; run_differential does it when it can. It is the orchestration entry point that activates the wasm2c/w2c2 differential pipeline the moment a C toolchain is detected, and honestly reports a plan instead of pretending when one is not present:

from warden.verify import run_differential

result = run_differential("app_v1.wasm", reference_path="reconstruction.wasm")
# Toolchain present:
#   result["ran"]    → True
#   result["tool"]   → "wasm2c" or "w2c2", whichever was detected
#   result["steps"]  → one entry per command run, with its captured output
#   result["result"] → the final command's result
# Toolchain absent:
#   result["ran"]    → False
#   result["reason"] → why the native run could not happen
#   result["plan"]   → the same dict differential_plan returns

run_differential(wasm_path, *, reference_path=None, func_index=None, samples=64, runner=None) is the orchestration entry point that activates the wasm2c/w2c2 differential pipeline the moment a C toolchain is detected, and honestly reports a plan instead of pretending when one is not present. The control flow is deliberately conservative.

Detect the toolchain

run_differential calls tooling_status() first. If can_differential is False, it does no work: it returns {"ran": False, "reason": ..., "plan": ...}, where plan is exactly what differential_plan returns. Nothing is faked.

Build the commands

When (wasm2c or w2c2) and cc holds, the command lists are built by small pure helpers (_wasm2c_cmd, _w2c2_cmd, _cc_cmd). The transpiler it uses is whichever of wasm2c or w2c2 was detected.

Run through an injectable runner

Each command runs through runner, which defaults to a thin wrapper over subprocess.run with output captured and a non-zero exit raised. Passing a runner is what lets a test drive the full orchestration without wasm2c or a compiler installed.

Report the outcome

The returned dict records that the run executed, which tooling was found, and the per-command results. A reported pass means a real native run agreed, not that a check was skipped.

run_differential shells out through subprocess and shutil from the standard library only. The runner indirection keeps it testable and keeps it honest: every native command it runs is exactly the one the helpers build, so the plan and the run never drift apart.

Activating the harness

brew install wabt

apt install wabt gcc

After installation, warden verify app_v1.wasm will report differential equivalence ready: True.

What behavioral equivalence actually claims

When the differential harness reports a function as verified, the precise claim is: corpus-bounded behavioral equivalence: the reconstruction agrees with the original on every input tried, to the depth the fuzzer reached. This is strong, automatable evidence. It is not a formal proof. The corpus is finite, and an adversarial input outside it could in principle expose a divergence. Formal equivalence checking via SMT-based symbolic execution over all inputs is a future direction for cryptographically critical functions, but it is not what the current harness provides.

For most reverse-engineering purposes (understanding behavior, tracking changes, writing accurate documentation), corpus-bounded evidence is sufficient and is considerably stronger than “the name sounds right.”

Future dynamic ground-truth hooks

Three additional sources of behavioral evidence are identified in the architecture vision and scaffolded as future work.

SeeWasm: symbolic cross-check

For specific claimed path conditions (“this branch fires when magic == 0xCAFE”), SeeWasm can confirm the condition symbolically rather than relying on the model’s assertion. This would plug in as an additional gate in the verifier after the differential harness, providing path-condition soundness for security-critical branches.

Wasabi / Frida / Chrome DevTools: dynamic ground truth

Wasabi is a WebAssembly instrumentation framework; Frida and Chrome DevTools expose runtime traces. Live execution traces confirm which element-table targets actually fire, which atomic operations guard which memory regions, and what values flow through indirect calls. This is precisely the information hardest to recover statically. These hooks would validate the thread model and dynCall attribution produced by the concurrency agent.

Oracle-as-oracle: free verification for matched functions

Any function matched by the Emscripten Oracle to a known upstream source (musl, dlmalloc, the Emscripten runtime) is verified by definition: its behavior is already documented in the public source. The reconstruction can be checked directly against that real source code. This is available today for Oracle-matched functions, at no additional toolchain cost.

Verification in the confidence economy

Every symbol written to the KB carries a provenance and a confidence score. The verifier controls whether an agent proposal is promoted:

Oracle matches (provenance="oracle") are inherently verified against compiled ground truth. They receive the highest confidence.
Agent proposals (provenance="agent") are gated by the verifier before write-back. Until behavioral equivalence is confirmed, they carry a confidence below 1.0 and are marked unverified.
Human names (provenance="human", locked=True) are sovereign. The verifier does not touch them and they cannot be overwritten by agent passes.

No unverified annotation is ever silently promoted to ground truth from which subsequent carry-overs or agent passes propagate. Verification failure is always visible in the KB.

Summary of checks

Check	Dependency	Runs today	Guarantee
`verify_determinism`	Pure Python, stdlib only	Yes	`stable_id` byte-stability, which guards all carry-over
`generate_inputs` corpus	Pure Python, stdlib only	Yes	A fixed, reproducible input set so every differential run is comparable
`differential_versions`	Pure Python, stdlib only	Yes	Corpus-bounded behavioral equivalence between two versions for the i32 and f32 subsets
`warden.interp` differential execution	Pure Python, stdlib only	Yes	Corpus-bounded behavioral equivalence for the i32 and f32 subsets
Structural validation	`wasm-validate` (WABT)	When installed	Input is a well-formed WebAssembly module
`run_differential` harness	`wasm2c` or `w2c2` + `cc`	When installed	Corpus-bounded behavioral equivalence (full instruction set)
Oracle source verification	Emscripten Oracle corpus	Today, Oracle-matched functions only	Exact behavioral match to upstream source
Symbolic cross-check	SeeWasm	Future	Path-condition soundness for critical branches
Dynamic trace validation	Wasabi / Frida / DevTools	Future	Runtime dynCall and atomic ground truth

The agent crew

Where the proposals the verifier gates actually come from.

Diff and carry-over

How verified annotations carry forward when a new version ships.

Getting started

The pipeline

Reference

Project

Three honest axes of “100%”

Running the check

What runs today: `verify_determinism`

How it works

Why it matters for carry-over

Differential execution (runs today)

What the interpreter covers

Running one function

The deterministic input corpus: `generate_inputs`

Differential execution

Proving two versions equivalent: `differential_versions`

Three-tier verification model

The optional differential harness

Tooling detection: `tooling_status()`

The plan: `differential_plan(wasm_path)`

Running the pipeline: `run_differential`

Activating the harness

What behavioral equivalence actually claims

Future dynamic ground-truth hooks

Verification in the confidence economy

Summary of checks

The agent crew

Diff and carry-over

​Three honest axes of “100%”

​Running the check

​What runs today: verify_determinism

​How it works

​Why it matters for carry-over

​Differential execution (runs today)

​What the interpreter covers

​Running one function

​The deterministic input corpus: generate_inputs

​Differential execution

​Proving two versions equivalent: differential_versions

​Three-tier verification model

​The optional differential harness

​Tooling detection: tooling_status()

​The plan: differential_plan(wasm_path)

​Running the pipeline: run_differential

​Activating the harness

​What behavioral equivalence actually claims

​Future dynamic ground-truth hooks

​Verification in the confidence economy

​Summary of checks

The agent crew

Diff and carry-over

Three honest axes of “100%”

Running the check

What runs today: `verify_determinism`

How it works

Why it matters for carry-over

Differential execution (runs today)

What the interpreter covers

Running one function

The deterministic input corpus: `generate_inputs`

Differential execution

Proving two versions equivalent: `differential_versions`

Three-tier verification model

The optional differential harness

Tooling detection: `tooling_status()`

The plan: `differential_plan(wasm_path)`

Running the pipeline: `run_differential`

Activating the harness

What behavioral equivalence actually claims

Future dynamic ground-truth hooks

Verification in the confidence economy

Summary of checks