Skip to main content
The Oracle is the phase that typically moves the biggest block of coverage in a single step. Emscripten, musl, dlmalloc/emmalloc, libc++, and the whole WebAssembly runtime are open source. WARDEN exploits that: it compiles the exact runtime a target was built with, across the build-flag matrix that affects codegen, retains the name section, fingerprints every labeled function into a portable store, and then runs that store against the stripped target. Functions that match at or above the acceptance threshold get the real upstream name attached immediately, with provenance oracle.
Alpha status. The matching engine, signature store, and corpus farm script are all implemented and working. The seed store that ships with a fresh clone is intentionally empty. You build your own. A shared community corpus is on the roadmap.

Compile your own ground truth

No other FLIRT- or BinDiff-style workflow auto-builds its corpus from the toolchain’s own source across the version/flag matrix. The critical ingredient is --profiling-funcs. Without it, an optimized Emscripten build strips the wasm name section. The corpus builder, extract_signatures, looks up each function’s name from:
  1. module.names.function_names: the wasm name section (primary source).
  2. func.export_names[0]: used if the function is exported and the name section is absent.
If neither yields a name, the function is silently skipped. Retaining names is not optional: it is the entire source of ground truth. Use --profiling-funcs (or -g / -gsource-map) on every corpus build. This flag forces Emscripten to emit the name section even at -O2 and -Oz where it would otherwise be stripped.

The signature store

Signature

Defined in src/warden/oracle/signatures.py. One Signature is a labeled fingerprint of a single function extracted from a corpus build.
FieldTypeMeaning
namestrThe real upstream name, from the wasm name section or export.
librarystrWhich library: musl, musl-pthread, libc++, libc++abi, dlmalloc, wasi-libc, emscripten, or the --library default.
emscripten_versionstr | NoneThe emsdk version these were built with.
opt_levelstr | NoneThe -O level, e.g. -O2.
source_refstr | NoneOptional upstream source link.
structural_hashstrBlake2b of the normalized control-flow/opcode skeleton.
exact_hashstrSHA-256 of the raw function body.
minhashlist[int]32-element MinHash over instruction 4-grams (fuzzy matching).
histogramdict[str, int]Opcode-class frequency histogram.
call_targetslist[str]Import call-neighborhood (module.field strings).
type_signaturestrWASM type signature string.
instruction_countintTotal instruction count.
local_callsintCount of calls to locally-defined functions.
Signature.fingerprint() reconstructs a FunctionFingerprint so the same similarity() function used everywhere in WARDEN can compare a corpus entry against a live target function. Library classification is automatic based on name prefixes:
PrefixAssigned library
emscripten_, __em_emscripten
pthread_, __pthreadmusl-pthread
_ZNSt, _ZSt, _ZNlibc++
__cxa_libc++abi
dlmalloc, dlfreedlmalloc
__wasi, wasi_wasi-libc
(no match)whatever --library you passed

SignatureStore

A SignatureStore is an ordered list of Signature objects with JSON persistence. The on-disk format is plain JSON (portable, git-diffable, and shareable without a database).
store = SignatureStore.load("oracle.json")  # returns empty store if file is missing
store.add(sig)
store.save("oracle.json")
len(store)          # number of signatures
store.libraries()   # dict[library_name, count]
{
  "version": 1,
  "count": 214,
  "signatures": [ { "name": "memcpy", "library": "musl", ... }, ... ]
}

The shipped seed store is empty

src/warden/oracle/seed_signatures.json ships with "count": 0 and an empty signatures array. This is deliberate: a pre-built corpus would encode assumptions about which emsdk versions and flag combinations matter for your target. You build the store that matches your target and commit it alongside your project.

Identifying functions: identify and ORACLE_THRESHOLD

identify is defined in src/warden/oracle/match.py. Given a KnowledgeBase, a version_id, and a SignatureStore, it:
1

Fetch all defined functions for the version

Imports are excluded. They already have names from the JS glue.
2

Reconstruct a fingerprint from the KB row

Uses fingerprint_from_record. No re-parsing of the original .wasm is required.
3

Score every corpus signature against the target fingerprint

The similarity() function combines four signals:
SignalWeightMethod
Exact body equalityshortcircuit → 1.0SHA-256 match
Fuzzy instruction similarity0.45MinHash Jaccard over 4-grams
Opcode-class histogram0.25cosine similarity
Call-neighborhood overlap0.20import Jaccard
Structural skeleton match+0.10hash equality bonus
4

Accept or reject

The best-scoring signature wins. If score >= threshold, it is an Oracle match.
The acceptance threshold is:
ORACLE_THRESHOLD = 0.82   # src/warden/oracle/match.py
When write=True (the default), every accepted match is written back to the KB with:
  • Provenance "oracle" and confidence = score.
  • A Symbol entry with the matched name, type signature, and a summary noting library and version.
  • An evidence trail: {"kind": "oracle", "detail": "<library> <emver> @<opt> score=<N>"}.
  • A row in oracle_matches keyed to the internal function ID.
Human-set names (provenance = "human", locked = True) are sovereign. The Oracle cannot overwrite them, no matter the score.
The --threshold flag overrides ORACLE_THRESHOLD at the CLI. Lower values catch more functions at the cost of false positives; higher values are conservative. The default of 0.82 is a reasonable starting point for functions of moderate size, but you should audit matches after the first run.

The OracleMatch return type

identify returns list[OracleMatch]:
@dataclass
class OracleMatch:
    func_index: int
    stable_id: str
    matched_name: str
    library: str
    emscripten_version: str | None
    opt_level: str | None
    score: float
    source_ref: str | None

Scaling: the MinHash-LSH index

The default identify scan is O(targets × signatures). For a corpus of a few hundred signatures that is negligible, but once the matrix farm has accumulated thousands of entries across many versions and opt levels the linear pass becomes a bottleneck. warden.oracle.index provides SignatureIndex, a banded MinHash + structural-hash index that reduces each function’s search space to a small set of candidates before any scoring happens.

How it works

SignatureIndex.build(store, *, bands=8) splits each signature’s 32-element MinHash into bands equal slices and hashes each slice into a bucket. Two signatures that agree on any whole band land in the same bucket. The probability of that collision grows with their true Jaccard similarity, so near-matches across -O levels still surface. Signatures are also indexed by their exact structural_hash, so structurally identical functions are always found regardless of their MinHash values. index.candidates(fp) returns the deduplicated union of every signature sharing a band-bucket or the structural hash with fp. This is a pure dictionary lookup on the hot path. identify_indexed is the LSH-accelerated mirror of identify. When the index yields no candidates for a function it falls back to the full store, so no function is ever silently dropped. The result is identical to the linear scan at ORACLE_THRESHOLD = 0.82.
from warden.oracle.index import SignatureIndex, identify_indexed
from warden.oracle import SignatureStore
from warden.kb import KnowledgeBase

store = SignatureStore.load("oracle.json")
index = SignatureIndex.build(store, bands=8)

# Query candidates for one fingerprint
candidates = index.candidates(fp)   # list[Signature], usually << len(store)

# Full pass: same matches as identify(), faster on large stores
with KnowledgeBase("project.db") as kb:
    v = kb.get_version("v1")
    matches = identify_indexed(kb, v.id, store, threshold=0.82, write=True)
bands=8 (the default) balances recall and index size well for a 32-element MinHash. Raise it to narrow candidates further; lower it to catch more edge cases at the cost of a larger candidate set.

Version inference

After identify returns, call infer_version to infer which Emscripten version the target was built with:
def infer_version(matches: list[OracleMatch]) -> VersionInference:
    ...

@dataclass
class VersionInference:
    emscripten_version: str | None   # e.g. "3.1.55"
    confidence: float                # fraction of matches that voted for this version
    histogram: dict[str, int]        # full vote distribution
The implementation tallies emscripten_version across all matches using Counter.most_common. The version with the highest vote share wins; confidence is count / total_votes. When 80% of your Oracle matches are tagged 3.1.55, the inference is reliable. This version pin matters downstream:
  • It is stored on the ModuleVersion row and shown by warden versions.
  • The diff engine uses it to classify toolchain churn: if v1 infers 3.1.50 and v2 infers 3.1.61, functions that changed can be re-examined against the Oracle and reported as “changed due to Emscripten upgrade” rather than “application change”.
  • It sharpens dynCall/elem-table convention assumptions for any subsequent JS glue analysis.
The accuracy of the inference is bounded by corpus coverage. A wider flag matrix and more reference programs produce more matches and a more confident vote.

CLI usage

Step 1: Build a signature store

warden oracle build <wasm> [<wasm> ...] \
    --out oracle.json \
    --emver 3.1.55 \
    --opt -O2 \
    [--library musl]
  • <wasm>: one or more .wasm files with a name section (--profiling-funcs at compile time).
  • --out / -o: the output (or existing) oracle.json. If the file already exists, new signatures are appended; the CLI reports +N new.
  • --emver: the Emscripten version string to tag these signatures with.
  • --opt: the optimization level (e.g. -O2, -Oz).
  • --library: fallback label for names that do not match any known prefix (default musl).
After building, the CLI prints the total signature count and the per-library breakdown:
Wrote 214 signatures (+214) to oracle.json
Libraries: dlmalloc=8, emscripten=12, libc++=34, musl=160

Step 2: Identify runtime functions in a target

warden oracle identify v1 \
    --store oracle.json \
    [--threshold 0.82] \
    [--indexed] \
    [--db warden.db]
  • v1: a version label already ingested with warden ingest.
  • --store / -s: path to the oracle.json built above (default oracle.json).
  • --threshold: override ORACLE_THRESHOLD.
  • --indexed: use the MinHash-LSH index (identify_indexed) instead of the linear scan. Produces identical matches but is significantly faster once the store grows large. The index is built in memory at the start of each run. No separate build step is required.
The CLI prints a match table and, when infer_version succeeds, the toolchain line:
 idx  name              library   score  emver
 ───  ────────────────  ────────  ─────  ──────
  12  memcpy            musl      1.00   3.1.55
  14  memset            musl      1.00   3.1.55
  19  __cxa_throw       libc++abi 0.91   3.1.55
  ...

Inferred Emscripten version: 3.1.55 (confidence 0.91)
All matches are written into the KB immediately.

Step 3: Check coverage

warden coverage v1
defined functions   184
named               112  (60%)
  by oracle          98
  by human            0
  by agent           14

The emsdk matrix corpus farm

Building a corpus by hand for each version and flag combination is tedious. The scripts in scripts/corpus/ provide a containerized, reproducible alternative. The only host dependencies are Docker and the warden CLI.
cd scripts/corpus

./build_matrix.sh \
  --versions "3.1.50 3.1.55 3.1.61" \
  --opt "-O0 -O2 -Oz" \
  --out ../../oracle.json

What build_matrix.sh does

For each (emscripten_version, opt_level) pair:
1

Pull the official emsdk image

emscripten/emsdk:<version> is pulled from Docker Hub. Emscripten never touches your host.
2

Compile reference programs inside the container

Every .c file under scripts/corpus/reference/ is compiled with emcc, passing --profiling-funcs and -sEXPORT_ALL=1. --profiling-funcs forces the name section to be emitted even at high -O levels. Without it, the wasm has no labels and extract_signatures extracts nothing useful.
3

Copy the resulting .wasm files back to the host

The container is discarded; only the compiled artifacts remain.
4

Call warden oracle build

warden oracle build <wasms...> \
    --out <store> \
    --emver <version> \
    --opt <opt>
Signatures are appended to the same oracle.json across all matrix entries, so the build is incremental, so you can interrupt and resume safely.

Why the matrix matters

Emscripten codegen varies across optimization levels in ways that affect fingerprints. A function compiled at -O0 has a different structural skeleton than the same function at -Oz after inlining and dead-code elimination. A corpus that covers only one opt level will miss matches in targets compiled at a different level. The MinHash fuzzy similarity (ORACLE_THRESHOLD = 0.82) tolerates some variation, but breadth of coverage in the corpus is the primary lever for match rate.

Reference programs

scripts/corpus/reference/ currently contains strings.c, a program that exercises musl string functions and the Emscripten runtime. The design calls for more: programs exercising pthreads, exceptions, different allocator configurations, and ideally portions of the Emscripten runtime itself. To widen the Oracle’s reach, add .c files to reference/ and extend the FLAGS array in build_matrix.sh to cover -pthread, -sPROXY_TO_PTHREAD, -fexceptions, and LTO.
The corpus accumulates into a single oracle.json. Because the format is plain JSON, it can be committed to your project repository and shared with collaborators. Everyone runs against the same ground truth without rebuilding.

Using the Oracle as a library

from warden.oracle import (
    SignatureStore,
    build_corpus_from_files,
    identify,
    infer_version,
    ORACLE_THRESHOLD,
    load_seed_store,
)
from warden.kb import KnowledgeBase

# Build a store from labeled modules
store = build_corpus_from_files(
    ["runtime_debug.wasm"],
    library="musl",
    emscripten_version="3.1.55",
    opt_level="-O2",
)
store.save("oracle.json")

# Identify against the KB
with KnowledgeBase("project.db") as kb:
    v = kb.get_version("v1")
    matches = identify(kb, v.id, store, threshold=ORACLE_THRESHOLD)
    inferred = infer_version(matches)
    print(inferred.emscripten_version, inferred.confidence)
load_seed_store() loads src/warden/oracle/seed_signatures.json (the empty store that ships with the package). It is exported from warden.oracle for completeness but has no signatures until you build them.

See also

Ingest (phase 1)

What happens before the Oracle: parsing, fingerprinting, seeding from exports and the name section.

Agents (phase 3)

The Oracle handles runtime code; the agent crew names what remains.

Diff and carry-over (phase 4)

How the inferred Emscripten version feeds into toolchain-churn suppression during diffing.

CLI reference

Full flag listing for warden oracle build and warden oracle identify.
Last modified on June 7, 2026