oracle.
Alpha status. The matching engine, signature store, and corpus farm script are all
implemented and working. The seed store that ships with a fresh clone is intentionally empty.
You build your own. A shared community corpus is on the roadmap.
Compile your own ground truth
No other FLIRT- or BinDiff-style workflow auto-builds its corpus from the toolchain’s own source across the version/flag matrix. The critical ingredient is--profiling-funcs.
Without it, an optimized Emscripten build strips the wasm name section. The corpus builder,
extract_signatures, looks up each function’s name from:
module.names.function_names: the wasm name section (primary source).func.export_names[0]: used if the function is exported and the name section is absent.
--profiling-funcs (or -g / -gsource-map) on
every corpus build. This flag forces Emscripten to emit the name section even at -O2 and -Oz
where it would otherwise be stripped.
The signature store
Signature
Defined in src/warden/oracle/signatures.py. One Signature is a labeled fingerprint of a
single function extracted from a corpus build.
| Field | Type | Meaning |
|---|---|---|
name | str | The real upstream name, from the wasm name section or export. |
library | str | Which library: musl, musl-pthread, libc++, libc++abi, dlmalloc, wasi-libc, emscripten, or the --library default. |
emscripten_version | str | None | The emsdk version these were built with. |
opt_level | str | None | The -O level, e.g. -O2. |
source_ref | str | None | Optional upstream source link. |
structural_hash | str | Blake2b of the normalized control-flow/opcode skeleton. |
exact_hash | str | SHA-256 of the raw function body. |
minhash | list[int] | 32-element MinHash over instruction 4-grams (fuzzy matching). |
histogram | dict[str, int] | Opcode-class frequency histogram. |
call_targets | list[str] | Import call-neighborhood (module.field strings). |
type_signature | str | WASM type signature string. |
instruction_count | int | Total instruction count. |
local_calls | int | Count of calls to locally-defined functions. |
Signature.fingerprint() reconstructs a FunctionFingerprint so the same similarity()
function used everywhere in WARDEN can compare a corpus entry against a live target function.
Library classification is automatic based on name prefixes:
| Prefix | Assigned library |
|---|---|
emscripten_, __em_ | emscripten |
pthread_, __pthread | musl-pthread |
_ZNSt, _ZSt, _ZN | libc++ |
__cxa_ | libc++abi |
dlmalloc, dlfree | dlmalloc |
__wasi, wasi_ | wasi-libc |
| (no match) | whatever --library you passed |
SignatureStore
A SignatureStore is an ordered list of Signature objects with JSON persistence. The on-disk
format is plain JSON (portable, git-diffable, and shareable without a database).
The shipped seed store is empty
src/warden/oracle/seed_signatures.json ships with "count": 0 and an empty signatures
array. This is deliberate: a pre-built corpus would encode assumptions about which emsdk
versions and flag combinations matter for your target. You build the store that matches your
target and commit it alongside your project.
Identifying functions: identify and ORACLE_THRESHOLD
identify is defined in src/warden/oracle/match.py. Given a KnowledgeBase, a version_id,
and a SignatureStore, it:
Fetch all defined functions for the version
Imports are excluded. They already have names from the JS glue.
Reconstruct a fingerprint from the KB row
Uses
fingerprint_from_record. No re-parsing of the original .wasm is required.Score every corpus signature against the target fingerprint
The
similarity() function combines four signals:| Signal | Weight | Method |
|---|---|---|
| Exact body equality | shortcircuit → 1.0 | SHA-256 match |
| Fuzzy instruction similarity | 0.45 | MinHash Jaccard over 4-grams |
| Opcode-class histogram | 0.25 | cosine similarity |
| Call-neighborhood overlap | 0.20 | import Jaccard |
| Structural skeleton match | +0.10 | hash equality bonus |
write=True (the default), every accepted match is written back to the KB with:
- Provenance
"oracle"andconfidence = score. - A
Symbolentry with the matched name, type signature, and a summary noting library and version. - An evidence trail:
{"kind": "oracle", "detail": "<library> <emver> @<opt> score=<N>"}. - A row in
oracle_matcheskeyed to the internal function ID.
--threshold flag overrides ORACLE_THRESHOLD at the CLI. Lower values catch more functions
at the cost of false positives; higher values are conservative. The default of 0.82 is a
reasonable starting point for functions of moderate size, but you should audit matches after
the first run.
The OracleMatch return type
identify returns list[OracleMatch]:
Scaling: the MinHash-LSH index
The defaultidentify scan is O(targets × signatures). For a corpus of a few hundred signatures
that is negligible, but once the matrix farm has accumulated thousands of entries across many
versions and opt levels the linear pass becomes a bottleneck. warden.oracle.index provides
SignatureIndex, a banded MinHash + structural-hash index that reduces each function’s search
space to a small set of candidates before any scoring happens.
How it works
SignatureIndex.build(store, *, bands=8) splits each signature’s 32-element MinHash into
bands equal slices and hashes each slice into a bucket. Two signatures that agree on any
whole band land in the same bucket. The probability of that collision grows with their true
Jaccard similarity, so near-matches across -O levels still surface. Signatures are also indexed
by their exact structural_hash, so structurally identical functions are always found regardless
of their MinHash values.
index.candidates(fp) returns the deduplicated union of every signature sharing a band-bucket
or the structural hash with fp. This is a pure dictionary lookup on the hot path.
identify_indexed is the LSH-accelerated mirror of identify. When the index yields no
candidates for a function it falls back to the full store, so no function is ever silently
dropped. The result is identical to the linear scan at ORACLE_THRESHOLD = 0.82.
Version inference
Afteridentify returns, call infer_version to infer which Emscripten version the target was
built with:
emscripten_version across all matches using Counter.most_common.
The version with the highest vote share wins; confidence is count / total_votes. When 80% of
your Oracle matches are tagged 3.1.55, the inference is reliable.
This version pin matters downstream:
- It is stored on the
ModuleVersionrow and shown bywarden versions. - The diff engine uses it to classify toolchain churn: if
v1infers3.1.50andv2infers3.1.61, functions that changed can be re-examined against the Oracle and reported as “changed due to Emscripten upgrade” rather than “application change”. - It sharpens dynCall/elem-table convention assumptions for any subsequent JS glue analysis.
CLI usage
Step 1: Build a signature store
<wasm>: one or more.wasmfiles with a name section (--profiling-funcsat compile time).--out/-o: the output (or existing)oracle.json. If the file already exists, new signatures are appended; the CLI reports+N new.--emver: the Emscripten version string to tag these signatures with.--opt: the optimization level (e.g.-O2,-Oz).--library: fallback label for names that do not match any known prefix (defaultmusl).
Step 2: Identify runtime functions in a target
v1: a version label already ingested withwarden ingest.--store/-s: path to theoracle.jsonbuilt above (defaultoracle.json).--threshold: overrideORACLE_THRESHOLD.--indexed: use the MinHash-LSH index (identify_indexed) instead of the linear scan. Produces identical matches but is significantly faster once the store grows large. The index is built in memory at the start of each run. No separate build step is required.
infer_version succeeds, the toolchain line:
Step 3: Check coverage
The emsdk matrix corpus farm
Building a corpus by hand for each version and flag combination is tedious. The scripts inscripts/corpus/ provide a containerized, reproducible alternative. The only host dependencies
are Docker and the warden CLI.
What build_matrix.sh does
For each (emscripten_version, opt_level) pair:
Pull the official emsdk image
emscripten/emsdk:<version> is pulled from Docker Hub. Emscripten never touches your host.Compile reference programs inside the container
Every
.c file under scripts/corpus/reference/ is compiled with emcc, passing
--profiling-funcs and -sEXPORT_ALL=1. --profiling-funcs forces the name section to be
emitted even at high -O levels. Without it, the wasm has no labels and extract_signatures
extracts nothing useful.Copy the resulting .wasm files back to the host
The container is discarded; only the compiled artifacts remain.
Why the matrix matters
Emscripten codegen varies across optimization levels in ways that affect fingerprints. A function compiled at-O0 has a different structural skeleton than the same function at -Oz after
inlining and dead-code elimination. A corpus that covers only one opt level will miss matches in
targets compiled at a different level. The MinHash fuzzy similarity (ORACLE_THRESHOLD = 0.82)
tolerates some variation, but breadth of coverage in the corpus is the primary lever for match
rate.
Reference programs
scripts/corpus/reference/ currently contains strings.c, a program that exercises musl
string functions and the Emscripten runtime. The design calls for more: programs exercising
pthreads, exceptions, different allocator configurations, and ideally portions of the Emscripten
runtime itself. To widen the Oracle’s reach, add .c files to reference/ and extend the
FLAGS array in build_matrix.sh to cover -pthread, -sPROXY_TO_PTHREAD, -fexceptions,
and LTO.
The corpus accumulates into a single
oracle.json. Because the format is plain JSON, it can be
committed to your project repository and shared with collaborators. Everyone runs against the
same ground truth without rebuilding.Using the Oracle as a library
load_seed_store() loads src/warden/oracle/seed_signatures.json (the empty store that ships
with the package). It is exported from warden.oracle for completeness but has no signatures
until you build them.
See also
Ingest (phase 1)
What happens before the Oracle: parsing, fingerprinting, seeding from exports and the name
section.
Agents (phase 3)
The Oracle handles runtime code; the agent crew names what remains.
Diff and carry-over (phase 4)
How the inferred Emscripten version feeds into toolchain-churn suppression during diffing.
CLI reference
Full flag listing for
warden oracle build and warden oracle identify.