warden ingest reads the raw binary, recovers everything the file reveals for free, and
writes the result into the knowledge base. By the time the command returns, every function
already has a stable identity, and many have a name.
Overview
Parse the .wasm binary
A pure-Python WASM binary parser (
ingest/wasm.py) reads the file section by section
and produces a Module: a structured object containing every function, type, import,
export, memory, element segment, data segment, and the name custom section.Parse the Emscripten JS glue (optional)
If you pass
--glue, the regex-driven JS glue parser (ingest/jsglue.py) extracts the
Emscripten version string, dynCall_* signatures, exported symbol names, import bindings,
and pthread / PROXY_TO_PTHREAD markers. None of these live inside the .wasm itself.Fingerprint every function
identity.fingerprint_function() derives four complementary fingerprints from each
Function and writes them into the functions table alongside a stable_id, the
carry-over key used by every later phase. See core concepts for how the
fingerprints work.The WASM binary parser
ingest/wasm.py implements a complete pure-Python reader of the WebAssembly binary
format. No native toolchain (no WABT,
no Binaryen, no system libraries) is required.
What gets parsed
The parser walks every standard section in a single left-to-right pass:| Section | What is recovered |
|---|---|
| Type | Function type signatures such as (i32, i32) -> i32. Free in every module, including stripped ones. |
| Import | All imports, keyed by module.field. Imported functions occupy the low function-index slots. |
| Function | Type-index assignments for locally defined functions. |
| Memory | Limits (minimum/maximum pages), shared flag (threads), and memory64 flag. |
| Global | Mutable/immutable globals with their init expressions. |
| Export | All exported names and the function indices they point to. |
| Element | Table initializers; ref.func targets are resolved to indices where statically known. |
| Code | Per-function locals and body bytes; each body is immediately passed to the disassembler. |
| Data | Active and passive data segments; active segments carry a constant-expression offset so string extraction can compute absolute addresses. |
| Name (custom) | The name custom section (see below). |
| Other custom | Preserved verbatim in module.custom_sections for downstream use. |
Function objects so the
module.function_name() resolver can apply the name-section > export > import priority chain.
Name resolution priority
The name section
ingest/names.py parses the name WebAssembly custom section. Emscripten retains this
section unless you explicitly strip it (e.g. with wasm-strip). It is the richest source of
high-fidelity names in a debug or --profiling-funcs build.
Three subsections are parsed:
- Module name (subsection 0): the module-level name string.
- Function names (subsection 1): a map from function index to name. This is the primary
source for
function_names. - Local names (subsection 2): per-function local-variable names. Useful for the agent crew when constructing context.
NameSection.present_subsections. A malformed subsection does not abort the parse. It is
silently skipped and the rest of the section is still read.
In a production Emscripten build compiled with
-O2 or higher and no --profiling-funcs,
the name section is often absent or covers only a handful of exports. The parser handles
both cases gracefully: NameSection.present returns False and name-section seeds are
not written.Full opcode disassembly
ingest/opcodes.py contains the instruction decoder. It covers the MVP plus every extension
that Emscripten actually emits:
Sign-extension ops
i32.extend8_s, i64.extend32_s, and friends.Non-trapping float-to-int
The
0xFC prefix family (sub-opcodes 0–7).Bulk memory
memory.init, data.drop, memory.copy, memory.fill, table.* (also 0xFC).Reference types & tail calls
ref.null, ref.is_null, ref.func, return_call, return_call_indirect.Threads / atomics
The full
0xFE prefix family: atomic.fence, notify/wait, and all load/store/rmw/cmpxchg
variants. Central to the -pthread story.SIMD
The
0xFD prefix family, with per-sub-opcode immediate layouts (memarg, lane byte, 16-byte
shuffles, and no-immediate ops).Instruction records its offset and size inside the function body, its
opcode (first byte) and sub_opcode for prefixed families, a mnemonic, an opcode klass
(used for the histogram fingerprint), and the parsed immediates.
The degradation path. An opcode whose immediate layout the decoder does not model raises
UnsupportedOpcode. The code section parser catches that exception and sets
func.disasm_error instead of aborting:
disasm_error set has instructions = None but still has its full raw
body bytes. It is still byte-fingerprintable via exact_hash and contributes to the KB
with whatever fingerprints can be derived. The KB notes the error so downstream analysis
knows the disassembly is incomplete.
The Emscripten JS glue parser
The JS file Emscripten emits alongside the.wasm is what jsglue.py calls the Rosetta
stone of the module. Facts that do not exist inside the binary at all live only in the glue:
| Fact | How it is extracted |
|---|---|
| Emscripten version | Four regex patterns covering EMSCRIPTEN_VERSION, @emscripten/X.Y.Z, GENERATED_BY, and inline package references. |
dynCall_* signatures | All dynCall_<sig> identifiers found anywhere in the file; the signature suffix (e.g. viii, iji) describes the indirect-call ABI. |
| Exported symbols | Module['_foo'] and wasmExports['bar'] patterns, plus the asm[...] form from older glue. |
| Import bindings | Object literal member patterns like _emscripten_memcpy_js: _emscripten_memcpy_js. |
| pthread / threading | Presence of PThread, pthread_create, _emscripten_proxy, PROXY_TO_PTHREAD, proxyToMainThread, spawnThread, worker.js, _emscripten_thread_init. |
PROXY_TO_PTHREAD | Checked explicitly; activates a separate flag on GlueInfo. |
| Memory growth | ALLOW_MEMORY_GROWTH, _emscripten_resize_heap, growMemory. |
GlueInfo is preferable to a hard
failure on an unfamiliar layout. If no version string is found, GlueInfo.notes records
that explicitly.
Minification is detected via a crude heuristic: if the average line length exceeds 200
characters, GlueInfo.minified = True. This signals that symbol extraction may be
incomplete.
The glue is optional. If you only have the
.wasm, ingestion still works. You lose
the version string, dynCall signatures, and threading-model facts, but the binary is still
fully parsed. Pass --glue whenever the glue file is available; it sharpens every
downstream heuristic.What ingest seeds for free
project.ingest_into_kb() calls _seed_symbol_for() for every function immediately after
fingerprinting. The function mines the three tiers of free naming evidence and writes an
initial Symbol into the KB if any evidence is present:
| Evidence tier | Provenance | Confidence | Condition |
|---|---|---|---|
| Name section | export | 0.90 | Function index appears in NameSection.function_names |
| Export name | export | 0.85 | Function has at least one export name |
| Import field | import | 0.80 | Function is imported (is_import = True) |
kb.upsert_symbol(), the same economy-gated write path used by the
Oracle and agents, so they cannot be accidentally overwritten by lower-authority sources
later. A debug build that retains its name section can reach substantial KB coverage before
you run a single Oracle or agent pass.
Using warden ingest
<wasm>: path to the.wasmfile (required).--glue, -g: optional Emscripten.jsglue file.--label, -l: version label; defaults to the filename stem.--notes: free-text note stored with this version in the KB.--db: project database path (defaultwarden.db, orWARDEN_DBenv var).
Using ingest_into_kb from Python
IngestResult.glue_info holds the full GlueInfo object if a glue file was parsed.
IngestResult.notes carries any parser warnings (e.g. missing version string, empty export
list from a minified glue).
You can also parse a module without touching the KB:
What comes next
Ingestion does not run the Oracle, agents, or diff. Those are separate commands you layer on top. The typical sequence after ingest:Oracle identification
Match every defined function against a corpus of labeled Emscripten/musl/libc signatures
to collapse runtime code instantly.
Agent crew
Propose names for the application-specific remainder, gated by the provenance economy.
Diff & carry-over
When a new version ships, ingest it and diff against the previous label.
Export
Emit a C header, pseudocode listing, git-diffable text, or a Ghidra rename script.