categorize_strings
Extract strings from binary files and classify them into semantic categories such as anti-debug, hardware IDs, crypto, network, and more, enabling quick identification of code behaviors.
Instructions
Extract strings from path and bucket them into semantic categories.
The categorization vocabulary is loaded from
data/drm-indicators.yaml::string_categories at MCP-server
load time. Two categories (anti_debug, hwid) inherit
their keyword lists from the existing catalog sections via a
seed_from pointer; the rest have inline keyword lists.
When a future agent adds a new HWID API to
hwid_apis.high_signal, the hwid category picks it up on
next MCP-server reload with zero Python change.
The return shape is a strict superset of extract_strings:
::
{
"path": "...",
"min_length": 5,
"totals": {"ascii_extracted": N, "utf16le_extracted": N,
"deduplicated": N, "categorized": N},
"truncated": {"input": bool, "per_category": bool,
"per_encoding": bool},
"by_category": {
"anti_debug": {"count": N, "samples": [{"string":..., "section":...}, ...]},
"hwid": {"count": N, "samples": [...]},
"crypto": {"count": N, "samples": [...]},
"network": {"count": N, "samples": [...]},
"registry": {"count": N, "samples": [...]},
"process": {"count": N, "samples": [...]},
"file": {"count": N, "samples": [...]},
"fingerprint": {"count": N, "samples": [...]},
"activation": {"count": N, "samples": [...]},
"obfuscation": {"count": N, "samples": [...]},
"misc": {"count": N, "samples": [...]}
},
"ascii_capped": [...], # backward-compat with extract_strings
"utf16le_capped": [...],
"uncategorized_sample": [...] # 50 misc strings (helps spot missing categories)
}On large binaries (e.g. a 500+ MB Unity IL2CPP GameAssembly.dll
wrapped by an encrypted-VM bytecode interpreter), pass
skip_sections=[".idata", ".xtls", ".xpdata", ".udata", ".xdata",
".didata", ".ecode", ".00cfg"] to skip the encrypted-VM
bytecode regions. Those sections contain no readable strings;
the categorization result is the same and the memory footprint
drops dramatically.
Categories are descriptive — they describe observable string content, not specific commercial products.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| path | Yes | ||
| min_length | No | ||
| categories | No | ||
| include_misc | No | ||
| max_per_category | No | ||
| samples_per_category | No | ||
| skip_sections | No |