redflag-mcp
Utilizes OpenAI models (gpt-4o-mini) for extracting structured AML red flag indicators from regulatory documents during the extraction pipeline phase.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@redflag-mcpwhat are red flags for cryptocurrency money laundering?"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
redflag-mcp
MCP server exposing AML red flag knowledge as queryable tools. Compliance officers ask natural-language questions; the server returns relevant, sourced red flags from either a local LanceDB vector store or a packaged SQLite FTS5 corpus.
Hosted Connector
Public users should start with the hosted MCP URL:
https://<deployment>/mcpAdd that URL in a hosted MCP client, enable the connector, and ask AML red flag research questions such as:
What red flags apply to TBML invoice mismatch?
Which red flags cover bulk cash movement to Mexico?
List source coverage for the corpus.Public hosted mode is not for confidential customer, transaction, institution, or investigation details. User prompts are sent to the hosted MCP service operator and the host client. Use local desktop or institution-hosted deployments for sensitive institution-specific context.
The hosted connector is backed by a verified packaged corpus. End users do not need Python, repository setup, package downloads, ingestion, OpenAI keys, or environment variables. Operators should use docs/hosted-deployment.md for Railway deployment, corpus activation, rollback, logging, and validation.
Overview
Eight distinct workflows:
URL pipeline — download URLs, optionally inspect local captures, then extract red flags
Source harvesting — bulk-download PDFs and web pages from a catalog CSV into
sources.yamlExtraction — pull AML red flags out of PDFs or web pages using an LLM and save them as YAML
Source registry — rebuild
red_flag_sources/registry.csv, the audit ledger for extracted, downloaded, and not-downloaded sourcesIngestion — embed the YAML files and load them into the local vector database
Corpus packaging — build a versioned SQLite FTS5 package for offline lexical runtime use
Hosted deployment — run the ASGI MCP service from a verified corpus package at one public
/mcpURLQuery — MCP server answers search and filtering requests against the configured local or hosted store
URL Pipeline
Use scripts/pipeline.py for day-to-day source onboarding from URLs. It supports both a one-shot workflow and a review checkpoint between download and extraction.
Create a plain text file with one URL per line:
https://example.gov/report.pdf
https://example.gov/red-flag-guidanceBlank lines and non-HTTP(S) lines are skipped.
One-shot download and extraction
uv run python scripts/pipeline.py run urls.txtUse this when you trust the source list and want to download each URL, register it in red_flag_sources/sources.yaml, extract red flags, update data/source/.extracted_sources.yaml, and rebuild red_flag_sources/registry.csv.
Download, inspect, then extract
# Download PDFs/web captures and update sources.yaml + registry.csv
uv run python scripts/pipeline.py download urls.txt
# Inspect red_flag_sources/pdf/ and red_flag_sources/markdown/, then extract downloaded rows
uv run python scripts/pipeline.py extractUse the two-step flow when you want to inspect Jina Reader markdown captures or downloaded PDFs before spending OpenAI extraction calls.
Options
# Bypass registry deduplication and re-download/re-extract
uv run python scripts/pipeline.py run urls.txt --force
# Extract downloaded sources in parallel; default is sequential unless --parallel is present
uv run python scripts/pipeline.py extract --parallel
uv run python scripts/pipeline.py extract --parallel 8
uv run python scripts/pipeline.py run urls.txt --parallel 4Deduplication uses red_flag_sources/registry.csv by source_url. Re-run scripts/build_registry.py first if you manually edited sources.yaml, catalog CSVs, or YAML source files and need the pipeline to see the latest status.
Source Harvesting
scripts/harvest_sources.py automates acquisition of regulatory documents from the Global AML/CFT/Sanctions Red Flag Catalog. It reads the Direct URL column, classifies each URL as a PDF or web page, downloads the file, and registers it in red_flag_sources/sources.yaml.
uv run python scripts/harvest_sources.py red_flag_sources/Global_AML_CFT_Sanctions_Red_Flag_Catalog.csvWhat it does:
Reads the
Direct URLcolumn from each CSV rowSkips blank, malformed, or already-registered URLs
Classifies the URL as PDF via path heuristics (
.pdfsuffix,/download,/file) — falls back to an HTTP HEAD check for ambiguous casesDownloads PDFs to
red_flag_sources/pdf/NNN.pdfFetches web pages via the Jina Reader API and saves cleaned markdown to
red_flag_sources/markdown/NNN.mdAppends each new entry to
sources.yaml(written once at the end)Prints a final summary: PDFs downloaded, web pages fetched, skipped, failed
The script is idempotent — re-running against the same CSV produces no new files or registry entries. Per-URL failures are logged and skipped without aborting the run.
red_flag_sources/
Global_AML_CFT_Sanctions_Red_Flag_Catalog.csv # input catalog (~218 URLs)
sources.yaml # registry of all harvested URLs
pdf/ # downloaded PDFs (gitignored via *.pdf)
markdown/ # Jina Reader captures (gitignored)After harvesting, rebuild the status registry and pass downloaded files to extraction:
uv run python scripts/build_registry.py
# Extract red flags from all newly downloaded PDFs
uv run python scripts/extract.py --parallel
# Or target a specific serial range
uv run python scripts/extract.py --range 039-060 --parallelUse harvest_sources.py when the input is a catalog CSV. Use pipeline.py when the input is a simple URL list or when you want the download-inspect-extract workflow.
Note:
sources.yamlis the shared URL registry forpipeline.py,harvest_sources.py, andbuild_sources_registry.py. Do not run these scripts concurrently — each can overwritesources.yamlafter updating it.
Extraction Pipeline
scripts/extract.py takes a regulatory document (PDF file or URL), sends its text to an OpenAI model, and writes a structured YAML file into data/source/. Each extracted entry includes a source_url linking back to the original document.
Prerequisites
uv sync --extra dev
export OPENAI_API_KEY=sk-...Adding sources in bulk (recommended workflow)
Use scripts/pipeline.py for new URL lists:
uv run python scripts/pipeline.py download urls.txt
uv run python scripts/pipeline.py extract --parallelThis downloads into red_flag_sources/pdf/ or red_flag_sources/markdown/, updates sources.yaml, extracts downloaded registry rows, updates .extracted_sources.yaml, and rebuilds registry.csv.
For catalog CSVs, use scripts/harvest_sources.py first, then scripts/extract.py.
Manual PDF workflow
PDFs are stored in red_flag_sources/pdf/ and should be named with a zero-padded serial prefix:
red_flag_sources/pdf/
001_fincen_alert_russian_sanctions_evasion.pdf
002_ffiec_bsa_aml_examination_manual.pdf
003_fatf_guidance_virtual_assets.pdfEach serial number maps to a public URL for the source document in red_flag_sources/sources.yaml. For the legacy manual flow, maintain that mapping in red_flag_sources/pdflinks.txt — one URL per line, in serial order:
# FinCEN Russian Sanctions Evasion Alert
https://fincen.gov/sites/default/files/2022-06/Alert%20FIN-2022-Alert001_508C.pdf
# FFIEC BSA/AML Examination Manual
https://bsaaml.ffiec.gov/manual
# FATF Guidance on Virtual Assets
https://www.fatf-gafi.org/...Blank lines and lines starting with # are ignored. After editing pdflinks.txt, regenerate sources.yaml and registry.csv:
uv run python scripts/build_sources_registry.py
uv run python scripts/build_registry.pyThen run batch extraction:
uv run python scripts/extract.py --parallelOnly new (unprocessed) PDFs are extracted — previously processed sources are skipped automatically.
Batch extraction commands
# Sequential batch
uv run python scripts/extract.py
# Parallel batch (4 workers by default)
uv run python scripts/extract.py --parallel
# Parallel batch with custom worker count
uv run python scripts/extract.py --parallel 8
# Force re-extract everything
uv run python scripts/extract.py --force --parallel
# Process only PDFs in a serial range (e.g. 001 through 005)
uv run python scripts/extract.py --range 001-005
# Range + parallel
uv run python scripts/extract.py --range 001-005 --parallel
# Force re-extract a range
uv run python scripts/extract.py --force --range 001-005 --parallelNote:
--rangeapplies only to numbered PDFs. Web URLs inWeblinks.mdare excluded when a range is active.
Single source (ad hoc)
# Extract from a local PDF
uv run python scripts/extract.py red_flag_sources/pdf/001_fincen_alert.pdf
# Extract from a URL
uv run python scripts/extract.py https://example.com/regulatory-guidance
# Re-extract a source that was already processed
uv run python scripts/extract.py --force red_flag_sources/pdf/001_fincen_alert.pdfFor single-source PDFs, make sure sources.yaml maps the file's serial prefix to the public URL before extraction so the extractor can populate source_url in the output. If you maintain the legacy pdflinks.txt file, run build_sources_registry.py and then build_registry.py first.
What it does
Fetches the document — downloads the web page (strips nav/footer/scripts) or reads text from the PDF via pdfplumber
Sends to OpenAI — prompts
gpt-4o-mini(override withOPENAI_EXTRACTION_MODEL) to extract every distinct AML red flag indicator as structured JSONValidates — each returned flag is checked against the
RedFlagSourceschema; invalid entries are skipped with a warningWrites YAML — saves to
data/source/<slug>.yaml, one entry per red flagUpdates the manifest — records the source in
data/source/.extracted_sources.yamlto prevent re-processingRebuilds the source registry — updates
red_flag_sources/registry.csvafter successful batch or single-source extraction
Output schema
Each entry in the YAML file has the following fields:
Field | Type | Required | Description |
| string | yes | Unique identifier, e.g. |
| string | yes | Standalone description of the red flag indicator |
| string | no | Public URL of the source document |
| list[string] | no | Financial products this applies to (e.g. |
| list[string] | no | Customer industries or sectors this applies to (e.g. |
| list[string] | no | Customer archetypes this applies to (e.g. |
| list[string] | no | Relevant geographies or corridors (e.g. |
| string | no | Source document name or authority (e.g. |
| string | no | Abbreviated issuing authority (e.g. |
| string | no | Canonical jurisdiction code deterministically derived from |
| string | no | Publication date of the source document (ISO 8601: YYYY-MM-DD, YYYY-MM, or YYYY). |
| string | no |
|
| string | no | AML typology (e.g. |
| string | no | Optional simulation complexity code (e.g. |
| list[string] | no | Higher-level AML typology families (e.g. |
| list[string] | no | Observable behavioral patterns (e.g. |
| list[string] | no | Short searchable phrases, instruments, thresholds, or acronyms (e.g. |
regulator and issued_date are requested during extraction. regulator_jurisdiction is derived in code from regulator; if the regulator is missing or unmapped, it stays unset and ingestion logs a warning. typology_family, transaction_patterns, and key_terms are added to existing YAML source files by running scripts/ingest.py --write-back-yaml (see Enriching YAML source files below).
Deduplication
data/source/.extracted_sources.yaml tracks every processed source by its canonical path or URL. Sources already in the manifest are skipped in both batch and single-source mode. Use --force to re-extract a source regardless.
Source Registry
scripts/build_registry.py rebuilds red_flag_sources/registry.csv from scratch. The registry is a human-readable audit ledger across three states:
Status | Meaning |
| A YAML file exists in |
| The URL is present in |
| The URL appears in the catalog CSVs, but is not present in |
Run it manually after editing catalog CSVs, sources.yaml, or extracted YAML files outside the normal scripts:
uv run python scripts/build_registry.pyYou usually do not need to run it after pipeline.py extract or extract.py; both rebuild the registry after successful extraction. pipeline.py download rebuilds it after each successful download so newly captured URLs appear as downloaded.
The registry powers pipeline deduplication and extraction auto-discovery:
pipeline.py downloadskips URLs already present inregistry.csvunless--forceis used.pipeline.py extractfinds rows withstatus == "downloaded"and extracts their local PDF or markdown files.
Ingestion
After extraction, embed the YAML files and load them into the vector database:
uv run python scripts/ingest.pyFor the initial local corpus, ingest only the three target files:
uv run python scripts/ingest.py \
data/source/001_federal_child_nutrition_fraud.yaml \
data/source/002_oil_smuggling_cartels.yaml \
data/source/003_bulk_cash_smuggling_repatriation.yamlThis generates embeddings with nomic-embed-text-v1.5 and upserts records into LanceDB at data/vectors/. Run ingestion before connecting the MCP server to a desktop client; the embedding model downloads on first use and is better cached during ingestion than during server startup.
OPENAI_API_KEY is optional for ingestion. When it is set, ingestion can auto-tag missing metadata into the derived LanceDB records. When it is not set, ingestion preserves available YAML metadata and leaves missing rich consultation fields empty. Source YAML files are not rewritten by normal ingestion.
Enriching YAML source files (write-back)
To enrich source YAML files with typology_family, transaction_patterns, key_terms, regulator, regulator_jurisdiction, and issued_date — fields used for offline keyword search and faceted filtering — run ingestion with --write-back-yaml:
export OPENAI_API_KEY=sk-...
uv run python scripts/ingest.py --write-back-yaml data/source/001_federal_child_nutrition_fraud.yamlWrite-back supports the same batch selection styles as extraction:
# All visible YAML files in data/source/
uv run python scripts/ingest.py --write-back-yaml
# Multiple explicit YAML files
uv run python scripts/ingest.py --write-back-yaml \
data/source/001_federal_child_nutrition_fraud.yaml \
data/source/002_oil_smuggling_cartels.yaml
# Serial range by source filename prefix
uv run python scripts/ingest.py --write-back-yaml --range 001-003
# Parallel file-level write-back (4 workers by default, or pass a count)
uv run python scripts/ingest.py --write-back-yaml --range 001-003 --parallel
uv run python scripts/ingest.py --write-back-yaml --parallel 8This enriches each selected source file in-place and exits without updating the vector database. Existing metadata is not overwritten by the LLM; only missing fields are requested, and deterministic fields such as regulator_jurisdiction are derived in code. After write-back, re-run normal ingestion to load the enriched records:
uv run python scripts/ingest.py data/source/001_federal_child_nutrition_fraud.yamlNote: If you deploy this change against an existing
data/vectors/store, delete the store and re-ingest from scratch so the new columns (typology_family,transaction_patterns,key_terms,regulator,regulator_jurisdiction,issued_date) are present in the LanceDB schema:rm -rf data/vectors/ uv run python scripts/ingest.py
Corpus Packaging
Maintainers can build a versioned, verifiable SQLite FTS5 corpus package from approved YAML records:
uv run python scripts/build_corpus.py \
--output-dir dist/corpus \
--version 2026.04.29 \
--all-sources
# Or build a curated corpus from explicit YAML files
uv run python scripts/build_corpus.py \
--output-dir dist/corpus \
--version 2026.04.29 \
data/source/001_federal_child_nutrition_fraud.yaml \
data/source/002_oil_smuggling_cartels.yaml \
data/source/003_bulk_cash_smuggling_repatriation.yaml
uv run python scripts/verify_corpus.py dist/corpus/redflag-corpus-2026.04.29.zipThe package contains manifest.json and redflags.sqlite. The manifest records schema version, build timestamp, source record hashes, file hashes, record/source counts, and source redistribution metadata. Source documents are treated as URL-only unless data/lexicon/source_metadata.yaml explicitly clears them for bundling.
The current SQLite lexical corpus schema version is 3. Rebuild older corpus packages after schema changes that add stored fields or filters.
Run the hosted retrieval smoke benchmark before publishing a corpus package:
uv run python scripts/evaluate_retrieval.py \
--corpus dist/corpus/redflag-corpus-2026.04.29.zip \
--benchmark data/eval/hosted_retrieval_queries.yamlThis benchmark checks representative alias, geography, typology, product/channel, and source-specific queries against the lexical corpus. It is a launch gate, not proof of broad AML retrieval quality.
Running from a corpus
The server can run directly against a built SQLite corpus without loading the embedding model:
REDFLAG_CORPUS_PATH=dist/corpus/redflags.sqlite uv run python -m redflag_mcpIt can also verify and install a ZIP package into a local corpus cache:
REDFLAG_CORPUS_PACKAGE=dist/corpus/redflag-corpus-2026.04.29.zip \
REDFLAG_CORPUS_CACHE_DIR=~/.redflag-mcp \
uv run python -m redflag_mcpFor release-index driven activation:
REDFLAG_CORPUS_RELEASE_INDEX=dist/corpus/releases.json \
REDFLAG_CORPUS_VERSION=2026.04.29 \
REDFLAG_CORPUS_CACHE_DIR=~/.redflag-mcp \
uv run python -m redflag_mcpSet REDFLAG_CORPUS_AUTO_UPDATE=0 to reuse the active cached corpus without checking the package or release index. When no corpus environment variables are set, the server falls back to the LanceDB vector store at data/vectors/.
MCP Server
# Start server (stdio mode, for Claude Desktop / Claude Code)
uv run python -m redflag_mcp
# Start in MCP inspector
uv run mcp dev src/redflag_mcp/server.py
# Start as HTTP server (for OpenAI agents or other HTTP clients)
MCP_TRANSPORT=http MCP_HOST=0.0.0.0 MCP_PORT=8000 uv run python -m redflag_mcp
# Start from a packaged corpus instead of LanceDB
REDFLAG_CORPUS_PACKAGE=dist/corpus/redflag-corpus-2026.04.29.zip uv run python -m redflag_mcpThe server exposes hosted-client-compatible tools for request routing, semantic search, exact metadata filtering, source browsing, and filter discovery:
classify_red_flag_requestfor deciding whether a request needs more context, exact metadata filtering, filtered semantic search, or direct semantic searchsearch_red_flagsfor natural-language relevance search with sourced, ranked resultsfilter_red_flagsfor exact metadata requests that should not use embedding search. Filters includeproduct_types,industry_types,customer_profiles,geographic_footprints,typology_family,transaction_patterns,category,risk_level,regulator,regulator_jurisdiction,issued_after,issued_before,regulatory_source,source_url, andsource_id.get_red_flagfor the full text and citation metadata for one red flaglist_filtersfor available metadata filter valueslist_sourcesandget_sourcefor ingested source coverage and citation context
It is fully offline after ingestion or corpus installation — no API keys required at query time.
Use from Codex
For local Codex threads, prefer stdio so Codex starts the MCP server automatically:
codex mcp add redflag-mcp -- zsh -lc 'cd /Users/learningmachine/Documents/Python-dev/redflag-mcp && HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1 uv run python -m redflag_mcp'Verify the registration:
codex mcp list
codex mcp get redflag-mcpThen start a new Codex thread and ask for the server by name, for example:
Use the redflag-mcp MCP server. List the available AML red flag filters.If you already have the HTTP server running, you can register that instead:
codex mcp add redflag-mcp-http --url http://127.0.0.1:8000/mcpLocal smoke checks
After ingesting the three target files, verify the tools with:
list_filters
list_sources
classify_red_flag_request(query="what red flags apply to my crypto product?")
filter_red_flags(product_types=["depository"], category="fraud_nexus", risk_level="medium")
filter_red_flags(typology_family=["trade_based_money_laundering"], transaction_patterns=["trade_document_manipulation"])
filter_red_flags(regulator="FinCEN", issued_after="2024", issued_before="2026")
filter_red_flags(regulator_jurisdiction="FR")
search_red_flags(query="federal child nutrition program sponsor receives reimbursements inconsistent with its profile", product_types=["depository"])
search_red_flags(query="TBML invoice mismatch")
search_red_flags(query="southwest border oil company wires for waste oil or hazardous materials")
search_red_flags(query="bulk cash moved by armored car service to Mexico")
get_red_flag(red_flag_id="001_federal_child_nutrition_fraud-01")For a vague query such as "what should I look for in business accounts?", the calling agent should call classify_red_flag_request and ask a brief consultation question covering product/channel, industry, customer profile, geography, and transaction channel or volume when the route is needs_more_context. For exact metadata requests such as "show medium-risk fraud nexus red flags for depository products" or "red flags from regulators in France", it should call filter_red_flags instead of semantic search, translating country names to regulator_jurisdiction codes such as FR, SG, AU, GB, US, and EU. For requests with both usable filters and a rich narrative, it should call search_red_flags with filters so metadata controls eligibility and embeddings rank the matching records.
Development
uv sync --extra dev # Install dependencies
uv run pytest tests/ # Run tests
uv run ruff check src/ # Lint
uv run mypy src/ # Type checkThis server cannot be installed
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/govindgnair23/redflag-mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server