aegis-dq
OfficialSupports data quality checks on Databricks tables, including ML anomaly detection and auto-generated SQL remediation.
Integrates with dbt pipelines to run data quality checks as part of transformation workflows.
Allows running data quality validation rules against a DuckDB database, including completeness, uniqueness, and statistical checks.
Leverages local Ollama LLM for offline diagnosis and remediation, avoiding external API calls.
Uses OpenAI models to diagnose data quality failures and propose targeted SQL remediation steps.
Enables data quality validation on PostgreSQL or Redshift databases with support for all rule types and LLM diagnosis.
Delivers data quality reports and alerts directly to Slack channels for team visibility.
Provides data quality validation for Snowflake data warehouses, leveraging the same rule engine and LLM capabilities.
Aegis DQ
The open-source agentic data quality framework. Validate data contracts, diagnose failures with LLM root-cause analysis, and auto-generate SQL remediation — all in a single CI step or Python call.
31 rule types — completeness, uniqueness, validity, referential integrity, statistical, ML anomaly detection
6 warehouse adapters — DuckDB, Postgres/Redshift, BigQuery, Databricks, AWS Athena, Snowflake
Pluggable LLMs — Anthropic Claude, OpenAI, Ollama (local), AWS Bedrock
Agentic pipeline — plan → parallel validation → LLM diagnose → RCA → SQL remediate → report
GitHub Actions — Quick Start
Add a data quality gate to any workflow in under 2 minutes:
# .github/workflows/data-quality.yml
name: Data Quality
on: [push, pull_request]
jobs:
data-quality:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Validate data quality
uses: aegis-dq/aegis-dq@v0.7.0
with:
rules-file: rules.yaml
db: data/warehouse.duckdb
anthropic-api-key: ${{ secrets.ANTHROPIC_API_KEY }}The step fails the job automatically when any rules fail, blocking broken data from reaching production. Set fail-on-failure: 'false' to report without blocking.
Offline mode (no API key required):
- name: Validate data quality (offline)
uses: aegis-dq/aegis-dq@v0.7.0
with:
rules-file: rules.yaml
db: data/warehouse.duckdb
no-llm: 'true'Action inputs
Input | Default | Description |
|
| Path to rules YAML |
|
| DuckDB file path |
|
|
|
| — | PostgreSQL / Redshift connection DSN |
|
| Skip LLM — free offline validation |
|
|
|
| (provider default) | Override the default model |
|
| Fail the step when rules fail |
| (latest) | Pin a specific |
| — | Required when |
| — | Required when |
Action outputs
Output | Description |
| Total rules evaluated |
| Rules that passed |
| Rules that failed |
| Pass rate as a decimal (e.g. |
| Absolute path to the full JSON report |
Using outputs in downstream steps:
- name: Validate data quality
id: dq
uses: aegis-dq/aegis-dq@v0.7.0
with:
rules-file: rules.yaml
- name: Post summary
run: echo "Pass rate: ${{ steps.dq.outputs.pass-rate }}%"Demo

╭──────────────────────────────────────────────────────╮
│ Aegis DQ — RetailCo E-commerce Demo │
│ LLM: amazon.nova-pro-v1:0 via AWS Bedrock │
╰──────────────────────────────────────────────────────╯
✓ Pipeline complete in 7.1s · 12 rules · $0.0056 LLM cost
╭──────────────── Validation Summary ─────────────────╮
│ Rules checked │ 12 │
│ Passed │ 1 │ Failed │ 11 │
│ Pass rate │ 8% │ Cost │ $0.005576 │
╰─────────────────────────────────────────────────────╯
LLM Diagnoses
orders_customer_fk → Order placed with customer_id=99 that does not exist.
Likely cause: customer deleted or test record not cleaned up.
products_sku_unique → Duplicate SKU-001 — two products share the same identifier.
Likely cause: duplicate import from supplier feed.
Remediation SQL (LLM-generated)
orders_status_valid UPDATE orders SET status = 'SHIPPED' WHERE status = 'DISPATCHED';
products_price_positive UPDATE products SET price = ABS(price) WHERE price < 0;
products_stock_non_negative UPDATE products SET stock_quantity = 0 WHERE stock_quantity < 0;Why Aegis?
Aegis DQ | Great Expectations / Soda | Monte Carlo / Anomalo | |
Open source | ✅ Apache 2.0 | ✅ | ❌ Commercial |
Agentic LLM diagnosis + RCA | ✅ | ❌ | ✅ Proprietary |
SQL auto-fix proposals | ✅ | ❌ | ❌ |
Audit trail (per-decision log) | ✅ | Partial | ✅ Proprietary |
Pluggable LLM (Anthropic, OpenAI, Bedrock, Ollama) | ✅ | ❌ | ❌ |
dbt integration | ✅ | ✅ | Partial |
Portable open rule standard | ✅ | Partial | ❌ |
ML anomaly detection | ✅ built-in | ❌ | ✅ Proprietary |
Install
pip install aegis-dqExtra | What it adds |
| BigQuery adapter |
| Databricks adapter |
| AWS Athena adapter |
| PostgreSQL / Redshift adapter |
| Snowflake adapter |
| REST API server (FastAPI + uvicorn) |
| OpenAI LLM provider |
| Airflow |
| MCP server for Claude Desktop |
| scikit-learn anomaly detection |
5-minute quickstart
Seed a demo DuckDB database:
import duckdb
con = duckdb.connect("demo.db")
con.execute("""
CREATE TABLE orders AS
SELECT i AS order_id, 'placed' AS status, i * 9.99 AS revenue
FROM range(1, 10001) t(i)
""")
# introduce some bad data
con.execute("UPDATE orders SET order_id = NULL WHERE order_id % 200 = 0")
con.execute("UPDATE orders SET revenue = -5.00 WHERE order_id % 500 = 0")
con.close()Generate a starter rules file and run:
aegis init
export ANTHROPIC_API_KEY=sk-ant-...
aegis run rules.yaml --db demo.dbRun without an API key (validation only, no LLM diagnosis):
aegis run rules.yaml --db demo.db --no-llmPipeline
Every aegis run passes your data through a LangGraph pipeline:
rules (Python / YAML)
│
▼
plan ──► parallel_table ──► reconcile ──► remediate ──► report
│
┌──────────────────┐
│ per table: │
│ execute │
│ classify │
│ diagnose │ ← concurrent across all tables
│ rca │
└──────────────────┘plan — parse and validate rules, build an execution graph
parallel_table — concurrently fans out per table: execute all rules, classify failures by severity, diagnose with LLM, and trace root causes
reconcile — compare results against expected thresholds
remediate — LLM proposes a targeted SQL fix for each diagnosed failure
report — structured JSON + optional Slack notification
Rule types (31 total)
Category | Types |
Completeness |
|
Uniqueness |
|
Validity |
|
Referential |
|
Statistical |
|
Timeliness |
|
Volume |
|
Cross-table |
|
ML / Anomaly |
|
Example rule:
rules:
- apiVersion: aegis.dev/v1
kind: DataQualityRule
metadata:
id: orders_revenue_non_negative
severity: critical
owner: revenue-team
tags: [revenue, validity]
scope:
warehouse: duckdb
table: orders
logic:
type: sql_expression
expression: "revenue >= 0"Generate rules with the LLM
Instead of writing rules by hand, let Aegis introspect your table schema and generate a draft rules file:
# Schema-aware structural rules (not_null, between, unique, accepted_values...)
aegis generate orders --db warehouse.duckdb --output orders_rules.yamlAdd a --kb document — any plain text or markdown file describing your business logic — and the LLM generates business validation rules alongside structural ones:
aegis generate orders \
--db warehouse.duckdb \
--kb docs/orders_policy.md \
--output orders_rules.yamlWhat goes in a KB file? Anything your team knows about the data:
# orders_policy.md
- status must be one of: placed, confirmed, shipped, delivered, cancelled
- amount must be greater than 0; refunds are handled in a separate table
- customer_id must reference a valid customer (no test accounts: id > 1000)
- order_date must not be in the future
- discount_pct must be between 0 and 0.5 (max 50% discount)The LLM turns these into accepted_values, sql_expression, between, and foreign_key rules automatically. Generated rules are stamped status: draft — review, promote to active, and commit.
All aegis generate options:
Flag | Default | Description |
| — | DuckDB file for schema introspection |
| — | Business-context file (text/markdown) |
|
| Output YAML file |
|
| Cap on number of rules generated |
|
| Skip SQL verification of generated rules |
|
| Persist rules to version store |
|
| LLM provider |
| (default) | Override model |
Warehouse adapters
Adapter | Install | Status |
DuckDB | built-in | ✅ GA |
BigQuery |
| ✅ GA |
Databricks |
| ✅ GA |
AWS Athena |
| ✅ GA |
Postgres / Redshift |
| ✅ GA |
Snowflake |
| ✅ GA |
LLM providers
Provider | Install | Default model |
Anthropic (Claude) | built-in |
|
OpenAI |
|
|
Ollama (local) |
|
|
AWS Bedrock |
|
|
Switch providers at the CLI:
aegis run rules.yaml --llm openai --llm-model gpt-4o
aegis run rules.yaml --llm ollama --llm-model llama3.2
aegis run rules.yaml --llm bedrock --llm-model amazon.nova-pro-v1:0Integrations
Integration | What it does |
GitHub Action | CI/CD gate — fails the job when rules fail |
| REST API server — |
|
|
| MCP server for Claude Desktop / tool use |
| Convert dbt |
CLI reference
Command | Description |
| Generate a starter |
| Check YAML syntax + schema (no warehouse needed) |
| LLM-generate rules from table schema |
| Run validation, diagnose failures, produce a report |
| Browse built-in rule templates |
| Inspect the LLM decision trail for a past run |
| Full-text search across audit logs |
| Convert a dbt manifest to Aegis rules |
| Start the MCP server for Claude Desktop |
aegis run flags:
Flag | Default | Description |
|
| DuckDB file path |
|
| LLM provider |
| (provider default) | Override model name |
|
| Skip LLM diagnosis entirely |
| (none) | Write full JSON report to file |
| (none) | Slack webhook URL |
|
| When to notify: |
Roadmap
Phase | Version | Items | Status |
Foundation | v0.1 | Core agent, DuckDB, CLI, audit trail | ✅ Done |
Differentiate | v0.5 | BigQuery, Databricks, Athena, Airflow, Ollama, RCA, ShareGPT export, FTS5 search, dbt, MCP | ✅ Done |
Quality | v0.7 | SQL verification pipeline, rule versioning, | ✅ Done |
Mature | v1.0 | Postgres, REST API, parallel subagents, VS Code extension, eval suite, banking/healthcare packs | 🚧 In progress |
Full issue tracker: github.com/aegis-dq/aegis-dq/issues
Contributing
Contributions are welcome. See CONTRIBUTING.md to get started.
Good first issues: label:good first issue
License
Maintenance
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/aegis-dq/aegis-dq'
If you have feedback or need assistance with the MCP directory API, please join our Discord server