Offers community support through Discord channel
References GitHub for project hosting, stars, forks, and issue tracking
Supports interaction with Hugging Face datasets, enabling evaluation of data quality for datasets hosted on the platform
Provides evaluation capabilities for LaTeX formulas in datasets
Supports evaluation of Markdown formatting in datasets and content extraction quality assessment
Acknowledges MLflow as a related project in the model evaluation ecosystem
Integrates with OpenAI models like GPT-4o for LLM-based data quality assessment using various evaluation prompts
Integrates with pre-commit for code quality checks
Enables installation through PyPI package registry
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@Dingo MCP Serverevaluate my dataset for data quality issues"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
Introduction
Dingo is A Comprehensive AI Data, Model and Application Quality Evaluation Tool, designed for ML practitioners, data engineers, and AI researchers. It helps you systematically assess and improve the quality of training data, fine-tuning datasets, and production AI systems.
Why Dingo?
🎯 Production-Grade Quality Checks - From pre-training datasets to RAG systems, ensure your AI gets high-quality data
🗄️ Multi-Source Data Integration - Seamlessly connect to Local files, SQL databases (PostgreSQL/MySQL/SQLite), HuggingFace datasets, and S3 storage
🔍 Multi-Field Evaluation - Apply different quality rules to different fields in parallel (e.g., ISBN validation for isbn, text quality for title)
🤖 RAG System Assessment - Comprehensive evaluation of retrieval and generation quality with 5 academic-backed metrics
🧠 LLM & Rule & Agent Hybrid - Combine fast heuristic rules (30+ built-in) with LLM-based deep assessment
🚀 Flexible Execution - Run locally for rapid iteration or scale with Spark for billion-scale datasets
📊 Rich Reporting - Detailed quality reports with GUI visualization and field-level insights
Related MCP server: Tigris MCP Server
Architecture Diagram

Quick Start
Installation
Example Use Cases of Dingo
1. Evaluate LLM chat data
2. Evaluate Dataset
Command Line Interface
Evaluate with Rule Sets
Evaluate with LLM (e.g., GPT-4o)
GUI Visualization
After evaluation (with result_save.bad=True), a frontend page will be automatically generated. To manually start the frontend:
Where output_directory contains the evaluation results with a summary.json file.

Online Demo
Try Dingo on our online demo: (Hugging Face)🤗
Local Demo
Try Dingo in local:

Google Colab Demo
Experience Dingo interactively with Google Colab notebook:
MCP Server
Dingo includes an experimental Model Context Protocol (MCP) server. For details on running the server and integrating it with clients like Cursor, please see the dedicated documentation:
Video Demonstration
To help you get started quickly with Dingo MCP, we've created a video walkthrough:
https://github.com/user-attachments/assets/aca26f4c-3f2e-445e-9ef9-9331c4d7a37b
This video demonstrates step-by-step how to use Dingo MCP server with Cursor.
🎓 Key Concepts for Practitioners
What Makes Dingo Production-Ready?
1. Multi-Field Evaluation Pipeline
Apply different quality checks to different fields in a single pass:
Why It Matters: Evaluate structured data (like database tables) without writing separate scripts for each field.
2. Stream Processing for Large Datasets
SQL datasources use SQLAlchemy's server-side cursors:
Why It Matters: Process production databases without exporting to intermediate files.
3. Field Isolation in Memory
RAG evaluations prevent context bleeding across different field combinations:
Why It Matters: Accurate metric calculations when evaluating multiple field combinations.
4. Hybrid Rule-LLM Strategy
Combine fast rules (100% coverage) with sampled LLM checks (10% coverage):
Why It Matters: Balance cost and coverage for production-scale evaluation.
5. Extensibility Through Registration
Clean plugin architecture for custom rules, prompts, and models:
Why It Matters: Adapt to domain-specific requirements without forking the codebase.
📚 Data Quality Metrics
Dingo provides 70+ evaluation metrics across multiple dimensions, combining rule-based speed with LLM-based depth.
Metric Categories
Category | Examples | Use Case |
Pretrain Text Quality | Completeness, Effectiveness, Similarity, Security | LLM pre-training data filtering |
SFT Data Quality | Honest, Helpful, Harmless (3H) | Instruction fine-tuning data |
RAG Evaluation | Faithfulness, Context Precision, Answer Relevancy | RAG system assessment |
Hallucination Detection | HHEM-2.1-Open, Factuality Check | Production AI reliability |
Classification | Topic categorization, Content labeling | Data organization |
Multimodal | Image-text relevance, VLM quality | Vision-language data |
Security | PII detection, Perspective API toxicity | Privacy and safety |
📊 View Complete Metrics Documentation →
📖 RAG Evaluation Guide → | 中文版
🔍 Hallucination Detection Guide → | 中文版
✅ Factuality Assessment Guide → | 中文版
Most metrics are backed by academic research to ensure scientific rigor.
Quick Metric Usage
Customization: All prompts are defined in dingo/model/llm/ directory (organized by category: text_quality/, rag/, hhh/, etc.). Extend or modify them for domain-specific requirements.
🌟 Feature Highlights
📊 Multi-Source Data Integration
Diverse Data Sources - Connect to where your data lives
✅ Local Files: JSONL, CSV, TXT, Parquet
✅ SQL Databases: PostgreSQL, MySQL, SQLite, Oracle, SQL Server (with stream processing)
✅ Cloud Storage: S3 and S3-compatible storage
✅ ML Platforms: Direct HuggingFace datasets integration
Enterprise-Ready SQL Support - Production database integration
✅ Memory-efficient streaming for billion-scale datasets
✅ Connection pooling and automatic resource cleanup
✅ Complex SQL queries (JOIN, WHERE, aggregations)
✅ Multiple dialect support with SQLAlchemy
Multi-Field Quality Checks - Different rules for different fields
✅ Parallel evaluation pipelines (e.g., ISBN validation + text quality simultaneously)
✅ Field aliasing and nested field extraction (user.profile.name)
✅ Independent result reports per field
✅ ETL pipeline architecture for flexible data transformation
🤖 RAG System Evaluation
5 Academic-Backed Metrics - Based on RAGAS, DeepEval, TruLens research
✅ Faithfulness: Answer-context consistency (hallucination detection)
✅ Answer Relevancy: Answer-query alignment
✅ Context Precision: Retrieval precision
✅ Context Recall: Retrieval recall
✅ Context Relevancy: Context-query relevance
Comprehensive Reporting - Auto-aggregated statistics
✅ Average, min, max, standard deviation for each metric
✅ Field-grouped results
✅ Batch and single evaluation modes
🧠 Hybrid Evaluation System
Rule-Based - Fast, deterministic, cost-effective
✅ 30+ built-in rules (text quality, format, PII detection)
✅ Regex, heuristics, statistical checks
✅ Custom rule registration
LLM-Based - Deep semantic understanding
✅ OpenAI (GPT-4o, GPT-3.5), DeepSeek, Kimi
✅ Local models (Llama3, Qwen)
✅ Vision-Language Models (InternVL, Gemini)
✅ Custom prompt registration
Agent-Based - Multi-step reasoning with tools ✅ Web search integration (Tavily) ✅ Adaptive context gathering ✅ Multi-source fact verification ✅ Custom agent & tool registration
Extensible Architecture
✅ Plugin-based rule/prompt/model registration
✅ Clean separation of concerns (agents, tools, orchestration)
✅ Domain-specific customization
🚀 Flexible Execution & Integration
Multiple Interfaces
✅ CLI for quick checks
✅ Python SDK for integration
✅ MCP (Model Context Protocol) server for IDEs (Cursor, etc.)
Scalable Execution
✅ Local executor for rapid iteration
✅ Spark executor for distributed processing
✅ Configurable concurrency and batching
Data Sources
✅ Local Files: JSONL, CSV, TXT, Parquet formats
✅ Hugging Face: Direct integration with HF datasets hub
✅ S3 Storage: AWS S3 and S3-compatible storage
✅ SQL Databases: PostgreSQL, MySQL, SQLite, Oracle, SQL Server (stream processing for large-scale data)
Modalities
✅ Text (chat, documents, code)
✅ Images (with VLM support)
✅ Multimodal (text + image consistency)
📈 Rich Reporting & Visualization
Multi-Level Reports
✅ Summary JSON with overall scores
✅ Field-level breakdown
✅ Per-rule violation details
✅ Type and name distribution
GUI Visualization
✅ Built-in web interface
✅ Interactive data exploration
✅ Anomaly tracking
Metric Aggregation
✅ Automatic statistics (avg, min, max, std_dev)
✅ Field-grouped metrics
✅ Overall quality score
📖 User Guide
🔧 Extensibility
Dingo uses a clean plugin architecture for domain-specific customization:
Custom Rule Registration
Custom LLM/Prompt Registration
Examples:
Agent-Based Evaluation with Tools
Dingo supports agent-based evaluators that can use external tools for multi-step reasoning and adaptive context gathering:
Built-in Agent:
AgentHallucination: Enhanced hallucination detection with web search fallback
Configuration Example:
Learn More:
Agent Development Guide - Comprehensive guide for creating custom agents and tools
AgentHallucination Example - Production agent example
AgentFactCheck Example - LangChain agent example
⚙️ Execution Modes
Local Executor (Development & Small-Scale)
Best For: Rapid iteration, debugging, datasets < 100K rows
Spark Executor (Production & Large-Scale)
Best For: Production pipelines, distributed processing, datasets > 1M rows
Evaluation Reports
After evaluation, Dingo generates:
Summary Report (
summary.json): Overall metrics and scoresDetailed Reports: Specific issues for each rule violation
Report Description:
score:
num_good/totaltype_ratio: The count of type / total, such as:
QUALITY_BAD_COMPLETENESS/total
Example summary:
🚀 Roadmap & Contributions
Future Plans
Agent-as-a-Judge - Multi-agent debate patterns for bias reduction and complex reasoning
SaaS Platform - Hosted evaluation service with API access and dashboard
Audio & Video Modalities - Extend beyond text/image
Diversity Metrics - Statistical diversity assessment
Real-time Monitoring - Continuous quality checks in production pipelines
Limitations
The current built-in detection rules and model methods primarily focus on common data quality issues. For special evaluation needs, we recommend customizing detection rules.
Acknowledgments
Contribution
We appreciate all the contributors for their efforts to improve and enhance Dingo. Please refer to the Contribution Guide for guidance on contributing to the project.
License
This project uses the Apache 2.0 Open Source License.
This project uses fasttext for some functionality including language detection. fasttext is licensed under the MIT License, which is compatible with our Apache 2.0 license and provides flexibility for various usage scenarios.
Citation
If you find this project useful, please consider citing our tool: