Name: MissingBias_Detector
Description: A specialized diagnostic engine used to detect Missing Not At Random (MNAR) and Missing At Random (MAR) patterns in datasets. This tool determines if the "missingness" of data in a primary variable is statistically dependent on the values of a secondary covariate. Use this to determine whether missing data can be safely deleted or if it requires advanced imputation to avoid systematic bias in downstream models.
Why This Tool is Mandatory for Data Cleaning
Prevents Selection Bias: Identifying bias ensures that the agent does not inadvertently delete a specific sub-population (e.g., an unreliable sensor that only fails at high temperatures).
Automated Strategy Selection: Provides the statistical evidence needed to choose between Deletion (if no bias is found) and Imputation/Source Investigation (if bias is detected).
Math Error Prevention: Offloads complex dependency testing (like Little’s MCAR test or logistic modeling of missingness) to a dedicated engine, eliminating LLM calculation errors.
Operational Logic
The tool analyzes a dictionary containing two aligned arrays:
Target Array (Index 0): The variable containing missing values (null, NaN, or empty strings).
Predictor Array (Index 1): The potential biasing variable used to see if its values influence the probability of the Target Array being missing.
Recommended Workflows
Exploratory Data Analysis (EDA): Run this on all permutations of columns to identify hidden dependencies in a new dataset.
Hardware/Sensor Audits: Identify "unreliable sources" (e.g., which satellite sensor or survey researcher is producing the most incomplete data).
Pre-Training Validation: Ensure that "dropping rows" won't result in a biased training set that compromises model generalization.
Interpretation of Results
Bias Detected: You must not simply delete the missing rows. You must investigate the source of the bias or use statistical imputation.
No Bias Detected: Missingness is likely stochastic; deleting rows is a statistically lower risk for analysis.
Example Input:
{
"array_with missingness":["NA",166.445,470.604,25.0739,49.1652,324.7797,190.9287,"NA",451.39,405.4469,"NA",347.1129,253.0294,141.4462,"NA",241.4338,160.2388,123.1855,51.5936,151.8691,309.7825],
"array_causing_bias":[418.3812,"NA",14.552,329.5427,"NA",119.1472,"NA",462.8084,320.5384,148.8701,412.0277,125.1991,"NA",255.8993,441.0706,"NA",297.2804,"NA","NA",296.7565,111.2001]
}
Example Output:
{"missing_is_biased":[1]}