AutoFE-ShiftBench is a reproducible, large-scale benchmark for evaluating the trade-off between predictive performance and robustness under realistic feature corruption. It compares standard, raw-feature models against Automated Feature Engineering (AutoFE) enhanced pipelines across 20 diverse datasets and 10 models.
Most AutoML and Feature Engineering evaluations optimize only for clean-test accuracy. This project adds a strict robustness lens by injecting synthetic perturbations (Gaussian noise, missing values, label noise) into the data. Crucially, the benchmark implements a Nested Cross-Validation approach where AutoFE is strictly fit inside the fold, ensuring zero data leakage.
| Configuration | Details |
|---|---|
| Datasets | 20 OpenML tabular datasets (capped at 100,000 rows max) |
| Task | Classification (Binary & Multiclass) |
| Validation Strategy | Stratified 5-Fold Cross-Validation |
| Replications | 5 Random Seeds |
| Models (10) | Logistic Regression, Random Forest, Extra Trees, XGBoost, LightGBM, CatBoost, SVM, KNN, Gaussian Naive Bayes, MLP Neural Network |
| Perturbations (10) | clean, gaussian_0.01, gaussian_0.05, gaussian_0.10, missing_0.05, missing_0.10, missing_0.20, label_0.05, label_0.10, label_0.20 |
| Metrics | ROC-AUC, PR-AUC, F1 (Macro), MCC, Balanced Accuracy, Accuracy, Log Loss, Brier Score, Precision, Recall |
| Effect Sizes | Cliff's Delta, Wilcoxon Signed-Rank, Friedman, Nemenyi |
Below are the 20 OpenML datasets included in this benchmark. The pipeline automatically caps datasets exceeding 100K rows (via random downsampling) to keep runtimes feasible on standard hardware.
| # | Dataset | Domain / Topic | Link |
|---|---|---|---|
| 1 | Adult | Income Prediction | OpenML Search |
| 2 | Bank Marketing | Marketing | OpenML Search |
| 3 | APS Failure | Industrial / Sensor | OpenML Search |
| 4 | Electricity | Energy | OpenML Search |
| 5 | Covertype | Forest Cover | OpenML Search |
| 6 | Dry Bean | Agriculture | OpenML Search |
| 7 | Crop Recommendation | Agriculture | OpenML Search |
| 8 | Breast Cancer Wisconsin | Medical | OpenML Search |
| 9 | Heart Disease | Medical | OpenML Search |
| 10 | Diabetes | Medical | OpenML Search |
| 11 | Haberman Survival | Medical | OpenML Search |
| 12 | Ionosphere | Physics / Radar | OpenML Search |
| 13 | Sonar | Physics / Sonar | OpenML Search |
| 14 | Statlog German Credit | Finance | OpenML Search |
| 15 | Credit Default | Finance | OpenML Search |
| 16 | Mushroom | Biology | OpenML Search |
| 17 | Magic Telescope | Astronomy | OpenML Search |
| 18 | Spambase | NLP / Email | OpenML Search |
| 19 | Wine Quality (Red) | Chemistry | OpenML Search |
| 20 | Rice (Cammeo/Osmancik) | Agriculture | OpenML Search |
We recommend using uv or pip to install dependencies in a virtual environment.
# Create environment
uv venv
# Activate environment (Windows)
.venv\Scripts\activate
# Install dependencies
uv pip install -r requirements.txtThis command will download all 20 datasets from OpenML, cap them at 100K rows, compute structural meta-features, and save them in data/raw.
python src/data_loader.pyThe benchmark uses multiprocessing to parallelize tasks while strictly avoiding thread thrashing (by forcing model algorithms to execute synchronously within the worker process). The --n-workers flag dictates how many datasets/folds to process simultaneously.
# Run the full benchmark suite
# (Adjust n-workers based on your CPU. It is recommended to leave 1 core free for IO)
python -m src.pipeline_runner --n-workers 4Results are streamed sequentially to a JSON Lines file (reports/tables/results_stream.jsonl) to ensure nothing is lost during long-running benchmarks.
If you want to quickly test the pipeline end-to-end on a single dataset, fold, and condition, run:
python -m src.pipeline_runner --max-datasets 1 --max-seeds 1 --max-folds 1 --max-conditions 1 --n-workers 1AutoFE-ShiftBench/
├── config/
│ └── dataset_list.yaml # Defines the 20 benchmark datasets to download and run
├── data/ # (Git-ignored) Where artifacts are cached
│ └── raw/ # Downloaded CSV datasets and JSON meta-features
├── reports/ # (Git-ignored) Where outputs are saved
│ ├── figures/ # Generated publication plots (PDF, PNG)
│ ├── tables/
│ │ ├── results_stream.jsonl # Streaming pipeline results (1 row per model fit)
│ │ └── statistical_results.csv # Final effect sizes and significance tests
│ └── terminal.log # Running log of the pipeline executions
├── src/
│ ├── data_loader.py # Downloads OpenML datasets and computes meta-features
│ ├── pipeline_runner.py # Multiprocessing orchestrator and Nested CV loop
│ ├── model.py # Instantiates the 10 core ML algorithms
│ ├── shift_generator.py # Injects realistic perturbations (noise, missing, label)
│ ├── preprocessing.py # Handles standard scaling and encoding
│ ├── feature_engineering.py # Wrapper for automated feature generation (Featuretools)
│ ├── feature_selection.py # Filters the generated AutoFE explosion
│ ├── evaluation.py # Calculates the 10 core classification metrics
│ ├── statistics.py # Computes Cliff's Delta, Wilcoxon, Friedman, Nemenyi
│ ├── shap_explainer.py # Computes fast clustered SHAP importances
│ ├── plotting.py # Generates the 30-figure Seaborn visual suite
│ └── verify_claims.py # (Deprecated) Old claim verifier, superseded by statistics.py
├── task.md # Development tracking checklist
├── walkthrough.md # Detailed implementation notes
├── requirements.txt # Project dependencies
└── README.md # This documentation
See LICENSE file for details.