Production-style machine learning project for predicting student final exam scores and pass/fail outcomes.
Student Performance Predictor is an end-to-end machine learning application that solves two related academic analytics problems:
- Regression: predict a student's final exam score from academic, behavioral, and demographic inputs.
- Classification: predict whether the student is likely to pass or fail.
The project is structured as a portfolio-ready ML system, not a single notebook. It includes data ingestion, cleaning, EDA, feature engineering, model comparison, hyperparameter tuning, model persistence, and an interactive Streamlit application.
- UCI Student Performance dataset integration with a deterministic offline fallback dataset.
- Data quality handling for missing values, duplicates, invalid values, categorical encoding, and scaling.
- Engineered academic features such as study efficiency, homework ratio, engagement score, sleep quality index, grade trend, and risk index.
- Regression models: Linear Regression, Decision Tree, Random Forest, Gradient Boosting, and optional XGBoost.
- Classification models: Logistic Regression, Decision Tree, Random Forest, SVM, Gradient Boosting, and optional XGBoost.
- Hyperparameter tuning with
RandomizedSearchCV. - Saved
joblibartifacts for regression, classification, preprocessing, encoder, and scaler. - Streamlit interface with score prediction, pass/fail probability, confidence score, and feature contribution chart.
- Reproducible notebooks for EDA and model training.
Add screenshots after running the Streamlit app:
screenshots/app_home.pngscreenshots/prediction_result.pngreports/figures/regression_feature_importance.pngreports/figures/classification_feature_importance.pngreports/figures/confusion_matrix.pngreports/figures/roc_curve.png
Primary source: UCI Machine Learning Repository, Student Performance Datasets.
The training code attempts to download:
https://archive.ics.uci.edu/ml/machine-learning-databases/00320/student.zip
The UCI data contains Portuguese school performance records. This project standardizes the source fields into a clean portfolio schema:
genderagestudy_hoursattendancesleep_hoursprevious_gradeinternet_accessparent_educationfamily_incomeextra_classesassignments_completedparticipationfinal_scorepass_fail
If the dataset cannot be downloaded, src/utils.py creates a deterministic synthetic dataset so the project still trains and runs offline.
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txtstudent-performance-predictor/
├── data/
│ ├── raw/
│ └── processed/
├── notebooks/
│ ├── EDA.ipynb
│ └── Model_Training.ipynb
├── models/
│ ├── regression.pkl
│ ├── classifier.pkl
│ ├── preprocessor.pkl
│ ├── encoders_scaler.pkl
│ └── metrics.json
├── reports/
│ └── figures/
├── screenshots/
├── src/
│ ├── preprocessing.py
│ ├── feature_engineering.py
│ ├── train.py
│ ├── predict.py
│ └── utils.py
├── app.py
├── requirements.txt
├── pyproject.toml
├── README.md
├── LICENSE
└── .gitignore
- Load UCI data or create the offline fallback dataset.
- Standardize source columns into a consistent application schema.
- Remove duplicates and correct invalid numeric ranges.
- Create engineered features:
- Study Efficiency = Study Hours x Attendance
- Homework Ratio
- Academic Engagement Score
- Sleep Quality Index
- Grade Trend
- Risk Index
- Split data into train and test sets.
- Build sklearn pipelines with median imputation, categorical imputation, standard scaling, and one-hot encoding.
- Train multiple regression and classification models.
- Tune final Random Forest models with
RandomizedSearchCV. - Evaluate with regression and classification metrics.
- Save models, preprocessing artifacts, metrics, and diagnostic plots.
- Serve predictions through Streamlit.
The EDA notebook includes:
- Missing-value audit
- Duplicate-record check
- Summary statistics
- IQR outlier detection
- Histograms
- Box plots
- Final-score distribution
- Pass/fail distribution
- Correlation heatmap
- Pair plot
- Categorical distributions
- Feature importance preview
Each graph is followed by an interpretation explaining what the visualization contributes to model design.
Regression:
- Linear Regression
- Decision Tree Regressor
- Random Forest Regressor
- Gradient Boosting Regressor
- XGBoost Regressor, when installed
Classification:
- Logistic Regression
- Decision Tree Classifier
- Random Forest Classifier
- Support Vector Machine
- Gradient Boosting Classifier
- XGBoost Classifier, when installed
Regression:
- MAE
- MSE
- RMSE
- R2 Score
Classification:
- Accuracy
- Precision
- Recall
- F1 Score
- ROC Curve
- Confusion Matrix
python -m src.trainGenerated artifacts:
models/regression.pklmodels/classifier.pklmodels/preprocessor.pklmodels/encoders_scaler.pklmodels/metrics.jsonreports/figures/regression_model_comparison.csvreports/figures/classification_model_comparison.csvreports/figures/regression_feature_importance.pngreports/figures/classification_feature_importance.pngreports/figures/confusion_matrix.pngreports/figures/roc_curve.png
streamlit run app.pyThe app sidebar accepts student details and returns:
- Predicted final score
- Pass/fail result
- Pass probability
- Confidence score
- Feature contribution chart
The latest training run used the UCI Student Performance dataset and selected the best evaluated candidate from the baseline and tuned model set.
Selected regression model: Random Forest Regressor
| Metric | Value |
|---|---|
| MAE | 5.4359 |
| MSE | 61.7454 |
| RMSE | 7.8578 |
| R2 | 0.8801 |
Selected classification model: Gradient Boosting Classifier
| Metric | Value |
|---|---|
| Accuracy | 0.9494 |
| Precision | 0.9552 |
| Recall | 0.9846 |
| F1 Score | 0.9697 |
The final metrics are saved to models/metrics.json, and comparison tables are saved under reports/figures/.
The expected strong predictors are previous grades, attendance, study efficiency, assignments completed, engagement score, and risk index.
- Add model monitoring for prediction drift.
- Add SHAP explanations for richer local interpretability.
- Add a REST API with FastAPI.
- Add automated CI checks for formatting and training smoke tests.
- Expand the dataset with school-specific and semester-specific records.
- Add threshold optimization for pass/fail decisions.
This project is licensed under the MIT License. See LICENSE for details.
Aditya Verma
Portfolio project for GitHub, LinkedIn, and internship applications.