Student Performance Predictor

Production-style machine learning project for predicting student final exam scores and pass/fail outcomes.

Project Overview

Student Performance Predictor is an end-to-end machine learning application that solves two related academic analytics problems:

Regression: predict a student's final exam score from academic, behavioral, and demographic inputs.
Classification: predict whether the student is likely to pass or fail.

The project is structured as a portfolio-ready ML system, not a single notebook. It includes data ingestion, cleaning, EDA, feature engineering, model comparison, hyperparameter tuning, model persistence, and an interactive Streamlit application.

Features

UCI Student Performance dataset integration with a deterministic offline fallback dataset.
Data quality handling for missing values, duplicates, invalid values, categorical encoding, and scaling.
Engineered academic features such as study efficiency, homework ratio, engagement score, sleep quality index, grade trend, and risk index.
Regression models: Linear Regression, Decision Tree, Random Forest, Gradient Boosting, and optional XGBoost.
Classification models: Logistic Regression, Decision Tree, Random Forest, SVM, Gradient Boosting, and optional XGBoost.
Hyperparameter tuning with RandomizedSearchCV.
Saved joblib artifacts for regression, classification, preprocessing, encoder, and scaler.
Streamlit interface with score prediction, pass/fail probability, confidence score, and feature contribution chart.
Reproducible notebooks for EDA and model training.

Screenshots

Add screenshots after running the Streamlit app:

screenshots/app_home.png
screenshots/prediction_result.png
reports/figures/regression_feature_importance.png
reports/figures/classification_feature_importance.png
reports/figures/confusion_matrix.png
reports/figures/roc_curve.png

Dataset

Primary source: UCI Machine Learning Repository, Student Performance Datasets.

The training code attempts to download:

https://archive.ics.uci.edu/ml/machine-learning-databases/00320/student.zip

The UCI data contains Portuguese school performance records. This project standardizes the source fields into a clean portfolio schema:

gender
age
study_hours
attendance
sleep_hours
previous_grade
internet_access
parent_education
family_income
extra_classes
assignments_completed
participation
final_score
pass_fail

If the dataset cannot be downloaded, src/utils.py creates a deterministic synthetic dataset so the project still trains and runs offline.

Installation

python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

Project Structure

student-performance-predictor/
├── data/
│   ├── raw/
│   └── processed/
├── notebooks/
│   ├── EDA.ipynb
│   └── Model_Training.ipynb
├── models/
│   ├── regression.pkl
│   ├── classifier.pkl
│   ├── preprocessor.pkl
│   ├── encoders_scaler.pkl
│   └── metrics.json
├── reports/
│   └── figures/
├── screenshots/
├── src/
│   ├── preprocessing.py
│   ├── feature_engineering.py
│   ├── train.py
│   ├── predict.py
│   └── utils.py
├── app.py
├── requirements.txt
├── pyproject.toml
├── README.md
├── LICENSE
└── .gitignore

Machine Learning Workflow

Load UCI data or create the offline fallback dataset.
Standardize source columns into a consistent application schema.
Remove duplicates and correct invalid numeric ranges.
Create engineered features:
- Study Efficiency = Study Hours x Attendance
- Homework Ratio
- Academic Engagement Score
- Sleep Quality Index
- Grade Trend
- Risk Index
Split data into train and test sets.
Build sklearn pipelines with median imputation, categorical imputation, standard scaling, and one-hot encoding.
Train multiple regression and classification models.
Tune final Random Forest models with RandomizedSearchCV.
Evaluate with regression and classification metrics.
Save models, preprocessing artifacts, metrics, and diagnostic plots.
Serve predictions through Streamlit.

Exploratory Data Analysis

The EDA notebook includes:

Missing-value audit
Duplicate-record check
Summary statistics
IQR outlier detection
Histograms
Box plots
Final-score distribution
Pass/fail distribution
Correlation heatmap
Pair plot
Categorical distributions
Feature importance preview

Each graph is followed by an interpretation explaining what the visualization contributes to model design.

Algorithms Used

Regression:

Linear Regression
Decision Tree Regressor
Random Forest Regressor
Gradient Boosting Regressor
XGBoost Regressor, when installed

Classification:

Logistic Regression
Decision Tree Classifier
Random Forest Classifier
Support Vector Machine
Gradient Boosting Classifier
XGBoost Classifier, when installed

Evaluation Metrics

Regression:

MAE
MSE
RMSE
R2 Score

Classification:

Accuracy
Precision
Recall
F1 Score
ROC Curve
Confusion Matrix

Run Training

python -m src.train

Generated artifacts:

models/regression.pkl
models/classifier.pkl
models/preprocessor.pkl
models/encoders_scaler.pkl
models/metrics.json
reports/figures/regression_model_comparison.csv
reports/figures/classification_model_comparison.csv
reports/figures/regression_feature_importance.png
reports/figures/classification_feature_importance.png
reports/figures/confusion_matrix.png
reports/figures/roc_curve.png

Run the Web App

streamlit run app.py

The app sidebar accepts student details and returns:

Predicted final score
Pass/fail result
Pass probability
Confidence score
Feature contribution chart

Results

The latest training run used the UCI Student Performance dataset and selected the best evaluated candidate from the baseline and tuned model set.

Selected regression model: Random Forest Regressor

Metric	Value
MAE	5.4359
MSE	61.7454
RMSE	7.8578
R2	0.8801

Selected classification model: Gradient Boosting Classifier

Metric	Value
Accuracy	0.9494
Precision	0.9552
Recall	0.9846
F1 Score	0.9697

The final metrics are saved to models/metrics.json, and comparison tables are saved under reports/figures/.

The expected strong predictors are previous grades, attendance, study efficiency, assignments completed, engagement score, and risk index.

Future Improvements

Add model monitoring for prediction drift.
Add SHAP explanations for richer local interpretability.
Add a REST API with FastAPI.
Add automated CI checks for formatting and training smoke tests.
Expand the dataset with school-specific and semester-specific records.
Add threshold optimization for pass/fail decisions.

License

This project is licensed under the MIT License. See LICENSE for details.

Author

Aditya Verma

Portfolio project for GitHub, LinkedIn, and internship applications.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Student Performance Predictor

Project Overview

Features

Screenshots

Dataset

Installation

Project Structure

Machine Learning Workflow

Exploratory Data Analysis

Algorithms Used

Evaluation Metrics

Run Training

Run the Web App

Results

Future Improvements

License

Author

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
models		models
notebooks		notebooks
reports/figures		reports/figures
screenshots		screenshots
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Student Performance Predictor

Project Overview

Features

Screenshots

Dataset

Installation

Project Structure

Machine Learning Workflow

Exploratory Data Analysis

Algorithms Used

Evaluation Metrics

Run Training

Run the Web App

Results

Future Improvements

License

Author

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages