Skip to content

drkocoglu/python-ml-class-projects

Repository files navigation

Python ML Class Projects

A collection of six machine-learning mini-projects converted from an original MATLAB course into Python. Each project implements a core algorithm from scratch (gradient descent, softmax regression, a multi-layer perceptron, etc.) and, where useful, compares the hand-written implementation against an established library to validate it.

The emphasis throughout is on understanding the mechanics — building the optimizer, the backprop, the bias-variance experiment by hand — rather than calling a black-box .fit(). Library implementations are used as benchmarks, not as the solution.


Repository layout

python-ml-class-projects/
├── data/                      # datasets (NOT in git — see "Data" below)
│   ├── proj1Dataset.xlsx
│   ├── MNIST/
│   └── celegans/
├── proj1-linear-regression/
├── proj2-trainig vs test error/
├── proj3-Bias vs Variance/
├── proj4-logistic-regression/
├── proj5-MLP/
├── proj6-CNN/
├── pyproject.toml             # dependencies (managed by uv)
├── uv.lock                    # locked, reproducible versions
└── README.md

Each project folder contains its source scripts, an empty conftest.py (see Running the tests), and a tests/ subfolder with its pytest suite.

The data/ folder lives outside the project folders and is shared: some datasets are used by more than one project (the C. elegans images are used by both proj4 and proj6). Keeping data in one place avoids duplicating large files across projects.


The projects

proj1 — Linear Regression

Develops the gradient-descent algorithm and applies it to linear regression, then compares the results against built-in linear-regression implementations to confirm correctness. This is the foundation the later projects build on. Data: data/proj1Dataset.xlsx

proj1 — gradient descent vs closed-form linear regression

The from-scratch gradient-descent fit (right) matches the closed-form least-squares solution (left), confirming the optimizer is correct.

proj2 — Training vs Test Error (Polynomial Regression)

Explores model complexity: fits polynomials of increasing degree (n = 1, 2, 3, …, 9) to noisy sin(2πx) data and watches the training-error/test-error trade-off as degree grows — the classic under-fit → good-fit → over-fit curve. Results are compared against a built-in polynomial-regression framework. A second experiment varies the amount of training data and shows how the fit quality changes with dataset size. Data: generated synthetically in code (no external file).

proj2 — gradient descent vs built-in polynomial regression

Degree-9 fits. With little data (N=10, top) the high-degree polynomial overfits wildly; with more data (N=100, bottom) the same degree fits cleanly — showing how dataset size controls overfitting.

proj3 — Bias-Variance Trade-off

Demonstrates the bias-variance trade-off and the effect of regularization using a radial-basis-function (RBF) regression framework. The regularizer here is L2 regularization (ridge / Tikhonov regularization — the penalty term λ‖w‖² added to the least-squares objective), which shrinks the weights and trades a little bias for a large reduction in variance. The project sweeps λ across a range and plots bias², variance, and test error so the trade-off is visible directly. Data: generated synthetically in code.

proj3 — bias-variance trade-off vs regularization

As regularization (ln λ) increases, bias² rises and variance falls. Their sum tracks the test error almost exactly — the bias-variance decomposition made visible.

proj4 — Softmax Regression

The assignment asked for logistic regression; this project implements softmax (multinomial) regression instead — a generalisation that also handles more than two classes — which was permitted. It develops the softmax regression algorithm from scratch (mini-batch gradient descent with momentum and optional L2 regularization).

The implementation was validated on two benchmarks before real use:

  • Iris (3 classes) — checked against scikit-learn's logistic regression.
  • MNIST (10 classes) — handwritten-digit classification.

Once validated, the same algorithm was applied to the C. elegans images (worm vs defect). Data: data/MNIST/, data/celegans/ (and Iris, loaded from scikit-learn).

proj4 — MNIST softmax prediction

The from-scratch softmax classifier predicting an MNIST test digit. An interactive viewer steps through test images one at a time from the terminal.

proj5 — Multi-Layer Perceptron (from scratch)

Implements an MLP from scratch, including forward propagation, full backpropagation, and momentum-based gradient descent, applied to two tasks:

  • Part 1 — Classification: an XOR-style problem (not linearly separable, so a hidden layer is required).
  • Part 2 — Regression: fitting sin(2πx).

It lets you try several common activation functions, and which one is appropriate depends on where it sits in the network:

Activation Where it's used Notes
sigmoid hidden layers (classification) smooth, bounded (0, 1); used in Part 1's hidden layer
tanh hidden layers (regression) smooth, bounded (−1, 1); fits smooth targets like sin well
ReLU hidden layers piecewise-linear, unbounded; fast but can produce "dead" units with poor init / few units
softmax output layer only, classification only converts final-layer scores into a probability distribution over classes; paired with cross-entropy loss

The key rules: softmax is only ever used at the very end, and only for classification (never in a hidden layer, never for regression). Regression networks use a linear output (no activation on the final layer) with a least-squares loss. Hidden layers use sigmoid / tanh / ReLU.

Beyond the assignment, proj5 also includes an optional multi-layer ("deep") extension that lets you configure the number of hidden layers and the width of each layer for further experimentation, with optional batch normalization on the hidden layers (which keeps activations well-scaled as depth grows and allows a larger learning rate). Data: generated synthetically in code.

proj5 — XOR decision boundary (Part 1)

Part 1: the from-scratch MLP learns a nonlinear decision surface that separates the XOR classes — impossible without a hidden layer.

proj6 — Convolutional Neural Network

A CNN (PyTorch) trained to classify the C. elegans images: worm (class 1) vs defect (class 0).

C. elegans is a roundworm widely used in biological research — including drug-discovery work aimed at improving patient quality of life. The worms are imaged on plates, but plates sometimes have defects that make the animals hard to track; this classifier separates clean worm images from defective ones.

proj6 ships with a pre-trained model (in its model/ folder) that celegans_predict.py uses to classify test images. You can also train a new model by running celegans_cnn.py — note that doing so overwrites the existing model, which may change prediction accuracy. Data: data/celegans/

proj6 — C. elegans CNN predictions

The CNN classifying a test pair: a clean worm (left) and a plate defect (right), both predicted correctly with high confidence. An interactive viewer steps through test pairs from the terminal.


Data

The data/ folder is not committed to this repository.

The full dataset is ~147 MB (MNIST binaries plus thousands of C. elegans images). Committing that to git would bloat every clone permanently, since git retains the full history of large binary files. So data/ is git-ignored, and you provide it locally.

Expected layout

data/
├── proj1Dataset.xlsx                 # proj1
├── MNIST/                            # proj4 (benchmark)
│   ├── train-images.idx3-ubyte
│   ├── train-labels.idx1-ubyte
│   ├── t10k-images.idx3-ubyte
│   └── t10k-labels.idx1-ubyte
└── celegans/                        # proj4 and proj6
    ├── 0/                           # defect  (class 0)
    │   ├── training/
    │   └── test/
    └── 1/                           # worm    (class 1)
        ├── training/
        └── test/

Where each dataset comes from

  • proj1Dataset.xlsx — the small spreadsheet for proj1 (included with the course materials).
  • MNIST — the standard handwritten-digit dataset, available from its canonical source in the IDX ubyte format shown above.
  • C. elegans — extracted from microscopy source images. The extraction scripts are intentionally not included: extraction required manual data preparation afterward, and shipping the scripts without that manual context would cause more confusion than it resolves. The images are already split into training/ and test/ under each class folder.

If you only want to run a subset, you only need the data for those projects (e.g. proj2, proj3, and proj5 generate their data in code and need nothing in data/ at all).


Setup (uv)

This project uses uv for dependency management, so there is no requirements.txt. Dependencies are declared in pyproject.toml and pinned in uv.lock for fully reproducible installs.

1. Install uv (if you don't already have it).

  • Windows (PowerShell):
    powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
  • macOS / Linux:
    curl -LsSf https://astral.sh/uv/install.sh | sh

2. Install the dependencies. From the repository root:

uv sync

This reads pyproject.toml and uv.lock, creates a virtual environment (.venv/), and installs the exact locked versions. You do not create or manage the venv by hand.

What the files do:

  • pyproject.toml — declares the project and its dependencies.
  • uv.lock — the resolved, locked versions uv sync installs from (commit this; it's what makes installs reproducible).
  • .venv/ — the environment uv creates. Not committed.

3. Run things with uv run, which uses the project environment automatically:

uv run python proj1-linear-regression/linear_regression.py

Shared environment: all six projects share one virtual environment at the repository root — there is not a separate venv per project. Run uv sync once and every project is ready.

Editor

These projects were written in VS Code. You don't need it, but if you hit import or interpreter issues, VS Code makes them easy to avoid: open the repo root as the workspace folder and select the .venv interpreter (Ctrl/Cmd+Shift+P → "Python: Select Interpreter" → the .venv in the repo root). Several scripts also open matplotlib windows, so run them in an environment with a display rather than a headless terminal.


Running the tests

Each project has a real pytest suite under its tests/ folder. These are genuine assertion-based tests — they verify behavior and fail when something breaks, not demonstrations that merely print output. They exist to catch regressions: the algorithms here are hand-written and interdependent, so a test that pins down, say, "softmax weights come out with the right shape" or "backprop matches a numerical gradient check" protects you the day an edit silently breaks one of them. Highlights include numerical gradient checks on the MLP's backpropagation and the CNN's layer-shape arithmetic.

Run everything at once

From the repository root:

uv run python -m pytest

This discovers and runs every project's tests in one pass. Add detail or brevity as needed:

uv run python -m pytest -v      # verbose: one line per test
uv run python -m pytest -q      # quiet: compact summary

Not using uv? Every command below also works without the uv run prefix — just call python -m pytest directly. The only requirement is that the dependencies are installed in the active Python environment (e.g. you ran uv sync and activated .venv, or installed the packages another way). With uv, uv run handles the environment for you; without it, activate your environment first and drop the prefix:

python -m pytest          # run everything
python -m pytest -v       # verbose
python -m pytest -q       # quiet

Run one project

uv run python -m pytest "proj5-MLP"
# without uv:
python -m pytest "proj5-MLP"

Run a single test file or test

uv run python -m pytest "proj5-MLP/tests/test_mlp.py"
uv run python -m pytest "proj5-MLP/tests/test_mlp.py::test_gradient_check_weights_no_bn"
# without uv:
python -m pytest "proj5-MLP/tests/test_mlp.py"

A note on the conftest.py files

Each project has an empty conftest.py at its root (next to the source files, not inside tests/). It is required: the test files live in tests/, but the modules they import live one level up in the project root. The presence of conftest.py tells pytest to add the project folder to the import path so from models import ... resolves. Don't delete these files — without them, every test in tests/ fails with ModuleNotFoundError.