A curated list for Efficient Large Language Models
-
Updated
Jun 17, 2025 - Python
A curated list for Efficient Large Language Models
[ICML 2023] This project is the official implementation of our accepted ICML 2023 paper BiBench: Benchmarking and Analyzing Network Binarization.
[NeurIPS 2023 Spotlight] This project is the official implementation of our accepted NeurIPS 2023 (spotlight) paper QuantSR: Accurate Low-bit Quantization for Efficient Image Super-Resolution.
The official implementation of the ICML 2023 paper OFQ-ViT
[ICLR 2026] This is the official PyTorch implementation of "QVGen: Pushing the Limit of Quantized Video Generative Models".
Chat to LLaMa 2 that also provides responses with reference documents over vector database. Locally available model using GPTQ 4bit quantization.
A tutorial of model quantization using TensorFlow
PyTorch implementation of "BiDense: Binarization for Dense Prediction," A binary neural network for dense prediction tasks.
This project distills a ViT model into a compact CNN, reducing its size to 1.24MB with minimal accuracy loss. ONNXRuntime with CUDA boosts inference speed, while FastAPI and Docker simplify deployment.
On-device Perceive → Reason pipeline for Apple Silicon: Core ML + Vision for perception, a swappable LanguageModel (Apple Foundation Models or Claude) for reasoning. Python conversion/quantization toolkit plus a SwiftUI reference app.
🔬 Curiosity-Driven Quantized Mixture of Experts
Comprehensive performance analysis of DeepSeek V3 quantization levels (FP16, Q8_0, Q4_0) on 16GB GPU environments.
Measure MLX quantization quality loss — KL divergence, perplexity, top-token agreement for KV cache and weights
Official ICML 2026 Spotlight implementation for structural MoE compression, including attribution-guided channel scoring, coverage-maximized pruning, compact checkpoint construction, and fine-tuning support.
🚀 Next-gen FP8 diffusion pipeline with ComfyUI backend & smart caching. Professional image generation on free Colab — zero setup, modular src/ package, one-click notebook.
PyTorch benchmark harness comparing full / fine-tuned / quantised NLP models on accuracy, latency, memory, and energy per 1,000 predictions. Produces accuracy-vs-emissions trade-off curves for stakeholder consumption.
Fine-tuning Llama models with QLoRA using Unsloth for supervised instruction tasks
Unofficial implementation of NCNet using flax and jax
Automated INT8 quantization pipeline for ONNX models (segmentation, classification, and anomaly detection) using ONNX Runtime QDQ format. Supports efficient deployment on edge devices such as Raspberry Pi.
Two-stage confidence-gated YOLOv8 detector for autonomous driving, optimized for CPU with OpenVINO INT8 (0.946 mAP@0.5, ~2× faster). Built for uOttawa's SEG4180 (Applied ML, Dr. Daniel Shapiro).
Add a description, image, and links to the model-quantization topic page so that developers can more easily learn about it.
To associate your repository with the model-quantization topic, visit your repo's landing page and select "manage topics."