Open-source Python library for evaluating ML model reliability beyond accuracy — with calibration, failure, and fairness diagnostics for informed deployment decisions.
-
Updated
Jul 3, 2026 - Python
Open-source Python library for evaluating ML model reliability beyond accuracy — with calibration, failure, and fairness diagnostics for informed deployment decisions.
Hard Reasoning Benchmark filtered with disagreement scores
PromptGuard is a pragmatic, opinionated framework for establishing continuous integration for LLM behavior. It operates on a simple, verifiable principle: run the same prompts across multiple model configurations, compare outputs against defined expectations, and flag semantic regressions.
Capability Schema Spec defines a shared semantic language for world model evaluation. Standardize capability definition, observation, and verification across models and benchmarks. Not a benchmark—a shared language. Define • Observe • Verify
Reference implementation of the Capability Schema Specification. Proves that world model capabilities can be defined, observed, and verified in practice — with real checkpoints, real simulators, and real scores. Define • Observe • Verify • Deliver
Enterprise-style RAG reliability platform for MLOps docs: cited answers, evals, traces, FastAPI, Next.js.
A reproducible visual-attribute verification framework combining group-disjoint evaluation, audited LoRA controls, calibration analysis, and CI-backed evidence contracts.
Multi-LLM consensus engine for automated code review, diff analysis, and risk scoring.
Add a description, image, and links to the model-reliability topic page so that developers can more easily learn about it.
To associate your repository with the model-reliability topic, visit your repo's landing page and select "manage topics."