Local LLM Benchmark System

Minimum viable benchmark framework for a local OpenAI-compatible model server on Mac.

Directory Layout

local_llm_bench/
├── configs/
├── datasets/
├── outputs/
├── reports/
├── scripts/
└── local_llm_bench/

Environment

conda create -n llm-bench-py312 python=3.12 -y
conda activate llm-bench-py312
pip install -r requirements.txt

Default LM Studio Endpoint

http://localhost:12345/v1/chat/completions

Run MBPP

python scripts/run_mbpp.py --base-url http://localhost:12345/v1 --model qwen3-coder-next-mlx-4b --limit 20

Run HumanEval

python scripts/run_humaneval.py --base-url http://localhost:12345/v1 --model qwen3-coder-next-mlx-4b --limit 20

Run Context Probe

python scripts/run_context.py --base-url http://localhost:12345/v1 --model qwen3-coder-next-mlx-4b

Notes

The client targets any OpenAI-compatible server, so the same code can point at Open WebUI, LiteLLM, vLLM, llama.cpp, or MLX backends.
Stream mode is used to measure TTFT. If the server does not include token usage in streaming responses, the runner can fall back to a non-stream usage fetch.

Performance Monitoring

powermetrics for CPU/GPU/power sampling during runs.
asitop for a lightweight live view of Apple Silicon utilization.
Activity Monitor for manual correlation with spikes, throttling, or memory pressure.

Future Extensions

pass@1 and pass@k scoring for MBPP / HumanEval.
Code execution verification for generated Python solutions.
Additional dataset loaders for GSM8K and MMLU.
Unified backend profiles for Open WebUI, LiteLLM, vLLM, llama.cpp, and MLX.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Local LLM Benchmark System

Directory Layout

Environment

Default LM Studio Endpoint

Run MBPP

Run HumanEval

Run Context Probe

Notes

Performance Monitoring

Future Extensions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
datasets		datasets
local_llm_bench		local_llm_bench
outputs		outputs
reports		reports
scripts		scripts
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Local LLM Benchmark System

Directory Layout

Environment

Default LM Studio Endpoint

Run MBPP

Run HumanEval

Run Context Probe

Notes

Performance Monitoring

Future Extensions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages