Minimum viable benchmark framework for a local OpenAI-compatible model server on Mac.
local_llm_bench/
├── configs/
├── datasets/
├── outputs/
├── reports/
├── scripts/
└── local_llm_bench/conda create -n llm-bench-py312 python=3.12 -y
conda activate llm-bench-py312
pip install -r requirements.txthttp://localhost:12345/v1/chat/completionspython scripts/run_mbpp.py --base-url http://localhost:12345/v1 --model qwen3-coder-next-mlx-4b --limit 20python scripts/run_humaneval.py --base-url http://localhost:12345/v1 --model qwen3-coder-next-mlx-4b --limit 20python scripts/run_context.py --base-url http://localhost:12345/v1 --model qwen3-coder-next-mlx-4b- The client targets any OpenAI-compatible server, so the same code can point at Open WebUI, LiteLLM, vLLM, llama.cpp, or MLX backends.
- Stream mode is used to measure TTFT. If the server does not include token usage in streaming responses, the runner can fall back to a non-stream usage fetch.
powermetricsfor CPU/GPU/power sampling during runs.asitopfor a lightweight live view of Apple Silicon utilization.- Activity Monitor for manual correlation with spikes, throttling, or memory pressure.
pass@1andpass@kscoring for MBPP / HumanEval.- Code execution verification for generated Python solutions.
- Additional dataset loaders for GSM8K and MMLU.
- Unified backend profiles for Open WebUI, LiteLLM, vLLM, llama.cpp, and MLX.