Skip to content

kid0114/Modelbase

Repository files navigation

Local LLM Benchmark System

Minimum viable benchmark framework for a local OpenAI-compatible model server on Mac.

Directory Layout

local_llm_bench/
├── configs/
├── datasets/
├── outputs/
├── reports/
├── scripts/
└── local_llm_bench/

Environment

conda create -n llm-bench-py312 python=3.12 -y
conda activate llm-bench-py312
pip install -r requirements.txt

Default LM Studio Endpoint

http://localhost:12345/v1/chat/completions

Run MBPP

python scripts/run_mbpp.py --base-url http://localhost:12345/v1 --model qwen3-coder-next-mlx-4b --limit 20

Run HumanEval

python scripts/run_humaneval.py --base-url http://localhost:12345/v1 --model qwen3-coder-next-mlx-4b --limit 20

Run Context Probe

python scripts/run_context.py --base-url http://localhost:12345/v1 --model qwen3-coder-next-mlx-4b

Notes

  • The client targets any OpenAI-compatible server, so the same code can point at Open WebUI, LiteLLM, vLLM, llama.cpp, or MLX backends.
  • Stream mode is used to measure TTFT. If the server does not include token usage in streaming responses, the runner can fall back to a non-stream usage fetch.

Performance Monitoring

  • powermetrics for CPU/GPU/power sampling during runs.
  • asitop for a lightweight live view of Apple Silicon utilization.
  • Activity Monitor for manual correlation with spikes, throttling, or memory pressure.

Future Extensions

  • pass@1 and pass@k scoring for MBPP / HumanEval.
  • Code execution verification for generated Python solutions.
  • Additional dataset loaders for GSM8K and MMLU.
  • Unified backend profiles for Open WebUI, LiteLLM, vLLM, llama.cpp, and MLX.

About

Local LLM benchmark for OpenAI-compatible model servers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages