A config-driven LLM evaluation harness — CLI + dashboard. Define an eval in YAML, run it against many models, score every output with deterministic, LLM-as-judge, and cost/latency graders, and compare side by side.