No description provided.
Shared domain types and Zod schemas for agent-eval-harness
Trajectory loading, evaluation, and comparison for agent-eval-harness
Cost tracking, budget management, and reporting for agent-eval-harness
Latency monitoring, SLA enforcement, and optimization analysis for agent-eval-harness
Golden trajectory management, comparison, and curation for agent-eval-harness
OpenTelemetry observability (tracing, metrics, logging, dashboards) for agent-eval-harness
Orchestrated evaluation suite runner with results aggregation for agent-eval-harness
Tool-use validation (selection, schema compliance, result verification) for agent-eval-harness
Provider-agnostic LLM-as-judge with calibration and consensus for agent-eval-harness
Evaluate and improve AI agents from runs, traces, judges, and feedback. Compare candidates, cluster failures, measure lift, and gate releases.
The agent eval standard for MCP. Score every agent output for quality, safety, and cost.
Web-based playground for browsing agent-eval experiment results
CI regression gates, threshold checks, and JUnit/GitHub integration for agent-eval-harness
Three-layer MCP tool server (judge, suite, gate) for agent-eval-harness
CLI interface for agent-eval-harness with eval, judge, compare, gate, golden, report, and serve commands
Framework for testing AI coding agents in isolated sandboxes
Agent evaluation framework with LLM-based grading for AI agent quality assessment
General-purpose eval harness for running trials against CLI agents
TypeScript client for the AumOS agent-eval evaluation framework — benchmarks, metrics, and run comparisons
evaluate statically-analyzable expressions
Compare coding agents head-to-head. Pass rate, cost, time, consistency — one command.
Static + schema + routing + spawn-fixture eval harness for *.md subagents (Claude Code, etc.). Catches description bloat, fence-mimicry, low routing margin, and schema regressions before they ship.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.