Static + schema + routing + spawn-fixture eval harness for *.md subagents (Claude Code, etc.). Catches description bloat, fence-mimicry, low routing margin, and schema regressions before they ship.
Shared domain types and Zod schemas for agent-eval-harness
Trajectory loading, evaluation, and comparison for agent-eval-harness
Latency monitoring, SLA enforcement, and optimization analysis for agent-eval-harness
Orchestrated evaluation suite runner with results aggregation for agent-eval-harness
Cost tracking, budget management, and reporting for agent-eval-harness
Golden trajectory management, comparison, and curation for agent-eval-harness
OpenTelemetry observability (tracing, metrics, logging, dashboards) for agent-eval-harness
Tool-use validation (selection, schema compliance, result verification) for agent-eval-harness
Provider-agnostic LLM-as-judge with calibration and consensus for agent-eval-harness
Three-layer MCP tool server (judge, suite, gate) for agent-eval-harness
CI regression gates, threshold checks, and JUnit/GitHub integration for agent-eval-harness
CLI interface for agent-eval-harness with eval, judge, compare, gate, golden, report, and serve commands
General-purpose eval harness for running trials against CLI agents
Deep Agents - a library for building controllable AI agents with LangGraph
evaluate statically-analyzable expressions
Evaluate and improve AI agents from runs, traces, judges, and feedback. Compare candidates, cluster failures, measure lift, and gate releases.
Solo-dev harness engineering kit for Claude Code, with experimental Codex and Kiro CLI runtime rendering.
Stable application runtime and operator control plane for agent workspaces.
Simple JavaScript expression evaluator
Evaluate node require() module content directly
Mathematical expression evaluator fork with exports map, prototype pollution and code injection security fixes
A flexible math expression evaluator
require or eval modules