LLM eval & testing toolkit
promptfoo custom provider for running evals against a Lobu agent
Genkit AI framework plugin for Promptfoo.
Promptfoo + OpenCode eval harness for agent behavior. Owns model/tier policy, provider wiring, package discovery, state export, and artifact guards. Consumers own eval YAML, prompts, fixtures, and assertions.
Promptfoo-native core package for evaluating reusable agent skills.
TracePact integration for Promptfoo — tool-trace assertions in your eval config
promptfoo extension for writing AI evaluations for Twilio AI Assistants
A converter for Promptfoo test formats and into the EVA-LLM ecosystem
Close the eval-to-improvement loop for promptfoo. Automatically evaluate, identify low-scoring prompts, rewrite with any LLM, and re-evaluate.
TypeScript CLI + SDK for planned Claude Code harness runs, LLM bundle judging, and summary.md + results.jsonl reports with Braintrust + Promptfoo telemetry.
Enforcement scaffold for AI applications on Vercel AI SDK v6 -- git hooks, AGENTS.md generation, IDE configs, tool validation, and quality gates. Pairs with Langfuse for observability and Promptfoo for evals.
CLI bridge between Eqho AI platform and promptfoo evaluations
Recall-Oriented Understudy for Gisting Evaluation (ROUGE) Evaluation Functions with TypeScript support
CLI tool to block --no-verify flag in git commands. Prevents AI agents from bypassing git hooks.
CLI for deploying and managing AI agents on Lobu
LLM-as-a-Judge abstraction layer using ai-sdk and plugins
Reserved name — future eval harness (golden paths, injection, traffic replay) for the Maestro agent runtime. See https://github.com/costasoftware/maestro.
MCP server that simulates malicious behaviors for security testing
Pipeline-kit eval foundation — defineEval, runEval, case/scorer/score types
Local development platform for chat agents — workspaces, multi-provider LLMs, feedback corpus.
A terminal-based tool for local runs and debugging of eva-run
Collector and post-collect runtime package for Agent Skill Harbor.
Security evals for the AI era. Probes · Targets · Graders · Proof. Confirmed XSS / SQLi / BOLA / prompt-injection / MCP-RCE with reproducible proof attached to every finding.
A set of CLI tools to help you iterate on your LLM prompts.
CLI for Assay
High-performance evaluation framework for LLM agents (Core)
Tamper-evident, cryptographically verifiable evidence bundles for the Assay compliance framework.
MCP server integration for Assay
Metrics library for Assay
Policy types and compilation logic for Assay
Pack registry client for remote pack distribution (SPEC-Pack-Registry-v1)
Internal/experimental substrate for Assay measured-run workflows. Runner orchestration, archive assembly, and layer normalizers for the Assay-Runner candidate (Phase 2D Slice 2). No standalone product guarantee; API surface remains narrow and intentionally undocumented for third-party use; semver tracks the Assay workspace.
Internal/experimental substrate for Assay measured-run workflows. Linux-only platform adapter for the Assay-Runner candidate; hosts cgroup placement primitives (Phase 2D Slice 3). No standalone product guarantee; API surface remains narrow and intentionally undocumented for third-party use; semver tracks the Assay workspace.
Internal/experimental substrate for Assay measured-run workflows. Versioned schema types and constants for the Assay-Runner v0 contracts (Phase 2D Slice 1). No standalone product guarantee; API surface remains narrow and intentionally undocumented for third-party use; semver tracks the Assay workspace.
Internal/experimental substrate for Assay measured-run workflows. Compatibility wrapper crate that re-exports the Assay-Runner candidate surface from assay-runner-schema (Slice 1) and assay-runner-core (Slice 2). No standalone product guarantee; legacy alias retained for readers of pre-extraction history; semver tracks the Assay workspace.
Simulation harness for Assay (internal, API unstable)