GitHub Action for evaluating MCP server tool calls using LLM-based scoring
LLM eval & testing toolkit
No description provided.
Harness-backed AI testing on top of Vitest.
Test your LLM-powered apps with a TypeScript-native, Vitest-based eval runner. No API key required.
Much like tests in traditional software, evals are an important part of bringing LLM applications to production. The goal of this package is to help provide a starting point for you to write evals for your LLM applications, from which you can write more c
A library for running evaluations for AI use cases
Offline evaluation framework for Output.ai workflows
Universal library for evaluating AI models
Harness AI Evals Service APIs integrated with react hooks
Axiom AI SDK provides - an API to wrap your AI calls with observability instrumentation. - offline evals - online evals
CLI entry point for AgentV
No description provided.
Usage, cost, and response telemetry primitives for Agent Assistant
Autonomous coding agent CLI - capable of creating/editing files, running commands, using the browser, and more
Ink-based evaluation console for m4trix
> ⚠️ **Beta Notice** > > This SDK is currently in **beta**. All APIs are **experimental** and subject to change. > Please review the [release notes](https://github.com/statsig-io/statsig-ai-node/releases) for any **breaking changes** before upgrading.
A general-purpose LLM evaluation framework with dataset loading, scoring, run persistence, model comparison, and console reporting.
AI SDK harness adapter for vitest-evals.
Record, replay, and score evaluation primitives for Beach applications — built on the event log.
Arize evals package
Run Claude Code 24/7 on your Claude Pro/Max subscription over Telegram. Open-source alternative to OpenClaw and NanoClaw — no API keys.
pi-ai harness adapter with tool replay for vitest-evals.
promptfoo custom provider for running evals against a Lobu agent
LLM evaluation engine for Rails.
Wraps RubyLLM::Chat with input/output contracts, business-rule validation, retry with model escalation on validation failure, pre-flight cost ceilings, and an evaluation framework. Sibling abstraction to RubyLLM::Agent — same niche (reusable class-based prompts), wider contract.