.. raw:: html <iframe width="560" height="315" src="ahttps://www.youtube.com/embed/17ozSeGw-fY?si=8vbGltLVhtoMYbCT" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-pictu
A Stencil web component library for LLM test runner functionality
A Stencil web component library for LLM test runner functionality
Generate tests to evaluate the intelligence of large language models.
A high-level API to automate web browsers
A high-level API to automate web browsers
No description provided.
CLI wrapper for LLM Test Bench - A production-grade framework for testing and benchmarking Large Language Models
Enforce real-time token budgets and spending limits for OpenAI, Anthropic Claude, and Google Gemini API calls in Node.js
Playwright reporter that outputs structured JSON for LLM agents. Minimal console output, flat schema, easy to filter to failures.
Superfast runtime validators with only one line
Lightweight, zero-dependency LLM API cost & token usage tracker for OpenAI, Anthropic, Gemini, Mistral, Groq, and DeepSeek
Typescript bindings for langchain
Superfast runtime validators with only one line
Display language model outputs in your React project.
[llm-ui](https://llm-ui.com) markdown block.
Hardware accelerated language model chats on browsers
[llm-ui](https://llm-ui.com) code block.
LLM eval & testing toolkit
A super simple text splitter for LLM
[llm-ui](https://llm-ui.com) JSON blocks for building custom components.
General-purpose agent with transport abstraction, state management, and attachment support
Test your LLM-powered apps with a TypeScript-native, Vitest-based eval runner. No API key required.
Detox driver for Wix Pilot usage
A production-grade CLI for testing and benchmarking LLM applications with support for GPT-5, Claude Opus 4, Gemini 2.5, and 65+ models
Core library for LLM Test Bench - comprehensive testing framework for Large Language Models with 65+ supported models across 14+ providers
Dataset management and utilities for LLM Test Bench - load, validate, and manage test datasets
Provides a RubyLLM::Provider that allows you to stub responses for testing purposes. You can stub individual responses or a sequence of responses, and you can also temporarily stub responses within a block.
TNG (Test Next Generation) is a Rails gem that automatically generates comprehensive test files by analyzing your Ruby code using static analysis and AI. It supports models, controllers, and services with intelligent test case generation.
Generate tests for your code using LLMs. This gem is a CLI tool that uses OpenAI to generate test code for your code. It uses a configuration file to match files with the right test code generation instructions. It is designed to be used with Ruby on Rails, but it can be used with any codebase. It is a work in progress.
Output test code using LLM agents.
Define qualitative evaluation criteria and let an LLM judge if responses pass. Perfect for testing AI agents, comparing models, and evaluating subjective qualities.
The Prompt testing library for LLM that allows comparing patterns of prompts.
Probatio Diabolica runs custom *_spec.rb files with a DSL inspired by RSpec and supports text/image/PDF reporting.
A thin Minitest wrapper around promptfoo that brings prompt testing to Ruby projects. Test LLM prompts with a familiar Minitest-like DSL, supporting multiple providers and assertion types.
CompletionKit is a prompt testing platform that runs as a Rails engine or a standalone app. Run prompts against real datasets, score every output with an LLM judge against criteria you define, track prompt versions, and get AI-generated improvement suggestions grounded in your actual results. Includes a web UI, REST API, and a built-in MCP server with 34 tools.
Fleet pipeline validation: tests, lint, security scan, adversarial LLM review
A Rails engine providing comprehensive observability for LLM-powered applications. Features include session tracking, trace analysis, prompt management, cost monitoring, and optional chat/agent testing UI (with RubyLLM integration).
Provider-agnostic LLM evaluation with pluggable metrics, statistical A/B comparison, and test framework integration. Ragas for Ruby, powered by RubyLLM.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.