There data is still too hard to find and access. We need to make it accessible so we can start doing natural language processing and machine learning at a faster rate.
Registry of DraCor corpora
A tool to grab the latest Corpora Project data locally and access it.
A node module exposing nltk stopwords corpora and provide utility functions for removing stopwords
CLI toolchain for validating, indexing, and bootstrapping docs-mcp documentation corpora
Chiron corpora list.
Local-first catalytic memory for Markdown corpora and AI agents.
Leapable — Context engineering MCP server. Query expert knowledge corpora, manage your data, and search with full provenance. One-line install: npx leapable-mcp
MCP server for exposing dspack design system corpora to AI coding agents
Text corpora from Project Gutenburg used by NLTK.
Generate large rule and skill progressive-disclosure pressure-test corpora.
Unified memory-layer plugin for Claude Code. One MCP server, multiple typed corpora (vault, code, plans, docs, research, project-map), unified cross-corpus knowledge graph, conditional tool registration, in-process drift detection, and human↔code translat
Self-hosted, open-source documentation framework. Drop-in compatible with Mintlify projects — render existing docs.json/MDX corpora unmodified, with Astro under the hood. Includes a CLI (init, dev, build, migrate), pluggable themes, and full MDX component
Prime CLI — compile, check, install, publish, search Prime atoms and corpora.
automated corpora alignment script using GIZA++ and mkcls
Evaluation framework for LLM knowledge inputs — prompts, RAG corpora, skills, agent workflows. Fix the model, vary the artifact. Built-in statistical rigor: bootstrap CI, Krippendorff α, length-debias, saturation curves.
Hunalign is a bilingual sentence aligner, useful for aligning parallel corpora.
NuBerea SDK — Client library for the NuBerea biblical data platform. Query morphological corpora, lexicons, Bible texts, manuscripts, and scrolls.
SIGN validator. Parses, validates, and compiles .sign corpora.
the json-schema of dicto corpora data
No description provided.
Pure Delphi 13 tree-sitter sub-grammar — drops {$IFDEF}/pp_* tokens entirely and expects preprocessor-resolved source as input. Pairs with delphi13-preprocessor to reach 99%+ pass on real Delphi 13 corpora.
This is a library of miniature search engines for small corpora. It's similar to [Lunr](https://lunrjs.com/) but less well-made. It offers multiple search engines to handle different scoring preferences and context considerations. For example, the `BM25Se
Zombie ipsum reversus ab viral inferno, nam rick grimes malum cerebro. De carne lumbering animata corpora quaeritis. Summus brains sit, morbo vel maleficia? De apocalypsi gorger omero undead survivor dictum mauris. Hi mindless mortuis soulless creaturas
A complete library to interact with Aiplatform (protocol v1beta1)
Binary byte pair encoding (BPE) trainer and CLI compatible with Hugging Face tokenizers
A composable, deterministic text data pipeline for ML. Ingest, denoise, chunk, split, and sample multi-source corpora into reproducible training triplets.
Generate fake data with various generators.
Semantic code search CLI — like ripgrep but for meaning
Semantic code + document search engine. Cacheless static-embedding + cross-encoder rerank by default; optional ModernBERT/BGE transformer engines with GPU backends. Tree-sitter chunking, hybrid BM25 + PageRank, composable ranking layers.
MCP + LSP server for ripvec — semantic code search, PageRank repo maps, and multi-language code intelligence
CLI and HTTP server for BM25 Turbo
The fastest BM25 information retrieval engine — 28K QPS on 8.8M docs
SysML v2 grammar for tree-sitter
Compression Dictionary Transport toolkit
Pure Rust port of libdivsufsort suffix array construction library
MExiCo is a library and API for the management of multimodal experimental corpora.
Simple text processing for small data sets.
Use various corpora to generate random mnemonic names for things.
A tool to construct longitudinal corpora from web data
BardBot can generate markov sentences for your from a number of Shakespearean character corpora
classified provides an abstract interface to common ruby classifiers. It allows comparison of these classifiers using common corpora to compare accuracy, precision, recall and f-measure metrics.
DSPy datasets provide prebuilt loaders, caching, and schema metadata for benchmark corpora used in DSPy examples and teleprompters.
Whistlepig is a minimalist realtime full-text search index. Its goal is to be as small and minimally-featured as possible, while still remaining useful, performant and scalable to large corpora. If you want realtime full-text search without the frills, Whistlepig may be for you.
A gem for checking if a given email is corporate or not.
generates a random string of corporate bs and makes your computer say it
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.