Pure javascript cross-platform module to extract text from PDFs.
PDF extraction and rendering across all JavaScript runtimes
TagSpaces pdf extraction module
Web search, URL fetching, GitHub repo cloning, PDF extraction, YouTube video understanding, and local video analysis for Pi coding agent
High-performance PDF extraction — Rust engine, Node.js interface
A powerful PDF extraction library for Node.js built on Mozilla's pdf.js - extract text, tables, and visual elements with precision
Rust powered PDF extraction for Node
OkraPDF command-line interface for PDF extraction and document chat
Rust powered PDF extraction for Node
Rust powered PDF extraction for Node
Rust powered PDF extraction for Node
Rust powered PDF extraction for Node
Rust powered PDF extraction for Node
Rust powered PDF extraction for Node
Rust powered PDF extraction for Node
Rust powered PDF extraction for Node
Rust powered PDF extraction for Node
Rust powered PDF extraction for Node
Rust powered PDF extraction for Node
Rust powered PDF extraction for Node
Rust powered PDF extraction for Node
Rust powered PDF extraction for Node
Parse MinerU PDF extraction JSON output into clean Markdown
MCP server for Moldova's National Agency for Solving Complaints (ANSC) — appeals, decisions, hearing schedule, and multi-modal PDF extraction.
A library to extract content from pdfs
A Rust toolkit for detecting and extracting metadata, text, and content from various file formats
Extract text from email attachments (PDF + image OCR). PDF text via `pdf-extract` (pure Rust); OCR via the `tesseract` CLI subprocess (not linked as a C library). Two-stage fallback for scanned PDFs: try embedded text first, fall back to OCR on the raw bytes if the text is too short. Returns `ExtractionResult` with text + language + confidence + page count + JSON metadata.
Self-contained web search MCP server. 9 backends with automatic fallback. Works from any IP.
Local-first MCP server bridging Claude to your Zotero library — search, read, cite, enrich, write — over stdio or streamable-HTTP with OAuth 2.1.
High-performance PDF text extraction library for vectorization pipelines
Build LLM applications in Rust with type safety: chains, agents, RAG, LangGraph, embeddings, vector stores, and 20+ document loaders. A LangChain port supporting OpenAI, Claude, Gemini, Mistral, Bedrock, Ollama, and more. Includes streaming, structured output, and multi-agent (Deep Agent) workflows.
High-performance document conversion engine for AI/LLM embeddings - 27 formats supported
A high-performance, reasoning-based RAG indexer in Rust following the PageIndex pattern.
Fast pure-Rust PDF extraction library and CLI — ~10-50x faster than pdfplumber for text, word, table, layout, image, and metadata extraction from PDFs. By Clark Labs Inc.
TUI for webpage summarisation
A flexible rule-based file and folder comparison tool and crate including nice html reporting. Compares CSVs, JSON, text files, pdf-texts and images.
Grim is a simple gem for extracting a page from a pdf and converting it to an image as well as extract the text from the page as a string. It basically gives you an easy to use api to ghostscript, imagemagick, and pdftotext specific to this use case.
Extract citations from PDFs.
PDF content extraction tool and library.
Extract tables from PDF as a structured info. Uses ghostscript to print pdf to image, then recognizes table separators optically. No OpenCV or other heavy dependencies
description yo
This gem lets you extract plain text from PDF documents. It is a Jruby wrapper for the Apache PDFBox library.
This is a ChupaText decomposer plugin for to extract text and meta-data from PDF. You can use `pdf` decomposer.
FillablePDF is an extremely simple and lightweight utility that bridges iText and Ruby in order to fill out fillable PDF forms or extract field values from previously filled out PDF forms.
simple wrapper around CLI for extracting text from PDF and Word documents
Nameday data extraction from Valsts valodas centrs PDF
Kreuzberg is a high-performance document intelligence library with a Rust core and native Ruby bindings via Magnus. Extract text, metadata, and structured data from 75+ file formats including PDF, DOCX, PPTX, XLSX, HTML, RTF, images (with OCR), email, archives, and more. Features async/sync APIs, text chunking, language detection, and keyword extraction.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.