Content extraction with built-in sanitization via hanzo-guard
RAG-based codebase indexing and semantic search - dual purpose library and MCP server
Extractous provides a fast and efficient way to extract content from all kind of file formats including PDF, Word, Excel CSV, Email etc... Internally it uses a natively compiled Apache Tika for formats are not supported natively by the Rust core
Extract indicators (IP, domain, email, hashes, etc.) from a string or a PDF file
PDF → Markdown extractor with figure rasterization, table & banner detection. Built on pdfium-render.
A preprocessor for text and HTML corpora
A pure-Rust PDF library — create, parse, and render PDF documents with zero C dependencies
Extract plain text from HTML, PDF, and other document formats
A crate for interpreting PDF files.
nosy: various contents summarization tool powered by artificial intelligence
Get markdown out of any document — Pandoc + pdfium + platform-native OCR, dispatched per format.
Tools to parse Screenplay-formatted documents into semantically-typed structs.
PDFTk wrapper to extract form fiels
Nameday data extraction from Valsts valodas centrs PDF
Extracts tables from PDF text using spacing and position heuristics.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.