node module that can ocr pdfs that are not searchable
Ultra-fast, offline, and free PDF OCR using native macOS Vision Framework and PDFKit. Supports Vietnamese & English.
The most powerful PDF toolkit for n8n — HTML to PDF, sign PDF, OCR, extract tables, merge/split, compress, convert to Excel/Word/PPTX, and 30+ more operations
A high-performance, parallelized PDF OCR tool using Tesseract.js WASM
```bash npm i node-pdf-ocr ```
CoffeeScript lib for PDF OCR and text extraction.
Javascript-only library to perform OCR on scanned PDFs to turn them into searchable PDFs
High-quality OCR and text extraction for images and PDFs.
The Adobe PDF Services Node.js SDK provides APIs for creating, combining, exporting and manipulating PDFs.
A CLI tool for OCR processing of PDF files using Mistral API with optional LLM verification
A Node.js wrapper for the opendataloader-pdf Java CLI.
A robust, strictly-typed Node.js and Browser library for parsing office files (.docx, .pptx, .xlsx, .odt, .odp, .ods, .pdf, .rtf, .csv, .md, .html) and generating high-fidelity outputs in Markdown, HTML, CSV, RTF, and RAG-focused chunks.
Read text and parse tables from PDF files. Supports tabular data with automatic column detection, and rule-based parsing.
Fast PDF classification and text extraction. Detect text-based vs scanned PDFs, extract text by region with quality checks. Native Rust performance via napi-rs.
A Node.js wrapper for the Tesseract OCR API
Node PDF is a set of tools that takes in PDF files and converts them to usable formats for data processing. The library supports both extracting text from searchable pdf files as well as performing OCR on pdfs which are just scanned images of text
n8n community node to convert HTML and CSS to PDF using PdfMunk API - perfect for invoices, reports, certificates, and document generation
A library for loading PDFs and using OCR with Tesseract.js
Extract text from scanned PDF documents using OCR powered by PDF API Hub
Pure TypeScript, cross-platform module for extracting text, images, and tabular data from PDFs. Run directly in your browser or in Node!
Display PDFs in your React app as easily as if they were images.
A simple wrapper around command-line utils to assist in PDF / Image OCR (Optical Character Recognition) processing using Tesseract.
Fast PDF classification and text extraction. Detect text-based vs scanned PDFs, extract text by region with quality checks. Native Rust performance via napi-rs.
Create and modify PDF files with JavaScript
OCR integration for scanned PDFs with pluggable engine support
OCR engine trait + HTTP and Tesseract implementations.
OCR is a Ruby gem that allows you to easily extract text from image files (JPG, PNG, PDF) using Tesseract OCR engine. It provides a simple, intuitive interface for integrating OCR capabilities into your Ruby or Rails applications.
Extracts text from PDF files using Tesseract, the text is added to the PDF as a background layer.
A utility gem for processing PDFs for OCR and TEI
Native Ruby gem for parsing documents (PDF, DOCX, XLSX, images with OCR) with zero runtime dependencies. Statically links MuPDF for PDF extraction and Tesseract for OCR.
Native Ruby gem for parsing documents (PDF, DOCX, XLSX, images with OCR) with zero runtime dependencies. Statically links MuPDF for PDF extraction and Tesseract for OCR.
Tahweel is a tool for converting PDF files to txt, docx, or json using OCR through multiple engines, currently supporting Google Drive only.
A tool to combine PDF tools, OCR tools and image processing into a single interface as both a CLI and a library.
Copyleaks detects plagiarism and checks content distribution online. Use Copyleaks to find out if textual content is original and if it has been used before. With Copyleaks cloud publishers, academics, and more can scan files (pdf, doc, docx, ocr...), URLs and free text for plagiarism check.
Extract text from images and PDFs stored in Active Storage using a high-performance Rust OCR server
Provides PDF outline extraction, precision page numbering, and OCR via HexaPDF. AGPL-3.0 licensed.
Kreuzberg is a high-performance document intelligence library with a Rust core and native Ruby bindings via Magnus. Extract text, metadata, and structured data from 75+ file formats including PDF, DOCX, PPTX, XLSX, HTML, RTF, images (with OCR), email, archives, and more. Features async/sync APIs, text chunking, language detection, and keyword extraction.
Ruby SDK for the Convertorio API. Convert files between 20+ formats with AI-powered OCR for text extraction. Supports JPG, PNG, WebP, AVIF, HEIC, GIF, BMP, TIFF, ICO, PDF, and more.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.