BETAmodules.com is in beta — open to partnerships & joint ventures.Build with us

scrapetor

v0.2.0RubyGems· Ruby

Scrapetor is a Ruby HTML parsing + scraping toolkit. The parser is a native C arena DOM with structural indexes built at parse time and NEON SIMD scanners in the SAX hot loop. A streaming extraction engine compiles the schema DSL into a single forward pass — no DOM materialised, one Ruby boundary crossing per document. On builds where libcurl is available, Scrapetor::Fetcher adds an HTTP/2-capable fetch layer with per-thread connection cache, shared DNS + TLS session pool, in-process gzip / deflate / brotli / zstd decoding, iconv charset transcoding, retry + exponential backoff, ETag / Last-Modified disk cache with bulk revalidation, per-host throttle, cookie jar, basic + bearer auth, proxy, and three bulk concurrency models (parallel_fetch / multi_fetch / streaming multi_each). Scrapetor::Session ties the cookie / auth / throttle / retry policies together. Also ships robots.txt + sitemap.xml parsers, a bounded-memory streaming HTML parser, and structured-data extractors (JSON-LD, OpenGraph, Schema.org, Microdata, RDFa, Twitter Cards). The Net::HTTP-based Scrapetor.fetch is preserved as the no-libcurl fallback.

The verdict
Maintained. Niche but maintained, actively maintained.
Live from the RubyGems registry · derived rules, not AI
How it scores
MaintenanceHealthy
PopularityNiche
SecurityClean
LicensePermissive
DepsZero deps
Maintenance
Last published this month.
Popularity
162 downloads / week
Security
No known advisories for this version (OSV).
License
MIT
Dependencies
No runtime dependencies
Recent releases
  • 0.2.0this month
scrapetor — Scrapetor is a Ruby HTML parsing + scraping toolkit. The parser is a native C arena DOM with structural indexes built at parse time and NEON SIMD scanners in the SAX hot loop. A streaming extraction engine compiles the schema DSL into a single forward pass — no DOM materialised, one Ruby boundary crossing per document. On builds where libcurl is available, Scrapetor::Fetcher adds an HTTP/2-capable fetch layer with per-thread connection cache, shared DNS + TLS session pool, in-process gzip / deflate / brotli / zstd decoding, iconv charset transcoding, retry + exponential backoff, ETag / Last-Modified disk cache with bulk revalidation, per-host throttle, cookie jar, basic + bearer auth, proxy, and three bulk concurrency models (parallel_fetch / multi_fetch / streaming multi_each). Scrapetor::Session ties the cookie / auth / throttle / retry policies together. Also ships robots.txt + sitemap.xml parsers, a bounded-memory streaming HTML parser, and structured-data extractors (JSON-LD, OpenGraph, Schema.org, Microdata, RDFa, Twitter Cards). The Net::HTTP-based Scrapetor.fetch is preserved as the no-libcurl fallback. (Ruby / RubyGems) · Modules