A library to crawl and extract cleaned HTML content from URLs.
URL crawler for analysing web content
The fastest directory crawler & globbing alternative to glob, fast-glob, & tiny-glob. Crawls 1m files in < 1s
TarzanDB URL crawler core package
This repository contains a list of of HTTP user-agents used by robots, crawlers, and spiders as in single JSON file.
A triple-linked lists based DOM implementation
Used to run a web crawler that checks for errors on specified pages.
express middleware for serving prerendered javascript-rendered pages for SEO
A library to recursively retrieve and serialize Notion pages with customization for machine learning applications.
A module for crawling thredds catalogs
Analyzes license information for multiple node.js modules (package.json files) as part of your software project.
This is an ES6 adaptation of the original PHP library CrawlerDetect, this library will help you detect bots/crawlers/spiders vie the useragent.
Very straightforward, event driven web crawler. Features a flexible queue interface and a basic cache mechanism with extensible backend.
Inspecting Node.js's Network with Chrome DevTools
A CLI tool to crawl documentation sites and create a search index for Upstash Search.
Device detection module for Nuxt
Crawler is a ready-to-use web spider that works with proxies, asynchrony, rate limit, configurable request pools, jQuery, and HTTP/2 support.
x-ray's crawler
An implementation of the WHATWG URL Standard's URL API and parsing machinery
Create xml sitemaps from the command line.
HTTP request module customized for crawlers.
A web crawler that works with prember to discover URLs in your app
Encode a URL to a percent-encoded form, excluding already-encoded sequences
AST codebase scanner and URL crawler for longcelot-seo
Does things
Traverses all HTML files from given directory and checks links found in them.
Stupid crawler that looks for URLs on a given site. Result is saved as two CSV files one with found URLs and another with failed URLs.
Retrieves a list of URLs to seed the crawler by publishing them to a RabbitMQ exchange.
Post URLs to Wayback Machine (Internet Archive), using a crawler, from Sitemap(s) or a list of URLs.
validate-website is a web crawler for checking the markup validity with XML Schema / DTD and not found urls.
A generic web crawler that doesn't crawl outside URLs.
Arachnid is a web crawler that relies on Bloom Filters to efficiently store visited urls and Typhoeus to avoid the overhead of Mechanize when crawling every page on a domain.
Arachnidish is a web crawler that relies on Bloom Filters to efficiently store visited urls and Typhoeus to avoid the overhead of Mechanize when crawling every page on a domain.
livedoor-feeddiscover performs feed autodiscovery using the livedoor Feed Discover API. livedoor Feed Discover API find a Atom/RSS feed(s) from the livedoor Reader crawler database. So, livedoor-feeddiscover do not access the target URL.
== Medusa: a ruby crawler framework {rdoc-image:https://badge.fury.io/rb/medusa-crawler.svg}[https://rubygems.org/gems/medusa-crawler] rdoc-image:https://github.com/brutuscat/medusa-crawler/workflows/Ruby/badge.svg?event=push Medusa is a framework for the ruby language to crawl and collect useful information about the pages it visits. It is versatile, allowing you to write your own specialized tasks quickly and easily. === Features * Choose the links to follow on each page with +focus_crawl+ * Multi-threaded design for high performance * Tracks +301+ HTTP redirects * Allows exclusion of URLs based on regular expressions * Records response time for each page * Obey _robots.txt_ directives (optional, but recommended) * In-memory or persistent storage of pages during crawl, provided by Moneta[https://github.com/moneta-rb/moneta] * Inherits OpenURI behavior (redirects, automatic charset and encoding detection, proxy configuration options). <b>Do you have an idea or a suggestion? {Open an issue and talk about it}[https://github.com/brutuscat/medusa-crawler/issues/new]</b> === Examples Medusa is versatile and to be used programatically, you can start with one or multiple URIs: require 'medusa' Medusa.crawl('https://www.example.com', depth_limit: 2) Or you can pass a block and it will yield the crawler back, to manage configuration or drive its crawling focus: require 'medusa' Medusa.crawl('https://www.example.com', depth_limit: 2) do |crawler| crawler.discard_page_bodies = some_flag # Persist all the pages state across crawl-runs. crawler.clear_on_startup = false crawler.storage = Medusa::Storage.Moneta(:Redis, 'redis://redis.host.name:6379/0') crawler.skip_links_like(/private/) crawler.on_pages_like(/public/) do |page| logger.debug "[public page] #{page.url} took #{page.response_time} found #{page.links.count}" end # Use an arbitrary logic, page by page, to continue customize the crawling. crawler.focus_crawl(/public/) do |page| page.links.first end end
No description provided.
No description provided.
No description provided.
No description provided.