Release history

Changelog

Notable changes to Crawlboy, the Python CLI for sitemap-to-Markdown crawling. Follows Keep a Changelog and Semantic Versioning.

View all releases on GitHub open_in_new

v1.2.0

Latest
Changed
  • Crawl4AI — minimum dependency raised to >=0.8.6 (unclecode/crawl4ai 0.8.x: security fixes, MarkdownGenerationResult API, Playwright/patchright updates)
  • lxml — CI/Docker install upgrades to >=6.1.0 after pip install to fix PYSEC-2026-87 until crawl4ai relaxes lxml~=5.3
Install
$ pip install crawlboy==1.2.0
$ crawl4ai-setup

After install, upgrade lxml if you run your own security audits: pip install 'lxml>=6.1.0'

v1.1.0

Added
  • --meta-frontmatter — optional YAML frontmatter on each Markdown file with source_url and extracted HTML metadata (title, canonical, meta name / property / http-equiv), plus matching interactive wizard option
  • PyYAML — dependency for frontmatter serialization

v1.0.0

Added
  • Sitemap crawling — sequentially crawls every URL from XML sitemaps with Crawl4AI
  • Nested sitemap support — recursively follows <sitemapindex> entries
  • Markdown output — converts crawled pages to Markdown, one file per URL with mirrored directory structure
  • HTML export — optional --save-html flag to preserve raw HTML alongside Markdown
  • Image download--download-images to save media locally with content-addressed filenames (deduped across crawl) and automatic path rewriting in Markdown and HTML
  • Automatic sitemap discovery — auto-detects sitemap from robots.txt or common paths (/sitemap.xml, /sitemap_index.xml, etc.)
  • Interactive CLI — guided wizard with questionary and Rich for easy configuration
  • Flexible URL modes — direct sitemap URL (--sitemap-url) or site root discovery (--site-url)
  • Host filtering — respects site origin by default; --include-offsite-urls to crawl all listed URLs
  • Error logging — failures logged to errors.jsonl with paths and error details
  • Performance tuning — configurable per-page delay, page timeout, and max URL limit
  • Browser control--no-headless to show browser window for debugging
  • Docker support — includes Dockerfile for containerized execution with pre-installed browser dependencies
  • Fail-fast mode--fail-fast to stop on first error for rapid iteration
Technical Details
  • Built with Crawl4AI for intelligent page crawling
  • Requires Python 3.10+
  • Installable via pip install crawlboy
  • CLI entry point: crawlboy
  • Docker image based on Playwright Python for browser automation
  • Async-first architecture for efficient crawling
  • XML namespace-safe sitemap parsing