Skip to content

Add discover valid sitemaps utility (port from JS) #1740

@vdusek

Description

@vdusek

Summary

Port the discoverValidSitemaps() utility from Crawlee JS to Python.

JS source: packages/utils/src/internals/sitemap.ts#3392

How it works in JS

async function* discoverValidSitemaps(
    urls: string[],
    options?: { proxyUrl?: string; httpClient?: BaseHttpClient }
): AsyncIterable<string>
  1. Group input URLs by hostname
  2. For each domain, discover sitemaps from (in order):
    • Sitemap: entries in robots.txt
    • Input URLs that match /sitemap\.(xml|txt)(\.gz)?$/i
    • HEAD-request probing of /sitemap.xml, /sitemap.txt, /sitemap_index.xml (fallback)
  3. Deduplicate and process domains concurrently

Returns an async iterable yielding sitemap URLs as discovered.

What Python already has

  • Sitemap.try_common_names() — probes /sitemap.xml and /sitemap.txt for a single URL (missing /sitemap_index.xml)
  • RobotsTxtFile.find() + get_sitemaps() — fetches and extracts Sitemap: entries from robots.txt

What's missing: the orchestrating function that combines these steps, groups by hostname, validates via HEAD requests, detects direct sitemap URLs from input, and processes domains concurrently.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request.t-toolingIssues with this label are in the ownership of the tooling team.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions