-
Notifications
You must be signed in to change notification settings - Fork 618
Open
Labels
enhancementNew feature or request.New feature or request.t-toolingIssues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.
Description
Summary
Port the discoverValidSitemaps() utility from Crawlee JS to Python.
JS source: packages/utils/src/internals/sitemap.ts — #3392
How it works in JS
async function* discoverValidSitemaps(
urls: string[],
options?: { proxyUrl?: string; httpClient?: BaseHttpClient }
): AsyncIterable<string>- Group input URLs by hostname
- For each domain, discover sitemaps from (in order):
Sitemap:entries in robots.txt- Input URLs that match
/sitemap\.(xml|txt)(\.gz)?$/i - HEAD-request probing of
/sitemap.xml,/sitemap.txt,/sitemap_index.xml(fallback)
- Deduplicate and process domains concurrently
Returns an async iterable yielding sitemap URLs as discovered.
What Python already has
Sitemap.try_common_names()— probes/sitemap.xmland/sitemap.txtfor a single URL (missing/sitemap_index.xml)RobotsTxtFile.find()+get_sitemaps()— fetches and extractsSitemap:entries from robots.txt
What's missing: the orchestrating function that combines these steps, groups by hostname, validates via HEAD requests, detects direct sitemap URLs from input, and processes domains concurrently.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or request.New feature or request.t-toolingIssues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.