Add CSS background image extraction feature (Issue #1691)#1702
Open
YxmMyth wants to merge 1 commit intounclecode:mainfrom
Open
Add CSS background image extraction feature (Issue #1691)#1702YxmMyth wants to merge 1 commit intounclecode:mainfrom
YxmMyth wants to merge 1 commit intounclecode:mainfrom
Conversation
This commit adds support for extracting CSS background images during crawling, addressing issue unclecode#1691 where background images were being skipped. ## Changes ### New Files - crawl4ai/js_snippet/extract_css_backgrounds.js: JavaScript script to extract background images from computed styles in the browser ### Modified Files - crawl4ai/models.py: - Added `css_images` field to Media class - Added `css_images_data` field to AsyncCrawlResponse - crawl4ai/async_configs.py: - Added CSS background image configuration parameters to CrawlerRunConfig: - extract_css_images (bool, default False) - css_image_min_width (int, default 100) - css_image_min_height (int, default 100) - css_image_score_threshold (int, default 2) - css_exclude_repeating (bool, default True) - crawl4ai/content_scraping_strategy.py: - Added process_css_background_images() method - Integrated CSS image extraction into _process_element() - Added css_images to media dictionary - crawl4ai/async_crawler_strategy.py: - Added JavaScript execution in _crawl_web() to extract CSS backgrounds - Included css_images_data in AsyncCrawlResponse - crawl4ai/async_webcrawler.py: - Modified aprocess_html() to accept and pass css_images_data - Added Dict type import ## Features - Extracts background images from both inline styles and stylesheets - Uses window.getComputedStyle() for accurate extraction - Smart filtering (small elements, repeating patterns) - Scoring system based on element size and properties - Opt-in by default for backward compatibility - Separate storage in media.css_images ## Usage ```python result = await crawler.arun( url="https://example.com", extract_css_images=True, css_image_min_width=100, css_image_min_height=100, ) css_images = result.media.get('css_images', []) ``` Closes unclecode#1691 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This commit adds support for extracting CSS background images during crawling, addressing issue #1691 where background images were being skipped.
Changes
New Files
Modified Files
crawl4ai/models.py:
css_imagesfield to Media classcss_images_datafield to AsyncCrawlResponsecrawl4ai/async_configs.py:
crawl4ai/content_scraping_strategy.py:
crawl4ai/async_crawler_strategy.py:
crawl4ai/async_webcrawler.py:
Features
Usage
Closes #1691