Logo

dev-resources.site

for different kinds of informations.

Web Scraping Using Image Processing

Published at
2/14/2024
Categories
scraping
screenshot
python
automation
Author
hessler5
Author
8 person written this
hessler5
open
Web Scraping Using Image Processing

My final project for my software engineering bootcamp was a web scraping site that uses image processing to scrape images from a given URL. The idea for this came when I was trying to think about a way to gather a large set of images to use as a machine learning data set. I knew that scrapping would be the way to go for this kind of collection but I was unhappy with how brittle traditional web scrapers were.

The scrapers back end uses Selenium, Pillow and the Chrome Driver to accomplish its task. The first step is Selenium opens up the page using the Chrome Driver in headless mode and injects custom css into the DOM. The custom CSS consists of a colored border around each image as well as a color key in the top left corner of the page. I also have CSS to make sure all images as well as the color key render on top of the page. As a note the color key is necessary because Pillow detects RGB values of pixels differently than how they render on the webpage. Once the CSS is injected Selenium takes a screenshot of the entire page.

Website screenshot with CSS injection

The screenshot is then processed by the Python Pillow library where each pixel is scanned to see if it matches the color key in the top left corner. Once a pixel matches the key an algorithm checks to see if this pixel is the start of an image. If an image is detected the height and width of the image is found and then used to crop the page screenshot into the desired sub image.

The desired images are then zipped up and sent to the front end where a user can rename and download them. The images are all deleted on the back after the scrape is complete.

Pixel Harvester front end

Advantages

  1. The biggest advantage is Pixel Harvester can work on any website that uses image tags regardless of HTML structure or css selectors.

  2. This scraper allows you to scrape the entire webpage without having to travel to each individual image link which lowers the scrapers overall footprint on the website.

  3. The images that are scrapped are at the same resolution as how they appear on the website.

Challenges

  1. The biggest challenge this web scraper faces is the challenge all web scrapers face and that is bot detection. Because many websites do not render in headless mode screenshots for those websites will be blank and there will be no images to capture. This scraper makes no attempt to subvert bot detection in order to obey the terms and service of websites that do not wish to be scraped.

  2. With the wide range of how websites are coded some websites that may appear to work at first will fail to produce desired results. For example if a website utilizes the background img attribute to display images instead of using html img tags the scraper will not be able to detect those images and no results will be returned.

screenshot Article's
30 articles in total
Favicon
From 80 to 8000/m: A Journey of SEO Optimization (Part 1)
Favicon
CodeSnap : prendre des captures d'Γ©cran de code dans VS Code
Favicon
rails system test, save failed screenshots
Favicon
πŸš€ πŸ“Έ Creating Accessible and Stunning code screenshots
Favicon
AVIF Studio - Web page screen capture Chrome extension Made with Svelte and WebAssembly.
Favicon
Edge: Screenshots einer Seite erstellen ohne Addons
Favicon
Web Scraping Using Image Processing
Favicon
Edge: Create screenshots of a page without addons
Favicon
How to Perform Screenshot Comparison in Playwright
Favicon
How to take screenshots effectively on windows 11
Favicon
How to screenshot webpages in Golang
Favicon
Screenshot all your pages
Favicon
CodeSnap: Take Code Screenshots In VS Code
Favicon
Rendering NativeScript Angular Templates and Components into images
Favicon
How to run a code in editor in Atom IDE
Favicon
Take a Full-Page Screenshot in Browser (without extension or add-on)
Favicon
Tomar capturas de pantalla facilmente en i3wm
Favicon
Building A Serverless Screenshot Service with Lambda
Favicon
Android - How to do Screenshot Testing in Jetpack Compose
Favicon
How to capture a screenshot of the single window you want (using command line).
Favicon
How to Scan QR Code from Desktop Screen in Python
Favicon
react-native detect when user takes a screenshot
Favicon
Set Flameshot as default screenshot app on Ubuntu :)
Favicon
How to capture Screenshot within the browser
Favicon
Capture Website Screenshots with Python
Favicon
Reducing a Screenshot Size in Mac
Favicon
Employee Monitoring Tool to improve employee productivity
Favicon
Configuring screenshots in Mac
Favicon
How to take a screenshot of Jira kanban
Favicon
How To Capture Screenshots In Selenium? Guide With Examples

Featured ones: