Sling Academy
Home/Python/Python: Extract and download all images from a web page

Python: Extract and download all images from a web page

Last updated: May 20, 2023

This example-based article walks you through 2 different ways to programmatically extract and download all images from a web page with Python. The first approach use requests and beautifulsoup4, while the second one uses scrapy.

Using Requests and BeautifulSoup

In this example, we will get all images from this sample web page:

https://api.slingacademy.com/v1/examples/sample-page.html

Before writing Python code, make sure you don’t forget to install the required libraries:

pip install requests beautifulsoup4

The complete code:

import requests
from bs4 import BeautifulSoup
from os.path import basename
import os

# We will extract and download all images from this page
url = "https://api.slingacademy.com/v1/examples/sample-page.html"

r = requests.get(url)
html = r.text

# Parse the HTML with BeautifulSoup
soup = BeautifulSoup(html, "lxml")

# Find all image tags
images = soup.find_all("img")

# Loop through the images and download them
for image in images:
  # Get the image source URL
  src = image["src"]

  # Download the image and save it to a folder named "images"
  # Auto create the folder if it doesn't exist
  if(not os.path.exists("images")):
    os.mkdir("images")
 
  with open("images/" + basename(src), "wb") as f: 
    f.write(requests.get(src).content)  

print("All images downloaded!")

After running the code, you should see a folder named images in the same directory as your Python file. Inside that folder, you’ll see a couple of newly saved images. If not, please recheck your internet connection and make sure you didn’t make any typos in the code.

Using Scrapy

We will use the same target URL as the preceding example:

https://api.slingacademy.com/v1/examples/sample-page.html

Get scrapy installed by running this command:

pip install scrapy

The image pipeline requires the package pillow to work properly (even though we don’t need to import it into our Python code). The package is used for thumbnailing and normalizing images to JPEG/RGB format. Install pillow:

pip install pillow

Then use it like so:

# Import Scrapy and other libraries
import scrapy
from scrapy.crawler import CrawlerProcess

# Define an item class for images
class ImageItem(scrapy.Item):
  image_urls = scrapy.Field()
  images = scrapy.Field()

# Define a spider class for image scraping
class ImageSpider(scrapy.Spider):
  name = "image_spider"
  start_urls = ["https://api.slingacademy.com/v1/examples/sample-page.html"]

  custom_settings = {
    "IMAGES_STORE": "images", # Store the images in the "images" folder
    "ITEM_PIPELINES": {"scrapy.pipelines.images.ImagesPipeline": 1},
    "LOG_ENABLED": True, 
    "LOG_FILE": "logs.txt",
    'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
    "LOG_LEVEL": "ERROR" 
  } 

  def parse(self, response):
    # Find all image tags
    images = response.xpath("//img")

    # Loop through the images and yield items
    for image in images:
      # Get the image source URL
      src = image.xpath("@src").get()

      # Yield an item with the image URL
      yield ImageItem(image_urls=[src])

# Create a crawler process with some settings
process = CrawlerProcess()
 
# Start the crawler with the spider
process.crawl(ImageSpider) 
process.start() 

The downloaded images will be located in <your project>/images/full . The tutorial ends here. Happy coding & have a nice day!

Next Article: Python: 3 Ways to Hash a Password

Previous Article: Python: 2 Ways to Extract Plain Text from a Webpage

Series: Python – Fun Examples

Python

You May Also Like

  • Introduction to yfinance: Fetching Historical Stock Data in Python
  • Monitoring Volatility and Daily Averages Using cryptocompare
  • Advanced DOM Interactions: XPath and CSS Selectors in Playwright (Python)
  • Automating Strategy Updates and Version Control in freqtrade
  • Setting Up a freqtrade Dashboard for Real-Time Monitoring
  • Deploying freqtrade on a Cloud Server or Docker Environment
  • Optimizing Strategy Parameters with freqtrade’s Hyperopt
  • Risk Management: Setting Stop Loss, Trailing Stops, and ROI in freqtrade
  • Integrating freqtrade with TA-Lib and pandas-ta Indicators
  • Handling Multiple Pairs and Portfolios with freqtrade
  • Using freqtrade’s Backtesting and Hyperopt Modules
  • Developing Custom Trading Strategies for freqtrade
  • Debugging Common freqtrade Errors: Exchange Connectivity and More
  • Configuring freqtrade Bot Settings and Strategy Parameters
  • Installing freqtrade for Automated Crypto Trading in Python
  • Scaling cryptofeed for High-Frequency Trading Environments
  • Building a Real-Time Market Dashboard Using cryptofeed in Python
  • Customizing cryptofeed Callbacks for Advanced Market Insights
  • Integrating cryptofeed into Automated Trading Bots