Python: Extract and download all images from a web page

Updated: May 20, 2023 By: Frienzied Flame Post a comment

This example-based article walks you through 2 different ways to programmatically extract and download all images from a web page with Python. The first approach use requests and beautifulsoup4, while the second one uses scrapy.

Using Requests and BeautifulSoup

In this example, we will get all images from this sample web page:

https://api.slingacademy.com/v1/examples/sample-page.html

Before writing Python code, make sure you don’t forget to install the required libraries:

pip install requests beautifulsoup4

The complete code:

import requests
from bs4 import BeautifulSoup
from os.path import basename
import os

# We will extract and download all images from this page
url = "https://api.slingacademy.com/v1/examples/sample-page.html"

r = requests.get(url)
html = r.text

# Parse the HTML with BeautifulSoup
soup = BeautifulSoup(html, "lxml")

# Find all image tags
images = soup.find_all("img")

# Loop through the images and download them
for image in images:
  # Get the image source URL
  src = image["src"]

  # Download the image and save it to a folder named "images"
  # Auto create the folder if it doesn't exist
  if(not os.path.exists("images")):
    os.mkdir("images")
 
  with open("images/" + basename(src), "wb") as f: 
    f.write(requests.get(src).content)  

print("All images downloaded!")

After running the code, you should see a folder named images in the same directory as your Python file. Inside that folder, you’ll see a couple of newly saved images. If not, please recheck your internet connection and make sure you didn’t make any typos in the code.

Using Scrapy

We will use the same target URL as the preceding example:

https://api.slingacademy.com/v1/examples/sample-page.html

Get scrapy installed by running this command:

pip install scrapy

The image pipeline requires the package pillow to work properly (even though we don’t need to import it into our Python code). The package is used for thumbnailing and normalizing images to JPEG/RGB format. Install pillow:

pip install pillow

Then use it like so:

# Import Scrapy and other libraries
import scrapy
from scrapy.crawler import CrawlerProcess

# Define an item class for images
class ImageItem(scrapy.Item):
  image_urls = scrapy.Field()
  images = scrapy.Field()

# Define a spider class for image scraping
class ImageSpider(scrapy.Spider):
  name = "image_spider"
  start_urls = ["https://api.slingacademy.com/v1/examples/sample-page.html"]

  custom_settings = {
    "IMAGES_STORE": "images", # Store the images in the "images" folder
    "ITEM_PIPELINES": {"scrapy.pipelines.images.ImagesPipeline": 1},
    "LOG_ENABLED": True, 
    "LOG_FILE": "logs.txt",
    'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
    "LOG_LEVEL": "ERROR" 
  } 

  def parse(self, response):
    # Find all image tags
    images = response.xpath("//img")

    # Loop through the images and yield items
    for image in images:
      # Get the image source URL
      src = image.xpath("@src").get()

      # Yield an item with the image URL
      yield ImageItem(image_urls=[src])

# Create a crawler process with some settings
process = CrawlerProcess()
 
# Start the crawler with the spider
process.crawl(ImageSpider) 
process.start() 

The downloaded images will be located in <your project>/images/full . The tutorial ends here. Happy coding & have a nice day!