This example-based article walks you through 2 different ways to programmatically extract and download all images from a web page with Python. The first approach use requests
and beautifulsoup4
, while the second one uses scrapy
.
Using Requests and BeautifulSoup
In this example, we will get all images from this sample web page:
https://api.slingacademy.com/v1/examples/sample-page.html
Before writing Python code, make sure you don’t forget to install the required libraries:
pip install requests beautifulsoup4
The complete code:
import requests
from bs4 import BeautifulSoup
from os.path import basename
import os
# We will extract and download all images from this page
url = "https://api.slingacademy.com/v1/examples/sample-page.html"
r = requests.get(url)
html = r.text
# Parse the HTML with BeautifulSoup
soup = BeautifulSoup(html, "lxml")
# Find all image tags
images = soup.find_all("img")
# Loop through the images and download them
for image in images:
# Get the image source URL
src = image["src"]
# Download the image and save it to a folder named "images"
# Auto create the folder if it doesn't exist
if(not os.path.exists("images")):
os.mkdir("images")
with open("images/" + basename(src), "wb") as f:
f.write(requests.get(src).content)
print("All images downloaded!")
After running the code, you should see a folder named images
in the same directory as your Python file. Inside that folder, you’ll see a couple of newly saved images. If not, please recheck your internet connection and make sure you didn’t make any typos in the code.
Using Scrapy
We will use the same target URL as the preceding example:
https://api.slingacademy.com/v1/examples/sample-page.html
Get scrapy
installed by running this command:
pip install scrapy
The image pipeline requires the package pillow
to work properly (even though we don’t need to import it into our Python code). The package is used for thumbnailing and normalizing images to JPEG/RGB
format. Install pillow
:
pip install pillow
Then use it like so:
# Import Scrapy and other libraries
import scrapy
from scrapy.crawler import CrawlerProcess
# Define an item class for images
class ImageItem(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()
# Define a spider class for image scraping
class ImageSpider(scrapy.Spider):
name = "image_spider"
start_urls = ["https://api.slingacademy.com/v1/examples/sample-page.html"]
custom_settings = {
"IMAGES_STORE": "images", # Store the images in the "images" folder
"ITEM_PIPELINES": {"scrapy.pipelines.images.ImagesPipeline": 1},
"LOG_ENABLED": True,
"LOG_FILE": "logs.txt",
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
"LOG_LEVEL": "ERROR"
}
def parse(self, response):
# Find all image tags
images = response.xpath("//img")
# Loop through the images and yield items
for image in images:
# Get the image source URL
src = image.xpath("@src").get()
# Yield an item with the image URL
yield ImageItem(image_urls=[src])
# Create a crawler process with some settings
process = CrawlerProcess()
# Start the crawler with the spider
process.crawl(ImageSpider)
process.start()
The downloaded images will be located in <your project>/images/full
. The tutorial ends here. Happy coding & have a nice day!