Python: Using aiohttp to crawl webpages asynchronously

Updated: August 8, 2023 By: Goodman Post a comment

This concise, code-centric article will show you how to asynchronously crawl a single or a list of webpages by using aiohttp and BeautifulSoup 4 in Python.

Overview

async/await

async/await is a way of writing asynchronous code in Python, which means that the code can run without blocking or waiting for other tasks to finish. This is useful for web scraping and crawling, because we often need to make many requests to different URLs or websites, and we don’t want to waste time waiting for each response before moving on to the next one. By using async/await, we can make multiple requests concurrently, and handle the responses as they arrive, without blocking the main thread of execution.

aiohttp

aiohttp is an open-source library that provides an asynchronous HTTP client and server for Python. It allows us to make HTTP requests and handle HTTP responses using the async/await syntax. You can install aiohttp by running:

pip install aiohttp

To get the HTML source of a page, you can do like this:

async with session.get(url) as response:
    html = await response.text()

To download an image from the internet, you can do as follows:

async with session.get(image_url) as response:
    data = await response.read()

More details can be seen in the complete example in the next section of this article.

BeautifulSoup 4

BeautifulSoup 4 is a library that allows us to parse HTML content and extract data from it. It works by creating a soup object, which represents the HTML document as a tree of tags, attributes, strings, and other objects. We can then navigate and search through the soup object using various methods and properties. You can install BeautifulSoup 4 like so:

pip install beautifulsoup4

Here’s how we can extract all image URLs from raw HTML:

oup = BeautifulSoup(html, "html.parser")
    images = soup.find_all("img")
    image_urls = [image["src"] for image in images]

Build an Async Web Crawler

We’ll combine aiohttp, BeautifulSoup4 and asyncio features to make a fast and efficient web crawler that can fetch and parse raw HTML as well as download all images from one or many URLs. The steps are:

  1. Define an async function that takes a session object and a url as parameters, and returns a list of image URLs from that webpage.
  2. Define an async function that takes a session object and an image url as parameters, and downloads the image to a local file.
  3. Define an async function that takes a list of URLs as parameters, and creates a session object and a list of tasks for crawling and downloading each Url.
  4. Create an event loop and run the last function until it is done.

In the example to come, we’ll use the sample webpage below for learning and practicing purposes:

https://api.slingacademy.com/v1/examples/sample-page.html

The complete code (with explanations):

# SlingAcademy.com
# This code uses Python 3.11.4

import aiohttp
import asyncio
from bs4 import BeautifulSoup


# Define an async function that takes a session object and a url as parameters,
# and returns a list of image urls from that webpage
async def crawl(session, url):
    # Make a GET request to the url and wait for the response
    async with session.get(url) as response:
        # Read the response content as text
        html = await response.text()
        # Create a soup object from the html content
        soup = BeautifulSoup(html, "html.parser")
        # Find all the img tags in the soup object
        images = soup.find_all("img")
        # Extract the src attributes from each image tag
        image_urls = [image["src"] for image in images]
        # Return the list of image urls
        return image_urls


# Define an async function that takes a session object and an image url as parameters,
# and downloads the image to a local file
async def download(session, image_url):
    # Make a GET request to the image url and wait for the response
    async with session.get(image_url) as response:
        # Read the response content as bytes
        data = await response.read()
        # Get the file name from the image url
        file_name = image_url.split("/")[-1]
        # Open a local file with the same name as write mode
        with open(file_name, "wb") as file:
            # Write the data to the file
            file.write(data)
            # Print a message indicating success
            print(f"Downloaded {file_name}")


# Define an async function that takes a list of urls as parameters,
# and creates a session object and a list of tasks for crawling and downloading each url
async def main(urls):
    # Create a session object
    async with aiohttp.ClientSession() as session:
        # Create an empty list of tasks
        tasks = []
        # For each url in the list of urls
        for url in urls:
            # Create a task for crawling the url and append it to the list of tasks
            tasks.append(asyncio.create_task(crawl(session, url)))
        # Wait for all the tasks to finish and get their results
        results = await asyncio.gather(*tasks)
        # Flatten the results into one list of image urls
        image_urls = [image_url for result in results for image_url in result]
        # Create another empty list of tasks
        tasks = []
        # For each image url in the list of image urls
        for image_url in image_urls:
            # Create a task for downloading the image url and append it to the list of tasks
            tasks.append(asyncio.create_task(download(session, image_url)))
        # Wait for all the tasks to finish
        await asyncio.gather(*tasks)


# Create an event loop
loop = asyncio.get_event_loop()
# Run the main function until it is done with a list of urls
my_urls = [
    "https://api.slingacademy.com/v1/examples/sample-page.html",
    # you can add more urls here
]
loop.run_until_complete(
    main(my_urls)
)
# Close the loop
loop.close()

After running the code above, you’ll see downloaded images in the same directory as your Python script:

Some messages will also be printed out:

Downloaded 1.jpeg
Downloaded 2.jpeg
Downloaded 4.jpeg
Downloaded 3.jpeg

What’s Next?

The code in the example above works but there’s still a big room to improve. You can do that on your own to get better outcomes:

  • The code assumes that the image URLs are absolute and valid, but in reality, they might be relative or broken. You might need to use some logic to resolve the relative URLs or handle the exceptions.
  • The code does not check for duplicate image URLs, so it might download the same image multiple times. You might want to use a set or a dictionary to store the image URLs and avoid repetition.
  • The code does not handle redirects, cookies, authentication, headers, proxies, or other HTTP features that might be required for some websites. You might need to use some options or methods from the aiohttp library to deal with these issues.
  • The code does not limit the number of concurrent requests or downloads, so it might overload the server or your network bandwidth. You might want to use a semaphore or a queue to control the concurrency and avoid being blocked or throttled by the server. To learn more about semaphore and queue, see these articles:

This tutorial ends here. If you have any questions related to the topic we’ve discussed, just leave a comment. I’m more than happy to hear from you.