Python asyncio: How to download a list of files in parallel

Updated: February 12, 2024 By: Guest Contributor Post a comment

Overview

In today’s fast-paced digital era, efficiency is key. Whether you’re a developer working on a high-load system, a data scientist needing to download large datasets, or simply someone looking to optimize your code for faster execution, Python’s asyncio library is an invaluable tool for performing IO-bound and high-level structured network tasks, especially when it comes to downloading files in parallel.

This tutorial will guide you through the process of using asyncio along with aiohttp to download a list of files in parallel. We’ll start with the basics and progressively delve into more advanced concepts, providing code examples and their expected outputs at each step.

Getting Started

Before diving into the code, it’s essential to understand the core concepts behind asyncio and how asynchronous programming works in Python. Asyncio is an asynchronous I/O framework that uses coroutines and event loops to execute code in a non-blocking manner, enabling the parallel execution of tasks. This is particularly useful for IO-bound tasks, such as downloading files from the internet.

To begin, you’ll need to install the necessary libraries. Run the following command in your terminal:

pip install aiohttp

Basic Parallel Downloads

Let’s start with a simple example. The following code will download three files in parallel:

import asyncio
import aiohttp

async def download_file(session, url):
    async with session.get(url) as response:
        filename = url.split('/')[-1]
        with open(filename, 'wb') as f:
            while True:
                chunk = await response.content.read(1024)
                if not chunk:
                    break
                f.write(chunk)
        print(f"Downloaded {filename}")

async def main():
    urls = ['http://example.com/file1', 'http://example.com/file2', 'http://example.com/file3']
    async with aiohttp.ClientSession() as session:
        tasks = [asyncio.create_task(download_file(session, url)) for url in urls]
        await asyncio.gather(*tasks)

asyncio.run(main())

This code initializes an asyncio event loop and creates a task for each file download. The asyncio.gather function then runs these tasks in parallel, thereby downloading the files simultaneously.

A key thing to note here is the use of async with for resource management, which ensures that the resources are released properly once the tasks are completed.

Advanced Usage

While the previous example demonstrates the basic premise of parallel downloads, a real-world application often demands more sophistication. This might include error handling, rate limiting, or working with large sets of URLs.

Error Handling

To handle errors gracefully, modify the download_file function to include a try-except block:

async def download_file(session, url):
    try:
        async with session.get(url) as response:
            # Your code as before
    except aiohttp.ClientError as e:
        print(f"Failed to download {url}, error: {e}")

Rate Limiting

To prevent overwhelming the server with too many concurrent requests, you can implement rate limiting using asyncio‘s Semaphore:

async def download_file(session, url, semaphore):
    async with semaphore:
        # Your download code here

async def main():
    semaphore = asyncio.Semaphore(10)  # Max 10 concurrent requests
    # Your code to initialize tasks, with each task now also getting the semaphore as an argument

Working with Large Sets of URLs

When dealing with a large list of URLs, splitting the task into chunks and processing each chunk in parallel can increase efficiency. Here’s how you might implement this:

async def main():
    urls = [...]  # A large list of URLs
    chunk_size = 20
    for i in range(0, len(urls), chunk_size):
        chunk = urls[i:i+chunk_size]
        async with aiohttp.ClientSession() as session:
            tasks = [asyncio.create_task(download_file(session, url)) for url in chunk]
            await asyncio.gather(*tasks)

Conclusion

Using asyncio and aiohttp, Python programmers can effectively download files in parallel, significantly reducing the overall execution time for IO-bound tasks. This tutorial has demonstrated the basic to advanced techniques for achieving this, equipping you with the knowledge to apply these techniques in your own projects.