Python Requests module: How to crawl raw HTML from a URL

Updated: January 2, 2024 By: Guest Contributor Post a comment

Introduction

Gathering data from the internet has become an essential task for various applications. The Python Requests module simplifies the process of crawling and obtaining raw HTML from URLs with its user-friendly interface and robust capability.

Getting Started with Requests

The first step in using the Requests module to crawl raw HTML from a website is to install the module (if it’s not already installed) using pip :

pip install requests

Once installed, you can perform a basic GET request to retrieve the content of a web page:

import requests

url = 'http://example.com'
response = requests.get(url)
html_content = response.text

print(html_content)

Handling Response Status Codes

Before processing the HTML content, it’s essential to check the response status code to ensure the request was successful:

if response.status_code == 200:
    print('Success!')
elif response.status_code == 404:
    print('Not Found.')

Advanced Handling of Requests

For more complex scenarios, you might need to incorporate headers, cookies, and other parameters in your request. Here’s how to send a request with custom headers:

headers = {
    'User-Agent': 'Mozilla/5.0 (...)',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
}

response = requests.get(url, headers=headers)
html_content = response.text

Working with Sessions

Sessions can be used to persist certain parameters across multiple requests. For instance, if you need to maintain a logged-in state, you can use a session object:

with requests.Session() as session:
    session.post('http://example.com/login', data={'username':'admin', 'password':'password'})
    response = session.get('http://example.com/protected_page')
    print(response.text)

Error Handling

It’s good practice to manage exceptions that may occur during requests to handle connection errors, timeouts, and other exceptions gracefully:

try:
    response = requests.get(url, timeout=5)
    response.raise_for_status()
except requests.exceptions.HTTPError as e:
    print('HTTP Error:', e)
except requests.exceptions.ConnectionError as e:
    print('Connection Error:', e)
except requests.exceptions.Timeout as e:
    print('Timeout Error:', e)
except requests.Timeout as e:
    print('Request Exception:', e)

Conclusion

In this tutorial, we’ve explored the Python Requests module, starting from the basics and moving on to more advanced topics such as handling custom headers, sessions, and errors. Understanding how to utilize this powerful module enables you to efficiently crawl and process raw HTML data from the web to fuel your applications and analyses.