Sling Academy
Home/Python/Python Requests module: How to crawl raw HTML from a URL

Python Requests module: How to crawl raw HTML from a URL

Last updated: January 02, 2024

Introduction

Gathering data from the internet has become an essential task for various applications. The Python Requests module simplifies the process of crawling and obtaining raw HTML from URLs with its user-friendly interface and robust capability.

Getting Started with Requests

The first step in using the Requests module to crawl raw HTML from a website is to install the module (if it’s not already installed) using pip :

pip install requests

Once installed, you can perform a basic GET request to retrieve the content of a web page:

import requests

url = 'http://example.com'
response = requests.get(url)
html_content = response.text

print(html_content)

Handling Response Status Codes

Before processing the HTML content, it’s essential to check the response status code to ensure the request was successful:

if response.status_code == 200:
    print('Success!')
elif response.status_code == 404:
    print('Not Found.')

Advanced Handling of Requests

For more complex scenarios, you might need to incorporate headers, cookies, and other parameters in your request. Here’s how to send a request with custom headers:

headers = {
    'User-Agent': 'Mozilla/5.0 (...)',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
}

response = requests.get(url, headers=headers)
html_content = response.text

Working with Sessions

Sessions can be used to persist certain parameters across multiple requests. For instance, if you need to maintain a logged-in state, you can use a session object:

with requests.Session() as session:
    session.post('http://example.com/login', data={'username':'admin', 'password':'password'})
    response = session.get('http://example.com/protected_page')
    print(response.text)

Error Handling

It’s good practice to manage exceptions that may occur during requests to handle connection errors, timeouts, and other exceptions gracefully:

try:
    response = requests.get(url, timeout=5)
    response.raise_for_status()
except requests.exceptions.HTTPError as e:
    print('HTTP Error:', e)
except requests.exceptions.ConnectionError as e:
    print('Connection Error:', e)
except requests.exceptions.Timeout as e:
    print('Timeout Error:', e)
except requests.Timeout as e:
    print('Request Exception:', e)

Conclusion

In this tutorial, we’ve explored the Python Requests module, starting from the basics and moving on to more advanced topics such as handling custom headers, sessions, and errors. Understanding how to utilize this powerful module enables you to efficiently crawl and process raw HTML data from the web to fuel your applications and analyses.

Next Article: Python Requests Module: Exception Handling Best Practices

Previous Article: Python & aiohttp: How to create a simple web server

Series: Python: Network & JSON tutorials

Python

You May Also Like

  • Python Warning: Secure coding is not enabled for restorable state
  • Python TypeError: write() argument must be str, not bytes
  • 4 ways to install Python modules on Windows without admin rights
  • Python TypeError: object of type ‘NoneType’ has no len()
  • Python: How to access command-line arguments (3 approaches)
  • Understanding ‘Never’ type in Python 3.11+ (5 examples)
  • Python: 3 Ways to Retrieve City/Country from IP Address
  • Using Type Aliases in Python: A Practical Guide (with Examples)
  • Python: Defining distinct types using NewType class
  • Using Optional Type in Python (explained with examples)
  • Python: How to Override Methods in Classes
  • Python: Define Generic Types for Lists of Nested Dictionaries
  • Python: Defining type for a list that can contain both numbers and strings
  • Using TypeGuard in Python (Python 3.10+)
  • Python: Using ‘NoReturn’ type with functions
  • Type Casting in Python: The Ultimate Guide (with Examples)
  • Python: Using type hints with class methods and properties
  • Python: Typing a function with default parameters
  • Python: Typing a function that can return multiple types