Introduction
Gathering data from the internet has become an essential task for various applications. The Python Requests module simplifies the process of crawling and obtaining raw HTML from URLs with its user-friendly interface and robust capability.
Getting Started with Requests
The first step in using the Requests module to crawl raw HTML from a website is to install the module (if it’s not already installed) using pip :
pip install requests
Once installed, you can perform a basic GET request to retrieve the content of a web page:
import requests
url = 'http://example.com'
response = requests.get(url)
html_content = response.text
print(html_content)
Handling Response Status Codes
Before processing the HTML content, it’s essential to check the response status code to ensure the request was successful:
if response.status_code == 200:
print('Success!')
elif response.status_code == 404:
print('Not Found.')
Advanced Handling of Requests
For more complex scenarios, you might need to incorporate headers, cookies, and other parameters in your request. Here’s how to send a request with custom headers:
headers = {
'User-Agent': 'Mozilla/5.0 (...)',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
}
response = requests.get(url, headers=headers)
html_content = response.text
Working with Sessions
Sessions can be used to persist certain parameters across multiple requests. For instance, if you need to maintain a logged-in state, you can use a session object:
with requests.Session() as session:
session.post('http://example.com/login', data={'username':'admin', 'password':'password'})
response = session.get('http://example.com/protected_page')
print(response.text)
Error Handling
It’s good practice to manage exceptions that may occur during requests to handle connection errors, timeouts, and other exceptions gracefully:
try:
response = requests.get(url, timeout=5)
response.raise_for_status()
except requests.exceptions.HTTPError as e:
print('HTTP Error:', e)
except requests.exceptions.ConnectionError as e:
print('Connection Error:', e)
except requests.exceptions.Timeout as e:
print('Timeout Error:', e)
except requests.Timeout as e:
print('Request Exception:', e)
Conclusion
In this tutorial, we’ve explored the Python Requests module, starting from the basics and moving on to more advanced topics such as handling custom headers, sessions, and errors. Understanding how to utilize this powerful module enables you to efficiently crawl and process raw HTML data from the web to fuel your applications and analyses.