Sling Academy
Home/Python/Python Requests module: How to crawl raw HTML from a URL

Python Requests module: How to crawl raw HTML from a URL

Last updated: January 02, 2024

Introduction

Gathering data from the internet has become an essential task for various applications. The Python Requests module simplifies the process of crawling and obtaining raw HTML from URLs with its user-friendly interface and robust capability.

Getting Started with Requests

The first step in using the Requests module to crawl raw HTML from a website is to install the module (if it’s not already installed) using pip :

pip install requests

Once installed, you can perform a basic GET request to retrieve the content of a web page:

import requests

url = 'http://example.com'
response = requests.get(url)
html_content = response.text

print(html_content)

Handling Response Status Codes

Before processing the HTML content, it’s essential to check the response status code to ensure the request was successful:

if response.status_code == 200:
    print('Success!')
elif response.status_code == 404:
    print('Not Found.')

Advanced Handling of Requests

For more complex scenarios, you might need to incorporate headers, cookies, and other parameters in your request. Here’s how to send a request with custom headers:

headers = {
    'User-Agent': 'Mozilla/5.0 (...)',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
}

response = requests.get(url, headers=headers)
html_content = response.text

Working with Sessions

Sessions can be used to persist certain parameters across multiple requests. For instance, if you need to maintain a logged-in state, you can use a session object:

with requests.Session() as session:
    session.post('http://example.com/login', data={'username':'admin', 'password':'password'})
    response = session.get('http://example.com/protected_page')
    print(response.text)

Error Handling

It’s good practice to manage exceptions that may occur during requests to handle connection errors, timeouts, and other exceptions gracefully:

try:
    response = requests.get(url, timeout=5)
    response.raise_for_status()
except requests.exceptions.HTTPError as e:
    print('HTTP Error:', e)
except requests.exceptions.ConnectionError as e:
    print('Connection Error:', e)
except requests.exceptions.Timeout as e:
    print('Timeout Error:', e)
except requests.Timeout as e:
    print('Request Exception:', e)

Conclusion

In this tutorial, we’ve explored the Python Requests module, starting from the basics and moving on to more advanced topics such as handling custom headers, sessions, and errors. Understanding how to utilize this powerful module enables you to efficiently crawl and process raw HTML data from the web to fuel your applications and analyses.

Next Article: Python Requests Module: Exception Handling Best Practices

Previous Article: 3 Ways to Handle Exceptions in aiohttp (Python)

Series: Python: Network & JSON tutorials

Python

You May Also Like

  • Introduction to yfinance: Fetching Historical Stock Data in Python
  • Monitoring Volatility and Daily Averages Using cryptocompare
  • Advanced DOM Interactions: XPath and CSS Selectors in Playwright (Python)
  • Automating Strategy Updates and Version Control in freqtrade
  • Setting Up a freqtrade Dashboard for Real-Time Monitoring
  • Deploying freqtrade on a Cloud Server or Docker Environment
  • Optimizing Strategy Parameters with freqtrade’s Hyperopt
  • Risk Management: Setting Stop Loss, Trailing Stops, and ROI in freqtrade
  • Integrating freqtrade with TA-Lib and pandas-ta Indicators
  • Handling Multiple Pairs and Portfolios with freqtrade
  • Using freqtrade’s Backtesting and Hyperopt Modules
  • Developing Custom Trading Strategies for freqtrade
  • Debugging Common freqtrade Errors: Exchange Connectivity and More
  • Configuring freqtrade Bot Settings and Strategy Parameters
  • Installing freqtrade for Automated Crypto Trading in Python
  • Scaling cryptofeed for High-Frequency Trading Environments
  • Building a Real-Time Market Dashboard Using cryptofeed in Python
  • Customizing cryptofeed Callbacks for Advanced Market Insights
  • Integrating cryptofeed into Automated Trading Bots