Python Requests module: How to parse HTML responses

Introduction
Setting up the Environment
Fetching HTML Content
Parsing HTML with BeautifulSoup
Navigating the HTML Tree
Extracting Attributes and Text
Handling Relative URLs
Advanced Parsing: Using Selectors
Working with Forms
Session Handling
Error Handling
Conclusion

Introduction

Working with HTML responses in Python is a common task for developers. Using the Requests module alongside parsers like BeautifulSoup, we can easily navigate and manipulate HTML content fetched from the web.

Setting up the Environment

Before parsing HTML with Python Requests, you need to install the necessary packages. Open your terminal or command prompt and run:

pip install requests
pip install beautifulsoup4

Fetching HTML Content

To fetch HTML content from a webpage, we use the Requests module’s get method:

import requests
response = requests.get('https://example.com')
html_content = response.text

Parsing HTML with BeautifulSoup

Once we have the HTML content, we can use BeautifulSoup to parse and extract data:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_content, 'html.parser')

Navigating the HTML Tree

With BeautifulSoup, locating elements by tag is straightforward:

headers = soup.find_all('h1')
for header in headers:
    print(header.text)

To find elements by class or id:

navigation_bar = soup.find('div', {'class': 'nav-bar'})
footer = soup.find('footer', {'id': 'site-footer'})

Extracting Attributes and Text

Extracting attributes like href from anchor tags can be done with:

for link in soup.find_all('a'):
    print(link.get('href'))

Similarly, to extract text you can use:

for paragraph in soup.find_all('p'):
    print(paragraph.text)

Handling Relative URLs

If you encounter relative URLs, you can resolve them by using Requests’ URL joining utilities:

from urllib.parse import urljoin

base_url = 'https://example.com'
for link in soup.find_all('a'):
    absolute_url = urljoin(base_url, link.get('href'))
    print(absolute_url)

Advanced Parsing: Using Selectors

You can make use of CSS selectors with the select method:

for item in soup.select('div.content > p.entry'):
    print(item.text)

Working with Forms

To work with forms, you can extract form fields and prepare data for submission:

form = soup.find('form')
form_action = form['action']
form_data = {input['name']: input.get('value', '') for input in form.find_all('input')}
response = requests.post(urljoin(base_url, form_action), data=form_data)

Session Handling

If maintaining sessions is necessary, use the Session object to persist cookies and headers across requests:

with requests.Session() as session:
    session.get('https://example.com/login')
    session.post('https://example.com/login', data={'username': 'user', 'password': 'pass'})
    response = session.get('https://example.com/dashboard')
    # Parse response as before

Error Handling

It’s crucial to handle potential errors in network communication:

try:
    response = requests.get('https://example.com/nonexistent', timeout=5)
    response.raise_for_status()
except requests.exceptions.HTTPError as errh:
    print(f'HTTP Error: {errh}')
except requests.exceptions.ConnectionError as errc:
    print(f'Error Connecting: {errc}')
except requests.exceptions.Timeout as errt:
    print(f'Timeout Error: {errt}')
except requests.exceptions.RequestException as err:
    print(f'OOps: Something Else: {err}')

Conclusion

Python’s Requests module paired with BeautifulSoup makes it simple to fetch and parse HTML content. Through these examples, you can customize and build robust systems for web scraping and automated interactions with web pages.

Next Article: Fixing Python aiohttp RuntimeError: Event Loop Already Running

Previous Article: Python & aiohttp: Sending multiple requests concurrently

Series: Python: Network & JSON tutorials

Python