Python: Crawling HTML tables and save to CSV files

Updated: February 13, 2024 By: Guest Contributor Post a comment

Introduction

Web scraping is an invaluable skill in the vast domain of data science and analysis. In today’s digital world, a significant amount of data is stored in web pages as HTML tables. Python, with its rich ecosystem of libraries, provides a powerful toolkit for extracting this data and saving it in more structured formats like CSV files. In this tutorial, we’ll explore how to use Python to crawl HTML tables from web pages and save the extracted data to CSV files. We’ll cover essential topics like making HTTP requests, parsing HTML, extracting relevant data, and finally, writing this data to CSV files.

Installing Libraries

Before we dive into the code, there are several Python libraries that we need to install which will make our life easier:

  • requests: To make HTTP requests to web pages.
  • BeautifulSoup4: For parsing HTML and navigating the parsed document tree.
  • pandas: For manipulating the extracted data and saving it to CSV files.

You can install these libraries using pip:

pip install requests beautifulsoup4 pandas

Making an HTTP Request

The first step in web scraping is to make an HTTP request to the web page from which you want to extract data. Using the requests library, we can do this in just a few lines of code:

import requests

url = "https://example.com/table-page"
response = requests.get(url)

if response.status_code == 200:
    print("Successfully retrieved the page.")
else:
    print("Failed to retrieve the page.")

Parsing the HTML Content

Once we have the HTML content of the page, the next step is to parse it. We will use BeautifulSoup4 for this purpose:

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, 'html.parser')

Extracting Data from HTML Tables

HTML tables are defined with the <table> tag, with rows and columns denoted by <tr> and <td>/<th> tags respectively. To extract table data, we must navigate through these tags:

tables = soup.find_all('table')
for table in tables:
    for row in table.find_all('tr'):
        cols = row.find_all(['td', 'th'])
        data = [ele.text.strip() for ele in cols]
        print(data)

This will print all the data rows in the table(s) present in the HTML page.

Saving Data to CSV Files

Once we have the data extracted from the tables, the next step is to save it into a CSV file. This is where pandas comes into play:

import pandas as pd

df = pd.DataFrame(data)
df.to_csv('extracted_table.csv', index=False)

However, note that in the code above, data should be a list of lists, where each sublist represents a row in the CSV file. Ensure to adjust the data extraction loop accordingly.

Handling Complex Tables

Some HTML tables might have nested tables or headers that span multiple columns or rows. Handling such complexity requires careful navigation and data structuring:

for table in tables:
    headers = []
    for header in table.find_all('th'):
        headers.append(header.text.strip())
    
    all_rows = []
    for row in table.find_all('tr'):
        cols = row.find_all(['td', 'th'])
        current_row = [ele.text.strip() for ele in cols]
        all_rows.append(current_row)
    
    df = pd.DataFrame(all_rows, columns=headers)
    df.to_csv('complex_table.csv', index=False)

Conclusion

Python makes web scraping accessible and efficient. By following the steps outlined in this tutorial, you can easily crawl HTML tables and save the data to CSV files for further analysis or archiving. Whether it’s for data analysis, machine learning projects, or simply automating the collection of information from the web, these techniques provide a solid foundation.

However, when scraping web pages, always be mindful of the website’s robots.txt file and terms of use to ensure that you’re respecting the site’s policies regarding automated data collection.