Sling Academy
Home/Python/Python: Crawling HTML tables and save to CSV files

Python: Crawling HTML tables and save to CSV files

Last updated: February 13, 2024

Introduction

Web scraping is an invaluable skill in the vast domain of data science and analysis. In today’s digital world, a significant amount of data is stored in web pages as HTML tables. Python, with its rich ecosystem of libraries, provides a powerful toolkit for extracting this data and saving it in more structured formats like CSV files. In this tutorial, we’ll explore how to use Python to crawl HTML tables from web pages and save the extracted data to CSV files. We’ll cover essential topics like making HTTP requests, parsing HTML, extracting relevant data, and finally, writing this data to CSV files.

Installing Libraries

Before we dive into the code, there are several Python libraries that we need to install which will make our life easier:

  • requests: To make HTTP requests to web pages.
  • BeautifulSoup4: For parsing HTML and navigating the parsed document tree.
  • pandas: For manipulating the extracted data and saving it to CSV files.

You can install these libraries using pip:

pip install requests beautifulsoup4 pandas

Making an HTTP Request

The first step in web scraping is to make an HTTP request to the web page from which you want to extract data. Using the requests library, we can do this in just a few lines of code:

import requests

url = "https://example.com/table-page"
response = requests.get(url)

if response.status_code == 200:
    print("Successfully retrieved the page.")
else:
    print("Failed to retrieve the page.")

Parsing the HTML Content

Once we have the HTML content of the page, the next step is to parse it. We will use BeautifulSoup4 for this purpose:

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, 'html.parser')

Extracting Data from HTML Tables

HTML tables are defined with the <table> tag, with rows and columns denoted by <tr> and <td>/<th> tags respectively. To extract table data, we must navigate through these tags:

tables = soup.find_all('table')
for table in tables:
    for row in table.find_all('tr'):
        cols = row.find_all(['td', 'th'])
        data = [ele.text.strip() for ele in cols]
        print(data)

This will print all the data rows in the table(s) present in the HTML page.

Saving Data to CSV Files

Once we have the data extracted from the tables, the next step is to save it into a CSV file. This is where pandas comes into play:

import pandas as pd

df = pd.DataFrame(data)
df.to_csv('extracted_table.csv', index=False)

However, note that in the code above, data should be a list of lists, where each sublist represents a row in the CSV file. Ensure to adjust the data extraction loop accordingly.

Handling Complex Tables

Some HTML tables might have nested tables or headers that span multiple columns or rows. Handling such complexity requires careful navigation and data structuring:

for table in tables:
    headers = []
    for header in table.find_all('th'):
        headers.append(header.text.strip())
    
    all_rows = []
    for row in table.find_all('tr'):
        cols = row.find_all(['td', 'th'])
        current_row = [ele.text.strip() for ele in cols]
        all_rows.append(current_row)
    
    df = pd.DataFrame(all_rows, columns=headers)
    df.to_csv('complex_table.csv', index=False)

Conclusion

Python makes web scraping accessible and efficient. By following the steps outlined in this tutorial, you can easily crawl HTML tables and save the data to CSV files for further analysis or archiving. Whether it’s for data analysis, machine learning projects, or simply automating the collection of information from the web, these techniques provide a solid foundation.

However, when scraping web pages, always be mindful of the website’s robots.txt file and terms of use to ensure that you’re respecting the site’s policies regarding automated data collection.

Next Article: Python – Using Pillow to generate images programmatically

Previous Article: Python: Schedule events with sched.scheduler class (3 examples)

Series: Python – Fun Examples

Python

You May Also Like

  • Introduction to yfinance: Fetching Historical Stock Data in Python
  • Monitoring Volatility and Daily Averages Using cryptocompare
  • Advanced DOM Interactions: XPath and CSS Selectors in Playwright (Python)
  • Automating Strategy Updates and Version Control in freqtrade
  • Setting Up a freqtrade Dashboard for Real-Time Monitoring
  • Deploying freqtrade on a Cloud Server or Docker Environment
  • Optimizing Strategy Parameters with freqtrade’s Hyperopt
  • Risk Management: Setting Stop Loss, Trailing Stops, and ROI in freqtrade
  • Integrating freqtrade with TA-Lib and pandas-ta Indicators
  • Handling Multiple Pairs and Portfolios with freqtrade
  • Using freqtrade’s Backtesting and Hyperopt Modules
  • Developing Custom Trading Strategies for freqtrade
  • Debugging Common freqtrade Errors: Exchange Connectivity and More
  • Configuring freqtrade Bot Settings and Strategy Parameters
  • Installing freqtrade for Automated Crypto Trading in Python
  • Scaling cryptofeed for High-Frequency Trading Environments
  • Building a Real-Time Market Dashboard Using cryptofeed in Python
  • Customizing cryptofeed Callbacks for Advanced Market Insights
  • Integrating cryptofeed into Automated Trading Bots