Pandas: How to parse an HTML table into a DataFrame

Updated: February 19, 2024 By: Guest Contributor Post a comment

Overview

Parsing HTML tables into Pandas DataFrames is a convenient way to extract and manipulate web data. Pandas, a powerful data manipulation library in Python, provides functionalities that make this task relatively straightforward. This tutorial will guide you through extracting data from HTML tables and converting it into a DataFrame with several code examples.

Getting Started

Before diving into the code, ensure that you have Pandas installed in your Python environment. Additionally, you’ll need lxml, html5lib, and BeautifulSoup4 as Pandas uses these libraries to parse HTML:

pip install pandas lxml html5lib beautifulsoup4

Basic Table Parsing

To begin, let’s parse a simple HTML table. Suppose you have the following HTML:

<table>
  <tr>
    <th>Name</th>
    <th>Age</th>
  </tr>
  <tr>
    <td>John</td>
    <td>22</td>
  </tr>
  <tr>
    <td>Anna</td>
    <td>30</td>
  </tr>
</table>

To convert this to a DataFrame, use the following Python code:

import pandas as pd

# Read HTML table into a DataFrame
html = '''
<table>
  <tr>
    <th>Name</th>
    <th>Age</th>
  </tr>
  <tr>
    <td>John</td>
    <td>22</td>
  </tr>
  <tr>
    <td>Anna</td>
    <td>30</td>
  </tr>
</table>'''
df = pd.read_html(html)[0]
print(df)

# Output:
#     Name  Age
# 0  John   22
# 1  Anna   30

Handling Complex Tables

Web tables can be more complex, containing multiple header rows, nested tables, or missing values. To manage such complexity, Pandas offers various parameters in the read_html function. Consider a table with multiple headers:

<table>
  <tr>
    <th>Level 1</th> <th></th>
    <th>Level 2</th>
  </tr>
  <tr>
    <th>Name</th>
    <th>Age</th>
    <th>Gender</th>
  </tr>
  <tr>
    <td>John</td>
    <td>22</td>
    <td>Male</td>
  </tr>
</table>

To parse this correctly, adjust the header parameter:

df_complex = pd.read_html(html, header=[0,1])[0]
print(df_complex)

This adjusts the DataFrame to properly align the hierarchical headers. Similarly, for tables with missing values or special formatting requirements, you can utilize additional parameters like fillna, index_col, or converters to tailor the DataFrame to your needs.

Extracting Tables from URL

In many cases, you might want to directly parse tables from a web page. Pandas makes this exceptionally easy. Use the read_html function and pass the URL of the web page:

url = 'https://example.com/page_with_table.html'
df_webpage = pd.read_html(url)[0]
print(df_webpage)

This approach automatically identifies and extracts tables from the HTML content retrieved from the website. It’s vital to inspect the page and ensure that it does not use dynamic table loading (e.g., through JavaScript) as Pandas won’t be able to parse such tables.

Advanced Techniques

For even more control and customization, you can preprocess the HTML with BeautifulSoup before converting it to a DataFrame. This allows for filtering, modification, or enhancement of the HTML table:

from bs4 import BeautifulSoup
import requests

response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# Select and modify the table as needed
# Then convert to DataFrame

This method is particularly useful when dealing with troublesome HTML structures or when precise manipulation of the table before conversion is necessary.

Conclusion

Parsing HTML tables into Pandas DataFrames presents a flexible and powerful approach to web data extraction and analysis. With the methods outlined in this tutorial, you can efficiently handle a wide range of HTML table complexities, empowering your data science projects with the rich, structured data from the web.