Introduction
Web scraping is an invaluable skill in the vast domain of data science and analysis. In today’s digital world, a significant amount of data is stored in web pages as HTML tables. Python, with its rich ecosystem of libraries, provides a powerful toolkit for extracting this data and saving it in more structured formats like CSV files. In this tutorial, we’ll explore how to use Python to crawl HTML tables from web pages and save the extracted data to CSV files. We’ll cover essential topics like making HTTP requests, parsing HTML, extracting relevant data, and finally, writing this data to CSV files.
Installing Libraries
Before we dive into the code, there are several Python libraries that we need to install which will make our life easier:
- requests: To make HTTP requests to web pages.
- BeautifulSoup4: For parsing HTML and navigating the parsed document tree.
- pandas: For manipulating the extracted data and saving it to CSV files.
You can install these libraries using pip:
pip install requests beautifulsoup4 pandas
Making an HTTP Request
The first step in web scraping is to make an HTTP request to the web page from which you want to extract data. Using the requests library, we can do this in just a few lines of code:
import requests
url = "https://example.com/table-page"
response = requests.get(url)
if response.status_code == 200:
print("Successfully retrieved the page.")
else:
print("Failed to retrieve the page.")
Parsing the HTML Content
Once we have the HTML content of the page, the next step is to parse it. We will use BeautifulSoup4 for this purpose:
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
Extracting Data from HTML Tables
HTML tables are defined with the <table>
tag, with rows and columns denoted by <tr>
and <td>
/<th>
tags respectively. To extract table data, we must navigate through these tags:
tables = soup.find_all('table')
for table in tables:
for row in table.find_all('tr'):
cols = row.find_all(['td', 'th'])
data = [ele.text.strip() for ele in cols]
print(data)
This will print all the data rows in the table(s) present in the HTML page.
Saving Data to CSV Files
Once we have the data extracted from the tables, the next step is to save it into a CSV file. This is where pandas comes into play:
import pandas as pd
df = pd.DataFrame(data)
df.to_csv('extracted_table.csv', index=False)
However, note that in the code above, data should be a list of lists, where each sublist represents a row in the CSV file. Ensure to adjust the data extraction loop accordingly.
Handling Complex Tables
Some HTML tables might have nested tables or headers that span multiple columns or rows. Handling such complexity requires careful navigation and data structuring:
for table in tables:
headers = []
for header in table.find_all('th'):
headers.append(header.text.strip())
all_rows = []
for row in table.find_all('tr'):
cols = row.find_all(['td', 'th'])
current_row = [ele.text.strip() for ele in cols]
all_rows.append(current_row)
df = pd.DataFrame(all_rows, columns=headers)
df.to_csv('complex_table.csv', index=False)
Conclusion
Python makes web scraping accessible and efficient. By following the steps outlined in this tutorial, you can easily crawl HTML tables and save the data to CSV files for further analysis or archiving. Whether it’s for data analysis, machine learning projects, or simply automating the collection of information from the web, these techniques provide a solid foundation.
However, when scraping web pages, always be mindful of the website’s robots.txt file and terms of use to ensure that you’re respecting the site’s policies regarding automated data collection.