PHP: How to Extract all Headings from HTML Source (2 Ways)

Updated: February 2, 2024 By: Guest Contributor Post a comment

Overview

Extracting headings from HTML source code can be an essential task for various purposes such as content analysis, SEO optimization, and creating content summaries. PHP, with its powerful set of functions for handling strings and HTML, offers a straightforward way to perform this task. In this tutorial, we’ll dive into methods to extract all headings (from <h1> to <h6>) from given HTML source code using PHP.

Prerequisites

Before you begin, just make sure you have the following:

  • Basic understanding of PHP
  • Access to a PHP environment (Local or server-based)
  • Basic understanding of HTML

Method 1: Using DOMDocument

The first method we will explore involves the DOMDocument class, which is part of PHP’s DOM extension. This approach provides a more accurate and safer way to parse HTML content and extract elements.


$sourceHtml = '<html><body><h1>Welcome to my blog</h1><h2>PHP Tutorials</h2><p>Lorem ipsum...<</p></body></html>';
$doc = new DOMDocument();
@$doc->loadHTML($sourceHtml);
$xpath = new DOMXPath($doc);

for ($i = 1; $i <= 6; $i++) {
    $query = sprintf('//h%d', $i);
    $entries = $xpath->query($query);
    foreach ($entries as $entry) {
        echo "<p>Found: " . $entry->nodeValue . "</p>";
    }
}

In the above example, we first load the HTML into a DOMDocument, and then we use DOMXPath to run a query for each type of heading. This allows us to extract the text of each heading found in the HTML source.

Method 2: Using Regular Expressions

Although generally not recommended due to the complexity and potential insecurity of parsing HTML with regular expressions (regex), this method can be useful for quick scripts or when dealing with simple HTML structures.


$htmlContent = 'Your HTML content here';
$pattern = '/<(h[1-6])[^>]*>(.*?)<\/\1>/is';

preg_match_all($pattern, $htmlContent, $matches, PREG_SET_ORDER);

foreach ($matches as $match) {
    echo "<p>Found{%25 raw %25}{% endraw %25}: {$match[2]} ({})</p>";
}

This code snippet demonstrates how to use PHP’s preg_match_all function to search for all headings in an HTML string. The key is to define a regular expression that matches headings while capturing their content for extraction.

Best Practices and Considerations

  • DOM vs. Regex: Whenever possible, prefer using DOMDocument for HTML parsing as it’s more reliable and secure. Regular expressions can be useful in certain conditions but are generally less safe and accurate.
  • Error Handling: When using DOMDocument, suppress warnings that result from malformed HTML by prefixing the loadHTML method with an @ symbol or by setting libxml_use_internal_errors(true) before loading HTML.
  • Performance: Keep in mind that HTML parsing and element extraction can be resource-intensive, especially for large documents or complex queries. Optimize by parsing only the necessary parts when possible.
  • Security: Be cautious when handling HTML content, especially if sourced from user input, to avoid cross-site scripting (XSS) attacks. Use PHP’s htmlentities or htmlspecialchars functions to sanitize output.

Conclusion

Extracting headings from HTML documents using PHP is a powerful capability that can be applied in various scenarios, from web scraping to content analysis. While the DOMDocument approach is generally preferred for its safety and flexibility, regular expressions can offer a quick alternative for simpler tasks. Regardless of the method, understanding the basics of HTML parsing in PHP allows developers to manipulate and analyze web content effectively. Remember to practice safe coding principles, especially when dealing with user-generated content, to maintain the security and integrity of your applications.