Sling Academy
Home/PHP/PHP: How to Extract all Headings from HTML Source (2 Ways)

PHP: How to Extract all Headings from HTML Source (2 Ways)

Last updated: February 02, 2024

Overview

Extracting headings from HTML source code can be an essential task for various purposes such as content analysis, SEO optimization, and creating content summaries. PHP, with its powerful set of functions for handling strings and HTML, offers a straightforward way to perform this task. In this tutorial, we’ll dive into methods to extract all headings (from <h1> to <h6>) from given HTML source code using PHP.

Prerequisites

Before you begin, just make sure you have the following:

  • Basic understanding of PHP
  • Access to a PHP environment (Local or server-based)
  • Basic understanding of HTML

Method 1: Using DOMDocument

The first method we will explore involves the DOMDocument class, which is part of PHP’s DOM extension. This approach provides a more accurate and safer way to parse HTML content and extract elements.


$sourceHtml = '<html><body><h1>Welcome to my blog</h1><h2>PHP Tutorials</h2><p>Lorem ipsum...<</p></body></html>';
$doc = new DOMDocument();
@$doc->loadHTML($sourceHtml);
$xpath = new DOMXPath($doc);

for ($i = 1; $i <= 6; $i++) {
    $query = sprintf('//h%d', $i);
    $entries = $xpath->query($query);
    foreach ($entries as $entry) {
        echo "<p>Found: " . $entry->nodeValue . "</p>";
    }
}

In the above example, we first load the HTML into a DOMDocument, and then we use DOMXPath to run a query for each type of heading. This allows us to extract the text of each heading found in the HTML source.

Method 2: Using Regular Expressions

Although generally not recommended due to the complexity and potential insecurity of parsing HTML with regular expressions (regex), this method can be useful for quick scripts or when dealing with simple HTML structures.


$htmlContent = 'Your HTML content here';
$pattern = '/<(h[1-6])[^>]*>(.*?)<\/\1>/is';

preg_match_all($pattern, $htmlContent, $matches, PREG_SET_ORDER);

foreach ($matches as $match) {
    echo "<p>Found{%25 raw %25}{% endraw %25}: {$match[2]} ({})</p>";
}

This code snippet demonstrates how to use PHP’s preg_match_all function to search for all headings in an HTML string. The key is to define a regular expression that matches headings while capturing their content for extraction.

Best Practices and Considerations

  • DOM vs. Regex: Whenever possible, prefer using DOMDocument for HTML parsing as it’s more reliable and secure. Regular expressions can be useful in certain conditions but are generally less safe and accurate.
  • Error Handling: When using DOMDocument, suppress warnings that result from malformed HTML by prefixing the loadHTML method with an @ symbol or by setting libxml_use_internal_errors(true) before loading HTML.
  • Performance: Keep in mind that HTML parsing and element extraction can be resource-intensive, especially for large documents or complex queries. Optimize by parsing only the necessary parts when possible.
  • Security: Be cautious when handling HTML content, especially if sourced from user input, to avoid cross-site scripting (XSS) attacks. Use PHP’s htmlentities or htmlspecialchars functions to sanitize output.

Conclusion

Extracting headings from HTML documents using PHP is a powerful capability that can be applied in various scenarios, from web scraping to content analysis. While the DOMDocument approach is generally preferred for its safety and flexibility, regular expressions can offer a quick alternative for simpler tasks. Regardless of the method, understanding the basics of HTML parsing in PHP allows developers to manipulate and analyze web content effectively. Remember to practice safe coding principles, especially when dealing with user-generated content, to maintain the security and integrity of your applications.

Next Article: PHP Composer error: Could not open input file ‘composer.phar’

Previous Article: [Solved] Composer error: PHP extension zip is missing

Series: Basic PHP Tutorials

PHP

You May Also Like

  • Pandas DataFrame.value_counts() method: Explained with examples
  • Constructor Property Promotion in PHP: Tutorial & Examples
  • Understanding mixed types in PHP (5 examples)
  • Union Types in PHP: A practical guide (5 examples)
  • PHP: How to implement type checking in a function (PHP 8+)
  • Symfony + Doctrine: Implementing cursor-based pagination
  • Laravel + Eloquent: How to Group Data by Multiple Columns
  • PHP: How to convert CSV data to HTML tables
  • Using ‘never’ return type in PHP (PHP 8.1+)
  • Nullable (Optional) Types in PHP: A practical guide (5 examples)
  • Explore Attributes (Annotations) in Modern PHP (5 examples)
  • An introduction to WeakMap in PHP (6 examples)
  • Type Declarations for Class Properties in PHP (5 examples)
  • Static Return Type in PHP: Explained with examples
  • PHP: Using DocBlock comments to annotate variables
  • PHP: How to ping a server/website and get the response time
  • PHP: 3 Ways to Get City/Country from IP Address
  • PHP: How to find the mode(s) of an array (4 examples)
  • PHP: Calculate standard deviation & variance of an array