PHP: How to remove HTML tags from a string

Updated: January 10, 2024 By: Guest Contributor Post a comment

Overview

Removing HTML tags from strings is a common task in PHP development, ensuring clean text for processing, storage, or display. Proper handling of HTML cleanup promotes security and data integrity.

Basics of Stripping HTML Tags

PHP provides a built-in function strip_tags(), which strips HTML and PHP tags from a string.

$stringWithHtml = '<h1>Hello World!</h1>';
$cleanString = strip_tags($stringWithHtml);
echo $cleanString; // Outputs: Hello World!

However, sometimes you might want to allow certain tags for formatting purposes.

$stringWithHtml = '<p>Hello,<span style="color:red;"> World!</span></p>';
$allowedTags = '<p><span>';
$cleanString = strip_tags($stringWithHtml, $allowedTags);
echo $cleanString; // Outputs: <p>Hello,<span style="color:red;"> World!</span></p>

Dealing with Malicious Code

While strip_tags() is effective, it may not be enough to prevent XSS attacks. Here’s where htmlspecialchars() comes into play, converting special characters to HTML entities.

$stringWithHtml = '<script>alert("XSS Attack!")</script>' +
'<div>Some text</div>';
$safeString = htmlspecialchars($stringWithHtml);
echo $safeString; // Outputs: &lt;script&gt;alert("XSS Attack!")&lt;/script&gt;&lt;div&gt;Some text&lt;/div&gt;

Custom HTML Tag Stripping Functions

What if you need more control? You can write custom functions using regular expressions with preg_replace().

$stringWithHtml = '<div style="font-size: 18px;">Text</div>';
$cleanString = preg_replace('/<\/?.+?(>|$/s', '', $stringWithHtml);
echo $cleanString; // Outputs: Text

Using DOMDocument for Advanced HTML Manipulation

For more complex operations, such as removing scripts but keeping other tags intact, the DOMDocument class is very powerful.

$dom = new DOMDocument();
$dom->loadHTML($stringWithHtml);
$scriptTags = $dom->getElementsByTagName('script');

foreach ($scriptTags as $tag) {
    $tag->parentNode->removeChild($tag);
}

echo $dom->saveHTML(); // Outputs HTML without script tags

Libraries for Sanitizing HTML

Third-party libraries, like HTML Purifier, provide a solid foundation for cleaning up HTML content whilst maintaining a balance between security and flexibility.

require_once 'HTMLPurifier.auto.php';

$config = HTMLPurifier_Config::createDefault();
$purifier = new HTMLPurifier($config);

$cleanHtml = $purifier->purify($dirtyHtml);
echo $cleanHtml;

Conclusion

In conclusion, PHP offers multiple ways to remove HTML tags from strings. With built-in functions for quick usage, regular expressions for customized solutions, and advanced classes like DOMDocument, you have the flexibility to handle HTML content securely. External libraries like HTML Purifier can be adopted for more robust requirements and enhanced security.