Overview
Removing HTML tags from strings is a common task in PHP development, ensuring clean text for processing, storage, or display. Proper handling of HTML cleanup promotes security and data integrity.
Basics of Stripping HTML Tags
PHP provides a built-in function strip_tags()
, which strips HTML and PHP tags from a string.
$stringWithHtml = '<h1>Hello World!</h1>';
$cleanString = strip_tags($stringWithHtml);
echo $cleanString; // Outputs: Hello World!
However, sometimes you might want to allow certain tags for formatting purposes.
$stringWithHtml = '<p>Hello,<span style="color:red;"> World!</span></p>';
$allowedTags = '<p><span>';
$cleanString = strip_tags($stringWithHtml, $allowedTags);
echo $cleanString; // Outputs: <p>Hello,<span style="color:red;"> World!</span></p>
Dealing with Malicious Code
While strip_tags()
is effective, it may not be enough to prevent XSS attacks. Here’s where htmlspecialchars()
comes into play, converting special characters to HTML entities.
$stringWithHtml = '<script>alert("XSS Attack!")</script>' +
'<div>Some text</div>';
$safeString = htmlspecialchars($stringWithHtml);
echo $safeString; // Outputs: <script>alert("XSS Attack!")</script><div>Some text</div>
Custom HTML Tag Stripping Functions
What if you need more control? You can write custom functions using regular expressions with preg_replace()
.
$stringWithHtml = '<div style="font-size: 18px;">Text</div>';
$cleanString = preg_replace('/<\/?.+?(>|$/s', '', $stringWithHtml);
echo $cleanString; // Outputs: Text
Using DOMDocument for Advanced HTML Manipulation
For more complex operations, such as removing scripts but keeping other tags intact, the DOMDocument
class is very powerful.
$dom = new DOMDocument();
$dom->loadHTML($stringWithHtml);
$scriptTags = $dom->getElementsByTagName('script');
foreach ($scriptTags as $tag) {
$tag->parentNode->removeChild($tag);
}
echo $dom->saveHTML(); // Outputs HTML without script tags
Libraries for Sanitizing HTML
Third-party libraries, like HTML Purifier, provide a solid foundation for cleaning up HTML content whilst maintaining a balance between security and flexibility.
require_once 'HTMLPurifier.auto.php';
$config = HTMLPurifier_Config::createDefault();
$purifier = new HTMLPurifier($config);
$cleanHtml = $purifier->purify($dirtyHtml);
echo $cleanHtml;
Conclusion
In conclusion, PHP offers multiple ways to remove HTML tags from strings. With built-in functions for quick usage, regular expressions for customized solutions, and advanced classes like DOMDocument
, you have the flexibility to handle HTML content securely. External libraries like HTML Purifier can be adopted for more robust requirements and enhanced security.