Python: 5 ways to remove HTML tags from a string

Updated: May 20, 2023 By: Frienzied Flame Post a comment

This concise, example-based article will walk you through some different approaches to stripping HTML tags from a given string in Python (to get plain text).

The raw HTML string we will use in the examples to come is shown below:

html_string = """
<html>
    <head>
        <script src="script.js"></script>
        <link rel="stylesheet" href="styles.css">
        <title>Sling Academy</title>
    </head>
    <body>
        <h1>This is a heading</h1>
        <p>Some meaningless text</p>
        <div>
          <a href="https://www.slingacademy.com">Sample link</a>
            <a href="https://www.slingacademy.com">Sample link</a>
            <img src="https://api.slingacademy.com/public/sample-photos/1.jpeg" alt="sample image"/>
        </div>
        <br>
        <hr>
        <!-- This is an example of a comment -->
    </body>
</html>
"""

As you can see, it contains several common HTML tags like <script>, <a>, <img>, <p>, self-closing ones like <br>, <hr>, and a sample comment. The reason we use such a long HTML string is to make sure that our methods can work well in many different scenarios. If the test HTML string is too short and simple, potential pitfalls might be overlooked.

Using lxml

lxml is a powerful tool for processing HTML and XML. It’s fast, safe, and reliable. This is an external package, so we need to install it first:

pip install lxml

Example:

from lxml import etree

html_string = """
<html>
    <head>
        <script src="script.js"></script>
        <link rel="stylesheet" href="styles.css">
        <title>Sling Academy</title>
    </head>
    <body>
        <h1>This is a heading</h1>
        <p>Some meaningless text</p>
        <div>
          <a href="https://www.slingacademy.com">Sample link</a>
            <a href="https://www.slingacademy.com">Sample link</a>
            <img src="https://api.slingacademy.com/public/sample-photos/1.jpeg" alt="sample image"/>
        </div>
        <br>
        <hr>
        <!-- This is an example of a comment -->
    </body>
</html>
"""

def remove_html_tags(text):
    parser = etree.HTMLParser()
    tree = etree.fromstring(text, parser)
    return etree.tostring(tree, encoding='unicode', method='text')

plan_text = remove_html_tags(html_string) 
print(plan_text.strip()) 

Output:

Sling Academy
    
    
        This is a heading
        Some meaningless text
        
          Sample link
            Sample link

Using Regular Expressions

You can use the re module to create a pattern that matches any text inside < and >, and then use the re.sub() method to replace them with empty strings.

Example:

import re

html_string = """
<html>
    <head>
        <script src="script.js"></script>
        <link rel="stylesheet" href="styles.css">
        <title>Sling Academy</title>
    </head>
    <body>
        <h1>This is a heading</h1>
        <p>Some meaningless text</p>
        <div>
          <a href="https://www.slingacademy.com">Sample link</a>
            <a href="https://www.slingacademy.com">Sample link</a>
            <img src="https://api.slingacademy.com/public/sample-photos/1.jpeg" alt="sample image"/>
        </div>
        <br>
        <hr>
        <!-- This is an example of a comment -->
    </body>
</html>
"""

def remove_html_tags(text):
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

result = remove_html_tags(html_string)

# print the result without leading and trailing white spaces
print(result.strip()) 

The output looks exactly as what we got after using the previous method:

Sling Academy
    
    
        This is a heading
        Some meaningless text
        
          Sample link
            Sample link

Using BeautifulSoup

This solution involves using the popular BeautifulSoup library, which provides convenient methods to parse and manipulate HTML.

Install the library:

pip install beautifulsoup4

Then utilize it like so:

from bs4 import BeautifulSoup

html_string = """
<html>
    <head>
        <script src="script.js"></script>
        <link rel="stylesheet" href="styles.css">
        <title>Sling Academy</title>
    </head>
    <body>
        <h1>This is a heading</h1>
        <p>Some meaningless text</p>
        <div>
          <a href="https://www.slingacademy.com">Sample link</a>
            <a href="https://www.slingacademy.com">Sample link</a>
            <img src="https://api.slingacademy.com/public/sample-photos/1.jpeg" alt="sample image"/>
        </div>
        <br>
        <hr>
        <!-- This is an example of a comment -->
    </body>
</html>
"""

def remove_html_tags(input):
    soup = BeautifulSoup(input, 'html.parser')
    return soup.get_text()

print(remove_html_tags(html_string).strip())

Still, the same plain text you got in the previous examples, but the indentation is automatically removed:

Sling Academy


This is a heading
Some meaningless text

Sample link
Sample link

Using a for loop and if…else statements

This technique is super flexible, and you can customize it as needed. Our weapons are just a for loop, some if...else statements, and some basic string operations.

Here’s the code:

html_string = """
<html>
    <head>
        <script src="script.js"></script>
        <link rel="stylesheet" href="styles.css">
        <title>Sling Academy</title>
    </head>
    <body>
        <h1>This is a heading</h1>
        <p>Some meaningless text</p>
        <div>
          <a href="https://www.slingacademy.com">Sample link</a>
            <a href="https://www.slingacademy.com">Sample link</a>
            <img src="https://api.slingacademy.com/public/sample-photos/1.jpeg" alt="sample image"/>
        </div>
        <br>
        <hr>
        <!-- This is an example of a comment -->
    </body>
</html>
"""

def remove_html_tags(text):
    inside_tag = False
    result = ''
    for char in text:
        if char == '<':
            inside_tag = True
        elif char == '>':
            inside_tag = False
        else:
            if not inside_tag:
                result += char
    return result

print(remove_html_tags(html_string).strip()) 

The output is the same:

Sling Academy
    
    
        This is a heading
        Some meaningless text
        
          Sample link
            Sample link

Using HTMLParser

This solution makes use of the built-in html.parser module in Python for parsing HTML and extracting the text. However, it’s a little bit longer in comparison to the preceding approaches.

Example:

from html.parser import HTMLParser

class HTMLTagRemover(HTMLParser):
    def __init__(self):
        super().__init__()
        self.result = []

    def handle_data(self, data):
        self.result.append(data)

    def get_text(self):
        return ''.join(self.result)

def remove_html_tags(text):
    remover = HTMLTagRemover()
    remover.feed(text)
    return remover.get_text()

html_string = """
<html>
    <head>
        <script src="script.js"></script>
        <link rel="stylesheet" href="styles.css">
        <title>Sling Academy</title>
    </head>
    <body>
        <h1>This is a heading</h1>
        <p>Some meaningless text</p>
        <div>
          <a href="https://www.slingacademy.com">Sample link</a>
            <a href="https://www.slingacademy.com">Sample link</a>
            <img src="https://api.slingacademy.com/public/sample-photos/1.jpeg" alt="sample image"/>
        </div>
        <br>
        <hr>
        <!-- This is an example of a comment -->
    </body>
</html>
"""

print(remove_html_tags(html_string).strip())

Output:

Sling Academy
    
    
        This is a heading
        Some meaningless text
        
          Sample link
            Sample link

That’s it. Happy coding & have a nice day!