Python: 5 ways to remove HTML tags from a string

This concise, example-based article will walk you through some different approaches to stripping HTML tags from a given string in Python (to get plain text).

The raw HTML string we will use in the examples to come is shown below:

html_string = """
<html>
    <head>
        <script src="script.js"></script>
        <link rel="stylesheet" href="styles.css">
        <title>Sling Academy</title>
    </head>
    <body>
        <h1>This is a heading</h1>
        <p>Some meaningless text</p>
        <div>
          <a href="https://www.slingacademy.com">Sample link</a>
            <a href="https://www.slingacademy.com">Sample link</a>
            <img src="https://api.slingacademy.com/public/sample-photos/1.jpeg" alt="sample image"/>
        </div>
        <br>
        <hr>
        <!-- This is an example of a comment -->
    </body>
</html>
"""

As you can see, it contains several common HTML tags like <script>, <a>, <img>, <p>, self-closing ones like <br>, <hr>, and a sample comment. The reason we use such a long HTML string is to make sure that our methods can work well in many different scenarios. If the test HTML string is too short and simple, potential pitfalls might be overlooked.

Using lxml
Using Regular Expressions
Using BeautifulSoup
Using a for loop and if…else statements
Using HTMLParser

Using lxml

lxml is a powerful tool for processing HTML and XML. It’s fast, safe, and reliable. This is an external package, so we need to install it first:

pip install lxml

Example:

from lxml import etree

html_string = """
<html>
    <head>
        <script src="script.js"></script>
        <link rel="stylesheet" href="styles.css">
        <title>Sling Academy</title>
    </head>
    <body>
        <h1>This is a heading</h1>
        <p>Some meaningless text</p>
        <div>
          <a href="https://www.slingacademy.com">Sample link</a>
            <a href="https://www.slingacademy.com">Sample link</a>
            <img src="https://api.slingacademy.com/public/sample-photos/1.jpeg" alt="sample image"/>
        </div>
        <br>
        <hr>
        <!-- This is an example of a comment -->
    </body>
</html>
"""

def remove_html_tags(text):
    parser = etree.HTMLParser()
    tree = etree.fromstring(text, parser)
    return etree.tostring(tree, encoding='unicode', method='text')

plan_text = remove_html_tags(html_string) 
print(plan_text.strip())

Output:

Sling Academy
    
    
        This is a heading
        Some meaningless text
        
          Sample link
            Sample link

Using Regular Expressions

You can use the re module to create a pattern that matches any text inside < and >, and then use the re.sub() method to replace them with empty strings.

Example:

import re

html_string = """
<html>
    <head>
        <script src="script.js"></script>
        <link rel="stylesheet" href="styles.css">
        <title>Sling Academy</title>
    </head>
    <body>
        <h1>This is a heading</h1>
        <p>Some meaningless text</p>
        <div>
          <a href="https://www.slingacademy.com">Sample link</a>
            <a href="https://www.slingacademy.com">Sample link</a>
            <img src="https://api.slingacademy.com/public/sample-photos/1.jpeg" alt="sample image"/>
        </div>
        <br>
        <hr>
        <!-- This is an example of a comment -->
    </body>
</html>
"""

def remove_html_tags(text):
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

result = remove_html_tags(html_string)

# print the result without leading and trailing white spaces
print(result.strip())

The output looks exactly as what we got after using the previous method:

Sling Academy
    
    
        This is a heading
        Some meaningless text
        
          Sample link
            Sample link

Using BeautifulSoup

This solution involves using the popular BeautifulSoup library, which provides convenient methods to parse and manipulate HTML.

Install the library:

pip install beautifulsoup4

Then utilize it like so:

from bs4 import BeautifulSoup

html_string = """
<html>
    <head>
        <script src="script.js"></script>
        <link rel="stylesheet" href="styles.css">
        <title>Sling Academy</title>
    </head>
    <body>
        <h1>This is a heading</h1>
        <p>Some meaningless text</p>
        <div>
          <a href="https://www.slingacademy.com">Sample link</a>
            <a href="https://www.slingacademy.com">Sample link</a>
            <img src="https://api.slingacademy.com/public/sample-photos/1.jpeg" alt="sample image"/>
        </div>
        <br>
        <hr>
        <!-- This is an example of a comment -->
    </body>
</html>
"""

def remove_html_tags(input):
    soup = BeautifulSoup(input, 'html.parser')
    return soup.get_text()

print(remove_html_tags(html_string).strip())

Still, the same plain text you got in the previous examples, but the indentation is automatically removed:

Sling Academy


This is a heading
Some meaningless text

Sample link
Sample link

Using a for loop and if…else statements

This technique is super flexible, and you can customize it as needed. Our weapons are just a for loop, some if...else statements, and some basic string operations.

Here’s the code:

html_string = """
<html>
    <head>
        <script src="script.js"></script>
        <link rel="stylesheet" href="styles.css">
        <title>Sling Academy</title>
    </head>
    <body>
        <h1>This is a heading</h1>
        <p>Some meaningless text</p>
        <div>
          <a href="https://www.slingacademy.com">Sample link</a>
            <a href="https://www.slingacademy.com">Sample link</a>
            <img src="https://api.slingacademy.com/public/sample-photos/1.jpeg" alt="sample image"/>
        </div>
        <br>
        <hr>
        <!-- This is an example of a comment -->
    </body>
</html>
"""

def remove_html_tags(text):
    inside_tag = False
    result = ''
    for char in text:
        if char == '<':
            inside_tag = True
        elif char == '>':
            inside_tag = False
        else:
            if not inside_tag:
                result += char
    return result

print(remove_html_tags(html_string).strip())

The output is the same:

Sling Academy
    
    
        This is a heading
        Some meaningless text
        
          Sample link
            Sample link

Using HTMLParser

This solution makes use of the built-in html.parser module in Python for parsing HTML and extracting the text. However, it’s a little bit longer in comparison to the preceding approaches.

Example:

from html.parser import HTMLParser

class HTMLTagRemover(HTMLParser):
    def __init__(self):
        super().__init__()
        self.result = []

    def handle_data(self, data):
        self.result.append(data)

    def get_text(self):
        return ''.join(self.result)

def remove_html_tags(text):
    remover = HTMLTagRemover()
    remover.feed(text)
    return remover.get_text()

html_string = """
<html>
    <head>
        <script src="script.js"></script>
        <link rel="stylesheet" href="styles.css">
        <title>Sling Academy</title>
    </head>
    <body>
        <h1>This is a heading</h1>
        <p>Some meaningless text</p>
        <div>
          <a href="https://www.slingacademy.com">Sample link</a>
            <a href="https://www.slingacademy.com">Sample link</a>
            <img src="https://api.slingacademy.com/public/sample-photos/1.jpeg" alt="sample image"/>
        </div>
        <br>
        <hr>
        <!-- This is an example of a comment -->
    </body>
</html>
"""

print(remove_html_tags(html_string).strip())

Output:

Sling Academy
    
    
        This is a heading
        Some meaningless text
        
          Sample link
            Sample link

That’s it. Happy coding & have a nice day!

Next Article: Python: How to unescape HTML entities in a string

Previous Article: Python: 3 Ways to Validate Phone Numbers

Series: Working with Strings in Python

Python

Python: How to Convert a Dictionary to a Query String

February 12, 2024

Python File Modes: Explained

August 27, 2023

Python & aiohttp: How to download files using streams

August 20, 2023

Using aiohttp to make POST requests in Python (with examples)

August 20, 2023

Python asyncio.Queue class (with 3 examples)

August 18, 2023

How to Setup Python Virtual Environments (venv)

August 11, 2023

Python: Handling asyncio.exceptions.CancelledError gracefully

August 02, 2023

Python asyncio.wait_for() function (with examples)

August 02, 2023

Python Linked Lists: Explanation & Examples

July 31, 2023

Python asyncio.wait() function (with examples)

July 26, 2023

Python asyncio.gather() function (with examples)

July 26, 2023

Python match/case statement (with examples)

July 18, 2023

Python

Python: 5 ways to remove HTML tags from a string

Table of Contents

Using lxml

Using Regular Expressions

Using BeautifulSoup

Using a for loop and if…else statements

Using HTMLParser