Sling Academy
Home/Python/Python: 5 ways to remove HTML tags from a string

Python: 5 ways to remove HTML tags from a string

Last updated: May 20, 2023

This concise, example-based article will walk you through some different approaches to stripping HTML tags from a given string in Python (to get plain text).

The raw HTML string we will use in the examples to come is shown below:

html_string = """
<html>
    <head>
        <script src="script.js"></script>
        <link rel="stylesheet" href="styles.css">
        <title>Sling Academy</title>
    </head>
    <body>
        <h1>This is a heading</h1>
        <p>Some meaningless text</p>
        <div>
          <a href="https://www.slingacademy.com">Sample link</a>
            <a href="https://www.slingacademy.com">Sample link</a>
            <img src="https://api.slingacademy.com/public/sample-photos/1.jpeg" alt="sample image"/>
        </div>
        <br>
        <hr>
        <!-- This is an example of a comment -->
    </body>
</html>
"""

As you can see, it contains several common HTML tags like <script>, <a>, <img>, <p>, self-closing ones like <br>, <hr>, and a sample comment. The reason we use such a long HTML string is to make sure that our methods can work well in many different scenarios. If the test HTML string is too short and simple, potential pitfalls might be overlooked.

Using lxml

lxml is a powerful tool for processing HTML and XML. It’s fast, safe, and reliable. This is an external package, so we need to install it first:

pip install lxml

Example:

from lxml import etree

html_string = """
<html>
    <head>
        <script src="script.js"></script>
        <link rel="stylesheet" href="styles.css">
        <title>Sling Academy</title>
    </head>
    <body>
        <h1>This is a heading</h1>
        <p>Some meaningless text</p>
        <div>
          <a href="https://www.slingacademy.com">Sample link</a>
            <a href="https://www.slingacademy.com">Sample link</a>
            <img src="https://api.slingacademy.com/public/sample-photos/1.jpeg" alt="sample image"/>
        </div>
        <br>
        <hr>
        <!-- This is an example of a comment -->
    </body>
</html>
"""

def remove_html_tags(text):
    parser = etree.HTMLParser()
    tree = etree.fromstring(text, parser)
    return etree.tostring(tree, encoding='unicode', method='text')

plan_text = remove_html_tags(html_string) 
print(plan_text.strip()) 

Output:

Sling Academy
    
    
        This is a heading
        Some meaningless text
        
          Sample link
            Sample link

Using Regular Expressions

You can use the re module to create a pattern that matches any text inside < and >, and then use the re.sub() method to replace them with empty strings.

Example:

import re

html_string = """
<html>
    <head>
        <script src="script.js"></script>
        <link rel="stylesheet" href="styles.css">
        <title>Sling Academy</title>
    </head>
    <body>
        <h1>This is a heading</h1>
        <p>Some meaningless text</p>
        <div>
          <a href="https://www.slingacademy.com">Sample link</a>
            <a href="https://www.slingacademy.com">Sample link</a>
            <img src="https://api.slingacademy.com/public/sample-photos/1.jpeg" alt="sample image"/>
        </div>
        <br>
        <hr>
        <!-- This is an example of a comment -->
    </body>
</html>
"""

def remove_html_tags(text):
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

result = remove_html_tags(html_string)

# print the result without leading and trailing white spaces
print(result.strip()) 

The output looks exactly as what we got after using the previous method:

Sling Academy
    
    
        This is a heading
        Some meaningless text
        
          Sample link
            Sample link

Using BeautifulSoup

This solution involves using the popular BeautifulSoup library, which provides convenient methods to parse and manipulate HTML.

Install the library:

pip install beautifulsoup4

Then utilize it like so:

from bs4 import BeautifulSoup

html_string = """
<html>
    <head>
        <script src="script.js"></script>
        <link rel="stylesheet" href="styles.css">
        <title>Sling Academy</title>
    </head>
    <body>
        <h1>This is a heading</h1>
        <p>Some meaningless text</p>
        <div>
          <a href="https://www.slingacademy.com">Sample link</a>
            <a href="https://www.slingacademy.com">Sample link</a>
            <img src="https://api.slingacademy.com/public/sample-photos/1.jpeg" alt="sample image"/>
        </div>
        <br>
        <hr>
        <!-- This is an example of a comment -->
    </body>
</html>
"""

def remove_html_tags(input):
    soup = BeautifulSoup(input, 'html.parser')
    return soup.get_text()

print(remove_html_tags(html_string).strip())

Still, the same plain text you got in the previous examples, but the indentation is automatically removed:

Sling Academy


This is a heading
Some meaningless text

Sample link
Sample link

Using a for loop and if…else statements

This technique is super flexible, and you can customize it as needed. Our weapons are just a for loop, some if...else statements, and some basic string operations.

Here’s the code:

html_string = """
<html>
    <head>
        <script src="script.js"></script>
        <link rel="stylesheet" href="styles.css">
        <title>Sling Academy</title>
    </head>
    <body>
        <h1>This is a heading</h1>
        <p>Some meaningless text</p>
        <div>
          <a href="https://www.slingacademy.com">Sample link</a>
            <a href="https://www.slingacademy.com">Sample link</a>
            <img src="https://api.slingacademy.com/public/sample-photos/1.jpeg" alt="sample image"/>
        </div>
        <br>
        <hr>
        <!-- This is an example of a comment -->
    </body>
</html>
"""

def remove_html_tags(text):
    inside_tag = False
    result = ''
    for char in text:
        if char == '<':
            inside_tag = True
        elif char == '>':
            inside_tag = False
        else:
            if not inside_tag:
                result += char
    return result

print(remove_html_tags(html_string).strip()) 

The output is the same:

Sling Academy
    
    
        This is a heading
        Some meaningless text
        
          Sample link
            Sample link

Using HTMLParser

This solution makes use of the built-in html.parser module in Python for parsing HTML and extracting the text. However, it’s a little bit longer in comparison to the preceding approaches.

Example:

from html.parser import HTMLParser

class HTMLTagRemover(HTMLParser):
    def __init__(self):
        super().__init__()
        self.result = []

    def handle_data(self, data):
        self.result.append(data)

    def get_text(self):
        return ''.join(self.result)

def remove_html_tags(text):
    remover = HTMLTagRemover()
    remover.feed(text)
    return remover.get_text()

html_string = """
<html>
    <head>
        <script src="script.js"></script>
        <link rel="stylesheet" href="styles.css">
        <title>Sling Academy</title>
    </head>
    <body>
        <h1>This is a heading</h1>
        <p>Some meaningless text</p>
        <div>
          <a href="https://www.slingacademy.com">Sample link</a>
            <a href="https://www.slingacademy.com">Sample link</a>
            <img src="https://api.slingacademy.com/public/sample-photos/1.jpeg" alt="sample image"/>
        </div>
        <br>
        <hr>
        <!-- This is an example of a comment -->
    </body>
</html>
"""

print(remove_html_tags(html_string).strip())

Output:

Sling Academy
    
    
        This is a heading
        Some meaningless text
        
          Sample link
            Sample link

That’s it. Happy coding & have a nice day!

Next Article: Python: How to unescape HTML entities in a string

Previous Article: Python: 3 Ways to Validate Phone Numbers

Series: Working with Strings in Python

Python

You May Also Like

  • Introduction to yfinance: Fetching Historical Stock Data in Python
  • Monitoring Volatility and Daily Averages Using cryptocompare
  • Advanced DOM Interactions: XPath and CSS Selectors in Playwright (Python)
  • Automating Strategy Updates and Version Control in freqtrade
  • Setting Up a freqtrade Dashboard for Real-Time Monitoring
  • Deploying freqtrade on a Cloud Server or Docker Environment
  • Optimizing Strategy Parameters with freqtrade’s Hyperopt
  • Risk Management: Setting Stop Loss, Trailing Stops, and ROI in freqtrade
  • Integrating freqtrade with TA-Lib and pandas-ta Indicators
  • Handling Multiple Pairs and Portfolios with freqtrade
  • Using freqtrade’s Backtesting and Hyperopt Modules
  • Developing Custom Trading Strategies for freqtrade
  • Debugging Common freqtrade Errors: Exchange Connectivity and More
  • Configuring freqtrade Bot Settings and Strategy Parameters
  • Installing freqtrade for Automated Crypto Trading in Python
  • Scaling cryptofeed for High-Frequency Trading Environments
  • Building a Real-Time Market Dashboard Using cryptofeed in Python
  • Customizing cryptofeed Callbacks for Advanced Market Insights
  • Integrating cryptofeed into Automated Trading Bots