This concise, example-based article will walk you through some different approaches to stripping HTML tags from a given string in Python (to get plain text).
The raw HTML string we will use in the examples to come is shown below:
html_string = """
<html>
<head>
<script src="script.js"></script>
<link rel="stylesheet" href="styles.css">
<title>Sling Academy</title>
</head>
<body>
<h1>This is a heading</h1>
<p>Some meaningless text</p>
<div>
<a href="https://www.slingacademy.com">Sample link</a>
<a href="https://www.slingacademy.com">Sample link</a>
<img src="https://api.slingacademy.com/public/sample-photos/1.jpeg" alt="sample image"/>
</div>
<br>
<hr>
<!-- This is an example of a comment -->
</body>
</html>
"""
As you can see, it contains several common HTML tags like <script>
, <a>
, <img>
, <p>
, self-closing ones like <br>
, <hr>
, and a sample comment. The reason we use such a long HTML string is to make sure that our methods can work well in many different scenarios. If the test HTML string is too short and simple, potential pitfalls might be overlooked.
Using lxml
lxml
is a powerful tool for processing HTML and XML. It’s fast, safe, and reliable. This is an external package, so we need to install it first:
pip install lxml
Example:
from lxml import etree
html_string = """
<html>
<head>
<script src="script.js"></script>
<link rel="stylesheet" href="styles.css">
<title>Sling Academy</title>
</head>
<body>
<h1>This is a heading</h1>
<p>Some meaningless text</p>
<div>
<a href="https://www.slingacademy.com">Sample link</a>
<a href="https://www.slingacademy.com">Sample link</a>
<img src="https://api.slingacademy.com/public/sample-photos/1.jpeg" alt="sample image"/>
</div>
<br>
<hr>
<!-- This is an example of a comment -->
</body>
</html>
"""
def remove_html_tags(text):
parser = etree.HTMLParser()
tree = etree.fromstring(text, parser)
return etree.tostring(tree, encoding='unicode', method='text')
plan_text = remove_html_tags(html_string)
print(plan_text.strip())
Output:
Sling Academy
This is a heading
Some meaningless text
Sample link
Sample link
Using Regular Expressions
You can use the re
module to create a pattern that matches any text inside <
and >
, and then use the re.sub()
method to replace them with empty strings.
Example:
import re
html_string = """
<html>
<head>
<script src="script.js"></script>
<link rel="stylesheet" href="styles.css">
<title>Sling Academy</title>
</head>
<body>
<h1>This is a heading</h1>
<p>Some meaningless text</p>
<div>
<a href="https://www.slingacademy.com">Sample link</a>
<a href="https://www.slingacademy.com">Sample link</a>
<img src="https://api.slingacademy.com/public/sample-photos/1.jpeg" alt="sample image"/>
</div>
<br>
<hr>
<!-- This is an example of a comment -->
</body>
</html>
"""
def remove_html_tags(text):
clean = re.compile('<.*?>')
return re.sub(clean, '', text)
result = remove_html_tags(html_string)
# print the result without leading and trailing white spaces
print(result.strip())
The output looks exactly as what we got after using the previous method:
Sling Academy
This is a heading
Some meaningless text
Sample link
Sample link
Using BeautifulSoup
This solution involves using the popular BeautifulSoup
library, which provides convenient methods to parse and manipulate HTML.
Install the library:
pip install beautifulsoup4
Then utilize it like so:
from bs4 import BeautifulSoup
html_string = """
<html>
<head>
<script src="script.js"></script>
<link rel="stylesheet" href="styles.css">
<title>Sling Academy</title>
</head>
<body>
<h1>This is a heading</h1>
<p>Some meaningless text</p>
<div>
<a href="https://www.slingacademy.com">Sample link</a>
<a href="https://www.slingacademy.com">Sample link</a>
<img src="https://api.slingacademy.com/public/sample-photos/1.jpeg" alt="sample image"/>
</div>
<br>
<hr>
<!-- This is an example of a comment -->
</body>
</html>
"""
def remove_html_tags(input):
soup = BeautifulSoup(input, 'html.parser')
return soup.get_text()
print(remove_html_tags(html_string).strip())
Still, the same plain text you got in the previous examples, but the indentation is automatically removed:
Sling Academy
This is a heading
Some meaningless text
Sample link
Sample link
Using a for loop and if…else statements
This technique is super flexible, and you can customize it as needed. Our weapons are just a for
loop, some if...else
statements, and some basic string operations.
Here’s the code:
html_string = """
<html>
<head>
<script src="script.js"></script>
<link rel="stylesheet" href="styles.css">
<title>Sling Academy</title>
</head>
<body>
<h1>This is a heading</h1>
<p>Some meaningless text</p>
<div>
<a href="https://www.slingacademy.com">Sample link</a>
<a href="https://www.slingacademy.com">Sample link</a>
<img src="https://api.slingacademy.com/public/sample-photos/1.jpeg" alt="sample image"/>
</div>
<br>
<hr>
<!-- This is an example of a comment -->
</body>
</html>
"""
def remove_html_tags(text):
inside_tag = False
result = ''
for char in text:
if char == '<':
inside_tag = True
elif char == '>':
inside_tag = False
else:
if not inside_tag:
result += char
return result
print(remove_html_tags(html_string).strip())
The output is the same:
Sling Academy
This is a heading
Some meaningless text
Sample link
Sample link
Using HTMLParser
This solution makes use of the built-in html.parser
module in Python for parsing HTML and extracting the text. However, it’s a little bit longer in comparison to the preceding approaches.
Example:
from html.parser import HTMLParser
class HTMLTagRemover(HTMLParser):
def __init__(self):
super().__init__()
self.result = []
def handle_data(self, data):
self.result.append(data)
def get_text(self):
return ''.join(self.result)
def remove_html_tags(text):
remover = HTMLTagRemover()
remover.feed(text)
return remover.get_text()
html_string = """
<html>
<head>
<script src="script.js"></script>
<link rel="stylesheet" href="styles.css">
<title>Sling Academy</title>
</head>
<body>
<h1>This is a heading</h1>
<p>Some meaningless text</p>
<div>
<a href="https://www.slingacademy.com">Sample link</a>
<a href="https://www.slingacademy.com">Sample link</a>
<img src="https://api.slingacademy.com/public/sample-photos/1.jpeg" alt="sample image"/>
</div>
<br>
<hr>
<!-- This is an example of a comment -->
</body>
</html>
"""
print(remove_html_tags(html_string).strip())
Output:
Sling Academy
This is a heading
Some meaningless text
Sample link
Sample link
That’s it. Happy coding & have a nice day!