Python: How to unescape HTML entities in a string

Updated: May 20, 2023 By: Frienzied Flame Post a comment

Overview

HTML entities are special characters that are used to represent characters that have special meaning in HTML or that are not part of the character set. They start with an ampersand (&) and end with a semicolon (;). Some common HTML entities are:

& // ampersand
< // less than
> // greater than
© // copyright

This practical, example-centric shows you a couple of different ways to unescape HTML entities in a given string in Python. No more boring words; let’s get to the point.

Using the html module

You can use the html.unescape() function to turn all HTML entities to their corresponding characters. Here’s how you can do it:

import html

def unescape_html_entities(text):
    return html.unescape(text)

text = "©2023 Sling Academy. Happy coding & enjoy the day."
print(unescape_html_entities(text))

Output:

©2023 Sling Academy. Happy coding & enjoy the day.

html is a built-in module of Python, so you don’t have to install anything.

Using BeautifulSoup4

This solution leverages the beautifulsoup4 library to parse HTML entities and return the desired result with all HTML entities converted to their corresponding characters.

Install the library:

pip install beautifulsoup4

Example:

from bs4 import BeautifulSoup

def unescape_html_entities(text):
    soup = BeautifulSoup(text, 'html.parser')
    return soup.get_text()

text = "Is 1 > 2? & < ? I dunno. "Yes" & 'No'."
print(unescape_html_entities(text))

Output:

Is 1 > 2? & < ? I dunno. "Yes" & 'No'.

That’s it. Happy coding & have a nice day!