HTML entities are special characters that are used to represent characters that have special meaning in HTML or that are not part of the character set. They start with an ampersand (&) and end with a semicolon (;). Some common HTML entities are:
& // ampersand < // less than > // greater than © // copyright
This practical, example-centric shows you a couple of different ways to unescape HTML entities in a given string in Python. No more boring words; let’s get to the point.
Using the html module
You can use the
html.unescape() function to turn all HTML entities to their corresponding characters. Here’s how you can do it:
import html def unescape_html_entities(text): return html.unescape(text) text = "©2023 Sling Academy. Happy coding & enjoy the day." print(unescape_html_entities(text))
©2023 Sling Academy. Happy coding & enjoy the day.
html is a built-in module of Python, so you don’t have to install anything.
This solution leverages the
beautifulsoup4 library to parse HTML entities and return the desired result with all HTML entities converted to their corresponding characters.
Install the library:
pip install beautifulsoup4
from bs4 import BeautifulSoup def unescape_html_entities(text): soup = BeautifulSoup(text, 'html.parser') return soup.get_text() text = "Is 1 > 2? & < ? I dunno. "Yes" & 'No'." print(unescape_html_entities(text))
Is 1 > 2? & < ? I dunno. "Yes" & 'No'.
That’s it. Happy coding & have a nice day!