When developing web and other kinds of applications with Python, there might be cases where you want (or have to) substitute bad words with asterisks (
*), such as:
- Censorship: In cases where you need to filter sensitive or inappropriate content, replacing unwanted words with asterisks can help mask or censor offensive language.
- Data anonymization: When dealing with personally identifiable information (PII), replacing names or other identifying words with asterisks can help protect privacy.
- Profanity filtering: In text analysis or content moderation systems, replacing offensive or inappropriate words with asterisks can be part of a profanity filtering mechanism.
This easy-to-understand, step-by-step tutorial will show you how to replace bad words in a string with asterisk characters by using regular expressions.
- Create a list of bad words that you want to hide from your text.
- Import the
- Define a regular expression pattern to match each element in the list of undesired words.
- Use the
re.sub()function to replace unwanted words with asterisks (the number of asterisks will be equal to the number of characters in substituted words). By adding the
flags=re.IGNORECASEparameter to the
re.sub()function, the regular expression will match the unwanted words regardless of their case.
import re # Define a reusable function to replace unwanted words with asterisks def replace_unwanted_words(text, unwanted_words): pattern = r'\b(?:' + '|'.join(map(re.escape, unwanted_words)) + r')\b' modified_text = re.sub( pattern, lambda match: '*' * len(match.group()), text, flags=re.IGNORECASE) return modified_text # Define a list of unwanted words unwanted_words = ['stupid', 'horrible', 'ugly'] # Test the function text1 = 'This is a stupid, horrible, and ugly text.' print(replace_unwanted_words(text1, unwanted_words)) # Text with both upper and lower case letters text2 = "This IS a STUPID, HORRIBLe, and UGLY text." print(replace_unwanted_words(text2, unwanted_words))
This is a ******, ********, and **** text. This IS a ******, ********, and **** text.
If you’re not familiar with regular expressions in Python, you are likely to find it hard to understand the pattern used in the code snippet above. Let me explain it.
The regular expression pattern is constructed dynamically based on the
r'\b(?:' + '|'.join(map(re.escape, unwanted_words)) + r')\b'
Its components are:
rat the beginning denotes a raw string literal.
\brepresents a word boundary to match whole words.
(?:' + '|'.join(map(re.escape, unwanted_words)) + r')\b: This part dynamically constructs a pattern by joining the
re.escapeis used to escape special characters in unwanted words to ensure they are treated literally. The
)create a non-capturing group.