Python: Replace unwanted words in a string with asterisks

Updated: June 3, 2023 By: Khue Post a comment

When developing web and other kinds of applications with Python, there might be cases where you want (or have to) substitute bad words with asterisks (*), such as:

  • Censorship: In cases where you need to filter sensitive or inappropriate content, replacing unwanted words with asterisks can help mask or censor offensive language.
  • Data anonymization: When dealing with personally identifiable information (PII), replacing names or other identifying words with asterisks can help protect privacy.
  • Profanity filtering: In text analysis or content moderation systems, replacing offensive or inappropriate words with asterisks can be part of a profanity filtering mechanism.

This easy-to-understand, step-by-step tutorial will show you how to replace bad words in a string with asterisk characters by using regular expressions.

The steps

  1. Create a list of bad words that you want to hide from your text.
  2. Import the re module.
  3. Define a regular expression pattern to match each element in the list of undesired words.
  4. Use the re.sub() function to replace unwanted words with asterisks (the number of asterisks will be equal to the number of characters in substituted words). By adding the flags=re.IGNORECASE parameter to the re.sub() function, the regular expression will match the unwanted words regardless of their case.

Code example

import re

# Define a reusable function to replace unwanted words with asterisks
def replace_unwanted_words(text, unwanted_words):
    pattern = r'\b(?:' + '|'.join(map(re.escape, unwanted_words)) + r')\b'
    modified_text = re.sub(
        pattern, 
        lambda match: '*' * len(match.group()), 
        text, 
        flags=re.IGNORECASE)
    
    return modified_text


# Define a list of unwanted words
unwanted_words = ['stupid', 'horrible', 'ugly']

# Test the function
text1 = 'This is a stupid, horrible, and ugly text.'
print(replace_unwanted_words(text1, unwanted_words))

# Text with both upper and lower case letters
text2 = "This IS a STUPID, HORRIBLe, and UGLY text."
print(replace_unwanted_words(text2, unwanted_words))

Output:

This is a ******, ********, and **** text.
This IS a ******, ********, and **** text.

Pattern explained

If you’re not familiar with regular expressions in Python, you are likely to find it hard to understand the pattern used in the code snippet above. Let me explain it.

The regular expression pattern is constructed dynamically based on the unwanted_words input:

r'\b(?:' + '|'.join(map(re.escape, unwanted_words)) + r')\b'

Its components are:

  • r'\b(?:: The r at the beginning denotes a raw string literal. \b represents a word boundary to match whole words.
  • (?:' + '|'.join(map(re.escape, unwanted_words)) + r')\b: This part dynamically constructs a pattern by joining the unwanted_words using the | (OR) operator. re.escape is used to escape special characters in unwanted words to ensure they are treated literally. The (?: and ) create a non-capturing group.

See also: The modern Python regular expressions cheat sheet.