Python: Get a list of unique words/characters from a string

When working with language-related tasks in Python, you may need to get a list of unique words or characters from a string to perform word frequency analytics, tokenization, deduplication, vocabulary creation, or data cleaning and preprocessing tasks.

This concise, straight-to-the-point article will walk you through a couple of different approaches to extracting unique words and characters from a given string in Python. There’s no time to waste; let’s get our hands dirty with code!

Using the split() method and Set conversion

If your input string doesn’t contain punctuation or you don’t care about the appearance of punctuation in your results, this approach is fine. Otherwise, see the approach in the later section (that uses regular expressions).

The steps are:

Split the string into individual words or characters using the split() method for words or list comprehension for characters.
Convert the resulting list to a set using the set() function to eliminate duplicates.
Convert the set back to a list if needed.

Code example:

text = "blue red green red blue yellow orange blue"

# list of unique words
unique_words = list(set(text.split()))
print(unique_words)

# list of unique characters
unique_characters = list(set(text))
print(unique_characters)

Output:

['green', 'red', 'yellow', 'blue', 'orange']
['b', 'a', 'r', 'd', 'o', 'g', 'w', 'e', 'u', 'l', 'n', 'y', ' ']

Using regular expressions and Set conversion

The main difference of this approach in comparison to the previous one is that it will eliminate all punctuation and spaces from the results. You will get a list of unique “clean” words and a list of unique “clean” alphanumeric characters.

The steps:

Use the re.findall() function with a regular expression pattern to extract all words or characters from the string.
Convert the resulting list to a set using the set() function to remove duplicates.
Convert the set back to a list if required.

Code example:

import re

text = "Dog, Cat! Dog Dragon Cat! Dog."

# list of all unique words
unique_words = list(set(re.findall(r'\b\w+\b', text)))
print(unique_words)

# list of unique characters
unique_characters = list(set(re.findall(r'\w', text)))
print(unique_characters)

Output:

['o', 'D', 'a', 'n', 'C', 'g', 'r', 't']

That’s it. Happy coding & have a nice day!

Next Article: Python: Replace unwanted words in a string with asterisks

Previous Article: Python: How to Reverse the Order of Words in a String

Series: Working with Strings in Python

Python

Python: How to Convert a Dictionary to a Query String

February 12, 2024

Python File Modes: Explained

August 27, 2023

Python & aiohttp: How to download files using streams

August 20, 2023

Using aiohttp to make POST requests in Python (with examples)

August 20, 2023

Python asyncio.Queue class (with 3 examples)

August 18, 2023

How to Setup Python Virtual Environments (venv)

August 11, 2023

Python: Handling asyncio.exceptions.CancelledError gracefully

August 02, 2023

Python asyncio.wait_for() function (with examples)

August 02, 2023

Python Linked Lists: Explanation & Examples

July 31, 2023

Python asyncio.wait() function (with examples)

July 26, 2023

Python asyncio.gather() function (with examples)

July 26, 2023

Python match/case statement (with examples)

July 18, 2023