When working with language-related tasks in Python, you may need to get a list of unique words or characters from a string to perform word frequency analytics, tokenization, deduplication, vocabulary creation, or data cleaning and preprocessing tasks.
This concise, straight-to-the-point article will walk you through a couple of different approaches to extracting unique words and characters from a given string in Python. There’s no time to waste; let’s get our hands dirty with code!
Using the split() method and Set conversion
If your input string doesn’t contain punctuation or you don’t care about the appearance of punctuation in your results, this approach is fine. Otherwise, see the approach in the later section (that uses regular expressions).
The steps are:
- Split the string into individual words or characters using the
split()
method for words or list comprehension for characters. - Convert the resulting list to a set using the
set()
function to eliminate duplicates. - Convert the set back to a list if needed.
Code example:
text = "blue red green red blue yellow orange blue"
# list of unique words
unique_words = list(set(text.split()))
print(unique_words)
# list of unique characters
unique_characters = list(set(text))
print(unique_characters)
Output:
['green', 'red', 'yellow', 'blue', 'orange']
['b', 'a', 'r', 'd', 'o', 'g', 'w', 'e', 'u', 'l', 'n', 'y', ' ']
Using regular expressions and Set conversion
The main difference of this approach in comparison to the previous one is that it will eliminate all punctuation and spaces from the results. You will get a list of unique “clean” words and a list of unique “clean” alphanumeric characters.
The steps:
- Use the
re.findall()
function with a regular expression pattern to extract all words or characters from the string. - Convert the resulting list to a set using the
set()
function to remove duplicates. - Convert the set back to a list if required.
Code example:
import re
text = "Dog, Cat! Dog Dragon Cat! Dog."
# list of all unique words
unique_words = list(set(re.findall(r'\b\w+\b', text)))
print(unique_words)
# list of unique characters
unique_characters = list(set(re.findall(r'\w', text)))
print(unique_characters)
Output:
['o', 'D', 'a', 'n', 'C', 'g', 'r', 't']
That’s it. Happy coding & have a nice day!