Sling Academy
Home/Python/Python: Get a list of unique words/characters from a string

Python: Get a list of unique words/characters from a string

Last updated: June 03, 2023

When working with language-related tasks in Python, you may need to get a list of unique words or characters from a string to perform word frequency analytics, tokenization, deduplication, vocabulary creation, or data cleaning and preprocessing tasks.

This concise, straight-to-the-point article will walk you through a couple of different approaches to extracting unique words and characters from a given string in Python. There’s no time to waste; let’s get our hands dirty with code!

Using the split() method and Set conversion

If your input string doesn’t contain punctuation or you don’t care about the appearance of punctuation in your results, this approach is fine. Otherwise, see the approach in the later section (that uses regular expressions).

The steps are:

  1. Split the string into individual words or characters using the split() method for words or list comprehension for characters.
  2. Convert the resulting list to a set using the set() function to eliminate duplicates.
  3. Convert the set back to a list if needed.

Code example:

text = "blue red green red blue yellow orange blue"

# list of unique words
unique_words = list(set(text.split()))
print(unique_words)

# list of unique characters
unique_characters = list(set(text))
print(unique_characters)

Output:

['green', 'red', 'yellow', 'blue', 'orange']
['b', 'a', 'r', 'd', 'o', 'g', 'w', 'e', 'u', 'l', 'n', 'y', ' ']

Using regular expressions and Set conversion

The main difference of this approach in comparison to the previous one is that it will eliminate all punctuation and spaces from the results. You will get a list of unique “clean” words and a list of unique “clean” alphanumeric characters.

The steps:

  1. Use the re.findall() function with a regular expression pattern to extract all words or characters from the string.
  2. Convert the resulting list to a set using the set() function to remove duplicates.
  3. Convert the set back to a list if required.

Code example:

import re

text = "Dog, Cat! Dog Dragon Cat! Dog."

# list of all unique words
unique_words = list(set(re.findall(r'\b\w+\b', text)))
print(unique_words)

# list of unique characters
unique_characters = list(set(re.findall(r'\w', text)))
print(unique_characters)

Output:

['o', 'D', 'a', 'n', 'C', 'g', 'r', 't']

That’s it. Happy coding & have a nice day!

Next Article: Python: Replace unwanted words in a string with asterisks

Previous Article: Python: How to Reverse the Order of Words in a String

Series: Working with Strings in Python

Python

You May Also Like

  • Python Warning: Secure coding is not enabled for restorable state
  • Python TypeError: write() argument must be str, not bytes
  • 4 ways to install Python modules on Windows without admin rights
  • Python TypeError: object of type ‘NoneType’ has no len()
  • Python: How to access command-line arguments (3 approaches)
  • Understanding ‘Never’ type in Python 3.11+ (5 examples)
  • Python: 3 Ways to Retrieve City/Country from IP Address
  • Using Type Aliases in Python: A Practical Guide (with Examples)
  • Python: Defining distinct types using NewType class
  • Using Optional Type in Python (explained with examples)
  • Python: How to Override Methods in Classes
  • Python: Define Generic Types for Lists of Nested Dictionaries
  • Python: Defining type for a list that can contain both numbers and strings
  • Using TypeGuard in Python (Python 3.10+)
  • Python: Using ‘NoReturn’ type with functions
  • Type Casting in Python: The Ultimate Guide (with Examples)
  • Python: Using type hints with class methods and properties
  • Python: Typing a function with default parameters
  • Python: Typing a function that can return multiple types