Sling Academy
Home/Python/Python: Get a list of unique words/characters from a string

Python: Get a list of unique words/characters from a string

Last updated: June 03, 2023

When working with language-related tasks in Python, you may need to get a list of unique words or characters from a string to perform word frequency analytics, tokenization, deduplication, vocabulary creation, or data cleaning and preprocessing tasks.

This concise, straight-to-the-point article will walk you through a couple of different approaches to extracting unique words and characters from a given string in Python. There’s no time to waste; let’s get our hands dirty with code!

Using the split() method and Set conversion

If your input string doesn’t contain punctuation or you don’t care about the appearance of punctuation in your results, this approach is fine. Otherwise, see the approach in the later section (that uses regular expressions).

The steps are:

  1. Split the string into individual words or characters using the split() method for words or list comprehension for characters.
  2. Convert the resulting list to a set using the set() function to eliminate duplicates.
  3. Convert the set back to a list if needed.

Code example:

text = "blue red green red blue yellow orange blue"

# list of unique words
unique_words = list(set(text.split()))
print(unique_words)

# list of unique characters
unique_characters = list(set(text))
print(unique_characters)

Output:

['green', 'red', 'yellow', 'blue', 'orange']
['b', 'a', 'r', 'd', 'o', 'g', 'w', 'e', 'u', 'l', 'n', 'y', ' ']

Using regular expressions and Set conversion

The main difference of this approach in comparison to the previous one is that it will eliminate all punctuation and spaces from the results. You will get a list of unique “clean” words and a list of unique “clean” alphanumeric characters.

The steps:

  1. Use the re.findall() function with a regular expression pattern to extract all words or characters from the string.
  2. Convert the resulting list to a set using the set() function to remove duplicates.
  3. Convert the set back to a list if required.

Code example:

import re

text = "Dog, Cat! Dog Dragon Cat! Dog."

# list of all unique words
unique_words = list(set(re.findall(r'\b\w+\b', text)))
print(unique_words)

# list of unique characters
unique_characters = list(set(re.findall(r'\w', text)))
print(unique_characters)

Output:

['o', 'D', 'a', 'n', 'C', 'g', 'r', 't']

That’s it. Happy coding & have a nice day!

Next Article: Python: Replace unwanted words in a string with asterisks

Previous Article: Python: How to Reverse the Order of Words in a String

Series: Working with Strings in Python

Python

You May Also Like

  • Introduction to yfinance: Fetching Historical Stock Data in Python
  • Monitoring Volatility and Daily Averages Using cryptocompare
  • Advanced DOM Interactions: XPath and CSS Selectors in Playwright (Python)
  • Automating Strategy Updates and Version Control in freqtrade
  • Setting Up a freqtrade Dashboard for Real-Time Monitoring
  • Deploying freqtrade on a Cloud Server or Docker Environment
  • Optimizing Strategy Parameters with freqtrade’s Hyperopt
  • Risk Management: Setting Stop Loss, Trailing Stops, and ROI in freqtrade
  • Integrating freqtrade with TA-Lib and pandas-ta Indicators
  • Handling Multiple Pairs and Portfolios with freqtrade
  • Using freqtrade’s Backtesting and Hyperopt Modules
  • Developing Custom Trading Strategies for freqtrade
  • Debugging Common freqtrade Errors: Exchange Connectivity and More
  • Configuring freqtrade Bot Settings and Strategy Parameters
  • Installing freqtrade for Automated Crypto Trading in Python
  • Scaling cryptofeed for High-Frequency Trading Environments
  • Building a Real-Time Market Dashboard Using cryptofeed in Python
  • Customizing cryptofeed Callbacks for Advanced Market Insights
  • Integrating cryptofeed into Automated Trading Bots