Sling Academy
Home/Scikit-Learn/Using `TfidfVectorizer` for Text Classification in Scikit-Learn

Using `TfidfVectorizer` for Text Classification in Scikit-Learn

Last updated: December 17, 2024

In the world of Natural Language Processing (NLP), text classification is a foundational task. It involves categorizing text into predefined labels. Leveraging scikit-learn, a robust machine learning library in Python, we can efficiently perform text classification tasks. Particularly, the TfidfVectorizer is a powerful tool for transforming text into a numerical representation for machine learning models.

What is TF-IDF?

Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical measure used to evaluate the importance of a word in a text corpus. The importance increases proportionally with the number of times a word appears in a document but is offset by the frequency of the word in the corpus.

Using TfidfVectorizer in Scikit-Learn

Scikit-learn's TfidfVectorizer is a simple way to compute TF-IDF features from text data. Let’s dive into how you can apply this to text classification tasks with Python code examples.

Step 1: Import Libraries

First, we need to import necessary libraries.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn import metrics

Step 2: Data Preparation

Prepare sample data. For simplicity, consider a set of text documents and associated labels.

texts = [
    'I love programming in Python',
    'Python is a great programming language',
    'Machine learning languages may include Python',
    'Statistics and data analysis are fun',
    'Data science and statistics use Python'
]
labels = ['Python', 'Python', 'Python', 'Statistics', 'Statistics']

Split your data into training and testing datasets using train_test_split() from scikit-learn.

X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2, random_state=42)

Step 3: Create the Model

Construct a pipeline that first transforms data using TfidfVectorizer and then applies the Naive Bayes classifier.

model = make_pipeline(TfidfVectorizer(), MultinomialNB())

Step 4: Train the Model

Fit the model on the training data.

model.fit(X_train, y_train)

Step 5: Evaluate the Model

Predict categories for the test data and calculate accuracy.

predicted_labels = model.predict(X_test)
accuracy = metrics.accuracy_score(y_test, predicted_labels)
print(f'Accuracy: {accuracy * 100:.2f}%')

Output might look something like:

Accuracy: 100.00%

Conclusion

The TfidfVectorizer in scikit-learn simplifies the transformation of text data into vector form, making it suitable for machine learning models. This, combined with a classifier like Naive Bayes, offers a powerful yet straightforward mechanism for text classification tasks. Although the above example uses a small dataset, in practice, this method scales to much larger datasets effectively.

As part of future exploration, consider diving deeper into tuning the hyperparameters associated with the TfidfVectorizer and the classifier for better performance, and explore other classifiers offered within scikit-learn.

Next Article: Feature Selection with Scikit-Learn's `SelectKBest`

Previous Article: Text Processing with Scikit-Learn's `CountVectorizer`

Series: Scikit-Learn Tutorials

Scikit-Learn

You May Also Like

  • Generating Gaussian Quantiles with Scikit-Learn
  • Spectral Biclustering with Scikit-Learn
  • Scikit-Learn Complete Cheat Sheet
  • ValueError: Estimator Does Not Support Sparse Input in Scikit-Learn
  • Scikit-Learn TypeError: Cannot Broadcast Due to Shape Mismatch
  • AttributeError: 'dict' Object Has No Attribute 'predict' in Scikit-Learn
  • KeyError: Missing 'param_grid' in Scikit-Learn GridSearchCV
  • Scikit-Learn ValueError: 'max_iter' Must Be Positive Integer
  • Fixing Log Function Error with Negative Values in Scikit-Learn
  • RuntimeError: Distributed Computing Backend Not Found in Scikit-Learn
  • Scikit-Learn TypeError: '<' Not Supported Between 'str' and 'int'
  • AttributeError: GridSearchCV Has No Attribute 'fit_transform' in Scikit-Learn
  • Fixing Scikit-Learn Split Error: Number of Splits > Number of Samples
  • Scikit-Learn TypeError: Cannot Concatenate 'str' and 'int'
  • ValueError: Cannot Use 'predict' Before Fitting Model in Scikit-Learn
  • Fixing AttributeError: NoneType Has No Attribute 'predict' in Scikit-Learn
  • Scikit-Learn ValueError: Cannot Reshape Array of Incorrect Size
  • LinAlgError: Matrix is Singular to Machine Precision in Scikit-Learn
  • Fixing TypeError: ndarray Object is Not Callable in Scikit-Learn