Sling Academy
Home/Scikit-Learn/Fetching the 20 Newsgroups Dataset with Scikit-Learn

Fetching the 20 Newsgroups Dataset with Scikit-Learn

Last updated: December 17, 2024

In the world of machine learning and text classification, the 20 Newsgroups dataset is a well-known benchmark that is often used by researchers and practitioners alike. It consists of approximately 20,000 newsgroup documents, partitioned across 20 different newsgroups, making it a compelling dataset for testing text processing techniques, natural language processing (NLP) and machine learning algorithms. In this article, we will demonstrate how to fetch and preprocess this dataset using the Scikit-Learn library in Python.

Prerequisites

Before we begin, ensure you have the following installed:

  • Python 3.6 or later
  • Scikit-Learn: A library for machine learning in Python
  • Pandas & NumPy: For data manipulation and numerical computations

You can install these packages using pip.

pip install numpy pandas scikit-learn

Loading the 20 Newsgroups Dataset

Scikit-Learn provides a convenient API to load this dataset directly. Let’s start by loading the dataset:

from sklearn.datasets import fetch_20newsgroups

# Download the dataset
newsgroups_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)
newsgroups_test = fetch_20newsgroups(subset='test', shuffle=True, random_state=42)

In this code, we download both the training and testing subsets of the 20 Newsgroups data. Setting the random_state ensures that we can reproduce our splits for consistent results.

Understanding the Dataset

The fetch_20newsgroups function returns a Bunch object, which is similar to a dictionary. It contains a few key and useful attributes:

  • data: This contains the raw newsgroup text files.
  • target: This holds the category indices of the lists.
  • target_names: It lists the category names corresponding to each index in target.
# Exploring the datasets
print("Category names: ", newsgroups_train.target_names)
print("First text sample: ", newsgroups_train.data[0])
print("Corresponding category: ", newsgroups_train.target[0])

Preprocessing the Text Data

Before we can utilize this textual data for model training or evaluation, minimal preprocessing is necessary. One requires converting the text into a structured format that can be understood by a model, typically using techniques like text vectorization (e.g., Bag of Words, TF-IDF).

from sklearn.feature_extraction.text import TfidfVectorizer

# Use TF-IDF vectorization
vectorizer = TfidfVectorizer(stop_words='english')
X_train = vectorizer.fit_transform(newsgroups_train.data)
X_test = vectorizer.transform(newsgroups_test.data)

Here we employ TfidfVectorizer from Scikit-Learn to convert a collection of raw documents to a matrix of TF-IDF features, dismissing common English stop words that might not offer strong interpretative benefit.

Training a Model

Assuming a typical use-case, let's train a simple Naive Bayes model, which is often effective for text classification tasks.

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score

# Train the model
model = MultinomialNB()
model.fit(X_train, newsgroups_train.target)

# Make predictions
predictions = model.predict(X_test)

Evaluating the Model

Once we have our predictions, it's vital to evaluate the model’s performance. By using metrics such as accuracy, precision, and recall, we can grasp how well our classifier is working.

# Evaluate the model
accuracy = accuracy_score(newsgroups_test.target, predictions)
report = classification_report(newsgroups_test.target, predictions, target_names=newsgroups_test.target_names)

print("Accuracy: ", accuracy)
print(report)

The above steps provide a foundation for text classification using the 20 Newsgroups dataset along with Scikit-Learn. With a multitude of customization options available for preprocessing and learning algorithms, the actual implementation can be tailored further to better fit specific problem requirements.

Next Article: Working with the California Housing Dataset in Scikit-Learn

Previous Article: Dumping and Loading Datasets with Scikit-Learn's `dump_svmlight_file`

Series: Scikit-Learn Tutorials

Scikit-Learn

You May Also Like

  • Generating Gaussian Quantiles with Scikit-Learn
  • Spectral Biclustering with Scikit-Learn
  • Scikit-Learn Complete Cheat Sheet
  • ValueError: Estimator Does Not Support Sparse Input in Scikit-Learn
  • Scikit-Learn TypeError: Cannot Broadcast Due to Shape Mismatch
  • AttributeError: 'dict' Object Has No Attribute 'predict' in Scikit-Learn
  • KeyError: Missing 'param_grid' in Scikit-Learn GridSearchCV
  • Scikit-Learn ValueError: 'max_iter' Must Be Positive Integer
  • Fixing Log Function Error with Negative Values in Scikit-Learn
  • RuntimeError: Distributed Computing Backend Not Found in Scikit-Learn
  • Scikit-Learn TypeError: '<' Not Supported Between 'str' and 'int'
  • AttributeError: GridSearchCV Has No Attribute 'fit_transform' in Scikit-Learn
  • Fixing Scikit-Learn Split Error: Number of Splits > Number of Samples
  • Scikit-Learn TypeError: Cannot Concatenate 'str' and 'int'
  • ValueError: Cannot Use 'predict' Before Fitting Model in Scikit-Learn
  • Fixing AttributeError: NoneType Has No Attribute 'predict' in Scikit-Learn
  • Scikit-Learn ValueError: Cannot Reshape Array of Incorrect Size
  • LinAlgError: Matrix is Singular to Machine Precision in Scikit-Learn
  • Fixing TypeError: ndarray Object is Not Callable in Scikit-Learn