Applying Non-Negative Matrix Factorization (NMF) with Scikit-Learn

In the realm of machine learning, Non-Negative Matrix Factorization (NMF) is a powerful technique for dimensionality reduction, particularly useful when data consists of non-negative values. It decomposes a matrix into two smaller, dense matrices, making it useful for tasks such as topic modeling, sparsity representation, and collaborative filtering. In this article, we’ll walk through applying NMF using Scikit-Learn, a popular machine learning library in Python.

Understanding Non-Negative Matrix Factorization
Pre-requisites
Setting Up the Environment
Example: Applying NMF with Scikit-Learn
Conclusion

Understanding Non-Negative Matrix Factorization

NMF aims to find two non-negative matrices (W and H) whose product approximates the original matrix X (i.e., X ≈ W * H). Both W and H are lower-rank approximations of the original matrix, where W serves as a feature matrix and H serves as a coefficient matrix. This factorization is particularly advantageous in applications like document clustering or image compression.

Pre-requisites

Basic knowledge of machine learning concepts.
Python installed on your machine along with Scikit-Learn and Numpy.

Setting Up the Environment

Begin by installing the necessary Python packages. If you haven't installed Scikit-Learn and Numpy, you can do so using pip:

pip install numpy scikit-learn

Example: Applying NMF with Scikit-Learn

To demonstrate NMF, let's use a dataset of text documents, where the aim is to extract important topics. We’ll use the popular 20 Newsgroups dataset available in Scikit-Learn.

Step 1: Load the Dataset

First, let’s load the dataset using Scikit-Learn's built-in methods:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

# Load the 20 newsgroups dataset
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))

Step 2: Vectorize the Documents

Using TfidfVectorizer, we’ll convert the text documents into a matrix of term frequencies:

tfidf_vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')
X_tfidf = tfidf_vectorizer.fit_transform(newsgroups_train.data)

This will convert our text data into a sparse matrix with 1000 features.

Step 3: Applying NMF

Now, let’s decompose the TF-IDF matrix using NMF:

from sklearn.decomposition import NMF

# Set the number of topics
n_components = 10

# Create an NMF model
nmf_model = NMF(n_components=n_components, random_state=1, init='nndsvda')

# Fit the model to the TF-IDF data
W = nmf_model.fit_transform(X_tfidf)
H = nmf_model.components_

The W and H matrices represent our topics and feature distribution respectively.

Step 4: Interpret Topics

The feature matrix H can be used to identify the top terms for each topic. Let’s define a utility to fetch the top words:

def display_topics(H, feature_names, num_top_words):
    for topic_idx, topic in enumerate(H):
        print(f"Topic {topic_idx}:", " ".join([feature_names[i] for i in topic.argsort()[:-num_top_words - 1:-1]]))

# Display the top words for each topic
feature_names = tfidf_vectorizer.get_feature_names_out()
display_topics(H, feature_names, 10)

This function allows us to iteratively print out the top 10 terms for each topic identified by the NMF model.

Conclusion

Non-Negative Matrix Factorization is a highly effective method for encoding data into simpler, more interpretable forms. By applying it to text data, you can unearth valuable topics and insights with a few lines of code using Scikit-Learn. As you continue exploring NMF, consider experimenting with different datasets and adjusting hyperparameters to fine-tune the model’s performance.

Next Article: Dictionary Learning with Scikit-Learn's `dict_learning_online`

Previous Article: FastICA with Scikit-Learn: A Step-by-Step Guide

Series: Scikit-Learn Tutorials

Scikit-Learn