Pandas: How to count non-NA/null values in a DataFrame (4 ways)

Introduction
Preparing a Test DataFrame
Method 1: count()
Method 2: notnull() with sum()
Method 3: Using info()
Advanced Methods
1. Custom Functions
2. Combining Multiple Conditions
Conclusion

Introduction

Pandas, a cornerstone library in Python for data manipulation and analysis, offers various approaches for handling missing data within a DataFrame. A common task when preprocessing data is counting non-NA/null values to assess data completeness or perform subsequent analysis. This tutorial will guide you through several methods to accomplish this, ranging from straightforward to more nuanced approaches.

Preparing a Test DataFrame

Before we delve into counting non-NA/null values, let’s create a simple DataFrame to work with:

import pandas as pd
import numpy as np

# Sample DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
    'Age': [29, None, 25, np.nan, 42],
    'City': ['New York', 'Los Angeles', np.nan, 'Chicago', 'Austin'],
    'Profession': [None, 'Developer', 'Designer', 'Developer', np.nan]
})
print(df)

This DataFrame represents a snapshot of a dataset, including some missing values represented by None or np.nan.

Method 1: count()

The simplest way to count non-NA/null values across each column is to use the count() method:

# Counting non-null values in each column
df.count()

This method directly provides the count of non-null values per column. However, it does not include row-wise counts or differentiate between data types.

Method 2: notnull() with sum()

A more granular approach can be achieved by combining notnull() with sum():

# Counting non-null values using notnull() and sum()
print(df.notnull().sum())

This method allows for a more detailed view by providing both column-wise and row-wise counts when needed. It’s particularly useful for more detailed analytical tasks.

Method 3: Using info()

The info() method is useful for getting a quick summary of the DataFrame, including the count of non-NA values:

# Using info() to get summary details
df.info()

Although not directly providing a count in the form of an object or output suitable for further manipulation, info() is highly efficient for an overall summary glance at your data.

Advanced Methods

Custom Functions

For customized needs, you can wrap these methods within your own functions to calculate specific metrics or for more complex filtering conditions:

def count_non_null(df, col_name):
    """ Counts non-NA/null values in a specific column. """
    return df[col_name].notnull().sum()

# Example usage
count = count_non_null(df, 'Age')
print(f'Non-null Age values count: {count}')

This custom function example highlights how we can extend built-in methods to fit specific needs.

Combining Multiple Conditions

For analyses that require counting non-null values under specific conditions, Pandas conditional selection can be paired with notnull() seamlessly:

# Counting non-null 'Age' values for users from 'New York'
count = df[df['City'] == 'New York']['Age'].notnull().sum()
print(f'Non-null Age values for New Yorkers: {count}')

These advanced methods and customizations underscore the flexibility of pandas when working with missing data.

Conclusion

Counting non-NA/null values in a Pandas DataFrame is a foundational skill for data analysis and can be achieved through multiple methods depending on the need. Whether it’s a quick count with count(), a detailed analysis using notnull() and sum(), or a custom function for specific conditions, Pandas provides the tools necessary for effective data manipulation. Mastering these techniques ensures a solid groundwork for handling missing values in your data exploration and cleaning processes.

Next Article: Pandas DataFrame.cummax() method: Explained with examples

Previous Article: Pandas: How to compute pairwise correlation of columns in DataFrame

Series: DateFrames in Pandas

Pandas