Pandas: Select rows from DataFrame A but not in DataFrame B (3 ways)

Updated: February 21, 2024 By: Guest Contributor Post a comment

Overview

Data analysis and manipulation in Python often requires handling large datasets and comparing them to extract meaningful insights. Pandas, being one of the most powerful and widely used data manipulation libraries, provides various ways to filter and manipulate data. In this tutorial, we will explore how to select rows from one DataFrame that are not present in another DataFrame. This task is common in data analysis when dealing with disjoint datasets, cleaning data, or finding unique entries. We will start with basic approaches and gradually move to more advanced techniques, providing code examples for each step.

Prerequisites

  • Basic knowledge of Python
  • Understanding of Pandas library
  • An installed version of Pandas. If you haven’t, you can install it using pip install pandas

Preparing 2 Sample DataFrames

For demonstration purposes, let’s create two sample DataFrames:

import pandas as pd

# Create DataFrame A
df_a = pd.DataFrame({
    'ID': [1, 2, 3, 4, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve']
})

# Create DataFrame B
df_b = pd.DataFrame({
    'ID': [4, 5, 6, 7, 8],
    'Name': ['David', 'Eve', 'Frank', 'Grace', 'Hannah']
})

Our goal is to select rows from df_a that are not present in df_b based on a key column, which we’ll assume to be the ‘ID’ column in this case.

Method 1: Using the ~ Operator and isin() Method

One basic approach is to use the ~ (tilde) operator combined with the isin() method. The ~ operator negates the boolean Series returned by isin(), effectively selecting rows that do not match.

result = df_a[~df_a['ID'].isin(df_b['ID'])]
print(result)

This code will output:

   ID     Name
0  1    Alice
1  2      Bob
2  3  Charlie

Method 2: Using Merge with Indicator

Another approach is to use the merge method with the indicator=True option, which adds a column indicating from which DataFrame each row originates. We can then filter based on this indicator.

merged = df_a. merge(df_b, on='ID', how='outer', indicator=True)
result = merged[merged['_merge'] == 'left_only'].drop(columns=['_merge'])
print(result)

This will output the rows from df_a that are not in df_b, excluding the ‘Name’ column from df_b:

   ID     Name
0  1    Alice
1  2      Bob
2  3  Charlie

Method 3: Using DataFrame.query with Merge

For a more code-efficient and advanced approach, combining merge and query can offer a powerful solution. First, conduct an outer merge and then use query to filter rows.

merged = df_a.merge(df_b, on='ID', how='outer', indicator=True)
result = merged.query("_merge == 'left_only'").drop('_merge', axis=1)
print(result)

This method produces the same output as the previous methods but could be more readable and concise for some users.

Advanced Technique: Using concat and drop_duplicates

For cases where you want to find unique rows based on all columns, you can use concat and drop_duplicates. This is particularly useful when the data does not have a clear key column for comparison or when you need to consider the entirety of the data’s content.

unique_rows = pd.concat([df_a, df_b]).drop_duplicates(keep=False)
print(unique_rows)

Keep in mind, this method considers all columns when determining uniqueness, so it might give different results compared to the methods focusing only on the ‘ID’ column.

Conclusion

Selecting rows from one DataFrame that are not present in another is a common task in data analysis, and Pandas provides several powerful tools to accomplish this. Whether through the use of boolean indexing, merge operations, or concatenation, each method has its own advantages depending on the specifics of the data and the analysis goals. Experimenting with these techniques will help you develop a deeper understanding of Pandas and data manipulation in Python.