Pandas DataFrame: Add new column based on values from existing columns

Updated: February 20, 2024 By: Guest Contributor Post a comment

Introduction

Adding a new column to a DataFrame based on values from existing columns is a common operation in data manipulation and analysis. This operation can enhance or adjust the original dataset for further analysis, visualization, or modeling. In this tutorial, we will explore several methods to achieve this using Pandas, a powerful and flexible data analysis and manipulation tool in Python.

First, ensure you have Pandas installed in your environment by running pip install pandas in your terminal or command prompt.

Basic Column Addition

Let’s start with a basic example where we add a new column whose values are calculated from existing ones. Suppose we have a DataFrame df with two columns, A and B, and we want to create a new column C as the sum of these two columns.

import pandas as pd

df = pd.DataFrame({
  'A': [1, 2, 3, 4],
  'B': [5, 6, 7, 8]
})

df['C'] = df['A'] + df['B']
print(df)

This simple operation results in a new column C, which is the sum of columns A and B.

Using apply() Function

The apply() function can be used to apply a function across the rows or columns of a DataFrame. Let’s say we want to generate a new column Score by applying a custom function that evaluates values from other columns.

def custom_score(row):
    return row['A'] * 2 + row['B']

df['Score'] = df.apply(custom_score, axis=1)
print(df)

Here, the apply() function takes each row of the DataFrame and applies the custom_score function to it, generating a new column Score.

Using lambda Functions

For more concise and inline operations, we can use lambda functions. This is particularly useful for simple operations. For example, to create a new column D that is double the value of column A, we can do the following:

df['D'] = df['A'].apply(lambda x: x * 2)
print(df)

Note that when using lambda with apply(), the operation is applied to each element if the axis is not specified, which differs from the row-wise operation shown previously.

Combining Columns with Different Criteria

Sometimes, the new column’s value depends on multiple columns with complex logic. For such cases, NumPy’s where() method is extremely useful. Let’s say we need a new column Status that marks a row as ‘Passed’ if the Score column is greater than 5, and ‘Failed’ otherwise.

import numpy as np

df['Status'] = np.where(df['Score'] > 5, 'Passed', 'Failed')
print(df)

This operation evaluates each row based on the condition provided and assigns values accordingly.

Advanced Usage: Using vectorized operations with np.select

For more complex conditional column creation, np.select() can handle multiple conditions. Imagine a scenario where we want to categorize our rows into three categories based on the Score: ‘High’, ‘Medium’, and ‘Low’.

conditions = [
    df['Score'] > 10,
    df['Score'] > 5,
    df['Score'] <= 5
]

choices = ['High', 'Medium', 'Low']

df['Category'] = np.select(conditions, choices, default='Unknown')
print(df)

This approach allows you to define a list of conditions and a list of choices to apply based on those conditions, adding a significant level of flexibility to DataFrame manipulation.

Creating Columns with Data from External Sources

Lastly, you might find yourself needing to add a column with data derived from an external source. Suppose we have another DataFrame external_df that contains additional information we want to merge with our original DataFrame based on a common key.

external_df = pd.DataFrame({
  'Key': [1, 2, 3, 4],
  'ExternalData': [9, 10, 11, 12]
})

df = df.merge(external_df, how='left', left_on='A', right_on='Key')

df.drop('Key', axis=1, inplace=True)  # Removing the Key column after merge
print(df)

This operation combines data from external_df into our original DataFrame, adding new information relevant to our analysis.

Conclusion

Adding a new column to a DataFrame based on values from existing columns is a versatile technique that can significantly enhance your data analysis process. By understanding and utilizing the different methods and functions provided by Pandas, you can manipulate your datasets in powerful and efficient ways. Whether you’re performing simple operations or complex condition-based column additions, Pandas offers the tools necessary to achieve your data manipulation goals.