Introduction
When working with data in Python, the Pandas library stands out for its powerful data manipulation capabilities. One frequent need is to create new columns based on conditions applied to existing ones. In this tutorial, we’ll explore four examples of how to use multiple if-else conditions to create new columns in a Pandas DataFrame, ranging from basic to more advanced scenarios. These techniques are essential for data preprocessing, feature engineering, and data analysis tasks.
Setup: Import Pandas and Create a Sample DataFrame
First, let’s import the Pandas library and create a sample DataFrame to work with:
import pandas as pd
df = pd.DataFrame({
'Age': [25, 38, 15, 22, 45, 33],
'Salary': [50000, 80000, 0, 32000, 120000, 95000],
'Gender': ['Female', 'Male', 'Female', 'Female', 'Male', 'Male']
})
Example 1: Basic If-Else Condition
Let’s start with a simple scenario where we create a new column, ‘Adult’, to indicate whether each person is an adult (18 or over) or not:
df['Adult'] = ['Yes' if x>=18 else 'No' for x in df['Age']]
print(df)
The output should show our DataFrame with the new column:
Age Salary Gender Adult
0 25 50000 Female Yes
1 38 80000 Male Yes
2 15 0 Female No
3 22 32000 Female Yes
4 45 120000 Male Yes
5 33 95000 Male Yes
Example 2: Advanced If-Else with Multiple Conditions
Next, let’s create a new column, ‘Financial Status’, based on multiple conditions conditioned on the ‘Salary’ and ‘Age’ columns:
df['Financial Status']='NA'
df.loc[(df['Salary']>50000)],'Financial Status'='Well-off'
df.loc[(df['Salary']<=50000) & (df['Age']<30)], 'Financial Status'='Starting Out']
df.loc[(df['Salary']<=50000) & (df['Age']>=30], 'Financial Status'='Experienced, but modest']
print(df)
The output would look like this:
Age Salary Gender Financial Status
0 25 50000 Female Starting Out
1 38 80000 Male Well-off
2 15 0 Female NA
3 22 32000 Female Starting Out
4 45 120000 Male Well-off
5 33 95000 Male Well-off
Example 3: Using np.where
Now, for a more concise way to implement conditional logic, we turn to np.where
from the NumPy library. Here, we’ll use it to add a ‘Student’ column, indicating whether the individual is likely a student.
import numpy as np
df['Student'] = np.where(df['Age'] < 22, 'Yes', 'No')
print(df)
The resulting DataFrame:
Age Salary Gender Adult Financial Status Student
0 25 50000 Female Yes Starting Out No
1 38 80000 Male Yes Well-off No
2 15 0 Female No NA Yes
3 22 32000 Female Yes Starting Out No
4 45 120000 Male Yes Well-off No
5 33 95000 Male Yes Well-off No
Example 4: Using pd.cut
for Categorical Variables
For our final example, we’ll categorize the ‘Age’ column into bins to create a new ‘Age Group’ column. This is particularly useful when working with continuous data that you’d like to analyze categorically.
df['Age Group'] = pd.cut(df['Age'], bins=[0,20, 40, 60], labels=['Youth','Adult','Senior'])
print(df)
The updated DataFrame would look like:
Age Salary Gender Adult Financial Status Student Age Group
0 25 50000 Female Yes Starting Out No Adult
1 38 80000 Male Yes Well-off No Adult
2 15 0 Female No NA Yes Youth
3 22 32000 Female Yes Starting Out No Adult
4 45 120000 Male Yes Well-off No Senior
5 33 95000 Male Yes Well-off No Adult
Conclusion
Creating new columns based on multiple if-else conditions is a fundamental technique in data manipulation with Pandas. Through these examples, we’ve explored various approaches from basic to advanced, including logical operations, np.where
, and pd.cut
. Mastering these techniques allows for efficient and effective data analysis, enabling data scientists to gain deeper insights from their datasets.