Introduction
Performing data analysis often requires the computation and addition of new columns to the existing DataFrame, especially when dealing with percentages, which provide valuable insights into the relative sizes of parts to the whole. Pandas, a powerful data manipulation library in Python, simplifies these tasks significantly. In this tutorial, we will learn various methods to add a percentage column to a DataFrame, based on the values of other columns, using Pandas.
Prerequisites
Before proceeding, ensure you have Pandas installed. You can install Pandas using pip:
pip install pandas
Also, ensure you have basic familiarity with Python and Pandas. Let’s start with the simplest case and gradually move to more complex examples.
Basic Percentage of a Single Column
First, we create a DataFrame to work with:
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3, 4]})
To add a percentage column that represents each value’s percentage of the sum of column ‘A’, we use:
df['Perc_A'] = (df['A'] / df['A'].sum()) * 100
print(df)
Output:
A Perc_A
0 1 10.0
1 2 20.0
2 3 30.0
3 4 40.0
Percentage Change Between Rows
Next, we calculate the percentage change between rows in a column:
df['A_Percent_Change'] = df['A'].pct_change() * 100
print(df)
Output:
A A_Percent_Change
0 1 NaN
1 2 100.000000
2 3 50.000000
3 4 33.333333
Note: The first row in ‘A_Percent_Change’ is NaN since there’s no previous row to compare.
Weighted Percentage of Multiple Columns
If your DataFrame has multiple columns and you want to get a weighted percentage of those columns into a new column, here’s how you could do it:
import pandas as pd
df = pd.DataFrame({"A": [1, 2, 3, 4]})
df = pd.DataFrame({"A": [1, 2, 3, 4], "B": [2, 3, 4, 5]})
total = df["A"].sum() + df["B"].sum()
df["Weighted_Perc"] = ((df["A"] + df["B"]) / total) * 100
print(df)
Output:
A B Weighted_Perc
0 1 2 12.500000
1 2 3 20.833333
2 3 4 29.166667
3 4 5 37.500000
Conditional Percentages
For more complex scenarios, where the percentage column is conditional upon the values of other columns, the np.where
function from NumPy can be very useful. For example, if you want to create a percentage column where only values above a certain threshold in column ‘A’ are considered:
import pandas as pd
import numpy as np
df = pd.DataFrame({"A": [1, 2, 3, 4]})
df["Perc_Above_Threshold"] = np.where(df["A"] > 2, (df["A"] / df["A"].sum()) * 100, 0)
print(df)
Output:
A Perc_Above_Threshold
0 1 0.0
1 2 0.0
2 3 30.0
3 4 40.0
5. Using Apply Functions to Compute Custom Percentages
When you have complex requirements for computing percentages that cannot be easily expressed with arithmetic operations or conditions, using the apply
function can be very powerful. Let’s create a column that applies a custom function to compute the percentage:
import pandas as pd
df = pd.DataFrame({"A": [1, 2, 3, 4], "B": [2, 3, 4, 5]})
def custom_percentage(row):
return (row["A"] / (row["A"] + row["B"])) * 100
df["Custom_Perc"] = df.apply(custom_percentage, axis=1)
print(df)
This function calculates the percentage of ‘A’ in the total of ‘A’ and ‘B’ for each row and applies it across the DataFrame.
Output:
A B Custom_Perc
0 1 2 33.333333
1 2 3 40.000000
2 3 4 42.857143
3 4 5 44.444444
Conclusion
Adding percentage columns to a DataFrame is a common task in data analysis. As demonstrated, Pandas provides multiple ways to approach this task, from simple percentage calculations to applying custom functions for more complex requirements. Choosing the right approach depends on the specific needs of your data analysis project. Armed with these techniques, you’re now well-equipped to add insightful percentage-based metrics to your datasets.