Introduction
Working with data in Python often means dealing with missing values in datasets. The pandas
library, a powerhouse for data manipulation and analysis, provides a versatile method fillna()
to handle such missing data in DataFrames. This tutorial will walk you through five practical examples of using the fillna()
method, escalating from basic applications to more advanced uses.
What does fillna() do?
The pandas.DataFrame.fillna()
method is used to fill in missing values in a DataFrame. The method offers flexibility in terms of what value to use for filling gaps, allowing for constants, dictionary, Series, or DataFrame objects as inputs. It can fill missing values in place, or return a copy of the DataFrame with missing values filled.
Example 1: Filling with a Constant Value
import pandas as pd
# Creating a sample DataFrame with missing values
data = {"Name": ["John", "Jane", "Anna"], "Age": [28, None, 22], "City": [None, "New York", "London"]}
df = pd.DataFrame(data)
# Filling missing values with a constant
filled_df = df.fillna("Unknown")
print(filled_df)
This simple example demonstrates how you can fill all missing values in a DataFrame with a constant string “Unknown”. The resulting DataFrame will not have any missing values:
Name Age City
0 John 28 Unknown
1 Jane Unknown New York
2 Anna 22 London
Example 2: Filling with Column-Specific Values
import pandas as pd
# Again, starting with our sample DataFrame
data = {"Name": ["John", "Jane", "Anna"], "Age": [28, None, 22], "City": [None, "New York", "London"]}
df = pd.DataFrame(data)
# Filling missing values using a dictionary to specify different fill values for each column
fill_values = {"Age": df["Age"].mean(), "City": "Not Provided"}
df.fillna(fill_values, inplace=True)
print(df)
In this example, missing values in the ‘Age’ column are filled with the column’s mean value, and those in the ‘City’ column with the string “Not Provided”. This method allows for more meaningful data imputation:
Name Age City
0 John 28.0 Not Provided
1 Jane 25.0 New York
2 Anna 22.0 London
Example 3: Using Method Parameters (‘ffill’ and ‘bfill’)
import pandas as pd
# Yet again, our starting point DataFrame
data = {"Name": ["John", "Jane", "Anna"], "Age": [28, None, 22], "City": [None, "New York", "London"]}
df = pd.DataFrame(data)
# Using 'ffill' to forward fill the missing values
df.fillna(method='ffill', inplace=True)
# For the sake of illustration, let's reset and use 'bfill'
df = pd.DataFrame(data)
df.fillna(method='bfill', inplace=True)
print(df)
Forward fill (‘ffill’) copies a value from the previous row to fill a gap, while backward fill (‘bfill’) uses the next row’s value. This approach is suitable for time series or ordered data:
Name Age City
0 John 28.0 Not Provided
1 Jane 28.0 New York
2 Anna 22.0 London
Example 4: Filling with a Series
import pandas as pd
# Sample DataFrame
data = {"Name": ["John", "Jane", "Anna"], "Sales": [None, 150, None]}
df = pd.DataFrame(data)
# Creating a Series to use for filling missing values
fill_series = pd.Series([100, 110, 120])
# Filling missing values in the 'Sales' column with the Series values
df['Sales'] = df['Sales'].fillna(fill_series)
print(df)
Here, we fill missing values in the ‘Sales’ column using a Series, demonstrating the flexibility to align by index between a DataFrame column and a Series:
Name Sales
0 John 100
1 Jane 150
2 Anna 120
Example 5: Filling Using a Function
import pandas as pd
# DataFrame setup
data = {"Name": ["John", "Jane", "Anna"], "Performance": [None, "Good", None]}
df = pd.DataFrame(data)
# Defining a custom function to fill missing values based on other column values or conditions
def fill_performance(row):
if row['Name'] == 'John':
return 'Excellent'
else:
return 'Satisfactory'
df['Performance'] = df.apply(lambda row: fill_performance(row) if pd.isna(row['Performance']) else row['Performance'], axis=1)
print(df)
In this advanced example, we employ a custom function to dynamically fill missing values based on other values within the same row. This illustrates the versatility of fillna()
when combined with other pandas
functionalities:
Name Performance
0 John Excellent
1 Jane Good
2 Anna Satisfactory
Conclusion
Throughout this tutorial, we explored five different strategies for using the pandas.DataFrame.fillna()
method, ranging from simple substitutions to more nuanced and conditional methods of data imputation. By understanding these techniques, you can tackle missing data in your datasets more effectively and maintain the integrity of your analysis.