Introduction
In data analysis, understanding the shape of the distribution of your data can be as crucial as knowing its central tendency or variability. The kurtosis()
method in Pandas aids in assessing the shape, specifically the ‘tailedness’ of the data distribution. This tutorial walks you through using the DataFrame.kurtosis()
method in Pandas, from basic to advanced usage, accompanied by multiple code examples.
What does ‘Kurtosis’ really mean??
Kurtosis is a measure of the ‘tailedness’ of a data distribution compared to a normal distribution. High kurtosis implies a distribution with heavy tails, indicating a high probability of outliers. Conversely, low kurtosis suggests lighter tails, potentially fewer outliers. Kurtosis values can be:
- Leptokurtic (Kurtosis > 3): Distribution has heavier tails than a normal distribution.
- Platykurtic (Kurtosis < 3): Distribution has lighter tails than a normal distribution.
- Mesokurtic (Kurtosis = 3): Distribution resembles a normal distribution in terms of tailedness.
Now, let’s dive into using the kurtosis()
method in Pandas.
Basic Usage
First, ensure you have Pandas installed:
!pip install pandas
To begin with, let’s create a simple DataFrame and compute its kurtosis:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': np.random.normal(0, 1, 1000),
'B': np.random.uniform(-1, 1, 1000)
})
kurtosis = df.kurtosis()
print(kurtosis)
The output might resemble:
A -0.012
B -1.200
Here, column A, drawn from a normal distribution, shows a kurtosis close to 0, indicating a mesokurtic distribution. Column B, derived from a uniform distribution, exhibits negative kurtosis, pointing towards a platykurtic distribution.
Working with Outliers
In datasets prone to outliers, understanding kurtosis becomes even more essential. The following example illustrates handling a DataFrame with potential outliers:
df_with_outliers = pd.DataFrame({
'C': np.concatenate([np.random.normal(0, 1, 990), np.random.normal(0, 10, 10)])
})
kurtosis_outliers = df_with_outliers.kurtosis()
print(kurtosis_outliers)
The output, showing a significantly high kurtosis value, indicates the presence of heavy tails and hence, a leptokurtic distribution liable to have outliers.
Adjusting for Bias
The kurtosis method in Pandas, by default, corrects for bias in a sample. However, you might sometimes prefer an uncorrected kurtosis, especially when working with a population dataset. Use the fisher=False
parameter to adjust:
unbiased_kurtosis = df.kurtosis(fisher=False)
print(unbiased_kurtosis)
This parameter set to False
computes the Pearson’s definition of kurtosis, where a normal distribution has a kurtosis of 3, not 0.
Group-wise Kurtosis
Pandas also allows computing kurtosis for grouped data. This can be valuable for comparing distributions across different categories:
df['Category'] = np.random.choice(['Group 1', 'Group 2'], size=1000)
kurtosis_by_group = df.groupby('Category').kurtosis()
print(kurtosis_by_group)
The results offer insights into the distribution characteristic of each group, essential for comparative analysis.
Time Series Data
For time series data, understanding seasonal trends or detecting anomalies becomes simpler by evaluating the kurtosis over time. Consider a DataFrame with datetime indexes:
time_series_df = pd.DataFrame({
'D': np.random.normal(0, 1, 365)
}, index=pd.date_range('2020-01-01', periods=365))
kurtosis_time_series = time_series_df.kurtosis()
print(kurtosis_time_series)
Such analysis can pinpoint periods with unusual fluctuation, indicative of transient external influences or data irregularities.
Conclusion
This tutorial has explored the DataFrame.kurtosis()
method in Pandas across different scenarios from basic to advanced. Understanding and applying kurtosis aids in gaining deeper insights into the distribution characteristics of your data, facilitating more informed data-driven decision-making.