Working with DataFrame.kurtosis() method in Pandas (practical examples)

Introduction
What does ‘Kurtosis’ really mean??
Basic Usage
Working with Outliers
Adjusting for Bias
Group-wise Kurtosis
Time Series Data
Conclusion

Introduction

In data analysis, understanding the shape of the distribution of your data can be as crucial as knowing its central tendency or variability. The kurtosis() method in Pandas aids in assessing the shape, specifically the ‘tailedness’ of the data distribution. This tutorial walks you through using the DataFrame.kurtosis() method in Pandas, from basic to advanced usage, accompanied by multiple code examples.

What does ‘Kurtosis’ really mean??

Kurtosis is a measure of the ‘tailedness’ of a data distribution compared to a normal distribution. High kurtosis implies a distribution with heavy tails, indicating a high probability of outliers. Conversely, low kurtosis suggests lighter tails, potentially fewer outliers. Kurtosis values can be:

Leptokurtic (Kurtosis > 3): Distribution has heavier tails than a normal distribution.
Platykurtic (Kurtosis < 3): Distribution has lighter tails than a normal distribution.
Mesokurtic (Kurtosis = 3): Distribution resembles a normal distribution in terms of tailedness.

Now, let’s dive into using the kurtosis() method in Pandas.

Basic Usage

First, ensure you have Pandas installed:

!pip install pandas

To begin with, let’s create a simple DataFrame and compute its kurtosis:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': np.random.normal(0, 1, 1000),
    'B': np.random.uniform(-1, 1, 1000)
})

kurtosis = df.kurtosis()
print(kurtosis)

The output might resemble:

A   -0.012
B   -1.200

Here, column A, drawn from a normal distribution, shows a kurtosis close to 0, indicating a mesokurtic distribution. Column B, derived from a uniform distribution, exhibits negative kurtosis, pointing towards a platykurtic distribution.

Working with Outliers

In datasets prone to outliers, understanding kurtosis becomes even more essential. The following example illustrates handling a DataFrame with potential outliers:

df_with_outliers = pd.DataFrame({
    'C': np.concatenate([np.random.normal(0, 1, 990), np.random.normal(0, 10, 10)])
})

kurtosis_outliers = df_with_outliers.kurtosis()
print(kurtosis_outliers)

The output, showing a significantly high kurtosis value, indicates the presence of heavy tails and hence, a leptokurtic distribution liable to have outliers.

Adjusting for Bias

The kurtosis method in Pandas, by default, corrects for bias in a sample. However, you might sometimes prefer an uncorrected kurtosis, especially when working with a population dataset. Use the fisher=False parameter to adjust:

unbiased_kurtosis = df.kurtosis(fisher=False)
print(unbiased_kurtosis)

This parameter set to False computes the Pearson’s definition of kurtosis, where a normal distribution has a kurtosis of 3, not 0.

Group-wise Kurtosis

Pandas also allows computing kurtosis for grouped data. This can be valuable for comparing distributions across different categories:

df['Category'] = np.random.choice(['Group 1', 'Group 2'], size=1000)

kurtosis_by_group = df.groupby('Category').kurtosis()
print(kurtosis_by_group)

The results offer insights into the distribution characteristic of each group, essential for comparative analysis.

Time Series Data

For time series data, understanding seasonal trends or detecting anomalies becomes simpler by evaluating the kurtosis over time. Consider a DataFrame with datetime indexes:

time_series_df = pd.DataFrame({
    'D': np.random.normal(0, 1, 365)
}, index=pd.date_range('2020-01-01', periods=365))

kurtosis_time_series = time_series_df.kurtosis()
print(kurtosis_time_series)

Such analysis can pinpoint periods with unusual fluctuation, indicative of transient external influences or data irregularities.

Conclusion

This tutorial has explored the DataFrame.kurtosis() method in Pandas across different scenarios from basic to advanced. Understanding and applying kurtosis aids in gaining deeper insights into the distribution characteristics of your data, facilitating more informed data-driven decision-making.

Next Article: How to Use Pandas for Geospatial Data Analysis (3 examples)

Previous Article: How to Integrate Pandas with Apache Spark

Series: DateFrames in Pandas

Pandas