How to Use NumPy for Descriptive Statistics

Updated: January 22, 2024 By: Guest Contributor Post a comment

Introduction

NumPy (Numerical Python) is an open-source Python library that’s widely used for scientific computing, especially when it comes to performing mathematical operations on large, multi-dimensional arrays and matrices. It is particularly popular in the fields of data analysis, machine learning, and engineering. Descriptive statistics is one of the essential concepts in statistics, which describes the basic features of data through quantitative summaries and visualizations. In this tutorial, we will explore how NumPy can be used to compute various descriptive statistical measures.

Getting Things Ready

Before diving into calculations, ensure that you have NumPy installed. If not, install it using pip:

pip install numpy

After installation, you can import NumPy:

import numpy as np

Basic Statistics with NumPy

To demonstrate some basic descriptive statistics, let’s begin with creating some example data using NumPy:

data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

Now let’s calculate some basic descriptive statistics:

Mean

mean_value = np.mean(data)
print(f'Mean Value: {mean_value}')
# Output: Mean Value: 5.5

Median

median_value = np.median(data)
print(f'Median Value: {median_value}')
# Output: Median Value: 5.5

Standard Deviation

std_dev = np.std(data)
print(f'Standard Deviation: {std_dev}')
# Output: Standard Deviation: 2.8722813232690143

Variance

variance = np.var(data)
print(f'Variance: {variance}')
# Output: Variance: 8.25

This is just the tip of the iceberg. NumPy offers a comprehensive set of functions for more complex statistical measurements. Let’s delve into some more advanced concepts.

Advanced Statistics with NumPy

For more sophisticated analysis, such as weighted averages, percentiles, and correlation coefficients, we can still rely on NumPy’s capabilities.

Weighted Average

weights = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
weighted_avg = np.average(data, weights=weights)
print(f'Weighted Average: {weighted_avg}')
# Output: Weighted Average: 7.0

Percentiles

fiftieth_percentile = np.percentile(data, 50)
print(f'50th Percentile (Median): {fiftieth_percentile}')

ninety_percentile = np.percentile(data, 90)
print(f'90th Percentile: {ninety_percentile}')
# Outputs:
# 50th Percentile (Median): 5.5
# 90th Percentile: 9.1

Correlation Coefficient

data_x = np.array([1, 2, 3, 4, 5])
data_y = np.array([5, 4, 3, 2, 1])
correlation = np.corrcoef(data_x, data_y)
print(f'Correlation Coefficient Matrix:\n{correlation}')
# Output: Correlation Coefficient Matrix:
# [[ 1. -1.]
# [-1.  1.]]

These are useful for understanding the relationships between variables in your dataset and how they contribute to the overall distribution of the data.

Descriptive Statistics on Multi-dimensional Data

Often, data comes in multi-dimensional arrays. NumPy seamlessly supports calculations on such arrays which might represent several groups of observations.

multi_dimensional_data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
mean_multi = np.mean(multi_dimensional_data, axis=0)
print(f'Means of each column: {mean_multi}')

The axis parameter is crucial as it determines the direction in which you want to compute the mean. For axis=0, it’ll calculate for each column, for axis=1 for each row.

We can perform similar operations for other statistical measures to analyze multi-dimensional datasets efficiently.

Working with Missing Data

In real-life datasets, missing values are typical, and handling them is an important part of data preprocessing. NumPy offers convenient ways to deal with missing data through functions that can ignore these values during computations.

data_with_nan = np.array([1, 2, np.nan, 4, 5])
mean_without_nan = np.nanmean(data_with_nan)
print(f'Mean without NaN: {mean_without_nan}')

Similarly, we can use np.nanstd, np.nanvar, np.nanpercentile, and more to compute statistics without being affected by NaNs.

Conclusion

This tutorial has covered how NumPy can be utilized to perform descriptive statistical analysis on data of varying complexity. As a powerful tool with an extensive array of functions, NumPy simplifies the computation of statistical measures, helping you gain insights into your data swiftly. With practice, these techniques become invaluable in the analysis and interpretation of datasets across diverse applications.