How to Use NumPy for Simple Statistical Analysis

Updated: January 22, 2024 By: Guest Contributor Post a comment

Introduction

Python’s NumPy library is a cornerstone in the domain of data analysis and scientific computing. It offers comprehensive mathematical functions, random number generators, linear algebra routines, and more. This tutorial aims to guide you through performing simple statistical analysis using NumPy. We’ll begin with the basics and gradually move to more advanced operations, displaying code examples and their outputs wherever applicable.

Setting Up Your Environment

Before we start, ensure that you have Python installed on your system. You can download NumPy using pip:

pip install numpy

Importing NumPy

Once installed, you can import the NumPy package:

import numpy as np

Basics of NumPy Arrays

NumPy operations are primarily performed on ‘ndarrays’, its core data structure. Let’s create an array:

data = np.array([1, 2, 3, 4, 5])
print(data)

Output: [1 2 3 4 5]

Descriptive Statistics

Now let’s discuss some fundamental statistical operations.

Mean

The mean, or average, is a measure of the central tendency of a dataset.

mean_value = np.mean(data)
print(mean_value)

Output: 3.0

Median

Median gives the middle value of the dataset.

median_value = np.median(data)
print(median_value)

Output: 3.0

Variance

Variance measures the spread of the data from the mean

variance_value = np.var(data)
print(variance_value)

Output: 2.0

Standard Deviation

Standard deviation is the square root of the variance, indicating the amount of variation or dispersion in a set of values.

std_dev_value = np.std(data)
print(std_dev_value)

Output: 1.4142135623730951

Random Numbers and Distributions

NumPy can also generate random numbers and random sampling from various distributions which is often useful in statistical analysis.

Generating Random Numbers

For example, here’s how you can generate a set of random numbers from a normal distribution:

normal_distribution = np.random.normal(0, 1, size=1000)

Descriptive Statistics on Distributions

Let’s calculate the mean and standard deviation of these numbers:

print('Mean:', np.mean(normal_distribution))
print('Standard deviation:', np.std(normal_distribution))

Correlation Coefficients

Another important statistical tool is the correlation coefficient, which measures the association between variables.

x = np.array([1, 2, 3, 4, 5, 6])
y = np.array([2, 3.5, 3, 4.5, 6, 5.5])
correlation = np.corrcoef(x, y)
print(correlation)

Advanced Operations

With a firm grasp of the basics, we can now look into some more elaborate statistical methods provided by NumPy.

Hypothesis Testing

Although NumPy isn’t primarily designed for complex statistical tests, it can perform some essential hypothesis testing tasks.

Linear Regression

You can conduct a simple linear regression using NumPy’s polyfit function to model the relationship between two variables.

coefficients = np.polyfit(x, y, 1)
print(coefficients)

Working with Multidimensional Data

Most datasets in the real world are multidimensional, and NumPy is perfectly equipped to handle them.

Multi-dimensional Mean

Here’s how to calculate the mean across different axes of a multi-dimensional dataset:

multi_data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print('Mean of entire dataset:', np.mean(multi_data))
print('Mean of each column:', np.mean(multi_data, axis=0))
print('Mean of each row:', np.mean(multi_data, axis=1))

Conclusion

In this tutorial, we’ve walked through various statistical tools NumPy provides from foundational concepts to more advanced topics. Whether you’re analyzing simple datasets or delving into more complex data structures, NumPy offers the functionality needed for thorough statistical analysis.