NumPy: How to store multiple data types in an array

Updated: January 23, 2024 By: Guest Contributor Post a comment

Introduction

NumPy is a fundamental package for scientific computing in Python. It provides support for large multi-dimensional array objects and various tools to work with them. One common question is how to store multiple data types in a NumPy array. This tutorial aims to answer that through a step-by-step approach, with code examples ranging from basic to advanced use-cases.

The Essence of NumPy Arrays

Before diving into mixed data types, let’s understand NumPy arrays. A standard NumPy array can only have one data type. When you create a NumPy array using numpy.array(), all elements are typically coerced into a single data type. Here’s a simple example:

import numpy as np

arr = np.array([1, 2, 3])
print(arr.dtype)
# Output: int64

If you try to create an array with mixed types such as integers and strings, you’ll notice that NumPy will convert them all into the same data type:

arr = np.array([1, 'two', 3])
print(arr)
print(arr.dtype)
# Output: ['1' 'two' '3']
# Output: 

As displayed, NumPy converts the integers to strings to match the data type. <U21 reflects the string type with character length up to 21.

Structured Arrays: A Solution

NumPy provides a way to create arrays with mixed data types with something called ‘structured arrays’. Structured arrays provide a mean to store data of different types in each column, similar to tables or spreadsheets. The data type of each column is specified using a special syntax. Let’s see an example:

data = [('Alice', 25, 55.0), ('Bob', 32, 60.5)]
dtypes = [('name', 'U10'), ('age', 'i4'), ('weight', 'f4')]
people = np.array(data, dtype=dtypes)
print(people)
print(people.dtype)
# Output: [('Alice', 25, 55.) ('Bob', 32, 60.5)]
# Data type: [('name', '<U10'), ('age', '<i4'), ('weight', '<f4')]

In this snippet, we create an array with a name, age, and weight field, and assign each a string, integer, and float data type correspondingly.

Accessing and Modifying Structured Arrays

Accessing data from structured arrays feels a bit different because you can access fields by their names. The following examples show how to use this method to get and set data:

print(people['name'])
# Output: ['Alice' 'Bob']

people['age'] += 1
print(people['age'])
# Output: [26 33]

You can also combine structured arrays with NumPy’s indexing to filter or modify the data substantially more powerfully than with a standard list or array.

Advanced: Record Arrays

To level up the structured array, NumPy offers record arrays, or np.recarray. Record arrays allow for attribute-like access to the fields:

people_rec = people.view(np.recarray)
print(people_rec.age)
# Output: [26 33]

However, keep in mind that using record arrays can result in a performance hit due to the accessory layer of attribute access.

Mixed Datatypes and Operations

Operations on structured arrays are limited when compared to regular NumPy arrays because of the complexities introduced by mixed datatypes. Therefore, it is essential to consider what operations you’ll need before opting for structured arrays.

# Adding 0.5 to weight on the structured array
cleaned_weights = people['weight'] + 0.5
people['weight'] = cleaned_weights
print(people['weight'])
# Output: [55.5 61. ]

Case Study: Reading CSV Data

A practical use case for structured arrays is reading CSV files into a NumPy array with mixed datatypes. NumPy allows you to load data with np.genfromtxt(), which can be used for this task:

# Considering 'people.csv' has name, age, and weight
data = np.genfromtxt('people.csv', delimiter=',', names=True, dtype=None, encoding='utf-8')
print(data)

Note the use of names=True, which assumes the first row of the CSV file contains column headers and dtype=None which allows dtype inference.

Tips for Working with Structured Arrays

  • Always define the correct data types for each field in the structure to maintain data integrity.
  • Use indexing and slicing thoughtfully, as not all standard NumPy operations apply to structured arrays.
  • Memory layout of structured arrays can impact performance, so consider memory order if operations seem to be slower than expected.

Conclusion

In conclusion, storing multiple data types in one array is possible with NumPy’s structured arrays. As we saw, it opens up many possibilities when it comes to complex data processing, despite the limitations on performing mathematical operations. With this understanding, applications requiring heterogenous data management can benefit significantly from structured arrays, coupled with NumPy’s efficient computation.