Pandas: Sorting rows by multiple columns in a DataFrame

Overview
Basic Sorting
Sorting by Multiple Columns
Handling Missing Values
Sorting with a Custom Order
Advanced Sorting: Using the key Parameter
Chained Sorting
Conclusion

Overview

Handling data efficiently is integral to data analysis in Python, and Pandas is the go-to library for such tasks. Among many useful functionalities that Pandas provides, sorting data frames by rows based on one or several columns is a frequent need. This tutorial will guide you through various ways to sort DataFrame rows by multiple columns using Pandas, starting from basic approaches to more advanced techniques.

Before we dive in, ensure that you have Pandas installed. If not, you can install it using pip:

pip install pandas

Basic Sorting

Let’s start with a basic example. Suppose we have a DataFrame with several columns, and we want to sort the DataFrame by one of these columns.

import pandas as pd

# Sample DataFrame
data = {
  'Name': ['John', 'Anna', 'Peter', 'Linda'],
  'Age': [28, 34, 29, 32],
  'City': ['New York', 'Paris', 'Berlin', 'London']
}
df = pd.DataFrame(data)

# Sorting by 'Age'
df_sorted_by_age = df.sort_values(by='Age')
print(df_sorted_by_age)

This will sort the DataFrame in ascending order by the “Age” column:

    Name  Age     City
0   John   28  New York
2   Peter   29  Berlin
3   Linda   32  London
1   Anna   34    Paris

Sorting by Multiple Columns

Often, we need to sort by more than one column. Pandas makes this easy. Assume we want to sort the aforementioned DataFrame first by ‘City’ in ascending order and then by ‘Age’ in descending order.

df_sorted = df.sort_values(by=['City', 'Age'], ascending=[True, False])
print(df_sorted)

This results in a DataFrame sorted first by the ‘City’ column in ascending order, then by the ‘Age’ column in descending order:

    Name  Age     City
2  Peter   29  Berlin
3  Linda   32  London
1  Anna   34    Paris
0  John   28  New York

Handling Missing Values

Dealing with missing values is an essential part of data preprocessing. Pandas provides options to control how missing values are sorted. By default, missing values are sorted to the end, but you can change this behavior using the na_position parameter:

# Assuming 'Age' column has missing values
df['Age'].iloc[2] = None

df_sorted = df.sort_values(by=['City', 'Age'], ascending=[True, False], na_position='first')
print(df_sorted)

In this case, rows with missing values in the ‘Age’ column will be sorted to the top.

Sorting with a Custom Order

Sometimes, the required sorting order does not follow the typical ascending or descending pattern. Instead, you might need a specific custom order for categorical data. Although Pandas does not provide a direct way to sort by custom orders, you can achieve this by mapping categories to an order index and then sorting based on that index.

# Mapping cities to a custom order
order_index = {'London': 1, 'Berlin': 2, 'Paris': 3, 'New York': 4}
df['CityOrder'] = df['City'].map(order_index)

# Now, sorting by 'CityOrder' and then removing the auxiliary column
df_sorted = df.sort_values(by='CityOrder').drop('CityOrder', axis=1)
print(df_sorted)

This will sort the DataFrame based on the custom city order provided.

Advanced Sorting: Using the `key` Parameter

In recent Pandas versions, the sort_values function has been enhanced with a key parameter, allowing for more advanced sorting mechanisms. This parameter accepts a function that transforms the values before sorting, providing a powerful way to sort based on custom criteria.

# Suppose we want to sort by the length of the names

df_sorted_by_name_length = df.sort_values(by='Name', key=lambda col: col.str.len())
print(df_sorted_by_name_length)

This example sorts the DataFrame by the length of the names in ascending order.

Chained Sorting

For even more complex sorting requirements, Pandas allows chaining multiple sorting operations. This can be useful when you want to apply different sorting criteria to different subsets of your DataFrame.

# First, sort by 'City', then by 'Age' within each city
df.sort_values(by='City', inplace=True)
df.sort_values(by='Age', kind='mergesort', inplace=True)
print(df)

Notice that we used ‘mergesort’ for the second sorting because it is stable and keeps the order of ‘City’ after sorting by ‘Age’.

Conclusion

Sorting DataFrame rows by multiple columns in Pandas is a versatile and powerful operation that provides deep insights into your data. Whether it’s a straightforward sort by one or two columns, or more complex operations involving custom orders and the key parameter, Pandas offers flexible ways to order your data. Understanding these sorting techniques is crucial for data preprocessing and analysis, ensuring that you can handle your data with finesse.

Next Article: How to convert a DataFrame to a MongoDB document (Pandas + PyMongo)

Previous Article: Pandas DataFrame: How to search rows that match a string keyword

Series: DateFrames in Pandas

Pandas