Overview
Handling data efficiently is integral to data analysis in Python, and Pandas is the go-to library for such tasks. Among many useful functionalities that Pandas provides, sorting data frames by rows based on one or several columns is a frequent need. This tutorial will guide you through various ways to sort DataFrame rows by multiple columns using Pandas, starting from basic approaches to more advanced techniques.
Before we dive in, ensure that you have Pandas installed. If not, you can install it using pip:
pip install pandas
Basic Sorting
Let’s start with a basic example. Suppose we have a DataFrame with several columns, and we want to sort the DataFrame by one of these columns.
import pandas as pd
# Sample DataFrame
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 34, 29, 32],
'City': ['New York', 'Paris', 'Berlin', 'London']
}
df = pd.DataFrame(data)
# Sorting by 'Age'
df_sorted_by_age = df.sort_values(by='Age')
print(df_sorted_by_age)
This will sort the DataFrame in ascending order by the “Age” column:
Name Age City
0 John 28 New York
2 Peter 29 Berlin
3 Linda 32 London
1 Anna 34 Paris
Sorting by Multiple Columns
Often, we need to sort by more than one column. Pandas makes this easy. Assume we want to sort the aforementioned DataFrame first by ‘City’ in ascending order and then by ‘Age’ in descending order.
df_sorted = df.sort_values(by=['City', 'Age'], ascending=[True, False])
print(df_sorted)
This results in a DataFrame sorted first by the ‘City’ column in ascending order, then by the ‘Age’ column in descending order:
Name Age City
2 Peter 29 Berlin
3 Linda 32 London
1 Anna 34 Paris
0 John 28 New York
Handling Missing Values
Dealing with missing values is an essential part of data preprocessing. Pandas provides options to control how missing values are sorted. By default, missing values are sorted to the end, but you can change this behavior using the na_position
parameter:
# Assuming 'Age' column has missing values
df['Age'].iloc[2] = None
df_sorted = df.sort_values(by=['City', 'Age'], ascending=[True, False], na_position='first')
print(df_sorted)
In this case, rows with missing values in the ‘Age’ column will be sorted to the top.
Sorting with a Custom Order
Sometimes, the required sorting order does not follow the typical ascending or descending pattern. Instead, you might need a specific custom order for categorical data. Although Pandas does not provide a direct way to sort by custom orders, you can achieve this by mapping categories to an order index and then sorting based on that index.
# Mapping cities to a custom order
order_index = {'London': 1, 'Berlin': 2, 'Paris': 3, 'New York': 4}
df['CityOrder'] = df['City'].map(order_index)
# Now, sorting by 'CityOrder' and then removing the auxiliary column
df_sorted = df.sort_values(by='CityOrder').drop('CityOrder', axis=1)
print(df_sorted)
This will sort the DataFrame based on the custom city order provided.
Advanced Sorting: Using the key
Parameter
In recent Pandas versions, the sort_values
function has been enhanced with a key
parameter, allowing for more advanced sorting mechanisms. This parameter accepts a function that transforms the values before sorting, providing a powerful way to sort based on custom criteria.
# Suppose we want to sort by the length of the names
df_sorted_by_name_length = df.sort_values(by='Name', key=lambda col: col.str.len())
print(df_sorted_by_name_length)
This example sorts the DataFrame by the length of the names in ascending order.
Chained Sorting
For even more complex sorting requirements, Pandas allows chaining multiple sorting operations. This can be useful when you want to apply different sorting criteria to different subsets of your DataFrame.
# First, sort by 'City', then by 'Age' within each city
df.sort_values(by='City', inplace=True)
df.sort_values(by='Age', kind='mergesort', inplace=True)
print(df)
Notice that we used ‘mergesort’ for the second sorting because it is stable and keeps the order of ‘City’ after sorting by ‘Age’.
Conclusion
Sorting DataFrame rows by multiple columns in Pandas is a versatile and powerful operation that provides deep insights into your data. Whether it’s a straightforward sort by one or two columns, or more complex operations involving custom orders and the key
parameter, Pandas offers flexible ways to order your data. Understanding these sorting techniques is crucial for data preprocessing and analysis, ensuring that you can handle your data with finesse.