Pandas: How to get the summary of a DataFrame (3 examples)

Updated: February 19, 2024 By: Guest Contributor Post a comment

Introduction

Pandas is a powerful, open-source data analysis and manipulation tool built on top of the Python programming language.

DataFrames are the core data structure of the Pandas library and are particularly useful for handling structured data. Before delving deep into data analysis or manipulation, it is often necessary to get an overview or summary of the DataFrame to understand its structure, size, and types of data it contains. This initial step is crucial for any data science project as it helps in identifying potential issues such as missing values, understanding the nature of the columns, and getting a sense of the data distribution.

In this article, we’ll dive deep into how to get a comprehensive summary of a DataFrame using the Pandas library. This knowledge is crucial for understanding the structure, content, and statistical characteristics of your data, which is essential for any data analysis task.

Getting Started

First, ensure you have Pandas installed in your environment. If not, you can install it using pip:

pip install pandas

After installation, you can import Pandas into your Python script:

import pandas as pd

Example 1: Basic Summary with .info() Method

The .info() method provides a concise summary of a DataFrame, including the index dtype and columns, non-null values, and memory usage. Let’s start with a simple example:

import pandas as pd

df = pd.DataFrame({
  'A': [1, 2, 3, 4],
  'B': ["a", "b", "c", "d"],
  'C': pd.date_range('20230101', periods=4),
  'D': pd.Categorical(["test", "train", "test", "train"])
})

print(df.info())

This simple invocation prints basic details about your DataFrame, helping you understand the shape and types of data it contains at a glance.

Example 2: Descriptive Statistics with .describe() Method

For a more quantitative summary, the .describe() method can be used. It generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values. Here’s how you can use it:

import pandas as pd

df = pd.DataFrame({
  'A': [1, 2, 3, 4],
  'B': [10, 20, 30, 40]
})

print(df.describe())

This method primarily focuses on numerical columns, providing a wealth of information such as count, mean, standard deviation, min/max values, and quartiles.

Example 3: Custom Descriptions with .agg() Method

For more advanced and customized summaries, the .agg() method allows you to apply a wide variety of statistical methods to a DataFrame. This is particularly useful when the standard summary provided by .describe() isn’t enough. Here’s an example of how .agg() can be used to summarize the data further:

import pandas as pd

df = pd.DataFrame({
  'A': [1, 2, 3, 4],
  'B': [10, 20, 30, 40]
})

df.agg({
  'A': ['sum', 'min', 'max', 'mean'],
  'B': ['mean', 'std']
})

This produces a custom summary based on the aggregation functions defined for each column. For instance, in the above code, we’re calculating the sum, min, max, and mean for column ‘A’ and the mean and standard deviation for column ‘B’.

Conclusion

In conclusion, Pandas offers several methods for summarizing and understanding your DataFrame quickly. From basic structure and data types analysis using the .info() method, through descriptive statistics with .describe(), to custom summaries via .agg(), there’s a wide range of tools at your disposal. Employing these methods will significantly aid in grasping the overall picture of your data, setting a solid foundation for further analysis and manipulation.