Skip to main content

Pandas: DataFrame Describe

Pandas: DataFrame Describe

Gaining insights into the statistical properties of a dataset is vital for data analysis and machine learning. Built on NumPy Array Operations, Pandas provides the describe() method to generate descriptive statistics for numerical and categorical columns in a DataFrame. This guide explores Pandas DataFrame Describe, covering key techniques, customization options, and applications in data exploration and preprocessing workflows.


01. Why Use DataFrame Describe?

Datasets often contain numerical and categorical variables that require statistical summarization to understand distributions, detect outliers, or identify data issues. The describe() method in Pandas offers a quick, vectorized way to compute key statistics like mean, standard deviation, and frequency counts, leveraging NumPy’s efficiency. This tool is essential for exploratory data analysis, feature engineering, and preparing datasets for machine learning or statistical modeling.

Example: Basic DataFrame Describe

import pandas as pd
import numpy as np

# Create a sample DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, np.nan, 35],
    'Salary': [50000, 60000, 55000, None]
})

# Generate descriptive statistics
description = df.describe()

print("Descriptive statistics:\n", description)

Output:

Descriptive statistics:
             Age       Salary
count   3.000000      3.000000
mean   30.000000  55000.000000
std     5.000000   5000.000000
min    25.000000  50000.000000
25%    27.500000  52500.000000
50%    30.000000  55000.000000
75%    32.500000  57500.000000
max    35.000000  60000.000000

Explanation:

  • describe() - Summarizes numerical columns with count, mean, std, min, max, and quartiles.
  • Excludes non-numerical columns (e.g., Name) by default.

02. Key Describe-Related Methods and Options

The describe() method, along with complementary tools, provides flexible ways to summarize DataFrame data efficiently. These methods, optimized with NumPy, support customization for different data types. The table below summarizes key methods and their applications in data exploration:

Method/Option Description Use Case
Basic Statistics describe() Summarize numerical or categorical columns
Include/Exclude describe(include='all') Analyze specific data types (e.g., object, category)
Percentiles describe(percentiles=[0.1, 0.9]) Customize quantile outputs
Column-Specific describe().loc['mean'] Extract specific statistics
Complementary Stats mean(), std() Compute individual statistics


2.1 Basic Descriptive Statistics

Example: Summarizing Numerical Data

import pandas as pd
import numpy as np

# Create a DataFrame
df = pd.DataFrame({
    'Product': ['A', 'B', 'C', 'D'],
    'Price': [10.5, 15.0, np.nan, 12.5],
    'Stock': [100, 200, 150, None]
})

# Generate statistics
description = df.describe()

print("Descriptive statistics:\n", description)

Output:

Descriptive statistics:
            Price       Stock
count   3.000000    3.000000
mean   12.666667  150.000000
std     2.250926   50.000000
min    10.500000  100.000000
25%    11.500000  125.000000
50%    12.500000  150.000000
75%    13.750000  175.000000
max    15.000000  200.000000

Explanation:

  • describe() - Computes count, mean, std, min, max, and quartiles for numerical columns.
  • Ignores missing values (e.g., Price has 3 non-null values).

2.2 Including Non-Numerical Data

Example: Summarizing All Data Types

import pandas as pd
import numpy as np

# Create a DataFrame
df = pd.DataFrame({
    'Category': ['A', 'B', 'A', None],
    'Score': [85, 90, np.nan, 95],
    'Grade': ['High', 'Medium', 'High', 'Low']
})

# Summarize all columns
description = df.describe(include='all')

print("Descriptive statistics (all columns):\n", description)

Output:

Descriptive statistics (all columns):
        Category      Score  Grade
count         3    3.000000      4
unique        2         NaN      3
top           A         NaN   High
freq          2         NaN      2
mean        NaN   90.000000    NaN
std         NaN    5.000000    NaN
min         NaN   85.000000    NaN
25%         NaN   87.500000    NaN
50%         NaN   90.000000    NaN
75%         NaN   92.500000    NaN
max         NaN   95.000000    NaN

Explanation:

  • describe(include='all') - Includes object and categorical columns, showing count, unique values, top value, and frequency.
  • Numerical columns retain standard statistics; non-numerical show categorical summaries.

2.3 Customizing Percentiles

Example: Specifying Percentiles

import pandas as pd
import numpy as np

# Create a DataFrame
df = pd.DataFrame({
    'City': ['New York', 'Chicago', 'Boston'],
    'Population': [8.4e6, 2.7e6, 0.7e6],
    'Area': [783.8, 606.1, 89.6]
})

# Summarize with custom percentiles
description = df.describe(percentiles=[0.1, 0.5, 0.9])

print("Descriptive statistics with custom percentiles:\n", description)

Output:

Descriptive statistics with custom percentiles:
         Population         Area
count  3.000000e+00     3.000000
mean   3.933333e+06   493.166667
std    4.069199e+06   356.159022
min    7.000000e+05    89.600000
10%    1.260000e+06   227.230000
50%    2.700000e+06   606.100000
90%    6.540000e+06   727.030000
max    8.400000e+06   783.800000

Explanation:

  • describe(percentiles=[0.1, 0.5, 0.9]) - Customizes quantiles (e.g., 10th, 50th, 90th percentiles).
  • Useful for analyzing skewed distributions or outlier detection.

2.4 Extracting Specific Statistics

Example: Accessing Specific Metrics

import pandas as pd
import numpy as np

# Create a DataFrame
df = pd.DataFrame({
    'Item': ['A', 'B', 'C', 'D'],
    'Price': [10.0, 15.0, np.nan, 20.0],
    'Stock': [100, 200, 150, None]
})

# Extract specific statistics
stats = df.describe()
mean_prices = stats.loc['mean', 'Price']
max_stock = stats.loc['max', 'Stock']

print("Descriptive statistics:\n", stats)
print("\nMean Price:", mean_prices)
print("Max Stock:", max_stock)

Output:

Descriptive statistics:
            Price       Stock
count   3.000000    3.000000
mean   15.000000  150.000000
std     5.000000   50.000000
min    10.000000  100.000000
25%    12.500000  125.000000
50%    15.000000  150.000000
75%    17.500000  175.000000
max    20.000000  200.000000

Mean Price: 15.0
Max Stock: 200.0

Explanation:

  • describe().loc - Extracts specific statistics (e.g., mean of Price).
  • Enables targeted analysis for feature engineering or reporting.

2.5 Complementary Statistical Methods

Example: Computing Individual Statistics

import pandas as pd
import numpy as np

# Create a DataFrame
df = pd.DataFrame({
    'Category': ['A', 'B', 'C', 'A'],
    'Value': [10, 20, np.nan, 30]
})

# Compute individual statistics
mean_value = df['Value'].mean()
std_value = df['Value'].std()
median_value = df['Value'].median()

print("Mean Value:", mean_value)
print("Standard Deviation:", std_value)
print("Median Value:", median_value)

Output:

Mean Value: 20.0
Standard Deviation: 10.0
Median Value: 20.0

Explanation:

  • mean(), std(), median() - Compute specific statistics for targeted analysis.
  • Complements describe() for fine-grained insights.

2.6 Incorrect Describe Usage

Example: Misinterpreting Describe Output

import pandas as pd
import numpy as np

# Create a DataFrame with skewed data
df = pd.DataFrame({
    'Feature': [1, 2, 1000, 3],
    'Label': ['A', 'B', 'C', 'A']
})

# Incorrect: Assuming normal distribution from describe
description = df.describe()
print("Descriptive statistics:\n", description)
print("\nAssuming 'Feature' is normally distributed without further checks")

Output:

Descriptive statistics:
           Feature
count     4.000000
mean    251.500000
std     499.250625
min       1.000000
25%       1.750000
50%       2.500000
75%     252.250000
max    1000.000000

Assuming 'Feature' is normally distributed without further checks

Explanation:

  • High standard deviation (499.25) indicates outliers, but assuming normality without visualization (e.g., histogram) can mislead analysis.
  • Solution: Pair describe() with plots or value_counts() for context.

03. Effective Usage

3.1 Recommended Practices

  • Use describe() to quickly identify distributions and potential outliers.

Example: Comprehensive Statistical Analysis

import pandas as pd
import numpy as np

# Create a large DataFrame
df = pd.DataFrame({
    'ID': np.arange(1000, dtype='int32'),
    'Category': pd.Series(['A', 'B', 'C', 'A'] * 250).astype('category'),
    'Price': pd.Series([10.5, 15.0, np.nan, 20.0] * 250),
    'Stock': pd.Series([100, 200, 150, None] * 250)
})

# Comprehensive analysis
num_stats = df.describe(percentiles=[0.1, 0.5, 0.9])
cat_stats = df['Category'].describe()
missing = df.isna().sum()
mean_price = df['Price'].mean()

print("Numerical statistics:\n", num_stats)
print("\nCategorical statistics:\n", cat_stats)
print("\nMissing values:\n", missing)
print("\nMean Price:", mean_price)

Output:

Numerical statistics:
               ID       Price       Stock
count  1000.000000  750.000000  750.000000
mean    499.500000   15.000000  150.000000
std     288.747186    4.609772   43.333333
min       0.000000   10.500000  100.000000
10%      99.900000   10.500000  100.000000
50%     499.500000   15.000000  150.000000
90%     899.100000   20.000000  200.000000
max     999.000000   20.000000  200.000000

Categorical statistics:
count     1000
unique       3
top          A
freq       500
Name: Category, dtype: object

Missing values:
ID            0
Category      0
Price       250
Stock       250
dtype: int64

Mean Price: 15.0
  • Use describe(include='all') or astype('category') for categorical analysis.
  • Customize percentiles to focus on specific distribution points.
  • Check missing values with isna().sum() to contextualize statistics.

3.2 Practices to Avoid

  • Avoid using describe() without checking data quality (e.g., missing values or outliers).

Example: Ignoring Missing Data Impact

import pandas as pd
import numpy as np

# Create a DataFrame with missing values
df = pd.DataFrame({
    'Feature': [1, 2, np.nan, np.nan],
    'Label': ['A', 'B', 'C', 'A']
})

# Incorrect: Relying on describe without checking missing values
description = df.describe()
print("Descriptive statistics:\n", description)
print("\nAssuming complete data without checking missing values")

Output:

Descriptive statistics:
           Feature
count     2.000000
mean      1.500000
std       0.707107
min       1.000000
25%       1.250000
50%       1.500000
75%       1.750000
max       2.000000

Assuming complete data without checking missing values
  • Only 2 non-null values in Feature skew the statistics, misleading interpretations.
  • Solution: Check isna().sum() and inspect data with head() first.

04. Common Use Cases in Data Exploration

4.1 Feature Distribution Analysis

Analyze feature distributions to inform preprocessing or modeling.

Example: Exploring Feature Distributions

import pandas as pd
import numpy as np

# Create a DataFrame
df = pd.DataFrame({
    'Feature1': [1.0, 2.0, np.nan, 4.0],
    'Feature2': [10, 20, 15, 30],
    'Label': ['A', 'B', 'A', 'C']
})

# Analyze distributions
stats = df.describe(percentiles=[0.25, 0.75])
missing = df.isna().sum()

print("Feature distributions:\n", stats)
print("\nMissing values:\n", missing)

Output:

Feature distributions:
          Feature1   Feature2
count    3.000000   4.000000
mean     2.333333  18.750000
std      1.527525   8.539126
min      1.000000  10.000000
25%      1.500000  13.750000
50%      2.000000  17.500000
75%      3.000000  22.500000
max      4.000000  30.000000

Missing values:
Feature1    1
Feature2    0
Label       0
dtype: int64

Explanation:

  • describe() - Reveals distributions (e.g., Feature2 has higher variance).
  • Paired with isna().sum() to contextualize missing data impact.

4.2 Categorical Data Summarization

Summarize categorical variables to understand distributions.

Example: Analyzing Categorical Columns

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({
    'Region': ['North', 'South', 'North', 'West', 'South'],
    'Sales': [1000, 1500, 1200, 800, 2000]
})

# Summarize categorical data
cat_stats = df['Region'].describe()
value_counts = df['Region'].value_counts()

print("Categorical statistics:\n", cat_stats)
print("\nValue counts:\n", value_counts)

Output:

Categorical statistics:
count         5
unique        3
top       North
freq          2
Name: Region, dtype: object

Value counts:
Region
North    2
South    2
West     1
Name: count, dtype: int64

Explanation:

  • describe() - Summarizes categorical data (count, unique, top, freq).
  • value_counts() - Provides detailed frequency distribution.

Conclusion

Pandas’ describe() method, powered by NumPy Array Operations, provides a powerful tool for summarizing numerical and categorical data. Key takeaways:

  • Use describe() to compute key statistics for numerical and categorical columns.
  • Customize with include='all' or percentiles for flexibility.
  • Pair with isna().sum() and visualizations to validate findings.
  • Apply in feature distribution analysis and categorical summarization for effective exploration.

With Pandas, you can efficiently uncover insights into your data’s statistical properties, setting the stage for robust preprocessing and modeling!

Comments