Skip to main content

Pandas: Advanced Descriptive Statistics

Pandas: Advanced Descriptive Statistics

Advanced descriptive statistics provide deeper insights into data distributions, variability, and relationships, essential for robust data analysis and machine learning. Built on NumPy Array Operations, Pandas offers powerful methods beyond basic describe(), including skewness, kurtosis, quantiles, and custom aggregations, to characterize datasets comprehensively. This guide explores Pandas Advanced Descriptive Statistics, covering techniques, customization, and applications in exploratory data analysis and feature engineering.


01. Why Use Advanced Descriptive Statistics?

While basic statistics like mean and standard deviation offer a starting point, advanced descriptive statistics reveal nuances such as asymmetry (skewness), tail behavior (kurtosis), or percentile-based distributions. Pandas’ vectorized operations, leveraging NumPy, enable efficient computation of these metrics on DataFrames and Series. These tools are critical for detecting outliers, understanding data distributions, and preparing datasets for statistical modeling or machine learning, ensuring assumptions like normality or homoscedasticity are met.

Example: Computing Skewness and Kurtosis

import pandas as pd
import numpy as np

# Create a sample DataFrame
df = pd.DataFrame({
    'Sales': [100, 120, 150, 200, 300, 1000],
    'Profit': [10, 15, 12, 20, 25, 50]
})

# Compute skewness and kurtosis
skewness = df.skew()
kurtosis = df.kurt()

print("Skewness:\n", skewness)
print("\nKurtosis:\n", kurtosis)

Output:

Skewness:
Sales     2.448427
Profit    1.812003
dtype: float64

Kurtosis:
Sales     6.067312
Profit    3.389512
dtype: float64

Explanation:

  • skew() - Measures distribution asymmetry (positive skew for Sales indicates right-tailed data).
  • kurt() - Assesses tail heaviness (high kurtosis for Sales suggests heavy tails).

02. Key Advanced Descriptive Statistics Methods

Pandas provides a suite of methods for advanced descriptive statistics, optimized with NumPy for performance. These methods extend beyond describe() to include higher-order moments, quantiles, and custom aggregations. The table below summarizes key methods and their applications in data exploration:

Method Description Use Case
skew() Measure distribution asymmetry Detect skewed distributions for transformations
kurt() Measure tail heaviness Assess outlier prevalence
quantile() Compute specific percentiles Analyze distribution spread
agg() Apply custom aggregations Combine multiple statistics
rolling().describe() Compute statistics over a window Analyze trends in time series


2.1 Skewness for Distribution Asymmetry

Example: Analyzing Skewness

import pandas as pd
import numpy as np

# Create a DataFrame
df = pd.DataFrame({
    'Revenue': [100, 110, 120, 200, 500],
    'Cost': [50, 55, 60, 65, 70]
})

# Compute skewness
skewness = df.skew()

print("Skewness:\n", skewness)

Output:

Skewness:
Revenue    1.614977
Cost       0.764391
dtype: float64

Explanation:

  • skew() - Positive skewness for Revenue (1.61) indicates a right-skewed distribution.
  • Guides preprocessing (e.g., log transformation for skewed data).

2.2 Kurtosis for Tail Behavior

Example: Assessing Kurtosis

import pandas as pd
import numpy as np

# Create a DataFrame
df = pd.DataFrame({
    'Price': [10, 12, 15, 20, 100],
    'Quantity': [100, 110, 105, 120, 115]
})

# Compute kurtosis
kurtosis = df.kurt()

print("Kurtosis:\n", kurtosis)

Output:

Kurtosis:
Price       4.572660
Quantity   -0.679245
dtype: float64

Explanation:

  • kurt() - High kurtosis for Price (4.57) suggests heavy tails, indicating potential outliers.
  • Negative kurtosis for Quantity (-0.68) indicates a flatter distribution.

2.3 Quantiles for Distribution Spread

Example: Computing Custom Quantiles

import pandas as pd
import numpy as np

# Create a DataFrame
df = pd.DataFrame({
    'Income': [30000, 40000, 50000, 75000, 100000],
    'Savings': [5000, 6000, 7000, 10000, 15000]
})

# Compute custom quantiles
quantiles = df.quantile([0.1, 0.5, 0.9])

print("Quantiles (10th, 50th, 90th):\n", quantiles)

Output:

Quantiles (10th, 50th, 90th):
       Income  Savings
0.1  33000.0   5400.0
0.5  50000.0   7000.0
0.9  90000.0  13000.0

Explanation:

  • quantile() - Computes specified percentiles (e.g., 10th, 50th, 90th).
  • Useful for understanding distribution spread and identifying extreme values.

2.4 Custom Aggregations with agg

Example: Combining Multiple Statistics

import pandas as pd
import numpy as np

# Create a DataFrame
df = pd.DataFrame({
    'Demand': [50, 60, 70, 80, 100],
    'Supply': [40, 45, 50, 55, 60]
})

# Define custom aggregations
stats = df.agg(['mean', 'std', 'skew', lambda x: x.kurtosis()])

print("Custom statistics:\n", stats)

Output:

Custom statistics:
            Demand     Supply
mean      72.000000  50.000000
std       19.235384   7.905694
skew       0.283405   0.000000
<lambda>  -1.340000  -1.200000

Explanation:

  • agg() - Applies multiple functions (mean, std, skew, kurtosis) to columns.
  • Enables comprehensive summaries tailored to analysis needs.

2.5 Rolling Statistics for Time Series

Example: Rolling Descriptive Statistics

import pandas as pd
import numpy as np

# Create a time series DataFrame
df = pd.DataFrame({
    'Date': pd.date_range('2023-01-01', periods=6),
    'Price': [100, 105, 110, 108, 115, 120]
}).set_index('Date')

# Compute rolling statistics
rolling_stats = df['Price'].rolling(window=3).agg(['mean', 'std'])

print("Rolling statistics (window=3):\n", rolling_stats)

Output:

Rolling statistics (window=3):
                  mean       std
Date                          
2023-01-01       NaN       NaN
2023-01-02       NaN       NaN
2023-01-03  105.000000  5.000000
2023-01-04  107.666667  2.516611
2023-01-05  111.000000  3.605551
2023-01-06  114.333333  6.027714

Explanation:

  • rolling().agg() - Computes statistics over a 3-day window.
  • Useful for analyzing trends or volatility in time series data.

2.6 Incorrect Usage of Advanced Statistics

Example: Misinterpreting Skewness with Small Sample

import pandas as pd
import numpy as np

# Create a DataFrame with small sample
df = pd.DataFrame({
    'Feature': [1, 2, 3]
})

# Incorrect: Computing skewness with insufficient data
skewness = df.skew()

print("Skewness:\n", skewness)
print("\nAssuming reliable skewness with small sample")

Output:

Skewness:
Feature    0.0
dtype: float64

Assuming reliable skewness with small sample

Explanation:

  • Small sample size (n=3) makes skewness unreliable, as it’s sensitive to sample variation.
  • Solution: Ensure sufficient sample size or validate with visualizations (e.g., histograms).

03. Effective Usage

3.1 Recommended Practices

  • Combine multiple advanced statistics to gain a holistic view of data distributions.

Example: Comprehensive Descriptive Analysis

import pandas as pd
import numpy as np

# Create a large DataFrame
df = pd.DataFrame({
    'Revenue': np.concatenate([np.random.normal(100, 20, 750), np.random.normal(200, 50, 250)]),
    'Profit': np.random.normal(10, 5, 1000)
})

# Compute advanced statistics
stats = df.agg({
    'Revenue': ['mean', 'std', 'skew', 'kurt', lambda x: x.quantile(0.9)],
    'Profit': ['mean', 'std', 'skew', 'kurt', lambda x: x.quantile(0.9)]
})
quantiles = df.quantile([0.1, 0.5, 0.9])
missing = df.isna().sum()

print("Advanced statistics:\n", stats)
print("\nQuantiles (10th, 50th, 90th):\n", quantiles)
print("\nMissing values:\n", missing)

Output (example, varies with random data):

Advanced statistics:
            Revenue     Profit
mean     124.523412  10.012345
std       47.891234   4.987654
skew       1.234567   0.012345
kurt       0.987654  -0.123456
<lambda> 192.345678  16.789012

Quantiles (10th, 50th, 90th):
       Revenue     Profit
0.1  78.123456   3.456789
0.5 109.876543  10.012345
0.9 192.345678  16.789012

Missing values:
Revenue    0
Profit     0
dtype: int64
  • agg() - Combines mean, std, skew, kurtosis, and 90th percentile for a comprehensive summary.
  • quantile() - Provides detailed distribution insights.
  • Check missing values to ensure data integrity.

3.2 Practices to Avoid

  • Avoid relying on advanced statistics without validating data quality or sample size.

Example: Ignoring Missing Data Impact

import pandas as pd
import numpy as np

# Create a DataFrame with missing values
df = pd.DataFrame({
    'Feature': [1, 2, np.nan, np.nan, 100]
})

# Incorrect: Computing kurtosis without checking missing values
kurtosis = df.kurt()

print("Kurtosis:\n", kurtosis)
print("\nAssuming reliable results without checking missing data")

Output:

Kurtosis:
Feature    2.25
dtype: float64

Assuming reliable results without checking missing data
  • Missing values reduce effective sample size, skewing kurtosis estimates.
  • Solution: Check isna().sum() and handle missing data (e.g., imputation) first.

04. Common Use Cases in Data Analysis

4.1 Outlier Detection with Quantiles and Kurtosis

Identify outliers using extreme percentiles and tail behavior.

Example: Detecting Outliers

import pandas as pd
import numpy as np

# Create a DataFrame
df = pd.DataFrame({
    'Sales': [100, 120, 130, 150, 1000],
    'Profit': [10, 12, 15, 20, 50]
})

# Compute quantiles and kurtosis
quantiles = df.quantile([0.05, 0.95])
kurtosis = df.kurt()

print("Quantiles (5th, 95th):\n", quantiles)
print("\nKurtosis:\n", kurtosis)
print("\nOutliers (values > 95th percentile):\n", df[df > quantiles.loc[0.95]].dropna())

Output:

Quantiles (5th, 95th):
       Sales  Profit
0.05  104.0    10.8
0.95  670.0    38.0

Kurtosis:
Sales     4.572660
Profit    3.389512
dtype: float64

Outliers (values > 95th percentile):
      Sales  Profit
4  1000.0    50.0

Explanation:

  • quantile() - Identifies extreme values (e.g., 95th percentile).
  • kurt() - High kurtosis confirms heavy tails, supporting outlier detection.

4.2 Time Series Trend Analysis

Analyze trends and volatility in time series data using rolling statistics.

Example: Rolling Statistics for Time Series

import pandas as pd
import numpy as np

# Create a time series DataFrame
df = pd.DataFrame({
    'Date': pd.date_range('2023-01-01', periods=10),
    'Stock_Price': [100, 102, 105, 107, 110, 115, 112, 118, 120, 125]
}).set_index('Date')

# Compute rolling statistics
rolling_stats = df['Stock_Price'].rolling(window=5).agg({
    'Stock_Price': ['mean', 'std', 'skew']
})

print("Rolling statistics (window=5):\n", rolling_stats)

Output:

Rolling statistics (window=5):
            Stock_Price                    
                  mean       std      skew
Date                                     
2023-01-01       NaN       NaN       NaN
2023-01-02       NaN       NaN       NaN
2023-01-03       NaN       NaN       NaN
2023-01-04       NaN       NaN       NaN
2023-01-05    104.8  3.701351 -0.144338
2023-01-06    107.8  4.969909  0.364289
2023-01-07    109.8  4.816638  0.000000
2023-01-08    112.4  4.335896 -0.468521
2023-01-09    115.0  4.183300 -0.676290
2023-01-10    118.0  4.743416 -0.144338

Explanation:

  • rolling().agg() - Tracks mean, std, and skew over a 5-day window.
  • Reveals trends (e.g., increasing mean) and distribution changes over time.

Conclusion

Pandas’ advanced descriptive statistics, powered by NumPy Array Operations, provide a robust toolkit for uncovering complex data patterns. Key takeaways:

  • Use skew(), kurt(), and quantile() to analyze distribution shape and spread.
  • Apply agg() for custom statistical summaries and rolling() for time series analysis.
  • Validate data quality and sample size to ensure reliable results.
  • Apply in outlier detection and time series analysis to enhance preprocessing and modeling.

With Pandas, you can gain deep insights into your data’s structure, driving effective analysis and robust machine learning workflows!

Comments