Pandas: Advanced Descriptive Statistics
Advanced descriptive statistics provide deeper insights into data distributions, variability, and relationships, essential for robust data analysis and machine learning. Built on NumPy Array Operations, Pandas offers powerful methods beyond basic describe()
, including skewness, kurtosis, quantiles, and custom aggregations, to characterize datasets comprehensively. This guide explores Pandas Advanced Descriptive Statistics, covering techniques, customization, and applications in exploratory data analysis and feature engineering.
01. Why Use Advanced Descriptive Statistics?
While basic statistics like mean and standard deviation offer a starting point, advanced descriptive statistics reveal nuances such as asymmetry (skewness), tail behavior (kurtosis), or percentile-based distributions. Pandas’ vectorized operations, leveraging NumPy, enable efficient computation of these metrics on DataFrames and Series. These tools are critical for detecting outliers, understanding data distributions, and preparing datasets for statistical modeling or machine learning, ensuring assumptions like normality or homoscedasticity are met.
Example: Computing Skewness and Kurtosis
import pandas as pd
import numpy as np
# Create a sample DataFrame
df = pd.DataFrame({
'Sales': [100, 120, 150, 200, 300, 1000],
'Profit': [10, 15, 12, 20, 25, 50]
})
# Compute skewness and kurtosis
skewness = df.skew()
kurtosis = df.kurt()
print("Skewness:\n", skewness)
print("\nKurtosis:\n", kurtosis)
Output:
Skewness:
Sales 2.448427
Profit 1.812003
dtype: float64
Kurtosis:
Sales 6.067312
Profit 3.389512
dtype: float64
Explanation:
skew()
- Measures distribution asymmetry (positive skew forSales
indicates right-tailed data).kurt()
- Assesses tail heaviness (high kurtosis forSales
suggests heavy tails).
02. Key Advanced Descriptive Statistics Methods
Pandas provides a suite of methods for advanced descriptive statistics, optimized with NumPy for performance. These methods extend beyond describe()
to include higher-order moments, quantiles, and custom aggregations. The table below summarizes key methods and their applications in data exploration:
Method | Description | Use Case |
---|---|---|
skew() |
Measure distribution asymmetry | Detect skewed distributions for transformations |
kurt() |
Measure tail heaviness | Assess outlier prevalence |
quantile() |
Compute specific percentiles | Analyze distribution spread |
agg() |
Apply custom aggregations | Combine multiple statistics |
rolling().describe() |
Compute statistics over a window | Analyze trends in time series |
2.1 Skewness for Distribution Asymmetry
Example: Analyzing Skewness
import pandas as pd
import numpy as np
# Create a DataFrame
df = pd.DataFrame({
'Revenue': [100, 110, 120, 200, 500],
'Cost': [50, 55, 60, 65, 70]
})
# Compute skewness
skewness = df.skew()
print("Skewness:\n", skewness)
Output:
Skewness:
Revenue 1.614977
Cost 0.764391
dtype: float64
Explanation:
skew()
- Positive skewness forRevenue
(1.61) indicates a right-skewed distribution.- Guides preprocessing (e.g., log transformation for skewed data).
2.2 Kurtosis for Tail Behavior
Example: Assessing Kurtosis
import pandas as pd
import numpy as np
# Create a DataFrame
df = pd.DataFrame({
'Price': [10, 12, 15, 20, 100],
'Quantity': [100, 110, 105, 120, 115]
})
# Compute kurtosis
kurtosis = df.kurt()
print("Kurtosis:\n", kurtosis)
Output:
Kurtosis:
Price 4.572660
Quantity -0.679245
dtype: float64
Explanation:
kurt()
- High kurtosis forPrice
(4.57) suggests heavy tails, indicating potential outliers.- Negative kurtosis for
Quantity
(-0.68) indicates a flatter distribution.
2.3 Quantiles for Distribution Spread
Example: Computing Custom Quantiles
import pandas as pd
import numpy as np
# Create a DataFrame
df = pd.DataFrame({
'Income': [30000, 40000, 50000, 75000, 100000],
'Savings': [5000, 6000, 7000, 10000, 15000]
})
# Compute custom quantiles
quantiles = df.quantile([0.1, 0.5, 0.9])
print("Quantiles (10th, 50th, 90th):\n", quantiles)
Output:
Quantiles (10th, 50th, 90th):
Income Savings
0.1 33000.0 5400.0
0.5 50000.0 7000.0
0.9 90000.0 13000.0
Explanation:
quantile()
- Computes specified percentiles (e.g., 10th, 50th, 90th).- Useful for understanding distribution spread and identifying extreme values.
2.4 Custom Aggregations with agg
Example: Combining Multiple Statistics
import pandas as pd
import numpy as np
# Create a DataFrame
df = pd.DataFrame({
'Demand': [50, 60, 70, 80, 100],
'Supply': [40, 45, 50, 55, 60]
})
# Define custom aggregations
stats = df.agg(['mean', 'std', 'skew', lambda x: x.kurtosis()])
print("Custom statistics:\n", stats)
Output:
Custom statistics:
Demand Supply
mean 72.000000 50.000000
std 19.235384 7.905694
skew 0.283405 0.000000
<lambda> -1.340000 -1.200000
Explanation:
agg()
- Applies multiple functions (mean, std, skew, kurtosis) to columns.- Enables comprehensive summaries tailored to analysis needs.
2.5 Rolling Statistics for Time Series
Example: Rolling Descriptive Statistics
import pandas as pd
import numpy as np
# Create a time series DataFrame
df = pd.DataFrame({
'Date': pd.date_range('2023-01-01', periods=6),
'Price': [100, 105, 110, 108, 115, 120]
}).set_index('Date')
# Compute rolling statistics
rolling_stats = df['Price'].rolling(window=3).agg(['mean', 'std'])
print("Rolling statistics (window=3):\n", rolling_stats)
Output:
Rolling statistics (window=3):
mean std
Date
2023-01-01 NaN NaN
2023-01-02 NaN NaN
2023-01-03 105.000000 5.000000
2023-01-04 107.666667 2.516611
2023-01-05 111.000000 3.605551
2023-01-06 114.333333 6.027714
Explanation:
rolling().agg()
- Computes statistics over a 3-day window.- Useful for analyzing trends or volatility in time series data.
2.6 Incorrect Usage of Advanced Statistics
Example: Misinterpreting Skewness with Small Sample
import pandas as pd
import numpy as np
# Create a DataFrame with small sample
df = pd.DataFrame({
'Feature': [1, 2, 3]
})
# Incorrect: Computing skewness with insufficient data
skewness = df.skew()
print("Skewness:\n", skewness)
print("\nAssuming reliable skewness with small sample")
Output:
Skewness:
Feature 0.0
dtype: float64
Assuming reliable skewness with small sample
Explanation:
- Small sample size (n=3) makes skewness unreliable, as it’s sensitive to sample variation.
- Solution: Ensure sufficient sample size or validate with visualizations (e.g., histograms).
03. Effective Usage
3.1 Recommended Practices
- Combine multiple advanced statistics to gain a holistic view of data distributions.
Example: Comprehensive Descriptive Analysis
import pandas as pd
import numpy as np
# Create a large DataFrame
df = pd.DataFrame({
'Revenue': np.concatenate([np.random.normal(100, 20, 750), np.random.normal(200, 50, 250)]),
'Profit': np.random.normal(10, 5, 1000)
})
# Compute advanced statistics
stats = df.agg({
'Revenue': ['mean', 'std', 'skew', 'kurt', lambda x: x.quantile(0.9)],
'Profit': ['mean', 'std', 'skew', 'kurt', lambda x: x.quantile(0.9)]
})
quantiles = df.quantile([0.1, 0.5, 0.9])
missing = df.isna().sum()
print("Advanced statistics:\n", stats)
print("\nQuantiles (10th, 50th, 90th):\n", quantiles)
print("\nMissing values:\n", missing)
Output (example, varies with random data):
Advanced statistics:
Revenue Profit
mean 124.523412 10.012345
std 47.891234 4.987654
skew 1.234567 0.012345
kurt 0.987654 -0.123456
<lambda> 192.345678 16.789012
Quantiles (10th, 50th, 90th):
Revenue Profit
0.1 78.123456 3.456789
0.5 109.876543 10.012345
0.9 192.345678 16.789012
Missing values:
Revenue 0
Profit 0
dtype: int64
agg()
- Combines mean, std, skew, kurtosis, and 90th percentile for a comprehensive summary.quantile()
- Provides detailed distribution insights.- Check missing values to ensure data integrity.
3.2 Practices to Avoid
- Avoid relying on advanced statistics without validating data quality or sample size.
Example: Ignoring Missing Data Impact
import pandas as pd
import numpy as np
# Create a DataFrame with missing values
df = pd.DataFrame({
'Feature': [1, 2, np.nan, np.nan, 100]
})
# Incorrect: Computing kurtosis without checking missing values
kurtosis = df.kurt()
print("Kurtosis:\n", kurtosis)
print("\nAssuming reliable results without checking missing data")
Output:
Kurtosis:
Feature 2.25
dtype: float64
Assuming reliable results without checking missing data
- Missing values reduce effective sample size, skewing kurtosis estimates.
- Solution: Check
isna().sum()
and handle missing data (e.g., imputation) first.
04. Common Use Cases in Data Analysis
4.1 Outlier Detection with Quantiles and Kurtosis
Identify outliers using extreme percentiles and tail behavior.
Example: Detecting Outliers
import pandas as pd
import numpy as np
# Create a DataFrame
df = pd.DataFrame({
'Sales': [100, 120, 130, 150, 1000],
'Profit': [10, 12, 15, 20, 50]
})
# Compute quantiles and kurtosis
quantiles = df.quantile([0.05, 0.95])
kurtosis = df.kurt()
print("Quantiles (5th, 95th):\n", quantiles)
print("\nKurtosis:\n", kurtosis)
print("\nOutliers (values > 95th percentile):\n", df[df > quantiles.loc[0.95]].dropna())
Output:
Quantiles (5th, 95th):
Sales Profit
0.05 104.0 10.8
0.95 670.0 38.0
Kurtosis:
Sales 4.572660
Profit 3.389512
dtype: float64
Outliers (values > 95th percentile):
Sales Profit
4 1000.0 50.0
Explanation:
quantile()
- Identifies extreme values (e.g., 95th percentile).kurt()
- High kurtosis confirms heavy tails, supporting outlier detection.
4.2 Time Series Trend Analysis
Analyze trends and volatility in time series data using rolling statistics.
Example: Rolling Statistics for Time Series
import pandas as pd
import numpy as np
# Create a time series DataFrame
df = pd.DataFrame({
'Date': pd.date_range('2023-01-01', periods=10),
'Stock_Price': [100, 102, 105, 107, 110, 115, 112, 118, 120, 125]
}).set_index('Date')
# Compute rolling statistics
rolling_stats = df['Stock_Price'].rolling(window=5).agg({
'Stock_Price': ['mean', 'std', 'skew']
})
print("Rolling statistics (window=5):\n", rolling_stats)
Output:
Rolling statistics (window=5):
Stock_Price
mean std skew
Date
2023-01-01 NaN NaN NaN
2023-01-02 NaN NaN NaN
2023-01-03 NaN NaN NaN
2023-01-04 NaN NaN NaN
2023-01-05 104.8 3.701351 -0.144338
2023-01-06 107.8 4.969909 0.364289
2023-01-07 109.8 4.816638 0.000000
2023-01-08 112.4 4.335896 -0.468521
2023-01-09 115.0 4.183300 -0.676290
2023-01-10 118.0 4.743416 -0.144338
Explanation:
rolling().agg()
- Tracks mean, std, and skew over a 5-day window.- Reveals trends (e.g., increasing mean) and distribution changes over time.
Conclusion
Pandas’ advanced descriptive statistics, powered by NumPy Array Operations, provide a robust toolkit for uncovering complex data patterns. Key takeaways:
- Use
skew()
,kurt()
, andquantile()
to analyze distribution shape and spread. - Apply
agg()
for custom statistical summaries androlling()
for time series analysis. - Validate data quality and sample size to ensure reliable results.
- Apply in outlier detection and time series analysis to enhance preprocessing and modeling.
With Pandas, you can gain deep insights into your data’s structure, driving effective analysis and robust machine learning workflows!
Comments
Post a Comment