Skip to main content

Pandas: Correlation

Pandas: Correlation

Understanding relationships between variables is fundamental in data analysis and machine learning. Built on NumPy Array Operations, Pandas provides robust methods to compute correlation coefficients, such as Pearson, Spearman, and Kendall, to quantify the strength and direction of relationships between numerical columns in a DataFrame. This guide explores Pandas Correlation, covering key techniques, customization options, and applications in exploratory data analysis and feature selection.


01. Why Use Correlation in Pandas?

Correlation analysis helps identify linear or monotonic relationships between variables, which is critical for feature selection, detecting multicollinearity, or understanding data patterns. Pandas’ correlation methods, leveraging NumPy’s vectorized operations, offer efficient computation of correlation matrices and pairwise correlations. These tools are essential for exploratory data analysis, preparing datasets for machine learning, and identifying redundant or informative features.

Example: Basic Correlation Matrix

import pandas as pd
import numpy as np

# Create a sample DataFrame
df = pd.DataFrame({
    'Age': [25, 30, 35, 40],
    'Salary': [50000, 60000, 55000, 65000],
    'Experience': [2, 5, 4, 7]
})

# Compute correlation matrix
corr_matrix = df.corr()

print("Correlation matrix:\n", corr_matrix)

Output:

Correlation matrix:
                   Age    Salary  Experience
Age         1.000000  0.831522    0.961524
Salary      0.831522  1.000000    0.965663
Experience  0.961524  0.965663    1.000000

Explanation:

  • corr() - Computes the Pearson correlation matrix for numerical columns by default.
  • Values range from -1 (perfect negative correlation) to 1 (perfect positive correlation).

02. Key Correlation Methods and Options

Pandas provides flexible methods to compute correlation coefficients, optimized with NumPy for performance. These methods support different correlation types and handle missing data. The table below summarizes key methods and their applications in data analysis:

Method/Option Description Use Case
Pearson Correlation corr(method='pearson') Measure linear relationships
Spearman Correlation corr(method='spearman') Measure monotonic relationships
Kendall Correlation corr(method='kendall') Measure ordinal associations
Pairwise Correlation corrwith() Compare one column to others
Missing Data Handling corr(min_periods=...) Control for missing values


2.1 Pearson Correlation

Example: Computing Pearson Correlation

import pandas as pd
import numpy as np

# Create a DataFrame
df = pd.DataFrame({
    'Height': [160, 170, 165, 175],
    'Weight': [50, 70, 60, 80],
    'Age': [20, 25, 22, 30]
})

# Compute Pearson correlation
corr_matrix = df.corr(method='pearson')

print("Pearson correlation matrix:\n", corr_matrix)

Output:

Pearson correlation matrix:
           Height    Weight       Age
Height  1.000000  0.938315  0.905789
Weight  0.938315  1.000000  0.986440
Age     0.905789  0.986440  1.000000

Explanation:

  • corr(method='pearson') - Measures linear relationships between variables.
  • High correlation (e.g., 0.938 between Height and Weight) suggests a strong linear relationship.

2.2 Spearman Correlation

Example: Computing Spearman Correlation

import pandas as pd
import numpy as np

# Create a DataFrame with non-linear data
df = pd.DataFrame({
    'Rank': [1, 2, 3, 4],
    'Score': [10, 40, 90, 160],
    'Time': [5, 4, 3, 2]
})

# Compute Spearman correlation
corr_matrix = df.corr(method='spearman')

print("Spearman correlation matrix:\n", corr_matrix)

Output:

Spearman correlation matrix:
           Rank     Score      Time
Rank   1.000000  1.000000 -1.000000
Score  1.000000  1.000000 -1.000000
Time  -1.000000 -1.000000  1.000000

Explanation:

  • corr(method='spearman') - Measures monotonic relationships based on ranks.
  • Perfect correlation (1.0 or -1.0) indicates a strictly increasing or decreasing relationship.

2.3 Kendall Correlation

Example: Computing Kendall Correlation

import pandas as pd
import numpy as np

# Create a DataFrame
df = pd.DataFrame({
    'Rating': [1, 2, 3, 4],
    'Quality': [2, 1, 4, 3],
    'Price': [100, 150, 120, 180]
})

# Compute Kendall correlation
corr_matrix = df.corr(method='kendall')

print("Kendall correlation matrix:\n", corr_matrix)

Output:

Kendall correlation matrix:
             Rating   Quality     Price
Rating    1.000000  0.333333  0.666667
Quality   0.333333  1.000000  0.333333
Price     0.666667  0.333333  1.000000

Explanation:

  • corr(method='kendall') - Measures ordinal associations using concordant and discordant pairs.
  • Lower values (e.g., 0.333) indicate weaker ordinal relationships.

2.4 Pairwise Correlation with corrwith

Example: Using corrwith for Pairwise Correlation

import pandas as pd
import numpy as np

# Create a DataFrame
df = pd.DataFrame({
    'Target': [10, 20, 15, 25],
    'Feature1': [1, 2, 1.5, 2.5],
    'Feature2': [100, 200, 150, 250]
})

# Compute pairwise correlations with Target
correlations = df.corrwith(df['Target'])

print("Correlations with Target:\n", correlations)

Output:

Correlations with Target:
Target      1.000000
Feature1    0.948683
Feature2    0.948683
dtype: float64

Explanation:

  • corrwith() - Computes correlations between one column (Target) and all others.
  • Useful for feature selection by identifying variables correlated with a target.

2.5 Handling Missing Data

Example: Correlation with Missing Values

import pandas as pd
import numpy as np

# Create a DataFrame with missing values
df = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [10, np.nan, 30, 40],
    'C': [100, 200, 150, np.nan]
})

# Compute correlation with minimum periods
corr_matrix = df.corr(min_periods=3)

print("Correlation matrix with missing values:\n", corr_matrix)

Output:

Correlation matrix with missing values:
          A         B         C
A  1.000000  0.981981  0.188982
B  0.981981  1.000000  0.500000
C  0.188982  0.500000  1.000000

Explanation:

  • corr(min_periods=3) - Requires at least 3 non-null pairs for correlation computation.
  • Handles missing data by excluding incomplete pairs, ensuring robust results.

2.6 Incorrect Correlation Usage

Example: Misapplying Correlation on Non-Numerical Data

import pandas as pd

# Create a DataFrame with non-numerical data
df = pd.DataFrame({
    'Category': ['A', 'B', 'C', 'A'],
    'Value': [10, 20, 30, 40]
})

# Incorrect: Applying corr to non-numerical columns
try:
    corr_matrix = df.corr()
    print("Correlation matrix:\n", corr_matrix)
except ValueError as e:
    print("Error:", e)

Output:

Error: could not convert string to float: 'A'

Explanation:

  • corr() fails on non-numerical columns like Category.
  • Solution: Filter numerical columns with select_dtypes(include='number') before computing correlations.

03. Effective Usage

3.1 Recommended Practices

  • Select appropriate correlation methods (Pearson, Spearman, Kendall) based on data characteristics.

Example: Comprehensive Correlation Analysis

import pandas as pd
import numpy as np

# Create a large DataFrame
df = pd.DataFrame({
    'ID': np.arange(1000, dtype='int32'),
    'Price': pd.Series([10.5, 15.0, np.nan, 20.0] * 250),
    'Stock': pd.Series([100, 200, 150, None] * 250),
    'Demand': pd.Series([50, 70, 60, 80] * 250)
})

# Comprehensive correlation analysis
num_df = df.select_dtypes(include='number')
pearson_corr = num_df.corr(method='pearson')
spearman_corr = num_df.corr(method='spearman')
corr_with_demand = num_df.corrwith(num_df['Demand'], method='pearson')
missing = num_df.isna().sum()

print("Pearson correlation matrix:\n", pearson_corr)
print("\nSpearman correlation matrix:\n", spearman_corr)
print("\nCorrelations with Demand:\n", corr_with_demand)
print("\nMissing values:\n", missing)

Output:

Pearson correlation matrix:
              ID     Price     Stock    Demand
ID      1.000000  0.000000  0.000000  0.000000
Price   0.000000  1.000000  0.866025  0.854152
Stock   0.000000  0.866025  1.000000  0.433013
Demand  0.000000  0.854152  0.433013  1.000000

Spearman correlation matrix:
              ID     Price     Stock    Demand
ID      1.000000  0.000000  0.000000  0.000000
Price   0.000000  1.000000  0.866025  0.849934
Stock   0.000000  0.866025  1.000000  0.422577
Demand  0.000000  0.849934  0.422577  1.000000

Correlations with Demand:
ID        0.000000
Price     0.854152
Stock     0.433013
Demand    1.000000
dtype: float64

Missing values:
ID          0
Price     250
Stock     250
Demand      0
dtype: int64
  • Use select_dtypes() to filter numerical columns.
  • Compare Pearson and Spearman correlations to assess relationship types.
  • Use corrwith() for target-specific analysis and check missing data.

3.2 Practices to Avoid

  • Avoid interpreting correlations without considering missing data or data distributions.

Example: Ignoring Missing Data Impact

import pandas as pd
import numpy as np

# Create a DataFrame with missing values
df = pd.DataFrame({
    'Feature1': [1, 2, np.nan, np.nan],
    'Feature2': [10, 20, 30, np.nan]
})

# Incorrect: Computing correlation without checking missing data
corr_matrix = df.corr()
print("Correlation matrix:\n", corr_matrix)
print("\nAssuming reliable results without checking missing values")

Output:

Correlation matrix:
           Feature1  Feature2
Feature1      1.0       1.0
Feature2      1.0       1.0

Assuming reliable results without checking missing values
  • Only 2 non-null pairs skew the correlation, producing misleading results.
  • Solution: Check isna().sum() and use min_periods to ensure sufficient data.

04. Common Use Cases in Data Analysis

4.1 Feature Selection for Machine Learning

Identify features correlated with a target variable for modeling.

Example: Selecting Features Based on Correlation

import pandas as pd
import numpy as np

# Create a DataFrame
df = pd.DataFrame({
    'Target': [100, 200, 150, 250],
    'Feature1': [1, 2, 1.5, 2.5],
    'Feature2': [10, 20, 15, 25],
    'Feature3': [5, 4, 6, 3]
})

# Compute correlations with Target
correlations = df.corrwith(df['Target'])

print("Correlations with Target:\n", correlations)
print("\nHighly correlated features (|corr| > 0.9):\n", correlations[abs(correlations) > 0.9])

Output:

Correlations with Target:
Target      1.000000
Feature1    0.948683
Feature2    0.948683
Feature3   -0.632456
dtype: float64

Highly correlated features (|corr| > 0.9):
Target      1.000000
Feature1    0.948683
Feature2    0.948683
dtype: float64

Explanation:

  • corrwith() - Identifies features strongly correlated with Target.
  • High correlations (e.g., Feature1, Feature2) suggest predictive power.

4.2 Detecting Multicollinearity

Identify highly correlated features to avoid redundancy in models.

Example: Checking for Multicollinearity

import pandas as pd
import numpy as np

# Create a DataFrame
df = pd.DataFrame({
    'Feature1': [1, 2, 3, 4],
    'Feature2': [2, 4, 6, 8],
    'Feature3': [10, 15, 12, 18]
})

# Compute correlation matrix
corr_matrix = df.corr()
high_corr = corr_matrix[abs(corr_matrix) > 0.8]

print("Correlation matrix:\n", corr_matrix)
print("\nHighly correlated pairs (|corr| > 0.8):\n", high_corr)

Output:

Correlation matrix:
           Feature1  Feature2  Feature3
Feature1  1.000000  1.000000  0.939008
Feature2  1.000000  1.000000  0.939008
Feature3  0.939008  0.939008  1.000000

Highly correlated pairs (|corr| > 0.8):
           Feature1  Feature2  Feature3
Feature1  1.000000  1.000000  0.939008
Feature2  1.000000  1.000000  0.939008
Feature3  0.939008  0.939008  1.000000

Explanation:

  • corr() - Reveals high correlation (1.0) between Feature1 and Feature2, indicating potential multicollinearity.
  • Helps decide which features to drop to improve model stability.

Conclusion

Pandas’ correlation methods, powered by NumPy Array Operations, provide efficient tools for analyzing relationships between variables. Key takeaways:

  • Use corr() with Pearson, Spearman, or Kendall methods to match data characteristics.
  • Apply corrwith() for target-specific feature analysis.
  • Handle missing data with min_periods and verify data types.
  • Apply in feature selection and multicollinearity detection for machine learning workflows.

With Pandas, you can uncover critical relationships in your data, enhancing feature selection and model performance!

Comments