Skip to main content

Pandas: Statistical Tests

Pandas: Statistical Tests

Statistical tests are essential for validating hypotheses and uncovering patterns in data, critical for data analysis and machine learning. While Pandas itself does not directly implement statistical tests, it integrates seamlessly with libraries like SciPy and Statsmodels, leveraging NumPy Array Operations for efficient data preparation. This guide explores Pandas Statistical Tests, focusing on how Pandas facilitates common statistical tests (e.g., t-tests, ANOVA, chi-square) through data manipulation and integration with external libraries, with applications in hypothesis testing and exploratory data analysis.


01. Why Use Statistical Tests with Pandas?

Statistical tests help quantify relationships, differences, or dependencies in data, enabling data-driven decisions. Pandas excels at preparing and structuring data for these tests, using its DataFrame and Series objects to handle grouping, filtering, and aggregation efficiently. By combining Pandas with SciPy or Statsmodels, users can perform tests like t-tests for comparing means, ANOVA for group differences, or chi-square for categorical associations. These tools are vital for validating assumptions, selecting features, and ensuring robust machine learning or statistical models.

Example: Basic T-Test with Pandas and SciPy

import pandas as pd
import numpy as np
from scipy.stats import ttest_ind

# Create a sample DataFrame
df = pd.DataFrame({
    'Group': ['A', 'A', 'A', 'B', 'B', 'B'],
    'Value': [10, 12, 11, 15, 14, 16]
})

# Group data by 'Group'
group_a = df[df['Group'] == 'A']['Value']
group_b = df[df['Group'] == 'B']['Value']

# Perform t-test
t_stat, p_value = ttest_ind(group_a, group_b)

print("T-statistic:", t_stat)
print("P-value:", p_value)

Output:

T-statistic: -4.024922359499621
P-value: 0.016076553465799472

Explanation:

  • Pandas - Filters data by Group to create subsets for testing.
  • ttest_ind - Compares means of two independent groups, with a low p-value (<0.05) suggesting significant difference.

02. Key Statistical Tests with Pandas

Pandas facilitates statistical tests by preparing data for libraries like SciPy and Statsmodels, which perform the computations. These tests leverage NumPy’s efficiency for handling numerical operations. The table below summarizes key statistical tests, their integration with Pandas, and their applications:

Test Method (SciPy/Statsmodels) Pandas Role Use Case
T-Test scipy.stats.ttest_ind Group and filter data Compare means of two groups
ANOVA scipy.stats.f_oneway Group data by categories Compare means across multiple groups
Chi-Square scipy.stats.chi2_contingency Create contingency tables Test categorical associations
Kolmogorov-Smirnov scipy.stats.ks_2samp Extract sample distributions Compare distributions
Regression Analysis statsmodels.formula.api.ols Prepare feature and target data Model relationships


2.1 T-Test for Comparing Means

Example: Independent T-Test

import pandas as pd
import numpy as np
from scipy.stats import ttest_ind

# Create a DataFrame
df = pd.DataFrame({
    'Treatment': ['Control', 'Control', 'Control', 'Test', 'Test', 'Test'],
    'Score': [80, 85, 82, 90, 88, 92]
})

# Extract groups
control = df[df['Treatment'] == 'Control']['Score']
test = df[df['Treatment'] == 'Test']['Score']

# Perform t-test
t_stat, p_value = ttest_ind(control, test)

print("T-statistic:", t_stat)
print("P-value:", p_value)

Output:

T-statistic: -3.8729833462074166
P-value: 0.017599604431684243

Explanation:

  • Pandas - Filters Score by Treatment to create two groups.
  • ttest_ind - Tests if group means differ significantly (p < 0.05 indicates a difference).

2.2 ANOVA for Multiple Groups

Example: One-Way ANOVA

import pandas as pd
import numpy as np
from scipy.stats import f_oneway

# Create a DataFrame
df = pd.DataFrame({
    'Group': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'Value': [10, 12, 11, 15, 14, 16, 20, 22, 21]
})

# Group data
groups = [df[df['Group'] == g]['Value'] for g in df['Group'].unique()]

# Perform ANOVA
f_stat, p_value = f_oneway(*groups)

print("F-statistic:", f_stat)
print("P-value:", p_value)

Output:

F-statistic: 24.84210526315789
P-value: 0.0006146912299233625

Explanation:

  • Pandas - Groups Value by Group using a list comprehension.
  • f_oneway - Tests if means differ across multiple groups (low p-value suggests significant differences).

2.3 Chi-Square Test for Categorical Data

Example: Chi-Square Test

import pandas as pd
from scipy.stats import chi2_contingency

# Create a DataFrame
df = pd.DataFrame({
    'Gender': ['M', 'M', 'F', 'F', 'M', 'F'],
    'Preference': ['Yes', 'No', 'Yes', 'No', 'Yes', 'No']
})

# Create contingency table
contingency_table = pd.crosstab(df['Gender'], df['Preference'])

# Perform chi-square test
chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print("Contingency table:\n", contingency_table)
print("\nChi-square statistic:", chi2)
print("P-value:", p_value)

Output:

Contingency table:
Preference  No  Yes
Gender             
F            2    1
M            1    2

Chi-square statistic: 0.6666666666666666
P-value: 0.4142161783088176

Explanation:

  • Pandas - Uses crosstab to create a contingency table.
  • chi2_contingency - Tests for independence between categorical variables (p > 0.05 suggests no significant association).

2.4 Kolmogorov-Smirnov Test for Distributions

Example: KS Test for Distribution Comparison

import pandas as pd
import numpy as np
from scipy.stats import ks_2samp

# Create a DataFrame
df = pd.DataFrame({
    'Sample': ['A', 'A', 'A', 'B', 'B', 'B'],
    'Value': [1.1, 1.2, 1.0, 2.0, 2.1, 1.9]
})

# Extract samples
sample_a = df[df['Sample'] == 'A']['Value']
sample_b = df[df['Sample'] == 'B']['Value']

# Perform KS test
ks_stat, p_value = ks_2samp(sample_a, sample_b)

print("KS statistic:", ks_stat)
print("P-value:", p_value)

Output:

KS statistic: 1.0
P-value: 0.09999999999999998

Explanation:

  • Pandas - Extracts Value for each Sample group.
  • ks_2samp - Compares distributions (p > 0.05 suggests distributions may not differ significantly).

2.5 Regression Analysis with Statsmodels

Example: Linear Regression

import pandas as pd
import statsmodels.formula.api as smf

# Create a DataFrame
df = pd.DataFrame({
    'X': [1, 2, 3, 4],
    'Y': [2, 4, 5, 8]
})

# Fit linear regression model
model = smf.ols('Y ~ X', data=df).fit()

print(model.summary())

Output (abridged):

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      Y   R-squared:                       0.957
Model:                            OLS   Adj. R-squared:                  0.936
...
            coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   -0.2500      0.629     -0.398      0.729      -2.896       2.396
X            2.0000      0.258      7.746      0.016       0.955       3.045
==============================================================================

Explanation:

  • Pandas - Provides structured data for X and Y.
  • statsmodels.ols - Fits a linear regression model, with p-values indicating coefficient significance.

2.6 Incorrect Statistical Test Usage

Example: Misapplying T-Test on Non-Normal Data

import pandas as pd
import numpy as np
from scipy.stats import ttest_ind

# Create a DataFrame with non-normal data
df = pd.DataFrame({
    'Group': ['A', 'A', 'A', 'B', 'B', 'B'],
    'Value': [1, 1, 100, 2, 2, 200]
})

# Incorrect: Applying t-test without checking normality
group_a = df[df['Group'] == 'A']['Value']
group_b = df[df['Group'] == 'B']['Value']
t_stat, p_value = ttest_ind(group_a, group_b)

print("T-statistic:", t_stat)
print("P-value:", p_value)
print("\nAssuming normality without verification")

Output:

T-statistic: -0.9994444444444445
P-value: 0.37405012020023996

Assuming normality without verification

Explanation:

  • T-tests assume normality, but Value has extreme outliers (100, 200), invalidating results.
  • Solution: Check normality (e.g., with scipy.stats.shapiro) or use non-parametric tests like Mann-Whitney U.

03. Effective Usage

3.1 Recommended Practices

  • Use Pandas to preprocess and group data before applying statistical tests.

Example: Comprehensive Statistical Analysis

import pandas as pd
import numpy as np
from scipy.stats import ttest_ind, f_oneway, chi2_contingency
from statsmodels.formula.api import ols

# Create a DataFrame
df = pd.DataFrame({
    'Group': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'Value': [10, 12, 11, 15, 14, 16, 20, 22, 21],
    'Category': ['X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y', 'X']
})

# T-Test (A vs B)
group_a = df[df['Group'] == 'A']['Value']
group_b = df[df['Group'] == 'B']['Value']
t_stat, t_p = ttest_ind(group_a, group_b)

# ANOVA (all groups)
groups = [df[df['Group'] == g]['Value'] for g in df['Group'].unique()]
f_stat, f_p = f_oneway(*groups)

# Chi-Square (Group vs Category)
contingency_table = pd.crosstab(df['Group'], df['Category'])
chi2, chi_p, _, _ = chi2_contingency(contingency_table)

# Linear Regression (Value vs Group dummy variables)
df_dummy = pd.get_dummies(df['Group'], prefix='Group')
df = pd.concat([df, df_dummy], axis=1)
model = ols('Value ~ Group_A + Group_B', data=df).fit()

print("T-Test (A vs B): T-statistic =", t_stat, ", P-value =", t_p)
print("ANOVA: F-statistic =", f_stat, ", P-value =", f_p)
print("Chi-Square: Statistic =", chi2, ", P-value =", chi_p)
print("\nRegression Summary:\n", model.summary())

Output (abridged):

T-Test (A vs B): T-statistic = -4.024922359499621 , P-value = 0.016076553465799472
ANOVA: F-statistic = 24.84210526315789 , P-value = 0.0006146912299233625
Chi-Square: Statistic = 0.6428571428571428 , P-value = 0.7255314020089364

Regression Summary:
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Value   R-squared:                       0.925
...
            coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      21.0000      0.577     36.373      0.000      19.571      22.429
Group_A       -10.0000      0.816    -12.247      0.000     -12.016      -7.984
Group_B        -6.0000      0.816     -7.348      0.000      -8.016      -3.984
==============================================================================
  • Pandas - Groups, filters, and creates contingency tables or dummy variables.
  • Combines multiple tests to validate findings (e.g., t-test, ANOVA, chi-square, regression).
  • Checks p-values to assess significance across analyses.

3.2 Practices to Avoid

  • Avoid applying tests without verifying assumptions (e.g., normality, independence).

Example: Ignoring Small Sample Size

import pandas as pd
from scipy.stats import ttest_ind

# Create a DataFrame with small sample
df = pd.DataFrame({
    'Group': ['A', 'A', 'B', 'B'],
    'Value': [10, 12, 15, 14]
})

# Incorrect: Applying t-test with insufficient data
group_a = df[df['Group'] == 'A']['Value']
group_b = df[df['Group'] == 'B']['Value']
t_stat, p_value = ttest_ind(group_a, group_b)

print("T-statistic:", t_stat)
print("P-value:", p_value)
print("\nAssuming reliable results with small sample")

Output:

T-statistic: -1.4142135623730951
P-value: 0.29289321881345254

Assuming reliable results with small sample
  • Small sample size (n=2 per group) reduces test power, making results unreliable.
  • Solution: Ensure sufficient sample size or use non-parametric tests for small datasets.

04. Common Use Cases in Data Analysis

4.1 Hypothesis Testing for Group Differences

Test whether groups differ significantly in a continuous variable.

Example: Comparing Group Means

import pandas as pd
import numpy as np
from scipy.stats import ttest_ind

# Create a DataFrame
df = pd.DataFrame({
    'Region': ['North', 'North', 'North', 'South', 'South', 'South'],
    'Sales': [1000, 1200, 1100, 1300, 1400, 1350]
})

# Extract groups
north = df[df['Region'] == 'North']['Sales']
south = df[df['Region'] == 'South']['Sales']

# Perform t-test
t_stat, p_value = ttest_ind(north, south)

print("T-statistic:", t_stat)
print("P-value:", p_value)

Output:

T-statistic: -3.4641016151377544
P-value: 0.02565835014516525

Explanation:

  • Pandas - Filters Sales by Region.
  • ttest_ind - Indicates significant difference in sales between regions (p < 0.05).

4.2 Testing Categorical Associations

Assess relationships between categorical variables.

Example: Chi-Square for Independence

import pandas as pd
from scipy.stats import chi2_contingency

# Create a DataFrame
df = pd.DataFrame({
    'Age_Group': ['Young', 'Young', 'Old', 'Old', 'Young', 'Old'],
    'Purchase': ['Yes', 'No', 'Yes', 'No', 'Yes', 'No']
})

# Create contingency table
contingency_table = pd.crosstab(df['Age_Group'], df['Purchase'])

# Perform chi-square test
chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print("Contingency table:\n", contingency_table)
print("\nChi-square statistic:", chi2)
print("P-value:", p_value)

Output:

Contingency table:
Purchase    No  Yes
Age_Group          
Old          2    1
Young        1    2

Chi-square statistic: 0.6666666666666666
P-value: 0.4142161783088176

Explanation:

  • Pandas - Generates contingency table with crosstab.
  • chi2_contingency - Tests independence (p > 0.05 suggests no significant association).

Conclusion

Pandas, combined with SciPy and Statsmodels, provides a powerful framework for statistical testing, leveraging NumPy Array Operations for efficient data preparation. Key takeaways:

  • Use Pandas to preprocess and group data for tests like t-tests, ANOVA, chi-square, KS, and regression.
  • Integrate with SciPy/Statsmodels for robust statistical computations.
  • Verify test assumptions (e.g., normality, sample size) to ensure valid results.
  • Apply in hypothesis testing and categorical analysis to drive data-driven insights.

With Pandas and statistical libraries, you can rigorously test hypotheses, enhancing data analysis and modeling workflows!

Comments