Skip to main content

Pandas: Regular Expressions

Pandas: Regular Expressions

Regular expressions (regex) are powerful tools for pattern matching and text manipulation, essential for extracting structured information from unstructured data. Built on NumPy Array Operations and integrated with Pandas’ str accessor, regex methods enable efficient processing of text in Series and DataFrames. This guide explores Pandas regular expressions, covering key techniques, optimization strategies, and applications in machine learning and data preprocessing.


01. Why Use Regular Expressions in Pandas?

Text data often contains complex patterns, such as email addresses, phone numbers, or inconsistent formats, that require advanced parsing. Pandas’ regex methods, leveraging Python’s re module and NumPy’s vectorized operations, allow for fast, scalable pattern matching, extraction, and replacement. These capabilities are critical for cleaning datasets, extracting features for natural language processing (NLP), or standardizing data for machine learning.

Example: Basic Regex Extraction

import pandas as pd

# Create a DataFrame with text data
df = pd.DataFrame({
    'Contact': ['alice@company.com', 'bob.jones@other.org', 'charlie: 123-456-7890', 'david@firm.com'],
    'Code': ['ID-001', 'ID-002', 'NA', 'ID-003']
})

# Extract email domains using regex
df['Domain'] = df['Contact'].str.extract(r'@([\w.]+)')

print("DataFrame with extracted domains:\n", df)

Output:

DataFrame with extracted domains:
                  Contact    Code       Domain
0     alice@company.com  ID-001  company.com
1  bob.jones@other.org  ID-002    other.org
2  charlie: 123-456-7890      NA         NaN
3      david@firm.com  ID-003     firm.com

Explanation:

  • str.extract() - Captures regex groups (e.g., email domains after @).
  • Regex pattern @([\w.]+) - Matches characters and dots after @.

02. Key Regex Methods

Pandas’ str accessor supports regex operations for pattern matching, extraction, and replacement, optimized for vectorized performance. The table below summarizes key regex methods and their machine learning applications:

Method Description ML Use Case
Pattern Matching contains() Filter rows by text patterns
Extraction extract(), extractall() Create features from text
Replacement replace() Standardize text formats
Splitting split() Parse structured text
Counting count() Quantify pattern occurrences


2.1 Pattern Matching with contains()

Example: Filtering by Pattern

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({
    'Description': ['Product A: 2023-01', 'Item B: 2022-12', 'Service C', 'Product D: 2023-02'],
    'Price': [100, 150, 200, 120]
})

# Filter rows with date patterns
df_filtered = df[df['Description'].str.contains(r'\d{4}-\d{2}', na=False)]

print("Filtered DataFrame:\n", df_filtered)

Output:

Filtered DataFrame:
             Description  Price
0  Product A: 2023-01    100
1    Item B: 2022-12    150
3  Product D: 2023-02    120

Explanation:

  • str.contains() - Returns a boolean mask for strings matching the regex (e.g., \d{4}-\d{2} for YYYY-MM).
  • na=False - Handles null values to avoid errors.

2.2 Extracting Patterns

Example: Extracting Phone Numbers

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({
    'Contact': ['Call: 123-456-7890', 'Phone: 987-654-3210', 'Email: bob@company.com', 'Tel: 555-123-4567'],
    'Status': ['Active', 'Inactive', 'Active', 'Pending']
})

# Extract phone numbers
df['Phone'] = df['Contact'].str.extract(r'(\d{3}-\d{3}-\d{4})')

print("DataFrame with extracted phone numbers:\n", df)

Output:

DataFrame with extracted phone numbers:
                    Contact   Status          Phone
0     Call: 123-456-7890   Active  123-456-7890
1   Phone: 987-654-3210  Inactive  987-654-3210
2  Email: bob@company.com   Active           NaN
3     Tel: 555-123-4567  Pending  555-123-4567

Explanation:

  • str.extract() - Captures the first match of the regex pattern (e.g., \d{3}-\d{3}-\d{4} for phone numbers).
  • Non-matching rows return NaN.

2.3 Replacing Patterns

Example: Standardizing Formats

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({
    'Address': ['123 Main St.', '456 Elm Street', '789 Pine St', '101 Oak Ave.'],
    'Code': ['A-001', 'B_002', 'C-003', 'D_004']
})

# Standardize formats using regex
df_cleaned = df.copy()
df_cleaned['Address'] = df_cleaned['Address'].str.replace(r'St\.?|Street', 'Street', regex=True)
df_cleaned['Code'] = df_cleaned['Code'].str.replace(r'[-_]', '', regex=True)

print("Cleaned DataFrame:\n", df_cleaned)

Output:

Cleaned DataFrame:
            Address  Code
0  123 Main Street  A001
1  456 Elm Street  B002
2   789 Pine Street  C003
3   101 Oak Avenue  D004

Explanation:

  • str.replace() - Replaces matches of the regex pattern (e.g., St\.?|Street for street variations).
  • Supports complex patterns like [-_] to remove hyphens or underscores.

2.4 Splitting with Regex

Example: Splitting Text by Delimiters

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({
    'Data': ['tag1,tag2;tag3', 'tag4:tag5', 'tag6', 'tag7,tag8'],
    'Value': [10, 20, 30, 40]
})

# Split tags using regex
df['Tags'] = df['Data'].str.split(r'[,;:]')

print("DataFrame with split tags:\n", df)

Output:

DataFrame with split tags:
                 Data  Value                  Tags
0  tag1,tag2;tag3     10  [tag1, tag2, tag3]
1      tag4:tag5     20      [tag4, tag5]
2           tag6     30              [tag6]
3      tag7,tag8     40      [tag7, tag8]

Explanation:

  • str.split() - Splits strings on regex patterns (e.g., [,;:] for commas, semicolons, or colons).
  • Returns lists of substrings for flexible processing.

2.5 Counting Pattern Occurrences

Example: Counting Digits

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({
    'Text': ['Order123', 'Item45', 'Product6789', 'NoDigits'],
    'Category': ['A', 'B', 'A', 'C']
})

# Count digits in text
df['Digit_Count'] = df['Text'].str.count(r'\d')

print("DataFrame with digit counts:\n", df)

Output:

DataFrame with digit counts:
           Text Category  Digit_Count
0     Order123        A            3
1      Item45        B            2
2  Product6789        A            4
3    NoDigits        C            0

Explanation:

  • str.count() - Counts occurrences of the regex pattern (e.g., \d for digits).
  • Useful for feature engineering in NLP.

2.6 Incorrect Regex Usage

Example: Invalid Regex Pattern

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({
    'Email': ['alice@company.com', 'bob@other.org', None, 'david@firm.com'],
    'Score': [85, 90, 95, 100]
})

# Incorrect: Unclosed bracket in regex
try:
    df['Domain'] = df['Email'].str.extract(r'@[\w.')
    print("DataFrame with extracted domains:\n", df)
except Exception as e:
    print("Error:", e)

Output:

Error: unterminated bracket

Explanation:

  • An invalid regex pattern (e.g., unclosed bracket) raises a syntax error.
  • Solution: Validate regex patterns (e.g., @[\w.]+) and handle nulls with na=False.

03. Effective Usage

3.1 Recommended Practices

  • Handle null values before applying regex methods to avoid errors.

Example: Comprehensive Regex Processing

import pandas as pd
import numpy as np

# Create a large DataFrame with text data
df = pd.DataFrame({
    'ID': np.arange(1000, dtype='int32'),
    'Contact': pd.Series(['alice@company.com', 'bob: 123-456-7890', None, 'david@firm.com'] * 250),
    'Code': pd.Series(['A-001', 'B_002', 'C-003', 'D_004'] * 250)
})

# Clean and extract with regex
df_cleaned = df.copy()
df_cleaned['Contact'] = df_cleaned['Contact'].fillna('Unknown')
df_cleaned['Is_Email'] = df_cleaned['Contact'].str.contains(r'@[\w.]+', na=False)
df_cleaned['Phone'] = df_cleaned['Contact'].str.extract(r'(\d{3}-\d{3}-\d{4})')
df_cleaned['Code'] = df_cleaned['Code'].str.replace(r'[-_]', '', regex=True)

print("Cleaned DataFrame head:\n", df_cleaned.head())
print("Memory usage (bytes):\n", df_cleaned.memory_usage(deep=True).sum())

Output:

Cleaned DataFrame head:
   ID             Contact   Code  Is_Email          Phone
0   0   alice@company.com  A001      True           NaN
1   1  bob: 123-456-7890  B002     False  123-456-7890
2   2             Unknown  C003     False           NaN
3   3     david@firm.com  D004      True           NaN
4   4   alice@company.com  A001      True           NaN
Memory usage (bytes): 324000
  • fillna() - Prevents errors with null values.
  • Use na=False in contains() for robust filtering.
  • Combine regex methods for feature extraction and cleaning.

3.2 Practices to Avoid

  • Avoid complex regex patterns without testing, as they can be error-prone.

Example: Overly Complex Regex

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({
    'Text': ['ID: A-001', 'ID: B-002', 'ID: C-003'],
    'Value': [10, 20, 30]
})

# Incorrect: Overly complex and untested regex
try:
    df['Code'] = df['Text'].str.extract(r'(ID: [A-Z]-[0-9]{3}[a-z]?)')
    print("DataFrame with extracted codes:\n", df)
except Exception as e:
    print("Error:", e)

Output:

Error: bad character in group name
  • Complex regex patterns can lead to syntax errors or unexpected results.
  • Solution: Test patterns with tools like regex101.com and simplify where possible.

04. Common Use Cases in Machine Learning

4.1 Feature Extraction for NLP

Extract structured features from text for NLP or modeling.

Example: Extracting Product Codes

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({
    'Description': ['Product A-123', 'Item B-456', 'Service C', 'Product D-789'],
    'Price': [100, 150, 200, 120]
})

# Extract product codes
df['Code'] = df['Description'].str.extract(r'([A-Z]-[0-9]{3})')
df['Is_Product'] = df['Description'].str.contains(r'Product|Item', na=False)

print("DataFrame with extracted features:\n", df)

Output:

DataFrame with extracted features:
             Description  Price  Code  Is_Product
0      Product A-123    100  A-123        True
1        Item B-456    150  B-456        True
2         Service C    200   NaN       False
3      Product D-789    120  D-789        True

Explanation:

  • str.extract() - Creates features like product codes.
  • str.contains() - Generates binary features for classification.

4.2 Data Cleaning for Consistency

Standardize text formats to ensure data quality.

Example: Cleaning Inconsistent IDs

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({
    'ID': ['A_001', 'B-002', 'C_003', 'D-004'],
    'Score': [85, 90, 95, 100]
})

# Standardize ID formats
df_cleaned = df.copy()
df_cleaned['ID'] = df_cleaned['ID'].str.replace(r'[_-]', '', regex=True)

print("Cleaned DataFrame:\n", df_cleaned)

Output:

Cleaned DataFrame:
      ID  Score
0  A001     85
1  B002     90
2  C003     95
3  D004    100

Explanation:

  • str.replace() - Removes inconsistent delimiters for uniform IDs.
  • Ensures consistency for merging or grouping operations.

Conclusion

Pandas’ regular expression methods, integrated with NumPy Array Operations and the str accessor, provide robust tools for pattern matching, extraction, and cleaning of text data. Key takeaways:

  • Use contains(), extract(), and replace() for pattern-based text processing.
  • Handle null values with fillna() or na=False to avoid errors.
  • Test and simplify regex patterns to ensure reliability.
  • Apply in feature extraction and data cleaning for machine learning workflows.

With Pandas regex, you can efficiently transform and extract insights from text data, enhancing data quality for analytics and modeling!

Comments