Pandas: Removing Duplicates

Duplicate rows in a dataset can skew analysis, bias machine learning models, or inflate results. Built on NumPy Array Operations, Pandas provides the drop_duplicates() method to efficiently identify and remove duplicate rows. This guide explores Pandas removing duplicates, covering techniques, optimization, and applications in machine learning workflows.

01. Why Remove Duplicates?

Duplicates can arise from data collection errors, merges, or redundant entries, leading to inaccurate statistics or overfitting in machine learning. The drop_duplicates() method in Pandas allows you to eliminate duplicate rows based on all or specific columns, ensuring clean, representative datasets. Its integration with NumPy ensures fast processing, making it ideal for preprocessing large datasets.

Example: Basic Duplicate Removal

import pandas as pd

# Create a DataFrame with duplicates
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
    'Age': [25, 30, 25, 35, 30],
    'Salary': [50000, 60000, 50000, 70000, 60000]
})

# Remove duplicates
df_cleaned = df.drop_duplicates()

print("Cleaned DataFrame:\n", df_cleaned)

Output:

Cleaned DataFrame:
       Name  Age  Salary
0    Alice   25   50000
1      Bob   30   60000
3  Charlie   35   70000

Explanation:

drop_duplicates() - Removes rows with identical values across all columns.
By default, keeps the first occurrence of each unique row.

02. Key Techniques for Removing Duplicates

The drop_duplicates() method offers flexible options to handle duplicates based on specific columns, keep strategies, and performance considerations. The table below summarizes key techniques and their machine learning applications:

Technique	Description	ML Use Case
Full Row Duplicates	Removes rows identical across all columns	Ensure unique training samples
Subset Duplicates	Removes duplicates based on specific columns	Maintain unique keys (e.g., IDs)
Keep Strategy	Chooses which duplicate to retain (first, last, none)	Preserve most recent data
Performance Optimization	Handles large datasets efficiently	Scale to big data preprocessing

2.1 Removing Full Row Duplicates

Example: Dropping All Duplicates

import pandas as pd

# Create a DataFrame with duplicates
df = pd.DataFrame({
    'ID': [1, 2, 1, 3, 2],
    'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
    'Score': [85, 90, 85, 95, 90]
})

# Remove duplicates across all columns
df_cleaned = df.drop_duplicates()

print("Cleaned DataFrame:\n", df_cleaned)

Output:

Cleaned DataFrame:
   ID     Name  Score
0   1   Alice     85
1   2     Bob     90
3   3  Charlie     95

Explanation:

drop_duplicates() - Considers all columns for duplicate checks.
Keeps the first occurrence by default.

2.2 Removing Duplicates Based on Subset

Example: Dropping Duplicates by Specific Columns

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({
    'ID': [1, 2, 1, 3, 2],
    'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
    'Score': [85, 90, 88, 95, 92]
})

# Remove duplicates based on 'ID'
df_cleaned_id = df.drop_duplicates(subset=['ID'])

# Remove duplicates based on 'Name' and 'Score'
df_cleaned_name_score = df.drop_duplicates(subset=['Name', 'Score'])

print("Cleaned by ID:\n", df_cleaned_id)
print("Cleaned by Name and Score:\n", df_cleaned_name_score)

Output:

Cleaned by ID:
   ID     Name  Score
0   1   Alice     85
1   2     Bob     90
3   3  Charlie     95
Cleaned by Name and Score:
   ID     Name  Score
0   1   Alice     85
1   2     Bob     90
2   1   Alice     88
3   3  Charlie     95
4   2     Bob     92

Explanation:

subset - Specifies columns to check for duplicates.
Useful for ensuring unique keys or combinations.

2.3 Choosing Which Duplicate to Keep

Example: Keep First, Last, or None

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({
    'ID': [1, 2, 1, 3, 2],
    'Score': [85, 90, 88, 95, 90],
    'Date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05']
})

# Keep first occurrence
df_first = df.drop_duplicates(subset=['ID'], keep='first')

# Keep last occurrence
df_last = df.drop_duplicates(subset=['ID'], keep='last')

# Keep no duplicates
df_none = df.drop_duplicates(subset=['ID'], keep=False)

print("Keep first:\n", df_first)
print("Keep last:\n", df_last)
print("Keep none:\n", df_none)

Output:

Keep first:
   ID  Score        Date
0   1     85  2023-01-01
1   2     90  2023-01-02
3   3     95  2023-01-04
Keep last:
   ID  Score        Date
2   1     88  2023-01-03
3   3     95  2023-01-04
4   2     90  2023-01-05
Keep none:
   ID  Score        Date
3   3     95  2023-01-04

Explanation:

keep='first' - Retains the first occurrence (default).
keep='last' - Retains the last occurrence.
keep=False - Removes all duplicates.

2.4 Handling Large Datasets

Example: Removing Duplicates in Large Data

import pandas as pd
import numpy as np

# Create a large DataFrame
np.random.seed(42)
df = pd.DataFrame({
    'ID': np.repeat(np.arange(500), 2),  # Duplicates in ID
    'Value': np.random.randn(1000).astype('float32'),
    'Category': np.random.choice(['A', 'B', 'C'], 1000)
})

# Remove duplicates based on 'ID'
df_cleaned = df.drop_duplicates(subset=['ID'], keep='first')

print("Original shape:", df.shape)
print("Cleaned shape:", df_cleaned.shape)
print("Cleaned DataFrame head:\n", df_cleaned.head())

Output:

Original shape: (1000, 3)
Cleaned shape: (500, 3)
Cleaned DataFrame head:
   ID     Value Category
0   0  0.496714        B
2   1 -0.138264        C
4   2  0.647689        A
6   3  1.523030        B
8   4 -0.234153        C

Explanation:

Efficiently handles large datasets using NumPy-based operations.
float32 dtype reduces memory usage.

2.5 Incorrect Usage

Example: Incorrect Subset Selection

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({
    'ID': [1, 2, 1, 3],
    'Name': ['Alice', 'Bob', 'Alice', 'Charlie'],
    'Score': [85, 90, 88, 95]
})

# Incorrect: Remove duplicates without specifying subset
df_wrong = df.drop_duplicates()  # Considers all columns

print("Incorrectly cleaned DataFrame:\n", df_wrong)

Output:

Incorrectly cleaned DataFrame:
   ID     Name  Score
0   1   Alice     85
1   2     Bob     90
2   1   Alice     88
3   3  Charlie     95

Explanation:

Failing to specify subset=['ID'] keeps rows with different scores.
Solution: Use subset to target key columns.

03. Effective Usage

3.1 Best Practices

Specify subset to focus on relevant columns.

Example: Optimized Duplicate Removal

import pandas as pd
import numpy as np

# Create a large DataFrame
df = pd.DataFrame({
    'ID': np.repeat(np.arange(1000), 2),
    'Category': pd.Series(np.random.choice(['A', 'B', 'C'], 2000), dtype='category'),
    'Value': np.random.randn(2000).astype('float32'),
    'Date': pd.date_range('2023-01-01', periods=2000, freq='D')
})

# Remove duplicates based on 'ID', keeping last
df_cleaned = df.drop_duplicates(subset=['ID'], keep='last')

print("Original shape:", df.shape)
print("Cleaned shape:", df_cleaned.shape)
print("Cleaned DataFrame head:\n", df_cleaned.head())
print("Memory usage (bytes):\n", df_cleaned.memory_usage(deep=True).sum())

Output:

Original shape: (2000, 4)
Cleaned shape: (1000, 4)
Cleaned DataFrame head:
     ID Category     Value       Date
1     0        B -0.789012 2023-01-02
3     1        A  1.234567 2023-01-04
5     2        C -0.345678 2023-01-06
7     3        B  0.567890 2023-01-08
9     4        A  0.123456 2023-01-10
Memory usage (bytes): 39128

subset=['ID'] - Ensures unique IDs.
keep='last' - Retains the most recent entry.
category and float32 dtypes optimize memory.

3.2 Practices to Avoid

Avoid removing duplicates without verifying column relevance.

Example: Overly Broad Duplicate Removal

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({
    'ID': [1, 2, 1, 3],
    'Name': ['Alice', 'Bob', 'Alice', 'Charlie'],
    'Score': [85, 90, 85, 95]
})

# Incorrect: Remove duplicates across all columns
df_wrong = df.drop_duplicates()

print("Incorrectly cleaned DataFrame:\n", df_wrong)

Output:

Incorrectly cleaned DataFrame:
   ID     Name  Score
0   1   Alice     85
1   2     Bob     90
3   3  Charlie     95

Removing duplicates across all columns may miss ID-based duplicates.
Solution: Use subset=['ID'] for targeted deduplication.

04. Common Use Cases in Machine Learning

4.1 Ensuring Unique Training Samples

Remove duplicate rows to prevent overfitting in model training.

Example: Cleaning Training Data

import pandas as pd
import numpy as np

# Create a DataFrame with duplicates
df = pd.DataFrame({
    'Feature1': [1.0, 2.0, 1.0, 3.0],
    'Feature2': [4.0, 5.0, 4.0, 6.0],
    'Target': [0, 1, 0, 1]
})

# Remove duplicates
df_cleaned = df.drop_duplicates()

print("Cleaned DataFrame:\n", df_cleaned)

Output:

Cleaned DataFrame:
   Feature1  Feature2  Target
0      1.0      4.0       0
1      2.0      5.0       1
3      3.0      6.0       1

Explanation:

Eliminates duplicate feature-target pairs.
Ensures unbiased model training.

4.2 Deduplicating Prediction Data

Ensure unique IDs in prediction datasets for accurate evaluation.

Example: Cleaning Prediction Data

import pandas as pd

# Create a DataFrame with duplicate IDs
df = pd.DataFrame({
    'ID': [1, 2, 1, 3],
    'Prediction': [0.75, 0.80, 0.78, 0.95],
    'Actual': [1, 0, 1, 1]
})

# Remove duplicates based on 'ID', keeping last
df_cleaned = df.drop_duplicates(subset=['ID'], keep='last')

print("Cleaned DataFrame:\n", df_cleaned)

Output:

Cleaned DataFrame:
   ID  Prediction  Actual
1   2       0.80       0
2   1       0.78       1
3   3       0.95       1

Explanation:

subset=['ID'] - Ensures unique IDs.
keep='last' - Retains the latest prediction.

Conclusion

Pandas’ drop_duplicates() method, powered by NumPy Array Operations, efficiently removes duplicate rows to ensure clean, unbiased datasets for machine learning and analysis. Key takeaways:

Use subset to target specific columns for deduplication.
Choose appropriate keep strategy (first, last, or none).
Optimize memory with compact dtypes for large datasets.
Avoid broad deduplication without specifying relevant columns.

With Pandas, you can effectively remove duplicates to prepare high-quality datasets for machine learning and analytics!