Day 4: Master Data Cleaning in 4 Simple Steps: A Python Guide

Data cleaning is one of the most crucial tasks for data scientists. Let's break it down into four manageable steps to help you clean your datasets like a pro!

Sun Jan 5, 2025

Step 1: Handle Duplicates 🧹

Duplicate data can skew your analysis and lead to inaccurate conclusions. Here’s how you can check for and remove duplicates in your dataset:


# Removing duplicate rows
df.drop_duplicates()

# Checking value counts for a specific column
df['column'].value_counts()

Pro Tip: Look out for near-duplicates with slight variations in formatting, like case sensitivity or extra spaces!

Step 2: Handle Missing Values 🕵️‍♀️

Missing values are another challenge. Instead of deleting them, use these techniques to handle them:

Fill with the mean or median for numerical columns
Use the previous or next value to fill missing data
Create a ‘Missing’ category for categorical data

For example, here's how you can fill missing values with the median:


# Fill missing numerical values with the median
df['numeric_column'].fillna(df['numeric_column'].median(), inplace=True)

Step 3: Standardize Your Data 📝

Inconsistent data can hinder your analysis. Standardizing the data ensures that everything is in the same format. Here’s how:

Convert text to lowercase
Remove special characters from strings
Fix any spelling mistakes
Ensure consistent formatting (e.g., date formats)

Example for standardizing column names and text:


# Standardizing column names: converting to lowercase and replacing spaces
df.columns = df.columns.str.lower().str.replace(' ', '_')

# Standardizing string columns: converting to lowercase and removing special characters
df['column'] = df['column'].str.lower().str.replace(r'[^\w\s]', '')

Step 4: Check for Outliers ⚠️

Outliers can heavily influence your data analysis. Use these methods to detect and manage outliers:

Visualize outliers with box plots
Use the Z-score method to identify outliers in numerical columns
Apply the IQR (Interquartile Range) method to detect outliers

Here’s an example of detecting outliers using the Z-score and IQR methods:


# Using Z-score method
from scipy import stats
z_scores = stats.zscore(df['column'])
df_clean = df[(z_scores < 3)]  # Filtering out rows with Z-scores greater than 3

# Using IQR method to detect outliers
Q1 = df['column'].quantile(0.25)
Q3 = df['column'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['column'] < (Q1 - 1.5 * IQR)) | (df['column'] > (Q3 + 1.5 * IQR))]

Final Check: Ensure Your Data is Clean ✅

After completing the above steps, run a final check to ensure your data is clean and ready for analysis:


# Final validation check
df.info()  # Dataframe summary
df.describe()  # Summary statistics
df.isnull().sum()  # Check for remaining missing values

Once this is done, you can be confident that your dataset is in its best shape for analysis and modeling!

Hands-On Practice: Explore the Google Colab Notebook

Want to practice these steps? I’ve created a Google Colab notebook where you can apply these data cleaning techniques on a sample dataset. Click the link below to get started:

Open Google Colab Notebook