Step 1: Handle Duplicates 🧹
Duplicate data can skew your analysis and lead to inaccurate conclusions. Here’s how you can check for and remove duplicates in your dataset:
# Removing duplicate rows
df.drop_duplicates()
# Checking value counts for a specific column
df['column'].value_counts()
Pro Tip: Look out for near-duplicates with slight variations in formatting, like case sensitivity or extra spaces!
Step 2: Handle Missing Values 🕵️♀️
Missing values are another challenge. Instead of deleting them, use these techniques to handle them:
- Fill with the mean or median for numerical columns
- Use the previous or next value to fill missing data
- Create a ‘Missing’ category for categorical data
For example, here's how you can fill missing values with the median:
# Fill missing numerical values with the median
df['numeric_column'].fillna(df['numeric_column'].median(), inplace=True)
Step 3: Standardize Your Data 📝
Inconsistent data can hinder your analysis. Standardizing the data ensures that everything is in the same format. Here’s how:
- Convert text to lowercase
- Remove special characters from strings
- Fix any spelling mistakes
- Ensure consistent formatting (e.g., date formats)
Example for standardizing column names and text:
# Standardizing column names: converting to lowercase and replacing spaces
df.columns = df.columns.str.lower().str.replace(' ', '_')
# Standardizing string columns: converting to lowercase and removing special characters
df['column'] = df['column'].str.lower().str.replace(r'[^\w\s]', '')
Step 4: Check for Outliers ⚠️
Outliers can heavily influence your data analysis. Use these methods to detect and manage outliers:
- Visualize outliers with box plots
- Use the Z-score method to identify outliers in numerical columns
- Apply the IQR (Interquartile Range) method to detect outliers
Here’s an example of detecting outliers using the Z-score and IQR methods:
# Using Z-score method
from scipy import stats
z_scores = stats.zscore(df['column'])
df_clean = df[(z_scores < 3)] # Filtering out rows with Z-scores greater than 3
# Using IQR method to detect outliers
Q1 = df['column'].quantile(0.25)
Q3 = df['column'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['column'] < (Q1 - 1.5 * IQR)) | (df['column'] > (Q3 + 1.5 * IQR))]
Final Check: Ensure Your Data is Clean ✅
After completing the above steps, run a final check to ensure your data is clean and ready for analysis:
# Final validation check
df.info() # Dataframe summary
df.describe() # Summary statistics
df.isnull().sum() # Check for remaining missing values
Once this is done, you can be confident that your dataset is in its best shape for analysis and modeling!
Hands-On Practice: Explore the Google Colab Notebook
Want to practice these steps? I’ve created a Google Colab notebook where you can apply these data cleaning techniques on a sample dataset. Click the link below to get started: