Data Wrangling with Pandas: Slicing, Cleaning, and Combining Datasets

 

You've got your data, but it's not quite ready for analysis. Welcome to data wrangling, the crucial and often time-consuming step of preparing raw data. Fortunately, the pandas library offers a powerful and intuitive toolkit for this very purpose. In this post, we'll dive into the essential pandas functions for effective data manipulation: slicing with loc and iloc, handling missing values, and combining multiple DataFrames.
Slicing with loc vs. iloc: Label-Based vs. Position-Based Indexing
Selecting specific rows and columns is a fundamental task. Pandas gives us two primary tools for this: loc and iloc. The key difference is how they reference data.
1. loc (Label-Based Indexing)
Use loc when you want to select data based on the row and column labels or names.
python
import pandas as pd

# Create a sample DataFrame with a custom index
data = {'city': ['New York', 'Los Angeles', 'Chicago'],
        'population': [8.4, 3.9, 2.7],
        'country': ['USA', 'USA', 'USA']}
df = pd.DataFrame(data, index=['NY', 'LA', 'CHI'])

# Select a single row by its label
df.loc['NY']
# Output:
# city        New York
# population       8.4
# country          USA
# Name: NY, dtype: object

# Select a specific cell
df.loc['LA', 'population']
# Output: 3.9

# Slice rows and select a list of columns
df.loc['NY':'LA', ['city', 'population']]
# Output:
#        city  population
# NY  New York         8.4
# LA  Los Angeles         3.9

An important note about loc is that when you slice a range, such as NY':'LA', it is inclusive of both the start and end labels.
2. iloc (Integer-Based Indexing)
Use iloc when you want to select data based on the row and column integer positions, just like indexing a Python list.
python
# Select the first row by its position
df.iloc[0]
# Output:
# city        New York
# population       8.4
# country          USA
# Name: NY, dtype: object

# Select a specific cell using integer positions
df.iloc[1, 1]
# Output: 3.9

# Slice rows and columns using integer positions
df.iloc[0:2, 0:2]
# Output:
#        city  population
# NY  New York         8.4
# LA  Los Angeles         3.9

Unlike loc, slicing with iloc follows standard Python conventions, where the end of the range is exclusive.
Handling Missing Values
Real-world data is rarely pristine. Pandas represents missing values with NaN (Not a Number) and provides simple ways to deal with them.
python
import numpy as np

# Create a DataFrame with missing values
df_missing = pd.DataFrame({'A': [1, np.nan, 3], 'B': [4, 5, np.nan]})

# Check for missing values (returns a boolean DataFrame)
df_missing.isnull()

# Count missing values per column
df_missing.isnull().sum()
# Output:
# A    1
# B    1
# dtype: int64

# Option 1: Drop rows with any missing values
df_cleaned = df_missing.dropna()

# Option 2: Fill missing values with a specific value
df_filled = df_missing.fillna(0)

# Option 3: Fill missing values with the mean of the column
df_filled_mean = df_missing.fillna(df_missing.mean())

Combining DataFrames
When data is split across multiple tables, pandas offers functions to combine them. The two most common methods are concat and merge.
1. pd.concat() (Stacking or Joining Side-by-Side)
Use concat to stack DataFrames together either vertically (adding more rows) or horizontally (adding more columns).
python
df1 = pd.DataFrame({'Name': ['A', 'B'], 'Age': [25, 30]})
df2 = pd.DataFrame({'Name': ['C', 'D'], 'Age': [35, 40]})

# Stack vertically (along rows, default axis=0)
pd.concat([df1, df2], ignore_index=True)

df3 = pd.DataFrame({'City': ['NY', 'LA'], 'Salary': [70000, 80000]}, index=[0, 1])

# Join horizontally (along columns, axis=1)
pd.concat([df1, df3], axis=1)

2. pd.merge() (SQL-Style Joins)
Use merge to combine DataFrames based on a common key, much like a SQL join.
python
# Sample data
df_employees = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
df_salaries = pd.DataFrame({'id': [1, 3, 4], 'salary': [60000, 75000, 90000]})

# Inner merge (default): Only includes common IDs
pd.merge(df_employees, df_salaries, on='id', how='inner')

# Left merge: Keeps all rows from the left DataFrame
pd.merge(df_employees, df_salaries, on='id', how='left')


Comments

Popular Posts