Data Cleaning or Data Preprocessing

 The inconvenient truth about data mining process!

Data scientists spend 60-80% of their time here!

Data cleaning tasks

  • Fill in missing values
  • Identify outliers and smooth out noisy data
  • Correct inconsistent data
  • Resolve redundancy caused by data integration

HANDLING MISSING DATA

Missing data may need to be ignored, replaced or inferred.

Easiest solution: depending on how much data is missing, ignore/delete rows with missing attributes values



e.g., deleting 5% of your data may be reasonable for dataset with millions of rows

Replace missing data with carefully chosen value

e.g., most common value

Advanced solution: build separate model to infer missing values;

e.g., infer height from weight

Fill in it automatically with:

When you do not have a class label:

  • The attribute mean for continuous data (common approach)
  • The attribute median if distribution is skewed for example
  • The attribute mode for categorical data

When you have a class label:

A new class label : e.g., “unknown”

The attribute mean for all samples belonging to the same class/target – smarter!

NOISY DATA

Noise refers to modification of original values

Examples: From a few manual or equipment errors to consistent distortion of a

person’s voice when talking on a poor phone

Identify or remove outliers (note: outliers may be interesting)

Smooth noisy data


OUTLIER DETECTION
Look at the distribution of a single attribute
  • Inter Quartile method
IQR = Q3 - Q1
Lower Quartile <= Q1 - 1.5 IQR
Upper Quartile >= Q3 + 1.5 IQR
  • Standard Deviation method 
Lower outlier <= μ - 3 * σ
Upper outlier >= μ +3 * σ



DATA SMOOTHING

Binning method:

 first sort data and partition into bins

 then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.

Regression

 smooth by fitting the data into regression functions

BINNING

Equal-width (distance) partitioning:

  • Divides the range into N intervals of equal size: uniform grid
  •  If A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N.
  •  Outliers may dominate presentation, some ranges may be empty
  •  Skewed data is not handled well
  • Tend to be used more to create categories (we will see discretization)
Equal-depth (frequency) partitioning:
  •  Divides the range into N intervals, each containing approximately same number of
  • samples
  • Good data scaling: e.g., 7-day average

BINNING AND SMOOTHING

Data for price in (dollars)
15, 8, 9, 4, 21, 34, 24, 29, 26, 28, 25, 21

Sorted data for price (in dollars):
Partition into 3 (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
Smoothing by bin means (values was rounded up here but it is not necessary):
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

SMOOTHING W REGRESSION

Regression: Data smoothing can also be done by regression. Linear regression
involves finding the “best” line to fit two attributes (or variables) but more advanced

Correct Inconsistent Data

In real-world datasets, inconsistencies are common. These can appear as conflicting entries, mismatched formats, or typos—for example, the same city spelled differently (“New York” vs. “NYC”) or dates recorded in multiple formats. Correcting inconsistent data ensures that each attribute accurately reflects the underlying information. Standardizing formats, applying validation rules, and reconciling conflicting entries not only improves data quality but also prevents errors from propagating into analysis and machine learning models.


Resolve Redundancy from Data Integration

When combining data from multiple sources, redundancy often arises—multiple records may describe the same entity or attribute. Left unresolved, this duplication can skew analysis and inflate statistics. Data integration techniques, such as entity resolution, deduplication, and careful schema alignment, help eliminate redundancy. By consolidating overlapping records and maintaining a single, authoritative version of each data object, organizations can create a cleaner, more reliable dataset that is easier to analyze and interpret.

Comments