The inconvenient truth about data mining process!
Data scientists spend 60-80% of their time here!
Data cleaning tasks
- Fill in missing values
- Identify outliers and smooth out noisy data
- Correct inconsistent data
- Resolve redundancy caused by data integration
HANDLING MISSING DATA
Missing data may need to be ignored, replaced or inferred.
Easiest solution: depending on how much data is missing, ignore/delete rows with missing attributes values
e.g., deleting 5% of your data may be reasonable for dataset with millions of rows
Replace missing data with carefully chosen value
e.g., most common value
Advanced solution: build separate model to infer missing values;
e.g., infer height from weight
Fill in it automatically with:
When you do not have a class label:
- The attribute mean for continuous data (common approach)
- The attribute median if distribution is skewed for example
- The attribute mode for categorical data
When you have a class label:
A new class label : e.g., “unknown”
The attribute mean for all samples belonging to the same class/target – smarter!
NOISY DATA
Noise refers to modification of original values
Examples: From a few manual or equipment errors to consistent distortion of a
person’s voice when talking on a poor phone
Identify or remove outliers (note: outliers may be interesting)
Smooth noisy data
- Inter Quartile method
- Standard Deviation method
DATA SMOOTHING
Binning method:
first sort data and partition into bins
then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
BINNING
Equal-width (distance) partitioning:
- Divides the range into N intervals of equal size: uniform grid
- If A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N.
- Outliers may dominate presentation, some ranges may be empty
- Skewed data is not handled well
- Tend to be used more to create categories (we will see discretization)
- Divides the range into N intervals, each containing approximately same number of
- samples
- Good data scaling: e.g., 7-day average
Correct Inconsistent Data
In real-world datasets, inconsistencies are common. These can appear as conflicting entries, mismatched formats, or typos—for example, the same city spelled differently (“New York” vs. “NYC”) or dates recorded in multiple formats. Correcting inconsistent data ensures that each attribute accurately reflects the underlying information. Standardizing formats, applying validation rules, and reconciling conflicting entries not only improves data quality but also prevents errors from propagating into analysis and machine learning models.
Resolve Redundancy from Data Integration
When combining data from multiple sources, redundancy often arises—multiple records may describe the same entity or attribute. Left unresolved, this duplication can skew analysis and inflate statistics. Data integration techniques, such as entity resolution, deduplication, and careful schema alignment, help eliminate redundancy. By consolidating overlapping records and maintaining a single, authoritative version of each data object, organizations can create a cleaner, more reliable dataset that is easier to analyze and interpret.
Comments
Post a Comment