Data Ingestion: Source and Destination
Sources and targets: who/what/where
The chapter says: don’t just “grab data.” Think about each source carefully.
For every source (Stripe, app DB, CSV dump, etc.), ask:
-
Who will we work with?
-
Which team owns it? Marketing, Payments, Product, etc.
-
-
How will the data be used?
-
Reporting? ML? Finance audits? Real-time alerts?
-
-
What’s the frequency?
-
Does it change once a day? Every second? Is it a one-time historical dump?
-
-
What’s the volume?
-
Thousands of rows? Billions? This affects performance & cost.
-
-
What’s the format?
-
JSON, CSV, database table, files on S3, weird FTP dumps…
-
-
What’s the quality?
-
Clean and consistent? Missing values? Weird codes that need decoding?
-
-
How will we store it after ingestion?
-
Data lake, lakehouse (Delta/Iceberg/Hudi), warehouse tables, etc.
Same idea for destinations (targets), but more focused on stakeholders:
-
Who reads this data?
-
What tools do they use (BI, ML notebooks, etc.)?
-
Does it need staging layers (raw → cleaned → business-ready)?
They also talk about:
-
Staging data:
-
Often you first land it in a lake/lakehouse (S3/GCS/Azure + Delta/Iceberg/Hudi).
-
You keep raw data + cleaned versions → easier backfills, schema changes, history.
-
-
OLAP vs OLTP (warehouse vs transactional DB):
-
OLAP: columnar, great for analytics (Redshift, BigQuery, Snowflake, Databricks SQL).
-
OLTP: row-based, great for app transactions (Postgres, MySQL).
-
-
Change Data Capture (CDC):
-
Instead of reloading everything, just capture what changed in the source and update downstream.
-
Comments
Post a Comment