Data Ingestion: Source and Destination

November 18, 2025

Data Ingestion: Source and Destination

Sources and targets: who/what/where

The chapter says: don’t just “grab data.” Think about each source carefully.

For every source (Stripe, app DB, CSV dump, etc.), ask:

Who will we work with?
- Which team owns it? Marketing, Payments, Product, etc.
How will the data be used?
- Reporting? ML? Finance audits? Real-time alerts?
What’s the frequency?
- Does it change once a day? Every second? Is it a one-time historical dump?
What’s the volume?
- Thousands of rows? Billions? This affects performance & cost.
What’s the format?
- JSON, CSV, database table, files on S3, weird FTP dumps…
What’s the quality?
- Clean and consistent? Missing values? Weird codes that need decoding?
How will we store it after ingestion?

Data lake, lakehouse (Delta/Iceberg/Hudi), warehouse tables, etc.

Same idea for destinations (targets), but more focused on stakeholders:

Who reads this data?
What tools do they use (BI, ML notebooks, etc.)?
Does it need staging layers (raw → cleaned → business-ready)?

They also talk about:

Staging data:
- Often you first land it in a lake/lakehouse (S3/GCS/Azure + Delta/Iceberg/Hudi).
- You keep raw data + cleaned versions → easier backfills, schema changes, history.
OLAP vs OLTP (warehouse vs transactional DB):
- OLAP: columnar, great for analytics (Redshift, BigQuery, Snowflake, Databricks SQL).
- OLTP: row-based, great for app transactions (Postgres, MySQL).
Change Data Capture (CDC):
- Instead of reloading everything, just capture what changed in the source and update downstream.

Comments