Data Ingestion: Source and Destination

 

Sources and targets: who/what/where

The chapter says: don’t just “grab data.” Think about each source carefully.

For every source (Stripe, app DB, CSV dump, etc.), ask:

  1. Who will we work with?

    • Which team owns it? Marketing, Payments, Product, etc.

  2. How will the data be used?

    • Reporting? ML? Finance audits? Real-time alerts?

  3. What’s the frequency?

    • Does it change once a day? Every second? Is it a one-time historical dump?

  4. What’s the volume?

    • Thousands of rows? Billions? This affects performance & cost.

  5. What’s the format?

    • JSON, CSV, database table, files on S3, weird FTP dumps…

  6. What’s the quality?

    • Clean and consistent? Missing values? Weird codes that need decoding?

  7. How will we store it after ingestion?

    • Data lake, lakehouse (Delta/Iceberg/Hudi), warehouse tables, etc.

Same idea for destinations (targets), but more focused on stakeholders:

  • Who reads this data?

  • What tools do they use (BI, ML notebooks, etc.)?

  • Does it need staging layers (raw → cleaned → business-ready)?

They also talk about:

  • Staging data:

    • Often you first land it in a lake/lakehouse (S3/GCS/Azure + Delta/Iceberg/Hudi).

    • You keep raw data + cleaned versions → easier backfills, schema changes, history.

  • OLAP vs OLTP (warehouse vs transactional DB):

    • OLAP: columnar, great for analytics (Redshift, BigQuery, Snowflake, Databricks SQL).

    • OLTP: row-based, great for app transactions (Postgres, MySQL).

  • Change Data Capture (CDC):

    • Instead of reloading everything, just capture what changed in the source and update downstream.


Comments

Popular Posts