Ingestion considerations: how the data behaves


This section is: “what properties of the data affect how we ingest it?”

Four big ones:

a) Frequency (batch vs streaming)

  • Batch: process chunks on a schedule

    • e.g., “run every night at 2 AM”

  • Micro-batch: run more often (every few minutes)

    • Feels “almost real-time”

  • Streaming: continuous; data flows as events happen

    • Often uses Kafka / Kinesis / Pub/Sub + Spark/Flink/Bytewax etc.

    • Methods. 

    • Common methods of streaming unbounded data include: 

    1. Windowing: Segmenting a data source into finite chunks based on temporal boundaries.
    2. Fixed windows: Data is essentially “micro-batched” and read in small fixed windows to a target. 

    3. Sliding windows: Similar to fixed windows, but with overlapping boundaries.

    4.  Sessions: Dynamic windows in which sequences of events are separated by gaps of inactivity- in sessions, the “window” is defined by the data itself. 

    5. Time-agnostic: Suitable for data where time isn’t crucial, often utilizing batch workloads.

Key idea: don’t over-engineer. “Right-time data” is enough for most:

  • Moving from daily → hourly already feels huge for the business.

b) Volume (how much)

  • High volume means:

    • Need compressed, efficient formats (Parquet/Avro, then Delta/Iceberg/Hudi on top)

    • Care about throughput, latency, cost, and retention

  • Also: decide how long to keep and where to archive old data.

c) Structure / shape

  • Structured: tables, fixed schemas (SQL DBs)

  • Semi-structured: JSON, XML, nested stuff

  • Unstructured: text, images, video, audio

Modern tools let you keep semi-structured data (like JSON) and query it with SQL later. But you still need to validate and think about missing keys/NULLs.

d) Format and variety

  • Real life = many sources, many formats, many quirks.

  • Variety is why ingestion is tricky: you need flexible pipelines and good observability.


5. Choosing an ingestion solution (tools strategy)

This is: “How do we actually implement ingestion?”

They split tools into two big styles:

1) Declarative (“tell it what you want”)

You configure things in a UI or YAML; tool handles the details.

  • Legacy tools: Talend, Pentaho, etc. (enterprise ETL tools, less modern)

  • Modern SaaS/OSS: Fivetran, Stitch, Airbyte

    • Many ready-made connectors, easy to set up

  • Native platform features (inside Databricks, cloud, etc.)

    • e.g., “connect and ingest” directly from message bus or cloud storage

  • Pros:

    • Fast to get started, less engineering

    • Vendors maintain connectors and handle schema/API changes

  • Cons:

    • Less flexible for weird edge cases

    • Vendor lock-in → hard/expensive to switch later

    • You depend on them to add new connectors

2) Imperative (“write the code yourself”)

You write code/pipelines: Python scripts, Lambdas, Airflow DAGs, Beam, custom connectors, etc.

  • Pros:

    • Maximum flexibility; you can handle any weird source

    • You decide patterns, testing, standards

  • Cons:

    • Expensive in time and people

    • Needs strong engineering discipline (testing, maintainability)

    • Overkill for small teams or simple needs

3) Hybrid (what most sane teams do)

  • Use declarative tools (Fivetran/Airbyte/native connectors) for:

    • Common sources: Salesforce, Stripe, Google Ads, etc.

  • Use custom/imperative code where:

    • The source is weird, niche, or super critical

  • Maybe contribute extra connectors back to open source (Airbyte, Singer, dlt, etc.)

Analogy they use:

Most of the time you need a Toyota, not a Formula 1 car.
But sometimes, for very special problems, you do need that race car.

Comments

Popular Posts