Ingestion considerations: how the data behaves

November 18, 2025

Ingestion considerations: how the data behaves

This section is: “what properties of the data affect how we ingest it?”

Four big ones:

a) Frequency (batch vs streaming)

Batch: process chunks on a schedule
- e.g., “run every night at 2 AM”
Micro-batch: run more often (every few minutes)
- Feels “almost real-time”
Streaming: continuous; data flows as events happen
- Often uses Kafka / Kinesis / Pub/Sub + Spark/Flink/Bytewax etc.
- Methods.
- Common methods of streaming unbounded data include:
1. Windowing: Segmenting a data source into finite chunks based on temporal boundaries.
2. Fixed windows: Data is essentially “micro-batched” and read in small fixed windows to a target.
3. Sliding windows: Similar to fixed windows, but with overlapping boundaries.
4. Sessions: Dynamic windows in which sequences of events are separated by gaps of inactivity- in sessions, the “window” is defined by the data itself.
5. Time-agnostic: Suitable for data where time isn’t crucial, often utilizing batch workloads.

Key idea: don’t over-engineer. “Right-time data” is enough for most:

Moving from daily → hourly already feels huge for the business.

b) Volume (how much)

High volume means:
- Need compressed, efficient formats (Parquet/Avro, then Delta/Iceberg/Hudi on top)
- Care about throughput, latency, cost, and retention
Also: decide how long to keep and where to archive old data.

c) Structure / shape

Structured: tables, fixed schemas (SQL DBs)
Semi-structured: JSON, XML, nested stuff
Unstructured: text, images, video, audio

Modern tools let you keep semi-structured data (like JSON) and query it with SQL later. But you still need to validate and think about missing keys/NULLs.

d) Format and variety

Real life = many sources, many formats, many quirks.
Variety is why ingestion is tricky: you need flexible pipelines and good observability.

5. Choosing an ingestion solution (tools strategy)

This is: “How do we actually implement ingestion?”

They split tools into two big styles:

1) Declarative (“tell it what you want”)

You configure things in a UI or YAML; tool handles the details.

Legacy tools: Talend, Pentaho, etc. (enterprise ETL tools, less modern)
Modern SaaS/OSS: Fivetran, Stitch, Airbyte
- Many ready-made connectors, easy to set up
Native platform features (inside Databricks, cloud, etc.)
- e.g., “connect and ingest” directly from message bus or cloud storage
Pros:
- Fast to get started, less engineering
- Vendors maintain connectors and handle schema/API changes
Cons:
- Less flexible for weird edge cases
- Vendor lock-in → hard/expensive to switch later
- You depend on them to add new connectors

2) Imperative (“write the code yourself”)

You write code/pipelines: Python scripts, Lambdas, Airflow DAGs, Beam, custom connectors, etc.

Pros:
- Maximum flexibility; you can handle any weird source
- You decide patterns, testing, standards
Cons:
- Expensive in time and people
- Needs strong engineering discipline (testing, maintainability)
- Overkill for small teams or simple needs

3) Hybrid (what most sane teams do)

Use declarative tools (Fivetran/Airbyte/native connectors) for:
- Common sources: Salesforce, Stripe, Google Ads, etc.
Use custom/imperative code where:
- The source is weird, niche, or super critical
Maybe contribute extra connectors back to open source (Airbyte, Singer, dlt, etc.)

Analogy they use:

Most of the time you need a Toyota, not a Formula 1 car.
But sometimes, for very special problems, you do need that race car.

Search This Blog

Machine Learning