Ingestion considerations: how the data behaves
This section is: “what properties of the data affect how we ingest it?”
Four big ones:
a) Frequency (batch vs streaming)
-
Batch: process chunks on a schedule
-
e.g., “run every night at 2 AM”
-
-
Micro-batch: run more often (every few minutes)
-
Feels “almost real-time”
-
-
Streaming: continuous; data flows as events happen
-
Often uses Kafka / Kinesis / Pub/Sub + Spark/Flink/Bytewax etc.
Methods.
Common methods of streaming unbounded data include:
- Windowing: Segmenting a data source into finite chunks based on temporal boundaries.
Fixed windows: Data is essentially “micro-batched” and read in small fixed windows to a target.
Sliding windows: Similar to fixed windows, but with overlapping boundaries.
Sessions: Dynamic windows in which sequences of events are separated by gaps of inactivity- in sessions, the “window” is defined by the data itself.
Time-agnostic: Suitable for data where time isn’t crucial, often utilizing batch workloads.
-
Key idea: don’t over-engineer. “Right-time data” is enough for most:
-
Moving from daily → hourly already feels huge for the business.
b) Volume (how much)
-
High volume means:
-
Need compressed, efficient formats (Parquet/Avro, then Delta/Iceberg/Hudi on top)
-
Care about throughput, latency, cost, and retention
-
-
Also: decide how long to keep and where to archive old data.
c) Structure / shape
-
Structured: tables, fixed schemas (SQL DBs)
-
Semi-structured: JSON, XML, nested stuff
-
Unstructured: text, images, video, audio
Modern tools let you keep semi-structured data (like JSON) and query it with SQL later. But you still need to validate and think about missing keys/NULLs.
d) Format and variety
-
Real life = many sources, many formats, many quirks.
-
Variety is why ingestion is tricky: you need flexible pipelines and good observability.
5. Choosing an ingestion solution (tools strategy)
This is: “How do we actually implement ingestion?”
They split tools into two big styles:
1) Declarative (“tell it what you want”)
You configure things in a UI or YAML; tool handles the details.
-
Legacy tools: Talend, Pentaho, etc. (enterprise ETL tools, less modern)
-
Modern SaaS/OSS: Fivetran, Stitch, Airbyte
-
Many ready-made connectors, easy to set up
-
-
Native platform features (inside Databricks, cloud, etc.)
-
e.g., “connect and ingest” directly from message bus or cloud storage
-
-
Pros:
-
Fast to get started, less engineering
-
Vendors maintain connectors and handle schema/API changes
-
-
Cons:
-
Less flexible for weird edge cases
-
Vendor lock-in → hard/expensive to switch later
-
You depend on them to add new connectors
-
2) Imperative (“write the code yourself”)
You write code/pipelines: Python scripts, Lambdas, Airflow DAGs, Beam, custom connectors, etc.
-
Pros:
-
Maximum flexibility; you can handle any weird source
-
You decide patterns, testing, standards
-
-
Cons:
-
Expensive in time and people
-
Needs strong engineering discipline (testing, maintainability)
-
Overkill for small teams or simple needs
-
3) Hybrid (what most sane teams do)
-
Use declarative tools (Fivetran/Airbyte/native connectors) for:
-
Common sources: Salesforce, Stripe, Google Ads, etc.
-
-
Use custom/imperative code where:
-
The source is weird, niche, or super critical
-
-
Maybe contribute extra connectors back to open source (Airbyte, Singer, dlt, etc.)
Analogy they use:
Most of the time you need a Toyota, not a Formula 1 car.
But sometimes, for very special problems, you do need that race car.
Comments
Post a Comment