Tuesday, November 18, 2025

Data Ingestion

Data ingestion, in essence, involves transferring data from a source to a designated target.

Its primary aim is to usher data into an environment! ronment primed for staging, processing, analysis, and artificial intel! ligence/machine learning (AI/ML). While massive organizations may focus on moving data internally (among teams), for most of us, data ingestion emphasizes pulling data from external sources and directing it to in-house targets. 


ETL steps

Data ingestion: now vs then

  • Old world: traditional ETL

    • Extract → Transform → Load

    • You heavily clean/transform before loading into a warehouse.

  • New world: mostly ELT

    • Extract → Load everything into cheap cloud storage → Transform later

    • Storage is cheap, compute is flexible, so people prefer “store first, think later”.

  • Big trends that changed ingestion:

    • Cloud + warehouses + lakehouses

    • Streaming/real-time data, not just nightly batches

So: we still do “ETL”, but the order + tools changed.



Monday, November 17, 2025

20. Index basics

What is an index? 

Imagine a giant book with 10 million pages.
You want the page that talks about “Spark ETL”.

Without an index, you would:

  • Flip page 1

  • Flip page 2

  • Flip page 3...
    …until you find it.

That’s slow.
Now imagine doing that on a database table with a million rows.
Same pain.

An index is a shortcut : like the index at the back of a book.

In books:

  • “Spark ETL → Page 567”

In databases:

  • WHERE name = "Kavana"
    → The index jumps directly to the row instead of scanning everything.


Why do we need an index?

Without an index:
Database checks EVERY row. (called full table scan)

With an index:
Database jumps to the exact row instantly.

Speed difference:

  • Without index: O(n) → slow as the table grows

  • With index: O(log n) → lightning fast


Small SQL example 

Table: users

Columns: idnameemail

If we search:

    SELECT *
    FROM users
    WHERE email = "kav@example.com";

Without index on email:

DB checks all 5 million rows.

With index on email

DB jumps directly to the row.

Creating the index:

CREATE INDEX idx_email ON users(email);
 CREATE INDEX idx_email ON Users(email);
 SELECT *
    FROM users
    WHERE email = "kav@example.com";
Boom. Search is 100x faster.

What indexes actually store 

An index is like a mini sorted phonebook that maps:

email → row location namerow location id → row location

Think:

  • index = “cheat sheet”

  • table = “full data”


Downside 

Indexes speed up searching, but slow down inserts/updates, because the index must stay sorted.

So:

  • Good for searching (WHERE, JOIN)

  • Too many indexes = slower writes

Wednesday, November 5, 2025

15. Window Functions

Powering Advanced Analytics in SQL

 Window Functions  perform calculations across a set of table rows that are somehow related to the current row, but crucially, they do not collapse the rows. This allows for sophisticated analytical tasks like ranking, calculating running totals, or comparing a row's value to a previous or next row, all within the same result set.

13. Triggers

What is a Trigger?

A trigger in SQL is a special kind of stored program that automatically executes when certain events occur in a table — such as an INSERT, UPDATE, or DELETE.

Think of it like a “background script” that runs whenever your data changes — no manual call needed!


Trigger Syntax 


CREATE TRIGGER TriggerName(
{ BEFORE | AFTER } { INSERT | UPDATE | DELETE }
ON Table_Name
FOR EACH ROW
BEGIN 
--
END;


Example: Automatically Update Inventory After a Sale

Trigger Code

CREATE TRIGGER Update_Inventory_Trigger(
AFTER INSERT 
ON Sales
FOR EACH ROW
BEGIN 
UPDATE Inventory
SET quantity = quantity - :NEW.qantity
WHERE product_id = :NEW.product_id
END;

Explanation

  • :NEW
    → Refers to the newly inserted row in
    Sales

12. Stored Procedure

A Stored Procedure is a set of SQL statements that performs an operation or sequence of operations. Think of it as a complete program stored on the database server.

Example Use Case: Automated Order Processing

A procedure might be used to finalize an order

  1. Update the Inventory table (UPDATE).

  2. Insert a new record into the Sales table (INSERT).

  3. Commit the transaction.

Oracle PL/SQL Stored Procedure

CREATE OR REPLACE PROCEDURE Finalize_Order ( ... )

  • Purpose: Defines a stored procedure in Oracle.

  • CREATE OR REPLACE: Creates a new procedure or replaces an existing one with the same name.

  • Finalize_Order: The name of the stored procedure.

  • (...): The parameter list—used to pass input values into the procedure.


CREATE OR
REPLACE PROCEDURE Finalize_Order(
p_product_id   IN NUMBER,
p_quantity     IN NUMBER,
p_customer_id  IN NUMBER
)
  • Each line defines a parameter:

    • p_product_id: product being sold.

    • p_quantity: how many units are sold.

    • p_customer_id: the buyer’s ID.

  • IN → means data flows into the procedure (input-only parameter).

  • NUMBER → defines the datatype.

IS

    IS

    • Marks the start of the procedure body in PL/SQL.

    • Similar to saying “beginning of declaration and code section.”

    • Use SET SERVEROUTPUT ON; in SQL Developer or SQL*Plus to see the output.


    BEGIN

      BEGIN

      • Marks the start of executable statements.

      • All SQL logic (update, insert, etc.) goes between BEGIN and END.

      • Use SET SERVEROUTPUT ON; in SQL Developer or SQL*Plus to see the output.


       Update Inventory

      UDATE Inventory
      SET quantity = quantity - p_quantity
      WHERE product_id = p_product_id;

      • Reduces stock quantity by the amount sold.

      • Inventory → table name.

      • SET quantity = quantity - p_quantity → subtracts ordered quantity from available stock.

      • WHERE product_id = p_product_id → ensures only that product’s row is updated.


         Insert Sale Record

        INSERT INTO Sales(sales_id, product_id, customer_id, 
        quantity, sale_date)
        VALUES (sales_id.NEXTVAL, p_product_id, 
        p_customer_id, p_quantity, SYSDATE);
        • Adds a new row into the Sales table for this transaction.

        • sales_seq.NEXTVAL → uses a sequence named sales_seq to generate a unique sale ID automatically.

        • SYSDATE → system date/time (when the sale happens).


          Commit Transaction

          COMMIT;
          • Makes the changes permanent in the database (update + insert).

          • Without COMMIT, changes wouldn’t persist after the procedure ends.

          COMMIT;



            Success Message

            DBMS_OUTPUT.PUT_LINE('Order finalized successfully.');
            • Prints a message to the SQL console (for debugging/confirmation).

            • DBMS_OUTPUT is an Oracle package for displaying messages.

            • Use SET SERVEROUTPUT ON; in SQL Developer or SQL*Plus to see the output.

            DBMS_OUTPUT.PUT_LINE('Order finalized successfully.');

            Full code:



            CREATE OR
            REPLACE PROCEDURE Finalize_Order(
            p_product_id   IN NUMBER,
            p_quantity     IN NUMBER,
            p_customer_id  IN NUMBER
            )
            IS
            BEGIN
              -- Step 1 : Update Inventory
              UDATE Inventory
              SET quantity = quantity - p_quantity
              WHERE product_id = p_product_id;
              -- Step 2 : Insert Sales Values 
            INSERT INTO Sales(sales_id, product_id,
            customer_id, quantity, sale_date)
              VALUES (sales_id.NEXTVAL, p_product_id,
             p_customer_id, p_quantity, SYSDATE);
            COMMIT;
            DBMS_OUTPUT.PUT_LINE('Order finalized successfully.');
            END Finalize_Order;


            To Run it: 

            SET SERVEROUTPUT ON;
            EXEC Finalize_Order(101, 2, 501);

            Fl Key Characteristics

            • Purpose: To execute a sequence of DML (INSERTUPDATEDELETE) or DDL (CREATEALTER) commands, or complex retrieval tasks.

            • Return Value: Can return zero, one, or multiple result sets (tables of data). It can also return a single numeric value (an exit code or status code) to indicate success or failure.

            • Execution: Executed using the EXEC or EXECUTE command.

            • Transactional Control: Procedures can manage transactions (using BEGIN TRANCOMMITROLLBACK), making them ideal for ensuring multi-step processes either fully succeed or completely fail.




            11. Views and Materialized Views

            In SQL, Views and Materialized Views are essential tools for both simplifying complex queries and significantly improving reporting performance. They don't store data themselves (in the case of a standard View) but rather act as permanent, pre-defined windows into your underlying data.


            1. Views: The Virtual Table

            A View is a saved SQL query that is stored in the database as a virtual table. When you query a View, the database executes the underlying SQL query in real-time.

            Purpose and Usage

            • Simplification: Views hide the complexity of joining multiple tables. Instead of writing a complex 5-table JOIN query every time, you simply SELECT * FROM SalesReportView.

            • Security: You can restrict a user's access to only certain columns and rows by granting permission to the View, while denying access to the sensitive base tables.

            • Logic Abstraction: If the definition of a calculation (like "Active Users") changes, you update the View once, and all reports and applications relying on it are automatically updated.

            Creation Example

            This View combines employee names with their department names, hiding the complexity of the JOIN.

            CREATE VIEW EmployeeDepartmentView AS
            SELECT
                E.Name,
                D.DeptName,
                E.Salary
            FROM
                Employees E
            INNER JOIN
                Departments D ON E.DeptID = D.DeptID
            WHERE
                E.IsActive = TRUE;
            

            Key Feature: Real-Time Data

            Views do not store data; they are just stored SQL code. Every time the View is queried, it pulls the very latest data from the base tables.


            2. Materialized Views: The Performance Snapshot

            A Materialized View (MV) is similar to a standard View, but it is different in a critical way: it stores the results of the query in a dedicated physical table.

            Purpose and Usage

            • Performance: MVs are designed for speed. They pre-calculate and store the results of complex aggregations (like SUM,AVG,GROUP BY) or slow, multi-table joins.

            • Reporting: Ideal for dashboards and reports that query the same, large data set repeatedly, as the database simply reads from a static table rather than re-running the complex query.

            Creation Example

            This MV calculates total sales per city and stores the result.

            CREATE MATERIALIZED VIEW CitySalesSummary AS
            SELECT
                C.City,
                SUM(S.Quantity * P.Price) AS TotalRevenue
            FROM
                Sales S
            JOIN
                Customer C ON S.CustomerID = C.CustomerID
            JOIN
                Product P ON S.ProductID = P.ProductID
            GROUP BY
                C.City;
            

            Key Feature: Refreshment

            Since the data is static, MVs must be periodically updated or "refreshed."

            • Manual Refresh: The MV is only updated when you explicitly run the refresh command (e.g., REFRESH MATERIALIZED VIEW CitySalesSummary;).

            • Scheduled Refresh: In production environments, MVs are often refreshed automatically on a schedule (e.g., every hour, or once nightly) to balance data freshness against performance gains.

            TF-IDF

              TF-IDF stands for Term Frequency–Inverse Document Frequency . It’s a numerical statistic used in text mining and natural language process...