Mastering Databricks: Key Concepts and Features

๐ŸŒ Databricks Basics

What is Databricks?

  • Cloud platform for storing, processing, and analyzing data.

  • Built on Apache Spark โ†’ fast, scalable.

  • Lakehouse = one place for data lake + warehouse + ML.

  • Used for ETL, streaming, AI/ML.

Persist data (very important idea):

  • In-memory DataFrame dies when the cluster stops.

  • Persisted data = saved as Delta / Parquet / tables โ†’ survives cluster restarts.

Cluster = compute

  • A cluster = group of machines that run notebooks & jobs.

Notebooks & magics

  • Languages: Python, SQL, Scala, R

  • Switch cells:

    • %python, %sql, %scala, %r

  • Run another notebook:

    • %run /path/to/notebook

  • Use filesystem & dbutils:

    • %fs ls /

    • dbutils.fs.ls("dbfs:/")

    • dbutils.fs.rm("dbfs:/path", recurse=True)

  • Show data nicely:

    • display(df) or display(table_name)


๐Ÿงฑ Storage + Metastore + Delta

Hive Metastore (metadata, โ€œwhereโ€):

  • Stores table names, schema, locations.

  • โ€œTells Databricks where the data is.โ€

Storage (DBFS / S3 / ADLS โ€“ the โ€œwhatโ€):

  • Holds actual files: Parquet, Delta, CSV, JSONโ€ฆ

Delta Transaction Log _delta_log (the โ€œhow it changedโ€):

  • Tracks all changes to the table.

  • Enables ACID + time travel.

ACID (remember meaning):

  • A โ€“ Atomic: all-or-nothing operations.

  • C โ€“ Consistent: rules & constraints always respected.

  • I โ€“ Isolated: concurrent writes donโ€™t break each other.

  • D โ€“ Durable: once committed, not lost.

Together:

  • Metastore = where

  • Storage = what

  • Delta log = how it changed


๐Ÿ” Advanced Delta Lake Features (only core ones)

  • History:

    DESCRIBE HISTORY table_name;
  • Detail (metadata):

    DESCRIBE DETAIL table_name;
  • Time travel (read old version):

    SELECT * FROM table_name VERSION AS OF 5;
    SELECT * FROM table_name@v5;
  • Restore table to old version:

    RESTORE TABLE table_name TO VERSION AS OF 5;
  • Optimize layout:

    OPTIMIZE table_name ZORDER BY (col1);
  • Clean old files:

    VACUUM table_name RETAIN 168 HOURS;
  • See actual files in DBFS:

    %fs ls 'dbfs:/path/to/folder'

๐Ÿ“ Optimizing Data Layout (Partition / ZORDER / Cluster)

Partition = split data into folders by column(s):

CREATE TABLE sales (...) PARTITIONED BY (year, month);
ALTER TABLE sales ADD PARTITION (year=2025, month=10);

  • Best when queries filter on those columns (date, region, etc.).

ZORDER = reorder data inside files for fast filtering:

OPTIMIZE sales ZORDER BY (customer_id, date);

  • Best for columns used in WHERE / JOIN that are not partitions.

Clustering (AUTO) = auto layout:

CREATE TABLE sales (...) CLUSTERED BY (customer_id) INTO 32 BUCKETS;
ALTER TABLE sales CLUSTER BY AUTO;

  • AUTO lets Databricks continuously optimize layout based on query patterns.

  • Think:

    • Partition / ZORDER โ†’ manual layout

    • Clustering โ†’ automatic layout (you donโ€™t tune it)


๐Ÿง  Databases & Tables (Hive Metastore)

Default database:

  • Name: default

  • Path:

    • dbfs:/user/hive/warehouse/

Create database / schema:

CREATE DATABASE mydb;
-- or
CREATE SCHEMA mydb;

Path example:

  • dbfs:/user/hive/warehouse/mydb.db/

Custom location:

CREATE DATABASE mydb
LOCATION 'dbfs:/custom/path/';

Managed vs External tables:

TypeCreateData locationDROP TABLE doesโ€ฆ
ManagedCREATE TABLEManaged by HiveDeletes data + metadata
ExternalCREATE TABLE ... LOCATIONYour custom pathDeletes metadata only

Delete external data (manually):

%fs rm -r dbfs:/data/external_table_folder/

or

dbutils.fs.rm("dbfs:/data/external_table_folder/", recurse=True)

Choose database before creating:

USE mydb;

๐Ÿงฑ Setting Up Delta Tables

CTAS โ€“ Create Table As Select:

CREATE TABLE new_table AS
SELECT col1, col2 AS new_name
FROM old_table;

  • Creates table + loads data in one step.

  • Can add options:

CREATE TABLE new_table
COMMENT 'description'
PARTITIONED BY (city, date)
LOCATION 'dbfs:/custom/path/'
AS SELECT * FROM source_table;

Table constraints:

ALTER TABLE table_name
ADD CONSTRAINT chk_price CHECK (price > 0);

  • NOT NULL, CHECK (...) = cleaner data.

Cloning:

  • Deep clone (full copy + data backup):

CREATE TABLE target_table
DEEP CLONE source_table;

  • Shallow clone (metadata only, share data files):

CREATE TABLE target_table
SHALLOW CLONE source_table;

๐Ÿ‘๏ธ Views & Spark Session

Types of views:

TypeLifetimeCreate
Stored viewPermanent (until DROP)CREATE VIEW v AS SELECT ...
Temp viewCurrent session onlyCREATE TEMP VIEW v AS ...
Global tempWhole cluster, until restartCREATE GLOBAL TEMP VIEW v AS ...
  • Temp view โ†’ only in the notebook/session.

  • Global temp view โ†’ across notebooks on the same cluster.

Spark session starts when:

  • You open & attach a notebook to a cluster.

Spark session ends when:

  • Notebook/cluster restarts / detach, or you install new packages.

  • When the session ends โ†’ temp views disappear, global temp views live until cluster restart.


๐Ÿ“‚ Querying Files

Query files directly:

SELECT * FROM json.`/path/customers.json`;
SELECT * FROM parquet.`/path/data/`;
SELECT * FROM csv.`/path/files_*.csv`;

Better for JSON/Parquet (self-describing schema).

CTAS from JSON/Parquet folder:

CREATE TABLE customers AS
SELECT * FROM json.`/path/customers-json`;

For CSV โ†’ use TEMP VIEW with schema + options:

CREATE TEMP VIEW books_tmp_vw
(book_id STRING, title STRING, author STRING, category STRING, price DOUBLE)
USING CSV
OPTIONS (
path = "/path/books-csv/export_*.csv",
header = "true",
delimiter = ";"
);
CREATE TABLE books AS
SELECT * FROM books_tmp_vw;


๐Ÿงฉ read_files() & _metadata

read_files (simple file reading with options):

SELECT * FROM read_files(
'/path/*.csv',
format => 'csv',
header => 'true',
delimiter => ';'
);

  • Infers schema across files.

  • Handles options + _rescued_data for bad rows.

_metadata column (file info):

SELECT *,
_metadata.file_path,
_metadata.file_name
FROM json.`/path/customers-json`;

  • Useful for tracking which file each row came from.


๐Ÿ“ฅ Writing to Tables (Overwrite / Append / Merge)

1๏ธโƒฃ Overwrite all data

CREATE OR REPLACE TABLE table_name AS
SELECT ...;

  • Recreates table โ†’ loses history & some metadata.

INSERT OVERWRITE TABLE table_name
SELECT ...;

  • Keeps table, schema, Delta history, replaces all rows.

2๏ธโƒฃ Append

INSERT INTO table_name
SELECT ...;

  • Adds new rows.

  • Watch out for duplicates.

3๏ธโƒฃ Merge (upsert โ€“ best for incremental loads)

MERGE INTO target t
USING source s
ON t.id = s.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

  • Handles updates + inserts safely.

  • Great for CDC / incremental data.


๐Ÿ”ง Advanced Transformations (JSON, arrays, joins, pivot)

Parse JSON:

SELECT from_json(json_col, schema) AS parsed;

explode (array โ†’ rows):

SELECT order_id, explode(books) AS book
FROM orders;

collect (rows โ†’ array):

SELECT customer_id, collect_set(order_id) AS order_ids
FROM orders
GROUP BY customer_id;

flatten + array_distinct:

array_distinct(flatten(collect_set(books.book_id)))

Joins:

SELECT *
FROM orders o
JOIN books b
ON o.book_id = b.book_id;

Set operations:

SELECT * FROM A UNION SELECT * FROM B;
SELECT * FROM A INTERSECT SELECT * FROM B;
SELECT * FROM A MINUS SELECT * FROM B;

Pivot:

SELECT * FROM data
PIVOT (sum(value) FOR key IN ('A', 'B', 'C'));

๐Ÿงฎ Higher-Order Functions & UDFs

Filter array:

SELECT
order_id,
FILTER(books, b -> b.quantity >= 2) AS multi_books
FROM orders;

Transform array:

SELECT
order_id,
TRANSFORM(books, b -> b.subtotal * 0.8) AS discounted
FROM orders;

SQL UDF (user-defined function):

CREATE OR REPLACE FUNCTION get_url(email STRING)
RETURNS STRING
RETURN concat('href="https://www">https://www.', split(email, '@')[1]);

Use it:

SELECT email, get_url(email) FROM customers;

โšก Structured Streaming (Core Ideas)

Data stream = data that keeps arriving.

Basic pattern:

streamDF = spark.readStream.table("Input_Table")
streamDF.writeStream
.trigger(processingTime="2 minutes")
.outputMode("append")
.option("checkpointLocation", "/chkpt")
.table("Output_Table")

  • readStream โ†’ live input

  • writeStream โ†’ continuous output

  • checkpointLocation โ†’ remembers whatโ€™s processed

Triggers:

  • Default: ~0.5s

  • Fixed: .trigger(processingTime="5 minutes")

  • Once: .trigger(once=True)

  • Available now: .trigger(availableNow=True)

Output modes:

  • "append" โ†’ only new rows

  • "complete" โ†’ recompute whole result each batch

Checkpointing = reliability + exactly-once (with idempotent sinks like Delta).


๐Ÿ” Incremental Ingestion (COPY INTO & Auto Loader)

Idempotent concept:
โ€œRun it again โ†’ same final result (no duplicates).โ€

๐Ÿ“ฆ COPY INTO (SQL)

COPY INTO my_table
FROM '/path/files/'
FILEFORMAT = CSV
FORMAT_OPTIONS('header'='true')
COPY_OPTIONS('mergeSchema'='true');

  • Only loads new files.

  • Good for small/medium batch ingestion.

โš™๏ธ Auto Loader (cloudFiles)

spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("cloudFiles.schemaLocation", "/schemas/customers")
.load("/path/data")
.writeStream
.option("checkpointLocation", "/chkpt")
.table("customers")

  • For large scale, near real-time, billions of files.

  • Uses schemaLocation + checkpoint for evolution & reliability.

Useful options:

  • Only load specific files:

    .option("pathGlobFilter", "*.png")
  • Schema evolution:

    .option("cloudFiles.schemaEvolutionMode", "addNewColumns")

๐Ÿฅ‡ Medallion / Multi-Hop Architecture

  • Bronze: raw data (from source).

  • Silver: cleaned, filtered, joined.

  • Gold: aggregated, business-ready (for BI/ML).

You move data: Bronze โ†’ Silver โ†’ Gold with Delta tables.


๐Ÿค– DLT (Delta Live Tables) & CDC

DLT:

  • Define tables using SQL or Python.

  • System handles dependencies, quality, incremental updates.

  • Good for Bronze โ†’ Silver โ†’ Gold pipelines.

CDC with DLT:

APPLY CHANGES INTO LIVE.target
FROM STREAM(LIVE.cdc_source)
KEYS (id)
APPLY AS DELETE WHEN op = 'DELETE'
SEQUENCE BY seq_col
COLUMNS *;

  • Applies insert / update / delete automatically.

  • Keeps target table in sync with CDC feed.


๐Ÿ“ฆ Databricks Asset Bundles (super short)

  • Bundle = databricks.yml + resources + code.

  • Used to deploy jobs/pipelines from dev โ†’ prod.

  • Helps standardize deployment.


๐Ÿ“Š Databricks SQL & Dashboards

Two main experiences:

  • Data Science & Engineering โ†’ Notebooks, Python, ETL, ML.

  • Databricks SQL โ†’ SQL-only, dashboards, BI, alerts.

You can:

  • Create dashboards from SQL queries.

  • Set alerts (e.g., revenue < 5000) โ†’ send email/Slack.


๐Ÿ” Data Governance (Hive Metastore)

Key commands:

GRANT SELECT ON TABLE my_table TO 'user@company.com';
REVOKE SELECT ON TABLE my_table FROM 'user@company.com';
SHOW GRANTS ON TABLE my_table;

Common privileges: SELECT, MODIFY, CREATE, USAGE, READ METADATA, ALL PRIVILEGES.

Best practice: grant to roles/groups, not individuals.


๐Ÿงฑ Unity Catalog (Modern Governance)

  • UC = central governance across workspaces.

  • 3-level name:

    SELECT * FROM catalog.schema.table;

Hierarchy:

  • Metastore

    • Catalog

      • Schema

        • Table / View / Function

Permissions with:

GRANT SELECT ON TABLE catalog.schema.table TO `group_name`;

Data lineage:

  • UC tracks upstream โ†’ transform โ†’ downstream.

  • Helps debug and audit.


๐Ÿ”— Delta Sharing

  • Open protocol to share Delta tables across orgs.

  • Provider uses:

CREATE SHARE my_share;
ALTER SHARE my_share ADD TABLE catalog.schema.table;
GRANT SELECT ON SHARE my_share TO RECIPIENT some_recipient;

  • Recipient reads live data (read-only).


๐ŸŒ Lakehouse Federation

  • Query external databases (MySQL, PostgreSQL, etc.) directly from Databricks.

  • No copy needed; appears as foreign catalogs.

  • Good for live reads / quick analysis.

  • Heavy / complex queries may be slower because compute happens partly outside Databricks.


๐Ÿ–ฅ๏ธ Clusters & SQL Warehouses

Cluster types:

  • All-purpose: interactive notebooks, ad hoc.

  • Job clusters: created per job, auto-terminate โ†’ cheaper for scheduled workloads.

Best practices:

  • Enable auto-termination.

  • Use autoscaling.

  • Use Job clusters for pipelines.

  • Use Spot instances when failures are tolerable.

  • Use Photon for fast SQL / Delta performance.

SQL Warehouses:

  • For SQL + BI dashboards.

  • Serverless SQL warehouse:

    • Start fast, scale automatically.

    • No cluster tuning.

  • Use Pro SQL warehouse when you need custom networking or when serverless not available.