Mastering Databricks: Key Concepts and Features

Posted on Dec 13, 2025 in Computers

🌐 Databricks Basics

What is Databricks?

Cloud platform for storing, processing, and analyzing data.
Built on Apache Spark → fast, scalable.
Lakehouse = one place for data lake + warehouse + ML.
Used for ETL, streaming, AI/ML.

Persist data (very important idea):

In-memory DataFrame dies when the cluster stops.
Persisted data = saved as Delta / Parquet / tables → survives cluster restarts.

Cluster = compute

A cluster = group of machines that run notebooks & jobs.

Notebooks & magics

Languages: Python, SQL, Scala, R
Switch cells:
- %python, %sql, %scala, %r
Run another notebook:
- %run /path/to/notebook
Use filesystem & dbutils:
- %fs ls /
- dbutils.fs.ls("dbfs:/")
- dbutils.fs.rm("dbfs:/path", recurse=True)
Show data nicely:
- display(df) or display(table_name)

🧱 Storage + Metastore + Delta

Hive Metastore (metadata, “where”):

Stores table names, schema, locations.
“Tells Databricks where the data is.”

Storage (DBFS / S3 / ADLS – the “what”):

Holds actual files: Parquet, Delta, CSV, JSON…

Delta Transaction Log _delta_log (the “how it changed”):

Tracks all changes to the table.
Enables ACID + time travel.

ACID (remember meaning):

A – Atomic: all-or-nothing operations.
C – Consistent: rules & constraints always respected.
I – Isolated: concurrent writes don’t break each other.
D – Durable: once committed, not lost.

Together:

Metastore = where
Storage = what
Delta log = how it changed

🔍 Advanced Delta Lake Features (only core ones)

History:
DESCRIBE HISTORY table_name;
Detail (metadata):
DESCRIBE DETAIL table_name;
Time travel (read old version):
SELECT * FROM table_name VERSION AS OF 5;
SELECT * FROM table_name@v5;
Restore table to old version:
RESTORE TABLE table_name TO VERSION AS OF 5;
Optimize layout:
OPTIMIZE table_name ZORDER BY (col1);
Clean old files:
VACUUM table_name RETAIN 168 HOURS;
See actual files in DBFS:
%fs ls 'dbfs:/path/to/folder'

📁 Optimizing Data Layout (Partition / ZORDER / Cluster)

Partition = split data into folders by column(s):

CREATE TABLE sales (...) PARTITIONED BY (year, month); ALTER TABLE sales ADD PARTITION (year=2025, month=10);

Best when queries filter on those columns (date, region, etc.).

ZORDER = reorder data inside files for fast filtering:

OPTIMIZE sales ZORDER BY (customer_id, date);

Best for columns used in WHERE / JOIN that are not partitions.

Clustering (AUTO) = auto layout:

CREATE TABLE sales (...) CLUSTERED BY (customer_id) INTO 32 BUCKETS; ALTER TABLE sales CLUSTER BY AUTO;

AUTO lets Databricks continuously optimize layout based on query patterns.
Think:
- Partition / ZORDER → manual layout
- Clustering → automatic layout (you don’t tune it)

🧠 Databases & Tables (Hive Metastore)

Default database:

Name: default
Path:
- dbfs:/user/hive/warehouse/

Create database / schema:

CREATE DATABASE mydb; -- or CREATE SCHEMA mydb;

Path example:

dbfs:/user/hive/warehouse/mydb.db/

Custom location:

CREATE DATABASE mydb LOCATION 'dbfs:/custom/path/';

Managed vs External tables:

Type	Create	Data location	DROP TABLE does…
Managed	`CREATE TABLE`	Managed by Hive	Deletes data + metadata
External	`CREATE TABLE ... LOCATION`	Your custom path	Deletes metadata only

Delete external data (manually):

%fs rm -r dbfs:/data/external_table_folder/

dbutils.fs.rm("dbfs:/data/external_table_folder/", recurse=True)

Choose database before creating:

USE mydb;

🧱 Setting Up Delta Tables

CTAS – Create Table As Select:

CREATE TABLE new_table AS SELECT col1, col2 AS new_name FROM old_table;

Creates table + loads data in one step.
Can add options:

CREATE TABLE new_table COMMENT 'description' PARTITIONED BY (city, date) LOCATION 'dbfs:/custom/path/' AS SELECT * FROM source_table;

Table constraints:

ALTER TABLE table_name ADD CONSTRAINT chk_price CHECK (price > 0);

NOT NULL, CHECK (...) = cleaner data.

Cloning:

Deep clone (full copy + data backup):

CREATE TABLE target_table DEEP CLONE source_table;

Shallow clone (metadata only, share data files):

CREATE TABLE target_table SHALLOW CLONE source_table;

👁️ Views & Spark Session

Types of views:

Type	Lifetime	Create
Stored view	Permanent (until DROP)	`CREATE VIEW v AS SELECT ...`
Temp view	Current session only	`CREATE TEMP VIEW v AS ...`
Global temp	Whole cluster, until restart	`CREATE GLOBAL TEMP VIEW v AS ...`

Temp view → only in the notebook/session.
Global temp view → across notebooks on the same cluster.

Spark session starts when:

You open & attach a notebook to a cluster.

Spark session ends when:

Notebook/cluster restarts / detach, or you install new packages.
When the session ends → temp views disappear, global temp views live until cluster restart.

📂 Querying Files

Query files directly:

SELECT * FROM json.`/path/customers.json`; SELECT * FROM parquet.`/path/data/`; SELECT * FROM csv.`/path/files_*.csv`;

Better for JSON/Parquet (self-describing schema).

CTAS from JSON/Parquet folder:

CREATE TABLE customers AS SELECT * FROM json.`/path/customers-json`;

For CSV → use TEMP VIEW with schema + options:

CREATE TEMP VIEW books_tmp_vw (book_id STRING, title STRING, author STRING, category STRING, price DOUBLE) USING CSV OPTIONS ( path = "/path/books-csv/export_*.csv", header = "true", delimiter = ";" ); CREATE TABLE books AS SELECT * FROM books_tmp_vw;

🧩 `read_files()` & `_metadata`

read_files (simple file reading with options):

SELECT * FROM read_files( '/path/*.csv', format => 'csv', header => 'true', delimiter => ';' );

Infers schema across files.
Handles options + _rescued_data for bad rows.

_metadata column (file info):

SELECT *, _metadata.file_path, _metadata.file_name FROM json.`/path/customers-json`;

Useful for tracking which file each row came from.

📥 Writing to Tables (Overwrite / Append / Merge)

1️⃣ Overwrite all data

CREATE OR REPLACE TABLE table_name AS SELECT ...;

Recreates table → loses history & some metadata.

INSERT OVERWRITE TABLE table_name SELECT ...;

Keeps table, schema, Delta history, replaces all rows.

2️⃣ Append

INSERT INTO table_name SELECT ...;

Adds new rows.
Watch out for duplicates.

3️⃣ Merge (upsert – best for incremental loads)

MERGE INTO target t USING source s ON t.id = s.id WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *;

Handles updates + inserts safely.
Great for CDC / incremental data.

🔧 Advanced Transformations (JSON, arrays, joins, pivot)

Parse JSON:

SELECT from_json(json_col, schema) AS parsed;

explode (array → rows):

SELECT order_id, explode(books) AS book FROM orders;

collect (rows → array):

SELECT customer_id, collect_set(order_id) AS order_ids FROM orders GROUP BY customer_id;

flatten + array_distinct:

array_distinct(flatten(collect_set(books.book_id)))

Joins:

SELECT * FROM orders o JOIN books b ON o.book_id = b.book_id;

Set operations:

SELECT * FROM A UNION SELECT * FROM B; SELECT * FROM A INTERSECT SELECT * FROM B; SELECT * FROM A MINUS SELECT * FROM B;

Pivot:

SELECT * FROM data PIVOT (sum(value) FOR key IN ('A', 'B', 'C'));

🧮 Higher-Order Functions & UDFs

Filter array:

SELECT order_id, FILTER(books, b -> b.quantity >= 2) AS multi_books FROM orders;

Transform array:

SELECT order_id, TRANSFORM(books, b -> b.subtotal * 0.8) AS discounted FROM orders;

SQL UDF (user-defined function):

CREATE OR REPLACE FUNCTION get_url(email STRING) RETURNS STRING RETURN concat('href="https://www">https://www.', split(email, '@')[1]);

Use it:

SELECT email, get_url(email) FROM customers;

⚡ Structured Streaming (Core Ideas)

Data stream = data that keeps arriving.

Basic pattern:

streamDF = spark.readStream.table("Input_Table") streamDF.writeStream .trigger(processingTime="2 minutes") .outputMode("append") .option("checkpointLocation", "/chkpt") .table("Output_Table")

readStream → live input
writeStream → continuous output
checkpointLocation → remembers what’s processed

Triggers:

Default: ~0.5s
Fixed: .trigger(processingTime="5 minutes")
Once: .trigger(once=True)
Available now: .trigger(availableNow=True)

Output modes:

"append" → only new rows
"complete" → recompute whole result each batch

Checkpointing = reliability + exactly-once (with idempotent sinks like Delta).

🔁 Incremental Ingestion (COPY INTO & Auto Loader)

Idempotent concept:
“Run it again → same final result (no duplicates).”

📦 COPY INTO (SQL)

COPY INTO my_table FROM '/path/files/' FILEFORMAT = CSV FORMAT_OPTIONS('header'='true') COPY_OPTIONS('mergeSchema'='true');

Only loads new files.
Good for small/medium batch ingestion.

⚙️ Auto Loader (cloudFiles)

spark.readStream.format("cloudFiles") .option("cloudFiles.format", "csv") .option("cloudFiles.schemaLocation", "/schemas/customers") .load("/path/data") .writeStream .option("checkpointLocation", "/chkpt") .table("customers")

For large scale, near real-time, billions of files.
Uses schemaLocation + checkpoint for evolution & reliability.

Useful options:

Only load specific files:
.option("pathGlobFilter", "*.png")
Schema evolution:
.option("cloudFiles.schemaEvolutionMode", "addNewColumns")

🥇 Medallion / Multi-Hop Architecture

Bronze: raw data (from source).
Silver: cleaned, filtered, joined.
Gold: aggregated, business-ready (for BI/ML).

You move data: Bronze → Silver → Gold with Delta tables.

🤖 DLT (Delta Live Tables) & CDC

DLT:

Define tables using SQL or Python.
System handles dependencies, quality, incremental updates.
Good for Bronze → Silver → Gold pipelines.

CDC with DLT:

APPLY CHANGES INTO LIVE.target FROM STREAM(LIVE.cdc_source) KEYS (id) APPLY AS DELETE WHEN op = 'DELETE' SEQUENCE BY seq_col COLUMNS *;

Applies insert / update / delete automatically.
Keeps target table in sync with CDC feed.

📦 Databricks Asset Bundles (super short)

Bundle = databricks.yml + resources + code.
Used to deploy jobs/pipelines from dev → prod.
Helps standardize deployment.

📊 Databricks SQL & Dashboards

Two main experiences:

Data Science & Engineering → Notebooks, Python, ETL, ML.
Databricks SQL → SQL-only, dashboards, BI, alerts.

You can:

Create dashboards from SQL queries.
Set alerts (e.g., revenue < 5000) → send email/Slack.

🔐 Data Governance (Hive Metastore)

Key commands:

GRANT SELECT ON TABLE my_table TO 'user@company.com'; REVOKE SELECT ON TABLE my_table FROM 'user@company.com'; SHOW GRANTS ON TABLE my_table;

Common privileges: SELECT, MODIFY, CREATE, USAGE, READ METADATA, ALL PRIVILEGES.

Best practice: grant to roles/groups, not individuals.

🧱 Unity Catalog (Modern Governance)

UC = central governance across workspaces.
3-level name:
SELECT * FROM catalog.schema.table;

Hierarchy:

Metastore
- Catalog
  - Schema
    - Table / View / Function

Permissions with:

GRANT SELECT ON TABLE catalog.schema.table TO `group_name`;

Data lineage:

UC tracks upstream → transform → downstream.
Helps debug and audit.

🔗 Delta Sharing

Open protocol to share Delta tables across orgs.
Provider uses:

CREATE SHARE my_share; ALTER SHARE my_share ADD TABLE catalog.schema.table; GRANT SELECT ON SHARE my_share TO RECIPIENT some_recipient;

Recipient reads live data (read-only).

🌍 Lakehouse Federation

Query external databases (MySQL, PostgreSQL, etc.) directly from Databricks.
No copy needed; appears as foreign catalogs.
Good for live reads / quick analysis.
Heavy / complex queries may be slower because compute happens partly outside Databricks.

🖥️ Clusters & SQL Warehouses

Cluster types:

All-purpose: interactive notebooks, ad hoc.
Job clusters: created per job, auto-terminate → cheaper for scheduled workloads.

Best practices:

Enable auto-termination.
Use autoscaling.
Use Job clusters for pipelines.
Use Spot instances when failures are tolerable.
Use Photon for fast SQL / Delta performance.

SQL Warehouses:

For SQL + BI dashboards.
Serverless SQL warehouse:
- Start fast, scale automatically.
- No cluster tuning.
Use Pro SQL warehouse when you need custom networking or when serverless not available.

Mastering Databricks: Key Concepts and Features

🌐 Databricks Basics

🧱 Storage + Metastore + Delta

🔍 Advanced Delta Lake Features (only core ones)

📁 Optimizing Data Layout (Partition / ZORDER / Cluster)

🧠 Databases & Tables (Hive Metastore)

🧱 Setting Up Delta Tables

👁️ Views & Spark Session

📂 Querying Files

🧩 `read_files()` & `_metadata`

📥 Writing to Tables (Overwrite / Append / Merge)

1️⃣ Overwrite all data

2️⃣ Append

3️⃣ Merge (upsert – best for incremental loads)

🔧 Advanced Transformations (JSON, arrays, joins, pivot)

🧮 Higher-Order Functions & UDFs

⚡ Structured Streaming (Core Ideas)

🔁 Incremental Ingestion (COPY INTO & Auto Loader)

📦 COPY INTO (SQL)

⚙️ Auto Loader (cloudFiles)

🥇 Medallion / Multi-Hop Architecture

🤖 DLT (Delta Live Tables) & CDC

📦 Databricks Asset Bundles (super short)

📊 Databricks SQL & Dashboards

🔐 Data Governance (Hive Metastore)

🧱 Unity Catalog (Modern Governance)

🔗 Delta Sharing

🌍 Lakehouse Federation

🖥️ Clusters & SQL Warehouses

Recent Notes

Subjects

Publicidad

Mastering Databricks: Key Concepts and Features

🌐 Databricks Basics

🧱 Storage + Metastore + Delta

🔍 Advanced Delta Lake Features (only core ones)

📁 Optimizing Data Layout (Partition / ZORDER / Cluster)

🧠 Databases & Tables (Hive Metastore)

🧱 Setting Up Delta Tables

👁️ Views & Spark Session

📂 Querying Files

🧩 read_files() & _metadata

📥 Writing to Tables (Overwrite / Append / Merge)

1️⃣ Overwrite all data

2️⃣ Append

3️⃣ Merge (upsert – best for incremental loads)

🔧 Advanced Transformations (JSON, arrays, joins, pivot)

🧮 Higher-Order Functions & UDFs

⚡ Structured Streaming (Core Ideas)

🔁 Incremental Ingestion (COPY INTO & Auto Loader)

📦 COPY INTO (SQL)

⚙️ Auto Loader (cloudFiles)

🥇 Medallion / Multi-Hop Architecture

🤖 DLT (Delta Live Tables) & CDC

📦 Databricks Asset Bundles (super short)

📊 Databricks SQL & Dashboards

🔐 Data Governance (Hive Metastore)

🧱 Unity Catalog (Modern Governance)

🔗 Delta Sharing

🌍 Lakehouse Federation

🖥️ Clusters & SQL Warehouses

Recent Notes

Subjects

Publicidad

🧩 `read_files()` & `_metadata`