Mastering Databricks: Key Concepts and Features
๐ Databricks Basics
What is Databricks?
Cloud platform for storing, processing, and analyzing data.
Built on Apache Spark โ fast, scalable.
Lakehouse = one place for data lake + warehouse + ML.
Used for ETL, streaming, AI/ML.
Persist data (very important idea):
In-memory DataFrame dies when the cluster stops.
Persisted data = saved as Delta / Parquet / tables โ survives cluster restarts.
Cluster = compute
A cluster = group of machines that run notebooks & jobs.
Notebooks & magics
Languages: Python, SQL, Scala, R
Switch cells:
%python,%sql,%scala,%r
Run another notebook:
%run /path/to/notebook
Use filesystem & dbutils:
%fs ls /dbutils.fs.ls("dbfs:/")dbutils.fs.rm("dbfs:/path", recurse=True)
Show data nicely:
display(df)ordisplay(table_name)
๐งฑ Storage + Metastore + Delta
Hive Metastore (metadata, โwhereโ):
Stores table names, schema, locations.
โTells Databricks where the data is.โ
Storage (DBFS / S3 / ADLS โ the โwhatโ):
Holds actual files: Parquet, Delta, CSV, JSONโฆ
Delta Transaction Log _delta_log (the โhow it changedโ):
Tracks all changes to the table.
Enables ACID + time travel.
ACID (remember meaning):
A โ Atomic: all-or-nothing operations.
C โ Consistent: rules & constraints always respected.
I โ Isolated: concurrent writes donโt break each other.
D โ Durable: once committed, not lost.
Together:
Metastore = where
Storage = what
Delta log = how it changed
๐ Advanced Delta Lake Features (only core ones)
History:
DESCRIBE HISTORY table_name;Detail (metadata):
DESCRIBE DETAIL table_name;Time travel (read old version):
SELECT * FROM table_name VERSION AS OF 5;SELECT * FROM table_name@v5;Restore table to old version:
RESTORE TABLE table_name TO VERSION AS OF 5;Optimize layout:
OPTIMIZE table_name ZORDER BY (col1);Clean old files:
VACUUM table_name RETAIN 168 HOURS;See actual files in DBFS:
%fs ls 'dbfs:/path/to/folder'
๐ Optimizing Data Layout (Partition / ZORDER / Cluster)
Partition = split data into folders by column(s):
CREATE TABLE sales (...) PARTITIONED BY (year, month);
ALTER TABLE sales ADD PARTITION (year=2025, month=10);
Best when queries filter on those columns (date, region, etc.).
ZORDER = reorder data inside files for fast filtering:
OPTIMIZE sales ZORDER BY (customer_id, date);
Best for columns used in WHERE / JOIN that are not partitions.
Clustering (AUTO) = auto layout:
CREATE TABLE sales (...) CLUSTERED BY (customer_id) INTO 32 BUCKETS;
ALTER TABLE sales CLUSTER BY AUTO;
AUTO lets Databricks continuously optimize layout based on query patterns.
Think:
Partition / ZORDER โ manual layout
Clustering โ automatic layout (you donโt tune it)
๐ง Databases & Tables (Hive Metastore)
Default database:
Name:
defaultPath:
dbfs:/user/hive/warehouse/
Create database / schema:
CREATE DATABASE mydb;
-- or
CREATE SCHEMA mydb;
Path example:
dbfs:/user/hive/warehouse/mydb.db/
Custom location:
CREATE DATABASE mydb
LOCATION 'dbfs:/custom/path/';
Managed vs External tables:
| Type | Create | Data location | DROP TABLE doesโฆ |
|---|---|---|---|
| Managed | CREATE TABLE | Managed by Hive | Deletes data + metadata |
| External | CREATE TABLE ... LOCATION | Your custom path | Deletes metadata only |
Delete external data (manually):
%fs rm -r dbfs:/data/external_table_folder/
or
dbutils.fs.rm("dbfs:/data/external_table_folder/", recurse=True)
Choose database before creating:
USE mydb;
๐งฑ Setting Up Delta Tables
CTAS โ Create Table As Select:
CREATE TABLE new_table AS
SELECT col1, col2 AS new_name
FROM old_table;
Creates table + loads data in one step.
Can add options:
CREATE TABLE new_table
COMMENT 'description'
PARTITIONED BY (city, date)
LOCATION 'dbfs:/custom/path/'
AS SELECT * FROM source_table;
Table constraints:
ALTER TABLE table_name
ADD CONSTRAINT chk_price CHECK (price > 0);
NOT NULL,CHECK (...)= cleaner data.
Cloning:
Deep clone (full copy + data backup):
CREATE TABLE target_table
DEEP CLONE source_table;
Shallow clone (metadata only, share data files):
CREATE TABLE target_table
SHALLOW CLONE source_table;
๐๏ธ Views & Spark Session
Types of views:
| Type | Lifetime | Create |
|---|---|---|
| Stored view | Permanent (until DROP) | CREATE VIEW v AS SELECT ... |
| Temp view | Current session only | CREATE TEMP VIEW v AS ... |
| Global temp | Whole cluster, until restart | CREATE GLOBAL TEMP VIEW v AS ... |
Temp view โ only in the notebook/session.
Global temp view โ across notebooks on the same cluster.
Spark session starts when:
You open & attach a notebook to a cluster.
Spark session ends when:
Notebook/cluster restarts / detach, or you install new packages.
When the session ends โ temp views disappear, global temp views live until cluster restart.
๐ Querying Files
Query files directly:
SELECT * FROM json.`/path/customers.json`;
SELECT * FROM parquet.`/path/data/`;
SELECT * FROM csv.`/path/files_*.csv`;
Better for JSON/Parquet (self-describing schema).
CTAS from JSON/Parquet folder:
CREATE TABLE customers AS
SELECT * FROM json.`/path/customers-json`;
For CSV โ use TEMP VIEW with schema + options:
CREATE TEMP VIEW books_tmp_vw
(book_id STRING, title STRING, author STRING, category STRING, price DOUBLE)
USING CSV
OPTIONS (
path = "/path/books-csv/export_*.csv",
header = "true",
delimiter = ";"
);
CREATE TABLE books AS
SELECT * FROM books_tmp_vw;
๐งฉ read_files() & _metadata
read_files (simple file reading with options):
SELECT * FROM read_files(
'/path/*.csv',
format => 'csv',
header => 'true',
delimiter => ';'
);
Infers schema across files.
Handles options +
_rescued_datafor bad rows.
_metadata column (file info):
SELECT *,
_metadata.file_path,
_metadata.file_name
FROM json.`/path/customers-json`;
Useful for tracking which file each row came from.
๐ฅ Writing to Tables (Overwrite / Append / Merge)
1๏ธโฃ Overwrite all data
CREATE OR REPLACE TABLE table_name AS
SELECT ...;
Recreates table โ loses history & some metadata.
INSERT OVERWRITE TABLE table_name
SELECT ...;
Keeps table, schema, Delta history, replaces all rows.
2๏ธโฃ Append
INSERT INTO table_name
SELECT ...;
Adds new rows.
Watch out for duplicates.
3๏ธโฃ Merge (upsert โ best for incremental loads)
MERGE INTO target t
USING source s
ON t.id = s.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
Handles updates + inserts safely.
Great for CDC / incremental data.
๐ง Advanced Transformations (JSON, arrays, joins, pivot)
Parse JSON:
SELECT from_json(json_col, schema) AS parsed;
explode (array โ rows):
SELECT order_id, explode(books) AS book
FROM orders;
collect (rows โ array):
SELECT customer_id, collect_set(order_id) AS order_ids
FROM orders
GROUP BY customer_id;
flatten + array_distinct:
array_distinct(flatten(collect_set(books.book_id)))
Joins:
SELECT *
FROM orders o
JOIN books b
ON o.book_id = b.book_id;
Set operations:
SELECT * FROM A UNION SELECT * FROM B;
SELECT * FROM A INTERSECT SELECT * FROM B;
SELECT * FROM A MINUS SELECT * FROM B;
Pivot:
SELECT * FROM data
PIVOT (sum(value) FOR key IN ('A', 'B', 'C'));
๐งฎ Higher-Order Functions & UDFs
Filter array:
SELECT
order_id,
FILTER(books, b -> b.quantity >= 2) AS multi_books
FROM orders;
Transform array:
SELECT
order_id,
TRANSFORM(books, b -> b.subtotal * 0.8) AS discounted
FROM orders;
SQL UDF (user-defined function):
CREATE OR REPLACE FUNCTION get_url(email STRING)
RETURNS STRING
RETURN concat('href="https://www">https://www.', split(email, '@')[1]);
Use it:
SELECT email, get_url(email) FROM customers;
โก Structured Streaming (Core Ideas)
Data stream = data that keeps arriving.
Basic pattern:
streamDF = spark.readStream.table("Input_Table")
streamDF.writeStream
.trigger(processingTime="2 minutes")
.outputMode("append")
.option("checkpointLocation", "/chkpt")
.table("Output_Table")
readStreamโ live inputwriteStreamโ continuous outputcheckpointLocationโ remembers whatโs processed
Triggers:
Default: ~0.5s
Fixed:
.trigger(processingTime="5 minutes")Once:
.trigger(once=True)Available now:
.trigger(availableNow=True)
Output modes:
"append"โ only new rows"complete"โ recompute whole result each batch
Checkpointing = reliability + exactly-once (with idempotent sinks like Delta).
๐ Incremental Ingestion (COPY INTO & Auto Loader)
Idempotent concept:
โRun it again โ same final result (no duplicates).โ
๐ฆ COPY INTO (SQL)
COPY INTO my_table
FROM '/path/files/'
FILEFORMAT = CSV
FORMAT_OPTIONS('header'='true')
COPY_OPTIONS('mergeSchema'='true');
Only loads new files.
Good for small/medium batch ingestion.
โ๏ธ Auto Loader (cloudFiles)
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("cloudFiles.schemaLocation", "/schemas/customers")
.load("/path/data")
.writeStream
.option("checkpointLocation", "/chkpt")
.table("customers")
For large scale, near real-time, billions of files.
Uses schemaLocation + checkpoint for evolution & reliability.
Useful options:
Only load specific files:
.option("pathGlobFilter", "*.png")Schema evolution:
.option("cloudFiles.schemaEvolutionMode", "addNewColumns")
๐ฅ Medallion / Multi-Hop Architecture
Bronze: raw data (from source).
Silver: cleaned, filtered, joined.
Gold: aggregated, business-ready (for BI/ML).
You move data: Bronze โ Silver โ Gold with Delta tables.
๐ค DLT (Delta Live Tables) & CDC
DLT:
Define tables using SQL or Python.
System handles dependencies, quality, incremental updates.
Good for Bronze โ Silver โ Gold pipelines.
CDC with DLT:
APPLY CHANGES INTO LIVE.target
FROM STREAM(LIVE.cdc_source)
KEYS (id)
APPLY AS DELETE WHEN op = 'DELETE'
SEQUENCE BY seq_col
COLUMNS *;
Applies insert / update / delete automatically.
Keeps target table in sync with CDC feed.
๐ฆ Databricks Asset Bundles (super short)
Bundle = databricks.yml + resources + code.
Used to deploy jobs/pipelines from dev โ prod.
Helps standardize deployment.
๐ Databricks SQL & Dashboards
Two main experiences:
Data Science & Engineering โ Notebooks, Python, ETL, ML.
Databricks SQL โ SQL-only, dashboards, BI, alerts.
You can:
Create dashboards from SQL queries.
Set alerts (e.g.,
revenue < 5000) โ send email/Slack.
๐ Data Governance (Hive Metastore)
Key commands:
GRANT SELECT ON TABLE my_table TO 'user@company.com';
REVOKE SELECT ON TABLE my_table FROM 'user@company.com';
SHOW GRANTS ON TABLE my_table;
Common privileges: SELECT, MODIFY, CREATE, USAGE, READ METADATA, ALL PRIVILEGES.
Best practice: grant to roles/groups, not individuals.
๐งฑ Unity Catalog (Modern Governance)
UC = central governance across workspaces.
3-level name:
SELECT * FROM catalog.schema.table;
Hierarchy:
Metastore
Catalog
Schema
Table / View / Function
Permissions with:
GRANT SELECT ON TABLE catalog.schema.table TO `group_name`;
Data lineage:
UC tracks upstream โ transform โ downstream.
Helps debug and audit.
๐ Delta Sharing
Open protocol to share Delta tables across orgs.
Provider uses:
CREATE SHARE my_share;
ALTER SHARE my_share ADD TABLE catalog.schema.table;
GRANT SELECT ON SHARE my_share TO RECIPIENT some_recipient;
Recipient reads live data (read-only).
๐ Lakehouse Federation
Query external databases (MySQL, PostgreSQL, etc.) directly from Databricks.
No copy needed; appears as foreign catalogs.
Good for live reads / quick analysis.
Heavy / complex queries may be slower because compute happens partly outside Databricks.
๐ฅ๏ธ Clusters & SQL Warehouses
Cluster types:
All-purpose: interactive notebooks, ad hoc.
Job clusters: created per job, auto-terminate โ cheaper for scheduled workloads.
Best practices:
Enable auto-termination.
Use autoscaling.
Use Job clusters for pipelines.
Use Spot instances when failures are tolerable.
Use Photon for fast SQL / Delta performance.
SQL Warehouses:
For SQL + BI dashboards.
Serverless SQL warehouse:
Start fast, scale automatically.
No cluster tuning.
Use Pro SQL warehouse when you need custom networking or when serverless not available.
