Database Normalization Forms and Indexing Techniques
1NF: A table is in 1NF if:
- The domain of each attribute contains only atomic values.
- The value of each attribute contains only a single value from that domain.
Functional Dependency: Y is functionally dependent on X in T if for each set x belonging to R.X there is precisely one corresponding set y belonging to R.Y.
Full Functional Dependency: Y is fully functionally dependent on X in T if Y is functionally dependent on X and Y is not functionally dependent on any proper subset of X.
Insertion Anomaly: Cannot insert new patrons in the system until they have borrowed books.
Update Anomaly: Must update all rows involving a given patron if he or she moves.
Deletion Anomaly: Will lose information about patrons that have returned all the books they have borrowed.
Axioms:
- Reflexivity: If Y is a subset of X, then X→Y.
- Augmentation: If X→Y, then WX→WY.
- Transitivity: If X→Y and Y→Z, then X→Z.
Derived Rules:
- Union: If X→Y and X→Z, then X→YZ.
- Decomposition: If X→YZ, then X→Y and X→Z.
- Pseudotransitivity: If X→Y and WY→Z, then XW→Z.
2NF: A table is in 2NF if:
- It is in 1NF.
- No non-prime attribute is dependent on any proper subset of any candidate key of the table.
A non-prime attribute of a table is an attribute that is not a part of any candidate key of the table.
A candidate key is a minimal superkey.
3NF: A table is in 3NF if:
- It is in 2NF.
- All its attributes are determined only by its candidate keys and not by any non-prime attributes.
BCNF: A stricter form of 3NF.
A table T is in BCNF if:
- For every one of its non-trivial dependencies X → Y, X is a super key for T.
Most tables that are in 3NF also are in BCNF.
Multivalued Dependency: A multivalued dependency X => Y occurs if:
- For any xc actually occurring in the table and the list of all the xcyz combinations that occur in the table, we will find that xc is associated with the same y entries regardless of z.
Trivial Multivalued Dependency: A trivial multivalued dependency X => Y is one where either:
- Y is a subset of X, or
- Z is empty (X È Y has all column headings).
4NF: For every one of its non-trivial multivalued dependencies X => Y, X is either:
- A candidate key, or
- A superset of a candidate key.
A table T is subject to a join dependency if it can always be recreated by joining multiple tables each having a subset of the attributes of T. The join dependency is said to be trivial if one of the tables in the join has all the attributes of the table T.
Notation: *{ A, B, …} on T
5NF: A table T is said to be 5NF if:
- Every non-trivial join dependency in it is implied by its candidate keys.
A join dependency *{A, B, …, Z} on T is implied by the candidate key(s) of T if and only if each of A, B, …, Z is a superkey for T.
Conventional Indexes:
- Sparse: One entry per data block.
- Dense: One entry per record.
Sparse Index:
- Identifies the first record of the block.
- Requires data to be sorted.
- Advantages:
- Occupy less space.
- Can keep more of it in main memory (Faster access).
Dense Index:
- Data do not have to be sorted.
- Advantages:
- Can tell if a record exists without accessing file.
- No data sorting.
Indexes based on primary key:
- Each key value corresponds to one record.
- Table is sorted on primary key: Sparse index.
- Table either non-sorted or sorted on other field: Dense index.
Indexes based on other fields:
- Each key value corresponds to more than one record (Clustering index).
- Table is sorted on that field: Sparse index.
- Table is either non-sorted or sorted on other field: Dense index.
When the index is very large, we index the index. Two-level or three-level index.
Index at top level is called master index.
(Normally a sparse index)
Updating indexed tables can be painful.
They decrease performance on inserts, updates, and deletes. They take up space. Some databases will monocase values in fields that are indexed. (interviewques.wordpress.com)
B and B+ Trees : Dynamic indexing structures that can evolve when records are added or deleted. Optimized for searches on block devices. Both B and B+ are not binary (Objective is to increase branching factor (degree or fan-out) to reduce the number of device accesses)
B Tree : Bottom nodes are leaf nodes where their pointers are NULL
Number of child nodes typically vary between d and 2d (Order -> d)
Will split nodes that would otherwise have contained 2d + 1 child nodes
Will merge nodes that contain less than d child nodes
Two Basic Operations : Split – When trying to add to a full node. Split node at central value
Promote – Must insert root of split node higher up. May require a new split
B+ Tree : Two types of nodes.
Internal nodes have no data pointers.
Leaf nodes have no tree pointers.
Advantages : Removing unneeded pointers allows to pack more keys in each node
Higher fan-out for a given node size
Normally one block
Having all keys present in the leaf nodes allows us to build a linked list of all keys
If m is the order of the tree
Every internal node has at most m children.
Every internal node (except root) has at least Floor(m ⁄ 2) children.
The root has at least two children if it is not a leaf node.
Every leaf has at least floor(m/2) keys
Every leaf has at most m − 1 keys
An internal node with k children has k − 1 keys.
All leaves appear in the same level
A B+ tree of degree m and height h will store
At most m^(h – 1)*(m – 1) = m^h – m records
At least 2*floor(m ⁄ 2)^(h – 1) records
Hashing : Define m target addresses (the “buckets”)
Create a hash function h(k) that is defined for all possible values of the key k and returns an integer value h such that 0 ≤ h ≤ m – 1
Bucket Organization : 1) Buckets contain records. When bucket is full, records go to an overflow bucket
2) Buckets contain pairs
When bucket is full, pairs go to an overflow bucket
A good hash function should distribute records evenly among the buckets. A bad one will have too many over flowing buckets or too many empty or near empty buckets.
1) If the key is numeric Divide the key by the number of buckets If the number of buckets is a power of two, this means selecting log2 m least significant bits of key (Otherwise) 2) Transform the key into a numerical value. Divide that value by the number of buckets
Dynamic Hashing : Dynamic hashing lets the hash table grow as the number of records grow. Two techniques : Extendible and linear hashing.
Extendible : Represents hash values as bit strings (001, 110..)
Introduce an additional level of indirection, the directory : One entry per key value, Multiple entries can point to same bucket.
Extendible hashing
Allows hash table contents
1) To grow, by splitting buckets
2) To shrink by merging buckets
but
Adds one level of indirection
1) No problem if the directory can reside in main memory
Linear :Doesn’t add additional level of indirection. Reduces but doesn’t eliminate overflow buckets. Uses family of hash functions.
Start with : m buckets , hi(K) = K mod m
When any bucket overflows :
Create an overflow bucket
Create a new bucket at location m
Apply hash function hi+1(K)= K mod 2m to the contents of bucket 0
Will now be split between buckets 0 and m
When a second bucket overflows :
Create an overflow bucket
Create a new bucket at location m + 1
Apply hash function hi+1(K)= K mod 2m to the contents of bucket 1
Will now be split between buckets 1 and
m + 1
Each time a bucket overflows
Create an overflow bucket
Apply hash function hi+1(K)= K mod 2m to the contents of the successor s + 1 of the last bucket that was split
Contents of bucket s + 1 will now be split between buckets s and m + s – 1
The size of the hash table grows linearly at each split until all buckets use the new hash function
Advantages : The hash table goes linearly
As we split buckets in linear order, bookkeeping is very simple:
Need only to keep track of the last bucket s that was split
Buckets 0 to s use the new hash function
hi+1(K)= K mod 2m
Buckets s + 1 to m – 1 still use the old hash function hi(K)= K mod m
,>,>
