Core Concepts in Database Management Systems (DBMS)
Data Abstraction and Its Three Levels in DBMS
Data abstraction is a logical function in a Database Management System (DBMS) that separates the raw data from the front end. It hides the complexities of how data is stored and managed, providing a simplified view of the data to users and applications. Since efficiency often requires complex data structures to represent data, developers hide this complexity from users through several levels of abstraction to simplify interactions with the system:
Physical Level (Internal View)
The lowest level of abstraction describes how the data are actually stored.
Example (Library System):
- Shelving: How books are arranged on shelves (e.g., by Dewey Decimal Classification, author, or title).
- Storage Space: The physical layout of the library, including stacks, reading rooms, and storage areas.
- Barcode System: How books are tagged and tracked using barcodes.
Analogy: A librarian organizing new books by subject and placing them on specific shelves.
Logical Level (Conceptual View)
The next-higher level of abstraction describes what data are stored in the database and what relationships exist among those data.
Example (Library System):
- Cataloging Rules: How books are classified, indexed, and described in the library catalog.
- Database Structure: The fields and relationships in the library’s database (e.g., title, author, ISBN, availability status).
- Search Functionality: How users can search the catalog using keywords, subject headings, or author names.
Analogy: A librarian creating a new catalog entry for a book, including its title, author, subject headings, and call number.
View Level (External View)
The highest level of abstraction describes only part of the entire database, tailored to specific user needs.
Example (Library System):
- User Interfaces: Online library catalogs, mobile apps, and search engines.
- Access Control: Limiting access to certain parts of the catalog or specific functions based on user permissions (e.g., student, faculty, staff).
- Personalized Recommendations: Suggesting books based on user history and preferences.
Analogy: A student searching the online library catalog for books on a specific topic, while a librarian accesses detailed inventory information.
Comparison of Pipelining and Materialization in Query Performance
Aspect | Pipelining | Materialization |
Definition | Intermediate results are passed directly from one operation to the next without being written to disk or temporary storage15. | Intermediate results are fully computed and stored (materialized) in temporary storage before being used by the next operation15. |
Performance Impact | Reduces I/O and memory usage, leading to faster query execution, especially for large or streaming data, since results are not written and read from disk35. | Incurs additional I/O and storage overhead, as intermediate results must be written to and read from disk or memory, potentially slowing down query execution35. |
Resource Utilization | More efficient use of CPU and memory, as data flows continuously through operators35. | Higher resource usage due to storage and retrieval of intermediate results35. |
Suitability | Best for queries where intermediate results are not reused or are small; ideal for real-time or streaming scenarios5. | Beneficial when intermediate results need to be reused by multiple operations or queries, avoiding recomputation at the cost of extra storage35. |
Query Cost Estimation Techniques for Optimization
Query cost estimation determines the time a query will require relative to the total batch time. The query optimizer (or analyzer) selects the most efficient query plan by examining the query, analyzing data statistics, evaluating several execution patterns, and choosing the least costly option.
The measures of query cost in DBMS are achieved by creating a framework that can generate multiple designs for a query. This is done by comparing every possible plan in terms of their estimated cost. To calculate the net estimated cost of any plan, the cost of every operation within that plan should be set deterministically and consolidated to get the net estimated cost of the query assessment plan.
Example of Cost Estimation:
We typically use the number of block transfers (blocks read from the disk) and the number of disk seeks to estimate the cost of a query assessment plan.
If the disk subsystem takes an average of tT seconds to transfer a block of data and has an average block access time (disk lookup time plus rotational latency) of tS seconds, then an operation that moves b blocks and performs S seeks would take:
Cost = (b * tT) + (S * tS) seconds.
The values of tT and tS must be calibrated for the disk system utilization. Normal values for a top-end disk today might be tS = 4 milliseconds and tT = 0.1 milliseconds (assuming a 4-kilobyte block size and a transfer rate of 40 megabytes per second).
The cost assessment of a query assessment plan is calculated by considering the following resources:
- The number of disk accesses.
- Time of execution taken by the CPU to execute a query.
- The involved communication costs in either distributed or parallel database systems.
Generally, for estimating the cost, we must consider the worst-case scenario. Users often assume that initially, the data is read only from the disk. However, there is a chance that the data is already present in the main memory. Since users generally disregard this effect, the actual execution cost often comes out lesser than the estimated value.
The response time (the time required to execute the plan) could also be used for assessing the cost of the query assessment plan. However, it becomes difficult to calculate the response time without actually executing the query assessment plan, as the reaction time becomes dependent upon the content stored in the buffer as soon as the query starts execution.
ACID Properties: Ensuring Transaction Reliability
ACID is an acronym that stands for Atomicity, Consistency, Isolation, and Durability. These properties are crucial for maintaining the reliability and consistency of transactions in databases.
1. Atomicity
Atomicity ensures that a transaction is treated as a single, indivisible unit of work.
- How It Works: If any part of the transaction fails, the entire transaction is rolled back, and the database is returned to its original state. This prevents partial updates that could leave the database in an inconsistent state.
- Example: Imagine transferring money from one account to another. Atomicity ensures that either both the debit and credit operations succeed, or neither does, maintaining the integrity of the accounts.
2. Consistency
Consistency ensures that the database remains in a valid state before and after a transaction.
- How It Works: The database must adhere to predefined rules and constraints. If a transaction violates these rules (e.g., uniqueness constraints), it is rolled back to maintain consistency.
- Example: If a transaction tries to insert a record with an invalid or duplicate key, the transaction is rolled back to prevent inconsistency.
3. Isolation
Isolation ensures that transactions operate independently and do not interfere with each other.
- How It Works: Changes made by one transaction are not visible to other transactions until the first transaction is committed. This prevents concurrent transactions from accessing inconsistent data.
- Example: When multiple users are updating different parts of a database simultaneously, isolation ensures that each user sees a consistent view of the data until all transactions are committed.
4. Durability
Durability ensures that once a transaction is committed, its effects are permanent and survive even system failures.
- How It Works: Databases use mechanisms like transaction logs to ensure that committed transactions are not lost. Even if the system crashes after a transaction is committed, the database can recover and maintain the committed state.
- Example: After a transaction is committed, the changes are written to a log file. If the system fails before the changes are written to disk, the log file ensures that the transaction can be replayed during recovery, maintaining durability.
Applying Log-Based Recovery After System Failure
Every DBMS utilizes system logs, which record every system activity and include timestamps for the event’s timing. Databases manage several log files for operations such as errors, queries, and other database updates. The log is saved in the following file formats:
[start transaction, T]
represents the start of transaction T execution.[write item, T, X, old value, new value]
indicates that transaction T changes the value of variable X from the old value to the new value.[read item, T, X]
indicates that transaction T reads the value of X.[commit, T]
signifies that the modifications to the data have been committed to the database and cannot be updated further by the transaction.[abort, T]
indicates that the transaction, T, has been canceled.
We utilize these logs to see how the state of the data changes during a transaction and recover it to the prior or new state. An undo operation can be used to inspect the [write item, T, X, old value, new value]
operation and restore the data state to the old data.
The only way to restore the previous state of data to the new state that was lost due to a system failure is to perform the [commit, T]
action (redo operation). Consider the following series of transactions: t1, t2, t3, and t4. If the system crashes after the fourth transaction, the data can still be retrieved to the state it was in before the checkpoint was established during transaction t1.
After all the records of a transaction are written to logs, a checkpoint is created to transfer all of the logs from local storage to permanent storage for future usage.
Web Databases Versus Data Warehouses
Web Database
A web database is a type of database application that is managed and accessed through the Internet, rather than being stored on a single computer. It allows website operators and users to store, organize, and analyze large amounts of data online, making it an essential tool for both businesses and individuals.
Example: Online Store
Imagine an online shop. The website uses a web database to store information about products, customers, and orders. When a customer makes a purchase, the website updates the database in real time—reducing inventory, recording the order, and saving customer details.
Data Warehouse
A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process.
- Subject-Oriented: A data warehouse can be used to analyze a particular subject area, such as “sales.”
- Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single, unified way of identification.
- Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data. This contrasts with a transaction system, where often only the most recent data is kept (e.g., only the most recent address of a customer).
- Non-volatile: Once data is in the data warehouse, it will not change. Historical data in a data warehouse should never be altered.
Comparing First and Second Normal Forms (1NF & 2NF)
Aspect | First Normal Form (1NF) | Second Normal Form (2NF) |
Definition | Ensures that all attributes in a table have atomic values (no multi-valued or composite attributes). | Builds on 1NF by eliminating partial dependencies, ensuring all non-key attributes depend fully on the primary key. |
Key Dependency | Does not require identification of functional dependencies. | Requires identification of functional dependencies to eliminate partial dependency. |
Primary Key | Composite keys are allowed. | Composite keys are not allowed if they introduce partial dependency. |
Goals | Eliminates repeating groups and multi-valued attributes. | Ensures data dependencies and eliminates redundancy caused by partial dependencies. |
Strength | Less restrictive. | More restrictive, ensuring stronger data integrity. |
Update Anomalies | Does not address update anomalies. | Addresses update anomalies caused by partial dependencies. |
Classification of Failures in Database Systems
Failure in a database can be defined as its inability to execute a specified transaction or the loss of data. A DBMS is vulnerable to several kinds of failures, each requiring different management strategies. Reasons for database failures include network failure, system crash, natural disasters, carelessness, sabotage (corrupting data intentionally), and software errors.
Transaction Failure
A transaction failure occurs if a transaction is unable to execute or reaches a point where it cannot proceed further.
Reasons for transaction failure in DBMS:
- Logical Error: Occurs if a transaction is unable to execute because of mistakes in the code or internal faults.
- System Error: Occurs when the database system itself terminates an active transaction due to a system issue or because it is unable to proceed. For example, the system ends an operating transaction if it reaches a deadlock condition or if resources are unavailable.
System Crash
A system crash usually occurs when there is some sort of hardware or software breakdown. Other external problems that cause the system to abruptly stop or crash include transaction failure, operating system errors, power cuts, or main memory crash. These types of failures are often termed soft failures and are responsible for data losses in the volatile memory. It is assumed that a system crash does not affect the data stored in non-volatile storage; this is known as the fail-stop assumption.
Data-Transfer Failure
Data-transfer failures are categorized as disk failures that occur during a data-transfer operation, resulting in the loss of content from disk storage. Other reasons for disk failures include disk head crash, disk unreachability, formation of bad sectors, or read-write errors on the disk. To quickly recover from a disk failure, a backup copy of the data stored on other tapes or disks can be used. Thus, frequent data backup is a good practice.
Fundamental Data Models in DBMS
1. Relational Data Model
This model designs data in the form of rows and columns within a table (also called a relation). Thus, a relational model uses tables for representing data and the relationships between them. This model was initially described by Edgar F. Codd in 1969 and is the most widely used model, primarily utilized by commercial data processing applications.
2. Entity-Relationship (ER) Data Model
An ER model is the logical representation of data as objects and relationships among them. These objects are known as entities, and a relationship is an association among these entities. This model was designed by Peter Chen and published in 1976. It is widely used in database design. Entities are described by a set of attributes (e.g., student_name, student_id describe the ‘student’ entity). A set of the same type of entities is known as an ‘Entity Set,’ and the set of the same type of relationships is known as a ‘Relationship Set.’
3. Object-Based Data Model
This is an extension of the ER model incorporating notions of functions, encapsulation, and object identity. This model supports a rich type system that includes structured and collection types. Developed in the 1980s, various database systems followed this object-oriented approach, where objects carry both data and properties.
4. Semistructured Data Model
This data model differs from the other three because it allows data specifications where individual data items of the same type may have different attribute sets. The Extensible Markup Language (XML) is widely used for representing semistructured data. Although XML was initially designed for including markup information in text documents, it gained importance due to its application in data exchange.
Example: A common example of this type of data model is web-based data sources where the data and the schema of the website are visible. Some entities may be missing attributes, while others may have extra attributes. This approach allows for data storage flexibility. Any value stored in an attribute can be either an atomic value or a collection of data. Emails, HTML, and web pages are a few more examples of this type of data model.
Checkpoints and Their Contribution to Database Recovery
A checkpoint is a process in a database system that saves the current state of the database to disk. It ensures faster recovery in case of system failures by storing committed transactions and logs of recent changes.
Types of Checkpoints
- Automatic Checkpoints:
- Occur at regular intervals (e.g., hourly or daily).
- Useful for large, frequently updated databases.
- Configured by the database administrator to ensure quick recovery.
- Manual Checkpoints:
- Triggered manually by the database administrator.
- Ideal for smaller databases with fewer updates.
- Allows control over when checkpoints occur.
Role in Recovery
After a crash, the database uses checkpoint data and transaction logs to restore its last consistent state:
- Roll Forward (Redo): Replays committed transactions that occurred after the last checkpoint.
- Rollback (Undo): Undoes uncommitted changes to maintain consistency.
Benefits of Checkpoints
- Speeds up recovery by reducing the time needed to process logs.
- Improves consistency during failures.
- Allows administrators to balance performance and recovery speed by adjusting checkpoint intervals.
In short, checkpoints are essential for maintaining data integrity and ensuring efficient recovery after system crashes.