Database Management Systems: Architectures, Models, NoSQL, and Transactions
1) What is a Database? Explain the three-schema architecture with a neat diagram.
• A database is an organized collection of data stored in a computer system and usually controlled by a database management system (DBMS).
• The Three-Schema Architecture: The goal of the three-schema architecture is to separate user applications from the physical database.
1. The Internal Level (Internal Schema):
➢ Describes the physical storage structure of the database.
➢ Uses a physical data model and describes the complete details of data storage and access paths for the database.
2. The Conceptual Level (Conceptual Schema):
➢ Describes the structure of the whole database for a community of users.
➢ Hides the details of physical storage structures and concentrates on describing entities, data types, relationships, user operations, and constraints.
➢ A representational data model is used to describe the conceptual schema when a database system is implemented.
➢ This implementation conceptual schema is often based on a conceptual schema design in a high-level data model.
3. The External or View Level (External Schemas or User Views):
➢ Describes the part of the database that a particular user group is interested in and hides the rest of the database from that user group.
➢ As in the previous level, each external schema is typically implemented using a representational data model, possibly based on an external schema design in a high-level data model.
Benefits:
➢ Data Abstraction: Separates physical storage details from the logical data model, making it easier to manage the database.
➢ Data Independence: Changes at one level do not affect other levels. For example, changes in the physical storage structure do not impact the logical structure of the database or user views.
➢ Security: Provides different views for different users, ensuring that users can only access the data they are authorized to see.
Why do we need mapping between schema levels?
➢ Data independence can be defined as the capacity to change the schema at one level of a database system without having to change the schema at the next higher level. We can define two types of data independence:
1. Logical Data Independence:
➢ Capacity to change the conceptual schema without having to change external schemas or application programs.
➢ Change the conceptual schema to expand the database, to change constraints, or to reduce the database.
➢ Changes to constraints can be applied to the conceptual schema without affecting the external schemas or application programs.
2. Physical Data Independence:
➢ Capacity to change the internal schema without having to change the conceptual schema. Hence, the external schemas need not be changed as well by creating additional access structures—to improve the performance of retrieval or update.
2) What are the advantages of using the DBMS approach? Explain.
1. Controlling Redundancy:
• Data redundancy (such as tends to occur in the “file processing” approach) leads to wasted storage space, duplication of effort (when multiple copies of a datum need to be updated), and a higher likelihood of the introduction of inconsistency.
• On the other hand, redundancy can be used to improve the performance of queries. Indexes, for example, are entirely redundant but help the DBMS in processing queries more quickly.
• A DBMS should provide the capability to automatically enforce the rule that no inconsistencies are introduced when data is updated.
2. Restricting Unauthorized Access:
• A DBMS should provide a security and authorization subsystem, which the DBA uses to create accounts and to specify account restrictions. Then, the DBMS should enforce these restrictions automatically.
3. Providing Persistent Storage for Program Objects:
• Object-oriented database systems make it easier for complex runtime objects (e.g., lists, trees) to be saved in secondary storage so as to survive beyond program termination and to be retrievable at a later time.
4. Providing Storage Structures and Search Techniques for Efficient Query Processing:
• Database systems must provide capabilities for efficiently executing queries and updates. The query processing and optimization module of the DBMS is responsible for choosing an efficient query execution plan for each query based on the existing storage structures.
5. Providing Backup and Recovery:
• A DBMS must provide facilities for recovering from hardware or software failures. The backup and recovery subsystem of the DBMS is responsible for recovery.
• The recovery subsystem could ensure that the transaction is resumed from the point at which it was interrupted so that its full effect is recorded in the database.
6. Providing Multiple User Interfaces:
• Many types of users with varying levels of technical knowledge use a database, a DBMS should provide a variety of user interfaces. For example, query languages for casual users, programming language interfaces for application programmers, forms and/or command codes for parametric users, menu-driven interfaces for stand-alone users.
3) Explain the categories of Data Models:
• A data model: a collection of concepts that can be used to describe the structure of a database—provides the necessary means to achieve this abstraction.
• By structure of a database means the data types, relationships, and constraints that apply to the data. Most data models also include a set of basic operations for specifying retrievals and updates on the database.
1. High-Level or Conceptual Data Models:
➢ Provide concepts that are close to the way many users perceive data.
➢ Use concepts such as entities, attributes, and relationships.
➢ An entity represents a real-world object or concept, such as an employee or a project from the miniworld that is described in the database.
➢ An attribute represents some property of interest that further describes an entity, such as the employee’s name or salary.
2. Low-Level or Physical Data Models:
➢ Provide concepts that describe the details of how data is stored on the computer storage media, typically magnetic disks.
➢ Concepts provided by low-level data models are generally meant for computer specialists, not for end users.
3. Representational (or Implementation) Data Models:
➢ Provide concepts that may be easily understood by end users but that are not too far removed from the way data is organized in computer storage.
➢ In between high level and low level. Hide many details of data storage on disk but can be implemented on a computer system directly.
4) What is NoSQL? Explain the CAP theorem:
NoSQL (Not Only SQL) refers to a category of database management systems that do not use traditional relational databases’ tabular structure. Instead, NoSQL databases offer various models such as document, key-value, column-family, and graph. They are designed to handle a wide range of data types and structures, making them suitable for modern applications that require scalability and flexibility.
The CAP Theorem, originally introduced as the CAP principle, can be used to explain some of the competing requirements in a distributed system with replication. It is a tool used to make system designers aware of the trade-offs while designing networked shared-data systems.
• The three letters in CAP refer to three desirable properties of distributed systems with replicated data: consistency (among replicated copies), availability (of the system for read and write operations), and partition tolerance (in the face of the nodes in the system being partitioned by a network fault).
• Consistency – Consistency means that the nodes will have the same copies of a replicated data item visible for various transactions. A guarantee that every node in a distributed cluster returns the same, most recent, and successful write. Consistency refers to every client having the same view of the data. There are various types of consistency models. Consistency in CAP refers to sequential consistency, a very strong form of consistency.
• Availability – Availability means that each read or write request for a data item will either be processed successfully or will receive a message that the operation cannot be completed. Every non-failing node returns a response for all the read and write requests in a reasonable amount of time. The key word here is “every”. In simple terms, every node (on either side of a network partition) must be able to respond in a reasonable amount of time.
• Partition Tolerance – Partition tolerance means that the system can continue operating even if the network connecting the nodes has a fault that results in two or more partitions, where the nodes in each partition can only communicate among each other. That means the system continues to function and upholds its consistency guarantees in spite of network partitions. Network partitions are a fact of life.
5) What are document-based NoSQL systems? Explain basic operations CRUD in MongoDB:
• Document-based NoSQL systems are a type of database that stores, retrieves, and manages data as documents. These documents are typically stored in formats like JSON (JavaScript Object Notation), BSON (Binary JSON), or XML.
• MongoDB has several CRUD operations, where CRUD stands for (create, read, update, delete).
1. Create Operations: The create or insert operations are used to insert or add new documents in the collection. If a collection does not exist, then it will create a new collection in the database.
2. Read Operations: The Read operations are used to retrieve documents from the collection, or in other words, read operations are used to query a collection for a document.
3. Update Operations: The update operations are used to update or modify the existing document in the collection. You can perform update operations using the following methods provided by MongoDB.
4. Delete Operations:
• The delete operations are used to delete or remove the documents from a collection. You can perform delete operations using the following methods provided by MongoDB.
6) What is a NoSQL Graph database? Explain Neo4j:
• A graph database is a type of NoSQL database that is designed to handle data with complex relationships and interconnections. In a graph database, data is stored as nodes and edges, where nodes represent entities and edges represent the relationships between those entities.
• A Neo4j graph database stores nodes and relationships instead of tables or documents.
Nodes: Represent entities or objects (e.g., people, products, locations). Each node can have properties (attributes) associated with it.
Relationships: Represent connections between nodes. Relationships can be directed or undirected and can also have properties.
Properties: Key-value pairs associated with nodes and relationships providing additional information about the entities and their connections.
Labels and Properties:
Labels: Tags for nodes. For example, a node can be labeled EMPLOYEE, DEPARTMENT, PROJECT, etc.
Properties: Key-value pairs associated with nodes or relationships. Properties are enclosed in curly brackets {}.
Nodes can have multiple labels e.g., PERSON:EMPLOYEE:MANAGER.
Relationships and Relationship Types:
Relationships: Connections between nodes which can have properties.
Types: Labels for relationships. For example, WorksFor, Manager, LocatedIn.
Paths: Sequences of nodes and relationships. Used in queries to match patterns.
Optional Schema:
· Neo4j can operate without a schema, but it supports creating indexes and constraints.
· Indexes: Improve query performance by indexing properties.
· Constraints: Ensure data integrity such as unique property values for nodes with a specific label.
Indexing and Node Identifiers:
Each node gets a unique system-defined identifier.
Users can create indexes for efficient retrieval using node properties.
7) Demonstrate the Database Transaction with a transaction diagram:
➢ A transaction in Database Management Systems (DBMS) can be defined as a set of logically related operations.
➢ It is the result of a request made by the user to access the contents of the database and perform operations on it.
➢ It consists of various operations and has various states in its completion journey.
➢ It also has some specific properties that must be followed to keep the database consistent.
Key Components of a Transaction:
• Transaction Start: Initiates the transaction.
• Operations: The sequence of database actions (e.g., read, write).
• Commit: Finalizes the transaction, making all changes permanent.
• Rollback: Reverts changes if the transaction encounters an issue, ensuring no partial updates are made.
These are different types of Transaction States:
• Active State: When the instructions of the transaction are running then the transaction is in an active state. If all the ‘read and write’ operations are performed without any error then it goes to the “partially committed state”; if any instruction fails, it goes to the “failed state”.
• Partially Committed: After completion of all the read and write operations, the changes are made in main memory or local buffer. If the changes are made permanent on the Database then the state will change to “committed state” and in case of failure, it will go to the “failed state”.
• Failed State: When any instruction of the transaction fails, it goes to the “failed state” or if failure occurs in making a permanent change of data on the Database.
• Aborted State: After having any type of failure the transaction goes from “failed state” to “aborted state” and since in previous states, the changes are only made to the local buffer or main memory and hence these changes are deleted or rolled-back.
• Committed State: It is the state when the changes are made permanent on the Database and the transaction is complete and therefore terminated in the “terminated state”.
• Terminated State: If there isn’t any roll-back or the transaction comes from the “committed state”, then the system is consistent and ready for a new transaction and the old transaction is terminated.
8) Demonstrate the System Log in a database transaction:
A system log (or transaction log) in a database transaction is a mechanism used to ensure data integrity and consistency by recording all changes made during a transaction. This log is crucial for recovery in case of a failure and for maintaining the ACID properties of transactions.
Example Scenario:
Let’s consider a bank transaction where $100 is transferred from Account A to Account B. We’ll demonstrate how to use a system log to record each step of this transaction.
Database Schema:
Accounts Table: Holds account details and balances.
Transaction Log Table: Records each step of the transaction.
Implementation:
— Assume Account A has account_id = 1 and Account B has account_id = 2
— Start the transaction
START TRANSACTION;
— Log the transaction start
INSERT INTO Transaction_Log (transaction_id, action, status) VALUES (1, 'Transaction Start', 'Pending');
— Withdraw from Account A
UPDATE Accounts SET balance = balance - 100 WHERE account_id = 1;
— Check if the withdrawal was successful
IF (ROW_COUNT() = 1) THEN
-- Log the successful withdrawal
INSERT INTO Transaction_Log (transaction_id, action, status) VALUES (1, 'Withdraw from Account A', 'Success');
ELSE
-- Log the failure and rollback
INSERT INTO Transaction_Log (transaction_id, action, status) VALUES (1, 'Withdraw from Account A', 'Failed');
ROLLBACK;
RETURN;
END IF;
— Deposit into Account B
UPDATE Accounts SET balance = balance + 100 WHERE account_id = 2;
— Check if the deposit was successful
IF (ROW_COUNT() = 1) THEN
-- Log the successful deposit
INSERT INTO Transaction_Log (transaction_id, action, status) VALUES (1, 'Deposit into Account B', 'Success');
ELSE
-- Log the failure and rollback
INSERT INTO Transaction_Log (transaction_id, action, status) VALUES (1, 'Deposit into Account B', 'Failed');
ROLLBACK;
RETURN;
END IF;
— Commit the transaction
COMMIT;
— Log the successful transaction commit
INSERT INTO Transaction_Log (transaction_id, action, status) VALUES (1, 'Transaction Commit', 'Success');
9) What is Functional dependency? Explain the inference rules for functional dependency with proof:
A functional dependency is a constraint between two sets of attributes from the database.
Given a relation R, a set of attributes X in R is said to functionally determine another attribute Y also in R (written X → Y) if and only if each X value is associated with at most one Y value.
X is the determinant set and Y is the dependent attribute. Thus, given a tuple and the values of the attributes in X, one can determine the corresponding value of the Y attribute.
The abbreviation for functional dependency is FD or f.d. The set of attributes X is called the left-hand side of the FD and Y is called the right-hand side.
A functional dependency is a property of the semantics or meaning of the attributes.
The database designers will use their understanding of the semantics of the attributes of R to specify the functional dependencies that should hold on all relation states (extensions) r of R.
Informal Design Guidelines for Relation Schemas:
Four informal guidelines may be used as measures to determine the quality of relation schema design:
1. Imparting Clear Semantics to Attributes in Relations:
• The semantics of a relation refers to its meaning resulting from the interpretation of attribute values in a tuple.
• Whenever we group attributes to form a relation schema, we assume that attributes belonging to one relation have certain real-world meaning and a proper interpretation associated with them.
• The easier it is to explain the semantics of the relation, the better the relation schema design will be.
2. Redundant Information in Tuples and Update Anomalies:
• One goal of schema design is to minimize the storage space used by the base relations.
· Grouping attributes into relation schemas has a significant effect on storage space.
• For example, compare the space used by the two base relations EMPLOYEE and DEPARTMENT with that for an EMP_DEPT base relation.
• In EMP_DEPT, the attribute values pertaining to a particular department (Dnumber, Dname, Dmgr_ssn) are repeated for every employee who works for that department.
3. NULL Values in Tuples:
• If many of the attributes do not apply to all tuples in the relation, we end up with many NULL tuples.
• This can waste space at the storage level.
• May lead to problems with understanding the meaning of the attributes.
• May also lead to problems with specifying JOIN operations.
• How to account for them when aggregate operations such as COUNT or SUM are applied.
• SELECT and JOIN operations involve comparisons; if NULL values are present, the results may become unpredictable.
• The attribute does not apply to this tuple. For example, Visa_status may not apply to U.S. students.
• The attribute value for this tuple is unknown. For example, the Date_of_birth may be unknown for an employee.
• The value is known but absent; that is, it has not been recorded yet. For example, the Home_Phone_Number for an employee may exist but may not be available and recorded yet.
Inference Rules for Functional Dependencies:
Inference rules are used to infer new functional dependencies from a given set of functional dependencies. Some common inference rules are:
1. Reflexive Rule: If Y is a subset of X, then X → Y (A set of attributes always determines itself or its subset).
2. Augmentation Rule: If X → Y, then XZ → YZ (Adding attributes to both sides of an FD preserves the dependency).
3. Transitive Rule: If X → Y and Y → Z, then X → Z (If X determines Y and Y determines Z, then X determines Z).
Proof of Transitive Rule:
Assume X → Y and Y → Z hold. We need to prove that X → Z also holds.
Let’s consider two tuples t1 and t2 in relation R such that t1[X] = t2[X]. Since X → Y, we have t1[Y] = t2[Y]. Now, since Y → Z and t1[Y] = t2[Y], we have t1[Z] = t2[Z]. Therefore, we have shown that if t1[X] = t2[X], then t1[Z] = t2[Z], which implies X → Z.
These inference rules can be used to derive all other inference rules for functional dependencies.