Essential Database and Data Management Concepts
Database Fundamentals
Core Database Terminology
- Catalog: A set of schemas that constitute the description of a database.
- Schema: The structure that contains descriptions of objects created by a user (base tables, views, constraints).
- Data Definition Language (DDL): Commands that define a database, including creating, altering, and dropping tables and establishing constraints.
- Data Manipulation Language (DML): Commands that maintain and query a database.
- Data Control Language (DCL): Commands that control a database, including administering privileges and committing data.
Referential Integrity & Constraints
- Referential Integrity: A constraint that ensures that foreign key values of a table must match primary key values of a related table in 1:M relationships.
- Restrictions: Deletes, updates, and inserts of primary keys.
Database Views
Types of Views
- Dynamic View: A “virtual table” created dynamically upon request by a user. No data is actually stored; instead, data from the base table is made available to the user. It is based on an SQL SELECT statement on base tables or other views.
- Materialized View: A copy or replication of data. Data is actually stored and must be refreshed periodically to match corresponding base tables.
Advantages of Views
- Simplify query commands.
- Assist with data security (but don’t rely solely on views for security; more important security measures exist).
- Enhance programming productivity.
- Contain the most current base table data.
- Use little storage space.
- Provide a customized view for the user.
- Establish physical data independence.
Disadvantages of Views
- Use processing time each time the view is referenced.
- May or may not be directly updateable.
Database Partitioning
Advantages of Partitioning
- Efficiency: Records used together are grouped together.
- Local Optimization: Each partition can be optimized for performance.
- Security: Data not relevant to users are segregated.
- Recovery and Uptime: Smaller files take less time to back up.
- Load Balancing: Partitions stored on different disks, which reduces contention.
Disadvantages of Partitioning
- Inconsistent Access Speed: Slow retrievals across partitions.
- Complexity: Non-transparent partitioning.
- Extra Space or Update Time: Duplicate data; access from multiple partitions.
Types of Partitioning
- Range Partitioning: Partitions defined by a range of field values. Could result in unbalanced distribution of rows. Like-valued fields share partitions.
- Hash Partitioning: Partitions defined via hash functions. Will guarantee balanced distribution of rows. A partition could contain widely varying valued fields.
- List Partitioning: Based on predefined lists of values for the partitioning key.
- Composite Partitioning: A combination of the other approaches.
Application Partitioning
Placing portions of the application code in different locations (client vs. server) after it is written.
Advantages of Application Partitioning
- Improved performance.
- Improved interoperability.
- Balanced workloads.
Application Architecture & Middleware
Middleware and APIs
- Middleware: Software that allows an application to interoperate with other software without requiring the user to understand and code low-level operations.
- Application Program Interface (API): Routines that an application uses to direct the performance of procedures by the computer’s operating system.
- Common Database APIs: ODBC, ADO.NET, JDBC.
Three-Tier Architecture
Processing in Three-Tier Architecture
- Static Page Requests: .htm or .html requests handled by the Web server.
- Dynamic Page Requests: .jsp, .aspx, and .php requests are routed to the application server. Server-side processing by JSP servlet, ASP.NET application, ColdFusion, or PHP. Database access via JDBC, ADO.NET, or other database middleware.
Benefits of Three-Tier Architecture
- Scalability.
- Technological flexibility.
- Long-term cost reduction.
- Better match of systems to business needs.
- Improved customer service.
- Competitive advantage.
- Reduced risk.
XML (Extensible Markup Language)
A text-based markup language (like HTML) that uses elements, tags, and attributes. It includes document type declarations (DTDs), XML schemas, comments, and entity references. XML revolutionizes the way data are exchanged over the Internet.
Document Structure Declarations (DSD), XML Schema (XSD), and Relax NG are replacing DTDs for validating XML document structure. XSD is a language for defining XML databases, recommended by the W3C.
Data Warehousing & Business Intelligence
Data Warehouse
A subject-oriented, integrated, time-variant, non-updatable collection of data used in support of management decision-making processes.
- Subject-Oriented: E.g., customers, patients, students, products.
- Integrated: Consistent naming conventions, formats, encoding structures; from multiple data sources.
- Time-Variant: Can study trends and changes.
- Non-Updatable: Read-only, periodically refreshed.
Data Mart
A data warehouse that is limited in scope.
Data Types in Data Warehouses
- Event: A database action (create/update/delete) that results from a transaction.
- Transient Data: Changes to existing records are written over previous records, thus destroying the previous data content.
- Periodic Data: Never physically altered or deleted once they have been added to the store.
Derived Data
Objectives of Derived Data
- Ease of use for decision support applications.
- Fast response to predefined user queries.
- Customized data for particular target audiences.
- Ad-hoc query support.
- Data mining capabilities.
Characteristics of Derived Data
- Detailed (mostly periodic) data.
- Aggregate (for summary).
- Distributed (to departmental server).
Surrogate Keys
Dimension table keys should be surrogate (non-intelligent and non-business related), because:
- Business keys may change over time.
- Helps keep track of non-key attribute values for a given primary key.
- Surrogate keys are simpler and shorter.
- Surrogate keys can be the same length and format for all keys.
Advanced Database Concepts & Constraints
Referential Integrity Rules
A rule stating that any foreign key value (on the relation of the many side) MUST match a primary key value in the relation of the one side. (Or the foreign key can be null).
Delete Rules
- Restrict: Do not allow deletion of the “parent” side if related rows exist in the “dependent” side.
- Cascade: Automatically delete “dependent” side rows that correspond with the “parent” side row to be deleted.
- Set-to-Null: Set the foreign key in the dependent side to null if deleting from the parent side. (Not allowed for weak entities).
Enterprise Keys
Primary keys that are unique in the whole database, not just within a single relation. Corresponds with the concept of an object ID in object-oriented systems.
Cardinality Constraints
The number of instances of one entity that can or must be associated with each instance of another entity.
- Minimum Cardinality: If zero, then optional. If one or more, then mandatory.
- Maximum Cardinality: The maximum number.
Aggregate Functions
- Scalar Aggregate: A single value returned from an SQL query with an aggregate function.
- Vector Aggregate: Multiple values returned from an SQL query with an aggregate function (via GROUP BY).
Generalization & Specialization in Data Modeling
- Generalization: The process of defining a more general entity type from a set of more specialized entity types. (BOTTOM-UP approach).
- Specialization: The process of defining one or more subtypes of the supertype and forming supertype/subtype relationships. (TOP-DOWN approach).
Completeness Constraints
Whether an instance of a supertype must also be a member of at least one subtype.
- Total Specialization Rule: Yes (represented by a double line).
- Partial Specialization Rule: No (represented by a single line).
Disjointness Constraints
Whether an instance of a supertype may simultaneously be a member of two (or more) subtypes.
- Disjoint Rule: An instance of the supertype can be only ONE of the subtypes.
- Overlap Rule: An instance of the supertype could be more than one of the subtypes.
Subtype Discriminator
An attribute of the supertype whose values determine the target subtype(s).
- Disjoint: A simple attribute with alternative values to indicate the possible subtypes.
- Overlapping: A composite attribute whose subparts pertain to different subtypes. Each subpart contains a Boolean value to indicate whether or not the instance belongs to the associated subtype.
Packaged Data Models
Advantages of Packaged Data Models
- Use proven model components.
- Save time and cost.
- Less likelihood of data model errors.
- Easier to evolve and modify over time.
- Aid in requirements determination.
- Easier to read.
- Supertype/subtype hierarchies promote reuse.
- Many-to-many relationships enhance model flexibility.
- Vendor-supplied data model fosters integration with vendor’s applications.
- Universal models support inter-organizational systems.
Compliance & IT Governance Frameworks
- Sarbanes-Oxley Act (SOX): Protects investors by improving accuracy and reliability of financial reporting.
- Committee of Sponsoring Organizations (COSO) of the Treadway Commission: Provides frameworks for enterprise risk management, internal control, and fraud deterrence.
- IT Infrastructure Library (ITIL): A set of detailed practices for IT service management.
- Control Objectives for Information and Related Technology (COBIT): A framework for IT governance and management.
Data Warehouse Granularity
What level of detail do you want in the fact table?
- Transactional Grain: Finest level of detail.
- Aggregated Grain: More summarized level.
Finer grains lead to better market basket analysis capability, more dimension tables, and more rows in the fact table. In Web-based commerce, the finest granularity is a click.
Distributed Database Concepts
Transparency in Distributed Databases
- Location Transparency: User does not have to know the location of the data. Data requests are automatically forwarded to appropriate sites.
- Local Autonomy: A local site can operate with its database when network connections fail. Each site controls its own data and security.
- Concurrency Transparency: A design goal for distributed databases, providing the appearance of serial transactions, even though simultaneous transactions are occurring.
Concurrency Control & Transactions
- Time Stamping: A concurrency control mechanism that assigns a globally unique timestamp to each transaction. An alternative to locks in distributed databases.
- Commit: An algorithm to ensure that a transaction is either successfully completed or aborted.
- Two-Phase Commit: An algorithm for coordinating updates in a distributed database.
Semijoin Operation
Only the joining attribute of the query is sent from one site to the other, rather than all selected attributes.
Object-Oriented Database Concepts
- Association Role: The role of an object in an association; the end of an association where it connects to a class.
- Multiplicity: How many objects participate in an association. Represented as Lower-bound…Upper-bound (cardinality).
- Abstract Operation: Defines the form or protocol of the operation, but not its implementation.
- Method: The implementation of an operation.
- Polymorphism: The same operation may apply to two or more different classes in different ways.
- Aggregation: A part-of relationship between a component object and an aggregate object.
- Composition: A stronger form of aggregation in which a part object belongs to only one whole object and exists only as part of the whole object.
- Recursive Aggregation: Composition where the component object is an instance of the same class as the aggregate object.
Data Governance & Master Data Management
- Data Governance: High-level organizational groups and processes overseeing data stewardship across the organization.
- Data Steward: A person responsible for ensuring that organizational applications properly support the organization’s data quality goals.
Master Data Management (MDM)
Disciplines, technologies, and methods to ensure the currency, meaning, and quality of reference data within and across various subject areas.
Three Main MDM Architectures
- Identity Registry: Master data remains in source systems; the registry provides applications with location.
- Integration Hub: Data changes are broadcast through a central service to subscribing databases.
- Persistent: A central “golden record” is maintained; all applications have access. Requires applications to push data. Prone to data duplication.
ETL Process
The process of:
- Capture/Extract
- Scrub or Data Cleansing
- Transform
- Load and Index
Data Administration vs. Database Administration
- Data Administration: A high-level function that is responsible for the overall management of data resources in an organization, including maintaining corporate-wide definitions and standards.
- Database Administration: A technical function that is responsible for physical database design and for dealing with technical issues such as security enforcement, database performance, and backup and recovery.
Platform for Privacy Protection (P3P)
Addresses the following aspects of data privacy:
- Who collects data.
- What data is collected and for what purpose.
- Who is data shared with.
- Can users control access to their data.
- How are disputes resolved.
- Policies for retaining data.
- Where are policies kept and how can they be accessed.