Database Systems & Data Warehousing Concepts

Data Warehouse Architectures

Metadata in Knowledge Data Lifecycle

Examples of metadata that can be used at every step of the Knowledge Data Lifecycle:

  • Data Selection: Data description, date of data, structures, data sources, extraction mechanisms.
  • Data Pre-processing: Cleaning logs, methods, missing values.
  • Data Mining & Machine Learning: Data types for algorithms, size of dataset.
  • Evaluation & Interpretation: Knowledge derived, data mining patterns.

CSV vs. DBMS: Advantages Comparison

Advantages of Using Separate Files (CSV):

  • Simplicity
  • Portability
  • Cost-effective
  • Easier to share
  • Less complexity

Advantages of Using DBMS:

  • Guarantees the structure of the data
  • Easier to collaborate
  • Easier to query
  • Faster query execution

Inmon’s Data Warehouse Characteristics

Characteristics of a Data Warehouse (DW) in W. H. Inmon’s definition, which increase the amount of data:

  • Subject-oriented: Data is organized around major subjects of the enterprise, not applications. You can have the same data for different customers, replicating it to facilitate queries.
  • Integrated: Data is gathered from multiple disparate sources and integrated into a consistent format.
  • Time-variant: The DW maintains a historical record of data, allowing analysis of changes over time. You keep the whole history.
  • Non-volatile: Once data is loaded into the DW, it is not typically updated, changed, or deleted. It remains static and unchanged over time, retaining historical data.

Time-Variant vs. Non-Volatile Data Warehouses

Time-variant: The time-variant characteristic of a Data Warehouse implies that it is designed to capture and store historical data, allowing users to analyze data changes over time. It maintains a historical record of data, and new data is added as it becomes available. This relates to valid time, reflecting what is happening in the real time.

Non-volatile: The non-volatile characteristic means that once data is loaded into the Data Warehouse, it is not typically updated, changed, or deleted. It remains static and unchanged over time. Any changes or corrections to the source data are handled separately in the source systems. This relates to transaction time, reflecting when data was recorded.

Online Analytical Processing (OLAP)

OLAP Schema Differences

Briefly explain the difference between:

  1. Star vs. Snowflake Schema

    In a Star Schema, the dimension tables are denormalized and directly linked to the fact table without further normalization. In a Snowflake Schema, dimension tables are normalized into multiple related tables, meaning a dimension table has at least one other table linked to it. Snowflake schemas can show more aggregations, for example, related to time hierarchies.

  2. Star vs. Galaxy Schema

    A Star Schema is composed of one central fact table surrounded by multiple dimension tables. A Galaxy Schema (or Fact Constellation Schema) is composed of two or more fact tables that share one or more dimension tables. Essentially, a Galaxy schema consists of multiple star schemas joined by common dimensions.

Snowflake Schema in RDBMS

Would you implement a snowflake schema in a Relational DBMS (RDBMS)?

No, because it requires more joins, which can be more expensive than the redundancy found in a star schema. While it reduces data redundancy, the increased join complexity can negatively impact query performance in an RDBMS.

Dimensions Without Attributes or Hierarchies

Consider a multidimensional schema with a dimension that does not have any attribute or aggregation hierarchy (e.g., order number). Briefly justify whether you would implement it like:

(b) Only an attribute in the fact table, but not pointing to any dimensional table.

This approach is suitable because if a dimension has no attributes or hierarchy, creating a separate dimension table would introduce unnecessary overhead without providing any analytical benefit. Storing it directly as an attribute in the fact table simplifies the schema and avoids extra joins.

Handling Overlapping Dimension Elements

One of the three necessary summarizability conditions is disjointness. How can you implement a dimension with a level whose elements are overlapping, avoiding summarizability problems? Give a concrete example using a UML class diagram with only two levels (you do not need to draw the whole snowflake schema).

One common approach is to use a bridge table or association class to represent the many-to-many relationship between the overlapping elements. The association class (bridge table) helps maintain the many-to-many relationship, allowing for flexibility in handling overlapping elements without violating summarizability conditions. (Note: UML class diagram not provided in original text.)

SQL NULL Value Meanings

Name the three possible meanings of the NULL value in standard SQL:

  • Missing
  • Not applicable
  • Unknown

SQL GROUP BY Modifiers & Queries

Using the new GROUP BY modifiers and the following star schema, create some queries to obtain the result shown in the following exercises:

Query Example: Police Data Aggregation

a) For each age range, sum of all the Police with the rank of “Caporal” assigned to each city in the county “Baix Llobregat” with specialty “Support”

SELECT a.rankID, p.age, a.cityID, sum(a.police)
FROM assignment a, profile p, city c
WHERE a.profileID=p.ID AND a.cityID=c.ID
AND a.rankID='Caporal' AND c.county='Baix Llobregat'
AND a.SpecialityID='Support'
GROUP BY a.rankID, ROLLUP(p.age, a.cityID)

Query Example: Average Female Police

c) Average female police of any rank in the “Basic” scale assigned every year to the province of “Barcelona”, belonging to specialties “TEDAX”.

SELECT d.year, p.gender, c.province, a.specialty, AVG(a.police)
FROM date d, profile p, city c, assignment a, rank r
WHERE a.cityID=c.name AND d.ID=a.dateID AND p.ID=a.profileID
AND a.RankID=r.name AND p.gender='Female'
AND r.name='Basic' AND c.province='Barcelona'
AND a.specialty = 'TEDAX'
GROUP BY CUBE(p.gender, c.province), ROLLUP(d.year, a.specialty)

Query Example: Average Male Police

e) Average male police assigned every year to a given city with a certain rank. There are 5 attributes and it is a prime number, so it cannot be a combination of CUBE and ROLLUP.

Query Optimization

Syntactic Optimization Heuristic Rules

Name the two heuristic rules of syntactic optimization, and briefly explain the rationale behind them (i.e., why they, in general, work).

  1. Rule 1: Selection before Projection: This rule suggests applying selection operations (filtering rows) before projection operations (selecting columns). The rationale is that by reducing the number of rows early, the database can reduce the amount of data handled in subsequent operations, leading to faster processing.
  2. Rule 2: Pushing Down Selections: This rule advises moving selection operations as close as possible to the data source. By applying filters early, the volume of data that needs to be processed and transferred between different query plan stages is significantly reduced, improving efficiency.

SQL Query for Projection Merge

Give an example of an SQL query whose syntactic optimization would require the merge of two projections.

SELECT AVG(salary) AS avg_salary, MAX(age) AS max_age
FROM employee
WHERE department = 'IT';

Clustered Index Scan Cost Estimation

We estimate the cost of a table scan in a clustered index like ⌈1.5B⌉ · D. However, we estimate the cost of searching one tuple without an index like 0.5B · D. Briefly explain why we round it up in the first case and not in the second.

The rounding up (⌈⌉) in the clustered index scan cost estimate is done to account for the fact that, in a clustered index, adjacent rows are physically stored together on the same page or block. When reading data via a clustered index, the system typically reads entire data blocks. Rounding up ensures that the estimate accounts for the entire page or block, even if only part of it is needed for the specific data, reflecting the block-oriented I/O. For a single tuple search without an index, 0.5B · D represents an average cost, assuming random access to a block, and doesn’t necessarily involve reading an entire block for a single tuple.

Pipelining Join Algorithms

Name the two join algorithms that allow pipelining and briefly explain what is the benefit of using it, and where it comes from.

The two join algorithms that allow pipelining are:

  • Hash Join
  • Merge Join

The benefit of pipelining comes from the ability to start producing results earlier in the execution process. Instead of waiting for an entire intermediate result set to be fully computed and materialized before the next operation begins, pipelining allows data to flow directly from one operator to the next as it becomes available. This reduces the need for temporary storage, improves overall query response time, and can significantly enhance performance for complex queries.

Sequential vs. Random Access Costs

Given the figure below (Note: figure not provided in original text), answer the following questions in order to study the difference between sequential and random access.

Reading the Whole Table

a) If we need to read the whole table. How many blocks do we read if using a table scan (ignoring the index)? How many blocks do we read using the B+ tree index to read the whole table?

  • Table Scan: We read as many blocks as the table occupies, as it reads the entire table sequentially without using any index.
  • B+ Tree Index: We would typically read twice every data block (assuming no caching) plus the blocks for entering the index, going down to the leaf level, and then traversing the leaf nodes. If we have few rows, a B+ tree can be useful; otherwise, for a full table scan, it’s generally less efficient than a direct table scan due to the overhead of index traversal and potentially non-sequential data block access.

Retrieving a Single Tuple

b) If we need to retrieve the employee number 6. How many blocks do we read if using a table scan? How many blocks do we read if using the B+ tree index to find it?

  • Table Scan: On average, 0.5B blocks (where B is the total number of blocks in the table), as it involves scanning half the table on average to find a specific tuple.
  • B+ Tree Index: The number of blocks read would typically be 1 (root) + 2 (intermediate levels) + 3 (leaf node) + 4 (pointer to the actual data row), depending on the tree’s depth and structure.

Materialized Views

Materialized View Query Conditions

Which are the three conditions for a query to be solved from a materialized view without accessing the source tables?

  • Match condition
  • Summarize condition
  • Rewrite condition

Materialized View Query Answering Justification

Justify if we can answer or not the queries using materialized views.

Yes, because with the sum and the count, you can compute the average. However, it might not be possible if the query requests for a specific customer ID, and the materialized view’s GROUP BY clause is more general than what the query requires, making the materialized view less specific than the query’s needs.

ETL Processes

DBMS-Based Data Extraction Mechanisms

Name three extraction mechanisms using exclusively the DBMS and briefly explain the requirements each of these pose on it.

  • Change Data Capture (CDC): Requires the DBMS to have features for tracking and capturing changes made to data in real-time or near real-time (e.g., transaction logs, triggers, or specific CDC tools).
  • Database Triggers: Requires the DBMS to support triggers that can execute predefined actions (like logging changes to a separate table) when specific DML operations (INSERT, UPDATE, DELETE) occur on source tables.
  • Materialized Views: Requires the DBMS to support materialized views, which are pre-computed summary tables that can be refreshed periodically or incrementally to reflect changes in the source data.

Advantages of ETL Tools

Name three advantages of creating ETL processes using ETL tools (against hand-coded scripts).

  • Ease of Use (visual development, pre-built connectors)
  • Scalability and Maintainability (easier to manage complex flows, better performance handling)
  • Reusability (components can be reused across different projects)

ETL Quality Improvement via Design

Which ETL process quality characteristic is primarily improved by introducing Logical and Conceptual designs? Briefly explain how.

The primary ETL process quality characteristic improved by introducing Logical and Conceptual designs is Data Quality and Data Integration.

  • Data Quality: By designing data models and transformations at the logical and conceptual levels, potential data quality issues (e.g., inconsistencies, inaccuracies, missing values) can be identified and addressed proactively. This ensures that the transformed data aligns with the intended meaning and business rules.
  • Data Integration: These design phases ensure that the transformed data aligns with the intended meaning and structure across disparate sources. They help in defining clear mappings and transformations, reducing redundancy and ensuring consistency by addressing these issues during the design phase rather than during implementation.

Schema & Data Integration

Steps to Resolve Semantic Heterogeneity

Name the three major steps to solve semantic heterogeneity.

  • Schema integration
  • Semantic reconciliation
  • Query preprocessing and optimization

Federated vs. Wrapper/Mediator Architectures

Which is the main difference between a federated architecture and an architecture based on wrappers and mediators?

  • Federated Architecture: In a federated architecture, each data source maintains its autonomy, and the federation framework coordinates queries across multiple autonomous databases. It provides a single, unified view of distributed data while allowing sources to remain independent.
  • Wrappers and Mediators: In an architecture based on wrappers and mediators, wrappers encapsulate the functionality to access and understand individual data sources, translating queries and results between the global schema and the local source. Mediators then integrate the data from multiple wrappers, providing a unified view. This approach often involves more centralized control over data access and integration logic.

Schematic Discrepancy Example

Give an example of schematic discrepancy (different from that in the slides), where data in one source is represented as metadata in the other. Draw and briefly explain it.

Scenario: Consider a case where one source represents date and time information in a structured format (e.g., “YYYY-MM-DD HH:MM:SS”), while another source treats the same information as metadata with separate attributes for year, month, day, hour, minute, and second.

  • Source 1: A single column `EventDateTime` (e.g., `DATETIME` type) storing “2023-10-27 14:30:00”. Here, the components (year, month, day, etc.) are data within the single field.
  • Source 2: Separate columns `EventYear`, `EventMonth`, `EventDay`, `EventHour`, `EventMinute`, `EventSecond`. Here, the components of the date/time are treated as distinct attributes (metadata) rather than being encapsulated within a single data field.

(Note: Drawing not provided in original text.)

R-Swoosh Algorithm Worst-Case Cost

Which is the worst-case cost of the R-Swoosh algorithm and when would it happen?

R-Swoosh (Recursive Schema Matching): The worst-case cost of the R-Swoosh algorithm is exponential in the number of schemas and the size of the schema lattice. It can happen when there are numerous possible mappings between schemas, and the algorithm needs to explore a large solution space to find the optimal mappings. The complexity arises from the recursive nature of the schema matching process and the need to consider various combinations of mappings, leading to a combinatorial explosion in highly heterogeneous environments.