Storage Systems: A Comprehensive Guide to Technologies and Concepts
SCSI vs FC
SCSI (Small Computer System Interface) and Fibre Channel (FC) are both interfaces used for connecting internal storage to external disks, often used with Storage Area Networks (SANs).
- SCSI is a potential bottleneck, as it can lead to downtime.
- RAID controllers are typically SCSI hardware.
- SCSI is media-specific, using only copper cables.
- Fibre Channel is a more robust option, with potential for downtime.
- Fibre Channel hardware is known as a Host Bus Adapter (HBA).
- Fibre Channel is media-independent, supporting both copper and fiber optic cables.
FC vs. iSCSI
Fibre Channel (FC)
- Currently the market leader for shared storage technologies.
- Provides the highest performance levels.
- Designed for mission-critical applications.
- Components are relatively expensive, particularly per-server HBA costs.
- Relatively difficult to implement and manage.
Internet SCSI (iSCSI)
- Relatively new, but usage is increasing rapidly.
- Performance can approach Fibre Channel speeds.
- A better fit for databases than Network Attached Storage (NAS).
- A good fit for Small to Medium Size Businesses (SMBs).
- Relatively inexpensive compared to Fibre Channel.
- Relatively easy to implement and manage.
NAS Benefits
- Increases performance throughput to end users.
- Minimizes investment in additional servers.
- Provides storage pooling.
- Provides heterogeneous file serving.
- Utilizes existing infrastructure, tools, and processes.
Benefits of SAN
- Reduces the cost of external storage.
- Increases performance.
- Centralized and improved tape backup.
- LAN-less backup.
- High-speed, no single-point-of-failure clustering solutions.
- Consolidation.
Goals of BigTable
- Data is highly available at any time.
- Very high read/write rates.
- Efficient scans over all or interesting subsets of data.
- Asynchronous and continuously updates.
- High Scalability.
- Data is organized as (row, column, timestamp) -> cell contents.
- No table-wide integrity constraints.
- No multirow transactions.
How is Chubby Used?
- Ensures at most one active master at any time.
- Stores the bootstrap location of Bigtable data.
- Discovers tablet servers and finalizes tablet server deaths.
- Stores Bigtable schema information (the column family information for each table).
- Stores access control lists.
- If Chubby is unavailable for an extended period of time, Bigtable becomes unavailable.
SSTable
An SSTable (Sorted String Table) is a sorted file of key-value string pairs, containing chunks of data plus an index.
Tablet
A tablet contains a range of rows from a table. It is built out of multiple SSTables and stored in tablet servers.
Table
Multiple tablets make up a table. SSTables can be shared, but tablets do not overlap. SSTables can overlap.
Fault Tolerance and Load Balancing
- The master is responsible for load balancing and fault tolerance.
- Chubby is used to keep locks of tablet servers, restart failed servers, and monitor the status of tablet servers.
- The master keeps track of available tablet servers and unassigned tablets.
- If a server fails, tablet recovery is initiated.
Recovering a Tablet
- A new tablet server reads data from the METADATA table.
- Metadata contains a list of SSTables and pointers into any commit log that may contain data for the tablet.
- The server reads the indices of the SSTables into memory.
- The memtable is reconstructed by applying all of the updates since the redo points.
Refinements
- Group column families together into an SSTable.
- Compress locality groups.
- Use Bloom Filters on locality groups to avoid searching the SSTable.
What is Spanner?
Spanner is a globally distributed database system designed for strong consistency with wide area replication. It offers:
- Auto-sharding and auto-rebalancing.
- Automatic failure response.
- User/application control over data replication and placement.
- Transaction serialization via global timestamps.
- Acknowledges clock uncertainty and guarantees a bound on it.
- Uses a novel TrueTime API for concurrency control.
- Enables consistent backups and atomic schema updates during ongoing transactions.
- Features lock-free distributed read transactions.
- Provides external consistency of distributed transactions.
- Implementation integrates concurrency control, replication, and 2PC (2 Phase Commit).
TrueTime
TrueTime is a key enabling technology for Spanner. It provides:
- Interval-based global time.
- Exposes uncertainty in clock.
- Leverages hardware features like GPS and Atomic Clocks.
- A set of time master servers per datacenter and time slave daemons per machine.
- Daemons poll various masters and reach a consensus about the correct timestamp.
Dynamo
Dynamo is a distributed database system designed for high availability and fault tolerance. Key features include:
- Every node has the same responsibilities as its peers.
- No updates are rejected due to failures or concurrent writes.
- Conflict resolution is executed during read instead of write, resulting in an”always writeabl” system.
Replica Synchronization
Dynamo implements an anti-entropy (replica synchronization) protocol to keep replicas synchronized, addressing scenarios where hinted replicas become unavailable before they can be returned to the original replica node. This protocol utilizes a Merkle tree.
Merkle Tree
A Merkle tree is a hash tree where leaves are hashes of the values of individual keys. Parent nodes higher in the tree are hashes of their respective children.
Advantages of Merkle Tree
- Each branch of the tree can be checked independently without requiring nodes to download the entire tree.
- Reduces the amount of data that needs to be transferred while checking for inconsistencies among replicas.
Membership Detection
Dynamo uses an explicit mechanism to initiate the addition or removal of nodes from the Dynamo ring. This mechanism involves:
- The node serving the request writes the membership change and its time of issue to a persistent store.
- Membership changes form a history, as nodes can be removed and added back multiple times.
- A gossip-based protocol propagates membership changes and maintains an eventually consistent view of membership.
- Each node contacts a randomly chosen peer every second, and the two nodes efficiently reconcile their persisted membership change histories.
RAID
RAID (Redundant Array of Independent Disks) is a data storage virtualization technology that combines multiple disk drive components into a logical unit for data redundancy or performance improvement. Data is distributed across the drives in various ways, known as RAID levels. Each scheme provides a different balance between reliability, availability, performance, and capacity.
- RAID levels greater than RAID 0 provide protection against unrecoverable (sector) read errors and whole disk failure.
- RAID 0 consists of striping without mirroring or parity.
- RAID 1 consists of mirroring without parity or striping.
- RAID 2 consists of bit-level striping with dedicated Hamming-code parity.
- RAID 3 consists of byte-level striping with dedicated parity.
- RAID 4 consists of block-level striping with dedicated parity. Block-interleaved parity. Wasted storage is small: one parity block for N data blocks. The parity disk becomes a hot spot.
- RAID 5 consists of block-level striping with distributed parity. Unlike RAID 4, parity information is distributed among the drives. It requires that all drives but one be present to operate. Upon failure of a single drive, subsequent reads can be calculated from the distributed parity such that no data is lost. RAID 5 requires at least three disks.
RAID 5 is seriously affected by the general trends regarding array rebuild time and the chance of failure during rebuild.
Intelligent Storage System
An intelligent storage system consists of four key components: front end, cache, back end, and physical disks.
High-end Storage Systems
High-end storage systems, referred to as active-active arrays, are generally aimed at large enterprises for centralizing corporate data. These arrays are designed with a large number of controllers and cache memory. An active-active array implies that the host can perform I/Os to its LUNs across any of the available paths.
Midrange Storage Systems
Midrange storage systems, also referred to as active-passive arrays, are designed for small and medium enterprises. They have two controllers, each with cache, RAID controllers, and disk drive interfaces. Hosts can perform I/Os to LUNs only through active paths. Other paths remain passive until the active path fails. Midrange arrays are less scalable compared to high-end arrays.