A Comprehensive Guide to Storage Systems: Architectures, Technologies, and Performance

Posted on Aug 13, 2024 in Computers

SCSI vs. FC

Interface for: Internal storage to external disks

Used with: SAN

Potential Downtime: Present with both SCSI and FC

Hardware: RAID controller (SCSI), HBA (Fibre Channel)

Media: SCSI (copper only), FC (can be media independent – copper or fiber optic)

FC vs. iSCSI

Fibre Channel (FC)

Current market leader for shared storage technologies
Provides the highest performance levels
Designed for mission-critical applications
Cost of components is relatively high, particularly per-server HBA costs
Relatively difficult to implement and manage

iSCSI

Relatively new, but usage is increasing rapidly
Performance can approach Fibre Channel speeds
A better fit for databases than NAS
A good fit for Small to Medium Size Businesses
Relatively inexpensive, compared to Fibre Channel
Relatively easy to implement and manage

NAS Benefits

Increases performance throughput to end-users
Minimizes investment in additional servers
Provides storage pooling
Provides heterogeneous file serving
Uses existing infrastructure, tools, and processes

Benefits of SAN

Reduce cost of external storage
Increase performance
Centralized and improved tape backup
LAN-less backup
High-speed, no single-point-of-failure clustering solutions
Consolidation

Goals of BigTable

Data is highly available at any time
Very high read/write rates
Efficient scans over all or interesting subsets of data
Asynchronous and continuously updates
High scalability
(row, column, timestamp) -> cell contents
No table-wide integrity constraints
No multi-row transactions

How is Chubby Used?

Ensure at most one active master at any time
Store the bootstrap location of Bigtable data
Discover tablet servers and finalize tablet server deaths
Store Bigtable schema information (the column family information for each table)
Store access control lists
If Chubby is unavailable for an extended period, Bigtable becomes unavailable

SSTable

Sorted file of key-value string pairs
Chunks of data plus an index

Tablet

Contains some range of rows of the table
Built out of multiple SSTables
Tablets are stored in Tablet servers

Table

Multiple tablets make up the table
SSTables can be shared
Tablets do not overlap, SSTables can overlap

Fault Tolerance and Load Balancing

Master responsible for load balancing and fault tolerance
Use Chubby to keep locks of tablet servers, restart failed servers
Master checks the status of tablet servers
Keep track of available tablet servers and unassigned tablets
If a server fails, start tablet recovering

Recovering Tablet

New tablet server reads data from METADATA table
Metadata contains a list of SSTables and pointers into any commit log that may contain data for the tablet
Server reads the indices of the SSTables in memory
Reconstructs the memtable by applying all of the updates since redo points

Refinements

Group column families together into an SSTable
Can compress locality groups
Bloom Filters on locality groups – avoid searching SSTable

What is Spanner?

For strong consistency with wide-area replication
Auto-sharding, auto-rebalancing, automatic failure response
Exposes control of data replication and placement to user/application
Enables transaction serialization via global timestamps
Acknowledges clock uncertainty and guarantees a bound on it
Uses novel TrueTime API to accomplish concurrency control
Enables consistent backups, atomic schema updates during ongoing transactions

Features:

Lock-free distributed read transactions

Properties:

External consistency of distributed transactions

Implementation:

Integration of concurrency control, replication, and 2PC (2 Phase Commit)

Enabling Technology: TrueTime

Interval-based global time
Exposes uncertainty in the clock
Leverages hardware features like GPS and Atomic Clocks
Set of time master servers per data center and time slave daemon per machine
Daemon polls a variety of masters and reaches a consensus about the correct timestamp

Dynamo

Every node in Dynamo should have the same set of responsibilities as its peers
No updates rejected due to failures or concurrent writes
Conflict resolution is executed during read instead of write, i.e. “always writeable”

Replica Synchronization

There are scenarios under which hinted replicas become unavailable before they can be returned to the original replica node. To handle this and other threats to durability, Dynamo implements an anti-entropy (replica synchronization) protocol to keep the replicas synchronized.

Merkle Tree

A hash tree where leaves are hashes of the values of individual keys
Parent nodes higher in the tree are hashes of their respective children

Advantages of Merkle Tree

Each branch of the tree can be checked independently without requiring nodes to download the entire tree
Help in reducing the amount of data that needs to be transferred while checking for inconsistencies among replicas

Membership Detection

Explicit mechanism to initiate the addition/removal of nodes from the Dynamo ring
The node that serves the request writes the membership change and its time of issue to persistent storage
The membership changes form a history because nodes can be removed and added back multiple times
A gossip-based protocol propagates membership changes and maintains an eventually consistent view of membership
Each node contacts a peer chosen at random every second and the two nodes efficiently reconcile their persisted membership change histories

RAID

RAID (Redundant Array of Independent Disks) is a data storage virtualization technology that combines multiple disk drive components into a logical unit for data redundancy or performance improvement. Data is distributed across the drives in one of several ways, referred to as RAID levels. Each scheme provides a different balance between the key goals: reliability, availability, performance, and capacity. RAID levels greater than RAID 0 provide protection against unrecoverable (sector) read errors, as well as whole disk failure.

RAID 0: Consists of striping, without mirroring or parity.
RAID 1: Consists of mirroring, without parity or striping.
RAID 2: Consists of bit-level striping with dedicated Hamming-code parity.
RAID 3: Consists of byte-level striping with dedicated parity.
RAID 4: Consists of block-level striping with dedicated parity. Block-interleaved parity. Wasted storage is small: one parity block for N data blocks. Parity disk becomes a hot spot.
RAID 5: Consists of block-level striping with distributed parity. Unlike RAID 4, parity information is distributed among the drives. It requires that all drives but one be present to operate. Upon failure of a single drive, subsequent reads can be calculated from the distributed parity such that no data is lost. RAID 5 requires at least three disks. RAID 5 is seriously affected by the general trends regarding array rebuild time and the chance of failure during rebuild.

Intelligent Storage System

An intelligent storage system consists of four key components: front end, cache, back end, and physical disks.

High-End Storage Systems

High-end storage systems, referred to as active-active arrays, are generally aimed at large enterprises for centralizing corporate data
These arrays are designed with a large number of controllers and cache memory
An active-active array implies that the host can perform I/Os to its LUNs across any of the available paths

Midrange Storage Systems

Also referred to as Active-passive arrays
Host can perform I/Os to LUNs only through active paths
Other paths remain passive until the active path fails
Midrange arrays have two controllers, each with cache, RAID controllers, and disk drive interfaces
Designed for small and medium enterprises
Less scalable as compared to high-end arrays

Performance Metrics for Storage

Storage efficiency
Saturation throughput
Rebuild time
Mean time to data loss
Encoding/Decoding/Update/Rebuild complexity
Sequential read/write bandwidth
Scale-out
High availability
Reliability vs. cost
Systems cannot be taken down for backing up data
Rebuild time should be small