Understanding Storage Technologies: SCSI, FC, iSCSI, and More

SCSI vs. FC

Interface for internal storage to external disks, used with SAN. Potential downtime with SCSI. RAID controller is SCSI hardware, while HBA is Fibre Channel hardware. Media specific (copper only), but it can be media independent—copper or fibre optic.

FC vs. iSCSI

Fibre Channel (FC): Current market leader for shared storage technologies. Provides the highest performance levels and is designed for mission-critical applications. The cost of components is relatively high, particularly per server HBA costs. It is relatively difficult to implement and manage.

iSCSI: Relatively new, but usage is increasing rapidly. Performance can approach Fibre Channel speeds. A better fit for databases than NAS, and a good fit for small to medium-sized businesses. It is relatively inexpensive compared to Fibre Channel and relatively easy to implement and manage.

Benefits of NAS

  • Increases performance throughput to end users
  • Minimizes investment in additional servers
  • Provides storage pooling
  • Provides heterogeneous file serving
  • Uses existing infrastructure, tools, and processes

Benefits of SAN

  • Reduces cost of external storage
  • Increases performance
  • Centralized and improved tape backup
  • LAN-less backup
  • High-speed, no single-point-of-failure clustering solutions
  • Consolidation

Goals of BigTable

  • Data is highly available at any time
  • Very high read/write rates
  • Efficient scans over all or interesting subsets of data
  • Asynchronous and continuously updates
  • High scalability (row, column, timestamp) -> cell contents
  • No table-wide integrity constraints
  • No multi-row transaction

How is Chubby Used?

  • Ensure at most one active master at any time
  • Store the bootstrap location of Bigtable data
  • Discover tablet servers and finalize tablet server deaths
  • Store Bigtable schema information (the column family information for each table)
  • Store access control lists
  • If Chubby is unavailable for an extended period, Bigtable becomes unavailable

SSTable

Sorted file of key-value string pairs, chunks of data plus an index.

Tablet

Contains some range of rows of the table, built out of multiple SSTables. Tablets are stored in tablet servers.

Table

Multiple tablets make up the table. SSTables can be shared, and tablets do not overlap, but SSTables can overlap.

Fault Tolerance and Load Balancing

The master is responsible for load balancing and fault tolerance. It uses Chubby to keep locks of tablet servers, restart failed servers, and check the status of tablet servers. It keeps track of available tablet servers and unassigned tablets. If a server fails, it starts tablet recovery.

Recovering Tablet

A new tablet server reads data from the METADATA table. Metadata contains a list of SSTables and pointers into any commit log that may contain data for the tablet. The server reads the indices of the SSTables in memory and reconstructs the memtable by applying all of the updates since redo points.

Refinements

Group column families together into an SSTable. Can compress locality groups. Bloom filters on locality groups help avoid searching SSTable.

What is Spanner?

For strong consistency with wide area replication, auto-sharding, auto-rebalancing, and automatic failure response. Exposes control of data replication and placement to the user/application. Enables transaction serialization via global timestamps, acknowledges clock uncertainty, and guarantees a bound on it. Uses the novel TrueTime API to accomplish concurrency control, enables consistent backups, and atomic schema updates during ongoing transactions.

Feature: Lock-free distributed read transactions. Property: External consistency of distributed transactions. Implementation: Integration of concurrency control, replication, and 2PC (2 Phase Commit). Enabling technology: TrueTime (Interval-based global time, exposes uncertainty in clock, leverages hardware features like GPS and atomic clocks, set of time master servers per datacenter and time slave daemon per machine. Daemon polls a variety of masters and reaches a consensus about the correct timestamp.)

Dynamo

Every node in Dynamo should have the same set of responsibilities as its peers. No updates are rejected due to failures or concurrent writes. Conflict resolution is executed during read instead of write, i.e., “always writable”.

Replica Synchronization

There are scenarios under which hinted replicas become unavailable before they can be returned to the original replica node. To handle this and other threats to durability, Dynamo implements an anti-entropy (replica synchronization) protocol to keep the replicas synchronized.

Merkle Tree: A hash tree where leaves are hashes of the values of individual keys. Parent nodes higher in the tree are hashes of their respective children.

Advantage of Merkle Tree

Each branch of the tree can be checked independently without requiring nodes to download the entire tree. This helps in reducing the amount of data that needs to be transferred while checking for inconsistencies among replicas.

Membership Detection

Explicit mechanism to initiate addition/removal of nodes from the Dynamo ring. The node that serves the request writes the membership change and its time of issue to persistent store. The membership changes form a history because nodes can be removed and added back multiple times. A gossip-based protocol propagates membership changes and maintains an eventually consistent view of membership. Each node contacts a peer chosen at random every second, and the two nodes efficiently reconcile their persisted membership change histories.

RAID

Data storage virtualization technology that combines multiple disk drive components into a logical unit for the purposes of data redundancy or performance improvement. Data is distributed across the drives in one of several ways, referred to as RAID levels. Each scheme provides a different balance between the key goals: reliability, availability, performance, and capacity. RAID levels greater than RAID 0 provide protection against unrecoverable (sector) read errors, as well as whole disk failure.

RAID 0 consists of striping without mirroring or parity. RAID 1 consists of mirroring without parity or striping. RAID 2 consists of bit-level striping with dedicated Hamming-code parity. RAID 3 consists of byte-level striping with dedicated parity. RAID 4 consists of block-level striping with dedicated parity. Block-interleaved parity. Wasted storage is small: one parity block for N data blocks. Parity disk becomes a hot spot. RAID 5 consists of block-level striping with distributed parity. Unlike in RAID 4, parity information is distributed among the drives. It requires that all drives but one be present to operate. Upon failure of a single drive, subsequent reads can be calculated from the distributed parity such that no data is lost. RAID 5 requires at least three disks.

RAID 5 is seriously affected by the general trends regarding array rebuild time and chance of failure during rebuild.

Intelligent Storage System

An intelligent storage system consists of four key components: front end, cache, back end, and physical disks.

High-end Storage Systems

High-end storage systems, referred to as active-active arrays, are generally aimed at large enterprises for centralizing corporate data. These arrays are designed with a large number of controllers and cache memory. An active-active array implies that the host can perform I/Os to its LUNs across any of the available paths.

Midrange Storage Systems

Also referred to as active-passive arrays, the host can perform I/Os to LUNs only through active paths. Other paths remain passive until the active path fails. Midrange arrays have two controllers, each with cache, RAID controllers, and disk drive interfaces. They are designed for small and medium enterprises and are less scalable compared to high-end arrays.

Performance Metrics for Storage

  • Storage efficiency
  • Saturation throughput
  • Rebuild time
  • Mean time to data loss
  • Encoding/Decoding/Update/Rebuild complexity
  • Sequential read/write bandwidth
  • Scale-out
  • High availability
  • Reliability vs. cost
  • Systems cannot be taken down for backing up data
  • Rebuild time should be small