Storage Technologies: Fibre Channel, iSCSI, Erasure Coding, and More

Fibre Channel vs. iSCSI
(Fibre Channel: The current market leader for shared storage technologies, Provides the highest performance levels, Designed for mission-critical applications, Cost of components is relatively high, particularly per server HBA cost, Relatively difficult to implement and manage. iSCSI: Relatively new, but usage is increasing rapidly, Performance can approach Fibre Channel speeds, A better fit for databases than NAS, A good fit for Small to Medium Size Businesses, Relatively inexpensive, compared to Fibre Channel, Relatively easy to implement and manage)
Inter Switch Links
(ISL connects two or more FC switches to each other using EPorts, ISLs are used to transfer host-to-storage data as well as the fabric management traffic from one switch to another,ISL is also one of the scaling mechanisms in SAN connectivity. Erasure Coding(n = k + m)

A systematic erasure code stores the data in the clear on k of the n disks. There are k data disks, and m coding or “parity” disks. A non-systematic erasure code stores only coding information, but we still use k, m, and n to describe the code. When disks fail, their contents become unusable, and the storage system detects this. This failure mode is called an erasure. An MDS (“Maximum Distance Separable”) code can reconstruct the data from any m failures.

A horizontal code is systematic, and partitions the disks into data disks and coding disks. A vertical code is systematic, but each disk is required to hold some data and some coding. Vertical codes can be MDS, so long as they tolerate all combinations of m failed disks.

Two views of a stripe: Because w is small (often 1 bit), and operating over collection of symbols is much faster than operating over single symbol

Why the two views:Because w is small (often 1 bit), and operating over collection of symbols is much faster than operating over single symbol. Can also balance load by partitioning system into many systems stripes and rotating IDs.

Arithmetic for Erasure Codes:When w = 1: XOR’s only, Otherwise, Galois Field Arithmetic GF(2w) • w is 2, 4, 8, 16, 32, 64, 128 so that words fit evenly into computer words. Multiplication of a*b in GF(2w) – Table lookup for w = 2, 4 ,8. – Three table lookups + a little arithmetic for w = 16 – 16 table lookups + XORs for w = 32 – More expensive for w = 64, 128. • • Multiplication of a buffer by a in GF(2w) – Used to be roughly a factor of 4 worse than XOR for w = 2, 4, 8, 16. – SIMD instructions – w = 32 slightly slower. Others slower still.

Reed Solomon Codes: Properties : MDS Erasure codes for any n and k, That means any m = (n-k) failures can be tolerated without data loss. (Theoretical): One word per disk per stripe. r = 1 w constrained so that n ≤ 2^w. Systematic and nonsystematic forms. Replication vs Erasure Coding  (For a fixed redundancy level, an erasure-encoded system has at least the same level of reliability as the corresponding replicated system, For a fixed level of reliability, an erasure-encoded system requires less disk space or can store more data with the same amount of disk space. EC cheaper in cost compared to Replication)

Erasure Coding Given a value of m and the individual disk reliability, we figured how to choose n to achieve the desired level of protection. Increasing n while holding m fixed increases reliability. Increasing m while holding n fixed decreases reliability. We can increase reliability by increasing m and n proportionately, keeping the redundancy factor kc constant. Larger values of m and n do result in greater data protection for a fixed level of disk reliability. However, while larger values of m and n reduce the chance of complete failure (irrecoverable data loss), they potentially increase latency and reconstruction costs by increasing the chance of at least one recoverable disk failure. The required time depends greatly on how EC is implemented, but is always some amount of time, and hence an overhead associated with EC that is not required with replication. • EC system can be slower than a replicated system, even when all fragments are in the same data center. • The more data fragments there are to manage, the more work that is required to rebuild the fragment inventory database when failures occur.

Disk Allocation to DCs If there are d data centers, the probabilities of an object being lost or unavailable are pl^d and pu^d respectively. In EC systems, the number of fragments per object is typically larger than the number of DCs. Because data fragments are inevitably co-located, these fragments have correlated probabilities of being unavailable and so the unavailability probability for the system goes up.

Cost of disk failure EC can lower the probability of a catastrophic failure, it increases the probability of a recoverable failure. The probability that one or more disks in a set will fail is approximately the probability of a particular disk failing times the number of disks. Cost of recoverable disk failure? • cost of replacing hard disks is proportional to the total number of disks in use, whether those disks contain replicated objects or erasure-encoded object fragments. • Erasure encoding does not increase the number of disk failures. • By reducing the number of disks needed, it can reduce the number of failures. • However, erasure encoding increases the probability that an object request is affected by a disk failure

Latency due to disk failure(In an m+n erasure-encoded system, an object request could be filled by reconstructing the object from the m nearest fragments. Since object requests more often encounter (recoverable) disk failures in erasure-encoded systems, they will more often involve an increase in latency. Suppose a replicated system maintains k copies of an object, each in a different data center, and that requesting data from these centers has latency L1

For example, if there is a probability of 0.001 that an object is available at the nearest data center, and the second data center has 100 times greater latency, • the expected latency is 10% greater (that is, pL2/L1 =0.001 * 100 = 0.1) than if both replicas were located in the nearest data center. In erasure-encoded system, fragments could be geographically distributed in many ways. If latency is a concern, an m + n system would store at least m fragments in a single location so that only local reads would be necessary under normal circumstances. If m fragments are stored in one location, the probability of one local disk failure is mp. This means the probability of a single local failure, and the necessity of transferring data from a more distant data center, is m times greater for an erasure-coded system compared to a replicated system. The expected latency increases from approximately (1 − p)L1 + pL2 in a replicated system to approximately (1 − p)L1 + mpL2 in an erasure-encoded system. For 8 + 3 erasure coding , if there is a probability of 0.001 that an object is available at the nearest data center, and the second data center has 100 times greater latency. • we pick our units so that the latency of the nearer server is L1 = 1. • If m = 8, the expected latency would be 1.099 using replication and 1.799 using erasure coding. • For active data, objects that are accessed frequently, latency is a major concern and the latency advantage of replication should be considered. The probability that exactly one out of a set of m disks will fail is mp if p is the probability of an individual failure. • But the probability of one or more failures is 1−(1−p)m, which is approximately mp if p is small.)

Accumulo: User Management(createuser dennis,  grant System.CREATE_TABLE –s -u dennis,  grant Table.READ –t mytable -u dennis, grant Table.WRITE –t mytable –u dennis, setauths –s public,private –u dennis)

Writing Data: insert row1 colf1 colq1 value1, insert row4 colf1 colq1 value4 –l public,  insert row5 colf1 colq1 value5 –t 4000, Reading Data:scan –s private,  scan –b row2 –s public, Deleting Data:delete row1 colf1 colq1,  deleterows –b row3 –e row7,  deletemany –b row3 –e row7 –c colf1:colq1

Obtaining a Connection(String instanceName = “myinstance”; String zookeepers = “zk1,zk2,zk3”; Instance inst = new ZooKeeperInstance (instanceName, zookeepers); Connector conn = inst.getConnector (“username”, new PasswordToken(“passwd”)); )

Obtaining a Mini Connection(File tmpDir = ..; MiniAccumuloCluster mini = new MiniAccumuloCluster(tmpDir, “password”); accumulo.start(); … Instance instance = new ZooKeeperInstance (mini.getInstanceName(),mini.getZooKeepers()); Connector conn = instance.getConnector (“root”, new PasswordToken(“password”)); … accumulo.stop(); tmpDir.delete();)

Mini takes longer to setup/tear down than Mock, but runs all of the components locally; much more closely matches execution logic of distributed instance (major difference: mini stores data in local file system vs. HDFS).

Table Management(TableOperations tableOps = conn.tableOperations(); if (!tableOps.exists(“mytable”)) { tableOps.create(“mytable”); } … tableOps.delete(“mytable”);)

Writing Data(BatchWriter writer = conn.createBatchWriter (“mytable”, new BatchWriterConfig()); ColumnVisibility publicVis = new ColumnVisibilty(“Public”); … Mutation m = new Mutation(“Dennis”); m.put(“Address”, “City”, publicVis, “Buffalo”); m.put(“Attr”, “Age”, privateVis, “40”); writer.addMutation(m); writer.close(); —— MultiTableBatchWriter mtbw = conn .createMultiTableBatchWriter (new BatchWriterConfig()); BatchWriter bw1 = mtbw.getBatchWriter(“table1”); BatchWriter bw2 = mtbw.getBatchWriter(“table2”); … bw1.addMutation(m1); bw2.addMutation(m2); … mtbw.close();)

Reading Data(Authorizations auths = new Authorizations(“public”); Scanner scan = conn.createScanner (“mytable”, auths); for (Entry entry: scan) { String n = entry.getKey().getRow().toString(); if ((n.startsWith(“D”)) || (n.startsWith(“E”))) { process(entry.getKey(), entry.getValue()); } } scan.close(); ——– Authorizations auths = new Authorizations(“public”); Scanner scan = conn.createScanner (“mytable”, auths); scan.setRange(new Range(“D”, true, “F”, false)); for (Entry entry: scan) { process(entry.getKey(), entry.getValue()); } scan.close(); scan.fetchColumnFamily(new Text(“Address”));  scan.fetchColumn(new Text(“Address”), new Text(“City”)); scan.fetchColumn(new Text(“Address”), new Text(“State”)); ——- IteratorSetting is = new IteratorSetting(10, “my-iter”, RegexFilter.class); RegexFilter.setRegexs(is, “^[AEIOU].*”, null, null, null, false); scan.addScanIterator(is); )

Isolation ensures that once a scanner starts processing a row, either all or none in a set of related changes are seen by the scanner for that row. Scanner scan = new IsolatedScanner (conn.createScanner(“mytable”, auths));

Deleting Data(Mutation m = new Mutation(“Dennis”); m.putDelete(“Address”, “City”); m.putDelete(“Address”, “State”); writer.addMutation(m); // …or delete entire ranges (rows) BatchDeleter bd = conn.createBatchDeleter(“myTable”, auths, 4, new BatchWriterConfig()); bd.setRanges(ranges); bd.delete(); bd.close(); )

Caching (Write back cache: drive reports that writes are complete after they have been cached • Write through cache: drive reports that writes are complete after they have been written to disk )

Disk Scheduling( SSTF Scheduling: The bad: SSTF is prone to starvation. SCAN Example: The bad: average access times are less for requests at high and low addresses)

Scheduling problems that take place in the kernel • Process scheduling • Page swapping Where should disk scheduling be implemented? • OS scheduling • OS can implement SSTF or LOOK by ordering the queue by LBA • However, the OS cannot account for rotation delay • On-disk scheduling • Disk knows the exact position of the head and platters • Can implement more advanced schedulers (SPTF) • But, requires specialized hardware and drivers

Advantages of SSDs(More resilient against physical damage • No sensitive read head or moving parts • Immune to changes in temperature • Greatly reduced power consumption • No mechanical, moving parts • Much faster than hard drives • >500 MB/s vs ~200 MB/s for hard drives • No penalty for random access • Each flash cell can be addressed directly • No need to rotate or seek • Extremely high throughput)

Challenges with Flash(Flash memory is written in pages, but erased in blocks • Pages: 4 – 16 KB, Blocks: 128 – 256 KB • Thus, flash memory can become fragmented • Leads to the write amplification problem)

Wear Leveling(Dynamic wear leveling: Uses a map to link logical block addresses (LBAs) from the OS to the physical flash memory. • Each time the OS writes replacement data, the map is updated so the original physical block is marked as invalid data, and a new block is linked to that map entry. • Each time a block of data is re-written to the flash memory, it is written to a new location. • However, flash memory blocks that never get replacement data would sustain no additional wear, thus the name comes from only the dynamic data being recycled. • Such a device may last longer than one with no wear leveling, but there are blocks still remaining as active that will go unused when a device is no longer operable. Static wear leveling: Also uses a map to link the LBA to physical memory addresses. • Static wear leveling works the same as dynamic wear leveling except the static blocks that do not change are periodically moved so that these low usage cells are able to be used by other data. • This rotational effect enables an SSD to continue to operate until most of the blocks are near their end of life. • This is also sometimes referred to as global wear leveling, as the entire image is leveled. )