Exploring Disk Storage and Cache Optimization: Solutions to Common Problems

Assume:

  • Bits Per Track
  • Track Count
  • M, K are constant

So:

  • Bit Count
  • When x == 1/2, bc is maximum

T_avg_seek = 4ms

T_avg_rotation = 1/2 * 1/15000 * 60s/min * 1000ms/s = 2ms

T_avg_transfer = 1/15000 * 1/800 * 60s/min * 1000ms/s = 0.005ms

So:

T_access = 6.005ms

Almost same as problem 6.4 in the book.

A.

Best case: Blocks are mapped sequentially and on the same cylinder. Just seek data once.

File size 2MB, block size 512B, block count 2MB/512B = 4000

Block Per Track = 1000, so we need to rotate 4 times to read all data.

B.

Worse case: Blocks are random.

mcBEStsb
32102444642462
321024425613002
321024811282273
321024812812903
321024321322255
32102432482435
mcBEStsb
322048812562183
322048441282372
32102428642561
321024322162345

A.

t = 0x45 = 0b01000101, s = 0b001, b = xx (xx is 00/01/10/11)

Address may be:

  • Address range: 0x08A4 – 0x08A7
  • t = 0x38
  • Address range: 0x0704 – 0x0707

B.

0x1238 – 0x123B

Same as 6.27

A.

None

B.

  • 0x18F0 – 0x18F3
  • 0x00B0 – 0x00B3

C.

0x0E34 – 0x0E37

D.

0x1BDC – 0x1BDF

A.

B.

read/writeaddrhit?value (or unknown)
read0x834No
write0x836Yesunknown
read0xFFDYes0xC0

A.

C = E * B * S = 128B

B.

A.

B.

paramvalue
CO0x02
CI0x06
CT0x38
hit?Yes
return0xEB

A.

B.

paramvalue
CO0x00
CI0x02
CT0xB7
hit?No
return

Same as 6.27

  • 0x1788 – 0x178B
  • 0x16C8 – 0x16CB

src:

c0c1c2c3
r0mmhm
r1mhmh
r2mmhm
r3mhmh

dst:

c0c1c2c3
r0mmmm
r1mmmm
r2mmmm
r3mmmm

src:

c0c1c2c3
r0mhhh
r1mhhh
r2mhhh
r3mhhh

dst:

c0c1c2c3
r0mhhh
r1mhhh
r2mhhh
r3mhhh

A.

C = 512, E = 1, B = 16, S = 32

Total read count: 2 * 128

x[0][i] address: i * 4

x[1][i] address: (128 + i) * 4 = 512 + i * 4

So x[0][i] and x[1][i] are cached into the same block.

Miss rate 100%

B.

C = 1024, E = 1, B = 16, S = 64

sizeof(x) == 2 * 128 * 4 == C

The whole array can be cached.

Miss rate is dependent on block size B.

B = 16, sizeof(int) = 4, so

Miss rate is 25%

C.

C = 512, E = 2, B = 16, S = 16

Total read count: 2 * 128

x[0][i] address: i * 4

x[1][i] address: (128 + i) * 4 = 512 + i * 4

So x[0][i] and x[1][i] are cached into a different block in the same set.

In the first half, all elements can be cached. Miss rate is 25%.

In the second half

x[0][i] is not in the cache. According to the LRU strategy, cache x[0][i] into the same block with x[0][i-64], cache x[1][i] into the same block with x[1][i-64]. Miss rate is 25%.

Finally, the miss rate is still 25%.

D.

No

If B is still 16, sizeof(int) = 4, the block can only cache 4 int at one time.

Read int the first time, toggle memory cache, miss; next 3 times read hit.

25% miss rate is the lowest.

E.

Yes

Assume B = 32, the block caches 8 int at one time.

Only 1 miss in 8 times int read.

The miss rate can be 12.5%.

C = 4096, B = 16, E = 1, S = 256

N = 64

sizeof(array_t) == 64 * 64 == 4096 == 4 * C

A. sumA

Read memory address order:

0, 4, 8, 12, ….., 4096 * 4 – 4

Read cache order:

0, 0, 0, 0, 1, 1, 1, 1,…..255, 255, 255, 255, 0, 0, 0, 0,…..255, 255, 255, 255

First cache read miss, next 3 times read hit.

Miss rate: 25%

B. sumB

Read memory address order:

0, 64 * 4, 64 * 4 * 2, …. 64 * 4 * 63, 4, 64 * 4 + 4, 64 * 4 * 2 + 4, …. 4096 * 4 – 4

Read cache order:

0, 16, 32, 48, … 240, (4 times) 1, 17, 33, … 241, (4 times) 15, 31, 47, … 255 (4 times)

Let’s see the first read loop:

Read cache order loop 4 times

0, 16, 32, 48, … 240, (4 times)

First loop all miss, next 3 loops all hit

So the miss rate is 25%.

C. sumC

Easy to see that read a[j][i+1] and a[j+1][i+1] always hit

Same as

Same as

Total read count = 64 * 64

Because of i+=2,

Read cache order only loop 2 times

0, 16, 32, 48, … 240, (2 times)

So the miss rate is 50%.

Total read miss count = 64/2 * 64 * 50% = 64 * 64 / 4

So the miss rate is still 25%.

N = 60

A. sumA

Read memory by step 1

Miss rate 25%

B. sumB

It’s interesting.

Let’s see the first inner loop a[0][0] -> a[59][0]

Read memory address order:

0, 60 * 4, 60 * 4 * 2, …. 60 * 4 * 59

Read cache order:

0, 15, 30, …., 225, (17 numbers) 255, 14, 29, ….., 224, (17 numbers) 254, 13, 28, ….., 223, (17 numbers) 253, 12, 27, 42, 57, 72, 87, 102, 117 (9 numbers)

All read miss and store into different blocks

Next 3 loops: a[0][1] -> a[59][1], a[0][2] -> a[59][2], a[0][3] -> a[59][3]

All hit

Although N is smaller and not a power of 2, the miss rate is still 25%

C. sumC

Same miss rate as when N = 64

25%

A.

4 * 16 * 16

B.

sizeof(point_color) == 16, B = 32

Miss, cache 2 point_color, then

All hit

So miss count is 4 * 16 * 16 * 1/8

C.

1/8

A.

4 * 16 * 16

B.

sizeof(point_color) == 16, B = 32

Miss, cache 2 point_color, then

All hit.

Next loop

square[j+1][i] miss, cache block not conflict with square[j][i]

square[j+8][i] miss, cache block overwrite square[j][i] block. So when we reach square[j][i+1], still miss.

So miss count is 4 * 16 * 16 * 1/4

C.

1/4

A.

4 * 16 * 16

B.

First loop, same as 6.38, but

Write count is 16 * 16, miss rate is 1/2.

Second loop, same as 6.39, but

Write count is 16 * 16 * 3, miss rate is 1/6.

Miss count is

16 * 16 * 1/2 + 16 * 16 * 3 * 1/6

C.

1/4

Every loop

Always miss, then cache one pixel, so

All hit

Miss rate is 1/4

Same as

C = 64KB, B = 4B, sizeof(pixel) = 4

Miss, cache one pixel, so

All hit

Miss rate is 1/4

Same as

Every loop,

Always miss

Miss rate 100%

Assume matrix size N = 4, cache block size B = 8Byte, sizeof(int) = 4

Function transpose

Every cache block can store 2 int numbers

Traverse src by step 1, order:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Order 0: cache miss, load src[0],src[1] order 1: cache hit because of order 0 order 2: cache miss, load src[2],src[3] …. order 15: cache hit

If B is greater, the hit rate will be greater too.

Traverse dst by step 4, order:

0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15

Element 0: miss, load dst[0],dst[1] element 4: miss, load dst[4],dst[5] element 8: miss, load dst[8],dst[9] element 12: miss, load dst[12],dst[13] element 1: interesting, order 0 has loaded dst[1] into the cache, hit. But in many cases, element 4/8/12 may use the same cache block with element 0. So miss is also possible. …. element 15: hit / miss

Let’s assume the worst case: all elements miss the cache

If we split the matrix by 2 * 2

  • Transpose block 0 itself
  • Transpose block 1 with 2
  • Transpose block 3 itself

Code

When i = 0, j = 0, transpose block 0 itself

If element 0 is miss, element 1 must hit; if element 4 is miss, element 5 must hit;

50% is the highest hit rate in such a low cache block size.

If B is greater and cache size C is larger, we can split the matrix into 4 * 4, 8 * 8, or more, larger. Theoretically, we will achieve the highest hit rate.

Finally code:

In code, matrix size 1024 * 1024, loop 1000 times to measure program run time.

Change BLOCK from 2 to 16, record time statistics

Origin function run time: 16.46s

BLOCKtime(s)
29.99
37.16
45.6
55.66
65.34
75.39
85.38
95.48
106.21
117.9
1210.17
1311.14
1411.88
1512.11
1611.85

Tip: look up CPU cache info

Same as 6.45, pay attention to the brilliant comment 🙂

Measure time statistics:

Origin function run time: 30.28s

BLOCKtime(s)
114.26
212.01
37.43
46.20
56.08
65.86
75.70
85.67
96.30
106.39
116.21
126.18
135.9
146.3
155.88
165.92