Exploring Disk Storage and Cache Optimization: Solutions to Common Problems
Assume:
- Bits Per Track
- Track Count
- M, K are constant
So:
- Bit Count
- When x == 1/2, bc is maximum
T_avg_seek = 4ms
T_avg_rotation = 1/2 * 1/15000 * 60s/min * 1000ms/s = 2ms
T_avg_transfer = 1/15000 * 1/800 * 60s/min * 1000ms/s = 0.005ms
So:
T_access = 6.005ms
Almost same as problem 6.4 in the book.
A.
Best case: Blocks are mapped sequentially and on the same cylinder. Just seek data once.
File size 2MB, block size 512B, block count 2MB/512B = 4000
Block Per Track = 1000, so we need to rotate 4 times to read all data.
B.
Worse case: Blocks are random.
| m | c | B | E | S | t | s | b |
|---|---|---|---|---|---|---|---|
| 32 | 1024 | 4 | 4 | 64 | 24 | 6 | 2 |
| 32 | 1024 | 4 | 256 | 1 | 30 | 0 | 2 |
| 32 | 1024 | 8 | 1 | 128 | 22 | 7 | 3 |
| 32 | 1024 | 8 | 128 | 1 | 29 | 0 | 3 |
| 32 | 1024 | 32 | 1 | 32 | 22 | 5 | 5 |
| 32 | 1024 | 32 | 4 | 8 | 24 | 3 | 5 |
| m | c | B | E | S | t | s | b |
|---|---|---|---|---|---|---|---|
| 32 | 2048 | 8 | 1 | 256 | 21 | 8 | 3 |
| 32 | 2048 | 4 | 4 | 128 | 23 | 7 | 2 |
| 32 | 1024 | 2 | 8 | 64 | 25 | 6 | 1 |
| 32 | 1024 | 32 | 2 | 16 | 23 | 4 | 5 |
A.
t = 0x45 = 0b01000101, s = 0b001, b = xx (xx is 00/01/10/11)
Address may be:
- Address range: 0x08A4 – 0x08A7
- t = 0x38
- Address range: 0x0704 – 0x0707
B.
0x1238 – 0x123B
Same as 6.27
A.
None
B.
- 0x18F0 – 0x18F3
- 0x00B0 – 0x00B3
C.
0x0E34 – 0x0E37
D.
0x1BDC – 0x1BDF
A.
B.
| read/write | addr | hit? | value (or unknown) |
|---|---|---|---|
| read | 0x834 | No | |
| write | 0x836 | Yes | unknown |
| read | 0xFFD | Yes | 0xC0 |
A.
C = E * B * S = 128B
B.
A.
B.
| param | value |
|---|---|
| CO | 0x02 |
| CI | 0x06 |
| CT | 0x38 |
| hit? | Yes |
| return | 0xEB |
A.
B.
| param | value |
|---|---|
| CO | 0x00 |
| CI | 0x02 |
| CT | 0xB7 |
| hit? | No |
| return |
Same as 6.27
- 0x1788 – 0x178B
- 0x16C8 – 0x16CB
src:
| c0 | c1 | c2 | c3 | |
|---|---|---|---|---|
| r0 | m | m | h | m |
| r1 | m | h | m | h |
| r2 | m | m | h | m |
| r3 | m | h | m | h |
dst:
| c0 | c1 | c2 | c3 | |
|---|---|---|---|---|
| r0 | m | m | m | m |
| r1 | m | m | m | m |
| r2 | m | m | m | m |
| r3 | m | m | m | m |
src:
| c0 | c1 | c2 | c3 | |
|---|---|---|---|---|
| r0 | m | h | h | h |
| r1 | m | h | h | h |
| r2 | m | h | h | h |
| r3 | m | h | h | h |
dst:
| c0 | c1 | c2 | c3 | |
|---|---|---|---|---|
| r0 | m | h | h | h |
| r1 | m | h | h | h |
| r2 | m | h | h | h |
| r3 | m | h | h | h |
A.
C = 512, E = 1, B = 16, S = 32
Total read count: 2 * 128
x[0][i] address: i * 4
x[1][i] address: (128 + i) * 4 = 512 + i * 4
So x[0][i] and x[1][i] are cached into the same block.
Miss rate 100%
B.
C = 1024, E = 1, B = 16, S = 64
sizeof(x) == 2 * 128 * 4 == C
The whole array can be cached.
Miss rate is dependent on block size B.
B = 16, sizeof(int) = 4, so
Miss rate is 25%
C.
C = 512, E = 2, B = 16, S = 16
Total read count: 2 * 128
x[0][i] address: i * 4
x[1][i] address: (128 + i) * 4 = 512 + i * 4
So x[0][i] and x[1][i] are cached into a different block in the same set.
In the first half, all elements can be cached. Miss rate is 25%.
In the second half
x[0][i] is not in the cache. According to the LRU strategy, cache x[0][i] into the same block with x[0][i-64], cache x[1][i] into the same block with x[1][i-64]. Miss rate is 25%.
Finally, the miss rate is still 25%.
D.
No
If B is still 16, sizeof(int) = 4, the block can only cache 4 int at one time.
Read int the first time, toggle memory cache, miss; next 3 times read hit.
25% miss rate is the lowest.
E.
Yes
Assume B = 32, the block caches 8 int at one time.
Only 1 miss in 8 times int read.
The miss rate can be 12.5%.
C = 4096, B = 16, E = 1, S = 256
N = 64
sizeof(array_t) == 64 * 64 == 4096 == 4 * C
A. sumA
Read memory address order:
0, 4, 8, 12, ….., 4096 * 4 – 4
Read cache order:
0, 0, 0, 0, 1, 1, 1, 1,…..255, 255, 255, 255, 0, 0, 0, 0,…..255, 255, 255, 255
First cache read miss, next 3 times read hit.
Miss rate: 25%
B. sumB
Read memory address order:
0, 64 * 4, 64 * 4 * 2, …. 64 * 4 * 63, 4, 64 * 4 + 4, 64 * 4 * 2 + 4, …. 4096 * 4 – 4
Read cache order:
0, 16, 32, 48, … 240, (4 times) 1, 17, 33, … 241, (4 times) 15, 31, 47, … 255 (4 times)
Let’s see the first read loop:
Read cache order loop 4 times
0, 16, 32, 48, … 240, (4 times)
First loop all miss, next 3 loops all hit
So the miss rate is 25%.
C. sumC
Easy to see that read a[j][i+1] and a[j+1][i+1] always hit
Same as
Same as
Total read count = 64 * 64
Because of i+=2,
Read cache order only loop 2 times
0, 16, 32, 48, … 240, (2 times)
So the miss rate is 50%.
Total read miss count = 64/2 * 64 * 50% = 64 * 64 / 4
So the miss rate is still 25%.
N = 60
A. sumA
Read memory by step 1
Miss rate 25%
B. sumB
It’s interesting.
Let’s see the first inner loop a[0][0] -> a[59][0]
Read memory address order:
0, 60 * 4, 60 * 4 * 2, …. 60 * 4 * 59
Read cache order:
0, 15, 30, …., 225, (17 numbers) 255, 14, 29, ….., 224, (17 numbers) 254, 13, 28, ….., 223, (17 numbers) 253, 12, 27, 42, 57, 72, 87, 102, 117 (9 numbers)
All read miss and store into different blocks
Next 3 loops: a[0][1] -> a[59][1], a[0][2] -> a[59][2], a[0][3] -> a[59][3]
All hit
Although N is smaller and not a power of 2, the miss rate is still 25%
C. sumC
Same miss rate as when N = 64
25%
A.
4 * 16 * 16
B.
sizeof(point_color) == 16, B = 32
Miss, cache 2 point_color, then
All hit
So miss count is 4 * 16 * 16 * 1/8
C.
1/8
A.
4 * 16 * 16
B.
sizeof(point_color) == 16, B = 32
Miss, cache 2 point_color, then
All hit.
Next loop
square[j+1][i] miss, cache block not conflict with square[j][i]
square[j+8][i] miss, cache block overwrite square[j][i] block. So when we reach square[j][i+1], still miss.
So miss count is 4 * 16 * 16 * 1/4
C.
1/4
A.
4 * 16 * 16
B.
First loop, same as 6.38, but
Write count is 16 * 16, miss rate is 1/2.
Second loop, same as 6.39, but
Write count is 16 * 16 * 3, miss rate is 1/6.
Miss count is
16 * 16 * 1/2 + 16 * 16 * 3 * 1/6
C.
1/4
Every loop
Always miss, then cache one pixel, so
All hit
Miss rate is 1/4
Same as
C = 64KB, B = 4B, sizeof(pixel) = 4
Miss, cache one pixel, so
All hit
Miss rate is 1/4
Same as
Every loop,
Always miss
Miss rate 100%
Assume matrix size N = 4, cache block size B = 8Byte, sizeof(int) = 4
Function transpose
Every cache block can store 2 int numbers
Traverse src by step 1, order:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Order 0: cache miss, load src[0],src[1] order 1: cache hit because of order 0 order 2: cache miss, load src[2],src[3] …. order 15: cache hit
If B is greater, the hit rate will be greater too.
Traverse dst by step 4, order:
0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15
Element 0: miss, load dst[0],dst[1] element 4: miss, load dst[4],dst[5] element 8: miss, load dst[8],dst[9] element 12: miss, load dst[12],dst[13] element 1: interesting, order 0 has loaded dst[1] into the cache, hit. But in many cases, element 4/8/12 may use the same cache block with element 0. So miss is also possible. …. element 15: hit / miss
Let’s assume the worst case: all elements miss the cache
If we split the matrix by 2 * 2
- Transpose block 0 itself
- Transpose block 1 with 2
- Transpose block 3 itself
Code
When i = 0, j = 0, transpose block 0 itself
If element 0 is miss, element 1 must hit; if element 4 is miss, element 5 must hit;
50% is the highest hit rate in such a low cache block size.
If B is greater and cache size C is larger, we can split the matrix into 4 * 4, 8 * 8, or more, larger. Theoretically, we will achieve the highest hit rate.
Finally code:
In code, matrix size 1024 * 1024, loop 1000 times to measure program run time.
Change BLOCK from 2 to 16, record time statistics
Origin function run time: 16.46s
| BLOCK | time(s) |
|---|---|
| 2 | 9.99 |
| 3 | 7.16 |
| 4 | 5.6 |
| 5 | 5.66 |
| 6 | 5.34 |
| 7 | 5.39 |
| 8 | 5.38 |
| 9 | 5.48 |
| 10 | 6.21 |
| 11 | 7.9 |
| 12 | 10.17 |
| 13 | 11.14 |
| 14 | 11.88 |
| 15 | 12.11 |
| 16 | 11.85 |
Tip: look up CPU cache info
Same as 6.45, pay attention to the brilliant comment 🙂
Measure time statistics:
Origin function run time: 30.28s
| BLOCK | time(s) |
|---|---|
| 1 | 14.26 |
| 2 | 12.01 |
| 3 | 7.43 |
| 4 | 6.20 |
| 5 | 6.08 |
| 6 | 5.86 |
| 7 | 5.70 |
| 8 | 5.67 |
| 9 | 6.30 |
| 10 | 6.39 |
| 11 | 6.21 |
| 12 | 6.18 |
| 13 | 5.9 |
| 14 | 6.3 |
| 15 | 5.88 |
| 16 | 5.92 |
