Exploring Disk Storage and Cache Optimization: Solutions to Common Problems

Posted on May 28, 2024 in Computers

Assume:

Bits Per Track
Track Count
M, K are constant

So:

Bit Count
When x == 1/2, bc is maximum

T_avg_seek = 4ms

T_avg_rotation = 1/2 * 1/15000 * 60s/min * 1000ms/s = 2ms

T_avg_transfer = 1/15000 * 1/800 * 60s/min * 1000ms/s = 0.005ms

So:

T_access = 6.005ms

Almost same as problem 6.4 in the book.

A.

Best case: Blocks are mapped sequentially and on the same cylinder. Just seek data once.

File size 2MB, block size 512B, block count 2MB/512B = 4000

Block Per Track = 1000, so we need to rotate 4 times to read all data.

B.

Worse case: Blocks are random.

m	c	B	E	S	t	s	b
32	1024	4	4	64	24	6	2
32	1024	4	256	1	30	0	2
32	1024	8	1	128	22	7	3
32	1024	8	128	1	29	0	3
32	1024	32	1	32	22	5	5
32	1024	32	4	8	24	3	5

m	c	B	E	S	t	s	b
32	2048	8	1	256	21	8	3
32	2048	4	4	128	23	7	2
32	1024	2	8	64	25	6	1
32	1024	32	2	16	23	4	5

A.

t = 0x45 = 0b01000101, s = 0b001, b = xx (xx is 00/01/10/11)

Address may be:

Address range: 0x08A4 – 0x08A7
t = 0x38
Address range: 0x0704 – 0x0707

B.

0x1238 – 0x123B

Same as 6.27

A.

None

B.

0x18F0 – 0x18F3
0x00B0 – 0x00B3

C.

0x0E34 – 0x0E37

D.

0x1BDC – 0x1BDF

A.

B.

read/write	addr	hit?	value (or unknown)
read	0x834	No
write	0x836	Yes	unknown
read	0xFFD	Yes	0xC0

A.

C = E * B * S = 128B

B.

A.

B.

param	value
CO	0x02
CI	0x06
CT	0x38
hit?	Yes
return	0xEB

A.

B.

param	value
CO	0x00
CI	0x02
CT	0xB7
hit?	No
return

Same as 6.27

0x1788 – 0x178B
0x16C8 – 0x16CB

src:

	c0	c1	c2	c3
r0	m	m	h	m
r1	m	h	m	h
r2	m	m	h	m
r3	m	h	m	h

dst:

	c0	c1	c2	c3
r0	m	m	m	m
r1	m	m	m	m
r2	m	m	m	m
r3	m	m	m	m

src:

	c0	c1	c2	c3
r0	m	h	h	h
r1	m	h	h	h
r2	m	h	h	h
r3	m	h	h	h

dst:

	c0	c1	c2	c3
r0	m	h	h	h
r1	m	h	h	h
r2	m	h	h	h
r3	m	h	h	h

A.

C = 512, E = 1, B = 16, S = 32

Total read count: 2 * 128

x[0][i] address: i * 4

x[1][i] address: (128 + i) * 4 = 512 + i * 4

So x[0][i] and x[1][i] are cached into the same block.

Miss rate 100%

B.

C = 1024, E = 1, B = 16, S = 64

sizeof(x) == 2 * 128 * 4 == C

The whole array can be cached.

Miss rate is dependent on block size B.

B = 16, sizeof(int) = 4, so

Miss rate is 25%

C.

C = 512, E = 2, B = 16, S = 16

Total read count: 2 * 128

x[0][i] address: i * 4

x[1][i] address: (128 + i) * 4 = 512 + i * 4

So x[0][i] and x[1][i] are cached into a different block in the same set.

In the first half, all elements can be cached. Miss rate is 25%.

In the second half

x[0][i] is not in the cache. According to the LRU strategy, cache x[0][i] into the same block with x[0][i-64], cache x[1][i] into the same block with x[1][i-64]. Miss rate is 25%.

Finally, the miss rate is still 25%.

D.

If B is still 16, sizeof(int) = 4, the block can only cache 4 int at one time.

Read int the first time, toggle memory cache, miss; next 3 times read hit.

25% miss rate is the lowest.

E.

Yes

Assume B = 32, the block caches 8 int at one time.

Only 1 miss in 8 times int read.

The miss rate can be 12.5%.

C = 4096, B = 16, E = 1, S = 256

N = 64

sizeof(array_t) == 64 * 64 == 4096 == 4 * C

A. sumA

Read memory address order:

0, 4, 8, 12, ….., 4096 * 4 – 4

Read cache order:

0, 0, 0, 0, 1, 1, 1, 1,…..255, 255, 255, 255, 0, 0, 0, 0,…..255, 255, 255, 255

First cache read miss, next 3 times read hit.

Miss rate: 25%

B. sumB

Read memory address order:

0, 64 * 4, 64 * 4 * 2, …. 64 * 4 * 63, 4, 64 * 4 + 4, 64 * 4 * 2 + 4, …. 4096 * 4 – 4

Read cache order:

0, 16, 32, 48, … 240, (4 times) 1, 17, 33, … 241, (4 times) 15, 31, 47, … 255 (4 times)

Let’s see the first read loop:

Read cache order loop 4 times

0, 16, 32, 48, … 240, (4 times)

First loop all miss, next 3 loops all hit

So the miss rate is 25%.

C. sumC

Easy to see that read a[j][i+1] and a[j+1][i+1] always hit

Same as

Total read count = 64 * 64

Because of i+=2,

Read cache order only loop 2 times

0, 16, 32, 48, … 240, (2 times)

So the miss rate is 50%.

Total read miss count = 64/2 * 64 * 50% = 64 * 64 / 4

So the miss rate is still 25%.

N = 60

A. sumA

Read memory by step 1

Miss rate 25%

B. sumB

It’s interesting.

Let’s see the first inner loop a[0][0] -> a[59][0]

Read memory address order:

0, 60 * 4, 60 * 4 * 2, …. 60 * 4 * 59

Read cache order:

0, 15, 30, …., 225, (17 numbers) 255, 14, 29, ….., 224, (17 numbers) 254, 13, 28, ….., 223, (17 numbers) 253, 12, 27, 42, 57, 72, 87, 102, 117 (9 numbers)

All read miss and store into different blocks

Next 3 loops: a[0][1] -> a[59][1], a[0][2] -> a[59][2], a[0][3] -> a[59][3]

All hit

Although N is smaller and not a power of 2, the miss rate is still 25%

C. sumC

Same miss rate as when N = 64

25%

A.

4 * 16 * 16

B.

sizeof(point_color) == 16, B = 32

Miss, cache 2 point_color, then

All hit

So miss count is 4 * 16 * 16 * 1/8

C.

1/8

A.

4 * 16 * 16

B.

sizeof(point_color) == 16, B = 32

Miss, cache 2 point_color, then

All hit.

Next loop

square[j+1][i] miss, cache block not conflict with square[j][i]

square[j+8][i] miss, cache block overwrite square[j][i] block. So when we reach square[j][i+1], still miss.

So miss count is 4 * 16 * 16 * 1/4

C.

1/4

A.

4 * 16 * 16

B.

First loop, same as 6.38, but

Write count is 16 * 16, miss rate is 1/2.

Second loop, same as 6.39, but

Write count is 16 * 16 * 3, miss rate is 1/6.

Miss count is

16 * 16 * 1/2 + 16 * 16 * 3 * 1/6

C.

1/4

Every loop

Always miss, then cache one pixel, so

All hit

Miss rate is 1/4

Same as

C = 64KB, B = 4B, sizeof(pixel) = 4

Miss, cache one pixel, so

All hit

Miss rate is 1/4

Same as

Every loop,

Always miss

Miss rate 100%

Assume matrix size N = 4, cache block size B = 8Byte, sizeof(int) = 4

Function transpose

Every cache block can store 2 int numbers

Traverse src by step 1, order:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Order 0: cache miss, load src[0],src[1] order 1: cache hit because of order 0 order 2: cache miss, load src[2],src[3] …. order 15: cache hit

If B is greater, the hit rate will be greater too.

Traverse dst by step 4, order:

0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15

Element 0: miss, load dst[0],dst[1] element 4: miss, load dst[4],dst[5] element 8: miss, load dst[8],dst[9] element 12: miss, load dst[12],dst[13] element 1: interesting, order 0 has loaded dst[1] into the cache, hit. But in many cases, element 4/8/12 may use the same cache block with element 0. So miss is also possible. …. element 15: hit / miss

Let’s assume the worst case: all elements miss the cache

If we split the matrix by 2 * 2

Transpose block 0 itself
Transpose block 1 with 2
Transpose block 3 itself

Code

When i = 0, j = 0, transpose block 0 itself

If element 0 is miss, element 1 must hit; if element 4 is miss, element 5 must hit;

50% is the highest hit rate in such a low cache block size.

If B is greater and cache size C is larger, we can split the matrix into 4 * 4, 8 * 8, or more, larger. Theoretically, we will achieve the highest hit rate.

Finally code:

In code, matrix size 1024 * 1024, loop 1000 times to measure program run time.

Change BLOCK from 2 to 16, record time statistics

Origin function run time: 16.46s

BLOCK	time(s)
2	9.99
3	7.16
4	5.6
5	5.66
6	5.34
7	5.39
8	5.38
9	5.48
10	6.21
11	7.9
12	10.17
13	11.14
14	11.88
15	12.11
16	11.85

Tip: look up CPU cache info

Same as 6.45, pay attention to the brilliant comment 🙂

Measure time statistics:

Origin function run time: 30.28s

BLOCK	time(s)
1	14.26
2	12.01
3	7.43
4	6.20
5	6.08
6	5.86
7	5.70
8	5.67
9	6.30
10	6.39
11	6.21
12	6.18
13	5.9
14	6.3
15	5.88
16	5.92

Exploring Disk Storage and Cache Optimization: Solutions to Common Problems

A.

B.

A.

B.

A.

B.

C.

D.

A.

B.

A.

B.

A.

B.

A.

B.

A.

B.

C.

D.

E.

N = 64

A. sumA

B. sumB

C. sumC

N = 60

A. sumA

B. sumB

C. sumC

A.

B.

C.

A.

B.

C.

A.

B.

C.

Recent Notes

Subjects

Publicidad