Pipeline Performance and Hazard Detection in Computer Architecture

Given a non-pipelined architecture running at 2.5GHz that takes 5 cycles to finish an instruction. You want to make it pipelined with 5 stages. The increase in hardware forces you to run the machine

at 2GHz. The only stalls are caused by:

  • memory – (30% of the total instructions) a stall of 50 cycles happens in 2% of the memory instructions
  • branch – (20% of the total instructions) a stall of 2 cycles happens in 20% of the branch instructions

What is the speedup? Clearly show all work.

Equation

For the following code, indicate where and label the type of hazard (RAW, WAW, WAR).

square: ld.b R2,0(R3) !load the byte at mem[R3] and place it in R2

            add R3,R3,4 !R3=R3+4, next word

            mul R2,R2,R2 !R2=R2*R2,square

            st.b R,0(R3) !store low order byte in R2 to mem[R3]

            sub R2,R3,R4 !R2=R3-R8

            bnez R2, square !if R2 !=0 then goto square

square: ld.b R2,0(R3) !1-

            add R3,R3,4 !2-WAR with 1

            mul R2,R2,R2 !3-RAW with 1

            st.b R,0(R3) !4-RAW with 2,3

           sub R2,R3,R4 !5-RAW with 2, WAR with 4

            bnez R2, square !6-RAW with 5

(3) This question will be about the following computer:

  • 1.5 GHz, No Pipeline, instruction buffers that negate instruction load penalties 80% of the time
  • 32Kb L1 cache, 1ns access, 4-way associative, write-through, not write-allocate, 2% miss rate for Data and a miss rate of 1/60 for Instructions
  • 1Mb L2 cache, 12ns access, 4-way associative, write-back, write-allocate, 30% miss rate, 20% dirty
  • 1Gb RAM, 50ns access

(a) Does it make sense to have a branch delay slot, since there is no pipeline? Why or why not?

No, a branch delay slot is only useful to reduce the penalty of address jumps in a pipelined machine. In a pipelined machine, when the branch occurs, two or three instructions following the branch are typically already loaded. These instructions must be flushed, reducing the efficiency of the pipeline. The branch delay slot reduces the penalty by one.

(b) What would the ideal speedup be for this computer if a 5 stage pipeline were installed?

Ideal speedup equals the number of stages in the pipeline, thus 5.

(c) What is the effective access time for loading instructions?

Equation

d) What is the CPI penalty to each instruction due to loading instructions?

Equation

e) What would you recommend changing to make this faster?

Two-thirds of the penalty is due to the access time of level one cache. I would look at trying to reduce or negate this first.


5.2

How many 32-bit integers can be stored in a 16-byte cache line?

Let 1 byte = 8 bits

16 bytes = 128 bits

4 * 32 bits = 128 bits

Therefore, four 32-bit integers can be stored in a 16-byte cache line.

References to which variable exhibit temporal locality?

a)

For(I=0:I

for(J=0:j

A[J][3]=B[3][0]+A[3][I];

Temporal locality:

This refers to the tendency for a processor to access memory locations that have been used recently. For example, when an iteration loop is executed, the processor executes the same set of instructions repeatedly.

b)

for(j=0;j

for(I=0;I

A[1][j]=B[3][0]+A[3][I];

Temporal locality: I, J

References to which variable exhibit spatial locality?
A) for(I=0;I

for(J=0;j

A[I][J]=B[j][0]+A[J][I];

Spatial locality:

This refers to the tendency of execution to involve several memory locations that are clustered. This reflects the tendency of a processor to access instructions sequentially.

Spatial locality also reflects the tendency of a program to access data locations sequentially, such as when processing a table of data. Therefore, in the store code, the spatial locality is:

Spatial locality: A[I][J]

5.22

References to which variable exhibit temporal locality?

for(I=1:800)

for J=1.8

A(I,J)=B(J,0)+A(J,I);

end

end

Temporal Locality

This refers to the tendency for a processor to access memory locations that have been used recently. For example, when an iteration loop is executed, the processor executes the same set of instructions repeatedly.

Temporal locality: I, J, B(3,0)

 C)

for J=1;8

for (I=1:8000)

4(I,J)=B(J,0)+A(J,I);

end

end

D) Temporal Locality: I, J

References to which variables exhibit spatial locality?

Step 1:

a) for (I=1:8000)

for J=1:8

A(I,J)=B(J,0)+A(J,I)

end

end

Step 2

Spatial locality

This refers to the tendency of execution to involve several memory locations that are clustered. This reflects the tendency of a processor to access instructions sequentially. Spatial location also reflects the tendency of a program to access data locations sequentially, such as when processing a table of data. Therefore, in the above code, the spatial locality is:

Spatial locality: A(I,J), A(J,I), B(J,0)      

Step 3

for J=1:8

for(I=1,8000)

A(I,J)=B(J,0)+A(J,I)

end

end

1. Assuming that the L1 hit time determines the cycle times for P1 and P2, what are their respective clock rates?

Given L1 hit time of P1 = 0.62ns

a) Clock rate of P1 = (1 / (L1 hit time))

P1 = (1 / 0.62ns) = 1.613GHz

b) Given L1 hit time of P2 = 0.66ns

Clock rate of P2 = (1 / (L1 hit time))

P2 = (1 / 0.66ns)

P2 = 1.515GHz

c) Given L1 hit time of P1 = 0.96ns

b) Clock rate of P1 = 1 / (L1 hit time)

P1 = 1 / 0.96ns

P1 = 1.041GHz

d) Given L1 hit time of P2 = 1.08ns

Clock rate of P2 = 1 / (L1 hit time)

P2 = 0.925GHz P2 = 1 / 1.08ns

2. What is the AMAT for each of P1 and P2?

a) Given data

L1 hit time of P1 = 0.62ns

L1 miss rate = 11.4%

Miss penalty = AMAT(DRAM) = 70ns

AMAT (Average Memory Access Time) = (Hit time + (Miss rate * Miss Penalty))

AMAT = (0.62 + (11.4% * 70)) = (0.62 + 7.98)

AMAT = 8.6ns

B) For processor 2

Given data

L1 hit time of P2 = 0.66ns

L1 hit rate of P2 = 8.0%

AMAT = (Hit time + (Miss rate * Miss Penalty)) = (0.66 + (8.0% * 70)) = (0.66 + 56)

AMAT = 6.26ns

3. L1 hit time of P1 = 0.96ns

L1 hit rate = 4.3%

AMAT = (Hit time + (Miss rate * Miss Penalty) = (0.96 + (42% * 70)) = (0.96 + 3.01)

AMAT = 3.97ns

For processor 2

Given data

L1 hit time of P2 = 1.08ns

L1 hit rate = 3.4%

AMAT = (Hit time + (Miss rate * Miss Penalty) = (1.08 + (3.4% + 70)) = (1.08 + 2.38)

AMAT = 3.46ns

3. Assuming a base CPI of 1.0, what is the total CPI for each of P1 and P2? Which processor is faster?

Main memory access = 70ns

(P1) L1 hit rate = 0.62ns

(P1) L1 miss rate = 11.4%

(P2) L1 hit rate = 0.66ns

(P2) L1 miss rate = 8.0%

Base CPI = 1.0

B)

CPI of P2

Miss Penalty = (Main memory access * L1 hit time P1)

= (70 * 0.62) = 434

CPI memory = (L1 miss rate of P1 * Miss penalty)

CPI memory = (0.114 * 43.4) = 4.95

CPI total = CPI base + CPI memory

CPI total = 1.0 + 4.95

CPI total = 5.95

c) CPI of P2

Miss penalty = (Main memory access * L1 hit time of P2) = (70 * 0.66) = 46.2

CPI memory = (L1 miss rate of P2 * Miss penalty)

CPI memory = (0.08 * 462) = 3.7

CPI total = CPI base + CPI memory

CPI total = 1.0 + 3.7

CPI total = 4.7 -> Therefore, P2 processor is faster

d) Main memory access = 70ns

P1) L1 hit rate = 0.96ns

P1) L1 miss rate 4.3%

P2) L1 hit rate = 1.08ns

P2) L1 miss rate = 3.4%

Base CPI = 1.0

CPI of P1

Miss Penalty = (Main memory access * L1 hit time of P1)

= (70 * 0.96) = 672

CPI memory = (L1 miss rate of P1 * Miss Penalty)

CPI memory = (0.034 * 67.2) = 2.28

CPI total = CPI base + CPI memory

CPI total = 1.0 + 2.28

CPI total = 3.28

CPI of P2

Miss penalty = (Main memory access * L1 hit time of P2 = (70 * 1.08) = 75.6

CPI memory = (L1 miss rate of P2 * Miss penalty)

CPI Memory = (0.034 * 75.6) = 2.5

CPI total = CPI base + CPI memory

CPI total = 1.0 + 2.5

CPI total = 3.5

Therefore, P2 processor is faster

4. What is the AMAT for P1 with the addition of an L2 cache? Is the AMAT better or worse with the L2 cache?

Hit time L1 = 0.62ns

Hit time L2 = 3.22ns

Miss rate L1 = 11.4%

Miss rate L2 = 98%

Miss penalty L2 = AMAT DRAM

Miss penalty = 70ns

Miss Penalty L1 = AMAT L2 = Hit time L2 + (Miss rate L2 + Miss penalty L2)

Miss Penalty L1 = (3.22 + 98% * 70) = 3.22 + (0.98 * 70) = (3.22 + 68.6) = 71.82ns

AMAT = Hit time L1 + (Miss rate L1 * Miss Penalty L1)

AMAT = 0.62 + (11.4% * 71.82) = 0.62 + (0.114 * 71.82) = (0.62 + 8.187) = 8.80ns

Therefore, AMAT is worse with the L2 cache

d)

Hit time L1 = 0.96 ns

Hit time L2 = 11.48ns

Miss rate L1 = 43%

Miss rate L2 = 73%

Miss penalty L2 = AMAT DRAM

Miss penalty L2 = 70ns

e)

Miss Penalty L1 = AMAT L2 = Hit time L2 + (Miss Rate + Miss penalty)

Miss Penalty + (11.48 + (73% * 70)) = (11.48 + (0.73 * 70) = 11.48 + 52.1 + 62.58ns

F)

AMAT = Hit time L1 + (Miss rate L1 * Miss penalty L1)

AMAT = (0.96 + (4.3% * 62.58)) -> (0.96 + (0.0423 * 62.58) -> (0.96 + 2.64)

AMAT = 3.60ns

AMAT is better with L2

5. Assuming a base CPI of 1.0, what is the total CPI for P1 with the addition of an L2 Cache?

6. Which processor is faster, now that P1 has an L2 cache? If P1 is faster, what miss rate would P2 need in its L1 cache to match P1’s performance? If P2 is faster, what miss rate would P1 need in its L1 cache to match P2’s performance?

Chapter 4

1) If there is no forwarding or hazard detection, insert nops to ensure correct execution.

22.1



2) Nops only when a hazard cannot be avoided by changing or rearranging these instructions. You can assume register R7 can be used to hold temporary values in your modified code.

22.2

3) If the processor has forwarding, but we forget to implement the hazard detection unit, what happens when this code executes?

With forwarding, the hazard detection unit is still needed because it must insert a one-cycle stall whenever the load supplies a value to the instruction that immediately follows that load. Without the hazard detection unit, the instruction that depends on the immediately preceding load gets the stale value the register had before the load instruction.
22.3

4) If there is forwarding, for the first five cycles during the execution of this code, specify which signals are asserted in each cycle by hazard detection and forwarding units.

The outputs of the hazard detection unit are PCWrite, IF/IDWrite, and ID/EXZero (which controls the Mux after the output of the Control unit). Note that IF/IDWrite is always equal to PCWrite, and ED/ExZero is always the opposite of PCWrite. As a result, we will only show the value of PCWrite for each cycle. The outputs of the forwarding unit is ALUin1 and ALUin2, which control Muxes which select the first and second input of the ALU. The three possible values for ALUin1 or ALUin2 are 0 (no forwarding), 1 (forward ALU output from the previous instruction), or 2 (forward data value for the second-previous instruction).
22.4

5) If there is no forwarding, what new inputs and output signals do we need for the hazard detection? Using the instruction sequence as an example, explain why each signal is needed.

The instruction currently in the ID stage needs to be stalled if it depends on a value produced by the instruction in the EX or the instruction in the MEM stage. So we need to check the destination register of these two instructions. For the instruction in the EX stage, we need to check Rd for R-type instructions and Rd for loads. For the instruction in the MEM stage, the destination register is already selected (by the Mux in the EX stage), so we need to check that register number (this is the bottommost output of the EX/MEM pipeline register). The additional inputs to the hazard detection unit are register Rd from the ID/EX pipeline register and the output number of the output register from the EX/MEM pipeline register. The Rt field from the ID/EX register is already an input of the hazard detection unit in Figure 4.60. No additional outputs are needed. We can stall the pipeline using the three output signals that we already have.

6) For the new hazard detection unit, specify which output signals it asserts in each of the first five cycles during the execution of this code.

As explained for 4.21.5, we only need to specify the value of the PCWrite signal because IF/IDWrite is equal to PCWrite and the ID/EXzero signal is its opposite. We have:
22.6



1) Draw the pipeline execution diagram for this code, assuming there are no delay slots and that branches execute in the EX stage.

22.11

2) But assume that delay slots are used in the given code. The instruction that follows the branch is now the delay slot instruction for that branch.

22.22

3) One way to move the branch resolution one stage earlier is to not need an ALU operation in conditional branches. The branch instructions would be “bez Rd, Label” and “Bnez Rd, Label” and it would branch if the register has and does not have a 0 value, respectively. Change this code to use these branch instructions instead of beq. You can assume that register $8 is available for you to use as a temporary register, and that a seq R-type instruction can be used.

22.33

4) Using the first branch instruction in the given code as an example, describe the hazard detection logic needed to support branch execution in the ID stage. Which type of hazard is this new logic supposed to detect?

The hazard detection logic must detect situations when the branch depends on the result of the previous R-type instruction or the result of two previous loads. When the branch uses its register operands’ values in its ID stage, the R-type instructions result is still being generated in the EX stage. Thus we must stall the processor and repeat the ID stage of the branch in the next cycle. Similarly, if the branch depends on a load that immediately precedes it, the load result is only generated two cycles after the branch enters the ID stage, so we must stall the branch for two cycles. Finally, if the branch depends on a load that is the second-previous instruction, the load is completing its MEM stage when the branch is in its ID stage, so we must stall the branch for one cycle. In all three cases, the hazard is a data hazard. Note that in all three cases, we assume that the values of preceding instructions are forwarded to the ID stage of the branch if possible.

5) With the 2-bit predictor, what speed-up would be achieved if we could convert half of the branch instructions in a way that replaced each branch instruction with two ALU instructions? Assume that correctly and incorrectly predicted instructions have the same chance of being replaced.

The following is the pipeline diagram when branches are executed in the ID stage, including new stalls due to data dependencies:
22.5



22.51

6) Using the first branch instruction in the given code as an example, describe the forwarding support that must be added to support branch execution in the ID stage. Compare the complexity of this new forwarding unit to the complexity of the existing forwarding unit.

Branch instructions are now executed in the ID stage. If the branch instruction uses a register value produced by the immediately preceding instruction, as we described for 4.22.4, the branch must be stalled because the preceding instruction is in the EX stage when the branch is already using the stale register values in the ID stage. If the branch in the ID stage depends on an R-type instruction in the MEM stage, we need forwarding to ensure the branch’s correct execution. Similarly, if the branch in the ID stage depends on an R-type of load instruction in the WB stage, we need forwarding to ensure the branch’s correct execution. Overall, we need another forwarding unit that takes the same inputs as the one that forwards to the EX stage. The new forwarding unit should control two Muxes placed right before the branch comparator. Each Mux selects between the value read from Registers, the ALU output from the EX/MEM pipeline register, and the data value from the MEM/WB pipeline register. The complexity of the new forwarding unit is the same as the existing one.
23.sample

1) Stall cycles due to mispredicted branches increase the CPI. What is the extra CPI due to mispredicted branches with the always-taken predictor? Assume that branch outcomes are determined in the EX stage, that there are no data hazards, and that no delay slots are used.

Each branch not correctly predicted by the always-taken predictor will cause 3 stall cycles, so we have:

a) 3 * (1 – 0.40) * 0.15 = 0.27

b) 3 * (1 – 0.60) * 0.10 = 0.12

2) For the “always not-taken” predictor

Each branch not correctly predicted by the always-not-taken predictor will cause 3 stall cycles, so we have:

a. 3 * (1 – 0.60) * 0.15 = 0.18

b. 3 * (1 – 0.40) * 0.10 = 0.18

3) For the 2-bit predictor

a) 3 * (1 – 0.80) * 0.15 = 0.090
b) 3 * (1 – 0.95) * 0.10 = 0.015

4) With the 2-bit predictor, what speed-up would be achieved if we could convert half of the branch instructions in a way that replaced a branch instruction with an ALU instruction? Assume that correctly and incorrectly predicted instructions have the same chance of being replaced.

Correctly predicted branches had a CPI of 1 and now they become ALU instructions whose CPI is also 1. Incorrectly predicted instructions that are converted also become ALU instructions with a CPI of 1, so we have:
23.4

5) With the 2-bit predictor, what speed-up would be achieved if we converted half of the branch instructions in a way that replaced each branch instruction with two ALU instructions? Assume that correctly and incorrectly predicted instructions have the same chance of being replaced.

23.5

6) Some branch instructions are much more predictable than others. If we know that 80% of all executed branch instructions are easy-to-predict loop-back branches that are always predicted correctly, what is the accuracy of the 2-bit predictor on the remaining 20% of the branch instructions?

23.6