Pipelining and Cache Performance: Speedup Calculation and Optimization

Pipelining and Speedup Calculation

Given a non-pipelined architecture running at 2.5GHz that takes 5 cycles to finish an instruction. You want to make it pipelined with 5 stages. The increase in hardware forces you to run the machine at 2GHz.

The only stalls are caused by:

  • Memory stalls: (30% of total instructions) a stall of 50 cycles happens in 2% of the memory instructions.
  • Branch stalls: (20% of total instructions) a stall of 2 cycles happens in 20% of the branch instructions.

What is the speedup? Clearly show all work.

Equation

Hazard Detection in Code

For the following code, indicate and label the type of hazard (RAW, WAW, WAR):

square: ld.b R2,0(R3) ! load the byte at mem[R3] and place it in R2
add R3,R3,4     ! R3=R3+4, next word
mul R2,R2,R2     ! R2=R2*R2, square
st.b R2,0(R3)    ! store low order byte in R2 to mem[R3]
sub R2,R3,R4     ! R2=R3-R8
bnez R2, square  ! if R2 !=0 then goto square

Hazard Analysis:

square: ld.b R2,0(R3) ! 1-
add R3,R3,4     ! 2- WAR with 1
mul R2,R2,R2     ! 3- RAW with 1
st.b R2,0(R3)    ! 4- RAW with 2, 3
sub R2,R3,R4     ! 5- RAW with 2, WAR with 4
bnez R2, square  ! 6- RAW with 5

Computer Architecture Analysis

This question will be about the following computer:

  • 1.5 GHz, No Pipeline, instruction buffers that negate instruction load penalties 80% of the time
  • 32Kb L1 cache, 1ns access, 4-way associative, write-through, not write-allocate, 2% miss rate for Data and a miss rate of 1/60 for Instructions
  • 1Mb L2 cache, 12ns access, 4-way associative, write-back, write-allocate, 30% miss rate, 20% dirty
  • 1Gb RAM, 50ns access

Branch Delay Slot

(a) Does it make sense to have a branch delay slot since there is no pipeline? Why or why not?

No, a branch delay slot is only useful to reduce the penalty of address jumps in a pipelined machine. In a pipelined machine, when the branch occurs, two or three instructions following the branch are typically already loaded. These instructions must be flushed, reducing the efficiency of the pipeline. The branch delay slot reduces the penalty by one.

Ideal Speedup with Pipelining

(b) What would the ideal speedup be for this computer if a 5-stage pipeline were installed?

Ideal speedup equals the number of stages in the pipeline, thus 5.

Effective Access Time for Instructions

(c) What is the effective access time for loading instructions?

Equation

CPI Penalty Due to Instruction Loading

(d) What is the CPI penalty to each instruction due to loading instructions?

Equation

Performance Optimization

(e) What would you recommend changing to make this faster?

Two-thirds of the penalty is due to the access time of level one cache. I would look at trying to reduce or negate this first.

Cache Line Capacity

How many 32-bit integers can be stored in a 16-byte cache line?

Let 1 byte = 8 bits

16 bytes = 128 bits

4 * 32 bits = 128 bits

Therefore, four 32-bit integers can be stored in a 16-byte cache line.

Locality Analysis

Temporal Locality

References to which variable exhibit temporal locality?

Definition: Temporal locality refers to the tendency for a processor to access memory locations that have been used recently. For example, when an iteration loop is executed, the processor executes the same set of instructions repeatedly.

Example A:

for (I=0; I<...; I++) {
  for (J=0; J<...; J++) {
    A[J][3] = B[3][0] + A[3][I];
  }
}

Temporal Locality: I, B[3][0]

Example B:

for (j=0; j<...; j++) {
  for (I=0; I<...; I++) {
    A[1][j] = B[3][0] + A[3][I];
  }
}

Temporal Locality: j, I, B[3][0]

Example C:

for (I=1; I<=800; I++) {
  for (J=1; J<=8; J++) {
    A(I,J) = B(J,0) + A(J,I);
  }
}

Temporal Locality: I, J, B(J,0)

Example D:

for (J=1; J<=8; J++) {
  for (I=1; I<=8000; I++) {
    A(I,J) = B(J,0) + A(J,I);
  }
}

Temporal Locality: I, J, B(J,0)

Spatial Locality

References to which variable exhibit spatial locality?

 
A)for(I=0;I

for(J=0;j

A[I][J]=B[j][0]+A[J][I];

spatial locality:

this is refers to the tedency of execution to involve a number of memory locations that are clustered this refleds the tedency of a processor to access instructions seqnentially.

spotal location also reflects the tendency of a program to access data locations sequentially suden as when processing a table of data. therefore in the store code the spctial locality  is spectal localty:A[I][J]

References to which variable exhibit temporal locality?

for(I=1:800)

for J=1.8

A(I,J)=B(J,0)+A(J,I);

end

end

Temporal Locality

This is reters to the tedency for a processor to acces memory locations that have been used recently for example, when an itevation loop is executed, the processor executes the same set of instruction repetealy

Temporal locality: I,J,B(3,0)

 C)

for J=1;8

for (I=1:8000)

4(I,J)=B(J,0)+A(J,I);

end

end

D) tempoval Locality:I,J

References to which variables exhibit spatial locality?

step 1:

a) for (I=1:8000)

for J=1:8

A(I,J)=B(J,0)+A(J,I)

end

end

step 2

spotied locality

this is reters to the tedeuncy of executing to involve a number of memory locations that ance  clustered. this reflects the tedency of a processor to access instructions sequenually spatal location  also vectors retlects the tendency of a program to access data locating sequentially, such as when processing a table of data. therefore in the above code the spotial locality b   

spetal locality=A(I,J),A(J,I),B(J,O)      

step3

for J=1:8

for(I=1,8000)

A(I,J)=B(J,0)+A(J,I)

end

end