Pipelining and Cache Performance: Speedup Calculation and Optimization
Pipelining and Speedup Calculation
Given a non-pipelined architecture running at 2.5GHz that takes 5 cycles to finish an instruction. You want to make it pipelined with 5 stages. The increase in hardware forces you to run the machine at 2GHz.
The only stalls are caused by:
- Memory stalls: (30% of total instructions) a stall of 50 cycles happens in 2% of the memory instructions.
- Branch stalls: (20% of total instructions) a stall of 2 cycles happens in 20% of the branch instructions.
What is the speedup? Clearly show all work.
Hazard Detection in Code
For the following code, indicate and label the type of hazard (RAW, WAW, WAR):
square: ld.b R2,0(R3) ! load the byte at mem[R3] and place it in R2
add R3,R3,4 ! R3=R3+4, next word
mul R2,R2,R2 ! R2=R2*R2, square
st.b R2,0(R3) ! store low order byte in R2 to mem[R3]
sub R2,R3,R4 ! R2=R3-R8
bnez R2, square ! if R2 !=0 then goto square
Hazard Analysis:
square: ld.b R2,0(R3) ! 1-
add R3,R3,4 ! 2- WAR with 1
mul R2,R2,R2 ! 3- RAW with 1
st.b R2,0(R3) ! 4- RAW with 2, 3
sub R2,R3,R4 ! 5- RAW with 2, WAR with 4
bnez R2, square ! 6- RAW with 5
Computer Architecture Analysis
This question will be about the following computer:
- 1.5 GHz, No Pipeline, instruction buffers that negate instruction load penalties 80% of the time
- 32Kb L1 cache, 1ns access, 4-way associative, write-through, not write-allocate, 2% miss rate for Data and a miss rate of 1/60 for Instructions
- 1Mb L2 cache, 12ns access, 4-way associative, write-back, write-allocate, 30% miss rate, 20% dirty
- 1Gb RAM, 50ns access
Branch Delay Slot
(a) Does it make sense to have a branch delay slot since there is no pipeline? Why or why not?
No, a branch delay slot is only useful to reduce the penalty of address jumps in a pipelined machine. In a pipelined machine, when the branch occurs, two or three instructions following the branch are typically already loaded. These instructions must be flushed, reducing the efficiency of the pipeline. The branch delay slot reduces the penalty by one.
Ideal Speedup with Pipelining
(b) What would the ideal speedup be for this computer if a 5-stage pipeline were installed?
Ideal speedup equals the number of stages in the pipeline, thus 5.
Effective Access Time for Instructions
(c) What is the effective access time for loading instructions?
CPI Penalty Due to Instruction Loading
(d) What is the CPI penalty to each instruction due to loading instructions?
Performance Optimization
(e) What would you recommend changing to make this faster?
Two-thirds of the penalty is due to the access time of level one cache. I would look at trying to reduce or negate this first.
Cache Line Capacity
How many 32-bit integers can be stored in a 16-byte cache line?
Let 1 byte = 8 bits
16 bytes = 128 bits
4 * 32 bits = 128 bits
Therefore, four 32-bit integers can be stored in a 16-byte cache line.
Locality Analysis
Temporal Locality
References to which variable exhibit temporal locality?
Definition: Temporal locality refers to the tendency for a processor to access memory locations that have been used recently. For example, when an iteration loop is executed, the processor executes the same set of instructions repeatedly.
Example A:
for (I=0; I<...; I++) {
for (J=0; J<...; J++) {
A[J][3] = B[3][0] + A[3][I];
}
}
Temporal Locality: I, B[3][0]
Example B:
for (j=0; j<...; j++) {
for (I=0; I<...; I++) {
A[1][j] = B[3][0] + A[3][I];
}
}
Temporal Locality: j, I, B[3][0]
Example C:
for (I=1; I<=800; I++) {
for (J=1; J<=8; J++) {
A(I,J) = B(J,0) + A(J,I);
}
}
Temporal Locality: I, J, B(J,0)
Example D:
for (J=1; J<=8; J++) {
for (I=1; I<=8000; I++) {
A(I,J) = B(J,0) + A(J,I);
}
}
Temporal Locality: I, J, B(J,0)
Spatial Locality
References to which variable exhibit spatial locality?
A)for(I=0;I
for(J=0;j
A[I][J]=B[j][0]+A[J][I];
spatial locality:
this is refers to the tedency of execution to involve a number of memory locations that are clustered this refleds the tedency of a processor to access instructions seqnentially.
spotal location also reflects the tendency of a program to access data locations sequentially suden as when processing a table of data. therefore in the store code the spctial locality is spectal localty:A[I][J]
References to which variable exhibit temporal locality?
for(I=1:800)
for J=1.8
A(I,J)=B(J,0)+A(J,I);
end
end
Temporal Locality
This is reters to the tedency for a processor to acces memory locations that have been used recently for example, when an itevation loop is executed, the processor executes the same set of instruction repetealy
Temporal locality: I,J,B(3,0)
C)
for J=1;8
for (I=1:8000)
4(I,J)=B(J,0)+A(J,I);
end
end
D) tempoval Locality:I,J
References to which variables exhibit spatial locality?
step 1:
a) for (I=1:8000)
for J=1:8
A(I,J)=B(J,0)+A(J,I)
end
end
step 2
spotied locality
this is reters to the tedeuncy of executing to involve a number of memory locations that ance clustered. this reflects the tedency of a processor to access instructions sequenually spatal location also vectors retlects the tendency of a program to access data locating sequentially, such as when processing a table of data. therefore in the above code the spotial locality b
spetal locality=A(I,J),A(J,I),B(J,O)
step3
for J=1:8
for(I=1,8000)
A(I,J)=B(J,0)+A(J,I)
end
end