RISC-V Single-Cycle and Pipelined Processor Design and Analysis

NYU Tandon School of Engineering

Fall 2019, ECE 6913 – Section C

Instructor: Azeez Bhavnagarwala, email: ajb20@nyu.edu

TA: Likith Purushotham, email: lp2267@nyu.edu

Homework Assignment 3 [released Saturday October 26th 2019] [due* Monday November 4th 2019, before 11:55 PM]

You are allowed to discuss HW assignments only with other colleagues taking the class. You are not allowed to share your solutions with other colleagues in the class. Please feel free to reach out to the TA or to the Instructor during office hours or by appointment if you need any help with the HW.

Please enter your responses in this Word document after you download it from NYU Classes. Please use the NYU Classes portal to send in your completed HW. If you are having difficulty doing this before the deadline, please convert it to PDF when you are done and email it to ajb20@nyu.edu before 11:55 PM on Monday November 4th 2019. Circumstances may prevent you from doing so – please let the instructor know about these.

1. Consider the following instruction: and rd, rs1, rs2

Interpretation: Reg[rd] = Reg[rs1] AND Reg[rs2]

Figure 4.19 (next page) shows the operation of the datapath for an Rtype instruction, such as add x1, x2, x3. Although everything occurs in one clock cycle, we can think of four steps to execute the instruction; these steps are ordered by the flow of information:

  1. The instruction is fetched, and the PC is incremented.
  2. Two registers, x2 and x3, are read from the register file; also, the main control unit computes the setting of the control lines during this step.
  3. The ALU operates on the data read from the register file, using portions of the opcode to generate the ALU function.
  4. The result from the ALU is written into the destination register (x1) in the register file.

1.1 What are the values of control signals generated by the control in Figure 4.10 (4:19) of RISCV text for this instruction?

Fig 4.18 of text shows the values of the control signal for R-type instructions as well:

1.2 Which resources (blocks) perform a useful function for this instruction?

All units except Data memory, sign extension and Branch Adder

1.3 Which resources (blocks) produce no output for this instruction? Whichh resources produce output that is not used?

2. Consider the following instruction mix:

2.1 What fraction of all instructions use data memory?

Load and Store Instructions only: 35%

2.2 What fraction of all instructions use instruction memory?

All instructions – 100% (Obviously)

2.3 What fraction of all instructions use the sign extend?

I-Type, Load, Store, Branch, Jump : 76%

2.4 What is the sign extend doing during cycles in which its output is not needed?

It could be providing extensions of the immediate fields or clock gating them during this time

3. In this exercise, we examine in detail how an instruction is executed in a single-cycle datapath. Problems in this exercise refer to a clock cycle in which the processor fetches the following instruction word: 0x00c6ba23.

00c6ba2316 = 0000 0000 1100 0110 1011 1010 0010 0011

The opcode in the least significant 7 bits (red) is ‘35’ suggesting the given instruction word is store doubleword or sd with register x12 in the (rs2) destination register field (brown) and register x13 in the (rs1) base address register field (blue)

so, 00c6ba2316 = sd x12, 20(x13)

3.1 What are the values of the ALU control unit’s inputs for this instruction?

ALU Control line bits are 0010 [from table in fig 4.12 of text]

Since the ‘add’ operation is executed in the ALU for a sd instruction (add offset in immediate 5 bit field (20) to base address (in rs1), the ALU Control lines take the value 0010

3.2 What is the new PC address after this instruction is executed? Highlight the path through which this value is determined.

Since the sd instruction does not take any branch, the new PC address after this instruction is executed is PC+4

3.3 For each mux, show the values of its inputs and outputs during the execution of this instruction. List values that are register outputs at Reg [xn].

The 3 mux units shown in Fig 4.17 are listed in the table below:

ALUsrc, PCsrc and MemtoReg For the sd instruction, ALUsrc is ‘1’ (in Table 4.18 reproduced below) and takes the input to this mux from the sign extend unit (0x0..014) – instead of register rs2 (x12 in this problem), to add to the contents of the base register rs1 in the ALU to determine the address in memory to store contents of register rs2. The decimal number ‘20’ corresponding to the sign extended offset (in hex) = 0x00..014. Table 4.18 from text – providing 8 control signals from the Control unit based on the 7-bit opcode of the instruction Table 4.16 from text – defining 6 of 8 control signals from the Control unit (not including the ALUop bits). PCsrc is ‘0’ since this mux is asserted only for branch instructions MemtoReg is value is irrelevant (don’t care). In table below it is asserted ‘1’ this choice picks data from Data Memory to send to the register file for Write back. Data Memory is in Write mode (Mem Write =1 in table 4.18 for a sd instruction) so the data at output buffer of Data memory is undefined. This data is routed to write port of Register file array. But Reg Write control bit in Table 4.18 is ‘0’ so undefined data not used in Register File.

Mux Control input input 1 input 2 output ALUsrc 1 rs2 0x00..014 0x0..014 PCsrc 0 PC+4 offs sll2 PC + 4 MemtoReg 1 rs1+0x..14 unknown unknown 3.4 What are the input values for the ALU and the two add units? From Fig 4.17 below, ALU inputs are rs1 (Reg[x13]) and 0x00..014 from the mux controlled by ALUsrc Branch adder inputs: PC and 0x0..014 shifted by 1 (0x0..28) PC adder inputs: PC and 4 3.5 What are the values of all inputs for the registers unit? Read register 1 (rs1) input port has the 5-bit instruction field [19-15] identifying the register x13, Read register 2 (rs2) input port has the 5-bit instruction field [24 – 20] identifying the register x11. Write register input port has undefined/irrelevant bits since Reg Write control signal is disabled for the sd instruction. 4. Section 4.4 (Simple, single cycle implementation scheme) does not discuss I-type instructions like addi or andi. 4.1 What additional logic blocks, if any, are needed to add I-type instructions to the CPU shown in Figure 4.21 (reproduced below)? Add any necessary logic blocks to the Figure below and explain their purpose. 4.2 List the values of the signals generated by the control unit for addi. Explain the reasoning for any “don’t care” control signals. 5. Problems in this exercise assume that the logic blocks used to implement a processor’s datapath have the following latencies: “Register read” is the time needed after the rising clock edge for the new register value to appear on the output. This value applies to the PC only. “Register setup” is the amount of time a register’s data input must be stable before the rising edge of the clock. This value applies to both the PC and Register File 5.1 What is the latency of an R-type instruction (i.e., how long must the clock period be to ensure that this instruction works correctly)? the R-type and I-type instructions both have identical delays: 1. PC Register Read delay following rising edge of clock: 30 ps 2. Instruction Memory: 250ps 3. Read Register File: 150 ps 4. Mux to pick data from either R2 or sign extended bit data from immediate field of instruction = 25 ps [note – there is no number given for sign extension delay, so we can assume it is less than the register file delay of 150 ps] 5. ALU delay: 200 ps 6. Mux for WB to Register File of result from ALU: 25 ps 7. Write back setup time for Register File registers: 20 ps total = 700ps 5.2 What is the latency of ld? (Check your answer carefully. Many students place extra muxes on the critical path.) For ld , item 6 below is new: 1. PC Register Read delay following rising edge of clock: 30 ps 2. Instruction Memory: 250ps 3. Read Register File: 150 ps 4. Mux to pick data from either R2 or sign extended bit data from immediate field of instruction = 25 ps [note – number given for sign extension delay of 50 ps is less than the 150 ps for register file read access, so we can assume that mux control signal is designed to fire only when both inputs are ready] 5. ALU delay: 200 ps 6. Read data from Data Mem, whose address provided by ALU: 250 ps 7. Mux for WB to Register File of result from Data Mem: 25 ps 8. Write back setup time for Register File registers: 20 ps total ld = 950ps 5.3 What is the latency of sd? (Check your answer carefully. Many students place extra muxes on the critical path.) for sd 1. PC Register Read delay following rising edge of clock: 30 ps 2. Instruction Memory: 250ps 3. Read Register File: 150 ps 4. Mux to pick data from either R2 or sign extended bit data from immediate field of instruction = 25 ps [note – there is no number given for sign extension delay, so we can assume it is less than the register file delay of 150 ps] 5. ALU delay: 200 ps 6. Write Data from R2 to Data Mem: 250 ps Total for sd: 905 ps 5.4 What is the latency of beq? 1. PC Register Read delay following rising edge of clock: 30 ps 2. Instruction Memory: 250ps 3. Read Register File: 150 ps 4. Mux to pick data from either R2 or sign extended bit data from immediate field of instruction = 25 ps [note – there is no number given for sign extension delay, so we can assume it is less than the register file delay of 150 ps] 5. ALU delay: 200 ps 6. Single gate delay AND of ‘zero’ output from ALU and ‘Branch’ control output from Control unit: 5ps 7. Mux for Branch address to PC: 25 ps 8. Write back setup time for PC register: 20 ps 5.5 What is the latency of an I-type instruction? [same as 5.1 for R-type] 5.6 What is the minimum clock period for this CPU? 950 ps { longest execution time instruction: ld } 6. Suppose you could build a CPU where the clock cycle time was different for each instruction. What would the speedup of this new CPU be over the CPU presented in Figure 4.21 from RISCV text (reproduced in Problem 4) given the instruction mix below? Assuming only the above 4 instructions used with the given frequencies of their occurrence. cycle time = TR-typexfR-type + Tldxfld + Tsdxfsd + Tbeqxfbeq = 700psx0.52 + 950psx0.25 + 905psx0.11 + 705psx0.12 = 785.6ps Speed-up = 950ps / 785.6 ps = 1.21 7. Consider the addition of a multiplier to the CPU shown in Figure 4.21 (reproduced in Problem 4). This addition will add 300 ps to the latency of the ALU, but will reduce the number of instructions by 5% (because there will no longer be a need to emulate the multiply instruction). 7.1 What is the clock cycle time with and without this improvement? Previous problem: 950ps, with multiplier added: 1250ps 7.2 What is the speedup achieved by adding this improvement? time to run new CPU with multiplier reduces by 5%, so (assuming a CPI = 1) execution time from fewer instructions = 0.95 x 1250ps. Speedup = 950ps / (0.95x 1250ps) = 0.8 ( version with multiplier slows down 20%) 7.3 What is the slowest the new ALU can be and still result in improved performance? Even if the new CPU has 5% fewer instructions, it can improve the cycle time by at most 50ps. However, this is insufficient to make up for the 300ps penalty by adding a multiplier unit. So, the new ALU cannot be any slower than it is to result in improved performance of the new CPU 8. ld is the instruction with the longest latency on the CPU from Section 4.4. If we modified ld and sd so that there was no offset (i.e., the address to be loaded from/stored to must be calculated and placed in rs1 before calling ld/sd), then no instruction would use both the ALU and Data memory. This would allow us to reduce the clock cycle time. However, it would also increase the number of instructions, because many ld and sd instructions would need to be replaced with ld/add or sd/add combinations. 8.1 What would the new clock cycle time be? For the new ld instruction , items 5 and 6 below proceed in parallel: with the faster of these (ALU delay) removed from the critical path 1. PC Register Read delay following rising edge of clock: 30 ps 2. Instruction Memory: 250ps 3. Read Register File: 150 ps 4. Mux to pick data from either R2 or sign extended bit data from immediate field of instruction = 25 ps [note – number given for sign extension delay of 50 ps is less than the 150 ps for register file read access, so we can assume that mux control signal is designed to fire only when both inputs are ready] 5. ALU delay: 200 ps 6. Read data from Data Mem, whose address provided by ALU: 250 ps 7. Mux for WB to Register File of result from Data Mem: 25 ps 8. Write back setup time for Register File registers: 20 ps total ld = 750ps For the new sd instruction , items 5 and 6 below proceed in parallel: with the faster of these (ALU delay) removed from the critical path 1. PC Register Read delay following rising edge of clock: 30 ps 2. Instruction Memory: 250ps 3. Read Register File: 150 ps 4. Mux to pick data from either R2 or sign extended bit data from immediate field of instruction = 25 ps [note – there is no number given for sign extension delay, so we can assume it is less than the register file delay of 150 ps] 5. ALU delay: 200 ps 6. Write Data from R2 to Data Mem: 250 ps Total for sd: 705 ps 8.2 Would a program with the instruction mix presented in Problem 5 run faster or slower on this new CPU? By how much? (For simplicity, assume every ld and sd instruction is replaced with a sequence of two instructions.) Since ld/sd instructions are 36% of all instructions in problem 2,3 the number of instructions would increase by 1.36x. So the new cycle time of 750ps would multiply with 1.36n to give 1020n ps. Speedup = 950 / 1020 = 0.93 or a reduction in speed by 7% 8.3 What is the primary factor that influences whether a program will run faster or slower on the new CPU? Since load/store instructions are the slowest instructions, the frequency of their occurrence will impact how much of an improvement the new CPU sees. If the frequency of their occurrence is very small, then the penalty of higher instruction count is small while the improvement in cycle time is much bigger. for example if only 10% of the program is load/store then then 750ps x 1.1 = 825ps and the speedup is 950/825 = 15% 8.4 Do you consider the original CPU (as shown in Figure 4.21/Problem 4) a better overall design; or do you consider the new CPU a better overall design? Why? see above response – depends on how often the ld/sd instruction occur. The improvement to cycle time is 22% by eliminating the ALU op for ld/sd. however, of ld/sd occur more than 22% of instructions then it is not a better CPU 9. In this exercise, we examine how pipelining affects the clock cycle time of the processor. Problems in this exercise assume that individual stages of the datapath have the following latencies: Also, assume that instructions executed by the processor are broken down as follows: 9.1 What is the clock cycle time in a pipelined and non-pipelined processor? pipelined 350ps (slowest stage) non-pipelined 1250ps 9.2 What is the total latency of an ld instruction in a pipelined and non-pipelined processor? pipelined and non-pipelined = 1250ps 9.3 If we can split one stage of the pipelined datapath into two new stages, each with half the latency of the original stage, which stage would you split and what is the new clock cycle time of the processor? the slowest stage: 350ps. Now pipelined version is 300ps cycle time 9.4 Assuming there are no stalls or hazards, what is the utilization of the data memory? load and store = 35% ( the only instructions that access data memory) 9.5 Assuming there are no stalls or hazards, what is the utilization of the write-register port of the “Registers” unit? 65% – only ALU and load utilize write back to register file write port 10. What is the minimum number of cycles needed to completely execute n instructions on a CPU with a k stage pipeline? Justify your formula For a k stage pipeline, the first instruction reaches the end of the pipeline in k cycles. After this, the remaining n-1 instructions execute at 1 cycle per instruction over the next n-1 cycles so, n instructions take k + n -1 cycles to execute in a k stage pipeline 11. Assume that x11 is initialized to 11 and x12 is initialized to 22. Suppose you executed the code below on a version of the pipeline from Section 4.5 that does not handle data hazards (i.e., the programmer is responsible for addressing data hazards by inserting NOP instructions where necessary). What would the final values of registers x13 and x14 be? addix11, x12, 5 addx13, x11, x12 addix14, x11, 15 x13 = 33, x14 = 26 12. Assume that x11 is initialized to 11 and x12 is initialized to 22. Suppose you executed the code below on a version of the pipeline from Section 4.5 (of RISCV text) that does not handle data hazards (i.e., the programmer is responsible for addressing data hazards by inserting NOP instructions where necessary). What would the final values of register x15 be? Assume the register file is written at the beginning of the cycle and read at the end of a cycle. Therefore, an ID stage will return the results of a WB state occurring during the same cycle. See Section 4.7 and Figure 4.51 for details. addix11, x12, 5 #x11 does not update to 27 (=22+5) until the add instruction below addx13, x11, x12 #x13 = 33 addix14, x11, 15 #x11 = 26 addx15, x11, x11 #x15 = 54 13. Add NOP instructions to the code below so that it will run correctly on a pipeline that does not handle data hazards. addix11, x12, 5 NOP NOP addx13, x11, x12 addix14, x11, 15 NOP