

#### Introduction to RISC-V

#### Jielun Tan, James Connolly, Ian Wrzesinski

Last updated 02/2024





- What is RISC-V
- Why RISC-V
- ISA overview
- Software environment
- Project 3 stuff





- RISC-V (pronounced "risk-five") is an open, free ISA enabling a new era of processor innovation through open standard collaboration. Born in academia and research, RISC-V ISA delivers a new level of free, extensible software and hardware freedom on architecture, paving the way for the next 50 years of computing design and innovation. - RISC-V Foundation
  - This is just marketing talk obviously, don't get carried away...
- It originated in UC Berkeley, but now it has own its foundation with a large number of contributors
- Wide adoption within industry
  - Nvidia, Alibaba, Western Digital, etc. all have RV cores in their product
  - A fruit company, a letters company and others have developments
  - Many startups: SiFive, Esperanto, Tenstorrent, etc.





- Why not OpenRISC?
  - OpenRISC had condition codes and branch delay slots, which complicate higher performance implementations, fixed encoding immediates, no support for 2008 revision of floating standard, blah blah etc.
  - The reality is no one uses it so yeah
- MIPS although now open sourced...
  - MIPS now makes RISC-V products
  - MIPS is also very convoluted
- License issues for Arm...
- Only academia still deals with Alpha
  - Press F to pay respect to DEC, Compaq and HP







- A completely open ISA that is freely available to academia and industry
- A real ISA suitable for direct native hardware implementation, not just simulation or binary translation
- An ISA that avoids "over-architecting" for a particular microarchitecture style (e.g., microcoded, in-order, decoupled, out-of-order) or implementation technology (e.g., full-custom, ASIC, FPGA), but which allows efficient implementation in any of these
- An ISA separated into a small base integer ISA, usable by itself as a base for customized accelerators or for educational purposes, and optional standard extensions, to support general-purpose software development
  - Most ISAs can be used for educational purposes if we just take a subset, however we'd have to build custom software around that
  - $\circ$   $\;$  With RV, that support is built-in





- Lets us explore more layers of the computing stack, mainly compilers and systems
- Can arbitrarily generate test cases, since we can just write in C now!
  - Easier for you to test
  - Easier for the staff to shuffle around test cases
  - Easier to generate large test cases that can actually benefit from additional features and properly reward those who worked on extra features





- Base ISA + many extensions including privileges mode
- 32, 64 and 128-bit address space
  - We only use 32-bit for now, the other two only add a few instructions
- 32 integer registers
- Byte level addressing for memory, little endian
- Instructions must align to 32-bit addresses (unless they are compressed)
- No condition codes or carry out bits to detect overflow
  - Intentional, these can be achieved in software
- Comparisons are built in for branches
  - $\circ$  e.g. beq x1, x2, offset
- Does support misaligned memory access by default
  - You don't have to worry about this, all tests should be compiled with strict alignment



## ISA Overview - Instruction Formats

- 6 different encoding format for instructions
- A looooooot of pseudoinstructions
  - You can read about all of them in the <u>specification</u> here (highly recommended, the base integer and multiplication extension aren't long at all)

| 31      | 30 25                   | 24 21 | 20                       | 19  | 15 14    | 12 1 | 1 8     | 7       | 6 0    |        |
|---------|-------------------------|-------|--------------------------|-----|----------|------|---------|---------|--------|--------|
| fu      | inct7                   | r     | \$2                      | rs1 | funct    | 3    | rd      |         | opcode | R-type |
|         |                         |       |                          |     |          |      |         |         |        | -      |
|         | imm[1                   | 1:0]  |                          | rs1 | funct    | 3    | rd      |         | opcode | I-type |
|         | [                       |       |                          |     |          | ~ 1  |         |         |        | 1 a    |
| imn     | n[11:5]                 | rs    | \$2                      | rs1 | funct    | 3    | imm[    | 4:0]    | opcode | S-type |
| . [10]  | . [10, 1]               |       | 2                        |     | C .      | 0    | [4 1]   | • [11]  |        |        |
| imm[12] | imm[10:5]               | rs    | 52                       | rs1 | funct    | 3 1  | mm[4:1] | imm[11] | opcode | B-type |
| -       |                         | • [0  | 1 10                     |     |          | -    |         |         | 1      |        |
|         |                         | imm[3 | 1:12]                    |     |          |      | rd      |         | opcode | U-type |
| · [00]  | • [4                    | 0.11  | • [11]                   | · · | [10,10]  |      |         |         | 1      | T T .  |
| imm[20] | $\operatorname{imm}[1]$ | 0:1]  | $\operatorname{imm}[11]$ | ımn | n[19:12] |      | rd      |         | opcode | J-type |

Figure 2.3: RISC-V base instruction formats showing immediate variants.





- Also a list of Control Status Registers (CSR)
  - Many are important if interrupt support is needed
  - You can choose to implement them in the final project
  - Here are some examples, you can read more about them in the privileged spec

| Number | Name     | Description                              |
|--------|----------|------------------------------------------|
| 0x000  | ustatus  | User status register.                    |
| 0x004  | uie      | User interrupt-enable register.          |
| 0x005  | utvec    | User trap handler base address.          |
| 0x040  | uscratch | Scratch register for user trap handlers. |
| 0x041  | uepc     | User exception program counter.          |
| 0x042  | ucause   | User trap cause.                         |
| 0x043  | utval    | User bad address or instruction.         |
| 0x044  | uip      | User interrupt pending.                  |



## ISA Overview - More Extensions

- V Has a vector extension as well, if staff in the future wants to spice things up
- A The atomic extension will be partially used to implement locks in the future
- F, D, Q, L- Floating point extensions can be supported for people's own interest
- C Compressed extension to increase code density
- E for embedded systems; reduced number of registers (only 16), can be combined with C to save ROM
- T RISC-V has plans to support transactional memory (omegalul)
- Z series, basically all future extensions since they ran out of letters











Assembly

VS.



High-level Language







- Why do I remotely even care about software in a hardware class
  - Believe me, it's important
  - Architecture is the bridge between the two (insert preaching)
  - Software support dictates hardware adoption
    - There'd be nothing to run on hardware without software just saying...
- For RISC-V, we will have both C programs and assembly programs to test
- At the same time, you also need to have a grasp of how C works at a very low level
  - It doesn't affect your implementation, but knowing this will make your life easier
  - $\circ \quad \ \ \, \text{You will also learn A LOT}$





- There's a full suite of GNU tools for RISC-V
  - gcc compiler
  - o as assembler
  - $\circ$   $\ \ \mbox{ld}$  linker
  - objdump disassembler
  - objcopy copies one object file to another, can change link formats
  - $\circ$  g++ don't use this, no linker support
  - a lot more that you can explore yourself...





- What happens when you compile a program?
  - You generate an ELF, not Legolas though
  - But rather Executable and Linkable Format
- What is actually inside an ELF?
  - ELF/program header
    - Usually tells what OS it's for
    - Where in memory to put the program in
  - .text: the actual instructions of the program
  - $\circ$  .rodata: read-only data, but we don't enforce that
  - .data: modifiable program data
  - Section header table: where's what







- Flashback to 370 or whatever computer organization class you had
- What does the memory space for a program look like?
- Stack for statically allocated variables, pointer decrements
- Heap for dynamic memory, pointer increments







- Example of program space allocation for arm processors->
  - The linker allocate the memory space
- In general memory addresses around 0x0 are precious
  - Some peripherals on the serial buses can only talk to

limited addresses

- In our case, the text section starts at 0x0 to simplify loading
- The stack pointer starts at 0x10000
  - The end of the testbench memory space
  - This means any program that you write, text+data+stack < 64KiB





## Software Environment - Program Space

C Source Code

#### RISC-V Assembly (dump files)

#### Machine Code

|                                                                                                                                                                                                                                                               | Disass                               | embly of section .text:<br>c8 <avg_pooling>:<br/>for (int i = 0; i &lt; 2;<br/>down_sample[0]<br/>down_sample[1]<br/>}</avg_pooling> | ++i) {<br>[i] = (i                     | 32-littleriscv<br>mage[0][i*2] + image[1][i*2] + image[0][i*2+1] + image[1][i*2+1]) >> ;<br>mage[3][i*2] + image[4][i*2] + image[3][i*2+1] + image[4][i*2+1]) >> ; | . 000002                     | embly of section .data:<br>00 <impure_data>:<br/>0000<br/>0000</impure_data> |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------|------------------------------------------------------------------------------|
| extern void exit();<br>#include <stdlib.h≻< td=""><td>}<br/>c8:<br/>cc:</td><td>return down_sample;<br/>00000513<br/>00008067</td><td>li<br/>ret</td><td>a0 , 0</td><td>204:<br/>206:</td><td>04ec<br/>0000</td></stdlib.h≻<>                                 | }<br>c8:<br>cc:                      | return down_sample;<br>00000513<br>00008067                                                                                          | li<br>ret                              | a0 , 0                                                                                                                                                             | 204:<br>206:                 | 04ec<br>0000                                                                 |
| <pre>int** avg_pooling(int image [][4]) {     int down_sample [2][2];     int avg = 0;</pre>                                                                                                                                                                  | 0000000<br>int ma:<br>d0:<br>d4:     | d0 <main>:<br/>in() {<br/>ff010113<br/>00812423</main>                                                                               | addi<br>sw                             | sp, sp, - 16<br>s9. 8 (sp)                                                                                                                                         | 208:<br>20a:<br>20c:         | 0554<br>0000<br>05bc                                                         |
| <pre>for (int i = 0; i &lt; 2; ++i) {     down_sample[0][i] = (image[0][i*2] + image[1][i*2] + image[0][i*2+1] + image[1][i*2+1]) &gt;&gt; 2;     down_sample[1][i] = (image[3][i*2] + image[4][i*2] + image[3][i*2+1] + image[4][i*2+1]) &gt;&gt; 2; }</pre> | d8:<br>dc:                           | 00112623<br>00400413<br>int image [4][4];<br>for (int i = 0; i < 4;                                                                  | sw<br>li<br>++i) {                     | ra,12(sp)<br>s0,4                                                                                                                                                  | 2a6:<br>2a8:<br>2aa:         | <br>0000<br>0001<br>0000                                                     |
| <pre>return down_sample; }</pre>                                                                                                                                                                                                                              | e0:<br>e4:<br>e8:                    | 00000097<br>054080e7<br>00000097                                                                                                     | i][j] =<br>auipc<br>jalr<br>auipc      | rand();<br>ra,0x0<br>84(ra) # 134 <rand><br/>ra,0x0</rand>                                                                                                         | 2ac:<br>2ac:<br>2ae:<br>2b0: | 0000<br>0000<br>330e                                                         |
| <pre>int main() {     int image [4][4];     for (int i = 0; i &lt; 4; ++i) {</pre>                                                                                                                                                                            | ec:<br>f0:<br>f4:<br>f8:<br>fc:      | 04c080e7<br>00000097<br>044080e7<br>fff40413<br>00000097                                                                             | jalr<br>auipc<br>jalr<br>addi<br>auipc | 76(ra) # 134 <rand><br/>ra,0x0<br/>68(ra) # 134 <rand><br/>50,50,-1<br/>ra,0x0</rand></rand>                                                                       | 2b2:<br>2b4:<br>2b6:         | abcd<br>1234<br>e66d                                                         |
| <pre>for (int i = 0; i &lt; 4; ++i) {     for (int j = 0; j &lt; 4; ++j) {         image[i][j] = rand();     } }</pre>                                                                                                                                        | 100:<br>104:                         | 038080e7<br>for (int i = 0; i < 4;<br>fc04lee3<br>}<br>int** worse image;                                                            | jalr<br>++i) {<br>bnez                 | 56(ra) # 134 <rand><br/>s0,e0 <main+0x10></main+0x10></rand>                                                                                                       | 2b8:<br>2ba:<br>2bc:         | deec<br>0005<br>0000000b                                                     |
| <pre>int** worse_image;<br/>worse_image = avg_pooling(image);<br/>return 0;</pre>                                                                                                                                                                             | }                                    | worse_image = avg_pool<br>return θ;                                                                                                  |                                        |                                                                                                                                                                    | Disass                       | <br>embly of section .srodata:                                               |
| }                                                                                                                                                                                                                                                             | 108:<br>10c:<br>110:<br>114:<br>118: | 00c12083<br>00812403<br>00000513<br>01010113<br>00008067                                                                             | lw<br>lw<br>li<br>addi<br>ret          | ra,12(sp)<br>s8,8(sp)<br>a0,0<br>sp,sp,16                                                                                                                          | 000001<br>1e0:               | e0 <_global_impure_ptr>:<br>0200<br>                                         |
|                                                                                                                                                                                                                                                               | 11c:<br>120:<br>124:                 | lc <srand>:<br/>00000797<br/>52478793<br/>0007a783</srand>                                                                           | auipc<br>addi<br>lw                    | a5,0x0<br>a5,a5,1316 # 640 <_impure_ptr><br>a5,0(a5)                                                                                                               |                              | embly of section .sdata:                                                     |
|                                                                                                                                                                                                                                                               | 128:<br>12c:<br>130:                 | 0aa7a423<br>0a07a623<br>00008067                                                                                                     | sw<br>sw<br>ret                        | a0,168(a5)<br>zero,172(a5)                                                                                                                                         | 000006<br>640:               | 40 <_impure_ptr>:<br>0200                                                    |





- Every time there's a function call, have a frame pointer that saves the previous stack pointer
- Caller/callee save the variables







- Registers aren't just registers, each of them has a meaning
- Such concept is called Application Binary Interface (ABI)

| Register | ABI Name | Description                       | Saver  |
|----------|----------|-----------------------------------|--------|
| x0       | zero     | Hard-wired zero                   |        |
| x1       | ra       | Return address                    | Caller |
| x2       | sp       | Stack pointer                     | Callee |
| xЗ       | gp       | Global pointer                    | _      |
| x4       | tp       | Thread pointer                    | _      |
| x5       | t0       | Temporary/alternate link register | Caller |
| x6-7     | t1-2     | Temporaries                       | Caller |
| x8       | s0/fp    | Saved register/frame pointer      | Callee |
| x9       | s1       | Saved register                    | Callee |
| x10-11   | a0-1     | Function arguments/return values  | Caller |
| x12-17   | a2-7     | Function arguments                | Caller |
| x18-27   | s2-11    | Saved registers                   | Callee |
| x28-31   | t3-6     | Temporaries                       | Caller |
| f0-7     | ft0-7    | FP temporaries                    | Caller |
| f8-9     | fs0-1    | FP saved registers                | Callee |
| f10-11   | fa0-1    | FP arguments/return values        | Caller |
| f12-17   | fa2-7    | FP arguments                      | Caller |
| f18-27   | fs2-11   | FP saved registers                | Callee |
| f28-31   | ft8-11   | FP temporaries                    | Caller |





- Same 5 stage in-order pipeline as in class
  - No hazard detection or forwarding logic
  - More info at Appendix A of the textbook
    - Starting with the 8th Edition of Hennessy and Patterson, all examples should be using RISC-V as well
  - Supports RV32IM minus divide, remainder and system instructions
- Need to add hazard logic
- Programs still run because only one instruction is allowed in the pipeline at a time











- Forwarding
  - Like what we covered in class
  - Need to forward results from later stages to EX
- Structural Hazards
  - Only one memory port for fetching and memory accesses
  - Memory gets priority over fetch
    - You need this to guarantee forward progress
- Control Hazards
  - Predict not taken, resolved in MEM stage
  - Flush IF/ID, ID/EX, EX/MEM if incorrect





• These diagrams are outdated, but still helpful









































- Makefile You've seen multiple times now
  - Compiles the verilog executables simv and syn\_simv (and vis\_simv)
  - Also contains targets for compiling/linking C and assembly programs
  - More on the next slide
- verilog/ Verilog code directory you should edit and change
- test/ directory with the testbench, memory, and pipeline printing code
- synth/ directory where synthesis output will be created. Also where the synthesis script is
- programs/ test programs, both C code and assembly. Where . mem files are compiled
  - You will write your own test\_[12345].s assembly files here!
- output/ where and .out, .cpi, .wb, and .ppln files are created





- make programs/<my program>.mem compile <my program> into a machine code .mem file
- make <my program>.out run <my program> on simv by loading the .mem file into memory
- Ex: for the program, programs/no\_hazard.s, run `make no\_hazard.out'
- For the program, programs/omegalul.c, run `make omegalul.out'
- This creates <my program>.out, <my program>.cpi, <my program>.wb, and <my program>.ppln in the output/ directory
- make <my\_program.dump> create .dump\_x and .dump\_abi dump files in programs/
- .dump\_x has numeric register names: x0, x1, x2, etc.
- .dump\_abi has RISC-V ABI register names: sp, ra, a0, t0, etc.
- make <my program>.verdi run the program on simv with verdi
- make <my program>.syn.verdi run the program on syn\_simv with verdi
- make <my program>.vis run the VTUBER visual debugger, an extremely useful tool for project 3
- make simv create the verilog executable simv from the testbench/sources







- output/<my\_program>.out \$display() output of the testbench contains final memory status
- output/<my\_program>.cpi Final CPI and total time running
- output/<my\_program>.wb Register file writeback at each PC
- output/<my\_program>.ppln Pipeline output file of which PC/instruction is in each stage as well as activity to/from memory

We've given you three correct outputs for project 3 in the correct\_out/ directory. However, they test separate things, and neither contains every combination of hazards.

```
Use the program diff to check correctness:
```

```
make no_hazard.out
diff output/no_hazard.wb correct_out/no_hazard.wb
diff output/no_hazard.ppln correct_out/no_hazard.ppln
grep `@@@' output/no_hazard.out | diff correct_out/no_hazard.out
```

```
(Try with: alias diff="git diff --no-index")
```



# Writing your own assembly unit tests

- You will write up to 5 assembly unit tests matching programs/test\_[12345].s
- These must expose either a correctness or CPI bug in 4 buggy processors
- Copy the assembly from the existing assembly files in programs/
- Look at each of the hazard bullet points in the project spec
- Some internet assembly resources (not affiliated):
  - <u>https://projectf.io/posts/riscv-arithmetic/</u>
  - <u>https://github.com/riscv-non-isa/riscv-asm-manual/blob/master/riscv-asm.md</u>
- Notes:
  - Don't write anything that has misaligned memory access in assembly
  - Compiler won't generate any for C





- Branches should resolve in the same stage they are currently resolved in
- All forwarding must be to the EX stage, even if the data isn't needed until a later stage





- add
   1
   2
   3

   nand
   3
   4
   5

   add
   6
   3
   7
- lw 3 6 10
- sw 6 2 12

- ; reg **3** = reg **1** + reg **2**
- ; reg 5 = reg 3 ~& reg 4
- ; reg **7** = reg 6 + reg 3
- ; reg 6 = Mem[reg 3 + 10]
- ; Mem[reg6+12] = reg 2















































- Branches should resolve in the same stage they are currently resolved in
- All forwarding must be to the EX stage, even if the data isn't needed until a later stage
- Any stalling due to data hazards must occur in the decode stage. (That is, if stalling is required the dependent instruction should stall in the decode stage.)
  - Instructions following the stalling instruction in the IF stage will have to stay in the IF stage. Put another way, if you need to insert an invalid instruction, it should be inserted in the EX stage





- add 1 2 3
- nand 3 4 5
- add 6 3 7
- lw 3 6 10
- sw 6 2 12

- ; reg **3** = reg **1** + reg **2**
- ; reg  $5 = reg 3 \sim \& reg 4$
- ; reg **7** = reg 6 + reg 3
- ; reg 6 = Mem[reg 3 + 10]
- ; Mem[reg6+12] = reg 2















































- Branches should resolve in the same stage they are currently resolved in
- All forwarding must be to the EX stage, even if the data isn't needed until a later stage
- Any stalling due to data hazards must occur in the decode stage. (That is, if stalling is required the dependent instruction should stall in the decode stage.)
  - Instructions following the stalling instruction in the IF stage will have to stay in the IF stage. Put another way, if you need to insert an invalid instruction, it should be inserted in the EX stage
- If you wish to insert a nop you must invalidate the instruction. Otherwise your CPI numbers will be wrong
- If there is a structural hazard in the memory, you should let the load/store go and have the fetch stage wait on getting memory

















- Error on load access
  - Tried to access an invalid memory address
- Be careful of using "op{a,b}\_select" signals
  - They are mux select signals for ALU inputs, not indications of whether instruction uses rs1 and/or rs2





- Try to tackle one thing at a time
- Be careful of register 0!
- Be aware of where operand data is coming from
  - Not all instructions receive source data from ALU output.
- "Forward data into EX stage"
  - Essentially means widen muxes for ALU input, or add muxes for EX/MEM pipeline register inputs.
- Adding signals should be very easy if you use structs wisely





- Examine <my\_program>.ppln output
- Find first incorrect register write/memory load
- Trace back execution of that instruction
- Don't start with Verdi, instead try the Visual debugger!
  - o make <my\_program>.vis
  - Verdi is useful only once you know precisely where/when a bug is occurring
  - But is still a good tool for synthesis and tracing X values
- Avoid "staring at your screen" debugging

