Building a RISC System in an FPGA

Chia sẻ: Dqdsadasd Qwdasdsad | Ngày: | Loại File: PDF | Số trang:7

lượt xem

Building a RISC System in an FPGA

Mô tả tài liệu
  Download Vui lòng tải xuống để xem tài liệu đầy đủ

companies sell FPGA CPU cores, but most are synthesized implementations of existing instruction sets, filling huge, expensive FPGAs, and are too slow and too costly for production use.

Chủ đề:

Nội dung Text: Building a RISC System in an FPGA

  1. Building a FEATURE THE PROJECT Several companies sell FPGA CPU ARTICLE cores, but most are synthesized imple- RISC mentations of existing instruction sets, filling huge, expensive FPGAs, and are too slow and too costly for production use. These cores are mar- System in Jan Gray keted as ASIC prototyping platforms. In contrast, this article shows how a streamlined and thrifty CPU design, an FPGA optimized for FPGAs, can achieve a cost-effective integrated computer system, even for low-volume products that can’t justify an ASIC run. I’ll build an SoC, including a 16-bit RISC CPU, memory controller, video Part 1: Tools, Instruction Set, and Datapath display controller, and peripherals, in a small Xilinx 4005XL. I’ll apply free software tools including a C compiler and assembler, and design the chip i using Xilinx Student Edition. If you’re new to Xilinx FPGAs, you used to envy can get started with the Student Edi- CPU designers— tion 1.5. This package includes the the lucky engineers development tools and a textbook To kick off this three- with access to expensive with many lab exercises.[3] tools and fabs. But, field-program- The Xilinx university-program part article, Jan’s go- mable gate arrays (FPGAs) have made folks confirm that Student Edition is custom-processor and integrated- not just for students, but also for pro- ing to port a C system design much more accessible. fessionals continuing their education. 20–50-MHz FPGA CPUs are per- Because it is discounted with respect compiler, design an fect for many embedded applications. to their commercial products, you do They can support custom instructions not receive telephone support, al- instruction set, write and function units, and can be recon- though there is web and fax-back figured to enhance system-on-chip support. You also do not receive an assembler and (SoC) development, testing, debug- maintenance updates—if you need the ging, and tuning. Of course, FPGA simulator, and design systems offer high integration, short Register Use time-to-market, low NRE costs, and the CPU datapath. easy field updates of entire systems. r0 always zero r1 reserved for assembler FPGA CPUs may also provide new Get reading, you’ve answers to old problems. Consider r2 r3–r5 function return value function arguments r6–r9 temporaries one system designed by Philip Freidin. only got a month be- During self-test, its FPGA is config- r10–r12 register variables r13 stack pointer (sp) ured as a CPU and it runs the tests. fore your connecting Later the FPGA is reconfigured for r14 r15 interrupt return address return address normal operation as a hardwired sig- article arrives! nal processing datapath. The ephem- Table 1—The xr16 C language calling conventions assign a fixed role to each register. To minimize the cost eral CPU is free and saves money by of function calls, up to three arguments, the return eliminating test interfaces. address, and the return value are passed in registers. 26 Issue 116 March 2000 CIRCUIT CELLAR®
  2. Listing 1—This sample C code declares a binary search tree data structure and defines a binary search words), and provide one addressing function. Search returns a pointer to the tree node whose key compares equal to the argument key, or mode—disp(reg). To support long NULL if not found. ints we need add/subtract carry and shift left/right extended. typedef struct TreeNode { int key; Which instructions merit the most struct TreeNode *left, *right; bits? Reviewing early compiler out- } *Tree; put from test applications shows that Tree search(int key, Tree p) { the most common instructions (static while (p && p->key != key) frequency) are lw (load word), 24%; if (p->key < key) sw (store word), 13%; mov (reg-reg p = p->right; else move), 12%; lea (load effective ad- p = p->left; dress), 8%; call, 8%; br, 6%; and return p; cmp, 6%. Mov, lea, and cmp can be } synthesized from add or sub with r0. 69% of loads/stores use disp(reg) next version of the software, you have Next, lcc needs operators that load addressing, 21% are absolute, and to buy it all over again. Nevertheless, a 2-byte int into a register, add 2-byte 10% are register indirect. Student Edition is a good deal and a int registers, dereference a 2-byte Therefore we make these choices: great way to learn about FPGA design. pointer, and so on. The lcc ops util- My goal is to put together a simple, ity prints the required operator set. I • add, sub, addi are 3-operand fast 16-bit processor that runs C code. modified my tables and instruction • less common operations (logical ops, Rather than implement a complex templates accordingly. For example: add/sub with carry, and shifts) are 2- legacy instruction set, I’ll design a operand to conserve opcode space new one streamlined for FPGA imple- reg: CVUI2(INDIRU1(addr)) \ • r0 always reads as 0 mentation: a classic pipelined RISC “lb r%c,%0\n” 1 • 4-bit immediate fields with 16-bit instructions and sixteen • for 16-bit constants, an optional 16-bit registers. To get things started, uses lb rd,addr to load an unsigned immediate prefix imm establishes the let’s get a C compiler. char at addr and zero-extend it into a most significant 12-bits of the in- 16-bit int register. struction that immediately follows C COMPILER stmt: EQI2(reg,con) \ • no condition codes, rather use an Fraser and Hanson’s book is the “cmpi r%0,%1\nbeq %a\n” 2 interlocked compare and condi- literate source code of their lcc retar- tional branch sequence getable C compiler.[1] I downloaded uses a cmpi, beq sequence to com- • jal (jump-and-link) jumps to an the V.4.1 distribution and modified it pare a register to a constant and effective address, saving the return to target the nascent RISC, xr16. branch to this label if equal. address in a register Most of lcc is machine indepen- I removed any remaining 32-bit • call func encodes jal r15,func dent; targets are defined using ma- assumptions inherited from, in one 16-bit instruction (provided chine description (md) files. Lcc ships and arranged to store long ints in the function is 16-byte aligned) with ’x86, MIPS, and SPARC md files, register pairs, and call helper routines • perform mul, div, rem, and variable and my job was to write for mul, div, rem, and some shifts. and multiple bit shifts in software I copied from, My port was up and running in just added it to the makefile, and added an one day, but I had already read the lcc The six instruction formats are xr16 target option. I designed xr16 book. Let’s see what she can do. List- shown in Table 2 and the 43 distinct register conventions (see Table 1) and ing 1 is the source for a binary tree instructions are shown in Table 3. changed my md to target them. search routine, and Listing 2 is the adds, subs, shifts, and imm are At this point, I had a C compiler for assembly code lcc-xr16 emits. uninterruptible prefixes. Loads/stores a 32-bit 16-register RISC, but needed take two cycles, jump and branch- to target a 16-bit machine with INSTRUCTION SET taken take three cycles (no branch sizeof(int)=sizeof(void*)=2. lcc obtains Now, let’s refine the instruction target operand sizes from md tables, so set and choose an instruction encod- Format 15–12 11–8 7–4 3–0 I just changed some entries from 4 to 2: ing. My goals and constraints include: rrr op rd ra rb cover C (integer) operator set, fixed- rri op rd ra imm Interface xr16IR = { size 16-bit instructions, easily de- rr op rd fn rb 1, 1, 0, /* char */ coded, easily pipelined, with three- ri op rd fn imm 2, 2, 0, /* short */ operand instructions (dest = src1 i12 op imm12 … … br op cond disp8 … 2, 2, 0, /* int */ op src2/imm), as encoding space 2, 2, 0, /* T* */ allows. I also want it to be byte ad- Table 2—The xr16 has six instruction formats, each dressable (load and store bytes and with 4-bit opcode and register fields. CIRCUIT CELLAR® Issue 116 March 2000 27
  3. write it using a jump. Because insert- tables (4-LUTs) and two flip-flops. Figure 1—The xr16 ing a jump may make other branches Each 4-LUT can implement any logic AN[15:0] processing symbol ACE far, we repeat until no far branches function of 4 inputs, or a 16 × 1-bit WORDN ports, which include READN instruction and data remain. synchronous static RAM, or ROM. DBUSN buses, next address Next, we evaluate fixups. For each Each CLB also has “carry logic” to DMA IREQ and memory con- one, we look up the target address and build fast, compact ripple-carry adders. DMAREQ trols, and bus ZERODMA apply that to the fixup subject word. Each IOB offers input and output RDY controls, constitute UDT its interface to the Lastly, we emit the output files. buffers and flip-flops. The output LDT INSN[15:0] UDLDT system memory I also wrote a simple instruction set buffer can be 3-stated for bidirectional CLK D[15:0] controller. simulator. It is useful for exercising I/O. The programmable interconnect XR16 both the compiler and the embedded routes CLB/IOB output signals to other application in a friendly environment. CLB/IOB inputs. It also provides wide- Well, by now you are probably fanout low-skew clock lines, and hori- delay slots). The four-bit imm field wondering if there is any hardware to zontal long lines, which can be driven encodes either an int (-8–7): add/ this project. Indeed there is! First, by 3-state buffers at each CLB.[2] sub, logic, shifts; unsigned (0–15): lb, let’s consider our target FPGA device. The XC4000XL architecture would sb; or unsigned word displacement (0, appear to have been designed with 2–30): lw, sw, jal, call. THE FPGA CPUs in mind. Just eight CLBs can Some assembly instructions are The Xilinx XC4005XL-PC84C-3 is build a single-port 16 × 16-bit register formed from other machine instruc- a 3.3-V FPGA in an 84-pin J-lead file (using LUTs as SRAM), a 16-bit tions, as you can see in Table 4. Note PLCC package. This SRAM-based adder/subtractor (using carry logic), or that only signed char data use lbs. device must be configured by external a four-function 16-bit logic unit. Be- ROM or host at power-up. It has a cause each LUT has a flip-flop, the ASSEMBLER 14 × 14 array of configurable logic device is register rich, enabling a I wrote a little multipass assembler blocks (CLBs) and 61 bonded-out I/O pipelined implementation style; and to translate the lcc assembly output blocks (IOBs) in a sea of program- as each flip-flop has a dedicated clock into an executable image. mable interconnect. enable input, it’s easy to stall the The xr16 assembler reads one or Every CLB has two 4-input look-up pipeline when necessary. Long line more assembly files and buses and 3-state drivers emits both image and form an efficient word- listing files. The lexical Z,N,V,A 15 wide multiplexer of the analyzer reads the source Control unit Datapath many function unit re- characters and recognizes INSN[15:0] INSN[15:0] READN READN Result[15:0] D[15:0] sults, and even an on- ADDR[15:0] AN[15:0] tokens like the identifier RDY RDY WORDN WORDN chip 3-state peripheral IREQ IREQ DBUSN DBUSN _main. The parser scans DMAREQ DMAREQ DMA DMA bus. tokens on each line and ZERODMA ZERODMA ACE ACE RNA[3:0] recognizes instructions RNB[3:0] RNA[3:0] RNB[3:0] THE PROCESSOR and operands, such as RFWE RFWE INTERFACE FWD FWD register names and effec- IMM[11:0] IMM[11:0] Figure 1 gives you a IMMOP[5:0] tive address expressions. IMMOP[5:0] good look at the xr16 A15 PCE PCE A15 The symbol table remem- BCE15_4 BCE15_4 processor macro symbol. bers labels and their ad- Z ADD ADD Z The interface was de- N CI CI N dresses, and a fixup table CO LOGICOP[1:0] LOGICOP[1:0] CO signed to be easy to use V V remembers symbolic refer- SUMT LOGICT SUMT with an on-chip bus. The LOGICT ences. UDT UDT key signals are the sys- In pass one, the assem- LDT LDT tem clock (CLK), next UDLDT UDLDT bler parses each line. La- ZXT ZXT memory address (AN15:0), bels are added to the SRI SRI next access is a read SRT SRT symbol table. Each in- SLT SLT (READN), next access is RETADT RETADT struction expands into one 16-bit data (WORDN), BRDISP[7:0] BRDISP[7:0] or more machine instruc- BRANCH BRANCH address clock enable: SELPC SELPC tions. If an operand refers ZEROPC ZEROPC above signals are valid, DMAPC DMAPC to a label, we record a PCCE PCCE start next access (ACE), RETCE RETCE fixup to it. memory ready input: the CLK CLK In pass two, we check CTRL16 RLOC=R0C0 current access completes DP16 RLOC=R5C0 all branch fixups. If a CLK this cycle (RDY), instruc- branch displacement ex- Figure 2—The control unit receives instructions, decodes them, and drives both the tion word input ceeds 128 words, we re- memory control outputs and the datapath control signals. (INSN15:0), on-chip bidi- 28 Issue 116 March 2000 CIRCUIT CELLAR®
  4. EXECUTION UNIT A15 Z SUM15 N BUF ADDSUB I[15:0]Z Z RESULT MUX CI ADSU16 AREGS FWD A ZERODET AREG[15:0] AMUX[15:0] A[15:0] CI BUFT16X RLOC=R1C6 T D[15:0] Q[15:0] A[15:0] O[15:0] D[15:0] Q[15:0] A[15:0] SUMT SUMBUF RNA[3:0] A[3:0] B[15:0] PCE CE S[15:0] RFWE WE FWD SEL CLK CLK SUM[15:0] RLOC=R1C5 B[15:0] CO OFL CLK CLK ADD V REGFILE M2_16 FD16E CO RLOC=R1C0 RLOC=R1C2 RLOC=R1C2 RLOC=R-1C5 ADD BUFT16X LOGIC SUMT T LOGIC[15:0] LOGICBUF BREGS IMMED B A[15:0] Q[15:0] BREG[15:0] BMUX[15:0] B[15:0] RLOC=R1C4 D[15:0] Q[15:0] B[15:0] O[15:0] D[15:0] Q[15:0] B[15:0] RNB[3:0] A[15:0] IMM[11:0] IR[11:0] BCE15_4 CE15_4 SRI OP[1:0] BUFT16X RFWE WE IMMOP[5:0] OP[5:0] PCE CE3_0 T LOGIC16 SRT SRBUF CLK CLK CLK CLK SRI,A[15:1] RLOC=R1C4 IMM16 FD12E4 LOGICOP[1:0] RLOC=R1C1 REGFILE RLOC=R1C1 RLOC=R1C3 RLOC=R1C3 BUFT16X SLT T A[14:0],G SLBUF RESULT[15:0] RLOC=R1C0 BUFT8X DOUT LDT T LDBUF DOUT[15:0] DOUT[7:0] RESULT[7:0] D[15:0] Q[15:0] RLOC=R5C3 PCE CE BUFT8X CLK CLK UDT T UDBUF DOUT[15:8] RESULT[15:8] FD16E RLOC=R1C3 RLOC=R1C4 BUFT8X UDLDT T UDLDBUF DOUT[15:8] RESULT[7:0] RLOC=R5C2 T BUFT8X ZXT ZHBUF G,G,G,G,G,G,G,G RESULT[15:8] GND RLOC=R1C2 PCINCR BUFT16X ADD16 ADDRMUX PC RET RETADT T RETBUF CI PC[15:0] RETAD[15:0] ADDRESS/PC UNIT A[15:0] A[15:0] O[15:0] D[15:0] O[15:0] D[15:0] Q[15:0] PCNEXT[15:0] G,G,G,DMAPC A[3:0] RLOC=R1C9 PCDISP PCDISP S[15:0] B[15:0] RETCE CE [15:0] SELPC SEL PCCE WE BRDISP[7:0] BRDISP[15:0] PCDISP[15:0] B[15:0] CO OFL CLK CLK CLK WCLK GND BRANCH ZEROPC ZERO FD16E BRANCH M2 16Z DMAPC RAM16X16S RLOC=R1C9 PCDISP16 RLOC=R-1C8 RLOC=R1C7 RLOC=R1C9 RLOC=R1C6 ADDR[15:0] Figure 3—The pipelined datapath has an execution unit, a result multiplexer, and an address/PC unit. Operands from the register file or immediate field are selected and latched into the A and B operand registers. Then the function units, including ADDSUB, operate upon A and B, and one of the results is driven onto RESULT15:0 and written back into the register file. Meanwhile, the address/PC unit increments the PC to help fetch the next instruction. rectional data bus to load/store data To execute one instruction per instruction fetch (addr ← PC) and (D15:0). cycle you need a 16-entry 16-bit regis- load/store (addr ← effective address). The memory/bus controller (which ter file with two read ports (add r3, r1, Careful design and reuse will let I’ll explain further in Part 3) decodes r2) and one write port (add r3, r1, you minimize the datapath area be- the address and activates the selected r2); an immediate operand multiplexer cause the adder, with the immediate memory or peripheral. Later it asserts (mux) to select the immediate field as mux, can do the effective address add, RDY to signal that the memory access an operand (addi r3, r1, 2); an arith- and the PC incrementer can also add is done. metic/logic unit (ALU) (sub r3, r1, branch displacements. The memory As Figure 2 shows, the CPU is r2; xor r3, r1); a shifter (srai r3, address mux can help load the PC simply a datapath that is steered by a 1), and an effective address adder to with the jump target. control unit. Next month, I’ll exam- compute reg+offset (lw r3, 2(r1)). ine the control unit in greater detail. You’ll also need a mux to select a DATAPATH SCHEMATIC The rest of this article explores the result from the adder, logic unit, left Figure 3 is the culmination of these design and implementation of the or right shifter, return address, or load ideas. There are three groups of re- datapath. data; logic to check a result for zero, sources. The execution unit is the DATAPATH RESOURCES negative, carry-out, or overflow; a heart of the processor. It fetches oper- The instruction set evolved with program counter (PC), PC incrementer, ands from the register file and the the datapath implementation. Each branch displacement adder (br L), immediate fields of the instruction new idea was first evaluated in terms and a mux to load the PC with a jump register, presents them to the add/sub, of the additional logic required and its target address (call _foo); and a logic, and (trivial) shift units, and impact on the processor cycle time. mux to share the memory port for writes back the result to the register CIRCUIT CELLAR® Issue 116 March 2000 29
  5. file. The result multiplexer format instruction which Hex Fmt Assembler Semantics selects one result from the follows. various function units. The 0dab rrr add rd,ra,rb rd = ra + rb; So, the B operand mux address/PC unit drives the 1dab rrr sub rd,ra,rb rd = ra – rb; IMMED is a 16-bit-wide 2dai rri addi rd,ra,imm rd = ra + imm; next memory address, and selection of either BREG, 3d*b rr {and or xor andn adc rd = rd op rb; includes the PC, PC adder, sbc} rd,rb 015:4||IR3:0, sign15:4||IR3:0, or and address mux. Now, let’s 4d*i ri {andi ori xori andni rd = rd op imm; IR11:0||03:0 (“||” means bit see how each resource is adci sbci slli slxi concatenation). srai srli srxi} rd,imm implemented in our FPGA. I used an unusual 2-1 5dai rri lw rd,imm(ra) rd = *(int*)(ra+imm); 6dai rri lb rd,imm(ra) rd = *(byte*)(ra+imm); mux with a fourth “force REGISTER FILE 8dai rri sw rd,imm(ra) *(int*)(ra+imm) = rd; constant” input for this During each cycle, we 9dai rri sb rd,imm(ra) *(byte*)(ra+imm) = rd; zero/sign extension func- Adai rri jal rd,imm(ra) rd = pc, pc = ra + imm; must read two register oper- tion, primarily because it B*dd br {br brn beq bne bc bnc bv ands and write back one re- bnv blt bge ble bgt bltu fits in a single 4-LUT. sult. You get two read ports bgeu bleu bgtu} label if (cond) pc += 2*disp8; So, as with FWD, (AREG and BREG) by keeping Ciii i12 call func r15 = pc, pc = imm12
  6. 4-LUT. Thus, the 16-bit logic unit is a the RESULT bus. The control unit column of eight CLBs. asserts RFWE (register file write en- ADDSUB adds B to A, or subtracts able), and sets RNA=RNB=3 to write B from A, according to its ADD input. the result into both REGFILEs’ r3. FWD, A, UDLDBUF, ZHBUF IMMED, B, LDBUF, UDBUF LOGIC, DOUT, LOGICBUF It reads carry-in (CI) and drives carry- BREGS, BREG, SRBUF AREGS, AREG, SLBUF ADDSUB, SUMBUF out (CO), and overflow (V). ADDSUB DEVELOPMENT TOOLS PC, RET, RETBUF is an instance of the ADSU16 library This hardware was designed, simu- ADDRMUX PCDISP, Z PCINCR symbol, and is 10 CLBs high—one to lated, and compiled on a PC using the anchor the ripple-carry adder, eight to Foundation tools in Xilinx Student add/sub 16 bits, and one to compute Edition 1.5. I used schematics for this carry-out and overflow. project because their 2-D layout Z, the zero detector, is a 2.5-CLB Figure 4—In the datapath floorplan, RLOC attributes makes it easier to understand the data NOR-tree of the SUM15:0 output. applied to the datapath schematic pin down the flow because they offer explicit con- datapath elements to specific CLB locations. The The shifter produces either A>>1 or RESULT15:0 bus runs horizontally across the bottom trol and because they support the A1, A
  7. Instruction(s) A B REFERENCES add rd,ra,rb AREG BREG [1] C. Fraser and D. Hanson, A Retargetable C Compiler: Design addi rd,ra,i4 AREG sign-ext imm and Implementation, Benjamin/ sb rd,i4(ra) AREG zero-ext imm Cummings, Redwood City, CA, imm 0x123 ignored imm12 || 03:0 1995. addi rd,ra,4 AREG B15:4 || imm [2] T. Cantrell, “VolksArray,” Cir- cuit Cellar, April 1998, pp. 82-86. add1 r3,r1,r2 AREG BREG add2 r5,r3,r4 RESULT BREG [3] D. Van den Bout, The Practical Xilinx Designer Lab Book, Table 5—Depending on the instruction or instruction Prentice Hall, 1998. (Available sequence, A is either AREG or the forwarded result, separately and included with and B is either BREG or an immediate field of the instruction register. Xilinx Student Edition.) structures, TBUFs, and flip-flop clock SOURCE enables), floorplanning (placing func- Xilinx Student Edition 1.5 tions in columns, ordering columns to Xilinx, Inc. reduce interconnect requirements, (408) 559-7778 and running the 3-state bus horizon- Fax: (408) 559-7114 tally over the columns), iterative design (measuring the area and delay effects of each potential feature), and using timing-driven place-and-route and iterative timing improvement. I apply timing constraints, such as net CLK period=28;, which causes par to find critical paths in the design and prioritize their placement and routing to best meet the constraints. Next, I run trce to find critical paths. Then I fix them, rebuild, and repeat until performance is satisfac- tory. I’ve built some tools, settled on an instruction set, built a datapath to execute it, and learned how to imple- ment it efficiently in an FPGA. Next month, I’ll design the control unit. I Jan Gray is a software developer whose products include a leading C++ compiler. He has been building FPGA processors and systems since 1994, and now he designs for Gray Re- search LLC. You may reach him at SOFTWARE Visit the Circuit Cellar web site for more information, including specifications, source code, sche- matics, and links to related sites. © Circuit Cellar, The Magazine for Computer Applications. Reprinted with permission. For subscription information call (860) 875-2199, email or on our web site at 32 Issue 116 March 2000 CIRCUIT CELLAR®
Đồng bộ tài khoản