Building a RISC System in an FPGA Part 3

Chia sẻ: Dqdsadasd Qwdasdsad | Ngày: | Loại File: PDF | Số trang:7

lượt xem

Building a RISC System in an FPGA Part 3

Mô tả tài liệu
  Download Vui lòng tải xuống để xem tài liệu đầy đủ

Now that the xr16 RISC processor is complete, it’s time to tie everything together and wrap up this series. In this final part, Jan designs a demo system that includes an on-chip bus, memory controller, video controller, and peripherals

Chủ đề:

Nội dung Text: Building a RISC System in an FPGA Part 3

  1. Building a RISC System FEATURE ARTICLE in an FPGA Jan Gray Part 3: System-on-a-Chip Design Now that the xr16 t he xr16 RISC processor is de- signed, now it’s time to design the rest of the A SYSTEM-ON-A-CHIP I’ll build an integrated system from the resources at hand—the FPGA, RAM, the video and parallel ports, System-on-a-Chip (SoC). Besides the and the 12-MHz oscillator. RISC processor is CPU, the FPGA hosts an on-chip bus, I used the RAM for program, data, bus controller, parallel port, RAM, and video memory. The byte-wide, complete, it’s time to video controller, and an external asynchronous SRAM isn’t ideal, but it SRAM controller. is fast enough for you to read and tie everything to- This month, I’ll show how simple latch a byte on each clock edge, interfaces can make SoC design as thereby fetching a 16-bit instruction gether and wrap up straightforward as classic CPU, glue during each cycle. logic, memory, peripherals, and PCB By displaying all 32 KB of RAM, this series. In this fi- design used to be. you can fashion a bitmapped 576 × 455 monochrome video display at nal part, Jan designs XS40 BOARD VGA-compatible sync frequencies. The project targets the XESS XS40- How quaint, to watch every bit on a demo system that 005XL V.1.2 FPGA board in Photo 1, screen! which includes a Xilinx XC4005XL, Refer also to Figure 4, the FPGA includes an on-chip 12-MHz oscillator (see Figure 1), top-level schematic. It includes the 32-KB SRAM, 8031 MCU, bus, memory control- 7-segment LED, voltage regulators, and parallel Address Resource ler, video controller, port and VGA port connec- 0000-7FFF external 32-KB RAM, tors. It’s simple, inexpen- and peripherals. sive, and is featured in The 0000 video frame buffer reset handler 0010 interrupt handler Practical Xilinx Designer FF00-FFFF I/O control registers, Lab Book included with 8 peripherals × 32 bytes Xilinx Student Edition. FF00-FF1F 0: 16-word on-chip IRAM FF21 1: parallel port input byte I chose this board be- FF41 2: parallel port output byte cause it is well supported FF60-FF7F 3: unused with documentation and … … FFE0-FFFF 7: unused tools, and because it can be used for both the XSE Table 1—The system memory map includes eight decoded peripheral exercises and this project. control register address blocks. CIRCUIT CELLAR® Issue 118 May 2000 1
  2. processor (P), the system memory/bus into an opaque bus CTRL15:0. controller (MEMCTRL), the on-chip MEMCTRL drives CTRL and also 16-bit data bus (D15:0), on-chip periph- does I/O address decoding, driving the erals (PARIN, PAROUT, and IRAM), eight I/O selects SEL7:0. the external SRAM interface, and the Now, you need only instantiate the VGA video controller. core, attach CLK, CTRL, D, some SELi, any core-specific inputs and DECISIONS, DECISIONS outputs, and you’re done! Before examining the design, let’s Contrast this with interfacing to a briefly explore the on-chip bus design traditional peripheral IC. Each IC has Figure 1—The system schematic depicts the subset of space. (This is not the sort of thing its own idiosyncratic set of control the XS40 needed for our project. The 8031 (not shown) you worry about when designing to is held in reset. signals, I/O register addresses, chip someone else’s microprocessor, but in selects, byte read and write strobes, an FPGA SoC, you have a little more 0xFF00, MEMCTRL decodes an I/O ready, interrupt request, and such. freedom.) write word request. It asserts LDT They don’t call it glue logic for nothing. Bus design issues include how and UDT, driving the store data onto Of course, we can’t just sweep all many bus masters are permitted, how D15:0, and asserts IRAM/LCE and the complexity under the rug. Each is the bus clocked and pipelined, how IRAM/UCE, writing D15:0 into IRAM’s core must decode CTRL and recover wide is it, does it provide byte ad- SRAMs: the relevant control signals. This is dressing, and is it split or unified with done with the DCTRL (CTRL de- the processor core RESULT bus. IRAM/D15:0 := D15:0 ← DOUT15:0 coder) macro (see Figure 5). DCTRL For XSOC, the pipelined on-chip inputs SELi, CTRL15:0, and CLK and 16-bit data bus D15:0 is single-mas- Next, consider a store to external outputs local I/O register address, tered (but recall the CPU also per- RAM: sw r0,0x0100. Because the upper and lower byte output enables forms DMA transfers), the bus clock external data bus is only eight bits (read strobes), and clock enables is the CPU clock, and the on-chip wide, first store the least significant (write strobes). data bus is unified with the pro- byte, then the most significant byte. Within each DCTRL instance, you cessor’s RESULT15:0 data bus. All of First, MEMCTRL asserts LDT and do final address decoding for the spe- these design decisions help to keep XDOUTT: cific peripheral, combining its SELi this project simple. signal with the I/O select within XD7:0 ¬ D7:0 ¬ DOUT7:0 CTRL15:0. Here XIN8 only uses LDT BUS CONTROLS (the LSB output enable). The other MEMCTRL, the system bus/ Later, it asserts UDLDT and DCTRL outputs are unloaded and memory controller, interfaces the XDOUTT: automatically eliminated by the processor to the on-chip and off-chip FPGA implementation tools. peripherals. It receives the pipelined XD7:0 ← D7:0 ← DOUT15:8 Using DCTRL and the on-chip tri- “next transaction” memory request BUS INTERFACE state bus, the typical overhead per signals AN15:0, WORDN, READN, Now, let’s design an on-chip bus peripheral is only one or two CLBs, DBUSN, and ACE from the CPU. peripheral interface to enable robust and perhaps a column of TBUFs. Then, it decodes the address, enables and easy reuse of peripheral cores and Control signal abstraction can also some peripheral or memory, and later to prepare for an ecology of interoper- make bus interface evolution easy. If asserts RDY in the clock cycle in able cores to come. you revise MEMCTRL and DCTRL which the memory cycle completes. It helps to distinguish between together, arbitrary changes to CTRL15:0 I/O registers are memory mapped (see core users and core designers. The can be made without invalidating any Table 1). former are more numerous, while the There are eight transaction types: latter are more experienced. There- Enable Effect (external RAM or I/O) × (read or fore, I make ease-of-use tradeoffs in write) × (byte or word), all decoded favor of core users. LDT D7:0 ← DOUT7:0 from AN15:0, WORDN, and READN. Because FPGAs are malleable and UDT D15:8 ← DOUT15:8 UDLDT D7:0 ← DOUT15:8 MEMCTRL manages transfers on FPGA SoC design is so new, I wanted XDOUTT XD7:0 ← D7:0 the on-chip data bus D15:0 and the an interface that can evolve to address LXDT D7:0 ← XDIN7:0 external data bus XD7:0 by asserting new requirements without invalidat- UXDT D15:8 ← XDIN15:8 p/LDT D7:0 ← p/D7:0 various tri-state output enables (xT) ing existing designs. p/UDT D15:8 ← p/D15:8 and control register clock enables With these two considerations in p/LCE p/D7:0 := D7:0 (xCE). These enable signals are as- mind, I borrowed a few ideas from the p/UCE p/D15:8 := D15:8 serted according to the transaction software world and defined an ab- Table 2—There are a set of enables p/* within each type (see Table 3). stract control signal bus with all of peripheral. DOUT15:0 is the CPU store data output For example, during sw r0, the common control signals collected register (see Part 1, Circuit Cellar 116). 2 Issue 118 May 2000 CIRCUIT CELLAR®
  3. existing designs. And, to add new bus Read W1 W2 W3 W4 W5 W6 Read WE, and one half clock to deassert it. features, simply design a new decoder CLK Therefore, byte writes take two full DCTRL_v2, causing no changes to XA[14:1] 0010 0200 0012 cycles, and word writes take three existing DCTRL clients. XA_0 (e.g., a word write takes six half XD[7:0] 12 34 CD AB 56 78 cycles W1–W6): EXTERNAL I/O INTERFACE? /WE There isn’t one. If it were necessary • W1: assert XA14:1, data LSB, XA0=1 /OE to attach external peripherals, perhaps • W2: assert /WE to the XD7:0 bus, you might design Figure 2—The RAM interface signals for three memory • W3: deassert /WE, hold XA and data transactions are: read 1234 from address 0010, write some on-chip external peripheral • W4: assert data MSB, XA1=0 ABCD to address 0200, and read 5678 from address adapter macros. Just like an on-chip 0012. • W5: assert /WE peripheral, each adapter would take • W6: deassert /WE, hold XA and data CTRL and some SELi, but its job buffers), and IFDs (input flip-flops). would be to use additional I/O pins to During a RAM write, XDOUTT is MEMCTRL DESIGN control its peripheral IC’s chip selects asserted, RAMNOE is deasserted, and I’ve discussed the responsibilities and so forth. Of course, as a CTRL15:0 the OBUFTs drive D7:0 out onto XD7:0. of MEMCTRL design: address decod- client, it would be able to raise inter- During a RAM read, XDOUTT is ing, on-chip bus control, and external rupts, insert wait states, and so forth. deasserted, RAMNOE is asserted, and RAM control. Now, let’s review its the RAM drives its output data onto implementation (see Figure 6). EXTERNAL RAM XD7:0. The data is input through the In address decoding, if the next The external RAM is a classic IBUFs and latched in the XDIN IFDs access is a load/store to address FFxx, 32-KB fast asynchronous SRAM with (on each falling CLK edge). the access is to memory-mapped I/O, a 15-ns access time (tAA). Its pins in- To keep the CPU busy with fresh and SELIO is asserted. Otherwise, it’s new instructions, the system reads a RAM access. both bytes of a 16-bit word in one Within each peripheral’s DCTRL cycle. In the first half cycle, it sets instance, its SELi (decoded from AN7:5) CPU CTRL, SYSCTRL, MISC XA0=0, reading the MSB, and latches and CTRLSELIO combine to develop that it in XDIN. In the second half cycle, peripheral’s output and clock enables. RNA RNB RNA the system sets XA0=1, reading the For bus control, the current state of LSB, and reads it through IBUFs. The the memory transaction finite state FWD, A, UDLDBUF, ZHBUF IMMED, B, LDBUF, UDBUF catenation of these two bytes, machine determines which controls LOGIC, DOUT ,LOGICBUF BREGS, BREG, SRBUF AREGS, AREG, SLUBUF PIXELS, LXD, UXD PC, RET, RETBUF ADDSUB, SUMBUF XDIN15:0, feeds the CPU’s INSN port, are asserted. The CPU asserts ACE PMUX, P ADDRMUX PCDISP, Z the video controller’s PIX port, and (address clock enable) to request the PCINCR IRAM D15:0 via the byte-wide tri-state buff- next transaction and awaits RDY. ers LXD and UXD. MEMCTRL decodes the request, and Writes to asynchronous SRAM the FSM enters the IO, RAMRD, or require careful design. Let’s see if we RAMWR state. The latter has three Figure 3—The rest of the device contains the auto- can safely write one byte per clock sub-states—W12, W34, and W56— matically placed processor control unit and other logic. cycle. The key constraints are: corresponding to pairs of the W1–W6 half-states described previously. • address must be valid before assert- In the IO state, RDY is asserted clude A14:0 (address), D7:0 (data in/ ing /WE unless the selected peripheral out), /CS (chip select), /WE (write • data must be valid before deassert- deasserts CTRL0, the I/O ready line, enable), and /OE (output enable). ing /WE thereby inserting a wait state. Refer to Figure 2 and the external • /WE must be deasserted briefly In the RAMRD state, RDY is as- bus and SRAM interface block of • no adddress/data hold Figure 5. time after /WE XA14:1 is 14 IOBs configured as Transaction Cycles Enables OFDXs (output flip-flops with clock I required a fully syn- RAM read byte 1 LXDT enables). XA14:1 captures the next ad- chronous design to be RAM read word 1 LXDT, UXDT dress AN14:1 at the start of each new able to slow or stop the RAM write byte 2 LDT, XDOUTT memory transaction. XA0 (XA_0) is clock and was unwilling RAM write word 3 LDT or UDLDT, XDOUTT I/O read byte 1+ p/LDT the least significant bit of the external to employ any asynchro- I/O read word 1+ p/LDT, p/UDT address. It is a logic output and can nous delay tricks. I/O write byte 1+ LDT, p/LCE change on either CLK edge. Accomplishing this I/O write word 1+ LDT, UDT p/LCE, p/UCE XD7:0 is eight IOBs configured as requires one half clock to eight sets of simultaneous OBUFTs settle the write address, Table 3—Depending on the memory transaction, different bus output (tri-state output buffers), IBUFs (input one half clock to assert / enables and register clock enables are asserted. CIRCUIT CELLAR® Issue 118 May 2000 3
  4. serted immediately because all request new pixel data via the RAM reads require only one DMA controller. The rest are the clock cycle. In the RAMWR VGA video outputs. The red, state, RDY is asserted on W34 for green, and blue intensities R1, byte stores and on W56 for word R0, G1, G0, B1, and B0 drive stores. Figure 5—The XIN8 (PARIN) implementation shows the CTRL resistor-based 2-bit D/A convert- The write controller uses flip- decoder output LDT that enables the input byte to be driven onto the ers, providing up to 64 colors (4 × data bus. flops W23_45 and W45, which are 4 × 4). However, at this resolu- clocked on CLK falling edges. So, tion, with 32 KB of RAM, you W34 is true during W3 and W4, while has a DCTRL decoder, of course, and can only support a monochrome (1- W45 is true during W4 and W5. From clocks D3:0 on LCE (LSB clock enable). bit/pixel) display. So, each pixel bit the W* signals you derive glitch-free This parallel port requires only three drives all six outputs, drawing black control signals XA_0, /WE, /OE, and so CLBs, eight TBUFs, and 10 IOBs! or white pixels. on. To generate horizontal and vertical The rest of MEMCTRL is straight- ON-CHIP RAM syncs and a video blanking signal, you forward. Note how E encodes (re- XSOC also includes a 16 × 16-bit need a 9-bit horizontal cycle counter names) the various peripheral control RAM peripheral. It uses all of the and a 10-bit vertical line counter. signals to CTRL15:0. DCTRL outputs: A4:1 to select the After 288 clocks, it’s time to blank I technology-mapped some logic word to read or write, LCE and UCE the video. Assert horizontal sync after using FMAPs. Timing analysis had as lower and upper byte write strobes, 308 clocks, deassert it after 353, and revealed poor automatic mapping of and LDT and UDT as lower and upper reset the counter and re-enable video this logic. This change shaved a few byte output enables. after 381 clocks (one line). nanoseconds off the critical path. In the vertical direction, the VGA Now that we’ve covered the imple- VIDEO CONTROLLER controller must blank video after 455 mentation of MEMCTRL, let’s turn The bit-mapped video controller, lines, assert vertical sync after 486 our attention to peripherals. based on ideas from [1], displays all lines, deassert it after 488 lines, and 32 KB of external SRAM at 576 × 455 reset the counter, re-enable video, and PARALLEL PORT I/O resolution, monochrome. reset the video DMA address counter I provided parallel port I/O to com- It runs autonomously from the after 528 lines. municate with the host. The XS40 CPU, and so is not a peripheral on the The simplest way to build each board provides eight parallel port data on-chip bus. It uses DMA to fetch counter is with a Xilinx library binary inputs and five status outputs. Reserv- video data, which consumes about counter, such as a CC16RE. But be- ing a few for debug I/Os, I used six 10% of memory bandwidth. cause I had just about filled the inputs and four outputs. A video signal is a series of frames; FPGA, and because they’re cool, I During lb rd,FF41, the PARIN each frame is a series of lines, and designed a more compact 10-bit linear input peripheral is selected, driving each line is a series of pixels. The feedback shift register (LFSR) counter. the inputs 00 || PAR_D5:0 onto D7:0 (see video controller fetches 16-pixel words This uses a 10-bit serial shift register Figure 5). of video memory, shifts the pixels out which has an input that is the XOR of During sb r1,FF21, the PAROUT serially, and uses horizontal and verti- certain shift register output taps. output peripheral is selected, captur- cal sync pulses to format the pixels An n-bit LFSR repeats every 2n-1 ing the store data D3:0 in flip-flops, into frames and lines for the monitor. cycles, but you can make an arbitrary which drive the PC_S6:3 status outputs. Generating VGA-compatible hori- m-cycle counter by complementing XOUT4 is as simple as XIN8. It zontal and vertical sync timings, VGA the LFSR input bit, thereby short- shifts pixels out at 24 circuiting the full sequence when a MHz, twice the sys- particular bit pattern is recognized. tem clock rate, shift- My LFSR counter design program can ing one out when CLK be downloaded from the Circuit Cel- is high and a second lar web site. when it is low. The Referring to Figure 7, note the horizontal and vertical video controller contains two LFSR sync pulses are ad- counters, H and V. Each has four com- vanced a few clocks parators to compare the LFSR bit (lines) to center the patterns to the count patterns output display in the frame by my program. (see Table 5). Each of the J-K flip-flops HENN, The VGA ports are NHSYNC, VEN, and NVSYNC are set Photo 1—Here’s the XS40 board, with the project design loaded into the described in Table 6. on reaching one counter value and FPGA and running a demo program that’s drawing graphics on the monitor. The first five ports reset on reaching another. 4 Issue 118 May 2000 CIRCUIT CELLAR®
  5. from external RAM, then MEMCTRL, then LED output regis- ters. Writing text messages to the seven-segment LED was a big mile- stone. RAM writes were next. And, late in the project I added DMA, the video controller, and interrupts. I want to em- phasize the impor- tance of thorough testing. You have your work cut out for you when prop- erly testing a pipelined processor and an SoC. Figure 4—The processor (P) issues requests to MEMCTRL, accessing instruction and data via the on-chip bus D15:0 or external SRAM. This has been a Integrated peripherals provide parallel port I/O and on-chip RAM. The VGA controller fetches pixel data via DMA. proof-of-concept project, and I have NHSYNC is asserted low during design using the Xilinx tools and focused on design issues. To ship clocks 308–353, and NVSYNC during tested it on my XS40 board. Using a something like this, you would need lines 486–488. HEN is the pipelined parallel port output for CLK, I wrote to budget as much or more time for horizontal video enable, and VEN is shell scripts to single-step the proces- validation as for the design and imple- the vertical video enable. When both sor and observe PC7:1 on the LEDs. mentation. are true, you fetch and shift out video Later, I ran the CPU at up to 20 MHz. The final system floorplan, as data. Starting from a core set of working placed on our 14 × 14 CLB FPGA, is In the video datapath, each clock instructions, it was easy to test the shown in Figure 3. shifts out two bits of video data. Ev- rest, one at a time. If something went ery eight clocks, WORD goes true, awry, I could do a binary search for SERIES WRAP-UP and it requests a new 16-bit word of the problem, insert a stop: goto In this three-part series, I have video data from memory. REQ is stop; breakpoint into my test, presented the complete design and asserted, registering a pending DMA recompile, and download. A real re- implementation of a real, full-fea- transfer with the CPU. mote debugger would be nice! tured, pipelined microprocessor and Five or fewer clocks later, the CPU Armed with a working CPU, it is an integrated System-on-a-Chip. I performs the DMA load, asserting easy to add and test new features, one designed a new instruction set, ported ACK. The video data word is latched by one. I added double-cycled reads a C compiler, and discussed how to in the PIXELS staging register. On the eighth clock, this word is loaded into Port Description Quantity Value the PMUX 8 × 2 parallel-load serial- PIX15:0 next 16-bit pixel word two-pixel clock 83.3 ns out shift register. REQ request DMA of next word one-pixel half-clock 41.7 ns Two bits shift out of PMUX during RESET reset DMA address counter visible pixels/line 576 each clock, and feed a 2–1 mux that ACK DMA acknowledge input visible clocks/line 288 CLK system clock horizontal sync “on” clock 308 drives the 1-bit pixel each half clock. R1,R0 2-bit red intensity horizontal sync “off” clock 353 G1,G0 2-bit green intensity line total clocks 381 SYSTEM BRING-UP B1,B0 2-bit blue intensity line time 31.8 ms NHSYNC active-low horizontal sync visible lines/frame 455 After designing the CPU, I de- NVSYNC active-low vertical sync vertical sync “on” line 486 signed a simple test-fixture using on- vertical sync “off” line 488 chip ROM and ran my test programs frame total lines 528 frame time 16.8 ms in the Foundation simulator. After simulating test programs for Tables 5 & 6—The 12-MHz clock and 24-MHz pixel shift frequency determines the pixels per line and lines per hundreds of cycles, I compiled the frame, as well as the horizontal and vertical counter values for sync and blanking events. CIRCUIT CELLAR® Issue 118 May 2000 5
  6. Figure 6—The memory controller consists of an address decoder, a memory transaction state machine, and miscellaneous on-chip bus and external RAM control logic. Figure 7—As you can see, the video controller contains two LFSR counters that each have four comparators for comparing the LFSR bit patterns to the count patterns that are output by the program that I wrote. 6 Issue 118 May 2000 CIRCUIT CELLAR®
  7. Jan Gray is a software developer whose products include a leading C++ compiler. He has been building FPGA processors and systems since 1994, and he now designs for Gray Re- search LLC. You may reach him at Please note that I do not warrant that you have the right to build something based upon the ideas dis- cussed in this series of articles under the relevent intellectual property laws in your jurisdiction. SOFTWARE You may download more informa- tion, including specifications, source code, schematics, and links to related sites from the Circuit Cellar web site. REFERENCE [1] VGA Signal Generation with the XS Board, XESS App Note SOURCES XESS XS40-005XL FPGAs, Student Edition tools Xilinx, Inc. (408) 559-7778 Fax: (408) 559-7114 © Circuit Cellar, The Magazine for Computer Applications. Reprinted with permission. For subscription information call (860) 875-2199, email or on our web site at CIRCUIT CELLAR® Issue 118 May 2000 7
Đồng bộ tài khoản