# Kiến trúc phần mềm Radio P13

Chia sẻ: Tien Van Van | Ngày: | Loại File: PDF | Số trang:30

0
45
lượt xem
8

## Kiến trúc phần mềm Radio P13

Mô tả tài liệu

Performance Management The material covered in this chapter can reduce DSP hardware costs by a factor of 2 : 1 or more. Thus it is pivotal and in some sense the culmination of the SDR design aspects of this text. I. OVERVIEW OF PERFORMANCE MANAGEMENT Resources critical to software radio architecture include I/O bandwidth, memory, and processing capacity.

Chủ đề:

Bình luận(0)

Lưu

## Nội dung Text: Kiến trúc phần mềm Radio P13

1. Software Radio Architecture: Object-Oriented Approaches to Wireless Systems Engineering Joseph Mitola III Copyright !2000 John Wiley & Sons, Inc. c ISBNs: 0-471-38492-5 (Hardback); 0-471-21664-X (Electronic) 13 Performance Management The material covered in this chapter can reduce DSP hardware costs by a factor of 2 : 1 or more. Thus it is pivotal and in some sense the culmination of the SDR design aspects of this text. I. OVERVIEW OF PERFORMANCE MANAGEMENT Resources critical to software radio architecture include I/O bandwidth, mem- ory, and processing capacity. Good estimates of the demand for such resources result in a well-informed mapping of software objects to heterogeneous mul- tiprocessing hardware. Depending on the details of the hardware, the critical resource may be the capacity of the embedded processor(s), memory, bus, mass storage, or some other input/output (I/O) subsystem. A. Conformable Measures of Demand and Capacity MIPS, MOPS, and MFLOPS are not interchangeable. Many contemporary processors, for example, include pipelined floating point arithmetic or sin- gle instruction FFT butterfly operations. These operations require processor clock cycles. One may, however, express demand in the common measure of millions of operations per second (MOPS), where an operation is the aver- age work accomplished in a single clock cycle of an SDR word width and operation mix. Although software radios may be implemented with 16-bit words, this requires systematic control of dynamic range in each processing stage (e.g., through automatic gain control and other normalization functions). Thus 32-bit equivalent words provide a more useful reference point, in spite of the fact that FPGA implementations use limited precision arithmetic for efficiency. The mix of computation (e.g., filtering) versus I/O (e.g., for a T1 multiplexer) depends strongly on the radio application, so this chapter pro- vides tools for quantitatively determining the instruction mix for a given SDR application. One useful generalization for the mix is that the RF conversion and modem segments are computationally intensive, dominated by FIR fil- tering and frequency translation. Another is that the INFOSEC and network segments are dominated by I/O or bitstream functions. Those protocols with elaborate networking and error control may be dominated by bitstream func- tions. Layers of packetization may be dominated by packing and unpacking bitstreams using protocol state machines. 437
2. 438 PERFORMANCE MANAGEMENT MIPS and MFLOPS may both be converted to MOPS. In addition, 16-bit, 32-bit, 64-bit, and extended precision arithmetic mixes may also be expressed in Byte-MOPS, MOPS times bytes transformed by the operation. Processor I/O, DMA, auxiliary I/O throughput, memory, and bus bandwidths may all be expressed in MOPS. In this case the operand is the number of bytes in a data word and the operation is store or fetch. A critical resource is any computational entity (CPU, DSP unit, floating- point processor, I/O bus, etc.) in the system. MOPS must be accumulated for each critical resource independently. Finally, software demand must be trans- lated rigorously to equivalent MOPS. Benchmarking is the key to this last step. Hand-coded assembly language algorithms may outperform high-order language (HOL) code (e.g., Ada or C) by an order of magnitude. In addition, hand-coded HOL generally outperforms code-generating software tools, in some cases by an order of magnitude. Rigorous analysis of demand and capac- ity in terms of standards MOPS per critical resource yield useful predictions of performance. Initial estimates generated during the project-planning phase are generally not more accurate than a factor of two. Thus, one must sustain the performance management discipline described in this chapter throughout the project in order to ensure that performance budgets converge so that the product may be delivered on time and within specifications. B. Initial Demand Estimates Table 13-1 illustrates how design parameters drive the resource demand of the associated segment. The associated demand may exceed the capacity of today’s general-purpose processors. But, the capacity estimates help identify the class of hardware that best supports a given segment. One may determine the number of operations required per point for a point operation such as a digital filter. One hundred operations per point is representative for a high- quality frequency translation and FIR filter, for example. One then multiplies by the critical parameter shown in the table to obtain a first cut at processing demand. Multiplying the sampling rate of the stream being filtered times 100 quickly yields a rough order of magnitude demand estimate. Processing demand depends on a first-order approximation on the signal bandwidths and on the complexity of key operations within IF, baseband, bitstream, and source segments as follows: D = Dif + N " (Dbb + Dbs + Ds ) + Doh Where: Dif = Wa" (G1 + G2) " 2:5 Dbb = Wc" (Gm + Gd ) Dbs = Rb G3" (1=r) "
3. OVERVIEW OF PERFORMANCE MANAGEMENT 439 TABLE 13-1 Illustrative Functions, Segments, and Resource Demand Drivers Application Radio Function Segment First-Order Demand Drivers Analog Companding Source Speech bandwidth (Wv) and Sampling rate Speech Gap suppression Bitstream Gap identification algorithm complexity FM modulation Baseband Interpolation required (Wfm/Wv) Up conversion IF IF carrier and FM bandwidth: fi, Wi = Wfm Receiver Band selection IF Access bandwidth (Wa) Channel selection IF Channel bandwidth (Wc) FM demodulation Baseband fi, Wi DS0 reconstruction Bitstream Speech bandwidth; vocoder TDMA Voice codec Source Voice codec complexity TDM FEC coding Bitstream Code rate; block vs. convolutional Framing Bitstream Frame rate (Rf); bunched vs. distributed MSK modulation Baseband Baud rate (Rb) Up conversion IF fi, Wi + Rb/2 Band selection IF Access bandwidth (Wa) Channel selection IF Channel bandwidth (Wi = Wc) Demodulation Baseband Baud rate (Rb) or channel bandwidth (Wc) Demultiplexing Bitstream Frame rate (Rf) FEC decoding Bitstream Code rate CDMA Voice codec Source Choice of voice codec FEC coding Bitstream Code rate Spreading Baseband Chip Rate (Rc) Up conversion IF Wc, fi , Rc Band selection IF Wc, fi, Rc Despreading Baseband Chip rate (Rc) FEC decoding Bitstream Code rate D is aggregate demand (in standardized MOPS). Dif , Dbb , Dbs , and Ds are the IF, baseband, bitstream, and source processing demands, respectively. Doh is the management overhead processing demand. Wa is the bandwidth of the accessed service band. G1 is the per-point complexity of the service-band iso- lation filter. G2 is the complexity of subscriber channel-isolation filters. N is the number of subscribers. W is the bandwidth of a single channel. Gm is c the complexity of modulation processing and filtering. Gd is the complexity of demodulation processing (carrier recovery, Doppler tracking, soft decod- ing, postprocessing for TCM, etc.). Rb is the data rate of the (nonredundant) bitstream. The code rate is r. G3 is the per-point complexity of bitstream processing per channel (e.g., FEC). Table 13-2 shows how parameters of pro- cessing demand are related in an illustrative application. This real-time demand must be met by processors with sufficient capacity to support real-time performance. At present, most IF processing is off-loaded to special-purpose digital receiver chips because general-purpose processors with sufficient MOPS are not yet cost effective. This tradeoff changes approx- imately every 18 months in favor of the general-purpose processor. Aggregate baseband and bitstream-processing demand of 4 to 10 MOPS per user is within the capabilities of most DSP chips. Therefore, several tens of subscribers may
4. 440 PERFORMANCE MANAGEMENT TABLE 13-2 Illustrative Processing Demand Segment Parameter Illustrative Value Demand IF Ws 25 MHz G1 100 OPS/Sample Ws "G1 "2:5 = 6:25 GOPSa G2 100 OPS/Sample Dif = Ws " (G2 + G2) " 2:5 = 12:5 GOPSa N 30/ cell site Wc 30 kHz Gm 20 OPS/Sample Wc "Gm = 0:6 MOPS Baseband Gd 50 OPS/Sample Dbb = Wc " (Gm + Gd) = 2:1 MOPS R 1 b/b Rb 64 kbps Bitstream G3 1/8 FLOPS/bps Dbs = G3 " Rb=r = 0:32 MOPS Source Ds 1.6 MIPS/user N " G4 = 4:02 MIPS per user N "(Wc " (Gm + Gd) + Rb "G3=r + G4) = 120:6 MOPS per cell site Do 2 MOPS Aggregate D 122.6 MOPS per cell site (excluding IF) a Typically performed in digital ASICs in contemporary implementations. be accommodated by the highest performance DSP chips. Aggregate demand of all users of 122.6 MOPS, including overhead is nominally within the ca- pacity of a quad TMS 320 C50 board. When multiplexing more than one user’s stream into a single processor, memory buffer sizes, bus bandwidth, and fan-in/fan-out may cost additional overhead MOPS. C. Facility Utilization Accurately Predicts Performance The critical design parameter in relating processing demand to processor ca- pacity is resource utilization. Resource utilization is the ratio of average offered demand to average effective capacity. When expressed as a ratio of MOPS, uti- lization applies to buses, mass storage, and I/O as well as to CPUs and DSP chips. The bottleneck is the critical resource that limits system throughput. Identifying the bottleneck requires the analysis and benchmarking described in this chapter. The simplified analysis given above applies if the processor is the critical resource. The SDR systems engineer must understand these bot- tlenecks in detail for a given design. The SDR architect must project changes in bottlenecks over time. The designer should work to make it so. Sometimes, however, I/O, the backplane bus, or memory will be the critical resource. The SDR systems engineer must understand these bottlenecks in detail for a given design. The SDR architect must project changes in bottlenecks over time. The following applies to all such critical resources. Utilization, ½, is the ratio of offered demand to critical resource capacity, ½ = D=C, where D is average resource demand and C is average realizable capacity, both in MOPS. Figure 13-1 shows how queuing delay at the resource varies as a function of processor utilization. In a multithreaded DSP, there
5. OVERVIEW OF PERFORMANCE MANAGEMENT 441 Figure 13-1 Facility utilization characterizes system stability. may be no explicit queues but if more than one thread, task, user, etc. is ready to run, its time spent waiting for the resource constitutes queuing delay. The curve f(½) represents exponentially distributed service times, while g(½) repre- sents constant service times. Simple functions like digital filters have constant service times. That is, it takes the same 350 operations every time a 35-point FIR filter is invoked. More complex functions with logic or convergence prop- erties such as demodulators are more accurately modeled with exponentially distributed service times. Robust performance occurs when ½ is less than 0.5, which is 50% spare ca- pacity. The undesired events that result in service degradation will occur with noticeable regularity for 0:5 < ½ < 0:75. For ½ > 0:75, the system is gener- ally unstable, with queue overflows regularly destroying essential information. Systems operating in the marginal region will miss isochronous constraints, causing increased user annoyance as ½ increases. An analysis of variance is required to establish the risk that the required time delays will be exceeded, causing an unacceptable fault in the real-time stream. The incomplete Gamma distribution relates the risk of exceeding a specified delay to the ratio of the specification to the average delay. Assump- tions about the relationship of the mean to the variance determine the choice of Gamma parameters. Software radios work well if there is a 95 to 99% probability of staying within required performance. A useful rule-of-thumb sets peak predicted demand at one-third of benchmarked processor capacity: D < C=4. If D is accurate and task scheduling is random, with uniform arrival rates and exponential service times, then on average, less than 1% of the tasks will fail to meet specified performance.
6. 442 PERFORMANCE MANAGEMENT Figure 13-2 Four-step performance management process. Simulation and rapid prototyping refine the estimates obtained from this simple model. But there is no free lunch. SDRs require three to four times the raw hardware processing capacity of ASICs and special-purpose chips. SDRs therefore lag special-purpose hardware implementations by about one hardware generation, or three to five years. Thus, canonical software-radio architectures have appeared first in base stations employing contemporary hardware implementations. The process for managing performance of such multichannel multithreaded multiprocessing systems is now defined. II. PERFORMANCE MANAGEMENT PROCESS FLOW The performance management process consists of the four steps illustrated in Figure 13-2. The first step is the identification of the system’s critical re- sources. The critical resource model characterizes each significant processing facility, data flow path, and control flow path in the system. Sometimes, the system bottleneck can be counterintuitive. For example, in one distributed pro- cessing command-and-control (C2) system, there were two control processors, a Hardware Executive (HE) and a System Controller (SC). The system had unstable behavior in the integration laboratory and thus was six months late. Since I was the newly assigned system engineer for a derivative (and more complex) C2 system, I investigated the performance stability problems of the baseline system. Since the derivative system was to be delivered on a firm- fixed price commercial contract, the analysis was profit-motivated. The timing and loading analyses that had been created during the proposal phase of the project over two years earlier were hopelessly irrelevant. Consequently, we
8. 444 PERFORMANCE MANAGEMENT specification could state “Response time to operator commands shall be two seconds or less 95% of the time given a throughput of N.” In this case, the system is both functionally acceptable and passes its acceptance test because the throughput condition and statistical structure of the response time are ac- curately reflected in the specification. The author has been awarded more than one cash bonus in his career for using these techniques to deliver a system that the customer found to be stable and robust and that the test engineers found easy to sell off. These proven techniques are now described. III. ESTIMATING PROCESSING DEMAND To have predictable performance in the development of an SDR, one must first know how to estimate processing demand. This includes both the mechanics of benchmarking and the intuition of how to properly interpret benchmarks. The approach is introduced with an example. A. Pseudocode Example—T1 Multiplexer In the late 1980s, there was a competitive procurement for a high-end mil- itary C2 system. The evolution of the proposal included an important case study in the estimation of processing demand. The DoD wanted 64 kbps “clear” channels over T1 lines. There was a way to do this with a customized T1 multiplexer board, a complex, expensive hardware item that was avail- able from only one source. The general manager (GM) wanted a lower-cost approach. I suggested that we consider doing the T1 multiplexer (mux) in software. Literally on a napkin at lunch, I wrote the pseudocode and created the rough order of magnitude (ROM) loading analysis shown in Figure 13-3. The T1 multiplexer is a synchronous device that aggregates 24 parallel channels of DS0 voice into a single 1.544 Mbps serial stream. The companion demultiplexer extracts 24 parallel DS0 channels from a single input stream. DS0 is sampled at 8000 samples per second and coded in companded 8-bit bytes. This generates 8000 times 24 bytes or 192,000 bytes per second of input to a software mux. The pseudocode consists of the inner loop of the software mux or demux. Mux and demux are the same except for the addressing of the “get byte from slot” and “put byte” instructions. Adding up the processing in the inner loop, there are 15 data movement instructions to be executed per input byte. Multiplying this complexity per byte times the 192,000-byte data rate yields 2.88 MIPS. In addition to the mux functions, the multiplexer board maintained synchronization using a bunched frame alignment word in the spirit of the European E1 or CEPT mux hierarchy. The algorithm to test and maintain synchronization consumed an order of magnitude fewer resources than the mux, so it was not included in this initial analysis. Again, MIPS, MOPS, and MFLOPS are not interchangeable. But given that this is being done on a napkin over lunch, it is acceptable to use MIPS as a ROM estimate of processing demand. The capacity of three then-popular
9. ESTIMATING PROCESSING DEMAND 445 Figure 13-3 Specialized T1-mux ROM feasibility analysis. single-board computers is also shown in Figure 13-3, also in MIPS as pub- lished by the manufacturer. Dividing the demand by the capacity yields the facility utilization. This is also the amount of time that the central processing unit (CPU) accomplishes useful work in each second, CPU seconds of work per second. The VAX was projected to use 350 milliseconds per second, pro- cessing one real-time T1 stream more or less comfortably. The VAX was also the most expensive processor. The Sun also was within real-time constraints, but only marginally. The Gould machine was not in the running. However, that answer was unacceptable because the Gould processor was the least expensive of the single-board computers. The GM therefore liked the Gould the best. But the process of estimating software demand is like measuring a marshmallow with a micrometer. The result is a function of how hard you twist the knob, so we decided we were close enough to begin twisting the knobs. To refine the lunchtime estimates, we implemented benchmarks. The pro- totype mux pseudocode was implemented and tested on each machine. The first implementation was constrained by the rules of the procurement to “Ada reuse.” This meant that we had to search a standard library of existing Ada code to find a package to use to implement the function. The software team identified a queuing package that could be called in such a way as to imple- ment the mux. The Ada-reuse library is very portable, so it ran on all three machines. We put the benchmark code in a loop that would execute about ten million bytes of data with the operating system parameters set to run this func- tion in an essentially dedicated machine. The time to process 193,000 bytes (one second of data) is then computed. This time is shown in Figure 13-4 for each of the series of benchmarks that resulted. If it takes 10 seconds to process one second of data, then it would take at least 10 machines working in parallel to process the stream in real-time. The ordinate (vertical axis), which shows the facility utilization of the benchmark, can therefore be viewed as the num-
10. 446 PERFORMANCE MANAGEMENT Figure 13-4 Five benchmarks yield successively better machine utilization. ber of machines needed for real-time performance. The Ada-reuse approach thus required 2.5 VAX, 6 Sun, and 10 Gould machines for notional real-time performance. Of course, one could not actually implement this application using that number of machines in parallel because each machine would be 100% utilized, which, as indicated above, cannot deliver robust performance. With proper facility utilization of 50% , it would actually take 5 VAX, 12 Sun, and 20 Gould processors. The purpose of the benchmarks is to gain insights into the feasibility of the approach. Thus one may think of the inverse of the facility utilization as the number of full-time machines needed to do the software work, provided there is no doubt that twice that number that would be required for a robust implementation. The reuse library included run-time error checking with nested subroutine calls. A Pascal style of programming replaces subroutine calls with For-loops and replaces dynamically checked subroutine calling parameters with fixed loop parameters (e.g., Slot = 1, 24). An exemption to the Ada-reuse mandate of the procurement would be required for this approach. The second column in Figure 13-4 shows 0.7 VAX, 3.2 Sun, and 4.5 Gould machines needed for the Ada Pascal style. A VAX, then, can do the job this way in real-time, if somewhat marginally. Pascal style is not as efficient as is possible in Ada, however. The In-line style replaces For-loops with explicit code. Thus, for example, the For-loop code segment: For Slot = 1, 24 X = Get (T1-buffer, Slot); Put (X, Channel[Slot]); End For
14. 450 PERFORMANCE MANAGEMENT Figure 13-6 Quantifying performance of a new processor. accesses 4 bytes, multiplying as shown in the figure normalizes both types to byte-operations per second so they may be added without inconsistency. The result is a representative capacity estimate for the target processor. Consider the following example. Suppose caller identification uses 60% data movement and 40% floating point instructions on the library host processor. Suppose fur- ther that the new processor requires an estimated 10 ns for data movement and 40 ns for floating point arithmetic. Capacity, C, of the processor is: C: Data Movement : 60% $1=10 ns$ 2 Bytes/instruction Floating Point : 40% $1=40 ns$ 4 Bytes/instruction C = :6 $100 MHz$ 2 + :4 $25 MHz$ 4 = 160 MByte Ops/sec This estimate of capacity is compatible with the way in which the ob- ject uses the ISA. The estimates may be validated by computing the MByte Ops/sec for the original processor and for the original object. Inefficiencies introduced by cache misses, for example, can be calibrated out this way. Sup- pose a data movement instruction should take 30 ns on the library host pro- cessor. Computing the MIPS of the library object yields a total wall-clock time in the validation step equivalent to 40 ns used per data movement in- struction. This means that the clock time of the target processor should also be multiplied by 4/3. This additional factor times the nominal execution time yields an effective execution time on the new processor that is closer to what one will experience with the rehosted software. The net effect of this anal- ysis is to establish a processing capacity estimate for a new host processor that is compatible with the way in which the library object uses machine re- sources. C. Thread Analysis and Object Load Factors Often, however, one will not have a comparable library object on which to base the preceding analysis. In such cases, code inspection yields the necessary insights. Recall the T1 benchmarking example above. At the conclusion of the benchmarking work, each implementation style used a specified fraction of a given processor. This fraction is readily converted into equivalent instructions per invocation by the equation: Instructions = (Processor MIPS " Processor fraction)/Invocation rate