Model-Based Design for Embedded Systems- P8

Chia sẻ: Cong Thanh | Ngày: | Loại File: PDF | Số trang:30

Thêm vào BST

Báo xấu

66
lượt xem 5
download

Download Vui lòng tải xuống để xem tài liệu đầy đủ

Model-Based Design for Embedded Systems- P8:The unparalleled flexibility of computation has been a key driver and feature bonanza in the development of a wide range of products across a broad and diverse spectrum of applications such as in the automotive aerospace, health care, consumer electronics, etc.

Chủ đề:

Bình luận(0) Đăng nhập để gửi bình luận!

Lưu

Nội dung Text: Model-Based Design for Embedded Systems- P8

186 Model-Based Design for Embedded Systems object request broker (HORBA) when the support of small-grain parallelism is needed. Our most recent developments in MultiFlex are mostly focused on the support of the streaming programming model, as well as its interaction with the client–server model. SMP subsystems are still of interest, and they are becoming increasingly well supported commercially [14,21]. Moreover, our focus is on data-intensive applications in multimedia and communica- tions. For these applications, our focus has been primarily on streaming and client–server programming models for which explicit communication centric approaches seem most appropriate. This chapter will introduce the MultiFlex framework specialized at sup- porting the streaming and client–server programming models. However, we will focus primarily on our recent streaming programming model and mapping tools. 7.2.1 Iterative Mapping Flow MultiFlex supports an iterative process, using initial mapping results to guide the stepwise refinement and optimization of the application-to- platform mapping. Different assignment and scheduling strategies can be employed in this process. An overview of the MultiFlex toolset, which supports the client–server and streaming programming models, is given in Figure 7.2. The design methodology requires three inputs: Application Application specification core C functions Client/server Streaming Intermediate representation (IR) Application Map, Transform & schedule of IR Abstract constraints, Client/server Client/server platform profiling Streaming Streaming specification Static Dynamic tools tools Visualization Performance Component assembly analysis Target platform Video Mobile Multimedia platform platform platform FIGURE 7.2 MultiFlex toolset overview.
MPSoC Platform Mapping Tools for Data-Dominated Applications 187 • The application specification—the application can be specified as a set of communicating blocks; it can be programmed using the streaming model or client–server programming model semantics. • Application-specific information (e.g., quality-of-service requirements, measured or estimated execution characteristics of the application, data I/O characteristics, etc.). • The abstract platform specification—this information includes the main characteristics of the target platform which will execute the application. An intermediate representation (IR) is used to express the high-level appli- cation in a language-neutral form. It is translated automatically from one or more user-level capture environments. The internal structure of the appli- cation capture is highly inspired by the Fractal component model [23]. Although we have focused mostly on the IR-to-platform mapping stages, we have experimented with graphical capture from a commercial toolset [7], and a textual capture language similar to StreamIt [3] has also been experimented with. In the MultiFlex approach, the IR is mapped, transformed, and sched- uled; finally the application is transformed into targeted code that can run on the platform. There is a flexibility or performance trade-off between what can be calculated and compiled statically, and what can be evaluated at run- time. As shown on Figure 7.2, our approach is currently implemented using a combination of both, allowing a certain degree of adaptive behaviors, while making use of more powerful offline static tools when possible. Finally, the MultiFlex visualization and performance analysis tools help to validate the final results or to provide information for the improvement of the results through further iterations. 7.2.2 Streaming Programming Model As introduced above, the streaming programming model [1] has been designed for use with data-dominated applications. In this computing model, an application is organized into streams and computational kernels to expose its inherent locality and concurrency. Streams represent the flow of data, while kernels are computational tasks that manipulate and transform the data. Many data-oriented applications can easily be seen as sequences of transformations applied on a data stream. Examples of languages based on the streaming computing models are: ESTEREL [4], Lucid [5], StreamIt [3], Brooks [2]. Frameworks for stream computing visualization are also avail- able (e.g., Ptolemy [6] and Simulink R [7]). In essence, our streaming programming model is well suited to a distributed-memory, parallel architecture (although mapping is possible on shared-memory platforms), and favors an implementation using soft- ware libraries invoked from the traditional sequential C language, rather than proposing language extensions, or a completely new execution model.
188 Model-Based Design for Embedded Systems The entry to the mapping tools uses an XML-based IR that describes the application as a topology with semantic tags on tasks. During the mapping process, the semantic information is used to generate the schedulers and all the glue necessary to execute the tasks according to their firing conditions. In summary, the objectives of the streaming design flow are: • To refine the application mapping in an iterative process, rather than having a one-way top-down code generation • To support multiple streaming execution models and firing conditions • To support both restricted synchronous data-flow and more dynamic data-flow blocks • To be controlled by the user to achieve the mechanical transformations rather than making decisions for him We first present the mapping flow in the Section 7.3, and at the end of the section, we will give more details on the streaming programming model. 7.3 MultiFlex Streaming Mapping Flow The MultiFlex technology includes support for a range of streaming pro- gramming model variants. Streaming applications can be used alone or in interoperation with client–server applications. The MultiFlex streaming tool flow is illustrated in Figure 7.3. The different stages of this flow will be described in the next sections. The application mapping begins with the assignment of the application blocks to the platform resources. The IR transformations consist mainly in splitting and/or clustering the application blocks; they are performed for optimization purposes (e.g., memory optimization); the transformations also imply the insertion of communication mechanisms (e.g., FIFOs, and local buffers). The scheduling defines the sharing of a processor between several blocks of the application. Most of the IR mapping, transforming, and scheduling is realized statically (at compilation time), rather than dynamically (at run- time). The methodology targets large-scale multicore platforms including a uniform layered communication network based on STMicroelectronics’ network-on-chip (NoC) backbone infrastructure [18] and a small number of H/W-based communication IPs for efficient data transfer (e.g., stream- oriented DMAs or message-passing accelerators [9]). Although we consider our methodology to be compatible with the integration of application- specific hardware accelerators using high-level hardware synthesis, we are not targeting such platforms currently.
MPSoC Platform Mapping Tools for Data-Dominated Applications 189 Application functional capture Filter core Filter B1 B2 Abstract C functions dataflow Comm. services B3 Application Intermediate representation (IR) Platform constraints specification MpAssign Number and Profiling types of PE Communication Commn. resources and topology User assignment MpCompose Storage directives resources Comm. and Core functions Component assembly H/W abstraction wrappers library Target platform Video Mobile Multimedia platform platform platform FIGURE 7.3 MultiFlex tool flow for streaming applications. 7.3.1 Abstraction Levels In the MultiFlex methodology, a data-dominated application is gradually mapped on a multicore platform by passing through several abstractions: • The application level—at this level, the application is organized as a set of communicating blocks. The targeted architecture is completely abstracted. • The partitioning level—at this level the application blocks are grouped in partitions; each partition will be executed on a PE of the target architecture. PEs can be instruction-set programmable processors, reconfigurable hardware or standard hardware. • The communication level—at this level, the scheduling and the com- munication mechanisms used on each processor between the different blocks forming a partition are detailed. • The target architecture level—at this level, the final code executed on the targeted platforms is generated. Table 7.2 summarizes the different abstractions, models, and tools provided by MultiFlex in order to map complex data-oriented applications onto mul- tiprocessor platforms.
190 Model-Based Design for Embedded Systems TABLE 7.2 Abstraction, Models, and Tools in MultiFlex Abstraction Level Model Refinement Tool Application level Set of communicating blocks Textual or graphical front-end Partition level Set of communicating blocks MpAssign and directives to assign blocks to processors Communication level Set of communicating blocks MpCompose and required communication components Target architecture level Final code loaded and Component-based executed on the target compilation back-end platform 7.3.2 Application Functional Capture The application is functionally captured as a set of communicating blocks. A basic (or primitive) block consists of a behavior that implements a known interface. The implementation part of the block uses streaming application programming interface (API) calls to get input and output data buffers to communicate with other tasks. Blocks are connected through communica- tion channels (in short, channels) via their interfaces. The basic blocks can be grouped in hierarchical blocks or composites. The main types of basic blocks supported in MultiFlex approach are • Simple data-flow block: This type of block consumes and produces tokens on all inputs and outputs, respectively, when executed. It is launched when there is data available at all inputs, and there is suf- ficient free space in downstream components for all outputs to write the results. • Synchronous client–server block: This block needs to perform one or many remote procedural calls before being able to push data in the output interface. It must therefore be scheduled differently than the simple data-flow block. • Server block: This block can be executed once all the arguments of the call are available. Often this type of block can be used to model a H/W coprocessor. • Delay memory: This type of block can be used to store a given number of data tokens (an explicit state). Figure 7.4 gives the graphical representation of a streaming application cap- ture which interacts with a client–server application. Here, we focus mostly on streaming applications.
MPSoC Platform Mapping Tools for Data-Dominated Applications 191 Sync. dataflow Synchronous Server(s) Composite semantics client semantics semantics Dataflow App interface Init interface Memory delay Process int method1 (...) Max elem int method2 (...) End Data type State Token rate FIGURE 7.4 Application functional capture. From the point of view of the application programmer, the first step is to split the application into processing blocks with buffer-based I/O ports. User code corresponding to the block behavior is written using the C language. Using component structures, each block has its private state, and imple- ments a constructor (init), a work section (process), and a destructor (end). To obtain access to I/O port data buffers, the blocks have to use a prede- fined API. A run-to-completion execution model is proposed as a compro- mise between programming and mapping flexibility. The user can extend the local schedulers to allow the local control of the components, based on application-specific control interfaces. The dataflow graph may contain blocks that use client–server semantics, with application-specific interfaces, to perform remote object calls that can be dispatched to a pool of servers. 7.3.3 Application Constraints The following application constraints are used by the MultiFlex streaming tools: 1. Block profiling information. For a given block, this represents the aver- age number of clock cycles required for the block execution, on a target processor. 2. Communication volume: the size of data exchanged on this channel. 3. User assignment directives. Three types of directives are supported by the tool: a. Assign a block to a specific processor b. Assign two blocks to the same processor (can be any processor) c. Assign two blocks to any two different processors
192 Model-Based Design for Embedded Systems 7.3.4 The High-Level Platform Speciﬁcation The high-level platform specification is an abstraction of the processing, communication, and storage resources of the target platform. In the current implementation, the information stored is as follows: • Number and type of PEs. • Program and data memory size constraints (for each programmable PE). • Information on the NoC topology. Our target platform uses the STNoC, which is based on the “Spidergon” topology [18]. We include the latency measures for single and multihop communication. • Constraints on communication engines: Number of physical links available for communication with the NoC. 7.3.5 Intermediate Format MultiFlex relies on intermediate representations (IRs) to capture the applica- tion, the constraints, and high-level platform descriptions. The topology of the application—the block declaration and their connectivity—is expressed using an XML-based intermediate format. It is also used to store task anno- tations, such as the block execution semantics. Other block annotations are used for the application profiling and block assignments. Edges are anno- tated with the communication volume information. The IR is designed to support the refinement of the application as it is iteratively mapped to the platform. This implies supporting the mul- tiple abstraction levels involved in the assignment and mapping process described in the next sections. 7.3.6 Model Assumptions and Distinctive Features In this section, we provide more details about the streaming model. This background information will help in explaining the mapping tools in the next section. The task specification includes the data type for each I/O port as well as the maximum amount of data consumed or produced on these ports. This information is an important characteristic of the application capture because it is at the foundation of our streaming model: each task has a known com- putation grain size. This means we know the amount of data required to fire the process function of the task for a single iteration without starving on input data, and we know the maximum amount of output data that can be produced each time. This is a requirement for the nonblocking, or run-to- completion execution of the task, which simplifies the scheduling and com- munication infrastructure and reduces the system overhead. Finally, we can quantify the computation requirements of each task for a single iteration.
MPSoC Platform Mapping Tools for Data-Dominated Applications 193 The run-to-completion execution model allows dissociating the scheduling of the tasks from the actual processing function, providing clear scheduling points. Application developers focus on implementing and optimizing the task functions (using the C language), and expressing the functionality in a way that is natural for the application, without trying to balance the task loads in the first place. This means each task can work on a different data packet size and have different computation loads. The assign- ment and scheduling of the tasks can be done in a separate phase (usually performed later), allowing the exploration of the mapping parameters, such as the task assignment, the FIFO, and buffer sizes, to be conducted without changing the functionality of the tasks: a basic principle to allow correct-by- construction automated refinement. The run-to-completion execution model is a compromise, requiring more constrained programming but leads to higher flexibility in terms of mapping. However, in certain cases, we have no choice but to support multiple concur- rent execution contexts. We use cooperative threading to schedule special tasks that use a mix of streaming and client–server constructs. Such tasks are able to invoke remote services via client–server (DSOC) calls, including synchronous methods (with return values) that cause the caller task to block, waiting for an answer. In addition, we are evaluating the pros and cons of supporting tasks with unrestricted I/O and very fine-grain communication. To be able to eventu- ally run several tasks of this nature on the same processor, we may need a software kernel or make use of hardware threading if the underlying plat- form provides it. To be able to choose the correct scheduler to deploy on each PE, we have introduced semantic tags, which describe the high-level behavior type of each task. This information is stored in the IR. We have defined a small set of task types, previously listed in Section 7.3.2. This allows a mix of execution models and firing conditions, thus providing a rich programming environ- ment. Having clear semantic tags is a way to ensure the mapping tools can optimize the scheduling and communications on each processor, rather than systematically supporting all features and be designed for the worst case. The nonblocking execution is only one characteristic of streaming com- pared to our DSOC client–server message-passing programming model. As opposed to DSOC, our streaming programming model does not provide data marshaling (although, in principle, this could be integrated in the case of het- erogeneous streaming subsystems). When compared to asynchronous concurrent components, another dis- tinction of the streaming model is the data-driven scheduling. In event- based programming, asynchronous calls (of unknown size) can be generated during the execution of a single reaction, and those must be queued. The quantity of events may result in complex triggering protocols to be defined and implemented by the application programmer. This remains to be a well- known drawback of event-based systems. With the data-flow approach, the
194 Model-Based Design for Embedded Systems clear data-triggered execution semantic, and the specification of I/O data ports resolve the scheduling, memory management, and memory ownership problems inherent to asynchronous remote method invocations. Finally, another characteristic of our implementation of the streaming programming model, which is also shared with our SMP and DSOC models, is the fact that application code is reused “as is,” i.e., no source code trans- formations are performed. We see two beneficial consequences of this com- mon approach. In terms of debugging, it is an asset, since the programmer can use a standard C source-level debugger, to verify the unmodified code of the task core functions. The other main advantage is related to profiling. Once again, it is relatively easy for an application engineer to understand and optimize the task functions with a profiling report, because his source code is untouched. 7.4 MultiFlex Streaming Mapping Tools 7.4.1 Task Assignment Tool The main objective of the MpAssign tool (see Figure 7.5) is to assign applica- tion blocks to processors while optimizing two objectives: 1. Balance the task load on all processors 2. Minimize the inter-processor communication load Application graph Filter core C functions B2 B4 B1 Platform B3 B5 specification Application Number and constraints types of PE Profiling MpAssign Commn. resources Communication Assignment and topology volume directives Storage User resources assignment PE2 directives PE1 B2 B4 B1 B3 B5 FIGURE 7.5 MpAssign tool.
MPSoC Platform Mapping Tools for Data-Dominated Applications 195 The inter-processor communication cost is given by the data volume exchanged between two processors, related to each task. The tool receives as inputs the application capture, the application con- straints, and the high-level platform specification. The output of the tool is a set of assignment directives specifying which blocks are mapped on each processor, the average load of each processor, and the cost for each inter-processor communication. The lower portion of Figure 7.5 gives a visual representation of the MpAssign output. The tool provides the visual display of the resulting block assignments to processors. The implemented algorithm for the MpAssign tool is inspired from Marculescu’s research [10] and is based on graph traversal approaches, where ready tasks with maximal 2-minimal cost-variance are assigned iteratively. The two main graph traversal approaches implemented in MpAssign are • The list-based approach, using mainly the breadth-first principle—a task is ready if all its predecessors are assigned • The path-based approach, using mainly the depth-first principle—a task is ready if one predecessor is assigned and it is on the critical path A cost estimator C t, p of assigning a task t on processor p is used. This cost estimator is computed using the following equation: C t, p = w1 ∗ Cproc + w2 ∗ Ccomm + w3 ∗ Csucc (7.1) where Cproc is the additional average processing cost required when the task t is assigned to processor p Ccomm is the communication cost required for the communication of task t with the preceding tasks Csucc represents a look-ahead cost concerning the successor tasks, the min- imal cost estimate of mapping a number of successor tasks This assumes state space exploration for a predefined look-ahead depth. wi represents the weight associated with each cost factor (Cproc , Ccomm , and Csucc ) and indicates the significance of the factor in the total cost C p, t as compared with the other factors. The factors are weighted by the designer to set their relative importance. 7.4.2 Task Reﬁnement and Communication Generation Tools The main objective of the MpCompose tool (see Figure 7.6) is to generate one application graph per PE, each graph containing the desired computation blocks from the application, one local scheduler, and the required communi- cation components. To perform this functionality, MpCompose requires the following three inputs:
196 Model-Based Design for Embedded Systems Application IR Abstract comm. (with assignments) services Filter core C functions Scheduler PE1 B2 shell B1 Local B3 PE2 bindings (buffer) Global bindings MpCompose (fifo) PE1 Scheduler PE2 Control I/F Scheduler LB B2 1/2 B1 GB GB1/3 B3 Data I/F 1/3 FIGURE 7.6 MpCompose tool. • The application capture • The platform description • The set of directives, optionally generated by the MpAssign tool The MpCompose tool relies on a library of abstract communication services that provide different communication mechanisms that can be inserted in the application graph. Three types of services are currently supported by MpCompose: 1. Local bindings consisting mainly of a FIFO implemented with memory buffers and enabling the intra-processor communication (e.g., block B1 is connected to block B2 via local buffer LB1/2). 2. Global binding FIFOs, which enable the inter-processor communication (e.g., block B1 on PE1 communicates to block B3 on PE2 via external buffers GB1/3). 3. A scheduler on each PE, which is configurable in terms of number and types of blocks and which enables the sharing of a processor between several application blocks. A set of libraries are used to abstract part of the platform and provide com- munication and synchronization mechanism (point-to-point communication, semaphores, access to shared memory, access to I/O, etc.). The various FIFO components have a default depth, but these are configuration values that can be changed during the mapping. Since we support custom data types for I/O port tokens, each element of a FIFO has a certain size that matches the data type and maximum size specified in the intermediate format.
MPSoC Platform Mapping Tools for Data-Dominated Applications 197 There is no global central controller: a local scheduler is created on each processor. This component is the main controller and has access to the control interface of all the components it is responsible for scheduling. The proper control interface for each filter task is automatically added, based on the type of filter specified in the application IR, and connected to the scheduler. The implementations of the schedulers are partly generated, for example, the list of filter tasks (a static list) and some setup code for the hardware communi- cation accelerators are automatically created. The core scheduling function can be pulled from a library or customized by the application programmer. The output of MpCompose is a set of component descriptions; one for each processor. From the point of view of the top-level component defi- nitions, these components are not connected together; however, communi- cating processors use the platform-specific features to actually implement the buffer-based communication at runtime. The set of independent compo- nent definitions allow a monoprocessor component–based infrastructure to be used for compilation. 7.4.3 Component Back-End Compilation Starting from the set of processor graphs, the component back-end gener- ates the targeted code that can run on the platform. MultiFlex tools currently target the Fractal component model, and more specifically its C implemen- tation [19]. Even though this toolset supports features such as a binding controller and a life cycle manager to allow dynamic insertion–removal of components in the graph at runtime, we are not currently using any of the dynamic features of components, such as runtime elaboration, introspection, etc., mainly for code size reasons. Nevertheless, we expect multimedia appli- cation requirements to push toward this direction. Until then, we mainly use the component model as a back-end to represent the software architecture to be built on each processor. MpCompose generates one architecture (.fractal) file describing the components and their topology for each CPU. The Fractal tools will generate the required C glue code to bind components, to create block instance structures and will compile all the code into an executable for the specified processor by invoking the target cross-compiler. This build process is invoked for each PE, thus producing a binary for each processor. 7.4.4 Runtime Support Components The main services provided by the MultiFlex components at runtime are scheduling and communication. The scheduler in fact controls both the com- munication components and the application tasks. The scheduler interleaves communication and processing at the block level. For each input port, the scheduler scans if there is available data in the local memory. If not, it checks if the input FIFO is empty. If not, the sched- uler orders the input FIFO to perform the transfer into local memory. This is
198 Model-Based Design for Embedded Systems typically done by some coprocessors such as DMA or specialized hardware communication engines. While the transfer occurs, the scheduler can manage other tasks. In the same manner, it can look for previously produced output data ready to be transmitted from local memory to another processor, using an output FIFO. Tasks with more dynamic (data-dependant) behaviors may produce less data than their allowed maximum, including no data at all. If a task is ready to execute, the scheduler simply calls its process function in the same context. The user tasks make use of an API that is based on pointers, thus we avoid data copies between the tasks and the local queues managed by the scheduler. So in a nutshell, the run-to-completion model allows the scheduler to run ready tasks, manage input and output data consumed or produced by the tasks, while allowing data transfers to take place in parallel, thus overlapping communication and processing without the need for threading. The tasks can have different computation and communication costs: the mapping tools will help to balance the overall tasks load between processors, with the objective to keep the streaming fabric busy and the latency minimized. 7.5 Experimental Results 7.5.1 3G Application Mapping Experiments In this section, we present mapping results using the MpAssign tool on an application graph having the characteristic of a 3G WCDMA/FDD base- station application from [13]. The block diagram of this application is presented in Figure 7.7 and contains two main chains: transmitter (tx) and receiver (rx). The blocks a tx_crc_0 a tx_vtc_0 b R (70) (75) tx_rm tx_fi tx_rfs tx_si tx_sm a (40) (80) (30) (80) (170) d tx_crc_1 tx_vtc_1 b c c c d i (70) a (75) b o a M i A n C t e a rx_crc_0 a rx_vtd_0 b r (70) (205) rx_fi rx_rfa rx_rm c tx_si c rx_rake d f a rx_crc_1 rx_vtd_1 (80) (30) (40) (80) (175) b b c (70) b a a (205) e FIGURE 7.7 3G application block diagram. The communication volumes are a = 260, b = 3136, c = 1280, and d = 768.
MPSoC Platform Mapping Tools for Data-Dominated Applications 199 are annotated with numbers that represent the estimated processing load, while each edge has an estimated communication volume given in the figure caption. These numbers are extracted from [13], where the computation cost corresponds to a latency (in microseconds) for a PE to execute one iteration of the corresponding functional block, while the edge cost corresponds to the volume of data (in 16-bit words) transferred at each iteration between the connected functional blocks. A manual and static mapping of this application is presented in [13], using a 2D-mesh of 46 PEs, where each PE is executing only one of the functional blocks, for which some of them are duplicated to expose more potential parallel processing. We use this example in this chapter mainly for illustrative purposes, to show that MpAssign can be used to explore auto- matically different mappings where, optionally, multiple functional blocks can be mapped on the same PE to balance the processing load. To expose more potential parallel processing, we create a set of functionally equivalent application graphs of the above reference application in which we duplicate the transmitter and receiver processing chains several times. In our experi- ments, four versions have been explored: • v1: 1 transmitter and 1 receiver (original reference application) • v2: 2 transmitters and 2 receivers • v3: 3 transmitters and 3 receivers • v4: 4 transmitters and 4 receivers The version v1 will be mapped on a 16 processor architecture (v1/16). The version v2 will be mapped on a 16 processor architecture (v2/16) and a 32 processor architecture (v2/32). The version v3 will be mapped on a 32 processor architecture (v3/32) and a 48 processor architecture (v3/48). The version v4 will be mapped on a 48 processor architecture (v4/48). This results in six different mapping configurations (v1/16, v2/16, v2/32, v3/32, v3/48, v4/48) to explore. For the experiments, we suppose that each PE can execute any of the func- tional blocks, and that the NoC connecting all the PEs is the STMicroelectron- ics Spidergon [18]. As described in section “Task Assignment Tool,” our mapping heuris- tic allows exploring different solutions in order to find a good compromise between communication and PE load balancing. These different solutions can be obtained by varying the parameters w1, w2, and w3 (see Equation 7.1): A high value of w1 promotes solutions with good load balancing, while a high value of w2 promotes solutions with minimal communications. The parameter w3, which favors the selection based on an optimistic look-ahead search, will be fixed at 100. For our experiments three combinations of w1 and w2 will be studied:
200 Model-Based Design for Embedded Systems • c1 (w1 = 1000, w2 = 10): This weight combination tends to maximize the load balancing. • c2 (w1 = 100, w2 = 100): This weight combination tends to balance load and communications. • c3 (w1 = 10, w2 = 1000): This weight combination tends to minimize the communications. Each of the six configurations described above will be tested with these three weight parameter combinations, which results in a total of 18 experiments. For each experiment, we will extract the following statistics: • Load variance (LV), given by Equation 7.2, where for each mapping solution x, load (PEi ) is the sum of the task costs assigned to the PEi , avgload is the average load defined by the sum of all task costs divided by the number of PE, and p is the PE number. p−1 2 i=0 load(PEi ) − avgload LV(x) = (7.2) p • Maximal load (ML) defined as max(load(PEi )), where 0 < i < p − 1. • Total communication (TC) given by the sum of each edge cost times the NoC distance of the route related to that edge. • Maximal communication (MC) the maximum communication cost found between any of two PE. The LV statistic gives an approximation of the quality of the load balancing. The ML statistic is related to the lower-bound of the application performance after mapping, since the application throughput depends on the slowest PE processing. The MC statistic gives as well a lower-bound on the application performance, but this time with respect to the worst case of communica- tion contention (instead of w.r.t. processing in the case of ML). Finally, the TC indicator gives an approximation of the quality of the communication mapping. Figure 7.8 shows the resulting LV statistic of the different application con- figurations and mapping weight combinations. The best results are given by the mapping weight combination c1. This is predictable because c1 promotes solutions with a good load balancing, which means a low LV value. Figure 7.9 presents the resulting ML statistics of the different application configurations and mapping weight combinations. Following the same logic as with Figure 7.8, the best results here are given by the mapping weight combination c1. Figure 7.10 presents the resulting TC statistic of the different application configurations and mapping weight combinations. This time, the best results are given by the mapping weight combination c3. This is predictable because c3 promotes solutions with low communication costs. Figure 7.11 presents the resulting MC statistic for the different application configurations and mapping weight combinations. Contrary to Figure 7.10,
MPSoC Platform Mapping Tools for Data-Dominated Applications 201 c1 400 c2 c3 300 200 100 0 v1/16 v2/16 v2/32 v3/32 v3/48 v4/48 FIGURE 7.8 Load variance. c1 1500 c2 c3 1000 500 0 v1/16 v2/16 v2/32 v3/32 v3/48 v4/48 FIGURE 7.9 Maximal load (at any PE). c3 35000 c2 30000 25000 c1 20000 15000 10000 5000 0 v1/16 v2/16 v2/32 v3/32 v3/48 v4/48 FIGURE 7.10 Total communications.
202 Model-Based Design for Embedded Systems c3 c2 8000 c1 6000 4000 2000 0 v1/16 v2/16 v2/32 v3/32 v3/48 v4/48 FIGURE 7.11 Maximal communication (through any PE). the best results are given by the mapping weight combination c2. Since the tool is not trying to optimize this statistic, it appears that when optimizing either for load balancing or TC, the given obtained solution may have a worst case communication contention at a PE. This is one aspect of the mapping heuristics that needs improvement. These results show that the selection of the final task assignment solution really depends on the target performance, architecture hardware budget, and acceptable bandwidth for communications. Nevertheless, the MpAs- sign tool, by generating an interesting subset of mapping solutions, allows architects to concentrate on the more detailed and time consuming analysis, rather than trying to find task assignment solutions. At this level, the various costs remain estimates based on platform and application abstractions and assumptions. For a candidate solution, the refinement can continue down to the target architecture level. 7.5.2 Reﬁnement and Simulation For a given solution, MpAssign provides an output text file that contains the task assignment directives, the resulting average load of each processor and the cost for each inter-processor communication. The mapping results are also available in a graphical representation. For the purpose of a sim- pler display, we have created a mapping example of only eight processors that is shown in Figure 7.12. We see that the dataflow source and sink blocks have been assigned to dedicated I/O processors (the first and the last respec- tively), thus keeping the data intensive tasks on the remaining other 6 PEs. The intent was to isolate on different processors the interesting task set that we will want to profile later on, during the simulation. Starting from the mapping solution presented in Figure 7.12, MpCom- pose uses those task assignment directives to perform the software synthesis. Final compilation is carried out by the component back-end.
MPSoC Platform Mapping Tools for Data-Dominated Applications 203 Geometric location corresponds to PE assignment (here PE2) Block height proportional to task load (here 80us) PE load variance: 50 (out of 275 average) Connectivity between PE (here from PE2 to PE6) FIGURE 7.12 Sample output of MpAssign for a single solution. The next step is the execution of the application on an instrumented vir- tual platform for performance analysis. For each processor, we obtain the function profiling report. By adding up the time spent in the task process functions versus the time spent in the scheduler or waiting for input data to arrive, we obtain the effective processor utilization. Task code optimization can be done orthogonally with the profiling report, in the same way as in a monoprocessor flow. However, in the MultiFlex flow, the user can update the IR with new task profiling information and rerun MpAssign to see how this can influence the suggested task assignments. The simulations on the virtual platform should additionally provide NoC bandwidth and contention information, given by the instrumented links and routers. Our actual multicore virtual platform is currently under development, and the accurate STNoC model is in the process of being integrated. Mean- while, a fast and functional interconnect implementation is used instead. 7.6 Conclusions The increasing need for flexibility in multimedia SoCs for consumer applications is leading to a new class of programmable, multiprocessor solutions. The high computation and data bandwidth requirements of these
204 Model-Based Design for Embedded Systems applications pose new challenges in the expression of the applications, the platform architectures to support them and the application-to-platform mapping tools. In this chapter, we elaborated on these challenges and introduced the MP-SoC platform mapping technologies under development at STMicroelec- tronics called MultiFlex, with emphasis on assignment and scheduling of streaming applications. The use of these tools, integrated in a design flow that proposes a stepwise mapping refinement, was illustrated with a 3G base-station application example. While keeping the application code unmodified, we have seen how the MultiFlex tools refine an IR of the application, based on information related to the application properties, the high-level platform characteristics, as well as user mapping constraints. The MpAssign tool provides mapping solutions that minimize (1) the communications on the NoC and (2) the processing Load Variance. By chang- ing the weight factors, the user can direct the heuristics to favor different classes of solutions. For a chosen solution, the MpCompose tool provides the required intra- and inter-processor communications and local task schedulers, by instantiat- ing generic components with the proper specific configuration. Finally, for each processor, a self-contained component description gen- erated by MpCompose can be given to the component compilation back-end, which takes care of implementing the low-level component glue code and invoking the compiler for final compilation and linking, ready for simula- tion and analysis. This methodology supports a user-driven, iterative approach that auto- mates the mechanical mapping transformations. Performance and profil- ing information obtained from a given platform mapping iteration can be exploited by the user and the mapping tools to guide the next optimization cycle. 7.6.1 Outlook We are currently looking at alternative approaches for the implementation of the MpAssign tool, such as evolutionary algorithms, to cope with the scal- ability problem of list-based heuristics presented in this chapter. In fact, if we want to add several optimization goals to the algorithm, such as mini- mizing the memory usage, or power optimizations, it will become difficult to implement an efficient list-based algorithm. Other areas of research include looking at how we can support finer grain streaming with scalar-based I/O (similar to StreamIt [3]), mixed with our buffer-based dataflow approach. We are also evaluating the dynamic elaboration features of the component framework, which could extend our methodology for runtime application deployment.
MPSoC Platform Mapping Tools for Data-Dominated Applications 205 References 1. G. De Micheli and L. Benini, Networks on Chip: Technology and Tools, Morgan Kauffman, San Francisco, CA, 2006. 2. I. Buck, Brook: A Streaming Programming Language, available online at http://graphics.stanford.edu/streamlang/brook_v0.2.pdf 3. W. Thies, M. Karczmarek and S.P Amarasinghe, StreamIt: A language for streaming applications, in Proceedings of the International Conference on Compiler Construction, Grenoble, France, April 2002. 4. G. Berry, P. Couronne, and G. Gonthier, Synchronous Programming of Reactive Systems: An Introduction to ESTEREL, Elsevier, Amsterdam, the Netherlands, 1988, pp. 35–56. 5. W.W. Wadge and E.A. Ashcroft, Lucid, the Data-Flow Programming Lan- guage, Academic Press, New York, 1985. 6. J.T. Buck, S. Ha, E.A. Lee, and D.G. Messerschmitt, Ptolemy: A frame- work for simulating and prototyping heterogeneous systems, Journal of Computer Simulation, special issue on “Simulation Software Develop- ment,” 4, 155–182, April 1994. 7. The MathWorks: Matlab and Simulink for technical computing, available on line at http://www.mathworks.com 8. Data marshalling, http://www.webopedia.com/TERM/D/data_mars- halling.html 9. P. Paulin, C. Pilkington, M. Langevin, E. Bensoudane, and D. Lyonnard, A multi-processor SoC platform and tools for communications applica- tions, in Embedded Systems Handbook, CRC Press, Boca Raton, FL, 2004. 10. J. Hu and R. Marculescu, Energy-aware communication and task scheduling for network-on-chip architectures under real-time con- straints, in Proceedings of DATE 2004, Pairs, France. 11. M. Paganini, Nomadik R : A Mobile multimedia application proces- sor platform, in Proceedings of ASP-DAC (Asia and South Pacific Design Automation Conference), Yokohama, Japan, January 2007, pp. 749–750. 12. P.G. Paulin, C. Pilkington, M. Langevin, E. Bensoudane, D. Lyonnard, O. Benny, B. Lavigueur, D. Lo, G. Beltrame, V. Gagné, and G. Nicolescu, Parallel programming models or a multi-processor SoC platform applied to networking and multimedia, IEEE Transactions on VLSI Journal, 14(7), July 2006, 667–680.