Model-Based Design for Embedded Systems- P9

Chia sẻ: Cong Thanh | Ngày: | Loại File: PDF | Số trang:30

Thêm vào BST

Báo xấu

66
lượt xem 6
download

Download Vui lòng tải xuống để xem tài liệu đầy đủ

Model-Based Design for Embedded Systems- P9:The unparalleled flexibility of computation has been a key driver and feature bonanza in the development of a wide range of products across a broad and diverse spectrum of applications such as in the automotive aerospace, health care, consumer electronics, etc.

Chủ đề:

Bình luận(0) Đăng nhập để gửi bình luận!

Lưu

Nội dung Text: Model-Based Design for Embedded Systems- P9

216 Model-Based Design for Embedded Systems the necessary information needed for each translation step. Based on the task-dependency information that tells how to connect the tasks, the translator determines the number of intertask communication channels. Based on the period and deadline information of tasks, the run-time sys- tem is synthesized. With the memory map information of each processor, the translator defines the shared variables in the shared region. To support a new target architecture in the proposed workflow, we have to add translation rules of the generic API to the translator, make a target- specific-OpenMP-translator for data parallel tasks, and apply the generation rule of task scheduling codes tailored for the target OS. Each step of CIC translator will be explained in this section. 8.5.1 Generic API Translation Since the CIC task code uses generic APIs for target-independent specifi- cation, the translation of generic APIs to target-dependent APIs is needed. If the target processor has an OS installed, generic APIs are translated into OS APIs; otherwise, they are translated into communication APIs that are defined by directly accessing the hardware devices. We implement the OS API library and communication API library, both optimized for each target architecture. For most generic APIs, API translation is achieved by simple redefini- tion of the API function. Figure 8.6a shows an example where the trans- lator replaces MQ_RECEIVE API with a “read_port” function for a target processor with pthread support. The read_port function is defined using Generic API 1. int read_port(int channel_id, unsigned char *buf, int len) { 2. ... ... 3. pthread_mutex_lock (channel_mutex); MQ_RECEIVE (port_id, buf, size); 4. ... ... 5. memcpy(buf, channel->start, len); 6. ... 7. pthread_mutex_unlock(channel_mutex); 8. } (a) Generic API #include #include ... file = OPEN("input.dat", O_RDONLY); #include file = fopen("input.dat", "r"); ... #include ... READ(file, data, 100); #include fread(data, 1, 100, file); ... ... file = open ("input.dat", O_RDONLY); ... CLOSE(file); ... fclose(file); read(file, data, 100); ... close(file); (b) FIGURE 8.6 Examples of generic API translation: (a) MQ_RECEIVE operation, (b) READ operation.
Retargetable, Embedded Software Design Methodology 217 pthread APIs and the memcpy C library function. However some APIs need additional treatment: For example, the READ API needs different function prototypes depending on the target architecture as illustrated in Figure 8.6b. Maeng et al. [14] presented a rule-based translation technique that is general enough to translate any API if the translation rule is defined in a pattern-list file. 8.5.2 HW-Interfacing Code Generation If there is a code segment contained within a HW pragma section and its translation rule exists in an architecture information file, the CIC translator replaces the code segment with the HW-interfacing code, considering the parameters of the HW accelerator and buffer variables that are defined in the architecture section of the CIC. The translation rule of HW-interfacing code for a specific HW is separately specified as a HW-interface library code. Note that some HW accelerators work together with other HW IPs. For example, a HW accelerator may notify the processor of its completion through an interrupt; in this case an interrupt controller is needed. The CIC translator generates a combination of the HW accelerator and interrupt con- troller, as shown in the next section. 8.5.3 OpenMP Translator If an OpenMP compiler is available for the target, then task codes with OpenMP directives can be used easily. Otherwise, we somehow need to translate the task code with OpenMP directives to a parallel code. Note that we do not need a general OpenMP translator since we use OpenMP direc- tives only to specify the data parallel CIC task. But we have to make a sepa- rate OpenMP translator for each target architecture in order to achieve opti- mal performance. For a distributed memory architecture, we developed an OpenMP trans- lator that translates an OpenMP task code to the MPI codes using a minimal subset of the MPI library for the following reasons: (1) MPI is a standard that is easily ported to various software platforms. (2) Porting the MPI library is much easier than modifying the OpenMP translator itself for the new target architecture. Figure 8.7 shows the structure of the translated MPI program. As shown in the figure, the translated code has the master–worker structure: The master processor executes the entire core while worker pro- cessors execute the parallel region only. When the master processor meets the parallel region, it broadcasts the shared data to worker processors. Then, all processors concurrently execute the parallel region. The master proces- sor synchronizes all the processors at the end of the parallel loop and col- lects the results from the worker processors. For performance optimization, we have to minimize the amount of interprocessor communication between processors.
218 Model-Based Design for Embedded Systems Work Initialize Initialize Initialize alone Parallel region start BCast BCast BCast BCast share share share share data data data data Work Work Work Work in in in in parallel parallel parallel parallel region region region region Receive Send Send Send & shared shared shared Parallel update data data data region end Work alone Master Worker Worker Worker processor processor processor processor FIGURE 8.7 The workflow of translated MPI codes. (From Kwon, S. et al., ACM Trans. Des. Autom. Electron. Syst., 13, Article 39, July 2008. With permission.) 8.5.4 Scheduling Code Generation The last step of the proposed CIC translator is to generate the task-scheduling code for each processor core. There will be many tasks mapped to each processor, with different real-time constraints and dependency information. We remind the reader that a task code is defined by three functions: “{task name}_init(), {task name}_go(), and {task name}_wrapup().” The generated scheduling code initializes the mapped tasks by calling “{task name}_init()” and wraps them up after the scheduling loop finishes its execution, by calling “{task name}_wrapup().” The main body of the scheduling code differs depending on whether there is an OS available for the target processor. If there is an OS that is POSIX-compliant, we generate a thread-based scheduling code, as shown in Figure 8.8a. A POSIX thread is created for each task (lines 17 and 18) with an assigned priority level if available. The thread, as shown in lines 3 to 5, executes the main body of the task, “{task name}_go(),” and schedules the thread itself based on its timing constraints by calling the “sleep()” method. If the OS is not POSIX-compliant, the CIC translator should be extended to generate the OS-specific scheduling code. If there is no available OS for the target processor, the translator should synthesize the run-time scheduler that schedules the mapped tasks. The CIC translator generates a data structure of each task, containing three main functions of tasks (“init(), go(), and wrapup()”). With this data structure, a
Retargetable, Embedded Software Design Methodology 219 1. void ∗ thread_task_0_func(void ∗ argv) { 2. ... 3. task_0_go(); 4. get_time(&time); 5. sleep(task_0->next_period – time); // sleep for remained time 6. ... 7. } 8. int main() { 9. ... 10. pthread_t thread_task_0; 11. sched_param thread_task_0_param; 12. ... 13. thread_task_0_param.sched_priority = 0; 14. pthread_attr_setschedparam(. . ., &thread_task_0_param); 15. ... 16. task_init(); /∗ {task_name}_init() functions are called ∗ / 17. pthread_create(&thread_task_0, 18. &thread_task_0_attr, thread_task_0_func, NULL); 19. ... 20. task_wrapup(); /∗ {task_name}_wrapup() functions are called ∗ / 21. } (a) 1. typedef struct { 2. void (∗ init)(); 3. int (∗ go()); 4. void (∗ wrapup)(); 5. int period, priority, . . .; 6. } task; 7. task taskInfo[] = { {task 1_init, task 1_go, task 1_wrapup, 100, 0} 8. , {task2_init, task2_go, task2_wrapup, 200, 0}}; 9. 10. void scheduler() { 11. while(all_task_done()==FALSE) { 12. int taskld = get_next_task(); 13. taskInfo[taskld]->go() 14. } 15. } 16. 17. int main() { 18. init(); /∗ {task_name}_init() functions are called ∗ / 19. scheduler(); /∗ scheduler code ∗ / 20. wrapup(); /∗ {task_name}_wrapup() functions are called ∗ / 21. return 0; 22. } (b) FIGURE 8.8 Pseudocode of generated scheduling code: (a) if OS is available, and (b) if OS is not available. (From Kwon, S. et al., ACM Trans. Des. Autom. Electron. Syst., 13, Article 39, July 2008. With permission.)
220 Model-Based Design for Embedded Systems real-time scheduler is synthesized by the CIC translator. Figure 8.8b shows the pseudocode of a generated scheduling code. Generated scheduling code may be changed by replacing the function “void scheduler()” or “int get_next_task()” to support another scheduling algorithm. 8.6 Preliminary Experiments An embedded software development framework based on the proposed methodology, named HOPES, is under development. While it allows the use of any model for initial specification, the current implementation is being done with the PeaCE model. PeaCE model is one that is used in PeaCE hardware–software codesign environment for multimedia embedded sys- tems design [15]. To verify the viability of the proposed programming, we built a virtual prototyping system, based on the Carbon SoC Designer [16], that consists of multiple subsystems of arm926ej-s connected to each other through a shared bus as shown in Figure 8.9. H.263 Decoder as depicted in Figure 8.3 is used for preliminary experiments. 8.6.1 Design Space Exploration We specified the functional parallelism of the H.263 decoder with six tasks as shown in Figure 8.3, where each task is assigned an index. For data- parallelism, the data parallel region of motion compensation task is specified with an OpenMP directive. In this experiment, we explored the design space of parallelizing the algorithm, considering both functional and data paral- lelisms simultaneously. As is evident in Figure 8.3, tasks 1 to 3 can be executed in parallel; thus, they are mapped to multiple-processors with three configu- rations as shown in Table 8.1. For example, task 1 is mapped to processor 1, and the other tasks are mapped to processor 0 for the second configuration. Interrupt ctrl. Interrupt ctrl. Arm926ej-s Local mem. HW1 HW2 Arm926ej-s Local mem. HW1 HW2 Shared memory HW3 FIGURE 8.9 The target architecture for preliminary experiments. (From Kwon, S. et al., ACM Trans. Des. Autom. Electron. Syst., 13, Article 39, July 2008. With permission.)
Retargetable, Embedded Software Design Methodology 221 TABLE 8.1 Task Mapping to Processors The Configuration of Task Mapping Processor Id 1 2 3 0 Task 0, Task 1, Task 2, Task 0, Task 2, Task 3, Task 0, Task 3, Task 3, Task 4, Task 5 Task 4, Task 5 Task 4, Task 5 1 N/A Task 1 Task 1 2 N/A N/A Task 2 Source: Kwon, S. et al., ACM Trans. Des. Autom. Electron. Syst., 13, Article 39, July 2008. With permission. TABLE 8.2 Execution Cycles for Nine Configurations The Configuration of Task Mapping The Number of Processors for Data-Parallelism 1 2 3 No OpenMP 158,099,172 146,464,503 146,557,779 2 167,119,458 152,753,214 153,127,710 4 168,640,527 154,159,995 155,415,942 Source: Kwon, S. et al., ACM Trans. Des. Autom. Electron. Syst., 13, Article 39, July 2008. With permission. For each configuration of task mapping, we parallelized task 4, using one, two, and four processors. As a result, we have prepared nine configurations in total as illustrated in Table 8.2. In the proposed framework, each configu- ration is simply specified by changing the task-mapping information in the architecture information file. The CIC translator generates the executable C codes automatically. Table 8.2 shows the performance result for these nine configurations. For functional parallelism, the best performance can be obtained by using two processors as reported in the first row (“No OpenMP” case). H.263 decoder algorithm uses a 4:1:1 format frame, so computation of Y macroblock decod- ing is about four times larger than those of U and V macroblocks. Therefore macroblock decoding of U and V can be merged in one processor during macroblock decoding of Y in another processor. There is no performance gain obtained by exploiting data parallelism. This is because the computa- tion workload of motion compensation is not large enough to outweigh the communication overhead incurred by parallel execution. 8.6.2 HW-Interfacing Code Generation Next, we accelerated the code segment of IDCT in the macroblock decod- ing tasks (task 1 to task 3) with a HW accelerator, as shown in Figure 8.10a. We use the RealView SoC designer to model the entire system including the
222 Model-Based Design for Embedded Systems #pragma hardware IDCT (output.data, input.data) { /∗ code segments for IDCT ∗ / } (a) 1. 2. IDCT 3. IDCT_slave 4. 0x2F000000 5. (b) 1. 2. IDCT 3. IDCT_interrupt 4. 0x2F000000 5. 6. 7. IRQ_CONTROLLER 8. irq_controller 9. 0xA801000 10. (c) FIGURE 8.10 (a) Code segment wrapped with HW pragma and architecture section infor- mation of IDCT, (b) when interrupt is not used, and (c) when interrupt is used. (From Kwon, S. et al., ACM Trans. Des. Autom. Electron. Syst., 13, Article 39, July 2008. With permission.) HW accelerator. Two kinds of inverse discrete cosine transformation (IDCT) accelerator are used. One uses an interrupt signal for completion notifica- tion, and other uses polling to detect the completion. The latter is specified in the architecture section as illustrated in Figure 8.10b, where the library name of the HW-interfacing code is set to IDCT_slave and its base address to 0x2F000000. Figure 8.11a shows the assigned address map of the IDCT accelerator and Figure 8.11b shows the generated HW-interfacing code. This code is sub- stituted for the code segment contained within a HW pragma section. In Figure 8.11b, bold letters are changeable according to the parameters spec- ified in a task code and in the architecture information file; they specify the base address for the HW interface data structure and the input and output port names of the associated CIC task. Note that interfacing code uses polling at line 6 of Figure 8.11b. If we use the accelerator with interrupt, an interrupt controller is additionally attached to the target platform, as shown in Figure 8.10c, with information on the code library name, IRQ_CONTROLLER, and its base address 0xA801000. The new IDCT accelerator has the same address map as the previous one, except for
Retargetable, Embedded Software Design Methodology 223 Address (Offset) I/O Type Comment 0 Read Semaphore 4 Write IDCT start 8 Read Complete ﬂag 12 Write IDCT clear 64 ∼ 191 Write Input data 192 ∼ 319 Read Output data (a) 1. int i; 2. volatile unsigned int ∗ idct_base = (volatile unsigned int∗ ) 0x2F000000; 3. while(idct_base[0]==1); // try to obtain hardware resource 4. for (i=0;i
224 Model-Based Design for Embedded Systems 1. int complete; 2. ... 3. volatile unsigned int ∗ idct_base = (volatile unsigned int∗ ) 0x 2F000000; 4. while(idct_base[0]== 1); // try to obtain hardware resource 5. complete = 0; 6. for (i=0;i
Retargetable, Embedded Software Design Methodology 225 1. void cyg_user_start(void) { 2. cyg_threaad_create(taskInfo[0]->priority, TE_task_0, 3. (cyg_addrword_t)0, “TE_task_0”, (void∗ )&TaskStk[0], 4. TASK_STK_SIZE-1, &handler[0], &thread[0]); 5. ... 6. init_task(); 7. cyg_thread_resume(handle[0]); 8. ... 9. } 10. Void TE_task_0(cyg_addrword_t data) { 11. while(!ﬁnished) 12. if (this task is executable) tasklnfo[0]->go(); 13. else cyg_thread_yield(); 14. } 15. void TE_main(cyg_addrword_t data) { 16. while(1) 17. if (all_task_is_done()) { 18. wrapup_task(); 19. exit(1); 20. } 21. } FIGURE 8.13 Pseudocode of an automatically generated scheduler for eCos. (From Kwon, S. et al., ACM Trans. Des. Autom. Electron. Syst., 13, Article 39, July 2008. With permission.) 1. int get_next_task() { 2. a. ﬁnd executable tasks 3. b. ﬁnd the tasks that has the smallest value of time count 4. c. select the task that is not executed for the longest time 5. d. add period to the time count of selected task 6. e. return selected task id 7. } (a) 1. int get_next_task() { 2. a. ﬁnd executable tasks 3. b. select the task that has the smallest period 4. c. update task information 5. d. return selected task id 6. } (b) FIGURE 8.14 Pseudocode of “get_next_task()” without OS support: (a) default, and (b) RMS scheduler. (From Kwon, S. et al., ACM Trans. Des. Autom. Electron. Syst., 13, Article 39, July 2008. With permission.) for the IDCT function block. Third, the software environment for the tar- get system is prepared, which includes the run-time scheduler and target- dependent API library.
226 Model-Based Design for Embedded Systems Arm926ej-s Local mem. Arm926ej-s Local mem. IDCT Shared memory Shared memory (a) (b) Interrupt ctrl. Arm926ej-s Local mem. IDCT Arm926ej-s Local mem. Arm926ej-s Local mem. Shared memory Shared memory (c) (d) FIGURE 8.15 Four target configurations for productivity analysis: (a) initial architecture, (b) HW IDCT is attached, (c) HW IDCT and interrupt controller are attached, and (d) additional processor and local memory are attached. (From Kwon, S. et al., ACM Trans. Des. Autom. Electron. Syst., 13, Article 39, July 2008. With permission.) At first, we needed to port the application code to the simulation environ- ment shown in Figure 8.15a. The application code consists of about 2400 lines of C code, in which 167 lines are target dependent. The target-dependent codes should be rewritten using target-dependent APIs defined for the tar- get simulator. It took about 5 h to execute the application on the simulator of our initial configuration (Figure 8.15a). The simulation porting overhead is directly proportional to the target-dependent code size. In addition, the overhead increases as total code size increases, since we need to identify the target-dependent codes throughout the entire application code. Next, we changed the target architecture to those shown in Figure 8.15b and c by using two kinds of IDCT HW IPs. The interface code between pro- cessor and IDCT HW should be inserted. It took about 2–3 h to write and debug the interfacing code with IDCT HW IP, without and with the inter- rupt controller, respectively. The sizes of the interface without and with the interrupt controller were 14 and 48 lines of code, respectively. Note that the overhead will increase if the HW IP has a more complex interfacing protocol. Last, we modified the task mapping by adding one more processor, as shown in Figure 8.15d. For this analysis, we needed to make an additional data structure of software tasks to link with the run-time scheduler on each processor. It took about 2 h to make the data structure of all tasks and attach
Retargetable, Embedded Software Design Methodology 227 TABLE 8.3 Time Overhead for Manual Software Modification Description Code Line Time (h) Figure 8.15a → Initial porting overhead to the 167 of 2400 5 Figure 8.15b and c target simulator Making HW interface code of IDCT 14 2 (Figure 8.15a → Figure 8.15b) Modifying HW interface code to use 48 3 interrupt controller (Figure 8.15a → Figure 8.15c) Figure 8.15a → Making initial data structure for 31 2 Figure 8.15d scheduler Modification of data structure 12 0.5 according to the task mapping decision Source: Kwon, S. et al., ACM Trans. Des. Autom. Electron. Syst., 13, Article 39, July 2008. With permission. it to the default scheduler. Then, it took about 0.5 h to modify the data structure according to the task-mapping decision. Note that to change the task-mapping configuration, the algorithm part of the software code need not be modified. We summarize the overheads of manual software modifi- cation in Table 8.3. By contrast, in the proposed framework, design space exploration is sim- ply performed by modifying the architecture information file only, not task code. Modifying the architecture information file is much easier than modi- fying the task code directly, and needs only a few minutes. Then CIC transla- tor generates the target code automatically in a minute. Of course, it requires a significant amount of time to establish the translation environment for a new target. But once the environment is set up for each candidate processing element, we believe that the proposed framework improves design produc- tivity dramatically for design space exploration of various architecture and task-mapping candidates. 8.7 Conclusion In this chapter, we presented a retargetable parallel programming frame- work for MPSoC, based on a new parallel programming model called the CIC. The CIC specifies the design constraints and task codes separately. Fur- thermore, the functional parallelism and data parallelism of application tasks are specified independently of the target architecture and design constraints.
228 Model-Based Design for Embedded Systems Then, the CIC translator translates the CIC into the final parallel code, con- sidering the target architecture and design constraints, to make the CIC retargetable. Temporal parallelism is exploited by inserting pipeline buffers between CIC tasks and where to put the pipeline buffers is determined at the mapping stage. We have developed a mapping algorithm that considers temporal parallelism as well as functional and data parallelism [17]. Preliminary experiments with a H.263 decoder example prove the viabil- ity of the proposed parallel programming framework: It increases the design productivity of MPSoC software significantly. There are many issues to be researched further in the future, which include the optimal mapping of CIC tasks to a given target architecture, exploration of optimal target architec- ture, and optimizing the CIC translator for specific target architectures. In addition, we have to extend the CIC to improve the expression capability of the model. References 1. Message Passing Interface Forum, MPI: A message-passing interface standard, International Journal of Supercomputer Applications and High Per- formance Computing, 8(3/4), 1994, 159–416. 2. OpenMP Architecture Review Board, OpenMP C and C++ application program interface, http://www.openmp.org, Version 1.0, 1998. 3. M. Sato, S. Satoh, K. Kusano, and Y. Tanaka, Design of OpenMP compiler for an SMP cluster, in EWOMP’99, Lund, Sweden, 1999. 4. F. Liu and V. Chaudhary, A practical OpenMP compiler for system on chips, in WOMPAT 2003, Toronto, Canada, June 26–27, 2003, pp. 54–68. 5. Y. Hotta, M. Sato, Y. Nakajima, and Y. Ojima, OpenMP implementation and performance on embedded renesas M32R chip multiprocessor, in EWOMP, Stockholm, Sweden, October, 2004. 6. W. Jeun and S. Ha, Effective OpenMP implementation and translation for multiprocessor system-on-chip without using OS, in 12th Asia and South Pacific Design Automation Conference (ASP-DAC’2007), Yokohama, Japan, 2007, pp. 44–49. 7. R. Eigenmann, J. Hoeflinger, and D. Padua, On the automatic paralleliza- tion of the perfect benchmarks(R), IEEE Transactions on Parallel and Dis- tributed Systems, 9(1), 1998, 5–23. 8. G. Martin, Overview of the MPSoC design challenge, in 43rd Design Automation Conference, San Francisco, CA, July, 2006, pp. 274–279.
Retargetable, Embedded Software Design Methodology 229 9. P. G. Paulin, C. Pilkington, M. Langevin, E. Bensoudane, and G. Nico- lescu, Parallel programming models for a multi-processor SoC platform applied to high-speed traffic management, in CODES+ISSS 2004, Stockholm, Sweden, 2004, pp. 48–53. 10. P. van der Wolf, E. de Kock, T. Henriksson, W. Kruijizer, and G. Essink, Design and programming of embedded multiprocessors: An interface- centric approach, in Proceedings of CODES+ISSS 2004, Stockholm, Sweden, 2004, pp. 206–217. 11. A. Jerraya, A. Bouchhima, and F. Petrot, Programming models and HW-SW interfaces abstraction for multi-processor SoC, in 43rd Design Automation Conference, San Francisco, CA, July 24–28, 2006, pp. 280–285. 12. K. Balasubramanian, A. Gokhale, G. Karsai, J. Sztipanovits, and S. Neema, Developing applications using model-driven design environ- ments, IEEE Computer, 39(2), 2006, 33–40. 13. K. Kim, J. Lee, H. Park, and S. Ha, Automatic H.264 Encoder synthesis for the cell processor from a target independent specification, in 6th IEEE Workshop on Embedded Systems for Real-time Multimedia (ESTIMedia’2008), Atlanta, GA, 2008. 14. J. Maeng, J. Kim, and M. Ryu, An RTOS API translator for model-driven embedded software development, in 12th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA’06), Sydney, Australia, August 16–18, 2006, pp. 363–367. 15. S. Ha, C. Lee, Y. Yi, S. Kwon, and Y. Joo, PeaCE: A hardware-software codesign environment for multimedia embedded systems, ACM Transac- tions on Design Automation of Electronic Systems (TODAES), 12(3), Article 24, August 2007. 16. Carbon R SoC Designer homepage, http://carbondesignsystems.com/ products_socd.shtml 17. H. Yang and S. Ha, Pipelined data parallel task mapping/scheduling technique for MPSoC, in DATE 2009, Nice, France, April 2009. 18. S. Kwon, Y. Kim, W. Jeun, S. Ha, and Y Paek, A retargetable parallel- programming framework for MPSoC, ACM Transactions on Design Automation of Electronic Systems (TODAES), 13(3), Article 39, July 2008.
9 Programming Models for MPSoC Katalin Popovici and Ahmed Jerraya CONTENTS 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 9.2 Hardware–Software Architecture for MPSoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 9.2.1 Hardware Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 9.2.2 Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 9.2.3 Hardware–Software Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 9.3 Programming Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 9.3.1 Programming Models Used in Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 9.3.2 Programming Models for SoC Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 9.3.3 Defining a Programming Model for SoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 9.4 Existing Programming Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 9.5 Simulink- and SystemC-Based MPSoC Programming Environment . . . . . . . . 241 9.5.1 Programming Models at Different Abstraction Levels Using Simulink and SystemC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 9.5.2 MPSoC Programming Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 9.6 Experiments with H.264 Encoder Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 9.6.1 Application and Architecture Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 9.6.2 Programming at the System Architecture Level . . . . . . . . . . . . . . . . . . . . . . 249 9.6.3 Programming at the Virtual Architecture Level . . . . . . . . . . . . . . . . . . . . . . 250 9.6.4 Programming at the Transaction Accurate Architecture Level . . . . . . 253 9.6.5 Programming at the Virtual Prototype Level . . . . . . . . . . . . . . . . . . . . . . . . . 254 9.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 9.1 Introduction Multimedia applications impose demanding constraints in terms of time to market and design quality. Efficient hardware platforms do exist for these applications. These feature heterogeneous multiprocessor architectures with specific I/O components in order to achieve computation and communi- cation performance [1]. Heterogeneous MPSoC includes different kinds of processing units (digital signal processor [DSP], microcontroller, application- specific instruction set processor [ASIP], etc.) and different communication schemes (fast links, nonstandard memory organization and access). Typical 231
232 Model-Based Design for Embedded Systems heterogeneous platforms used in industry are TI OMAP [2], ST Nomadik [3], Philips Nexperia [4], and Atmel Diopis [5]. Next generation MPSoC promises to be a multitile architecture that integrates hundreds of DSP and microcon- trollers on a single chip [6]. The software running on these heterogeneous MPSoC architectures is generally organized into several stacks made of dif- ferent software layers. Programming heterogeneous MPSoC architectures becomes a key issue because of two competing requirements: (1) Reducing the software devel- opment cost and the overall design time requires a higher level program- ming model. Usually, high level programming models diminish the amount of architecture details that need to be handled by the application software designers, and accelerates the design process. The use of high level program- ming model also allows concurrent software–hardware design, thus reduc- ing the overall SoC design time. (2) Improving the performance of the overall system requires finding the best matches between the hardware and the soft- ware. This is generally obtained through low level programming. Thus, the key challenge is to find a programming environment able to satisfy these two opposing requirements. Programming MPSoCs means generating software stacks running on the various processors efficiently, while exploiting the available resources of the architecture. Producing efficient code requires that the software takes into account the capabilities of the target platform. For instance, a data exchange between two different processors may use different schemes (global mem- ory accessible by both processing units, local memory of the one of the pro- cessors, dedicated hardware FIFO components, etc.). Additionally, different synchronization schemes (polling, interrupts) may be used to coordinate this data exchange. Each of these communication schemes has advantages and disadvantages in terms of performance (e.g., latency, throughput), resource sharing (e.g., multitasking, parallel I/O), and communication overhead (e.g., memory size, execution time). In an ideal design flow, programming a specific architecture consists of partitioning and mapping, application software code generation, and hardware-dependent software (HdS) code generation (Figure 9.1). The HdS is made of the lower software layers that may incorporate an operating sys- tem (OS), communication management, and a hardware abstraction layer (HAL) to allow the OS functions to access the hardware resources of the platform. Unfortunately, we are still missing such an ideal generic flow, which can efficiently map high level programs on heterogeneous MPSoC architectures. Traditional software development strategies make use of the concept of a software development platform to debug the software before the hardware is ready, thus allowing parallel hardware–software design. As illustrated in Figure 9.2, the software development platform is an abstract model of the architecture in form of a run time library or simulator aimed to execute the software. The combination of this platform with the software code given
Programming Models for MPSoC 233 Application specification Partitioning + mapping SW code generation -Appl code generation -HdS code generation Final application software code Execution MPSoC HdS FIGURE 9.1 Software design flow. Software code Development platform Executable model generation HW abstraction Executable model Debug and Hardware performance platform validation FIGURE 9.2 Software development platform. as a high level representation produces an executable model that emulates the execution of the final system including hardware and software architec- ture. Generic software development platforms have been designed to fully abstract the hardware–software interfaces, for example, MPITCH is a run time execution environment designed to execute parallel software code writ- ten using MPI [7]. The use of generic platforms does not allow simulating the software execution with detailed hardware–software interaction. There- fore, it does not allow debugging the lower layers of the software stack, for instance, the OS or the implementation of the high level communication primitives. The validation and debug of the HdS is the main bottleneck in MPSoC design [8] because each processor subsystem requires specific HdS implementation to be efficient. The use of programming models for the software design of heteroge- neous MPSoC requires the definition of new design automation methods to enable concurrent design of hardware and software. This also requires new
234 Model-Based Design for Embedded Systems models to deal with nonstandard application specific hardware–software interfaces at several abstraction levels. In this chapter, we give the definition of the programming models to abstract hardware–software interfaces in the case of heterogeneous MPSoC. Then, we propose a programming environment, which identifies several programming models at different MPSoC abstraction levels. The proposed approach combines the Simulink R environment for high level program- ming and SystemC design language for low level programming. The pro- posed methodology is applied to a heterogeneous multiprocessor platform, to explore the communication architecture and to generate efficient exe- cutable code of the software stacks for an H.264 video encoder application. The chapter is composed of seven sections. Section 9.1 gives a short intro- duction to present the context of MPSoC programming models and environ- ments. Section 9.2 describes the hardware and software organization of the MPSoC, including hardware–software interfaces. Section 9.3 gives the defi- nition of the programming models and MPSoC abstraction levels. Section 9.4 lists several existing programming models. Section 9.5 summarizes the main steps of the proposed programming environment, based on Simulink and SystemC design languages. Section 9.6 addresses the experimental results, followed by conclusion. 9.2 Hardware–Software Architecture for MPSoC The literature relates mainly two kinds of organizations for multiprocessor architectures. These are called shared memory and message passing [9]. This classification fixes both hardware and software organizations for each class of architectures. The shared memory organization generally assumes multi- tasking application organized as a single software stack and hardware archi- tecture made of multiple identical processors (CPUs). The communication between the different CPUs is performed through a global shared memory. The message passing organization assumes multiple software stacks running on nonidentical subsystems that may include different CPUs and/or a differ- ent I/O systems in addition to specific local memory architectures. The com- munication between the different subsystems generally proceeds by message passing. Heterogeneous MPSoCs generally combine both models to integrate a massive number of processors on a single chip [10]. Future heterogeneous MPSoC will be made of few heterogeneous subsystems, where each subsys- tem may include a massive number of the same processor to run a specific software stack. In the following sections, we describe the hardware organization, soft- ware stack composition, and the hardware–software interface for MPSoC architectures.
Programming Models for MPSoC 235 SW-SS HW-SS Application Task 1 Task 2 Task q Task 1 Task 2 Task p Task 1 Task 2 Task n Software HDS API Comm OS HdS HAL API HAL CPU Peripherals Hardware Intra-subsyst comm. Intra-subsyst comm. Inter-subsystem communication FIGURE 9.3 MPSoC hardware–software architecture. 9.2.1 Hardware Architecture Generally, MPSoC architectures may be represented as a set of processing subsystems or components that interact via an inter-subsystem communica- tion network (Figure 9.3). The processing subsystems may be either hardware (HW-SS) or software subsystem (SW-SS). The SW-SS are programmable subsystems that include one or several identical processing units, or CPUs. Different kinds of pro- cessing units may be needed for the different subsystems to realize different types of functionality (e.g., DSP for data oriented operations, general purpose processor [GPP], for control oriented operations, and ASIP, for application specific computation). Each SW-SS executes a software stack. In addition to the CPU, the hardware part of a SW-SS generally includes auxiliary components and peripherals to speed up computation and commu- nication. This may range from simple bus arbitration to sophisticated mem- ory and parallel I/O architectures. 9.2.2 Software Architecture In classical literature, a subsystem is organized into layers for the purpose of standardization and reuse. Unfortunately, each layer induces additional cost and performances overheads. In this chapter, we consider that within a subsystem the software stack is structured in only three layers, as depicted in Figure 9.3. The top layer is the software application that may be a multitasking description or a single task function. The application layer consists of a set of tasks that makes use of a programming model or application programming interface (API) to abstract the underlying HdS layer. These APIs correspond the HdS APIs. The separa- tion between the application layer and the underlying HdS layer is required to facilitate concurrent software and hardware development.