Multi−Threaded Wide SIMD D$ I$
Multi−Threaded Wide SIMD D$ I$
12 Conclusions 185
n o i t c n u F d e x i F
e c a f r e t n I y a l p s i D
L2 Cache
r e l l o r t n o C y r o m e M
r e l l o r t n o C y r o m e M
e c a f r e t n I
Multi−Threaded Wide SIMD D$ I$
Multi−Threaded Wide SIMD D$ I$
c i g o L e r u t x e T
m e t s y S
Shared Multiprocessor
Core
Fig. 12.2 Larrabee architecture from Intel
F / I
D R A M
M A R D
I / F
D R A M
F / I T S O H
I / F
L2
D R A M
I / F
d a e r h T a g i G
F / I
D R A M
M A R D
I / F
multiprocessor (SM). The block diagram of a single SM is shown in Fig. 12.4 and the block diagram of a core within an SM is shown in Fig. 12.5.
With these upcoming architectures, newer approaches for hardware acceleration of algorithms would become viable. These approaches could exploit the more gen- eral computing paradigm offered by the newer architectures. For example, the close coupling between the GPU and the CPU (which reside on the same die) would
Fig. 12.3 Fermi architecture from NVIDIA
186 12 Conclusions
Instruction Cache
Scheduler
Scheduler
Dispatch
Dispatch
Register File
Core Core Core Core
Core
Core
Core
Core
Core Core
Core
Core
Core Core Core Core
Core Core Core Core
Core
Core
Core
Core
Core Core
Core
Core
Core Core Core Core
Load/Store Units X 16
Special Func Units X 4
Interconnect Network
64K Configurable Cache/Shared Mem
Uniform Cache
reduce the communication cost. Also, in these upcoming architectures the instruc- tion dispatch unit is distributed, and the instruction set is more general purpose. These enhancements would enable a more general computing paradigm (in compar- ison to the SIMD paradigm for current GPUs), which in turn would enable acceler- ation opportunities for more EDA applications.
The approaches presented in this monograph collectively aim to contribute toward enabling the CAD community to accelerate EDA algorithms on modern hardware platforms. Our work demonstrates techniques to rearchitect several EDA algorithms to maximally harness their performance on the alternative platforms under consideration.
Fig. 12.4 Block diagram of a single shared multiprocessor (SM) in Fermi
References 187
CUDA Core
Dispatch Port
Operand Collector
INT Unit
FP Unit
Result Queue
Fig. 12.5 Block diagram of a single processor (core) in SM
References
1. http://www.cs.chalmers.se/cs/research/formalmethods/minisat/main.html. The MiniSAT Page 2. NVIDIA Tesla GPU Computing Processor. http://www.nvidia.com/object/IO_ 43499.html http://www.nascentric.com/ 3. OmegaSim Mixed-Signal Fast-SPICE Simulator. product.html
4. Lee, H.K., Ha, D.S.: An efficient, forward fault simulation algorithm based on the parallel pattern single fault propagation. In: Proceedings of the IEEE International Test Conference on Test, pp. 946–955. IEEE Computer Society, Washington, DC (1991)
5. Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., Hanrahan, P.: Larrabee: A many-core x86 architecture for visual computing. ACM Transactions on Graphics 27(3), 1–15 (2008) 6. Silva, M., Sakallah, J.: GRASP-a new search algorithm for satisfiability. In: Proceedings of the International Conference on Computer-Aided Design (ICCAD), pp. 220–7 (1996)
Index
CDFG, 160 Clause, 31 Clock speed, 11 CNF, 31, 34 Co-processors, 9 Compilers, 16 Complete SAT, 83, 85 A Accelerators, 9 ACML-GPU, 15 Activity, 93 Algorithm parallel, 120, 121, 134 Amdahl’s Law, 158, 170 Application specific, 64 Arrival time, 110 Assignment, 31, 37, 40
Conflict, 37, 40, 42, 44, 71 Conflict clause, 31 Conflict clause generation, 33, 64 Conjunctive Normal Form, 34 Constant Memory, 26, 161 Control and dataflow graph, 173 Control dominated EDA, 3 Control plus data parallel B Backtracking, 32 Bandwidth, 13 Bandwidth minimization, 52 Bank conflict, 27 BCP, 32, 37, 40 Bias EDA, 3 survey propagation, 89
Core, 185 Critical line critical path tracing, 138
Bins, 64 Bin packing, 52, 70 Bin utilization, 74 Bit parallel, 135, 146 Block, 28 Block-based
Critical path tracing, 138 CUBLAS, 15 CUDA, 15, 24 CUFFT, 15 Cumulative detectability, 138 Custom IC, 7, 10, 33
SSTA, 108 Board test, 15 Boolean Constant Propagation, see BCP Boolean Satisfiability, see SAT Box-Muller, 101 BRAM, 11, 14, 32, 63, 66, 72, 78 Brook+, 15 BSIM3 SPICE, 158 BSIM4 D Data parallel, 28, 106, 120, 122, 134 Debuggers, 16 Decision engine, 37, 39, 49, 70 Decision level, 39, 67 Decisions
SPICE, 158 Bulldog Fortran, 171
SAT, 32 Detectability, 138 DFF, 11 DIMACS, 45 Dimblock, 29 C Capacity, 31, 35
189
K. Gulati, S.P. Khatri, Hardware Acceleration of EDA Algorithms, DOI 10.1007/978-1-4419-0944-2, C(cid:3) Springer Science+Business Media, LLC 2010
190 Index
Infringement security, 19
Input vector control, 10 Instance specific, 64 Inter-bin non-chronological, 32 Dimensionality, 29 Dimgrid, 29 Divide, 12 Dominator, 138 DPLL, 85 DRAM, 14, 66, 184 Dropped Intra-bin fault table, 134 non-chronological, 32 Dynamic IP cores, 15 power, 10 Dynamic bulk modulation, 10 K Kernel, 28, 167, 184
E EDA, 3 Embedded processor, 10
L Larrabee, 184 Latency, 11, 13 Leakage
power, 10 Levelize, 112 Literal, 37
F Factor Graph, 87 Fault detection, 134 Fault diagnosis, 134 Fault dropping, 134 Fault grading, 102, 120 Fault injection, 135 Fault parallel free literal, 41 Local memory, 12, 27 Logic analyzers, 15 Lookup table, 11, 106, 120 LUT, 12
data parallel, 120 Fault simulation, 4, 119 Fault table, 4, 134 Fermi, 184 Fingerprinting, 19 FPGA, 3, 7, 10, 32 Function Factor Graph, 87
M Memory bandwidth, 1, 13 Memory wall, 1 Mersenne Twister, 101, 106, 112 MIMD, 171 Minimum unsatisfiable core, 31, 33, 53 MiniSAT, 85 MNA SPICE, 154
Model evaluation, 154 Model parallel, 122, 134 Monte Carlo, 4 SSTA, 101, 106 G Global Memory, 13, 27, 110, 159 GPGPU, 3 Graphics Processors, see GPU GRASP, 35, 64, 85 Grid, see dimgrid GridSAT, 87 GSAT, 85
H Hardware
Moore’s Law, 24 MOPs, 17 MOPs per watt MOPs, 17 Multi-GPU, 16 Multi-port memory, 20 IP cores, 15 HDL, 10, 14, 19 Hybrid SAT, 85 Multiprocessor, 12, 24 MUX, 11
N Newton-Raphson, 154 NMOS I Immediate dominator dominator, 138 Implication, 37, 40, 44 Implication graph, 31, 33, 37, 50, 64 passgates, 11
Index 191
Non-chronological backtrack, 32, 43, 45, 64, 68, 85
Non-recurring engineering, 10, 18 Non-volatile memory, 20 Reconfigure, 12 Reduced OR, 144 Register, 26, 172 Resolution, 36 Reuse-based design, 19
S Sample parallelism, 106 SAT, 4, 31, 33, 34, 36 3-SAT, 36 O Off-chip, 14 On-chip, 14 OPB, 67, 72
P Paging, 12 Parafrase, 170 Parallel SAT, 85
Partition, 32, 35, 63, 78 Pass/fail fault dictionary, 134 Path-based Scalability, 15, 31, 35, 66 Scattered reads, 29 SEE, 18, 114 Self-test, 15 Sensitive input, 138 Shared Memory, 26, 27, 110 Shared multiprocessor, 185 SIMD, 3, 18, 29 Software IP cores, 15 SSTA, 108 Pattern parallel data parallel, 120
PCI, 15 PCI-X
PCI, 15 Pipeline, 11 Piracy security, 19 Span, 69 Speedup, 31 SPICE, 31, 153 Square root, 12 SRAM, 11 SSTA, 4, 101, 106 STA, 101, 106 SPICE, 154
PLB, 67, 72 PLB2OPB bridge, 72 Power, 10, 56 Stem, 137 Stem region, 138, 143 Stochastic
SAT, 83, 85 Subroutine, 167 Subsumption
resolution, 56 Successive chord, 156 Supply voltage, 10 Survey propagation, 84 Surveys
average power, 58 Power delay product, 18 Power gating, 10 Power wall, 1 PowerPC, 32 Precharged, 39 Predischarged, 39 Process variations, 106 Processor, 24 Profiling code, 16
Programmable, 12 Prototyping, 16 survey propagation, 88 Synchronization points, 29 Synchronize, 28 System test, 15 Systematic variations, 106
Q QuickPath Interconnect, 18
T Termination cell, 40 Texture fetching R Random Texture Memory, 27
variations, 106 Re-programmability, 19 Reconfigurable logic FPGA, 11 Texture Memory, 26, 110, 155, 160 Thread, 28, 146 Thread block, 28 Thread parallel, 135
192 Index
Thread scheduler, 29 Throughput, 11 Time slicing, 29 Tree Factor Graph, 87 Virtual memory, 12 VLIW, 171 VLSI, 106 VSIDS, 93
U Unate covering, 134
V Variable, 31, 37 W WalkSAT, 85, 90, 96 Warp size, 29 Warps, 29 Watermarking, 19 Factor Graph, 87
X XC2VP30 Variable ordering SAT, 32 Variable Vt, 10 Variations, 106 FPGA, 32