Hardware Acceleration of EDA Algorithms- P11

Multi−Threaded Wide SIMD D$ I$

12 Conclusions 185

n o i t c n u F d e x i F

e c a f r e t n I y a l p s i D

L2 Cache

r e l l o r t n o C y r o m e M

e c a f r e t n I

Multi−Threaded Wide SIMD D$ I$

c i g o L e r u t x e T

m e t s y S

Shared Multiprocessor

Core

Fig. 12.2 Larrabee architecture from Intel

F / I

D R A M

M A R D

I / F

D R A M

F / I T S O H

I / F

L2

D R A M

I / F

d a e r h T a g i G

F / I

D R A M

M A R D

I / F

multiprocessor (SM). The block diagram of a single SM is shown in Fig. 12.4 and the block diagram of a core within an SM is shown in Fig. 12.5.

With these upcoming architectures, newer approaches for hardware acceleration of algorithms would become viable. These approaches could exploit the more gen- eral computing paradigm offered by the newer architectures. For example, the close coupling between the GPU and the CPU (which reside on the same die) would

Fig. 12.3 Fermi architecture from NVIDIA

186 12 Conclusions

Instruction Cache

Scheduler

Dispatch

Register File

Core Core Core Core

Core

Core Core

Core

Core Core Core Core

Core

Core Core

Core

Core Core Core Core

Load/Store Units X 16

Special Func Units X 4

Interconnect Network

64K Configurable Cache/Shared Mem

Uniform Cache

reduce the communication cost. Also, in these upcoming architectures the instruc- tion dispatch unit is distributed, and the instruction set is more general purpose. These enhancements would enable a more general computing paradigm (in compar- ison to the SIMD paradigm for current GPUs), which in turn would enable acceler- ation opportunities for more EDA applications.

The approaches presented in this monograph collectively aim to contribute toward enabling the CAD community to accelerate EDA algorithms on modern hardware platforms. Our work demonstrates techniques to rearchitect several EDA algorithms to maximally harness their performance on the alternative platforms under consideration.

Fig. 12.4 Block diagram of a single shared multiprocessor (SM) in Fermi

References 187

CUDA Core

Dispatch Port

Operand Collector

INT Unit

FP Unit

Result Queue

Fig. 12.5 Block diagram of a single processor (core) in SM

References

1. http://www.cs.chalmers.se/cs/research/formalmethods/minisat/main.html. The MiniSAT Page 2. NVIDIA Tesla GPU Computing Processor. http://www.nvidia.com/object/IO_ 43499.html http://www.nascentric.com/ 3. OmegaSim Mixed-Signal Fast-SPICE Simulator. product.html

4. Lee, H.K., Ha, D.S.: An efﬁcient, forward fault simulation algorithm based on the parallel pattern single fault propagation. In: Proceedings of the IEEE International Test Conference on Test, pp. 946–955. IEEE Computer Society, Washington, DC (1991)

5. Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., Hanrahan, P.: Larrabee: A many-core x86 architecture for visual computing. ACM Transactions on Graphics 27(3), 1–15 (2008) 6. Silva, M., Sakallah, J.: GRASP-a new search algorithm for satisﬁability. In: Proceedings of the International Conference on Computer-Aided Design (ICCAD), pp. 220–7 (1996)

Index

CDFG, 160 Clause, 31 Clock speed, 11 CNF, 31, 34 Co-processors, 9 Compilers, 16 Complete SAT, 83, 85 A Accelerators, 9 ACML-GPU, 15 Activity, 93 Algorithm parallel, 120, 121, 134 Amdahl’s Law, 158, 170 Application speciﬁc, 64 Arrival time, 110 Assignment, 31, 37, 40

Conﬂict, 37, 40, 42, 44, 71 Conﬂict clause, 31 Conﬂict clause generation, 33, 64 Conjunctive Normal Form, 34 Constant Memory, 26, 161 Control and dataﬂow graph, 173 Control dominated EDA, 3 Control plus data parallel B Backtracking, 32 Bandwidth, 13 Bandwidth minimization, 52 Bank conﬂict, 27 BCP, 32, 37, 40 Bias EDA, 3 survey propagation, 89

Core, 185 Critical line critical path tracing, 138

Bins, 64 Bin packing, 52, 70 Bin utilization, 74 Bit parallel, 135, 146 Block, 28 Block-based

Critical path tracing, 138 CUBLAS, 15 CUDA, 15, 24 CUFFT, 15 Cumulative detectability, 138 Custom IC, 7, 10, 33

SSTA, 108 Board test, 15 Boolean Constant Propagation, see BCP Boolean Satisﬁability, see SAT Box-Muller, 101 BRAM, 11, 14, 32, 63, 66, 72, 78 Brook+, 15 BSIM3 SPICE, 158 BSIM4 D Data parallel, 28, 106, 120, 122, 134 Debuggers, 16 Decision engine, 37, 39, 49, 70 Decision level, 39, 67 Decisions

SPICE, 158 Bulldog Fortran, 171

SAT, 32 Detectability, 138 DFF, 11 DIMACS, 45 Dimblock, 29 C Capacity, 31, 35

189

K. Gulati, S.P. Khatri, Hardware Acceleration of EDA Algorithms, DOI 10.1007/978-1-4419-0944-2, C(cid:3) Springer Science+Business Media, LLC 2010

190 Index

Infringement security, 19

Input vector control, 10 Instance speciﬁc, 64 Inter-bin non-chronological, 32 Dimensionality, 29 Dimgrid, 29 Divide, 12 Dominator, 138 DPLL, 85 DRAM, 14, 66, 184 Dropped Intra-bin fault table, 134 non-chronological, 32 Dynamic IP cores, 15 power, 10 Dynamic bulk modulation, 10 K Kernel, 28, 167, 184

E EDA, 3 Embedded processor, 10

L Larrabee, 184 Latency, 11, 13 Leakage

power, 10 Levelize, 112 Literal, 37

F Factor Graph, 87 Fault detection, 134 Fault diagnosis, 134 Fault dropping, 134 Fault grading, 102, 120 Fault injection, 135 Fault parallel free literal, 41 Local memory, 12, 27 Logic analyzers, 15 Lookup table, 11, 106, 120 LUT, 12

data parallel, 120 Fault simulation, 4, 119 Fault table, 4, 134 Fermi, 184 Fingerprinting, 19 FPGA, 3, 7, 10, 32 Function Factor Graph, 87

M Memory bandwidth, 1, 13 Memory wall, 1 Mersenne Twister, 101, 106, 112 MIMD, 171 Minimum unsatisﬁable core, 31, 33, 53 MiniSAT, 85 MNA SPICE, 154

Model evaluation, 154 Model parallel, 122, 134 Monte Carlo, 4 SSTA, 101, 106 G Global Memory, 13, 27, 110, 159 GPGPU, 3 Graphics Processors, see GPU GRASP, 35, 64, 85 Grid, see dimgrid GridSAT, 87 GSAT, 85

H Hardware

Moore’s Law, 24 MOPs, 17 MOPs per watt MOPs, 17 Multi-GPU, 16 Multi-port memory, 20 IP cores, 15 HDL, 10, 14, 19 Hybrid SAT, 85 Multiprocessor, 12, 24 MUX, 11

N Newton-Raphson, 154 NMOS I Immediate dominator dominator, 138 Implication, 37, 40, 44 Implication graph, 31, 33, 37, 50, 64 passgates, 11

Index 191

Non-chronological backtrack, 32, 43, 45, 64, 68, 85

Non-recurring engineering, 10, 18 Non-volatile memory, 20 Reconﬁgure, 12 Reduced OR, 144 Register, 26, 172 Resolution, 36 Reuse-based design, 19

S Sample parallelism, 106 SAT, 4, 31, 33, 34, 36 3-SAT, 36 O Off-chip, 14 On-chip, 14 OPB, 67, 72

P Paging, 12 Parafrase, 170 Parallel SAT, 85

Partition, 32, 35, 63, 78 Pass/fail fault dictionary, 134 Path-based Scalability, 15, 31, 35, 66 Scattered reads, 29 SEE, 18, 114 Self-test, 15 Sensitive input, 138 Shared Memory, 26, 27, 110 Shared multiprocessor, 185 SIMD, 3, 18, 29 Software IP cores, 15 SSTA, 108 Pattern parallel data parallel, 120

PCI, 15 PCI-X

PCI, 15 Pipeline, 11 Piracy security, 19 Span, 69 Speedup, 31 SPICE, 31, 153 Square root, 12 SRAM, 11 SSTA, 4, 101, 106 STA, 101, 106 SPICE, 154

PLB, 67, 72 PLB2OPB bridge, 72 Power, 10, 56 Stem, 137 Stem region, 138, 143 Stochastic

SAT, 83, 85 Subroutine, 167 Subsumption

resolution, 56 Successive chord, 156 Supply voltage, 10 Survey propagation, 84 Surveys

average power, 58 Power delay product, 18 Power gating, 10 Power wall, 1 PowerPC, 32 Precharged, 39 Predischarged, 39 Process variations, 106 Processor, 24 Proﬁling code, 16

Programmable, 12 Prototyping, 16 survey propagation, 88 Synchronization points, 29 Synchronize, 28 System test, 15 Systematic variations, 106

Q QuickPath Interconnect, 18

T Termination cell, 40 Texture fetching R Random Texture Memory, 27

variations, 106 Re-programmability, 19 Reconﬁgurable logic FPGA, 11 Texture Memory, 26, 110, 155, 160 Thread, 28, 146 Thread block, 28 Thread parallel, 135

192 Index

Thread scheduler, 29 Throughput, 11 Time slicing, 29 Tree Factor Graph, 87 Virtual memory, 12 VLIW, 171 VLSI, 106 VSIDS, 93

U Unate covering, 134

V Variable, 31, 37 W WalkSAT, 85, 90, 96 Warp size, 29 Warps, 29 Watermarking, 19 Factor Graph, 87

X XC2VP30 Variable ordering SAT, 32 Variable Vt, 10 Variations, 106 FPGA, 32