 # Báo cáo hóa học: " Research Article Novel VLSI Algorithm and Architecture with Good Quantization Properties for a "

33
lượt xem
4 ## Báo cáo hóa học: " Research Article Novel VLSI Algorithm and Architecture with Good Quantization Properties for a "

Mô tả tài liệu

Tuyển tập báo cáo các nghiên cứu khoa học quốc tế ngành hóa học dành cho các bạn yêu hóa học tham khảo đề tài: Research Article Novel VLSI Algorithm and Architecture with Good Quantization Properties for a

Chủ đề:

Bình luận(0)

## Nội dung Text: Báo cáo hóa học: " Research Article Novel VLSI Algorithm and Architecture with Good Quantization Properties for a "

3. EURASIP Journal on Advances in Signal Processing 3 for k = 1, . . . , N where π ⎧ with α = . (2) (N − 1) ⎪ 2N ⎨ϕ (k ) if ϕ (k) ≤ , ψ (k) = ⎪ 2 √ (7) ⎩ To simplify the presentation, the constant coeﬃcient 2/N ϕ (N − 1 + k ) otherwise, will be dropped from (1) that represents the deﬁnition of DCT; a multiplier will be added at the end of the VLSI array to scale the output sequence with this constant. with We will reformulate relation (1) as a parallel decomposi- ⎧ tion based on a cycle and a pseudo-cycle convolution forms ⎨ϕ(k ) if k > 0, using a single new input restructuring sequence as opposed ϕ (k ) = ⎩ to  where two such auxiliary input sequences were used. ϕ(N − 1 + k) otherwise , (8) Further, we will use the proprieties of DCT kernel and those of the Galois Field of indexes to appropriately permute the ϕ(k) = g k N , auxiliary input and output sequences. Thus, we will introduce an auxiliary input sequence {xa (i) : i = 0, 1, . . . , N − 1}. It can be recursively deﬁned where x N denotes the result of x modulo N and g is the as follows: primitive root of indexes. We have also used the properties of the Galois Field of xa (N − 1) = x(N − 1), (3) indexes to convert the computation of DCT as a convolution form. xa (i) = (−1)i x(i) + xa (i + 1), (4) for i = N − 2, . . . , 0. 3. Examples Using this restructuring input sequence, we can reformu- late (1) as follows: To illustrate our approach, we will consider two examples for 1D DCT, one with the length N = 7 and the primitive root N −1 g = 3 and the other with the length N = 11 and the primitive (−1)ϕ(i) X (0) = xa (0) + 2 root g = 2. i=0 (N − 1) 3.1. DCT Algorithm with Length N = 7. We recursively × xa ϕ(i) − xa ϕ i + , 2 compute the following input auxiliary sequence {xa (i) : i = 0, . . . , N − 1} as follows: X (k) = [xa (0) + T (k)] · cos(kα), for k = 1, . . . , N − 1. (5) xa (6) = x(6), The new auxiliary output sequence {T (k) : k = 1, 2, (9) xa (i) = (−1)i x(i) + xa (i + 1), . . . , N − 1} can be computed in parallel as a cycle and pseudo- for i = 5, . . . , 0. cycle convolutions if the transform length N is a prime number as follows: Using the auxiliary input sequence {xa (i) : i = 0, . . . , N − 1}, (N −1)/ 2 we can write (6) in a matrix-vector product form as (−1)ξ (k,i) T (δ (k)) = i=1 ⎡ ⎤ T (4) ⎢ ⎥ (N − 1) ⎢ ⎥ · xa ϕ(i − k ) − xa ϕ i − k + ⎢T (2)⎥ ⎣ ⎦ 2 T (6) × 2 · cos ψ (i) × 2α , ⎡ ⎤ −[xa (3) − xa (4)] −[xa (2) − xa (5)] −[xa (6) − xa (1)] ⎢ ⎥ (N −1)/ 2 ⎢ xa (3) − xa (4) −[xa (2) − xa (5)]⎥ (−1)ζ (i) = ⎢ [xa (6) − xa (1)] ⎥ T γ(k) = ⎣ ⎦ i=1 xa (2) − xa (5) −[xa (6) − xa (1)] xa (3) − xa (4) (N − 1) ⎡ ⎤ · xa ϕ(i − k ) + xa ϕ i − k + c(2) 2 ⎢ ⎥ ⎢ ⎥ · ⎢c(1)⎥, ⎣ ⎦ (N − 1) × 2 · cos ψ (i) × 2α for k = 0, 1, . . . , , c(3) 2 (6) (10)
4. 4 EURASIP Journal on Advances in Signal Processing ⎡ ⎤ T (3) Finally, the output sequence {X (k) : k = 1, 2, . . . , N − 1} can ⎢ ⎥ ⎢ ⎥ ⎢T (5)⎥ be computed as ⎣ ⎦ T (1) X (1) = [xa (0) + T (1)] · cos[α], ⎡ ⎤ X (2) = [xa (0) + T (2)] · cos[2α], xa (3) + xa (4) xa (2) + xa (5) xa (6) + xa (1) ⎢ ⎥ ⎢ ⎥ = ⎢xa (6) + xa (1) xa (3) + xa (4) xa (2) + xa (5)⎥ X (3) = [xa (0) + T (3)] · cos[3α], (11) ⎣ ⎦ xa (2) + xa (5) xa (6) + xa (1) xa (3) + xa (4) X (4) = [xa (0) + T (4)] · cos[4α], ⎡ ⎤ X (5) = [xa (0) + T (5)] · cos[5α], c(2) (13) ⎢ ⎥ ⎢ ⎥ X (6) = [xa (0) + T (6)] · cos[6α], · ⎢−c(1)⎥, ⎣ ⎦ −c(3) 3 (−1)ϕ(i) X (0) = xa (0) + 2 where we have noted c(k) for 2 · cos(2kα). i=1 The index mappings δ (i) and γ(i) realize a partition into (N − 1) two groups of the permutation of indexes {1, 2, 3, 4, 5, 6}. × xa ϕ(i) − xa ϕ i + . 2 They are deﬁned as follows: 3.2. DCT Algorithm with Length N = 11. We recursively {δ (i) : 1 − 4, 2 − 2, 3 − 6}, → → → compute the following input auxiliary sequence {xa (i) : i = (12) γ(i) : 1 −→ 3, 2 −→ 5, 3 −→ 1 . 0, . . . , N − 1} as follows: The functions ξ (k, i) and ζ (i) deﬁne the sign of terms in xa (10) = x(10), (10), respectively, as follows: (14) xa (i) = (−1)i x(i) + xa (i + 1) for i = 9, . . . , 0. 111 ξ (k, i) is deﬁned by the matrix and 001 Using the auxiliary input sequence {xa (i) : i = 0, . . . , N − 1} 010 ζ (i) is deﬁned by the vector [0 1 1]. we can write (6) in a matrix-vector product form as ⎡ ⎤ ⎡ ⎤ T (2) xa (2) − xa (9) −(xa (4) − xa (7)) xa (8) − xa (3) xa (5) − xa (6) xa (10) − xa (1) ⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ T (4) ⎥ ⎢ xa (10) − xa (1) −(xa (2) − xa (9)) −(xa (4) − xa (7)) −(xa (8) − xa (3)) −(xa (5) − xa (6))⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ T (8) ⎥ = ⎢−xa (5) − xa (6) −(xa (10) − xa (1)) −(xa (2) − xa (9)) xa (8) − xa (3) ⎥ −(xa (4) − xa (7)) ⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ T (6) ⎥ ⎢ xa (8) − xa (3) xa (4) − xa (7) ⎥ xa (5) − xa (6) −(xa (10) − xa (1)) −(xa (2) − xa (9)) ⎢ ⎥⎢ ⎥ ⎣ ⎦⎣ ⎦ T (10) xa (4) − xa (7) −(xa (8) − xa (3)) xa (5) − xa (6) −(xa (10) − xa (1)) xa (2) − xa (9) ⎡ ⎤ c(4) ⎢ ⎥ ⎢ ⎥ ⎢c(3)⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ · ⎢c(5)⎥, (15) ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢c(1)⎥ ⎢ ⎥ ⎣ ⎦ c(2) ⎡ ⎤ ⎡ ⎤⎡ ⎤ T (9) xa (2) + xa (9) xa (4) + xa (7) xa (8) + xa (3) xa (5) + xa (6) xa (10) + xa (1) c(4) ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢T (7)⎥ ⎢xa (10) + xa (1) xa (2) + xa (9) xa (4) + xa (7) xa (8) + xa (3) (5) + xa (6) ⎥ ⎢−c(3)⎥ xa ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢T (3)⎥ = ⎢ xa (5) + xa (6) xa (10) + xa (1) xa (2) + xa (9) xa (4) + xa (7) xa (8) + xa (3) ⎥ · ⎢−c(5)⎥, ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢T (5)⎥ ⎢ xa (8) + xa (3) xa (5) + xa (6) xa (10) + xa (1) xa (2) + xa (9) xa (4) + xa (7) ⎥ ⎢−c(1)⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎣ ⎦⎣ ⎦⎣ ⎦ T (1) xa (4) + xa (7) xa (8) + xa (3) xa (5) + xa (6) xa (10) + xa (1) xa (2) + xa (9) c(2)
6. 6 EURASIP Journal on Advances in Signal Processing xa 2 (0) c (a) 11 c (6a) 0 xa 2 (0) c (5a) 10 c (2a) 1 1 xa 2 (0) c (3a) 01 c (4a) 1 0 0 xa 1 (0) c (a) 11 c (6a) ∗ 0 1 xa 1 (0) c (5a) 10 c (2a) ∗ xa 1 (0) ∗ c (3a) 01 c (4a) 1 Post - c(3) c(2) c(1) processing 0 stage 0 t=10 xb 1 (6, 1) xa 1 (6.1) Y 2 (0) Y 2 (1) Y 2 (6) xb 1 (2, 5) xa 1 (2, 5) Y 2 (5) Y 2 (2) 0 0 xb 1 (3, 4) xb 2 (3, 4) xa 1 (3, 4) xa 2 (3, 4) Y 2 (3) Y 2 (4) 0 1 xb 1 (6, 1) xb 2 (6, 1) xa 1 (6, 1) xa 2 (6, 1) Y 1 (0) Y 1 (1) Y 1 (6) 1 xb 1 (2, 5) xb 2 (2, 5) xa 1 (2, 5) xa 2 (2, 5) Y 1 (5) Y 1 (2) 0 1 xb 3 (3, 4) xb 2 (3, 4) xa 3 (3, 4) xa 2 (3, 4) Y 1 (3) Y 1 (4) 0 0 t=7     xa k (i, j ) = xk (i) + xk ( j ) and xb k (i, j ) = xk (i) − xk ( j ) with a = α Figure 1: Systolic array architecture for DCT of length N = 7. tc 1 x1i −1 MUX  MUX x1i y1i ADD Y 1o L/ 2 DEMUX tc >> MUX Biport ADD x2i ROM MUX L/ 2 ADD  x2i Y 2o y2i (a) tc 1 x1i −1 MUX  MUX x1i y1i ADD Y 1o L/ 2 DEMUX tc >> MUX Biport ADD x2i ROM −1 MUX L/ 2 ADD  x2i Y 2o y2i (b) Figure 2: The structure of the ﬁrst processing element PE in Figure 1. The structure of the other processing elements PE in Figure 1.
7. EURASIP Journal on Advances in Signal Processing 7 x1o
8. 8 EURASIP Journal on Advances in Signal Processing with lengths N = 7 and N = 11 with those of the We can write algorithm proposed by Hou  and with a direct-form u cos(i) = Q(u · cos(i)) + Δu cos(i), (23) implementation . where: Δu cos(i) is the truncation error. 5.1. Fixed-Point Quantization Error Analysis. We will analyze Thus, we can write the ﬁxed-point error for the kernel of our architecture (N −1)/ 2 represented by the VLSI implementation of (6). This part (−1)δ (k,i) Q u(i − k) × cos ψ (i) × 2α T (k ) = contributes decisively to the hardware complexity and the i=1 power consumption of the VLSI implementation of the DCT. +Δ u cos ψ (i) × 2α We will show analytically and by computer simulation that it has good quantization properties that can be exploited (N −1)/ 2 to further reduce the hardware complexity and the power (−1)δ (k,i) · Δu(i − k) × cos ψ (i) × 2α , + consumption of our implementation. i=1 We can write (6) in a generic form (N −1)/ 2 (−1)δ (k,i) Q u(i − k) × cos ψ (i) × 2α (N −1)/ 2 T (k ) = (−1)δ (k,i) · u(i − k) × cos ψ (i) × 2α . (18) T (k ) = i=1 i=1 (N −1)/ 2 (−1)δ (k,i) Δ u cos ψ (i) × 2α Let + i=1 u(i − k) = u(i − k) + Δu(i − k), (19) (N −1)/ 2 (−1)δ (k,i) · Δu(i − k) × cos ψ (i) × 2α , where u(i − k) is the ﬁxed-point representation of the input + data and Δu(i − k) is the error between the actual value and i=1 its ﬁxed-point representation. Thus (N −1)/ 2 (−1)δ (k,i) Q u(i − k) × cos ψ (i) × 2α e(k) = T (k) − (N −1)/ 2 i=1 (−1)δ (k,i) T (k ) = (20) (N −1)/ 2 i=1 (−1)δ (k,i) Δ u cos ψ (i) × 2α e(k) = · [u(i − k ) + Δu(i − k )] × cos ψ (i) × 2α . i=1 (N −1)/ 2 We suppose that the process that governs the errors is linear (−1)δ (k,i) · Δu(i − k) × cos ψ (i) × 2α . + and, thus, we can utilize the superposition property. i=1 Thus, (20) become (24) (N −1)/ 2 (−1)δ (k,i) · u(i − k) × cos ψ (i) × 2α We will compute the second-order statistics σT of the error 2 T (k ) = term. This parameter describes the average behavior of the i=1 error and is related to MSE (mean-squared error) and SNR (N −1)/ 2 (21) (signal-to-noise ratio). (−1)δ (k,i) · Δu(i − k) + We assume that the errors are uncorrelated and with zero i=1 mean. × cos ψ (i) × 2α . We have σT = E e2 (k) 2 We can write ⎧⎡ ⎨ (N −1)/2 u(i − k) × cos ψ (i) × 2α (−1)δ (k,i) Δ u cos ψ (i) × 2α =E ⎣ ⎩ = −u(i − k )0 20 × cos ψ (i) × 2α i=1 ⎤2 ⎫ (22) ⎪ ⎬ (N −1)/ 2 L−1 (−1)δ (k,i) · Δu(i − k) × cos ψ (i) × 2α ⎦ ⎪ u(i − k) j 2− j × cos ψ (i) × 2α . + + ⎭ i=1 j =1 The sums (22) for all combinations of {u0 , u1 , . . . , uL−1 } (N −1)/ 2 E Δ2 u cos ψ (i) × 2α = are computed using a ﬂoating-point representation for coeﬃcients cos(ψ (i) × 2α); then, the result is truncated and i=1 stored in a ROM. (N −1)/ 2 Thus, we can use the following error model for the E Δ2 u(i − k) cos2 ψ (i) × 2α . + constant multiplication (22), where u is the quantization of i=1 the input and Q(·) is the truncation operator (25)
9. EURASIP Journal on Advances in Signal Processing 9 Sgn[(−1)i ] MUX DEMUX Permut xa (·) ADD ± and x(i) Add MUX split stage Latch C x(N − 1) C = 0011 Figure 5: The structure of the preprocessing stage for DCT of length N = 7. where σO is the variance of the output sequence and σT is 2 2 Q (u cos(i)) u ˜ Q (·) the variance of the quantization error at the output of the transform. cos (i) Using the graphic representation shown in Figures 7– 9, we can see that the computed values agree with those Figure 6: Truncation error model for a ROM-based multiplication. obtained from simulations. Thus, the computed values of SNR have similar values with those computed using relations (26) and (29) represented using snrDfcT plots. In Figure 7, It results we have shown the dependence of the SNR in our solution ⎛ ⎞ on the transform length N = 7 and N = 11 as a function (N −1)/ 2 (N −1)/ 2 + σΔ u ⎝ cos ψ (i) × 2α ⎠ σT σΔ 2 2 2 2 = of the number of bits L and M used in the quantization i=1 i=1 of the input sequence and the output of each ROM-based (26) ⎛ ⎞ multiplier, respectively, when M = L. In Figure 8 we show (N −1)/ 2 (N − 1) 2 the dependence of the SNR for our solution with N = 7 as a · σ Δ + σΔ u ⎝ cos2 (i × 2α)⎠. 2 = function of L, when M = 10, 12, and 14, respectively, and in 2 i=1 Figure 9, we have the same dependence for our solution with It can be easily seen that the transform length N = 11. Thus, we have obtained analytically and we have veriﬁed (N − 1) σT < · σ Δ + σΔ u . 2 2 2 (27) by simulation the dependence of the variance of the output 2 round-oﬀ error with the quantized error of the input sequence σΔx . This is a signiﬁcant result especially for our We can assume that 2 architecture as we can choose appropriately the value of 2−2M 2−2L σΔ = σΔ x = . 2 2 L with L < M . It can be used to signiﬁcantly reduce , (28) 12 12 the hardware complexity of our implementation as the We will verify the relation (26) by computer simulation using dimension of the ROM used in the implementation of a multiplier in our architecture is given by M · 2L and increases SNR parameter . In performing the ﬁxed-point round-oﬀ error analysis, exponentially with the number of bits L used to quantize the we will use the following assumptions: input u(i) for each multiplier. It can be seen that if L > M the improvement of the SNR is insigniﬁcant. Using these (i) the input sequence x(i) is quantized using L bits, dependences, we can easily chose L signiﬁcantly less than M . (ii) the output of each ROM-based multiplier is quan- Using the method proposed in , where the analysis is tized using M bits, made for the direct-form IDCT, we can also obtain for the direct-form DCT, the round-oﬀ error variance (σN )i 2 (iii) the errors are uncorrelated with one another and with the input, σN = (N − 1)σR for 0 ≤ i ≤ N − 1. 2 2 (30) i (iv) the input sequence x(i) is uniformly distributed As compared with a direct-form method implementation, between (−1, 1) with zero mean, known to be robust to the ﬁxed-point implementation and (v) the round-oﬀ errors at each multiplier is uniformly thus used by many chip manufactures , the round-oﬀ distributed with zero mean. error variance σT for the kernel of our solution given by 2 The SNR parameter is computed as relation (26) is signiﬁcantly better as we will also see by computer simulations. σO 2 Note that in  the analysis is made for direct-form SNR = 10 log10 2, (29) σT IDCT but it is similar for the direct-form DCT.
10. 10 EURASIP Journal on Advances in Signal Processing SNR SNR 90 100 80 90 70 80 60 70 50 60 40 50 30 40 20 30 6 8 10 12 14 16 6 8 10 12 14 16 snrDfc11 snrDfc7 snrDfcT snrDfcT (a) (b) Figure 7: SNR variation function of M when M = L. SNR SNR SNR 90 80 70 80 70 65 70 60 60 60 50 55 40 50 50 30 40 45 6 8 10 12 14 16 6 8 10 12 14 16 6 8 10 12 14 16 snrDfc7 snrDfc7 snrDfc7 snrDfcT snrDfcT snrDfcT (a) M = 10 (b) M = 12 (c) M = 14 Figure 8: SNR variation function of L for our solution with N = 7. SNR SNR SNR 90 80 60 80 70 55 70 60 50 60 50 45 50 40 40 40 30 30 35 6 8 10 12 14 16 6 8 10 12 14 16 6 8 10 12 14 16 snrDfc11 snrDfc11 snrDfc11 snrDfcT snrDfcT snrDfcT (a) M = 10 (b) M = 12 (c) M = 14 Figure 9: SNR variation with L for our solution with N = 11.
11. EURASIP Journal on Advances in Signal Processing 11 In the case we are using hardwired multipliers instead of 0.015 LUT based multipliers, there is another error term due to the quantization of multiplier coeﬃcients cos(ψ (i) × 2α) of the 0.0125 form (N −1)/ 2 0.01 2 E u(i − k)2 × Δcos ψ (i) × 2α , (31) i=1 0.075 where [Δ cos(ψ (i) × 2α)] is the quantization error of the multiplier coeﬃcient. Thus, the quantization error will be 0.005 signiﬁcantly greater for a similar hardwired multiplier based architecture for DCT reported in . It follows that the 0.0025 proposed ROM-based architecture will have better numerical properties as compared with similar hardwired multiplier 0 based architectures for DCT. 6 8 10 12 14 16 Bits number (M = L) Let us also observe that instead of the quantization of the result of relation (22), we can store in the ROMs the Hou N = 8 Proposed N = 11 rounded value of that result. The rounding error will be less Porposed N = 7 Direct form N = 8 than the truncation error. Thus, the numerical properties of the proposed solution can be further increased. Figure 10: Mean square error. This result shows that the kernel of our implementation based on (6) has good quantization properties, the error due to a ﬁxed-point representation being small, one of the best 0.05 results for DCT implementations as will be shown also using computer simulations. Thus, our proposed algorithm and 0.04 architecture is a very robust solution for a ﬁxed point imple- mentation of DCT. This property can be exploited to further reduce the hardware complexity and the power consumption 0.03 of the main part of our architecture represented by the above- mentioned kernel. This kernel has the main contribution to the hardware complexity of our architecture and to its power 0.02 consumption. 0.01 5.2. Comparison with Other Relevant Implementations of DCT. In our computer simulations in order to demonstrate the good quantization properties of the proposed solution 0 in comparison with some relevant other ones, we have used 8 10 12 14 16 several signiﬁcant parameters as PSNR (peak-signal-to-noise Bits number (M = L) ratio) and the following measures: Hou N = 8 Proposed N = 11 Proposed N = 7 Direct form N = 7 (i) overall MSE deﬁned as m−1 n−1 Figure 11: Peak error. 1 2 MSE = I f init i, j − I i, j , (32) mn i=0 j =0 The obtained numerical properties of the proposed algo- (ii) peak error deﬁned as rithm and its associated VLSI architecture can be exploited to signiﬁcantly decrease the hardware complexity. Thus, in the proposed architecture, the dimension of the ROMs used I ﬁnit i, j − I i, j . PPE = Max Max (33) to implement the multipliers depends exponentially on the i j word length L used for a ﬁxed-point representation of the The numbers for the root of MSE and the value of PPE input operands. Consequently, we can use small-size ROMs for diﬀerent values of the word length L when L = M are with a short access time to reduce the hardware complexity represented in the Figures 10 and 11. and to improve the speed. It results that the overall hardware The values of PSNR are presented in Figure 12 in dB for complexity will be signiﬁcantly reduced due to these good diﬀerent values of the word length L when M = L. It can numerical properties and also the power consumption. be easily seen that the values for PSNR are better for our This improvement is explained besides the inner proper- algorithm than those reported in Hou  and for the direct ties of the proposed algorithm by two design decisions that form. will be explained below.
12. 12 EURASIP Journal on Advances in Signal Processing 110 100 101 90 92.5 85 80 70 60 49 50 40 24 30 6 8 10 12 14 16 7 11 17 Bits number (M = L) DCT length N Hou8 DFC11  Prososed DFC7 FD7   Figure 12: Peak-signal-to-noise ratio (PSNR). Figure 14: I/O channels. 7000 period T is signiﬁcant diﬀerent in these three cases. In the proposed design T = max(TMem , Ta ) is signiﬁcantly less than 6000 in , where T = TMem + Ta and in , where T = TMul + 3Ta and , where T = TMul + Ta . 5000 We have used the following notations: TMem -multi- plication time, Ta -adder time, and TMem -access time. Thus, 4000 the proposed design is signiﬁcantly faster. If we compare with , we can see that (10) can 3000 be computed in parallel and thus the throughput can be doubled with respect to , where we do not have such 2000 a parallelization. Moreover, due to the fact that the two equations have a similar form and the same length, they can 1000 be mapped on the same linear systolic array with the ROMs 0 used to implement the shared multipliers. Thus, a signiﬁcant 8 10 12 14 16 increase of the throughput can be obtained without doubling DCT length N the hardware as compared with . The number of control bits is signiﬁcantly reduced for a double computation (10) as Proposed compared with .   In order to illustrate the advantages of our proposal, we will consider a numerical example with the length N = 37 Figure 13: ROM words. and the number of bits for the input of multipliers L = 10. In this case, the number of multipliers for our proposed solution [16, 34] is very small being 2 and 1. Instead, our First, we can increase the precision used in representing proposed solution uses 576 words of ROM,  uses 1152 the data in relations (3) and (4) without signiﬁcantly words,  uses 22 528 words, and  have 9 473 000 increasing the hardware complexity. words. The number of adders is 57 for our solution, 77 for Then, we can use a high precision for the coeﬃcients c(i) , 54 for , 107 for , and only 20 for . But in the computing of the partial results stored in the ROMs [24, 34] have only a half of the throughput of [16, 33] and our that implement the multiplication with this coeﬃcients. solution. The number of bits for I/O channels is 88 for our proposed solution, 107 for , and 71 for . In [33, 34], we have only 41, respectively, 20 bits for I/O channels. 6. Comparison The comparison between our solution and other VLSI The throughput of the VLSI architecture that can be obtained implementations for DCT is presented in Table 1. using the proposed algorithm is 2/ (N − 1) similar to [16, 33], Comparing the hardware complexity, we can see from but doubled compared to . The pipeline period (average Table 1 that the number of ROM words is half in our design computation time) is (N − 1)T/ 2 as in [16, 33] but the clock as compared with  and signiﬁcantly less than in 