Báo cáo hóa học: " A Two-Channel Training Algorithm for Hidden Markov Model and Its Application to Lip Reading"

Chia sẻ: Linh Ha | Ngày: | Loại File: PDF | Số trang:18

Thêm vào BST

Báo xấu

34
lượt xem 4
download

Download Vui lòng tải xuống để xem tài liệu đầy đủ

Tuyển tập báo cáo các nghiên cứu khoa học quốc tế ngành hóa học dành cho các bạn yêu hóa học tham khảo đề tài: A Two-Channel Training Algorithm for Hidden Markov Model and Its Application to Lip Reading

Chủ đề:

Bình luận(0) Đăng nhập để gửi bình luận!

Lưu

Nội dung Text: Báo cáo hóa học: " A Two-Channel Training Algorithm for Hidden Markov Model and Its Application to Lip Reading"

EURASIP Journal on Applied Signal Processing 2005:9, 1382–1399 c 2005 Hindawi Publishing Corporation A Two-Channel Training Algorithm for Hidden Markov Model and Its Application to Lip Reading Liang Dong Department of Electrical and Computer Engineering, National University of Singapore, Singapore 119260 Email: engp0564@nus.edu.sg Say Wei Foo School of Electrical and Electronic Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798 Email: eswfoo@ntu.edu.sg Yong Lian Department of Electrical and Computer Engineering, National University of Singapore, Singapore 119260 Email: eleliany@nus.edu.sg Received 1 November 2003; Revised 12 May 2004 Hidden Markov model (HMM) has been a popular mathematical approach for sequence classiﬁcation such as speech recognition since 1980s. In this paper, a novel two-channel training strategy is proposed for discriminative training of HMM. For the proposed training strategy, a novel separable-distance function that measures the diﬀerence between a pair of training samples is adopted as the criterion function. The symbol emission matrix of an HMM is split into two channels: a static channel to maintain the validity of the HMM and a dynamic channel that is modiﬁed to maximize the separable distance. The parameters of the two-channel HMM are estimated by iterative application of expectation-maximization (EM) operations. As an example of the application of the novel approach, a hierarchical speaker-dependent visual speech recognition system is trained using the two-channel HMMs. Results of experiments on identifying a group of confusable visemes indicate that the proposed approach is able to increase the recognition accuracy by an average of 20% compared with the conventional HMMs that are trained with the Baum-Welch estimation. Keywords and phrases: viseme recognition, two-channel hidden Markov model, discriminative training, separable-distance func- tion. 1. INTRODUCTION 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]. Sumby and Pol- lack [1] showed that the incorporation of visual information added an equivalent 12 dB gain in signal-to-noise ratio. The focus of most automatic speech recognition techniques Among the various techniques for visual speech recog- is on the spoken sounds alone. If the speaking environment nition, hidden Markov model (HMM) holds the greatest is noise free and the recognition engine is well conﬁgured, promise due to its capabilities in modeling and analyzing high recognition rate is attainable for most speakers. How- ever, in real-world environments such as oﬃce, bus station, temporal processes as reported in [9, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]. Most of the HMM-based visual speech shop, and factory, the speech captured may be greatly pol- processing systems reported take an individual word as the luted by background noise and cross-speaker noise. Present- basic recognition unit and an HMM is trained to model it. ing such a signal to a sound-based speech recognition sys- Such an approach works well with limited vocabulary such tem, the recognition accuracy may drop dramatically. One as digit set [15, 30], a small number of AVletters [31], and solution to enhance the speech recognition accuracy under isolated words or nonsense words [32], but it is diﬃcult to noisy conditions is to jointly process information from mul- extend the methods to large-vocabulary recognition task as a tiple modalities of speech. Automatic lip reading is one mode great number of word models has to be trained. One solution in which the visual aspect of speech is considered for speech to this problem is to build subword models such as phoneme recognition. models. Any word that is presented to the recognition system It has long been observed that the presence of visual is broken down into subwords. In this way, even if a word is cues such as the movement of lips, facial muscles, teeth, and not included in training the system, the system can still make tongue may enhance human speech perception. Systematic a good guess on its identity. studies on lip reading have been carried out since 1950s [1, 2,
A Two-Channel HMM Training Algorithm for Lip Reading 1383 The smallest visibly distinguishable unit of visual speech include the convergence of the entropy function of an HMP, is commonly referred to as viseme [33]. Like phonemes the computation of the conditional probability, and the local that are the basic building blocks of sound of a language, convergence of the maximal likelihood (ML) parameter esti- visemes are the basic constituents for the visual representa- mation of HMM. Application of HMM to speech processing tion of words. The time variation of mouth shape in speech is took place in the mid-1970s. A phonetic speech recognition small compared with the corresponding variation of acoustic system that adopts HMM-based classiﬁer was ﬁrst developed waveform. Some previous experiments indicate that the tra- in IBM [44, 45]. Applications of HMM for speech processing ditional HMM classiﬁers, which are trained with the Baum- were further explored by Rabiner and Juang [46, 47]. Welch algorithm, are sometimes incompetent to separate The beauty of HMM is that it is able to reveal the under- mouth shapes with small diﬀerence [34]. Such small dif- lying process of signal generation even though the properties ference has prompted some researchers to regard the rela- of the signal source remain greatly unknown. Assume that OM = {O1 , O2 , . . . , OM } is the discrete set of observed sym- tionship between phonemes and visemes as many-to-one bols and SN = {S1 , S2 , . . . , SN } is the set of states; an N -state- mapping. For example, although phonemes /b/, /m/, /p/ are M -symbol discrete HMM θ (π , A, B) consists of the following acoustically distinguishable, the sequence of mouth shape for the three sounds are not readily distinguishable, hence three components. the three phonemes are grouped into one viseme category. (1) The probability array of the initial state: π = [πi ] = An early viseme grouping was suggested by Binnie et al. [P (s1 = Si )]1×N , where s1 is the ﬁrst state in the state [35]. The MPEG-4 multimedia standard adopted the same chain. viseme grouping strategy for face animation, in which four- (2) The state-transition matrix: A = [ai j ] = [P (st+1 = teen viseme groups are included [36]. However, diﬀerent S j |st = Si )]N ×N , where st+1 and st denote the t + 1th groupings are adopted by diﬀerent researchers to fulﬁll spe- state and the t th state in the state chain. ciﬁc requirements [37, 38]. (3) The symbol emission probability matrix: B = [bi j ] = Motivated by the need to ﬁnd an approach to diﬀerenti- [P (ot = O j |st = Si )]N ×M , where ot is the t th observed ate visemes that are only slightly diﬀerent, we propose a novel symbol in the observation sequence. approach to improve the discriminative power of the HMM classiﬁers. The approach aims at amplifying the separable- In a K -class identiﬁcation problem, assume that xT = distance between a pair of training samples. A two-channel (x1 , x2 , . . . , xT ) is a sample of a particular class, say class HMM is developed, one channel, called the static channel, is di . The probability of occurrence of the sample xT given kept ﬁxed to maintain the validity of the probabilistic frame- the HMM θ (π , A, B), denoted by P (xT |θ ), is computed us- work, and the other channel, called the dynamic channel, is ing either the forward or backward process and the optimal modiﬁed to amplify the diﬀerence between the training pair. hidden-state chain is revealed using Viterbi matching [46]. A hierarchical classiﬁer is also proposed based on the Training of the HMM is the process of determining the pa- two-channel training strategy. At the top level, broad identi- rameters set θ (π , A, B) to fulﬁll a certain criterion function ﬁcation is performed and ﬁne identiﬁcation is subsequently such as P (xT |θ ) or the mutual information [46, 48]. For carried out within the broad category identiﬁed. Experi- training of the HMM, the Baum-Welch training algorithm mental results indicate that the proposed classiﬁer excels is popularly adopted. The Baum-Welch algorithm is an ML the traditional ML HMM classiﬁer in identifying the mouth estimation; thus the HMM so obtained, θML , is one that max- shapes. imizes the probability P (xT |θ ). Mathematically, Although the proposed method is developed for the recognition of visemes, it can also be applied to any sequence θML = arg max P xT |θ . (1) classiﬁcation problem. As such, the theoretical background θ and the training strategy of the two-channel discriminative training method are introduced ﬁrst in Sections 2, 3, and The Baum-Welch training can be realized at a relatively 4. This is followed by discussion of the general properties high speed as the expectation-maximization (EM) estima- and extensions of the training strategy in Sections 5 and tion is adopted in the training process. 6, respectively. Details of the application of the method to However, the parameters of the HMM are solely deter- viseme recognition and the experimental results obtained are mined by the correct samples while the relationship between given in Section 7. The concluding remarks are presented in the correct samples and incorrect ones is not taken into con- Section 8. sideration. The method, in its original form, is thus not de- veloped for ﬁne recognition. If another sample y T of class d j ( j = i) is similar to xT , the scored probability P ( y T |θ ) 2. REVIEW OF HIDDEN MARKOV MODEL may be close to P (xT |θ ), and θML may not be able to distin- Hidden Markov model is also referred to as hidden Markov guish xT and y T . One solution to this problem is to adopt process (HMP) as the latter emphasizes the stochastic process a training strategy that maximizes the mutual information rather than the model itself. HMP was ﬁrst introduced by IM (θ , xT ) deﬁned as Baum and Petrie [39] in 1966. The basic theories/properties of HMP were introduced in full generality in a series of pa- IM θ , xT = log P xT |θ − log P x T |θ P ( θ ) . (2) pers by Baum and his colleagues [40, 41, 42, 43], which θ =θ
1384 EURASIP Journal on Applied Signal Processing Static-channel HMM ... State i State i + 1 State i + 2 ... = + Two-channel HMM ... State i State i + 1 State i + 2 ... Dynamic-channel HMM Figure 1: The block diagram of a two-channel HMM. determine the HMM θMSD (MSD for maximum separable This method is referred to as maximum mutual infor- distance) that maximizes I (xT , y T , θ ). Mathematically, mation (MMI) estimation [48]. It increases the a posteriori probability of the model corresponding to the training data, and thus the overall discriminative power of the HMM ob- θMSD = arg max I xT , y T , θ . (4) tained is guaranteed. However, analytical solutions to (2) are θ diﬃcult to realize and implementation of MMI estimation is tedious. A computationally less intensive approach is desir- For the proposed training strategy, the parameter set for able. the static-channel HMM is determined in the normal way such as the ML approach. For the dynamic-channel HMM, to maintain synchronization of the duration and transition 3. PRINCIPLES OF TWO-CHANNEL HMM of states, the same set of values for π and A as derived for the static-channel HMM is used; only the parameters of matrix To improve the discriminative ability of HMM and at the B are adjusted. same time, to facilitate the process of parameter tuning, the As a ﬁrst step towards the maximization of the following two-channel training method is proposed, where the HMM is specially tailored to amplify the diﬀerence be- separable-distance function I (xT , y T , θ ), an auxiliary func- tion F (xT , y T , θ , λ) involving I (xT , y T , θ ) and the parameters tween two similar samples. of B is deﬁned as The block diagram of the two-channel HMM is given in Figure 1. It consists of a static-channel HMM and a dynamic- channel HMM. For the static channel, a normal HMM de- N M F xT , y T , θ , λ = I xT , y T , θ + λi 1 − bi j , (5) rived from a parameter-smoothed ML approach is used. A i=1 j =1 new HMM for the dynamic channel is to be derived. De- tails of the derivation of the dynamic-channel HMM are de- scribed in the following paragraphs. where λi is the Lagrange multiplier for the ith state and Assume that in a two-class identiﬁcation problem, {xT : M j =1 bi j = 1 (i = 1, 2, . . . , N ). By maximizing F (x , y , θ , λ), TT d1 } and { y T : d2 } are a pair of training samples, where T , y T , θ ) is also maximized. Diﬀerentiating F (x T , y T , θ , λ) I (x TT T TT T xT = (x1 , x2 , . . . , xT ) and y T = ( y1 , y2 , . . . , yT ) are obser- with respect to bi j and setting the result to 0, we have vation sequences of length T and d1 and d2 are the class la- bels. The observed symbols in xT and y T are from the symbol ∂ log P y T |θ set OM . P (xT |θ ) and P ( y T |θ ) are the scored probabilities for ∂ log P xT |θ − = λi . (6) xT and y T given HMM θ , respectively. The pair of training ∂bi j ∂bi j samples xT and y T must be of the same length so that their probabilities P (xT |θ ) and P ( y T |θ ) can be suitably compared. Since λi is positive, the optimum value obtained for Such a comparison is meaningless if the samples are of diﬀer- I (xT , y T , θ ) is a maximum as solutions for bi j must be pos- ent lengths; the shorter sequence may give larger probability itive. In (6), log P (xT |θ ) and log P ( y T |θ ) may be computed than the longer one even if it is not the true sample of θ . by summing up all the probabilities over time T : Deﬁne a new function I (xT , y T , θ ), called the separable- distance function, as follows: T N log P xT |θ = P sT = Si bi xτ . T log (7) τ τ =1 i=1 T T T T I x , y , θ = log P x |θ − log P y |θ . (3) Note that the state-transition coeﬃcients ai j do not ap- pear explicitly in (7); they are included in the term P (sT = A large value of I (xT , y T , θ ) would mean that xT and τ y T are more distinct and separable. The strategy then is to Si ).
A Two-Channel HMM Training Algorithm for Lip Reading 1385 where θ and θ are the HMM before training and the HMM The two partial derivatives in (6) may be evaluated sepa- after training, respectively, and S denotes all the state combi- rately as follows: nations with length T . The purpose of the E-step of the EM estimation is to calculate Eθ (I ). By using the auxiliary func- T ∂ log P xT |θ P sT = Si |θ , xT = tion Qx (θ , θ ) proposed in [48] and deﬁned as follows: τ ∂bi j τ =1 xτ =O j T log P xT , sT |θ P xT , sT |θ , Qx ( θ , θ ) = (13) T −1 sT T T sT ∈S = bi j = = O j |θ , x P Si , xτ , τ τ =1 (8) equation (12) can be written as T y T |θ ∂ log P sT T = = Si |θ , y P τ ∂bi j τ =1 Eθ (I ) = Qx (θ , θ )P y T |sT , θ − Q y (θ , θ )P xT |sT , θ . (14) yτ =O j T T = bi−1 P sT = Si , yτ = O j |θ , y T . T j τ Qx (θ , θ ) and Q y (θ , θ ) may be further analyzed by break- τ =1 ing up the probability P (xT , sT |θ ) as follows: By deﬁning T T P xT , sT |θ = π s0 asτ −1 ,sτ bsτ xτ , (15) E Si , O j |θ , xT = P sT = Si , xτ = O j |θ , xT , T τ τ =1 τ =1 T (9) E Si , O j |θ , y T = P sT = Si , yτ = O j |θ , y T , T where π , a, and b are the parameters of θ . Here, we assume τ that the initial distribution starts at τ = 0 instead of τ = 1 τ =1 Di j xT , y T , θ = E Si , O j |θ , xT − E Si , O j |θ , y T , for notational convenience. The Q function then becomes log π s0 P xT , sT |θ Qx (θ , θ ) = equation (6) can be written as sT ∈S E Si , O j |θ , xT − E Si , O j |θ , y T T log aτ −1,τ P xT , sT |θ + bi j (16) (10) τ =1 sT ∈S xT , yT , θ Di j = = λi , 1 ≤ j ≤ M. T bi j P xT , sT |θ . log bτ xτ + τ =1 sT ∈S M = 1, it can be j =1 bi j By making use of the fact that The parameters to be optimized are now separated into shown that three independent terms. From (14) and (16), Eθ (I ) can also be divided into the Di j x T , y T , θ bi j = i = 1, 2, . . . , N , j = 1, 2, . . . , M. , following three terms: M TT j =1 Di j x , y , θ (11) Eθ (I ) = Eθ (π , I ) + Eθ (a, I ) + Eθ (b, I ), (17) The set {bi j } (i = 1, 2, . . . , N , j = 1, 2, . . . , M ) so ob- where tained gives the maximum value of I (xT , y T , θ ). An algorithm for the computation of the values may Eθ ( π , I ) = log π s0 be developed by using standard expectation-maximization sT ∈S (EM) technique. By considering xT and y T as the ob- × P xT , y T , sT |θ − P xT , y T , sT |θ = 0, served data and the state sequence sT = (sT , sT , . . . , sT ) as 12 T the hidden or unobserved data, the estimation of Eθ (I ) = T E[I (xT , y T , sT |θ )|xT , y T , θ ] from incomplete data xT and y T Eθ (a, I ) = log aτ −1,τ sT ∈S τ =1 is then given by [49] × P xT , y T , sT |θ − P xT , y T , sT |θ = 0, T T T T T T Eθ ( I ) = I x , y , s |θ P x , y , s |θ T T Eθ ( b , I ) = log bτ xτ − sT ∈S log bτ yτ log P xT , sT |θ − log P y T , sT |θ τ =1 τ =1 = (12) sT ∈S × P xT , y T , sT |θ . sT ∈S T T T × P x , y , s |θ , (18)
1386 EURASIP Journal on Applied Signal Processing State i − 1 State i State i + 1 Weightage = 1 − ωi Static channel [bisj ] [bidj ] Weightage = ωi Dynamic channel Figure 2: The two-channel structure of the ith state of a left-right HMM. Eθ (π , I ) and Eθ (a, I ) are associated with the hidden-state given by sequence sT . It is assumed that xT and y T are drawn indepen- dently and emitted from the same state sequence sT , hence E Si , O j |θ , xT − E Si , O j |θ , y T bi j = both Eθ (π , I ) and Eθ (a, I ) become 0. Eθ (b, I ), on the other M j =1 E Si , O j |θ , x T − E S , O |θ , y T i j hand, is related to the symbols that appear in xT and y T and (22) contributes to Eθ (I ). By enumerating all the state combina- Di j x T , y T , θ = . tions, we have M TT j =1 Di j x , y , θ N T This equation, compared with (11), enables the re- T T Eθ ( b , I ) = log bi xτ − log bi yτ (19) estimation of the symbol emission coeﬃcients bi j from ex- i=1 τ =1 pectations of the existing HMM. The above derivations × P xT , y T , sT = Si |θ . τ strictly observe the standard optimization strategy [49], where the expectation of the value of the separable-distance If T=1 [log bi (xτ ) − log bi ( yτ )] is arranged according to T T τ function, Eθ (I ), is computed in the E-step and the coef- the order of appearance of the symbols (O j ) within xT and ﬁcients bi j are adjusted to maximize Eθ (I ) in the M-step. y T , we have The convergence of the method is therefore guaranteed. However, bi j may not be estimated by applying (22) alone; Eθ ( b , I ) other considerations will be taken into account such as when Di j (xT , y T , θ ) is less than or equal to 0. Further discussion on N M T = log bi j the determination of values of bi j is given in the subsequent τ =1 i=1 j =1 sections. xτ = yτ =O j T T To modify the parameters according to (22) and simul- × P xτ = O j , sT = Si |θ , xT − P yτ = O j , sT = Si |θ , y T T T taneously ensure the validity of the model, a two-channel τ τ structure as depicted in Figure 2 is proposed. The elements × P x T , y T |θ (bi j ) of matrix B of the two-channel HMM are decomposed (20) into two parts as or bi j = bisj + bidj (∀i = 1, 2, . . . , N , j = 1, 2, . . . , M ), (23) Eθ ( b , I ) N M log bi j E Si , O j |θ , xT − E Si , O j |θ , y T bisj for the static channel and bidj for the dynamic channel. = (21) i=1 j =1 The dynamic-channel coeﬃcients bidj are the key source of the discriminative power. bisj are computed using parameter- × P x T , y T |θ . smoothed ML HMM and weighted. As long as bi j computed from (22) is greater than bisj , bidj is determined as the diﬀer- In the M-step of the EM estimation, bi j is adjusted to ence between bi j and bisj according to (23); otherwise bidj is maximize Eθ (b, I ) or Eθ (I ). Since M 1 bi j = 1 and (21) j= set to be 0. has the form K M 1 w j log v j , which attains a global maxi- j= To avoid the occurrence of zero or negative probabil- mum at the point v j = w j / M 1 w j ( j = 1, 2, . . . , M ), the re- ity, bisj (∀i = 1, 2, . . . , N , ∀ j = 1, 2, . . . , M ) should be kept j= estimated value of bi j of θ that lead to the maximum Eθ (I ) is greater than 0 in the training procedure and at the same time,
A Two-Channel HMM Training Algorithm for Lip Reading 1387 the dynamic-channel coeﬃcient bidj (∀i = 1, 2, . . . , N , ∀ j = A guideline for the determination of the value of ωi is as 1, 2, . . . , M ) should be nonnegative. Thus the probability follows. If the training pairs are very similar to each other constraint bi j = bisj + bidj ≥ bisj > 0 is met. x x such that P (xT |θML ) ≈ P ( y T |θML ), ωi should be set to a large In addition, the relative weightage of the static channel value to guarantee good discrimination; on the other hand, and the dynamic channel may be controlled by the credibility x x if P (xT |θML ) P ( y T |θML ), ωi should be set to a small value weighing factor ωi (i = 1, 2, . . . , N ) (diﬀerent states may have T |θ ) reasonably large. In addition, diﬀerent val- to make P (x diﬀerent values). If the weightage of the dynamic channel is ues will be used for diﬀerent states because they contribute set to be ωi by scaling of the coeﬃcients diﬀerently to the scored probabilities. However, the values of ωi for the diﬀerent states should not diﬀer greatly. N Based on the above considerations, the following proce- bidj = ωi , 0 ≤ ωi < 1 ∀i = 1, 2, . . . , N , (24) x dures are taken to determine ωi . Given the base HMM θML j =1 and the training pair xT and y T , the optimal state chains are then the weightage of the static channel has to be set as fol- x searched using the Viterbi algorithm. If θML is a left-right lows: model and the expected (optimal) duration of the ith state x (i = 1, 2, . . . , N ) of xT is from ti to ti + τi , P (xT |θML ) is then N bisj = 1 − ωi , 0 ≤ ωi < 1 ∀i = 1, 2, . . . , N. (25) written as follows: j =1 x x x P xT |θML = P xtT , . . . , xtT +τ1 |θML P xtT , . . . , xtT +τ2 |θML 1 1 2 2 4. TWO-CHANNEL TRAINING STRATEGY x T T · · · P xtN , . . . , xtN +τN |θML ; 4.1. Parameter initialization (27) x The parameter-smoothed ML HMM of xT , θML , which is trained using the Baum-Welch estimation, is referred to as x P ( y T |θML ) is decomposed in the same way. the base HMM. The static-channel HMM is derived from x x Let Pdur (xT , Si |θML ) = P (xtT , . . . , xtT+τi |θML ). This proba- the base HMM after applying the scaling factor. Parameter i i bility may be computed as follows: x smoothing is carried out for θML to prevent the occurrence of zero probability. Parameter smoothing is the simple man- ti +τi N agement that bi j is set to some minimum value, for exam- x Pdur xT , Si |θML = P sT = S j b j xtT . (28) t ple, ε = 10−3 , if the estimated conditional probability bi j = 0 t =ti j =1 [46]. As a result, even though symbol O j never appears in the x Pdur (xT , Si |θML ) may also be computed using the for- training set, there is still a nonzero probability of its occur- x ward variables αx (i) = P (xtT+1 , . . . , xtT+t , sT+t = Si |θML ) or/and x rence in θML . Parameter smoothing is a posttraining adjust- t ti i i x the backward variables βt (i) = P (xti +t+1 , . . . , xti +τi +1 |sT+t = T T ment to decrease error rate because the training set, which is ti usually limited by its size, may not cover erratic samples. x Si , θML ) [46]. Before carrying out discriminative training, ωi (credibil- x However, if θML is not a left-right model but an ergodic ity weighing factor of the ith state), bisj (static-channel coef- model, the expected duration of a state will consist of a num- ﬁcients), and bidj (dynamic-channel coeﬃcients) are initial- ber of separated time slices, for example, k slices such as ti1 ized. x to ti1 + τi1 , ti2 to ti2 + τi2 , and tik to tik + τik . Pdur (xT , Si |θML ) is The static-channel coeﬃcients bisj are given by then computed by multiplying them together as shown: bis1 bis2 · · · biM = 1 − ωi bi1 bi2 · · · biM , s x Pdur xT , Si |θML (26) 1 ≤ i ≤ N , 0 ≤ ωi < 1, x x T T T T = P xti1 , . . . , xti1 +τi1 |θML P xti2 , . . . , xti2 +τi2 |θML (29) x where bi j is the symbol emission probability of θML . x T T · · · P xtik , . . . , xtik +τik |θML . As for the dynamic-channel coeﬃcients bidj , a random or uniform initial distribution usually works well. In the experi- The value of ωi is derived by comparing the corre- ments conducted in this paper, uniform values equal to ωi /M x x sponding Pdur (xT , Si |θML ) and Pdur ( y T , Si |θML ). If Pdur (xT , are assigned to bidj ’s as initial values. T , S |θ x ), this indicates that the coeﬃ- x Si |θML ) Pdur ( y i ML The selection of ωi is ﬂexible and largely problem- cients of the ith state of the base model are good enough dependent. A large value of ωi means large weightage is as- for discrimination, ωi should be set to a small value to pre- signed to the dynamic channel and the discriminative power x serve the original ML conﬁgurations. If Pdur (xT , Si |θML ) < is enhanced. However, as we adjust bidj toward the direction T , S |θ x ) or P T , S |θ x ) ≈ P T , S |θ x ), this Pdur ( y i ML dur (x dur ( y of increasing I (xT , y T , θ ), the probability of the correct ob- i ML i ML indicates that state Si is not able to distinguish between xT servation P (xT |θ ) will normally decrease. This situation is and y T , thus ωi must be set to a value large enough to ensure undesirable because the two-channel HMM obtained is un- Pdur (xT , Si |θ ) > Pdur ( y T , Si |θ ), where θ is the two-channel likely to generate even the correct samples.
1388 EURASIP Journal on Applied Signal Processing HMM. In practice, ωi can be manually selected according to 25 20 the conditions mentioned above (which is preferred), or they 15 E can be computed using the following expression: 10 5 1 0 ωi = , (30) 20 40 60 80 100 120 1 + CvD Symbol label x x where v = Pdur (xT , Si |θML )/Pdur ( y T , Si |θML ). C (C > 0) and xT D are constants that jointly control the smoothness of ωi with yT respect to v. Since C > 0 and v > 0, ωi < 1, by using suitable values of C and D, a set of credibility factors ωi are computed (a) for the states of the target HMM. For example, if the range of v is 10−3 ∼ 105 , a typical setting is C = 1.0 and D = 0.1. 25 20 Once the values of ωi (i = 1, 2, . . . , N ) are determined, 15 E they will not be changed in the training process. 10 5 0 4.2. Partition of the observation symbol set 20 40 60 80 100 120 Let θ denote the HMM with the above initial conﬁgurations. Symbol label The coeﬃcients of the dynamic channel are adjusted accord- xT ing to the following procedures. First, E(Si , O j |θ , xT ) and yT E(Si , O j |θ , y T ) are computed through the counting process. T Using the forward variables αx (i) = P (x1 , . . . , xτ , sT = Si |θ ) T (b) τ τ x (i) = P (x T , . . . , x T |sT = S , θ ) and backward variables βτ i τ +1 Tτ [46], the following two probabilities are computed: Figure 3: (a) Distributions of E(Si , O j |θ , xT ) and E(Si , O j |θ , y T ) for various symbols. (b) Distribution of E(Si , O j |θ , xT ) for the symbols ξτ (i, j ) = P sT = Si , sT+1 = S j |xT , θ x in V . τ τ x T αx (i)ai j b j xτ +1 βτ +1 ( j ) τ = , N N x T x j =1 ατ (i)ai j b j xτ +1 βτ +1 ( j ) (31) i=1 where η is the threshold with a typical value of 1. η will be N set to a larger value if it is required that the set V will con- γτ (i) = P sT = Si |xT , θ = x x ξτ (i, j ); tain fewer dominant symbols. With η ≥ 1, E(Si , V j |θ , y T ) − τ j =1 E(Si , V j |θ , xT ) > 0. As an illustration, the distributions of the values of E(Si , O j |θ , xT ) and E(Si , O j |θ , y T ) for diﬀerent y y ξτ (i, j ) and γτ (i) are obtained in the same manner. By count- symbol labels are shown in Figure 3a. The ﬁltered symbols in ing the state, we have set V when η is set l are shown in Figure 3b. T 4.3. Modiﬁcation to the dynamic channel E Si , O j |θ , xT = x γτ (i), τ =1 For each state, the symbol set is partitioned according to the xτ =O j T (32) procedures described in Section 4.2. As an example, consider T the ith state. For symbols in the set U , the symbol emission y E Si , O j |θ , y T = γτ (i). coeﬃcient bi (U j ) (U j ∈ U ) should be set as small as possible. τ =1 yτ =O j T Let bid (U j ) = 0, and so bi (U j ) = bis (U j ). For symbols in the set V , the corresponding dynamic-channel coeﬃcient bid (Vk ) It is shown in (22) that to maximize I (xT , y T , θ ), is computed according to (34), which is derived from (22): bi j should be set proportional to Di j (xT , y T , θ ). How- ever, for certain symbols, for example, O p , the expectation bid Vk = PD Si , Vk , xT , y T Dip (xT , y T , θ ) may be less than 0. Since the symbol emis- sion coeﬃcients cannot take negative values, these symbols K bis V j − bis Vk , × ωi + k = 1, 2, . . . , K , have to be specially treated. For this reason, the symbol j =1 set OM = {O1 , O2 , . . . , OM } is partitioned into the sub- (34) set V = {V1 , V2 , . . . , VK } and its complement set U = {U1 , U2 , . . . , UM −K } (OM = U ∪ V ) according to the follow- where ing criterion: PD Si , Vk , xT , y T E Si , O j |θ , xT = arg (η ≥ 1), V1 , V2 , . . . , VK >η E Si , Vk |θ , xT − E Si , Vk |θ , y T (35) E Si , O j |θ , y T Oj = . K E Si , V j |θ , xT − E Si , V j |θ , y T (33) j =1
A Two-Channel HMM Training Algorithm for Lip Reading 1389 bil = bis1 bil = bis1 bi3 bi2 I = I (xT , y T , θt ) I = I (xT , y T , θt ) I = I (xT , y T , θt+1 ) I = I (xT , y T , θt+1 ) bid1 + bid2 = ωi θt+1 d θt+1 θt θt d bi1 bi1 bi2 = bis2 bi3 = bis3 (a) (b) Figure 4: The surface of I and the direction of parameter adjustment. However, some coeﬃcients obtained may still be negative, for The following state-duration validation procedure is example, bid (Vl ) < 0 because of large value of bis (Vl ). In which added to make the training strategy complete. After each case, it indicates that bis (Vl ) alone is large enough for sepa- training epoch, E(Si |θ , xT ) and E(Si |θ , y T ) are computed and ration. To prevent negative values appearing in the dynamic compared with each other. Using the forward variables and channel, the symbol Vl is transferred from V to U and bid (Vl ) backward variables, the state duration of xT is obtained as is set to 0. The coeﬃcients of the remaining symbols in V are follows: reestimated using (34) until all bid (Vk )’s are greater than 0. This situation (some bid (Vl ) < 0) usually happens at the ﬁrst T αx (i)βτ (i) x τ E Si |θ , xT = i = 1, 2, . . . , N , , (36) few epochs of training and it is not conducive to convergence N x x i=1 ατ (i)βτ (i) because there is a steep jump in the surface of I (xT , y T , θ ). To τ =1 relieve this problem, a larger value of η in (33) will be used. and E(Si |θ , y T ) is computed in the same way. If E(Si |θ , xT ) ≈ 4.4. Termination E(Si |θ , y T ) (not necessary to be the same), for example, Optimization is done through iteratively calling the training 1.2E(Si |θ , y T ) > E(Si |θ , xT ) > 0.8E(Si |θ , y T ), training con- epoch described in Sections 4.2 and 4.3. After each epoch, tinues; otherwise, training stops even if I (xT , y T , θ ) keeps on the separable-distance I (xT , y T , θ ) of the HMM θ obtained, increasing. is calculated and compared with that obtained in the last If the I (xT , y T , θ ) of the ﬁnal HMM θ does not meet cer- epoch. If I (xT , y T , θ ) does not change more than a prede- tain discriminative requirement, for example, I (xT , y T , θ ) is ﬁned value, training is terminated and the target two-channel less than a desired value, a new base HMM or a smaller ωi HMM is established. should be used instead. 5. PROPERTIES OF THE TWO-CHANNEL 5.2. Speed of convergence TRAINING STRATEGY As discussed in Section 3, the convergence of the parameter- 5.1. State alignment estimation strategy proposed in (22) is guaranteed accord- One of the requirements for the proposed training strategy ing to the EM optimization principles. In the implemen- is that the state durations of the training pair, say xT and tation of discriminative training, only some of the symbol y T , are comparable. This is a requirement for (22). If the emission coeﬃcients in the dynamic channel are modiﬁed state durations, for example, E(Si |θ , xT ) and E(Si |θ , y T ), dif- according to (22) while the others remain unchanged. How- fer too much, Di j (xT , y T , θ ) will become meaningless. For ever, the convergence is still assured because ﬁrstly the surface example, if E(Si |θ , xT ) E(Si |θ , y T ), the symbol O j takes of I (xT , y T , θ ) with respect to bi j is continuous, and also ad- much greater portion in E(Si |θ , xT ) than in E(Si |θ , y T ), the justing the dynamic-channel elements according to the two- computed Di j (xT , y T , θ ) may also be less than 0. The out- channel training strategy leads to increased Eθ (I ). A concep- come is that bi j is always set to bisj rather than adjusted to tual illustration is given in Figure 4 on how bi j is modiﬁed increase I (xT , y T , θ ). Fortunately, if the corresponding state when the symbol set is divided into subsets V and U . For ease durations of the training pair are very diﬀerent, the normal of explanation, we assume that the symbol set contains only three symbols O1 , O2 , and O3 with O1 , O2 ∈ V and O3 ∈ U ML HMMs are usually adequate to distinguish the states.
1390 EURASIP Journal on Applied Signal Processing for state Si . Let θt denote the HMM trained at the t th round the objective function (10) is modiﬁed as follows: and let θt+1 denote the HMM obtained at the t + 1th round. sTx = Si , xτ x = O j |θ , xTx T Tx The surface of the separable distance (I surface) is denoted τ =1 P τ λi = as I = I (xT , y T , θt+1 ) for θt+1 and I = I (xT , y T , θt ) for θt . bi j Clearly I > I . The I surface is mapped to the bi1 -bi2 plane Ty Ty Ty (39) sτ = Si , yτ = O j |θ , y T y Tx /T y τ =1 P (Figure 4a) and the bi1 -bi3 plane (Figure 4b). In the training − phase, bi1 and bi2 are modiﬁed along the line bid1 + bid2 = ωi to bi j reach a better estimation θt+1 , which is shown in Figure 4a. In ∀ j = 1, 2 . . . , M. the bi1 -bi3 plane, bi3 is set to the constant bis3 while bi1 is mod- Parameter estimation is then carried out as follows: iﬁed along the line bi3 = bis3 with the direction d as shown in Figure 4b. The direction of parameter adjustment given E Si , O j |θ , xTx − Tx /T y E Si , O j |θ , y T y bi j = . by (22) is denoted by d . In the two-channel approach, since M j =1 E Si , O j |θ , x Tx − T /T E S , O |θ , y T y x y i j only bi1 and bi2 are modiﬁed according to (22) while bi3 re- (40) mains unchanged, d may lead to lower speed of convergence The expectations of diﬀerent states of y T y are normal- than d does. ized using the scale factor Tx /T y . This approach is easy to 5.3. Improvement to the discriminative power implement; however, it does not consider the nonlinear vari- ance of signal such as local stretch or squash. If the train- The improvement to the discriminative power is estimated as ing sequences demonstrate obvious nonlinear variance, some follows. Assume that θ is the two-channel HMM obtained. nonlinear processing such as sequence truncation or symbol The lower bound of the probability P ( y T |θ ) is given by prune may be carried out to adjust the training sequences to the same length [50]. T x P y T |θ ≥ 1 − ωmax P y T |θML , (37) 6.2. Multiple training samples where ωmax = max(ω1 , ω2 , . . . , ωN ). In order to obtain a reliable model, multiple observations Because the base HMM is the parameter-smoothed ML must be used to train the HMM. The extension of the pro- x HMM of xT , it is natural to assume that P (xT |θML ) ≥ posed method to include multiple training samples may P (xT |θ ). The upper bound of the separable distance is given be carried out as follows. Consider two labeled sets X = by the following expression: {x(1) , x(2) , . . . , x(k) : d1 } and Y = { y (1) , y (2) , . . . , y (l) : d2 } of samples, where X has k number of samples and Y has x P xT |θML l number of samples. The separable-distance function that I xT , y T , θ ≤ log T takes care of all these samples is given by x 1 − ωmax P y T |θML (38) x = −T log 1 − ωmax + I xT , y T , θML . k k 1 1 log P x(m) |θ − log P y (n) |θ . I (X , Y , θ ) = k m=1 l n=1 In practice, the gain of I (xT , y T , θ ) is much smaller than (41) the theoretical upper bound. It depends on the resemblance between xT and y T , and the setting of ωi . For simplicity, if we assume that the observation se- quences in X and Y have the same length T , then (10) may 6. EXTENSIONS OF THE TWO-CHANNEL be rewritten as TRAINING ALGORITHM k l Si , O j |θ , x(m) − (1/l) Si , O j |θ , y (n) (1/k) m=1 E n=1 E 6.1. Training samples with different lengths bi j Up to this point, the training sequences are assumed to be = λi , 1 ≤ j ≤ M. of equal length. This is necessary as we cannot properly compare the probability scores of two sequences of diﬀerent (42) lengths. To extend the training strategy to sequences of diﬀer- The probability coeﬃcients are then estimated using the ent lengths, linear adjustment is ﬁrst carried out as follows. Given the training pair xTx of length Tx and y T y of length T y , following: k (m) − (1/l ) l m=1 E Si , O j |θ , x n=1 E Si , O j |θ , y (n) (1/k) bi j = . (43) M k (m) − (1/l ) l j =1 (1/k ) m=1 E Si , O j |θ , x n=1 E Si , O j |θ , y (n)
A Two-Channel HMM Training Algorithm for Lip Reading 1391 Preliminary Fine recognition recognition Two-channel HMM ML HMM classiﬁer 1 classiﬁer: 1 vs. 2 Identity . Input . . of the vector sequence input . Two-channel HMM Coarse . . classiﬁer: 1 vs. L identity . . . Two-channel HMM ML HMM classiﬁer: L − 1 vs. L classiﬁer R Figure 5: Block diagram of the viseme recognition system. Table 1: The 18 visemes selected for the experiments. /a:/, /ai/, /æ/, /ei/, /i/, /j/, /ie/, /o/, /oi/, /th/, /sh/, /tZ/, /dZ/, /eu/, /au/, /p/, /m/, /b/ 7. APPLICATION TO LIP READING referred to as text-independent viseme samples, which is dif- ferent from the type of samples extracted from various con- The proposed two-channel HMM method is applied to texts, for example, from diﬀerent words. The video clips that speaker-dependent lip reading for modeling and recogniz- indicate the productions of context-independent visemes are ing the basic visual speech elements of the English language. normalized such that all the visemes have uniform duration For the experiments reported in this paper, the visemes are of 0.5 second, or equivalently 25 frames. treated as having a one-to-one mapping with the phonemes in order to test the discriminative power of the proposed 7.2. Feature extraction method. As there are 48 phonemes in the English language Each frame of the video clip reveals the lip area of the [47], 48 visemes are considered. speaker during articulation (Figure 6a). To eliminate the ef- The block diagram of the viseme recognition system is fect caused by changes in the brightness, the RGB (red, green, given in Figure 5. The lip movement is captured with a video blue) factors of the image are converted into HSV (hue, sat- camera and the sequence of images is processed to extract uration, value) factors. The RGB to HSV conversion algo- the essential features relevant to the lip movement. For each rithm proposed in [51, 52] is adopted in our experiments. As frame of image, a feature vector is extracted. The sequence of illustrated in the histograms of distribution of the hue com- feature vectors thus represents the movement of lips during ponent shown in Figure 7, the hue factors of the lip region viseme production. This vector sequence is then presented and the remaining lip-excluded image occupy diﬀerent re- as input to the proposed classiﬁer. A hierarchical structure is gions of the histogram. A threshold may be manually selected adopted such that for a system with K visemes to be recog- to segment the lip region from the entire image as shown in nized, R (usually R < K ) ML HMM classiﬁers are employed Figure 6b. This threshold usually corresponds to a local min- for preliminary recognition. The output of the preliminary imum point (valley) in the histogram as shown in Figure 7a. recognition is a coarse identity, which may include L (usu- Note that for diﬀerent speakers and lighting conditions, the ally 1 < L < K ) viseme classes. Fine recognition is then per- threshold may be diﬀerent. formed using a bank of two-channel HMMs. The most prob- The boundaries of the lips are tracked using a geomet- able viseme is then chosen as the identity of the input. Details ric template with dynamic contours to ﬁt an elastic object of the various steps involved are given in the following sec- [53, 54, 55]. As the contours of the lips are simple, the re- tions. quirement on the selection of the dynamic contours that build the template is thus not stringent. Results of lip track- 7.1. Data acquisition ing experiments show that Bezier curves can well ﬁt the shape For our experiments, a professional English speaker is en- of the lip [34]. In our experiments, the parameterized tem- gaged. The speaker is asked to articulate every phone me plate consists of ten Bezier curves with eight of them char- of the 18 phonemes in Table 1 one hundred times. The 18 acterizing the lip contours and two of them describing the visemes are chosen as some of them bear close similarity to tongue when it is visible (Figure 6c). The template is con- others. The lip movements of the speakers are captured at 50 trolled by points marked as small circles in Figure 6c. Lip frames per second. Each pronunciation starts from a closed tracking is carried out by ﬁtting the template to minimize a mouth and ends with a closed mouth. This type of samples is certain energy function. The energy function comprises the
1392 EURASIP Journal on Applied Signal Processing 20 20 40 40 C1 60 60 1 R1 7 5 R3 80 80 9 3 100 100 R2 C3 11 10 6 8 120 120 C2 2 140 140 50 100 150 4 50 100 150 (a) (b) (c) (d) Figure 6: (a) Original image. (b) Segmented lip area. (c) Parameterized lip template. (d) Geometric measures extracted from the lip template. (1) Thickness of the upper bow. (2) Thickness of the lower bow. (3) Thickness of the lip corner. (4) Position of the lip corner. (5) Position of the upper lip. (6) Position of the lower bow. (7) Curvature of the upper-exterior boundary. (8) Curvature of the lower-exterior boundary. (9) Curvature of the upper-interior boundary. (10) Curvature of the lower-interior boundary. (11) Width of the tongue (when it is visible). following four terms: Initially, the dynamic contours are conﬁgured to provide a crude match to the lips. This can be done via comparing 1 the enclosed region of the template and the segmented lip re- Elip = − H (x)dx, R1 gion as depicted in Figure 6b. Following that, the template is R1 matched to the image sequence by adopting diﬀerent values 1 H + (x ) − H (x ) Eedge = − of the parameters {ci } (i = 1, 2, . . . , 7) in a number of search- C 1 + C2 C1 +C2 ing epochs (a detailed discussion is given in [53, 54, 55]). The (44) + H − (x ) − H (x ) d x , matched template is pictured in Figure 6d. It can be seen that 1 the matched template is symmetric and smooth, and is there- Ehole = − H (x)dx, R2 − R3 fore easy to process. R2 −R3 Eleven geometric parameters as shown in Figure 6d are 2 Einertia = Γt+1 − Γt , extracted to form a feature vector from the matched tem- plate. These features indicate the thickness of various parts of where R1 , R2 , R3 , C1 , and C2 are areas and contours as illus- the lips, the positions of some key points, and the curvatures trated in Figure 6c. H (x) is a function of the hue of a given of the bows. They are chosen as they uniquely determine the pixel; H + (x) is the hue function of the closest right-hand side shape of the lips and they best characterize the movement of pixel and H − (x) is that of the closest left-hand side pixel. the lips. Γt+1 and Γt are the matched templates at time t + 1 and t . Principal components analysis (PCA) is carried out to Γt+1 − Γt indicates the Euclidean distance between the two reduce the dimension of the feature vectors from eleven templates (further details may be found in [55]). The over- to seven. The resulting feature vectors are clustered into all energy of the template E is the linear combination of the groups using K -means algorithm. In the experiments con- components deﬁned as ducted, 128 clusters are created for the vector database. The means of the 128 clusters form the symbol set O128 = E = c1 Elip + c2 Eedge + c3 Ehole + c4 Einertia . (45) (O1 , O2 , . . . , O128 ) of the HMM. They are used to encode the vector sequences presented to the system. Similarly, the energy terms for the tongue template in- 7.3. Conﬁguration of the viseme model clude Investigation on the lip dynamics reveals that the movement 1 Etongue-area = − H (x)dx if R3 > 0, of the lips can be partitioned into three phases during the R3 R3 production of a text-independent viseme. The initial phase 1 begins with a closed mouth and ends with the start of sound H + (x ) − H (x ) =− Etongue-edge C3 production. The intermediate phase is the articulation phase, C3 + H − (x ) − H (x ) d x which is the period when sound is produced. The third phase if C3 > 0, is the end phase when the mouth restores to the relaxed 2 Etongue-inertia = Ttongue,t+1 − Ttongue,t , state. Figure 8 illustrates the change of the lips in the three (46) phases and the corresponding acoustic waveform when the phoneme /u/ is uttered. and the overall energy is To associate the HMM with the physical process of viseme production, three-state left-right HMM structure as Etongue = c5 Etongue-area + c6 Etongue-edge + c7 Etongue-inertia . (47) shown in Figure 9 is adopted.
A Two-Channel HMM Training Algorithm for Lip Reading 1393 Threshold 450 400 400 350 350 300 300 250 250 200 200 150 150 100 100 50 50 0 0 0.2 0.4 0.6 0.8 0 1 0.2 0.4 0.6 0.8 0 1 (a) (b) 450 400 350 300 250 200 150 100 50 0 0.2 0.4 0.6 0.8 0 1 (c) Figure 7: Isolation of the lip region from the entire image using hue distribution. (a) Histogram of the hue component for the entire image. (b) Histogram of the hue component for the actual lip region. (c) Histogram of the hue component for the actual lip-excluded image. is counted. The coeﬃcients ai,i and ai,i+1 are initialized with Using this structure, the state-transition matrix A has the these durations. For example, if the duration of state Si is Ti , form the initial value of ai,i is set to be Ti / (Ti + 1) and the initial   value of ai,i+1 is set to be 1/ (Ti + 1) as they maximize aTii ai,i+1 . a1,1 a1,2 0 0 0 a2,2 a2,3 0  i,   Matrix B is initialized in a similar manner. If symbol O j ap- A= , (48) 0 0 a3,3 a3,4  pears T (O j ) times in state Si , the initial value of bi j is set to 00 0 1 be T (O j )/Ti . For such arrangement, the states of the HMM are aligned with the three phases of viseme production and where the 4th state is a null state that indicates the end of hence are referred to as the initial state, articulation state, and viseme production. The initial values of the coeﬃcients in end state. matrices A and B are set according to the statistics of the For each of the 18 visemes in Table 1, an HMM with the three phases. Given a viseme sample, the approximate ini- above the conﬁguration is trained using the Baum-Welch tial phase, articulation phase, and end phase are segmented estimation. After implementing parameter smoothing, the from the image sequence and the acoustic signal (an illus- parameter-smoothed ML HMM is ready for the subsequent tration is given in Figure 8), and the duration of each phase two-channel discriminative training.
1394 EURASIP Journal on Applied Signal Processing ··· ··· (a) (b) (c) 1 2 3 2 1 0 −1 −2 0 500 1000 1500 2000 2500 3000 3500 4000 Figure 8: The three phases of viseme production. (a) Initial phase. (b) Articulation phase. (c) End phase. Initial Articulation End Figure 9: Three-state left-right viseme model. 7.4. Viseme classiﬁer A macro class may consist of several similar visemes. Fine recognition within a macro class is carried out at the sec- The block diagram of the proposed hierarchical viseme clas- ond layer. Assume that Macro Class i comprises L visemes: siﬁer is given in Figure 10. For visemes that are too similar V1 , V2 , . . . , VL . A number of two-channel HMMs are trained to be separated by the normal ML HMMs, they are clustered with the proposed discriminative training strategy. For V1 , into one macro class. In the ﬁgure, θMac1 , θMac2 , . . . , θMacR are L − 1HMMs, θ1∧2 , θ1∧3 , . . . , θ1∧L , are trained to separate the the R number of ML HMMs for the R macro classes. The samples of V1 from those of V2 , V3 , . . . , VL , respectively. Take similarity between the visemes is measured as follows. θ1∧2 as an example, the parameter-smoothed ML HMM of Assume that Xi = {x1 , x2 , . . . , xlii : di } is the training sam- i i 1 ples of viseme di (i = 1, 2, . . . , 18, as 18 visemes are involved), V1 , θML , is adopted as the base HMM. The samples of V1 are used as the correct samples (xT in (3)) and the samples of where xij is the j th training sample and li is the number of the V2 are used as the incorrect samples ( y T in (3)) while train- samples. An ML HMM is trained for each of the 18 visemes ing θ1∧2 . There is a total of L(L − 1) two-channel HMMs in using the Baum-Welch estimation. Let θ1 , θ2 , . . . , θ18 denote the 18 ML HMMs. For {x1 , x2 , . . . , xlii : di }, the joint proba- i i Macro Class i. For an input viseme zT to be identiﬁed, the following hy- bility scored by θ j is computed as follows: pothesis is made: li i P X i |θ j = P x n |θ j . (49)  i  if log P zT |θi∧ j − log P zT |θ j ∧i > K , n=1 Hi∧ j =  (51)  A viseme model θi is able to separate visemes di and d j if 0 otherwise, the following condition applies: log P Xi |θi − log P Xi |θ j ≥ Kli ∀ j = 1, 2, . . . , 18, j = i, where K is the positive constant as deﬁned in (47). For the 25-frame sequence input to the system, K is chosen to be (50) equal to 2. Hi∧ j = i indicates a vote for Vi . The decision about the identity of zT is made by a majority vote of all the where K is a positive constant that is set according to the two-channel HMMs. The viseme class that has the maximum length of the training samples. For long training samples, a number of votes is chosen as the identity of zT , denoted by large value of K is desired. For the 25-length samples adopted ID(zT ). Mathematically, in our experiments, K is set to be equal to 2. If the condition stated in (50) is not met, visemes di and d j are categorized into the same macro class. The training samples of di and d j ID zT = max Number of Hi∧ j = i are jointly used to train the ML HMM of the macro class. i (52) θMac1 , θMac2 , . . . , θMacR are obtained in this way. ∀i, j = 1, 2, . . . , L, i = j. For an input viseme zT to be identiﬁed, the probabilities P (z Mac1 ), P (zT |θMac2 ), . . . , P (zT |θMacR ) are computed and T |θ compared with one another. The macro identity of zT is de- If two viseme classes, say Vi and V j , receive the same number of votes, the decision about the identity of zT is made termined by the HMM that gives the largest probability.
A Two-Channel HMM Training Algorithm for Lip Reading 1395 Input viseme Layer 1 ··· ··· θMac 1 θMaci θMacR ··· ··· Macro Class i Macro Class R Macro Class 1 Layer 2 θ1∧3 · · · θ1∧L ··· θ2∧1 · · · θ2∧L · · · θL∧2 · · · θL∧L−1 θ1∧2 θL∧1 H1∧3 · · · H1∧L ··· H2∧1 · · · H2∧L · · · HL∧1 HL∧2 · · · HL∧L−1 H1∧2 Majority vote Identity of the input Figure 10: Flow chart of the hierarchical viseme classiﬁer. V2 by comparing P (zT |θi∧ j ) and P (zT |θ j ∧i ). Mathematically, Macro class θ1∧2 Viseme 1 (V1 )  V4 Viseme 2 (V2 ) i if log P zT |θi∧ j > log P zT |θ j ∧i , V1 Two-channel ID zT =  Viseme 3 (V3 ) (53) V3 HMMs of V1 j otherwise. θ1∧4 Viseme 4 (V4 ) Viseme 5 (V5 ) V5 θ1∧5 θ1∧3 The decision is based on pairwise comparisons of the hy- potheses. The proposed hierarchical structure greatly reduces Figure 11: Viseme boundaries formed by the two-channel HMMs. the computational load and increases the accuracy of recog- nition because pairwise comparisons are carried out within each macro class, which comprises much fewer candidate 7.5. Performance of the system classes than the entire set. If coarse identiﬁcation is not per- Experiments are carried out to assess the performance of the formed, the number of classes increases and the number of proposed system. For the experiments conducted in this pa- pairwise comparisons goes up rapidly. per, 100 samples are drawn for each viseme with 50 for train- The two-channel HMMs act as the boundary func- ing and the remaining 50 for testing. By computing and com- tions for the viseme they represent. Each of them serves paring the probabilities scored by diﬀerent viseme models to separate the correct samples from the samples of an- using (49) and (50), the 18 visemes are clustered into 6 macro other viseme. A conceptual illustration is given in Figure 11 classes as illustrated in Table 2. where the macro class comprises ﬁve visemes V1 , V2 , . . . , V5 . The results of ﬁne recognition of some confusable θ1∧2 , θ1∧3 , . . . , θ1∧5 build the decision boundaries for V1 to visemes are listed in Table 3. Each row in Table 3 shows the delimit it from the similar visemes. two similar visemes that belong to the same macro class. The The proposed two-channel HMM model is specially tai- ﬁrst viseme label (in boldface) is the target viseme and is de- lored for the target viseme and its “surroundings”. As a result, noted by x. The second viseme is the incorrect viseme and it is more accurate than the traditional modeling method that uses single ML HMM. is denoted by y . θML denotes the parameter-smoothed ML
1396 EURASIP Journal on Applied Signal Processing that I (x, y , θ ) may drop at the ﬁrst few training epochs. This Table 2: The macro classes for coarse identiﬁcation. phenomenon can be attributed to the fact that some symbols Macro classes Visemes Macro classes Visemes in subset V are transferred to U while training the dynamic- channel coeﬃcients as explained in Section 4.3. Figure 12d il- 1 /a:/, /ai/, /æ/ 4 /o/, /oi/ lustrates the situation of early termination. The training pro- 2 /ei/, /i/, /j/, /ie/ 5 /th/, /sh/, /tZ/, /dZ/ cess stops even though I (x, y , θ ) still shows the tendency of 3 /eu/, /au/ 6 /p/, /m/, /b/ increasing. As explained in Section 5.1, if the state durations of the target training samples and incorrect training sam- ples diﬀer greatly, that is, the state alignment condition is Table 3: The average values of probability and separable-distance violated, the two-channel training should terminate imme- function of the ML HMMs and two-channel HMMs. diately. The performance of the proposed hierarchical system ∗ ∗∗ θML θ1 θ2 Viseme pair is compared with that of the traditional recognition sys- tem where ML HMMs (parameter-smoothed) are used as x y P I P I P I ω1 ω2 ω3 the viseme classiﬁers. The ML HMMs and the two-channel /a:/ /ai/ −14.1 1.196 −17.1 5.571 −18.3 6.589 0.5 0.5 0.5 HMMs involved are trained with the same set of training /ei/ /i/ −14.7 2.162 −19.3 5.977 −20.9 7.008 0.6 0.8 0.6 samples. The credibility factors of the two-channel HMMs are set according to (30), with C = 0.1 and D = 0.1. The de- /au/ /eu/ −15.6 2.990 −18.1 5.872 −18.5 6.555 0.6 0.5 0.6 cision about the identity of an input testing sample is made −13.9 0.830 −17.5 2.508 −18.7 3.296 0.5 0.5 0.5 /o/ /oi/ according to (47), (49), (50), and (51), where K = 2. The −15.7 0.602 −19.0 2.809 −18.5 2.732 0.4 0.4 0.4 /th/ /sh/ false rejection error rates (FRRs) or Type-II error of the two −16.3 1.144 −19.0 3.102 −17.1 2.233 0.4 0.5 0.4 /p/ /m/ types of viseme classiﬁers are computed for the 50 testing Conﬁguration of the two-channel HMMs: samples of each of the 18 visemes. Note that as some of the 18 ∗ For θ , ω , ω , and ω are set according to (30), with C = 1.0 and D = 0.1. visemes can be accurately identiﬁed by the ML HMMs with 1 1 2 3 ∗∗ For θ , ω , ω , and ω are manually selected. 2 1 2 3 FRRs less than 10% [34], the improvement resulting from the two-channel training approach is not prominent for these visemes. In Table 4, only the FRRs of 12 confusable visemes HMMs that are trained with the samples of x. With θML be- are listed. ing the base HMM, two two-channel HMMs, θ1 and θ2 , are Compared with the conventional ML HMM classiﬁer, the trained with the samples of x being the target training sam- classiﬁcation error of the proposed hierarchical viseme clas- ples and the samples of y being the incorrect training sam- siﬁer is reduced by about 20%. Thus the two-channel train- ples. Diﬀerent sets of the credibility factors (ω1 , ω2 , and ω3 ing algorithm is able to increase the discriminative ability of for the three states) are used for θ1 and θ2 . P is the average log HMM signiﬁcantly for identifying visemes. probability scored for the testing samples and is computed as P = (1/l) li=1 log P (xi |θ ), where xi is the ith testing sam- 8. CONCLUSION ple of viseme x and l is the number of the testing samples. I = (1/l2 ) li=1 lj =1 I (xi , y j , θ ) is the average separable dis- In this paper, a novel two-channel training strategy for hid- tance. The value of I gives an indication of the discriminative den Markov model is proposed. A separable-distance func- tion, which measures the diﬀerence between a pair of train- power, the larger the value of I , the higher the discriminative ing samples, is applied as the objective function. To maxi- power. For all settings of (ω1 , ω2 , ω3 ), the two-channel HMMs mize the separable distance and maintain the validity of the probabilistic framework of HMM at the same time, a two- give a much larger separable-distance than the ML HMMs. channel HMM structure is used. Parameters in one chan- It shows that better discrimination capabilities are attained nel, named the dynamic channel, are optimized in a series of using the two-channel viseme classiﬁers than using the ML HMM classiﬁers. In addition, diﬀerent levels of capabilities expectation-maximization (EM) estimations if feasible while parameters in the other channel, the static channel, are kept can be attained by adjusting the credibility factors. However, ﬁxed. The HMM trained in this way ampliﬁes the diﬀer- the two-channel HMM gives smaller average probability for ence between the training samples. This strategy is especially the target samples than the normal ML HMM. It indicates suited to increase the discriminative ability of HMM over that the two-channel HMMs perform well at discriminating confusable observations. confusable visemes but are not good at modeling the visemes. The change of I (x, y , θ ) with respect to the training The proposed training strategy is applied to viseme recognition. A hierarchical system is developed with normal epochs in the two-channel training is depicted in Figure 12. ML HMM classiﬁer implementing coarse recognition and For the three-state left-right HMMs and 25-length training two-channel HMM carrying out ﬁne recognition. To extend samples adopted in the experiment, the separable-distance the classiﬁcation from binary-class to multiple-class, a deci- becomes stable after ten to twenty epochs. Such speed of con- sion rule based on majority vote is adopted. Experimental re- vergence shows that the two-channel training is not compu- sults show that the classiﬁcation error of the proposed viseme tationally intensive for viseme recognition. It is also observed
A Two-Channel HMM Training Algorithm for Lip Reading 1397 7 7 6 6 5 5 I (x , y , θ ) I (x , y , θ ) 4 4 3 3 2 2 1 1 0 0 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 Training epochs Training epochs (a) (b) 7 7 6 6 5 5 Early termination I (x , y , θ ) I (x , y , θ ) 4 4 3 3 2 2 1 1 0 0 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 Training epochs Training epochs (c) (d) Figure 12: Change of I (x, y , θ ) during the training process. Although the two-channel HMM is illustrated for viseme Table 4: Classiﬁcation error ε1 of the conventional classiﬁer and classiﬁcation in this paper, the method is applicable to any se- classiﬁcation error ε2 of the two-channel classiﬁer. quence classiﬁcation problem where the sequences to be rec- ε1 ε2 ε1 ε2 Viseme Viseme ognized are of comparable length. Such applications include speech recognition, speaker identiﬁcation, and handwriting /a:/ 64% 12% /o/ 46% 28% recognition. /ai/ 60% 40% /oi/ 36% 8% /ei/ 46% 22% /th/ 18% 16% REFERENCES /i/ 52% 32% /sh/ 20% 12% [1] W. H. Sumby and I. Pollack, “Visual contributions to speech /au/ 30% 18% /p/ 36% 12% intelligibility in noise,” Journal of the Acoustical Society of /eu/ 26% 16% /m/ 32% 32% America, vol. 26, pp. 212–215, 1954. [2] K. K. Neely, “Eﬀect of visual factors on the intelligibility of speech,” Journal of the Acoustical Society of America, vol. 28, no. 6, pp. 1275–1277, 1956. classiﬁer is on the average 20% less than that of the popular [3] C. A. Binnie, A. A. Montgomery, and P. L. Jackson, “Auditory ML HMM classiﬁer while only 10 ∼ 20 training epochs are and visual contributions to the perception of consonants,” required in the training process. Journal of Speech and Hearing Research, vol. 17, pp. 619–630, 1974. The two-channel training strategy thus provides signiﬁ- [4] D. Reisberg, J. McLean, and A. Goldﬁeld, “Easy to hear but cant improvement over the traditional Baum-Welch estima- hard to understand: A lipreading advantage with intact audi- tion in ﬁne recognition. However, the proposed method re- tory stimuli,” in Hearing by Eye: The Psychology of Lipreading, quires state alignment among the training samples; in other B. Dodd and R. Campbell, Eds., pp. 97–113, Lawrence Erl- words, the samples should be of suﬃcient similarity such baum Associates, Hillsdale, NJ, USA, 1987. that the durations of the corresponding states are compara- [5] H. McGurk and J. MacDonald, “Hearing lips and seeing voices,” Nature, vol. 264, pp. 746–748, 1976. ble.
1398 EURASIP Journal on Applied Signal Processing [6] D. W. Massaro, Speech Perception by Ear and Eye: A Paradigm [24] X. Z. Zhang, R. M. Mersereau, and M. A. Clements, “Audio- for Psychological Inquiry, Lawrence Erlbaum Associates, Hills- visual speech recognition by speechreading,” in Proc. 14th In- dale, NJ, USA, 1987. ternational Conference on Digital Signal Processing (DSP ’02), vol. 2, pp. 1069–1072, Santorini, Greece, July 2002. [7] R. Campbell and B. Dodd, “Hearing by eye,” Quarterly Journal of Experimental Psychology, vol. 32, pp. 85–99, 1980. [25] G. Gravier, G. Potamianos, and C. Neti, “Asynchrony mod- eling for audio-visual speech recognition,” in Proc. Interna- [8] E. D. Petajan, Automatic lipreading to enhance speech recog- tional Conference on Human Language Technology (HLT ’02), nition, Ph.D. dissertation, University of Illinois at Urbana- San Diego, Calif, USA, March 2002, available on proceeding Champaign, Urbana, Ill, USA, 1984. CD. [9] A. J. Goldschen, Continuous automatic speech recognition by [26] A. V. Neﬁan, L. Liang, X. Pi, X. Liu, and K. Murphy, “Dynamic lipreading, Ph.D. dissertation, George Washington University, Bayesian networks for audio-visual speech recognition,” Washington, DC, USA, 1993. EURASIP Journal on Applied Signal Processing, vol. 2002, [10] B. P. Yuhas, M. H. Goldstein, and T. J. Sejnowski, “Integration no. 11, pp. 1274–1288, 2002. of acoustic and visual speech signals using neural networks,” [27] S. Dupont and J. Luettin, “Audio-visual speech modeling IEEE Commun. Mag., vol. 27, no. 11, pp. 65–71, 1989. [11] D. G. Stork, G. Wolﬀ, and E. Levine, “Neural network lipread- for continuous speech recognition,” IEEE Trans. Multimedia, vol. 2, no. 3, pp. 141–151, 2000. ing system for improved speech recognition,” in Proc. IEEE In- [28] S. W. Foo and L. Dong, “A boosted multi-HMM classiﬁer for ternational Joint Conference on Neural Networks (IJCNN ’92), recognition of visual speech elements,” in Proc. IEEE Int. Conf. pp. 285–295, Baltimore, Md, USA, June 1992. Acoustics, Speech, and Signal Processing (ICASSP ’03), vol. 2, [12] C. Bregler and S. Omohundro, “Nonlinear manifold learn- pp. 285–288, Hong Kong, China, April 2003. ing for visual speech recognition,” in Proc. IEEE 5th Interna- [29] J. J. Williams and A. K. Katsaggelos, “An HMM-based speech- tional Conference on Computer Vision (ICCV ’95), pp. 494– to-video synthesizer,” IEEE Trans. Neural Networks, vol. 13, 499, Cambridge, Mass, USA, June 1995. no. 4, pp. 900–915, 2002. [13] P. Silsbee and A. Bovik, “Computer lipreading for improved [30] J. Luettin, N. A. Thacker, and S. W. Beet, “Speechreading using accuracy in automatic speech recognition,” IEEE Trans. Speech shape and intensity information,” in Proc. 4th International Audio Processing, vol. 4, no. 5, pp. 337–351, 1996. Conference on Spoken Language Processing (ICSLP ’96), vol. 1, [14] D. G. Stork and H. L. Lu, “Speechreading by Boltzmann pp. 58–61, Philadelphia, Pa, USA, October 1996. zippers,” in Machines that Learn Workshop, Snowbird, Utah, USA, April 1996. [31] I. Matthews, T. F. Cootes, J. A. Bangham, S. Cox, and R. Har- [15] T. Chen, “Audiovisual speech processing,” IEEE Signal Process- vey, “Extraction of visual features for lipreading,” IEEE Trans. ing Mag., vol. 18, no. 1, pp. 9–21, 2001. Pattern Anal. Machine Intell., vol. 24, no. 2, pp. 198–213, 2002. [16] D. G. Stork and M. E. Hennecke, “Speechreading: An [32] A. Adjoudani and C. Benoˆt, “On the integration of audi- ı overview of image processing, feature extraction, sensory tory and visual parameters in an HMM-based ASR,” in Speech integration and pattern recognition techniques,” in Proc. Reading by Humans and Machines, D. G. Stork and M. E. Hen- 2nd International Conference on Automatic Face and Ges- necke, Eds., NATO ASI Series, pp. 461–471, Springer Verlag, ture Recognition, pp. 16–26, Killington, Vt, USA, October Berlin, Germany, 1996. 1996. [33] E. Owens and B. Blazek, “Visemes observed by hearing im- [17] S. W. Foo, Y. Lian, and L. Dong, “Recognition of visual paired and normal hearing adult viewers,” Journal of Speech speech elements using adaptively boosted hidden Markov Hearing and Research, vol. 28, pp. 381–393, 1985. models,” IEEE Trans. Circuits Syst. Video Technol., vol. 14, [34] S. W. Foo and L. Dong, “Recognition of visual speech ele- no. 5, pp. 693–705, 2004. ments using hidden Markov models,” in Proc. 3rd IEEE Pa- [18] A. J. Goldschen, O. N. Garcia, and E. D. Petajan, “Continuous ciﬁc Rim Conference on Multimedia (PCM ’02), pp. 607–614, optical automatic speech recognition by lipreading,” in Proc. December 2002. 28th Asilomar Conference on Signals, Systems, and Comput- [35] C. Binnie, A. Montgomery, and P. Jackson, “Auditory and vi- ers, vol. 1, pp. 572–577, Paciﬁc Grove, Calif, USA, October– sual contributions to the perception of consonants,” Journal November 1994. of Speech Hearing and Research, vol. 17, pp. 619–630, 1974. [19] W. J. Welsh, A. D. Simon, R. A. Hutchinson, and S. Searby, [36] A. M. Tekalp and J. Ostermann, “Face and 2-D mesh anima- “A speech-driven ‘talking-head’ in real time,” in Proc. Picture tion in MPEG-4,” Signal Processing: Image Communication, Coding Symposium, pp. 7.6-1–7.6-2, Cambridge, Mass, USA, vol. 15, no. 4-5, pp. 387–421, 2000, special issue on MPEG-4. March 1990. [37] S. Morishima, S. Ogata, K. Murai, and S. Nakamura, “Audio- [20] P. L. Silsbee and A. C. Bovic, “Visual lipreading by computer to visual speech translation with automatic lip synchronization improve automatic speech recognition accuracy,” Tec. Rep. TR- and face tracking based on 3-D head model,” in Proc. IEEE Int. 93-02-90, University of Texas Computer and Vision Research Conf. Acoustics, Speech, and Signal Processing (ICASSP ’02), Center, Austin, Tex, USA, 1993. vol. 2, pp. 2117–2120, Orlando, Fla, USA, May 2002. [21] P. L. Silsbee and A. C. Bovic, “Medium vocabulary audiovisual [38] S. W. Foo, Y. Lian, and L. Dong, “A two-channel training al- speech recognition,” in NATO ASI New Advances and Trends in gorithm for hidden Markov model to identify visual speech Speech Recognition and Coding, pp. 13–16, Bubion, Granada, elements,” in Proc. IEEE Int. Symp. Circuits and Systems (IS- Spain, June–July 1993. CAS ’03), vol. 2, pp. 572–575, Bangkok, Thailand, May 2003. [22] M. J. Tomlinson, M. J. Russell, and N. M. Brooke, “Integrating [39] L. E. Baum and T. Petrie, “Statistical inference for probabilis- audio and visual information to provide highly robust speech tic functions of ﬁnite state Markov chains,” Annals of Mathe- recognition,” in Proc. IEEE Int. Conf. Acoustics, Speech, and matical Statistics, vol. 37, pp. 1554–1563, 1966. Signal Processing (ICASSP ’96), vol. 2, pp. 821–824, Atlanta, [40] L. E. Baum and G. R. Sell, “Growth functions for transforma- Ga, USA, May 1996. tions on manifolds,” Paciﬁc Journal of Mathematics, vol. 27, [23] J. Luettin, N. A. Thacker, and S. W. Beet, “Speechreading us- no. 2, pp. 211–227, 1968. ing shape and intensity information,” in Proc. 4th Interna- [41] T. Petrie, “Probabilistic functions of ﬁnite state Markov tional Conference on Spoken Language Processing (ICSLP ’96), chains,” Annals of Mathematical Statistics, vol. 40, no. 1, pp. 58–61, Philadelphia, Pa, USA, October 1996. pp. 97–115, 1969.
A Two-Channel HMM Training Algorithm for Lip Reading 1399 Say Wei Foo received the B. Eng. degree [42] L. E. Baum, T. Petrie, G. Soules, and N. Weiss, “A maximiza- in electrical engineering from the Univer- tion technique occurring in the statistical analysis of proba- bilistic functions of Markov chains,” Annals of Mathematical sity of Newcastle, Australia, in 1972, the Statistics, vol. 41, pp. 164–171, 1970. M.S. degree in industrial and systems engi- [43] L. E. Baum, “An inequality and associated maximization tech- neering from the University of Singapore in nique in statistical estimation for probabilistic functions of 1979, and the Ph.D. degree in electrical en- Markov processes,” in Inequalities, vol. 3, pp. 1–8, Academic gineering from Imperial College, University Press, New York, NY, USA, 1972. of London, in 1983. From 1972 to 1973, he [44] J. K. Baker, “The DRAGON system—An overview,” IEEE was with the Electrical Branch, Lands and Trans. Acoust., Speech, Signal Processing, vol. 23, no. 1, pp. 24– Estates Department, Ministry of Defense, 29, 1975. Singapore. From 1973 to 1992, he worked in the Electronics Di- [45] F. Jelinek, “Continuous speech recognition by statistical meth- vision of the Defense Science Organization, Singapore, where he ods,” Proc. IEEE, vol. 64, no. 4, pp. 532–556, 1976. conducted research and carried out development work on security [46] L. R. Rabiner, “A tutorial on hidden Markov models and se- equipment. From 1992 to 2001, he was the Associate Professor with lected applications in speech recognition,” Proc. IEEE, vol. 77, the Department of Electrical and Computer Engineering, National no. 2, pp. 257–286, 1989. University of Singapore. In 2002, he joined the School of Electri- [47] L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recog- cal and Electronic Engineering, Nanyang Technological University. nition, Prentice Hall, Englewood Cliﬀs, NJ, USA, 1993. He has authored and coauthored over one hundred published arti- [48] L. R. Bahl, P. F. Brown, P. V. de Souza, and R. L. Mer- cles. His research interests include speech signal processing, speaker cer, “Maximum mutual information estimation of hidden recognition, and musical note recognition. Markov model parameters for speech recognition,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing (ICASSP Yong Lian received the B.S. degree from ’86), pp. 49–52, Tokyo, Japan, April 1986. the School of Management, Shanghai Jiao [49] G. J. McLanchlan and T. Krishnan, The EM Algorithm and Ex- Tong University, China, in 1984, and the tensions, Wiley Series in Probability and Statistics, John Wiley Ph.D. degree from the Department of Elec- & Sons, New York, NY, USA, 1997. [50] R. J. Schalkoﬀ, Pattern Recognition: Statistical, Structural and trical Engineering, National University of Neural Approaches, John Wiley & Sons, New York, NY, USA, Singapore, Singapore, in 1994. He was with 1992. the Institute of Microcomputer Research, [51] X. Z. Zhang, C. Broun, R. M. Mersereau, and M. A. Shanghai Jiao Tong University, Brighten In- Clements, “Automatic speechreading with applications to formation Technology Ltd., SyQuest Tech- human-computer interfaces,” EURASIP Journal on Applied nology International, and Xyplex Inc. from Signal Processing, vol. 2002, no. 11, pp. 1228–1247, 2002, spe- 1984 to 1996. He joined the National University of Singapore in cial issue on Joint Audio-Visual Speech Processing. 1996 where he is currently an Associate Professor in the Depart- [52] T. W. Lewis and D. M. W. Powers, “Lip feature extraction us- ment of Electrical and Computer Engineering. His research in- ing red exclusion,” Selected Papers from the Pan-Sydney Work- terests include digital ﬁlter design, VLSI implementation of high- shop on Visualization, vol. 2, pp. 61–67, Sydney, Australia, speed digital systems, biomedical instrumentation, and RF IC de- 2000. sign. Dr. Lian received the 1996 IEEE Circuits and Systems Soci- [53] A. Yuille and P. Hallinan, “Deformable templates,” in Active ety’s Guillemin-Cauer Award for the best paper published in IEEE Vision, A. Blake and A. Yuille, Eds., pp. 21–38, MIT Press, Transactions on Circuits and Systems Part II. He currently serves Cambridge, Mass, USA, 1992. as an Associate Editor for the IEEE Transactions on Circuits and [54] T. Coianiz, L. Torresani, and B. Caprile, “2D deformable mod- Systems Part II and has been an Associate Editor for Circuits, Sys- els for visual speech analysis,” in Speech Reading by Humans tems and Signal Processing since 2000. Dr. Lian serves as the Secre- and Machines, D. G. Stork and M. E. Hennecke, Eds., NATO tary and Member of IEEE Circuits and Systems Society’s Biomed- ASI Series, pp. 391–398, Springer Verlag, New York, NY, USA, ical Circuits and Systems Technical Committee and Digital Signal 1996. Processing Technical Committee, respectively. [55] M. E. Hennecke, K. V. Prasad, and D. G. Stork, “Using de- formable templates to infer visual speech dynamics,” in Proc. 28th Asilomar Conference on Signals, Systems and Comput- ers, vol. 1, pp. 578–582, Paciﬁc Grove, Calif, USA, October– November 1994. Liang Dong received the B. Eng. degree in electronic engineering from Beijing Univer- sity of Aeronautics and Astronautics, China, in 1997, and the M. Eng. degree in electri- cal engineering from the Second Academy of China Aerospace in 2000. Currently, he is a Ph.D. candidate in the National University of Singapore and working in the Institute for Infocomm Research, Singapore. His re- search interests include speech processing, image processing, and video processing.