# Thuật toán ICA - 13: Practical Considerations

Chia sẻ: Dph Dph | Ngày: | Loại File: PDF | Số trang:10

0
84
lượt xem
17

## Thuật toán ICA - 13: Practical Considerations

Mô tả tài liệu

In the preceding chapters, we presented several approaches for the estimation of the independent component analysis (ICA) model. In particular, several algorithms were proposed for the estimation of the basic version of the model, which has a square mixing matrix and no noise. Now we are, in principle, ready to apply those algorithms on real data sets. Many such applications will be discussed in Part IV.

Chủ đề:

Bình luận(0)

Lưu

## Nội dung Text: Thuật toán ICA - 13: Practical Considerations

1. Independent Component Analysis. Aapo Hyv¨ rinen, Juha Karhunen, Erkki Oja a Copyright  2001 John Wiley & Sons, Inc. ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic) 13 Practical Considerations In the preceding chapters, we presented several approaches for the estimation of the independent component analysis (ICA) model. In particular, several algorithms were proposed for the estimation of the basic version of the model, which has a square mixing matrix and no noise. Now we are, in principle, ready to apply those algorithms on real data sets. Many such applications will be discussed in Part IV. However, when applying the ICA algorithms to real data, some practical con- siderations arise and need to be taken into account. In this chapter, we discuss different problems that may arise, in particular, overlearning and noise in the data. We also propose some preprocessing techniques (dimension reduction by principal component analysis, time ﬁltering) that may be useful and even necessary before the application of the ICA algorithms in practice. 13.1 PREPROCESSING BY TIME FILTERING The success of ICA for a given data set may depend crucially on performing some application-dependent preprocessing steps. In the basic methods discussed in the previous chapters, we always used centering in preprocessing, and often whitening was done as well. Here we discuss further preprocessing methods that are not necessary in theory, but are often very useful in practice. 263
2. 264 PRACTICAL CONSIDERATIONS 13.1.1 Why time ﬁltering is possible In many cases, the observed random variables are, in fact, time signals or time series, which means that they describe the time course of some phenomenon or system. Thus the sample index t in xi (t) is a time index. In such a case, it may be very useful to ﬁlter the signals. In other words, this means taking moving averages of the time series. Of course, in the ICA model no time structure is assumed, so ﬁltering is not x always possible: If the sample points (t) cannot be ordered in any meaningful way with respect to t, ﬁltering is not meaningful, either. For time series, any linear ﬁltering of the signals is allowed, since it does not change the ICA model. In fact, if we ﬁlter linearly the observed signals xi (t) to obtain new signals, say xi (t), the ICA model still holds for xi (t), with the same X mixing matrix. This can be seen as follows. Denote by the matrix that contains x x S the observations (1) ::: (T ) as its columns, and similarly for . Then the ICA model can be expressed as: X = AS (13.1) X Now, time ﬁltering of corresponds to multiplying X from the right by a matrix, let M us call it . This gives X = XM = ASM = AS (13.2) which shows that the ICA model still remains valid. The independent components are ﬁltered by the same ﬁltering that was applied on the mixtures. They are not mixed with each other in S M because the matrix is by deﬁnition a component-wise ﬁltering matrix. Since the mixing matrix remains unchanged, we can use the ﬁltered data in the ICA estimating method only. After estimating the mixing matrix, we can apply the same mixing matrix on the original data to obtain the independent components. The question then arises what kind of ﬁltering could be useful. In the following, we consider three different kinds of ﬁltering: high-pass and low-pass ﬁltering, as well as their compromise.
3. PREPROCESSING BY TIME FILTERING 265 13.1.2 Low-pass ﬁltering Basically, low-pass ﬁltering means that every sample point is replaced by a weighted M average of that point and the points immediately before it.1 This is a form of smoothing the data. Then the matrix in (13.2) would be something like 0 . 1 B B 1 1 1 0 . . 0 0 0 0 C C B B ::: 0 1 1 1 0 0 0 0 ::: C C 1B M= 3B C C ::: ::: B 0 0 1 1 1 0 0 0 C B C ::: ::: B 0 0 0 1 1 1 0 0 C (13.3) B C ::: ::: B 0 0 0 0 1 1 1 0 C B C ::: ::: @ ::: 0 0 0 0 0 1 1 1 ::: A . . . Low-pass ﬁltering is often used because it tends to reduce noise. This is a well- known property in signal processing that is explained in most basic signal processing textbooks. In the basic ICA model, the effect of noise is more or less neglected; see Chapter 15 for a detailed discussion. Thus basic ICA methods work much better with data that does not have much noise, and reducing noise is thus useful and sometimes even necessary. A possible problem with low-pass ﬁltering is that it reduces the information in the data, since the fast-changing, high-frequency features of the data are lost. It often happens that this leads to a reduction of independence as well (see next section). 13.1.3 High-pass ﬁltering and innovations High-pass ﬁltering is the opposite of low-pass ﬁltering. The point is to remove slowly changing trends from the data. Thus a low-pass ﬁltered version is subtracted from the signal. A classic way of doing high-pass ﬁltering is differencing, which means M replacing every sample point by the difference between the value at that point and the value at the preceding point. Thus, the matrix in (13.2) would be 0 . 1 B B 1 1 0 . . 0 0 0 0 C C B B ::: 0 1 1 0 0 0 0 C C ::: B C M=B C ::: ::: B 0 0 1 1 0 0 0 C B C ::: ::: B 0 0 0 1 1 0 0 C (13.4) B C ::: ::: B 0 0 0 0 1 1 0 C B C ::: ::: @ ::: 0 0 0 0 0 1 1 A ::: . . . 1 To have a causal ﬁlter, points after the current point may be left out of the averaging.
4. 266 PRACTICAL CONSIDERATIONS High-pass ﬁltering may be useful in ICA because in certain cases it increases the independence of the components. It often happens in practice that the compo- nents have slowly changing trends or ﬂuctuations, in which case they are not very independent. If these slow ﬂuctuations are removed by high-pass ﬁltering the ﬁl- tered components are often much more independent. A more principled approach to high-pass ﬁltering is to consider it in the light of innovation processes. Innovation processes Given a stochastic process s(t), we deﬁne its innovation process ~(t) as the error of the best prediction of s(t), given its past. Such a best s prediction is given by the conditional expectation of s(t) given its past, because it is the expected value of the conditional distribution of s(t) given its past. Thus the s innovation process of ~(t) is deﬁned by ~(t) = s(t) s E fs( )js( t t 1) s(t 2) :::g (13.5) s The expression “innovation” describes the fact that ~(t) contains all the new infor- mation about the process that can be obtained at time t by observing s(t). The concept of innovations can be utilized in the estimation of the ICA model due to the following property: Theorem 13.1 If x(t) and s(t) follow the basic ICA model, then the innovation processes x(t) and ~(t) follow the ICA model as well. In particular, the components ~ s si (t) are independent from each other. ~ On the other hand, independence of the innovations does not imply the indepen- dence of the si (t). Thus, the innovations are more often independent from each other than the original processes. Moreover, one could argue that the innovations are usually more nongaussian than the original processes. This is because the si (t) is a kind of moving average of the innovation process, and sums tend to be more gaussian than the original variable. Together these mean that the innovation process is more susceptible to be independent and nongaussian, and thus to fulﬁll the basic assumptions in ICA. Innovation processes were discussed in more detail in [194], where it was also shown that using innovations, it is possible to separate signals (images of faces) that are otherwise strongly correlated and very difﬁcult to separate. The connection between innovations and ordinary ﬁltering techniques is that the computation of the innovation process is often rather similar to high-pass ﬁltering. Thus, the arguments in favor of using innovation processes apply at least partly in favor of high-pass ﬁltering. A possible problem with high-pass ﬁltering, however, is that it may increase noise for the same reasons that low-pass ﬁltering decreases noise. 13.1.4 Optimal ﬁltering Both of the preceding types of ﬁltering have their pros and cons. The optimum would be to ﬁnd a ﬁlter that increases the independence of the components while reducing
5. PREPROCESSING BY PCA 267 noise. To achieve this, some compromise between high- and low-pass ﬁltering may be the best solution. This leads to band-pass ﬁltering, in which the highest and the lowest frequencies are ﬁltered out, leaving a suitable frequency band in between. What this band should be depends on the data and general answers are impossible to give. In addition to simple low-pass/high-pass ﬁltering, one might also use more so- phisticated techniques. For example, one might take the (1-D) wavelet transforms of the data [102, 290, 17]. Other time-frequency decompositions could be used as well. 13.2 PREPROCESSING BY PCA A common preprocessing technique for multidimensional data is to reduce its dimen- sion by principal component analysis (PCA). PCA was explained in more detail in Chapter 6. Basically, the data is projected linearly onto a subspace x = En x ~ (13.6) so that the maximum amount of information (in the least-squares sense) is preserved. Reducing dimension in this way has several beneﬁts which we discuss in the next subsections. 13.2.1 Making the mixing matrix square First, let us consider the case where the the number of independent components n is smaller than the number of mixtures, say m. Performing ICA on the mixtures directly can cause big problems in such a case, since the basic ICA model does not hold anymore. Using PCA we can reduce the dimension of the data to n. After such a reduction, the number of mixtures and ICs are equal, the mixing matrix is square, and the basic ICA model holds. The question is whether PCA is able to ﬁnd the subspace correctly, so that the n ICs can be estimated from the reduced mixtures. This is not true in general, but in a special case it turns out to be the case. If the data consists of n ICs only, with no noise added, the whole data is contained in an n-dimensional subspace. Using PCA for dimension reduction clearly ﬁnds this n-dimensional subspace, since the eigenvalues corresponding to that subspace, and only those eigenvalues, are nonzero. Thus reducing dimension with PCA works correctly. In practice, the data is usually not exactly contained in the subspace, due to noise and other factors, but if the noise level is low, PCA still ﬁnds approximately the right subspace; see Section 6.1.3. In the general case, some “weak” ICs may be lost in the dimension reduction process, but PCA may still be a good idea for optimal estimation of the “strong” ICs [313]. Performing ﬁrst PCA and then ICA has an interesting interpretation in terms of factor analysis. In factor analysis, it is conventional that after ﬁnding the factor subspace, the actual basis vectors for that subspace are determined by some criteria
6. 268 PRACTICAL CONSIDERATIONS that make the mixing matrix as simple as possible [166]. This is called factor rotation. Now, ICA can be interpreted as one method for determining this factor rotation, based on higher-order statistics instead of the structure of the mixing matrix. 13.2.2 Reducing noise and preventing overlearning A well-known beneﬁt of reducing the dimension of the data is that it reduces noise, as was already discussed in Chapter 6. Often, the dimensions that have been omitted consist mainly of noise. This is especially true in the case where the number of ICs is smaller than the number of mixtures. Another beneﬁt of reducing dimensions is that it prevents overlearning, to which the rest of this subsection is devoted. Overlearning means that if the number of parameters in a statistical model is too large when compared to the number of available data points, the estimation of the parameters becomes difﬁcult, maybe impossible. The estimation of the parameters is then too much determined by the available sample points, instead of the actual process that generated the data, which is what we are really interested in. Overlearning in ICA [214] typically produces estimates of the ICs that have a single spike or bump, and are practically zero everywhere else. This is because in the space of source signals of unit variance, nongaussianity is more or less maximized by such spike/bump signals. This becomes easily comprehensible if we consider the extreme case where the sample size T equals the dimension of the data m, and these are both equal to the number of independent components n. Let us collect x x X the realizations (t) of as the columns of the matrix , and denote by the S s corresponding matrix of the realizations of (t), as in (13.1). Note that now all the matrices in (13.1) are square. This means that by changing the values of A (and X keeping ﬁxed), we can give any values whatsoever to the elements of . This is S a case of serious overlearning, not unlike the classic case of regression with equal numbers of data points and parameters. Thus it is clear that in this case, the estimate of S that is obtained by ICA estimation depends little on the observed data. Let us assume that the densities of the source signals are known to be supergaussian (i.e., positively kurtotic). Then the B ICA estimation basically consists of ﬁnding a separating matrix that maximizes a measure of the supergaussianities (or sparsities) of the estimates of the source signals. Intuitively, it is easy to see that sparsity is maximized when the source signals each have only one nonzero point. Thus we see that ICA estimation with an insufﬁcient sample size leads to a form of overlearning that gives artifactual (spurious) source signals. Such source signals are characterized by large spikes. An important fact shown experimentally [214] is that a similar phenomenon is much more likely to occur if the source signals are not independently and identically distributed (i.i.d.) in time, but have strong time-dependencies. In such cases the sample size needed to get rid of overlearning is much larger, and the source signals are better characterized by bumps, i.e., low-pass ﬁltered versions of spikes. An intuitive way of explaining this phenomenon is to consider such a signal as being constant on N=k blocks of k consecutive sample points. This means that the data can
7. HOW MANY COMPONENTS SHOULD BE ESTIMATED? 269 be considered as really having only N=k sample points; each sample point has simply been repeated k times. Thus, in the case of overlearning, the estimation procedure gives “spikes” that have a width of k time points, i.e., bumps. Here we illustrate the phenomenon by separation of artiﬁcial source signals. Three positively kurtotic signals, with 500 sample points each, were used in these simulations, and are depicted in Fig. 13.1 a. Five hundred mixtures were produced, and a very small amount of gaussian noise was added to each mixture separately. As an example of a successful ICA estimation, Fig. 13.1 b shows the result of applying the FastICA and maximum likelihood (ML) gradient ascent algorithms (denoted by “Bell-Sejnowski”) to the mixed signals. In both approaches, the prepro- cessing (whitening) stage included a dimension reduction of the data into the ﬁrst three principal components. It is evident that both algorithms are able to extract all the initial signals. In contrast, when the whitening is made with very small dimension reduction (we took 400 dimensions), we see the emergence of spiky solutions (like Dirac functions), which is an extreme case of kurtosis maximization (Fig. 13.1 c). The algorithm used in FastICA was of a deﬂationary type, from which we plot the ﬁrst ﬁve components extracted. As for the ML gradient ascent, which was of a symmetric type, we show ﬁve representative solutions to the 400 extracted. Thus, we see here that without dimension reduction, we are not able to estimate the source signals. Fig. 13.1 d presents an intermediate stage of dimension reduction (from the original 500 mixtures we took 50 whitened vectors). We see that the actual source signals are revealed by both methods, even though each resulting vector is more noisy than the ones shown in Fig. 13.1 b. For the ﬁnal example, in Fig. 13.1 e, we low-pass ﬁltered the mixed signals, prior to the independent component analysis, using a 10 delay moving average ﬁlter. Taking the same amount of principal components as in d, we can see that we lose all the original source signals: the decompositions show a bumpy structure corresponding to the low-pass ﬁltering of the spiky outputs presented in c. Through low-pass ﬁltering, we have reduced the information contained in the data, and so the estimation is rendered impossible even with this, not very weak, dimension reduction. Thus, we see that with this low-pass ﬁltered data, a much stronger dimension reduction by PCA is necessary to prevent overlearning. In addition to PCA, some kind of prior information on the mixing matrix could be useful in preventing overlearning. This is considered in detail in Section 20.1.3. 13.3 HOW MANY COMPONENTS SHOULD BE ESTIMATED? Another problem that often arises in practice is to decide the number of ICs to be estimated. This problem does not arise if one simply estimates the same number of components as the dimension of the data. This may not always be a good idea, however.
8. 270 PRACTICAL CONSIDERATIONS (a) Fast ICA Bell−Sejnowski (b) (c) (d) (e) Fig. 13.1 (From [214]) Illustration of the importance of the degree of dimension reduction and ﬁltering in artiﬁcially generated data, using FastICA and a gradient algorithm for ML estimation. (a) Original positively kurtotic signals. (b) ICA decomposition in which the preprocessing includes a dimension reduction to the ﬁrst 3 principal components. (c) Poor, i.e., too weak dimension reduction. (d) Decomposition using an intermediate dimension reduction (50 components retained). (e) Same results as in (d) but using low-pass ﬁltered mixtures
9. CHOICE OF ALGORITHM 271 First, since dimension reduction by PCA is often necessary, one must choose the number of principal components to be retained. This is a classic problem; see Chapter 6. It is usually solved by choosing the minimum number of principal components that explain the data well enough, containing, for example, 90% of the variance. Often, the dimension is actually chosen by trial and error with no theoretical guidelines. Second, for computational reasons we may prefer to estimate only a smaller number of ICs than the dimension of the data (after PCA preprocessing). This is the case when the dimension of the data is very large, and we do not want to reduce the dimension by PCA too much, since PCA always contains the risk of not including the ICs in the reduced data. Using FastICA and other algorithms that allow estimation of a smaller number of components, we can thus perform a kind of dimension reduction by ICA. In fact, this is an idea somewhat similar to projection pursuit. Here, it is even more difﬁcult to give any guidelines as to how many components should be estimated. Trial and error may be the only method applicable. Information-theoretic, Bayesian, and other criteria for determining the number of ICs are discussed in more detail in [231, 81, 385]. 13.4 CHOICE OF ALGORITHM Now we shall brieﬂy discuss the choice of ICA algorithm from a practical viewpoint. As will be discussed in detail in Chapter 14, most estimation principles and objective functions for ICA are equivalent, at least in theory. So, the main choice is reduced to a couple of points: One choice is between estimating all the independent components in parallel, or just estimating a few of them (possibly one-by-one). This corresponds to choosing between symmetric and hierarchical decorrelation. In most cases, symmetric decorrelation is recommended. Deﬂation is mainly useful in the case where we want to estimate only a very limited number of ICs, and other special cases. The disadvantage with deﬂationary orthogonalization is that the estimation errors in the components that are estimated ﬁrst accumulate and increase the errors in the later components. One must also choose the nonlinearity used in the algorithms. It seems that the robust, nonpolynomial nonlinearities are to be preferred in most applications. The simplest thing to do is to just use the tanh function as the nonlinearity g . This is sufﬁcient when using FastICA. (When using gradient algorithms, especially in the ML framework, a second function needs to be used as well; see Chapter 9.) Finally, there is the choice between on-line and batch algorithms. In most cases, the whole data set is available before the estimation, which is called in different contexts batch, block, or off-line estimation. This is the case where FastICA can be used, and it is the algorithm that we recommend. On-
10. 272 PRACTICAL CONSIDERATIONS line or adaptive algorithms are needed in signal-processing applications where the mixing matrix may change on-line, and fast tracking is needed. In the on-line case, the recommended algorithms are those obtained by stochastic gradient methods. It should also be noted that in some cases, the FastICA algorithm may not converge well as Newton-type algorithms sometimes exhibit oscillatory behavior. This problem can be alleviated by using gradient methods, or combinations of the two (see [197]). 13.5 CONCLUDING REMARKS AND REFERENCES In this chapter, we considered some practical problems in ICA. When dealing with time signals, low-pass ﬁltering of the data is useful to reduce noise. On the other hand, high-pass ﬁltering, or computing innovation processes is useful to increase the independence and nongaussianity of the components. One of these, or their combi- nation may be very useful in practice. Another very useful thing to do is to reduce the dimension of the data by PCA. This reduces noise and prevents overlearning. It may also solve the problems with data that has a smaller number of ICs than mixtures. Problems 13.1 Take a Fourier transform on every observed signal xi (t). Does the ICA model still hold, and in what way? 13.2 Prove the theorem on innovations. Computer assignments 13.1 Take a gaussian white noise sequence. Low-pass ﬁlter it by a low-pass ﬁlter with coefﬁcients (...,0,0,1,1,1,1,1,0,0,0,...). What does the signal look like? 13.2 High-pass ﬁlter the gaussian white noise sequence. What does the signal look like? 13.3 Generate 100 samples of 100 independent components. Run FastICA on this data without any mixing. What do the estimated ICs look like? Is the estimate of the mixing matrix close to identity?