Dự đoán P8 bằng mạng thần kinh thường xuyên: Giải pháp mới

Recurrent Neural Networks for Prediction

Authored by Danilo P. Mandic, Jonathon A. Chambers

2001 John Wiley & Sons Ltd

ISBNs: 0-471-49517-4 (Hardback); 0-470-84535-X (Electronic)

Data-Reusing Adaptive

Learning Algorithms

8.1 Perspective

In this chapter, a class of data-reusing learning algorithms for recurrent neural net-

works is analysed. This is achieved starting from a case of feedforward neurons,

through to the case of networks with feedback, trained with gradient descent learn-

ing algorithms. It is shown that the class of data-reusing algorithms outperforms the

standard (a priori) algorithms for nonlinear adaptive ﬁltering in terms of the instanta-

neous prediction error. The relationships between the a priori and a posteriori errors,

learning rate and the norm of the input vector are derived in this context.

8.2 Introduction

The so-called a posteriori error estimates provide us with, roughly speaking, some

information after computation. From a practical point of view, they are valuable and

useful, since real-life problems are often nonlinear, large, ill-conditioned, unstable or

have multiple solutions and singularities (Hlavacek and Krizek 1998). The a posteriori

error estimators are local in a computational sense, and the computational complexity

of a posteriori error estimators should be far less expensive than the computation of

an exact numerical solution of the problem. An account of the essence of a posteriori

techniques is given in Appendix F.

In the area of linear adaptive ﬁlters, the most comprehensive overviews of a poste-

riori techniques can be found in Treichler (1987) and Ljung and Soderstrom (1983).

These techniques are also known as data-reusing techniques (Douglas and Rupp 1997;

Roy and Shynk 1989; Schnaufer and Jenkins 1993; Sheu et al. 1992). The quality of an

a posteriori error estimator is often measured by its efficiency index , i.e. the ratio of

the estimated error to the true error. It has been shown that the a posteriori approach

in the neural network framework introduces a kind of normalisation of the employed

learning algorithm (Mandic and Chambers 1998c). Consequently, it is expected that

the instantaneous a posteriori output error ¯e(k) is smaller in magnitude than the

136 INTRODUCTION

corresponding a priori error e(k) for a non-expansive nonlinearity Φ(Mandic and

Chambers 1998c; Treichler 1987).

8.2.1 Towards an A Posteriori Nonlinear Predictor

To obtain an a posteriori RNN-based nonlinear predictor, let us, for simplicity, con-

sider a NARMA recurrent perceptron, the output of which can be expressed as

y(k)=Φ(uT(k)w(k)),(8.1)

where the information vector

u(k)=[x(k−1),...,x(k−M),1,y(k−1),...,y(k−N)]T(8.2)

comprises both the external input and feedback signals. As the updated weight vector

w(k+1) is available before the arrival of the next input vector u(k+1), an a posteriori

output estimate ¯y(k) can be formed as

¯y(k)=Φ(uT(k)w(k+ 1)).(8.3)

The corresponding instantaneous a priori and a posteriori errors at the output neuron

of a neural network are given, respectively, as

e(k)=d(k)−y(k)a priori error,(8.4)

¯e(k)=d(k)−¯y(k)a posteriori error,(8.5)

where d(k) is some teaching signal. The a posteriori outputs (8.3) can be used to form

an a posteriori information vector

u(k)=[x(k−1),...,x(k−M),1,¯y(k−1),...,¯y(k−N)]T,(8.6)

which can replace the a priori information vector (8.2) in the output (8.3) and weight

update calculations (6.43)–(6.45). This also results in greater accuracy (Ljung and

Soderstrom 1983). An alternate representation of such an algorithm is the so-called

a posteriori error gradient descent algorithm (Ljung and Soderstrom 1983; Treichler

1987), explained later in this chapter.

A simple data-reusing algorithm for linear adaptive filters

The procedure of calculating the instantaneous error, output and weight update may

be repeated for a number of times, keeping the same external input vector x(k) and

teaching signal d(k), which results in improved error estimation. Let us consider such

adata-reusing LMS algorithm for FIR adaptive ﬁlters, described by (Mandic and

Chambers 2000e)

ei(k)=d(k)−xT(k)wi(k),

wi+1(k)=wi(k)+ηei(k)x(k),

subject to |ei+1(k)|γ|ei(k)|,0<γ<1,i=1,...,L.











(8.7)

DATA-REUSING ADAPTIVE LEARNING ALGORITHMS 137

0200 400 600 800 1000 1200 1400 1600 1800 2000

-6

-5

-4

-3

-2

-1

Data-reusing algorithms - Linear case

Sample number

NWEV (dB)

Input: Speech signal recording

Filter: MATLAB filter(1,[1 0.5 0.2],...)

Noise: 20dB SNR Gaussian

Step size: 0.01

N=1

N=2

N=3

N=4

N=5

Figure 8.1 Convergence curves for a repeatedly applied data-reusing algorithm

From (8.7), w(k+ 1) is associated with the index (L+ 1), i.e. w(k+1)=wL+1(k),

whereas for L= 1, the problem reduces to the standard a priori algorithm, i.e.

w1(k)=w(k), w2(k)=w(k+ 1). Convergence curves for such a reiterated LMS

algorithm for a data-reusing FIR ﬁlter applied to echo cancellation are shown in Fig-

ure 8.1. The averaged squared prediction error becomes smaller with the number of

iterations, N.ForN→∞, the prediction error becomes the one of the NLMS1algo-

rithm. A geometrical perspective of the procedure (8.7) is given in Appendix F and

Figures F.2 and F.3. This provides advantageous stabilising features as compared to

standard algorithms. This is further elaborated in Section F.2.2 of Appendix F. In

practice, however, the advantage of the a posteriori algorithms is not always signiﬁ-

cant, and depends on the physics of the problem and the chosen ﬁlter.

8.2.2 Note on the Computational Complexity

It has been shown that the computational complexity of the a priori RTRL algorithm

is O(N4) (Haykin 1994; Williams and Zipser 1995), with Ndenoting the number of

neurons in the RNN. If, in order to improve the performance, the number of neurons in

the network is increased from Nto (N+ 1), the time required for the new adaptation

process to ﬁnish can be dramatically increased. To depict that problem, the relative

change in the computational load when the number of neurons increases, i.e. the ratio

(N+1)

4/N 4, is shown in Figure 8.2. In other words, that means that the a posteriori

1In fact, for the linear case, the NLMS algorithm is approached by repeating this kind of data-

reusing for an infinite number of times (Nitzberg 1985; Roy and Shynk 1989; Schnaufer and Jenkins

1993). For further details, see Appendix F.

138 A CLASS OF SIMPLE A POSTERIORI ALGORITHMS

1 2 3 4 5 6 7 8 9 10

Number of neurons in RNN

Computational complexity ratio [(N+1)^4]/[N^4]

Figure 8.2 Ratio of the increase of computational burden with N

procedure applied to the network with Nneurons should have the computational load

CLgiven by

CL(N4)CLa posteriori <CL((N+1)

4).(8.8)

8.2.3 Chapter Summary

A detailed account of various data-reusing techniques for nonlinear adaptive ﬁlters

realised as neural networks is provided. The relationships between the a priori and

a posteriori errors are derived and the corresponding bounds on learning rates are

analysed. This class of algorithms performs better than standard algorithms, does

not introduce a signiﬁcant additional computational burden, and for a class of data-

reusing algorithms, when iterated for an inﬁnite number of times, converges to a class

of normalised algorithms.

8.3 A Class of Simple A Posteriori Algorithms

Consider a simple computational model of a feedforward neural adaptive ﬁlter shown

in Figure 6.4. The aim is to preserve

|¯e(k)|γ|e(k)|,0γ<1 (8.9)

at each iteration, for both feedforward and recurrent neural networks acting as a

nonlinear predictor. The problem of obtaining the a posteriori error can be represented

in the gradient descent setting as (Mandic and Chambers 2000e)

w(k+1)=w(k)−η∇wE(k),

¯e(k)=d(k)−Φ(xT(k)w(k+ 1)),

subject to |¯e(k)|γ|e(k)|,0<γ<1.











(8.10)

DATA-REUSING ADAPTIVE LEARNING ALGORITHMS 139

From (8.10), the actual learning is performed in the standard manner, i.e. a priori

using e(k), whereas an improved a posteriori error ¯e(k) is calculated at every discrete

time interval using the updated weight vector w(k+1). The gradient descent algorithm

for this computational model, with the cost function in the form of E(k)=1

2e2(k), is

given by

e(k)=d(k)−Φ(xT(k)w(k)),

w(k+1)=w(k)+η(k)e(k)Φ′(xT(k)w(k))x(k),

¯e(k)=d(k)−Φ(xT(k)w(k+ 1)).











(8.11)

This case represents a generalisation of the LMS algorithm for FIR adaptive linear

ﬁlters. Let us express the a posteriori error term from above as

¯e(k)=d(k)−Φ(xT(k)w(k)) −[Φ(xT(k)w(k+ 1)) −Φ(xT(k)w(k))].(8.12)

Using the CMT, for a contractive, monotonically increasing Φand positive e(k) and

¯e(k), we have

Φ(xT(k)w(k+ 1)) −Φ(xT(k)w(k)) = α(k)xT(k)∆w(k),(8.13)

where α(k)=Φ′(ξ)<1, ξ∈(xT(k)w(k),xT(k)w(k+ 1)). Using (8.11)–(8.13) yields

¯e(k)=[1−η(k)α(k)Φ′(k)x(k)2

2]e(k),(8.14)

where Φ′(k)=Φ′(xT(k)w(k)). The learning rate

η(k)= 1

α(k)Φ′(k)x(k)2

which minimises (8.14), is approximately that of a normalised nonlinear gradient

descent algorithm (9.15), given in Chapter 9.

To obtain the bounds of such an a posteriori error, premultiplying the weight

update equation in (8.11) by xT(k) and applying the nonlinear activation function Φ

on either side yields (Mandic and Chambers 2000e)

Φ(xT(k)w(k+ 1)) = Φ(xT(k)w(k)+η(k)e(k)Φ′(k)x(k)2

2).(8.15)

Further analysis depends on the function Φ, which can exhibit either contractive or

expansive behaviour. For simplicity, let us consider a class of contractive functions Φ,

which satisfy2

Φ(a+b)Φ(a)+Φ(b).(8.16)

With a=xT(k)w(k) and b=η(k)e(k)Φ′(k)x(k)2

2, applying (8.16) to (8.15) and

subtracting d(k) from both sides of the resulting equation, due to contractivity of Φ,

we obtain

¯e(k)e(k)−Φ(η(k)e(k)Φ′(k)x(k)2

2).(8.17)

2This is the case, for instance, for many sigmoid functions. For many other functions this is

satisfied in a certain range of interest. For instance, for a= 0, positive band a saturating, mononically

increasing, positive sigmoid, Φ(a+b)<Φ(b)<b. The condition Φ(a+b)Φ(a)+Φ(b) is satisfied

for the logistic function on all of its range and for the positive range of the tanh activation function.

For many other functions, |Φ(a+b)||Φ(a)+Φ(b)|.

Mạng thần kinh thường xuyên cho dự đoán P8

Chủ đề:

Tài liệu liên quan

Tài liêu mới

AI tóm tắt

Giới thiệu tài liệu

Đối tượng sử dụng

Từ khoá chính

Nội dung tóm tắt

Hỗ trợ

Phương thức thanh toán

Theo dõi chúng tôi