doi:10.1046/j.1432-1033.2002.03115.x
Eur. J. Biochem. 269, 4219–4225 (2002) (cid:1) FEBS 2002
Prediction of protein structural class by amino acid and polypeptide composition
Rui-yan Luo1, Zhi-ping Feng2,3 and Jia-kun Liu1,3 1Department of Mathematics, Tianjin University, Tianjin, China, 2Department of Physics, Tianjin University, Tianjin, China, 3LiuHui Center for Applied Mathematics, Nankai University and Tianjin University, Tianjin, China
validation test with dataset PDB40-J constructed by Park and colleagues, more than 80% predictive accuracy is obtained. Furthermore, for the dataset constructed by Chou and Maggiona, the accuracy of 100% and 99.7% can be easily achieved, respectively, in the resubstitution test and in the jackknife test merely taking the compo- sition of dipeptides into account. Therefore, this new method provides an effective tool to extract valuable information from protein sequences, which can be used for the systematic analysis of small or medium size pro- tein sequences. The computer programs used in this paper are available on request.
Keywords: stepwise discriminant analysis; polypeptides; amino acid composition; domain structural class; seeded peptides.
A new approach of predicting structural classes of protein domain sequences is presented in this paper. Besides the amino acid composition, the composition of several dipeptides, tripeptides, tetrapeptides, pentapeptides and hexapeptides are taken into account based on the step- wise discriminant analysis. The result of jackknife test shows that this new approach can lead to higher pre- dictive sensitivity and specificity for reduced sequence similarity datasets. Considering the dataset PDB40-B constructed by Brenner and colleagues, 75.2% protein domain sequences are correctly assigned in the jackknife test for the four structural classes: all-a, all-b, a/b and a + b, which is improved by 19.4% in jackknife test and 25.5% in resubstitution test, in contrast with the com- ponent-coupled algorithm using amino acid composition alone (AAC approach) for the same dataset. In the cross-
dary-structure elements and general aspects of their arrangement [11]. According to the original concept of protein structural class presented by Levitt and Chothia, protein folds can be classified into one of four classes: all-a, all-b, a/b and a + b. Ever since then, various quantitative classification rules have been proposed based on the percentages of a-helices and b-sheets in a protein. The introduction of these quantitative rules has stimulated the development of protein structural class prediction.
It is generally accepted that protein structure is determined by its amino acid sequence [1] and that the knowledge of protein structures plays an important role in understanding their functions. To understand the rules relating amino acid sequence to three-dimensional protein structure is one of the major goals of contemporary molecular biology. A priori knowledge of protein structural classes has become quite useful from both an experimental and theoretical point of view. The structural class of a protein presents an intuitive description of its overall fold and the restrictions of the structural class have a high impact on its tertiary structure prediction [2]. The accuracy of secondary structure predic- tion from amino acid sequences may be improved by incorporating the knowledge of structural classes [3–9]. Some researchers have claimed that the knowledge of structural classes might be used to reduce the scope of searching conformational space during energy optimization, and provide useful information for a heuristic approach [8,10] to find the tertiary structure of a protein.
Fold classes were defined over 20 years ago as being general ways of describing folds that reflected the secon-
Historically, Nishikawa’s finding [12–15] that structural classes of proteins correlate strongly with amino acid composition marked the onset of algorithm developments aimed at predicting the structural class of a protein from its amino acid composition alone. There have been a number of algorithms about this topic, such as the least Hamming distance [3], the least Euclidian distance [15], the discrimi- nate analysis [16,17], the vector decomposition [2,18], the component-coupled algorithm [19,20], and fuzzy structural vectors [21]. There have also been some reviews or critical articles addressing the same problems [22,23]. In general, due to the different datasets used in different studies, the evaluation for existing algorithms is still controversial. To improve structure prediction significantly, more informa- tion is required. In addition to amino acid composition, it might be expected that taking the sequence order along the primary structure of a protein into account would result in the improvement of predictive accuracy [24,25]. Hidden Markov models (HMMs) can also be used for discrimin- ation or multiple alignment. However, too many parameters in the model have made it hard to keep up with the ever increasing amount of sequence data [26,27]. This paper
Correspondence to Z.-P. Feng, Department of Physics, Tianjin University, Tianjin 300072, China, Fax: + 86 22 87890061, Tel.: + 86 22 87891908, E-mail: zpfeng@eyou.com Abbreviation: AAC, the component-coupled algorithm using amino acid composition alone. (Received 5 May 2002, revised 6 July 2002, accepted 10 July 2002)
4220 R.-y. Luo et al. (Eur. J. Biochem. 269)
(cid:1) FEBS 2002
sequences with unknown residues, a dataset including 1054 sequences is obtained, which is labeled as PDB40-b. In this dataset there are 220 domain sequences of all-a class, 309 of all-b class, 285 of a/b class, and 240 of a + b class. Similarly, there are 935 in dataset PDB40-J, sequences in total, and 790 sequences are involved in the four major classes, with the percentage of the four major classes also about 85%. After excluding the sequences with unknown residues, 162 sequences of all-a class, 209 of all-b class, 222 of a/b class and 165 of a + b class remain. The dataset includes 758 sequences in total, which is labeled as PDB40-j. As there are 386 identical protein domain sequences in PDB40-b and PDB40-j, the subset of the latter, excluding the identical sequences, is denoted as PDB40-j1. In order to clarify the advantage of the proposed algorithm in this paper, the dataset constructed by Chou & Maggiora [20] is also used. The constructions of all the datasets are listed in Table 1.
Predictive method
two criteria. Firstly,
presents a novel method of extracting the sorting features of sequences in different domain structural classes. Based on the stepwise discriminant analysis and the lengthening (cid:1)seeded residues or peptides(cid:2) method, several critical amino acid residues and polypeptides in the four major protein domain classes are selected and their occurring frequencies in protein sequences are used to predict the domain structural classes. Consequently, for the low homologous dataset PDB40-B constructed by Brenner et al. [28], the overall predictive accuracy thus obtained reaches 91.7% and 75.2% in resubstitution and jackknife tests, respectively. This is about 19.4% (jackknife test) and 25.5% (resubsti- tution test) higher than that based on the component- coupled algorithm using amino acid composition alone (AAC approach) for the same dataset. In the cross- validation test with dataset PDB40-J constructed by Park et al. [29], using the parameters derived from PDB40-B, more than 80% predictive accuracy is obtained. Further- more, for the dataset of 359 sequences constructed by Chou and Maggiona [20], accuracies of 100% and 99.7% can be achieved easily, for the resubstitution and jackknife tests respectively, merely by taking the composition of dipeptides into account in the second cycle of the iterant process. It is simple to put the present method into practical use for the datasets that include other classes, whenever larger, nonre- dundant datasets are available. Therefore, this new method provides an effective tool to extract valuable information from protein sequences, which would be useful for the systematic analysis of small or medium size protein sequences.
D A T A B A S E A N D M E T H O D S
The stepwise discriminant analysis in multivariate statistics [32] from is used in extracting classifying features sequences for different structural classes. In stepwise discrimination, variables are chosen to enter or leave the model according to one of the significance level of an F-test from an analysis of covariance, where the variables already chosen act as covariates and the variable under consideration is the dependent variable, and secondly, the squared partial correlation for predicting the variable under consideration from the variable representing the classification of obser- vations, controlling for the effects of the variables already selected for the model.
Datasets
Let S(1) be the set composed of the 20 amino acids, namely S(1) ¼ {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}. Let S(i) be the set of all possible i-peptides (i is a positive integer greater than 1), where i-peptide is the string of i members in S(1). For brevity, the elements in S(i) are ordered lexicographically. Therefore,
Sð2Þ ¼ fAA; AC; AD; . . . ; AY; CA; CC; . . . ; YW; YYg;
Sð3Þ¼fAAA;AAC;AAD;...;AAY;ACA;ACC;...;YYYg;
the number of elements
in S(i)
etc. is 20i Apparently, (i ¼ 1, 2, . . .), which grows with exponential rate. In the
Table 1. The datasets used in this paper. PDB40-b and PDB40-j were obtained by excluding the sequences with unknown residues from PDB40-B and PDB40-J, respectively. PDB40-j1 is the subset of PDB40-j without the 386 identical sequences with PDB40-b. The dataset constructed by Chou & Maggiora is also used [20].
The number of sequences in different classes
Datasets All-a All-b a + b Total a/b
The datasets used in this paper are PDB40-B and PDB40-J, established by Brenner et al. and Park et al. [28,29], respectively. The pairwise sequence identities in the datasets are less than 40%. Both datasets are available at http://scop.berkeley.edu/parse/index.html. Since the above datasets are extracted from SCOP [30], which is the classification of structural classes for protein domains based on the evolutionary relationship and the principles that govern the three dimensional structure of proteins, they are more objective and reliable. Small proteins, and most of those with medium size, have single domains [30]. Therefore, the tool for predicting protein domain struc- tural classes can be used in analysing the sequences of small or medium size. For larger proteins involving multidomains, the servers of DomainParser at: http:// compbio.ornl.gov/structure/domainparser and DomCut at: http://www.kazusa.or.jp/tech/suyama/domcut/domcut.html can be used initially for the decomposition of a multi- domain protein into individual structural domains. The underlying principle used in the servers is that residue– residue contacts are denser within a domain than between domains [31]. In dataset PDB40-B, there are 1323 domain sequences in seven structural classes, where 1112 belong to the first four major structural classes, i.e. 85% sequences in PDB40-B are in all-a class, all-b class, a/b class or a + b class. The sequences in the other three classes are too scarce to have statistical significance and therefore this study. After excluding the they are left out of
PDB40-b PDB40-j PDB40-j1 [20] 220 162 166 82 309 209 91 85 285 222 140 99 240 165 75 93 1054 758 372 359
Prediction of protein structural class (Eur. J. Biochem. 269) 4221
(cid:1) FEBS 2002
2. Lengthen (cid:1)seeded i-peptides(cid:2) or construct T(i + 1) from T0(i) (i ¼ 1, 2, 3,. . .). For all the members of T0(i), add each of the 20 elements of S(1) in front and at the back of them to get n(i) · 20 · 2 ¼ 40 · n(i) elements belonging to S(i + 1) (the repeated ones are counted twice). Delete one of the repeated elements to make sure that each variable appears only once. Suppose the number of (i + 1) peptides thus obtained is m(i + 1). Calculate their frequencies in each domain structural class and the corresponding means defined by Eqn (3) in the n subgroups, then choose the variables one of whose means in one subgroup is above the given threshold. In other words, for each of the m(i + 1) elements, supposing it is the s-th element of S(i + 1), if any of
; . . . ; riþ1;n
riþ1;1 s
; riþ1;2 s
s
following description, S(i) (i ‡ 0) is called the set of peptides, but it represents the set of single amino acid residues when i ¼ 1 and the set of i-peptides when i > 1. To predict the structural class of each query sequence, a subset of S(i), denoted as T(i), is constructed such that the result of prediction using T(i) is almost the same as using S(i), while the number of elements in T(i) is much smaller than 20i. To form T(i), first construct a subset of T(i)1), T0(i)1) which is called (cid:1)seeded (i)1)-peptide(s)(cid:2), then add every amino acid in S(1) in front and at the back of each element in T0(i)1) to get a set including T(i). Meanwhile, denote the quantitative variable sets that represent the occurrence frequencies of elements in T(i) and T0(i) as C(i) and X0(i), respectively. The frequency is defined as following. For a given protein domain sequence with length of L, the total number of all possible i-peptides in it is L ) i + 1. So the frequency of a certain i-peptide in this domain is
ð1Þ
the number of the i-peptide in the domain L (cid:6) i þ 1
is greater than the threshold, this element is remained. Put these remained variables together and get a quan- titative variable set C(i + 1) as well as the correspond- ing polypeptide set T(i + 1).
3. Choose
(cid:1)seeded (i + 1)-peptides(cid:2) or
(i > 1) Since the above form is also valid for i ¼ 1, the case of single amino acid, we can let i ‡ 1 in (1).
Suppose N protein domains form a set G, which is the
union of n subsets, i.e.
G ¼ G1 [ G2 [ (cid:9) (cid:9) (cid:9) [ Gn
construct T0(i + 1) from T(i + 1) (i ¼ 1, 2, 3. . .). Perform stepwise discrimination with all the elements in C(i + 1) to select the relatively important variables. Suppose n(i + 1) variables are chosen, which form a subset of X(i + 1), called X0(i + 1). The corresponding polypeptide set T0(i + 1) consisting of (cid:1)seeded (i + 1)-peptides(cid:2) is obtained as well.
The size of each subset is given by Nk (k ¼ 1, 2, . . . , n). Therefore,
Xn
N ¼
Nk
k¼1
Denote the frequency of the s-th i-peptide of S(i) in the p-th sequence of subset Gk defined by (1) as xi;k
s;p, then
iT
h s ¼ xi;k xi;k
s;1 xi;k
s;2
(cid:9) (cid:9) (cid:9) xi;k s;Ni
is the frequencies of the s-th i-peptide of S(i) in all sequences of subset Gk and
4. Check the discriminant results while polypeptides are taken into account: Put together all the variables in C(1), C(2), . . . , C(i + 1) and perform stepwise discrimi- nant analysis. Denote the discriminant result as R(i + 1) with the variables chosen. If none of C(i + 1) enters the model for discrimination, or if R(i + 1) is not better than R(i), then stop the iterant process. The variables obtained from X(1), X(2), . . . , X(i) by stepwise discrimination are the desired variables in final prediction and R(i) is the highest predictive accuracy. Otherwise go back to step 2 to lengthen (cid:1)seeded i-peptides(cid:2), further the iterant process and letting i‹i + 1.
3
2
Evaluation of the predictive results
ðs ¼ 1; 2; . . . ; 20iÞ
ð2Þ
xi s ¼
7 7 7 5
6 6 6 4
xi;1 s xi;2 s . .. xi;n s
i and QM i
is the s-th i-peptide of S(i) in all sequences of G. Each element of X(i) is defined like (2) as the frequencies of one i-peptide of T(i) in all sequences of G and each element of X0(i) is defined as the frequencies of one i-peptide of T0(i) in all sequences of G. Defining the mean of xi;k in subset Gk s
Nk X
ðk ¼ 1; 2; . . . ; nÞ
ð3Þ
ri;k s ¼
xi;k s;p
1 Nk
p¼1
The algorithm is presented as follows.
In order to assess the accuracy of a predictive algorithm, the sensitivity and specificity for each type are calculated according to the definition of Baldi et al. [33], and denoted as QD respectively. Besides, predictions are eval- uated by the resubstitution and jackknife test. The former reflects the self-consistency and the latter reflects the extrapolating effectiveness of the algorithm studied. Among various cross-validation tests in the literature to evaluate the extrapolating effectiveness of an algorithm, the jackknife test is thought to be more rigorous and reliable [2,19,20,24]. In the jackknife test each domain sequence in the training dataset is singled out in turn as an independent test sample, and all the frequencies of the chosen peptide(s) are derived from the remaining domain sequences.
R E S U L T S A N D D I S C U S S I O N
The final predictive results for dataset PDB40-b are summarized in Table 2. For comparison, the predictive
1. Choose (cid:1)seeded residues(cid:2). Let i ¼ 1 and T(1) ¼ S(1). Perform stepwise discriminant analysis with elements in to obtain a subset of C(1), denoted as X0(1). C(1) Accordingly a subset of T(1) is obtained, denoted as T0(1). Suppose T0(1) has n(1) elements and call them (cid:1)seeded residues(cid:2). Denote the result of discrimination with variables in X0(1) as R(1).
4222 R.-y. Luo et al. (Eur. J. Biochem. 269)
(cid:1) FEBS 2002
Table 2. Comparison of predictive results for PDB40-b using different approaches. QD and QM are sensitivity and specificity, as defined previously [33]. This study denotes the approach of algorithm presented in the paper and AAC [19,20] refers to that of component-coupled algorithm based on the amino acid composition alone. Resubstitution test and the jackknife test are denoted as Resb. and Jack., respectively.
Method All-a All-b a/b a + b Overall
QD
QD
QM
offers little and doubtful information because in such a case it is difficult to tell useful information from errors. This phenomenon, firstly ascending and then descending, is true for all the predictions using different datasets. Specified to dataset PDB40-b, predictions with lower thresholds begin to drop at the seventh iteration and the final 296 variables are chosen from X(1), X(2), . . . , X(6) in stepwise discrimination. With these variables the highest predictive accuracy of 75.2% in the jackknife test can be reached.
From Table 3, we can see that the amino acid residues and polypeptides are not listed according to their length. In fact, they are put in the order of being chosen in stepwise discriminant analysis among all the elements in C(1), C(2),. . ., and X(6). So the order, in which variables are chosen in stepwise discrimination is not determined by the absolute values of frequencies. Some longer peptides are superior to shorter ones and will provide more information in analysis, despite of smaller occurring frequencies. There- fore, an algorithm concerning only amino acid or dipeptide composition, whose occurring frequencies are usually greater, will not be able to extract such information from sequences.
accuracy is also listed for the same dataset based on AAC approach. From the table we can see that the overall rate of correct prediction by the current method is 91.7% in the resubstitution test and 75.2% in the jackknife test, which is about 19.4% and 25.5%, respectively, higher than those of AAC approach. For each class of protein domain sequences both the sensitivity and the specificity are also much higher. The peptides finally selected for prediction are listed in Table 3. Of the 296 peptides, 12 are single amino acid residues, 62 are dipeptides, 130 are tripeptides, 23 are tetrapeptides, 66 are pentapeptides and three are hexapeptides. Using the composition of these peptides, the sequences in PDB40-j are predicted, with a total of 85.8% sequences being correctly assigned (Table 4). Excluding the identical sequences between PDB40-b and PDB40-j, we get dataset PDB40-j1. The overall percentage of correct assignment is 82.3% for the dataset PDB40-j1. Furthermore, we performed an inde- pendent test based on the randomly divided PDB40-b into two subsets, one involving 160 sequences as the test set, and another as the training set. Based on the 296 peptides listed in Table 3, the divisions and tests are performed 100 times. The predictive results in the independent tests are in the range of 64.33% to 79.50%, and the average value is 75%. The above results indicate that the presented method of extracting useful information from sequences is effective and that a good extrapolating effectiveness can be obtained.
The contribution provided by polypeptides in prediction
is
The higher predictive accuracy implies that the informa- tion brought by selected polypeptides significant. Figure 1 depicts the improvement of predictive accuracies for dataset PDB40-b as longer peptides are taken into account. The solid lines represent the results in jackknife tests, and the solid dots indicate the results with lower thresholds. It can be seen that the predictive accuracy increases in the first several iterations, which implies that with the input of some longer peptides more and more sorting features of sequences are obtained. However, this improvement will not be continued endlessly. Since the probability of a longer polypeptide to appear in a domain sequence is generally smaller, its occurring frequencies in most sequences are zeros. A variable set with many zeros
In the process of variable selecting, some variables chosen in the earlier steps are removed later in the same stepwise discriminant analysis, because certain later chosen variables can substitute for the removed ones. Taking dataset PDB40-b as an example (the lower thresholds set), when choosing variables from X(1), X(2), . . . , X(6), glycine, G, is the fourth element to enter the discriminant model, but it is later removed. Referring to Table 3, we can find that many polypeptides containing G are chosen for final prediction. They cover the role of G and have more sorting features. ALAAA is another variable chosen early in the process, but with the choice of ALAAV, ALAAF and ALAA, ALAAA is removed and not among the variables for final prediction. This phenomenon of replacement among amino acids or peptides explains why the variables in Table 3 (much less than 20 + 400 + 8000 + (cid:9) (cid:9) (cid:9) + 206) contain plentiful information and can result in higher predictive accuracy. The similar roles played by different peptides make it unnecessary for all peptides to be input as parameters for discriminant analysis, and the stepwise discrimination provides an effective way to select the variables that best embody the information for classification.
QM This study/Resb. This study/Jack. AAC/Resb. AAC/Jack. This study/Resb. This study/Jack. AAC/Resb. AAC/Jack. (208/220) 94.5 (161/220) 73.2 (136/220) 61.8 (119/220) 54.1 (208/211) 98.6 (161/182) 88.4 (136/183) 74.3 (119/196) 60.7 (284/309) 91.9 (241/309) 78.0 (201/309) 65.0 (184/309) 59.5 (284/306) 92.8 (241/301) 80.1 (201/270) 74.4 (184/282) 65.2 (262/285) 91.9 (232/285) 81.4 (244/285) 85.6 (215/285) 75.4 (262/293) 89.4 (232/326) 71.2 (244/408) 59.8 (215/389) 55.3 (213/240) 88.8 (159/240) 66.3 (117/240) 48.8 (70/240) 29.2 (213/244) 87.3 (159/245) 64.9 (117/193) 60.0 (70/187) 37.4 (967/1054) 91.7 (793/1054) 75.2 (698/1054) 66.2 (588/1054) 55.8 (967/1054) 91.7 (793/1054) 75.2 (698/1054) 66.2 (588/1054) 55.8
Prediction of protein structural class (Eur. J. Biochem. 269) 4223
(cid:1) FEBS 2002
Table 3. The peptide(s) chosen for final prediction. The variables are derived from dataset PDB40-b in the sixth cycle of the process, and listed in the order in which they enter the model in stepwise discriminant analysis with X(1), X(2), . . . , X(6). For example, GDSGG is the 199th element, which corresponds to 19 in the first column and 9 in the first row. Of the 296 variables chosen, 12 are single amino acids, 62 are dipeptides, 130 are tripeptides, 23 are tetrapeptides, 66 are pentapeptides and three are hexapeptides. Their frequencies are used in final prediction.
1 2 3 4 5 6 7 8 9 10
P IEAI LFKK LDE ALK KV FQ AC AAQ H VLA
DAI AEV CA I EAF LLD ESG CVSLT AGAEA GVS DF LLEEE TTT IP RAE ELD
FN GDIL GKS LEEE TVS LGS GE LLAEP VL NGK RVR ILS AALEG ALGV LTV AALER GDIVI SKV QV LEV TK QK LIFA AET FLE GVDIL VTV GSS VDG GDTV AYV PVI IIL EMKAE LLDL RD FLK EDK TLDEV DFL LERLT AGL EALKP AVL SKVKI YN PQ GAEAL DLN KHLKL PGSSV NA FP DQ LAA SKVKR ESLKV GIA M GYV VLRD RLAT GQP GLR LAG GGK EY ALAAV GDTVY LEELG GLA TQALA DRLAA AEVAV LEELN YQ AEALK GDSGG VE GII EALK LKKRG VVI PQVSL VF LAS EAEVA L YA QLN PAGT VPA GAEAH VDI FSG SLN ALAAF IDG AGG VIL ALGKQ DIAAA EIL SLD IAG FG YV RQ KD LEALK NK LLDLM LLAEL LERLD C EAA 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 V DG NV DGRI VIS NLD PGAEA FLAG RT ARA LGR AEL VVG SLK LRK AEE IAEA EQLG KLA LDL TLDEQ MT EDL EALKY PVT LKT PT GGF W GA A PGE DS AIN IAQ RLLAE IT TFT KLK SH ALAA VTLGM QALAG ARKFL LLDLL RRK KHLK ELE LE GLL LI VAVTE TVTVSL MLERL GS SAA AALEE RF LD SP KHL PL EF AALEA LVD EGI DIA VSLTD DLA LKN II LIFAD IAL SAG SVT IFA MR SW ALGKT GI K DA DGI VTS GGR KVL LLN ILG EALEP AG QALA VEAIL DDA ANA VLR NTG GGV GDL VIN KPL GQGD ELK TV TD GK IRKLE RQALA EDV ELG D LIF VLI DTG KDGRI VSLKE DLV VH VRE KAF AR Q SLA VSS ETV ARLKE QLNMYG VTLGN VTVS NAD KGI LAP LKKR AYA AI TT RPGSS DDGRI AIK LLAEQ GDTVK DT LK LGG LNA NKL TVL TL IK AGK GDILR RLLA GGG RLIFAD
Table 4. Results of independent tests with PDB40-b as the experimental dataset. The predictions are performed using the variables listed in Table 3. QD and QM are sensitivity and specificity as defined previously [32].
Dataset Accuracy All-a All-b a/b a + b Overall
PDB40-j
The influence of threshold(s) on prediction
threshold. The advantage of the lower threshold is not considerable. In the case of tripeptides, there are 8000 different strings in total, which is a large number. Since 0.0015 is set as the threshold in the second iteration, the (cid:1)seeded dipeptides(cid:2) contain only 89 elements. Adding the 20 amino acids in front and at the back of these (cid:1)seeded dipeptides(cid:2), a total of 3560 tripeptides are formed (repeated ones are counted twice). Therefore, if the threshold is too small, there will be quite a number of variables, even more than the observations in discriminant analysis. At other times, the pooled covariance matrix of variables will be singular. Neither of these two cases is eligible and therefore the threshold cannot be too high or too low. Its setting depends on both the variable frequencies and our purpose,
The setting of different thresholds in the algorithm usually leads to different predictive results. Since only the variables with means in one class greater than the given threshold are considered in stepwise discriminant analysis, higher thresh- olds will exclude too many variables and lead to a poor final prediction. However, this does not mean that the lower the threshold is set, the more effective will be the prediction. Considering dataset PDB40-b, the average frequencies for dipeptides in the four domain structural classes range from 0.0000 to 0.0103. The overall predictive accuracy in the jackknife test is 599/1054 ¼ 56.83% if no threshold is set, whereas 597/1054 ¼ 56.64% if 0.0015 is taken as the
PDB40-j1 QD QM QD QM (142/162) 87.7 (142/150) 94.7 (61/74) 82.4 (61/66) 92.4 (185/209) 88.5 (185/211) 87.7 (77/92) 83.7 (77/91) 84.6 (194/222) 87.4 (194/239) 81.2 (108/121) 89.3 (108/139) 77.7 (129/165) 78.2 (129/158) 81.6 (60/85) 70.6 (60/76) 78.9 (650/758) 85.8 (650/758) 85.8 (306/372) 82.3 (306/372) 82.3
4224 R.-y. Luo et al. (Eur. J. Biochem. 269)
(cid:1) FEBS 2002
algorithm, the same procedure of lengthening (cid:1)seeded peptide(s)(cid:2) and choosing variables was performed on the dataset used by Chou & Maggiora [20], which consists of 359 sequences with unknown sequence similarity. If no threshold is set in the second iteration, the predictive accuracy can reach 100% and 99.7% in the resubstitution and jackknife test, respectively, with 216 variables. Among them there are 212 dipeptides and only four single residues. The predictive accuracy with the AAC approach could only be 94.4% and 84.7%, respectively, in these two tests. These results indicate that the sequence features in the dataset of Chou & Maggiora [20] are easily extracted by the presented algorithm. A similar improvement can be obtained also for the datasets used by Chou and other scholars [19,25,34]. Because the AAC approach is actually the same as the Bayes algorithm [22,23], and is considered the most powerful tool in predicting the protein structural classes, the improvement presented in this paper will be significant, and the method of selecting variables based on stepwise discriminant analysis and lengthening the (cid:1)seeded peptides(cid:2) is an effective one.
C O N C L U S I O N
because lower thresholds will bring out both more variables and greater burden in calculation. The threshold set in different iteration, the number of variables chosen and the overall predictive accuracies in the jackknife test are listed in Table 5.
The influence of sequence homology in datasets
Instead of the traditional approach based on amino acid composition alone, this paper presents an algorithm to select several dipeptides, tripeptides, etc., and use them in prediction. Applying the method to a nonredundant dataset, considerable improvements in the overall predictive accuracy (about 20%) are achieved compared with the traditional amino acid composition approach. For the dataset constructed by Brenner et al. the overall predictive accuracy achieved for the jackknife test is as high as 75.2% and 91.7% in the resubstitution test. For the dataset established by Chou and Maggiona [20] the predictive accuracy can easily reach 100% and 99.7% in the resub- stitution test and the jackknife test, respectively, by taking into account the composition of dipeptides in the second iteration of the algorithm. Furthermore, the variables derived from a larger nonredundant dataset contain more meaningful information for classification and have good extrapolating effectiveness, which can be used in practical prediction. It can be inferred that with the extension of protein structural datasets, more accurate, sufficient and significant information about structural classes will be extracted based on the method provided in this paper.
The sequences in datasets PDB40-b and PDB40-j have lower similarities. The pairwise sequence identities for these two datasets are both less than 40%. The predictive accuracies based on these datasets with the AAC approach are much lower than reports in literature [2,19,20] because of the different datasets used. To clarify this matter and further demonstrate the effectiveness of our proposed
Fig. 1. The improvement of overall predictive accuracy as longer peptides are taken into account. The optimum prediction is obtained for dataset PDB40-b at the sixth iteration when lower thresholds are set (d), and at the fifth iteration when higher thresholds are set (s). The solid line represents the results in the jackknife test and the dotted line represents the results in the resubstitution test. The general trends of curves are the same whether higher or lower thresholds are set, and usually improve in the first several iterations and then decrease. If a lower threshold is set in iteration, more variables are chosen and predictive accuracy is usually higher. If a higher threshold is set in iteration, fewer variables are chosen and the predictive accuracy is lower, with the loss of some useful information.
Table 5. The influence of thresholds on prediction for dataset PDB40-b. QJack total is the overall rate of correct prediction [33] in the jackknife test and usually improves in the first several cycles of the method. If a lower threshold is set in each cycle, more variables or more peptides are chosen and the predictive accuracy is usually higher. Similarly, if a higher threshold is set in each cycle, fewer variables or fewer peptides are chosen and the predictive accuracy is lower with the loss of some useful information.
Lower thresholds Higher thresholds
Thresholds Number of variables Iteration Thresholds Number of variables QJack total QJack total
0.003000 0.000550 0.000120 0.000050 0.000001 50 99 170 185 170 55.6 60.1 65.5 68.6 66.1 2 3 4 5 6 7 0.001500 0.000400 0.000150 0.000000 0.000000 0.000000 65 198 197 271 296 321 56.6 70.4 72.1 74.0 75.2 74.9
Prediction of protein structural class (Eur. J. Biochem. 269) 4225
(cid:1) FEBS 2002
Therefore, this new method provides an effective tool to extract valuable information from protein sequences, which would be used for the systematic analysis of small or medium size protein sequences. It may also be a useful tool in other assignment problems in proteomics and genome research.
15. Nakashima, H., Nishikawa, K. & Ooi, T. (1986) The folding type of a protein is relevant to the amino acid composition. J. Biochem. 99, 152–162. 16. Klein, P. (1986) Prediction of protein structural class by discri- minant analysis. Biochem. Biophys. Acta 874, 205–215. 17. Klein, P. & Delisi, C. (1986) Prediction of protein structural classes from amino acids sequence. Biopolymers 25, 1659–1672.
A C K N O W L E D G E M E N T S
18. Zhang, C.T. & Chou, K.C. (1992) An optimization approach to predicting protein structural class from amino acid composition. Protein Sci. 1, 401–408.
19. Chou, K.C., Liu, W.M., Maggiora, G.M. & Zhang, C.T. (1998) Prediction and classification of domain structural classes. Proteins: Struct. Funct. Genet. 31, 97–103. Valuable discussions with Prof. F. S. Ma are greatly acknowledged. The current study was supported in part by grant 90103031 from NSF, 90104015 by IBM SUR Grant and LiuHui Center for Applied Mathematics, Nankai University and Tianjin University. 20. Chou, K.C. & Maggiora, G.M. (1998) Domain structural class prediction. Protein Eng. 11, 523–538.
R E F E R E N C E S
1. Anfinsen, C.B. (1973) Principles that govern the folding of protein 21. Boberg, J., Salakoski, T. & Vihinen, M. (1995) Accurate predic- tion of protein secondary structural class with fuzzy structural vectors. Protein Eng. 8, 505–512. chains. Science 181, 223–230. 2. Chou, K.C. & Zhang, C.T. (1995) Prediction of protein structural classes. Crit. Rev. Biochem. Mol. Biol. 30, 275–349. 22. Wang, Z.X. & Yuan, Z. (2000) How good is prediction of protein structural class by the component-coupled method? Proteins: Struct. Funct. Genet. 38, 165–175.
23. Zhou, G.P. & Assa-Munt, N. (2001) Some insights into protein structural class prediction. Proteins: Struct. Funct. Genet. 44, 57– 59. 3. Chou, P.Y. (1989) Prediction of protein structural classes from amino acid composition. In Prediction of Protein Structures and the Principles of Protein Conformation (Fasman, G.D., ed.), pp. 549–586. Plenum Press, New York.
24. Bu, W.S., Feng, Z.P., Zhang, Z.D. & Zhang, C.T. (1999) Predic- tion of protein (domain) structural classes based on amino-acid index. Eur. J. Biochem. 266, 1043–1049. 4. Deleage, G. & Roux, B. (1987) An algorithm for protein sec- ondary structure prediction based on class prediction. Protein Eng. 1, 289–294.
25. Kumarevel, T.S., Gromiha, M.M. & Ponnuswamy, M.N. (2000) Structural class prediction: an application of residue distribution along the sequence. Biophys. Chem. 88, 81–101.
26. Hughey, R. & Krogh, A. (1996) Hidden Markov models for sequence analysis: extension and analysis of the basic method. Comput Appl Biosci. 12, 95–107. 27. Nakai, K. (2000) Protein sorting signals and prediction of sub- 5. Deleage, G. & Dixon, J.S. (1989) Use of class prediction to improve protein secondary structure prediction. In Prediction of Protein Structures and the Principles of Protein Conformation (Fasman, G.D., ed.), pp. 587–597. Plenum Press, New York. 6. Kneller, D.G., Cohen, F.E. & Langridge, R. (1990) Improvements in secondary structure prediction by enhanced neural network. J. Mol. Biol. 214, 171–182. cellular localization. Adv. Protein Chem. 54, 277–344.
7. Muggleton, S., King, R.D. & Sternberg, M.J. (1992) Protein sec- ondary structure prediction using logic-based machine learning. Protein Eng. 5, 647–657. (Corrigendum appears in Protein Eng. 6, 549). 28. Brenner, S.E., Chothia, C. & Hubbard, T. (1998) Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl Acad. Sci. USA 95, 6073–6078.
8. Cohen, F.E. & Kuntz, I.D. (1987) Prediction of the three-dimen- sional structure of human growth hormone. Proteins: Struct. Funct. Genet. 2, 162–166. 29. Park, J., Karplus, K., Barrett, C., Hughey, R., Haussler, D., Hubbard, T. & Chothia, C. (1998) Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J. Mol. Biol. 284, 1201–1210.
9. Cohen, B., Presnell, S.R. & Cohen, F.E. (1993) Origins of struc- tural diversity within sequentially identical hexapeptides. Protein Sci. 2, 2134–2145.
30. Murzin, A.G., Brenner, S.E., Hubbard, T. & Chothia, C. (1995) SCOP: a structural classification of protein database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540. 10. Carlacci, L., Chou, K.C. & Maggiora, G.M. (1991) A heuristic approach to predicting the tertiary structure of bovine somato- tropin. Biochemistry 30, 4389–4398. 11. Levitt, M. & Chothia, C. (1976) Structural patterns in globular proteins. Nature 261, 552–557. 31. Xu, Y., Xu, D. & Gabow, H.N. (2000) Protein domain decom- position using a graph-theoretic approach. Bioinformatics. 16, 1091–1104.
12. Nishikawa, K. & Ooi, T. (1982) Correlation of amino acid composition of a protein to its structural and biological characters. J. Biochem. 91, 1821–1824.
32. Srivastava, M.S. & Carter, E.M. (1983) An Introduction to Applied Multivariate Statistics. Elsevier Science Publishing Co, Inc., NY. 33. Baldi, P., Brunak, S., Chauvin, Y., Andersen, C.A. & Nielsen, H. (2000) Assessing the accuracy of prediction algorithms for classi- fication: an overview. Bioinformatics 16, 412–424.
34. Chou, K.C. (1995) A novel approach to predicting protein struc- tural classes in a (20–1)-D amino acid composition space. Proteins: Struct. Funct. Genet. 21, 319–344. 13. Nishikawa, K., Kubota, Y. & Ooi, T. (1983) Classification of proteins into groups based on amino acid composition and other characters. I. Angular distribution. J. Biochem. 94, 981–995. 14. Nishikawa, K., Kubota, Y. & Ooi, T. (1983) Classification of the proteins into groups based on amino acid composition and other characters. II. Grouping into four types. J. Biochem. 94, 997–1007.