Báo cáo khoa học: A knowledge-based potential function predicts the speciﬁcity and relative binding energy of RNA-binding proteins

A knowledge-based potential function predicts the speciﬁcity and relative binding energy of RNA-binding proteins Suxin Zheng1,, Timothy A. Robertson2, and Gabriele Varani1,2

1 Department of Chemistry, University of Washington, Seattle, WA, USA 2 Department of Biochemistry, University of Washington, Seattle, WA, USA

Keywords distance-dependent potential; protein–RNA interaction; RRM recognition; statistical potential

Correspondence G. Varani, Department of Chemistry and Department of Biochemistry, University of Washington, Seattle, WA 98195, USA Fax: +1 206 685 8665 Tel: +1 206 543 7113 E-mail: varani@chem.washington.edu

*These authors contributed equally to this work

RNA–protein interactions are fundamental to gene expression. Thus, the molecular basis for the sequence dependence of protein–RNA recognition has been extensively studied experimentally. However, there have been very few computational studies of this problem, and no sustained attempt has been made towards using computational methods to predict or alter the sequence-speciﬁcity of these proteins. In the present study, we provide a distance-dependent statistical potential function derived from our previous work on protein–DNA interactions. This potential function discriminates native structures from decoys, successfully predicts the native sequences recognized by sequence-speciﬁc RNA-binding proteins, and recapitulates experimentally determined relative changes in binding energy due to muta- tions of individual amino acids at protein–RNA interfaces. Thus, this work demonstrates that statistical models allow the quantitative analysis of protein–RNA recognition based on their structure and can be applied to modeling protein–RNA interfaces for prediction and design purposes.

(Received 25 July 2007, revised 22 Septem- ber 2007, accepted 19 October 2007)

doi:10.1111/j.1742-4658.2007.06155.x

subcellular

well as in the prediction of protein–protein [25,28–30] and protein–ligand interactions [30–33]. A few studies have explored the use of knowledge-based methods for the prediction of protein–DNA interactions from structure [30,34,35]. More recently, our group [36] and [37] have independently demonstrated that others knowledge-based potentials can provide quantitative descriptions of protein–DNA interfaces comparable to those provided using molecular mechanics force ﬁelds [37].

The sequence-speciﬁc recognition of RNA by proteins plays a fundamental role in gene expression by direct- ing different cellular RNAs to speciﬁc processing path- ways or locations. Many experimental studies have explored the molecular basis for the sequence dependence of protein–RNA recognition [1– 4]; more recently, a few studies have explored this prob- lem from a computational perspective as well [5–16]. However, these early studies have emphasized qualita- tive descriptions of the recognition process; relatively few attempts have been made to quantify the character- istics of protein–RNA interactions using computational approaches [17]. Here, we present a new approach for predicting the speciﬁcity of RNA-binding proteins and to evaluate the contribution of individual amino acids to the energetic of protein–RNA complexes.

Knowledge-based potential

functions have been employed in protein structure prediction [18–27], as

The relative scarcity of high-resolution structures of protein–RNA complexes has represented an under- standable barrier to the quantitative application of computational approaches to the problem of protein– RNA recognition. However, we have previously dem- onstrated that a statistical hydrogen bonding potential can discriminate native structures of protein–RNA complexes from docking decoy sets [17]. As hydrogen

Abbreviations KH, K homology; MD, molecular dynamics; PDB, Protein Data Bank; RRM, RNA recognition motif; SRP, signal recognition particle.

FEBS Journal 274 (2007) 6378–6391 ª 2007 The Authors Journal compilation ª 2007 FEBS

6378

S. Zheng et al.

A knowledge-based potential function

bonds represent only approximately 25% of contacts between protein and RNA [12], we reasoned that a more comprehensive approach would describe these interactions more effectively.

from 0.2 A˚

and predict

cally determined structures) from noncognate (decoy) structures [38]. As a preliminary test of our method, and a direct comparison with previous work, we used our distance-dependent potential to evaluate ﬁve sets of docking decoys generated for the application of the rosetta physical potential function to protein– RNA interactions [17]. These decoys were created using a combination of rigid-body docking and pro- tein side-chain repacking, and range in rmsd (relative to over 20 A˚ . to the native structure) Thus, they represent a solid basis for comparison to a much more complex scoring method (the multiterm, hybrid physical ⁄ statistical potential function used by rosetta).

In the present study, we report the application of an all-atom, distance-dependent statistical potential to the prediction of sequence-speciﬁc recognition between proteins and RNA. We demonstrate that this approach can discriminate native structures of complexes from even close docking decoys, recapitulate experimentally determined relative binding energies (DDGs) for several protein–RNA complexes, the RNA sequences recognized by a number of different RNA recognition motif (RRM) and K homology (KH) domains. These results demonstrate that statistical models can be applied to problems requiring the high- resolution modeling of protein–RNA interactions. The anticipated future enrichment of the structural data- base will further improve the predictive performance of the potential.

Results

When scored with the distance-dependent potential, the native complex can always be identiﬁed as the best structure in each of the ﬁve decoy sets (Fig. 1), even for decoys that are very close to the native structure. The native structure Z-scores for these decoy sets are shown in Table 1. These values indicate a strong dis- criminatory ability, comparable to that reported by Chen et al. [17] using their signiﬁcantly more complex scoring method. Overall, the distance potential (using a 6 A˚ cut-off) results in a mean native Z-score of )5.45, versus the value of )6.37 obtained by Chen et al. [17] (Table 1); this difference is statistically insig- niﬁcant (P ¼ 0.53, Welch’s two-sided t-test), indicating that the two methods are equivalent.

When we

is essentially identical

every

The all-atom distance potential is constructed from the distribution of interatomic distances observed in the high resolution (< 2.5 A˚ ) structures of protein–RNA complexes deposited in the Protein Data Bank (PDB). In this approach, the ‘correctness’ of a protein–RNA structure is assumed to be approximated by the sum of the probabilities of observing the set intermolecular distances deﬁned in the 3D structure, relative to the likelihood of encountering such distance in the dataset of all protein–RNA structures. This kind of method was proposed by Sipple [20], and has been applied to protein structure prediction, protein–protein and pro- tein–ligand interactions [18–33], as well as to protein– DNA recognition [30,34–37]. The distance-dependent statistical potential used here for protein–RNA inter- faces to the score recently described by us for protein–DNA complexes [36]. The primary difference is the introduction of a new pseud- count correction, where an optimized number of pseudocounts are added to the observed counts for each atom pair (for additional details, see Experimen- tal procedures). As a control, we also tested a simple contact-counting method, wherein contact between protein and RNA (within a given distance cut-off) was assigned the same score of )1.

slightly better

than the

investigated protein–DNA complexes using the same approach, we demonstrated that the all-atom potential outperformed a reduced atom description, where relevant groups were grouped according to their chemical similarity (as described in the Experimental procedures) [36]. Given the relative sparsity of the structural database, we investigated whether a reduced-atom representation would not lead to improved performance in the protein–RNA case. than the The all-atom potential performs better reduced atom potential (mean Z-score )5.45 versus )4.66; see also supplementary Table S4), although the difference is not as striking as for protein–DNA com- plexes. We believe this is due to less favorable statistics structures of protein–RNA complexes). We (fewer anticipate that the increasing availability of protein– RNA structures, together with the availability of data on speciﬁcity, will further improve the performance of the knowledge-based predictive method presented here. We retained the all atom representation because it is already reduced atom approach.

Docking decoy discrimination

An important property of any potential function is its ability to discriminate cognate (native crystallographi-

The protein–RNA score has distinctive properties compared to the protein–DNA potential. When we scored the protein–RNA decoy set using the protein–

FEBS Journal 274 (2007) 6378–6391 ª 2007 The Authors Journal compilation ª 2007 FEBS

6379

S. Zheng et al.

A knowledge-based potential function

B

A

D

C

E

Fig. 1. Score–rmsd plots for the ﬁve docking decoy sets generated by Chen et al. [17]; the score generated by the distance-dependent potential (in arbitrary units) is plotted versus the deviation from the native structure (open circle at rmsd ¼ 0). (A) Poly A-binding protein in complex with polyadenylate RNA (PDB code: 1CVJ). (B) Nova-2 KH RNA-binding domain 3 (PDB code: 1EC6). (C) HuD protein in complex with AU-rich RNA (PDB code: 1FXL). (D) Human SRP19 in complex with human SRP RNA (PDB code: 1JID). (E) Human U1A protein in com- plex with U1 snRNA hairpin (PDB code: 1URN). Close-up views of near-native decoys (0–3 A˚ rmsd) are shown in the insets.

the average Z-score was approxi- DNA potential, mately half that obtained with the protein–RNA potential ()2.84 versus )5.45; see also supplementary Table S4). Thus, although the chemistry of RNA and DNA are very similar, the structure of RNA allows for different interactions between proteins and the two nucleic acids that are reﬂected in this result.

To investigate whether the statistical potential is not simply reﬂecting the size of an interface or the number of intermolecular contacts, we also used a very simple contact-counting potential to evaluate the same decoys; in this method, the ﬁtness of an interface is evaluated by counting the number of close approaches between the protein and RNA. Satisfactorily, this method was

FEBS Journal 274 (2007) 6378–6391 ª 2007 The Authors Journal compilation ª 2007 FEBS

6380

S. Zheng et al.

A knowledge-based potential function

Table 2. Z-scores and correlations for near-native decoys generated by MD simulation.

Table 1. Native Z-scores and score–rmsd correlation coefﬁcients for the protein–RNA docking decoy sets prepared by Chen et al. [17] .

Z-scores

AMBER

Largest rmsd (A˚ )

ROSETTA + HBc

Contact counta

Distance- dependenta Coulombb

Distance- dependent

Distance- dependent versus AMBER (R2)

)2.44 )3.00 )1.26 )3.09 )3.39

)7.02 )6.46 )2.66 )6.29 )4.80

)1.19 )1.09 )1.55 )1.36 )1.35

)5.11 )6.53 )2.70 )9.12 )8.39

1CVJ 1EC6 1FXL 1JID 1URN Mean ± SD )5.45 ± 1.76 )1.31 ± 0.18 )6.37 ± 2.58 )2.64 ± 0.84

aUsing a 6 A˚ contact cut-off. bFrom Chen et al. [17] and referring to a potential lacking the directional component of hydrogen bonding (HB) interactions. cFrom Chen et al. [17] and referring to the com- plete potential function.

of

Interestingly,

the magnitude

much less effective, providing an average Z-score of )2.64, less than half of the average native Z-score found using the distance potential (Table 1). the

to 10 A˚ and then to 12 A˚

)3.51 )1.13 )1.82 )0.38 )3.02 1.49 )1.32 )0.31 0.26 0.52 )1.39 )1.10 )1.24 0.92 )1.08 )0.68 0.37 )0.21 )0.94 )1.38 1.42

)3.01 )0.44 )1.93 0.83 )2.29 2.51 )1.34 0.60 1.05 1.95 )1.69 )1.21 )0.19 3.00 )2.88 )2.09 )2.15 )0.65 )2.07 )2.94 2.47

2.33 2.63 2.45 2.58 5.21 3.54 2.22 2.46 2.92 3.48 2.44 2.47 2.55 3.85 3.06 3.94 2.66 3.01 2.12 2.19 2.75

0.37 0.51 0.39 0.34 0.71 0.42 0.47 0.38 0.54 0.52 0.24 0.44 0.11 0.19 0.46 0.42 0.16 0.63 0.48 0.33 0.42 0.41 ± 0.15

1B7F 1CVJ 1DFU 1E7K 1EC6 1FJE 1FXL 1JBS 1JID 1K1G 1KNZ 1M8W 1R9F 1RKJ 1URN 2AD9 2ADB 2ADC 2ASB 2ATW 2CJK Mean ± SD

)0.69 ± 1.28 )0.59 ± 1.94

observed Z-scores declines signiﬁcantly as the contact cut-off is increased from 6 A˚ (see sup- plementary Table S5), suggesting that short-range con- tacts provide the bulk of the discriminatory power in this test. This result suggests that protein–RNA recog- nition speciﬁcity is primarily determined by short- range intermolecular contacts. Long-range effects (e.g. nonlocal electrostatics) appear to play a more limited role, at least in decoy discrimination.

energy values predicted by the amber force ﬁeld. Thus, it remains very difﬁcult for either approach to discrim- inate the native structure from structures that are close to it in energy.

Identifying RNA-binding sequences from structure

To test the discrimination ability of the potential for near native decoys, we next compared its ability to discriminate near-native protein–RNA structures with that of the force ﬁeld implemented in the amber 8 molecular simulation package. We generated near- native protein–RNA decoys for 21 protein–RNA complexes by conducting molecular dynamics (MD) simulations of the native complexes, and by selecting multiple time-steps from the resulting trajectories for each structure. We then scored these structures using the distance-dependent potential function, and exam- ined the correlations between distance scores and amber energies for each decoy set.

Having established the performance of the statistical potential function in decoy discrimination, we investi- gated the ability of the potential to perform tasks rele- vant to its intended application. First, we sought to evaluate whether the potential could predict the cog- nate recognition sequences of RNA-binding proteins. is a particularly important problem because This sequence speciﬁcity is known for only a fraction of all RNA-binding proteins. The ability to predict (or at least narrow down) the cognate sequence for ‘orphan’ RNA-binding proteins would greatly facilitate the design of biological experiments aimed at dissecting the function of these proteins. It is also a problem that is not well suited for MD approaches because of the demanding computational requirements.

This application relies on a speciﬁc

structural model of RNA recognition by RRM and KH

This is a difﬁcult test of score performance because the structures are very close to native. Indeed, neither the distance-dependent score, nor the amber potential appears to be able to discriminate native structures from these very near-native, MD-generated decoys (average Z-score of )0.69 versus )0.59; Table 2). Although there is no correlation of the either score with rmsd, the distance-dependent statistical potential is somewhat correlated (average R2 ¼ 0.41) with the

FEBS Journal 274 (2007) 6378–6391 ª 2007 The Authors Journal compilation ª 2007 FEBS

6381

S. Zheng et al.

A knowledge-based potential function

for

the correct

sequence). Furthermore,

percentile score of 100). Remarkably, we found that 18 of the 29 tested RRM and KH domain complexes had their cognate recognition sequence ranked above the 90th percentile (i.e. had better than ten-fold enrich- the ment distance-dependent potential ranks the cognate recog- nition sequences of the protein–RNA complexes in our test set above the 90th percentile, on average. By con- trast, when we performed the same test using a simple counting potential as a control (Fig. 2), the average rank was the 41st percentile.

domains involving four nucleotides, as detailed in the Experimental procedures. This model is strongly sup- ported by previous research on the mechanism of RNA recognition for RRM proteins [6,39,40] and by the structure of existing KH domains bound to RNA [6,41–44]. As a consequence of the assumptions of the model, complexes containing two RNA-binding domains were divided into independent structures (e.g. 1CVJ_1 and 1CVJ_2 represent the ﬁrst and sec- ond Poly A binding protein domain of structure 1CVJ, respectively), and the two domains were con- sidered structurally and thermodynamically unrelated. Because the model assumes that each RRM and KH domain binds to each of four nucleotides indepen- dently, we generated a set of 44 (256) different structures for each protein–RNA complex by compu- tationally ‘threading’ all possible four-nucleotide com- binations onto the RNA bases nearest the center of the b-sheet structure of the RRM. We then scored these sequence-variant structures with the distance- dependent potential function.

Figure 2 shows the results of this analysis. If the potential and model of recognition were perfect, and if each structure was sequence-speciﬁc and corresponded to the most favorable sequence recognized by a given domain, the cognate sequences of the tested structures would be expected to rank as number 1. Because it is unlikely that the cognate recognition sequences for all domains will be consistently assigned the best score, we expressed sequence-discrimination performance in terms of percentiles (where perfect discrimination of the cognate recognition sequence would result in a

Among successful examples of binding-sequence discrimination, the native sequences of the RRM1 of Sex-lethal protein (1B7F_1) and KH1 domain of Poly C-binding protein-2 were both ranked ﬁrst out of 256 sequences, whereas KH domain 3 of hnRNP K (1ZZI), RRM of U2B¢ protein (1A9N) and RRM 4 of Polypyrimidine Tract Binding protein (2ADC_1) each had their cognate recognition sequences ranked in the top 3 (Supplemental Table S2). However, prediction was less successful for other RRM domains, such as the U1A complex (the cognate recognition sequence of U1A protein was ranked at 30). This result is none- theless not too surprising due to the noncanonical, seven-nucleotide recognition sequence (AUUGCAC) recognized by U1A that makes an unusually speciﬁc and strong interaction with RNA, unparalleled in other known RRMs [45]. Relatively poor results were also obtained for the Poly A binding potein (1CVJ_1, the HuD protein rank 19), and for RRM1 of (1FXL_1, rank 32). Both Pab and HuD utilize two domains to achieve sequence-speciﬁc recognition in a cooperative manner and do not discriminate well between sequences that are related to their cognate rec- ognition motif (A-rich and AU-rich sequences, respec- tively) [46]. Notably, however, the nonsequence-speciﬁc RNA helicase protein (PDB code: 2DB3, included as a negative control) had an expectedly poor cognate sequence rank of 226 ⁄ 256.

Estimating experimentally determined relative RNA-binding afﬁnities

the blue line represents the rank of

Fig. 2. Structure-based identiﬁcation of RRM recognition sequen- ces. The cognate sequence is ranked by the distance potential (cut-off ¼ 6 A˚ ) for RRM ⁄ KH domain proteins. The red line repre- sents the rank of cognate recognition sequences using the contact- counting score; these sequences using the distance-dependent potential. The points in each colored line are sorted independently by rank; the x-axis is the sort order. The dashed line represents the 10th percentile.

A second very important property of any potential function is the ability to recapitulate the sequence dependence of experimental binding energies; this is a prerequisite if the potential is to be applied to prob- lems of protein–RNA interface prediction or design. Fortunately, a few structures have a relatively dense set of experimentally determined binding constants for interface mutations. We used these experimentally characterized mutants to create a set of computation- ally ‘mutated’ structures of the complexes (Table 3),

FEBS Journal 274 (2007) 6378–6391 ª 2007 The Authors Journal compilation ª 2007 FEBS

6382

S. Zheng et al.

A knowledge-based potential function

A

Table 3. Correlations between the distance-dependent score and the experimental free energy of binding for several mutant protein– RNA complexes.

Distance-dependent Contact counting

6 A˚

10 A˚

12 A˚

6 A˚

10 A˚

12 A˚

Protein mutations

0.43

0.50

0.65 0.19

0.10

0.08

0.81

0.97 0.43

0.14

0.09

0.43

B

MS2 (no cytosine at position )5) MS2 mutations (with cytosine at )5) Fox-1 U1A a U1Ab

0.40 0.27 0.04

0.45 0.48 0.14

0.47 0.47 0.42 0.65 0.29 )0.06 )0.03 0.39 0.29 )0.06 )0.03

RNA mutations

0.56

Fox-1 SRP; 2¢-OH mutations SRP; base mutations

0.20 )0.39 )0.57 0.30 0.52 0.36 0.87 )0.07 )0.03 )0.07 0.01

0.33 0.30 0.07

0.35 0.29 0.05

a The native U1A complex was included in the training set for this experiment. b The U2B¢ complex (U1A homolog) was included in the training set for this experiment.

and have scored these structures using the distance- dependent statistical potential.

C

the second set contains

Fig. 3. Correlation between scores generated by the distance- dependent statistical potential and experimental binding free ener- gies (logKd) for mutants of the MS2 coat protein. (A) Complexes between protein mutants and RNA-containing nucleotides other than cytosine at position )5. (B) Complexes between protein mutants and RNA containing cytosine at position )5. (C) The char- acteristic intramolecular hydrogen bond between the amino group of C5 and the O1P atom of U6 observed in the structure of the MS2–RNA complex containing a cytosine at position )5 that helps organize the RNA structure for protein binding [47].

This result does not provide direct information on the relative contribution of that hydrogen bond to it is simply implied that the overall binding energy;

A ﬁrst very instructive example is provided by mutants of bacteriophage MS2 coat protein [47,48]. the complex Starting with the crystal structure of between MS2 coat protein and the cognate RNA hair- pin (PDB code: 1ZDI), a series of structures were gen- erated, representing the RNA and protein mutants for which binding constants are reported in the literature. Then the distance-dependent potential scores for these structures were compared with the known binding con- stants for each mutation. Unfortunately, when all of the MS2 mutations were considered together, a poor correlation was observed between distance score and experimental binding afﬁnities (data not shown). How- ever, excellent correlations were obtained between these values when the binding-afﬁnity data were divided into two subsets (Table 3, Fig. 3). A ﬁrst set corresponds to complexes where the bound RNA hair- pin contained adenine, guanine or uridine base at posi- tion )5; instead protein mutants where the bound RNA contained a cytosine at this position. Within each sets of mutants, the corre- lation between distance score and experimental binding strong (R2 ¼ 0.65, Fig. 3A; R2 ¼ 0.97, afﬁnity is Fig. 3B), and statistically signiﬁcant at the 95% conﬁ- dence level. Figure 3C shows a likely explanation for this result: an intramolecular hydrogen bond formed by the cytosine at position )5 [47]. When this nucleo- tide is mutated to any other base, the intramolecular hydrogen bond is lost, leading to a reorganization of the RNA structure.

FEBS Journal 274 (2007) 6378–6391 ª 2007 The Authors Journal compilation ª 2007 FEBS

6383

S. Zheng et al.

A knowledge-based potential function

A

mutations must be segregated into two groups to obtain a clear correlation between experimental and predicted relative afﬁnities. The most likely explana- tion for this result is that, at present, the statistical potential does not consider RNA intramolecular con- tacts; therefore, contributions to binding energy due to changes in RNA structure (i.e. that occur when that hydrogen bond is lost) cannot be captured by our cur- rent approach.

B

Ade-4

Gua-2

Cyt-3

A second example that reinforces our interpretation of the results obtained with MS2 is provided by Fox-1 protein, which regulates alternative splicing of tissue- speciﬁc exons by binding to the GCAUG sequence [49]. The structure of the complex (PDB code: 2ERR) and the experimental binding constants for two sets of related mutations have been reported [49]: one set for mutations on the Fox-1 protein and a second set for mutations to its target RNA molecule. A moderately strong correlation was observed between the distance score and the protein mutation data (R2 ¼ 0.46, Fig. 4), but an anticorrelation was observed for the set of RNA mutations (R2 ¼ )0.57; Table 3). As in the previous case, this result reﬂects the failure of the current statistical potential to capture the energetic contribution associated with the disruption of RNA intramolecular interactions that are a characteristic of this complex [49].

Ura-1

signiﬁcant

Fig. 4. (A) Correlation between scores generated by the distance- dependent statistical potential and experimental binding free ener- gies (logKd) for mutants of the Fox-1 protein. (B) The intramolecular hydrogen bond between uracil 1 and cytosine 3, and the non-Wat- son–Crick base pair between guanine 2 and adenine 4 for the RNA in complex with Fox-1 protein (PDB code: 2ERR). The protein is represented in yellow; the RNA structure is colored by atom type.

interatomic distances) that are not well represented in the 71 other protein–RNA complexes in our training set.

Figure 5 shows the ﬁnal example, a universally con- served component of the core of the signal recognition particle (SRP). The structure of the complex (PDB code: 1HQ1) and the binding afﬁnity of a series of RNA mutants have been determined [54]. The distance potential results in scores that correlate signiﬁcantly (R2 ¼ 0.52, P £ 0.05) with experimental binding afﬁni- ties for mutations involving substitutions of deoxy- nucleotides for their corresponding ribonucleotides. However, as observed for Fox-1, no signiﬁcant

A third example is human U1A protein (PDB code: for the RRM superfamily 1URN), a great model the availability of NMR and crystallo- because of graphic structures [50,51], as well as binding data. In this case, we observed poor correlations between the distance-dependent score and the experimentally determined dissociation constants (Kd) [52] when we conducted a test using a training set of strictly non- homologous protein–RNA structures. Initially, we assumed that this observation would reﬂect the very conformational large and energetically changes that have been observed in the RNA and protein upon complex formation [53]. However, when the U1A complex itself was included in the training set, we obtained moderate to strong correlations (R2 values between 0.27 and 0.65, depending on the choice of distance cut-off). This suggests that U1A binds to RNA by forming intermolecular interactions that are not commonly observed in the database of training structures. This hypothesis is supported by the observation that the inclusion of a close U1A homolog (the U2B¢–U2A¢ complex) in the training set improves the results of this test as well (R2 increases from 0.04 to 0.39; Table 3). Thus, it appears that the structure of the U1A or of its homologous complex contains a set of protein–RNA atomic contacts (i.e.

FEBS Journal 274 (2007) 6378–6391 ª 2007 The Authors Journal compilation ª 2007 FEBS

6384

S. Zheng et al.

A knowledge-based potential function

[18–35],

In a previous study, our group demonstrated that a statistical potential function could be surprisingly accu- rate when used to predict protein–DNA interactions from structure [36]; this result was corroborated by a similar study published concurrently by another group [37]. Given these results, we hypothesized that the same approach would be equally successful with pro- tein–RNA interfaces. Indeed, although various statisti- cal techniques have been used by a number of groups for the prediction of protein structures, protein–DNA and protein–ligand interactions such an approach has never been applied to protein–RNA interactions.

Fig. 5. Correlation between scores generated by the distance- dependent statistical potential and experimental binding free energies (logKd) for ribose-to-deoxyribose mutants of a universally conserved protein component of the SRP.

that

the

is

correlation was found for mutations of nucleotides that disrupt critical RNA intramolecular interactions. In this ﬁnal case, these mutations involve the disrup- tion of base pairs near the binding interface that deﬁne the secondary structure of the RNA, which is obvi- ously important for recognition, but do not contribute directly to the formation of intermolecular contacts [54].

In the present study, we describe the successful application of the distance-dependent, all-atom statis- tical potential function to the prediction of the ener- getics and recognition speciﬁcity of protein–RNA interactions. We demonstrate statistical potential can recapitulate experimentally determined relative binding constants for a number of protein– RNA complexes (with the caveat that it cannot yet capture the effect of mutations on RNA–RNA inter- actions). We also demonstrate that this simple tech- nique remarkably successful at predicting the cognate recognition sequences of a wide variety of RNA-binding proteins.

Disscussion

role of protein–RNA interactions

The challenge of near native decoy discrimination

the

score

statistical

still

largely descriptive

least

The statistical potential performs very well in classi- cal decoy discriminations tests. It is quite remarkable that similar Z-scores in tests of decoy discrimination and the are obtained for rosetta-derived score because this second method contains many more adjustable parameters that are optimized to reproduce the average composition of these interfaces as observed in nature. By contrast, the current statistical potential was generated ‘as is’ from the observed frequency of intermolecular con- in the database of protein–RNA structures. tacts Thus, it appears that the distance-dependent statisti- some of implicitly captures at cal potential the complexities of these intermolecular interactions that are explicitly enumerated in physical energy functions.

The central in the regulation of gene expression has led to consider- able interest in the biochemical processes underlying these interactions [55–57]. However, much of this research has been devoted to the study of the struc- individual protein– ture ⁄ function relationship for RNA complexes, and little effort has been made to develop quantitative models that might describe these interactions more comprehensively. Thus, our the mechanisms driving protein– understanding of RNA recognition is [11]. Recent work on protein–DNA interactions has shown that quantitative models of protein–nucleic acid recognition can provide insight into the mecha- nisms of gene regulation [58,59], and, in the not too distant future, promise to allow the rational design of DNA-binding proteins with altered speciﬁcity [60]. The development of computational tools capable of predicting the speciﬁcity of RNA-binding proteins across entire families (such as the RRM superfam- ily), or of redesigning the speciﬁcity of these pro- teins, would be of equal importance in dissecting regulatory mechanisms, and in post-transcriptional providing new tools to interrogate gene expression pathways.

The question of how to generate and discriminate near-native decoys is still an open challenge for many areas of computational structural biology [61,62]. The docking decoy set used here contains many near-native decoys (e.g. < 1 A˚ rmsd) that can be discriminated by the distance-dependent potential (Fig. 1). However, the exceptionally near-native when testing against

FEBS Journal 274 (2007) 6378–6391 ª 2007 The Authors Journal compilation ª 2007 FEBS

6385

S. Zheng et al.

A knowledge-based potential function

may be possible to signiﬁcantly improve potential functions by customizing their parameterization to particular problems.

Prediction of RNA recognition sequences from protein–RNA complex structures

decoys generated by extracting snapshots from MD simulations (Table 2), we found that near non-native decoys could not be reliably discriminated from native structures, not even by amber, which was used to con- duct the MD simulations. Thus, the question of how to create a potential that is sensitive to the extremely subtle structural variations present in very near-native decoys remains a challenging and important area of research. We are hopeful that the incorporation of terms describing the higher-order geometric preferences of protein–RNA interfaces (e.g. the incorporation of a directional hydrogen-bonding potential) [17] may enhance the discriminatory power of our method, as will the inevitable increase in high-resolution structural data available for training. Nevertheless, the distance- dependent potential function already performs on par with the amber and rosetta force ﬁelds in decoy dis- crimination tests.

independent

The impact of contact distance cut-off on discriminatory power

An obvious but yet to be attempted application of any potential function for protein–RNA interactions is the prediction of cognate binding sequences. In a test of sequence recognition for 29 unique KH and RRM domains, we found that the potential is able to identify (within the 10th percentile) the cognate RNA recognition motifs of these domains approximately 70% of the time. As not all RRM ⁄ KH domains (for example, U1A) obey the simple four-nucleotide recog- nition model that we have introduced (where each nucleotide makes interaction with the protein) [6], and the speciﬁcity of some proteins is limited (i.e. they bind nearly equally well to a set of related sequences), this is a remarkably strong result. Despite the simple form of the statistical potential, and the over-simpliﬁcations of the four-nucleotide recognition model, this method is surprisingly robust over the diverse set of RNA-binding domains that we have considered.

[21]

Prediction of relative protein–RNA binding energies

statistically signiﬁcant

The contact distance cut-offs used in the present study were varied to determine the value that maxi- mizes decoy discrimination performance for protein– RNA complexes. Previously, Robertson et al. [36] showed that shorter contact cut-offs result in optimal discrimination ability in protein–DNA complexes, whereas Samudrala et al. found that a longer cut-off (> 10 A˚ ) was better able to discriminate cor- rect structures during protein structure prediction experiments. Finally, Lu et al. [23] demonstrated that the ﬁrst coordination shell (i.e. a cut-off between 3.5 A˚ and 6.5 A˚ ) achieves the greatest selectivity for protein decoys created using gapless threading pro- cedures; the question remains as to the best thus, choice of contact cut-off.

the interface (Table 1). For

To evaluate the inﬂuence of different cut-off values in our study, replicate experiments were conducted using 6 A˚ , 10 A˚ and 12 A˚ distance cut-offs. In nearly all of our tests, the use of a shorter contact cut-off (6 A˚ ) results in greater selectivity for structural details the prediction of of mutation energies, however, a longer cut-off appears to outperform shorter cut-off values for some sets of mutation data (Table 3). Some of these mutations are not near the protein–RNA interface (e.g. one of the U1A mutations, D79V, is 9 A˚ from the RNA mole- cule), and only the use of a longer cut-off value can capture these effects. In light of the differing conclu- these results sions of previous research [21,23,36], imply that a ‘one size ﬁts all’ approach to energy it function design may be limiting. In other words,

When we evaluated the relative free energy of a set of mutations for several protein–RNA complexes of known structure, the distance-dependent potential was classes. We successful within deﬁned structural (P £ 0.05) observed strong, score–energy correlations for several sets of mutations that we tested; however, to achieve these results, it was necessary to subdivide several of the mutation data sets. For example, for the MS2 complex, the mutation data had to be divided into two classes based on the presence or absence of a cytosine at position )5 in the RNA. A likely explanation for the importance of the )5 cytosine mutation is offered by the observation that the amino group of the cytosine at position )5 makes an intramolecular hydrogen bond that increases the propensity of the free RNA to adopt the structure seen in the complex [48] (Fig. 3C). Because the dis- tance potential currently measures only intermolecular interactions, it is unable to capture the thermodynamic effect of interactions within the RNA or protein, and of mutation-induced changes in RNA (or protein) structure. The good correlations of distance potential with experimental binding energies (i.e. when sequence

FEBS Journal 274 (2007) 6378–6391 ª 2007 The Authors Journal compilation ª 2007 FEBS

6386

S. Zheng et al.

A knowledge-based potential function

the energetic contributions of

mutations are grouped according to the base identity at position )5) strongly suggests that the potential cap- tures intermolecular interactions well.

not contain enough information to capture particular structural phenomena. For example, we observed vir- tually no correlation between the distance-dependent score and the experimental binding afﬁnity for muta- tions of U1A protein until the U1A complex structure was added to the training set (Table 3). Addition of the homologous U2B¢ complex structure (PDB code: 1A9N) to the training set improved these results con- siderably, indicating that the training set was missing critical structural information that would help to dis- criminate native-like contacts unique to the U1A com- plex (an unusually high-afﬁnity RRM, with a long, seven-nucleotide recognition sequence) [52]. We antici- pate that the performance of the method will improve with the size of the structural database, as more high- resolution protein–RNA structures become available.

Conclusions

The same limitations observed in the MS2 mutation data led to the failures in prediction for RNA mutations in the Fox-1 and SRP complexes. In the structure of the Fox-1 complex, nucleotide U1 interacts with C3 by forming an intramolecular hydrogen bond, whereas G2 and A4 form a non-Watson–Crick base pair [49] (Fig. 4). Four out of seven Fox-1 RNA mutations that were tested directly affect these intramolecular interac- tions, which are not evaluated by the statistical potential used in the present study. In the case of the RNA muta- tions to the SRP complex, the mutated RNA residues are located in a double-stranded region of RNA, and do not interact with the protein [54], yet the disruption of the helix clearly affects the binding energy. The effect of these changes in RNA conformation cannot be captured by the intermolecular potential function used here.

Given these observations,

from decoys,

the predictive power of

it is reasonable to con- clude that the omission of protein intramolecular con- tacts might also limit the method. However, additional examples will need to be examined before deﬁnite conclusions can be made con- cerning the applications of statistical potentials to pre- diction of relative binding energies.

functions

The effect of training set composition on potential function performance

All knowledge-based potentials face the possibility of unintentional bias or over-training because their train- ing depends upon the selection of a representative sam- ple of structures. If great care is not exercised to ensure that this training set is unbiased (i.e. structur- ally heterogeneous), it is possible to create a statistical potential that unfairly scores certain structures more favorably than others simply because they are over- represented in the training set.

We have introduced a statistical potential function that discriminates the structures of native protein–RNA reproduces experimentally complexes determined relative binding afﬁnities for a number of RNA-binding proteins, and predicts cognate binding sequences for a large set of protein–RNA complexes. The statistical potential performs as well as highly optimized physical potential in tests of docking decoy discrimination. We anticipate that the performance of the potential will only increase with the size of the structural database and as terms are added to the model to account for protein and RNA intra-molecular interactions that are currently ignored. Nevertheless, even in its current implementation, this statistical model achieves a high degree of sensitivity to subtle changes in protein–RNA interface structure. We are optimistic that this knowledge-based potential function will ﬁnd broad application to problems requiring the high-resolution modeling of protein– RNA interfaces, such as structure-based genome anno- tation, or the rational design of novel RNA-binding proteins.

Experimental procedures

All-atom distance potential

[20]

The challenge of over-ﬁtting is particularly acute for protein–RNA interactions because there are relatively few high-resolution structures of protein–RNA com- plexes. Because of this limitation, a combined train- ing ⁄ test set was used in the present study. To avoid bias, a ‘leave one out’ cross-validation strategy was employed: the tested structure was always excluded from the training set. Thus, every test in the present study was conducted with a different score, and trained using only those structures that were not homologous to the tested protein–RNA complex.

The potential function used here is identical to a previ- ously described method [36] (a more complete description of the method is provided in supplementary Doc. S1), with the exception of a modiﬁed low-count correction. In the is present study, the correction described by Sippl replaced with a weighted pseudocount method, where a constant number of pseudocounts (P) are added to the

This strategy cannot be avoided at the present time, yet it leads to situations where the training data does

FEBS Journal 274 (2007) 6378–6391 ª 2007 The Authors Journal compilation ª 2007 FEBS

6387

S. Zheng et al.

A knowledge-based potential function

Construction of test sets

dij

observed counts for each atom pair. A pseudocount cor- rection value of 75 ensured the greatest performance of the potential, as indicated by the average Z-scores [63] (see supplementary Table S3). These pseudocounts are allocated over distance bins in proportion to the back- ground frequency f(dij) values, as calculated using Eqn (4) from a previous study [36], leading to an updated expres- sion for f(dij, ti, tj): Five high quality docking decoy sets [17] were used for ini- tial decoy-discrimination tests. These were constructed using the docking module of rosetta, which incorporates energy minimization through the use of a protein side-chain repacking algorithm [66,67]. Each of these decoy sets con- tains 2000 structures with deviations as low as 0.2 A˚ rmsd from the native complex structure. Nobsðdij; ti; tjÞ þ P (cid:2) f ðdijÞ f ðdij; ti; tjÞadj ¼ P Nobsðdij; ti; tjÞ þ P

Here, dij is the cartesian distance observed between two atoms (i,j), of types ti and tj, and Nobs(dij,ti,tj) represents the number of atoms of types ti and tj observed in the structure training set, separated by a distance of at least dij. Additionally, large numbers of near-native decoys were generated for 21 different protein–RNA complexes by extracting time-step structures from MD trajectories, cre- ated using amber 8 in a deformation-like process with the ff99 force ﬁeld [68]. These MD-generated decoys are espe- cially near-native structures; the maximum decoy rmsd for 21 sets is below 4 A˚ , and only seven decoy sets have a max- imum rmsd greater than 3 A˚ .

As a control, we also tested a simple, contact-counting method, wherein every contact between protein and RNA (within a given distance cut-off) was assigned a same score of )1.

Atom type selection

To generate these decoy sets, the initial structure of each native complex was ﬁrst minimized in 500 steps (250 steps of steepest-descent and 250 steps of conjugate gradient min- imization), then heated from 0–400 (cid:2)K in 20 ps using a Langevin dynamics algorithm [69,70]. Snapshots were taken every 0.05 ps, and a total of 400 structures were extracted from each MD simulation. The binding free energy was cal- culated using the mm_gbsa module of amber 8 as:

DGbind ¼ Gcomplex (cid:3) ðGprotein þ GRNAÞ

where Gcomplex, Gprotein and GRNA represent the mm_gbsa- calculated free energies of the protein–RNA complex, the free protein and the free RNA, respectively.

Prediction of sequence-speciﬁcity

Atom score types were assigned using the method of Rob- ertson and Varani [36]. Brieﬂy, the all-atom potential treats every atom, in every residue, as a unique type (e.g. ala- nine Cb and arginine Cb are considered as unique atom types under this scheme), resulting in a total of 158 protein, and 81 RNA atom types. Using a 10 A˚ cut-off, there are total of 1639 295 counts; with this representation, they are distributed over 158 · 81 · 8 bins, for an average of nearly 16 counts in each bin. When using a reduced atom repre- sentation, chemically similar atoms were group together based on the CHARM atom deﬁnition, as previously described [30,36].

to be

Selection of protein–RNA training set

The training set contains crystal structures of protein–RNA complexes downloaded from the PDB [64] with resolution better than 2.5 A˚ . Structures with more than 20% sequence identity were identiﬁed using the expasy sequence-redun- dancy tool [65]; the higher-resolution structure of every homologous pair was retained. After ﬁltering, the training set contained 72 protein–RNA complexes (the 50S ribo- some structure comprises 28 individual peptide chains in complex with RNA, plus 44 independent protein–RNA crystal structures). As many RNA-binding domains of the RRM superfamily interact in a conserved fashion with four nucleotides across the surface of the b-sheet [6,39,40], and recognition by KH domains appears conserved between different domains as well [6,41–44], we adopted a four-nucleotide model for our sequence-speciﬁcity test. Starting with each complex in the training set containing one or more RRM or KH domains, we extracted the protein coordinates and the four nucleotides bound at the center of the domain (for structures containing more than one RNA-binding domain, the structure was divided into two independent domains). This approach was chosen even though we are well aware that this simple model of RRM recognition would fail for domains that bind anomalously (e.g. U1A protein), or in situations where two domains cooperatively deﬁne speciﬁcity.

FEBS Journal 274 (2007) 6378–6391 ª 2007 The Authors Journal compilation ª 2007 FEBS

6388

For all chosen domains, every nucleotide was replaced with A, U, C and G, systematically, in all possible combi- nations, using insight ii 2000 (Accelrys Software Inc., San Diego, CA, USA). Thus, 256 different structures were gen- erated for each binding domain, and minimized using Because of the limited number of protein–RNA struc- tures, it was necessary to use a combined training ⁄ test set. Thus, to assess the performance of the potential without biasing the result, the native structure of scored complexes was excluded from the training set for the score (‘leave one out’ cross validation).

S. Zheng et al.

A knowledge-based potential function

11 Messias AC & Sattler M (2004) Structural basis of

single-stranded RNA recognition. Acc Chem Res 37, 279–287.

12 Treger M & Westhof E (2001) Statistical analysis of atomic contacts at RNA–protein interfaces. J Mol Recognit 14, 199–214.

13 Nadassy K, Wodak SJ & Janin J (1999) Structural features of protein–nucleic acid recognition sites. Biochemistry 38, 1999–2017. amber 8 [68] in 20 steps to regularize the local structure. Some RRM and KH domains in complex with single strand DNA (PDB codes: 2UP1, 1WTB, 1X0F, 1ZZI and 1ZZJ) were also included in the test set because recognition of single stranded RNA and DNA are mechanistically simi- lar. Protein mutations were modeled using moe (Chemical Computing Group, Montreal, Canada), followed by energy minimization with amber; the conformation of the mutated residue with side chain conformation most similar to the native residue was retained. 14 Perez-Canadillas J-M & Varani G (2001) Recent

advances in RNA–protein recognition. Curr Opin Struct Biol 11, 53–58.

Acknowledgements

15 Steﬂ R, Skrisovska L & Allain FH-T (2005) RNA sequence- and shape-dependent recognition by pro- teins in the ribonucleoprotein particle. EMBO Rep 6, 33–38. 16 Frankel AD (2000) Fitting peptides into the RNA

We wish to thank Dr Yu Chen for providing the pro- tein–RNA decoy sets and Mr Daniel Bjerre for many valuable discussions. The study was supported by grants from NIH.

world. Curr Opin Struct Biol 10, 332–340.

References

17 Chen Y, Kortemme T, Robertson T, Baker D & Varani G (2004) A new hydrogen-bonding potential for the design of protein–RNA interactions predicts speciﬁc contacts and discriminates decoys. Nucleic Acids Res 32, 5147–5162.

1 Amosova O, Broitman SL & Fresco JR (2003) Alanine- scanning mutagenesis of the predicted rRNA-binding domain of ErmC¢ redeﬁnes the substrate-binding site and suggests a model for protein–RNA interactions. Nucleic Acids Res 31, 4941–4949. 18 Sippl M, Ortner M, Jaritz M, Lackner P & Flo¨ ckner H (1996) Helmholtz free energies of atom pair interactions in proteins. Fold Des 1, 289–298.

2 Law MJ, Rice AJ, Lin P & Laird-Offringa IA (2006) The role of RNA structure in the interaction of U1A protein with U1 hairpin II RNA. RNA 12, 1168–1178.

19 Sippl M (1993) Boltzmann’s principle, knowledge-based mean ﬁelds and protein folding. An approach to the computational determination of protein structures. J Comput Aided Mol Des 7, 473–501.

3 Xia T, Wan C, Roberts RW & Zewail AH (2005) RNA–protein recognition: single-residue ultrafast dynamical control of structural speciﬁcity and function. PNAS 102, 13013–13018. 4 White SA, Hoeger M, Schweppe JJ, Shillingford A, 20 Sippl MJ (1990) Calculation of conformational ensem- bles from potentials of mean force: an approach to the knowledge-based prediction of local structures in globu- lar proteins. J Mol Biol 213, 859–883.

Shipilov V & Zarutskie J (2004) Internal loop mutations in the ribosomal protein L30 binding site of the yeast L30 RNA transcript. RNA 10, 369–377.

21 Samudrala R & Moult J (1998) An all-atom distance- dependent conditional probability discriminatory func- tion for protein structure prediction. J Mol Biol 275, 895–916.

5 Allers J & Shamoo Y (2001) Structure-based analysis of protein–RNA interactions using the program ENTAN- GLE. J Mol Biol 311, 75–86. 6 Auweter SD, Oberstrass FC & Allain FHT (2006)

22 Skolnick J, Kolinski A & Ortiz A (2000) Derivation of protein-speciﬁc pair potentials based on weak sequence fragment similarity. Proteins: Struct Funct Genet 38, 3–16.

Sequence-speciﬁc binding of single-stranded RNA: is there a code for recognition? Nucleic Acids Res 17, 4943–4959. 7 Chen Y & Varani G (2005) Protein families and RNA recognition. FEBS J 272, 2088–2097. 8 Draper DE (1995) Protein–RNA recognition. Annu Rev Biochem 64, 593–620.

23 Lu H & Skolnick J (2001) A distance-dependent atomic knowledge-based potential for improved protein struc- ture selection. Proteins: Struct Funct Genet 44, 223–232. 24 Zhou H & Zhou Y (2002) Distance-scaled, ﬁnite ideal- gas reference state improves structure-derived potentials of mean force for structure selection and stability pre- diction. Protein Sci 11, 2714–2726.

9 Guzman RND, Turner RB & Summers MF (1998) Pro- tein–RNA recognition. Biopolymers (Nucleic Acid Sci) 48, 181–195. 10 Jones S, Daley DTA, Luscombe NM, Berman HM &

FEBS Journal 274 (2007) 6378–6391 ª 2007 The Authors Journal compilation ª 2007 FEBS

6389

25 Zhang C, Liu S, Zhou H & Zhou Y (2004) An accurate, residue-level, pair potential of mean force for folding and binding based on the distance-scaled, ideal-gas reference state. Protein Sci 13, 400–411. Thornton JM (2001) Protein–RNA interactions: a struc- tural analysis. Nucleic Acids Res 29, 943–954.

S. Zheng et al.

A knowledge-based potential function

regulate post-transcriptional gene expression. FEBS J 272, 2118–2131. 26 Skolnick J (2006) In quest of an empirical potential for protein structure prediction. Curr Opin Struct Biol 16, 166–171. 41 Beuth B, Pennell S, Arnvig KB, Martin SR & 27 Weichenberger CX & Sippl MJ (2006) Self-consistent

Taylor IA (2005) Structure of a Mycobacterium tuberculosis NusA–RNA complex. EMBO J 24, 3576– 3587. 42 Lewis HA, Musunuru K, Jensen KB, Edo C, Chen H,

assignment of asparagine and glutamine amide rotamers in protein crystal structures. Structure 14, 967–972. 28 Jiang L, Gao Y, Mao F, Liu Z & Lai L (2002) Potential of mean force for protein–protein interaction studies. Proteins: Struct Funct Genet 46, 190–196.

Darnell RB & Burley SK (2000) Sequence-speciﬁc RNA binding by a Nova KH domain: implications for para- neoplastic disease and the fragile X syndrome. Cell 100, 323–332. 29 Lu H, Lu L & Skolnick J (2003) Development of uni- ﬁed statistical potentials describing protein–protein interactions. Biophys J 84, 1895–1901.

43 Siomi H, Matunis MJ, Michael WM & Dreyfuss G (1993) The pre-mRNA binding K protein contains a novel evolutionary conserved motif. Nucleic Acids Res 21, 1193–1198. 30 Zhang C, Liu S, Zhu Q & Zhou Y (2005) A knowledge- based energy function for protein–ligand, protein–pro- tein, and protein–DNA complexes. J Med Chem 48, 2325–2335. 44 Grishin NV (2001) KH domain: one motif, two folds. Nucleic Acids Res 29, 638–643.

31 Ishchenko AV & Shakhnovich EI (2002) SMall Mole- cule Growth 2001 (SMoG2001): an improved knowl- edge-based scoring function for protein–ligand interactions. J Med Chem 45, 2770–2780. 32 Velec HFG, Gohlke H & Klebe G (2005) Drug- 45 Tsai DE, Harper DS & Keene JD (1991) U1-snRNP-A protein selects a ten nucleotide consensus sequence from a degenerate RNA pool presented in various structural contexts. Nucleic Acids Res 19, 4931–4936.

46 Lunde BM, Moore C & Varani G (2007) RNA-binding proteins: modular design for efﬁcient function. Nat Rev Mol Cell Biol 8, 479–490. ScoreCSD-knowledge-based scoring function derived from small molecule crystal data with superior recogni- tion rate of near-native ligand poses and better afﬁnity prediction. J Med Chem 48, 6296–6303.

33 DeWitte RS & Shakhnovich EI (1996) SMoG: de novo design method based on simple, fast, and accurate free energy estimates. 1. Methodology and supporting evi- dence. J Am Chem Soc 118, 11733–11744. 34 Liu Z, Mao F, Guo J-T, Yan B, Wang P, Qu Y & Xu 47 Valegard K, Murray JB, Stonehouse NJ, van den Worm S, Stockley PG & Liljas L (1997) The three-dimensional structures of two complexes between recombinant MS2 capsids and RNA operator fragments reveal sequence- speciﬁc protein–RNA interactions. J Mol Biol 270, 724– 738. 48 Johansson HE, Dertinger D, LeCuyer KA, Behlen

Y (2005) Quantitative evaluation of protein–DNA inter- actions using an optimized knowledge-based potential. Nucleic Acids Res 33, 546–558. 35 Kono H & Sarai A (1999) Structure-based prediction of LS, Greef CH & Uhlenbeck OC (1998) A thermody- namic analysis of the sequence-speciﬁc binding of RNA by bacteriophage MS2 coat protein. PNAS 95, 9244–9249. DNA target sites by regulatory proteins. Proteins: Struct Funct Genet 35, 114–131. 36 Robertson TA & Varani G (2007) An all-atom, dis-

49 Auweter SD, Fasan R, Reymond L, Underwood JG, Black DL, Pitsch S & Allain FH-T (2006) Molecular basis of RNA recognition by the human alternative splicing factor Fox-1. EMBO J 25, 163–173. tance-dependent scoring function for the prediction of protein–DNA interactions from structure. Proteins: Struct Funct Bioinform 66, 359–374.

37 Donald JE, Chen WW & Shakhnovich EI (2007) Ener- getics of protein–DNA interactions. Nucleic Acids Res 35, 1039–1047. 50 Oubridge C, Ito N, Evans PR, Teo CH & Nagai K (1994) Crystal structure at 1.92 A resolution of the RNA-binding domain of the U1A spliceosomal protein complexed with an RNA hairpin. Nature 372, 432–438. 38 Hendlich M, Lackner P, Weitckus S, Floeckner H, 51 Allain FHT, Gubser CC, Howe PWA, Nagai K,

Neuhaus D & Varani G (1996) Speciﬁcity of ribonucleo- protein interaction determined by RNA folding during complex formation. Nature 380, 646–650. Froschauer R, Gottsbacher K, Casari G & Sippl MJ (1990) Identiﬁcation of native protein folds amongst a large number of incorrect models: the calculation of low energy conformations from potentials of mean force. J Mol Biol 216, 167–180.

52 Timm H, Jessen Oubridge C, Teo CH, Pritchard C & Nagai K (1991) Identiﬁcation of molecular contacts between the U1A small nuclear ribonucleoprotein and U1 RNA. EMBO J 10, 3447–3456. 39 Wang X & Tanaka Hall TM (2001) Structural basis for recognition of AU-rich element RNA by the HuD pro- tein. Nat Struct Mol Biol 8, 141–145.

FEBS Journal 274 (2007) 6378–6391 ª 2007 The Authors Journal compilation ª 2007 FEBS

6390

53 Gubser CC & Varani G (1996) Structure of the poly- adenylation regulatory element of the human U1A 40 Maris C, Dominguez C & Allain FHT (2005) The RNA recognition motif, a plastic RNA-binding platform to

S. Zheng et al.

A knowledge-based potential function

CAPRI experiment. Proteins: Struct Funct Genet 52, 118–122. pre-mRNA-3¢-untranslated region and interaction with the U1A protein. Biochemistry 35, 2253–2267. 54 Batey RT, Sagar MB & Doudna JA (2001) Structural 67 Gray JJ, Moughon S, Wang C, Schueler-Furman O,

and energetic analysis of RNA recognition by a univer- sally conserved protein from the signal recognition par- ticle. J Mol Biol 307, 229–246. Kuhlman B, Rohl CA & Baker D (2003) Protein–pro- tein docking with simultaneous optimization of rigid- body displacement and side-chain conformations. J Mol Biol 331, 281–299.

55 Siomi H & Dreyfuss G (1997) RNA-binding proteins as regulators of gene expression. Curr Opin Genet Dev 7, 345–353. 56 Onesto C, Berra E, Grepin R & Pages G (2004) 68 Case DA, Darden TA, Cheatham TE III, Simmerling CL, Wang J, Duke RE, Luo R, Merz KM, Wang B, Pearlman DA et al. (2004) Amber8. University of Cali- fornia, San Francisco, CA.

Poly(A)-binding protein-interacting protein 2, a strong regulator of vascular endothelial growth factor mRNA. J Biol Chem 279, 34217–34226. 57 Kinnaird JH, Maitland K, Walker GA, Wheatley I, 69 Izaguirre JA, Catarello DP, Wozniak JM & Skeel RD (2001) Langevin stabilization of molecular dynamics. J Chem Phys 114, 2090–2098. 70 Pastor R, Brooks B & Szabo A (1988) An analysis of

the accuracy of Langevin and molecular dynamics algo- rithms. Mol Physics 65, 1409–1419. Thompson FJ & Devaney E (2004) HRP-2, a heteroge- neous nuclear ribonucleoprotein, is essential for embryo- genesis and oogenesis in Caenorhabditis elegans. Exp Cell Res 298, 418–430. 58 Havranek JJ, Duarte CM & Baker D (2004) A simple

Supplementary material

physical model for the prediction and design of protein– DNA interactions. J Mol Biol 344, 59–70.

is available

59 Morii T, Sato S, Hagihara M, Mori Y, Imoto K &

Makino K (2002) Structure-based design of a leucine zipper protein with new DNA contacting region. Biochemistry 41, 2177–2183. 60 Ashworth J, Havranek JJ, Duarte CM, Sussman D,

Monnat RJ, Stoddard BL & Baker D (2006) Computa- tional redesign of endonuclease DNA binding and cleavage speciﬁcity. Nature 441, 656–659. 61 Gray JJ (2006) High-resolution protein–protein docking. Curr Opin Struct Biol 16, 183–193.

62 Wang K, Fain B, Levitt M & Samudrala R (2004) Improved protein structure selection using decoy- dependent discriminatory functions. BMC Struct Biol 4, 8.

the performance of

63 Sippl MJ (1993) Recognition of errors in three-dimen- sional structures of proteins. Proteins: Struct Funct Genet 17, 355–362.

The following supplementary material online: Doc. S1. Detailed description of the construction and derivation of the distance-dependent potential. Table S1. The complex structure (PDB code) in pro- tein–RNA training set. Table S2. The cognate sequence rank by distance potential and contact score (cut-off ¼ 6 A˚ ) in RRM ⁄ KH domain sequence decoy (256) sets. Table S3. Optimization of the pseudocounts in decoy sets discrimination. Table S4. Comparison of the Z-scores obtained on the same set of protein–RNA decoys for the RNA, DNA and reduced atom distance potentials in decoy discrim- ination. Table S5. Comparison of the potential with different upper cut-off values in decoy discrimination.

This material is available as part of the online article

from http://www.blackwell-synergy.com

64 Berman H, Henrick K & Nakamura H (2003) Announc- ing the worldwide Protein Data Bank. Nat Struct Mol Biol 10, 980–980.

65 Notredame C. ExPASy sequence-redundancy tool. Available at http://ca.expasy.org/cgi-bin/reduce_ redundancy.cgi. 66 Gray JJ, Moughon SE, Kortemme T, Schueler-

Please note: Blackwell Publishing is not responsible for the content or functionality of any supplementary materials supplied by the authors. Any queries (other than missing material) should be directed to the corre- sponding author for the article.

FEBS Journal 274 (2007) 6378–6391 ª 2007 The Authors Journal compilation ª 2007 FEBS

6391

Furman O, Misura KMS, Morozov AV & Baker D (2003) Protein–protein docking predictions for the

Báo cáo khoa học: A knowledge-based potential function predicts the speciﬁcity and relative binding energy of RNA-binding proteins

A knowledge-based potential function predicts the speciﬁcity and relative binding energy of RNA-binding proteins Suxin Zheng1,*, Timothy A. Robertson2,* and Gabriele Varani1,2

subcellular

Knowledge-based potential

functions have been employed in protein structure prediction [18–27], as

bonds represent only approximately 25% of contacts between protein and RNA [12], we reasoned that a more comprehensive approach would describe these interactions more effectively.

from 0.2 A˚

and predict

Results

When we

is essentially identical

every

slightly better

than the

Docking decoy discrimination

An important property of any potential function is its ability to discriminate cognate (native crystallographi-

The protein–RNA score has distinctive properties compared to the protein–DNA potential. When we scored the protein–RNA decoy set using the protein–

B

A

D

C

E

of

Interestingly,

the magnitude

much less effective, providing an average Z-score of )2.64, less than half of the average native Z-score found using the distance potential (Table 1). the

to 10 A˚ and then to 12 A˚

energy values predicted by the amber force ﬁeld. Thus, it remains very difﬁcult for either approach to discrim- inate the native structure from structures that are close to it in energy.

Identifying RNA-binding sequences from structure

This application relies on a speciﬁc

structural model of RNA recognition by RRM and KH

for

the correct

sequence). Furthermore,

Estimating experimentally determined relative RNA-binding afﬁnities

A

B

and have scored these structures using the distance- dependent statistical potential.

C

the second set contains

This result does not provide direct information on the relative contribution of that hydrogen bond to it is simply implied that the overall binding energy;

A

B

Ade-4

Gua-2

Cyt-3

Ura-1

signiﬁcant

interatomic distances) that are not well represented in the 71 other protein–RNA complexes in our training set.

[18–35],

that

the

is

Disscussion

role of protein–RNA interactions

The challenge of near native decoy discrimination

the

score

statistical

still

largely descriptive

least

may be possible to signiﬁcantly improve potential functions by customizing their parameterization to particular problems.

Prediction of RNA recognition sequences from protein–RNA complex structures

independent

The impact of contact distance cut-off on discriminatory power

[21]

Prediction of relative protein–RNA binding energies

statistically signiﬁcant

the interface (Table 1). For

the energetic contributions of

mutations are grouped according to the base identity at position )5) strongly suggests that the potential cap- tures intermolecular interactions well.

Conclusions

Given these observations,

from decoys,

the predictive power of

functions

The effect of training set composition on potential function performance

Experimental procedures

All-atom distance potential

A knowledge-based potential function predicts the speciﬁcity and relative binding energy of RNA-binding proteins Suxin Zheng1,, Timothy A. Robertson2, and Gabriele Varani1,2