Modules, multidomain proteins and organismic complexity Hedvig Tordai, Alinda Nagy, Krisztina Farkas, La´ szlo´ Ba´ nyai and La´ szlo´ Patthy

Institute of Enzymology, Biological Research Center, Hungarian Academy of Sciences, Budapest, Hungary

Keywords domain; exon-shuffling; module; multidomain protein; organismic complexity

Correspondence L. Patthy, Institute of Enzymology, Biological Research Center, Hungarian Academy of Sciences, Budapest, POBox 7, H-1518, Hungary Fax: +361 4665465 Tel: +361 2093537 E-mail: patthy@enzim.hu

(Received 9 May 2005, revised 9 August 2005, accepted 12 August 2005)

Originally the term ‘protein module’ was coined to distinguish mobile domains that frequently occur as building blocks of diverse multidomain proteins from ‘static’ domains that usually exist only as stand-alone units of single-domain proteins. Despite the widespread use of the term ‘mobile domain’, the distinction between static and mobile domains is rather vague as it is not easy to quantify the mobility of domains. In the present work we show that the most appropriate measure of the mobility of domains is the number of types of local environments in which a given domain is pre- sent. Ranking of domains with respect to this parameter in different evo- lutionary lineages highlighted marked differences in the propensity of domains to form multidomain proteins. Our analyses have also shown that there is a correlation between domain size and domain mobility: smaller domains are more likely to be used in the construction of multidomain pro- teins, whereas larger domains are more likely to be static, stand-alone domains. It is also shown that shuffling of a limited set of modules was facilitated by intronic recombination in the metazoan lineage and this has contributed significantly to the emergence of novel complex multidomain proteins, novel functions and increased organismic complexity of metazoa.

consisting of multiple domains of independent evolu- tionary origin, are frequently referred to as mosaic proteins.

The average size of a protein domain of known crystal structure is about 175 residues; proteins that are larger than 200–300 residues usually consist of multiple pro- tein folds [1]. The individual structural domains of such multidomain proteins are defined as compact folds that are relatively independent inasmuch as the interactions within one domain are more significant than with other domains. The individual domains of multidomain proteins usually fold independently of the other domains.

for the biological

Some multidomain proteins contain multiple copies of a single type of structural domain, indicating that internal duplication of a gene segment encoding a domain has given rise to such proteins. Many multi- domain proteins contain different types of domains (i.e. domains that are not homologous to each other). The genes of such multidomain proteins were created by joining two or more gene segments that encode dif- ferent protein domains. Such multidomain proteins,

Multidomain proteins have some unique features that endow them with major evolutionary significance. In multidomain proteins a large number of functions (different binding activities, catalytic activities) may coexist making such proteins indispensable constituents of regulatory or structural networks where multiple interactions (protein–protein, protein–ligand, protein– interactions) are essential. For example, DNA, etc., the domains that constitute multidomain proteins of the intracellular and extracellular signaling pathways mediate multiple interactions with other components of the signaling pathways. Similarly, the coexistence of different domains with different binding specificities is also essential function of multi- domain proteins of the extracellular matrix: the mul- tiple, specific interactions among matrix constituents

doi:10.1111/j.1742-4658.2005.04917.x

FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS

5064

Abbreviations TSP1, thrombospondin type I.

in multiple interactions,

reflects

organizations of different organisms Ye and Godzik [7] have shown that the number of domains, the number of domain combinations, and the size of the largest connected component of domain-combination net- works (measured by the number of domains it consists of) of each organism increase with the complexity of the organisms.

are indispensable for the proper architecture of the extracellular matrix. As a corollary of their involve- ment formation of novel multidomain proteins is likely to contribute signifi- cantly to the evolution of increased organismic com- plexity since the latter the complexity of interactions among genes, proteins, cells, tissues and organs [2].

single-domain proteins,

The propensity of different domain types to form multidomain proteins shows great variation, ranging from ‘static’ domains that rarely or never occur in multidomain proteins, to ‘mobile’ domains (usually referred to as modules) that frequently participate in gene-rearrangements to build multidomain proteins. Various analyses of the number of multidomain archi- tectures in which different domain-types are involved have shown that their distribution also follows a power law: a minority of domain-types (the ‘mobile’ modules) occur in numerous multidomain proteins, whereas the majority of domains belong to categories that are rarely used in multidomain proteins [5,6,8]. Such a power law distribution indicates that the chance of a domain to be used in the construction of novel multi- domain proteins is proportional to the number of times it has already been used.

Despite such valuable properties of large, complex multidomain proteins the vast majority of proteins con- tain only one domain [3–5]. Furthermore, recent studies have revealed that the majority of multidomain pro- teins tend to have very few domains. Wolf et al. [4] have counted the number of different folds in each pro- tein of proteomes of archaea, bacteria and eukarya and the average fraction of the proteins with each given number of domains was calculated. It has been conclu- ded from these analyses that distributions of single-, two-, three-domain, etc., proteins in archaea, bacteria and eukarya is such that each next class (e.g. two- domain proteins vs. three- domain proteins vs. two-domain proteins, etc.) contains significantly fewer entries than the previous one. More recent mathematical analyses of the distribution of multidomain proteins according to the number of dif- ferent constituent domains have revealed that their distribution follows a power law, i.e. single-domain proteins are the most abundant, whereas proteins con- taining larger numbers of domain-types are increasingly less frequent. This type of distribution is consistent with a random recombination (joining and breaking) model of evolution of multidomain architectures [6].

As for any other type of genetic change, the fre- joining a given domain-type to other quency of domains to create novel multidomain architectures reflects the probability of such a genetic change and the probability of its fixation. In other words, the pro- pensity of a domain to form multidomain proteins is a function of the frequency of genetic events that can lead to such gene-fusions and the selective value of the resulting chimeric proteins. Accordingly, it is likely that the most mobile modules have acquired this status as a result of a combination of special structural, func- tional and genomic features [9].

The observation of Wolf et al. [4] that the size distri- bution of multidomain proteins was very similar in eukaryotes and prokaryotes apparently contradicted the notion that evolution of complex eukaryotes fav- ored (and benefited from) the formation of more and larger multidomain proteins as they contributed to their increased organismic complexity.

Recent analyses, however, provided evidence that there may be a connection between the propensity of protein domains to form multidomain architectures and organismic complexity. For example, Koonin et al. [6] have shown that – although in all proteomes the domain distribution is compatible with a random recombination model of the evolution of multidomain likelihood of domain joining architectures – the appears to increase in the order Archaea < Bacteria < Eukaryotes, and there is a significant excess of larger multidomain proteins in Eukaryotes. Similarly, Wuchty [5] has shown that higher organisms tend to have more complex multidomain proteins. Using graph theory- based tools to survey and compare protein domain

First, certain structural features of domains may facilitate their preferential proliferation in multidomain proteins. For example, the stability and folding auton- omy of domains in multidomain proteins may be of utmost importance for their mobility as this minimizes the influence of neighboring domains [9]. Folding autonomy can ensure that folding of the domain is not deranged when inserted into a novel protein environ- ment. It seems thus very likely that the most widely used domains have been selected according to the rate, robustness and autonomy of folding [10]. It is note- worthy in this respect that multidomain proteins are under-represented in Archaea compared with the other two kingdoms of life and this fact is thought to be related to the lower stability of multidomain proteins in the hyperthermophilic environments where most archaeal species live [6].

FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS

5065

H. Tordai et al. Mobile domains

H. Tordai et al. Mobile domains

Table 1. Domains and multidomain proteins in different groups of organisms. aProteins containing at least one Pfam-A domain; bPfam-A domains; cproteins containing at least two Pfam-A domains; ddomains occurring in at least one multidomain protein; edomains occurring only as stand-alone domains in single domain proteins.

Proteinsa Domainsb Multidomain proteinsc (% of proteins) Mobile domainsd (% of domains) Static domainse (% of domains)

certain genes

features of

Second, functional aspects may also contribute to the proliferation of certain domains. For example, in complex cellular signaling pathways there is a greater demand for domains that mediate interaction with other constituents of the pathways (e.g. protein kinase domain) thus selection may have favored the spread of these modules to other multidomain proteins. Finally, special genomic (gene- segments) may have significantly facilitated their com- bination with other domains.

[12],

To gain further insight into the factors that influence the mobility of domains and control the creation of in the present work we have multidomain proteins, compared the propensity of different domains to form multidomain proteins in several major groups of organisms (Bacteria, Archaea, Protozoa, Plants, Fungi, metazoa) as well as in individual proteomes of some representative species.

entries contain more than one Pfam-A domain, while metazoa represent the other extreme where 39% of the entries correspond to multidomain proteins. It is also clear from Table 1 that in metazoa a larger proportion of Pfam-A domains participates in the construction of multidomain proteins than in archaea. Furthermore, the multidomain proteins of metazoa tend to be larger than those in Archaea: multidomain proteins with more than 10 PfamA domains are nine times more fre- quent in metazoa than in archaea (Table 2). This observation is in harmony with earlier conclusions that the average protein length is considerably greater in eukaryotes than in prokaryotes [11]. These differences between different evolutionary lineages are unlikely to be due to differences in annotation coverage. As shown recently by Ekman et al. the Pfam-A domain coverage is similar for archaea, bacteria and eukarya: in each group about 70% of the proteins have at least one Pfam-A domain. In agreement with this conclusion, our analyses have also shown that Pfam-A coverage is similar for bacteria, archaea, pro- tozoa, plants, fungi and metazoa (Table 3).

The specific questions we have addressed were: (a) What is the most appropriate parameter that reflects the evolutionary mobility of protein domains? (b) Are there significant differences in the propensity to form multidomain proteins in different evolutionary line- ages? (c) How do structural and functional properties of domains influence their mobility? (d) Is there reli- able evidence for the notion that intronic recombina- tion has significantly contributed to the remarkable mobility of some domain-types in metazoa?

To gain a deeper insight into the factors controlling the frequency and size of multidomain proteins in dif- ferent groups of organisms we have plotted the num- ber of multidomain proteins vs. the number of constituent domains. Earlier studies have pointed out that such distributions usually fit the power law: P(i)@ci –c where P(i) is the number of multidomain

Bacteria Archaea Protozoa Plants Fungi Metazoa 273 859 23 728 16 756 57 620 20 371 129 881 4079 1725 1967 2562 2249 3272 73 076 (27%) 5529 (23%) 5298 (32%) 20 359 (35%) 6434 (32%) 51 085 (39%) 1974 (48%) 776 (45%) 932 (47%) 1305 (51%) 1102 (49%) 1748 (53%) 2105 (52%) 949 (55%) 1035 (53%) 1257 (49%) 1147 (51%) 1524 (47%)

Results and discussion

Differences in the propensity to form multidomain proteins in different evolutionary lineages

Table 2. Percentage of multidomain proteins containing more than N number of Pfam-A domains in different groups of organisms.

decreases

the

in

N Bacteria Archaea Protozoa Plants Fungi Metazoa

As shown in Table 1, different evolutionary groups show significant differences in the propensity to form multidomain proteins: the proportion of multidomain order metazoa > proteins plants > fungi (cid:1) protozoa > bacteria > archaea. At one extreme we find archaea where only 23% of the

FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS

5066

1 2 3 4 5 10 26.67 8.88 3.94 1.96 1.24 0.27 23.30 7.01 3.40 1.66 1.13 0.19 31.62 14.95 8.84 5.72 3.94 1.14 35.33 14.28 8.00 5.33 3.84 1.22 31.58 12.98 6.88 4.15 2.72 0.39 39.33 17.97 11.11 7.66 5.62 1.74

H. Tordai et al. Mobile domains

Table 4. Parameters of the linear fit of the double logarithmic plots for P(i) ¼ ci -c where P(i) is the number of multidomain proteins and i is the number of constituent domains.

Bacteria Archaea Protozoa Plants Fungi Metazoa Table 3. Percentage of positions in domain-triplet types occupied by Pfam-A domains vs. Nterm, Cterm and Unknown regions in multidomain proteins of different groups of organisms. For defini- tion of domain-triplet type, Nterm, Cterm and Unknown regions in domain-triplets see Methods.

Bacteria Archaea Protozoa Plants Fungi Metazoa 3.1744 0.9822 2.9343 0.9597 2.8356 0.9737 2.7101 0.9785 3.0635 0.9755 2.7457 0.9865 23 21 36 40 34 c R N 65 P < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001

and fungi. Surprisingly, the slope in protozoa is similar to that observed for metazoa.

(P < 0.0001)

fungi

or

increasingly

steeper

are

the

in

A possible explanation for the unusual abundance of larger multidomain proteins in protozoa is that parasitic protists have acquired metazoan-like multido- main proteins through lateral gene transfer. Recently it has been shown that different lineages of apicomplexan protozoa (e.g. Plasmodium, Cryptosporidium) have acquired distinct but overlapping sets of multidomain surface proteins constructed from adhesion domains typical of animal proteins, although in no case do they share multidomain architectures identical to those of animals [14,15]. Some of these proteins contain con- served adhesion domains such as the epidermal growth factor-like domain (EGF domain), thrombospondin type I (TSP1) domain, the von Willebrand factor A (vWA) domain and the PAN ⁄ APPLE domain that are typically abundant in animal surface proteins but are absent or rarely present in surface adhesion molecules

proteins containing exactly i domains, c is a normaliza- tion constant and c is a parameter, which typically assumes values between 1 and 3 [13]. In double-log- arithmic plots, the plot of P(i) as a function of i is a straight line with a negative slope c. As shown in Fig. 1, in the case of each evolutionary group the data closely follow straight lines in double-logarithmic plots consistent with power-law dependence. The distribu- tion of values of metazoan multidomain proteins was found to be significantly different from those of multidomain proteins of plants (P ¼ 0.0002), bacteria (P < 0.0001), archaea (P < 0.0001). The fact that the slopes of the curves in order Fig. 1 metazoa fi plants fi bacteria (cid:1) fungi (cid:1) archaea (Table 4) indicates that the likelihood of domain join- ing is greater in metazoa than in prokaryotes, plants

71 Pfam-A 8 Nterm 8 Cterm Unknown 13 70 9 8 13 68 6 7 19 68 6 7 18 67 7 7 19 71 6 6 16

FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS

5067

Fig. 1. Distribution of multidomain proteins with respect to the number of constituent domains. The figure shows the number of constituent domains (x axis, log10 scale) compared with the number of multidomain proteins that have that number of domains (y axis, log10 scale) in bacteria (A), archaea (B), protozoa (C), plants (D), fungi (E) and metazoa (F). Parameters of the plots are compiled in Table 4.

H. Tordai et al. Mobile domains

Table 5. Parameters of the linear fit of the double logarithmic plots for P(i) ¼ ci –c where P(i) is the number of extracellular, transmem- brane, intracellular or class 1–1 multidomain proteins and i is the number of constituent domains.

Total Extracellular Transmembrane Intracellular Class 1–1

intracellular multidomain proteins of metazoa. To test whether this reflects the fact that exon-shuffling of class 1–1 modules contributed primarily to the creation of extracellular (and extracellular parts of some trans- membrane) multidomain proteins of metazoa [2,9], we have analyzed the size distribution of multidomain proteins assembled from class 1–1 modules (irrespect- ive of their subcellular localization).

affinities

analyses

Power law distribution of metazoan multidomain proteins assembled by exon-shuffling from class 1–1 modules

in other eukaryotic lineages. A systematic analysis of the C. parvum proteome has identified 32 widely con- served surface domains distributed in 51 proteins, including 24 noncatalytic protein- or carbohydrate- interacting domains and seven catalytic domains. Most strikingly, 10 of these domains, namely, TSP1, sus- hi ⁄ CCP, Notch ⁄ Lin (NL1), NEC (neurexin-collagen domain), fibronectin type 2 (FN2), pentraxin, MAM domain (a domain present in meprin, A5, receptor protein tyrosine phophatase mu), ephrin-receptor EGF-like domain, the animal signaling protein hedge- hog-type HINT domain and the scavenger domain have thus far been found only in the surface proteins of animals other than apicomplexans. The remaining domains such as the EGF, LCCL domain (a domain first found in Limulus factor C, Coch-5b2 and Lgl1), Kringle, SCP domain are seen in some other eukaryo- tes, but predominantly found only in animals. In phy- logenetic between specific apicomplexan and animal versions were recovered [16], making horizontal gene transfer from animals, fol- lowed by selective retention of functionally relevant proteins involved in adhesion as the most parsimoni- ous explanation for these observations.

for

extracellular,

It thus appears that metazoa favor the formation of larger multidomain proteins than archaea, bacteria, fungi, plants. To test whether this is related to the fact that the world of extracellular (and some transmem- brane) multidomain proteins has significantly expanded in metazoa [2,9], we have analyzed the size distribution of and intracellular transmembrane multidomain proteins of metazoa separately.

Differences in the propensity to form extra- cellular, intracellular and transmembrane multidomain proteins in metazoa

Analysis of the double-logarithmic plot of the number of multidomain proteins assembled from class 1–1 mod- ules vs. the number of constituent domains has revealed that the distribution of values differs significantly from those total metazoan multidomain proteins (P < 0.0001) or for intracellular metazoan multido- main proteins (P < 0.0001). The slope for multidomain proteins assembled from class 1–1 modules is shallower than the values for intracellular or total metazoan multidomain proteins (Table 5). This observation is consistent with the notion that exon-shuffling of class 1–1 modules has favored the creation of larger (primar- ily extracellular) multidomain proteins of metazoa.

Domain size and propensity to form multidomain proteins

for

for

Double-logarithmic plots of the number of extracellu- lar, intracellular and transmembrane multidomain pro- teins vs. the number of constituent domains have revealed that in each case the data follow straight lines consistent with power-law dependence. The distribu- tion of values for extracellular multidomain proteins, however, differed significantly from those of intracellu- lar multidomain proteins (P < 0.0001), of transmem- brane multidomain proteins (P ¼ 0.0010) or of total (P < 0.0001). The metazoan multidomain proteins slope for extracellular multidomain proteins is shal- lower than the value for intracellular multidomain pro- teins, total transmembrane proteins or metazoan multidomain proteins (Table 5).

By plotting the size of multidomain proteins as a func- tion of the number of constituent Pfam-A domains we obtained a linear relationship (Y ¼ A + B*X), where X is the number of domains, B is the average size (in amino acid residues) of Pfam-A domains actually used to build multidomain proteins and Y is the size of the multidomain proteins (Fig. 2). The value of B was found to be 80 amino acid residues, much smaller than the average size of Pfam-A domains (178 residues) present in the Pfam-A database. This observation suggests that smaller domains are more likely to be used in the con- struction of multidomain proteins. As the value of A is

These observations indicate that the ratio of domain joining ⁄ breaking is greater for extracellular than for

FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS

5068

2.1107 09362 35 c 2.7457 R 0.9865 N 40 P < 0.0001 < 0.0001 2.7479 0.9684 34 < 0.0001 2.8071 0.9781 37 < 0.0001 2.4031 0.9714 39 < 0.0001

the distinction between static and mobile theless, domains is rather vague since it is not simple to meas- ure domain mobility.

The frequent reuse (‘mobility’) of a protein domain increases several types of parameters such as (a) the number of proteins in proteome(s) in which it is pre- sent; (b) number of copies of the domain in proteo- mes(s); (c) number of other domain-types with which the given domain co-occurs to form multidomain proteins; and (d) number of multidomain protein architectures (linear sequence of domains, domain- organizations) in which the given domain occurs.

Parameters (a) and (b) are rarely used to illustrate differences in the mobility of protein domains, as it is clear that these parameters may also be affected by and may have more to do with gene duplications or domain duplications than with domain mobility.

H. Tordai et al. Mobile domains

302 amino acid residues this also suggests that Pfam-A domains larger than average are more likely to be static, stand-alone domains. Figure 2 thus suggests that larger Pfam-A domains predominate in single- and oligodo- main proteins, whereas larger multidomain proteins are constructed from smaller Pfam-A domains. It is note- worthy in this respect that the most versatile mobile modules (e.g. the EGF, ig, fn1, TSP_1, Sushi, Ldl_re- cept_a, SH3–1, SH2 modules, kringles) are less than 100 amino acid residues. A possible explanation for this phe- nomenon is that smaller, compact domains are more likely to satisfy the folding autonomy criterion that is crucial for their structural integrity in multidomain pro- teins. This explanation is supported by the fact that the rate of protein folding of single-domain proteins is inversely proportional to protein length [17,18].

In recent years the mobility of a domain was most frequently measured by the number of other domain- types with which the given domain co-occurs (to which it is ‘connected’) in multidomain proteins [5,7]. An obvious problem with this ‘co-occurrence’ or ‘connec- tivity’ approach is that a domain may co-occur with a large number of other domains in large families of multidomain proteins in which the given domain is always in the same local context, i.e. it shows no sign of mobility (Fig. 3). We face a similar problem if we wish to use the number of multidomain protein archi- tectures to measure mobility of domains: a domain may occur in a large number of different architectures in which the given domain is always in the same local context (Fig. 3). As illustrated in this figure, during evolution of multidomain protein families domain insertions distant from the given domain may lead to marked changes in the number of architectures in which a given domain is present, marked changes in the number of co-occurring domains even though the given domain is present in the same local environment. To assess the significance of these problems, in the present work we have introduced the number of local architecture-types in which the given domain occurs as a measure of its mobility. Local architecture (local context) is defined as the ‘triplet’ consisting of the clo- sest upstream (if any) and downstream (if any) domain neighbors of the given domain.

Measuring the evolutionary mobility of protein domains

(CO-OCCURRENCE), number of

differences: whereas

the majority

It has long been known that the propensity of individ- ual domains to form multidomain architectures shows of significant domains are rarely observed in multidomain proteins, some domains are extremely widely used [18]. Never-

As illustrated in Table 6, ranking domains with respect to the number of types of domains co-occur- ring with the domain in metazoan multidomain pro- teins types of metazoan multidomain protein architectures in which the domain is present (ARCHITECTURE) and num- ber of local architectures (TRIPLETS) in which the domain is present give very different results.

FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS

5069

Fig. 2. PfamA domain number and protein size of metazoan pro- teins. The line shows a linear fit according to equation Y ¼ A + B*X, where Y is the size of proteins with a given number of Pfam-A domains, X is the number of constituent Pfam-A domains (N ¼ 129881; B ¼ 80.1 ± 0.3425; A ¼ 302.3 ± 1.243; r2 ¼ 0.2987, P < 0.0001, 95% confidence interval). The figure shows only the data for proteins containing less than 25 domain; the squares repre- sent the average size of proteins with a given number of Pfam-A domains.

H. Tordai et al. Mobile domains

Probable G protein-coupled receptor 97 precursor Q8R0T6

CD97 antigen precursor P48960

Brain - specific angiogenesis inhibitor 1 Q8CGM0

Latrophilin-like protein LAT-2 AAQ84879

Probable G protein-coupled receptor 126 precursor Q86SQ4

Probable G protein-coupled receptor 125 precursor Q8IWK6

Probable G protein-coupled receptor 116 precursor Q8IZF2

Latrophilin-1 O97830

Polycystic kidney disease and receptor for egg jelly related protein precursor Q9Z0T6

Receptor for egg jelly 3 protein Q95V80

Fig. 3. Domain organization of representa- tive multidomain proteins containing the GPS-domain (G-protein-coupled receptor proteolytic site domain). The rectangles highlight the two types of local environ- ments in which the GPS-domain occurs. The multidomain proteins shown represent 10 distinct architectures and contain 18 types of co-occurring domains. Note that GPS-containing multidomain proteins have diverse architecture due to the relatively high number of co-occurring domain-types, although the local environment of the GPS domain is mostly unchanged: it is present in only two triplet types.

A

Table 6. Ranking of Pfam-A domains in metazoa with respect to parameters reflecting their evolutionary mobility. Only the top-rank- ing 20 are shown; the domains are listed in the order of decreasing mobility. The domain names correspond to those used by the Pfam database (http://www.sanger.ac.uk/Software/Pfam/) [22]. Class 1–1 modules are highlighted in bold.

Number of types of co-occurring domains Number of types of architectures Number of types of ‘triplets’ Rank

B

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Pkinase EGF Ank PH zf-C3HC4 zf-C2H2 fn1 EGF_CA ig SH3–1 I-set efhand PDZ LRR WD40 Lectin_C Ldl_recept_a IQ TSP_1 Helicase_C EGF I-set fn1 ig LRR Pkinase EGF_CA Ank zf-C2H2 PH SH3–1 Ldl_recept_a Laminin_G_2 Collagen Sushi efhand PDZ IQ CUB zf-C3HC4 EGF Pkinase ig PH fn3 EGF_CA I-set SH3–1 CUB Ldl_recept_a zf-C2H2 TSP_1 Ank Sushi zf-C3HC4 efhand PDZ zf-CCHC C1–1 SH2

The similarities and differences of the information- content of ‘TRIPLET’ vs. ‘ARCHITECTURE’ and ‘CO-OCCURRENCE’ are illustrated in Fig. 4. As shown in Fig. 4(A), there is a clear linear relationship (Y ¼ B*X; R ¼ 0.9156, P < 0.0001) between the number of architecture-types (X) and the number of

FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS

5070

Fig. 4. Comparison of the number of triplet types, number of archi- tecture-types and number of co-occurring domain-types in metazoan multidomain proteins. (A) The figure shows a linear fit according to equation Y ¼ A + B*X, where Y is the number of triplet types, X is the number of architecture-types containing a given domain (N ¼ 1748; B ¼ 0.3673; R ¼ 0.9156, P < 0.0001). (B) The figure shows a linear fit according to equation Y ¼ A + B*X, where Y is the number of triplet types containing a given domain, X is the number of domain-types co-occurring with that domain (N ¼ 1748; B ¼ 1.082; R ¼ 0.9144, P < 0.0001). Class 1–1 modules showing greatest mobi- lity (present in more than 15 triplet types) are highlighted in red.

triplet types (i.e. local architecture-types, Y) in which a given domain is present, but the slope of the line (B ¼0.3673) indicates that a given local architecture- type may be found in several different global architec- tures, i.e. there is a uniform tendency that changes in architecture occur at distant regions. Furthermore, examination of the data reveals that domains (repeats) more prone to duplication than to shuffling (LRR, Ank, etc.) are the ones that deviate from this linear relationship most significantly.

Similarly, there is a linear relationship (R ¼ 0.9144, P < 0.0001) between the number of domain types with which a given domain co-occurs (connectivity) and the number of triplet types in which it is present (Fig. 4B). Nevertheless, examination of data reveals that the majority of mobile class 1–1 modules known to have been shuffled by exon-shuffling [20] deviate from this linear relationship most significantly inasmuch as they have higher triplet numbers than expected by the linear relationship, they are above the line calculated by linear regression analysis (Fig. 4B). This is also reflected in the fact that in Table 6, class 1–1 modules (e.g. CUB-, TSP1, Ldl_recept_a) occupy more prominent positions in the TRIPLET column than in the ARCHITEC- TURE or CO-OCCURRENCE columns.

H. Tordai et al. Mobile domains

On the other hand, domains [e.g. GPS (G-protein- coupled receptor proteolytic site domain), Fig. 3] that are present in almost invariable local environments of a vast variety of multidomain protein architectures are present in much lower number of triplet types than expected if we assume a perfect linear relation. As illustrated in Fig. 3, the high number of domains co-occurring with the GPS domain, the high number is present reflects of architecture-types in which it domain-shuffling events distant from the GPS domain and has little to do with mobility of the GPS domain.

to reflect

is a more relevant parameter

correlation between domain size and domain mobility. This observation is in harmony with the data shown in Fig. 2 that also suggest that smaller domains are more likely to be used in the construction of multidomain proteins, whereas larger domains are more likely to be static, stand-alone domains. There are a few notewor- thy exceptions to the generalization that the domains showing greatest mobility are small. One of these exceptions is the protein kinase domain that – with an average size of 228 amino acids – shows the second greatest mobility in metazoan multidomain proteins (Fig. 5 and Table 6). It seems likely that its mobility reflects primarily the great demand of this domain in signaling networks.

revealed that

ig, I-set, SH3–1,

Power law distribution of domain mobility

It thus appears that the number of local architec- ture-types (‘triplets’) in which the given domain is pre- sent the ‘shuffling’ or ‘insertion’ of a mobile domain into differ- ent environments. Ranking of domains according to the best known this parameter has fn1, mobile modules (EGF, PH, EGF_CA, CUB, TSP_1, Ldl_recept_a, sushi, etc.) occupy most of the top 20 positions in the TRIPLET column of Table 6.

Domain size and domain mobility

It is evident from Fig. 5 that the majority of domains occur in a relatively small number of local architec- ture-types, whereas a small minority of domains serves as versatile building blocks of multidomain proteins. This is in agreement with recent observations that power laws describe the distribution of domains with respect to the number of multidomain architectures in which they occur [5,6,8].

We have used the number of triplet types in which a Pfam-A domain occurs as a measure of its mobility to investigate whether mobility correlates with domain size. As shown in Fig. 5 there is a significant inverse

FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS

5071

(number Pearson of Fig. 5. Domain size and domain mobility. The number of domain- triplet types in which a Pfam-A domain occurs in metazoan multi- domain proteins is plotted as a function of the average size of the given Pfam-A domain family (in amino acid residues). Note that there is an inverse correlation between domain size and domain r ¼ )0.1507, pairs ¼ 1748, mobility P < 0.0001).

and

protozoa

(P < 0.001)

To analyze the factors that influence domain mobil- ity in different groups of organisms we have plotted the number of domain-types as a function of their ‘mobility’, mobility being expressed either as the num- ber of domain-types co-occurring with the given domain or the number of triplet types in which a domain occurs.

data follow straight lines in double-logarithmic plots. In the case of the triplet plots the distribution of val- ues in metazoa is also significantly different from those of bacteria (P < 0.0001), archaea (P < 0.0001), plants (P < 0.0001), fungi (P < 0.0001). Comparison of the slopes of co-occur- rence vs. triplet plots in different groups of organisms (Tables 7A and B) has revealed that in each case the slopes of the triplet plots are steeper than those of co-occurrence plots (metazoa: c ¼ 1.6170 vs. 2.0125; bacteria: c ¼ 1.8207 vs. 2.2851; archaea: c ¼ 2.0278 vs. 2.4554; protozoa: c ¼ 1.9690 vs. 2.7118; plants: c ¼ fungi: c ¼ 2.2128 vs. 2.8692). It 1.8616 vs. 2.5508; seems likely that this is due to the difference of the two approaches: the ‘global’ co-occurrence approach tends to overestimate the mobility of domains as opposed to the ‘local’ triplet approach (Fig. 3). Nevertheless, the results of the two analyses are similar inasmuch as metazoan domains display the shallowest slopes.

In the case of the co-occurrence approach the data follow straight lines in double-logarithmic plots consis- tent with power-law dependence. The distribution of values of metazoa was found to be significantly differ- ent from those of bacteria (P < 0.0001), archaea (P ¼ 0.0002), plants (P < 0.0001) and fungi (P < 0.0001). The slopes of the curves increase in the order metazoa < bacteria < plants < archaea < fungi (Table 7A). To test whether this is related to the fact that shuffling of (class 1–1) modules and creation of extracellular and transmembrane multidomain proteins was significantly facilitated by intronic recombination in metazoa [9,21], we have analyzed the domain co-occurrence plots for extracellular, transmembrane and intracellular and class 1–1 multidomain proteins of metazoa separately.

The distribution of values for extracellular multi- domain proteins differed significantly from that of intracellular multidomain proteins (P < 0.0051). The slope for extracellular multidomain proteins is shal- lower than the value for intracellular multidomain pro- teins (Table 8A).

It is interesting to point out that the mobility distri- bution of domains of protozoa is very similar to those of Plants and Fungi (Table 7B), whereas the size-frequency distribution of protozoan multidomain proteins is more similar to that of metazoa (Table 4). A possible explanation for this apparent contradiction is that the lateral gene transfer of multidomain pro- teins from animal hosts affects the size distribution of the multidomain protein pool of parasitic protozoa, but the domains thus acquired have lost their mobility in the intron-poor genomes of protists.

(P < 0.0001),

Furthermore, the values for multidomain proteins assembled from class 1–1 modules differed significantly from that for total metazoan multidomain proteins (P < 0.0001). The slope for class 1–1 multidomain proteins is shallower than the value for total metazoan multidomain proteins (Table 8A). These observations are consistent with the notion that intronic recombina- tion greatly increased the mobility of class 1–1 mod- ules in metazoa, and this facilitated the creation of novel extracellular multidomain proteins of animals.

Analysis of domain mobility of metazoan proteins with the triplet approach has also revealed that the

In the case of the triplet plots the distribution of values for multidomain proteins assembled from class 1–1 modules is significantly different from those of transmembrane extracellular proteins multidomain proteins (P < 0.0001), intracellular multi- domain proteins (P < 0.0001) or total metazoan multi- domain proteins (P < 0.0001). The slope of the triplet plot for multidomain proteins assembled from class 1–1 modules is shallower than that for extracellular proteins, for intracellular multi- for transmembrane proteins,

H. Tordai et al. Mobile domains

Table 7. Parameters of the linear fit of the double logarithmic plots for P(i) ¼ ci – c where P(i) is the number of domains.

Bacteria Archaea Protozoa Plants Fungi Metazoa

1.8207 0.9755 1.9690 0.9596 1.8616 0.9498 2.2128 0.9688 1.6170 0.9588

43 < 0.0001 21 < 0.0001 31 < 0.0001 16 < 0.0001 22 < 0.0001 50 < 0.0001

2.7118 0.9852 2.2851 0.9726 2.4554 0.9357 2.5508 0.9624 2.8692 0.9734 2.0125 0.9568

FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS

5072

(A) i is the number of types of domains with which they co-occur in multidomain proteins 2.0278 c R 0.9316 N P (B) i is the number of domain-triplet types (local architectures) in which they occur in multidomain proteins c R N P 11 £ 0.0001 35 £ 0.0001 16 £ 0.0001 25 £ 0.0001 16 £ 0.0001 42 £ 0.0001

H. Tordai et al. Mobile domains

Table 8. Parameters of the linear fit of the double logarithmic plots for P(i) ¼ ci – c where P(i) is the number of domains.

Total Extracellular Transmembrane Intracellular Class 1–1

1.3193 0.8879 1.6170 0.9588 1.4542 0.9337 1.7839 0.9425 1.0984 0.8951

50 < 0.0001 17 < 0.0001 19 < 0.0001 25 < 0.0001

1.8397 0.9265 2.3389 0.9638 0.9233 0.9101 2.0125 0.9568 1.6241 0.8668

domain proteins or for total metazoan multidomain proteins (Table 8). Comparison of co-occurrence plots (Table 8A) and triplet plots (Table 8B) for extracellular, intracellular, transmembrane, class 1–1 multidomain proteins has also revealed that order of the slopes is sim- ilar in the two approaches: class 1–1 < extracellular < transmembrane < intracellular multidomain proteins.

Global and local domain co-occurrence networks

RNA_pol_Rpb1–1,

RNA_pol_Rpb1–4,

(Table 9). The size (the number of vertices) of the largest connected component increases linearly (with a slope of 1.0529) with the number of total vertices (as we proceed from prokaryotes to higher eukaryotes), with a ‘lag’ of about 500 vertices (Fig. 6A). A possible explanation for this phenomenon is that some ancient domains formed ancient multidomain proteins but apparently they no longer participate in novel domain combinations, instead remaining ‘islands’, separated from the largest connected component of the domain network. An illus- trative example of this group is the ancient multidomain protein RNA polymerase Rpb1, constructed from RNA_pol_Rpb1–2, domains RNA_pol_Rpb1–3, RNA_- pol_Rpb1–5, which combine only with each other.

Power law distributions are intimately related to the so-called scale-free networks: networks in which the frequency distribution of node degrees (i.e. the number of other nodes to which a given node is connected) fol- lows a power law. Accordingly, power law distribu- tions are frequently analyzed and visualized through scale-free networks.

in the

the fact

that

The basis of the scale-free behavior of network evo- lution (and power law distributions) is that the prob- ability of a node acquiring a new connection is proportional to the number of links that node already has: there is a greater likelihood of nodes being added to pre-existing hubs. For example, the fact that the casting of actors in movies and the distribution of peo- ple according to their wealth follow a power law is a manifestation of ‘the rich get richer’ principle [5,6]. By analogy, the distribution of domains according to the frequency they are used to build multidomain proteins follows a power law indicates that the chance of a domain to be used is proportional to the number of times it has already been used.

In the present work domain co-occurrence networks and triplet networks were used to illustrate and quantify the mobility of domains and the complexity of multido- main protein networks. The number of vertices, connec- tivities (edges) and the size of the largest connected component were used to characterize the complexity of the domain networks of different groups of organisms

The number of architecture types also increases with the number of total vertices, and the correlation is best described by a semilogarithmic plot (Fig. 6B) consis- tent with a model in which domains combine at ran- dom. It is noteworthy, however, that in the linear fit to equation Y ¼ A + B*X the value of A suggests that there are frozen ancient multidomain architec- tures, the constituent domains of which do not partici- pate construction of novel multidomain proteins. It appears that this is another manifestation of what we said above in connection with the set of vertices excluded from the largest connected compo- nent of domain networks: some ancient domains form ancient multidomain proteins with permanent domain partners (and conserved architectures) but they are no longer used in the construction of novel multidomain architectures. A comparison of the list of domains excluded from the largest connected components in all organisms with the list of domains in conserved multi- domain architectures shared by all organisms has revealed significant similarities. For example, ancient domains ⁄ multidomain proteins (fulfilling basic func- tions) such as Enolase_C and Enolase_N of enolase,

FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS

5073

(A) i is the number of types of domains with which they co-occur in extracellular, transmembrane, intracellular and class 1–1 multidomain proteins of metazoa c R 37 N P < 0.0001 (B) i is the number of domain-triplet types (local architectures) in which they occur in extracellular, transmembrane, intracellular and class 1–1 multidomain proteins of metazoa c R N P 42 £ 0.0001 13 £ 0.0001 17 £ 0.0001 19 £ 0.0001 28 £ 0.0001

H. Tordai et al. Mobile domains

Table 9. Parameters of domain-networks and number of architecture types in different groups of organisms.

Full domain-network Largest connected component

Vertex Vertex (% of total) Edge Edge Architecture types

consist of

their domain networks

carbohydrate

FGGY family

from those of other

Gp_dh_C and Gp_dh_N of glyceraldehyde 3-phos- phate dehydrogenase, Ldh_1 (cid:1)C and Ldh_1_N of lac- tate ⁄ malate dehydrogenase, FGGY_N and FGG_C of the kinases, of THF_DHG_CYH and THF_DHG_CYH_C of tetra- dehydrogenase ⁄ cyclohydrolases, RNA_ hydrofolate pol_Rpb1–1, RNA_pol_Rpb1–2, RNA_pol_Rpb1–3, RNA_pol_Rpb1–4, RNA_pol_Rpb1–5 of RNA poly- merase Rpb1, DNA_photolyase and FAD_binding_7 of DNA photolyases, ATP-cone, Ribonuc_red_lgN and Ribonuc_red_lgC of ribonucleotide reductases are present in both groups.

Domain networks and organismic complexity

complex multicellular organisms. In the case of Proto- zoa, Fungi and Plants the largest connected compo- nents of two subclusters containing nuclear and cytoplasmic signa- ling domains. The domain-network of metazoa is strik- eukaryotes ingly different inasmuch as it has a third distinct subcluster consisting of extracellular (primarily class 1–1) modules. This extracellular subcluster is connected to the cytoplasmic signaling subcluster through multidomain transmem- brane proteins such as receptor kinases, G-protein cou- pled receptors, etc. Comparison of domain-networks of different eukaryotes thus confirms that the evolution of increased organismic complexity in metazoa is inti- mately associated with the generation of novel extra- cellular and transmembrane multidomain proteins that mediate the interactions among their cells, tissues and organs [2].

Bacteria Escherichia coli Bacillus subtilis Archaea Methanococcus jannaschii Sulfolobus solfataricus Archaeoglobus fulgidus Protozoa Plasmodium falciparum Plants Arabidopsis thaliana Fungi Saccharomyces cerevisiae Neurospora crassa Metazoa Invertebrates Caenorhabditis elegans Drosophila melanogaster Chordates Homo sapiens 1792 779 650 713 323 318 382 823 413 1191 1040 1021 695 855 1586 1296 1071 1109 1422 1335 3684 824 683 799 286 251 359 1129 407 1764 1366 1397 794 1055 3786 2563 1825 1944 3196 2755 1339 96 120 294 21 16 48 384 34 643 491 497 191 355 1159 822 565 604 967 838 (75%) (12%) (18%) (41%) (7%) (5%) (13%) (47%) (8%) (54%) (47%) (49%) (27%) (42%) (73%) (63%) (53%) (54%) (68%) (63%) 5123 678 478 960 223 200 290 1362 422 2836 1607 1471 622 873 5797 3061 1659 1758 4093 2753 3334 171 182 458 27 25 62 700 44 1286 901 868 289 565 3472 2107 1308 1457 2843 2335

Conclusions

As illustrated in Figs 6, 7 and 8 and summarized in Table 9, the total number of vertices and edges, the size of the largest connected component of domain networks, and the number of architecture types all increase parallel with the evolution of higher organ- isms of greater organismic complexity.

At one extreme we find Archaea with the lowest val- ues for the parameters reflecting the complexity of the world of multidomain proteins. Conversely, metazoa, particularly Chordates, have the highest values in all these parameters.

In the present work we have shown that the most appropriate measure of the mobility of domains is the number of types of local environments (number of types of domain-triplets) in which a given domain-type is present. According to this definition, domains that always occur in the same local environment are consid- ered ‘static domains’, whereas ‘mobile domains’ are those that occur in numerous different local environ-

Figures 7 and 8 also show that significant changes occurred in the structural organization of the domain- networks of eukaryotes parallel with the evolution of

FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS

5074

revealed that

(extracellular, intracellular and transmembrane multi- domain proteins) has extracellular domains used in the construction of extracellular pro- teins (and extracellular parts of transmembrane pro- teins) of metazoa are particularly enriched in domains of greater mobility. Among the extracellular domains the so-called class 1–1 modules, i.e. domains which have been shuffled by recombination in phase 1 introns flanking these domains [2,9], were found to show the greatest mobility. These observations provide statistical support for our earlier suggestions that intronic recom- bination has contributed significantly to the remark- able mobility of the class 1–1 domains in metazoa, thereby facilitating the construction of extracellular and transmembrane multidomain proteins unique for metazoa [2,9].

H. Tordai et al. Mobile domains

Methods

Databases of multidomain proteins

Information on full-length protein sequences was down- loaded (http:// from the UniProt Knowledgebase www.expasy.uniprot.org) to create databases of proteins of Bacteria, Archaea, Protozoa, Plants, Fungi and metazoa as well as databases of proteins of representative species (Escherichia coli, Bacillus subtilis, Archaeaoglobus fulgidus, Methanococcus jannaschii, Sulfoglobus solfataricus, Plasmo- dium falciparum, Arabidopsis thaliana, Saccharomyces cerevisiae, Neurospora crassa, Caenorhabditis elegans, Dro- sophila melanogaster, Homo sapiens).

Domain analyses

The domain composition, domain connectivity, domain architecture and domain-neighborhood characteristics of proteins were analyzed based on the Pfam-A domain anno- tations of the database entries. In the present work we have used only information on the better-characterized Pfam-A domains of the Pfam Protein Families Database Version 15.0 (http://www.sanger.ac.uk/Software/Pfam/) [22].

distance

between

We have defined a multidomain protein as a protein containing at least two Pfam-A domains irrespective of their evolutionary relationship. The domain composition of a multidomain protein was defined by the sum of con- stituent Pfam-A domain-types. The domain architecture the linear of a multidomain protein was defined as irrespective of sequence of constituent Pfam-A domains, the consecutive PfamA domains. Domain connectivities were calculated as the number of domain types with which a given domain co-occurs.

The local architectures in which a given domain (domain Di) occurs were defined as domain triplets Du-Di-Dd contain- ing the closest upstream (Du, if any) and downstream (Dd, if

ments. Using this triplet-approach we have determined the mobility-distribution of Pfam-A domains in Bac- teria, Archaea, Protozoa, Plants, Fungi and metazoa and have found that in each group the distribution of the mobility of domains fits the power law P(i)@ci –c where P(i) is the number of domain-types present in i triplets, c is a normalization constant and c is a parameter which assumes values between 1 and 3. Comparison of the domain-mobility distributions in different groups of organisms has revealed that the c-value of metazoa was lower than those of other eukaryotes or prokaryotes, indicating that in metazoa more domains have greater mobility. Analysis of dif- ferent subgroups of metazoan multidomain proteins

FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS

5075

Fig. 6. Correlation of the number of total vertices of domain net- works with the number of vertices in LCC, the largest connected component and with the number of architecture types. (A) The fig- ure shows the linear fit according to equation Y ¼ A + B*X, where Y is the number of vertices in the largest connected component, X is the number of total vertices of domain networks. (B ¼ 1.0529; X intercept at Y0 ¼ 538; R ¼ 0.9843, P < 0.0001). (B) The figure shows the linear fit according to equation Y ¼ A + B*X, where Y is the log of the number of architecture types and X is the number of total vertices of domain networks (Y intercept at X0 ¼ 2.109; R ¼ 0.9746, P < 0.0001). The data analyzed here are shown in Table 9 and represent networks of Escherichia coli, Bacillus subtilis, Met- hanococcus jannaschii, Sulfolobus solfataricus, Archaeoglobus fulgi- dus, Plasmodium falciparum, Arabidopsis thaliana, Saccharomyces cerevisiae, Neurospora crassa, Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, bacteria, archaea, protozoa, plants, fungi, metazoa, invertebrates, chordates.

H. Tordai et al. Mobile domains

Protozoa

Plants

Fungi

Metazoa

Fig. 7. Domain co-occurrence networks of protozoan (A), plant (B), fungal (C), and metazoan (D) multidomain proteins. Domains (vertices) are colored according to their subcellular locales (see Methods). Green dots correspond to nuclear domains, yellow dots indicate signaling ⁄ cytoplasmic domains, magenta dots identify extracellular domains. Class 1–1 modules (mostly extra- cellular domains) are identified by red dots.

E.coli

A.fulgidus

P.fulciparum

A.thaliana

N.crassa

H.sapiens

any) domain neighbors of the given domain. As the mean value of the size of Pfam-A domains is 178 amino acid resi- dues, a window of 180 amino acid residues was used to find

the closest domain neighbors. The reason for this limitation is to minimize the danger that an unidentified domain is skipped and a more distant domain is identified as neighbor.

FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS

5076

Fig. 8. Domain co-occurrence networks of multidomain proteins of Escherichia coli (A), Archaeoglobus fulgidus (B), Plasmodium falciparum (C), Arabidopsis thaliana (D), Neurospora crassa (E) and Homo sapiens (F). Domains (vertices) are colored as in Fig. 7.

Statistical analyses

Statistical analyses were performed using GraphPad Prism version 4.02 for Windows, GraphPad Software (San Diego, CA, USA) (http://www.graphpad.com).

If no Pfam-A domain was found within this window because no Pfam-A domain was assigned, the unassigned region was defined as Unknown. If the distance of the boundary of domain Di from the N-terminus (or C-terminus) of the pro- tein was less than 50 residues (i.e. significantly shorter than a typical PfamA domain) and there was no PfamA domain within this region, then the upstream region was defined as Nterm, the downstream region was defined as Cterm. To assess the number of contexts in which a given domain (domain Di) can occur in multidomain proteins we have lis- ted all domain triplets Du-Di-Dd, where Di is the domain an- alyzed and Du and Dd are the domains flanking domain Di at its N- and C-terminal boundary, respectively.

Identification of extracellular, intracellular and transmembrane multidomain proteins of metazoa

The statistical significance of differences of power law dis- tributions was assessed by two-tailed paired t-test with 95% confidence interval. In the case of the size distributions of multidomain proteins we have expressed the number of pro- teins containing a given number of PfamA domains as a per- centage of single-domain proteins present in the given group. The data thus normalized (to eliminate the problem that the number of proteins analyzed is very different in the different groups) were subjected to paired t-tests to decide whether the distributions are significantly different or not. Similarly, we have analyzed differences in power-law distri- butions of domain mobility by paired t-tests following nor- malization of the data: we have expressed the number of domains with a given value of domain mobility (measured as number of co-occurring domains or number of triplets in which a domain is present) as a percentage of domains with a mobility value of 1 (i.e. which co-occur with only one domain, or are present only in one triplet type).

Linear regression analysis with 95% confidence interval was used to analyze double logarithmic plots of the individ- ual power-law distributions. We have calculated the signifi- cance of Pearson’s correlation coefficient between the domain size and domain-connectivity.

of

or

absence

annotated

Network analyses

identified as

Pfam-A domain co-occurrence (or connectivity) graphs were defined by the vertex set consisting of all domains found within multidomain proteins of a given species, of a given group of organisms or of a given subgroup of multi- domain proteins. Two domains (vertices in the network) were regarded as being connected by an edge in the net- work if they co-occur in at least one multidomain protein irrespective of their distance in the linearly arranged domains within the protein.

[23], which

Extracellular, transmembrane and intracellular multido- main proteins of metazoa were identified on the basis of the subcellular location information of database entries. Extracellular proteins were identified as those annotated as extracellular, secreted or plasma proteins, intracellular proteins were identified as those annotated as intracellu- lar, cytoplasmic, nuclear, mitochondrial, cytoskeletal pro- teins, transmembrane proteins were identified as those annotated as membrane proteins. The correct assignment of proteins to these categories was also checked by the presence transmembrane domains, presence or absence of extracellular or intracel- lular (cytoplasmic, nuclear) PfamA domains. Extracellular proteins were identified as those containing extracellular domains but lacking intracellular domains and transmem- brane domains. Intracellular proteins were identified as those containing intracellular domains but lacking extra- cellular domains and transmembrane domains. Transmem- brane multidomain proteins were those containing intracellular and ⁄ or extracellular domains and transmembrane domains. PfamA domains were assigned a subcellular locale with the help of the SMART Simple Modular Architecture Research Tool Version 4.0 (http://smart.embl-heidelberg.de/) classifies domains in three major categories on the basis of their most probable cellular localization [24].

Undirected triplet networks were similarly defined, except that in this case two domains were regarded as being con- nected by an edge if they are adjacent in at least one domain triplet.

Identification of metazoan multidomain proteins assembled by exon-shuffling from class 1–1 modules

Pajek 1.01 for Windows 32, a program for large-network analysis and visualization [25]; (http://vlado.fmf.uni-lj.si/ pub/networks/pajek/) and LGL [26] (http://bioinformatics. icmb.utexas.edu/lgl/) were used for the calculation and visu- alization of domain co-occurrence or triplet networks.

H. Tordai et al. Mobile domains

Acknowledgements

Metazoan multidomain proteins assembled by exon-shuf- fling were identified as those containing class 1–1 modules that were shuffled by recombination in phase 1 introns flanking these domains [20].

The authors wish to acknowledge support from the BioSapiens project funded by the European Commis-

FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS

5077

sion within its FP6 Programme, under the thematic area ‘Life sciences, genomics and biotechnology for health’, contract number LHSG-CT-2003–503265.

15 Templeton TJ, Iyer LM, Anantharaman V, Enomoto S, Abrahante JE, Subramanian GM, Hoffman SL, Abra- hamsen MS & Aravind L (2004) Comparative analysis of apicomplexa and genomic diversity in eukaryotes. Genome Res 14, 1686–1695.

H. Tordai et al. Mobile domains

References

16 Pradel G, Hayton K, Aravind L, Iyer LM, Abrahamsen MS, Bonawitz A, Mejia C & Templeton TJ (2004) A multidomain adhesion protein family expressed in Plas- modium falciparum is essential for transmission to the mosquito. J Exp Med 199, 1533–1544.

17 Ivankov DN, Garbuzynskiy SO, Alm E, Plaxco KW,

1 Gerstein M (1997) A structural census of genomes: com- paring bacterial, eukaryotic, and archaeal genomes in terms of protein structure. J Mol Biol 274, 562–576. 2 Patthy L (2003) Modular assembly of genes and the evolution of new functions. Genetica 118, 217–231. 3 Doolittle R (1995) The multiplicity of domains in pro-

teins. Annu Rev Biochem 64, 287–314.

Baker D & Finkelstein AV (2003) Contact order revis- ited: influence of protein size on the folding rate. Protein Sci 12, 2057–2062.

4 Wolf YI, Brenner SE, Bash PA & Koonin EV (1999)

18 Punta M & Rost B (2005) Protein folding rates esti-

Distribution of protein folds in the three superkingdoms of life. Genome Res 9, 17–26.

mated from contact predictions. J Mol Biol 348, 507– 512.

5 Wuchty S (2001) Scale-free behavior in protein domain

19 Patthy L (1985) Evolution of the proteases of blood

networks. Mol Biol Evol 18, 1694–1702.

coagulation and fibrinolysis by assembly from modules. Cell 41, 657–663.

6 Koonin EV, Wolf YI & Karev GP (2002) The structure of the protein universe and genome evolution. Nature 420, 218–223.

20 Ba´ nyai L & Patthy L (2004) Evidence that human genes of modular proteins have retained significantly more ancestral introns than their fly or worm orthologues. FEBS Lett 565, 127–132.

7 Ye Y & Godzik A (2004) Comparative analysis of pro- tein domain organization. Genome Res 14, 343–353. 8 Apic G, Gough J & Teichmann SA (2001) Domain

21 Patthy L (1999) Genome evolution and the evolution of

exon-shuffling – a review. Gene 238, 103–114.

combinations in archaeal, eubacterial and eukaryotic proteomes. J Mol Biol 310, 311–325.

9 Patthy L (1999) Protein Evolution. Blackwell Science,

Oxford.

22 Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL et al. (2004) The Pfam protein families database. Nucleic Acids Res 32 (Database issue), D138– 41.

10 Trexler M & Patthy L (1983) Folding autonomy of the kringle 4 fragment of human plasminogen. Proc Natl Acad Sci USA 80, 2457–2461.

11 Das SYuL, Gaitatzes C, Rogers R, Freeman J, Bien-

23 Letunic I, Copley RR, Schmidt S, Ciccarelli FD, Doerks T, Schultz J, Ponting CP & Bork P (2004) SMART 4.0: towards genomic data integration. Nucleic Acids Res 32 (Database issue), D142–4.

24 Mott R, Schultz J, Bork P & Ponting CP (2002) Pre-

dicting protein cellular localization using a domain pro- jection method. Genome Res 12, 1168–1174.

kowska J, Adams RM, Smith TF & Lindelien J (1997) Biology’s new Rosetta stone. Nature 385, 29–30. 12 Ekman D, Bjorklund AK, Frey-Skott J & Elofsson A (2005) Multi-domain proteins in the three kingdoms of life: orphan domains and other unassigned regions. J Mol Biol 348, 231–243.

25 Batagelj V & Mrvar A (1998) PAJEK — program for

large network analysis. Connections 21, 47–57.

13 Karev GP, Wolf YI, Rzhetsky AY, Berezovskaya FS & Koonin EV (2002) Birth and death of protein domains: a simple model of evolution explains power law beha- vior. BMC Evol Biol 2, 18.

26 Adai AT, Date SV, Wieland S & Marcotte EM (2004) LGL: creating a map of protein function with an algo- rithm for visualizing very large biological networks. J Mol Biol 340, 179–190.

14 Deng M, Templeton TJ, London NR, Bauer C, Schroe- der AA & Abrahamsen MS (2002) Cryptosporidium parvum genes containing thrombospondin type 1 domains. Infect Immun 70, 6987–6995.

FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS

5078