Reannotation of hypothetical ORFs in plant pathogen Erwinia carotovora subsp. atroseptica SCRI1043 Ling-Ling Chen, Bin-Guang Ma and Na Gao
Shandong Provincial Research Center for Bioinformatic Engineering and Technique, Shandong University of Technology, Zibo, China
Keywords clusters of orthologous groups of proteins; function assignment; hypothetical ORFs; plant pathogen; principal component analysis
Correspondence L.-L. Chen, Shandong Provincial Research Center for Bioinformatic Engineering and Technique, Center for Advanced Study, Shandong University of Technology, Zibo 255049, China Fax: +86 533 278 0271 Tel: +86 533 278 0271 E-mail: llchen@sdut.edu.cn
(Received 18 August 2007, revised 3 October 2007, accepted 12 November 2007)
Over-annotation of hypothetical ORFs is a common phenomenon in bacte- rial genomes, which necessitates confirming the coding reliability of hypo- thetical ORFs and then predicting their functions. The important plant pathogen Erwinia carotovora subsp. atroseptica SCRI1043 (Eca1043) is a typical case because more than a quarter of its annotated ORFs are hypo- thetical. Our analysis focuses on annotation of Eca1043 hypothetical ORFs, and comprises two efforts: (a) based on the Z-curve method, 49 originally annotated hypothetical ORFs are recognized as noncoding, this is further supported by principal components analysis and other evidence; and (b) using sequence-alignment tools and some functional resources, more than a half of the hypothetical genes were assigned functions. The potential functions of 427 hypothetical genes are summarized according to the cluster of orthologous groups functional category. Moreover, 114 and 86 hypothetical genes are recognized as putative ‘membrane proteins’ and ‘exported proteins’, respectively. Reannotation of Eca1043 hypothetical ORFs will benefit research into the lifestyle, metabolism and pathogenicity of the important plant pathogen. Also, our study proffers a model for the reannotation of hypothetical ORFs in microbial genomes.
doi:10.1111/j.1742-4658.2007.06190.x
in tubers after harvest
Currently, more than 500 completely sequenced micro- bial genomes are available from public databases, which provide an unprecedented opportunity to study the genetics, biochemistry and evolutionary features of these species. Such analyses depend strongly on the gene annotation of each species. However, for many genomes, there are many hypothetical ORFs for which no functional information exists. Thanks to the devel- opment of the genome-sequencing project, a large number of hypothetical ORFs can now be assigned functions. Furthermore, some annotated hypothetical ORFs actually do not encode proteins, so the number of annotated ORFs is usually greater than the number of actual protein-coding genes for most microbial genomes [1,2].
Erwinia carotovora
subsp.
atroseptica SCRI1043 (Eca1043) belongs to the Enterobacteriaceae, a family
noted for its human pathogens [3]. Eca1043 is a com- mercially important plant pathogen that is restricted to potato in temperate regions; it can cause blackleg in the field and soft rot [3]. Although soft rot pathogenesis relies primarily on the prolific production of extracellular plant-cell-wall- degrading enzymes that cause extensive tissue macera- tion, recent discoveries suggest that the process may be far more complex than previously thought [4,5]. The Eca1043 genome was sequenced in 2004, and its anno- tated ORFs can be divided into two groups: (a) genes with known functions, and (b) hypothetical ORFs. Whether these hypothetical ORFs are protein-coding genes is uncertain and their functions are unknown. Because more than a quarter of ORFs in Eca1043 are hypothetical, it is necessary to reannotate them. Using sequence-alignment tools (e.g. blast and fasta) [6–8]
FEBS Journal 275 (2008) 198–206 ª 2007 The Authors Journal compilation ª 2007 FEBS
198
Abbreviations COG, cluster of orthologous groups; NCBI, National Center for Biotechnology Information; PCA, principal components analysis.
L.-L. Chen et al. Hypothetical ORFs reannotation in Eca1043
has
been
analysis
discrimination
correct,
interpro and and other functional resources (e.g. kegg) [9,10], we predicted the functions of 427 hypo- thetical ORFs. The predicted functions of 109 hypo- thetical ORFs are highly reliable with sequence coverage > 80%, identity ‡ 80% and E value < 1e-20 to their homologous proteins. Moreover, 114 and 86 hypothetical ORFs are recognized as putative ‘mem- brane proteins’ and ‘exported proteins’, respectively. In addition, 49 hypothetical ORFs are identified as non- coding ORFs using a methodology based on Z-curve theory [11]. Using principal components analysis (PCA), it can be intuitively observed that most of the identified noncoding ORFs are found away from the core function-known genes, and close to random sequences. Other evidence also suggests that the 49 rec- ognized noncoding ORFs are unlikely to code for pro- teins. Consequently, the number of hypothetical genes in Eca1043 decreases from 1254 to 578. These results are highly significant for research into the adaptation, lifestyle and pathogenicity of this important plant pathogen.
Results and Discussion
Identification of 49 noncoding ORFs
are coding and the remaining intergenic regions are dominated by structural RNA sequences, so it is diffi- cult to prepare an appropriate set of negative samples. Thus, the following procedures were taken to produce negative samples. Each of the known genes was ran- domly shuffled 10 000 times, so that it was trans- formed into a random sequence. Shuffled sequences then served as negative samples. The detailed process described of previously [12]. The sensitivity sn and specificity sp were used to evaluate the algorithm, which were defined as: sn = TP ⁄ (TP + FN), sp = TN ⁄ (TN + FP), where TP, TN, FP and FN are fractions of positive correct, negative and false-negative false-positive predictions, respectively. The accuracy was defined as the average of sn and sp. After performing 10-fold cross-validation tests, mean sensitivity, specificity and SD were obtained (Table 1). The prediction accuracy was as high as 99.58%. All positive samples in the first group and the corresponding negative samples were merged, forming a new and larger training set. The final Fisher coefficients and thresholds were based on the larger training set. Using the final Fisher coeffi- cients and the criterion for deciding coding ⁄ noncoding, the hypothetical ORFs in Eca1043 were re-identified. A total of 49 of the 1254 hypothetical ORFs were recognized as noncoding (Table 2).
Why are the recognized noncoding ORFs unlikely to encode proteins?
sequences. Many
In the first stage of annotation, the 1254 hypothetical ORFs were re-identified using the Z-curve method [11]. First, the 2246 genes of known function were ran- domly divided into two almost equal parts. The former served as a training set to calculate Fisher coefficients, and the latter served as a test set to assess the accuracy of the algorithm. Both the training set and the test set should include positive and negative samples. In the Eca1043 genome, 80.6% of the whole-DNA sequences
The need to fold a peptide chain into a stable and functional protein imposes rigorous constraints on constraints have been coding observed and the generally accepted base usage pattern
Table 1. The genome feature, sensitivity, specificity and accuracy over 10-fold cross-validation tests for Eca1043.
GC content (%) Length (bp) Sensitivitya (%) Specificitya (%) Accuracyb (%)
a ± SD. b Accuracy is defined as the average of the sensitivity and specificity.
50.97 99.64 ± 0.002 99.53 ± 0.001 99.58 5 064 019
Table 2. The synonyms of the 49 recognized noncoding ORFs.
Synonym
FEBS Journal 275 (2008) 198–206 ª 2007 The Authors Journal compilation ª 2007 FEBS
199
ECA0547 ECA0726 ECA1636 ECA2505 ECA2874 ECA3405 ECA3982 ECA0579 ECA1062 ECA1771 ECA2513 ECA2890 ECA3412 ECA4287 ECA0586 ECA1066 ECA2121 ECA2658 ECA2896 ECA3414 ECA4295 ECA0590 ECA1183 ECA2124 ECA2706 ECA3326 ECA3521 ECA4306 ECA0637 ECA1522 ECA2129 ECA2859 ECA3385 ECA3674 ECA4442 ECA0670 ECA1584 ECA2234 ECA2862 ECA3397 ECA3676 ECA4484 ECA0394 ECA0675 ECA1610 ECA2470 ECA2864 ECA3404 ECA3677
L.-L. Chen et al. Hypothetical ORFs reannotation in Eca1043
first two principal components. The coding and non- coding sequences are represented by open circles and triangles, respectively. It can be seen that the two prin- cipal axes are responsible for separating the coding and noncoding sequences into two almost nonoverlap- ping clusters. The difference in the two regions reflected the base usage at the three codon positions of coding and noncoding sequences was quite different. The recognized noncoding ORFs are represented by filled stars, distributed far from the core of function- known genes, and close to random sequences. This implies that the 49 ORFs listed in Table 2 are unlikely to encode proteins.
is the R (cid:2)GN prototype, where R, (cid:2)G and N denote pur- ine, nonguanine and any bases at the first, second and third codon positions, respectively [13–17]. It is sug- gested that the first, second and third codon positions are associated with the biosynthetic pathway, hydro- phobicity pattern and the a-helix- or b-strand-forming potentiality of the coded amino acid, respectively [13– 17]. By contrast, the negative samples are shuffled sequences of function-known genes, so the frequencies of the bases at the three ‘codon’ positions are almost identical (note that the term ‘codon’ in a negative sam- ple is meaningless). The base distribution pattern of negative sample sequences is the NNN type. The difference in the two codon types, R (cid:2)GN and NNN, forms the basis of our method for distinguishing between protein-coding and noncoding ORFs.
i.e.
The difference between coding and noncoding sequences can be viewed intuitively using PCA. PCA defines the correlation among the variables of given data. The first derived direction is chosen to maximize the SD of the derived variable and the second is to maximize the SD among directions uncorrelated with the first, and so forth [12]. Figure 1 shows the distribu- tion of points on the principal plane spanned by the
1.2
0.8
the 49 recognized noncoding ORFs
0.4
i
0.0
–0.4
–0.8
t n e n o p m o c l a p c n i r p d n o c e s e h T
–1.2
–0.8
–0.4
0.4
0.8
that
0.0 The first principal component
In the latest version of RefSeq annotation, clusters of orthologous groups (COGs) of proteins were added to the annotation file. Each COG is a group three or more proteins that are inferred to be of orthologs, they have evolved from a common ancestor [18,19]. Computational analysis of complete microbial genomes shows that prokaryotic proteins are generally highly conserved, with (cid:2) 70% of them containing ancient conserved regions shared by homo- logs from distantly related species [18,19]. Therefore, an annotated ORF within a COG is highly likely to be a protein-coding gene with homologs from other species. Of the 2246 genes of known function, 84.3% are included in at least one COG, the ratio decreases to 75.3% in ‘putative’ and ‘probable’ ORFs, and decreases further to 40.6% in ‘hypothetical’ ORFs. listed in Of Table 2, only 4 (8.2%) contain COG codes. In addi- tion, previous statistics have shown that over-annota- tion of short ORFs was one of the major problems in prokaryotic genome annotation [1]. So we com- pared the average length of the 2246 function-known genes in the first group and the 49 recognized non- coding ORFs. The average length of the recognized noncoding ORFs (330 bp) is much shorter than that of the function-known genes (1112 bp; Table 3). All the above evidence strongly suggests the 49 ORFs are over-annotated short ORFs. Of course, our conclusion is only theoretical and needs to be verified by experiments.
Table 3. Average length and percentage of ORFs with COG code for 2246 function-known protein-coding genes and 49 recognized noncoding ORFs in Eca1043.
Feature Genes with known functions Recognized noncoding ORFs
FEBS Journal 275 (2008) 198–206 ª 2007 The Authors Journal compilation ª 2007 FEBS
200
84.3% 8.2% With COG code Average length (bp) 1112 330 Fig. 1. The distribution of points on the principal plane spanned by the first (x) and second (y) principal axes using PCA in Eca1043. Open circles represent the function-known genes, open triangles represent the corresponding negative samples and filled stars denote ORFs recognized as noncoding. The first and second princi- inertia of the pal axes account for 26.2% and 22.3% of the total 21-dimensional space, respectively. Note that the distribution of the open circles is separate from that of the open triangles, indicating that coding and noncoding sequences are well distinguished. Fur- thermore, most of the identified noncoding ORFs are far from the core of open circles, and close to the core of open triangles, imply- ing that the 49 recognized noncoding ORFs listed in Table 2 are unlikely to encode proteins.
L.-L. Chen et al. Hypothetical ORFs reannotation in Eca1043
Function annotation of the hypothetical genes
for Biotechnology
methyl-accepting chemotaxis protein genes and 336 putative regulators were annotated, which supports that Eca1043 is able to respond to a wide range of nutrient sources and live in different environments [3]. In our analysis, more genes associated with a variety of lifestyles and habitats for Eca1043 were identified, including 23 transporters, 17 regulators, 15 transferases and 1 methyl-accepting chemotaxis protein. Further- more, except for the newly annotated 427 hypothetical genes, 114 hypothetical genes were recognized as puta- tive ‘membrane proteins’ and 86 as ‘exported proteins’, which are detailed in supplementary Tables S2 and S3, respectively. It is highly possible that some of the puta- tive ‘exported proteins’ are related to the pathogenicity of Eca1043.
After identifying the 49 noncoding ORFs, the next step was to assign functions to the remaining hypothetical genes. In the original annotation of the Eca1043 gen- ome, although the authors queried all the ORFs against the complete set of ORFs from 64 selected fully annotated bacterial genomes obtained from Information National Center (NCBI) to determine their functions [3], more than a quarter of the annotated genes still had no functional information. Three years have passed and now > 500 complete bacterial and archaeal genomes are annotated in the NCBI, so a large number of new functional genes can be obtained from public databases. Further- more, many studies with knowledge of Eca1043 genes have been published in the last 3 years. All this infor- mation provides valuable resources for assigning func- tions to a mass of hypothetical genes. After collecting all this information and systematically searching non- redundant nucleotide and protein databases, functions have been assigned to 109 hypothetical genes with high reliability, the synonyms, protein lengths, E values, identities and predicted functions (products) are listed in Table 4. The aligned length covered at least 80% of each gene with the identity ‡ 80% and E value < 1e-20. Furthermore, the functions of another 318 hypothetical genes have been assigned with query coverage > 80%, identity > 30% and E value < 1e-10 (see supplementary Table S1).
categories. The
In conclusion, 1254 hypothetical ORFs in the impor- tant plant pathogen Eca1043 are reannotated in this analysis. First, 49 originally annotated hypothetical ORFs are recognized as noncoding ORFs using a methodology based on the Z-curve method. The recog- nized noncoding ORFs are very unlikely to encode supported by PCA evidence, average proteins, as length distribution and COG functional category occu- pation. Second, using sequence alignment tools and some functional resources, potential functions for 427 hypothetical genes have been predicted. Moreover, 114 and 86 hypothetical genes are recognized as putative ‘membrane proteins’ and ‘exported proteins’, respec- tively. Therefore, the number of hypothetical genes decreases to 578. These results provide more informa- tion than earlier annotation, and will benefit research into the lifestyle, metabolism and pathogenicity of this important plant pathogen.
Experimental procedures
The predicted functions of the above 427 hypotheti- cal genes were summarized according to COG func- tional latest version of COG is classified into 25 functional categories and each cate- gory is symbolized by a capital letter, J, A, K, L, B, D, Y, V, T, M, N, W, O, U, C, G, E, F, H, I, P, Q, R and S, respectively. Details of the functions of the codes are listed in Table 5. The 25 functional catego- ries are summarized into four functional groups. According to the COG functional category, 48, 79, 97 and 167 newly annotated hypothetical genes belong to the ‘information storage and processing’, ‘cellular pro- cesses and signaling’, ‘metabolism’ and ‘poorly charac- terized’ groups, respectively. Detailed information about annotated hypothetical genes in each COG func- tional category is summarized in Table 5. Of the 427 newly annotated genes, 50 can be classified into two or more functional categories and 36 can not be assigned to any category.
The length of the Eca1043 genome is (cid:2) 5.06 Mb and the original annotation was submitted to GenBank (accession number BX950851) in July 2004 [3]. Subsequently, a curated annotation was made available by RefSeq at NCBI (NC_004547). The number of annotated ORFs in the two databases are the same. The sequence and annotation files analyzed in this study were downloaded from NCBI RefSeq (updated 9 February 2007) and the number of annotated ORFs was 4472. Among them, two ORFs (ECA0773 and ECA2198) have lengths that cannot be divided by three, which obviously denotes that they are not protein-coding genes and thus are excluded from this analysis. The remain- ing 4470 ORFs can be classified into two groups: the first contains 2246 genes with confirmed functions and 970 genes with ‘putative’ or ‘probable’ functions, of which the 2246 function-confirmed genes are used as training parameters; the second group contains 1254 hypothetical ORFs, whose
As pointed out by Bell et al., Eca1043 has the ability to to use a range of different nutrients to adapt diverse environments [3]. In the original annotation study, 80 putative ABC transporters, 36 putative
FEBS Journal 275 (2008) 198–206 ª 2007 The Authors Journal compilation ª 2007 FEBS
201
L.-L. Chen et al. Hypothetical ORFs reannotation in Eca1043
Table 4. Synonyms, COG functional categories and predicted functions (products) of 109 Eca1043 hypothetical genes with BLAST search identity ‡ 80%, E value < 1e-20 and aligned length covering at least 80% of each gene.
Synonym E value Length (aa)a Functional categoryb Identity (%) Product
2e-39 0.0 1e-107 6e-128 2e-38 2e-31 7e-118 1e-166 3e-153 2e-151 5e-113 89 311 248 280 95 83 287 333 285 357 260 ECA0018 ECA0019 ECA0054 ECA0061 ECA0063 ECA0064 ECA0130 ECA0264 ECA0285 ECA0293 ECA0296 S R QR R JD D S R R MR Q 98 95 82 80 88 87 87 83 96 81 89 YihD YihE SAM-dependent methyltransferases Protein involved in catabolism of external DNA Addiction module toxin, RelE ⁄ StbE family Antitoxin of toxin-antitoxin stability system YicC N-terminal domain protein Twin-arginine translocation pathway signal Predicted P-loop-containing kinase Predicted sugar phosphate isomerase involved in capsule formation ABC-type transport system involved in resistance to organic solvents, permease component ECA0298 9e-92 ABC-type transport system involved in resistance to organic 209 Q 82 solvents, auxiliary component
2e-147 8e-127 5e-133 1e-70 0.0 3e-27 2e-27 4e-17 7e-45 2e-35 7e-74 2e-143 7e-79 2e-66 3e-137 8e-142 9e-39 2e-92 9e-136 2e-30 7e-25 309 261 287 155 373 83 80 73 131 97 140 292 215 148 299 277 90 238 272 116 75 ECA0313 ECA0327 ECA0338 ECA0383 ECA0420 ECA0444 ECA0512 ECA0631 ECA0636 ECA0696 ECA0710 ECA0721 ECA0757 ECA0837 ECA0882 ECA0971 ECA0975 ECA0983 ECA1010 ECA1024 ECA1071 R S G G S D K S S J S O R I P R CO P H S K 82 84 80 85 84 87 90 88 82 88 99 81 83 86 80 90 89 82 89 88 80 Putative Fe-S oxidoreductase Extradiol ring-cleavage dioxygenase, class III enzyme, subunit B Fructose-bisphosphate aldolase, class II family Beta-galactosidase, beta subunit Cupin 4 family protein YefM protein Putative regulatory protein CsbD-like family DoxD-like family protein Putative RNA-binding protein YhbC-like protein Collagenase and related proteases Putative oxidoreductase Oligoketide cyclase ⁄ lipid transport protein Dyp-type peroxidase family Predicted TIM-barrel enzyme, possibly a dioxygenase Fe(II) trafficking protein YggX Membrane protein TerC, possibly involved in tellurium resistance HesA ⁄ MoeB ⁄ ThiF family protein tRNA pseudouridine synthase C DNA-directed RNA polymerase, subunit M ⁄ Transcription elongation factor TFIIS
FEBS Journal 275 (2008) 198–206 ª 2007 The Authors Journal compilation ª 2007 FEBS
202
2e-67 4e-112 9e-64 5e-138 1e-61 0.0 2e-22 6e-167 1e-172 1e-66 0.0 4e-116 1e-30 2e-85 2e-50 5e-73 0.0 4e-82 149 231 159 304 159 474 93 358 369 179 512 254 94 321 109 219 382 192 ECA1125 ECA1155 ECA1191 ECA1196 ECA1317 ECA1319 ECA1333 ECA1405 ECA1410 ECA1578 ECA1585 ECA1645 ECA1663 ECA1684 ECA1762 ECA1763 ECA1781 ECA1782 K L S O R J – G D S R – K GER P R R S 91 85 80 88 81 88 84 84 81 86 81 88 86 90 92 83 83 81 Ribonucleotide reductase regulator NrdR-like ExsB protein YbaK ⁄ ebsC protein Membrane protease subunits, stomatin ⁄ prohibitin homologs Predicted metal-dependent hydrolase tRNA-methylthiotransferase MiaB protein LexA regulated, putative SOS response Putative sugar ABC transporter Mrp protein Nucleoprotein ⁄ polynucleotide-associated enzyme Deoxyribodipyrimidine photolyase-like protein Putative plasmid replication protein Predicted transcriptional regulators Permeases of the drug ⁄ metabolite transporter (DMT) superfamily Sulfite reductase Putative transport protein Rhodanese-like domain protein Protein yceI
L.-L. Chen et al. Hypothetical ORFs reannotation in Eca1043
Table 4. (Continued).
Synonym E value Length (aa)a Functional categoryb Identity (%) Product
FEBS Journal 275 (2008) 198–206 ª 2007 The Authors Journal compilation ª 2007 FEBS
203
ECA1809 ECA1814 ECA1816 ECA1860 ECA1927 ECA1956 ECA1958 ECA1986 ECA1995 ECA2292 ECA2348 ECA2359 ECA2367 ECA2464 ECA2511 ECA2512 ECA2525 ECA2529 ECA2560 ECA2683 ECA2708 ECA2777 ECA2812 ECA2977 ECA3034 ECA3037 ECA3057 ECA3059 ECA3070 ECA3087 ECA3115 ECA3135 ECA3223 ECA3262 ECA3288 ECA3306 ECA3361 ECA3382 ECA3428 ECA3472 ECA3475 ECA3487 ECA3523 ECA3623 ECA3774 ECA3792 ECA3824 ECA3860 ECA3877 ECA3894 ECA4059 ECA4063 ECA4134 ECA4157 ECA4162 2e-50 3e-85 3e-57 4e-188 3e-51 1e-177 4e-34 0.0 3e-165 1e-96 0.0 0.0 2e-30 0.0 3e-116 7e-158 9e-163 2e-120 6e-33 0.0 0.0 9e-32 6e-96 3e-96 4e-86 1e-77 2e-95 6e-151 3e-155 2e-123 5e-205 2e-59 0.0 2e-151 5e-56 2e-52 1e-108 6e-40 1e-88 3e-137 2e-55 1e-97 7e-87 5e-65 8e-31 0.0 2e-77 2e-51 5e-144 4e-66 1e-73 2e-61 2e-97 3e-64 5e-112 116 180 189 499 116 389 109 465 311 206 644 513 95 484 259 327 401 245 105 562 442 109 235 177 199 164 220 336 310 263 413 133 398 308 127 115 264 95 172 255 116 205 187 141 95 477 152 125 313 160 200 135 191 197 231 FGR R S O O G J R D J T S S J QR QR GEPR ER S R S – R G R S S E J – E S R R R S R C S R V G E K G G S P H S E O O U R 85 82 87 85 87 80 87 82 89 84 94 85 82 10 80 81 82 84 85 89 88 80 87 96 86 82 85 80 82 81 93 80 85 99 85 93 82 92 88 86 90 83 82 86 88 84 94 80 81 82 86 88 91 81 83 Histidine triad (HIT) protein Predicted esterase Predicted outer membrane lipoprotein FeS assembly protein SufB Glutaredoxin-like protein Predicted N-acetylglucosaminyl transferase Translation initiation factor SUI1 Predicted ATPase YdaO Putative translation factor (SUA5) Putative Ser protein kinase Putative sporulation protein Protein ycgL Ribosomal RNA small subunit methyltransferase F Predicted methyltransferase SAM-dependent methyltransferases Permeases of the major facilitator superfamily Histidinol phosphatase and related hydrolases of the PHP family Putative alpha helix protein TrkA, potassium channel-family protein Putative FeS oxidoreductase Putative phage-related exported protein Integral membrane protein, interacts with FtsH ABC-type sugar transport system, periplasmic component Predicted hydrolases of HD superfamily YfbU family protein DedA protein (dsg-1 protein) Putative aspartate-semialdehyde dehydrogenase Adenine-specific methylase Necrosis-inducing protein Aspartate ⁄ tyrosine ⁄ aromatic aminotransferase Cupin 2, conserved barrel Radical SAM enzyme, Cfr family N-acetylmuramic acid 6-phosphate etherase Autonomous glycyl radical cofactor Iron–sulfur cluster assembly accessory protein Cytochrome c assembly protein Rhs protein Putative hemolysin co-regulated protein Predicted glutamine amidotransferase HNH endonuclease Probable dehydratase D,D-heptose 1,7-bisphosphate phosphatase Putative negative regulator Phosphotransferase enzyme II, B component Na+ ⁄ melibiose symporter and related transporters MraZ protein ApaG protein FAD synthase CreA protein Lysine exporter protein Predicted redox protein, regulator of disulfide bond formation HesB ⁄ YadR ⁄ YfhF Multiple antibiotic resistance (MarC)-related protein Pirin-related protein
L.-L. Chen et al. Hypothetical ORFs reannotation in Eca1043
Table 4. (Continued).
Synonym E value Length (aa)a Functional categoryb Identity (%) Product
a Amino acid length of each hypothetical gene. b The 25 functional categories in COG database, i.e., J, A, K, L, B, D, Y, V, T, M, N, W, O, U, C, G, E, F, H, I, P, Q, R and S, respectively. In addition, ‘–’ denotes the corresponding gene can not be assigned to any COG category.
ECA4275 ECA4329 ECA4353 1e-88 2e-94 4e-30 172 218 81 S G O 89 81 85 Putative hemolysin co-regulated protein Class II aldolase ⁄ adducin domain protein SirA protein
Table 5. The number of newly annotated hypothetical genes in each of the 25 COG functional categories.
Group Code Description Numbera
9 Information storage and processing
Cellular processes and signaling – 24 15 – 12 –
6 16 18 2
– –
Metabolism
a ‘–’indicates there is no newly annotated gene in this COG functional category.
Poorly characterized J A K L B D Y V T M N Z W U O C G E F H I P Q R S Translation, ribosomal structure and biogenesis RNA processing and modification Transcription Replication, recombination and repair Chromatin structure and dynamics Cell cycle control, cell division, chromosome partitioning Nuclear structure Defense mechanisms Signal transduction mechanisms Cell wall ⁄ membrane ⁄ envelope biogenesis Cell motility Cytoskeleton Extracellular structures Intracellular trafficking, secretion, and vesicular transport Posttranslational modification, protein turnover, chaperones Energy production and conversion Carbohydrate transport and metabolism Amino acid transport and metabolism Nucleotide transport and metabolism Coenzyme transport and metabolism Lipid transport and metabolism Inorganic ion transport and metabolism Secondary metabolites biosynthesis, transport and catabolism General function prediction only Function unknown 6 19 6 33 23 4 4 5 13 9 81 86
Assigning functions to hypothetical genes
coding status is identified and their functions predicted in this analysis.
Re-recognizing hypothetical ORFs
Hypothetical ORFs were compared with nucleotide and protein sequences in public nonredundant databases using alignment tools such as blast [6,7] and fasta [8]. Other functional assignment resources, such as interpro [9] and kegg [10], were also used. Furthermore, studies from the past 3 years with information about Eca1043 were collected and used to manually assign functions to some hypothetical ORFs.
Acknowledgements
The authors wish to thank Professor Hong-Yu Zhang and Dr Hong-Yu Ou for their valuable suggestions.
The method adopted here is based on the Z-curve of DNA sequence [11], which has been successfully used to find genes in microbe [20,21] and eukaryotic genomes [22,23]. In this analysis, 21 Z-curve variables were adopted, including nine variables of phase-dependent single nucleotides and 12 of phase-independent dinucleotides. For details about these variables, please refer to Gao and Zhang [23]. The Fisher linear discrimination algorithm was used to differentiate protein-coding and noncoding sequences, the procedure was as detailed previously [20,23].
FEBS Journal 275 (2008) 198–206 ª 2007 The Authors Journal compilation ª 2007 FEBS
204
L.-L. Chen et al. Hypothetical ORFs reannotation in Eca1043
of China
12 Dillon WR & Goldstein M (1984) Multivariate Analysis – Methods and Applications (Wiley Series in Probability and Mathematical Statistics). Wiley, New York, NY.
Basic Research
13 Trifonov EN (1987) Translation framing code and
scientific
of Technology
The study was supported by the National Natural Science Foundation the (30600119), Program of China National research funds (2003CB114400) and the (grant Shandong University of 2004KJM29 and 04KQ14).
frame-monitoring mechanism as suggested by the analy- sis of mRNA and 16S rRNA nucleotide sequences. J Mol Biol 194, 643–652.
References
14 Zhang CT & Chou KC (1994) A graphic approach to analyzing codon usage in 1562 E. coli protein coding sequences. J Mol Biol 238, 1–8.
1 Nielsen P & Krogh A (2005) Large-scale prokaryotic
gene prediction and comparison to genome annotation. Bioinformatics 21, 4322–4329.
15 Gupta SK, Majumdar S, Bhattacharya TK & Ghosh TC (2000) Studies on the relationships between the synonymous codon usage and protein secondary struc- tural units. Biochem Biophys Res Commun 269, 692– 696.
2 Skovgaard M, Jensen LJ, Brunak S, Ussery D & Krogh A (2001) On the total number of genes and their length distribution in complete microbial genomes. Trends Genet 17, 425–428.
16 Pan A, Dutta C & Das J (1998) Codon usage in highly expressed genes of Haemophillus influenzae and Myco- bacterium tuberculosis: translational selection versus mutational bias. Gene 215, 405–413.
17 Chiusano ML, Alvarez-Valin F, Di Giulio M,
D’Onofrio G, Ammirato G, Colonna G & Bernardi G (2000) Second codon positions of genes and the secondary structures of proteins. Relationships and implications for the origin of the genetic code. Gene 261, 63–69.
3 Bell KS, Sebaihia M, Pritchard L, Holden MT, Hyman LJ, Holeva MC, Thomson NR, Bentley SD, Churcher LJ, Mungall K et al. (2004) Genome sequence of the enterobacterial phytopathogen Erwinia carotovora subsp. atroseptica and characterization of virulence fac- tors. Proc Natl Acad Sci USA 101, 11105–11110. 4 Pe´ rombelon MCM (2002) Potato diseases caused by soft rot erwinias: an overview of pathogenesis. Plant Pathol 51, 1–12.
18 Tatusov RL, Koonin EV & Lipman DJ (1997) A geno- mic perspective on protein families. Science 278, 631– 637.
5 Toth IK, Bell KS, Holeva MC & Birch PRJ (2003) Soft rot erwiniae: from genes to genomes. Mol Plant Pathol 4, 17–30.
19 Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN et al. (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4, 41.
6 Altschul SF, Madden TL, Scha¨ ffer AA, Zhang J, Zhang Z, Miller W & Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389– 3402.
20 Chen LL & Zhang CT (2003) Gene recognition from
questionable ORFs in bacterial and archaeal genomes. J Biomol Struct Dyn 21, 99–110.
7 Scha¨ ffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, Koonin EV & Altschul SF (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 29, 2994– 3005.
21 Guo FB, Ou HY & Zhang CT (2003) ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes. Nucleic Acids Res 31, 1780– 1789.
22 Zhang CT & Wang J (2000) Recognition of protein
8 Pearson WR (1990) Rapid and sensitive sequence com- parison with FASTP and FASTA. Methods Enzymol 183, 63–98.
9 Mulder NJ, Apweiler R, Attwood TK, Bairoch A,
coding genes in the yeast genome at better than 95% accuracy based on the Z curve. Nucleic Acids Res 28, 2804–2814.
23 Gao F & Zhang CT (2004) Comparison of various
algorithms for recognizing short coding sequences of human genes. Bioinformatics 20, 673–681.
Supplementary material
is available
Bateman A, Binns D, Bork P, Buillard V, Cerutti L, Copley R et al. (2007) New developments in the Inter- Pro database. Nucleic Acids Res 35, D224–D228. 10 Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, Kawashima S, Katayama T, Araki M & Hira- kawa M (2006) From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res 34, D354–D357.
The following supplementary material online: Table S1. Synonyms, COG functional categories and predicted functions (products) of 318 Eca hypothetical
11 Zhang CT & Zhang R (1991) Analysis of distribution of bases in the coding sequences by a diagrammatic technique. Nucleic Acids Res 19, 6313–6317.
FEBS Journal 275 (2008) 198–206 ª 2007 The Authors Journal compilation ª 2007 FEBS
205
L.-L. Chen et al. Hypothetical ORFs reannotation in Eca1043
This material is available as part of the online article
from http://www.blackwell-synergy.com
the 86 recognized exported
genes with blast search identity > 30%, E value < 1e- 10 and aligned length covers at least 80% of each gene. Table S2. Synonyms of the 114 recognized membrane proteins. Table S3. Synonyms of proteins.
Please note: Blackwell Publishing are not responsible for the content or functionality of any supplementary materials supplied by the authors. Any queries (other than missing material) should be directed to the corre- sponding author for the article.
FEBS Journal 275 (2008) 198–206 ª 2007 The Authors Journal compilation ª 2007 FEBS
206