On the Encoding of Proteins for Disordered Regions Prediction
2013, PLOS ONE
https://doi.org/10.1371/JOURNAL.PONE.0082252Abstract
Disordered regions, i.e., regions of proteins that do not adopt a stable three-dimensional structure, have been shown to play various and critical roles in many biological processes. Predicting and understanding their formation is therefore a key subproblem of protein structure and function inference. A wide range of machine learning approaches have been developed to automatically predict disordered regions of proteins. One key factor of the success of these methods is the way in which protein information is encoded into features. Recently, we have proposed a systematic methodology to study the relevance of various feature encodings in the context of disulfide connectivity pattern prediction. In the present paper, we adapt this methodology to the problem of predicting disordered regions and assess it on proteins from the 10th CASP competition, as well as on a very large subset of proteins extracted from PDB. Our results, obtained with ensembles of extremely randomized trees, highlight a novel feature function encoding the proximity of residues according to their accessibility to the solvent, which is playing the second most important role in the prediction of disordered regions, just after evolutionary information. Furthermore, even though our approach treats each residue independently, our results are very competitive in terms of accuracy with respect to the state-of-the-art. A web-application is available at x3Disorder.
References (37)
- Becker J, Maes F, Wehenkel L (2013) On the relevance of sophisticated structural annotations for disulfide connectivity pattern prediction. PLoS One 8: e56621.
- Uversky VN, Oldfield CJ, Dunker AK (2005) Showing your id: intrinsic disorder as an id for recognition, regulation and cell signaling. Journal of Molecular Recognition 18: 343-384.
- Uversky VN (2009) The mysterious unfoldome: structureless, underappreciated, yet vital part of any given proteome. BioMed Research International 2010.
- Dyson HJ, Wright PE (2005) Intrinsically unstructured proteins and their functions. Nature Reviews Molecular Cell Biology 6: 197-208.
- Wootton JC (1994) Non-globular domains in protein sequences: automated segmentation using complexity measures. Computers & chemistry 18: 269-285.
- Jones D, Ward J (2003) Prediction of disordered regions in proteins from position specific score matrices. Proteins: Structure, Function, and Bioinformatics 53: 573-578.
- Deng X, Eickholt J, Cheng J (2009) Predisorder: ab initio sequence-based prediction of protein disordered regions. BMC bioinformatics 10: 436.
- Peng K, Vucetic S, Radivojac P, CELESTE J, Dunker A, et al. (2005) Optimizing long intrinsic disorder predictors with protein evolutionary information. Journal of bioinformatics and computational biology 3: 35-60.
- Yang Z, Thomson R, McNeil P, Esnouf R (2005) Ronn: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins. Bioinformatics 21: 3369-3376.
- Zhang T, Faraggi E, Xue B, Dunker A, Uversky V, et al. (2012) Spine-d: accurate prediction of short and long disordered regions by a single neural- network based method. Journal of Biomolecular Structure and Dynamics 29: 799-813.
- Shimizu K, Hirose S, Noguchi T (2007) Poodle-s: web application for predicting protein disorder by using physicochemical features and reduced amino acid set of a position-specific scoring matrix. Bioinformatics 23: 2337-2338.
- Hirose S, Shimizu K, Kanai S, Kuroda Y, Noguchi T (2007) Poodle-l: a two- level svm prediction system for reliably predicting long disordered regions. Bioinformatics 23: 2046-2053.
- Shimizu K, Muraoka Y, Hirose S, Tomii K, Noguchi T (2007) Predicting mostly disordered proteins by using structure-unknown protein data. BMC bioinfor- matics 8: 78.
- Vullo A, Bortolami O, Pollastri G, Tosatto S (2006) Spritz: a server for the prediction of intrinsically disordered regions in protein sequences using kernel machines. Nucleic Acids Research 34: W164-W168.
- Ishida T, Kinoshita K (2008) Prediction of disordered regions in proteins based on the meta approach. Bioinformatics 24: 1344-1348.
- Figure 6. ROC curves on PDB30 dataset. ROC curve of our method (Becker et al.) and three freely downloadable predictors: DISPRED2 [34], ESpritz [36] and IUPred [35]. doi:10.1371/journal.pone.0082252.g006
- Mizianty M, StachW, Chen K, Kedarisetti K, Disfani F, et al. (2010) Improved sequence-based prediction of disordered regions with multilayer fusion of multiple information sources. Bioinformatics 26: i489-i496.
- Monastyrskyy B, Fidelis K, Moult J, Tramontano A, Kryshtafovych A (2011) Evaluation of disorder predictions in casp9. Proteins: Structure, Function, and Bioinformatics 79: 107-118.
- Deng X, Eickholt J, Cheng J (2012) A comprehensive overview of computational protein disorder prediction methods. Molecular BioSystems 8: 114-121.
- Cheng J, Sweredoski MJ, Baldi P (2005) Accurate prediction of protein disordered regions by mining protein structure data. Data Mining and Knowledge Discovery 11: 213-222.
- Berman H, Westbrook J, Feng Z, Gilliland G, Bhat T, et al. (2000) The protein data bank. Nucleic Acids Research 28: 235-242.
- Sven M, Burkhard R (2003) Uniqueprot: creating representative protein sequence sets. Nucleic Acids Res : 3789-3791.
- Sander C, Schneider R (1991) Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins: Structure, Function, and Bioinformatics 9: 56-68.
- Dondoshansky I, Wolf Y (2002) Blastclust (ncbi software development toolkit). NCBI, Bethesda, Md.
- Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, et al. (1997) Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Research 25: 3389-3402.
- Pruitt K, Tatusova T, Brown G, Maglott D (2012) Ncbi reference sequences (refseq): current status, new features and genome annotation policy. Nucleic Acids Research 40: D130-D135.
- Cheng J, Randall AZ, Sweredoski MJ, Baldi P (2005) Scratch: a protein structure and structural feature prediction server. Nucleic acids research 33: W72-W76.
- Wang L, Sauer UH (2008) Ond-crf: predicting order and disorder in proteins conditional random fields. Bioinformatics 24: 1401-1402.
- Xue B, Dunbrack RL,Williams RW, Dunker AK, Uversky VN (2010) Pondr-fit: a meta-predictor of intrinsically disordered amino acids. Biochimica et Biophysica Acta (BBA)-Proteins & Proteomics 1804: 996-1010.
- Kozlowski L, Bujnicki J (2012) Metadisorder: a meta-server for the prediction of intrinsic disorder in proteins. BMC bioinformatics 13: 111.
- Kim H, Park H (2003) Protein secondary structure prediction based on an improved support vector machines approach. Protein Engineering 16: 553-560.
- Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Machine Learning 63: 3-42.
- Breiman L (2001) Random forests. In: Machine Learning. pp. 5-32.
- Eickholt J, Cheng J (2013) Dndisorder: predicting protein disorder using boosting and deep networks. BMC Bioinformatics 14: 88.
- Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT (2004) Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. Journal of molecular biology 337: 635-645.
- Dosztanyi Z, Csizmok V, Tompa P, Simon I (2005) The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. Journal of molecular biology 347: 827-839.
- Walsh I, Martin AJ, Di Domenico T, Tosatto SC (2012) Espritz: accurate and fast prediction of protein disorder. Bioinformatics 28: 503-509.