Fast Model-based Protein Homology Discovery without Alignment
DOI:
https://doi.org/10.18034/apjee.v1i2.580Keywords:
Protein homology discovery, Support vector machines (SVMs), Homology detection, LSTM NetworkAbstract
The need for quick gene categorization tools is growing as more genomes are sequenced. To evaluate a newly sequenced genome, the genes must first be identified and translated into amino acid sequences, which are then categorized into structural or functional classes. Protein homology detection using sequence alignment algorithms is the most effective way for protein categorization. Discriminative approaches such as support vector machines (SVMs) and position-specific scoring matrices (PSSM) derived from PSI-BLAST have recently been used to improve alignment algorithms. However, if a fresh sequence is being aligned, alignment algorithms take time. must be compared to a large number of previously published sequences — the same is true for SVMs. Building a PSSM for the PSSM is even more time-consuming than a fresh order It would take roughly 25 hours to implement the best-performing approaches to classify the sequences on today's computers. Describing a novel genome (20, 000 genes) as belonging to one single organism. There are hundreds of classes to choose from, though. Another flaw with alignment algorithms is that they do not construct a model of the positive class, instead of measuring the mutual distance between sequences or profiles. Only multiple alignments and hidden Markov models are common classification approaches for creating a positive class model, but they have poor classification performance. A model's advantage is that it may be evaluated for chemical features that are shared by all members of the class to get fresh insights into protein function and structure. We used LSTM to solve a well-known remote protein homology detection benchmark, in which a protein must be categorized as a member of the SCOP superfamily. LSTM achieves state-of-the-art classification performance while being significantly faster than other algorithms with similar classification performance. LSTM is five orders of magnitude quicker than the quickest SVM-based approaches and two orders of magnitude faster than methods that perform somewhat better in classification (which, however, have lower classification performance than LSTM). We applied LSTM to PROSITE classes and analyzed the derived patterns to test the modeling capabilities of the algorithm. Because it does not require established similarity metrics like BLOSUM or PAM matrices, LSTM is complementary to alignment-based techniques. The PROSITE motif was retrieved by LSTM in 8 out of 15 classes. In the remaining seven examples, alternative motifs are developed that, on average, outperform the PROSITE motifs in categorization.
Downloads
References
Altschul, S.F. et al. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410.
Altschul, S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402.
Bairoch, A (1999) The PROSITE database, its status in 1995. Nucleic Acids Res., 24, 189–196.
Baldi, P. et al. (1999) Exploiting the past and the future in protein secondary structure prediction. Bioinformatics, 15, 937–946.
Bynagari, N. B. (2014). Integrated Reasoning Engine for Code Clone Detection. ABC Journal of Advanced Research, 3(2), 143-152. https://doi.org/10.18034/abcjar.v3i2.575
Cheng, J. and Baldi, P. (2005) Three-stage prediction of protein beta-sheets by neural networks, alignments, and graph algorithms. Bioinformatics, 21, i75–i84.
Ding, C. and Dubchak, I. (2001) Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics, 17, 349–358.
Donepudi, P. K. (2014). Voice Search Technology: An Overview. Engineering International, 2(2), 91-102. https://doi.org/10.18034/ei.v2i2.502
Dong, Q.W et al. (2006) Application of latent semantic analysis to protein remote homology detection. Bioinformatics, 22, 285–290.
Gille, C. et al. (2003) A comprehensive view on proteasomal sequences: implications for the evolution of the proteasome. J. Mol. Biol., 326, 1437–1448.
Gribskov, M. et al. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl Acad. Sci., 84, 4355–4358 .
Grundy, W.N. (1998) Family-based homology detection via pairwise sequence comparison. In Proceedings of 2nd Annual International Conference on Computational Molecular Biology, pp. 94–100. ACM Press, New York, USA.
Henikoff, S. and Henikoff, J.G. (1994) Position-based sequence weights. J. Mol. Biol., 243, 574–578.
Hochreiter, S. (1991) Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut fu¨r Informatik, Lehrstuhl Prof. Brauer, Tech. Univ. Mu¨nchen.
Hochreiter, S. and Schmidhuber. J. (1997) Long short-term memory. Neural Comput., 9, 1735–1780.
Hochreiter, S. et al. (2001) Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In Kolen, J. and Kremer, S. (eds), A Field Guide to Dynamical Recurrent Networks. Wiley-IEEE Press, Piscataway, NJ.
Hou, Y. et al. (2004) Remote homolog detection using local sequence-structure correlations. Proteins Struct., Funct. and Bioinformatics, 57, 518–530.
Jaakkola, T. et al. (1999) Using the fisher kernel method to detect remote protein homologies. In Proc. the Seventh International Conference on Intelligent Systems for Molecular Biology, 16, 149–158. AAAI Press, Menlo Park, CA.
Karplus, K. et al. (1998) Hidden markov models for detecting remote protein homologies. Bioinformatics, 14, 846–856.
Kent, W. J. (2002) BLAT - the BLAST like alignment tool. Genome Research, 12, 656–664.
Kuang, R. et al. (2005) Profile-based string kernels for remote homology detection and motif extraction. Journal of Bioinformatics and Computational Biology, 3, 527–550.
Leslie, C. et al. (2004a) Mismatch string kernels for discriminative protein classification. Bioinformatics, 20, 467–476.
Leslie, C. et al. (2004b) Inexact matching string kernels for protein classification. In Scho¨ lkopf, B. Tsuda, K. and Vert, J.P. (eds), Kernel Methods in Computational Biology, pp. 95–111. The MIT Press, Cambridge, Massachusetts, London, England.
Liao, L. and Noble, W.S. (2002) Combining pairwise squence similarity support vector machines for remote protein homology detection. In Proceedings of the Sixth International Conference on Computational Molecular Biology, pp. 225–232. ACM Press, New York, USA.
Lingner, T. and Meinicke, P. (2006) Remote homology detection based on oligomer distances. Bioinformatics, 22, 2224–2236.
Madera, M. and Gough. J. (2002) A comparision of profile hidden Markov model procedures for remote homology detection. Nucleic Acids Res., 30, 4321–4328.
Murzin, A.G. et al. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol.Biol., 247, 536–540.
Park, J. et al. (1998) Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J. Mol. Biol., 284, 1201–1210.
Pearson, W. and Lipman, D. et al. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci., 85, 2444–2448, .
Rangwala, H. and Karypis, G. (2005) Profile based direct kernels for remote homology detection and fold recognition. Bioinformatics, 21, 4239–4247 .
Sigrist, C.J.A. et al. (2002) PROSITE: A documented database using patterns and profiles as motif descriptors. Brief. Bioinform., 3, 265–274.
Smith, T. and Waterman, M. et al. (1981) Identification of common molecular subsequences. J. Mol. Biol., 147, 195–197.
Tarnas, C. and Hughey, R. (1998) Reduced space hidden Markov model training. Bioinformatics, 14, 401–406.
Thompson, J.D. et al. (1994) CLUSTAL W: improving the sensivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4680.
Vapnik V.N. (2000) The Nature of Statistical Learning Theory. Statistics for Engineering and Information Science. 2nd edition, Springer Verlag. New York.
Vert, J.P. et al. (2004) Local alignment kernels for biological sequences. In Scho¨ lkopf, B. Tsuda, K. and Vert, J.-P. (eds.), Kernel Methods in Computational Biology, pp. 131–154. The MIT Press, Cambridge, Massachusetts, London, England.
Vinga, S. and Almeida, J. (2003) Alignment-free sequence comparision–a review. Bioinformatics, 19. 513–523.
--0--