Design and evaluation of CATPA

Mehmet M. Dalkilic; Arijit Sengupta

doi:10.1145/1066677.1066721

Outline

Design and evaluation of CATPA

Mehmet Dalkilic

2005, Proceedings of the 2005 ACM symposium on Applied computing - SAC '05

https://doi.org/10.1145/1066677.1066721

visibility

…

description

29 pages

link

1 file

Abstract

We present a new application for experimental biologists, the Curation Alignment Tool for Protein Analysis (CATPA), that allows for the efficient and effective creation, storage, management, and querying of experimentally curated protein families. As the number of discovered genomic and proteomic sequences outpaces our ability to understand them, the experimental biologist, who is our primary link in fundamentally and essentially understanding genomic and proteomic information, is left further behind in our race to automate and semi-automate information discovery.

Key takeaways
AI

CATPA enhances curation by allowing efficient management and querying of protein families.
The system integrates both curation and annotation, addressing gaps in existing bioinformatics tools.
CATPA utilizes a Java GUI and MySQL for user-friendly interaction and robust data management.
An evaluation with 47 subjects demonstrated CATPA's superior efficiency over PFAAT in multiple tasks.
Standardized vocabulary via Gene Ontology improves data sharing and search accuracy for biologists.

Figures (10)

Fig. 1. The spectrum of adding information from curation to annotation.

Fig. 2. The ER Model for the CATPA architecture

‘ig. 3. The CATPA user interface. The alignment shows conserved regions (shaded). The

Fig. 4. CATPA Interface: Browsing GO terms

Fig. 6. CATPA Interface: Querying results in CATPA. The workplace shows a point (with the mouse arrow) that when clicked moves to the residue in the alignment below.

Table I. Demographic composition of subjects

Table Il. Demographic composition of subjects

Table III. Demographic composition of subjects

Table IV. Subject verbal protocol comments and interpretations

References (39)

Bailey, T. and Elkan, C. 1998. Fitting a mixture model of expectation maximization to dis- cover motifs in biopolymers. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, 28-36.
Bainbridge, L. 1990. Verbal protocol analysis. In Evaluation of Human Work, A practical ergonomics methodology, J. Wilson and E. Corlett, Eds. Taylor and Francis, 161-179.
Bateman, A. et al. 2002. The pfam protein families database. Nucleic Acids Research 302, 276-280.
Baxevanis, A. 2003. The molecular biology database collection: 2003 update. Nucleic Acids Research 31, 1, 1-12.
Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., and "et.al". 2002. Genbank. Nucleic Acids Research 30, 17-20.
Birney, E., Clamp, M., and Hubbard, T. 2002. Databases and tools for browsing genomes. Annu. Rev. Genomics Hum. Genet. 3, 2293-310.
Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M., Estreicher, A., Gasteiger, E., Martin, E., Michoud, K., O'Donovan, C., Phan, I., Pilbout, S., and Schneider, M. 2003. The swiss-prot protein knowledgebase and its supplement trembl in 2003. Nucleic Acids Re- search 31, 365-370.
Booth, P. 1989. An Introduction to Human-computer Interaction. Laurence ErlBaum Associates Publishers.
Bork, P. and Koonin, E. 1998. Predicting functions from protein sequences-where are the bottlenecks? Nat. Genet. 18, 313-318.
Brenner, S. 1999. Errors in genome annotation. Trends Genet. 15, 132-133.
Consortium", G. O. 2001. Creating the gene ontology resource: design and implementation. Genome Res. 11, 1425-1433.
Submitted for review, November 2003.
• Dalkilic and Sengupta
Conte, L. L., Brenner, S. E., Hubbard, T. J., Chothia, C., and Murzin, A. G. 2002. SCOP database in 2002: refinements accomodate structural genomics. Nucleic Acids Research 30, 264-267.
Dayhoff, M. O., Ed. 1972. Atlas of Protein Sequence and Structure. Vol. 5. National Biomedical Research Foundation, Washington, DC.
De, P., Sinha, A., and Vessey, I. 2001. An empirical investigation of factors influencing object- oriented database querying. Information Technology and Management 2, 71-93.
Devos, D. and Valencia, A. 2001. Intrinsic errors in genome annotation. Trends Genen. 17, 429-431.
Eberts, R. E. 1994. User Interface Design. Prentice Hall.
Falquet, L., Pagni, M., Bucher, P., Hulo, N., Sigrist, C., et al. 2002. The prosite database, its status in 2002. Nucleic Acids Research 30, 235-238.
Freedman, L. P., Ed. 1998. Molecular Biology of Steroid and Nuclear Hormone Receptors. Birkhäuser.
Higgins, D. and Taylor, W., Eds. 2000. Bioinformatics sequence, structure and databanks. Oxford University Press, Oxford, UK.
Iyer, L. et al. 2001. Quod erat demonstradum? the mystery of experimental validation of apparently erroneous computational analyses of protein sequences. Genome Bio. 2, 12, 1-11.
Johnson, J. J., Mason, K., Moallemi, C., Xi, H., Somaroo, S., and Huang, E. 2003. Protein family alignment annotation tool. Bioinformatics 19, 4, 544-545.
Jonassen, I., Collins, J., and Higgins, D. 1995. Finding flexible patterns in unaligned protein sequences. Protein Science 4, 1587-1595.
Kanehisa, M., Goto, S., Kawashima, S., and Nakaya, A. 2002. The kegg database at genomenet. Nucleic Acids Research 30, 42-46.
Kunin, V., Cases, I., Enright, A. J., de Lorenzo, V., and Ouzounis, C. A. 2003. Myriads of protein families, and still counting. Genome Biology 4, 401.
Mount, D. W. 2001. Bioinformatics sequence and genome analysis. Cold Spring Harbor Labo- ratory Press, Chapter 3.
Norman, D. 1990. The Design of Everyday things. Doubleday Currency.
Rigoustos, L. and Floratos, A. 1998. Combinatorial pattern discovery in biological sequences: The teiresias algorithm. Bioinformatics 14, 55-67.
Submitted for review, November 2003. Design and Evaluation of CATPA • 29
Salzberg, S., Searls, D., and Kasif, S., Eds. 1999. Computational Methods in Molecular Biology. Elsevier.
Siegel, S. 1956. Nonparametric Statistics for the Behavioral Sciences. McGraw-Hill, New York.
Sigrist, C., Cerutti, L., Hulo, N., et al. 2002. Prosite: a documented databases using patterns and profiles as motif descriptors. Brief Bioinformatics 3, 265-274.
Stoesser, G., Baker, W., "van den Broek", A., Camon, E., Garcia-Pastor, M., et al. 2002. The EMBL nucleotide sequence database. Nucleic Acids Research 30, 21-26.
The FlyBase Consortium. 2003. The flybase database of the drosophila genome projects and community literature. Nucleic Acids Research 31. http://flybase.org.
Thijs, G., Lescot, M., Marchal, K., Rombaut, S., et al. 2001. A higer order background model improves the detection of regulatory elements by gibbs sampling. Bioinformatics 17, 12, 71-93.
Westbrook, J. et al. 2002. The protein data bank: unifying the archive. Nucleic Acids Re- search 30, 245-248.
Wu, C. et al. 2003. The protein information resource. Nucleic Acids Research 31, in press.
Wu, C. H., Huang, H., Yu, L.-S. L., and Barker, W. C. 2003. Protein family classification and functional annotation. Computational Biology and Chemistry 27, 37-47. Submitted for review, November 2003.

Design and evaluation of CATPA

Sign up for access to the world's latest research

Abstract

Key takeawaysAI

Related papers

References (39)

Related papers

Related topics

Key takeaways
AI