New Feature Extraction Methods for Classifying Proteins from Amino Acid Sequences.

Date of Submission

December 2003

Date of Award

Winter 12-12-2004

Institute Name (Publisher)

Indian Statistical Institute

Document Type

Master's Dissertation

Degree Name

Master of Technology

Subject Name

Computer Science


Machine Intelligence Unit (MIU-Kolkata)


Bandyopadhyay, Sanghamitra (MIU-Kolkata; ISI)

Abstract (Summary of the Work)

An interdisciplinary area of Computer Science and Molecular Biology that has developed in the recent years is called Bioinformatics [2]. This was necessitated by the ever-increasing amount of raw data generated and routincly collected by molecular biologists. This is as a result of Human Genome Project and similar efforts, along with dramatic evolution of technology for information storage and retrieval. In response to this problem a number of researchers have developed techniques to interpret the data and discover concepts in the DNA, RNA and protein databases. An important problem in the domain of Bioinformatics is the classification of protein sequences. Proteins are chains of amino acids and form the basic building blocks of a living organism. Classification of a protein sequence allows one to infer the structure and function of proteins. Perhaps, the most important practical application of such knowledge is in drug discovery. A primary challenge in classifying protein sequences lies in the proper extraction of a feature vector. Evidently a good input representation (extraction of feature) is crucial for proper classification of the proteins. In this dissertation we propose new feature extraction methods for classifying proteins from amino acid sequences. This chapter is organized as follows. In Section 1.1 we give the basic concepts of Molecular Biology. This section deals with the basic structure and function of proteins and nucleic acids, the mechanism of molecular genetics and other related terminologies that we come across in the research works related to Bioinformatics. Section 1.2 gives an overview of the existing biological sequence databases. In Section 1.3 we detail out what Bioinformatics is and the role of a computer scientist in the field of molecular biology. Finally Section 1.4 deals with conclusions and the organization of the thesis.1.1Basic Concepts of Molecular Biology Al living things are made up of tiny living parts called the Cells. Similar cells join to form Tissues. Similar tissues organize themselves to form Organs. Similar organs arrange themselves to form an Organism. Thus in the molecular level both complex and simple organisms have a similar chemistry. The main actors in the chemistry of life are molecules called proteins and nucleic acids. Roughly speaking, proteins are responsible for what a living being is and what it does. The distinguished scientist Russell Doolittle once wrote, we are our proteins . Nucleic acids on the other hand, encode the information necessary to produce proteins and are responsible for passing along this recipe to subsequent generations. Molecular biology [1, 23, 24] research is basically devoted to the understanding of structure and function of proteins and nucleic acids. In the following section we provide a preliminary discussion on the structure and function of proteins.Proteins (25] are basically the tissue building blocks of a living being. Proteins are large organic molecules and are among the most important components in the cells of an organism. They are more diverse in structure and function than any other kind of molecule. Structural proteins act as tissue building blocks, whereas there are other proteins known as enzymes which act as catal of chemical reactions occurring inside any living organism.


ProQuest Collection ID:

Control Number


Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.


This document is currently not available here.