Knowledge Discovery from Gene Expression Data in a Computational Intelligent Framework :Identifying marker genes and Cancer Subtypes.

Date of Submission

December 2005

Date of Award

Winter 12-12-2006

Institute Name (Publisher)

Indian Statistical Institute

Document Type

Master's Dissertation

Degree Name

Master of Technology

Subject Name

Computer Science

Department

Electronics and Communication Sciences Unit (ECSU-Kolkata)

Supervisor

Pal, Nikhil Ranjan (ECSU-Kolkata; ISI)

Abstract (Summary of the Work)

There has been a substantial improvement in cancer classification over last decades.But, there is no general approach for identifying new cancer types(class discovery) or for assigning a tumor to known classes (class prediction).There is no well accepted method for identifying "marker genes". Traditional histological classification of cancer sub- type is informative, but incomplete. Recent studies of gene expressions suggest that molecular classification can be used for effective diagnosis and prediction of the cancer type and treatment outcome. Here, we have made a study on microarray gene expression data of Lung Cancer with a view to discovering two types of knowledge : finding cancer subtypes(class discovery) and finding marker genes. In the con- text of the first problem the effect of various normalization schemes are studied in conjunction with different clustering algorithms. Experimentally, we found that because of the high dimensionality of the data c-means type clustering algorithms and their variants are not found to be very effective. So we applied a feature selection algorithm to reduce the dimension. Typically researchers use unsupervised feature selection for class discovery. Such features does not ensure class discriminating power. So we take a different route. We first find cancer subtypes(class discovery) and finding marker genes. In the con- text of the first problem the effect of various normalization schemes are studied in conjunction with different clustering algorithms. Experimentally, we found that because of the high dimensionality of the data c-means type clustering algorithms and their variants are not found to be very effective. So we applied a feature selection algorithm to reduce the dimension. Typically researchers use unsupervised feature selection for class discovery. Such features does not ensure class discriminating power. So we take a different route. We first find the marker genes. These reduces the dimensionality retaining the class discriminating power of the genes. These marker genes are then used to discover cancer subtypes. We could find just nine informative features(genes) to preserve the discriminating power. However, for class discovery we considered fifteen important features. Application of various clustering algorithms(C-means, Fuzzy c-means and Gustafson-Kessel method) and self organizing feature map (SOFM) finds clusters that are generally consistent with the histological classification. The analysis reveals previously defined types, subtypes and many additional details of Lung Cancer.

Comments

ProQuest Collection ID: http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqm&rft_dat=xri:pqdiss:28843236

Control Number

ISI-DISS-2005-147

Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.

DOI

http://dspace.isical.ac.in:8080/jspui/handle/10263/6316

This document is currently not available here.

Share

COinS