Finding optimum width of discretization for gene expressions using functional annotations
Computers in Biology and Medicine
Discretizing gene expression values is an important step in data preprocessing as it helps in reducing noise and experimental errors. This in turn provides better results in various tasks such as gene regulatory network analysis and disease prediction. A supervised discretization method for gene expressions using gene annotation is developed. The method is called “Gene Annotation Based Discretization” (GABD) where the discretization width is determined by maximizing the positive predictive value (PPV), computed using gene annotations, for top 20,000 gene pairs. The method can capture the gene similarity better than those obtained using original expressions. The performance of GABD is compared with some existing discretization methods like equal width discretization, equal frequency discretization and k-means discretization in terms of positive predictive value (PPV). The utility of GABD is also shown by clustering genes using k-medoid algorithm and thereby predicting the function of 23 unclassified Saccharomyces cerevisiae genes using p-value cut off 10−10. The source code for GABD is available at http://www.sampa.droppages.com/GABD.html.
Misra, Sampa and Ray, Shubhra Sankar, "Finding optimum width of discretization for gene expressions using functional annotations" (2017). Journal Articles. 2358.