Journal Articles

Mapping sequence to feature vector using numerical representation of codons targeted to amino acids for alignment-free sequence analysis

Jayanta Kumar Das, Indian Statistical Institute, Kolkata
Antara Sengupta, MCKV Institute of Engineering
Pabitra Pal Choudhury, Johns Hopkins University
Swarup Roy, Sikkim University

Article Type

Research Article

Publication Title

Gene

Abstract

The phylogenetic analysis based on sequence similarity targeted to real biological taxa is one of the major challenging tasks. In this paper, we propose a novel alignment-free method, CoFASA (Codon Feature based Amino acid Sequence Analyser), for similarity analysis of nucleotide sequences. At first, we assign numerical weights to the four nucleotides. We then calculate a score of each codon based on the numerical value of the constituent nucleotides, termed as degree of codons. Accordingly, we obtain the degree of each amino acid based on the degree of codons targeted towards a specific amino acid. Utilizing the degree of twenty amino acids and their relative abundance within a given sequence, we generate 20-dimensional features for every coding DNA sequence or protein sequence. We use the features for performing phylogenetic analysis of the set of candidate sequences. We use multiple protein sequences derived from Beta-globin (BG), NADH dehydrogenase subunit 5 (ND5), Transferrins (TFs), Xylanases, low identity (<40%) and high identity (⩾40%) protein sequences (encompassing 533 and 1064 protein families) for experimental assessments. We compare our results with sixteen (16) well-known methods, including both alignment-based and alignment-free methods. Various assessment indices are used, such as the Pearson correlation coefficient, RF (Robinson-Foulds) distance and ROC score for performance analysis. While comparing the performance of CoFASA with alignment-based methods (ClustalW, ClustalΩ, MAFFT, and MUSCLE), it shows very similar results. Further, CoFASA shows better performance in comparison to well-known alignment-free methods, including LZW-Kernal, jD2Stat, FFP, spaced, and AFKS-D2s in predicting taxonomic relationship among candidate taxa. Overall, we observe that the features derived by CoFASA are very much useful in isolating the sequences according to their taxonomic labels. While our method is cost-effective, at the same time, produces consistent and satisfactory outcomes.

DOI

10.1016/j.gene.2020.145096

Publication Date

1-15-2021

Recommended Citation

Das, Jayanta Kumar; Sengupta, Antara; Choudhury, Pabitra Pal; and Roy, Swarup, "Mapping sequence to feature vector using numerical representation of codons targeted to amino acids for alignment-free sequence analysis" (2021). Journal Articles. 2122.
https://digitalcommons.isical.ac.in/journal-articles/2122

This document is currently not available here.

COinS

Journal Articles

Mapping sequence to feature vector using numerical representation of codons targeted to amino acids for alignment-free sequence analysis

Article Type

Publication Title

Abstract

DOI

Publication Date

Recommended Citation

Browse

Search

Author Corner

Links

Journal Articles

Mapping sequence to feature vector using numerical representation of codons targeted to amino acids for alignment-free sequence analysis

Authors

Article Type

Publication Title

Abstract

DOI

Publication Date

Recommended Citation

Share

Browse

Search

Author Corner

Links