A Model for Distributed Processing and Analyses of NGS Data under Map-Reduce Paradigm
IEEE/ACM Transactions on Computational Biology and Bioinformatics
Massively parallel sequencing technique, introduced by NGS technology, has resulted in an exponential growth of sequencing data, with greatly reduced cost and increased throughput. This huge explosion of data has introduced new challenges in regard to its storage, integration, processing, and analyses. In this paper, we have proposed a novel distributed model under Map-Reduce paradigm to address the NGS big data problem. The architecture of the model involves Map-Reduce based modularized approach involving three different phases that support various analytical pipelines. The first phase will generate detailed base level information of various individual genomes, by granulating the alignment data. The other two phases independently process this base level information in parallel. One of these two phases will provide an integrated DNA profile of multiple individuals, whereas the other phase will generate contigs with similar features in an individual. Each of these three phases will generate a repository of genomic information that will facilitate other analytical pipelines. A simulated and real experimental prototypes has been provided as results to show the effectiveness of the model and its superiority over a few existing popular models and tools. A detailed description of the scope of applications of this model is also included in this article.
Samaddar, Sandip; Sinha, Rituparna; and De, Rajat K., "A Model for Distributed Processing and Analyses of NGS Data under Map-Reduce Paradigm" (2019). Journal Articles. 871.