PASTASpark: multiple sequence alignment meets Big Data

TítuloPASTASpark: multiple sequence alignment meets Big Data
AutoresJosé M. Abuín, Tomás F. Pena and Juan C. Pichel
TipoArtículo de revista
Fonte Bioinformatics, Oxford University Press, Vol. 18, No. 33, pp. 2948-2950 , 2017.
RankRanked Q1 in Biochemistry by SJR
ISSN1367-4803
DOI10.1093/bioinformatics/btx354
AbstractMotivation: One basic step in many bioinformatics analyses is the Multiple Sequence Alignment (MSA). One of the state of the art tools to perform MSA is PASTA (Practical Alignments using SATé and TrAnsitivity). PASTA supports multithreading but it is limited to process datasets on shared memory systems. In this work we introduce PASTASpark, a tool that uses the Big Data engine Apache Spark to boost the performance of the alignment phase of PASTA, which is the most expensive task in terms of time consumption. Results: Speedups up to 10× with respect to single-threaded PASTA were observed, which allows to process an ultra-large dataset of 200,000 sequences within the 24-hr limit. Availability: PASTASpark is an Open Source tool available at https://github.com/citiususc/pastaspark
Palabras chaveGenomics, Multiple Sequence Alignment, Spark, Big Data