Toggle Main Menu Toggle Search

Open Access padlockePrints

Design and evaluation of a genomics variant analysis pipeline using GATK Spark tools

Lookup NU author(s): Dr Jacek CalaORCiD, Dr Jannetta Steyn, Professor Paolo MissierORCiD



This is the final published version of a conference proceedings (inc. abstract) that has been published in its final definitive form by CEUR-WS series, 2018.

For re-use rights please refer to the publisher's terms and conditions.


Scalable and efficient processing of genome sequence data, i.e. for variant discovery, is key to the mainstream adoption of High Throughput technology for disease prevention and for clinical use. Achieving scalability, however, requires a significant effort to enable the parallel execution of the analysis tools that make up the pipelines. This is facilitated by the new Spark versions of the well-known GATK toolkit, which offer a black-box approach by transparently exploiting the underlying Map Reduce architecture. In this paper we report on our experience implementing a standard variant discovery pipeline using GATK 4.0 with Docker-based deployment over a cluster. We provide a preliminary performance analysis, comparing the processing times and cost to those of the new Microsoft Genomics Services.

Publication metadata

Author(s): Tucci N, Cala J, Steyn J, Missier P

Publication type: Conference Proceedings (inc. Abstract)

Publication status: Published

Conference Name: SEBD '18 – 26TH Italian Symposium on Advanced Database Systems

Year of Conference: 2018

Print publication date: 27/06/2018

Online publication date: 27/06/2018

Acceptance date: 01/06/2018

Date deposited: 08/07/2018

Publisher: CEUR-WS series