Lookup NU author(s): Dr Jacek Cala,
Dr Jannetta Steyn,
Professor Paolo Missier
Full text for this publication is not currently held within this repository. Alternative links are provided below where available.
© 2018 CEUR-WS. All rights reserved. Scalable and efficient processing of genome sequence data, i.e. for variant discovery, is key to the mainstream adoption of High Throughput technology for disease prevention and for clinical use. Achieving scalability, however, requires a significant effort to enable the parallel execution of the analysis tools that make up the pipelines. This is facilitated by the new Spark versions of the well-known GATK toolkit, which offer a black-box approach by transparently exploiting the underlying Map Reduce architecture. In this paper we report on our experience implementing a standard variant discovery pipeline using GATK 4.0 with Docker-based deployment over a cluster. We provide a preliminary performance analysis, comparing the processing times and cost to those of the new Microsoft Genomics Services.
Author(s): Tucci N, Cala J, Steyn J, Missier P
Publication type: Conference Proceedings (inc. Abstract)
Publication status: Published
Conference Name: 26th Italian Symposium in Advanced Database Systems (SEBD 2018)
Year of Conference: 2018
Online publication date: 24/06/2018
Acceptance date: 24/06/2018