Toggle Main Menu Toggle Search

Open Access padlockePrints

Scalable and efficient whole-exome data processing using workflows on the cloud

Lookup NU author(s): Dr Jacek CalaORCiD, Eyad Marei, Dr Yaobo Xu, Professor Paolo MissierORCiD


Full text for this publication is not currently held within this repository. Alternative links are provided below where available.


Dataflow-style workflows offer a simple, high-level programming model for flexible prototyping of scientific applications as an attractive alternative to low-level scripting. At the same time, workflow management systems (WFMS) may support data parallelism over big datasets by providing scalable, distributed deployment and execution of the workflow over a cloud infrastructure. In theory, the combination of these properties makes workflows a natural choice for implementing Big Data processing pipelines, common for instance in bioinformatics. In practice, however, correct workflow design for parallel Big Data problems can be complex and very time-consuming. In this paper we present our experience in porting a genomics data processing pipeline from an existing scripted implementation deployed on a closed HPC cluster, to a workflow-based design deployed on the Microsoft Azure public cloud. We draw two contrasting and general conclusions from this project. On the positive side, we show that our solution based on the e-Science Central WFMS and deployed in the cloud clearly outperforms the original HPC-based implementation achieving up to 2.3x speed-up. However, in order to deliver such performance we describe the importance of optimising the workflow deployment model to best suit the characteristics of the cloud computing infrastructure. The main reason for the performance gains was the availability of fast, node-local SSD disks delivered by D-series Azure VMs combined with the implicit use of local disk resources by e-Science Central workflow engines. These conclusions suggest that, on parallel Big Data problems, it is important to couple understanding of the cloud computing architecture and its software stack with simplicity of design, and that further efforts in automating parallelisation of complex pipelines are required.

Publication metadata

Author(s): Cala J, Marei E, Xu Y, Takeda K, Missier P

Publication type: Article

Publication status: Published

Journal: Future Generation Computer Systems

Year: 2016

Volume: 65

Pages: 153-168

Print publication date: 01/12/2016

Online publication date: 28/01/2016

Acceptance date: 04/01/2016

Date deposited: 19/01/2016

ISSN (print): 0167-739X

ISSN (electronic): 1872-7115

Publisher: Elsevier


DOI: 10.1016/j.future.2016.01.001

Notes: Special Issue: Big Data in the Cloud


Altmetrics provided by Altmetric


Funder referenceFunder name
Microsoft Azure for Research programme
BH135498/PD0204NIHR grant through the Newcastle Biomedical Research Centre