Unsupervised blocking and probabilistic parallelisation for record matching of distributed big data

Dou, C; Cui, Y; Sun, D; Wong, R; Atif, M; Li, G; Ranjan, R

doi:10.1007/s11227-017-2008-8

Unsupervised blocking and probabilistic parallelisation for record matching of distributed big data

Lookup NU author(s): Professor Raj Ranjan

Downloads

Full text for this publication is not currently held within this repository. Alternative links are provided below where available.

Abstract

© 2017 Springer Science+Business Media New York Record Matching refers to identifying pairs of records that relate to the same entities across different data sources. In many applications of data mining, record matching is usually associated to quadratic complexity. In practice, the number of non-matching record pairs always far exceeds the number of matching pairs, and this is called imbalance problem. Blocking is a technique of data reduction, which can filter unlikely matching pairs before record matching. However, for big data there is no fast and effective blocking algorithm yet. In this paper, we report on big data infrastructure to improve efficiency of blocking. Our approach runs blocking process independently and distributedly on the partitions of whole data. To improve efficiency, we adopt a probabilistic technique to balance the speed and the effect of the algorithm that we proposed for distributed blocking. Our experimental analysis endorses the superiority of our technique and shows its novel scalability.

Publication metadata

Author(s): Dou C, Cui Y, Sun D, Wong R, Atif M, Li G, Ranjan R

Publication type: Article

Publication status: Published

Journal: Journal of Supercomputing

Year: 2019

Volume: 75

Issue: 2

Pages: 623-645

Print publication date: 01/02/2019

Online publication date: 16/03/2017

Acceptance date: 02/04/2016

ISSN (print): 0920-8542

ISSN (electronic): 1573-0484

Publisher: Springer New York LLC

URL: https://doi.org/10.1007/s11227-017-2008-8

DOI: 10.1007/s11227-017-2008-8

Altmetrics

Altmetrics provided by Altmetric

ePrints

Unsupervised blocking and probabilistic parallelisation for record matching of distributed big data

Downloads

Abstract

Publication metadata

Altmetrics

Share