Browse by author
Lookup NU author(s): Professor Raj Ranjan
Full text for this publication is not currently held within this repository. Alternative links are provided below where available.
© 2017 Springer Science+Business Media New York Record Matching refers to identifying pairs of records that relate to the same entities across different data sources. In many applications of data mining, record matching is usually associated to quadratic complexity. In practice, the number of non-matching record pairs always far exceeds the number of matching pairs, and this is called imbalance problem. Blocking is a technique of data reduction, which can filter unlikely matching pairs before record matching. However, for big data there is no fast and effective blocking algorithm yet. In this paper, we report on big data infrastructure to improve efficiency of blocking. Our approach runs blocking process independently and distributedly on the partitions of whole data. To improve efficiency, we adopt a probabilistic technique to balance the speed and the effect of the algorithm that we proposed for distributed blocking. Our experimental analysis endorses the superiority of our technique and shows its novel scalability.
Author(s): Dou C, Cui Y, Sun D, Wong R, Atif M, Li G, Ranjan R
Publication type: Article
Publication status: Published
Journal: Journal of Supercomputing
Year: 2019
Volume: 75
Issue: 2
Pages: 623-645
Print publication date: 01/02/2019
Online publication date: 16/03/2017
Acceptance date: 02/04/2016
ISSN (print): 0920-8542
ISSN (electronic): 1573-0484
Publisher: Springer New York LLC
URL: https://doi.org/10.1007/s11227-017-2008-8
DOI: 10.1007/s11227-017-2008-8
Altmetrics provided by Altmetric