Ddup - towards a deduplication framework utilising apache spark

10 years 2 months ago

Download www.btw-2015.de

: This paper is about a new framework called DeduPlication (DduP). DduP aims to solve large scale deduplication problems on arbitrary data tuples. DduP tries to bridge the gap between big data, high performance and duplicate detection. At the moment a ﬁrst prototype exists but the overall project status is work in progress. DduP utilises the promising successor of Apache Hadoop MapReduce [Had14], the Apache Spark Framework [ZCF+ 10] and its modules MLlib [MLl14] and GraphX [XCD+ 14]. The three main goals of this project are creating a prototype of the mentioned framework DduP, analysing the deduplication process about scalability and performance and evaluate the behaviour of different small cluster conﬁgurations. Tags: Duplicate Detection, Deduplication, Record Linkage, Machine Learning, Big Data, Apache Spark, MLlib, Scala, Hadoop, In-Memory

Niklas Wilcke

Real-time Traffic

BTW 2015 | Database |

claim paper

Post Info
More Details (n/a)

Added	17 Apr 2016
Updated	17 Apr 2016
Type	Journal
Year	2015
Where	BTW
Authors	Niklas Wilcke

Comments (0)

Sciweavers

Ddup - towards a deduplication framework utilising apache spark

BTW 2015 | Database |

Explore & Download

Productivity Tools

Sciweavers