Data Partitioning for Parallel Entity Matching

15 years 6 months ago

Download dbs.uni-leipzig.de

Entity matching is an important and difficult step for integrating web data. To reduce the typically high execution time for matching we investigate how we can perform entity matching in parallel on a distributed infrastructure. We propose different strategies to partition the input data and generate multiple match tasks that can be independently executed. One of our strategies supports both, blocking to reduce the search space for matching and parallel matching to improve efficiency. Special attention is given to the number and size of data partitions as they impact the overall communication overhead and memory requirements of individual match tasks. We have developed a service-based distributed infrastructure for the parallel execution of match workflows. We evaluate our approach in detail for different match strategies for matching real-world product data of different web shops. We also consider caching of input entities and affinity-based scheduling of match tasks.

Toralf Kirsten, Lars Kolb, Michael Hartung, Anika

Real-time Traffic

CORR 2010 | Distributed Infrastructure | Education | Match Tasks | Multiple Match Tasks |

claim paper

» A Parallel Point Matching Algorithm for Landmark Based Image Registration Using Multicore ...

» Performance aware secure code partitioning

» Approximate String Matching in DNA Sequences

» Parallel Selection Query Processing Involving Index in Parallel Database Systems

» Combining Flexibility and Scalability in a PeertoPeer PublishSubscribe System

» Querying Composite Objects in Semistructured Data

» Compact graph representations and parallel connectivity algorithms for massive dynamic net...

» Clusterfile A Flexible Physical Layout Parallel File System

Post Info
More Details (n/a)

Added	09 Dec 2010
Updated	09 Dec 2010
Type	Journal
Year	2010
Where	CORR
Authors	Toralf Kirsten, Lars Kolb, Michael Hartung, Anika Gross, Hanna Köpcke, Erhard Rahm

Comments (0)

Sciweavers

Data Partitioning for Parallel Entity Matching

CORR 2010 | Distributed Infrastructure | Education | Match Tasks | Multiple Match Tasks |

Explore & Download

Productivity Tools

Sciweavers