Entity matching, also known as record linkage, is a crucial aspect of Natural Language Processing (NLP) that involves identifying and connecting data to determine whether two entities refer to the same real-world object or not.[1]
Problem definition
Let $π΄$ and $π΅$ be two data sources. π΄ has the attributes $(π΄_1, π΄_2, β¦, π΄_π)$, and we denote records as $π = (π_1, π_2, β¦, π_π) β π΄$. Similarly, $π΅$ has the attributes $(π΅_1, π΅_2, β¦, π΅_π)$, and we denote records as $π = (π_1, π_2, β¦, π_π) β π΅$. $A$ data source is a set of records, and a record is a tuple following a specific schema of attributes. An attribute is defined by the intended semantics of its values. So $π΄_π = π΅_π$ if and only if values $π_π$ of $π΄_π$ are intended to carry the same information as values $π_π$ of $π΅_π$, and the specific syntactics of the attribute values are irrelevant. Attributes can also have metadata (like a name) associated with them, but this does not affect the equality between them. We call the tuple of attributes $(π΄_1, π΄_2, β¦, π΄_π)$ the schema of data source $π΄$, and correspondingly for $π΅$. The goal of entity matching is to find the largest possible binary relation $π β π΄ Γ π΅$ such that $π$ and $π$ refer to the same entity for all $(π, π) β π$.[2] Obviously, the number of candidate all potential pairs is Cartesian product $|π΄ Γ π΅|$.
Traditional entity matching process
Tranditional entity matching process consists of data-processing, schema matching, blocking, record pair comparison and classification.
-
Schema matching is aimed to find out which attributes should be compared to one another, essentially identifying semantically related attributes. In practice, this step is often performed manually as part of the preprocessing step, simply making sure to transform both data sources into the same schema format.
-
Since the number of potential matches grows quadratically, it is common to pick out candidate pairs $πΆ β π΄ Γ π΅$ in a separate step before any records are compared directly. This step is called as blocking, and its goal is to bring down the number of potential matches $ πΆ βͺ π΄ Γ π΅ $ to a level where record-to-record comparison is feasible. -
When the number of candidate pairs $ πΆ $ has been reduced to a manageable amount, we can compare individual records $(π, π) β πΆ$. The pairwise comparison results in a similarity vector $π$, consisting of one or more numerical values indicating how similar or dissimilar the two records are. - Lastly, the objective of the classification step is to classify each candidate pair as either match or nonmatch based on the similarity vector. In cases where |π | = 1, simple thresholding might be enough, while when |π | > 1, one needs more elaborate solutions. Frequently, the process of pairwise comparison and classification is combined into a single comparison step in modeling. In the overall process, the two key components are blocking and comparison.
Metrics
For blocking:
-
$PR = C / π΄ Γ π΅ $ -
$P/E = C / π΄ $
For comparsion:
- precision/recall measures
- $F_1$
Datasets
The early approaches to entity matching were mostly concerned with matching personal records(census data or medical records). Such datasets are usually not publicly available due to privacy concerns.[2] An overview of the most popular public datasets has been listed below: