blink - Record Linkage for Empirically Motivated Priors
An implementation of the model in Steorts (2015) <DOI:10.1214/15-BA965SI>, which performs Bayesian entity resolution for categorical and text data, for any distance function defined by the user. In addition, the precision and recall are in the package to allow one to compare to any other comparable method such as logistic regression, Bayesian additive regression trees (BART), or random forests. The experiments are reproducible and illustrated using a simple vignette. LICENSE: GPL-3 + file license.
Last updated 1 years ago
5.72 score 5 stars 1 dependents 70 scripts 288 downloadsclevr - Clustering and Link Prediction Evaluation in R
Tools for evaluating link prediction and clustering algorithms with respect to ground truth. Includes efficient implementations of common performance measures such as pairwise precision/recall, cluster homogeneity/completeness, variation of information, Rand index etc.
Last updated 1 years ago
clustering-evaluationentity-resolutionevaluation-metricslink-predictionrecord-linkagecpp
4.77 score 12 stars 49 scripts 293 downloadsrepresentr - Create Representative Records After Entity Resolution
An implementation of Kaplan, Betancourt, Steorts (2022) <doi:10.1080/00031305.2022.2041482> that creates representative records for use in downstream tasks after entity resolution is performed. Multiple methods for creating the representative records (data sets) are provided.
Last updated 2 years ago
downstream-taskspost-linkage-analysisrecord-linkagecpp
4.68 score 8 stars 12 scripts 238 downloadscd - CD Data for Entity Resolution
Duplicated music data (pre-processed and formatted) for entity resolution. The total size of the data set is 9763. There are respective gold standard records that are labeled and can be considered as a unique identifier.
Last updated 7 years ago
datalinkage
4.16 score 29 scripts 222 downloadsklsh - Blocking for Record Linkage
An implementation of the blocking algorithm KLSH in Steorts, Ventura, Sadinle, Fienberg (2014) <DOI:10.1007/978-3-319-11257-2_20>, which is a k-means variant of locality sensitive hashing. The method is illustrated with examples and a vignette.
Last updated 4 years ago
3.70 score 3 scripts 176 downloadscora - Cora Data for Entity Resolution
Duplicated publication data (pre-processed and formatted) for entity resolution. This data set contains a total of 1879 records. The following variables are included in the data set: id, title, book title, authors, address, date, year, editor, journal, volume, pages, publisher, institution, type, tech, note. The data set has a respective gold data set that provides information on which records match based on id.
Last updated 5 years ago
datalinkage
3.35 score 3 stars 15 scripts 173 downloadsrestaurant - Restaurant Data for Entity Resolution
Duplicated restaurant data (pre-processed and formatted) for entity resolution. This package contains formatted data from a data set that contains information about different restaurants, with the Zagats portion containing 331 records and the Fodors portion containing 533 records. The following variables are included in the data set: id, name, address, city, phone, type. The data set has a respective gold data set that provides information on which records match based on id.
Last updated 7 years ago
datalinkage
2.00 score 1 stars 162 downloads