blink - Record Linkage for Empirically Motivated Priors
An implementation of the model in Steorts (2015) <DOI:10.1214/15-BA965SI>, which performs Bayesian entity resolution for categorical and text data, for any distance function defined by the user. In addition, the precision and recall are in the package to allow one to compare to any other comparable method such as logistic regression, Bayesian additive regression trees (BART), or random forests. The experiments are reproducible and illustrated using a simple vignette. LICENSE: GPL-3 + file license.
Last updated
5.74 score 5 stars 1 dependents 74 scripts 298 downloadsclevr - Clustering and Link Prediction Evaluation in R
Tools for evaluating link prediction and clustering algorithms with respect to ground truth. Includes efficient implementations of common performance measures such as pairwise precision/recall, cluster homogeneity/completeness, variation of information, Rand index etc.
Last updated
clustering-evaluationentity-resolutionevaluation-metricslink-predictionrecord-linkagecpp
5.62 score 15 stars 1 dependents 92 scripts 280 downloadsrepresentr - Create Representative Records After Entity Resolution
An implementation of Kaplan, Betancourt, Steorts (2022) <doi:10.1080/00031305.2022.2041482> that creates representative records for use in downstream tasks after entity resolution is performed. Multiple methods for creating the representative records (data sets) are provided.
Last updated
downstream-taskspost-linkage-analysisrecord-linkagecpp
4.81 score 10 stars 13 scripts 230 downloadsklsh - Blocking for Record Linkage
An implementation of the blocking algorithm KLSH in Steorts, Ventura, Sadinle, Fienberg (2014) <DOI:10.1007/978-3-319-11257-2_20>, which is a k-means variant of locality sensitive hashing. The method is illustrated with examples and a vignette.
Last updated
3.70 score 3 scripts 205 downloadscd - CD Data for Entity Resolution
Duplicated music data (pre-processed and formatted) for entity resolution. The total size of the data set is 9763. There are respective gold standard records that are labeled and can be considered as a unique identifier.
Last updated
datalinkage
3.47 score 59 scripts 162 downloadscora - Cora Data for Entity Resolution
Duplicated publication data (pre-processed and formatted) for entity resolution. This data set contains a total of 1879 records. The following variables are included in the data set: id, title, book title, authors, address, date, year, editor, journal, volume, pages, publisher, institution, type, tech, note. The data set has a respective gold data set that provides information on which records match based on id.
Last updated
datalinkage
3.35 score 3 stars 15 scripts 191 downloads