Package 'clevr'

Title: Clustering and Link Prediction Evaluation in R
Description: Tools for evaluating link prediction and clustering algorithms with respect to ground truth. Includes efficient implementations of common performance measures such as pairwise precision/recall, cluster homogeneity/completeness, variation of information, Rand index etc.
Authors: Neil Marchant [aut, cre], Rebecca Steorts [aut], Olivier Binette [ctb]
Maintainer: Neil Marchant <[email protected]>
License: GPL-2
Version: 0.1.2
Built: 2024-11-06 04:41:36 UTC
Source: https://github.com/cleanzr/clevr

Help Index


Accuracy of Linked Pairs

Description

Computes the accuracy of a set of predicted coreferent (linked) pairs given a set of ground truth coreferent pairs.

Usage

accuracy_pairs(true_pairs, pred_pairs, num_pairs, ordered = FALSE)

Arguments

true_pairs

set of true coreferent pairs stored in a matrix or data.frame, where rows index pairs and columns index the ids of the constituents. Any pairs not included are assumed to be non-coreferent. Duplicate pairs (including equivalent pairs with reversed ids) are automatically removed.

pred_pairs

set of predicted coreferent pairs, following the same specification as true_pairs.

num_pairs

the total number of coreferent and non-coreferent pairs, excluding equivalent pairs with reversed ids.

ordered

whether to treat the element pairs as ordered—i.e. whether pair (x,y)(x, y) is distinct from pair (y,x)(y, x) for xyx \neq y. Defaults to FALSE, which is appropriate for clustering, undirected link prediction, record linkage etc.

Details

The accuracy is defined as:

TP+TPN\frac{|T \cap P| + |T' \cap P'|}{N}

where:

  • TT is the set of true coreferent pairs,

  • PP is the set of predicted coreferent pairs,

  • TT' is the set of true non-coreferent pairs,

  • PP' is the set of predicted non-coreferent pairs, and

  • NN is the total number of coreferent and non-coreferent pairs.

Examples

true_pairs <- rbind(c(1,2), c(2,3), c(1,3)) # ground truth is 3-clique
pred_pairs <- rbind(c(1,2), c(2,3))         # prediction misses one edge
num_pairs <- 3                              # assuming 3 elements
accuracy_pairs(true_pairs, pred_pairs, num_pairs)

Adjusted Rand Index Between Clusterings

Description

Computes the adjusted Rand index (ARI) between two clusterings, such as a predicted and ground truth clustering.

Usage

adj_rand_index(true, pred)

Arguments

true

ground truth clustering represented as a membership vector. Each entry corresponds to an element and the value identifies the assigned cluster. The specific values of the cluster identifiers are arbitrary.

pred

predicted clustering represented as a membership vector.

Details

The adjusted Rand index (ARI) is a variant of the Rand index (RI) which is corrected for chance using the Permutation Model for clusterings. It is related to the RI as follows:

RIE(RI)1E(RI),\frac{RI - E(RI)}{1 - E(RI)},

where E(RI)E(RI) is the expected value of the RI under the Permutation Model. Unlike the RI, the ARI takes values in the range -1 to 1. A value of 1 indicates that the clusterings are identical, while a value of 0 indicates the clusterings are drawn randomly independent of one another.

References

Hubert, L., Arabie, P. "Comparing partitions." Journal of Classification 2, 193–218 (1985). doi:10.1007/BF01908075

Examples

true <- c(1,1,1,2,2)  # ground truth clustering
pred <- c(1,1,2,2,2)  # predicted clustering
adj_rand_index(true, pred)

Balanced Accuracy of Linked Pairs

Description

Computes the balanced accuracy of a set of predicted coreferent (linked) pairs given a set of ground truth coreferent pairs.

Usage

balanced_accuracy_pairs(true_pairs, pred_pairs, num_pairs, ordered = FALSE)

Arguments

true_pairs

set of true coreferent pairs stored in a matrix or data.frame, where rows index pairs and columns index the ids of the constituents. Any pairs not included are assumed to be non-coreferent. Duplicate pairs (including equivalent pairs with reversed ids) are automatically removed.

pred_pairs

set of predicted coreferent pairs, following the same specification as true_pairs.

num_pairs

the total number of coreferent and non-coreferent pairs, excluding equivalent pairs with reversed ids.

ordered

whether to treat the element pairs as ordered—i.e. whether pair (x,y)(x, y) is distinct from pair (y,x)(y, x) for xyx \neq y. Defaults to FALSE, which is appropriate for clustering, undirected link prediction, record linkage etc.

Details

The balanced accuracy is defined as:

TPP+TPP2\frac{\frac{|T \cap P|}{|P|} + \frac{|T' \cap P'|}{|P'|}}{2}

where:

  • TT is the set of true coreferent pairs,

  • PP is the set of predicted coreferent pairs,

  • TT' is the set of true non-coreferent pairs, and

  • PP' is the set of predicted non-coreferent pairs.

Examples

true_pairs <- rbind(c(1,2), c(2,3), c(1,3)) # ground truth is 3-clique
pred_pairs <- rbind(c(1,2), c(2,3))         # prediction misses one edge
num_pairs <- 3                              # assuming 3 elements
balanced_accuracy_pairs(true_pairs, pred_pairs, num_pairs)

Canonicalize element pairs

Description

Coerce a collection of element pairs into canonical form. Facilitates testing of equivalence.

Usage

canonicalize_pairs(pairs, ordered = FALSE)

Arguments

pairs

a matrix or data.frame of element pairs where rows correspond to element pairs and columns correspond to element identifiers.

ordered

whether to treat the element pairs as ordered—i.e. whether pair (x,y)(x, y) is distinct from pair (y,x)(y, x) for xyx \neq y. Defaults to FALSE, which is appropriate for clustering, undirected link prediction, record linkage etc.

Value

Returns the element pairs in canonical form, so that:

  • the first element id precedes the second element id lexicographically if ordered = FALSE—i.e. pair (3, 2) becomes pair (2, 3);

  • pairs with missing element ids are removed;

  • duplicate pairs are removed; and

  • the rows in the matrix/data.frame pairs are sorted lexicographically by the first element id, then by the second element id.

Examples

messy_pairs <- rbind(c(2,1), c(1,2), c(3,1), c(1,2))
clean_pairs <- canonicalize_pairs(messy_pairs)
all(rbind(c(1,2), c(1,3)) == clean_pairs) # duplicates removed and order fixed

Transform Clustering Representations

Description

Transform between different representations of a clustering.

Usage

clusters_to_membership(clusters, elem_ids = NULL, clust_ids = NULL)

membership_to_clusters(membership, elem_ids = NULL, clust_ids = NULL)

clusters_to_pairs(clusters)

membership_to_pairs(membership, elem_ids = NULL)

pairs_to_membership(pairs, elem_ids)

pairs_to_clusters(pairs, elem_ids)

Arguments

clusters

a representation of a clustering as a list of vectors, where the i-th vector contains the identifiers of elements assigned to the i-th cluster. If clust_ids is specified (see below), the i-th cluster is identified according to the corresponding entry in clust_ids. Otherwise the i-th cluster is identified according it's name (if clusters is a named list) or its integer index i.

elem_ids

a vector specifying the complete set of identifiers for the cluster elements in canonical order. Optional for all functions excluding pairs_to_membership and pairs_to_clusters.

clust_ids

a vector specifying the complete set of identifiers for the clusters in canonical order. Optional for all functions.

membership

a representation of a clustering as a membership vector, where the i-th entry contains the cluster identifier for the i-th element. If elem_ids is specified (see below), the i-th element is identified according to the corresponding entry in elem_ids. Otherwise the i-th element is identified according it's name (if members is a named vector) or its integer index i.

pairs

a representation of a clustering as a matrix or data.frame containing all pairs of elements that are co-clustered. The rows index of the matrix/data.frame index pairs and columns index the identifiers of the constituent elements. The elem_ids argument (see below) must be specified in order to recover singleton clusters (containing a single element).

Value

clusters_to_membership and pairs_to_membership both return a membership vector representation of the clustering. The order of the elements is taken from elem_ids if specified, otherwise the elements are ordered lexicographically by their identifiers. For pairs_to_membership, the cluster identifiers cannot be recovered and are taken to be integers.

membership_to_clusters and pairs_to_clusters both return a representation of the clustering as a list of vectors. The order of the clusters is taken from clust_ids if specified, otherwise the clusters are ordered lexicographically by their identifiers. For pairs_to_clusters, the cluster identifiers cannot be recovered and are taken to be integers.

clusters_to_pairs and membership_to_pairs both return a representation of the clustering as a matrix of element pairs that are co-clustered. This representation results in loss of information, as singleton clusters (with one element) and cluster identifiers are not represented.

Examples

## A clustering of three items represented as a membership vector
m <- c("Item1" = 1, "Item2" = 2, "Item3" = 1)

# Transform to list of clusters
membership_to_clusters(m)
# Specify different identifiers for the items
membership_to_clusters(m, elem_ids = c(1, 2, 3))
# Transform to array of pairs that are co-clustered
membership_to_pairs(m)

## A clustering represented as a list of clusters
cl <- list("ClustA" = c(1,3), "ClustB" = c(2))

# Transform to membership vector representation
clusters_to_membership(cl)
# Transform to array of pairs that are co-clustered
clusters_to_pairs(cl)

## A clustering (incompletely) represented as an array of pairs that
## are co-clustered
p <- rbind(c(1,3)) # pairs of elements in the same cluster
ids <- c(1,2,3)    # necessary to specify set of all elements

# Transform to membership vector representation
pairs_to_membership(p, ids)
# Transform to list of clusters
pairs_to_clusters(p, ids)

Completeness Between Clusterings

Description

Computes the completeness between two clusterings, such as a predicted and ground truth clustering.

Usage

completeness(true, pred)

Arguments

true

ground truth clustering represented as a membership vector. Each entry corresponds to an element and the value identifies the assigned cluster. The specific values of the cluster identifiers are arbitrary.

pred

predicted clustering represented as a membership vector.

Details

Completeness is an entropy-based measure of the similarity between two clusterings, say tt and pp. The completeness is high if all members of a given cluster in tt are assigned to a single cluster in pp. The completeness ranges between 0 and 1, where 1 indicates perfect completeness.

References

Rosenberg, A. and Hirschberg, J. "V-measure: A conditional entropy-based external cluster evaluation measure." Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), (2007).

See Also

homogeneity evaluates the homogeneity, which is a dual measure to completeness. v_measure evaluates the harmonic mean of completeness and homogeneity.

Examples

true <- c(1,1,1,2,2)  # ground truth clustering
pred <- c(1,1,2,2,2)  # predicted clustering
completeness(true, pred)

Contingency Table for Clusterings

Description

Compute the contingency table for a predicted clustering given a ground truth clustering.

Usage

contingency_table_clusters(true, pred)

Arguments

true

ground truth clustering represented as a membership vector. Each entry corresponds to an element and the value identifies the assigned cluster. The specific values of the cluster identifiers are arbitrary.

pred

predicted clustering represented as a membership vector.

Value

Returns a table CC (stored as a sparse matrix) such that CijC_{ij} counts the number of elements assigned to cluster ii in pred and cluster jj in true.

See Also

eval_report_clusters computes common evaluation measures derived from the output of this function.

Examples

true <- c(1,1,1,2,2)  # ground truth clustering
pred <- c(1,1,2,2,2)  # predicted clustering
contingency_table_clusters(true, pred)

Binary Contingency Table for Linked Pairs

Description

Compute the binary contingency table for a set of predicted coreferent (linked) pairs given a set of ground truth coreferent pairs.

Usage

contingency_table_pairs(
  true_pairs,
  pred_pairs,
  num_pairs = NULL,
  ordered = FALSE
)

Arguments

true_pairs

set of true coreferent pairs stored in a matrix or data.frame, where rows index pairs and columns index the ids of the constituents. Any pairs not included are assumed to be non-coreferent. Duplicate pairs (including equivalent pairs with reversed ids) are automatically removed.

pred_pairs

set of predicted coreferent pairs, following the same specification as true_pairs.

num_pairs

the total number of coreferent and non-coreferent pairs, excluding equivalent pairs with reversed ids. If not provided, the true negative cell will be set to NA.

ordered

whether to treat the element pairs as ordered—i.e. whether pair (x,y)(x, y) is distinct from pair (y,x)(y, x) for xyx \neq y. Defaults to FALSE, which is appropriate for clustering, undirected link prediction, record linkage etc.

Value

Returns a 2×22 \times 2 contingency table of the form:

             Truth
   Prediction   TRUE  FALSE
        TRUE      TP     FP
        FALSE     FN     TN

See Also

The membership_to_pairs and clusters_to_pairs functions can be used to transform other clustering representations into lists of pairs, as required by this function. The eval_report_pairs function computes common evaluation measures derived from binary contingency matrices, like the ones output by this function.

Examples

### Example where pairs/edges are undirected
# ground truth is 3-clique
true_pairs <- rbind(c(1,2), c(2,3), c(1,3))
# prediction misses one edge
pred_pairs <- rbind(c(1,2), c(2,3))
# total number of pairs assuming 3 elements
num_pairs <- 3 * (3 - 1) / 2
eval_report_pairs(true_pairs, pred_pairs, num_pairs)

### Example where pairs/edges are directed
# ground truth is a 3-star
true_pairs <- rbind(c(2,1), c(3,1), c(4,1))
# prediction gets direction of one edge incorrect
pred_pairs <- rbind(c(2,1), c(3,1), c(1,4))
# total number of pairs assuming 4 elements
num_pairs <- 4 * 4
eval_report_pairs(true_pairs, pred_pairs, num_pairs, ordered = TRUE)

Evaluation Report for Clustering

Description

Compute various evaluation measures for a predicted clustering using a ground truth clustering as a reference.

Usage

eval_report_clusters(true, pred)

Arguments

true

ground truth clustering represented as a membership vector. Each entry corresponds to an element and the value identifies the assigned cluster. The specific values of the cluster identifiers are arbitrary.

pred

predicted clustering represented as a membership vector.

Value

Returns a list containing the following measures:

homogeneity

see homogeneity

completeness

see completeness

v_measure

see v_measure

rand_index

see rand_index

adj_rand_index

see adj_rand_index

variation_info

see variation_info

mutual_info

see mutual_info

fowlkes_mallows

see fowlkes_mallows

Examples

true <- c(1,1,1,2,2)  # ground truth clustering
pred <- c(1,1,2,2,2)  # predicted clustering
eval_report_clusters(true, pred)

Evaluation Report for Linked Pairs

Description

Compute various evaluation measures for a set of predicted coreferent (linked) pairs given a set of ground truth coreferent pairs.

Usage

eval_report_pairs(true_pairs, pred_pairs, num_pairs = NULL, ordered = FALSE)

Arguments

true_pairs

set of true coreferent pairs stored in a matrix or data.frame, where rows index pairs and columns index the ids of the constituents. Any pairs not included are assumed to be non-coreferent. Duplicate pairs (including equivalent pairs with reversed ids) are automatically removed.

pred_pairs

set of predicted coreferent pairs, following the same specification as true_pairs.

num_pairs

the total number of coreferent and non-coreferent pairs, excluding equivalent pairs with reversed ids. If not provided, measures that depend on the number of true negatives will be returned as NA.

ordered

whether to treat the element pairs as ordered—i.e. whether pair (x,y)(x, y) is distinct from pair (y,x)(y, x) for xyx \neq y. Defaults to FALSE, which is appropriate for clustering, undirected link prediction, record linkage etc.

Value

Returns a list containing the following measures:

precision

see precision_pairs

recall

see recall_pairs

specificity

see specificity_pairs

sensitivity

see sensitivity_pairs

f1score

see f_measure_pairs

accuracy

see accuracy_pairs

balanced_accuracy

see balanced_accuracy_pairs

fowlkes_mallows

see fowlkes_mallows_pairs

See Also

The contingency_table_pairs function can be used to compute the contingency table for entity resolution or record linkage problems.

Examples

### Example where pairs/edges are undirected
# ground truth is 3-clique
true_pairs <- rbind(c(1,2), c(2,3), c(1,3))
# prediction misses one edge
pred_pairs <- rbind(c(1,2), c(2,3))
# total number of pairs assuming 3 elements
num_pairs <- 3 * (3 - 1) / 2
eval_report_pairs(true_pairs, pred_pairs, num_pairs)

### Example where pairs/edges are directed
# ground truth is a 3-star
true_pairs <- rbind(c(2,1), c(3,1), c(4,1))
# prediction gets direction of one edge incorrect
pred_pairs <- rbind(c(2,1), c(3,1), c(1,4))
# total number of pairs assuming 4 elements
num_pairs <- 4 * 4
eval_report_pairs(true_pairs, pred_pairs, num_pairs, ordered = TRUE)

F-measure of Linked Pairs

Description

Computes the F-measure (a.k.a. F-score) of a set of predicted coreferent (linked) pairs given a set of ground truth coreferent pairs.

Usage

f_measure_pairs(true_pairs, pred_pairs, beta = 1, ordered = FALSE)

Arguments

true_pairs

set of true coreferent pairs stored in a matrix or data.frame, where rows index pairs and columns index the ids of the constituents. Any pairs not included are assumed to be non-coreferent. Duplicate pairs (including equivalent pairs with reversed ids) are automatically removed.

pred_pairs

set of predicted coreferent pairs, following the same specification as true_pairs.

beta

non-negative weight. A value of 0 assigns no weight to recall (i.e. the measure reduces to precision), while larger values assign increasing weight to recall. A value of 1 weights precision and recall equally.

ordered

whether to treat the element pairs as ordered—i.e. whether pair (x,y)(x, y) is distinct from pair (y,x)(y, x) for xyx \neq y. Defaults to FALSE, which is appropriate for clustering, undirected link prediction, record linkage etc.

Details

The β\beta-weighted F-measure is defined as the weighted harmonic mean of precision PP and recall RR:

(1+β2)PRβ2P+R.(1 + \beta^2)\frac{P \cdot R}{\beta^2 \cdot P + R}.

References

Van Rijsbergen, C. J. "Information Retrieval." (2nd ed.). Butterworth-Heinemann, USA, (1979).

Examples

true_pairs <- rbind(c(1,2), c(2,3), c(1,3)) # ground truth is 3-clique
pred_pairs <- rbind(c(1,2), c(2,3))         # prediction misses one edge
num_pairs <- 3                              # assuming 3 elements
f_measure_pairs(true_pairs, pred_pairs, num_pairs)

Fowlkes-Mallows Index Between Clusterings

Description

Computes the Fowlkes-Mallows index between two clusterings, such as a predicted and ground truth clustering.

Usage

fowlkes_mallows(true, pred)

Arguments

true

ground truth clustering represented as a membership vector. Each entry corresponds to an element and the value identifies the assigned cluster. The specific values of the cluster identifiers are arbitrary.

pred

predicted clustering represented as a membership vector.

Details

The Fowlkes-Mallows index is defined as the geometric mean of precision and recall, computed with respect to pairs of elements.

References

Fowlkes, E. B. and Mallows, C. L. "A Method for Comparing Two Hierarchical Clusterings." Journal of the American Statistical Association 78:383, 553-569, (1983). doi:10.1080/01621459.1983.10478008

Examples

true <- c(1,1,1,2,2)  # ground truth clustering
pred <- c(1,1,2,2,2)  # predicted clustering
fowlkes_mallows(true, pred)

Fowlkes-Mallows Index of Linked Pairs

Description

Computes the Fowlkes-Mallows index for a set of predicted coreferent (linked) pairs given a set of ground truth coreferent pairs.

Usage

fowlkes_mallows_pairs(true_pairs, pred_pairs, ordered = FALSE)

Arguments

true_pairs

set of true coreferent pairs stored in a matrix or data.frame, where rows index pairs and columns index the ids of the constituents. Any pairs not included are assumed to be non-coreferent. Duplicate pairs (including equivalent pairs with reversed ids) are automatically removed.

pred_pairs

set of predicted coreferent pairs, following the same specification as true_pairs.

ordered

whether to treat the element pairs as ordered—i.e. whether pair (x,y)(x, y) is distinct from pair (y,x)(y, x) for xyx \neq y. Defaults to FALSE, which is appropriate for clustering, undirected link prediction, record linkage etc.

Details

The Fowlkes-Mallows index is defined as the geometric mean of precision PP and recall RR:

PR.\sqrt{P R}.

References

Fowlkes, E. B. and Mallows, C. L. "A Method for Comparing Two Hierarchical Clusterings." Journal of the American Statistical Association 78:383, 553-569, (1983). doi:10.1080/01621459.1983.10478008.

Examples

true_pairs <- rbind(c(1,2), c(2,3), c(1,3)) # ground truth is 3-clique
pred_pairs <- rbind(c(1,2), c(2,3))         # prediction misses one edge
num_pairs <- 3                              # assuming 3 elements
fowlkes_mallows_pairs(true_pairs, pred_pairs, num_pairs)

Homogeneity Between Clusterings

Description

Computes the homogeneity between two clusterings, such as a predicted and ground truth clustering.

Usage

homogeneity(true, pred)

Arguments

true

ground truth clustering represented as a membership vector. Each entry corresponds to an element and the value identifies the assigned cluster. The specific values of the cluster identifiers are arbitrary.

pred

predicted clustering represented as a membership vector.

Details

Homogeneity is an entropy-based measure of the similarity between two clusterings, say tt and pp. The homogeneity is high if clustering tt only assigns members of a cluster to a single cluster in pp. The homogeneity ranges between 0 and 1, where 1 indicates a perfect homogeneity.

References

Rosenberg, A. and Hirschberg, J. "V-measure: A conditional entropy-based external cluster evaluation measure." Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), (2007).

See Also

completeness evaluates the completeness, which is a dual measure to homogeneity. v_measure evaluates the harmonic mean of completeness and homogeneity.

Examples

true <- c(1,1,1,2,2)  # ground truth clustering
pred <- c(1,1,2,2,2)  # predicted clustering
homogeneity(true, pred)

Mutual Information Between Clusterings

Description

Computes the mutual information between two clusterings, such as a predicted and ground truth clustering.

Usage

mutual_info(true, pred, base = exp(1))

Arguments

true

ground truth clustering represented as a membership vector. Each entry corresponds to an element and the value identifies the assigned cluster. The specific values of the cluster identifiers are arbitrary.

pred

predicted clustering represented as a membership vector.

base

base of the logarithm. Defaults to exp(1).

Details

Mutual information is an entropy-based measure of the similarity between two clusterings.

Examples

true <- c(1,1,1,2,2)  # ground truth clustering
pred <- c(1,1,2,2,2)  # predicted clustering
mutual_info(true, pred)

Precision of Linked Pairs

Description

Computes the precision of a set of predicted coreferent (linked) pairs given a set of ground truth coreferent pairs.

Usage

precision_pairs(true_pairs, pred_pairs, ordered = FALSE)

Arguments

true_pairs

set of true coreferent pairs stored in a matrix or data.frame, where rows index pairs and columns index the ids of the constituents. Any pairs not included are assumed to be non-coreferent. Duplicate pairs (including equivalent pairs with reversed ids) are automatically removed.

pred_pairs

set of predicted coreferent pairs, following the same specification as true_pairs.

ordered

whether to treat the element pairs as ordered—i.e. whether pair (x,y)(x, y) is distinct from pair (y,x)(y, x) for xyx \neq y. Defaults to FALSE, which is appropriate for clustering, undirected link prediction, record linkage etc.

Details

The precision is defined as:

TPP\frac{|T \cap P|}{|P|}

where TT is the set of true coreferent pairs and PP is the set of predicted coreferent pairs.

Examples

true_pairs <- rbind(c(1,2), c(2,3), c(1,3)) # ground truth is 3-clique
pred_pairs <- rbind(c(1,2), c(2,3))         # prediction misses one edge
num_pairs <- 3                              # assuming 3 elements
precision_pairs(true_pairs, pred_pairs, num_pairs)

Rand Index Between Clusterings

Description

Computes the Rand index (RI) between two clusterings, such as a predicted and ground truth clustering.

Usage

rand_index(true, pred)

Arguments

true

ground truth clustering represented as a membership vector. Each entry corresponds to an element and the value identifies the assigned cluster. The specific values of the cluster identifiers are arbitrary.

pred

predicted clustering represented as a membership vector.

Details

The Rand index (RI) can be expressed as:

a+b(n2).\frac{a + b}{{n \choose 2}}.

where

  • nn is the number of elements,

  • aa is the number of pairs of elements that appear in the same cluster in both clusterings, and

  • bb is the number of pairs of elements that appear in distinct clusters in both clusterings.

The RI takes on values between 0 and 1, where 1 denotes exact agreement between the clusterings and 0 denotes disagreement on all pairs of elements.

References

Rand, W. M. "Objective Criteria for the Evaluation of Clustering Methods." Journal of the American Statistical Association 66(336), 846-850 (1971). doi:10.1080/01621459.1971.10482356

Examples

true <- c(1,1,1,2,2)  # ground truth clustering
pred <- c(1,1,2,2,2)  # predicted clustering
rand_index(true, pred)

Recall of Linked Pairs

Description

Computes the precision of a set of predicted coreferent (linked) pairs given a set of ground truth coreferent pairs.

Usage

recall_pairs(true_pairs, pred_pairs, ordered = FALSE)

sensitivity_pairs(true_pairs, pred_pairs, ordered = FALSE)

Arguments

true_pairs

set of true coreferent pairs stored in a matrix or data.frame, where rows index pairs and columns index the ids of the constituents. Any pairs not included are assumed to be non-coreferent. Duplicate pairs (including equivalent pairs with reversed ids) are automatically removed.

pred_pairs

set of predicted coreferent pairs, following the same specification as true_pairs.

ordered

whether to treat the element pairs as ordered—i.e. whether pair (x,y)(x, y) is distinct from pair (y,x)(y, x) for xyx \neq y. Defaults to FALSE, which is appropriate for clustering, undirected link prediction, record linkage etc.

Details

The recall is defined as:

TPT\frac{|T \cap P|}{|T|}

where TT is the set of true coreferent pairs and PP is the set of predicted coreferent pairs.

Note

sensitivity_pairs is an alias for recall_pairs.

Examples

true_pairs <- rbind(c(1,2), c(2,3), c(1,3)) # ground truth is 3-clique
pred_pairs <- rbind(c(1,2), c(2,3))         # prediction misses one edge
num_pairs <- 3                              # assuming 3 elements
recall_pairs(true_pairs, pred_pairs, num_pairs)

Specificity of Linked Pairs

Description

Computes the specificity of a set of predicted coreferent (linked) pairs given a set of ground truth coreferent pairs.

Usage

specificity_pairs(true_pairs, pred_pairs, num_pairs, ordered = FALSE)

Arguments

true_pairs

set of true coreferent pairs stored in a matrix or data.frame, where rows index pairs and columns index the ids of the constituents. Any pairs not included are assumed to be non-coreferent. Duplicate pairs (including equivalent pairs with reversed ids) are automatically removed.

pred_pairs

set of predicted coreferent pairs, following the same specification as true_pairs.

num_pairs

the total number of coreferent and non-coreferent pairs, excluding equivalent pairs with reversed ids.

ordered

whether to treat the element pairs as ordered—i.e. whether pair (x,y)(x, y) is distinct from pair (y,x)(y, x) for xyx \neq y. Defaults to FALSE, which is appropriate for clustering, undirected link prediction, record linkage etc.

Details

The specificity is defined as:

PTP\frac{|P' \cap T'|}{|P'|}

where TT' is the set of true non-coreferent pairs, PP is the set of predicted non-coreferent pairs.

Examples

true_pairs <- rbind(c(1,2), c(2,3), c(1,3)) # ground truth is 3-clique
pred_pairs <- rbind(c(1,2), c(2,3))         # prediction misses one edge
num_pairs <- 3                              # assuming 3 elements
specificity_pairs(true_pairs, pred_pairs, num_pairs)

V-measure Between Clusterings

Description

Computes the V-measure between two clusterings, such as a predicted and ground truth clustering.

Usage

v_measure(true, pred, beta = 1)

Arguments

true

ground truth clustering represented as a membership vector. Each entry corresponds to an element and the value identifies the assigned cluster. The specific values of the cluster identifiers are arbitrary.

pred

predicted clustering represented as a membership vector.

beta

non-negative weight. A value of 0 assigns no weight to completeness (i.e. the measure reduces to homogeneity), while larger values assign increasing weight to completeness. A value of 1 weights completeness and homogeneity equally.

Details

V-measure is defined as the β\beta-weighted harmonic mean of homogeneity hh and completeness cc:

(1+β)hcβh+c.(1 + \beta)\frac{h \cdot c}{\beta \cdot h + c}.

The range of V-measure is between 0 and 1, where 1 corresponds to a perfect match between the clusterings. It is equivalent to the normalised mutual information, when the aggregation function is the arithmetic mean.

References

Rosenberg, A. and Hirschberg, J. "V-measure: A conditional entropy-based external cluster evaluation measure." Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), (2007).

Becker, H. "Identification and characterization of events in social media." PhD dissertation, Columbia University, (2011).

See Also

homogeneity and completeness evaluate the component measures upon which this measure is based.

Examples

true <- c(1,1,1,2,2)  # ground truth clustering
pred <- c(1,1,2,2,2)  # predicted clustering
v_measure(true, pred)

Variation of Information Between Clusterings

Description

Computes the variation of information between two clusterings, such as a predicted and ground truth clustering.

Usage

variation_info(true, pred, base = exp(1))

Arguments

true

ground truth clustering represented as a membership vector. Each entry corresponds to an element and the value identifies the assigned cluster. The specific values of the cluster identifiers are arbitrary.

pred

predicted clustering represented as a membership vector.

base

base of the logarithm. Defaults to exp(1).

Details

Variation of information is an entropy-based distance metric on the space of clusterings. It is unnormalized and varies between 00 and log(N)\log(N) where NN is the number of clustered elements. Larger values of the distance metric correspond to greater dissimilarity between the clusterings.

References

Arabie, P. and Boorman, S. A. "Multidimensional scaling of measures of distance between partitions." Journal of Mathematical Psychology 10:2, 148-203, (1973). doi:10.1016/0022-2496(73)90012-6

Meilă, M. "Comparing Clusterings by the Variation of Information." In: Learning Theory and Kernel Machines, Lecture Notes in Computer Science 2777, Springer, Berlin, Heidelberg, (2003). doi:10.1007/978-3-540-45167-9_14

Examples

true <- c(1,1,1,2,2)  # ground truth clustering
pred <- c(1,1,2,2,2)  # predicted clustering
variation_info(true, pred)