findAnnoy {BiocNeighbors}R Documentation

Find approximate nearest neighbors

Description

Use the Annoy algorithm to identify approximate nearest neighbors from a dataset.

Usage

findAnnoy(X, k, get.index=TRUE, get.distance=TRUE, BPPARAM=SerialParam(), 
    precomputed=NULL, subset=NULL, ...)

Arguments

X

A numeric matrix where rows correspond to data points and columns correspond to variables (i.e., dimensions).

k

A positive integer scalar specifying the number of nearest neighbors to retrieve.

get.index

A logical scalar indicating whether the indices of the nearest neighbors should be recorded.

get.distance

A logical scalar indicating whether distances to the nearest neighbors should be recorded.

BPPARAM

A BiocParallelParam object indicating how the search should be parallelized.

precomputed

An AnnoyIndex object from running buildAnnoy on X.

subset

A vector indicating the rows of X for which the nearest neighbors should be identified.

...

Further arguments to be passed to buildAnnoy if precomputed=NULL.

Details

This function uses the Annoy method (Approximate Nearest Neighbours Oh Yeah) by Erik Bernhardsson to identify approximate k-nearest neighbors in high-dimensional data. Briefly, a tree is constructed by using a random hyperplane to split the points at each internal node. For a given input data point, all points in the same leaf node are defined as a set of potential nearest neighbors. This is repeated across multiple trees, and the union of all such sets is searched to identify the actual nearest neighbors.

By default, nearest neighbors are identified for all data points within X. If subset is specified, nearest neighbors are only detected for the points in the subset. This yields the same result as (but is more efficient than) subsetting the output matrices after running findKNN with subset=NULL.

Turning off get.index or get.distance will not return the corresponding matrices in the output. This may provide a slight speed boost when these returned values are not of interest. Using BPPARAM will also split the search across multiple workers, which should increase speed proportionally (in theory) to the number of cores.

If the function is to be called multiple times with the same X (e.g., with different subset), it may be faster to call buildAnnoy once externally, and pass the returned object to findAnnoy via the precomputed argument. This avoids unnecessary re-indexing and can provide a substantial speed-up. Note that when precomputed is supplied, the value of X is completely ignored.

Value

A list is returned containing:

If subset is not NULL, each row of the above matrices refers to a point in the subset, in the same order as supplied in subset.

Author(s)

Aaron Lun

References

Bernhardsson E (2018). Annoy. https://github.com/spotify/annoy

See Also

See buildAnnoy to construct the index files ahead of time.

Examples

Y <- matrix(rnorm(100000), ncol=20)
out <- findAnnoy(Y, k=25)
head(out$index)
head(out$distance)

[Package BiocNeighbors version 1.0.0 Index]