findAnnoy {BiocNeighbors} | R Documentation |
Use the Annoy algorithm to identify approximate nearest neighbors from a dataset.
findAnnoy(X, k, get.index=TRUE, get.distance=TRUE, BPPARAM=SerialParam(), precomputed=NULL, subset=NULL, ...)
X |
A numeric matrix where rows correspond to data points and columns correspond to variables (i.e., dimensions). |
k |
A positive integer scalar specifying the number of nearest neighbors to retrieve. |
get.index |
A logical scalar indicating whether the indices of the nearest neighbors should be recorded. |
get.distance |
A logical scalar indicating whether distances to the nearest neighbors should be recorded. |
BPPARAM |
A BiocParallelParam object indicating how the search should be parallelized. |
precomputed |
An AnnoyIndex object from running |
subset |
A vector indicating the rows of |
... |
Further arguments to be passed to |
This function uses the Annoy method (Approximate Nearest Neighbours Oh Yeah) by Erik Bernhardsson to identify approximate k-nearest neighbors in high-dimensional data. Briefly, a tree is constructed by using a random hyperplane to split the points at each internal node. For a given input data point, all points in the same leaf node are defined as a set of potential nearest neighbors. This is repeated across multiple trees, and the union of all such sets is searched to identify the actual nearest neighbors.
By default, nearest neighbors are identified for all data points within X
.
If subset
is specified, nearest neighbors are only detected for the points in the subset.
This yields the same result as (but is more efficient than) subsetting the output matrices after running findKNN
with subset=NULL
.
Turning off get.index
or get.distance
will not return the corresponding matrices in the output.
This may provide a slight speed boost when these returned values are not of interest.
Using BPPARAM
will also split the search across multiple workers, which should increase speed proportionally (in theory) to the number of cores.
If the function is to be called multiple times with the same X
(e.g., with different subset
),
it may be faster to call buildAnnoy
once externally, and pass the returned object to findAnnoy
via the precomputed
argument.
This avoids unnecessary re-indexing and can provide a substantial speed-up.
Note that when precomputed
is supplied, the value of X
is completely ignored.
A list is returned containing:
index
, if get.index=TRUE
.
This is an integer matrix where each row corresponds to a point (denoted here as i) in X
.
The row for i contains the row indices of X
that are the nearest neighbors to point i, sorted by increasing distance from i.
distance
, if get.distance=TRUE
.
This is a numeric matrix where each row corresponds to a point (as above) and contains the sorted distances of the neighbors from i.
If subset
is not NULL
, each row of the above matrices refers to a point in the subset, in the same order as supplied in subset
.
Aaron Lun
Bernhardsson E (2018). Annoy. https://github.com/spotify/annoy
See buildAnnoy
to construct the index files ahead of time.
Y <- matrix(rnorm(100000), ncol=20) out <- findAnnoy(Y, k=25) head(out$index) head(out$distance)