findKmknn {BiocNeighbors}R Documentation

Find nearest neighbors

Description

Use the KMKNN (K-means for k-nearest neighbors) algorithm to identify nearest neighbors from a dataset.

Usage

findKmknn(X, k, get.index=TRUE, get.distance=TRUE, BPPARAM=SerialParam(), 
    precomputed=NULL, subset=NULL, raw.index=FALSE, ...)

Arguments

X

A numeric matrix where rows correspond to data points and columns correspond to variables (i.e., dimensions).

k

A positive integer scalar specifying the number of nearest neighbors to retrieve.

get.index

A logical scalar indicating whether the indices of the nearest neighbors should be recorded.

get.distance

A logical scalar indicating whether distances to the nearest neighbors should be recorded.

BPPARAM

A BiocParallelParam object indicating how the search should be parallelized.

precomputed

A KmknnIndex object returned from running buildKmknn on X.

subset

A vector indicating the rows of X for which the nearest neighbors should be identified.

raw.index

A logical scalar indicating whether column indices to the reordered data in precomputed should be directly returned.

...

Further arguments to pass to buildKmknn if precomputed=NULL.

Details

This function uses the method proposed by Wang (2012) to quickly identify k-nearest neighbors in high-dimensional data. Briefly, data points are rapidly clustered into N clusters using k-means clustering in buildKmknn, where N is the square root of the number of points. This clustering is then used to speed up the nearest neighbor search across X, exploiting the triangle inequality between cluster centers, the query point and each point in the cluster to narrow the search space.

By default, nearest neighbors are identified for all data points within X. If subset is specified, nearest neighbors are only detected for the points in the subset. This yields the same result as (but is more efficient than) subsetting the output matrices after running findKmknn with subset=NULL.

Turning off get.index or get.distance will not return the corresponding matrices in the output. This may provide a slight speed boost when these returned values are not of interest. Using BPPARAM will also split the search across multiple workers, which should increase speed proportionally (in theory) to the number of cores.

If the function is to be called multiple times with the same X (e.g., with different subset), it may be faster to call buildKmknn once externally, and pass the returned object to findKmknn via the precomputed argument. This avoids unnecessary repeated k-means clustering and can provide a substantial speed-up. Note that when precomputed is supplied, the value of X is completely ignored.

Currently, only Euclidean distances are supported, but support may be added for other distance types depending on demand. It remains to be seen whether the speed-up achieved with k-means is still applicable to alternative distance metrics.

Note that the code here was originally derived from an implementation in the cydar package (Lun et al., 2017).

Value

A list is returned containing:

If subset is not NULL, each row of the above matrices refers to a point in the subset, in the same order as supplied in subset.

If raw.index=TRUE, the values in index refer to columns of KmknnIndex_clustered_data(precomputed).

Ties and random seeds

In general, this function is fully deterministic, despite the use of a stochastic kmeans step in buildKmknn. The only exception occurs when there are tied distances to neighbors, at which point the order and/or identity of the k-nearest neighboring points is not well-defined.

A warning will be raised if ties are detected among the k+1 nearest neighbors, as this indicates that the order/identity is arbitrary. Specifically, ties are detected when a larger distance is less than (1 + 1e-10)-fold of the smaller distance. This criterion tends to be somewhat conservative in the sense that it will warn users even if there is no problem (i.e., the distances are truly different). However, more accurate detection is difficult to achieve due to the vagaries of numerical precision across different machines.

In the presence of ties, the output will depend on the ordering of points in the buildKmknn output. Users should set the seed to guarantee consistent (albeit arbitrary) results across different runs of the function. Note, however, that the exact selection of tied points depends on the numerical precision of the system. Thus, even after setting a seed, there is no guarantee that the results will be reproducible across machines (especially Windows)!

Returning raw indices

Advanced users can also set raw.index=TRUE, which yields results equivalent to running findKmknn on t(PRE) directly, where PRE is the output of KmknnIndex_clustered_data(precomputed). With this setting, the indices in the output index matrix refer to columns of PRE. Similarly, the subset argument is assumed to refer to columns of PRE.

This setting may be more convenient when the reordered data in precomputed is used elsewhere, e.g., for plotting. With raw.index=TRUE, users avoid the need to switch between the original ordering and that from buildKmknn. Of course, it is also the user's responsibility to be aware of the reordering in downstream applications.

Author(s)

Aaron Lun

References

Wang X (2012). A fast exact k-nearest neighbors algorithm for high dimensional search using k-means clustering and triangle inequality. Proc Int Jt Conf Neural Netw, 43, 6:2351-2358.

Lun ATL, Richard AC, Marioni JC (2017). Testing for differential abundance in mass cytometry data. Nat. Methods, 14, 7:707-709.

See Also

buildKmknn to build the index structure ahead of time.

Examples

Y <- matrix(rnorm(100000), ncol=20)
out <- findKmknn(Y, k=25)
head(out$index)
head(out$distance)

[Package BiocNeighbors version 1.0.0 Index]