BiocNeighbors 1.16.0
The BiocNeighbors package implements a few algorithms for exact nearest neighbor searching:
Both KMKNN and VP-trees involve a component of randomness during index construction, though the k-nearest neighbors result is fully deterministic1.
The most obvious application is to perform a k-nearest neighbors search. We’ll mock up an example here with a hypercube of points, for which we want to identify the 10 nearest neighbors for each point.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
The findKNN()
method expects a numeric matrix as input with data points as the rows and variables/dimensions as the columns.
We indicate that we want to use the KMKNN algorithm by setting BNPARAM=KmknnParam()
(which is also the default, so this is not strictly necessary here).
We could use a VP tree instead by setting BNPARAM=VptreeParam()
.
fout <- findKNN(data, k=10, BNPARAM=KmknnParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 6961 7813 9170 1226 3839 7008 9424 2876 3767 7456
## [2,] 8990 125 8362 6248 4779 2685 9581 7234 3846 5162
## [3,] 9326 3930 1608 7055 2299 2535 9984 3870 6416 4849
## [4,] 3103 9028 5464 6662 2537 4710 553 2122 6900 9294
## [5,] 3242 718 6475 6813 9395 2844 7682 2232 1385 6772
## [6,] 2221 352 6141 7122 591 1269 480 7758 4629 1807
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.7948161 0.8700061 0.9444299 0.9736093 1.0045227 1.0259050 1.0373362
## [2,] 0.7987711 0.8186981 0.8391710 0.8547602 0.8618549 0.8623643 0.8785859
## [3,] 0.9887107 1.0396488 1.0614358 1.0758956 1.0850965 1.0995462 1.0998265
## [4,] 1.0073244 1.0550710 1.0565176 1.0693396 1.0985015 1.1051071 1.1145887
## [5,] 0.9491025 0.9575265 0.9926144 1.0278524 1.0385156 1.0391379 1.0672122
## [6,] 1.0073630 1.0495905 1.0893822 1.0962145 1.1590247 1.1655495 1.1744522
## [,8] [,9] [,10]
## [1,] 1.0566652 1.0574998 1.0726549
## [2,] 0.8855493 0.8856993 0.8926532
## [3,] 1.1020565 1.1131608 1.1305606
## [4,] 1.1284751 1.1318379 1.1322208
## [5,] 1.0704776 1.0827966 1.0901881
## [6,] 1.1923896 1.1974756 1.2143994
Each row of the index
matrix corresponds to a point in data
and contains the row indices in data
that are its nearest neighbors.
For example, the 3rd point in data
has the following nearest neighbors:
fout$index[3,]
## [1] 9326 3930 1608 7055 2299 2535 9984 3870 6416 4849
… with the following distances to those neighbors:
fout$distance[3,]
## [1] 0.9887107 1.0396488 1.0614358 1.0758956 1.0850965 1.0995462 1.0998265
## [8] 1.1020565 1.1131608 1.1305606
Note that the reported neighbors are sorted by distance.
Another application is to identify the k-nearest neighbors in one dataset based on query points in another dataset. Again, we mock up a small data set:
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
We then use the queryKNN()
function to identify the 5 nearest neighbors in data
for each point in query
.
qout <- queryKNN(data, query, k=5, BNPARAM=KmknnParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 3896 7662 2570 1502 7470
## [2,] 4784 8439 9245 4159 3410
## [3,] 7134 2162 1791 8784 3087
## [4,] 1039 1661 3299 2363 7723
## [5,] 9271 1113 8542 9340 7815
## [6,] 8201 3139 555 6463 9062
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.9698087 1.0042420 1.0136370 1.030114 1.030365
## [2,] 0.8719075 0.9062713 0.9257835 1.011957 1.013228
## [3,] 0.9806023 0.9935614 1.0215643 1.062581 1.064352
## [4,] 1.0010272 1.0449616 1.1036209 1.129081 1.161795
## [5,] 1.0027726 1.0117239 1.0292581 1.029434 1.046989
## [6,] 0.8826955 1.0486813 1.0554361 1.065353 1.092932
Each row of the index
matrix contains the row indices in data
that are the nearest neighbors of a point in query
.
For example, the 3rd point in query
has the following nearest neighbors in data
:
qout$index[3,]
## [1] 7134 2162 1791 8784 3087
… with the following distances to those neighbors:
qout$distance[3,]
## [1] 0.9806023 0.9935614 1.0215643 1.0625811 1.0643518
Again, the reported neighbors are sorted by distance.
Users can perform the search for a subset of query points using the subset=
argument.
This yields the same result as but is more efficient than performing the search for all points and subsetting the output.
findKNN(data, k=5, subset=3:5)
## $index
## [,1] [,2] [,3] [,4] [,5]
## [1,] 9326 3930 1608 7055 2299
## [2,] 3103 9028 5464 6662 2537
## [3,] 3242 718 6475 6813 9395
##
## $distance
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.9887107 1.0396488 1.0614358 1.075896 1.085097
## [2,] 1.0073244 1.0550710 1.0565176 1.069340 1.098501
## [3,] 0.9491025 0.9575265 0.9926144 1.027852 1.038516
If only the indices are of interest, users can set get.distance=FALSE
to avoid returning the matrix of distances.
This will save some time and memory.
names(findKNN(data, k=2, get.distance=FALSE))
## [1] "index"
It is also simple to speed up functions by parallelizing the calculations with the BiocParallel framework.
library(BiocParallel)
out <- findKNN(data, k=10, BPPARAM=MulticoreParam(3))
For multiple queries to a constant data
, the pre-clustering can be performed in a separate step with buildIndex()
.
The result can then be passed to multiple calls, avoiding the overhead of repeated clustering2.
pre <- buildIndex(data, BNPARAM=KmknnParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
The default setting is to search on the Euclidean distance.
Alternatively, we can use the Manhattan distance by setting distance="Manhattan"
in the BiocNeighborParam
object.
out.m <- findKNN(data, k=5, BNPARAM=KmknnParam(distance="Manhattan"))
Advanced users may also be interested in the raw.index=
argument, which returns indices directly to the precomputed object rather than to data
.
This may be useful inside package functions where it may be more convenient to work on a common precomputed object.
sessionInfo()
## R version 4.2.1 Patched (2022-07-09 r82577)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_GB/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocParallel_1.32.0 BiocNeighbors_1.16.0 knitr_1.40
## [4] BiocStyle_2.26.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.9 magrittr_2.0.3 BiocGenerics_0.44.0
## [4] lattice_0.20-45 R6_2.5.1 rlang_1.0.6
## [7] fastmap_1.1.0 stringr_1.4.1 tools_4.2.1
## [10] parallel_4.2.1 grid_4.2.1 xfun_0.34
## [13] cli_3.4.1 jquerylib_0.1.4 htmltools_0.5.3
## [16] yaml_2.3.6 digest_0.6.30 bookdown_0.29
## [19] Matrix_1.5-1 BiocManager_1.30.19 S4Vectors_0.36.0
## [22] sass_0.4.2 codetools_0.2-18 cachem_1.0.6
## [25] evaluate_0.17 rmarkdown_2.17 stringi_1.7.8
## [28] compiler_4.2.1 bslib_0.4.0 stats4_4.2.1
## [31] jsonlite_1.8.3