BiocNeighbors 1.16.0
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 2998 1996 7866 1603 6569 715 8320 3960 6861 5813
## [2,] 2027 898 5381 1951 1995 6884 4890 296 1030 6163
## [3,] 5722 3862 8488 36 7346 7006 8348 7214 5395 4427
## [4,] 1464 63 1019 2373 8717 1868 1132 10000 1086 5936
## [5,] 1050 7891 7529 4149 4003 3090 8007 3072 9456 7049
## [6,] 6671 3870 2515 9055 5986 1401 681 9326 4823 3904
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.9699740 0.9703456 1.0137383 1.0141190 1.0490384 1.0536724 1.056115
## [2,] 0.9925946 1.0504065 1.0597878 1.1027910 1.1158421 1.1208044 1.122202
## [3,] 0.8949631 0.9087799 0.9735734 0.9748551 0.9949543 1.0176134 1.018482
## [4,] 0.8200140 1.0148401 1.1165154 1.1287633 1.1367433 1.1593237 1.176634
## [5,] 0.8186470 0.8713682 0.9101841 0.9520418 0.9759419 0.9858167 1.001212
## [6,] 0.8441145 0.9574950 0.9832371 0.9993249 0.9996191 1.0020701 1.020006
## [,8] [,9] [,10]
## [1,] 1.060883 1.062336 1.064488
## [2,] 1.131508 1.131803 1.141006
## [3,] 1.019078 1.029804 1.060567
## [4,] 1.186096 1.187308 1.187478
## [5,] 1.010752 1.021255 1.031924
## [6,] 1.026687 1.030553 1.042332
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 580 4335 1238 8520 2291
## [2,] 3190 9194 6097 2735 2947
## [3,] 688 5201 7866 4676 220
## [4,] 4091 8798 4744 7038 8309
## [5,] 3763 3401 1564 1807 2715
## [6,] 9486 9312 2801 5079 9197
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.7866626 0.8574749 0.8771691 0.9088970 0.9366355
## [2,] 0.8449786 0.9715865 1.0555539 1.0805581 1.0809981
## [3,] 0.8957751 0.9174234 0.9761440 0.9787058 0.9813795
## [4,] 0.9105396 0.9521742 1.0249147 1.0279537 1.1160817
## [5,] 0.8012452 0.9045596 0.9730635 1.0240366 1.0565325
## [6,] 1.0099212 1.0357484 1.0416129 1.0639051 1.1244622
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/var/folders/db/4tvgx8jx4z3fm1gzlnlzw9rc0000gq/T//RtmpeueRjh/file14be018f1d696.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.2.1 Patched (2022-07-09 r82577)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_GB/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.16.0 knitr_1.40 BiocStyle_2.26.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.9 magrittr_2.0.3 BiocGenerics_0.44.0
## [4] BiocParallel_1.32.0 lattice_0.20-45 R6_2.5.1
## [7] rlang_1.0.6 fastmap_1.1.0 stringr_1.4.1
## [10] tools_4.2.1 parallel_4.2.1 grid_4.2.1
## [13] xfun_0.34 cli_3.4.1 jquerylib_0.1.4
## [16] htmltools_0.5.3 yaml_2.3.6 digest_0.6.30
## [19] bookdown_0.29 Matrix_1.5-1 BiocManager_1.30.19
## [22] S4Vectors_0.36.0 sass_0.4.2 codetools_0.2-18
## [25] cachem_1.0.6 evaluate_0.17 rmarkdown_2.17
## [28] stringi_1.7.8 compiler_4.2.1 bslib_0.4.0
## [31] stats4_4.2.1 jsonlite_1.8.3
On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access.↩︎