BiocNeighbors 1.12.0
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 7585 6096 5103 3529 9976 3415 1910 2404 9571 1648
## [2,] 8835 7063 2014 1742 8736 7701 6900 5101 4778 8465
## [3,] 3076 2390 3752 8468 5212 5401 2678 6416 4012 2065
## [4,] 3572 109 7240 9176 6278 2866 6706 2588 8903 2663
## [5,] 6228 287 9535 836 1177 3209 7709 9800 7773 4041
## [6,] 7591 9579 1050 3651 7984 6208 6250 6345 3925 8118
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.8562882 0.9785632 0.9955938 1.0029528 1.0154270 1.0189490 1.036193
## [2,] 0.8980052 0.9264596 0.9304165 0.9488965 0.9520898 0.9711837 0.988787
## [3,] 0.8909540 0.9162792 0.9691771 1.0108618 1.0434083 1.0516258 1.078825
## [4,] 0.8938934 1.0189710 1.0424507 1.0498170 1.0501077 1.0742526 1.085795
## [5,] 0.9439647 1.0002828 1.0523891 1.0920228 1.0920416 1.0932543 1.098912
## [6,] 1.1933516 1.2220123 1.2371436 1.2576003 1.2633315 1.2724845 1.278140
## [,8] [,9] [,10]
## [1,] 1.0381757 1.0398611 1.056635
## [2,] 0.9930208 0.9975103 1.002489
## [3,] 1.0908662 1.0932020 1.093322
## [4,] 1.0986246 1.1096017 1.110590
## [5,] 1.1026709 1.1068655 1.109273
## [6,] 1.2812349 1.2950578 1.306112
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 9291 8417 4496 944 1505
## [2,] 8223 6341 6101 3826 8639
## [3,] 6639 2815 622 7802 5025
## [4,] 4314 3986 2075 1006 9606
## [5,] 729 5634 6383 8336 6407
## [6,] 6261 6205 8184 9407 8448
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.0299851 1.0347896 1.0756702 1.0787002 1.0796717
## [2,] 0.9112312 1.0651323 1.0655141 1.1191463 1.1643677
## [3,] 0.8830458 0.9194919 0.9801800 1.0236783 1.0753007
## [4,] 0.7582997 0.8892292 0.9390271 0.9575239 0.9592733
## [5,] 0.8588887 0.9869007 1.0197400 1.0462971 1.0939988
## [6,] 0.7314596 0.8228635 0.8780414 0.9247340 0.9578670
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/RtmpVyyuJN/fileb8bb514afcb5.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.1.1 Patched (2021-08-22 r80813)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Mojave 10.14.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.12.0 knitr_1.36 BiocStyle_2.22.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.7 magrittr_2.0.1 BiocGenerics_0.40.0
## [4] BiocParallel_1.28.0 lattice_0.20-45 R6_2.5.1
## [7] rlang_0.4.12 fastmap_1.1.0 stringr_1.4.0
## [10] tools_4.1.1 parallel_4.1.1 grid_4.1.1
## [13] xfun_0.27 jquerylib_0.1.4 htmltools_0.5.2
## [16] yaml_2.2.1 digest_0.6.28 bookdown_0.24
## [19] Matrix_1.3-4 BiocManager_1.30.16 S4Vectors_0.32.0
## [22] sass_0.4.0 evaluate_0.14 rmarkdown_2.11
## [25] stringi_1.7.5 compiler_4.1.1 bslib_0.3.1
## [28] stats4_4.1.1 jsonlite_1.7.2