BiocNeighbors 1.18.0
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1411 3156 5127 9706 8865 2033 6608 7808 8992 1531
## [2,] 8650 2146 8057 60 8299 9951 1923 8408 6779 7254
## [3,] 1333 8306 8523 8188 3783 3211 543 4802 9816 1212
## [4,] 2901 1747 566 7570 2121 3328 3620 2050 4617 4536
## [5,] 5441 6389 6655 8523 1782 3448 2649 467 5320 2477
## [6,] 843 4472 4934 506 8133 9518 797 631 5956 2589
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.9113433 0.9442689 0.9674024 0.9716540 0.9831457 1.0116637 1.0335116
## [2,] 0.8163012 0.8510935 0.9216015 0.9657613 0.9907292 0.9948630 0.9986632
## [3,] 0.8113317 0.8649326 0.8666590 0.8732036 0.8749936 0.8755187 0.8899532
## [4,] 0.9229928 0.9834348 0.9905853 1.0542908 1.0658935 1.0660372 1.0678189
## [5,] 0.8148131 0.8749768 0.8932300 0.9118466 1.0324681 1.0409095 1.0410593
## [6,] 0.9233640 0.9345645 0.9394530 0.9717999 0.9935158 0.9954746 1.0179756
## [,8] [,9] [,10]
## [1,] 1.0360973 1.0893233 1.0962842
## [2,] 1.0065637 1.0135359 1.0158643
## [3,] 0.9001028 0.9132335 0.9384832
## [4,] 1.0878972 1.0988855 1.1017878
## [5,] 1.0424314 1.0544243 1.0622605
## [6,] 1.0183369 1.0280355 1.0496942
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 8828 8021 1216 5974 4222
## [2,] 1002 7945 2511 8710 226
## [3,] 2887 5068 3095 2796 7391
## [4,] 6334 2460 9655 5531 2016
## [5,] 8816 12 2179 9628 6158
## [6,] 9345 8683 7753 3880 7428
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.7474802 0.8132715 0.9063365 0.9830542 0.9879160
## [2,] 0.9383955 0.9541859 1.0251299 1.0256882 1.0368404
## [3,] 0.8047271 1.1227342 1.1300002 1.1451337 1.1456198
## [4,] 0.8857076 0.9328380 0.9795797 0.9918284 0.9975939
## [5,] 1.0858150 1.1334616 1.1439090 1.1892228 1.1904478
## [6,] 0.9220677 0.9237384 1.0348144 1.0457319 1.0841793
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "F:\\biocbuild\\bbs-3.17-bioc\\tmpdir\\Rtmpsfj5aE\\file2be45cf57059.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.3.0 RC (2023-04-13 r84269 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows Server 2022 x64 (build 20348)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=C
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.18.0 knitr_1.42 BiocStyle_2.28.0
##
## loaded via a namespace (and not attached):
## [1] cli_3.6.1 rlang_1.1.0 xfun_0.39
## [4] jsonlite_1.8.4 S4Vectors_0.38.0 htmltools_0.5.5
## [7] stats4_4.3.0 sass_0.4.5 rmarkdown_2.21
## [10] grid_4.3.0 evaluate_0.20 jquerylib_0.1.4
## [13] fastmap_1.1.1 yaml_2.3.7 bookdown_0.33
## [16] BiocManager_1.30.20 compiler_4.3.0 codetools_0.2-19
## [19] Rcpp_1.0.10 BiocParallel_1.34.0 lattice_0.21-8
## [22] digest_0.6.31 R6_2.5.1 parallel_4.3.0
## [25] bslib_0.4.2 Matrix_1.5-4 tools_4.3.0
## [28] BiocGenerics_0.46.0 cachem_1.0.7