Scientific computing in python is well-established. This package takes advantage of new work at Rstudio that fosters python-R interoperability. Identifying good practices of interface design will require extensive discussion and experimentation, and this package takes an initial step in this direction.
A key motivation is experimenting with an incremental PCA implementation with very large out-of-memory data. We have also provided an interface to the sklearn.cluster.KMeans procedure.
The package includes a list of references to python modules.
We can acquire python documentation of included modules with
reticulate’s py_help: The following result could
get stale:
skd = reticulate::import("sklearn")$decomposition
py_help(skd)
Help on package sklearn.decomposition in sklearn:
NAME
    sklearn.decomposition
FILE
    /Users/stvjc/anaconda2/lib/python2.7/site-packages/sklearn/decomposition/__init__.py
DESCRIPTION
    The :mod:`sklearn.decomposition` module includes matrix decomposition
    algorithms, including among others PCA, NMF or ICA. Most of the algorithms of
    this module can be regarded as dimensionality reduction techniques.
PACKAGE CONTENTS
    _online_lda
    base
    cdnmf_fast
    dict_learning
    factor_analysis
    fastica_
    incremental_pca
...The reticulate package is designed to limit the amount of effort required to convert data from R to python for natural use in each language.
np = reticulate::import("numpy", convert=FALSE, delay_load=TRUE)
irloc = system.file("csv/iris.csv", package="BiocSklearn")
irismat = np$genfromtxt(irloc, delimiter=',')To examine a submatrix, we use the take method from numpy. The bracket format seen below notifies us that we are not looking at data native to R.
## array([[5.1, 3.5, 1.4, 0.2],
##        [4.9, 3. , 1.4, 0.2],
##        [4.7, 3.2, 1.3, 0.2]])We’ll use R’s prcomp as a first test to demonstrate performance of the sklearn modules with the iris data.
We have a python representation of the iris data. We compute the PCA as follows:
## + '/home/biocbuild/.cache/R/basilisk/1.8.0/0/bin/conda' 'create' '--yes' '--prefix' '/home/biocbuild/.cache/R/basilisk/1.8.0/BiocSklearn/1.18.2/bsklenv' 'python=3.7.7' '--quiet' '-c' 'conda-forge'## + '/home/biocbuild/.cache/R/basilisk/1.8.0/0/bin/conda' 'install' '--yes' '--prefix' '/home/biocbuild/.cache/R/basilisk/1.8.0/BiocSklearn/1.18.2/bsklenv' 'python=3.7.7'## + '/home/biocbuild/.cache/R/basilisk/1.8.0/0/bin/conda' 'install' '--yes' '--prefix' '/home/biocbuild/.cache/R/basilisk/1.8.0/BiocSklearn/1.18.2/bsklenv' '-c' 'conda-forge' 'python=3.7.7' 'scikit-learn=1.0.2' 'h5py=3.6.0' 'pandas=1.2.4'## SkDecomp instance, method:  PCA 
## use getTransformed() to acquire projected input.This returns an object that can be reused through python methods.
The numerical transformation is accessed via getTransformed.
## [1] 150   4##           [,1]       [,2]        [,3]         [,4]
## [1,] -2.684126  0.3193972 -0.02791483 -0.002262437
## [2,] -2.714142 -0.1770012 -0.21046427 -0.099026550
## [3,] -2.888991 -0.1449494  0.01790026 -0.019968390
## [4,] -2.745343 -0.3182990  0.03155937  0.075575817
## [5,] -2.728717  0.3267545  0.09007924  0.061258593
## [6,] -2.280860  0.7413304  0.16867766  0.024200858Concordance with the R computation can be checked:
##      PC1 PC2 PC3 PC4
## [1,]   1   0   0   0
## [2,]   0  -1   0   0
## [3,]   0   0  -1   0
## [4,]   0   0   0  -1A computation supporting a priori bounding of memory consumption is available. In this procedure one can also select the number of principal components to compute.
ippca = skIncrPCA(iris[,1:4]) #mat) #
ippcab = skIncrPCA(iris[,1:4], batch_size=25L)
round(cor(getTransformed(ippcab), fullpc),3)##         PC1 PC2   PC3    PC4
## [1,]  1.000   0  0.00  0.000
## [2,] -0.001  -1 -0.01 -0.001This procedure can be used when data are provided in chunks, perhaps from a stream. We iteratively update the object, for which there is no container at present. Again the number of components computed can be specified.
ta = np$take # provide slicer utility
ipc = skPartialPCA_step(ta(irismat,0:49,0L))
ipc = skPartialPCA_step(ta(irismat,50:99,0L), obj=ipc)
ipc = skPartialPCA_step(ta(irismat,100:149,0L), obj=ipc)
ipc$transform(ta(irismat,0:5,0L))##           [,1]       [,2]        [,3]         [,4]
## [1,] -2.684126  0.3193972 -0.02791483  0.002262437
## [2,] -2.714142 -0.1770012 -0.21046427  0.099026550
## [3,] -2.888991 -0.1449494  0.01790026  0.019968390
## [4,] -2.745343 -0.3182990  0.03155937 -0.075575817
## [5,] -2.728717  0.3267545  0.09007924 -0.061258593
## [6,] -2.280860  0.7413304  0.16867766 -0.024200858##            PC1        PC2         PC3          PC4
## [1,] -2.684126 -0.3193972  0.02791483  0.002262437
## [2,] -2.714142  0.1770012  0.21046427  0.099026550
## [3,] -2.888991  0.1449494 -0.01790026  0.019968390
## [4,] -2.745343  0.3182990 -0.03155937 -0.075575817
## [5,] -2.728717 -0.3267545 -0.09007924 -0.061258593We have extracted methylation data for the Yoruban
subcohort of CEPH from the yriMulti package. Data
from chr6 and chr17 are available in an HDF5 matrix
in this BiocSklearn package. A reference to the
dataset through the h5py File interface is created by
H5matref.
See skPartialPCA_h5 for basilisk interface, and example(H5matref)
for working directly with HDF5.
We need more applications and profiling.