mbkmeans {mbkmeans}R Documentation

Mini-Batch k-means for large single cell sequencing data

Description

This is an implementation of the mini-batch k-means algorithm of Sculley (2010) for large single cell sequencing data with the dimensionality reduction results as input in the reducedDim() slot.

Usage

mbkmeans(x, ...)

## S4 method for signature 'SummarizedExperiment'
mbkmeans(x, whichAssay = 1, ...)

## S4 method for signature 'SingleCellExperiment'
mbkmeans(x, reduceMethod = "PCA",
  whichAssay = 1, ...)

## S4 method for signature 'LinearEmbeddingMatrix'
mbkmeans(x, ...)

## S4 method for signature 'ANY'
mbkmeans(x, clusters, batch_size = ifelse(ncol(x) > 100,
  ceiling(ncol(x) * 0.05), ncol(x)), max_iters = 100, num_init = 1,
  init_fraction = ifelse(ncol(x) > 100, 0.25, 1),
  initializer = "kmeans++", calc_wcss = FALSE, early_stop_iter = 10,
  verbose = FALSE, CENTROIDS = NULL, tol = 1e-04)

Arguments

x

The object on which to run mini-batch k-means. It can be a matrix-like object (e.g., matrix, Matrix, DelayedMatrix, HDF5Matrix) with genes in the rows and samples in the columns. Specialized methods are defined for SummarizedExperiment and SingleCellExperiment.

...

Arguments to pass to the matrix method.

whichAssay

The assay to use as input to mini-batch k-means. If x is a SingleCellExperiment, this is ignored unless reduceMethod = NA.

reduceMethod

Name of dimensionality reduction results to use as input to mini-batch k-means. Set to NA to use the full matrix.

clusters

the number of clusters

batch_size

the size of the mini batches. By default, it equals the number of observations if there are less than 100 cells and 5 percent of observations otherwise.

max_iters

the maximum number of clustering iterations

num_init

number of times the algorithm will be run with different centroid seeds

init_fraction

percentage of data to use for the initialization centroids (applies if initializer is kmeans++ ). Should be a float number between 0.0 and 1.0. By default, it uses all the data if there are less than 100 cells and 25 percent of the data otherwise.

initializer

the method of initialization. One of kmeans++ and random. See details for more information

calc_wcss

either TRUE or FALSE, indicating whether the result of WCSS is shown. FALSE is default

early_stop_iter

continue that many iterations after calculation of the best within-cluster-sum-of-squared-error

verbose

either TRUE or FALSE, indicating whether progress is printed during clustering

CENTROIDS

a matrix of initial cluster centroids. The rows of the CENTROIDS matrix should be equal to the number of clusters and the columns should be equal to the columns of the data

tol

a float number. If, in case of an iteration (iteration > 1 and iteration < max_iters) 'tol' is greater than the squared norm of the centroids, then kmeans has converged

Details

The implementation is largely based on the MiniBatchKmeans function of the ClusterR package. The contribution of this package is to provide support for on-disk data representations such as HDF5, through the use of DelayedMatrix and HDF5Matrix objects, as well as for sparse data representation through the classes of the Matrix package. We also provide high-level methods for objects of class SummarizedExperiment, SingleCellExperiment, and LinearEmbeddingMatrix.

This function performs k-means clustering using mini batches.

kmeans++: kmeans++ initialization. Reference : http://theory.stanford.edu/~sergei/papers/kMeansPP-soda.pdf AND http://stackoverflow.com/questions/5466323/how-exactly-does-k-means-work

random: random selection of data rows as initial centroids

Value

A list with the following attributes: centroids, WCSS_per_cluster, best_initialization, iters_per_initialization.

a list with the following attributes: centroids, WCSS_per_cluster, best_initialization, iters_per_initialization

Author(s)

Lampros Mouselimis and Yuwei Ni

References

Sculley. Web-Scale K-Means Clustering. WWW 2010, April 26–30, 2010, Raleigh, North Carolina, USA. ACM 978-1-60558-799-8/10/04.

https://github.com/mlampros/ClusterR

Examples

library(SummarizedExperiment)
se <- SummarizedExperiment(matrix(rnorm(100), ncol=10))
mbkmeans(se, clusters = 2)
library(SingleCellExperiment)
sce <- SingleCellExperiment(matrix(rnorm(100), ncol=10))
mbkmeans(sce, clusters = 2, reduceMethod = NA)
x<-matrix(rnorm(100), ncol=10)
mbkmeans(x,clusters = 3)


[Package mbkmeans version 1.0.0 Index]