robustica package#

Submodules#

robustica.InferComponents module#

class robustica.InferComponents.InferComponents(max_variance_explained_ratio=0.8, n_components=None, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', random_state=None)#

Bases: object

Estimate the number of principal components needed to explain a certain amount of variance using sklearn.decomposition.PCA.

Parameters
  • max_variance_explained_ratio (float, default=0.8) – Threshold of maximum variance explained by the desired number of components.

  • n_components (int, float or 'mle', default=None) –

    Number of components to keep. if n_components is not set all components are kept:

    n_components == min(n_samples, n_features)
    

    If n_components == 'mle' and svd_solver == 'full', Minka’s MLE is used to guess the dimension. Use of n_components == 'mle' will interpret svd_solver == 'auto' as svd_solver == 'full'. If 0 < n_components < 1 and svd_solver == 'full', select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components. If svd_solver == 'arpack', the number of components must be strictly less than the minimum of n_features and n_samples. Hence, the None case results in:

    n_components == min(n_samples, n_features) - 1
    

  • copy (bool, default=True) – If False, data passed to fit are overwritten and running fit(X).transform(X) will not yield the expected results, use fit_transform(X) instead.

  • whiten (bool, default=False) – When True (False by default) the components_ vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances. Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.

  • svd_solver ({'auto', 'full', 'arpack', 'randomized'}, default='auto') –

    If auto :

    The solver is selected by a default policy based on X.shape and n_components: if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient ‘randomized’ method is enabled. Otherwise the exact full SVD is computed and optionally truncated afterwards.

    If full :

    run exact full SVD calling the standard LAPACK solver via scipy.linalg.svd and select the components by postprocessing

    If arpack :

    run SVD truncated to n_components calling ARPACK solver via scipy.sparse.linalg.svds. It requires strictly 0 < n_components < min(X.shape)

    If randomized :

    run randomized SVD by the method of Halko et al.

    New in version 0.18.0.

  • tol (float, default=0.0) – Tolerance for singular values computed by svd_solver == ‘arpack’. Must be of range [0.0, infinity). .. versionadded:: 0.18.0

  • iterated_power (int or 'auto', default='auto') – Number of iterations for the power method computed by svd_solver == ‘randomized’. Must be of range [0, infinity). .. versionadded:: 0.18.0

  • random_state (int, RandomState instance or None, default=None) – Used when the ‘arpack’ or ‘randomized’ solvers are used. Pass an int for reproducible results across multiple function calls. See See Glossary. .. versionadded:: 0.18.0

pca#
Type

instance of sklearn.decomposition.PCA

cumsum_#

Cumulative explained variance ratio.

Type

np.array of length n_components

inferred_components_#

Number of components required to explain max_variance_explained_ratio amount of variance.

Type

int

Examples

from robustica.examples import make_sampledata
from robustica import InferComponents

X = make_sampledata(200, 50)
ncomp = InferComponents().fit_predict(X)
ncomp
fit(X)#

Run PCA and get neccessary number of components to explain as much variance as defined by max_variance_explained_ratio.

Parameters

X (np.array of shape (n_samples, n_features)) – Data input.

Return type

self

fit_predict(X)#

Run PCA and get neccessary number of components to explain as much variance as defined by max_variance_explained_ratio and returns the inferred number of components.

Parameters

self

Returns

inferred_components – Number of components required to explain max_variance_explained_ratio amount of variance.

Return type

int

predict()#

After having run self.fit(X), returns self.inferred_components_

Parameters

self

Returns

self.inferred_components_ – Number of components required to explain max_variance_explained_ratio amount of variance.

Return type

int

robustica.RobustICA module#

class robustica.RobustICA.RobustICA(n_components=None, algorithm='parallel', whiten='arbitrary-variance', fun='logcosh', fun_args=None, max_iter=200, tol=0.0001, w_init=None, random_state=None, n_jobs=None, robust_runs=100, robust_infer_signs=True, robust_dimreduce=True, robust_method='DBSCAN', robust_kws={}, robust_precompdist_func='abs_pearson_dist', verbose=True)#

Bases: object

Class to perform robust Independent Component Analysis (ICA) using different methods to cluster together the independent components computed via sklearn.decomposition.FastICA.

By default, it carries out the Icasso algorithm using aglomerative clustering with average linkage and a precomputed Pearson dissimilarity matrix.

Schematically, RobustICA works like this:
  1. Run ICA multiple times and save source (S) and mixing (A) matrices.

  2. Cluster the components into robust components using Ss across runs.
    2.1) If we use a precomputed dissimilarity:

    2.1.1) Precompute dissimilarity

    2.2) If we don’t use a precomputed dissimilarity:

    2.2.1) (Optional) Infer and correct component signs across runs 2.2.2) (Optional) Reduce the feature space with PCA

    2.3) Cluster components across all S runs 2.4) Use clustering labels to compute the centroid of each cluster, i.e. the robust component in both S and A.

Parameters
  • n_components (int, default=None) – Number of components to use. If None is passed, all are used.

  • algorithm ({'parallel', 'deflation'}, default='parallel') – Apply parallel or deflational algorithm for FastICA.

  • whiten (bool, default=True) – If whiten is false, the data is already considered to be whitened, and no whitening is performed. WARNING: if you use scikit-learn>1.3 defaults were set whiten=’arbitrary-variance’ to maintain the behavior.

  • fun ({'logcosh', 'exp', 'cube'} or callable, default='logcosh') –

    The functional form of the G function used in the approximation to neg-entropy. Could be either ‘logcosh’, ‘exp’, or ‘cube’. You can also provide your own function. It should return a tuple containing the value of the function, and of its derivative, in the point. Example:

    def my_g(x):
        return x ** 3, (3 * x ** 2).mean(axis=-1)
    

  • fun_args (dict, default=None) – Arguments to send to the functional form. If empty and if fun=’logcosh’, fun_args will take value {‘alpha’ : 1.0}.

  • max_iter (int, default=200) – Maximum number of iterations during fit.

  • tol (float, default=1e-4) – Tolerance on update at each iteration.

  • w_init (ndarray of shape (n_components, n_components), default=None) – The mixing matrix to be used to initialize the algorithm.

  • random_state (int, RandomState instance or None, default=None) –

    Used to initialize w_init when not specified, with a normal distribution. Pass an int, for reproducible results across multiple function calls. See Glossary.

  • robust_runs (int, default=100) – Number of times to run FastICA.

  • robust_infer_signs (bool, default=True) – If robust_infer_signs is True, we infer and correct the signs of components across ICA runs before clustering them.

  • robust_method (str or callable, default="DBSCAN") –

    Clustering class to compute robust components across ICA runs. If str, choose one of the following clustering algorithms from sklearn.cluster:

    • ”AgglomerativeClustering”

    • ”AffinityPropagation”

    • ”Birch”

    • ”DBSCAN”

    • ”FeatureAgglomeration”

    • ”KMeans”

    • ”MiniBatchKMeans”

    • ”MeanShift”

    • ”OPTICS”

    • ”SpectralClustering”

    or from sklearn_extra.cluster:
    • ”KMedoids”

    • ”CommonNNClustering”

    If class, the algorithm expects a clustering class with a self.fit() method that creates a self.clustering.labels_ attribute returning the list of clustering labels.

  • robust_kws (dict, default={"linkage": "average"}) – Keyword arguments to send to clustering class defined by robust_method. If robust_method is str and if “n_clusters” or “min_samples” are not defined in robust_kws, robust_kws will be updated with either {“n_clusters”: self.n_components} or {“min_samples”: int(self.robust_runs * 0.5)} accordingly.

  • robust_dimreduce (bool, default=True) – If robust_dimreduce is True, we use sklearn.decomposition.PCA with the same n_components to reduce the feature space across ICA runs after sign inference and correction (if robust_infer_signs=True) and before clustering.

  • robust_precompdist_func ("abs_pearson_dist" or callable, default="abs_pearson_dist") – If robust_kws contain the value “precomputed”, we precompute a distance matrix by executing robust_precomp_dist_func and use it for clustering.

S#

Robust source matrix computed using the centroids of every cluster.

Type

np.array of shape (n_features, n_components)

A#

Robust mixing matrix computed using the centroids of every cluster.

Type

np.array of shape (n_samples, n_components)

S_std#

Within robust component standard deviation across features.

Type

np.array of shape (n_features, n_components)

A_std#

Within robust component standard deviation across samples.

Type

np.array of shape (n_features, n_components)

S_all#

Concatenated source matrices corresponding to every run of ICA.

Type

np.array of shape (n_features, n_components * robust_runs)

A_all#

Concatenated mixing matrices corresponding to every run of ICA.

Type

np.array of shape (n_features, n_components * robust_runs)

time#

Time to execute every run of ICA for robust_runs times. Dictionary structured as {run : seconds}.

Type

dict of length n_components * robust_runs

signs_#

Array of positive or negative ones used to correct for signs before clustering.

Type

np.array of length n_components * robust_runs

orientation_#

Array of positive or negative ones used to orient labeled components after clustering so that largest weights face positive.

Type

np.array of length n_components * robust_runs

clustering#

Instance used to cluster components in S_all across ICA runs. The clustering labels can be found in the attribute self.clustering.labels_. In self.clustering.stats_ you can find information on cluster sizes and mean standard deviations per cluster in both S and A robust matrices.

Type

class instance

Examples

from robustica import RobustICA
from robustica.examples import make_sampledata

X = make_sampledata(200,50)
rica = RobustICA(n_components=10)
S, A = rica.fit_transform(X)

Notes

Icasso procedure based on Himberg, J., & Hyvarinen, A. “Icasso: software for investigating the reliability of ICA estimates by clustering and visualization”. IEEE XIII Workshop on Neural Networks for Signal Processing (2003). DOI: https://doi.org/10.1109/NNSP.2003.1318025

Centroid computation based on Sastry, Anand V., et al. “The Escherichia coli transcriptome mostly consists of independently regulated modules.” Nature communications 10.1 (2019): 1-14. DOI: https://doi.org/10.1038/s41467-019-13483-w

evaluate_clustering(S_all, labels, signs, orientation, metric='euclidean')#

After having executed the self.fit(X) method, computes silhouette scores by samples and index of quality (Iq) proposed by Himberg, J., & Hyvarinen (2004) (DOI: https://doi.org/10.1109/NNSP.2003.1318025).

Silhouette scores for each component are computed using sklearn.metrics.silhouette_samples.

Iq scores foreach cluster (i.e. robust component) are computed using robustica.RobustICA.compute_iq.

Parameters
  • S_all (np.array of shape (n_features, n_components * robust_runs)) – Concatenated source matrices corresponding to every run of ICA.

  • labels (array-like object of length n_components * robust_runs) – List of clustering labels.

  • signs (np.array of length n_components * robust_runs) – Array of positive or negative ones used to correct for signs before clustering.

  • orientation (np.array of length n_components * robust_runs) – Array of positive or negative ones used to orient labeled components after clustering so that largest weights face positive.

  • metric (str) – Metric to use to evaluate the clustering with sklearn.metrics.silhouette_samples. If metric=’precomputed’, S_all has to be a square matrix with a diagonal of 0s.

Returns

evaluation – Dataframe with information on the average silhouette scores and Iq for each cluster.

Return type

pd.DataFrame

self.clustering.silhouette_scores_#

Silhouette coefficient for each component.

Type

np.array of length n_components * robust_runs

self.clustering.iq_scores_#

Iq coefficient for each component.

Type

np.array of length n_components * robust_runs

fit(X)#

Runs ICA robust_runs times and computes robust independent components.

Parameters

X (np.array of shape (n_features, n_samples)) – Data input.

Return type

self

fit_transform(X)#

Runs ICA robust_runs times and computes robust independent components and returns the robust S and A matrices.

Parameters

X (np.array of shape (n_features, n_samples)) – Data input.

Returns

  • S (np.array of shape (n_features, n_components)) – Robust source matrix computed using the centroids of every cluster.

  • A (np.array of shape (n_samples, n_components)) – Robust mixing matrix computed using the centroids of every cluster.

transform()#

After having executed the self.fit(X) method, return robust S and A matrices.

Parameters

self

Returns

  • S (np.array of shape (n_features, n_components)) – Robust source matrix computed using the centroids of every cluster.

  • A (np.array of shape (n_samples, n_components)) – Robust mixing matrix computed using the centroids of every cluster.

robustica.RobustICA.abs_pearson_dist(X)#

Compute Pearson dissimilarity between columns.

Parameters

X (np.array of shape (n_features, n_samples)) – Input data.

Returns

D – Dissimilarity matrix.

Return type

np.array of shape (n_samples, n_samples)

Examples

from robustica import abs_pearson_dist
from robustica.examples import make_sampledata

X = make_sampledata(15, 5)
D = abs_pearson_dist(X)
D.shape
robustica.RobustICA.compute_iq(X, labels, precomputed=False)#

Compute cluster index of quality as suggested by Himberg, J., & Hyvarinen (2004) (DOI: https://doi.org/10.1109/NNSP.2003.1318025). This method requires computing a square correlation matrix.

Parameters
  • X (np.array of shape (n_samples, n_features)) –

  • labels (list or np.array) – Clustering labels indicating to which cluster every observation belongs.

  • precomputed (bool, default=False) – Indicates whether X is a square pairwise correlation matrix.

Returns

df – Dataframe with cluster labels (‘cluster_id’) and their corresponding Iq scores.

Return type

pd.DataFrame

Examples

from robustica import compute_iq
from robustica.examples import make_sampledata

X = make_sampledata(5, 15)
labels = [1,1,2,1,2]
df = compute_iq(X, labels)
df
robustica.RobustICA.corrmats(X, Y)#

Vectorized implementation of pairwise correlations between rows in X and rows in Y. Make sure that the number of columns in X and Y is the same.

Parameters
  • X (np.array of shape (n_features_x, n_samples_x)) –

  • Y (np.array of shape (n_features_y, n_samples_y)) –

Returns

r

Return type

np.array of shape (n_features_x, n_features_y)

Examples

from robustica import corrmats
from robustica.examples import make_sampledata

X = make_sampledata(15, 5)
Y = make_sampledata(20, 5)
r = corrmats(X, Y)
r.shape
robustica.RobustICA.optional_import_sklearn_extra(method)#

robustica.examples module#

robustica.examples.make_sampledata(nrow, ncol, seed=None)#

Prepare a random sample dataset with np.random.rand.

Parameters
  • nrow (int) – Number of desired rows.

  • ncol (int) – Number of desired columns

  • seed (int, default=None) – Random seed in case we want full reproducibility.

Returns

sampledata – Resulting random sample dataset.

Return type

np.array of shape (nrow, ncol)

Example

from robustica.examples import make_sampledata
X = make_sampledata(ncol=300, nrow=2000, seed=123)

Module contents#