RobustICA
#
- robustica.abs_pearson_dist(X)#
Compute Pearson dissimilarity between columns.
- Parameters
X (np.array of shape (n_features, n_samples)) – Input data.
- Returns
D – Dissimilarity matrix.
- Return type
np.array of shape (n_samples, n_samples)
Examples
from robustica import abs_pearson_dist from robustica.examples import make_sampledata X = make_sampledata(15, 5) D = abs_pearson_dist(X) D.shape
- robustica.corrmats(X, Y)#
Vectorized implementation of pairwise correlations between rows in X and rows in Y. Make sure that the number of columns in X and Y is the same.
- Parameters
X (np.array of shape (n_features_x, n_samples_x)) –
Y (np.array of shape (n_features_y, n_samples_y)) –
- Returns
r
- Return type
np.array of shape (n_features_x, n_features_y)
Examples
from robustica import corrmats from robustica.examples import make_sampledata X = make_sampledata(15, 5) Y = make_sampledata(20, 5) r = corrmats(X, Y) r.shape
- robustica.compute_iq(X, labels, precomputed=False)#
Compute cluster index of quality as suggested by Himberg, J., & Hyvarinen (2004) (DOI: https://doi.org/10.1109/NNSP.2003.1318025). This method requires computing a square correlation matrix.
- Parameters
X (np.array of shape (n_samples, n_features)) –
labels (list or np.array) – Clustering labels indicating to which cluster every observation belongs.
precomputed (bool, default=False) – Indicates whether X is a square pairwise correlation matrix.
- Returns
df – Dataframe with cluster labels (‘cluster_id’) and their corresponding Iq scores.
- Return type
pd.DataFrame
Examples
from robustica import compute_iq from robustica.examples import make_sampledata X = make_sampledata(5, 15) labels = [1,1,2,1,2] df = compute_iq(X, labels) df
- class robustica.RobustICA(n_components=None, algorithm='parallel', whiten='arbitrary-variance', fun='logcosh', fun_args=None, max_iter=200, tol=0.0001, w_init=None, random_state=None, n_jobs=None, robust_runs=100, robust_infer_signs=True, robust_dimreduce=True, robust_method='DBSCAN', robust_kws={}, robust_precompdist_func='abs_pearson_dist', verbose=True)#
Class to perform robust Independent Component Analysis (ICA) using different methods to cluster together the independent components computed via sklearn.decomposition.FastICA.
By default, it carries out the Icasso algorithm using aglomerative clustering with average linkage and a precomputed Pearson dissimilarity matrix.
- Schematically, RobustICA works like this:
Run ICA multiple times and save source (S) and mixing (A) matrices.
- Cluster the components into robust components using Ss across runs.
- 2.1) If we use a precomputed dissimilarity:
2.1.1) Precompute dissimilarity
- 2.2) If we don’t use a precomputed dissimilarity:
2.2.1) (Optional) Infer and correct component signs across runs 2.2.2) (Optional) Reduce the feature space with PCA
2.3) Cluster components across all S runs 2.4) Use clustering labels to compute the centroid of each cluster, i.e. the robust component in both S and A.
- Parameters
n_components (int, default=None) – Number of components to use. If None is passed, all are used.
algorithm ({'parallel', 'deflation'}, default='parallel') – Apply parallel or deflational algorithm for FastICA.
whiten (bool, default=True) – If whiten is false, the data is already considered to be whitened, and no whitening is performed. WARNING: if you use scikit-learn>1.3 defaults were set whiten=’arbitrary-variance’ to maintain the behavior.
fun ({'logcosh', 'exp', 'cube'} or callable, default='logcosh') –
The functional form of the G function used in the approximation to neg-entropy. Could be either ‘logcosh’, ‘exp’, or ‘cube’. You can also provide your own function. It should return a tuple containing the value of the function, and of its derivative, in the point. Example:
def my_g(x): return x ** 3, (3 * x ** 2).mean(axis=-1)
fun_args (dict, default=None) – Arguments to send to the functional form. If empty and if fun=’logcosh’, fun_args will take value {‘alpha’ : 1.0}.
max_iter (int, default=200) – Maximum number of iterations during fit.
tol (float, default=1e-4) – Tolerance on update at each iteration.
w_init (ndarray of shape (n_components, n_components), default=None) – The mixing matrix to be used to initialize the algorithm.
random_state (int, RandomState instance or None, default=None) – Used to initialize
w_init
when not specified, with a normal distribution. Pass an int, for reproducible results across multiple function calls. See Glossary.robust_runs (int, default=100) – Number of times to run FastICA.
robust_infer_signs (bool, default=True) – If robust_infer_signs is True, we infer and correct the signs of components across ICA runs before clustering them.
robust_method (str or callable, default="DBSCAN") –
Clustering class to compute robust components across ICA runs. If str, choose one of the following clustering algorithms from sklearn.cluster:
”AgglomerativeClustering”
”AffinityPropagation”
”Birch”
”DBSCAN”
”FeatureAgglomeration”
”KMeans”
”MiniBatchKMeans”
”MeanShift”
”OPTICS”
”SpectralClustering”
- or from sklearn_extra.cluster:
”KMedoids”
”CommonNNClustering”
If class, the algorithm expects a clustering class with a self.fit() method that creates a self.clustering.labels_ attribute returning the list of clustering labels.
robust_kws (dict, default={"linkage": "average"}) – Keyword arguments to send to clustering class defined by robust_method. If robust_method is str and if “n_clusters” or “min_samples” are not defined in robust_kws, robust_kws will be updated with either {“n_clusters”: self.n_components} or {“min_samples”: int(self.robust_runs * 0.5)} accordingly.
robust_dimreduce (bool, default=True) – If robust_dimreduce is True, we use sklearn.decomposition.PCA with the same n_components to reduce the feature space across ICA runs after sign inference and correction (if robust_infer_signs=True) and before clustering.
robust_precompdist_func ("abs_pearson_dist" or callable, default="abs_pearson_dist") – If robust_kws contain the value “precomputed”, we precompute a distance matrix by executing robust_precomp_dist_func and use it for clustering.
- S#
Robust source matrix computed using the centroids of every cluster.
- Type
np.array of shape (n_features, n_components)
- A#
Robust mixing matrix computed using the centroids of every cluster.
- Type
np.array of shape (n_samples, n_components)
- S_std#
Within robust component standard deviation across features.
- Type
np.array of shape (n_features, n_components)
- A_std#
Within robust component standard deviation across samples.
- Type
np.array of shape (n_features, n_components)
- S_all#
Concatenated source matrices corresponding to every run of ICA.
- Type
np.array of shape (n_features, n_components * robust_runs)
- A_all#
Concatenated mixing matrices corresponding to every run of ICA.
- Type
np.array of shape (n_features, n_components * robust_runs)
- time#
Time to execute every run of ICA for robust_runs times. Dictionary structured as {run : seconds}.
- Type
dict of length n_components * robust_runs
- signs_#
Array of positive or negative ones used to correct for signs before clustering.
- Type
np.array of length n_components * robust_runs
- orientation_#
Array of positive or negative ones used to orient labeled components after clustering so that largest weights face positive.
- Type
np.array of length n_components * robust_runs
- clustering#
Instance used to cluster components in S_all across ICA runs. The clustering labels can be found in the attribute self.clustering.labels_. In self.clustering.stats_ you can find information on cluster sizes and mean standard deviations per cluster in both S and A robust matrices.
- Type
class instance
Examples
from robustica import RobustICA from robustica.examples import make_sampledata X = make_sampledata(200,50) rica = RobustICA(n_components=10) S, A = rica.fit_transform(X)
Notes
Icasso procedure based on Himberg, J., & Hyvarinen, A. “Icasso: software for investigating the reliability of ICA estimates by clustering and visualization”. IEEE XIII Workshop on Neural Networks for Signal Processing (2003). DOI: https://doi.org/10.1109/NNSP.2003.1318025
Centroid computation based on Sastry, Anand V., et al. “The Escherichia coli transcriptome mostly consists of independently regulated modules.” Nature communications 10.1 (2019): 1-14. DOI: https://doi.org/10.1038/s41467-019-13483-w