Using a custom clustering class#

Load modules#

[1]:
from robustica import RobustICA, corrmats

import pandas as pd
import numpy as np
from sklearn.cluster import AgglomerativeClustering
from sklearn.utils import parallel_backend

import matplotlib.pyplot as plt
import seaborn as sns

Load data from Sastry (2019)#

[2]:
url = "https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-019-13483-w/MediaObjects/41467_2019_13483_MOESM4_ESM.xlsx"
data = pd.ExcelFile(url)
data.sheet_names
[2]:
['README',
 'Metadata',
 'Expression Data',
 'S matrix',
 'A matrix',
 'Media Recipes',
 'Gene Information']
[3]:
data.parse("README")
[3]:
Sheet Name Description
0 Experimental Conditions Experimental conditions for each sample in PRE...
1 Expression Data Expression levels of genes (log2 Transcripts p...
2 S matrix I-modulon gene coefficients. Each column is an...
3 A matrix Condition-specific i-modulon activities. Each ...
4 Media Description Recipes for base media and trace element mixtures
5 Gene Information Mapping of b-numbers to gene names/annotations...
6 NaN NaN
7 Metadata Columns Description
8 Sample ID Unique sample identifier for experiment (e.g. ...
9 Study Short 1-2 word description of study where samp...
10 Project ID Identifier for study
11 Condition ID Identifier for unique experimental conditions
12 Replicate # Suffix for sample ID to distinguish between re...
13 Strain Description Describes gene knock-out, mutation knock-in, o...
14 Strain MG1655 or BW25113
15 Base Media Media recipe as described in Media Description...
16 Carbon Source (g/L) Carbon source in g/L
17 Nitrogen Source (g/L) Nitrogen source in g/L
18 Electron Acceptor Aerobic (O2), anaerobic (None), or nitrate (KN...
19 Trace Element Mixture Trace element recipe as described in Media Des...
20 Supplement Additional media supplements in each experimen...
21 Temperature (C) Temperature of growth condition in Celsius
22 pH Neutral (7) or acidic (5.5)
23 Antibiotic Treatment with low concentration Kanamycin for...
24 Culture Type Batch or Chemostat (always exponential growth ...
25 Growth Rate (1/hr) Measured growth rate of strain in media
26 Evolved Sample Whether the strain was from an adaptive labora...
27 Isolate Type Clonal or Population (only applicable to ALE s...
28 Sequencing Machine Type of machine used for sequencing
29 Additional Details Additional information about experiment
30 Biological Replicates Number of biological replicates within study
31 Alignment Percent of reads aligned by bowtie to referenc...
32 DOI Accession number for study
33 GEO GEO Accession ID

Preprocess data#

Following the article’s methods, we compute the log2FCs with respect to controls

[4]:
# get gene expression
X = data.parse("Expression Data").set_index("log-TPM")

# get controls
controls = [c for c in X.columns if "control" in c]
print(controls)

# compute log2FCs with respect to controls
X = X - X[controls].mean(1).values.reshape(-1, 1)

X
['control__wt_glc__1', 'control__wt_glc__2']
[4]:
control__wt_glc__1 control__wt_glc__2 fur__wt_dpd__1 fur__wt_dpd__2 fur__wt_fe__1 fur__wt_fe__2 fur__delfur_dpd__1 fur__delfur_dpd__2 fur__delfur_fe2__1 fur__delfur_fe2__2 ... efeU__menFentC_ale29__1 efeU__menFentC_ale29__2 efeU__menFentC_ale30__1 efeU__menFentC_ale30__2 efeU__menFentCubiC_ale36__1 efeU__menFentCubiC_ale36__2 efeU__menFentCubiC_ale37__1 efeU__menFentCubiC_ale37__2 efeU__menFentCubiC_ale38__1 efeU__menFentCubiC_ale38__2
log-TPM
b0002 -0.061772 0.061772 0.636527 0.819793 -0.003615 -0.289353 -1.092023 -0.777289 0.161343 0.145641 ... -0.797097 -0.791859 0.080114 0.102154 0.608180 0.657673 0.813105 0.854813 0.427986 0.484338
b0003 -0.053742 0.053742 0.954439 1.334385 0.307588 0.128414 -0.872563 -0.277893 0.428542 0.391761 ... -0.309105 -0.352535 -0.155074 -0.077145 0.447030 0.439881 0.554528 0.569030 0.154905 0.294799
b0004 -0.065095 0.065095 -0.202697 0.119195 -0.264995 -0.546017 -1.918349 -1.577736 -0.474815 -0.495312 ... -0.184898 -0.225615 0.019575 0.063986 0.483343 0.452754 0.524828 0.581878 0.293239 0.341040
b0005 0.028802 -0.028802 -0.865171 -0.951179 0.428769 0.123564 -1.660351 -1.531147 0.240353 -0.151132 ... -0.308221 -0.581714 0.018820 0.004040 -1.228763 -1.451750 -0.839203 -0.529349 -0.413336 -0.478682
b0006 0.009087 -0.009087 -0.131039 -0.124079 -0.144870 -0.090152 -0.219917 -0.046648 -0.044537 -0.089204 ... 1.464603 1.415706 1.230831 1.165153 0.447447 0.458852 0.421417 0.408077 1.151066 1.198529
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
b4688 -0.261325 0.261325 -1.425581 -2.734490 0.181893 0.514395 -1.943947 -1.992701 0.066037 -0.695325 ... -0.885297 -0.462485 -2.734490 -1.451148 -1.379069 -1.567420 -0.999610 -1.726577 -2.734490 -1.189069
b4693 -0.278909 0.278909 1.361362 1.020310 0.608108 0.988541 2.558416 2.142724 3.120867 3.104887 ... -0.374963 0.856574 -1.147824 -0.814089 2.054471 1.853620 1.957717 1.943582 2.233115 2.023755
b4696_1 0.050526 -0.050526 1.166436 1.043373 -0.531441 -0.581626 0.914055 0.731165 -0.127269 -0.164046 ... 0.261604 0.278426 0.201089 -0.017780 0.138178 0.122287 0.504402 0.425213 0.629383 0.805945
b4696_2 -0.031653 0.031653 0.785573 0.881353 -0.477271 -0.916095 0.837603 0.801393 -0.071710 -0.000540 ... -0.499371 0.398783 0.096609 -0.103446 -0.519098 0.615363 0.343959 0.580288 0.366905 0.702608
b4705 0.724324 -0.724324 -4.350151 -4.317498 -0.747489 -1.257045 -3.308337 -4.421970 -2.679693 -1.872713 ... -1.968530 -1.365300 -5.468290 -2.997169 -3.673367 -3.161608 -3.959910 -4.088644 -5.468290 -5.468290

3923 rows × 278 columns

Run robustica with custom clustering classes#

Using the classical Icasso algorithm#

[5]:
%%time

rica = RobustICA(
    n_components=100,
    robust_runs=8,
    robust_infer_signs=False,
    robust_dimreduce=False,
    robust_kws={"affinity": "precomputed", "linkage": "average"},
)

with parallel_backend("loky", n_jobs=4):
    S_classic, A_classic = rica.fit_transform(X.values)

print(S_classic.shape, A_classic.shape)
  0%|          | 0/8 [00:00<?, ?it/s]
Running FastICA multiple times...
100%|██████████| 8/8 [00:03<00:00,  2.03it/s]
Precomputing distance matrix...
Clustering...
Computing centroids...
(3923, 100) (278, 100)
CPU times: user 2.42 s, sys: 242 ms, total: 2.67 s
Wall time: 8.59 s

Using your clustering algorithm#

Create your own clustering class#

In this example, we create a clustering class that computes the distance within the class.

[6]:
class MyClusteringClass:
    def __init__(self, **kws):
        self.clustering = AgglomerativeClustering(
            affinity="precomputed", linkage="average", **kws
        )

    def fit(self, X):
        # compute dissimilarity matrix
        D = 1 - np.abs(np.corrcoef(X))

        # cluster
        self.clustering.fit(D)
        self.labels_ = self.clustering.labels_  # required instance

Run#

You just need to define your class in the robust_method and its keyword arguments in robust_kws. In this example,

[7]:
%%time

rica = RobustICA(
    n_components=100,
    robust_runs=8,
    robust_infer_signs=False,
    robust_dimreduce=False,
    robust_method=MyClusteringClass,
    robust_kws={"n_clusters": 100},  # define expected n. clusters
    robust_precompdist_func=None,
)

with parallel_backend("loky", n_jobs=4):
    S, A = rica.fit_transform(X.values)

print(S.shape, A.shape)
  0%|          | 0/8 [00:00<?, ?it/s]
Running FastICA multiple times...
100%|██████████| 8/8 [00:03<00:00,  2.47it/s]
/home/miquel/miniconda3/lib/python3.8/site-packages/numpy/lib/arraysetops.py:580: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  mask |= (ar1 == a)
Clustering...
Computing centroids...
(3923, 100) (278, 100)
CPU times: user 2.52 s, sys: 64.5 ms, total: 2.58 s
Wall time: 8.13 s

Correspondence between the two approaches#

[8]:
plt.figure(figsize=(11, 10))
sns.heatmap(
    np.abs(corrmats(S.T, S_classic.T)),
    cmap="coolwarm",
    center=0,
    cbar_kws={"label": r"$abs(Pearson Correlation)$"},
)
plt.xlabel("Classic ICs")
plt.ylabel("My ICs")

plt.show()
_images/customize_clustering_18_0.png

Using other existing clustering algorithms#

robustica accepts any clustering class as robust_method. It just needs to create a self.labels_ attribute. For example, we can use `HDBSCAN <https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html>`__.

[9]:
from hdbscan import HDBSCAN
[10]:
rica = RobustICA(
    n_components=100,
    robust_runs=8,
    robust_method=HDBSCAN,
    robust_kws={"min_cluster_size": 4},  # half of robust_runs
)

with parallel_backend("loky", n_jobs=4):
    S, A = rica.fit_transform(X.values)

print(S.shape, A.shape)
  0%|          | 0/8 [00:00<?, ?it/s]
Running FastICA multiple times...
100%|██████████| 8/8 [00:02<00:00,  2.80it/s]
Inferring sign of components...
Reducing dimensions...
Clustering...
/home/miquel/miniconda3/lib/python3.8/site-packages/numpy/lib/arraysetops.py:580: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  mask |= (ar1 == a)
Computing centroids...
(3923, 93) (278, 93)
[11]:
rica.clustering.stats_
[11]:
cluster_id cluster_size S_mean_std A_mean_std
0 -1.0 63.0 0.015966 4.107105
1 0.0 8.0 0.015966 5.892067
2 1.0 8.0 0.015966 13.805991
3 2.0 8.0 0.015966 9.572825
4 3.0 8.0 0.015966 9.232571
... ... ... ... ...
88 87.0 8.0 0.015966 4.127089
89 88.0 8.0 0.015966 3.455316
90 89.0 4.0 0.015966 6.677909
91 90.0 8.0 0.015966 3.843471
92 91.0 7.0 0.015966 3.586185

93 rows × 4 columns