Using a custom clustering class - robustica 0.1.4 documentation

Load modules#

[1]:

from robustica import RobustICA, corrmats

import pandas as pd
import numpy as np
from sklearn.cluster import AgglomerativeClustering
from sklearn.utils import parallel_backend

import matplotlib.pyplot as plt
import seaborn as sns

Load data from Sastry (2019)#

[2]:

url = "https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-019-13483-w/MediaObjects/41467_2019_13483_MOESM4_ESM.xlsx"
data = pd.ExcelFile(url)
data.sheet_names

[2]:

['README',
 'Metadata',
 'Expression Data',
 'S matrix',
 'A matrix',
 'Media Recipes',
 'Gene Information']

[3]:

data.parse("README")

[3]:

	Sheet Name	Description
0	Experimental Conditions	Experimental conditions for each sample in PRE...
1	Expression Data	Expression levels of genes (log2 Transcripts p...
2	S matrix	I-modulon gene coefficients. Each column is an...
3	A matrix	Condition-specific i-modulon activities. Each ...
4	Media Description	Recipes for base media and trace element mixtures
5	Gene Information	Mapping of b-numbers to gene names/annotations...
6	NaN	NaN
7	Metadata Columns	Description
8	Sample ID	Unique sample identifier for experiment (e.g. ...
9	Study	Short 1-2 word description of study where samp...
10	Project ID	Identifier for study
11	Condition ID	Identifier for unique experimental conditions
12	Replicate #	Suffix for sample ID to distinguish between re...
13	Strain Description	Describes gene knock-out, mutation knock-in, o...
14	Strain	MG1655 or BW25113
15	Base Media	Media recipe as described in Media Description...
16	Carbon Source (g/L)	Carbon source in g/L
17	Nitrogen Source (g/L)	Nitrogen source in g/L
18	Electron Acceptor	Aerobic (O2), anaerobic (None), or nitrate (KN...
19	Trace Element Mixture	Trace element recipe as described in Media Des...
20	Supplement	Additional media supplements in each experimen...
21	Temperature (C)	Temperature of growth condition in Celsius
22	pH	Neutral (7) or acidic (5.5)
23	Antibiotic	Treatment with low concentration Kanamycin for...
24	Culture Type	Batch or Chemostat (always exponential growth ...
25	Growth Rate (1/hr)	Measured growth rate of strain in media
26	Evolved Sample	Whether the strain was from an adaptive labora...
27	Isolate Type	Clonal or Population (only applicable to ALE s...
28	Sequencing Machine	Type of machine used for sequencing
29	Additional Details	Additional information about experiment
30	Biological Replicates	Number of biological replicates within study
31	Alignment	Percent of reads aligned by bowtie to referenc...
32	DOI	Accession number for study
33	GEO	GEO Accession ID

Preprocess data#

Following the article’s methods, we compute the log2FCs with respect to controls

[4]:

# get gene expression
X = data.parse("Expression Data").set_index("log-TPM")

# get controls
controls = [c for c in X.columns if "control" in c]
print(controls)

# compute log2FCs with respect to controls
X = X - X[controls].mean(1).values.reshape(-1, 1)

X

['control__wt_glc__1', 'control__wt_glc__2']

[4]:

	control__wt_glc__1	control__wt_glc__2	fur__wt_dpd__1	fur__wt_dpd__2	fur__wt_fe__1	fur__wt_fe__2	fur__delfur_dpd__1	fur__delfur_dpd__2	fur__delfur_fe2__1	fur__delfur_fe2__2	...	efeU__menFentC_ale29__1	efeU__menFentC_ale29__2	efeU__menFentC_ale30__1	efeU__menFentC_ale30__2	efeU__menFentCubiC_ale36__1	efeU__menFentCubiC_ale36__2	efeU__menFentCubiC_ale37__1	efeU__menFentCubiC_ale37__2	efeU__menFentCubiC_ale38__1	efeU__menFentCubiC_ale38__2
log-TPM
b0002	-0.061772	0.061772	0.636527	0.819793	-0.003615	-0.289353	-1.092023	-0.777289	0.161343	0.145641	...	-0.797097	-0.791859	0.080114	0.102154	0.608180	0.657673	0.813105	0.854813	0.427986	0.484338
b0003	-0.053742	0.053742	0.954439	1.334385	0.307588	0.128414	-0.872563	-0.277893	0.428542	0.391761	...	-0.309105	-0.352535	-0.155074	-0.077145	0.447030	0.439881	0.554528	0.569030	0.154905	0.294799
b0004	-0.065095	0.065095	-0.202697	0.119195	-0.264995	-0.546017	-1.918349	-1.577736	-0.474815	-0.495312	...	-0.184898	-0.225615	0.019575	0.063986	0.483343	0.452754	0.524828	0.581878	0.293239	0.341040
b0005	0.028802	-0.028802	-0.865171	-0.951179	0.428769	0.123564	-1.660351	-1.531147	0.240353	-0.151132	...	-0.308221	-0.581714	0.018820	0.004040	-1.228763	-1.451750	-0.839203	-0.529349	-0.413336	-0.478682
b0006	0.009087	-0.009087	-0.131039	-0.124079	-0.144870	-0.090152	-0.219917	-0.046648	-0.044537	-0.089204	...	1.464603	1.415706	1.230831	1.165153	0.447447	0.458852	0.421417	0.408077	1.151066	1.198529
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
b4688	-0.261325	0.261325	-1.425581	-2.734490	0.181893	0.514395	-1.943947	-1.992701	0.066037	-0.695325	...	-0.885297	-0.462485	-2.734490	-1.451148	-1.379069	-1.567420	-0.999610	-1.726577	-2.734490	-1.189069
b4693	-0.278909	0.278909	1.361362	1.020310	0.608108	0.988541	2.558416	2.142724	3.120867	3.104887	...	-0.374963	0.856574	-1.147824	-0.814089	2.054471	1.853620	1.957717	1.943582	2.233115	2.023755
b4696_1	0.050526	-0.050526	1.166436	1.043373	-0.531441	-0.581626	0.914055	0.731165	-0.127269	-0.164046	...	0.261604	0.278426	0.201089	-0.017780	0.138178	0.122287	0.504402	0.425213	0.629383	0.805945
b4696_2	-0.031653	0.031653	0.785573	0.881353	-0.477271	-0.916095	0.837603	0.801393	-0.071710	-0.000540	...	-0.499371	0.398783	0.096609	-0.103446	-0.519098	0.615363	0.343959	0.580288	0.366905	0.702608
b4705	0.724324	-0.724324	-4.350151	-4.317498	-0.747489	-1.257045	-3.308337	-4.421970	-2.679693	-1.872713	...	-1.968530	-1.365300	-5.468290	-2.997169	-3.673367	-3.161608	-3.959910	-4.088644	-5.468290	-5.468290

3923 rows × 278 columns

Run `robustica` with custom clustering classes#

Using the classical Icasso algorithm#

[5]:

%%time

rica = RobustICA(
    n_components=100,
    robust_runs=8,
    robust_infer_signs=False,
    robust_dimreduce=False,
    robust_kws={"affinity": "precomputed", "linkage": "average"},
)

with parallel_backend("loky", n_jobs=4):
    S_classic, A_classic = rica.fit_transform(X.values)

print(S_classic.shape, A_classic.shape)

  0%|          | 0/8 [00:00<?, ?it/s]

Running FastICA multiple times...

100%|██████████| 8/8 [00:03<00:00,  2.03it/s]

Precomputing distance matrix...
Clustering...
Computing centroids...
(3923, 100) (278, 100)
CPU times: user 2.42 s, sys: 242 ms, total: 2.67 s
Wall time: 8.59 s

Using your clustering algorithm#

Create your own clustering class#

In this example, we create a clustering class that computes the distance within the class.

[6]:

class MyClusteringClass:
    def __init__(self, **kws):
        self.clustering = AgglomerativeClustering(
            affinity="precomputed", linkage="average", **kws
        )

    def fit(self, X):
        # compute dissimilarity matrix
        D = 1 - np.abs(np.corrcoef(X))

        # cluster
        self.clustering.fit(D)
        self.labels_ = self.clustering.labels_  # required instance

Run#

You just need to define your class in the robust_method and its keyword arguments in robust_kws. In this example,

[7]:

%%time

rica = RobustICA(
    n_components=100,
    robust_runs=8,
    robust_infer_signs=False,
    robust_dimreduce=False,
    robust_method=MyClusteringClass,
    robust_kws={"n_clusters": 100},  # define expected n. clusters
    robust_precompdist_func=None,
)

with parallel_backend("loky", n_jobs=4):
    S, A = rica.fit_transform(X.values)

print(S.shape, A.shape)

  0%|          | 0/8 [00:00<?, ?it/s]

Running FastICA multiple times...

100%|██████████| 8/8 [00:03<00:00,  2.47it/s]
/home/miquel/miniconda3/lib/python3.8/site-packages/numpy/lib/arraysetops.py:580: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  mask |= (ar1 == a)

Clustering...
Computing centroids...
(3923, 100) (278, 100)
CPU times: user 2.52 s, sys: 64.5 ms, total: 2.58 s
Wall time: 8.13 s

Correspondence between the two approaches#

[8]:

plt.figure(figsize=(11, 10))
sns.heatmap(
    np.abs(corrmats(S.T, S_classic.T)),
    cmap="coolwarm",
    center=0,
    cbar_kws={"label": r"$abs(Pearson Correlation)$"},
)
plt.xlabel("Classic ICs")
plt.ylabel("My ICs")

plt.show()

Using other existing clustering algorithms#

robustica accepts any clustering class as robust_method. It just needs to create a self.labels_ attribute. For example, we can use `HDBSCAN <https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html>`__.

[9]:

from hdbscan import HDBSCAN

[10]:

rica = RobustICA(
    n_components=100,
    robust_runs=8,
    robust_method=HDBSCAN,
    robust_kws={"min_cluster_size": 4},  # half of robust_runs
)

with parallel_backend("loky", n_jobs=4):
    S, A = rica.fit_transform(X.values)

print(S.shape, A.shape)

  0%|          | 0/8 [00:00<?, ?it/s]

Running FastICA multiple times...

100%|██████████| 8/8 [00:02<00:00,  2.80it/s]

Inferring sign of components...
Reducing dimensions...
Clustering...

/home/miquel/miniconda3/lib/python3.8/site-packages/numpy/lib/arraysetops.py:580: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  mask |= (ar1 == a)

Computing centroids...
(3923, 93) (278, 93)

[11]:

rica.clustering.stats_

[11]:

	cluster_id	cluster_size	S_mean_std	A_mean_std
0	-1.0	63.0	0.015966	4.107105
1	0.0	8.0	0.015966	5.892067
2	1.0	8.0	0.015966	13.805991
3	2.0	8.0	0.015966	9.572825
4	3.0	8.0	0.015966	9.232571
...	...	...	...	...
88	87.0	8.0	0.015966	4.127089
89	88.0	8.0	0.015966	3.455316
90	89.0	4.0	0.015966	6.677909
91	90.0	8.0	0.015966	3.843471
92	91.0	7.0	0.015966	3.586185

93 rows × 4 columns

Using a custom clustering class#