Basic pipeline for exploratory analysis#
Load modules#
[1]:
from robustica import RobustICA, corrmats
import pandas as pd
import numpy as np
from sklearn.utils import parallel_backend
import matplotlib.pyplot as plt
import seaborn as sns
Load data from Sastry (2019)#
[2]:
url = "https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-019-13483-w/MediaObjects/41467_2019_13483_MOESM4_ESM.xlsx"
data = pd.ExcelFile(url)
data.sheet_names
[2]:
['README',
 'Metadata',
 'Expression Data',
 'S matrix',
 'A matrix',
 'Media Recipes',
 'Gene Information']
[3]:
data.parse("README")
[3]:
| Sheet Name | Description | |
|---|---|---|
| 0 | Experimental Conditions | Experimental conditions for each sample in PRE... | 
| 1 | Expression Data | Expression levels of genes (log2 Transcripts p... | 
| 2 | S matrix | I-modulon gene coefficients. Each column is an... | 
| 3 | A matrix | Condition-specific i-modulon activities. Each ... | 
| 4 | Media Description | Recipes for base media and trace element mixtures | 
| 5 | Gene Information | Mapping of b-numbers to gene names/annotations... | 
| 6 | NaN | NaN | 
| 7 | Metadata Columns | Description | 
| 8 | Sample ID | Unique sample identifier for experiment (e.g. ... | 
| 9 | Study | Short 1-2 word description of study where samp... | 
| 10 | Project ID | Identifier for study | 
| 11 | Condition ID | Identifier for unique experimental conditions | 
| 12 | Replicate # | Suffix for sample ID to distinguish between re... | 
| 13 | Strain Description | Describes gene knock-out, mutation knock-in, o... | 
| 14 | Strain | MG1655 or BW25113 | 
| 15 | Base Media | Media recipe as described in Media Description... | 
| 16 | Carbon Source (g/L) | Carbon source in g/L | 
| 17 | Nitrogen Source (g/L) | Nitrogen source in g/L | 
| 18 | Electron Acceptor | Aerobic (O2), anaerobic (None), or nitrate (KN... | 
| 19 | Trace Element Mixture | Trace element recipe as described in Media Des... | 
| 20 | Supplement | Additional media supplements in each experimen... | 
| 21 | Temperature (C) | Temperature of growth condition in Celsius | 
| 22 | pH | Neutral (7) or acidic (5.5) | 
| 23 | Antibiotic | Treatment with low concentration Kanamycin for... | 
| 24 | Culture Type | Batch or Chemostat (always exponential growth ... | 
| 25 | Growth Rate (1/hr) | Measured growth rate of strain in media | 
| 26 | Evolved Sample | Whether the strain was from an adaptive labora... | 
| 27 | Isolate Type | Clonal or Population (only applicable to ALE s... | 
| 28 | Sequencing Machine | Type of machine used for sequencing | 
| 29 | Additional Details | Additional information about experiment | 
| 30 | Biological Replicates | Number of biological replicates within study | 
| 31 | Alignment | Percent of reads aligned by bowtie to referenc... | 
| 32 | DOI | Accession number for study | 
| 33 | GEO | GEO Accession ID | 
Preprocess data#
Following the article’s methods, we compute the log2FCs with respect to controls
[4]:
# get gene expression
X = data.parse("Expression Data").set_index("log-TPM")
# get controls
controls = [c for c in X.columns if "control" in c]
print(controls)
# compute log2FCs with respect to controls
X = X - X[controls].mean(1).values.reshape(-1, 1)
X
['control__wt_glc__1', 'control__wt_glc__2']
[4]:
| control__wt_glc__1 | control__wt_glc__2 | fur__wt_dpd__1 | fur__wt_dpd__2 | fur__wt_fe__1 | fur__wt_fe__2 | fur__delfur_dpd__1 | fur__delfur_dpd__2 | fur__delfur_fe2__1 | fur__delfur_fe2__2 | ... | efeU__menFentC_ale29__1 | efeU__menFentC_ale29__2 | efeU__menFentC_ale30__1 | efeU__menFentC_ale30__2 | efeU__menFentCubiC_ale36__1 | efeU__menFentCubiC_ale36__2 | efeU__menFentCubiC_ale37__1 | efeU__menFentCubiC_ale37__2 | efeU__menFentCubiC_ale38__1 | efeU__menFentCubiC_ale38__2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| log-TPM | |||||||||||||||||||||
| b0002 | -0.061772 | 0.061772 | 0.636527 | 0.819793 | -0.003615 | -0.289353 | -1.092023 | -0.777289 | 0.161343 | 0.145641 | ... | -0.797097 | -0.791859 | 0.080114 | 0.102154 | 0.608180 | 0.657673 | 0.813105 | 0.854813 | 0.427986 | 0.484338 | 
| b0003 | -0.053742 | 0.053742 | 0.954439 | 1.334385 | 0.307588 | 0.128414 | -0.872563 | -0.277893 | 0.428542 | 0.391761 | ... | -0.309105 | -0.352535 | -0.155074 | -0.077145 | 0.447030 | 0.439881 | 0.554528 | 0.569030 | 0.154905 | 0.294799 | 
| b0004 | -0.065095 | 0.065095 | -0.202697 | 0.119195 | -0.264995 | -0.546017 | -1.918349 | -1.577736 | -0.474815 | -0.495312 | ... | -0.184898 | -0.225615 | 0.019575 | 0.063986 | 0.483343 | 0.452754 | 0.524828 | 0.581878 | 0.293239 | 0.341040 | 
| b0005 | 0.028802 | -0.028802 | -0.865171 | -0.951179 | 0.428769 | 0.123564 | -1.660351 | -1.531147 | 0.240353 | -0.151132 | ... | -0.308221 | -0.581714 | 0.018820 | 0.004040 | -1.228763 | -1.451750 | -0.839203 | -0.529349 | -0.413336 | -0.478682 | 
| b0006 | 0.009087 | -0.009087 | -0.131039 | -0.124079 | -0.144870 | -0.090152 | -0.219917 | -0.046648 | -0.044537 | -0.089204 | ... | 1.464603 | 1.415706 | 1.230831 | 1.165153 | 0.447447 | 0.458852 | 0.421417 | 0.408077 | 1.151066 | 1.198529 | 
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 
| b4688 | -0.261325 | 0.261325 | -1.425581 | -2.734490 | 0.181893 | 0.514395 | -1.943947 | -1.992701 | 0.066037 | -0.695325 | ... | -0.885297 | -0.462485 | -2.734490 | -1.451148 | -1.379069 | -1.567420 | -0.999610 | -1.726577 | -2.734490 | -1.189069 | 
| b4693 | -0.278909 | 0.278909 | 1.361362 | 1.020310 | 0.608108 | 0.988541 | 2.558416 | 2.142724 | 3.120867 | 3.104887 | ... | -0.374963 | 0.856574 | -1.147824 | -0.814089 | 2.054471 | 1.853620 | 1.957717 | 1.943582 | 2.233115 | 2.023755 | 
| b4696_1 | 0.050526 | -0.050526 | 1.166436 | 1.043373 | -0.531441 | -0.581626 | 0.914055 | 0.731165 | -0.127269 | -0.164046 | ... | 0.261604 | 0.278426 | 0.201089 | -0.017780 | 0.138178 | 0.122287 | 0.504402 | 0.425213 | 0.629383 | 0.805945 | 
| b4696_2 | -0.031653 | 0.031653 | 0.785573 | 0.881353 | -0.477271 | -0.916095 | 0.837603 | 0.801393 | -0.071710 | -0.000540 | ... | -0.499371 | 0.398783 | 0.096609 | -0.103446 | -0.519098 | 0.615363 | 0.343959 | 0.580288 | 0.366905 | 0.702608 | 
| b4705 | 0.724324 | -0.724324 | -4.350151 | -4.317498 | -0.747489 | -1.257045 | -3.308337 | -4.421970 | -2.679693 | -1.872713 | ... | -1.968530 | -1.365300 | -5.468290 | -2.997169 | -3.673367 | -3.161608 | -3.959910 | -4.088644 | -5.468290 | -5.468290 | 
3923 rows × 278 columns
Run robustica#
Using our version of the Icasso algorithm#
By default, robustica performs robust ICA by 1. running FastICA multiple times (robust_runs=100) 2. inferring and correcting the signs of components across runs (robust_infer_signs=True) 3. compressing the feature space (rows) (robust_dimreduce=True) 4. using sign-corrected and compressed components to obtain clusters of independent components (robust_method="AgglomerativeClustering"
and robust_kws={"linkage": "average"}) 5. uses the clustering labels to compute robust independent components
[5]:
%%time
rica = RobustICA(n_components=100, robust_runs=8)
with parallel_backend("loky", n_jobs=4):
    S, A = rica.fit_transform(X.values)
print(S.shape, A.shape)
  0%|          | 0/8 [00:00<?, ?it/s]
Running FastICA multiple times...
100%|██████████| 8/8 [00:03<00:00,  2.24it/s]
Inferring sign of components...
Reducing dimensions...
Clustering...
Computing centroids...
(3923, 100) (278, 100)
CPU times: user 3.71 s, sys: 312 ms, total: 4.02 s
Wall time: 7.6 s
Using the classical Icasso algorithm#
[6]:
%%time
rica = RobustICA(
    n_components=100,
    robust_runs=8,
    robust_infer_signs=False,
    robust_dimreduce=False,
    robust_kws={"affinity": "precomputed", "linkage": "average"},
)
with parallel_backend("loky", n_jobs=4):
    S_classic, A_classic = rica.fit_transform(X.values)
print(S_classic.shape, A_classic.shape)
  0%|          | 0/8 [00:00<?, ?it/s]
Running FastICA multiple times...
100%|██████████| 8/8 [00:02<00:00,  3.09it/s]
Precomputing distance matrix...
Clustering...
Computing centroids...
(3923, 100) (278, 100)
CPU times: user 2.38 s, sys: 132 ms, total: 2.51 s
Wall time: 7.51 s
Correspondence between the two approaches#
[7]:
plt.figure(figsize=(11, 10))
sns.heatmap(
    np.abs(corrmats(S.T, S_classic.T)),
    cmap="coolwarm",
    center=0,
    cbar_kws={"label": r"$abs(Pearson Correlation)$"},
)
plt.xlabel("Classic ICs")
plt.ylabel("Default ICs")
plt.show()
