The CellRegMap model

The CellRegMap model can be cast as:

$y = W\alpha + g\beta_G + g \odot \beta_{GxC} + c + u + \epsilon$,

where

$\beta_{GxC} \sim \mathcal{N} (0, \sigma^2_{GxC}CC^T)$,

$c \sim \mathcal{N} (0, \sigma^2_{C}CC^T)$,

$u \sim \mathcal{N} (0, \sigma^2_{KC}(CC^T \odot K))$, and

$\epsilon \sim \mathcal{N} (0, \sigma^2_n I)$

Brief description of the model terms

The following terms should be provided as input files:

The following terms will be estimated by the model:

Notes

Necessary inputs

The model will not run if one of y, W, g or C is not provided as input.

Each SNP-gene pair should be tested independently

The test is run independently for each gene-SNP pair, thus in the model above, y and g are one-dimensional vectors, representing i) the expression of a single gene and ii) the genotypes at a single SNP, respectively.

As tests are independent, we recommend parallelising as much as possible, for example submitting independent jobs for each chromosome, gene, or even gene-SNP pair.

Covariates, cell contexts and repeatedness are fixed

W, C, hK (and thus K) remain the same across all tests (i.e., across all SNP-gene pairs).

Dimensionality

Specified dimensionality for each of the terms, where n is the total number of cells:

Normalization

For optimal model fit, we recommend standardizing or quantile normalizing (to a standard normal distribution) the phenotype vector y and column-standardizing the cellular contexts C. Standardization refers to a transformation of a vector to have 0 mean and standard deviation 1. You can use StandardScaler for this task. Quantile normalization is a rank-normalization which enforces a standard normal distribution of the vector provided. For an implementation of quantile-normalization see here.

Pseudocells

This approach refers to the action of grouping together small numbers of similar cells into “pseudocells” to reduce issues due to sparsity and speed up computations by reducing sample size. Existing implementations include Metacell and the micro pooling approach within the Vision pipeline. Those approaches do not directly take into account the presence of several genetically distinct donors, which is important here. To address this, we recommend using one of these approaches for each donor separately. For an implementation of how we computed meta-cells in the CellRegMap manuscript (for the neuronal differentiation data analysis), see here.

Multiple testing correction

Since thousands of tests are typically run, multiple testing correction of the test p-values is necessary. Below, we provide guidelines for how to correct for multiple testing for the two main tests implemented in CellRegMap. Also refer to workflow here

Association test

Run discovery, two-step multiple testing correction, 1) within gene across SNPs (FWER), 2) across genes (FDR).

Interaction test

Only one SNP per gene, or at least independent. If one SNP per gene straight to step 2 (FDR), if multiple but independent Bonferroni as step 1, then step 2.

References

[1] Argelaguet*, Velten* et al., Molecular Systems Biology, 2018 (MOFA: multi-omics factor analysis) - link

[2] Risso et al, Nature Communications, 2018 (ZINB-WaVE: zero-inflated negative binomial-based Wanted Variation Extraction) - link

[3] Svensson et al, Bioinformatics, 2020 (LDVAE: linearly decoded variational autoencoder) - link