Remove genes with very low expression for tradeSeq analysis
Source:R/tradeSeqPreprocess.R
tradeSeqPreprocess.Rd
The functions fitGAM
and evaluateK
, used to analyze gene expression along trajectories, usually require both considerable calculation power (access to computer clusters) and time (up to days) to complete. While one might assume that computing time is uniform across genes, it appears however that model fitting takes a few seconds or less for the vast majority of genes, and that only a small fraction disproportionately affects calculation time, with model fitting taking up to hours or even days for a single gene. Further examination shows that all these genes are among the lowest expressed, and that there is an inverse correlation between the number of cells with a non-zero count for a particular gene and model fitting time, which becomes exponentially longer as the number of expressing cells tends towards zero. Based on these observations, this function was built to considerably speed up computing time / lower computing power requirements while retaining as much information as possible. It filters out genes below a certain expression threshold (10 expressing cells by default), effectively removing genes whose expression is likely just noise (and thus the cause of lengthy and difficult model fitting), rather than a signal of biological significance. This is especially common when cells are subset from a larger object for trajectory inference, and the expression matrix contains many zero or near-zero expression genes, as these would be primarily expressed in excluded cells. Additionally, it offers a second method of filtering, if computing resources are still a limiting factor and further reduction in the number of genes is needed, by retaining only the top genes according to deviance, which provides a better information estimate than selecting variable genes, as the latter tends to still include a few genes with near-zero expressing cells. Finally, if fitGAM
and evaluateK
will be run with a condition, it also allows filtering based on the specific metadata that will be used.
Usage
tradeSeqPreprocess(
sds,
min.cells = 10,
nb.genes = NULL,
condition = NULL,
filter.mito = FALSE,
filter.ribo = FALSE,
filter.ncRNA = FALSE,
species = "human",
plot.genes = TRUE,
output.data = FALSE
)
Arguments
- sds
A SingleCellExperiment object containing the pseudotime values of one or several lineages, computed using
slingshot
(usually, the input object tofitGAM
).- min.cells
Numeric. The minimum number of cells with 1 count or more to keep a gene.
- nb.genes
Numeric. The maximum number of genes to keep according to deviance, as computed by
devianceFeatureSelection
internally. IfNULL
, all genes are kept.- condition
Character or Factor. Either the name of a metadata present in
sds
(for example, 'treatment', 'disease', etc) to subset the object and separately remove lowly expressed genes, or the identities, as character or factor, of length equal to the number of cells insds
.- filter.mito
Logical. If
TRUE
, mitochondrial features will be filtered out.- filter.ribo
Logical. If
TRUE
, ribosomal features will be filtered out.- filter.ncRNA
Logical. If
TRUE
, non-coding RNA features will be filtered out.- species
Character. The species name to filter out non-coding RNA features. If 'human', a dataset named ncRNA_human built from genenames database will be used as reference. If 'mouse', only pseudogenes will be filtered out based on a dataset named pseudogenes_mouse and built from dreamBase2 database. These datasets are loaded with RightOmicsTools and may be checked for more information.
- plot.genes
Logical. If
TRUE
, a bar plot showing the number of genes removed according to the number of cells with at least 1 count or more is displayed.- output.data
Logical. If
TRUE
, the function will return alist
containing the filtered SingleCellExperiment object, adata.frame
object with the number of cells with at least 1 count or more for each removed gene, and adata.frame
object with the data used to build the bar plot.