Single-cell RNA-seq (scRNA-seq) technologies can uncover changes in the molecular states that underlie cellular phenotypes. However, to understand dynamic cellular processes, computational tools are needed to extract temporal information from the snapshots of cellular gene expression that scRNA-seq provides. To address this challenge, we have developed a neural ordinary differential equation based method, RNAForecaster, for predicting gene expression states in single cells for multiple future time steps. We demonstrate that RNAForecaster can accurately predict future expression states in simulated scRNA-seq data. We then show that using metabolic labeling scRNA-seq data from constitutively dividing cells, RNAForecaster accurately recapitulates many of the expected changes in gene expression during progression through the cell cycle over a three day period. Finally, we extend RNAForecaster to use unspliced and spliced counts from scRNA-seq to predict the impact of knock down experiments in pancreatic development. Thus, RNAForecaster enables estimation of future expression states in biological systems.


Pancreatic ductal adenocarcinoma (PDAC) is a devastating malignancy driven by a heterogeneous tumor microenvironment enriched with cancer associated fibroblasts (CAFs) that influence its overall immunosuppressive composition. We examined inflammation in PDAC by leveraging our existing patient-derived organoid (PDO) model and a novel PDO-CAF co-culture. We first identified induction of major histocompatibility complex class II expression following treatment with interferon gamma. In parallel, we collated an atlas of 6 published single-cell RNA-sequencing datasets (174,394 cells) combining 61 PDAC (142,807 cells) and 16 non-malignant samples. By combining in silico modeling and in vitro PDO co-culture, we define a gene expression pattern of inflammatory processes in epithelial tumor cells. Following computational inferences, we examined interactions between epithelial tumor cells and CAFs, focusing on VEGF-A and ITGB1 pathways. This work, integrating computational and biological approaches, highlights the value of convergence to accelerate our understanding of key drivers of PDAC. Statement of SignificanceWe established PDO-CAF co-cultures to model tumor cell interactions and validated discoveries using transfer learning into a single-cell RNA-seq atlas of PDAC tumors. This bidirectional approach from human to experimental systems facilitates interrogation of PDAC biology, including the role of inflammation associated with VEGF-A crosstalk between CAFs and tumor cells.


Non-negative matrix factorization (NMF) is an unsupervised learning method well suited to high-throughput biology. Still, inferring biological processes requires additional post hoc statistics and annotation for interpretation of features learned from software packages developed for NMF implementation. Here, we aim to introduce a suite of computational tools that implement NMF and provide methods for accurate, clear biological interpretation and analysis. A generalized discussion of NMF covering its benefits, limitations, and open questions in the field is followed by three vignettes for the Bayesian NMF algorithm CoGAPS (Coordinated Gene Activity across Pattern Subsets). Each vignette will demonstrate NMF analysis to quantify cell state transitions in public domain single-cell RNA-sequencing (scRNA-seq) data of malignant epithelial cells in 25 pancreatic ductal adenocarcinoma (PDAC) tumors and 11 control samples. The first uses PyCoGAPS, our new Python interface for CoGAPS that we developed to enhance runtime of Bayesian NMF for large datasets. The second vignette steps through the same analysis using our R CoGAPS interface, and the third introduces two new cloud-based, plug-and-play options for running CoGAPS using GenePattern Notebook and Docker. By providing Python support, cloud-based computing options, and relevant example workflows, we facilitate user-friendly interpretation and implementation of NMF for single-cell analyses.


RNA velocity analysis of single cells promises to predict temporal dynamics from gene expression. Indeed, in many systems, it has been observed that RNA velocity produces a vector field that qualitatively reflects known features of the system. Despite this observation, the limitations of RNA velocity estimates are poorly understood. Using real data and simulations, we dissect the impact of different steps in the RNA velocity workflow on the estimated vector field. We find that the process of mapping RNA velocity estimates into a low-dimensional representation, such as those produced by UMAP, has a large impact on the result. The RNA velocity vector field strongly depends on the k-NN graph of the data. This dependence leads to significant estimator errors when the k-NN graph is not a faithful representation of the true data structure, a feature that cannot be known for most real datasets. Finally, we establish that RNA velocity estimates expression speed neither at the gene nor cellular level. We propose that RNA velocity is best considered a smoothed interpolation of the observed k-NN structure, as opposed to an extrapolation of future cellular states, and that the use of RNA velocity as a validation of latent space embedding structures is circular.


Trans-differentiation of human induced pluripotent stem cells into neurons (hiPSC-N) via Ngn2-induction has become an efficient system to quickly generate neurons for disease modeling and in vitro assay development, a significant step up from previously used neoplastic and other cell lines. Recent single-cell interrogation of Ngn2-induced neurons however, has revealed some similarities to unexpected neuronal lineages. Similarly, a straightforward method to generate hiPSC derived astrocytes (hiPSC-A) for the study of neuropsychiatric disorders has also been described. Here we examine the homogeneity and similarity of hiPSC-N and hiPSC-A to their in vivo counterparts, the impact of different lengths of time post Ngn2 induction on hiPSC-N (15 or 21 days) and of hiPSC-N / hiPSC-A co-culture. We explore how often genes differentially expressed between conditions relate to genetic risk for neuropsychiatric disease. Leveraging the wealth of existing public single-cell RNA-seq (scRNA-seq) data in Ngn2-induced neurons and in vivo data from the developing brain, we provide perspectives on the lineage origins and maturation of hiPSC-N and hiPSC-A. Both show heterogeneity and share similarity with multiple in vivo cell fates, and both cell types more precisely approximate their in vivo counterparts when co-cultured. hiPSC-A show more heterogeneity and similarities to other non-neural cell types, especially when cultured in isolation. Gene expression data from the hiPSC-N show excess of genes linked to schizophrenia (SZ) and autism spectrum disorders (ASD) as has been previously shown for neural stem cells and neurons. These overrepresentations of disease genes are strongest in our system at early times (day 15) in Ngn2-induction/maturation of neurons, which together with our observation of similarities with in vivo neurons earlier in development suggest they may be a better model for neurodevelopmental disorders. We have assembled this new scRNA-seq data along with the public data explored here as an integrated biologist-friendly web-resource for researchers seeking to understand this system more deeply:


Recent advances in spatial transcriptomics (ST) enable gene expression measurements from a tissue sample while retaining its spatial context. This technology enables unprecedented in situ resolution of the regulatory pathways that underlie the heterogeneity in the tumor and its microenvironment (TME). The direct characterization of cellular co-localization with spatial technologies facilities quantification of the molecular changes resulting from direct cell-cell interaction, as occurs in tumor-immune interactions. We present SpaceMarkers, a novel bioinformatics algorithm to infer molecular changes from cell-cell interaction from latent space analysis of ST data. We apply this approach to infer molecular changes from tumor-immune interactions in Visium spatial transcriptomics data of metastasis, invasive and precursor lesions, and immunotherapy treatment. Further transfer learning in matched scRNA-seq data enabled further quantification of the specific cell types in which SpaceMarkers are enriched. Altogether, SpaceMarkers can identify the location and context-specific molecular interactions within the TME from ST data.


SO_SCPLOWUMMARYC_SCPLOWIntegrative analysis of multiple data sets has the potential of fully leveraging the vast amount of high throughput biological data being generated. In particular such analysis will be powerful in making inference from publicly available collections of genetic, transcriptomic and epigenetic data sets which are designed to study shared biological processes, but which vary in their target measurements, biological variation, unwanted noise, and batch variation. Thus, methods that enable the joint analysis of multiple data sets are needed to gain insights into shared biological processes that would otherwise be hidden by unwanted intra-data set variation. Here, we propose a method called two-stage linked component analysis (2s-LCA) to jointly decompose multiple biologically related experimental data sets with biological and technological relationships that can be structured into the decomposition. The consistency of the proposed method is established and its empirical performance is evaluated via simulation studies. We apply 2s-LCA to jointly analyze four data sets focused on human brain development and identify meaningful patterns of gene expression in human neurogenesis that have shared structure across these data sets.


BackgroundThe cell cycle is a highly conserved, continuous process which controls faithful replication and division of cells. Single-cell technologies have enabled increasingly precise measurements of the cell cycle both as a biological process of interest and as a possible confounding factor. Despite its importance and conservation, there is no universally applicable approach to infer position in the cell cycle with high-resolution from single-cell RNA-seq data. ResultsHere, we present tricycle, an R/Bioconductor package, to address this challenge by leveraging key features of the biology of the cell cycle, the mathematical properties of principal component analysis of periodic functions, and the use of transfer learning. We estimate a cell cycle embedding using a fixed reference dataset and project new data into this reference embedding; an approach that overcomes key limitations of learning a dataset dependent embedding. Tricycle then predicts a cell-specific position in the cell cycle based on the data projection. The accuracy of tricycle compares favorably to gold-standard experimental assays, which generally require specialized measurements in specifically constructed in vitro systems. Using internal controls which are available for any dataset, we show that tricycle predictions generalize to datasets with multiple cell types, across tissues, species and even sequencing assays. ConclusionsTricycle generalizes across datasets, is highly scalable and applicable to atlas-level single-cell RNA-seq data.


Latent space techniques have emerged as powerful tools to identify genes and gene sets responsible for cell-type and species-specific differences in single-cell data. Transfer learning methods can compare learned latent spaces across biological systems. However, the robustness that comes from leveraging information across multiple genes in transfer learning is often attained at the sacrifice of gene-wise precision. Thus, methods are needed to identify genes, defined as important within a particular latent space, that significantly differ between contexts. To address this challenge, we have developed a new framework, scProject, and a new metric, projectionDrivers, to quantitatively examine latent space usage across single-cell experimental systems while concurrently extracting the genes driving the differential usage of the latent space between defined contrasts. Here, we demonstrate the efficacy, utility, and scalability of scProject with projectionDrivers and provide experimental validation for predicted species-specific differences between the developing mouse and human retina.


Variability between human pluripotent stem cell (hPSC) lines remains a challenge and opportunity in biomedicine. We identify differences in the spontaneous self-organization of individual hPSC lines during self-renewal that lie along a fundamental axis of in vivo development. Distinct stable and dynamic transcriptional elements were revealed by decomposition of RNA-seq data in pluripotency and early lineage emergence. Stable differences within pluripotency predicted regional bias in the dynamics of neural differentiation that were also observed in large collections of hPSC lines. Using replicate human induced PSC (hiPSC) lines and paired adult tissue, we demonstrate that cells from individual humans expressed unique transcriptional signatures that were maintained throughout life. In addition, replicate hiPSC lines from one donor showed divergent expression phenotypes driven by distinct chromatin states. These stable transcriptional states are under both genetic and epigenetic control and predict bias in subsequent forebrain and hindbrain neural trajectories. These results define mechanisms controlling transcription in pluripotency to first establish human neural diversity. Data availability: GSE164055; O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=200 SRC="FIGDIR/small/435870v1_ufig1.gif" ALT="Figure 1"> View larger version (45K): org.highwire.dtl.DTLVardef@1c9434borg.highwire.dtl.DTLVardef@60f3e4org.highwire.dtl.DTLVardef@12c2f2aorg.highwire.dtl.DTLVardef@1ccae99_HPS_FORMAT_FIGEXP M_FIG C_FIG


Parallel processing circuits are thought to dramatically expand the network capabilities of the nervous system. Magnocellular and parvocellular oxytocin neurons have been proposed to subserve two parallel streams of social information processing, which allow a single molecule to encode a diverse array of ethologically distinct behaviors, although to date direct evidence to support this hypothesis is lacking. Here we provide the first comprehensive characterization of magnocellular and parvocellular oxytocin neurons, validated across anatomical, projection target, electrophysiological, and transcriptional criteria. We next used novel multiple feature selection tools in Fmr1 KO mice to provide direct evidence that normal functioning of the parvocellular but not magnocellular oxytocin pathway is required for autism-relevant social reward behavior. Finally, we demonstrate that autism risk genes are uniquely enriched in parvocellular oxytocin neurons. Taken together these results provide the first evidence that oxytocin pathway specific pathogenic mechanisms account for social impairments across a broad range of autism etiologies. One Sentence SummaryPathoclisis of parvocellular oxytocin neurons plays an important role in the pathogenesis of social impairments in autism.


Better understanding the progression of neural stem cells (NSCs) in the developing cerebral cortex is important for modeling neurogenesis and defining the pathogenesis of neuropsychiatric disorders. Here we used RNA-sequencing, cell imaging and lineage tracing of mouse and human in vitro NSCs to model the generation of cortical neuronal fates. We show that conserved signaling mechanisms regulate the acute transition from proliferative NSCs to committed glutamatergic excitatory neurons. As human telencephalic NSCs developed from pluripotency in vitro, they first transitioned through organizer states that spatially pattern the cortex before generating glutamatergic precursor fates. NSCs derived from multiple human pluripotent lines varied in these early patterning states leading differentially to dorsal or ventral telencephalic fates. This work furthers systematic analysis of the earliest patterning events that generate the major neuronal trajectories of the human telencephalon.


The development of single-cell RNA-Sequencing (scRNA-Seq) has allowed high resolution analysis of cell type diversity and transcriptional networks controlling cell fate specification. To identify the transcriptional networks governing human retinal development, we performed scRNA-Seq over retinal organoid and in vivo retinal development, across 20 timepoints. Using both pseudotemporal and cross-species analyses, we examined the conservation of gene expression across retinal progenitor maturation and specification of all seven major retinal cell types. Furthermore, we examined gene expression differences between developing macula and periphery and between two distinct populations of horizontal cells. We also identify both shared and species-specific patterns of gene expression during human and mouse retinal development. Finally, we identify an unexpected role for ATOH7 expression in regulation of photoreceptor specification during late retinogenesis. These results provide a roadmap to future studies of human retinal development, and may help guide the design of cell-based therapies for treating retinal dystrophies.


MotivationDimension reduction techniques are widely used to interpret high-dimensional biological data. Features learned from these methods are used to discover both technical artifacts and novel biological phenomena. Such feature discovery is critically import to large single-cell datasets, where lack of a ground truth limits validation and interpretation. Transfer learning (TL) can be used to relate the features learned from one source dataset to a new target dataset to perform biologically-driven validation by evaluating their use in or association with additional sample annotations in that independent target dataset.\n\nResultsWe developed an R/Bioconductor package, projectR, to perform TL for analyses of genomics data via TL of clustering, correlation, and factorization methods. We then demonstrate the utility TL for integrated data analysis with an example for spatial single-cell analysis.\n\nAvailabilityprojectR is available on Bioconductor and at\n\;


Bioinformatics techniques to analyze time course bulk and single cell omics data are advancing. The absence of a known ground truth of the dynamics of molecular changes challenges benchmarking their performance on real data. Realistic simulated time-course datasets are essential to assess the performance of time course bioinformatics algorithms. We develop an R/Bioconductor package, CancerInSilico, to simulate bulk and single cell transcriptional data from a known ground truth obtained from mathematical models of cellular systems. This package contains a general R infrastructure for running cell-based models and simulating gene expression data based on the model states. We show how to use this package to simulate a gene expression data set and consequently benchmark analysis methods on this data set with a known ground truth. The package is freely available via Bioconductor:


Tumor heterogeneity provides a complex challenge to cancer treatment and is a critical component of therapeutic response, disease recurrence, and patient survival. Single-cell RNA-sequencing (scRNA-seq) technologies reveal the prevalence of intra-and inter-tumor heterogeneity. Computational techniques are essential to quantify the differences in variation of these profiles between distinct cell types, tumor subtypes, and patients to fully characterize intra-and inter-tumor molecular heterogeneity. We devised a new algorithm, Expression Variation Analysis in Single Cells (EVAsc), to perform multivariate statistical analyses of differential variation of expression in gene sets for scRNA-seq. EVAsc has high sensitivity and specificity to detect pathways with true differential heterogeneity in simulated data. We then apply EVAsc to several public domain scRNA-seq tumor datasets to quantify the landscape of tumor heterogeneity in several key applications in cancer genomics, i.e. immunogenicity, cancer subtypes, and metastasis. Immune pathway heterogeneity in hematopoietic cell populations in breast tumors corresponded to the amount diversity present in the T-cell repertoire of each individual. In head and neck squamous cell carcinoma (HNSCC) patients, we found dramatic differences in pathway dysregulation across basal primary tumors. Within the basal primary tumors we also identified increased immune dysregulation in individuals with a high proportion of fibroblasts present in the tumor microenvironment. Moreover, cells in HNSCC primary tumors had significantly more heterogeneity across pathways than cells in metastases, consistent with a model of clonal outgrowth. These results demonstrate the broad utility of EVAsc to quantify inter-and intra-tumor heterogeneity from scRNA-seq data without reliance on low dimensional visualization.


New approaches are urgently needed to glean biological insights from the vast amounts of single cell RNA sequencing (scRNA-Seq) data now being generated. To this end, we propose that cell identity should map to a reduced set of factors which will describe both exclusive and shared biology of individual cells, and that the dimensions which contain these factors reflect biologically meaningful relationships across different platforms, tissues and species. To find a robust set of dependent factors in large-scale scRNA- Seq data, we developed a Bayesian non-negative matrix factorization (NMF) algorithm, scCoGAPS. Application of scCoGAPS to scRNA-Seq data obtained over the course of mouse retinal development identified gene expression signatures for factors associated with specific cell types and continuous biological processes. To test whether these signatures are shared across diverse cellular contexts, we developed projectR to map biologically disparate datasets into the factors learned by scCoGAPS. Because projecting these dimensions preserve relative distances between samples, biologically meaningful relationships/factors will stratify new data consistent with their underlying processes, allowing labels or information from one dataset to be used for annotation of the other--a machine learning concept called transfer learning. Using projectR, data from multiple datasets was used to annotate latent spaces and reveal novel parallels between developmental programs in other tissues, species and cellular assays. Using this approach we are able to transfer cell type and state designations across datasets to rapidly annotate cellular features in a new dataset without a priori knowledge of their type, identify a species-specific signature of microglial cells, and identify a previously undescribed subpopulation of neurosecretory cells within the lung. Together, these algorithms define biologically meaningful dimensions of cellular identity, state, and trajectories that persist across technologies, molecular features, and species.\n\nGRAPHICAL ABSTRACT\n\nO_FIG O_LINKSMALLFIG WIDTH=174 HEIGHT=200 SRC=\"FIGDIR/small/395004_ufig1.gif\" ALT=\"Figure 1\">\nView larger version (81K):\norg.highwire.dtl.DTLVardef@dd1c07org.highwire.dtl.DTLVardef@5b1109org.highwire.dtl.DTLVardef@bb6714org.highwire.dtl.DTLVardef@16c66f0_HPS_FORMAT_FIGEXP M_FIG C_FIG


Precise temporal control of gene expression in neuronal progenitors is necessary for correct regulation of neurogenesis and cell fate specification. However, the extensive cellular heterogeneity of the developing CNS has posed a major obstacle to identifying the gene regulatory networks that control these processes. To address this, we used single cell RNA-sequencing to profile ten developmental stages encompassing the full course of retinal neurogenesis. This allowed us to comprehensively characterize changes in gene expression that occur during initiation of neurogenesis, changes in developmental competence, and specification and differentiation of each of the major retinal cell types. These data identify transitions in gene expression between early and late-stage retinal progenitors, as well as a classification of neurogenic progenitors. We identify here the NFI family of transcription factors (Nfia, Nfib, and Nfix) as genes with enriched expression within late RPCs, and show they are regulators of bipolar interneuron and Muller glia specification and the control of proliferative quiescence.


Omics data contains signal from the molecular, physical, and kinetic inter- and intra-cellular interactions that control biological systems. Matrix factorization techniques can reveal low-dimensional structure from high-dimensional data that reflect these interactions. These techniques can uncover new biological knowledge from diverse high-throughput omics data in topics ranging from pathway discovery to time course analysis. We review exemplary applications of matrix factorization for systems-level analyses. We discuss appropriate application of these methods, their limitations, and focus on analysis of results to facilitate optimal biological interpretation. The inference of biologically relevant features with matrix factorization enables discovery from high-throughput data beyond the limits of current biological knowledge--answering questions from high-dimensional data that we have not yet thought to ask.


BACKGROUNDTargeted therapies specifically act by blocking the activity of proteins that are encoded by genes critical for tumorigenesis. However, most cancers acquire resistance and long-term disease remission is rarely observed. Understanding the time course of molecular changes responsible for the development of acquired resistance could enable optimization of patients treatment options. Clinically, acquired therapeutic resistance can only be studied at a single time point in resistant tumors. To determine the dynamics of these molecular changes, we obtained high throughput omics data weekly during the development of cetuximab resistance in a head and neck cancer in vitro model.\n\nRESULTSAn unsupervised algorithm, CoGAPS, was used to quantify the evolving transcriptional and epigenetic changes. Applying a PatternMarker statistic to the results from CoGAPS enabled novel heatmap-based visualization of the dynamics in these time course omics data. We demonstrate that transcriptional changes result from immediate therapeutic response or resistance, whereas epigenetic alterations only occur with resistance. Integrated analysis demonstrates delayed onset of changes in DNA methylation relative to transcription, suggesting that resistance is stabilized epigenetically.\n\nCONCLUSIONSGenes with epigenetic alterations associated with resistance that have concordant expression changes are hypothesized to stabilize resistance. These genes include FGFR1, which was associated with EGFR inhibitor resistance previously. Thus, integrated omics analysis distinguishes the timing of molecular drivers of resistance. Our findings provide a relevant towards better understanding of the time course progression of changes resulting in acquired resistance to targeted therapies. This is an important contribution to the development of alternative treatment strategies that would introduce new drugs before the resistant phenotype develops.


Cancer is a complex disease, driven by aberrant activity in numerous signaling pathways in even individual malignant cells. Epigenetic changes are critical mediators of these functional changes that drive and maintain the malignant phenotype. Changes in DNA methylation, histone acetylation and methylation, non-coding RNAs, post-translational modifications are all epigenetic drivers in cancer, independent of changes in the DNA sequence. These epigenetic alterations, once thought to be crucial only for the malignant phenotype maintenance, are now recognized as critical also for disrupting essential pathways that protect the cells from uncontrolled growth, longer survival and establishment in distant sites from the original tissue. In this review, we focus on DNA methylation and chromatin structure in cancer. While associated with cancer, the precise functional role of these alterations is an area of active research using emerging high-throughput approaches and bioinformatics analysis tools. Therefore, this review describes these high-throughput measurement technologies, public domain databases for high-throughput epigenetic data in tumors and model systems, and bioinformatics algorithms for their analysis. Advances in bioinformatics data integration techniques that combine these epigenetic data with genomics data are essential to infer the function of specific epigenetic alterations in cancer, and are therefore also a focus of this review. Future studies using these emerging technologies will elucidate how alterations in the cancer epigenome cooperate with genetic aberrations to cause tumorigenesis initiation and progression. This deeper understanding is essential to future studies that will precisely infer patients prognosis and select patients who will be responsive to emerging epigenetic therapies.


SummaryNon-negative Matrix Factorization (NMF) algorithms associate gene expression with biological processes (e.g., time-course dynamics or disease subtypes). Compared with univariate associations, the relative weights of NMF solutions can obscure biomarkers. Therefore, we developed a novel PatternMarkers statistic to extract genes for biological validation and enhanced visualization of NMF results. Finding novel and unbiased gene markers with PatternMarkers requires whole-genome data. However, NMF algorithms typically do not converge for the tens of thousands of genes in genome-wide profiling. Therefore, we also developed Genome-Wide CoGAPS Analysis in Parallel Sets (GWCoGAPS), the first robust whole genome Bayesian NMF using the sparse, MCMC algorithm, CoGAPS. This software contains analytic and visualization tools including a Shiny web application, patternMatcher, which are generalized for any NMF. Using these tools, we find granular brain-region and cell-type specific signatures with corresponding biomarkers in GTex data, illustrating GWCoGAPS and patternMarkers ascertainment of data-driven biomarkers from whole-genome data.\n\nAvailabilityPatternMarkers & GWCoGAPS are in the CoGAPS Bioconductor package (3.5) under the GPL license.\n\;;