Data analysis of transcriptome sequencing data

I want to learn more about the data analysis and statistics on transcriptome sequencing data. I would like to read some important papers of the field and books and maybe some MOOCS, if they are available.

More precisely I have data of differentially expressed genes across different groups of individuals and I want to test, if the genes are more expressed in one group are the genes also more polymorphic?

Any ideas?

Transcriptome analysis using next-generation sequencing

Up to date research in biology, biotechnology, and medicine requires fast genome and transcriptome analysis technologies for the investigation of cellular state, physiology, and activity. Here, microarray technology and next generation sequencing of transcripts (RNA-Seq) are state of the art. Since microarray technology is limited towards the amount of RNA, the quantification of transcript levels and the sequence information, RNA-Seq provides nearly unlimited possibilities in modern bioanalysis. This chapter presents a detailed description of next-generation sequencing (NGS), describes the impact of this technology on transcriptome analysis and explains its possibilities to explore the modern RNA world.

Graphical abstract


► We present a detailed description of next-generation sequencing (NGS). ► We describes the technologies and platforms for transcriptome analysis. ► We explain the analysis of NGS data. ► We inform about further applications of NGS.


Flavonoids are a group of secondary metabolites that are extensively distributed in plants. They have been divided into several major subgroups such as anthocyanins, proanthocyanidins, flavonols, flavones, and isoflavones [1]. These metabolites play important biological roles specifically related to plant development and defense. Anthocyanins are water soluble pigments that are mainly involved in flower and fruit coloration. Therefore, anthocyanins are important for attracting pollinators and they also influence seed dispersal [2]. Additionally, anthocyanins are natural antioxidants [3]. Proanthocyanidins are condensed tannins and are primarily concentrated in seeds, but they also affect fruit flavor [4]. Flavonols, flavones, flavanones, and isoflavones help protect plants from ultraviolet radiation and pathogens [5]. Furthermore, flavonoids are essential for plant adaptations to biotic and abiotic stresses [6].

The flavonoid biosynthesis pathway is a branch of the phenylpropanoid pathway [7] and requires several enzymes. For example, genes encoding PAL (phenylalanine ammonia lyase), CHS (chalcone synthase), CHI (chalcone isomerase), and F3H (flavanone 3-hydroxylase) are the early biosynthetic genes (EBGs) that produce common precursors in the early steps of the pathway [8]. The late biosynthetic genes (LBGs) contribute to a later stage, during which specific flavonoid products are synthesized such as anthocyanins, proanthocyanidins, and flavonols. The LBGs include those encoding DFR (dihydroflavonol 4-reductase), ANS (anthocyanin synthase), and UFGT (UDP-glucose:flavonoid 3-glucosyltransferase), which are specifically involved in anthocyanin biosynthesis [9]. In contrast, LAR (leucoanthocyanidin reductase) and ANR (anthocyanin reductase) are key enzymes mediating proanthocyanidin biosynthesis [10]. Additionally, FLS (flavonol synthase) is specific for flavonol biosynthesis [11]. The structural genes of the flavonoid biosynthesis pathway are transcriptionally controlled by the MYB–bHLH–WDR (MBW) complex comprising a MYB transcription factor, a basic helix-loop-helix (bHLH), and a WD-repeat protein [12].

Flavonoid biosynthesis is affected by various factors, including light [13], temperature [14], water deficit [15], and nutrient deficiency [16]. Moreover, phytohormones are among the most important regulators of the biosynthesis of flavonoid compounds in plants. The effects of plant hormones, such as jasmonate [17, 18], abscisic acid [19, 20], auxin [21], ethylene [22], cytokinin [23], and gibberellin [24], on flavonoid accumulation have been widely studied.

Jasmonates are oxylipins (oxygenated fatty acids) synthesized by the octadecanoid/hexadecanoid pathways [25]. Jasmonic acid can be metabolized to several derivatives, including methyl jasmonate (MeJA), jasmonoyl-isoleucine (JA-Ile), jasmonyl-1-aminocyclopropane-1-carboxylic acid (JA-ACC), glucosylated derivatives of JA (e.g., JA-O-Glc), and cis-jasmone. However, of these derivatives, only MeJA and JA-Ile have been well characterized [26]. Multiple studies have revealed that MeJA application induces flavonoid biosynthesis in different fruit species such as apple (Malus domestica) [27], grape [28], blueberry [29], and strawberry (Fragaria × ananassa) [30]. In pear, the post-harvest application of MeJA induces anthocyanin accumulation in the fruit peel under UV-B/Vis irradiation [31]. In addition to anthocyanin, Ni et al. [22] reported that MeJA increases the accumulation of other flavonoid derivatives, including flavone and isoflavone, in pear fruit.

The molecular mechanism underlying jasmonate-induced anthocyanin accumulation has been clarified in Arabidopsis thaliana (Arabidopsis) and apple [17, 32, 33]. Jasmonate ZIM-domain proteins (JAZs) are substrates of the SCF COI1 complex and negatively regulate the jasmonate signaling pathway [34, 35]. The JAZ proteins can directly interact with MYB and bHLH and disrupt the formation of the MBW complex [32, 36]. After the jasmonate signal is perceived, JAZ proteins are recruited by COI1 to the SCF COI1 complex for ubiquitination and are subsequently degraded by the 26S proteasome pathway [32]. This triggers the release of MYB and bHLH transcription factors and the formation of the MBW complex to activate the expression of flavonoid biosynthesis pathway structural genes [18, 33]. The expression levels of MYB and bHLH transcription factor genes are upregulated by MeJA in Arabidopsis and apple, suggesting these transcription factors are regulated by the jasmonate signaling pathway. However, the molecular mechanism associated with MeJA-induced flavonoid biosynthesis in pear is largely unknown. Therefore, in the present study, pear calli treated with MeJA underwent a comprehensive transcriptome analysis to identify the differentially expressed genes (DEGs) between the MeJA-treated and untreated control pear calli. Moreover, a co-expression network was constructed to detect the transcripts specifically related to MeJA-induced flavonoid biosynthesis. This study generated a pool of candidate genes that should be analyzed in greater detail to clarify the molecular mechanism associated with MeJA-induced flavonoid biosynthesis in pear. Specifically, we examined pear calli because of their lack of seasonal restrictions and the ease in which their gene effects can be observed in a homogeneous system, which can substantially accelerate the study of gene functions in pear.


Pathway classification map of the differentially expressed genes based on transcriptome sequencing

The cDNA libraries were constructed from W and X groups of mandarin fish, and sequenced using the Illumina Hiseq2000 system. High quality reads were assembled. After removing the partial overlapping sequences, a total of 77,312 distinct sequences were obtained (All-Unigene, mean size: 1138 bp, N50: 2334 bp). In these unigene, 49.06% (37,927) were less than 500 bp, 50.94% (39,385) were longer than 500 bp, in which 34.38% (26,578) were longer than 1000 bp. We found 54 genes to be differential expressed among the two groups, 29 and 25 genes are up-regulated and down-regulated in mandarin fish of Group X, respectively. The metabolic pathway showed the most differential expressed genes (Fig. 1a and b), in which lipid metabolism, signal transduction and global overview maps showed 10, 6 and 13 genes to be differentially expressed, respectively (Fig. 1a). And the rich factor of steroid biosynthesis and glycerolipid metabolism is largest of all (Fig. 1b). The details of the differential expressed genes between the two groups were presented in Table 1. The sequencing data in this study have been deposited in the Sequence Read Archive (SRA) database (accession number: PRJNA613186).

a Pathway classification map of the differentially expressed genes. b Rich factor of the differentially expressed genes of different pathway based on transcriptome sequencing

Analysis of differential metabolites of two groups

We analyzed the metabolic profiles of the two groups by LC-MS in positive (ESI+) and negative (ESI−) scan modes, and selected 9249 irons for subsequent analyses (4155 irons in ESI+ mode and 5094 irons in ESI− mode).

The normalized data were analyzed by PCA and PLS-DA with multivariate analysis. The PCA result showed the positive and negative ions from the different groups were in the two clusters, and were separated clearly by the first two components (Fig. 2a). PLS-DA result showed the clear separation of the two groups (Fig. 2b), suggesting the significant biochemical changes. The hierarchical clustering analysis (HCA) of the differential metabolites showed that Group X and W showed significant difference (Fig. 2c). The information of these metabolomic biomarkers was listed in Table 2.

a PCA scores scatter plot in positive ion (left) and in negative ion (right) scan modes for the two groups. b PLS-DA scores scatter plot in positive ion (left) and in negative ion (right) scan modes for the two groups. c The heat map of differential metabolites from the related pathways between the two groups in both positive and negative mode. Each line represents a differential metabolite and each cross represents a plasma sample group. Different colors represent different abundance intensity, and the higher abundance intensity shows a gradual increase from dark color to red color

To identify the metabolites, we used the freely accessible database of Kyoto Encyclopedia of Genes and Genomes (KEGG) to elucidate the putative function of the metabolites. 44 and 20 irons were identified by MS1 and MS2 level in positive mode respectively, and 24 and 11 irons in MS1 and MS2 level in negative mode respectively. The details of differential ions between the two groups were presented in Table 3.

The common pathways of differential metabolites and genes

In retinol metabolism pathway, retinol, 9-cis-retinol and 11-cis-retinol metabolites were higher in mandarin fish of Group X than those of Group W, RDH (retinol dehydrogenase) gene expression was consistently higher in Group X (Fig. 3a). In glycerolipid metabolism pathway, triacylglycerol lipase gene expression was higher in mandarin fish of Group X, and glycerophosphoric metabolites was also higher in Group X (Fig. 3b). In biosynthesis of unsaturated fatty acids pathway, stearoyl-CoA gene expression and DPA (docosapentaenoic acid) metabolites were higher in fish of Group X than those in Group W (Fig. 3c).

Pathways of the differentially expressed genes and metabolites based on transcriptome and metabolome. a Retinol metabolism b Glycerolipid metabolism c Biosynthesis of unsaturated fatty

TFIIF gene expression and DNA methylation

As is shown in Fig. 4a, General transcription factor IIF (TFIIF) gene expression was higher in the mandarin fish of Group X than that of Group W. We then analyzed the CpG islands at − 5000 bp upstream from the transcription initiation site (designated as 0) of TFIIF by methylation analysis software. As shown in Fig. 4b, one CpG islands containing 9 CpG sites existed in − 3619 to − 3574 bp of TFIIF gene. The total DNA methylation level was significantly higher in the fish of Group X than that of Group W (Table 4).

TFIIF gene expression and DNA methylation. a TFIIF gene expression. b Illustration of the region of CpG islands sites, which includes 9 CpG sites, DNA methylation patterns of the two groups (X and W) analyzed by BSP. Each line represents one individual bacterial clone, and each circle represents one single CpG dinucleotide. Open circles show unmethylated CpGs and black circles show methylated CpGs

Ezh1 gene expression and histone methylation

The mRNA expression of histone methyltransferase ezh1 gene was lower in the mandarin fish of Group X (Fig. 5a). As histone methyltransferase Ezh1 could methylate ‘Lys-27’ of histone H3, we analyzed the H3K27me3 levels of the two groups. The results showed that H3K27me3 level was also lower in the mandarin fish of Group X than that of Group W (Fig. 5b).

a Validation of ezh1 mRNA expression. b The H3K27me3 protein level of between Group X and W. Data are mean ± SEM (n = 6), significant difference is marked with an asterisk (P < 0.05)

A step-by-step guide to submitting RNA-Seq data to NCBI

The analysis of transcriptome data from non-model organisms contributes to our understanding of diverse aspects of evolutionary biology, including developmental processes, speciation, adaptation, and extinction. Underlying this diversity is one shared feature, the generation of enormous amounts of sequence data. Data availability requirements in most journals oblige researchers to make their raw transcriptome data publicly available, and the databases housed at the National Center for Biotechnology Information (NCBI) are a popular choice for data deposition. Unfortunately, the successful submission of raw sequences to the Sequence Read Archive (SRA) and transcriptome assemblies to the Transcriptome Shotgun Assembly (TSA) can be challenging for novice users, significantly delaying data availability and publication. Researchers from the University of Veterinary Medicine Hannover present two comprehensive protocols for submitting RNA-Seq data to NCBI databases, accompanied by an easy-to-use website that facilitates the timely submission of data by researchers of any experience level.

RNA-seq: the principle

RNA-seq, also called whole-transcriptome shotgun sequencing, refers to the use of high-throughput sequencing technologies (see below) for characterizing the RNA content and composition of a given sample. Due to technological limitations at present, sequence information from transcripts cannot be retrieved as a whole, but is randomly decomposed into short reads of up to several hundred base pairs (Fig. 2). In the absence of genome or transcriptome information, transcripts first need to be reconstructed from these reads (or read pairs), which is referred to as de novo assembly. In the case where transcript or genome information is readily available, reads can be directly aligned onto the reference. Further, counting the reads that fall onto a given transcript provides a digital measurement of transcript abundance, which serves as the starting point for biological inference (Fig. 1).

Table of contents (16 chapters)

Comparison of Gene Expression Profiles in Nonmodel Eukaryotic Organisms with RNA-Seq

Microarray Data Analysis for Transcriptome Profiling

Pathway and Network Analysis of Differentially Expressed Genes in Transcriptomes

QuickRNASeq: Guide for Pipeline Implementation and for Interactive Results Visualization

Tracking Alternatively Spliced Isoforms from Long Reads by SpliceHunter

RNA-Seq-Based Transcript Structure Analysis with TrBorderExt

Analysis of RNA Editing Sites from RNA-Seq Data Using GIREMI

Bioinformatic Analysis of MicroRNA Sequencing Data

Microarray-Based MicroRNA Expression Data Analysis with Bioconductor

Identification and Expression Analysis of Long Intergenic Noncoding RNAs

Analysis of RNA-Seq Data Using TEtranscripts

Computational Analysis of RNA–Protein Interactions via Deep Sequencing

Predicting Gene Expression Noise from Gene Expression Variations

A Protocol for Epigenetic Imprinting Analysis with RNA-Seq Data

Single-Cell Transcriptome Analysis Using SINCERA Pipeline

Mathematical Modeling and Deconvolution of Molecular Heterogeneity Identifies Novel Subpopulations in Complex Tissues

Big Data to the Bench: Transcriptome Analysis for Undergraduates

Next-generation sequencing (NGS)-based methods are revolutionizing biology. Their prevalence requires biologists to be increasingly knowledgeable about computational methods to manage the enormous scale of data. As such, early introduction to NGS analysis and conceptual connection to wet-lab experiments is crucial for training young scientists. However, significant challenges impede the introduction of these methods into the undergraduate classroom, including the need for specialized computer programs and knowledge of computer coding. Here, we describe a semester-long, course-based undergraduate research experience at a liberal arts college combining RNA-sequencing (RNA-seq) analysis with student-driven, wet-lab experiments to investigate plant responses to light. Students derived hypotheses based on analysis of RNA-seq data and designed follow-up studies of gene expression and plant growth. Our assessments indicate that students acquired knowledge of big data analysis and computer coding however, earlier exposure to computational methods may be beneficial. Our course requires minimal prior knowledge of plant biology, is easy to replicate, and can be modified to a shorter, directed-inquiry module. This framework promotes exploration of the links between gene expression and phenotype using examples that are clear and tractable and improves computational skills and bioinformatics self-efficacy to prepare students for the "big data" era of modern biology.


Summary of the schedule of…

Summary of the schedule of class activities.

Student analysis of gene expression…

Student analysis of gene expression and phenotype of shade-treated Arabidopsis seedlings. (A) Flowchart…

Computational Methods for Next Generation Sequencing Data Analysis

This book provides an in-depth survey of some of the recent developments in NGS and discusses mathematical and computational challenges in various application areas of NGS technologies. The 18 chapters featured in this book have been authored by bioinformatics experts and represent the latest work in leading labs actively contributing to the fast-growing field of NGS. The book is divided into four parts:

Part I focuses on computing and experimental infrastructure for NGS analysis, including chapters on cloud computing, modular pipelines for metabolic pathway reconstruction, pooling strategies for massive viral sequencing, and high-fidelity sequencing protocols.

Part II concentrates on analysis of DNA sequencing data, covering the classic scaffolding problem, detection of genomic variants, including insertions and deletions, and analysis of DNA methylation sequencing data.

Part III is devoted to analysis of RNA-seq data. This part discusses algorithms and compares software tools for transcriptome assembly along with methods for detection of alternative splicing and tools for transcriptome quantification and differential expression analysis.

Part IV explores computational tools for NGS applications in microbiomics, including a discussion on error correction of NGS reads from viral populations, methods for viral quasispecies reconstruction, and a survey of state-of-the-art methods and future trends in microbiome analysis.

Computational Methods for Next Generation Sequencing Data Analysis:

  • Reviews computational techniques such as new combinatorial optimization methods, data structures, high performance computing, machine learning, and inference algorithms
  • Discusses the mathematical and computational challenges in NGS technologies
  • Covers NGS error correction, de novo genome transcriptome assembly, variant detection from NGS reads, and more

This text is a reference for biomedical professionals interested in expanding their knowledge of computational techniques for NGS data analysis. The book is also useful for graduate and post-graduate students in bioinformatics.

Author Bios

Ion Mandoiu, PhD, is an associate professor in the Computer Science and Engineering Department at the University of Connecticut, USA. His main research interests are in the design and analysis of approximation algorithms for NP-hard optimization problems, particularly in the area of bioinformatics. Dr. Mandoiu has authored over 100 refereed articles in journals and conference proceedings. He has also co-edited (with A. Zelikovsky) a book on Bioinformatics Algorithms: Techniques and Applications (Wiley 2008).

Alexander Zelikovsky, PhD, is a Distinguished University Professor with the Computer Science Department at the Georgia State University, USA. His research focuses on discrete algorithms and their applications in computational biotechnology and biology, bioinformatics, VLSI CAD, and wireless networks. Dr. Zelikovsky has authored more than 170 refereed publications. He served as the co-Chair of International Symposium on Bioinformatics Research and Applications (2005-2016) and the Workshop on Computational Advances in Next-Generation Sequencing (2011-2015).

Reviewers' comments

Reviewers report 1

Rohan Williams, John Curtin School of Medical Research, Australian National University, Australia. Nominated by Gavin Huttley

RNA-Seq and related high-throughput sequencing are receiving intense attention due to their potential to survey the transcriptome in an unbiased, global fashion. While it is likely that these sequencing based approaches will permit a major advance on microarray based technologies, it is also highly likely that unanticipated systematic errors will be present in these data and will need to be corrected in order to permit appropriate application. While expression microarrays and tiling arrays are known to be subject to a number of such effects, to date there has been little investigation of issues in the emerging RNA-Seq literature. Oshlack and Wakefield now present a re-analysis of data from several recent RNA-Seq studies to show that identification of differential expression is positively biased towards longer transcripts (and has the potential to impact downstream interpretation at a functional level). Although it is recognised that tag count will be proportional to the product of expression level and transcript length, adjusting for transcript length does not remove this effect: the authors show the effect arises from increased variance for shorted transcripts. They further argue that this effect is unlikely to be removed by exon-level analysis. Interestingly, this effect is not observable in microarray expression platforms. This paper represents an important contribution to the ongoing development of analysis methodology for RNA-Seq and I recommend it for publication in Biology Direct.

Reviewers report 2

Nicole Cloonan, Institute for Molecular Bioscience, The University of Queensland, Australia. Nominated by Mark Ragan

In this paper, the authors describe "transcript length bias" in RNAseq data, which is the reduced statistical power to detect differential gene expression of short mRNAs when compared to long mRNAs using a "shotgun sequencing" approach. As randomly fragmented mRNA molecules will generate less short-read tags for a short transcript than for a longer transcript, changes in expression between two (relatively) poorly sampled transcripts are less discernible from sampling noise. The authors examine three published shotgun sequencing-based studies to show this bias exists in the sequencing data, but not in the corresponding microarray data from the same samples. This bias against short transcripts could lead to a general under-representation in gene set testing for functional categories enriched in short genes (such as cell-cell communication, innate immunity, and signal transduction). This is an important finding that the RNA sequencing community needs to be aware of.

The manuscript is generally well written, and the authors have done well to create a manuscript understandable to a biological audience without specialized mathematical or statistical training. As all of my (generally minor) concerns with this manuscript have been adequately addressed, I recommend this manuscript for publication.

Reviewers Report 3

James Bullard, Division of Biostatistics, School of Public Health, University of California, Berkeley, USA. Nominated by Sandrine Dudoit

In Oshlack and Wakefield the authors demonstrate a relationship between gene-length and observed significance of a statistical test in three published studies (Marioni et al., Cloonan et al., Sultan et al.). The authors demonstrate that this observed tendency is not present in the analysis of the same samples in the Marioni study when microarrays are used. This "bias" is due to the dependence of the variance on the intensity of the read-process which is proportional to the length of the transcript sequenced.

The reviewer recommends the article for publication as the issues presented are both relevantand important. In particular, the issues presented are quite pertinent with the advent of numerous high-throughput sequencing studies. The reviewer believes that in its current form the article would benefit from some revisions to either more rigorously present the mathematics or simply present the statistics described in the offending studies.

Background: paragraph 2, "We hypothesize . " Why are you hypothesizing? I think that this sentence needs reference to a particular test-statistic, then you really don't need to hypothesize anything.

Author's response: We believe the statement in the article relates to all statistical analysis methods under the assumptions we have stated however we have not and really cannot test all possible methods. Therefore we have used the word hypothesize but we have also given an example in the methods section.

Background: paragraph 3, "All methods for detection of . " Doesn't this sentence appear a bit strong?

Author's response: We amended this to "Most statistical methods. "

Results: paragraph 2, Can you comment why the "length bias" is stronger for more lowly expressed genes? Also, I think it is better to present all of the data on the plots, rather than excluding the middle bin.

Author's response: We have added the sentence: "We believe the slope is lower in highly expressed genes due to the observation that nearly all of these genes have enough power to be called differentially expressed in this data set even though the p-values are higher for shorter genes."

Results: paragraph 3, In the mean-variance plots how do you compute the variance? Is this just the sample variance? What about the different numbers of counts across lanes? As for panel (2), After we divide by length we don't have a Poisson so the mean-variance plot is not correct or at least the proper interpretation of it is non-obvious (isn't it obvious that we will cause a shift on the plot because we are now scaling by length squared?)

Author's response: Yes this is exactly the point we are trying to make. This plot is meant to be more heuristic in nature rather than any rigorous proof that dividing by length doesn't remove the length bias. Therefore we have just used the sample variance without taking into account the different number of counts across lanes as a visual demonstration. To clarify we have also added the sentence: "However, when the mean is divided by the length of the transcript the relationship becomes more complex and the data is obviously no longer Poisson"

Results: paragraph 4, A potentially "better" plot would be boxplots (of gene-length) ordered from largest to smallest KEGG p-value both for microarray and sequencing data.

Author's response: Thank you for the suggestion. We felt that the plot you suggested was a little bit more tricky to interpret.

Methods: paragraph 1, The math is a little sloppy. In general, there is confusion between random variables and parameters. Specifically, I note two obvious errors: 1.) t is defined to be one thing (random variables on the rhs of equation (1)) and then redefined to be another thing (parameters on rhs of following definition). 2.) Methods: paragraph 2, μ' is a parameter then you do the Var(μ') which is incorrect, you probably want to dene an X' instead, then you can take variances.

Author's response: Thanks for pointing this out. We have modified and tidied up the math.

From your treatment it appears that I can just divide t by √ L to remove the dependence on L in the test-statistic is this correct?

Author's response: No, I don't think this is possible. A t-test is like a signal to noise ratio and therefore has a specific relationship between the estimate of the mean and the standard error of the estimate. I don't believe this should be broken by essentially dividing the estimate of the mean by √ L.

Watch the video: 2021-2022 Διάλεξη 00 Στατιστική Ανάλυση Δεδομένων (January 2022).