RNA seq Data Analysis A Practical Approach 1st edition by Eija Korpelainen, Jarno Tuimala, Panu Somervuo, Mikael Huss, Garry Wong – Ebook PDF Instand Download/DeliveryISBN: 1482262355, 9781482262353
Full dowload RNA seq Data Analysis A Practical Approach 1st edition after payment
Product details:
ISBN-10 : 1482262355
ISBN-13 : 9781482262353
Author: Eija Korpelainen, Jarno Tuimala, Panu Somervuo, Mikael Huss, Garry Wong
The State of the Art in Transcriptome AnalysisRNA sequencing (RNA-seq) data offers unprecedented information about the transcriptome, but harnessing this information with bioinformatics tools is typically a bottleneck. RNA-seq Data Analysis: A Practical Approach enables researchers to examine differential expression at gene, exon, and transcript le
RNA seq Data Analysis A Practical Approach 1st Table of contents:
CHAPTER 1 Introduction to RNA-seq
1.1 INTRODUCTION
FIGURE 1.1 General scheme for RNA-seq experiments. The workflow from tissue to data in the RNA-seq method is shown with alternatives for CLIP-seq, miRNA-seq, and general RNA-seq.
1.2 ISOLATION OF RNAs
1.3 QUALITY CONTROL OF RNA
FIGURE 1.2 Agilent Bioanalyzer run showing RNA quality. Both the ladder and sample run are shown.
1.4 LIBRARY PREPARATION
TABLE 1.1 Major RNA-seq Platforms and Their General Properties
FIGURE 1.3 Schematic of RNA-seq library preparation.
1.5 MAJOR RNA-SEQ PLATFORMS
1.5.1 Illumina
1.5.2 SOLID
1.5.3 Roche 454
1.5.4 Ion Torrent
1.5.5 Pacific Biosciences
1.5.6 Nanopore Technologies
1.6 RNA-SEQ APPLICATIONS
1.6.1 Protein Coding Gene Structure
FIGURE 1.4 Schematic of gene structure model for the human TP53 gene from ENSEMBL genome browser showing RNA-seq reads from blood and adipose tissue as support for the models.
1.6.2 Novel Protein-Coding Genes
1.6.3 Quantifying and Comparing Gene Expression
1.6.4 Expression Quantitative Train Loci (eQTL)
1.6.5 Single-Cell RNA-seq
1.6.6 Fusion Genes
1.6.7 Gene Variations
1.6.8 Long Noncoding RNAs
1.6.9 Small Noncoding RNAs (miRNA-seq)
1.6.10 Amplification Product Sequencing (Ampli-seq)
1.7 CHOOSING AN RNA-SEQ PLATFORM
1.7.1 Eight General Principles for Choosing an RNA-seq Platform and Mode of Sequencing
1.7.1.1 Accuracy: How Accurate Must the Sequencing Be?
1.7.1.2 Reads: How Many Do I Need?
1.7.1.3 Length: How Long Must the Reads Be?
1.7.1.4 SR or PE: Single Read or Paired End?
1.7.1.5 RNA or DNA: Am I Sequencing RNA or DNA?
1.7.1.6 Material: How Much Sample Material Do I Have?
1.7.1.7 Costs: How Much Can I Spend?
1.7.1.8 Time: When Does the Work Need to Be Completed?
1.7.2 Summary
REFERENCES
CHAPTER 2 Introduction to RNA-seq Data Analysis
2.1 INTRODUCTION
FIGURE 2.1 Possible paths in RNA-seq data analysis. In the beginning, the quality of reads is checked and if necessary, reads are preprocessed to remove low-quality data and artifacts. The read’s origin is then identified by aligning them to a reference genome if available. Novel genes and transcripts are detected using genome-guided transcriptome assembly, and gene and transcript expression are quantified. Alternatively, gene and transcript discovery can be skipped and expression quantified only for known genes and transcripts. If the reference genome is not available, reads can be aligned and quantified using a reference transcriptome instead. If a transcriptome is not available, it can be produced from reads using de novo transcriptome assembly. When abundance estimates are obtained using one of these paths, expression differences between sample groups can be analyzed using statistical testing. The details of each step can be found in the chapters indicated in parentheses.
FIGURE 2.2 The open source Chipster software provides a comprehensive collection of analysis tools for RNA-seq data via an intuitive graphical user interface. The workflow panel (bottom left) shows the relationships of result files. This screenshot shows a differential expression analysis of GM12892 and hESC cells, which are used as an example throughout the book. Gene-level expression level estimates were obtained using genome-aligned reads and the HTSeq tool, and differential expression was analyzed using the edgeR Bioconductor package. Differentially expressed genes were further filtered by fold change and visualized in the built-in genome browser.
2.2 DIFFERENTIAL EXPRESSION ANALYSIS WORKFLOW
FIGURE 2.3 Differential expression analysis workflow consists of several, interrelated steps. The typical output file formats are indicated in parentheses.
2.2.1 Step 1: Quality Control of Reads
2.2.2 Step 2: Preprocessing of Reads
2.2.3 Step 3: Aligning Reads to a Reference Genome
2.2.4 Step 4: Genome-Guided Transcriptome Assembly
2.2.5 Step 5: Calculating Expression Levels
2.2.6 Step 6: Comparing Gene Expression between Conditions
2.2.7 Step 7: Visualization of Data in Genomic Context
FIGURE 2.4 Integrative Genomics Viewer (IGV) window showing RNA-seq reads from the C. elegans abu-1 gene. The top panel shows reads from control and the bottom panel shows reads from ethanol-treated animals.
2.3 DOWNSTREAM ANALYSIS
2.3.1 Gene Annotation
2.3.2 Gene Set Enrichment Analysis
2.4 AUTOMATED WORKFLOWS AND PIPELINES
2.5 HARDWARE REQUIREMENTS
2.6 FOLLOWING THE EXAMPLES IN THE BOOK
2.6.1 Using Command Line Tools and R
2.6.2 Using the Chipster Software
2.6.3 Example Data Sets
2.7 SUMMARY
REFERENCES
CHAPTER 3 Quality Control and Preprocessing
3.1 INTRODUCTION
3.2 SOFTWARE FOR QUALITY CONTROL AND PREPROCESSING
3.2.1 FastQC
FIGURE 3.1 The beginning of the FastQC quality report offers basic statistics (right) and a judgment on the different quality aspects measured (left).
3.2.2 PRINSEQ
3.2.3 Trimmomatic
3.3 READ QUALITY ISSUES
3.3.1 Base Quality
3.3.1.1 Filtering
FIGURE 3.2 FastQC per base sequence quality plot for the forward reads (top) and reverse reads (bottom) of the example data. The plot summarizes the base quality at each base position across all reads. The y-axis shows the quality scores, and the yellow boxes represent the interquartile range (25–75%) of base quality values for each base position. The red line is the median value and the blue line is the mean. The green, orange, and red background coloring indicate good, reasonable, and poor quality, respectively. FastQC issues a warning if the lower quartile is below 10, or if the median is less than 25 in any of the base positions.
FIGURE 3.3 Per sequence quality score plot by FastQC shows the distribution of reads’ mean qualities. Both the forward reads of our example data (top) and the reverse reads (bottom) include a subset of roughly two million reads (~6%), which have a mean quality less than 2. The mean quality of the reverse reads is lower in general.
3.3.1.2 Trimming
3.3.2 Ambiguous Bases
FIGURE 3.4 PRINSEQ report on the occurrence of ambiguous bases (Ns). Two percent of reads contain Ns, and 92,452 reads contain only Ns.
3.3.3 Adapters
FIGURE 3.5 Palindrome approach of the Trimmomatic software can detect even very short partial adapters in paired-end reads in a “read-through” situation. Two paired-end reads are aligned (forward read on the top and reverse below). Adaptors are white and the insert to be sequenced is black. When the insert is short, sequencing “reads through” it to the 3′ end, resulting in a partial (or full) adapter in that end. Trimmomatic can recognize and remove them (as indicated by the crosses).
3.3.4 Read Length
3.3.5 Sequence-Specific Bias and Mismatches Caused by Random Hexamer Priming
FIGURE 3.6 Per base sequence content plot produced by FastQC. The y-axis shows the percentage of each nucleotide. The first 13 base positions show sequence-specific bias typical for Illumina RNA-seq reads.
3.3.6 GC Content
3.3.7 Duplicates
FIGURE 3.7 Extract of PRINSEQ’s duplicate report showing the number of duplicates for the 100 most duplicated reads. The most duplicated read has 92,452 copies.
3.3.8 Sequence Contamination
3.3.9 Low-Complexity Sequences and PolyA Tails
QUALITY CONTROL AND PREPROCESSING IN CHIPSTER
3.4 SUMMARY
REFERENCES
CHAPTER 4 Aligning Reads to Reference
4.1 INTRODUCTION
4.2 ALIGNMENT PROGRAMS
4.2.1 Bowtie
Building or Downloading a Reference Index
Align Reads to Genome
4.2.2 TopHat
FIGURE 4.1 Spliced alignment procedure of TopHat2. (a) Reads which did not map to the transcriptome or the genome are split into short segments and mapped to the genome again. If TopHat2 finds reads where the left and the right segment map within a user-defined maximum intron size, it maps the whole read to that genomic region in order to find potential splice sites containing known splice signals. (b) Genomic sequences flanking the potential splice sites are concatenated and indexed, and unmapped read segments (marked by a star here) are aligned to this junction flanking index with Bowtie2. (c) Segment alignments are stitched together to form whole read alignments.
Preparing the Reference Indexes
Aligning the Reads
4.2.3 STAR
Building or Downloading a Reference Index
Mapping
Output
ALIGNING READS TO REFERENCE IN CHIPSTER
4.3 ALIGNMENT STATISTICS AND UTILITIES FOR MANIPULATING ALIGNMENT FILES
SAM/BAM MANIPULATION AND ALIGNMENT STATISTICS IN CHIPSTER
4.4 VISUALIZING READS IN GENOMIC CONTEXT
VISUALIZING READS IN GENOMIC CONTEXT WITH CHIPSTER
4.5 SUMMARY
REFERENCES
CHAPTER 5 Transcriptome Assembly
5.1 INTRODUCTION
5.2 METHODS
5.2.1 Transcriptome Assembly Is Different from Genome Assembly
5.2.2 Complexity of Transcript Reconstruction
5.2.3 Assembly Process
5.2.4 de Bruijn Graph
5.2.5 Use of Abundance Information
5.3 DATA PREPROCESSING
5.3.1 Read Error Correction
5.3.2 Seecer
5.4 MAPPING-BASED ASSEMBLY
5.4.1 Cufflinks
5.4.2 Scripture
5.5 De novo Assembly
5.5.1 Velvet + Oases
5.5.2 Trinity
FIGURE 5.1 Example transcript graph resulting from Trinity assembly.
ASSEMBLING TRANSCRIPTS IN CHIPSTER
5.6 SUMMARY
FIGURE 5.2 Assembled transcript “YourSeq” mapped on human genome and shown in UCSC genome browser.
REFERENCES
CHAPTER 6 Quantitation and Annotation-Based Quality Control
6.1 INTRODUCTION
6.2 ANNOTATION-BASED QUALITY METRICS
6.2.1 Tools for Annotation-Based Quality Control
FIGURE 6.1 RseQC plot for coverage uniformity along transcripts. The length of all transcripts is scaled to 100 nucleotides.
FIGURE 6.2 Sequencing saturation plot by RseQC. Subsets of reads are resampled and RPKMs calculated for each subset and compared to the RPKMs from total reads. This is done separately for four different expression level categories.
FIGURE 6.3 The RseQC software annotates the detected splice junctions as novel, partially novel, and known (a), and analyzes their saturation status by resampling (b).
ANNOTATION-BASED QUALITY CONTROL IN CHIPSTER
6.3 QUANTITATION OF GENE EXPRESSION
6.3.1 Counting Reads per Genes
6.3.1.1 HTSeq
FIGURE 6.4 HTSeq offers three modes to count reads per genomic features. Black bar indicates a read, white box indicates a gene that the read maps to, and the grey box indicates another gene which partially overlaps with the white one. Tick mark means that the read is counted for the white gene, and the question mark means that it is not counted because of ambiguity. The intersection_strict mode does not count the read if it overlaps with intronic or intergenic regions (“no_feature,” indicated as dash here). The default setting is the union mode.
6.3.2 Counting Reads per Transcripts
COUNTING READS PER GENES IN CHIPSTER
6.3.2.1 Cufflinks
6.3.2.2 eXpress
COUNTING READS PER TRANSCRIPTS IN CHIPSTER
6.3.3 Counting Reads per Exons
COUNTING READS PER EXONS IN CHIPSTER
6.4 SUMMARY
REFERENCES
CHAPTER 7 RNA-seq Analysis Framework in R and Bioconductor
7.1 INTRODUCTION
7.1.1 Installing R and Add-on Packages
7.1.2 Using R
7.2 OVERVIEW OF THE BIOCONDUCTOR PACKAGES
7.2.1 Software Packages
7.2.2 Annotation Packages
7.2.3 Experiment Packages
7.3 DESCRIPTIVE FEATURES OF THE BIOCONDUCTOR PACKAGES
7.3.1 OOP Features in R
7.4 REPRESENTING GENES AND TRANSCRIPTS IN R
7.5 REPRESENTING GENOMES IN R
7.6 REPRESENTING SNPs IN R
7.7 FORGING NEW ANNOTATION PACKAGES
7.8 SUMMARY
REFERENCES
CHAPTER 8 Differential Expression Analysis
8.1 INTRODUCTION
8.2 TECHNICAL VERSUS BIOLOGICAL REPLICATES
8.3 STATISTICAL DISTRIBUTIONS IN RNA-SEQ DATA
8.3.1 Biological Replication, Count Distributions, and Choice of Software
TABLE 8.1 List of (Some) Software Tools for Differential Expression Analysis
8.4 NORMALIZATION
FACT BOX: HOW TO SELECT A SOFTWARE PACKAGE FOR DIFFERENTIAL EXPRESSION ANALYSIS
8.5 SOFTWARE USAGE EXAMPLES
8.5.1 Using Cuffdiff
FIGURE 8.1 Typical workflows used in RNA-seq differential expression analysis.
FIGURE 8.2 Isoforms need to be considered in order to obtain unbiased gene-level expression estimates. An imaginary gene with two different isoforms is depicted above (upper left corner). For simplicity, all exons are assumed to have the same length L. Two common methods for calculating gene-level counts is the exon-intersection model, where only reads mapping to exons that are part of all of the gene isoforms are considered, and the exon-union model, where all reads mapping to any exon are considered (upper right corner). In the hypothetical case shown in the lower panel, where the reads originating from each isoform in two different conditions A and B are indicated, the actual fold change for the whole gene considering the isoforms would be estimated as (38/30), while both the exon-intersection model and the union model would estimate it as 1 (i.e. no change at all).
FACT BOX: PROS AND CONS OF CUFFDIFF
8.5.2 Using Bioconductor Packages: DESeq, edgeR, limma
8.5.3 Linear Models, the Design Matrix, and the Contrast Matrix
8.5.3.1 Design Matrix
8.5.3.2 Contrast Matrix
8.5.4 Preparations Ahead of Differential Expression Analysis
8.5.4.1 Starting from BAM Files
8.5.4.2 Starting from Individual Count Files
8.5.4.3 Starting from an Existing Count Table
8.5.4.4 Independent Filtering
8.5.5 Code Example for DESeq(2)
8.5.6 Visualization
FIGURE 8.3 Correlation heatmap of cell line samples. There is an obvious grouping of samples into two distinct groups.
FIGURE 8.4 (a) Principal component plot of cell line samples. A grouping of the samples into two groups is evident along principal component 1 (the X axis). (b) A biplot shows both the relative locations of the samples in the PC1–PC2 space and the contributions of various genes to the principal components.
FIGURE 8.5 A volcano plot showing the negative log p value against the log fold change for each gene. Genes with adjusted p value below 0.01 are colored red.
8.5.7 For Reference: Code Examples for Other Bioconductor Packages
FIGURE 8.6 (a) Bar plot showing the expression level (in normalized count units) of a specific gene in the GM and hES samples. (b) Box plot showing the expression distribution of the same gene within each group (GM or hES). The bold line indicates the median.
8.5.8 Limma
FIGURE 8.7 Consistency of differential expression calls DESeq2, limma, SAMseq and edgeR shown as a four-way Venn diagram. 159 of the genes are called as differentially expressed by all four packages.
8.5.9 SAMSeq (samr package)
8.5.10 edgeR
8.5.11 DESeq2 Code Example for a Multifactorial Experiment
8.5.12 For Reference: edgeR Code Example
8.5.13 Limma Code Example
ANALYZING DIFFERENTIAL EXPRESSION IN CHIPSTER
8.6 SUMMARY
REFERENCES
CHAPTER 9 Analysis of Differential Exon Usage
9.1 INTRODUCTION
9.2 Preparing the Input Files for DEXSeq
9.3 READING DATA IN TO R
9.4 ACCESSING THE ExonCountSet OBJECT
9.5 NORMALIZATION AND ESTIMATION OF THE VARIANCE
FIGURE 9.1 The mean dispersion plot. Each point in the figure represents one exon. Check that the plotted line (mean dispersion function) follows the shape of the data cloud.
9.6 TEST FOR DIFFERENTIAL EXON USAGE
9.7 VISUALIZATION
FIGURE 9.2 The MA plot of the results from a differential exon usage analysis.
FIGURE 9.3 A plot of result obtained with DEXSeq package. In the upper panel, the expression of the exons of the gene ENSG00000226742 in the two conditions (esc and gm) is shown with two lines. Every exon is clearly more expressed in the gm cells than in the esc cells. Therefore, the result is probably due to differential expression of the whole gene. True enough, the result table verifies this (res[res$geneID==“ENSG00000226742”,]), since none of the exons are statistically significantly differentially expressed. The lower panel displays the exonic structure of the gene. Exons are displayed as gray bars, and introns as black wedged lines between the exons.
FIGURE 9.4 Results for gene ENSG00000119541. Exon 1 clearly differentially expressed between the esc and the gm cell lines. Another exon is also statistically significantly differentially expressed. Can you guess which one it is? (It is exon number 3.)
FIGURE 9.5 A summary page of the HTML report generated by the DEXSeq package. Information on the samples, and the exact models fitted to the data are shown. The table in the bottom contains information for individual genes, and the first column of each line of the table in a link to a more detailed result page for that particular gene.
ANALYZING DIFFERENTIAL Exon EXPRESSION IN CHIPSTER
9.8 SUMMARY
REFERENCES
CHAPTER 10 Annotating the Results
10.1 INTRODUCTION
10.2 RETRIEVING ADDITIONAL ANNOTATIONS
10.2.1 Using an Organism-Specific Annotation Package to Retrieve Annotations for Genes
10.2.2 Using BioMart to Retrieve Annotations for Genes
10.3 USING ANNOTATIONS FOR ONTOLOGICAL ANALYSIS OF GENE SETS
10.4 GENE SET ANALYSIS IN MORE DETAIL
10.4.1 Competitive Method Using GOstats Package
10.4.2 Self-Contained Method Using Globaltest Package
10.4.3 Length Bias Corrected Method
10.5 SUMMARY
REFERENCES
CHAPTER 11 Visualization
11.1 INTRODUCTION
11.1.1 Image File Types
11.1.2 Image Resolution
11.1.3 Color Models
11.2 GRAPHICS IN R
11.2.1 Heatmap
FIGURE 11.1 A heatmap generated from the filtered parathyroidGenes data set. By default a color blind-friendly color scheme ranging from red to blue is used.
FIGURE 11.2 The heatmap generated from the parathyroidGenes data set using the pseudogenes for the plot.
FIGURE 11.3 A heatmap generated from a count table for one gene’s exons. Compare the plot with Figure 9.4.
11.2.2 Volcano Plot
FIGURE 11.4 A Volcano plot generated from the parathyroidGenes data set. A rather low number of genes are statistically significantly differentially expressed, and this fact is clearly visible.
11.2.3 MA Plot
FIGURE 11.5 An example of an MA plot for the parathyroidGenes data set. The expressed genes are highlighted with the squares.
11.2.4 Idiogram
11.2.5 Visualizing Gene and Transcript Structures
FIGURE 11.6 An idiogram of the human chromosomes with the positions of the differentially expressed genes inferred from the parathyroidGenes data set overlaid.
FIGURE 11.7 A plot of the gene (“reduce”) and transcript (“full”) structures for the gene ENSG00000119541 that was found to have two differentially expressed exons in the ENCODE data set. The gene is located in the long arm of the chromosome 18.
11.3 FINALIZING THE PLOTS
11.4 SUMMARY
REFERENCES
CHAPTER 12 Small Noncoding RNAs
12.1 INTRODUCTION
TABLE 12.1 Major Classes of Small Noncoding RNAs
12.2 MICRORNAs (miRNAs)
FIGURE 12.1 miRNA biogenesis and processing pathway.
TABLE 12.2 miRNA Nomenclature
12.3 MICRORNA OFF-SET RNAs (moRNAs)
12.4 PIWI-ASSOCIATED RNAs (piRNAs)
12.5 ENDOGENOUS SILENCING RNAs (endo-siRNAs)
12.6 EXOGENOUS SILENCING RNAs (exo-siRNAs)
12.7 TRANSFER RNAs (tRNAs)
12.8 SMALL NUCLEOLAR RNAs (snoRNAs)
12.9 SMALL NUCLEAR RNAs (snRNAs)
12.10 ENHANCER-DERIVED RNAs (eRNA)
12.11 Other Small Noncoding RNAs
TABLE 12.3 miRNA Pathway Component Genes and Their Orthologs in Drosophila and C. elegans
12.12 Sequencing Methods for Discovery of Small Noncoding RNAs
12.12.1 microRNA-seq
FIGURE 12.2 Scheme for small RNA library preparation. Small RNAs are first enriched using biochemical methods. A 3′ adaptor and 5′ adaptor are ligated to the RNA molecules. Reverse transcription is performed with a primer complementary to the 3′ adaptor to produce a cDNA of the RNA. PCR is then performed on the cDNA using 5′ and 3′ PCR primers based on the ligated adaptor sequences. The 5′ primer has an overhang to the cDNA sequence where library indexes can be used for multiplexing.
12.12.2 CLIP-seq
FIGURE 12.3 Electropherogram of an miRNA library run on Agilent Bioanalyzer. Size markers are at 35 and 10,380bp, while the library shows a peak at 98bp. In the 98bp peak, the small RNAs are 21–23 nt, while the rest of the dsDNA consists of the library adaptors.
12.12.3 Degradome-seq
12.12.4 Global Run-On Sequencing (GRO-seq)
12.13 SUMMARY
REFERENCES
CHAPTER 13 Computational Analysis of Small Noncoding RNA Sequencing Data
13.1 INTRODUCTION
13.2 DISCOVERY OF SMALL RNAs—miRDeep2
13.2.1 GFF files
TABLE 13.1 A GFF File of C. elegans miRNAs
13.2.2 FASTA Files of Known miRNAs
13.2.3 Setting up the Run Environment
TABLE 13.2 FASTA File of C. elegans miRNAs
13.2.4 Running miRDeep2
13.2.4.1 miRDeep2 Output
FIGURE 13.1 Output file from miRDeep2 showing the performance scores.
FIGURE 13.2 Output file from miRDeep2 showing novel miRNAs predicted. The table has been parsed and mature as well as hairpin miRNA sequences in columns on the right have been removed.
13.3 miRANALYZER
FIGURE 13.3 Graphical output of hairpin and mapped reads from miRDeep2.
TABLE 13.3 FASTA Format, Read-Count Format, Multi-FASTA Format
FIGURE 13.4 Output from miRanalyzer.
13.3.1 Running miRanalyzer
13.4 miRNA TARGET ANALYSIS
13.4.1 Computational Prediction Methods
13.4.2 Artificial Intelligence Methods
13.4.3 Experimental Support-Based Methods
13.5 miRNA-SEQ AND mRNA-SEQ DATA INTEGRATION
13.6 SMALL RNA DATABASES AND RESOURCES
13.6.1 RNA-seq Reads of miRNAs in miRBase
FIGURE 13.5 miRBase view of an miRNA.
13.6.2 Expression Atlas of miRNAs
FIGURE 13.6 miRBase entry of hsa-mir124-1 RNA-seq reads curated from experiments.
13.6.3 Database for CLIP-seq and Degradome-seq Data
13.6.4 Databases for miRNAs and Disease
13.6.5 General Databases for the Research Community and Resources
13.6.6 miRNAblog
TABLE 13.4 Resources for miRNA-seq Analysis
13.7 SUMMARY
People also search for RNA seq Data Analysis A Practical Approach 1st:
rna-seq data analysis a practical approach
rna-seq data analysis a practical approach pdf
what is rna seq analysis
tools for rna-seq analysis
rna seq time course analysis
Reviews
There are no reviews yet.