Information

How do high-throughput/NGS sequencers calculate quality scores?

How do high-throughput/NGS sequencers calculate quality scores?


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I am confused as to how quality scores are actually calculated by DNA sequencers like Illumina. For each base call, some quality predictor value is computed, based on various properties of the sequencing machine, like intensity of light during the read.

Do we know exactly how these quality scores are computed? Exactly how many factors go into computing these QUAL values?


I'm restricting this answer to Illumina. Even then, I don't know about the exact details of the raw data analysis (it is a proprietary software).

Basically Illumina records the sequence based on photographic images. Each nucleotide has a distinct fluorescent label. In a cycle, a nucleotide is pumped and unincorporated nucleotides are washed off (this is repeated for all nucleotides). A laser excites the fluorophore and the emitted light is recorded in the form of a photograph. The template DNA is present in the form of clusters of strands (at a given location), which enables easy visual identification of the fluorescence.

Base calling is done using image analysis. Each image is analysed for intensities of different colours and based on this the quality score is calculated. The quality score is basically the log likelihood of a occurrence nucleotide at a given position (based on its colour intensity) compared to other nucleotides.

This is the most simple explanation of how Illumina does base calling. There are different kinds of errors and biases and there are different statistical approaches to correct for them.

Have a look at following references for more details:

  • Ledergerber, Christian, and Christophe Dessimoz. "Base-calling for next-generation sequencing platforms." Briefings in bioinformatics (2011): bbq077.
  • Illumina MiSeq Imaging and Base Calling course
  • Genome Analysis Wiki

Sequencing Quality Scores

Sequencing quality scores measure the probability that a base is called incorrectly. With sequencing by synthesis (SBS) technology, each base in a read is assigned a quality score by a phred-like algorithm 1,2 , similar to that originally developed for Sanger sequencing experiments.

Sequencing Technology Video
Sequencing Technology Video

See how Illumina SBS works.

Q Score Definition

The sequencing quality score of a given base, Q, is defined by the following equation:

where e is the estimated probability of the base call being wrong.

  • Higher Q scores indicate a smaller probability of error.
  • Lower Q scores can result in a significant portion of the reads being unusable. They may also lead to increased false-positive variant calls, resulting in inaccurate conclusions.

As shown below, a quality score of 20 represents an error rate of 1 in 100, with a corresponding call accuracy of 99%.

SBS Technology Overview

Illumina technology enables massively parallel sequencing with optimized SBS chemistry.

Relationship Between Sequencing Quality Score and Base Call Accuracy
Quality Score Probability of Incorrect Base Call Inferred Base Call Accuracy
10 (Q10) 1 in 10 90%
20 (Q20) 1 in 100 99%
30 (Q30) 1 in 1000 99.9%

Illumina Sequencing Quality Scores

Illumina sequencing chemistry delivers high accuracy, with a vast majority of bases scoring Q30 and above. This level of accuracy is ideal for a range of sequencing applications, including clinical research.

Learn how PhiX can be used as an in-run control for run quality monitoring in Illumina NGS.

Choosing an NGS Company

Seek out a best-in-class next-generation sequencing company with user-friendly bioinformatics tools and industry-leading support and service.

Additional Information About Quality Scores

For more in-depth information about sequencing quality scores, read the following technical notes:

Beginner's Guide to NGS

Considering bringing NGS to your lab, but unsure where to start? These resources cover key topics in NGS and are designed to help you plan your first experiment.

Interested in receiving newsletters, case studies, and information on new applications? Enter your email address below.

Related Solutions

Next-Generation Sequencing (NGS)

Discover the broad range of experiments you can perform with next-generation sequencing, and find out how Illumina NGS works.

Benefits of SBS Technology

Illumina SBS technology delivers proven base calling accuracy, with the fewest false positive, false negatives, and miscalls among leading NGS platforms.

Sequencing Platforms

Compare next-generation sequencing (NGS) platforms by application and specification. Find tools and guides to help you choose the right sequencer.

References
  1. Ewing B, Hillier L, Wendl MC, Green P. (1998): Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8(3):175-185
  2. Ewing B, Green P. (1998): Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8(3):186-194

Innovative technologies

At Illumina, our goal is to apply innovative technologies to the analysis of genetic variation and function, making studies possible that were not even imaginable just a few years ago. It is mission critical for us to deliver innovative, flexible, and scalable solutions to meet the needs of our customers. As a global company that places high value on collaborative interactions, rapid delivery of solutions, and providing the highest level of quality, we strive to meet this challenge. Illumina innovative sequencing and array technologies are fueling groundbreaking advancements in life science research, translational and consumer genomics, and molecular diagnostics.

For Research Use Only. Not for use in diagnostic procedures (except as specifically noted).


High-Throughput Genomics: NGS & Arrays

Researchers are harnessing the power of high-throughput next-generation sequencing (NGS) and microarray technologies to conduct large-scale, global genetic analyses. This research often focuses on multifactorial genetic discovery of disease risk markers, and may involve looking for changes in genetic variants such as single nucleotide polymorphisms (SNPs), indels, splice variants, structural variants, and methylation markers.

High-throughput genomics studies with tens to hundreds of thousands of samples require fast, cost-effective tools. Illumina offers high-throughput sequencing and array technologies with unprecedented sample-to-analysis solutions and unparalleled collaborative expertise to meet these needs.

Illumina High-Throughput Sequencing Technology

Illumina sequencing by synthesis (SBS) is a massively parallel sequencing technology that has revolutionized sequencing capabilities and launched the next generation in genomic sciences. The latest Illumina sequencers unite high-performance imaging with state-of-the-art flow cells to deliver massive increases in throughput.

Ultra-High-Throughput Sequencer

With unmatched scalable throughput, tremendous flexibility for a broad range of applications, and streamlined operation, the NovaSeq 6000 System is the most powerful high-throughput Illumina sequencer to date, perfectly positioned to help scientists perform large-scale genomics studies. The system offers output of up to 6 Tb and 20 billion reads in < 2 days.

Scientists Discuss High-Throughput Sequencing Projects

High-Throughput NGS to Identify Breast Cancer Targets

The Breast Cancer Atlas Project involves sequencing more than a million individual breast cancer cells to help researchers identify potential therapeutic targets.

High-Throughput Sequencing Supports Australia Genome Center Growth

Implementing high-capacity NGS allowed the Deakin Genomics Centre to expand projects for species ancient and new, large and small.

Value of a Panomics-Based Drug Discovery Approach

A large clinical study that integrates sequencing with imaging, multiomic technologies, and big data uncovers novel therapeutic targets for chronic diseases.

High-Throughput Genomics Approaches for Prioritizing Functional Genetic Variants

Recent advances in sequencing technologies have allowed for the development of genomics-based strategies to assay GWAS SNPs for potential functional relevance. Powerful combinations of high-throughput experimental assays, single-cell approaches, and computational analyses are accelerating the ability to link variants to function, and, by extension, link genotype to phenotype.

Featured High-Throughput Sequencing Solutions

High-Throughput Library Prep Automation

For labs preparing large quantities of NGS libraries, liquid-handling robots and other automation solutions provide a good option.

Multiplex Sequencing

Sample multiplexing allows large numbers of NGS libraries to be pooled and sequenced simultaneously during a single run.

Large-Scale Sequencing Products

Single-lot shipments and other features empower clinical labs to reduce the frequency and cost of revalidating reagents and protocols.

Bioinformatics Pipeline Setup

Find information and resources to help simplify the process of setting up an informatics infrastructure and data analysis pipeline.

LIMS for NGS

Learn how you can benefit from a laboratory information management system (LIMS) optimized for NGS, and find out what to look for.

Hear From High-Throughput Genomics Labs

Establishing and Scaling an Efficient Genotyping Facility

Prenetics created a high-throughput genotyping laboratory to serve its growing customer base in Southeast Asia.

Polygenic Risk Scores Could Become Useful Tools in the Physician's Toolbox

Researchers discuss large GWAS studies to identify disease-associated DNA risk loci and develop PRSs for clinical validation.

Scaling Up to Genotype Thousands of Samples

Resource planning and automated genotyping workflows allowed GPBio to achieve immediate efficiency and throughput gains.

Featured High-Throughput Microarray Products

Infinium Global Screening Array

A next-generation genotyping array for population-scale genetics, variant screening, pharmacogenomics studies, and precision medicine research.

Infinium XT

A comprehensive microarray solution for production-scale genotyping of up to 50,000 single or multi-species custom variants.

Illumina Array LIMS

This state-of-the-art LIMS facilitates high-throughput microarray processing and sample tracking, using advanced automation and precise robotic control.

Related Solutions

Population Genomics

National population genomics programs seek to integrate large, diverse data sets, combining clinical information with genomic data at scale in a learning health system.

High-Throughput Genotyping

Large-scale genotyping with arrays can identify variants associated with disease risk in large cohorts or populations.

SBS Technology

Illumina sequencing technology uses fluorescently labeled reversible terminators to detect bases as they’re incorporated into growing DNA strands.

Interested in receiving newsletters, case studies, and information on complex disease genomics? Please enter your email address.

Additional Resources

Driving IBD Discovery with Integrative Genomics

Dr Carl Anderson discusses integrated genomic research approaches in inflammatory bowel disease (IBD) research.

Driving IBD Discovery with Integrative Genomics

The Functional Effects of Genetic Variants

Tuuli Lappalainen, PhD is working toward identifying how genetic differences may affect an individual's risk for certain diseases.

The Functional Effects of Genetic Variants

Shared Vision for the Power of Human WGS

Genomics leaders share their perspective on the impact of high-throughput and population sequencing in clinical research.

Chan Zuckerberg Biohub and the NovaSeq System

The Chan Zuckerberg Biohub uses the NovaSeq System to conduct innovative experiments in genomics.

Innovative technologies

At Illumina, our goal is to apply innovative technologies to the analysis of genetic variation and function, making studies possible that were not even imaginable just a few years ago. It is mission critical for us to deliver innovative, flexible, and scalable solutions to meet the needs of our customers. As a global company that places high value on collaborative interactions, rapid delivery of solutions, and providing the highest level of quality, we strive to meet this challenge. Illumina innovative sequencing and array technologies are fueling groundbreaking advancements in life science research, translational and consumer genomics, and molecular diagnostics.

For Research Use Only. Not for use in diagnostic procedures (except as specifically noted).


Next-Generation Sequencing (NGS)

Next-generation sequencing (NGS) is a massively parallel sequencing technology that offers ultra-high throughput, scalability, and speed. The technology is used to determine the order of nucleotides in entire genomes or targeted regions of DNA or RNA. NGS has revolutionized the biological sciences, allowing labs to perform a wide variety of applications and study biological systems at a level never before possible.

Today's complex genomics questions demand a depth of information beyond the capacity of traditional DNA sequencing technologies. NGS has filled that gap and become an everyday tool to address these questions.

Next-Generation Sequencing for Beginners

We'll guide you through the basics of NGS, with tutorials and tips for planning your first experiment.

See What NGS Can Do For You

NGS technology has fundamentally changed the kinds of questions scientists can ask and answer. Innovative sample preparation and data analysis options enable a broad range of applications. For example, NGS allows labs to:

  • Rapidly sequence whole genomes
  • Deeply sequence target regions
  • Utilize RNA sequencing (RNA-Seq) to discover novel RNA variants and splice sites, or quantify mRNAs for gene expression analysis
  • Analyze epigenetic factors such as genome-wide DNA methylation and DNA-protein interactions
  • Sequence cancer samples to study rare somatic variants, tumor subclones, and more
  • Study the human microbiome
  • Identify novel pathogens

Accessible Whole-Genome Sequencing

Using capillary electrophoresis-based Sanger sequencing, the Human Genome Project took over 10 years and cost nearly $3 billion.

Next-generation sequencing, in contrast, makes large-scale whole-genome sequencing (WGS) accessible and practical for the average researcher. It enables scientists to analyze the entire human genome in a single sequencing experiment, or sequence thousands to tens of thousands of genomes in one year.

NGS Data Analysis Tools

Explore user-friendly tools designed to make data analysis accessible to any scientist, regardless of bioinformatics experience.

Broad Dynamic Range for Expression Profiling

NGS-based RNA-Seq is a powerful method that enables researchers to break through the inefficiency and expense of legacy technologies such as microarrays. Microarray gene expression measurement is limited by noise at the low end and signal saturation at the high end.

In contrast, next-gen sequencing quantifies discrete, digital sequencing read counts, offering a broader dynamic range. 1,2,3

Tunable Resolution for Targeted NGS

Targeted sequencing allows you to sequence a subset of genes or specific genomic regions of interest, efficiently and cost-effectively focusing the power of NGS. NGS is highly scalable, allowing you to tune the level of resolution to meet experimental needs. Choose whether to do a shallow scan across multiple samples, or sequence at greater depth with fewer samples to find rare variants in a given region.

NGS for COVID-19

Next-generation sequencing is uniquely positioned in an infectious disease surveillance and outbreak model. Learn which NGS methods are recommended for detecting and characterizing SARS-CoV-2 and other respiratory pathogens, tracking transmission, studying co-infection, and investigating viral evolution.

How Does Illumina NGS Work?

Illumina sequencing utilizes a fundamentally different approach from the classic Sanger chain-termination method. It leverages sequencing by synthesis (SBS) technology – tracking the addition of labeled nucleotides as the DNA chain is copied – in a massively parallel fashion.

Next-generation sequencing generates masses of DNA sequencing data, and is both less expensive and less time-consuming than traditional Sanger sequencing. 2 Illumina sequencing systems can deliver data output ranging from 300 kilobases up to multiple terabases in a single run, depending on instrument type and configuration.

Sequencing Technology Video

In-Depth NGS Introduction

This detailed overview of Illumina sequencing describes the evolution of genomic science, major advances in sequencing technology, key methods, the basics of Illumina sequencing chemistry, and more.

What Can You Do with Next-Generation Sequencing?

See how scientists utilize NGS to make breakthrough discoveries.
Genetics of COVID-19 Susceptibility

This UK-wide study uses NGS to compare the genomes of severely and mildly ill COVID-19 patients, to help uncover genetic factors associated with susceptibility.

Exploring the Tumor Microenvironment

Researchers use single-cell techniques to study cancer microenvironments, to elucidate gene expression patterns and gain insights into drug resistance and metastasis.

Using NGS to Study Rare Diseases

Whole-exome and transcriptome sequencing prove beneficial in uncovering mutations and pathways associated with rare genetic diseases.

Evolution of Illumina NGS

Recent Illumina next-generation sequencing technology breakthroughs include:

    : The iSeq 100 System combines a complementary metal-oxide semiconductor (CMOS) chip with one-channel SBS to deliver high-accuracy data in a compact system. : This technology enables faster sequencing than the original 4-channel version of SBS technology, with the same high data accuracy. : This option offers an exceptional level of throughput for diverse sequencing applications. : Learn how the NovaSeq 6000 System offers tunable output of up to 6 Tb in

History of Illumina Sequencing

Find out how Illumina SBS technology originated and evolved over time.

Bring NGS to Your Lab

The resources below offer valuable guidance to scientists who are considering purchasing a next-generation sequencing system.

Download Buyer’s Guide

NGS Experimental Considerations

Learn about read length, coverage, quality scores, and other experimental considerations to help you plan your sequencing run.

Use our interactive tools to help you create a custom NGS protocol or select the right products and methods for your project.

Key Terms in NGS

Use our next-generation sequencing glossary to clarify key terms and important concepts as you plan your sequencing project.

Methods Guide

Access the information you need—from BeadChips to library preparation for genome, transcriptome, or epigenome studies to sequencer selection, analysis, and support—all in one place. Select the best tools for your lab with our comprehensive guide designed specifically for research applications.

Genomics News

Illumina and Next Generation Genomic Launch Expanded NIPT in Thailand

The collaboration will introduce VeriSeq™ NIPT Solution v2 in Southeast Asia

Improved Illumina Library prep kit is the ideal solution for the Australian Genome Research Facility

Illumina Announces the Thirteenth Agricultural Greater Good Grant Winner

Dr. Bertram Brenig uses genomics grant to help save the bees

Interested in receiving newsletters, case studies, and information from Illumina based on your area of interest? Sign up now.

Related Solutions

NGS Library Preparation

Fast, simple NGS library prep and enrichment workflows from Illumina to prepare your samples for sequencing.

Sequencing Services

Access fast, reliable next-generation sequencing services that provide high-quality data and offer extensive scientific expertise.

Illumina NGS & Microarray Training

Work with expert Illumina instructors and get hands-on training. We also offer online courses, webinars, videos, and podcasts.

References
  1. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 200910:57–63.
  2. Wilhelm BT, Landry JR. RNA-Seq—quantitative measurement of expression through massively parallel RNA sequencing. Methods. 200948:249–57.
  3. Zhao S, Fung-Leung WP, Bittner A, and Ngo K, Liu X. Comparison of RNA-Seq and microarray in transcriptome profiling of activated T cells. PLoS One. 2014169(1):e78644.

Innovative technologies

At Illumina, our goal is to apply innovative technologies to the analysis of genetic variation and function, making studies possible that were not even imaginable just a few years ago. It is mission critical for us to deliver innovative, flexible, and scalable solutions to meet the needs of our customers. As a global company that places high value on collaborative interactions, rapid delivery of solutions, and providing the highest level of quality, we strive to meet this challenge. Illumina innovative sequencing and array technologies are fueling groundbreaking advancements in life science research, translational and consumer genomics, and molecular diagnostics.

For Research Use Only. Not for use in diagnostic procedures (except as specifically noted).


Next-Generation Sequencing for Beginners

These resources cover key topics in next-generation sequencing (NGS) designed for beginners. We'll guide you through the workflow, tutorials, and planning your first experiment.

The Worldwide Impact of NGS

Next-generation sequencing is revolutionizing research, enabling experiments that weren’t possible before.

The Worldwide Impact of NGS

Benefits of Next-Generation Sequencing

Compare NGS to other technologies and see if it’s right for you and your research goals.

NGS vs. Sanger Sequencing

Learn the key differences between the technologies and see when NGS can be a more effective option.

NGS vs. qPCR

Discover how NGS offers higher discovery power compared to qPCR, making it a useful method for quantifying variation.

NGS vs. Microarrays

Find out why RNA sequencing with NGS offers wide dynamic range and high sensitivity for detecting novel transcripts.

How NGS Works

The basic next-generation sequencing process involves fragmenting DNA/RNA into multiple pieces, adding adapters, sequencing the libraries, and reassembling them to form a genomic sequence. In principle, the concept is similar to capillary electrophoresis. The critical difference is that NGS sequences millions of fragments in a massively parallel fashion, improving speed and accuracy while reducing the cost of sequencing.

How NGS Works

Your NGS Workflow

Prepare
Sequence
Analyze

Next-generation sequencing involves three basic steps: library preparation, sequencing, and data analysis. Find resources to help you prepare for each step and see an example workflow for microbial whole-genome sequencing, a common NGS application.

NGS Tutorials for Beginners

Getting started with NGS can be easier than you expect. View our free tutorials for each of the major steps in the workflow. Want personalized training for your lab delivered face-to-face or virtually? We offer that too.

Planning an NGS Budget

The cost of NGS has declined dramatically in recent years, enabling labs of all sizes to introduce sequencing into their studies. There are a few factors to consider when planning your budget, such as lab equipment and sample volume.

Get Started with NGS Basics

Let's start with a detailed overview of the main steps in the next-generation sequencing workflow.

The Illumina Community

Join other Illumina customers in the Illumina Online Community. Collaborate with Illumina moderators, customers, and developers. Discuss best practices, troubleshoot, and learn about how others are using Illumina sequencers, library preparation kits, and automated data analysis to fuel their research.

Additional Resources

Choosing an NGS Company

Seek out a best-in-class next-generation sequencing company with user-friendly bioinformatics tools and industry-leading support and service.

Next-Generation Sequencing Glossary

Find definitions for common terms and illustrations of important concepts in NGS.

NGS Workflow Consulting

Get started faster with our experimental design experts.* We’ll help you design an NGS workflow that’s right for you.

Join the Illumina Community

In our open forum, researchers can come together to support one another, ask questions, and collaborate on great science.

Contact Us

*Not available in Asia or South Pacific countries.

Innovative technologies

At Illumina, our goal is to apply innovative technologies to the analysis of genetic variation and function, making studies possible that were not even imaginable just a few years ago. It is mission critical for us to deliver innovative, flexible, and scalable solutions to meet the needs of our customers. As a global company that places high value on collaborative interactions, rapid delivery of solutions, and providing the highest level of quality, we strive to meet this challenge. Illumina innovative sequencing and array technologies are fueling groundbreaking advancements in life science research, translational and consumer genomics, and molecular diagnostics.

For Research Use Only. Not for use in diagnostic procedures (except as specifically noted).


How do high-throughput/NGS sequencers calculate quality scores? - Biology

Overall sequencing run performance is evaluated by determining whether the sequencing run meets the Illumina specifications for quality scores and data output. Actual run performance will vary based on sample type, quality, and clusters passing filter. Specifications are based on the Illumina PhiX control library at supported cluster densities.

Where can I find instrument specifications?

Follow the links below to the instrument specification pages:

The Sequencing Analysis Viewer (SAV) is a free software used to assess the performance of sequencing runs, and can be downloaded from the Illumina website:

  • SAV v2.4.7 of SAV on all instruments except on-instrument MiSeq and NextSeq1000/2000
    SAV v2.4.7 is compatible for all off-instrument (remote) use running Windows 7 or later
  • SAV v1.8.37 of SAV for on-instrument MiSeq viewing

Once SAV is installed, open it and select the tab containing the desired query information.

How do I determine if my run meets spec?

Below is an example of a PhiX validation run (2 x 151 bp) on the MiSeq, using v2 reagents. The specifications for this run are as follows:

  • Total data output of 4.5–5.1 gigabases (Gb)
  • At least 80% of bases called with a quality score of 30 or higher (at least 80% ≥ Q30)

    Does the overall quality (Q30) score meet specification?

To determine the quality score, review the Analysis tab Q score Distribution chart and the Summary tab as shown below.

The quality specification for a MiSeq paired-end 151-cycle run is Q30 ≥ 80%. The run meets this specification, as the percent ≥ Q30 is >94%.

To determine the yield of the run, review the information in the Summary tab as shown below.

The yield specification for a paired-end 151-cycle run is >4.4 Gb. The run meets this specification, as the total yield is 6.10 Gb.

What additional information can I obtain from SAV?

The following images are from BaseSpace public data set: “MiSeq: Nextera DNA Flex (replicates of E. coli, B. cereus, and R. sphaeroides)”. Note: Nextera DNA Flex has been renamed to Illumina DNA Prep.

Analysis Tab: Overview of the run metrics.

  1. Flow Cell Chart shows color-coded metrics per tile for the entire flow cell.
  2. Data by Cycle displays various metrics for each cycle of the run. Select the displayed metric, lane, surface, and channel using the drop-down lists.
  3. Q Score Distribution shows a quick overview of the quality of the run. The Q30 for the whole run is found in the upper right of this box.
  4. Data by Lane shows plots of metrics per lane.
  5. Q score Heatmap displays a heat map for Q score by cycle.

Imaging Tab: Displays thumbnails from the run if available.

  1. Toggle which base or color channel image to view here.
  2. If thumbnails are saved for the run, they are displayed here.

Summary Tab: Provides basic data quality metrics summarized per lane and per read.

  1. Run summary per read, including quality, is reported here.
  2. More details per read including the exact density, clusters Passing Filter (PF), and % aligned.

Indexing Tab: Total and Per Sample % Reads Identified if a sample sheet was used and demultiplexing was performed.


Performance comparison of benchtop high-throughput sequencing platforms

Three benchtop high-throughput sequencing instruments are now available. The 454 GS Junior (Roche), MiSeq (Illumina) and Ion Torrent PGM (Life Technologies) are laser-printer sized and offer modest set-up and running costs. Each instrument can generate data required for a draft bacterial genome sequence in days, making them attractive for identifying and characterizing pathogens in the clinical setting. We compared the performance of these instruments by sequencing an isolate of Escherichia coli O104:H4, which caused an outbreak of food poisoning in Germany in 2011. The MiSeq had the highest throughput per run (1.6 Gb/run, 60 Mb/h) and lowest error rates. The 454 GS Junior generated the longest reads (up to 600 bases) and most contiguous assemblies but had the lowest throughput (70 Mb/run, 9 Mb/h). Run in 100-bp mode, the Ion Torrent PGM had the highest throughput (80–100 Mb/h). Unlike the MiSeq, the Ion Torrent PGM and 454 GS Junior both produced homopolymer-associated indel errors (1.5 and 0.38 errors per 100 bases, respectively).


Why Do I Need to Quantitate My Library?


There are two primary reasons that libraries must be quantitated.

  1. The chemistries that underlie Illumina sequencing require an optimal amount of adaptor-ligated DNA fragments to be loaded into the cluster generation step, for example 6-10 pM for the MiSeq® instrument (v3 chemistry).
  2. If multiple libraries are sequenced in one run, it is desirable for the sequence coverage to be equal for each library, and therefore an equal amount of each library should be moved into the cluster generation step.

What happens to your library during sequencing?

To fully understand the importance of accurate library quantitation before sequencing, it is first necessary to understand sequencing chemistries and their interactions with the samples you&rsquoll be sequencing.

For the purposes of this article, we&rsquoll focus on the chemistries that underlie the popular (and market leading) Illumina sequencers, although library quantitation is an important step for sequencing on any platform.

Building bridges & counting clusters

Core components of Illumina&rsquos sequencing tech- nology are its flow cells and their cluster-generating capabilities. Illumina&rsquos sequencers are based on optical detection of DNA clusters that form on the glass flow cell, a phenomenon enabled by a dense lawn of primers pre-immobilized to the flow cell channel. As you add your library to the flow cell, the single-stranded, adaptor-ligated fragments hybridize to the immobilized primers studded across the flow cell. This step is where the accuracy of your library quantitation is put to the test.

Cluster generation then occurs: each hybridized molecule undergoes multiple rounds of amplification to produce up to 1,000 copies of the same molecule in the same location on the flow cell: a &ldquocluster&rdquo, whose diameter is 1 micron or less. For more details on cluster generation, visit Illumina.com.

The amount of DNA initially loaded onto the flow cell directly influences the density of the clusters that form. Too little DNA and the clusters are likely to sparsely populate the flow cell. Too much DNA and the clusters will be too close together, making it difficult to interpret the sequencing data due to poor resolution, and resequencing of libraries will be required (Figure 1). Illumina&rsquos recommended input ranges, which differ depending on the specific Illumina instrument, help to ensure that the clusters forming on the flow cell have sufficient resolution, without wasting valuable flow cell space.

Figure 1: Optimal cluster density enables efficient & accurate quantitation
The density of library clusters as they form on the flow cell prior to sequencing is a key factor in the success of a sequencing run. Low concen- tration libraries (Left) fail to make optimal use of the space, while high concentration libraries (Right) lead to densely packed clusters that are difficult to call. Optimal cluster density (Center) makes the best use of flow cell real estate, without over crowding. Representative optical data generated during sequencing depicts variation in cluster densities as shown in the insets.

A deeper dive into equivalent representation

When you pool libraries, you increase the value of each sequencing run by increasing the number of samples that can be sequenced in a single run. However, if libraries are combined in unequal concentrations, this leads to biased representation of certain libraries over oth- ers. In cases where libraries are significantly under-represented, these libraries will need to be resequenced, costing time and money. Over-rep- resentation of libraries can result in generation of more sequence data than required, and the subsequent discarding of sequence reads, wasting sequence capacity.

Figure 2 is an example of uneven library pooling resulting in uneven sequence coverage and the need to resequence. With 16 libraries in this pool, each library should theoretically have 6.25% of the sequence reads. However, this is not the case, and some of the libraries, such as libraries 5 and 15, would need to be resequenced.

FIGURE 2: Uneven pooling of libraries yields uneven sequence coverage
Inadequate or uneven pooling of libraries can result in suboptimal data, and even lead to the need for library resequencing, as seen with library #5.

Why do my library fragments need to be adaptor-ligated?

Optimal cluster density enables efficient & accurate quantitation The density of library clusters as they form on the flow cell prior to sequencing is a key factor in the success of a sequencing run. Low concen- tration libraries (Left) fail to make optimal use of the space, while high concentration libraries (Right) lead to densely packed clusters that are difficult to call. Optimal cluster density (Center) makes the best use of flow cell real estate, without over crowding. Representative optical data generated during sequencing depicts variation in cluster densities as shown in the insets. Sequences required downstream of library preparation, such as those for cluster generation and sequencing, must be added to the DNA fragments to be sequenced, and this is the primary goal of library preparation. In PCR-free library preparation workflows, all of the required sequences must be included in the adaptor sequence. In workflows including amplification, some of the sequences, including the sequences required for cluster generation (indicated by P5 and P7 in Figure 3,), can be added during PCR instead.

FIGURE 3: Adaptor ligation workflow
The stepwise addition of the sequences P5 and P7 and the barcode (BC) can be achieved during PCR amplification of the library.

Only fragments that have a P5 sequence at one end and a P7 sequence at the other are capable of participating successfully in cluster generation. Therefore, ideally, only fragments to which both of these sequences have been attached should be counted during a library quantitation step.

However, in addition to the desired fragments with an adaptor at both ends, libraries may also contain fragments that have no adaptors, one adaptor or adaptor-dimers. Fragments with no adaptors or one adaptor ligated will not form clusters. Adaptor-di- mers will efficiently cluster, but contain no DNA of interest (Figure 4).

FIGURE 4: Adaptors are the hallmark of productive molecules
Only library fragments containing both a P5 and a P7 adaptor will result in a flow-cell cluster. Other molecules are insufficient for cluster formation or contain no DNA of interest, so efforts should be made to exclude them from quantitation.


RESULTS

Our results show the effectiveness of combining quality scores with sequence alignment by applying LAST within two experiment settings: the first with synthetic data and the second with real data based on cross-species mapping.

Test with simulated DNA reads

In our first experiment, we employ simulated reads since we are able to know exactly where they should map to. We began by sampling 100 000 random 36-bp fragments from human chromosome 1 (hg19, both strands). To simulate real sequence differences, we made random substitutions at a low level (0.2, 0.5, 2 or 5%). These substitutions consisted of 60% transitions and 40% transversions: a realistic proportion (6). To keep this initial test simple, we did not introduce any insertions or deletions. Finally, we assigned 100 000 real quality score strings (those summarized in Figure 1 A) to the simulated reads, and randomly mutated each base according to the corresponding error probability.

We then aligned the reads to chromosome 1, and checked whether or not they mapped back to their original locations. The ‘real’ sequence differences were modeled by using suitable alignment score parameters for each level of divergence ( Table 1 ). We obtained alignments with score 120 (equivalent to 20 error-free matching bases), then calculated mapping probabilities, and kept alignments with mapping probability 0.99. Figure 2 shows the relationship between the number of correctly and incorrectly mapped reads, as the score threshold is varied between 216 (the maximum possible) and 120. As the score threshold approaches 120, falsely mapped reads increase dramatically: this is because the mapping probabilities become less reliable since they fail to account for alignments with scores 119. In all cases, however, mapping accuracy improves (i.e. we obtain more correctly mapped reads for a given number of incorrectly mapped ones) when we model both sequencer errors and ‘real’ substitutions. If we model only sequencer errors, there is the potential to do worse than traditional alignment, where only substitutions are modeled.

Table 1.

Alignment score parameters for DNA with various substitution rates

Substitution Rate (%)Match scoreMismatch cost a T a Transition cost b TransversionT b
cost b
064.328094.32809
0.26264.3339123284.33441
0.56224.3429519244.34425
16194.3583816214.36106
26164.3908213184.39646
56124.5021210144.49125
10694.733877124.65864
15684.882816 c 9 c 4.92305 c

a Applies when there is no transition/transversion bias (i.e. one in three substitutions are transitions).

b For the case where 60% of substitutions are transitions.

c For the case where 45% of substitutions are transitions.

Mapping accuracy for 100 000 simulated 36-bp reads. The reads differ from the genome by a certain rate of ‘real’ substitutions (0.2, 0.5, 1 or 2%) plus sequencer errors. Each line shows the relationship between the number of correctly and incorrectly mapped reads as the alignment score cutoff is varied. Circles indicate a score cutoff of 150. Dotted lines show the accuracy when we model the substitutions but not the sequencer errors. Dashed lines show the accuracy when we model the sequencer errors but not the substitutions. Solid lines show the accuracy when we model both.

To check whether these conclusions hold for a different read length and quality score distribution, we repeated the test using simulated reads of size 51 ( Figure 3 ). The main conclusion still applies: mapping accuracy is improved by modeling both sequencer errors and substitutions. This time, however, traditional alignment performs worse relative to modeling sequencer errors only. The reason, presumably, is that the error probabilities used here are higher on average than those used for the 36-bp reads ( Figure 1 ): so it becomes more important to model sequencing errors.

Mapping accuracy for 100 000 simulated 51-bp reads. See legend of Figure 2 . Circles indicate a score cutoff of 180.

It might be argued that, since we used a particular mapping algorithm (with adaptive seeds), the conclusions may not apply to other mapping techniques. To address this concern, we repeated the experiment using LAST in a different mode, where it guarantees to find all alignments with up to two mismatches (and score 120). (Many alignments with more than two mismatches are also returned in this mode.) This resembles several popular mapping methods. The main conclusions are unchanged: mapping accuracy is improved by modeling both sequencer errors and substitutions, and in some cases modeling only sequencer errors is less accurate than traditional alignment ( Figure 4 ).

Mapping accuracy for 100 000 simulated 36-bp reads using a mapping procedure that guarantees to find all matches with up to two substitutions. This is identical to Figure 2 , except that a different mapping algorithm was used here.

The mapping algorithm does make a difference, though ( Figures 2 and ​ and4). 4 ). The default adaptive seed method gives only a few hundred false mappings for 60 000� 000 correctly mapped reads, but the two-mismatch guarantee method gives yet fewer false mappings – only a handful – for 50 000� 000 correctly mapped reads. On the other hand, it ultimately gets fewer correctly mapped reads. In our simulation, all of the reads actually come from the reference sequence, and the two-mismatch guarantee method will never miss the correct alignment if the read has at most two differences: this is why there are so few false mappings. Real data is less clean than this, and we would expect more false mappings (see below).

In a further test, we mapped the simulated reads using a simple match/mismatch scoring matrix while also modeling the sequencer errors. This means that we accurately modeled the level of divergence, but ignored the difference between transitions and transversions. This approach works almost, but not quite as well as when we model transitions and transversions (Supplementary Figure S2). This is worth knowing, because match/mismatch scoring schemes are simpler to implement and slightly faster than general score matrices (Supplementary Data).

Test by xeno-mapping real DNA reads

We wished to test our approach with real (not simulated) reads, but we need a case where we can at least estimate whether the mappings are correct. To accomplish this, we mapped reads of D. melanogaster DNA (those in Figure 1 A) to the genome of D. simulans, a closely related organism. This cross-species mapping exemplifies xeno-mapping and mapping to highly polymorphic genomes.

To estimate correctness, we first mapped the reads to the D. melanogaster genome, which can presumably be done much more accurately, and then used the D. melanogaster / D. simulans genome alignment from the UCSC database to cross-reference the mappings. The genome alignment no doubt has errors, but it should be much more accurate than short-read mapping because it can leverage the context provided by long sequences.

In order to construct a suitable alignment scoring scheme, we examined the divergence between D. melanogaster and D. simulans. In the UCSC ‘net’ alignments, 15% of aligned bases are mismatches, and 45% of these are transitions. There is about one gap per 101 aligned bases, and the average gap size is 6.67. These statistics suffice to construct a scoring scheme ( Table 1 , Supplementary Data).

In this test, mapping accuracy was greatly improved by modeling real sequence differences in addition to sequencing errors ( Figure 4 ). At a score cutoff of 150, we get 35 667 correctly mapped reads (66% of the 53 748 that could be mapped confidently to D. melanogaster) and 197 falsely mapped reads. If we model sequencing errors only, we get 26 569 correctly mapped reads (49%) and 194 falsely mapped reads.

If we model real sequence differences without gaps, the accuracy is only slightly lower than when we do allow gaps ( Figure 5 ). So it is not important to model gaps for this data set. Gaps are likely to be more important for longer reads, since a longer read is more likely to cross a gap, and it is also more likely that the alignment can be extended across the gap.

Estimated mapping accuracy for 100 000 real 36-bp reads from D. melanogaster, mapped to the D. simulans genome. Circles indicate a score cutoff of 150. The dotted line shows the mapping accuracy when we model the sequencer errors but not the real differences. The solid line shows the accuracy when we model both. The dashed red line shows the accuracy when we model both but forbid insertions and deletions. Correctness was estimated by mapping the reads to the D. melanogaster genome (modeling sequencer errors only), and using the UCSC D. melanogaster / D. simulans pairwise genome alignment to cross-reference the mappings.

For completeness, we also tried mapping the reads to either or both Drosophila genomes in two-mismatch guarantee mode (Supplementary Figure S3). All combinations support the main conclusion that mapping accuracy increases significantly when we model real sequence differences in addition to sequencing errors. When we map to D. simulans in two-mismatch guarantee mode, the slight benefit of modeling gaps disappears, perhaps because this mode requires finding large (26 bp) gapless matches (see ‘Materials and Methods’ section). As expected, two-mismatch guarantee mode does not reduce false mappings as dramatically as it did for simulated data. Finally, two-mismatch guarantee mode gives fewer correctly mapped reads (as it did for simulated reads), perhaps because it requires seeds with 18 matches (see ‘Materials and Methods’ section), making it less sensitive in general than adaptive seeds.


Mapping qualities

Current high throughput sequencers produces reads that are short for example the HiSeq2000 produces millions of reads that are 50 and 100 bp long. To align such short reads with high speed and accuracy, many short read alignment programs have been developed, such as BWA. The major limitation is the length of the sequenced reads because many eukaryotic genomes are repetitive and therefore it is difficult to accurately map these reads. Because of this, alignment programs have mapping qualities for each read that is mapped to the reference genome. A mapping quality is basically the probability that a read is aligned in the wrong place (i.e. phred-scaled posterior probability that the mapping position of this read is incorrect). The probability is calculated as:

where q is the quality. For example a mapping quality of 40 = 10 to the power of -4, which is 0.0001, which means there is a 0.01 percent chance that the read is aligned incorrectly.

Base calling errors with respect to mapping qualities

Sequencers make base calling mistakes and this complicates matters. To illustrate how this affects the mapping qualities using BWA, I will use an example I came across in SEQanswers. First let’s examine mapping qualities when a read maps to a specific region without suboptimal hits:

Mapping the read to our reference, BWA returns a mapping quality of 37 (which is actually the highest mapping quality BWA returns).

Next let’s create an example with suboptimal hits. Below is a reference that contains five identical stretches of 28 mers and one 28 mer with a single mismatch (in red) compared to the other five:

>ref2
ACGTACGTACGTACGTA C GTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG

Let’s map a read from the single mismatch stretch to this reference:

The mapping quality of the read in the second example is 16, which has a probability of $ 10^ <-16/10>= 0.025119 $ of mapping to the wrong place. Even though the read maps uniquely in the reference, its mapping quality is 16 and not 37. The BWA specific tags in the SAM file provides some nice additional information:

XT Type: Unique/Repeat/N/Mate-sw
X0 Number of best hits
X1 Number of suboptimal hits found by BWA

From the BWA tag information we can quickly deduce whether a read is aligned uniquely in this case the XT:A:U indicates that it was aligned uniquely. In addition, the X1:i:5 tag indicates that there were 5 suboptimal hits.

Mapping qualities when considering base calling errors

To model base calling errors we can use the Binomial distribution if I expect there to be 1 base calling error in 100 bps, I can calculate the probability of an error for a read of 25 nt as such using R

If we expect 1 base calling error in 100 bps, the probability of making two base calling errors in 25 bps is quite low. Using the formula from the SEQanswers post that calculates the posterior probability that the best alignment is actually correct:

In reality base calling is much more accurate than 1 error in 100 bases, which is a Phred quality score of 20. If we changed the base calling error rate to 1 in 1000 (Phred score of 30):

then the posterior probability that the best alignment is correct improves to 0.88879. Using a base calling error rate of 1 in 10000 (Phred score of 40):

improves the probability to 0.9876531, which is a

0.012 probability that the alignment is incorrect, which is around the same ball park to the BWA mapping quality of 16, which is a 0.025 probability that the alignment is incorrect.

Does BWA make use of base calling qualities?

When I included base calling qualities to the read

I still get the same mapping quality of 16 with BWA, indicating that mapping qualities are not used by BWA:

tag 0 artificial 1 16 25M * 0 0 ACGTACGTACGTACGTACGTACGTA . XT:A:U NM:i:0 X0:i:1 X1:i:5 XM:i:0 XO:i:0 XG:i:0 MD:Z:25

This was confirmed when I examined the BWA manual, which mentioned that “Base quality is NOT considered in evaluating hits.”


/>
This work is licensed under a Creative Commons
Attribution 4.0 International License.


Discussion and Conclusion

Although we only applied our pipeline to RNA-seq short reads in this experiment, it is also applicable to other quantitative high-throughput sequence analysis tasks, such as DNA-seq, Chip-seq, DNase-seq, Bis-seq, etc. For example, studies of allele-specific copy number variations can leverage our pipeline for DNA-seq data. The resulting read-origin annotations can be used to estimate the number of DNA copies in different parental haplotypes in later analysis steps.

Although we chose to use a diallel experiment to evaluate our new pipeline in the ‘methods and result’ sections, it is equally applicable to other multi-parental crosses. For example, our multi-alignment pipeline can be directly applied to recombinant inbred lines (RILs) [22] and backcrosses. For a multi-parental cross with N distinct inbred founders, we would generate N pseudogenomes and perform N separate alignments. These alignments can then be merged using N BAM files. In this scenario, each mapping that is saved to the output will have an N-bit flag set indicating which files the read was found in. This allows for cases where a mapping’s origin is shared/ambiguous between multiple founders. The latest version of Suspenders allows for a variable number of input alignments during the merging process.

Furthermore, we can incorporate additional filters into the pipeline to better determine the origin of mappings. In our experiment, we only used the Unique and Quality filters as informative filters. This resulted in 𢏅% of the mapped reads being handled by the Random filter. Adding an additional filter before the Random filter will help to reduce the amount of random choices made in the final output. One possible filter is a Pileup filter based on choosing among otherwise equal mappings the single mapping that has the most surrounding mappings supporting it. To do this, we first find all mapping sets that can be filtered by the Unique or Quality filters and use their chosen mappings to compute the read coverage at each base in the reference genome. Then, any mapping sets that could not be resolved using Unique or Quality would compare the pileup coverage of each potential mapping in the set and choose the mapping with the highest coverage. This will be particularly useful for reducing the number of reads that map to pseudogenes in RNA-seq. In cases where the pileups are not significantly different, more computation or simply using the Random filter may be necessary. Suspenders currently has a preliminary version of this filter included in the software package.

To summarize, we propose a new multi-alignment pipeline, which is generic enough to handle reads of various types of organisms from different high-throughput sequencing techniques. We demonstrated its effectiveness on RNA-seq data from a diallel cross and compared our pipeline with a single-reference pipeline. It is shown that our pipeline outperforms the traditional single-reference-based alignment approaches: not only are more reads aligned by our pipeline, but a higher percentage of them are assigned a correct origin.

The two key components of our pipeline, Lapels and Suspenders, are Python scripts that can be downloaded at https://code.google.com/p/lapels/ and https://code.google.com/p/suspenders/.


Watch the video: Using FastQC to check the quality of high throughput sequence (June 2022).


Comments:

  1. Juma

    There is something in this. Thank you for your help in this matter, how can I thank you?

  2. Skipper

    Granted, very useful information

  3. Beauvais

    I must tell you you are on the wrong track.

  4. Manuelo

    In my opinion, it is an interesting question, I will take part in discussion.

  5. Lise

    I think you are wrong. We will examine this.

  6. Sakus

    You are not right. I'm sure. We will discuss it. Write in PM.

  7. Andweard

    It abstract people



Write a message