Information

How to check if a fastq file has single or paired end reads

How to check if a fastq file has single or paired end reads



We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I am trying to check if a fastq file has single or paired end reads. How can I achieve this with an error-proof method?

I checked wikipedia and MAQ but I want to know if is there a reliable document that describes all possible variants in sequence ID to check for single/paired end reads.

I am searching also for a library, better in Python, to achieve this.

Thanks


By now I got some interesting answers in this question on Biostars

Basically what I did is the following:

  • First of all, I checked if Sequence Id contains paired end notation. As described in this wikipedia page, for Illumina reads there are two possible notation for single/paired end reads:

    @HWUSI-EAS100R:6:73:941:1973#0/1

    If the last number is/2in some reads then the reads are paired end; otherwise they can be single end.

    The second notation is:

    @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG

    If the first number in the second group is2in some reads then the reads are paired end; otherwise they can be single end;

  • Then I checked for multiple files. If a sample has two fastq files it is likely that the reads are paired end. It is to note however that with a single file it is not possible to exclude that paired end reads can be interleaved in one single file, even if it is not common (in my opinion);

  • The most general method is to cross check each single reads with the whole set of reads. If the first part of the sequence ID (in this case the field starting from@and ending before the#- in the first notation - or the whitespace - in the second notation) is unique among all the reads (for each read) it is likely that the reads are single reads, otherwise - if can be found a duplicate for each reads - the reads are paired end. In this case, on *nix systems it can be achieved with the following command (thanks to the biostars answers):

    grep --no-filename @HWUSI-EAS100R:6:73:941:1973 *.fastq | cut -d"-f1 | sort | uniq -c | sort -rgk 1,1 | head

    If the result shows in the first lines a result like this:

    1 read1_ID

    1 read2_ID

    It is likely it is single end. Otherwise:

    2 read1_ID

    2 read2_ID

    it is paired end.

I skim read BioPython API documentation but I cannot find something useful to do it.

Suggestions and corrections are welcomed.

Thanks


This is complementary answer to what @gc5 provided.

for cases that are using "the second notation" which looks like:

@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG' ^ |________________what we are trying to extract

The following code will go through all the files iteratively and produce one output per file:

grep -P "^@" *.fastq | grep -oP "sd+" | sort | uniq -c

or if you have.fastq.gzfiles:

zgrep -e "^@" *.fastq.gz | grep -oP "sd+" | sort | uniq -c

if you have single-end you will only see ones and if you have paired-end you will see ones and twos. Also as sanity check, you can see how many of each you have:

zgrep --max-count=10000 -e "^@" *.fastq.gz | grep -oP "sd+" | sort | uniq -c
6333652 1 6333652 2

Note that I added--max-count=10000to the last one. This is particularly useful if you have paired-ends in separate files as you will get all ones form one and all twos from the other. This will only go through the first 10'000 lines which makes this one-liner much much faster.


How to check if a fastq file has single or paired end reads - Biology

Illumina sequencing technology uses cluster generation and sequencing by synthesis (SBS) chemistry to sequence millions or billions of clusters on a flow cell, depending on the sequencing platform. During SBS chemistry, for each cluster, base calls are made and stored for every cycle of sequencing by the Real-Time Analysis (RTA) software on the instrument. RTA stores the base call data in the form of individual base call (or BCL) files. When sequencing completes, the base calls in the BCL files must be converted into sequence data. This process is called BCL to FASTQ conversion.

A FASTQ file is a text file that contains the sequence data from the clusters that pass filter on a flow cell (for more information on clusters passing filter, see the “additional information” section of this bulletin). If samples were multiplexed, the first step in FASTQ file generation is demultiplexing. Demultiplexing assigns clusters to a sample, based on the cluster’s index sequence(s). After demultiplexing, the assembled sequences are written to FASTQ files per sample. If samples were not multiplexed, the demultiplexing step does not occur, and, for each flow cell lane, all clusters are assigned to a single sample.

For a single-read run, one Read 1 (R1) FASTQ file is created for each sample per flow cell lane. For a paired-end run, one R1 and one Read 2 (R2) FASTQ file is created for each sample for each lane. FASTQ files are compressed and created with the extension *.fastq.gz.

What does a FASTQ file look like?

For each cluster that passes filter, a single sequence is written to the corresponding sample’s R1 FASTQ file, and, for a paired-end run, a single sequence is also written to the sample’s R2 FASTQ file. Each entry in a FASTQ files consists of 4 lines:

  1. A sequence identifier with information about the sequencing run and the cluster. The exact contents of this line vary by based on the BCL to FASTQ conversion software used.
  2. The sequence (the base calls A, C, T, G and N).
  3. A separator, which is simply a plus (+) sign.
  4. The base call quality scores. These are Phred +33 encoded, using ASCII characters to represent the numerical quality scores.

Here is an example of a single entry in a R1 FASTQ file:

More detailed information on the FASTQ sequence file format can be found here.

How to view a FASTQ file

FASTQ files can contain up to millions of entries and can be several megabytes or gigabytes in size, which often makes them too large to open in a normal text editor. Generally, it is not necessary to view FASTQ files, because they are intermediate output files used as input for tools that perform downstream analysis, such as alignment to a reference or de novo assembly.

If you need to view a FASTQ file for troubleshooting purposes or out of curiosity, you will need either a text editor that can handle very large files, or access to a Unix or Linux system where large files can be viewed via the command line.

How to generate FASTQ files

FASTQ file generation is the first step for all analysis workflows used by MiSeq Reporter on the MiSeq and Local Run Manager on the MiniSeq. When analysis completes, the FASTQ files are located in <run folder>DataIntensitiesBaseCalls on the MiSeq and <output folder>Alignment_#<subfolder>Fastq on the MiniSeq.

For all runs uploaded to BaseSpace Sequence Hub, FASTQ file generation automatically occurs after the run is completely uploaded, and the FASTQ files are used as input for the various analysis apps on BaseSpace Sequence Hub. On BaseSpace Sequence Hub, you can find your FASTQ files in the project(s) associated with your run.

The bcl2fastq conversion software can be used to generate FASTQ files from data generated on all current Illumina sequencing systems.

For information on the different settings that can be applied during FASTQ file generation, see the software user guides below.


Now we get into some actual preprocessing. We will use fastq-mcf to trim adapter from our reads and do some quality filtering. We need to trim adapter, because if a fragment is short enough, we will sequence all the way through the fragment and into the adapter. Obviously the adapter sequence in not found in the genome, and can keep the read from aligning properly. To do the trimming, we need to generate an adapter file.

The first step is to get the adapter sequence. We can get this from the manual, but sequences from a PDF can pick up weird characters, so we are better off getting the adapter sequences from the Primer Sample Sheet.

We can download and display the Sample Sheet using curl :

We want the adapter sequences from the sample sheet:

Now we need to make the adapter file this needs to be in FASTA format.

Browse to scratch/bioinf_intro/myinfo

Click on the jupyter “File” menu, and select “Open”.

When the the new browser window/tab opens, click on the “Files” tab if it is not already active.

Click on the “home” symbol to go to the top level directory, then click on “myinfo”

In the “New” menu select “Text File”.

In this text file, paste the adapter lines from above.

We also want to include the reverse complement of the adapter, in case the adapter contamination as sequenced is the reverse completement of what is given. The easiest way to do that is to use https://www.bioinformatics.org/sms/rev_comp.html to generate the reverse complement, then name it something like “Adapter_RC”

Now clean up by making sure that …

Each sequence is on its own line

Each sequence has a name on the line before it

The sequence name is preceded by a “>”

All commas and spaces need to be removed, and non-sequence characters need to be removed from the sequence lines Now it should look like this:

Click on “untitled.txt” to change the file name to “neb_e7600_adapters.fasta”


Paired-End vs. Single-Read Sequencing

Understand the key differences between these sequencing read types

What is Paired-End Sequencing?

Paired-end sequencing allows users to sequence both ends of a fragment and generate high-quality, alignable sequence data. Paired-end sequencing facilitates detection of genomic rearrangements and repetitive sequence elements, as well as gene fusions and novel transcripts.

In addition to producing twice the number of reads for the same time and effort in library preparation, sequences aligned as read pairs enable more accurate read alignment and the ability to detect insertion-deletion (indel) variants, which is not possible with single-read data. 1 All Illumina next-generation sequencing (NGS) systems are capable of paired-end sequencing.

What is Paired-End Sequencing?

Paired-End Sequencing Highlights

  • Simple Paired-End Libraries: Simple workflow allows generation of unique ranges of insert sizes
  • Efficient Sample Use: Requires the same amount of DNA as single-read genomic DNA or cDNA sequencing
  • Broad Range of Applications: Does not require methylation of DNA or restriction digestion can be used for bisulfite sequencing
  • Simple Data Analysis: Enables high-quality sequence assemblies with short-insert libraries. A simple modification to the standard single-read library preparation process facilitates reading both the forward and reverse template strands of each cluster during one paired-end read. Both reads contain long-range positional information, allowing for highly precise alignment of reads.
Illumina Sequencing Introduction

This overview describes major sequencing technology advances, key methods, the basics of Illumina sequencing chemistry, and more.

Paired-End DNA Sequencing

Paired-end DNA sequencing reads provide high-quality alignment across DNA regions containing repetitive sequences, and produce long contigs for de novo sequencing by filling gaps in the consensus sequence. Paired-end DNA sequencing also detects common DNA rearrangements such as insertions, deletions, and inversions.

Methods for DNA Sequencing

DNA sequencing can be applied to small, targeted regions or the entire genome through a variety of methods.

Sequencing Read Length

Choosing the right sequencing read length depends on your sample type, application, and coverage requirements. Learn how to calculate the right read length for your sequencing run.

Paired-End RNA Sequencing

Paired-end RNA sequencing (RNA-Seq) enables discovery applications such as detecting gene fusions in cancer and characterizing novel splice isoforms. 2

For paired-end RNA-Seq, use the following kits with an alternate fragmentation protocol, followed by standard Illumina paired-end cluster generation and sequencing.

For mRNA-Seq library prep, use:
For stranded total RNA library prep, use:
RNA-Seq Overview

This method offers a high-resolution view of coding and noncoding regions of the transcriptome for a deeper understanding of biology.

NGS is Revealing the Mysterious World of Microbes

Researchers are using 16sRNA to investigate the genomes of microbes and improve our understanding of human health, disease, and microbial evolution.

Single-Read Sequencing

Single-read sequencing involves sequencing DNA from only one end, and is the simplest way to utilize Illumina sequencing. This solution delivers large volumes of high-quality data, rapidly and economically. Single-read sequencing can be a good choice for certain methods such as small RNA-Seq or chromatin immunoprecipitation sequencing (ChIP-Seq).

Library Preparation

Innovative, comprehensive library prep solutions are a key part of the Illumina sequencing workflow.

Interested in receiving newsletters, case studies, and information from Illumina based on your area of interest? Sign up now.

Additional Resources

Sequencing Technology Video

See SBS technology in action.

Sequencing Technology Video

Sequencing Platform Selection Tool

Compare the speed and throughput of Illumina sequencing systems to find the best instrument for your lab.

References
  1. Nakazato T, Ohta T, Bono H. Experimental design-based functional mining and characterization of high-throughput sequencing data in the sequence read archive. PLoS One. 20138(10):e77910.
  2. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 200910:57–63.

Innovative technologies

At Illumina, our goal is to apply innovative technologies to the analysis of genetic variation and function, making studies possible that were not even imaginable just a few years ago. It is mission critical for us to deliver innovative, flexible, and scalable solutions to meet the needs of our customers. As a global company that places high value on collaborative interactions, rapid delivery of solutions, and providing the highest level of quality, we strive to meet this challenge. Illumina innovative sequencing and array technologies are fueling groundbreaking advancements in life science research, translational and consumer genomics, and molecular diagnostics.

For Research Use Only. Not for use in diagnostic procedures (except as specifically noted).


How to check if a fastq file has single or paired end reads - Biology

NGmerge: merging paired-end reads and removing sequencing adapters

Gaspar JM. BMC Bioinformatics. 2018 Dec 2019(1):536. [PubMed] [BMC] [PDF]

NGmerge operates on paired-end high-throughput sequence reads in two distinct modes (Fig. 1).

In the default stitch mode, NGmerge combines paired-end reads that overlap into a single read that spans the full length of the original DNA fragment (Fig. 1A). The ends of the merged read are defined by the 5' ends of the original reads. Reads that fail the stitching process (due to a lack of sufficient overlap, or excessive sequencing errors) are placed into secondary output files, if the user requires them.

The alternative adapter-removal mode returns the original reads as pairs, removing the 3' overhangs of those reads whose valid stitched alignment has this characteristic (Fig. 1B). Reads whose alignments do not have such overhangs (or do not align at all) will also be printed to the output files, unmodified.

Figure 1. Analysis modes of NGmerge. The diagrams show the paired-end reads (R1, R2) derived from sequencing DNA fragments (white boxes) with sequencing adapters (gray boxes) on either end.

  • sample_R1.fastq.gz , sample_R2.fastq.gz (paired-end sequence files for a sample)
  • NGmerge (downloaded and compiled as described below)

To produce stitched reads (Fig. 1A): sample_merged.fastq.gz

To produce reads with adapters removed (Fig. 1B): sample_noadapters_1.fastq.gz and sample_noadapters_2.fastq.gz

The software can be downloaded from GitHub. (and you're already here! congratulations!)

A Makefile is provided for compilation with GCC, and both zlib and OpenMP are also required. The program has been tested after compilation with GCC 6.3.0, zlib 1.2.8, and OpenMP 4.0.

To compile, run make in the folder in which the software was downloaded. The executable NGmerge should be produced.

In either analysis mode (Fig. 1), NGmerge evaluates all possible gapless alignments of a pair of reads in attempting to find an optimal one. The determinations of which alignments are considered, and then which alignment (if any) is both valid and optimal, are made according to several parameters: -m , -p , -d , -e , and -s .

NGmerge begins by aligning a pair of reads (R1, R2) such that the minimum overlap parameter ( -m , default 20bp) is met. It then checks each possible alignment of the reads until they overlap with no 3' overhangs (Fig. 2A). If the -d option is selected (or in adapter-removal mode [ -a , which automatically sets -d ]), NGmerge additionally evaluates dovetailed alignments (with 3' overhangs), down to the minimum length set by the -e parameter (Fig. 2B).

Figure 2. Alignments considered by NGmerge. A: Default alignments range from those with the minimal overlap length (set by -m ), to complete overlaps with no overhangs. B: When the -d option is selected, NGmerge also evaluates dovetailed alignments.

For each alignment, NGmerge computes the fraction mismatch (the number of mismatches between the R1 and R2 reads, divided by the overlap length). Alignments with calculated values no more than the threshold set by the -p parameter (default 0.10) are considered valid. If multiple valid alignments are found, the one with the lowest fraction mismatch is selected as the optimal alignment. In rare cases where multiple alignments have identical fraction mismatches, the longest is preferred by default (unless -s is set). In all of these calculations, ambiguous bases (Ns) are considered neither matches nor mismatches.

Further descriptions of these parameters are provided below.

NGmerge analyzes unaligned paired-end reads in FASTQ format. The input files can be gzip-compressed. Multiple sets of input files can be specified, comma-separated (or space-separated, in quotes).

The input files must list the reads in the same order. The program requires that the paired reads' headers match, at least up to the first space character.

An input file of interleaved reads can be analyzed by not specifying a -2 file. Also, it is possible to read from stdin using - , e.g. -1 - .

Since the merged reads are defined by the 5' ends of the paired reads' alignments (Fig. 1A), one should be wary of quality trimming the reads at those ends. For example, when using a program such as qualTrim, one should specify -3 to ensure that quality trimming occurs only at the 3' ends, prior to using NGmerge.

The primary output file in stitch mode is the file of merged reads, in FASTQ format. It is possible to write to stdout with -o - (see also -y , below).

When specified, all the reads that failed the merging procedure will be written to the output files, as they appeared in the original inputs.

By default, all FASTQ output files will be gzip-compressed if and only if the input files are (with multiple sets of input files, the outputs will be compressed if either of the first set of inputs is). Specifying -z will guarantee that the outputs are gzip-compressed, whereas -y will guarantee that they are not, regardless of the inputs' formats. Note that all gzip-compressed outputs will automatically have '.gz' appended to their filenames, if necessary.

In stitch mode, this applies only to the optional output from -f (above). Instead of two outputs, a single interleaved output will be produced (and no '.fastq' suffix will be appended to the filename).

This log file lists the following for each read pair in the input file(s):

Read read header, not including @
OverlapLen total length of the read overlap, including Ns NA if reads were not merged (and remaining columns are left blank)
StitchedLen total length of the merged read
Mismatch fraction of mismatched bases (count of mismatches divided by overlap length [not including Ns]) must be less than or equal to -p value (see below)

This log file lists the following for each read pair whose optimal valid alignment has 3' overhangs:

Read read header, not including @
Adapter_R1 3' overhang of R1 read - if no overhang
Adapter_R2 3' overhang of R2 read - if no overhang

The columns are labeled 'Adapter' because, if the reads were not trimmed on their 5' ends, these extra sequences should be adapters. If the sequences that appear in the 'Adapter' columns are not consistent, they may be false positives, and one should consider decreasing -p or increasing -e .

For each pair of reads that was successfully merged, this log file lists alignments of the reads' sequences and quality scores, along with the resulting merged sequence and quality scores. For example:

This is the minimum overlap length (in bp) for valid alignments of a pair of reads (see Fig. 2A). Note that ambiguous bases (Ns) do not count toward this minimum length.

This parameter determines how stringent the evaluation of an alignment is. The value must be in the interval [0, 1), with lower values equating to increased stringency. Specifying -p 0 means that only perfect alignments (with no mismatches) are valid the default value of 0.10 means that a valid alignment can have at most 10% mismatches (calculated as the number of mismatches divided by the overlap length [not counting Ns]).

When this option is selected, alignments in which a read's 3' end extends past its pair's 5' end will be evaluated, down to a minimum length (see Fig. 2B). By default, such alignments are not even considered. Since the merged read is defined by the original reads' 5' ends, the 3' overhangs are automatically removed. These overhangs, which are typically adapters, can be printed to a separate log file (see -c , above).

This is the minimum overlap length (in bp) for alignments with 3' overhangs (see Fig. 2B). This value should be set to the length of the absolute shortest DNA fragment that may have been sequenced. Using a value that is too low may result in false positives, especially if the reads contain repetitive sequences.

Given multiple valid alignments with identical fraction mismatch scores, NGmerge will select the longest stitched read by default. With -s , the shortest stitched read will be preferred instead.

Quality score profile options

By default, NGmerge uses hard-coded profiles when determining the quality scores of overlapping bases. There are separate profiles for cases where the R1 base and the R2 base match, and for when they do not match. Those who do not wish to use these profiles have two alternative options:

With this option, NGmerge will use the quality score profiles in the provided file. The file must list two matrices of comma- or tab-separated values that follow header lines #match and #mismatch . One should follow the template of the given qual_profile.txt file, which mimics the hard-coded profiles of NGmerge with the quality score range of [0, 40].

With this option, NGmerge will use a method similar to that of the program fastq-join. In cases where the R1 base and R2 base match, the higher quality score is used for the merged base. When they do not match, the merged base's quality score is calculated as the difference in the two quality scores.

This option must be specified for NGmerge to run in adapter-removal mode. As indicated, it automatically sets the -d option to check for dovetailed alignments.

The formatting of the input files is described above.

In adapter-removal mode, all reads are printed to the output files. The only modifications are the clipping of the 3' overhangs of reads whose alignments have such overhangs.

With this option, instead of two outputs, a single interleaved output will be produced (and no '.fastq' suffix will be appended to the filename).

These options are described above.

This log file is described above.

In adapter-removal mode, the following files cannot be produced:

These parameters are described above.

As noted previously, the -d option is automatically set in adapter-removal mode.

To reduce computational time, one can run NGmerge across multiple cores via this option. Note that gzip compression and decompression is not parallelized, so the computational savings are not linear.

These two parameters set the range of quality scores for the input FASTQ files. The default values match the Sanger format, with quality scores in the range [0, 40] spanning ASCII values [33, 73].

Instead of printing full alignments, the log file specified by -j will list the details of the mismatches: the read header, position, and the base and quality score for both the R1 and R2 reads. This is useful for calculating separate error rates for matches and mismatches.

  • NGmerge cannot gzip-compress multiple output files that are stdout . For example, the following will produce an error:
    • -o - -a without -i
    • -f - without -a and without -i


    How to sort fastq files in order to align paired end reads using BWA.

    I am trying to align paired end reads using BWA but since the fastq files aren't sorted, its complaining that "paired reads have different names". Like this: "M01628:49:000000000-D06TG:1:1102:25364:18377", "M01628:49:000000000-D06TG:1:1101:16377:1698"

    Is there a convenient tool for sorting or do I have to create a script to do this?

    A sample read looks like this:

    Yeah, I already looked at top search results before posting the question. Unfortunately it didn't sort it correctly and that's why I was wondering if there was a tool or alternate method for this.

    Are your reads in interleaved fastq or split fastq files? If the reads are interleaved, there are some answers on Google that will deinterleave your files for you. If the reads are split into a mate1 and mate2 file already then with a little Python or Perl you can easily sort the files correctly with access to enough ram. It is probably not the most efficient method, but if you want something that you can write very quickly and only need to use to get your results, this will do the trick.

    Build a list of reads using BioPython SeqRecords

    Sort the lists by read ids 3a) Iterate through both lists, pulling 1 read from each list. 3b) Compare the read ids (don't forget the /1 and /2 or the unique identifier for the mate1 and mate2 reads) 3c) If a match is found, write the mate1 read to your sorted mate1 file and the mate2 read to your sorted mate2 file. (Better to have mate1_paired and mate2_paired lists that you use as buffers. Then write to your file every time you get 10k or 50k reads in the buffers which you'll empty after writing and start filling again. Go back to step 3a. 3d) If no match is found, take the lower read id and append it to the singleton read id buffer to write to the singleton read file. 3e) Pull a new read from the read list that the singleton read came from and repeat steps 3b-3e until a match is found

    When you're done, you'll have mate1_sorted.fq and mate2_sorted.fq files which should now be perfectly ordered with each other. You'll also have a singleton.fq file which contains reads that lost their pair at some point between sequencing and your mapping step.

    That said, don't use BWA for mapping reads unless you have no other choice. The developer is well known for his work on short read mapping, the program itself never gives you what you really want. Parsing BWA data is one of the most tedious things I've had to do with NGS data. I highly recommend only using it if the tools you are using are already setup to work with the output of BWA directly. Otherwise, look into BBMAP for your short read mapping needs. You will fall in love with the flexibility of input and output options and formats. The stats that you can ask it to generate as part of the run are incredible and it just does what you want.


    Trimmomatic

    Trimmomatic is a popular tool for trimming adapter sequences from Illumina reads. The Trimmomatic manual describes how to install this application, how to run it and it describes all of the required and optional command line parameters. If you decide to use Trimmomatic for trimming adapter sequences from Illumina reads, a minimal command that only performs adapter trimming may look like this:

    • Most sequencing runs use paired-end reads, so we specify “PE” in the command line.
    • To speed up the application, we specify the number of threads to use, up to the maximum number of available processor threads.
    • There are always two FASTQ files in a paired-end run: one file for the forward reads and one file for the reverse reads. We specify both files in the parameter list.
    • For each read file, we specify the name of a paired output file and an unpaired output file.
    • The adapter sequence(s) is/are contained in a FASTA formatted file. The ILLUMINACLIP parameter specifies the name of this file. This parameter also requires three additional fields: seedMismatches, palindromeClipThreshold, simpleClipThreshold. See the manual for more information about how to set these three fields.

    In our example, using the Nextera XT library prep kit, the “adapters.fasta” file would look like this:

    This is a standard FASTA formatted file. The first record contains the right-caret character followed by an arbitrary string. The second record contains the adapter sequence. This file can contain multiple adapter sequences by using a multi-FASTA file format. Trimmomatic output files will show which reads (if any) were trimmed.


    It only takes two minor changes to run fastq-mcf on paired data, we need to tell it to also load the read 2 file, and also what to call the trimmed output from this file.

    1. neb_adapters.fasta
    2. r1.8A_pilot.fq.gz
    3. r2.8A_pilot.fq.gz : NEW for paired-data
    4. -q 20
    5. -x 0.5
    6. -o r1.8A_pilot.trim.fastq.gz
    7. -o r2.8A_pilot.trim.fastq.gz : NEW for paired-data

    Note: Now that, since we are now including the reverse reads, contamination with the Universal adapter is now observed


    Darencard / extract_fastq_bam.md

    Sometimes FASTQ data is aligned to a reference and stored as a BAM file, instead of the normal FASTQ read files. This is okay, because it is possible to recreate raw FASTQ files based on the BAM file. The following outlines this process. The useful software samtools and bedtools are both required.

    From each bam, we need to extract:

    1. reads that mapped properly as pairs
    2. reads that didn’t map properly as pairs (both didn’t map, or one didn’t map)

    For #1, the following command will work. This was taken from this webpage.

    The -f and -F filter using flags in column 2 of the BAM file. These aren't always intuitive, and I won't describe them more here, but you can use this handy tool to better understand. Also note that the -u flag creates uncompressed BAM output rather than default compressed BAM output, so the files will be larger. This helps with quicker reading in later steps, but it isn't necessary to include this if you want to save disk space. samtools is super fast either way.

    Resolving #2 is more complicated, as there are three ways a read might not have mapped as a proper pair. A. The first read mapped but the paired read did not. B. The first read did not map but the paired read did. C. Neither paired read mapped at all. Again, flags will be used to filter the original BAM file. This information was found at this webpage.

    As you might expect, you have to then merge the three files that contain at least one unmapped pair.

    Next, these BAM files must be resorted so that they are ordered by read ID instead of location in the reference.

    At this time, it is a good idea to check that you have the correct number of reads and no redundancy. You can summarize the original BAM file to get an idea of where you started.

    Notice the toal number of input reads that is found on the first line. You want to be sure that the number of unmapped and mapped reads total this number. It is easy to check using the following commands.

    Note that one paired read is counted as two reads here. If you sum these two numbers, they should equal the number you noted above, as they do here.

    If all is good, you can now extract the FASTQ reads into two paired read files, as follows.

    And then it also makes sense to combine both the first and paired reads together from the mapped and unmapped files.

    These two files should now have the same number of reads that are exactly as you would have received them if they had come directly from the sequencer as FASTQ.

    Please also note that all of the commands above can be piped together in bash using | , which will save on disk space and time. So it is best to combine commands where possible.


    NextSeq 500

    The NextSeq 500 is different from the other Illumina sequencers in two important ways that impact the FASTQ files it generates.

    The NextSeq 500 has 4 lanes. Each lane gets the same sample or pool, but they are imaged by different cameras. Therefore, the data is tagged with lane numbers 1 to 4. However, the data in each file is for the same sample and represents distinct set of fragments for the sample. We generally keep these files separate, but not always.

    The NextSeq 500 sequences the second read of a dual-indexed library in the reverse direction from the other sequencers. We reverse complement the second barcode in the file name, but not in the FASTQ deflines.

    So for example, a barcode pair TAAGGCGA and TAGATCGC would be sequenced as TAAGGCGA and GCGATCTA . The defline for a read would contain TAAGGCGA-GCGATCTA but we would rename the FASTQ file to TAAGGCGATAGATCGC .