A tool to get annotation features by position in the reference genome

Previously, when I needed to find a genomic feature (promoter, CDS, repeat etc) that underwent mutation (SNP) I used my own scripts to load an annotation file, parse it, map the SNP's position and output the sequence of the feature. Now that I'm moving from microbes to mammals with huge annotations, introns/exons and huge genomes, my scripts don't work that well and I'm sort of lazy to optimize the code or rewrite it in a faster language, since I'm sure there should be some prominent tools for the task. That's it, could you recommend me some tools that do the following:

  1. Given a reference genome, annotation and a VCF (or similar variant file) match the SNPs to genomic features in the annotation
  2. Output the entire feature that underwent mutation, e.g. if the SNP happened in a protein-coding gene, I need to have the entire protein-coding sequence not just an exon/intron.

Thanks in advance.

Well, after 2 weeks and several thousand lines of code I got a demo-version of my revamped annotator up and running, but then I was told to take a look at SnpEff. The tool makes everything I was to implement in my own one, so if anyone ever wants to annotate SNPs with regard to genomic features they hit and possible effects, give SnpEff a try. It's really fast.

A tool to get annotation features by position in the reference genome - Biology

A fully automated pipeline for high-resolution typing based on wgMLST

Synchronization of wgMLST nomenclature and sub-typing schemes from organism-specific public reference databases (e.g. BIGDSdb)

Flexible selection of typing loci possible (e.g. cgMLST, MLST) or any other user-defined selection

wgMLST allele assignments both based on the raw sequencing data (assembly-free) and on de novo assembled contigs (assembly-based)

In-depth quality assessment of the different steps in the pipeline

Integrated with the BIONUMERICS Calculation Engine for high-throughput analyses

42 in-house developed bacteria specific schemas:

Acinetobacter baumannii
Bacillus cereus
Bacillus subtilis
Burkholderia cepacia complex
Brucella spp.
Campylobacter coli - C. jejuni
Citrobacter spp.
Clostridioides difficile
Cronobacter spp.

Enterobacter cloacae
Enterococcus faecalis
Enterococcus faecium
Enterococcus raffinosus
Escherichia coli / Shigella
Francisella tularensis
Klebsiella aerogenes
Klebsiella oxytoca
Klebsiella pneumoniae

Lactobacillus sanfranciscensis
Legionella pneumophila
Leuconostoc spp.
Listeria monocytogenes
Micrococcus spp.
Mycobacterium bovis
Mycobacterium kansasii
Mycobacterium leprae
Mycobacterium tuberculosis

Neisseria gonorrhoeae
Neisseria meningitidis
Pasteurella multocida
Proteus vulgaris
Pseudomonas aeruginosa
Salmonella enterica
Serratia marcescens
Staphylococcus aureus
Staphylococcus epidermidis

Staphylococcus pseudointermedius
Stenotrophomonas maltophilia
Streptococcus agalactiae
Streptococcus mitis/oralis
Streptococcus pyogenes
Weissella spp.

In-house developed wgSNP analysis pipeline, starting from raw reads

Flexible genome mapping onto multiple reference genomes

Perform read mapping by using one of the reference algorithms such as Bowtie2* or SNAP**

Various SNP filtering templates available

Flexibility to create your own SNP filtering templates based on coverage or quality criteria, position, mutation types and much more

In-depth quality assessment of retained SNPs

Integrated with the BIONUMERICS Calculation Engine for high-throughput analyses

Integrated CFSAN SNP pipeline*** (United States Food and Drug Administration, Center for Food Safety and Applied Nutrition),
producing SNP matrices from NGS data to be used in phylogenetic analysis of pathogenic organisms typically linked to food safety.

* Langdon, William B. "Performance of genetic programming optimised Bowtie2 on genome comparison and analytic testing (GCAT) benchmarks." BioData mining 8.1 (2015): 1.
** Zaharia, Matei, et al. "Faster and more accurate sequence alignment with SNAP." arXiv preprint arXiv:1111.5572 (2011).
*** Davis, Steve, et al. "CFSAN SNP Pipeline: an automated method for constructing SNP matrices from next-generation sequence data." PeerJ Computer Science 1 (2015): e20.

Comparative genome mapping and dot plot representation representing homologous sequences in direct or reverse orientation

Multiple genome alignment functionality

Genome clustering and phylogeny analysis

Genome alignment-based SNP analysis and dN/dS calculation

mv2.jpg/v1/fill/w_180,h_116,al_c,q_80,usm_0.66_1.00_0.01,blur_2/AlignmentAnnotation.jpg" />

Identification of coding regions in prokaryotic genomes

Annotation of sequences against one or multiple reference sequences based on feature identity and chromosome synteny

Annotation of sequences using the integrated Prokka* pipeline

Integrated with the BIONUMERICS Calculation Engine for high-throughput analyses

* Seemann, Torsten. "Prokka: rapid prokaryotic genome annotation." Bioinformatics 30.14 (2014): 2068-2069.

De novo assembly, based on one of the integrated short read de novo assemblers, including Velvet*, SPAdes**, SKESA*** and Unicycler****

Generate more accurate &ldquohybrid&rdquo assemblies by using Unicycler****, leveraging the benefits of both data types, namely the accuracy of short reads and the structural resolving power of long reads

Integrated with the BIONUMERICS Calculation Engine for high-throughput analyses

*Zerbino, Daniel R., and Ewan Birney. "Velvet: algorithms for de novo short read assembly using de Bruijn graphs." Genome research 18.5 (2008): 821-829.

** Bankevich, Anton, et al. "SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing." Journal of computational biology 19.5 (2012): 455-477.
*** Souvorov, Alexandre, Richa Agarwala, and David J. Lipman. "SKESA: strategic k-mer extension for scrupulous assemblies." Genome biology 19.1 (2018): 153.
**** Wick, Ryan R., et al. "Unicycler: resolving bacterial genome assemblies from short and long sequencing reads." PLoS computational biology 13.6 (2017): e1005595.

Estimating evolutionary relationships based on maximum parsimony and maximum likelihood methods, with phylogenetic distance scaling correction e.g. Jukes & Cantor or Kimura2

Inferring phylogenies of varying complexity based on maximum likelihood by using the embedded standard tools RAxML* or FastTree**

Integrated with the BIONUMERICS Calculation Engine for high-throughput analyses

*Stamatakis, Alexandros. "RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models." Bioinformatics 22.21 (2006): 2688-2690.

** Price, Morgan N., Paramvir S. Dehal, and Adam P. Arkin. "FastTree 2&ndashapproximately maximum-likelihood trees for large alignments." PloS one 5.3 (2010).


Sequence annotations and their relationship with reference sequences

Sequence annotations are information artifacts that add biologically meaningful information to specific locations on genomic, gene, transcript or protein sequences. For example:

1) Gene OR4F5 is located on human chromosome 1 (build hg19), from position 69090 to 70008.

Sequence annotations are only meaningful if the reference sequence is known. However, specifying a stable reference is not necessarily straightforward. Before the Human Genome Project, Locus Specific Databases (LSDB) were recommended for storing and sharing gene centric variant annotations [1]. To date, the most popular platform for storing these transcript variants is the Leiden Open-source Variation Database v.2 (LOVD2) [2]. In each LOVD2 instance, a "stable" transcript sequence is chosen as the reference sequence of each gene. Variants are annotated with descriptions of sequence variations and positions according to the chosen transcript sequence. There are many advantages of using gene/transcript centric annotation approach. First, the length of a gene is much shorter than a locus/chromosome, therefore maintaining the sequence content is much easier. Secondly, it limits annotations mainly to the protein coding regions of the genome, therefore focusing more on easy to predict phenotypic effects. However, LSDBs typically limit descriptions of DNA variants to a single transcript, even when multiple transcripts may be affected. Depending on which transcript is used, the variant description may look very different. To calculate the location of a variant based on a different reference sequence, an external conversion tool has to be used for the position conversion [3]. Disambiguation of the variant description is an essential step in the context of data integration and preservation.

However, not all biological questions are locus specific. As sequencing technologies advanced in the past 15 years, more and more studies are omics focused, requiring a "stable" and "complete" reference genome [4]. The Human Genome project was completed in April 2003, followed by the release of human genome assembly NCBI35/hg17 in May 2004. Sequence gaps and assembly errors were removed and newly discovered genes, (non-coding) transcripts and proteins were annotated with every new release up to GRCh37/hg19 (February, 2009) [5]. As reference sequences are revised, it becomes increasingly difficult to track and compare annotations. Researchers today share their results of genome-wide genomic and epigenetic studies in publications and databases, but they often fail to mention the exact version of the reference genome sequence. Moreover, many popular annotation file formats do not explicitly ask for reference sequence version information. It is up to the user to embed this information in the file description through natural language. Consequently, when using these formats to exchange data for computational analysis and data integration, essential metadata is too easily lost. For example, the ENCODE Project Consortium [6] has effectively shared their data by publishing them as annotation tracks in the UCSC genome browser [7]. However, these annotation tracks use Browser Extensible Data (BED) format, which does not explicitly state the reference assembly version within the file. To propagate current annotations to the forthcoming GRCh38/hg20 and alternative genome assemblies, it is crucial to preserve annotations with their respective reference sequence versions.

A Semantic Web approach to data integration

A possible approach to exposing sequence variation annotations in a computer accessible format is provided by Sematic Web languages and tools [8]. It effectively removes the boundaries between annotating data, linking data, and making data machine readable [9-11]. By representing data and metadata in Resource Description Framework (RDF) and using shared ontologies in RDF and Web Ontology Language (OWL), mismatches between database schema's and the identity of its content can be addressed [12,13]. A first attempt for mutation data was presented by Zappa and coworkers, who produced a mutation database for TP53 as Linked Open Data [14]. They followed the principles of Linked Data [15] and applied various existing ontologies to achieve optimal interoperability. However, they did not address the problem of integrating mutation data that were annotated using different reference sequences. They did not model genomic locations of annotations in detail, which makes querying this dataset difficult.

Ontological framework for data integration across resources

Formal ontologies play an important role in semantic data integration between information systems [16,17], bringing conceptual coherence, stability, and scalability to the applied domain, which can greatly increase data interoperability [17,18]. The Open Biological and Biomedical Ontologies (OBO) Foundry provides a suite of orthogonal interoperable ontologies to aid knowledge integration in the biomedical domain [19]. To take advantage of the OBO Foundry ontologies, we have chosen Basic Formal Ontology (BFO) [20] as our upper ontological framework for data modeling [20]. Other ontologies in OBO that are relevant to this paper include the Information Artifact Ontology (IAO) [21], the Sequence Ontology (SO) [22], the Ontology for Genetic Interval (OGI) [23], and the Relation Ontology (RO) [24].

Previous efforts on modeling biological sequences and sequence annotations in the OBO community have taken primarily a biological viewpoint. Thus, sequences refer to biological molecules, and sequence annotations refer to features defined with respect to biological process [22,25]. The SO focuses on creating a set of consistent vocabularies that describe the biological functions of these sequences and defining the biological relationships between these sequences [22]. OGI models the biological physical sequence by adopting the realism approach from BFO, and further contributes to this model by adding spatial topological relationships between sequences [23]. However, Hoehndorf et al. pointed out a gap between this biological model and information systems that are used to store sequence annotations [26]. To bridge this gap, they have proposed three views of biological sequences: molecular, syntactic, and abstract. Molecular sequences are DNA and RNA molecules as well as proteins. Syntactic sequences are strings like "ACAC" and represent the arrangement of the molecules in the molecular sequences. Abstract sequences represent an equivalence class of sequence tokens or representations. They point out that without such a clear distinction data integration is hampered. Indeed, the SO community acknowledged the lack of distinction that is made by biologists between abstract, syntactic, and molecular sequences. Bada and Eilbeck proposed a strategy of separating SO into two parallel ontologies: one for molecular sequences, the other with abstract sequences (abstract in a broader sense than meant by Hoehndorf). The former would be an extension of the Molecular Sequence Ontology while the SO would focus more on the abstract sequences referring to sequences, and parts of sequences [27]. However, this new alignment strategy is still under discussion.

Beyond the OBO Foundry there are additional relevant ontologies applicable to sequence annotation. The Feature Annotation Location Description Ontology (FALDO) is the latest effort to address the void of describing sequence annotations from the information systems' perspective [28]. It is designed to be general enough to describe annotations with various level of location complexity, but not addresses issues such as the meaning of or the evidence of the location.

Aim of this paper

Our aim is to create an RDF data model for describing sequence annotation instances within an established ontological framework that fits our practice of working with reference sequences and different versions of genome assemblies. We provide a mechanism for linking annotation instances to different reference sequences. We also present some of the challenges in aligning our approach with current OBO Foundry ontologies.


The falling cost of high-throughput sequencing (HTS) technologies has made them accessible to small laboratories, promoting a large number of genome-sequencing projects even in nonmodel organisms. Nevertheless, genome assembly and annotation, especially in eukaryotic genomes, still represent major limitations (Dominguez Del Angel et al., 2018 ). The unique genomic characteristics of many nonmodel organisms, often lacking pre-existing gene models (Yandell & Ence, 2012 ), and the absence of closely related species with well-annotated genomes, means that the annotation process can be very challenging. State-of-the-art pipelines for de novo genome annotation, such as braker1 (Hoff, Lange, Lomsadze, Borodovsky, & Stanke, 2016 ) or maker2 (Holt & Yandell, 2011 ), allow the integration of multiple lines of evidence such as RNA-seq, expressed sequence tag (EST) data, gene models from other previously annotated species or ab initio gene predictions—using software such as genemark (Lomsadze, Burns, & Borodovsky, 2014 ), exonerate (Slater & Birney, 2005 ), genomethreader (Gremme, Brendel, Sparks, & Kurtz, 2005 ), augustus (Stanke & Waack, 2003 Mario Stanke, Diekhans, Baertsch, & Haussler, 2008 ) or snap (Korf, 2004 ). However, the gene models predicted by these automatic tools are often inaccurate, particularly for gene family members. Furthermore, these predictions can be especially inaccurate for medium- or low-quality assemblies, which is a quite common situation in the increasing large number of genome drafts of nonmodel organisms used in molecular ecology studies. The correct annotation of gene families frequently requires additional programs, such as augustus-ppx (Keller, Kollmar, Stanke, & Waack, 2011 ), or semi-automatic, and even manual approaches, that evaluate the quality of supporting data. This latter task is usually performed in genomic annotation editors, such as apollo , which give researchers the option to work simultaneously in the same annotation project (Lee et al., 2013 ).

There are a number of issues affecting the quality of gene family annotations, especially for either old or fast evolving families (Yohe et al., 2019 ). First, new duplicates within a family usually originate by unequal crossing-over and are found in tandem arrays in the genome, with the more recent duplicates also the physically closest (Clifton et al., 2020 Vieira, Sánchez-Gracia, & Rozas, 2007 ). This configuration often causes local misassemblies that result in the incorrect or failed identification of tandem duplicated copies (i.e., it produces artefact, incomplete or chimeric genes along a genomic region). Second, the identification and characterization of gene copies in medium- to large-sized families tends to be laborious, requiring data from multiple sources, including well-annotated remote homologues and hidden Markov model (HMM) profiles. Certainly, the robust identification and annotation of the complete repertory of a gene family in a typical genome draft is a challenging task that requires important additional efforts, which are very tedious to perform manually.

To facilitate this curation task, we have developed bitacora , a bioinformatics pipeline to assist the comprehensive annotation of gene families in genome assemblies. bitacora requires a structurally annotated genome (general feature format [GFF] and FASTA format) or a draft assembly, and a curated database with well-annotated members of the focal gene families. The program will perform comprehensive blast and hmmer searches (Altschul, 1997 Eddy, 2011 ) to identify putative candidate gene regions (already annotated, or not), combine evidence from all searches and generate new gene models. The outcome of the pipeline consists of a new structural annotation (GFF) file along with their encoded sequences. These output sequences can be directly used to conduct downstream functional or evolutionary analyses or to facilitate a fine-scale re-annotation in genome browsers such as Apollo (Lee et al., 2013 ).


The GENCODE consortium annotates protein-coding genes, pseudogenes, long non-coding RNAs (lncRNAs) and small non-coding RNAs (sncRNAs). We define protein-coding genes as loci where the weight of available evidence supports the presence of a coding sequence (CDS). Evidence for a CDS may come from high-throughput experimental assays, the demonstration of physiological function in the research literature, the observation of homology to a known protein-coding gene, or the interpretation of evolutionary conservation data. Pseudogenes are sequences derived from protein-coding genes, containing disabling mutations such as in-frame stop codons, frameshifting indels, truncations or insertions, or for which there is no evidence of transcription. lncRNA genes are identified by a combination of transcriptional evidence and a lack of potential to be assigned as protein-coding. We do not absolutely require lncRNA genes to be longer than 200 bp, but very few annotated lncRNAs fall below this threshold, as we also require annotated lncRNAs to be free of secondary structures found in known functional sncRNAs. Currently, sncRNAs are almost entirely annotated by computational pipelines that use homology to known sncRNA sequences and predicted secondary structure to identify functional copies.

Our annotation processes use primary transcript and proteomics data, evolutionary conservation, computational methods and curated public databases such as UniProt ( 14). These data are integrated using a combination of expert manual annotators and computational methods to identify regions of the genome with genic potential, annotate the exon-intron structures of transcripts identified at the locus under investigation and assign a functional classification to both the individual transcript and the locus.

Broad functional classes (referred to as ‘biotypes’) of protein-coding, pseudogene, lncRNA and sncRNA are assigned as described above. More detailed functional categories are also added. For example, at the locus level we describe the provenance of pseudogenes as processed (derived via retrotransposition), unprocessed (defined by a genome duplication event) or unitary (arising from the lineage specific disruption of an ancestral protein-coding gene). At the transcript level we define transcripts belonging to protein-coding loci as protein-coding, nonsense mediated decay (NMD) (containing a premature stop codon believed likely to lead to the transcript being targeted by the nonsense-mediated decay pathway) or retained intron (containing sequence that is intronic in other transcripts from the locus). Following the structural and functional classification of transcripts, a subset of GENCODE annotation is subject to targeted experimental validation as described below to ensure consistent high quality of the gene annotation.

To cater for a variety of use cases, we create a number of annotation sets. Examples of these are our ‘GENCODE comprehensive’ and ‘GENCODE basic’ gene sets. GENCODE comprehensive includes the complete set of annotations including partial transcripts (i.e. transcripts that are not full length, but represent a unique splice form based on available evidence) and biotypes such as NMD. GENCODE basic is a subset of GENCODE comprehensive that contains only transcripts with full-length CDS. For non-coding loci, GENCODE basic includes the smallest number of transcripts that cover 80% of the exonic features, while ensuring all loci are represented by at least 1 transcript. Computational methods add additional information. For example, APPRIS, described in more detail below, identifies the most likely functional translations at protein-coding loci and TSL (transcript support level) calculates the amount and quality of supporting evidence for each transcript.

Manual annotation

The GENCODE gene set is created by merging the results of manual and computational gene annotation methods. Manual gene annotation has two major modes of operation: clone-by-clone and targeted annotation. ‘Clone-by-clone’ annotation involves ‘walking’ across a genomic region, investigating the sequence, aligned expression data and computational predictions for each BAC clone. In doing so, an expert annotator investigates all possible genic features and considers all possible annotations and biotypes simultaneously. We believe this approach carries substantial advantages. For example, the decision to annotate a locus as protein-coding or pseudogenic benefits from being able to weigh both possibilities in light of all available evidence. This process helps prevent false positive and false negative misclassifications. Targeted annotation is designed to answer specific questions such as ‘is there an unannotated protein-coding gene in this position?’ Ranked target lists are generated by computational analysis based, for example, on transcriptomic data, shotgun proteomic data or conservation measures. Over the last two years mouse annotation has been dominated by the clone-by-clone approach while the human genome has been refined entirely via targeted reannotation except for the annotation of human assembly patches and haplotypes released by the Genome Reference Consortium ( 15), which take a clone-by-clone approach.

Over the last two years, we have focused on two broad areas: completing the first pass manual annotation across the entire mouse reference genome and a dedicated effort to improve the annotation of protein-coding genes in human and mouse.

We have completed the annotation of novel protein-coding genes, lncRNAs and pseudogenes, plus QC and updating previous annotation where necessary for mouse chromosomes 9, 10, 11, 12, 13, 14, 15, 16 and 17. These updates bring the fraction of the mouse genome with completed first pass manual annotation to approximately 97%. In addition, we have continued to work with the NCBI and Mouse Genome Informatics project at the Jackson Laboratory to resolve annotation differences for protein-coding, pseudogene and lncRNA loci. For protein-coding genes this is under the umbrella of the Consensus Coding Sequence (CCDS) project ( 16).

We have also manually investigated unannotated regions of high protein-coding potential identified by whole genome analysis using PhyloCSF ( 17) (a tool described in more detail below). In human, this led to the addition of 144 novel protein-coding genes and 271 pseudogenes (of which 42 were unitary pseudogenes). In mouse, we annotated orthologous loci for all but 11 of the 144 human protein-coding genes. We have also revisited the annotation of all olfactory receptor loci in both human and mouse, using RNAseq data to define 5′ and 3′ UTR sequences for ∼1400 loci. In human we have also targeted a ‘deep dive’ manual reannotation of genes on clinical panels for paediatric neurological disorders to identify missing functional alternative splicing. Incorporating second and third generation transcriptomic data, we reannotated ∼190 genes and added more than 3600 alternatively spliced transcripts, including ∼1400 entirely novel exons and an additional ∼30kb of CDS. We have also completed an effort to capture all recently described unannotated microexons ( 18) into GENCODE, and further added an additional 146 novel microexons mined from public SLRseq data ( 19).

As part of the CCDS collaboration with RefSeq, we have checked a large subset of human loci where there was disagreement over gene biotype. Similarly, we have checked all UniProt manually annotated and reviewed (i.e. Swiss-Prot) accessions that lack an equivalent in GENCODE. As a result, we added 32 novel protein-coding loci to GENCODE and rejected more than 200 putative coding loci. Finally, we are manually reviewing genes previously annotated as protein-coding, but with weak or no support based on a method incorporating UniProt, APPRIS, PhyloCSF, Ensembl comparative genomics, RNA-seq, mass spectrometry and variation data ( 20, 21). Of the 821 loci investigated to date, 54 have had their coding status removed while a further 110 potentially dubious cases remain under review.

The approach taken reflects in the kinds of updates captured in the annotation. For example, the targeted reannotation in human leads to the annotation of few novel protein-coding loci but many novel transcripts at updated protein-coding and lncRNA loci. Conversely, in mouse the emphasis on clone-by-clone annotation identifies many more novel loci and transcripts across a broader range of biotypes (Figure 1).

New and updated manually annotated genes and transcripts from July 2016 to June 2018. For both human (left) and mouse (right) the numbers of completely new genes and transcripts, updated genes and transcripts and the total number of manually added or edited genes and transcripts for each of four broad categories of annotation. A new gene annotation can represent a completely de novo locus with no overlap with pre-existing annotation or the reclassification of an existing complex locus into multiple loci to better represent the biology of the locus inferred from transcriptomic and/or proteomic data. A new transcript represents the annotation of a unique exon-intron structure, including novel alternative splicing at an annotated locus. Updated genes and transcripts represent pre-existing loci or transcript models that have been edited to improve the representation of biotype (e.g. changed from lncRNA to protein-coding) or structure (e.g. by extension, addition of novel exons).

New and updated manually annotated genes and transcripts from July 2016 to June 2018. For both human (left) and mouse (right) the numbers of completely new genes and transcripts, updated genes and transcripts and the total number of manually added or edited genes and transcripts for each of four broad categories of annotation. A new gene annotation can represent a completely de novo locus with no overlap with pre-existing annotation or the reclassification of an existing complex locus into multiple loci to better represent the biology of the locus inferred from transcriptomic and/or proteomic data. A new transcript represents the annotation of a unique exon-intron structure, including novel alternative splicing at an annotated locus. Updated genes and transcripts represent pre-existing loci or transcript models that have been edited to improve the representation of biotype (e.g. changed from lncRNA to protein-coding) or structure (e.g. by extension, addition of novel exons).

Computational annotation of small RNAs

We annotate small non-coding RNAs (sncRNAs) using a variety of mechanisms. Specifically, miRNA annotations are imported directly from miRBase ( 22), while tRNAs are identified ab initio using tRNAScan-SE ( 23) although they are not included directly in the gene set. For other classes of sncRNA, including small nucleolar RNAs (snoRNAs), small nuclear RNAs (snRNAs) and small Cajal body-specific RNAs (scaRNAs), we use a homology-based, computational pipeline ( 24), which first compares sequences of known RNA families in Rfam ( 25) to the genome using BLAST ( 26). This initial step reduces the genomic search space and excludes sequences with sub-optimal alignments to the genome. We define putative sncRNA models after clustering top BLAST hits and evaluating these predictions by performing sequence and structure searches against covariance models in the Infernal suite of tools ( 27).


Pseudogene annotations across 18 mouse strains were generated using a combination of manual annotation liftover and computational methods. Additionally, we were able to annotate 88 new human and 131 new mouse unitary pseudogenes relative to each other. Amongst the strains we find roughly 20 unitary pseudogenes per strain. We identified nearly 3000 ancestral pseudogenes conserved across all strains. Meanwhile, ∼20% of the pseudogenes in each strain are strain specific. In line with previous results in human, 15% of pseudogenes exhibit transcriptional activity (bioRxiv:


The genome is typically represented as a linear sequence, split over multiple chromosomes, and data are linked to the genome by occupying a range of positions on the sequence. These data fall into two broad categories. First, there are the annotations, such as gene models, transcription factor binding site predictions, GC percentage, polymorphisms, and conservation scores. Such annotations are highly processed and are often served by public databases such as NCBI or EBI. Second, there are primary experimental measurements, such as read alignments from high-throughput sequencing. Data integration, within and between those two categories, is made possible by treating the data as ranges on the genome, which acts as a common scaffold. Thus, ranges play a central role in genomic data analysis, and statistical tools should consider ranges to be as fundamental as quantitative and categorical data types.

For example, ranges are integral to the manipulation of gene model annotations. Examples include deriving candidate promoter regions, finding introns, calculating the total exonic length of a transcript or finding the exonic regions that are unique to a particular transcript in an alternatively spliced gene. Ranges also play a central role in the analysis of experimental data, where they are used to represent read alignments. In the analysis of ChIP-seq data, it is typical to calculate the depth of alignment coverage, which then serves as input to calling algorithms which output peaks as ranges. These ranges are then annotated according to their overlap with and proximity to other ranges, such as gene structures. Similarly, for RNA-seq data, analysts measure gene expression based on counting the alignments overlapping exons.

All these analyses depend on specialized, range-based algorithms and data structures. For example, computations on gene models involve set operations on ranges, including intersection, union and complement. Coverage calculation is important for detecting regions of enrichment and for producing visual summaries. Overlap and nearest neighbor detection is fundamental to the annotation of ChIP-seq peaks, estimating expression from RNA-seq data and many other integrative analyses.

The primary argument for storing ranges in specialized, formal data structures is efficiency, in terms of both implementation and language. The notion of ranges can be made explicit in the application programming interface (API), permitting the expression of algorithms in a succinct and readable language that illustrates concepts instead of exposing implementation details. Another goal is interoperability: by using the same data structures, multiple routines, spread across different packages, can operate on the data without cumbersome conversions. Also, a data structure can be accessed through an abstraction that hides the details of the optimized implementation, and this results in looser coupling between components. Together, these benefits lead to more robust, maintainable software.

Data structures should support the storage of per-range metadata, because genomic data is multivariate and consists of much more than the ranges alone. This enables the storage of gene identifiers and other symbols with the gene ranges, and the peak heights or confidence scores with the peak ranges. Some metadata merit special treatment, such as the chromosome name and the strand. Also necessary is a data structure for storing summaries and processing results for a common set of ranges across multiple samples. Such a structure would hold, for example, the RNA-seq per-exon counts or a set of variant calls. Finally, there should be support for storing hierarchies of ranges, at least for one level of nesting, to represent, for example, the nesting of exons into transcripts. Whether it is appropriate to treat the exons as individual ranges or the transcript as a compound range depends on the use case both should be supported.

These data structures are represented as classes, through which we communicate the formal definition of each data structure to the programming language. One benefit is that we can defer the regulation of data access and the tracking of data integrity to the language. In the case of functional object-oriented languages, there is another benefit: we can implement behaviors as methods on generic functions. A generic function is one that dispatches to a particular implementation, termed a method, based on the classes of passed arguments. This means that the same API will exhibit specialized behavior depending on the input. For example calling start on a range data structure would return the starting positions for the ranges, while calling the same function on a base R time-series object would behave differently.

This paper describes the infrastructure in Bioconductor [1] for the integrative statistical analysis of range-based genomic data. Main features include scalable data structures for annotated genomic ranges and genome-length vectors, and efficient algorithms for overlap detection and other range operations. The packages that form the core of the infrastructure include IRanges, GenomicRanges and GenomicFeatures. Source code for the packages is included in the supplement, under Software S1, S2, and S3, respectively. The IRanges package provides the fundamental range data structures and operations, while GenomicRanges builds upon it to add biological semantics to the metadata, including explicit treatment of sequence name and strand. Finally, GenomicFeatures enables access to and manipulation of gene models and other annotations. Together, these packages support more than 80 other packages in Bioconductor.

Other software tools provide facilities for working with genomic ranges, e.g., bedtools [2] and cisGenome [3]. Those provide UNIX command-line interfaces and rely on common file formats (which are often incompletely specified) to interoperate with other tools, leading to workflows embodied as: collections of heterogeneous scripts, system dependencies and data files. Such workflows can be difficult to maintain and challenging to reproduce. In contrast, the Bioconductor infrastructure is tightly integrated with other R packages through in-memory data structures, while still supporting interaction with external tools. The Bioconductor package genomeIntervals provides data structures for representing genomic ranges and utilities, such as overlap detection, that have much in common with the tools described here, but our tools are more extensive and have been more widely adopted.


Detection of DR Mutations

The DR is a major challenge to TB control. In this study, NGS was applied to identify DR mutations in six Beijing lineage M. tuberculosis isolates with different DR profiles. We focused on analyzing nonsynonymous mutations in the coding regions and mutations in the upstream regions of the known drug-resistant genes. Using 2 new rules, we identified 19 mutations associated with DR, including 13 known and 6 novel mutations. How these mutations were identified and their biological backgrounds and implications are discussed below.

Isoniazid-Resistant Mutations

Isoniazid, a major first-line anti-TB drug, is a prodrug that kills M. tuberculosis cells by stopping cell wall synthesis. The prodrug is transformed into the activated form, isonicotinic acyl radical, by KatG (catalase-peroxidase). Isonicotinic acyl radical and oxidized nicotinamide adenine dinucleotide (NAD + ) form the isonicotinic acyl–reduced nicotinamide adenine dinucleotide (NADH) complex, which binds to protein InhA and inhibits the synthesis of mycolic acids in the bacterial cell ( Rozwarski et al. 1998). The concentration ratio of NADH to NAD + is regulated by ndh ( Vilchèze et al. 2005). Therefore, mutations in inhA, katG, and ndh may cause isoniazid resistance. The transcription of Rv1592c, a gene with unknown function, is induced by isoniazid and mutations in this gene were found in isoniazid-resistant isolates ( Ramaswamy et al. 2003 Aragón et al. 2006). Therefore, we also analyzed the mutations in Rv1592c. The mutations we found in these four genes are shown in figure 7a.

—Identified mutations in known DR genes associated with first- and second-line anti-TB drugs. Green and red rectangles denote the known and novel candidate drug-resistant mutations, respectively. The seven drugs studied are presented separately in figure (a)–(f). A mutation that is only identified in DR isolates in which there are no known DR mutations found is considered a DR mutation. The gray rectangles denote mutations deemed not associated with drug resistance because they are found in both DR and DS isolates.

—Identified mutations in known DR genes associated with first- and second-line anti-TB drugs. Green and red rectangles denote the known and novel candidate drug-resistant mutations, respectively. The seven drugs studied are presented separately in figure (a)–(f). A mutation that is only identified in DR isolates in which there are no known DR mutations found is considered a DR mutation. The gray rectangles denote mutations deemed not associated with drug resistance because they are found in both DR and DS isolates.

As the mutation R463L in katG and the mutation I322V in Rv1592c are found in all six isolates ( fig. 7a), including TCDC1 (a DS isolate), they are considered unrelated to isoniazid resistance. In addition, the mutation R463L in katG is a known lineage marker found in both isoniazid-resistant and -susceptible isolates and is thus considered not associated with isoniazid resistance ( Torres et al. 2015).

In TCDC11, an XDR-TB isolate, the resistance to isoniazid can probably be explained by the known isoniazid-resistant mutation S94A in inhA ( fig. 7a), although it is a low-level INH-R mutation ( Vilchèze et al. 2006). Therefore, the C-1T mutation in the promoter of inhA is not a candidate mutation for isoniazid resistance.

In TCDC4 and TCDC10, two MDR isolates, we found two novel mutations katG A479E and ndh I68T, but no other potential isoniazid-resistant mutations ( fig. 7a). We therefore consider these two novel mutation candidate mutations for isoniazid resistance. In TCDC7, the mutation katG S315T is known to be associated with isoniazid resistance ( Yu et al. 2003), so the new mutation Rv1592c I322F is probably not an isoniazid-resistant mutation ( fig. 7a).

In TCDC5, the mutation katG R249H is a known mutation for isoniazid-resistance ( Brossier et al. 2016) and we found no other candidate mutation for isoniazid resistance ( fig. 7a).

Rifampicin-Resistant Mutations

Rifampicin, a major first-line anti-TB drug, binds to the β-subunit of bacterial DNA-dependent RNA polymerase and inhibits the RNA synthesis of M. tuberculosis. Some mutations in the rpoB gene, which encodes the β-subunit of bacterial DNA-dependent RNA polymerase, are known to be associated with resistance to rifampicin ( Brandis et al. 2012). In addition, fitness-compensatory mutations of rifampicin-resistant M. tuberculosis have been found in three bacterial DNA-dependent RNA polymerase genes—rpoA, rpoB, and rpoC—which encode the α, β, and β′ subunits of RNA polymerase, respectively ( Hughes and Brandis 2013). We therefore examined the mutations in the rpoA, rpoB, and rpoC genes, and figure 7b shows the mutations found in rpoB and rpoC in different M. tuberculosis isolates.

The two amino acids D435 and S450 in rpoB form hydrogen bonds with the critical rifampicin hydroxyl groups at O1 and O2 ( Campbell et al. 2001). Therefore, mutations at these two positions are considered candidate mutations for rifampicin resistance ( fig. 7b).

In TCDC4, the known rifampicin-resistant mutation rpoB S450L is found.

In TCDC5, the known rifampicin-resistant mutation rpoB D435V is found.

In TCDC7, the known rifampicin-resistant mutation rpoB S450W is found ( Casali et al. 2016).

In TCDC10, as in TCDC4, the known rifampicin-resistant mutation rpoB S450L is found. On the other hand, rpoC P1040R is a candidate novel mutation for compensatory rifampicin resistance mutation.

In TCDC11, rpoB S450L is a known rifampicin-resistant mutation and rpoC V483A is considered a compensatory mutation associated with rifampicin resistance.

Pyrazinamide-Resistant Mutations

Pyrazinamide is activated by PZase, the product of pncA, and transformed into POA, which disrupts the assembly of the M. tuberculosis cell membrane by inhibiting fatty acid synthase I ( Palomino and Martin 2014). Hence, mutations in pncA may cause pyrazinamide resistance. Figure 7c shows the mutations we found in pncA.

In TCDC4 and TCDC10, the mutation pncA G78V is considered a candidate novel mutation for pyrazinamide resistance.

In TCDC11, pncA V139A, a mechanism-unknown mutation found in pyrazinamide-resistant isolates is identified. A previous study showed that the POA efflux rate was significantly lower in pyrazinamide-resistant isolates than in pyrazinamide-susceptible isolates ( Zimic et al. 2012). The POA efflux rate of the pyrazinamide-resistant isolate with pncA V139A was higher than the average rate of pyrazinamide-susceptible isolates ( Zimic et al. 2012). However, we still consider pncA V139A a candidate DR mutation of pyrazinamide because it is the only mutation in pncA found in TCDC11 and was found in other PZA-resistant isolates ( Sheen et al. 2017). As will be seen later, this conclusion was supported by our functional assay.

Streptomycin-Resistant Mutations

Streptomycin binds to helices 1, 18, 27, and 44 of 16S ribosomal RNA and S12 ribosomal protein ( Demirci et al. 2013). Therefore, mutations in 16S ribosomal RNA and rpsL, which encodes the S12 ribosomal protein, may cause streptomycin resistance. In addition, mutations in gidB were also found in streptomycin-resistant isolates ( Wong et al. 2011). We therefore investigated the mutations in rrs, rpsL, and gidB. We found the known mutation rpsL K43R ( Spies et al. 2011) in all four streptomycin-resistant isolates, TCDC4, TCDC5, TCDC10, and TCDC11 ( fig. 7d). The mutation gidB E92D was detected in all M. tuberculosis isolates including the streptomycin-susceptible one and is thus not a candidate mutation for streptomycin resistance ( Spies et al. 2011).

In TCDC11, the nucleotide substitution rrs A1401G was found ( fig. 7f). However, it was known to be associated with high-level resistance to KM and AMK and only low-level to CPM ( Hobbie et al. 2006) it is therefore not the cause of streptomycin resistance in TCDC11.

Fluoroquinolone-Resistant Mutations

The second-line anti-TB drugs include fluoroquinolone and aminoglycoside/polypeptide drugs. Ofloxacin is one of the fluoroquinolone drugs ( Ramaswamy and Musser 1998) that binds M. tuberculosis DNA gyrase and inhibits DNA gyrase to relax positive supercoils of DNA. Mutations in the quinolone resistance determining region (QRDR) in gyrA and gyrB (QRDR-A and QRDR-B) are known to be associated with fluoroquinolone resistance ( Piton et al. 2010). Therefore, we analyzed the mutations in the gyrA and gyrB genes ( fig. 7e), and six nonsynonymous mutations in the gyrA gene and one nonsynonymous mutation in the gyrB gene were found in the six M. tuberculosis isolates. The mutations gyrA E21Q, S95T, and G668D were found in both ofloxacin-susceptible and ofloxacin-resistant M. tuberculosis, so they are not candidates for ofloxacin resistance ( Farhat et al. 2016). The other three mutations ( fig. 7e) are discussed below.

In TCDC4, the G88C mutation in gyrA is located at one of the fluoroquinolone binding sites, QRDR-A ( Matrat et al. 2006). It is a known fluoroquinolone-resistance mutation.

In TCDC7, the A90V mutation in gyrA is detected, and like G88C, A90V is at a fluoroquinolone binding site of gyrase A. It is a known mutation for fluoroquinolone resistance ( Matrat et al. 2006).

In TCDC11, the D94Y mutation in gyrA is a known fluoroquinolone-resistant mutation ( Matrat et al. 2006).

In TCDC10, as in TCDC4, the mutation G88C in gyrA was found. In addition, a novel mutation in gyrB, K247N is found. It is not located in QRDR and has not been previously detected in fluoroquinolone-resistant isolates. Our functional assay verified the association between this mutation and ofloxacin resistance (see next section).

Second-Line Injectable Drug-Resistant Mutations

The second-line injectable drugs for tuberculosis treatment include the polypeptide antibiotic capreomycin and the aminoglycoside antibiotics amikacin and kanamycin, which bind to the A site of the 30S subunit of M. tuberculosis ribosome, causing incorrect translation ( Poehlsgaard and Douthwaite 2005). In the six M. tuberculosis isolates under study, only TCDC11 resists the second-line injectable drugs, and we found a known mutation associated with second-line injectable drug ( fig. 7f). The adenine to guanine substitution at position 1401 in rrs represses the binding of aminoglycosides to the A site of M. tuberculosis ribosome RNA ( Hobbie et al. 2006). The association between rrs A1401G and DR was considered almost 100% specific to both kanamycin and amikacin, whereas the specificity to capreomycin is lower ( Jugheli et al. 2009).

Experimental Support of Candidate Drug-Resistant Mutations

The functional assays of two known and two novel drug-resistant mutations showed that in each of these four mutations, the affinity of antibiotics to their targets was reduced, raising the capacity of M. tuberculosis to resist antibiotics. One of the experimentally supported novel DR mutations associated with ofloxacin is K247N in gyrB, which is outside the QRDR-B region that is defined between residue positions 500 and 540 ( Pantel et al. 2012). The mechanism of K247N causing a decrease in the efficacy of ofloxacin is unclear as it is not located in the enzyme catalytic core ( Piton et al. 2010). Further structural analysis may be conducted to figure out the DR mechanism of K247N.

Kanamycin and amikacin are aminoglycosides and are important second-line injectable drugs. They bind to the 16S-ribosome RNA and kill bacteria by producing incorrect translation. Structural evidence has shown that the 6′ amino group in ring I of aminoglycoside cannot interact with guanine 1401 of ribosome RNA via a hydrogen bond. In addition, repulsion between the 6′ amino group of ring I and the N1/N2 amino groups of guanine was also observed. Therefore, the substitution of adenine to guanine at 1401 of the ribosomal RNA prevents aminoglycoside from binding to the RNA and decreases the capacity of aminoglycoside to kill M. tuberculosis ( Hobbie et al. 2006). We conducted a functional assay to confirm that nucleotide substitution A1401G in rrs leads to a weaker kanamycin binding to rrs and causes DR.

Genome Sequencing, Assembly, and Annotation

This study applied three NGS technologies to conduct genome sequencing and assembly of five Taiwan M. tuberculosis isolates, including one Euro-American lineage and four Beijing lineage isolates. The obtained circular genomes of one XDR and three MDR isolates are the first Taiwan M. tuberculosis assemblies in circular chromosome form and can be used as the reference for genomic study of Taiwan isolates.

Gene annotation was conducted by Prokka ( Seemann 2014) with the addition of third-party tools. However, some genes were incorrectly annotated and some were missed. In order to correct the prediction errors, we developed a reference-guided gene model reannotation pipeline to adjust the gene models. With the developed reannotation pipeline, the gene sequences of 850 TCDC11 genes were revised, 232 short TCDC11 annotated sequences were removed, and 51 missed annotated genes were added to the gene models. We compared the sequence alignment percentage before and after reannotation to evaluate the performance of the reference-guided gene model reannotation pipeline. The sequences of removed genes and revised genes before correction were blasted against the H37Rv genes to identify the alignment percentage of Prokka annotation. The alignment percentage of each gene is defined by the product of alignment identity and the percentage of the sequence aligned, which is the ratio of alignment length to aligned gene length of H37Rv. Adjusted sequences of revised genes and new genes were used to compute the alignment percentage after gene reannotation. Histograms of sequence alignment percentage before and after gene model revision are illustrated by blue and dark magenta bars in supplementary figure S7 , Supplementary Material online. The sequence alignment percentage of adjusted gene models is significantly higher than the Prokka one (P-value = 5×10 − 63 ), indicating that our reference-guided gene model reannotation pipeline effectively corrected annotation errors.

Toxin–Antitoxin System Gene Mutations

The toxin–antitoxin (TA) system is essential for bacteria to adapt to external stress. Toxin MazF3, MazF6, and MazF9 of the ribonuclease MazEF TA system are considered to respond to antibiotics and may induce drug tolerance ( Tiwari et al. 2015). Our genome assembly and annotation led to the identification of the MazEF TA system in all M. tuberculosis isolates we studied. However, point mutations were found in mazF3, mazF6, and mazF8 in the isolates ( table 3). The mutations T65I in MazF3 and G41V in MazF8 are found in all of the six Taiwan Beijing lineage isolates studied, including DS and resistant ones. These two mutations are thus not associated with DR.


Many important crops are recent allopolyploids with different sets of subgenomes that were derived from the interspecific hybridization between related species (Cheng et al., 2018 Zhang et al., 2019a ). However, it is often a formidable task to obtain a high-quality polyploid genome assembly due to the large genome size and highly similar homeologous subgenomes that tend to create much increased complexities in assembly graphs. In particular, the repetitive and complex regions present a challenge as next-generation sequencing (NGS) or second-generation sequencing platforms generally produce short reads that are incapable of spanning and resolving repetitive regions. In the last few years, several new technologies have become available to drastically improve existing reference genomes (Wang et al., 2019b Zhang et al., 2020 ), for instance long-read sequencing including single-molecule real-time (SMRT) sequencing and Oxford Nanopore, and chromosome conformation capture (Hi-C). Based on these technologies, chromosome-level assemblies of allopolyploid genomes were achieved. For example, the high-quality, chromosomal-scale reference genome of quinoa (Chenopodium quinoa Willd., 2n = 4x = 36) was successfully produced using the SMRT sequencing coupled with BioNano, Hi-C and genetic maps (Jarvis et al., 2017 ). With the further development of sequencing technologies, it is likely that the quality of polyploid genomes could be much improved compared with the earlier draft genomes, providing richer information for the genetic diversity and molecular breeding of economic crops.

Allotetraploid oilseed rape (B. napus L., AACC, 2n = 38) is a member of Brassicaceae family and was thought to be derived

7500 years ago with the hybridization of two diploid parental genomes Brassica rapa (AA, 2n = 20) (Wang et al., 2011 ) and Brassica oleracea (CC, 2n = 18) (Liu et al., 2014 ) and subsequent genome doubling (Chalhoub et al., 2014 ). There are three ecotype groups of B. napus, including spring, winter and semi-winter that are adapted to different geographical environments (Lu et al., 2019 ). The widespread cultivation of rapeseed crop increased the exposure to disease caused by various pathogens, thus leading to a serious decline in yield, including blackleg (Leptosphaeria maculans and L. biglobosa), clubroot (Plasmodiophora brassicae) and Sclerotinia stem rot (Sclerotinia sclerotiorum) (Neik et al., 2017 Sanogo et al., 2015 Van de Wouw et al., 2016 Wei et al., 2017 ).

The first reference genome sequence of B. napus, derived from the European winter-type cultivar 'Darmor-bzh', was previously published (Chalhoub et al., 2014 ) but largely incomplete due to the limitation of the read length available at the time. The ‘Darmor-bzh’ genome sequence was assembled with NGS short reads and often contains numerous sequencing gaps with missing sequences or errors, which makes it difficult to be utilized for downstream applications. Subsequently, another European winter-type cultivar 'Tapidor' was also assembled with NGS short reads, leading to a suboptimal quality assembly (Bayer et al., 2017 ). Additionally, a Chinese semi-winter-type cultivar 'Ningyou7' (NY7) with ‘double-high’ traits (high erucic acid and high glucosinolate quality) was assembled from mostly NGS short reads, while the sequencing gaps were filled in by PacBio reads (Zou et al., 2019 ). An alternative ‘double-low’ semi-winter-type B. napus cultivar 'ZS11' attracted the attention of the B. napus community with its high oil content and high seed production, with the first genome release (ZS11_NGS) completed based on a hybrid strategy using BAC clones and NGS short reads (Sun et al., 2017 ), resulting in a similarly lower quality assembly. Here, we reported a much-improved assembly of genome (ZS11_PB) through a de novo assembly by integrating long PacBio SMRT reads, genetic maps and Hi-C technologies. Our new B. napus genome assembly is shown to be more complete than both Darmor-bzh and ZS11_NGS, as shown in a variety of completeness and contiguity metrics. Finally, we have annotated a near-complete set of NLR genes owing to the high contiguity and completeness of our assembly, which offered a valuable resource for future genetic and disease resistance in B. napus and comparative genomic studies in Brassicaceae.

Blast Results

The Blast visualisation module requires that you have a gff3 formatted set of features which you then exported as DNA or protein, and blasted. The reason is easy to understand: when you extract DNA /protein sequences for Blasting, this process looses information about where these sequences were along the genome. The results from Blast retains the identifiers from the DNA /protein sequences, so we need to “map” these identifiers, to proper features with locations.

The best way to accomplish this is through the gffread tool which can cleanup a gff3 file, and export various features, optionally translating them. With these outputs, the cleaned features and fasta formatted sequences, you can Blast the sequences, and then supply the resulting Blast XML outputs in addition to the cleaned features, allowing a script to re-associate these Blast results to their original locations along the genome.

Hands_on Hands-on: Building a JBrowse for Blast results

  1. JBrowse Tool: toolshed galaxy 0 with the following parameters:
    • “Reference genome to display”: Use a genome from history
      • param-file “Select the reference genome”: genome.fa
    • “Genetic Code”: 11. The Bacterial, Archaeal and Plant Plastid Code
    • In “Track Group”:
      • param-repeat “Insert Track Group”
        • In “Annotation Track”:
          • param-repeat “Insert Annotation Track”
            • “Track Type”: Blast XML
              • param-file “BlastXML Track Data”: blastp vs swissprot.xml
              • param-file “Features used in Blast Search”: blastp genes.gff3
              • “Minimum Gap Size”: 5
              • “Is this a protein blast search?”: Yes
              • In “JBrowse Feature Score Scaling & Coloring Options [Advanced]”:
                • “Color Score Algorithm”: Based on score
                  • “JBrowse style.color function’s score scaling”: Blast scaling
  2. Execute and then explore the resulting data.

    Figure 5: Blast results, coloured according to their e-value. This sort of track is commonly used to help genome annotators have additional genomic context when they are annotating.

The Vmatch large scale sequence analysis software

This is the web-site for Vmatch , a versatile software tool for efficiently solving large scale sequence matching tasks. Vmatch subsumes the software tool REPuter, but is much more general, with a very flexible user interface, and improved space and time requirements. Here is a printable version of this HTML-page in PDF.

Features of Vmatch

The Vmatch -manual gives many examples on how to use Vmatch . Here are the program’s most important features.

Persistent index

Usually, in a large scale matching problem, extensive portions of the sequences under consideration are static, i.e. they do not change much over time. Therefore it makes sense to preprocess this static data to extract information from it and to store this in a structured manner, allowing efficient searches. Vmatch does exactly this: it preprocesses a set of sequences into an index structure. This is stored as a collection of several files constituting the persistent index. The index efficiently represents all substrings of the preprocessed sequences and, unlike many other sequence comparison tools, allows matching tasks to be solved in time, independent of the size of the index. Different matching tasks require different parts of the index, but only the required parts of the index are accessed during the matching process.

Alphabet independency

Most software tools for sequence analysis are restricted to DNA and/or protein sequences. In contrast, Vmatch can process sequences over any user defined alphabet not larger than 250 symbols. Vmatch fully implements the concept of symbol mappings , denoting alphabet transformations. These allow the user to specify that different characters in the input sequences should be considered identical in the matching process. This feature is used to group similar amino acids, for example.


Vmatch allows a multitude of different matching tasks to be solved using the persistent index. Every matching task is basically characterized by (1) the kind of sequences to be matched, (2) the kind of matches sought, (3) additional constraints on the matches, and (4) the kind of postprocessing to be done with the matches.

In the standard case, Vmatch matches sequences over the same alphabet. Additionally, DNA sequences can be matched against a protein sequence index in all six reading frames. Finally, DNA sequences can be transformed in all six reading frames and compared against itself.

Where appropriate, Vmatch can compute the following kinds of matches, using state-of-the-art algorithms:

  • maximal and supermaximal repeats using the algorithms of M.I. Abouelhoda, S. Kurtz, and E. Ohlebusch. Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms , 2:53–86, 2004
  • branching tandem repeats using the algorithm of M.I. Abouelhoda, S. Kurtz, and E. Ohlebusch. The enhanced suffix array and its applications to genome analysis. In Proceedings of the Second Workshop on Algorithms in Bioinformatics , pages 449–463. Lecture Notes in Computer Science 2452, Springer-Verlag, 2002
  • maximal (unique) substring matches using the algorithms of S. Kurtz. A Time and Space Efficient Algorithm for the Substring Matching Problem, 2002
  • complete matches using the algorithms of U. Manber and E.W. Myers. Suffix Arrays: A New Method for On-Line String Searches. SIAM Journal on Computing , 22(5):935–948, 1993 and [86]

To compute degenerate substring matches or degenerate repeats, each kind of match (with the exception of tandem repeats and complete matches) can be taken as an exact seed and extended by either of two different strategies:

    the maximum error extension strategy, as described in

S. Kurtz, J.V. Choudhuri, E. Ohlebusch, C. Schleiermacher, J. Stoye, and R. Giegerich. REPuter: The manifold applications of repeat analysis on a genomic scale. Nucleic Acids Res. , 29(22):4633–4642, 2001 for repeat detection,

Matches can be selected according to their length, their E-value, their identity value, or match score.

In the standard case, a match is displayed as an alignment including positional information. Alternatively, a match can directly be postprocessed in different ways:

  • inverse output , i.e. reporting of substrings not covered by a match.
  • masking of substrings covered by a match.
  • clustering of sequences according to the matches found.
  • chaining of matches, i.e. finding optimal subsets of matches which do not cross, using the algorithms described in

M.I. Abouelhoda and E. Ohlebusch. A Local Chaining Algorithm and its Applications in Comparative Genomics. In Proc. 3rd Worksh. Algorithms in Bioinformatics (WABI 2003) , number 2812 in Lecture Notes in Bioinformatics, pages 1–16. Springer-Verlag, 2003

N. Volfovsky, B.J. Haas, and S.L. Salzberg. A Clustering Method for Repeat Analysis in DNA Sequences. Genome Biology , 2(8):research0027.1–0027.11, 2001

Efficient algorithms and data structures

Vmatch is based on enhanced suffix arrays described Abouelhoda, Kurtz & Ohlebusch, 2004. This data structure has been shown to be as powerful as suffix trees, with the advantage of a reduced space requirement and reduced processing time. Careful implementation of the algorithms and data structures incorporated in Vmatch have led to exceedingly fast and robust software, allowing very large sequence sets to be processed quickly. The 32-bit version of Vmatch can process up to 400 million symbols, if enough memory is available. For large server class machines (e.g. SUN-Sparc/Solaris, Intel Xeon/Linux, Compaq-Alpha/Tru64) Vmatch is available as a 64 bit version, enabling gigabytes of sequences to be processed.

Flexible input format

The most common formats for input sequences (Fasta, Genbank, EMBL, and SWISSPROT) are accepted. The user does not have to specify the input format. It is automatically recognized. All input files can contain an arbitrary number of sequences. Gzipped compressed inputs are accepted.

Customized output and match selection

Vmatch ’s output can be parsed by other programs easily. Furthermore, several options allow for its customization. XML output is available and new output formats can easily be incorporated without changing Vmatch ’s program code. Certain matches can easily be selected by user defined criteria, without intermediate output and subsequent parsing.

The parts of Vmatch

Up until now we have referred to Vmatch as a collection of programs. In the following we use the same name, vmatch (in typewriter font), for the most important program in this collection. Besides vmatch , there are the following programs available:

  1. mkvtree constructs the persistent index and stores it on files.
  2. mkdna6idx constructs an index for a DNA sequence after translating this in all six reading frames.
  3. vseqinfo delivers information about indexed database sequences.
  4. vstree2tex outputs a representation of the index in L A T E X -format. It can be used, for example, for educational or debugging purposes.
  5. vseqselect selects indexed sequences satisfying specific criteria.
  6. vsubseqselect selects substrings of a specified length range from an index.
  7. converts an index from big endian to little endian architectures, or vice versa.
  8. vmatchselect sort and selects matches delivered by vmatch .
  9. chain2dim computes optimal chains of matches from files in Vmatch -format.
  10. matchcluster computes clusters of matches from files in Vmatch -format.

Here is an overview of the dataflow in Vmatch.

Related tools

There are several tools which are based on the persistent index of Vmatch :

Genalyzer is a graphical user interface to visualize the output of Vmatch in form of a match graph. For details see

J.V. Choudhuri, C. Schleiermacher, S. Kurtz, and R. Giegerich. Genalyzer: Interactive visualization of sequence similarities between entire genomes. Bioinformatics , 20:1964–1965, 2004

Genalyzer is not available any more.

MGA is a program to compute multiple alignments of complete genomes. For details see

M. Höhl, S. Kurtz, and E. Ohlebusch. Efficient multiple genome alignment. Bioinformatics , 18(Suppl. 1):S312–S320, 2002

Multimat is a program to compute multiple exact matches between three or more genome size sequences. For details see

E. Ohlebusch and S. Kurtz. Space efficient computation of rare maximal exact matches between multiple sequences. J. Comp. Biol. , 15(4):357–377, 2008

Please contact Stefan Kurtz if you are interested in using Multimat.

PossumSearch Is a program to search for position specific scoring matrices. For details, see

M. Beckstette, R. Homann, R. Giegerich, and S. Kurtz. Fast index based algorithms and software for matching position specific scoring matrices. BMC Bioinformatics , 7:389, 2006

GenomeThreader is a software tool to compute gene structure predictions. The gene structure predictions are calculated using a similarity-based approach where additional cDNA/EST and/or protein sequences are used to predict gene structures via spliced alignments. GenomeThreader uses the matching capabilities of Vmatch to efficiently map the reference sequence to a genomic sequence. For details, see

G. Gremme, V. Brendel, M.E. Sparks, and S. Kurtz. Engineering a software tool for gene prediction in higher organisms. Information and Software Technology , 47(15):965–978, 2005

Biopieces is a collection of bioinformatics tools that can be pieced together in a very easy and flexible manner to perform both simple and complex tasks. Some Biopieces depend on Vmatch . For details see .

Previous and Current Usages

We provide an annotated bibliography listing papers which applied Vmatch and shortly describe the tasks for which Vmatch was used. We omit our own papers. The references were collected by a search in Google scholar (which, as of Jan 2, 2016 retrieved 397 results.)

Usages in Plant Genome Research

    V. Brendel, S. Kurtz, and V. Walbot. Comparative genomics of Arabidopsis and Maize: Prospects and limitations. Genome Biology , 3(3):reviews1005.1–1005.6, 2002

In this work Vmatch was used to a compute a non-redundant set from a large collection of protein sequences from Zea-Maize.

Similar applications are described in

Q. Dong, L. Roy, M. Freeling, V. Walbot, and V. Brendel. ZmDB, an integrated Database for Maize Genome Research. Nucleic Acids Res. , 31:244–247, 2003.

S. Dash, J. Van Hemert, L. Hong, R. P. Wise, and J. A. Dickerson. PLEXdb: gene expression resources for plants and plant pathogens. Nucleic Acids Res. , 40(Database issue):D1194–1201, Jan 2012

PLEXdb provides a Vmatch -based web-service to match PLEXdb probes.

This work describes PlantGDB, which provides a service called [email protected] for genome wide pattern searches in plant sequences. The service is based on Vmatch .

M. Lindow and A. Krogh. Computational evidence for hundreds of non-conserved plant micrornas. BMC Genomics , 6(1):119, 2005

In this work Vmatch was used for three different tasks:

  • Searching spliced mRNA in the Arabidopsis genome to detect micromatches of length at least 20 with maximum 2 mismatches.
  • Finding matches of length at least 15 long with at most one mismatch between predicted mature miRNA-sequences and a set of ESTs as well as sequences from the Arabidopsis Small RNA Project (ASRP).
  • Aligning and performing single linkage clustering of the predicted mature miRNA sequences. Candidate pairs aligning over at least 17 bases, allowing an edit distance of 1 were grouped in the same family.

M. Turmel, C. Otis, and C. Lemieux. The Chloroplast Genome Sequence of Chara vulgaris Sheds New Light into the Closest Green Algal Relatives of Land Plants. Molecular Biology and Evolution , 23:1324–1338, 2006

In these papers Vmatch was used to search and compare repeated elements in different chloroplast DNA.

In this work Vmatch was used to compare target genes of the tomato Chs RNAi to a tomato gene index.

M. Lindow, A. Jacobsen, S. Nygaard, Y. Mang, and A. Krogh. Intragenomic matching reveals a huge potential for mirna-mediated regulation in plants. PLOS Comput. Biol , 3(11):e238, 2007

In this work Vmatch was used to search different plant genomes for matches of length at least 20 with maximum of 2 mismatches. Here the fact that Vmatch is an exhaustive search tool is important.

J.-C. de Cambiaire, C. Otis, M. Turmel, and C. Lemieux. The chloroplast genome sequence of the green alga leptosira terrestris: multiple losses of the inverted repeat and extensive genome rearrangements within the trebouxiophyceae. BMC Genomics , 8(1):213, 2007

In this work Vmatch was used to determine the presence of shared repeated elements of minimum length 30, with up to 10% mismatches using in different sequence sets from the green alga Leptosira terrestris .

S. Ossowski, K. Schneeberger, R.M. Clark, C. Lanz, N. Warthmann, and D. Weigel. Sequencing of natural strains of Arabidopsis thaliana with short reads. Genome Res. , 18:2024–2033, 2008

In this work Vmatch was used to map millions of short sequence reads to the A. Thaliana genome. Up to four mismatches and up to three indels were allowed in the matching process. The seed size was chosen to be 0. The reads were aligned using the best match strategy by iteratively increasing the the allowed number of mismatches and gaps at each round.

F. De Bona, S. Ossowski, K. Schneeberger, and G. Ratsch. Optimal spliced alignments of short sequence reads. Bioinformatics , 24(16):i174–180, 2008

In this work Vmatch was used to map millions of short sequence reads to the A. Thaliana genome. Vmatch was part of a multi-step pipeline, combining a fast matching algorithm ( Vmatch ) for initial read mapping and an optimal alignment algorithm based on dynamic programming (QPALMA) for high quality detection of splice sites.

A. G. L. Assunção, E. Herrero, Y-F. Lin, B. Huettel, S. Talukdar, C. Smaczniak, R. GH Immink, M. Van Eldik, M. Fiers, H. Schat, et al. Arabidopsis thaliana transcription factors bzip19 and bzip23 regulate the adaptation to zinc deficiency. Proceedings of the National Academy of Sciences , 107(22):10296–10301, 2010

In this work Vmatch was used for motif searching in different plant genomes.

Andrea L Eveland, Namiko Satoh-Nagasawa, Alexander Goldshmidt, Sandra Meyer, Mary Beatty, Hajime Sakai, Doreen Ware, and David Jackson. Digital gene expression signatures for maize development. Plant physiology , 154(3):1024–1039, 2010

In this work Vmatch was used to map unique consensus sequence tags to the maize reference genome.

Jean-Simon Brouard, Christian Otis, Claude Lemieux, and Monique Turmel. The exceptionally large chloroplast genome of the green alga floydiella terrestris illuminates the evolutionary history of the chlorophyceae. Genome biology and evolution , 2:240, 2010

In this work Vmatch was used to identify and cluster repeated sequences in Floydiella chloroplast genome.

Hubert Rehrauer, Catharine Aquino, Wilhelm Gruissem, Stefan R Henz, Pierre Hilson, Sascha Laubinger, Naira Naouar, Andrea Patrignani, Stephane Rombauts, Huan Shu, et al. Agronomics1: a new resource for arabidopsis transcriptome profiling. Plant Physiology , 152(2):487–499, 2010

In this work Vmatch was used to calculate direct and reverse complementary matches of length 17 bp or greater with edit distance 1 or less between five nuclear chromosomes and mitochondrial and chloroplast genome sequences.

R. S. Sekhon, H. Lin, K. L. Childs, C. N. Hansey, C. R. Buell, N. de Leon, and S. M. Kaeppler. Genome-wide atlas of transcription during maize development. Plant J. , 66(4):553–563, May 2011

In this work Vmatch was used to search probe sequences against the maize genome the cDNA sequences of the official maize gene models.

M. Dassanayake, D. H. Oh, J. S. Haas, A. Hernandez, H. Hong, S. Ali, D. J. Yun, R. A. Bressan, J. K. Zhu, H. J. Bohnert, and J. M. Cheeseman. The genome of the extremophile crucifer Thellungiella parvula. Nat. Genet. , 43(9):913–918, Sep 2011

In this work Vmatch was used for clustering sequences assembled from 454-reads of Thellungiella parvula , a model for the evolution of plant adaptation to extreme environments.

E. M. Willing, M. Hoffmann, J. D. Klein, D. Weigel, and C. Dreyer. Paired-end RAD-seq for de novo assembly and marker design without available reference. Bioinformatics , 27(16):2187–2193, Aug 2011

In this work Vmatch was used for grouping short reads into pools representing the same RAD tag.

L. Gao, Y. Zhou, Z.-W. Wang, Y.-J. Su, and T. Wang. Evolution of the rpoB-psbZ region in fern plastid genomes: notable structural rearrangements and highly variable intergenic spacers. BMC Plant Biology , 11(1):64, 2011

In this work Vmatch was used for detecting and clustering repetitive sequences in diverse fern plastid genomes.

D. B. Sloan, A. J. Alverson, J. P. Chuckalovcak, M. Wu, D. E. McCauley, J. D. Palmer, and D. R. Taylor. Rapid evolution of enormous, multichromosomal genomes in flowering plant mitochondria with exceptionally high mutation rates. PLoS Biol. , 10(1):e1001241, Jan 2012

In this work Vmatch was used to precisely define the boundaries of all repeats with 100% sequence identity.

Anuja Dubey, Andrew Farmer, Jessica Schlueter, Steven B Cannon, Brian Abernathy, Reetu Tuteja, Jimmy Woodward, Trushar Shah, Benjamin Mulasmanovic, Himabindu Kudapa, et al. Defining the transcriptome assembly and its use for genome dynamics and transcriptome profiling studies in pigeonpea ( Cajanus cajan l.). DNA research , 18(3):153–164, 2011

In this work Vmatch was used cluster sequences based on their six-frame translation.

Rachit K Saxena, R Varma Penmetsa, Hari D Upadhyaya, Ashish Kumar, Noelia Carrasquilla-Garcia, Jessica A Schlueter, Andrew Farmer, Adam M Whaley, Birinchi K Sarma, Gregory D May, et al. Large-scale development of cost-effective single-nucleotide polymorphism marker assays for genetic mapping in pigeonpea and comparative mapping in legumes. DNA research , 19(6):449–461, 2012

In this work Vmatch was used to identify reciprocal best matches between the pigeonpea sequences and other legume sequences.

B. Z. Haznedaroglu, D. Reeves, H. Rismani-Yazdi, and J. Peccia. Optimization of de novo transcriptome assembly from high-throughput short read sequencing data improves functional annotation for non-model organisms. BMC Bioinformatics , 13:170, 2012

In this work Vmatch was used for assembly clustering and optimization of contigs for Neochloris oleoabundans (a Chlorophyceae class green microalgae).

M. M. Martis, S. Klemme, A. M. Banaei-Moghaddam, F. R. Blattner, J. Macas, T. Schmutzer, U. Scholz, H. Gundlach, T. Wicker, H. Šimková, P. Novak, P. Neumann, M. Kubalakova, E. Bauer, G. Haseneyer, J. Fuchs, J. Dolezel, N. Stein, K. F. Mayer, and A. Houben. Selfish supernumerary chromosome reveals its origin as a mosaic of host genome and organellar sequences. Proc. Natl. Acad. Sci. U.S.A. , 109(33):13343–13346, Aug 2012

In this work Vmatch was used to match reads against a repeat library to identity the content of the repetitive DNA per sequence read.

K. L. Childs, R. M. Davidson, and C. R. Buell. Gene coexpression network analysis as a source of functional annotation for rice genes. PloS one , 6(7):e22196, 2011

In this work Vmatch was used to align individual probes to representative gene models.

E. I. Severing, A. D. J. van Dijk, and R. C. H. J. van Ham. Assessing the contribution of alternative splicing to proteome diversity in arabidopsis thaliana using proteomics data. BMC Plant Biology , 11(1):82, 2011

In this work Vmatch was used for performing exact searches with peptides against the filtered proteome of A. thaliana .

P. Wolff, I. Weinhofer, J. Seguin, P. Roszak, C. Beisel, M.T. Donoghue, C. Spillane, M. Nordborg, M. Rehmsmeier, and C. Köhler. High-resolution analysis of parent-of-origin allelic expression in the arabidopsis endosperm. PLoS Genet , 7(6):e1002126–e1002126, 2011

In this work Vmatch was used to map RNAseq reads, allowing up to two mismatches (option -h 2 ) and generating maximal substring matches that are unique in some reference dataset (option -mum cand ).

D. J. Fleetwood, A. K. Khan, R. D. Johnson, C. A. Young, S. Mittal, R. E. Wrenn, U. Hesse, S. J. Foster, C. L. Schardl, and B. Scott. Abundant degenerate miniature inverted-repeat transposable elements in genomes of epichloid fungal endophytes of grasses. Genome Biol Evol , 3:1253–1264, 2011

In this work Vmatch was used to identify terminal inverted repeats of length range 10-65 bp, ≥ 80% identity, maximum inter-TIR distance 650 bp in in genomes of epichloid fungal endophytes of grasses.

K. L. Childs, K. Konganti, and C. R. Buell. The Biofuel Feedstock Genomics Resource: a web-based portal and database to enable functional genomics of plant biofuel feedstock species. Database (Oxford) , 2012:bar061, 2012

In this work Vmatch was used to match putative unique transcript sequence assemblies.

Y. Chen, B. J. Cassone, X. Bai, M. G. Redinbaugh, and A. P. Michel. Transcriptome of the plant virus vector Graminella nigrifrons, and the molecular interactions of maize fine streak rhabdovirus transmission. PLoS ONE , 7(7):e40613, 2012

In this work Vmatch was used for refining assemblies of Illumina reads in the context of a transcriptome project for plant virus vector Graminella nigrifrons .

N. M. Krishnan, S. Pattnaik, P. Jain, P. Gaur, R. Choudhary, S. Vaidyanathan, S. Deepak, A. K. Hariharan, P. B. Krishna, J. Nair, L. Varghese, N. K. Valivarthi, K. Dhas, K. Ramaswamy, and B. Panda. A draft of the genome and four transcriptomes of a medicinal and pesticidal angiosperm Azadirachta indica. BMC Genomics , 13:464, 2012

In this work Vmatch was used for clustering repeats and for building a consensus repeat library in the context of genome and transcriptome projects for Azadirachta indica , a medicinal and pesticidal angiosperm.

Z. Liu, S. Kumari, L. Zhang, Y. Zheng, and D. Ware. Characterization of mirnas in response to short-term waterlogging in three inbred lines of zea mays. PLoS One , 7(6):e39786, 2012

In this work Vmatch was used to map unique consensus sequences tags to the maize reference genome and to predict targets of novel miRNAs.

A. Bousios, Y. A. I. Kourmpetis, P. Pavlidis, E. Minga, A. Tsaftaris, and N. Darzentas. The turbulent life of sirevirus retrotransposons and the evolution of the maize genome: more than ten thousand elements tell the story. The Plant Journal , 69(3):475–488, 2012

In this work Vmatch was used for masking Long Terminal Repeats in the Maize Genome Sequence.

P. Hernandez, M. Martis, G. Dorado, M. Pfeifer, S. Galvez, S. Schaaf, N. Jouve, H. Šimková, M. Valarik, J. Dolezel, and K. F. Mayer. Next-generation sequencing and syntenic integration of flow-sorted arms of wheat chromosome 4A exposes the chromosome structure and gene content. Plant J. , 69(3):377–386, Feb 2012

R. Philippe, E. Paux, I. Bertin, P. Sourdille, F. Choulet, C. Laugier, H. Šimková, J. Šafář, A. Bellec, S. Vautrin, et al. A high density physical map of chromosome 1bl supports evolutionary studies, map-based cloning and sequencing in wheat. Genome Biol , 14(6):R64, 2013

Vmatch was used to mask repetitive DNA.

G. T. Howe, J. Yu, B. Knaus, R. Cronn, S. Kolpak, P. Dolan, W. W. Lorenz, and J. F. Dean. A SNP resource for Douglas-fir: de novo transcriptome assembly and SNP detection and validation. BMC Genomics , 14:137, 2013

In this work Vmatch was used to cluster 40 010 assembled isotigs.

R. Karlova, J. C. van Haarst, C. Maliepaard, H. van de Geest, A. G. Bovy, M. Lammers, G. C. Angenent, and R. A. de Maagd. Identification of microRNA targets in tomato fruit development using high-throughput sequencing and degradome analysis. J. Exp. Bot. , 64(7):1863–1878, Apr 2013

In this work Vmatch was used to preprocess short reads in the context of identifying mircoRNA targets in tomato fruit development.

S. M. Gross, J. A. Martin, J. Simpson, M. J. Abraham-Juarez, Z. Wang, and A. Visel. De novo transcriptome assembly of drought tolerant CAM plants, Agave deserti and Agave tequilana. BMC Genomics , 14:563, 2013

In this work Vmatch was used in an all-vs-all comparison to bin contigs into loci based on a minimum of 200 bp sequence overlap in the context of transcriptome assembly for two Agave-species.

U. Kanter, W. Heller, J. Durner, J. B. Winkler, M. Engel, H. Behrendt, A. Holzinger, P. Braun, M. Hauser, F. Ferreira, K. Mayer, M. Pfeifer, and D. Ernst. Molecular and immunological characterization of ragweed (Ambrosia artemisiifolia L.) pollen after exposure of the plants to elevated ozone over a whole growing season. PLoS ONE , 8(4):e61518, 2013

In this work Vmatch was used to align 454-reads to assembled isotigs for Ragweed pollen.

K. G. Kugler, G. Siegwart, T. Nussbaumer, C. Ametz, M. Spannagl, B. Steiner, M. Lemmens, K. F. X. Mayer, H. Buerstmayr, and W. Schweiger. Quantitative trait loci-dependent analysis of a gene co-expression network associated with fusarium head blight resistance in bread wheat (triticum aestivum l.). BMC Genomics , 14(1):728, 2013

In this work Vmatch was used for comparing gene sets.

Mihaela M Martis, Ruonan Zhou, Grit Haseneyer, Thomas Schmutzer, Jan Vrána, Marie Kubaláková, Susanne König, Karl G Kugler, Uwe Scholz, Bernd Hackauf, et al. Reticulate evolution of the rye genome. The Plant Cell , 25(10):3685–3698, 2013

In this work Vmatch was used to detect repetitive DNA content of chromosomal survey sequences from the Rye genome.

D. Kopeckỳ, M. Martis, J. Číhalíková, E. Hřibová, J. Vrána, J. Bartoš, J. Kopecká, F. Cattonaro, Š. Stočes, Petr Novák, et al. Flow sorting and sequencing meadow fescue chromosome 4f. Plant Physiology , 163(3):1323–1337, 2013

D. Kopeckỳ, M Martis, J Číhalíková, E Hřibová, J Vrána, J Bartoš, et al. Genomics of meadow fescue chromosome 4f. Plant Physiol , 163:1323–1337, 2013

Vmatch was used for identifying repetitive DNA content in contigs of meadow fescue chromosome 4F assembled from Illumina short reads.

F. Jay, Y. Wang, A. Yu, L. Taconnat, S. Pelletier, V. Colot, J.-P. Renou, and O. Voinnet. Misregulation of AUXIN RESPONSE FACTOR 8 underlies the developmental abnormalities caused by three distinct viral silencing suppressors in Arabidopsis . PLoS Pathog , 7(5):e1002035–e1002035, 2011

X. Wang, D. Weigel, and L. M. Smith. Transposon variants and their effects on gene expression in arabidopsis. PLoS Genet , 9(2):e1003255, 2013

Vmatch was used for mapping siRNA sequences to the Arabidopsis thaliana genome.

E. Henaff, C. Vives, B. Desvoyes, A. Chaurasia, J. Payet, C. Gutierrez, and J. M. Casacuberta. Extensive amplification of the E2F transcription factor binding sites by transposons during evolution of Brassica species. Plant J. , 77(6):852–862, Mar 2014

In this work Vmatch was used for the identification of binding motifs.

W Wang, G Haberer, H Gundlach, C Gläßer, TCLM Nussbaumer, MC Luo, A Lomsadze, M Borodovsky, RA Kerstetter, J Shanklin, et al. The Spirodela polyrhiza genome reveals insights into its neotenous reduction fast growth and aquatic lifestyle. Nature Communications , 5, 2014

In this work Vmatch was used for masking one sequence set with another and for mapping miRNA sequences of all plant species present in a reference database to whole-genome assembly of Spirodela polyrhiza .

M. D. Logacheva, M. I. Schelkunov, M. S. Nuraliev, T. H. Samigullin, and A. A. Penin. The plastid genome of mycoheterotrophic monocot petrosavia stellaris exhibits both gene losses and multiple rearrangements. Genome biology and evolution , 6(1):238–246, 2014

In this work Vmatch was used for repeat detection.

X. Wang, W. Shi, and T. Rinehart. Transcriptomes That Confer to Plant Defense against Powdery Mildew Disease in Lagerstroemia indica. Int J Genomics , 2015:528395, 2015

In this work Vmatch was used to eliminate redundancies in assemblies of Illumina reads in the context of studying plant defense mechanisms.

H. Ashrafi, A. M. Hulse-Kemp, F. Wang, S. S. Yang, X. Guan, D. C. Jones, M. Matvienko, K. Mockaitis, Z. J. Chen, D. M. Stelly, et al. A long-read transcriptome assembly of cotton (l.) and intraspecific single nucleotide polymorphism discovery. The Plant Genome , 2015

In this work Vmatch was used for clustering to determine a non-redundant set of assembled contigs.

K. Ustyantsev, O. Novikova, A. Blinov, and G. Smyshlyaev. Convergent evolution of ribonuclease h in ltr retrotransposons and retroviruses. Molecular biology and evolution , 32(5):1197–1207, 2015

In this work Vmatch was used for clustering sequences based on their RT and aRNH domain.

M. Helguera, M. Rivarola, B. Clavijo, M. M. Martis, L. S. Vanzetti, S. González, I. Garbus, P. Leroy, H. Šimková, M. Valárik, et al. New insights into the wheat chromosome 4d structure and virtual gene order, revealed by survey pyrosequencing. Plant Science , 233:200–212, 2015

In this work Vmatch was used for identifying repeats in contigs assembled from 454-reads.

Qi Shen, Jun Yang, Chaolong Lu, Bo Wang, and Chi Song. The complete chloroplast genome sequence of perilla frutescens (l.). Mitochondrial DNA , preprint:1–2, 2015

In this work Vmatch was used for identifying inverted repeats in chloroplast genomes.

Bahman Panahi, Seyed Abolghasem Mohammadi, Reyhaneh Ebrahimi Khaksefidi, Jalil Fallah Mehrabadi, and Esmaeil Ebrahimie. Genome-wide analysis of alternative splicing events in Hordeum vulgare : Highlighting retention of intron-based splicing and its possible function through network analysis. FEBS letters , 589(23):3564–3575, 2015

In this work Vmatch was used to identify contaminations and repetitive elements by comparison of mRNA sequences to vector, bacterial and repeat databases.

SN Wolfenbarger, MC Twomey, DM Gadoury, BJ Knaus, NJ Grünwald, and DH Gent. Identification and distribution of mating-type idiomorphs in populations of podosphaera macularis and development of chasmothecia of the fungus. Plant Pathology , 2015

In this work Vmatch was used to cluster contigs of different assemblies into groups of homologous sequences.

Jun Yang, Chaolong Lu, Qi Shen, Yuying Yan, Changjiang Xu, and Chi Song. The complete chloroplast genome sequence of Fagopyrum cymosum. Mitochondrial DNA , pages 1–2, 2015

In this work Vmatch was used to identify inverted repeats in chloroplast genomes.

Usages in the Microbial Genome Research

    The KPATH system, developed at the Lawrence Livermore National Laboratories, and described in

J.P. Fitch, S.N. Gardner, T.A. Kuczmarski, S. Kurtz, R. Myers, L.L. Ott, T.R. Slezak, E.A. Vitalis, A.T. Zemla, and P.M. McCready. Rapid development of nucleic acid diagnostics. Proceedings of the IEEE , 90(11):1708–1721, 2002

T. Slezak, T. Kuczmarski, L. Ott, C. Torres, D. Medeiros, J. Smith, B. Truitt, N. Mulakken, M. Lam, E. Vitalis, A. Zemla, C.E. Zhou, and S. Gardner. Comparative Genomics Tools Applied to Bioterrorism Defense. Briefings in Bioinformatics , 4(2):133–149, 2003

used Vmatch to detect unique substrings in large collection of DNA sequences. These unique substrings serve as signatures allowing for rapid and accurate diagnostics to identify pathogen bacteria and viruses. A similar application is reported in S.N. Gardner, T.A. Kuczmarski, E.A. Vitalis, and T.R. Slezak. Limitations of TaqMan PCR for Detecting Viral Pathogens I llustrated by Hepatitis A, B, C, and E Viruses and Human Immunodeficiency Virus. J. of Clinical Microbiology , 41(6):2417–2427, 2003.

N. Pobigaylo, D. Wetter, S. Szymczak, U. Schiller, S. Kurtz, F. Meyer, T.W. Nattkemper, and Becker A. Construction of a large signature-tagged mini-Tn5 transposon library and its application to mutagenesis of Sinorhizobium meliloti . Appl Environ Microbiol. , 72(6):4329–4337, 2006

In this work Vmatch was used to map signature tags to the genome of S. meliloti .

I. Grissa, G. Vergnaud, and C. Pourcel. CRISPRFinder: a web tool to identify clustered regularly interspaced short palindromic repeats. Nucleic Acids Res , 35(Web Server issue):W52–7, 2007

I. Grissa, G. Vergnaud, and C. Pourcel. The CRISPRdb database and tools to display CRISPRs and to generate dictionaries of spacers and repeats. BMC Bioinformatics , 8:172, 2007

used Vmatch to efficiently find maximal repeats, as a first step in localizing Clustered regularly interspaced short palindromic repeats (CRISPRs).

B. Voss, J. Georg, V. Schöon, S. Ude, and W. R. Hess. Biocomputational prediction of non-coding RNAs in model cyanobacteria. BMC Genomics , 10:123, 2009

In this work Vmatch was used to map predicted sequences to information about Rho-independent terminators provided by a specific database.

Jeremy Schmutz, Steven B Cannon, Jessica Schlueter, Jianxin Ma, Therese Mitros, William Nelson, David L Hyten, Qijian Song, Jay J Thelen, Jianlin Cheng, et al. Genome sequence of the palaeopolyploid soybean. Nature , 463(7278):178–183, 2010

In this work Vmatch was used to cluster DNA-sequences into families based on their six-frame translation.

Bob Zimmermann, Tanja Gesell, Doris Chen, Christina Lorenz, Renée Schroeder, and J Valcarcel. Monitoring genomic sequences during selex using high-throughput sequencing: neutral selex. PLoS One , 5(2):e9169, 2010

In this work Vmatch was used to align 454-sequences to the Ecoli-genome and to cluster the sequences.

Fabrice Touzain, Erick Denamur, Claudine Médigue, Valérie Barbe, Meriem El Karoui, Marie-Agnès Petit, et al. Small variable segments constitute a major type of diversity of bacterial genomes at the species level. Genome Biol , 11(4):R45, 2010

In this work Vmatch was used for detecting repeats in three bacterial species.

Klaus FX Mayer, Mihaela Martis, Pete E Hedley, Hana Šimková, Hui Liu, Jenny A Morris, Burkhard Steuernagel, Stefan Taudien, Stephan Roessner, Heidrun Gundlach, et al. Unlocking the barley genome by chromosomal and comparative genomics. The Plant Cell , 23(4):1249–1263, 2011

In this work Vmatch was used for masking repeats in 454-reads.

Smruti Pushalkar, Shrinivasrao P Mane, Xiaojie Ji, Yihong Li, Clive Evans, Oswald R Crasta, Douglas Morse, Robert Meagher, Anup Singh, and Deepak Saxena. Microbial diversity in saliva of oral squamous cell carcinoma. FEMS Immunology & Medical Microbiology , 61(3):269–277, 2011

In this work Vmatch was used to identify distal primers.

J. E. Breitenbach, K. S. Shelby, and H. JR Popham. Baculovirus induced transcripts in hemocytes from the larvae of heliothis virescens. Viruses , 3(11):2047–2064, 2011

In this work Vmatch was used for removing redundant transcripts assembled in an RNA-seq study based on Illumina reads for Heliothis virescens (tobacco budworm), infected with a virus.

LR Triplett, JP Hamilton, CR Buell, NA Tisserat, V. Verdier, F Zink, and JE Leach. Genomic analysis of xanthomonas oryzae isolates from rice grown in the united states reveals substantial divergence from known x. oryzae pathovars. Applied and Environmental Microbiology , 77(12):3930–3937, 2011

In this work Vmatch was used to search unassembled Illumina reads of US and African strains of Xanthomonas oryzae for evidence of transcriptional activator-like effector sequences.

D. A. Hysom, P. Naraghi-Arani, M. Elsheikh, A. C. Carrillo, P. L. Williams, and S. N. Gardner. Skip the alignment: degenerate, multiplex primer and probe design using K-mer matching instead of alignments. PLoS ONE , 7(4):e34560, 2012

In this context Vmatch used for selecting multiplex compatible, degenerate primers and probes to detect diverse targets such as viruses.

K. S. Shelby and H. JR Popham. Rna-seq study of microbially induced hemocyte transcripts from larval heliothis virescens (lepidoptera: Noctuidae). Insects , 3(3):743–762, 2012

In this work Vmatch was used to identify redundant contigs from de novo exome assemblies.

B. L. Hurwitz and M. B. Sullivan. The Pacific Ocean virome (POV): a marine viral metagenomic dataset and associated protein clusters for quantitative viral ecology. PLoS ONE , 8(2):e57355, 2013

In this work Vmatch was used to identify reads which have no common 20-mers with other reads in a context of a marine viral metagenome project.

X. Zhuo, M. Rho, and C. Feschotte. Genome-wide characterization of endogenous retroviruses in the bat Myotis lucifugus reveals recent and diverse infections. J. Virol. , 87(15):8493–8501, Aug 2013

In this work Vmatch was used for clustering potential complete Endogenous retroviruses of the bat Myotis lucifugus into subfamilies.

B. L. Hurwitz, A. H. Westveld, J. R. Brum, and M. B. Sullivan. Modeling ecological drivers in marine viral communities using comparative metagenomics and network analyses. Proc. Natl. Acad. Sci. U.S.A. , 111(29):10714–10719, July 2014

B. L. Hurwitz, L. Deng, B. T. Poulos, and M. B. Sullivan. Evaluation of methods to concentrate and purify ocean virus communities through comparative, replicated metagenomics. Environ. Microbiol. , 15(5):1428–1440, May 2013

J. R. Brum, B. L. Hurwitz, O. Schofield, H. W. Ducklow, and M. B. Sullivan. Seasonal time bombs: dominant temperate viruses affect southern ocean microbial dynamics. The ISME journal , 2015

Vmatch was used for k -mer analysis in the context of different marine metagenome projects.

C. J. Decker and R. Parker. Analysis of double-stranded rna from microbial communities identifies double-stranded rna virus-like elements. Cell reports , 7(3):898–906, 2014

In this work Vmatch was used for k -mer analysis in the context of microbial communities.

J. Bengtsson-Palme, F. Boulund, J. Fick, E. Kristiansson, and D. G. Larsson. Shotgun metagenomics reveals a wide array of antibiotic resistance genes and mobile elements in a polluted lake in India. Front Microbiol , 5:648, 2014

In this work Vmatch was used in an iterative scheme to construct contigs from reads associated with resistance genes in the context of a shotgun metagenome project.

A Be Nicholas, James B Thissen, Shea N Gardner, Kevin S McLoughlin, Viacheslav Y Fofanov, Heather Koshinsky, Sally R Ellingson, Thomas S Brettin, Paul J Jackson, and Crystal J Jaing. Detection of Bacillus anthracis DNA in complex soil and air samples using next-generation sequencing. PloS one , 8(9), 2013

In this work Vmatch was used to match probe candidate sequences against viral sequences and the human genmome sequence.

Birgit Henrich, Madis Rumming, Alexander Sczyrba, Eunike Velleuer, Ralf Dietrich, Wolfgang Gerlach, Michael Gombert, Sebastian Rahn, Jens Stoye, Arndt Borkhardt, et al. Mycoplasma salivarium as a dominant coloniser of Fanconi anaemia associated oral carcinoma. PloS one , 9(3), 2014

In this work Vmatch was used to identify the species of the Streptococcaceae by comparing with Silva 115 release 16S reference sequence database.

Usages in General Web-Servers or Sequence Analysis Software

    Since 2000, the RSA-tools, described in

J. van Helden, A.F. Rios, and J. Collado-Vides. Discovering Regulatory Elements in Non-Coding Sequences by Analysis of Spaced Dyads. Nucleic Acids Res. , 28(8):1808–1818, 2000

and developed by Jacques van Helden use Vmatch to purge sequences before computing sequence statistics. Similar applications are reported in the following papers:

R.J.M. Hulzink, H. Weerdesteyn, A.F. Croes, M.M.A. Gerats, T. van Herpen, and J. van Helden. In Silico Identification of Putative Regulatory Sequence Elements in the 5’-Untranslated Region of Genes That Are Expressed during Male Gametogenesis Gene Co-regulation. Plant Physiol. , 132:75–83, 2003

N. Simonis, S.J. Wodak, G.N. Cohen, and J van Helden. Combining Pattern Discovery and Discriminant Analysis to Predict Gene Co-regulation. Bioinformatics , 20:2370–2379, 2004

N. Simonis, J. van Helden, G.N. Cohen, and S.J. Wodak. Transcriptional regulation of protein complexes in yeast. Genome Biology , 5:R33, 2004.

E. Coward, S.A. Haas, and M. Vingron. SpliceNest: Visualization of Gene Structure and Alternative Splicing Based on EST Clusters. Trends Genet. , 18(1):53–55, 2002

computes gene indices and uses Vmatch to map clustered sequences to large genomes.

is a web-based server which efficiently maps large EST and cDNA data sets to genomic DNA. The use of Vmatch allows to significantly extend the size of data that can be mapped in reasonable time. e2g is available as a web service and hosts large collections of EST sequences (e.g. 4.1 million mouse ESTs of 1.87 Gbp) in a precomputed persistent index. For details see

J. Krüger, A. Sczyrba, S. Kurtz, and R. Giegerich. e2g: An interactive web-based server for efficiently mapping large EST and cDNA sets to genomic sequences. Nucleic Acids Res. , 32:W301–W304, 2004.

In this work Vmatch was used to (1) match 130 861 vector-trimmed sequences against the maize repeat database, and (2) to cluster near-identical sequences.

T. Dezulian, M. Schaefer, R. Wiese, D. Weigel, and D.H. Huson. CrossLink: visualization and exploration of sequence relationships between (micro) RNAs. Nucleic Acids Res. , 34(Web Server Issue):W400–W404, 200

is a versatile computational tool which aids in visualizing relationships between RNA sequences (particularly between ncRNAs and their putative target transcripts) in an intuitive and accessible way. Besides BLAST, CrossLink uses Vmatch to reveal the sequence relationships to be visualized.

R. Arnold, T. Rattei, P. Tischler, M.-D. Truong, V. Stümpflen, and H.W. Mewes. SIMAP - The similarity matrix of proteins. Bioinformatics , 21(Suppl. 2):ii42–ii46, 2005

used Vmatch to locate the sequences in SIMAP which are similar to a given query. This is much faster than running BLAST.

Fiers, M.W.E.J. and Van de Wetering, H. and Peeters, T.H.J.M. and van Wijk, J.J. and Nap, J-P. DNAVis: interactive visualization of comparative genome annotations. Bioinformatics , 22(3):354–355, 2005

In this work Vmatch was used to compute similarities between genomes, which are then visualized by the program DNAVis.

P.N. Seibel, J. Krüger, S. Hartmeier, K. Schwarzer, K. Löwenthal, H. Mersch, T. Dandekar, and R. Giegerich. XML schemas for common bioinformatic data types and their application in workflow systems. BMC Bioinformatics , 7:490, 2006

Seidel et. al. describe methods for creating web-services and give examples which, among other tools, also integrate Vmatch .

J. Krumsiek, R. Arnold, and T. Rattei. Gepard: a rapid and sensitive tool for creating dotplots on genome scale. Bioinformatics , 23(8):1026–8, 2007

uses mkvtree to compute enhanced suffix arrays.

J. Martin, V. M. Bruno, Z. Fang, X. Meng, M. Blow, T. Zhang, G. Sherlock, M. Snyder, and Z. Wang. Rnnotator: an automated de novo transcriptome assembly pipeline from stranded RNA-Seq reads. BMC Genomics , 11:663, 2010

C. M. Lushbough, D. M. Jennewein, and V. Brendel. The bioextract server: a web-based bioinformatic workflow platform. Nucleic acids research , 39(suppl 2):W528–W532, 2011

uses Vmatch to remove duplicated sequences.

C. M. Lushbough, E. Z. Gnimpieba, and R. Dooley. Life science data analysis workflow development using the bioextract server leveraging the iplant collaborative cyberinfrastructure. Concurrency and Computation: Practice and Experience , 27(2):408–419, 2015

In this work Vmatch was used for removing duplicates in BlastP results. This use is part of a workflow in myexperiment.

Daniel Greuter, Alexander Loy, Matthias Horn, and Thomas Rattei. ProbeBase-an online resource for rRNA-targeted oligonucleotide probes and primers: new features 2016. Nucleic acids research , page gkv1232, 2015

In this work Vmatch was used for probe/primer search functionality in the probeBase database.

Current Usages in Human Genome Research

    P.G. Buckley, C. Jarbo, U. Menzel, T. Mathiesen, C. Scott, S.G. Gregory, C.F. Langford, and J.P. Dumanski. Comprehensive DNA Copy Number Profiling of Meningioma Using a Chromosome 1 Tiling Path Microarray identifies Novel Candidate Tumor Surpressor Loci. Cancer Res. , 65(7):2653–2661, 2005

In this work Vmatch was used to reveal long repeats inside human chromosome 1 and long similar regions between human chromosome 1 and all other human chromosomes.

Liang, C. and Wang, G. and Liu, L. and Ji, G. and Liu, Y. and Chen, J. and Webb, J.S. and Reese, G. and Dean, J.F.D. WebTraceMiner: a web service for processing and mining EST sequence trace files. Nucleic Acids Res , 35(Web Server issue):W137–42, 2007

In this work Vmatch was used for Vector screening.

Sanne Nygaard, Anders Jacobsen, Morten Lindow, Jens Eriksen, Eva Balslev, Henrik Flyger, Niels Tolstrup, Søren Møller, Anders Krogh, and Thomas Litman. Identification and analysis of mirnas in human breast cancer and teratoma samples using deep sequencing. BMC Medical Genomics , 2(1):35, 2009

In this work Vmatch was used for mapping short reads.

Christian Cole, Andrew Sobala, Cheng Lu, Shawn R Thatcher, Andrew Bowman, John WS Brown, Pamela J Green, Geoffrey J Barton, and Gyorgy Hutvagner. Filtering of deep sequencing data reveals the existence of abundant dicer-dependent small rnas derived from trnas. Rna , 15(12):2147–2160, 2009

In this work Vmatch was used for matching reads to sets of RNA sequences and the Human genome.

N. Cloonan, S. Wani, Q. Xu, J. Gu, K. Lea, S. Heater, C. Barbacioru, A. L. Steptoe, H. C. Martin, E. Nourbakhsh, et al. Micrornas and their isomirs function cooperatively to target common biological pathways. Genome Biol , 12(12):R126, 2011

In this work Vmatch was used to uniquely map miRNAs against the human genome.

K Takayama, S Tsutsumi, S Katayama, T Okayama, K Horie-Inoue, K Ikeda, T Urano, C Kawazu, A Hasegawa, K Ikeo, et al. Integration of cap analysis of gene expression and chromatin immunoprecipitation analysis on array reveals genome-wide androgen receptor signaling in prostate cancer cells. Oncogene , 30(5):619–630, 2011

In this work Vmatch was used to determine the positions of CAGE tags on the human genome.

Kevin CH Ha, Emilie Lalonde, Lili Li, Luca Cavallone, Rachael Natrajan, Maryou B Lambros, Costas Mitsopoulos, Jarle Hakas, Iwanka Kozarewa, Kerry Fenwick, et al. Identification of gene fusion transcripts by transcriptome sequencing in BRCA1-mutated breast cancers and cell lines. BMC Medical Genomics , 4(1):75, 2011

In this work Vmatch was used to align sections of reads against RefSeq mRNA exon sequences.

Marie J Kidd, Zhiliang Chen, Yan Wang, Katherine J Jackson, Lyndon Zhang, Scott D Boyd, Andrew Z Fire, Mark M Tanaka, Bruno A Gaëta, and Andrew M Collins. The inference of phased haplotypes for the immunoglobulin h chain v region gene loci by analysis of vdj gene rearrangements. The Journal of Immunology , 188(3):1333–1340, 2012

In this work Vmatch was used to align sets of genes.

Ryonosuke Yamaga, Kazuhiro Ikeda, Joost Boele, Kuniko Horie-Inoue, Ken-ichi Takayama, Tomohiko Urano, Kaoru Kaida, Piero Carninci, Jun Kawai, Yoshihide Hayashizaki, et al. Systemic identification of estrogen-regulated genes in breast cancer cells through cap analysis of gene expression mapping. Biochemical and biophysical research communications , 447(3):531–536, 2014

In this work Vmatch was used to determine the positions of CAGE tags on the human genome.

Current Usages for different Model Organisms

    A. Sczyrba, M. Beckstette, A.H. Brivanlou, R. Giegerich, and C.R. Altmann. Xendb: Full length cDNA prediction and cross species mapping in xenopus laevis . BMC Genomics , 2005

In this work Vmatch was used to cluster 317 242 EST and cDNA sequences from Xenopus laevis . Vmatch was chosen for the following reasons:

  • At first, there was no clustering tool available which could handle large data sets efficiently, and which was documented well enough to allow a detailed b replication and evaluation of existing clusters.
  • Second, Vmatch identifies similarities between sequences rapidly, and it provides additional options to cluster a set of sequences based on these matches. Furthermore, the Vmatch output provides information about how the clusters were derived. Due to the efficiency of Vmatch , it was possible to perform the clustering for a wide variety of parameters on the complete sequence set. This allows to study the effect of the parameter choice on the clustering.

In this work Vmatch was used to cluster EST-sequences of Xenopus laevis .

J.A. Eisen, R.S. Coyne, M. Wu, D. Wu, M. Thiagarajan, J.R. Wortman, J.H. Badger, Q. Ren, P. Amedeo, and K.M. Jones et al. Macronuclear Genome Sequence of the Ciliate Tetrahymena thermophila, a Model Eukaryote. PLoS Biology , 4(9):e286, 2006

In this work Vmatch was used to search exact repeats in the Macronuclear Genome Sequence of the Ciliate Tetrahymena thermophila .

G. J. Faulkner, A. R. Forrest, A. M. Chalk, K. Schroder, Y. Hayashizaki, P. Carninci, D. A. Hume, and S. M. Grimmond. A rescue strategy for multimapping short sequence tags refines surveys of transcriptional activity by CAGE. Genomics , 91(3):281–288, Mar 2008

In this work Vmatch was used for mapping

  • 11 567 973 FANTOM3 mouse CAGE tags to the mouse genome with minimum match length of 18 bp, a single internal mismatch allowed, and multiple mismatches allowed at tag ends.
  • Affymetrix GNF probe sequences to transcripts without allowing for mismatches.

In this work Vmatch was used to search small RNA signatures in entire miRNA gene sequences for Arabidopsis and rice.

R. J. Taft, E. A. Glazov, T. Lassmann, Y. Hayashizaki, P. Carninci, and J. S. Mattick. Small RNAs derived from snoRNAs. RNA , 15(7):1233–1240, Jul 2009

In this work Vmatch was used to map small RNA data sets onto the corresponding reference genomes for different model organisms.

C. Plessy, G. Pascarella, N. Bertin, A. Akalin, C. Carrieri, A. Vassalli, D. Lazarevic, J. Severin, C. Vlachouli, R. Simone, et al. Promoter architecture of mouse olfactory receptor genes. Genome research , 22(3):486–497, 2012

In this work Vmatch was used for mapping Illumina reads to the mouse genome.

Nathan J Kenny and Sebastian M Shimeld. Additive multiple k-mer transcriptome of the keelworm Pomatoceros lamarckii (annelida serpulidae) reveals annelid trochophore transcription factor cassette. Development genes and evolution , 222(6):325–339, 2012

In this work Vmatch was used for redundancy removal in the context of transcriptome assembly of a keelworm species.

Cene Gostin, Robin A Ohm, Tina Kogej, Silva Sonjak, Martina Turk, Janja Zajc, Polona Zalar, Martin Grube, Hui Sun, James Han, et al. Genome sequencing of four aureobasidium pullulans varieties: biotechnological potential, stress tolerance, and description of new species. BMC Genomics , 15(1):549, 2014

In this work Vmatch was used to remove redundant contigs in a genome project of four Aureobasidium pullulans varieties.

M. McMullan, A. Gardiner, K. Bailey, E. Kemen, B. J. Ward, V. Cevik, A. Robert-Seilaniantz, T. Schultz-Larsen, A. Balmuth, E. Holub, et al. Evidence for suppression of immunity as a driver for genomic introgressions and host range expansion in races of albugo candida, a generalist parasite. eLife , 4:e04550, 2015

In this work Vmatch was used for merging assemblies of Illumina sequenced cDNA.

C Morandin, K Dhaygude, J Paviala, K Trontti, C Wheat, and H Helanterä. Caste-biases in gene expression are specific to developmental stage in the ant formica exsecta. Journal of evolutionary biology , 28(9):1705–1718, 2015

In this work Vmatch was used to combine and scaffold contigs.

Total number of usages: 108


Vmatch is available for download in executable form for the following platforms:


Vmatch was developed since May 2000 by Stefan Kurtz, a professor of Computer Science at the Center for Bioinformatics, University of Hamburg, Germany.