Information

Variant VCF: AD vs DP?

Variant VCF: AD vs DP?



We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

In my VCF file from GATK, I have the following definitions forADandDP.

AD - Allelic depths for the ref and alt alleles in the order listed DP - Approximate read depth (reads with MQ=255 or with bad mates are filtered

I don't understand the definitions, can anybody explain in a less technical way?

From what I can read,ADgives the number of reads spanning the reference and variant allele. But what doesDPmean? This doesn't look like the total number of reads spanning a variant, so what is this? How does this different toAD?


DP is the total number of read bases spanning a particular position. If you add up the different AD, you should get a number close to DP, the difference being merely in how the reads are filtered in either set of numbers.


Intra-host variation and evolutionary dynamics of SARS-CoV-2 populations in COVID-19 patients

Since early February 2021, the causative agent of COVID-19, SARS-CoV-2, has infected over 104 million people with more than 2 million deaths according to official reports. The key to understanding the biology and virus-host interactions of SARS-CoV-2 requires the knowledge of mutation and evolution of this virus at both inter- and intra-host levels. However, despite quite a few polymorphic sites identified among SARS-CoV-2 populations, intra-host variant spectra and their evolutionary dynamics remain mostly unknown.

Methods

Using high-throughput sequencing of metatranscriptomic and hybrid captured libraries, we characterized consensus genomes and intra-host single nucleotide variations (iSNVs) of serial samples collected from eight patients with COVID-19. The distribution of iSNVs along the SARS-CoV-2 genome was analyzed and co-occurring iSNVs among COVID-19 patients were identified. We also compared the evolutionary dynamics of SARS-CoV-2 population in the respiratory tract (RT) and gastrointestinal tract (GIT).

Results

The 32 consensus genomes revealed the co-existence of different genotypes within the same patient. We further identified 40 intra-host single nucleotide variants (iSNVs). Most (30/40) iSNVs presented in a single patient, while ten iSNVs were found in at least two patients or identical to consensus variants. Comparing allele frequencies of the iSNVs revealed a clear genetic differentiation between intra-host populations from the respiratory tract (RT) and gastrointestinal tract (GIT), mostly driven by bottleneck events during intra-host migrations. Compared to RT populations, the GIT populations showed a better maintenance and rapid development of viral genetic diversity following the suspected intra-host bottlenecks.

Conclusions

Our findings here illustrate the intra-host bottlenecks and evolutionary dynamics of SARS-CoV-2 in different anatomic sites and may provide new insights to understand the virus-host interactions of coronaviruses and other RNA viruses.


Experimental design

There are numerous preparation methods, (e.g. Nextera, Kapa ) to construct sequencing libraries from DNA extraction. These laboratory methods are beyond the scope of this tutorial. However, we will address a few aspects of study design when designing an experiment.

1. Pooled sequencing vs. individually barcoding samples

One decision researchers need to make when designing their resequencing experiments is whether to pool unbarcoded individuals together in a single sequencing library (termed Pool-seq), or to individually barcode each individual, enabling researchers to demultiplex these individuals for downstream analyses even if they are pooled for sequencing itself. There are pros and cons to both approaches, but essentially the decision comes down to cost and research objective. Many papers have been written about the pros and cons of pooled sequencing, and Schlötterer et al. 2014 provides a nice review and comparison with other methods. Briefly, we outline some of the pros and cons below:

Pooled sequencing: The main advantage for this approach is cost savings on library preparation. If large sample sizes are required for the research objectives, library preparation costs can quickly become a limiting factor. By pooling large numbers of individuals in a single population, researchers would only need to prepare a single sequencing library per pool. However, this method has limitations on possible downstream analyses and potential sequencing biases. This method can yield estimates of allele frequencies from a pooled population, but few statistics beyond that (e.g. haplotype information, linkage disequilibrium). Pool-seq also works best when large numbers of individuals (>40) are pooled together, with pools on the order of hundreds or thousands of individuals being ideal. One of the biggest drawbacks of Pool-seq is that unequal individual representation will produce biases in allele frequency estimates, and without barcodes it is impossible to know if this has occurred. This is less likely to happen with larger sample sizes.

Individually barcoded sequencing: The main advantage of this approach is that individually barcoding reads means that variants can be called for individuals, and with sufficient coverage (see below), it is possible to obtain haplotype information or other useful statistics. As mentioned above, the main drawback of this method is the cost of library preparation. This is increasingly less expensive however, either because of new kits becoming available, or the ability to split library preparation reagents into multiple microreactions (e.g. Baym et al. 2015). Therefore, in the tutorial, we focus on methods for creating VCFs from individually barcoded samples.

2. Sample sizes

Determining how many individuals you need to sequence depends on what types of analysis you wish to conduct downstream. If the objective of the study is to describe population structure and genetic diversity, very few individuals per population are needed because whole-genome sequencing provides so much information per individual (e.g. Nazareno et al. 2017). If the goal of the study is to perform detailed demographic inference (e.g. with the site frequency spectrum via dadi or fastsimcoal2), small numbers of individuals may be sufficient for detecting older events or testing different models, but larger numbers of individuals may be necessary to detect recent events or estimate parameters (e.g. Robinson et al. 2014).

Identifying allele frequency shifts at specific sites (e.g. looking for Fst outliers) or GWAS-types of analyses require larger population sizes to have enough power to detect significant differences with FDR corrections for millions of sites.

3. Sequencing depth

Ideally, to confidently call variants from whole genome resequencing data, diploid organisms should be sequenced to 30x coverage. However, due to limited budgets and different study goals, it is often possible to sequence to much lower coverage. For many population genomic goals, given a set amount of sequencing to work with (e.g. 100x coverage of the target genome), it is often more advantageous to sequence more individuals (100 individuals to 1x coverage), to more accurately infer population genetic parameters (Buerkle and Gompert 2012). Because of limited budgets despite falling sequencing costs, an increasing number of tools are available to make use of low-coverage whole-genome resequencing for population genomic inference. For example, the ANGSD and NGSTools packages allow one to calculate site frequency spectra, diversity statistics, PCA, and admixture analysis among others based entirely on genotype likelihood scores. Other packages, like MAPGD, allow one to calculate linkage disequilibrium and relatedness using genotype likelihoods. By not actually calling genotypes, and instead inferring parameters from genotype likelihoods across individuals in a population, these programs avoid many of the biases associated with low-coverage genome data (e.g. Han et al. 2014).

4. Sequencing mode

For whole-genome resequencing studies, it is almost always recommended to use paired-end sequencing. As genome coverage is generally a limiting factor, the cost per base is much less for paired-end than single-end data. In addition, paired-end data generally provide better abilities to map reads to reference genomes, which is highly advantageous, especially for low-coverage data.


Results and discussion

Data representation and challenges

Our initial observation on the mutation score matrix showed that, the C-scores range from 0 to 1417.14 and distribution of scores for top ten variant genes can be seen in Fig. 1. Comparison against the COSMIC database shows that nine out of these ten genes (with the exception of FAM38A gene) have evidence of abundant accumulation of somatic mutations in large population screens [15].

Distribution of total mutational scores for the top ten variant genes. The top 10 most heavily mutated genes include several proven cancer associated genes including MUC4 and OBSCN

Somatic mutation profiles of BC patients exhibit a very sparse data form, unlike other data types such as gene expression or methylation in which nearly all genes or markers are assigned a quantitative value in all the patients. Even clinically identical patients may share no more than a single mutation [16–18]. Therefore, this problem introduces too many zero valued entries to the main data structure (96 %). On the other hand, from machine learning perspective, having a limited number of patients (a far less number of patients than the number of effected genes in a cohort) introduces a dimensionality challenge commonly known as the “curse of dimensionality” in machine learning. In this study, we are faced with this challenge as we observed the sample-to-feature ratio of 1:50 (358/18117) in the main data structure.

In order to overcome the aforementioned challenges, generally there are two popular approaches, namely feature extraction and feature selection. Feature extraction transforms the current existing features into a lower dimensional space and widely used example methods include principal component analysis (PCA) and linear discriminant analysis (LDA), while feature selection selects a subset of features without applying any transformation. These methods increase the sample-to-feature ratio and decrease the sparseness hence making the clustering both feasible and more effective. In this study, we used feature selection by ranking the features (genes) in decreasing order of their variance value and selected top n features for clustering (see methods for more details). We optimized the size of n to be 854 genes in our clustering method.

Classification of breast cancers based on somatic mutations

Unsupervised clustering is the task of grouping a set of samples that have no label information, which results in grouping samples in such a way that samples in the same group are more similar in a specified measure to each other than to those in the other groups. There are several methods trying to achieve this goal such as k-means clustering, hierarchical clustering and expectation maximization (EM) algorithms. However, these methods perform poorly or cannot come to a solution when applied to sparse data, as is the case in our study. Therefore, we selected to use NMF because of its proven superior performance when tested on biological data based applications [19–21]. NMF was introduced in its modern formulation by Lee and Seung [21] as a method to decompose images.

As a factorization method, NMF algorithm takes our mutation score matrix as the input and decomposes it to two smaller matrices (basis matrix W and coefficient matrix H). The output coefficient matrix (matrix H) is used to make sample cluster assignments. Refer to methods for more details.

Using the NMF clustering algorithm on our dataset, we stably clustered the samples into three groups using the top 854 genes, which have the highest variance values of mutation scores across all the samples. The three groups Cluster 1, 2, and 3 involve 169, 121 and 68 patients, respectively. Refer to methods section for more details.

In Fig. 2, we show a representation of the input data in the mutation score matrix, focusing only the top 50 variant genes for illustration purpose. As it can be seen, data represents a very sparse form (most of the cells are colored blue meaning a zero score) which makes most clustering approaches inapplicable. Additional file 1: Figure S1 and Fig. 3 are the output matrices from decomposition of the mutation score matrix, which we input to NMF algorithm. Note that multiplication of the two output matrices will approximately yield the input data. In Additional file 1: Figure S1, we see the basis matrix (W), which is not used in the scope of this study however it could serve for clustering purpose of the genes. Figure 3 displays the coefficient matrix (H), where the rows represent the metagenes that are a compact representation of all the genes, and columns represent the patients. We use this matrix to make sample to cluster associations by assigning the samples to the clusters where we observe the highest metagene value, i.e., the dark red color, (See methods section for details).

Input matrix with C-scores of the top 50 mutated genes. The heat map shows the most heavily mutated 50 genes. The columns represent patients (358) and rows represent genes. One of the challenge of the dataset is being extremely sparse which can be seen in the heat map as most of the cells are colored very close to blue, which indicates a 0 (C-score) mutation score, with the exception of the first few columns. We identified that the main data structure is composed of 96 % zeros

Coefficient matrix (H). The coefficient matrix (H), 3 × 358 in size, is used for assigning samples to clusters. The columns of the matrix represent patients and rows represent metagenes. We generated 3 metagenes that are used to cluster patients into 3 groups. The number of metagenes (rank of clustering) is determined by running the algorithm iteratively over a range of biologically reasonable parameters as explained in methods section

Figure 4 illustrates the stability of the clustering by displaying the consensus matrix, which was generated after 100 NMF runs using Brunet’s [22] approach (explained in methods section). We used the silhouette score of consensus matrix to determine the optimum number of genes and clusters. In an ideal clustering case, we expect to observe values either close to 1 or 0, indicating the probability of two samples being in the same cluster or not, respectively, which displays solid colored blocks. A value of one represents the highest probability that two samples are in the same cluster (red blocks) and the value of zero denotes the opposite (blue blocks). In Fig. 4 it can be seen that the dataset is clearly clustered into three distinct groups.

Consensus matrix. The consensus matrix is 358 × 358 in size and illustrating the stability of the clustering. In ideal case, all the entries are expected to be either 0 or 1, making solid colored blocks. The bar on top indicates the clinical stage of each patient. The Silhouette score of this matrix is 0.958 which indicates a very stable clustering. (Silhouette (consensus) = 0.958)

Characterization of discovered clusters

We investigate the clinical significance of discovered clusters by comparing the BC stage of the patients in each cluster. For this purpose, we analyze the distribution of patients according to their disease stage provided in the TCGA data. We found that Cluster 1 was dominated by early stage patients while Cluster 3 had much higher proportion of late stage patients compared to Cluster 1 (Fisher’s exact test p-value = 0.02048, Table 1). As can be seen in Table 1, the number distribution of patients in each cluster with stage ratio (number of early stage patients over late stage patients) for Cluster1 is more than two-fold higher than that of Cluster 3 hence here we call Cluster 1 as the early-stage-enriched cluster, Cluster 2 as the mixed cluster and Cluster 3 as the late-stage-enriched cluster. This separation of patients by their disease stage indicates that our clustering method can successfully discriminate breast cancer patients by their disease stage using only the somatic mutational profiles of patients from their exome sequencing data.

Next, we compared the somatic mutation profiles of patients between the early and late-stage-enriched clusters (Cluster 1 vs. Cluster 3). We found that there were 358 genes, which have significantly higher mean mutation scores in the late-stage-enriched cluster (Cluster 3) than in the early-stage-enriched cluster (Cluster 1) (Wilcox rank-sum test, FDR < 0.1), but none of the genes have significantly higher mean mutation scores in Cluster 1 than in Cluster 3. This interesting finding indicates that these genes may have accumulated deleterious mutations leading to the progression of breast cancer into advanced disease states. We identified that tumor suppressor genes, APC, BRCA2 and oncogene, MLL are among the 358 genes used in this comparison. Table 2 shows the top 25 most significant genes that are found to show significantly higher mutation rates in late-stage-enriched cluster.

We stratified the 358 genes into different gene families using the Gene Set Enrichment Analysis (GSEA) [23] tool as shown in Table 3. We observe that a significant proportion of the genes belong to transcription factor and protein kinase gene families, which are well known to be related to the progression of BC [24, 25]. Table 4 shows the assignment of these genes to functionally distinct gene families.

Network analysis of differentially mutated genes

We carried out the network analysis of the top 25 highly mutated genes (Table 2) in the late-stage-enriched cluster compared to the early-stage-enriched cluster patients, to understand the functional relationship among these genes. The network in Fig. 5, generated using the Ingenuity Pathway Analysis (IPA) program shows several interaction hubs, where the genes highlighted in purple color are highly mutated in the late stage cluster patients. Most of the genes in our list interact with the central hub protein, UBC, which is expected because most of the proteins (especially the unneeded or damaged ones) are ubiquitinated before proteosomal degradation. It has been known that ubiquitin-proteasome system regulates the degradation of a number of cancer-associated genes [24]. APC (adenomatous polyposis coli) is another key tumor suppressor seen in this network that acts as an antagonist of the Wnt signaling pathway, with a number of roles in cancer development and progression such as cell migration, adhesion, apoptosis, etc. The role of APC mutations in breast cancers has been well documented in the literature [25]. It is noteworthy to mention two transcriptional regulator genes in our list, NOTCH2 and KMT2A (MLL). NOTCH2 is a key regulator of Akt, and its role is well documented in several cancers including in apoptosis, proliferation and epithelial-mesenchymal transition (EMT) pathway [26]. Several somatic mutations in NOTCH2 are also associated with different cancers in COSMIC database [27]. MLL is a transcriptional regulator and an oncogene with a variety of roles in cell proliferation and apoptosis [28].

Interaction network analysis of the top 25 genes. The image shows the interactions of the top 25 genes with highest mutation load in the late-stage-enriched cluster compared to the early-stage-enriched cluster of patients

Class prediction of breast cancers based on somatic mutations

Using the aforementioned BC clusters, we labeled each sample with its assigned cluster, and developed a classification model to see how accurate we can predict clusters of unseen breast cancer patients based on their somatic mutations. With this model, we can predict the cluster of an unseen patient, using his/her mutation profile hence we get insight about the patient’s clinical outcome, like BC stage. As an example if the model predicts a new patient to be in the Cluster3, than we can expect this patient to be in late stage with certain genes be more likely to carry higher mutation loads.

We labeled each patient with its assigned cluster and tested five popular machine learning (ML) algorithms Random Forest (RF) [15], Support Vector Machine (SVM) [29], C4.5 [30], Naïve Bayes [31], and k-Nearest Neighbor(KNN) [32] to find the most appropriate algorithm for our dataset.

We used a 10-fold cross-validation for evaluation of classifier performances. In each loop of the 10-fold cross validation, after withdrawal of the test set, we did feature selection using the information gain feature selection method [33] and selected the top 500 genes, which provide the highest information gain based on the training set. Therefore, in total, we selected ten sets of 500 genes in the 10-fold cross validation. Out of the aforementioned ML algorithms, we selected to further use the RF method in this study as it achieved the best 10-fold cross-validation accuracy with 70.86 %. We believe that the sparseness of the data along with the low sample to feature ratio and difficulty of multiclass prediction are the reasons behind this moderate accuracy.

Also we observe that SVM algorithms achieved a very close accuracy but with a loss in TPR, FPR and F measure. And KNN method yielded the worst accuracy of all the methods we used. Table 5 shows the performance measures of each ML algorithm.

Figure 6 shows the receiver operating characteristic (ROC) curves for each class that illustrate the relationship between TPR (sensitivity) and FPR (1-specificity) for each class. In the perfect case, an ROC curve goes straight up on the Y-axis and then to the right parallel to the X-axis thus maximizing the area under the curve (AUC). An AUC close to one indicates that the classifier is predicting with maximum TP and minimum FP. We calculated the AUC for clusters 1, 2 and 3 (used interchangeably as class in this section) as 0.88, 0.8 and 0.95, respectively, indicating that the classification model can better differentiate the late stage patients against the remaining patients.

ROC curves. The ROC curve, which is used to show the accuracy of the predictions made by the model, shows the relationship between TPR (sensitivity) and FPR (1-specificity) for each class. As the AUC values indicate that the prediction model achieves a better accuracy in discrimination of early-stage-enriched and late-stage-enriched classes with 0.88 and 0.95 respectively

We also used a permutation test, by running the same class prediction procedure with RF on 10,000 randomly labeled datasets and none of the 10-fold cross-validations gave us a better accuracy, yielding a very significant p-value (p-value < 10 −4 ) (see methods for more details). This supports the robustness of our model and the predication accuracy.


How to create correct .vcf file based on data.frame in R? Subscript is out of bounds?

currently I'm working with signeR package which allows you to create signatures of somatic mutations. First of all I'd like to reproduce their result from vignette.

As you see there are two options of input data .vcf file and previously preprocessed mutation counts file which is simple data frame obtained by their genCountMatrixFromVcf() function ..in fact from .vcf.

Of course here you can find some example .vcf file and this mutation count matrix ready for further analysis.

In vignette they used (as I read) this data: SUBSTITUTIONS_13Apr2012_snz.txt. So I decided to create a .vcf file from this data frame and go through all steps of the vignette.

So I made a several simple steps to build .vcf from mentioned above data:

All mandatory columns for .vcf file are present. I made this type of files previously, than I used readVcfAsVRanges() everything worked fine.

However here in signeR package there's some problem. Because if I wont to create this mutation count matrix I have an error:

So I tried to find out what is wrong with my .vcf file in comparison to 'working' ones.

Thier 'working' example file (in fact after run genCountMatrixFromVcf() function there's also error but different. Not important at this moment).


Proposed framework for making progress as a community

To facilitate innovation in this field, we recommend the development of a framework of common formats and application programming interfaces (APIs) that enable the many resources available to interoperate more effectively both at the individual variant level and at large scales. We further recommend the development of a portal that can be used to annotate the current state of tools in the field and guide users on how these tools can interoperate and be used to address different research questions. The outline of the recommended GVto3D framework takes its lead both from our wider review of the field as well as from the presentations and discussions that occurred among those members of the research community who attended the workshop its design incorporates the needs and existing efforts of these researchers.

Figure 1 depicts the recommended components and design of the GVto3D framework. The Tools Registry will act as a central repository of data resources and software tools related to genetic variants, protein sequences, protein structures, variant effect prediction, and variant annotation. Metadata about each resource to enable findability of the different software tools will be stored and offered through an interactive web interface and also an API, which in turn enables the development of intelligent software that can automatically discover applicable resources and gather information about how to communicate with them to obtain the desired results. In addition to name, description, citations, contact information, and uniform resource locators (URLs), each entry will contain information important to the tool’s interoperation, such as the inputs and outputs, API support, and reference genome information.

Components of the GVto3D portal. The Tools Registry contains a searchable description and metadata for tools, resources, and reference data sets for third-party variant effect prediction and annotation services. Standardized application programming interfaces (APIs) provide interoperability for data input and output of these third-party tools. Custom adapters can provide limited interoperability for tools that cannot adopt the API. A mapping service provides bidirectional mappings from reference genome coordinates to UniProt protein positions and to Protein Data Bank (PDB) residue positions. The tools can use the mapping service to accept variant positions in any of the three coordinate systems. A beacon system enables queries about variant positions where three-dimensional (3D) structural information and annotation are available

A second component of the portal will be the definition of standard APIs so that information can be sent to and requested from different tools in the same way, thereby reducing software development overheads, which are typically encumbered with different tools using different APIs. It is envisaged that new third-party tools will use the API natively while API adapters will be developed in order to bridge with pre-existing third-party tools. The API enables seamless interoperability between different variant-related tools and also a standard access to multidirectional mapping among genomic, protein sequence, and protein structure coordinates. These mappings will be made available through APIs and as downloadable data files. Mappings will be kept up to date based on the update schedules of the underlying data sources (PDB, weekly UniProt, monthly), freeing developers from maintaining and updating copies of these data. Once several similar resources support the standard APIs, the site can be further developed into an aggregation portal, where a query at the portal can be automatically farmed out to multiple resources, and the results collated and returned to the user in a single batch. This framework advances the FAIR principles of findability, accessibility, interoperability, and reusability [99] for all tools and resources that participate.

The use of standard file formats and standardized representations of data enable interoperability of prediction tools, for example, the output from one tool can be passed as input into a second tool, and can thereby simplify the comparison of different methods. The standardized formats are also essential components of a reusable set of integrated tools (software stack), including tools for reading and interpreting data files (file parsers), APIs, and visualization tools. Most of the current tools use a variety of inputs and outputs, placing a large burden on the user to transform data. Standard file formats and uniform APIs will be at the core of future services that will combine and compare different approaches. Various platforms and tools have different schedules and reliability of upgrades keeping track of versions is important as changes to software may have large effects on the results.

The VCF file format [37], despite its complexity, is the de facto standard format for storing variant calls for a wide range of variants, from SNVs to long insertions and deletions. The Global Alliance for Genomics and Health’s Data Working Group File Formats Team defines the VCF specification and its evolution [100]. Variant annotations—for example, the results of prediction tools—can be captured in the INFO records, which are a set of structured records used to add annotation to VCF files. VCF versions 4.x, including the current version 4.3 [101], define meta-information lines that describe the INFO record data types and enforce standardization [102]. In addition to VCF, a few other formats have been described, such as ANN, which defines a different standard for representing variant information in INFO fields VEP [97] supports a simple tab-delimited, as well as JavaScript Object Notation (JSON) output format.

Regarding genome nomeclature, the Human Genome Variation Society, which aims to foster the discovery and characterization of genomic variations, including population distribution and phenotypic associations, has established guidelines and recommendations for the nomenclature of gene variations, and serves as an international standard [103].

Progress in this field depends on global collaboration and the sharing and reuse of tools. APIs provide protocols to enable this collaboration. Tools wrapped in standard APIs present a consistent interface to heterogeneous tools, enhancing interoperability, and shielding the user from changes to the underlying software. As an example, many prediction tools that use 3D protein structural information define the location of mutations at the protein level using either UniProt or PDB coordinates. Mapping genomic coordinates to 3D protein structure is non-trivial and error prone. Robust APIs that can perform this mapping with up-to-date 3D information using both types of protein coordinates can augment existing tools that are based on just linear protein sequence coordinates.

Moreover, progress in the prediction of the effect of mutations and use of 3D structural information depend on the availability of well-designed training, test, and validation sets. The tool repository will be a place to share datasets, as well as protocols and references (metadata) for how these datasets were generated. Validation sets, accompanied by well-documented tutorials or vignettes, will include a subset of variants with clearly understood effects that can be used to test the output of available resources. Eventually these can serve as a set of unit tests for the framework itself.


Comments on this article Comments (0)

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Genomics, genome assembly, genome annotation, variant calling, genome engineering.

The authors have revised the document and added the INDEL call set.

In my opinion, the authors have provided sufficient data and analyses to enable the reader to understand the caveats associated with the data prepared by . Continue reading

The authors have revised the document and added the INDEL call set.

In my opinion, the authors have provided sufficient data and analyses to enable the reader to understand the caveats associated with the data prepared by them.

I have one small remaining concern. I would like it if the authors clarify the behaviour of their processing when dealing with multinucleotide variants (MNVs). Are they excluded? If not, then the normalisation and merging approaches may render spurious results.

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Bioinformatics, clinical genomics

The authors present a new call set from the 1000 genomes project. This time the call set is a fresh recall of the data against GRCh38. The call set is only composed of biallelic SNPs (biallelic . Continue reading

The authors present a new call set from the 1000 genomes project. This time the call set is a fresh recall of the data against GRCh38. The call set is only composed of biallelic SNPs (biallelic in this study). Previous variant call sets on GRCh38 had been lifted over from their GRCh37 native counterparts. The study identifies SNPs by first calling variants with several algorithms, creating an union set and then explicitly genotyping these sites across the complete data set. The genotypes are then phased. A comparison with GIAB data for NA12878 is performed to assess the sensitivity and specificity of the call set.

A native call set of the 1000 genomes project data on GRCh38 is quite an important effort. These data have many important downstream uses including clinical genomics uses for variant filtering and population genomics studies. Notwithstanding the importance of this data set, I have issues with the data note as it stands. I think the paper simply described what the authors did to generate these data but makes little effort to explain why it was done in this way. The latter is important to gain confidence in the call set.

As a user of the data I would look to be satisfied with the quality of the call set. This is particularly important as several other large scale projects have released allele frequencies of many thousands of deep sequenced whole genomes (e.g. Topmed and Gnomad). In parallel, frameworks for assessing the analytical performance of variant calls have been proposed by GIAB and GA4GH and recently published. Truth sets for several important samples beyond the NA12878 have been released including the ability to assess to how well phased the data sets was. As explained in the specific comments below, I believe more work should have been done to convince the reader that sufficient due diligence has been committed to this data set.
From the perspective of having a native call set on GRCh38 it is a shame that there is very little to show the potential benefits of this call set on GRCh38. There is no mention [or I really missed it] about how alternative haplotypes were handled and the implications of having variants in alternative haplotypes. The data release also ignores any non diploid areas of the genome.
From a methodological perspective, the tool chain feels outdated with BCFtools and GATK being at least 2 years old. Thresholds are widely used but little work is done to explain how these thresholds are determined. I assume that the parameters used in the tools themselves, these are standard or perhaps defined in previous iterations of the project. However, the filtering thresholds in Tables 1-4, at least from the reading, sounds like they were plucked out of thin air.
On a more positive note, I would like to commend authors for the great effort placed to make available the code, organising it, document it, and making this data set reproducible.

  1. *Tool chain is really outdated*. Is this because there have been little changes for the specific algorithms used (mpileup and unified genotyper?). I assume that many improvements and bugfixes have appeared in the last two years plus since these versions were released.
  2. *Demonstrating improvements on GRCh38 and why it is better than the lift over*. I missed some figure showing how this was an improvement. For example a comparison of the lifted over and the native call set. Are there regions of the genome where they perform different? Was it worth the effort? What about the ALTs, have we now better allele frequencies on these regions? How do they affect the frequencies on the corresponding frequencies on the primary assembly?
  3. *Variant normalisation*. This is glossed over in the text and little is said about how it was performed. In my experience this step often introduces difficult tradeoffs. Please expand on this.
  4. *Benchmarking*. This area is quite lacking here. I understand that you are only looking at biallelic sites so comparison with a truth set is easier. However, standards to doing this have existed for over a year and recently published. They are based on comparing at the haplotype level and not the site level.
  5. *Phasing*. These standards also enable comparing the phasing accuracy. Given the effort placed here to phase the genotypes, it would be helpful to also benchmark the phasing data.
  6. *Union set*. I would have loved to see a figure that shows how the various variant callers contributed to the union call set. Is there one that is unnecessary? Is there one that is responsible for many of the false positives?
  7. *Analytical performance*. "Taken together, these results demonsrate both the high sensitivity and high specificity of our callset". You seem to have less TP and more FN than in the 37 release. Now a days these numbers do not represent high sensitivity and specificity (at least with 30X genomes). At least a more nuanced discussion is required here. This is strongly linked to the thresholds you chose for various filtering steps.
  8. *Truth sets*. It would be very helpful to add further truth sets, for example for the Ashkenazim and Chinese trios.
  9. *Joint calling*. The approach here was to do single sample calling, assembling a union set and then genotyping those sites. Could the authors please explain why this approach as opposed to joint calling. I would assume that for low coverage genomes, joint calling may be more powerful as it can leverage information across more samples.

Is the rationale for creating the dataset(s) clearly described?

Are the protocols appropriate and is the work technically sound?

Are sufficient details of methods and materials provided to allow replication by others?

Are the datasets clearly presented in a useable and accessible format?

1. Krusche P, Trigg L, Boutros PC, Mason CE, et al.: Best practices for benchmarking germline small-variant calls in human genomes.Nat Biotechnol. 2019. PubMed Abstract | Publisher Full Text
2. Zook JM, McDaniel J, Olson ND, Wagner J, et al.: An open resource for accurately benchmarking small variant and reference calls.Nat Biotechnol. 2019. PubMed Abstract | Publisher Full Text

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Bioinformatics, clinical genomics

First, we would like to thank the reviewer for the feedback that has been provided and for giving this work their attention.

In the general observations the reviewer notes that the . Continue reading First, we would like to thank the reviewer for the feedback that has been provided and for giving this work their attention.

In the general observations the reviewer notes that the data note is limited to a description of what was done, without significant discussion of the rationale behind the approach. We note that this is a data note, intended to describe a data set and how it was produced, however, we have also amended the text to include more detail relating to the rationale behind our approach.

Also, in general observations, the reviewer highlights issues relating to data quality: other call sets, the limitation of comparing to only NA12878 and benchmarking phasing. While other call sets, such as TOPmed and gNOMAD exist, the 1000 Genomes Project data set remains unique in terms of its composition of populations and that all data can be accessed to the base pair level. With regard to other truth sets, beyond NA12878, we were unable to locate “gold-standard” data for any other samples in our data set. With regard to phasing, we have used the WhatsHap program to assess this and added the results to the manuscript.

With regard to GRCh38, our aim was not to demonstrate the superiority of GRCh38 but rather to provide a resource for those wishing to use that assembly. We believe the benefits of the assembly have been demonstrated by Schneider et al.

We have amended the text to make it clearer that calling was not done on alternate loci. The text has also been amended to describe the intention of using the data note format to release and describe data early, with the intention being to revisit elements not included in this set.

With regard to the tool chain, this reflects the volume of compute involved in this work. It was around two years ago that this work commenced. The final stage of the pipeline alone takes around six months to run, even with access to generous compute resources.

The text has been amended to include information on threshold selection.

With regard to the detailed points:

1) Tool chain is really outdated

Software versions reflect the length of time required to run this compute, as noted above.

2) Demonstrating improvements on GRCh38 and why it is better than the lift over

As noted above, our intention was not to demonstrate that GRCh38 was the better assembly. We consider that this has been done previously by Schneider et al. We have added a comparison with the liftover.

3) Variant normalisation

We have updated the text to include further information on this.

We note that hap.py was published after this work was submitted. We were unable, however, to establish from the manuscript how it could be used to improve upon the existing benchmarking. The summary provided in Figure 1 (https://www.nature.com/articles/s41587-019-0054-x) indicates that it wraps tools for arriving at consistent representation of variants (handled in the normalisation steps of our pipeline at the point of producing the consensus call set) and then produces a “standardised” report, providing similar metrics to those we present. From this, it appears to provide similar functionality to steps already present in our work. Our attempt to contact the authors for more information on this was not successful.

With regard to our decision to use a “truth set”, our belief is that comparing to an independently produced “gold-standard” is a valuable benchmarking strategy.

We have extended the benchmarking using WhatsHap.

This has been done with WhatsHap and the results added. As noted above, our attempt to contact the author of hap.py to establish how it could be used to benchmark phasing was sadly unsuccessful.

We have added the requested figure.

7) Analytical performance

These statements were made in the context of comparison with the phase three call set. The text has been amended. We have also compared with an initial analysis of new 30x coverage data produced by standard pipelines at NYGC. Based on our benchmark, the performance of our call set is slightly better. We agree that filtering has an impact here.

Our calling approach used low-coverage and exome data from the period of approximately 2008 to 2012 to perform joint genotyping. We believe that there would be questions about the validity of trying to benchmarking our results with samples that are simultaneously (1) not part of one of our populations, (2) have different relatedness to other samples in the population, and (3) have different data types available for variant calling. This excludes the Ashkenazi samples as a suitable benchmark. For the Han Chinese samples, we could not locate data that matches the profile of that for our samples. We have updated the text in an effort to improve the discussion of issues relating to benchmarking.

This did use joint calling. The text has been amended to make this clearer.

First, we would like to thank the reviewer for the feedback that has been provided and for giving this work their attention.

In the general observations the reviewer notes that the data note is limited to a description of what was done, without significant discussion of the rationale behind the approach. We note that this is a data note, intended to describe a data set and how it was produced, however, we have also amended the text to include more detail relating to the rationale behind our approach.

Also, in general observations, the reviewer highlights issues relating to data quality: other call sets, the limitation of comparing to only NA12878 and benchmarking phasing. While other call sets, such as TOPmed and gNOMAD exist, the 1000 Genomes Project data set remains unique in terms of its composition of populations and that all data can be accessed to the base pair level. With regard to other truth sets, beyond NA12878, we were unable to locate “gold-standard” data for any other samples in our data set. With regard to phasing, we have used the WhatsHap program to assess this and added the results to the manuscript.

With regard to GRCh38, our aim was not to demonstrate the superiority of GRCh38 but rather to provide a resource for those wishing to use that assembly. We believe the benefits of the assembly have been demonstrated by Schneider et al.

We have amended the text to make it clearer that calling was not done on alternate loci. The text has also been amended to describe the intention of using the data note format to release and describe data early, with the intention being to revisit elements not included in this set.

With regard to the tool chain, this reflects the volume of compute involved in this work. It was around two years ago that this work commenced. The final stage of the pipeline alone takes around six months to run, even with access to generous compute resources.

The text has been amended to include information on threshold selection.

With regard to the detailed points:

1) Tool chain is really outdated

Software versions reflect the length of time required to run this compute, as noted above.

2) Demonstrating improvements on GRCh38 and why it is better than the lift over

As noted above, our intention was not to demonstrate that GRCh38 was the better assembly. We consider that this has been done previously by Schneider et al. We have added a comparison with the liftover.

3) Variant normalisation

We have updated the text to include further information on this.

We note that hap.py was published after this work was submitted. We were unable, however, to establish from the manuscript how it could be used to improve upon the existing benchmarking. The summary provided in Figure 1 (https://www.nature.com/articles/s41587-019-0054-x) indicates that it wraps tools for arriving at consistent representation of variants (handled in the normalisation steps of our pipeline at the point of producing the consensus call set) and then produces a “standardised” report, providing similar metrics to those we present. From this, it appears to provide similar functionality to steps already present in our work. Our attempt to contact the authors for more information on this was not successful.

With regard to our decision to use a “truth set”, our belief is that comparing to an independently produced “gold-standard” is a valuable benchmarking strategy.

We have extended the benchmarking using WhatsHap.

This has been done with WhatsHap and the results added. As noted above, our attempt to contact the author of hap.py to establish how it could be used to benchmark phasing was sadly unsuccessful.

We have added the requested figure.

7) Analytical performance

These statements were made in the context of comparison with the phase three call set. The text has been amended. We have also compared with an initial analysis of new 30x coverage data produced by standard pipelines at NYGC. Based on our benchmark, the performance of our call set is slightly better. We agree that filtering has an impact here.

Our calling approach used low-coverage and exome data from the period of approximately 2008 to 2012 to perform joint genotyping. We believe that there would be questions about the validity of trying to benchmarking our results with samples that are simultaneously (1) not part of one of our populations, (2) have different relatedness to other samples in the population, and (3) have different data types available for variant calling. This excludes the Ashkenazi samples as a suitable benchmark. For the Han Chinese samples, we could not locate data that matches the profile of that for our samples. We have updated the text in an effort to improve the discussion of issues relating to benchmarking.

This did use joint calling. The text has been amended to make this clearer.

First, we would like to thank the reviewer for the feedback that has been provided and for giving this work their attention.

In the general observations the reviewer notes that the . Continue reading First, we would like to thank the reviewer for the feedback that has been provided and for giving this work their attention.

In the general observations the reviewer notes that the data note is limited to a description of what was done, without significant discussion of the rationale behind the approach. We note that this is a data note, intended to describe a data set and how it was produced, however, we have also amended the text to include more detail relating to the rationale behind our approach.

Also, in general observations, the reviewer highlights issues relating to data quality: other call sets, the limitation of comparing to only NA12878 and benchmarking phasing. While other call sets, such as TOPmed and gNOMAD exist, the 1000 Genomes Project data set remains unique in terms of its composition of populations and that all data can be accessed to the base pair level. With regard to other truth sets, beyond NA12878, we were unable to locate “gold-standard” data for any other samples in our data set. With regard to phasing, we have used the WhatsHap program to assess this and added the results to the manuscript.

With regard to GRCh38, our aim was not to demonstrate the superiority of GRCh38 but rather to provide a resource for those wishing to use that assembly. We believe the benefits of the assembly have been demonstrated by Schneider et al.

We have amended the text to make it clearer that calling was not done on alternate loci. The text has also been amended to describe the intention of using the data note format to release and describe data early, with the intention being to revisit elements not included in this set.

With regard to the tool chain, this reflects the volume of compute involved in this work. It was around two years ago that this work commenced. The final stage of the pipeline alone takes around six months to run, even with access to generous compute resources.

The text has been amended to include information on threshold selection.

With regard to the detailed points:

1) Tool chain is really outdated

Software versions reflect the length of time required to run this compute, as noted above.

2) Demonstrating improvements on GRCh38 and why it is better than the lift over

As noted above, our intention was not to demonstrate that GRCh38 was the better assembly. We consider that this has been done previously by Schneider et al. We have added a comparison with the liftover.

3) Variant normalisation

We have updated the text to include further information on this.

We note that hap.py was published after this work was submitted. We were unable, however, to establish from the manuscript how it could be used to improve upon the existing benchmarking. The summary provided in Figure 1 (https://www.nature.com/articles/s41587-019-0054-x) indicates that it wraps tools for arriving at consistent representation of variants (handled in the normalisation steps of our pipeline at the point of producing the consensus call set) and then produces a “standardised” report, providing similar metrics to those we present. From this, it appears to provide similar functionality to steps already present in our work. Our attempt to contact the authors for more information on this was not successful.

With regard to our decision to use a “truth set”, our belief is that comparing to an independently produced “gold-standard” is a valuable benchmarking strategy.

We have extended the benchmarking using WhatsHap.

This has been done with WhatsHap and the results added. As noted above, our attempt to contact the author of hap.py to establish how it could be used to benchmark phasing was sadly unsuccessful.

We have added the requested figure.

7) Analytical performance

These statements were made in the context of comparison with the phase three call set. The text has been amended. We have also compared with an initial analysis of new 30x coverage data produced by standard pipelines at NYGC. Based on our benchmark, the performance of our call set is slightly better. We agree that filtering has an impact here.

Our calling approach used low-coverage and exome data from the period of approximately 2008 to 2012 to perform joint genotyping. We believe that there would be questions about the validity of trying to benchmarking our results with samples that are simultaneously (1) not part of one of our populations, (2) have different relatedness to other samples in the population, and (3) have different data types available for variant calling. This excludes the Ashkenazi samples as a suitable benchmark. For the Han Chinese samples, we could not locate data that matches the profile of that for our samples. We have updated the text in an effort to improve the discussion of issues relating to benchmarking.

This did use joint calling. The text has been amended to make this clearer.

First, we would like to thank the reviewer for the feedback that has been provided and for giving this work their attention.

In the general observations the reviewer notes that the data note is limited to a description of what was done, without significant discussion of the rationale behind the approach. We note that this is a data note, intended to describe a data set and how it was produced, however, we have also amended the text to include more detail relating to the rationale behind our approach.

Also, in general observations, the reviewer highlights issues relating to data quality: other call sets, the limitation of comparing to only NA12878 and benchmarking phasing. While other call sets, such as TOPmed and gNOMAD exist, the 1000 Genomes Project data set remains unique in terms of its composition of populations and that all data can be accessed to the base pair level. With regard to other truth sets, beyond NA12878, we were unable to locate “gold-standard” data for any other samples in our data set. With regard to phasing, we have used the WhatsHap program to assess this and added the results to the manuscript.

With regard to GRCh38, our aim was not to demonstrate the superiority of GRCh38 but rather to provide a resource for those wishing to use that assembly. We believe the benefits of the assembly have been demonstrated by Schneider et al.

We have amended the text to make it clearer that calling was not done on alternate loci. The text has also been amended to describe the intention of using the data note format to release and describe data early, with the intention being to revisit elements not included in this set.

With regard to the tool chain, this reflects the volume of compute involved in this work. It was around two years ago that this work commenced. The final stage of the pipeline alone takes around six months to run, even with access to generous compute resources.

The text has been amended to include information on threshold selection.

With regard to the detailed points:

1) Tool chain is really outdated

Software versions reflect the length of time required to run this compute, as noted above.

2) Demonstrating improvements on GRCh38 and why it is better than the lift over

As noted above, our intention was not to demonstrate that GRCh38 was the better assembly. We consider that this has been done previously by Schneider et al. We have added a comparison with the liftover.

3) Variant normalisation

We have updated the text to include further information on this.

We note that hap.py was published after this work was submitted. We were unable, however, to establish from the manuscript how it could be used to improve upon the existing benchmarking. The summary provided in Figure 1 (https://www.nature.com/articles/s41587-019-0054-x) indicates that it wraps tools for arriving at consistent representation of variants (handled in the normalisation steps of our pipeline at the point of producing the consensus call set) and then produces a “standardised” report, providing similar metrics to those we present. From this, it appears to provide similar functionality to steps already present in our work. Our attempt to contact the authors for more information on this was not successful.

With regard to our decision to use a “truth set”, our belief is that comparing to an independently produced “gold-standard” is a valuable benchmarking strategy.

We have extended the benchmarking using WhatsHap.

This has been done with WhatsHap and the results added. As noted above, our attempt to contact the author of hap.py to establish how it could be used to benchmark phasing was sadly unsuccessful.

We have added the requested figure.

7) Analytical performance

These statements were made in the context of comparison with the phase three call set. The text has been amended. We have also compared with an initial analysis of new 30x coverage data produced by standard pipelines at NYGC. Based on our benchmark, the performance of our call set is slightly better. We agree that filtering has an impact here.

Our calling approach used low-coverage and exome data from the period of approximately 2008 to 2012 to perform joint genotyping. We believe that there would be questions about the validity of trying to benchmarking our results with samples that are simultaneously (1) not part of one of our populations, (2) have different relatedness to other samples in the population, and (3) have different data types available for variant calling. This excludes the Ashkenazi samples as a suitable benchmark. For the Han Chinese samples, we could not locate data that matches the profile of that for our samples. We have updated the text in an effort to improve the discussion of issues relating to benchmarking.

This did use joint calling. The text has been amended to make this clearer.

In the work entitled 'Variant calling on the GRCh38 assembly with the data from phase three of the 1000 Genomes Project', Lowy-Gallego et al. describe their effort to re-analyze the 1000 genomes data on the current . Continue reading

In the work entitled 'Variant calling on the GRCh38 assembly with the data from phase three of the 1000 Genomes Project', Lowy-Gallego et al. describe their effort to re-analyze the 1000 genomes data on the current GRCh38 assembly. They do not perform a full variant analysis, but rather release a set of biallelic SNVs as a preliminary variant call set. They compare this variant set to the Genome in a Bottle (GIAB) variant calls on the sample NA12878.

It is great to see efforts to update important datasets onto the current human reference assembly, GRCh38. As the authors note, the GRCh38 reference represents a substantial improvement over the GRCh37 reference, but the lack of GRCh38 based annotation has hindered adoption of this version of the reference assembly. The authors go on to discuss why 'lift-over' based approaches are inadequate, which motivated this work. I agree that 'lift-over' based approaches are inadequate, but I find the results presented in this manuscript unconvincing with respect to this assertion.

The authors spend significant space in the introduction on both explaining the improvements in GRCh38, including the addition of alternate loci, but then put no effort into demonstrating why these are valuable. Additionally, the authors spend time discussing why 'lift-over' based approaches are inadequate, but then do no comparisons to show why their de novo approach is an improvement.

While I believe this work is important, I feel the authors fail to make the point of why doing this de novo analysis on the GRCh38 reference is important.

1. Explanation of why 'lift-over' approaches have limitations: I agree with the statement that 'lift-over' is inadequate. However, the description of this on page 1 is not clear. Statement 1 'they rely on an equivalent region existing in the new genome, so new sequence in the improved assembly is effectively excluded' confounds two points. Regions that are present on the old assembly but no the new one will be excluded from a 'lift-over' approach. Additionally, sequence that is new on the updated reference will also be omitted - but these are two separate cases.
Point 2, relating two alignments also confounds multiple issues. Yes - correct alignments are key to the 'lift-over' approach, but there are two 'bad alignment' cases. The case I think the manuscript is referring to is a case where increased diversity in one version of the assembly can confound alignments (that is, sequence change). The other relevant case is the addition of paralogous sequence to one assembly that is missing from the other. This can lead to a locus aligning to a paralogous region rather then the equivalent locus (I have seen examples of this), that can also lead to incorrect 'lift-over'. Point three in this statement is a clear statement, but the authors provide no evidence to actually support this.

2. The authors only provide biallelic SNPs: I can see the utility in concentrating on a restricted set of variants, but only if this set of data are actually used to demonstrate the value of de novo analysis over 'lift-over', which was not done in this manuscript. On page 3, the authors state that "These represent the major part of the SNVs present in the human genome." but I'd like more hard numbers on this. What percentage of all SNVs do the biallelics represent? What percentage of all variation do they represent?

3. Page 3, Quality control of alignment files: Are the steps presented here just the differences from the original protocol? I think this is OK, but it is not clear from reading the manuscript if this is the full set of steps or just the differences.

4. Variant discovery: Why did you use the variant calling tools you chose?

5. Variant Filtering: The omission of variants from the sex chromosomes seems like a significant omission and limits the use of this dataset.

6. Data set validation: I have significant concerns here. I understand why NA12878 was used for some validation. However, my understanding is that the GIAB dataset does not take the alternate loci into account in their variant calling, while this manuscript tries to take advantage of these sequences - how did this impact the comparison? For example, I would predict more conflicts in regions where alt-loci exist in GRCh38. Does this occur?
I am also not convinced that accuracy on NA12878 really translates well to other samples, particularly non-European samples (as NA12878 has a European ancestry). Will the accuracy really extend to non-European samples? Also, my reading of Table 5 is that this dataset performs slightly worse than the GRCh37 call set. This does not do a lot to convince this reader that the work of doing the re-analysis is worthwhile - and I'm a believer, based on previous work I have been a part of! I have some concerns that this may be due to improvements in the new call set (due to the inclusion of the alts and more complex decoy) but it takes some significant work to track this down. There are examples of this kind of analysis 1 . The authors should also clearly state what fraction of the genome they are able to assess using this method.

7. Omitted analysis: The authors discuss the value of the improved reference in the introduction, but then do nothing to show the value of the alternate loci. How many new variants are identified on these loci? How does the inclusion of these sequences change variant calling on the primary?
Perhaps most disappointing is that there is no analysis of how the de novo is an improvement over 'lift-over' approaches. How do the de novo variant calls compare to the 'lift-over' calls? Without this analysis, it is unclear to me that anyone would be convinced that doing the de novo call approach is worth the effort.
Lastly, the authors miss the opportunity to do an accuracy comparison by looking at the regions of the reference comprised of the 'ABC' clones. These are fosmid libraries constructed from several of the samples that went into the 1000 genomes project. These provide a great test bed for both looking at variant calls (any call in this region should be heterozygous or hemizygous as the reference sequence represents one valid haplotype in the sample being analyzed) and also allows for the confirmation of the local haplotype assertions.

Is the rationale for creating the dataset(s) clearly described?

Are the protocols appropriate and is the work technically sound?

Are sufficient details of methods and materials provided to allow replication by others?

Are the datasets clearly presented in a useable and accessible format?

1. Marks P, Garcia S, Barrio AM, Belhocine K, et al.: Resolving the full spectrum of human genome variation using Linked-Reads.Genome Res. 29 (4): 635-645 PubMed Abstract | Publisher Full Text

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Genomics, genome assembly, genome annotation, variant calling, genome engineering.

First, we would like to thank the reviewer for the feedback that has been provided and for giving this work their attention.

We note the high level comments are broken down . Continue reading First, we would like to thank the reviewer for the feedback that has been provided and for giving this work their attention.

We note the high level comments are broken down into detailed points and addressed with detailed comments below. We have further updated the manuscript in an effort to improve clarity where it was indicated this was lacking. We also provided the requested information comparing the generated call set with the lift-over and included assorted other updates.

In response to the high level comments, which we understand to primarily relate to a) the improvements of GRCh38 over GRCh37 and b) comparison between de novo calling versus lift-over:

a) It was not our intention to demonstrate the superiority of GRCh38 over GRCh37. We believe that the GRC, in particular in the paper by Schneider et al., have demonstrated this already. We included information on this for the information of readers who may not be familiar with these issues. However, we accept that this may give an inaccurate impression of the emphasis of the data note. As such, the text has been amended, reducing the explanation of assembly changes and, rather, making reference to the paper by Schnieder et al. Our aim was to provide a resource for those wishing to adopt the new assembly, not to present the case as to why GRCh38 should be adopted, which we believe has already been made elsewhere.
b) Our emphasis is on providing resources for the community. To make the data available to users in as timely a manner as possible, we elected to use the data note format for publication. This has a focus on describing how the data was produced, with validation of data outputs being listed in the information to authors as optional. In light of the comments, we have performed a comparison with the lift over set and also looked specifically at regions of the assembly that were updated between the two assemblies. Further details are below.

1) Explanation of why “lift-over” approaches have limitations:

This relates to a set of three statements regarding the inadequacy of lift-overs.

For the first statement, the reviewer notes that the removal of sequence when moving from GRCh37 to GRCh38 and the gain of sequence are two separate cases and that these were conflated in the statement “they rely on an equivalent region existing in the new genome, so new sequence in the improved assembly is effectively excluded”. We accept that this combines multiple facets of changes between the two assemblies. The text for statement one has, therefore, been updated to focus on the central point that we sought to make: that a mapping between the assemblies is necessary before one can lift-over a given variant and that this is not always possible (for any one of a number of possible reasons). Further, we have added the number of records that could not be lifted over in the dbSNP/EVA processed files to give a concrete indication of the numbers of records where this occurs.

For the second statement, the point we wished to make was that, even when a variant can be lifted-over, it does not follow that the evidence that supported that call in the original assembly would also transfer to the new location. The text has been amended in an effort to make this clearer, also citing evidence from Schneider et al. in regard to alignments and the transition from GRCh37 to GRCh38.

For the third statement, it was noted that this was clear but that no evidence was provided to support the assertion. In light of the other changes, this text has been modified to focus on the case of new sequence being added to the assembly and pointing to specific examples shown as part of Figure 1, which illustrate the differences in the lift-over and de novo call sets at examples of clinically relevant loci, which were updated between the two assemblies

2) The authors provide only biallelic SNPs:

The requested comparison with the lift-over is addressed in the response to point seven. We have added the requested numbers relating to what fraction of SNVs are biallelic (99.6%) and the number of SNVs relative to other short variants. We have also taken the opportunity to update the call set to include biallelic INDELs, a category of variant that was not previously included. Multi-allelic calls remain absent from the set as SHAPEIT is unable to handle such calls and our pipelines would require further development. Our strategy has been to release calls as soon as possible and to revisit the data set adding additional classes of variant as practical. This was done with the aim of making data useful to many available and with the intention of revisiting the data set to extend it as useful.

3) Page 3, Quality control of alignment files:

All of the steps used are described in the data note, not just the differences. The text has been updated in an effort to make this clearer to readers.

Tools were chosen in consultation with members of the 1000 Genomes Project Consortium. While our aim was to recapitulate their GRCh37 analysis on the new assembly, this would not have been feasible given the large number of callers used in the original project, the concomitant compute and the relatively complex methods used in filtering and integrating call sets, which were both compute and labour intensive. This obliged us to look at using a reduced set of callers and a simplified methodology. We sought recommendations that took into account the performance of the callers on the 1000 Genomes phase three data, which unlike most other panels is a mix of low coverage and exome with significantly more geographic diversity. In addition, the performance of some callers on the data set rendered their use impractical.

The text has been updated to inform readers of the above.

This was intended to be an initial release of data, with the intention of revisiting and adding additional elements that require further processing. As the sex chromosomes required additional analysis, they were not included in this first release. Further, we believe that the data set is still beneficial to some users in their absence. We anticipate calls on the GRCh38 sex chromosomes being released in the future.

6) Data set validation:

We acknowledge that the GIAB NA12878 benchmark is imperfect. As the reviewer notes, it is a single sample and the differences in the versions of the reference genome used for alignment (with and without alternate loci) by us and GIAB would be expected to have a bearing on variant detection.

With regard to the alternate loci, the possibility of comparing the level of conflict with the benchmark at regions where alternate loci are present and where they are not is mentioned. Given, however, that the presence of the alternate loci would also be expected to have at least some impact across the genome (irrespective of the presence of alternate loci at that particular location), we feel that, in order to truly assess what impact the alternate loci had on the analysis, it would be necessary to repeat the analysis, using, instead, alignments where the alternate loci had not been present. As our data set also relies on joint genotyping, this would effectively mean realigning all of the data and repeating analysis on all of the data in order to answer this question. The substantial compute volume involved in this would add significant time and expense and, thus, makes this comparison impractical. It would, however, be necessary in order to derive meaningful and sound conclusions about the impact of the alternate loci on our analysis.

The reviewer also expresses concern that accuracy with NA12878 may not transfer to other samples, particularly those of non-European ancestry. Given the prevalence of data from NA12878, it would seem reasonable to conclude that calling methods should perform well, and potentially above average, with that sample. However, NA12878 has data similar to that of other samples in our data set. Further, our data set is comprised only of Illumina data, so we do not expect, for example, types of sequencing error to vary across the samples. In work done by others, comparing the new call set to 1000 Genomes phase three, we see that our results and those for phase three show a strong level of consistency across the samples (Robinson and Glusman, 2019, https://www.biorxiv.org/content/10.1101/600254v1), with no indication that NA12878 is an outlier.

In relation to our comparison with phase three, it was not our intention to try to outperform phase three, rather to offer a de novo call set of similar quality on GRCh38. The utility is to those wishing to work with GRCh38 and work with a de novo call set generated on that assembly including the novel GRCh38 regions. The comparison with phase three is offered to assist users in understanding how our call set compares to phase three. Our call set shows broadly similar behaviour to phase three, with a slightly different balance of sensitivity and specificity. Given, however, that phase three involved a massively greater analysis effort, which resources would make impossible to repeat, it is perhaps not surprising that phase three sees a higher yield. In turn, this is reflected in the lift-over but with substantial differences demonstrated in novel regions where the de novo call set detects variants absent in the lift-over.

While we do acknowledge the limitations of the GIAB benchmark we have used, we found no better alternatives. To effectively benchmark our data, based as it is on joint genotyping, we needed “gold standard” data for samples in our data set. For short variants, the only such data set we were able to locate was GIAB NA12878. The alternatives, of manual inspection of the data or alternative data types, such as PacBio reads being assessed by us also have limitations and lose the benefits gained from an independent “gold standard” data set created by another group.

The text has been updated in an effort to better reflect the above.

The alternate loci were used in aligning reads to ensure the best possible read mapping but variants were not called on these loci. The text has been amended to make this clearer. This is in large part as protocols for successfully calling on the alternate loci are lacking. The only information provided by developers of calling software of which we are aware in relation to this is a beta tutorial from GATK (https://software.broadinstitute.org/gatk/documentation/article.php?id=8017). Due to the lack of tools and protocols for confidently calling on the alternate loci, calls were not made at those loci.

We have extended our benchmarking work to include the lift-over data set in the comparison. We have also looked specifically at novel regions of GRCh38. These are included in the revised text.

The suggestion relating to fosmid clones is interesting and would provide further validation. We would, however, note that this is a data note, offered by the journal as a means of describing the production of a data set, where benchmarking is described as optional. Our existing benchmarking covers the genome at greater scale and should, therefore, already give a better indication of the performance of our calling genome wide. Further, we have added benchmarking of the phasing using WhatsHap.

First, we would like to thank the reviewer for the feedback that has been provided and for giving this work their attention.

We note the high level comments are broken down into detailed points and addressed with detailed comments below. We have further updated the manuscript in an effort to improve clarity where it was indicated this was lacking. We also provided the requested information comparing the generated call set with the lift-over and included assorted other updates.

In response to the high level comments, which we understand to primarily relate to a) the improvements of GRCh38 over GRCh37 and b) comparison between de novo calling versus lift-over:

a) It was not our intention to demonstrate the superiority of GRCh38 over GRCh37. We believe that the GRC, in particular in the paper by Schneider et al., have demonstrated this already. We included information on this for the information of readers who may not be familiar with these issues. However, we accept that this may give an inaccurate impression of the emphasis of the data note. As such, the text has been amended, reducing the explanation of assembly changes and, rather, making reference to the paper by Schnieder et al. Our aim was to provide a resource for those wishing to adopt the new assembly, not to present the case as to why GRCh38 should be adopted, which we believe has already been made elsewhere.
b) Our emphasis is on providing resources for the community. To make the data available to users in as timely a manner as possible, we elected to use the data note format for publication. This has a focus on describing how the data was produced, with validation of data outputs being listed in the information to authors as optional. In light of the comments, we have performed a comparison with the lift over set and also looked specifically at regions of the assembly that were updated between the two assemblies. Further details are below.

1) Explanation of why “lift-over” approaches have limitations:

This relates to a set of three statements regarding the inadequacy of lift-overs.

For the first statement, the reviewer notes that the removal of sequence when moving from GRCh37 to GRCh38 and the gain of sequence are two separate cases and that these were conflated in the statement “they rely on an equivalent region existing in the new genome, so new sequence in the improved assembly is effectively excluded”. We accept that this combines multiple facets of changes between the two assemblies. The text for statement one has, therefore, been updated to focus on the central point that we sought to make: that a mapping between the assemblies is necessary before one can lift-over a given variant and that this is not always possible (for any one of a number of possible reasons). Further, we have added the number of records that could not be lifted over in the dbSNP/EVA processed files to give a concrete indication of the numbers of records where this occurs.

For the second statement, the point we wished to make was that, even when a variant can be lifted-over, it does not follow that the evidence that supported that call in the original assembly would also transfer to the new location. The text has been amended in an effort to make this clearer, also citing evidence from Schneider et al. in regard to alignments and the transition from GRCh37 to GRCh38.

For the third statement, it was noted that this was clear but that no evidence was provided to support the assertion. In light of the other changes, this text has been modified to focus on the case of new sequence being added to the assembly and pointing to specific examples shown as part of Figure 1, which illustrate the differences in the lift-over and de novo call sets at examples of clinically relevant loci, which were updated between the two assemblies

2) The authors provide only biallelic SNPs:

The requested comparison with the lift-over is addressed in the response to point seven. We have added the requested numbers relating to what fraction of SNVs are biallelic (99.6%) and the number of SNVs relative to other short variants. We have also taken the opportunity to update the call set to include biallelic INDELs, a category of variant that was not previously included. Multi-allelic calls remain absent from the set as SHAPEIT is unable to handle such calls and our pipelines would require further development. Our strategy has been to release calls as soon as possible and to revisit the data set adding additional classes of variant as practical. This was done with the aim of making data useful to many available and with the intention of revisiting the data set to extend it as useful.

3) Page 3, Quality control of alignment files:

All of the steps used are described in the data note, not just the differences. The text has been updated in an effort to make this clearer to readers.

Tools were chosen in consultation with members of the 1000 Genomes Project Consortium. While our aim was to recapitulate their GRCh37 analysis on the new assembly, this would not have been feasible given the large number of callers used in the original project, the concomitant compute and the relatively complex methods used in filtering and integrating call sets, which were both compute and labour intensive. This obliged us to look at using a reduced set of callers and a simplified methodology. We sought recommendations that took into account the performance of the callers on the 1000 Genomes phase three data, which unlike most other panels is a mix of low coverage and exome with significantly more geographic diversity. In addition, the performance of some callers on the data set rendered their use impractical.

The text has been updated to inform readers of the above.

This was intended to be an initial release of data, with the intention of revisiting and adding additional elements that require further processing. As the sex chromosomes required additional analysis, they were not included in this first release. Further, we believe that the data set is still beneficial to some users in their absence. We anticipate calls on the GRCh38 sex chromosomes being released in the future.

6) Data set validation:

We acknowledge that the GIAB NA12878 benchmark is imperfect. As the reviewer notes, it is a single sample and the differences in the versions of the reference genome used for alignment (with and without alternate loci) by us and GIAB would be expected to have a bearing on variant detection.

With regard to the alternate loci, the possibility of comparing the level of conflict with the benchmark at regions where alternate loci are present and where they are not is mentioned. Given, however, that the presence of the alternate loci would also be expected to have at least some impact across the genome (irrespective of the presence of alternate loci at that particular location), we feel that, in order to truly assess what impact the alternate loci had on the analysis, it would be necessary to repeat the analysis, using, instead, alignments where the alternate loci had not been present. As our data set also relies on joint genotyping, this would effectively mean realigning all of the data and repeating analysis on all of the data in order to answer this question. The substantial compute volume involved in this would add significant time and expense and, thus, makes this comparison impractical. It would, however, be necessary in order to derive meaningful and sound conclusions about the impact of the alternate loci on our analysis.

The reviewer also expresses concern that accuracy with NA12878 may not transfer to other samples, particularly those of non-European ancestry. Given the prevalence of data from NA12878, it would seem reasonable to conclude that calling methods should perform well, and potentially above average, with that sample. However, NA12878 has data similar to that of other samples in our data set. Further, our data set is comprised only of Illumina data, so we do not expect, for example, types of sequencing error to vary across the samples. In work done by others, comparing the new call set to 1000 Genomes phase three, we see that our results and those for phase three show a strong level of consistency across the samples (Robinson and Glusman, 2019, https://www.biorxiv.org/content/10.1101/600254v1), with no indication that NA12878 is an outlier.

In relation to our comparison with phase three, it was not our intention to try to outperform phase three, rather to offer a de novo call set of similar quality on GRCh38. The utility is to those wishing to work with GRCh38 and work with a de novo call set generated on that assembly including the novel GRCh38 regions. The comparison with phase three is offered to assist users in understanding how our call set compares to phase three. Our call set shows broadly similar behaviour to phase three, with a slightly different balance of sensitivity and specificity. Given, however, that phase three involved a massively greater analysis effort, which resources would make impossible to repeat, it is perhaps not surprising that phase three sees a higher yield. In turn, this is reflected in the lift-over but with substantial differences demonstrated in novel regions where the de novo call set detects variants absent in the lift-over.

While we do acknowledge the limitations of the GIAB benchmark we have used, we found no better alternatives. To effectively benchmark our data, based as it is on joint genotyping, we needed “gold standard” data for samples in our data set. For short variants, the only such data set we were able to locate was GIAB NA12878. The alternatives, of manual inspection of the data or alternative data types, such as PacBio reads being assessed by us also have limitations and lose the benefits gained from an independent “gold standard” data set created by another group.

The text has been updated in an effort to better reflect the above.

The alternate loci were used in aligning reads to ensure the best possible read mapping but variants were not called on these loci. The text has been amended to make this clearer. This is in large part as protocols for successfully calling on the alternate loci are lacking. The only information provided by developers of calling software of which we are aware in relation to this is a beta tutorial from GATK (https://software.broadinstitute.org/gatk/documentation/article.php?id=8017). Due to the lack of tools and protocols for confidently calling on the alternate loci, calls were not made at those loci.

We have extended our benchmarking work to include the lift-over data set in the comparison. We have also looked specifically at novel regions of GRCh38. These are included in the revised text.

The suggestion relating to fosmid clones is interesting and would provide further validation. We would, however, note that this is a data note, offered by the journal as a means of describing the production of a data set, where benchmarking is described as optional. Our existing benchmarking covers the genome at greater scale and should, therefore, already give a better indication of the performance of our calling genome wide. Further, we have added benchmarking of the phasing using WhatsHap.

First, we would like to thank the reviewer for the feedback that has been provided and for giving this work their attention.

We note the high level comments are broken down . Continue reading First, we would like to thank the reviewer for the feedback that has been provided and for giving this work their attention.

We note the high level comments are broken down into detailed points and addressed with detailed comments below. We have further updated the manuscript in an effort to improve clarity where it was indicated this was lacking. We also provided the requested information comparing the generated call set with the lift-over and included assorted other updates.

In response to the high level comments, which we understand to primarily relate to a) the improvements of GRCh38 over GRCh37 and b) comparison between de novo calling versus lift-over:

a) It was not our intention to demonstrate the superiority of GRCh38 over GRCh37. We believe that the GRC, in particular in the paper by Schneider et al., have demonstrated this already. We included information on this for the information of readers who may not be familiar with these issues. However, we accept that this may give an inaccurate impression of the emphasis of the data note. As such, the text has been amended, reducing the explanation of assembly changes and, rather, making reference to the paper by Schnieder et al. Our aim was to provide a resource for those wishing to adopt the new assembly, not to present the case as to why GRCh38 should be adopted, which we believe has already been made elsewhere.
b) Our emphasis is on providing resources for the community. To make the data available to users in as timely a manner as possible, we elected to use the data note format for publication. This has a focus on describing how the data was produced, with validation of data outputs being listed in the information to authors as optional. In light of the comments, we have performed a comparison with the lift over set and also looked specifically at regions of the assembly that were updated between the two assemblies. Further details are below.

1) Explanation of why “lift-over” approaches have limitations:

This relates to a set of three statements regarding the inadequacy of lift-overs.

For the first statement, the reviewer notes that the removal of sequence when moving from GRCh37 to GRCh38 and the gain of sequence are two separate cases and that these were conflated in the statement “they rely on an equivalent region existing in the new genome, so new sequence in the improved assembly is effectively excluded”. We accept that this combines multiple facets of changes between the two assemblies. The text for statement one has, therefore, been updated to focus on the central point that we sought to make: that a mapping between the assemblies is necessary before one can lift-over a given variant and that this is not always possible (for any one of a number of possible reasons). Further, we have added the number of records that could not be lifted over in the dbSNP/EVA processed files to give a concrete indication of the numbers of records where this occurs.

For the second statement, the point we wished to make was that, even when a variant can be lifted-over, it does not follow that the evidence that supported that call in the original assembly would also transfer to the new location. The text has been amended in an effort to make this clearer, also citing evidence from Schneider et al. in regard to alignments and the transition from GRCh37 to GRCh38.

For the third statement, it was noted that this was clear but that no evidence was provided to support the assertion. In light of the other changes, this text has been modified to focus on the case of new sequence being added to the assembly and pointing to specific examples shown as part of Figure 1, which illustrate the differences in the lift-over and de novo call sets at examples of clinically relevant loci, which were updated between the two assemblies

2) The authors provide only biallelic SNPs:

The requested comparison with the lift-over is addressed in the response to point seven. We have added the requested numbers relating to what fraction of SNVs are biallelic (99.6%) and the number of SNVs relative to other short variants. We have also taken the opportunity to update the call set to include biallelic INDELs, a category of variant that was not previously included. Multi-allelic calls remain absent from the set as SHAPEIT is unable to handle such calls and our pipelines would require further development. Our strategy has been to release calls as soon as possible and to revisit the data set adding additional classes of variant as practical. This was done with the aim of making data useful to many available and with the intention of revisiting the data set to extend it as useful.

3) Page 3, Quality control of alignment files:

All of the steps used are described in the data note, not just the differences. The text has been updated in an effort to make this clearer to readers.

Tools were chosen in consultation with members of the 1000 Genomes Project Consortium. While our aim was to recapitulate their GRCh37 analysis on the new assembly, this would not have been feasible given the large number of callers used in the original project, the concomitant compute and the relatively complex methods used in filtering and integrating call sets, which were both compute and labour intensive. This obliged us to look at using a reduced set of callers and a simplified methodology. We sought recommendations that took into account the performance of the callers on the 1000 Genomes phase three data, which unlike most other panels is a mix of low coverage and exome with significantly more geographic diversity. In addition, the performance of some callers on the data set rendered their use impractical.

The text has been updated to inform readers of the above.

This was intended to be an initial release of data, with the intention of revisiting and adding additional elements that require further processing. As the sex chromosomes required additional analysis, they were not included in this first release. Further, we believe that the data set is still beneficial to some users in their absence. We anticipate calls on the GRCh38 sex chromosomes being released in the future.

6) Data set validation:

We acknowledge that the GIAB NA12878 benchmark is imperfect. As the reviewer notes, it is a single sample and the differences in the versions of the reference genome used for alignment (with and without alternate loci) by us and GIAB would be expected to have a bearing on variant detection.

With regard to the alternate loci, the possibility of comparing the level of conflict with the benchmark at regions where alternate loci are present and where they are not is mentioned. Given, however, that the presence of the alternate loci would also be expected to have at least some impact across the genome (irrespective of the presence of alternate loci at that particular location), we feel that, in order to truly assess what impact the alternate loci had on the analysis, it would be necessary to repeat the analysis, using, instead, alignments where the alternate loci had not been present. As our data set also relies on joint genotyping, this would effectively mean realigning all of the data and repeating analysis on all of the data in order to answer this question. The substantial compute volume involved in this would add significant time and expense and, thus, makes this comparison impractical. It would, however, be necessary in order to derive meaningful and sound conclusions about the impact of the alternate loci on our analysis.

The reviewer also expresses concern that accuracy with NA12878 may not transfer to other samples, particularly those of non-European ancestry. Given the prevalence of data from NA12878, it would seem reasonable to conclude that calling methods should perform well, and potentially above average, with that sample. However, NA12878 has data similar to that of other samples in our data set. Further, our data set is comprised only of Illumina data, so we do not expect, for example, types of sequencing error to vary across the samples. In work done by others, comparing the new call set to 1000 Genomes phase three, we see that our results and those for phase three show a strong level of consistency across the samples (Robinson and Glusman, 2019, https://www.biorxiv.org/content/10.1101/600254v1), with no indication that NA12878 is an outlier.

In relation to our comparison with phase three, it was not our intention to try to outperform phase three, rather to offer a de novo call set of similar quality on GRCh38. The utility is to those wishing to work with GRCh38 and work with a de novo call set generated on that assembly including the novel GRCh38 regions. The comparison with phase three is offered to assist users in understanding how our call set compares to phase three. Our call set shows broadly similar behaviour to phase three, with a slightly different balance of sensitivity and specificity. Given, however, that phase three involved a massively greater analysis effort, which resources would make impossible to repeat, it is perhaps not surprising that phase three sees a higher yield. In turn, this is reflected in the lift-over but with substantial differences demonstrated in novel regions where the de novo call set detects variants absent in the lift-over.

While we do acknowledge the limitations of the GIAB benchmark we have used, we found no better alternatives. To effectively benchmark our data, based as it is on joint genotyping, we needed “gold standard” data for samples in our data set. For short variants, the only such data set we were able to locate was GIAB NA12878. The alternatives, of manual inspection of the data or alternative data types, such as PacBio reads being assessed by us also have limitations and lose the benefits gained from an independent “gold standard” data set created by another group.

The text has been updated in an effort to better reflect the above.

The alternate loci were used in aligning reads to ensure the best possible read mapping but variants were not called on these loci. The text has been amended to make this clearer. This is in large part as protocols for successfully calling on the alternate loci are lacking. The only information provided by developers of calling software of which we are aware in relation to this is a beta tutorial from GATK (https://software.broadinstitute.org/gatk/documentation/article.php?id=8017). Due to the lack of tools and protocols for confidently calling on the alternate loci, calls were not made at those loci.

We have extended our benchmarking work to include the lift-over data set in the comparison. We have also looked specifically at novel regions of GRCh38. These are included in the revised text.

The suggestion relating to fosmid clones is interesting and would provide further validation. We would, however, note that this is a data note, offered by the journal as a means of describing the production of a data set, where benchmarking is described as optional. Our existing benchmarking covers the genome at greater scale and should, therefore, already give a better indication of the performance of our calling genome wide. Further, we have added benchmarking of the phasing using WhatsHap.

First, we would like to thank the reviewer for the feedback that has been provided and for giving this work their attention.

We note the high level comments are broken down into detailed points and addressed with detailed comments below. We have further updated the manuscript in an effort to improve clarity where it was indicated this was lacking. We also provided the requested information comparing the generated call set with the lift-over and included assorted other updates.

In response to the high level comments, which we understand to primarily relate to a) the improvements of GRCh38 over GRCh37 and b) comparison between de novo calling versus lift-over:

a) It was not our intention to demonstrate the superiority of GRCh38 over GRCh37. We believe that the GRC, in particular in the paper by Schneider et al., have demonstrated this already. We included information on this for the information of readers who may not be familiar with these issues. However, we accept that this may give an inaccurate impression of the emphasis of the data note. As such, the text has been amended, reducing the explanation of assembly changes and, rather, making reference to the paper by Schnieder et al. Our aim was to provide a resource for those wishing to adopt the new assembly, not to present the case as to why GRCh38 should be adopted, which we believe has already been made elsewhere.
b) Our emphasis is on providing resources for the community. To make the data available to users in as timely a manner as possible, we elected to use the data note format for publication. This has a focus on describing how the data was produced, with validation of data outputs being listed in the information to authors as optional. In light of the comments, we have performed a comparison with the lift over set and also looked specifically at regions of the assembly that were updated between the two assemblies. Further details are below.

1) Explanation of why “lift-over” approaches have limitations:

This relates to a set of three statements regarding the inadequacy of lift-overs.

For the first statement, the reviewer notes that the removal of sequence when moving from GRCh37 to GRCh38 and the gain of sequence are two separate cases and that these were conflated in the statement “they rely on an equivalent region existing in the new genome, so new sequence in the improved assembly is effectively excluded”. We accept that this combines multiple facets of changes between the two assemblies. The text for statement one has, therefore, been updated to focus on the central point that we sought to make: that a mapping between the assemblies is necessary before one can lift-over a given variant and that this is not always possible (for any one of a number of possible reasons). Further, we have added the number of records that could not be lifted over in the dbSNP/EVA processed files to give a concrete indication of the numbers of records where this occurs.

For the second statement, the point we wished to make was that, even when a variant can be lifted-over, it does not follow that the evidence that supported that call in the original assembly would also transfer to the new location. The text has been amended in an effort to make this clearer, also citing evidence from Schneider et al. in regard to alignments and the transition from GRCh37 to GRCh38.

For the third statement, it was noted that this was clear but that no evidence was provided to support the assertion. In light of the other changes, this text has been modified to focus on the case of new sequence being added to the assembly and pointing to specific examples shown as part of Figure 1, which illustrate the differences in the lift-over and de novo call sets at examples of clinically relevant loci, which were updated between the two assemblies

2) The authors provide only biallelic SNPs:

The requested comparison with the lift-over is addressed in the response to point seven. We have added the requested numbers relating to what fraction of SNVs are biallelic (99.6%) and the number of SNVs relative to other short variants. We have also taken the opportunity to update the call set to include biallelic INDELs, a category of variant that was not previously included. Multi-allelic calls remain absent from the set as SHAPEIT is unable to handle such calls and our pipelines would require further development. Our strategy has been to release calls as soon as possible and to revisit the data set adding additional classes of variant as practical. This was done with the aim of making data useful to many available and with the intention of revisiting the data set to extend it as useful.

3) Page 3, Quality control of alignment files:

All of the steps used are described in the data note, not just the differences. The text has been updated in an effort to make this clearer to readers.

Tools were chosen in consultation with members of the 1000 Genomes Project Consortium. While our aim was to recapitulate their GRCh37 analysis on the new assembly, this would not have been feasible given the large number of callers used in the original project, the concomitant compute and the relatively complex methods used in filtering and integrating call sets, which were both compute and labour intensive. This obliged us to look at using a reduced set of callers and a simplified methodology. We sought recommendations that took into account the performance of the callers on the 1000 Genomes phase three data, which unlike most other panels is a mix of low coverage and exome with significantly more geographic diversity. In addition, the performance of some callers on the data set rendered their use impractical.

The text has been updated to inform readers of the above.

This was intended to be an initial release of data, with the intention of revisiting and adding additional elements that require further processing. As the sex chromosomes required additional analysis, they were not included in this first release. Further, we believe that the data set is still beneficial to some users in their absence. We anticipate calls on the GRCh38 sex chromosomes being released in the future.

6) Data set validation:

We acknowledge that the GIAB NA12878 benchmark is imperfect. As the reviewer notes, it is a single sample and the differences in the versions of the reference genome used for alignment (with and without alternate loci) by us and GIAB would be expected to have a bearing on variant detection.

With regard to the alternate loci, the possibility of comparing the level of conflict with the benchmark at regions where alternate loci are present and where they are not is mentioned. Given, however, that the presence of the alternate loci would also be expected to have at least some impact across the genome (irrespective of the presence of alternate loci at that particular location), we feel that, in order to truly assess what impact the alternate loci had on the analysis, it would be necessary to repeat the analysis, using, instead, alignments where the alternate loci had not been present. As our data set also relies on joint genotyping, this would effectively mean realigning all of the data and repeating analysis on all of the data in order to answer this question. The substantial compute volume involved in this would add significant time and expense and, thus, makes this comparison impractical. It would, however, be necessary in order to derive meaningful and sound conclusions about the impact of the alternate loci on our analysis.

The reviewer also expresses concern that accuracy with NA12878 may not transfer to other samples, particularly those of non-European ancestry. Given the prevalence of data from NA12878, it would seem reasonable to conclude that calling methods should perform well, and potentially above average, with that sample. However, NA12878 has data similar to that of other samples in our data set. Further, our data set is comprised only of Illumina data, so we do not expect, for example, types of sequencing error to vary across the samples. In work done by others, comparing the new call set to 1000 Genomes phase three, we see that our results and those for phase three show a strong level of consistency across the samples (Robinson and Glusman, 2019, https://www.biorxiv.org/content/10.1101/600254v1), with no indication that NA12878 is an outlier.

In relation to our comparison with phase three, it was not our intention to try to outperform phase three, rather to offer a de novo call set of similar quality on GRCh38. The utility is to those wishing to work with GRCh38 and work with a de novo call set generated on that assembly including the novel GRCh38 regions. The comparison with phase three is offered to assist users in understanding how our call set compares to phase three. Our call set shows broadly similar behaviour to phase three, with a slightly different balance of sensitivity and specificity. Given, however, that phase three involved a massively greater analysis effort, which resources would make impossible to repeat, it is perhaps not surprising that phase three sees a higher yield. In turn, this is reflected in the lift-over but with substantial differences demonstrated in novel regions where the de novo call set detects variants absent in the lift-over.

While we do acknowledge the limitations of the GIAB benchmark we have used, we found no better alternatives. To effectively benchmark our data, based as it is on joint genotyping, we needed “gold standard” data for samples in our data set. For short variants, the only such data set we were able to locate was GIAB NA12878. The alternatives, of manual inspection of the data or alternative data types, such as PacBio reads being assessed by us also have limitations and lose the benefits gained from an independent “gold standard” data set created by another group.

The text has been updated in an effort to better reflect the above.

The alternate loci were used in aligning reads to ensure the best possible read mapping but variants were not called on these loci. The text has been amended to make this clearer. This is in large part as protocols for successfully calling on the alternate loci are lacking. The only information provided by developers of calling software of which we are aware in relation to this is a beta tutorial from GATK (https://software.broadinstitute.org/gatk/documentation/article.php?id=8017). Due to the lack of tools and protocols for confidently calling on the alternate loci, calls were not made at those loci.

We have extended our benchmarking work to include the lift-over data set in the comparison. We have also looked specifically at novel regions of GRCh38. These are included in the revised text.

The suggestion relating to fosmid clones is interesting and would provide further validation. We would, however, note that this is a data note, offered by the journal as a means of describing the production of a data set, where benchmarking is described as optional. Our existing benchmarking covers the genome at greater scale and should, therefore, already give a better indication of the performance of our calling genome wide. Further, we have added benchmarking of the phasing using WhatsHap.


Results

Comparison to linear aligners

Regular sequence-to-sequence alignment is a special case of sequence-to-graph alignment, where the graph consists of a linear chain of nodes. We compare GraphAligner to a well-optimized sequence-to-sequence aligner, minimap2 [13], in whole human genome read alignment. We simulated 20x coverage reads from the GRCh38 reference using pbsim [33] with default parameters. We filtered out reads shorter than 1000 bp and reads containing any non-ATCG characters. Then, we aligned the reads to the reference using both minimap2 and GraphAligner. Then, we evaluated the mapping accuracy. We adopt the criteria used in the minimap2 evaluation [13] and consider a read correctly mapped if its longest alignment overlaps at least 10% with the genomic position from where it was simulated.

Table 1 shows the results. GraphAligner and minimap2 both align approximately as accurately, with minimap2 aligning slightly more reads correctly (95.0% vs 95.1%). GraphAligner takes about 3 × the runtime of minimap2, which we consider to be a modest overhead for a tool able to handle graphs in comparison to a highly optimized sequence-to-sequence mapping tool. Note that minimap2 is faster than commonly used competing tools, such as BWA-MEM [14], by more than one order of magnitude [13].

Aligning to a graph with variants

In this experiment, we evaluated the mapping accuracy to a graph with variants. We used the chromosome 22 reference (GRCh37) and all variants in the Thousand Genomes project phase 3 release [34]. We constructed a variation graph from the reference and the variants using vg [16], producing a graph of chromosome 22 with 2,212,133 variants, containing on average one variant every 15 base pairs in the non-telomeric regions (the variant graph). Then, we simulated reads of varying lengths from the chromosome 22 reference sequence (GRCh37) using pbsim [33] with the default CLR parameters and aligned them to the graph with GraphAligner. We consider a read correctly mapped if its longest alignment overlaps at least 10% with the genomic position from where it was simulated and evaluate the number of reads correctly aligned. We also aligned the same reads to the chromosome 22 reference without variants (the linear graph) with GraphAligner to differentiate between reads which could not be aligned due to variants and reads which could not be aligned due to other reasons such as short read lengths leading to missed seeds. In addition to the reads simulated from the reference, we also simulated reads from de novo diploid assembled chromosome 22 contigs of the individual HG00733 [35]. This was done to test alignment accuracy on reads with realistic variants.

Figure 1 shows the results. The left part of the figure shows alignment accuracy for the reference simulated reads. For comparison purposes, the blue curve represents the results from mapping reads simulated from GRCh37 back to the (linear) reference genome and hence indicate the performance that can be achieved in an idealized setting. When aligning to the variant graph, 95% of the reference simulated reads are correctly aligned once read length grows above 1200 base pairs. At 1500 base pairs, 97.0% of the reads are correctly aligned to the variant graph. The right part of Fig. 1 shows the accuracy for reads simulated from de novo assembled contigs. Expectedly, the alignment accuracy for reads simulated from contigs is worse than for reads simulated from the reference (GRCh37) when aligning to the linear reference, but similar when aligning to the graph with variants. The results show that GraphAligner is capable of aligning long reads accurately to a variation-rich graph.

Fraction of reads correctly aligned at varying read lengths for the variant graph and the linear graph. Left: reads simulated from the GRCh37 reference. Right: reads simulated from de novo assembled contigs of HG00733

Comparison to vg

In this experiment, we compared GraphAligner and vg [16] for aligning long reads. We used the graph from the previous experiment containing the chromosome 22 reference and all variants in the Thousand Genomes project phase 3 release [34]. We simulated reads from the chromosome 22 reference using pbsim [33] with default parameters. Then, we aligned the simulated reads to the graph using GraphAligner and vg.

Table 2 shows the results. GraphAligner aligned 96.6% of reads correctly, which is consistent with the results of the variation graph experiment. In contrast, vg aligned 93.8% of reads into the correct genomic region. However, we found that some of the alignments by vg were not consistent with graph topology, that is, the alignment traversed through nodes which are not connected by an edge. In some cases, the alignment “looped back” into the same reference area multiple times and even covered both alleles of a variant (Additional file 1: Figure S2). We did not evaluate how many of vg’s alignments were inconsistent with graph topology. GraphAligner’s runtime and peak memory includes both indexing and alignment. Despite including the indexing phase, we see that GraphAligner is almost ten times faster than vg’s mapping phase. When including vg’s indexing as well, GraphAligner is over thirteen times faster than vg. Peak memory use is three times smaller.

Variant genotyping

We implemented a simple variant genotyping pipeline for long reads. First, a list of reference variants and a reference genome are used to build a pangenome graph using vg [16]. Then, long reads are aligned to the pangenome graph with GraphAligner. Finally, vg is used to genotype the variants according to the long read alignments.

We tested our variant genotyping pipeline using 35x coverage PacBio hifi reads from the individual HG002 [36], using the Genome in a Bottle (GIAB) benchmarking variant set version 3.3.2 for GRCh38 [37] as the ground truth. We tested three different scenarios: first, an ideal scenario where we use the variants in the GIAB variant set to build the graph second, a more realistic scenario where we used variants from a different source, using the variant set by Lowy-Gallego et al. [38] called from the GRCh38 genome using the data from phase 3 of the Thousand Genomes Project (1000G) to build the graph and third, using the variants from 1000G to build the graph but only evaluating the accuracy on variants which occur in both the 1000G and the GIAB variant set (1000G+GIAB). The reason for using the three different scenarios is that the genotyping pipeline cannot call novel variants instead, it only genotypes variants which are already in the list of reference variants. This separates errors caused by the pangenome approach, and errors caused by imperfect reference variant set the GIAB scenario will show how the pipeline would behave if the reference variant set was perfect, while the 1000G scenario will show the performance with a realistic, imperfect reference variant set and the 1000G+GIAB scenario will show the performance in a realistic setting for those variants that the pipeline could in principle genotype.

We evaluated the genotyping accuracy using RTG Tools vcfeval [39], which computes precision and recall for all variants, SNPs only and non-SNPs only. vg produces a confidence for each variant, and the evaluation produces a precision-recall curve for different confidence thresholds. We selected the threshold with the highest F-measure and report the precision and recall for that threshold. We evaluated the results in the Genome in a Bottle high confidence regions from all chromosomes in each scenario.

Table 3 shows the results. The genotyping accuracy is high in the GIAB scenario, but lower in the 1000G scenario. This shows that the choice of variant set affects the accuracy noticably with the F-measure dropping from 0.985 to 0.930. However, when excluding variants that the pipeline could not genotype even in principle, the F-measure is 0.970. This shows that a large part of the missing recall in the 1000G scenario is from variants that are not included in the reference variant set.

Although previous publications [36] have shown performance exceeding the results in Table 3, the genotyping experiment shows an example use case for GraphAligner. The major limitation of the pipeline is that it cannot call novel variants, instead only genotyping known variants. We did not try varying the parameters of vg’s genotyping module or otherwise adjusting the genotyping process, which is tuned for short read genotyping and may not be optimal for long reads.

Error correction

We have implemented a hybrid error correction pipeline based on sequence-to-graph alignment. Aligning reads to a de Bruijn graph (DBG) is a method of error correcting long reads from short reads [6, 7]. The idea is to build a DBG from the short reads and then find the best alignment between the long read and a path in the DBG. The sequence of the path can then be used as the corrected long read.

Zhang et al. [40] performed an evaluation of 16 different error correction methods. Based on their results, we chose FMLRC [8] as a fast and accurate hybrid error corrector for comparison. We also compare to LoRDEC [6] since our pipeline uses the same overall idea.

LoRDEC [6] builds a de Bruijn graph from the short reads, then aligns the long reads to it using a depth-first search and uses the path sequence as the corrected read. FMLRC [8] also aligns the reads to a graph, except instead of building one de Bruijn graph it uses an FM-index which can represent all de Bruijn graphs and dynamically vary the k-mer size. FMLRC then corrects the reads in two passes, using different k-mer sizes. Our error correction pipeline is similar to LoRDEC. Figure 2 shows the pipeline. We first self-correct the Illumina reads using Lighter [41], then build the de Bruijn graph using BCalm2 [42], align the long reads using GraphAligner with default parameters, and finally extract the path as the corrected read.

Overview of the error correction pipeline. The circles represent data and the rectangles programs

Due to fluctuations and biases of Illumina coverage, some genomic areas are impossible to correct with short reads even in principle. Our pipeline has two modes: either we output the full reads, keeping uncorrected areas as is or clipped reads, which remove the uncorrected areas and split the read into multiple corrected sub-reads, if needed. In the results, we present the full reads as “GraphAligner,” and the clipped reads as “GraphAligner-clip.” We similarly report “LoRDEC” as full reads and “LoRDEC-clip” as clipped reads. FMLRC does not offer an option to clip the reads, so we report only the full reads.

To evaluate the results, we use the evaluation methodology from Zhang et al. [40]. The long reads are first corrected, and then, the evaluation pipeline is run for both the raw reads and the corrected reads. The first step of the evaluation is removing reads shorter than 500 bp. Note that the reads are removed during the evaluation step, that is, they are corrected in the initial correction step and different reads may be removed in the uncorrected and corrected sets. After this, the remaining reads are aligned to the reference genome. The alignment yields several quality metrics, including number of aligned reads and base pairs, read N50, error rate, and genomic coverage. Here, we report error rate as given by samtools stats instead of alignment identity. Resource consumption is measured from CPU time and peak memory use. We use the E. coli Illumina+PacBio dataset (E. coli, called D1-P + D1-I by Zhang et al.) and the D. melanogaster Illumina+ONT dataset (Fruit fly, called D3-O + D3-I by Zhang et al.) from Zhang et al. [40]. In addition, we use whole human genome PacBio Sequel Footnote 1 and Illumina Footnote 2 data from HG00733, randomly subsampled to 15x coverage for PacBio and 30x for Illumina. We use the diploid assembly from [43] as the ground truth to evaluate against for HG00733. We did not include LoRDEC in the fruit fly or HG00733 experiments as the results in [40] show that FMLRC outperforms it in both speed and accuracy. Although we use the same evaluation method, our results are slightly different. This is due to two factors: First, Zhang et al. use LoRDEC version 0.8 with the default parameters, while we use version 0.9 with the parameters suggested for E. coli in the LoRDEC paper [6]. Second, Zhang et al. use FMLRC version 0.1.2 and construct the BWT with msBWT [44], while we use version 1.0.0 and construct the BWT with RopeBWT2 [45] as recommended by the FMLRC documentation.

Table 4 shows the results. The amount of aligned sequence is similar in all cases. For the PacBio data sets, the amount of corrected sequence is lower than the uncorrected input sequence, while for ONT, the amount of corrected sequence increases during correction. This is consistent with the observation that insertion errors are more common than deletions in PacBio and vice versa for ONT [47]. The number of reads is noticeably higher, and the N50 is lower for the clipped modes for both LoRDEC and GraphAligner, showing that most reads contain uncorrected areas and clipping the reads reduces read contiguity. In addition, the fruit fly and human experiments show that clipping the reads significantly reduces the genome fraction covered by the reads. The clipping is more pronounced in the more complex genomes, with the reads in the whole human genome dataset being on average cut into four pieces, around 4% of the genome lost due to clipping and a large reduction in read N50. We see that GraphAligner is about 30x faster and 2.7x more accurate than LoRDEC for E. coli. GraphAligner is over four times faster than FMLRC in all datasets. When not clipping reads, GraphAligner’s error rate is slightly worse then FMLRC for E. coli (0.51% vs. 0.30%), but substantially better for D. melanogaster (1.2% vs. 2.3%) and human (3.4% vs. 7.1%). For the human genome HG00733, GraphAligner hence produces over two times better error rates while the runtime is over twelve times faster.

Our pipeline is a large improvement in runtime over the state-of-the-art. The error rates are competitive for simpler genomes and significantly better for more complex genomes. We hypothesize that the two-pass method used by FMLRC can in principle enable better correction than a single k-mer size graph, but FMLRC’s performance with the larger genomes is limited by their alignment method, while GraphAligner can handle the more complex genomes. When using the clipped mode, that is, when only considering parts of the reads that have been corrected, the accuracy in the corrected areas can approach or exceed the accuracy of short reads. This emphasizes the value of this clipped mode to users. The main source of errors are in fact uncorrected areas without sufficient short read coverage.


Acknowledgements

We thank the many people who made data and software publicly available, particularly vgteam for providing the vg toolkit as an open source software. We thank Braunvieh Schweiz for providing pedigree and genotype data of Original Braunvieh and Brown Swiss cattle. Semen samples of the sequenced bulls were kindly provided by Swissgenetics.

Review history

The review history is available as Additional file 4.

Peer review information

Andrew Cosgrove was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.


Introduction

Structural variants (SVs) such as deletions, insertions and duplications account for a large part of the genomic diversity among individuals and have been implicated in many diseases including cancer. With the advent of novel DNA sequencing technologies, whole genome sequencing (WGS) is becoming an integral part of cancer diagnostics that can potentially enable tailored treatments of individual patients (Stratton, 2011). Despite advances in large-scale cancer genomics projects (such as the TCGA and PCAWG of the International Cancer Genome Consortium https://icgc.org/), systematic and comprehensive analysis of massive genomic data, in particular the detection of SVs in genomes, remains challenging due to computational and algorithmic limitations (Alkan, Coe & Eichler, 2011 Yung et al., 2017 Ma et al., 2018 Gröbner et al., 2018).

Recent tools for somatic and germline SV detection (callers) exploit more than one type of information present in WGS data (Lin et al., 2015). For example, DELLY (Rausch et al., 2012) relies on split reads and discordant read pairs while LUMPY (Layer et al., 2014) additionally utilizes read depth information. Furthermore, callers such as Manta (Chen et al., 2016) and GRIDSS (Cameron et al., 2017) also incorporate short-read assembly. To obtain a more comprehensive and/or accurate callset, ensemble approaches have yielded promising results (English et al., 2015 Mohiyuddin et al., 2015 Becker et al., 2018 Fang et al., 2018). In such an approach, (i) a range of SV callers are executed, and (ii) their results are combined into a single callset. While this approach has been demonstrated to improve SV callsets, the step (i) poses a major bottleneck as running multiple SV callers efficiently on the user’s computational infrastructure and/or adding new SV callers (as they become available) is far from straightforward.