We are searching data for your request:
Upon completion, a link will appear to access the found materials.
I'm just getting started out learning how to use the QTL package in R studio, and I'm trying to find additional datasets to practice with. I've already worked through the datasets here on the Rqtl website. Is there anywhere I can find similar datasets to work with? I'm specifically looking to work with autosomal data.
Qtlizer: comprehensive QTL annotation of GWAS results
Exploration of genetic variant-to-gene relationships by quantitative trait loci such as expression QTLs is a frequently used tool in genome-wide association studies. However, the wide range of public QTL databases and the lack of batch annotation features complicate a comprehensive annotation of GWAS results. In this work, we introduce the tool “Qtlizer” for annotating lists of variants in human with associated changes in gene expression and protein abundance using an integrated database of published QTLs. Features include incorporation of variants in linkage disequilibrium and reverse search by gene names. Analyzing the database for base pair distances between best significant eQTLs and their affected genes suggests that the commonly used cis-distance limit of 1,000,000 base pairs might be too restrictive, implicating a substantial amount of wrongly and yet undetected eQTLs. We also ranked genes with respect to the maximum number of tissue-specific eQTL studies in which a most significant eQTL signal was consistent. For the top 100 genes we observed the strongest enrichment with housekeeping genes (P = 2 × 10 –6 ) and with the 10% highest expressed genes (P = 0.005) after grouping eQTLs by r 2 > 0.95, underlining the relevance of LD information in eQTL analyses. Qtlizer can be accessed via https://genehopper.de/qtlizer or by using the respective Bioconductor R-package (https://doi.org/10.18129/B9.bioc.Qtlizer).
Deep learning 1 has gathered wide attraction in both the scientific and industrial communities. In computer vision field, deep-learning-based techniques using convolutional neural network (CNN) are actively applied to various tasks, such as image classification 2 , object detection 3,4 , and semantic/instance segmentation 5,6,7 . Such techniques have also been influencing the field of agriculture. This involves image-based phenotyping, including weed detection 8 , crop disease diagnosis 9,10 , fruit detection 11 , and many other applications as listed in the recent review 12 . Meanwhile, not only features from images but also with that of environmental variables, functionalized a neural network to predict plant water stress for automated control of greenhouse tomato irrigation 13 . Utilizing the numerous and high-context data generated in the relevant field seems to have high affinity with deep learning.
However, one of the drawbacks of using deep learning is the need to prepare a large amount of labeled data. The ImageNet dataset as of 2012 consists of 1.2 million and 150,000 manually classified images in the training dataset and validation/test dataset, respectively 14 . Meanwhile, the COCO 2014 Object Detection Task constitutes of 328,000 images containing 2.5 million labeled object instances of 91 categories 15 . This order of annotated dataset is generally difficult to prepare for an individual or a research group. In the agricultural domain, it has been reported that sorghum head detection network can be trained with a dataset consisting of 52 images with an average of 400 objects per image 16 , while a crop stem detection network was trained starting from 822 images 17 . These case studies imply that the amount of data required in a specialized task may be less compared with a relatively generalized task, such as ImageNet classification and COCO detection challenges. Nonetheless, the necessary and sufficient amount of annotation data to train a neural network is generally unknown. Although many techniques to decrease the labor cost, such as domain adaptation or active learning, are widely used in plant/bio science applications 18,19,20 , the annotation process is highly stressful for researchers, as it is like running a marathon without knowing the goal.
A traditional way to minimize the number of manual annotations is to learn from synthetic images, which is occasionally referred to as the sim2real transfer. One of the important advantages of using a synthetic dataset for training is that the ground-truth annotations can be automatically obtained without the need for human labor. A successful example can be found in person image analysis method that uses the image dataset with synthetic human models 21 for various uses such as person pose estimation 22 . Similar approaches have also been used for the preparation of training data for plant image analysis. Isokane et al. 23 used the synthetic plant models for the estimation of branching pattern, while Ward et al. generated artificial images of Arabidopsis rendered from 3D models and utilized them for neural network training in leaf segmentation 24 .
One drawback of the sim2real approach are the gaps between the synthesized images and the real scenes, e.g., nonrealistic appearances. To counter this problem, many studies attempt to generate realistic images from synthetic datasets, such as by using generative adversarial networks (GAN) 25,26 . In the plant image analysis field, Giuffrida et al. 27 used GAN-generated images to train a neural network for Arabidopsis leaf counting. Similarly, Arsenovic et al. used StyleGAN 28 to create training images for the plant disease image classification 29 .
On the other hand, an advantage of sim2real approach is the capability of creating (nearly) infinite number of training data. Approaches that are bridging the sim2real gap by leveraging the advantage is domain randomization, which trains the deep networks using large variations of synthetic images with randomly sampled physical parameters. Although domain randomization is somewhat related to data augmentation (e.g., randomly flipping and rotating the images), the synthetic environment enables the representation of variations under many conditions, which is generally difficult to attain by straightforward data augmentation techniques for real images. An early attempt at domain randomization was made by generating the images using different camera positions, object location, and lighting conditions, which is similar to the technique applied to control robots 30 . For object recognition tasks, Tremblay et al. 31 proposed a method to generate images with a randomized texture on synthetic data. In the plant-phenotyping field, recently, Kuznichov et al. proposed a method to segment and count leaves of not only Arabidopsis, but also that of avocado and banana, by using a synthetic leaf texture located with various size/angles, so as to mimic images that were acquired in real agricultural scenes 32 . Collectively, the use of synthetic images has a huge potential in the plant-phenotyping research field.
Seed shape, along with seed size, is an important agricultural phenotype. It consists of yield components of crops, which are affected by environmental condition in the later developmental stage. The seed size and shape can be predictive on germination rates and subsequent development of plants 33,34 . Genetic alteration of seed size contributed a significant increase in thousand-grain weight in contemporary barley-cultivated germplasm 35 . Several studies report the enhancement of rice yield by utilizing seed width as a metric 36,37 . Moreover, others utilized elliptic Fourier descriptors that enable to handle the seed shape as variables representing a closed contour, successfully characterizing the characters of various species 38,39,40,41 . Focusing on morphological parameters of seeds seems to be a powerful metric for both crop-yield improvement and for biological studies. However, including the said reports, many of the previous studies have evaluated the seed shape by qualitative metrics (e.g., whether the seeds are similar to the parental phenotype), by vernier caliper, or by manual annotation using an image- processing software. The phenotyping is generally labor-intensive and cannot completely exclude the possibility of quantification errors that differ by the annotator. To execute a precise and large-scale analysis, automation of the seed-phenotyping step was preferred.
In recent years, several studies have been reported to systematically analyze the morphology of plant seeds by image analysis. Ayoub et al. focused on barley seed characterization in terms of area, perimeter, length, width, F circle, and F shape based on digital camera-captured images 42 . Herridge et al. utilized a particle analysis function of ImageJ (https://imagej.nih.gov/ij/) to quantify and differentiate the seed size of Arabidopsis mutants from the background population 43 . SmartGrain software has been developed to realize the high-throughput phenotyping of crop seeds, successfully identifying the QTL that is responsible for seed length of rice 44 . Miller et al. reported a high-throughput image analysis to measure morphological traits of maize ears, cobs, and kernels 45 . Wen et al. developed an image analysis software that can measure seed shape parameters such as width, length, and projected area, as well as the color features of maize seeds: they found a correlation between these physical characteristics with seed vigor 46 . Moreover, commercially available products such as Germination Scanalyzer (Lemnatec, Germany) and PT portable tablet tester (Greenpheno, China) also aim or have the ability to quantify the morphological shape of seeds. However, the aforementioned approaches require the seeds to be sparsely oriented for efficient segmentation. When seeds are densely sampled and physically touching each other, they are often detected as a unified region, leading to an abnormal seed shape output. This requires the user to manually reorient the seeds in a sparse manner, which is a potential bar to secure sufficient amount of biological replicate in the course of high-throughput analysis. In such situations, deep-learning-based instance segmentation can be used to overcome such a problem by segmenting the respective seed regions regardless of their orientation. Nonetheless, the annotation process as described previously was thought to be the potential limiting step.
In this paper, we show that utilizing a synthetic dataset that the combination and orientation of seeds are artificially rendered, is sufficient to train an instance segmentation of a deep neural network to process real-world images. Moreover, applying our pipeline enables us to extract morphological parameters at a large scale with precise characterization of barley natural variation at a multivariate perspective. The proposed method can alleviate the labor-intensive annotation process to realize the rapid development of deep-learning-based image analysis pipeline in the agricultural domain as illustrated in Fig. 1. Our method is largely related to the sim2real approaches with the domain randomization, where we generate a number of training images by randomly locating the synthetic seeds with actual textures by changing its orientation and location.
Conventional method requires manual labeling of images to generate the training dataset, while our proposed method can substitute such step by using a synthetic dataset for crop seed instance segmentation model.
Over the past 30 years, the metazoan Caenorhabditis elegans has become a premier animal model for determining the genetic basis of quantitative traits (1,2). The extensive knowledge of molecular, cellular and neural bases of complex phenotypes makes C. elegans an ideal system for the next endeavour: determining the role of natural genetic variation on system variation. These efforts have resulted in an accumulation of a valuable amount of phenotypic, high-throughput molecular and genotypic data across different developmental worm stages and environments in hundreds of strains (3). In addition, a similar wealth has been produced on hundreds of different C. elegans wild isolates and other species (20). For example, C. briggsae is an emerging model organism that allows evolutionary comparisons with C. elegans and quantitative genetic exploration of its own unique biological attributes (21).
This rapid increase in valuable data calls for an easily accessible database allowing for comparative analysis and meta-analysis within and across Caenorhabditis species (22). To facilitate this, we designed a public database repository for the worm community, WormQTL (http://www.wormqtl.org). Driven by the PANACEA project of the systems biology program of the EU, its design was tuned to the needs of C. elegans researchers via an intensive series of interactive design and user evaluation sessions on a mission to integrate all available data within the project.
As a result, data that were scattered across different platforms and databases can now be stored, downloaded, analysed and visualized in an easily and comprehensive way in WormQTL. On top, the database provides a set of user interfaced analysis tools to search the database and explore genotype–phenotype mapping based on R/qtl (23,24). New data can be uploaded and downloaded using the extensible plain text format for genotype and phenotypes, XGAP (25). There is no limit to the type of data (from gene expression to protein, metabolite or cellular data) that can be accommodated because of its extensible design. All data and tools can be accessed via a public web user interface and programming interfaces to R and REST web services, which were built using the MOLGENIS biosoftware toolkit (26). Moreover, users can upload and share more R scripts as ‘plugin’ for the colleagues in the community to use directly and run those on a computer cluster using software modules from xQTL workbench (27) this requires login to prevent abuse. All software can be downloaded for free to be used, for example as local mirror of the database, and/or to host new studies.
All the software was built as open source, reusing and building on existing open source components as much as possible. WormQTL is freely accessible without registration and is hosted on a large computational cluster enabling high-throughput analyses at http://www.wormqtl.org. Below we detail the results, methods used to implement the system and future plans.
Open-Access Data and Computational Resources to Address COVID-19
COVID-19 open-access data and computational resources are being provided by federal agencies, including NIH, public consortia, and private entities. These resources are freely available to researchers, and this page will be updated as more information becomes available.
The Office of Data Science Strategy seeks to provide the research community with links to open-access data, computational, and supporting resources. These resources are being aggregated and posted for scientific and public health interests. Inclusion of a resource on this list does not mean it has been evaluated or endorsed by NIH.
To suggest a new resource, please send an email with the name of the resource, the website, and a short description to [email protected]
NIAID Clinical Trials Data Repository, [email protected], is a NIAID cloud-based, secure data platform that enables sharing of and access to reports and data sets from NIAID COVID-19 and other sponsored clinical trials for the basic and clinical research community.
A centralized repository of up-to-date and curated datasets on or related to the spread and characteristics of SARS-CoV-2 and COVID-19. Information on how to best use this resource is available.
The Broad Terra cloud workspace for best practices with COVID-19 genomics data
- Raw COVID-19 sequencing data from the NCBI Sequence Read Archive (SRA)
- Workflows for genome assembly, quality control, metagenomic classification, and aggregate statistics
- Jupyter Notebook produces quality control plots for workflow output
The open source dataset of nearly 50,000 chemical substances includes antiviral drugs and related compounds that are structurally similar to known antivirals for use in applications including research, data mining, machine learning and analytics. A COVID-19 Protein Target Thesaurus is also available. CAS is a division of the American Chemical Society.
The CDC is providing a variety of data on COVID-19 in the United States.
Maintained by China National Center for Bioinformation/National Genomics Data Center, 2019nCoVR is a comprehensive resource on COVID-19, combining up-to-date information on all published sequences, mutation analyses, literatures and others.
View listed clinical studies related to the coronavirus disease (COVID-19). Studies are submitted in a structured format directly by the sponsors and investigators conducting the studies. Submitted study information is generally posted on ClinicalTrials.gov within 2 days after initial submission and site content is updated daily. Full website content is also available through the API.
This collection of files contains information for printing 3D physical models of SARS-CoV-2 proteins and is part of the NIH 3D Print Exchange.
Freely available dataset of 45,000 scholarly articles, including over 33,000 with full text, on COVID-19, SARS-CoV-2, and related coronaviruses. This machine-readable resource is provided to enable the application of natural language processing and other AI techniques.
See the CORD-19 Challenge, developed in partnership with Kaggle. Amazon Web Services has a CORD-19 search website.
Read the accompanying call to action from the White House Office of Science & Technology Policy and learn more about the creation of CORD-19.
This web-based viewer offers 3D visualization and analysis of SARS-CoV-2 protein structures with respect to the CoV-2 mutational patterns.
The COVID-DPR provides whole slide images of histopathologic samples relevant to COVID-19, including biopsy samples and autopsy specimens. The current focus of the repository includes tissue from the lungs, heart, liver, and kidney. The repository contains examples of H1N1, SARS, and MERS for comparison.
The NCI Cancer Imaging Program (CIP) is utilizing its Cancer Imaging Archive as a resource for making COVID-19 radiology and digitized histopathology patient image sets publicly available.
A centralized sequence repository for all strains of novel corona virus (SARS-CoV-2) submitted to the National Center for Biotechnology Information (NCBI). Included are both the original sequences submitted by the principal investigator as well as SRA-processed sequences that require the SRA Toolkit for analysis.
All Dimensions publications, datasets, and clinical trials related to COVID-19, updated daily. Content exported from the openly accessible Dimensions application accessible at https://covid-19.dimensions.ai/.
The European Bioinformatics Institute (EMBL-EBI), part of the European Molecular Biology Laboratory, has a COVID-19 Data Portal to facilitate data sharing and analysis and ultimately contribute to the European COVID-19 Data Platform. EMBL-EBI is part of the International Nucleotide Sequence Database Collaboration (INSDC) the National Center for Biotechnology Information (NCBI) is the U.S. partner of the INSDC.
The downloadable data file is updated daily and contains the latest available public data on COVID-19. Each row/entry contains the number of new cases reported per day and per country. You may use the data in line with ECDC’s copyright policy.
Provides rapid, open, and unrestricted access to virus nucleotide sequences and is the repository being recommended by NIAID and CDC for investigator and public health submissions. Due to the scale of data indexing, there may be a delay before new submissions are indexed and retrievable with a term-based query.
Provides rapid, open, and unrestricted access to virus conceptually translated protein sequences and is the repository being recommended by NIAID and CDC for investigator and public health submissions. Due to the scale of data indexing, there may be a delay before new submissions are indexed and retrievable with a term-based query.
Human transcriptional responses to SARS-CoV-2 infection
International database of hCoV-19 genome sequences and related clinical and epidemiological data
GCP is hosting a repository of public datasets and offering free hosting and queries of COVID datasets. Learn more about the free hosting and queries of COVID datasets.
Comprehensive, expert-curated portfolio of COVID‑19 publications and preprints that includes peer-reviewed articles from PubMed and preprints from medRxiv, bioRxiv, ChemRxiv, and arXiv.
NLM curated literature hub for COVID-19
NIGMS-funded modeling research. Public-access data collections with documented metadata.
NCATS is generating a collection of datasets by screening a panel of SARS-CoV-2-related assays against all approved drugs. These datasets, as well as the assay protocols used to generate them, are being made immediately available to the scientific community on this site as these screens are completed.
SARS-CoV-2 focused content from NCBI Virus, including links to related resources. Search, filter, and download the most up-to-date nucleotide and protein sequences from GenBank and RefSeq (taxid 2697049). Generate multiple sequence alignments and phylogenetic trees for sequences of interest. Provides one-click access to the Betacoronavirus BLAST database and relevant literature in PubMed.
Open-source SARS-CoV-2 genome data and analytic and visualization tools
The Inter-university Consortium for Political and Social Research (ICPSR) has launched a new repository of data examining the impact of the novel coronavirus global pandemic. This repository is a free, self-publishing option for researchers to share COVID-19 related data.
A resource to aggregate data critical to scientific research during outbreaks of emerging diseases, such as COVID-19
Small molecule compounds, bioactivity data, biological targets, bioassays, chemical substances, patents, and pathways
On March 13, national science and technology advisors from a dozen countries, including the United States, called on publishers to voluntarily agree to make their COVID-19 and coronavirus-related publications, and the available data supporting them, immediately accessible in PMC and other appropriate public repositories to support the ongoing public health emergency response efforts. The articles added to PMC are distributed through the PMC Open Access Subset and are made available in CORD-19.
The RCSB Protein Data Bank is offering access to COVID-19 related PDB structures for research and related images and videos for education.
Reactome is a free, open-source, curated and peer-reviewed pathway database. The goal is to provide intuitive bioinformatics tools for the visualization, interpretation and analysis of pathway knowledge to support basic research, genome analysis, modeling, systems biology and education. In response to the COVID-19 pandemic, Reactome is fast-tracking the annotation of human coronavirus infection pathways.
A database of carefully validated SARS-CoV-2 protein structures, including many structural models which have been re-refined or re-processed. The resource is being updated weekly by Minor Lab at the University of Virginia as new SARS-CoV-2 structures are being deposited to the Protein Data Bank.
Provides rapid, open, and unrestricted access to virus nucleotide or metagenomic sequence data and is the repository being recommended by NIAID and CDC for investigator and public health submissions. Due to the scale of data indexing, there may be a delay before new submissions are indexed and retrievable with a term-based query.
In this section we illustrate the application of the Q-, N-, and NL-methods to two simulated data sets: one with highly correlated traits and the other with uncorrelated traits. We generated data from backcrosses composed of 112 individuals with 16 chromosomes of length 400 cM containing 185 equally spaced markers each, and phenotype data on 6000 traits. The phenotype data were generated according to the models Y k = β M + θ L + ε k , if Y k belongs to a hotspot , Y k = θ L + ε k , if Y k does not belong to a hotspot , where L ∼ N(0, σ 2 ) represents a latent variable that affects all k = 1, … , 6000 traits θ represents the latent variable effect on the phenotype and works as a tuning parameter to control the strength of the correlation among the traits M = γ Q + εM represents a master regulator trait that affects the phenotypes in the hotspot β represents the master regulator effect on the phenotype and Q represents the QTL giving rise to the hotspot. Note that traits composing the hotspot are directly affected by the master regulator M and map to Q indirectly, γ represents the QTL effect on the master regulator, and εk and εM represent independent and identically distributed error terms following a N(0, σ 2 ) distribution.
In both examples we simulated three hotspots: (i) a small hotspot located at 200 cM on chromosome 5 showing high LOD scores (see Figure S5, A and D), (ii) a big hotspot located at 200 cM on chromosome 7 showing LOD scores ranging from small to high (see Figure S5, B and E), and (iii) a big hotspot located at 200 cM on chromosome 15 showing LOD scores ranging from small to moderate (see Figure S5, C and F).
In both simulations we set σ 2 = 1 and γ = 2. QTL analysis was performed using Haley–Knott regression (Haley and Knott 1992) with the R/qtl software (Broman et al. 2003). We adopted Haldane’s map function and genotype error rate of 0.0001. Because we adopted a dense genetic map (our markers are ∼2.16 cM apart), we did not consider putative QTL positions between markers.
In the first example, denoted simulated example 1, we adopted a latent effect of 1.5. In the second example, denoted simulated example 2, we adopted a latent effect of 0 and simulated uncorrelated traits. Figure S6, A and B, shows the distribution of all pairwise correlations among the 6000 traits for both simulated examples. These extreme examples illustrate the effect of phenotype correlation on QTL hotspot sizes. The correlations of the real data are actually intermediate (see Figure S6C).
Figure 1 shows the results for the Q- and N-methods for simulated example 1, using α = 0.05. Figure 1A shows the hotspot architecture computed using a single-trait LOD threshold of 3.65 i.e., at each genomic location the plot shows the number of traits with LOD score >3.65. In addition to the simulated hotspots on chromosomes 5, 7, and 15, Figure 1A shows a few spurious hotspots, including a big hotspot on chromosome 8. The blue and red lines show the N- and Q-methods’ thresholds, 560 and 7, respectively. In this example the N-method was unable to detect any hotspots, whereas the Q-method detected false hotspots on chromosomes 3, 6, 8, 9, 12, and 16. Figure 1, B and C, shows the hotspot size null distributions and the 5% significance thresholds for the N- and Q-methods, respectively.
N- and Q-method analyses for simulated example 1. (A) Inferred hotspot architecture using a single-trait permutation threshold of 3.65 corresponding to a GWER of 5% of falsely detecting at least one QTL somewhere in the genome. The blue line at count 560 corresponds to the hotspot size expected by chance at a GWER of 5% according to the N-method permutation test. The red line at count 7 corresponds to the Q-method’s 5% significance threshold. The hotspots on chromosomes 5, 7, 8, and 15 have sizes 50, 500, 125, and 280, respectively. (B) N-methods permutation null distribution of the maximum genome-wide hotspot size. The blue line corresponds to the hotspot size 560 expected by chance at a GWER of 5%. (C) Q-methods permutation null distribution of the maximum genome-wide hotspot size. The red line at 7 shows the 5% threshold. Results are based on 1000 permutations.
Figures 2 and 3 show the NL-method analysis results for simulated example 1, using α = 0.05. Figure 2, A–D, presents the hotspot architecture inferred using four different quantile-based permutation thresholds. Figure 2A presents the hotspot architecture inferred using a LOD threshold of 7.07. Only the true hotspots (on chromosomes 5, 7, and 15) were significant by this conservative threshold. Figure 2B presents the hotspot architecture computed using a LOD threshold of 4.93 that aims to control GWER ≤ 0.05 for spurious hotspots of size 50. The hotspots on chromosomes 5, 7, and 15 were detected by this threshold. Figure 2, C and D, shows the hotspot architectures using LOD thresholds of 4.21 and 3.72, respectively. Only the hotspot on chromosome 7 was detected as significant for these thresholds. Note that neither the big spurious hotspot on chromosome 8 nor any of the other spurious hotspots shown in Figure 1A were picked up by the quantile-based thresholds.
NL-method analysis for simulated example 1. (A–D) Hotspot architecture inferred using different quantile-based permutation thresholds i.e., for each genomic location it shows the number of traits that mapped there with a LOD threshold higher than the quantile-based permutation threshold. (A) Hotspot architecture inferred using a permutation LOD threshold of 7.07 corresponding to the LOD threshold that controls the probability of falsely detecting at least a single linkage for any of the traits somewhere in the genome under the null hypothesis that none of the traits have a QTL anywhere in the genome, at an error rate of 5%. (B, C, and D) Hotspot architectures computed using QTL mapping LOD thresholds of 4.93, 4.21, and 3.72 that aim to control GWER at a 5% error rate for spurious eQTL hotspots of sizes 50, 200, and 500, respectively.
Hotspot size significance profile derived with the NL-method for simulated example 1. For each genomic location (i.e., x-axis position) the hotspot sizes at which the hotspot was significant are shown, that is, at which the hotspot locus had more traits mapping to it with a LOD score higher than the threshold on the right, than expected by chance. The scale on the left shows the range of spurious hotspot sizes investigated by our approach. The scale on the right shows the respective LOD thresholds associated with the spurious hotspot sizes on the left. The range is from 7.07, the conservative empirical LOD threshold associated with a spurious “hotspot of size 1,” to 3.65, the single-trait empirical threshold, associated with a spurious hotspot of size 560. All permutation thresholds were computed targeting GWER ≤ 0.05, for n = 1, … , 560.
Figure 3 connects hotspot size to quantile-based threshold. This hotspot size significance profile depicts a sliding window of hotspot size thresholds ranging from n = 1, … , N, where N = 560 corresponds to the hotspot size threshold derived from the N method. For each genomic location, the hotspot size (left axis) is significant for the LOD threshold (right axis). For example, the chromosome 5 hotspot was significant up to size 49, meaning that >1 trait mapped to the hotspot locus with LOD > 7.07, >2 traits mapped to the hotspot locus with LOD > 6.46, and so on up to hotspot size 49 where >49 traits mapped to the hotspot locus with LOD > 4.93. The hotspot on chromosome 7 was significant up to size 499, and the hotspot on chromosome 15 (higher peak) was significant for hotspot sizes 2–129 and 132–143.
The NL-method detected only the real hotspots on chromosomes 5, 7, and 15, whereas the N-method did not detect any hotspots and the Q-method detected 6 spurious hotspots, in addition to the real hotspots. The sliding window of quantile-based thresholds detected the small hotspot composed of traits with high LOD scores on chromosome 5 as well the big hotspots on chromosomes 7 and 15. Equally important, the NL-method dismissed spurious hotspots, such as chromosome 8, composed of numerous traits with LOD scores <5.57.
Figure 4 shows the results for the Q- and N-methods for simulated example 2, using α = 0.05. Figure 4A shows the hotspot architecture. The blue and red lines show the N- and Q-method’s thresholds, 19 and 8, respectively. In this example, both the N- and Q-methods were able to correctly pick up the hotspots on chromosomes 5, 7, and 15.
N- and Q-method analyses for simulated example 2. (A) Inferred hotspot architecture using a single-trait permutation threshold of 3.65 corresponding to a GWER of 5% of falsely detecting at least one QTL somewhere in the genome. The blue line at count 19 corresponds to the hotspot size expected by chance at a GWER of 5% according to the N-method permutation test. The red line at count 8 corresponds to the Q-method’s 5% significance threshold. The hotspots on chromosomes 5, 7, and 15 have sizes 50, 464, and 220, respectively. (B) The N-method’s permutation null distribution of the maximum genome-wide hotspot size. The blue line at 19 corresponds to the hotspot size expected by chance at a GWER of 5%. (C) The Q-method’s permutation null distribution of the maximum genome-wide hotspot size. The red line at 8 shows the 5% threshold. Results are based on 1000 permutations.
Comparison of Figures 1A and 4A shows that the spurious hotspots tend to be much smaller when the traits are uncorrelated (compare chromosome 8 on both plots), leading to much smaller N-method thresholds (compare the blue lines). The Q-method thresholds, on the other hand, are quite close. This is expected since the Q-method threshold depends on the number of significant QTL (we observed 3162 significant linkages in simulated example 1, against 3586 significant linkages in example 2) and not on the correlation among the traits.
Figure S7 displays the hotspot size significance profile for simulated example 2. The NL-method also detected the hotspots on chromosomes 5, 7, and 15.
In this simulation study we assess and compare the error rates of the Q-, N-, and NL-methods under three different levels of correlation among the traits. To determine whether the methods are capable of controlling the GWER at the target levels, we conduct separate simulation experiments as follows:
We generate a “null genetical genomics data set” from a backcross composed of (i) 6000 traits, none of which is affected by a QTL, but that are nevertheless affected by a common latent variable to generate a correlation structure among the traits, and (ii) genotype data on 2960 equally spaced markers across 16 chromosomes of length 400 cM (185 markers per chromosome). Any detected QTL hotspot is spurious, arising from correlation among the traits.
We perform QTL mapping analysis, and 1.5-LOD support interval processing, of the 6000 traits. For each one of the following single-trait QTL mapping permutation thresholds (that control GWER at the α = 0.01, 0.02, … , 0.10 levels, respectively), we do the following:
a. We compute the observed QTL matrix and generate the Q-method hotspot size threshold on the basis of 1000 permutations of the observed QTL matrix. We record whether or not we see at least one spurious hotspot of size greater than the Q-method threshold anywhere in the genome.
b. For each genomic location we count the number of traits above the single-trait LOD threshold. We compute the N-method hotspot size threshold on the basis of 1000 permutations of the null data set. We record whether at least one spurious hotspot of size greater than the N-method threshold is anywhere in the genome.
c. We compute the NL-method LOD thresholds for spurious hotspot size thresholds ranging from 1 to the N-method threshold. For each NL-method LOD threshold, λn,α, where n = 1, … , N, we count, at each genomic location, how many traits mapped to that genomic location with a LOD > λn,α and record whether there is at least one spurious hotspot of size greater than n anywhere in the genome.
We repeat the first two steps 1000 times. For each one of the three methods, the proportion of times we recorded spurious hotspots, out of the 1000 simulations, gives us an estimate of the empirical GWER associated with the method.
QTL analysis was performed as described above. Figure 5 shows the simulation results for null data sets generated using latent variable effects of 0.0, 0.25, and 1.0. The Q- and N-methods, with observed GWER (red), and target error rate (black), have two α-levels, α1 for QTL mapping and α2 for the tail area of the hotspot size permutation null distribution. Figure 5 displays the results when α1 = α2 = 0.01, 0.02, … , 0.10. The NL-method has a single α-level the red curves are the observed GWERs for spurious hotspot sizes n = 1, … , N, where N represents the N method’s permutation threshold.
Observed GWER for the Q-, N-, and NL-methods under varying strengths of phenotype correlation. Black lines show the targeted error rates. Red curves show the observed GWER. (A–C) Results for uncorrelated phenotypes. (D–F) Results for weakly correlated phenotypes generated using a latent variable effect of 0.25. (G–I) Simulation results for highly correlated phenotypes generated using latent effect set to 1. The left, middle, and right columns show the results for the Q-, N-, and NL-methods, respectively. Note the different y-axis scales for the Q-method panels. The red curves on the NL-method panels show the observed GWER for hotspot sizes ranging from 1 to N, where N is the median N-method threshold for α = 0.10.
Figure 5, A–C, shows that for uncorrelated traits the Q- and N-methods were conservative, below target levels, whereas the NL-method shows error rates about the right target levels for most hotspot sizes. Figure 5, D and G, shows that error rates for the Q-method are higher than target levels when the traits are correlated and increase with correlation strength among the phenotypes. These results are expected since the Q-method’s thresholds depend on the number of QTL detected in the unpermuted data and tend to increase with the number of phenotypes. Because we generated the same number of phenotypes in the three simulation studies, the Q-method’s thresholds were similar. Therefore, the number and the size of the spurious QTL tend to be proportional to the correlation strength of the phenotypes. The N- and NL-methods, on the other hand, are designed to cope with the correlation structure among the phenotypes and show error rates close to the target levels as shown in Figure 5, E, F, H, and I.
Yeast data set example
In this section we illustrate and compare the Q-, N-, and NL-methods using data generated from a cross between two parent strains of yeast: a laboratory strain and a wild isolate from a California vineyard (Brem and Kruglyak 2005). The data consist of expression measurements on 5740 transcripts measured on 112 segregant strains, with dense genotype data on 2956 markers. Processing of the expression measurements of raw data was done as described in Brem and Kruglyak (2005), with an additional step of converting the processed measurements to normal quantiles by the transformation Φ − 1 [ ( r i − 0.5 ) / 112 ] , where Φ is the standard normal cumulative density function, and the ri are the ranks. We performed QTL analysis using Haley–Knott regression (Haley and Knott 1992) with the R/qtl software (Broman et al. 2003). We adopted Haldane’s map function, with a genotype error rate of 0.0001, and set the maximum distance between positions at which genotype probabilities were calculated to 2 cM.
Hotspot analysis of the yeast data, based on the N-method (Figure 6A), detected significant eQTL hotpots on chromosomes 2 (second peak), 3, 12 (first peak), 14, and 15 (first peak), at a GWER of 5% according to null distribution of hotspot sizes shown in Figure 6B. The blue line represents the N method’s significance threshold of N = 96. The maximum hotspot size on chromosome 8 was 95 and almost reached significance. Nonetheless, Figure 6A also shows suggestive (although substantially smaller) peaks on chromosomes 1, 4, 5, 7, 9, 12 (second peak), 13, 15 (second peak), and 16 that did not reach significance according to the N-method’s significance threshold.
N- and Q-method analyses for the yeast data. (A) Inferred hotspot architecture using a single-trait permutation threshold of 3.44 corresponding to a GWER of 5% of falsely detecting at least one QTL somewhere in the genome. The blue and red lines at counts 96 and 28 correspond to the hotspot size expected by chance at a GWER of 5% according to the N- and the Q-method permutation tests, respectively. (B and C) The permutation null distributions of the maximum genome-wide hotspot size based on 1000 permutations. The blue and red lines at 96 and 28 correspond, respectively, to the hotspot size expected by chance at a GWER of 5% for the N- and Q-methods.
The red line in Figure 6A represents the Q-method’s significance threshold of 28, derived from the null distribution of hotspot sizes shown in Figure 6C. The Q-method detected significant hotspots on chromosomes 2 (both peaks), 3, 4, 5 (both peaks), 7, 8, 12 (both peaks), 13, 14, and 15 (both peaks).
Figure 7 shows the hotspot significance profile for the NL-method. The major hotspots on chromosomes 2, 3, 12 (first peak), 14, and 15 (first peak) were significant across all thresholds tested up, and the hotspot on chromosome 8 was significant up to size 93. Furthermore, the NL-method showed that the small hotspots detected by the Q-method on chromosomes 5, 12 (second peak), 13, and 15 (second peak) might indeed be real. Nonetheless, the small hotspots on chromosomes 4 and 7, detected by the Q-method, are less interesting than the small hotspot on chromosome 1 that was actually missed by the Q-method.
Hotspot size significance profile derived with the NL-method. The range is from 7.40, the conservative empirical LOD threshold associated with a spurious “hotspot of size 1,” to 3.45, the single-trait empirical threshold, associated with a spurious hotspot of size 96. All permutation thresholds were computed targeting GWER ≤ 0.05, for n = 1, … , 96.
Plant breeders and geneticists have benefited from the availability of tools for the rapid and cost-effective development of molecular marker-based linkage maps. As predicted by Tanksley et al. , linkage maps have proven to be useful for discovering, dissecting and manipulating the genes that determine simple and complex traits in crop plants. Barley (Hordeum vulgare) is a model for plant breeding and genetics because it is diploid (2n = 2x = 14) and has a long history of genetics research. Over the past decade, increasingly dense maps of the barley genome have been constructed using multiple populations and many types of molecular markers . Most recently, Szűcs et al.  reported an integrated 2383-locus linkage map developed in the Oregon Wolfe Barley (OWB) mapping population based on representative early generation markers (e.g. morphological loci, RFLPs, and SSRs) and single nucleotide polymorphisms (SNPs).
SNP markers have become increasingly important tools for molecular genetic analysis, as single base-pair changes are the most abundant small-scale genetic variation present between related sequences of DNA . To date, most SNP development efforts in larger, more complex genomes such as barley have focused on "complexity reduction" techniques that aim to sequence a fraction of the genome, such as that represented in EST collections. Once a panel of markers is established from initial SNP discovery, samples from a selected population are then genotyped using oligo-extension or array-based platforms . Both these strategies were used for construction of the current barley SNP-based maps [3, 6, 7].
The emergence of massively-parallel, next-generation sequencing (NGS) platforms capable of producing millions of short (50-100 bp) DNA sequence reads has reduced the costs of DNA sequencing and offers the tantalizing possibility of making direct, genotyping-by-sequencing (GBS) practical (Reviewed in ). Recently, Huang and colleagues  have elegantly demonstrated how genotyping using NGS data can facilitate the rapid development of linkage maps in domesticated rice, Oryza sativa. Despite the attractiveness of this approach and availability of next-generation sequencing platforms, at present, GBS methods retain significant limitations. First, current protocols for synthesis of DNA fragment libraries compatible with high-throughput sequencing platforms are laborious, costly and would be impractical for production efforts involving hundreds of samples . Second, sequence-based genotyping is restricted to those species with available, high-quality, pseudomolecule-sized genome assemblies . While many key economic and scientifically meritorious species will undoubtedly be sequenced as a direct result of the ongoing revolution in NGS technologies, what is required are marker platforms that can provide GBS independent of the status of an assembled genome.
Restriction-site Associated DNA (RAD) markers detect genetic variation adjacent to restriction enzyme cleavage sites across a target genome . The first iteration of RAD markers facilitated cloning of mutants isolated from genetic screens in classic model systems [12, 13]. More recent efforts have focused on adapting the RAD technique for use in NGS platforms, specifically the Illumina sequencing-by-synthesis method, to enable individual sequence based genotyping of samples . The sequenced RAD marker system enjoys two favourable characteristics for high-throughput GBS. As previously mentioned, the RAD method uses restriction enzymes as a complexity reduction strategy to reduce the sequenced portion of the genome anywhere from 0.01% to 10% . Furthermore, RAD protocols facilitate the creation of highly multiplexed NGS sequencing formulations containing many tens of samples in a single library, thereby reducing library preparation costs . While previously published RAD studies have explored NGS of limited numbers of individuals or bulked genotyping of pooled populations, the objective of this research was to determine the feasibility of constructing a RAD marker genetic map in barley. We used the OWB population as a mapping resource in order to directly compare RAD and EST-based SNP maps and to assess the quality and utility of a linkage map built with the two types of data.
qtl.outbred has been extensively tested. Firstly, we established that the triM algorithm produce exactly the same genotype probabilities as R/qtl when inbred line cross data are used (i.e. line crosses of inbred mouse strains). Secondly, we used genotypic data from an outbred line cross between domesticated and wild chickens with a simulated phenotype. Genotype probabilities were calculated with the triM algorithm using qtl.outbred to interface it with R/qtl. The single- and two-QTL genome scan for this dataset is illustrated in Figure Figure1. 1 . The identified peaks correspond to where the QTL were simulated. Lastly, we calculated QTL genotype probabilities for the simulated chicken intercross using GridQTL. These genotype probabilities were imported in R/qtl, using the qtl.outbred interface, and the conducted QTL scan gave similar results to those reported in Figure Figure1 1 .
The graph was obtained by using outbred line cross data (domesticated and wild chicken intercross genotypic data with simulated phenotype), calculating genotype probabilities with the triM algorithm from the qtl.outbred interface and importing it directly to R/qtl where the genome scans were performed. LOD scores for Haley-Knott regression  for (a) single-QTL genome scan and (b) two-QTL genome scan are reported. LOD scores are indicated on the colour scale where, numbers to the left correspond to the upper triangle indicating two-locus epistasis and values to the right correspond to the lower triangle indicating the significance for a test of two versus one QTL.
The authors would like to acknowledge all farm owners and managers who took part in our study, and in particular Joyce Voogt for her valuable insights into farmer opinions. We would like to acknowledge Fiona Brown, Nicolas Lopez-Villalobos, Danny Donaghy and Martin Correa Luna from Massey University and Sandeep Seernam from AgResearch for their help during the data collection process. Lastly, we would like to acknowledge Stella Sim, Esther Donkersloot and Neil Macdonald from LIC for providing photographs used in this research.
New NIH Resource to Analyze Biomedical Research Citations: The Open Citation Collection
Citations from scientific articles are more than lines on a page. They can, when reading between those lines, shed some light on the development of scientific thought and on the progress of biomedical technology. We’ve previously posted some examples in blogs here, here, and here. But to better see the light, we all would benefit from more comprehensive data and easier access to them.
My colleagues within the NIH Office of Portfolio Analysis sought to answer this call. Drs. Ian Hutchins and George Santangelo embarked on a hefty bibliometric endeavor over the past several years to curate biomedical citation data. They aggregated over 420 million citation links from sources like Medline, PubMed Central, Entrez, CrossRef, and other unrestricted, open-access datasets. With this information in hand, we can now take a better glimpse into relationships between basic and applied research, into how a researchers’ works are cited, and into ways to make large-scale analyses of citation metrics easier and free.
As described in their recent PLOS Biology essay, the resulting resource, called the NIH Open Citation Collection (OCC), is now freely available and ready for the biomedical and behavioral research communities to use. You can access, visualize, and bulk download OCC data as part of the NIH’s webtool called iCite (Figure 1). iCite allows users to access bibliometric tools, look at productivity of research, and see how often references are cited.
Figure 2 illustrates the new OCC web interface. Data from a group of publications are displayed on a summary table on the top. Various charts with visualizations lie beneath the summary table. They show publications over time (left), total citations per year by the publication year of the referenced article (center left) or the citing article (center right), and average citations per article in each publication year (right). These tables are customizable as publications are selected or deselected from the portfolio. You can also see information related to the article, such as links to the citing and referenced papers on PubMed, on the bottom of the screen.
The new OCC resource collection within iCite aims to reduce the costs of large-scale analyses of structured citation data, a recognized impediment for the bibliometrics field. OCC goes further still. It enhances the quality, robustness, and reproducibility of analyses using citation data. Moreover, it allows those interested to freely access structured data and share it with others. And, it also provides for transparency, which improves understanding of how knowledge flows and applied technologies develop.
Let’s use OCC to see that knowledge flow in action (Figure 3). Here the team assessed citation networks associated with the development of cancer immunotherapy. Each dot represents a scientific paper. The color represents whether the paper describes basic (green), translational (yellow), or clinical (red) science. The most influential clinical trials are shown in the large red dots in the center. These trials formed part of the evidence base FDA required for approval as a clinical treatment.
Information available in OCC will continue to grow. In addition to accumulating citations, the OCC will acquire data preprint servers and other materials currently not indexed in PubMed.
We invite you to take a look at and use the OCC. It will be exciting to see how the research community will use this new resource when conducting their own analyses. Data from these studies delving into citation dynamics may even provide additional insights that help all of us better understand how the scientific enterprise works and how we could make it even better.