Intra-species genetic variation can be used to investigate population structure, selection, and gene flow in non-model vertebrates; and due to the plummeting costs for genome sequencing, it is now possible for small labs to obtain full-genome variation data from their species of interest. However, those labs may not have easy access to, and familiarity with, computational tools to analyze those data.
We have created a suite of tools for the Galaxy web server aimed at handling nucleotide and amino-acid polymorphisms discovered by full-genome sequencing of several individuals of the same species, or using a SNP genotyping microarray. In addition to providing user-friendly tools, a main goal is to make published analyses reproducible. While most of the examples discussed in this paper deal with nuclear-genome diversity in non-human vertebrates, we also illustrate the application of the tools to fungal genomes, human biomedical data, and mitochondrial sequences.
This project illustrates that a small group can design, implement, test, document, and distribute a Galaxy tool collection to meet the needs of a particular community of biologists.
Keywords:Genome analysis; Species conservation; Population structure; Selective sweeps; Admixture
A remarkable decrease in the cost of high-throughput sequencing has prompted many groups to address their biological questions by applying this technology. However, as is widely recognized, data analysis remains challenging for all but the largest and most experienced groups. Frequently, one of the challenges in this analysis is identifying the polymorphisms within species from the vast amount of raw data produced by the sequencing instruments, which requires computational resources and expertise not always available to a small biology lab. Fortunately, this and other data-processing steps are becoming more affordable as technology evolves.
Once the gigabases of sequence data have been filtered to, say, a few million intra-species DNA polymorphisms and a few thousand amino-acid variants, the computational requirements for their exploration are often relatively modest. Nevertheless, the analyses performed on the polymorphisms frequently require the expertise of the lab that originated and processed the biological samples, and it is burdensome for a small group to assemble and master the required computational tools. We have written a set of tools that run on the Galaxy web server to perform such analyses. They have an emphasis on understanding the population structure of the focus species and/or developing testable hypotheses about phenotypic consequences of genetic polymorphisms. In addition to avoiding the need for research groups to download and install all relevant software, a major benefit of using Galaxy for these, or indeed other, analyses is that reproducibility of published results is often enhanced. Moreover, these tools can be applied to polymorphisms identified by technologies other than sequencing, such as a SNP genotyping microarray.
The goal of this paper is to describe our Galaxy tools and illustrate their effectiveness. We begin with an overview of the new tools, after which we sketch how they were applied in a recently published manuscript, illustrate some of their capabilities, and compare their results with results published by other groups. These case-studies are listed in Table 1. Finally, we mention some lessons learned in the course of creating this Galaxy toolset.
Table 1. Examples discussed in this paper
Our tools work on polymorphism data in tabular formats that are appropriate for loading into the Galaxy web server [1-4]. The files for amino-acid variants and genes are basically just tab-delimited tables as required by Galaxy’s rich arsenal of table-manipulation tools. However, our single nucleotide variant (SNV) tables (which covers both intra-species SNPs and inter-species differences) have particular formats required by many of the tools that we have recently added to Galaxy, and a little familiarity with those formats is assumed in some of our later discussions.
Our most flexible format for SNV tables, called gd_snp (“gd” for Genome Diversity), has one row per SNV, and designated columns for the chromosome (and/or scaffold) name and position, the reference (or consensus) nucleotide, the variant nucleotide, and a quality value. For each individual (or sample) there are four columns, giving (1) the number of reads with the reference nucleotide, (2) the number of reads with the variant, (3) a genotype (0, 1, or 2 occurrences of the reference nucleotide; –1 = no genotype) and (4) a quality value for the genotype. A description of how columns are to be interpreted is specified in header lines, which can be prepared using one of our tools (#1 and 2 in the list below). Among other uses, this information lets Galaxy present the user with a simple interface for defining a set of individuals (Figure 1). In addition, there can be other columns, either supplied in the original table or generated by running Galaxy tools (e.g., each SNV’s FST value relative to two specified populations).
Figure 1. Specifying a “population”. The user is telling Galaxy that the individuals called “North1” through “North4” are to be considered a population called “North” in subsequent analyses (see the aye-aye example). Galaxy tools then know which columns of the SNV table to consult to locate information for further analysis.
Galaxy also supports a simpler format, called gd_genotype, which differs from gd_snp by having just a single column per individual, giving the genotype. The overall quality value (column 5 for gd_snp format) can also be omitted and/or other columns can be included. We provide a Galaxy tool to facilitate the conversion of VCF files and three commonly used population genetics formats (i.e., FSTAT, Genepop, and CSV, see #2 in the tool list below). VCF files that include the field’s allelic depth, genotype quality and genotype (“AD”, “GQ”, and “GT” respectively in the “FORMAT” field) can be converted into a gd_snp file. We also provide a Galaxy tool to convert gd_snp and gd_genotype tables into a form suitable for submission to dbSNP.
The following is a list of the tools we have made available on Galaxy, under the heading “Genome Diversity”. This is the status as of November 2013; we expect that the list will grow over time.
A. Basic Analyses
1. Make File: Create a gd_snp or gd_genotype file
2. Convert: Change CSV, FSTAT, Genepop or VCF to either gd_snp or gd_genotype
3. Specify Individuals: Define a collection of individuals from a SNV dataset
4. Coverage Distributions: Examine sequence coverage for SNVs
5. Phylogenetic Tree: Build a distance-based tree.
6. Filter SNPs: Discard some SNVs based on coverage or quality
7. Aggregate Individuals: Append summary columns for a population
8. Nucleotide Diversity: π and θ
9. Restore Attributes: Update properties of a SNV table
10. SNV table to VCF: For submission to dbSNP
B. Population Structure
11. Prepare Input: Filter and convert to the format needed for tools #12-14
12. PCA: Principal Component Analysis of genotype data
13. Population Complexity: Evaluate possible numbers of ancestral populations
14. Ancestry: Characterize ancestries with respect to inferred ancestral populations
15. Admixture: Map genomic intervals resembling specified source populations
C. FSTand Selective Sweeps
16. Overall FST: Estimate the relative fixation index between two populations
17. Per-SNP FST: Compute a fixation index score for each SNV
18. Remarkable Intervals: Find high-scoring runs of SNVs
19. Close relatives: Discover familial relationships
20. Pairs sequenced: Offspring estimated heterozygosity of sequenced pairs
21. Founders sequenced: Estimated heterozygosity from a pedigree with sequenced founders
22. Matings: Assignment of optimal breeding pairs
23. Inbreeding and kinship: Analyze the pedigree without genomic data
E. KEGG and GO
24. Get Pathways: Look up KEGG pathways for given Ensembl transcripts
25. Rank Pathways: Assess the impact of gene sets on pathways
26. Pathway Image: Draw a KEGG pathway, highlighting specified gene modules
27. Rank Terms: Assess the enrichment of gene sets on Gene Ontology terms
28. Cluster KEGG: Group gene categories by shared genes
F. Design Genotyping Studies
29. Sample SNPs: Select a specified number of uniformly spaced SNVs
30. Differential Cleavage: Select SNVs differentially cut by specified restriction enzymes
31. Flanking Sequence: Fetch DNA sequence for intervals surrounding the given SNVs
32. Pick Primers: Find suitable PCR primers for SNVs
33. Draw variants: show positions of SNVs and unsequenced intervals
34. Reorder individuals: exchange rows in the picture created by tool #33
35. Diversity: pi, allowing for unsequenced intervals
36. Phylip: prepare data for phylogenetic analysis by tool #37
37. RAxML: maximum-likelihood phylogenetic tree
Many of these capabilities are familiar to biologists who analyze genome sequences or genotype genetic markers in population samples. However, some detailed understanding of tools #17 and #18 is assumed in our subsequent discussions. First, the FST, or “fixation index”, assigns a value to each SNV based on the allele-frequency difference between two populations. Tool #17 (as well as #16) lets the user choose among the original definition of FST and two “unbiased estimates” of it [6,7]. The Reich formulation  has been shown to work well with small population sizes , and we use it exclusively unless stated otherwise.
Second, genomic intervals having particular characteristics, such as showing signs of a “selective sweep”, can be identified with the use of multi-individual sequence data. These sweeps are caused when an advantageous allele and neighboring linked variants increase their frequency in a population. Large FST values are one potential signature of a past selective sweep , though care must be taken because large FST values can also be created by genetic drift, demographic effects, or admixture [10-12]. We currently do not provide a tool for the traditional approach of scanning for unusual genomic intervals using a fixed “window” size, because in our opinion it would involve an undesirable number of user-selected values and options (e.g., specifying the window size, the amount of overlap between successive windows, the treatment of overlapping “significant” windows, discarding windows containing too few SNVs, etc.). Instead, we provide a tool (#18) that works with any chosen numerical column in a SNV table, for example, the FST relative to two populations or a value measuring homozygosity within a population. This tool has a single “tuning parameter”, which we call the “shift value”, set by the user. The tool subtracts this number from each SNV score, and then finds “maximal” intervals where the sum of shifted scores cannot be increased by adding or subtracting SNVs at the ends of the intervals. For instance, if the column in question contains FST values, the user could set the shift value at, say, the 90th percentile, so that 90% of the shifted values would be negative, and hence the SNVs in any high-scoring interval on average lie in the top 10%. In general, raising the shift value will lead to identification of fewer and shorter intervals. Statistical significance is estimated by a randomization strategy, in which the shifted SNV scores are shuffled some specified number of times, the highest-scoring interval found in each case, and the highest observed score is taken as the cutoff; this provides an empirical p-value, assuming that the scores are independent.
Another tool implementing a new algorithm estimates admixture in each of a set of individuals relative to two or three assumed source (often called “ancestral”) populations (tool #15). Roughly the same capability is available from other tools, such as SABER , HAPMIX  and PCAdmix . However, we implemented a simpler approach for data sets that have inadequate numbers of individuals, inadequate accuracy of genotype calls, or uncertainty about basic species parameters (e.g., mutation rate and extent of linkage disequilibrium) to justify use of complex evolutionary models. This has the advantages of much shorter execution times and simplicity of use. Like tool #18, there is a single “tuning parameter”, which we call the “genotype switch penalty”.
With two source populations, the tool’s goal is to partition every autosome of an admixed individual into three inferred “interval genotypes”: (1) both chromosomes from the first source population, (2) both chromosomes from the second source population, or (3) one chromosome from each source population. (There are six interval genotypes in the case of three source populations). Suppose for a moment that a genomic interval is entirely in one of those categories, and we want to determine which interval genotype is most likely. For each SNP, we are given the observed genotype for the potentially admixed individual and can estimate the reference allele frequency in each source population. From those data, we can compute the probability of the observed sequence of SNV genotypes being produced from each of the interval genotypes, preferring the scenario with highest probability. The remaining issue concerns the choice of when to switch from one interval genotype to another, which is solved efficiently with a technique called “dynamic programming”, a relative of a partitioning method used with hidden Markov models. As the program scans along a chromosome, a larger genotype switch penalty makes it more difficult to switch between inferred interval genotypes, so the autosomes are partitioned into fewer, but longer runs of constant genotype.
We use this capability for exploratory data analysis, where we experiment with various thresholds on minimal spacing between SNVs (to increase independence), minimal FST between the source populations (to identify “ancestry informative markers”), and switch penalty, to reach conclusions that are robust to changes in analysis parameters.
In the following paragraphs we turn to a set of examples aimed at providing the reader with a clearer understanding of the range of capabilities of the Galaxy tools. Most of them deal with analyzing variation in the nuclear genome of non-model vertebrates, but one example studies polymorphisms in the fungus Batrachochytrium dendrobatidis, two examples consider human data, including differences between sequences from normal and LGL-leukemia diseased cells of the same individual, and one study of which, deals with mitochondrial sequence data.
An analysis of low-coverage sequence data (roughly 5× coverage per individual) for aye-aye SNPs has been published  based on a de novo assembly of the aye-aye genome . The aye-aye is a lemur species with a relatively wide geographic distribution around the periphery of Madagascar. The goal of our study was to assess the species’ genetic diversity and population structure, and relate these values to the geographical range. One possible outcome of this kind of investigation might be to identify sub-populations that warrant treatment as separate “conservation units” because of their genetic distinctiveness.
We sequenced five individuals from the north of Madagascar, five from the east, and three from the west. A Galaxy coverage plot (tool #4 in the list above) showed that one of the samples from the north had particularly low sequence coverage, so we excluded it from further analysis. Also, for many SNPs the depth of sequence coverage seemed inadequate to support reliable estimations of genotype, so we omitted those SNPs from subsequent analyses (via tool #6), and also discarded SNPs where the coverage was so high as to suggest the presence of a regional duplication. The Galaxy phylogenetic-tree (#5) and PCA (#11, 12) tools indicated a clear population structure (further supported by tool #14), which appeared at first to be unrelated to the putative geographical source of the samples. We hypothesized that some samples had been mislabeled during handling, which we verified using PCR and Sanger sequencing experiments on separate DNA extractions from the source tissues, using primers identified by Galaxy tool #32 to amplify over a subset of the genotyped SNPs. We then specified three populations: North, West and East (tool #3; Figure 1). The phylogenetic tree, principal components and population-structure tools (#5, 11, 12, 14), then painted a consistent picture that the North population was particularly distinct. Figure 2 depicts the Galaxy commands that perform these analyses.
Figure 2. Commands for the aye-aye example. Depiction of the Galaxy commands needed to determine levels of sequence coverage, define sets of individuals, filter SNPs by depth of sequence coverage and non-fixation, draw a phylogenetic tree, perform a principal-components analysis, and analyze with 2 and 3 putative ancestral populations, as discussed for our aye-aye data.
The Galaxy tools also provided estimations of within-group diversity, π (tool #8), and overall FST values for pairs of aye-aye populations (tool #16), which we evaluated by synthesizing a human data set that matched the aye-aye sequences in numbers of individuals and sequence depth. We found that the overall FST between the North and East populations appeared to be 2.1 times greater than that between human sub-Saharan Africans and Europeans, despite the fact that nucleotide diversity (tool #8) within each of the three aye-aye populations is relatively low.
In addition to SNV tables, our tools produce Galaxy tables of putative amino-acid polymorphisms. For the aye-aye, we mapped the assembled contigs  and the SNPs they contain to the human genome, and used human gene annotations to infer coding exons in the aye-aye. The results of that analysis have not been published, and we sketch some observations here to illustrate the use of additional Galaxy tools.
We calculated a Locus Specific Branch Length (LSBL) score for each SNP in each of the three aye-aye populations. LSBL is a function of the pairwise FST between populations, and helps to isolate the direction of allele frequency change. It has been extensively used in previous papers (e.g., ). We then selected the SNPs that mapped to coding regions and had an LSBL score in the top 5% for each population (i.e., LSBL95, with thresholds 0.6112 for North, 0.4365 for East, and 0.5536 for West). The LSBL score can be computed for each lineage using
1. For each pair of populations, calculate the pair’s FST value for each SNP (using tool #17).
2. Use the standard Galaxy tool called “Compute an expression for every row” to compute, for each SNP:
We identified 390 coding mutations in the North population, 373 in the East and 420 in the West (above the LSBL95). Of these, the number of non-synonymous SNPs was roughly the same in the three populations (150 in 129 genes for North, 133 in 121 genes for East, and 134 in 128 genes for West). We looked for Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways in which these genes are known to be involved using the Get Pathways tool (#24), and then ranked them by percentage of genes affected using the Rank Pathways tool (#25). For this discussion, we consider only the West aye-aye population, for which this tool produced a list of 153 KEGG pathways for the genes with synonymous mutations, and 83 for the genes with non-synonymous mutations. For instance, the extracellular matrix (ECM) receptor interaction pathway was placed second in the synonymous ranking and third in the non-synonymous ranking. This pathway was one of eleven significantly enriched pathways for genes in the synonymous list (p = 3.8 × 10-7), and one of four in the non-synonymous list (p = 0.018). Three genes with non-synonymous mutations (LAMC2, HSPG2, and LAMA3) and eight with synonymous mutations (COL4A2, COL5A1, LAMA4, LAMB1, LAMB4, LAMC1, TNN, and SV2B) are associated with this KEGG pathway. We used the Pathway Image tool (#26) to visualize the genes’ roles in the pathway (Figure 3A).
Figure 3. Two KEGG pathways from the aye-aye data. A) KEGG pathway diagram showing the genes with coding mutations involved in the extracellular matrix-receptor interaction pathway. Eleven genes with SNPs in the top 5% by LSBL score in the West aye-aye population appear in this pathway, including three with non-synonymous mutations (LAMC2, HSPG2, and LAMA3). These genes are grouped in 5 different functional units distributed along the pathway (i.e., collagen, laminin, tenascin, perlecan, and SV2, all shown in red). B) KEGG pathway diagram for the Glycosylphosphatidylinositol-anchor biosynthesis pathway showing the central role of the PIG-N gene for GPI-anchor synthesis.
In support of these results, the Rank Terms tool (#27) produced a list of GO terms related to ECM-receptor interaction that were significantly enriched in the genes with non-synonymous mutations above LSBL95. These GO terms included “cytoskeletal anchoring at nuclear membrane” (p = 4.6 × 10-5), “laminin-5 complex” (p = 1.4 × 10-4), “basement membrane” (p = 0.0016), and “cell adhesion” (p = 0.0067). Also, we grouped the GO terms and KEGG pathways with non-synonymous mutations in the West population using the Cluster gene categories tool (#28). We used different thresholds to study the groups produced and found at a cutting threshold of 20% a cluster of gene categories that include the terms “response to glucose stimulus”, “collagen type VI”, “muscle organ development”, “sarcolema”, “extracellular matrix structural constituent”, “focal adhesion”, and “PI3k-Akt signal pathway”. Furthermore, we found consistently (with thresholds ranging from 25% to 85%) the cluster of gene categories: “response to glucose stimulus”, “collagen type VI”, “muscle organ development”, and “sarcolema”.
We studied the potential effects of non-synonymous mutations in the West population by ranking the KEGG pathways according to the changes in length and number of paths if the genes are disrupted. Among the five KEGG pathways that showed changes in both of these values, the Glycosylphosphatidylinositol (GPI)-anchor biosynthesis pathway was ranked first (change in the mean length of paths between sources and sinks = 4.5, change in the number of paths between sources and sinks = 4). The image of this pathway (Figure 3B; produced using tool #26) shows that a mutation in the gene PIG-N could disrupt the transference of phosphatidylethanolamine to the first mannose of the glycosylphosphatidylinositol. This result revealed a picture that could not have been obtained by using the overrepresentation approach: despite that only one gene (out of 23) was found to have a non-synonymous mutation, the role of this gene is required and critical in the GPI-anchor biosynthesis. Genes involved in both extracellular matrix-receptor interactions and cell adhesion (including GPI-anchor production) are implicated in tissue morphogenesis and organization [19,20]. Their role has been described in the organogenesis of kidney, lung, peripheral nerves, brain, extremities, digits, pancreas and placenta, as well as in the integrity maintenance of skeletal muscles, skin, and hair . The modules laminin and perlecan in the ECM-receptor interaction pathway include genes with non-synonymous mutations (LAMC2, HSPG2, and LAMA3). Both of these modules are involved in the linkage of extracellular matrix with dystrophin through dystrophin-associated glycoproteins (alpha-DG and beta-DG in Figure 3A; ). A failure in this linkage has been extensively associated with muscular dystrophy, as dystrophin is thought to provide mechanical reinforcement to the sarcolemma to protect it from the membrane stresses developed during muscle contraction [21-23].
The mutations affecting matrix-receptor interactions and cell adhesion are expected to evolve in concert as organisms adapt to specific niches [19,24]. Aye-ayes are highly specialized extractive foragers; they feed on insect larvae obtained from decaying tree bark, and on seeds. It has been suggested that limitations in the availability of food may explain the large individual home range requirements of this species . Previous papers have reported a relatively complex neuromuscular organization for lemurs, and have proposed that this is consistent with differences in habitat and surface utilization (e.g., arboreal vs. ground) [25,26]. Additionally, a potential for increased stress on the aye-aye's long gracile digits is generated during its locomotion, especially while descending trees . It is difficult to assess the extent to which the molecular mechanisms reported here may be implicated any kind of ongoing adaptation among aye-aye populations. However, one interpretation is that they might be involved in muscular adaptations to exploit the niche variability produced by the landscape variation, habitat diversity, and microendemism patterns of northern Madagascar . This example illustrates the use of some of our new tools, as well as the kinds of hypotheses they can lead to.
A number of methods have been developed for detecting evidence of selective sweeps using polymorphism data from multiple individuals, with each method exploiting a particular departure from the expectation with neutral evolution . A typical application of these methods is to identify genomic regions related to reproductive fitness, such as those conferring traits important for adaptation to a new environment. Several tools to support such analyses can be found in the new Genome Diversity toolset on Galaxy, and we wanted to compare their performance with accepted techniques.
The chicken genome was one of the first vertebrate genomes to be published . An analysis of multi-individual data was published later , where a windows-based approach was used to look for regions of low heterozygosity in various combinations of domestic breeds, with the goal of identifying genomic regions associated with economically important traits, such as egg or meat production. We were interested in understanding how much, and under what conditions, their results differ from genomic intervals found by our windows-free method.
The published project  sequenced ten samples from different chicken breeds, nine of which were each a pool of DNA from several individuals. Their analysis was carried out on the numbers of reads corresponding to the more common and less common allele, whose values were calculated for each combination of SNV and DNA sample. The authors kindly provided us with those numbers, from which we produced a Galaxy SNV table (gd_snp format) with 7,285,024 rows (i.e., SNVs) and 45 columns (see Methods).
A search for regions of high homozygosity and the genes within them can be conducted, starting with the SNV table and a list of chicken genes, by the following Galaxy commands, which are also depicted in Figure 4.
1. Specify individuals (tool #3 listed above), for example, all pools from domestic chickens, or all Commercial Broilers.
2. Aggregate those individuals (tool #7), to get totals of the reference alleles (column 46) and the variant alleles (column 47).
3. Use a standard Galaxy tool to compute (into column 50) the expression
where c46 and c47 are the values in columns 46 and 47. Intuitively, the two allele frequencies are c46/tot and c47/tot, where tot = c46 + c47, and we are adding their squares to quantify homozygosity.
4. Use the Remarkable Intervals tool (#18), setting the shift value to a desired threshold, say 0.9, to find intervals where the sum of the scores c50 – 0.9 is high; c50 is the value assigned to a SNV by step 3 (i.e., homozygosity).
5. Use a standard Galaxy tool to find genes that intersect the intervals identified by step 4.
Figure 4. Commands for the chicken example. Depiction of the Galaxy commands for finding genes contained in autosomal intervals of unusually high heterozygosity in eight pooled populations of domestic chickens and in two pooled samples of domestic broilers. The input datasets are the set of 7.3 million SNVs and a list of chicken genes with their chromosomal coordinates.
For the pool, the AD of all domestic individuals, 158 intervals of average length (approximately 85 kb) were reported . The intervals cover a total of 13.4 Mb, or approximately 1.3% of the chicken genome. We set the threshold in step 4 (see above) to 0.78, chosen by trial and error so that the average length of the 158 highest scoring intervals was also 85 kb. For the most part, the reported intervals agree with the highest-scoring intervals found by our window-free method. Our seventh highest-scoring interval, chr5 43,222,353-43,275,554, and their top-scoring segment, chr5 43,200,000-43,280,000, overlap the TSHR gene, which is a major focus of the paper . Our twelfth and their fourth highest scoring interval (6,252,242-6,301,349 and 6,240,000-6,300,000 on chromosome 24, respectively) overlap the gene BCDO2 for the yellow skin allele, which the authors of the original paper adopt as a proof of principle that a method can identify a known sweep . In all, 89 of their regions overlap one of our 100 highest-scoring intervals.
For other measurements of concordance between the two approaches, consider regions of low heterozygosity in the two commercial broiler lines, which are bred for efficient meat production. The paper  identified 132 intervals of average length around 62 kb, while we used the threshold 0.9 in step 4 (see above) to get an average length around 64 kb (close enough) for our highest scoring 132 intervals. One of the top-scoring reported intervals, chr1 57,340,000-57,560,000, contains several genes related to growth, including insulin-like growth factor 1 (IGF1). In our approach, the interval chr1 57,356,555-57,574,111 scores highest. The other interval reported as under selection in commercial broilers is chr4 71,720,000-71,860,000, containing the TBC1D1 gene, which had earlier been identified in several independent studies as the major Quantitative Trait Locus (QTL) explaining differences in growth between broilers and layers. Accordingly, our seventh highest-scoring interval is chr4 71,709,127-71,847,930, which also overlaps TBC1D1. Overall, our 100 highest-scoring intervals intersect 67 of their intervals. We also noticed a tendency for our highest-scoring intervals to overlap the 56% (74 of 132) of their intervals that intersect genes; our 20 highest scoring intervals overlap 15 of their gene-intersecting intervals, but only three of their intervals that do not intersect any annotated gene.
However, major differences between intervals found by the authors’ window-based approach and our window-free method can arise. Compared with our approach, their particular windows-based method favors regions with a low density of SNVs. Consider a simple example where one window has 10 SNVs, all fixed in the domestic birds (say nMaj_Allele = 30 and nMin_Allele = 0) and a second window with 100 of such SNVs. Then both windows score 0 according to the published approach, On the other hand, our approach instead works with homozygosity = 1 – heterozygosity, which is 1.0 for these SNVs. A threshold (for instance 0.9) is subtracted to give a score of 0.1, and the scores are added for each genomic interval, giving totals of 1.0 for the first interval (window) and 10.0 for the second, and a preference for the interval with more SNVs.
The seventeenth highest scoring reported interval for sweeps in commercial broilers , chr2 84,660,000-84,720,000, is not known to overlap any gene. The 1,272nd best interval from our approach (far from being statistically significant) is chr2 84,662,385-84,719,725. It is possible that the main source of this discrepancy between the two methods is the extremely low number of SNVs at chr2 84,660,000-84,720,000, namely 31 SNVs in the 60 kb interval. Giving the nearly 7.3 million SNVs in the 1 Gb chicken genome, the expected number of SNVs in this interval is around 450, making the interval an extreme outlier. We believe it is counter-intuitive to consider genomic intervals with an extremely low density of SNVs as likely candidates for having experienced (or still experiencing) positive selection; low SNV density seems more indicative of negative selection.
The demographic history and relationships between lineages of North American Canidae has often been studied using a handful of genetic markers with limited resolution of evolutionary relationships. Specifically, a few dozen co-dominant or uniparentally-inherited markers will only provide a fraction of the evolutionary history. One of the main and long-debated topics of North American canids has been the degree of admixture and species ancestries. Answers to those questions could potentially influence conservation planning. For instance, if an endangered species is identified to have a significant degree of genetic admixture, then the management options become less obvious and the relevant conservation policy may, most likely, need to be updated. To best address admixture among canids and better resolve their ancestry, a published study  analyzed genotypes from 48,036 SNVs (hereafter, referred to as 48 K) distributed genome-wide. In order to test the robustness of our tools, we reanalyzed the same dataset for admixed ancestry across wolves and coyotes of North America.
After formatting the 48 K data, individuals were designated into specific groups for subsequent testing. The California coyote (n = 12) and Yellowstone National Park gray wolves (YNP, n = 18) were labeled as non-admixed reference groups, with five putatively admixed groups identified for testing wolves from Algonquin Provincial Park (n = 2) and the Great Lakes region (n = 12), the Red wolf (n = 12), and two populations of coyotes (Northeastern, n = 13; Midwestern, n = 19) (Table 2A). To confirm the data, we conducted a principal component analysis (PCA) of the SNV genotypes and identified reference and admixed populations (Figure 5).
Table 2. Comparison of percentage of admixed ancestry results
Figure 5. Principal components analysis of canid data. PCA of the reference (California coyote and YNP wolf) and putatively admixed North American canid populations using 48 K SNVs.
From the published analysis , we had expectations a priori as to the ancestry composition of each individual. We began our two-ancestor dataset construction with computing per-SNP FST values (tool #17) using Wright’s original definition  to identify and retain 4,229 SNVs with FST > 0.4 using the two non-admixed reference populations of Yellowstone wolves and California coyotes (as per the published inclusion threshold ) as Ancestry Informative Markers (AIMs) for analysis using the admixture tool (#15). Next, we filtered SNVs with tool #6 to retain 3,175 SNVs with an average spacing of 1 SNV per 100 kb in order to construct a SNV set comparable to the published one . These methods were repeated with the inclusion of a third possible ancestor population, domestic dogs (n = 28 from modern breeds). We conducted two additional per-SNP FST analyses to compare the YNP wolf and California coyotes each with the domestic dog. As a result, we therefore filtered to keep SNVs every 300 kb to retain similar marker counts (6,375 SNVs for 3-ancestor analysis). Finally, we conducted independent analyses for each of the putatively admixed populations (Table 2A).
As per the algorithm used by SABER  to estimate the time since initial admixture (e.g., length of blocks and recombination rate), there is a significant negative correlation between ancestry proportion and admixture times (r = -0.810, 1-tail p = 0.04819). When we surveyed the impact of varying the switch value of the Galaxy admixture tool (#15), it appears to be sensitive to the timing of initial admixture. All populations had a significant correlation between ancestry and switch values (Red wolf: r = 0.9914, 1-tail p = 0.0005; Algonquin wolf: r = -0.9809, 1-tail p = 0.002; Great Lakes wolf: r = -0.9716, 1-tail p = 0.003; Northeastern coyote: r = 0.9891, 1-tail p = 0.0007; Midwestern coyote: r = -0.9721, 1-tail p = 0.0028). The captive Red wolf, for example, has been extensively monitored in their recovery area, and recent coyote hybridization events have been documented [33-36]. As a result, coyote ancestry is expected to be elevated in the Red wolf genome. Varying the switch parameter, we found that larger values (switch ~ 7-10) estimated a level of coyote ancestry comparable to that with SABER (Table 2B,C). An analogous demography is suspected of the Northeastern coyote, with the population harboring many hybrids of recent ancestry . With Galaxy, we found low levels of switch variation (mean ± SD coyote ancestry: 74% ± 3.3) and consistent representation of two ancestries within the Northeastern coyote genome (74% coyote, 26% gray wolf; Table 2B).
Admixture can result from highly complex demographic processes, either through mating preferences (e.g., [dis]assortative), backcrossing, selective removal of hybrids, or natural mechanisms of isolation. Regardless of the admixture process, quantifying ancestry is central for exploring population demography and conservation management efforts. Here we show that tabulating ancestry blocks across multiple canid genomes can be obtained quickly and accurately using the Galaxy admixture tool.
4. A southern African genome
Understanding admixture in humans is crucial for correctly tracing the flow of human populations, and it plays a substantial role in identifying genomic variants that affect disease susceptibility. Moreover, many people express a strong interest in knowing their ancestry. For instance, an individual, called ABT, of mostly Bantu-speaking heritage from South Africa, was thrilled when told that he also has some Khoisan (“bushman”) ancestry .
After that original discovery, we combined 419,974 microarray-based genotype calls for 485 humans, and added ABT’s genotypes inferred from whole-genome sequence data. Among the 485 individuals, 89 were identified as Yoruba (a well-studied population of agriculturalists from Nigeria) and 67 as Khoisan. We used this dataset to compare our admixture results (tool #15) with those obtained by the programs HAPMIX  PCAdmix  and ADMIXTURE [38,39]. With the Galaxy tool and PCAdmix, we used a third potential source population, CEU (certain individuals of European ancestry) and specified K = 4 ancestral populations with ADMIXTURE.
The programs gave reasonably comparable results: ABT’s Yoruba ancestry was estimated as 78.3%, 70.5%, 77.9% and 74.1% by HAPMIX, PCAdmix, ADMIXTURE, and the Galaxy tool, respectively. All of the programs of course depend somewhat on the settings chosen for input parameters; for instance HAPMIX produced values between 73.1% and 79% when we varied two parameters (i.e., number of generations since admixture and estimated proportions from the ancestral populations) over a wide range. PCAdmix appeared to have some problems with these data, leaving approximately 15% of ABT’s genome as having undetermined ancestry. Part of the reason that HAPMIX produces higher estimations of Yoruba admixture than the other tools is that it only allows two source populations. For example, limiting the Galaxy tool to two source populations (Khoisan and Yoruba) raises its estimation to 76.5% Yoruba.
The different breeds of domestic pigs are the result of a long domestication process from wild boars (Sus scrofa). During this process, qualities of importance to humans have been selected, shaping the genome landscape of the domestic breeds . It is estimated that European and Asian wild boars split about 1 million years ago, with their domestication occurring independently on each continent . Signals of positive selection have been found in domestic breeds that are associated with color, vertebrate number, and muscle development [40,41]. This exemplifies a number of studies that seek to identify genotype-phenotype relationships in domestic breeds, to potentially improve breeding practices, as with the chicken study, mentioned above.
We obtained 48,649,642 SNVs for 6 outgroup species and 49 Sus scrofa individuals (36 European, 6 Chinese, and several from other regions) used in those previous studies, and attempted to recapitulate some of the published results using the Galaxy tools. Following the approach of the chicken analysis, we calculated the homozygosity for four European breeds (n = 25 individuals), one Asian (n = 4) and one European wild boar (n = 6) breed [40, dataset 2]. The starting point for this analysis was a gd_genotype formatted file; thus, instead of using the aggregation tool (#7) we calculated the number of reference and alternative alleles for each population as follows:
1. Determine the columns with the genotypes of the individuals of interest (for the Asian breed c34, c35, c36 and c37).
2. Calculate the number of reference alleles in the individuals of interest (for the Asian breed ((c34==2)*2) + ((c35==2)*2) + ((c36==2)*2) + ((c37==2)*2) + ((c34==1)*1) + ((c35==1)*1) + ((c36==1)*1) + ((c37==1)*1)).
3. Calculate the number of alternative alleles in the individuals of interest (For the Asian breed ((c34==0)*2) + ((c35==0)*2) + ((c36==0)*2) + ((c37==0)*2) + ((c34==1)*1) + ((c35==1)*1) + ((c36==1)*1) + ((c37==1)*1)).
Further, we followed steps 3, 4, and 5 of the homozygosity calculation explained in the chicken example.
Published data  identified 70 selective sweeps genome-wide with a mean length of 878 kb. By trial and error we selected a shift score of 0.9889 for which the 70 top scoring intervals presented a mean size of 877 kb. Eleven of our 50 highest-scored intervals intersected reported selective sweeps , three of which overlapped the genes NR6A1, PLAG1, and LCORL to which the original study devotes a large discussion. The lengths of the intervals identified by our program agreed well with those reported previously. The lengths were on average 0.32 kb different, and exceeded 1 kb in only two cases. We attribute the observed differences to the limitations that a windows-based approach imposes: the resulting selective sweeps can only be as small or big as the specified window size. This limitation is illustrated by the intervals overlapping the gene LCORL (located in the chromosome 8 between 12,633,950 bp and 12,766,041 bp). While the windows-based approach found a sweep between 12,540,000 bp and 12,840,000; our windows-free approach determined that this interval was between the positions 12,555,236 bp and 12,807,451 bp. Thus, our approach better localizes the selective sweep to the LCORL gene, excluding a non-gene region between 12,807,452 and 12,840,000 bp.
To further explore the domestication dynamics in pigs, we used admixture studies. A PCA (with tool #12) indicated that European domestic swine, including the Landrace breed, are much closer to European wild boars than to Asian wild boars (also indicated by the phylogenetic-tree tool, #5), whereas an admixture analysis (with tool #15) suggests that about 10% of the Landrace genome is ultimately derived from Asian boars, mostly from southern China (possibly from intended crossings of European and Asian domestic swine). According to our analyses, genes overlapping the regions of Landrace genomes that have an origin in Chinese breeds are enriched for immune-related GO terms, including “defense response” (p = 8.7 × 10-11), “response to virus” (p = 5.6 × 10-6) and “defense response to bacterium” (p = 0.012).
The presence of this admixture would confound a search for selective sweeps based solely on the FST between a European domestic breed and European wild boars, because a genomic interval in the domestic breed that is derived from the Asian lineage will tend to have an FST that exceeds the genome average. Several approaches have been developed to circumvent this difficulty (e.g., ). A simple plan is to use the so-called LSBL , which treats the European and Asian wild boars symmetrically. Additionally, this score allows us to explore the specific selection occurring in the domestic breed lineage:
where L = Landrace, EW = European wild boar, and AW = Asian wild boar.
After calculating LSBLs for each group and each SNP, we determined remarkable intervals with them (using tool #18). As previously suggested, we found that genes NR6A1 and PLAG1 overlap a potential selective sweep (further analyses reveal that these genes do not overlap intervals of high LSBL in European or Chinese breeds). These genes have been in QTLs associated with an increase in the number of vertebrae in domestic pigs and body height, respectively . Other genes previously related to back fat thickness were also found to overlap intervals of high LSBL in Landrace, including ALMS1, ACP2 and ENPP1. Finally, one of those intervals overlaps the gene VRTN, previously found in a QTL suspected to cause heterogeneity of the number of vertebrae in commercial-breed pigs . Other genes overlapping selective sweeps for the Landrace breed have been previously reported in QTLs of commercial interest for pig (i.e., ELOVL6). As with other studies [32,40], we assume that finding selective sweeps overlapping previously reported QTLs is a validation for our approach.
In the same line of inquiry, we determined regions that had been potentially under positive directional selection in the lineage leading to Landrace pigs. To do so, we selected SNVs with one variant fixed in the two closed related species Sus barbatus (Bornean bearded pig) and Sus verrucosus (Java warty pig) and the other fixed in the Landrace individuals. Further, we determined genes overlapping regions with a remarkable number of these markers, and studied their enrichment in GO terms and KEGG pathways (tools #25 and #27). One of these regions, which includes the genes SPATA7 and TTC8, overlapped a QTL previous described for porcine intramuscular fat content . 903 genes were found to overlap regions enriched in SNVs under potential positive directional selection, including NR6A1 and CASP10. The CASP10 gene has been found in a putative selective sweep with several duplications in domestic pigs . Among the GO terms and KEGG pathways we found that "skeletal system morphogenesis" was significantly enriched for genes overlapping these regions (p = 0.0037).
In summary, using Galaxy tools, with special attention to the possible confounding of selective-sweep analyses by the presence of admixture, we were able to recapitulate published results and highlight additional genes of potential commercial interest.
6. Chytrid fungus
The chytridiomycota Batrachochytrium dendrobatidis (Bd) has been linked to the global decline of amphibians [45,46]. To shed light on the evolutionary history of this pathogen and to identify genomic underpinnings of its virulence, a recent paper reported the genomes of 29 Bd isolates from around the world . Among other results, the authors analyzed genes potentially under positive selection (dN > dS), those in regions exhibiting loss of heterozygosity (LOH), and those in regions with copy number variations (CNV), focusing on strains in the Global Panzootic Lineage (GPL) associated with the disease. This study illustrates the use of genome sequencing to uncover the evolutionary history of an emerging pathogen and to identify mechanisms-related shifts in virulence, with the ultimate goal of mitigating the disease’s impact.
In order to compare the results obtained from different tools and to potentially contribute to the understanding of Bd biology, we analyzed the published data and looked for evidence of selective sweeps. We downloaded the SNVs for each isolate, gene annotation and published results . After reformatting the data (gd_genotype), we uploaded it to Galaxy. Our first experiment was to check that our tools for identifying GO categories from a set of genes produce results comparable to the published results (the authors used custom computer scripts, which they make freely available). For the genes reported to have dN > dS (protein-coding differences between GPL and the outgroup isolate UM142), we found that GO terms with a significant enrichment included “DNA binding”, “protein binding”, “ATP binding”, and “nucleic acid binding”. We also found that the terms “microtubule motor activity” (p = 0.012), “microtubule-based movement” (p = 0.026), and “helicase activity” (p = 0.020) were enriched in these genes. The 35 GO terms significantly enriched for genes with LOH included “superoxide dismutase activity”, “oxidoreductase activity”, “oxidation-reduction process”, and “extracellular region”, while the 16 GO terms significantly depleted for genes with LOH include “chitin binding”. Finally, three GO terms were found to be significantly enriched for genes associated with CNVs: “aspartic-type endopeptidase activity”, “serine-type peptidase activity” and “proteolysis”. These results fit well with those in the original publication.
To go somewhat beyond what was published, we analyzed GO categories of genes in regions of high FST between an outgroup clade (UM142 and CLFT024-02) and the GP clade, as well as between two “populations” within GPL identified by PCA and phylogenetic analyses. Those results are included in the Galaxy history that we make available.
In summary, our Galaxy tools replicate many of the published results about Bd  and add some new observations. Our results suggest a fast evolution of genes associated to motility and helicase activity in the GP fungi lineage, as well as in proteolysis-related genes. As previously suggested, the peptidase genes are potentially important in Bd infection of amphibian skin . Additionally, the genes associated to motility (i.e., microtubule motor activity and microtubule-based movement) might mediate in the dispersion capabilities of the zoospores and increase virulence . Our results also suggest possible selection over a region of the Bd genome that includes five tandem SCP PR-1-like genes (BDEG_04273 to BDEG_04277), which are categorized under the GO term “extracellular region”. This selection seems to occur in all or a portion of the GLP lineage. Interestingly, the same region was found to have a loss of heterozygosity relative to UM142 and the GP clade. It has been suggested that these genes are involved in the pathogenesis and virulence of plant and animals pathogens, and may play a role in spore penetration and modulation of the host defense response [49,50]. While these results should be treated with healthy skepticism, due to a variety of possible difficulties, such as the observed high rate of copy-number variations in the Bd chromosomes, they illustrate the kinds of tantalizing observations that can be made with our Galaxy tools.
7. Human disease data
Many studies have sequenced the genomes from both normal and diseased tissues of the same individual, and looked for differences that might be associated with that disease, such as germ-line alleles that affect disease susceptibility, or variations in tumors that affect response to therapies. The new Galaxy tools can facilitate such investigations, as we now illustrate.
Analysis of human exome data has recently identified mutations in signal transducer and activator of transcription 3 (STAT3) in large granular lymphocyte (LGL) leukemia . Concurrent to these findings, our group has recently undertaken whole genome sequencing of three paired patient lymphocyte/saliva samples to look for these and other mutations. With Galaxy we are able to use simple filters applied to gd_snp files to identify potential somatic mutations. Examples of the filtering include finding SNPs with differing genotype calls between LGL and saliva, a quality score of 20 or greater for both genotypes and a minimum read depth of 8 reads in each sample. The SNPs can be further filtered to identify changes of a particular type, such as LOH or somatic mutations. Using a file of amino-acid variants caused by the SNPs, one can identify which of the SNPs leads to a predicted change in protein structure. In our case SIFT  is available in Galaxy and can be used for this purpose with the added benefit that additional output fields, such as allele frequencies and OMIM disease associations are appended, if selected.
Applying this protocol, STAT3 mutations were discovered in two of the three patients that correspond to amino acid changes of D661V and D661Y in genome 1 and 2 respectively. Previous reports  demonstrate constitutive STAT3 activation in all LGL leukemia samples, though one study  reported direct STAT3 mutations in only 31 of 77 patients. For this reason, the third genome was selected from a list of patients known to lack mutations in exon 20 or 21 of STAT3. Applying the same filters and SIFT algorithm to the SNPs from this genome did not reveal any mutations in any exon of STAT3. We then converted the Ensembl transcripts extracted from SIFT to their canonical transcripts and retrieved KEGG pathways using the Get Pathways tools (#24). A quick examination revealed two altered transcripts in the Janus Kinase (JAK)/STAT signaling pathway. Both consisted of 3' UTR mutations in the interleukin 6 receptor (IL6R) and CBL. Of these two, only the IL6R alteration is predicted to be in proximity to a conserved miRNA binding site according to the TargetScan  miRNA Regulatory Sites track on the UCSC Genome Browser [55,56]. If this variant alters miRNA binding and leads to increased translation of the IL6R, this could be one mechanism leading to aberrant STAT3 activation in those patients that do not demonstrate direct STAT3 mutation.
8. Mitochondrial polymorphism
In studies aimed at estimating evolutionary relationships, but where it is infeasible to collect data from the full (nuclear) genome, an alternative is to sequence the mitochondrial genome, which is far smaller and occurs at much higher copy number per cell. Recent methods that further enrich the concentration of mitochondrial DNA [57,58] make it possible to sequence mitochondria from very degraded samples, such as those from museum specimens. Nevertheless, the resulting data can leave intervals of the mitochondrial genome unsequenced, or sequenced to such low coverage that the results are unreliable. We have added tools to Galaxy that can perform some basic analyses for such datasets.
SNVs in the mitochondrial genome can be represented in gd_snp or gd_genotype format. In addition, we abuse the gd_genotype format to store the sequence coverage at each position in the mitochondrial sequence. Thus the file might start as follows:
This indicates that the first sample has sequencing depth roughly 35 at the start of the mitochondrial sequence (column 5), while no reads from the second sample map there (column 6). We also include a file of gene annotations for the reference sequence, with lines like:
We provide several tools to process these files, including the production of a graphical representation of variants and/or the coverage depth (tools #33 and #34; Figures 6 and 7), computation of average pairwise difference (π; tool #35), and of a phylogenetic tree (tools #36 and #37). In each case, the user specifies a set of individuals and a minimum depth of coverage.
Figure 6. Inadequately covered parts of colugo mitochondrial sequences. Intervals with sequencing depth (A) below 5 and (B) below 30 for several Sunda colugos (Galeopterus variegatus), from a published study .
Figure 7. Variants identified in cave-bear mitochondrial sequences. Locations of SNVs (vertical lines) and unsequenced regions (blue rectangles) for several cave bears  relative to GenBank entry NC011112 . SP1324, SP1844, SP2125 and NC01112 have been classified as one species of European cave bear (Ursus ingressus), and SP2083, SP2129 and SP2133 as another (Ursus spelaeus). SP2073 is an Asian cave bear (Ursus deningeri kudarensis).
Regardless of how the polymorphism data are produced, using the Galaxy system to perform the analyses and to make the results available, can greatly facilitate reproducibility of the study by other groups, as well as apple-to-apple comparisons among data from different species. The published chicken sequence data described above provides a case in point – this study  sequenced pools, each containing DNA from multiple individuals, and so knowledge of numbers of reads corresponding to each allele, rather than a single genotype, was needed for each sample at each SNV. The authors attempted to provide relevant information in their dbSNP entries with a comment line telling in which samples the alleles were observed. However, they were aware that this information is not only difficult to extract from dbSNP, but it is also not sufficiently rich to allow their analyses to be reproduced; hence, the authors have contributed the appropriate data to Galaxy. More generally, having this and similar data sets on Galaxy substantially relieves the difficulties frequently encountered when attempting to reproduce published claims . Moreover, reproducing published results is a prerequisite for exploring their resilience to changes in the parameters controlling the analysis; it is well-known that many computational tools produce output that is quite sensitive to input parameters, and progress in science is facilitated if readers can readily explore the brittleness of published claims.
The tools described in this paper were produced by members of a small genomics lab, an approach that had the benefit of relatively quick development, but drawbacks in terms of getting other groups to contribute their data. Being able to compare datasets with each other is tremendously valuable. For instance, as described above, we observed an average FST of 0.169 between the North and East populations of aye-aye, but without corresponding numbers from other data sets, it is unclear whether this value should be considered large or small. It would have been very helpful to have datasets provided by other groups, but thus far the only gd_snp and gd_genotype data sets are ones that we created or requested from the originating group. Unfortunately, biology journals are often lax about enforcing requirements that authors make their data readily available, and we have been unable to obtain certain requested datasets. Our belief is that if the toolset had been designed and contributed by a wider community, then those groups would be motivated to make their data available in Galaxy.
The development of Galaxy tools for the kinds of data discussed in this paper has only just begun. While short insertion/deletions are handled much like nucleotide substitutions, larger scale mutations, such as inversions, are not currently handled by our Galaxy tools, despite the fact that they are believed to sometimes directly influence phenotypic differences between populations [60,61]. However, proposed tools need to be carefully evaluated. Importantly, high computational demands are often a limiting factor; examples of tools whose extensive run times make them less desirable for incorporation into Galaxy include PSMC  for estimating ancestral population sizes, CoalHMM  for estimating population split times, and a tool for identifying a set of individuals that is an “optimal” choice for founding a captive-breeding population or a relocation effort . Still, we hope the reader has become convinced that the resources we provide constitute a worthwhile addition to software for genome analysis.
Putting tools on Galaxy
Our analysis tools were integrated into the Galaxy web-based platform [1-4]. We followed the documentation  to install our own Galaxy instance. This instance was used as a test-bed for integrating our analysis tools. A tool definition file was written for each analysis tool, and all of these tool definition files were added to the tool-registry file as described on the Galaxy Wiki . We added new data types  to our Galaxy instance to support our SNV table format. After verifying that all of our analysis tools were working as expected, we created a Galaxy Tool Shed  repository to share our tools with the Galaxy community by following the documentation , producing the final set of Genome Diversity tools .
Tools for analyzing SNV tables
Phylogenetic trees can be determined when sequence data come from individuals (as opposed to pooled samples). For each pair of individuals, the “informative” SNVs having a user-specified read-coverage are used, and the distance between the two individuals is the number of genotypic differences divided by the number of informative sites. A neighbor-joining tree is constructed from these differences by QuickTree . The output includes a viewable tree and formatted inputs to one of several popular tree-drawing tools. For small data sets, such as for a modest number of mitochondrial SNPs, we let users run RAxML . Principal component analysis (tool #12) is preformed by smartpca, the ancestry analysis (#14) uses admixture[38,39] and the “Remarkable Intervals” tool implements a published algorithm , Figure 6. PCR primer sequences are computed off-line by Primer3 . Determining an optimal set of breeding pairs (toll #22) is an instance of a classic computational problem known in the Operations Research literature as the “assignment problem”, and called “weighted optimal bipartite matching” by computer scientists; it can be solved efficiently .
The admixture tool (#15) uses allele frequencies in the source populations to estimate the probability that a genotype observed at one SNV in a potentially admixed individual would be generated by randomly sampling chromosomes in each of the three possible combinations (six combinations if there are three source populations). For instance, if the frequencies of the reference (or consensus) allele in the two source populations are p and q, then the probabilities of the admixed individual being homozygous for the reference allele are p2 if both chromosomes come from the first source population, q2 if both are from the second source population, and p × q if the individual has one chromosome from each source. The logarithms of these values are added along a chromosomal segment to estimate the (logarithm of the) probability that the sequence of genotypes along the segment would be produced. Logarithms are used so that values can be added instead of multiplied and to forestall underflow in computer arithmetic. A dynamic programming algorithm is used to select chromosomal positions where the source of the admixed segment is switched (e.g., from homozygous in the first source population to heterozygous).
KEGG and Gene Ontology
We implemented a set of tools to evaluate the possible effect of mutations on phenotypic differences. The first group of tools assesses the over-representation of input genes in phenolic categories (i.e., GO terms and KEGG pathways), and the second uses network metrics to calculate the impact of these genes in a given phenotype (i.e., KEGG pathways).
GOs are a broadly used category of gene annotations that describe their functions through the use of domain-specific ontologies . Each gene is associated to one or more GO terms, and in turn, each GO term can be associated to one or more genes. Our set of programs includes the Rank Terms tool (#27) to determine the enrichment of a gene list (i.e., mutated genes) in GO terms. To do so, each gene is associated to a GO term following the Ensembl annotation . Further, the probability of GO term enrichment and depletion among the genes in the input list is calculated with a two-tailed Fisher exact test, as suggested . The tool returns a table that ranks the GO terms based on the percentage of genes in an input dataset (out of the total in each category in a background list) and their enrichment/depletion probability.
Network-based approaches have been recently introduced with promising results to capture the intricate relation of genes, regulatory elements and phenotypes [80,81]. The Rank Pathways tool (#25) is designed to study phenotypes as networks. This tool takes, as input, the set of metabolic pathways and biological processes in the KEGG database [82,83] and ranks them based on two criteria. The first criterion returns a table that ranks the KEGG pathway based on the percentage of genes in an input dataset (out of the total in each pathway) and their enrichment/depletion probability (calculated by a two-tailed Fisher exact test).
The second ranking criterion ranks KEGG pathways based on the change in length and number of paths connecting sources and sinks between pathways that exclude or include the nodes representing the genes in an input list. Sources are all the nodes representing the initial reactants/products in the pathway. Sinks are all the nodes representing the final reactants/products in the pathway. In detail, the mean length and number of paths between sources and sinks is calculated for each pathway including and excluding the genes in the input dataset; further, the change in both parameters is estimated and ranked [84,85]. Gene names and networks are obtained from each KGML pathway file from the KEGG database of the reference species.
In addition, the Get Pathways tool (#24) maps KEGG genes and pathways to Ensembl codes, while the Pathway Image tool (#26) plots KEGG pathways highlighting genes of interest respectively (e.g., Figure 3). In more detail, the second tool takes as input datasets with KEGG gene codes and pathways, links the genes present in the input table to specific modules (i.e., a collection of functional units) and returns an image of a KEGG pathway highlighting (in red) the modules representing genes in the input dataset.
The published SNVs  were kindly provided to us by Carl-Johan Rubin and Leif Andersson. Importantly, the sequences were from pooled samples of birds, so the numbers of reads observed for each allele in each sample (rather than just a “genotype” for the sample) was required to reproduce their results. We created a Galaxy gd_snp table. For each sample, in addition to the two allele counts, the SNV had a “genotype” that we extracted from comments in the dbSNP records listing the samples where each allele was observed, which we included to permit attempts to reproduce some of the published  using just the information in dbSNP. (We were unable to accomplish this feat). Extracting that information required help from the dbSNP staff at NCBI. Since no quality values were available to us, we used the place-holder “–1” in columns 5, 9, 13-45. The data and a command “workflow” for the results described in this paper are available on Galaxy.
A southern African genome
Three methods were applied to detect admixtured haplotype blocks in a southern Bantu genome (ABT): PCAdmix , HAPMIX , and the Galaxy admixture tool. Applying those methods required population datasets of two or three putative ancestral populations in order to assign ancestries to each SNV or particular size of haplotype. We retrieved genotyping SNV datasets of various populations from two human variation projects, HGDP  and HapMap , and one publication . We selected 419,974 SNVs that were common among the datasets, after filtering out multiple-allelic and possible “flipped” SNVs. For the estimation of ancestry of the Bantu individual, Khoisan, Bantu, Yoruba, and two non-African (CEU and CHB) populations were selected from the datasets. For the accuracy of analyses, we included only unrelated individuals and excluded outlier individuals, which were not clustered with the corresponding populations in the PCA analysis. The final dataset used in this study consisted of 419,974 SNVs from 481 individuals. Regarding ABT, we extracted the genotypes of the same SNV positions from the ABT genome sequences .
The table of porcine SNVs was contributed by Martien Groenen.
We converted the table Bd_49.selectedSNPs.5.ACGT.10X.tab  to gd_genotype format. From the same website we obtained a mapping of gene names to GO categories, and lists of genes with dN > dS, with LOH, and with CNVs. We extracted gene annotations from the file .
The coverage data for colugo were contributed by William Murphy.
Availability of supporting data
The data sets and tools sufficient to reproduce results described in this paper are available on the GigaGalaxy website . The tools discussed here are available from the Galaxy website , under “Genome Diversity”. Links to the materials and future examples will also be made available from a Galaxy page , along with documentation for using the tools . Please send requests for other materials to email@example.com.
AD: All domestic (chickens); also abbreviates “Allelic Depth” in VCF file; AIM: Ancestry informative marker; Bp: Base pair; CNV: Copy Number Variation; gd: Genome diversity; GO: Gene ontology; GPL: Global Panzootic Lineage; KEGG: Kyoto Encyclopedia of Genes and Genomes; LOH: Loss of Heterozygosity; LSBL: Locus-specific branch length; PCA: Principal components analysis; QTL: Quantitative Trait Locus; SNP: Single-nucleotide polymorphism; SNV: Single-nucleotide variant; YNP: Yellowstone National Park (wolves).
The authors declare that they have no competing interests.
The project was designed by WM, with guidance from SCS and GHP. The Galaxy tools were written, installed, and documented by OCB-R, AR, RB, BG, CR and WM. QL provided statistical expertise. The paper was written by WM, with sections contributed by OCB-R, HLK, TLO, TPL and BMH. All authors read and approved the final manuscript.
We thank Andreanna Welsh and Charlotte Lindqvist for early testing of the Galaxy tools, Carl-Johan Rubin and Leif Andersson for graciously supplying the chicken data, and Martien Groenen for providing the pig data. The dbSNP on-line help-desk extracted the comments in the chicken SNP entries. Michael DeGiorgio discussed the interplay between admixture and identification of selective sweeps. Development of the Galaxy tools for analyzing SNVs from low-coverage sequence data was supported in part by NIH award 1 UL1 RR033184-01 to the Penn State Clinical and Translational Science Institute, with supplemental funding from a grant from the Pennsylvania Department of Health using Tobacco Commonwealth Universal Research Enhancement Funds. TLO and TPL were supported by NIH award CA 170334. Development of the tools to help guide captive-breeding programs (tools #19-#23) was supported by the Gordon and Betty Moore Foundation, and was helped by suggestions from Katherine Rall and Jon Ballou.
Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A: Galaxy: a platform for interactive large-scale genome analysis.
Curr Protoc Mol Biol 2010, 10:1-21.
Evolution 1984, 38:1358-1370. Publisher Full Text
Huerta-Sánchez E, Degiorgio M, Pagani L, Tarekegn A, Ekong R, Antao T, Cardona A, Montgomery HE, Cavalleri GL, Robbins PA, Weale ME, Bradman N, Bekele E, Kivisild T, Tyler-Smith C, Nielsen R: Genetic signatures reveal high-altitude adaptation in a set of Ethiopian populations.
Perry GH, Louis EE Jr, Ratan A, Bedoya-Reina OC, Burhans R, Lei R, Johnson SE, Schuster SC, Miller W: Aye-aye population genomic analyses highlight an important center of endemism in northern Madagascar.
Perry GH, Reeves D, Melsted P, Ratan A, Miller W, Michelini K, Louis EE Jr, Pritchard JK, Mason CE, Gilad Y: A genome sequence resource for the aye-aye (Daubentonia madagascariensis), a nocturnal lemur from Madagascar.
Hum Genom 2004, 1:274-286. BioMed Central Full Text
Hutter H, Vogel BE, Plenefisch JD, Norris CR, Proenca RB, Spieth J, Guo C, Mastwal S, Zhu X, Scheel J, Hedgecock EM: Conservation and novelty in the evolution of cell adhesion and extracellular matrix genes.
Hara Y, Balci-Hayta B, Yoshida-Moriguchi T, Kanagawa M, Beltrán-Valero de Bernabé D, Gündeşli H, Willer T, Satz JS, Crawford RW, Burden SJ, Kunz S, Oldstone MB, Accardi A, Talim B, Muntoni F, Topaloğlu H, Dinçer P, Campbell KP: A dystroglycan mutation associated with limb-girdle muscular dystrophy.
Am J Phyl Anthropol 1986, 69:355-375. Publisher Full Text
Rubin CJ, Zody MC, Eriksson J, Meadows JR, Sherwood E, Webster MT, Jiang L, Ingman M, Sharpe T, Ka S, Hallböök F, Besnier F, Carlborg O, Bed'hom B, Tixier-Boichard M, Jensen P, Siegel P, Lindblad-Toh K, Andersson L: Whole-genome resequencing reveals loci under selection during chicken domestication.
Eriksson J, Larson G, Gunnarsson U, Bed'hom B, Tixier-Boichard M, Strömstedt L, Wright D, Jungerius A, Vereijken A, Randi E, Jensen P, Andersson L: Identification of the yellow skin gene reveals a hybrid origin of the domestic chicken.
von Holdt BM, Pollinger JP, Earl DA, Knowles JC, Boyko AR, Parker H, Geffen E, Pilot M, Jedrzejewski W, Jedrzejewska B, Sidorovich V, Greco C, Randi E, Musiani M, Kays R, Bustamante CD, Ostrander EA, Novembre J, Wayne RK: A genome-wide perspective on the evolutionary history of enigmatic wolf-like canids.
Schuster SC, Miller W, Ratan A, Tomsho LP, Giardine B, Kasson LR, Harris RS, Petersen DC, Zhao F, Qi J, Alkan C, Kidd JM, Sun Y, Drautz DI, Bouffard P, Muzny DM, Reid JG, Nazareth LV, Wang Q, Burhans R, Riemer C, Wittekindt NE, Moorjani P, Tindall EA, Danko CG, Teo WS, Buboltz AM, Zhang Z, Ma Q, Oosthuysen A, et al.: Complete Khoisan and Bantu genomes from southern Africa.
BMC Bioinforma 2011, 12:246. BioMed Central Full Text
Rubin CJ, Megens HJ, Martinez Barrio A, Maqbool K, Sayyab S, Schwochow D, Wang C, Carlborg Ö, Jern P, Jørgensen CB, Archibald AL, Fredholm M, Groenen MA, Andersson L: Strong signatures of selection in the domestic pig genome.
Groenen MA, Archibald AL, Uenishi H, Tuggle CK, Takeuchi Y, Rothschild MF, Rogel-Gaillard C, Park C, Milan D, Megens HJ, Li S, Larkin DM, Kim H, Frantz LA, Caccamo M, Ahn H, Aken BL, Anselmo A, Anthon C, Auvil L, Badaoui B, Beattie CW, Bendixen C, Berman D, Blecha F, Blomberg J, Bolund L, Bosse M, Botti S, Bujie Z, Bystrom M, Capitanu B, Carvalho-Silva D, Chardon P, Chen C, Cheng R, Choi SH, Chow W, Clark RC, Clee C, Crooijmans RP, Dawson HD, Dehais P, De Sapio F, Dibbits B, Drou N, Du ZQ, Eversole K, Fadista J, Fairley S, Faraut T, Faulkner GJ, Fowler KE, Fredholm M, Fritz E, Gilbert JG, Giuffra E, Gorodkin J, Griffin DK, Harrow JL, Hayward A, Howe K, Hu ZL, Humphray SJ, Hunt T, Hornshøj H, Jeon JT, Jern P, Jones M, Jurka J, Kanamori H, Kapetanovic R, Kim J, Kim JH, Kim KW, Kim TH, Larson G, Lee K, Lee KT, Leggett R, Lewin HA, Li Y, Liu W, Loveland JE, Lu Y, Lunney JK, Ma J, Madsen O, Mann K, Matthews L, McLaren S, Morozumi T, Murtaugh MP, Narayan J, Nguyen DT, Ni P, Oh SJ, Onteru S, Panitz F, Park EW, Park HS, Pascal G, Paudel Y, Perez-Enciso M, Ramirez-Gonzalez R, Reecy JM, Rodriguez-Zas S, Rohrer GA, Rund L, Sang Y, Schachtschneider K, Schraiber JG, Schwartz J, Scobie L, Scott C, Searle S, Servin B, Southey BR, Sperber G, Stadler P, Sweedler JV, Tafer H, Thomsen B, Wali R, Wang J, Wang J, White S, Xu X, Yerle M, Zhang G, Zhang J, Zhang J, Zhao S, Rogers J, Churcher C, Schook LB, et al.: Analyses of pig genomes provide insight into porcine demography and evolution.
Fontanesi L, Schiavo G, Galimberti G, Calò DG, Scotti E, Martelli PL, Buttazzoni L, Casadio R, Russo V: Identification and association analysis of several hundred single nucleotide polymorphisms within candidate genes for back fat thickness in Italian Large White pigs using a selective genotyping approach.
Rosenblum EB, James TY, Zamudio KR, Poorten TJ, Ilut D, Rodriguez D, Eastman JM, Richards-Hrdlicka K, Joneson S, Jenkinson TS, Longcore JE, Parra Olea G, Toledo LF, Arellano ML, Medina EM, Restrepo S, Flechas SV, Berger L, Briggs CJ, Stajich JE: Complex history of the amphibian-killing chytrid fungus revealed with genome resequencing data.
Stajich lab popgen data at GitHub.
Teixeira PJ, Thomazella DP, Vidal RO, do Prado PF, Reis O, Baroni RM, Franco SF, Mieczkowski P, Pereira GA, Mondego JM: The fungal pathogen Moniliophthora perniciosa has genes similar to plant PR-1 that are highly expressed during its interaction with cacao.
Koskela HL, Eldfors S, Ellonen P, van Adrichem AJ, Kuusanmäki H, Andersson EI, Lagström S, Clemente MJ, Olson T, Jalkanen SE, Majumder MM, Almusa H, Edgren H, Lepistö M, Mattila P, Guinta K, Koistinen P, Kuittinen T, Penttinen K, Parsons A, Knowles J, Saarela J, Wennerberg K, Kallioniemi O, Porkka K, Loughran TP Jr, Heckman CA, Maciejewski JP, Mustjoki S: Somatic STAT3 mutations in large granular lymphocytic leukemia.
Epling-Burnette PK, Liu JH, Catlett-Falcone R, Turkson J, Oshiro M, Kothapalli R, Li Y, Wang JM, Yang-Yen HF, Karras J, Jove R, Loughran TP Jr: Inhibition of STAT3 signaling leads to apoptosis of leukemic large granular lymphocytes and decreased Mcl-1 expression.
Meyer LR, Zweig AS, Hinrichs AS, Karolchik D, Kuhn RM, Wong M, Sloan CA, Rosenbloom KR, Roe G, Rhead B, Raney BJ, Pohl A, Malladi VS, Li CH, Lee BT, Learned K, Kirkup V, Hsu F, Heitner S, Harte RA, Haeussler M, Guruvadoo L, Goldman M, Giardine BM, Fujita PA, Dreszer TR, Diekhans M, Cline MS, Clawson H, Barber GP, Haussler D, Kent WJ, et al.: The UCSC Genome Browser database: extensions and updates 2013.
Romanov MN, Tuttle EM, Houck ML, Modi WS, Chemnick LG, Korody ML, Mork EM, Otten CA, Renner T, Jones KC, Dandekar S, Papp JC, Da Y, Comparative Sequencing Program NISC, Green ED, Magrini V, Hickenbotham MT, Glasscock J, McGrath S, Mardis ER, Ryder OA: The value of avian genomics to the conservation of wildlife.
Jones FC, Grabherr MG, Chan YF, Russell P, Mauceli E, Johnson J, Swofford R, Pirun M, Zody MC, White S, Birney E, Searle S, Schmutz J, Grimwood J, Dickson MC, Myers RM, Miller CT, Summers BR, Knecht AK, Brady SD, Zhang H, Pollen AA, Howes T, Amemiya C, Baldwin J, Bloom T, Jaffe DB, Nicol R, Wilkinson J, Lander ES, Di Palma F, Lindblad-Toh K, Kingsley DM, Broad Institute Genome Sequencing Platform & Whole Genome Assembly Team, et al.: The genomic basis of adaptive evolution in threespine sticklebacks.
Mailund T, Dutheil JY, Hobolth A, Lunter G, Schierup MH: Estimating divergence time and ancestral effective population size of Bornean and Sumatran orangutan subspecies using a coalescent hidden Markov model.
Galaxy Download and Installation.
Hub page for Galaxy Administration of computational tools.
Hub page for administering Galaxy data types.
Galaxy Tool Shed.
Huang X, Pevzner P, Miller W: Parametric recomputing in alignment graphs. In Combinatorial Pattern Matching. Volume 807. Edited by Crochemore M, Gusfield D. Berlin: Springer-Verlag; 1994::87-101. [Springer Lecture Notes in Computer Science] Publisher Full Text
Rozen S, Skaletsy HJ: Primer3 on the WWW for general users and for biologist programmers. In Bioinformatics Methods and Protocols. Edited by Misener KS. Totowa NJ: Humana Press; 2000:365-386. [Methods in Molecular Biology]
J Assoc Comput Mach 1987, 34:596-615. Publisher Full Text
Flicek P, Ahmed I, Amode MR, Barrell D, Beal K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fairley S, Fitzgerald S, Gil L, García-Girón C, Gordon L, Hourlier T, Hunt S, Juettemann T, Kähäri AK, Keenan S, Komorowska M, Kulesha E, Longden I, Maurel T, McLaren WM, Muffato M, Nag R, Overduin B, Pignatelli M, Pritchard B, Pritchard E, Riat HS, Ritchie GR, Ruffier M, Schuster M, Sheppard D, Sobral D, Taylor K, Thormann A, Trevanion S, White S, Wilder SP, Aken BL, Birney E, Cunningham F, Dunham I, Harrow J, Herrero J, Hubbard TJ, Johnson N, Kinsella R, Parker A, Spudich G, Yates A, Zadissa A, Searle SM, et al.: Ensembl 2013.
Miller W, Hayes VM, Ratan A, Petersen DC, Wittekindt NE, Miller J, Walenz B, Knight J, Qi J, Zhao F, Wang Q, Bedoya-Reina OC, Katiyar N, Tomsho LP, Kasson LM, Hardie RA, Woodbridge P, Tindall EA, Bertelsen MF, Dixon D, Pyecroft S, Helgen KM, Lesk AM, Pringle TH, Patterson N, Zhang Y, Kreiss A, Woods GM, Jones ME, Schuster SC: Genetic diversity and population structure of the endangered marsupial Sarcophilus harrisii (Tasmanian devil).
Bedoya-Reina OC, Ratan A, Harris RS, Riemer C, Burhans R, Perry GH, Schuster SC, Miller W: Network approaches reveal hypothetical causes for phenotypes in wild mammal populations. In Genome Informatics. Cambridge, UK: Robinson College; 2012.
6-9 September 2012
Cann HM, de Toma C, Cazes L, Legrand MF, Morel V, Piouffre L, Bodmer J, Bodmer WF, Bonne-Tamir B, Cambon-Thomsen A, Chen Z, Chu J, Carcassi C, Contu L, Du R, Excoffier L, Ferrara GB, Friedlaender JS, Groot H, Gurwitz D, Jenkins T, Herrera RJ, Huang X, Kidd J, Kidd KK, Langaney A, Lin AA, Mehdi SQ, Parham P, Piazza A, Pistillo MP, Qian Y, Shu Q, Xu J, Zhu S, Weber JL, Greely HT, Feldman MW, Thomas G, Dausset J, Cavalli-Sforza LL: A human genome diversity cell line panel.
Science 2002, 296:261-262. PubMed Abstract
Schlebusch CM, Skoglund P, Sjödin P, Gattepaille LM, Hernandez D, Jay F, Li S, De Jongh M, Singleton A, Blum MG, Soodyall H, Jakobsson M: Genomic variation in seven Khoe-San groups reveals adaptation and complex African history.
Bedoya-Reina OC, Ratan A, Burhans R, Kim HL, Giardine B, Riemer C, Li Q, Olson TL, Loughran TP Jr, von Holdt BM, Perry GH, Schuster SC, Miller W: GigaGalaxy workflows and histories from “Galaxy tools to study genome diversity”.
GigaScience Database 2013.
Galaxy page for this paper.
Documentation for Genome Diversity Tools.
Krause J, Unger T, Noçon A, Malaspinas AS, Kolokotronis SO, Stiller M, Soibelzon L, Spriggs H, Dear PH, Briggs AW, Bray SC, O'Brien SJ, Rabeder G, Matheus P, Cooper A, Slatkin M, Pääbo S, Hofreiter M: Mitochondrial genomes reveal an explosive radiation of extinct and extant bears near the Miocene-Pliocene boundary.