The data displayed in this applet is derived from next generation sequencing of 120 soybean lines. Seventy-nine new lines were sequenced as part of this project and represent plant introductions and milestone cultivars. The remaining sequences represent the 41 parents used for developing the soybean NAM population. NAM parent sequences were provided by Dr. Perry Cregan (USDA-ARS) and the Soybean NAM project. For the new sequences generated by this project, twenty seeds from each line were acquired from the USDA Soybean Germplasm Collection. Seeds were planted in the USDA greenhouse at Iowa State University. Once plants reached the trifoliolate stage, leaves from up to 10 plants were pooled and genomic DNA was extracted. DNA was sent to Hudson Alpha Institute for Biotechnology for next-generation sequencing. In addition, replicated field trials were conducted on a subset of lines (30 of the 79 lines, plus ancestral varieties that were not sequenced) to measure protein, oil, yield, and other characteristics under standard growth conditions, to dissociate the effect of on-farm improvements from genetic gain [1],[2].
AddOrReplaceReadGroups
and MarkDuplicates
functions in picard tools.IndelRealigner
function in GATK [5]. The ReduceReads
function was used to compress the alignment files by removing non-informative and redundant reads (default parameters except for downsample_coverage=1).HaplotypeCaller
function in GATK (version 2.7-2-g6bda569).SNPS displayed in this applet are reliable SNPs from 120 lines. 86944 unique SNPs were identified for further analysis.
The lines included in this analysis are:Need to delineate "reliable" criterion.
180501 | CL0J173-6-8 | Hood | LG92-1255 | PI574 |
4J105-3-4 | Clark | HS6-3976 | LG94-1128 | Pickett |
506-13640-2 | Clark (NAM) | Hutcheson | LG94-1906 | Prohio |
5601T | Clemson | IA3023 | LG97-7012 | Raleigh |
5M20-2-5-2 | CNS | Illini | LG98-1605 | Ralsoy |
88788 | Cook | Jackson | Lincoln | Ransom |
A3127 | Corsoy | Kanro | Magellan | Richland |
Adams | Cumberland | Kent | Mandarin | Roanoke |
A.K. | Davis | Lawrence | Maverick | S06-13640 |
Amcor | Dillon | LD00-3309 | Merit | S-100 |
Amsoy | Dorman | LD01-5907 | Mukden | Shelby |
Anderson | Douglas | LD02-4485 | NCRoy | Skylla |
Beeson | Dunfield | LD02-9050 | NE3001 | TN05-3027 |
Blackhawk | Essex | Lee | Oakland | Tokyo |
Bonus | Ford | LG00-3372 | Ogden | Tracy |
Bragg | Forrest | LG03-2979 | Pella | U03-100612 |
Braxton | Gasoy17 | LG03-3191 | Perry | Volstate |
Brim | Haberlandt | LG04-4717 | PI398 | Wayne |
Calland | Hagood | LG04-6000 | PI404 | Williams |
Capital | Harcor | LG05-4292 | PI427 | Williams82 |
Centennial | Harosoy | LG05-4317 | PI437 | Woodworth |
Century | Hawkeye | LG05-4464 | PI507 | York |
Chippewa | Hill | LG05-4832 | PI518 | Young |
CL0J095-4-6 | Holladay | LG90-2550 | PI561 | Zane |
Kinship matrices were generated with TASSEL [6] using a subset of the SNP data, where one random SNP was taken from every 10,000 base interval in the genome or the next closest SNP (Supplementary Script 1).
These matrices were then clustered using Ward's method, using the distance (2- similarity). Clusters were used to lay out the rows and columns of the heatmap.
The clusters were then formatted for plotting and plotted using ggplot2.
Need more detail here from Andrew. Also, need QTL results?
Using the combined, phased and imputed VCF file generated as a result of the SNP sampling, the following steps are performed.
Plots in this applet were generated using ggplot2 [10], and are rendered interactively using Shiny [11].
1. Specht JE, Williams JH. Contribution of genetic technology to soybean productivity—Retrospect and prospect. Genetic contributions to yield gains of five major crop plants. Crop Science Society of America; American Society of Agronomy; 1984;49–74.
2. Fox CM, Cary TR, Colgrove AL, Nafziger ED, Haudenshield JS, Hartman GL, Specht JE, Diers BW. Estimating soybean genetic gain for yield in the northern united states—Influence of cropping history. Crop Science. The Crop Science Society of America, Inc. 2013;53:2473–2482.
3. Wu TD, Nacu S. Fast and snp-tolerant detection of complex variants and splicing in short reads. Bioinformatics. Oxford Univ Press; 2010;26:873–881.
4. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, others. The sequence alignment/map format and samtools. Bioinformatics. Oxford Univ Press; 2009;25:2078–2079.
5. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, others. The genome analysis toolkit: A mapreduce framework for analyzing next-generation dna sequencing data. Genome research. Cold Spring Harbor Lab; 2010;20:1297–1303.
6. Bradbury PJ, Zhang Z, Kroon DE, Casstevens TM, Ramdoss Y, Buckler ES. TASSEL: Software for association mapping of complex traits in diverse samples. Bioinformatics. Oxford Univ Press; 2007;23:2633–2635.
7. Browning BL, Browning SR. Improving the accuracy and efficiency of identity-by-descent detection in population data. Genetics. Genetics Soc America; 2013;194:459–471.
8. Obenchain V, Lawrence M, Carey V, Gogarten S, Shannon P, Morgan M. VariantAnnotation: A bioconductor package for exploration and annotation of genetic variants. Bioinformatics. Oxford Univ Press; 2014;btu168.
9. Yin T, Cook D, Lawrence M. Ggbio: An r package for extending the grammar of graphics for genomic data. Genome Biology. BioMed Central Ltd; 2012;13:R77.
10. Wickham H. Ggplot2: Elegant graphics for data analysis [Internet]. Springer New York; 2009. Available from: http://had.co.nz/ggplot2/book.
11. RStudio, Inc. Shiny: Web application framework for r [Internet]. 2014. Available from: http://www.rstudio.com/shiny/.