Use the table below to see what GlymaIDs have SNPs on a specific chromosome.

Use the table below to see SNPs at specific locations within a GlymaID.

Matching Glyma IDs

Matching SNPs

Genealogy

Data is available for cultivars shown in black.

SNP Documentation

Data Collection

The data displayed in this applet is derived from next generation sequencing of 120 soybean lines. Seventy-nine new lines were sequenced as part of this project and represent plant introductions and milestone cultivars. The remaining sequences represent the 41 parents used for developing the soybean NAM population. NAM parent sequences were provided by Dr. Perry Cregan (USDA-ARS) and the Soybean NAM project. For the new sequences generated by this project, twenty seeds from each line were acquired from the USDA Soybean Germplasm Collection. Seeds were planted in the USDA greenhouse at Iowa State University. Once plants reached the trifoliolate stage, leaves from up to 10 plants were pooled and genomic DNA was extracted. DNA was sent to Hudson Alpha Institute for Biotechnology for next-generation sequencing. In addition, replicated field trials were conducted on a subset of lines (30 of the 79 lines, plus ancestral varieties that were not sequenced) to measure protein, oil, yield, and other characteristics under standard growth conditions, to dissociate the effect of on-farm improvements from genetic gain [1],[2].

Processing Data and SNP Identification

Raw reads were aligned to the reference soybean genome (version 2) using GSNAP (version 2013-8-31) [3].
Reads that mapped uniquely were converted from SAM format to BAM format using samtools [4].
Read groups were added for each soybean line and duplicate reads were removed using AddOrReplaceReadGroups and MarkDuplicates functions in picard tools.
The resulting alignment BAM files were realigned using IndelRealigner function in GATK [5]. The ReduceReads function was used to compress the alignment files by removing non-informative and redundant reads (default parameters except for downsample_coverage=1).
SNPs and InDels were called on all reduced BAM files together using the HaplotypeCaller function in GATK (version 2.7-2-g6bda569).

SNP Sampling

Summary

SNPS displayed in this applet are reliable SNPs from 120 lines. 86944 unique SNPs were identified for further analysis.

Need to delineate "reliable" criterion.

The lines included in this analysis are:

180501	CL0J173-6-8	Hood	LG92-1255	PI574
4J105-3-4	Clark	HS6-3976	LG94-1128	Pickett
506-13640-2	Clark (NAM)	Hutcheson	LG94-1906	Prohio
5601T	Clemson	IA3023	LG97-7012	Raleigh
5M20-2-5-2	CNS	Illini	LG98-1605	Ralsoy
88788	Cook	Jackson	Lincoln	Ransom
A3127	Corsoy	Kanro	Magellan	Richland
Adams	Cumberland	Kent	Mandarin	Roanoke
A.K.	Davis	Lawrence	Maverick	S06-13640
Amcor	Dillon	LD00-3309	Merit	S-100
Amsoy	Dorman	LD01-5907	Mukden	Shelby
Anderson	Douglas	LD02-4485	NCRoy	Skylla
Beeson	Dunfield	LD02-9050	NE3001	TN05-3027
Blackhawk	Essex	Lee	Oakland	Tokyo
Bonus	Ford	LG00-3372	Ogden	Tracy
Bragg	Forrest	LG03-2979	Pella	U03-100612
Braxton	Gasoy17	LG03-3191	Perry	Volstate
Brim	Haberlandt	LG04-4717	PI398	Wayne
Calland	Hagood	LG04-6000	PI404	Williams
Capital	Harcor	LG05-4292	PI427	Williams82
Centennial	Harosoy	LG05-4317	PI437	Woodworth
Century	Hawkeye	LG05-4464	PI507	York
Chippewa	Hill	LG05-4832	PI518	Young
CL0J095-4-6	Holladay	LG90-2550	PI561	Zane

Kinship Analysis

Sampling Methodology

Kinship matrices were generated with TASSEL [6] using a subset of the SNP data, where one random SNP was taken from every 10,000 base interval in the genome or the next closest SNP (Supplementary Script 1).

Clustering and Plot of Kinship Distance

These matrices were then clustered using Ward's method, using the distance (2- similarity). Clusters were used to lay out the rows and columns of the heatmap.

library(plyr)
library(reshape2)
library(animint)
library(ggplot2) # need to use fork of ggplot2: install_github("tdhock/ggplot2") for animint compatibility
library(ggdendro)
library(grid)


# Symmetric kinship matrix except for the first column, which contains the variety names
tmp <- read.csv("kinshipMatrix.txt", sep="\t", skip=1, head=F, stringsAsFactors=F)
names(tmp) <- c("Variety", tmp[,1])

# Fix names to correspond with actual usage
rownames <- tmp$Variety
rownames <- gsub("506-13640-2", "S06-13640-2", rownames)
rownames <- gsub("901-G0\\d{1}_RS_", "", rownames)
rownames <- gsub("_NuGEN", "", rownames)
rownames <- gsub("Gasoy17", "Gasoy 17", rownames)
rownames <- gsub("RS_", "", rownames)
rownames <- gsub("Williams82", "Williams 82", rownames)
rownames <- gsub("AK_004", "A.K. (Harrow)", rownames)
rownames <- gsub("NCR", "R", rownames)

# Enforce row and column names
tmp$Variety <- rownames
names(tmp) <- c("Variety", rownames)

# Create numerical matrix for use in clustering algorithms
meanMatrix <- data.matrix(tmp[,-1])
# Ensure dimnames are accurate
dimnames(meanMatrix)[[1]] <- rownames
dimnames(meanMatrix)[[2]] <- rownames

# Convert kinship (similarity) to distance matrix by subtracting relatedness from 2 (maximum relatedness)
dd.col <- rev(as.dendrogram(hclust(as.dist(2-meanMatrix), method="ward.D")))
dd.row <- as.dendrogram(hclust(as.dist(2-meanMatrix), method="ward.D"))
col.ord <- labels(dd.col)
row.ord <- labels(dd.row)
# Get x and y coordinates
ddata_x <- dendro_data(dd.row)
ddata_y <- dendro_data(dd.col)

The clusters were then formatted for plotting and plotted using ggplot2.

# Melt kinship matrix into long form
matrixLong <- melt(tmp, id.vars = 1)
names(matrixLong) <- c("variety1", "variety2", "value")
matrixLong$value <- as.numeric(matrixLong$value)
write.csv(matrixLong, "LongKinshipMatrix.csv")

# Make long matrix factor names match clustered ordering (so tree matches x and y coordinates)
matrixLong$variety1 <- factor(matrixLong$variety1, levels=col.ord)
matrixLong$variety2 <- factor(matrixLong$variety2, levels=row.ord)

# Multiplier to ensure the tree has reasonable range compared to the heatmap
dendro.multiplier <- 1

# Static Plot
qplot(x=as.numeric(variety1), y=as.numeric(variety2), geom="tile", fill=value, data=matrixLong) + 
  theme_bw() + 
  theme(axis.text.x=element_text(angle=90, hjust=1, vjust=.5)) +  
  xlab("Variety 1") + 
  ylab("Variety 2") + 
  scale_x_continuous(expand=c(0, 1), breaks=1:79, labels=col.ord) + 
  scale_y_continuous(expand=c(0, 1), breaks=1:79, labels=row.ord) + 
  scale_fill_gradient("Relatedness",low="#ffffff", high="#374f6b") + 
  ggtitle("Kinship Matrix-based Relatedness") + 
  coord_equal()+ 
  geom_segment(data=segment(ddata_y), aes(x=x, y=y*dendro.multiplier+80, xend=xend, yend=yend*dendro.multiplier+80), inherit.aes=F) + 
  geom_segment(data=segment(ddata_x), aes(x=y*dendro.multiplier+80, y=x, xend=yend*dendro.multiplier+80, yend=xend), inherit.aes=F)

SNP Imputation and QTL Analysis

SNPs were imputed using BEAGLE [7]
TASSEL [6] was used to identify QTLs using a general linear model to control for population structure.

Need more detail here from Andrew. Also, need QTL results?

Plot Documentation

Using the combined, phased and imputed VCF file generated as a result of the SNP sampling, the following steps are performed.

Convert the VCF file into a data frame

We used the vcf2tsv command from vcflib to convert the vcf file into an easily plotable format.

vcf2tsv combined.REDUCED.1.realign.123013.uniq.sorted.SNPsOnly.MeanPlus0p150.nospace.vcf > MeanPlus0p150SD.tsv

Other options for performing the same function include the VariantAnnotation bioconductor package [8] and the ggbio bioconduuctor package [9].

library(stringr)
library(reshape2)
library(animint)
library(ggplot2)

# Read in the data from the TSV
vcfTable <- read.table("MeanPlus0p150SDimputed.tsv", sep="\t", stringsAsFactors=FALSE, header=TRUE)

# Correct column names
col.names <- c("Chromosome", "Position", "id", "Reference", "Alternate", 
               "qual", "filter", "Allele.Freq", "AlleleRSquared", "DosageRSquared",
               "Variety", "Alt_Allele_Freq", "Genotype_Probability", "Gene_State")

names(vcfTable) <- col.names

Rearranging Data for Plotting

Plots in this applet were generated using ggplot2 [10], and are rendered interactively using Shiny [11].

library(doMC)
registerDoMC()
library(dplyr)

vcfTable$Alt.Allele.Count <- round(vcfTable$Alt.Allele.Freq)

varieties <- as.character(unique(vcfTable$Variety))
seqnames <- unique(vcfTable$Chromosome[grepl("Chr", vcfTable$Chromosome)])

# remove scaffolds for now to make display easier
vcfTable <- filter(vcfTable, grepl("Chr", vcfTable$Chromosome))

# Change names to correspond to actual usage (instead of lab names) according to Jim's suggestions.
vcfTable$Variety <- str_replace(vcfTable$Variety, fixed("901-G04_RS_5601T"), "5601T")
vcfTable$Variety <- str_replace(vcfTable$Variety, fixed("901-G06_RS_Bonus"), "Bonus")
vcfTable$Variety <- str_replace(vcfTable$Variety, fixed("RS_"), "")
vcfTable$Variety <- str_replace(vcfTable$Variety, fixed("AK_004"), "A.K.")
vcfTable$Variety <- str_replace(vcfTable$Variety, fixed("Clark_NuGEN"), "Clark")
vcfTable$Variety <- str_replace(vcfTable$Variety, fixed("NCRaleigh"), "Raleigh")

SNP Browser Plot

The goal here is to calculate the proportion of SNPs of each base pair (including those varieties that did not differ from the reference genome at the base pair in question).

n <- length(unique(varieties))
snp <- snpList %>% ungroup %>% group_by(Chromosome, Position, Reference) %>% 
  summarise(A=sum(Alt_Allele_Count*(Alternate=="A")), 
            G=sum(Alt_Allele_Count*(Alternate=="G")), 
            C=sum(Alt_Allele_Count*(Alternate=="C")), 
            T=sum(Alt_Allele_Count*(Alternate=="T")), 
            total=sum(Alt_Allele_Count)) %>%
  mutate(A = A + (Reference=="A")*(2*n-total), 
         G = G + (Reference=="G")*(2*n-total),
         C = C + (Reference=="C")*(2*n-total),
         T = T + (Reference=="T")*(2*n-total)) 
snp <- snp[,1:7]


library(tidyr)
snp.counts <- snp %>% gather(Nucleotide, Count, 4:7)
snp.counts <- filter(snp.counts, Count>0)
save(snp.counts, file="SNPCounts.RData")

SNP Density Plots

This code calculates the density of SNPs along each chromosome. For each group of data, the SNP density is calculated at 8192 points along the chromosome, with a bandwidth 10% of the default kernel bandwidth. SNPs are weighted according to the count of alternate alleles.

snp.density <- group_by(snpList, Chromosome, Variety) %>%
  do(as.data.frame(density(.$Position, n=2048*4, adjust=0.1, from=1, to=max(.$Position), weights=(.$Alt_Allele_Count)/sum(.$Alt_Allele_Count))[1:2]))
save(snp.density, file="SNPDensity.RData")

References

1. Specht JE, Williams JH. Contribution of genetic technology to soybean productivity—Retrospect and prospect. Genetic contributions to yield gains of five major crop plants. Crop Science Society of America; American Society of Agronomy; 1984;49–74.

2. Fox CM, Cary TR, Colgrove AL, Nafziger ED, Haudenshield JS, Hartman GL, Specht JE, Diers BW. Estimating soybean genetic gain for yield in the northern united states—Influence of cropping history. Crop Science. The Crop Science Society of America, Inc. 2013;53:2473–2482.

3. Wu TD, Nacu S. Fast and snp-tolerant detection of complex variants and splicing in short reads. Bioinformatics. Oxford Univ Press; 2010;26:873–881.

4. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, others. The sequence alignment/map format and samtools. Bioinformatics. Oxford Univ Press; 2009;25:2078–2079.

5. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, others. The genome analysis toolkit: A mapreduce framework for analyzing next-generation dna sequencing data. Genome research. Cold Spring Harbor Lab; 2010;20:1297–1303.

6. Bradbury PJ, Zhang Z, Kroon DE, Casstevens TM, Ramdoss Y, Buckler ES. TASSEL: Software for association mapping of complex traits in diverse samples. Bioinformatics. Oxford Univ Press; 2007;23:2633–2635.

7. Browning BL, Browning SR. Improving the accuracy and efficiency of identity-by-descent detection in population data. Genetics. Genetics Soc America; 2013;194:459–471.

8. Obenchain V, Lawrence M, Carey V, Gogarten S, Shannon P, Morgan M. VariantAnnotation: A bioconductor package for exploration and annotation of genetic variants. Bioinformatics. Oxford Univ Press; 2014;btu168.

9. Yin T, Cook D, Lawrence M. Ggbio: An r package for extending the grammar of graphics for genomic data. Genome Biology. BioMed Central Ltd; 2012;13:R77.

10. Wickham H. Ggplot2: Elegant graphics for data analysis [Internet]. Springer New York; 2009. Available from: http://had.co.nz/ggplot2/book.

11. RStudio, Inc. Shiny: Web application framework for r [Internet]. 2014. Available from: http://www.rstudio.com/shiny/.

GlymaID navigation

Manual navigation

Matching Glyma IDs

Matching Glyma IDs

Matching SNPs

Genealogy

SNP Documentation

Susan VanderPlas, Di Cook, Andrew Severin, Jim Specht, Michelle Graham

July 1, 2015

Data Collection

Processing Data and SNP Identification

SNP Sampling

Summary

Kinship Analysis

Sampling Methodology

Clustering and Plot of Kinship Distance

SNP Imputation and QTL Analysis

Plot Documentation

Convert the VCF file into a data frame

Rearranging Data for Plotting

SNP Browser Plot

SNP Density Plots

References