Select one or more chromosomes. The first table will show the number of SNP sites for each glymaID on the chromosome.
Select one or more chromosomes to see the distribution of SNPs.
Manual navigation
Choose chromosome and position
For a selected glymaID, the second table shows each SNP location, and the number of varieties with SNPs at that position.
Use the table below to see what GlymaIDs have SNPs on a specific chromosome.
Use the table below to see SNPs at specific locations within a GlymaID.
Matching Glyma IDs
Matching Glyma IDs
Matching SNPs
Genealogy
Data is available for cultivars shown in black.
SNP Documentation
SNP Documentation
Susan VanderPlas, Di Cook, Andrew Severin, Jim
Specht, Michelle Graham
July 1, 2015
```
Data Collection
The data displayed in this applet is derived from next generation
sequencing of 120 soybean lines. Seventy-nine new lines were sequenced
as part of this project and represent plant introductions and milestone
cultivars. The remaining sequences represent the 41 parents used for
developing the soybean NAM population. NAM parent sequences were
provided by Dr. Perry Cregan (USDA-ARS) and the Soybean NAM project. For
the new sequences generated by this project, twenty seeds from each line
were acquired from the USDA Soybean Germplasm Collection. Seeds were
planted in the USDA greenhouse at Iowa State University. Once plants
reached the trifoliolate stage, leaves from up to 10 plants were pooled
and genomic DNA was extracted. DNA was sent to Hudson Alpha Institute
for Biotechnology for next-generation sequencing. In addition,
replicated field trials were conducted on a subset of lines (30 of the
79 lines, plus ancestral varieties that were not sequenced) to measure
protein, oil, yield, and other characteristics under standard growth
conditions, to dissociate the effect of on-farm improvements from
genetic gain [@specht1984contribution],[@fox2013estimating].
Reads that mapped uniquely were converted from SAM format to BAM
format using samtools[@li2009sequence].
Read groups were added for each soybean line and duplicate reads
were removed using AddOrReplaceReadGroups and
MarkDuplicates functions in picard tools.
The resulting alignment BAM files were realigned using
IndelRealigner function in GATK[@mckenna2010genome]. The
ReduceReads function was used to compress the alignment
files by removing non-informative and redundant reads (default
parameters except for downsample_coverage=1).
SNPs and InDels were called on all reduced BAM files together using
the HaplotypeCaller function in GATK (version
2.7-2-g6bda569).
SNP Sampling
Summary
SNPS displayed in this applet are reliable SNPs from 120 lines. 86944
unique SNPs were identified for further analysis.
Need to delineate “reliable” criterion.
The lines included in this analysis are:
180501
CL0J173-6-8
Hood
LG92-1255
PI574
4J105-3-4
Clark
HS6-3976
LG94-1128
Pickett
506-13640-2
Clark (NAM)
Hutcheson
LG94-1906
Prohio
5601T
Clemson
IA3023
LG97-7012
Raleigh
5M20-2-5-2
CNS
Illini
LG98-1605
Ralsoy
88788
Cook
Jackson
Lincoln
Ransom
A3127
Corsoy
Kanro
Magellan
Richland
Adams
Cumberland
Kent
Mandarin
Roanoke
A.K.
Davis
Lawrence
Maverick
S06-13640
Amcor
Dillon
LD00-3309
Merit
S-100
Amsoy
Dorman
LD01-5907
Mukden
Shelby
Anderson
Douglas
LD02-4485
NCRoy
Skylla
Beeson
Dunfield
LD02-9050
NE3001
TN05-3027
Blackhawk
Essex
Lee
Oakland
Tokyo
Bonus
Ford
LG00-3372
Ogden
Tracy
Bragg
Forrest
LG03-2979
Pella
U03-100612
Braxton
Gasoy17
LG03-3191
Perry
Volstate
Brim
Haberlandt
LG04-4717
PI398
Wayne
Calland
Hagood
LG04-6000
PI404
Williams
Capital
Harcor
LG05-4292
PI427
Williams82
Centennial
Harosoy
LG05-4317
PI437
Woodworth
Century
Hawkeye
LG05-4464
PI507
York
Chippewa
Hill
LG05-4832
PI518
Young
CL0J095-4-6
Holladay
LG90-2550
PI561
Zane
Kinship Analysis
Sampling Methodology
Kinship matrices were generated with TASSEL[@bradbury2007tassel] using a subset of
the SNP data, where one random SNP was taken from every 10,000 base
interval in the genome or the next closest SNP (Supplementary Script
1).
Clustering and Plot of
Kinship Distance
These matrices were then clustered using Ward’s method, using the
distance (2- similarity). Clusters were used to lay out the rows and
columns of the heatmap.
library(plyr)
library(reshape2)
library(animint)
library(ggplot2) # need to use fork of ggplot2: install_github("tdhock/ggplot2") for animint compatibility
library(ggdendro)
library(grid)
# Symmetric kinship matrix except for the first column, which contains the variety names
tmp <- read.csv("kinshipMatrix.txt", sep="\t", skip=1, head=F, stringsAsFactors=F)
names(tmp) <- c("Variety", tmp[,1])
# Fix names to correspond with actual usage
rownames <- tmp$Variety
rownames <- gsub("506-13640-2", "S06-13640-2", rownames)
rownames <- gsub("901-G0\\d{1}_RS_", "", rownames)
rownames <- gsub("_NuGEN", "", rownames)
rownames <- gsub("Gasoy17", "Gasoy 17", rownames)
rownames <- gsub("RS_", "", rownames)
rownames <- gsub("Williams82", "Williams 82", rownames)
rownames <- gsub("AK_004", "A.K. (Harrow)", rownames)
rownames <- gsub("NCR", "R", rownames)
# Enforce row and column names
tmp$Variety <- rownames
names(tmp) <- c("Variety", rownames)
# Create numerical matrix for use in clustering algorithms
meanMatrix <- data.matrix(tmp[,-1])
# Ensure dimnames are accurate
dimnames(meanMatrix)[[1]] <- rownames
dimnames(meanMatrix)[[2]] <- rownames
# Convert kinship (similarity) to distance matrix by subtracting relatedness from 2 (maximum relatedness)
dd.col <- rev(as.dendrogram(hclust(as.dist(2-meanMatrix), method="ward.D")))
dd.row <- as.dendrogram(hclust(as.dist(2-meanMatrix), method="ward.D"))
col.ord <- labels(dd.col)
row.ord <- labels(dd.row)
# Get x and y coordinates
ddata_x <- dendro_data(dd.row)
ddata_y <- dendro_data(dd.col)
The clusters were then formatted for plotting and plotted using
ggplot2.
# Melt kinship matrix into long form
matrixLong <- melt(tmp, id.vars = 1)
names(matrixLong) <- c("variety1", "variety2", "value")
matrixLong$value <- as.numeric(matrixLong$value)
write.csv(matrixLong, "LongKinshipMatrix.csv")
# Make long matrix factor names match clustered ordering (so tree matches x and y coordinates)
matrixLong$variety1 <- factor(matrixLong$variety1, levels=col.ord)
matrixLong$variety2 <- factor(matrixLong$variety2, levels=row.ord)
# Multiplier to ensure the tree has reasonable range compared to the heatmap
dendro.multiplier <- 1
# Static Plot
qplot(x=as.numeric(variety1), y=as.numeric(variety2), geom="tile", fill=value, data=matrixLong) +
theme_bw() +
theme(axis.text.x=element_text(angle=90, hjust=1, vjust=.5)) +
xlab("Variety 1") +
ylab("Variety 2") +
scale_x_continuous(expand=c(0, 1), breaks=1:79, labels=col.ord) +
scale_y_continuous(expand=c(0, 1), breaks=1:79, labels=row.ord) +
scale_fill_gradient("Relatedness",low="#ffffff", high="#374f6b") +
ggtitle("Kinship Matrix-based Relatedness") +
coord_equal()+
geom_segment(data=segment(ddata_y), aes(x=x, y=y*dendro.multiplier+80, xend=xend, yend=yend*dendro.multiplier+80), inherit.aes=F) +
geom_segment(data=segment(ddata_x), aes(x=y*dendro.multiplier+80, y=x, xend=yend*dendro.multiplier+80, yend=xend), inherit.aes=F)
SNP Imputation and QTL
Analysis
SNPs were imputed using BEAGLE[@browning2013improving]
TASSEL[@bradbury2007tassel] was used to identify
QTLs using a general linear model to control for population
structure.
Need more detail here from Andrew. Also, need QTL results?
Plot Documentation
Using the combined, phased and imputed VCF file generated as a result
of the SNP sampling, the following steps are performed.
Convert the VCF file
into a data frame
We used the vcf2tsv command from vcflib to convert the
vcf file into an easily plotable format.
Other options for performing the same function include the
VariantAnnotation bioconductor package [@obenchain2014variantannotation] and the ggbio
bioconduuctor package [@ggbio].
library(stringr)
library(reshape2)
library(animint)
library(ggplot2)
# Read in the data from the TSV
vcfTable <- read.table("MeanPlus0p150SDimputed.tsv", sep="\t", stringsAsFactors=FALSE, header=TRUE)
# Correct column names
col.names <- c("Chromosome", "Position", "id", "Reference", "Alternate",
"qual", "filter", "Allele.Freq", "AlleleRSquared", "DosageRSquared",
"Variety", "Alt_Allele_Freq", "Genotype_Probability", "Gene_State")
names(vcfTable) <- col.names
Rearranging Data for
Plotting
Plots in this applet were generated using ggplot2 [@ggplot2], and are rendered interactively using
Shiny [@shiny].
library(doMC)
registerDoMC()
library(dplyr)
vcfTable$Alt.Allele.Count <- round(vcfTable$Alt.Allele.Freq)
varieties <- as.character(unique(vcfTable$Variety))
seqnames <- unique(vcfTable$Chromosome[grepl("Chr", vcfTable$Chromosome)])
# remove scaffolds for now to make display easier
vcfTable <- filter(vcfTable, grepl("Chr", vcfTable$Chromosome))
# Change names to correspond to actual usage (instead of lab names) according to Jim's suggestions.
vcfTable$Variety <- str_replace(vcfTable$Variety, fixed("901-G04_RS_5601T"), "5601T")
vcfTable$Variety <- str_replace(vcfTable$Variety, fixed("901-G06_RS_Bonus"), "Bonus")
vcfTable$Variety <- str_replace(vcfTable$Variety, fixed("RS_"), "")
vcfTable$Variety <- str_replace(vcfTable$Variety, fixed("AK_004"), "A.K.")
vcfTable$Variety <- str_replace(vcfTable$Variety, fixed("Clark_NuGEN"), "Clark")
vcfTable$Variety <- str_replace(vcfTable$Variety, fixed("NCRaleigh"), "Raleigh")
SNP Browser Plot
The goal here is to calculate the proportion of SNPs of each base pair
(including those varieties that did not differ from the reference genome
at the base pair in question).
n <- length(unique(varieties))
snp <- snpList %>% ungroup %>% group_by(Chromosome, Position, Reference) %>%
summarise(A=sum(Alt_Allele_Count*(Alternate=="A")),
G=sum(Alt_Allele_Count*(Alternate=="G")),
C=sum(Alt_Allele_Count*(Alternate=="C")),
T=sum(Alt_Allele_Count*(Alternate=="T")),
total=sum(Alt_Allele_Count)) %>%
mutate(A = A + (Reference=="A")*(2*n-total),
G = G + (Reference=="G")*(2*n-total),
C = C + (Reference=="C")*(2*n-total),
T = T + (Reference=="T")*(2*n-total))
snp <- snp[,1:7]
library(tidyr)
snp.counts <- snp %>% gather(Nucleotide, Count, 4:7)
snp.counts <- filter(snp.counts, Count>0)
save(snp.counts, file="SNPCounts.RData")
SNP Density Plots
This code calculates the density of SNPs along each chromosome. For each
group of data, the SNP density is calculated at 8192 points along the
chromosome, with a bandwidth 10% of the default kernel bandwidth. SNPs
are weighted according to the count of alternate alleles.