19  Module 5.2: Single Trait GWAS

We will be carrying out the GWAS analysis using the runSingleTraitGwas() function from the statgenGWAS package.

runSingleTraitGwas(gData, traits = NULL, covar = NULL, snpCov = NULL, kin = NULL, kinshipMethod, remlAlgo, GLSMethod, MAF = 0.01, thrType, alpha = 0.05, LODThr = 4, nSnpLOD = 10, pThr = 0.05): performs a single-trait Genome Wide Association Study (GWAS) on phenotypic and genotypic data contained in a gData object.

19.1 Significance Threshold

Prior to running our analysis, we need to compute a threshold by which to identify significant SNPs. The threshold for selecting significant SNPs in a GWAS analysis tends to be computed by using Bonferroni correction, with an alpha of 0.05. The alpha can be modified by setting the option alpha when calling the runSingleTraitGwas() function. Two other threshold types can be used: a fixed threshold (we set the thrType parameter to “fixed”) by specifying the log10(p) (LODThr) value of the threshold, or a small threshold (we set the thrType parameter to “small” and nSnpLOD to n) which defines the n SNPs with the highest log10(p) scores as significant SNPs.

# Bonferroni correction approximation
(pvalue.thr <- 0.05 / ncol(gData$markers))
[1] 4.599816e-05
format(pvalue.thr, scientific = FALSE)
[1] "0.00004599816"
# LOD score (logarithm of the odds)
(LOD.thr <- -log10(pvalue.thr))
[1] 4.33726
# Reverse the conversion to get p-value threshold
10^(-LOD.thr)
[1] 4.599816e-05
# Exact Bonferroni correction formula
(pvalue.thr <- 1 - (1 - 0.05) ^ (1 / ncol(gData$markers)))
[1] 4.718683e-05
(LOD.thr <- -log10(pvalue.thr))
[1] 4.326179

19.2 Variance Covariance Matrix

There are two ways to compute the phenotypic variance covariance matrix used in the GWAS analysis. Either the EMMA algorithm (Kang et al. 2008) or the Newton-Raphson algorithm (Tunnicliffe 1989). Specify the method by setting the parameter remlAlgo to either “EMMA” or “NR”. By default the EMMA algorithm is used.

19.3 Running GWAS

GWAS <- runSingleTraitGwas(gData, 
                           traits  = 'ASC_Score', 
                           thrType = 'fixed', 
                           LODThr  = 3.5, 
                           kinshipMethod = 'vanRaden')

Our output is an object of class GWAS. GWAResult is a data.table with the following columns:

  • trait: trait name

  • snp: SNP name

  • chr: chromosome on which the SNP is located

  • pos: position of the SNP on the chromosome

  • allFreq: allele frequency of the SNP

  • pValue: P-value for the SNP

  • effect: effect of the SNP on the trait value

  • effectSe: standard error of the effect of the SNP on the trait value

  • RLR2: likelihood-ratio-based R2 as defined in Sun et al. (2010)

  • LOD: LOD score (logarithm of the odds) for the SNP, defined as log10(pValue)

# Visualizing results
create_dt(cbind(GWAS$GWAResult$phenotypic_data[,1:4],
                signif(GWAS$GWAResult$phenotypic_data[,-c(1:4)], 3)))

Note that the estimated effect is computed for a single allele. Its direction depends on the coding of the markers in the gData object. In this example the minor allele was used as reference allele, so the effects are the estimated effects for the minor allele.

19.4 Significant Alleles

signSnp is a data.tables containing the significant SNPs. Optionally, the SNPs close to the significant SNPs are included in the data.table. The data.table in signSnp consist of the same columns as those in GWAResult described above. Two extra columns are added:

  • snpStatus: either significant SNP or within … of a significant SNP

  • propSnpVar: proportion of the variance explained by the SNP

create_dt(cbind(GWAS$signSnp$phenotypic_data[, 1:4],
                signif(GWAS$signSnp$phenotypic_data[, 5:10], 3),
                GWAS$signSnp$phenotypic_data[, 11],
                signif(GWAS$signSnp$phenotypic_data[, 12], 3)))

We can get the name of the best marker (the one with the highest LOD value):

(snp <- GWAS$signSnp$phenotypic_data[LOD == max(LOD), snp])
[1] "SCa4_9803244"

We can manually check by using t.test, which is the old school way.

# Creating data frame for t-test
df <- as.data.frame(marker_matrix[,snp])

df$genotype  <- rownames(marker_matrix)
colnames(df) <- c('snp', 'genotype')

df <- merge(df, phenotypic_data, id = 'genotype')
df <- na.omit(df)

table(df$snp)

  C   N   T   Y 
129  11  55   5 
# Running t-test
df <- df[df$snp %in% c('C', 'G', 'A', 'T'),]
df$snp <- as.factor(df$snp)

t.test(formula(paste(trait_name, '~ snp')), data = df)

    Welch Two Sample t-test

data:  ASC_Score by snp
t = -5.4233, df = 114.29, p-value = 3.308e-07
alternative hypothesis: true difference in means between group C and group T is not equal to 0
95 percent confidence interval:
 -1.6581884 -0.7709251
sample estimates:
mean in group C mean in group T 
       4.592171        5.806727 
# Plotting t-test results
boxplot(formula(paste(trait_name, '~ snp')), data = df)

19.5 Kinship Results

A kinship matrix was produced and used in the GWAS analysis. The method for producing this matrix can be defined by thge kinshipMethod parameter in the runSingleTraitGwas() function. By default, the same kinship matrix is used for testing all SNPs (GLSMethod = “single”). When GLSMethod = “multi”, the kinship matrix is chromosome-specific. As shown by Rincent et al.(2014), this often gives a considerable improvement in power.

# Visualizing kinship matrix
create_dt(round(GWAS$kinship, 3))
# Plotting a heatmap of kinship results
heatmap(GWAS$kinship)

19.6 Population Structure

The population structure can also be explored from the produced kinship results.

# Perform principal components analysis
pca <- prcomp(GWAS$kinship)

# Cumulative Proportion
summary(pca)$importance[,1:2]
                            PC1      PC2
Standard deviation     2.373792 1.268689
Proportion of Variance 0.591670 0.169010
Cumulative Proportion  0.591670 0.760670
# Plotting correlations between traits (1st and 2nd components)
plot(pca$x[,1:2], pch = 20,
     main = paste0('Kinship PCA (', round(100*summary(pca)$importance[3,2],1), '%)'),
     xlab = paste0('PC1 (', round(100*summary(pca)$importance[2,1],1), '%)'),
     ylab = paste0('PC2 (', round(100*summary(pca)$importance[2,2],1), '%)'))

19.7 Results Summary

General information of our GWAS results can be obtained by printing the GWASInfo part of our GWAS object. GWASInfo$inflationFactor returns the inflation factor (Devlin and Roeder 1999). Ideally this factor should be 1, meaning there is no inflation at all. If the values are further away from 1, the inflation can be corrected for by setting genomicControl = TRUE in runSingleTraitGwas().

GWAS$GWASInfo
$call
runSingleTraitGwas(gData = gData, traits = "ASC_Score", kinshipMethod = "vanRaden", 
    thrType = "fixed", LODThr = 3.5)

$remlAlgo
[1] "EMMA"

$thrType
[1] "fixed"

$MAF
[1] 0.01

$GLSMethod
[1] "single"

$kinshipMethod
[1] "vanRaden"

$varComp
$varComp$phenotypic_data
$varComp$phenotypic_data$ASC_Score
      Vg       Ve 
0.646051 1.895564 



$genomicControl
[1] FALSE

$inflationFactor
$inflationFactor$phenotypic_data
ASC_Score 
0.8854797 

For a quick overview of the results, e.g. the number of significant SNPs, use the summary function.

summary(GWAS)
phenotypic_data:
    Traits analysed: ASC_Score 

    Data are available for 1087 SNPs.
     0 of them were not analyzed because their minor allele frequency is below 0.01 

    GLSMethod: single 
    kinshipMethod: vanRaden 

    Trait: ASC_Score 

        Mixed model with only polygenic effects, and no marker effects:
        Genetic variance: 0.646051 
        Residual variance: 1.895564 

        LOD-threshold: 3.5 
        Number of significant SNPs: 2 
        Smallest p-value among the significant SNPs: 6.81679e-05 
        Largest p-value among the significant SNPs: 0.0002511666 (LOD-score: 3.600038)

        No genomic control correction was applied
        Genomic control inflation-factor: 0.885 

19.8 Plots

19.8.1 QQ-Plot

A QQ-plot of the observed against the expected log10(p) values. Most of the SNPs are expected to have no effect, resulting in P-values uniformly distributed on [0,1], and leading to the identity function (y=x) on the log10(p) scale. Deviations from this line should only occur on the right side of the plot, for a small number of SNPs with an effect on the phenotype (and possibly SNPs in LD).

plot(GWAS, plotType = 'qq', trait = trait_name)

19.8.2 Manhattan Plot

A Manhattan plot of GWAS. Significant SNPs are marked in red. Plot only 5% of SNPs with a LOD below 2 (lod = 2)

plot(GWAS, plotType = 'manhattan', trait = trait_name)

19.8.3 QTL Plot

A qtl plot of GWAS. The significant SNPs are marked by circles at their genomic positions, with diameter proportional to the estimated effect size. Colors indicate the direction of the effect: green when the allele increases the trait value, and blue when it decreases the value.

plot(GWAS, plotType = 'qtl')