12  Module 3.1: Genotype Data

12.1 Introduction

Genotype data refers to the genetic makeup, in this case of crops, at specific loci across the genome. This data allows us to associate genetic differences with traits of agronomic interest and regional information.

  • The genotype refers to the specific combination of alleles at a given location. Depending on the ploidy of the crop, we will have two (diploids) or more alleles per locus.

  • SNPs (Single Nucleotide Polymorphisms) are positions across the genome where variations exist between individuals.

    • Example: Three different crop variants may have homozygous A/A, heterozygous A/G and homozygous G/G at a specific locus. This can also be coded as 0, 1 and 2. 0 represents homozygous for the reference allele, 1 represents heterozygous, and 2 represents homozygous for the alternate allele.

12.2 Formats

SNP data can be stored in different formats and file types, depending on the platform or program used. We will briefly discuss the most common file types.

  • VCF (Variant Call Format - .vcf): Standard format for SNPs and variants from sequencing. Contains metadata, IDs, calls, positions and other information.
    • GT = Genotype
    • 0 = REF; 1 = ALT
    • 0/0 or 1/1= homozygous; 0/1 or 1/0 = heterozygous
#CHROM  POS     ID            REF   ALT QUAL    FILTER  INFO    FORMAT  G1  G2  G3
1H      7253074 SCRI_RS_1929    A     C   .     PASS        .     GT        1/1 1/1 0/0
# Read vcf file
vcf <- read.vcfR("data/Barley.vcf", verbose = FALSE)
# Glimpse vcf
head(vcf)
[1] "***** Object of class 'vcfR' *****"
[1] "***** Meta section *****"
[1] "##fileformat=VCFv4.1"
[1] "##FILTER=<ID=PASS,Description=\"All filters passed\">"
[1] "##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"Approximate read  [Truncated]"
[1] "##FORMAT=<ID=DV,Number=.,Type=Integer,Description=\"Read depth of the [Truncated]"
[1] "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"
[1] "##INFO=<ID=MQ,Number=1,Type=Float,Description=\"RMS Mapping Quality\">"
[1] "First 6 rows."
[1] 
[1] "***** Fixed section *****"
     CHROM POS      ID REF ALT QUAL  FILTER
[1,] "1H"  "144018" NA "A" "G" "999" "NA"  
[2,] "1H"  "147155" NA "T" "C" "999" "NA"  
[3,] "1H"  "166336" NA "C" "T" "999" "NA"  
[4,] "1H"  "173286" NA "T" "C" "999" "NA"  
[5,] "1H"  "253434" NA "C" "T" "999" "NA"  
[6,] "1H"  "253481" NA "C" "T" "999" "NA"  
[1] 
[1] "***** Genotype section *****"
     FORMAT     ICARDA_G0011 ICARDA_G0012 ICARDA_G0013 ICARDA_G0014
[1,] "GT:DP:DV" "0/0:33:0"   "0/0:23:0"   "0/0:18:0"   "0/0:23:0"  
[2,] "GT:DP:DV" "0/0:8:0"    "0/0:8:0"    "0/0:10:0"   "0/0:6:0"   
[3,] "GT:DP:DV" "0/0:16:0"   "0/0:9:0"    "0/0:9:0"    "0/0:5:0"   
[4,] "GT:DP:DV" "0/0:18:0"   "0/0:18:0"   "0/0:14:0"   "0/0:10:0"  
[5,] "GT:DP:DV" "1/1:12:12"  "1/1:19:19"  "1/1:12:11"  "1/1:10:10" 
[6,] "GT:DP:DV" "0/0:12:0"   "0/0:17:0"   "0/0:12:0"   "0/0:10:0"  
     ICARDA_G0015
[1,] "0/0:15:0"  
[2,] "0/0:8:0"   
[3,] "0/0:9:0"   
[4,] "0/0:10:0"  
[5,] "1/1:12:12" 
[6,] "0/0:11:0"  
[1] "First 6 columns only."
[1] 
[1] "Unique GT formats:"
[1] "GT:DP:DV"
[1] 
# Turn into matrix
vcfMatrix <- extract.gt(vcf)
  • PLINK (-.ped, .map) or Binary PLINK (-.bed, .bim, .fam)
    • .ped: Pedigree/genotype data (tab delimited)
    • .map: SNP mapping information
    • .bed: Binary genotype matrix
    • .bim: SNP information
    • .fam: Sample information
  • HapMap (-.hmp.txt): Used in TASSEL, header includes metadata, positions and genotypes encoded as allele pairs (A/A, A/G, etc.).
  • Numeric Matrix (-.csv, .txt): SNPs in columns and genotypes in rows (or vice versa), data encoded as 0, 1 and 2 for homozygous for reference allele, heterozygous, and homozygous for alternate allele.
# Load SNP data matrix
matrix <- read.table("data/BarleyMatrix.txt", sep = "\t", header = TRUE, 
                     row.names = 1, check.names = FALSE)
# The vcf matrix we obtained can also be turned into this type of format
matrixNum <- vcfToNumericMatrix(vcfMatrix)