#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT G1 G2 G3
1H 7253074 SCRI_RS_1929 A C . PASS . GT 1/1 1/1 0/0
12 Module 3.1: Genotype Data
12.1 Introduction
Genotype data refers to the genetic makeup, in this case of crops, at specific loci across the genome. This data allows us to associate genetic differences with traits of agronomic interest and regional information.
The genotype refers to the specific combination of alleles at a given location. Depending on the ploidy of the crop, we will have two (diploids) or more alleles per locus.
SNPs (Single Nucleotide Polymorphisms) are positions across the genome where variations exist between individuals.
- Example: Three different crop variants may have homozygous A/A, heterozygous A/G and homozygous G/G at a specific locus. This can also be coded as 0, 1 and 2. 0 represents homozygous for the reference allele, 1 represents heterozygous, and 2 represents homozygous for the alternate allele.
12.2 Formats
SNP data can be stored in different formats and file types, depending on the platform or program used. We will briefly discuss the most common file types.
- VCF (Variant Call Format -
.vcf
): Standard format for SNPs and variants from sequencing. Contains metadata, IDs, calls, positions and other information.- GT = Genotype
- 0 = REF; 1 = ALT
- 0/0 or 1/1= homozygous; 0/1 or 1/0 = heterozygous
# Read vcf file
<- read.vcfR("data/Barley.vcf", verbose = FALSE) vcf
# Glimpse vcf
head(vcf)
[1] "***** Object of class 'vcfR' *****"
[1] "***** Meta section *****"
[1] "##fileformat=VCFv4.1"
[1] "##FILTER=<ID=PASS,Description=\"All filters passed\">"
[1] "##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"Approximate read [Truncated]"
[1] "##FORMAT=<ID=DV,Number=.,Type=Integer,Description=\"Read depth of the [Truncated]"
[1] "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"
[1] "##INFO=<ID=MQ,Number=1,Type=Float,Description=\"RMS Mapping Quality\">"
[1] "First 6 rows."
[1]
[1] "***** Fixed section *****"
CHROM POS ID REF ALT QUAL FILTER
[1,] "1H" "144018" NA "A" "G" "999" "NA"
[2,] "1H" "147155" NA "T" "C" "999" "NA"
[3,] "1H" "166336" NA "C" "T" "999" "NA"
[4,] "1H" "173286" NA "T" "C" "999" "NA"
[5,] "1H" "253434" NA "C" "T" "999" "NA"
[6,] "1H" "253481" NA "C" "T" "999" "NA"
[1]
[1] "***** Genotype section *****"
FORMAT ICARDA_G0011 ICARDA_G0012 ICARDA_G0013 ICARDA_G0014
[1,] "GT:DP:DV" "0/0:33:0" "0/0:23:0" "0/0:18:0" "0/0:23:0"
[2,] "GT:DP:DV" "0/0:8:0" "0/0:8:0" "0/0:10:0" "0/0:6:0"
[3,] "GT:DP:DV" "0/0:16:0" "0/0:9:0" "0/0:9:0" "0/0:5:0"
[4,] "GT:DP:DV" "0/0:18:0" "0/0:18:0" "0/0:14:0" "0/0:10:0"
[5,] "GT:DP:DV" "1/1:12:12" "1/1:19:19" "1/1:12:11" "1/1:10:10"
[6,] "GT:DP:DV" "0/0:12:0" "0/0:17:0" "0/0:12:0" "0/0:10:0"
ICARDA_G0015
[1,] "0/0:15:0"
[2,] "0/0:8:0"
[3,] "0/0:9:0"
[4,] "0/0:10:0"
[5,] "1/1:12:12"
[6,] "0/0:11:0"
[1] "First 6 columns only."
[1]
[1] "Unique GT formats:"
[1] "GT:DP:DV"
[1]
# Turn into matrix
<- extract.gt(vcf) vcfMatrix
- PLINK (-
.ped
,.map
) or Binary PLINK (-.bed
,.bim
,.fam
).ped
: Pedigree/genotype data (tab delimited).map
: SNP mapping information.bed
: Binary genotype matrix.bim
: SNP information.fam
: Sample information
- HapMap (-
.hmp.txt
): Used in TASSEL, header includes metadata, positions and genotypes encoded as allele pairs (A/A, A/G, etc.). - Numeric Matrix (-.csv, .txt): SNPs in columns and genotypes in rows (or vice versa), data encoded as 0, 1 and 2 for homozygous for reference allele, heterozygous, and homozygous for alternate allele.
# Load SNP data matrix
<- read.table("data/BarleyMatrix.txt", sep = "\t", header = TRUE,
matrix row.names = 1, check.names = FALSE)
# The vcf matrix we obtained can also be turned into this type of format
<- vcfToNumericMatrix(vcfMatrix) matrixNum