# Load SNP data matrix
<- read.table("data/BarleyMatrix.txt", sep = "\t", header = TRUE,
matrix row.names = 1, check.names = FALSE)
13 Module 3.2: Data Quality Control and Filtering for SNP Data
Filtering SNP data is important for genetic and genomic studies in order to improve data quality, optimize resources and avoid noise by removing non-informative markers, and account for missing data. The most common filtering criteria we will focus on is call rate, missing data and MAF (Minor Allele Frequency).
13.1 Call Rate
Call rate refers to the percentage of non-missing data for a specific SNP marker. It is calculated by dividing the number of non-missing individuals / total number of individuals for each marker. The filtering threshold will depend on the specific data set or investigation; however, common thresholds tend to be 0.9 or 0.95.]
# Considering our n x m matrix with n markers and m individuals
# Defining our threshold
<- 0.9
call_rate
# Filtering our matrix
<- matrix[which(rowMeans(!is.na(matrix)) > call_rate),] filtered_matrix
13.2 Missing data
Just as we can filter markers with too much missing data, we can filter individuals with too many missing values. Our missing data threshold will refer to the percentage of non-missing data for each individual.
# Defining our threshold
<- 0.8
na_ind
# Filtering our matrix
<- filtered_matrix[,which(colMeans(!is.na(filtered_matrix))
filtered_matrix > na_ind)]
13.3 MAF
Minor allele frequency refers to the frequency of the least common allele for a particular SNP marker in a given population. It is commonly used as a filtering criteria as it allows you to exclude markers that contribute little to population-level analyses and helps reduce noise.
For example, if I have a marker which is homozygous 0 (AA
), heterozygous 1 (AG
), and homozygous 2 (GG
), each individual contributes 2 alleles. We then count the minor alleles across all individuals to obtain MAF.
# Calculating MAF for all markers
<- apply(filtered_matrix, 1, function(row) {
mafFreq <- row[!is.na(row)]
row <- sum(row) / (2 * length(row))
maf <- min(maf, 1 - maf)
maf
maf
})
# Filtering matrix according to maf
<- filtered_matrix[mafFreq > 0.01,] filtered_matrix
13.4 General Filtering
To simplify all previous steps, we can use the filterData()
function from our package. The function allows you to define thresholds for call rate, MAF and missing individuals. There is an additional parameter called stats
, it is set to FALSE by default but by setting it to TRUE you can get a data frame with statistics for the number of filtered markers and individuals by criteria.
<- filterData(matrix, call_rate = 0.9, maf = 0.01, na_ind = 0.8) filter
13.5 Visualization of marker information
Once we have filtered our matrix, we can visualize marker information. We first need to import our marker information.
# Importing our marker information
<- read.table("data/BarleyMarkers.tsv", sep = "\t", header = TRUE,
markers check.names = FALSE)
We can then generate our plots using the markerPlots()
function from our ICARDA package. This function requires the markers and genotype matrix data. It allows for a third parameter called chrom
, where you can define the specific chromosome you wish to plot your data for.
# Generating our plots
# (our SNP marker matrix needs to be transposed for this function)
# Individuals as rows and markers as columns
<- markerPlots(markers, t(filter)) markerplots
Warning in guide_legend(ncol = 8, bycol = TRUE): Arguments in `...` must be used.
✖ Problematic argument:
• bycol = TRUE
ℹ Did you misspell an argument name?
# Marker plot
$markerPlot markerplots
# MAF plot
$mafPlot markerplots
# Missing Data
$missingData markerplots