13  Module 3.2: Data Quality Control and Filtering for SNP Data

Filtering SNP data is important for genetic and genomic studies in order to improve data quality, optimize resources and avoid noise by removing non-informative markers, and account for missing data. The most common filtering criteria we will focus on is call rate, missing data and MAF (Minor Allele Frequency).

13.1 Call Rate

Call rate refers to the percentage of non-missing data for a specific SNP marker. It is calculated by dividing the number of non-missing individuals / total number of individuals for each marker. The filtering threshold will depend on the specific data set or investigation; however, common thresholds tend to be 0.9 or 0.95.]

# Load SNP data matrix
matrix <- read.table("data/BarleyMatrix.txt", sep = "\t", header = TRUE, 
                     row.names = 1, check.names = FALSE)
# Considering our n x m matrix with n markers and m individuals
# Defining our threshold
call_rate <- 0.9

# Filtering our matrix
filtered_matrix <- matrix[which(rowMeans(!is.na(matrix)) > call_rate),]

13.2 Missing data

Just as we can filter markers with too much missing data, we can filter individuals with too many missing values. Our missing data threshold will refer to the percentage of non-missing data for each individual.

# Defining our threshold
na_ind <- 0.8

# Filtering our matrix
filtered_matrix <- filtered_matrix[,which(colMeans(!is.na(filtered_matrix)) 
                                          > na_ind)]

13.3 MAF

Minor allele frequency refers to the frequency of the least common allele for a particular SNP marker in a given population. It is commonly used as a filtering criteria as it allows you to exclude markers that contribute little to population-level analyses and helps reduce noise.

For example, if I have a marker which is homozygous 0 (AA), heterozygous 1 (AG), and homozygous 2 (GG), each individual contributes 2 alleles. We then count the minor alleles across all individuals to obtain MAF.

# Calculating MAF for all markers
mafFreq <- apply(filtered_matrix, 1, function(row) {
  row <- row[!is.na(row)]
  maf <- sum(row) / (2 * length(row))
  maf <- min(maf, 1 - maf)
  maf
})

# Filtering matrix according to maf
filtered_matrix <- filtered_matrix[mafFreq > 0.01,]

13.4 General Filtering

To simplify all previous steps, we can use the filterData() function from our package. The function allows you to define thresholds for call rate, MAF and missing individuals. There is an additional parameter called stats, it is set to FALSE by default but by setting it to TRUE you can get a data frame with statistics for the number of filtered markers and individuals by criteria.

filter <- filterData(matrix, call_rate = 0.9, maf = 0.01, na_ind = 0.8)

13.5 Visualization of marker information

Once we have filtered our matrix, we can visualize marker information. We first need to import our marker information.

# Importing our marker information
markers <- read.table("data/BarleyMarkers.tsv", sep = "\t", header = TRUE, 
                     check.names = FALSE)

We can then generate our plots using the markerPlots() function from our ICARDA package. This function requires the markers and genotype matrix data. It allows for a third parameter called chrom, where you can define the specific chromosome you wish to plot your data for.

# Generating our plots
# (our SNP marker matrix needs to be transposed for this function)
# Individuals as rows and markers as columns
markerplots <- markerPlots(markers, t(filter))
Warning in guide_legend(ncol = 8, bycol = TRUE): Arguments in `...` must be used.
✖ Problematic argument:
• bycol = TRUE
ℹ Did you misspell an argument name?
# Marker plot
markerplots$markerPlot
# MAF plot
markerplots$mafPlot
# Missing Data
markerplots$missingData