Investigating male X heterozygosity in short read sequencing

26 Jun 2024 — Barış Salman

Links

1. Introduction
- 1.1. Calling variant from X chromosome in males
2. Materials and Methods
3. Results
4. Discussion
5. References
6. Acronyms
7. Glossary
Comments

1. Introduction

During exome sequencing analyses, we encounter heterozygous variants on the X chromosomes in the male samples. This is not an accurate representation of biological status as males carry only one X and are hemizygous for both sex chromosomes. Except for the pseudoautosomal region (PAR), at the beginning and the end of both chromosomes X and Y span around 3Mb together. This region shows homology between the chromosomes and males are “homozygous” for the regions. This homology makes it possible for chromosomes to pair during cell division and the only place where recombination can occur between X and Y. (Raudsepp et al. 2012) However, the variants we are seeing are located outside of the PAR and are spread across chromosome X. There have also been other groups who come across this and used it to check for sample sex as a form of quality control.(Do et al. 2015) There have been a lot of questions about this around the web with no satisfying answers.

My question was where do these variants are originated? We will be making an empirical observation by looking at the 1000 genome project.

1.1. Calling variant from X chromosome in males

Variant callers assume a diploid genome while calling variants. Males are haploid for the X chromosome meaning there is only one chromosome. Which doesn’t fit the worldview of the variant callers. What makes it even harder is the highly repetitive regions and regions that show a high degree of homology to other loci in the X chromosome. Haplotypecaller has a parameter to give ploidy, but since the data we are dealing with doesn’t represent a haploid chromosome fitting it into a hemizygous genotype calls might be a challenge. 1, 2, 3, 4

2. Materials and Methods

These are whole genome Variant Call Format (VCF) files and I can’t download them all, luckily bcftools can query specific regions on them over the line which makes it both space and speed-efficient. The region we are going to query is the X chromosome without the PAR. We are excluding the PAR since it’s a diploid region. We are going to further filtrate the variants to be heterozygous and since this all started from Exome Sequencing (ES) we are just going to look at the exonic regions. To make things simpler we are going to look at SNVs since INDELs can be challenging to genotype. We also want to have a depth enough to decide the genotype and don’t want anything not passing filters.

We can then merge those files into one, count the number of heterozygous genotypes we see for each position and look at them on IGV. Later we can group them by the gene and see which genes have the most heterozygous variants.

We are going to implement all this workflow in the nextflow.

2.1. VCF files from the 1000genome project

For this job we are going to use individual variant call files from the project since multi-sample VCF are phased. Looking at the individual variant calls there are New York Genome Center (NYGC) study from 2019. There are two versions of these with one directory named raw_calls_old and another directory named raw_calls_updated.

The updated directory has a note related to this study that says:

Chromosome X: male samples were processed through HaplotypeCaller with “–ploidy 1” parameter applied to non-PAR regions, and “–ploidy 2” parameter applied to PAR1 and PAR2 regions, whereas female samples were processed using “–ploidy 2” parameter applied to the entire chrX (see table below for the exact coordinates and corresponding ploidy settings).

Since we want to see the genotypes called as a diploid we will be looking at the calls from the old directory. The other different thing about the old directory is that samples are located under the major population directories.

2.2. Preamble

I had these as a variable while writing the workflow but looking at it now they don’t need to be variables. But I am not gonna change them either since it does make what’s being done a bit more transparent.