Polygenic Risk Score Calculation

08 August 2023 — Barış Salman

Links

Table of Contents

1. Intro

$$
\chemname{\chemfig{[:150]@{C2}N*6(-(-N([:0]-@{C1}H)([:120]-H))=N-=-(=@{C3}O)-)}}{Cytosine} \qquad %
\chemname{\chemfig{N([:180]-@{G2}H)*6(-(-N([:180]-@{G3}H)([:-60]-H))=N-*5(-N-=N-)=-(=@{G1}O)-[,,,2])}}{Guanine} %
%\chemmove{ %
%\draw[-,dash pattern=on 2pt off 2pt] (C1)--(G1); %
%\draw[-,dash pattern=on 2pt off 2pt] (C2)--(G2); %
%\draw[-,dash pattern=on 2pt off 2pt] (C3)--(G3);
%}
$$

$$\chemname{\chemfig{[:-30]*6(-N=-@{A2}N=(-N([:0]-@{A1}H)([:120]-H))-*5(-N-=N-)=)}}{Adenine} \qquad %
\chemname{\chemfig{N([:180]-@{T2}H)*6(-(=O)-N-=(-CH_3)-(=@{T1}O)-)}}{Tymine} %
%\chemmove{ %
%\draw[-,dash pattern=on 2pt off 2pt] (A1)--(T1); %
%\draw[-,dash pattern=on 2pt off 2pt] (A2)--(T2); %
%}
$$

These 4 nucleotides in sequence make up our DNA.

Our observable traits like height, hair, and eye color as well as diseases manifest according to our biological heritage. For all humans, this heritage comes in the form of ~3.25 billion bases long DNA sequence from both parents chunked and packaged as chromosomes. What makes this heritage unique is the collection of differences in these bases. Every person carries a different set of these variations made unique by the shuffling of DNA segments in germ cells. These variations can be large or small; it could be the addition or deletion of bases or the change of bases from one to another. For some locations in the genome, multiple versions of these changes will be present in the population.

Some of these variations have a high enough impact to project their effect on phenotype by themself. Some of the variations affect the phenotype in concert with each other showing their effect in an additive way. We model this additive effect of multiple variations on a given phenotype with a Polygenic Risk Score (PRS). This writing aims to give a basic understanding of PRS and some concepts related to it in literature.

1.1. Polygenic

Polygenic means multiple genes are involved in the emergence of the given phenotype. Gene is a fuzzy concept with evolving definitions and boundaries that are used as a catch-all phrase. (Cipriano and Ballarino 2018) In our case what we are interested in are loci in the genome that carries the variation that effect the same phenotype. This variation can be in a gene that shows its effect through its impact on the protein like a change from one amino acid to another. However, regions outside of the genes also play a role in biological processes and the emergence of phenome which we will see in the association studies. Since association studies show correlations and not causation a variant like rs1421085 found on FTO affects ARID5B millions of bases away. (Claussnitzer et al. 2015)

You can create this ideogram at: https://www.ncbi.nlm.nih.gov/genome/tools/gdp

With the following intervals:

  98956422 98956423
chr5 21538259 21538260
chr6 50805979 50805980
chr7 99345973 99345974
chr8 95138636 95138637
chr9 68394717 68394718
chr16 30338345 30338346
chr17 33257441 33257442
chr19 38617616 38617617
chr20 34444167 34444168
chr22 30818468 30818469

Figure 1: Variations are marked with pink circles.

1.2. Risk

Oxford English Dictionary defines risk as:

the possibility of something bad happening at some time in the future.

In statistics, there exist two types of risk. (“Understanding Statistics: Risk BMJ Best Practice” n.d.)

  • Absolute Risk (AR) ratio of the events to the total number in a given group
  • Relative Risk (RR) ratio of the event happening in the case group to an event happening in the control group.

Here the term risk refers to a relative risk since we are comparing the case group with the control group. (“Polygenic Risk Scores” n.d.) RR should be evaluated with absolute risk on an individual basis. (Andrade 2015)

1.3. Score

We can look at two different equations that calculate PRS. They both calculate $PRS_j$ by summing the effect size (denoted as $\beta_i$ in Collister et al. and $S_i$ in PLINK formula) and dosage (denoted as $dosage_{ij}$ in Collister et al. and $G_{ij}$ in PLINK formula) of the alleles. Differently, the PLINK formula has $P \times M_j$ in the denominator. Where $P$ is the ploidy and $M_j$ is the number of non-missing Single Nucleotide Polymorphisms (SNPs). If a sample missing SNPs their score would be lower than samples with full genotypes. Dividing by non-missing number of SNPs and ploidy makes the score comparable between individuals with missing genotypes. (Collister, Liu, and Clifton 2022) With enough samples, these scores would approach a normal distribution.

\begin{equation*}
PRS_j = \sum_{i}^{N} \beta_i \times dosage_{ij}
\end{equation*}
1

(Collister, Liu, and Clifton 2022)

\begin{equation*}
PRS_j = \frac{\sum_{i}^{N} S_i \times G_{ij}}{P \times M_j}
\end{equation*}
2

(Choi, Mak, and O’Reilly 2020)

2. Calculating Polygenic Risk Scores

We are going to mock a Genome-Wide Association (GWAS) results and create a mock study of our own. We are going to use the summary statistics from the GWAS data. The idea here is that most of the time we don’t have access to every genotype from a GWAS. Instead, we use the most significant variation and investigate their effect in our cohort.

2.1. Genome-wide Association Studies

A great resource for viewing studies and their results is the GWAS catalog. A more hands-on example can be found at Hail’s site.

GWAS performs statistical tests on thousands of genomic loci in thousands of genomes to see if any of those regions are associated with the given trait. (Uffelmann et al. 2021) As a result, we get p-values for how significant that association is and the effect size of the variation.

Let’s say we have found <<snp_number>> SNPs significant in a study with <<gwas_case_number>> cases and <<gwas_control_number>> controls. First, let’s go over the SNPs.

snp_number <- <<snp_number>>

gwas <- data.frame(
│ SNP_id = paste("SNP", seq(snp_number), sep="")
)
gwas[c(
│ "non_effect_allele",
│ "effect_allele"
)] <- t(replicate(snp_number,
│ sample(c("A", "T", "C", "G"), 2)
│ ))
│
print(rbind(head(gwas, 3), tail(gwas, 3)))
SNP_id non_effect_allele effect_allele
SNP1 C A
SNP2 T C
SNP3 G C
SNP8 G C
SNP9 A T
SNP10 T G

I selected some genotypes randomly. However, we don’t need to know the non-effect allele and effect allele nucleotides, that’s why they will be simply named A or 0 for the non-effect allele and B or 1 for the effect allele.

gwas$non_effect_allele = "A"
gwas$effect_allele = "B"
print(rbind(head(gwas, 3), tail(gwas, 3)))
SNP_id non_effect_allele effect_allele
SNP1 A B
SNP2 A B
SNP3 A B
SNP8 A B
SNP9 A B
SNP10 A B

Another thing we need from GWAS summary statistics is the effect sizes of these SNPs. We are going to use the log of OR as the effect size.

I want to set it up so every SNP is observed <<increment_percent>> * 100 % more incrementally. The lines with step_case and obs_case just generalize this instead of hard-coding so it applies to any number of SNPs and sample size.

gwas_control_number = <<gwas_control_number>>
gwas_case_number = <<gwas_case_number>>
increment_percent = <<increment_percent>>
gwas_sample_number = gwas_control_number + gwas_case_number

step_case = round(gwas_case_number * increment_percent)
gwas$obs_case = seq(gwas_case_number / 2 + step_case,
│   │   │   │   │   gwas_case_number / 2 + step_case * nrow(gwas),
│   │   │   │   │   step_case)
│   │   │   │   │
step_control = round(gwas_control_number * increment_percent)
gwas$obs_control = seq(gwas_control_number / 2 - step_control,
│   │   │   │   │   │  gwas_control_number / 2 - step_control * nrow(gwas),
│   │   │   │   │   │  -step_control)
│   │   │   │   │   │
gwas$log_OR <- signif(log(
│   (gwas$obs_case / (gwas_case_number - gwas$obs_case)) /
│   (gwas$obs_control / (gwas_control_number - gwas$obs_control))
), 2)

print(rbind(head(gwas, 3), tail(gwas, 3)))
SNP_id non_effect_allele effect_allele obs_case obs_control log_OR
SNP1 A B 26000 24000 0.16
SNP2 A B 27000 23000 0.32
SNP3 A B 28000 22000 0.48
SNP8 A B 33000 17000 1.3
SNP9 A B 34000 16000 1.5
SNP10 A B 35000 15000 1.7
gwas$log_OR[1]
gwas$log_OR[nrow(gwas)]

We get effect sizes <<gwas_first_OR>> for the first genotype and <<gwas_last_OR>> for the last genotype. A BB genotype would have score of $2 \times$ <<gwas_first_OR>> $=$ <<gwas_first_OR>> * 2 for the first genotype and $2 \times$ <<gwas_last_OR>> $=$ <<gwas_last_OR>> * 2.

2.2. Toy dataset

We are screening our population concerning a phenotype. We aim to determine risk scores using the GWAS summary statistics and see if we can make a correlation between PRS and phenotype. We have genotyped the same <<snp_number>> SNPs in <<sample_number>> people for our study.

The set.seed() in the first line ensure get_genotype function randoms the same values every time.

set.seed(42)

sample_number <- <<sample_number>>
snp_number <- <<snp_number>>
samples <- data.frame(
│ sample_id = paste("Sample", seq(sample_number), sep=""),
│ status = c(
│   rep("Control", sample_number / 2),
│   rep("Case", sample_number / 2))
)

get_genotype <- function(prob=c(.5, .5)) {
│ paste(sample(c("A", "B"), 2, replace=T, prob = prob), collapse="")
}

control_prob <- <<control_prob>>
case_prob <- <<case_prob>>
samples[gwas$SNP_id] <- rbind(
│ matrix(
│   replicate(
│   │ snp_number * sample_number / 2,
│   │ get_genotype(control_prob)),
│   nrow=sample_number / 2),
│ matrix(
│   replicate(
│   │ snp_number * sample_number / 2,
│   │ get_genotype(case_prob)),
│   nrow=sample_number / 2)
)
print(rbind(head(samples, 3), tail(samples, 3)))
sample_id status SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10
Sample1 Control BB BA BB AA AA AB BA BA AA AB
Sample2 Control AB BB AB AA AA BA AB BA AB AB
Sample3 Control BA BB AA AA AB BB AB BA AA AB
Sample1998 Case BB AB BA AA BB BB BA BB BA AA
Sample1999 Case BB BB AB AB AA AA BB BA BA AB
Sample2000 Case BA BB AB BB AB AB AA AA AA AB

We get random genotypes for controls for alleles A, and B with probabilities paste(<<control_prob>>, collapse=", ") ; cases for alleles A, and B with probabilities paste(<<case_prob>>, collapse=", ") This way cases are more prone to having the effect allele.

2.3. Calculate the PRS

We will be using the simpler version of the equations. Mainly because we don’t have any missing genotypes and secondly because it’s simpler.

We implement the formula by using an apply function to multiply dosage and the effect size then doing a row sum would give us the score.

We have both AB and BA genotypes in the dosage list because our get_genotype function returns both of these. Since there are no haplotypes or variant phasing both AB and BA are the same.

dosage <- list("AA" = 0, "AB" = 1, "BA" = 1, "BB" = 2)

samples$PRS <- colSums(as.data.frame(
│ apply(
│   samples[gwas$SNP_id],
│   1,
│   function(x) as.numeric(dosage[x]) * gwas$log_OR)))
│
print(rbind(head(samples, 3), tail(samples, 3)))
sample_id status SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10 PRS
Sample1 Control BB BA BB AA AA AB BA BA AA AB 6.78
Sample2 Control AB BB AB AA AA BA AB BA AB AB 7.96
Sample3 Control BA BB AA AA AB BB AB BA AA AB 7.77
Sample1998 Case BB AB BA AA BB BB BA BB BA AA 10
Sample1999 Case BB BB AB AB AA AA BB BA BA AB 8.99
Sample2000 Case BA BB AB BB AB AB AA AA AA AB 6.07

2.4. Ploting the normal curves

What we will be seeing around in papers and around the web is a plot of normal distributions comparing cases to controls. To create the plot we will need ggplot2.

library(ggplot2)
ggplot2
stats
graphics
grDevices
utils
datasets
methods
base

We shift the scores by subtracting the mean of the controls to make the controls center around 0.

We get the mean and standard deviations to plot the curves with the stat_function.

samples$status <- as.factor(with(samples, reorder(status, PRS)))
samples$PRS <- samples$PRS - mean(samples[which(samples$status=="Control"), "PRS"])
mean_control <- mean(samples[which(samples$status=="Control"), "PRS"])
sd_control <- sd(samples[which(samples$status=="Control"), "PRS"])

mean_case <- mean(samples[which(samples$status=="Case"), "PRS"])
sd_case <- sd(samples[which(samples$status=="Case"), "PRS"])

ggplot(samples, aes(x=PRS, color=status, fill=status)) +
│ geom_histogram(aes(y=..density..), position="dodge", alpha=.5, bins=30) +
│ stat_function(fun = dnorm,
│   │   │   │   color="darkred",
│   │   │   │   args = list(mean = mean_control, sd = sd_control)) +
│ stat_function(fun = dnorm,
│   │   │   │   color="darkblue",
│   │   │   │   args = list(mean = mean_case, sd = sd_case )) +
│ ggtitle(sprintf(
│   "PRS distributions Controls μ %.2f σ %.2f Cases μ %.2f σ %.2f",
│   mean_control, sd_control, mean_case, sd_case))

PRS.svg

When you visit www.pgscatalog.org or image search “polygenic risk score” you would be seeing the plot of these two peaks. Cases have more effect alleles with higher totals on average this is why their curve is slightly shifted to the right. The case curve also has a larger standard deviation because effect alleles have increasing effect sizes.

2.5. Logistic regression

Logistic regression is a model used in statistics that fit an S curve to data based on probabilities. Details of the subject are out of the scope of this post. I would suggest StatQuest’s logistic regression playlist to anyone interested.

Logistic regressions are used for prediction. However, one thing I would note in context to PRS calculations is that, unlike this common use, it’s not used for prediction in genetic studies. Because the scores and the risk calculated here are relative. The effect becomes tangible only in the tails of the curves but as the number of cases and controls drops the significance of the effect also drops.

  • Because most of the totals are the same they overlap and we plot only a few points as a result. position_jitter with w=0.5 spreads the points around a bit allowing us the plot all of the points.
ggplot(samples, aes(x=PRS, y=as.numeric(samples$status) -1, color=status)) +
│ geom_point(shape = "|", position = position_jitter(w = 0.5, h = 0)) +
│ geom_hline(yintercept = c(0,1), linetype = "dashed", color = "grey") +
│ scale_y_discrete(name ="Status", labels=c("Control","Case"), limits=c(0,1))

PRS2.svg

model <- glm(status~PRS, family="binomial", data=samples)
summary(model)

Call:
glm(formula = status ~ PRS, family = "binomial", data = samples)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.30654    0.07353  -4.169 3.06e-05 ***
PRS          0.34170    0.03149  10.852  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1386.3  on 999  degrees of freedom
Residual deviance: 1242.5  on 998  degrees of freedom
AIC: 1246.5

Number of Fisher Scoring iterations: 4
signif(coef(model)[1], 4)
-0.3065
signif(coef(model)[2], 4)
0.3417

In the model summary, we can find coefficients <<odds_control>> for the control group and <<odds_case>> for the case group. We can write our equation as $status =$ <<odds_control>> + <<odds_case>> $\times PRS$. Inserting the sample’s score into this equation would give us the log odds of having the phenotype. One unit increase in the $PRS$ would increase the log odds by <<odds_case>> units.

We can translate the log of odds to probabilities. Because it’s log we can turn them back to odds with exponential function. Odds are the event happening (let’s say $x$) ratio to the event not happening (let’s say $y$) which would be $\frac{x}{y}$. ${\beta_0+\beta_1X}$ gives us the log of the odds; $e^{\beta_0+\beta_1X}$ gives us the odds. We can write the odds as $\frac{\frac{x}{y}}{1 + \frac{x}{y}}$; this would be $\frac{x}{y} \times \frac{y}{x+y}$ which would simplify to $\frac{x}{x+y}$. It would give us the probability of the event happening over the total events.

\begin{equation*}
p(X)= \frac{e^{\beta_0+\beta_1X}}{1 + e^{\beta_0+\beta_1X}}
\end{equation*}
3

(“An Introduction to Statistical Learning” n.d.)

2.6. Plotting the regression curve

We can plot our model’s S-shaped curve that shows the probabilities.

ggplot(samples, aes(x=PRS, y=as.numeric(samples$status) -1)) +
│ geom_point(shape = "|", position = position_jitter(w = 0.5, h = 0), aes(color=status)) +
│ geom_smooth(method = "glm",
│   method.args = list(family = "binomial"),
│   se = FALSE) +
│ geom_hline(yintercept = c(0,1), linetype = "dashed", color = "grey") +
│ scale_y_discrete(name ="Status", labels=c("Control","Case"), limits=c(0,1))

PRS3.svg

2.7. Measuring goodness of fit

One of the measures of how good or model is $pseudo-R^2$. $R^2$ in linear regression is calculated using the sum of squares of residuals. This calculation does not apply to the logistic regression because we have a probability curve. There are several approaches to calculating $pseudo-R^2$ and the most used one I have seen is the Nagelkerke method (also the most optimistic). The closer it gets to 1 better the fit.

rcompanion has the $pseudo-R^2$ calculations modules when the Nagelkerke function is called other methods results are also returned.

library(rcompanion)

nagelkerke(model)
$Models

Model: "glm, status ~ PRS, binomial, samples"
Null:  "glm, status ~ 1, binomial, samples"

$Pseudo.R.squared.for.model.vs.null
                             Pseudo.R.squared
McFadden                             0.103711
Cox and Snell (ML)                   0.133917
Nagelkerke (Cragg and Uhler)         0.178556

$Likelihood.ratio.test
 Df.diff LogLik.diff  Chisq    p.value
      -1     -71.887 143.77 3.9798e-33

$Number.of.observations

Model: 1000
Null:  1000

$Messages
[1] "Note: For models fit with REML, these statistics are based on refitting with ML"

$Warnings
[1] "None"

2.8. Enrichment in the highest PRS burden percentiles

This is the table-2 from the paper (Leu et al. 2019). Here it looks at the most extreme tails of the curve to see if the effect is larger.

We get the samples bigger and smaller than our percent thresholds with the quantile function. We then pull the values used in the table from the summary function. Differently, we use the confusionMatrix from the caret library to calculate the sensitivity and specificity.

What we are looking at in this table is the increase in the odds in the more extreme percentages. One odd thing I notice is that in my simulations sensitivity goes up and specificity goes down as we get to the edge of the tails where the opposite is seen in the paper.

library(caret)

my_sum_rows = list()
percents = c(20, 10, 5, 1, 0.5)
for (percent in percents) {
│   samples$percentile <- ifelse(
│   │ samples$PRS >= quantile(samples$PRS,prob=1-percent/100),
│   │ "Above", "Below")
│   samples$percentile <- as.factor(with(samples, reorder(percentile, PRS)))
│
│   model <- glm(status~percentile, family="binomial", data=samples)
│
│   threshold <- 0.5
│   cm <- confusionMatrix(
│   │ as.factor(ifelse(predict(model,type="response")> threshold, 1, 0)),
│   │ as.factor(model$y)
│   )
│
│   above_control <- nrow(
│   │ samples[which(samples$status=="Control" & samples$percentile=="Above"),]
│   )
│   below_control <- nrow(
│   │ samples[which(samples$status=="Control" & samples$percentile=="Below"),]
│   )
│   above_case <- nrow(
│   │ samples[which(samples$status=="Case" & samples$percentile=="Above"),]
│   )
│   below_case <- nrow(
│   │ samples[which(samples$status=="Case" & samples$percentile=="Below"),]
│   )
│
│   my_sum_row <- data.frame(
│   "Reference Group" = sprintf("Remaining %s%% ", 100-percent),
│   "Cases/Controls Above PRS%" = sprintf("%s/%s", above_case, above_control),
│   "Cases/Controls Below PRS%" = sprintf("%s/%s", below_case, below_control),
│   "Odds Ratio" = round(exp(coef(model)[2]), 2),
│   "95% CI" = do.call(sprintf, c(
│   │   │   │   │   │   │   │   │ "%.2f - %.2f",
│   │   │   │   │   │   │   │   │ as.list(exp(confint(model, level=.90)[2,])))),
│   "P value" = sprintf("%.2e",
│   │   │   │   │   │   summary(model)$coefficients['percentileAbove', 'Pr(>|z|)']
│   │   │   │   │   │   ),
"Sensitivity / Specificity" = sprintf(
│   │   "%.3f / %.3f",
│   │   as.list(cm$byClass)$Sensitivity,
│   │   as.list(cm$byClass)$Specificity
│   │ )
│   )
│   my_sum_rows[[sprintf("Top %s%% of distribution", percent)]] <- my_sum_row
}
my_sum_table <- do.call(rbind, my_sum_rows)
colnames(my_sum_table) <- c(
│ "Reference group",
│ "Cases/Controls above PRS%",
│ "Cases/Controls below PRS%",
│ "Odds ratio",
│ "95% CI",
│ "P value",
│ "Sensitivity / Specificity"
)
my_sum_table
  Reference group Cases/Controls above PRS% Cases/Controls below PRS% Odds ratio 95% CI P value Sensitivity / Specificity
Top 20% of distribution Remaining 80% 290/110 710/890 3.3 2.71 - 4.05 2.1e-22 0.890 / 0.290
Top 10% of distribution Remaining 90% 149/52 851/948 3.19 2.43 - 4.23 4.68e-12 0.948 / 0.149
Top 5% of distribution Remaining 95% 78/23 922/977 3.59 2.44 - 5.43 1.2e-07 0.977 / 0.078
Top 1% of distribution Remaining 99% 15/6 985/994 2.52 1.18 - 5.93 0.0565 0.994 / 0.015
Top 0.5% of distribution Remaining 99.5% 9/1 991/999 9.07 2.14 - 90.13 0.0366 0.999 / 0.009

2.9. PRS deciles

This graph is from the paper (Kloeve-Mogensen et al. 2021). Here they compare scores from each decile. The fifth decile is used as the reference that’s why we are seeing a stair pattern where the decile before the fifth gets closer and the deciles after go up incrementally. I have found this answer which sets the fifth decile as a reference differently. (socialscientist 2022) I just set the levels so the fifth is the first one which the generalized linear models (glm) uses as the reference. Resulting coefficients are considered as odds ratio.(Noah 2018)

samples$deciles <- factor(.bincode(samples$PRS,
│   │   breaks = quantile(samples$PRS, seq(0, 1, 0.1)),
│   │   include.lowest = TRUE
│   │ ), levels = c(5, 1, 2, 3, 4, 6, 7, 8, 9, 10))
│   │
model <- glm(status~deciles, family="binomial", data=samples)
summary(model)

Call:
glm(formula = status ~ deciles, family = "binomial", data = samples)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.19867    0.14141  -1.405  0.16006
deciles1    -1.32376    0.23203  -5.705 1.16e-08 ***
deciles2    -0.88014    0.21511  -4.092 4.28e-05 ***
deciles3    -0.36885    0.20435  -1.805  0.07108 .
deciles4    -0.03352    0.20092  -0.167  0.86749
deciles6     0.50095    0.20114   2.491  0.01276 *
deciles7     0.58733    0.20242   2.902  0.00371 **
deciles8     1.02227    0.20877   4.897 9.75e-07 ***
deciles9     1.07696    0.20974   5.135 2.82e-07 ***
deciles10    1.26406    0.21532   5.871 4.34e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2772.6  on 1999  degrees of freedom
Residual deviance: 2484.1  on 1990  degrees of freedom
AIC: 2504.1

Number of Fisher Scoring iterations: 4
my_sum_data <- as.data.frame(cbind(
│ exp(confint(model, level=.90)[2:10,]),
│ "Odds ratio"=round(exp(coef(model)[2:10]), 2),
│ "P value"=summary(model)$coefficients[2:10, 'Pr(>|z|)']
))
colnames(my_sum_data) <- c(
│ "lower",
│ "upper",
│ "odds_ratio",
│ "p_val"
)
my_sum_data <- rbind(my_sum_data, c(1,1,1,1))
my_sum_data$decile <- factor(c(1,2,3,4,6,7,8,9,10,5))

ggplot(my_sum_data, aes(decile, odds_ratio)) +
│ geom_errorbar(aes(ymin = lower, ymax = upper), width=.2) +
│ geom_hline(yintercept = 1, linetype = "dashed", color = "grey") +
│ geom_point(shape=22, size=6, color="black", fill="orange", stroke=.8)

PRS-decile.svg

3. Conclusion

PRS is a method for estimating the underlying genetic risk that common variation create for complex phenotypes. As of pgscatalog.org hosts 3,688 calculations for 619 traits from 480 publications. Their interpretation in the diseases risk and their role in understanding disease etiology still needs further work but there are also some proposals for other uses. Lu et al. concludes that it can be useful for prioritizing patients with low scores for monogenic testing. (Lu et al. 2022) Campbell et al. suggests polygenic burden plays role in the severity and penetrance of the developmental epileptic encephalopathies. (Campbell et al. 2022) Another approach taken by Darst et al. and Hassanin et. al is to calculate cancer risk with PRS combined with rare variants. (Hassanin et al. 2023; Darst et al. 2021) Considering papers above, I think investigating PRS with variant of unknown significance with low allele frequency might give an insight to some of the complex diseases.

This has been my introduction to polygenic risk scores and genetic burden. This is not a tutorial on how to perform the analyses but just to understand basic underlying concept. This post also never goes into more finickier subjects like population stratification, better effect size estimation, controlling linkage disequilibrium.

4. References

Andrade, Chittaranjan. 2015. “Understanding Relative Risk, Odds Ratio, and Related Terms: As Simple as It Can Get.” The Journal of Clinical Psychiatry 76 (7): 21865. https://doi.org/10.4088/JCP.15f10150.
“An Introduction to Statistical Learning.” n.d. An Introduction to Statistical Learning. https://www.statlearning.com. Accessed August 7, 2023.
Campbell, Ciarán, Costin Leu, Yen-Chen Anne Feng, Stefan Wolking, Claudia Moreau, Colin Ellis, Shiva Ganesan, et al. 2022. “The Role of Common Genetic Variation in Presumed Monogenic Epilepsies.” Ebiomedicine 81 (July): 104098. https://doi.org/10.1016/j.ebiom.2022.104098.
Choi, Shing Wan, Timothy Shin-Heng Mak, and Paul F. O’Reilly. 2020. “Tutorial: A Guide to Performing Polygenic Risk Score Analyses.” Nature Protocols 15 (9): 2759–72. https://doi.org/10.1038/s41596-020-0353-1.
Cipriano, Andrea, and Monica Ballarino. 2018. “The Ever-Evolving Concept of the Gene: The Use of RNA/Protein Experimental Techniques to Understand Genome Functions.” Frontiers in Molecular Biosciences 5.
Claussnitzer, Melina, Simon N. Dankel, Kyoung-Han Kim, Gerald Quon, Wouter Meuleman, Christine Haugen, Viktoria Glunk, et al. 2015. “FTO Obesity Variant Circuitry and Adipocyte Browning in Humans.” New England Journal of Medicine 373 (10): 895–907. https://doi.org/10.1056/NEJMoa1502214.
Collister, Jennifer A., Xiaonan Liu, and Lei Clifton. 2022. “Calculating Polygenic Risk Scores (PRS) in UK Biobank: A Practical Guide for Epidemiologists.” Frontiers in Genetics 13.
Darst, Burcu F., Xin Sheng, Rosalind A. Eeles, Zsofia Kote-Jarai, David V. Conti, and Christopher A. Haiman. 2021. “Combined Effect of a Polygenic Risk Score and Rare Genetic Variants on Prostate Cancer Risk.” European Urology 80 (2): 134–38. https://doi.org/10.1016/j.eururo.2021.04.013.
Hassanin, Emadeldin, Isabel Spier, Dheeraj R. Bobbili, Rana Aldisi, Hannah Klinkhammer, Friederike David, Nuria Dueñas, et al. 2023. “Clinically Relevant Combined Effect of Polygenic Background, Rare Pathogenic Germline Variants, and Family History on Colorectal Cancer Incidence.” Bmc Medical Genomics 16 (1): 42. https://doi.org/10.1186/s12920-023-01469-z.
Kloeve-Mogensen, Kirstine, Palle Duun Rohde, Simone Twisttmann, Marianne Nygaard, Kristina Magaard Koldby, Rudi Steffensen, Christian Møller Dahl, et al. 2021. “Polygenic Risk Score Prediction for Endometriosis.” Frontiers in Reproductive Health 3: 793226. https://doi.org/10.3389/frph.2021.793226.
Leu, Costin, Remi Stevelink, Alexander W Smith, Slavina B Goleva, Masahiro Kanai, Lisa Ferguson, Ciaran Campbell, et al. 2019. “Polygenic Burden in Focal and Generalized Epilepsies.” Brain 142 (11): 3473–81. https://doi.org/10.1093/brain/awz292.
Lu, Tianyuan, Vincenzo Forgetta, John Brent Richards, and Celia M. T. Greenwood. 2022. “Polygenic Risk Score as a Possible Tool for Identifying Familial Monogenic Causes of Complex Diseases.” Genetics in Medicine: Official Journal of the American College of Medical Genetics 24 (7): 1545–55. https://doi.org/10.1016/j.gim.2022.03.022.
Noah. 2018. “Answer to ‘Why Are Exponentiated Logistic Regression Coefficients Considered ‘Odds Ratios’?’.” Cross Validated.
“Polygenic Risk Scores.” n.d. Genome.Gov. https://www.genome.gov/Health/Genomics-and-Medicine/Polygenic-risk-scores. Accessed July 29, 2023.
socialscientist. 2022. “Answer to ‘How to Calculate and Plot Odds-Ratios and Their Standard Errors from a Logistic Regression in R?’.” Stack Overflow.
Uffelmann, Emil, Qin Qin Huang, Nchangwi Syntia Munung, Jantina de Vries, Yukinori Okada, Alicia R. Martin, Hilary C. Martin, Tuuli Lappalainen, and Danielle Posthuma. 2021. “Genome-Wide Association Studies.” Nature Reviews Methods Primers 1 (1): 1–21. https://doi.org/10.1038/s43586-021-00056-9.
“Understanding Statistics: Risk BMJ Best Practice.” n.d. Accessed July 29, 2023.

Comments

5. Acronyms

AR Absolute Risk 1

GWAS Genome-Wide Association 1, 2, 3, 4

Glm generalized linear models 1

PRS Polygenic Risk Score 1, 2, 3, 4, 5, 6, 7, 8

RR Relative Risk 1, 2

SNP Single Nucleotide Polymorphism 1, 2, 3, 4, 5

6. Glossary

Allele frequency The ratio of given allele to total number of alleles in a given population 1

Allele One of the alternatives of a genomic locus. Human autosomal chromosomes have two alleles for every locus each inherited from a parent 1, 2, 3

Dosage Number of effect allele. It would be 0 if both alleles are non-effect; 1 if one allele is an effect allele; and 2 if both alleles are the effect allele 1, 2

Effect size Weight of the effect allele. It can be odds ratio, hazard ratio, allele frequency 1, 2, 3, 4

Effect allele The alternative allele associated with the phenotype 1, 2, 3

Odds ratio ratio of the odds of the two events 1

Odds ratio of number of times event happening to event not happening 1, 2, 3, 4, 5, 6, 7, 8, 9

Penetrance Probability of an variants effect is seen in the phenotype 1

Variant of unknown significance The variants we can’t decide whether its pathogenic or benign 1