What is GWAS & how does it work?

October 5, 2022 BioCertica Content Team

Written by: Nermin Đuzić, M.Sc. in Genetics, Content Specialist

Peer-reviewed by: Edin Hamzić, Ph.D. in Genetics, Chief Science Officer

In our previous article, we provided a short general introduction to the PRS and how it works. To calculate a PR score for a particular condition or disease, we must sum up individual risk estimates for all SNPs being associated with a given trait across your genome. But wait, how do we know what variants we should consider for any condition?

The answer lies in Genome-Wide Association Studies (GWAS) that serve as a reference. GWAS studies include millions of people from all over the world belonging to a specific ethnicity or population whose variants are tested for association with a given trait [1].

What is GWAS and why is it important for PRS?

In simple terms, GWAS studies use samples from many different people to identify and report genetic variants that are associated with the onset of a certain condition (also known as cases) compared to those individuals who lack them (also called controls). In other words, these studies identify and report on how alleles for genetic variants that are associated with a given disorder discriminate between affected and not affected individuals.

In this way, GWAS data provides information on which genetic variants (SNPs) we should pay attention to assess genetic risk for a condition of interest. By identifying these SNPs we can determine the underlying mechanisms that cause the trait and may even help us predict how much of the trait is caused by genetics and how much is caused by environmental factors. This clarifies how these alleles differ and are more common in people with particular traits and disorders. In that way, we know which variants we should look for in DNA samples submitted by our users.

What does everything have to do with PRS? In the context of the infinitesimal model (or polygenic model), we look for genetic variants, in this case SNPs, that have been found to explain the genetic variation of a given trait. This is generally done by GWAS. PRS model analyzes the effect size of each allele as obtained from GWAS and used to evaluate the SNP risk scores which are later summed for the final risk score value. It’s important to note that the greater the effect size, the greater the weight given to the variant. Seems like too much information? Let’s explain the procedure in more detail below!

How does it work?

To provide a basic and stepwise explanation of how GWAS works, we should start with explaining basics of genetics and single nucleotide polymorphisms (SNPs). The person’s genetic makeup is called genome and in this case we look at the genetic material located in the nucleus of the cell, where it is split into 23 pairs of chromosomes. Our DNA is composed of extremely long chains of connected subunits called nucleotides, that come in either of four forms: adenine (A), guanine (G), cytosine (C), and thymine (T).

As far as the sequence of nucleotides goes, humans share 99.5% of their genetic information [2]. It means that, for example, at a particular place majority of people have a nucleotide A, while remaining have nucleotide T at the same spot. These forms are called variants. Since this location in human DNA can have multiple forms, it is called single nucleotide polymorphism (SNP). Therefore, the similarity of 99.5 % of 0.5% difference between two individuals does not refer to genes, but base pairs.

SNPs are the key point in understanding the genetic causes of human traits and conditions. Certain traits like aptitude for music or languages are environmental, while traits like eye-color are extremely heritable. Where SNPs help us is to determine and understand to what extent certain traits are due to genetics, or what biological mechanisms may be affecting this trait. To do this, we have to carry out association analysis.

Let’s suppose we want to find genetic association with body mass index (BMI), or to find which SNPs (genetic variants) contribute to a person's BMI, such as genes that may increase or decrease metabolic activity of our body.

First, we need a large sample size of volunteers, preferably thousands of people, those with the same ethnicities are used to minimize the confounding effect of other factors on genetic variation. Other genetic models such as principal component analysis (PCA) can be used account for differences in ethnicities and population structure. Next step is to have each participant genotyped, which means to have their nucleotides recorded at many known SNP locations. This results in obtaining information on millions of SNPs for each participant.

Next step is to record BMI for all participants. Once we have recorded genotype and phenotype (trait) data for a large number of people, we can proceed with computing the association between these two. This is done by means of a genome-wide association analysis program and a commonly used software called PLINK. This software makes it possible to perform quality control filters on genetic dataset, removing all individuals or SNPs that may not fulfill QC criteria

Afterwards, we perform a regression analysis for every SNP in the dataset with each individual being a data point. Let’s say that we want to perform a regression analysis for SNP ID #1, which has alleles A and T (Figure 1). Each individual in the dataset has a number of T alleles for that SNP plotted against the trait of interest, in our case BMI. Person’s DNA contains a copy from a father and a copy from a mother, meaning that person’s combination for the SNP may be either AA, AT, or TT, which can be coded as 0, 1, and 2 respectively. Once each individual is plotted on the graph, the program tries to draw a line that estimates the relationship between the number of alleles and phenotype.

Simply put, if there is no association between SNP and BMI, regression analysis will simply plot a horizontal line. However, if there is an association between the two, the line will have a slope. Effectiveness of the regression line in predicting the data points determines the p-value. A p-value is a statistical term, measuring a likelihood that the association between two variables was due to random chance, given that there is no association between the SNP and BMI. It means that the more data points clustered together around a sloped regression line, the less likely the association between variables is due to random chance, producing a small p-value. For each SNP we record the p-value and the slope of the regression line, which is also known as the effect size.

Figure 1: Association between SNP and BMI

This regression analysis is repeated for every single SNP in the dataset. If we work with millions of SNPs, it would take hours and days for the computer to execute results. However, programs like PLINK provide efficient processing such as multi-threading to finish the analysis faster.

Moreover, these programs also allow us to include covariates, which are other factors that may affect the phenotype or trait of interest. For example, BMI may be hugely affected by the amount of exercise done weekly by a person, and having this information available for people in a dataset may significantly influence the slope of the regression line compared to considering genotype alone. Additionally, PCA is also used here to account for population substructure.

Finally, after calculating the p-values for all SNPs available, we can say if there is an association between SNP and trait, or if the association is significant. It has again to work with p-values, as the only values below 0.05 are considered to be statistically significant, which means that real association is present and not due to random chance.

However, when working with millions of SNPs, there is a possibility of producing thousands of false positives. This can be fixed by doing a Bonferroni correction which transforms the threshold required for achieving significance by taking the typical threshold of 0.05 and dividing it by the number of SNPs in the analysis. Quantitative genetics has adopted the value of 5E-8 as the default threshold for significance. The goal of applying Bonferroni correction is to reduce the number of false positives.

For visualization of results, a Manhattan plot is produced, where each SNP with its corresponding chromosome position is plotted on x-axis versus corresponding negative log of p-value on y-axis. Dots above the line indicating threshold represent SNPs that are significantly associated with the trait. Next, these SNPs are analyzed using SNP databases that indicate what genes are present for those regions. Identifying these genetic regions may be helpful for understanding biological mechanisms that may be useful in prevention and treatment of certain genetic diseases.

Figure 2: An example of a Manhattan plot

Key benefits and limitations of GWAS

There is no doubt that GWAS have revolutionized the approach and understanding of genetics behind the complex disease in recent decades, finding and reporting many significant associations between gene variants and complex diseases and traits. Some benefits that GWAS brought over the last several years are successful endeavors in the field of understanding complex diseases and genes that increase susceptibility to them, discovering new biological mechanisms underlying complex diseases, and translating all them into clinical care.

However, despite all of them, GWAS also have been subjected to many controversies and limiting factors. Let’s try to summarize some key benefits and limitations of GWAS.

Benefits:

GWAS have been very successful in identifying novel variant-trait associations. Thousands of GWAS have been published so far, with thousands of SNPs and associations reported including vast number of diseases and traits including major depressive disorder, anorexia nervosa, cancers and their subtypes, type II diabetes, coronary heart disease, schizophrenia, inflammatory bowel disease, insomnia, BMI etc.
GWAS can lead to the discovery of novel biological mechanisms. For example, the role of autophagy in Crohn's disease was not known until SNPs associated with this disease were not discovered.
GWAS findings have multiple benefits in clinical settings, where they help translate biological insights into medical advancements. GWAS may help for disease classification and subtyping. Genetic variants identified by GWAS can be used to identify individuals at high risk for developing a condition, which may provide clues in directing right prevention, treatment or diagnosis.
GWAS can provide insight into ethnic variation of complex traits., since some risk SNPs and their locations on chromosomes show considerable ethnic differences in frequency and effect size.
GWAS can be used to identify novel monogenic and oligogenic disease genes.
Beyond gene identification, GWAS data may be used also for reconstruction of population history, ancestry and population substructure determination, fine-scale estimation of location of birth, estimation of SNPs heritability for complex traits, estimation of genetic correlations between traits, polygenic risk scores, forensic analyses etc.
GWAS data generation, management, and analysis are straightforward as explained above.
GWAS and their findings are easily available today and facilitate novel discoveries.

Limitations:

There are concerns that most associations found in GWAS do not reflect functional variants, but variants in linkage disequilibrium with potential functional variants.
SNPs used in GWAS account for a rather small fraction mof heritability of complex traits. Heritability refers to a proportion of genetic variation due to genetic factors solely.
Certain associations concluded in GWAS may rather be spurious, not pointing out causal variants and genes.
Since most GWAS studies have been conducted on the European population, polygenic risk scores are mainly available for people of European ancestry. This affects the accuracy of PRS methodology if applied to other ethnic groups of non-European descent since there could be differences in their genetic variants and other confounding effects. However, the good news is that there are already large-scale GWAS studies that cover individuals from different ancestries and new ones are published daily.

These limitations have led to skepticism and hesitancy among non-geneticists and especially stakeholders and national funding agencies to fund new GWAS. However, there is considerable benefit from GWAS, as their associations have led to insights into the architecture of disease susceptibility, and to advances in clinical care and personalized medicine.

Overview of GWAS

Figure 3: Overview of GWAS

References

Tam, V., Patel, N., Turcotte, M., Bossé, Y., Paré, G., & Meyre, D. (2019). Benefits and limitations of genome-wide association studies. Nature Reviews Genetics, 20(8), 467-484.
Levy, S., Sutton, G., Ng, P. C., Feuk, L., Halpern, A. L., Walenz, B. P., ... & Venter, J. C. (2007). The diploid genome sequence of an individual human. PLoS biology, 5(10), e254.