How are ancestry results generated?
Written by: Nermin Đuzić, M.Sc. in Genetics, Content Specialist
In previous articles, we explained the whole procedure of obtaining genotyping results. Now we will talk about how we may use and interpret these results. Do you remember when we spoke about genetic tests and their importance? We mentioned that genetic analysis could be used for estimating ethnicity composition as well.
One of our products at BioCertica is a DNA ancestry report that allows you to unlock your identity and take a look into history through your genes. This test analyzes your DNA and gives an insight into the populations and cultures your ancestors belonged to.
Before we dig deeper into the explanation of genetic ancestry results, let’s first explain what genetic ancestry analysis is, step by step.
What is genetic ancestry analysis?
Genetic ancestry analysis is a complex process that combines many disciplines such as genetics, statistics, and probability and is based on the latest research in genomics. In this process, we use your DNA information and estimate your deep ancestry going far back in history, generally hundreds of years [1].
However, it is essential to underline that this deep ancestry analysis is not related to a more recent ancestry which refers to finding your recent unknown relatives. The deep origin is estimated by comparing your DNA information with the reference panel datasets of individuals with known ancestry.
DNA information we use for ancestry analysis is genotyping data obtained from your saliva sample. On the other hand, reference data panels are DNA information compiled of many (hundreds) individuals from different populations with known ancestry.
Genetic ancestry vs. genetic genealogy
Although these two terms are closely related, we should not use them interchangeably, and it is essential to differentiate between them. Genetic genealogy uses DNA information to identify your recent relatives a couple of generations back in contrast to the genetic ancestry analysis, which estimates your ancestry that goes far back in history, several hundred or even thousands of years back.
In other words, genetic genealogy uses the same DNA information as ancestry analysis. Still, it estimates the amount of identity between you and other individuals to which you want to be compared. This genetic information is generally combined with historical and family records to determine recent relationships between you and other studied individuals.
What are the key inputs necessary for the estimation of genetic ancestry?
To obtain genetic ancestry estimates for any individual, we need the following:
- Your DNA genotyping information
- A reference population dataset composed of populations with known ancestry
We use the sophisticated and verified methodology to compare your DNA information against our reference population dataset to obtain ancestry estimates in the form of percentages shown in your results section (Figure 1).
What is the reference panel dataset?
Recently you had the opportunity to read the series of articles about genotyping and learn more about DNA information we obtain from your swab samples in the form of SNPs. Besides this, the second equally important element necessary for getting your genetic ancestry estimates is the reference panel dataset.
The reference panel dataset is compiled from DNA information obtained from hundreds of individuals from different populations with known ancestry. It serves as the reference point for the estimation of ancestry for all our users. In straightforward terms, we take your DNA information and compare it against the reference panel dataset to obtain ancestry estimates.
Our reference panel dataset is composed of 21 ancestry groups which are either composed of one population or represent a mix of multiple genetically very similar populations. The complete list of ancestry groups in our report is presented below (Table 1).
We aim to release more groups in the upcoming period, make our analysis more comprehensive, and increase the resolution of our estimates. It is crucial because your ancestry results depend on the detailed and extensive reference panel dataset and the amount of DNA information obtained.
Table 1: List of ancestry groups in Biocertica Ancestry Report
Ancestry Group |
Continent |
Populations |
Africa: Mande Origin |
Africa |
Gambian Mandinka, Mandenka people |
Northern Africa |
Africa |
Bedouin and Mozabite |
West Africa |
Africa |
Esan people, Yoruba people, Esan people, Bantu Kenya, Luhya people |
Native Northern American |
Americas |
Pima people |
Indigenous People of Brazil: Karitiana |
Americas |
Karitiana people |
Indigenous People of Brazil: Surui |
Americas |
Surui people |
Central and West Asia |
Asia |
Hazara, Mansi, Uygur, Kyrgyz |
Han Chinese |
Asia |
Han Chinese |
Dai People |
Asia |
Dai Chinese |
South Asia: Bengali |
Asia |
South Asia: Bengali |
South Asia: Indian Origin |
Asia |
Tamil, Telugu, Gujarati |
South Asian: Dardic Origin |
Asia |
Kalash people |
Middle East |
Asia |
Druze people |
Japanese |
Asia |
Japanese |
North Asian |
Asia |
Even and Yakut people |
Caucasus |
Europe |
Abkhasian, Adygei, Georgian, North Ossetian |
Western European |
Europe |
Utah Residents with Western European ancestry, French, British and Orcadian |
Finish Origin |
Europe |
Finish |
Sardinian Origin |
Europe |
Sardinians |
Oceania: Bougainville |
Oceania |
Bougainville people |
Papuan Origin |
Oceania |
Papuan Sepik, Papuan Highlands |
You may ask now how we came up with this DNA information from individuals with known ancestry to form our reference panel dataset? Well, the above-listed 21 groups were obtained from the publicly available datasets with known ancestry:
How do we estimate ancestry?
Ancestry estimation is a time-consuming and iterative process consisting of several steps. We have to start with the process of the development of the reference panel dataset. The method includes the following steps:
- Collecting publicly available DNA information from individuals from many populations worldwide with known ancestry.
- Combining all these individual data into a single dataset and performing the quality control analysis includes removing any related individuals and duplicated individuals and their SNPs from the dataset. This is important as related and duplicated individuals in the reference dataset can impact the ancestry estimates.
- The next step is cleaning the reference panel dataset. In this step, the reference panel dataset is divided into six subsets which are continent related:
- Africa
- South and West Asia
- North and East Asia
- Americas
- Europe and
- Oceania.
Each subset reference panel dataset is analyzed using Principal Component Analysis (PCA), a statistical method we use to cluster distinct populations in our reference panel dataset. Afterward, we create genetic ancestry groups for which we provide estimates. Once we clean the reference panel dataset, we can proceed with the process of ancestry estimation, which is done using specific algorithms and approaches.
What do these ancestry results mean to you?
You may ask yourself what it means if you have 75% of Western or Central European origin or why your confirmed East Asian origin from your distant relatives doesn’t appear in our results? And finally, how reliable are these estimates?
First of all, you should know that these results estimate how your DNA is associated with a specific population of a particular continent, region, or country using a predefined reference panel dataset.
The essence of the analysis is that we take your DNA data and compare it to 21 groups that we created in our current reference population dataset and estimate the composition of DNA segments that most closely match one of these groups. Finally, all the segments compared across your DNA are summarized together to create overall ancestry estimates for the corresponding ancestry groups.
Currently, we have 21 groups, and your ancestry results are estimated based on those 21 populations meaning you will have ancestry estimates that will tell you the genetic similarity of your DNA to some of those 21 groups.
Therefore, your results indicate you are 75% Western European, 20% Finnish origin, and 5% Northern African. It means that your DNA has a 75% genetic similarity to our Western European group, 20% genetically similar to the Finish origin group, and 5% genetically similar to the Northern African group compared with our current reference panels.
However, as we mentioned elsewhere, your results might change as we increase our reference panel (add new populations) and the number of genetic variants used for estimation. So, suppose your ancestry comes from some of the populations not covered by our reference panel. In that case, you will obtain the closest results to the most similar population in our reference panel.
Also, it is essential to underline that ancestry estimates with values below 5% are in the standard deviation range and may not reflect genetic similarity to the corresponding reference population. This is especially true if you have ancestry estimates below 1%. Therefore low-value ancestry estimates (<5%) should not be interpreted literally. Similar to mentioned above, these deviations will be reduced as we increase the size and diversity of our reference panel dataset.
If you are interested in knowing more about yourself and your ancestry, you can unlock your identity today by ordering our Ancestry DNA kit.
References
- Shriver, M. D., & Kittles, R. A. (2004). Genetic ancestry and the search for personalized genetic histories. Nature Reviews Genetics, 5(8), 611-618.
- Siva, N. (2008). 1000 Genomes project. Nature Biotechnology, 26(3), 256-257.
- Cavalli-Sforza, L. L. (2005). The human genome diversity project: past, present, and future. Nature Reviews Genetics, 6(4), 333-340.
- Mallick, S., Li, H., Lipson, M., Mathieson, I., Gymrek, M., Racimo, F., ... & Reich, D. (2016). The Simons genome diversity project: 300 genomes from 142 diverse populations. Nature, 538(7624), 201-206.