lauantai 26. syyskuuta 2015

Dstat reveals genetic distances

I have seen numerous PCA and ADMIXTURE analyses which try to demonstrate who are  full-blooded Europeans, as well as many analyses proving real or false migration inside/outside Europe.   This is sometimes misleading and hides actual European ancestry because admixtures revealed by selective tests can be very small and detectable only by detaching it from main history events in Europe.   My aim is now to find out large scale similarities inside Europe. This can be done by using dstat-analyses which compares whole genomes without dropping meaningful genetic proportions.  I do now tests by searching differences between suggested non-European and actual European ancestries.   

I suggest following non-European populations
  •  Nganassans representing pure Siberians, found in North Siberia and Northeast Europe

  •   Mongolians representing medieval Mongolian invasion to Europe

  •  Bedouins representing present-day Middle-Easterners, ruling out Early European Farmers

Doing any comparison needs a baseline, suggested least admixed Europeans.    Brits live in an island isolated from the mainland Europe.  People in Kent are thought to have their origin in Iron Age and medieval continental West Europe.  My previous analyses also prove that they have very little newer non-European admixtures, less than French and Germans. 

I use original Haak et al 2015 Lazaridis et al. 2014 data with additional British Kent and Finnish samples downloaded from the 1000genome project.  Each sample consists of 555268 SNPs. West Finns are filtered in three steps using PCA from 1000genome data:  1) removing 20 westernmost samples to get rid of possible Finland-Swedes, 2) splitting the rest 80 into eastern and western groups and finally 3) picking randomly 13 western samples.  Kents are randomly sampled as well.

 The data is available if someone wants to repeat my tests, or make own tests. Please contact me in that case. 

The first task to do is to verify the data.  For this purpose I ran three PCA-plots:

Before testing admixtures it is a good idea to see wide genome distances between British Kents and other Europeans.   I do it using two outgroups, the first one being extreme (Chimp), setting another one (Ju-hoan-North) to the base line.  

Admixture Dstat analyses follow the formula: 

dstat(Kent,non-European population:Outgroup,European population).  

If the result is negative the European population is closer to Kent than it is to the non-European population on axis, the bigger the negative value is the closer it is to the Kent compared to non-European population.  Be aware of the fact that this test doesn’t figure how much the population under test has non-European admixture in question, but the full genome genetic distance between populations, which mainly depends on the common history between population pairs.  If tested European population is “multimixtured” then the result could be surprising for a reader who has seen only analyses figuring minor admixtures.   In other words, your genetic profile can be A1+b or A2+c, where b and c are minor admixtures. You can’t figure the overlapping  between A1 and A2 without knowing both minor admixtures if you try to do it using PCA or ADMIXTURE, but you can use dstat to determine genomewide similarity.

3 kommenttia:

  1. D stats are quite tricky to estimate proportions from;

    One thing with them is that, taking a Dstat like D(Kent,Mongolian;Ju_Hoan_North,European) then a population could tend towards 0 because it has a balanced effect size from ancestry from both Kent and Mongolian, as you would expect from a population that has a roughly (probably not exactly) equal mix of Kent and Mongolian ancestry *or* it could tend to 0 because it is equally unrelated to either (as a population like Yoruba would be, as an extreme example, having descent from many people who were totally neutrally to the founding of Eurasia).

    You could sort of visually control for this by using your innovative Wide genome similarity measure as the X and then stats like D(Kent,Mongolian;Ju_Hoan_North,European) as the Y. That would visually distinguish between populations like Cypriots who tend more towards a 0 on D(Kent,Mongolian;Ju_Hoan_North,European), but who don't share much drift with Kent overall, vs populations who do share more drift with Kent and yet still tend more to 0 on D(Kent,Mongolian;Ju_Hoan_North,European) than others.

    1. Of course, you are right, this method doesn't show admixture amounts, only a balance between baselines (f.ex. Mongolian and Kents). Every method has weaknesses. As far as I know we have today only one freely available software trying to give absolute results: qpAdm. If you have used it you know how painful it is in practice, just because it try to test real admixture amounts.

      Not PCA, not ADMIXTURE gives realiable results as to admixture amounts proportioned to ALL tested populations and individuals. However qpDstat is a wonderful tool because it simply makes wide genome pairwise comparisons between pooled populations, unlike PCA and ADMIXTURE. PCA and ADMIXTURE analyses all populations simultaneously, but those tools have no idea about directions of gene flows, so the result can be wrong. We have one tool giving gene flow directions, Chromopainter, but it is YOU who make decicions, the software itself can't do it. But what is important in using qpDstat is that you don't place two "unknowns" on the same side of colon. If you put two "unkowns", like an ancient sample set and modern populations you are testing on the same side you can't expect reliable results.

      In other words we have many softwares, but none of them are infallible. So we making analyses have some kind of responsibility to not misinform readers.

  2. One point more. When the formula dstat(Kent,non-European population:Outgroup,European population) doesn't show admixture amounts, only a relative place between baselines(because we have two variables and two constants, one outgroup), another formula, dstat(Ju_hoan-North,X:Chimp,Kent) shows exact genetic distance between Kents and other Europeans from baselines, because it uses one variable and three constants, of which two are outgroups i.e. distant ancestral observation points.