Tuesday, December 9, 2014

Big comparison of ancient samples

F3-statistics measures common genetic drift between populations.   I have done a simple run using ancient genomes and Eigensoft’s qp3Pop program, comparing all those ancient genomes to present-day populations and moved results to Excel graphics.  The bigger the result is the more the present population shares drift with ancient samples.

Ancient genomes:

Loschbour M Luxembourg_Mesolithic
LBK F GermanStuttgart_LBK
Otzi M Tyrolean_Iceman Copper Age
MA1 M Siberian_Upper_Paleolithic
AG2 M Siberian_Ice_Age
Scandinavian_HG M Swedish_HunterGatherer Neolithic
Scandinavian_farmer F Swedish_Farmer Neolithic
Motala12 M Swedish_Motala 7000 years old
LaBrana M    LaBrana Mesolithic
AngloSaxon several samples UK Hinxton Iron Age
Briton several samples UK Hinxton Iron Age
NE1…NE7  Hungarian Neolithic
KO1 Hungarian Neolithic
CO1 Hungarian Copper Age
IR1 Hungarian Iron Age
BR2 Hungarian Bronze age

The formula is f3(Mbuti;test1,test2)

All results can be downloaded and unzipped here.

I couldn’t resist to look why I have at 23andme many “cousins” in Southern Russia and Ukraine, but no one in Nothern Russia.  This looked weird because I have heard always that the Finns are from the north.  Indeed, Western Finns, like me,  are closer Ukrainians in terms of the common genetic drift than Northern Russians (Mordva and Russian people closer White Sea).  Surprising. 

Sunday, December 7, 2014

Starting with new data

I am now starting with a new data set and hopefully I can give you more reliable results.  My new set is based on the data publicly available from Lazaridis et al. 2014.   Some additions were made:

 Finnish, British and CEU samples from the 1000-genomes project
 5 ancient British samples from Hinxton
 8 ancient Hungarian samples (from Y-Str server)

Going to Lazaridis’ data meant also changing to Affymetrix coordinates.   Affymetrix doesn’t fit well with most commercial SNP sets, but seems to give better coverage for many ancient samples.  Switching to Affymetrix  was not a problem to me because my private sample collection is not big and I can give it away and take the advantage of better data.   Now I have

2244 samples
555268 SNP’s

The individual sample data mostly covers the whole 550kSNP, also many ancient samples reach over 500kSNP. 

Present ancient genomes

I am not yet familiar with those ancient samples and it takes a time to find out their secrets.

While I have a new data I am also starting with a new software toolkit.   I found the Eigenstrat format handy, because it is also easy to handle with SQL-tools.  This decision led to another one, to use Eigensoft’s software.  Luckily they just a few weeks ago released a new version.  

It is the time to uncork the new data, starting with Eigensoft’s excellent PCA tool.  It is capable to do many thing, like LD-pruning, random sample selection, sample projection etc.   Now I use only the random sample selection, which ensures that none of populations are oversampled.   I use the whole 550k data and  I do not use LD-pruning for two reasons.  At first it is not necessary to avoid excessive clustering.  Working with the LD-pruned data I did not notice any improvement in clustering.   Secondly, LD-pruning can be disadvantageous, because not all populations have the same genetic drift to remove.  Generally the LD-pruning must be used carefully and it is not for dummies like me, so I use it seldom and trust to the original data.

PCA including Europeans

Fst-distance table.  Average standard error is 0,000762745, meaning that there can be an error of one thousandth of unity.