Tuesday, December 9, 2014

Big comparison of ancient samples

F3-statistics measures common genetic drift between populations.   I have done a simple run using ancient genomes and Eigensoft’s qp3Pop program, comparing all those ancient genomes to present-day populations and moved results to Excel graphics.  The bigger the result is the more the present population shares drift with ancient samples.

Ancient genomes:

Loschbour M Luxembourg_Mesolithic
LBK F GermanStuttgart_LBK
Otzi M Tyrolean_Iceman Copper Age
MA1 M Siberian_Upper_Paleolithic
AG2 M Siberian_Ice_Age
Scandinavian_HG M Swedish_HunterGatherer Neolithic
Scandinavian_farmer F Swedish_Farmer Neolithic
Motala12 M Swedish_Motala 7000 years old
LaBrana M    LaBrana Mesolithic
AngloSaxon several samples UK Hinxton Iron Age
Briton several samples UK Hinxton Iron Age
NE1…NE7  Hungarian Neolithic
KO1 Hungarian Neolithic
CO1 Hungarian Copper Age
IR1 Hungarian Iron Age
BR2 Hungarian Bronze age

The formula is f3(Mbuti;test1,test2)

All results can be downloaded and unzipped here.

I couldn’t resist to look why I have at 23andme many “cousins” in Southern Russia and Ukraine, but no one in Nothern Russia.  This looked weird because I have heard always that the Finns are from the north.  Indeed, Western Finns, like me,  are closer Ukrainians in terms of the common genetic drift than Northern Russians (Mordva and Russian people closer White Sea).  Surprising. 

Sunday, December 7, 2014

Starting with new data

I am now starting with a new data set and hopefully I can give you more reliable results.  My new set is based on the data publicly available from Lazaridis et al. 2014.   Some additions were made:

 Finnish, British and CEU samples from the 1000-genomes project
 5 ancient British samples from Hinxton
 8 ancient Hungarian samples (from Y-Str server)

Going to Lazaridis’ data meant also changing to Affymetrix coordinates.   Affymetrix doesn’t fit well with most commercial SNP sets, but seems to give better coverage for many ancient samples.  Switching to Affymetrix  was not a problem to me because my private sample collection is not big and I can give it away and take the advantage of better data.   Now I have

2244 samples
555268 SNP’s

The individual sample data mostly covers the whole 550kSNP, also many ancient samples reach over 500kSNP. 

Present ancient genomes

I am not yet familiar with those ancient samples and it takes a time to find out their secrets.

While I have a new data I am also starting with a new software toolkit.   I found the Eigenstrat format handy, because it is also easy to handle with SQL-tools.  This decision led to another one, to use Eigensoft’s software.  Luckily they just a few weeks ago released a new version.  

It is the time to uncork the new data, starting with Eigensoft’s excellent PCA tool.  It is capable to do many thing, like LD-pruning, random sample selection, sample projection etc.   Now I use only the random sample selection, which ensures that none of populations are oversampled.   I use the whole 550k data and  I do not use LD-pruning for two reasons.  At first it is not necessary to avoid excessive clustering.  Working with the LD-pruned data I did not notice any improvement in clustering.   Secondly, LD-pruning can be disadvantageous, because not all populations have the same genetic drift to remove.  Generally the LD-pruning must be used carefully and it is not for dummies like me, so I use it seldom and trust to the original data.

PCA including Europeans

Fst-distance table.  Average standard error is 0,000762745, meaning that there can be an error of one thousandth of unity.


Friday, November 21, 2014

Do_it_yourself Dodecad test for Finns (including Baltic Finns in general)

Wondering the Finnish history and migrations that happened during the last 2000 years I have done the following Do-It-Yourself Dodecad test.   My goal was to achieve a dedicated test for Finns, but it could work also for Estonians and other Baltic-Finnic people.  This test doesn’t work for other nationalities due to the regional reference assortment.  What I have done differently than in many other Dodecad Oracle tests is not only the reference selection, but also I had a tighter Finnish sample qualification.  It is also reasonable to mention that in some tests the preprocessing of genotype data has been biased.  My data includes 290000 SNPs and it doesn’t include any preprocessing based on differences between populations.  So it is as it is, straight from the stock.  

Reference populations:


You don ‘t need to worry about the “calculator effect”, because all my data is from public academic sources.

To perform the test you at first have to download DiyDodecad scripts.  You can do it here

Please notice that DiyDodecad is authorized by Dienekes and included in his Dodecad Ancestry Project:

After you have uncompressed all files into your own directory (for instance Kaleva) you have to download and uncompress four Kaleva-specific files to the same directory. 


Now everything is ready for making first analyses, to do it you need to read README.txt and follow Dienekes’ instructions, the only difference is that you need to use KALEVA.par instead of the Dodecad dv3.par file.

Friday, November 14, 2014

Ancient British genomes from Hinxton reveal the eastern Iron Age frontier

It is the time for ancient genomes.  A month ago I read about new ancient samples from England, Hinxton, and saw them to be interesting in terms of the Finnish history.  Those samples are around 1500-2000 years old, thus being rather suitable for estimating Finnish western connections.  The Finnish history in Finland is rather short, in the best case bloodlines goes around 2000 years to the past, quite a short history compared to many southern Europeans.


I use now the same data I had in my roll-off analyses.  Just to remind you, I made a very strict qualification for Finnish samples to remove all recent admixtures, meaning the time span from the beginning of the Swedish era in Finland.  All public western Finnish samples were selected by comparing to my own genealogically proved samples and outliers were removed.


I used Reich’s three population test (qp3Pop) with default settings.

Before going forward some words of caution.   After testing with larger data I realized that also qp3Pop makes an assumption that less diverse populations are source populations for more diverse sum populations, in other words diverse populations are usually composed from several less diverse populations.   This is not true and is a rather mechanic perception.  In genetics the process can be reverse; a more diverse population can turn in to a less diverse one through genetic drift.  This is important because just the drift is now what we analyze.  

Some general observations.  This above-mentioned problem doesn’t have effect on ancient samples, because they lived far before us and they can’t violate causality.   However the lack of diversity can overestimate the admixture.  

I have also some results using preset-day source populations and those results can be problematic. Nevertheless,  despite of the fact that some Finnish samples are from young isolations  I  assume that my Eastern Finnish samples represent historically most unmixed Finnic language speakers in my data, keeping in mind however that I have no Finnic (Baltic Finnish) speakers from Russia.  Additional samples from Russia could give information about possible admixtures of East Finns.

AS - Hinxton Anglo-Saxons BR - Hinxton Iron Age Briton EF – East Finns WF – West Finns
PL – Poles LT – Lithuanians EE – Estonians MA – Maris CH – Chuvashes NR – Norwegians
MR – Mordvas BU - Belarussians

Negative F_3 values mean likely that the target has admixtures of both source populations.



The Western Finnish map shows high ancient admixtures, especially the Anglo-Saxon - East Finnish admixture among them is outstanding!  Estonians show admistures with almost all their neighbors, which can point out continuous migrations to Estonia through the history.

Another way to find out speculative admixture of source populations is to pick the least probable target population, in this case African Pygmies.  Using this method we see that the most Anglo-Saxon-like are Norwegians and the most Iron Age Briton-like are Lithuanians.  West Finns are the third on the Anglo-Saxon axis.  This probably means that West Finns have Anglo-Saxon-like ancestry, or Anglo Saxons had common Fennoscandinavian ancestry with West Finns.  All those owning more Iron Age British ancestry than West Finns (NR, LT, EE, PL) likely have more ancient Celtic ancestry from Central Europe.

 edit Su 16.11.2014

I thought that it would be interesting to know more about the western outlier group, the Finns who are more western on PCA plots than the genealogically proved West Finnish group.  This is done by comparing both western Finnish groups to East Finns, that is to say the East Finns are a fixed landmark on which to base the comparison.  The result shows mixed results with negative f3-stat F3(WF;EF,test) and f3(WF2;EF,test) where "test" includes Estonians, Swedes, Iron Age Britons and ancient Anglo-Saxons.

The result shows that western outliers show more Iron Age Briton, more Estonian and more Swedish ancestry than the genealogical western group, but they show little less ancient Anglo-Saxon ancestry than the genealogical western group.  The result also proves that both western Finnish groups have significant Eastern Finnish -like ancestry.

The abbreviation "SE" stands for three Swedish samples who show only very little Finnish ancestry at 23andme's Ancestry Composition.

edit. Mon 17.11.2014

Another graphic showing the Swedish - ancient Anlo-Saxon ratio among Finnish individuals, both admixture gotten by 3Pop-software.  The East Finnish group is used as a fixed landmark.   The individual difference between AS and SE was used as a sort key and the trend line shows linear difference.   I would have done also comparisons to other populations, like to Estonians, but the difference in SNP-sets made an individual level comparison impossible.

Although ancient Anglo-Saxon and Swedish admixtures follow each other, my judgement is that the bigger the AS is compared to the SE, the bigger is the ancient admixture, and vice vesa. 

Friday, November 7, 2014

The long and dark shadow of history

Since the last post I have done a lot of testing, I have tried to find limitations of the analysing tool as well as increase my own understanding what all results mean.  There is still much work to do, but I am going forward piece by piece and I try to shed light on the Finnish genetic history.  In his purpose started my shared LD tests from present-day populations, not from the ancient ones, although it would be more intriguing to resolve big historical questions in our deep past.


The data mainly consists of publicly available academic samples.  Everyone can download same samples over internet.  Additionally I have a few Finnish, genealogically classified Finnish samples.  I use them to categorize public Finnish samples, because the public data includes some Finns with foreign admixture.  

Finns 96  1000genomes
Finns 7 my own collection
Norwegians 15 other sources
Poles 10 other sources
Belarussians 9 Est.BC
Chuvashes 16 Est.BC
Estonians 13 Est.BC
Lithuanians 10 Est.BC
Maris 15 Est.BC
Mordvas 14 Est.BC
Ukrainians 16 Est.BC
Swedes 3 my own collection

Preparing data

I found the maximum overlap being in my data around 550000 SNP and the minimum around 290000 SNP.  The number under the test varies depending on the selected references and target populations.  I found also that the minimum SNP space for reliable results is over 20 million SNP’s.  It is however likely that larger individual sample sizes would give steadier LD-sharings and smoother roll-off curves than larger sample amount, as well as also less standard error.   It would be better to have millions individual SNP’s, but I didn’t see big quality differences when comparing curves in this test to other similar results achieved by authors using same programs and I suggest that our data is quite similar in terms of reasonable results

Preparing  the Finnish data 

In the first step I ran a European level PCA figure to see possible foreign admixture and removed 13 Finnish samples locating to the west from my genealogical west Finns.  Secondly I ran a new PCA  including only Finnish samples, grouping it to three portions:  19 most eastern samples (excluded  11 outliers), 17 most western samples (including my genealogical West Finns) and the rest forming an intermediate group.  By this arrangement it was possible to have distinct eastern and western groups, but also a working Finnish reference (56 samples), suggesting that the intermediate group probably consists of purest present-day Finns.  


My aim is to use at first Reich’s programs starting with Rolloff.  Rolloff is a software outputting  LD-sharings from target populations filtered by two reference populations.  You can search different mixing routes for the target by changing references.  It also gives an estimate for the admixture time.  This dating suggests one pulse admixture between the target and references, so continuous gene flow will give erroneous admixture times, but still showing real admixture.  


All analyses are run using Rolloff’s defaults, with exception of the resolution being 0.5 cM instead of  1 cM.   I tested both values and didn’t notice the lower value increasing standard error, just conversely the lower resolution reduced it a bit.  I also noticed that Alder (another roll-off program) uses this lower value.   The lower the value is the more we get LD-transaction.  Too high resolution however increases statistic noise.   
These results were surprising, but the truth is that similar shared LD-tests obviously have never been done before regarding Finns, so I had no expectations.  I can only say that if someone sees these results unexpected, do not shoot the messenger, I prefer repeating my tests, perhaps under tighter quality control, if you wish.  I would be happy to see new results to evaluate possible differences. 

These results suggest that the Finnish genetic shape is an outcome of several migrations and admixture events, more than I could expect using PCA and formal admixture analyses based on averagely LD-pruned data.  The big genetic difference (in Fst-distances) between East and West Finland might be more due to the migration history than genetic drift.  Eastern Finnish results show rather young southeastern or eastern admixture history (Mordvas), while western results show older southern admixture (present-day Belarussians).  Both groups show also northeastern admixture (Mari->Saami?).  It is possible that those three populations are all proxies, most likely this is true in case of Belarussians. 

The common history with present Scandinavians is smaller and older than expected, but this doesn’t rule out possible ancient regional migrations from there to Finland.  Unfortunately I have not enough samples to check it and regarding Scandinavian migrations to Finland before the Swedish era in Finland my expectations are more focused at ancient genomes.  It is worth noticing that I removed all known foreign admixture, including obvious Finland-Swedish samples.  It was possible, thanks to my genealogical western Finnish data.

It looks like no particular Estonian migration existed to Finland since the common language diverged and southern migrations to Finland bypassed Estonia.  

I am going to find out admixture amounts in following analyses. 

Admixture times for Finnish people

Related roll-off graphics

Related PCA-plots

PCA dimensions 1 and 2

PCA dimensions 1 and 3

edit 9.11.14

I got yesterday a feedback that I could verify my results by checking the French admix among Finns.  My first thought was oh no,  I am not going to start qualifying the software which has been used in several academic studies.  It is in principle unfair to ask me to do such thing.  But then I rethought it.  Why not, but using Spaniards I could check if the admixture time fits to the Stone Age and to the times when southern migration waves expanded to Finland.  Here are the result. 
Admixture time   197.139  generations  +-55.497 = 5914 years +-1665 years.