Tuesday, December 17, 2013

Late settlement Finns

If you have followed my blog, you may remember what I wrote about the Finnish settlements (Testing 23andme's Ancestry Composition continues / Finnish results), here:

“The term late settlement is used by Finnish historians and means areas which were mostly populated during the Swedish era by administrative transactions (king’s orders) or by occupying areas in wars between Sweden and Novgorod/ later Moscow.   This means actually that the age of Finnish reference group (used by 23andme) is around 500-700 years and people living in older settlements …”


We have a problem (sure) when analyzing young expanding settlements and populations, because they have more genetic drift than old populations. In fact old rural populations have also genetic drift, but only very locally.  I encountered this same problem also with the Utah-CEU samples and resolved it by selecting least drifted of them.  It is obvious that I can’t do this same with Finnish samples without becoming questioned as a neutral actor.   I wrote also about the genetic drift, here  (the chapter written in English):


“At first I found that all groups with high genetic drift due to isolation will strongly distort the result.   It was easy to see the effect of genetic drift and the consequential distortion, for example I dropped out a lot of HGDP-CEU samples being too homogeneous or drifted.  Young genetic drift generating own genetic componenents in analyses inside one sample group doesn’t figure their older common history with other groups, the reason why groups with genetic drift are useless in searching the common history of  populations.   They will also affect the root population where they come from.  This kind of genetic drift can be found from rapid expansions in some subpopulation, like in villages or in smaller cultural communities.”


So what to do?


If we want to see behind the birth of local settlements we should get rid of the genetic drift in results and prevent the PCA generating drift components.  It is not difficult at all.  We should only be aware of the sample size (number of samples) that triggers the formation of young drift components.  It can be anything between two or tens samples, depending on the sample set.  Now I don’t exactly know how many late settlement Finns I need to reach the threshold value (because I have not them enough).  But I don’t need to know it, because even a few late settlement Finns belonging to the same root population show the trend, where they belong without genetic drift and where they came from (or where at least a significant part of their ancestors came from).    I can add more samples, nothing will change until I reach the threshold value where PCA starts generating drift components on the dimensions we want to see. 




The first PCA include same Eurasian samples which I used when analyzing old settlement Finns.  In this case Finnish samples (SK0001, SK0002 and SK0003) are located clearly inside the North Russian cluster, but on the opposite side than Slavic Belarussians.  After adding more late settlement Finns this would look more dramatic.  I can’t avoid making a conclusion that Northern Russians (Vologda people and Mordvas) are a mixture of Finnish look-alike people and Slavs. 





You can see an image with better resolution here


Secondly here is the same European plot as before with old settlement Finns.  Now a lot of Caucasian and Eurasian components are missing compared to the Eurasian plot and Finnish samples move towards Lithuanians who represent the gene pool around the eastern Baltic Sea region.   This effect would be stronger with Estonians, and strongest with late settlement Finns.  This happens due to the gene flow between late settlement Finns and old settlement Finns and between them and Estonians.  I don't know whether this gene swap happened before they adopted North Russian genes, or after that.  Maybe the Baltic-Finnic gene pool was much more widespread before the Slavic expansion.  





You can see an image with better resolution here


My last graphic shows how those three Finns under the test were related to the effective PCA components.  Sorry, this is available only for the European PCA, I was too lazy to work with the bigger Eurasian data.   




Saturday, December 7, 2013

Testing 23andme's Ancestry Composition continues

I am happy to inform that 23andme has got repaired a few weird things I noticed two weeks ago.   Two Finns owning obviously non-Finnish European admixture show now sensible results.  The first one shows now 37% Finnish, being before the repair 100% Finnish.  A huge difference. The second Finn shows now 47,9%, was earlier 99.9% Finnish.   I was not cheating. 

What’s new

Basically Ancestry Composition is unchanged, only the software engine behind the user interface is revised.  Now it is time to go ahead and look at new results.  I have made some statistics.   The first graphics shows  results per country, how well 23andme has succeeded to assign people to their own national gene pool.     It was of course no sense to select national groups without own reference group, like Estonians.  They look “mixed” despite of the genetic diversity level.

All Finns with known recent foreign admix are excluded as well as Swedish speaking citizens, but of course I can’t know what happened hundreds years ago.   I can’t guarantee that all Finns are same people from the ice age and not even from the Roman Iron Age.   The Balts includes Lithianians and Latvians.The Russian group includes only ethnic Russians.

Secondly we see the standard deviation of results of each country.  It is good to notice that even in case the national proportion is very low, like in the case of Scandinavians, the deviation in own gene pool figures population diversity comparable to averagely higher country numbers.    This is one of those weird things being related to admix analyses and sometimes mislead people to think that admix analyses showing plenty of admixes means high diversity.   Actually it is a wrong conclusion.  Admixture results show only how much some corresponding part of your genome resembles the chosen reference set.    


Finnish results

Looking at results and the origin of Finns we can be sure that 23andme uses Finns from the late settlements in building the Finnish reference set.   The term late settlement is used by Finnish historians and means areas which were mostly populated during the Swedish era by administrative transactions (king’s orders) or by occupying areas in wars between Sweden and Novgorod/ later Moscow.   This means actually that the age of Finnish reference group is around 500-700 years and people living in older settlements, in areas that where populated pretty much before Swedish crusades to Finland,  are compared to them, not vice versa.   It is impossible to find out how much genes have during this 500-700 years period moved from old settlements to late settlements and how much from late settlements to old settlements, but we know the age of both populations .   The younger entity can’t be used to classify the older one.   Who populated the late settlements in Finland is another question.     

Anyway, evaluating the error caused by this poor test arrangement and putting things newly together we could try to estimate the lowest percentage for unmixed Finns by looking the Finnish history and personal data at 23andme.  It would be at lowest level around 70%, being somewhat below that in Southwestern Finland because they have given least genes to the late settlements, less than Tavastians, Karelians and Ostrobothnians.   In SW-Finland a bigger portion of old Finnish heritage remains unknown and hides inside nonspecific numbers.   You can notice this, as well as the Swedish admix level, just look at your shared Finnish results at Ancestry Composition.   The Finnish percentage being smaller than 70% we can expect some foreign admix more than the corresponding average Finnish admixture for example in Sweden and Russia.  
Some points more

Highest Finnish numbers seem to be from East Finland, near Iisalmi and Kuopio, highest in East Europe from Baltic countries, Pskov and Tverskaya regions in Russia and the highest Scandinavian number is from Värmland (Sic!), Sweden, followed by Norwegians nearby the Värmland on the other side of the boundary between Norway and Sweden. I wouldn't say I felt any déjà vu when looking these results, it is boring to see how admix analyses do this again.  Must say, we need now new ideas. Although 23andme uses obviously their own dedicated admix model they still fall into the same problem than all recent admix models deriving results based on a pure admix model and don't taking into account genetic drift. Unfortunately admix models conclude that the gene flow goes always from homogeneous populations to more heterogeneous one, without understanding the effect of genetic drift which happens usually after opposite gene flow. This happens because admix models don’t take into account timelines and believe that higher diversity is an admixture of present-day populations.

Sunday, December 1, 2013

Controlling data

It is somewhat coincidental what we get when looking for genetic samples for our analyses.   We don't know whether our samples are typical representatives for people they should represent according to the given title.  Usually researchers confirm us only grandparents of gathered samples belonging to the mentioned group.  But are they third generation immigrants, villagers from same village, do they speak same language or belong to some certain cultural group - we don't usually know.  We ought to have pretty much trust in the coordination of researchers and what they have done all over the world.  It would be a good idea to report some key figures about used samples and going further to compare these key figures between public data bases and studies (using same SNP-sets of.c).   After that we could see whether results are comparable. 

Here are two key figures for my European samples.

1. Similarity

This graphics figures the similarity of each population as an average of shared IBS between samples in each population (136835 SNPs):

2.  Level of homozygosity

This figures the average homozygosity of each population (same data as above):

   In both cases Finns belong to the selected old settlement group and CEU samples are selected samples with low genetic drift, most CEU samples owning significant genetic drift.