perjantai 14. lokakuuta 2016

Worldwide admixture analysis based on 14.4 million SNP's

The EGDP data, available from Estonian Biocenter, made it possible to reach 15-30 times more genome density than earlier available data made possible.  The new data lacks of West European samples, but it was not a big problem due to the publicly available western data from the 1000-genomes project.   So I merged these two data sets.  For the quality check I ran heterozygosity rates for all European samples in both data sets and found both sets being considerably close each other, although the read depth of the 1000-genome data is smaller.   Actually Finnish samples in both sets showed exactly same level of heterozygosity.

After the succesful merge I had 14.4 million SNPs over all 22 chromosomes, which was far too much to process in few days on my desktop (i7, 3.5Ghz, 32 GB memory).  Instead of thinning the whole data set to 1-2 millions SNPs I decided to use chromosomes 1 and 6 and leave the genome density untouched.  So I had two chromosomes, a bit over 2 million SNPs showing still 15-30 times more genotype information per chromosome than other available genotype sets.  Considering thinning over all chromosomes to get the dataset handy enough to be processed with my computer would likely have induced more algorithm dependent bias, which I wanted to avoid.

The process

1 merging EGDP and 1000g data sets
2 quaility checks, including homozygosity/heterozygosity ratios per populations
3 extracting chromosomes 1 and 6
4 thinning data by Plink:   plink --file data --indep 50 5 2, resulting 1.1 million SNPs
5 running admixture analyses with k values from 3 to 13 in unsupervised mode and without reference populatons (=projection).

Each k-value was run in unsupervised mode without reference data, because projection reference data is not available for this SNP set.  You can see analyses using projection reference for example in works analysing ancient and moderm genomes together. Analyses made on any kind of projection are cool, because we have no other way to designate proportion of ancient samples to modern ones.  I am not saying that unsupervised analysis without references would be error-free, but that errors are systemic and not user dependant.

All analyses (k-values from 3 to 13) done here are run as individual runs without user supervision and for that reason colors on charts are not consistent (at least it sounded like a painful work the get colors consistent). Each analysis is optimized separately by the Admixture algorithm.  All this makes it more difficult to perceive differences between different K values, but as soon as you get the idea I am sure you also can see the big picture and understand details.

Hopefully this test is helpful for you.  In my opinion, it gives interesteing hints about Finnish relations with other populations, but the analysis itself is wordwide.

- Mordvins seem to differ from other Volga-Finnic populations and belong to Balto-Slavic ancestry and they probably are language shifters from a Baltic to a Volga-Finnic language.

- Estonians are just what can be expected, some Estonians have Baltic ancestry, some others Baltic-Finnic ancestry.  We should, however, be cautious of in using linguistic terms when we speak about ancestry.

- North Russian Finno-Ugric populations seem to be Baltic-Finnic people with Siperian admixture.  The Siberian admixture is present in a lesser amount among Finns and Estonians (note that the amount of minor admixtures depends on the used data/populations and Admixture is based on a selective method processing admixture proportions relatively).

- in some extent also Swedes show Baltic-Finnic ancestry, but the Swedish sample size is rather small to make a sure conclusion.  However,  if this is true, we can assume the present-day Baltic-Finnic people having largely Fennoscandinavian ancestry.

- Ingrian samples show up like pure unadmixed Baltic-Finnic people, which surprises me because of their long lasting minority status in Russia. Sample collectors have done good work.  Those samples are valuable indeed.

- thinking all this and trying to rebuild the the history of Baltic-Finnic people it looks like they lived to the north from the axis Latvia-Moscow (Balts living to the south before the East-Slavic expansion). Mixing between Baltic and Finnic people happened and people also shifted language.

- open questions are how strong the Baltic-Finnic influence is/was in Scandinavia and conversely how strong the Germanic influence is/was in Finland and Estonia.  For certain political reasons it is a difficult approach today.

CV errors, indicating quality in general, the lower the value is the better the quality, but absolute values depend on the used data and can't be compared to other Admixture tests. 

K3: 0.19708
K4: 0.19503
K5: 0.19480
K6: 0.19451
K7: 0.19432
K8: 0.19503
K9: 0.19508
K10: 0.19576
K11: 0.19708
K12: 0.19797
K13: 0.20221

Population abbreviations, download here

Analysis, download here.

You definitely need a suitable picture viewer being able to handle big GIF-files.

  1. This is a pretty decent short summary (even though it often says "Finnish" when it should mention Finnic/Baltic Finnic, the common ancestor of Estonian, Finnish etc. about how the Finnic language came relatively recently to Finland from the southern side of the Gulf of Finland, Bronze Age archaic Germanic & Baltic loanwords etc.

    It might be worthwhile to mention that Saamic speakers inhabited Southern Finland and Karelia in those times, their influence accounting for some of the genetic difference between Estonians and northern Finnics.

    We get an idea of this in Finestructure PCA of the most recent Estonian Biocentre study.

    Draw a line from Mordva or Pinega Russian to Altai and you'll get the Volga-Ural between. N.Finnic is shifted towards Saami from S.Finnic. Dashed lines are towards African positions, as this is a zoom on Eurasians.

    1. Well, I don't see the word Finnish in a wrong context. There is exactly two mentionings about "Finnish" people and one time "Finns". Instead, "Baltic-Finnic" people are mentioned 7 times and "Volga-Finnic" people two times :)

  2. I was a bit harsh, mostly referring to this:

    "Finnish was probably brought to Southern Finland only about 1500 years ago, though who brought it remains at best an educated guess. It was possibly spoken by trading peoples living south of the Baltic Sea, seen as prestigious by the predominantly hunter-gathering dwellers north of the gulf."

    It reads like Finnish was spoken south of the Gulf before it came to Finland but that was an ancestral language, probably better called a variant of Finnic. A minor issue.

    btw, I have an alternate idea about Mordvas. Their language has way less Baltic loanwords than Finnic and they live outside ancient Balt range, which shouldn't happen if they were language switching Balts. Maybe that kind of ancestry was just common widely and early, even beyond Balto-Slavic territory.

    1. Anyway, my point was that Mordvins looks like not being genetically Finnic people and resemble more Balto-Slavs. As usually the connection between language and genes can offer surprises.

    2. Yeah, but if we look at fst's or something like the PCA I linked, Maris and such look even more distant from Baltic Finns. So who really was Finno-Permic and Finnic and who's been mixed since? Probably everyone, it's silly to expect proto-populations to have survived unchanged in the Volga highway for thousands of years. Someone'll dig up the real thing eventually and then we shall see.

    3. You are wrong, because you can't see admixtures on PCA, neither by Fst. The Fst-distance between Maris and Baltic-Finnic people is due to the Siberian admixture among Maris, as well as between Finns and Saamis. What kind of Fst-distance can be expected between pure present-day Baltic-Finnics and people of half-Baltic-Finnic and half-East Siberian ancestry? You just oppose, although you know that I am right and being honest you could admit it :) Then,talking about Baltic- Finnic people and Mordvins - they both belong to same East European ancestry. They are slightly different, because of different later history. Don't make things too trivial, like "disclosing" the Fst between Baltic-Finns and Maris.

  3. What kind of nonsense is that man? Maris do not have half Baltic Finnic ancestry, they have zero because their ancestors split from the Uralic root population in the Volga before Baltic Finns existed. You know that there were no ancient Baltic Finnic or Baltic or Germanic migrations to Mari-El.

    1. I speak about genes, you all the time about languages. I warned about this in my text. Let's say this: according linguists Baltic-Finnish language was born in Eastern Estonia, but genes and prehistory are a different story. The problem here is more political than scientific. Look man, FamilytreeDna classifies the Baltic-Finnic gene pool as Finnish-Siberian, which is of course a part of this sane nonsense. They do this classification because they believe that the Baltic-Finnic ancestry proportion of Maris etc. came from Siberia. Actually my tests so that it came from west to east. Maybe I do in near future a map showing the same using mtdna and ydna.

  4. They don't say it came from Siberia. FtDNA is doing an update eventually, changing that components name into just Finnish.

    But these ancestries didn't come from a specific point, they've been forming all over Eastern Europe from a mixture various hunter-gatherer, farmer and pastoralist populations since the Neolithic. We can tell there was no baltic-finnic migration to Mari-El from Y-chromosomes too, Maris don't have the right N or the right I1 for that.

    1. Of course there was no Baltic-Finnic LANGUAGE moving from west to Ural, but there was an old Fennoscandinavian type genetic pool reaching North Russia,a long time ago, before Baltic-Finns, and there was a Siberian gene pool to the east.

  5. ”We can tell there was no baltic-finnic migration to Mari-El from Y-chromosomes too, Maris don't have the right N or the right I1 for that.”

    According to the new N paper published in the summer, Mari do carry N3a3-VL29:
    Saami 6/13
    Finns 5/21
    Karelians 21/52
    Arhangel Russians 14/47
    Estonians 65/72
    Maris 6/21
    Nenets 6/39

    If the Finnish language comes from Estonia, VL29 could indeed be the original yDNA of Finnic speakers. On the basis of N3a3, there could have been a migration from the Baltic-Finnics to Maris and Nenets.

    As for I, information is much less specific. However, here (, they say that 5% of Mari males belong to I1.

  6. But no Z1936. VL29 could easily represent Russian admixture like Slavic R1a, most of South and Central Russian N belongs to that type. Migrations from Finland or Karelia to Mari-El are ruled out by Y-DNA evidence just as they are by archaeology.

  7. But the linguists argue that Finnic languages came from the south (Estonia) and not from Karelia. It is possible that Z1936 was originally more a Saami yDNA. I really do not think why VL29 in Maris should come from Russians.

    In any case, we need ancient yDNA from the Uralic area.

  8. Yes linguists do argue that, and these expansions are associated with expansions of accompanying Y-dna, and Z1936 has an Iron Age founder effect in Finland. It wasn't Saami any more likely than R1a was Neolithic Farmer - the conquered.

    Russian ancestry can be found all around the minorities of the Empire, nothing surprising about that.

  9. YDNA N is surely old in the Uralic world, and I am against your idea that the Proto- Uralic language could be reduced to the expansion of Z1936 c. 1000 BC. According to Jaakko Häkkinen, Proto-Uralic existed c. 3500 BC, so it requires at least the N3a3’6 level and cannot be only as old as N3a4-Z1936.

    As for mtDNA, Maris seem to have some U5b1b1a (+144) which is typical of Uralic speakers and the oldest mtDNA in Finns and Saamis (c. 4 kya old). Apart from Maris, this mtDNA has spread to Erzya, Komi Zyryans, Chuvash, Jakuts and Buryats. Maris also have some V1a which also has a high frequency in Finns and Saamis. Both U5b1b1a and V1a probably spread from west to east within the Uralic speaking world and also ended up in high frequency N carrying Turkic groups such as Chuvash, Jakuts and Buryats. J1c2 which originated in Neolithic Central Europe also spread to the Uralic area. J1c2 is shared between Finns, Maris, Erzya, Komi Permyaks, Komi Zyryans and Turkic speaking Tatars, Bashkirs and Chuvash. On the basis of mtDNA, some mtDNA haplotypes spread from west to east in the Uralic world.

  10. N3a4-Z1936 arose c. 1000 BC and if we think that it arose somewhere close to the Karelian Isthmus within the Net Ware and brought the Finnic languages to Finland and Estonia and the Saami language to Lakeland Finland, we must admit that this yDNA spread from there to the east as N3a4-Z1936 is also found in Khantys, Mansis, Nenets and Nganasans as well as in Dolgans, Chuvash, Tatars and Bashkirs. In that case proto Finnic/Saami speakers were quite influental. (see the article by Kosmenko, THE CULTURE OF BRONZE AGE NET WARE IN KARELIA)

    If we reserve N3a4-Z1936 for the Finnic languages, then N3a3-VL29 seems to be Corded Ware in Finland.

    1. Your dating is off big time, recheck the recent studies and Yfull.

  11. This is all what yfull says N-Z1936 * CTS1223 * Z1922 +2 SNPs.

    However, it is true that in the new Estonian paper the age of N3a4 is older than I thought on the basis of their graph. Their estimate for the Turkic and Uralic branch is 4476 years which gives the date of c. 2400 BC. The non-Turkic branch separated c. 4027 years ago which gives the date of c. 2000 BC for the Uralic N3a4 which is enough for the Net Ware.

    Their age estimate for N3a3’6 is 4995 years, and this figure fits quite well with the age of Proto-Uralic as defined by Jaakko Häkkinen (c. 3500 BC).

  12. I must add that this Uralic N3a4 is not fully Uralic as the oldest split is between a Kuban Cossack haplotype and the rest, and the youngest branch of N3a4 in this new Estonian tree is a Bashkir branch which is under an Estonian branch. Therefore, this N3a4 is really quite mixed with Volga Turkic groups, whatever it means linguistically.

  14. @Kristiina

    If we reserve N3a4-Z1936 for the Finnic languages, then N3a3-VL29 seems to be Corded Ware in Finland.

    Are you positing Y-DNA hg N for some Corded Ware groups? That sounds radical.

    1. Why do you connect VL29 with CW in Finland. It is not necessary in regards to the autosomal basis.

    2. I was essentially asking the same question to Kristiina, why she connects them.

  15. To be honest, I was more irritated by Grelsson’s comment that VL29 in Maris is from Russians as if VL29 could not have been frequent in Uralic groups and would somehow be of Indo-European origin. However, it is clear that VL29’s distribution is more western than Z1936’s distribution and if Z1936 arrived to Finland only c. 1500 BC, it is not so far-fetched to presume that VL29 was in Finland before that, i.e. during the Corded Ware period, 3000-2000 BC. Moreover, VL29 is clearly much less frequent in Turkic groups than Z1936 and there was N1c/N3 in Zhizhitskaya Culture c. 2500 BC close to the Finnish territory. There is a direct route from the area of the Zhizhitskaya Culture (Novgorod oblast) to Finland via the Karelia Isthmus.

    I do not see anything radical in the idea that the most frequent yDNAs in Finland were fixed during the Bronze Age. We have seen this happening in many areas. Languages are a different story, but I really do not believe that a Germanic language or a Baltic language was spoken in Finland c. 3000-2000 BC.

    I am very interested to see the results of the upcoming Baltic paper.

    I am not claiming that there was VL29 in Germany, Poland, Sweden or even in Lithuania or Latvia during the Corded ware period. I am only arguing that it could have been present in the Finnish territory.

  16. "To be honest, I was more irritated by Grelsson’s comment that VL29 in Maris is from Russians as if VL29 could not have been frequent in Uralic groups and would somehow be of Indo-European origin."

    It doesn't need an Indo-European origin to get introduced into Maris by Russians, duh. It's enough that the expanding Russians have it.

    The point is that there is no Z1936, which you find around the Baltic, in Maris and that means it doesn't look like there were migrations related to Finnic speakers to that region.

    Following this train of thought, there were no migrations after the initial Uralic movement to the opposite direction, because all the various Volga groups have Y9022 which expanded there during Middle-Late Bronze Age looking at the MRCA.

    1. Just a reminder. I didn't write in my blog text about a migration from west to east, I only stated that North Russian FU-people looks like a mixture of present-day Baltic-Finnic people and Siberians. What does this mean? In my opinion that the whole North Russian region was a mixing area of those two ethnic groups.

  17. I am quite sure that the history in the Uralic area is more complex than only a migration from Volga-Kama to Finland and Volga Oka c. 1500 BC. There are notable differences in the grammar and lexicon of Uralic languages as well as in their mtDNA and yDNA. Russians started to expand only after the Middle Ages. I am against the idea that the history of Western Uralic languages can be reduced to a migration from Volga-Kama to the west c. 1500 BC with a subsequent isolation and then an admixture with Russians after 1600 AD.

    Your logic only holds if it is true that there was no VL29 in Uralic groups and that VL29 males spoke an unknown extinct language that was completely different from Uralic languages. If for example the original Finnic speakers carried mainly VL29 as Estonians, your logic does not hold.

    Notwithstanding this, I am open to the possibility that VL29 spoke a Corded Ware language. However, I do not think that it was a Baltic or Germanic language, but it surely shared many features/words with the modern Finnic languages.

    In any case, it really does not matter what you think or what I think as hopefully these questions will be resolved through ancient DNA. I will accept the picture that emerges through the ancient DNA.

  18. R1a and R1b separated c. 22 800 years ago and R1 and R c. 28 200 years ago, and notwithstanding this there are people with academic education who argue that Malta boy in pre Ice Age Siberia spoke a precursor language of Indo-European languages. Then, there are many people who think that Samara hunter gatherer (5 650-5 555 BC) spoke a pre-Proto-Indo-European language, although his yDNA seems to be pre-M73 which is found in Turks, Mongols and Samoyeds and which formed c. 13 600 years ago. And the last but not the least, it is claimed by many that Khvalynsk R1a-M198 spoke a pre-Proto-Indo-European language, although his R1a-M459 formed already c. 18 300 years ago. For many, it does not matter if yDNA is R1b-L23, R1b-M73, R1a-M459 or R1a-Z645, they are still convinced that these guys spoke an Indo-European language.

    In the Uralic area, there is N3a3 and N3a4 which separated only c. 5000 years ago, i.e. at the start of the Bronze Age, which is the estimated age of the Proto-Uralic, but in this case most people seem to agree that N3a3 and N3a4 cannot have both spoken a Uralic language and N3a3 must have spoken an unrelated extinct language.

    The double standards do not end here. Yamnaya ancestry has been considered conclusive evidence of an Indo-European migration from Yamnaya area to Central and Western Europe. By contrast, in the Uralic area it is firmly argued that genetic evidence can NEVER be used as evidence for linguistic relationship. In the Uralic area very rigorous linguistic criteria must be used but in the IE area Yamnaya autosomal affinity is enough to prove a linguistic relationship, although yesterday for example Maciamo on Eupedia confirmed that Saamis have much more Yamnaya ancestry than Germans.

  19. In regards to differences between ydna haplogroups and clades it is good to remember that many European ydna groups went through a severe bottle neck around 5000 years ago. Those people still forwarded autosomal genes. It is absolutely possible to share autosomal genes without sharing ydna. We know it well, but repeatedly insist something else. Also, we have often a wrong conception about bottleneck. It doesn't mean one man living with one or tens of women. It mean a time span of tens or hundreds generations, during which time some clades died out. There are two explanations for this evolution, the first one is genetic drift, which happens in isolation i.e. locally. Another explanation, not so bad one, is the competition between clans being formed of relatives. During lawless times clans gave safety. Also this explanation leads us to think about local ydna differences without obligatory difference of autosomal genes.