Methods
I have seen several mitochondrial statistics using main haplogroups, H, U, I etc. Haplogroups, being tens of thousand years old are a very robust way to analyze geographic areas where people have moved and mixed during latest centuries and in maximum during some thousands years. Because of this I decided to use mutation information based on RSRS-reference. The RSRS was introduced a few years ago and lists mitochondrial mutations defined from so called "mito-Eve", from the reconstructed first woman in the human ancestral tree. Even RSRS lets lot to be desired, because many mutations are common in several mitochondrial branches.
Data
The data is collected from publicly available FamilyTreeDna's projects and includes two hypervariable regions, HVR1 and HVR2. HVR2 is not available for all samples, in those cases it is marked as "no call", otherwise all mutations are included.
Countries and sample sizes
Finnish sample size is probably biggest ever seen in academic or any studies. Even taking into account some bias in regional personal activity this have to be the best ever seen sample data from Finland.
Some geographical areas are underrepresented, like White Sea Karelians, but I was expecting some interest and included them.
Results
Fst distances
Seeking for country level rather than individual statistics I ran at first Fst-statistics between countries. Keeping in mind the nature of mitochondrial data and mutations it is not relevant to expect any strict ancestral sum information, on the contrary results mirror European migrations during thousands years.
Fst distances
Image with better resolution can be downloaded here
MDS-plot based on Fst-distances:
Two dots to the most left are Poland and Germany.
And classical euclidean tree plot:
edit 20.0.2016 13:40
Here I reconstructed mitochondrial genome instead of using straightforwardly hypervariable mutations. Reconstructed SNP data was analyzed by standard analyzing tools. I am very sure that analyzes done using only mutation indicators will not be successful.
22.9.2016 11:30
Added Fst and genome data. Notice that the genome data is reconstructed using minimum labor input and original kit-id numbers are substituted by surrogates!
Fst-data download here
Genome data download here
Hello, just a comment on the MDS plot and clustering in this post, I've been experimenting with MDS and clustering functions in PAST3, I'm not sure you get the best results by using Euclidean distances on the columns.
ReplyDeleteAs the data you have in the table is already a distance matrix, you don't need PAST3 to transform the matrix into a set of Euclidean distances.
Instead what you can do is just ask PAST3 to use the distances you have supplied by selecting the similarity index as "User Defined Distance".
Here's an example using you table from your post: http://terheninenmaa.blogspot.co.uk/2015/01/fst-distances-in-europe-finnish-rolloff.html
Here: http://i.imgur.com/lOakTnt.png
It forms a nice approximation in two dimensions of all the distances in the table.
The transform to Euclidean seems to add a distorting curve to the plot: http://i.imgur.com/lOakTnt.png
Here's a similar plot MDS with the European fst data from the Lazaridis et al 2016 paper: http://i.imgur.com/qRHZ5Mb.png
The Euclidean version is similar, but with a distorting curvature again: http://i.imgur.com/RzJsYkW.png
This becomes a bigger issue the higher Fst populations you have involved, and the same applies to the clustering in PAST3.
For example:
Here's the Neighbour Joining Clustering on the Fst data from Lazaridis et al with Euclidean transform:
http://i.imgur.com/OcTNCMF.png
Here's when PAST3 just uses the user supplied distances from the Fst matrix:
http://i.imgur.com/f73ZsWc.png
The Euclidean mode is not bad, but there's obviously more logical population relationships just using the user supplied distance - e.g.
1) Sardinian joins with Sicilian, not Ashkenazi Jew
2) the Northwest Europeans split into a Norwegian-Icelandic and a more Celtic group, rather than having Orcadian with Icelandic and Norwegian in the middle of the group
3) the Czechs cluster in with the West Slavic populations, rather than the Germans
4) Canary Islanders cluster logically with Spaniards, not with North Italians
5) the Finns cluster with the Estonians, logically, rather than with Lithuanians
I'd be interested to see what happens with your Fst matrix on mtdna when you use this similarity index on the function rather than the Euclidean distance transformation of the matrix.
Many thanks Matt. I can publish the Fst table and also corresponding snp-data, although the latter one is somewhat cryptic. May I have your help on making new plots? I am struggling with a new data and encountered problems with its ancestral-derived allele pairs. I am keenly trying to solve those problems to see inside different SNP selections. Playing with different data set becomes possible with around 10 millions SNPs.
DeleteI'll publish the table tomorrow local time.