A month ago
we saw a new study, Allentoft et
al. with new earlier unpublished data regarding several Bronze Age
cultures. Altogether 101 ancient
samples were available, of which almost half has reasonable high quality. I ran several PCA’s and noticed some
problems due to the error caused by those low quality samples and obviously
nonrandom distribution of SNP results.
If I used standard methods most new samples clustered somewhere between
Central Europe and Caucasus and if I used the projection method included to
Eigensoft’s PCA-tool most samples from ancient European cultures were placed
among modern Europeans. So I understood
that PCA wouldn’t work well and wouldn’t reveal original ancient features and I saw
it necessary to use straight comparisons between ancient and modern samples, comparing
them without selective clustering. Tools
like f3Stat and Dstat are straightforward methods without for low quality data
vulnerable clustering. Therefore f3- and Dstat are more applicable in this case.
My first
test includes selected Eastern European populations comparing them to other
modern Europeans and ancient samples. I
used Dstat and the formula is Dstat(test-a,test-b;ancient sample,Mbuti), where
“test-a” is the East European sample to be tested and “test-b” is the European
sample to be compared with “test-a”. If
the result is positive then “test-a” is closer the ancient sample in
comparison between "test-a" and "test-b". If the result is negative
then “test-b” is closer than "test-a". I moved some
results to Excel sheet to show one idea how to make comparisons. New data is downloadable here.
I publish
now some first observations. Although
the locality seems to be absolutely right, ancient Scandinavian are close
modern Scandinavians etc., there are many surprising results which are in
contradiction with results obtained by selective clustering methods. You are welcome to leave your comments if you
find something surprising. Unfortunately
the publicly available version of Allentoft et al. doesn’t show comparable
results using f3- or Dstat, so he keeps us in excitement.
I have now
only a few results from East Europe, but I’ll run more results including
Central, West and South Europeans during the next week.
Examples click here.
edit 1.7.2015 11.40 am: German samples are from Estonian BC and not representative. They seem to be partly more unknown East Europeans than Germans from Germany. I should have deleted them.
edit 4.7.2015 11.10 am: More results, including Western Europeans, click here to download xlsx-sheet.
edit 1.7.2015 11.40 am: German samples are from Estonian BC and not representative. They seem to be partly more unknown East Europeans than Germans from Germany. I should have deleted them.
edit 4.7.2015 11.10 am: More results, including Western Europeans, click here to download xlsx-sheet.
Interesting. For some comparisons, in visual terms using Poland as a UTC (Coordinate Universal Time) of 0:
ReplyDeletehttp://i.imgur.com/VwBdVQJ.png - RISE SIntashta and Yamnaya
http://i.imgur.com/nSVPanQ.png - RISE Bell Beaker and Corded Ware
Similarity to Sintashta, BB and CW seems to follows the same order, which is
Baltic->Scandinavia->West Slavic & other Northwest European->more Southern European populations, with East Eurasian ancestry lowering sharing as well, for Baltic and East Slavic populations for example (some of these differences are very slight of course)
http://i.imgur.com/QZsYUZx.png- shows how the statistic for CW, Sintashta and Bell Beaker seem more or less exactly correlated for these Europeans, while Yamnaya is different.
Or in other words, it is quite simple to predict similarity to any of Sintashta, BB or CW from any of the other, while predicting Yamnaya similarity from any of those three is more noisy and not as predictable. Unlike with the set Sintashta, BB and CW, where a high / low level in one predicts a high / low level in another, many populations who have a quite high similarity to Yamnaya, like Romanian, have a low level in Sintashta, BB or CW or a pop high in Sintashta can be lower in Yamnaya, like Sweden... but others are low in both or high in both, so not as much prediction between Yamnaya and these others.
Matt, many thanks for excellent graphics whichmake easier to comprehend results. It is likely that Lithuanians have mixed very little during the time from Bronze Age to the present and all other nations are more mixed. For that reason the similarity with BB and CW decreases to the south. I have to make similar tests using Neolithic Farmer genomes to see this effect. Then there can have been also founders effects among modern people decreasing similarity with BA people.
DeleteRemember that a sample's elevated internal IBS-sharing increases shared drift too. Behar's 10-man Lithuanian sample and the small Smolensk Russian sample have this issue. It elevates shared drift with ancient and modern genomes alike, both of the aforementioned groups have higher shared drift with Chuvashes than more eastern groups, the tests you did previously @muinainensuomi show this:
DeleteRU_Smolensk Mordva : Chuvash MBUTI 0.0016 1.626
Mordva Lithuania : Chuvash MBUTI -0.0039 -4.050
So, similarity in f- or d-tests has more complicated factors than just apparent admixture differences involved.
You are absolutely right. It is better to look for bigger trends than any single population. It is however true that Lithuanians are very little mixed and can represent something local and old. I admit that I don't understand the mechanism of increased LD inside isolation. Does it move forward in some general genetic rule, meaning that all heavily drifted populations approach each other? Like Lithunians and Chuvashes. I am not a geneticist.
DeleteWhat is good in f3- and d-tests is lack of continental trends. You can make comparisons like Poles vs Italians and Poles vs Chuvashes showing only what they really are (taking into accounte the error you mentioned). If you use PCA you have to add Poles, Italians and Chuvashes to a SAME analysis to get the result commensurate and then the component effect between Italians and Chuvashes comes most significant, also regarding differences between Poles and Chuvashes/Italians. F- and d-stats are free of this error in local estimates.
Thanks. I think what you say about the Lithuanians sounds quite likely.
DeleteAnother apparent pattern I have spotted looking at these stats this morning is an one involving Afanasievo stats.
Although Afanasievo is supposed to be a clade with Yamnaya according to the Allentoft paper, the correlation between the stat for D(Poland,Pop,baAfan,Mbuti) appears to be much more linearly correlated with other of the LNBA populations.
You can see this in these graphs (apologies for the messiness of them):
http://i.imgur.com/r5Zgdxq.png
(Note: I have reversed the sign for the stats given to make D(Pop,Poland,Ancient,Mbuti) from D(Poland,Pop,Ancient,Mbuti) as it is simpler for me to visualize on these graphs this way, with more affinity than Poland being positive).
Although there is a positive correlation betwen Yamnaya and Afanasievo, it is a loose one and the correlation seems stronger between Afanasievo and Sintashta, Bell Beaker, Corded Ware, Andronovo, etc.
This seems unusual in light of the idea from Allentoft that Afanasievo and Yamnaya are very much alike.
Another interesting pattern when visualising Afanasievo and Andronovo against other stats, something which will be immediately apparent, is that unlike Yamnaya vs Sintashta or Yamnaya vs CW, contrasting Afanasievo or Andronovo to Sintashta, Bell Beaker or Corded Ware seems to split Europe into an rough Eastern and Western European cline, or Russian vs non-Russian, both pointing towards the Baltic. These populations in Russia and Eastern Europe, while still not having the most shared drift with Andronovo or Afanasievo, have a relatively higher degree of it compared to Bell Beaker, Corded Ware, Sintashta.
Thanks again. Regarding Yamnaya I have a hunch that it might be distorted by the data. If Yamnaya is from Haak et al. , moved to Allentoft et al., then it can have a different SNP distribution. I noticed already that all Allentoft data is consistent. The figure is that both studies are done using Affymetrix SNP coordinates. I use however data from Estonian BC which uses Illumina SNP coordinates. I have used data downloaded from Felix's server, the data using published full scanning, but we don't know sure if Haak's and Allentoft's data outputs are equal regarding Estonian BC's Illumina set. If Yamnaya data isn't, the result can suffer of SNP mismatch. I can check this next Monday, I am now en route and can't do it.
DeleteI forgot, I had already checked SNP coverages and Yamnaya covers over 75% of the EBC data, which should be enough to give quality results. But it looks like the Yamnaya data is somehow bad. I could make similar statistics using original Haak/Allentoft Affy data and look if it fixes the problem. But it is possible at the earliest next Monday.
DeleteRe: distortion of the Yamnaya data, one thing that may be relevant to this is in these kind of stats above it seems the Karasuk data actually seem more correlated with the Yamnaya data than any of the other ancients do.
Deletehttp://i.imgur.com/O46CEIn.png
Karasuk is supposed to have some East Eurasian ancestry (via ADMIXTURE and various formal stats), and most of the populations which seem to have break the linear correlation between baKarasuk and baYamnaya above seem to be the ones who are usually modelled as having some East Eurasian link (although Serbian is not at least). Matching to what we expect, these tend to be closer to Karasuk relative to Poland and the other populations on the cline cutting through Poland, who lack East Eurasian. The correlation has a very high r2 when the off cline populations are discarded.
So if there were an error in the Yamnaya data, it might have systematically also affected the Karasuk data in the same way, if that gives any clues?
Karasuk acting like Yamnaya + Siberian and not like Andronovo + Siberian would seem strange in itself from the point of history though, as in history the region where Karasuk is from would seem to go Yamna->Poltavka->Sintashta->Andronovo->Karasuk.
I started to wonder why Sweden and Norway are so far each other on Yamnaya plots and checked my data. It looks like your plots have some error regarding Yamnaya numbers. Here are Sweden, Norway and Utrah-CEU data from the Excel sheet:
DeletePoland Utah_CEU baYam MBUTI 0,0017 1,742
Poland Sweden baYam MBUTI -0,001 -0,973
Poland Norway baYam MBUTI -0,0021 -1,974
We see that the order should be Utah-CEU - Sweden - Norway on x-axis, but it is Sweden - Utah-CEU - Norwegian.
http://www.elisanet.fi/mauri_my/sin-yam.gif
Yep, there must have been an error where I didn't sort the Karasuk and Yamnaya correctly at some step when I was transferring the data and mismatched some of those populations, so any comments I made on those relative to others can be ignored (the comparisons between the others like Sintashta, Bell Beaker, Corded Ware should still be correct I think).
DeleteI've run checked those stats again, and corrected that, and now a PCA of them looks nice and as expected - http://i.imgur.com/HhSZXwD.png - as does the correlation between Afanasievo and Yamnaya, and between Karasuk and others - http://i.imgur.com/vWDskjN.png and http://i.imgur.com/pJoilNA.png, with the *right* outliers that Grelsson mentions. http://i.imgur.com/VLwQNGX.png
Apologies for wasting your time with that mistake.
Thanks Matt, looks great. Next week I try to make an admixture analysis. I would like to see it based on ancient gene distribution among modern samples, because it would follow the history, from past to present, but I am not sure if it is possible.
DeleteMatt, Vepsians, Ru_Kostr and Ru_Vologda (these are actually the HGDP samples from Kargopol) are more eastern than many of the populations deviating towards Karasuk, yet do not behave that way.
ReplyDeleteMauri, can you test Okunevo? They are supposed to represent some sort of "Native American" resurgence, and it would be nice to see if they behave differently from Karasuk.
Grelsson,
DeleteVeps and Karelian are from a Russian study published a couple years ago. Afaik Ru_Kostr too. Only Vologda is from the original HGDP data. Russian and Estonian researchers seem to cooperate and corresponding data is usuallyt available from Estonian Biocentre. Sadly Finnish researchers seem to have only thin connection with them, I have seen no references which means that the peer view is negligible.
I can't say now anything about Okunevo, because I am not at home and all my data is there.
Yeah, I'm aware of where the samples came from. My point was about how various populations relate to Karasuk. There's no need to do Okunevo now, but maybe include it in the next blog entry?
ReplyDeleteI see now why I removed Okunevo samples. They seem to have rather bad quality, not so bad that it makes impossible to use them, so I'll add them.
DeleteHere are results regarding Okunevo. Okunevo samples have lower quality than samples in my previous tests.
Deletehttp://www.elisanet.fi/mauri_my/f4loki3.xlsx