keskiviikko 1. heinäkuuta 2015

Bronze and Iron Age samples analyzed using Dstat

A month ago we saw a new study,  Allentoft et al. with new earlier unpublished data regarding several Bronze Age cultures.  Altogether 101 ancient samples were available, of which almost half has reasonable high quality.   I ran several PCA’s and noticed some problems due to the error caused by those low quality samples and obviously nonrandom distribution of SNP results.  If I used standard methods most new samples clustered somewhere between Central Europe and Caucasus and if I used the projection method included to Eigensoft’s PCA-tool most samples from ancient European cultures were placed among modern Europeans.  So I understood that PCA wouldn’t work well and wouldn’t reveal original ancient features and I saw it necessary to use straight comparisons between ancient and modern samples, comparing them without selective clustering.  Tools like f3Stat and Dstat are straightforward methods without for low quality data vulnerable clustering.  Therefore f3- and Dstat are more applicable in this case.
My first test includes selected Eastern European populations comparing them to other modern Europeans and ancient samples.  I used Dstat and the formula is Dstat(test-a,test-b;ancient sample,Mbuti), where “test-a” is the East European sample to be tested and “test-b” is the European sample to be compared with “test-a”.  If the result is positive then “test-a” is closer the ancient sample in comparison between "test-a" and "test-b".  If the result is negative then “test-b” is closer than "test-a".  I moved some results to Excel sheet to show one idea how to make comparisons.  New data is downloadable here

I publish now some first observations.  Although the locality seems to be absolutely right, ancient Scandinavian are close modern Scandinavians etc., there are many surprising results which are in contradiction with results obtained by selective clustering methods.  You are welcome to leave your comments if you find something surprising.  Unfortunately the publicly available version of Allentoft et al. doesn’t show comparable results using f3- or Dstat, so he keeps us in excitement. 

I have now only a few results from East Europe, but I’ll run more results including Central, West and South Europeans during the next week.

Examples click here.

edit 1.7.2015 11.40 am:  German samples are from Estonian BC and not representative.  They seem to be partly more unknown East Europeans than Germans from Germany. I should have deleted them.  

edit 4.7.2015 11.10 am:   More results, including Western Europeans, click here to download xlsx-sheet.

16 kommenttia:

  1. Interesting. For some comparisons, in visual terms using Poland as a UTC (Coordinate Universal Time) of 0: - RISE SIntashta and Yamnaya - RISE Bell Beaker and Corded Ware

    Similarity to Sintashta, BB and CW seems to follows the same order, which is
    Baltic->Scandinavia->West Slavic & other Northwest European->more Southern European populations, with East Eurasian ancestry lowering sharing as well, for Baltic and East Slavic populations for example (some of these differences are very slight of course) shows how the statistic for CW, Sintashta and Bell Beaker seem more or less exactly correlated for these Europeans, while Yamnaya is different.

    Or in other words, it is quite simple to predict similarity to any of Sintashta, BB or CW from any of the other, while predicting Yamnaya similarity from any of those three is more noisy and not as predictable. Unlike with the set Sintashta, BB and CW, where a high / low level in one predicts a high / low level in another, many populations who have a quite high similarity to Yamnaya, like Romanian, have a low level in Sintashta, BB or CW or a pop high in Sintashta can be lower in Yamnaya, like Sweden... but others are low in both or high in both, so not as much prediction between Yamnaya and these others.

    1. Matt, many thanks for excellent graphics whichmake easier to comprehend results. It is likely that Lithuanians have mixed very little during the time from Bronze Age to the present and all other nations are more mixed. For that reason the similarity with BB and CW decreases to the south. I have to make similar tests using Neolithic Farmer genomes to see this effect. Then there can have been also founders effects among modern people decreasing similarity with BA people.

    2. Remember that a sample's elevated internal IBS-sharing increases shared drift too. Behar's 10-man Lithuanian sample and the small Smolensk Russian sample have this issue. It elevates shared drift with ancient and modern genomes alike, both of the aforementioned groups have higher shared drift with Chuvashes than more eastern groups, the tests you did previously @muinainensuomi show this:

      RU_Smolensk Mordva : Chuvash MBUTI 0.0016 1.626
      Mordva Lithuania : Chuvash MBUTI -0.0039 -4.050

      So, similarity in f- or d-tests has more complicated factors than just apparent admixture differences involved.

    3. You are absolutely right. It is better to look for bigger trends than any single population. It is however true that Lithuanians are very little mixed and can represent something local and old. I admit that I don't understand the mechanism of increased LD inside isolation. Does it move forward in some general genetic rule, meaning that all heavily drifted populations approach each other? Like Lithunians and Chuvashes. I am not a geneticist.

      What is good in f3- and d-tests is lack of continental trends. You can make comparisons like Poles vs Italians and Poles vs Chuvashes showing only what they really are (taking into accounte the error you mentioned). If you use PCA you have to add Poles, Italians and Chuvashes to a SAME analysis to get the result commensurate and then the component effect between Italians and Chuvashes comes most significant, also regarding differences between Poles and Chuvashes/Italians. F- and d-stats are free of this error in local estimates.

    4. Thanks. I think what you say about the Lithuanians sounds quite likely.

      Another apparent pattern I have spotted looking at these stats this morning is an one involving Afanasievo stats.

      Although Afanasievo is supposed to be a clade with Yamnaya according to the Allentoft paper, the correlation between the stat for D(Poland,Pop,baAfan,Mbuti) appears to be much more linearly correlated with other of the LNBA populations.

      You can see this in these graphs (apologies for the messiness of them):

      (Note: I have reversed the sign for the stats given to make D(Pop,Poland,Ancient,Mbuti) from D(Poland,Pop,Ancient,Mbuti) as it is simpler for me to visualize on these graphs this way, with more affinity than Poland being positive).

      Although there is a positive correlation betwen Yamnaya and Afanasievo, it is a loose one and the correlation seems stronger between Afanasievo and Sintashta, Bell Beaker, Corded Ware, Andronovo, etc.

      This seems unusual in light of the idea from Allentoft that Afanasievo and Yamnaya are very much alike.

      Another interesting pattern when visualising Afanasievo and Andronovo against other stats, something which will be immediately apparent, is that unlike Yamnaya vs Sintashta or Yamnaya vs CW, contrasting Afanasievo or Andronovo to Sintashta, Bell Beaker or Corded Ware seems to split Europe into an rough Eastern and Western European cline, or Russian vs non-Russian, both pointing towards the Baltic. These populations in Russia and Eastern Europe, while still not having the most shared drift with Andronovo or Afanasievo, have a relatively higher degree of it compared to Bell Beaker, Corded Ware, Sintashta.

    5. Thanks again. Regarding Yamnaya I have a hunch that it might be distorted by the data. If Yamnaya is from Haak et al. , moved to Allentoft et al., then it can have a different SNP distribution. I noticed already that all Allentoft data is consistent. The figure is that both studies are done using Affymetrix SNP coordinates. I use however data from Estonian BC which uses Illumina SNP coordinates. I have used data downloaded from Felix's server, the data using published full scanning, but we don't know sure if Haak's and Allentoft's data outputs are equal regarding Estonian BC's Illumina set. If Yamnaya data isn't, the result can suffer of SNP mismatch. I can check this next Monday, I am now en route and can't do it.

    6. I forgot, I had already checked SNP coverages and Yamnaya covers over 75% of the EBC data, which should be enough to give quality results. But it looks like the Yamnaya data is somehow bad. I could make similar statistics using original Haak/Allentoft Affy data and look if it fixes the problem. But it is possible at the earliest next Monday.

    7. Re: distortion of the Yamnaya data, one thing that may be relevant to this is in these kind of stats above it seems the Karasuk data actually seem more correlated with the Yamnaya data than any of the other ancients do.

      Karasuk is supposed to have some East Eurasian ancestry (via ADMIXTURE and various formal stats), and most of the populations which seem to have break the linear correlation between baKarasuk and baYamnaya above seem to be the ones who are usually modelled as having some East Eurasian link (although Serbian is not at least). Matching to what we expect, these tend to be closer to Karasuk relative to Poland and the other populations on the cline cutting through Poland, who lack East Eurasian. The correlation has a very high r2 when the off cline populations are discarded.

      So if there were an error in the Yamnaya data, it might have systematically also affected the Karasuk data in the same way, if that gives any clues?

      Karasuk acting like Yamnaya + Siberian and not like Andronovo + Siberian would seem strange in itself from the point of history though, as in history the region where Karasuk is from would seem to go Yamna->Poltavka->Sintashta->Andronovo->Karasuk.

    8. I started to wonder why Sweden and Norway are so far each other on Yamnaya plots and checked my data. It looks like your plots have some error regarding Yamnaya numbers. Here are Sweden, Norway and Utrah-CEU data from the Excel sheet:

      Poland Utah_CEU baYam MBUTI 0,0017 1,742
      Poland Sweden baYam MBUTI -0,001 -0,973
      Poland Norway baYam MBUTI -0,0021 -1,974

      We see that the order should be Utah-CEU - Sweden - Norway on x-axis, but it is Sweden - Utah-CEU - Norwegian.

    9. Yep, there must have been an error where I didn't sort the Karasuk and Yamnaya correctly at some step when I was transferring the data and mismatched some of those populations, so any comments I made on those relative to others can be ignored (the comparisons between the others like Sintashta, Bell Beaker, Corded Ware should still be correct I think).

      I've run checked those stats again, and corrected that, and now a PCA of them looks nice and as expected - - as does the correlation between Afanasievo and Yamnaya, and between Karasuk and others - and, with the *right* outliers that Grelsson mentions.

      Apologies for wasting your time with that mistake.

    10. Thanks Matt, looks great. Next week I try to make an admixture analysis. I would like to see it based on ancient gene distribution among modern samples, because it would follow the history, from past to present, but I am not sure if it is possible.

  2. Matt, Vepsians, Ru_Kostr and Ru_Vologda (these are actually the HGDP samples from Kargopol) are more eastern than many of the populations deviating towards Karasuk, yet do not behave that way.

    Mauri, can you test Okunevo? They are supposed to represent some sort of "Native American" resurgence, and it would be nice to see if they behave differently from Karasuk.

    1. Grelsson,

      Veps and Karelian are from a Russian study published a couple years ago. Afaik Ru_Kostr too. Only Vologda is from the original HGDP data. Russian and Estonian researchers seem to cooperate and corresponding data is usuallyt available from Estonian Biocentre. Sadly Finnish researchers seem to have only thin connection with them, I have seen no references which means that the peer view is negligible.

      I can't say now anything about Okunevo, because I am not at home and all my data is there.

  3. Yeah, I'm aware of where the samples came from. My point was about how various populations relate to Karasuk. There's no need to do Okunevo now, but maybe include it in the next blog entry?

    1. I see now why I removed Okunevo samples. They seem to have rather bad quality, not so bad that it makes impossible to use them, so I'll add them.

    2. Here are results regarding Okunevo. Okunevo samples have lower quality than samples in my previous tests.