Kalevan ja Untamon geenit: Analyzing ancient samples is not a piece of cake, an example

Sunday, November 22, 2015

Analyzing ancient samples is not a piece of cake, an example

Testing ancient "steppe" samples on PCA together with modern ones revealed unexpected issues. Studies have included different sets of modern samples, some use South Asians samples, but not East Asian ones. Probably they assume that East Asians are not relevant when testing Europeans. Maybe it is not true, because we try to verify thousands of years history and the migration process during that time is always at least partially unknown. Let's look three PCA-runs with different compositions. I published the first one in my previous blog entry, to the second one I include now South Asians and to the third one also East Asians. Due to a limitation of my Gnuplot printing routine to handle populations names I had to remove some ancient and Uralic samples from the printing stage of two global views, but PCA analysis in each phase include all samples creating proper values of x- and y-axes. The Gnuplot routine I use tries to fit all on one page. So I present here two PCA-plots in all three phases, each including global and close-up views. Close-ups include all same global samples and their impact and are made only for better resolution.

In my previous analysis all "steppe" samples located very close modern Europeans. Making it simple let's follow Bronze Age Scandinavians (baSca). They seem to be the westernmost group of all Bronze Age samples.

After adding South Asians all "steppe" samples move eastwards and Bronze Age Scandinavians with them to the same direction. Regarding "steppe" samples this starts to look like Jones et al. Sorry about flipping pictures, SmartPca does it sometimes.

After adding East Asians changes happen again, "steppe" samples move back to west and some of Bronze Age Scandinavians are now among Basques (this is interesting indeed, think about western megaliths, but let's forget it now).

As a conclusion I would say that it is not always relevant to make up one's mind about clines between modern and ancient samples if we are not aware of the history between ancient and modern samples. We can select modern samples coincidentally or even in a prejudiced manner and perhaps lose meaningful history.

30 comments:

MajuNovember 22, 2015 at 12:56 PM
PCAs (or any other statistical analysis method, such as ADMIXTURE) are not indifferent to the "weight" of the various populations, hence sampling strategy is most important. For example I notice that from graph one, you have lots of North and NE Europeans and also some de-facto East Asians (Nenets, Mongola). These last clutter Dim1, which becomes a West-East axis that has more to do with Eurasia than with Europe specifically, and that's why typically they are excluded from Europe-focused analysis.

When you remove these last (second graph), the PCA is still cluttered by so many Eastern European samples of small populations that it's still all about NE Europeans, with the rest being redefined in NE European terms. It's not your typical Europe PCA, something very apparent in the absence of a Basque-Caucasus Dim2 polarity, that invariably appears in other less biased PCAs.

As you add again more extra-European populations the result reverts to pan-Eurasian analysis, which is of little interest to the understanding or our subcontinent. The third analysis says: India vs Europe (Dim1) and India vs Bedouin (Dim2). That's what a PCA can tell, not more.

Sampling strategy is almost everything in autosomal analysis. It can be used to twist the results but, if we want to do it properly, then we must:

(1) focus on what exactly we want to try to decipher. If we are trying to understand intra-European fine grain, then East Asians or any other external sample will bother us - unless what we are trying to understand is specifically East Asian admixture.

(2) Choose samples carefully. Personally I favor regular samples of large populations such as English, Italians, Russians, etc. and half-sized samples of key minor populations such as Basques, Sardinians, Finns. For example a sample I could use for Europe analysis could be: Spanish, French, English, German, Italian, Polish, Romanian, Russian each at n=10 (aprox.), Basque, Sardinians, Irish, Swedish, Latvian, Finnish, Greek, Sicilians and a Caucasian sample such as Tabassarans or Georgians, each at n=5. The difference in size is in order to allow actual larger populations not to be so excessively cluttered by smaller ones, what is just all kinds of wrong. You can use other apportions but trying to allow for proper representation of the large populations, not strictly proportional but something like logarithmically proportional (base 10) maybe.

(3) Any single approach is probably not enough to understand everything so different sampling strategies can be used to understand different aspects, producing different PCAs. Hence one with East Asians will illustrate East Asian admixture, one with lots of samples from whatever European region will illustrate the innards of that region (but be of little interest for pan-European analysis most often), etc.
ReplyDelete
Replies
AnonymousNovember 22, 2015 at 3:19 PM
Russia, Finland, Italy and France need regional samples because these countries have genuine differences between subpopulations, beyond drift.

Re: non-european sampling, Nenets have enough European hunter-gatherer-like ancestry to the exclusion of East Asians that their effect is not de facto East Asian. This is easily seen with D(X, Mbuti; Dai/Han, Loschbour), comparing with South Asians. Mongolians may be more East Asian.
ReplyDelete
Replies

Add comment

English preferred, because readers are international.

No more Anonymous posts.