Sunday, March 29, 2015

Some observations more about ancient European admixtures

I am not ready to make any large-scale comparisons yet, because so many things must be checked and ensured before it, but just for an excerise two qpAdm analyses.

1.  Yamnaya samples seem to be an admixture of Samara-HG and Iranians with smaller additions of Karelian and Scandinavian hunter-gatherer admixtures.  I got several possible results showing chisq below 1 (0.5-0.7).   Main proportions were EHG (Eastern hunter-gatherers) and Iranian, although some Armenian was also possible.  Indeed,  this may follow the idea of ancient Iranian linguistic connection with some East European people.

2.  I also checked Ashkenazi samples more precisely and found them being genetically very close Sicilians, but I was still able to find out a better composition by combining Early European Neolithic Farmers, Lebanese and Turkish samples. Chisq was typically around 1 or less.   It also was observable that although Ashkenazim and many Southeastern Europeans showed clear Turkish like admixture it likely is not from present-day Turks (with exception of some Greeks and many Cypriots) because no Asian admixture was detctable and as far as I know Turks have some Asian admixture.  So this was a bit puzzling.   No Asian, but minor North African admixture existed in some Mediterranean results.

Wednesday, March 25, 2015

Near Eastern admixture in South Europe

Haak et al. 2015 has led to confusion about the genetic shape of Southern Europeans.  Two examples  about those challenging points:

Sardinians 89.8% EN+3.2% WHG+7.1% Yamnaya
Ashkenazi 90.7% EN+0% WHG+9.3% Yamnaya

and later

Ashkenazi_Jew 3.7% EHG+95.5% EN+0.8% Nganassan+0% WHG 

EN stands for Early Neolithic Europeans, samples are very likely from Neolithic Hungary. 

Results show only a small difference in resnorm values between results.  Are Ashkenazim more original Southern Europeans than European themselves?  How we could explain this shift between original Europeans and original Near Easterners?   IMO,  it is quite hard to believe this kind of preserving isolation, even  among Ashkenazim.  This is definitely confusing because Jews came to Europe around 2000 years ago from Near East and analyses I have made using present-day samples show non-European admixture for Ashkenazim.  Very likely they have become more mixed lately in Europe and turning out to be like Neolithic Europeans, huh unbelievable.     (I have nothing against Ashkenazim people, my son-in-law is according 23andme's  test 98% Ashkenazi).

Another thing creating confusion is using Bedouins as a proxy for something.  Maybe it is not clear that Bedouins don't represent all Middle-Eastern migrations to Europe and are only an arrangement and substitute something that explains something in history, but real migrations has modified our genes many times more.  It is okay if  three or four ancient migrations can explain things, but it doesn't explain later migrations, possible from North Africa, Near East, Turkey ...  So Near Eastern Bedouin-like admixture can be Turkish as well, if we want to find out the history. 

My last observation belongs to Nganassans.   Almost all Europeans show in this study admixture with Nganassans.  They are a small isolated group in northernmost Siberia.   Although it is clear that also they are thought to be proxies, it is also quite suprising to see 2.8% Nganassan admix among French people, with almost lowest resnorm.  Does the resnorm mean in this case an error rate of 5% or more?  My conclusion is that Nganassans have Northeast European admixture, which is not observable using admixture analyses because they are a rather homogeneous group.         


 

Tuesday, March 24, 2015

Estimating ancient genes among present-day European populations, part 3

After adding the Karelian HG things "clicked" in Northeast Europe.  This Ancient Karelian sample is closely related to Yamnaya people, but with less or no European farmer ancestry.  This lack of farmer genes became in new results balanced by additional Hungarian Neolithic farmer genes, especially among northeastern and ancient Corded Ware samples, and as a result chisq  is now pretty much lower in Northeast Europe.   You see on the Excel-sheet also the older chisq for comparison.  It is worth of noticing that both northern hunter-gatherer groups (Scandinavian and Karelian) are necessary to obtain best fits in the north and that my second test proved the western hunter-gatherer (Hungarian neolithic HG) being quite distant for all present-day northern populations, even for some southern ones, like for Basques.

I tried also to find out the lacking piece that could complete certain South European admixtures.  The best fit was obtained among Greeks by adding something like Turkish-Iranian and in a case of Maltese and Sicilians also North African genes (Turkish-Iranian + Tunisian).  It is hard to say exactly for what this Turkish-Iranian stands for, anyway Lebanese, the best Near-Eastern admixture gave a poorer result.

Download new results here.

Sunday, March 22, 2015

Estimating ancient genes among present-day European populations, part 2

Following my obvious quideline I substituted the Scandinavian hunter-gatherer group by Hungarian Neolithic hunter-gatherer (KO1).  My goal was to find out possible difference in relations between Yamnaya and both hunter-gatherer groups.   I thought that he result could reveal ancient migrations.   Indeed, the result shows possible genetic connection between Yamnaya, Corded Ware and Scandinavian cultures, which is of course reasonable from the historical perspective. The Hungarian hunter-gatherer looks more like belonging to some more southern diverged populations. 

I appreciate if someone with good mathematic skills could make a correlation analysis between all four groups and paying attention to differences between Excel-sheets, I myself am happy to see how tightly Yamnaya, Corded Ware and Scandinavian HG groups correlate.  It looks definitely like substituting Scandinavian HG's by the Hungarian one moves all Scandinavian similarity to the Corded Ware culture.  Drawing speculative conclusions I would say that also early animal husbandry connects these regions.  One of the most interesting questions is now related to new finds of ancient  yDna from the East European plain. 


Corded-Ware-LN
- Yamnaya 0.68
- Hun. farmer 0.175
- SC-HG    0.138

Corded-Ware-LN
- Yamnaya 0.804
- Hun. farmer 0.125
- Hun. HG 0.067

This same continuity between Yamnaya, Corded Ware and North Europe is observable widely among all North Europeans.

Download my earlier xls-sheet here
Download my new xls-sheet here

Next I try to estimate Eastern Hunter-Gatherers.  Possible eastern sinilarity with present-day populations would be interesting.

edit 22.03.15   Adding Karelian Hunter-Gatherer as a fifth admixture proportion decreased chisq (below 1 at best) in Northeast Europe, meaning a better fit. Concurrently the proportion of Hungarian Neolithic farmers in same results increased, as compensation to increased hunter-gatherer proportion.  This all was reasonable and expected.  But in the south the transition was minimal, for example Basques had no ancient Karelian admix and actually nothing changed.   This all looks so expected.  Incredible!     The biggest issue I have now, after the next update, is to find the lost South European admixture.


Friday, March 20, 2015

Estimating ancient genes among present-day European populations

Reich Lab's new Admixtools version 3 includes a new admixture calculator based on f4-statistics.  So here is my first effort to find out our Neolithic genetic admixtures.   This test is based on three European Neolithic sample groups:  Yamnaya culture samples,  Neolithic farmer samples from Hungary and Neolithic hunter-gatherer samples from Southern Scandinavia.  East-Asian samples were used additionally to find out possible Northeastern migrations.

So I had four admixture groups.  It wasn't enough, it became soon obvious that South Europeans had a fifth admixture dimension.  I tried to solve the problem and fit Near-Eastern samples to the model, because my previous test with Mixmapper showed a later Near-Eastern admixture in South Europe.  But at this time it didn't succeed.  Sure there is a solution and I'll try to find it later.  Just now some South European results show poor accuracy due to this unknown admixture proportion.  Best fit was obtained in Ukrainian, Belarussian, Cornwall and Basque results.   Results show elevated amounts of East Asian, which was the case also in Haak et al.  The reason is either in low-quality ancient samples or in the method itself. 

This new tool is exciting because it makes estimates fast and seems to have less problems caused by genetic drift, which are typical for admixture analyses.  But I have work to do to learn pros and cons. 

Download Excel-file here.


edit 20.03.15   Sardinian and Sicilian row names corrected.

edit 21.03.15  I did preliminary tests using HungaryGamba_HG, aka KO1, instead of Scandinavian hunter-gatherers.  KO1 was a hunter-gatherer found from Hungarian plain and lived around 5000 years ago and the genome is around as old as the Swedish hunter-gatherer Ajv58 used in my earlier test with other Scandinavian samples from the same period.

The result implies that KO1 shares less common ancestry with Yamnaya, Scandinavian obviously much more than KO1.  Replacing Scandinavian HGs by KO1 led to much higher Yamnaya and lower hunter-gatherer admixtures everywhere where the sum admixture of Yamnaya and Scandinavian HG was highest.   I am working with this but have not just now time to complete my tests.  

edit 21.03.15  Sorry about erroneous test, now corrected (underlined).  To be in a hurry is not a good thing.


Wednesday, March 18, 2015

Finestructure testing continues

This is my last test with Finestructure, now using East-Finns, Karelians and a lot of other North Europeans released recently by Estonian Biocentre.   After this moment I see no reason to continue with Finestructure, because Finestructure fails to work properly with (sub)populations showing high genetic drift, just like I noticed in my first experience.  It starts to accumulate internal drift to other bound populations, making a causality error.   One interesting notice however:  Russians in Kostroma seem to belong to the same old Finnic ancestry than Mordvas, not to the same one with Maris, as supposed in some studies.   This probably means that Meryan people belonged to a western branch of Finno-Ugric speaking people.

Doewload results here

Saturday, March 14, 2015

Yamnaya likelihood among present-day populations in comparison to Early Neolithic farmers, Western huntergatherer and Iron Age Britons

This test tries to estimate how much present-day populations have preserved Yamnaya affinity.  Each of three ancient comparison groups are compared to Yamnaya samples and to present-day populations.  Populations scoring more are more Yamnaya-like than less scoring ones.  Positive scores mean that the population in first column has more Yamnaya admix than the ancient comparison group and vice versa with negative scores.  Please note that this test doesn't try to compare present-day populations to each other but to their position between ancient populations and ancient migrations.


Hungarian EN samples versus Yamnaya
 




Loschbour vs Yamnaya
























































Iron Age Britons versus Yamnaya




Monday, March 9, 2015

F4-statistics: Karelians, Vepsians and Estonians

I have done some time also f4-statistics, but have not yet had time enough to take bigger steps.   Here however some quick and dirty results and how to interpret them.

Populations and sample sizes


 West-Finland   19
 Estonia   14
 Komi   16
 Udmurtia   16
 RU_Vologda   10
 Mordva   15
 RU_Tver    4
 RU-Kostr    6
 RU_Arch    1
 Lithuania   10
 East-Finland   20
 Poland   24
 Belarussia   16
 Karelia   15
 Veps   10
 RU_Smolensk    7
 RU_west   17
 MBUTI   14



Results


Friday, March 6, 2015

More Finestructure thingy, unlinked tests

Continuing my Finestucture tests I have now ready an unlinked test of the same data I used in my previous Finestructure test.   As expected there is not much difference to the previous linked mode test because already it found quite a long history.  Both tests show that Europeans belong into three main categories, Baltoslavic, Mediterranean and Germanic.   The Finns seem to be a mixture of ancient Volga people and another old North European group which is today best represented by Baltoslavs, but in a deeper history was probably more widespread.   Unlinked results show a bit more Swedish admixture in Finland, which also mirrors the same old history.

Download results here.


Tuesday, March 3, 2015

Starting with a new data

Just a very beginning, a new data downloaded  and compiled.  




Click here to see a large picture.

edit 04.03.15    Just to remind readers that this plot wasn't done as a projection of ancient genomes onto the present-day ones and is not straight comparable to the original Haak's plot (which projects ancient genomes).    All ancient genomes are a part of the composition.  I'll do a projected version in the next update.

edit 05.03.15  Reduced the amount of Finnish samples to correspond to average North European sample size.

I also made a new map showing projected ancient genomes, i.e. ancient genomes are placed according to coordinates of present-day populations and ancient genomes themselves have no effect on generated principal components.  Click here to download.

edit 06.03.15   Following the advice sent me I ran new PCA's with and without projection of ancient samples using original Haak's data (just same SNP's) and I didn't see difference compared to my previous plots.   Yamna samples are still on my plots a long chain between North Europe and Turkish/Armenian/Middle East, not like on Haak's plot, northward from North Europe.  

My experience with Chromopainter adn Finestructure



I used MaCH  (http://csg.sph.umich.edu/abecasis/mach/tour/input_files.html and http://csg.sph.umich.edu/abecasis/mach/tour/imputation.html ) in imputation and phasing.   The imputation showed good reliability by its statistics, which was expected because only a few SNPs were missing.   The proportion of missing alleles was 0.03% (three per 10000), approximate in random positions.  Both stages were done chromosome by chromosome, still the processing time was quite long, typically hours per run (PC: quad core Intel 4770k/3500 MHz / 16 GB memory).  

Data

The data was selected from following studies with additional populations from the 1000-genome project (Finnish, CEU, British and Tuscan samples):
http://mbe.oxfordjournals.org/content/29/1/359
http://www.nature.com/nature/journal/v466/n7303/pdf/nature09103.pdf
http://www.nature.com/nature/journal/vaop/ncurrent/pdf/nature12736.pdf
http://digitalcommons.wayne.edu/humbiol_preprints/41/

The total amount of SNPs per sample was limited by these studies to around 300000 SNPs.

Emerged problems in running Chromopainter/Finestructure

There are two disadvantages occurring with Chromopainter and Finestructure.   I have also tested the functionality using smaller "synthetic" data to see how it works and problems in detail.  The first problem is related to isolated "daughter" populations and caused by Markov chain process.   Markov chain process can’t itself be aware of the population history and the process leads to a result where more homogeneous and possible oversampled isolated populations are more source than actual donating populations, although this is not possible in case of isolates being  younger "daughters".   It is hard avoid this error, because Chromopainter/Finestructure  doesn't give enough factual information to steer the process and to take in to account the known history and the origin of “chunks” or haplotypes, i.e. causality.  Actually you can supervise Chromopainter and it gives you a chance to correct this problem, or make it even worse.   Practically the only way to avoid these errors is to cut out known isolated populations from the input, but this all is up to you and the result can still be subjective.  

Here is a picture showing how the clustering works:
   






The amount of additional chunks multiplies when the A-B population grows. 

This would be a perfect way to make clusters if we only could know gene flow directions between individual, or it would be a reasonable way if we could know gene flow directions between countries or putative populations, but if we have to guess the result will be just a guess or even worse.   
Another question related to the donor populations of Chromopainter is that we simply don’t know unidirectional gene flows in Europe.  It is a great idea to mark Scandinavian, Spaniards and Germans as donors if we analyze American populations,  but this doesn’t work in Europe, because here we have barely any unidirectional gene flows.  Any attempt to mark donors in this analysis would be simply a guess. I didn’t want to guess and I ran Chromopainter in a neutral mode in which every individual is compared to all other individuals.  Maybe I could use high quality ancient samples as donors, but if I see a Finestructure analysis targeting only to Europeans with asymmetric admixture matrix I would be interesting in how the donor haplotypes were determined. 

Another problem is also caused by the Markov chain process and is related to mixed populations.   Basically it is very similar to the first problem, but needs different data preparing.   When the process finds mixed individuals it considers also ancestral populations being mixed.  This happens because the process is relative and there is no understanding of the causality between individuals.   So the Markov chain process clusters  both ancestral populations together with the mixed one, despite of the history, geography and genetic distances shown in the input data.  How strong this clustering is depends on sample sizes of all three populations, ancestral and the mixed ones.   Again we need thorough preparation of the data to avoid wrong results.   In a worst case some of populations are mixed and isolated, combining both errors into the result. 

The following picture demonstrates the problem concerning mixed populations.  In Chromopainter/Finestructure it is even worse because they use chunk/haplotype counts instead of haplotypes.  





Maybe you say now that this is okay, but it is not.   If we put 20 Spaniards, 20 Amerindians and 20 Mestizoes into Markov chain process and get one cluster including all three populations it would also in my opinion be okay and I don’t object it, but after all very misleading because Spaniards and Native Americans are not relatives and live in two continents thousands of miles apart.  This problem is solved in Chromopainter and you can mark Spanish and Native American phased data as donor data and Mestizoes as recipients, but this strategy doesn’t work in Europe where there are no such donor-recipient pairs than between Europe and Americas.  

Because I am especially interested in Finnish results I have here some details.   Finnish samples include 18 samples estimated being from old settlements.   Please check Finnish settlement definitions, explained in Finnish studies  (Jakkula: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2668058/    Palo: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2986642/ ).     The estimate is based on a comparison to earlier analyzed and better known data sources and PCA analyses, the 1000-genome data itself is not documented enough to make this decision.   Good news are now that after Karelian and Vepsian samples became available it is possible to add them to Finestructure tests and also Finnish late settlements without drawback of showing  too much genetic drift , i.e. catching Finnish clusters by a strong intra-populational chunk sharing.  My next tests will include all those samples.

Western Finns show highest similarity to the south, with Estonians, West Russians and Poles, but there are two individuals with more North Russian similarity and some West Finns show weaker similarity with Scandinavians.   It is possible that the pre-selection made using PCA was not perfect and two Karelians or Savonians became included, or those two belong partly to some other ethnicity and the result fell into same PCA category.  Low Scandinavian chunk amount doesn’t necessarily mean low Scandinavian ancestry, only low chunk sharing, which could mean that the western ancestry is older than southern and eastern ancestry.  Mosaic patterns also show that the Scandinavian affinity based on chunk sharing (on linked results) is more East European ancestry in Scandinavia than vice versa, although also Swedish admixture in Finland is detectable. This reasoning about old Scandinavian ancestry in Finland may surprise some people, but perhaps it can be supported by the small amount of young Scandinavian specific y-dna in Western Finland (look for example Lappalainen et al. 2006).  Swedish admixture estimated by the ratio of R1b is 8/21=38% among the Swedish speaking population in Ostrobothnia while Swedish speakers form around 5% of the Finnish population.   

Maybe there is also reason to mention also 23andme’s and FtDna’s results giving sometimes high amounts of western admixture for West Finns.  There is a principled difference between what they do and this analysis.   While 23andme and FtDna created a Finnish “average Joe” and compare individuals to him, Finestructure in this analysis compare everyone to everyone and there are no inferred archetypes, stereotypes or hypothetical ancestors for any ethnic groups.  Another question is how to create genetic averages, whatever it might be. 

Abbreviation:  CEU=Utah-Europeans, FR= France, NRG=Norway, HU=Hungary, RO=Romania, BL=Bulgaria, SE=Sweden, CR=Croatia, EE=Estonia, BR=Belarussia, UKR=Ukraine, WRU=West Russia, WFI=Western Finland, MA=Mari, CH=Chuvash, MR=Mordva, NRU=Russia-Volodga, TSI=Tuscan, SP=Spain, ITALY=Abruzzo

Inferred groups averagely
1: UK-Kent CEU FR NRG HU RO BL SE CR
2: EE BR PL UKR WRU
3: WFI, mixed
4: MA CH TATAR 
5: MR NRU
6: TSI SP SICILY ITALY

Imputing and phasing was done by MaCH with rounds 50 and states 200 per each chromosome, creating around 1500 shared chunks between individuals.  This really reaches a deep haplotype history.


Run parameters in Finestructure:  50000 burnin, 500000 MCMC rounds, tree climbing 100000.


You can download results here (compressed .zip).