sunnuntai 25. tammikuuta 2015

Later Middle-Eastern and Native American like admixtures in Europe

I made new MixMapper tests, now using only high quality ancient samples, Loschbour, BR2, LBK380, NE1 and Kostenki.   Definitely this is not most informative, but shows less distort in qualifying farmer and hunter-gatherer groups.

The second table shows the probability of showing more Middle-Eastern admixture than those ancient farmers.  It is found by searching best admixture fits with mandatory Bedouin reference samples.  Negative alpha means no additional Middle-Eastern, positive alpha means a need for additional Bedouin-like Middle-Eastern admixture to create the best fit with ancient samples.   Please note that Basques show lowest resnorm (best fit) and still no additional Middle-Eastern making the best fit in the junction point LBK380-NE1.

My last table shows possible Karitiana-like admixture in a same way.   Note that  French, Bulgarian, Spanish, Sicilian, Greek and Turkish groups would very likely show negative Karitian with Loschbour, but the best fit is found with ancient farmer groups.  Basques show minimum admixture, again.

edit 26.01.15

Sardinians were forgotten accidentally.  They show no Karitiana-like admixture, but some Bedouin-like later admixture sure exists, more than among Basques.  

I received a question about the lack of all Middle-Eastern affinity among some North Europeans, which sounds the be wrong.  This test searched only Bedouin-like admixture, not Middle-Eastern in general.  It is a great idea to include more Middle-Easteners to find more coverage in Europe, but as far as I have understood the Bedouins are one of the purest local people there and probably represent unidirectional gene flow from the Middle East.   So the test arrangement is simple, but turns out to be much worse to interpret with a more complex Middle-Eastern data set.   This same applies also to Native American and Siberian admixtures. 

perjantai 23. tammikuuta 2015

Searching ancient roots by MixMapper

After installing MixMapper ( I found that I should purchase a full MATLAB license (200 Euros) to get working three-way mixture graphics.   So I decided to move to Treemix.  But before going ahead with Treemix I made a MixMapper run using Lazaridis’ data with additional later distributed ancient genomes.  The idea was to build a basic construction using as many ancient genomes as possible and fit present-day populations on this result.  I selected all ancient genomes reaching a 200 kSNP common coverage.  Here is the result, a two-way mixture tree (click to enlarge):

Using this ancient structure as unmixed ancestry I ran MixMapper for north and west Europeans.   It is worth noticing that many present-day populations get several possible mixtures, obviously because the genetic distance between ancient and present-day people is considerably big.   Despite of that many results are very close what we see when analyzing modern populations, for example the Finns get around 93-98% European and the rest Karitiana.   Karitiana and Mbuti are only modern groups included to the ancient sample set and without Karitiana the Finns would get MA1, Saqqaq or totally different ancient composition, likely with poorer fitting mixtures. 

edit 23.1.15

I got a hint that She samples from East Asia would be better than Karitiana, so I made this change. It looks like the She people are more distant to MA1 than Karitianas and MA1 moves to the ancestral line of Europeans, otherwise changes are small in Northeast Europe where this kind of admixture exists.

edit 24.1.15

After checking the quality of ancient genomes I became sceptical.  It looks like the quality of those genomes varies too much to get trustworthy results in wide genome comparisons.  So my advice is to take these results with a grain of salt.

sunnuntai 18. tammikuuta 2015

Fst-distances in Europe, Finnish Rolloff-analyses

Many studies report European Fst-distances, but usually give us results only of few  populations.  I am now going to fix this shortage.  You’re welcome to use these results in your analyses and comparisons.  I did this work to correct some erroneous results seen earlier in Finnish numbers due to a bad sample qualification.  

Two Finnish groups are defined as follows: using European PCA and Finnish samples from the 1000 genomes project  I first removed obvious outliers, then picked 20 most eastern and most western samples compared to PCA dimensions 1 and 2.  So the eastern group is built from the most eastern 25% of the 1000g Finnish sample set and the western one from the most western 25%. 

Despite of using extremities in Finland the genetic distance (Fst) between Finnish groups is only 0.002, similar to the distance between Estonians and Belarussians, which also corresponds to the result between Finnish old and new settlements in the study “The Genome-wide Patterns of Variation Expose Significant Substructure in a Founder Population” ( Jakkula et al.)   Often heard claim that the distance between Finnish groups is extremely big seems to be false and obtained only by using certain minor Finnish settlements.

 Distances between West Finns and other populations are a bit smaller than in Jakkula et al. (CEU:0.006/0.004, Sweden 0.004/0.003), but regarding East Finns there is no difference.  Comparing results of Finnish groups to other results we see more differences in west than in east.   However, it sounds almost ridiculous to mention that all Finns are genetically closer Sicilians than Finno-Ugric Mari-people who live in Russia, alongside the Volga river (Finnish Fst-distances respectively 0.017/0.013 for Sicilians and 0.021/0.020 for Maris).

The data is gathered from publicly available sources released by universities and international projects.  Each country’s data consists of 10-20 samples, each sample holds around 300000 SNPs, without LD-pruning.  If you want the data I can send it to you.

Fst-distances are determined using Eigensoft’s Smartpca, Rolloff analyses using Eigensofts’s Rolloff.

Genetic distances (Fst/Std.error): Click here to see the table and click here to dowload xls-data.

 I am also returning back to Rolloff-analyses.  Rolloff is a useful tool when used together with another analyses and information about genetic distances.  Admixture datings together with genetic distance makes it possible to try to evaluate results with known history.  Unfortunately it is also possible to run Rolloff-analyses for two arbitrary populations with large genetic distance and with very small real admixture, and make wrong conclusions.   In that case there is no guarantee that results are sensible.  I try now to move forward in order to take into account genetic distances, admixture datings and connect results with the generally known and researched history. 

Rolloff-results (generations/std.error):

Swedish admixture in West Finland:   126.632   26.533 (close to the I1-Bothnian clade age)
Swedish admixture in East Finland:    58.173   10.535 (close to the Tavastian or Viking migration to the east)
Western Finnish admixture in East Finland:  50.154   18.967 (close to the Tavastian migration to Karelia)
Eastern Finnish admixture in West Finland:    96.104   12.999 (close to the common Baltic-Finnic root?)
West-Russian admixture in East Finland:     52.609   15.212 (close to the foundation of Novgorod)

 Click here to download related graphics.

One of the most impressive Rolloff results I have seen so far shows West Russian admixture in Eastern Finland.   In Western Finland West Russian admixture is much older and negligible.  West Russia consists of  Orjol, Kursk, Smolensk and Voron Russians, ethnic names are got from the data (Estonian BC).  As far as I can understand the graphic, it tells about a short genetic pulse among East Finns.


perjantai 16. tammikuuta 2015

F3-statistics of Kostenki14 and Ust'-Ishim

Here are world's oldest human (Homo Sapiens) genomes driven through f3-statistics using formula f3(mbuti;test,kostenki14) and f3(mbuti;test,Ust-Ishim).  My results definitely differ somewhat from the original study due to the different and smaller snp-set, but I have not time to build new testing environment and I use my standard data set. 


Source f3 SNPs
Koryak 0,236256 425675
Japanese 0,236091 429066
Chukchi 0,236033 427413
Miao 0,235830 427436
Han 0,235741 429351
Mongola 0,235567 426397
Karitiana 0,235559 422494
Eskimo 0,235470 427107
Xibo 0,235293 426917
Korean 0,235204 426075
Hezhen 0,235168 427084
MA1 0,235030 295395
Han_NChina 0,235017 427681
LaBrana 0,234485 380064
Yakut 0,234453 428950
Loschbour 0,234170 350588
Altaian 0,233780 427545
Nganasan 0,233703 426077
Dolgan 0,233603 423032
Pima 0,233472 424626
Yukagir 0,233451 429275
Kalmyk 0,232592 428447
Selkup 0,232253 428022
Saami 0,231903 412124
Hazara 0,231409 429565
Motala 0,231265 379732
Uygur 0,231124 429040
East-Finnish 0,228334 427955
Chuvash 0,228022 428518
Burusho 0,228005 430068
West-Finnish 0,227732 428002
Mordovian 0,227516 428289
Uzbek 0,227505 429040
Scottish 0,227386 424733
Lithuanian 0,227366 427910
Russian 0,227171 429632
Estonian 0,227011 427914
Late-migr-Finnish 0,226909 425752
Nogai 0,226781 428651
Ukrainian 0,226496 427856
Basque 0,226492 429407
Pathan 0,226399 429881
Norwegian 0,226367 428222
Kent 0,226352 429342
Sindhi 0,226269 429775
CEU 0,226117 426154
Czech 0,226022 428095
Belarusian 0,225979 428132
Cornwall 0,225874 429475
Turkmen 0,225830 427921
Croatian 0,225659 428143
Spanish_North 0,225513 425477
BR2 0,225508 410515
Kalash 0,225436 428481
Orcadian 0,225426 428469
French_South 0,225277 426941
Hungarian 0,225210 429370
French 0,224550 429600
Chechen 0,224507 427795
Bergamo 0,224065 428459
Bulgarian 0,223971 428137
Kumyk 0,223804 427850
Adygei 0,223778 429220
Balkar 0,223611 428347
KO1 0,223611 227060
Balochi 0,223511 429758
Lezgin 0,223288 427957
Greek 0,223254 429378
Tuscan 0,223173 427389
LBK 0,223083 348603
Sardinian 0,223048 429279
Spanish 0,222956 430150
Abkhasian 0,222406 427787
NE1 0,222323 408466
Armenian 0,222067 428086
AngloSaxon 0,221446 364855
Briton 0,220870 387006
Sicilian 0,220530 428355
Iceman 0,220186 357120
IR1 0,220045 218652
Makrani 0,219359 429665
NE5 0,218036 185097
Maltese 0,217884 427243
NE6 0,216792 214577
CO1 0,216335 196647
NE7 0,215117 207674


Source f3 SNPs
KO1 0,281785 192452
CO1 0,275900 166516
Loschbour 0,272549 291022
NE5 0,271244 156624
NE7 0,269457 175583
LaBrana 0,269449 316501
NE6 0,268098 181650
IR1 0,265718 184708
Motala 0,261317 314780
MA1 0,259493 245408
Briton 0,258596 321656
AngloSaxon 0,258345 303958
Lithuanian 0,253541 350200
Icelandic 0,252974 350409
Estonian 0,252884 350223
Saami 0,252622 340401
Iceman 0,252033 297127
Basque 0,251777 350842
Cornwall 0,251709 350884
Orcadian 0,251538 350487
East-Finnish 0,251535 349644
Late-migr-Finnish 0,251392 348727
Kent 0,251379 350857
Spanish_North 0,251172 349075
Norwegian 0,251126 350384
CEU 0,250908 348936
Czech 0,250863 350320
Belarusian 0,250750 350254
Mordovian 0,250698 350335
Scottish 0,250626 348725
West-Finnish 0,250621 349664
French_South 0,250603 349813
Ukrainian 0,250368 350180
Hungarian 0,250142 350866
Russian 0,250120 350908
NE1 0,250051 337520
Croatian 0,249933 350312
BR2 0,249875 339188
French 0,249870 350938
LBK 0,249408 289200
Bergamo 0,249247 350473
Tuscan 0,248178 350059
Bulgarian 0,248104 350333
Spanish 0,247902 351118
Sardinian 0,247686 350834
Greek 0,246928 350845
Chuvash 0,246590 350370
Chechen 0,245638 350127
Lezgin 0,245571 350218
Adygei 0,244999 350727
Kumyk 0,244259 349999
Sicilian 0,243754 350423
Balkar 0,243645 350386
Armenian 0,243315 350223
Karitiana 0,242963 346845
Nogai 0,242688 350399
Abkhasian 0,242637 350083
Selkup 0,242314 349959
Kalash 0,242050 350303
Maltese 0,241578 349953
Pathan 0,241183 350912
Turkmen 0,240726 350053
Burusho 0,240202 350967
Pima 0,239602 347912
Uygur 0,239302 350473
Sindhi 0,239120 350854
Balochi 0,238957 350877
Uzbek 0,238909 350442
Hazara 0,238809 350680
Altaian 0,237666 349553
Eskimo 0,237437 349229
Yukagir 0,237166 350467
Chukchi 0,236065 349381
Yakut 0,235438 350339
Kalmyk 0,234958 350016
Makrani 0,234912 350837
Dolgan 0,234503 347029
Nganasan 0,234148 348588
Koryak 0,233907 348368
Xibo 0,231913 348960
Mongola 0,231745 348637
Hezhen 0,231369 348938
Han_NChina 0,231100 349387
Korean 0,230986 348358
Miao 0,230982 348983
Japanese 0,230937 350016
Han 0,230338 350178

sunnuntai 11. tammikuuta 2015

Enhanced Baltic Sea admixture tests

This updated version has some minor modifications to add more West European genetic diversity.  For that reason I changed Norwegian samples to Swedish ones and made some minor changes in South Europe.  You can imagine that the main result is increased western admixture on Eastern Baltic Sea shores.  Unfortunately it turned out to be difficult to distinguish southwestern and western admixtures in Baltic Sea region results, obviously because ADMIXTURE is not able to see the diference between original and pass through admixtures outside the original mixing area. 

All you need to do is to download files here and follow instuctions in my previous post, with exception to use run parameters k5a.par, k6a.par and k7a.par.   You can also download the genotype database and start to make your own analyses. It is downloadable here.    The following picture shows the data in detail:

tiistai 6. tammikuuta 2015

Admixture analysis of the Baltic Sea region

This test will show genetic similarities in the Baltic Sea region.  It works well for all people from Northern Russia to Norway, including all areas alongside Baltic Sea.  It doesn’t work if your origin is more southern or eastern, including that you are an immigrant with genetic admixture from other regions.   This test doesn’t tell exactly the place where you are from or your ancestry is from, not even in the above-mentioned zone. It doesn’t tell about ancient migrations.  It tells reliably about genetic similarities between people.   

I implemented this test into the DIYDodecad procedure (made and authorized by Dienekes), but to reach the goal I made some changes to the original run procedure, to get more resolution.   It is clear to me that the original concept loses genetic signals.

Another question is how to select reference samples to obtain local genetic differences.  I used PCA to find out criterions.  The selection is based on the fact that all ancient migrations came to Northern Europe through same routes.  It is easy to see how this happened by analyzing European-wide PCA-pictures. 


This PCA represents also samples used in this admix analytsis.  Distances on the plot don't correspond to genetic distances, the main European cluster is shrunk by the distance impact of Uralic groups.

Using PCA I cropped all Near Eastern and Caucasian samples, as well as all behind them seen from the European perspective.   Almost all Near Eastern gene flow came to North Europe through South and West Europe.  All this can be seen on PCA-pictures.  Some Asian gene flow came from the east to Baltic Sea, especially to Russia and Finland.  So I included some Uralic populations to represent this gene flow, but I didn’t took any East or North Asian samples, because eastern Uralic people, speaking Uralic and Turkic languages, show up to 30% of Siberian gene flow in Europe and adding Asians would have reduced local resolution.   
Some notes

- I tried to balance the size of ethnic groups and avoid oversampling local populations.

- I use Finnish old settlement samples only.  The sample group is cleaned of all samples with foreign admixture as well as I can do it.  Finnish samples represent most likely Tavastians.  

- I don't make population dependent LD-prosessing.  IMO LD-pruning is often the reason of poor results, among with oversampling of test groups.

- I still miss North German and Saami samples.  It would be nice to have also those Baltic-Finnic Russian minorities which are available for Russian researchers.
- I use only public data distributed by universities, thus all individual DIY-results are free of the “calculator effect”.  ADMIXTURE runs are carried out in two phases.  In the initial phase samples were run in UNSUPERVISED mode.  The output phase was run in SUPERVISED mode using all homogeneous populations as control groups, despite of their admixture rate shown in the preceeding unsupervised run.  By this mean I was hopefully able to fix the usual problem of homogeneous populations ruining admixture analyses.  This means that any population showing for all individuals results like 30%/70% in two k-group is considered as a control group, but if individuals under same population label show various admixes it is not used as a control in the supervised output phase.

- I cannot publish k-distribution data per used samples because of the overfitting in admixture runs and because I have not enough samples to run meaningful “calculator effect” free results.  But here are some results for instance:

Lithuanian k6
17.46% West-Europe
16.49% North-Baltic
2.56% South-Europe
4.50% East-Europe&Volgaic
51.92% E-Cntral-Euro&S-Balt
7.08% Southeast-Europe 

Lithuanian k7
17.11% West-Europe
14.76% North-Baltic
2.43% South-Europe
16.82% North-Russia
2.07% East-Europe&Volgaic
39.94% E-Cntral-Euro&S-Balt
6.88% Southeast-Europe

North Russian k6
15.69% West-Europe
14.97% North-Baltic
5.81% South-Europe
19.05% East-Europe&Volgaic
37.83% E-Cntral-Euro&S-Balt
6.65% Southeast-Europe

North Russian k7
15.49% West-Europe
12.58% North-Baltic
5.68% South-Europe
20.67% North-Russia
15.55% East-Europe&Volgaic
23.70% E-Cntral-Euro&S-Balt
6.33% Southeast-Europe

North American k5
42.61% West-Europe
9.36% North-Baltic
25.47% South-Europe
2.79% East-Europe&Volgaic
19.77% E-Cntral-Euro&S-Balt

North American k6
39.04% West-Europe
10.26% North-Baltic
17.24% South-Europe
3.02% East-Europe&Volgaic
19.60% E-Cntral-Euro&S-Balt
10.84% Southeast-Europe

North American k7
38.80% West-Europe
9.24% North-Baltic
17.13% South-Europe
 8.42% North-Russia
1.91% East-Europe&Volgaic
13.84% E-Cntral-Euro&S-Balt
10.67% Southeast-Europe

Southwestern Finnish k6
25.25% West-Europe
27.50% North-Baltic
3.94% South-Europe
11.39% East-Europe&Volgaic
26.53% E-Cntral-Euro&S-Balt
5.38% Southeast-Europe

Southwestern Finnish k7
24.99% West-Europe
26.20% North-Baltic
3.67% South-Europe
11.76% North-Russia
9.48% East-Europe&Volgaic
18.59% E-Cntral-Euro&S-Balt
5.32% Southeast-Europe

Mediterranean peaks in Central Italy extending to North Italy and to the Iberian Peninsula.  West Europe peaks in England and Norway.  North Baltic peaks in Finland, Tavastia.  East-Central Euro&South Baltic peaks in Belarussia and Lithuania.  East Europe peaks in Volga/Ural regions and North Russia in Kargopol/Vologda/Mordva regions.  Southeast Europe is highest among Bulgarians and Romanians.

You can download all necessary Diydodecad files to perform your own tests, download files from here.  In running analyses use parameters “k<n>.par”, where n represents the desired k value, 5, 6 or 7.   More detailed instructions about Diydodecad and installation can be found here and original Diydodecad files are downloadable here.  My download package however includes all necessary to run tests.

You can download the genotype data used in this test in EIGENSTRAT format, here.  It is possible to convert it to PED format using for example Eigensoft's CONVERTF, the result is usable although you'll probably lose original allele pairs.