sunnuntai 20. tammikuuta 2019

Y chromosome mutations decoded

Thanks for the mutation map of the newest ISOGG Y-DNA Haplogroup Tree I was now able to decode yDna mutations.  The whole matrix includes over 300000 Y chromosomal SNP's and mutation checks, but it is limited by mutations found in BAM files.  Now tested code fits with the Build 19/37,  but I decoded also the Build 20/38.  I am waiting for my BIG Y and will test the Build 20/38 after it.  Nevetheless, novel mutations are detected as well. My code reads BAM format, but use of FASTQ is also possible if needed.   The second step after decoding BAM files makes matches with ISOGG trees and the result looks like:

This particular result was run using an ancient Kola Peninsula sample BOO002, but my code works with modern samples as well.  So the haplotype is here N1a1a1a1a, in other word N-L392, including also many parallel mutations shown by the ISOGG tree. You can see that some downstream mutations represent other haplogroups, because some downstream mutations exist in several haplogroups.  I am happy with this, but if someone wants to code a tree based on this code, I'll give it (not only data) for a testing purpose. 

lauantai 5. tammikuuta 2019

Potential pitfall of IBD and other statistics due to homozygous IBD

It is a well known issue that homozygous IBD can lead to erroneous results in many statistics targeting ancestral reckoning, no matter are we trying to find out ancestry using present-day or ancient samples.   Here is a Beagle statistics showing homozygous IBD inside populations using 600000 SNP's.   Results are not  universally applicable, because of low sample numbers, nevertheless they are valid showing the error possibility of ancestral statistics using any selected data. Homozygous IBD can also reveal bad sample selection (unrepresentative selection). It is also good to notice that random individuals can have large homozygous segments near centromere, still showing relatively low overall homozygous IBD, hence a ROH test is not a good method to show inbreeding.

lauantai 29. joulukuuta 2018

False correlation between yDna N1c1 and Asian admixture in Finland and Baltic area

Time and again I see people making conclusions between Finnish N1c1 and eastern admixture.   Regardless of the eastern origin of N1c there is not such correlation in Finland.  The reality is even worse for those who cherish this fallacy; if we count also Baltic countries the correlation turns out to be negative.  In Finland alone all male haplogroups have equal level Asian admixture and the only difference comes from the locality, not from the male haplogroup.  Rational person would conclude that the Asian admixture is from a local source.   This is a no-brainer and I don't even need to prove it.  Everyone being familiar with this matter knows it, but it doesn't prevent the biggest Finnish newspaper distributing this urban myth.   Google translation, click here.  

Epilogue.  The fallacy of the eastern origin of Finns results from many things.  I am not interested in other opinions than those bothering Finnish people and researchers, because I don't care much about "public opinions" without scientific basis.  A common idea in Finland, believing in different Finnish origins (Lappeenranta-Vaasa or whatever axis) driven by Finnish "race realists" who inherited opinions from the old Swedish school, is that here in Finland have lived two "races".  Now some Finnish scientists have agreed this and detached themselves from known historic facts.

torstai 27. joulukuuta 2018

QpAdm - what it means in practice

As we saw in my previous posts the correlation between fit and standard error is very meaningful.  We saw that the Basques are a loose mixture of East European Steppe  and ancient Iberian people, but they are only far descendants of those two groups and we can't prove that these two are their only ancestor, although they definitely forwarded genes to Basques.  I made similar test showing that the Greeks are distant descendants of Iron Age Anatolians and Bronze Age Balkanians, but again we can't be prove that those two were their only ancestors.  Probably not.

                                Balkans_BronzeAge Anatolia_IA
best coefficients:     0.470                        0.530
Jackknife mean:      0.475197121             0.524802879
std. errors:              0.077                         0.077 

fixed pat  wt  dof     chisq       tail prob
00  0     8    15.062       0.0579554     0.470     0.530

On the other hand,  qpAdm showed that the Finns are very strictly descendants of Iron Age Scanian, Iron Age Baltic and Iron Age Saami people, but we can't prove exact proportions of those thee admixtures, which we saw in high standard errors.  It is easy to understand that admixtures of close populations are not as easy determinable as admixtures of distant populations, because close relatives share much common ancestry.   

But how accurate are results showing very distant ancestry and moderately low standard errors, if the fit is poor?  I tested it.  Following tests show admixtures of Iron Age Saami people in Ostrobothnia Levaluhta.

We see that there is only a small difference in admixtures of Iron Age Saamis generated by present-day Finns and Iron Age Scandinavians in conjunction of Bolshoy outlier.  Chisq is high, tail prob. below 0.4, but std. err. only 6% max.  Nothing obliges such a high admixture similarity, because the genetic distance between Finns and Scandinavians is rather high.  Such a similarity is achieved only by a big genetic distant of Bolshoy outlier. 

Another example, although not equally striking.

Chisq is between 10 and 21, tail prob. between 0.006 and 0.24.  FI21 shows best fit.  Std.error is 5% in FI4 and FI12, highest (9%) in case of FI21.  

lauantai 22. joulukuuta 2018

Exciting results of Basques and Estonians, updated: Southwest Finnish results

I made recalibration of qpAdm references to improve accuracy.   Please read my previous post to find my opinion and to see the problematic with  qpAdm.

New references:

Kostenki14 813405
MA1 625746
WHG 703903
EHG 975726
CHG 889688
Ganj_Dareh_N 794892
West_Siberia_N 670626
Anatolia_Neolithic 889986
Mbuti1M 971767
Wichi1M 971774

At first I tried to find out the admixture of present-day Estonians and it is really challenging without proper Iron Age samples.  So I had to use best available modern samples.

                                Latvian1M        FI12                  AncFinn
best coefficients:     0.408                0.451                 0.141
Jackknife mean:      0.404965620     0.441429933     0.153604447
      std. errors:        0.248                 0.159                 0.203

fixed pat  wt  dof     chisq       tail prob
000  0     7     2.724         0.90932     0.408     0.451     0.141
001  1     8     3.392        0.907406     0.531     0.469     0.000
010  1     8    10.133        0.255821     2.398    -0.000    -1.398  infeasible
100  1     8     5.005        0.757038     0.000     0.626     0.374


Latvian1M  -  Latvian samples covering 1 million SNP's
FI12 - it is me, because I am one of my individual samples covering 1 million SNP's and in this case giving best fit.  So I represent here present-day Finns.
AncFinn -  an ancient Finnish sample from Damgaard et al. 2018.

Reasonable fit for Basques was even more challenging.  My test shows that the Basques are averagely two thirds ancient people from the Iberian peninsula and one third from Steppe origin.

                                SE_Iberia_CA   Yamnaya_Samara
best coefficients:     0.671                0.329
Jackknife mean:      0.670091689     0.329908311
      std. errors:        0.022                 0.022 

fixed pat  wt  dof     chisq       tail prob
00  0     8     5.956        0.652145     0.671     0.329

edit 25.12.2018 12:55

A new Southwest Finnish result using recalibrated references.  Recalibration here means better results of Asian and African admixtures.  There was also a  inconsistency between Iron Gate and WHG - Iron Gate removed.

                                Scania_IA         Baltic_IA            Levaluhta
best coefficients:     0.483                0.358                 0.159
Jackknife mean:      0.450272111     0.373492239     0.176235650
std. errors:              0.182                 0.183                 0.111

     fixed pat wt dof chisq       tail prob
     000  0     7     0.657        0.998647     0.483     0.358     0.159
     001  1     8     2.229         0.97317     0.596     0.404     0.000
     010  1     8     3.018         0.93322     0.803     0.000     0.197
     100  1     8     5.884        0.660177    -0.000     0.849     0.151
     011  2     9     4.219        0.896439     1.000     0.000     0.000
     101  2     9     6.568        0.681978     0.000     1.000     0.000
     110  2     9    21.991      0.00890644     0.000     0.000     1.000

tiistai 18. joulukuuta 2018

Still not enough West European Iron Age samples to get proper qpAdm results of West Europeans

My try to model present-day Swedes was not what I hoped, because lack of proper western Iron Age samples.  Now I tried to find out the best possible solution using Scania_IA and older samples.  I noticed that in all possible variations we need recently unavailable and unknown Iron Age samples to achieve reasonable results.  So I have to forget such tests until West European Iron Age samples are available. Several Central European Late Copper Age samples turned out to be best ones, but made not proper fits, for instance:

                                Scania_IA  Protoboleraz_LCA
best coefficients:     0.949     0.051
Jackknife mean:      0.947619305     0.052380695
      std. errors:       0.041     0.041

This is best I can do right now.

An issue beyond qpAdm is how to determine standard errors. While we can consider low standard error good, there is also a good reason to consider high standard error reasonable in many cases.  In a case where two or more populations share pretty much common ancestry (as it is in many case today) qpAdm can't determine which one is the right one.  For instance in a case of  admixtures built of Swedes and Norwegians the standard error can be very high, because qpAdm is not able to break ancestries into common ancestry of both populations.  So, when we try to minimize the standard error we in fact abandon the most obvious result.  Usually this dilemma is tried to avoid in two ways: 1) using very ancient/distant samples to avoid common ancestry or 2) approving very high chisq and small tail prob values.   In the latter case we actually approve poorer results to show falsely better results.

A result showing high standard errors:


                                Scania_IA         Baltic_IA           Poland_BA
best coefficients:     0.560                0.108                 0.332
Jackknife mean:      0.253950408     0.349222728     0.396826864
      std. errors:        0.532                 0.634                 0.389

In this case all admixtures are overlapping resulting statistical transitions and uncertainty between admixtures and high standard errors, but chisq and tail prob values are still relatively good,  respectively 2.290 and 0.942093.

Another case shows low standard errors, but poorer coverage of admixtures:


                                 Scania_IA        Hungary_LCA 
best coefficients:     0.948                0.052
Jackknife mean:      0.946235880     0.053764120
      std. errors:        0.043                 0.043

Respectively chisq and tail prob values were 7.413 and 0.492767.

I can make a more provocative latter example for similar target populations in which standard errors could be 1-2 percentages and chisq and tail prob values around 10-20 and 0.1-0.2


maanantai 10. joulukuuta 2018

Finnish genetic composition: Iron Age Baltic, Iron Age Germanic and probably Iron Age Saami people

You probably have read my previous post regarding European genetic structures composed by Admixture and Eurasian data.  My aim was to make an admixture analysis free of recent genetic drift.  It gave following admixtures for Southwestern Finns (Finnish, k=10)

- Saami 14%
- Baltic 49%
- Germanic 35%

The post is here.

Now I have tested same admixtures using Iron Age samples and qpAdm.  QpAdm allocates admixtures for given populations, in these tests admixture  populations were Baltic Iron Age, Scanian Iron Age and Levaluhta Iron Age.  Levaluhta consists of five Iron Age remains found from Finnish Bothnia,  Ostrobothnia.  The id of the Scanian IA sample is RISE174 and the Baltic IA sample is DA171.  Although my results are unambiguous, there are some uncertainty regarding the software and given references, and being the first exploring something like this I am curious to see results of professional geneticists.  I simply can't understand why this matter wouldn't interest also researchers.   All samples I use here are publicly available.

Southwest Finns:

chisq       tail prob
1.112       0.992818 

Levaluhta 0.115 
Scania_IA 0.437 
Baltic_IA  0.448


chisq       tail prob
1.555       0.980354

Levaluhta          0.202 
FIN_Southwest 0.614 
Baltic_IA           0.184
Nganasan        -0.000

and another result of Vepsa

chisq       tail prob
2.734       0.908488
Levaluhta          0.000 
FIN_Southwest 0.764 
Baltic_IA            0.164 
Nganasan          0.072

The match is much worse when using present-day Baltic, Germanic and Saami counterparts.

Southwest Finns:

chisq       tail prob
9.270       0.233881

Saami    0.156 
Latvian  0.297 
Swedish 0.548

Outgroups and coverages were

Kostenki14 813405
MA1 625746
WHG 703903
Iron_Gates_HG 884901
EHG 983976
CHG 889688
Ganj_Dareh_N 794892
West_Siberia_N 670626
Anatolia_Neolithic 889986
LBK_EN 880957

Typical SNP coverage of target groups

Vepsa1M 971774
Levaluhta 377788
FIN-Southwest 1102712
Baltic_IA  97311
Scania_IA 373352
Nganasan1M 971774

I have lost some SNP's due to allele mismatches or multiallelic conversion errors in Plink, but the coverage is still reasonable.   Some ancient samples need to be reconverted.

edit 11.12.2018 16:30

Adding Russians from Pinega makes a perfect match for Vepsas.  Not a big surprise.

chisq       tail prob
0.825       0.991377    

FIN-Southwest 0.672    
RusPinega        0.130    
Baltic_IA           0.093    
Levaluhta          0.105