After a long testing period I have now tools to process fastq-files and I can create PED and EIGENSTRAT samples from original scan results. The work flow is based on BWA and GATK, figured simply:
1. mapping fastq-files separately using BWA-mem
2. sorting and merging(samtools)
3. extracting mapped reads over certain map-quality (samtools)
4. dropping doubles (Picard tools)
5. recalculating base quality scores (GATK)
6. mapping genotypes (GATK), checking the base quality
7. updating RS-ids
8. extracting ped from vcf (vcftools)
9. converting ped to eigenstrat
Processing one sample takes on my laptop (i7/3,5Ghz/8threads used/32GB memory) 6-12 hours.
After checking all samples from the study release I was sure that I could find more information using qpDstat, which compares genetic drift rather than IBS, which was used in the original study. I considered this being possible because I have seen in my works that IBS gives often high results for unmixed and drifted populations and the result doesn't of necessity imply common ancestry in all magnitude. Mixed populations evidently become underestimated. In this meaning qpDstat beats IBS-statistics.
A few comment about results. I used Kent samples as a baseline in comparison to ancient Anglo-Saxon and Iron Age Briton, assuming that present-day Brits should be closest relatives for their ancestors. It was not true in all cases. Apparently Brits are more mixed than some other North Europeans.
I used my new Finnish grouping splitting Finns into two genetic patterns, one consisting of more local ancestry and another resembling German Corded Ware samples published last year. Both groups represent around 20% of the original 1000genomes Finnish sample set, after removing outliers. It looks like the local 1000g group differs particularly in this test from East Finnish samples, which I have gathered straight from volunteers. The difference between East and West Finland is explained more by the Iron Age British sample than the Anglo-Saxon sample. Anglo-Saxon shows high similarity with present-day Scandinavians and looks more widespread than Iron Age Briton everywhere in Northernmost Europe. Of course more Iron Age West European samples could tell more and perhaps confirm my results. Hopefully British researchers dig soon more Iron Age samples to fulfill my dreams.
I have two samples from my project members (ISX and LSX), added to figure better Finnish 1000gemone samples. Both project samples are from genealogists of Finnish speaking ancestry.
I have two databases, the smaller one holding 1 million SNPs, but only a few populations, and the larger one 0.25 million SNPs and around 3000 samples. The first one makes possible to use in this particular case around 200kSNPs, the latter gives 112368 SNPs (Anglo-Saxon) and 71478 SNPs( Iron Age sample).
edit 25.2.2016 23:25
Replacing Kent by an outgroup (Mbuti) we get absolute distances in reasonable accuracy. Closest to the Anglo-Saxon sample are
Sweden
Norway
Kent
sounds good.
And closest to the Iron Age Briton are
France
Norway
Welsh
followed by Ireland, FinnsMostCW and Kent.
edit 26.2.2016 13:40
Ranking of ancient genomes released last year by Reich Lab. Should be noted that some results are based on small amounts of SNPs. It is likely that Hungary_MBA and Germany Bronze Age get too high scores due to fewer SNPs.
The second column is for the calculated difference between Yoruba and Anglo-Saxon or Iron Age Briton, compared to the difference between ancient populations and Anglo-Saxon or Iron Age Briton. The third column is the SNP number.
Anglo-Saxon:
Hungary_MBA.SG 0.4435 23477
Remedello_BA.SG 0.4254 178071
Germany_Bronze_Age.SG 0.4209 26827
Bell_Beaker_Germany.SG 0.4137 136565
Sintashta_MBA_RISE.SG 0.4127 245264
Andronovo.SG 0.4016 285166
Corded_Ware_Estonia.SG 0.4002 154859
Bell_Beaker_Czech.SG 0.3981 190277
Iron Age Brit:
Hungary_MBA.SG 0.4531 15019
Germany_Bronze_Age.SG 0.4109 16253
Bell_Beaker_Germany.SG 0.4071 86809
Remedello_BA.SG 0.4061 114783
Sintashta_MBA_RISE.SG 0.3982 154956
Nordic_LBA.SG 0.3839 11704
Andronovo.SG 0.3764 182788
Maros.SG 0.3758 59040
IBS-results from "Genomic signals of migration and continuity in Britain before the Anglo-Saxons" seemed quite reasonable, the major difference compared to these D-stats is that the Finnish sample was relatively closer to the Anglo-Saxon than to the Iron Age samples but in absolute terms it was still closer than Central and East European samples.
ReplyDeleteCould you run an IBS similarity test for the IA and AS samples using the larger database of your first figure so we could compare the difference between IBS and D-stat?
The difference berween those ibs-results and my dstat-result is that my results show less similarity between ančient samples and drifted populations, like Scots. I do have Scots, Irish and Welsh samples. Unfortunateli I have no real German samples. Another shortcoming in the study is that it uses only one Finnsh sample and because the Finns are a very diverge population (mixed) I have to state you that one sample is not a statistic.
DeleteI am going to continue with other ancient British samples. After i am ready i can do also ibs-statistics, although ibs is not as reliable as dstat and genetic drift in finding commn ancestry.
Sorry typos due to too small touchscreen.
DeleteThe low amount of Finnish samples in the original study is just another reason to do the IBS stats with your dataset. Only that way we can see if mostcw, local and maybe your project references show different results than what D-stats give here and how will those results differ.
DeleteWe can already see that there is massive differentiation, far more than between many individual European countries, between different Finnish samples and groups using D-stats, and I figure the result will be same using IBS. But can't be absolutely sure before the test is done.
I really like to do it and will do it. But there is some basics to follow before one can understand the difference between ibs and genetic drift. While ibs stands for simple statistics sĥowing allele similarity and more mixed populations usually get loẃ results despite if the ancestry, genetic drift shows smaller dedicated proportions of common ancestry. In Finnish results it means that intrapopulational difference can grow, because some Finnish "tribes" can lack of some admixtures, even though the Finns can have partially same root. It follows also that other Finns show high drift similarity with Iron Age Brits and some show almost no common drift with those Brits. When we then look ibs we see no similar sorting because ibs sees the whole genome as one big lump. I hope you got wind if my explanation.
DeleteI think we would still see pretty significant differences. In the study these ancient British samples are from, the IBS results of Scots and Welsh differed from each other more than those of Lithuanians and French did, and differences between Finnish subpopulations should be greater than those between subpopulations from the British Isles.
ReplyDeleteWe will see. What I have seen at 23andme the Finnish intrapopulational ibs difference is puzzlng; many pure Finns are very far away from each other, but in general people living late Finnish settlements get high ibs numbers. This is what happens when people share commn root and are less mixed. Then in old settlement they share less common ibs than for example Lithuanians. But as i wrote ibs is not straight comparable with genetic drift. Especially comparing ancient genomes to modern populations using ibs is problematic. For that reason we use qpDsat and qpAdm.
DeleteI mean that the IBS sharing of Scots and Welsh, Lithuanians and French with ancient Britons differed. Not Scots sharing with other Scots or Welsh, or Lithuanians with other Lithuanians.
Deleteokey-dokey
DeleteOne more thing, you might want to put the 1000genomes id's for mostcw and local groups up (hg00X etc) up in a post. That way others with the data can repeat/verify the results too.
DeleteOf course, you're welcome
DeleteFinnLocal
HG00176
HG00182
HG00185
HG00187
HG00266
HG00276
HG00282
HG00304
HG00309
HG00313
HG00326
HG00330
HG00338
HG00365
HG00378
FinnMostCW
HG00174
HG00177
HG00178
HG00190
HG00268
HG00274
HG00277
HG00285
HG00311
HG00320
HG00353
HG00362
HG00364
HG00376
HG00381
HG00384
Originally I had a few more, but some samples turned out to be outliers, likely with Slavic admixture (although those outliers are classified as pure Finns in 1000g project).
Who ever tries to generate PCA's using those samples will surprise after seeing ordinary West-Finns / East-Finns groupings. The PCA is not the way to go. I have said it hundreds times. Why? Because PCA makes thing wrongly; it makes a bifurcation based on 4-8% Siberian and the genetic drift common in East Finland. It gives less importance to the rest 80% or so.
DeleteI don't think the difference between "local" and East Finns is related to Siberian because we did that comparison before using verified East Finns and the difference was insignificant, a near-zero result.
ReplyDeleteEastFinnish FinnLocal Nganassan Chimp.DG 0.0005 Z 0.256 SNP 231799
So it has to be something else, drift or, if admixture, something unrelated to Nganasans.
Sorry, my previous message was that USUALLY pca makes the finnish bifurcation on the Siberian and genetic drift of Finnish late settlements. I don't do it, my grouping is based on historical main events in Finland. I don't use pca in this meaning. I only stated that other people do it using pca and if you use pca with my dedicated Local and MostCW samples you get nothing informative if you expect "old fashion results" typically got in tests. So your previous dstat is just expected; there is practically no Siberian difference, but it is not question about genetic drift.
ReplyDeleteMy FinnLocal is not made extracting highest Siberian among Finns, so it is not the question. The local Finnish group seems to be a local group, as I named it. But mostly people (other bloggers and those making businees, like 23andme and FtDja) create the Finnish or East Finnish group by extracting Siberian admixture and/or genetic drift. I don't do it. My FinnLocal groups represents local ancestry from the beginning of times. in other words I am very well aware of what other people do and what has gone wrong historically speaking. I am not bragging, I only know more about the Finnish history and can follow it. Frankly speaking following insistently a presumption expecting only Siberian admixture and/or genetic drift is a simple and patent answer for what people don't know. Sorry to say that, but I have to be honest.