After a long testing period I have now tools to process fastq-files and I can create PED and EIGENSTRAT samples from original scan results. The work flow is based on BWA and GATK, figured simply:
1. mapping fastq-files separately using BWA-mem
2. sorting and merging(samtools)
3. extracting mapped reads over certain map-quality (samtools)
4. dropping doubles (Picard tools)
5. recalculating base quality scores (GATK)
6. mapping genotypes (GATK), checking the base quality
7. updating RS-ids
8. extracting ped from vcf (vcftools)
9. converting ped to eigenstrat
Processing one sample takes on my laptop (i7/3,5Ghz/8threads used/32GB memory) 6-12 hours.
After checking all samples from the study release I was sure that I could find more information using qpDstat, which compares genetic drift rather than IBS, which was used in the original study. I considered this being possible because I have seen in my works that IBS gives often high results for unmixed and drifted populations and the result doesn't of necessity imply common ancestry in all magnitude. Mixed populations evidently become underestimated. In this meaning qpDstat beats IBS-statistics.
A few comment about results. I used Kent samples as a baseline in comparison to ancient Anglo-Saxon and Iron Age Briton, assuming that present-day Brits should be closest relatives for their ancestors. It was not true in all cases. Apparently Brits are more mixed than some other North Europeans.
I used my new Finnish grouping splitting Finns into two genetic patterns, one consisting of more local ancestry and another resembling German Corded Ware samples published last year. Both groups represent around 20% of the original 1000genomes Finnish sample set, after removing outliers. It looks like the local 1000g group differs particularly in this test from East Finnish samples, which I have gathered straight from volunteers. The difference between East and West Finland is explained more by the Iron Age British sample than the Anglo-Saxon sample. Anglo-Saxon shows high similarity with present-day Scandinavians and looks more widespread than Iron Age Briton everywhere in Northernmost Europe. Of course more Iron Age West European samples could tell more and perhaps confirm my results. Hopefully British researchers dig soon more Iron Age samples to fulfill my dreams.
I have two samples from my project members (ISX and LSX), added to figure better Finnish 1000gemone samples. Both project samples are from genealogists of Finnish speaking ancestry.
I have two databases, the smaller one holding 1 million SNPs, but only a few populations, and the larger one 0.25 million SNPs and around 3000 samples. The first one makes possible to use in this particular case around 200kSNPs, the latter gives 112368 SNPs (Anglo-Saxon) and 71478 SNPs( Iron Age sample).
edit 25.2.2016 23:25
Replacing Kent by an outgroup (Mbuti) we get absolute distances in reasonable accuracy. Closest to the Anglo-Saxon sample are
And closest to the Iron Age Briton are
followed by Ireland, FinnsMostCW and Kent.
edit 26.2.2016 13:40
Ranking of ancient genomes released last year by Reich Lab. Should be noted that some results are based on small amounts of SNPs. It is likely that Hungary_MBA and Germany Bronze Age get too high scores due to fewer SNPs.
The second column is for the calculated difference between Yoruba and Anglo-Saxon or Iron Age Briton, compared to the difference between ancient populations and Anglo-Saxon or Iron Age Briton. The third column is the SNP number.
Hungary_MBA.SG 0.4435 23477
Remedello_BA.SG 0.4254 178071
Germany_Bronze_Age.SG 0.4209 26827
Bell_Beaker_Germany.SG 0.4137 136565
Sintashta_MBA_RISE.SG 0.4127 245264
Andronovo.SG 0.4016 285166
Corded_Ware_Estonia.SG 0.4002 154859
Bell_Beaker_Czech.SG 0.3981 190277
Iron Age Brit:
Hungary_MBA.SG 0.4531 15019
Germany_Bronze_Age.SG 0.4109 16253
Bell_Beaker_Germany.SG 0.4071 86809
Remedello_BA.SG 0.4061 114783
Sintashta_MBA_RISE.SG 0.3982 154956
Nordic_LBA.SG 0.3839 11704
Andronovo.SG 0.3764 182788
Maros.SG 0.3758 59040