Tuesday, May 5, 2020

Creating average genetic samples

Often especially ancient samples have poor quality, although sample number can be reasonable.  I made a simple Linux shell script making an average sample from a sample group, giving possibility to increase sample quality to a reasonable level for analyses based on allele frequencies.  The script reads EIGENSTRAT-format and the result is formed by picking alleles randomly from pooled samples.  Unfortunately Linux shell scripts don't support indexed files and I had to make some compromises to keep run time reasonable. The result of this script will not work with analyses based on IBD's or principal components, so it is not possible to use it f.ex. as an input file of the popular Eurogenes G25 test, but this should work with all analyses on Dodecad platform.  If someone is not familiar with these semantics it is very possible that the outcome is disappointing.  The script is freely available here.   I forgot, you need also a rs-id file and it is available here.  It should be unzipped to the same directory with the script.   Some results below using available academic samples and based on my own models:

Medieval Nomads

East_Asian 42.4
Uralic 24.5
Siberian 15.9
Northeast_European 6.1
Mediterranean 4.4
East_Scandinavian 3.3
Fennoscandinavian 2.2
AMBIG_European 1.2

Iberian Chalcolithic

Mediterranean 80.4
Northwest_European 15.6
Central_European 2.7
East_Scandinavian 1.4

Hungarian Bronze Age

Mediterranean 86.4
Central_European 11.3
Fennoscandinavian 1.1
East_Scandinavian 1.2

Polish Bronze Age

Northwest_European 39.0
Slavic 37.8
Fennoscandinavian 11.9
Central_European 6.5
East_Scandinavian 3.6
Baltic 1.2

Estonian Iron Age

Baltic 57.1
Slavic 26.1
Fennoscandinavian 11.5
Finnic 2.8
East_Scandinavian 2.2

edit 8.5. 15:30

The script edited so that it will accept also rs-id's in input, the original version accepted only  concatenation id's (chr:location).  Please notice that only hg19/GRCh37 mappings are possible.  New version is available here.