Sunday, December 7, 2014

Starting with new data



I am now starting with a new data set and hopefully I can give you more reliable results.  My new set is based on the data publicly available from Lazaridis et al. 2014.   Some additions were made:

 Finnish, British and CEU samples from the 1000-genomes project
 5 ancient British samples from Hinxton
 8 ancient Hungarian samples (from Y-Str server)

Going to Lazaridis’ data meant also changing to Affymetrix coordinates.   Affymetrix doesn’t fit well with most commercial SNP sets, but seems to give better coverage for many ancient samples.  Switching to Affymetrix  was not a problem to me because my private sample collection is not big and I can give it away and take the advantage of better data.   Now I have

2244 samples
555268 SNP’s

The individual sample data mostly covers the whole 550kSNP, also many ancient samples reach over 500kSNP. 

Present ancient genomes
BR2-Hungarian
CO1-Hungarian
IR1-Hungarian
KO1-Hungarian
NE1-Hungarian
NE5-Hungarian
NE6-Hungarian
NE7-Hungarian
Hinxton1
Hinxton2
Hinxton3
Hinxton4
Hinxton5
Denisova
Loschbourr
LBK
Mezmaiskaya
Otzi
Saqqaq
MA1
AG2
Swedish_HunterGatherer
Swedish_Farmer
Motala_merge
Motala12
LaBrana
Denisova_light
Vindija_light

I am not yet familiar with those ancient samples and it takes a time to find out their secrets.

While I have a new data I am also starting with a new software toolkit.   I found the Eigenstrat format handy, because it is also easy to handle with SQL-tools.  This decision led to another one, to use Eigensoft’s software.  Luckily they just a few weeks ago released a new version.  

It is the time to uncork the new data, starting with Eigensoft’s excellent PCA tool.  It is capable to do many thing, like LD-pruning, random sample selection, sample projection etc.   Now I use only the random sample selection, which ensures that none of populations are oversampled.   I use the whole 550k data and  I do not use LD-pruning for two reasons.  At first it is not necessary to avoid excessive clustering.  Working with the LD-pruned data I did not notice any improvement in clustering.   Secondly, LD-pruning can be disadvantageous, because not all populations have the same genetic drift to remove.  Generally the LD-pruning must be used carefully and it is not for dummies like me, so I use it seldom and trust to the original data.

PCA including Europeans





Fst-distance table.  Average standard error is 0,000762745, meaning that there can be an error of one thousandth of unity.


Fst-table

No comments:

Post a Comment

English preferred, because readers are international.

No more Anonymous posts.