I used MaCH
(http://csg.sph.umich.edu/abecasis/mach/tour/input_files.html
and http://csg.sph.umich.edu/abecasis/mach/tour/imputation.html ) in imputation and
phasing. The imputation showed good reliability by its statistics,
which was expected because only a few SNPs were missing. The
proportion of missing alleles was 0.03% (three per 10000), approximate in
random positions. Both stages were done chromosome
by chromosome, still the processing time was quite long, typically hours per
run (PC: quad core Intel 4770k/3500 MHz / 16 GB memory).
Data
The data was selected from following studies with additional populations from the 1000-genome project (Finnish, CEU, British and Tuscan samples):
Data
The data was selected from following studies with additional populations from the 1000-genome project (Finnish, CEU, British and Tuscan samples):
http://mbe.oxfordjournals.org/content/29/1/359
http://www.nature.com/nature/journal/v466/n7303/pdf/nature09103.pdf
http://www.nature.com/nature/journal/vaop/ncurrent/pdf/nature12736.pdf
http://digitalcommons.wayne.edu/humbiol_preprints/41/
The total amount
of SNPs per sample was limited by these studies to around 300000 SNPs.
Emerged problems in running Chromopainter/Finestructure
There are two disadvantages occurring with Chromopainter and Finestructure. I have also tested the functionality using smaller "synthetic" data to see how it works and problems in detail. The first problem is related to isolated "daughter" populations and caused by Markov chain process. Markov chain process can’t itself be aware of the population history and the process leads to a result where more homogeneous and possible oversampled isolated populations are more source than actual donating populations, although this is not possible in case of isolates being younger "daughters". It is hard avoid this error, because Chromopainter/Finestructure doesn't give enough factual information to steer the process and to take in to account the known history and the origin of “chunks” or haplotypes, i.e. causality. Actually you can supervise Chromopainter and it gives you a chance to correct this problem, or make it even worse. Practically the only way to avoid these errors is to cut out known isolated populations from the input, but this all is up to you and the result can still be subjective.
Emerged problems in running Chromopainter/Finestructure
There are two disadvantages occurring with Chromopainter and Finestructure. I have also tested the functionality using smaller "synthetic" data to see how it works and problems in detail. The first problem is related to isolated "daughter" populations and caused by Markov chain process. Markov chain process can’t itself be aware of the population history and the process leads to a result where more homogeneous and possible oversampled isolated populations are more source than actual donating populations, although this is not possible in case of isolates being younger "daughters". It is hard avoid this error, because Chromopainter/Finestructure doesn't give enough factual information to steer the process and to take in to account the known history and the origin of “chunks” or haplotypes, i.e. causality. Actually you can supervise Chromopainter and it gives you a chance to correct this problem, or make it even worse. Practically the only way to avoid these errors is to cut out known isolated populations from the input, but this all is up to you and the result can still be subjective.
Here is a
picture showing how the clustering works:
The amount of additional chunks multiplies when
the A-B population grows.
This would
be a perfect way to make clusters if we only could know gene flow directions
between individual, or it would be a reasonable way if we could know gene flow
directions between countries or putative populations, but if we have to guess
the result will be just a guess or even worse.
Another
question related to the donor populations of Chromopainter is that we simply
don’t know unidirectional gene flows in Europe.
It is a great idea to mark Scandinavian, Spaniards and Germans as donors
if we analyze American populations, but
this doesn’t work in Europe, because here we have barely any unidirectional gene
flows. Any attempt to mark donors in this
analysis would be simply a guess. I didn’t want to guess and I ran Chromopainter
in a neutral mode in which every individual is compared to all other
individuals. Maybe I could use high quality
ancient samples as donors, but if I see a Finestructure analysis targeting only
to Europeans with asymmetric admixture matrix I would be interesting in how the
donor haplotypes were determined.
Another problem is also caused by the Markov chain process and is related to mixed populations. Basically it is very similar to the first problem, but needs different data preparing. When the process finds mixed individuals it considers also ancestral populations being mixed. This happens because the process is relative and there is no understanding of the causality between individuals. So the Markov chain process clusters both ancestral populations together with the mixed one, despite of the history, geography and genetic distances shown in the input data. How strong this clustering is depends on sample sizes of all three populations, ancestral and the mixed ones. Again we need thorough preparation of the data to avoid wrong results. In a worst case some of populations are mixed and isolated, combining both errors into the result.
Another problem is also caused by the Markov chain process and is related to mixed populations. Basically it is very similar to the first problem, but needs different data preparing. When the process finds mixed individuals it considers also ancestral populations being mixed. This happens because the process is relative and there is no understanding of the causality between individuals. So the Markov chain process clusters both ancestral populations together with the mixed one, despite of the history, geography and genetic distances shown in the input data. How strong this clustering is depends on sample sizes of all three populations, ancestral and the mixed ones. Again we need thorough preparation of the data to avoid wrong results. In a worst case some of populations are mixed and isolated, combining both errors into the result.
The
following picture demonstrates the problem concerning mixed populations. In Chromopainter/Finestructure it is even
worse because they use chunk/haplotype counts instead of haplotypes.
Maybe you
say now that this is okay, but it is not.
If we put 20 Spaniards, 20 Amerindians and 20 Mestizoes into Markov
chain process and get one cluster including all three populations it would also
in my opinion be okay and I don’t object it, but after all very misleading
because Spaniards and Native Americans are not relatives and live in two
continents thousands of miles apart. This
problem is solved in Chromopainter and you can mark Spanish and Native American
phased data as donor data and Mestizoes as recipients, but this strategy
doesn’t work in Europe where there are no such donor-recipient pairs than
between Europe and Americas.
Because I
am especially interested in Finnish results I have here some
details. Finnish samples include 18 samples estimated being from
old settlements. Please check Finnish settlement definitions,
explained in Finnish studies (Jakkula: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2668058/
Palo: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2986642/ ). The
estimate is based on a comparison to earlier analyzed and better known data
sources and PCA analyses, the 1000-genome data itself is not documented enough
to make this decision. Good news are
now that after Karelian and Vepsian samples became available it is possible to
add them to Finestructure tests and also Finnish late settlements without
drawback of showing too much genetic
drift , i.e. catching Finnish clusters by a strong intra-populational chunk
sharing. My next tests will include all
those samples.
Western
Finns show highest similarity to the south, with Estonians, West Russians and
Poles, but there are two individuals with more North Russian similarity and
some West Finns show weaker similarity with Scandinavians. It is
possible that the pre-selection made using PCA was not perfect and two
Karelians or Savonians became included, or those two belong partly to some
other ethnicity and the result fell into same PCA category. Low Scandinavian chunk amount doesn’t
necessarily mean low Scandinavian ancestry, only low chunk sharing, which could
mean that the western ancestry is older than southern and eastern ancestry. Mosaic patterns also show that the
Scandinavian affinity based on chunk sharing (on linked results) is more
East European ancestry in Scandinavia than vice versa, although also Swedish
admixture in Finland is detectable. This reasoning about old Scandinavian
ancestry in Finland may surprise some people, but perhaps it can be supported
by the small amount of young Scandinavian specific y-dna in Western Finland
(look for example Lappalainen et al. 2006). Swedish admixture estimated by the ratio of R1b is 8/21=38% among the Swedish speaking population in Ostrobothnia while Swedish speakers form around 5% of the Finnish population.
Maybe there
is also reason to mention also 23andme’s and FtDna’s results giving sometimes
high amounts of western admixture for West Finns. There is a principled difference between what
they do and this analysis. While
23andme and FtDna created a Finnish “average Joe” and compare individuals to
him, Finestructure in this analysis compare everyone to everyone and there are
no inferred archetypes, stereotypes or hypothetical ancestors
for any ethnic groups. Another question
is how to create genetic averages, whatever it might be.
Abbreviation: CEU=Utah-Europeans, FR= France, NRG=Norway,
HU=Hungary, RO=Romania, BL=Bulgaria, SE=Sweden, CR=Croatia, EE=Estonia, BR=Belarussia, UKR=Ukraine, WRU=West
Russia, WFI=Western Finland, MA=Mari, CH=Chuvash, MR=Mordva, NRU=Russia-Volodga, TSI=Tuscan, SP=Spain, ITALY=Abruzzo
Inferred
groups averagely
1: UK-Kent CEU FR NRG HU RO BL SE CR
2: EE BR PL UKR WRU
3: WFI, mixed
4: MA CH TATAR
5: MR NRU
5: MR NRU
6: TSI SP SICILY ITALY
Imputing and phasing was done by MaCH with
rounds 50 and states 200 per each chromosome, creating around 1500 shared chunks
between individuals. This really reaches a deep haplotype history.
Run parameters in Finestructure: 50000 burnin, 500000 MCMC rounds, tree climbing 100000.
You can download results here (compressed .zip).
No comments:
Post a Comment
English preferred, because readers are international.
No more Anonymous posts.