tiistai 20. joulukuuta 2016

New data gives 11 million SNP's

Thanks for latest updates of free gene banks and hard work of several projects I have now been able to increase the SNP number to 11 millions per sample.  Increased amount of SNP's means increased accuracy whether I use all SNP's (especially in drift analyses)  or after pruning ld.  In my new data base I have combined samples from three sources:  the 1000 genomes project, Estonian Biocentre samples used by Pagani eta al. 3026 and Simons Genome Diversity Project (SGDP).  For the present the sample size is only 866.

Here, as a showcase of the new data two PCA prints and some comments.  Instead of making a central continental European picture I included four outgroups to see the effect.  Those outgroups are Armenians, Mongolians, Sardinians and Saamis.   For the present individual names are picked straight from original sources and can be somewhat ambiguous.

As we can see we have several clusters, which makes possible to evaluate the data.  For example Scandinavians of SGDP and Pagani et al. cluster with East Europeans.

Personally I don't give much attention to PCA-figures, because the result depends on the selected samples, amounts, ratios between populations sizes, about how mixed are individuals etc.  My upcoming high resolution tests will be much more interesting.

Added time 12:50

If someone is interested in how Mordvas locate on this map.  They are very similar to North Russians and RusKU (Pagani et al.) and move towards Mongolians. Baltic Finns moves towards Saamis.  Sorry, GIMP makes something unwanted with colors.


keskiviikko 16. marraskuuta 2016

Ancient admixtures look shifty

It is hard to believe in some ancestry results.   FamilyTreeDna's new Ancient Origins give me following results

Metal Age Invader 12%
Farmer 30%
Hunter-Gatherer 54%
Non-European 4%

Regarding Metal Age Invaders they refer to the Metal Age Yamnaya culture, regarding Farmers to the Neolithic Anatolian migration to Europe and regarding Hunter-Gatherers to ancient LaBrana, Loschbour and Motala samples.   Regarding non-European proportion they give a hint to look at myOrigins, which is FamilyTreeDna's admixture analysis based on present-day populations.  My myOrigins give me only one non-European group, Middle Easterners.  I doubt it, the non-European in my Ancient Origins test is likely Asian.

Going further in analyzing results I compared my Ancient Origin results to  scientific papers,  Haak et al. 2015 giving comparable results.  Haak et al.  gives following results for Finns:

EN (Farmers) 31.5%
Nganasan (Asian) 10.2%
WHG (Hunter-Gatherer) 7.9%
Yamnaya (Metal Age Intrurers) 50.4%

Respectively Norwegians get in this study
EN (Farmers) 48.2%
Nganasan (Asian) 4.2%
WHG (Hunter-Gatherer) 0%
Yamnaya (Metal Age Intrurers) 47.5%

We can see a huge transition between Yamnayas/Iron Age Intruders and Hunter-Gatherers between Ancient Origins and Haak et al.  I know something about the method used by Haak et al., but I have no idea what FamilyTreeDna did. However, if I try to guess, I would say that they could have used a very drastic LD-pruning.  I can get similar differences by heavily pruned data and it makes sense.  Metal-Age invasion to Europe happened during the Bronze Age, thousands years later than the arrival of hunter-gatherers.  So it is reasonable to assume that we have still much more Bronze Age genetic drift than drift from hunter-gatherers, thus removing LD removes more ancestry of Metal Age Intrurers.  Pruning present-day samples does't have same effect due to more similar genetic composition.

I made also some admixture tests.   Pruning LD gives a big change in ancient admixtures.

My result without pruning

Anatolian_Neolithic 31.4
BA_East_European_Steppe 44,8
East_and_Southeast_Asian 10,8
Western_Hunter_Gathrerer 13

and after pruning

Anatolian_Neolithic 27.5
BA_East_European_Steppe 25.9
East_and_Southeast_Asian 7.8
Western_Hunter_Gathrerer  38.8

I am not saying that the difference between results of FamilyTreeDna and Haak et al. is caused by pruning, because I don't know it.  I only state that pruning ancient samples is risky.

keskiviikko 9. marraskuuta 2016

Project admix results, revised

My previous test was missing of German reference samples.  Together with the fact that my Swedish reference samples seem to be somewhat off, this gave results biased towards Balto-Slavs.  I have now added German samples available from Pagani et al. 2016 and have rerun all project samples, plus two new Finnish samples. Additionally I tested three Finnish samples introduced by aforementioned study.  Soon after downloading those samples I understood that they don't represent average Finns.  So this point is included after project results.

I had difficulties in editing columns and after some useless efforts I copy-pasted all in plain text format.

A new grouping, Karelian-Finnic indicates a sum of Karelian and Veps people.

Finland     57,0
AMBIG_Europe     25,0
Balto-Slavic     12,9
Baltic-Finnic     2,5

Finland     37,2
AMBIG_Europe     28,0
Balto-Slavic     14,8
NW-Atlantic-Europe     10,6
Saami     3,9


Finland     62,3
AMBIG_Europe     33,0
Baltic-Finnic     2,3

Finland     47,2
AMBIG_Europe     18,9
NW-Atlantic-Europe     18,1
Northeast-Europe     15,8

Finland     53,8
AMBIG_Europe     33,1
Baltic-Finnic     11,7

Finland     43,0
AMBIG_Europe     36,0
Baltic-Finnic     12,5
NW-Atlantic-Europe     7,9


Finland     78,7
AMBIG_Europe     17,4
TunNenets     3,4


Finland     56,5
Karelia     25,4
AMBIG_Europe     17,4

Finland     42,1
AMBIG_Europe     27,7
Karelia     24,5
Karelian-Finnic     5,0

Finland     43,1
Saami     21,5
AMBIG_Europe     10,9
Karelian-Finnic     10,2
AMBIGUOUS     10.0
AMBIG_Siberian     4,3

Finland     63,7
AMBIG_Europe     31,7
Baltic-Finnic     1,8

Finland     71,6
AMBIG_Europe     18,0
Central-Europe     10,2


Finland     69,8
Balto-Slavic     16,0
AMBIG_Europe     11,3
Baltic-Finnic     1,6


Finland     62,0
Karelian-Finnic     21,2
AMBIG_Europe     14,9


Finland     43,1
AMBIG_Europe     22,9
Estonia     21,8
Karelia     10,3

Finland     33,9
Central-Europe     24,0
Karelia     13,8
Baltic-Finnic     9,8
AMBIG_Europe     9,5
RU_Pinega     5,6
Karelian-Finnic     1,3


Finland     46,1
Karelian-Finnic     19,7
Balto-Slavic     14,5
AMBIG_Europe     8,8
Baltic-Finnic     6,5
Saami     3,7


Finland    0,62
AMBIG_Europe    0,20
Northeast-Europe    0,08
RU_Pinega    0,05
Saami    0,03

Finland     57,8
AMBIG_Europe     21,8
Balto-Slavic     10,9
Baltic-Finnic     4,3

Finland     53,1
Karelia     28,0
AMBIG_Europe     10,7
Northeast-Europe     4,8
Karelian-Finnic     1,2

NW-Atlantic-Europe     32,8
Central-Europe     32,5
Balto-Slavic     19,3
AMBIG_Europe     13,3


Baltic-Finnic     27,6
Central-Europe     21,2
AMBIG_Europe     19,3
Norway     17,5
NW-Atlantic-Europe     12,9


Norway     53,0
Central-Europe     18,3
Balto-Slavic     13,7
NW-Atlantic-Europe     8,1
AMBIG_Europe     6,5

AMBIG_Europe     28,9
NW-Atlantic-Europe     18,3
Central-Europe     18,3
Ireland     14,1
GermanyAustria     11,5
Northeast-Europe     7,9

Central-Europe     31,5
NW-Atlantic-Europe     24,7
AMBIG_Europe     16,5
Finland     14,5
Balto-Slavic     11,9

AMBIG_Europe     29,7
NW-Atlantic-Europe     26,1
Sweden     20,5
Orcadian     11,0
Central-Europe     10,7

Additionally some freely available genomes, only for checking the method.

Genomes Unzipped, VXP
North-Italy     24,9
Central-Europe     20,7
AMBIG_Europe     18,4
Norway     13,7
NW-Atlantic-Europe     12,0
South-Europe     6,6

Genomes Unzipped, JKP
Central-Europe     28,9
South-Europe     19,8
NW-Atlantic-Europe     19,1
Spain     12,5
AMBIG_Europe     11,3
AMBIG_SEURASIA     2,0                                      

Razib Khan, downloaded here.
Indian     35,6
Sindhi     22,3
Cambodian     12,8
AMBIGUOUS     10,6
Burusho     8,6
IndianJew     6,3
AMBIG_Southeast-Asian     2,4

Blaine Bettinger, downloaded here.         
He looks British, with a small portion of Native American.
Central-Europe     24,9
Kent     24,1
AMBIG_Europe     21,2
Welsh     9,3
Ireland     7,3
Atlantic-Europe     3,3
Native-American     1,9

Tests using Pagani et al. Finns as a Finnish reference   
Karelia    28,0
AMBIG_Europe    23,8
Central-Europe    17,8
Baltic-Finnic    12,6
Finland    12,1
Karelian-Finnic    3,4

Estonia    23,7
AMBIG_Europe    22,5
Karelia    18,6
Central-Europe    18,5
Finland    7,9
Karelian-Finnic    4,7

Karelia    46,3
AMBIG_Europe    16,1
Finland    10,4
Baltic-Finnic    8,7
Northeast-Europe    8,5
Saami    4,3
Karelian-Finnic    2,8

I tested three Finns, seen above, two of them typical Western Finns without any obvious foreign admixture and one should be a typical Finn from East Finland. The first row below shows the average result using average Finnish reference picked from 1000-genomes and the second row shows the average result after changing the reference to Finnish samples of Pagani et al.
FI12, FI14 and FI21, average Finnish result when using average Finnsh reference    64,8

FI12, FI14 and FI21, average Finnish result when using Pagani Finnish samples as a reference    10,1

In this particular case, while Pagani Finns almost fully mismatch with average Finns, it also eliminates Finnish admixture of Swedish results where it is present in analyses based on average Finnish reference, in some cases substituting Finnish admixture by Karelian and Veps.  This is really odd.

A map giving an estimate of admixture regions in Europe

maanantai 31. lokakuuta 2016

Project admixtures, fitted ancient proportions

Here are ancient European proportions of project members and for comparison some academic present-day samples (not all fully covered by references, though),  one random sample per each population.  Results don't express primary proportions of Anatolian Neolithic and various hunter-gatherers populations, but add-ons over European LNBA samples.  The European LNBA itself was already a genetic mixture, including admixtures similar to aforesaid West Eurasians and probably also of still unknown ancient populations.  Similarly "BA East European Steppe" already included eastern hunter gatherer admixture.  My aim was not to fix all admixtures on the same time level, but to get a good coverage and make project samples comparable to each other. 

XLS-sheet is available from here.

lauantai 29. lokakuuta 2016

Project admixture results

While preparing my ancient haplotyping analyses I decided to test project members using Dna.Land's Ancestry program.  Many thanks to authors for distributing it.  All you need is to compile it and start your analyses,

All result are "as is" straight from the analyses.  Some comments

- Finns and Norwegians are easily identified.
- Swedes and Estonians (the latter ones don't belong to to my project) can't be confidently identified by the academic reference I have used in this and in my previous analyses.
- many Finns have minor Saami admixture.  This makes sense and Saami ancestry is the most likely source of the Finnish Siberian admixture.  In most cases we can forget Nganasans and other distant and small Siberian populations.  The minor Saami admixture among Finns is pervasive, not only pointing out Siberian ancestry, but to the complex history of ancient Fennoscandinavian, otherwise we would see in these results real Siberians also included into my tests (Nganasans, TunNenets, Nenets, Yakuts and numerous "semi-Siberians" from more southern North Asian regions.
- I didn't get weird "Finnish-South European" admixtures, seen on FamilyTreeDna and Dna.Land result pages.  This because my Finnish reference is built of average Finns, not of Finnish minority groups.
- the ambiguous Balto-Slavic admixture among Finns is mostly from Latvia, Lithuania or Russian Tver.  Russians living to the north from the Tver region are classified as "Northeast Europe", except Karelians and Veps who belong to Baltic-Finns with Estonians and Finns.   Saamis form their own group.
- the ambiguous Northwest European admixture among Finns is mostrly Swedish.
- the ambiguous European admixture is usually some combination of two above-mentioned groups.
- "Ambiguous" means that the result of several individual bootstrap tests was ambiguous, meaning high dispersion of results.   

Finland 63,9
Ambiguous Northeast-Europe 11,9
RU_Pinega 8,9
Ambiguous Balto-Slavic 6,9
Ambiguous Europe 4,6
Iran_Jew 2,9

Finland 42,5
Ambiguous Northwest-Europe 15,9
Karelia 9,7
Ambiguous Balto-Slavic 9,5
Ambiguous Europe 8,3
Ambiguous Northeast-Europe 7,2
Ambiguous 3,8
Saami 3,1

Finland 69,2
Latvia 13,0
Ambiguous Baltic-Finnic 8,2
Ambiguous Northwest-Europe 6,3
Saami 1,7
Ambiguous 1,4

Finland 51,8
Ambiguous Northwest-Europe 22,7
RU_Smolensk 9,8
Ambiguous Northeast-Europe 7,1
RU_Pinega 4,7
Ambiguous Europe 3,6

Finland 52,4
Estonia 17,4
Karelia 15,3
Ireland 11,0
Saami 2,0
Ambiguous Europe 1,1

Finland 43,8
Karelia 12,3
Ambiguous Northwest-Europe 11,7
Ambiguous Baltic-Finnic 10,2
Lithuania 9,5
Ambiguous Northeast-Europe 7,4
Ambiguous Europe 3,5
Ambiguous Balto-Slavic 1,0

Finland 44,2
Karelia 27,9
Latvia 12,4
Ambiguous Europe 10,4
Ambiguous Baltic-Finnic 3,4
Ambiguous 1,6

Finland 66,5
Karelia 22,5
Ambiguous Europe 8,3
Saami 2,3

Finland 63,3
Karelia 23,2
Ambiguous Europe 8,1
Ambiguous Baltic-Finnic 2,8
Ambiguous 2,6

Finland 54,7
Karelia 17,0
Ambiguous Baltic-Finnic 15,9
Ambiguous Balto-Slavic 5,8
Saami 3,5
Ambiguous Europe 3,1

Finland 84,3
Ambiguous Balto-Slavic 8,0
TunNenets 4,2
Ambiguous Baltic-Finnic 3,5

Finland 63,6
Karelia 24,9
Ambiguous Europe 10,6

Finland 48,7
Saami 22,0
Karelia 12,2
Ambiguous 6,0
Nenets 4,0
Latvia 3,2
Ambiguous Europe 2,8
Ambiguous Siberian 1,0

Finland 72,9
Ambiguous Balto-Slavic 16,0
Ambiguous Europe 6,6
Ambiguous Baltic-Finnic 3,3
Ambiguous 1,3

Finland 82,1
Ambiguous Europe 17,0

Finland 44,1
Estonia 26,5
Karelia 10,2
Ambiguous Europe 13,1
Ambiguous Baltic-Finnic 4,2
Ambiguous 1,9

Finland 32,7
Karelia 17,7
Estonia 15,2
Sweden 14,6
Tatar 7,0
Ambiguous Europe 6,5
RU_Pinega 5,5

Utah_CEU 18,4
Ambiguous Northwest-Europe 18,2
Sweden 17,6
Belarussia 10,8
Welsh 8,2
Ambiguous Baltic-Finnic 8,1
Latvia 5,9
GermanyAustria 5,8
Ambiguous Balto-Slavic 3,1
Ambiguous 2,9
Ambiguous Europe 1,1

Sweden 20,5
Ambiguous Northwest-Europe 19,7
Ambiguous Baltic-Finnic 19,3
GermanyAustria 13,1
Ireland 11,3
Latvia 5,1
Ambiguous Central-Europe 4,8
Ambiguous Europe 4,6
Ambiguous Balto-Slavic 1,5

Norway 20,0
Sweden 19,9
Veps 13,9
Kent 12,9
Orcadian 12,5
Ambiguous Europe 9,3
Ambiguous Central-Europe 7,0
Ambiguous Northwest-Europe 2,3
Ambiguous Baltic-Finnic 2,0

Norway 17,9
France 17,5
Estonia 16,7
Finland 14,2
Utah_CEU 14,0
Ambiguous Europe 7,2
Ambiguous Northwest-Europe 6,6
Scotland 5,6

Norway 53,0
Ambiguous Northwest-Europe 24,3
Ambiguous Central-Europe 11,2
Ambiguous Europe 5,5
Veps 5,2

Utah_CEU 35,5
Finland 17,5
Ambiguous Northwest-Europe 14,2
Ambiguous Balto-Slavic 9,5
Veps 8,7
GermanyAustria 7,7
Ambiguous Northeast-Europe 4,3
Ambiguous 1,6
Ambiguous Europe 1,0

tiistai 18. lokakuuta 2016

European coarse population structure using 14.4 millions markers

I already made a Finestructure analysis before my previous Admixture based work, but didn't publish it because it gave so little additional information.   I used same data than with Admixture.   The workflow:

1 extracting chrpmosomes 1 and 6
2 running haplotypes (HAPI-UR ten times and making consensus)
3 running Chromopainter in linked mode, without defining donor haplotypes
4 running Finestructure with parameters burning 200000 and runtine 2000000

As a result we see a very obvious grouping, each ethnic group are grouped together.   Some cautions have to be made about Chromopainter-Finestrucure combination

-  first at all,  Finestructure doesn't really use dedicated haplotypes, but the number of shared haplotypes and haplotype lengths between individuals.  So there is no guarantee that in a triple sample case (individuals a, b and c)  all three share common haplotypes, even when the result of  Finestructure shows up haplotype sharing for all three samples.  This can lead to a pseudo-ancestry between individuals and also to a wrong tree grouping.

- using donor haplotypes can be methodically unreliable.  We can assign donor haplotypes for people living in Americas, but it is not equally reliable for people living in the old world.  It is a chicken egg question.  If we really know donors before testing we know the result before we have the result.   I have seen methods creating donor types (selections of prepared haplotypes), but I can't see how it could really work reliably.  Note also that speaking about donor populations (I have seen it) makes this even a more problematic question; to know donor populations we already know the population grouping before the analysis and bind donor populations to something that exists today, but did not necessarily exist thousands years ago.

While checking the data I see there a questionable sample qroup:  Swedes. They look more eastern than can be healthily suggested.

In general, looking at any results the first question is "does the result look obvious?".  If we have two different results based on any kind supervised method (like using donor haplogroups/populations) it is only common sense to see the more obvious result being the better one.   Here we have a philosophic question: what "the obvious" means for you and for me.  It makes sense, but an idea as "too obvious" lead us to tin foil hat theories. Perfection is suspicious.  We don't want it, although also it is in practice possible.   Another, much more sensible question in regards to donor haplotypes would be if we could assign  donor haplotypes of Bronze Age Europeans based on ancient samples.  It would make sense.

Dowload Finestructure picture here.

perjantai 14. lokakuuta 2016

Worldwide admixture analysis based on 14.4 million SNP's

The EGDP data, available from Estonian Biocenter, made it possible to reach 15-30 times more genome density than earlier available data made possible.  The new data lacks of West European samples, but it was not a big problem due to the publicly available western data from the 1000-genomes project.   So I merged these two data sets.  For the quality check I ran heterozygosity rates for all European samples in both data sets and found both sets being considerably close each other, although the read depth of the 1000-genome data is smaller.   Actually Finnish samples in both sets showed exactly same level of heterozygosity.

After the succesful merge I had 14.4 million SNPs over all 22 chromosomes, which was far too much to process in few days on my desktop (i7, 3.5Ghz, 32 GB memory).  Instead of thinning the whole data set to 1-2 millions SNPs I decided to use chromosomes 1 and 6 and leave the genome density untouched.  So I had two chromosomes, a bit over 2 million SNPs showing still 15-30 times more genotype information per chromosome than other available genotype sets.  Considering thinning over all chromosomes to get the dataset handy enough to be processed with my computer would likely have induced more algorithm dependent bias, which I wanted to avoid.

The process

1 merging EGDP and 1000g data sets
2 quaility checks, including homozygosity/heterozygosity ratios per populations
3 extracting chromosomes 1 and 6
4 thinning data by Plink:   plink --file data --indep 50 5 2, resulting 1.1 million SNPs
5 running admixture analyses with k values from 3 to 13 in unsupervised mode and without reference populatons (=projection).

Each k-value was run in unsupervised mode without reference data, because projection reference data is not available for this SNP set.  You can see analyses using projection reference for example in works analysing ancient and moderm genomes together. Analyses made on any kind of projection are cool, because we have no other way to designate proportion of ancient samples to modern ones.  I am not saying that unsupervised analysis without references would be error-free, but that errors are systemic and not user dependant.

All analyses (k-values from 3 to 13) done here are run as individual runs without user supervision and for that reason colors on charts are not consistent (at least it sounded like a painful work the get colors consistent). Each analysis is optimized separately by the Admixture algorithm.  All this makes it more difficult to perceive differences between different K values, but as soon as you get the idea I am sure you also can see the big picture and understand details.

Hopefully this test is helpful for you.  In my opinion, it gives interesteing hints about Finnish relations with other populations, but the analysis itself is wordwide.

- Mordvins seem to differ from other Volga-Finnic populations and belong to Balto-Slavic ancestry and they probably are language shifters from a Baltic to a Volga-Finnic language.

- Estonians are just what can be expected, some Estonians have Baltic ancestry, some others Baltic-Finnic ancestry.  We should, however, be cautious of in using linguistic terms when we speak about ancestry.

- North Russian Finno-Ugric populations seem to be Baltic-Finnic people with Siperian admixture.  The Siberian admixture is present in a lesser amount among Finns and Estonians (note that the amount of minor admixtures depends on the used data/populations and Admixture is based on a selective method processing admixture proportions relatively).

- in some extent also Swedes show Baltic-Finnic ancestry, but the Swedish sample size is rather small to make a sure conclusion.  However,  if this is true, we can assume the present-day Baltic-Finnic people having largely Fennoscandinavian ancestry.

- Ingrian samples show up like pure unadmixed Baltic-Finnic people, which surprises me because of their long lasting minority status in Russia. Sample collectors have done good work.  Those samples are valuable indeed.

- thinking all this and trying to rebuild the the history of Baltic-Finnic people it looks like they lived to the north from the axis Latvia-Moscow (Balts living to the south before the East-Slavic expansion). Mixing between Baltic and Finnic people happened and people also shifted language.

- open questions are how strong the Baltic-Finnic influence is/was in Scandinavia and conversely how strong the Germanic influence is/was in Finland and Estonia.  For certain political reasons it is a difficult approach today.

CV errors, indicating quality in general, the lower the value is the better the quality, but absolute values depend on the used data and can't be compared to other Admixture tests. 

K3: 0.19708
K4: 0.19503
K5: 0.19480
K6: 0.19451
K7: 0.19432
K8: 0.19503
K9: 0.19508
K10: 0.19576
K11: 0.19708
K12: 0.19797
K13: 0.20221

Population abbreviations, download here

Analysis, download here.

You definitely need a suitable picture viewer being able to handle big GIF-files.

torstai 8. syyskuuta 2016

Worldwide diversity based on 3.2 millions X chromosome markers

Genetic diversity tests are usually done using around 300-500 thousands markers.  It is however possible to use much more markers (SNPs) using already available data from the 1000 genomes project.  The downside is that we have only a few populatons and the upside is that we see the big picture accurately, without possible bad sampling.

I made this test using Chromopainter and Finestructure.  Unfortunately Chromopainter is a rather ineffective tool and incapable to use available computing resources (threads, memory).  Without this drawback I would have made this using 25 millions markers instead of only 3.2 millions.

The process:

1 Vcftools, parameters  -remove indels -chr 23
2 Haplytyping using HAPI-UR and all samples, run three times and driven in consensus
3 Made a manual selection for random samples, 10-20 of each population
4 Chromopainter,  without specifying donor haplotypes
5 Finestructure  with run parameters 30000/300000
6 MDS using Past.

Additionally I ran Vcftools using parameters -keep-only-indels and -chr 23.   The result was filtered and biallelic deletions (CN=0) were counted.  Male results were treated biallelic, so CN=0 should give us the number of effectine deletions in both cases, for females and males.


MDS done by Past:

All previous pictures are downloadable with better resolution, here.

Deletions per 3.2 million markers (averages per sample):

The British subgrouping is gathered from internet and can be unreliable.  The Finnish one represents those with highest Siberian admixture, the group being "most Finnish" / local, those closest ancient Corded Ware samples and the rest of all 99 samples.  The last Finnish group includes all outliers.  

lauantai 20. elokuuta 2016

Mitochondrial diversity in Europe


I have seen several mitochondrial statistics using main haplogroups, H, U, I etc.  Haplogroups, being tens of thousand years old are a very robust way to analyze geographic areas where people have moved and mixed during latest centuries and in maximum during some thousands years.   Because of this I decided to use mutation information based on RSRS-reference.  The RSRS was introduced a few years ago and lists mitochondrial mutations defined from so called "mito-Eve", from the reconstructed first woman in the human ancestral tree.  Even RSRS lets lot to be desired, because many mutations are common in several mitochondrial branches.


The data is collected from publicly available FamilyTreeDna's projects and includes two hypervariable regions, HVR1 and HVR2.   HVR2 is not available for all samples, in those cases it is marked as "no call", otherwise all mutations are included.

Countries and sample sizes

Finnish sample size is probably biggest ever seen in academic or any studies.  Even taking into account some bias in regional personal activity this have to be the best ever seen sample data from Finland.

Some geographical areas are underrepresented, like White Sea Karelians, but I was expecting some interest and included them.


Fst distances

Seeking for country level rather than individual statistics I ran at first Fst-statistics between countries.  Keeping in mind the nature of mitochondrial data and mutations it is not relevant to expect any strict ancestral sum information, on the contrary results mirror European migrations during thousands years.

Fst distances

 Image with better resolution can be downloaded here

 MDS-plot based on Fst-distances:

Two dots to the most left are Poland and Germany.

And classical euclidean tree plot:

edit 20.0.2016 13:40

Here I  reconstructed mitochondrial genome instead of using straightforwardly hypervariable mutations.  Reconstructed SNP data was analyzed by standard analyzing tools.   I am very sure that analyzes done using only mutation indicators will not be successful.  

22.9.2016 11:30

Added Fst and genome data.  Notice that the genome data is reconstructed using minimum labor input and original kit-id numbers are substituted by surrogates!

Fst-data download here
Genome data download here

maanantai 27. kesäkuuta 2016

Global ROH-results

ROH (runs of homozygosity) predicts or estimates individual autozygosity for a subpopulation.   After reading the study "Genetic characterization of northeastern Italian population isolates in the context of broader European genetic diversity" I stopped to ponder its statistics, because the presentation in the figure 5 shows decimals for country ROH averages.  Using integers results below 1 are not possible without individual zero values and zero values in practice means some lost data.  It seemed necessary to count shorter ROH segments to get more precise results.  Although my statistics looks in general reasonable, I can't take responsibility for possible bad sampling regarding some ethnic groups. 

Data and processes

Primaty data: 600 ksnp, with very low no-call rate
LD-pruning:  ./plink --noweb --bfile LARGEDATA --indep-pairwise 200 25 0.4
Pruned data: 160 ksnp
ROH process: ./plink --noweb --bfile LARGEDATA --extract plink.prune.in --homozyg --homozyg-window-kb 5000 --homozyg-window-snp 25 --homozyg-snp 50 --homozyg-window-het 1 --homozyg-window-missing 1  --homozyg-density 50 --homozyg-window-threshold 0.05 --homozyg-gap 100 --homozyg-kb 1000

My goal was to find smaller ROH segments and it was done by changing three parameters: homozygosity-density, homozygosity-snp and homozygosity-kb, not big changes, but enough to do it.   There is an optimum combination of SNP and basepair lenghts and comparing to the study I picked smaller basepair length (1500->1000) and longer SNP length (25->50).  This did the trick.   

ROH count on the X-axis, total ROH length in basepairs on the Y-axis.   

Large picture:

Small picture covering the left bottom corner:

Pictures with better resolution:


Zoom in

tiistai 31. toukokuuta 2016

I1-L22 revised

I revised my earlier test about I1-L22 trying to figure Scandinavian and Finnish clades using TRMCA method based on 67 STR markers.  The main reason for doing this is new available CTS2208 samples.  It is really fascinating to see how CTS2208 divides L22 subclades into two brances, implying the Finnish "Bothnian" clade being older than the estimated age of 1850 years.  Here are recent TMRCA estimates

L22 - 4100 BP
Z74 - 4100 BP  (It is not credible to assume both clades being 4100 years old and L22 is likely older than predicted)

P109 - 3400 BP
CTS2208 - 2800 BP 
L205 - 1400 BP
L287 - 1850 BP
L258 - 1700 BP

The logic goes that downstream clades can be older than the calculated TMRCA,  at the maximum as old as the TMRCA of its nearest known upstream clade.

Here is also a tree figure.  67 STR markers are not enough to create a perfect tree, but it gives anyway certain idea of the close relation of the "Bothnian" and CTS2208.  


torstai 5. toukokuuta 2016

Comparison of Ice Age and modern Europeans, Ice Age remix

Thanks to the new study "The genetic history of Ice Age Europe" and the corresponding data we have now a lot more really old human samples.   As a quick experiment I made some comparisons between those ancient samples, following the grouping presented in the study,  and modern Europeans.  Using dstat and selected third populations from America, Asia and Europe I try to infer the amount of common ancestry of selected Europeans and Karitians, Hans and Frenchmen insofar it goes to selected ancient samples.  

The dstat formula was d(European population, Karitian/Han/French ; ancient sample group, Chimp)

06.05.16 20:05  There was a small error in El Mirón numbers, showing somewhat too low similarity for Europeans.  Now corrected.

15.05.16 11.00  Added dstat-gtaphics (as above) regarding Northeast Europe:

16.05.16 18:45

Added GoyetQ116-1 to the first series of graphics.

keskiviikko 20. huhtikuuta 2016

Neolithic and Bronze Age Irish samples, compared to modern populations

It was worth of waiting for a few weeks to see these Irish samples, especially because I already expected that Irish insular samples could reveal new things about ancient people who lived in Northwest Europe.  You see the original study here.  There are four samples, three from Rathlin Island in Northern Ireland and one sample from Ballynahatty, which locates in Northern Ireland.  Two of Rathlin samples are of low quality and don't work well with my database based on Estonian Biocentre's data.  Maybe I'll download them later to the Lazaridis' database.  The third Rathlin and Ballynahatty samples are however excellent.

Picking from the study

- Ballynahatty, a Neolithic woman (3343–3020 cal BC)
- Rathlin, in context of  an early megalithic passage-like grave, an Early Bronze Age man from Rathlin Island (2026–1885 cal BC)

I was really excited when started to analyse Rathlin samples, because it was possible that it would reveal new knowledge about ancient people who lived in North Europe before eastern Bronze Age steppe migrations.  I decided to compare them to present-day population instead of using ancient samples, to make results touchable.  At first  I tested which of modern populations are closest Rathlin and Ballynahatty samples and found that the Rathlin genome emphasized still Irish people.  Ballynahatty sample was closest present-day Sardinians, representing typical Neolithic era.

After processing all this from fastq-files 1) I made two qpDstat comparisons to find out who of modern populations resembles best those ancient Irish samples in comparison with best fits of modern populations.  In comparison with the Rathlin man I included also my project samples, mainly Finnish and Swedish individuals.

Rathlin and modern populations

Ballynahatty and modern populations

Inspired by the western origin of Saami people I made one comparison more using another database to get reliable results with the Saami sample introduced by Haak et al. 2015.  It looks like, despite of the remarkable North Asian admixture, they have Rathlin like ancestry more than Eastern Finns, who have less North Siberian.

Saami between ancient samples, using the arrangement seen already in my previous post

FI15 is from Northern Karelia, FI12 is western Finnish, FI10 is from Finnish Lapland.

Finally, after downloading and testing DNA.LAND's admixture program,  I made some admixture analyses.   You can find and download the software from their site, here.  This small program is based on allele frequencies and probably the method is Markov chain Monte Carlo.  It is not based on original alleles and genetic drift, thus there is always a residual admixture.  There are also other weaknesses, what kind of, it could be a new topic.  Now I only say that in my opinion it has problems in composing kinship populations with different minor admixtures. 

Two results using references downloaded from DNA.LAND


CSAMERICA 0.00697236
KALASH 0.0165295
NEEUROPE 0.223957
NEUROPE 0.731415
SWEUROPE 0.0185991


ITALY 0.0116662
SARDINIA 0.565326
SWEUROPE 0.423008

Two results using my Estonian-BC database as reference


Bulgaria 0.0524726
Colombian 0.00995061
Ireland 0.213786
Kalash 0.00510618
Latvia 0.0102334
Lithuania 0.221711
Orcadian 0.140857
RU_Smolensk 0.0244413
Scotland 0.207837
Udmurtia 0.0262457
Welsh 0.0840646


Basque 0.0446958
Ireland 0.0547766
NorthItaly 0.0953578
Sardinian 0.569061
Scotland 0.0409351
Sicily 0.0670495
Spain 0.12727
Tuscany 0.000854723

1)  I have changed my fastq-process.   Although BWA is an excellent program in mapping reads, it's automatic trimming is not powerful enough and now I have rerun also all older samples using separate trimming program. 


keskiviikko 23. maaliskuuta 2016

Two-fold ancestry of Finnish people

It has been a common idea, especially among linguists, to say that Baltic Finnic languages came from the Volga region, from so called Volga river bend near Samara. It is a carefully cherished tradition in Finnish science, but any movement of people from there to Finland is still without genetic evidences. Now I am going to prove something which contradicts with this idea of the Volga origin of Finns, or at least gives a new view about it.  I'll show a plausible genetic evidence of Volga-Saami connection using the Saami sample (Haak et al. 2015 and Lazaridis et al. 2014), which shows very high similarity with the ancient Eneolithic Samara sample (Mathiesson et al.).

The other half of my Finnish story tells about ancient Central-European influence in Finland.  Around 20% of Finnish samples from the 1000genome project show Corded-Ware similarity comparable to Estonians and Lithuanians, and Western Finnish project samples show equally Corded-Ware similarity with Swedes, some even more, despite of the fact that they are much more "eastern" when compared to present-day Swedes.

This Finnish duality doesn't tell were and when the mixing occurred and so far I have not seen any genetic evidence about the Baltic Finnic origin. It looks very possible that genetically Baltic Finns were born somewhere in region from Estonia to White Sea, no matter what the origin of Baltic Finnish language could have been. 

Saami results

Saamis are genetically closer for Eneolithic Samara people than Mordovians (Mordva) and Chuvashes.   Worth noticing is that Mordovians, who live near Volga are not closer those ancient people living in Samara.  Saami people live thousands kilometers and thousands years away from what was the suggested Volga home range.  Siberian admixture of Chuvashes roughly equals to Saami Siberian.  This statistic has however very limited use, because Saami people are not Central Europeans, but still the statistic shows them being comparable to Central Europeans when compared to ancient East European samples.   What could be the best outcome?

Probably some readers can think that the Eneolithic Samara - Saami - Finnish genetic connection is only based on the amount of Siberian.  It is not true and easily proved false.  Chuvashes and Mansi people (and Komis, not included) with high Siberian admixture are far away from the Eneolithic Samara, definitely not comparable to the Saamis.  Similarly those Finns being closest Eneolithic Samara have less Siberian than Russians living in Archangel and Pinega regions in Russia (look project results).

Only people in northernmost Europe beat Saami_WGA in comparison with Eneolithic Samara.  Have to admit, this is a bit complicated question. Then let's look at another perspective of supposed Finnish ancestry, Corded Ware samples.  It is less complicated.

Corded Ware results

Only Lithuanians beat the Finnish CW-group (20% of Finnish samples from the 1000g project after removing outliers) when the test is done using over half million SNPs.  Even Lithuanians would be beaten with more homogeneous Finnish sample group.  There is all variations from very CW-looking to only moderately CW-looking. They don't look like coming from Volga bend.  Not really.

Then combining Saami and CW results and project members.  To do this I have to use my smaller data base, based on Estonian Biocentre's data.   The accuracy is somewhat poorer.   Numbers show the difference between Eneolithic Samara and German Corded Ware affinities in Finland and in neighboring countries, as well as results for project members.  Using Eneolithic Samara and CW samples the Siberian-like admixture becomes excluded and results show only affinities common for those two groups, even if tested populations or project members have extra Siberian admixture.  It is important to understand that this table alone doesn't tell how much individuals and populations have those two ancient affinities  (it tells only a ratio).  To see the big picture you have to take into account also two previous tables showing how significant is the relation between ancient and modern populations.

Project results

sunnuntai 13. maaliskuuta 2016

Continuing tests with ancient Brits, better material and final results

Continuing with ancient British samples.  This is fascinating because these samples represent high scanning quality giving precise results.  I use now 4 samples:

1. Hinxton2 is HI2 from the study "Iron Age and Anglo-Saxon genomes from East England reveal British migration history". HI2 Hinxton Male 170 BCE – 80 CE.

2. Rabrit3 is one of Roman Age samples from the study "Genomic signals of migration and continuity in Britain before the Anglo-Saxons".  I can't identify which one it is of those six local samples from Driffield Terrace, because study authors don't tell connections between sample labels and sample data.  Rabrit3 is processed using sample files ERR1043145, ERR1043146, ERR1043147.

3. Iabrit is M1489 from the same study (Genomic signals...).  M1489 Iron Age Melton, age estimate between 210 BC and 40 AD.

4. Anglosaxon/anglosaxon2 is NO3423, again from the same study.  NO3423 Anglo-Saxon Norton on Tees. Age estimate is unknown, but it is mentioned to be Anglo-Saxon.

All four samples are remastered using BWA-mem as described in my previous post.  BWA-mem makes automatic trimming for reads and gives great results with minimum personal action and control, the process is fully automated.  Before choosing BWA-mem I tested three additional softwares.

I have also standardized the sample selection in this test to ensure same SNP coverage for all samples and to avoid errors due to SNP qualification and differences in SNP counts.  So each ancient sample is compared almost exactly similarly against modern populations. This is fundamental, because especially differences in the SNP count can cause severe biases to results.

Here are Dstat results:

result:        CEU Mbuti_Pygmy     iabrit   Chimp.DG      0.4532   100.000 17762   6683 184450
result:        CEU Mbuti_Pygmy anglosaxon   Chimp.DG      0.4604   100.000 28390  10489 287312
result:     French Mbuti_Pygmy     iabrit   Chimp.DG      0.4533   100.000 17714   6663 184450
result:     French Mbuti_Pygmy anglosaxon   Chimp.DG      0.4579   100.000 28268  10511 287312
result:  FinnLocal Mbuti_Pygmy     iabrit   Chimp.DG      0.4491   100.000 17677   6721 184450
result:  FinnLocal Mbuti_Pygmy anglosaxon   Chimp.DG      0.4542   100.000 28218  10592 287312
result: FinnMostCW Mbuti_Pygmy     iabrit   Chimp.DG      0.4539   100.000 17757   6670 184450
result: FinnMostCW Mbuti_Pygmy anglosaxon   Chimp.DG      0.4592   100.000 28352  10509 287312
result:        IBS Mbuti_Pygmy     iabrit   Chimp.DG      0.4481   100.000 17602   6709 184450
result:        IBS Mbuti_Pygmy anglosaxon   Chimp.DG      0.4521   100.000 28066  10590 287312
result:       Kent Mbuti_Pygmy     iabrit   Chimp.DG      0.4546   100.000 17771   6664 184450
result:       Kent Mbuti_Pygmy anglosaxon   Chimp.DG      0.4604   100.000 28375  10484 287312
result:    Estonia Mbuti_Pygmy     iabrit   Chimp.DG      0.4536   100.000 17611   6620 183032
result:    Estonia Mbuti_Pygmy anglosaxon   Chimp.DG      0.4609   100.000 28197  10406 285211
result:  Sardinian Mbuti_Pygmy     iabrit   Chimp.DG      0.4498   100.000 17634   6692 184450
result:  Sardinian Mbuti_Pygmy anglosaxon   Chimp.DG      0.4528   100.000 28105  10587 287311
result:   Orcadian Mbuti_Pygmy     iabrit   Chimp.DG      0.4552   100.000 17766   6652 184450
result:   Orcadian Mbuti_Pygmy anglosaxon   Chimp.DG      0.4595   100.000 28345  10496 287311
result:        TSI Mbuti_Pygmy     iabrit   Chimp.DG      0.4479   100.000 17597   6711 184450
result:        TSI Mbuti_Pygmy anglosaxon   Chimp.DG      0.4529   100.000 28076  10573 287312
result: North_Italian Mbuti_Pygmy     iabrit   Chimp.DG      0.4497   100.000 17653   6702 184450
result: North_Italian Mbuti_Pygmy anglosaxon   Chimp.DG      0.4557   100.000 28173  10533 287311
result: Russian_Vologda Mbuti_Pygmy     iabrit   Chimp.DG      0.4483   100.000 17635   6718 184450
result: Russian_Vologda Mbuti_Pygmy anglosaxon   Chimp.DG      0.4525   100.000 28129  10603 287311

result:        CEU Mbuti_Pygmy   hinxton2   Chimp.DG      0.4391   100.000 45931  17904 433006
result:        CEU Mbuti_Pygmy    rabrit3   Chimp.DG      0.4557   100.000 29993  11214 299676
result:     French Mbuti_Pygmy   hinxton2   Chimp.DG      0.4375   100.000 45806  17923 433006
result:     French Mbuti_Pygmy    rabrit3   Chimp.DG      0.4536   100.000 29889  11235 299676
result:  FinnLocal Mbuti_Pygmy   hinxton2   Chimp.DG      0.4327   100.000 45622  18064 433006
result:  FinnLocal Mbuti_Pygmy    rabrit3   Chimp.DG      0.4486   100.000 29784  11336 299676
result: FinnMostCW Mbuti_Pygmy   hinxton2   Chimp.DG      0.4384   100.000 45896  17921 433006
result: FinnMostCW Mbuti_Pygmy    rabrit3   Chimp.DG      0.4540   100.000 29933  11239 299676
result:        IBS Mbuti_Pygmy   hinxton2   Chimp.DG      0.4329   100.000 45492  18006 433006
result:        IBS Mbuti_Pygmy    rabrit3   Chimp.DG      0.4505   100.000 29732  11263 299676
result:       Kent Mbuti_Pygmy   hinxton2   Chimp.DG      0.4402   100.000 45988  17876 433006
result:       Kent Mbuti_Pygmy    rabrit3   Chimp.DG      0.4574   100.000 30041  11186 299676
result:    Estonia Mbuti_Pygmy   hinxton2   Chimp.DG      0.4394   100.000 45638  17775 430442
result:    Estonia Mbuti_Pygmy    rabrit3   Chimp.DG      0.4558   100.000 29766  11126 297499
result:  Sardinian Mbuti_Pygmy   hinxton2   Chimp.DG      0.4327   100.000 45516  18021 433005
result:  Sardinian Mbuti_Pygmy    rabrit3   Chimp.DG      0.4515   100.000 29768  11250 299675
result:   Orcadian Mbuti_Pygmy   hinxton2   Chimp.DG      0.4384   100.000 45892  17919 433005
result:   Orcadian Mbuti_Pygmy    rabrit3   Chimp.DG      0.4571   100.000 30016  11185 299675
result:        TSI Mbuti_Pygmy   hinxton2   Chimp.DG      0.4329   100.000 45487  18005 433006
result:        TSI Mbuti_Pygmy    rabrit3   Chimp.DG      0.4491   100.000 29704  11292 299676
result: North_Italian Mbuti_Pygmy   hinxton2   Chimp.DG      0.4346   100.000 45645  17989 433005
result: North_Italian Mbuti_Pygmy    rabrit3   Chimp.DG      0.4528   100.000 29838  11238 299675
result: Russian_Vologda Mbuti_Pygmy   hinxton2   Chimp.DG      0.4337   100.000 45616  18017 433005
result: Russian_Vologda Mbuti_Pygmy    rabrit3   Chimp.DG      0.4493   100.000 29754  11306 299675

And here are corresponding graphic maps:


Results differ somewhat from what I got earlier, obviously due to the stricter data preparation and more neutral outgroups.

Finally,  I made also IBS-statistics using same data and a PCA-plot.  It is however reasonable to state that due to the homozygosity error of ancient samples most homozygous modern populations get extra boost and give us too high results.  This is typical for Balts, Irismen and Scots.  I don't know about Basque homozygosity.

I was able to catch extra populations using Plink and --geno 0.01 option to standardize the SNP set as much as possible.

Creating PCA needs more samples to pick proper and all-inclusive components and is here done using another data set with less SNPs and more populations. 

edit 17.3.2016 23:05

I read a comment on a Finnish history forum that using two outgroups, as I did in this post, is not recommended and can distort results.  I gladly admit that this is true.  But the reason for using two outgroups is very clear; I used this way to get big amount of results comparable instead of comparing only three populations.  Using three target populations and one outgroup makes impossible to compare results from separate qpDstat runs, or make it at least painful.  Of course the latter method, using three targets, gives better accuracy.

But no smoke here without fire, my tests using two outgroups looks reliable.  In my previous results (above) the FinnMostCW group was very close to the Iron Age British sample, closer than the French sample group.  I made a new test using same data, now using three target populations and one outgroup.  It confirms my  previous results:

  0           FinnMostCW   16
  1           French   28
  2           iabrit    1
  3           Chimp.DG    1
jackknife block size:     0.050
snps: 605676  indivs: 46
number of blocks for jackknife: 551
nrows, ncols: 46 605676
result: FinnMostCW     French     iabrit   Chimp.DG      0.0020     1.039  9159   9122 184450

Indeed, I will have to come back to this question with larger data.  

edit 18.3.2016 12:40

Here is another dpDstat result using three target populations.   I am quite disappointed to the way some people react when they are not happy seeing some results.   My only goal is to make objective tests using primarily European genetic data. My focus is not on Finnish results, neither I try to avoid making reliable results about Finns.

  0           FinnMostCW   16
  1           FinnLocal   15
  2           iabrit    1
  3           Chimp.DG    1
jackknife block size:     0.050
snps: 605676  indivs: 33
number of blocks for jackknife: 551
nrows, ncols: 33 605676
result: FinnMostCW  FinnLocal     iabrit   Chimp.DG      0.0073     3.750  9120   8989 184450