Tuesday, December 17, 2013

Late settlement Finns




If you have followed my blog, you may remember what I wrote about the Finnish settlements (Testing 23andme's Ancestry Composition continues / Finnish results), here:


“The term late settlement is used by Finnish historians and means areas which were mostly populated during the Swedish era by administrative transactions (king’s orders) or by occupying areas in wars between Sweden and Novgorod/ later Moscow.   This means actually that the age of Finnish reference group (used by 23andme) is around 500-700 years and people living in older settlements …”

 

We have a problem (sure) when analyzing young expanding settlements and populations, because they have more genetic drift than old populations. In fact old rural populations have also genetic drift, but only very locally.  I encountered this same problem also with the Utah-CEU samples and resolved it by selecting least drifted of them.  It is obvious that I can’t do this same with Finnish samples without becoming questioned as a neutral actor.   I wrote also about the genetic drift, here  (the chapter written in English):

 

“At first I found that all groups with high genetic drift due to isolation will strongly distort the result.   It was easy to see the effect of genetic drift and the consequential distortion, for example I dropped out a lot of HGDP-CEU samples being too homogeneous or drifted.  Young genetic drift generating own genetic componenents in analyses inside one sample group doesn’t figure their older common history with other groups, the reason why groups with genetic drift are useless in searching the common history of  populations.   They will also affect the root population where they come from.  This kind of genetic drift can be found from rapid expansions in some subpopulation, like in villages or in smaller cultural communities.”

 

So what to do?

 

If we want to see behind the birth of local settlements we should get rid of the genetic drift in results and prevent the PCA generating drift components.  It is not difficult at all.  We should only be aware of the sample size (number of samples) that triggers the formation of young drift components.  It can be anything between two or tens samples, depending on the sample set.  Now I don’t exactly know how many late settlement Finns I need to reach the threshold value (because I have not them enough).  But I don’t need to know it, because even a few late settlement Finns belonging to the same root population show the trend, where they belong without genetic drift and where they came from (or where at least a significant part of their ancestors came from).    I can add more samples, nothing will change until I reach the threshold value where PCA starts generating drift components on the dimensions we want to see. 

 

Results

 

The first PCA include same Eurasian samples which I used when analyzing old settlement Finns.  In this case Finnish samples (SK0001, SK0002 and SK0003) are located clearly inside the North Russian cluster, but on the opposite side than Slavic Belarussians.  After adding more late settlement Finns this would look more dramatic.  I can’t avoid making a conclusion that Northern Russians (Vologda people and Mordvas) are a mixture of Finnish look-alike people and Slavs. 

 

 

 

 

You can see an image with better resolution here

 

Secondly here is the same European plot as before with old settlement Finns.  Now a lot of Caucasian and Eurasian components are missing compared to the Eurasian plot and Finnish samples move towards Lithuanians who represent the gene pool around the eastern Baltic Sea region.   This effect would be stronger with Estonians, and strongest with late settlement Finns.  This happens due to the gene flow between late settlement Finns and old settlement Finns and between them and Estonians.  I don't know whether this gene swap happened before they adopted North Russian genes, or after that.  Maybe the Baltic-Finnic gene pool was much more widespread before the Slavic expansion.  

   

 

 

 

You can see an image with better resolution here

 

My last graphic shows how those three Finns under the test were related to the effective PCA components.  Sorry, this is available only for the European PCA, I was too lazy to work with the bigger Eurasian data.   

 

 

 

Saturday, December 7, 2013

Testing 23andme's Ancestry Composition continues



I am happy to inform that 23andme has got repaired a few weird things I noticed two weeks ago.   Two Finns owning obviously non-Finnish European admixture show now sensible results.  The first one shows now 37% Finnish, being before the repair 100% Finnish.  A huge difference. The second Finn shows now 47,9%, was earlier 99.9% Finnish.   I was not cheating. 

What’s new

Basically Ancestry Composition is unchanged, only the software engine behind the user interface is revised.  Now it is time to go ahead and look at new results.  I have made some statistics.   The first graphics shows  results per country, how well 23andme has succeeded to assign people to their own national gene pool.     It was of course no sense to select national groups without own reference group, like Estonians.  They look “mixed” despite of the genetic diversity level.


   
All Finns with known recent foreign admix are excluded as well as Swedish speaking citizens, but of course I can’t know what happened hundreds years ago.   I can’t guarantee that all Finns are same people from the ice age and not even from the Roman Iron Age.   The Balts includes Lithianians and Latvians.The Russian group includes only ethnic Russians.


Secondly we see the standard deviation of results of each country.  It is good to notice that even in case the national proportion is very low, like in the case of Scandinavians, the deviation in own gene pool figures population diversity comparable to averagely higher country numbers.    This is one of those weird things being related to admix analyses and sometimes mislead people to think that admix analyses showing plenty of admixes means high diversity.   Actually it is a wrong conclusion.  Admixture results show only how much some corresponding part of your genome resembles the chosen reference set.    

  

Finnish results

Looking at results and the origin of Finns we can be sure that 23andme uses Finns from the late settlements in building the Finnish reference set.   The term late settlement is used by Finnish historians and means areas which were mostly populated during the Swedish era by administrative transactions (king’s orders) or by occupying areas in wars between Sweden and Novgorod/ later Moscow.   This means actually that the age of Finnish reference group is around 500-700 years and people living in older settlements, in areas that where populated pretty much before Swedish crusades to Finland,  are compared to them, not vice versa.   It is impossible to find out how much genes have during this 500-700 years period moved from old settlements to late settlements and how much from late settlements to old settlements, but we know the age of both populations .   The younger entity can’t be used to classify the older one.   Who populated the late settlements in Finland is another question.     

Anyway, evaluating the error caused by this poor test arrangement and putting things newly together we could try to estimate the lowest percentage for unmixed Finns by looking the Finnish history and personal data at 23andme.  It would be at lowest level around 70%, being somewhat below that in Southwestern Finland because they have given least genes to the late settlements, less than Tavastians, Karelians and Ostrobothnians.   In SW-Finland a bigger portion of old Finnish heritage remains unknown and hides inside nonspecific numbers.   You can notice this, as well as the Swedish admix level, just look at your shared Finnish results at Ancestry Composition.   The Finnish percentage being smaller than 70% we can expect some foreign admix more than the corresponding average Finnish admixture for example in Sweden and Russia.  
  
Some points more

Highest Finnish numbers seem to be from East Finland, near Iisalmi and Kuopio, highest in East Europe from Baltic countries, Pskov and Tverskaya regions in Russia and the highest Scandinavian number is from Värmland (Sic!), Sweden, followed by Norwegians nearby the Värmland on the other side of the boundary between Norway and Sweden. I wouldn't say I felt any déjà vu when looking these results, it is boring to see how admix analyses do this again.  Must say, we need now new ideas. Although 23andme uses obviously their own dedicated admix model they still fall into the same problem than all recent admix models deriving results based on a pure admix model and don't taking into account genetic drift. Unfortunately admix models conclude that the gene flow goes always from homogeneous populations to more heterogeneous one, without understanding the effect of genetic drift which happens usually after opposite gene flow. This happens because admix models don’t take into account timelines and believe that higher diversity is an admixture of present-day populations.

Sunday, December 1, 2013

Controlling data

It is somewhat coincidental what we get when looking for genetic samples for our analyses.   We don't know whether our samples are typical representatives for people they should represent according to the given title.  Usually researchers confirm us only grandparents of gathered samples belonging to the mentioned group.  But are they third generation immigrants, villagers from same village, do they speak same language or belong to some certain cultural group - we don't usually know.  We ought to have pretty much trust in the coordination of researchers and what they have done all over the world.  It would be a good idea to report some key figures about used samples and going further to compare these key figures between public data bases and studies (using same SNP-sets of.c).   After that we could see whether results are comparable. 

Here are two key figures for my European samples.

1. Similarity

This graphics figures the similarity of each population as an average of shared IBS between samples in each population (136835 SNPs):




2.  Level of homozygosity

This figures the average homozygosity of each population (same data as above):





   In both cases Finns belong to the selected old settlement group and CEU samples are selected samples with low genetic drift, most CEU samples owning significant genetic drift. 

Friday, November 29, 2013

23andme, please give me an explanation

I have for a long time wondered about some weird results you are offering me.   You tell me that the Ancestry Composition is a state of the art tool for finding our ancestry.  But let's look closer it.   It is good to look at this just now because you are revising this ancestry tool and we can compare old and new results soon, and I will do it regadless of the progress.  You can convince me that this will look better after the service upgrade.  

At first here are some genetic PCA-maps you give me.  I have gathered some examples about individuals and ethnic groups to figure out the situation.

Here is how some Northern Europeans seem to locate on the maps:




Here are two mixed Finns, the first one has ancestry of 50% Finnish and 50% German, the second one is 50% Finnish and 50% Irish:



Here are two other mixed Finns, the first one is 75% Finnish and 25% French-Swedish, the second one is obviously partly English.  I am not sure about the second admixture, but his name is English and his ancestral names are mostly English and Swedish, but also Finnish:




Everything looks fine on these maps.   The Finns are the northernmost ethnic group and all  those mixed Finns are closer the populations where they have got their admix from.  But some things don't look as good as I expected after looking corresponding results at the Ancestry Composition.

Here are two results corresponding to the map 1:





And here are two results corresponding to the map 2:





It shouldn't be difficult to see the difference, mixed Finns on the first map are only half Finnish, and it is true.  On the second map mixed Finns are fully Finnish.   So what is the inside story? I can only make my conclusions based on the public information.  I can assure, I have tried to find it out.  But how hard I try I can't see any other difference than that two persons on the first map live in America and those two on the second map live in Finland.      And they really live in these places.  Can you explain it?

Wednesday, November 27, 2013

Projektidata vapaasti saatavana

Käyttämäni genomidata, poisluettuna suomalaisnäytteet, on saatavana täältä .   Tiedostopaketti sisältää kolme tiedostoa, itse datan, SNP-listan ja näytelistan.   Datan rivit sisältävät kunkin näytteen 136835 SNP-arvoa koodattuna.   Koodit ovat

0 - homozygootti A-alleeli
1 - heterozygootti
2 - homozygootti B-alleeli
-1 - "no-call" eli arvo puuttuu

Rivejä on 667 vastaten näytelistaa.

Koodaus A- ja B-alleeleihin on yleinen tutkijoiden käyttämä tapa.  Tutustu esimerkiksi GEO:n dataan.  

Paketista purettua datatiedoa voi käsitellä parhaiten Editpad Lite -ohjelmalla, joka sopii suurten character-tiedostojen editointiin.  Ohjelman saa ladattua täältä .

PCA-tilastojen teko on hepointa R:llä.  Tilasto-ohjelman R saat ladattua täältä .  Koodaukseni käy suoraan R:n toiminnoille.  Omia näytteitäsi varten sinun tarvitsee rajata näytteesi SNP-listan mukaan, koodata A/T ja C/G -arvot numeerisiksi, "transposeta" näytteesi ja lisätä ne esimerkiksi Editpad Litellä muihin projektidatasta valitsemiisi näytteisiin.   "Transposen" pitäisi onnistua R:ssä, mutta myös Past kykenee siihen pienemmillä näytemäärillä, joskin sillä voit tehdä tilastoja vain 10000 SNP:n määrään asti.   


Monday, November 25, 2013

IBS-etäisyydet

Excel-taulukko (Excel 2007 tai uudempi versio) sisältäen projektin jäsenten ja viiteväestönäytteiden IBS-etäisyydet on ladattavissa täältä .

IBS-lukema kertoo yhteisen SNP-arvokannan suuruuden.  Etäisyydet on laskettu edellisten tulosten lailla LD-karsitun aineston pohjalta, joten lukemat eivät vastaa suuruusluokaltaan perusdatan (Raw data) pohjalta laskettuja lukemia.  LD-karsinta poistaa genomissa esiintyvää toisteisuutta.   LD-toisteisuuteen vaikuttaa verrattavien yhteisen genomiosuuden ikä.  Teen myöhemmin tämän IBS-datan pohjalta tilastoja, joten Excelin opiskelu ei ole välttämätöntä.     

Thursday, November 21, 2013

Edge populations detected by yDna



I made this PCA-map by using distribution numbers from Eupedia.  A small modification was made and Finnish numbers were replaced by the Finnish east-west distribution from the study Lappalainen et al.

Ydna, of course, represents only a small part of the genome. It however moves forward the genetic inheritance without recombination.  Recombination is typical for autosomal genes, so the yDna is also free of similar genetic drift.   A strong yDna-polarization  can’t be completely ignored in estimating the overall homogeneity of populations because it has a connection with the autosomal inheritance through individuals.   For example 100 % of men in a population belonging to the same branch of certain haplogroup can have a strong parallel effect on the autosomal side of the genome.  In this context don't get confused by terms homogeneity, admix shown by admix-analyses and opposite terms heterozygosity/homozygosity.  

We obviously can’t claim that the diversity in yDna means always diversity in auDna, neither claim that homogeneous yDna leads to homogeneous autosomal genes per se, but looking at the history and genetic studies we see certainly the connection between yDna and auDna.  This makes the yDna  certainly helpful in estimating populational structures.  Keeping in mind, however, that when analyses lead to interpretations of history, we should also know something about the history of yDna groups before making conclusions to avoid making circular arguments in evaluating the history.  Be careful.

Nevertheless, a strong correlation between yDna and auDna is obvious in this small analysis.   I could put comparable autosomal analyses here.  Basque, Irish, and certain British people shows homogeneous fractions on yDna, which seems to correlate with the strong own history, thus likely points also to certain homogeneity of these population (Can we accept this by only looking analyses, not based on our foreknowledge?).  Also, the Bosnian Croats and the Catalonians belonging to these edging groups sounds acceptable.  A thousand dollars questions is whether all Balkans belonging to the HG’s I2+I2a came from Bosnia, whether Brits, Scots and Spaniards belonging to the Hg R1b  came from Wales, Basque country and Catalonia, or do we have other explanations.   I think that we have, it is genetic isolation and genetic drift of these populations.  


Click here to see a big view.

Tuesday, November 19, 2013

Ensimmäinen luku

Muutaman vuoden kehitys ja uudet tutkimukset ovat tarjonneet kohtalaisen määrän eri väestöjen geneettistä tietoa myös harrastelijatutkijoiden saataville.  Paljon suomalaisten geneettisen historian kannalta oleellista dataa on kuitenkin vain yliopistotutkijoiden käytössä.   Muutamia mainitakseni, seuraavista ryhmistä ei dataa tietääkseni ole vapaasti saatavilla: luoteis- ja länsivenäläiset, puolalaiset, virolaiset, ruotsalaiset, norjalaiset ja saamet.  Nämä ryhmät olisivat tietenkin oleellisia suomalaisten kannalta.  

Useasta lähteestä keräämällä dataa löytyy riittävä määrä paikallisia laajempien analyysien tekoon.   Suomalaisten geeniperimän arviointi on toistaiseksi tehtävä naapurin naapuriin vertaamalla.  Tärkeimpiä naapureita olisivat virolaiset ja ruotsalaiset, mutta heidän liittämisensä vertailuaineistoon on mahdollista vain vapaaehtoisten datan luovuttajien avulla.  Teen siis analyysini toistaiseksi näiden rajoitusten varjossa.

Olen vertaillut saamiani tuloksia muihin saatavilla oleviin tuloksiin ja lähinnä omiani ovat Dr. McDonald'in PCA-tulokset.  Pienet erot Eurooppa-kartalla tulostemme kesken johtunevat eroista käytetyissä suomalaisnäytteissä.  Omat suomalaisnäytteeni sijoittuvat lähemmäksi Keski-Eurooppaa.  Se on selitettävissä sillä, että suomalaiseni ovat pääosin vanhoilta lounaisilta asuinalueilta.  Uskon tohtorin käyttäneen laillani paljon aikaa datan standardointiin.  Datan standardointi on edellytys luotettaville tuloksille, koska tarjolla olevat analysointityökalut eivät tunnista kaikkia datan ominaisuuksia.  Karkeasti ilmaistuna tähänkin pätee tietojenkäsittelyn vanha sanonta ”garbage in, garbage out”. 


Tulokset

PCA

Eurooppa-tasoinen  PCA-kuva 






Suureen kuvaan täältä



Suomalaisten asema havainnollistuu parhaiten 45 asteen 3D-projektiosta. 







Suomalaiset sijoittuvat balttien ja pohjoisvenäläisten väliin.   Skandinaavien lisäys todennäköisesti muuttaisi suomalaisten asemaa, mutta miten, se jää nähtäväksi siihen asti, että saan riittävän määrän skandinaaveja vertailuaineistoon.  Veikkaan, että vaikutus ei ole kovin suuri ja skandinaavit korvannevat CEU-näytteet sijoittumalla näiden ja suomalaisten väliin suomalaisten aseman suhteessa baltteihin ja venäläisiin jäänee entiselleen.  Arvioin virolaisten lisäyksen ainestoon pieneksi, koska he ovat hyvin lähellä balttinäytteitä. 

Suomalaisten suhteellista asemaa verrattaessa muihin kansoihin voi arvioida suomalaisten geenien esiintyvän maantieteelliseltä alueelta Pohjois-Baltiasta Vienan merelle asti.  Nämä kaikki alueet rajautuvat kartalla slaavien 1000 vuotta sitten valtaamiin alueisiin.  Suoraviivaisesti kartalle siirrettynä lounaissuomalaiset asettuvat Pihkovan ja Novgorodin paikkeille.  Huomioiden Suomessa tapahtunut sekoittuminen lounaissuomalaisten alkuperä voi olla myös läntisempi Baltiassa.

Ajoin myös keskiarvot kansallisuuksien loadings arvoista suhteessa SNP-aineistoon.







Loading-tulokset antavat arvion siitä kuinka eri kansallisuudet ovat edustettuina dimensioiden (PC1 ja PC2) komponenttivalikoimassa.  Suomalaiset ovat hieman keskiarvon yläpuolella.   Näin suomalaisten tulokset ovat vertailukepoisia verrattaessa muiden kansallisuuksien tuloksiin.


Eurooppa-analysin PCA-data on saatavissa täältä


Laajassa Euraasian kuvassa Eurooopan ulkopuolelta erottuvat selvästi Pohjois- ja Itä-Aasian geenivirta, Keski- ja Etelä-Aasian geenivirta ja Lähi-Idän/Pohjois-Afrikan geenivirta.  Euroopassa resoluutio heikkenee verrattuna Eurooppa-kuvaan osan Eurooppa-spesifisistä komponenteista korvautuessa mannerten välisillä komponenteilla.  Lisäksi suomalaisten asemaan vaikuttaa suomalaisten pohjoisaasialaistyyppinen geenilisä, jonka alkuperää voi vain arvailla.   Oma veikkaukseni tälle geenilisälle on muinainen sekoittuminen Suomen alueen alkuperäisväestön kanssa, joka oli lähtökohdaltaan suomenkielen tuojia arktisempaa.

Suomalaisten asema pohjoisvenäläisten, balttien ja valkovenäläisten välissä ei oleellisesti muutu Eurooppa-kuvasta, suomalaiset ja pohjoisvenäläiset vain siirtyvät Siperia - Itä-Aasia linjalla hiukan itäänpäin.  Siirtämällä pohjoisvenäläisten ja suomalaisten ryhmää  kello 10 suuntaan on asetelma o lähes Eurooppa-kuvaa vastaava.

Laaja Structure-analyysi osoitti kaikkien siperialaisten ryhmien olevan osin eurooppalaista alkuperää.  On täysin mahdollista, että osa tämä eurooppalainen osuus on lähtöisin Pohjois-Venäjältä tai Volgan ympäristöstä.  Rajasin nämä ryhmät pieniksi, koska eurooppalaisen geeniosuuden suunta olisi analyyseissä suurella määrällä geneettisesti ajautuneita väestöjä kääntynyt Siperiasta Eurooppaan. En pysty arvioimaan Uralin poikki kulkeneiden geenivirtojen suuntia, voin ainoastaan jakaa pohjoisaasialaisen osuuden itäaasialaiseen ja eurooppalaiseen osuuteen.  Luulen, ettei siihen pysty kukaan muukaan.

Euraasia-tasoinen PCA-kuva





Suureen kuvaan täältä


Euraasia-analyysin PCA-data on saatavissa täältä




Structure

Structure-analyysít osoittavat suomalaisten kuuluvan itäeurooppalaisiin kansoihin.   Suomalaiset eroavat omaksi itäeurooppalaiseksi ryhmäkseen k:n arvolla 7 sijoittuen pienemmillä arvoilla itäeurooppalaisten pääryhmään.






Euraasia-tasolla suomalaisten pohjoisaasialainen geenilisä näkyy selvästi noin 5%:n osuutena.




Yhteenveto

Suomalaiset sijoittuvat kaikilla kuvaajilla samaan ryhmään Itämeren alueen balttien ja pohjoisten slaavien kanssa.  On todennäköistä, että skandinaavi- ja saamenäytteiden lisäys viiteväestöihin muuttaisi jonkin verran tilannetta, mutta perusasetelma tuskin oleellisesti muuttuisi.   Saamenäytteitä on tuskin mahdollista saada muihin näytteisiin verrannollisessa määrässä.  Saamenäytteiden kohdalla tulisi myös teknisiä ongelmia, joihin en tässä yhteydessä asian laajuuden vuoksi puutu. Saameväestöjen suomalaisuus, tai suomalaisten saamelaisuus, olisi tietenkin mielenkiintoinen tutkimuskohde.   Toistaiseksi tyydyn kuitenkin saamelaishypoteesiin liittyen mittamaan suomalaisten itäistä (siperialaista?) perimää.  Tämä perustuu olettamukseen saamelaisten osaksi itäisestä alkuperästä. Mitään varmaa en tästä kuitenkaan voi esittää ja on uskaliasta nyt sanoa asiasta mitään.   Suomalaisten itäinen perimäosuus on kiistaton, mutta sen vaikutusta ei pidä yliarvioida, kuten suomalaisten asema  itäeurooppalaisten väestöjen keskuudessa PCA-kartoilla ja structure-tuloksissa todistaa.  Suomalaiset sijoittuvat loading-arvoillaan ja komponenttimäärillään luontevasti omalle paikalleen Eurooppa-kartalla.  Poikkeamat näkyisivät joko PC-lukemissa (PC-scores) tai loadings-arvoissa.   

Tarkoitukseni on seuraavaksi paneutua IBS-tilastointiin.  IBS antaa yksilötasolla edellisiä analyysejä tarkempia tuloksia.   Myös Dienekes DIY-admix on työlistalla. 

En ota uusia jäseniä projektiin ennenkuin olen saanut kaikki suunnitellut analyysit valmiiksi.  

Kiitos kaikille suomalaisille näytedatan lähettäjille. 


In English

My goal has been to figure out Finnish genes from the historic perspective and avoid problems occurring with unqualified data, i e. my goal has been to avoid results biased by bad sample size, homogeneous sample groups, genetic drift etc.

What is different?

Mainly I use samples from old Finnish settlements instead of commonly by bloggers and researchers used Finnish samples, which have been from young isolations (in certain studies from one Finnish village) or from the historically young Helsinki-metropolitan data.  My samples represent Finnish samples with known old Finnish ancestry and without any prominent foreign admix.

Selecting the data

My SNP-set includes all common SNP’s picked up from Dienekes/Dodecad V3 and Eurogenes Jtest admix analyses.  I made this selection to confirm that my SNP-set is comparable with some well-known references.  By making this I try to avoid the speculation of being biased by the data selection.  So the only thing I have my own is the Finnish sample set and this is just where I want to be different – I want to follow the Finnish history.  Additionally I took around 10000 AID-SNP's from known academic studies.  After putting these SNP’s together I chose common SNP's for all my data sources (individual samples) and defined the working data as a union of these two preselected SNP-sets.   This meant in practice common SNP’s for HGDP, HapMap, Yunusbayev, Rasmussen, Behar, 23andme, FtDna and above-mentioned two blogger data.  In the final phase I dropped out about 3000 SNP's for low quality reasons (high no-call rate).  After all this I had 136835 qualified SNP's left.

I converted the data to a full equivalence between original alleles, meaning that Allele A is coded to the value 0 and Allele B is coded to the value 1.  I don’t use a SNP-level coding for homozygotes and heterozygotes. Of course I can still check runs of homozygosity.  

There was also a reason to drop out individual samples for low quality reasons.  At first I found that all groups with high genetic drift due to isolation will strongly distort the result.   It was easy to see the effect of genetic drift and the consequential distortion, for example I dropped out a lot of HGDP-CEU samples being too homogeneous or drifted.  Young genetic drift generating own genetic componenents in analyses inside one sample group doesn’t figure their older common history with other groups, the reason why groups with genetic drift are useless in searching the common history of  populations.   They will also affect the root population where they come from.  This kind of genetic drift can be found from rapid expansions in some subpopulation, like in villages or in smaller cultural communities.  There were some obvious outliers too.  


Another issue making your analysis distorted is oversampling.  It is especially problematic if the group data is already biased by genetic drift, because oversampling amplifies the error caused by any other bias.    Undersampling is also problematic if you want to analyse certain populations, but it doesn't amplify much samples with genetic drift and doesn't destroy your analysis like oversampling does.   If you take 100 German samples all over the country and 100 samples from an Icelandic village, you can be sure to have biased results for both groups. 

The distortion caused by bad samples before the quality check was especially notable on the European plot because of the higher resolution and smaller overall genetic distances between samples.  On the worldwide plot local samples are not affected similarly by bad samples because on the world level genetic distances are bigger. 

The Structure Analysis needed some data cropping.  Almost 140000 SNP’s was far too much for a structure analysis and it would have taken weeks to run on my computers.   To minimize the effect of cropping for the quality I ran the worldwide data thru PCA analysis and selected 35000 SNP’s showing most meaningful loading values in five first PC’s.  After that I had 35000 SNP’s ready to use in the Structure Analysis, which is quite a lot compared to the amount usually researchers use.

Running the analysis

There is very little to say about it.  I ran all PCA’s by R and the structure analysis by Structure 2.3.4.  I kept default values for all run-time parameters.

Receiving new samples

I will not receive new samples before I have completed the initial phase including all planned analyses.   I’ll inform later about the possibility to partake to my project. 

Needing more reference data

I am interested in getting North-West Russians, Estonians and Scandinavians.  I really appreciate if you can help me getting them.   I need 10 samples from each ethnic group, but even less is enough if they are very representative.

The next step

I am going to run individual IBS-information and possible DIY-admix.   

Pictures 

European PCA click here 

Eurasian PCA click here 

European K3-7 click here

Eurasian K7 click here