--- authors: - givenNames: - Manjusha familyNames: - Chintalapati type: Person emails: - m_chintalapati@berkeley.edu affiliations: - name: Department of Molecular and Cell Biology, University of California, Berkeley address: addressLocality: Berkeley addressCountry: United States type: PostalAddress type: Organization - givenNames: - Nick familyNames: - Patterson type: Person emails: - nickp@broadinstitute.org affiliations: - name: Broad Institute of Harvard and MIT address: addressLocality: Cambridge addressCountry: United States type: PostalAddress type: Organization - name: Human Evolutionary Biology, Harvard University address: addressLocality: Boston addressCountry: United States type: PostalAddress type: Organization - givenNames: - Priya familyNames: - Moorjani type: Person emails: - moorjani@berkeley.edu affiliations: - name: Department of Molecular and Cell Biology, University of California, Berkeley address: addressLocality: Berkeley addressCountry: United States type: PostalAddress type: Organization - name: Center for Computational Biology, University of California, Berkeley address: addressLocality: Berkeley addressCountry: United States type: PostalAddress type: Organization editors: - givenNames: - George - H familyNames: - Perry type: Person affiliations: - name: United States type: Organization datePublished: value: '2022-05-30' type: Date dateReceived: value: '2022-02-05' type: Date dateAccepted: value: '2022-05-29' type: Date title: >- The spatiotemporal patterns of major human admixture events during the European Holocene description: - >- Recent studies have shown that admixture has been pervasive throughout human history. While several methods exist for dating admixture in contemporary populations, they are not suitable for sparse, low coverage ancient genomic data. Thus, we developed - type: Emphasis content: - DATES (Distribution of Ancestry Tracts of Evolutionary Signals) - ' that leverages ancestry covariance patterns across the genome of a single individual to infer the timing of admixture. ' - type: Emphasis content: - DATES - ' provides reliable estimates under various demographic scenarios and outperforms available methods for ancient DNA applications. Using ' - type: Emphasis content: - DATES - ' on~1100 ancient genomes from sixteen regions in Europe and west Asia, we reconstruct the chronology of the formation of the ancestral populations and the fine-scale details of the spread of Neolithic farming and Steppe pastoralist-related ancestry across Europe. By studying the genetic formation of Anatolian farmers, we infer that gene flow related to Iranian Neolithic farmers occurred before 9600 BCE, predating the advent of agriculture in Anatolia. Contrary to the archaeological evidence, we estimate that early Steppe pastoralist groups (Yamnaya and Afanasievo) were genetically formed more than a millennium before the start of Steppe pastoralism. Our analyses provide new insights on the origins and spread of farming and Indo-European languages, highlighting the power of genomic dating methods to elucidate the legacy of human migrations.' isPartOf: volumeNumber: 11 isPartOf: title: eLife issns: - 2050-084X identifiers: - name: nlm-ta propertyID: https://registry.identifiers.org/registry/nlm-ta value: elife type: PropertyValue - name: publisher-id propertyID: https://registry.identifiers.org/registry/publisher-id value: eLife type: PropertyValue publisher: name: eLife Sciences Publications, Ltd type: Organization type: Periodical type: PublicationVolume licenses: - url: http://creativecommons.org/licenses/by/4.0/ content: - content: - 'This article is distributed under the terms of the ' - content: - Creative Commons Attribution License target: http://creativecommons.org/licenses/by/4.0/ type: Link - >- , which permits unrestricted use and redistribution provided that the original author and source are credited. type: Paragraph type: CreativeWork keywords: - genomic clocks - ancient DNA - admixture - molecular clock - European Holocene - Neolithic - Human identifiers: - name: publisher-id propertyID: https://registry.identifiers.org/registry/publisher-id value: 77625 type: PropertyValue - name: doi propertyID: https://registry.identifiers.org/registry/doi value: 10.7554/eLife.77625 type: PropertyValue - name: elocation-id propertyID: https://registry.identifiers.org/registry/elocation-id value: e77625 type: PropertyValue fundedBy: - identifiers: - value: R35GM142978 type: PropertyValue funders: - name: National Institutes of Health type: Organization type: MonetaryGrant - identifiers: - value: Career Award at the Scientific Interface type: PropertyValue funders: - name: Burroughs Wellcome Fund type: Organization type: MonetaryGrant - identifiers: - value: Sloan Research Fellowship type: PropertyValue funders: - name: Alfred P. Sloan Foundation type: Organization type: MonetaryGrant about: - name: Evolutionary Biology type: DefinedTerm - name: Genetics and Genomics type: DefinedTerm genre: - Research Article bibliography: elife-77625.references.bib --- # Introduction Recent studies have shown that population mixture (or ‘admixture’) is pervasive throughout human history, including mixture between the ancestors of modern humans and archaic hominins (i.e., Neanderthals and Denisovans), as well as in the history of many contemporary human groups such as African Americans, South Asians, and Europeans [@bib50]. Understanding the timing and signatures of admixture offers insights into the historical context in which the mixture occurred and enables the characterization of the evolutionary and functional impact of the gene flow. Many admixed groups are formed due to population movements involving ancient migrations that predate historical records. The recent availability of genomic data for a large number of present-day and ancient genomes provides an unprecedented opportunity to reconstruct population events using genetic data, providing evidence complementary to linguistics and archaeology. To characterize patterns of admixture, genetic methods use the insight that the genome of an admixed individual is a mosaic of chromosomal segments inherited from distinct ancestral populations [@bib11]. Due to recombination, these ancestral segments get shuffled in each generation and become smaller and smaller over time. The length of the segments is inversely proportional to the time elapsed since the mixture [@bib11; @bib40]. Several genetic approaches – ROLLOFF [@bib40; @bib49], ALDER [@bib37], Globetrotter [@bib27], and Tracts [@bib19] – have been developed that use this insight by characterizing patterns of admixture linkage disequilibrium (LD) or haplotype lengths across the genome to infer the timing of mixture. Haplotype-based methods perform chromosome painting or local ancestry inference at each locus in the genome and characterize the distribution of ancestry tract lengths to estimate the time of mixture [@bib19; @bib27]. This requires accurate phasing and inference of local ancestry, which is often difficult when the admixture events are old (as ancestry blocks become smaller over time) or when reference data from ancestral populations is unavailable. Admixture LD-based methods, on the other hand, measure the extent of the allelic correlation across markers to infer the time of admixture [@bib37; @bib40]. They do not require phased data from the target or reference populations and work reliably for dating older admixture events (>100 generations). However, they tend to be less efficient in characterizing admixture events between closely related ancestral groups. While highly accurate for dating admixture events using data from present-day samples, current methods do not work reliably for dating admixture events using ancient genomes. Ancient DNA samples often have high rates of DNA degradation, contamination (from human and other sources), and low sequencing depth, leading to a large proportion of missing variants and uneven coverage across the genome [@bib46]. Additionally, most studies generate pseudo-haploid genotype calls – consisting of a haploid genotype determined by randomly selecting one allele at the variant site – that can lead to some issues in the inference. In such sparse datasets, estimating admixture LD can be noisy and biased (see Simulations below). Moreover, haplotype-based methods require phased data from both admixed and reference populations which remains challenging for ancient DNA specimens [@bib19; @bib27]. An extension of admixture LD-based methods, recently introduced by [@bib41], leverages ancestry covariance patterns that can be measured in a single sample using low coverage data. This approach measures the allelic correlation across neighboring sites, but instead of measuring admixture LD across multiple samples, it integrates data across markers within a single diploid genome. Using a set of ascertained markers that are informative for Neanderthal ancestry (where sub-Saharan Africans are fixed for the ancestral alleles and Neanderthals have a derived allele), [@bib41], inferred the timing of Neanderthal gene flow in Upper Paleolithic Eurasian samples and showed the approach works accurately in ancient DNA samples [@bib41]. However, this approach is inapplicable for dating admixture events within modern human populations, as there are very few fixed differences across populations [@bib3]. Motivated by the single sample statistic in [@bib41], we developed _DATES (Distribution of Ancestry Tracts of Evolutionary Signals_) that measures the ancestry covariance across the genome in a single admixed individual, weighted by the allele frequency difference between two ancestral populations. This method was first introduced in [@bib43], where it was used to infer the date of gene flow between groups related to Ancient Ancestral South Indians, Iranian farmers, and Steppe pastoralists in ancient South and Central Asian populations [@bib43]. In this study, we evaluate the performance of _DATES_ by carrying out extensive simulations for a range of demographic scenarios and comparing the approach to other published genomic dating methods. We then apply _DATES_ to infer the chronology of the genetic formation of the ancestral populations of Europeans and the spatiotemporal patterns of admixture during the European Holocene using data from \~1100 ancient DNA specimens spanning \~8000–350 BCE. # Results ## Overview of _DATES_: model and simulations _DATES_ estimates the time of admixture by measuring the weighted ancestry covariance across the genome using data from a single diploid genome and two reference populations (representing the ancestral source populations). _DATES_ works like haplotype-based methods as it is applicable to a single genome and not like admixture LD-based methods, which by definition require multiple genomes to be co-analyzed; but unlike haplotype-based methods, it is more flexible as it does not require local ancestry inference. There are three main steps in _DATES_: we start by first learning the genome-wide ancestry proportions by performing a simple regression analysis to model the observed genotypes in an admixed individual as a linear mix of allele frequencies from two reference populations. For each marker, we then compute the likelihood of the observed genotype in the admixed individual using the estimated ancestry proportions and allele frequencies in each reference population (this is similar in spirit to local ancestry inference). This information is, in turn, used to compute the joint likelihood of shared ancestry at two neighboring markers, accounting for the probability of recombination between the two markers. Finally, we compute the covariance across pairs of markers located at a particular genetic distance, weighted by the allele frequency differences in the reference populations (Materials and methods). Following [@bib41], we bin the markers that occur at a similar genetic distance across the genome, rather than estimating admixture LD for each pair of markers, and compute the covariance across increasing genetic distance between markers. The estimated covariance is expected to decay exponentially with genetic distance, and the rate of decay is informative of the time of the mixture [@bib11; @bib40]. Assuming the gene flow occurred instantaneously, we can then infer the average date of gene flow by fitting an exponential distribution to the decay pattern (Materials and methods). In cases where data for multiple individuals is available, we compute the likelihood by summing over all individuals. To make _DATES_ computationally tractable, we implement the fast Fourier transform (FFT) for calculating ancestry covariance as described in ALDER [@bib37]. This provides a speedup from ${\rm O}\left({n}^{2}\right)$ to ${\rm O}\left(n\mathrm{log}n\right)$ that reduces the typical runtimes from hours to seconds with minimal loss in accuracy ([Appendix 1—figure 2](https://elifesciences.org/articles/77625#app1fig2)). To assess the reliability of _DATES_, we performed simulations where we constructed 10 admixed diploid genomes by randomly sampling haplotypes from two source populations (Materials and methods). Briefly, we simulated individual genomes with 20% European and 80% African ancestry by using phased haplotypes of northern Europeans (Utah European Americans, CEU) and west Africans (Yoruba from Nigeria, YRI) from the 1000 Genomes Project, respectively [@bib3]. As reference populations in _DATES_, we used closely related surrogate populations of French and Yoruba respectively, from the Human Genome Diversity Panel (HGDP) [@bib35]. We first investigated the accuracy of _DATES_ by varying the time of admixture between 10 and 300 generations. For comparison, we also applied ALDER [@bib37] to these simulations. Both methods reliably recovered the time of admixture up to 200 generations or \~5600 years ago, assuming a generation time of 28 years [@bib41], though _DATES_ was more precise than ALDER for older admixture events (>100 generations) ([Figure 1—figure supplement 1](#fig1s1), [Appendix 1—table 4](https://elifesciences.org/articles/77625#app1table4)). Further, _DATES_ shows accurate results even for single target samples ([Figure 1A](#fig1), [Figure 1—figure supplement 2A](#fig1s2)) and even when few reference individuals are available for dating ([Figure 1A](#fig1), [Figure 1—figure supplement 2B](#fig1s2)). However, the use of large numbers of reference samples, if available, can improve the inference. In _DATES_, allele frequencies of the reference populations are used for computing the likelihood as well as the weighted pairwise ancestry covariance across the genome (Materials and methods). With large samples, allele frequencies of the reference populations are more reliably computed, which in turn, can improve the precision of inferred dates ([Figure 1A](#fig1), [Figure 1—figure supplement 2B](#fig1s2)). chunk: Figure 1. ::: ### Simulation results. We constructed _n_ admixed individuals with 20% European (CEU) and 80% African (YRI) ancestry using \~380,000 genome-wide SNPs for admixture dates ranging between 10 and 200 generations. To minimize any issues with overfitting, we used French and Yoruba from the Human Genome Diversity Panel as reference populations in _DATES (Distribution of Ancestry Tracts of Evolutionary Signals)_. We show the true time of admixture (X-axis, in generations) and the estimated time of admixture (±1 SE) (Y-axis, in generations). Standard errors were calculated using a weighted block jackknife approach by removing one chromosome in each run (Materials and methods). (**A**) Effect of sample size: We varied the sample size (**_n_**) of target group between 1 and 10 individuals. (**B**) Effect of data quality: To mimic the features of ancient genomes, we generated _n_=10 target individuals with pseudo-haploid genotypes and missing genotype rate as 10% (orange), 30% (purple), and 60% (green). See [Figure 1—figure supplements 1](#fig1s1)–[!number(9)](#fig1s9) for additional simulations to test the performance of DATES. ```{r fig.width = 10, fig.height = 5, message=FALSE, warning=FALSE} #' @width 10 #' @height 5 library(ggplot2) library(maps) library(ggrepel) library(rnaturalearth) library(rnaturalearthdata) library(rgeos) par(mfrow=c(1,2)) # Panel A data=read.table(file = "data/Figure1_data_panelA", header = T) dd=data[data$n==1,] plot(x=dd$admixture_time,y=dd$d_mean, col="darkorange2",pch=16,xlab = "True admixture time (gen)", ylab = "Estimated admixture time (gen)", ylim = c(0,max(dd$d_mean+dd$d_se)), main="(A) Variation in sample size",cex.axis=1.2,cex.lab=1.2); grid (); segments(x0 =dd$admixture_time,x1 = dd$admixture_time,y0 =(dd$d_mean-dd$d_se),y1 = (dd$d_mean+dd$d_se),col="darkorange2"); segments(x0 =dd$admixture_time-2,x1 = dd$admixture_time+2,y0 =(dd$d_mean+dd$d_se),y1 = (dd$d_mean+dd$d_se),col="darkorange2"); segments(x0 =dd$admixture_time-2,x1 = dd$admixture_time+2,y0 =(dd$d_mean-dd$d_se),y1 = (dd$d_mean-dd$d_se),col="darkorange2"); dd=data[data$n==10,] points(x=dd$admixture_time,y=dd$d_mean, col="green3",pch=16,xlab = "True admixture time (gen)", ylab = "Estimated admixture time (gen)", ylim = c(0,max(dd$d_mean+dd$d_se)), main="(A) Variation in sample size",cex.axis=1.2,cex.lab=1.2); grid (); segments(x0 =dd$admixture_time,x1 = dd$admixture_time,y0 =(dd$d_mean-dd$d_se),y1 = (dd$d_mean+dd$d_se),col="green3"); segments(x0 =dd$admixture_time-2,x1 = dd$admixture_time+2,y0 =(dd$d_mean+dd$d_se),y1 = (dd$d_mean+dd$d_se),col="green3"); segments(x0 =dd$admixture_time-2,x1 = dd$admixture_time+2,y0 =(dd$d_mean-dd$d_se),y1 = (dd$d_mean-dd$d_se),col="green3"); lines(x=dd$admixture_time,y=dd$admixture_time,col="darkgrey",lty=2) legend("topleft",legend = c("n=1","n=20"),col=c("darkorange2","green3"),pch = 16,bg = "white"); #panel B data=read.table(file = "data/Figure1_data_panelB") kol=c("tomato","darkviolet","forestgreen") len1=c("missing proportion=10%","missing proportion=30%","missing proportion=60%") len=c(10,30,60) c=1; dd=data[data$V2==10,] plot(x=dd$V1,y=dd$V3, col="tomato",pch=15,xlab = "True admixture time (gen)",ylab = "Estimated admixture time (gen)", ylim = c(min(dd$V3-dd$V4),max(dd$V3+dd$V4)), main="(B) Effect of data quality",cex.axis=1.2,cex.lab=1.2); grid (); lines(x=dd$V1,y=dd$V5, col=dd$V8,pch=6,type="p") segments(x0 =dd$V1,x1 = dd$V1,y0 =(dd$V3+dd$V4),y1 = (dd$V3-dd$V4),col="tomato"); segments(x0 =dd$V1-2,x1 = dd$V1+2,y0 =(dd$V3+dd$V4),y1 = (dd$V3+dd$V4),col="tomato"); segments(x0 =dd$V1-2,x1 = dd$V1+2,y0 =(dd$V3-dd$V4),y1 = (dd$V3-dd$V4),col="tomato"); c=2; dd=data[data$V2==30,] points(x=dd$V1+2,y=dd$V3, col="darkviolet",pch=16,xlab = "True admixture time (gen)",ylab = "Estimated admixture time (gen)", ylim = c(min(dd$V3-dd$V4),max(dd$V3+dd$V4))); grid (); segments(x0 =dd$V1+2,x1 = dd$V1+2,y0 =(dd$V3+dd$V4),y1 = (dd$V3-dd$V4),col="darkviolet"); segments(x0 =dd$V1+2-2,x1 = dd$V1+2+2,y0 =(dd$V3+dd$V4),y1 = (dd$V3+dd$V4),col="darkviolet"); segments(x0 =dd$V1+2-2,x1 = dd$V1+2+2,y0 =(dd$V3-dd$V4),y1 = (dd$V3-dd$V4),col="darkviolet"); dd=data[data$V2==60,] c=3 points(x=dd$V1+4,y=dd$V3, col="forestgreen",pch=17); grid (); segments(x0 =dd$V1+4,x1 = dd$V1+4,y0 =(dd$V3+dd$V4),y1 = (dd$V3-dd$V4),col="forestgreen"); segments(x0 =dd$V1+4-2,x1 = dd$V1+4+2,y0 =(dd$V3+dd$V4),y1 = (dd$V3+dd$V4),col="forestgreen"); segments(x0 =dd$V1+4-2,x1 = dd$V1+4+2,y0 =(dd$V3-dd$V4),y1 = (dd$V3-dd$V4),col="forestgreen"); lines(x=dd$V1,y=dd$V1,col="darkgrey",lty=2) legend("topleft",legend = len1,col=kol,pch = c(15,16,17),bg = "white"); ``` ::: {#fig1} chunk: Figure 1—figure supplement 1. ::: ### Varying time of admixture up to 300 generations. We simulated data for 10 admixed individuals with 20% European (CEU) and 80% African (YRI) ancestry and varied the time of admixture between 10 and 300 generations. The X-axis shows the true time of admixture, and the Y-axis shows the estimated time of admixture (±1 SE) inferred using _DATES (Distribution of Ancestry Tracts of Evolutionary Signals)_. ```{r fig.width=8, fig.height=6, message=FALSE, warning=FALSE} #' @width 8 #' @height 6 data=read.table(file = "data/Figure1_Supplement1") plot(x=data$V1,y = data$V3,col="deepskyblue",pch=15, xlab = "True admixture time (generations)",ylab = "Estimated admixture time (generations)", main = "Varying time of admixture up to 300 generations",xlim=c(0,300), ylim = c(0,max(data$V3[1:30]))); grid (nx=5,ny=4) segments(x0 =data$V1,x1 = data$V1,y0 =(data$V3+data$V6), y1 = (data$V3-data$V6),col="deepskyblue") segments(x0 =data$V1-2,x1 = data$V1+2,y0 =(data$V3+data$V6), y1 = (data$V3+data$V6),col="deepskyblue") segments(x0 =data$V1-2,x1 = data$V1+2,y0 =(data$V3-data$V6), y1 = (data$V3-data$V6),col="deepskyblue") lines(x=data$V1,y=data$V1,col="darkgrey",lty=2) ``` ::: {#fig1s1} chunk: Figure 1—figure supplement 2. ::: ### Impact of sample size of the target (admixed) and reference populations. We simulated $n$ admixed individuals with 20% European (CEU) and 80% African (YRI) ancestry and applied _DATES_ with _m_ reference samples of French and Yoruba ancestry. (**A**) Effect of sample size of target population. Each panel shows the results of simulations with _n_ target individuals shown in the legend and _m_=28 French and _m_=21 Yoruba reference samples from each source group. (**B**) Effect of sample size (**_m_**) of reference populations. Each panel shows the results of simulations with _n_=10 target individuals and _m_ reference samples from each source group shown in the legend. The true admixture time is shown on X-axis, and the estimated time of admixture (±1 SE) is shown on Y-axis. ```{r fig.width=10, fig.height=6, message=FALSE, warning=FALSE} #' @width 10 #' @height 6 par(mfrow=c(2,4)) data=read.table(file = "data/Figure1_Supplement2A") kol=c("darkorange2","green3","violetred1","royalblue1","coral","yellow3","cyan2","darkgoldenrod1","darkorchid1","dodgerblue1", "paleturquoise4"); len=c("n=1","n=5","n=10","n=20") arr=c(1,5,10,20) c=1; for(i in arr) { k=which(arr==i); dd=data[data$V2==i,] plot(x=dd$V1,y=dd$V3, col=kol[c],pch=16,xlab = "True admixture time (gen)", ylab = "Estimated admixture time (gen)", ylim = c(0,max(dd$V3+dd$V4))); grid (); segments(x0 =dd$V1,x1 = dd$V1,y0 =(dd$V3+dd$V4),y1 = (dd$V3-dd$V4),col=kol[c]); segments(x0 =dd$V1-2,x1 = dd$V1+2,y0 =(dd$V3+dd$V4),y1 = (dd$V3+dd$V4),col=kol[c]); segments(x0 =dd$V1-2,x1 = dd$V1+2,y0 =(dd$V3-dd$V4),y1 = (dd$V3-dd$V4),col=kol[c]); lines(x=dd$V1,y=dd$V1,col="darkgrey",lty=2) legend("topleft",legend = len[k],col=kol[c],pch = 16,bg = "white"); c=c+1 } # Panel B - reference sample size data=read.table(file = "data/Figure1_Supplement2B") kol=c("darkorange","green3","cyan2","violetred1","royalblue1") len1=c("reference pop size=1","reference pop size=5","reference pop size=10","reference pop size=20") len=c(1,5,10,20) #par(mfrow=c(3,2)) c=1; for(i in len) { #len=1 k=which(i==len); dd=data[data$V1==i,] plot(x=dd$V2,y=dd$V3, col=kol[c],pch=17,xlab = "True admixture time (gen)", ylab = "Estimated admixture time (gen)", ylim = c(min(dd$V3-dd$V4),max(dd$V3+dd$V4))); grid (); segments(x0 =dd$V2,x1 = dd$V2,y0 =(dd$V3+dd$V4),y1 = (dd$V3-dd$V4),col=kol[c]); segments(x0 =dd$V2-2,x1 = dd$V2+2,y0 =(dd$V3+dd$V4),y1 = (dd$V3+dd$V4),col=kol[c]); segments(x0 =dd$V2-2,x1 = dd$V2+2,y0 =(dd$V3-dd$V4),y1 = (dd$V3-dd$V4),col=kol[c]); lines(x=dd$V2,y=dd$V2,col="darkgrey",lty=2) legend("topleft",legend = len1[k],col=kol[c],pch = 17,bg = "white"); c=c+1; } title("(A) Effect of sample size of target population", line = -1.5, outer = TRUE,cex=1.3) title("(B) Effect of sample size of reference populations", line = -25, outer = TRUE,cex=1.3) ``` ::: {#fig1s2} chunk: Figure 1—figure supplement 3. ::: ### Impact of admixture proportion. We simulated data for 10 admixed individuals with European (CEU) ancestry (_α_) in the range of 1–40% (the rest derived from Africans). We ran _DATES (Distribution of Ancestry Tracts of Evolutionary Signals)_ to infer the time of admixture and ancestry proportion. (**A**) Impact on the estimated time of admixture: Each panel shows the estimated date of admixture for a different value of $\alpha$ shown in the legend. The true admixture time is shown on X-axis, and the estimated time of admixture (±1 SE) is shown on Y-axis. (**B**) Impact on estimated ancestry proportion: Each panel shows the estimated proportion of admixture for a different value of $\alpha$ shown in the legend. The red dashed horizontal line further indicates the value of $\alpha$ used. The true time of admixture is shown on X-axis with the inferred proportion of admixture on Y-axis. ```{r fig.width=10, fig.height=6, message=FALSE, warning=FALSE} #' @width 10 #' @height 6 par(mfrow=c(2,4)) data=read.table(file = "data/Figure1_Supplement3A") len=c("α=0.01","α=0.05","α=0.2","α=0.4") arr=c(0.01,0.05,2,4) kol=c("green","darkorchid","yellowgreen","blue","deeppink","red2","purple") #print (arr) c=1 for(i in arr) { k=which(arr==i); dd=data[data$V2==i,] plot(x=dd$V1,y=dd$V3, col=kol[c],pch=18,xlab = "True admixture time (gen)",ylab = "Estimated admixture time (gen)", ylim = c(0,max(dd$V3+dd$V4)),cex=1.5); grid (); #main="Admixture time for varying θ" segments(x0 =dd$V1,x1 = dd$V1,y0 =(dd$V3+dd$V4),y1 = (dd$V3-dd$V4),col=kol[c]); segments(x0 =dd$V1-2,x1 = dd$V1+2,y0 =(dd$V3+dd$V4),y1 = (dd$V3+dd$V4),col=kol[c]); segments(x0 =dd$V1-2,x1 = dd$V1+2,y0 =(dd$V3-dd$V4),y1 = (dd$V3-dd$V4),col=kol[c]); lines(x=dd$V1,y=dd$V1,col="darkgrey",lty=2) legend("topleft",legend = len[k],col=kol[c],pch = 18,bg = "white"); c=c+1; } title("(A) Impact of admixture proportion on the estimated time of admixture", line = -1.5, outer = TRUE,cex=1.3) # panel B- proportions of admixture data=read.table(file = "data/Figure1_Supplement3B") len=c("α=0.01","α=0.05","α=0.1","α=0.2","α=0.4") c=1; for(i in arr) { #i=5 if(i<1) {rr=i} else {rr=i/10} k=which(arr==i); dd=data[data$V2==i,] lim=max(dd$V6)+rr/4 plot(x=dd$V1,y=dd$V6, col=kol[c],pch=19,xlab = "True admixture time (gen)", ylab = "Estimated Theta (θ)",las=1,ylim = c(0,lim)); grid (); #main="Admixture prorportion inference" segments(x0 =dd$V1,x1 = dd$V1,y0 =(dd$V6+dd$V10),y1 = (dd$V6-dd$V10),col=kol[c]); segments(x0 =dd$V1-2,x1 = dd$V1+2,y0 =(dd$V6+dd$V10),y1 = (dd$V6+dd$V10),col=kol[c]); segments(x0 =dd$V1-2,x1 = dd$V1+2,y0 =(dd$V6-dd$V10),y1 = (dd$V6-dd$V10),col=kol[c]); lines(x=dd$V1,y=rep(rr,NROW(dd)),col="red",lty=2) legend("bottomright",legend = len[k],col=kol[c],pch = 19,bg = "white"); c=c+1; } title("(B) Impact of admixture proportion on estimated ancestry proportion", line = -25, outer = TRUE,cex=1.3) ``` ::: {#fig1s3} chunk: Figure 1—figure supplement 4. ::: ### Impact of divergence between the ancestral population and reference populations used in _DATES (Distribution of Ancestry Tracts of Evolutionary Signals)_. We simulated 10 admixed individuals with 80% European (CEU) and 20% African (YRI) ancestry. We applied _DATES_ to infer the timing of admixture using reference populations. In each panel, we show the estimated dates of admixture using French and a group that is increasingly divergent from Yoruba (shown in the legend as the _F~ST~_ with Yoruba). The true admixture time is shown on X-axis, and the estimated time of admixture (±1 SE) is shown on Y-axis. ```{r fig.width=9, fig.height=9, message=FALSE, warning=FALSE} #' @width 9 #' @height 9 par(mfrow=c(2,2)) data=read.table(file = "data/Figure1_Supplement4") kol=c("darkorange","green3","cyan","violetred1") len1=c("Fst(Yoruba-Yoruba)=0.000","Fst(Yoruba-BantuKenya)=0.009", "Fst(Yoruba-San)=0.103") arr=c("Yoruba","BantuKenya","San") c=1; for(i in arr) { dd=data[data$V2==i,] plot(x=dd$V3,y=dd$V4, col=kol[c],pch=15,xlab = "True admixture time (gen)",ylab = "Estimated admixture time (gen)", ylim = c(min(dd$V4-dd$V5),220)); grid (); segments(x0 =dd$V3,x1 = dd$V3,y0 =(dd$V4+dd$V5),y1 = (dd$V4-dd$V5),col=kol[c]); segments(x0 =dd$V3-2,x1 = dd$V3+2,y0 =(dd$V4+dd$V5),y1 = (dd$V4+dd$V5),col=kol[c]); segments(x0 =dd$V3-2,x1 = dd$V3+2,y0 =(dd$V4-dd$V5),y1 = (dd$V4-dd$V5),col=kol[c]); lines(x=dd$V3,y=dd$V3,col="darkgrey",lty=2) legend("topleft",legend = len1[c],col=kol[c],pch = 15,bg = "white"); c=c+1; } title("Impact of divergence between the ancestral population and reference populations used in DATES", line = -1.5, outer = TRUE,cex=1.3) ``` ::: {#fig1s4} chunk: Figure 1—figure supplement 5. ::: ### Impact of divergence between the two source populations. We simulated _n_ admixed individuals with 20% European (CEU) and 80% ancestry from a range of populations with increasing relatedness to Europeans (shown in the legend as the _F~ST~_ to Europeans). Specifically, the other reference population we used was either West Africans (YRI), East Asians (CHB), South Americans (MXL) or Southern Europeans (TSI). We used the following reference populations for the inference: French (for all simulations) with one of the other references as either Yoruba, Tujia, Maya, or Italian, respectively. We show results for varying target sample sizes of (**A**) _n_=10 and (**B**) _n_=1. The true admixture time is shown on X-axis, and the estimated time of admixture (±1 SE) is shown on Y-axis. We note the inferred dates for CEU/TSI mixtures were not significant for older timescales and hence not shown. ```{r fig.width=12, fig.height=6, message=FALSE, warning=FALSE} #' @width 12 #' @height 6 par(mfrow=c(2,4)) data=read.table(file = "data/Figure1_Supplement5A") len1=c("CEU/YRI", "CEU/CHB","CEU/MXL","CEU/TSI") len2=c("French/Yoruba(0.154)", "French/Tujia(0.110)", "French/Maya(0.037)","French/Italian(0.004)") len=c("YRI","CHB","MXL","TSI") for(i in len) { # i="ITU" dd=data[data$V1==i,] k=which(i==len); plot(x=dd$V2,y=dd$V3, col="navy",pch=19,xlab = "True admixture time (gen)", ylab = "Estimated admixture time (gen)",ylim = c(0,250), main=paste("True source populations:",len1[k],sep=" ")); grid (); segments(x0 =dd$V2 ,x1 = dd$V2, y0 =(dd$V3+dd$V4),y1 = (dd$V3-dd$V4),col="navy"); segments(x0 =dd$V2-2,x1 = dd$V2+2,y0 =(dd$V3+dd$V4),y1 =(dd$V3+dd$V4),col="navy"); segments(x0 =dd$V2-2,x1 = dd$V2+2,y0 =(dd$V3-dd$V4),y1 =(dd$V3-dd$V4),col="navy"); lines(x=dd$V2,y=dd$V2,col="darkgrey") kol=c("white",dd$V7[1]) legend("topleft",legend = paste(" References",len2[k],sep="="), col="navy",pch = c(19),bg = "white"); } title("(A) Impact of divergence between the two source populations (n=10)", line = -1, outer = TRUE,cex=1.3) # panel B- for target n=1 data=read.table(file = "data/Figure1_Supplement5B") for(i in len) { dd=data[data$V1==i,] k=which(i==len); plot(x=dd$V2,y=dd$V3, col="red",pch=16,xlab = "True admixture time (gen)", ylab = "Estimated admixture time (gen)",ylim = c(0,250), main=paste("True source populations:",len1[k],sep=" ")); grid (); segments(x0 =dd$V2 ,x1 = dd$V2, y0 =(dd$V3+dd$V4),y1 = (dd$V3-dd$V4),col="red"); segments(x0 =dd$V2-2,x1 = dd$V2+2,y0 =(dd$V3+dd$V4),y1 =(dd$V3+dd$V4),col="red"); segments(x0 =dd$V2-2,x1 = dd$V2+2,y0 =(dd$V3-dd$V4),y1 =(dd$V3-dd$V4),col="red"); lines(x=dd$V2,y=dd$V2,col="darkgrey") kol=c("white",dd$V8[1]) legend("topleft",legend = paste(" References",len2[k],sep="="), col="red",pch = c(16),bg = "white"); } title("(B) Impact of divergence between the two source populations (n=1)", line = -23, outer = TRUE,cex=1.3) ``` ::: {#fig1s5} chunk: Figure 1—figure supplement 6. ::: ### Impact of using the admixed individuals themselves as one of the reference groups in _DATES (Distribution of Ancestry Tracts of Evolutionary Signals)_. We simulated data for 10 admixed individuals with European (CEU) ancestry ($\alpha$) in the range of 20–80% (the rest derived from Africans \[YRI\]) using CEU and YRI as reference populations. Using a non-overlapping set of CEU and YRI individuals, we generated 10 additional individuals that we used as reference samples in _DATES_. For each simulation, we ran _DATES_ with Europeans (French) and a non-overlapping set of simulated admixed individuals as the reference populations (shown in blue), or Yoruba and simulated admixed individuals (shown in orange). The true admixture time is shown on X-axis, and the estimated time of admixture (±1 SE) is shown on Y-axis. ```{r fig.width=11, fig.height=11, message=FALSE, warning=FALSE} #' @width 11 #' @height 11 par(mfrow=c(2,2)) data=read.table(file = "data/Figure1_Supplement6") len=c(2,4,6,8) a=c("A","B","C","D") par(mfrow=c(2,2)) for(i in len) { dd=data[data$V1==i,] k=which(i==len) rr=paste("(",paste(a[k],paste("Admixed pop (as target and reference):",i*10,"%CEU+",(10-i)*10,"%YRI"),sep=") "),sep="") plot(x=dd$V2,y=dd$V4, col="deepskyblue",pch=15,xlab = "True admixture time (gen)", ylab = "Estimated admixture time (gen)", ylim = c(0,max(data$V4+data$V5)),main=rr); grid () lines(x=dd$V2,y=dd$V8, col="orange2",pch=19,type="p") segments(x0 =dd$V2,x1 = dd$V2,y0 =(dd$V4+dd$V5),y1 = (dd$V4-dd$V5),col="deepskyblue"); segments(x0 =dd$V2-2,x1 = dd$V2+2,y0 =(dd$V4+dd$V5),y1 = (dd$V4+dd$V5),col="deepskyblue"); segments(x0 =dd$V2-2,x1 = dd$V2+2,y0 =(dd$V4-dd$V5),y1 = (dd$V4-dd$V5),col="deepskyblue"); segments(x0 =dd$V2,x1 = dd$V2,y0 =(dd$V8+dd$V9),y1 = (dd$V8-dd$V9),col="orange2"); segments(x0 =dd$V2-2,x1 = dd$V2+2,y0 =(dd$V8+dd$V9),y1 = (dd$V8+dd$V9),col="orange2"); segments(x0 =dd$V2-2,x1 = dd$V2+2,y0 =(dd$V8-dd$V9),y1 = (dd$V8-dd$V9),col="orange2"); lines(x=dd$V2,y=dd$V2,col="darkgrey",lty=2) legend("topleft",legend = c("Refpops:French and Admixed","Refpops:Yoruba and Admixed"),text.col = c("deepskyblue","orange2"),pch = c(15,19), col = c("deepskyblue","orange2"),bg = "white"); } ``` ::: {#fig1s6} chunk: Figure 1—figure supplement 7. ::: ### Impact of sample size and data quality of target samples. We simulated data for _n_ admixed individuals with 20% European (CEU) and 80% African (YRI) ancestry. In each panel, we varied three key features of the data from the target population, notably sample size (_n_=1 or 10), type of genotypes (diploid or pseudo-haploid) and missing genotype rate (between 10% and 60%). (**A**) Diploid genotypes with missing data for _n_=10 admixed individuals. Each panel shows the results of _x_% of missing diploid genotypes (shown in the legend). (**B**) Pseudo-haploid genotypes with missing data for _n_=10 admixed individuals. Each panel shows the results of _x_% of missing pseudo-haploid genotypes (shown in the legend). (**C**) Pseudo-haploid genotypes with missing data for _n_=1 admixed individuals. Each panel shows the results of _x_% of missing pseudo-haploid genotypes (shown in the legend). The true admixture time is shown on X-axis, and the estimated time of admixture (±1 SE) is shown on Y-axis. ```{r fig.width=12, fig.height=9, message=FALSE, warning=FALSE} #' @width 12 #' @height 9 par(mfrow=c(3,4)) data=read.table(file = "data/Figure1_Supplement7A") kol=c("darkorange","green3","cyan","violetred1","royalblue1","orangered1","darkorchid1", "yellowgreen","slateblue2","tomato","palevioletred2","red","orange") len1=c("missing prop=10%","missing prop=20%","missing prop=40%","missing prop=60%") len=c(10,20,40,60) c=1; for(i in len) { dd=data[data$V2==i,] k=which(i==len); plot(x=dd$V1,y=dd$V3, col=kol[k],pch=17,xlab = "True admixture time (gen)",ylab = "Estimated admixture time (gen)", ylim = c(0,220)); grid (); lines(x=dd$V1,y=dd$V5, col=dd$V8,pch=6,type="p") segments(x0 =dd$V1,x1 = dd$V1,y0 =(dd$V3+dd$V4),y1 = (dd$V3-dd$V4),col=kol[k]); segments(x0 =dd$V1-2,x1 = dd$V1+2,y0 =(dd$V3+dd$V4),y1 = (dd$V3+dd$V4),col=kol[k]); segments(x0 =dd$V1-2,x1 = dd$V1+2,y0 =(dd$V3-dd$V4),y1 = (dd$V3-dd$V4),col=kol[k]); lines(x=dd$V1,y=dd$V1,col="darkgrey",lty=2) legend("topleft",legend = len1[k],col=kol[k],pch = c(17,15),bg = "white"); c=c+1; } title("(A) Diploid genotypes with missing data for n=10 admixed individuals", line = -1.5, outer = TRUE,cex=1.3) data=read.table(file = "data/Figure1_Supplement7B") data1=read.table(file = "data/Figure1_Supplement7C") kol=c("darkorange","green3","cyan","violetred1","royalblue1","orangered1","darkorchid1","yellowgreen","slateblue2","tomato","palevioletred2") c=1; for(i in len) { dd=data[data$V2==i,] k=which(i==len); plot(x=dd$V1,y=dd$V3, col=kol[c],pch=16,xlab = "True admixture time (gen)",ylab = "Estimated admixture time (gen)", ylim = c(min(dd$V3-dd$V4),max(dd$V3+dd$V4,250))); grid (); segments(x0 =dd$V1,x1 = dd$V1,y0 =(dd$V3+dd$V4),y1 = (dd$V3-dd$V4),col=kol[c]); segments(x0 =dd$V1-2,x1 = dd$V1+2,y0 =(dd$V3+dd$V4),y1 = (dd$V3+dd$V4),col=kol[c]); segments(x0 =dd$V1-2,x1 = dd$V1+2,y0 =(dd$V3-dd$V4),y1 = (dd$V3-dd$V4),col=kol[c]); lines(x=dd$V1,y=dd$V1,col="darkgrey",lty=2) legend("topleft",legend = c(len1[k]),col=kol[c],pch = 16,bg = "white"); c=c+1; } title("(B) Pseudo-haploid genotypes with missing data for n=10 admixed individuals", line = -25, outer = TRUE,cex=1.3) c=1; for(i in len) { dd1=data1[data1$V2==i,] k=which(i==len); plot(x=dd1$V1,y=dd1$V3, col=kol[c],pch=22,xlab = "True admixture time (gen)",ylab = "Estimated admixture time (gen)", ylim = c(min(dd1$V3-dd1$V4),max(dd1$V3+dd1$V4,250))); grid (); segments(x0 =dd1$V1,x1 = dd1$V1,y0 =(dd1$V3+dd1$V4),y1 = (dd1$V3-dd1$V4),col=kol[c],lty=2); segments(x0 =dd1$V1-2,x1 = dd1$V1+2,y0 =(dd1$V3+dd1$V4),y1 = (dd1$V3+dd1$V4),col=kol[c],lty=2); segments(x0 =dd1$V1-2,x1 = dd1$V1+2,y0 =(dd1$V3-dd1$V4),y1 = (dd1$V3-dd1$V4),col=kol[c],lty=2); lines(x=dd1$V1,y=dd1$V1,col="darkgrey",lty=2) legend("topleft",legend = c(len1[k]),col=kol[c],pch = 16,bg = "white"); c=c+1; } title("(C) Pseudo-haploid genotypes with missing data for n=1 admixed individuals", line = -47, outer = TRUE,cex=1.3) ``` ::: {#fig1s7} chunk: Figure 1—figure supplement 8. ::: ### Impact of data quality of target and reference populations as a function of divergence between true and reference populations used in _DATES (Distribution of Ancestry Tracts of Evolutionary Signals)_. We simulated data for _n_=10 admixed individuals with 20% European (CEU) and 80% African (YRI) ancestry with pseudo-haploid genotypes. The reference populations used also had pseudo-haploid genotypes. We further varied three key features of the data, missing genotype rate in reference populations, missing genotype rate in target populations, and divergence between true source populations and reference population used for the analysis. In each row, we show the admixture dates using reference populations with increasing divergence to true source population (_F~ST~_ shown in the row title). In each column, we varied the missing genotype rate in the target population (shown in the column title). Further, each panel shows results of missing data in the reference genomes (shown in the legend). (**a**) Reference populations of French and Yoruba (_F~ST~_(true, reference)~0). (**b**) Reference populations of French and Bantu Kenya (_F~ST~_(true, reference)~0.009). (**c**) Reference populations of French and San (_F~ST~_(true, reference)~0.103). The true admixture time is shown on X-axis, and the estimated time of admixture (±1 SE) is shown on Y-axis. ```{r fig.width=10, fig.height=10, message=FALSE, warning=FALSE} #' @width 10 #' @height 10 par(mfrow=c(3,3)) data=read.table(file = "data/Figure1_Supplement8",header=T) kol=c("darkorange","darkorchid1","cyan","deeppink") ref_used=c("FY","FB","FS") len1=c("Reference missing prop=0%","Reference missing prop=20%","Reference missing prop=40%") len=c(20,40,60) rlen=c(20,40) for(w in ref_used) { sub=data[data$references==w,] for(r in len) { c=1; rd=sub[sub$missing_target==r,] w=which(r==len); dd=rd[rd$missing_ref==0,] plot(x=dd$admixture_time,y=dd$dates_n10, col=kol[c],pch=15,xlab = "True admixture time (gen)",ylab = "Estimated admixture time (gen)",main=paste("Missing proportion in the target: ",paste(len[w],"%",sep=""),sep = ""),xlim=c(0,210),ylim = c(0,250)); grid (); segments(x0 =dd$admixture_time,x1 = dd$admixture_time,y0 =(dd$dates_n10+dd$dates_n10_se),y1 = (dd$dates_n10-dd$dates_n10_se),col=kol[c]); segments(x0 =dd$admixture_time-2,x1 = dd$admixture_time+2,y0 =(dd$dates_n10+dd$dates_n10_se),y1 = (dd$dates_n10+dd$dates_n10_se),col=kol[c]); segments(x0 =dd$admixture_time-2,x1 = dd$admixture_time+2,y0 =(dd$dates_n10-dd$dates_n10_se),y1 = (dd$dates_n10-dd$dates_n10_se),col=kol[c]); lines(x=dd$admixture_time,y=dd$admixture_time,col="darkgrey",lty=2) c=c+1; err=3 for(i in rlen) { dd=rd[rd$missing_ref==i,] k=which(i==len); points(x=dd$admixture_time+err,y=dd$dates_n10, col=kol[c],pch=15); grid (); segments(x0 =dd$admixture_time+err,x1 = dd$admixture_time+err,y0 =(dd$dates_n10+dd$dates_n10_se),y1 = (dd$dates_n10-dd$dates_n10_se),col=kol[c]); segments(x0 =dd$admixture_time+err-2,x1 = dd$admixture_time+err+2,y0 =(dd$dates_n10+dd$dates_n10_se),y1 = (dd$dates_n10+dd$dates_n10_se),col=kol[c]); segments(x0 =dd$admixture_time+err-2,x1 = dd$admixture_time+err+2,y0 =(dd$dates_n10-dd$dates_n10_se),y1 = (dd$dates_n10-dd$dates_n10_se),col=kol[c]); lines(x=dd$admixture_time,y=dd$admixture_time,col="darkgrey",lty=2) c=c+1; err=err+5 } legend("topleft",legend = c(len1),col=kol[1:4],pch = 15,bg = "white"); } } title("(a) Reference populations of French and Yoruba (FST(true, reference) ~ 0) for target population n=10 ", line = -1, outer = TRUE,cex=1.3) title("(b) Reference populations of French and Bantu Kenya ( FST(true, reference) ~ 0.009) for target population n=10", line = -26, outer = TRUE,cex=1.3) title("(c) Reference populations of French and San ( FST(true, reference) ~ 0.103) for target population n=10", line = -51, outer = TRUE,cex=1.3) ``` ::: {#fig1s8} chunk: Figure 1—figure supplement 9. ::: ### Impact of small sample size and data quality of target and reference populations as a function of divergence between true and reference populations used in _DATES (Distribution of Ancestry Tracts of Evolutionary Signals)_. We simulated data for _n_=1 admixed individuals with 20% European (CEU) and 80% African (YRI) ancestry with pseudo-haploid genotypes. The reference populations used also had pseudo-haploid genotypes. We further varied three key features of the data, missing genotype rate in reference populations, missing genotype rate in target populations, and divergence between true source populations and reference population used for the analysis. In each row, we show the admixture dates using reference populations with increasing divergence to true source population (_F~ST~_ shown in the row title). In each column, we varied the missing genotype rate in the target population (shown in the column title). Further, each panel shows results of missing data in the reference genomes (shown in the legend). (**a**) Reference populations of French and Yoruba (_F~ST~_(true, reference)~0). (**b**) Reference populations of French and Bantu Kenya (_F~ST~_(true, reference)~0.009). (**c**) Reference populations of French and San (_F~ST~_(true, reference)~0.103). The true admixture time is shown on X-axis, and the estimated time of admixture (±1 SE) is shown on Y-axis. ```{r fig.width=10, fig.height=10, message=FALSE, warning=FALSE} #' @width 10 #' @height 10 par(mfrow=c(3,3)) data=read.table(file = "data/Figure1_Supplement9",header=T) kol=c("darkorange","darkorchid1","cyan","deeppink") ref_used=c("FY","FB","FS") len1=c("Reference missing prop=0%","Reference missing prop=20%","Reference missing prop=40%") len=c(20,40,60) rlen=c(20,40) for(w in ref_used) { sub=data[data$references==w,] for(r in len) { c=1; rd=sub[sub$missing_target==r,] w=which(r==len); dd=rd[rd$missing_ref==0,] plot(x=dd$admixture_time,y=dd$dates_n1, col=kol[c],pch=15,xlab = "True admixture time (gen)",ylab = "Estimated admixture time (gen)", main=paste("Missing proportion in the target: ",paste(len[w],"%",sep=""),sep = ""),xlim=c(0,210), ylim = c(0,250)); grid (); segments(x0 =dd$admixture_time,x1 = dd$admixture_time,y0 =(dd$dates_n1+dd$dates_n1_se),y1 = (dd$dates_n1-dd$dates_n1_se),col=kol[c]); segments(x0 =dd$admixture_time-2,x1 = dd$admixture_time+2,y0 =(dd$dates_n1+dd$dates_n1_se),y1 = (dd$dates_n1+dd$dates_n1_se),col=kol[c]); segments(x0 =dd$admixture_time-2,x1 = dd$admixture_time+2,y0 =(dd$dates_n1-dd$dates_n1_se),y1 = (dd$dates_n1-dd$dates_n1_se),col=kol[c]); lines(x=dd$admixture_time,y=dd$admixture_time,col="darkgrey",lty=2) c=c+1; err=3 for(i in rlen) { dd=rd[rd$missing_ref==i,] k=which(i==len); points(x=dd$admixture_time+err,y=dd$dates_n1, col=kol[c],pch=15); grid (); segments(x0 =dd$admixture_time+err,x1 = dd$admixture_time+err,y0 =(dd$dates_n1+dd$dates_n1_se),y1 = (dd$dates_n1-dd$dates_n1_se),col=kol[c]); segments(x0 =dd$admixture_time+err-2,x1 = dd$admixture_time+err+2,y0 =(dd$dates_n1+dd$dates_n1_se),y1 = (dd$dates_n1+dd$dates_n1_se),col=kol[c]); segments(x0 =dd$admixture_time+err-2,x1 = dd$admixture_time+err+2,y0 =(dd$dates_n1-dd$dates_n1_se),y1 = (dd$dates_n1-dd$dates_n1_se),col=kol[c]); lines(x=dd$admixture_time,y=dd$admixture_time,col="darkgrey",lty=2) c=c+1; err=err+5 } legend("topleft",legend = c(len1),col=kol[1:4],pch = 15,bg = "white"); } } title("(a) Reference populations of French and Yoruba (FST(true, reference) ~ 0) for target population n=1 ", line = -1, outer = TRUE,cex=1.3) title("(b) Reference populations of French and Bantu Kenya ( FST(true, reference) ~ 0.009) for target population n=1 ", line = -26, outer = TRUE,cex=1.3) title("(c) Reference populations of French and San ( FST(true, reference) ~ 0.103) for target population n=1 ", line = -51, outer = TRUE,cex=1.3) ``` ::: {#fig1s9} Next, we tested _DATES_ for features such as varying admixture proportions and use of surrogate populations as reference groups. By varying of European ancestry proportion between \~1% and 50% (the rest derived from west Africans), we observed _DATES_ accurately estimated the timing in all cases ([Figure 1—figure supplement 3A](#fig1s3)). However, the inferred admixture proportion was overestimated for lower admixture proportions (<10%) ([Figure 1—figure supplement 3B](#fig1s3)). Thus, we caution against using _DATES_ for estimating ancestry proportions. _DATES_ works reliably for dating admixtures between related groups such as Europeans and Mexicans (_F~ST~_~ 0.03), though it was unable to distinguish mixtures of Southern and Northern Europeans (_F~ST~_< 0.005) ([Figure 1—figure supplement 5](#fig1s5)). We found _DATES_ is robust to the use of highly divergent surrogates as reference populations. For example, the use of Khomani San as the reference population instead of the true ancestral population of Yoruba (_F~ST~_ ~ 0.1) provides unbiased dates of admixture ([Figure 1—figure supplement 4](#fig1s4)). In this regard, for ancient DNA where sometimes only sparse data is available, one can also use present-day samples as reference populations to increase the quality and sample size of the ancestral groups. In principle, as long as the allele frequencies in the reference samples are correlated to the ancestral allele frequencies, the inference of admixture dates should remain unbiased (Materials and methods). In practice, however, recent demographic events (e.g., strong founder events or admixture from additional sources, etc.) in the history of the present-day samples could lead to significant deviation from the ancestral allele frequencies. Thus, the reference populations should be carefully chosen. Another idea is to use the admixed populations themselves as one of the reference populations as demonstrated by the single reference setup in ALDER [@bib37]. Admixed individuals have intermediate allele frequencies to the ancestral populations and thus weighted LD or ancestry covariance can be computed with only one reference population (albeit, with reduced power). [@bib37], showed that the use of admixed populations as one of the references does not bias the rate of decay of the weighted LD (i.e., time of admixture), though the amplitude of the decay curve (not used in _DATES_) can be biased under some scenarios. To verify _DATES_ provides reliable results under this setup, we applied _DATES_ with a single reference population and used the admixed population as the other reference. Like ALDER, our inferred dates of admixture were accurate and comparable to using two reference populations. ([Figure 1—figure supplement 6](#fig1s6)). An important feature of _DATES_ is that it does not require phased data and is applicable to datasets with small sample sizes, making it in principle useful for ancient DNA applications. To test the reliability of _DATES_ for ancient genomes, we simulated data mimicking the relevant features of ancient genomes, namely small sample sizes (_n_=1–20), large proportions of missing genotypes (between 10% and 60%), and pseudo-haploid genotype calls (instead of diploid genotype calls) in reference and/or target samples. _DATES_ showed reliable results under various setups, even when only a single admixed individual was available ([Figure 1B](#fig1), [Figure 1—figure supplements 1](#fig1s1)–[!number(9)](#fig1s9)). In contrast, admixture LD-based methods require more than one sample and do not work reliably with missing data. For example, ALDER estimates were very unstable for simulations with >40% missing data. For older dates (>100 generations), there was a slight bias even with >10% missing genotypes ([Appendix 1—figure 5](https://elifesciences.org/articles/77625#app1fig5)). This is expected as LD calculations leverage shared patterns across samples, thus variable missingness of genotypes across individuals leads to substantial loss of data leading to unstable and noisy inference. We also generated data for combinations of features including small sample sizes, pseudo-haploid genotypes with large proportions of missing genotypes in both target and reference samples, and use of highly divergent reference samples. We found _DATES_ yielded reliable results with large amounts (~40–60%) of missing data, either in the target or references, even with highly divergent reference populations ([Figure 1—figure supplement 8](#fig1s8)). This was also true when a single target sample was available, though as expected, the inference becomes noisier for older dates and large fractions of missing data ([Figure 1—figure supplement 9](#fig1s9)). The robust performance of _DATES_ in sparse datasets highlights a major advantage for ancient DNA applications. _DATES_ assumes a model of instantaneous gene flow with a single pulse of mixture between two source populations. However, many human populations have a history of multiple pulses of gene flow. To test the performance of _DATES_ for multi-way admixture events, we generated admixed individuals with ancestry from three sources (East Asians, Africans, and Europeans) where the gene flow occurred at two distinct time points ([Appendix 2—figure 1](https://elifesciences.org/articles/77625#app2fig1)). By applying _DATES_ with pairs of reference populations, we observed that _DATES_ recovered both admixture times for target populations that had equal contributions from all three ancestral groups ([Appendix 2—figure 2](https://elifesciences.org/articles/77625#app2fig2)). In the case of unequal admixture proportions from three ancestral groups, _DATES_ inferred the timing of the recent admixture event in most cases. In some cases, however, the inferred dates were intermediate to the two pulses when the ancestry proportion of the recent event was low ([Appendix 2—figure 3](https://elifesciences.org/articles/77625#app2fig3)). This confounding could be eliminated if the reference populations were set up to match the model of gene flow. For example, the inferred times of admixture were accurate if the two references used in _DATES_ were: reference 1: the source population for the recent event and reference 2: pooled individuals from both ancestral populations that contributed to the first admixture event, or the intermediate admixed group formed after the first event ([Appendix 2—table 1](https://elifesciences.org/articles/77625#app2table1)). This highlights how the choice of reference populations can help to tune the method to infer the timing of specific admixture events more reliably. Finally, we explored the impact of more complex demographic events, including continuous admixture and founder events using coalescent simulations (Appendix 2). In the case of continuous admixture, _DATES_ inferred an intermediate timing between the start and the end of the gene flow period, similar to other methods like ALDER and Globetrotter ([@bib27]; [@bib37]; [Appendix 2—table 2](https://elifesciences.org/articles/77625#app2table2)). In the case of populations with founder events, we inferred unbiased dates of admixture in most cases except when the founder event was extreme (_N~e~_ ~ 10) or the population had maintained a low population size (_N~e~_ < 100) until the present (i.e., no recovery bottleneck) ([Appendix 2—figure 4](https://elifesciences.org/articles/77625#app2fig4), [Appendix 2—table 3](https://elifesciences.org/articles/77625#app2table3)). In humans, few populations have such extreme founder events, and thus, in most other cases, our inferred admixture dates should be robust to founder events [@bib55]. We note that while _DATES_ is not a formal test of admixture, in simulations, we find that in the absence of gene flow, the method does not infer significant dates of admixture even if the target has a complex demographic history ([Appendix 2—figure 6](https://elifesciences.org/articles/77625#app2fig6), [Appendix 2—figure 7](https://elifesciences.org/articles/77625#app2fig7)). ## Comparison to other methods We assessed the reliability of _DATES_ in real data by comparing our results with published methods: Globetrotter, ALDER, and ROLLOFF. These methods are designed for the analysis of present-day samples that typically have high-quality data with limited missing variants. In addition, Globetrotter uses phased data which is challenging for ancient DNA samples. Thus, instead of rerunning other methods, we took advantage of the published results for contemporary samples presented in [@bib27]. Following [@bib27], we created a merged dataset including individuals from HGDP [@bib35; @bib5; @bib28] (Materials and methods). We applied _DATES_ and ALDER to 29 target groups using the reference populations reported in Table S12 in [@bib27], excluding one group where the population label was unclear. Interestingly, the majority of these groups (25/29) failed ALDER’s formal test of admixture; either because the results of the single reference and two reference analyses yielded inconsistent estimates or because the target had long-range shared LD with one of the reference populations ([Appendix 1—table 5](https://elifesciences.org/articles/77625#app1table5)). Using _DATES_, we inferred significant dates of admixture in 20 groups, and 14 of those were consistent with estimates based on Globetrotter. In the case of the six populations that disagreed across the two methods, most of the populations appear to have a history of multiple pulses of gene flow either involving more than two populations (e.g., Brahui [@bib48]) or multiple instances of contact between the same two reference groups (e.g., Mandenka [@bib51]) or the model of admixture differed (e.g., recent ancient DNA studies suggest present-day Bulgarians have ancestry from western hunter-gatherers \[HGs\], Near Eastern farmers, and Steppe pastoralists from Eurasia [@bib24] but were modeled as a mixture of Polish and Cypriots in Globetrotter). In case of complex admixture scenarios, the inconsistencies across the two methods are hard to interpret as Globetrotter and _DATES_ could be capturing different events or the weighting of both events could differ. Finally, the estimated admixture timing based on _DATES_, ROLLOFF, and ALDER (assuming two-way admixture regardless of the formal test results) were found to be highly concordant ([Appendix 1—table 5](https://elifesciences.org/articles/77625#app1table5)). ## Fine-scale patterns of population mixtures in ancient Europe Recent ancient DNA studies have shown that present-day Europeans derive ancestry from three distinct sources: (a) HG-related ancestry that is closely related to Mesolithic HGs from Europe; (b) Anatolian farmer-related ancestry related to Neolithic farmers from the Near East and associated to the spread of farming to Europe; and (c) Steppe pastoralist-related ancestry that is related to the Yamnaya pastoralists from Russia and Ukraine [@bib1; @bib24; @bib53]. Many open questions remain about the timing and dynamics of these population interactions, in particular related to the formation of the ancestral groups (which were themselves admixed) and their expansion across Europe. To characterize the spatial and temporal patterns of mixtures in Europe in the past 10,000 years, we used 1096 ancient European samples from 152 groups from the publicly available Allen Ancient DNA Resource (AADR) spanning a time range of \~8000–350 BCE (Materials and methods, [Supplementary file 1A](https://elifesciences.org/articles/77625#supp1)). Using _DATES_, we characterized the timing of the various gene flow events, and below, we describe the key events in chronological order focusing on three main periods. ### Holocene to Mesolithic Pre-Neolithic Europe was inhabited by HGs until the arrival of the first farmers from the Near East [@bib23; @bib29]. There was large diversity among HGs with four main groups – western hunter-gatherers (WHGs) that were related to the Villabruna cluster in central Europe, eastern hunter-gatherers (EHGs) from Russia and Ukraine related to the Upper Paleolithic group of Ancestral North Eurasians (ANEs), Caucasus hunter-gatherers (CHGs) from Georgia associated to the first farmers from Iran, and the GoyetQ2-cluster associated to the Magdalenian culture in Spain and Portugal [@bib15; @bib18; @bib30; @bib52; @bib53]. Most Mesolithic HGs fall on two main clines of relatedness: one cline that extends from Scandinavia to central Europe showing variable WHG-EHG ancestry, and the other in southern Europe with WHG-GoyetQ2 ancestry [@bib52]. The latter is already present in the 17,000 BCE _El Mirón_ individual from Spain, suggesting that the GoyetQ2-related gene flow occurred well before the Holocene. However, the WHG-EHG cline was formed more recently during the Mesolithic period, though the precise timing remains less well understood. To characterize the formation of the WHG-EHG cline, we used genomic data from 16 ancient HG groups (_n_=101) with estimated ages of \~7500–3600 BCE. We first verified the ancestry of each HG group using _qpAdm_ that compares the allele frequency correlations between the target and a set of source populations to formally test the model of admixture and then infer the ancestry proportions for the best-fitted model [@bib24]. For each target population, we chose the most parsimonious model, that is, fitting the data with the minimum number of source populations. Consistent with previous studies, our _qpAdm_ analysis showed that most HGs from Scandinavia, the Baltic Sea region, and central Europe could be modeled as a two-way mixture of WHG- and EHG-related ancestry ([Supplementary file 2A](https://elifesciences.org/articles/77625#supp2)). To confirm that the target populations do not harbor Anatolian farmer-related ancestry (that could lead to some confounding in estimated admixture dates), we applied _D_-statistics of the form _D_(Mbuti, _target_, WHG, Anatolian farmers) where target = Mesolithic HGs. We observed that none of the target groups had a stronger affinity to Anatolian farmers than WHG ([Supplementary file 2B](https://elifesciences.org/articles/77625#supp2)). Together, these results suggest that the mixtures we date below reflect pre-Neolithic contacts between the HGs. To infer the timing of the mixtures in the history of Mesolithic European HGs, we applied _DATES_ to HGs from Scandinavia, the Baltic regions, and central Europe using WHG- and EHG-related groups as reference populations. _DATES_ infers the time of admixture in generations before the sample lived. Accounting for the average sampling age of the specimens and the mean human generation time of 28 years [@bib41], we inferred the admixture time in years before present or in BCE (Materials and methods). We report the average dates (or median, where specified) in BCE in the main text and provide additional details in [Figure 2](#fig2) and [Supplementary file 1B](https://elifesciences.org/articles/77625#supp1) including the sample sizes, dates in generations, and BCE for each population. Among HGs, we inferred that the earliest admixture occurred in Scandinavian HGs from Norway and Sweden with a range of average dates of \~80–113 generations before the samples lived ([Figure 2—figure supplement 1](#fig2s1)). This translates to admixture dates of \~10,200–8000 BCE, with the most recent dates inferred in Motala HGs from Sweden suggesting substantial substructure in HGs ([Figure 2](#fig2)). In the Baltic region, we inferred the range of admixture dates of \~8700–6000 BCE in Latvia and Lithuania HGs, postdating the mixture in Scandinavia. In southeast Europe, the Iron Gates region of the Danube Basin shows widespread evidence of mixtures between HG groups and, in the case of some outliers, the mixture of HGs and Anatolian farmer-related ancestry as early as the Mesolithic period [@bib14]. Further, these groups showed a strong affinity to the WHG-related ancestry in Anatolian populations, suggesting ancient interactions with Near Eastern populations [@bib14]. We applied _qpAdm_ to test the model of admixture in Iron Gates HG and found that the parsimonious model with WHG- and EHG-related ancestry provides a good fit to the data. Further, when we tested the model with Anatolian-related ancestry using Anatolian HG (AHG) as an additional source population, the AHG ancestry proportion was not significant ([Supplementary file 2A](https://elifesciences.org/articles/77625#supp2)). Applying _DATES_ to Iron Gates HG with WHG and EHG as reference populations, we inferred this group was genetically formed in \~9200 BCE (95% confidence interval: 10,000–8400 BCE). Our samples of the Iron Gates HGs include a wide range of C14 dates between 8800 and 5700 BCE. We confirmed our dates were robust to the sampling age of the individuals as we obtained statistically consistent dates when all samples were combined as one group or when subsets of samples were grouped in bins of 500 years ([Figure 2—figure supplement 2](#fig2s2)). The most recent dates of \~7500 BCE were inferred in eastern Europe in Ukraine HGs, highlighting how the WHG-EHG cline was formed over a period \~2000–3000 years ([Figure 2](#fig2), [Supplementary file 1B](https://elifesciences.org/articles/77625#supp1)). chunk: Figure 2. ::: #### Timeline of admixture events in ancient Europe. We applied _DATES (Distribution of Ancestry Tracts of Evolutionary Signals)_ to ancient samples from Europe. In the right panel, we show the sampling locations of the ancient specimens, and in the left panel, we show the admixture dates for each target group listed on the X-axis. The inferred dates in generations were converted to dates in BCE by assuming a mean generation time of 28 years [@bib40] and accounting for the average sampling age (shown as gray dots) of all ancient individuals in the target group (Materials and methods). The top panel shows the formation of western hunter-gatherer (WHG)-eastern hunter-gatherer (EHG) cline (in blue) using Mesolithic hunter-gatherers (HGs) as the target and EHG and WHG as reference populations. The middle panel shows admixture dates of local HGs and Anatolian farmers (in orange) using Neolithic European groups as targets and Anatolian farmers-related groups and WHG-related groups as reference populations. The bottom panel shows the spread of Steppe pastoralist-related ancestry (in green) estimated using middle and late Neolithic, Chalcolithic, and Bronze Age samples from Europe as target populations and early Steppe pastoralist-related groups (Afanasievo and Yamnaya Samara) and a set of Anatolian farmers and WHG-related groups as reference populations. For the middle to late Bronze Age (MLBA) samples from Eurasia, we used the early Steppe pastoralist-related groups and the Neolithic European groups as reference populations. The cultural affiliation (Corded Ware Complex \[CWC\], Bell Beaker complex \[BBC\], or Steppe MLBA cultures) of the individuals is shown in the legend. See [Figure 2—figure supplements 1](#fig2s1) and [!number(2)](#fig2s2) we applied DATESfor decay curves for all samples and stratified datesfor Iron Gates HGs. ```{r fig.width=10, fig.height=10, message=FALSE, warning=FALSE} #' @width 10 #' @height 10 sf::sf_use_s2(FALSE) data=read.table(file = "data/Figure3_data_map",header = T) world <- ne_countries(scale = "medium", returnclass = "sf") Europe <- world[which(world$continent == "Europe"),] # Hunter Gatherer distribution panel dd=data[data$col=="deepskyblue",] ggplot(Europe) + geom_sf() + coord_sf(xlim = c(-15,40), ylim = c(35,70), expand = FALSE) + geom_point(data = dd, aes(x=long,y=lat),col=dd$col,inherit.aes = FALSE,pch=dd$pc,cex=4,show.legend = F) + scale_fill_manual(values ="deepskyblue") + theme(plot.title = element_text(colour = "black")) #Neolithic Farmer spread dd=data[data$col=="orange2",] ggplot(Europe) + geom_sf() + coord_sf(xlim = c(-15,40), ylim = c(35,70), expand = FALSE) + geom_point(data = dd, aes(x=long,y=lat),col=dd$col,inherit.aes = FALSE,pch=dd$pc,cex=4,show.legend = F) + theme(plot.title = element_text(colour = "black")) # MLBA bronze age samples dd=rbind(data[data$col=="green3",],data[data$col=="lightpink2",]) ggplot(world) + geom_sf() + coord_sf(xlim = c(-15,90), ylim = c(30,80), expand = FALSE) + geom_point(data = dd, aes(x=long,y=lat),col=dd$col,inherit.aes = FALSE,pch=dd$pc,cex=3,show.legend = F) + theme(plot.title = element_text(colour = "black")) # Admixture time plots par(mai = c(1,1,0.1,3)) data=read.table(file = "data/Figure3_data_admixturetimes") fossil_range=read.table(file = "data/Figure3_data_fossilages") # Hunter Gatherer mixture panel dd=rbind(data[data$V8=="deepskyblue",]) target_fossil=fossil_range[fossil_range$V2%in%as.character(dd$V1),] group_names=unique(target_fossil$V2) target_fossil$V2 <- as.character(target_fossil$V2) target_fossil$V6 <- as.numeric(as.character(target_fossil$V3))-1950 for(i in group_names) { k=which(dd$V1==i) dd[k,9]=min(as.numeric(target_fossil[target_fossil$V2==i,]$V6)) dd[k,10]=max((as.numeric(target_fossil[target_fossil$V2==i,]$V6))) dd[k,11]=mean(as.numeric(target_fossil[target_fossil$V2==i,]$V6)) } plot(x=seq(1,length(dd$V1),1),y=dd$V6,type = "p", col=as.character(dd$V8),pch=20,cex=2, ylim = c(4000,12200), xlim = c(0.5,length(dd$V1)+0.5), xaxt='n',yaxt='n',xlab = "",ylab = "Estimated admixture time (BCE)", main=""); grid(nx=8,ny=10); segments(x0 =seq(1,length(dd$V1),1),x1 = seq(1,length(dd$V1),1),y0 =(dd$V6+dd$V7), y1 = (dd$V6-dd$V7),col=as.character(dd$V8),lty = 1) segments(x0 =seq(1,length(dd$V1),1)-0.1,x1 = seq(1,length(dd$V1),1)+0.1,y0 =(dd$V6+dd$V7), y1 = (dd$V6+dd$V7),col=as.character(dd$V8)) segments(x0 =seq(1,length(dd$V1),1)-0.1,x1 = seq(1,length(dd$V1),1)+0.1,y0 =(dd$V6-dd$V7), y1 = (dd$V6-dd$V7),col=as.character(dd$V8)) axis(1, 1:length(dd$V1), lty = 1,col = "black",tck="y",labels = rep('', length(dd$V1))) name_lab=gsub(".SG","",dd$V1) points(x=seq(1,length(dd$V1),1),y=dd$V11,type = "p",col="grey20",pch=1,cex=1.1) text(1:length(dd$V4), rep(3500, length(dd$V4)), labels= name_lab, col="black", srt=25, xpd=TRUE, adj=1,cex=1) axis(2, seq(4000,12000,1000), lty = 1,col = "black",tck="y",labels = rep('',length(seq(4000,12000,1000)))) text(rep(-0.2,max(4000,12000,500)), seq(4000,12000,1000),labels= seq(4000,12000,1000), col="grey30", srt=0, xpd=TRUE, adj=-0.1,cex=0.8) abline(v=3.5,lty=3,col="grey20") abline(v=6.5,lty=3,col="grey20") text(c(1,4,7),rep(12000,5), labels= c("Scandinavian HG","Baltic region HG","Central European HG"), col="grey30", srt=0, xpd=TRUE, adj=-0.1,cex=1) legend(8.9,12000,legend = c("DATES in BCE ± 1SE","Average C14 radiocarbon age"),col=c("grey20","grey20"), pch=c(20,1),lty=c(1,-1),xpd=T,bty='n') # Neolithic Farmer mixture par(mai = c(1.4,1,0.1,0.5)) dd=data[data$V8=="orange2",] target_fossil=fossil_range[fossil_range$V2%in%as.character(dd$V1),] group_names=unique(target_fossil$V2) target_fossil$V2 <- as.character(target_fossil$V2) target_fossil$V6 <- as.numeric(as.character(target_fossil$V3))-1950 for(i in group_names) { k=which(dd$V1==i) dd[k,9]=min(as.numeric(target_fossil[target_fossil$V2==i,]$V6)) dd[k,10]=max((as.numeric(target_fossil[target_fossil$V2==i,]$V6))) dd[k,11]=mean(as.numeric(target_fossil[target_fossil$V2==i,]$V6)) } plot(x=seq(1,length(dd$V1),1),y=dd$V6,type = "p", col=as.character(dd$V8),pch=20,cex=2, ylim = c(2000,6800), xlim = c(1,length(dd$V1)+0.5), xaxt='n',yaxt='n',xlab = "",ylab = "Estimated admixture time (BCE)", main=""); grid(nx=8,ny=10); segments(x0 =seq(1,length(dd$V1),1),x1 = seq(1,length(dd$V1),1),y0 =(dd$V6+dd$V7), y1 = (dd$V6-dd$V7),col=as.character(dd$V8),lty = 1) segments(x0 =seq(1,length(dd$V1),1)-0.1,x1 = seq(1,length(dd$V1),1)+0.1,y0 =(dd$V6+dd$V7), y1 = (dd$V6+dd$V7),col=as.character(dd$V8)) segments(x0 =seq(1,length(dd$V1),1)-0.1,x1 = seq(1,length(dd$V1),1)+0.1,y0 =(dd$V6-dd$V7), y1 = (dd$V6-dd$V7),col=as.character(dd$V8)) axis(1, 1:length(dd$V1), lty = 1,col = "black",tck="y",labels = rep('', length(dd$V1))) points(x=seq(1,length(dd$V1),1),y=dd$V11,type = "p",col="grey20",pch=1,cex=1) name_lab=gsub("_published","",gsub("_published.DG","",gsub(".SG","",dd$V1))) text(1:length(dd$V4), rep(1800, length(dd$V4)), labels= name_lab, col="black", srt=35, xpd=TRUE, adj=1,cex=0.7, cex.lab=0.7) axis(2, seq(2000,6800,500), lty = 1,col = "black",tck="y",labels = rep('',length(seq(2000,6800,500)))) text(rep(-3.5,max(2000,6800,500)), seq(2000,6800,500),labels= seq(2000,6800,500), col="grey30", srt=0, xpd=TRUE, adj=-0.1,cex=0.8) abline(v=2.5,lty=3,col="grey20") abline(v=20.5,lty=3,col="grey20") abline(v=23.5,lty=3,col="grey20") abline(v=28.5,lty=3,col="grey20") abline(v=36.5,lty=3,col="grey20") abline(v=42.5,lty=3,col="grey20") abline(v=47.5,lty=3,col="grey20") abline(v=56.5,lty=3,col="grey20") abline(v=63.5,lty=3,col="grey20") text(c(-1.3,10,20.6,23.8,28.6,31.8,37,43,48,53,57,63.53),rep(6500,5), labels= c("Balkans","Hungary","Czech","Germany","Poland","Ukraine", "France","Italy","Spain","Portugal","Britain countries","Scandinavia"),col="grey30", srt=0, xpd=TRUE, adj=-0.1,cex=0.8) text(c(20.6,23.6),rep(6200,5), labels= c("Republic","Austria"), col="grey30", srt=0, xpd=TRUE, adj=-0.1,cex=0.8) # Bronze age steppe mixture par(mai = c(2.6,1,0.1,2.8)) dd=rbind(data[data$V8=="green3",],data[data$V8=="lightpink2",]) dd$V8=gsub("springgreen","green3",dd$V8) target_fossil=fossil_range[fossil_range$V2%in%as.character(dd$V1),] group_names=unique(target_fossil$V2) target_fossil$V2 <- as.character(target_fossil$V2) target_fossil$V6 <- as.numeric(as.character(target_fossil$V3))-1950 pch_str=as.numeric(dd$V9) for(i in group_names) { k=which(dd$V1==i) dd[k,10]=min(as.numeric(target_fossil[target_fossil$V2==i,]$V6)) dd[k,11]=max((as.numeric(target_fossil[target_fossil$V2==i,]$V6))) dd[k,12]=mean(as.numeric(target_fossil[target_fossil$V2==i,]$V6)) } plot(x=seq(1,length(dd$V1),1),y=dd$V6,type = "p", col=dd$V8,pch=dd$V9,cex=1.5, ylim = c(500,5000), xlim = c(1,length(dd$V1)+0.5), xaxt='n',yaxt='n',xlab = "",ylab = "Estimated admixture time (BCE)", main=""); grid(nx=8,ny=10); segments(x0 =seq(1,length(dd$V1),1),x1 = seq(1,length(dd$V1),1),y0 =(dd$V6+dd$V7), y1 = (dd$V6-dd$V7),col=dd$V8,lty = 1) segments(x0 =seq(1,length(dd$V1),1)-0.1,x1 = seq(1,length(dd$V1),1)+0.1,y0 =(dd$V6+dd$V7), y1 = (dd$V6+dd$V7),col=dd$V8) segments(x0 =seq(1,length(dd$V1),1)-0.1,x1 = seq(1,length(dd$V1),1)+0.1,y0 =(dd$V6-dd$V7), y1 = (dd$V6-dd$V7),col=dd$V8) axis(1, 1:length(dd$V1), lty = 1,col = "black",tck="y",labels = rep('', length(dd$V1))) points(x=seq(1,length(dd$V1),1),y=dd$V12,type = "p",col="grey20",pch=1,cex=0.8) name_lab=gsub(".SG","",dd$V1) text(1:length(dd$V4), rep(350, length(dd$V4)), labels= name_lab, col="black", srt=45, xpd=TRUE, adj=1,cex=1) axis(2, seq(500,5000,500), lty = 1,col = "black",tck="y",labels = rep('',length(seq(500,5000,500)))) text(rep(-2,max(500,5000,500)), seq(500,5000,500),labels = c(seq(500,5000,500)), col="grey30", srt=0, xpd=TRUE, adj=-0.1,cex=1) abline(v=7.5,lty=3,col="grey20") abline(v=26.5,lty=3,col="grey20") text(c(1.6,15,29),rep(4900,5), labels= c("Late Neolithic","Chacolithic to Bronze Age","Middle to Late Bronze Age"),col="grey30", srt=0, xpd=TRUE, adj=-0.1,cex=0.8) legend(35.5,5000,legend = c("Corded Ware complex","Bell Beaker complex"), col=c(rep("green3",2)),pch=c(15,17),lty=c(1,1),xpd=T,cex=1) ``` ::: {#fig2} chunk: Figure 2—figure supplement 1. ::: #### _DATES (Distribution of Ancestry Tracts of Evolutionary Signals)_ ancestry covariance decay curves. We show the weighted ancestry covariance decay curves generated using _DATES_ for all the target groups analyzed in the study. Each subplot shows the decay curve for one target population with the associated reference groups shown in the title. For each target, in the legend, we show the inferred average dates of admixture (±1 SE) in generations before the individual lived, in BCE that accounts for the average age of all the individuals in the target and the mean generation time of human populations (see Materials and methods). We also show the normalized root-mean-square deviation (NRMSD) values for all fitted curves and the plots with NRMSD >0.7 are shown in gray. For consistency, we use the same colors as [Figure 2](#fig2). ```{r fig.width=10, fig.height=10, message=FALSE, warning=FALSE} #' @width 10 #' @height 10 NRMSD <- function(y, yfit,na.rm = TRUE) { # y is the vector of empirical values # yfit is the vector of fitted values if (length(y) != length(yfit)) stop('y and yfit should have the same length') if (na.rm) { isna = is.na(y) | is.na(yfit) y = y[!isna] yfit = yfit[!isna] } nrmsd = sqrt(mean( (yfit - y)**2, na.rm = na.rm )) * (max(yfit, na.rm = na.rm) - min(yfit, na.rm = na.rm))**(-1) return(nrmsd) } #Usage: nmrds=round(NRMSD(data$V2,data$V3),4) admix_dates=read.table(file = "data/Figure3_supplement1_admixture_dates") par(mfrow=c(6,4),oma=c(0,0,3.5,0)) layout(matrix(seq(1,24,1), nrow = 6), heights=c(1,1)) # HG decay curves dates_files=read.table(file = "data/Figure3_supplement1_HG_curves") for (i in 1:nrow(dates_files)) { file=dates_files$V1[i] path="data/Figure3_supplement1_decay_files/"; var=paste(path,file,sep = "/") mm=unlist(strsplit(as.character(file),'/',fixed=TRUE))[1] mm1=unlist(strsplit(as.character(mm),'.',fixed=TRUE))[1] data=read.table(file = var) jout=gsub("fit", "jout", file); time=paste(path,jout,sep = "/") estimate=read.table(file = time) name=gsub("estimate_","",gsub(".fit", "", file)) len=paste("Estimate:",paste(round(estimate$V2,0),round(estimate$V5,0),sep = "+/-"),sep = " ") title_name=gsub("EHG_WHG_","",mm1) bce=admix_dates[admix_dates$Population==title_name,] main_t=paste("Target:",paste(title_name,paste(paste("(n=",bce$n,sep=""),")",sep=""),sep=" "),"\n","References:EHG-WHG") nmrsd=round(NRMSD(data$V2,data$V3),4) if(nmrsd>0.7) {kol="grey20"} else {kol="deepskyblue"}; plot(x=data$V1,y=data$V2,type="p",col=kol,pch="*",xlim = c(0,20),xlab = "Genetic Distance(cM)", ylab = "Ancestry covariance",las=1,main = main_t,cex=1,cex.main=0.8) lines(x=data$V1,y=data$V3,type = "l",col=kol,lty=2) len=paste(paste(paste(paste("DATES estimate (gen)",paste(round(estimate$V2,0),round(estimate$V5,0),sep = "±"),sep = ":"), paste("DATES estimate (BCE)",paste(bce$DATES_mean.BCE.,bce$DATES_SE.BCE.,sep=" ± "),sep=":"),sep = "\n"), paste("NRMSD",nmrsd,sep="="),sep = "\n")) legend("topright",legend = len,col = kol,bty='n',cex=0.7) } ### Farmer formation data=read.table(file = "data/Figure3_supplement1_decay_files/AAF.fit") jout=read.table(file = "data/Figure3_supplement1_decay_files/AAF.jout") plot(x=data$V1,y=data$V2,type="p",col="darkorchid",pch="*",xlim = c(0,20),xlab = "Genetic Distance(cM)", las=1,main = paste("Target: Early Anatolian farmers (n=9)","References: Levant_pooled-Iran_N_pooled",sep="\n") ,cex=1,ylab ="",cex.main=0.8) title(ylab = "Ancestry covariance", mgp = c(4.2, 1, 0)) lines(x=data$V1,y=data$V3,type = "l",col="darkorchid",lty=2) nmrsd=round(NRMSD(data$V2,data$V3),4) len=paste(paste(paste("DATES estimate (gen)",paste(round(jout$V2,0),round(jout$V5,0),sep = " ± "),sep = ": "), paste("DATES estimate (BCE)",paste(round(jout$V2,0)*28+8071,round(jout$V5,0)*28,sep = " ± "),sep = ": "),sep = "\n"), paste("NRMSD",nmrsd,sep="="),sep = "\n") legend("topright",legend = len,col = "darkorchid",lty=c(2,-1,-1),cex=0.7,bty='n') dates_files=read.table(file = "data/Figure3_supplement1_FF_curves") mt=c("Iran_N-Anatolian_N","Iran_N-Anatolian_N") for (i in 1:nrow(dates_files)) { file=dates_files$V1[i] path="data/Figure3_supplement1_decay_files" ; var=paste(path,file,sep = "/") mm=unlist(strsplit(as.character(file),'/',fixed=TRUE))[1] mm1=unlist(strsplit(as.character(mm),'-',fixed=TRUE))[1] data=read.table(file = var) jout=gsub("fit", "jout", file); time=paste(path,jout,sep = "/") estimate=read.table(file = time) name=gsub("estimate_","",gsub(".fit", "", file)) title_name=gsub("published_merged","Anatolian Farmer",gsub("Anatolia_N_Serbia_Iron_Gates_HG_","",mm1)) bce=admix_dates[admix_dates$Population==title_name,] main_t=paste("Target:",paste(title_name,paste(paste("(n=",bce$n,sep=""),")",sep=""),sep=" "),"\n",mt[i]) len=paste("Estimate:",paste(round(estimate$V2,0),round(estimate$V5,0),sep = "+/-"),sep = " ") nmrsd=round(NRMSD(data$V2,data$V3),4) if(nmrsd>0.7) {kol="grey20"} else {kol="firebrick"}; plot(x=data$V1,y=data$V2,type="p",col=kol,pch="*",xlim = c(0,20),xlab = "Genetic Distance(cM)", ylab = "Ancestry covariance",las=1,main =main_t ,cex=1,cex.main=0.8) lines(x=data$V1,y=data$V3,type = "l",col=kol,lty=1) len=paste(paste(paste("DATES estimate (gen)",paste(round(estimate$V2,0),round(estimate$V5,0),sep = "±"),sep = ":"), paste("DATES estimate (BCE)",paste(bce$V9,bce$V10,sep=" ± "),sep=":"),sep = "\n"), paste("NRMSD",nmrsd,sep="="),sep = "\n") legend("topright",legend = len,col = kol,bty='n',cex=0.7) } ### Farmer spread dates_files=read.table(file = "data/Figure3_supplement1_Neolithic_curves") for (i in 1:nrow(dates_files)) { file=dates_files$V1[i] path="data/Figure3_supplement1_decay_files" ; var=paste(path,file,sep = "/") mm=unlist(strsplit(as.character(file),'/',fixed=TRUE))[1] mm1=unlist(strsplit(as.character(mm),'-',fixed=TRUE))[1] refm=paste(unlist(strsplit(as.character(mm),'-',fixed=TRUE))[2],"-",unlist(strsplit(as.character(mm),'-',fixed=TRUE))[3]) data=read.table(file = var) jout=gsub("fit", "jout", file); time=paste(path,jout,sep = "/") estimate=read.table(file = time) name=gsub("estimate_","",gsub(".fit", "", file)) bce=admix_dates[admix_dates$Population==mm1,] main_t=paste("Target:",paste(mm1,paste(paste("(n=",bce$n,sep=""),")",sep=""),sep=" "),"\n","References:",gsub("Turkey_N","AnatolianN",refm)) len=paste("Estimate:",paste(round(estimate$V2,0),round(estimate$V5,0),sep = "+/-"),sep = " ") nmrsd=round(NRMSD(data$V2,data$V3),4) if(nmrsd>0.7) {kol="grey20"} else {kol="orange2"}; plot(x=data$V1,y=data$V2,type="p",col=kol,pch="*",xlim = c(0,20),xlab = "Genetic Distance(cM)", ylab = "Ancestry covariance",las=1,main = main_t,cex=1,cex.main=0.8) lines(x=data$V1,y=data$V3,type = "l",col=kol,lty=1) len=paste(paste(paste("DATES estimate (gen)",paste(round(estimate$V2,0),round(estimate$V5,0),sep = "±"),sep = ":"), paste("DATES estimate (BCE)",paste(bce$V9,bce$V10,sep=" ± "),sep=":"),sep = "\n"), paste("NRMSD",nmrsd,sep="="),sep = "\n") legend("topright",legend = len,col = kol,bty='n',cex=0.7) } ### Steppe formation- EBA dates_files=read.table(file = "data/Figure3_supplement1_Steppe_formation") for (i in 1:nrow(dates_files)) { file=dates_files$V1[i] path="data/Figure3_supplement1_decay_files" ; var=paste(path,file,sep = "/") mm=unlist(strsplit(as.character(file),'/',fixed=TRUE))[1] mm1=unlist(strsplit(as.character(mm),'-',fixed=TRUE))[1] data=read.table(file = var) jout=gsub("fit", "jout", file); time=paste(path,jout,sep = "/") estimate=read.table(file = time) name=gsub("estimate_","",gsub(".fit", "", file)) bce=admix_dates[admix_dates$Population==mm1,] main_t=gsub("_pub","",paste("Target:",paste(mm1,paste(paste("(n=",bce$n,sep=""),")",sep=""),sep=" "),"\n","References:Iran_N_pooled-EHG_pooled")) len=paste("Estimate:",paste(round(estimate$V2,0),round(estimate$V5,0),sep = "+/-"),sep = " ") nmrsd=round(NRMSD(data$V2,data$V3),4) if(nmrsd>0.7) {kol="grey20"} else {kol="deeppink"}; plot(x=data$V1,y=data$V2,type="p",col=kol,pch="*",xlim = c(0,20),xlab = "Genetic Distance(cM)", ylab = "Ancestry covariance",las=1,main = main_t,cex=1.2,cex.main=0.8) lines(x=data$V1,y=data$V3,type = "l",col=kol,lty=1) len=paste(paste(paste("DATES estimate (gen)",paste(round(estimate$V2,0),round(estimate$V5,0),sep = "±"),sep = ":"), paste("DATES estimate (BCE)",paste(bce$V9,bce$V10,sep=" ± "),sep=":"),sep = "\n"), paste("NRMSD",nmrsd,sep="="),sep = "\n") legend("topright",legend = len,col = kol,bty='n',cex=0.7) } ### Steppe formation - MLBA dates_files=read.table(file = "data/Figure3_supplement1_Steppe_MLBA") for (i in 1:nrow(dates_files)) { file=dates_files$V1[i] path="data/Figure3_supplement1_decay_files" ; var=paste(path,file,sep = "/") mm=unlist(strsplit(as.character(file),'/',fixed=TRUE))[1] mm1=unlist(strsplit(as.character(mm),'-',fixed=TRUE))[1] data=read.table(file = var) jout=gsub("fit", "jout", file); time=paste(path,jout,sep = "/") estimate=read.table(file = time) name=gsub("estimate_","",gsub(".fit", "", file)) bce=admix_dates[admix_dates$Population==mm1,] main_t=paste("Target:",paste(mm1,paste(paste("(n=",bce$n,sep=""),")",sep=""),sep=" "),"\n","References:Steppe EBA-Neolithic groups") len=paste("Estimate:",paste(round(estimate$V2,0),round(estimate$V5,0),sep = "+/-"),sep = " ") nmrsd=round(NRMSD(data$V2,data$V3),4) if(nmrsd>0.7) {kol="grey20"} else {kol="lightpink"}; plot(x=data$V1,y=data$V2,type="p",col=kol,pch="*",xlim = c(0,20),xlab = "Genetic Distance(cM)", ylab = "Ancestry covariance",las=1,main = main_t,cex=1,cex.main=0.8) lines(x=data$V1,y=data$V3,type = "l",col=kol,lty=1) len=paste(paste(paste("DATES estimate (gen)",paste(round(estimate$V2,0),round(estimate$V5,0),sep = "±"),sep = ":"), paste("DATES estimate (BCE)",paste(bce$V9,bce$V10,sep=" ± "),sep=":"),sep = "\n"), paste("NRMSD",nmrsd,sep="="),sep = "\n") legend("topright",legend = len,col = kol,bty='n',cex=0.7) } ### Steppe spread dates_files=read.table(file = "data/Figure3_supplement1_Steppe_spread") for (i in 1:nrow(dates_files)) { file=dates_files$V1[i] path="data/Figure3_supplement1_decay_files/" ; var=paste(path,file,sep = "/") mm=unlist(strsplit(as.character(file),'/',fixed=TRUE))[1] mm1=unlist(strsplit(as.character(mm),'-',fixed=TRUE))[1] data=read.table(file = var) jout=gsub("fit", "jout", file); time=paste(path,jout,sep = "/") estimate=read.table(file = time) name=gsub("estimate_","",gsub(".fit", "", file)) title_name=gsub("Afanasievo_Anatolia_N_","",mm1) bce=admix_dates[admix_dates$Population==title_name,] main_t=paste("Target:",paste(mm1,paste(paste("(n=",bce$n,sep=""),")",sep=""),sep=" "),"\n","References:Steppe groups-(WHG+AnatoliaN)") len=paste("Estimate:",paste(round(estimate$V2,0),round(estimate$V5,0),sep = "+/-"),sep = " ") nmrsd=round(NRMSD(data$V2,data$V3),4) if(nmrsd>0.7) {kol="grey20"} else {kol="green3"}; plot(x=data$V1,y=data$V2,type="p",col=kol,pch="*",xlim = c(0,20),xlab = "Genetic Distance(cM)", ylab = "Ancestry covariance",las=1,main = main_t,cex=1,cex.main=0.7) lines(x=data$V1,y=data$V3,type = "l",col=kol,lty=1) len=paste(paste(paste("DATES estimate (gen)",paste(round(estimate$V2,0),round(estimate$V5,0),sep = "±"),sep = ":"), paste("DATES estimate (BCE)",paste(bce$V9,bce$V10,sep=" ± "),sep=":"),sep = "\n"), paste("NRMSD",nmrsd,sep="="),sep = "\n") legend("topright",legend = len,col = kol,bty='n',cex=0.8) } ``` ::: {#fig2s1} chunk: Figure 2—figure supplement 2. ::: #### Timing of western hunter-gatherer (WHG) and eastern hunter-gatherer (EHG) admixture in Iron Gates hunter-gatherer (HG) samples. The time of admixture in Iron Gates HG samples grouped in bins of C14 age of 500 years. The C14 age is shown on X-axis and the admixture time in BCE for corresponding samples is shown on the Y-axis. ```{r fig.width=9, fig.height=6, message=FALSE, warning=FALSE} #' @width 9 #' @height 6 dd=read.table(file = "data/Figure3_supplement2") plot(seq(1,NROW(dd$V1),1),dd$V8,col=as.character(dd$V10),pch=15, ylim =c(3000,19000),las=1, xaxt='n',xlab="",ylab="Admixture time in years BCE", main = "Admixture dates in Iron_Gates samples grouped by c14 age in bins of 500 years"); grid() segments(x0 =seq(1,NROW(dd$V1),1),x1 = seq(1,NROW(dd$V1),1),y0 =(dd$V8+2*dd$V9), y1 = (dd$V8-2*dd$V9),col=as.character(dd$V10),lty = 1) segments(x0 =seq(1,NROW(dd$V1),1)-0.1,x1 = seq(1,NROW(dd$V1),1)+0.1,y0 =(dd$V8+2*dd$V9), y1 = (dd$V8+2*dd$V9),col=as.character(dd$V10)) segments(x0 =seq(1,NROW(dd$V1),1)-0.1,x1 = seq(1,NROW(dd$V1),1)+0.1,y0 =(dd$V8-2*dd$V9), y1 = (dd$V8-2*dd$V9),col=as.character(dd$V10)) points(seq(1,NROW(dd$V1),1),dd$V7) segments(x0 =seq(1,NROW(dd$V1),1),x1 = seq(1,NROW(dd$V1),1),y0 =(dd$V2-1950), y1 = (dd$V3-1950)) segments(x0 =seq(1,NROW(dd$V1),1)-0.1,x1 = seq(1,NROW(dd$V1),1)+0.1,y0 =(dd$V2-1950), y1 = (dd$V2-1950)) segments(x0 =seq(1,NROW(dd$V1),1)-0.1,x1 = seq(1,NROW(dd$V1),1)+0.1,y0 =(dd$V3-1950), y1 = (dd$V3-1950)) axis(1, 1:NROW(dd$V1), lty = 1,col = "black",tck="y",labels = rep('', NROW(dd$V1))) text(1:NROW(dd$V1), rep(2500, NROW(dd$V4)), labels= gsub("IronGates-","",dd$V1), col="black", srt=25, xpd=TRUE, adj=1,cex=1) legend("topright",legend = c("WHG-EHG admixture per c14 bin","WHG-EHG admixture in all samples","average c14 ages"), col = c("blue","cyan2","grey40"),lty=c(1,1,1),pch=c(15,15,1)) ``` ::: {#fig2s2} ## Early to middle Neolithic Neolithic farming began in the Near East – the Levant, Anatolia, and Iran – and spread to Europe and other parts of the world [@bib23; @bib34; @bib53]. The first farmers of Europe were related to Anatolian farmers, whose origin remains unclear. The early Neolithic Anatolian farmers (Aceramic Anatolian farmers) had majority ancestry from AHG with some gene flow from the first farmers from Iran [@bib14]. AHG, in turn, had ancestry from Levant HG (Natufians) and some mysterious HG group related to the ancestors of WHG individuals from central Europe – a gene flow event that likely occurred in the late Pleistocene [@bib14]. Using _qpAdm_, we confirmed that early Anatolian farmers could be modeled as a mixture of AHG and Iran Neolithic farmer-related groups ([Supplementary file 2C](https://elifesciences.org/articles/77625#supp2)). To learn about the timing of the genetic formation of early Anatolian farmers, we applied _DATES_ using Iran Neolithic farmer-related individuals and other reference as groups with AHG ancestry. Since there are limited samples of AHG ancestry, we instead used pooled individuals of WHG-related and Levant Neolithic farmer-related individuals to represent the main ancestry components of AHG. We note that the application of _DATES_ to three-way admixed groups such as early Anatolian farmers can lead to intermediate dates between the first and second pulse of gene flow unless the reference populations are chosen carefully ([Appendix 2—table 1](https://elifesciences.org/articles/77625#app2table1)). Our setup with pooled reference populations should recover the timing of the most recent event (in this case, the gene flow from CHG or Iran Neolithic-related groups) reliably. We infer the Iran Neolithic farmer-related gene flow occurred \~10,900 BCE (12,200–9600 BCE) ([Figure 3](#fig3)), predating the origin of farming in Anatolia [@bib9]. During the subsequent millennia, these early farmers further admixed with Levant Neolithic groups to form Anatolian Neolithic farmers who spread towards the west to Europe and in the east to mix with Iran Neolithic farmers, forming the Chalcolithic groups of Seh Gabi and Hajji Firuz ([Supplementary file 2C](https://elifesciences.org/articles/77625#supp2)). Using _DATES_, we inferred that these Chalcolithic groups were genetically formed in \~7600–5700 BCE ([Supplementary file 1B](https://elifesciences.org/articles/77625#supp1)). chunk: Figure 3. ::: ### Genetic formation of early Anatolian farmers and early Bronze Age Steppe pastoralists. The top panel shows a map with sampling locations of the target groups analyzed for admixture dating. The bottom panels show the inferred times of admixture for each target using _DATES (Distribution of Ancestry Tracts of Evolutionary Signals)_ by fitting an exponential function with an affine term $y=A{e}^{-\lambda d}+c$, where _d_ is the genetic distance in Morgans and $\lambda$ = (_t_+1) is the number of generations since admixture (**_t_**) (Materials and methods). We start the fit at a genetic distance (**_d_**) >0.5 cM (centiMorgans) to minimize confounding with background LD and estimate a standard error by performing a weighted block jackknife removing one chromosome in each run. For each target, in the legend, we show the inferred average dates of admixture (±1 SE) in generations before the individual lived, in BCE accounting for the average age of all the individuals and the mean human generation time, and the normalized root-mean-square deviation (NRMSD) values to assess the fit of the exponential curve (Materials and methods). The bottom left shows the ancestry covariance decay curve for early Anatolian farmers inferred using one reference group as a set of pooled individuals of western hunter-gatherer (WHG)-related and Levant Neolithic farmers-related individuals as a proxy of Anatolian hunter-gatherer (AHG) ancestry and the second reference group containing Iran Neolithic farmer-related individuals. The bottom right shows the ancestry covariance decay curve for early Steppe pastoralists groups, including all Yamnaya and Afanasievo individuals as the target group and eastern hunter-gatherer (EHG)-related and Iran Neolithic farmer-related groups as reference populations. ```{r fig.width=12, fig.height=6, message=FALSE, warning=FALSE} #' @width 6 #' @height 12 dd=read.table(file = "data/Figure2_Data_map" ,header = T) world <- ne_countries(scale = "medium", returnclass = "sf") ggplot(world) + geom_sf() + coord_sf(xlim = c(-25,120), ylim = c(35,80), expand = FALSE) + geom_point(data = dd, aes(x=long,y=lat),col=dd$col,inherit.aes = FALSE,pch=18,cex=4,show.legend = F) + geom_text_repel(data=dd,aes(x=long+10,y=lat),label=dd$sample,cex=4)+ theme_bw() # decay curves par(mai = c(1,1.1,0.5,0.5), mfrow=c(1,2)) data=read.table(file = "data/Figure2_Data_AnatoliaFarmer.fit", header = F) jout=read.table(file = "data/Figure2_Data_AnatoliaFarmer.jout",header = F) plot(x=data$V1,y=data$V2,type="p",col="darkorchid",pch="*",xlim = c(0,20),xlab = "Genetic Distance(cM)", las=1,main = "Early Anatolian farmers",cex=2,cex.main=1.5,cex.axis=1.2,cex.lab=1.2, ylab ="") title(ylab = "Ancestry covariance", mgp = c(4.2, 1, 0)) lines(x=data$V1,y=data$V3,type = "l",col="darkorchid",lty=2) nrmsd=round(NRMSD(data$V2,data$V3),4) len=paste(paste(paste("DATES estimate (gen)",paste(round(jout$V2,0),round(jout$V5,0),sep = " ± "),sep = ": "), paste("DATES estimate (BCE)",paste(round(jout$V2,0)*28+8071,round(jout$V5,0)*28,sep = " ± "),sep = ": "),sep = "\n"), paste("NRMSD",nrmsd,sep="="),sep = "\n") legend("topright",legend = len,col = "darkorchid",lty=c(2,-1,-1), cex=1,bty='n') data=read.table(file = "data/Figure2_Data_SteppeFarmer.fit",header = F) jout=read.table(file = "data/Figure2_Data_SteppeFarmer.jout",header = F) plot(x=data$V1,y=data$V2,type="p",col="deeppink",pch="*",xlim = c(0,20),xlab = "Genetic Distance(cM)", las=1,main = "Early Steppe Pastoralists",cex=2,cex.main=1.5,cex.axis=1.2,cex.lab=1.2,ylab="") title(ylab = "Ancestry covariance", mgp = c(4.2, 1, 0)) lines(x=data$V1,y=data$V3,type = "l",col="deeppink",lty=2) nrmsd=round(NRMSD(data$V2,data$V3),4) len=paste(paste(paste("DATES estimate (gen)",paste(round(jout$V2,0),round(jout$V5,0),sep = " ± "),sep = ": "), paste("DATES estimate (BCE)",paste(round(jout$V2,0)*28+2881,round(jout$V5,0)*28,sep = " ± "),sep = ": "),sep = "\n"), paste("NRMSD",nrmsd,sep="="),sep = "\n") legend("topright",legend = len,col = "deeppink",lty=c(2,-1,-1), cex=1,bty='n') ``` ::: {#fig3} In Europe, the Anatolian Neolithic farmers mixed with the local indigenous HGs contributing between \~40% and 98% of ancestry to the Neolithic Europeans. To elucidate the fine-scale patterns and regional dynamics of these mixtures, we applied _DATES_ to time transect samples from 94 groups (_n_=657) sampled from 16 regions in Europe, ranging from \~6000 to 1900 BCE and encompassing individuals from the early Neolithic to Chalcolithic periods ([Supplementary file 1A](https://elifesciences.org/articles/77625#supp1)). Using _qpAdm_, we first confirmed that the Neolithic Europeans could be modeled as a mixture of European HG-related ancestry and Anatolian farmer-related ancestry and inferred their ancestry proportions ([Supplementary file 2D](https://elifesciences.org/articles/77625#supp2)). For most target populations (\~80%), we found the model of gene flow between Anatolian farmer-related and WHG-related ancestry provided a good fit to the data (p-value > 0.05). In some populations, we found variation in the source of the HG-related ancestry and including either EHG- or GoyetQ2-related ancestry groups improved the fit of the model. In five groups, none of the models fit, despite excluding outlier individuals whose ancestry profile differed from the majority of the individuals in the group ([Supplementary file 2E](https://elifesciences.org/articles/77625#supp2)). To confirm that the target populations do not harbor Steppe pastoralist-related ancestry, we applied _D_-statistics of the form _D_(Mbuti, _target_, Anatolian farmers, Steppe pastoralists) where target = Neolithic European groups. We observed that four groups had a stronger affinity to Steppe pastoralists compared to Anatolian farmers, and hence we excluded these from further analysis ([Supplementary file 2F](https://elifesciences.org/articles/77625#supp2)). After filtering, we applied _DATES_ to 86 European Neolithic groups using WHG-related individuals and Anatolian farmers as reference populations. Earlier analysis has suggested that farming spread along two main routes in Europe, from southeast to central Europe (‘continental route’) and along the Mediterranean coastline to Iberia (‘coastal route') [@bib20; @bib21; @bib52]. Consistent with this, we inferred one of the earliest timings of gene flow was in the Balkans around 6400 BCE. Using the most comprehensive time-transect in Hungary with 19 groups (_n_=63) spanning from middle Neolithic to late Chalcolithic, we inferred the admixture dates ranged from \~6100 to 4500 BCE. Under a model of a single shared gene flow event in the common ancestors of all individuals, we would expect to obtain similar dates of admixture (before present) after accounting for the age of the ancient specimens. Similar to [@bib36], we observed that the estimated dates in middle Neolithic individuals were substantially older than those inferred in late Neolithic or Chalcolithic individuals [@bib7]. This would be expected if the underlying model of gene flow involved multiple pulses of gene flow, such that the timing in the middle Neolithic samples reflects the initial two-way mixture and the timing in the Chalcolithic samples captures both recent and older events. Interestingly, [@bib36], and other recent studies have documented increasing HG ancestry from \~3% to 15% from the Neolithic to Chalcolithic period [@bib24; @bib36; @bib52], suggesting that there was additional HG gene flow after the initial mixture. This highlights that the interactions between local HGs and incoming Anatolian farmers were complex with multiple gene flow events or continuous admixture between these two groups, which explains the increasing HG ancestry and more recent dates in Chalcolithic individuals ([Supplementary file 2D](https://elifesciences.org/articles/77625#supp2)). Mirroring the pattern in Hungary, we documented the resurgence of HG ancestry in the Czech Republic, France, Germany, and southern Europe. In central Europe, we inferred that the Anatolian farmer-related gene flow ranged between \~5600 and 5000 BCE across Germany and Czech Republic, with some exceptions. For instance, in the Blätterhöhle site from Germany, the inferred dates were more recent (\~4000 BCE), consistent with the occupation of both HGs and farmers in this region until the late Neolithic [@bib36]. In eastern Europe, using samples related to the Funnel Beaker culture (TRB; from German _Trichterbecher_) from Poland, we dated the Anatolian farmer-related gene flow occurred on average \~4700 BCE (5300–4200 BCE). Following the TRB decline, the Baden culture and the Globular Amphora culture appeared in many areas of Poland and Ukraine [@bib15]. These cultures had close contact with the Corded Ware complex (CWC) and Steppe pastoralists’ societies, though we found a parsimonious model without Steppe pastoralist-related ancestry provides a good fit to the GAC individuals ([Supplementary file 2D](https://elifesciences.org/articles/77625#supp2)). Applying _DATES_, we inferred the Anatolian farmer and HG-related mixture in GAC ranged between \~4700 and 3900 BCE, predating the spread of Steppe pastoralists to eastern Europe [@bib1; @bib24]. Along the Mediterranean route, we characterized Anatolian farmer-related gene flow in Italy, Iberia, France, and the British Isles. Using samples from five groups in Italy, we inferred the earliest dates of gene flow of \~6100 BCE, and within the millennium, the Anatolian farmer-related ancestry spread from Sardinia to Sicily ([Figure 2](#fig2)). In Iberia, the Anatolian farmer-related mixture ranged from \~5700 to 4300 BCE and showed evidence for an increase in HG ancestry from \~9% to 20% after the initial gene flow. In France, previous studies have shown that Anatolian farmer-related ancestry came from both routes, along the continental route in the north and along the costal route in the south [@bib52]. This is reflected in the source of the HG ancestry, which is predominantly EHG and WHG-related in the north and includes WHG and Goyet-Q2 ancestry in the south [@bib52]. Consistently, we also observed that the admixture dates in France were structured along these routes, with the median estimate of \~5100 BCE in the east and much older \~5500 BCE in the south ([Supplementary file 1B](https://elifesciences.org/articles/77625#supp1)). In Scandinavia, we inferred markedly more recent dates of admixture of \~4300 BCE using samples from Sweden associated with the TRB culture and Ansarve Megalithic tombs, consistent with a late introduction of farming to Scandinavia [@bib39]. Finally, we inferred recent dates of admixture in Neolithic samples from the British Isles (England, Scotland, and Ireland) with the median timing of \~5000 BCE across the three regions. Interestingly, unlike in western and southern Europe, we obtained overlapping dates across eight groups including early to late Neolithic samples from British Isles. This is consistent with previous studies that suggest there was no resurgence in HG ancestry during the Neolithic in Britain [@bib8]. Thus our dates can be interpreted as the time of the main mixture of HGs and Anatolian farmers in this region, implying that the farmer-related ancestry reached Britain a millennium after its arrival in continental Europe. By 4300 BCE, we find that Anatolian farmer-related ancestry is present in nearly all regions in Europe. ### Late Neolithic to Bronze Age The beginning of the Bronze Age (BA) was a period of major cultural and demographic change in Eurasia, accompanied by the spread of Yamnaya Steppe pastoralist-related ancestry from Pontic-Caspian steppes across Europe and South Asia [@bib24]. The archaeological record documents that the early Steppe pastoralists cultures of Yamnaya and Afanasievo, with characteristic burial styles and pottery, appeared \~3300–2600 BCE [@bib42]. These groups were formed as a mixture of EHG-related individuals and CHG-related groups associated with the first farmers from Iran [@bib30; @bib43; @bib56]. Using _qpAdm_, we first tested how well this model fits the data from eight early Steppe pastoralist groups, including seven groups associated with Yamnaya culture and one group related to the Afanasievo culture (Materials and methods). For all but two Yamnaya groups (from Hungary Baden and Russia Kalmykia), we found this model provides a good fit to the data ([Supplementary file 2G](https://elifesciences.org/articles/77625#supp2)). We note that the samples from Kalmykia in our dataset were shotgun sequenced, and in the _qpAdm_ analysis, we are mixing shotgun and capture data that could potentially lead to technical issues. To understand the timing of the formation of the early Steppe pastoralist-related groups, we applied _DATES_ using pooled EHG-related and pooled Iranian Neolithic farmer-related individuals. Focusing on the groups with the largest sample sizes, Yamnaya Samara (_n_=10) and Afanasievo (_n_=19), we inferred the admixture occurred between 40 and 45 generations before the individuals lived, translating to an admixture timing of \~4100 BCE ([Supplementary file 1B](https://elifesciences.org/articles/77625#supp1)). We obtained qualitatively similar dates across four Yamnaya and one Afanasievo groups, consistent with the findings that these groups descend from a recent common ancestor (we note for the Ozera samples from Ukraine, the dates were not significant). This is also further supported by the insight that the genetic differentiation across early Steppe pastoralist groups is very low (_F~ST~_ ~ 0.000–0.006) ([Supplementary file 2H](https://elifesciences.org/articles/77625#supp2)). Thus, we combined all early Steppe pastoralist individuals in one group to obtain a more precise estimate for the genetic formation of proto-Yamnaya of \~4400–4000 BCE ([Figure 3](#fig3)). These dates are noteworthy as they predate the archaeological evidence by more than a millennium [@bib2] and have important implications for understanding the origin of proto-Pontic Caspian cultures and their spread to Europe and South Asia. Over the following millennium, the Yamnaya-derived ancestry spread across Europe through CWC and Bell Beaker complex (BBC) cultures. Present-day Europeans derive between \~10% and 60% Steppe pastoralist-related ancestry, which was not seen in Neolithic samples. To obtain a precise chronology of the spread of Steppe pastoralist-related ancestry across Europe, we analyzed 109 late Neolithic, Chalcolithic, and BA samples dated between 3000 and 750 CE from 18 regions, including samples associated with the CWC and BBC cultures. We first confirmed that most target samples had Steppe pastoralist-related ancestry, in addition to European HG-related and Anatolian farmer-related ancestry using _qpAdm_. We excluded 20 groups that could not be parsimoniously modeled as a three-way mixture even after removing individual outliers. After filtering, we retained 79 groups for dating Steppe pastoralist-related gene flow across Europe ([Supplementary file 2I and J](https://elifesciences.org/articles/77625#supp2)). As BA Europeans have ancestry from three distinct groups, we applied _DATES_ using the following two reference populations, one group including early Steppe pastoralists (Yamnaya and Afanasievo) and the other group that is a the proxy for the ancestral Neolithic Europe population using pooled samples of WHG-related and Anatolian farmer-related individuals. To learn about the spread of CWC culture across Europe, we used seven late Neolithic and Bronze age groups, including five associated with CWC artifacts. Using _DATES_, we inferred that the oldest date of Steppe pastoralists gene flow in Europe was \~3200 BCE in Scandinavia in samples associated with Battle Axe Culture in Sweden and Single Grave Culture in Denmark that were both contemporary to CWC. The samples from Scandinavia showed large heterogeneity in ancestry, including some individuals with majority Steppe pastoralist-related ancestry (and negligible amounts of Anatolian farmer-related ancestry), consistent with patterns expected from recent gene flow [@bib38]. Strikingly, we inferred the timing of admixture in central Europe (Germany and the Czech Republic) and eastern Europe (Estonia and Poland) to be remarkably similar. These dates fall within a narrow range of \~3000–2900 BCE across diverse regions, suggesting that the mixed population associated with the Corded Ware culture formed over a short time and spread across Europe rapidly with very little further mixture ([Supplementary file 1B](https://elifesciences.org/articles/77625#supp1)). Following the Corded Ware culture, from around 2800 to 2300 BCE, Bell Beaker pottery became widespread across Europe [@bib17]. Using 19 Chalcolithic and BA samples, including 10 associated with Beaker-complex artifacts, we inferred the dynamics of the spread of the Beaker complex across Europe. We inferred the oldest date of Steppe pastoralist-related admixture was \~3200 BCE (3600–2800 BCE) in early Bronze Age (EBA) Mallorca samples from Iberia. We note the EBA Mallorca sample is not directly associated with Beaker culture, but _qpAdm_ modeling suggests that this individual is clade with the small subset of Iberian Beaker-complex-associated individuals who carried Steppe pastoralist-related ancestry [@bib16]. Most individuals from Iberia, however, had negligible Steppe pastoralist-related ancestry suggesting the Beaker culture was not accompanied by major gene flow in Iberia despite the earliest dates ([Supplementary file 2I](https://elifesciences.org/articles/77625#supp2)). In central and western Europe, where Steppe pastoralist gene flow was more pervasive, we inferred the median date of the mixture was \~2700 BCE with the oldest dates in the Netherlands, followed by Germany and France ([Figure 2](#fig2)). There was, however, large heterogeneity in the dates across Europe and even within the same region. For example, comparing two BA groups from the Netherlands suggests a wide range of dates \~3000 BCE and 2500 BCE, and four groups from Germany indicate a range of \~2900–2700 BCE. From central Europe, the Steppe pastoralist-related ancestry spread quickly to the British Isles, where people with Steppe pastoralist ancestry replaced 90% of the genetic ancestry of individuals from Britain. Our estimates for the time of gene flow in Bell Beakers samples from England suggest that the gene flow occurred \~2700 BCE (2770–2550 BCE). Our estimated dates of admixture are older than the dates of arrival of this ancestry in Britain [@bib44] and, interestingly, overlap the dates in central Europe. Given that a significant fraction of the Beaker individuals were recent migrants from central Europe, we interpret our dates reflect the admixture into ancestors of the British Beaker people, occurring in mainland Europe [@bib44]. The middle to late Bronze Age (MLBA) led to the final integration of Steppe pastoralist-related ancestry in Europe. In southern Europe, EBA samples had limited Steppe pastoralist-related ancestry, though present-day individuals harbor between \~5% and 30% of this ancestry [@bib24]. Using pooled samples of MLBA from Spain, we inferred major mixture occurred \~2500 BCE in Iberia. We inferred a similar timing in Italy using individuals associated with the Bell Beaker culture and EBA samples from Sicily ([Supplementary file 1B](https://elifesciences.org/articles/77625#supp1)). In Sardinia, a majority of the BA samples do not have Steppe pastoralist-related ancestry. In a few individuals, we found evidence for Steppe pastoralist-related ancestry, though in most cases, this ancestry proportion overlapped 0 and the inferred dates of admixture were very noisy ([Supplementary file 2I](https://elifesciences.org/articles/77625#supp2)). Using Iron Age samples from Sardinia, we inferred the gene flow occurred \~2600 BCE, though there is a large uncertainty associated with this estimate (3700–1490 BCE). In other parts of continental Europe and the British Isles, the Steppe pastoralist-related ancestry got diluted over time, as evidenced by more recent dates in LBA (late Bronze Age) than EBA or MBA (middle Bronze Age) samples in Germany, England, and Scotland, and an increase in Neolithic farmer ancestry during this period ([@bib45]; [Supplementary file 1B](https://elifesciences.org/articles/77625#supp1)). Finally, the CWC expanded to the east to form the archaeological complexes of Sintashta, Srubnaya, Andronovo, and the BA cultures of Kazakhstan. Samples associated with these cultures harbor mixed ancestry from the Yamnaya Steppe pastoralist-related groups (CWC, in some cases) and Neolithic individuals from central Europe ([Supplementary file 2K](https://elifesciences.org/articles/77625#supp2); [@bib43]). Applying _DATES_ to eight MLBA Steppe pastoralist groups, we inferred the precise timing for the formation of these groups beginning in the third millennium BCE. These groups were formed chronologically, with the date of genetic formation of \~3200 BCE for Sintashta culture, followed by \~2900 BCE for Srubnaya and Andronovo cultures. In the central Steppe region (present-day Kazakhstan), we obtained median dates of \~2800 BCE for the expansion of Steppe pastoralist-related ancestry in four Kazakh cultures of Maitan Alakul, Aktogai, and Kairan. By \~2700 BCE, most of these cultures had almost 60–70% Yamnaya Steppe pastoralist-related ancestry ([Supplementary file 1B](https://elifesciences.org/articles/77625#supp1)). These groups, in turn, expanded eastwards, transforming the genetic composition of populations in South Asia. # Discussion We developed _DATES_ that measures ancestry covariance patterns in a single diploid individual genome to estimate the time of admixture. Using extensive simulations, we show that _DATES_ provides accurate estimates of the timing of admixture across a range of demographic scenarios. Application of _DATES_ to present-day samples shows that the results are concordant with published methods – ROLLOFF, ALDER, and Globetrotter. For sparse datasets, _DATES_ outperforms published methods as it does not require phased data and works reliably with limited samples, large proportions of missing variants, as well as pseudo-haploid genotypes. This makes _DATES_ ideally suited for the analysis of ancient DNA samples. We illustrate the application of _DATES_ by reconstructing population movements and admixtures during the European Holocene. We confirm and extend signals that were previously identified such as the resurgence of HG ancestry during the Neolithic and provide new details about the genetic formation of the ancestral populations of Europeans and the spread of CWC and BBC cultures across Europe. Together, our analysis provides a detailed timeline and insights into the dynamics of the Neolithization of Europe and the spread of Steppe pastoralist-related ancestry across Europe. First, we document that the Mesolithic HGs formed as a mixture of WHG and EHG ancestry \~10,200–7400 BCE. These dates are consistent with the archaeological evidence for the appearance of lithic technology associated with eastern HGs in Scandinavia and the Baltic regions [@bib22; @bib32]. Next, we studied the timing of the genetic formation of Anatolian farmers. The earliest evidence of agriculture comes from the Fertile Crescent, the southern Levant, and the Zagros Mountains of Iran and dated to around 10,000 BCE. In central Anatolia, farming has been documented c. 8300 BCE [@bib4; @bib6]. It has been long debated if Neolithic farming groups from Iran and the Levant introduced agriculture to Anatolia or HGs in the region locally adopted agricultural practices. The early Anatolian farmers can be modeled as a mixture of local HGs related to Caucasus HGs or the first farmers from Iran [@bib14]. By applying _DATES_ (assuming a single instantaneous admixture), we inferred that the Iran Neolithic gene flow occurred around 10,900 BCE (~12,200–9600 BCE). An alternate possibility is that there was a long period of gradual gene flow between the two groups and our dates reflect intermediate dates between the start and end of the gene flow. An upper bound for such a mixture comes from the lack of Iran Neolithic ancestry in AHGs at 13,000 BCE, and a lower bound comes from the C14 dates of early Anatolian farmers, one of which is directly dated at 8269–8210 BCE [@bib14]. In either case (instantaneous admixture or gradual gene flow), the genetic mixture that formed Anatolian farmers predates the advent of agriculture in this region [@bib4; @bib6]. This supports the model that AHGs locally transitioned to agricultural subsistence, and most probably, there was cultural diffusion from other regions in Near East (Iran and Levant) [@bib14]. Future studies with more dense temporal sampling will shed light on the demographic processes that led to the transition from foraging to farming in the Near East, and in turn, elucidate the relative roles of demic and cultural diffusion in the dispersal of technologies like agriculture across populations. Using data from 16 regions in Europe, we reconstruct a detailed chronology and dynamics of the expansion of Anatolian farmers during the Neolithic period. We infer that starting in \~6400 BCE, gene flow from Anatolian farmers became widespread across Europe, and by \~4300 BCE, it was present in almost all parts of continental Europe and the British Isles. These dates are significantly more recent than the estimates of farming based on archaeological evidence in some parts of Europe, suggesting that the local HGs and farmers coexisted for more than a millennium before the mixture occurred [@bib24; @bib36]. In many regions, after the initial mixture, there was a resurgence of HG ancestry, highlighting the complexities of these ancient interactions. We note that our results are consistent with two previous genetic studies, [@bib36], and [@bib52], that applied genetic dating methods to a subset of samples we used in our analysis. [@bib36], used a modified version of ALDER to infer the timing of admixture in three regions (_n_=151). We obtained statistically consistent results for all overlapping samples (within two standard errors) ([Appendix 1—table 6](https://elifesciences.org/articles/77625#app1table6)). An advantage of our approach over the modified ALDER approach is that we do not rely on helper samples (higher coverage individuals combined with the target group) for dating; unless these have a similar ancestry profile, they could bias the inferred dates. Our results are concordant with [@bib52], that used a previous version of _DATES_ to infer the timing of Neolithic gene flow in 32 groups (vs. 86 groups in our study). We find the performance of both versions of _DATES_ is similar, though some implementation details have improved ([Appendix 1—table 1](https://elifesciences.org/articles/77625#app1table1)). The second major migration occurred when populations associated with the Yamnaya culture in the Pontic-Caspian steppes expanded across Europe. Our analysis reveals the precise timing of the genetic formation of the early Steppe pastoralist groups – Yamnaya and Afanasievo – occurred \~4400–4000 BCE. This estimate predates the archaeological evidence by more than a millennium [@bib2] and suggests the presence of an ancient ‘ghost’ population of proto-Yamnaya around this time. Understanding the source and location of this ghost population will provide deep insights into the history of Pontic-Caspian cultures and the origin of Indo-European languages that have been associated to have spread with Steppe pastoralists ancestry to Europe and South Asia [@bib24; @bib33]. Starting in \~3200 BCE, the Yamnaya-derived cultures of CWC and BBC spread westwards, bringing Steppe pastoralist-related ancestry to Europe. Our analysis reveals striking differences in the spread of these two cultures: the CWC formation is similar across diverse regions separated by thousands of kilometers, suggesting a rapid spread after the initial formation of this group, while the spread of BBC culture was more complex and heterogeneous across regions. We find the earliest evidence of Steppe pastoralist-related ancestry in Iberia around 3200 BCE, though this ancestry only becomes widespread after 2500 BCE. In central Europe, the gene flow occurred simultaneously with archaeological evidence and was coexisting with the CWC in some parts [@bib57; @bib44]. Finally, in the British Isles, the Bell Beaker culture spreads rapidly from central Europe and replaces almost 90% of the ancestry of individuals in this region [@bib44]. Recent analysis has shown remarkable parallels in the history of Europe and South Asia; with both groups deriving ancestry from local indigenous HGs, Near Eastern farmers, and Steppe pastoralist-related groups [@bib43]. Interestingly, however, the timing of the two major migrations events differs across the two subcontinents. Both mixtures occurred in Europe almost a millennium before they occurred in South Asia. In Europe, the Neolithic migrations primarily involved Anatolian farmers, while the source of Neolithic ancestry is closer to Iran Neolithic farmers in South Asia. The Steppe pastoralist-related gene flow occurred in the context of the spread of CWC and BBC cultures in Europe around 3200–2500 BCE; in South Asia, this ancestry arrived with Steppe MLB A cultures that were formed much later in 1800–1500 BCE [@bib43]. The Steppe MLBA groups have ancestry from Steppe pastoralist derived groups and European Neolithic farmers following the eastward expansion of CWC groups between \~3200 and 2700 BCE. Understanding the origin and migration paths of the ancestral groups thus helps to illuminate the differences in the timeline of the spread of Steppe pastoralists across the two subcontinents of Eurasia. Genomic dating methods like _DATES_ provide an independent and complementary approach for reconstructing population history. By focusing on the genetic clock based on recombination rate, we provide an independent estimate of the timing of evolutionary events up to several thousands of years. Our analysis also has advantages over the temporal sampling of ancient DNA, in that we can obtain direct estimates of when a population was formed, rather than inferring putative bounds for the timing based on the absence/presence of a particular ancestry signature (which may be sensitive to sampling choice or density). Genetic approaches provide complementary evidence to archaeology and linguistics as they date the time of admixture and not migration. Both dates are similar in many contemporary populations like African Americans and Latinos, though this may not be generally true [@bib27]. This is underscored by our dates for the Anatolian farmer-related mixture, which postdates evidence of material culture related to agriculture by almost two millennia in some regions. This suggests that European HGs and farmers resided side by side for several thousand years before mixing [@bib7; @bib54]. This underscores how genetic dates can provide complementary evidence to archaeology and help to build a comprehensive picture of population origins and movements. # Materials and methods ## Dataset We analyzed 1096 ancient European samples from 152 groups restricting to data from 1,233,013 autosomal SNP positions that were genotyped using the Affymetrix Human Origins array (the V44.3 release of the AADR; <https://reich.hms.harvard.edu/allen-ancient-dna-resource-aadr-downloadable-genotypes-present-day-and-ancient-dna-data>). We filtered this dataset to remove samples that were marked as contaminated, low coverage, outliers, duplicates, or first- or second-degree relatives. We grouped individuals together from a particular culture or region. Details of sample affiliation and grouping used is described in [Supplementary file 1A](https://elifesciences.org/articles/77625#supp1). ## Modeling admixture history We applied _qpAdm_ from ADMIXTOOLS to identify the best fitting model and estimate the ancestry proportions in a target population modeled as a mixture of _n_ ‘reference’ populations using a set of ‘Outgroup’ populations [@bib24]. We set the details: YES parameter, which reports a normally distributed Z-score to evaluate the goodness of fit of the model (standard errors were estimated with a Block Jackknife). For each target population, we chose the most parsimonious model, that is, fitting the data with the minimum number of source populations. We excluded models where the p-value < 0.05 indicating a poor fit to the data. Details of the _qpAdm_ analysis for each group are reported in [Supplementary file 2](https://elifesciences.org/articles/77625#supp2). We also applied _D_-statistics in some cases using _qpDstat_ in ADMIXTOOLS with default parameters. ## _DATES_: model and implementation _DATES_ leverages the weighted ancestry covariance patterns across the genome of an admixed individual to infer the time of admixture. This method extends the idea introduced in ROLLOFF and ALDER and [@bib41] to be applicable to dating admixture events between modern human populations using a single genome. ### Basic model and notation Assume we have an admixed individual _C_ with ancestry from source populations _A_ and _B_, with ancestry proportion of $\alpha$ and $\beta =\left(1-\alpha \right)$, respectively. This mixture occurred $t$ generations ago. First, we model the genotypes of _C_ as a linear mix of allele frequencies of populations _A_ and _B_. For any SNP _i_, let the genotype of _C_ be ${g}_{i}$ and allele frequency in _A_ and _B_ be ${p}_{A}\left(i\right)$ and ${p}_{B}\left(i\right).$ We can then infer the mixing fraction $\alpha$ from population _A_ by solving the simple linear regression by minimizing the residuals. $$ R={\sum }_{i}{({g}_{i}-\left(\alpha {p}_{A}\left(i\right)+\left(1-\alpha \right){p}_{B}\left(i\right)\right))}^{2} $$ Let ${a}_{i}$ be the probability of observing ${g}_{i}$ in _C_ given the observed genotype in _A,_ and ${b}_{i}$ be the probability of observing ${g}_{i}$ in _C_ given the observed genotype in _B_ $$ {a}_{i}=P\left({g}_{i}\right|A) $$ $$ {b}_{i}=P\left({g}_{i}\right|B) $$ We can then compute the likelihood _L~i~_ of observing a genotype ${g}_{i}$ in the admixed individual $$ {L}_{i}=\alpha {a}_{i}+\beta {b}_{i} $$ For a pair of neighboring markers _S_~1~, _S_~2~ located at a genetic distance of _d_ Morgans, the probability of no recombination between the two markers is given by $\theta ={e}^{-td}.$ Accounting for recombination, the log-likelihood that the two markers have the same ancestry is then given by: $$ \mathcal{L}=\mathcal{}\mathrm{l}\mathrm{o}\mathrm{g}[\left(1-\theta \right){L}_{1}{L}_{2}+\theta \left(\alpha {a}_{1}{a}_{2}+\beta {b}_{1}{b}_{2}\right)] $$ Let _K~i~_ represent the ancestry at marker _S~i~_. Expanding as a power series in _θ,_ the coefficient of $\theta$ is _QK~1~K~2~,_ where $$ Q=\alpha \beta $$ $$ {K}_{i}=\frac{\left({a}_{i}-{b}_{i}\right)}{{L}_{i}} $$ We can compute the ancestry covariance, _A(d_), across pairs of markers _S_~1~, _S_~2~ separated by distance _d_ as $$ A\left(d\right)=\frac{\sum _{s\left(d\right)}\left({K}_{1}-\overline{{K}_{1}}\right)\left({K}_{2}-\overline{{K}_{2}}\right)}{|S\left(d\right)|} $$ where _S(d_) is a set of markers _S_~1~, _S_~2~ located _d_ Morgans apart. The ancestry covariance $A\left(d\right)$ is expected to follow an exponential decay with _d_ with the rate of decay depending on the time since admixture (${\displaystyle t+1}$). $$ A(d)\sim {e}^{-\left(t+1\right)d} $$ The factor of (_t_+1) comes from the insight that in the first generation following admixture, the admixed population derives one chromosome from each ancestral group. The mixing of chromosomes only begins in the following generations as the chromosomes recombine. This means that if we fit _t_ generations, we are likely to underestimate the time of admixture. We note that previous methods like ALDER and ROLLOFF, however, incorrectly fit _t_ generations to infer the time of mixture. In practice, however, this has little effect on the inference except maybe in case of very recent admixture dates. We infer the time of the mixture by fitting an exponential distribution with affine term using least squares. _DATES_ is applicable for dating admixture in a single individual. When multiple individuals from an admixed population are available, _DATES_ computes the log-likelihood by summing over all individuals. ## Application to real data We applied _DATES_ using genome-wide SNP data from the target population and two reference populations. To infer the allele frequency in the ancestral populations more reliably, where specified, we pooled individuals deriving the majority of their ancestry from the population of interest ([Supplementary file 1A](https://elifesciences.org/articles/77625#supp1)). We computed the weighted ancestry covariance between 0.45 cM (centiMorgans) (to minimize the impact of background LD) and 100 cM, with a bin size of 0.1 cM. We plotted the weighted covariance with genetic distance and obtained a date by fitting an exponential function with an affine term $y=A{e}^{-\lambda d}+c$, where _d_ is the genetic distance in Morgans and $\lambda$ = (_t_+1) is the number of generations since admixture (_t_). We computed standard errors using weighted block jackknife, where one chromosome was removed in each run [@bib10]. Following [@bib55], we examined the quality of the exponential fit by computing the normalized root-mean-square deviation (NRMSD) between the empirical ancestry covariance values ${\displaystyle z}$ and the fitted ones ${\displaystyle \hat{z}}$, across all the genetic distance bins (Appendix 1). $$ NRMSD=\frac{1}{max(\hat{z})-min(\hat{z})}\sqrt{\frac{\sum ^{D}(z-\hat{z}{)}^{2}}{N}} $$ The estimated dates of admixture were considered significant if the (a) _Z_-score > 2, (b) $\lambda$ < 200 generations and (c) NRMSD < 0.7. We converted the inferred dates from generations to years by assuming a mean generation time of 28 years [@bib41]. For ancient samples, we added the sampling age of the ancient specimen ([Supplementary file 1A](https://elifesciences.org/articles/77625#supp1)). When multiple individuals were available, we used the average sampling ages to offset the admixture dates. We report dates in BCE by assuming the 1950 convention. ## Comparison of old and new version of _DATES_ An earlier version of _DATES_ (version v753) was released in [@bib43]. The current method (version 4010) released in this study differs in some key aspects of the implementation as described below. 1. Use of regression model vs. likelihood approach: In v753, we used a regression model to infer the residuals at each site in the genotype by conditioning on the allele frequency in the reference population and the genome-wide estimate of the admixture proportion [@bib43]. In contrast, in the current version (v4010), we use a more rigorous likelihood framework where we infer the probability of ancestry from each reference population at each site in the genome ([Equation 3](#equ5)). 2. Rate of decay of exponential fit: In v753, like ALDER and ROLLOFF, we fit an exponential decay with the rate of _t_ generations. However, this assumes that mosaic chromosomes are formed in the generation when the gene flow occurs. However, in reality, the mixing of ancestry only begins in the following generations as the chromosomes of distinct ancestry recombine. To correctly account for this effect, we fit an exponential with the rate of (_t_+1) in _DATES_ v4010. In practice, this has a minor effect on the dates reported earlier, as in most cases the uncertainty is much larger than one generation. 3. Goodness of fit test: In v4010, we implemented the NRMSD to assess the fit of the exponential curve. NRMSD computes the deviation between the empirical estimate and fitted data in order to provide a statistical way to characterize the noisiness of the fitted curve. Lower values of NRMSD suggest a better fit, however, there is no clear interpretation of the absolute value of NRMSD. Based on the empirical distribution of NRMSD values in our study samples ([Appendix 1—figure 3](https://elifesciences.org/articles/77625#app1fig3)), we infer a conservative threshold of 0.7 to define a ‘good’ fit. We caution that users should adjust this threshold based on their application and always visually inspect their exponential fits to ensure reliable results. 4. Support for arbitrary number of chromosomes: Unlike v753 that was optimized for parameters in humans, the new version supports an arbitrary number of chromosomes (inputted by the user) so _DATES_ can be used in any species. A comparison of the two version of _DATES_ using simulated data ([Appendix 1—table 1](https://elifesciences.org/articles/77625#app1table1)) and empirical data ([Appendix 1—table 2](https://elifesciences.org/articles/77625#app1table2), [Appendix 1—table 3](https://elifesciences.org/articles/77625#app1table3)) yields qualitatively similar results. ## Simulations We constructed admixed genomes following the approach described in [@bib40]. This method requires phased haplotypes from two source populations and uses two key parameters to simulate data from admixed individuals, (a) the mixture proportion (${\displaystyle \alpha }$) that represents the probability that a particular sampled haplotype comes from one of the reference panels, namely _source_~1~ and _source_~2~, and (b) the time of mixture (${\displaystyle \lambda }$) which is the number of generations since mixture. To simulate an admixed individual, we begin at the start of the chromosome and sample a haplotype from either _source_~1~ with a probability ($\alpha$) and _source_~2~ with a probability (${\displaystyle 1-\alpha }$). At each subsequent marker, we check if there was a recombination event between the two neighboring markers. A recombination event occurs with a probability of (${\displaystyle 1-{e}^{-\lambda g}}$), where _g_ is the genetic distance in Morgans. We use the time of ${\displaystyle \lambda =(t+1)}$ generations to account for the fact that in the first generation following admixture, the offspring inherits one chromosome of each ancestry. In the next generation, the crossovers lead to a mixing of ancestry. Thus, when a recombination event occurs, we resample the ancestry between _source_~1~ and _source_~2~, otherwise, we copy the haplotype from the same source population. (Note, a recombination event can lead to a switch to a haplotype of the same ancestry.) Once the ancestry is chosen, we randomly pick a haplotype from the ancestral pool (without replacement) and copy its sequence to the genome of the admixed individual. This process is continued until we reach the end of the chromosome. Using this approach, we generate the genomes of $n$ admixed individuals. The simulated haploid chromosomes are merged at random to construct diploid admixed individuals. This algorithm requires more than $2n$ ancestral haplotypes for generating data for $n$ diploid admixed individuals [@bib40]. For more than two reference populations, the same algorithm is repeated iteratively. We used 111 CEU and 112 YRI phased 1000 genomes phase 3 dataset [@bib3] for generating 10 admixed genomes for \~380,000 SNPs (unless otherwise stated). For the inference, we used French and Yoruba from HGDP [@bib35]. We generated data for various demographic scenarios, where we varied the time the admixture (${\displaystyle \lambda }$), proportion of mixture ($\alpha$), sample size in the reference and target populations, divergence between the ancestral and reference populations used and studied their impact on the estimated dates. We also characterized the impact of features of ancient DNA such as missing data, pseudo-haploid genotypes, and limited sample size. In order to simulate pseudo-haploid genotypes, we randomly sampled an allele at each heterozygous site and assigned it as the homozygous genotype at that site [@bib26]. To generate missing data, we set the genotype call at a site as ‘missing’ or ‘unknown’ (in eigenstrat format as _!number(9)_) where the proportion of missing genotypes ranged between 5% and 60% in our simulations. We also evaluated the impact of the choice of reference populations used in _DATES_ in case of simple and multiple pulses of admixture. To study the impact of complex scenarios of admixture involving founder events and continuous gene flow, we used a coalescent simulator, MaCs [@bib12]. We simulated 100 Mb of three populations with an effective population size of 12,500, mutation rate of 1.2 × 10^–8^ and recombination rate 1 × 10^–8^ per base pair per generation, respectively [@bib25; @bib31]. We assumed the admixture occurred continuously over a period of time or was followed by the bottleneck. In case of the latter, the duration of the bottleneck was 1–10 generations with reduction in effective population size from 12,500 to 10–1000 and the population recovered to its original size after the bottleneck or maintained a small size until present (no recovery founder event). For each simulation, we generated data for two haploid chromosomes and combined these to generate one diploid chromosome. ## Software availability The executable and source code for _DATES_ will be available on GitHub: <https://github.com/MoorjaniLab/DATES_v4010> (copy archived at [swh:1:rev:e034dc0d6fe8d41a828796f07791d50011b6bb04](https://archive.softwareheritage.org/swh:1:dir:0cd95e97a6e3186515f8812dca928ba30829d132;origin=https://github.com/MoorjaniLab/DATES_v4010;visit=swh:1:snp:bb2c8ef48a7ffe4366f654ee057c257c7780e88c;anchor=swh:1:rev:e034dc0d6fe8d41a828796f07791d50011b6bb04); [@bib13]).