# Statistical, algorithmic, and robustness aspects of population demographic inference from genomic variation data

### Anand Bhaskar

###
EECS Department

University of California, Berkeley

Technical Report No. UCB/EECS-2013-239

December 20, 2013

### http://www.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013-239.pdf

The recent availability of large-sample high-throughput sequencing data has given us an unprecedented opportunity to very finely resolve the details of historical demographic processes that have shaped the genomes of modern human populations. Such understanding of population demography is important for several applications --- to avoid false positives in genome-wide association studies; to calibrate null models of neutral genome evolution in order to find regions under selection; to study the impact of bottlenecks and small founder populations on genetic mutational load; to reconstruct large-scale historical human migration and admixture events; and so on.

In this dissertation, we consider some statistical, algorithmic and robustness aspects of demographic inference from genomic variation data. In particular, we study the problem of determining the historical effective size of a population from the sample frequency spectrum (SFS), which measures the distribution of allele frequencies in a sample of sequences drawn from the population.

From the statistical or information-theoretic perspective, it is known that this inverse problem does not have a unique solution in general, no matter how large the sample size. For any population allele frequency distribution, there exist infinitely many population size functions that are consistent with this distribution. While such a non-identifiability result might appear to pose a serious problem to statistical inference algorithms, we show that the situation is not so bad in practice. In particular, we prove that if the true population size function is piecewise-defined with each piece belonging to some family of biologically-motivated functions, then the SFS of a finite sample of sequences uniquely determines the underlying demography. We obtain a general bound on the sample size sufficient for identifiability; this bound depends on the number of pieces in the demographic model and on the family of functions for each piece. We also give concrete instantiations of this bound for piecewise-constant and piecewise-exponential models that are commonly used in demographic inference analyses.

From the algorithmic perspective, we build on analytic results for the expected SFS of a time-varying population size function and develop an efficient likelihood-based algorithm to infer piecewise-exponential population size histories from large sample allele frequency data. By considering very large samples, our method can resolve details of the population history from the very recent past that are not otherwise accessible using smaller samples.

The third aspect of this dissertation is concerned with understanding the robustness of widely used evolutionary models to violations of model assumptions. Continuous-time evolutionary models like Kingman's coalescent and its dual diffusion process are derived from discrete models of random mating by assuming that the sample size being analyzed is much smaller than the the population size. However, the very large sample datasets being produced due to advances in high-throughput sequencing technologies are approaching the limits of this assumption. To investigate this issue, we develop exact algorithms for computation under the discrete-time Wright-Fisher model and use these algorithms to study the distortions in several genealogical quantities arising due to the coalescent approximation. Our findings indicate that for several demographic models inferred from large-scale sequence data, there can be substantial genealogical deviations introduced by the coalescent approximation that might influence the results of inference studies.

**Advisor:** Yun S. Song

BibTeX citation:

@phdthesis{Bhaskar:EECS-2013-239, Author = {Bhaskar, Anand}, Title = {Statistical, algorithmic, and robustness aspects of population demographic inference from genomic variation data}, School = {EECS Department, University of California, Berkeley}, Year = {2013}, Month = {Dec}, URL = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013-239.html}, Number = {UCB/EECS-2013-239}, Abstract = {The recent availability of large-sample high-throughput sequencing data has given us an unprecedented opportunity to very finely resolve the details of historical demographic processes that have shaped the genomes of modern human populations. Such understanding of population demography is important for several applications --- to avoid false positives in genome-wide association studies; to calibrate null models of neutral genome evolution in order to find regions under selection; to study the impact of bottlenecks and small founder populations on genetic mutational load; to reconstruct large-scale historical human migration and admixture events; and so on. In this dissertation, we consider some statistical, algorithmic and robustness aspects of demographic inference from genomic variation data. In particular, we study the problem of determining the historical effective size of a population from the sample frequency spectrum (SFS), which measures the distribution of allele frequencies in a sample of sequences drawn from the population. From the statistical or information-theoretic perspective, it is known that this inverse problem does not have a unique solution in general, no matter how large the sample size. For any population allele frequency distribution, there exist infinitely many population size functions that are consistent with this distribution. While such a non-identifiability result might appear to pose a serious problem to statistical inference algorithms, we show that the situation is not so bad in practice. In particular, we prove that if the true population size function is piecewise-defined with each piece belonging to some family of biologically-motivated functions, then the SFS of a finite sample of sequences uniquely determines the underlying demography. We obtain a general bound on the sample size sufficient for identifiability; this bound depends on the number of pieces in the demographic model and on the family of functions for each piece. We also give concrete instantiations of this bound for piecewise-constant and piecewise-exponential models that are commonly used in demographic inference analyses. From the algorithmic perspective, we build on analytic results for the expected SFS of a time-varying population size function and develop an efficient likelihood-based algorithm to infer piecewise-exponential population size histories from large sample allele frequency data. By considering very large samples, our method can resolve details of the population history from the very recent past that are not otherwise accessible using smaller samples. The third aspect of this dissertation is concerned with understanding the robustness of widely used evolutionary models to violations of model assumptions. Continuous-time evolutionary models like Kingman's coalescent and its dual diffusion process are derived from discrete models of random mating by assuming that the sample size being analyzed is much smaller than the the population size. However, the very large sample datasets being produced due to advances in high-throughput sequencing technologies are approaching the limits of this assumption. To investigate this issue, we develop exact algorithms for computation under the discrete-time Wright-Fisher model and use these algorithms to study the distortions in several genealogical quantities arising due to the coalescent approximation. Our findings indicate that for several demographic models inferred from large-scale sequence data, there can be substantial genealogical deviations introduced by the coalescent approximation that might influence the results of inference studies.} }

EndNote citation:

%0 Thesis %A Bhaskar, Anand %T Statistical, algorithmic, and robustness aspects of population demographic inference from genomic variation data %I EECS Department, University of California, Berkeley %D 2013 %8 December 20 %@ UCB/EECS-2013-239 %U http://www.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013-239.html %F Bhaskar:EECS-2013-239