How Genomic Data Are Changing Population Genetics: An Interview with Molly Przeworski
By using statistical methods to compare genomic data across species, such as chimpanzees and humans, the Przeworski Lab is gaining insights into the origins of genetic variation and adaptation. (Photo: Common chimpanzee at the Leipzig Zoo. Thomas Lersch, Wikimedia Commons.)
Launched approximately 100 years ago, population genetics is a subfield within evolutionary biology that seeks to explain how processes such as mutation, natural selection, and random genetic drift lead to genetic variation within and between species. Population genetics was originally born from the convergence of Mendelian genetics and biostatistics, but with the recent availability of genome sequencing data and high-performance computing technologies, it has bloomed into a mature computational science that is providing increasingly high-resolution models of the processes that drive evolution.
Molly Przeworski, a professor in the Columbia University Departments of Biological Sciences and Systems Biology, majored in mathematics at Princeton before beginning her PhD in evolutionary biology at the University of Chicago in the mid-1990s. While there, she realized that the availability of increasingly large data sets was changing population genetics, and has since been interested in using statistical approaches to investigate questions such as how genetic variation drives adaptation and why mutation rate and recombination rate differ among species. In the following interview, she describes how population genetics is itself evolving, as well as some of her laboratory’s contributions to the field.
Population genetics is a discipline that has been changing over the past couple of decades. Could you talk about how you have seen it develop in your career so far?
Since the 1920s, population genetics has been focused on modeling evolutionary processes that occur on time scales that are too slow to be observed. In its early days, even the object of study—genetic variation among individuals—was very hard to measure. The first genetic loci that were found to be variable in humans were those responsible for determining blood groups, because they were easy to assay. But those kinds of variable loci—or “polymorphisms”—were few and far between. The vast majority of genetic variation among individuals was completely inaccessible until the 1960s, when people started developing techniques to look at protein variants, and until after 1983, when Marty Kreitman and others started using sequencing to survey genetic variation in populations. For many decades, then, it was a strange field, which was trying to ask deep questions about evolutionary processes without having access to the data that were needed for inferring what might have happened in the past.
That all changed in the late 1990s. As improvements in technologies for genome sequencing made it less labor intensive, a trickle of data started coming in. By the time I completed my degree it was clear that more were on the way and that statistical approaches would be needed to analyze them. Later, during a postdoc with Peter Donnelly in the Oxford University Department of Statistics, I became interested in the idea that it should be possible to learn not just the molecular basis of human adaptations, but when they occurred in our evolutionary history. I developed a statistical method to tackle this problem.
Around this time I met Svante Pääbo, who later became best known for sequencing the Neanderthal genome, but had just determined that the gene FOXP2 had been under natural selection at some point in human evolution. The gene is of particular interest because it plays a role in speech and language development in humans. His lab had shown evidence that the gene was involved in some kind of adaptation in human ancestors, but he didn’t have a clear sense of how they would date when that adaptive change to the gene occurred. He invited me to work at the Max Planck Institute in Leipzig. I did this both because I was interested in this specific application and because it got me closer to real data, which I had been chasing at the time.
In the years since then, thousands of genomes of humans and most organisms you can think of have become available, and population genetics has become a field with almost unlimited data. This has made it possible to ask many new kinds of questions and finally put century-old theories to the test. In my current work I focus on adaptation and the processes that generate genetic variation.
How do statistical and computational methods help to explore these kinds of issues?
I’ve always been interested in questions related to the origins of genetic differences among humans and other species: What fraction of those differences confers fitness advantages or is there by chance? What are the processes that produce genetic variation? And how does that variation play out through population dynamics to bring about adaptations such as bipedalism in humans or eyespots on butterflies?
To conceive of how we investigate this, think of your ancestry. You have many ancestors in your family tree, but in any particular position of your genome, you only inherit DNA from two of them. Which of the two ancestors you inherit DNA from changes across positions in the genome because of recombination, the shuffling of segments of DNA that is a natural part of cell division. Over long time scales, this means that if you compare your chromosomes to someone else’s, you might have close ancestors in common for some bits of your genome, but not for others.
Now imagine that a beneficial mutation arises in somebody. By definition, carriers of that beneficial mutation leave more offspring, and so it spreads through the population faster than variants that have a bad effect or no effect. Using computational means to analyze genomic data, we can observe genetic variation and identify regions of the genome where all individuals in our cohorts are very similar — unusually similar. We might look at hundreds of individuals, asking how many bits of DNA they have in common in any given position of the genome. These regions of high similarity indicate variants that spread very rapidly, and give us a hint that something in those regions was beneficial in the context of natural selection. We want to know how that process occurs, how long it takes, how strongly beneficial it was, and how many genetic alterations it takes to bring about complicated adaptations. In a sense, we’re using experimentally obtained genomic data and statistical methods to reverse engineer the processes that drive evolution.
So in addition to using computational methods to identify genetic variation, would you say that you are trying to understand the underlying genomic machinery that actually drives this variation?
Yes, I think that’s right. The sources of all genetic variation, whether it leads to a disease or an adaptation, are chance changes to the genome. These can be mutation events or alterations that occur during recombination. For a long time, I’ve been interested in viewing these phenomena not just as inputs that produce other traits, but as traits that are themselves specified genetically. For example, the repair factors that determine how frequently mutations occur in a particular organism or that specify where in the genome the machinery for recombination should place itself — called recombination hotspots — are themselves genetically specified. This means that evolution acts on the very inputs to the processes that enable adaptation and natural selection.
Mutation rates and recombination are not just inputs that produce other traits, but are themselves genetically specified.
This perspective raises several questions. For example, the mutation rate across species tends to be low but not zero; why is that? Or why does recombination occur in genes in yeast but more often outside of genes in humans? What are the evolutionary consequences of these types of findings? In this sense, I’m interested in understanding not just the mechanisms that produce genetic variation, but also why those values and patterns exist, and why they differ between species.
You mentioned you are interested in understanding how genetic variation enables adaptation. Can you give an example of your work on this topic?
One area that we’re interested in exploring is the benefits of persistent genetic variation. If you consider variation in things like eye color, for example, variation itself can be tens of thousands to hundreds of thousands of years old; in exceptional cases it might be a million years old. In a paper we published a couple of years ago, we were interested in trying to see whether there are regions of the genome where the variation we see among humans is unusually old, where selection led to the maintenance of diversity in the population rather than one type outcompeting another.
We took a set of 120 human samples and 10 chimpanzee samples, and asked whether there are regions of the genome in which stability in variation among humans is so old that it predates the evolutionary split with chimpanzees. We had previously shown that the A and B blood groups are extremely old variations. They’re millions of years old, so old that humans and gibbons have the same blood types because they inherited variants that were already present in their common ancestor. We were looking for similar regions in chimpanzees and found dozens of DNA segments where the same haplotypes are present in both humans and chimpanzees.
When identical, cross-species variation is so old, you know it’s not there by chance, because if it were it would have been lost over the ages due to genetic drift. There must be some form of selection that makes it advantageous to maintain variation in the population, and we wanted to understand how this happens.
We found that many of the regions that humans and chimps share are involved in the production of membrane glycoproteins, a class of proteins that viruses use to enter host cells and that some bacteria imitate to evade being attacked by the host immune system. This finding led us to hypothesize that some of the examples we found of persistent variation could be related to pathogen-host coevolution. Typically, as resistance to a pathogen builds up in the human population, the pathogen has less of a foothold in its hosts. But as the pathogen evolves, the frequency of resistance decreases and then the pathogen shores itself back up. Over time you see a cycle that maintains both resistant and nonresistant pathogen types in the population. We think that the long-term, cross-species persistence of variation in the genes that mediate these relationships could reflect this.
Recently, a paper was published in which the authors conducted a genome-wide association study to map the genetic basis of susceptibility to malaria in Africa. Interestingly, the one significant hit of a genetic variant that modulates whether a person is susceptible to malaria or not was on one of the regions that we had found in our earlier paper as being unusually old in humans. We were happy to see this, because their discovery was consistent with the notion that variability is maintained in genomic regions involved in host-pathogen coevolution.
Modern population genetics clearly has a role in explaining features of evolution and natural selection, but are there ways in which it intersects with other biological fields?
In a lot of molecular biology or cell biology, an experiment involves breaking a system in some way — perhaps by introducing a mutation or doing a mutant screen — and then learning the function of the particular gene or pathway you broke by observing the experiment’s effects. But when you study genetic variation using a population genetics approach, you’re in principle looking at the results of a huge mutagenesis experiment that has already been performed.
All of us are essentially mutants, huge allelic series walking around, and we all show differences in the activity of particular genes and in fluxes in different molecular signaling pathways. We’re like a living molecular biology experiment, with the difference that you know that the individuals you are studying can survive with whatever mutations they have. Species lose genetic changes that are hugely deleterious through natural selection, so population genetics allows you to computationally try out all combinations of variants that an organism can have and still survive. This provides a tremendous amount of information to be mined about all kinds of genetic processes and how they work.
A recent study in the Przeworski Lab showed that the zebra finch and long-tailed finch share recombination hotspots that have been conserved over several millions of years of evolution. (Courtesy: Science.)
From this perspective, one thing that’s exciting about the analysis of genetic variation today is the possibility of learning about biology in a much broader set of organisms. The reason biologists study mice and fruit flies is that well-established genetic tools exist for these systems, and they can easily be bred. But instead of engineering mice with specific defects, it is now becoming possible to look at different individuals and use computational approaches to identify genetic factors that differentiate them with respect to particular traits. In this way, statistical approaches based in population genetics also allow you to learn a lot about the genetics of non-model organisms, because you don’t actually have to do the breeding and knockout experiments that are necessary for traditional animal models.
For example, in a recent paper our lab published in Science, we studied zebra finches. In the past, the species has been used as a model organism for learning, but in terms of genetic resources not much exists. Yet by using statistical approaches for analyzing genomic data, we showed that we can learn a great deal about meiotic recombination in birds. What I’m particularly excited about these days is taking a very broad taxonomic perspective, beyond the handful of model organisms, to ask how recombination works in animals like fish, frogs, snakes, and turtles, and get a sense of the whole breadth of what is possible — and what’s not possible — and how and why it changes over time.
Do the findings coming out of population genetics have implications for human health?
Population genetics has many implications for disease genetics, specifically for understanding genetic susceptibility to complex diseases. Population genetics aims to interpret genetic variation, and if we want to predict who is more susceptible to disease than someone else, map the relevant genetic variants, and distinguish the roles of environment and genetics in causing a disease, these are all traditional population and quantitative genetics questions. In some sense, human genetics today is applied population genetics: looking at variation among humans and trying to relate it to variation in phenotypes.
There are many other areas of synergy between population genetics and other areas in biology: for example, my lab works on variation in recombination, which is a key to understanding aneuploidy (i.e., when fetuses have the wrong number of chromosomes), a major source of infertility in humans. It turns out that the major mechanism by which this occurs involves errors in recombination, and so understanding what is tolerable and what is not tolerable in terms of variation in recombination is critical to understanding the genesis of this phenomenon.
With all of the data you now have to work with, it sounds like it’s a golden age for population genetics. Do you see other developments on the horizon?
I think it is a golden age for the field, not just because there are a lot of data, but also because it is now possible to test the many evolutionary theories that population genetics has produced over quite a long period. Considering that it’s only recently that we have this kind of genomic data, it’s fascinating to me that Darwin in particular turned out to be right about almost everything he wrote. The level of detail in the evidence we can now marshal is mind-boggling, and it all supports his theory. I also continue to find it remarkable how useful the models developed by earlier population geneticists have turned out to be for making sense of the data people are generating today. Even in the data-driven age we find ourselves in, their continued relevance speaks to the power of abstract reasoning.
Over time, a lot of tools developed in the context of population genetics are likely to be adopted by other disciplines.
For 30 years the evolutionary literature has posited, for example, that if you see conservation of a particular bit of DNA in distantly related species, it must be really important. Now people are using this concept when they scan for mutations that cause disease. If you perform genome sequencing on a patient and notice that there were two mutations in a gene, you might ask whether this is unusual. Information about whether that gene is evolutionarily conserved and whether mutations in it can be tolerated can therefore be hugely informative, as it can help to prioritize mutations that are most likely to cause disease, as opposed to those that occurred by chance in these patients. Here, it turns out that evolutionary principles that were developed long before these kinds of data became available turn out to be really powerful.
In general, my sense is that there is an increasing synergy between many areas of biology and human genetics. In some ways we might be seeing the end of population genetics, as analysis of genetic variation becomes an essential part of molecular biology, cell biology, neuroscience, and other fields. Over time, a lot of tools developed within the context of population genetics are likely to be adopted by those disciplines. Perhaps they will even subsume population genetics, while evolutionary biology will keep asking more specific questions about evolution and natural selection.
— Interview by Chris Williams
Enard W, Przeworski M, Fisher SE, Lai CS, Wiebe V, Kitano T, Monaco AP, Pääbo S. Molecular evolution of FOXP2, a gene involved in speech and language. Nature. 2002 Aug 22;418(6900):869-72.
Leffler EM, Gao Z, Pfeifer S, Ségurel L, Auton A, Venn O, Bowden R, Bontrop R, Wall JD, Sella G, Donnelly P, McVean G, Przeworski M. Multiple instances of ancient balancing selection shared between humans and chimpanzees. Science. 2013 Mar 29;339(6127):1578-82.
Ségurel L, Thompson EE, Flutre T, Lovstad J, Venkat A, Margulis SW, Moyse J, Ross S, Gamble K, Sella G, Ober C, Przeworski M. The ABO blood group is a trans-species polymorphism in primates. Proc Natl Acad Sci U S A. 2012 Nov 6;109(45):18493-8.
Singhal S, Leffler EM, Sannareddy K, Turner I, Venn O, Hooper DM, Strand AI, Li Q, Raney B, Balakrishnan CN, Griffith SC, McVean G, Przeworski M. Stable recombination hotspots in birds. Science. 2015 Nov 20;350(6263):928-32.
Kermany AR, Segurel L, Oliver TR, Przeworski M. TroX: a new method to learn about the genesis of aneuploidy from trisomic products of conception. Bioinformatics. 2014 Jul 15;30(14):2035-42.