EFTA00611781.pdf

DataSet-9 7 pages 6,386 words document
👁 1 💬 0
📄 Extracted Text (6,386 words)
OPEN      ACCESS Freely available online                                                                                     PLOS COMPUTATIONAL
                                                                                                                                  BIOLOGY



The Time Scale of Evolutionary Innovation
Krishnendu Chatterjee'`, Andreas Pavlogiannis', Ben Adlam2, Martin A. Nowak2                                                                                        CrossNlark

11ST Austria, Klosterneuburg, AtIWIld, 2 Program for Evolutionary Dynamics, Department of Organismic and Evolutionary Biology. Department of Mathematics, Harvard
University• Cambridge. Massachusetts. United States of America


     Abstract
     A fundamental question in biology is the following: what is the time scale that is needed for evolutionary innovations?
     There are many results that characterize single steps in terms of the fixation time of new mutants arising in populations of
     certain size and structure. But here we ask a different question, which is concerned with the much longer time scale of
     evolutionary trajectories: how long does it take for a population exploring a fitness landscape to find target sequences that
     encode new biological functions? Our key variable is the length, L. of the genetic sequence that undergoes adaptation. In
     computer science there is a crucial distinction between problems that require algorithms which take polynomial or
     exponential time. The latter are considered to be intractable. Here we develop a theoretical approach that allows us to
     estimate the time of evolution as function of L. We show that adaptation on many fitness landscapes takes time that is
     exponential in L. even if there are broad selection gradients and many targets uniformly distributed in sequence space.
     These negative results lead us to search for specific mechanisms that allow evolution to work on polynomial time scales. We
     study a regeneration process and show that it enables evolution to work in polynomial time.

  Citation: Chatterjee K, Pavlogiannis A. Adlam B. Nowak MA (2014) The Time Scale of Evolutionary Innovation. PLoS Comput BId 10(9): e1003818. doL10.13711
  joumauxbiloossis
  Editor: Niko Beerenwinkel, ETH Zurich• Switzerland
  Received December 13, 2013; Accepted July 21. 2014: Published September II, 2014
  Copyright: 00 2014 Chatterjee et al. ThIs Is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits
  unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
  Funding: Austrian Science Fund JAW) Grant No P234994123. FWF NFN Grant No S114074423 (RISE), ERC Stan grant 1279307: Graph Games), and Microsoft
  Faculty Fellows award. Support from the John Templeton foundation is gratefully acknowledged. The finder had no role In study design data collection and
  analysis. decision to publish, or preparation of the manuscript.
  Competing Interests: The authors have declared that no competing Interests exist.
  • Email: Krishnendu.Chattedeealinac.at


Introduction                                                                       selection. This problem leads to the study of adaptive walks on
                                                                                   fitness landscapes [15,20,21,28,29]. In this paper we ask a different
   Our planet came into existence 4.6 billion years ago. There is                  question: how long does it take for evolution to discover a new
clear chemical evidence for life on earth 3.5 billion years ago [1,2].             function? More specifically, our aim is to estimate the expected
The evolutionary process generated procaria, eucaria and                           discover• time of new biological functions: how long does it take
complex multi-cellular organisms. Throughout the history of life,                  for a population of reproducing organisms to discover a biological
evolution had to discover sequences of biological polytners that                   function that is not present at the beginning of the search. We will
perform specific, complicated functions. The average length of                     discuss two approximations for rugged fitness landscapes. We also
bacterial genes is about 1000 nucleotides, that of human genes                     discuss the significance of clustered peaks.
about 3000 nucleotides. The longest known bacterial gene                              We consider an alphabet of size four, as is the case for DNA and
contains more than lOs nucleotides, the longest human gene                         RNA, and a nucleotide sequence of length L. We consider a
more than 106. A basic question is what is the time scale required                 population of size N, which reproduces asexually. The mutation
by evolution to discover the sequences that perform desired                        rate, u, is small: individual mutations are introduced and evaluated
functions. While many results exist for the fixation time of                       by natural selection and random drift one at a time. The
individual mutants [3-15], here we ask how the time scale of                       probability that the evolutionary process moves from a sequence i
evolution depends on the length L of the sequence that needs to be                 to a sequencef, which is at Hamming distance one from i, is given
adapted. We consider the crucial distinction of polynomial versus                  by             /(3L)]N, where p,✓ is the fixation probability of
exponential time [16-18]. A time scale that grows exponentially in                 sequence./ in a population consisting of sequence i. In the special
L is infeasible for long sequences.                                                case of a flat fitness landscape, we have Ad =UN, and
   Evolutionary dynamics operates in sequence space, which can                     Pro (u/(3L)I. Thus we have an evolutionary random walk,
be imagined as a discrete multi-dimensional lattice that arises                    where each step is a jump to a neighboring sequence of Hamming
when all sequences of a given length are arranged such that                        distance one.
nearest neighbors differ by one point mutation [19]. For constant
selection, each point in sequence space is associated with a non-
                                                                                   Results
negative fitness value (reproductive rate). The resulting fitness
landscape is a high dimensional mountain range. Populations                           Consider a high-dimensional sequence space. A particular
explore fitness landscapes searching for elevated regions, ridges,                 biological function can be instantiated by some of the sequences.
and peaks (20 27].                                                                 Each sequence f has a fitness valuef, which measures the ability of
   A question that has been extensively studied is how long does it                the sequence i to encode the desired function. Biological fitness
take for existing biological functions to improve under natural                    landscapes are typically expected to have many peaks [29-31].


PLOS Computational Biology I www.pbscompbiol.org                                                      September 2014 I Volume 10 I Issue 9 I elCO3818




                                                                                                                                                      EFTA00611781
                                                                                                          The Time Scale of Evolutionary Innovation


                                                                                 We first study a broad peak of target sequences described as
   Author Summary
                                                                              follows: consider a specific sequence; any sequence within a certain
  Evolutionary adaptation can be described as a biased,                       Hamming distance of that sequence belongs to the target set.
  stochastic walk of a population of sequences in a high                      Specifically, we consider that the evolutionary process has
  dimensional sequence space. The population explores a                       succeeded, if the population discovers a sequence that differs
  fitness landscape. The mutation-selection process biases                    from the specific sequence in no more than a fraction c of
  the population towards regions of higher fitness. In this                   positions. We refer to the specific sequence as the target center and
  paper we estimate the time sale that is needed for                          c as the width (or radius) of the peak. For example, if L=100 and
  evolutionary innovation. Our key parameter is the length
                                                                              c-0.1, then the target center is surrounded by a cloud of
  of the genetic sequence that needs to be adapted. We
                                                                              approximately 1018 sequences. For a single broad peak with width
  show that a variety of evolutionary processes take
  exponential time in sequence length. We propose a                           c, the target set contains at least 2a/(3L) sequences, which is an
  specific process, which we all 'regeneration processes',                    exponential function of L. The fitness landscape outside the broad
  and show that it allows evolution to work on polynomial                     peak is flat. We refer this binary fitness landscape as a broad peak
  time sales. In this view, evolution an solve a problem                      landscape. The population needs to discover any one of the target
  efficiently if it has solved a similar problem already.                     sequences in the broad peak, starting from some sequence that is
                                                                              not in the broad peak. We establish the following result.
                                                                                 Theorem 1. Consider a single search exploring a broad peak
They can be highly rugged due to epistatic effects of mutations               landscape with width c and mutation rate a. The following
[32-34]. They can also contain large regions or networks of                   assertions hold:
neutrality [20,21]. Empirical studies of short RNA sequences have
revealed that the underlying fitness landscape has low peak density            • if c <3/4, then there exists LOA such that for all sequence
[35]: around IS peaks in 424 sequences.                                          spaces of sequence length 1, Lo, the expected discovery time is
   For the purpose of estimating the expected discovery time we                                    „ L,         6
can approximate the fitness landscape with a binary step function                at least expR3 — tICy lOg
                                                                                                      16     4c -I-3,
over the sequence space. We discuss two different approximations               • if c2 3/4, then for all sequence spares of sequence length L, the
(Figure       For the first approximation, we consider the scenario              expected discovery time is at most O(L3Its).
where fitness values below some threshold, fi n, have negligible
contribution; those sequences do not instantiate the desired
                                                                                 Our result can be interpreted as follows (see Theorem S2 and
function (either not at all or only below the minimum level that
                                                                              Corollary S2 in Text SI): (i) If c < 3/4, then the expected discovery
could be detected by natural selection). We approximate the
                                                                              time is exponential in L; and (ii) if c 3,41., then the expected
nigged fitness landscape as follows: if f; <Ann, then A -0; if
                                                                              discovery time is polynomial in L. Thus, we have derived a strong
    ,finin then     I. The set of sequences with t/ tioin constitutes
                                                                              dichotomy result which shows a sharp transition from polynomial
the target set, and the remaining fitness landscape is neutral.
                                                                              to exponential time depending on whether a specific condition on
    The second approximation works as follows. Consider the
                                                                              c does or does not hold.
evolutionary proem exploring a rugged fitness landscape where
                                                                                 For the four letter alphabet most random sequences have
the goal is to attain a fitness level f*. Local maxima below)" slow
                                                                              Hamming distance 3L/4 from the target center. If the population
down the evolutionary process to attain            7,   because the
                                                                              is further away than this Hamming distance, then random drift
evolutionary walk might get stuck in those local maxima. In order
                                                                              will bring it closer. If the population is closer than this Hamming
to derive lower bounds for the expected discovery time, the rugged
                                                                              distance, then random drift will push it further away. This
fitness landscape can be approximated as follows. Let        Ibe the          argument constitutes the intuitive reason that c 3/4 is the critical
fitness value of the highest local maximum below        7. Then for           threshold. If the peak has a width of less than c= 3/4, then we
every sequence in a mountain range with a local maximum below                 prove that the expected discovery time by random drift is
7 we assign the fitness value f. The mountain ranges with local               exponential in the sequence length L (see Figure 2). This result
maxima above f• are the target sequences. Note that the target set            holds for any population size, N, as long as 41- > > N, which is
includes sequences that start at the upslope of mountain ranges               certainly the case for realistic values of L and N. In the Text S I we
with peaks above   7.   Thus, again we obtain a fitness landscape             also present a more general result, where along with a single broad
with clustered targets and neutral region, where the neutral region           peak, instead of a flat landscape outside the peak we consider a
consists of all sequences whose fitness values have been assigned to          multiplicative fitness landscape and establish a sharp dichotomy
1 . The two approximations are illustrated in Figure I. For                   result that generalizes Theorem I (see Corollary S2 in Text SI).
p-f„„,, the second approximation generates larger target areas                   Remark 1. We highlight two important aspects of our results.
than the first approximation and is therefore more lenient.
   Our key results for estimating the discovery time call now be               I. First, when we establish exponential lower bounds for the
formulated for binary fitness landscapes, but they apply to any                   expected discovery time, then these lower bounds hold even if the
type of rugged landscape using one of the two approximations. We                 starting sequence is only a few steps away front the target set.
note that our methods can also be applied for certain non-binary              2. Second, we present strong dichotomy results, and derive
fitness landscapes, and an example of a fitness landscape with a                 mathematically the most precise and strongest form of the
large gradient arising from multiplicative fitness effects is discussed          boundary condition.
in Sections 6 and 7 of Text SI.
   We now present our main results in the following order. We first             Let us now give a numerical example to demonstrate that
estimate the discovery time of a single search aiming to find a               exponential time is intractable. Bacterial life on earth has been
single broad peak. 'Chen we study multiple simultaneous searches              around for at least 3.5 billion years, which correspond to 3 x 1013
for a single broad peak. Finally, we consider multiple broad peaks            hours. Assuming fast bacterial cell division of 20-30 minutes on
that are uniformly randomly distributed in sequence space.                    average we have at most ION generations. 'Hie expected discovery


PLOS Computational Biology I www.pbscompbiolorg                           2                    September 2014 I Volume 10 I Issue 9 I e1003818




                                                                                                                                          EFTA00611782
                                                                                                            The Time Scale of Evolutionary Innovation




                          Projection of sequence space                                              Projection of sequence space




      B


           C




                          Projection of sequence space                                              Projection of sequence space
Figure 1. Approximations of a highly rugged fitness landscape by broad peaks and neutral regions. The figures depict examples of
highly rugged fitness landscapes where the sequence space has been projected in one dimension. (A) Sequences with fitness below some level b„,,,
are functionally very different to the desired function, and selection cannot act upon them. All other sequences are considered as targets. The fitness
landscape is approximated by a step function: if A </„„„, then J, =0, otherwise b =I. (B) Local maxima below the desired fitness threshold p are
known to slow down the evolutionary random walk towards sequences that attain fitness at least /*. We approximate the fitness landscape by broad
peaks and neutral regions by increasing the fitness of every sequence that belongs in a mountain range with fitness below f to the maximal local
maxima]. below f'. Note that the target set starts from the upslope of a mountain range whose peak exceeds f
doi:10.1371/joumal.pcbi.100313113.9001


time for a sequence of length Lm 1000 with a very large broad                       Theorem 2. In all cases where the busier bound on the expected
peak of em 1/2 is approximately 1065 generations; see Table 1.                   dimwits), time k exponential, for dl polynomials pi(), p2(-) and
    If individual evolutionary processes cannot find targets in                  p3e, for any starting sequence with Hamming distance at least
polynomial time, then perhaps the success of evolution is based on               3L/4from the target center, the probabilityfor any one out of p3(I)
the fact that many populations are searching independently and in                independent multiple searches to reach the target set within pi(L)
parallel for a particular adaptation. We prove that multiple,                    steps is at most I /p2(L).
independent parallel searches are not the solution of the problem,                  If an evolutionary process takes exponential time, then
if the starting sequence is far away from the target center. Formally            polynomially many independent searches do not find the target
we show the following result.                                                    in polynomial time with reasonable probability (for details see


    A                           exp(L)                                       poly(L)

               CD                                          O
               0.
               CD                                          CO
                                                         N
                                                         0
       N
       0 '0                                           O 73
      ZS
       rn —
               2                                         C
                                                      CU 03

       ID                                             ft   c•
      I-  02
          C                                                                                                 td
                                                            C
       •   LL                                         •    LE
                                                                                                            I0
               •
                                         31,
                    0     el.
                                          4    L                0      ci.
                                                                                        4
                                                                                               L                   50    100   ISO   200   210   300   350
                                                                                                                        Sequence length,
                        Hamming Distance                            Hamming Distance
Figure 2. Broad peak with different fitness landscapes. For the broad peak there is a specific sequence, and all sequences that are within
Hamming distance cL are part of the target set. The fitness landscape is flat outside the broad peak. (A) If the width of the broad peak is r <3/4, then
the expected discovery time is exponential in sequence length, L. (B) If the width of the broad peak is c Z 3/4, then the expected discovery time is
polynomial in sequence length, L. (C) Numerical cakulations for broad peak fitness landscapes. We observe exponential expected discovery time for
c=1/3 and c= 1/2, whereas polynomial expected discovery time for c=3/4.
doi:10.1371/joumal.pcbi.10031318.9002


PLOS Computational Biology I www.ploscompbiol.org                            3                     September 2014 I Volume 10 I Issue 9 I e1003818




                                                                                                                                                   EFTA00611783
                                                                                                                   The Time Scale of Evolutionary Innovation



 Table 1. Numerical data for discovery time in flat fitness landscapes.



  r.I
  L=102                                  1.021018                                   7.36107                               183
  L=10'                                  5.89                                       1.28.100                              2666

 Numerical data for the discovery time of broad peaks with width c= 1/3.1/2. and 3/4 embedded in flat fitness landscapes. First the discovery time Is computed for
 small values of Las shown In Figure 2(C). Then the exponential growth Is extrapolated to L= 100 and L= 1000, respectively. We show the discovery times for r= 1/2,
 and 1/3. For c=3/4 the values are polynomial in L.
 dol:10.1371/Joutnal.pcb1.1003818.E001

Theorem 55 in the Text SI). We also show all informal and                               What are then adaptive problems that can be solved by
  pproximate calculation of the success probability for M                            evolution in polynomial time? We propose a "regeneration
independent searches, as follows: if the expected discovery time                     process". 'lite basic idea is that evolution call solve a new
   exponential (say, en, then the probability that all M independent                 problem efficiently, if it is has solved a similar problem already.
searches fail upto b steps is at least exp(—(M/)1d) (i.e., the                       Suppose gene duplication or genome rearrangement can give rise
success probability within b steps of any of the searches is at most                 to starting sequences that are at most k point mutations away from
1— exp(—(M1))/d)), when the starting sequence is far away from                       the target set, where k is a number that is independent of L. It is
the target center. In such a case, one could quickly exhaust the                     important that starting sequences can be regenerated again and
physical resources of an entire planet. The estimated number of                      again. We prove that Lk +I many searches arc sufficient in order to
bacterial cells [36] on earth is about 103°. To give a specific                      find the target ill polynomial time with high probability (see
example let us assume that there are IP independent searches,                        Figure 4 and Section 10 in Text SI). 'flue upper bound, Lk 1,
each with population size N 106. The probability that at least                       holds even for neutral drift (without selection). Note that in this
one of those independent searches succeeds within 1014 genera-                       case, the expected discovery time for any single search is still
tions for sequence length L- 1000 and broad peak of cm 1/2 is                        exponential. 'flterefore, most of the Lk + 1 searches do not succeed
less than 10-26.                                                                     in polynomial time; however, with high probability one of the
   In our basic model, individual mutants are evaluated one at a                     searches succeeds in polynomial time. There are two key aspects to
time. The situation of many mutant lineages evolving in parallel is                  the "regeneration process": (a) the starting sequence is only a small
similar to the multiple searches described above. As we show that                    number of steps away from the target; and (b) the starting
whenever a single search takes exponential time, multiple                            sequence can he generated repeatedly. This process enables
independent searches do not lead to polynomial time solutions,                       evolution to overcome the exponential barrier. The upper bound,
our results imply intractability for this case as well.                              Lk       may possibly be further reduced, if selection and/or
   We now explore the case of multiple broad peaks that are                          recombination arc included.
uniformly and randomly distributed. Consider that there are in
target centers. Around each target center there is a selection                       Discussion
gradient extending up to a distance rt. Formally we can consider
                                                                                        The regeneration process formalizes the role of several existing
any fitness function f that assigns zero fitness to a sequence whose
                                                                                     ideas. First, it ties in with the proposal that gene duplications and
Hamming distance exceeds cL from all the target centers, which in
                                                                                     genome rearrangements arc major events leading to the emer-
particular is subsumed by considering the multiple broad peaks
                                                                                     gence of new genes [43]. Second, evolution call be seen as a
where around each center we consider a broad peak of target set
                                                                                     tinkerer playing around with small modifications of existing
with peak width c. We establish the following result:
                                                                                     sequences rather than creating entirely new ones [44]. 'fbird, the
   Theorem 3. Consider a single search under the multiple broad
                                                                                     process is related to Gillespie's suggestion [29] that the starting
peak fitness landscape of inc <'I'- target centers chosen uniformly                  sequence for an evolutionary search must have high fitness. In our
at random, with peak width at most c for each center and c < 3/4.                    theory, proximity in fitness value is replaced by proximity in
Then with high probability, the expected discovery time of the target                sequence space. However, our results show that proximity alone is
set is at least (11m)exp[2L(314—c)2].                                                insufficient to break the exponential barrier, and only when
   Whether or not the function (1,4n)exp[21.43/4 —02] is                             combined with the process of regeneration it yields polynomial
exponential in L depends on how In changes with L. But even if                       discovery time with high probability. Our process can also explain
we assume exponentially many broad peak centers, m, with peak                        the emergence of orphan genes arising front non-coding regions
width cL where c< 3/4, we need not obtain polynomial time                            [45]. Section 12 of the Text SI discusses the connection of our
(Figure 3 and Teorem S6 in Text SI).                                                 approach to existing results.
   It is known that recombination may accelerate evolution on                           There is one other scenario that must be mentioned. It is
certain fitness landscapes [28,37 39], and recombination may also                    possible that certain biological functions are hyper-abundant in
slow down evolution on other fitness landscapes (40]. Recombi-                       sequence space [2 l] and that a process generating a large number
nation, however, reduces the discovery time only by at most a                        of random sequences will fmd the function with high probability.
linear factor in sequence length [28,37,38,41,42]. A linear or even                  For example, Bartel & Szostak [46] isolated a new ribozytne from
polynomial factor improvement over an exponential flinction does                     a pool of about lois random sequences of length L-220. While
not convert the exponential function into a polynomial one.                          such a process is conceivable for small effective sequence length, it
Hence, recombination can make a significant difference only if the                   cannot represent a general solution for large L.
underlying evolutionary process without recombination already                           Our theory has clear empirical implications. The regeneration
operates in polynomial time.                                                         process can be tested in systems of in vitro evolution [47]. A


PLOS Computational Biology I www.pbscompbiolorg                                 4                      September 2014 I Volume 10 I Issue 9 I e1003818




                                                                                                                                                        EFTA00611784
                                                                                                                        The Time Scale of Evolutionary Innovation



  A                              Start of search                                                                    Bt 10^
                                                                                                                          10'

                                                                                                                          to
                                                                                                                        8) le

                                                                                                                          10'

                                                                                                                          102

                                                                                                                          io'
                                                                                                                          10'
                                                                                                                                 I   15    20   25    30    35   40   45   50
                                                                                                                                          Sequence length, 1.




  C          0                                      D       to                                                      E     10
                                                                                                                        5 Ie
      a % 08                                                                                                            110'
      .0
       CO                                                                                                               & 10'
      0.
            06
                                                                                                                          io'
      4 04
                                                                                                                          io'
      On 02                                                 02
                                                                                                                           to'
            00                                              0.0                                                         a le
              0.0   OS      1O        1.5     20                  1   15    20       25   30   35   40   45    50                    12    15    18        21    24   IT

                         lime limit
                                            I 7                            Sequence length, I.                                            Sequence length. I.


Figure 3. The search for randomly, uniformly distributed targets in sequence space. (A) The target set consists of m random sequences;
each one of them is surrounded by a broad peak of width up to it. The figure shows a pictorial illustration where the &dimensional sequence space
is projected onto two dimensions. From a randomly chosen starting sequence outside the target set, the expected discovery time is at least
(1/nOexp(2L(3/4 —rill, which can be exponential in L. (8) Computer simulations showing the average discovery time of m=100, 150, and 200
targets, with r. =1/3. We observe exponential dependency on L. The discovery time is averaged over 200 runs. (C) Success probability estimated as
the fraction of the 200 searches that succeed in finding one of the target sequences within 101 generations. The success probability drops
exponentially with L. (D) Success probability as a function of time for L=42. 45. and 48. (E) Discovery time for a large number of randomly generated
target sequences. Either in =2t     or in = 4t sequences were generated. For b= 0 and h= 3 the target set consists of balls of Hamming distance 0
and 3 (respectively) around each sequence. The figure shows the average discovery time of 100 runs. As expected we observe that the discovery time
grows exponentially with sequence length, L.
doi:10.1371/joumal.pcbi.10031318.9003



                                                                                      starting sequence can be generated by introducing k point
                                                                                      mutations in a known protein encoding sequence of length L. If
                                                                                      these point mutations destroy the function of the protein, then the
  Process re-generating starting sequence                                             expected discovery time of any one attempt to find the original
     at Hamming distance k kern target
                                                                                      sequence should be exponential in L. But only polynomially many
                                                                                      searches in L are required to find the target with high probability
                                                                                      in polynomially many steps. 'Fite same setup can be used to
                                                                                      explore whether the biological function can be found elsewhere in
                                                                                      sequence space: the evolutionary trajectory beginning with the
  Target
                                                                                      starting sequence could discover new solutions. Our theory also
                                                                                      highlights how important it is to explore the distribution of
                                                                                      biological functions in sequence space both for RNA [20,21,35,46]
                                                                                      and in the protein universe 1481.
                                                                                         In summary, we have developed a theory that allows us to
                         Hamming distance                                             estimate time scales of evolutionary trajectories. We have shown
                                                                                      that various natural processes of evolution take exponential time as
Figure 4. Regeneration process. Gene duplication (or possibly some                    function of the sequence length, L. In some cases we have
other process) generates a steady stream of starting sequences that are               established strong dichotomy results for precise boundary condi-
a constant number k of mutations away from the target Many searches
drift away from the target, but some will succeed in polynomially many                tions. We have proposed a mechanism that allows evolution in
steps. We prove that Lk+ I searches ensure that with high probability                 polynomial time scales. Some interesting directions of future work
some search succeed in polynomially many steps.                                       are as follows: (I) Consider various forms of rugged fitness
doi:10.1371/joumal.pcbi.1003818.9004                                                  landscapes and study more refined approximations as compared to


PLOS Computational Biology I www.pbscompbiolorg                                  5                            September 2014 I Volume 10 I Issue 9 l e1003818




                                                                                                                                                                      EFTA00611785
                                                                                                           The Time Scale of Evolutionary Innovation


the ones we consider; and then estimate the expected discovery                                                                        3
                                                                              the rest& for Theorem I is obtained as follows: If c < -4' then for all
time for the refined approximations. (2) While in this paper we
characterize the difference between exponential and polynomial                         3+4c               PL-n.1.-n+1     .
                                                                              cL<n< —.          we have                     for Seine t> I, and
for the expected discovery time, more refined analysis (such as                          8    '           rL-n.L-n- 1
efficiency for polynomial time, like cubic vs quadratic time) for             hence the sequence b„, grows geometrically for a linear length in L.
specific fitness landscapes using mechanisms like recombination is            Then, H(cL,i) ,...)1kL for all states i>cL (i.e., for all sequences
another interesting problem.                                                  outside of the target set). This corresponds to case 1 of Theorem I.
                                                                                                                     PL-n.L-ntl
Materials and Methods                                                         On the other hand, if    —
                                                                                                       3 thenit is                 I, and case 2 of
                                                                                                       4'
                                                                              Theorem I is derived (for details see Corollary 2 in Text SI).
  Our results are based on a mathematical analysis of the
underlying stochastic processes. For Markov chains on the one-
dimensional grid, we describe recurrence relations for the                    Intuition behind Theorem 2
expected hitting time and present lower and upper hounds on                      The basic intuition for the result is as follows: consider a single
the expected hitting time using combinatorial analysis (see Text              search for which the expected hitting time is exponential. Then for
SI for details). We now present the bas-        e arguments of                the single search the probability to succeed in polynomially many
the main results.                                                             steps is negligible (as otherwise the expectation would not have
                                                                              been exponential). In case of independent searches, the indepen-
Markov chain on the one-dimensional grid                                      dence ensures that the probability that all searches fail is the
   For a single broad peak, due to symmetni we can interpret the              product of the probabilities that every single search fails. Using the
evolutionary random walk as a Markov chain on the one-                        above arguments we establish Theorem 2 (for details see Section 8
dimensional grid. A sequence of type i is i steps away from the               in Text SI).
target, where i is the Hamming distance between this sequence and
the target. The probability that a type i sequence mutates to a type          Intuition behind Theorem 3
i— I sequence is given by tri/(3L). The stochastic process of the               For this result, it is first convenient to view the evolutionary walk
evolutionary random walk is a IsIarkov chain on the one-dimensional           taking place in the sequence space ofall sequences oflength L, under
grid                                                                          no selection. Each sequence has 31- neighbors, and considering that a
                                                                              point mutation happens, the transition probability to each of them is
The basic recurrence relation
  Consider a Markov chain on the one-dimensional grid, and let                   . The underlying Markov chain due to symmetry has fast mixing
                                                                              3L
H( ) denote the expected hitting time from i to j. The general                time, i.e., the number of steps to converge to the stationary
recurrence relation for the expected hitting time is as follows:              distribution (the mixing time) is O(L logL). Again by symmetry
                                                                                                                                               3
                                                                              the stationary distribution is the tinifomi distribution. If r < 4
                                                                                                                                               - ' then
H(j,f)in I +Pr.e+tH(j,i+1)+Pu -af(j,i- I)+ &NILO:                  (I)        from Theorem I we obtain that the expected time to reach a single
for j<f <L, with boundary condition HUI). 0. The interpreta-                  broad peak is exponential. By union bound, if in< <4L, the
tion is as follows. Given the current state i, if i eV, at least one          probability to reach any ofthe in broad peaks within O(L log L) steps
transition will be made to a neighboring state F, with probability            is negligible. Since after the fast O(LlogL) steps the Markov chain
Pa, from which the hitting time is H(//).                                     converges to the stationary distribution, then each step of the process
                                                                              can be interpreted as selection of sequences uniformly at random
Intuition behind Theorem 1                                                    among all sequences. Using Hoeffding's inequality, we show that with
  Theorem I is derived by obi: . g precise bounds for the                                                        exp(2-(3/4-0-L)
                                                                              high probability, in expectation                         such steps are
recurrence relation of the hitting time (Equation I). Consider that
PLg: _i >0 for all j<k Si (i.e., progress towards state j is always           required before a sequence is found that belong: to the target set.
possible), as otherwise/ is never reached from i. We show (see Lemma          Thus we obtain the result of Theorem 3 (for details see Section 9 in
2 in the Text SI) that we can write H(j,i) as a sum,                          Text SI).
H(/,1).Eitit,b„, where b„ is the sequence (Wilted as:
                                                                              Remark about techniques
                                                                                 An important aspect of our work is that we establish our results
            (Obe       1                                                      using elementary techniques for analysis of Markov chains. 'Ile
                     PL.L-I
                                                                   (2)        use of more advanced mathematical machinery, such as
                      1 +PL-a-n+lbn-I
                                               forn>0.                        martingales [49] or drift analysis [50,51], can possibly be used
                          PL-net -n- I                                        to derive more refined results. While in this work our goal is to
                                                                              distinguish between exponential and polynomial time, whether
                                                                              the techniques from [49 -51] can lead to a more refined
  The basic intuition obtained from Equation 2 is as follows: (i) If
                                                                              characterization within polynomial time is an interesting direc-
Pt-net -n+1 •                                                                 tion for future work.
              ez, for some constant t> I, then the sequence b,,
            i
grow; at least as fast as a geometric series with factor A_ (ii) On the
                                                                              Supporting Information
                 Pt —net —n+1
other hand, if                  SI and PL—n et.—N—t       x   for sonic
                 Pt-net-n-1                                                   Text SI Detailed proofs for The Time Scale of Evolutionary
constant x> 0, then the sequence b„ grows at most as fast as an               Innovation."
arithmetic series with difference lict. From the above case analysis          (PDF)


PLOS Computational Biology I www.pbscompbiol.org                          6                     September 2014 I Volume 10 I Issue 9 I e1003818




                                                                                                                                            EFTA00611786
                                                                                                                                The Time Scale of Evolutionary Innovation



Acknowledgments                                                                                Author      Contributions

We thank Nick Barton and Daniel Weissman for helpful ditemaiosu and                            comeiwil and designed Lite experithents: KC AP BA MAN. Analyzed the
pointing us to °levant literature.                                                             data: KC AP BA MAN. Wage the paper: KC AP BA MAN.


Refer            
ℹ️ Document Details
SHA-256
9aa27c56ddbb81f80fea043cf29d05d46da947536d25f5a1dba47858a258be7b
Bates Number
EFTA00611781
Dataset
DataSet-9
Document Type
document
Pages
7
Comments 0

Loading comments…