AbstractWe present a new approach to perform molecular simulations using evolutionary algorithms. The main application of this work is the simulation of dense amorphous polymers and the goal is to improve the efficiency of sampling, in other words to obtain valid samples from the phase state more rapidly. Our approach is based on parallel Markovian Monte Carlo simulations of the same physico-chemical system, where we optimise some Monte Carlo parameters by means of a real coded genetic algorithm.

Bertrand Braunschweig z, Herv´e Toulhoat z z Institut Franc¸ais du P´etrole 1 et 4, avenue de Bois-Pr´eau BP 311 - 92852 Rueil-Malmaison Cedex - France fBertrand.Braunschweig, [email protected]

namic properties through integrals of the form :

< A >=

Molecular simulations of dense amorphous polymers is facing the major challenge of sampling efficiency with the increase in the size and complexity of molecules : the potential energy surface of such systems are characterized by numerous local minima separated by very high barriers, hence difficult to sample either along the trajectories obtained from direct Molecular Dynamics (MD, see [1] [3]) or through conventional Markovian Monte Carlo simulations. Similar difficulties are encountered with protein folding simulations, that is translating the 1D genomic information into 3D bioactive edifice. Therefore, a major challenge for molecular simulation is to develop more efficient elementary moves ([7], [5]) in order to achieve more efficient exploration and sampling of configurational space for long chain molecules. Here, we are prototyping an approach using evolutionary algorithms (EA) in order to improve the statistical efficiency of Markovian Monte Carlo systems employing several types of such elementary moves. We show the basics of Markovian Monte Carlo simulations in section 2; section 3 presents the EA approach: encoding, fitness, algorithm, operators; section 4 details the Monte Carlo moves and the molecular model used; section 5 presents results on the case of polyethylene configurations, and is followed by a short conclusion.

x

2R A(x)exp( U (x))dx y 2 exp( U (y ))dy

=

B (x) =

R

exp( U (x)) y 2 exp( U (y ))dy

The main purpose of molecular simulation is to simulate matter at the atomic level in order to compute some thermody-

(2)

1 with kB T , kB the Boltzmann constant, T the temperature of the system, and U x its energy. The Monte Carlo scheme is a way to estimate this integral by the following average :

()

A^ =

( )

N X i=1

A(xi ) ; (xi )i=1:::N

2

(3)

with the xi i=1:::N some random trials according to B . The problem arises from the fact that one cannot take “direct” samples according to B , as its normalizing term would require to compute an integral over the whole phase space, which is infeasible for non trivial cases. The key point is then to design a Markov Chain, whose state space is our phase state, in such a way that it is ergodic and admits the desired limit distribution B (see [4]). This is done by applying the Metropolis ([6]) algorithm that defines a random walk according to the following rules: 1 Choose any starting point x from

as the initial state.

2 Generate a new point y by operating a move step from x. 3 Accept or reject y as the new state with a probability a

x; y and go back to 2 :

( )

a

(x; y ) = min (1; exp( U ))

2 The problem of sampling efficiency

(1)

Where A is some observable property that one can compute from a given state, and is the phase space, which is in the case of Markovian Monte Carlo simulation the space of geometrical coordinates of each component of the simulated system under prescribed conditions. The distribution associated to that space is the Boltzmann distribution B :

8x 2 ; 1 Introduction

R

where

(4)

U is the difference of energy of the two states.

The resulting sequence of states (which may include multiple instances of the same state in case of multiple consecutive

rejections) defines a correct sample according to B as long as enough states are visited. Of course the minimum time of simulation depends on the considered distribution, and in the field of molecular simulation an autocorrelation criterion is often used to decide when to stop the random walk. The problem of efficiency is then to keep this mixing time as small as possible.

3 Evolutionary Approach to improving Markovian Monte Carlo simulations In this section we show the chosen EA approach to improve Monte Carlo molecular simulations. The basic principle is to evolve sets of parameters describing distribution frequencies of allowed Monte Carlo movements. 3.1 Optimising parameter sets We deal with molecular simulations of amorphous polymers for which generating valid configuration samples is long and difficult. We consider parallel simulations of the same system. The goal is to produce a correct sampling of the search space, with all parallel simulations contributing to the same sampling. In traditional Monte Carlo approaches, users empirically adjust the parameters of a simulation, e.g. in case of several allowed Monte Carlo movements, the relative frequencies of those movements along the simulation. These frequencies have no consequence on the limit distribution of valid configurations, since they only impact the way the search space is sampled during the simulation. However finding good sets of such frequencies for a specific problem can significantly improve the performance of the whole process. The chosen reference algorithm (RA) that will be used here for the comparison of other approaches consists in ns simulations of polymer systems with different initial states in the same physico-chemical conditions (i.e. different points from the same phase space), each simulation using an equiprobable distribution of allowed movements. 3.2 Real-valued encoding, multicriteria fitness function The relative frequencies of movements are real numbers taking values in the interval ; and are used as such in our algorithm. The sum of m frequencies for the m types of movements is equal to 1. Other real-valued parameters of movements are added to the chromosome. The fitness function uses two complementary criteria: the first one, autocorrelation, is well known by practitioners of molecular simulation (see [5] for example). We compute the decreasing rate of autocorrelation of end-to-end chain vectors, previously normalised. The end-to-end chain vector is the vector linking both ends of a chain. For one simulation, autocorrelation of those vectors can be calculated periodically after a sufficient numbers of elementary moves, and is equal to the means of all single chain’s autocorrelation over the set

[0 1℄

of chains in the simulation box: nsamples X

a=

1

nsamples

( )

i=1

< vi ; v0 >

(5)

Where vi i2f0:::nsamples g are the normalized end-to-end vectors. The second criterion is the mass center displacement, that is, the distance between the current position of the mass center of a chain, and of its initial position. The greater the displacement, the more sampling has taken place in the simulation. 3.3 Cycles and generations We can summarize the algorithm as follows:

we simulate ns systems with identical physicochemical parameters; each system has a chromosome composed of the realvalued parameters of the simulation, in particular the frequencies distributions; the initial population is generated randomly; one simulation of one system consists in n cycles; those n cycles are further segmented in ng generations (see figure 1) inside which individuals of the populations are evaluated along with the fitness criteria presented above (Note: since an individual is a set of parameters, its evaluation requires many Monte Carlo movements in cycles). Our EA uses a range of standard strategies and parameter settings: Stochastic universal sampling ([2]) with full population replacement, uniform crossover and gaussian mutation operators.

A cycle consists in several elementary Monte-Carlo moves (trials to generate new conformations), totalling a pre-defined CPU time. This way, fitnesses represent the true efficiency of a parameter set over a specified period of time. This makes our algorithm dependent on the hardware, OS and software used, but supportive of cheap and efficient mixing moves. The fitness of an individual is computed over n =ng cycles, and is defined as:

f (x; ! ) = [1

aavg (x; ! )℄ d2avg (x; ! )

(6)

where ! denotes the random part of the simulation, aavg the average autocorrelation, and d2avg the average square displacement. Strictly speaking, improving efficiency is a multiobjective problem (i.e. optimising a and d2 ). We choose to formulate this problem via a single fitness because these two components are closely related, as we shall see in the numerical experiments. Moreover a desirable behaviour is a good “global” performance, i.e. all numerical simulations must be efficient. The goal of our evolutionary algorithm is thus to find rapidly many good solutions instead of locating a particular optimum.

range interactions, depending on the density and on this radius, is finally added. Simulated Systems

= 112

#V P B () kB

...

Simulation time ...

Figure 1: Schematic representation of our EA-based parallel MC simulation: bars stand for simulated systems. Each system is given MC move frequencies corresponding to an individual of the population, performance is measured on n =ng MC cycles and returned as the fitness score. At each new generation, new individuals are created and each simulation continues with corresponding (hopefully better) frequencies. In comparison, for the reference algorithm (RA), a unique set of frequencies is used for all the systems during all the simulation time.

4 Molecular model, Monte Carlo moves Molecular simulation specialists usually compare their results with experimental measurements. There is now a solid background of such comparisons that allows to conclude that molecular simulation correctly predicts physical properties of matter. Instead, we compare our results with those obtained by other authors on the same polymers systems. In this paper, we adopt the polyethylene model described in [5], with the same constants:

= 3 94

#LJ (rij ) = 4LJ

"

LJ 12 rij

= 8 67

(8)

= 1116 = 1462 = 368 = 3156 =

#tor kB

=

5 X k=0

k osk ()

(9)

Cubic simulation model box with periodic boundary conditions; Splitting the simulation box in cubix cells so as to accelerate interaction potentials computation. The length of a cell has to be larger than ut , hence the search for neighbours can be restricted to the cell of the site and the 26 neighbouring cells. This implies that this method is useful only if the simulation box is at least 4 times larger than ut , corresponding to at least 64 cells.

Translation: the whole molecule is translated along a random path. The translation distance is randomly chosen in the interval ; dmax . The parameter dmax is usually dynamically adjusted in order to reach a prescribed acceptance rate. In our case dmax is an element of the genome. This translation move can generate expensive calculations of Lennard-Jones potential if there are too many interaction sites and is therefore costly for long molecules, as we will see later;

[0

# LJ 6 (7) rij

In practice, in order to compute the total interaction potential of a particular site, we only consider neighbouring sites that are located within a cutoff radius ut ˚ A term taking into account long (here ut : A).

0 ) 2

Our Monte Carlo moves are commonly used in molecular simulation literature:

˚ corfixed monomer-to-monomer link length of 1.54 A, responding to C-C bond length;

= 0 098

= 21 k (

4.1 Monte-Carlo Moves

unified atom model considering each CH2, CH3 group as a single active site;

Lennard-Jones pair interaction potential. The charac˚ and the potential well’s teristic radius is LJ : A, : k al=mol. For two sites i; j whose depth LJ distance between is rij , then we have:

= 57950

Ryckaert and Bellmans torsional potential, function of dihedral angle (given by 4 following monomers of the same chain). Given 0 K , 1 K,

2 K , 3 K , 4 K , 5 K:

= 1578 3788

Generations

Van der Ploeg and Berendsen tension potential, function of a bond angle (given by 3 following monomers of the same chain) allowed to fluctuate around the mean o . Given k K:rad 1: value 0

℄

Rotation: an ending monomer is rotated within a sphere centered on its preceding site; energy variation must be calculated for the three potentials; this simple move only concerns one monomer and is therefore very fast to execute; Reptation (or ”snaking”): an ending monomer is removed, added at the other end of the chain, and rotated. The calculation requirements are the same as with rotation;

Figure 5: Flip Figure 2: Translation

Flip: a monomer inside a chain is rotated along the axis of its two neighbouring sites. The site is moved, two tension angles change, four torsion angles change. The rotation angle is randomly chosen in the interval ; max .The parameter max is usually dynamically adjusted in order to reach a prescribed acceptance rate, and is an element of the genome in our case.

[0

℄

5 Numerical results. We now present some numerical experiments in order to compare efficiency of our EA algorithm to reference algorithm (RA). We recall that the latter consists in the parallel and indepent MC simulations of ns polymer systems, with the same MC parameters and in the same physico-chemical conditions. The simulations are performed in the NnV T ensemble : Figure 3: Rotation

= 20. n : constant number of monomers. n = 1200 (which N : constant number of chains. N

means polymers of size 60).

V : constant volume. This condition is imposed by a ˚. constant simulation box length lB : A

= 35 64 T : constant temperature. T = 394:4K

The simulations were performed on a bi-processor PC (Intel Pentium II 350 MHz), and the simulation cycle was defined as 1 second of process time. On average, about 500 elementary moves are performed within a cycle. The test consisted in 16 instances of this system, which means that the quantities plotted in the following figures are averages on these 16 parrall runs. The total simulation time consisted in n cycles (divided in ng generations of 50 cycles in the case of the EA). To summarize EA parameters :

= 5000

Figure 4: Reptation

= 100

Population size : 16 Number of generations : 100 Full population replacement with Stochastic Universal Sampling

= 0:5 Mutation probability : pm = 0:1 Crossover probability : p

1400 Rotation Reptation Flip Translation

1200

This first set of tests use the four MC moves presented before, with the following frequency distribution for the RA : Rotation 60/181

Reptation 60/181

Flip 60/181

Translation 1/181

60

As the translation simultaneously moves monomers at a time, its frequency is divided by this amount.

Number of instances

5.1 Set A 1000 800 600 400 200 0 0

0.2

0.3

0.4

0.5

0.6

0.7

Frequencies

1 Chain vector autocorrelation

0.1

EA RA 0.9

Figure 8: Set A : histogram of MC moves frequencies. For each move, values of all individuals of all generations are counted.

0.8

We see on figures 6 and 7 that the RA starts to perform better than the EA, but both finish at approximately the same 0.7 performance, with a little advantage for the EA. This can be explained by the fact that in the initial population each move 0.6 have on average the same frequency = (in comparison to = for the RA), including the translation, which reveals to 0.5 be bad moves for this system. And as the evolution discards it, performances improve. We can check this if we look at the 0.4 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 histogram (figure 8) of the moves where translation frequencies are located near 0 for most, and only a few values a little Simulation time * 1s greater (corresponding to first generations). Figure 6: Set A : averaged end-to-end vector autocorrelation plotted against cycles of simulation 5.2 Set B : without translation move

1 181

14

We now drop the translation move. The three remaining moves have now an equal frequency of = in the RA. Figures 9 and 10 show that our EA is clearly able to find good parameters and at the same time leads to good performances in comparison to the RA. The histogram of moves (figures 11) shows that the reptation has a major role for this system, which was also the case for the previous set, but this cannot be a general rule. In fact each move may perform differently depending of the type of simulated molecules and also depending on the conditions (temperature, density, pressure, etc...), and this motivates the adoption of a strategy that finds “on the fly” some good parameter settings.

13

Center of mass displacement

16 EA RA

14 12 10 8 6 4 2

6 Conclusion and future exciting work

0 0

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 We have shown how to evolve sets of parameters of Markovian Monte Carlo molecular simulations using a real-valued Simulation time * 1s

Figure 7: Set A : averaged mass center displacement (in re2 : A˚ 2 ) plotted against cycles duced units, unit LJ of simulation.

1

=

= 15 52

genetic algorithm. Using this approach, a better sampling of the configurational space is obtained at almost no cost, compared to similar reference simulation based on fixed sets of parameters. The practicality of our approach has been shown

250 Rotation Reptation Flip

Chain vector autocorrelation

EA RA

0.9 0.8 0.7

100

50

0

0.5

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Frequencies

0.4

Figure 11: Set B (without translation move) : histogram of 0

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 MC moves frequencies. For each move, values of all individSimulation time * 1s uals of all generations are counted.

Figure 9: Set B (without translation move) : averaged endto-end vector autocorrelation plotted against cycles of simulation.

20 Center of mass displacement

150

0.6

0.3

EA RA

15

on a well-known reference problem in simulation of amorphous polymers. It can and will be extended to more complex material for which obtaining a correct sampling is even longer and more difficult. For this purpose, we plan to implement our algorithm on a ”PC farm” of several hundreds of PCs. Our approach is not a direct optimisation task. Instead, our only goal is to generate a better sampling of a very large search space using evolution as a means to increasing diversity while keeping a high acceptance rate. It is particularly important in molecular simulation where sampling is the key factor. On the contrary to pure optimisation EA implementations, the important output of our algorithm is not the best individual of the final population but rather the quality of all individuals generated along the evolution of the EA. As such it can be considered as an evolutionary adaptive system acting in a noisy environment.

10

Bibliography [1] (1987) M.P. Allen, D.J. Tildesley, Computer Simulation of Liquids, Oxford Science Publications.

5

0 0

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Simulation time * 1s

Figure 10: Set B (without translation move) : averaged mass center displacement (in reduced units, 2 unit LJ : A˚ 2 ) plotted against cycles of simulation.

1

Number of instances

200 1

=

= 15 52

[2] (1987) J.E. Baker. Reducing bias and inefficiency in the selection algorithm. Genetic Algorithms and their applications : Proceedings of the Second International Conference on Genetic Algorithms, 14-21. [3] (1996) Daan Frenkel and Berend Smit Understanding Molecular Simulation, From Algorithms to Applications, Academic Press. [4] (1998) Mark Jerrum. Mathematical Foundations of the Markov Chain Monte Carlo Method, in “Probabilistic Methods for Algorithmic Discrete Mathematics”, M.

1

Habib, C. Mc Diarmid. J. Ramirez-Alfonsin, B. Reeds (Eds), Springer Verlag. [5] (1999) V.G. Mavrantzas, T.D. Boone, E. Zervopoulou and D.N. Theodorou. End-Bridging Monte Carlo: A Fast Algorithm for Atomistic Simulation of Condensed Phases of Long Polymer Chains., Macromolecules, 32, 5072-5096. [6] (1953) N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.N. Teller, and E. Teller, Equation of state calculations by fast computing machines. J. Chem. Phys., 21:1087-1092. [7] (1992) J.I. Siepmann, D. Frenkel, Configurationnalbias Monte Carlo: A new sampling scheme for flexible chains, Mol. Phys., 75, 59-70.