Creating/Finding a small but realistic MD model for my research?

I’m completely new to Psi4 and have only just installed it today, because I learned that it was used to generate the MD17 dataset, which I am interested in.

I’m currently starting up a neural network approach to molecular dynamics and for that I need a dataset. The ideal dataset for my research is essentially the MD17 dataset found here: However, there is a problem with this dataset for my use-case, as quoted in the originating article, the MD17 dataset is created as:

Reference data generation. The data used for training the DFT models were
created running abinitio MD in the NVT ensemble using the Nosé-Hoover ther-
mostat at 500 K during a 200 ps simulation with a resolution of 0.5 fs. We computed
forces and energies using all-electrons at the generalized gradient approximation
level of theory with the Perdew-Burke-Ernzerhof (PBE) 65 exchange-correlation
functional, treating van der Waals interactions with the Tkatchenko-Scheffler (TS)
method 66 . All calculations were performed with FHI-aims 67 . The final training data
was generated by subsampling the full trajectory under preservation of the Maxwell-
Boltzmann distribution for the energies.
To create the coupled cluster datasets, we reused the same geometries as for the
DFT models and recomputed energies and forces using all-electron coupled cluster
with single, double, and perturbative triple excitations (CCSD(T)). The Dunning’s
correlation-consistent basis set cc-pVTZ was used for ethanol, cc-pVDZ for toluene
and malonaldehyde and CCSD/cc-pVDZ for aspirin. All calculations were
performed with the Psi4 68 software suite.

So the data has been subsampled, meaning that the datapoints in the MD17 dataset do not have the same time-step size between two following data samples, which is needed for my work.

So my question are:
Is there anyway of generating this dataset again given the above information? I have tried contacted the author, but haven’t heard anything back yet.

Or alternatively, are there any other simple systems like this available online or does anyone have any scripts/tutorial for how to generate a molecular system dataset.
What I need are the atomic positions at each step, and ideally I would like the atomic velocities and Force vectors as well if possible. I would like to generate at least 100k-500k time-steps since I need quite a lot of data for the neural network training.

Any insight from experienced psi4 users or people in the field of molecular dynamics would be greatly appreciated.

Above quote used psi4 only for the final CCSD(T) calculations. psi4 has no MD capabilities on its own. Unless you need a specific method from psi4 it is easier to find a specialised programs for your needs. I recommend cp2k because it’s free and fast and can do a lot.

1 Like

Thank you!
I don’t quite understand the purpose of the CCSD(T) calculations, is it recomputing the potential energy and forces on each atom in each frame to get higher accuracy or is it something unrelated? (Just wondering whether I will still need to do this if I end up creating my own dataset)
I will check out cp2k now.

I guess it was to improve the energetics yes.
However, while CCSD(T) can be accurate it needs a large basis set and I questions the quality of the CCSD(T)/cc-pVDZ calculations. A modern hybrid or double-hybrid DFT functional would probably give more accurate results.

1 Like

It is typically not that useful to use full MD trajectories for learning potential energy surfaces, since the correlation between subsequent points on the trajectory is very high. This is why subsampling is highly effective. You don’t need the whole trajectory, you just need some kind of ergodic sampling of the system’s full phase space. (Of course, this is very rarely sampled by MD; other methods are often used instead.)

That is a good point, but for my particular use-case I will need all the timesteps or at the very least the same timestepping distance between each subsampling.

In any case, I’m currently in the process of setting up a cp2k simulation of aspirin for this particular purpose, where I get the velocity, position, forces and energy for each timestep.

Thank you @hokru for pointing me in the right direction!