Process Killed when running calculation on molecule with 148 atoms using wB97M-D3BJ/def2-TZVPPD

tkreiman · May 18, 2024, 7:16am

I am trying to run a single structure energy and force calculation on the Bucky ball catcher molecule from the MD22 dataset. It has 148 atoms. I am using the psi4 calculator through ASE. I am using the following function:

def main(filename, save_every=20, num_threads=32, memory=1000, start=0):
    mol_name = filename.split('/')[-1].split('.')[0]

    print(f'Calculating {mol_name} with {num_threads} threads!')
    print(f'Saving every {save_every} steps.')
    print(f'Loading {filename}...')
    md17_asp = ase.io.read(filename, ':')
    atoms = md17_asp[0]

    props = ['DIPOLE', 'QUADRUPOLE', 'WIBERG_LOWDIN_INDICES', 'MAYER_INDICES']
    calc = Psi4(atoms = atoms, method = 'wB97M-D3BJ', num_threads=num_threads, memory = f'{memory}MB', basis = 'def2-TZVPPD', properties=props, wcombine=False)
    calc.psi4.gradient('wB97M-D3BJ/def2-TZVPPD', properties=props)

    engs = []
    forces = []

    print('Starting 1000 calculations!')
    for i in tqdm(range(start, 1000)):
        a = md17_asp[i]
        a.set_calculator(calc)
        engs.append(a.get_potential_energy())
        forces.append(a.get_forces())

where I set number of threads to 128 or 256 and the memory to 16 GB.

Eventually I get an error like the following:

==> Pre-Iterations <==

  SCF Guess: Superposition of Atomic Densities via on-the-fly atomic UHF (no occupation information).

   -------------------------
    Irrep   Nso     Nmo
   -------------------------
     A       4916    4827
   -------------------------
    Total    4916    4827
   -------------------------

  ==> Iterations <==

                           Total Energy        Delta E     RMS |[F,P]|

slurmstepd: error: Detected 1 oom_kill event in StepId=25582113.0. Some of the step tasks have been OOM Killed.
srun: error: nid005211: task 0: Out Of Memory
srun: Terminating StepId=25582113.0

My psi version is 1.9.1. I have tried increasing the memory further but the process just seems to be stuck at some point. I don’t know if there is another way to speed this up / parallelize the calculation that could somehow get around the issue. At the moment, the calculation runs for 12+ hours on the first molecule and then gets killed at some point with the above message. This is my first time using psi4 so any help would be appreciated.

Thank you in advance!

loriab · May 19, 2024, 6:29pm

Psi4’s parallelism comes from threaded BLAS (usually MKL), and it’s probably not worth setting above about 16 threads.
You mention 16 GB memory, but the posted script has 1000 MB. Could you check higher in the output what it’s set to, particularly this bit? This value can be increased into the hundreds of GB.

         ---------------------------------------------------------
                                   SCF
            by Justin Turney, Rob Parrish, and Andy Simmonett
                              RHF Reference
                        1 Threads,    572 MiB Core
         ---------------------------------------------------------

tkreiman · May 20, 2024, 6:20am

Thank you for the response. I have tried setting it as high as 400 GB with still the same error (I override the default in the code from above with a command line argument):


     ---------------------------------------------------------
                                   SCF
               by Justin Turney, Rob Parrish, Andy Simmonett
                          and Daniel G. A. Smith
                              RKS Reference
                      256 Threads, 381469 MiB Core
         ---------------------------------------------------------

And thank you for the tip re. the threads.

loriab · May 28, 2024, 10:35pm

Ok, so that rules out the easiest mistake . I have heard of report of a very high memory scf job in SAPT failing unexpectedly, but of course these aren’t easily accessed to debug. Are you leaving a good bit of padding between slurm max mem, and the memory handed to psi4? Normally I’d do 5% extra, but maybe 15% could help if it’s slurm triggering the OOM rather than the node.

tkreiman · May 31, 2024, 3:11pm

I have gotten it to work on a local cluster, reducing to 16 threads and running with 500 GB, but am still getting issues on slurm. I can try adding the padding with the slurm max mem.

In the meantime, I was also wondering if there was an easy way to do checkpointing / resume a calculation. On nersc the maximum job time is 24 hours and these calculations have either been crashing or terminating without completing on nersc. If I could resume the calculation wherever it left off I might be able to continue it on nersc.

Thanks for all the help.