Is it normal for double-hybrid DFT calculations to use so much memory?

psi4 (v1.3.2) uses a lot of memory during a single-point calculation using density fitting, a double-hybrid funcitonal, and a large basis set (i.e., DF-DSD-BLYP-D3(BJ)/def2-QZVPP). E.g., a molecule with NBF=1963 and NAUX=4358 needed 160 GB of memory (80 GB x 2 cores; 94% memory efficiency). Some molecules seem to do fine with just 4 GB of memory per core (e.g., 32 GB or 64 GB), but I cannot seem to determine the pattern. This happens with both unrestricted and restricted determinants.

According to this (older) post, NBF^2*NAUX/0.8 = ca. 20 GB should be sufficient. And my experience with other software makes this amount of memory usage surprising.

I know there are a lot of improvements to the DFT module coming in v1.4, but it only seems to need so much memory when doing the DF-MP2 part. I’ll have to wait for a stable release as they won’t deploy the developer version on the cluster.

I can provide more information if needed, of course.

The OpenMP nature of PSI4 means that most memory is shared and the per-thread overhead is usually small. Dont think about the per-core memory too much.

PSI4 will try to keep as many integrals and intermediates in memory as it is allowed to reduce IO.

Does it crash or complain if you specify below 160 GB? The minimal amount of memory needed should be much smaller.

Yes, it usually crashes if I specify less memory. Psi4 does not print an error to the output file but the following error is printed to STDOUT:

slurmstepd: error: Detected 1 oom-kill event(s) in StepId=4541276.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

Granted it is using the memory efficiently, these calculations take ~15 min to 2h on 2 cores with all that memory, but it would be more convenient if it’d use the amount I give to the previous steps (i.e., optimize, freq, etc.).

Not sure what is going on, would you mind posting your input? I can try with the release candidate.

psi4 is not 100% strict with the memory limitations, but excessively exceeding the assigned memory should not happen. And the calculation is not that large.

It’s also worth pointing out that Rob’s DF-MP2 code is aggressively optimized, so excessive memory use surprises me.

Keeping in line with the good advice from Jonathon on a different question of mine, I’ve run some tests to try and pin down the issue.

I cannot publicly disclose specifics about the molecules themselves (test is a placeholder), but they have between 30-100 atoms and NBF < 2000 and NAUX < 4000 with def2-QZVPP. There is nothing particularly unusual about the molecules, I can discuss a specific example privately if it comes down to it.

Using the following input (all other examples below only have the amount of memory changed):

with open('test.revpbe.xyz') as f:
        test_xyz = f.read()
test =  psi4.core.Molecule.from_string(test_xyz, dtype='xyz')
activate(test)

memory 64 GB # 4 GB x 16 cores
set reference rks # Happens with open-shell cases, too.
set basis def2-qzvpp
E_dubhyb, wfn = energy('dsd-blyp', return_wfn=True)

The job fails and throws an OUT_OF_MEMORY error even though it used 0% of the memory (is this a slurm error or indicative of a bug in psi4?)

    $ seff 4759647
Job ID: 4759647
Cluster: cedar
User/Group: gibacic/gibacic
State: OUT_OF_MEMORY (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 00:20:36
CPU Efficiency: 42.68% of 00:48:16 core-walltime
Job Wall-clock time: 00:03:01
Memory Utilized: 0.00 MB (estimated maximum)
Memory Efficiency: 0.00% of 64.00 GB (4.00 GB/core)

However, with more memory and fewer cores, it uses all the memory efficiently before running out at the DF-MP2 step.

[...]
memory 120 GB # 60 GB x 2 cores
[...]

has the following efficiency:

    $ seff 4765160
Job ID: 4765160
Cluster: cedar
User/Group: gibacic/gibacic
State: OUT_OF_MEMORY (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 01:29:50
CPU Efficiency: 93.35% of 01:36:14 core-walltime
Job Wall-clock time: 00:48:07
Memory Utilized: 119.30 GB
Memory Efficiency: 99.42% of 120.00 GB

Giving it even more (i.e., just enough) memory allows the the job to complete normally.

[...]
memory 160 GB # 80 GB x 2 cores
[...]
    $ seff 4760349
Job ID: 4760349
Cluster: cedar
User/Group: gibacic/gibacic
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 03:02:16
CPU Efficiency: 96.44% of 03:09:00 core-walltime
Job Wall-clock time: 01:34:30
Memory Utilized: 151.40 GB
Memory Efficiency: 94.62% of 160.00 GB

Here’s the list of the modules loaded during testing in case there is an obvious compatibility issue:

    $ module list

Currently Loaded Modules:
  1) CCconfig
  2) gentoo/2020       (S)
  3) gcccore/.9.3.0    (H)
  4) imkl/2020.1.217   (math)
  5) intel/2020.1.217  (t)
  6) ucx/1.8.0
  7) libfabric/1.10.1
  8) openmpi/4.0.3     (m)
  9) StdEnv/2020       (S)
 10) libxc/4.3.4       (chem)
 11) libffi/3.3
 12) python/3.8.2      (t)
 13) ipykernel/2020b
 14) scipy-stack/2020b (math)
 15) psi4/1.3.2        (chem)
 16) dftd3-lib/0.9

  Where:
   S:     Module is Sticky, requires --force to unload or purge
   m:     MPI implementations / Implémentations MPI
   math:  Mathematical libraries / Bibliothèques mathématiques
   t:     Tools for development / Outils de développement
   chem:  Chemistry libraries/apps / Logiciels de chimie
   H:     Hidden Module

Please let me know if you need more information or if there are other tests to run to figure what is going on.

1 Like

I am not familiar with slurm and if the OoM error is real or just slurm thinks its one because it spikes in ressource needs above the slurm limits.
It’s never a good idea to give any program the max allowed memory, better is something like 80-90%.
Chances are high some overhead is unaccounted for somewhere.

A test case of mine, of similar size, exceeds the 64GB limit by almost 3GB.

Ah, that was also recommended for ORCA but I thought it was peculiar to it alone. Good to know!

When I tell Psi4 only about 80% of the allocated memory (i.e., memory 50 GB but SLURM gets 64 GB) the calculation fails and throws an OUT_OF_MEMORY error to stderr during the SCF step, so I think Psi4 is really running out of memory.

To me this means these calculations are still using far more memory than one would expect for this level of theory. In other packages, this size of calculation would be fine with 4GB of memory, or at least it would switch to some direct algorithm and read/write the tables to disk.

Any recommendations on how to debug this? Maybe I’ve got something configured incorrectly?

I’d like to help figure out if this is a problem with the specific installation I’m using or if it is some bug in Psi4.

In any case, the code is still very fast when it has enough memory (ca. 15 min to 2 h with two cores on these rather large systems).

Thanks for the information. As a sanity check, I’d like to confirm some non-identifying details of your output files. I’ve included the snippets I see when I throw dsd-blyp/def2-qzvpp at three benzene molecules using only a single thread and the developer version of Psi4 1.4. (Memory use there was under 2 GB, and this is 1566 basis sets, 3546 auxiliary functions.)

First, I’d like to confirm your SCF Algorithm. In my output file, it looks like:

  ==> Algorithm <== 

  SCF Algorithm Type is DF. 

The second detail is your integral setup.

  ==> Integral Setup <==

  ==> DiskDFJK: Density-Fitted J/K Matrices <==

    J tasked:                  Yes
    K tasked:                  Yes
    wK tasked:                  No
    OpenMP threads:              1
    Integrals threads:           1
    Memory [MiB]:              643
    Algorithm:                Disk
    Integral Cache:           NONE

The third detail is information about DFT grids.

  Cached 0.9% of DFT collocation blocks in 0.074 [GiB].

For a job that failed (memory 64 GB, i.e., 4 GB x 16 cores), it appears to be using the on-disk algorithms:

  ==> Algorithm <== 

  SCF Algorithm Type is DF. 
  ==> Integral Setup <==

  ==> DiskDFJK: Density-Fitted J/K Matrices <==

    J tasked:                  Yes
    K tasked:                  Yes
    wK tasked:                  No
    OpenMP threads:             16
    Integrals threads:          16
    Memory [MiB]:            23938
    Algorithm:                Disk
    Integral Cache:           NONE
    Schwarz Cutoff:          1E-12
    Fitting Condition:       1E-10
  Cached 100.0% of DFT collocation blocks in 22.269 [GiB].

From a job that completed successfully (memory 160 GB, i.e., 80 GB x 2 cores), we see it has switched to in-memory density fitting (but still says Algorithm: Disk).

  ==> Algorithm <==

  SCF Algorithm Type is DF.
  ==> Integral Setup <==

  DFHelper Memory: AOs need 45.145 GiB; user supplied 89.490 GiB. Using in-core AOs. 

  ==> MemDFJK: Density-Fitted J/K Matrices <==

    J tasked:                   Yes  
    K tasked:                   Yes  
    wK tasked:                   No   
    OpenMP threads:               2    
    Memory [MiB]:             91637
    Algorithm:                 Disk 
    Schwarz Cutoff:           1E-12
    Mask sparsity (%):      41.5794
    Fitting Condition:        1E-10
Cached 100.0% of DFT collocation blocks in 22.269 [GiB]

Interestingly, it prints the same details as the case directly above when given an intermediate amount of memory (memory 120 GB, i.e., 60 GB x 2 cores) but then fails (OOM) at the DF-MP2 step.

NB: This is all on v1.3.2.

What is the virtual memory limit for your slurm? I noticed that a parallel psi4 has a much higher virtual memory allocation than single core.
That will need some time investigating, not sure it was always like this.

With a current dev build and 4GB memory request:

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 5808 kruse     20   0  0.098t 6.851g 702484 R 766.3  5.4  33:00.27 psi4

Hmm, I cannot seem to find out on the wiki for our cluster. Let me get back to you when I hear back from the admin team.

The DF-MP2 recommendations section contains some potentially useful information:

“If you notice DFMP2 using more memory than allowed, it is possible that the threaded three-index ERI computers are using too much overhead memory. Set the DF_INTS_NUM_THREADS to a smaller number to prevent this in this section (does not affect threaded efficiency in the rest of the code).”

Sorry for the extended delay.

Currently, I’ve been able to “work around” this “issue” by giving psi4 fewer threads and lots of memory. I have not had time to truly diagnose if it is behaving correctly or not, but will give it some attention in the coming weeks, especially considering the insightful documentation linked by Zach.

Thank you, Zach! This is exactly the information I was looking for regarding DF-MP2 scaling. I will see if changing the DF_INTS_NUM_THREADS tames the memory usage.