Conventional CCSD(T) disk/memory usage?

bge6 · October 22, 2019, 12:08am

I’m looking for a simple formula for the disk/memory usage for CCSD(T) calculations, as I’m planning to run some very large calculations and might need to get a bigger disk. I see formulas to calculate this for DF-CCSD(T) in the source code but couldn’t find anything for conventional CCSD(T). Is there any such formula someone could direct me to?

Thanks!

jmisiewicz · October 22, 2019, 11:45am

Using Psi’s conventional CCSD(T) is a terrible, terrible idea. The (T) correction takes far longer than it needs to. Have you considered using CFOUR? Assuming you want closed-shell species, their CCSD(T) using the NCC module is amazingly fast.

hokru · October 22, 2019, 11:56am

qc_module fnocc has a fine conventional CCSD(T).

For large calculations the AO-MO transform will take far too long to be worth it. I tried it for something large. I’d recommend DF-CCSD(T).

bge6 · October 22, 2019, 12:08pm

A bit more context… I’m trying to do some atomization energies, where the atoms can’t be done with DF-CCSD(T). The largest calculation has 1300 basis functions, but the rest should be pretty easy… so I don’t really mind. I just want to know how much disk space I’ll need.

hokru · October 22, 2019, 1:07pm

OK.

Assuming you ran some CCSD(T) already? It will print the integrals types it generates, using the default module. Also the disk usage.

The usually largest and dominating one is <VV|VV>, ie V=virtuals, O=occupied.
Perhaps also include <OV|VV>.

so ints=(v^4 + v^3*o)*8/(1024^2) MiB, would perhaps be a quick simple check for the CCSD integrals.

I think the fnocc module is overall faster. If you exceed 800 virtuals…good luck.
Optionally think about using FNO approximation, even very conservative with OCC_TOLERANCE=1e-7.

crawdad · October 22, 2019, 2:06pm

The conventional RHF-based (T) correction in PSI4 is based on the Rendell and Lee formulation which is very efficient. We use a minimum number of v^3 arrays and a minimum number of sorts between. In addition, Rollin King added threading to the triples algorithms, which gives one further opportunities for speed. In addition, at one time our code was slightly faster than that in CFOUR, but I’ve not made a direct comparison with their latest implementation. I can certainly believe that additional optimizations are warranted, but I have to take issue with the sentiment behind your comment. Can you explain what you mean by “takes far longer than it needs to”?

bge6 · October 22, 2019, 4:37pm

Thanks. I still have one point of confusion though…

In the output I see the line:
Size of irrep 0 of <ab|cd> integrals: 3154.957 (MW) / 25239.652 (MB)
Which matches the simple formula 8*v^4/1000^2.

However the file written to disk (psi.436384.mol.103) is taking up 36G of space. <ia|bc> is only 1064.964MB, so nothing else should be comparable to <VV|VV> right? I don’t see what I’m missing…

(c1 symmetry)

hokru · October 23, 2019, 7:42am

Not sure.
File 103 are the PSIF_CC_BINTS integrals in the code. (see psifiles.h)

Could be pre-sorted and antisymmetrized orbitals? Done here: https://github.com/psi4/psi4/blob/master/psi4/src/psi4/cctransort/b_spinad.cc
Though I have trouble understanding how much is being written there.

The fnocc module is a bit different because it splits its files more. But similar disk usage in the end. Though it seems the 2-e AO integrals from the SCF are not deleted, which could be sizable (not 100% sure on this)

crawdad · October 23, 2019, 12:06pm

For RHF-CC calculations, file 103 includes the <ab|cd> integrals (with no index packing), but it also includes two linear combinations that are packed: <ab|cd> + <ab|dc> and <ab|cd> - <ab|dc>. (These are used to streamline the particle-particle ladder terms in the various amplitude and EOM-CC equations.) For these two tensors, the row and column indices have dimension V*(V+1)/2, so the total word sizes of each tensor are (V*(V+1)/2)^2. If you have 237 virtual orbitals, then each of these tensors requires about 6.3 GB, which would account for your observation of a 36 GB file 103.

jmisiewicz · October 23, 2019, 2:50pm

Sure, have an example from my current project. I can try the exact same CCSD(T) computation in Psi4 and CFOUR. Psi says (T) takes 1326 seconds walltime beyond CCSD. CFOUR’s ECC and NCC modules finish just a minute apart from each other. I don’t believe ECC’s reported (T) timing, but NCC reports 881 walltime seconds. Psi’s cc module RHF (T) computation takes an extra 50% walltime compared to CFOUR’s (T).

PsiOutput: https://gist.github.com/JonathonMisiewicz/3021aa26f97567921498b6b9d0352118
PsiTimer: https://gist.github.com/JonathonMisiewicz/2b3edbe20c2ac559a2c68dfab4bf6867
ECC Output: https://gist.github.com/JonathonMisiewicz/5a1e7b782417a7f4facac5f76255abf5
NCC Output: https://gist.github.com/JonathonMisiewicz/690119020be57ea857dd311d391bd0f3
GENBAS: https://gist.github.com/JonathonMisiewicz/e96aadcc52894c9a43af87e94a01ea16

crawdad · October 23, 2019, 4:23pm

Are you taking full advantage of optimizations in PSI4’s implementation (including threading)? Is this an apples-to-apples comparison?

If it is, then we should try to learn what they’re doing to make their code faster. Perhaps they’re keeping more of their tensors in core (something we can also do in PSI4, but isn’t the default), thus reducing I/O wait times, or perhaps they’ve got a better threading scheme.

If it’s not a fair comparison, then we should adjust our default parameters to improve our performance on your hardware. In which case, it wouldn’t be a “terrible, terrible idea” to use PSI4.

jmisiewicz · October 28, 2019, 2:36pm

That course of action sounds good to me.

As best as we can tell, it’s apples to apples. I ran this on the CCQC cluster, which Jet set up to allow threading on Psi. My CFOUR output says

Running on 4 MPI processes
Running with 1 threads/proc

Memory limit is: 6.51926GB

One-particle lists are cached
Two-particle lists are cached
T1 and T2 DIIS vectors are cached
ABCI is not cached
ABCD is done in the AO basis

ABC and ABCD transposes are coarse-threaded
An out-of-core algorithm is used for <Ab|Ci>
DIIS is used to accelerate convergence of T1 and T2

Psi4 output says

    AO Basis        =     NONE
    ABCD            =     NEW
    Cache Level     =     2
    Cache Type      =     LOW

and

    Number of threads for explicit ijk threading:    4   

    MKL num_threads set to 1 for explicit threading.

The time falls from 43 to 36 minutes upon boosting Psi’s cache level to 3, but the difference between cache levels 2 and 3 is the ABCI, which CFOUR says it doesn’t cache. So that comparison is better for Psi but is no longer apples to apples.

Let me know if there’s anything else I can check.

crawdad · October 28, 2019, 3:00pm

I would say that “apples to apples” is to use the highest level of optimization that each program can provide. Just because CFOUR doesn’t cache ABCI doesn’t mean you shouldn’t let PSI4 cache them because it can.

Also, what MPI processing are they using?