Is CCSD threading broken in 1.1?

So, I have just installed the 1.1 conda package (btw great job, probably the most painless and straightforward installation of a quantum chemistry program I have ever seen) and decided to give the CCSD(T) analytic gradients a try.
I gave it 4 threads in the command line (-n4) and 12GB of RAM, but I noticed, that during the CC iterations only 1 thread is running. The docs claim that both CCSD and CCSD(T) are threadedhttp://psicode.org/psi4manual/1.1/introduction.html#f3, with the caveat that some parts may thread via the use of threaded BLAS. Is there an issue with threaded BLAS or is it something else?

If I look at the thread activity inside CC iterations for my input file after providing 4 threads psi4 -n 4, I see 4 threads remaining active throughout the iterations. As you rightly pointed out, we have threaded BLAS operations in conventional CC modules. Also, while CCSD(T) energy code is explicitly parallelized using pthreads (sets the mkl_num_threads to 1), our analytic CCSD(T) gradients rely again on BLAS for parallelism. However, we are going to explicitly parallelize the gradients code soon.

That being said, could you provide the input file that you are using so that I can confirm my observations.

Thanks!

@Diazonium what OS are you on? The conda package for mac won’t be threaded if I recall correctly.

Good point. AppleClang doesn’t have OpenMP yet (grr). And because clang is so much faster to build psi4 than gcc and we figure ppl aren’t doing production calcs on laptops, we accept the single-threadedness of clang for the conda pkgs.

Nevertheless, we have been having some threading issues on Linux lately. They happen when you set the threads programatically and psi4 and numpy math libraries get into a fight about whose mkl symbols get resolved, not for ordinary psi4-controlled threading, so I don’t think that’s your issue.

If copying the below into a file tu1.py and running time psi4 tu1.py and time psi4 -n 4 tu1.py shows some evidence of speedup, you’re fine.

tu1.py

import psi4
#psi4.set_num_threads(6)

def test_psi4_basic():
    """tu1-h2o-energy"""
    #! Sample HF/cc-pVDZ H2O computation

    h2o = psi4.geometry("""
      O
      H 1 0.96
      H 1 0.96 2 104.5
    """)

    psi4.set_options({'basis': "aug-cc-pV5Z"})
    psi4.energy('scf')

if __name__ == '__main__':
    test_psi4_basic()

I am running psi4 on Linux (Debian 8), and I think for some reason BLAS threading is broken. CC iterations are running on 1 thread, the (T) energy calculation uses all 4, and the gradient code goes back to using only one.

memory 12 GB
set basis 6-31G*
set freeze_core false
molecule {
0 1
N -1.16541 0.51673 -0.13539
C 0.02720 1.27841 0.14118
C 1.04100 0.20818 0.32544
C 0.40764 -0.99888 0.15481
C -0.91628 -0.73942 -0.11951
H -0.09463 1.87718 1.06832
H 2.09034 0.34069 0.55287
H 0.87559 -1.97162 0.22569
H -1.67228 -1.49097 -0.30278
H 0.28621 1.93011 -0.71973
}
optimize(‘CCSD(T)’)

I tested your suggestion, and indeed there is some speedup, so something else must be the problem.

maybe you see similar issues as those described here

I have done some additional testing, hopefully they end up being useful.
The 1.1 Conda package does not seem to make use of threaded BLAS, tested on Debian 8, Debian 9 and Fedora 26.

I have also grabbed a tarball from git, and compiled using OpenBLAS (on Debian 9). This was very useful, because OpenBLAS can be influenced by the OPENBLAS_NUM_THREADS enviroment variable.

Testing using this has revealed, that when it comes to the CCSD iterations of a DF-CCSD single point energy calculation, psi4 does not set MKL threading properly (or at least the conda package doesnt). Not only did running the conda package with -n2 gave the same timings as running with -n1, but also the git build with OpenBLAS produced practically identical timings to the conda package, and -n2 yet again did nothing for DF-CCSD iteration times. However, with the OpenBLAS build I saw very significant speedups, when I increased the BLAS thread count via the env. variable.

So while I cannot say what or where the problem is, I can say for certain that it is there somewhere.

Thanks for your investigations. After 1.0 we switched threading to ignore environment variables in favor of -nN. It sounds like OpenBLAS didn’t get the message. Poking around, I think this snippet will need a openblas_set_num_threads(int num_threads); section, since internet rumor is OpenBLAS favors OPENBLAS_NUM_THREADS over plain omp_set_num_threads(nth).

Regarding the conda packages, 1.1 still used statically linked MKL which had some threading problems. Dev conda packages (conda install psi4 -c psi4/label/dev -c psi4) switched to dynamically linked MKL in July, which helped. But then the MKL library linkage pattern also mattered (mkl_rt vs mkl_intel_lp64 + mkl_intel_thread mkl_core) between psi4 and numpy and which was loaded first, so there can be threading differences even amongst numpy’s from different conda channels. So I’m not too surprised by your conda results (though it looks like you passed the threading test in an earlier post).

I aslo won’t rule out that CC is threading differently from other parts of the code.

While I used CC iteration threading as an example, because it is the easiest to test unambiguously, I am reasonably confident that BLAS threading is broken globally in the 1.1 conda package, since DF-SCF also exhibited some single threaded behavior, but that is harder for me to test well, since it seems to be affected by both built in OpenMP/pthreads parallelism, which does work well, as well as threaded BLAS.

In some tests with threaded BLAS I saw negative scaling when both native and BLAS side threading was enabled and the number of threads exceeded the number of cores. So thread congestion/nesting is certainly something to be avoided, but some calculations absolutely need threaded BLAS to be enabled, to get any meaningful scaling. So the only solution I can think of is dynamically turning threading on/off, for example turn it on for CC iterations, then turn it off for the (T) calculation, because that uses native threading.

I am seeing that a significant amount of time in DF-SCF before any iterations is single threaded. But, I don’t know if this is related.

Hi Ioriab,

I am not sure if it’s late to reply this problem. But I’ve tried your suggestion there and got:
time psi4 tu1.py:
real 0m5.520s
user 0m5.252s
sys 0m0.228s

time psi4 -n 4 tu1.py:
real 0m3.207s
user 0m9.783s
sys 0m0.333s

Does this mean I even used more time when I threaded it up?

Best,

Yang

“real” time is total elapsed time, as in wall clock time
“user” is total time spent by all threads in user mode, this is effectively “core time”, if the program uses 1 core at 100% utilization this is equal to real time, if 4 cores are used 100% this should be roughly equal to 4 times the real time
“sys” is time spent in system calls, synchronization between threads, etc. This should be very small compared to the other two, if sys is comparable to usr, then you have some sort of problem going on that is hurting performance.

In your case using 4 threads instead of 1, resulted in a net speedup (run time down to 3.2 s from 5.5 s).

Also, you want a job that runs for several minutes to really compare timings.

That’s really clear explanation. Thanks a lot.

Thanks also! I will try a bigger job.

Whatever the cause was, I have tested 1.3 and found that this problem disappeared. While conventional CCSD threading is still not stellar (jumps around between 1 thread and N thread), it does work reasonably well now, and DF-CCSD seems to be completely fixed and well parallelized.
Yay!

I think we’re planning to rewrite the conventional CC code anyways. At the time that was written, I don’t think threading was much of a concern, so I’m not surprised it doesn’t parallelize well.

I’ll be closing this. If there are any further reports of threading issues, those belong in a new topic, as we are now well past 1.1.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.