Is there parallelization for ccsd calculation in Psi4?

Hi,

I’ve seen some weird behaviors while running ccsd energy jobs with Psi4. It seems that no matter how many threads number I set, Psi4 only used 1 core (1 thread) during the integral transformation step. The SCF step looks fine. Is that what it should be? Or did I install Psi4 incorrectly? Hoping someone can help.

Thanks!

Is that what it should be?.

The integral transformation calls into BLAS under the hood, so provided that the linked BLAS library is threaded the integral transformation should be as well. If the SCF is explicitly threaded (OPENMP pragmas are used directly) then OPENMP is available and linked correctly, but it may be that the BLAS/lapack used is not threaded.

Or did I Install psi4 incorrectly?

It is difficult to say from the information you have given. If psi4 is running then it looks like the install was successful, but you may have configured it in a way you did not want.

Some questions that will help figure this out.

  1. How did you install psi4? (ie conda package, build from source)
    1.1 If you installed from source what configuration options did you pass to cmake
  2. What version of psi4 are you using? You should be able to see that at the top of the output file.
  3. What type of system are you running on? (Linux/mac + distro/version as it applies)
  4. How did you actually determine that the integral transformation is only using one thread? The SCF code shows the number of threads in the output file, but the transformation does not.
  5. Did you observe one thread or multiple being used during the CCSD iterations?

Answers to these questions will help narrow down a reason for the behavior you are seeing.

Hi! Thank you for replying. Here is my answer to your question.

  1. I installed the 1.2 version from source.
    1.1 Options I used are:
    cmake -H. -Bbuild -DCMAKE_INSTALL_PREFIX=/opt/software/psi4/psi4-github-install
    -DCMAKE_CXX_COMPILER=icpc
    -DCMAKE_C_COMPILER=icc
    -DCMAKE_CXX_FLAGS=-xCORE-AVX512
    -DCMAKE_C_FLAGS=-xCORE-AVX512
  2. 1.2 version
  3. Centos7
  4. I determined that by looking at the top screen, which shows that the CPU usage is only 100%, while the SCF part give a CPU usage of 4800% (I set 48 threads)
  5. I didn’t use density fitting command in my input. the scf_type is just direct. And I did see a little bit parallelization during the cc amplitude calculation (The CPU usage jumps from 100% to 500 or 600% from time to time)

Well the jump during the CCSD iterations tells me that you are using threaded BLAS, so that rules out that. How large is this system? The transformation step does the integral transformation but also sorts the MO integrals into classes used by the CC code. The sort is not threaded and can easily become the bottleneck, which would explain why it appears to be running on one thread in top as it is running on one thread for most of the time.

Thanks for replying! The system is large with about 630 basis functions. Is that the reason why the integral step uses only one core?

BLAS threading is really wonky in my experience, I have had similar issues, see this thread for relevant information: http://forum.psicode.org/t/is-ccsd-threading-broken-in-1-1/584

So the answer is yes, but it is sometimes broken in mysterious ways, resulting in all BLAS operations (matrix multiply, matrix-vector product, eigenvalue calculation, etc.) being run single threaded. Since a lot of the heavy lifting is done via BLAS, this means large parts of Psi4 can become effectively single-threaded, if BLAS parallelism is not engaged properly for whatever reason.

Thanks for replying!