Multithreading in downloaded binary distribution

Hello again, newbie here,
I installed psi4 using the conda distribution. Among others, we are running DFMP2 and SAPT calculations. I noticed that much of the time psi4 is running a single-thread (100% vs X00%). Running a test I see that test_threaded_blas gives an xfail.
My assumption was that the distributed binary comes with its own linked multithreaded libraries (mkl perhaps). 1) is this true?
2) during the calculation above is it normal to have >50% of the time be single-threaded? And, can I see the threading information in the log files?
2) Our operating system is quite old (centos 6.8) - sorry - and the various more modern libraries, compilers, etc. are installed in special directories. Do we need to have a new-ish installation of MKL somewhere for the threading to work properly?

Thank you.

This is a question for @loriab.

  1. yes, there’s a good multithreaded MKL that ships with conda psi4 that should be doing the crunching.
    2a) you can see some timing/threading info in the timer.dat file that runs cummulatively in the submission directory.
    2b) no problem on the old OS – the conda is set up to be self contained so if it can run at all (new enough glibc), it can run with good performance.

That threading test is good, but running it through pytest isn’t so good. You’ll find a script version of the test in share/psi4/scripts/. try it with --ldd, --psithon, and --psiapi invocations. Better post the results here.

It is normal to have some single-threaded parts, but SAPT and DFMP2 are some of the best-threaded methods in psi4, so something could be wrong.

Thank you that’s a fantastic answer.

I installed the py2.7 version of psi4 (there may have been a reason but I doubt it…) and we’ve been running it for a while, so I cannot check the script above without reinstalling with py3.X - which I will do.
Meanwhile looking at our timers (representative attached) there is excellent scaling for most components so perhaps I was just being impatient. I’ll follow up with the thread checking results after I install a python3 version of psi4.

******************* timers.dat for run with 4 threads ******************

Host: compute-0-151.local

Timers On : Mon Aug 12 14:09:55 2019
Timers Off: Mon Aug 12 15:13:58 2019

Wall Time: 3843.00 seconds

JK: (A|mn) : 434.28u 0.55s 108.92w 15 calls
JK: (A|Q)^-1/2: 15.42u 0.08s 9.905021w 13 calls
JK: (Q|mn) : 530.71u 0.01s 529.55w 15 calls
HF: Form H : 0.35u 0.07s 0.340183w 10 calls
HF: Form S/X: 13.91u 0.13s 7.077274w 10 calls
HF: Guess : 2.46u 0.01s 0.640778w 10 calls
SAD Guess : 2.33u 0.01s 0.586342w 2 calls
JK: D : 3.58u 0.02s 1.015655w 160 calls
JK: USO2AO : 1.33u 0.02s 0.399976w 160 calls
JK: JK : 4849.57u 249.25s 1861.91w 160 calls
JK: J : 210.54u 0.32s 177.02w 205 calls
JK: J1 : 115.36u 0.17s 89.74w 243 calls
JK: J2 : 92.73u 0.13s 85.51w 243 calls
JK: K : 4582.29u 3.04s 1419.01w 181 calls
JK: K1 : 4116.96u 2.16s 1030.76w 240 calls
JK: K2 : 465.09u 0.88s 387.99w 215 calls
JK: AO2USO : 0.00u 0.00s 0.000246w 160 calls
DIISManager::add_entry: 4.71u 0.88s 1.995340w 118 calls
DIISManager::extrapolate: 9.98u 1.98s 3.139046w 108 calls
DIISManager::extrapolate: bMatrix setup: 4.88u 1.21s 1.562985w 108 calls
DIISManager::extrapolate: bMatrix pseudoinverse: 0.04u 0.00s 0.007677w 108 calls
DIISManager::extrapolate: form new data: 5.05u 0.77s 1.567206w 108 calls
HF: Form G : 3621.51u 208.01s 1397.01w 93 calls
HF: Form F : 1.09u 0.01s 0.372158w 93 calls
HF: DIIS : 69.62u 3.00s 30.77w 93 calls
HF: Form C : 99.20u 0.17s 70.36w 93 calls
HF: Form D : 0.40u 0.00s 0.293118w 93 calls
DFMP2 Singles: 1.55u 0.00s 1.552932w 8 calls
DFMP2 Aia : 1222.37u 28.82s 430.58w 8 calls
DFMP2 (A|mn): 1011.83u 0.10s 253.11w 17 calls
DFMP2 (A|mn)C_mi: 81.80u 0.05s 72.36w 17 calls
DFMP2 (A|mi)C_na: 62.72u 0.14s 15.73w 17 calls
DFMP2 Aia Write: 5.08u 2.05s 2.005388w 17 calls
DFMP2 iaQ : 13.87u 1.02s 13.78w 2 calls
DFMP2 Metric: 27.64u 0.22s 24.39w 8 calls
DFMP2 Aia Read: 0.01u 1.35s 1.351316w 8 calls
DFMP2 (Q|A)(A|ia): 53.27u 0.07s 53.34w 10 calls
DFMP2 Qia Write: 0.01u 1.95s 1.971972w 8 calls
DFMP2 Qia Read: 0.69u 0.90s 1.614037w 18 calls
DFMP2 Bia Write: 0.00u 0.18s 0.182224w 2 calls
DFMP2 aiQ : 0.03u 0.37s 0.395772w 2 calls
DFMP2 Tij : 84.94u 2.79s 84.29w 2 calls
DFMP2 Cia Read: 0.05u 0.20s 0.218588w 4 calls
DFMP2 I : 65.06u 0.00s 65.05w 4 calls
DFMP2 T2 : 11.42u 0.00s 2.872846w 4 calls
DFMP2 G : 31.48u 0.02s 30.33w 2 calls
DFMP2 Pab : 16.54u 0.00s 16.54w 2 calls
DFMP2 Gia Write: 0.00u 0.19s 0.196234w 2 calls
DFMP2 Tab : 43.01u 1.13s 36.98w 2 calls
DFMP2 Qai Read: 0.06u 0.08s 0.136110w 4 calls
DFMP2 Pij : 2.06u 0.01s 1.187400w 2 calls
DFMP2 gamma : 3.79u 0.30s 3.792650w 2 calls
DFMP2 Gia Read: 0.13u 0.18s 0.171819w 4 calls
DFMP2 g : 3.45u 0.00s 3.443438w 2 calls
DFMP2 Gia : 0.21u 0.49s 0.699099w 2 calls
DFMP2 aiG Write: 0.00u 0.13s 0.126127w 2 calls
DFMP2 AB^x : 1.04u 0.02s 0.338477w 2 calls
DFMP2 Amn^x : 416.25u 5.09s 114.46w 2 calls
DFMP2 L : 189.41u 4.52s 60.67w 2 calls
DFMP2 P : 0.02u 0.03s 0.041545w 2 calls
DFMP2 W : 0.09u 0.01s 0.105290w 2 calls
DFMP2 Z : 1269.33u 8.19s 472.19w 2 calls
DFMP2 Qia : 67.47u 4.63s 70.06w 6 calls
DFMP2 Energy: 297.59u 3.01s 77.45w 6 calls
JK: (A|Q)^-1: 11.52u 13.27s 13.46w 1 call
JK: (Q|mn) Write: 0.03u 12.96s 12.99w 2 calls
JK: (Q|mn) Read: 0.48u 150.87s 151.36w 72 calls
DF Integrals : 219.83u 11.60s 101.98w 1 call
W Integrals : 0.08u 0.25s 0.34180600w 1 call
Elst10 : 0.00u 0.00s 0.00012300w 1 call
Exch10 : 0.18u 0.02s 0.05957900w 1 call
Exch10 S^2 : 0.18u 0.04s 0.05338000w 1 call
Ind20 : 64.50u 18.48s 30.09w 1 call
Exch-Ind20 : 2.38u 0.36s 0.68302900w 1 call
Exch-Disp20 N^5 : 19.26u 1.34s 5.90863400w 1 call
Exch-Disp20 N^4 : 80.97u 2.09s 51.89w 1 call


The last py27 version was summer 2018 with v1.2.1, so definitely worth trying out the summer 2019 v1.3.2 .

K python 3.6 version running (from Psi4conda-latest-py36-Linux-x86_64.sh). Note I did not conda update this version.
I ran the threading script with the --ldd, --psithon, --psiapi and plain with no arguments. I quote them all below. Meanwhile, grepping the desired and achieved scaling I have:

psiapi_threading.txt: NumPy@n4 : Psi4@n4 ratio (want ~1): 0.27
psiapi_threading.txt: Psi4@n1 : Psi4@n4 ratio (want ~4): 1.05
psithon_threading.txt: Psi4@n1 : Psi4@n4 ratio (want ~4): 1.77
threading.txt: NumPy@n4 : Psi4@n4 ratio (want ~1): 0.26
threading.txt: Psi4@n1 : Psi4@n4 ratio (want ~4): 0.97
threading.txt: Psi4@n1 : Psi4@n4 ratio (want ~4): 1.83

Does it all look reasonable? (at first glance no but we can’t always get what we want in scaling).

-------------------- full output without any options -------------

/share/apps/psi4conda_36/lib/python3.6/site-packages/v2rdm_casscf/v2rdm_casscf.so loaded.
Threads set to 1 by Python driver.
Threads set to 4 by Python driver.
/share/apps/psi4conda_36/lib/python3.6/site-packages/v2rdm_casscf/v2rdm_casscf.so loaded.
Nuclear Repulsion Energy: computed value (85.1890641964) does not match (85.1890645313) to 9 digits.
Traceback (most recent call last):
File “/share/apps/psi4conda_36/bin/psi4”, line 247, in
exec(content)
File “”, line 42, in
File “/share/apps/psi4conda_36/lib//python3.6/site-packages/psi4/driver/p4util/util.py”, line 225, in compare_values
raise TestComparisonError(message)

TestComparisonError: Nuclear Repulsion Energy: computed value (85.1890641964) does not match (85.1890645313) to 9 digits.
/share/apps/psi4conda_36/lib/python3.6/site-packages/v2rdm_casscf/v2rdm_casscf.so loaded.
Nuclear Repulsion Energy: computed value (85.1890641964) does not match (85.1890645313) to 9 digits.
Traceback (most recent call last):
File “/share/apps/psi4conda_36/bin/psi4”, line 247, in
exec(content)
File “”, line 42, in
File “/share/apps/psi4conda_36/lib//python3.6/site-packages/psi4/driver/p4util/util.py”, line 225, in compare_values
raise TestComparisonError(message)

TestComparisonError: Nuclear Repulsion Energy: computed value (85.1890641964) does not match (85.1890645313) to 9 digits.
libiomp5.so => /share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/…/…/…/libiomp5.so (0x00007fc3f735c000)
Version information:
/share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/core.so:
libiomp5.so (VERSION) => /share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/…/…/…/libiomp5.so
/lib64/libpthread.so.0:
/lib64/libc.so.6:
/share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/…/…/…/libpcm.so.1:
libiomp5.so (VERSION) => /share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/…/…/…/libiomp5.so
/share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/…/…/…/libgdma.so:
/share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/…/…/…/libderiv.so:
/share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/…/…/…/libint.so:
/share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/…/…/…/liberd.so:
/share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/…/…/…/libsimint.so:
/share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/…/…/…/libefp.so:
/share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/…/…/…/libiomp5.so:
/lib64/libm.so.6:
/lib64/libdl.so.2:
/share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/…/…/…/libchemps2.so.2:
libiomp5.so (VERSION) => /share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/…/…/…/libiomp5.so
/share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/…/…/…/./libz.so.1:
/share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/…/…/…/./libgcc_s.so.1:
/lib64/librt.so.1:
/share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/…/…/…/./libhdf5.so.10:
/share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/…/…/…/./libhdf5_hl.so.10:
Time for threads 1, size 200: Psi4: 0.000786 NumPy: 0.000831
Time for threads 1, size 500: Psi4: 0.012445 NumPy: 0.012611
Time for threads 1, size 2000: Psi4: 0.725494 NumPy: 1.311480
Time for threads 1, size 4000: Psi4: 5.531336 NumPy: 5.353334
Time for threads 4, size 200: Psi4: 0.000725 NumPy: 0.000294
Time for threads 4, size 500: Psi4: 0.011711 NumPy: 0.003782
Time for threads 4, size 2000: Psi4: 0.682204 NumPy: 0.190112
Time for threads 4, size 4000: Psi4: 5.684884 NumPy: 1.480672
NumPy@n4 : Psi4@n4 ratio (want ~1): 0.26
Psi4@n1 : Psi4@n4 ratio (want ~4): 0.97
Running psi4 -i _thread_test_input_psi4_yo.in -o _thread_test_input_psi4_yo_n1.out -n1 …
Time for threads 1: Psi4: 180.558416
Running psi4 -i _thread_test_input_psi4_yo.in -o _thread_test_input_psi4_yo_n4.out -n4 …
Time for threads 4: Psi4: 98.666945
Psi4@n1 : Psi4@n4 ratio (want ~4): 1.83
Running ldd -v /share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/core.so | grep -e ‘:’ -e ‘mkl’ -e ‘openblas’ -e ‘iomp5’ -e ‘gomp’ -e ‘l
ibomp’ …
{‘mkl’: 0, ‘iomp5’: 9, ‘openblas’: 0, ‘omp’: 0, ‘gomp’: 0}
{‘mkl’: 0, ‘iomp5’: 2, ‘openblas’: 0, ‘omp’: 0, ‘gomp’: 0}

--------------------------- --ldd option --------------------------------

/share/apps/psi4conda_36/lib/python3.6/site-packages/v2rdm_casscf/v2rdm_casscf.so loaded.
libiomp5.so => /share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/…/…/…/libiomp5.so (0x00007fb2d6c47000)
Version information:
/share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/core.so:
libiomp5.so (VERSION) => /share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/…/…/…/libiomp5.so
/lib64/libpthread.so.0:
/lib64/libc.so.6:
/share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/…/…/…/libpcm.so.1:
libiomp5.so (VERSION) => /share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/…/…/…/libiomp5.so
/share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/…/…/…/libgdma.so:
/share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/…/…/…/libderiv.so:
/share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/…/…/…/libint.so:
/share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/…/…/…/liberd.so:
/share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/…/…/…/libsimint.so:
/share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/…/…/…/libefp.so:
/share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/…/…/…/libiomp5.so:
/lib64/libm.so.6:
/lib64/libdl.so.2:
/share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/…/…/…/libchemps2.so.2:
libiomp5.so (VERSION) => /share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/…/…/…/libiomp5.so
/share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/…/…/…/./libz.so.1:
/share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/…/…/…/./libgcc_s.so.1:
/lib64/librt.so.1:
/share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/…/…/…/./libhdf5.so.10:
/share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/…/…/…/./libhdf5_hl.so.10:
Running ldd -v /share/apps/psi4conda_36/lib/python3.6/site-packages/psi4/core.so | grep -e ‘:’ -e ‘mkl’ -e ‘openblas’ -e ‘iomp5’ -e ‘gomp’ -e ‘l
ibomp’ …
{‘mkl’: 0, ‘iomp5’: 9, ‘openblas’: 0, ‘omp’: 0, ‘gomp’: 0}
{‘mkl’: 0, ‘iomp5’: 2, ‘openblas’: 0, ‘omp’: 0, ‘gomp’: 0}

----------------------- --psithon option -----------------------------------

/share/apps/psi4conda_36/lib/python3.6/site-packages/v2rdm_casscf/v2rdm_casscf.so loaded.
/share/apps/psi4conda_36/lib/python3.6/site-packages/v2rdm_casscf/v2rdm_casscf.so loaded.
Nuclear Repulsion Energy: computed value (85.1890641964) does not match (85.1890645313) to 9 digits.
Traceback (most recent call last):
File “/share/apps/psi4conda_36/bin/psi4”, line 247, in
exec(content)
File “”, line 42, in
File “/share/apps/psi4conda_36/lib//python3.6/site-packages/psi4/driver/p4util/util.py”, line 225, in compare_values
raise TestComparisonError(message)

TestComparisonError: Nuclear Repulsion Energy: computed value (85.1890641964) does not match (85.1890645313) to 9 digits.
/share/apps/psi4conda_36/lib/python3.6/site-packages/v2rdm_casscf/v2rdm_casscf.so loaded.
Nuclear Repulsion Energy: computed value (85.1890641964) does not match (85.1890645313) to 9 digits.
Traceback (most recent call last):
File “/share/apps/psi4conda_36/bin/psi4”, line 247, in
exec(content)
File “”, line 42, in
File “/share/apps/psi4conda_36/lib//python3.6/site-packages/psi4/driver/p4util/util.py”, line 225, in compare_values
raise TestComparisonError(message)

TestComparisonError: Nuclear Repulsion Energy: computed value (85.1890641964) does not match (85.1890645313) to 9 digits.
Running psi4 -i _thread_test_input_psi4_yo.in -o _thread_test_input_psi4_yo_n1.out -n1 …
Time for threads 1: Psi4: 204.670653
Running psi4 -i _thread_test_input_psi4_yo.in -o _thread_test_input_psi4_yo_n4.out -n4 …
Time for threads 4: Psi4: 115.619684
Psi4@n1 : Psi4@n4 ratio (want ~4): 1.77

---------------- – psiapi option --------------------------------

/share/apps/psi4conda_36/lib/python3.6/site-packages/v2rdm_casscf/v2rdm_casscf.so loaded.
Threads set to 1 by Python driver.
Threads set to 4 by Python driver.
Time for threads 1, size 200: Psi4: 0.000767 NumPy: 0.000800
Time for threads 1, size 500: Psi4: 0.011045 NumPy: 0.011701
Time for threads 1, size 2000: Psi4: 0.689397 NumPy: 1.261678
Time for threads 1, size 4000: Psi4: 5.518337 NumPy: 5.416282
Time for threads 4, size 200: Psi4: 0.000747 NumPy: 0.000298
Time for threads 4, size 500: Psi4: 0.011578 NumPy: 0.003516
Time for threads 4, size 2000: Psi4: 0.668255 NumPy: 0.186691
Time for threads 4, size 4000: Psi4: 5.265579 NumPy: 1.427655
NumPy@n4 : Psi4@n4 ratio (want ~1): 0.27
Psi4@n1 : Psi4@n4 ratio (want ~4): 1.05